Category Archives: Libraries

Talking About Data: Lessons from Science Communication

As a person who worked for years in psychology and neuroscience laboratories before coming to work in academic libraries, I have particularly strong feelings about ambiguous definitions. One of my favorite anecdotes about my first year of graduate school involves watching two researchers argue about the definition of “attention” for several hours, multiple times a week, for an entire semester. One of the researchers was a clinical psychologist, the other a cognitive psychologist. Though they both devised research projects and wrote papers on the topic of attention, their theories and methods could not have been more different. The communication gap between them was so wide that they were never able to move forward productively. The punchline is that, after sitting through hours of their increasingly abstract and contentious arguments, I would go on to study attention using yet another set of theories and methods as a cognitive neuroscientist. Funny story aside, this anecdote illustrates the degree to which people with different perspectives and levels of expertise can define the same problem in strikingly different ways.

VisualSearch

A facsimile of a visual search array used by cognitive psychologists to study attention. Spot the horizontal red rectangle.

In the decade that has elapsed since those arguments, I have undergone my own change in perspective- from a person who primarily collects and analyzes their own research data to a person who primarily thinks about ways to help other researchers manage and share their data. While my day-to-day activities look rather different, there is one aspect of my work as a library post-doc that is similar to my work as a neuroscientist- many of my colleagues ostensibly working on the same things often have strikingly different definitions, methods, and areas of expertise. Fortunately, I have been able to draw on a body of work that addresses this very thing- science communication.

Wicked Problems

A “wicked problem” is a problem that is extremely difficult to solve because different stakeholders define and address it in different ways. In my anecdote about argumentative professors, understanding attention can be considered a wicked problem. Without getting too much into the weeds, the clinical psychologist understood attention mostly in the context of diagnoses like Attention Deficit Disorder, while the cognitive psychologist understood it the context of scanning visual environments for particular elements or features. As a cognitive neuroscientist, I came to understand it mostly in terms of its effects within neural networks as measured by brain imaging methods like fMRI.

Research data management (RDM) has been described as a wicked problem. A data service provider in an academic library may define RDM as “the documentation, curation, and preservation of research data”, while a researcher may define RDM as either simply part of their daily work or, in the case of something like a data management plan written for a grant proposal, as an extra burden placed upon such work. Other RDM stakeholders, including those affiliated with IT, research support, and university administration, may define it in yet other ways.

Science communication is chock full of wicked problems, including concepts like climate change and the use of stem cell use. Actually, given the significant amount of scholarship devoted to defining terms like “scientific literacy” and the multitudes of things that the term describes, science communication may itself be a wicked problem.

What is Scientific Communication?

Like attention and RDM, it is difficult to give a comprehensive definition of science communication. Documentaries like “Cosmos” are probably the most visible examples, but science communication actually comes in a wide variety of forms including science journalism, initiatives aimed at science outreach and advocacy, and science art. What these activities have in common is that they all generally aim to help people make informed decisions in a world dominated by science and technology. In parallel, there is also a burgeoning body of scholarship devoted to the science of science communication which, among other things, examines how effective different communication strategies are for changing people’s perceptions and behaviors around scientific topics.

For decades, the prevailing theory in science communication was the “Deficit Model”, which posits that scientific illiteracy is due to a simple lack of information. In the deficit model, skepticism about topics such as climate change are assumed to be due to a lack of comprehension of the science behind them. Thus, at least according to the deficit model, the “solution” to the problem of science communication is as straightforward as providing people with all the facts. In this conception, the audience is generally assumed to be homogeneous and communication is assumed to be one way (from scientists to the general public).

Though the deficit model persists, study after study (after meta-analysis) has shown that merely providing people with facts about a scientific topic does not cause them to change their perceptions or behaviors related to that topic. Instead, it turns out that presenting facts that conflict with a person’s worldview can actually cause them to double down on that worldview. Also, audiences are not homogenous. Putting aside differences in political and social worldviews, people have very different levels of scientific knowledge and relate to that knowledge in very different ways. For this reason, more modern models of science communication focus not on one-way transmissions of information but on fostering active engagement, re-framing debates, and meeting people where they are. For example, one of the more effective strategies for getting people to pay attention to climate change is not to present them with a litany of (dramatic and terrifying) facts, but to link it to their everyday emotions and concerns.

VisualSearch2

Find the same rectangle as before. It takes a little longer now that the other objects have a wider variety of features, right? Read more about visual search tasks here.

Communicating About Data

If we adapt John Durant’s nicely succinct definition of science literacy,What the general public ought to know about science.” to an RDM context, the result is something like “What researcher’s out to know about handling data.” Thus, data services in academic libraries can be said to be a form of science communication. As with “traditional” science communicators, data service providers interact with audiences possessing different perspectives and levels of knowledge as their own. The major difference, of course, being that the audience for data service providers is specifically the research community.

There is converging evidence that many of the current plans for fostering better RDM have led to mixed results. Recent studies of NSF data management plans have revealed a significant amount of variability in terms of the degree to which researchers address data management-related concepts like metadata, data sharing, and long-term preservation. The audience of data service providers is, like those of more “traditional science communicators, quite heterogeneous, so perhaps adopting methods from the repertoire of science communication could help foster more active engagement and the adoption of better practices. Many libraries and data service providers have already adopted some of these methods, perhaps without realizing their application in other domains. But I also don’t mean to criticize any existing efforts to engage researchers on the topic of RDM. If I’ve learned one thing from doing different forms of science communication over the years, it is that outreach is difficult and change is slow.

In a series of upcoming blog posts, I’ll write about some of my current projects that incorporate what I’ve written here. First up: I’ll provide an update of the RDM Maturity Model project that I previously described here and here. Coming soon!

Tagged , ,

Ensuring access to critical research data

For the last two months, UC3 have been working with the teams at Data.gov, Data Refuge, Internet Archive, and Code For Science (creators of the Dat Project) to aggregate the government data.

Data that spans the globe

There are currently volunteers across the country working to discover and preserve publicly funded research, especially climate data, from being deleted or lost from the public record. The largest initiative is called Data Refuge and is led by librarians and scientists. They are holding events across the UC campuses and the US that you should attend and help out in person, and are organizing the library community to band together to curate the data and ensure it’s preserved and accessible.

Our initiative builds on this and is looking to build a corpus of government data and corresponding metadata.  We are focusing on public research data, especially those at risk of disappearing. The initiative was nicknamed “Svalbard” by Max Ogden of the Dat project, after the Svalbard Global Seed Vault in the Arctic.  As of today, our friends at Code for Science have released 38GB of metadata, over 30 million hashes and URLs of research data files.

The Svalbard Global Seed Vault in the Arctic

To aid in this effort

We have assembled the following metadata as part of the Code for Science’s Svalbard v1:

  • 2.7 million SHA-256 hashes for all downloadable resources linked from Data.gov, representing around 40TB of data
  • 29 million SHA-1 hashes of files archived by the Internet Archive and the Archive Team from federal websites and FTP servers, representing over 120TB of data
  • All metadata from Data.gov, about 2.1 million datasets
  • A list of ~750 .gov and .mil FTP servers

There are additional sources such as Archivers.Space, EDGI, Climate Mirror, Azimuth Data Backup that we are working adding metadata for in future releases.

Following the principles set forth by the librarians behind Data Refuge, we believe it’s important to establish a clear and trustworthy chain of custody for research datasets so that mirror copies can be trusted. With this project, we are working to curate metadata that includes strong cryptographic hashes of data files in addition to metadata that can be used to reproduce a download procedure from the originating host.

We are hoping the community can use this data in the following ways:

  • To independently verify that the mirroring processes that produced these hashes can be reproduced
  • To aid in developing new forms of redundant dataset distribution (such as peer to peer networks)
  • To seed additional web crawls or scraping efforts with additional dataset source URLs
  • To encourage other archiving efforts to publish their metadata in an easily accessible format
  • To cross reference data across archives, for deduplication or verification purposes

What about the data?

The metadata is great, but the initial release of 30 million hashes and urls is just part of our project. The actual content (how the hashes were derived) have also been downloaded.  They are stored at either the Internet Archive or on our California Digital Library servers.

The Dat Project carried out a Data.gov HTTP mirror (~40TB) and uploaded it to our servers at California Digital Library. We are working with them to access ~160TB of data in the future and have partnered with UC Riverside to offer longer term storage .

Download

You can download the metadata here using Dat Desktop or Dat CLI tool.  We are using the Dat Protocol for distribution so that we can publish new metadata releases efficiently while still keeping the old versions around. Dat provides a secure cryptographic ledger, similar in concept to a blockchain, that can verify integrity of updates.

Feedback

If you want to learn more about how CDL and the UC3 team is involved, contact us at uc3@ucop.edu or @UC3CDL. If you have suggestions or questions, you can join the Code for Science Community Chat.  And, if you are a technical user you can report issues or get involved at the Svalbard GitHub.

This is crossposted here: https://medium.com/@maxogden/project-svalbard-a-metadata-vault-for-research-data-7088239177ab#.f933mmts8

Science Boot Camp West

Last week Stanford Libraries hosted the third annual Science Boot Camp West (SBCW 2015),

“… building on the great Science Boot Camp events held at the University of Colorado, Boulder in 2013 and at the University of Washington, Seattle in 2014. Started in Massachusetts and spreading throughout the USA, science boot camps for librarians are 2.5 day events featuring workshops and educational presentations delivered by scientists with time for discussion and information sharing among all the participants. Most of the attendees are librarians involved in supporting research in the sciences, engineering, medicine or technology although anybody with an interest in science research is welcome.”

As a former researcher and newcomer to the library and research data management (RDM) scenes, I was already familiar with many of the considerable challenges on both sides of the equation (Jake Carlson recently summarized the plight of data librarians). What made SBCW 2015 such an excellent event is that it brought researchers and librarians together to identify immediate opportunities for collaboration. It also showcased examples of Stanford libraries and librarians directly facilitating the research process, from the full-service Stanford Geospatial Center to organizing Software and Data Carpentry workshops (more on this below, and from an earlier post).

Collaboration: Not just a fancy buzzword

The mostly Stanford-based researchers were generous with their time, introducing us to high-level concerns (e.g., why electrons do what they do in condensed matter) as well as more practical matters (e.g., shopping for alternatives to Evernote—yikes—for electronic lab notebooks [ELNs]). They revealed the intimate details of their workflows and data practices (Dr. Audrey Ellerbee admitted that it felt like letting guests into her home to find dirty laundry strewn everywhere, a common anxiety among researchers that in her case was unwarranted), flagged the roadblocks, and presented a constant stream of ideas for building relationships across disciplines and between librarians and researchers.

From the myriad opportunities for library involvement, here are some of the highlights:

  • Facilitate community discussions of best practices, especially for RDM issues such as programming, digital archiving, and data sharing
  • Consult with researchers about available software solutions (e.g., ELNs such as Labguru and LabArchives; note: representatives from both of these companies gave presentations and demonstrations at SBCW 2015), connect them with other users on campus, and provide help with licensing
  • Provide local/basic IT support for students and researchers using commercial products such as ELNs (e.g., maintain FAQ lists to field common questions)
  • Leverage experience with searching databases to improve delivery of informatics content to researchers (e.g., chemical safety data)
  • Provide training in and access to GIS and other data visualization tools

A winning model

The final half-day was dedicated to computer science-y issues. Following a trio of presentations involving computational workflows and accompanying challenges (the most common: members of the same research group writing the same pieces of code over and over with scant documentation and zero version control), Tracy Teal (Executive Director of Data Carpentry) and Amy Hodge (Science Data Librarian at Stanford) introduced a winning model for improving everyone’s research lives.

Software Carpentry and Data Carpentry are extremely affordable 2-day workshops that present basic concepts and tools for more effective programming and data handling, respectively. Training materials are openly licensed (CC-BY) and workshops are led by practitioners for practitioners allowing them to be tailored to specific domains (genomics, geosciences, etc.). At present the demand for these (international) workshops exceeds the capacity to meet it … except at Stanford. With local, library-based coordination, Amy has brokered (and in some cases taught) five workshops for individual departments or research groups (who covered the costs themselves). This is the very thing I wished for as a graduate student—muddling through databases and programming in R on my own—and I think it should be replicated at every research institution. Better yet, workshops aren’t restricted to the sciences; Data Carpentry is developing training materials for techniques used in the digital humanities such as text mining.

Learning to live outside of the academic bubble

Another, subtler theme that ran throughout the program was the need/desire to strengthen connections between the academy and industry. Efforts along these lines stand to improve the science underlying matters of public policy (e.g., water management in California) and public health (e.g., new drug development). They also address the mounting pressure placed on researchers to turn knowledge into products. Mark Smith addressed this topic directly during his presentation on ChEM-H: a new Stanford initiative for supporting research across Chemistry, Engineering, and Medicine to understand and advance Human Health. I appreciated that Mark—a medicinal chemist with extensive experience in both sectors—and others emphasized the responsibility to prepare students for jobs in a rapidly shifting landscape with increasing demand for technical skills.

Over the course of SBCW 2015 I met engaged librarians, data managers, researchers, and product managers, including some repeat attendees who raved about the previous two SBCW events; the consensus seemed to be that the third was another smashing success. Helen Josephine (Head of the Engineering Library at Stanford who chaired the organizing committee) is already busy gathering feedback for next year.

SBCW 2015 at Stanford included researchers from:

Gladstone Institutes in San Francisco

ChEM-H Stanford’s lab for Chemistry, Engineering & Medicine for Human Health

Water in the West Institute at Stanford

NSF Engineering Research Center for Re-inventing the Nation’s Urban Water Infrastructure (ReNUWIt)

DeepDive

Special project topics on Software and Data Carpentry with Physics and BioPhysics faculty and Tracy Teal from Software Carpentry.

Many thanks to:

Helen Josephine, Suzanne Rose Bennett, and the rest of the Local Organizing Committee at Stanford. Sponsored by the National Network of Libraries of Medicine – Pacific Southwest Region, Greater Western Library Alliance, Stanford University Libraries, SPIE, IEEE, Springer Science+Business Media, Annual Reviews, Elsevier.

From Flickr by Paula Fisher (It was just like this, but indoors, with coffee, and powerpoints.)

From Flickr by Paula Fisher (It was just like this, but indoors, with coffee, and powerpoints.)

Tagged , , , ,

Does Your Library Delight You?

In a recent opinion piece in Forbes, Steve Denning provocatively asks, “Do we need libraries?

As a digital librarian, my short answer is “Yes, of course we need libraries!” But, Denning makes many excellent points in cautioning that the same disruptive threats faced by many industries — think taxis and Uber, or hotels and AirBnB, for example — are also a threat to libraries. Denning argues that in today’s world, libraries must change their management practices and offerings in order to remain relevant. The computer age is not just about computerizing, he explains, but also about a fundamental shift that puts the customer or the user in control:

600px-Smiley.svg

[From wikimedia user: Pumbaa80]

“… the most important thing that computers and the internet have done is not just to make things faster and easier for organizations. Even more importantly, they have shifted the balance of power in the marketplace from the seller to the buyer. The customer is now in charge. The customer has choices and good information about those choices. Unless customers and users are delighted, they can and will take their business elsewhere.”

To be clear, I would never suggest “Uber-izing” libraries, but there is much that those of us in the library world can learn from these evolving user-centered models.

Denning suggests a handful of “right” and “wrong” approaches to the future of libraries. Among the right approaches is the importance of focusing on how to “delight the user or customer.” We need to create services that truly meet or exceed the expectations of library users. We need to restructure ourselves in a way that ignites continuous innovation. And, we need to think about how to create services for users that they haven’t even thought of yet, while also continuing to perform the services that our users really love about our libraries, only faster and better.

Shifting the focus of academic research libraries to new models and areas of focus is not an easy task. But that’s exactly what’s happening at the UC Berkeley Libraries with the launch of the UC Berkeley’s Research Data Management (RDM) program, a joint venture between UCB Libraries and Berkeley’s Research Information Technologies (RIT) group. I recently attended the first public workshop for this program, and I’d say this initiative is a continuous affirmation that there’s a clear and compelling role for libraries in the future.

The UCB Libraries continue to re-tool themselves to meet the exponentially growing need to provide solutions for managing, preserving, and providing access to research data. They are proving innovative in their partnership with Research Information Technologies (RIT).Together, the Libraries and RIT bring to the table an excellent complement of staff and skills that, through collaboration, will help tackle the complex challenges of data management. From the get go, the Research Data Management program has focused on being inclusive. At the first workshop, for instance, they cast a wide net to ensure attendance from a variety of disciplines and departments. They also sought everyone’s input and challenged us to think creatively about new solutions. And finally, they are focusing their efforts on connecting things that are working well in the library and across the campus with external resources that users can tap into.

The Research Data Management program’s three goals for the coming year include:

  • Training and Workshop Series: An in-person space to learn and share ideas across the campus, including hands-on training as well as tackling big picture topics such as policy, best practices and governance issues.
  • Rich, Online Resource Guide: A one-stop shopping for researchers to find resources to support their work all along the research cycle.
  • Consultative Services: A personalized service to support research needs.

With the implementation of new funder requirements and the increased pressure to share data as well as the fragility of digital media, researchers are feeling the pressure to come up with sustainable solutions for data management. Through the RDM program, the UC Berkeley Libraries’ are taking steps towards providing news services that users need, and others that they may not even know that they need.

At this first workshop, there was great energy and excitement in the room. I was certainly delighted and I think UCB faculty, students, and staff will also be.

Meet the Team

The group spearheading program includes:

  • Norm Cheng, Senior Project Manager
  • Harrison Dekker, Coordinator Data Services
  • Susan Edwards, Head, Social Sciences Division
  • Mary Elings, Archivist for Digital Collections
  • David Greenbaum, Director, Research Information Technologies (RIT)
  • Chris Hoffman, Manager, Informatics Services
  • Rick Jaffe, Web Developer
  • John Lowe, Technical Lead and Manager for the CollectionSpace service
  • Erik Mitchell, Associate University Librarian, Director of Digital Initiatives and Collaborative Services
  • Felicia Poe, Interim UC Curation Center Director, California, Digital Library

The First UC Libraries Code Camp

This post was co-authored by Stephen Abrams.

Military camp on Coronado Island, California. Contributed to Calisphere by the San Diego History Center. Click on the image for more information.

Military camp on Coronado Island, California. Contributed to Calisphere by the San Diego History Center. Click on the image for more information.

So 30 coders walk into a conference center in Oakland… No, it’s not a bad joke in need of a punch line, it instead describes the start of the first UC Libraries Code Camp, which took place in downtown Oakland last week. These coders were all from the University of California system (8 out of 10 campuses were represented!) and work with or for the UC libraries. CDL sponsored the event and was well represented among the attendees.

The event consisted of two days of lively collaborative brainstorming on ways to provide better, more sustainable library services to the UC community.  Camp participants represented a variety of library roles– curatorial, development, and IT– providing a useful synergistic approach to common problems and solutions. The camp was organized according to the participatory unconference format, in which topics of discussion were arrived at through group consensus.  The final schedule included 10 breakout sessions on topics as diverse as the UC Libraries Digital Collection (UCLDC), data visualization, agile methodology, cloud computing, and use of APIs.  There was also a plenary session of “dork shorts” in which campus representatives gave summary presentations on selected services and initiatives of common interest.

The conference agenda, with notes from the various breakouts, is available on the event website. For those of us that work in the very large and expansive UC system, get-togethers like this one are crucial for ensuring we are efficiently and effectively supporting the UC community.

Of Note

  • We established a GitHub organization: UCLT. Join by emailing your GitHub username to uc3@ucop.edu.
  • We are establishing a Listserv: uclibrarytech-l@ucop.edu
  • Next code camp to take place in the south, in January or February 2015. (we need a southern campus to volunteer!)

Next Steps

  1. Establish a new Common Knowledge Group for Libraries Information Technologists. We need to draft a charter and establish the initial principles of group. Status: in progress, being led by Rosalie Lack, CDL
  2. Help articulate the need for more resources (staff, knowledge, skills, funding) that would allow libraries better support data and researchers creating/managing data. Status: database of skills table is being filled out. Will help guide discussions about library resources across the UC.
  3. Build up a database of UC libraries technologists; help share expertise and skills. Status: table being filled out. Will be moved to GitHub wiki once completed.
  4. Establish a collaborative space for us to share war stories, questions, concerns, approaches to problems, etc. Status: GitHub Organization created. Those interested should join by emailing us at uc3@ucop.edu with their GitHub username.
  5. Have more Code Camp style events, and rotate locations between campuses and regions (e.g., North versus South). Status: can plan these via GitHub organization + listserv
  6. Keep UC Code Camp conversations going, drilling down into some specific topics via virtual conferencing. Status: can plan these via GitHub organization + listserv. Can create specific “teams” within the GitHub organization to help organize more specific groups within the organization.
  7. Develop teams of IT + librarians to help facilitate outreach and education on campuses.
  8. Have CDL visit campuses more often to run informational sessions.
  9. Have space for sharing outreach and education materials around data management, tools and services available, etc. Status: can use GitHub organization or …?

It takes a data management village

A couple of weeks ago, information scientists, librarians, social scientists, and their compatriots gathered in Toronto for the 2014 IASSIST meeting. IASSIST is, of course, an acronym which I always have to look up to remember – International Association for Social Science Information Service & Technology. Despite its forgettable name, this conference is one of the better meetings I’ve attended. The conference leadership manages to put together a great couple of days, chock full of wonderful plenaries and interesting presentations, and even arranged a hockey game for the opening reception.

Yonge Street crowds celebrating the end of the Boer War, Toronto, Canada. This image is available from the City of Toronto Archives, and is in the public domain.

Although there were many interesting talks, and I’m still processing the great discussions I had in Toronto, a couple really rang true for me. I’m going to now shamelessly paraphrase one of these talks (with permission, of course) about building a “village” of data management experts at institutions to best service researchers’ needs. All credit goes to Alicia Hofelich Mohr and Thomas Lindsay, both from University of Minnesota. Their presentation was called “It takes a village: Strengthening data management through collaboration with diverse institutional offices.” I’m sure IASSIST will make the slides available online in the near future, but I think this information is too important to not share asap.

Mohr and Lindsay first described the data life cycle, and emphasized the importance of supporting data throughout its life – especially early on, when small things can make a big difference down the road. They asserted that in order to provide support for data management, librarians need to connect with other service providers at their institutions. They then described who these providers are, and where they fit into the broader picture. Below I’ve summarized Mohr and Lindsay’s presentation.

Grants coordinators

Faculty writing grants are constantly interacting with these individuals. They are on the “front lines” of data management planning, in particular, since they can point researchers to other service providers who can help over the course of the project. Bonus – grants offices often have a deep knowledge of agency requirements for data management.

Sponsored projects

The sponsored projects office is another service provider that often has early interactions with researchers during their project planning. Researchers are often required to submit grants directly to this office, who ensure compliance and focus on requirements needed for proposals to be complete.

College research deans

Although this might be an intimidating group to connect with, they are likely to be the most aware of the current research climate and can help you target your services to the needs of their researchers. They can also help advocate for your services, especially via things like new faculty orientation. Generally, this group is an important ally in facilitating data sharing and reuse.

IT system administrators

This group is often underused by researchers, despite their ability to potentially provide researchers with server space, storage, collaboration solutions, and software licenses. They are also useful allies in ensuring security for sensitive data.

Research support services & statistical consulting offices

Some universities have support for researchers in the designing, collecting, and analyzing of their data. These groups are sometimes housed within specific departments, and therefore might have discipline-specific knowledge about repositories, metadata standards, and cultural norms for that discipline. They are often formally trained as researchers and can therefore better relate to your target audience. In addition, these groups have the opportunity to promote replicable workflows and help researchers integrate best practices for data management into their everyday processes.

Data security offices, copyright/legal offices, & commercialization offices

Groups such as these are often overlooked by librarians looking to build a community of support around data management. Individuals in these offices may be able to provide invaluable expertise to your network, however. These groups contribute to and implement University security, data, and governance policies, and are knowledgeable about the legal implications of data sharing, especially related to sensitive data. Intellectual property rights, commercialization, and copyright are all complex topics that require expertise not often found among other data stewardship stakeholders. Partnering with experts can help reduce the potential for future problems, plus ensure data are shared to the fullest extent possible.

Library & institutional repository

The library is, of course, distinct from an institutional repository. However, often the institution’s library plays a key role in supporting, promoting, and often implementing the repository. I often remind researchers that librarians are experts in information, and data is one of many types of information. Researchers often underuse librarians and their specialized skills in metadata, curation, and preservation. The researchers’ need for a data repository and the strong link between repositories and librarians will change this in the coming years, however. Mohr and Lindsay ended with this simple statement, which nicely sums up their stellar presentation:

The data support village exists across levels and boundaries of the institution as well as across the lifecycle of data management.

Tagged , , , , , ,

Institutional Repositories: Part 1

If you aren’t a member of the library and archiving world, you probably aren’t aware of the phrase institutional repository (IR for short). I certainly wasn’t aware of IRs prior to joining the CDL, and I’m guessing most researchers are similarly ignorant. In the next two blog posts, I plan to first explain IRs, then lay out the case for their importance – nay, necessity – as part of the academic ecosphere. I should mention up front that although the IR’s inception focused on archiving traditional publications by researchers, I am speaking about them here as potential preservation of all scholarship, including data.

Academic lIbraries have a mission to archive scholarly work, including theses. These are at The Hive in Worcester, England. From Flickr by israelcsus.

Academic lIbraries have a mission to archive scholarly work, including theses. These are at The Hive in Worcester, England. From Flickr by israelcsus.

If you read this blog, I’m sure you are that there is increased awareness about the importance of open science, open access to publications, data sharing, and reproducibility. Most of these concepts were easily accomplished in the olden days of pen-and-paper: you simply took great notes in your notebook, and shared that notebook as necessary with colleagues (this assumes, of course geographic proximity and/or excellent mail systems). These days, that landscape has changed dramatically due to the increasingly computationally complex nature of research. Digital inputs and outputs of research might include software, spreadsheets, databases, images, websites, text-based corpuses, and more. But these “digital assets”, as the archival world might call them, are more difficult to store than a lab notebook. What does a virtual filing cabinet or file storage box look like that can house all of these different bits? In my opinion, it looks like an IR.

So what’s an IR?

An IR is a data repository run by an institution. Many of the large research universities have IRs. To name a few, Harvard has DASH, the University of California system has eScholarship and Merritt, Purdue has PURR, and MIT has DSpace. Many of these systems have been set up in the last 10 years or so to serve as archives for publications. For a great overview and history of IRs, check out this eHow article (which is surprisingly better than the relevant Wikipedia article).

So why haven’t more people heard of IRs? Mostly this is because there have never been any mandates or requirements for researchers to deposit their works in IRs. Some libraries take on this task– for example, I found out a few years ago that the MBL-WHOI Library graciously stored open access copies of all of my publications for me in their IR. But more and more these “works” include digital assets that are not publications, and the burden of collecting all of the digital scholarship produced by an institution is a near-insurmountable task for a small group of librarians; there has to be either buy-in from researchers or mandates from the top.

The Case for IRs

I’m not the first one to recognize the importance of IRs. Back in 2002 the Scholarly Publishing and Academic Resources Coalition (SPARC) put out a position paper titled “The Case for Institutional Repositories” (see their website for more information). They defined an IR as having four major qualities:

  1. Institutionally defined,
  2. Scholarly,
  3. Cumulative and perpetual, and
  4. Open and interoperable.

Taking the point of view of the academic institution (rather than the researcher), the paper cited two roles that institutional repositories play for academic institutions:

  1. Reform scholarly communication – Reassert control over scholarship, reduce monopoly power of journals, and bring relevance to libraries
  2. Promote the university – Serve as an indicator of the university’s quality; showcase the university’s research; demonstrate public value and increase status.

In general, IRs are run by information professionals (e.g., librarians), who are experts at documenting, archiving, preserving, and generally curating information. All of those digital assets that we produce as researchers fit the bill perfectly.

As a researcher, you might not be convinced by the importance of IRs given the  arguments above. Part of the indifference researchers may feel about IRs might have something to do with the existence of disciplinary repositories.

Disciplinary Repositories

There are many, many, many repositories out there for storing digital assets. To get a sense, check out re3data.org or databib.org and start browsing. Both of these websites are searchable databases for research data repositories. If you are a researcher, you probably know of at least one or two repositories for datasets in your field. For example, geneticists have GenBank, evolutionary biologists have TreeBase, ecologists have the KNB, and marine biologists have BCO-DMO. These are all examples of disciplinary repositories (DRs) for data. As any researcher who’s aware of these sites knows, you can both deposit and download data from these repositories, which makes them indispensable resources for their respective fields.

So where should a researcher put data?

The short answer is both an IR and a DR. I’ll expand on this and make the case for IRs to researchers in the next blog post.

Tagged , , , , , ,

Libraries & the Future of Scholarly Communication at #BTPDF2

Let's hope this doesn't become the uniform of academic librarians.

Let’s hope this doesn’t become the uniform of academic librarians. From allposters.com

Last week I attended the Beyond the PDF 2 Meeting, sponsored by FORCE11.  For those unaware of BTPDF2, it’s a spinoff event from the Beyond the PDF meeting, which took place in San Diego a few years back. BTPDF2 was a meeting of the minds for digital scholarship, with representatives from publishing, libraries, academia, software development, and everything in between. The room was full of heavy hitters and passionate advocates, with participant ages ranging from 19 to 70. The energy in the room was palpable, and was amplified by the amazing meeting space in Amsterdam.

There are plenty of ways to find out what happened at BTPDF2 (see a list of links below). In this post, I want to focus on the outcomes relevant to the stakeholders dear to my heart: librarians. Here I provide three observations related to libraries and the BTPDF2 meeting.

1. Missing Librarians

Lukas Koster, who works at the Library of the University of Amsterdam, wrote a terrific blog post about this topic titled Beyond the Library, where he summarized one of my first observations:

…any big changes in the way that scholarly communication is being carried out in the near and far future definitely affects the role of academic libraries… So I was surprised to see that the library representation at the conference was so low compared to researchers, publishers, students and tech/tools people.

There are many explanations possible for the dearth of librarians at BTPDF2; travel costs inevitably rises to the top. But what concerns me is that there wasn’t much action on the Twitter feed from the libraries, and almost every conversation I had wherein librarians were brought up, colleagues would say something to the effect of “Where are the librarians?” They were not only referring to the lack of librarians in Amsterdam; they were also asking the bigger question: Why haven’t libraries stepped up?

2. Librarians as both panacea and scapegoat

In discussions of stakeholder responsibilities and who should be leading the charge, librarians were mentioned repeatedly. They are at the center of the campus (sometimes physically as well as metaphorically), and can therefore facilitate discussions among IT, researchers, publishers, and administrators. The role of librarians has changed, regardless of the opinions of the librarians themselves. Publishers in attendance were among the most vocal in touting the library’s role in the future of scholarly communication: this is the community with which publishers primarily interact, and they clearly believed that it was the library’s responsibility to convey the needs of the researchers and their institutions.

But what about the actual handling of digital objects, creation of metadata, et cetera? During one discussion involving who should take on what responsibility in this space, one attendee said “Libraries are good at storing data. That’s what they do.” I think this would be news to many librarians.

3. Libraries are not promoting themselves

One prominent startup developer made a statement while on stage: while he was a researcher, he (1) never went to the library, (2) didn’t know about the institutional repository available to him, (3) wasn’t aware the library could help him with data, and (4) assumed librarians’ primary role was to “ensure researchers had access to online journals”, which he accessed daily. He then went on to state that libraries should be running themselves more like businesses: determine what services are needed and the most cost-effective way to deliver them.

I wish I could say I disagree with him, or that he does not represent the majority of researchers; I can’t. I would have made those same statements 3 years ago, before I started working with DataONE. Even more upsetting? Some librarians are not willing to swallow this information and rectify the situation. As one example, a senior librarian who shall go unnamed once said to me “No one is coming to me and asking for help with data or any of this stuff. Until they do, I’m going to continue doing what I’ve been doing for years”. Ouch. That’s a short path to irrelevance.

Next week I’ll post a bit more about other outcomes from BTPDF2, but suffice it to say that libraries have some work to do…

BTPDF2 Link Roundup:

Tagged , , , , ,

A Potpourri of DC Meetings

I’ve been in our nation’s capital since Sunday for three meetings, all while battling a particularly tenacious cold.  I’m using this post as a debrief, as well as to tell you about a few nifty projects.

First, the University of North Texas folks put on a symposium about the DataRes Project.  UNT librarians are quite the players in the data curation landscape these days – check out their website Data Management @UNT for more information. The DataRes Project is funded by the IMLS and “investigates how the library and information science (LIS) profession can best respond to emerging needs of research data management in universities.” Although I’ve only been involved with libraries since 2011, I’m pretty darn excited about the role that libraries are poised to play in data management.  Sounds like UNT agrees!

The second meeting was the Coalition for Networked Information 2012 Fall Members Meeting.  The Coalition for Networked information (CNI) is an institutional membership organization, with members that include universities, publishers, libraries, IT companies, governmental folks, and others.  These groups have a common interest in figuring out ways to facilitate communication, collaboration, and innovation in information management. I presented on the DMPTool, which was greeted with excitement by members of the audience. I also attended quite a few “project briefings” (i.e., sessions), wherein I heard about other interesting goings on in the world of information.

The briefing I enjoyed most was about FORCE11. It’s all caps because it’s an acronym: the Future of Research Communications and e-Scholarship. The “11” is because the group was founded in 2011.  FORCE11 is a “virtual community working to transform scholarly communications toward improved knowledge creation and sharing. ” I plan to join up with this group for their meeting Beyond the PDF 2 in March. Stay tuned for more on that group – I think they have the potential to really shape the future of scholarly communication.

The third and final meeting this week is still going on – the E-Science Institute. I blogged about E-Science last week, so I won’t go into detail on that aspect of the meeting.  But the basic idea is that libraries attend this meeting to think about ways to shape their “Strategic Agenda” for supporting science in this age of digital, big, complicated data and analyses. You can see how this might fit in with the DataRes project.  I like the idea of empowering libraries to take on all things data!

LOC reading room

I could get some serious studying done here. The Library of Congress Reading Room, From Flickr by shoupiest

Tagged , , ,

A brief thought: What is E-Science?

I’m not sure when I first heard the term “E-Science”, but it wasn’t that long ago. My first impression was that it sounds like one of those words that should be unsucked (i.e., jargon). Now that I know more about it, I’m inclined to think that jargon is in the ear of the beholder. Here’s why:

The most commonly used definition for E-Science is that it is type of scientific research that uses large-scale computing infrastructure to process very large datasets (i.e., “Big Science“, which generates “Big Data“).  However many (most?) often I hear E-Science used as an umbrella term that describes any size of science that involves digital data and/or analysis.  These days, that pretty much covers all science.  I therefore contend that E-Science as a phrase is redundant – it was describing what used to be a subset of science, but is now more correctly describing all science. So why is there an “E” at all?

There are journals, websites, and meetings focused on E-Science (I blogged about attending the Microsoft eScience Workshop just a few months ago). In fact, I’m currently participating in an E-Science Institute, sponsodred by the Association of Research Libraries, the Digital Library Federation, and  DuraSpace.  The goal of the Institute is to provide opportunities for “academic and research libraries to  boost institutional support of e-research and the management and preservation of our scientific and scholarly record.” Libraries are facing the new digital frontier head-on: they are interested in providing services that meet researchers’ needs, and these services have changed dramatically in the last few decades.

The argument for keeping the “E”: Although science researchers have no need for the distinction between Science and E-Science, it is a helpful distinction for groups that provide services to academia at large. Not all disciplines are as digital as the sciences: think about art history, studies of ancient texts, or observations of other cultures. Those groups that provide services or assistance for the broader academic community should, therefore, continue to consider E-Science.

banksy

Perhaps one day emails will just be mails… And Banksy will return them. From Flickr by Bruno Girin (More on Banksy: http://en.wikipedia.org/wiki/Banksy)

Some readings, recommended by the E-Science Institute organizers (and me!):

  • Jim Gray on e-Science, A Transformed Scientific Method. from The Fourth Paradigm: Data-Intensive Scientific Discovery, Tony Hey et al. Microsoft Research, 2009 .  Link
  • E-Science and the Life Cycle of Research, Charles Humphrey.  June, 2008.  Link
  • Special Online Collection: Dealing with Data, Science Magazine, AAAS.  February 11, 2011.  Link (free registration available)
Tagged , , ,