Category Archives: Archiving Data

Ensuring access to critical research data

For the last two months, UC3 have been working with the teams at Data.gov, Data Refuge, Internet Archive, and Code For Science (creators of the Dat Project) to aggregate the government data.

Data that spans the globe

There are currently volunteers across the country working to discover and preserve publicly funded research, especially climate data, from being deleted or lost from the public record. The largest initiative is called Data Refuge and is led by librarians and scientists. They are holding events across the UC campuses and the US that you should attend and help out in person, and are organizing the library community to band together to curate the data and ensure it’s preserved and accessible.

Our initiative builds on this and is looking to build a corpus of government data and corresponding metadata.  We are focusing on public research data, especially those at risk of disappearing. The initiative was nicknamed “Svalbard” by Max Ogden of the Dat project, after the Svalbard Global Seed Vault in the Arctic.  As of today, our friends at Code for Science have released 38GB of metadata, over 30 million hashes and URLs of research data files.

The Svalbard Global Seed Vault in the Arctic

To aid in this effort

We have assembled the following metadata as part of the Code for Science’s Svalbard v1:

  • 2.7 million SHA-256 hashes for all downloadable resources linked from Data.gov, representing around 40TB of data
  • 29 million SHA-1 hashes of files archived by the Internet Archive and the Archive Team from federal websites and FTP servers, representing over 120TB of data
  • All metadata from Data.gov, about 2.1 million datasets
  • A list of ~750 .gov and .mil FTP servers

There are additional sources such as Archivers.Space, EDGI, Climate Mirror, Azimuth Data Backup that we are working adding metadata for in future releases.

Following the principles set forth by the librarians behind Data Refuge, we believe it’s important to establish a clear and trustworthy chain of custody for research datasets so that mirror copies can be trusted. With this project, we are working to curate metadata that includes strong cryptographic hashes of data files in addition to metadata that can be used to reproduce a download procedure from the originating host.

We are hoping the community can use this data in the following ways:

  • To independently verify that the mirroring processes that produced these hashes can be reproduced
  • To aid in developing new forms of redundant dataset distribution (such as peer to peer networks)
  • To seed additional web crawls or scraping efforts with additional dataset source URLs
  • To encourage other archiving efforts to publish their metadata in an easily accessible format
  • To cross reference data across archives, for deduplication or verification purposes

What about the data?

The metadata is great, but the initial release of 30 million hashes and urls is just part of our project. The actual content (how the hashes were derived) have also been downloaded.  They are stored at either the Internet Archive or on our California Digital Library servers.

The Dat Project carried out a Data.gov HTTP mirror (~40TB) and uploaded it to our servers at California Digital Library. We are working with them to access ~160TB of data in the future and have partnered with UC Riverside to offer longer term storage .

Download

You can download the metadata here using Dat Desktop or Dat CLI tool.  We are using the Dat Protocol for distribution so that we can publish new metadata releases efficiently while still keeping the old versions around. Dat provides a secure cryptographic ledger, similar in concept to a blockchain, that can verify integrity of updates.

Feedback

If you want to learn more about how CDL and the UC3 team is involved, contact us at uc3@ucop.edu or @UC3CDL. If you have suggestions or questions, you can join the Code for Science Community Chat.  And, if you are a technical user you can report issues or get involved at the Svalbard GitHub.

This is crossposted here: https://medium.com/@maxogden/project-svalbard-a-metadata-vault-for-research-data-7088239177ab#.f933mmts8

Government Data At Risk

Government data is at risk, but that is nothing new.  

The existence of Data.gov, the Federal Open Data Policy, and open government data belies the fact that, historically, a vast amount of government data and digital information is at risk of disappearing in the transition between presidential administrations. For example, between 2008 and 2012, over 80 percent of the PDFs hosted on .gov domains disappeared. To track these and other changes, California Digital Library (CDL) joined with the University of North Texas, The Library of Congress, the Internet Archive, and the U.S. Government Publishing office to create the End of Term (EOT) Archive. After archiving the web presence of federal agencies in 2008 and 2012, the team initiated a new crawl in September of 2016.

In light of recent events, tools and infrastructure initially developed for EOT and other projects have been taken up by efforts to backup “at risk” datasets, including those related to the environment, climate change, and social justice. Data Refuge, coordinated by the Penn Program of Environmental Humanities (PPEH), has organized a series of “Data Rescue” events across the country where volunteers nominate webpages for submission to the End of Term Archive and harvest “uncrawlable” data to be bagged and submitted to an open data archive. Efforts such as the Azimuth Climate Data Backup Project and Climate Mirror do not involve submitting data or information directly to the End of Term Archive, but have similar aims and workflows.

These efforts are great for raising awareness and building back-ups of key collections. In the background, CDL and the team behind the Dat Project have worked to backup Data.gov, itself. The goal is not only to preserve the datasets catalogued by Data.gov but also the associated metadata and organization that makes it such a useful location for finding and using government data. As a result of this partnership, for the first time ever, the entire Data.gov metadata catalog of over 2 million datasets will soon be available for bulk download. This will allow the various backup efforts to coordinate and cross reference their data sets with those on Data.gov. To allow for further coordination and cross referencing, the Dat team has also begun acquiring the metadata for all the files acquired by Data Refuge, the Azimuth Climate Data Project, and Climate Mirror.

In an effort to keep track of all these efforts to preserve government data and information, we’re maintaining the following annotated list. As new efforts emerge or existing efforts broaden or change their focus, we’ll make sure the list is updated. Feel free to send additional info on government data projects to: uc3@ucop.edu

Get involved: Ongoing Efforts to Preserve Scientific Data or Support Science

Data.gov – The home of the U.S. Government’s open data, much of which is non-biological and non-environmental. Data.gov has a lightweight system for reporting and tracking datasets that aren’t represented and functions as a single point of discovery for federal data. Newly archived data can and should be reported there. CDL and the Dat team are currently working to backup the data catalogued on Data.gov and also the associated metadata.

End of Term – A collaborative project to capture and save U.S. Government websites at the end of presidential administrations. The initial partners in EOT included CDL, the Internet Archive, the Library of Congress, the University of North Texas, and the U.S. Government Publishing Office. Volunteers at many Data Rescue events use the URL nomination and BagIt/Bagger tools developed as part of the EOT project.

Data Refuge – A collaborative effort that aims to backup research-quality copies of federal climate and environmental data, advocate for environmental literacy, and build a consortium of research libraries to scale their tools and practices to make copies of other kinds of federal data. Find a Data Rescue event near you.

Azimuth Climate Data Backup Project – An urgent project to back up US government climate databases. Initially started by statistician Jan Galkowski and John Baez, a mathematician and science blogger at UC Riverside.

Climate Mirror – A distributed volunteer effort to mirror and back up U.S. Federal Climate Data. This project is currently being lead by Data Refuge.

The Environmental Data and Governance Initiative – An international network of academics and non-profits that addresses potential threats to federal environmental and energy policy, and to the scientific research infrastructure built to investigate, inform, and enforce. EDGI has built many of the tools used at Data Rescue events.

March for Science – A celebration of science and a call to support and safeguard the scientific community. The main march in Washington DC and satellite marches around the world are scheduled for April 22nd (Earth Day).

314 Action – A nonprofit that intends to leverage the goals and values of the greater science, technology, engineering, and mathematics community to aggressively advocate for science.

Tagged , , , , , , ,

USING AMAZON S3 AND GLACIER FOR MERRITT- An Update

The integration of the Merritt repository with Amazon’s S3 and Glacier cloud storage services, previously described in an August 16 post on the Data Pub blog, is now mostly complete. The new Amazon storage supplements Merritt’s longstanding reliance on UC private cloud offerings at UCLA and UCSD. Content tagged for public access is now routed to S3 for primary storage, with automatic replication to UCSD and UCLA. Private content is routed first to UCSD, and then replicated to UCLA and Glacier. Content is served for retrieval from the primary storage location; in the unlikely event of a failure, Merritt automatically retries from secondary UCSD or UCLA storage. Glacier, which provides near-line storage with four hour retrieval latency, is not used to respond to user-initiated retrieval requests.

Content Type Primary Storage Secondary Storage Primary Retrieval Secondary Retrieval
Public S3 UCSD
UCLA
S3 UCSD
UCLA
Private UCSD UCLA
Glacier
UCSD UCLA

In preparation for this integration, all retrospective public content, over 1.1 million objects and 3 TB, was copied from UCSD to S3, a process that took about six days to complete. A similar move from UCSD to Glacier is now underway for the much larger corpus of private content, 1.5 million objects and 71 TB, which is expected to take about five weeks to complete.

The Merritt-Amazon integration enables more optimized internal workflows and increased levels of reliability and preservation assurance. It also holds the promise of lowering overall storage costs, and thus, the recharge price of Merritt for our campus customers.  Amazon has, for example, recently announced significant price reductions for S3 and Glacier storage capacity, although their transactional fees remain unchanged.  Once the long-term impact of S3 and Glacier pricing on Merritt costs is understood, CDL will be able to revise Merritt pricing appropriately.

CDL is also investigating the possible use of the Oracle archive cloud, as a lower-cost alternative, or supplement, to Glacier for dark archival content hosting.  While offering similar function to Glacier, including four hour retrieval latency, Oracle’s price point is about 1/4th of Glacier’s for storage capacity.

UC3 to Explore Amazon S3 and Glacier Use for Merritt Storage

The UC Curation Center (UC3) has offered innovative digital content access and preservation services to the UC community for over six years through its Merritt repository.  Merritt was developed by UC3 to address unique needs for high-quality curation services at scale and a low price point.   Recently, UC3 started looking into Amazon’s S3 and Glacier cloud storage products as a way to address cost concerns, fine-tune reliability issues, increase service options, and keep pace with ever-increasing scale in the volume, variety, and velocity of new content contributions.

The current Merritt pricing model, in effect since July 1, 2015, is based on recovering the costs of storage use, currently totally over 73 TB contributed from all 10 UC campuses.  This content is now being replicated in UC private clouds supported by UCLA and UCSD.   Since the closure earlier this year of the UCOP data center, the computational processes underlying Merritt, along with all other CDL services, have been moved to virtual machines in the Amazon AWS cloud.  Collocating storage alongside this computational presence in AWS will provide increased data transfer throughput during Merritt deposit and retrieval.  In addition, the integration of online S3 with near-line Glacier storage offers opportunities to lower storage costs by moving archival materials with no expectation of direct end-user access to Glacier.  The cost for Glacier storage is about one quarter of that for S3, which is comparable with UCLA and UCSD pricing.  Of course, the additional dispersed replication of Merritt-managed data in AWS will also increase overall reliability and long-term preservation assurance.

The integration of S3 and Glacier will supplement Merritt’s existing use of UC storage.  Merritt’s storage function acts as a broker that automatically routes submitted content to the appropriate storage location based on its curatorially-defined access characteristics.  Once Amazon storage has been added to Merritt, content tagged for public access will be routed to S3 for primary storage, from which it will be automatically replicated to a UC cloud.  Retrieval requests for this content will be served from the S3 copy; should these requests fail (for example, if S3 is temporarily non-responsive), Merritt automatically retries from its secondary copy.

The path for content tagged for private access is somewhat different.  It is initially routed to S3 for temporary storage until the replication to a UC cloud completes.  The content is then moved into Glacier for permanent low-cost primary storage.  Retrieval requests will be served from the UC cloud.  In the unlikely event that this retrieval doesn’t success, there is no automatic retry from Glacier, since Glacier, while inexpensive for static storage, is costly for systematic retrieval.  UC3 staff can, however, intervene manually to retrieve from Glacier if it becomes necessary.  In the case of both public and private access, the digital content will continue to be managed with at least five copies spread across independent storage infrastructures and data centers.

The integration of Amazon S3 and Glacier into Merritt’s storage architecture will increase overall reliability and performance, while possibly leading to future reduction in costs.  Once the integration is complete, UC3 will monitor AWS storage usage and associated costs through the end of the current Merritt service year in June 30, 2017, to determine the impact on Merritt pricing.

Tagged , , ,

Lit Review: #PLOSFail and Data Sharing Drama

Turn and face the strange, researchers. From pipedreamsfromtheshire.wordpress.com

Turn and face the strange, researchers. From pipedreamsfromtheshire.wordpress.com

I know what you’re thinking– how can yet another post on the #PLOSfail hoopla say anything new? Fear not. I say nothing particularly new here, but I do offer a three-weeks-out lit review of the hoopla, in hopes of finding a pattern in the noise. For those new to the #PLOSFail drama, the short version is this: PLOS enacted a mandatory data sharing policy. Researchers flipped out. See the sources at the end of this post for more background.

 Arguments made against data sharing

1) My data is my lifeblood. I won’t just give it away.

Terry McGlynn, a biologist writing at Small Pond Science argues that “Regardless of the trajectory of open science, the fact remains that, at the moment, we are conducting research in a culture of data ownership.” Putting the ownership issue aside for now, let’s focus on the crux of this McGlynn’s argument: he contends that data sharing results in turning a private resource (data) into a community resource. This is especially burdensome for small labs (like his) since each data point takes relatively more effort to produce. If this resource is available to anyone, the benefits to the former owner are greatly reduced since they are now shared with the broader community.

Although these are valid concerns, they are not in the best interest of science. I argue that what we are really talking about here is the incentive problem (see more in the section below). That is, publications are valued in performance evaluation of academics, while data are not. Everyone can agree that data is indispensable to scientific advancement, so why hasn’t the incentive structure caught up yet? If McGlynn were able to offset the loss of benefits caused to data sharing by getting mad props for making their data available and useful, this issue would be less problematic. Jeff Leek, a biostatistician blogging at Simply Statistics, makes a great point with regard to this: to paraphrase him, the culture of credit hasn’t caught up with the culture of science. There is no appropriate form of credit for data generators – it’s either citation (seems chintzy) or authorship (not always appropriate). Solution: improve incentives for data sharing. Find a way to appropriately credit data producers.

2) My datasets are special, unique snowflakes. You can’t understand/use them.

Let’s examine what McGlynn says about this with regard to researchers re-using his data: “…anybody working on these questions wouldn’t want the raw data anyway, and there’s no way these particular data would be useful in anybody’s meta analysis. It’d be a huge waste of my time.”

Rather than try to come up with a new, witty way to answer to this argument, I’ll shamelessly quote from MacManes Lab blog post, Corner cases and the PLOS data policy:

 There are other objections – one type is the ‘my raw data are so damn special that nobody can over make sense of them’, while another is ‘I use special software and stuff, so they are probably not useful to anybody else’. I call BS on both of these arguments. Maybe you have the world’s most complicated data, but why not release them and not worry about whether or not people find them useful – that is not your concern (though it should be).

I couldn’t have said it better. The snowflake refrain from researchers is not new. I’ve heard it time and again when talking to them about data archiving. There is certainly truth to this argument: most (all?) datasets are unique. Why else would we be collecting data? This doesn’t make them useless to others, especially if we are sharing data to promote reproducibility of reported results.

DrugMonkey, an anonymous blogger and biomedical researcher, took this “my data are unique” argument to paranoia level. In their post, PLoS is letting the inmates run the asylum and it will kill them, s/he contends that researchers will somehow be forced to use all the same methods to facilitate data reuse. “…diversity in data handling results, inevitably, in attempts for data orthodoxy. So we burn a lot of time and effort fighting over that. So we’ll have PLoS [sic] inserting itself in the role of how experiments are to be conducted and interpreted!”

I imagine DrugMonkey pictures future scientists in grey overalls, trudging to a factory to do “science”. This is just ridiculous. The idiosyncrasies of how individual researchers handle their data will always be part of the challenge of reproducibility and data curation. But I have never (ever) heard of anyone suggesting that all researchers in a given field should be doing science in the exact same way. There are certainly best practices for handling datasets. If everyone followed these to the best of their ability, we would have an easier time reusing data. But no one is punching a time card at the factory.

 3) Data sharing is hard | time-consuming | new-fangled.

This should probably be #1 in the list of arguments from researchers. Even those that cite other reasons for not sharing their data, this is probably at the root of the hoarding. Full disclosure – only a small portion of the datasets I have generated as a researcher are available to the public. The only explanation is it’s time-consuming and I have other things on my plate. So I hear you, researchers. That said, the time has come to start sharing.

DrugMonkey says that the PLOS data policy requires much additional data curation which will take time. “The first problem with this new policy is that it suggests that everyone should radically change the way they do science, at great cost of personnel time…” McGlynn states this point succinctly: “Why am I sour on required data archiving? Well, for starters, it is more work for me… To get these numbers into a downloadable and understandable condition would be, frankly, an annoying pain in the ass.”

Fair enough. But I argue here (along with others others) that making data available is not an optional side note of research: it is research. In the comments of David Crotty’s post at The Scholarly Kitchen, “PLOS’ bold data policy“, there was a comment that I loved. The commenter, Mike Taylor, said this:

 …data curation is research. I’d argue that a researcher who doesn’t make available the data necessary to reproduce his conclusions isn’t getting his job done. Complaining about having to spend time on preparing the data for others to use is like complaining about having to spend time writing the paper, or indeed running experiments.

When I read that comment, I might have fist pumped a little. Of course, we still have that pesky incentive issue to work out… As Crotty puts it, “Perhaps the biggest practical problem with [data sharing] is that it puts an additional time and effort burden on already time-short, over-burdened researchers. Researchers will almost always follow the path of least resistance, and not do anything that takes them away from their research if it can be avoided.” Sigh.

What about that “new-fangled” bit? Well, researchers often complain that data management and curation requires skills that are not taught. I 100% agree with this statement – see my paper on the lack of data management education for even undergrads. But as my ex-cop dad likes to say, “ignorance of the law is not a defense”. In continuation of my shameless quoting from others, here’s what Ted Hart (Staff Scientist at NEON) has to say in his post, “Just Get Over Yourself and Share Your Data“:

Sharing is hard. but not an intractable problem… Is the alternative is that everyone just does everything in secret with myriad idiosyncrasies ferociously milking least publishable units from a data set? That just seems like a recipe for science moving slowly and in the dark. …I think we just need to own up to the fact being a scientist these days requires new skills, and it always have. You didn’t have to know how to do PCR prior to 1983, but now you do. In the 21st century to do science better, we need more than spreadsheets with a few rows, we need to implement best practices for data management.

More fist pumping! No, things won’t change overnight. Leek at Simply Statistics rightly stated that the transition to open data will be rough for two reasons: (1) there is no education on data handling, and (2) the is a disconnect between the incentives for individual researchers and the actions that will benefit science as a whole. Sigh. Back to that incentive issue again.

Highlights & Takeaways

At risk of making this blog post way too long, I want to showcase a few highlights and takeaways from my deep dive into the #PLOSfail blogging world.

1) The Incentives Problem

We have a big incentives problem, which was probably obvious from my repeated mentions of it above. What’s good for researchers’ careers is not conducive to data sharing. If we expect behavior to change, we need to work on giving appropriate credit where it’s due.

Biologist Björn Brembs puts it well in his post, What is the Difference Between Text, Data, and Code?“…it is unrealistic to expect tenure committees and grant evaluators to assess software and data contributions before anybody even is contributing and sharing data or code.” Yes, there is a bit of a chicken-and-egg situation. We need movement on both sides to get somewhere. Share the data, and they will start to recognize it.

2) Empiricism Versus Theory

There is a second plot line to the data sharing rants: empiricists versus theoreticians. See ecologist Timothée Poisot‘s blog, “Of the value of datasets and methods in open science” for a more extensive review of this issue as it relates to data sharing. Of course, this tension is not a new debate in science. But terms like “data vultures” get thrown about, and feelings get hurt. Due to the nature of their work, most theoreticians’ “data” is equations, methods, and code that are shared via publication. Meanwhile, empiricists generate data and can hoard it until they see fit to share it, only offering a glimpse of the entire suite of their research outputs. To paraphrase Hart again: science is equal parts data and analysis/methods. We need both, so let’s stop fighting and encourage open science all around.

3) Data Ownership Issues

There are lots of potential data owners: the funders who paid for the work, the institution where the research was performed, the researcher who collected the data, the principle investigator of the lab where the researcher works, etc. etc. The complications around data ownership make this a tricky subject to work out. Zen Faulkes, a neurobiologist at University of Texas, blogged about who owns data, in particular, his data. He did a little research and found what many (most?) researchers at universities might find: “I do not own research data I generate. Neither do the funding agencies. The University of Texas system Board of Regents own research data I generate.” Faulkes goes on to state that the regents probably don’t care what he does with his data unless/until they can make money off of it… very true. To make things more complicated, Crotty over at Scholarly Kitchen reminded us that “under US law (the Bayh-Dole Act), the intellectual property (IP) generated as the result of federal research funds belongs to the researcher and their institution.” What does that even mean?!

To me, the issue is not about who owns the data outright. Instead, it’s about my role as an open science “waccaloon” who is interested in what’s best for the scientific process. To that extent, I am going to borrow from Hart again. Hart makes a comparison between having data and having a pet: in Boulder CO, there are no pet “owners” – only pet “guardians”. We can think of our data in this same way: we don’t own it; we simply care for it, love it, and are intellectually (and sometimes emotionally!) invested in it.

4) PLOS is Part of a Much Bigger Movement

Open science mandates are already here. The OSTP memo released last year is a huge leap forward in this direction – it requires that federally funded research outputs (including data) be made available to the public. Crotty draws a link between OSTP and PLOS policies in his blog: “Once this policy goes into effect, PLOS’ requirements would seem to be an afterthought for authors funded in this manner. The problem is that the OSTP policy seems nowhere near being implemented.”

That last part is most definitely true. One way to work on implementing this policy? Get the journals involved. The current incentive structure is not well-suited for ensuring compliance with OSTP, but journals have a role as gatekeepers to the traditional incentives. Crotty states it this way:

PLOS has never been a risk averse organization, and this policy would seem to fit well with their ethos of championing access and openness as keys to scientific progress. Even if one suspects this policy is premature and too blunt an instrument, one still has to respect PLOS for remaining true to their stated goals.

So I say kudos to PLOS!

In Conclusion…

I’ll end with a quote from MacManes Lab blog post:

How about this, make an honest effort to make the data accessible and useful to others, and chances are you’re probably good to go.

Final fist pump.

Sources

  1. Timothée Poisot, Ecologist. Of the value of datasets and methods in open science.
  2. Terry McGlynn, Biologist. I own my data until I don’t. Blog at Small Pond Science @hormiga
  3. David Crotty, publisher & former researcher. PLOS’ bold data policy Blog at The Scholarly Kitchen @scholarlykitchn
  4. Edmund Hart, Staff Scientist at NEONJust Get Over Yourself and Share Your Data. @DistribEcology
  5. MacManes Lab, genomics. Corner cases and the PLOS data policy.
  6. DrugMonkey, biomedical research. PLoS is letting the inmates run the asylum and it will kill them. @DrugMonkey
  7. Zen Faulkes, Neurobiologist. Who owns data. Blog at NeuroDojo @DoctorZen
  8. Björn Brembs, biologist. What is the Difference Between Text, Data, and Code? @brembs
  9. Jeff Leek, biostatistician. PLoS One, I have an idea for what to do with all your profits: buy hard drives Blog at Simply Statistics. @leekgroup

Twitter feed for #PLOSfail

From PLOS

Finding Disciplinary Data Repositories with DataBib and re3data

This post is by Natsuko Nicholls and John Kratz.  Natsuko is a CLIR/DLF Postdoctoral Fellow in Data Curation for the Sciences and Social Sciences at the University of Michigan.

The problem: finding a repository

Everyone tells researchers not to abandon their data on a departmental server, hard drive, USB stick , CD-ROM, stack of Zip disks, or quipu– put it in a repository! But, most researchers don’t know what repository might be appropriate for their data. If your organization has an Institutional Repository (IR), that’s one good home for the data. However, not everyone has access to an IR, and data in IRs can be difficult for others to discover, so it’s important to consider the other major (and not mutually exclusive!) option: deposit in a Disciplinary Repository (DR).

Many disciplinary repositories exist to handle data from a particular field or of a particular type (e.g. WormBase cares about nematode biology, while GenBank takes only DNA sequences). Some may be asking if the co-existence of IRs and DRs means competition or is mutually beneficial to both universities and research communities, some may be wondering how many repositories are out there for archiving digital assets, but most librarians and researchers just want to find an appropriate repository in a sea of choices.

For those involved in assisting researchers with data management, helping to find the right place to put data for sharing and preservation has become a crucial part of data services. This is certainly true at the University of Michigan—during a recent data management workshop for faculty, faculty members expressed their interest in receiving more guidance on disciplinary repositories from librarians.

The help: directories of data repositories

Fortunately, there is help to be found in the form of repository directories.  The Open Access Directory maintains a subdirectory of data repositories.  In the Life Sciences, BioSharing collects data policies, standards, and repositories.  Here, we’ll be looking at two large directories that list repositories from any discipline: DataBib and the REgistry of REsearch data REpositories (re3data.org).

DataBib originated in a partnership between Purdue and Penn State University, and it’s hosted by Purdue. The 600 repositories in DataBib are each placed in a single discipline-level category and tagged with more detailed descriptors of the contents.

re3data.org, which is sponsored by the German Research Foundation, started indexing relatively recently, in 2012, but it already lists 628 repositories.  Unlike DataBib, repositories aren’t assigned to a single category, but instead tagged with subjects, content types, and keywords.  Last November, re3data and BioSharing agreed to share records.  re3data is more completely described in this paper.

Given the similar number of repositories listed in DataBib and re3data, one might expect that their contents would be roughly similar and conclude that there are something around 600 operating DRs.  To test this possibility and get a better sense of the DR landscape, we examined the contents of both directories.

The question: how different are DataBib and re3data?

Repository overlap is only 19%Contrary to expectation, there is little overlap between the databases.  At least 1,037 disciplinary data repositories currently exist, and only 18% (191) are listed in both databases.  That’s a lot to look for one right place to put data, because except for a few exceptions, most IRs are not listed in re3data and Databib (you can find  a long list of academic open access repositories).  Of the repositories in both databases, a majority (72%) are categorized into STEM fields. Below is a breakdown of the overlap by discipline (as assigned by DataBib).

CrossoverRepositories

Another way of characterizing repository collections by re3data and Databib is by the repository’s host country. In re3data, the top three contributing countries (US 36%, Germany 15%, UK 12%) form the majority, whereas in Databib 58% of repositories are hosted by the US, followed by UK (12%) and Canada (7%). This finding may not be too surprising, since re3data is based in Germany and Databib is in the US.  If you are a researcher looking for the right disciplinary data repository, the host country may matter, depending on your (national-international/private-public) funding agencies and the scale of collaboration.

The full list of repositories is available here .

The conclusion: check both

Going forward, help with disciplinary repository selection will be increasingly be a part of data management workflows; the Data Management Planing Tool (DMPTool) plans to incorporate repository recommendations through DataBib, and DataCite may integrate with re3data. Further simplifying matters, DataBib and re3data plan to merge their services in some as-yet-undefined way.  But, for now, it’s safe to say that anyone looking for a disciplinary repository should check both DataBib and re3data.

Tagged , , ,

Institutional Repositories: Part 2

A few weeks back I wrote a post describing institutional repositories (IRs for short). IRs have been around for a while, with the impetus of making scholarly publications open access. However more recently, IRs have been cited as potential repositories for datasets, code, and other scholarly outputs. Here I continue the discussion of IRs and compare their utility to DRs. Please note – although IRs are typically associated with open access publications, I discuss them here as potential repositories for data. 

Honest criticism of IRs

In my discussions with colleagues at conferences and meetings, I have found that some are skeptical about the role of IRs in data access preservation. I posit that this skepticism has a couple of origins:

  • IRs are often not intended for “self-service”, i.e., a researcher would need to connect with IR support staff (often via a face-to-face meeting), in order to deposit material into the IR.
  • Many IRs were created at minimum 5 years ago, with interfaces that sometimes appear to pre-date Facebook. Academic institutions often have no budget for a redesign of the user interface, which means those that visit an IR might be put off by the appearance and/or functionality.
  • IRs are run by libraries and IT departments, neither of which are known for self-promotion. Many (most?) researchers are likely unaware of an IR’s existence, and would not think to check in with the libraries regarding their data preservation needs.

These are all viable issues associated with many of the existing IRs. But there is one huge advantage to IRs over other data repositories: they are owned and operated by academic institutions that have a vested interest in preserving and providing access to scholarly work. 

The bright side

IRs aren’t all bad, or I wouldn’t be blogging about them. I believe that they are undergoing a rebirth of sorts: they are now seen as viable places for datasets and other scholarly outputs. Institutions like Purdue are putting IRs at the center of their initiatives around data management, access, and preservation. Here at the CDL, the UC3 group is pursuing the implementation of a data curation platform, DataShare, to allow self-service deposit of datasets into the Merritt Repository (see the UCSF DataShare site). Recent mandates from above requiring access to data resulting from federal grants means that funders (like IMLS) and organizations (like ARL) are taking an interest in improving the utility of IRs.

IRs versus discipline-specific repositories

In my last post, I mentioned that selecting a repository for your data doesn’t need to be either an IR or discipline-specific repository (DR). These repositories each have advantages and disadvantages, so using both makes sense.

DRs: ideal for data discovery and reuse

Often, DRs have collection policies for the specific types of data they are willing to accept. GenBank, for example, has standardized how your deposit your data, what types and formats of data they accept, and the metadata accompanying that data. This all means that searching for and using the data in GenBank is easy, and data users are able to easily download data for use. Another advantage of having a collection of similar, standardized data is the ability to build tools on top of these datasets, making reuse and meta-analyses easier.

The downside of DRs

The nature of a DR is that they are selective in the types of data that they accept. Consider this scenario, typical of many research projects: what if someone worked on a project that combined sequencing genes, collecting population demographics, and documenting location with GIS? Many DRs would not want to (or be able to) handle these disparate types of data. The result is that some of the data gets shared via a DR, while data less suitable for the DR would not be shared.

In my work with the DataONE Community Engagement and Education working group, I reviewed what datasets were shared from NSF grants awarded between 2005 and 2009 (see Panel 1 in Hampton et al. 2013). Many of the resulting publications relied on multiple types of data.  The percentage of those that shared all of the data produced was around 28%. However of the data that was shared, 81% was in GenBank or TreeBase – likely due to the culture of data sharing around genetic work. That means most of the non-genetic data is not available, and potentially lost, despite its importance for the project as a whole. Enter: institutional repositories.

IRs: the whole enchilada

Unlike many DRs, IRs have the potential to host entire collections of data around a project – regardless of the type of data, its format, etc. My postdoctoral work on modeling the effects of temperature and salinity on copepod populations involved field collection, laboratory copepod growth experiments (which included logs of environmental conditions), food growth (algal density estimates and growth rates, nutrient concentrations), population size counts, R scripts, and the development of the mathematical models themselves. An IR could take all of these disparate datasets as a package, which I could then refer to in the publications that resulted from the work. A big bonus is that this package could sit next to other packages I’ve generated over the course of my career, making it easier for me to point people to the entire corpus of research work. The biggest bonus of all: having all of the data the produced a publication, available at a single location, helps ensure reproducibility and transparency.

Maybe you can have your cake (DRs) and eat it too (IRs). From Flickr by Mayaevening

Maybe you can have your cake (DRs) and eat it too (IRs). From Flickr by Mayaevening

There are certainly some repositories that could handle the type of data package I just described. The Knowledge Network for Biocomplexity is one such relatively generic repository (although I might argue that KNB is more like an IR than a discipline repository). Another is figshare, although this is a repository ultimately owned by a publisher. But as researchers start hunting for places to put their datasets, I would hope that they look to academic institutions rather than commercial publishers. (Full disclosure – I have data stored in figshare!)

Good news! You can have your cake and eat it too. Putting data in both the relevant DRs and more generic IRs is a good solution to ensure discoverability (DRs) and provenance (IRs).

Tagged , , , ,

Institutional Repositories: Part 1

If you aren’t a member of the library and archiving world, you probably aren’t aware of the phrase institutional repository (IR for short). I certainly wasn’t aware of IRs prior to joining the CDL, and I’m guessing most researchers are similarly ignorant. In the next two blog posts, I plan to first explain IRs, then lay out the case for their importance – nay, necessity – as part of the academic ecosphere. I should mention up front that although the IR’s inception focused on archiving traditional publications by researchers, I am speaking about them here as potential preservation of all scholarship, including data.

Academic lIbraries have a mission to archive scholarly work, including theses. These are at The Hive in Worcester, England. From Flickr by israelcsus.

Academic lIbraries have a mission to archive scholarly work, including theses. These are at The Hive in Worcester, England. From Flickr by israelcsus.

If you read this blog, I’m sure you are that there is increased awareness about the importance of open science, open access to publications, data sharing, and reproducibility. Most of these concepts were easily accomplished in the olden days of pen-and-paper: you simply took great notes in your notebook, and shared that notebook as necessary with colleagues (this assumes, of course geographic proximity and/or excellent mail systems). These days, that landscape has changed dramatically due to the increasingly computationally complex nature of research. Digital inputs and outputs of research might include software, spreadsheets, databases, images, websites, text-based corpuses, and more. But these “digital assets”, as the archival world might call them, are more difficult to store than a lab notebook. What does a virtual filing cabinet or file storage box look like that can house all of these different bits? In my opinion, it looks like an IR.

So what’s an IR?

An IR is a data repository run by an institution. Many of the large research universities have IRs. To name a few, Harvard has DASH, the University of California system has eScholarship and Merritt, Purdue has PURR, and MIT has DSpace. Many of these systems have been set up in the last 10 years or so to serve as archives for publications. For a great overview and history of IRs, check out this eHow article (which is surprisingly better than the relevant Wikipedia article).

So why haven’t more people heard of IRs? Mostly this is because there have never been any mandates or requirements for researchers to deposit their works in IRs. Some libraries take on this task– for example, I found out a few years ago that the MBL-WHOI Library graciously stored open access copies of all of my publications for me in their IR. But more and more these “works” include digital assets that are not publications, and the burden of collecting all of the digital scholarship produced by an institution is a near-insurmountable task for a small group of librarians; there has to be either buy-in from researchers or mandates from the top.

The Case for IRs

I’m not the first one to recognize the importance of IRs. Back in 2002 the Scholarly Publishing and Academic Resources Coalition (SPARC) put out a position paper titled “The Case for Institutional Repositories” (see their website for more information). They defined an IR as having four major qualities:

  1. Institutionally defined,
  2. Scholarly,
  3. Cumulative and perpetual, and
  4. Open and interoperable.

Taking the point of view of the academic institution (rather than the researcher), the paper cited two roles that institutional repositories play for academic institutions:

  1. Reform scholarly communication – Reassert control over scholarship, reduce monopoly power of journals, and bring relevance to libraries
  2. Promote the university – Serve as an indicator of the university’s quality; showcase the university’s research; demonstrate public value and increase status.

In general, IRs are run by information professionals (e.g., librarians), who are experts at documenting, archiving, preserving, and generally curating information. All of those digital assets that we produce as researchers fit the bill perfectly.

As a researcher, you might not be convinced by the importance of IRs given the  arguments above. Part of the indifference researchers may feel about IRs might have something to do with the existence of disciplinary repositories.

Disciplinary Repositories

There are many, many, many repositories out there for storing digital assets. To get a sense, check out re3data.org or databib.org and start browsing. Both of these websites are searchable databases for research data repositories. If you are a researcher, you probably know of at least one or two repositories for datasets in your field. For example, geneticists have GenBank, evolutionary biologists have TreeBase, ecologists have the KNB, and marine biologists have BCO-DMO. These are all examples of disciplinary repositories (DRs) for data. As any researcher who’s aware of these sites knows, you can both deposit and download data from these repositories, which makes them indispensable resources for their respective fields.

So where should a researcher put data?

The short answer is both an IR and a DR. I’ll expand on this and make the case for IRs to researchers in the next blog post.

Tagged , , , , , ,

Thanks in Advance For Sharing Your Data

barbara bates turkey

Barbara Bates says to be sure to dress your turkey properly this season! Then invite him to eat some tofurky with you. From Flickr by carbonated

It’s American Thanksgiving this week, which means that hall traffic at your local university is likely to dwindle down to zero by Wednesday afternoon.  Because it’s a short week, this is a short post.  I wanted to briefly touch on data sharing policies in journals.

Will you be required to share your data next time you publish? If you are looking for a short answer, it’s probably not. Depending on the field you are in, the requirements for data sharing are not very… forceful. They often involve phrases like “strongly encourage” or “provided on demand”, rather than requiring researchers to archive their data, obtain an identifier, and submit that information alongside the journal article.  The journal Nature just beefed up their wording a bit; still no requirements for archiving though. Read the Nature policy on availability of data and materials.

Despite the slow progress towards data sharing mandates, there is a growing list of journals that sign up for the Joint Data Archiving Policy (JDAP), the brainchild of folks over at the Dryad Repository. The JDAP  verbiage, which journals can use in their instructions for authors, states that supporting data must be publicly available:

<< Journal >> requires, as a condition for publication, that data supporting the results in the paper should be archived in an appropriate public archive, such as << list of approved archives here >>. Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future. Authors may elect to have the data publicly available at time of publication, or, if the technology of the archive allows, may opt to embargo access to the data for a period up to a year after publication. Exceptions may be granted at the discretion of the editor, especially for sensitive information such as human subject data or the location of endangered species.

The bold face emphasis was mine, which I did because it’s important: the journal requires, as a condition for publication, that you share your data.  Now we’re cooking with gas!

The JDAP was adopted in a joint and coordinated fashion by many leading journals in the field of evolution in 2011, and JDAP has since been adopted by other journals across various disciplines. A list of journals that require data sharing via the JDAP verbiage are below.

Two other interesting bits about data sharing, in this case in PLOS:

List of Journals that require data sharing:

  • The American Naturalist
  • Biological Journal of the Linnean Society
  • BMC Ecology
  • BMC Evolutionary Biology
  • BMJ
  • BMJ Open
  • Ecological Applications
  • Ecological Monographs
  • Ecology
  • Ecosphere
  • Evolution
  • Evolutionary Applications
  • Frontiers in Ecology and the Environment
  • Functional Ecology
  • Genetics
  • Heredity
  • Journal of Applied Ecology
  • Journal of Ecology
  • Journal of Evolutionary Biology
  • Journal of Fish and Wildlife Management
  • Journal of Heredity
  • Journal of Paleontology
  • Molecular Biology and Evolution
  • Molecular Ecology and Molecular Ecology Resources
  • Nature
  • Nucleic Acids Research
  • Paleobiology
  • PLOS
  • Science
  • Systematic Biology
  • ZooKeys
Tagged , ,