Data Publication: Sharing, Crediting, and Re-Using Research Data

In the most basic terms- Data Publishing is the process of making research data publicly available for re-use. But even in this simple statement there are many misconceptions about what Data Publications are and why they are necessary for the future of scholarly communications.

Let’s break down a commonly accepted definition of “research data publishing”. A Data Publication has three core features: 1 – data that are publicly accessible and are preserved for an indefinite amount of time, 2 – descriptive information about the data (metadata), and 3 –  a citation for the data (giving credit to the data). Why are these elements essential? These three features make research data reusable and reproducible- the goal of a Data Publication.

Data are publicly accessible and preserved indefinitely

There are many ways for researchers to make their data publicly available, be it within Supporting Information files of a journal article or within an institutional, field specific, or general repository. For a true Data Publication, data should be submitted to a stable repository that can ensure data will be available and stored for an indefinite amount of time. There are over a thousand repositories registered with re3data and many publishers have repository guides to help with field specific guidance. When data are not suitable for public deposition, i.e. when data contain sensitive information, data should still be stored in a preserved and compliant space. While this restriction is a more difficult hurdle to jump over in advocating for data publishing and data preservation, it is important to ensure these data are not violating ethical requirements,  nor are they locked up in a filing cabinet and eventually thrown out. Preservation of data is a necessity for the future.

Data are described (data have metadata)

Data without proper documentation or descriptive metadata are about as useful as research without data. If a Data Publication is a citable piece of scholarly work, it should contain information that it allow it to be a useful and valued piece of scholarly work. Documentation and metadata range from information regarding software used for analysis to who funded the work. While these examples serve separate purposes (one for re-use and the other for credit), it is important that all information about the creation of the dataset (who, where, how, related publications) are available.

Data are citable and credible

We’ve established that datasets are essential to research output and are an important piece of scholarly work- and they should receive the same benefits. Data need to have a persistent identifier (a stable link) that can be referenced. While many repositories use a DataCite DOI to fulfill this, some field-specific repositories use accession numbers (i.e. NCBI repositories) that can be referenced within a URL. This is one of the reasons data need to be available in a stable repository. It’s a bit difficult to reference and credit data that are on your hard drive!

If it’s so clear- why are there barriers?

Data publishing has become more widely accepted in the last ten years, with new standards from funders and publishers and a growth in stable repositories. However, there’s still work to be done and more questions to be answered before we reach mass adoption. Let’s start that conversation (you can be the questioner and I’ll be the advocate):

Organizing and submitting data are time intensive and in turn, costly

Trying to replicate a data set from scratch takes much more time (and money) than publishing your data (see robotics example here). Taking the time to search your old computer files or get in touch with your last institution to get your data is more complicated than publishing your data. Having your paper retracted because your data are called into question and you can’t share your data or don’t have it would take more time, money, and hit to your reputation than proactively publishing your datasets.

As an important side note: Data Publications do not need to be linked to a journal publication. While it may take extra time to submit a Data Publication in proper form, if used as an intermediate step in the research process you can reduce time later, get credit, and benefit the research community in the meantime.

What’s the incentive?

Credit. Next question?

But beyond credit for a citable piece of work, publishing data as a common practice will shift focus from publications being an end point in the research cycle to a starting point and this shift is crucial for transparency and reproducibility in published works. Incentives will become clear once Data Citations become common practice within the publisher and research community, and resources are available for researchers to know how (and have the time/funds) to submit Data Publications.

Too few resources for understanding Data Publishing

Many great papers have been posted and published in the last ten years about what a Data Publication is; however, less resources have been made available to the research community on how to integrate Data Publishing into the research life cycle and how to organize data to even be suitable for a Data Publication. Data Management Plans, courses on research data management, and pressure from various funder and publisher policies will help, but there’s a serious need for education on data planning/organization (including metadata and format requirements) as well as awareness of data publishing platforms and their benefits. This is a call to the community to release these materials and engage in the Research Data Management (RDM) community to get as many of these conversations going. The more resources, answers, and guidance that institutions can provide to researchers, the less the “it takes too much time and money” argument will arise, the easier it will be to achieve the incentive, and the further we will push the boundaries of transparency in scholarly communications.

There’s no better time than now to re-evaluate what resources are available for research output. If we strive for re-use and reproducibility of research data within the community, then now is the time to increase awareness and adoption of Data Publication.

For more information about research data organizations, machine actionable Data Management Plans, or Data Publication platforms, please utilize UC3 resources or get in touch at uc3@ucop.edu.

Ensuring access to critical research data

For the last two months, UC3 have been working with the teams at Data.gov, Data Refuge, Internet Archive, and Code For Science (creators of the Dat Project) to aggregate the government data.

Data that spans the globe

There are currently volunteers across the country working to discover and preserve publicly funded research, especially climate data, from being deleted or lost from the public record. The largest initiative is called Data Refuge and is led by librarians and scientists. They are holding events across the UC campuses and the US that you should attend and help out in person, and are organizing the library community to band together to curate the data and ensure it’s preserved and accessible.

Our initiative builds on this and is looking to build a corpus of government data and corresponding metadata.  We are focusing on public research data, especially those at risk of disappearing. The initiative was nicknamed “Svalbard” by Max Ogden of the Dat project, after the Svalbard Global Seed Vault in the Arctic.  As of today, our friends at Code for Science have released 38GB of metadata, over 30 million hashes and URLs of research data files.

The Svalbard Global Seed Vault in the Arctic

To aid in this effort

We have assembled the following metadata as part of the Code for Science’s Svalbard v1:

  • 2.7 million SHA-256 hashes for all downloadable resources linked from Data.gov, representing around 40TB of data
  • 29 million SHA-1 hashes of files archived by the Internet Archive and the Archive Team from federal websites and FTP servers, representing over 120TB of data
  • All metadata from Data.gov, about 2.1 million datasets
  • A list of ~750 .gov and .mil FTP servers

There are additional sources such as Archivers.Space, EDGI, Climate Mirror, Azimuth Data Backup that we are working adding metadata for in future releases.

Following the principles set forth by the librarians behind Data Refuge, we believe it’s important to establish a clear and trustworthy chain of custody for research datasets so that mirror copies can be trusted. With this project, we are working to curate metadata that includes strong cryptographic hashes of data files in addition to metadata that can be used to reproduce a download procedure from the originating host.

We are hoping the community can use this data in the following ways:

  • To independently verify that the mirroring processes that produced these hashes can be reproduced
  • To aid in developing new forms of redundant dataset distribution (such as peer to peer networks)
  • To seed additional web crawls or scraping efforts with additional dataset source URLs
  • To encourage other archiving efforts to publish their metadata in an easily accessible format
  • To cross reference data across archives, for deduplication or verification purposes

What about the data?

The metadata is great, but the initial release of 30 million hashes and urls is just part of our project. The actual content (how the hashes were derived) have also been downloaded.  They are stored at either the Internet Archive or on our California Digital Library servers.

The Dat Project carried out a Data.gov HTTP mirror (~40TB) and uploaded it to our servers at California Digital Library. We are working with them to access ~160TB of data in the future and have partnered with UC Riverside to offer longer term storage .

Download

You can download the metadata here using Dat Desktop or Dat CLI tool.  We are using the Dat Protocol for distribution so that we can publish new metadata releases efficiently while still keeping the old versions around. Dat provides a secure cryptographic ledger, similar in concept to a blockchain, that can verify integrity of updates.

Feedback

If you want to learn more about how CDL and the UC3 team is involved, contact us at uc3@ucop.edu or @UC3CDL. If you have suggestions or questions, you can join the Code for Science Community Chat.  And, if you are a technical user you can report issues or get involved at the Svalbard GitHub.

This is crossposted here: https://medium.com/@maxogden/project-svalbard-a-metadata-vault-for-research-data-7088239177ab#.f933mmts8

Government Data At Risk

Government data is at risk, but that is nothing new.  

The existence of Data.gov, the Federal Open Data Policy, and open government data belies the fact that, historically, a vast amount of government data and digital information is at risk of disappearing in the transition between presidential administrations. For example, between 2008 and 2012, over 80 percent of the PDFs hosted on .gov domains disappeared. To track these and other changes, California Digital Library (CDL) joined with the University of North Texas, The Library of Congress, the Internet Archive, and the U.S. Government Publishing office to create the End of Term (EOT) Archive. After archiving the web presence of federal agencies in 2008 and 2012, the team initiated a new crawl in September of 2016.

In light of recent events, tools and infrastructure initially developed for EOT and other projects have been taken up by efforts to backup “at risk” datasets, including those related to the environment, climate change, and social justice. Data Refuge, coordinated by the Penn Program of Environmental Humanities (PPEH), has organized a series of “Data Rescue” events across the country where volunteers nominate webpages for submission to the End of Term Archive and harvest “uncrawlable” data to be bagged and submitted to an open data archive. Efforts such as the Azimuth Climate Data Backup Project and Climate Mirror do not involve submitting data or information directly to the End of Term Archive, but have similar aims and workflows.

These efforts are great for raising awareness and building back-ups of key collections. In the background, CDL and the team behind the Dat Project have worked to backup Data.gov, itself. The goal is not only to preserve the datasets catalogued by Data.gov but also the associated metadata and organization that makes it such a useful location for finding and using government data. As a result of this partnership, for the first time ever, the entire Data.gov metadata catalog of over 2 million datasets will soon be available for bulk download. This will allow the various backup efforts to coordinate and cross reference their data sets with those on Data.gov. To allow for further coordination and cross referencing, the Dat team has also begun acquiring the metadata for all the files acquired by Data Refuge, the Azimuth Climate Data Project, and Climate Mirror.

In an effort to keep track of all these efforts to preserve government data and information, we’re maintaining the following annotated list. As new efforts emerge or existing efforts broaden or change their focus, we’ll make sure the list is updated. Feel free to send additional info on government data projects to: uc3@ucop.edu

Get involved: Ongoing Efforts to Preserve Scientific Data or Support Science

Data.gov – The home of the U.S. Government’s open data, much of which is non-biological and non-environmental. Data.gov has a lightweight system for reporting and tracking datasets that aren’t represented and functions as a single point of discovery for federal data. Newly archived data can and should be reported there. CDL and the Dat team are currently working to backup the data catalogued on Data.gov and also the associated metadata.

End of Term – A collaborative project to capture and save U.S. Government websites at the end of presidential administrations. The initial partners in EOT included CDL, the Internet Archive, the Library of Congress, the University of North Texas, and the U.S. Government Publishing Office. Volunteers at many Data Rescue events use the URL nomination and BagIt/Bagger tools developed as part of the EOT project.

Data Refuge – A collaborative effort that aims to backup research-quality copies of federal climate and environmental data, advocate for environmental literacy, and build a consortium of research libraries to scale their tools and practices to make copies of other kinds of federal data. Find a Data Rescue event near you.

Azimuth Climate Data Backup Project – An urgent project to back up US government climate databases. Initially started by statistician Jan Galkowski and John Baez, a mathematician and science blogger at UC Riverside.

Climate Mirror – A distributed volunteer effort to mirror and back up U.S. Federal Climate Data. This project is currently being lead by Data Refuge.

The Environmental Data and Governance Initiative – An international network of academics and non-profits that addresses potential threats to federal environmental and energy policy, and to the scientific research infrastructure built to investigate, inform, and enforce. EDGI has built many of the tools used at Data Rescue events.

March for Science – A celebration of science and a call to support and safeguard the scientific community. The main march in Washington DC and satellite marches around the world are scheduled for April 22nd (Earth Day).

314 Action – A nonprofit that intends to leverage the goals and values of the greater science, technology, engineering, and mathematics community to aggressively advocate for science.

Tagged , , , , , , ,

Understanding researcher needs and values related to software

Software is as important as data when it comes to building upon existing scholarship. However, while there has been a small amount of research into how researchers find, adopt, and credit it, there is a comparative lack of empirical data on how researchers use, share, and value their software.

The UC Berkeley Library and the California Digital Library are investigating researchers’ perceptions, values, and behaviors in regards to software generated as part of the research process. If you are a researcher, it would be greatly appreciated if you could spare 10-15 minutes to complete the following survey:

Take the survey now!

The results of this survey will help us better understand researcher needs and values related to software and may also inform the development of library services related to software best practices, code sharing, and the reproducibility of scholarly activity.

If you have questions about our study or any problems accessing the survey, please contact yasminal@berkeley.edu or John.Borghi@ucop.edu.

Tagged , , , , , ,

csv conf is back in 2017!

csv,conf,v3 is happening!csv

This time the community-run conference will be in Portland, Oregon, USA on 2nd and 3rd of May 2017. It will feature stories about data sharing and data analysis from science, journalism, government, and open source. We want to bring together data makers/doers/hackers from backgrounds like science, journalism, open go
vernment and the wider software industry to share knowledge and stories.

csv,conf is a non-profit community conference run by people who love data and sharing knowledge. This isn’t just a conference about spreadsheets. CSV Conference is a conference about data sharing and data tools. We are curating content about advancing the art of data collaboration, from putting your data on GitHub to producing meaningful insight by running large scale distributed processing on a cluster.

Submit a Talk!  Talk proposals for csv,conf close Feb 15, so don’t delay, submit today! The deadline is fast approaching and we want to hear from a diverse range of voices from the data community.

Talks are 20 minutes long and can be about any data-related concept that you think is interesting. There are no rules for our talks, we just want you to propose a topic you are passionate about and think a room full of data nerds will also find interesting. You can check out some of the past talks from csv,conf,v1 and csv,conf,v2 to get an idea of what has been pitched before.

If you are passionate about data and the many applications it has in society, then join us in Portland!

csv-pic

Speaker perks:

  • Free pass to the conference
  • Limited number of travel awards available for those unable to pay
  • Did we mention it’s in Portland in the Spring????

Submit a talk proposal today at csvconf.com

Early bird tickets are now on sale here.

If you have colleagues or friends who you think would be a great addition to the conference, please forward this invitation along to them! csv,conf,v3 is committed to bringing a diverse group together to discuss data topics. 

– UC3 and the entire csv,conf,v3 team

For questions, please email csv-conf-coord@googlegroups.com, DM @csvconference or join the csv,conf public slack channel.

This was cross-posted from the Open Knowledge International Blog: http://blog.okfn.org/2017/01/12/csvconf-is-back-in-2017-submit-talk-proposals-on-the-art-of-data-analysis-and-collaboration/

Software Carpentry / Data Carpentry Instructor Training for Librarians

We are pleased to announce that we are partnering with Software Carpentry (http://software-carpentry.org) and Data Carpentry (http://datacarpentry.org) to offer an open instructor training course on May 4-5, 2017 geared specifically for the Library Carpentry movement.  

Open call for Instructor Training

This course will take place in Portland, OR, in conjunction with csv,conf,v3, a community conference for data makers everywhere. It’s open to anyone, but the two-day event will focus on preparing members of the library community as Software and Data Carpentry instructors. The sessions will be led by Library Carpentry community members, Belinda Weaver and Tim Dennis.

If you’d like to participate, please apply by filling in the form at https://amy.software-carpentry.org/forms/request_training/  Application closed

What is Library Carpentry?

lib_carpentryFor those that don’t know, Library Carpentry is a global community of library professionals that is customizing Software Carpentry and Data Carpentry modules for training the library community in software and data skills. You can follow us on twitter @LibCarpentry.

Library Carpentry is actively creating training modules for librarians and holding workshops around the world. It’s a relatively new movement that has already been a huge success. You can learn more by reading the recently published article: Library Carpentry: software skills training for library professionals.

Why should I get certified?

Library Carpentry is a movement tightly coupled with the Software Carpentry and Data Carpentry organizations. Since all are based on a train-the-trainer model, one of our challenges has been how to get more experience as instructors. This issue is handled within Software and Data Carpentry by requiring instructor certification.

Although certification is not a requirement to be involved in Library Carpentry, we know that doing so will help us refine workshops, teaching modules, and grow the movement. Also, by getting certified, you can start hosting your own Library Carpentry, Software Carpentry, or Data Carpentry events on your campus. It’s a great way to engage with your campuses and library community!

Prerequisites

Applicants will learn how to teach people the skills and perspectives required to work more effectively with data and software. The focus will be on evidence-based education techniques and hands-on practice; as a condition of taking part, applicants must agree to:

  1. Abide by our code of conduct, which can be found at http://software-carpentry.org/conduct/ and http://datacarpentry.org/code-of-conduct/,
  1. Agree to teach at a Library Carpentry, Software Carpentry, or Data Carpentry workshop within 12 months of the course, and
  1. Complete three short tasks after the course in order to complete the certification. The tasks take a total of approximately 8-10 hours: see http://swcarpentry.github.io/instructor-training/checkout/ for details.

Costs

This course will be held in Portland, OR, in conjunction with csv,conf,v3 and is sponsored by csv,conf,v3 and the California Digital Library. To help offset the costs of this event, we will ask attendees to contribute an optional fee (tiered prices will be recommended based on your or your employer’s ability to pay). No one will be turned down based on inability to pay and a small number of travel awards will be made available (more information coming soon).  

Application

Hope to see you there! To apply for this Software Carpentry / Data Carpentry Instructor Training course, please submit the application by Jan 31, 2017:

  https://amy.software-carpentry.org/forms/request_training/  Application closed

Under Group Name, use “CSV (joint)” if you wish to attend both the training and the conference, or “CSV (training only)” if you only wish to attend the training course.

More information

If you have any questions about this Instructor Training course, please contact admin@software-carpentry.org. And if you have any questions about the Library Carpentry movement, please contact via email at uc3@ucop.edu, via twitter @LibCarpentry or join the Gitter chatroom.

Dispatches from PIDapalooza

Last month, California Digital Library, ORCID, Crossref, and Datacite brought together the brightest minds in scholarly infrastructure to do the impossible: make a conference on persistent identifiers fun!

screen-shot-2016-09-22-at-11-53-28-am

Usually discussions about persistent identifiers (PIDs) and networked research are dry and hard to get through or we find ourselves discussing the basics and never getting to the meat.

We designed PIDapalooza to attract kindred spirits who are passionate about improving interoperability and the overall quality of our scholarly infrastructure. We knew if we built it, they would come!

The results were fantastic and there was a great showing from the University of California community:

All PIDapalooza presentations are being archived on Figshare: https:/pidapalooza.figshare.com

Take a look and make sure you are following @pidapalooza for word on future PID fun!

Tagged , , , ,

There’s a new Dash!

Dash: an open source, community approach to data publication

We have great news! Last week we refreshed our Dash data publication service.  For those of you who don’t know, Dash is an open source, community driven project that takes a unique approach to data publication and digital preservation.

Dash focuses on search, presentation, and discovery and delegates the responsibility for the data preservation function to the underlying repository with which it is integrated. It is a project based at the University of California Curation Center (UC3), a program at California Digital Library (CDL) that aims to develop interdisciplinary research data infrastructure.

Dash employs a multi-tenancy user interface; providing partners with extensive opportunities for local branding and customization, use of existing campus login credentials, and, importantly, offering the Dash service under a tenant-specific URL, an important consideration helping to drive adoption. We welcome collaborations with other organizations wishing to provide a simple, intuitive data publication service on top of more cumbersome legacy systems.

There are currently seven live instances of Dash: – UC BerkeleyUC IrvineUC MercedUC Office of the PresidentUC RiversideUC Santa CruzUC San FranciscoONEshare (in partnership with DataONE)

Architecture and Implementation

Dash is completely open source. Our code is made publicly available on GitHub (http://cdluc3.github.io/dash/). Dash is based on an underlying Ruby-on-Rails data publication platform called Stash. Stash encompasses three main functional components: Store, Harvest, and Share.

  • Store: The Store component is responsible for the selection of datasets; their description in terms of configurable metadata schemas, including specification of ORCID and Fundref identifiers for researcher and funder disambiguation; the assignment of DOIs for stable citation and retrieval; designation of an optional limited time embargo; and packaging and submission to the integrated repository
  • Harvest: The Harvest component is responsible for retrieval of descriptive metadata from that repository for inclusion into a Solr search index
  • Share: The Share component, based on GeoBlacklight, is responsible for the faceted search and browse interface

Dash Architecture Diagram

Individual dataset landing pages are formatted as an online version of a data paper, presenting all appropriate descriptive and administrative metadata in a form that can be downloaded as an individual PDF file, or as part of the complete dataset download package, incorporating all data files for all versions.

To facilitate flexible configuration and future enhancement, all support for the various external service providers and repository protocols are fully encapsulated into pluggable modules. Metadata modules are available for the DataCite and Dublin Core metadata schemas. Protocol modules are available for the SWORD 2.0 deposit protocol and the OAI-PMH and ResourceSync harvesting protocols. Authentication modules are available for InCommon/Shibboleth and Google/OAuth19 identity providers (IdPs). We welcome collaborations to develop additional modules for additional metadata schemas and repository protocols. Please email UC3 (uc3 at ucop dot edu) or visit GitHub (http://cdluc3.github.io/dash/) for more information.

Features of the newly refreshed Dash service

What are the new features on our refresh of the Dash services?  Take a look.

Feature Tech-focused User-focused Description
Open Source X All components open source, MIT licensed code (http://cdluc3.github.io/dash/)
Standards compliant X Dash integrates with any SWORD/OAI-PMH-compliant repository
Pluggable Framework X Inherent extensibility for supporting additional protocols and metadata schemas
Flexible metadata schemas X Support Datacite metadata schema out-of-the-box, but can be configured to support any schema
Innovation X Our modular framework will make new feature development easier and quicker
Mobile/responsive design X X Built mobile-first, from the ground up, for better user experience
Geolocation – Metadata X X For applicable research outputs, we have an easy to use way to capture location of your datasets
Persistent Identifers – ORCID X X Dash allows researchers to attach their ORCID, allowing them to track and get credit for their work
Persistent Identifers – DOIs X X Dash issues DOIs for all datasets, allowing researchers to track and get credit for their work
Persistent Identifers – Fundref X X Dash tracks funder information using FundRef, allowing researchers and funders to track their reasearch outputs
Login – Shibboleth /OAuth2 X X We offer easy single-sign with your campus credentials or Google account
Versioning X X Datasets can change. Dash offers a quick way for you to upload new versions of your datasets and offer a simple process for tracking updates
Accessibility X X The technology, design, and user workflows have all been built with accessibility in mind
Better user experience X Self-depositing made easy. Simple workflow, drag-and-drop upload, simple navigation, clean data publication pages, user dashboards
Geolocation – Search X With GeoBlacklight, we can offer search by location
Robust Search X Search by subject, filetype, keywords, campus, location, etc.
Discoverability X Indexing by search engines for Google, Bing, etc.
Build Relationships X Many datasets are related to publications or other data. Dash offers a quick way to describe these relationships
Supports Best Practices X Data publication can be confusing. But with Dash, you can trust Dash is following best practices
Data Metrics X See the reach of your datasets through usage and download metrics
Data Citations X Quick access to a well-formed citiation reference (with DOI) to every data publication. Easy for your peers to quickly grab
Open License X Dash supports open Creative Commons licensing for all data deposits; can be configured for other licenses
Lower Barrier to Entry X For those in a hurry, Dash offers a quick interface to self-deposit. Only three steps and few required fields
Support Data Reuse X Focus researchers on describing methods and explaining ways to reuse their datasets
Satisfies Data Availability Requirements X Many publishers and funders require researchers to make their data available. Dash is an readily accepted and easy way to comply

A little Dash history

The Dash project began as DataShare, a collaboration among UC3, the University of California San Francisco Library and Center for Knowledge Management, and the UCSF Clinical and Translational Science Institute (CTSI). CTSI is part of the Clinical and Translational Science Award program funded by the National Center for Advancing Translational Sciences at the National Institutes of Health. Dash version 2 developed by UC3 and partners with funding from the Alfred P. Sloan Foundation (our funded proposal). Read more about the code, the project, and contributing to development on the Dash GitHub site.

A little Dash future

We will continue the development of the new Dash platform and will keep you posted. Next up: support for timed deposits and embargoes.  Stay tuned!

Tagged , ,

USING AMAZON S3 AND GLACIER FOR MERRITT- An Update

The integration of the Merritt repository with Amazon’s S3 and Glacier cloud storage services, previously described in an August 16 post on the Data Pub blog, is now mostly complete. The new Amazon storage supplements Merritt’s longstanding reliance on UC private cloud offerings at UCLA and UCSD. Content tagged for public access is now routed to S3 for primary storage, with automatic replication to UCSD and UCLA. Private content is routed first to UCSD, and then replicated to UCLA and Glacier. Content is served for retrieval from the primary storage location; in the unlikely event of a failure, Merritt automatically retries from secondary UCSD or UCLA storage. Glacier, which provides near-line storage with four hour retrieval latency, is not used to respond to user-initiated retrieval requests.

Content Type Primary Storage Secondary Storage Primary Retrieval Secondary Retrieval
Public S3 UCSD
UCLA
S3 UCSD
UCLA
Private UCSD UCLA
Glacier
UCSD UCLA

In preparation for this integration, all retrospective public content, over 1.1 million objects and 3 TB, was copied from UCSD to S3, a process that took about six days to complete. A similar move from UCSD to Glacier is now underway for the much larger corpus of private content, 1.5 million objects and 71 TB, which is expected to take about five weeks to complete.

The Merritt-Amazon integration enables more optimized internal workflows and increased levels of reliability and preservation assurance. It also holds the promise of lowering overall storage costs, and thus, the recharge price of Merritt for our campus customers.  Amazon has, for example, recently announced significant price reductions for S3 and Glacier storage capacity, although their transactional fees remain unchanged.  Once the long-term impact of S3 and Glacier pricing on Merritt costs is understood, CDL will be able to revise Merritt pricing appropriately.

CDL is also investigating the possible use of the Oracle archive cloud, as a lower-cost alternative, or supplement, to Glacier for dark archival content hosting.  While offering similar function to Glacier, including four hour retrieval latency, Oracle’s price point is about 1/4th of Glacier’s for storage capacity.

An RDM Model for Researchers: What we’ve learned

Thanks to everyone who gave feedback on our previous blog post describing our data management tool for researchers. We received a great deal of input related to our guide’s use of the term “data sharing” and our guide’s position in relation to other RDM tools as well as quite a few questions about what our guide will include as we develop it further.

As stated in our initial post, we’re building a tool to enable individual researchers to assess the maturity of their data management practices within an institutional or organizational context. To do this, we’ve taken the concept of RDM maturity from in existing tools like the Five Organizational Stages of Digital Preservation, the Scientific Data Management Capability Model, and the Capability Maturity Guide and placed it within a framework familiar to researchers, the research data lifecycle.

researchercmm_090916

A visualization of our guide as presented in our last blog post. An updated version, including changed made in response to reader feedback, is presented later in this post.

Data Sharing

The most immediate feedback we received was about the term “Data Sharing”. Several commenters pointed out the ambiguity of this term in the context of the research data life cycle. In the last iteration of our guide, we intended “Data Sharing” as a shorthand to describe activities related to the communication of data. Such activities may range from describing data in a traditional scholarly publication to depositing a dataset in a public repository or publishing a data paper. Because existing data sharing policies (e.g. PLOS, The Gates Foundation, and The Moore Foundation) refer specifically to the latter over the former, the term is clearly too imprecise for our guide.

Like “Data Sharing”, “Data Publication” is a popular term for describing activities surrounding the communication of data. Even more than “Sharing”, “Publication” relays our desire to advance practices that treat data as a first class research product. Unfortunately the term is simultaneously too precise and too ambiguous it to be useful in our guide. On one hand, the term “Data Publication” can refer specifically to a peer reviewed document that presents a dataset without offering any analysis or conclusion. While data papers may be a straightforward way of inserting datasets into the existing scholarly communication ecosystem, they represent a single point on the continuum of data management maturity. On the other hand, there is currently no clear consensus between researchers about what it means to “publish” data.

For now, we’ve given that portion of our guide the preliminary label of “Data Output”. As the development process proceeds, this row will include a full range of activities- from description of data in traditional scholarly publications (that may or may not include a data availability statement) to depositing data into public repositories and the publication of data papers.

Other Models and Guides

While we correctly identified that there are are range of rubrics, tools, and capability models with similar aims as our guide, we overstated that ours uniquely allows researchers to assess where they are and where they want to be in regards to data management. Several of the tools we cited in our initial post can be applied by researchers to measure the maturity of data management practices within a project or institutional context.

Below we’ve profiled four such tools and indicated how we believe our guide differs from each. In differentiating our guide, we do not mean to position it strictly as an alternative. Rather, we believe that our guide could be used in concert with these other tools.

Collaborative Assessment of Research Data Infrastructure and Objectives (CARDIO)

CARDIO is a benchmarking tool designed to be used by researchers, service providers, and coordinators for collaborative data management strategy development. Designed to be applied at a variety of levels, from entire institutions down to individual research projects, CARDIO enables its users to collaboratively assess data management requirements, activities, and capacities using an online interface. Users of CARDIO rate their data management infrastructure relative to a series of statements concerning their organization, technology, and resources. After completing CARDIO, users are given a comprehensive set of quantitative capability ratings as well as a series of practical recommendations for improvement.

Unlike CARDIO, our guide does not necessarily assume its users are in contact with data-related service providers at their institution. As we stated in our initial blog post, we intend to guide researchers to specialist knowledge without necessarily turning them into specialists. Therefore, we would consider a researcher making contact with their local data management, research IT, or library service providers for the first time as a positive application of our guide.

Community Capability Model Framework (CCMF)

The Community Capability Model Framework is designed to evaluate a community’s readiness to perform data intensive research. Intended to be used by researchers, institutions, and funders to assess current capabilities, identify areas requiring investment, and develop roadmaps for achieving a target state of readiness, the CCMF encompasses eight “capability factors” including openness, skills and training, research culture, and technical infrastructure. When used alongside the Capability Profile Template, the CCMF provides its users with a scorecard containing multiple quantitative scores related to each capability factor.   

Unlike the CCMF, our guide does not necessarily assume that its users should all be striving towards the same level of data management maturity. We recognize that data management practices may vary significantly between institutions or research areas and that what works for one researcher may not necessarily work for another. Therefore, we would consider researchers understanding the maturity of their data management practices within their local contexts to be a positive application of our guide.

Data Curation Profiles (DCP) and DMVitals

The Data Curation Profile toolkit is intended to address the needs of an individual researcher or research group with regards to the “primary” data used for a particular project. Taking the form of a structured interview between an information professional and a researcher, a DCP can allow an individual research group to consider their long-term data needs, enable an institution to coordinate their data management services, or facilitate research into broader topics in digital curation and preservation.

DMVitals is a tool designed to take information from a source like a Data Curation Profile and use it to systematically assess a researcher’s data management practices in direct comparison to institutional and domain standards. Using the DMVitals, a consultant matches a list of evaluated data management practices with responses from an interview and ranks the researcher’s current practices by their level of data management “sustainability.” The tool then generates customized and actionable recommendations, which a consultant then provides to the researcher as guidance to improve his or her data management practices.  

Unlike DMVitals, our guide does not calculate a quantitative rating to describe the maturity of data management practices. From a measurement perspective, the range of practice maturity may differ between the four stages of our guide (e.g. the “Project Planning” stage could have greater or fewer steps than the “Data Collection” stage), which would significantly complicate the interpretation of any quantitative ratings derived from our guide. We also recognize that data management practices are constantly evolving and likely dependent on disciplinary and institutional context. On the other hand, we also recognize the utility of quantitative ratings for benchmarking. Therefore, if, after assessing the maturity of their data management practices with our guide, a researcher chooses to apply a tool like DMVitals, we would consider that a positive application of our guide.

Our Model (Redux)

Perhaps the biggest takeaway from the response to our  last blog post is that it is very difficult to give detailed feedback on a guide that is mostly whitespace. Below is an updated mock-up, which describes a set of RDM practices along the continuum of data management maturity. At present, we are not aiming to illustrate a full range of data management practices. More simply, this mock-up is intended to show the types of practices that could be described by our guide once it is complete.

screen-shot-2016-11-08-at-11-37-35-am

An updated visualization of our guide based on reader feedback. At this stage, the example RDM practices are intended to be representative not comprehensive.

Project Planning

The “Project Planning” stage describes practices that occur prior to the start of data collection. Our examples are all centered around data management plans (DMPs), but other considerations at this stage could include training in data literacy, engagement with local RDM services, inclusion of “sharing” in project documentation (e.g. consent forms), and project pre-registration.

Data Collection

The “Data Collection” stage describes practices related to the acquisition, accumulation, measurement, or simulation of data. Our examples relate mostly to standards around file naming and structuring, but other considerations at this stage could include the protection of sensitive or restricted data, validation of data integrity, and specification of linked data.

Data Analysis

The “Data Analysis” stage describes practices that involve the inspection, modeling, cleaning, or transformation of data. Our examples mostly relate to documenting the analysis workflow, but other considerations at this stage could include the generation and annotation of code and the packaging of data within sharable files or formats.

Data Output

The “Data Output” stage describes practices that involve the communication of either the data itself of conclusions drawn from the data. Our examples are mostly related to the communication of data linked to scholarly publications, but other considerations at this stage could include journal and funder mandates around data sharing, the publication of data papers, and the long term preservation of data.

Next Steps

Now that we’ve solicited a round of feedback from the community that works on issues around research support, data management, and digital curation, our next step is to broaden our scope to include researchers.

Specifically we are looking for help with the following:

  • Do you find the divisions within our model useful? We’ve used the research data lifecycle as a framework because we believe it makes our tool user-friendly for researchers. At the same time, we also acknowledge that the lines separating planning, collection, analysis, and output can be quite blurry. We would be grateful to know if researchers or data management service providers find these divisions useful or overly constrained.
  • Should there be more discrete “steps” within our framework? Because we view data management maturity as a continuum, we have shied away from creating discrete steps within each division. We would be grateful to know how researchers or data management service providers view this approach, especially when compared to the more quantitative approach employed by CARDIO, the Capability Profile Template, and DMVitals.
  • What else should we put into our model? Researchers are faced with changing expectations and obligations in regards to data management. We want our model to reflect that. We also want our model to reflect the relationship between research data management and broader issues like openness and reproducibility. With that in mind, what other practices and considerations should or model include?
Tagged , , , , , ,