Category Archives: Data Publication

Disambiguating Dash and Merritt

What’s Dash? What’s Merritt? What’s the difference? After numerous questions about where things should go and what the differences are between our UC3 services, we got the hint that we are not communicating clearly.

Clearing things up

A group of us sat down and talked through different use cases and what wording we were using that was causing such confusion, and have come up with what we hope is a disambiguation of Dash versus Merritt. 

Screen Shot 2017-07-10 at 1.54.06 PM

Different intentions, different target users

While Dash and Merritt interact with each other at a technical level, they have different intentions and users should not be looking at these two services as a comparison. Dash is optimized for researchers and therefore its user interface, user experience, and metadata schema are optimized for use by individual researchers. Merritt is designed for use by institutional librarians, archivists, and curators.

Because of the different intended purposes, features, and users, UC3 does not recommend that Merritt be advertised to researchers on Research Data Management (RDM) sites or researcher-facing Library Guides.

Below are quick descriptions of each service that should clarify intentions and target users:

  • Dash is an open data publication platform for researchers. Self-service depositing of research data through Dash fulfills publisher, funder, and data management plan requirements regarding data sharing and preservation. When researchers publish their datasets through Dash, their datasets are issued a DOI to optimize citability, are publicly available for download and re-use under a CC BY 4.0 or CC-0 license, and are preserved in Merritt, California Digital Library’s preservation repository.  Dash is available to researchers at participating UC campuses, as well as researchers in Environmental and Earth Sciences through the DataONE network.
  • Merritt is a preservation repository for mediated deposits by UC organizations. We work with staff at UC libraries, archives, and departments to preserve digital assets and collections. Merritt offers bit-level preservation and replication with both public or private access. Merritt is also the preservation repository that preserves Dash-deposited data.

The cost of service vs. the cost of storage

California Digital Library does not charge individual users for the Dash or Merritt services. However, we do recharge your institution for the amount of storage used in Merritt (remember, Dash preserves data in Merritt) on an annual basis.  On most campuses, the Library fully subsidizes Dash storage costs, so there is no extra financial obligation to individual researchers depositing data into Dash.

Follow-up

If you have any questions about edge cases or would like to know any more details about the architecture of the Dash platform or Merritt repository, please get in touch at uc3@ucop.edu.

And while you’re here: check out Dash’s new features for uploading large data sets, and uploading directly from the cloud.

Cirrus-ly Convenient Uploading

That was a cloud pun! Following our release two weeks ago, the Dash team is thrilled to present our newest functionality: you may now upload files directly from Box, Dropbox, and Google Drive!

Let’s get you publishing (and citing and getting credit for your data):

  • Using the “upload from server” option, you may enter up to 1000 URLs (and up to 100gb per submission) by pasting in the sharing link from Box, Dropbox, or Google Drive.

Screen Shot 2017-06-20 at 1.40.37 PM[2]

  •  Validate the files and your URLs will appear including the filename and size.

Screen Shot 2017-06-20 at 1.41.25 PM[2].png

  • Submit & download.
    • Box, Dropbox, and Google uploaded files will download the same as they were uploaded to the cloud
    • Google docs, sheets, or presentations will download as Microsoft Office word documents, excel spreadsheets, or powerpoint presentations.

We will be updating our help and FAQ pages this week to reflect our new features, but in the meantime please let us know if you have any questions or feedback. 

Manifesting Large and Bulk File Data Publications– Now A Reality!

The Dash team is excited to announce our June feature release: Large and Bulk File upload. Taking into consideration the need for large size and file numbers of datasets, as well as the practicality of server timeouts, we have developed a new feature that allows for up to 1,000 files or 100gb* of data to be published per DOI.

To accomplish this we are using a “manifest” workflow- which means that instead of uploading data directly from your computer, you may enter URLS for where your data are located (on a server or public site) for upload. Once uploaded, Dash will display the data in the same manner as direct upload. To reflect this new option for upload we have updated the Upload page to choose between uploading locally (from your computer) or via a server. Information about file size limits (2gb/file, 10gb total local or 1000 files any size up to 100gb*) are listed on this landing page.

Step 1: Enter URLs where data are located

Screen Shot 2017-06-07 at 1.01.59 PM

Step 2: Validated files will appear in Uploaded Files table with any other data files associated from current or former versions

Screen Shot 2017-06-07 at 1.02.19 PM

The benefit of using this workflow is that as a user you do not have to watch your screen for many hours as the data upload and instead your data will be uploaded in the back-end, without the involvement of your computer. This upload mechanism is also not limited to large file use- it can be an easy way to transfer your data directly from a server regardless of size.

A complication with this process is that you cannot upload local data and server-hosted data in the same version. Though this seems tricky- we would like to remind you that Dash supports versioning and after successful publication of the server uploaded data you could go back in and add local files (or vice versa).

While at the moment we do not allow for upload from Gdrive, Box, or Dropbox, we are investigating the sharing links necessary for integrating uploads from the cloud. If you have any feedback to make this feature, or any features more accessible or valuable for researchers please do get in touch. Happy Data Publishing!

Note: To utilize this feature and publish your datasets, your data will need to be hosted on a server. Many institutions, departments, and labs have servers used to host data and information (good examples across the UC campuses, MIT, University of Iowa, etc…). If you have any questions about servers on your campus or external resources, please utilize your campus librarians

*Size limits vary per institutional tenant- please check in with your UC Data Librarians if you have any questions

Make Data Count: Building a System to Support Recognition of Data as a First Class Research Output

The Alfred P. Sloan Foundation has made a 2-year, $747K award to the California Digital Library, DataCite and DataONE to support collection of usage and citation metrics for data objects. Building on pilot work, this award will result in the launch of a new service that will collate and expose data level metrics.

The impact of research has traditionally been measured by citations to journal publications: journal articles are the currency of scholarly research.  However, scholarly research is made up of a much larger and richer set of outputs beyond traditional publications, including research data. In order to track and report the reach of research data, methods for collecting metrics on complex research data are needed.  In this way, data can receive the same credit and recognition that is assigned to journal articles.

Recognition of data as valuable output from the research process is increasing and this project will greatly enhance awareness around the value of data and enable researchers to gain credit for the creation and publication of data” – Ed Pentz, Crossref.

This project will work with the community to create a clear set of guidelines on how to define data usage. In addition, the project will develop a central hub for the collection of data level metrics. These metrics will include data views, downloads, citations, saves, social media mentions, and will be exposed through customized user interfaces deployed at partner organizations. Working in an open source environment, and including extensive user experience testing and community engagement, the products of this project will be available to data repositories, libraries and other organizations to deploy within their own environment, serving their communities of data authors.

Are you working in the data metrics space? Let’s collaborate.

Find out more and follow us at: www.makedatacount.org, @makedatacount

About the Partners

California Digital Library was founded by the University of California in 1997 to take advantage of emerging technologies that were transforming the way digital information was being published and accessed. University of California Curation Center (UC3), one of four main programs within the CDL, helps researchers and the UC libraries manage, preserve, and provide access to their important digital assets as well as developing tools and services that serve the community throughout the research and data life cycles.

DataCite is a leading global non-profit organization that provides persistent identifiers (DOIs) for research data. Our goal is to help the research community locate, identify, and cite research data with confidence. Through collaboration, DataCite supports researchers by helping them to find, identify, and cite research data; data centres by providing persistent identifiers, workflows and standards; and journal publishers by enabling research articles to be linked to the underlying data/objects.

DataONE (Data Observation Network for Earth) is an NSF DataNet project which is developing a distributed framework and sustainable cyber infrastructure that meets the needs of science and society for open, persistent, robust, and secure access to well-described and easily discovered Earth observational data.

Announcing New Dash Features- April 2017

The Dash team is pleased to announce the release of our newest features. Taking in requests from users as well as standards in the field, we have now adapted the platform with the following releases: Private for Peer Review (Timed-Release of Data), ORCiD integration, email capture for corresponding authors, user friendly downloads, and a variety of search and view enhancements.

Private for Peer Review (Timed-Release of Data)

As mentioned in a previous post, this was formally referred to as embargoing data but we are releasing this feature in the context of keeping data private for the length of peer review. We have now implemented a feature to allow researchers to keep data private, for the purposes of peer review, for up to six months. If a researcher decides to use this option they will be given a private Reviewer URL that can be used by an external party to download the data.

This URL will redirect to the landing page with available data for download as soon as the data are public. If external parties have any questions or would like to request a download they will also now have the ability to reach the corresponding author.

Corresponding Author Email Capture & ORCiD Integration

Corresponding authors (and contributing authors) will now have the ability to enter their email address and ORCiD iD which will both appear on the landing page beneath author name. Just as article publications have, we believe Data Publications should have a corresponding author contact who can be reached with questions about the dataset.

User Friendly Downloads & Interface Improvements

What one uploads is what another may download. When choosing to download the data files, only the files uploaded by the corresponding author will be downloaded.

Some other fixes and features include:

  • the wording our our search filters and browse option
  • a checkbox at the file upload stage to ensure researchers are not uploading sensitive or identifying information 
  • explanatory information within the metadata submission for usage notes and related work
  • a preview of how large the dataset is on the download button

What’s up next?

  • Next Feature: large file upload and bulk file upload
  • Future Feature: a curation layer that will allow for administration capabilities

For more information or if you have any questions please check for updates on the @uc3cdl twitter feed, or get in touch at uc3@ucop.edu.

 

Embargoing the Term “Embargoes” Indefinitely

I’m two months into a position that lends part of its time to overseeing Dash, a Data Publication platform for the University of California. On my first day I was told that a big priority for Dash was to build out an embargo feature. Coming to the California Digital Library (CDL) from PLOS, an OA publisher with an OA Data Policy, I couldn’t understand why I would be leading endeavors to embargo data and not open it up- so I met this embargo directive with apprehension.

I began to acquaint myself with the campuses and a couple of weeks ago while at UCSF I presented the prototype for what this “embargo” feature would look like and I questioned why researchers wanted to close data on an open data platform. This is where it gets fun.

“Our researchers really just want a feature to keep their data private while their associated paper is under peer review. We see this frequently when people submit to PLOS”.

Yes, I had contributed to my own conflict.

While I laughed about how I was previously the person at PLOS convincing UC researchers to make their data public- I recognized that this would be an easy issue to clarify. And here we are.

Embargoes imply a negative connotation in the open community and I ask that moving forward we do not use this phrase to talk about keeping data private until an associated manuscript has been accepted. Let us call this “Private for Peer Review” or “Timed Release”, with a “Peer Review URL” that is available for sharing data during the peer review process as Dryad does.

  • Embargoes imply that data are being held private for reasons other than the peer review process.
  • Embargoes are not appropriate if you have a funder, publisher, or other mandate to open up your data.
  • Embargoes are not appropriate for sensitive data, as these data should not be held in a public repository (embargoed) unless this were through a data access committee and the repository had proper security.
  • Embargoes are not appropriate for open Data Publications.

To embargo your data for longer than the peer review process (or for other reasons) is to shield your data from being used, built off of, or validated. This is contrary to “Open” as a strategy to further scientific findings and scholarly communications.

Dash is implementing features that will allow researchers to choose, in line with what we believe is reasonable for peer review and revisions, a publication date up to six months after submission. If researchers choose to use this feature, they will be given a Peer Review URL that can be shared to download the data until the data are public. It is important to note though that while the data may be private during this time, the DOI for the data and associated metadata will be public and should be used for citation. These features will be for the use of Peer Review; we do not believe that data should be held private for a period of time on an open data publication platform for other reasons.

Opening up data, publishing data, and giving credit to data are all important in emphasizing that data are a credible and necessary piece of scholarly work. Dash and other repositories will allow for data to be private through peer review (with the intent to have data be public and accessible in the close future). However, my hope is that as the data revolution evolves, incentives to open up data sooner will become apparent. The first step is to check our vocab and limit the use of the term “embargo” to cases where data are being held private without an open data intention.

Tagged , , ,

Data Publication: Sharing, Crediting, and Re-Using Research Data

In the most basic terms- Data Publishing is the process of making research data publicly available for re-use. But even in this simple statement there are many misconceptions about what Data Publications are and why they are necessary for the future of scholarly communications.

Let’s break down a commonly accepted definition of “research data publishing”. A Data Publication has three core features: 1 – data that are publicly accessible and are preserved for an indefinite amount of time, 2 – descriptive information about the data (metadata), and 3 –  a citation for the data (giving credit to the data). Why are these elements essential? These three features make research data reusable and reproducible- the goal of a Data Publication.

Data are publicly accessible and preserved indefinitely

There are many ways for researchers to make their data publicly available, be it within Supporting Information files of a journal article or within an institutional, field specific, or general repository. For a true Data Publication, data should be submitted to a stable repository that can ensure data will be available and stored for an indefinite amount of time. There are over a thousand repositories registered with re3data and many publishers have repository guides to help with field specific guidance. When data are not suitable for public deposition, i.e. when data contain sensitive information, data should still be stored in a preserved and compliant space. While this restriction is a more difficult hurdle to jump over in advocating for data publishing and data preservation, it is important to ensure these data are not violating ethical requirements,  nor are they locked up in a filing cabinet and eventually thrown out. Preservation of data is a necessity for the future.

Data are described (data have metadata)

Data without proper documentation or descriptive metadata are about as useful as research without data. If a Data Publication is a citable piece of scholarly work, it should contain information that it allow it to be a useful and valued piece of scholarly work. Documentation and metadata range from information regarding software used for analysis to who funded the work. While these examples serve separate purposes (one for re-use and the other for credit), it is important that all information about the creation of the dataset (who, where, how, related publications) are available.

Data are citable and credible

We’ve established that datasets are essential to research output and are an important piece of scholarly work- and they should receive the same benefits. Data need to have a persistent identifier (a stable link) that can be referenced. While many repositories use a DataCite DOI to fulfill this, some field-specific repositories use accession numbers (i.e. NCBI repositories) that can be referenced within a URL. This is one of the reasons data need to be available in a stable repository. It’s a bit difficult to reference and credit data that are on your hard drive!

If it’s so clear- why are there barriers?

Data publishing has become more widely accepted in the last ten years, with new standards from funders and publishers and a growth in stable repositories. However, there’s still work to be done and more questions to be answered before we reach mass adoption. Let’s start that conversation (you can be the questioner and I’ll be the advocate):

Organizing and submitting data are time intensive and in turn, costly

Trying to replicate a data set from scratch takes much more time (and money) than publishing your data (see robotics example here). Taking the time to search your old computer files or get in touch with your last institution to get your data is more complicated than publishing your data. Having your paper retracted because your data are called into question and you can’t share your data or don’t have it would take more time, money, and hit to your reputation than proactively publishing your datasets.

As an important side note: Data Publications do not need to be linked to a journal publication. While it may take extra time to submit a Data Publication in proper form, if used as an intermediate step in the research process you can reduce time later, get credit, and benefit the research community in the meantime.

What’s the incentive?

Credit. Next question?

But beyond credit for a citable piece of work, publishing data as a common practice will shift focus from publications being an end point in the research cycle to a starting point and this shift is crucial for transparency and reproducibility in published works. Incentives will become clear once Data Citations become common practice within the publisher and research community, and resources are available for researchers to know how (and have the time/funds) to submit Data Publications.

Too few resources for understanding Data Publishing

Many great papers have been posted and published in the last ten years about what a Data Publication is; however, less resources have been made available to the research community on how to integrate Data Publishing into the research life cycle and how to organize data to even be suitable for a Data Publication. Data Management Plans, courses on research data management, and pressure from various funder and publisher policies will help, but there’s a serious need for education on data planning/organization (including metadata and format requirements) as well as awareness of data publishing platforms and their benefits. This is a call to the community to release these materials and engage in the Research Data Management (RDM) community to get as many of these conversations going. The more resources, answers, and guidance that institutions can provide to researchers, the less the “it takes too much time and money” argument will arise, the easier it will be to achieve the incentive, and the further we will push the boundaries of transparency in scholarly communications.

There’s no better time than now to re-evaluate what resources are available for research output. If we strive for re-use and reproducibility of research data within the community, then now is the time to increase awareness and adoption of Data Publication.

For more information about research data organizations, machine actionable Data Management Plans, or Data Publication platforms, please utilize UC3 resources or get in touch at uc3@ucop.edu.

There’s a new Dash!

Dash: an open source, community approach to data publication

We have great news! Last week we refreshed our Dash data publication service.  For those of you who don’t know, Dash is an open source, community driven project that takes a unique approach to data publication and digital preservation.

Dash focuses on search, presentation, and discovery and delegates the responsibility for the data preservation function to the underlying repository with which it is integrated. It is a project based at the University of California Curation Center (UC3), a program at California Digital Library (CDL) that aims to develop interdisciplinary research data infrastructure.

Dash employs a multi-tenancy user interface; providing partners with extensive opportunities for local branding and customization, use of existing campus login credentials, and, importantly, offering the Dash service under a tenant-specific URL, an important consideration helping to drive adoption. We welcome collaborations with other organizations wishing to provide a simple, intuitive data publication service on top of more cumbersome legacy systems.

There are currently seven live instances of Dash: – UC BerkeleyUC IrvineUC MercedUC Office of the PresidentUC RiversideUC Santa CruzUC San FranciscoONEshare (in partnership with DataONE)

Architecture and Implementation

Dash is completely open source. Our code is made publicly available on GitHub (http://cdluc3.github.io/dash/). Dash is based on an underlying Ruby-on-Rails data publication platform called Stash. Stash encompasses three main functional components: Store, Harvest, and Share.

  • Store: The Store component is responsible for the selection of datasets; their description in terms of configurable metadata schemas, including specification of ORCID and Fundref identifiers for researcher and funder disambiguation; the assignment of DOIs for stable citation and retrieval; designation of an optional limited time embargo; and packaging and submission to the integrated repository
  • Harvest: The Harvest component is responsible for retrieval of descriptive metadata from that repository for inclusion into a Solr search index
  • Share: The Share component, based on GeoBlacklight, is responsible for the faceted search and browse interface

Dash Architecture Diagram

Individual dataset landing pages are formatted as an online version of a data paper, presenting all appropriate descriptive and administrative metadata in a form that can be downloaded as an individual PDF file, or as part of the complete dataset download package, incorporating all data files for all versions.

To facilitate flexible configuration and future enhancement, all support for the various external service providers and repository protocols are fully encapsulated into pluggable modules. Metadata modules are available for the DataCite and Dublin Core metadata schemas. Protocol modules are available for the SWORD 2.0 deposit protocol and the OAI-PMH and ResourceSync harvesting protocols. Authentication modules are available for InCommon/Shibboleth and Google/OAuth19 identity providers (IdPs). We welcome collaborations to develop additional modules for additional metadata schemas and repository protocols. Please email UC3 (uc3 at ucop dot edu) or visit GitHub (http://cdluc3.github.io/dash/) for more information.

Features of the newly refreshed Dash service

What are the new features on our refresh of the Dash services?  Take a look.

Feature Tech-focused User-focused Description
Open Source X All components open source, MIT licensed code (http://cdluc3.github.io/dash/)
Standards compliant X Dash integrates with any SWORD/OAI-PMH-compliant repository
Pluggable Framework X Inherent extensibility for supporting additional protocols and metadata schemas
Flexible metadata schemas X Support Datacite metadata schema out-of-the-box, but can be configured to support any schema
Innovation X Our modular framework will make new feature development easier and quicker
Mobile/responsive design X X Built mobile-first, from the ground up, for better user experience
Geolocation – Metadata X X For applicable research outputs, we have an easy to use way to capture location of your datasets
Persistent Identifers – ORCID X X Dash allows researchers to attach their ORCID, allowing them to track and get credit for their work
Persistent Identifers – DOIs X X Dash issues DOIs for all datasets, allowing researchers to track and get credit for their work
Persistent Identifers – Fundref X X Dash tracks funder information using FundRef, allowing researchers and funders to track their reasearch outputs
Login – Shibboleth /OAuth2 X X We offer easy single-sign with your campus credentials or Google account
Versioning X X Datasets can change. Dash offers a quick way for you to upload new versions of your datasets and offer a simple process for tracking updates
Accessibility X X The technology, design, and user workflows have all been built with accessibility in mind
Better user experience X Self-depositing made easy. Simple workflow, drag-and-drop upload, simple navigation, clean data publication pages, user dashboards
Geolocation – Search X With GeoBlacklight, we can offer search by location
Robust Search X Search by subject, filetype, keywords, campus, location, etc.
Discoverability X Indexing by search engines for Google, Bing, etc.
Build Relationships X Many datasets are related to publications or other data. Dash offers a quick way to describe these relationships
Supports Best Practices X Data publication can be confusing. But with Dash, you can trust Dash is following best practices
Data Metrics X See the reach of your datasets through usage and download metrics
Data Citations X Quick access to a well-formed citiation reference (with DOI) to every data publication. Easy for your peers to quickly grab
Open License X Dash supports open Creative Commons licensing for all data deposits; can be configured for other licenses
Lower Barrier to Entry X For those in a hurry, Dash offers a quick interface to self-deposit. Only three steps and few required fields
Support Data Reuse X Focus researchers on describing methods and explaining ways to reuse their datasets
Satisfies Data Availability Requirements X Many publishers and funders require researchers to make their data available. Dash is an readily accepted and easy way to comply

A little Dash history

The Dash project began as DataShare, a collaboration among UC3, the University of California San Francisco Library and Center for Knowledge Management, and the UCSF Clinical and Translational Science Institute (CTSI). CTSI is part of the Clinical and Translational Science Award program funded by the National Center for Advancing Translational Sciences at the National Institutes of Health. Dash version 2 developed by UC3 and partners with funding from the Alfred P. Sloan Foundation (our funded proposal). Read more about the code, the project, and contributing to development on the Dash GitHub site.

A little Dash future

We will continue the development of the new Dash platform and will keep you posted. Next up: support for timed deposits and embargoes.  Stay tuned!

Tagged , ,

An RDM Model for Researchers: What we’ve learned

Thanks to everyone who gave feedback on our previous blog post describing our data management tool for researchers. We received a great deal of input related to our guide’s use of the term “data sharing” and our guide’s position in relation to other RDM tools as well as quite a few questions about what our guide will include as we develop it further.

As stated in our initial post, we’re building a tool to enable individual researchers to assess the maturity of their data management practices within an institutional or organizational context. To do this, we’ve taken the concept of RDM maturity from in existing tools like the Five Organizational Stages of Digital Preservation, the Scientific Data Management Capability Model, and the Capability Maturity Guide and placed it within a framework familiar to researchers, the research data lifecycle.

researchercmm_090916

A visualization of our guide as presented in our last blog post. An updated version, including changed made in response to reader feedback, is presented later in this post.

Data Sharing

The most immediate feedback we received was about the term “Data Sharing”. Several commenters pointed out the ambiguity of this term in the context of the research data life cycle. In the last iteration of our guide, we intended “Data Sharing” as a shorthand to describe activities related to the communication of data. Such activities may range from describing data in a traditional scholarly publication to depositing a dataset in a public repository or publishing a data paper. Because existing data sharing policies (e.g. PLOS, The Gates Foundation, and The Moore Foundation) refer specifically to the latter over the former, the term is clearly too imprecise for our guide.

Like “Data Sharing”, “Data Publication” is a popular term for describing activities surrounding the communication of data. Even more than “Sharing”, “Publication” relays our desire to advance practices that treat data as a first class research product. Unfortunately the term is simultaneously too precise and too ambiguous it to be useful in our guide. On one hand, the term “Data Publication” can refer specifically to a peer reviewed document that presents a dataset without offering any analysis or conclusion. While data papers may be a straightforward way of inserting datasets into the existing scholarly communication ecosystem, they represent a single point on the continuum of data management maturity. On the other hand, there is currently no clear consensus between researchers about what it means to “publish” data.

For now, we’ve given that portion of our guide the preliminary label of “Data Output”. As the development process proceeds, this row will include a full range of activities- from description of data in traditional scholarly publications (that may or may not include a data availability statement) to depositing data into public repositories and the publication of data papers.

Other Models and Guides

While we correctly identified that there are are range of rubrics, tools, and capability models with similar aims as our guide, we overstated that ours uniquely allows researchers to assess where they are and where they want to be in regards to data management. Several of the tools we cited in our initial post can be applied by researchers to measure the maturity of data management practices within a project or institutional context.

Below we’ve profiled four such tools and indicated how we believe our guide differs from each. In differentiating our guide, we do not mean to position it strictly as an alternative. Rather, we believe that our guide could be used in concert with these other tools.

Collaborative Assessment of Research Data Infrastructure and Objectives (CARDIO)

CARDIO is a benchmarking tool designed to be used by researchers, service providers, and coordinators for collaborative data management strategy development. Designed to be applied at a variety of levels, from entire institutions down to individual research projects, CARDIO enables its users to collaboratively assess data management requirements, activities, and capacities using an online interface. Users of CARDIO rate their data management infrastructure relative to a series of statements concerning their organization, technology, and resources. After completing CARDIO, users are given a comprehensive set of quantitative capability ratings as well as a series of practical recommendations for improvement.

Unlike CARDIO, our guide does not necessarily assume its users are in contact with data-related service providers at their institution. As we stated in our initial blog post, we intend to guide researchers to specialist knowledge without necessarily turning them into specialists. Therefore, we would consider a researcher making contact with their local data management, research IT, or library service providers for the first time as a positive application of our guide.

Community Capability Model Framework (CCMF)

The Community Capability Model Framework is designed to evaluate a community’s readiness to perform data intensive research. Intended to be used by researchers, institutions, and funders to assess current capabilities, identify areas requiring investment, and develop roadmaps for achieving a target state of readiness, the CCMF encompasses eight “capability factors” including openness, skills and training, research culture, and technical infrastructure. When used alongside the Capability Profile Template, the CCMF provides its users with a scorecard containing multiple quantitative scores related to each capability factor.   

Unlike the CCMF, our guide does not necessarily assume that its users should all be striving towards the same level of data management maturity. We recognize that data management practices may vary significantly between institutions or research areas and that what works for one researcher may not necessarily work for another. Therefore, we would consider researchers understanding the maturity of their data management practices within their local contexts to be a positive application of our guide.

Data Curation Profiles (DCP) and DMVitals

The Data Curation Profile toolkit is intended to address the needs of an individual researcher or research group with regards to the “primary” data used for a particular project. Taking the form of a structured interview between an information professional and a researcher, a DCP can allow an individual research group to consider their long-term data needs, enable an institution to coordinate their data management services, or facilitate research into broader topics in digital curation and preservation.

DMVitals is a tool designed to take information from a source like a Data Curation Profile and use it to systematically assess a researcher’s data management practices in direct comparison to institutional and domain standards. Using the DMVitals, a consultant matches a list of evaluated data management practices with responses from an interview and ranks the researcher’s current practices by their level of data management “sustainability.” The tool then generates customized and actionable recommendations, which a consultant then provides to the researcher as guidance to improve his or her data management practices.  

Unlike DMVitals, our guide does not calculate a quantitative rating to describe the maturity of data management practices. From a measurement perspective, the range of practice maturity may differ between the four stages of our guide (e.g. the “Project Planning” stage could have greater or fewer steps than the “Data Collection” stage), which would significantly complicate the interpretation of any quantitative ratings derived from our guide. We also recognize that data management practices are constantly evolving and likely dependent on disciplinary and institutional context. On the other hand, we also recognize the utility of quantitative ratings for benchmarking. Therefore, if, after assessing the maturity of their data management practices with our guide, a researcher chooses to apply a tool like DMVitals, we would consider that a positive application of our guide.

Our Model (Redux)

Perhaps the biggest takeaway from the response to our  last blog post is that it is very difficult to give detailed feedback on a guide that is mostly whitespace. Below is an updated mock-up, which describes a set of RDM practices along the continuum of data management maturity. At present, we are not aiming to illustrate a full range of data management practices. More simply, this mock-up is intended to show the types of practices that could be described by our guide once it is complete.

screen-shot-2016-11-08-at-11-37-35-am

An updated visualization of our guide based on reader feedback. At this stage, the example RDM practices are intended to be representative not comprehensive.

Project Planning

The “Project Planning” stage describes practices that occur prior to the start of data collection. Our examples are all centered around data management plans (DMPs), but other considerations at this stage could include training in data literacy, engagement with local RDM services, inclusion of “sharing” in project documentation (e.g. consent forms), and project pre-registration.

Data Collection

The “Data Collection” stage describes practices related to the acquisition, accumulation, measurement, or simulation of data. Our examples relate mostly to standards around file naming and structuring, but other considerations at this stage could include the protection of sensitive or restricted data, validation of data integrity, and specification of linked data.

Data Analysis

The “Data Analysis” stage describes practices that involve the inspection, modeling, cleaning, or transformation of data. Our examples mostly relate to documenting the analysis workflow, but other considerations at this stage could include the generation and annotation of code and the packaging of data within sharable files or formats.

Data Output

The “Data Output” stage describes practices that involve the communication of either the data itself of conclusions drawn from the data. Our examples are mostly related to the communication of data linked to scholarly publications, but other considerations at this stage could include journal and funder mandates around data sharing, the publication of data papers, and the long term preservation of data.

Next Steps

Now that we’ve solicited a round of feedback from the community that works on issues around research support, data management, and digital curation, our next step is to broaden our scope to include researchers.

Specifically we are looking for help with the following:

  • Do you find the divisions within our model useful? We’ve used the research data lifecycle as a framework because we believe it makes our tool user-friendly for researchers. At the same time, we also acknowledge that the lines separating planning, collection, analysis, and output can be quite blurry. We would be grateful to know if researchers or data management service providers find these divisions useful or overly constrained.
  • Should there be more discrete “steps” within our framework? Because we view data management maturity as a continuum, we have shied away from creating discrete steps within each division. We would be grateful to know how researchers or data management service providers view this approach, especially when compared to the more quantitative approach employed by CARDIO, the Capability Profile Template, and DMVitals.
  • What else should we put into our model? Researchers are faced with changing expectations and obligations in regards to data management. We want our model to reflect that. We also want our model to reflect the relationship between research data management and broader issues like openness and reproducibility. With that in mind, what other practices and considerations should or model include?
Tagged , , , , , ,

CC BY and data: Not always a good fit

This post was originally published on the University of California Office of Scholarly Communication blog.

Last post I wrote about data ownership, and how focusing on “ownership” might drive you nuts without actually answering important questions about what can be done with data. In that context, I mentioned a couple of times that you (or your funder) might want data to be shared under CC0, but I didn’t clarify what CC0 actually means. This week, I’m back to dig into the topic of Creative Commons (CC) licenses and public domain tools — and how they work with data. Continue reading

Tagged , , ,