Category Archives: Tools for Data

Disambiguating Dash and Merritt

What’s Dash? What’s Merritt? What’s the difference? After numerous questions about where things should go and what the differences are between our UC3 services, we got the hint that we are not communicating clearly.

Clearing things up

A group of us sat down and talked through different use cases and what wording we were using that was causing such confusion, and have come up with what we hope is a disambiguation of Dash versus Merritt. 

Screen Shot 2017-07-10 at 1.54.06 PM

Different intentions, different target users

While Dash and Merritt interact with each other at a technical level, they have different intentions and users should not be looking at these two services as a comparison. Dash is optimized for researchers and therefore its user interface, user experience, and metadata schema are optimized for use by individual researchers. Merritt is designed for use by institutional librarians, archivists, and curators.

Because of the different intended purposes, features, and users, UC3 does not recommend that Merritt be advertised to researchers on Research Data Management (RDM) sites or researcher-facing Library Guides.

Below are quick descriptions of each service that should clarify intentions and target users:

  • Dash is an open data publication platform for researchers. Self-service depositing of research data through Dash fulfills publisher, funder, and data management plan requirements regarding data sharing and preservation. When researchers publish their datasets through Dash, their datasets are issued a DOI to optimize citability, are publicly available for download and re-use under a CC BY 4.0 or CC-0 license, and are preserved in Merritt, California Digital Library’s preservation repository.  Dash is available to researchers at participating UC campuses, as well as researchers in Environmental and Earth Sciences through the DataONE network.
  • Merritt is a preservation repository for mediated deposits by UC organizations. We work with staff at UC libraries, archives, and departments to preserve digital assets and collections. Merritt offers bit-level preservation and replication with both public or private access. Merritt is also the preservation repository that preserves Dash-deposited data.

The cost of service vs. the cost of storage

California Digital Library does not charge individual users for the Dash or Merritt services. However, we do recharge your institution for the amount of storage used in Merritt (remember, Dash preserves data in Merritt) on an annual basis.  On most campuses, the Library fully subsidizes Dash storage costs, so there is no extra financial obligation to individual researchers depositing data into Dash.

Follow-up

If you have any questions about edge cases or would like to know any more details about the architecture of the Dash platform or Merritt repository, please get in touch at uc3@ucop.edu.

And while you’re here: check out Dash’s new features for uploading large data sets, and uploading directly from the cloud.

Talking About Data: Lessons from Science Communication

As a person who worked for years in psychology and neuroscience laboratories before coming to work in academic libraries, I have particularly strong feelings about ambiguous definitions. One of my favorite anecdotes about my first year of graduate school involves watching two researchers argue about the definition of “attention” for several hours, multiple times a week, for an entire semester. One of the researchers was a clinical psychologist, the other a cognitive psychologist. Though they both devised research projects and wrote papers on the topic of attention, their theories and methods could not have been more different. The communication gap between them was so wide that they were never able to move forward productively. The punchline is that, after sitting through hours of their increasingly abstract and contentious arguments, I would go on to study attention using yet another set of theories and methods as a cognitive neuroscientist. Funny story aside, this anecdote illustrates the degree to which people with different perspectives and levels of expertise can define the same problem in strikingly different ways.

VisualSearch

A facsimile of a visual search array used by cognitive psychologists to study attention. Spot the horizontal red rectangle.

In the decade that has elapsed since those arguments, I have undergone my own change in perspective- from a person who primarily collects and analyzes their own research data to a person who primarily thinks about ways to help other researchers manage and share their data. While my day-to-day activities look rather different, there is one aspect of my work as a library post-doc that is similar to my work as a neuroscientist- many of my colleagues ostensibly working on the same things often have strikingly different definitions, methods, and areas of expertise. Fortunately, I have been able to draw on a body of work that addresses this very thing- science communication.

Wicked Problems

A “wicked problem” is a problem that is extremely difficult to solve because different stakeholders define and address it in different ways. In my anecdote about argumentative professors, understanding attention can be considered a wicked problem. Without getting too much into the weeds, the clinical psychologist understood attention mostly in the context of diagnoses like Attention Deficit Disorder, while the cognitive psychologist understood it the context of scanning visual environments for particular elements or features. As a cognitive neuroscientist, I came to understand it mostly in terms of its effects within neural networks as measured by brain imaging methods like fMRI.

Research data management (RDM) has been described as a wicked problem. A data service provider in an academic library may define RDM as “the documentation, curation, and preservation of research data”, while a researcher may define RDM as either simply part of their daily work or, in the case of something like a data management plan written for a grant proposal, as an extra burden placed upon such work. Other RDM stakeholders, including those affiliated with IT, research support, and university administration, may define it in yet other ways.

Science communication is chock full of wicked problems, including concepts like climate change and the use of stem cell use. Actually, given the significant amount of scholarship devoted to defining terms like “scientific literacy” and the multitudes of things that the term describes, science communication may itself be a wicked problem.

What is Scientific Communication?

Like attention and RDM, it is difficult to give a comprehensive definition of science communication. Documentaries like “Cosmos” are probably the most visible examples, but science communication actually comes in a wide variety of forms including science journalism, initiatives aimed at science outreach and advocacy, and science art. What these activities have in common is that they all generally aim to help people make informed decisions in a world dominated by science and technology. In parallel, there is also a burgeoning body of scholarship devoted to the science of science communication which, among other things, examines how effective different communication strategies are for changing people’s perceptions and behaviors around scientific topics.

For decades, the prevailing theory in science communication was the “Deficit Model”, which posits that scientific illiteracy is due to a simple lack of information. In the deficit model, skepticism about topics such as climate change are assumed to be due to a lack of comprehension of the science behind them. Thus, at least according to the deficit model, the “solution” to the problem of science communication is as straightforward as providing people with all the facts. In this conception, the audience is generally assumed to be homogeneous and communication is assumed to be one way (from scientists to the general public).

Though the deficit model persists, study after study (after meta-analysis) has shown that merely providing people with facts about a scientific topic does not cause them to change their perceptions or behaviors related to that topic. Instead, it turns out that presenting facts that conflict with a person’s worldview can actually cause them to double down on that worldview. Also, audiences are not homogenous. Putting aside differences in political and social worldviews, people have very different levels of scientific knowledge and relate to that knowledge in very different ways. For this reason, more modern models of science communication focus not on one-way transmissions of information but on fostering active engagement, re-framing debates, and meeting people where they are. For example, one of the more effective strategies for getting people to pay attention to climate change is not to present them with a litany of (dramatic and terrifying) facts, but to link it to their everyday emotions and concerns.

VisualSearch2

Find the same rectangle as before. It takes a little longer now that the other objects have a wider variety of features, right? Read more about visual search tasks here.

Communicating About Data

If we adapt John Durant’s nicely succinct definition of science literacy,What the general public ought to know about science.” to an RDM context, the result is something like “What researcher’s out to know about handling data.” Thus, data services in academic libraries can be said to be a form of science communication. As with “traditional” science communicators, data service providers interact with audiences possessing different perspectives and levels of knowledge as their own. The major difference, of course, being that the audience for data service providers is specifically the research community.

There is converging evidence that many of the current plans for fostering better RDM have led to mixed results. Recent studies of NSF data management plans have revealed a significant amount of variability in terms of the degree to which researchers address data management-related concepts like metadata, data sharing, and long-term preservation. The audience of data service providers is, like those of more “traditional science communicators, quite heterogeneous, so perhaps adopting methods from the repertoire of science communication could help foster more active engagement and the adoption of better practices. Many libraries and data service providers have already adopted some of these methods, perhaps without realizing their application in other domains. But I also don’t mean to criticize any existing efforts to engage researchers on the topic of RDM. If I’ve learned one thing from doing different forms of science communication over the years, it is that outreach is difficult and change is slow.

In a series of upcoming blog posts, I’ll write about some of my current projects that incorporate what I’ve written here. First up: I’ll provide an update of the RDM Maturity Model project that I previously described here and here. Coming soon!

Tagged , ,

Make Data Count: Building a System to Support Recognition of Data as a First Class Research Output

The Alfred P. Sloan Foundation has made a 2-year, $747K award to the California Digital Library, DataCite and DataONE to support collection of usage and citation metrics for data objects. Building on pilot work, this award will result in the launch of a new service that will collate and expose data level metrics.

The impact of research has traditionally been measured by citations to journal publications: journal articles are the currency of scholarly research.  However, scholarly research is made up of a much larger and richer set of outputs beyond traditional publications, including research data. In order to track and report the reach of research data, methods for collecting metrics on complex research data are needed.  In this way, data can receive the same credit and recognition that is assigned to journal articles.

Recognition of data as valuable output from the research process is increasing and this project will greatly enhance awareness around the value of data and enable researchers to gain credit for the creation and publication of data” – Ed Pentz, Crossref.

This project will work with the community to create a clear set of guidelines on how to define data usage. In addition, the project will develop a central hub for the collection of data level metrics. These metrics will include data views, downloads, citations, saves, social media mentions, and will be exposed through customized user interfaces deployed at partner organizations. Working in an open source environment, and including extensive user experience testing and community engagement, the products of this project will be available to data repositories, libraries and other organizations to deploy within their own environment, serving their communities of data authors.

Are you working in the data metrics space? Let’s collaborate.

Find out more and follow us at: www.makedatacount.org, @makedatacount

About the Partners

California Digital Library was founded by the University of California in 1997 to take advantage of emerging technologies that were transforming the way digital information was being published and accessed. University of California Curation Center (UC3), one of four main programs within the CDL, helps researchers and the UC libraries manage, preserve, and provide access to their important digital assets as well as developing tools and services that serve the community throughout the research and data life cycles.

DataCite is a leading global non-profit organization that provides persistent identifiers (DOIs) for research data. Our goal is to help the research community locate, identify, and cite research data with confidence. Through collaboration, DataCite supports researchers by helping them to find, identify, and cite research data; data centres by providing persistent identifiers, workflows and standards; and journal publishers by enabling research articles to be linked to the underlying data/objects.

DataONE (Data Observation Network for Earth) is an NSF DataNet project which is developing a distributed framework and sustainable cyber infrastructure that meets the needs of science and society for open, persistent, robust, and secure access to well-described and easily discovered Earth observational data.

Describing the Research Process

We at UC3 are constantly developing new tools and resources to help researchers manage their data. However, while working on projects like our RDM guide for researchers, we’ve noticed that researchers, librarians, and people working in the broader digital curation space often talk about the research process in very different ways.

To help bridge this gap, we are conducting an informal survey to understand the terms researchers use when talking about the various stages of a research project.

If you are a researcher and can spare about 5 minutes, we would greatly appreciate it if you would click the link below to participate in our survey.

http://survey.az1.qualtrics.com/jfe/form/SV_a97IJAEMwR7ifRP

Thank you.

Ensuring access to critical research data

For the last two months, UC3 have been working with the teams at Data.gov, Data Refuge, Internet Archive, and Code For Science (creators of the Dat Project) to aggregate the government data.

Data that spans the globe

There are currently volunteers across the country working to discover and preserve publicly funded research, especially climate data, from being deleted or lost from the public record. The largest initiative is called Data Refuge and is led by librarians and scientists. They are holding events across the UC campuses and the US that you should attend and help out in person, and are organizing the library community to band together to curate the data and ensure it’s preserved and accessible.

Our initiative builds on this and is looking to build a corpus of government data and corresponding metadata.  We are focusing on public research data, especially those at risk of disappearing. The initiative was nicknamed “Svalbard” by Max Ogden of the Dat project, after the Svalbard Global Seed Vault in the Arctic.  As of today, our friends at Code for Science have released 38GB of metadata, over 30 million hashes and URLs of research data files.

The Svalbard Global Seed Vault in the Arctic

To aid in this effort

We have assembled the following metadata as part of the Code for Science’s Svalbard v1:

  • 2.7 million SHA-256 hashes for all downloadable resources linked from Data.gov, representing around 40TB of data
  • 29 million SHA-1 hashes of files archived by the Internet Archive and the Archive Team from federal websites and FTP servers, representing over 120TB of data
  • All metadata from Data.gov, about 2.1 million datasets
  • A list of ~750 .gov and .mil FTP servers

There are additional sources such as Archivers.Space, EDGI, Climate Mirror, Azimuth Data Backup that we are working adding metadata for in future releases.

Following the principles set forth by the librarians behind Data Refuge, we believe it’s important to establish a clear and trustworthy chain of custody for research datasets so that mirror copies can be trusted. With this project, we are working to curate metadata that includes strong cryptographic hashes of data files in addition to metadata that can be used to reproduce a download procedure from the originating host.

We are hoping the community can use this data in the following ways:

  • To independently verify that the mirroring processes that produced these hashes can be reproduced
  • To aid in developing new forms of redundant dataset distribution (such as peer to peer networks)
  • To seed additional web crawls or scraping efforts with additional dataset source URLs
  • To encourage other archiving efforts to publish their metadata in an easily accessible format
  • To cross reference data across archives, for deduplication or verification purposes

What about the data?

The metadata is great, but the initial release of 30 million hashes and urls is just part of our project. The actual content (how the hashes were derived) have also been downloaded.  They are stored at either the Internet Archive or on our California Digital Library servers.

The Dat Project carried out a Data.gov HTTP mirror (~40TB) and uploaded it to our servers at California Digital Library. We are working with them to access ~160TB of data in the future and have partnered with UC Riverside to offer longer term storage .

Download

You can download the metadata here using Dat Desktop or Dat CLI tool.  We are using the Dat Protocol for distribution so that we can publish new metadata releases efficiently while still keeping the old versions around. Dat provides a secure cryptographic ledger, similar in concept to a blockchain, that can verify integrity of updates.

Feedback

If you want to learn more about how CDL and the UC3 team is involved, contact us at uc3@ucop.edu or @UC3CDL. If you have suggestions or questions, you can join the Code for Science Community Chat.  And, if you are a technical user you can report issues or get involved at the Svalbard GitHub.

This is crossposted here: https://medium.com/@maxogden/project-svalbard-a-metadata-vault-for-research-data-7088239177ab#.f933mmts8

csv conf is back in 2017!

csv,conf,v3 is happening!csv

This time the community-run conference will be in Portland, Oregon, USA on 2nd and 3rd of May 2017. It will feature stories about data sharing and data analysis from science, journalism, government, and open source. We want to bring together data makers/doers/hackers from backgrounds like science, journalism, open go
vernment and the wider software industry to share knowledge and stories.

csv,conf is a non-profit community conference run by people who love data and sharing knowledge. This isn’t just a conference about spreadsheets. CSV Conference is a conference about data sharing and data tools. We are curating content about advancing the art of data collaboration, from putting your data on GitHub to producing meaningful insight by running large scale distributed processing on a cluster.

Submit a Talk!  Talk proposals for csv,conf close Feb 15, so don’t delay, submit today! The deadline is fast approaching and we want to hear from a diverse range of voices from the data community.

Talks are 20 minutes long and can be about any data-related concept that you think is interesting. There are no rules for our talks, we just want you to propose a topic you are passionate about and think a room full of data nerds will also find interesting. You can check out some of the past talks from csv,conf,v1 and csv,conf,v2 to get an idea of what has been pitched before.

If you are passionate about data and the many applications it has in society, then join us in Portland!

csv-pic

Speaker perks:

  • Free pass to the conference
  • Limited number of travel awards available for those unable to pay
  • Did we mention it’s in Portland in the Spring????

Submit a talk proposal today at csvconf.com

Early bird tickets are now on sale here.

If you have colleagues or friends who you think would be a great addition to the conference, please forward this invitation along to them! csv,conf,v3 is committed to bringing a diverse group together to discuss data topics. 

– UC3 and the entire csv,conf,v3 team

For questions, please email csv-conf-coord@googlegroups.com, DM @csvconference or join the csv,conf public slack channel.

This was cross-posted from the Open Knowledge International Blog: http://blog.okfn.org/2017/01/12/csvconf-is-back-in-2017-submit-talk-proposals-on-the-art-of-data-analysis-and-collaboration/

Dispatches from PIDapalooza

Last month, California Digital Library, ORCID, Crossref, and Datacite brought together the brightest minds in scholarly infrastructure to do the impossible: make a conference on persistent identifiers fun!

screen-shot-2016-09-22-at-11-53-28-am

Usually discussions about persistent identifiers (PIDs) and networked research are dry and hard to get through or we find ourselves discussing the basics and never getting to the meat.

We designed PIDapalooza to attract kindred spirits who are passionate about improving interoperability and the overall quality of our scholarly infrastructure. We knew if we built it, they would come!

The results were fantastic and there was a great showing from the University of California community:

All PIDapalooza presentations are being archived on Figshare: https:/pidapalooza.figshare.com

Take a look and make sure you are following @pidapalooza for word on future PID fun!

Tagged , , , ,

There’s a new Dash!

Dash: an open source, community approach to data publication

We have great news! Last week we refreshed our Dash data publication service.  For those of you who don’t know, Dash is an open source, community driven project that takes a unique approach to data publication and digital preservation.

Dash focuses on search, presentation, and discovery and delegates the responsibility for the data preservation function to the underlying repository with which it is integrated. It is a project based at the University of California Curation Center (UC3), a program at California Digital Library (CDL) that aims to develop interdisciplinary research data infrastructure.

Dash employs a multi-tenancy user interface; providing partners with extensive opportunities for local branding and customization, use of existing campus login credentials, and, importantly, offering the Dash service under a tenant-specific URL, an important consideration helping to drive adoption. We welcome collaborations with other organizations wishing to provide a simple, intuitive data publication service on top of more cumbersome legacy systems.

There are currently seven live instances of Dash: – UC BerkeleyUC IrvineUC MercedUC Office of the PresidentUC RiversideUC Santa CruzUC San FranciscoONEshare (in partnership with DataONE)

Architecture and Implementation

Dash is completely open source. Our code is made publicly available on GitHub (http://cdluc3.github.io/dash/). Dash is based on an underlying Ruby-on-Rails data publication platform called Stash. Stash encompasses three main functional components: Store, Harvest, and Share.

  • Store: The Store component is responsible for the selection of datasets; their description in terms of configurable metadata schemas, including specification of ORCID and Fundref identifiers for researcher and funder disambiguation; the assignment of DOIs for stable citation and retrieval; designation of an optional limited time embargo; and packaging and submission to the integrated repository
  • Harvest: The Harvest component is responsible for retrieval of descriptive metadata from that repository for inclusion into a Solr search index
  • Share: The Share component, based on GeoBlacklight, is responsible for the faceted search and browse interface

Dash Architecture Diagram

Individual dataset landing pages are formatted as an online version of a data paper, presenting all appropriate descriptive and administrative metadata in a form that can be downloaded as an individual PDF file, or as part of the complete dataset download package, incorporating all data files for all versions.

To facilitate flexible configuration and future enhancement, all support for the various external service providers and repository protocols are fully encapsulated into pluggable modules. Metadata modules are available for the DataCite and Dublin Core metadata schemas. Protocol modules are available for the SWORD 2.0 deposit protocol and the OAI-PMH and ResourceSync harvesting protocols. Authentication modules are available for InCommon/Shibboleth and Google/OAuth19 identity providers (IdPs). We welcome collaborations to develop additional modules for additional metadata schemas and repository protocols. Please email UC3 (uc3 at ucop dot edu) or visit GitHub (http://cdluc3.github.io/dash/) for more information.

Features of the newly refreshed Dash service

What are the new features on our refresh of the Dash services?  Take a look.

Feature Tech-focused User-focused Description
Open Source X All components open source, MIT licensed code (http://cdluc3.github.io/dash/)
Standards compliant X Dash integrates with any SWORD/OAI-PMH-compliant repository
Pluggable Framework X Inherent extensibility for supporting additional protocols and metadata schemas
Flexible metadata schemas X Support Datacite metadata schema out-of-the-box, but can be configured to support any schema
Innovation X Our modular framework will make new feature development easier and quicker
Mobile/responsive design X X Built mobile-first, from the ground up, for better user experience
Geolocation – Metadata X X For applicable research outputs, we have an easy to use way to capture location of your datasets
Persistent Identifers – ORCID X X Dash allows researchers to attach their ORCID, allowing them to track and get credit for their work
Persistent Identifers – DOIs X X Dash issues DOIs for all datasets, allowing researchers to track and get credit for their work
Persistent Identifers – Fundref X X Dash tracks funder information using FundRef, allowing researchers and funders to track their reasearch outputs
Login – Shibboleth /OAuth2 X X We offer easy single-sign with your campus credentials or Google account
Versioning X X Datasets can change. Dash offers a quick way for you to upload new versions of your datasets and offer a simple process for tracking updates
Accessibility X X The technology, design, and user workflows have all been built with accessibility in mind
Better user experience X Self-depositing made easy. Simple workflow, drag-and-drop upload, simple navigation, clean data publication pages, user dashboards
Geolocation – Search X With GeoBlacklight, we can offer search by location
Robust Search X Search by subject, filetype, keywords, campus, location, etc.
Discoverability X Indexing by search engines for Google, Bing, etc.
Build Relationships X Many datasets are related to publications or other data. Dash offers a quick way to describe these relationships
Supports Best Practices X Data publication can be confusing. But with Dash, you can trust Dash is following best practices
Data Metrics X See the reach of your datasets through usage and download metrics
Data Citations X Quick access to a well-formed citiation reference (with DOI) to every data publication. Easy for your peers to quickly grab
Open License X Dash supports open Creative Commons licensing for all data deposits; can be configured for other licenses
Lower Barrier to Entry X For those in a hurry, Dash offers a quick interface to self-deposit. Only three steps and few required fields
Support Data Reuse X Focus researchers on describing methods and explaining ways to reuse their datasets
Satisfies Data Availability Requirements X Many publishers and funders require researchers to make their data available. Dash is an readily accepted and easy way to comply

A little Dash history

The Dash project began as DataShare, a collaboration among UC3, the University of California San Francisco Library and Center for Knowledge Management, and the UCSF Clinical and Translational Science Institute (CTSI). CTSI is part of the Clinical and Translational Science Award program funded by the National Center for Advancing Translational Sciences at the National Institutes of Health. Dash version 2 developed by UC3 and partners with funding from the Alfred P. Sloan Foundation (our funded proposal). Read more about the code, the project, and contributing to development on the Dash GitHub site.

A little Dash future

We will continue the development of the new Dash platform and will keep you posted. Next up: support for timed deposits and embargoes.  Stay tuned!

Tagged , ,

An RDM Model for Researchers: What we’ve learned

Thanks to everyone who gave feedback on our previous blog post describing our data management tool for researchers. We received a great deal of input related to our guide’s use of the term “data sharing” and our guide’s position in relation to other RDM tools as well as quite a few questions about what our guide will include as we develop it further.

As stated in our initial post, we’re building a tool to enable individual researchers to assess the maturity of their data management practices within an institutional or organizational context. To do this, we’ve taken the concept of RDM maturity from in existing tools like the Five Organizational Stages of Digital Preservation, the Scientific Data Management Capability Model, and the Capability Maturity Guide and placed it within a framework familiar to researchers, the research data lifecycle.

researchercmm_090916

A visualization of our guide as presented in our last blog post. An updated version, including changed made in response to reader feedback, is presented later in this post.

Data Sharing

The most immediate feedback we received was about the term “Data Sharing”. Several commenters pointed out the ambiguity of this term in the context of the research data life cycle. In the last iteration of our guide, we intended “Data Sharing” as a shorthand to describe activities related to the communication of data. Such activities may range from describing data in a traditional scholarly publication to depositing a dataset in a public repository or publishing a data paper. Because existing data sharing policies (e.g. PLOS, The Gates Foundation, and The Moore Foundation) refer specifically to the latter over the former, the term is clearly too imprecise for our guide.

Like “Data Sharing”, “Data Publication” is a popular term for describing activities surrounding the communication of data. Even more than “Sharing”, “Publication” relays our desire to advance practices that treat data as a first class research product. Unfortunately the term is simultaneously too precise and too ambiguous it to be useful in our guide. On one hand, the term “Data Publication” can refer specifically to a peer reviewed document that presents a dataset without offering any analysis or conclusion. While data papers may be a straightforward way of inserting datasets into the existing scholarly communication ecosystem, they represent a single point on the continuum of data management maturity. On the other hand, there is currently no clear consensus between researchers about what it means to “publish” data.

For now, we’ve given that portion of our guide the preliminary label of “Data Output”. As the development process proceeds, this row will include a full range of activities- from description of data in traditional scholarly publications (that may or may not include a data availability statement) to depositing data into public repositories and the publication of data papers.

Other Models and Guides

While we correctly identified that there are are range of rubrics, tools, and capability models with similar aims as our guide, we overstated that ours uniquely allows researchers to assess where they are and where they want to be in regards to data management. Several of the tools we cited in our initial post can be applied by researchers to measure the maturity of data management practices within a project or institutional context.

Below we’ve profiled four such tools and indicated how we believe our guide differs from each. In differentiating our guide, we do not mean to position it strictly as an alternative. Rather, we believe that our guide could be used in concert with these other tools.

Collaborative Assessment of Research Data Infrastructure and Objectives (CARDIO)

CARDIO is a benchmarking tool designed to be used by researchers, service providers, and coordinators for collaborative data management strategy development. Designed to be applied at a variety of levels, from entire institutions down to individual research projects, CARDIO enables its users to collaboratively assess data management requirements, activities, and capacities using an online interface. Users of CARDIO rate their data management infrastructure relative to a series of statements concerning their organization, technology, and resources. After completing CARDIO, users are given a comprehensive set of quantitative capability ratings as well as a series of practical recommendations for improvement.

Unlike CARDIO, our guide does not necessarily assume its users are in contact with data-related service providers at their institution. As we stated in our initial blog post, we intend to guide researchers to specialist knowledge without necessarily turning them into specialists. Therefore, we would consider a researcher making contact with their local data management, research IT, or library service providers for the first time as a positive application of our guide.

Community Capability Model Framework (CCMF)

The Community Capability Model Framework is designed to evaluate a community’s readiness to perform data intensive research. Intended to be used by researchers, institutions, and funders to assess current capabilities, identify areas requiring investment, and develop roadmaps for achieving a target state of readiness, the CCMF encompasses eight “capability factors” including openness, skills and training, research culture, and technical infrastructure. When used alongside the Capability Profile Template, the CCMF provides its users with a scorecard containing multiple quantitative scores related to each capability factor.   

Unlike the CCMF, our guide does not necessarily assume that its users should all be striving towards the same level of data management maturity. We recognize that data management practices may vary significantly between institutions or research areas and that what works for one researcher may not necessarily work for another. Therefore, we would consider researchers understanding the maturity of their data management practices within their local contexts to be a positive application of our guide.

Data Curation Profiles (DCP) and DMVitals

The Data Curation Profile toolkit is intended to address the needs of an individual researcher or research group with regards to the “primary” data used for a particular project. Taking the form of a structured interview between an information professional and a researcher, a DCP can allow an individual research group to consider their long-term data needs, enable an institution to coordinate their data management services, or facilitate research into broader topics in digital curation and preservation.

DMVitals is a tool designed to take information from a source like a Data Curation Profile and use it to systematically assess a researcher’s data management practices in direct comparison to institutional and domain standards. Using the DMVitals, a consultant matches a list of evaluated data management practices with responses from an interview and ranks the researcher’s current practices by their level of data management “sustainability.” The tool then generates customized and actionable recommendations, which a consultant then provides to the researcher as guidance to improve his or her data management practices.  

Unlike DMVitals, our guide does not calculate a quantitative rating to describe the maturity of data management practices. From a measurement perspective, the range of practice maturity may differ between the four stages of our guide (e.g. the “Project Planning” stage could have greater or fewer steps than the “Data Collection” stage), which would significantly complicate the interpretation of any quantitative ratings derived from our guide. We also recognize that data management practices are constantly evolving and likely dependent on disciplinary and institutional context. On the other hand, we also recognize the utility of quantitative ratings for benchmarking. Therefore, if, after assessing the maturity of their data management practices with our guide, a researcher chooses to apply a tool like DMVitals, we would consider that a positive application of our guide.

Our Model (Redux)

Perhaps the biggest takeaway from the response to our  last blog post is that it is very difficult to give detailed feedback on a guide that is mostly whitespace. Below is an updated mock-up, which describes a set of RDM practices along the continuum of data management maturity. At present, we are not aiming to illustrate a full range of data management practices. More simply, this mock-up is intended to show the types of practices that could be described by our guide once it is complete.

screen-shot-2016-11-08-at-11-37-35-am

An updated visualization of our guide based on reader feedback. At this stage, the example RDM practices are intended to be representative not comprehensive.

Project Planning

The “Project Planning” stage describes practices that occur prior to the start of data collection. Our examples are all centered around data management plans (DMPs), but other considerations at this stage could include training in data literacy, engagement with local RDM services, inclusion of “sharing” in project documentation (e.g. consent forms), and project pre-registration.

Data Collection

The “Data Collection” stage describes practices related to the acquisition, accumulation, measurement, or simulation of data. Our examples relate mostly to standards around file naming and structuring, but other considerations at this stage could include the protection of sensitive or restricted data, validation of data integrity, and specification of linked data.

Data Analysis

The “Data Analysis” stage describes practices that involve the inspection, modeling, cleaning, or transformation of data. Our examples mostly relate to documenting the analysis workflow, but other considerations at this stage could include the generation and annotation of code and the packaging of data within sharable files or formats.

Data Output

The “Data Output” stage describes practices that involve the communication of either the data itself of conclusions drawn from the data. Our examples are mostly related to the communication of data linked to scholarly publications, but other considerations at this stage could include journal and funder mandates around data sharing, the publication of data papers, and the long term preservation of data.

Next Steps

Now that we’ve solicited a round of feedback from the community that works on issues around research support, data management, and digital curation, our next step is to broaden our scope to include researchers.

Specifically we are looking for help with the following:

  • Do you find the divisions within our model useful? We’ve used the research data lifecycle as a framework because we believe it makes our tool user-friendly for researchers. At the same time, we also acknowledge that the lines separating planning, collection, analysis, and output can be quite blurry. We would be grateful to know if researchers or data management service providers find these divisions useful or overly constrained.
  • Should there be more discrete “steps” within our framework? Because we view data management maturity as a continuum, we have shied away from creating discrete steps within each division. We would be grateful to know how researchers or data management service providers view this approach, especially when compared to the more quantitative approach employed by CARDIO, the Capability Profile Template, and DMVitals.
  • What else should we put into our model? Researchers are faced with changing expectations and obligations in regards to data management. We want our model to reflect that. We also want our model to reflect the relationship between research data management and broader issues like openness and reproducibility. With that in mind, what other practices and considerations should or model include?
Tagged , , , , , ,

Building a user-friendly RDM maturity model

UC3 is developing a guide to help researchers assess and progress the maturity of their data management practices.

What are we doing?

Researchers are increasingly faced with new expectations and obligations in regards to data management. To help researchers navigate this changing landscape and to complement existing instruments that enable librarians and other data managers to assess the maturity of data management practices at an institutional or organizational level, we’re developing a guide that will enable researchers to assess the maturity of their individual practices within an institutional or organizational context.

Our aim is to be descriptive rather than prescriptive. We do not assume every researcher will want or need to achieve the same level of maturity for all their data management practices. Rather, we aim to provide researchers with a guide to specialist knowledge without necessarily turning researchers into specialists. We want to help researchers understand where they are and, where appropriate, how to get to where they want or need to be.

Existing Models

As a first step in building our own guide, we’ve researched the range of related tools, rubrics, and capability models. Many, including the Five Organizational Stages of Digital Preservation, the Scientific Data Management Capability Model, and the Capability Maturity Guide developed by the Australian National Data Service, draw heavily from the SEI Capability Maturity Model and are intended to assist librarians, repository managers, and other data management service providers in benchmarking the policies, infrastructure, and services of their organization or institution.  Others, including the Collaborative Assessment of Research Data Infrastructure and Objectives (CARDIO), DMVitals, and the Community Capability Framework, incorporate feedback from researchers and are intended to assist in benchmarking a broad set of data management-related topics for a broad set of stockholders – from organizations and institutions down to individual research groups.

We intend for our guide to build on these tools but to have a different, and we think novel, focus. While we believe it could be a useful tool for data management service providers, the intended audience of our guide is research practitioners. While integration with service providers in the library, research IT, and elsewhere will be included where appropriate, the the focus will be on equipping researchers to assess and refine their individual own data management activities. While technical infrastructure will be included where appropriate, the focus will be on behaviors, “soft skills”, and training.

Our Guide

Below is a preliminary mockup of our guide. Akin to the “How Open Is It?” guide developed by SPARC, PLOS, and the OASPA, our aim is to provide a tool that is comprehensive, user-friendly, and provides tangible recommendations.  

researchercmm_090916

Obviously we still have a significant amount of work to do to refine the language and fill in the details. At the moment, we are using elements of the research data lifecycle to broadly describe research activities and very general terms to describe the continuum of practice maturity. Our next step is to fill in the blanks- to more precisely describe research activities and more clearly delineate the stages of practice maturity. From there, we will work to outline the behaviors, skills, and expertise present for each research activity at each stage.

Next Steps

Now that we’ve researched existing tools for assessing data management services and sketched out a preliminary framework for our guide, our next step is to elicit feedback from the broader community that works on issues around research support, data management, and digital curation and preservation.

Specifically we are looking for help on the following:

  • Have we missed anything? There is a range of data management-related rubrics, tools, and capability models – from the community-focused frameworks described above to frameworks focused on the preservation and curation of digital assets (e.g. the Digital Asset Framework, DRAMBORA). As far as we’re aware, there isn’t a complementary tool that allows researchers to assess where they are and where they want to be in regards to data management. Are there efforts that have already met this need? We’d be grateful for any input about the existence of frameworks with similar goals.
  • What would be the most useful divisions and steps within our framework? The “three legged stool” developed by the Digital Preservation Management workshop has been highly influential for community and data management provider-focused tools. Though examining policies, resources, and infrastructure are also important for researchers when self-assessing their data management practices, we believe it would be more useful for our guide to be more reflective of how data is generated, managed, disseminated in a research context. We’d be grateful for any insight into how we could incorporate related models – such as those depicting the research data lifecycle – into our framework.
Tagged , , , , , ,