An RDM Model for Researchers: What we’ve learned

Thanks to everyone who gave feedback on our previous blog post describing our data management tool for researchers. We received a great deal of input related to our guide’s use of the term “data sharing” and our guide’s position in relation to other RDM tools as well as quite a few questions about what our guide will include as we develop it further.

As stated in our initial post, we’re building a tool to enable individual researchers to assess the maturity of their data management practices within an institutional or organizational context. To do this, we’ve taken the concept of RDM maturity from in existing tools like the Five Organizational Stages of Digital Preservation, the Scientific Data Management Capability Model, and the Capability Maturity Guide and placed it within a framework familiar to researchers, the research data lifecycle.

researchercmm_090916

A visualization of our guide as presented in our last blog post. An updated version, including changed made in response to reader feedback, is presented later in this post.

Data Sharing

The most immediate feedback we received was about the term “Data Sharing”. Several commenters pointed out the ambiguity of this term in the context of the research data life cycle. In the last iteration of our guide, we intended “Data Sharing” as a shorthand to describe activities related to the communication of data. Such activities may range from describing data in a traditional scholarly publication to depositing a dataset in a public repository or publishing a data paper. Because existing data sharing policies (e.g. PLOS, The Gates Foundation, and The Moore Foundation) refer specifically to the latter over the former, the term is clearly too imprecise for our guide.

Like “Data Sharing”, “Data Publication” is a popular term for describing activities surrounding the communication of data. Even more than “Sharing”, “Publication” relays our desire to advance practices that treat data as a first class research product. Unfortunately the term is simultaneously too precise and too ambiguous it to be useful in our guide. On one hand, the term “Data Publication” can refer specifically to a peer reviewed document that presents a dataset without offering any analysis or conclusion. While data papers may be a straightforward way of inserting datasets into the existing scholarly communication ecosystem, they represent a single point on the continuum of data management maturity. On the other hand, there is currently no clear consensus between researchers about what it means to “publish” data.

For now, we’ve given that portion of our guide the preliminary label of “Data Output”. As the development process proceeds, this row will include a full range of activities- from description of data in traditional scholarly publications (that may or may not include a data availability statement) to depositing data into public repositories and the publication of data papers.

Other Models and Guides

While we correctly identified that there are are range of rubrics, tools, and capability models with similar aims as our guide, we overstated that ours uniquely allows researchers to assess where they are and where they want to be in regards to data management. Several of the tools we cited in our initial post can be applied by researchers to measure the maturity of data management practices within a project or institutional context.

Below we’ve profiled four such tools and indicated how we believe our guide differs from each. In differentiating our guide, we do not mean to position it strictly as an alternative. Rather, we believe that our guide could be used in concert with these other tools.

Collaborative Assessment of Research Data Infrastructure and Objectives (CARDIO)

CARDIO is a benchmarking tool designed to be used by researchers, service providers, and coordinators for collaborative data management strategy development. Designed to be applied at a variety of levels, from entire institutions down to individual research projects, CARDIO enables its users to collaboratively assess data management requirements, activities, and capacities using an online interface. Users of CARDIO rate their data management infrastructure relative to a series of statements concerning their organization, technology, and resources. After completing CARDIO, users are given a comprehensive set of quantitative capability ratings as well as a series of practical recommendations for improvement.

Unlike CARDIO, our guide does not necessarily assume its users are in contact with data-related service providers at their institution. As we stated in our initial blog post, we intend to guide researchers to specialist knowledge without necessarily turning them into specialists. Therefore, we would consider a researcher making contact with their local data management, research IT, or library service providers for the first time as a positive application of our guide.

Community Capability Model Framework (CCMF)

The Community Capability Model Framework is designed to evaluate a community’s readiness to perform data intensive research. Intended to be used by researchers, institutions, and funders to assess current capabilities, identify areas requiring investment, and develop roadmaps for achieving a target state of readiness, the CCMF encompasses eight “capability factors” including openness, skills and training, research culture, and technical infrastructure. When used alongside the Capability Profile Template, the CCMF provides its users with a scorecard containing multiple quantitative scores related to each capability factor.   

Unlike the CCMF, our guide does not necessarily assume that its users should all be striving towards the same level of data management maturity. We recognize that data management practices may vary significantly between institutions or research areas and that what works for one researcher may not necessarily work for another. Therefore, we would consider researchers understanding the maturity of their data management practices within their local contexts to be a positive application of our guide.

Data Curation Profiles (DCP) and DMVitals

The Data Curation Profile toolkit is intended to address the needs of an individual researcher or research group with regards to the “primary” data used for a particular project. Taking the form of a structured interview between an information professional and a researcher, a DCP can allow an individual research group to consider their long-term data needs, enable an institution to coordinate their data management services, or facilitate research into broader topics in digital curation and preservation.

DMVitals is a tool designed to take information from a source like a Data Curation Profile and use it to systematically assess a researcher’s data management practices in direct comparison to institutional and domain standards. Using the DMVitals, a consultant matches a list of evaluated data management practices with responses from an interview and ranks the researcher’s current practices by their level of data management “sustainability.” The tool then generates customized and actionable recommendations, which a consultant then provides to the researcher as guidance to improve his or her data management practices.  

Unlike DMVitals, our guide does not calculate a quantitative rating to describe the maturity of data management practices. From a measurement perspective, the range of practice maturity may differ between the four stages of our guide (e.g. the “Project Planning” stage could have greater or fewer steps than the “Data Collection” stage), which would significantly complicate the interpretation of any quantitative ratings derived from our guide. We also recognize that data management practices are constantly evolving and likely dependent on disciplinary and institutional context. On the other hand, we also recognize the utility of quantitative ratings for benchmarking. Therefore, if, after assessing the maturity of their data management practices with our guide, a researcher chooses to apply a tool like DMVitals, we would consider that a positive application of our guide.

Our Model (Redux)

Perhaps the biggest takeaway from the response to our  last blog post is that it is very difficult to give detailed feedback on a guide that is mostly whitespace. Below is an updated mock-up, which describes a set of RDM practices along the continuum of data management maturity. At present, we are not aiming to illustrate a full range of data management practices. More simply, this mock-up is intended to show the types of practices that could be described by our guide once it is complete.

screen-shot-2016-11-08-at-11-37-35-am

An updated visualization of our guide based on reader feedback. At this stage, the example RDM practices are intended to be representative not comprehensive.

Project Planning

The “Project Planning” stage describes practices that occur prior to the start of data collection. Our examples are all centered around data management plans (DMPs), but other considerations at this stage could include training in data literacy, engagement with local RDM services, inclusion of “sharing” in project documentation (e.g. consent forms), and project pre-registration.

Data Collection

The “Data Collection” stage describes practices related to the acquisition, accumulation, measurement, or simulation of data. Our examples relate mostly to standards around file naming and structuring, but other considerations at this stage could include the protection of sensitive or restricted data, validation of data integrity, and specification of linked data.

Data Analysis

The “Data Analysis” stage describes practices that involve the inspection, modeling, cleaning, or transformation of data. Our examples mostly relate to documenting the analysis workflow, but other considerations at this stage could include the generation and annotation of code and the packaging of data within sharable files or formats.

Data Output

The “Data Output” stage describes practices that involve the communication of either the data itself of conclusions drawn from the data. Our examples are mostly related to the communication of data linked to scholarly publications, but other considerations at this stage could include journal and funder mandates around data sharing, the publication of data papers, and the long term preservation of data.

Next Steps

Now that we’ve solicited a round of feedback from the community that works on issues around research support, data management, and digital curation, our next step is to broaden our scope to include researchers.

Specifically we are looking for help with the following:

  • Do you find the divisions within our model useful? We’ve used the research data lifecycle as a framework because we believe it makes our tool user-friendly for researchers. At the same time, we also acknowledge that the lines separating planning, collection, analysis, and output can be quite blurry. We would be grateful to know if researchers or data management service providers find these divisions useful or overly constrained.
  • Should there be more discrete “steps” within our framework? Because we view data management maturity as a continuum, we have shied away from creating discrete steps within each division. We would be grateful to know how researchers or data management service providers view this approach, especially when compared to the more quantitative approach employed by CARDIO, the Capability Profile Template, and DMVitals.
  • What else should we put into our model? Researchers are faced with changing expectations and obligations in regards to data management. We want our model to reflect that. We also want our model to reflect the relationship between research data management and broader issues like openness and reproducibility. With that in mind, what other practices and considerations should or model include?
Tagged , , , , , ,

6 thoughts on “An RDM Model for Researchers: What we’ve learned

  1. Angus Whyte says:

    Just some reflections based on experience with CARDIO and the new models we’ve been developing in DCC called RISE and ReCap. The models are aimed at institutional research data service providers, rather that at researchers themselves (and will be online soon)-

    The steps/continuum: Three levels of description is just right I believe, at least that is what we have used in RISE and ReCap. The 5 levels of description in CARDIO made it feel cumbersome in use. I have seen other models using 3 levels of description work with a 5 point scale, on the basis that people often want to self-assess their situation as somewhere between two levels described in the rubric..

    Like you, we wanted to avoid conveying to users that the end point on the continuum is an ideal that everyone can or should strive for, on every area of practice (row of the table/rubric). However for that reason we also avoided a maturity scale of the kind you have used (based I think on the CMM?). We have three levels intended to describe different levels of capability, described as ‘minimally compliant’, ‘institutionalised’, and ‘sector leading’. But in your case I wonder if something like ‘personalised’ and ‘practice leading’ might work?

    Descriptors: They are short which is good, though I realise they are not meant to be comprehensive. We have found it takes quite a few iterations to get the statements in these rubrics right, i.e. describing what we see as key aspects of the activities we want to represent, in a form that’s consistent with the levels we are describing, and in terms that users can readily relate to their practice.

    Activities: I think this inevitably ends up as a similar discussion to those about lifecycle models, so it makes sense to base the framework on whichever of those works best for your purpose. In DCC we revised our model of 10 ’RDM service components’ and used that for our capability model. We decomposed some further, so we have ended up with 21 capabilities. The feedback we got in our validation workshop was that this was about right for the purpose. Our rule of thumb is that it should be possible to get something useful out of the model within an hour. The full CARDIO model had 30 capabilities, which was a lot to deal with in that time. We also used a more concise model for institutions to benchmark progress on complying with funder expectations.

    • John Borghi says:

      Hi Angus,

      Thanks so much for your feedback! You are correct that we adopted the “Idiosyncratic” to “Optimized” language from the CMM. I really like the suggestion of changing it to something more like “Personalized” and “Practice Leading”. I’ll have to think a bit about how that fits into our descriptors.

      Thanks again!

  2. Lærke Friis Neergaard says:

    Hi John

    The danish universities are currently working on a similar project: The goal is to make specifikations and a mock up of a digital datamanagement guide. Like yours, it should be adjustable to institutional differences and is aimed to be used directly by researchers. Like yours, it is primarily intended to develop “soft skills”, but will probably also point to local support in library, it and technical solutions, when applicable.
    We try to make columns in our matrix based on stages of research, rather than stages of data, but they are very much linked.
    Our project involves workshops with researchers from different fields.

    Right now, one of our concerns is, how to motivate researchers to use the tool.
    One idea, is that the guide should be linked in some way to DMPOnline (all danish universities is probably going to implement DMPOnline, as the preferred DMP tool).

    Do you have similar concerns/ideas?

    • John Borghi says:

      Hi Lærke,

      That’s something we’ve thought about. One of our big priorities when starting this project was that we wanted to create a tool that researchers would actually use. I’m meeting with a group of researchers soon to discuss how our guide would fit into their concern about open science and reproducibility, but other than that, we don’t have any firm ideas. I have a notion of providing recommendations about different tools (like DMPTool), but it’s not very developed yet.

  3. […] Work continues on the RDM guide for researchers! I’ve written up what we’ve learned over the last few months over on the DataPub blog. […]

  4. Liz Stokes says:

    I think this is a great tool, I think you have a good balance of detail and explanation, especially with the descriptions.

    I wonder how this tool might demonstrate institutional policies and support that may affect the answers that researchers give. For example, I can imagine a translucent layer that identifies practical support from local research data, technology or library services to ‘level up’ their responses. Another layer could indicate where institutional policy encourages or mandates specific data management practices (or not at all). Just thinking aloud here. I have the results of the COAR RDM survey fresh in my mind (https://www.coar-repositories.org/files/COAR-RDM-Survey-Jan-2017.pdf) and these additional layers could be used to promote local RDM services and address some of the challenges mentioned in the report.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: