Category Archives: Data Citation

An RDM Model for Researchers: What we’ve learned

Thanks to everyone who gave feedback on our previous blog post describing our data management tool for researchers. We received a great deal of input related to our guide’s use of the term “data sharing” and our guide’s position in relation to other RDM tools as well as quite a few questions about what our guide will include as we develop it further.

As stated in our initial post, we’re building a tool to enable individual researchers to assess the maturity of their data management practices within an institutional or organizational context. To do this, we’ve taken the concept of RDM maturity from in existing tools like the Five Organizational Stages of Digital Preservation, the Scientific Data Management Capability Model, and the Capability Maturity Guide and placed it within a framework familiar to researchers, the research data lifecycle.

researchercmm_090916

A visualization of our guide as presented in our last blog post. An updated version, including changed made in response to reader feedback, is presented later in this post.

Data Sharing

The most immediate feedback we received was about the term “Data Sharing”. Several commenters pointed out the ambiguity of this term in the context of the research data life cycle. In the last iteration of our guide, we intended “Data Sharing” as a shorthand to describe activities related to the communication of data. Such activities may range from describing data in a traditional scholarly publication to depositing a dataset in a public repository or publishing a data paper. Because existing data sharing policies (e.g. PLOS, The Gates Foundation, and The Moore Foundation) refer specifically to the latter over the former, the term is clearly too imprecise for our guide.

Like “Data Sharing”, “Data Publication” is a popular term for describing activities surrounding the communication of data. Even more than “Sharing”, “Publication” relays our desire to advance practices that treat data as a first class research product. Unfortunately the term is simultaneously too precise and too ambiguous it to be useful in our guide. On one hand, the term “Data Publication” can refer specifically to a peer reviewed document that presents a dataset without offering any analysis or conclusion. While data papers may be a straightforward way of inserting datasets into the existing scholarly communication ecosystem, they represent a single point on the continuum of data management maturity. On the other hand, there is currently no clear consensus between researchers about what it means to “publish” data.

For now, we’ve given that portion of our guide the preliminary label of “Data Output”. As the development process proceeds, this row will include a full range of activities- from description of data in traditional scholarly publications (that may or may not include a data availability statement) to depositing data into public repositories and the publication of data papers.

Other Models and Guides

While we correctly identified that there are are range of rubrics, tools, and capability models with similar aims as our guide, we overstated that ours uniquely allows researchers to assess where they are and where they want to be in regards to data management. Several of the tools we cited in our initial post can be applied by researchers to measure the maturity of data management practices within a project or institutional context.

Below we’ve profiled four such tools and indicated how we believe our guide differs from each. In differentiating our guide, we do not mean to position it strictly as an alternative. Rather, we believe that our guide could be used in concert with these other tools.

Collaborative Assessment of Research Data Infrastructure and Objectives (CARDIO)

CARDIO is a benchmarking tool designed to be used by researchers, service providers, and coordinators for collaborative data management strategy development. Designed to be applied at a variety of levels, from entire institutions down to individual research projects, CARDIO enables its users to collaboratively assess data management requirements, activities, and capacities using an online interface. Users of CARDIO rate their data management infrastructure relative to a series of statements concerning their organization, technology, and resources. After completing CARDIO, users are given a comprehensive set of quantitative capability ratings as well as a series of practical recommendations for improvement.

Unlike CARDIO, our guide does not necessarily assume its users are in contact with data-related service providers at their institution. As we stated in our initial blog post, we intend to guide researchers to specialist knowledge without necessarily turning them into specialists. Therefore, we would consider a researcher making contact with their local data management, research IT, or library service providers for the first time as a positive application of our guide.

Community Capability Model Framework (CCMF)

The Community Capability Model Framework is designed to evaluate a community’s readiness to perform data intensive research. Intended to be used by researchers, institutions, and funders to assess current capabilities, identify areas requiring investment, and develop roadmaps for achieving a target state of readiness, the CCMF encompasses eight “capability factors” including openness, skills and training, research culture, and technical infrastructure. When used alongside the Capability Profile Template, the CCMF provides its users with a scorecard containing multiple quantitative scores related to each capability factor.   

Unlike the CCMF, our guide does not necessarily assume that its users should all be striving towards the same level of data management maturity. We recognize that data management practices may vary significantly between institutions or research areas and that what works for one researcher may not necessarily work for another. Therefore, we would consider researchers understanding the maturity of their data management practices within their local contexts to be a positive application of our guide.

Data Curation Profiles (DCP) and DMVitals

The Data Curation Profile toolkit is intended to address the needs of an individual researcher or research group with regards to the “primary” data used for a particular project. Taking the form of a structured interview between an information professional and a researcher, a DCP can allow an individual research group to consider their long-term data needs, enable an institution to coordinate their data management services, or facilitate research into broader topics in digital curation and preservation.

DMVitals is a tool designed to take information from a source like a Data Curation Profile and use it to systematically assess a researcher’s data management practices in direct comparison to institutional and domain standards. Using the DMVitals, a consultant matches a list of evaluated data management practices with responses from an interview and ranks the researcher’s current practices by their level of data management “sustainability.” The tool then generates customized and actionable recommendations, which a consultant then provides to the researcher as guidance to improve his or her data management practices.  

Unlike DMVitals, our guide does not calculate a quantitative rating to describe the maturity of data management practices. From a measurement perspective, the range of practice maturity may differ between the four stages of our guide (e.g. the “Project Planning” stage could have greater or fewer steps than the “Data Collection” stage), which would significantly complicate the interpretation of any quantitative ratings derived from our guide. We also recognize that data management practices are constantly evolving and likely dependent on disciplinary and institutional context. On the other hand, we also recognize the utility of quantitative ratings for benchmarking. Therefore, if, after assessing the maturity of their data management practices with our guide, a researcher chooses to apply a tool like DMVitals, we would consider that a positive application of our guide.

Our Model (Redux)

Perhaps the biggest takeaway from the response to our  last blog post is that it is very difficult to give detailed feedback on a guide that is mostly whitespace. Below is an updated mock-up, which describes a set of RDM practices along the continuum of data management maturity. At present, we are not aiming to illustrate a full range of data management practices. More simply, this mock-up is intended to show the types of practices that could be described by our guide once it is complete.

screen-shot-2016-11-08-at-11-37-35-am

An updated visualization of our guide based on reader feedback. At this stage, the example RDM practices are intended to be representative not comprehensive.

Project Planning

The “Project Planning” stage describes practices that occur prior to the start of data collection. Our examples are all centered around data management plans (DMPs), but other considerations at this stage could include training in data literacy, engagement with local RDM services, inclusion of “sharing” in project documentation (e.g. consent forms), and project pre-registration.

Data Collection

The “Data Collection” stage describes practices related to the acquisition, accumulation, measurement, or simulation of data. Our examples relate mostly to standards around file naming and structuring, but other considerations at this stage could include the protection of sensitive or restricted data, validation of data integrity, and specification of linked data.

Data Analysis

The “Data Analysis” stage describes practices that involve the inspection, modeling, cleaning, or transformation of data. Our examples mostly relate to documenting the analysis workflow, but other considerations at this stage could include the generation and annotation of code and the packaging of data within sharable files or formats.

Data Output

The “Data Output” stage describes practices that involve the communication of either the data itself of conclusions drawn from the data. Our examples are mostly related to the communication of data linked to scholarly publications, but other considerations at this stage could include journal and funder mandates around data sharing, the publication of data papers, and the long term preservation of data.

Next Steps

Now that we’ve solicited a round of feedback from the community that works on issues around research support, data management, and digital curation, our next step is to broaden our scope to include researchers.

Specifically we are looking for help with the following:

  • Do you find the divisions within our model useful? We’ve used the research data lifecycle as a framework because we believe it makes our tool user-friendly for researchers. At the same time, we also acknowledge that the lines separating planning, collection, analysis, and output can be quite blurry. We would be grateful to know if researchers or data management service providers find these divisions useful or overly constrained.
  • Should there be more discrete “steps” within our framework? Because we view data management maturity as a continuum, we have shied away from creating discrete steps within each division. We would be grateful to know how researchers or data management service providers view this approach, especially when compared to the more quantitative approach employed by CARDIO, the Capability Profile Template, and DMVitals.
  • What else should we put into our model? Researchers are faced with changing expectations and obligations in regards to data management. We want our model to reflect that. We also want our model to reflect the relationship between research data management and broader issues like openness and reproducibility. With that in mind, what other practices and considerations should or model include?
Tagged , , , , , ,

Data metrics survey results published

Today, we are pleased to announce the publication Making Data Count in Scientific Data. John Kratz and Carly Strasser led the research effort to understand the needs and values of both the researchers who create and use data and of the data managers who preserve and publish it. The Making Data Count project is a collaboration between the CDL, PLOS, and DataONE to define and implement a practical suite of metrics for evaluating the impact of datasets, which is a necessary prerequisite to widespread recognition of datasets as first class scholarly objects.

We started the project with research to understand what metrics would be meaningful to stakeholders and what metrics we can practically collect. We conducted a literature review, focus groups, and– the subject of today’s paper–  a pair of online surveys for researchers and data managers.

In November and December of 2014, 247 researchers and 73 data repository managers answered our questions about data sharing, use, and metrics.Graph of interest in various metrics Survey and anonymized data are available in the Dash repository. These responses told us, among other things, which existing Article Level Metrics (ALMs) might be profitably applied to data:

  • Social media: We should not worry excessively about capturing social media (Twitter, Facebook, etc.) activity around data yet, because there is not much to capture. Only 9% of researchers said they would “definitely” use social media to look for a dataset.
  • Page views: Page views are widely collected by repositories but neither researchers nor data managers consider them meaningful. (It stands to reason that, unlike a paper, you can’t have engaged very deeply with a dataset if all you’ve done is read about it.)
  • Downloads: Download counts, on the other hand, are both highly valuable and practical to collect. Downloads were a resounding second-choice metric for researchers and 85% of repositories already track them.
  • Citations: Citations are the coin of the academic realm. They were by far the most interesting metric to both researchers and data managers. Unfortunately, citations are much more difficult than download counts to work with, and relatively few repositories track them. Beyond technical complexity, the biggest challenge is cultural: data citation practices are inconsistent at best, and formal data citation is rare. Despite the difficulty, the value of citations is too high to ignore, even in the short term.

We have already begun to collect data on the sample project corpus– the entire DataONE collection of 100k+ datasets. Using this pilot corpus, we see preliminary indications of researcher engagement with data across a number of online channels not previously thought to be in use by scholars. The results of this pilot will complement the survey described in today’s paper with real measurement of data-related activities “in the wild.”

For more conclusions and in-depth discussion of the initial research, see the paper, which is open access and available here: http://dx.doi.org/10.1038/sdata.2015.39. Stay tuned for analysis and results of the DataONE data-level metrics data on the Making Data Count project page: http://lagotto.io/MDC/.

Data: Do You Care? The DLM Survey

We all know that data is important for research. So how can we quantify that? How can you get credit for the data you produce? What do you want to know about how your data is used?

If you are a researcher or data manager, we want to hear from you. Take this 5-10 minute survey and help us craft data-level metrics:

surveymonkey.com/s/makedatacount

Please share widely! The survey will be open until December 1st.

Read more about the project at mdc.plos.org or check out our previous post. Thanks to John Kratz for creating the survey and jumping through IRB hoops!

What do you think of data metrics? We're listening.  From gizmodo.com. Click for more pics of dogs + radios.

What do you think of data metrics? We’re listening.
From gizmodo.com. Click for more pics of dogs + radios.

Tagged , , , ,

UC3, PLOS, and DataONE join forces to build incentives for data sharing

We are excited to announce that UC3, in partnership with PLOS and DataONE, are launching a new project to develop data-level metrics (DLMs). This 12-month project is funded by an Early Concept Grants for Exploratory Research (EAGER) grant from the National Science Foundation, and will result in a suite of metrics that track and measure data use. The proposal is available via CDL’s eScholarship repository: http://escholarship.org/uc/item/9kf081vf. More information is also available on the NSF Website.

Why DLMs? Sharing data is time consuming and researchers need incentives for undertaking the extra work. Metrics for data will provide feedback on data usage, views, and impact that will help encourage researchers to share their data. This project will explore and test the metrics needed to capture activity surrounding research data.

The DLM pilot will build from the successful open source Article-Level Metrics community project, Lagotto, originally started by PLOS in 2009. ALM provide a view into the activity surrounding an article after publication, across a broad spectrum of ways in which research is disseminated and used (e.g., viewed, shared, discussed, cited, and recommended, etc.)

About the project partners

PLOS (Public Library of Science) is a nonprofit publisher and advocacy organization founded to accelerate progress in science and medicine by leading a transformation in research communication.

Data Observation Network for Earth (DataONE) is an NSF DataNet project which is developing a distributed framework and sustainable cyberinfrastructure that meets the needs of science and society for open, persistent, robust, and secure access to well-described and easily discovered Earth observational data.

The University of California Curation Center (UC3) at the California Digital Library is a creative partnership bringing together the expertise and resources of the University of California. Together with the UC libraries, we provide high quality and cost-effective solutions that enable campus constituencies – museums, libraries, archives, academic departments, research units and individual researchers – to have direct control over the management, curation and preservation of the information resources underpinning their scholarly activities.

The official mascot for our new project: Count von Count. From muppet.wikia.com

The official mascot for our new project: Count von Count. From muppet.wikia.com

Tagged , , ,

The DataCite Meeting in Nancy, France

Last week I took a lovely train ride through the cow-dotted French countryside to attend the 2014 DataCite Annual Conference. The event was held at the Institut de l’information Scientifique et Technique (INIST) in Nancy, France, which is about 1.5 hours by train outside of Paris. INIST is the French DataCite member (more on DataCite later). I was invited to the meeting to represent the CDL, which has been an active participant in DataCite since its inception (see my slides). But before I can provide an overview of the DataCite meeting, we need to back up and make sure everyone understands the concept of identifiers, plus a few other bits of key background information.

Background

Identifiers

An identifier is a string of characters that uniquely identifies an object. The object might be a dataset, software, or other research product. Most researchers are familiar with a particular type of identifier, the digital object identifier (DOI). These have been used by the academic publishing industry for uniquely identifying digital versions of journal articles for the last 15 years or so, and their use recently has expanded to other types of digital objects (posters, datasets, code, etc.). Although the DOI is the most widely known type of identifier, there are many, many other identifier schemes. Researchers do not necessarily need to understand the nuances of identifiers, however, since the data repository often chooses the scheme. The most important thing for researchers to understand is that their data needs an identifier to be easy to find, and to facilitate getting credit for that data.

The DataCite Organization

For those unfamiliar with DataCite, it’s a nonprofit organization founded in 2009. According to their website, their aims are to:

  • establish easier access to research data on the Internet
  • increase acceptance of research data as legitimate, citable contributions to the scholarly record
  • support data archiving that will permit results to be verified and re-purposed for future study.

In this capacity, DataCite has working groups, participates in large initiatives, and partners with national and international groups. Arguably they are most known for their work in helping organizations issue DOIs. CDL was a founding member of DataCite, and has representation on the advisory board and in the working groups.

EZID: Identifiers made easy

The CDL has a service that provides DataCite DOIs to researchers and those that support them, called EZID. The EZID service allows its users to create and manage long term identifiers (they do more than just DOIs). Note that individuals currently cannot go to the EZID website and obtain an identifier, however. They must instead work with one of the EZID clients, of which there are many, including academic groups, private industry, government organizations, and publishers. Figshare, Dryad, many UC libraries, and the Fred Hutchinson Cancer Research Center are among those who obtain their DataCite DOIs from EZID.

Highlights from the meeting

#1: Enabling culture shifts

Andrew Treloar from the Australian National Data Service (ANDS) presented a great way to think about how we can enable the shift to a world where research data is valued, documented, and shared. The new paradigm first needs to be possible: this means supporting infrastructure at the institutional and national levels, giving institutions and researchers the tools to properly manage research data outputs, and providing ways to count data citations and help incentivize data stewardship. Second, the paradigm needs to be encouraged/required. We are making slow but steady headway on this front, with new initiatives for open data from government-funded research and requirements for data management plans. Third, the new paradigm needs to be adopted/embraced. That is, researchers should be asking for DOIs for their data, citing the data they use, and understanding the benefits of managing and sharing their data. This is perhaps the most difficult of the three. These three aspects of a new paradigm can help frame tool development, strategies for large initiatives, and arguments for institutional support.

#2: ZENODO’s approach to meeting research data needs

Lars Holm Nielsen from the European Organization for Nuclear Research (CERN) provided a great overview of the repository ZENODO. If you are familiar with figshare, this repository has similar aspects: anyone can deposit their information, regardless of country, institution, etc. This was a repository created to meet the needs of researchers interested in sharing research products. One of the interesting features about Zenodo is their openness to multiple types of licenses, including those that do not result in fully open data. Although I feel strongly about ensuring data are shared with open, machine-readable waivers/licenses, Nielsen made an interesting point: step one is actually getting the data into a repository. If this is accomplished, then opening the data up with an appropriate license can be discussed at a later date with the researcher. I’m not sure if I agree with this strategy (I envision repositories full of data no one can actually search or use), it’s an interesting take.

Full disclosure: I might have a small crush on CERN due to the recent release of Particle Fever, a documentary on the discovery of the Higgs boson particle).

#3: the re3data-databib merger

Maxi Kindling from Humboldt University Berlin (representing re3data) and Michael Witt from Purdue University Libraries (representing databib) co-presented on plans for merging their two services, both searchable databases of repositories. Both re3data and databib have extensive metadata on data repositories available for depositing research data, covering a wide range of data types and disciplines. This merger makes sense since the two services emerged within X months of one another and there is no need for running them separately, with separate support, personnel, and databases. Kindling and Witt described the five principles of agreement for the merge: openness, optimal quality assurance, innovative functionality development, shared leadership (i.e., the two are equal partners), and sustainability. Regarding this last principle, the service that will result from the merge has been “adopted” by DataCite, which will support it for the long term. The service that will be born of the merge will be called re3data, with an advisory board called databib.

Attendees of the DataCite meeting had interesting lunchtime conversations around future integrations and tools development in conjunction with the new re3data. What about a repository “match-making” service, which could help researchers select the perfect repository for their data? Or integration with tools like the DMPTool? The re3data-databib group is likely coming up with all kinds of great ideas as a result of their new partnership, which will surely benefit the community as a whole.

#4: Lots of other great stuff

There were many other interesting presentations at the meeting: Amye Kenall from BioMed Central (BMC) talking about their GigaScience data journal; Mustapha Mokrane from the ICSU-World Data System on data publishing efforts; and Nigel Robinson from Thomson-Reuters on the Data Citation Index, to name a few. DataCite plans on making all of the presentations available on the conference website, so be sure to check that out in the next few weeks.

My favorite non-data part? The light show at the central square of Nancy, Place Stanislas. 20 minutes well-spent.

Related on Data Pub:

Tagged , ,

Data Citation Developments

Citation is a defining feature of scholarly publication and if we want to say that a dataset has been published, we have to be able to cite it. The purpose of traditional paper citations– to recognize the work of others and allow readers to judge the basis of the author’s assertions– align with the purpose of data citations. Check out previous posts on the topic here.

Although in the past, datasets and databases have usually been mentioned haphazardly, if at all, in the body of a paper and left out of the list of references, this no longer has to be the case.

Last month, there was quite a bit of activity on the data citation front:

  1. Importance: Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.

  2. Credit and Attribution: Data citations should facilitate giving scholarly credit and normative and legal atribution to all contributors to the data, recognizing that a single style or mechanism of atribution may not be applicable to all data.

  3. Evidence: Where a specific claim rests upon data, the corresponding data citation should be provided.

  4. Unique Identifiers: A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.

  5. Access: Data citations should facilitate access to the data themselves and to such associated metadata, documentation, and other materials, as are necessary for both humans and machines to make informed use of the referenced data.

  6. Persistence: Metadata describing the data, and unique identifiers should persist, even beyond the lifespan of the data they describe.

  7. Versioning and Granularity: Data citations should facilitate identification and access to different versions and/or subsets of data. Citations should include sufficient detail to verifiably link the citing work to the portion and version of data cited.

  8. Interoperability and Flexibility: Data citation methods should be sufficiently flexible to accommodate the variant practices among communities but should not differ so much that they compromise interoperability of data citation practices across communities.

In the simplest case– when a researcher wants to cite the entirety of a static dataset– there seems to be a consensus set of core elements between DataCite, CODATA and others. There is less agreement with respect to more complicated cases, so let’s tackle the easy stuff first.

(Nearly) Universal Core Elements

  • Creator(s): Essential, of course, to publicly credit the researchers who did the work. One complication here is that datasets can have large (into the hundreds) numbers of authors, in which case an organizational name might be used.
  • Date: The year of publication or, occasionally, when the dataset was finalized.
  • Title: As is the case with articles, the title of a dataset should help the reader decide whether your dataset is potentially of interest. The title might contain the name of the organization responsible, or information such as the date range covered.
  • Publisher: Many standards split the publisher into separate producer and distributor fields. Sometimes the physical location (City, State) of the organization is included.
  • Identifier: A Digital Object Identifier (DOI), Archival Resource Key (ARK), or other unique and unambiguous label for the dataset.

Common Additional Elements

  • Location: A web address from which the dataset can be accessed. DOIs and ARKs can be used to locate the resource cited, so this field is often redundant.
  • Version: May be necessary for getting the correct dataset when revisions have been made.
  • Access Date: The date the data was accessed for this particular publication.
  • Feature Name: May be a formal feature from a controlled vocabulary, or some other description of the subset of the dataset used.
  • Verifier: Information that can be used to be make sure you have the right dataset.

Complications

Datasets are different from journal articles in ways that can make them more difficult to cite. The first issue is deep citation or granularity, and the second is dynamic data.

Deep Citation

Traditional journal articles are cited as a whole and it is left to the reader to sort through the article to find the relevant information. When citing a dataset, more precision is sometimes necessary. An analysis is done on part of a dataset, it can only be repeated by extracting exactly that subset of the data. Consequently, there is a desire for mechanisms allowing precise citation of data subsets. A number of solutions have been put forward:

  • Most common and least useful is to describe how you extracted the subset in the text of the article.

  • For some applications, such as time series, you many be able to specify a date or geographic range, or a limited number of variables within the citation.

  • Another approach is to mint a new identifier that refers to only the subset used, and refer back to the source dataset in the metadata of the subset. The DataCite DOI metadata scheme includes a flexible mechanism to specify relationships between objects, including that one is part of another.

  • The citation can include a Universal Numeric Fingerprint (UNF) as a verifier for the subset. A UNF can be used to test whether two datasets are identical, even if they are stored in different file formats. This won’t help you to find the subset you want, but it will tell you whether you’ve succeeded.

Dynamic Data

When a journal article is published, it’s set in stone. Corrections and retractions are are rare occurrences, and small errors like typos are allowed to stand. In contrast, some datasets can be expected to change over time. There is no consensus as to whether or how much change is permitted before an object must be issued a new identifier. DataCite recommends but does not require that DOIs point to a static object.

Broadly, dynamic datasets can be split into two categories:

  • Appendable datasets get new data over time, but the existing data is never changed. If timestamps are applied to each entry, inclusion of an access date or a date range in the citation may allow a user to confidently reconstruct the state of the dataset. The Federation of Earth Science Information Partners (ESIP), for instance, specifies that an add-on dataset be issued a DOI only once, and a time range specified in the citation. On the other hand, the Dataverse standard and DCC guidelines require new DOIs for any change. If the dataset is impractically large, the new DOI may cover a “time slice” containing only the new data. For instance, each year of data from a sensor could be issued its own DOI.

  • Data in revisable datasets may be inserted, altered, or deleted. Citations to revisable datasets are likely to include version numbers or access dates. In this case ESIP specifies that a new DOI should be minted for each “major” but not “minor” version. If a new DOI is required for each version, a “snapshot” of the dataset can be frozen from time to time and issued it’s own DOI.

NSF now allows data in biosketch accomplishments

Hip hip hooray for data! Contributed to Calisphere by Sourisseau Academy for State and Local History (click for more information)

Hip hip hooray for data! Contributed to Calisphere by Sourisseau Academy for State and Local History (click for more information)

Back in October, the National Science Foundation announced changes to its Grant Proposal Guidelines (Full GPG for January 2013 here).  I blogged about this back when the announcement was made, but now that the changes are official, I figure it warrants another mention.

As of January 2013, you can now list products in your biographical sketches, not just publications. This is big (and very good) news for data advocates like myself.

The change is that the biosketch for senior personnel should contain a list of 5 products closely related to the project and 5 other significant products that may or may not be related to the project. But what counts as a product? “products are…including but not limited to publications, data sets, software, patents, and copyrights.”

To make it count, however, it needs to be both citable and accessible. How to do this?

  1.  Archive your data in a repository (find help picking a repo here)
  2. Obtain a unique, persistent identifier for your dataset (e.g., a DOI or ARK)
  3. Start citing your product!

For the librarians, data nerds, and information specialists in the group, the UC3 has put together a flyer you can use to promote listing data as a product. It’s available as a PDF (click on the image to the right to download). For the original PPT that you can customize for your institution and/or repository, send me an email.

NSF_products_flyer

Direct from the digital mouths of NSF:

Summary of changes: http://www.nsf.gov/pubs/policydocs/pappguide/nsf13001/gpg_sigchanges.jsp

Chapter II.C.2.f(i)(c), Biographical Sketch(es), has been revised to rename the “Publications” section to “Products” and amend terminology and instructions accordingly. This change makes clear that products may include, but are not limited to, publications, data sets, software, patents, and copyrights.

New wording: http://www.nsf.gov/pubs/policydocs/pappguide/nsf13001/gpg_2.jsp

(c) Products

A list of: (i) up to five products most closely related to the proposed project; and (ii) up to five other significant products, whether or not related to the proposed project. Acceptable products must be citable and accessible including but not limited to publications, data sets, software, patents, and copyrights. Unacceptable products are unpublished documents not yet submitted for publication, invited lectures, and additional lists of products. Only the list of 10 will be used in the review of the proposal.

Each product must include full citation information including (where applicable and practicable) names of all authors, date of publication or release, title, title of enclosing work such as journal or book, volume, issue, pages, website and Uniform Resource Locator (URL) or other Persistent Identifier.

Tagged , , , ,

Resources, and Versions, and Identifiers! Oh, my!

The only constant is change.  —Heraclitus

Data publication, management, and citation would all be so much easier if data never changed, or at least, if it never changed after publication. But as the Greeks observed so long ago, change is here to stay. We must accept that data will change, and given that fact, we are probably better off embracing change rather than avoiding it. Because the very essence of data citation is identifying what was referenced at the time it was referenced, we need to be able to put a name on that referenced quantity, which leads to the requirement of assigning named versions to data. With versions we are providing the x that enables somebody to say, “I used version x of dataset y.”

Since versions are ultimately names, the problem of defining versions is inextricably bound up with the general problem of identification. Key questions that must be asked when addressing data versioning and identification include:

  • What is being identified by a version? This can be a surprisingly subtle question. Is a particular set of bits being identified? A conceptual quantity (to use FRBR terms, an expression or manifestation)? A location? A conceptual quantity at a location? For a resource that changes rapidly or predictably, such as a data stream that accumulates over time, it will probably be necessary to address the structure of the stream separately from the content of the stream, and to support versions and/or citation mechanisms that allow the state of the stream to be characterized at the time of reference. In any case, the answer to the question of what is being identified will greatly impact both what constitutes change (and therefore what constitutes a version) and the appropriateness of different identifier technologies to identifying those versions.
  • When does a change constitute a new version? Always? Even when only a typographical error is being corrected? Or, in a hypertext document, when updating a broken hyperlink? (This is a particularly difficult case, since updating a hyperlink requires updating the document, of course, but a URL is really a property of the identifiee, not the identifier.) In the case of a science dataset, does changing the format of the data constitute a new version? Reorganizing the data within a format (e.g., changing from row-major to column-major order)? Re-computing the data on different floating-point hardware? Versions are often divided into “major” versions and “minor” versions to help characterize the magnitude and backward-compatibility of changes.
  • Is each version an independent resource? Or is there one resource that contains multiple versions? This may seem a purely semantic distinction, but the question has implications on how the resource is managed in practice. The W3C struggled with this question in identifying the HTML specification. It could have created one HTML resource with many versions (3.1, 4.2, 5, …), but for manageability it settled on calling HTML3 one resource (with versions 3.1, 3.2, etc.), HTML4 a separate resource (with analogous versions 4.1, 4.2, etc.), and continuing on to HTML5 as yet another resource.

So far we have only raised questions, and that’s the nature of dealing with versions: the answers tend to be very situation-specific. Fortunately, some broad guidelines have emerged:

  • Assign an identifier to each version to support identification and citation.
  • Assign an identifier to the resource as a whole, that is, to the resource without considering any particular version of the resource. There are many situations where it is desirable to be able to make a version-agnostic reference. Consider that, in the text above, we were able to refer to something called “HTML4” without having to name any particular version of that resource. What if that were not possible?
  • Provide linkages between the versions, and between the versions and the resource as a whole.

These guidelines still leave the question of how to actually assign identifiers to versions unanswered. One approach is to assign a different, unrelated identifier to each version. For example, doi:10.1234/FOO might refer to version 1 of a resource and doi:10.5678/BAR to version 2. Linkages, stored in the resource versions themselves or externally in a database, can record the relationships between these identifiers. This approach may be appropriate in many cases, but it should be recognized that it places a burden on both the resource maintainer (every link that must be maintained represents a breakage point) and user (there is no easily visible or otherwise obvious relationship between the identifiers). Another approach is to syntactically encode version information in the identifiers. With this approach, we might start with doi:10.1234/FOO as a base identifier for the resource, and then append version information in a visually apparent way. For example, doi:10.1234/FOO/v1 might refer to version 1, doi:10.1234/FOO/v2 to version 2, and so forth. And in a logical extension we could then treat the version-less identifier doi:10.1234/FOO as identifying the resource as a whole. This is exactly the approach used by the arXiv preprint service.

Resources, versions, identifiers, citations: the issues they present tend to get bound up in a Gordian knot.  Oh, my!

Further reading:

ESIP Interagency Data Stewardship/Citations/Provider Guidelines

DCC “Cite Datasets and Link to Publications” How-to Guide

Resources, Versions, and URIs

Tagged , ,

DataCite Metadata Schema update

 

This spring, work is underway on a new version of the DataCite metadata schema. DataCite is a worldwide consortium founded in 2009 dedicated to “helping you find, access, and reuse data.” The principle mechanism for doing so is the registration of digital object identifiers (DOIs) via the member organizations. To make sure dataset citations are easy to find, each registration for a DataCite DOI has to be accompanied by a small set of citation metadata. It is small on purpose:  this is intended to be a “big tent” for all research disciplines. DataCite has specified these requirements with a metadata schema.

The team in charge of this task is the Metadata Working Group. This group responds to suggestions from DataCite clients and community members. I chair the group, and my colleagues on the group come from the British Library, GESIS, the TIB, CISTI, and TU Delft.

The new version of the schema, 2.3, will be the first to be paired with a corresponding version in the Dublin Core Application Profile format. It fulfills a commitment that the Working Group made with its first release in January of 2011. The hope is that the application profile will promote interoperability with Dublin Core, a common metadata format in the library community, going forward. We intend to maintain synchronization between the schema and the profile with future versions.

Additional changes will include some new selections for the optional fields including support for a new relationType (isIdenticalTo), and we’re considering a way to specify temporal collection characteristics of the resource being registered. This would mean describing, in simple terms and optionally, a data set collected between two dates. There are a few other changes under discussion as well, so stay tuned.

DataCite metadata is available in the Search interface to the DataCite Metadata Store. The metadata is also exposed for harvest, via an OAI-PMH protocol. California Digital Library is a founding member, and our DataCite implementation is the EZID service, which also offers ARKs, an alternative identifier scheme. Please let me know if you have any questions by contacting uc3 at ucop.edu.

Tagged , , ,

EZID: now even easier to manage identifiers

EZID, the easy long-term identifier service, just got a new look. EZID lets you create and maintain ARKs and DataCite Digital Object Identifiers (DOIs), and now it’s even easier to use:

  • One stop for EZID and all EZID information, including webinars, FAQs, and more.

    Image by Simon Cousins

    • A clean, bright new look.
    • No more hunting across two locations for the materials and information you need.
  • NEW Manage IDs functions:
    • View all identifiers created by logged-in account;
    • View most recent 10 interactions–based on the account–not the session;
    • See the scope of your identifier work without any API programming.
  • NEW in the UI: Reserve an Identifier
    • Create identifiers early in the research cycle;
    • Choose whether or not you want to make your identifier public–reserve them if you don’t;
    • On the Manage screen, view the identifier’s status (public, reserved, unavailable/just testing).

In the coming months, we will also be introducing these EZID user interface enhancements:

  • Enhanced support for DataCite metadata in the UI;
  • Reporting support for institution-level clients.

So, stay tuned: EZID just gets better and better!

Tagged , ,