Category Archives: Data Metrics

Make Data Count: Building a System to Support Recognition of Data as a First Class Research Output

The Alfred P. Sloan Foundation has made a 2-year, $747K award to the California Digital Library, DataCite and DataONE to support collection of usage and citation metrics for data objects. Building on pilot work, this award will result in the launch of a new service that will collate and expose data level metrics.

The impact of research has traditionally been measured by citations to journal publications: journal articles are the currency of scholarly research.  However, scholarly research is made up of a much larger and richer set of outputs beyond traditional publications, including research data. In order to track and report the reach of research data, methods for collecting metrics on complex research data are needed.  In this way, data can receive the same credit and recognition that is assigned to journal articles.

Recognition of data as valuable output from the research process is increasing and this project will greatly enhance awareness around the value of data and enable researchers to gain credit for the creation and publication of data” – Ed Pentz, Crossref.

This project will work with the community to create a clear set of guidelines on how to define data usage. In addition, the project will develop a central hub for the collection of data level metrics. These metrics will include data views, downloads, citations, saves, social media mentions, and will be exposed through customized user interfaces deployed at partner organizations. Working in an open source environment, and including extensive user experience testing and community engagement, the products of this project will be available to data repositories, libraries and other organizations to deploy within their own environment, serving their communities of data authors.

Are you working in the data metrics space? Let’s collaborate.

Find out more and follow us at:, @makedatacount

About the Partners

California Digital Library was founded by the University of California in 1997 to take advantage of emerging technologies that were transforming the way digital information was being published and accessed. University of California Curation Center (UC3), one of four main programs within the CDL, helps researchers and the UC libraries manage, preserve, and provide access to their important digital assets as well as developing tools and services that serve the community throughout the research and data life cycles.

DataCite is a leading global non-profit organization that provides persistent identifiers (DOIs) for research data. Our goal is to help the research community locate, identify, and cite research data with confidence. Through collaboration, DataCite supports researchers by helping them to find, identify, and cite research data; data centres by providing persistent identifiers, workflows and standards; and journal publishers by enabling research articles to be linked to the underlying data/objects.

DataONE (Data Observation Network for Earth) is an NSF DataNet project which is developing a distributed framework and sustainable cyber infrastructure that meets the needs of science and society for open, persistent, robust, and secure access to well-described and easily discovered Earth observational data.

California Digital Library Supports the Initiative for Open Citations

California Digital Library (CDL) is proud to announce our formal endorsement for the Initiative for Open Citations (I4OC). CDL has long supported free and reusable scholarly work, as well as organizations and initiatives supporting citations in publication. With a growing database of literature and research data citations, there is a need for an open global network of citation data.

The Initiative for Open Citations will work with Crossref and their Cited-by service to open up all references indexed in Crossref. Many publishers and stakeholders have opted in to participate in opening up their citation data, and we hope that each year this list will grow to encompass all fields of publication. Furthermore, we are looking forward to seeing how research data citations will be a part of this discussion.

CDL is a firm believer in and advocate for data citations and persistent identifiers in scholarly work. However, if research publications are cited and those citations are not freely accessible and searchable- our goal is not accomplished. We are proud to support the Initiative for Open Citations and invite you to get in touch with any questions you may have about the need for open citations or ways to be an advocate for this necessary change.

Below are some Frequently Asked Questions about the need, ways to get involved, and misconceptions regarding citations. The answers are provided by the Board and founders of the I4OC Initiative:

I am a scholarly publisher not enrolled in the Cited-by service. How do I enable it?

If not already a participant in Cited-by, a Crossref member can register for this service free-of-charge. Having done so, there is nothing further the publisher needs to do to ‘open’ its reference data, other than to give its consent to Crossref, since participation in Cited-by alone does not automatically make these references available via Crossref’s standard APIs.

I am a scholarly publisher already depositing references to Crossref. How do I publicly release them?

We encourage all publishers to make their reference metadata publicly available. If you are already submitting article metadata to Crossref as a participant in their Cited-by service, opening them can be achieved in a matter of days. Publishers can easily and freely achieve this:

  • either by contacting Crossref support directly by e-mail, asking them to turn on reference distribution for all of the relevant DOI prefixes;
  • or by themselves setting the < reference_distribution_opt > metadata element to “ any ” for each DOI deposit for which they want to make references openly available.

How do I access open citation data?

Once made open, the references for individual scholarly publications may be accessed immediately through the Crossref REST API.

Open citations are also available from the OpenCitations Corpus , a database created to house scholarly citations, that is progressively and systematically harvested citation data from Crossref and other sources. An advantage of accessing citation data from the OpenCitations Corpus is that they are available in standards-compliant machine-readable RDF format , and include information about both incoming and outgoing citations of bibliographic resources (published articles and books).

Does this initiative cover future citations only or also historical data?

Both. All DOIs under a prefix set for open reference distribution will have open references through Crossref, for past, present, and future publications.

Past and present publications that lack DOIs are not dealt with by Crossref, and gaining access to their citation data will require separate initiatives by their publishers or others to extract and openly publish those references.

Under what licensing terms is citation data being made available?

Crossref exposes article and reference metadata without a license, since it regards these as raw facts that cannot be licensed.

The structured citation metadata within the OpenCitations Corpus are published under a Creative Commons CC0 public domain dedication, to make it explicitly clear that these data are open.

My journal is open access. Aren’t its articles’ citations automatically available?

No. Although Open Access articles may be open and freely available to read on the publisher’s website, their references are not separate, and are not necessarily structured or accessible programmatically. Additionally, although their reference metadata may be submitted to Crossref, Crossref historically set the default for references to “closed,” with a manual opt-in being required for public references. Many publisher members have not been aware that they could simply instruct Crossref to make references open, and, as a neutral party, Crossref has not promoted the public reference option. All publishers therefore have to opt in to open distribution of references via Crossref.

Is there a programmatic way to check whether a publisher’s or journal’s citation data is free to reuse?

For Crossref metadata , their REST API reveals how many and which publishers have opened references. Any system or tool (or a JSON viewer) can be pointed to this query: to show the count and the list of publishers with public-references “: true .

To query a specific publisher’s status, use, for example: ery=springer then find the tag for public-references. In some cases it will be set to false.


You can contact the founding group by e-mail at: .

Data metrics survey results published

Today, we are pleased to announce the publication Making Data Count in Scientific Data. John Kratz and Carly Strasser led the research effort to understand the needs and values of both the researchers who create and use data and of the data managers who preserve and publish it. The Making Data Count project is a collaboration between the CDL, PLOS, and DataONE to define and implement a practical suite of metrics for evaluating the impact of datasets, which is a necessary prerequisite to widespread recognition of datasets as first class scholarly objects.

We started the project with research to understand what metrics would be meaningful to stakeholders and what metrics we can practically collect. We conducted a literature review, focus groups, and– the subject of today’s paper–  a pair of online surveys for researchers and data managers.

In November and December of 2014, 247 researchers and 73 data repository managers answered our questions about data sharing, use, and metrics.Graph of interest in various metrics Survey and anonymized data are available in the Dash repository. These responses told us, among other things, which existing Article Level Metrics (ALMs) might be profitably applied to data:

  • Social media: We should not worry excessively about capturing social media (Twitter, Facebook, etc.) activity around data yet, because there is not much to capture. Only 9% of researchers said they would “definitely” use social media to look for a dataset.
  • Page views: Page views are widely collected by repositories but neither researchers nor data managers consider them meaningful. (It stands to reason that, unlike a paper, you can’t have engaged very deeply with a dataset if all you’ve done is read about it.)
  • Downloads: Download counts, on the other hand, are both highly valuable and practical to collect. Downloads were a resounding second-choice metric for researchers and 85% of repositories already track them.
  • Citations: Citations are the coin of the academic realm. They were by far the most interesting metric to both researchers and data managers. Unfortunately, citations are much more difficult than download counts to work with, and relatively few repositories track them. Beyond technical complexity, the biggest challenge is cultural: data citation practices are inconsistent at best, and formal data citation is rare. Despite the difficulty, the value of citations is too high to ignore, even in the short term.

We have already begun to collect data on the sample project corpus– the entire DataONE collection of 100k+ datasets. Using this pilot corpus, we see preliminary indications of researcher engagement with data across a number of online channels not previously thought to be in use by scholars. The results of this pilot will complement the survey described in today’s paper with real measurement of data-related activities “in the wild.”

For more conclusions and in-depth discussion of the initial research, see the paper, which is open access and available here: Stay tuned for analysis and results of the DataONE data-level metrics data on the Making Data Count project page:

Make Data Rain

Last October, UC3,  PLOS, and DataONE launched Making Data Count, a collaboration to develop data-level metrics (DLMs). This 12-month National Science Foundation-funded project will pilot a suite of metrics to track and measure data use that can be shared with funders, tenure & promotion committees, and other stakeholders.

Featured image

[image from Freepik]

To understand how DLMs might work best for researchers, we conducted an online survey and held a number of focus groups, which culminated on a very (very) rainy night last December in a discussion at the PLOS offices with researchers in town for the 2014 American Geophysical Union Fall Meeting.

Six eminent researchers participated:

Much of the conversation concerned how to motivate researchers to share data. Sources of external pressure that came up included publishers, funders, and peers. Publishers can require (as PLOS does) that, at a minimum, the data underlying every figure be available. Funders might refuse to ‘count’ publications based on unavailable data, and refuse to renew funding for projects that don’t release data promptly. Finally, other researchers– in some communities, at least– are already disinclined to work with colleagues who won’t share data.

However, Making Data Count is particularly concerned with the inverse– not punishing researchers who don’t share, but rewarding those who do. For a researcher, metrics demonstrating data use serve not only to prove to others that their data is valuable, but also to affirm for themselves that taking the time to share their data is worthwhile. The researchers present regarded altmetrics with suspicion and overwhelmingly affirmed that citations are the preferred currency of scholarly prestige.

Many of the technical difficulties with data citation (e.g., citing  dynamic data or a particular subset) came up in the course of the conversation. One interesting point was raised by many: when citing a data subset, the needs of reproducibility and credit diverge. For reproducibility, you need to know exactly what data has been used– at a maximum level of granularity. But, credit is about resolving to a single product that the researcher gets credit for, regardless of how much of the dataset or what version of it was used– so less granular is better.

We would like to thank everyone who attended any of the focus groups. If you have ideas about how to measure data use, please let us know in the comments!

Tagged , ,