Data Citation

DataCitationLogoThe value of formal citations to datasets are widely recognized within the scholarly communication community.  This agreement led to the release of the Joint Declaration of Data Citation Principles, which can be read and endorsed here: Joint Declaration of Data Citation Principles 

Basic Data Citation

There good consensus around the minimal components of a data citation:

Creator (Year) Title. Publisher. Identifier

Core Elements

  • Creator(s): Individual(s) or organization responsible for creating the dataset.
  • Year: Year the dataset was published, not necessarily created.
  • Title: Should be as descriptive as possible.
  • Publisher: Organization that provides access to the dataset (e.g. Dryad, Zenodo).
  • Identifier: Persistent, unique identifier (e.g. a DOI).

Additional Elements

  • Location / Availability: The web address of the dataset is essential when the identifier can’t be used to reach the dataset.
  • Version / Edition: Version of the dataset used in the present publication.  Needed to reproduce analysis of versioned dynamic datasets.
  • Access Date: Date of access for analysis in the present publication. Needed to reproduce  analysis of continuously updated dynamic datasets.
  • Format / Material Designator: e.g. database, CD-ROM.
  • Feature Name: A description of the subset of the dataset used.  May be a formal title or a list of variables  (e.g. concentration, optical density).
  • Verifier: Used to confirm that two datasets are identical.  Most commonly a UNF or MD5 checksum.
  • Series: Used if the dataset is part of series of releases (e.g. monthly, yearly).
  • Contributor: e.g. editor, compiler

For datasets that have DOIs, DataCite and CrossRef provide a citation formatter  to generate a citation matching any of a wide array of journal styles.

Citing Dynamic Data

Unlike traditional publications, datasets are often dynamic either in that new data is added over time, or in that the dataset is subject to revision and correction.  A number of groups suggest or insist on methods for citing specific versions of dynamic datasets:

  • DataCite: Starr & Gastl (2011)
    • ” DataCite does not enforce any validation rule that a resource ought to be re-registered each time it undergoes a version change. However, this is considered a recommended best practice for resource citation.”
    • The DataCite metadata scheme includes version number and, more powerfully, can specify a variety of relationships to another object, including versions & variants.
  • Dataverse: Altman & Crosas (2013)
    • All versions of a dataset share one DOI, but each is issued a distinct version number and citation.
    • Previous versions are accessible from the Dataverse landing page.
  • Digital Curation Center (DCC): Ball & Duke (2012)
    • New version is a new dataset with a new identifier.
    • “Time Slice” (data can be appended, but not revised): new identifier for only material added since last slice.  Users compile the slices to construct the complete dataset.
    • “Snapshot” (revisable): every so often the dataset is frozen and the static copy is issued a new identifier.  Only the snapshots are citable.
  • Earth Science Information Partners (ESIP)
    • Additions to time series are not new versions, don’t get new identifiers.  Citations include access date or the time range analyzed.
    • When revision is possible, data steward must distinguish major and minor version changes and describe the nature and file/record range of every version. Something that affects the whole data set (e.g. a complete reprocessing) would be considered a major version.
    • Major versions get a new identifier and collection-level metadata record.  The old metadata points to the new version and explains the status of the older version data.
    • New minor versions are explained in documentation, ideally in file-level metadata.
  • Lawrence, et al. (2011)
    • Any change, whether revision or addition, triggers a new review and a new identifier.
  • Natural Environment Research Council (NERC): Callaghan (2012)
    • Any change is a new version with a new identifier.
  • Organization for Economic Cooperation and Development (OECD): Green (2009)
    • Cite a DOI that leads to a landing page, whether the data is revisable, growing, or static.
    • Use “wikipedia-like” versioning when appropriate.
  • ZooKeys: Penev et al. (2009)
    • Distinguishes between static data tables (e.g. spreadsheets) and dynamic databases.
    • Data table is assigned a DOI and does not change.
    • Database citation “must include careful specification of the version, date and time of accessioning.”

Deep Citation

If a publication only uses part of a dataset, the citation should ideally specify that precise subset.  Because datasets can vary greatly in structure and the creators cannot fully anticipate the needs of future users, a general solution is difficult to reach.  However, there are a number of approaches to solve the problem in at least some cases :

  • DataCite: Starr & Gastl (2011)
    • The DataCite metadata scheme enables the specification of a variety of relationships to another object, including IsPartOf and HasPart.
    • Granularity can be handled by minting a new DOI for a subset of a dataset, and linking the two through metadata.
  • Dataverse: Altman & King (2007)
    • Cite  the full dataset and describe in the text how the subset was derived.
    • If the subset derivation is simple, include it in citation.
    • Always include a UNF of the subset in the citation, which can be used to confirm that two subsets are identical.
  • Digital Curation Center (DCC): Ball & Duke (2012)
    • Cite the smallest appropriate unit given an identifier by the repository, and describe in the text how the subset was derived.
  • Earth Science Information Partners (ESIP)
    • No general policy: “Data stewards should suggest how to reference subsets of their data”
    • If the subset derivation is simple, include it in citation.
  • Lawrence, et al. (2011)
    • If possible, use a “defined feature name from a defined registry” to refer to a subset in the citation.  Syntax:  [featureName, URI of the registry]
    • If further subsetting is necessary, use the measure of extent appropriate to the feature(s).
  • National Snow & Ice Data Center
    • Describe subset in citation (ex. temporal or spatial range).

Data Citation Standards & Styles

The recommendations mentioned above represent only a handful of the many approaches to citing data that have been put forth going back at least as far as Sue Dodd’s paper of 1979.

A more extensive compilation of material on data citation can be found here:

CiteULike Data Citation Library

3 thoughts on “Data Citation

  1. […] you’d like to find out more about data citation, DataPub’s new webpage may be of interest to […]

  2. […] of the technical difficulties with data citation (e.g., citing  dynamic data or a particular subset) came up in the course of the conversation. One interesting point was raised […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: