Citation is a defining feature of scholarly publication and if we want to say that a dataset has been published, we have to be able to cite it. The purpose of traditional paper citations– to recognize the work of others and allow readers to judge the basis of the author’s assertions– align with the purpose of data citations. Check out previous posts on the topic here.
Although in the past, datasets and databases have usually been mentioned haphazardly, if at all, in the body of a paper and left out of the list of references, this no longer has to be the case.
Last month, there was quite a bit of activity on the data citation front:
The Committee on Data for Science and Technology (CODATA) released a thorough report on data citation
A synthesis set of data citation principles combining principles from Future of Research Communication and e-Scholarship (FORCE11), the Digital Curation Center, CODATA, and DataCite was released. The principles are:
Importance: Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.
Credit and Attribution: Data citations should facilitate giving scholarly credit and normative and legal atribution to all contributors to the data, recognizing that a single style or mechanism of atribution may not be applicable to all data.
Evidence: Where a specific claim rests upon data, the corresponding data citation should be provided.
Unique Identifiers: A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.
Access: Data citations should facilitate access to the data themselves and to such associated metadata, documentation, and other materials, as are necessary for both humans and machines to make informed use of the referenced data.
Persistence: Metadata describing the data, and unique identifiers should persist, even beyond the lifespan of the data they describe.
Versioning and Granularity: Data citations should facilitate identification and access to different versions and/or subsets of data. Citations should include sufficient detail to verifiably link the citing work to the portion and version of data cited.
Interoperability and Flexibility: Data citation methods should be sufficiently flexible to accommodate the variant practices among communities but should not differ so much that they compromise interoperability of data citation practices across communities.
In the simplest case– when a researcher wants to cite the entirety of a static dataset– there seems to be a consensus set of core elements between DataCite, CODATA and others. There is less agreement with respect to more complicated cases, so let’s tackle the easy stuff first.
(Nearly) Universal Core Elements
- Creator(s): Essential, of course, to publicly credit the researchers who did the work. One complication here is that datasets can have large (into the hundreds) numbers of authors, in which case an organizational name might be used.
- Date: The year of publication or, occasionally, when the dataset was finalized.
- Title: As is the case with articles, the title of a dataset should help the reader decide whether your dataset is potentially of interest. The title might contain the name of the organization responsible, or information such as the date range covered.
- Publisher: Many standards split the publisher into separate producer and distributor fields. Sometimes the physical location (City, State) of the organization is included.
- Identifier: A Digital Object Identifier (DOI), Archival Resource Key (ARK), or other unique and unambiguous label for the dataset.
Common Additional Elements
- Location: A web address from which the dataset can be accessed. DOIs and ARKs can be used to locate the resource cited, so this field is often redundant.
- Version: May be necessary for getting the correct dataset when revisions have been made.
- Access Date: The date the data was accessed for this particular publication.
- Feature Name: May be a formal feature from a controlled vocabulary, or some other description of the subset of the dataset used.
- Verifier: Information that can be used to be make sure you have the right dataset.
Datasets are different from journal articles in ways that can make them more difficult to cite. The first issue is deep citation or granularity, and the second is dynamic data.
Traditional journal articles are cited as a whole and it is left to the reader to sort through the article to find the relevant information. When citing a dataset, more precision is sometimes necessary. An analysis is done on part of a dataset, it can only be repeated by extracting exactly that subset of the data. Consequently, there is a desire for mechanisms allowing precise citation of data subsets. A number of solutions have been put forward:
Most common and least useful is to describe how you extracted the subset in the text of the article.
For some applications, such as time series, you many be able to specify a date or geographic range, or a limited number of variables within the citation.
Another approach is to mint a new identifier that refers to only the subset used, and refer back to the source dataset in the metadata of the subset. The DataCite DOI metadata scheme includes a flexible mechanism to specify relationships between objects, including that one is part of another.
The citation can include a Universal Numeric Fingerprint (UNF) as a verifier for the subset. A UNF can be used to test whether two datasets are identical, even if they are stored in different file formats. This won’t help you to find the subset you want, but it will tell you whether you’ve succeeded.
When a journal article is published, it’s set in stone. Corrections and retractions are are rare occurrences, and small errors like typos are allowed to stand. In contrast, some datasets can be expected to change over time. There is no consensus as to whether or how much change is permitted before an object must be issued a new identifier. DataCite recommends but does not require that DOIs point to a static object.
Broadly, dynamic datasets can be split into two categories:
Appendable datasets get new data over time, but the existing data is never changed. If timestamps are applied to each entry, inclusion of an access date or a date range in the citation may allow a user to confidently reconstruct the state of the dataset. The Federation of Earth Science Information Partners (ESIP), for instance, specifies that an add-on dataset be issued a DOI only once, and a time range specified in the citation. On the other hand, the Dataverse standard and DCC guidelines require new DOIs for any change. If the dataset is impractically large, the new DOI may cover a “time slice” containing only the new data. For instance, each year of data from a sensor could be issued its own DOI.
Data in revisable datasets may be inserted, altered, or deleted. Citations to revisable datasets are likely to include version numbers or access dates. In this case ESIP specifies that a new DOI should be minted for each “major” but not “minor” version. If a new DOI is required for each version, a “snapshot” of the dataset can be frozen from time to time and issued it’s own DOI.