The DataCite Meeting in Nancy, France

Last week I took a lovely train ride through the cow-dotted French countryside to attend the 2014 DataCite Annual Conference. The event was held at the Institut de l’information Scientifique et Technique (INIST) in Nancy, France, which is about 1.5 hours by train outside of Paris. INIST is the French DataCite member (more on DataCite later). I was invited to the meeting to represent the CDL, which has been an active participant in DataCite since its inception (see my slides). But before I can provide an overview of the DataCite meeting, we need to back up and make sure everyone understands the concept of identifiers, plus a few other bits of key background information.

Background

Identifiers

An identifier is a string of characters that uniquely identifies an object. The object might be a dataset, software, or other research product. Most researchers are familiar with a particular type of identifier, the digital object identifier (DOI). These have been used by the academic publishing industry for uniquely identifying digital versions of journal articles for the last 15 years or so, and their use recently has expanded to other types of digital objects (posters, datasets, code, etc.). Although the DOI is the most widely known type of identifier, there are many, many other identifier schemes. Researchers do not necessarily need to understand the nuances of identifiers, however, since the data repository often chooses the scheme. The most important thing for researchers to understand is that their data needs an identifier to be easy to find, and to facilitate getting credit for that data.

The DataCite Organization

For those unfamiliar with DataCite, it’s a nonprofit organization founded in 2009. According to their website, their aims are to:

  • establish easier access to research data on the Internet
  • increase acceptance of research data as legitimate, citable contributions to the scholarly record
  • support data archiving that will permit results to be verified and re-purposed for future study.

In this capacity, DataCite has working groups, participates in large initiatives, and partners with national and international groups. Arguably they are most known for their work in helping organizations issue DOIs. CDL was a founding member of DataCite, and has representation on the advisory board and in the working groups.

EZID: Identifiers made easy

The CDL has a service that provides DataCite DOIs to researchers and those that support them, called EZID. The EZID service allows its users to create and manage long term identifiers (they do more than just DOIs). Note that individuals currently cannot go to the EZID website and obtain an identifier, however. They must instead work with one of the EZID clients, of which there are many, including academic groups, private industry, government organizations, and publishers. Figshare, Dryad, many UC libraries, and the Fred Hutchinson Cancer Research Center are among those who obtain their DataCite DOIs from EZID.

Highlights from the meeting

#1: Enabling culture shifts

Andrew Treloar from the Australian National Data Service (ANDS) presented a great way to think about how we can enable the shift to a world where research data is valued, documented, and shared. The new paradigm first needs to be possible: this means supporting infrastructure at the institutional and national levels, giving institutions and researchers the tools to properly manage research data outputs, and providing ways to count data citations and help incentivize data stewardship. Second, the paradigm needs to be encouraged/required. We are making slow but steady headway on this front, with new initiatives for open data from government-funded research and requirements for data management plans. Third, the new paradigm needs to be adopted/embraced. That is, researchers should be asking for DOIs for their data, citing the data they use, and understanding the benefits of managing and sharing their data. This is perhaps the most difficult of the three. These three aspects of a new paradigm can help frame tool development, strategies for large initiatives, and arguments for institutional support.

#2: ZENODO’s approach to meeting research data needs

Lars Holm Nielsen from the European Organization for Nuclear Research (CERN) provided a great overview of the repository ZENODO. If you are familiar with figshare, this repository has similar aspects: anyone can deposit their information, regardless of country, institution, etc. This was a repository created to meet the needs of researchers interested in sharing research products. One of the interesting features about Zenodo is their openness to multiple types of licenses, including those that do not result in fully open data. Although I feel strongly about ensuring data are shared with open, machine-readable waivers/licenses, Nielsen made an interesting point: step one is actually getting the data into a repository. If this is accomplished, then opening the data up with an appropriate license can be discussed at a later date with the researcher. I’m not sure if I agree with this strategy (I envision repositories full of data no one can actually search or use), it’s an interesting take.

Full disclosure: I might have a small crush on CERN due to the recent release of Particle Fever, a documentary on the discovery of the Higgs boson particle).

#3: the re3data-databib merger

Maxi Kindling from Humboldt University Berlin (representing re3data) and Michael Witt from Purdue University Libraries (representing databib) co-presented on plans for merging their two services, both searchable databases of repositories. Both re3data and databib have extensive metadata on data repositories available for depositing research data, covering a wide range of data types and disciplines. This merger makes sense since the two services emerged within X months of one another and there is no need for running them separately, with separate support, personnel, and databases. Kindling and Witt described the five principles of agreement for the merge: openness, optimal quality assurance, innovative functionality development, shared leadership (i.e., the two are equal partners), and sustainability. Regarding this last principle, the service that will result from the merge has been “adopted” by DataCite, which will support it for the long term. The service that will be born of the merge will be called re3data, with an advisory board called databib.

Attendees of the DataCite meeting had interesting lunchtime conversations around future integrations and tools development in conjunction with the new re3data. What about a repository “match-making” service, which could help researchers select the perfect repository for their data? Or integration with tools like the DMPTool? The re3data-databib group is likely coming up with all kinds of great ideas as a result of their new partnership, which will surely benefit the community as a whole.

#4: Lots of other great stuff

There were many other interesting presentations at the meeting: Amye Kenall from BioMed Central (BMC) talking about their GigaScience data journal; Mustapha Mokrane from the ICSU-World Data System on data publishing efforts; and Nigel Robinson from Thomson-Reuters on the Data Citation Index, to name a few. DataCite plans on making all of the presentations available on the conference website, so be sure to check that out in the next few weeks.

My favorite non-data part? The light show at the central square of Nancy, Place Stanislas. 20 minutes well-spent.

Related on Data Pub:

Tagged , ,

Sharing is caring, but should it count?

The following is a guest post by Shea Swauger, Data Management Librarian at Colorado State University. Shea and I both participated in a meeting for the Colorado Alliance of Research Libraries on 11 July 2014, where he presented survey results described below.


 

 Vanilla Ice has a timely message for the data community. From Flickr by wiredforlego.

Vanilla Ice has a timely message for the data community. From Flickr by wiredforlego.

It shouldn’t be a surprise that many of the people who collect and generate research data are academic faculty members. One of the gauntlets that these individuals must face is the tenure and promotion process, an evaluation system that measures and rewards professional excellence, scholarly impact and can greatly affect the career arch of an aspiring scholar. As a result, tenure and promotion metrics naturally influence the kind and quantity of scholarly products that faculty produce.

Some advocates of data sharing have suggested using the tenure and promotion process as a way to incentivize data sharing. I thought this was a brilliant idea and had designs to advocate its implementation to members of the executive administration at my university, but first I wanted to gather some evidence to support my argument. Some of my colleagues, Beth Oehlerts, Daniel Draper, Don Zimmerman and I sent out a survey to all faculty members as to how they felt about incorporating shared research data as an assessment measure in the tenure and promotion process. Only about 10% (202) responded, so while generalizations about the larger population can’t be made, their answers are still interesting.

This is how I expected the survey to work:

Me: “If sharing your research data counted, in some way, towards you achieving tenure and promotion, would you be more likely to do it?”

Faculty: “Yes, of course!”

I’d bring this evidence to the university, sweeping changes would be made, data sharing would proliferate and all would be well.

I was wrong.

Speaking broadly, only about half of the faculty members surveyed said that changing the tenure and promotion process would make them more likely to share their data.

While 76% of the faculty were interested in sharing data in the future, and 84% said that data generation or collection is important to their research, half of faculty said that shared research data has little to no impact on their scholarly community and almost a quarter of faculty said they are unable to judge the impact.

Okay, let’s back up.

The tenure system is supposed to measure, among several things like teaching, service, etc., someone’s impact on their scholarly community. According to this idea there should be a correlation between the things that impact your scholarly community and the things that impact you achieving tenure. Now, back to the survey.

I asked faculty to rate the impact of several research products on their scholarly community as well as on their tenure and promotion. 94% of faculty rated ‘peer-reviewed journal articles’ at ‘high impact’ (the top of the scale) for impact upon their scholarly community, and 96% of faculty rated ‘peer-reviewed journal articles’ at ‘high impact’ upon their tenure and promotion. This supports the idea that because peer-viewed journal articles have a high impact on the scholarly community, they have a high impact on the tenure and promotion process.

Shared research data had a similar impact correlation, though on the opposite end of the impact spectrum. Little impact on the scholarly community means little impact on the tenure and promotion process. Bad news for data sharing. Reductively speaking, I believe this to be the essence of the argument: contributions that are valuable to a research community should be rewarded in the tenure and promotion process; shared research data isn’t valuable to the research community; therefore, data sharing should not be rewarded.

Also, I received several responses from faculty saying that they were obligated not to share their data because of the kind of research they were doing, be it in defense, the private sector, or working with personally identifiable or sensitive data.  They felt that if the university started rewarding data sharing, they would be unfairly punished because of the nature of their research. Some suggested that a more local implementation of a data sharing policy, perhaps on a departmental basis or an individual opt-in system might be fairer to researchers who can’t share their data for one reason or another.

So what does this mean?

Firstly, it means that there’s a big perception gap on the importance of ‘my data to my research’, and the importance of ‘my data to someone else’s research’. Closing this gap could go a long way to increasing data sharing. Secondly, it means that the tenure and promotion system is a complicated, political mechanism and trying to leverage it as a way to incentivize data sharing is not easy or straightforward. For now, I’ve decided not to try and pursue amending the local tenure system, however I have hope that as interest in data sharing grows we can find meaningful ways that reward people who choose to share their data.

Note: the work described above is being prepared for publication in 2015.

Tagged , , , , ,

Unicorn Data Sharing

A few years ago I created a little video about data sharing using an online application called Xtranormal. Alas, the application has gone bust and and it’s hard to access the videos created on that site. As a result I’m adding my video here so you can still enjoy it.

It takes a data management village

A couple of weeks ago, information scientists, librarians, social scientists, and their compatriots gathered in Toronto for the 2014 IASSIST meeting. IASSIST is, of course, an acronym which I always have to look up to remember – International Association for Social Science Information Service & Technology. Despite its forgettable name, this conference is one of the better meetings I’ve attended. The conference leadership manages to put together a great couple of days, chock full of wonderful plenaries and interesting presentations, and even arranged a hockey game for the opening reception.

Yonge Street crowds celebrating the end of the Boer War, Toronto, Canada. This image is available from the City of Toronto Archives, and is in the public domain.

Although there were many interesting talks, and I’m still processing the great discussions I had in Toronto, a couple really rang true for me. I’m going to now shamelessly paraphrase one of these talks (with permission, of course) about building a “village” of data management experts at institutions to best service researchers’ needs. All credit goes to Alicia Hofelich Mohr and Thomas Lindsay, both from University of Minnesota. Their presentation was called “It takes a village: Strengthening data management through collaboration with diverse institutional offices.” I’m sure IASSIST will make the slides available online in the near future, but I think this information is too important to not share asap.

Mohr and Lindsay first described the data life cycle, and emphasized the importance of supporting data throughout its life – especially early on, when small things can make a big difference down the road. They asserted that in order to provide support for data management, librarians need to connect with other service providers at their institutions. They then described who these providers are, and where they fit into the broader picture. Below I’ve summarized Mohr and Lindsay’s presentation.

Grants coordinators

Faculty writing grants are constantly interacting with these individuals. They are on the “front lines” of data management planning, in particular, since they can point researchers to other service providers who can help over the course of the project. Bonus – grants offices often have a deep knowledge of agency requirements for data management.

Sponsored projects

The sponsored projects office is another service provider that often has early interactions with researchers during their project planning. Researchers are often required to submit grants directly to this office, who ensure compliance and focus on requirements needed for proposals to be complete.

College research deans

Although this might be an intimidating group to connect with, they are likely to be the most aware of the current research climate and can help you target your services to the needs of their researchers. They can also help advocate for your services, especially via things like new faculty orientation. Generally, this group is an important ally in facilitating data sharing and reuse.

IT system administrators

This group is often underused by researchers, despite their ability to potentially provide researchers with server space, storage, collaboration solutions, and software licenses. They are also useful allies in ensuring security for sensitive data.

Research support services & statistical consulting offices

Some universities have support for researchers in the designing, collecting, and analyzing of their data. These groups are sometimes housed within specific departments, and therefore might have discipline-specific knowledge about repositories, metadata standards, and cultural norms for that discipline. They are often formally trained as researchers and can therefore better relate to your target audience. In addition, these groups have the opportunity to promote replicable workflows and help researchers integrate best practices for data management into their everyday processes.

Data security offices, copyright/legal offices, & commercialization offices

Groups such as these are often overlooked by librarians looking to build a community of support around data management. Individuals in these offices may be able to provide invaluable expertise to your network, however. These groups contribute to and implement University security, data, and governance policies, and are knowledgeable about the legal implications of data sharing, especially related to sensitive data. Intellectual property rights, commercialization, and copyright are all complex topics that require expertise not often found among other data stewardship stakeholders. Partnering with experts can help reduce the potential for future problems, plus ensure data are shared to the fullest extent possible.

Library & institutional repository

The library is, of course, distinct from an institutional repository. However, often the institution’s library plays a key role in supporting, promoting, and often implementing the repository. I often remind researchers that librarians are experts in information, and data is one of many types of information. Researchers often underuse librarians and their specialized skills in metadata, curation, and preservation. The researchers’ need for a data repository and the strong link between repositories and librarians will change this in the coming years, however. Mohr and Lindsay ended with this simple statement, which nicely sums up their stellar presentation:

The data support village exists across levels and boundaries of the institution as well as across the lifecycle of data management.

Tagged , , , , , ,

Fifteen ideas about data validation (and peer review)

Phrenology diagram showing honest and dishonest head shapes

It’s easy to evaluate a person by the shape of their head, but datasets are more complicated. From Vaught’s Practical Character Reader in the Internet Archive.

Many open issues drift around data publication, but validation is both the biggest and the haziest. Some form of validation at some stage in a data publication process is essential; data users need to know that they can trust the data they want to use, data creators need a stamp of approval to get credit for their work, and the publication process must avoid getting clogged with unusable junk. However, the scientific literature’s validation mechanisms don’t translate as directly to data as its mechanism for, say, citation.

This post is in part a very late response to a data publication workshop I attended last February at the International Digital Curation Conference (IDCC). In a breakout discussion of models for data peer review, there were far more ideas about data review than time to discuss them. Here, for reference purposes, is a longish list of non-parallel, sometimes-overlapping ideas about how data review, validation, or quality assessment could or should work. I’ve tried to stay away from deeper consideration of what data quality means (which I’ll discuss in a future post) and from the broader issues of peer review associated with the literature, but they inevitably pop up anyway.

  1. Data validation is like peer review of the literature: Peer review is an integral part of science; even when they resent the process, scientists understand and respect it. If we are to ask them to start reviewing data, it behooves us to slip data into existing structures. Data reviewed in conjunction with a paper fits this approach. Nature publishing group’s Scientific Data publishes data papers through a traditional review process that considers the data as well as the paper. Peer review at F1000Research follows a literature-descended (although decidedly non-traditional) process that asks reviewers to examine underlying data together with the paper.
  2. Data validation is not like peer review of the literature: Data is fundamentally different from literature, and shouldn’t be treated as such. As Mark Parsons put it at the workshop, “literature is an argument; data is a fact.” The fundamental question in peer review of an article is “did the authors actually demonstrate what they claim?” This involves evaluation of the data, but in the context of a particular question and conclusion. Without a question, there is no context, and no way to meaningfully evaluate the data.
  3. Divide the concerns: Separate out aspects of data quality and consider them independently. For example, Sarah Callaghan divides data quality into technical and scientific quality. Technical quality demands complete data and metadata and appropriate file formats; scientific quality requires appropriate collection methods and high overall believability.
  4. Divvy up the roles: Separate concerns need not be evaluated by the same person or even the same organization. For instance, GigaScience assigns a separate data reviewer for technical review. Data paper publishers generally coordinate scientific review and leave at least some portion of the technical review to the repository that houses the data. Third party peer-review services like LIBRE or Rubriq could conceivably take up data review.
  5. Review data and metadata together: A reviewer must assess data in conjunction with its documentation and metadata. Assessing data quality without considering documentation is both impossible and pointless; it’s impossible to know that data is “good” without knowing exactly what it is and, even if one could, it would be pointless because no one will ever be able to use it. This idea is at least implicit any data review scheme. In particular, data paper journals explicitly raise evaluation of the documentation to the same level as evaluation of the data. Biodiversity Data Journal’peer review guidelines are not unusual in addressing not only the quality of the data and the quality of the documentation, but the consistency between them.
  6. Experts should review the data: Like a journal article, a dataset should pass review by experts in the field. Datasets are especially prone to cross-disciplinary use, in which case the user may not have the background to evaluate the data themselves. Sarah Callaghan illustrated how peer review might work– even without a data paper– by reviewing a pair of (already published) datasets.
  7. The community should review the data: Like a journal article, the real value of a dataset emerges over time as a result of community engagement. After a slow start, post-publication commenting on journal articles (e.g. through PubMed Commons) seems to be gaining momentum.
  8. Users should review the data: Data review can be a byproduct of use. A researcher using a dataset interrogates it more thoroughly than someone just reviewing it. And, because they were doing it anyway, the only “cost” is the effort of capturing their opinion. In a pilot study, the Dutch Data Archiving and Networked Services repository solicited feedback by emailing a link to an online form to researchers who had downloaded their data.
  9. Use is review: “Indeed, data use in its own right provides a form of review.” Even without explicit feedback, evidence of successful use is itself evidence of quality. Such evidence could be presented by collecting a list of papers that cite to the dataset.
  10. Forget quality, consider fitness for purpose: A dataset may be good enough for one purpose but not another. Trying to assess the general “quality” of a dataset is hopeless; consider instead whether the dataset is suited to a particular use. Extending the previous idea, documentation of how and in what contexts a dataset has been used may be more informative than an assessment of abstract quality.
  11. Rate data with multiple levels of quality: The binary accept/reject of traditional peer review (or, for that matter, fit/unfit for purpose) is overly reductive. A one-to-five (or one-to-ten) scale, familiar from pretty much the entire internet, affords a more nuanced view. The Public Library of Science (PLOS) Open Evaluation Tool applies a five-point scale to journal articles, and DANS users rated datasets on an Amazon-style five-star scale.
  12. Offer users multiple levels of assurance: Not all data, even in one place, needs be reviewed to the same extent. It may be sensible to invest limited resources to most thoroughly validate those datasets which are most likely to be used. For example, Open Context offers five different levels of assurance, ranging from “demonstration, minimal editorial acceptance” to “peer-reviewed.” This idea could also be framed as levels of service ranging (as Mark Parsons put it at the workshop) from “just thrown out there” to “someone answers the phone.”
  13. Rate data along multiple facets : Data can be validated or rated along multiple facets or axes. DANS datasets are rated on quality, completeness, consistency, and structure; two additional facets address documentation quality and usefulness of file formats. This is arguably a different framing of  divided concerns, with a difference in application: there, independent assessments are ultimately synthesized into a single verdict; here, the facets are presented separately.
  14. Dynamic datasets need ongoing review: Datasets can change over time, either through addition of new data or revision and correction of existing data. Additions and changes to datasets may necessitate a new (perhaps less extensive) review. Lawrence (2011) asserts that any change to a dataset should trigger a new review.
  15. Unknown users will put the data to unknown uses: Whereas the audience for, and findings of, a journal article are fairly well understood by the author, a dataset may be used by a researcher from a distant field for an unimaginable purpose. Such a person is both the most important to provide validation for– because they lack the expertise to evaluate the data themselves– and the most difficult– because no one can guess who they will be or what they will want to do.

Have an idea about data review that I left out? Let us know in the comments!

Git/GitHub: A Primer for Researchers

The Beastie Boys knew what’s up: Git it together. From egotripland.com

I might be what a guy named Everett Rogers would call an “early adopter“. Rogers wrote a book back in 1962 call The Diffusion of Innovation, wherein he explains how and why technology spreads through cultures. The “adoption curve” from his book has been widely used to  visualize the point at which a piece of technology or innovation reaches critical mass, and divides individuals into one of five categories depending on at what point in the curve they adopt a given piece of technology: innovators are the first, then early adopters, early majority, late majority, and finally laggards.

At the risk of vastly oversimplifying a complex topic, being an early adopter simply means that I am excited about new stuff that seems promising; in other words, I am confident that the “stuff” – GitHub, in this case –will catch on and be important in the future. Let me explain.

Let’s start with version control.

Before you can understand the power GitHub for science, you need to understand the concept of version control. From git-scm.com, “Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.”  We all deal with version control issues. I would guess that anyone reading this has at least one file on their computer with “v2″ in the title. Collaborating on a manuscript is a special kind of version control hell, especially if those writing are in disagreement about systems to use (e.g., LaTeX versus Microsoft Word). And figuring out the differences between two versions of an Excel spreadsheet? Good luck to you. The Wikipedia entry on version control makes a statement that brings versioning into focus:

The need for a logical way to organize and control revisions has existed for almost as long as writing has existed, but revision control became much more important, and complicated, when the era of computing began.

Ah, yes. The era of collaborative research, using scripting languages, and big data does make this issue a bit more important and complicated. Enter Git. Git is a free, open-source distributed version control system, originally created for Linux kernel development in 2005. There are other version control systems– most notably, Apache Subversion (aka SVN) and Mercurial. However I posit that the existence of GitHub is what makes Git particularly interesting for researchers.

So what is GitHub?

GitHub is a web-based hosting service for projects that use the Git revision control system. It’s free (with a few conditions) and has been quite successful since its launch in 2008. Historically, version control systems were developed for and by software developers. GitHub was created primarily as a way for efficiently developing software projects, but its reach has been growing in the last few years. Here’s why.

Note: I am not going into the details of how git works, its structure, or how to incorporate git into your daily workflow. That’s a topic best left to online courses and Software Carpentry Bootcamps

What’s in it for researchers?

At this point it is good to bring up a great paper by Karthik Ram titled “Git can facilitate greater reproducibility and increased transparency in science“, which came out in 2013 in the journal Source Code for Biology and Medicine. Ram goes into much more detail about the power of Git (and GitHub by extension) for researchers. I am borrowing heavily from his section on “Use cases for Git in science” for the four benefits of Git/GitHub below.

1. Lab notebooks make a comeback. The age-old practice of maintaining a lab notebook has been challenged by the digital age. It’s difficult to keep all of the files, software, programs, and methods well-documented in the best of circumstances, never mind when collaboration enters the picture. I see researchers struggling to keep track of their various threads of thought and work, and remember going through similar struggles myself. Enter online lab notebooks. naturejobs.com recently ran a piece about digital lab notebooks, which provides a nice overview of this topic. To really get a feel fore the power of using GitHub as a lab notebook, see GitHubber and ecologist Carl Boettiger’s site. The gist is this: GitHub can serve as a home for all of the different threads of your project, including manuscripts, notes, datasets, and methods development.

2. Collaboration is easier. You and your colleagues can work on a manuscript together, write code collaboratively, and share resources without the potential for overwriting each others’ work. No more v23.docx or appended file names with initials. Instead, a co-author can submit changes and document those with “commit messages” (read about them on GitHub here).

3. Feedback and review is easier. The GitHub issue tracker allows collaborators (potential or current), reviewers, and colleagues to ask questions, notify you of problems or errors, and suggest improvements or new ideas.

4. Increased transparency. Using a version control system means you and others are able to see decision points in your work, and understand why the project proceeded in the way that it did. For the super savvy GitHubber, you can make available your entire manuscript, from the first datapoint collected to the final submitted version, traceable on your site. This is my goal for my next manuscript.

Final thoughts

Git can be an invaluable tool for researchers. It does, however, have a bit of a high activation energy. That is, if you aren’t familiar with version control systems, are scared of the command line, or are married to GUI-heavy proprietary programs like Microsoft Word, you will be hard pressed to effectively use Git in the ways I outline above. That said, spending the time and energy to learn Git and GitHub can make your life so. much. easier. I advise graduate students to learn Git (along with other great open tools like LaTeX and Python) as early in their grad careers as possible. Although it doesn’t feel like it, grad school is the perfect time to learn these systems. Don’t be a laggard; be an early adopter.

References and other good reads

Tagged , , , , , ,

Abandon all hope, ye who enter dates in Excel

Big thanks to Kara Woo of Washington State University for this guest blog post!

Update: The XLConnect package has been updated to fix the problem described below; however, other R packages for interfacing with Excel may import dates incorrectly. One should still use caution when storing data in Excel.


Like anyone who works with a lot of data, I have a strained relationship with Microsoft Excel. Its ubiquity forces me to tolerate it, yet I believe that it is fundamentally a malicious force whose main goal is to incite chaos through the obfuscation and distortion of data.1 After discovering a truly ghastly feature of how it handles dates, I am now fully convinced.

As it turns out, Excel “supports” two different date systems: one beginning in 1900 and one beginning in 1904.2 Excel stores all dates as floating point numbers representing the number of days since a given start date, and Excel for Windows and Mac have different default start dates (January 1, 1900 vs. January 1, 1904).3 Furthermore, the 1900 date system purposely erroneously assumes that 1900 was a leap year to ensure compatibility with a bug in—wait for it—Lotus 1-2-3.

You can’t make this stuff up.

What is even more disturbing is how the two date systems can get mixed up in the process of reading data into R, causing all dates in a dataset to be off by four years and a day. If you don’t know to look for it, you might never even notice. Read on for a cautionary tale.

I work as a data manager for a project studying biodiversity in Lake Baikal, and one of the coolest parts of my job is getting to work with data that have been collected by Siberian scientists since the 1940s. I spend a lot of time cleaning up these data in R. It was while working on some data on Secchi depth (a measure of water transparency) that I stumbled across this Excel date issue.

To read in the data I do something like the following using the XLConnect package:

library(XLConnect)
wb1 <- loadWorkbook("Baikal_Secchi_64to02.xlsx")
secchi_main <- readWorksheet(wb1, sheet = 1)
colnames(secchi_main) <- c("date", "secchi_depth", "year", "month")

So far so good. But now, what’s wrong with this picture?

head(secchi_main)
##         date secchi_depth year month
## 1 1960-01-16           12 1964     1
## 2 1960-02-04           14 1964     2
## 3 1960-02-14           18 1964     2
## 4 1960-02-24           14 1964     2
## 5 1960-03-04           14 1964     3
## 6 1960-03-25           10 1964     3

As you can see, the year in the date column doesn’t match the year in the year column. When I open the data in Excel, things look correct.

excel_secchi_data

This particular Excel file uses the 1904 date system, but that fact gets lost somewhere between Excel and R. XLConnect can tell that there are dates, but all the dates are wrong.

My solution for these particular data was as follows:

# function to add four years and a day to a given date
fix_excel_dates <- function(date) {
    require(lubridate)
    return(ymd(date) + years(4) + days(1))
}

# create a correct date column
library(dplyr)
secchi_main <- mutate(secchi_main, corrected_date = fix_excel_dates(date))

The corrected_date column looks right.

head(secchi_main)
##         date secchi_depth year month corrected_date
## 1 1960-01-16           12 1964     1     1964-01-17
## 2 1960-02-04           14 1964     2     1964-02-05
## 3 1960-02-14           18 1964     2     1964-02-15
## 4 1960-02-24           14 1964     2     1964-02-25
## 5 1960-03-04           14 1964     3     1964-03-05
## 6 1960-03-25           10 1964     3     1964-03-26

That fix is easy, but I’m left with a feeling of anxiety. I nearly failed to notice the discrepancy between the date and year columns; a colleague using the data pointed it out to me. If these data hadn’t had a year column, it’s likely we never would have caught the problem at all. Has this happened before and I just didn’t notice it? Do I need to go check every single Excel file I have ever had to read into R?

And now that I know to look for this issue, I still can’t think of a way to check the dates Excel shows against the ones that appear in R without actually opening the data file in Excel and visually comparing them. This is not an acceptable solution in my opinion, but… I’ve got nothing else. All I can do is get up on my worn out data manager soapbox and say:

and-thats-why-excel


  1. For evidence of its fearsome power, see these examples.
  2. Though as Dave Harris pointed out, “is burdened by” would be more accurate.
  3. To quote John Machin, “In reality, there are no such things [as dates in Excel spreadsheets]. What you have are floating point numbers and pious hope.”
Tagged , , ,

Feedback Wanted: Publishers & Data Access

This post is co-authored with Jennifer Lin, PLOS

Short Version: We need your help!

We have generated a set of recommendations for publishers to help increase access to data in partnership with libraries, funders, information technologists, and other stakeholders. Please read and comment on the report (Google Doc), and help us to identify concrete action items for each of the recommendations here (EtherPad).

Background and Impetus

The recent governmental policies addressing access to research data from publicly funded research across the US, UK, and EU reflect the growing need for us to revisit the way that research outputs are handled. These recent policies have implications for many different stakeholders (institutions, funders, researchers) who will need to consider the best mechanisms for preserving and providing access to the outputs of government-funded research.

The infrastructure for providing access to data is largely still being architected and built. In this context, PLOS and the UC Curation Center hosted a set of leaders in data stewardship issues for an evening of brainstorming to re-envision data access and academic publishing. A diverse group of individuals from institutions, repositories, and infrastructure development collectively explored the question:

What should publishers do to promote the work of libraries and IRs in advancing data access and availability?

We collected the themes and suggestions from that evening in a report: The Role of Publishers in Access to Data. The report contains a collective call to action from this group for publishers to participate as informed stakeholders in building the new data ecosystem. It also enumerates a list of high-level recommendations for how to effect social and technical change as critical actors in the research ecosystem.

We welcome the community to comment on this report. Furthermore, the high-level recommendations need concrete details for implementation. How will they be realized? What specific policies and technologies are required for this? We have created an open forum for the community to contribute their ideas. We will then incorporate the catalog of listings into a final report for publication. Please participate in this collective discussion with your thoughts and feedback by April 24, 2014.

We need suggestions! Feedback! Comments! From Flickr by Hash Milhan

We need suggestions! Feedback! Comments! From Flickr by Hash Milhan

 

Tagged , , , , ,

Mountain Observatories in Reno

A few months ago, I blogged about my experiences at the NSF Large Facilities Workshop. “Large Facilities” encompass things like NEON (National Ecological Observatory Network), IRIS PASSCAL Instrument Center (Incorporated Research Institutions for Seismology Program for Array Seismic Studies of the Continental Lithosphere), and the NRAO (National Radio Astronomy Observatory). I found the event itself to be an eye-opening experience: much to my surprise, there was some resistance to data sharing in this community. I had always assumed that large, government-funded projects had strict data sharing requirements, but this is not the case. I had stimulating arguments with Large Facilities managers who considered their data too big and complex to share, and (more worrisome), that their researchers would be very resistant to opening up the data they generated at these large facilities.

Why all this talk about large facilities? Because I’m getting the chance to make my arguments again, to a group with overlapping interests to that of the Large Facilities community. I’m very excited to be speaking at Mountain Observatories: A Global Fair and Workshop  this July in Reno, Nevada. Here’s a description from the organizers:

The event is focused on observation sites, networks, and systems that provide data on mountain regions as coupled human-natural systems. So the meeting is expected to bring together biophysical as well as socio-economic researchers to discuss how we can create a more comprehensive and quantitative mountain observing network using the sites, initiatives, and systems already established in various regions of the world.

I must admit, I’m ridiculously excited to geek out with this community. I’ll get to hear about the GLORIA Project (GLObal Robotic-telescopes Intelligent Array), something called “Mountain Ethnobotany“, and “Climate Change Adaptation Governance”. See a full list of the proposed sessions here. The conference is geared towards researchers and managers, which means I’ll have the opportunity to hear about data sharing proclivities straight from their mouths. The roster of speakers joining me include a hydroclimatologist (Mike Dettinger, USGS) and a researcher focused on socio-cultural systems (Courtney Flint, Utah State University), plus representatives from the NSF, a sensor networks company, and others. The conference should be a great one – abstract submission deadline was just extended, so there’s still time to join me and nerd out about science!

Reno! From Flickr by Ravensmagiclantern

Reno! From Flickr by Ravensmagiclantern

Tagged , , , ,

Lit Review: #PLOSFail and Data Sharing Drama

Turn and face the strange, researchers. From pipedreamsfromtheshire.wordpress.com

Turn and face the strange, researchers. From pipedreamsfromtheshire.wordpress.com

I know what you’re thinking– how can yet another post on the #PLOSfail hoopla say anything new? Fear not. I say nothing particularly new here, but I do offer a three-weeks-out lit review of the hoopla, in hopes of finding a pattern in the noise. For those new to the #PLOSFail drama, the short version is this: PLOS enacted a mandatory data sharing policy. Researchers flipped out. See the sources at the end of this post for more background.

 Arguments made against data sharing

1) My data is my lifeblood. I won’t just give it away.

Terry McGlynn, a biologist writing at Small Pond Science argues that “Regardless of the trajectory of open science, the fact remains that, at the moment, we are conducting research in a culture of data ownership.” Putting the ownership issue aside for now, let’s focus on the crux of this McGlynn’s argument: he contends that data sharing results in turning a private resource (data) into a community resource. This is especially burdensome for small labs (like his) since each data point takes relatively more effort to produce. If this resource is available to anyone, the benefits to the former owner are greatly reduced since they are now shared with the broader community.

Although these are valid concerns, they are not in the best interest of science. I argue that what we are really talking about here is the incentive problem (see more in the section below). That is, publications are valued in performance evaluation of academics, while data are not. Everyone can agree that data is indispensable to scientific advancement, so why hasn’t the incentive structure caught up yet? If McGlynn were able to offset the loss of benefits caused to data sharing by getting mad props for making their data available and useful, this issue would be less problematic. Jeff Leek, a biostatistician blogging at Simply Statistics, makes a great point with regard to this: to paraphrase him, the culture of credit hasn’t caught up with the culture of science. There is no appropriate form of credit for data generators – it’s either citation (seems chintzy) or authorship (not always appropriate). Solution: improve incentives for data sharing. Find a way to appropriately credit data producers.

2) My datasets are special, unique snowflakes. You can’t understand/use them.

Let’s examine what McGlynn says about this with regard to researchers re-using his data: “…anybody working on these questions wouldn’t want the raw data anyway, and there’s no way these particular data would be useful in anybody’s meta analysis. It’d be a huge waste of my time.”

Rather than try to come up with a new, witty way to answer to this argument, I’ll shamelessly quote from MacManes Lab blog post, Corner cases and the PLOS data policy:

 There are other objections – one type is the ‘my raw data are so damn special that nobody can over make sense of them’, while another is ‘I use special software and stuff, so they are probably not useful to anybody else’. I call BS on both of these arguments. Maybe you have the world’s most complicated data, but why not release them and not worry about whether or not people find them useful – that is not your concern (though it should be).

I couldn’t have said it better. The snowflake refrain from researchers is not new. I’ve heard it time and again when talking to them about data archiving. There is certainly truth to this argument: most (all?) datasets are unique. Why else would we be collecting data? This doesn’t make them useless to others, especially if we are sharing data to promote reproducibility of reported results.

DrugMonkey, an anonymous blogger and biomedical researcher, took this “my data are unique” argument to paranoia level. In their post, PLoS is letting the inmates run the asylum and it will kill them, s/he contends that researchers will somehow be forced to use all the same methods to facilitate data reuse. “…diversity in data handling results, inevitably, in attempts for data orthodoxy. So we burn a lot of time and effort fighting over that. So we’ll have PLoS [sic] inserting itself in the role of how experiments are to be conducted and interpreted!”

I imagine DrugMonkey pictures future scientists in grey overalls, trudging to a factory to do “science”. This is just ridiculous. The idiosyncrasies of how individual researchers handle their data will always be part of the challenge of reproducibility and data curation. But I have never (ever) heard of anyone suggesting that all researchers in a given field should be doing science in the exact same way. There are certainly best practices for handling datasets. If everyone followed these to the best of their ability, we would have an easier time reusing data. But no one is punching a time card at the factory.

 3) Data sharing is hard | time-consuming | new-fangled.

This should probably be #1 in the list of arguments from researchers. Even those that cite other reasons for not sharing their data, this is probably at the root of the hoarding. Full disclosure – only a small portion of the datasets I have generated as a researcher are available to the public. The only explanation is it’s time-consuming and I have other things on my plate. So I hear you, researchers. That said, the time has come to start sharing.

DrugMonkey says that the PLOS data policy requires much additional data curation which will take time. “The first problem with this new policy is that it suggests that everyone should radically change the way they do science, at great cost of personnel time…” McGlynn states this point succinctly: “Why am I sour on required data archiving? Well, for starters, it is more work for me… To get these numbers into a downloadable and understandable condition would be, frankly, an annoying pain in the ass.”

Fair enough. But I argue here (along with others others) that making data available is not an optional side note of research: it is research. In the comments of David Crotty’s post at The Scholarly Kitchen, “PLOS’ bold data policy“, there was a comment that I loved. The commenter, Mike Taylor, said this:

 …data curation is research. I’d argue that a researcher who doesn’t make available the data necessary to reproduce his conclusions isn’t getting his job done. Complaining about having to spend time on preparing the data for others to use is like complaining about having to spend time writing the paper, or indeed running experiments.

When I read that comment, I might have fist pumped a little. Of course, we still have that pesky incentive issue to work out… As Crotty puts it, “Perhaps the biggest practical problem with [data sharing] is that it puts an additional time and effort burden on already time-short, over-burdened researchers. Researchers will almost always follow the path of least resistance, and not do anything that takes them away from their research if it can be avoided.” Sigh.

What about that “new-fangled” bit? Well, researchers often complain that data management and curation requires skills that are not taught. I 100% agree with this statement – see my paper on the lack of data management education for even undergrads. But as my ex-cop dad likes to say, “ignorance of the law is not a defense”. In continuation of my shameless quoting from others, here’s what Ted Hart (Staff Scientist at NEON) has to say in his post, “Just Get Over Yourself and Share Your Data“:

Sharing is hard. but not an intractable problem… Is the alternative is that everyone just does everything in secret with myriad idiosyncrasies ferociously milking least publishable units from a data set? That just seems like a recipe for science moving slowly and in the dark. …I think we just need to own up to the fact being a scientist these days requires new skills, and it always have. You didn’t have to know how to do PCR prior to 1983, but now you do. In the 21st century to do science better, we need more than spreadsheets with a few rows, we need to implement best practices for data management.

More fist pumping! No, things won’t change overnight. Leek at Simply Statistics rightly stated that the transition to open data will be rough for two reasons: (1) there is no education on data handling, and (2) the is a disconnect between the incentives for individual researchers and the actions that will benefit science as a whole. Sigh. Back to that incentive issue again.

Highlights & Takeaways

At risk of making this blog post way too long, I want to showcase a few highlights and takeaways from my deep dive into the #PLOSfail blogging world.

1) The Incentives Problem

We have a big incentives problem, which was probably obvious from my repeated mentions of it above. What’s good for researchers’ careers is not conducive to data sharing. If we expect behavior to change, we need to work on giving appropriate credit where it’s due.

Biologist Björn Brembs puts it well in his post, What is the Difference Between Text, Data, and Code?“…it is unrealistic to expect tenure committees and grant evaluators to assess software and data contributions before anybody even is contributing and sharing data or code.” Yes, there is a bit of a chicken-and-egg situation. We need movement on both sides to get somewhere. Share the data, and they will start to recognize it.

2) Empiricism Versus Theory

There is a second plot line to the data sharing rants: empiricists versus theoreticians. See ecologist Timothée Poisot‘s blog, “Of the value of datasets and methods in open science” for a more extensive review of this issue as it relates to data sharing. Of course, this tension is not a new debate in science. But terms like “data vultures” get thrown about, and feelings get hurt. Due to the nature of their work, most theoreticians’ “data” is equations, methods, and code that are shared via publication. Meanwhile, empiricists generate data and can hoard it until they see fit to share it, only offering a glimpse of the entire suite of their research outputs. To paraphrase Hart again: science is equal parts data and analysis/methods. We need both, so let’s stop fighting and encourage open science all around.

3) Data Ownership Issues

There are lots of potential data owners: the funders who paid for the work, the institution where the research was performed, the researcher who collected the data, the principle investigator of the lab where the researcher works, etc. etc. The complications around data ownership make this a tricky subject to work out. Zen Faulkes, a neurobiologist at University of Texas, blogged about who owns data, in particular, his data. He did a little research and found what many (most?) researchers at universities might find: “I do not own research data I generate. Neither do the funding agencies. The University of Texas system Board of Regents own research data I generate.” Faulkes goes on to state that the regents probably don’t care what he does with his data unless/until they can make money off of it… very true. To make things more complicated, Crotty over at Scholarly Kitchen reminded us that “under US law (the Bayh-Dole Act), the intellectual property (IP) generated as the result of federal research funds belongs to the researcher and their institution.” What does that even mean?!

To me, the issue is not about who owns the data outright. Instead, it’s about my role as an open science “waccaloon” who is interested in what’s best for the scientific process. To that extent, I am going to borrow from Hart again. Hart makes a comparison between having data and having a pet: in Boulder CO, there are no pet “owners” – only pet “guardians”. We can think of our data in this same way: we don’t own it; we simply care for it, love it, and are intellectually (and sometimes emotionally!) invested in it.

4) PLOS is Part of a Much Bigger Movement

Open science mandates are already here. The OSTP memo released last year is a huge leap forward in this direction – it requires that federally funded research outputs (including data) be made available to the public. Crotty draws a link between OSTP and PLOS policies in his blog: “Once this policy goes into effect, PLOS’ requirements would seem to be an afterthought for authors funded in this manner. The problem is that the OSTP policy seems nowhere near being implemented.”

That last part is most definitely true. One way to work on implementing this policy? Get the journals involved. The current incentive structure is not well-suited for ensuring compliance with OSTP, but journals have a role as gatekeepers to the traditional incentives. Crotty states it this way:

PLOS has never been a risk averse organization, and this policy would seem to fit well with their ethos of championing access and openness as keys to scientific progress. Even if one suspects this policy is premature and too blunt an instrument, one still has to respect PLOS for remaining true to their stated goals.

So I say kudos to PLOS!

In Conclusion…

I’ll end with a quote from MacManes Lab blog post:

How about this, make an honest effort to make the data accessible and useful to others, and chances are you’re probably good to go.

Final fist pump.

Sources

  1. Timothée Poisot, Ecologist. Of the value of datasets and methods in open science.
  2. Terry McGlynn, Biologist. I own my data until I don’t. Blog at Small Pond Science @hormiga
  3. David Crotty, publisher & former researcher. PLOS’ bold data policy Blog at The Scholarly Kitchen @scholarlykitchn
  4. Edmund Hart, Staff Scientist at NEONJust Get Over Yourself and Share Your Data. @DistribEcology
  5. MacManes Lab, genomics. Corner cases and the PLOS data policy.
  6. DrugMonkey, biomedical research. PLoS is letting the inmates run the asylum and it will kill them. @DrugMonkey
  7. Zen Faulkes, Neurobiologist. Who owns data. Blog at NeuroDojo @DoctorZen
  8. Björn Brembs, biologist. What is the Difference Between Text, Data, and Code? @brembs
  9. Jeff Leek, biostatistician. PLoS One, I have an idea for what to do with all your profits: buy hard drives Blog at Simply Statistics. @leekgroup

Twitter feed for #PLOSfail

From PLOS

Follow

Get every new post delivered to your Inbox.

Join 316 other followers