DataUp is Merging with Dash!

Exciting news! We are merging the DataUp tool with our new data sharing platform, Dash.

About Dash

Dash is a University of California project to create a platform that allows researchers to easily describe, deposit and share their research data publicly. Currently the Dash platform is connected to the UC3 Merritt Digital Repository; however, we have plans to make the platform compatible with other repositories using protocols such as SWORD and OAI-PMH. The Dash project is open-source and we encourage community discussion and contribution to our GitHub repository/site.

About the Merge

There is significant overlap in functionality for Dash and DataUp (see below), so we will merge these two projects to enable better support for our users. This merge is funded by an NSF grant (available on eScholarship) supplemental to the DataONE project.

The new service will be an instance of our Dash platform (to be available in late September), connected to the DataONE repository ONEShare. Previously the only way to deposit datasets into ONEShare was via the DataUp interface, thereby limiting deposits to spreadsheets. With the Dash platform, this restriction is removed and any dataset type can be deposited. Users will be able to log in with their Google ID (other options being explored). There are no restrictions on who can use the service, and therefore no restrictions on who can deposit datasets into ONEShare, and the service will remain free. The ONEShare repository will continue to be supported by the University of New Mexico in partnership with CDL/UC3. 

The NSF grant will continue to fund a developer to work with the UC3 team on implementing the DataONE-Dash service, including enabling login via Google and other identity providers, ensuring that metadata produced by Dash will meet the conditions of harvest by DataONE, and exploring the potential for implementing spreadsheet-specific functionality that existed in DataUp (e.g., the best practices check). 

Benefits of the Merge

  • We will be leveraging work that UC3 has already completed on Dash, which has fully-implemented functionality similar to DataUp (upload, describe, get identifier, and share data).
  • ONEShare will continue to exist and be a repository for long tail/orphan datasets.
  • Because Dash is an existing UC3 service, the project will move much more quickly than if we were to start from “scratch” on a new version of DataUp in a language that we can support.
  • Datasets will get DataCite digital object identifiers (DOIs) via EZID.
  • All data deposited via Dash into ONEShare will be discoverable via DataONE.

FAQ about the change

What will happen to DataUp as it currently exists?

The current version of DataUp will continue to exist until November 1, 2014, at which point we will discontinue the service and the dataup.org website will be redirected to the new service. The DataUp codebase will still be available via the project’s GitHub repository.

Why are you no longer supporting the current DataUp tool?

We have limited resources and can’t properly support DataUp as a service due to a lack of local experience with the C#/.NET framework and the Windows Azure platform.  Although DataUp and Dash were originally started as independent projects, over time their functionality converged significantly.  It is more efficient to continue forward with a single platform and we chose to use Dash as a more sustainable basis for this consolidated service.  Dash is implemented in the  Ruby on Rails framework that is used extensively by other CDL/UC3 service offerings.

What happens to data already submitted to ONEShare via DataUp?

All datasets now in ONEShare will be automatically available in the new Dash discovery environment alongside all newly contributed data.  All datasets also continue to be accessible directly via the Merritt interface at https://merritt.cdlib.org/m/oneshare_dataup.

Will the same functionality exist in Dash as in DataUp?

Users will be able to describe their datasets, get an identifier and citation for them, and share them publicly using the Dash tool. The initial implementation of DataONE-Dash will not have capabilities for parsing spreadsheets and reporting on best practices compliance. Also the user will not be able to describe column-level (i.e., attribute) metadata via the web interface. Our intention, however, is develop out these functions and other enhancements in the future. Stay tuned!

Still want help specifically with spreadsheets?

  • We have pulled together some best practices resources: Spreadsheet Help 
  • Check out the Morpho Tool from the KNB – free, open-source data management software you can download to create/edit/share spreadsheet metadata (both file- and column-level). Bonus – The KNB is part of the DataONE Network.

 

It's the dawn of a new day for DataUp! From Flickr by David Yu.

It’s the dawn of a new day for DataUp! From Flickr by David Yu.

Tagged , , , , , ,

The First UC Libraries Code Camp

This post was co-authored by Stephen Abrams.

Military camp on Coronado Island, California. Contributed to Calisphere by the San Diego History Center. Click on the image for more information.

Military camp on Coronado Island, California. Contributed to Calisphere by the San Diego History Center. Click on the image for more information.

So 30 coders walk into a conference center in Oakland… No, it’s not a bad joke in need of a punch line, it instead describes the start of the first UC Libraries Code Camp, which took place in downtown Oakland last week. These coders were all from the University of California system (8 out of 10 campuses were represented!) and work with or for the UC libraries. CDL sponsored the event and was well represented among the attendees.

The event consisted of two days of lively collaborative brainstorming on ways to provide better, more sustainable library services to the UC community.  Camp participants represented a variety of library roles– curatorial, development, and IT– providing a useful synergistic approach to common problems and solutions. The camp was organized according to the participatory unconference format, in which topics of discussion were arrived at through group consensus.  The final schedule included 10 breakout sessions on topics as diverse as the UC Libraries Digital Collection (UCLDC), data visualization, agile methodology, cloud computing, and use of APIs.  There was also a plenary session of “dork shorts” in which campus representatives gave summary presentations on selected services and initiatives of common interest.

The conference agenda, with notes from the various breakouts, is available on the event website. For those of us that work in the very large and expansive UC system, get-togethers like this one are crucial for ensuring we are efficiently and effectively supporting the UC community.

Of Note

  • We established a GitHub organization: UCLT. Join by emailing your GitHub username to uc3@ucop.edu.
  • We are establishing a Listserv: uclibrarytech-l@ucop.edu
  • Next code camp to take place in the south, in January or February 2015. (we need a southern campus to volunteer!)

Next Steps

  1. Establish a new Common Knowledge Group for Libraries Information Technologists. We need to draft a charter and establish the initial principles of group. Status: in progress, being led by Rosalie Lack, CDL
  2. Help articulate the need for more resources (staff, knowledge, skills, funding) that would allow libraries better support data and researchers creating/managing data. Status: database of skills table is being filled out. Will help guide discussions about library resources across the UC.
  3. Build up a database of UC libraries technologists; help share expertise and skills. Status: table being filled out. Will be moved to GitHub wiki once completed.
  4. Establish a collaborative space for us to share war stories, questions, concerns, approaches to problems, etc. Status: GitHub Organization created. Those interested should join by emailing us at uc3@ucop.edu with their GitHub username.
  5. Have more Code Camp style events, and rotate locations between campuses and regions (e.g., North versus South). Status: can plan these via GitHub organization + listserv
  6. Keep UC Code Camp conversations going, drilling down into some specific topics via virtual conferencing. Status: can plan these via GitHub organization + listserv. Can create specific “teams” within the GitHub organization to help organize more specific groups within the organization.
  7. Develop teams of IT + librarians to help facilitate outreach and education on campuses.
  8. Have CDL visit campuses more often to run informational sessions.
  9. Have space for sharing outreach and education materials around data management, tools and services available, etc. Status: can use GitHub organization or …?

The DataCite Meeting in Nancy, France

Last week I took a lovely train ride through the cow-dotted French countryside to attend the 2014 DataCite Annual Conference. The event was held at the Institut de l’information Scientifique et Technique (INIST) in Nancy, France, which is about 1.5 hours by train outside of Paris. INIST is the French DataCite member (more on DataCite later). I was invited to the meeting to represent the CDL, which has been an active participant in DataCite since its inception (see my slides). But before I can provide an overview of the DataCite meeting, we need to back up and make sure everyone understands the concept of identifiers, plus a few other bits of key background information.

Background

Identifiers

An identifier is a string of characters that uniquely identifies an object. The object might be a dataset, software, or other research product. Most researchers are familiar with a particular type of identifier, the digital object identifier (DOI). These have been used by the academic publishing industry for uniquely identifying digital versions of journal articles for the last 15 years or so, and their use recently has expanded to other types of digital objects (posters, datasets, code, etc.). Although the DOI is the most widely known type of identifier, there are many, many other identifier schemes. Researchers do not necessarily need to understand the nuances of identifiers, however, since the data repository often chooses the scheme. The most important thing for researchers to understand is that their data needs an identifier to be easy to find, and to facilitate getting credit for that data.

The DataCite Organization

For those unfamiliar with DataCite, it’s a nonprofit organization founded in 2009. According to their website, their aims are to:

  • establish easier access to research data on the Internet
  • increase acceptance of research data as legitimate, citable contributions to the scholarly record
  • support data archiving that will permit results to be verified and re-purposed for future study.

In this capacity, DataCite has working groups, participates in large initiatives, and partners with national and international groups. Arguably they are most known for their work in helping organizations issue DOIs. CDL was a founding member of DataCite, and has representation on the advisory board and in the working groups.

EZID: Identifiers made easy

The CDL has a service that provides DataCite DOIs to researchers and those that support them, called EZID. The EZID service allows its users to create and manage long term identifiers (they do more than just DOIs). Note that individuals currently cannot go to the EZID website and obtain an identifier, however. They must instead work with one of the EZID clients, of which there are many, including academic groups, private industry, government organizations, and publishers. Figshare, Dryad, many UC libraries, and the Fred Hutchinson Cancer Research Center are among those who obtain their DataCite DOIs from EZID.

Highlights from the meeting

#1: Enabling culture shifts

Andrew Treloar from the Australian National Data Service (ANDS) presented a great way to think about how we can enable the shift to a world where research data is valued, documented, and shared. The new paradigm first needs to be possible: this means supporting infrastructure at the institutional and national levels, giving institutions and researchers the tools to properly manage research data outputs, and providing ways to count data citations and help incentivize data stewardship. Second, the paradigm needs to be encouraged/required. We are making slow but steady headway on this front, with new initiatives for open data from government-funded research and requirements for data management plans. Third, the new paradigm needs to be adopted/embraced. That is, researchers should be asking for DOIs for their data, citing the data they use, and understanding the benefits of managing and sharing their data. This is perhaps the most difficult of the three. These three aspects of a new paradigm can help frame tool development, strategies for large initiatives, and arguments for institutional support.

#2: ZENODO’s approach to meeting research data needs

Lars Holm Nielsen from the European Organization for Nuclear Research (CERN) provided a great overview of the repository ZENODO. If you are familiar with figshare, this repository has similar aspects: anyone can deposit their information, regardless of country, institution, etc. This was a repository created to meet the needs of researchers interested in sharing research products. One of the interesting features about Zenodo is their openness to multiple types of licenses, including those that do not result in fully open data. Although I feel strongly about ensuring data are shared with open, machine-readable waivers/licenses, Nielsen made an interesting point: step one is actually getting the data into a repository. If this is accomplished, then opening the data up with an appropriate license can be discussed at a later date with the researcher. I’m not sure if I agree with this strategy (I envision repositories full of data no one can actually search or use), it’s an interesting take.

Full disclosure: I might have a small crush on CERN due to the recent release of Particle Fever, a documentary on the discovery of the Higgs boson particle).

#3: the re3data-databib merger

Maxi Kindling from Humboldt University Berlin (representing re3data) and Michael Witt from Purdue University Libraries (representing databib) co-presented on plans for merging their two services, both searchable databases of repositories. Both re3data and databib have extensive metadata on data repositories available for depositing research data, covering a wide range of data types and disciplines. This merger makes sense since the two services emerged within X months of one another and there is no need for running them separately, with separate support, personnel, and databases. Kindling and Witt described the five principles of agreement for the merge: openness, optimal quality assurance, innovative functionality development, shared leadership (i.e., the two are equal partners), and sustainability. Regarding this last principle, the service that will result from the merge has been “adopted” by DataCite, which will support it for the long term. The service that will be born of the merge will be called re3data, with an advisory board called databib.

Attendees of the DataCite meeting had interesting lunchtime conversations around future integrations and tools development in conjunction with the new re3data. What about a repository “match-making” service, which could help researchers select the perfect repository for their data? Or integration with tools like the DMPTool? The re3data-databib group is likely coming up with all kinds of great ideas as a result of their new partnership, which will surely benefit the community as a whole.

#4: Lots of other great stuff

There were many other interesting presentations at the meeting: Amye Kenall from BioMed Central (BMC) talking about their GigaScience data journal; Mustapha Mokrane from the ICSU-World Data System on data publishing efforts; and Nigel Robinson from Thomson-Reuters on the Data Citation Index, to name a few. DataCite plans on making all of the presentations available on the conference website, so be sure to check that out in the next few weeks.

My favorite non-data part? The light show at the central square of Nancy, Place Stanislas. 20 minutes well-spent.

Related on Data Pub:

Tagged , ,

Sharing is caring, but should it count?

The following is a guest post by Shea Swauger, Data Management Librarian at Colorado State University. Shea and I both participated in a meeting for the Colorado Alliance of Research Libraries on 11 July 2014, where he presented survey results described below.


 

 Vanilla Ice has a timely message for the data community. From Flickr by wiredforlego.

Vanilla Ice has a timely message for the data community. From Flickr by wiredforlego.

It shouldn’t be a surprise that many of the people who collect and generate research data are academic faculty members. One of the gauntlets that these individuals must face is the tenure and promotion process, an evaluation system that measures and rewards professional excellence, scholarly impact and can greatly affect the career arch of an aspiring scholar. As a result, tenure and promotion metrics naturally influence the kind and quantity of scholarly products that faculty produce.

Some advocates of data sharing have suggested using the tenure and promotion process as a way to incentivize data sharing. I thought this was a brilliant idea and had designs to advocate its implementation to members of the executive administration at my university, but first I wanted to gather some evidence to support my argument. Some of my colleagues, Beth Oehlerts, Daniel Draper, Don Zimmerman and I sent out a survey to all faculty members as to how they felt about incorporating shared research data as an assessment measure in the tenure and promotion process. Only about 10% (202) responded, so while generalizations about the larger population can’t be made, their answers are still interesting.

This is how I expected the survey to work:

Me: “If sharing your research data counted, in some way, towards you achieving tenure and promotion, would you be more likely to do it?”

Faculty: “Yes, of course!”

I’d bring this evidence to the university, sweeping changes would be made, data sharing would proliferate and all would be well.

I was wrong.

Speaking broadly, only about half of the faculty members surveyed said that changing the tenure and promotion process would make them more likely to share their data.

While 76% of the faculty were interested in sharing data in the future, and 84% said that data generation or collection is important to their research, half of faculty said that shared research data has little to no impact on their scholarly community and almost a quarter of faculty said they are unable to judge the impact.

Okay, let’s back up.

The tenure system is supposed to measure, among several things like teaching, service, etc., someone’s impact on their scholarly community. According to this idea there should be a correlation between the things that impact your scholarly community and the things that impact you achieving tenure. Now, back to the survey.

I asked faculty to rate the impact of several research products on their scholarly community as well as on their tenure and promotion. 94% of faculty rated ‘peer-reviewed journal articles’ at ‘high impact’ (the top of the scale) for impact upon their scholarly community, and 96% of faculty rated ‘peer-reviewed journal articles’ at ‘high impact’ upon their tenure and promotion. This supports the idea that because peer-viewed journal articles have a high impact on the scholarly community, they have a high impact on the tenure and promotion process.

Shared research data had a similar impact correlation, though on the opposite end of the impact spectrum. Little impact on the scholarly community means little impact on the tenure and promotion process. Bad news for data sharing. Reductively speaking, I believe this to be the essence of the argument: contributions that are valuable to a research community should be rewarded in the tenure and promotion process; shared research data isn’t valuable to the research community; therefore, data sharing should not be rewarded.

Also, I received several responses from faculty saying that they were obligated not to share their data because of the kind of research they were doing, be it in defense, the private sector, or working with personally identifiable or sensitive data.  They felt that if the university started rewarding data sharing, they would be unfairly punished because of the nature of their research. Some suggested that a more local implementation of a data sharing policy, perhaps on a departmental basis or an individual opt-in system might be fairer to researchers who can’t share their data for one reason or another.

So what does this mean?

Firstly, it means that there’s a big perception gap on the importance of ‘my data to my research’, and the importance of ‘my data to someone else’s research’. Closing this gap could go a long way to increasing data sharing. Secondly, it means that the tenure and promotion system is a complicated, political mechanism and trying to leverage it as a way to incentivize data sharing is not easy or straightforward. For now, I’ve decided not to try and pursue amending the local tenure system, however I have hope that as interest in data sharing grows we can find meaningful ways that reward people who choose to share their data.

Note: the work described above is being prepared for publication in 2015.

Tagged , , , , ,

Unicorn Data Sharing

A few years ago I created a little video about data sharing using an online application called Xtranormal. Alas, the application has gone bust and and it’s hard to access the videos created on that site. As a result I’m adding my video here so you can still enjoy it.

It takes a data management village

A couple of weeks ago, information scientists, librarians, social scientists, and their compatriots gathered in Toronto for the 2014 IASSIST meeting. IASSIST is, of course, an acronym which I always have to look up to remember – International Association for Social Science Information Service & Technology. Despite its forgettable name, this conference is one of the better meetings I’ve attended. The conference leadership manages to put together a great couple of days, chock full of wonderful plenaries and interesting presentations, and even arranged a hockey game for the opening reception.

Yonge Street crowds celebrating the end of the Boer War, Toronto, Canada. This image is available from the City of Toronto Archives, and is in the public domain.

Although there were many interesting talks, and I’m still processing the great discussions I had in Toronto, a couple really rang true for me. I’m going to now shamelessly paraphrase one of these talks (with permission, of course) about building a “village” of data management experts at institutions to best service researchers’ needs. All credit goes to Alicia Hofelich Mohr and Thomas Lindsay, both from University of Minnesota. Their presentation was called “It takes a village: Strengthening data management through collaboration with diverse institutional offices.” I’m sure IASSIST will make the slides available online in the near future, but I think this information is too important to not share asap.

Mohr and Lindsay first described the data life cycle, and emphasized the importance of supporting data throughout its life – especially early on, when small things can make a big difference down the road. They asserted that in order to provide support for data management, librarians need to connect with other service providers at their institutions. They then described who these providers are, and where they fit into the broader picture. Below I’ve summarized Mohr and Lindsay’s presentation.

Grants coordinators

Faculty writing grants are constantly interacting with these individuals. They are on the “front lines” of data management planning, in particular, since they can point researchers to other service providers who can help over the course of the project. Bonus – grants offices often have a deep knowledge of agency requirements for data management.

Sponsored projects

The sponsored projects office is another service provider that often has early interactions with researchers during their project planning. Researchers are often required to submit grants directly to this office, who ensure compliance and focus on requirements needed for proposals to be complete.

College research deans

Although this might be an intimidating group to connect with, they are likely to be the most aware of the current research climate and can help you target your services to the needs of their researchers. They can also help advocate for your services, especially via things like new faculty orientation. Generally, this group is an important ally in facilitating data sharing and reuse.

IT system administrators

This group is often underused by researchers, despite their ability to potentially provide researchers with server space, storage, collaboration solutions, and software licenses. They are also useful allies in ensuring security for sensitive data.

Research support services & statistical consulting offices

Some universities have support for researchers in the designing, collecting, and analyzing of their data. These groups are sometimes housed within specific departments, and therefore might have discipline-specific knowledge about repositories, metadata standards, and cultural norms for that discipline. They are often formally trained as researchers and can therefore better relate to your target audience. In addition, these groups have the opportunity to promote replicable workflows and help researchers integrate best practices for data management into their everyday processes.

Data security offices, copyright/legal offices, & commercialization offices

Groups such as these are often overlooked by librarians looking to build a community of support around data management. Individuals in these offices may be able to provide invaluable expertise to your network, however. These groups contribute to and implement University security, data, and governance policies, and are knowledgeable about the legal implications of data sharing, especially related to sensitive data. Intellectual property rights, commercialization, and copyright are all complex topics that require expertise not often found among other data stewardship stakeholders. Partnering with experts can help reduce the potential for future problems, plus ensure data are shared to the fullest extent possible.

Library & institutional repository

The library is, of course, distinct from an institutional repository. However, often the institution’s library plays a key role in supporting, promoting, and often implementing the repository. I often remind researchers that librarians are experts in information, and data is one of many types of information. Researchers often underuse librarians and their specialized skills in metadata, curation, and preservation. The researchers’ need for a data repository and the strong link between repositories and librarians will change this in the coming years, however. Mohr and Lindsay ended with this simple statement, which nicely sums up their stellar presentation:

The data support village exists across levels and boundaries of the institution as well as across the lifecycle of data management.

Tagged , , , , , ,

Fifteen ideas about data validation (and peer review)

Phrenology diagram showing honest and dishonest head shapes

It’s easy to evaluate a person by the shape of their head, but datasets are more complicated. From Vaught’s Practical Character Reader in the Internet Archive.

Many open issues drift around data publication, but validation is both the biggest and the haziest. Some form of validation at some stage in a data publication process is essential; data users need to know that they can trust the data they want to use, data creators need a stamp of approval to get credit for their work, and the publication process must avoid getting clogged with unusable junk. However, the scientific literature’s validation mechanisms don’t translate as directly to data as its mechanism for, say, citation.

This post is in part a very late response to a data publication workshop I attended last February at the International Digital Curation Conference (IDCC). In a breakout discussion of models for data peer review, there were far more ideas about data review than time to discuss them. Here, for reference purposes, is a longish list of non-parallel, sometimes-overlapping ideas about how data review, validation, or quality assessment could or should work. I’ve tried to stay away from deeper consideration of what data quality means (which I’ll discuss in a future post) and from the broader issues of peer review associated with the literature, but they inevitably pop up anyway.

  1. Data validation is like peer review of the literature: Peer review is an integral part of science; even when they resent the process, scientists understand and respect it. If we are to ask them to start reviewing data, it behooves us to slip data into existing structures. Data reviewed in conjunction with a paper fits this approach. Nature publishing group’s Scientific Data publishes data papers through a traditional review process that considers the data as well as the paper. Peer review at F1000Research follows a literature-descended (although decidedly non-traditional) process that asks reviewers to examine underlying data together with the paper.
  2. Data validation is not like peer review of the literature: Data is fundamentally different from literature, and shouldn’t be treated as such. As Mark Parsons put it at the workshop, “literature is an argument; data is a fact.” The fundamental question in peer review of an article is “did the authors actually demonstrate what they claim?” This involves evaluation of the data, but in the context of a particular question and conclusion. Without a question, there is no context, and no way to meaningfully evaluate the data.
  3. Divide the concerns: Separate out aspects of data quality and consider them independently. For example, Sarah Callaghan divides data quality into technical and scientific quality. Technical quality demands complete data and metadata and appropriate file formats; scientific quality requires appropriate collection methods and high overall believability.
  4. Divvy up the roles: Separate concerns need not be evaluated by the same person or even the same organization. For instance, GigaScience assigns a separate data reviewer for technical review. Data paper publishers generally coordinate scientific review and leave at least some portion of the technical review to the repository that houses the data. Third party peer-review services like LIBRE or Rubriq could conceivably take up data review.
  5. Review data and metadata together: A reviewer must assess data in conjunction with its documentation and metadata. Assessing data quality without considering documentation is both impossible and pointless; it’s impossible to know that data is “good” without knowing exactly what it is and, even if one could, it would be pointless because no one will ever be able to use it. This idea is at least implicit any data review scheme. In particular, data paper journals explicitly raise evaluation of the documentation to the same level as evaluation of the data. Biodiversity Data Journal’peer review guidelines are not unusual in addressing not only the quality of the data and the quality of the documentation, but the consistency between them.
  6. Experts should review the data: Like a journal article, a dataset should pass review by experts in the field. Datasets are especially prone to cross-disciplinary use, in which case the user may not have the background to evaluate the data themselves. Sarah Callaghan illustrated how peer review might work– even without a data paper– by reviewing a pair of (already published) datasets.
  7. The community should review the data: Like a journal article, the real value of a dataset emerges over time as a result of community engagement. After a slow start, post-publication commenting on journal articles (e.g. through PubMed Commons) seems to be gaining momentum.
  8. Users should review the data: Data review can be a byproduct of use. A researcher using a dataset interrogates it more thoroughly than someone just reviewing it. And, because they were doing it anyway, the only “cost” is the effort of capturing their opinion. In a pilot study, the Dutch Data Archiving and Networked Services repository solicited feedback by emailing a link to an online form to researchers who had downloaded their data.
  9. Use is review: “Indeed, data use in its own right provides a form of review.” Even without explicit feedback, evidence of successful use is itself evidence of quality. Such evidence could be presented by collecting a list of papers that cite to the dataset.
  10. Forget quality, consider fitness for purpose: A dataset may be good enough for one purpose but not another. Trying to assess the general “quality” of a dataset is hopeless; consider instead whether the dataset is suited to a particular use. Extending the previous idea, documentation of how and in what contexts a dataset has been used may be more informative than an assessment of abstract quality.
  11. Rate data with multiple levels of quality: The binary accept/reject of traditional peer review (or, for that matter, fit/unfit for purpose) is overly reductive. A one-to-five (or one-to-ten) scale, familiar from pretty much the entire internet, affords a more nuanced view. The Public Library of Science (PLOS) Open Evaluation Tool applies a five-point scale to journal articles, and DANS users rated datasets on an Amazon-style five-star scale.
  12. Offer users multiple levels of assurance: Not all data, even in one place, needs be reviewed to the same extent. It may be sensible to invest limited resources to most thoroughly validate those datasets which are most likely to be used. For example, Open Context offers five different levels of assurance, ranging from “demonstration, minimal editorial acceptance” to “peer-reviewed.” This idea could also be framed as levels of service ranging (as Mark Parsons put it at the workshop) from “just thrown out there” to “someone answers the phone.”
  13. Rate data along multiple facets : Data can be validated or rated along multiple facets or axes. DANS datasets are rated on quality, completeness, consistency, and structure; two additional facets address documentation quality and usefulness of file formats. This is arguably a different framing of  divided concerns, with a difference in application: there, independent assessments are ultimately synthesized into a single verdict; here, the facets are presented separately.
  14. Dynamic datasets need ongoing review: Datasets can change over time, either through addition of new data or revision and correction of existing data. Additions and changes to datasets may necessitate a new (perhaps less extensive) review. Lawrence (2011) asserts that any change to a dataset should trigger a new review.
  15. Unknown users will put the data to unknown uses: Whereas the audience for, and findings of, a journal article are fairly well understood by the author, a dataset may be used by a researcher from a distant field for an unimaginable purpose. Such a person is both the most important to provide validation for– because they lack the expertise to evaluate the data themselves– and the most difficult– because no one can guess who they will be or what they will want to do.

Have an idea about data review that I left out? Let us know in the comments!

Git/GitHub: A Primer for Researchers

The Beastie Boys knew what’s up: Git it together. From egotripland.com

I might be what a guy named Everett Rogers would call an “early adopter“. Rogers wrote a book back in 1962 call The Diffusion of Innovation, wherein he explains how and why technology spreads through cultures. The “adoption curve” from his book has been widely used to  visualize the point at which a piece of technology or innovation reaches critical mass, and divides individuals into one of five categories depending on at what point in the curve they adopt a given piece of technology: innovators are the first, then early adopters, early majority, late majority, and finally laggards.

At the risk of vastly oversimplifying a complex topic, being an early adopter simply means that I am excited about new stuff that seems promising; in other words, I am confident that the “stuff” – GitHub, in this case –will catch on and be important in the future. Let me explain.

Let’s start with version control.

Before you can understand the power GitHub for science, you need to understand the concept of version control. From git-scm.com, “Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.”  We all deal with version control issues. I would guess that anyone reading this has at least one file on their computer with “v2″ in the title. Collaborating on a manuscript is a special kind of version control hell, especially if those writing are in disagreement about systems to use (e.g., LaTeX versus Microsoft Word). And figuring out the differences between two versions of an Excel spreadsheet? Good luck to you. The Wikipedia entry on version control makes a statement that brings versioning into focus:

The need for a logical way to organize and control revisions has existed for almost as long as writing has existed, but revision control became much more important, and complicated, when the era of computing began.

Ah, yes. The era of collaborative research, using scripting languages, and big data does make this issue a bit more important and complicated. Enter Git. Git is a free, open-source distributed version control system, originally created for Linux kernel development in 2005. There are other version control systems– most notably, Apache Subversion (aka SVN) and Mercurial. However I posit that the existence of GitHub is what makes Git particularly interesting for researchers.

So what is GitHub?

GitHub is a web-based hosting service for projects that use the Git revision control system. It’s free (with a few conditions) and has been quite successful since its launch in 2008. Historically, version control systems were developed for and by software developers. GitHub was created primarily as a way for efficiently developing software projects, but its reach has been growing in the last few years. Here’s why.

Note: I am not going into the details of how git works, its structure, or how to incorporate git into your daily workflow. That’s a topic best left to online courses and Software Carpentry Bootcamps

What’s in it for researchers?

At this point it is good to bring up a great paper by Karthik Ram titled “Git can facilitate greater reproducibility and increased transparency in science“, which came out in 2013 in the journal Source Code for Biology and Medicine. Ram goes into much more detail about the power of Git (and GitHub by extension) for researchers. I am borrowing heavily from his section on “Use cases for Git in science” for the four benefits of Git/GitHub below.

1. Lab notebooks make a comeback. The age-old practice of maintaining a lab notebook has been challenged by the digital age. It’s difficult to keep all of the files, software, programs, and methods well-documented in the best of circumstances, never mind when collaboration enters the picture. I see researchers struggling to keep track of their various threads of thought and work, and remember going through similar struggles myself. Enter online lab notebooks. naturejobs.com recently ran a piece about digital lab notebooks, which provides a nice overview of this topic. To really get a feel fore the power of using GitHub as a lab notebook, see GitHubber and ecologist Carl Boettiger’s site. The gist is this: GitHub can serve as a home for all of the different threads of your project, including manuscripts, notes, datasets, and methods development.

2. Collaboration is easier. You and your colleagues can work on a manuscript together, write code collaboratively, and share resources without the potential for overwriting each others’ work. No more v23.docx or appended file names with initials. Instead, a co-author can submit changes and document those with “commit messages” (read about them on GitHub here).

3. Feedback and review is easier. The GitHub issue tracker allows collaborators (potential or current), reviewers, and colleagues to ask questions, notify you of problems or errors, and suggest improvements or new ideas.

4. Increased transparency. Using a version control system means you and others are able to see decision points in your work, and understand why the project proceeded in the way that it did. For the super savvy GitHubber, you can make available your entire manuscript, from the first datapoint collected to the final submitted version, traceable on your site. This is my goal for my next manuscript.

Final thoughts

Git can be an invaluable tool for researchers. It does, however, have a bit of a high activation energy. That is, if you aren’t familiar with version control systems, are scared of the command line, or are married to GUI-heavy proprietary programs like Microsoft Word, you will be hard pressed to effectively use Git in the ways I outline above. That said, spending the time and energy to learn Git and GitHub can make your life so. much. easier. I advise graduate students to learn Git (along with other great open tools like LaTeX and Python) as early in their grad careers as possible. Although it doesn’t feel like it, grad school is the perfect time to learn these systems. Don’t be a laggard; be an early adopter.

References and other good reads

Tagged , , , , , ,

Abandon all hope, ye who enter dates in Excel

Big thanks to Kara Woo of Washington State University for this guest blog post!

Update: The XLConnect package has been updated to fix the problem described below; however, other R packages for interfacing with Excel may import dates incorrectly. One should still use caution when storing data in Excel.


Like anyone who works with a lot of data, I have a strained relationship with Microsoft Excel. Its ubiquity forces me to tolerate it, yet I believe that it is fundamentally a malicious force whose main goal is to incite chaos through the obfuscation and distortion of data.1 After discovering a truly ghastly feature of how it handles dates, I am now fully convinced.

As it turns out, Excel “supports” two different date systems: one beginning in 1900 and one beginning in 1904.2 Excel stores all dates as floating point numbers representing the number of days since a given start date, and Excel for Windows and Mac have different default start dates (January 1, 1900 vs. January 1, 1904).3 Furthermore, the 1900 date system purposely erroneously assumes that 1900 was a leap year to ensure compatibility with a bug in—wait for it—Lotus 1-2-3.

You can’t make this stuff up.

What is even more disturbing is how the two date systems can get mixed up in the process of reading data into R, causing all dates in a dataset to be off by four years and a day. If you don’t know to look for it, you might never even notice. Read on for a cautionary tale.

I work as a data manager for a project studying biodiversity in Lake Baikal, and one of the coolest parts of my job is getting to work with data that have been collected by Siberian scientists since the 1940s. I spend a lot of time cleaning up these data in R. It was while working on some data on Secchi depth (a measure of water transparency) that I stumbled across this Excel date issue.

To read in the data I do something like the following using the XLConnect package:

library(XLConnect)
wb1 <- loadWorkbook("Baikal_Secchi_64to02.xlsx")
secchi_main <- readWorksheet(wb1, sheet = 1)
colnames(secchi_main) <- c("date", "secchi_depth", "year", "month")

So far so good. But now, what’s wrong with this picture?

head(secchi_main)
##         date secchi_depth year month
## 1 1960-01-16           12 1964     1
## 2 1960-02-04           14 1964     2
## 3 1960-02-14           18 1964     2
## 4 1960-02-24           14 1964     2
## 5 1960-03-04           14 1964     3
## 6 1960-03-25           10 1964     3

As you can see, the year in the date column doesn’t match the year in the year column. When I open the data in Excel, things look correct.

excel_secchi_data

This particular Excel file uses the 1904 date system, but that fact gets lost somewhere between Excel and R. XLConnect can tell that there are dates, but all the dates are wrong.

My solution for these particular data was as follows:

# function to add four years and a day to a given date
fix_excel_dates <- function(date) {
    require(lubridate)
    return(ymd(date) + years(4) + days(1))
}

# create a correct date column
library(dplyr)
secchi_main <- mutate(secchi_main, corrected_date = fix_excel_dates(date))

The corrected_date column looks right.

head(secchi_main)
##         date secchi_depth year month corrected_date
## 1 1960-01-16           12 1964     1     1964-01-17
## 2 1960-02-04           14 1964     2     1964-02-05
## 3 1960-02-14           18 1964     2     1964-02-15
## 4 1960-02-24           14 1964     2     1964-02-25
## 5 1960-03-04           14 1964     3     1964-03-05
## 6 1960-03-25           10 1964     3     1964-03-26

That fix is easy, but I’m left with a feeling of anxiety. I nearly failed to notice the discrepancy between the date and year columns; a colleague using the data pointed it out to me. If these data hadn’t had a year column, it’s likely we never would have caught the problem at all. Has this happened before and I just didn’t notice it? Do I need to go check every single Excel file I have ever had to read into R?

And now that I know to look for this issue, I still can’t think of a way to check the dates Excel shows against the ones that appear in R without actually opening the data file in Excel and visually comparing them. This is not an acceptable solution in my opinion, but… I’ve got nothing else. All I can do is get up on my worn out data manager soapbox and say:

and-thats-why-excel


  1. For evidence of its fearsome power, see these examples.
  2. Though as Dave Harris pointed out, “is burdened by” would be more accurate.
  3. To quote John Machin, “In reality, there are no such things [as dates in Excel spreadsheets]. What you have are floating point numbers and pious hope.”
Tagged , , ,

Feedback Wanted: Publishers & Data Access

This post is co-authored with Jennifer Lin, PLOS

Short Version: We need your help!

We have generated a set of recommendations for publishers to help increase access to data in partnership with libraries, funders, information technologists, and other stakeholders. Please read and comment on the report (Google Doc), and help us to identify concrete action items for each of the recommendations here (EtherPad).

Background and Impetus

The recent governmental policies addressing access to research data from publicly funded research across the US, UK, and EU reflect the growing need for us to revisit the way that research outputs are handled. These recent policies have implications for many different stakeholders (institutions, funders, researchers) who will need to consider the best mechanisms for preserving and providing access to the outputs of government-funded research.

The infrastructure for providing access to data is largely still being architected and built. In this context, PLOS and the UC Curation Center hosted a set of leaders in data stewardship issues for an evening of brainstorming to re-envision data access and academic publishing. A diverse group of individuals from institutions, repositories, and infrastructure development collectively explored the question:

What should publishers do to promote the work of libraries and IRs in advancing data access and availability?

We collected the themes and suggestions from that evening in a report: The Role of Publishers in Access to Data. The report contains a collective call to action from this group for publishers to participate as informed stakeholders in building the new data ecosystem. It also enumerates a list of high-level recommendations for how to effect social and technical change as critical actors in the research ecosystem.

We welcome the community to comment on this report. Furthermore, the high-level recommendations need concrete details for implementation. How will they be realized? What specific policies and technologies are required for this? We have created an open forum for the community to contribute their ideas. We will then incorporate the catalog of listings into a final report for publication. Please participate in this collective discussion with your thoughts and feedback by April 24, 2014.

We need suggestions! Feedback! Comments! From Flickr by Hash Milhan

We need suggestions! Feedback! Comments! From Flickr by Hash Milhan

 

Tagged , , , , ,
Follow

Get every new post delivered to your Inbox.

Join 322 other followers