Tag Archives: open data

There’s a new Dash!

Dash: an open source, community approach to data publication

We have great news! Last week we refreshed our Dash data publication service.  For those of you who don’t know, Dash is an open source, community driven project that takes a unique approach to data publication and digital preservation.

Dash focuses on search, presentation, and discovery and delegates the responsibility for the data preservation function to the underlying repository with which it is integrated. It is a project based at the University of California Curation Center (UC3), a program at California Digital Library (CDL) that aims to develop interdisciplinary research data infrastructure.

Dash employs a multi-tenancy user interface; providing partners with extensive opportunities for local branding and customization, use of existing campus login credentials, and, importantly, offering the Dash service under a tenant-specific URL, an important consideration helping to drive adoption. We welcome collaborations with other organizations wishing to provide a simple, intuitive data publication service on top of more cumbersome legacy systems.

There are currently seven live instances of Dash: – UC BerkeleyUC IrvineUC MercedUC Office of the PresidentUC RiversideUC Santa CruzUC San FranciscoONEshare (in partnership with DataONE)

Architecture and Implementation

Dash is completely open source. Our code is made publicly available on GitHub (http://cdluc3.github.io/dash/). Dash is based on an underlying Ruby-on-Rails data publication platform called Stash. Stash encompasses three main functional components: Store, Harvest, and Share.

  • Store: The Store component is responsible for the selection of datasets; their description in terms of configurable metadata schemas, including specification of ORCID and Fundref identifiers for researcher and funder disambiguation; the assignment of DOIs for stable citation and retrieval; designation of an optional limited time embargo; and packaging and submission to the integrated repository
  • Harvest: The Harvest component is responsible for retrieval of descriptive metadata from that repository for inclusion into a Solr search index
  • Share: The Share component, based on GeoBlacklight, is responsible for the faceted search and browse interface

Dash Architecture Diagram

Individual dataset landing pages are formatted as an online version of a data paper, presenting all appropriate descriptive and administrative metadata in a form that can be downloaded as an individual PDF file, or as part of the complete dataset download package, incorporating all data files for all versions.

To facilitate flexible configuration and future enhancement, all support for the various external service providers and repository protocols are fully encapsulated into pluggable modules. Metadata modules are available for the DataCite and Dublin Core metadata schemas. Protocol modules are available for the SWORD 2.0 deposit protocol and the OAI-PMH and ResourceSync harvesting protocols. Authentication modules are available for InCommon/Shibboleth and Google/OAuth19 identity providers (IdPs). We welcome collaborations to develop additional modules for additional metadata schemas and repository protocols. Please email UC3 (uc3 at ucop dot edu) or visit GitHub (http://cdluc3.github.io/dash/) for more information.

Features of the newly refreshed Dash service

What are the new features on our refresh of the Dash services?  Take a look.

Feature Tech-focused User-focused Description
Open Source X All components open source, MIT licensed code (http://cdluc3.github.io/dash/)
Standards compliant X Dash integrates with any SWORD/OAI-PMH-compliant repository
Pluggable Framework X Inherent extensibility for supporting additional protocols and metadata schemas
Flexible metadata schemas X Support Datacite metadata schema out-of-the-box, but can be configured to support any schema
Innovation X Our modular framework will make new feature development easier and quicker
Mobile/responsive design X X Built mobile-first, from the ground up, for better user experience
Geolocation – Metadata X X For applicable research outputs, we have an easy to use way to capture location of your datasets
Persistent Identifers – ORCID X X Dash allows researchers to attach their ORCID, allowing them to track and get credit for their work
Persistent Identifers – DOIs X X Dash issues DOIs for all datasets, allowing researchers to track and get credit for their work
Persistent Identifers – Fundref X X Dash tracks funder information using FundRef, allowing researchers and funders to track their reasearch outputs
Login – Shibboleth /OAuth2 X X We offer easy single-sign with your campus credentials or Google account
Versioning X X Datasets can change. Dash offers a quick way for you to upload new versions of your datasets and offer a simple process for tracking updates
Accessibility X X The technology, design, and user workflows have all been built with accessibility in mind
Better user experience X Self-depositing made easy. Simple workflow, drag-and-drop upload, simple navigation, clean data publication pages, user dashboards
Geolocation – Search X With GeoBlacklight, we can offer search by location
Robust Search X Search by subject, filetype, keywords, campus, location, etc.
Discoverability X Indexing by search engines for Google, Bing, etc.
Build Relationships X Many datasets are related to publications or other data. Dash offers a quick way to describe these relationships
Supports Best Practices X Data publication can be confusing. But with Dash, you can trust Dash is following best practices
Data Metrics X See the reach of your datasets through usage and download metrics
Data Citations X Quick access to a well-formed citiation reference (with DOI) to every data publication. Easy for your peers to quickly grab
Open License X Dash supports open Creative Commons licensing for all data deposits; can be configured for other licenses
Lower Barrier to Entry X For those in a hurry, Dash offers a quick interface to self-deposit. Only three steps and few required fields
Support Data Reuse X Focus researchers on describing methods and explaining ways to reuse their datasets
Satisfies Data Availability Requirements X Many publishers and funders require researchers to make their data available. Dash is an readily accepted and easy way to comply

A little Dash history

The Dash project began as DataShare, a collaboration among UC3, the University of California San Francisco Library and Center for Knowledge Management, and the UCSF Clinical and Translational Science Institute (CTSI). CTSI is part of the Clinical and Translational Science Award program funded by the National Center for Advancing Translational Sciences at the National Institutes of Health. Dash version 2 developed by UC3 and partners with funding from the Alfred P. Sloan Foundation (our funded proposal). Read more about the code, the project, and contributing to development on the Dash GitHub site.

A little Dash future

We will continue the development of the new Dash platform and will keep you posted. Next up: support for timed deposits and embargoes.  Stay tuned!

Tagged , ,

CC BY and data: Not always a good fit

This post was originally published on the University of California Office of Scholarly Communication blog.

Last post I wrote about data ownership, and how focusing on “ownership” might drive you nuts without actually answering important questions about what can be done with data. In that context, I mentioned a couple of times that you (or your funder) might want data to be shared under CC0, but I didn’t clarify what CC0 actually means. This week, I’m back to dig into the topic of Creative Commons (CC) licenses and public domain tools — and how they work with data. Continue reading

Tagged , , ,

The 10 Things Every New Grad Student Should Do

It’s now mid-October, and I’m guessing that first year graduate students are knee-deep in courses, barely considering their potential thesis project. But for those that can multi-task, I have compiled this list of 10 things that you should undertake in your first year as a grad student. These aren’t just any 10 things… they are 10 steps you can take to make sure you contribute to a culture shift towards open science. Some a big steps, and others are small, but they will all get you (and the rest of your field) one step closer to reproducible, transparent research.

1. Learn to code in some language. Any language.

Here’s the deal: it’s easier to use black-box applications to run your analyses than to create scripts. Everyone knows this. You put in some numbers and out pop your results; you’re ready to write up your paper and get that H-index headed upwards. But this approach will not cut the mustard for much longer in the research world. Researchers need to know about how to code. Growing amounts and diversity of data, more interdisciplinary collaborators, and increasing complexity of analyses mean that no longer can black-box models, software, and applications be used in research. The truth is, if you want your research to be reproducible and transparent, you must code. In a 2013 article “The Big Data Brain Drain: Why Science is in Trouble“, Jake Vanderplas argues that

In short, the new breed of scientist must be a broadly-trained expert in statistics, in computing, in algorithm-building, in software design, and (perhaps as an afterthought) in domain knowledge as well.

I learned MATLAB in graduate school, and experimented with R during a postdoc. I wish I’d delved into this world earlier, and had more skills and knowledge about best practices for scientific software. Basically, I wish I had attended a Software Carpentry bootcamp.

The growing number of Software Carpentry (SWC) bootcamps are more evidence that researchers are increasingly aware of the importance of coding and reproducibility. These bootcamps teach researchers the basics of coding, version control, and similar topics, with the potential for customizing the course’s content to the primary discipline of the audience. I’m a big fan of SWC – read more in my blog post on the organization. Check out SWC founder Greg Wilson’s article on some insights from his years in teaching bootcamps: Software Carpentry: Lessons Learned.

2. Stop using Excel. Or at least stop ONLY using Excel.

Most seasoned researchers know that Microsoft Excel can be potentially problematic for data management: there are loads of ways to manipulate, edit, reorder, and change your data without really knowing exactly what you did. In nerd terms, the trail of dataset changes is known as provenance; generally Excel is terrible at documenting provenance. I wrote about this a few years ago on the blog, and we mentioned a few of the more egregious ways people abuse Excel in our F1000Research publication on the DataUp tool. More recently guest blogger Kara Woo wrote a great post about struggles with dates in Excel.

Of course, everyone uses Excel. In our surveys for the DataUp project, about 88% of the researchers we interviewed used Excel at some point in their research. And we can’t expect folks to stop using it: it’s a great tool! It should, however, be used carefully. For instance, don’t manipulate the sole copy of your raw data in Excel; keep your raw data raw. Use Excel to explore your data, but use other tools to clean and analyze it, such as R, Python, or MATLAB (see #1 above on learning to code). For more help with spreadsheets, see our list of resources and tools: UC3 Spreadsheet Help.

3. Learn about how to properly care for your data.

You might know more about your data than anyone else, but you aren’t so smart when it comes stewardship your data. There are some great guidelines for how best to document, manage, and generally care for your data; I’ve collected some of my favorites here on CiteULike with the tag best_practices. Pick one (or all of them) to read and make sure your data don’t get short shrift.

4. Write a data management plan.

I know, it sounds like the ultimate boring activity for a Friday night. But these three words (data management plan) can make a HUGE difference in the time and energy spent dealing with data during your thesis. Basically, if you spend some time thinking about file organization, sample naming schemes, backup plans, and quality control measures, you can save many hours of heartache later. Creating a data management plan also forces you to better understand best practices related to data (#3 above). Don’t know how to start? Head over to the DMPTool to write a data management plan. It’s free to use, and you can get an idea for the types of things you should consider when embarking on a new project. Most funders require data management plans alongside proposal submissions, so you might as well get the experience now.

5. Read Reinventing Discovery by Michael Nielsen.

 Reinventing Discovery: The New Era of Networked Science by Michael Nielsen was published in 2013, and I’ve since heard it referred to as the Bible for Open Science, and the must-read book for anyone interested in engaging in the new era of 4th paradigm research. I’ve only just recently read the book, and wow. I was fist-bumping quite a bit while reading it, which must have made fellow airline passengers wonder what the fuss was about. If they had asked, I would have told them about Nielsen’s stellar explanation of the necessity for and value of openness and transparency in research, the problems with current incentive structures in science, and the steps we should all take towards shifting the culture of research to enable more connectivity and faster progress. Just writing this blog post makes me want to re-read the book.

6. Learn version control.

My blog post, Git/GitHub: a Primer for Researchers covers much of the importance of version control. Here’s an excerpt:

From git-scm.com, “Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.”  We all deal with version control issues. I would guess that anyone reading this has at least one file on their computer with “v2” in the title. Collaborating on a manuscript is a special kind of version control hell, especially if those writing are in disagreement about systems to use (e.g., LaTeX versus Microsoft Word). And figuring out the differences between two versions of an Excel spreadsheet? Good luck to you. TheWikipedia entry on version control makes a statement that brings versioning into focus:

The need for a logical way to organize and control revisions has existed for almost as long as writing has existed, but revision control became much more important, and complicated, when the era of computing began.

Ah, yes. The era of collaborative research, using scripting languages, and big data does make this issue a bit more important and complicated. Version control systems can make this much easier, but they are not necessarily intuitive for the fledgling coder. It might take a little time (plus attending a Software Carpentry Bootcamp) to understand version control, but it will be well worth your time. As an added bonus, your work can be more reproducible and transparent by using version control. Read Karthik Ram’s great article, Git can facilitate greater reproducibility and increased transparency in science.

7. Pick a way to communicate your science to the public. Then do it.

You don’t have to have a black belt in Twitter or run a weekly stellar blog to communicate your work. But you should communicate somehow. I have plenty of researcher friends who feel exasperated by the idea that they need to talk to the public about their work. But the truth is, in the US this communication is critical to our research future. My local NPR station recently ran a great piece called Why Scientists are seen as untrustworthy and why it matters. It points out that many (most?) scientists aren’t keen to spend a lot of time engaging with the broader public about their work. However:

…This head-in-the-sand approach would be a big mistake for lots of reasons. One is that public mistrust may eventually translate into less funding and so less science. But the biggest reason is that a mistrust of scientists and science will have profound effects on our future.

Basically, we are avoiding the public at our own peril. Science funding is on the decline, we are facing increasing scrutiny, and it wouldn’t be hyperbole to say that we are at war without even knowing it. Don’t believe me? Read this recent piece in Science (paywall warning): Battle between NSF and House science committee escalates: How did it get this bad?

So start talking. Participate in public lecture series, write a guest blog post, talk about your research to a crotchety relative at Thanksgiving, or write your congressman about the governmental attack on science.

8. Let everyone watch.

Consider going open. That is, do all of your science out in the public eye, so that others can see what you’re up to. One way to do this is by keeping an open notebook. This concept throws out the idea that you should be a hoarder, not telling others of your results until the Big Reveal in the form of a publication. Instead, you keep your lab notebook (you do have one, right?) out in a public place, for anyone to peruse. Most often an open notebook takes the form of a blog or a wiki, and the researcher updates their notebook daily, weekly, or whatever is most appropriate. There are links to data, code, relevant publications, or other content that helps readers, and the researcher themselves, understand the research workflow. Read more in these two blog posts: Open Up  and Open Science: What the Fuss is About.

9. Get your ORCID.

ORCID stands for “Open Researcher & Contributor ID”. The ORCID Organization is an open, non-profit group working to provide a registry of unique researcher identifiers and a transparent method of linking research activities and outputs to these identifiers. The endgame is to support the creation of a permanent, clear and unambiguous record of scholarly communication by enabling reliable attribution of authors and contributors. Basically, researcher identifiers are like social security numbers for scientists. They unambiguously identify you throughout your research life.

Lots of funders, tools, publishers, and universities are buying into the ORCID system. It’s going to make identifying researchers and their outputs much easier. If you have a generic, complicated, compound, or foreign name, you will especially benefit from claiming your ORCID and “stamping” your work with it. It allows you to claim what you’ve done and keep you from getting mixed up with that weird biochemist who does studies on the effects of bubble gum on pet hamsters. Still not convinced? I wrote a blog post a while back that might help.

10. Publish in OA journals, or make your work OA afterward.

A wonderful post by Michael White, Why I don’t care about open access to research: and why you should, captures this issue well:

It’s hard for me to see why I should care about open access…. My university library can pay for access to all of the scientific journals I could wish for, but that’s not true of many corporate R&D departments, municipal governments, and colleges and schools that are less well-endowed than mine. Scientific knowledge is not just for academic scientists at big research universities.

It’s easy to forget that you are (likely) among the privileged academics. Not all researchers have access to publications, and this is even more true for the general public. Why are we locking our work in the Ivory Tower, allowing for-profit publishers to determine who gets to read our hard-won findings? The Open Access movement is going full throttle these days, as evidenced by increasing media coverage (see “Steal this research paper: you already paid for it” from MotherJones, or The Guardian’s blog post “University research: if you believe in openness, stand up for it“). So what can you do?

Consider publishing only in open access journals (see the Directory of Open Access Journals). Does this scare you? Are you tied to a disciplinary favorite journal with a high impact factor? Then make your work open access after publishing in a standard journal. Follow my instructions here: Researchers! Make Your Previous Work #OA.

Openness is one of the pillars of a stellar academic career. From Flickr by David Pilbrow.

Openness is the pillar of a good academic career. From Flickr by David Pilbrow.

Tagged , , , , , ,

UC3, PLOS, and DataONE join forces to build incentives for data sharing

We are excited to announce that UC3, in partnership with PLOS and DataONE, are launching a new project to develop data-level metrics (DLMs). This 12-month project is funded by an Early Concept Grants for Exploratory Research (EAGER) grant from the National Science Foundation, and will result in a suite of metrics that track and measure data use. The proposal is available via CDL’s eScholarship repository: http://escholarship.org/uc/item/9kf081vf. More information is also available on the NSF Website.

Why DLMs? Sharing data is time consuming and researchers need incentives for undertaking the extra work. Metrics for data will provide feedback on data usage, views, and impact that will help encourage researchers to share their data. This project will explore and test the metrics needed to capture activity surrounding research data.

The DLM pilot will build from the successful open source Article-Level Metrics community project, Lagotto, originally started by PLOS in 2009. ALM provide a view into the activity surrounding an article after publication, across a broad spectrum of ways in which research is disseminated and used (e.g., viewed, shared, discussed, cited, and recommended, etc.)

About the project partners

PLOS (Public Library of Science) is a nonprofit publisher and advocacy organization founded to accelerate progress in science and medicine by leading a transformation in research communication.

Data Observation Network for Earth (DataONE) is an NSF DataNet project which is developing a distributed framework and sustainable cyberinfrastructure that meets the needs of science and society for open, persistent, robust, and secure access to well-described and easily discovered Earth observational data.

The University of California Curation Center (UC3) at the California Digital Library is a creative partnership bringing together the expertise and resources of the University of California. Together with the UC libraries, we provide high quality and cost-effective solutions that enable campus constituencies – museums, libraries, archives, academic departments, research units and individual researchers – to have direct control over the management, curation and preservation of the information resources underpinning their scholarly activities.

The official mascot for our new project: Count von Count. From muppet.wikia.com

The official mascot for our new project: Count von Count. From muppet.wikia.com

Tagged , , ,

DataUp is Merging with Dash!

Exciting news! We are merging the DataUp tool with our new data sharing platform, Dash.

About Dash

Dash is a University of California project to create a platform that allows researchers to easily describe, deposit and share their research data publicly. Currently the Dash platform is connected to the UC3 Merritt Digital Repository; however, we have plans to make the platform compatible with other repositories using protocols such as SWORD and OAI-PMH. The Dash project is open-source and we encourage community discussion and contribution to our GitHub site.

About the Merge

There is significant overlap in functionality for Dash and DataUp (see below), so we will merge these two projects to enable better support for our users. This merge is funded by an NSF grant (available on eScholarship) supplemental to the DataONE project.

The new service will be an instance of our Dash platform (to be available in late September), connected to the DataONE repository ONEShare. Previously the only way to deposit datasets into ONEShare was via the DataUp interface, thereby limiting deposits to spreadsheets. With the Dash platform, this restriction is removed and any dataset type can be deposited. Users will be able to log in with their Google ID (other options being explored). There are no restrictions on who can use the service, and therefore no restrictions on who can deposit datasets into ONEShare, and the service will remain free. The ONEShare repository will continue to be supported by the University of New Mexico in partnership with CDL/UC3. 

The NSF grant will continue to fund a developer to work with the UC3 team on implementing the DataONE-Dash service, including enabling login via Google and other identity providers, ensuring that metadata produced by Dash will meet the conditions of harvest by DataONE, and exploring the potential for implementing spreadsheet-specific functionality that existed in DataUp (e.g., the best practices check). 

Benefits of the Merge

  • We will be leveraging work that UC3 has already completed on Dash, which has fully-implemented functionality similar to DataUp (upload, describe, get identifier, and share data).
  • ONEShare will continue to exist and be a repository for long tail/orphan datasets.
  • Because Dash is an existing UC3 service, the project will move much more quickly than if we were to start from “scratch” on a new version of DataUp in a language that we can support.
  • Datasets will get DataCite digital object identifiers (DOIs) via EZID.
  • All data deposited via Dash into ONEShare will be discoverable via DataONE.

FAQ about the change

What will happen to DataUp as it currently exists?

The current version of DataUp will continue to exist until November 1, 2014, at which point we will discontinue the service and the dataup.org website will be redirected to the new service. The DataUp codebase will still be available via the project’s GitHub repository.

Why are you no longer supporting the current DataUp tool?

We have limited resources and can’t properly support DataUp as a service due to a lack of local experience with the C#/.NET framework and the Windows Azure platform.  Although DataUp and Dash were originally started as independent projects, over time their functionality converged significantly.  It is more efficient to continue forward with a single platform and we chose to use Dash as a more sustainable basis for this consolidated service.  Dash is implemented in the  Ruby on Rails framework that is used extensively by other CDL/UC3 service offerings.

What happens to data already submitted to ONEShare via DataUp?

All datasets now in ONEShare will be automatically available in the new Dash discovery environment alongside all newly contributed data.  All datasets also continue to be accessible directly via the Merritt interface at https://merritt.cdlib.org/m/oneshare_dataup.

Will the same functionality exist in Dash as in DataUp?

Users will be able to describe their datasets, get an identifier and citation for them, and share them publicly using the Dash tool. The initial implementation of DataONE-Dash will not have capabilities for parsing spreadsheets and reporting on best practices compliance. Also the user will not be able to describe column-level (i.e., attribute) metadata via the web interface. Our intention, however, is develop out these functions and other enhancements in the future. Stay tuned!

Still want help specifically with spreadsheets?

  • We have pulled together some best practices resources: Spreadsheet Help 
  • Check out the Morpho Tool from the KNB – free, open-source data management software you can download to create/edit/share spreadsheet metadata (both file- and column-level). Bonus – The KNB is part of the DataONE Network.


It's the dawn of a new day for DataUp! From Flickr by David Yu.

It’s the dawn of a new day for DataUp! From Flickr by David Yu.

Tagged , , , , , ,

Feedback Wanted: Publishers & Data Access

This post is co-authored with Jennifer Lin, PLOS

Short Version: We need your help!

We have generated a set of recommendations for publishers to help increase access to data in partnership with libraries, funders, information technologists, and other stakeholders. Please read and comment on the report (Google Doc), and help us to identify concrete action items for each of the recommendations here (EtherPad).

Background and Impetus

The recent governmental policies addressing access to research data from publicly funded research across the US, UK, and EU reflect the growing need for us to revisit the way that research outputs are handled. These recent policies have implications for many different stakeholders (institutions, funders, researchers) who will need to consider the best mechanisms for preserving and providing access to the outputs of government-funded research.

The infrastructure for providing access to data is largely still being architected and built. In this context, PLOS and the UC Curation Center hosted a set of leaders in data stewardship issues for an evening of brainstorming to re-envision data access and academic publishing. A diverse group of individuals from institutions, repositories, and infrastructure development collectively explored the question:

What should publishers do to promote the work of libraries and IRs in advancing data access and availability?

We collected the themes and suggestions from that evening in a report: The Role of Publishers in Access to Data. The report contains a collective call to action from this group for publishers to participate as informed stakeholders in building the new data ecosystem. It also enumerates a list of high-level recommendations for how to effect social and technical change as critical actors in the research ecosystem.

We welcome the community to comment on this report. Furthermore, the high-level recommendations need concrete details for implementation. How will they be realized? What specific policies and technologies are required for this? We have created an open forum for the community to contribute their ideas. We will then incorporate the catalog of listings into a final report for publication. Please participate in this collective discussion with your thoughts and feedback by April 24, 2014.

We need suggestions! Feedback! Comments! From Flickr by Hash Milhan

We need suggestions! Feedback! Comments! From Flickr by Hash Milhan


Tagged , , , , ,

Two Altmetrics Workshops in San Francisco

Last week, a group forward-thinking individuals interested in measuring scholarly impact gathered at Fort Mason in San Francisco to talk about altmetrics. The Alfred P. Sloan Foundation funded the events at Fort Mason, which included (1) an altmetrics-focused workshop run by the open-access publisher (and leader in ALM) PLOS, and (2) a NISO Alternative Assessment Initiative Project Workshop to discuss standards and best practices for altmetrics.

In lieu of a blog post for Data Pub, I wrote up something for the folks over at the London School of Economics Impact of Social Sciences Blog. Here’s a snippet that explains altmetrics:

Altmetrics focuses broadening the things we are measuring, as well as how we measure them. For instance, article-level metrics (ALMs) report on aspects of the article itself, rather than the journal in which it can be found. ALM reports might include the number of article views, the number of downloads, and the number of references to the article in social media such as Twitter. In addition to measuring the impact of articles in new ways, the altmetrics movement is also striving to expand what scholarly outputs are assessed – rather than focusing on journal articles, we could also be giving credit for other scholarly outputs such as datasets, software, and blog posts.

So head on over and read up on the role of higher education institutions in altmetrics: “Universities can improve academic services through wider recognition of altmetrics and alt-products.”

Related Data Pub posts:

Tagged , , ,

Closed Data… Excuses, Excuses

If you are a fan of data sharing, open data, open science, and generally openness in research, you’ve heard them all: excuses for keeping data out of the public domain. If you are NOT a fan of openness, you should be. For both groups (the fans and the haters), I’ve decided to construct a “Frankenstein monster” blog post composed of other peoples’ suggestions for how to deal with the excuses.

Yes, I know. Frankenstein was the doctor, not the monster. From Flickr by Chop Shop Garage.

Yes, I know. Frankenstein was the doctor, not the monster. From Flickr by Chop Shop Garage.

I have drawn some comebacks from Christopher Gutteridge, University of Southampton, and Alexander Dutton, University of Oxford. They created an open google doc of excuses for closing off data and appropriate responses, and generously provided access to the document under a CC-BY license. I also reference the UK Data Archive‘s list of barriers and solutions to data sharing, available via the Digital Curation Centre‘s PDF, “Research Data Management for Librarians” (pages 14-15).

People will contact me to ask about stuff

Christopher and Alex (C&A) say: “This is usually an objection of people who feel overworked and that [data sharing] isn’t part of their job…” I would add to this that science is all about learning from each other – if a researcher is opposed to the idea of discussing their datasets, collaborating with others, and generally being a good science citizen, then they should be outed by their community as a poor participant.

People will misinterpret the data

C&A suggest this: “Document how it should be interpreted. Be prepared to help and correct such people; those that misinterpret it by accident will be grateful for the help.” From the UK Data Archive: “Producing good documentation and providing contextual information for your research project should enable other researchers to correctly use and understand your data.”

It’s worth mentioning, however, a second point C&A make: “Publishing may actually be useful to counter willful misrepresentation (e.g. of data acquired through Freedom of Information legislation), as one can quickly point to the real data on the web to refute the wrong interpretation.”

My data is not very interesting

C&A: “Let others judge how interesting or useful it is — even niche datasets have people that care about them.” I’d also add that it’s impossible to decide whether your dataset has value to future research. Consider the many datasets collected before “climate change” was a research topic which have now become invaluable to documenting and understanding the phenomenon. From the UK Data Archive: “Who would have thought that amateur gardener’s diaries would one day provide essential data for climate change research?”

I might want to use it in a research paper

Anyone who’s discussed data sharing with a researcher is familiar with this excuse. The operative word here is might. How many papers have we all considered writing, only to have them shift to the back burner due to other obligations? That said, this is a real concern.

C&A suggest the embargo route: “One option is to have an automatic or optional embargo; require people to archive their data at the time of creation but it becomes public after X months. You could even give the option to renew the embargo so only things that are no longer cared about become published, but nothing is lost and eventually everything can become open.” Researchers like to have a say in the use of their datasets, but I would caution to have any restrictions default to sharing. That is, after X months the data are automatically made open by the repository.

I would also add that, as the original collector of the data, you are at a huge advantage compared to others that might want to use your dataset. You have knowledge about your system, the conditions during collection, the nuances of your methods, et cetera that could never be fully described in the best metadata.

I’m not sure I own the data

No doubt, there are a lot of stakeholders involved in data collection: the collector, the PI (if different), the funder, the institution, the publisher, … C&A have the following suggestions:

  • Sometimes as it’s as easy as just finding out who does own the data
  • Sometimes nobody knows who owns the data. This often seems to occur when someone has moved into a post and isn’t aware that they are now the data owner.
  • Going up the management chain can help. If you can find someone who clearly has management over the area the dataset belongs to they can either assign an owner or give permission.
  • Get someone very senior to appoint someone who can make decisions about apparently “orphaned” data.

My data is too complicated.

C&A: “Don’t be too smug. If it turns out it’s not that complicated, it could harm your professional [standing].” I would add that if it’s too complicated to share, then it’s too complicated to reproduce, which means it’s arguably not real scientific progress. This can be solved by more documentation.

My data is embarrassingly bad

C&A: “Many eyes will help you improve your data (e.g. spot inaccuracies)… people will accept your data for what it is.” I agree. All researchers have been on the back end of making the sausage. We know it’s not pretty most of the time, and we can accept that. Plus it helps you strive will be at managing and organizing data during your next collection phase.

It’s not a priority and I’m busy

Good news! Funders are making it your priority! New sharing mandates in the OSTP memorandum state that any research conducted with federal funds must be accessible. You can expect these sharing mandates to drift down to you, the researcher, in the very near future (6-12 months).

Tagged , , , , , ,

The Who’s Who of Publishing Research

This week’s blog post is a bit more of a Sociology of science topic… Perhaps only marginally related to the usual content surrounding data, but still worth consideration. I recently heard a talk by Laura Czerniewicz, from University of Cape Town’s Centre for Educational Technology. She was among the speakers  during the Context session at Beyond the PDF2, and she asked the following questions about research and science:

Whose interests are being served? Who participates? Who is enabled? Who is constrained?

She brought up points I had never really considered, related to the distribution of wealth and how that affects scientific outputs. First, she examined who actually produces the bulk of knowledge. Based on an editorial in Science in 2008, she reported that US academics produce about 30% of the articles published in international peer-reviewed journals, while developing countries (China, India, Brazil) produce another 20%. Sub-saharan Africa? A mere 1%.

She then explored what factors are shaping knowledge production and dissemination. She cited infrastructure (i.e., high speed internet, electricity, water, etc.), funding, culture, and reward systems. For example, South Africa produces more articles than other countries on the continent, perhaps because the government gives universities $13,000 for every article published in a “reputable journal”, and 21 of 23 universities surveyed give a cut of that directly to the authors.

Next, she asked “Who’s doing the publishing? What research are they publishing?” She put up some convincing graphics showing the number of articles published by authors from various countries, of which the US and Western Europe were leading the pack by six fold. I couldn’t hunt down the original publication, so take this rough statistic with a grain of salt. What about book publishing? The Atlantic Wire published a great chart back in October (based on an original article in Digital Book World) that scaled a country’s size based on the value of their domestic publishing markets:

Scaled map of the world based on book publishing. From Digital Book World via Atlantic Wire.

Scaled map of the world based on book publishing. From Digital Book World via Atlantic Wire.

When asking whose interests are served by international journals, she focused on a commentary by R. Horton, titled “Medical journals: Evidence of bias against the diseases of poverty” (The Lancet 361, 1 March 2003 – behind paywall). Granted, it’s a bit out of date, but it still has interesting points to consider. Horton reported that of the five top medical journals there is little or no representation on their editorial boards from countries with low Human Development Indices. Horton then postulates that this might be the cause for the so-called 10/90 gap – where 90% of research funding is allocated to diseases that affect only 10% of the world’s population. Although Horton does not go so far as to blame the commercial nature of publishing, he points out that editorial boards for journals must consider their readership and cater to those who can afford subscription fees.

I wonder how this commentary holds up, 10 years later. I would like to think that we’ve made a lot of progress towards better representation of research affecting humans that live in poverty. I’m not sure, however, we’ve done better with access to published research. I’ll leave you with something Laura said during her talk (paraphrased): “If half of the world is left out of knowledge exchange and dissemination, science will suffer.”

Check out Laura Czerniewicz’s Blog for more on this. She’s also got a Twitter feed.

Tagged , , , , ,

The New OSTP Policy & What it Means

Last week, the White House Office of Science and Technology Policy (OSTP) responded to calls for broader access to federally funded research. I was curious as to whether this policy had any teeth, so I actually read the official memorandum. Here I summarize and have a few thoughts.

The overall theme of the document is best represented by this phrase:

…wider availability of peer-reviewed publications and scientific data in digital formats will create innovative economic markets for services related to curation, preservation, analysis, and visualization.

OSTP must have fielded early concerns  from journal publishers, because several times in the memo there were sentiments like this:

The Administration also recognizes that publishers provide valuable services, including the coordination of peer review, that are essential for ensuring the high quality and integrity of many scholarly publications. It is critical that these services continue to be made available.

And now we get to the big change:

Federal agencies investing in research and development (more than $100 million in annual expenditures) must have clear and coordinated policies for increasing public access to research products

Each of the agency plans is required to outline strategies to:

  • leverage existing archives and partnerships with journals
  • improve public’s ability to locate and access data
  • provide optimized search, archival, and dissemination features that encourage accessibility and interoperability
  • notify researchers of their new obligations for increasing access to research products (e.g., guidance, conditions for funding)
  • measure and enforce researcher compliance

Draft plans for each agency are due within 6 months of the memo. This is all great news for open science advocates: agencies must require researchers to comply with open data mandates and help them do it.

Hopefully the teeth in this new OSTP memo won't be slowed down by its tiny arms. From Flickr by Hammerhead27

Hopefully the teeth in this new OSTP memo won’t be slowed down by its tiny arms. From Flickr by Hammerhead27

The memo then outlines what agency plans should include, breaking the guidelines into those for scientific articles, and those for data.

Scientific Articles:

New agency plans must include provisions for open access to scientific articles reporting on research. The memo provides two main guidelines related to this:

  • public access to research articles (including the ability to read, download, and analyze digitally) should happen within about 12 months post-publication
  • there should be free, full public access to the research article’s metadata, in standard format

Scientific Data:

First, the memo defines data:

…digital recorded factual material commonly accepted in the scientific community as necessary to validate research findings including data sets used to support scholarly publications, but does not include laboratory notebooks, preliminary analyses, drafts of scientific papers, plans for future research, peer review reports, communications with colleagues, or physical objects, such as laboratory specimens.

It then sets the following guidelines. The agency plans should:

  1. Maximize free public access while keeping in mind privacy/confidentiality, proprietary interests, and that not all data should be kept forever
  2. Ensure researchers create data management plans
  3. Allow costs for data preservation and access in proposal budgets
  4. Ensure evaluation of data management plan merits
  5. Ensure researchers comply with their data management plans
  6. Promote data deposition into public repositories
  7. Encourage public/private partnerships to ensure interoperability
  8. Develop approaches for identification and attribution of datasets
  9. Educate folks about data stewardship
  10. Assess long-term needs for repositories and infrastructure

This list got me excited: there might actually be some teeth in #4 and #5 above. We all know that the NSF’s data management plan requirements has been rather weak up to now, but this implies that there will now be more teeth to the requirement.

I’m also quite pleased to see #6: data should be deposited in public repositories. The icing on the cake is #8: datasets need identification and attribution. Overall, my feelings about this list can be summed up by one word – hooray!

Official versions of related documents:

Tagged , , ,