Category Archives: Open Access etc.

There’s a new Dash!

Dash: an open source, community approach to data publication

We have great news! Last week we refreshed our Dash data publication service.  For those of you who don’t know, Dash is an open source, community driven project that takes a unique approach to data publication and digital preservation.

Dash focuses on search, presentation, and discovery and delegates the responsibility for the data preservation function to the underlying repository with which it is integrated. It is a project based at the University of California Curation Center (UC3), a program at California Digital Library (CDL) that aims to develop interdisciplinary research data infrastructure.

Dash employs a multi-tenancy user interface; providing partners with extensive opportunities for local branding and customization, use of existing campus login credentials, and, importantly, offering the Dash service under a tenant-specific URL, an important consideration helping to drive adoption. We welcome collaborations with other organizations wishing to provide a simple, intuitive data publication service on top of more cumbersome legacy systems.

There are currently seven live instances of Dash: – UC BerkeleyUC IrvineUC MercedUC Office of the PresidentUC RiversideUC Santa CruzUC San FranciscoONEshare (in partnership with DataONE)

Architecture and Implementation

Dash is completely open source. Our code is made publicly available on GitHub ( Dash is based on an underlying Ruby-on-Rails data publication platform called Stash. Stash encompasses three main functional components: Store, Harvest, and Share.

  • Store: The Store component is responsible for the selection of datasets; their description in terms of configurable metadata schemas, including specification of ORCID and Fundref identifiers for researcher and funder disambiguation; the assignment of DOIs for stable citation and retrieval; designation of an optional limited time embargo; and packaging and submission to the integrated repository
  • Harvest: The Harvest component is responsible for retrieval of descriptive metadata from that repository for inclusion into a Solr search index
  • Share: The Share component, based on GeoBlacklight, is responsible for the faceted search and browse interface

Dash Architecture Diagram

Individual dataset landing pages are formatted as an online version of a data paper, presenting all appropriate descriptive and administrative metadata in a form that can be downloaded as an individual PDF file, or as part of the complete dataset download package, incorporating all data files for all versions.

To facilitate flexible configuration and future enhancement, all support for the various external service providers and repository protocols are fully encapsulated into pluggable modules. Metadata modules are available for the DataCite and Dublin Core metadata schemas. Protocol modules are available for the SWORD 2.0 deposit protocol and the OAI-PMH and ResourceSync harvesting protocols. Authentication modules are available for InCommon/Shibboleth and Google/OAuth19 identity providers (IdPs). We welcome collaborations to develop additional modules for additional metadata schemas and repository protocols. Please email UC3 (uc3 at ucop dot edu) or visit GitHub ( for more information.

Features of the newly refreshed Dash service

What are the new features on our refresh of the Dash services?  Take a look.

Feature Tech-focused User-focused Description
Open Source X All components open source, MIT licensed code (
Standards compliant X Dash integrates with any SWORD/OAI-PMH-compliant repository
Pluggable Framework X Inherent extensibility for supporting additional protocols and metadata schemas
Flexible metadata schemas X Support Datacite metadata schema out-of-the-box, but can be configured to support any schema
Innovation X Our modular framework will make new feature development easier and quicker
Mobile/responsive design X X Built mobile-first, from the ground up, for better user experience
Geolocation – Metadata X X For applicable research outputs, we have an easy to use way to capture location of your datasets
Persistent Identifers – ORCID X X Dash allows researchers to attach their ORCID, allowing them to track and get credit for their work
Persistent Identifers – DOIs X X Dash issues DOIs for all datasets, allowing researchers to track and get credit for their work
Persistent Identifers – Fundref X X Dash tracks funder information using FundRef, allowing researchers and funders to track their reasearch outputs
Login – Shibboleth /OAuth2 X X We offer easy single-sign with your campus credentials or Google account
Versioning X X Datasets can change. Dash offers a quick way for you to upload new versions of your datasets and offer a simple process for tracking updates
Accessibility X X The technology, design, and user workflows have all been built with accessibility in mind
Better user experience X Self-depositing made easy. Simple workflow, drag-and-drop upload, simple navigation, clean data publication pages, user dashboards
Geolocation – Search X With GeoBlacklight, we can offer search by location
Robust Search X Search by subject, filetype, keywords, campus, location, etc.
Discoverability X Indexing by search engines for Google, Bing, etc.
Build Relationships X Many datasets are related to publications or other data. Dash offers a quick way to describe these relationships
Supports Best Practices X Data publication can be confusing. But with Dash, you can trust Dash is following best practices
Data Metrics X See the reach of your datasets through usage and download metrics
Data Citations X Quick access to a well-formed citiation reference (with DOI) to every data publication. Easy for your peers to quickly grab
Open License X Dash supports open Creative Commons licensing for all data deposits; can be configured for other licenses
Lower Barrier to Entry X For those in a hurry, Dash offers a quick interface to self-deposit. Only three steps and few required fields
Support Data Reuse X Focus researchers on describing methods and explaining ways to reuse their datasets
Satisfies Data Availability Requirements X Many publishers and funders require researchers to make their data available. Dash is an readily accepted and easy way to comply

A little Dash history

The Dash project began as DataShare, a collaboration among UC3, the University of California San Francisco Library and Center for Knowledge Management, and the UCSF Clinical and Translational Science Institute (CTSI). CTSI is part of the Clinical and Translational Science Award program funded by the National Center for Advancing Translational Sciences at the National Institutes of Health. Dash version 2 developed by UC3 and partners with funding from the Alfred P. Sloan Foundation (our funded proposal). Read more about the code, the project, and contributing to development on the Dash GitHub site.

A little Dash future

We will continue the development of the new Dash platform and will keep you posted. Next up: support for timed deposits and embargoes.  Stay tuned!

Tagged , ,

Git/GitHub: A Primer for Researchers

The Beastie Boys knew what’s up: Git it together. From

I might be what a guy named Everett Rogers would call an “early adopter“. Rogers wrote a book back in 1962 call The Diffusion of Innovation, wherein he explains how and why technology spreads through cultures. The “adoption curve” from his book has been widely used to  visualize the point at which a piece of technology or innovation reaches critical mass, and divides individuals into one of five categories depending on at what point in the curve they adopt a given piece of technology: innovators are the first, then early adopters, early majority, late majority, and finally laggards.

At the risk of vastly oversimplifying a complex topic, being an early adopter simply means that I am excited about new stuff that seems promising; in other words, I am confident that the “stuff” – GitHub, in this case –will catch on and be important in the future. Let me explain.

Let’s start with version control.

Before you can understand the power GitHub for science, you need to understand the concept of version control. From, “Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.”  We all deal with version control issues. I would guess that anyone reading this has at least one file on their computer with “v2” in the title. Collaborating on a manuscript is a special kind of version control hell, especially if those writing are in disagreement about systems to use (e.g., LaTeX versus Microsoft Word). And figuring out the differences between two versions of an Excel spreadsheet? Good luck to you. The Wikipedia entry on version control makes a statement that brings versioning into focus:

The need for a logical way to organize and control revisions has existed for almost as long as writing has existed, but revision control became much more important, and complicated, when the era of computing began.

Ah, yes. The era of collaborative research, using scripting languages, and big data does make this issue a bit more important and complicated. Enter Git. Git is a free, open-source distributed version control system, originally created for Linux kernel development in 2005. There are other version control systems– most notably, Apache Subversion (aka SVN) and Mercurial. However I posit that the existence of GitHub is what makes Git particularly interesting for researchers.

So what is GitHub?

GitHub is a web-based hosting service for projects that use the Git revision control system. It’s free (with a few conditions) and has been quite successful since its launch in 2008. Historically, version control systems were developed for and by software developers. GitHub was created primarily as a way for efficiently developing software projects, but its reach has been growing in the last few years. Here’s why.

Note: I am not going into the details of how git works, its structure, or how to incorporate git into your daily workflow. That’s a topic best left to online courses and Software Carpentry Bootcamps

What’s in it for researchers?

At this point it is good to bring up a great paper by Karthik Ram titled “Git can facilitate greater reproducibility and increased transparency in science“, which came out in 2013 in the journal Source Code for Biology and Medicine. Ram goes into much more detail about the power of Git (and GitHub by extension) for researchers. I am borrowing heavily from his section on “Use cases for Git in science” for the four benefits of Git/GitHub below.

1. Lab notebooks make a comeback. The age-old practice of maintaining a lab notebook has been challenged by the digital age. It’s difficult to keep all of the files, software, programs, and methods well-documented in the best of circumstances, never mind when collaboration enters the picture. I see researchers struggling to keep track of their various threads of thought and work, and remember going through similar struggles myself. Enter online lab notebooks. recently ran a piece about digital lab notebooks, which provides a nice overview of this topic. To really get a feel fore the power of using GitHub as a lab notebook, see GitHubber and ecologist Carl Boettiger’s site. The gist is this: GitHub can serve as a home for all of the different threads of your project, including manuscripts, notes, datasets, and methods development.

2. Collaboration is easier. You and your colleagues can work on a manuscript together, write code collaboratively, and share resources without the potential for overwriting each others’ work. No more v23.docx or appended file names with initials. Instead, a co-author can submit changes and document those with “commit messages” (read about them on GitHub here).

3. Feedback and review is easier. The GitHub issue tracker allows collaborators (potential or current), reviewers, and colleagues to ask questions, notify you of problems or errors, and suggest improvements or new ideas.

4. Increased transparency. Using a version control system means you and others are able to see decision points in your work, and understand why the project proceeded in the way that it did. For the super savvy GitHubber, you can make available your entire manuscript, from the first datapoint collected to the final submitted version, traceable on your site. This is my goal for my next manuscript.

Final thoughts

Git can be an invaluable tool for researchers. It does, however, have a bit of a high activation energy. That is, if you aren’t familiar with version control systems, are scared of the command line, or are married to GUI-heavy proprietary programs like Microsoft Word, you will be hard pressed to effectively use Git in the ways I outline above. That said, spending the time and energy to learn Git and GitHub can make your life so. much. easier. I advise graduate students to learn Git (along with other great open tools like LaTeX and Python) as early in their grad careers as possible. Although it doesn’t feel like it, grad school is the perfect time to learn these systems. Don’t be a laggard; be an early adopter.

References and other good reads

Tagged , , , , , ,

Feedback Wanted: Publishers & Data Access

This post is co-authored with Jennifer Lin, PLOS

Short Version: We need your help!

We have generated a set of recommendations for publishers to help increase access to data in partnership with libraries, funders, information technologists, and other stakeholders. Please read and comment on the report (Google Doc), and help us to identify concrete action items for each of the recommendations here (EtherPad).

Background and Impetus

The recent governmental policies addressing access to research data from publicly funded research across the US, UK, and EU reflect the growing need for us to revisit the way that research outputs are handled. These recent policies have implications for many different stakeholders (institutions, funders, researchers) who will need to consider the best mechanisms for preserving and providing access to the outputs of government-funded research.

The infrastructure for providing access to data is largely still being architected and built. In this context, PLOS and the UC Curation Center hosted a set of leaders in data stewardship issues for an evening of brainstorming to re-envision data access and academic publishing. A diverse group of individuals from institutions, repositories, and infrastructure development collectively explored the question:

What should publishers do to promote the work of libraries and IRs in advancing data access and availability?

We collected the themes and suggestions from that evening in a report: The Role of Publishers in Access to Data. The report contains a collective call to action from this group for publishers to participate as informed stakeholders in building the new data ecosystem. It also enumerates a list of high-level recommendations for how to effect social and technical change as critical actors in the research ecosystem.

We welcome the community to comment on this report. Furthermore, the high-level recommendations need concrete details for implementation. How will they be realized? What specific policies and technologies are required for this? We have created an open forum for the community to contribute their ideas. We will then incorporate the catalog of listings into a final report for publication. Please participate in this collective discussion with your thoughts and feedback by April 24, 2014.

We need suggestions! Feedback! Comments! From Flickr by Hash Milhan

We need suggestions! Feedback! Comments! From Flickr by Hash Milhan


Tagged , , , , ,

Lit Review: #PLOSFail and Data Sharing Drama

Turn and face the strange, researchers. From

Turn and face the strange, researchers. From

I know what you’re thinking– how can yet another post on the #PLOSfail hoopla say anything new? Fear not. I say nothing particularly new here, but I do offer a three-weeks-out lit review of the hoopla, in hopes of finding a pattern in the noise. For those new to the #PLOSFail drama, the short version is this: PLOS enacted a mandatory data sharing policy. Researchers flipped out. See the sources at the end of this post for more background.

 Arguments made against data sharing

1) My data is my lifeblood. I won’t just give it away.

Terry McGlynn, a biologist writing at Small Pond Science argues that “Regardless of the trajectory of open science, the fact remains that, at the moment, we are conducting research in a culture of data ownership.” Putting the ownership issue aside for now, let’s focus on the crux of this McGlynn’s argument: he contends that data sharing results in turning a private resource (data) into a community resource. This is especially burdensome for small labs (like his) since each data point takes relatively more effort to produce. If this resource is available to anyone, the benefits to the former owner are greatly reduced since they are now shared with the broader community.

Although these are valid concerns, they are not in the best interest of science. I argue that what we are really talking about here is the incentive problem (see more in the section below). That is, publications are valued in performance evaluation of academics, while data are not. Everyone can agree that data is indispensable to scientific advancement, so why hasn’t the incentive structure caught up yet? If McGlynn were able to offset the loss of benefits caused to data sharing by getting mad props for making their data available and useful, this issue would be less problematic. Jeff Leek, a biostatistician blogging at Simply Statistics, makes a great point with regard to this: to paraphrase him, the culture of credit hasn’t caught up with the culture of science. There is no appropriate form of credit for data generators – it’s either citation (seems chintzy) or authorship (not always appropriate). Solution: improve incentives for data sharing. Find a way to appropriately credit data producers.

2) My datasets are special, unique snowflakes. You can’t understand/use them.

Let’s examine what McGlynn says about this with regard to researchers re-using his data: “…anybody working on these questions wouldn’t want the raw data anyway, and there’s no way these particular data would be useful in anybody’s meta analysis. It’d be a huge waste of my time.”

Rather than try to come up with a new, witty way to answer to this argument, I’ll shamelessly quote from MacManes Lab blog post, Corner cases and the PLOS data policy:

 There are other objections – one type is the ‘my raw data are so damn special that nobody can over make sense of them’, while another is ‘I use special software and stuff, so they are probably not useful to anybody else’. I call BS on both of these arguments. Maybe you have the world’s most complicated data, but why not release them and not worry about whether or not people find them useful – that is not your concern (though it should be).

I couldn’t have said it better. The snowflake refrain from researchers is not new. I’ve heard it time and again when talking to them about data archiving. There is certainly truth to this argument: most (all?) datasets are unique. Why else would we be collecting data? This doesn’t make them useless to others, especially if we are sharing data to promote reproducibility of reported results.

DrugMonkey, an anonymous blogger and biomedical researcher, took this “my data are unique” argument to paranoia level. In their post, PLoS is letting the inmates run the asylum and it will kill them, s/he contends that researchers will somehow be forced to use all the same methods to facilitate data reuse. “…diversity in data handling results, inevitably, in attempts for data orthodoxy. So we burn a lot of time and effort fighting over that. So we’ll have PLoS [sic] inserting itself in the role of how experiments are to be conducted and interpreted!”

I imagine DrugMonkey pictures future scientists in grey overalls, trudging to a factory to do “science”. This is just ridiculous. The idiosyncrasies of how individual researchers handle their data will always be part of the challenge of reproducibility and data curation. But I have never (ever) heard of anyone suggesting that all researchers in a given field should be doing science in the exact same way. There are certainly best practices for handling datasets. If everyone followed these to the best of their ability, we would have an easier time reusing data. But no one is punching a time card at the factory.

 3) Data sharing is hard | time-consuming | new-fangled.

This should probably be #1 in the list of arguments from researchers. Even those that cite other reasons for not sharing their data, this is probably at the root of the hoarding. Full disclosure – only a small portion of the datasets I have generated as a researcher are available to the public. The only explanation is it’s time-consuming and I have other things on my plate. So I hear you, researchers. That said, the time has come to start sharing.

DrugMonkey says that the PLOS data policy requires much additional data curation which will take time. “The first problem with this new policy is that it suggests that everyone should radically change the way they do science, at great cost of personnel time…” McGlynn states this point succinctly: “Why am I sour on required data archiving? Well, for starters, it is more work for me… To get these numbers into a downloadable and understandable condition would be, frankly, an annoying pain in the ass.”

Fair enough. But I argue here (along with others others) that making data available is not an optional side note of research: it is research. In the comments of David Crotty’s post at The Scholarly Kitchen, “PLOS’ bold data policy“, there was a comment that I loved. The commenter, Mike Taylor, said this:

 …data curation is research. I’d argue that a researcher who doesn’t make available the data necessary to reproduce his conclusions isn’t getting his job done. Complaining about having to spend time on preparing the data for others to use is like complaining about having to spend time writing the paper, or indeed running experiments.

When I read that comment, I might have fist pumped a little. Of course, we still have that pesky incentive issue to work out… As Crotty puts it, “Perhaps the biggest practical problem with [data sharing] is that it puts an additional time and effort burden on already time-short, over-burdened researchers. Researchers will almost always follow the path of least resistance, and not do anything that takes them away from their research if it can be avoided.” Sigh.

What about that “new-fangled” bit? Well, researchers often complain that data management and curation requires skills that are not taught. I 100% agree with this statement – see my paper on the lack of data management education for even undergrads. But as my ex-cop dad likes to say, “ignorance of the law is not a defense”. In continuation of my shameless quoting from others, here’s what Ted Hart (Staff Scientist at NEON) has to say in his post, “Just Get Over Yourself and Share Your Data“:

Sharing is hard. but not an intractable problem… Is the alternative is that everyone just does everything in secret with myriad idiosyncrasies ferociously milking least publishable units from a data set? That just seems like a recipe for science moving slowly and in the dark. …I think we just need to own up to the fact being a scientist these days requires new skills, and it always have. You didn’t have to know how to do PCR prior to 1983, but now you do. In the 21st century to do science better, we need more than spreadsheets with a few rows, we need to implement best practices for data management.

More fist pumping! No, things won’t change overnight. Leek at Simply Statistics rightly stated that the transition to open data will be rough for two reasons: (1) there is no education on data handling, and (2) the is a disconnect between the incentives for individual researchers and the actions that will benefit science as a whole. Sigh. Back to that incentive issue again.

Highlights & Takeaways

At risk of making this blog post way too long, I want to showcase a few highlights and takeaways from my deep dive into the #PLOSfail blogging world.

1) The Incentives Problem

We have a big incentives problem, which was probably obvious from my repeated mentions of it above. What’s good for researchers’ careers is not conducive to data sharing. If we expect behavior to change, we need to work on giving appropriate credit where it’s due.

Biologist Björn Brembs puts it well in his post, What is the Difference Between Text, Data, and Code?“…it is unrealistic to expect tenure committees and grant evaluators to assess software and data contributions before anybody even is contributing and sharing data or code.” Yes, there is a bit of a chicken-and-egg situation. We need movement on both sides to get somewhere. Share the data, and they will start to recognize it.

2) Empiricism Versus Theory

There is a second plot line to the data sharing rants: empiricists versus theoreticians. See ecologist Timothée Poisot‘s blog, “Of the value of datasets and methods in open science” for a more extensive review of this issue as it relates to data sharing. Of course, this tension is not a new debate in science. But terms like “data vultures” get thrown about, and feelings get hurt. Due to the nature of their work, most theoreticians’ “data” is equations, methods, and code that are shared via publication. Meanwhile, empiricists generate data and can hoard it until they see fit to share it, only offering a glimpse of the entire suite of their research outputs. To paraphrase Hart again: science is equal parts data and analysis/methods. We need both, so let’s stop fighting and encourage open science all around.

3) Data Ownership Issues

There are lots of potential data owners: the funders who paid for the work, the institution where the research was performed, the researcher who collected the data, the principle investigator of the lab where the researcher works, etc. etc. The complications around data ownership make this a tricky subject to work out. Zen Faulkes, a neurobiologist at University of Texas, blogged about who owns data, in particular, his data. He did a little research and found what many (most?) researchers at universities might find: “I do not own research data I generate. Neither do the funding agencies. The University of Texas system Board of Regents own research data I generate.” Faulkes goes on to state that the regents probably don’t care what he does with his data unless/until they can make money off of it… very true. To make things more complicated, Crotty over at Scholarly Kitchen reminded us that “under US law (the Bayh-Dole Act), the intellectual property (IP) generated as the result of federal research funds belongs to the researcher and their institution.” What does that even mean?!

To me, the issue is not about who owns the data outright. Instead, it’s about my role as an open science “waccaloon” who is interested in what’s best for the scientific process. To that extent, I am going to borrow from Hart again. Hart makes a comparison between having data and having a pet: in Boulder CO, there are no pet “owners” – only pet “guardians”. We can think of our data in this same way: we don’t own it; we simply care for it, love it, and are intellectually (and sometimes emotionally!) invested in it.

4) PLOS is Part of a Much Bigger Movement

Open science mandates are already here. The OSTP memo released last year is a huge leap forward in this direction – it requires that federally funded research outputs (including data) be made available to the public. Crotty draws a link between OSTP and PLOS policies in his blog: “Once this policy goes into effect, PLOS’ requirements would seem to be an afterthought for authors funded in this manner. The problem is that the OSTP policy seems nowhere near being implemented.”

That last part is most definitely true. One way to work on implementing this policy? Get the journals involved. The current incentive structure is not well-suited for ensuring compliance with OSTP, but journals have a role as gatekeepers to the traditional incentives. Crotty states it this way:

PLOS has never been a risk averse organization, and this policy would seem to fit well with their ethos of championing access and openness as keys to scientific progress. Even if one suspects this policy is premature and too blunt an instrument, one still has to respect PLOS for remaining true to their stated goals.

So I say kudos to PLOS!

In Conclusion…

I’ll end with a quote from MacManes Lab blog post:

How about this, make an honest effort to make the data accessible and useful to others, and chances are you’re probably good to go.

Final fist pump.


  1. Timothée Poisot, Ecologist. Of the value of datasets and methods in open science.
  2. Terry McGlynn, Biologist. I own my data until I don’t. Blog at Small Pond Science @hormiga
  3. David Crotty, publisher & former researcher. PLOS’ bold data policy Blog at The Scholarly Kitchen @scholarlykitchn
  4. Edmund Hart, Staff Scientist at NEONJust Get Over Yourself and Share Your Data. @DistribEcology
  5. MacManes Lab, genomics. Corner cases and the PLOS data policy.
  6. DrugMonkey, biomedical research. PLoS is letting the inmates run the asylum and it will kill them. @DrugMonkey
  7. Zen Faulkes, Neurobiologist. Who owns data. Blog at NeuroDojo @DoctorZen
  8. Björn Brembs, biologist. What is the Difference Between Text, Data, and Code? @brembs
  9. Jeff Leek, biostatistician. PLoS One, I have an idea for what to do with all your profits: buy hard drives Blog at Simply Statistics. @leekgroup

Twitter feed for #PLOSfail


UC Open Access: How to Comply

Free access to UC research is almost as good as free hugs! From Flickr by mhauri

Free access to UC research is almost as good as free hugs! From Flickr by mhauri

My last two blog posts have been about the new open access policy that applies to the entire University of California system. For big open science nerds like myself, this is exciting progress and deserves much ado. For the on-the-ground researcher at a UC, knee-deep in grants and lecture preparation, the ado could probably be skipped in lieu of a straightforward explanation of how to comply with the procedure. So here goes.

Who & When:

  • 1 November 2013: Faculty at UC Irvine, UCLA, and UCSF
  • 1 November 2014: Faculty at UC Berkeley, UC Merced, UC Santa Cruz, UC Santa Barbara, UC Davis, UC San Diego, UC Riverside

Note: The policy applies only to ladder-rank faculty members. Of course, graduate students and postdocs should strongly consider participating as well.

To comply, faculty members have two options:

Option 1: Out-of-the-box open access

. There are two ways to do this:

  1. Publishing in an open access-only journal (see examples here). Some have fees and others do not.
  2. Publishing with a more traditional publisher, but paying a fee to ensure the manuscript is publicly available. These are article-processing charges (APCs) and vary widely depending on the journal. For example, Elsevier’s Ecological Informatics charges $2,500, while Nature charges $5,200.

Learn more about different journals’ fees and policies: Directory of Open Access Journals:

Option 2: Deposit your final manuscript in an open access repository.

In this scenario, you can publish in whatever journal you prefer – regardless of its openness. Once the manuscript is published, you take action to make a version of the article freely and openly available.

As UC faculty (or any UC researcher, including grad students and postdocs), you can comply via Option 2 above by depositing your publications in UC’s eScholarship open access repository. The CDL Access & Publishing Group is currently perfecting a user-friendly, efficient workflow for managing article deposits into eScholarship. The new workflow will be available as of November 1stLearn more.

Does this still sound like too much work? Good news! The Publishing Group is also working on a harvesting tool that will automate deposit into eScholarship. Stay tuned – the estimated release of this tool is June 2014.

An Addendum: Are you not a UC affiliate? Don’t fret! You can find your own version of eScholarship (i.e., an open access repository) by going to OpenDOAR. Also see my full blog post about making your publications open access.


Academic libraries must pay exorbitant fees to provide their patrons (researchers) with access to scholarly publications.  The very patrons who need these publications are the ones who provide the content in the form of research articles.  Essentially, the researchers are paying for their own work, by proxy via their institution’s library.

What if you don’t have access? Individuals without institutional affiliations (e.g., between jobs), or who are affiliated with institutions that have no/a poorly funded library (e.g., in 2nd or 3rd world countries), depend on open access articles for keeping up with the scholarly literature. The need for OA isn’t limited to jobless or international folks, though. For proof, one only has to notice that the Twitter community has developed a hash tag around this, #Icanhazpdf (Hat tip to the Lolcats phenomenon). Basically, you tweet the name of the article you can’t access and add the hashtag in hopes that someone out in the Twittersphere can help you out and send it to you.

Special thanks to Catherine Mitchell from the CDL Publishing & Access Group for help on this post.

Tagged , , , , ,

A Closer Look at the New UC Open Access Policy

The UC is opening up their research locker.  From Flickr by sam.d

The UC is opening up their research locker. From Flickr by sam.d

Last week, the University of California announced a new Open Access Policy. Here I will explore the policy in a bit more detail.  The gist of the policy is this: research articles authored by UC faculty will be made available to the public at no charge.

I’m sure most of this blog’s readers are familiar with paywalls and the nuances of scholarly publishing, but for those that aren’t – if you don’t have a license to get content from particular journals (via your institution’s library, for example) then you may pay upwards of $100 per article. For example, if I publish an amazing article in Nature (and don’t pay the $5,200 fee to make my article open access), my mom can’t get a copy of the article to hang on her fridge without either (1) getting a copy from someone with access, or (2) paying a big fee. Considering that my mom pays taxes that fund the NSF which funded my work, this is rather strange.

The UC policy is trying to change that. The idea is that faculty at the UC will grant a license to the UC prior to any contractual arrangement with publishers. The faculty member then has the right to make their research will be widely and publicly available, re-use it for various purposes, or modify it for future research publications – regardless of the publisher’s wishes for locking down the work.

Faculty will continue to publish their work in the most appropriate journal (open access or not). The big change is that now they can also place a copy of the publication in UC’s open access repository, eScholarship, which is freely accessible to anyone. To re-emphasize: This policy does NOT require that faculty publish in particular journals or pay “Article Processing Charges” to ensure their article is open access.

From the policy’s FAQ  page:

Faculty are strongly encouraged to continue to publish as normal, in the most appropriate and prestigious journals. Faculty are not required to pay to publish articles or pay to deposit them in an open-access repository under this policy, unless they choose to do so.

How faculty can comply (from the FAQ page):

By passing the policy on July 24, 2013, UC faculty members have committed themselves to making their scholarly articles available to the public by granting a license to UC and depositing a copy of their publications in eScholarship, UC’s open access repository. The policy automatically grants UC a license to make any scholarly articles available in an open access repository. UC will not do so, however, until an author takes the action of depositing an article in UC’s eScholarship repository or confirms the availability of the article in another open access venue – i.e., a repository (such as PubMed Central, ArXiv or SSRN) or an open access journal.

The California Digital Library and the campus libraries will assist faculty by providing a streamlined deposit system into eScholarship and an automated ‘harvesting’ tool in order to ease the process of depositing articles, is expected to be in place by June 2014.

And now, the downside. Michael Eisen, co-founder of the open access journal PLOS, points out the potential downside of the new policy in his blog post:

This policy has a major, major hole – an optional faculty opt-out. This is there because enough faculty wanted the right to publish their works in ways that were incompatible with the policy that the policy would not have passed without the provision.  Unfortunately, this means that the policy is completely toothless.

Eisen goes on to say

…because of the opt out, this is a largely symbolic gesture – a minor event in the history of open access, not the watershed event that some people are making it out to be.

Although I agree with Eisen that the opt-out clause significantly weakens the strength of this policy, I still believe this move on the UC’s part represents a major step forward in the battle to reclaim our scholarly work from some publishers. Perhaps it isn’t “watershed” but it’s certainly exciting, and it’s stimulating conversations about open science and accessibility to research.

Read more on the new policy and related topics:

Tagged , , ,

UC Faculty Senate Passes #OA Policy

Big news! I just got this email regarding the new Open Access Policy for the University of California System. I’ll write a full blog post next week but wanted to share this as soon as possible. (emphasis is mine)

The Academic Senate of the University of California has passed an Open Access Policy, ensuring that future research articles authored by faculty at all 10 campuses of UC will be made available to the public at no charge. “The Academic Council’s adoption of this policy on July 24, 2013, came after a six-year process culminating in two years of formal review and revision,” said Robert Powell, chair of the Academic Council. “Council’s intent is to make these articles widely—and freely— available in order to advance research everywhere.”  Articles will be available to the public without charge via eScholarship (UC’s open access repository) in tandem with their publication in scholarly journals.  Open access benefits researchers, educational institutions, businesses, research funders and the public by accelerating the pace of research, discovery and innovation and contributing to the mission of advancing knowledge and encouraging new ideas and services.

Chris Kelty, Associate Professor of Information Studies, UCLA, and chair of the UC University Committee on Library and Scholarly Communication (UCOLASC), explains, “This policy will cover more faculty and more research than ever before, and it sends a powerful message that faculty want open access and they want it on terms that benefit the public and the future of research.”

The policy covers more than 8,000 UC faculty at all 10 campuses of the University of California, and as many as 40,000 publications a year. 

It follows more than 175 other universities who have adopted similar so-called “green” open access policies.  By granting a license to the University of California prior to any contractual arrangement with publishers, faculty members can now make their research widely and publicly available, re-use it for various purposes, or modify it for future research publications.  Previously, publishers had sole control of the distribution of these articles.  All research publications covered by the policy will continue to be subjected to rigorous peer review; they will still appear in the most prestigious journals across all fields; and they will continue to meet UC’s standards of high quality.  Learn more about the policy and its implementation here:

UC is the largest public research university in the world and its faculty members receive roughly 8% of all research funding in the U.S.

With this policy UC Faculty make a commitment to the public accessibility of research, especially, but not only, research paid for with public funding by the people of California and the United States.  This initiative is in line with the recently announced White House Office of Science and Technology Policy (OSTP) directive requiring “each Federal Agency with over $100 million in annual conduct of research and development expenditures to develop a plan to support increased public access to results of the research funded by the Federal Government.” The new UC Policy also follows a similar policy passed in 2012 by the Academic Senate at the University of California, San Francisco, which is a health sciences campus.

“The UC Systemwide adoption of an Open Access (OA) Policy represents a major leap forward for the global OA movement and a well-deserved return to taxpayers who will now finally be able to see first-hand the published byproducts of their deeply appreciated investments in research” said Richard A. Schneider, Professor, Department of Orthopaedic Surgery and chair of the Committee on Library and Scholarly Communication at UCSF.   “The ten UC campuses generate around 2-3% of all the peer-reviewed articles published in the world every year, and this policy will make many of those articles freely  available to anyone who is interested anywhere, whether they are colleagues, students, or members of the general public”

The adoption of this policy across the UC system also signals to scholarly publishers that open access, in terms defined by faculty and not by publishers, must be part of any future scholarly publishing system.  The faculty remains committed to working with publishers to transform the publishing landscape in ways that are sustainable and beneficial to both the University and the public.

More information:


University of California, Berkeley campus, 1901. Contributed to Calisphere by the Berkeley Public Library.

University of California, Berkeley campus, 1901. Contributed to Calisphere by the Berkeley Public Library.

Tagged , ,

The Data Lineup for #ESA2013

Why am I excited about Minneapolis? Potential Prince sightings, of course!

Why am I excited about Minneapolis? Potential Prince sightings, of course! From

In less than  week, the Ecological Society of America’s 2013 Meeting will commence in Minneapolis, MN. There will be zillions of talks and posters on topics ranging from microbes to biomes, along with special sessions on education, outreach, and citizen science. So why am I going?

For starters, I’m a marine ecologist by training, and this is an excuse to meet up with old friends. But of course the bigger draw is to educate my ecological colleagues about all things data: data management planning, open data, data stewardship, archiving and sharing data, et cetera et cetera. Here I provide a rundown of must-see talks, sessions, and workshops related to data. Many of these are tied to the DataONE group and the rOpenSci folks; see DataONE’s activities and rOpenSci’s activities. Follow the full ESA meeting on Twitter at #ESA2013. See you in Minneapolis!

Sunday August 4th

0800-1130 / WK8: Managing Ecological Data for Effective Use and Re-use: A Workshop for Early Career Scientists

For this 3.5 hour workshop, I’ll be part of a DataONE team that includes Amber Budden (DataONE Community Engagement Director), Bill Michener (DataONE PI), Viv Hutchison (USGS), and Tammy Beaty (ORNL). This will be a hands-on workshop for researchers interested in learning about how to better plan for, collect, describe, and preserve their datasets.

1200-1700 / WK15: Conducting Open Science Using R and DataONE: A Hands-on Primer (Open Format)

Matt Jones from NCEAS/DataONE will be assisted by Karthik Ram (UC Berkeley & rOpenSci), Carl Boettiger (UC Davis & rOpenSci), and Mark Schildhauer (NCEAS) to highlight the use of open software tools for conducting open science in ecology, focusing on the interplay between R and DataONE.

Monday August 5th

1015-1130 / SS2: Creating Effective Data Management Plans for Ecological Research

Amber, Bill and I join forces again to talk about how to create data management plans (like those now required by the NSF) using the free online DMPTool. This session is only 1.25 hours long, but we will allow ample time for questions and testing out the tool.

1130-1315 / WK27: Tools for Creating Ecological Metadata: Introduction to Morpho and DataUp

Matt Jones and I will be introducing two free, open-source software tools that can help ecologists describe their datasets with standard metadata. The Morpho tool can be used to locally manage data and upload it to data repositories. The DataUp tool helps researchers not only create metadata, but check for potential problems in their dataset that might inhibit reuse, and upload data to the ONEShare repository.

Tuesday August 6th

0800-1000 / IGN2: Sharing Makes Science Better

This two-hour session organized by Sandra Chung of NEON is composed of 5-minute long “ignite” talks, which guarantees you won’t nod off. The topics look pretty great, and the crackerjack list of presenters includes Ethan White, Ben Morris, Amber Budden, Matt Jones,  Ed Hart, Scott Chamberlain, and Chris Lortie.

1330-1700 / COS41: Education: Research And Assessment

In my presentation at 1410, “The fractured lab notebook: Undergraduates are not learning ecological data management at top US institutions”, I’ll give a brief talk on results from my recent open-access publication with Stephanie Hampton on data management education.

2000-2200 / SS19: Open Science and Ecology

Karthik Ram and I are getting together with Scott Chamberlain (Simon Fraser University & rOpenSci), Carl Boettiger, and Russell Neches (UC Davis) to lead a discussion about open science. Topics will include open data, open workflows and notebooks, open source software, and open hardware.

2000-2200 / SS15: DataNet: Demonstrations of Data Discovery, Access, and Sharing Tools

Amber Budden will demo and discuss DataONE alongside folks from other DataNet projects like the Data Conservancy, SEAD, and Terra Populus.

Tagged , , , , , ,

Large Facilities & the Data they Produce

Last week I spent three days in the desert, south of Albuquerque, at the NSF Large Facilities Workshop. What are these “large facilities”, you ask? I did too… this was a new world for me, but the workshop ended up being a great learning experience.

The NSF has a Large Facilities Office within the Office of Budget, Finance and Award Management, which supports “Major Research Equipment and Facilities Construction” (MREFC for short). Examples of these Large Facilities include NEON (National Ecological Observatory Network), IRIS PASSCAL Instrument Center (Incorporated Research Institutions for Seismology Program for Array Seismic Studies of the Continental Lithosphere), and the NRAO (National Radio Astronomy Observatory). Needless to say, I spent half of the workshop googling acronyms.

I was there to talk about data management, which made me a bit of an anomaly. Other attendees administered  managed, and worked at large facilities. In the course of my conversations with attendees, I was surprised to learn that these facilities aren’t too concerned with data sharing, and most of these administrator types implied that the data were owned by the researcher; it was therefore the researcher’s prerogative to share or not to share. From what I understand, the scenario is this: the NSF page huge piles of money to get these facilities up and running, with hardware, software, technicians, managers, and on and on. The researchers then write a grant to the NSF or the facilities themsleves to do work using these facilities. The researcher is then under no obligation to share the data with their colleagues. Does this seem fishy to anyone else?

I understand the point of view of the administrators that attended this conference: they have enough on their plate to worry about, without dealing with the miriad problems that accompany data management, archiving, sharing et cetera. These problems are only compounded by researchers’ general resistance to share. For example, an administrator told me that, upon completion of their study, one researcher had gone into their system and deleted all of the data related to their project to make sure no one else could get it. I nearly fell over from shock.

Whatever cultural hangups the researchers have, aren’t these big datasets, being collected by expensive equipment, among the most important to be shared? Observations of the sky at a single point and time are not reproducible. You only get one shot at collecting data on an earthquake or the current spread rate for a rift zone. Not sharing these datasets is tantamount to scientific malpractice.

The Very Large Array, near Soccoro NM. This was the best workshop field trip EVER. CC-BY, Carly Strasser

The Very Large Array, near Soccoro NM. This was the best workshop field trip EVER. CC-BY, Carly Strasser

One administrator respectfully disagreed with my charge that they should be doing more to promote data sharing. He said that their workflow for data processing was so complex and nuanced that no one could ever reproduce the dataset, and certainly no one could ever understand what exactly was done to obtain results. This marks the second time I nearly fell over during a conversation. If science isn’t reproducible because it’s too complex, you aren’t doing it right. Yes, I realize that exactly reproducing results is nearly impossible under the best of circumstances. But to not even try? With datasets this important? When all analyses are done via computers? It seems ludicrous.

So, after three days of dry skin and mexican food, my takeaway from the workshop was this: All large facilities sponsored by NSF need to have thorough, clear policies about data produced using their equipment. These policies should include provisions for sharing, access, use, and archiving. They will most certainly be met with skepticism and resistance, but in these tight fiscal times, data sharing is of utmost importance when equipment this expensive is being used.

Tagged , , , , ,

Closed Data… Excuses, Excuses

If you are a fan of data sharing, open data, open science, and generally openness in research, you’ve heard them all: excuses for keeping data out of the public domain. If you are NOT a fan of openness, you should be. For both groups (the fans and the haters), I’ve decided to construct a “Frankenstein monster” blog post composed of other peoples’ suggestions for how to deal with the excuses.

Yes, I know. Frankenstein was the doctor, not the monster. From Flickr by Chop Shop Garage.

Yes, I know. Frankenstein was the doctor, not the monster. From Flickr by Chop Shop Garage.

I have drawn some comebacks from Christopher Gutteridge, University of Southampton, and Alexander Dutton, University of Oxford. They created an open google doc of excuses for closing off data and appropriate responses, and generously provided access to the document under a CC-BY license. I also reference the UK Data Archive‘s list of barriers and solutions to data sharing, available via the Digital Curation Centre‘s PDF, “Research Data Management for Librarians” (pages 14-15).

People will contact me to ask about stuff

Christopher and Alex (C&A) say: “This is usually an objection of people who feel overworked and that [data sharing] isn’t part of their job…” I would add to this that science is all about learning from each other – if a researcher is opposed to the idea of discussing their datasets, collaborating with others, and generally being a good science citizen, then they should be outed by their community as a poor participant.

People will misinterpret the data

C&A suggest this: “Document how it should be interpreted. Be prepared to help and correct such people; those that misinterpret it by accident will be grateful for the help.” From the UK Data Archive: “Producing good documentation and providing contextual information for your research project should enable other researchers to correctly use and understand your data.”

It’s worth mentioning, however, a second point C&A make: “Publishing may actually be useful to counter willful misrepresentation (e.g. of data acquired through Freedom of Information legislation), as one can quickly point to the real data on the web to refute the wrong interpretation.”

My data is not very interesting

C&A: “Let others judge how interesting or useful it is — even niche datasets have people that care about them.” I’d also add that it’s impossible to decide whether your dataset has value to future research. Consider the many datasets collected before “climate change” was a research topic which have now become invaluable to documenting and understanding the phenomenon. From the UK Data Archive: “Who would have thought that amateur gardener’s diaries would one day provide essential data for climate change research?”

I might want to use it in a research paper

Anyone who’s discussed data sharing with a researcher is familiar with this excuse. The operative word here is might. How many papers have we all considered writing, only to have them shift to the back burner due to other obligations? That said, this is a real concern.

C&A suggest the embargo route: “One option is to have an automatic or optional embargo; require people to archive their data at the time of creation but it becomes public after X months. You could even give the option to renew the embargo so only things that are no longer cared about become published, but nothing is lost and eventually everything can become open.” Researchers like to have a say in the use of their datasets, but I would caution to have any restrictions default to sharing. That is, after X months the data are automatically made open by the repository.

I would also add that, as the original collector of the data, you are at a huge advantage compared to others that might want to use your dataset. You have knowledge about your system, the conditions during collection, the nuances of your methods, et cetera that could never be fully described in the best metadata.

I’m not sure I own the data

No doubt, there are a lot of stakeholders involved in data collection: the collector, the PI (if different), the funder, the institution, the publisher, … C&A have the following suggestions:

  • Sometimes as it’s as easy as just finding out who does own the data
  • Sometimes nobody knows who owns the data. This often seems to occur when someone has moved into a post and isn’t aware that they are now the data owner.
  • Going up the management chain can help. If you can find someone who clearly has management over the area the dataset belongs to they can either assign an owner or give permission.
  • Get someone very senior to appoint someone who can make decisions about apparently “orphaned” data.

My data is too complicated.

C&A: “Don’t be too smug. If it turns out it’s not that complicated, it could harm your professional [standing].” I would add that if it’s too complicated to share, then it’s too complicated to reproduce, which means it’s arguably not real scientific progress. This can be solved by more documentation.

My data is embarrassingly bad

C&A: “Many eyes will help you improve your data (e.g. spot inaccuracies)… people will accept your data for what it is.” I agree. All researchers have been on the back end of making the sausage. We know it’s not pretty most of the time, and we can accept that. Plus it helps you strive will be at managing and organizing data during your next collection phase.

It’s not a priority and I’m busy

Good news! Funders are making it your priority! New sharing mandates in the OSTP memorandum state that any research conducted with federal funds must be accessible. You can expect these sharing mandates to drift down to you, the researcher, in the very near future (6-12 months).

Tagged , , , , , ,