Lit Review: #PLOSFail and Data Sharing Drama

Turn and face the strange, researchers. From pipedreamsfromtheshire.wordpress.com

Turn and face the strange, researchers. From pipedreamsfromtheshire.wordpress.com

I know what you’re thinking– how can yet another post on the #PLOSfail hoopla say anything new? Fear not. I say nothing particularly new here, but I do offer a three-weeks-out lit review of the hoopla, in hopes of finding a pattern in the noise. For those new to the #PLOSFail drama, the short version is this: PLOS enacted a mandatory data sharing policy. Researchers flipped out. See the sources at the end of this post for more background.

 Arguments made against data sharing

1) My data is my lifeblood. I won’t just give it away.

Terry McGlynn, a biologist writing at Small Pond Science argues that “Regardless of the trajectory of open science, the fact remains that, at the moment, we are conducting research in a culture of data ownership.” Putting the ownership issue aside for now, let’s focus on the crux of this McGlynn’s argument: he contends that data sharing results in turning a private resource (data) into a community resource. This is especially burdensome for small labs (like his) since each data point takes relatively more effort to produce. If this resource is available to anyone, the benefits to the former owner are greatly reduced since they are now shared with the broader community.

Although these are valid concerns, they are not in the best interest of science. I argue that what we are really talking about here is the incentive problem (see more in the section below). That is, publications are valued in performance evaluation of academics, while data are not. Everyone can agree that data is indispensable to scientific advancement, so why hasn’t the incentive structure caught up yet? If McGlynn were able to offset the loss of benefits caused to data sharing by getting mad props for making their data available and useful, this issue would be less problematic. Jeff Leek, a biostatistician blogging at Simply Statistics, makes a great point with regard to this: to paraphrase him, the culture of credit hasn’t caught up with the culture of science. There is no appropriate form of credit for data generators – it’s either citation (seems chintzy) or authorship (not always appropriate). Solution: improve incentives for data sharing. Find a way to appropriately credit data producers.

2) My datasets are special, unique snowflakes. You can’t understand/use them.

Let’s examine what McGlynn says about this with regard to researchers re-using his data: “…anybody working on these questions wouldn’t want the raw data anyway, and there’s no way these particular data would be useful in anybody’s meta analysis. It’d be a huge waste of my time.”

Rather than try to come up with a new, witty way to answer to this argument, I’ll shamelessly quote from MacManes Lab blog post, Corner cases and the PLOS data policy:

 There are other objections – one type is the ‘my raw data are so damn special that nobody can over make sense of them’, while another is ‘I use special software and stuff, so they are probably not useful to anybody else’. I call BS on both of these arguments. Maybe you have the world’s most complicated data, but why not release them and not worry about whether or not people find them useful – that is not your concern (though it should be).

I couldn’t have said it better. The snowflake refrain from researchers is not new. I’ve heard it time and again when talking to them about data archiving. There is certainly truth to this argument: most (all?) datasets are unique. Why else would we be collecting data? This doesn’t make them useless to others, especially if we are sharing data to promote reproducibility of reported results.

DrugMonkey, an anonymous blogger and biomedical researcher, took this “my data are unique” argument to paranoia level. In their post, PLoS is letting the inmates run the asylum and it will kill them, s/he contends that researchers will somehow be forced to use all the same methods to facilitate data reuse. “…diversity in data handling results, inevitably, in attempts for data orthodoxy. So we burn a lot of time and effort fighting over that. So we’ll have PLoS [sic] inserting itself in the role of how experiments are to be conducted and interpreted!”

I imagine DrugMonkey pictures future scientists in grey overalls, trudging to a factory to do “science”. This is just ridiculous. The idiosyncrasies of how individual researchers handle their data will always be part of the challenge of reproducibility and data curation. But I have never (ever) heard of anyone suggesting that all researchers in a given field should be doing science in the exact same way. There are certainly best practices for handling datasets. If everyone followed these to the best of their ability, we would have an easier time reusing data. But no one is punching a time card at the factory.

 3) Data sharing is hard | time-consuming | new-fangled.

This should probably be #1 in the list of arguments from researchers. Even those that cite other reasons for not sharing their data, this is probably at the root of the hoarding. Full disclosure – only a small portion of the datasets I have generated as a researcher are available to the public. The only explanation is it’s time-consuming and I have other things on my plate. So I hear you, researchers. That said, the time has come to start sharing.

DrugMonkey says that the PLOS data policy requires much additional data curation which will take time. “The first problem with this new policy is that it suggests that everyone should radically change the way they do science, at great cost of personnel time…” McGlynn states this point succinctly: “Why am I sour on required data archiving? Well, for starters, it is more work for me… To get these numbers into a downloadable and understandable condition would be, frankly, an annoying pain in the ass.”

Fair enough. But I argue here (along with others others) that making data available is not an optional side note of research: it is research. In the comments of David Crotty’s post at The Scholarly Kitchen, “PLOS’ bold data policy“, there was a comment that I loved. The commenter, Mike Taylor, said this:

 …data curation is research. I’d argue that a researcher who doesn’t make available the data necessary to reproduce his conclusions isn’t getting his job done. Complaining about having to spend time on preparing the data for others to use is like complaining about having to spend time writing the paper, or indeed running experiments.

When I read that comment, I might have fist pumped a little. Of course, we still have that pesky incentive issue to work out… As Crotty puts it, “Perhaps the biggest practical problem with [data sharing] is that it puts an additional time and effort burden on already time-short, over-burdened researchers. Researchers will almost always follow the path of least resistance, and not do anything that takes them away from their research if it can be avoided.” Sigh.

What about that “new-fangled” bit? Well, researchers often complain that data management and curation requires skills that are not taught. I 100% agree with this statement – see my paper on the lack of data management education for even undergrads. But as my ex-cop dad likes to say, “ignorance of the law is not a defense”. In continuation of my shameless quoting from others, here’s what Ted Hart (Staff Scientist at NEON) has to say in his post, “Just Get Over Yourself and Share Your Data“:

Sharing is hard. but not an intractable problem… Is the alternative is that everyone just does everything in secret with myriad idiosyncrasies ferociously milking least publishable units from a data set? That just seems like a recipe for science moving slowly and in the dark. …I think we just need to own up to the fact being a scientist these days requires new skills, and it always have. You didn’t have to know how to do PCR prior to 1983, but now you do. In the 21st century to do science better, we need more than spreadsheets with a few rows, we need to implement best practices for data management.

More fist pumping! No, things won’t change overnight. Leek at Simply Statistics rightly stated that the transition to open data will be rough for two reasons: (1) there is no education on data handling, and (2) the is a disconnect between the incentives for individual researchers and the actions that will benefit science as a whole. Sigh. Back to that incentive issue again.

Highlights & Takeaways

At risk of making this blog post way too long, I want to showcase a few highlights and takeaways from my deep dive into the #PLOSfail blogging world.

1) The Incentives Problem

We have a big incentives problem, which was probably obvious from my repeated mentions of it above. What’s good for researchers’ careers is not conducive to data sharing. If we expect behavior to change, we need to work on giving appropriate credit where it’s due.

Biologist Björn Brembs puts it well in his post, What is the Difference Between Text, Data, and Code?“…it is unrealistic to expect tenure committees and grant evaluators to assess software and data contributions before anybody even is contributing and sharing data or code.” Yes, there is a bit of a chicken-and-egg situation. We need movement on both sides to get somewhere. Share the data, and they will start to recognize it.

2) Empiricism Versus Theory

There is a second plot line to the data sharing rants: empiricists versus theoreticians. See ecologist Timothée Poisot‘s blog, “Of the value of datasets and methods in open science” for a more extensive review of this issue as it relates to data sharing. Of course, this tension is not a new debate in science. But terms like “data vultures” get thrown about, and feelings get hurt. Due to the nature of their work, most theoreticians’ “data” is equations, methods, and code that are shared via publication. Meanwhile, empiricists generate data and can hoard it until they see fit to share it, only offering a glimpse of the entire suite of their research outputs. To paraphrase Hart again: science is equal parts data and analysis/methods. We need both, so let’s stop fighting and encourage open science all around.

3) Data Ownership Issues

There are lots of potential data owners: the funders who paid for the work, the institution where the research was performed, the researcher who collected the data, the principle investigator of the lab where the researcher works, etc. etc. The complications around data ownership make this a tricky subject to work out. Zen Faulkes, a neurobiologist at University of Texas, blogged about who owns data, in particular, his data. He did a little research and found what many (most?) researchers at universities might find: “I do not own research data I generate. Neither do the funding agencies. The University of Texas system Board of Regents own research data I generate.” Faulkes goes on to state that the regents probably don’t care what he does with his data unless/until they can make money off of it… very true. To make things more complicated, Crotty over at Scholarly Kitchen reminded us that “under US law (the Bayh-Dole Act), the intellectual property (IP) generated as the result of federal research funds belongs to the researcher and their institution.” What does that even mean?!

To me, the issue is not about who owns the data outright. Instead, it’s about my role as an open science “waccaloon” who is interested in what’s best for the scientific process. To that extent, I am going to borrow from Hart again. Hart makes a comparison between having data and having a pet: in Boulder CO, there are no pet “owners” – only pet “guardians”. We can think of our data in this same way: we don’t own it; we simply care for it, love it, and are intellectually (and sometimes emotionally!) invested in it.

4) PLOS is Part of a Much Bigger Movement

Open science mandates are already here. The OSTP memo released last year is a huge leap forward in this direction – it requires that federally funded research outputs (including data) be made available to the public. Crotty draws a link between OSTP and PLOS policies in his blog: “Once this policy goes into effect, PLOS’ requirements would seem to be an afterthought for authors funded in this manner. The problem is that the OSTP policy seems nowhere near being implemented.”

That last part is most definitely true. One way to work on implementing this policy? Get the journals involved. The current incentive structure is not well-suited for ensuring compliance with OSTP, but journals have a role as gatekeepers to the traditional incentives. Crotty states it this way:

PLOS has never been a risk averse organization, and this policy would seem to fit well with their ethos of championing access and openness as keys to scientific progress. Even if one suspects this policy is premature and too blunt an instrument, one still has to respect PLOS for remaining true to their stated goals.

So I say kudos to PLOS!

In Conclusion…

I’ll end with a quote from MacManes Lab blog post:

How about this, make an honest effort to make the data accessible and useful to others, and chances are you’re probably good to go.

Final fist pump.

Sources

  1. Timothée Poisot, Ecologist. Of the value of datasets and methods in open science.
  2. Terry McGlynn, Biologist. I own my data until I don’t. Blog at Small Pond Science @hormiga
  3. David Crotty, publisher & former researcher. PLOS’ bold data policy Blog at The Scholarly Kitchen @scholarlykitchn
  4. Edmund Hart, Staff Scientist at NEONJust Get Over Yourself and Share Your Data. @DistribEcology
  5. MacManes Lab, genomics. Corner cases and the PLOS data policy.
  6. DrugMonkey, biomedical research. PLoS is letting the inmates run the asylum and it will kill them. @DrugMonkey
  7. Zen Faulkes, Neurobiologist. Who owns data. Blog at NeuroDojo @DoctorZen
  8. Björn Brembs, biologist. What is the Difference Between Text, Data, and Code? @brembs
  9. Jeff Leek, biostatistician. PLoS One, I have an idea for what to do with all your profits: buy hard drives Blog at Simply Statistics. @leekgroup

Twitter feed for #PLOSfail

From PLOS

6 thoughts on “Lit Review: #PLOSFail and Data Sharing Drama

  1. sal says:

    I think you will find, that one reason some folks are hostile to this whole line of argumentation is that they find it disingenuous. The argument ‘but it’s good for science!’ is made by those who stand to benefit most from ‘open data’ policies, while belittling the concerns of the data generators for being selfish and careerist. Also, you may want to consider that your understanding of what comprises ‘data’ may be strongly influenced by your particular training and field (marine ecology?). Data archiving that is useful and practical in your field may not be in another field.
    PLoS still has also not clarified to what end the ‘open data’ policy is implemented: re-use? fraud checking? just trying to drive people bonkers? I still don’t know.
    Finally, most of those who object to the ‘publish data up front’ model of data sharing are not at all opposed to sharing per se (‘hoarding’, not a loaded word at all) but are opposed to the work involved to deposit something that 999 times out of 1000 no one will ever want to use. Nor should they, because other labs should be replicating and extending the results under varying conditions, not rehashing data that’s already been analyzed. This may be one difference between ecology data sets (that you don’t want to have to recollect, or perhaps simply can’t possibly replicate) and typical biomedical wet lab work.

    • Thanks for your comments, Sal!

      Re. the “good for science” argument being disingenuous… Open data is good for lots of reasons outside of the benefit of individual researchers. See this great post:
      Top 10 reasons to not share your data (and why you should anyway) [http://proteinsandwavefunctions.blogspot.com/2014/03/top-10-reasons-to-not-share-your-data.html]. I will say, however, that open data is idealistic given the current incentive structure. Yes, data generators are “selfish and careerist”, just like the rest of us. They are working within the current system, which does not (unfortunately) give them any reason to act otherwise.

      Re. my understanding of data – It’s true that my research has been marine ecology and math biology, but for the last three years I’ve been working on promoting data sharing and open science at the University of California campuses and beyond, which has given my opportunities to think about all kinds of data in many, many different fields (even digital humanities!). One of the points that I wanted to make in my post (but ended up on the cutting floor due to length was this: there are experts in the field of information science and technology who are thinking about these problems, and who are very good at their jobs. They are working on some pretty amazing stuff related to the semantic web, text mining, workflow captures, etc. that will make data useful in ways we can’t even think about right now. Many of these new awesome projects don’t even require a researcher to change how they work in order to make their data more usable. The potential for making data usable across fields, disciplines, and types is growing every day. No one can predict how a dataset might be useful in the future- one of the reasons that science is fun.

      Re. PLOS clarifying the reasoning, I think their policy more of the idealistic nature I talk about above. They are about promoting open science. The policy is in that vein. Sharing the data helps with all of the things you mention (even the driving people bonkers bit), but also is just a shift in the culture.

  2. mbode says:

    Thanks for a thoughtful round-up, and for reminding me of the awesomeness of early 70s Bowie.

  3. michaelbode says:

    Clearly one of the big obstacles to open data is the lack of recognition that flows to the dataset creator when their information is used by someone else for analysis (or at least, the perception that citations are insufficient recognition, since there most critiques seem to suggest that a co-authorship would be satisfactory), both in the literature and on grant/tenure/job panels.

    I wonder if the “lack of recognition” problem is partly a lack of imagination. There is a lot of talk in the #plosfail reaction about forcing institutions to recognise the value of data creation, as though the re-balancing of the reward system has to be top-down. However, I have a colleague who developed a powerful and widely-used ecological analysis toolbox. I don’t think that the broad use of this toolbox is fairly reflected in his citation statistics, nor do I think that a citation is really adequate summation of the impact of his work. However, he has no trouble convincing various panels (who presumably grow bored reading the same tables of citations, applicant after applicant) that he’s the best applicant for a job, grant, etc. I realise that this is anecdata, but if I were reading an application, I would certainly pay attention to an altmetric more than I would another random publication. I guess it could be hard to benchmark them, but at the same time that could easily work in one’s favour (“25 page views per day? That does sound impressive!”)

  4. Matt Jones says:

    Carly — nice article. I appreciate all the time you put into these posts – they are informative and balanced. Regarding the data ownership meme that you raise, the concept of “ownership” arises out of property law, which in the US is often conflated with copyright and patent protection under the term “intellectual property” (see http://www.theguardian.com/technology/2008/feb/21/intellectual.property). However, under US law, facts are not subject to copyright protection per se, including data such as measurements. This is well established, even though some University offices might like to espouse a different view, such as the one you cite. It is possible to copyright a compilation of data if that compilation meets the copyright statute requirements of originality and creativity. Not all compilations do. But the data underlying a compilation are still unprotected, and can be copied and used by anyone — they are not owned. Bitlaw (http://www.bitlaw.com/copyright/database.html#data), the University of Minnesota (https://www.lib.umn.edu/datamanagement/copyright), and Dryad (http://blog.datadryad.org/2011/10/05/why-does-dryad-use-cc0/) all have good treatments of this. Rather than speaking of ownership, I think it is better to frame this discussion in terms of scientific ethics (which certainly apply) rather than ownership per se.

  5. […] amount of work and dedication it needed to create them coincided with a cascade of blog posts on data sharing, triggered by a change of PLOS’s data sharing policy (required to deposit raw data). […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 304 other followers

%d bloggers like this: