Tag Archives: sharing

Dispatches from PIDapalooza

Last month, California Digital Library, ORCID, Crossref, and Datacite brought together the brightest minds in scholarly infrastructure to do the impossible: make a conference on persistent identifiers fun!

screen-shot-2016-09-22-at-11-53-28-am

Usually discussions about persistent identifiers (PIDs) and networked research are dry and hard to get through or we find ourselves discussing the basics and never getting to the meat.

We designed PIDapalooza to attract kindred spirits who are passionate about improving interoperability and the overall quality of our scholarly infrastructure. We knew if we built it, they would come!

The results were fantastic and there was a great showing from the University of California community:

All PIDapalooza presentations are being archived on Figshare: https:/pidapalooza.figshare.com

Take a look and make sure you are following @pidapalooza for word on future PID fun!

Tagged , , , ,

Institutional Repositories: Part 1

If you aren’t a member of the library and archiving world, you probably aren’t aware of the phrase institutional repository (IR for short). I certainly wasn’t aware of IRs prior to joining the CDL, and I’m guessing most researchers are similarly ignorant. In the next two blog posts, I plan to first explain IRs, then lay out the case for their importance – nay, necessity – as part of the academic ecosphere. I should mention up front that although the IR’s inception focused on archiving traditional publications by researchers, I am speaking about them here as potential preservation of all scholarship, including data.

Academic lIbraries have a mission to archive scholarly work, including theses. These are at The Hive in Worcester, England. From Flickr by israelcsus.

Academic lIbraries have a mission to archive scholarly work, including theses. These are at The Hive in Worcester, England. From Flickr by israelcsus.

If you read this blog, I’m sure you are that there is increased awareness about the importance of open science, open access to publications, data sharing, and reproducibility. Most of these concepts were easily accomplished in the olden days of pen-and-paper: you simply took great notes in your notebook, and shared that notebook as necessary with colleagues (this assumes, of course geographic proximity and/or excellent mail systems). These days, that landscape has changed dramatically due to the increasingly computationally complex nature of research. Digital inputs and outputs of research might include software, spreadsheets, databases, images, websites, text-based corpuses, and more. But these “digital assets”, as the archival world might call them, are more difficult to store than a lab notebook. What does a virtual filing cabinet or file storage box look like that can house all of these different bits? In my opinion, it looks like an IR.

So what’s an IR?

An IR is a data repository run by an institution. Many of the large research universities have IRs. To name a few, Harvard has DASH, the University of California system has eScholarship and Merritt, Purdue has PURR, and MIT has DSpace. Many of these systems have been set up in the last 10 years or so to serve as archives for publications. For a great overview and history of IRs, check out this eHow article (which is surprisingly better than the relevant Wikipedia article).

So why haven’t more people heard of IRs? Mostly this is because there have never been any mandates or requirements for researchers to deposit their works in IRs. Some libraries take on this task– for example, I found out a few years ago that the MBL-WHOI Library graciously stored open access copies of all of my publications for me in their IR. But more and more these “works” include digital assets that are not publications, and the burden of collecting all of the digital scholarship produced by an institution is a near-insurmountable task for a small group of librarians; there has to be either buy-in from researchers or mandates from the top.

The Case for IRs

I’m not the first one to recognize the importance of IRs. Back in 2002 the Scholarly Publishing and Academic Resources Coalition (SPARC) put out a position paper titled “The Case for Institutional Repositories” (see their website for more information). They defined an IR as having four major qualities:

  1. Institutionally defined,
  2. Scholarly,
  3. Cumulative and perpetual, and
  4. Open and interoperable.

Taking the point of view of the academic institution (rather than the researcher), the paper cited two roles that institutional repositories play for academic institutions:

  1. Reform scholarly communication – Reassert control over scholarship, reduce monopoly power of journals, and bring relevance to libraries
  2. Promote the university – Serve as an indicator of the university’s quality; showcase the university’s research; demonstrate public value and increase status.

In general, IRs are run by information professionals (e.g., librarians), who are experts at documenting, archiving, preserving, and generally curating information. All of those digital assets that we produce as researchers fit the bill perfectly.

As a researcher, you might not be convinced by the importance of IRs given the  arguments above. Part of the indifference researchers may feel about IRs might have something to do with the existence of disciplinary repositories.

Disciplinary Repositories

There are many, many, many repositories out there for storing digital assets. To get a sense, check out re3data.org or databib.org and start browsing. Both of these websites are searchable databases for research data repositories. If you are a researcher, you probably know of at least one or two repositories for datasets in your field. For example, geneticists have GenBank, evolutionary biologists have TreeBase, ecologists have the KNB, and marine biologists have BCO-DMO. These are all examples of disciplinary repositories (DRs) for data. As any researcher who’s aware of these sites knows, you can both deposit and download data from these repositories, which makes them indispensable resources for their respective fields.

So where should a researcher put data?

The short answer is both an IR and a DR. I’ll expand on this and make the case for IRs to researchers in the next blog post.

Tagged , , , , , ,

Closed Data… Excuses, Excuses

If you are a fan of data sharing, open data, open science, and generally openness in research, you’ve heard them all: excuses for keeping data out of the public domain. If you are NOT a fan of openness, you should be. For both groups (the fans and the haters), I’ve decided to construct a “Frankenstein monster” blog post composed of other peoples’ suggestions for how to deal with the excuses.

Yes, I know. Frankenstein was the doctor, not the monster. From Flickr by Chop Shop Garage.

Yes, I know. Frankenstein was the doctor, not the monster. From Flickr by Chop Shop Garage.

I have drawn some comebacks from Christopher Gutteridge, University of Southampton, and Alexander Dutton, University of Oxford. They created an open google doc of excuses for closing off data and appropriate responses, and generously provided access to the document under a CC-BY license. I also reference the UK Data Archive‘s list of barriers and solutions to data sharing, available via the Digital Curation Centre‘s PDF, “Research Data Management for Librarians” (pages 14-15).

People will contact me to ask about stuff

Christopher and Alex (C&A) say: “This is usually an objection of people who feel overworked and that [data sharing] isn’t part of their job…” I would add to this that science is all about learning from each other – if a researcher is opposed to the idea of discussing their datasets, collaborating with others, and generally being a good science citizen, then they should be outed by their community as a poor participant.

People will misinterpret the data

C&A suggest this: “Document how it should be interpreted. Be prepared to help and correct such people; those that misinterpret it by accident will be grateful for the help.” From the UK Data Archive: “Producing good documentation and providing contextual information for your research project should enable other researchers to correctly use and understand your data.”

It’s worth mentioning, however, a second point C&A make: “Publishing may actually be useful to counter willful misrepresentation (e.g. of data acquired through Freedom of Information legislation), as one can quickly point to the real data on the web to refute the wrong interpretation.”

My data is not very interesting

C&A: “Let others judge how interesting or useful it is — even niche datasets have people that care about them.” I’d also add that it’s impossible to decide whether your dataset has value to future research. Consider the many datasets collected before “climate change” was a research topic which have now become invaluable to documenting and understanding the phenomenon. From the UK Data Archive: “Who would have thought that amateur gardener’s diaries would one day provide essential data for climate change research?”

I might want to use it in a research paper

Anyone who’s discussed data sharing with a researcher is familiar with this excuse. The operative word here is might. How many papers have we all considered writing, only to have them shift to the back burner due to other obligations? That said, this is a real concern.

C&A suggest the embargo route: “One option is to have an automatic or optional embargo; require people to archive their data at the time of creation but it becomes public after X months. You could even give the option to renew the embargo so only things that are no longer cared about become published, but nothing is lost and eventually everything can become open.” Researchers like to have a say in the use of their datasets, but I would caution to have any restrictions default to sharing. That is, after X months the data are automatically made open by the repository.

I would also add that, as the original collector of the data, you are at a huge advantage compared to others that might want to use your dataset. You have knowledge about your system, the conditions during collection, the nuances of your methods, et cetera that could never be fully described in the best metadata.

I’m not sure I own the data

No doubt, there are a lot of stakeholders involved in data collection: the collector, the PI (if different), the funder, the institution, the publisher, … C&A have the following suggestions:

  • Sometimes as it’s as easy as just finding out who does own the data
  • Sometimes nobody knows who owns the data. This often seems to occur when someone has moved into a post and isn’t aware that they are now the data owner.
  • Going up the management chain can help. If you can find someone who clearly has management over the area the dataset belongs to they can either assign an owner or give permission.
  • Get someone very senior to appoint someone who can make decisions about apparently “orphaned” data.

My data is too complicated.

C&A: “Don’t be too smug. If it turns out it’s not that complicated, it could harm your professional [standing].” I would add that if it’s too complicated to share, then it’s too complicated to reproduce, which means it’s arguably not real scientific progress. This can be solved by more documentation.

My data is embarrassingly bad

C&A: “Many eyes will help you improve your data (e.g. spot inaccuracies)… people will accept your data for what it is.” I agree. All researchers have been on the back end of making the sausage. We know it’s not pretty most of the time, and we can accept that. Plus it helps you strive will be at managing and organizing data during your next collection phase.

It’s not a priority and I’m busy

Good news! Funders are making it your priority! New sharing mandates in the OSTP memorandum state that any research conducted with federal funds must be accessible. You can expect these sharing mandates to drift down to you, the researcher, in the very near future (6-12 months).

Tagged , , , , , ,

The Who’s Who of Publishing Research

This week’s blog post is a bit more of a Sociology of science topic… Perhaps only marginally related to the usual content surrounding data, but still worth consideration. I recently heard a talk by Laura Czerniewicz, from University of Cape Town’s Centre for Educational Technology. She was among the speakers  during the Context session at Beyond the PDF2, and she asked the following questions about research and science:

Whose interests are being served? Who participates? Who is enabled? Who is constrained?

She brought up points I had never really considered, related to the distribution of wealth and how that affects scientific outputs. First, she examined who actually produces the bulk of knowledge. Based on an editorial in Science in 2008, she reported that US academics produce about 30% of the articles published in international peer-reviewed journals, while developing countries (China, India, Brazil) produce another 20%. Sub-saharan Africa? A mere 1%.

She then explored what factors are shaping knowledge production and dissemination. She cited infrastructure (i.e., high speed internet, electricity, water, etc.), funding, culture, and reward systems. For example, South Africa produces more articles than other countries on the continent, perhaps because the government gives universities $13,000 for every article published in a “reputable journal”, and 21 of 23 universities surveyed give a cut of that directly to the authors.

Next, she asked “Who’s doing the publishing? What research are they publishing?” She put up some convincing graphics showing the number of articles published by authors from various countries, of which the US and Western Europe were leading the pack by six fold. I couldn’t hunt down the original publication, so take this rough statistic with a grain of salt. What about book publishing? The Atlantic Wire published a great chart back in October (based on an original article in Digital Book World) that scaled a country’s size based on the value of their domestic publishing markets:

Scaled map of the world based on book publishing. From Digital Book World via Atlantic Wire.

Scaled map of the world based on book publishing. From Digital Book World via Atlantic Wire.

When asking whose interests are served by international journals, she focused on a commentary by R. Horton, titled “Medical journals: Evidence of bias against the diseases of poverty” (The Lancet 361, 1 March 2003 – behind paywall). Granted, it’s a bit out of date, but it still has interesting points to consider. Horton reported that of the five top medical journals there is little or no representation on their editorial boards from countries with low Human Development Indices. Horton then postulates that this might be the cause for the so-called 10/90 gap – where 90% of research funding is allocated to diseases that affect only 10% of the world’s population. Although Horton does not go so far as to blame the commercial nature of publishing, he points out that editorial boards for journals must consider their readership and cater to those who can afford subscription fees.

I wonder how this commentary holds up, 10 years later. I would like to think that we’ve made a lot of progress towards better representation of research affecting humans that live in poverty. I’m not sure, however, we’ve done better with access to published research. I’ll leave you with something Laura said during her talk (paraphrased): “If half of the world is left out of knowledge exchange and dissemination, science will suffer.”

Check out Laura Czerniewicz’s Blog for more on this. She’s also got a Twitter feed.

Tagged , , , , ,

Collecting Journal Data Policies: JoRD

My last two posts have related to IDCC 2013; that makes this post three in a row. Apparently IDCC is a gift that just keeps giving (albeit a rather short post in this case).

Today the topic is the JoRD project, funded by JISC. JoRD stands for Journal Research Data; the JoRD Policy Bank is basically a project to collect and summarize data policies for a range of academic journals.

From the JISC project website, this project aims to

provide researchers, managers of research data and other stakeholders with an easy source of reference to understand and comply with Research Data policies.

How to go about this? The project’s objectives (cribbed and edited from the project site):

  1. Identify and consult with stakeholders; develop stakeholder requirements
  2. Investigate the current state of data sharing policies within journals
  3. Deliver recommendations on a central service to summarize journal research data policies and provide a reference for guidance and information on journal policies.

I’m most interested in #2: what are journals saying about data sharing? To tackle this, project members are collecting information about data sharing policies on the the top 100 and bottom 100 Science Journals, and the top 100 and bottom 100 Social Science Journals. Based on the stated journal policies about data sharing, they fill out an extensive spreadsheet. I’m anxious to see the final outcome of this data collection – my hunch is that most journals “encourage” or “recommend” data sharing, but do not mandate it.

I think of the JoRD Policy Bank as having two major benefits:

Educating Researchers. As  you may be aware, many researchers are a bit slow to jump on the data sharing bandwagon.  This is the case despite the fact that all signs point to future requirements for sharing at the time of publication (see my post about it, Thanks in Advance for Sharing Your Data). Once researchers come to terms with the fact that soon data sharing will not be optional, they will need to know how to comply. Enter JoRD Policy Bank!

Encouraging Publishers. The focus on stakeholder needs and requirements suggests that the outcomes of this project will provide guidance to publishers about how to proceed in their requirements surrounding data sharing. There might be a bit of peer pressure, as well: Journals don’t want to seem behind the times when it comes to data sharing, lest their credibility be threatened.

In general, the JoRD website is chock full of information about data sharing policies, open data, and data citation. Check it out!

C'mon researchers! Jump on the data sharing band wagon! From purlem.com

C’mon researchers! Jump on the data sharing band wagon! From purlem.com

Tagged , , , , , ,

Thoughts on Data Publication

If you read last week’s post on the IDCC meeting in Amsterdam, you may know that today’s post was inspired by a post-conference workshop on Data Publication, sponsored by the PREPARDE group. The workshop was “Data publishing, peer review and repository accreditation: everyone a winner?” (to access the workshop agenda, goals, and slides, go to the conference workshop website and scroll down to Workshop 6).

Basically the workshop focused on all things data publication, and incited lively discussion among those in attendance. Check out the workshop’s Twitter backchannel via this Storify by Sarah Callaghan of STFC.  My previous blog post about data publication sums it up like this:

The concept of data publication is rather simple in theory: rather than relying on journal articles alone for scholarly communication, let’s publish data sets as “first class citizens”.  Data sets have inherent value that makes them standalone scholarly objects— they are more likely to be discovered by researchers in other domains and working on other questions if they are not associated with a specific journal and all of the baggage that entails.

Stealing shamelessly from Sarah’s presentation, I’m providing a brief overview of issues surrounding data publication for those not well-versed:

First, the benefits of data publication:

  • Allows credit to data producers and curators (via data citation and emerging altmetrics)
  • Encourages reuse of datasets and discourages duplication of effort
  • Encourages proper curation and management of data (you don’t want to share messy data, right?)
  • Ensures completeness of the scientific record, as well as transparency and reproducibility of research (fundamental tenets of the scientific method!)
  • Improves discoverability of datasets (they will never be discovered on that old hard drive in your desk drawer)

We had an internal meeting here at CDL yesterday about data publication. After running through this list of benefits for those in attendance, one of my colleagues asked the question: “Does listing these benefits work? Do researchers want to publish their data?” I didn’t hesitate to answer “No”.

Why not? The biggest reason is a lack of time. Preparing data for sharing and publication is laborious, and overstretched researchers aren’t motivated by these benefits given the current incentive structures in research (papers, papers, papers. And citation of those papers.). Of course, I think this is changing in the very near future. Check out my post on data sharing mandates in the works. So let’s go with the assumption that researchers want to publish. How do they go about this?

Methods for “publishing” data:

  • A personal or lab webpage. This is a common choice for researchers who wish to share data since they can maintain control of the datasets. However, there are issues with stability, persistence, discoverability of these data, siloed on individual websites. Plus, website maintenance often falls to the bottom of a researcher’s to-do list.
  • A disciplinary repository. This is a common solution for only a select few data types (e.g., genetic data). Most disciplines are still awaiting a culture change that will motivate researchers to share their data in this way.
  • An institutional repository. Of course, researchers have to know that this is an option (most don’t), and must then properly prepare their data for deposit.
  • Supplementary materials.  In this case, the data accompany a primary journal article as supporting information. I recently shared data this way, but recognized that the data should also be placed in a curated repository.  There are a few reasons for this apparent duplication:
    • Supplemental materials are sometimes not available many years after publication due to broken links.
    • Journals are not particularly excited about archiving lots of supplementary data, especially if it’s a large volume of data. This is not their area of expertise, after all.
  • Data article. This is a new-ish option: basically, you publish your data in a proper data journal (see this semi-complete list of data journals on the PREPARDE blog).

Wondering what a “data article” is? Let’s look to Sarah again:

A data article describes a dataset, giving details of its collection, processing, software, file formats, et cetera, without the requirement of  novel analyses or ground-breaking conclusions.

That is, it’s a standalone product of research that can be cited as such. There is much debate surrounding such data articles. Among the issues are:

  • Is it really “publication”? How is this different from a landing page for the dataset that’s stored in a repository?
  • Traditional academic use of “publication” implies peer review. How do you review datasets?
  • How should publication differ depending on the discipline?

There are no easy answers to these questions, but I love hearing the debate. I’m optimistic that the forthcoming person we hire as a data publication postdoc will have some great ideas to contribute. Stay tuned!

Amsterdam! CC-BY license, C. Strasser

Amsterdam! CC-BY license, C. Strasser

 

Tagged , , ,

NSF now allows data in biosketch accomplishments

Hip hip hooray for data! Contributed to Calisphere by Sourisseau Academy for State and Local History (click for more information)

Hip hip hooray for data! Contributed to Calisphere by Sourisseau Academy for State and Local History (click for more information)

Back in October, the National Science Foundation announced changes to its Grant Proposal Guidelines (Full GPG for January 2013 here).  I blogged about this back when the announcement was made, but now that the changes are official, I figure it warrants another mention.

As of January 2013, you can now list products in your biographical sketches, not just publications. This is big (and very good) news for data advocates like myself.

The change is that the biosketch for senior personnel should contain a list of 5 products closely related to the project and 5 other significant products that may or may not be related to the project. But what counts as a product? “products are…including but not limited to publications, data sets, software, patents, and copyrights.”

To make it count, however, it needs to be both citable and accessible. How to do this?

  1.  Archive your data in a repository (find help picking a repo here)
  2. Obtain a unique, persistent identifier for your dataset (e.g., a DOI or ARK)
  3. Start citing your product!

For the librarians, data nerds, and information specialists in the group, the UC3 has put together a flyer you can use to promote listing data as a product. It’s available as a PDF (click on the image to the right to download). For the original PPT that you can customize for your institution and/or repository, send me an email.

NSF_products_flyer

Direct from the digital mouths of NSF:

Summary of changes: http://www.nsf.gov/pubs/policydocs/pappguide/nsf13001/gpg_sigchanges.jsp

Chapter II.C.2.f(i)(c), Biographical Sketch(es), has been revised to rename the “Publications” section to “Products” and amend terminology and instructions accordingly. This change makes clear that products may include, but are not limited to, publications, data sets, software, patents, and copyrights.

New wording: http://www.nsf.gov/pubs/policydocs/pappguide/nsf13001/gpg_2.jsp

(c) Products

A list of: (i) up to five products most closely related to the proposed project; and (ii) up to five other significant products, whether or not related to the proposed project. Acceptable products must be citable and accessible including but not limited to publications, data sets, software, patents, and copyrights. Unacceptable products are unpublished documents not yet submitted for publication, invited lectures, and additional lists of products. Only the list of 10 will be used in the review of the proposal.

Each product must include full citation information including (where applicable and practicable) names of all authors, date of publication or release, title, title of enclosing work such as journal or book, volume, issue, pages, website and Uniform Resource Locator (URL) or other Persistent Identifier.

Tagged , , , ,

Ocean Health and Data Sharing

How do we go about measuring the health of complex ecosystems, especially with humans – the ultimate complicating factor?  We use good data and smart science, that’s how.  Of course, this “good data” can’t be collected by any one scientist.  Instead, we should rely on years of data collection by scientists who are experts what they do.

If you are a science news junkie like me, you might have noticed a lot of recent buzz about the Ocean Health Index.  This is an incredible project with one simple goal: measure the health of the oceans. Of course, the word simple is hyperbole at best in this case.  Oceans (especially coastal oceans with nearby humans) are about as complicated as nature can get. There are:

  • biological factors like fishes, algae, invertebrates, and marine mammals
  • geological components like  sand, mud, rocks, and cliffs
  • physical factors like tides, storms, and upwelling
  • chemical factors like pollutants, water-rock interactions, and freshwater runoff

Layer on top of this system the biggest impact of all: humans.  We have cruise ships, sand castles, scuba divers, marinas, tourists, surfers, cliff divers, and fishermen, to name a few. How do you manage to take all of this into account to measure the health of oceans?

Basically the OHI has factors in 10 different “public goals”, including tourism, clean waters, biodiversity, food provision, and coastal economies.  They have equations for calculating a number from 0 to 100 for each goal, and taken together, these numbers indicate the health of a particular region of the ocean. The equations take into account both the current status and likely future status of each goal, making the OHI robust for prediction.

So what goes into these equations? DATA, of course! Many (many) scientists contributed data to the Ocean Health Index. Check out the main paper on OHI and its supplemental material.  Relevant datasets were compiled to provide parameters for the goal equations.  The more data the better the model (usually), and the OHI folks took that to heart – the list of contributing datasets is daunting.

I don’t have to ask, but I know that compiling those datasets was no easy feat. Poor documentation (i.e. metadata), bad file formats, icky table organization, and missing information likely plagued the OHI researchers pulling this information together.  And here’s the DataUp connection: if only they had all used a tool to create well documented data that follows best practices for data management!

I’m sure Billy Ocean is all about Ocean health. From Flickr by Eva Rinaldi Celebrity and Live Music Photographer

I’m really excited to see this OHI released. The website is pretty amazing, and definitely NOT geared towards the nerdy science types (although we can find that raw data pretty easily if we want it!).  Go play with it, share it with your family and friends, and help raise awareness about the importance of well documented datasets to society’s well being. Share with your colleagues and lab mates to emphasize that their data might be used in unimaginable ways in the future – which means good data management is critical.

More on the OHI:

Tagged , , ,

Your Time is Gonna Come

You know what they say:  Timing is everything.  Time enters into the data management and stewardship equation at several points and warrants discussion here.  Why timeliness? Last week at the University of North TexasOpen Access Symposium, there were several great speakers who touched on timeliness of data management, organization, and sharing.  It led me to wonder whether there is agreement about the timeliness of activities data-related, so here I’ve posted my opinions about time in a few points in the life cycle of data.  Feel free to comment on this post with your own opinions.

1. When should you start thinking about data management?  The best answer to this question is as soon as possible.  The sooner you plan, the less likely you are to be surprised by issues like metadata standards or funder requirements (see my previous DCXL post about things you will wish you had thought about documenting).  The NSF mandate for data management plans is a great motivator for thinking sooner rather than later, but let’s face facts: the DMP requirement is only two pages, and you can create one th

dark side of the rainbow image

If you have never watched the Wizard of Oz while listening to Pink Floyd’s Dark Side of the Moon album, you should. Of course, timing is everything: start the album on the third roar of the MGM lion. Image from horrorhomework.com (click on the image to go to the site)

at might pass muster without really thinking too carefully about your data.  I encourage everyone to go well beyond funder requirements and thoughtfully plan out your approach to data stewardship.  Spend plenty of time doing this, and return to your plan often during your project to update it.

2. When should you start archiving your data? By archiving, I do not mean backing up your data (that answer is constantly).  I am referring to the action of putting your data into a repository for long-term (20+ years) storage. This is a more complicated question of timeliness. Issues that should be considered include:

  • Is your data collection ongoing? Continuously updated sensor or instrument data should begin being archived as soon as collection begins.
  • Is your dataset likely to undergo a lot of versions? You might wait to begin archiving until you get close to your final version.
  • Are others likely to want access to your data soon?  Especially colleagues or co-authors? If the answer is yes, begin archiving early so that you are all using the same datasets for analysis.

3. When should you make your data publicly accessible?  My favorite answer to this question is also as soon as possible.  But this might mean different things for different scientists.  For instance, making your data available in near-real time, either on a website or in a repository that supports versioning, allows others to use it, comment on it, and collaborate with you while you are still working on the project.  This approach has its benefits, but also tends to scare off some scientists who are worried about being scooped.  So if you aren’t an open data kind of person, you should make your data publicly available at the time of publication.  Some journals are already requiring this, and more are likely to follow.

There are some that would still balk at making data available at publication: What if I want to publish more papers with this dataset in the future?  In that case, have an honest conversation with yourself.  What do you mean by “future”?  Are you really likely to follow through on those future projects that might use the dataset?  If the answer is no, you should make the data available to enhance your chances for collaboration. If the answer is yes, give yourself a little bit of temporal padding, but not too much.  Think about enforcing a deadline of two years, at which point you make the data available whether you have finished those dream projects or not.  Alternatively, find out if your favorite data repository will enforce your deadline for you– you may be able to provide them with a release date for your data, whether or not they hear from you first.

Tagged , , ,