Tag Archives: data publication

Announcing The Dash Tool: Data Sharing Made Easy

We are pleased to announce the launch of Dash – a new self-service tool from the UC Curation Center (UC3) and partners that allows researchers to describe, upload, and share their research data. Dash helps researchers perform the following tasks:

  • Prepare data for curation by reviewing best practice guidance for the creation or acquisition of digital research data.
  • Select data for curation through local file browse or drag-and-drop operation.
  • Describe data in terms of the DataCite metadata schema.
  • Identify data with a persistent digital object identifier (DOI) for permanent citation and discovery.
  • Preserve, manage, and share data by uploading to a public Merritt repository collection.
  • Discover and retrieve data through faceted search and browse.

Who can use Dash?

There are multiple instances of the Dash tool that all have similar functions, look, and feel.  We took this approach because our UC campus partners were interested in their Dash tool having local branding (read more). It also allows us to create new Dash instances for projects or partnerships outside of the UC (e.g., DataONE Dash and our Site Descriptors project).

Researchers at UC Merced, UCLA, UC Irvine, UC Berkeley, or UCOP can use their campus-specific Dash instance:

Other researchers can use DataONE Dash (oneshare.cdlib.org). This instance is available to anyone, free of charge. Use your Google credentials to deposit data.

Note: Data deposited into any Dash instance is visible throughout all of Dash. For example, if you are a UC Merced researcher and use dash.ucmerced.edu to deposit data, your dataset will appear in search results for individuals looking for data via any of the Dash instances, regardless of campus affiliation.

See the Users Guide to get started using Dash.

Stay connected to the Dash project:

Dash Origins

The Dash project began as DataShare, a collaboration among UC3, the University of California San Francisco Library and Center for Knowledge Management, and the UCSF Clinical and Translational Science Institute (CTSI). CTSI is part of the Clinical and Translational Science Award program funded by the National Center for Advancing Translational Sciences at the National Institutes of Health (Grant Number UL1 TR000004).

Fontana del Nettuno

Sound the horns! Dash is live! “Fontana del Nettuno” by Sorin P. from Flickr.

Tagged , , , ,

Dash Project Receives Funding!

We are happy to announce the Alfred P. Sloan Foundation has funded our project to improve the user interface and functionality of our Dash tool! You can read the full grant text at http://escholarship.org/uc/item/2mw6v93b.

More about Dash

Dash is a University of California project to create a platform that allows researchers to easily describe, deposit and share their research data publicly. Currently the Dash platform is connected to the UC3 Merritt Digital Repository; however, we have plans to make the platform compatible with other repositories using protocols during our Sloan-funded work. The Dash project is open-source; read more on our GitHub site. We encourage community discussion and contribution via GitHub Issues.

Currently there are five instances of the Dash tool available:

We plan to launch the new DataONE Dash instance in two weeks; this tool will replace the existing DataUp tool and allow anyone to deposit data into the DataONE infrastructure via the ONEShare repository using their Google credentials. Along with the release of DataONE Dash, we will release Dash 1.1 for the live sites listed above. There will be improvements to the user interface and experience.

The Newly Funded Sloan Project

Problem Statement

Researchers are not archiving and sharing their data in sustainable ways. Often data sharing involves using commercially owned solutions, posting data on personal websites, or submitting data alongside articles as supplemental material. A better option for data archiving is community repositories, which are owned and operated by trusted organizations (i.e., institutional or disciplinary repositories). Although disciplinary repositories are often known and used by researchers in the relevant field, institutional repositories are less well known as a place to archive and share data.

Why aren’t researchers using institutional repositories?

First, the repositories are often not set up for self-service operation by individual researchers who wish to deposit a single dataset without assistance. Second, many (or perhaps most) institutional repositories were created with publications in mind, rather than datasets, which may in part account for their less-than-ideal functionality. Third, user interfaces for the repositories are often poorly designed and do not take into account the user’s experience (or inexperience) and expectations. Because more of our activities are conducted on the Internet, we are exposed to many high-quality, commercial-grade user interfaces in the course of a workday. Correspondingly, researchers have expectations for clean, simple interfaces that can be learned quickly, with minimal need for contacting repository administrators.

Our Solution

We propose to address the three issues above with Dash, a well-designed, user friendly data curation platform that can be layered on top of existing community repositories. Rather than creating a new repository or rebuilding community repositories from the ground up, Dash will provide a way for organizations to allow self-service deposit of datasets via a simple, intuitive interface that is designed with individual researchers in mind. Researchers will be able to document, preserve, and publicly share their own data with minimal support required from repository staff, as well as be able to find, retrieve, and reuse data made available by others.

Three Phases of Work

  1. Requirements gathering: Before the design process begins, we will build requirements for researchers via interviews and surveys
  2. Design work: Based on surveys and interviews with researchers (Phase 1), we will develop requirements for a researcher-focused user interface that is visually appealing and easy to use.
  3. Technical work: Dash will be an added-value data sharing platform that integrates with any repository that supports community protocols (e.g., SWORD (Simple Web-service Offering Repository Deposit).

The dash is a critical component of any good ascii art. By reddit user Haleljacob

Tagged , , , , ,

UC3, PLOS, and DataONE join forces to build incentives for data sharing

We are excited to announce that UC3, in partnership with PLOS and DataONE, are launching a new project to develop data-level metrics (DLMs). This 12-month project is funded by an Early Concept Grants for Exploratory Research (EAGER) grant from the National Science Foundation, and will result in a suite of metrics that track and measure data use. The proposal is available via CDL’s eScholarship repository: http://escholarship.org/uc/item/9kf081vf. More information is also available on the NSF Website.

Why DLMs? Sharing data is time consuming and researchers need incentives for undertaking the extra work. Metrics for data will provide feedback on data usage, views, and impact that will help encourage researchers to share their data. This project will explore and test the metrics needed to capture activity surrounding research data.

The DLM pilot will build from the successful open source Article-Level Metrics community project, Lagotto, originally started by PLOS in 2009. ALM provide a view into the activity surrounding an article after publication, across a broad spectrum of ways in which research is disseminated and used (e.g., viewed, shared, discussed, cited, and recommended, etc.)

About the project partners

PLOS (Public Library of Science) is a nonprofit publisher and advocacy organization founded to accelerate progress in science and medicine by leading a transformation in research communication.

Data Observation Network for Earth (DataONE) is an NSF DataNet project which is developing a distributed framework and sustainable cyberinfrastructure that meets the needs of science and society for open, persistent, robust, and secure access to well-described and easily discovered Earth observational data.

The University of California Curation Center (UC3) at the California Digital Library is a creative partnership bringing together the expertise and resources of the University of California. Together with the UC libraries, we provide high quality and cost-effective solutions that enable campus constituencies – museums, libraries, archives, academic departments, research units and individual researchers – to have direct control over the management, curation and preservation of the information resources underpinning their scholarly activities.

The official mascot for our new project: Count von Count. From muppet.wikia.com

The official mascot for our new project: Count von Count. From muppet.wikia.com

Tagged , , ,

DataUp is Merging with Dash!

Exciting news! We are merging the DataUp tool with our new data sharing platform, Dash.

About Dash

Dash is a University of California project to create a platform that allows researchers to easily describe, deposit and share their research data publicly. Currently the Dash platform is connected to the UC3 Merritt Digital Repository; however, we have plans to make the platform compatible with other repositories using protocols such as SWORD and OAI-PMH. The Dash project is open-source and we encourage community discussion and contribution to our GitHub site.

About the Merge

There is significant overlap in functionality for Dash and DataUp (see below), so we will merge these two projects to enable better support for our users. This merge is funded by an NSF grant (available on eScholarship) supplemental to the DataONE project.

The new service will be an instance of our Dash platform (to be available in late September), connected to the DataONE repository ONEShare. Previously the only way to deposit datasets into ONEShare was via the DataUp interface, thereby limiting deposits to spreadsheets. With the Dash platform, this restriction is removed and any dataset type can be deposited. Users will be able to log in with their Google ID (other options being explored). There are no restrictions on who can use the service, and therefore no restrictions on who can deposit datasets into ONEShare, and the service will remain free. The ONEShare repository will continue to be supported by the University of New Mexico in partnership with CDL/UC3. 

The NSF grant will continue to fund a developer to work with the UC3 team on implementing the DataONE-Dash service, including enabling login via Google and other identity providers, ensuring that metadata produced by Dash will meet the conditions of harvest by DataONE, and exploring the potential for implementing spreadsheet-specific functionality that existed in DataUp (e.g., the best practices check). 

Benefits of the Merge

  • We will be leveraging work that UC3 has already completed on Dash, which has fully-implemented functionality similar to DataUp (upload, describe, get identifier, and share data).
  • ONEShare will continue to exist and be a repository for long tail/orphan datasets.
  • Because Dash is an existing UC3 service, the project will move much more quickly than if we were to start from “scratch” on a new version of DataUp in a language that we can support.
  • Datasets will get DataCite digital object identifiers (DOIs) via EZID.
  • All data deposited via Dash into ONEShare will be discoverable via DataONE.

FAQ about the change

What will happen to DataUp as it currently exists?

The current version of DataUp will continue to exist until November 1, 2014, at which point we will discontinue the service and the dataup.org website will be redirected to the new service. The DataUp codebase will still be available via the project’s GitHub repository.

Why are you no longer supporting the current DataUp tool?

We have limited resources and can’t properly support DataUp as a service due to a lack of local experience with the C#/.NET framework and the Windows Azure platform.  Although DataUp and Dash were originally started as independent projects, over time their functionality converged significantly.  It is more efficient to continue forward with a single platform and we chose to use Dash as a more sustainable basis for this consolidated service.  Dash is implemented in the  Ruby on Rails framework that is used extensively by other CDL/UC3 service offerings.

What happens to data already submitted to ONEShare via DataUp?

All datasets now in ONEShare will be automatically available in the new Dash discovery environment alongside all newly contributed data.  All datasets also continue to be accessible directly via the Merritt interface at https://merritt.cdlib.org/m/oneshare_dataup.

Will the same functionality exist in Dash as in DataUp?

Users will be able to describe their datasets, get an identifier and citation for them, and share them publicly using the Dash tool. The initial implementation of DataONE-Dash will not have capabilities for parsing spreadsheets and reporting on best practices compliance. Also the user will not be able to describe column-level (i.e., attribute) metadata via the web interface. Our intention, however, is develop out these functions and other enhancements in the future. Stay tuned!

Still want help specifically with spreadsheets?

  • We have pulled together some best practices resources: Spreadsheet Help 
  • Check out the Morpho Tool from the KNB – free, open-source data management software you can download to create/edit/share spreadsheet metadata (both file- and column-level). Bonus – The KNB is part of the DataONE Network.

 

It's the dawn of a new day for DataUp! From Flickr by David Yu.

It’s the dawn of a new day for DataUp! From Flickr by David Yu.

Tagged , , , , , ,

Finding Disciplinary Data Repositories with DataBib and re3data

This post is by Natsuko Nicholls and John Kratz.  Natsuko is a CLIR/DLF Postdoctoral Fellow in Data Curation for the Sciences and Social Sciences at the University of Michigan.

The problem: finding a repository

Everyone tells researchers not to abandon their data on a departmental server, hard drive, USB stick , CD-ROM, stack of Zip disks, or quipu– put it in a repository! But, most researchers don’t know what repository might be appropriate for their data. If your organization has an Institutional Repository (IR), that’s one good home for the data. However, not everyone has access to an IR, and data in IRs can be difficult for others to discover, so it’s important to consider the other major (and not mutually exclusive!) option: deposit in a Disciplinary Repository (DR).

Many disciplinary repositories exist to handle data from a particular field or of a particular type (e.g. WormBase cares about nematode biology, while GenBank takes only DNA sequences). Some may be asking if the co-existence of IRs and DRs means competition or is mutually beneficial to both universities and research communities, some may be wondering how many repositories are out there for archiving digital assets, but most librarians and researchers just want to find an appropriate repository in a sea of choices.

For those involved in assisting researchers with data management, helping to find the right place to put data for sharing and preservation has become a crucial part of data services. This is certainly true at the University of Michigan—during a recent data management workshop for faculty, faculty members expressed their interest in receiving more guidance on disciplinary repositories from librarians.

The help: directories of data repositories

Fortunately, there is help to be found in the form of repository directories.  The Open Access Directory maintains a subdirectory of data repositories.  In the Life Sciences, BioSharing collects data policies, standards, and repositories.  Here, we’ll be looking at two large directories that list repositories from any discipline: DataBib and the REgistry of REsearch data REpositories (re3data.org).

DataBib originated in a partnership between Purdue and Penn State University, and it’s hosted by Purdue. The 600 repositories in DataBib are each placed in a single discipline-level category and tagged with more detailed descriptors of the contents.

re3data.org, which is sponsored by the German Research Foundation, started indexing relatively recently, in 2012, but it already lists 628 repositories.  Unlike DataBib, repositories aren’t assigned to a single category, but instead tagged with subjects, content types, and keywords.  Last November, re3data and BioSharing agreed to share records.  re3data is more completely described in this paper.

Given the similar number of repositories listed in DataBib and re3data, one might expect that their contents would be roughly similar and conclude that there are something around 600 operating DRs.  To test this possibility and get a better sense of the DR landscape, we examined the contents of both directories.

The question: how different are DataBib and re3data?

Repository overlap is only 19%Contrary to expectation, there is little overlap between the databases.  At least 1,037 disciplinary data repositories currently exist, and only 18% (191) are listed in both databases.  That’s a lot to look for one right place to put data, because except for a few exceptions, most IRs are not listed in re3data and Databib (you can find  a long list of academic open access repositories).  Of the repositories in both databases, a majority (72%) are categorized into STEM fields. Below is a breakdown of the overlap by discipline (as assigned by DataBib).

CrossoverRepositories

Another way of characterizing repository collections by re3data and Databib is by the repository’s host country. In re3data, the top three contributing countries (US 36%, Germany 15%, UK 12%) form the majority, whereas in Databib 58% of repositories are hosted by the US, followed by UK (12%) and Canada (7%). This finding may not be too surprising, since re3data is based in Germany and Databib is in the US.  If you are a researcher looking for the right disciplinary data repository, the host country may matter, depending on your (national-international/private-public) funding agencies and the scale of collaboration.

The full list of repositories is available here .

The conclusion: check both

Going forward, help with disciplinary repository selection will be increasingly be a part of data management workflows; the Data Management Planing Tool (DMPTool) plans to incorporate repository recommendations through DataBib, and DataCite may integrate with re3data. Further simplifying matters, DataBib and re3data plan to merge their services in some as-yet-undefined way.  But, for now, it’s safe to say that anyone looking for a disciplinary repository should check both DataBib and re3data.

Tagged , , ,

Institutional Repositories: Part 2

A few weeks back I wrote a post describing institutional repositories (IRs for short). IRs have been around for a while, with the impetus of making scholarly publications open access. However more recently, IRs have been cited as potential repositories for datasets, code, and other scholarly outputs. Here I continue the discussion of IRs and compare their utility to DRs. Please note – although IRs are typically associated with open access publications, I discuss them here as potential repositories for data. 

Honest criticism of IRs

In my discussions with colleagues at conferences and meetings, I have found that some are skeptical about the role of IRs in data access preservation. I posit that this skepticism has a couple of origins:

  • IRs are often not intended for “self-service”, i.e., a researcher would need to connect with IR support staff (often via a face-to-face meeting), in order to deposit material into the IR.
  • Many IRs were created at minimum 5 years ago, with interfaces that sometimes appear to pre-date Facebook. Academic institutions often have no budget for a redesign of the user interface, which means those that visit an IR might be put off by the appearance and/or functionality.
  • IRs are run by libraries and IT departments, neither of which are known for self-promotion. Many (most?) researchers are likely unaware of an IR’s existence, and would not think to check in with the libraries regarding their data preservation needs.

These are all viable issues associated with many of the existing IRs. But there is one huge advantage to IRs over other data repositories: they are owned and operated by academic institutions that have a vested interest in preserving and providing access to scholarly work. 

The bright side

IRs aren’t all bad, or I wouldn’t be blogging about them. I believe that they are undergoing a rebirth of sorts: they are now seen as viable places for datasets and other scholarly outputs. Institutions like Purdue are putting IRs at the center of their initiatives around data management, access, and preservation. Here at the CDL, the UC3 group is pursuing the implementation of a data curation platform, DataShare, to allow self-service deposit of datasets into the Merritt Repository (see the UCSF DataShare site). Recent mandates from above requiring access to data resulting from federal grants means that funders (like IMLS) and organizations (like ARL) are taking an interest in improving the utility of IRs.

IRs versus discipline-specific repositories

In my last post, I mentioned that selecting a repository for your data doesn’t need to be either an IR or discipline-specific repository (DR). These repositories each have advantages and disadvantages, so using both makes sense.

DRs: ideal for data discovery and reuse

Often, DRs have collection policies for the specific types of data they are willing to accept. GenBank, for example, has standardized how your deposit your data, what types and formats of data they accept, and the metadata accompanying that data. This all means that searching for and using the data in GenBank is easy, and data users are able to easily download data for use. Another advantage of having a collection of similar, standardized data is the ability to build tools on top of these datasets, making reuse and meta-analyses easier.

The downside of DRs

The nature of a DR is that they are selective in the types of data that they accept. Consider this scenario, typical of many research projects: what if someone worked on a project that combined sequencing genes, collecting population demographics, and documenting location with GIS? Many DRs would not want to (or be able to) handle these disparate types of data. The result is that some of the data gets shared via a DR, while data less suitable for the DR would not be shared.

In my work with the DataONE Community Engagement and Education working group, I reviewed what datasets were shared from NSF grants awarded between 2005 and 2009 (see Panel 1 in Hampton et al. 2013). Many of the resulting publications relied on multiple types of data.  The percentage of those that shared all of the data produced was around 28%. However of the data that was shared, 81% was in GenBank or TreeBase – likely due to the culture of data sharing around genetic work. That means most of the non-genetic data is not available, and potentially lost, despite its importance for the project as a whole. Enter: institutional repositories.

IRs: the whole enchilada

Unlike many DRs, IRs have the potential to host entire collections of data around a project – regardless of the type of data, its format, etc. My postdoctoral work on modeling the effects of temperature and salinity on copepod populations involved field collection, laboratory copepod growth experiments (which included logs of environmental conditions), food growth (algal density estimates and growth rates, nutrient concentrations), population size counts, R scripts, and the development of the mathematical models themselves. An IR could take all of these disparate datasets as a package, which I could then refer to in the publications that resulted from the work. A big bonus is that this package could sit next to other packages I’ve generated over the course of my career, making it easier for me to point people to the entire corpus of research work. The biggest bonus of all: having all of the data the produced a publication, available at a single location, helps ensure reproducibility and transparency.

Maybe you can have your cake (DRs) and eat it too (IRs). From Flickr by Mayaevening

Maybe you can have your cake (DRs) and eat it too (IRs). From Flickr by Mayaevening

There are certainly some repositories that could handle the type of data package I just described. The Knowledge Network for Biocomplexity is one such relatively generic repository (although I might argue that KNB is more like an IR than a discipline repository). Another is figshare, although this is a repository ultimately owned by a publisher. But as researchers start hunting for places to put their datasets, I would hope that they look to academic institutions rather than commercial publishers. (Full disclosure – I have data stored in figshare!)

Good news! You can have your cake and eat it too. Putting data in both the relevant DRs and more generic IRs is a good solution to ensure discoverability (DRs) and provenance (IRs).

Tagged , , , ,

Institutional Repositories: Part 1

If you aren’t a member of the library and archiving world, you probably aren’t aware of the phrase institutional repository (IR for short). I certainly wasn’t aware of IRs prior to joining the CDL, and I’m guessing most researchers are similarly ignorant. In the next two blog posts, I plan to first explain IRs, then lay out the case for their importance – nay, necessity – as part of the academic ecosphere. I should mention up front that although the IR’s inception focused on archiving traditional publications by researchers, I am speaking about them here as potential preservation of all scholarship, including data.

Academic lIbraries have a mission to archive scholarly work, including theses. These are at The Hive in Worcester, England. From Flickr by israelcsus.

Academic lIbraries have a mission to archive scholarly work, including theses. These are at The Hive in Worcester, England. From Flickr by israelcsus.

If you read this blog, I’m sure you are that there is increased awareness about the importance of open science, open access to publications, data sharing, and reproducibility. Most of these concepts were easily accomplished in the olden days of pen-and-paper: you simply took great notes in your notebook, and shared that notebook as necessary with colleagues (this assumes, of course geographic proximity and/or excellent mail systems). These days, that landscape has changed dramatically due to the increasingly computationally complex nature of research. Digital inputs and outputs of research might include software, spreadsheets, databases, images, websites, text-based corpuses, and more. But these “digital assets”, as the archival world might call them, are more difficult to store than a lab notebook. What does a virtual filing cabinet or file storage box look like that can house all of these different bits? In my opinion, it looks like an IR.

So what’s an IR?

An IR is a data repository run by an institution. Many of the large research universities have IRs. To name a few, Harvard has DASH, the University of California system has eScholarship and Merritt, Purdue has PURR, and MIT has DSpace. Many of these systems have been set up in the last 10 years or so to serve as archives for publications. For a great overview and history of IRs, check out this eHow article (which is surprisingly better than the relevant Wikipedia article).

So why haven’t more people heard of IRs? Mostly this is because there have never been any mandates or requirements for researchers to deposit their works in IRs. Some libraries take on this task– for example, I found out a few years ago that the MBL-WHOI Library graciously stored open access copies of all of my publications for me in their IR. But more and more these “works” include digital assets that are not publications, and the burden of collecting all of the digital scholarship produced by an institution is a near-insurmountable task for a small group of librarians; there has to be either buy-in from researchers or mandates from the top.

The Case for IRs

I’m not the first one to recognize the importance of IRs. Back in 2002 the Scholarly Publishing and Academic Resources Coalition (SPARC) put out a position paper titled “The Case for Institutional Repositories” (see their website for more information). They defined an IR as having four major qualities:

  1. Institutionally defined,
  2. Scholarly,
  3. Cumulative and perpetual, and
  4. Open and interoperable.

Taking the point of view of the academic institution (rather than the researcher), the paper cited two roles that institutional repositories play for academic institutions:

  1. Reform scholarly communication – Reassert control over scholarship, reduce monopoly power of journals, and bring relevance to libraries
  2. Promote the university – Serve as an indicator of the university’s quality; showcase the university’s research; demonstrate public value and increase status.

In general, IRs are run by information professionals (e.g., librarians), who are experts at documenting, archiving, preserving, and generally curating information. All of those digital assets that we produce as researchers fit the bill perfectly.

As a researcher, you might not be convinced by the importance of IRs given the  arguments above. Part of the indifference researchers may feel about IRs might have something to do with the existence of disciplinary repositories.

Disciplinary Repositories

There are many, many, many repositories out there for storing digital assets. To get a sense, check out re3data.org or databib.org and start browsing. Both of these websites are searchable databases for research data repositories. If you are a researcher, you probably know of at least one or two repositories for datasets in your field. For example, geneticists have GenBank, evolutionary biologists have TreeBase, ecologists have the KNB, and marine biologists have BCO-DMO. These are all examples of disciplinary repositories (DRs) for data. As any researcher who’s aware of these sites knows, you can both deposit and download data from these repositories, which makes them indispensable resources for their respective fields.

So where should a researcher put data?

The short answer is both an IR and a DR. I’ll expand on this and make the case for IRs to researchers in the next blog post.

Tagged , , , , , ,

Hello Data Publication World

from foodbeast.com

Greetings, all. I’m a new postdoc at the CDL and I’m very excited to be spending the next couple of years thinking about data publication.  Carly has discussed data publication several times before, but briefly, the goal is to improve dataset reproduction and reuse by publishing datasets as “first class” scholarly objects akin to journal articles- with the attendant opportunities for preservation, citation, and award of credit.

I spent most of grad school tickling worms with an eyebrow hair glued to a toothpick (this is true), but now I’m moving from lab to library as a CLIR/DLF Postdoctoral Fellow in Data Curation for the Sciences and Social Sciences.  The Sloan Foundation funds these fellowships to, as Josh Greenberg puts it, train “professionals with one foot in research and one foot in data curation”.

Partly for my own edification, I’m starting with a thorough survey of the data publication landscape.  I’ll be looking at current practices and proposals for data publication, citation, and peer-review.  I’m interested in questions like: How can the quality of a dataset evaluated?  How does the creator of a dataset get credit for it?  How do datasets remain findable, accessible, and useable in the future?  Does it even make sense to apply the terms “publication” or “peer-review” to data at all?

Where things go from there depends on how the survey turns out, so that’s much more up in the air.  One possibility is to put a workflow for data publication together from existing tools.  Another is to identify a need not met by existing tools that the CDL could address.

If you have ideas you’d like to share, please comment here or email me.

Shameless Plug: Applications for 2014 CLIR/DLF Fellowships are opening soon!

Tagged ,

Impact Factors: A Broken System

From Flickr by The Official CTBTO Photostream

How big is your impact? Sedan Plowshare Crater, 1962. From Flickr by The Official CTBTO Photostream

If you are a researcher, you are very familiar with the concept of a journal’s Impact Factor (IF). Basically, it’s a way to grade journal quality. From Wikipedia:

The impact factor (IF) of an academic journal is a measure reflecting the average number of citations to recent articles published in the journal. It is frequently used as a proxy for the relative importance of a journal within its field, with journals with higher impact factors deemed to be more important than those with lower ones.

The IF was devised in the 1970s as a tool for research libraries to judge the relative merits of journals when allocating their subscription budgets. However it is now being used as a way to evaluate the merits of individual scientists– something for which it was never intended to be used.  As Björn Brembs puts it, “…scientific careers are made and broken by the editors at high-ranking journals.”

In his great post, “Sick of Impact Factors“, Stephen Curry says that the real problem started when impact factors began to be applied to papers and people.

I can’t trace the precise origin of the growth but it has become a cancer that can no longer be ignored. The malady seems to particularly afflict researchers in science, technology and medicine who, astonishingly for a group that prizes its intelligence, have acquired a dependency on a valuation system that is grounded in falsity. We spend our lives fretting about how high an impact factor we can attach to our published research because it has become such an important determinant in the award of the grants and promotions needed to advance a career. We submit to time-wasting and demoralising rounds of manuscript rejection, retarding the progress of science in the chase for a false measure of prestige.

Curry isn’t alone. Just last week Bruce Alberts, Editor-in-Chief of Science, wrote  a compelling editorial about Impact Factor distortions. Alberts’ editorial was inspired by the recently released San Francisco Declaration on Research Assessment (DORA). I think this is one of the more important declarations/manifestoes peppering the internet right now, and has the potential to really change the way scholarly publishing is approached by researchers.

DORA was created by a group of editors and publishers who met up at the Annual Meeting of the American Society for Cell Biology (ASCB) in 2012. Basically, it lays out all the problems with impact factors and provides a set of general recommendations for different stakeholders (funders, institutions, publishers, researchers, etc.). The goal of DORA is to improve “the way in which the quality of research output is evaluated”.  Read more on the DORA website and sign the declaration (I did!).

An alternative to IF?

If most of us can agree that impact factors are not a great way to assess researchers or their work, then what’s the alternative? Curry thinks the solution lies in Web 2.0 (quoted from this post):

…we need to find ways to attach to each piece of work the value that the scientific community places on it though use and citation. The rate of accrual of citations remains rather sluggish, even in today’s wired world, so attempts are being made to capture the internet buzz that greets each new publication…

That’s right, skeptical scientists: he’s talking about buzz on the internet as a way to assess impact. Read more about “alternative metrics” in my blog post on the subject: The Future of Metrics in Science.  Also check out the list of altmetrics-related tools at altmetrics.org. The great thing about altmetrics is that they don’t rely solely on citation counts, plus they are capable of taking other research products into account (like blog posts and datasets).

Other good reads on this subject:

Tagged , , ,

Large Facilities & the Data they Produce

Last week I spent three days in the desert, south of Albuquerque, at the NSF Large Facilities Workshop. What are these “large facilities”, you ask? I did too… this was a new world for me, but the workshop ended up being a great learning experience.

The NSF has a Large Facilities Office within the Office of Budget, Finance and Award Management, which supports “Major Research Equipment and Facilities Construction” (MREFC for short). Examples of these Large Facilities include NEON (National Ecological Observatory Network), IRIS PASSCAL Instrument Center (Incorporated Research Institutions for Seismology Program for Array Seismic Studies of the Continental Lithosphere), and the NRAO (National Radio Astronomy Observatory). Needless to say, I spent half of the workshop googling acronyms.

I was there to talk about data management, which made me a bit of an anomaly. Other attendees administered  managed, and worked at large facilities. In the course of my conversations with attendees, I was surprised to learn that these facilities aren’t too concerned with data sharing, and most of these administrator types implied that the data were owned by the researcher; it was therefore the researcher’s prerogative to share or not to share. From what I understand, the scenario is this: the NSF page huge piles of money to get these facilities up and running, with hardware, software, technicians, managers, and on and on. The researchers then write a grant to the NSF or the facilities themsleves to do work using these facilities. The researcher is then under no obligation to share the data with their colleagues. Does this seem fishy to anyone else?

I understand the point of view of the administrators that attended this conference: they have enough on their plate to worry about, without dealing with the miriad problems that accompany data management, archiving, sharing et cetera. These problems are only compounded by researchers’ general resistance to share. For example, an administrator told me that, upon completion of their study, one researcher had gone into their system and deleted all of the data related to their project to make sure no one else could get it. I nearly fell over from shock.

Whatever cultural hangups the researchers have, aren’t these big datasets, being collected by expensive equipment, among the most important to be shared? Observations of the sky at a single point and time are not reproducible. You only get one shot at collecting data on an earthquake or the current spread rate for a rift zone. Not sharing these datasets is tantamount to scientific malpractice.

The Very Large Array, near Soccoro NM. This was the best workshop field trip EVER. CC-BY, Carly Strasser

The Very Large Array, near Soccoro NM. This was the best workshop field trip EVER. CC-BY, Carly Strasser

One administrator respectfully disagreed with my charge that they should be doing more to promote data sharing. He said that their workflow for data processing was so complex and nuanced that no one could ever reproduce the dataset, and certainly no one could ever understand what exactly was done to obtain results. This marks the second time I nearly fell over during a conversation. If science isn’t reproducible because it’s too complex, you aren’t doing it right. Yes, I realize that exactly reproducing results is nearly impossible under the best of circumstances. But to not even try? With datasets this important? When all analyses are done via computers? It seems ludicrous.

So, after three days of dry skin and mexican food, my takeaway from the workshop was this: All large facilities sponsored by NSF need to have thorough, clear policies about data produced using their equipment. These policies should include provisions for sharing, access, use, and archiving. They will most certainly be met with skepticism and resistance, but in these tight fiscal times, data sharing is of utmost importance when equipment this expensive is being used.

Tagged , , , , ,