Tag Archives: tools

UC3, PLOS, and DataONE join forces to build incentives for data sharing

We are excited to announce that UC3, in partnership with PLOS and DataONE, are launching a new project to develop data-level metrics (DLMs). This 12-month project is funded by an Early Concept Grants for Exploratory Research (EAGER) grant from the National Science Foundation, and will result in a suite of metrics that track and measure data use. The proposal is available via CDL’s eScholarship repository: http://escholarship.org/uc/item/9kf081vf. More information is also available on the NSF Website.

Why DLMs? Sharing data is time consuming and researchers need incentives for undertaking the extra work. Metrics for data will provide feedback on data usage, views, and impact that will help encourage researchers to share their data. This project will explore and test the metrics needed to capture activity surrounding research data.

The DLM pilot will build from the successful open source Article-Level Metrics community project, Lagotto, originally started by PLOS in 2009. ALM provide a view into the activity surrounding an article after publication, across a broad spectrum of ways in which research is disseminated and used (e.g., viewed, shared, discussed, cited, and recommended, etc.)

About the project partners

PLOS (Public Library of Science) is a nonprofit publisher and advocacy organization founded to accelerate progress in science and medicine by leading a transformation in research communication.

Data Observation Network for Earth (DataONE) is an NSF DataNet project which is developing a distributed framework and sustainable cyberinfrastructure that meets the needs of science and society for open, persistent, robust, and secure access to well-described and easily discovered Earth observational data.

The University of California Curation Center (UC3) at the California Digital Library is a creative partnership bringing together the expertise and resources of the University of California. Together with the UC libraries, we provide high quality and cost-effective solutions that enable campus constituencies – museums, libraries, archives, academic departments, research units and individual researchers – to have direct control over the management, curation and preservation of the information resources underpinning their scholarly activities.

The official mascot for our new project: Count von Count. From muppet.wikia.com

The official mascot for our new project: Count von Count. From muppet.wikia.com

Tagged , , ,

DataUp is Merging with Dash!

Exciting news! We are merging the DataUp tool with our new data sharing platform, Dash.

About Dash

Dash is a University of California project to create a platform that allows researchers to easily describe, deposit and share their research data publicly. Currently the Dash platform is connected to the UC3 Merritt Digital Repository; however, we have plans to make the platform compatible with other repositories using protocols such as SWORD and OAI-PMH. The Dash project is open-source and we encourage community discussion and contribution to our GitHub site.

About the Merge

There is significant overlap in functionality for Dash and DataUp (see below), so we will merge these two projects to enable better support for our users. This merge is funded by an NSF grant (available on eScholarship) supplemental to the DataONE project.

The new service will be an instance of our Dash platform (to be available in late September), connected to the DataONE repository ONEShare. Previously the only way to deposit datasets into ONEShare was via the DataUp interface, thereby limiting deposits to spreadsheets. With the Dash platform, this restriction is removed and any dataset type can be deposited. Users will be able to log in with their Google ID (other options being explored). There are no restrictions on who can use the service, and therefore no restrictions on who can deposit datasets into ONEShare, and the service will remain free. The ONEShare repository will continue to be supported by the University of New Mexico in partnership with CDL/UC3. 

The NSF grant will continue to fund a developer to work with the UC3 team on implementing the DataONE-Dash service, including enabling login via Google and other identity providers, ensuring that metadata produced by Dash will meet the conditions of harvest by DataONE, and exploring the potential for implementing spreadsheet-specific functionality that existed in DataUp (e.g., the best practices check). 

Benefits of the Merge

  • We will be leveraging work that UC3 has already completed on Dash, which has fully-implemented functionality similar to DataUp (upload, describe, get identifier, and share data).
  • ONEShare will continue to exist and be a repository for long tail/orphan datasets.
  • Because Dash is an existing UC3 service, the project will move much more quickly than if we were to start from “scratch” on a new version of DataUp in a language that we can support.
  • Datasets will get DataCite digital object identifiers (DOIs) via EZID.
  • All data deposited via Dash into ONEShare will be discoverable via DataONE.

FAQ about the change

What will happen to DataUp as it currently exists?

The current version of DataUp will continue to exist until November 1, 2014, at which point we will discontinue the service and the dataup.org website will be redirected to the new service. The DataUp codebase will still be available via the project’s GitHub repository.

Why are you no longer supporting the current DataUp tool?

We have limited resources and can’t properly support DataUp as a service due to a lack of local experience with the C#/.NET framework and the Windows Azure platform.  Although DataUp and Dash were originally started as independent projects, over time their functionality converged significantly.  It is more efficient to continue forward with a single platform and we chose to use Dash as a more sustainable basis for this consolidated service.  Dash is implemented in the  Ruby on Rails framework that is used extensively by other CDL/UC3 service offerings.

What happens to data already submitted to ONEShare via DataUp?

All datasets now in ONEShare will be automatically available in the new Dash discovery environment alongside all newly contributed data.  All datasets also continue to be accessible directly via the Merritt interface at https://merritt.cdlib.org/m/oneshare_dataup.

Will the same functionality exist in Dash as in DataUp?

Users will be able to describe their datasets, get an identifier and citation for them, and share them publicly using the Dash tool. The initial implementation of DataONE-Dash will not have capabilities for parsing spreadsheets and reporting on best practices compliance. Also the user will not be able to describe column-level (i.e., attribute) metadata via the web interface. Our intention, however, is develop out these functions and other enhancements in the future. Stay tuned!

Still want help specifically with spreadsheets?

  • We have pulled together some best practices resources: Spreadsheet Help 
  • Check out the Morpho Tool from the KNB – free, open-source data management software you can download to create/edit/share spreadsheet metadata (both file- and column-level). Bonus – The KNB is part of the DataONE Network.

 

It's the dawn of a new day for DataUp! From Flickr by David Yu.

It’s the dawn of a new day for DataUp! From Flickr by David Yu.

Tagged , , , , , ,

Git/GitHub: A Primer for Researchers

The Beastie Boys knew what’s up: Git it together. From egotripland.com

I might be what a guy named Everett Rogers would call an “early adopter“. Rogers wrote a book back in 1962 call The Diffusion of Innovation, wherein he explains how and why technology spreads through cultures. The “adoption curve” from his book has been widely used to  visualize the point at which a piece of technology or innovation reaches critical mass, and divides individuals into one of five categories depending on at what point in the curve they adopt a given piece of technology: innovators are the first, then early adopters, early majority, late majority, and finally laggards.

At the risk of vastly oversimplifying a complex topic, being an early adopter simply means that I am excited about new stuff that seems promising; in other words, I am confident that the “stuff” – GitHub, in this case –will catch on and be important in the future. Let me explain.

Let’s start with version control.

Before you can understand the power GitHub for science, you need to understand the concept of version control. From git-scm.com, “Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.”  We all deal with version control issues. I would guess that anyone reading this has at least one file on their computer with “v2″ in the title. Collaborating on a manuscript is a special kind of version control hell, especially if those writing are in disagreement about systems to use (e.g., LaTeX versus Microsoft Word). And figuring out the differences between two versions of an Excel spreadsheet? Good luck to you. The Wikipedia entry on version control makes a statement that brings versioning into focus:

The need for a logical way to organize and control revisions has existed for almost as long as writing has existed, but revision control became much more important, and complicated, when the era of computing began.

Ah, yes. The era of collaborative research, using scripting languages, and big data does make this issue a bit more important and complicated. Enter Git. Git is a free, open-source distributed version control system, originally created for Linux kernel development in 2005. There are other version control systems– most notably, Apache Subversion (aka SVN) and Mercurial. However I posit that the existence of GitHub is what makes Git particularly interesting for researchers.

So what is GitHub?

GitHub is a web-based hosting service for projects that use the Git revision control system. It’s free (with a few conditions) and has been quite successful since its launch in 2008. Historically, version control systems were developed for and by software developers. GitHub was created primarily as a way for efficiently developing software projects, but its reach has been growing in the last few years. Here’s why.

Note: I am not going into the details of how git works, its structure, or how to incorporate git into your daily workflow. That’s a topic best left to online courses and Software Carpentry Bootcamps

What’s in it for researchers?

At this point it is good to bring up a great paper by Karthik Ram titled “Git can facilitate greater reproducibility and increased transparency in science“, which came out in 2013 in the journal Source Code for Biology and Medicine. Ram goes into much more detail about the power of Git (and GitHub by extension) for researchers. I am borrowing heavily from his section on “Use cases for Git in science” for the four benefits of Git/GitHub below.

1. Lab notebooks make a comeback. The age-old practice of maintaining a lab notebook has been challenged by the digital age. It’s difficult to keep all of the files, software, programs, and methods well-documented in the best of circumstances, never mind when collaboration enters the picture. I see researchers struggling to keep track of their various threads of thought and work, and remember going through similar struggles myself. Enter online lab notebooks. naturejobs.com recently ran a piece about digital lab notebooks, which provides a nice overview of this topic. To really get a feel fore the power of using GitHub as a lab notebook, see GitHubber and ecologist Carl Boettiger’s site. The gist is this: GitHub can serve as a home for all of the different threads of your project, including manuscripts, notes, datasets, and methods development.

2. Collaboration is easier. You and your colleagues can work on a manuscript together, write code collaboratively, and share resources without the potential for overwriting each others’ work. No more v23.docx or appended file names with initials. Instead, a co-author can submit changes and document those with “commit messages” (read about them on GitHub here).

3. Feedback and review is easier. The GitHub issue tracker allows collaborators (potential or current), reviewers, and colleagues to ask questions, notify you of problems or errors, and suggest improvements or new ideas.

4. Increased transparency. Using a version control system means you and others are able to see decision points in your work, and understand why the project proceeded in the way that it did. For the super savvy GitHubber, you can make available your entire manuscript, from the first datapoint collected to the final submitted version, traceable on your site. This is my goal for my next manuscript.

Final thoughts

Git can be an invaluable tool for researchers. It does, however, have a bit of a high activation energy. That is, if you aren’t familiar with version control systems, are scared of the command line, or are married to GUI-heavy proprietary programs like Microsoft Word, you will be hard pressed to effectively use Git in the ways I outline above. That said, spending the time and energy to learn Git and GitHub can make your life so. much. easier. I advise graduate students to learn Git (along with other great open tools like LaTeX and Python) as early in their grad careers as possible. Although it doesn’t feel like it, grad school is the perfect time to learn these systems. Don’t be a laggard; be an early adopter.

References and other good reads

Tagged , , , , , ,

Researchers – get your ORCID

Yesterday I remotely joined a lab meeting at my old stomping grounds, Woods Hole Oceanographic Institution. My former advisor, Mike Neubert, asked me to join his math ecology lab meeting to “convince them to get ORCID Identifiers. (Or try anyway!)”. As a result, I’ve spent a little bit of time thinking about ORCIDs in the last few days. I figured I might put the preverbal pen to paper and write a blog post about it for the benefit of other researchers.

What is ORCID?

An acronym, of course! ORCID stands for “Open Researcher & Contributor ID”. The ORCID Organization is an open, non-profit group working to provide a registry of unique researcher identifiers and a transparent method of linking research activities and outputs to these identifiers (from their website). The endgame is to support the creation of a permanent, clear and unambiguous record of scholarly communication by enabling reliable attribution of authors and contributors.

Wait – let’s back up.

What is a “Researcher Identifier”?

Wikipedia’s entry on ORCIDs might summarize researcher identifiers best:

An ORCID [i.e., researcher identifier] is nonproprietary alphanumeric code to uniquely identify scientific and other academic authors. This addresses the problem that a particular author’s contributions to the scientific literature can be hard to electronically recognize as most personal names are not unique, they can change (such as with marriage), have cultural differences in name order, contain inconsistent use of first-name abbreviations and employ different writing systems. It would provide for humans a persistent identity — an “author DOI” — similar to that created for content-related entities on digital networks by digital object identifiers (DOIs).

Basically, researcher identifiers are like social security numbers for scientists. They unambiguously identify you throughout your research life. It’s important to note that, unlike SSNs, there isn’t just one researcher ID system. Existing researcher identifier systems include ORCID, ResearcherIDScopus Author IdentifierarXiv Author ID, and eRA Commons Username. So why ORCID?

ORCID is an open system – that means web application developers, publishers, grants administrators, and institutions can hook into ORCID and use those identifiers for all kinds stuff. It’s like having one identifier to rule them all – imagine logging into all kinds of websites, entering your ORCID ID, and having them know who you are, what you’ve published, and what impacts you have had on scientific research. A bonus of the ORCID organization is that they are committed to “transcending discipline, geographic, national and institutional boundaries” and ensuring that ORCID services will be based on transparent and non-discriminatory terms posted on the ORCID website.

How does this differ from Google Scholar, Research Gate and the like?

This is one of the first question most researchers ask. In fact, CV creation sites like Google Scholar profiles, Academia.eduResearch Gate and the like are a completely different thing. ORCID is an identifier system, so comparing ORCIDs to Research Gate is like comparing your social security number to your Facebook profile. Note, however, that ORCID could work with these CV creation sites in the future – which would make identifying your research outputs even easier. The confusion probably stems from the fact that you can create an ORCID Profile on their website. Note that this is not required, however it helps ensure that past research products are connected to your ORCID ID.

Metrics + ORCID

One of the most exciting things about ORCID is its potential to influence the way we think about credit and metrics for researchers. If researchers have unique identifiers, it makes it easier to round up all of their products (data, blog posts, technical documents, theses) and determine how much they have influenced the field. In other words, ORCID plays nice with altmetrics. Read more about altmetrics in these previous Data Pub blog posts on the subject. A 2009 Nature Editorial sums up this topic about altmetrics and identifiers nicely:

…But perhaps the largest challenge will be cultural. Whether ORCID or some other author ID system becomes the accepted standard, the new metrics made possible will need to be taken seriously by everyone involved in the academic-reward system — funding agencies, university administrations, and promotion and tenure committees. Every role in science should be recognized and rewarded, not just those that produce high-profile publications.

What should you do?

  1. Go to orcid.org
  2. Follow the Register Now Link and fill out the necessary fields (name, email, password)

You can stop here- you’ve claimed your ORCID ID! It will be a numeric string that looks something like this: 0000-0001-9592-2339 (that’s my ORCID ID!).

…OR you can go ahead build out your ORCID profile. To do add previous work:

  1. On your profile page (which opens after you’ve registered), select the “Import Works” button.
  2. A window will pop up with organizations who have partnered with ORCID. When in doubt, start with “CrossRef Metadata Search”. CrossRef provides DOIs for publishers, which means if you’ve published articles in journals, they will probably show up in this metadata search.
  3. Grant approval for ORCID to access your CrossRef information. Then peruse the list and identify which works are yours.
  4. By default, the list of works on your ORCID profile will be private. You can change your viewing permission to allow others to see your profile.
  5. Consider adding a link to your ORCID profile on your CV and/or website. I’ve done it on mine.

ORCID is still quite new – that means it won’t find all of your work, and you might need to manually add some of your products. But given their recently-awarded funding from the Alfred P. Sloan Foundation, and interest from many web application developers and companies, you can be sure that the system will only get better from here.

Sources:

Orchis morio (Green-winged Orchid) Specimen in Derby Museum herbarium. From Flickr by Derby Museum.

Orchis morio (Green-winged Orchid) Specimen in Derby Museum herbarium. From Flickr by Derby Museum.

Tagged , , , , ,

UC Open Access: How to Comply

Free access to UC research is almost as good as free hugs! From Flickr by mhauri

Free access to UC research is almost as good as free hugs! From Flickr by mhauri

My last two blog posts have been about the new open access policy that applies to the entire University of California system. For big open science nerds like myself, this is exciting progress and deserves much ado. For the on-the-ground researcher at a UC, knee-deep in grants and lecture preparation, the ado could probably be skipped in lieu of a straightforward explanation of how to comply with the procedure. So here goes.

Who & When:

  • 1 November 2013: Faculty at UC Irvine, UCLA, and UCSF
  • 1 November 2014: Faculty at UC Berkeley, UC Merced, UC Santa Cruz, UC Santa Barbara, UC Davis, UC San Diego, UC Riverside

Note: The policy applies only to ladder-rank faculty members. Of course, graduate students and postdocs should strongly consider participating as well.

To comply, faculty members have two options:

Option 1: Out-of-the-box open access

. There are two ways to do this:

  1. Publishing in an open access-only journal (see examples here). Some have fees and others do not.
  2. Publishing with a more traditional publisher, but paying a fee to ensure the manuscript is publicly available. These are article-processing charges (APCs) and vary widely depending on the journal. For example, Elsevier’s Ecological Informatics charges $2,500, while Nature charges $5,200.

Learn more about different journals’ fees and policies: Directory of Open Access Journals: www.doaj.org

Option 2: Deposit your final manuscript in an open access repository.

In this scenario, you can publish in whatever journal you prefer – regardless of its openness. Once the manuscript is published, you take action to make a version of the article freely and openly available.

As UC faculty (or any UC researcher, including grad students and postdocs), you can comply via Option 2 above by depositing your publications in UC’s eScholarship open access repository. The CDL Access & Publishing Group is currently perfecting a user-friendly, efficient workflow for managing article deposits into eScholarship. The new workflow will be available as of November 1stLearn more.

Does this still sound like too much work? Good news! The Publishing Group is also working on a harvesting tool that will automate deposit into eScholarship. Stay tuned – the estimated release of this tool is June 2014.

An Addendum: Are you not a UC affiliate? Don’t fret! You can find your own version of eScholarship (i.e., an open access repository) by going to OpenDOAR. Also see my full blog post about making your publications open access.

Why?

Academic libraries must pay exorbitant fees to provide their patrons (researchers) with access to scholarly publications.  The very patrons who need these publications are the ones who provide the content in the form of research articles.  Essentially, the researchers are paying for their own work, by proxy via their institution’s library.

What if you don’t have access? Individuals without institutional affiliations (e.g., between jobs), or who are affiliated with institutions that have no/a poorly funded library (e.g., in 2nd or 3rd world countries), depend on open access articles for keeping up with the scholarly literature. The need for OA isn’t limited to jobless or international folks, though. For proof, one only has to notice that the Twitter community has developed a hash tag around this, #Icanhazpdf (Hat tip to the Lolcats phenomenon). Basically, you tweet the name of the article you can’t access and add the hashtag in hopes that someone out in the Twittersphere can help you out and send it to you.

Special thanks to Catherine Mitchell from the CDL Publishing & Access Group for help on this post.

Tagged , , , , ,

The Data Lineup for #ESA2013

Why am I excited about Minneapolis? Potential Prince sightings, of course!

Why am I excited about Minneapolis? Potential Prince sightings, of course! From http://www.emusic.com

In less than  week, the Ecological Society of America’s 2013 Meeting will commence in Minneapolis, MN. There will be zillions of talks and posters on topics ranging from microbes to biomes, along with special sessions on education, outreach, and citizen science. So why am I going?

For starters, I’m a marine ecologist by training, and this is an excuse to meet up with old friends. But of course the bigger draw is to educate my ecological colleagues about all things data: data management planning, open data, data stewardship, archiving and sharing data, et cetera et cetera. Here I provide a rundown of must-see talks, sessions, and workshops related to data. Many of these are tied to the DataONE group and the rOpenSci folks; see DataONE’s activities and rOpenSci’s activities. Follow the full ESA meeting on Twitter at #ESA2013. See you in Minneapolis!

Sunday August 4th

0800-1130 / WK8: Managing Ecological Data for Effective Use and Re-use: A Workshop for Early Career Scientists

For this 3.5 hour workshop, I’ll be part of a DataONE team that includes Amber Budden (DataONE Community Engagement Director), Bill Michener (DataONE PI), Viv Hutchison (USGS), and Tammy Beaty (ORNL). This will be a hands-on workshop for researchers interested in learning about how to better plan for, collect, describe, and preserve their datasets.

1200-1700 / WK15: Conducting Open Science Using R and DataONE: A Hands-on Primer (Open Format)

Matt Jones from NCEAS/DataONE will be assisted by Karthik Ram (UC Berkeley & rOpenSci), Carl Boettiger (UC Davis & rOpenSci), and Mark Schildhauer (NCEAS) to highlight the use of open software tools for conducting open science in ecology, focusing on the interplay between R and DataONE.

Monday August 5th

1015-1130 / SS2: Creating Effective Data Management Plans for Ecological Research

Amber, Bill and I join forces again to talk about how to create data management plans (like those now required by the NSF) using the free online DMPTool. This session is only 1.25 hours long, but we will allow ample time for questions and testing out the tool.

1130-1315 / WK27: Tools for Creating Ecological Metadata: Introduction to Morpho and DataUp

Matt Jones and I will be introducing two free, open-source software tools that can help ecologists describe their datasets with standard metadata. The Morpho tool can be used to locally manage data and upload it to data repositories. The DataUp tool helps researchers not only create metadata, but check for potential problems in their dataset that might inhibit reuse, and upload data to the ONEShare repository.

Tuesday August 6th

0800-1000 / IGN2: Sharing Makes Science Better

This two-hour session organized by Sandra Chung of NEON is composed of 5-minute long “ignite” talks, which guarantees you won’t nod off. The topics look pretty great, and the crackerjack list of presenters includes Ethan White, Ben Morris, Amber Budden, Matt Jones,  Ed Hart, Scott Chamberlain, and Chris Lortie.

1330-1700 / COS41: Education: Research And Assessment

In my presentation at 1410, “The fractured lab notebook: Undergraduates are not learning ecological data management at top US institutions”, I’ll give a brief talk on results from my recent open-access publication with Stephanie Hampton on data management education.

2000-2200 / SS19: Open Science and Ecology

Karthik Ram and I are getting together with Scott Chamberlain (Simon Fraser University & rOpenSci), Carl Boettiger, and Russell Neches (UC Davis) to lead a discussion about open science. Topics will include open data, open workflows and notebooks, open source software, and open hardware.

2000-2200 / SS15: DataNet: Demonstrations of Data Discovery, Access, and Sharing Tools

Amber Budden will demo and discuss DataONE alongside folks from other DataNet projects like the Data Conservancy, SEAD, and Terra Populus.

Tagged , , , , , ,

Software Carpentry and Data Management

About a year ago, I started hearing about Software Carpentry. I wasn’t sure exactly what it was, but I envisioned tech-types showing up at your house with routers, hard drives, and wireless mice to repair whatever software was damaged by careless fumblings. Of course, this is completely wrong. I now know that it is actually an ambitious and awesome project that was recently adopted by Mozilla, and recently got a boost from the Alfred P. Sloan Foundation (how is it that they always seem to be involved in the interesting stuff?).

From their website:

Software Carpentry helps researchers be more productive by teaching them basic computing skills. We run boot camps at dozens of sites around the world, and also provide open access material online for self-paced instruction.

SWC got its start in 1990s, when its founder, Greg Wilson, realized that many of the scientists who were trying to use supercomputers didn’t actually know how to build and troubleshoot their code, much less use things like version control. More specifically, most had never been shown how to do four basic tasks that are fundamentally important to any science involving computation (which is increasingly all science):

  • growing a program from 10 to 100 to 100 lines without creating a mess
  • automating repetitive tasks
  • basic quality assurance
  • managing and sharing data and code
Software Carpentry is too cool for a reference to the Carpenters. From marshallmatlock.com (click for more).

Software Carpentry is too cool for a reference to the Carpenters. From marshallmatlock.com (click for more).

Greg started teaching these topics (and others) at Los Alamos National Laboratory in 1998. After a bit of stop and start, he left a faculty position at the University of Toronto in April 2010 to devote himself to it full-time. Fast forward to January 2012, and Software Carpentry became the first project of what is now the Mozilla Science Lab, supported by funding from the Alfred P. Sloan Foundation.

This new incarnation of Software Carpentry has focused on offering intensive, two-day workshops aimed at grad students and postdocs. These workshops (which they call “boot camps”) are usually small – typically 40 learners – with low student-teacher ratios, ensuring that those in attendance get the attention and help they need.

Other than Greg himself, whose role is increasingly to train new trainers, Software Carpentry is a volunteer organization. More than 50 people are currently qualified to instruct, and the number is growing steadily. The basic framework for a boot camp is this:

  1. Someone decides to host a Software Carpentry workshop for a particular group (e.g., a flock of macroecologists, or a herd of new graduate students at a particular university). This can be fellow researchers, department chairs, librarians, advisors — you name it.
  2. Organizers round up funds to pay for travel expenses for the instructors and any other anticipated workshop expenses.
  3. Software Carpentry matches them with instructors according to the needs of their group; together, they and the organizers choose dates and open up enrolment.
  4. The boot camp itself runs eight hours a day for two consecutive days (though there are occasionally variations). Learning is hands-on: people work on their own laptops, and see how to use the tools listed below to solve realistic problems.

That’s it! They have a great webpage on how to run a bootcamp, which includes checklists and thorough instructions on how to ensure your boot camp is a success. About 2300 people have gone through a SWC bootcamp, and the organization hopes to double that number by mid-2014.

The core curriculum for the two-day boot camp is usually:

Software Carpentry also offers over a hundred short video lessons online, all of which are CC-BY licensed  (go to the SWC webpage for a hyperlinked list):

  • Version Control
  • The Shell
  • Python
  • Testing
  • Sets and Dictionaries
  • Regular Expressions
  • Databases
  • Using Access
  • Data
  • Object-Oriented Programming
  • Program Design
  • Make
  • Systems Programming
  • Spreadsheets
  • Matrix Programming
  • MATLAB
  • Multimedia Programming
  • Software Engineering

Why focus on grad students and postdocs? They focus on graduate students and post-docs because professors are often too busy with teaching, committees, and proposal writing to improve their software skills, while undergrads have less incentive to learn since they don’t have a longer-term project in mind yet. They’re also playing a long game: today’s grad students are tomorrow’s professors, and the day after that, they will be the ones setting parameters for funding programs, editing journals, and shaping science in other ways. Teaching them these skills now is one way – maybe the only way – to make computational competence a “normal” part of scientific practice.

So why am I blogging about this? When Greg started thinking about training researchers to understand the basics of good computing practice and coding, he couldn’t have predicted that huge explosion in the availability of data, the number of software programs to analyze those datasets, and the shortage of training that researchers receive in dealing with this new era. I believe that part of the reason funders stepped up to help the mission of software caprentry is because now, more than ever, reseachers need these skills to successfully do science. Reproducibility and accountability are in more demand, and data sharing mandates will likely morph into workflow sharing mandates. Ensuring reproducibility in analysis is next to impossible without the skills Software Carpentry’s volunteers teach.

My secret motive for talking about SWC? I want UC librarians to start organizing bootcamps for groups of researchers on their campuses!

Tagged , , , , , , , , ,

Software for Reproducibility Part 2: The Tools

From the Calisphere collection "American war posters from the Second World War", Contributed by UC Berkeley Bancroft Library (Click on image for more information)

From the Calisphere collection “American war posters from the Second World War”, Contributed by UC Berkeley Bancroft Library (Click on image for more information)

Last week I wrote about the workshop I attended (Workshop on Software Infrastructure for Reproducibility in Science), held in Brooklyn at the new Center for Urban Science and Progress, NYU. This workshop was made possible by the Alfred P. Sloan Foundation and brought together heavy-hitters from the reproducibility world who work on software for workflows. I provided some broad-strokes overviews last week; this week, I’ve created a list of some of the tools we saw during the workshop. Note: the level of detail for tools is consistent with my level of fatigue during their presentation!

Sumatra

  • Presenter: Andrew Davison
  • Short description: Sumatra is a library of components and graphical user interfaces (GUIs) for viewing and using tools via Python. This is an “electronic lab notebook” kind of idea.
  • Sumatra needs to interact with version control systems, such as Subversion, Git, Mercurial, or Bazaar.
  • Future plans: Integrate with R, Fortran, C/C++, Ruby.

From the Sumatra website:

The solution we propose is to develop a core library, implemented as a Python package, sumatra, and then to develop a series of interfaces that build on top of this: a command-line interface, a web interface, a graphical interface. Each of these interfaces will enable: (1) launching simulations/analyses with automated recording of provenance information; sand (2) managing a computational project: browsing, viewing, deleting simulations/analyses.

Taverna

  • Presenter: Carole Goble
  • Short description: Taverna is an “execution environment”, i.e. a way to design and execute formal workflows.
  • Other details: Written in Java. Consists of the Taverna Engine (the workhorse), the Taverna Workbench (desktop client) and Taverna Server (remote workflow execution server) that sit on top of the Engine.
  • Links up with myExperiment, a publication environment for sharing workflows. It allows you to put together the workflow, description, files, data, documents etc. and upload/share with others.

From the Taverna website:

Taverna is an open source and domain-independent Workflow Management System – a suite of tools used to design and execute scientific workflows and aid in silico experimentation.

IPython Notebook

  • Presenter: Brian Granger
  • IPython notebook is a web-based computing environment and open document format for telling stories about code and data. Spawn of the IPython Project (focused on interactive computing)
  • Very code-centric. Can integrate text (i.e., Markdown), LaTeX, images, code etc. can be interwoven. Linked up with GitHub.

Galaxy

  • Presenter: James Taylor
  • Galaxy is an open source, free web service integrating a wealth of tools, resources, etc. to simplify researcher workflows in genomics.
  • Biggest challenge is archiving. They currently have 1 PB of user data due to the abundance of sequences used.

From their website:

Galaxy is an open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses.

Madagascar

  • Presenter: Sergey Fomel
  • Geophysics-focused project management system.

From the Madagascar website:

Madagascar is an open-source software package for multidimensional data analysis and reproducible computational experiments. Its mission is to provide

  • a convenient and powerful environment
  • a convenient technology transfer tool

for researchers working with digital image and data processing in geophysics and related fields. Technology developed using the Madagascar project management system is transferred in the form of recorded processing histories, which become “computational recipes” to be verified, exchanged, and modified by users of the system.

VisTrails

  • Presenter: David Koop
  • VisTrails is an open-source scientific workflow and provenance management system that provides support for simulations, data exploration and visualization.
  • It was built with ideas of provenance and transparency in mind.  Essentially the research “trail” is followed as users generate and test a hypothesis.
  • Focus on change-based provenance. keeps track of all changes via a version tree.
  • There’s also execution provenance – start time and end time; where it was done etc. this is instrument-specific metadata.
  • Crowdlabs.org: associated social website for sharing workflows and provenance

RCloud

  • No formal website; GitHub site only
  • Presenter: Carlos Scheidegger
  • RCloud was developed for programmers who use R at AT&T Labs. They were having interactive sessions in R for data exploration and exploratory data analysis. Their idea: what if every R session was transparent and automatically versioned>
  • I know this is a bit of a thin description… but this is an ongoing active project. I’m anxious to see where it goes next

ReproZip 

  • No formal website, but see a 15 minute demo here
  • Presenter: Fernando Chirigati
  • The premise: few computational experiments are reproducible. To get closer to reproducibility, we need a record of: data description, experiment specs, description of environment (in addition to code, data etc).
  • ReproZip automatically and systematically captures required provenance of existing experiments. It does this by “packing” experiments. How it works:
    • ReproZip executes experiment. system tap captures the provenance. each node of process has details on it (provenance tree)
    • Necessary components identified.
    • Specification of workflow happens. fed into workflow.
  • When ready to examine/explore/verify experiments: the package is extracted; ReproZip unloads the experiment and workflows

Open Science Framework

  • Presenters: Brian Nosek & Jeff Spies
  • In brief, the OSF is a nifty way to track projects, work with collaborators, and link together tools from your workflow.
  • You simply go to the website and start a project (for free). Then add contributors and components to the project. Voila!
  • A neat feature – you can fork projects.
  • Provides a log of activities (i.e., version control) & access control (i.e., who can see your stuff)>
  • As long as two software systems work with OSF, they can work together – OSF allows the APIs to “talk”.

From the OSF website:

The Open Science Framework (OSF) is part network of research materials, part version control system, and part collaboration software. The purpose of the software is to support the scientist’s workflow and help increase the alignment between scientific values and scientific practices.

RunMyCode

  • Presenter: Victoria Stodden
  • The idea behind this service is that researchers create “companion websites” associated with their publications. These websites allow others to implement the methods used in the paper.

From the RunMyCode website:

RunMyCode is a novel cloud-based platform that enables scientists to openly share the code and data that underlie their research publications. This service is based on the innovative concept of a companion website associated with a scientific publication. The code is run on a computer cloud server and the results are immediately displayed to the user.

Dexy

  • Presenter: Ana Nelson
  • Tagline: make | docs | sexy (what more can I say?)
  • Dexy is an Open Source platform that allows those writing up documents that have underpinning code to combine the two.
  • Can mix/match coding languages and documentation formats. (e.g., Python, C, Markdown to HTML, data from APIs, WordPress, etc.)

From their website:

Dexy lets you to continue to use your favorite documentation tools, while getting more out of them than ever, and being able to combine them in new and powerful ways. With Dexy you can bring your scripting and testing skills into play in your documentation, bringing project automation and integration to new levels.

DuraSpace 

  • Presenter: Jonathan Markow
  • DuraSpace’s mission is access, discovery, preservation of scholarly digital data. Their primary stakeholders and audience are libraries. As part of this role, DuraSpace is a steward open source projects (Fedora, DSpace, VIVO)
  • Their main service is DuraCloud: an online storage and preservation service that does things like repair corrupt files, move online copies offsite, distribute content geographically, and scale up or down as needed.

Dataverse 

  • Presenter: Merce Crosas
  • Dataverse is a virtual archive for “research studies”. These research studies are containers for data, documentation, and code that are needed for reproducibility.

From their website:

A repository for research data that takes care of long term preservation and good archival practices, while researchers can share, keep control of and get recognition for their data.

Tagged , , ,

Software for Reproducibility

The ultimate replication machine: DNA. Sculpture at Lawrence Berkeley School of Science, Berkeley CA. From Flickr by D.H. Parks.

The ultimate replication machine: DNA. Sculpture at Lawrence Berkeley School of Science, Berkeley CA. From Flickr by D.H. Parks.

Last week I thought a lot about one of the foundational tenets of science: reproducibility. I attended the Workshop on Software Infrastructure for Reproducibility in Science, held in Brooklyn at the new Center for Urban Science and Progress, NYU. This workshop was made possible by the Alfred P. Sloan Foundation and brought together heavy-hitters from the reproducibility world who work on software for workflows.

New to workflows? Read more about workflows in old blog posts on the topic, here and here. Basically, a workflow is a formalization of “process metadata”.  Process metadata is information about the process used to get to your final figures, tables, and other representations of your results. Think of it as a precise description of the scientific procedures you follow.

After sitting through demos and presentations on the different tools folks have created, my head was spinning, in a good way. A few of my takeaways are below. For my next Data Pub post I will provide list of the tools we discussed.

Takeaway #1: Reuse is different from reproducibility.

The end-goal of documenting and archiving a workflow may be different for different people/systems. Reuse of a workflow, for instance, is potentially much easier than exactly reproducing the results .  Any researcher will tell you: reproducibility is virtually impossible. Of course, this differs a bit depending on discipline: anything involving a living thing is much more unpredictable (i.e., biology), while engineering experiments are more likely to be spot-on when reproduced. The level of detail needed to reproduce results is likely to dwarf details and information needed for reuse of workflows.

Takeaway #2: Think of reproducibility as archiving.

This was something Josh Greenberg said, and it struck a chord with me. It was said in the context of considering exactly how much stuff should be captured for reproducibility. Josh pointed out that there is a whole body of work out there addressing this very question: archival science.

Example: an archivist at a library gets boxes of stuff from a famous author who recently passed away. How does s/he decide what is important? What should be kept, and what should be thrown out? How should the items be arranged to ensure that they are useful? What metadata, context, or other information (like a finding aid) should be provided?

The situation with archiving workflows is similar: how much information is needed? What are the likely uses for the workflow? How much detail is too much? Too little? I like considering the issues around capturing the scientific process as similar to archival science scenarios– it makes the problem seem a bit more manageable.

Takeaway #3: High-quality APIs are critical for any tool developed.

We talked about MANY different tools. The one thing we could all agree on was that they should play nice with other tools. In the software world, this means having a nice, user-friendly Application Program Interface (API) that basically tells two pieces of software how to talk to one another.

Takeaway #4: We’ve got the tech-savvy researchers covered. Others? not so much.

The software we discussed is very nifty. That said, many of these tools are geared towards researchers with some impressive tech chops. The tools focus on helping capture code-based work, and integrate with things like LaTeX, Git/Github, the command line. Did I lose you there? You aren’t alone… many of the researchers I interact with are not familiar with these tools, and would therefore not be able to effectively use the software we discussed.

Takeaway #5: Closing the gap between the tools and the researchers that should use them is hard. But not impossible.

There are three basic approaches that we can take:

  1. Focus on better user experience design
  2. Emphasize researcher training via workshops, one-on-one help from experts, et cetera
  3. Force researchers to close the gap on their own. (i.e., Wo/man up).

The reality is that it’s likely to be some combination of these three. Those at the workshop recognized the need for better user interfaces, and some projects here at the CDL are focusing on extensive usability testing prior to release. Funders are beginning to see the value of funding new positions for “human bridges” to help sync up researcher skill sets with available tools. And finally, researchers are slowly recognizing the need to learn basic coding– note the massive uptake of R in the Ecology community as an example.

Tagged , , , , ,
Follow

Get every new post delivered to your Inbox.

Join 1,869 other followers