Category Archives: Best practices

Make Data Count: Building a System to Support Recognition of Data as a First Class Research Output

The Alfred P. Sloan Foundation has made a 2-year, $747K award to the California Digital Library, DataCite and DataONE to support collection of usage and citation metrics for data objects. Building on pilot work, this award will result in the launch of a new service that will collate and expose data level metrics.

The impact of research has traditionally been measured by citations to journal publications: journal articles are the currency of scholarly research.  However, scholarly research is made up of a much larger and richer set of outputs beyond traditional publications, including research data. In order to track and report the reach of research data, methods for collecting metrics on complex research data are needed.  In this way, data can receive the same credit and recognition that is assigned to journal articles.

Recognition of data as valuable output from the research process is increasing and this project will greatly enhance awareness around the value of data and enable researchers to gain credit for the creation and publication of data” – Ed Pentz, Crossref.

This project will work with the community to create a clear set of guidelines on how to define data usage. In addition, the project will develop a central hub for the collection of data level metrics. These metrics will include data views, downloads, citations, saves, social media mentions, and will be exposed through customized user interfaces deployed at partner organizations. Working in an open source environment, and including extensive user experience testing and community engagement, the products of this project will be available to data repositories, libraries and other organizations to deploy within their own environment, serving their communities of data authors.

Are you working in the data metrics space? Let’s collaborate.

Find out more and follow us at: www.makedatacount.org, @makedatacount

About the Partners

California Digital Library was founded by the University of California in 1997 to take advantage of emerging technologies that were transforming the way digital information was being published and accessed. University of California Curation Center (UC3), one of four main programs within the CDL, helps researchers and the UC libraries manage, preserve, and provide access to their important digital assets as well as developing tools and services that serve the community throughout the research and data life cycles.

DataCite is a leading global non-profit organization that provides persistent identifiers (DOIs) for research data. Our goal is to help the research community locate, identify, and cite research data with confidence. Through collaboration, DataCite supports researchers by helping them to find, identify, and cite research data; data centres by providing persistent identifiers, workflows and standards; and journal publishers by enabling research articles to be linked to the underlying data/objects.

DataONE (Data Observation Network for Earth) is an NSF DataNet project which is developing a distributed framework and sustainable cyber infrastructure that meets the needs of science and society for open, persistent, robust, and secure access to well-described and easily discovered Earth observational data.

Describing the Research Process

We at UC3 are constantly developing new tools and resources to help researchers manage their data. However, while working on projects like our RDM guide for researchers, we’ve noticed that researchers, librarians, and people working in the broader digital curation space often talk about the research process in very different ways.

To help bridge this gap, we are conducting an informal survey to understand the terms researchers use when talking about the various stages of a research project.

If you are a researcher and can spare about 5 minutes, we would greatly appreciate it if you would click the link below to participate in our survey.

http://survey.az1.qualtrics.com/jfe/form/SV_a97IJAEMwR7ifRP

Thank you.

CC BY and data: Not always a good fit

This post was originally published on the University of California Office of Scholarly Communication blog.

Last post I wrote about data ownership, and how focusing on “ownership” might drive you nuts without actually answering important questions about what can be done with data. In that context, I mentioned a couple of times that you (or your funder) might want data to be shared under CC0, but I didn’t clarify what CC0 actually means. This week, I’m back to dig into the topic of Creative Commons (CC) licenses and public domain tools — and how they work with data. Continue reading

Tagged , , ,

Data Science meets Academia

(guest post by Johannes Otterbach)

First Big Data and Data Science, then Data Driven and Data Informed. Even before I changed job titles—from Physicist to Data Scientist—I spent a good bit of time pondering what makes everyone so excited about these things, and whether they have a place in the academy.

Data Science is an incredibly young and flaming hot field (searching for ‘Data Science’ on Google Search yields about 283,000,000 results in 0.48 seconds [!] and the count is rising). The promises—and accordingly the stakes—of Data Science are high, and seem to follow a classic Hype Cycle. Nevertheless, Data Science is already having major impacts on all aspects of life, with personalized advertisement and self-quantification leading the charge. But is there a place for Data Science in Academia? To try and answer this question, first we have to understand more about Data Science itself, from lofty promises to practical workflows, and later I’ll offer some potential (big-picture) academic applications.

Yet another attempt at defining Data Science

There are gazillions of blogs, articles, diagrams, and other information channels that aim to define this new and still-fuzzy term ‘Data Science,’ and it will still be some years before we achieve consensus. At least for now there is some agreement surrounding the main ingredients; Drew Conway summarizes them nicely in his Venn diagram:

Data_Science_VD

In this popular tweet, Josh Wills defines a Data Scientist as an individual ‘who is better at statistics than any software engineer and better at software engineering than any statistician.’  This definition just barely captures some of the basics. Referring back to the Venn diagram, a Data Scientist finds her/himself at the intersection of Statistics, Machine Learning, and a particular business need (in academic parlance, a research question).

  • Statistics is perhaps the most obvious component, as Data Science is partially about analyzing data using summary statistics (e.g., averages, standard deviations, correlations, etc.) and more complex mathematical tools. This is supplemented by
  • Machine Learning, which subsumes the programming and data munging aspects of a Data Scientist’s toolkit. Machine Learning is used to automatically sift through data that are too unwieldy for humans to analyze. (This is sometimes an aspect of defining Big Data). As an example, just try to imagine how many dimensions you could define to monitor student performance: past and current grades, participation, education history, family and social circles, physical and mental health, just to name a few categories that you could explode into several subcategories. Typically the output of Machine Learning is a certain number of features that are important within a given business problem and that can provide insight when evaluated in the context of
  • the Domain Knowledge. Domain Knowledge is essential in order to identify and explore the questions that will drive business actions. It is the one ingredient that’s not generalizable across different segments of industry (disciplines or domains) and as such a Data Scientist must acquire new Domain Knowledge for each new problem that she/he encounters.

The most formalized definition I’ve come across is from NIST’s Big Data Framework:

Data science is the empirical synthesis of actionable knowledge from raw data through the complete data lifecycle process.

I won’t elaborate on these terms here, but I do want to draw your attention to the modest word actionable. This is the key component of Data Science that distinguishes it from mere data analysis, and the implementation of which gives rise to the dichotomy of Data Driven vs. Data Informed.

Promises and shortcomings of Data Science: The Hype Cycle

The Gartner Hype Cycle report (2014) on emerging technologies places Data Science just past the threshold of inflated expectations.

gartner_2014_emergingTech_hypecycle

This hype inflation contributes to unreasonable expectations about the problem-solving power of Data Science. All the way back in 2008, one of the early proponents of Big Data and Data Science, the Editor-in-Chief of Wired, Chris Anderson, blogged that the new data age would bring The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. He claimed that by using sufficiently advanced Machine Learning algorithms, gaining insight into a problem would become trivial. This ignores the element of Domain Knowledge to understand and pose the right questions and by now it’s not hard to see that his projection was off. If we consider highly complex processes where sufficient data are not and might never be available, we can only make advances by means of educated guesses and building appropriate models and hypotheses. This requires a substantial amount of Domain Knowledge. Nick Barrowman formulated a detailed argument (that goes beyond just a response to Anderson’s opinion) in his article on Correlation, Causation and Confusion.

Data Science, and in particular Applied Machine Learning, is not completely agnostic of the problem space in which it’s applied; this has serious implications for the analyst’s approach to unknown data. Most importantly, the Domain Knowledge is indispensable for correctly evaluating the predictions of the algorithms and making smart decisions rather than placing blind faith in the computational output. As Yoshua Bengio frames it in his book, Deep Learning [Ch. 5.3.1, p.110]:

The most sophisticated [Machine Learning] algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting that every point belongs to the same class. Fortunately, these results hold only when we average over all possible data generating distributions. If we make assumptions about the kinds of probability distributions we encounter in real-world applications, then we can design learning algorithms that perform well on these distributions.

Actionable business insights: Data Driven vs. Data Informed

The oft-quoted expression ‘Be data informed, not data driven’ seems to originate with Adam Mosseri’s (from Facebook) 2010 talk. He coined these terms to distinguish two different approaches to a data problem.

  • The Data-Driven approach involves analyzing the data and then adjusting the system to optimize a certain metric. Ad placement on a website provides a simple example. We move the ad slightly until we maximize the number of clicks on the ad. The problem with this approach is that we can get trapped in locally optimal points, i.e., points where any deviation leads to a decreasing click rate, however, we can’t be sure that there’s not an even better way of displaying the ad. Joshua Porter summarizes the pitfalls of a Data-Driven approach in the context of UX design. To find the absolute best solution, a tremendous amount of data and time are necessary (technically, an infinite amount of both).

Another shortcoming of the Data-Driven approach is that not everything can be formulated as an optimization problem, the fundamental mathematical formulation of Machine Learning. As a result, we can’t always guarantee that proper data have been collected, particularly in cases where we don’t have a good idea of a what a satisfying answer would look like. To circumvent these problems we can apply

  • The Data-Informed way of viewing a problem, which avoids micro-optimization as mentioned above. Furthermore it allows us to include decision-making inputs that cannot be cast into a ‘standard Machine Learning form,’ such as:
    – Qualitative data
    – Strategic interests
    – Regulatory bodies
    – Business interests
    – Competition
    – Market

Data-Informed decisions leverage the best of two worlds: the analysis of data given a hypothesis, followed by a well-rounded decision, that again leads to the collection of new data to improve business. Joe Blitzstein’s visualization summarizes the Data Science Process, and there’s even an industry standard know as CRISP-DM:

Blitzstein_DataScientistWorkflow

What about Data Science in Academia?

There have long been calls to Academia to better prepare students (especially Ph.D. graduates) for the job market. The explosion of Data Science as the sexiest job of the 21st century is fueling the creation of an increasing number of Data Science Masters programs. The value of these programs remains to be tested, as few graduates have hit the market, but the trend reveals that Academia is at least trying to respond to calls for reform.

Apart from preparing students for careers outside the academy, is there space for applying Data Science to traditional academic fields, and maybe establishing it as a field unto itself? Data Science involves much more than statistical data analysis, encompassing aspects of data management, data warehousing, reproducibility, and data best practices. To advance science as a whole, it will be necessary for researchers and staff to develop a pi-shaped skills profile (as coined by Alex Szalay):

pi_shaped_skills

The first leg, a.k.a. the domain specialty or Domain Knowledge, is already established after years of efforts to advance a field. However, this hints at a fundamental problem for Data Science as a domain-agnostic, standalone field. Data Science as a Service (DSaaS) is likely to fail. Instead, Data Scientists should be embedded in a field and possess domain expertise, in addition to the cross-disciplinary techniques required to tackle the data challenges at hand.

This feeds into the second, to-be-developed, leg, which represents advanced computational literacy. As more and more researchers leave the academy it’s obvious that the current system disincentivizes this development. However, it also reveals some low-hanging fruits. An easy win would be adopting simple best practices to improve how scientific data are handled and encouraging students to develop solid data skills. Another win would be to reward researchers for their efforts to make studies transparent and reproducible. Without such cultural changes, Academia will fail to advance ever-more-diversified scientific fields into the next century. Perpetuating current practices will only undermine scientific research and make it increasingly undiscoverable. As Denis Diderot put it in his 1755 Encyclopedie:

As long as the centuries continue to unfold, the number of books will grow continually, and one can predict that a time will come when it will be almost as difficult to learn anything from books as from the direct study of the whole universe. It will be almost as convenient to search for some bit of truth concealed in nature as it will be to find it hidden away in an immense multitude of bound volumes.

Next steps

It’s clear that Data Science will have major impacts on our digital and non-digital lives. The Internet of Things already transcends our individual internet presence by connecting everyday devices—such as thermostats, fridges, cars, etc.—to the internet, and thus makes them available to optimizations using Data Science. The extent of these impacts, though, will depend on our ability to make sense of the data and develop tools and intuitions to check computerized predictions against reality. Moreover, we require a better understanding of the limitations of Data Science as well as its mathematical-statistical foundations. Without thorough basic knowledge, Data Science and Machine Learning will be seen as belonging to the Dark Arts and raise skepticism. This is true for data of all sizes and depends strongly on whether we succeed in making data discoverable and processable. Data Science has a role to play in this (both in industry as well as the academy). To succeed we first need to rethink the way scientific information is produced, stored, and prepared for further investigations. And this goal hinges on overdue changes of incentives within the academy.

About the author

Johannes Otterbach is a Data Scientist at LendUp with a passion for big data technologies and applications to real world problems. He earned his Ph.D. in Physics in topics related to Quantum Computing.

Tagged , , , ,

Data (Curation) Viz.

Data management and data curation are related concepts, but they do not refer to precisely the same things. I use these terms so often now that sometimes the distinctions, fuzzy as they are, become indistinguishable. When this happens I return to visual abstractions to clarify —in my own mind—what I mean by one vs. the other. Data management is more straightforward and almost always comes in the guise of something like this:

The obligatory research data management life cycle slide. Everyone uses it, myself included, in just about every presentation I give these days. This simple (arguably oversimplified) but useful model defines more-or-less discrete data activities that correspond with different phases of the research process. It conveys what it needs to convey; namely, that data management is a dynamic cycle of activities that constantly influence one another. Essentially, we can envision a feedback loop.

Data curation, on the other hand, is a complex beastie. Standard definitions cluster around something like this one from the Digital Curation Centre in the UK:

Data curation involves maintaining, preserving, and adding value to digital research data throughout its lifecycle.

When pressed for a definition, this is certainly an elegant response. But, personally, I don’t find it to be helpful at all when I try to wrap my head around the myriad activities that go into curating anything, much less distinguishing management activities from curation activities. Moreover, I’m talking about all kinds of activities in the context of “data,” a squishy concept in and of itself. (We’ll go with the NSF’s definition: the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.)

I suppose I should mention sooner or later that the point of defining “data” and all these terms appended to it is the following. There’s a lot of it [data] and we need to figure out what on earth to do with it, ergo the proliferation of new positions with these things “data management” and “data curation” in their titles. It’s important to make sure we’re speaking the same language.

There are other, more expansive approaches to defining data curation and a related post on this very blog, but to really grasp what I’m talking about when I’m saying the words “data curation,” I invariably come back to this visualization created by Tim Norris. Tim is a geographer turned CLIR Postdoctoral Fellow in Data Curation at the University of Miami. Upon assuming a new post with an unfamiliar title, he decided to draw a map of his job to explain (to himself and to others) what he means by data curation. Many thanks to Tim for sharing this exercise with the rest of our CLIR cohort and now with the blogo-world-at-large.

Below is an abbreviated caption, in Tim’s own words, as well as short- (3 min) and long-format (9 min) tours of the map narrated by Tim. And here is a handy PNG file for those occasions when the looping life cycle visualizations just won’t do.

This map of data curation has two visual metaphors. The first is that of a stylized mandala: a drawing that implies both inwards and outwards motion that is in balance. And the second is that of a Zen Koen: first there is a mountain, then there’s none, and then there is. We start with visual complexity—the mountain. To build the data curation mountain we start with a definition of the word “curation” as a five step process that moves inwards. The final purpose of this curation is to move what is being curated back into the world for re-use, publication and dissemination. This can be understood as stewardship. Next we think about the sources of data in the outside world. These sources have been abstracted into three data spaces: library digital collections, external data sources, and research data products. As this data moves “inwards” we can think of verbs that describe the ingestion processes. Metadata creation, or describing the data, is a key that enables later data linkages to be identified with the final goal of making data interoperable. Once the data is “inside” the curation space it passes through a standard process that begins with storage and ends with discovery. Specific to data in this process are the formats in which the data is stored and the difference between preservation and conservation for data. To enable this work we need hardware, software, and human interfaces to the curated data. Finally, as the data moves back out into the world, we must pay attention to institutions of property rights and access. If we get this all right we will have a system that is sustainable, secure, and increases the value of our research data collections. Once again we have a mountain.

Tagged ,

The 10 Things Every New Grad Student Should Do

It’s now mid-October, and I’m guessing that first year graduate students are knee-deep in courses, barely considering their potential thesis project. But for those that can multi-task, I have compiled this list of 10 things that you should undertake in your first year as a grad student. These aren’t just any 10 things… they are 10 steps you can take to make sure you contribute to a culture shift towards open science. Some a big steps, and others are small, but they will all get you (and the rest of your field) one step closer to reproducible, transparent research.

1. Learn to code in some language. Any language.

Here’s the deal: it’s easier to use black-box applications to run your analyses than to create scripts. Everyone knows this. You put in some numbers and out pop your results; you’re ready to write up your paper and get that H-index headed upwards. But this approach will not cut the mustard for much longer in the research world. Researchers need to know about how to code. Growing amounts and diversity of data, more interdisciplinary collaborators, and increasing complexity of analyses mean that no longer can black-box models, software, and applications be used in research. The truth is, if you want your research to be reproducible and transparent, you must code. In a 2013 article “The Big Data Brain Drain: Why Science is in Trouble“, Jake Vanderplas argues that

In short, the new breed of scientist must be a broadly-trained expert in statistics, in computing, in algorithm-building, in software design, and (perhaps as an afterthought) in domain knowledge as well.

I learned MATLAB in graduate school, and experimented with R during a postdoc. I wish I’d delved into this world earlier, and had more skills and knowledge about best practices for scientific software. Basically, I wish I had attended a Software Carpentry bootcamp.

The growing number of Software Carpentry (SWC) bootcamps are more evidence that researchers are increasingly aware of the importance of coding and reproducibility. These bootcamps teach researchers the basics of coding, version control, and similar topics, with the potential for customizing the course’s content to the primary discipline of the audience. I’m a big fan of SWC – read more in my blog post on the organization. Check out SWC founder Greg Wilson’s article on some insights from his years in teaching bootcamps: Software Carpentry: Lessons Learned.

2. Stop using Excel. Or at least stop ONLY using Excel.

Most seasoned researchers know that Microsoft Excel can be potentially problematic for data management: there are loads of ways to manipulate, edit, reorder, and change your data without really knowing exactly what you did. In nerd terms, the trail of dataset changes is known as provenance; generally Excel is terrible at documenting provenance. I wrote about this a few years ago on the blog, and we mentioned a few of the more egregious ways people abuse Excel in our F1000Research publication on the DataUp tool. More recently guest blogger Kara Woo wrote a great post about struggles with dates in Excel.

Of course, everyone uses Excel. In our surveys for the DataUp project, about 88% of the researchers we interviewed used Excel at some point in their research. And we can’t expect folks to stop using it: it’s a great tool! It should, however, be used carefully. For instance, don’t manipulate the sole copy of your raw data in Excel; keep your raw data raw. Use Excel to explore your data, but use other tools to clean and analyze it, such as R, Python, or MATLAB (see #1 above on learning to code). For more help with spreadsheets, see our list of resources and tools: UC3 Spreadsheet Help.

3. Learn about how to properly care for your data.

You might know more about your data than anyone else, but you aren’t so smart when it comes stewardship your data. There are some great guidelines for how best to document, manage, and generally care for your data; I’ve collected some of my favorites here on CiteULike with the tag best_practices. Pick one (or all of them) to read and make sure your data don’t get short shrift.

4. Write a data management plan.

I know, it sounds like the ultimate boring activity for a Friday night. But these three words (data management plan) can make a HUGE difference in the time and energy spent dealing with data during your thesis. Basically, if you spend some time thinking about file organization, sample naming schemes, backup plans, and quality control measures, you can save many hours of heartache later. Creating a data management plan also forces you to better understand best practices related to data (#3 above). Don’t know how to start? Head over to the DMPTool to write a data management plan. It’s free to use, and you can get an idea for the types of things you should consider when embarking on a new project. Most funders require data management plans alongside proposal submissions, so you might as well get the experience now.

5. Read Reinventing Discovery by Michael Nielsen.

 Reinventing Discovery: The New Era of Networked Science by Michael Nielsen was published in 2013, and I’ve since heard it referred to as the Bible for Open Science, and the must-read book for anyone interested in engaging in the new era of 4th paradigm research. I’ve only just recently read the book, and wow. I was fist-bumping quite a bit while reading it, which must have made fellow airline passengers wonder what the fuss was about. If they had asked, I would have told them about Nielsen’s stellar explanation of the necessity for and value of openness and transparency in research, the problems with current incentive structures in science, and the steps we should all take towards shifting the culture of research to enable more connectivity and faster progress. Just writing this blog post makes me want to re-read the book.

6. Learn version control.

My blog post, Git/GitHub: a Primer for Researchers covers much of the importance of version control. Here’s an excerpt:

From git-scm.com, “Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.”  We all deal with version control issues. I would guess that anyone reading this has at least one file on their computer with “v2” in the title. Collaborating on a manuscript is a special kind of version control hell, especially if those writing are in disagreement about systems to use (e.g., LaTeX versus Microsoft Word). And figuring out the differences between two versions of an Excel spreadsheet? Good luck to you. TheWikipedia entry on version control makes a statement that brings versioning into focus:

The need for a logical way to organize and control revisions has existed for almost as long as writing has existed, but revision control became much more important, and complicated, when the era of computing began.

Ah, yes. The era of collaborative research, using scripting languages, and big data does make this issue a bit more important and complicated. Version control systems can make this much easier, but they are not necessarily intuitive for the fledgling coder. It might take a little time (plus attending a Software Carpentry Bootcamp) to understand version control, but it will be well worth your time. As an added bonus, your work can be more reproducible and transparent by using version control. Read Karthik Ram’s great article, Git can facilitate greater reproducibility and increased transparency in science.

7. Pick a way to communicate your science to the public. Then do it.

You don’t have to have a black belt in Twitter or run a weekly stellar blog to communicate your work. But you should communicate somehow. I have plenty of researcher friends who feel exasperated by the idea that they need to talk to the public about their work. But the truth is, in the US this communication is critical to our research future. My local NPR station recently ran a great piece called Why Scientists are seen as untrustworthy and why it matters. It points out that many (most?) scientists aren’t keen to spend a lot of time engaging with the broader public about their work. However:

…This head-in-the-sand approach would be a big mistake for lots of reasons. One is that public mistrust may eventually translate into less funding and so less science. But the biggest reason is that a mistrust of scientists and science will have profound effects on our future.

Basically, we are avoiding the public at our own peril. Science funding is on the decline, we are facing increasing scrutiny, and it wouldn’t be hyperbole to say that we are at war without even knowing it. Don’t believe me? Read this recent piece in Science (paywall warning): Battle between NSF and House science committee escalates: How did it get this bad?

So start talking. Participate in public lecture series, write a guest blog post, talk about your research to a crotchety relative at Thanksgiving, or write your congressman about the governmental attack on science.

8. Let everyone watch.

Consider going open. That is, do all of your science out in the public eye, so that others can see what you’re up to. One way to do this is by keeping an open notebook. This concept throws out the idea that you should be a hoarder, not telling others of your results until the Big Reveal in the form of a publication. Instead, you keep your lab notebook (you do have one, right?) out in a public place, for anyone to peruse. Most often an open notebook takes the form of a blog or a wiki, and the researcher updates their notebook daily, weekly, or whatever is most appropriate. There are links to data, code, relevant publications, or other content that helps readers, and the researcher themselves, understand the research workflow. Read more in these two blog posts: Open Up  and Open Science: What the Fuss is About.

9. Get your ORCID.

ORCID stands for “Open Researcher & Contributor ID”. The ORCID Organization is an open, non-profit group working to provide a registry of unique researcher identifiers and a transparent method of linking research activities and outputs to these identifiers. The endgame is to support the creation of a permanent, clear and unambiguous record of scholarly communication by enabling reliable attribution of authors and contributors. Basically, researcher identifiers are like social security numbers for scientists. They unambiguously identify you throughout your research life.

Lots of funders, tools, publishers, and universities are buying into the ORCID system. It’s going to make identifying researchers and their outputs much easier. If you have a generic, complicated, compound, or foreign name, you will especially benefit from claiming your ORCID and “stamping” your work with it. It allows you to claim what you’ve done and keep you from getting mixed up with that weird biochemist who does studies on the effects of bubble gum on pet hamsters. Still not convinced? I wrote a blog post a while back that might help.

10. Publish in OA journals, or make your work OA afterward.

A wonderful post by Michael White, Why I don’t care about open access to research: and why you should, captures this issue well:

It’s hard for me to see why I should care about open access…. My university library can pay for access to all of the scientific journals I could wish for, but that’s not true of many corporate R&D departments, municipal governments, and colleges and schools that are less well-endowed than mine. Scientific knowledge is not just for academic scientists at big research universities.

It’s easy to forget that you are (likely) among the privileged academics. Not all researchers have access to publications, and this is even more true for the general public. Why are we locking our work in the Ivory Tower, allowing for-profit publishers to determine who gets to read our hard-won findings? The Open Access movement is going full throttle these days, as evidenced by increasing media coverage (see “Steal this research paper: you already paid for it” from MotherJones, or The Guardian’s blog post “University research: if you believe in openness, stand up for it“). So what can you do?

Consider publishing only in open access journals (see the Directory of Open Access Journals). Does this scare you? Are you tied to a disciplinary favorite journal with a high impact factor? Then make your work open access after publishing in a standard journal. Follow my instructions here: Researchers! Make Your Previous Work #OA.

Openness is one of the pillars of a stellar academic career. From Flickr by David Pilbrow.

Openness is the pillar of a good academic career. From Flickr by David Pilbrow.

Tagged , , , , , ,

Git/GitHub: A Primer for Researchers

The Beastie Boys knew what’s up: Git it together. From egotripland.com

I might be what a guy named Everett Rogers would call an “early adopter“. Rogers wrote a book back in 1962 call The Diffusion of Innovation, wherein he explains how and why technology spreads through cultures. The “adoption curve” from his book has been widely used to  visualize the point at which a piece of technology or innovation reaches critical mass, and divides individuals into one of five categories depending on at what point in the curve they adopt a given piece of technology: innovators are the first, then early adopters, early majority, late majority, and finally laggards.

At the risk of vastly oversimplifying a complex topic, being an early adopter simply means that I am excited about new stuff that seems promising; in other words, I am confident that the “stuff” – GitHub, in this case –will catch on and be important in the future. Let me explain.

Let’s start with version control.

Before you can understand the power GitHub for science, you need to understand the concept of version control. From git-scm.com, “Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.”  We all deal with version control issues. I would guess that anyone reading this has at least one file on their computer with “v2” in the title. Collaborating on a manuscript is a special kind of version control hell, especially if those writing are in disagreement about systems to use (e.g., LaTeX versus Microsoft Word). And figuring out the differences between two versions of an Excel spreadsheet? Good luck to you. The Wikipedia entry on version control makes a statement that brings versioning into focus:

The need for a logical way to organize and control revisions has existed for almost as long as writing has existed, but revision control became much more important, and complicated, when the era of computing began.

Ah, yes. The era of collaborative research, using scripting languages, and big data does make this issue a bit more important and complicated. Enter Git. Git is a free, open-source distributed version control system, originally created for Linux kernel development in 2005. There are other version control systems– most notably, Apache Subversion (aka SVN) and Mercurial. However I posit that the existence of GitHub is what makes Git particularly interesting for researchers.

So what is GitHub?

GitHub is a web-based hosting service for projects that use the Git revision control system. It’s free (with a few conditions) and has been quite successful since its launch in 2008. Historically, version control systems were developed for and by software developers. GitHub was created primarily as a way for efficiently developing software projects, but its reach has been growing in the last few years. Here’s why.

Note: I am not going into the details of how git works, its structure, or how to incorporate git into your daily workflow. That’s a topic best left to online courses and Software Carpentry Bootcamps

What’s in it for researchers?

At this point it is good to bring up a great paper by Karthik Ram titled “Git can facilitate greater reproducibility and increased transparency in science“, which came out in 2013 in the journal Source Code for Biology and Medicine. Ram goes into much more detail about the power of Git (and GitHub by extension) for researchers. I am borrowing heavily from his section on “Use cases for Git in science” for the four benefits of Git/GitHub below.

1. Lab notebooks make a comeback. The age-old practice of maintaining a lab notebook has been challenged by the digital age. It’s difficult to keep all of the files, software, programs, and methods well-documented in the best of circumstances, never mind when collaboration enters the picture. I see researchers struggling to keep track of their various threads of thought and work, and remember going through similar struggles myself. Enter online lab notebooks. naturejobs.com recently ran a piece about digital lab notebooks, which provides a nice overview of this topic. To really get a feel fore the power of using GitHub as a lab notebook, see GitHubber and ecologist Carl Boettiger’s site. The gist is this: GitHub can serve as a home for all of the different threads of your project, including manuscripts, notes, datasets, and methods development.

2. Collaboration is easier. You and your colleagues can work on a manuscript together, write code collaboratively, and share resources without the potential for overwriting each others’ work. No more v23.docx or appended file names with initials. Instead, a co-author can submit changes and document those with “commit messages” (read about them on GitHub here).

3. Feedback and review is easier. The GitHub issue tracker allows collaborators (potential or current), reviewers, and colleagues to ask questions, notify you of problems or errors, and suggest improvements or new ideas.

4. Increased transparency. Using a version control system means you and others are able to see decision points in your work, and understand why the project proceeded in the way that it did. For the super savvy GitHubber, you can make available your entire manuscript, from the first datapoint collected to the final submitted version, traceable on your site. This is my goal for my next manuscript.

Final thoughts

Git can be an invaluable tool for researchers. It does, however, have a bit of a high activation energy. That is, if you aren’t familiar with version control systems, are scared of the command line, or are married to GUI-heavy proprietary programs like Microsoft Word, you will be hard pressed to effectively use Git in the ways I outline above. That said, spending the time and energy to learn Git and GitHub can make your life so. much. easier. I advise graduate students to learn Git (along with other great open tools like LaTeX and Python) as early in their grad careers as possible. Although it doesn’t feel like it, grad school is the perfect time to learn these systems. Don’t be a laggard; be an early adopter.

References and other good reads

Tagged , , , , , ,

Abandon all hope, ye who enter dates in Excel

Big thanks to Kara Woo of Washington State University for this guest blog post!

Update: The XLConnect package has been updated to fix the problem described below; however, other R packages for interfacing with Excel may import dates incorrectly. One should still use caution when storing data in Excel.


Like anyone who works with a lot of data, I have a strained relationship with Microsoft Excel. Its ubiquity forces me to tolerate it, yet I believe that it is fundamentally a malicious force whose main goal is to incite chaos through the obfuscation and distortion of data.1 After discovering a truly ghastly feature of how it handles dates, I am now fully convinced.

As it turns out, Excel “supports” two different date systems: one beginning in 1900 and one beginning in 1904.2 Excel stores all dates as floating point numbers representing the number of days since a given start date, and Excel for Windows and Mac have different default start dates (January 1, 1900 vs. January 1, 1904).3 Furthermore, the 1900 date system purposely erroneously assumes that 1900 was a leap year to ensure compatibility with a bug in—wait for it—Lotus 1-2-3.

You can’t make this stuff up.

What is even more disturbing is how the two date systems can get mixed up in the process of reading data into R, causing all dates in a dataset to be off by four years and a day. If you don’t know to look for it, you might never even notice. Read on for a cautionary tale.

I work as a data manager for a project studying biodiversity in Lake Baikal, and one of the coolest parts of my job is getting to work with data that have been collected by Siberian scientists since the 1940s. I spend a lot of time cleaning up these data in R. It was while working on some data on Secchi depth (a measure of water transparency) that I stumbled across this Excel date issue.

To read in the data I do something like the following using the XLConnect package:

library(XLConnect)
wb1 <- loadWorkbook("Baikal_Secchi_64to02.xlsx")
secchi_main <- readWorksheet(wb1, sheet = 1)
colnames(secchi_main) <- c("date", "secchi_depth", "year", "month")

So far so good. But now, what’s wrong with this picture?

head(secchi_main)
##         date secchi_depth year month
## 1 1960-01-16           12 1964     1
## 2 1960-02-04           14 1964     2
## 3 1960-02-14           18 1964     2
## 4 1960-02-24           14 1964     2
## 5 1960-03-04           14 1964     3
## 6 1960-03-25           10 1964     3

As you can see, the year in the date column doesn’t match the year in the year column. When I open the data in Excel, things look correct.

excel_secchi_data

This particular Excel file uses the 1904 date system, but that fact gets lost somewhere between Excel and R. XLConnect can tell that there are dates, but all the dates are wrong.

My solution for these particular data was as follows:

# function to add four years and a day to a given date
fix_excel_dates <- function(date) {
    require(lubridate)
    return(ymd(date) + years(4) + days(1))
}

# create a correct date column
library(dplyr)
secchi_main <- mutate(secchi_main, corrected_date = fix_excel_dates(date))

The corrected_date column looks right.

head(secchi_main)
##         date secchi_depth year month corrected_date
## 1 1960-01-16           12 1964     1     1964-01-17
## 2 1960-02-04           14 1964     2     1964-02-05
## 3 1960-02-14           18 1964     2     1964-02-15
## 4 1960-02-24           14 1964     2     1964-02-25
## 5 1960-03-04           14 1964     3     1964-03-05
## 6 1960-03-25           10 1964     3     1964-03-26

That fix is easy, but I’m left with a feeling of anxiety. I nearly failed to notice the discrepancy between the date and year columns; a colleague using the data pointed it out to me. If these data hadn’t had a year column, it’s likely we never would have caught the problem at all. Has this happened before and I just didn’t notice it? Do I need to go check every single Excel file I have ever had to read into R?

And now that I know to look for this issue, I still can’t think of a way to check the dates Excel shows against the ones that appear in R without actually opening the data file in Excel and visually comparing them. This is not an acceptable solution in my opinion, but… I’ve got nothing else. All I can do is get up on my worn out data manager soapbox and say:

and-thats-why-excel


  1. For evidence of its fearsome power, see these examples.
  2. Though as Dave Harris pointed out, “is burdened by” would be more accurate.
  3. To quote John Machin, “In reality, there are no such things [as dates in Excel spreadsheets]. What you have are floating point numbers and pious hope.”
Tagged , , ,

Software Carpentry and Data Management

About a year ago, I started hearing about Software Carpentry. I wasn’t sure exactly what it was, but I envisioned tech-types showing up at your house with routers, hard drives, and wireless mice to repair whatever software was damaged by careless fumblings. Of course, this is completely wrong. I now know that it is actually an ambitious and awesome project that was recently adopted by Mozilla, and recently got a boost from the Alfred P. Sloan Foundation (how is it that they always seem to be involved in the interesting stuff?).

From their website:

Software Carpentry helps researchers be more productive by teaching them basic computing skills. We run boot camps at dozens of sites around the world, and also provide open access material online for self-paced instruction.

SWC got its start in 1990s, when its founder, Greg Wilson, realized that many of the scientists who were trying to use supercomputers didn’t actually know how to build and troubleshoot their code, much less use things like version control. More specifically, most had never been shown how to do four basic tasks that are fundamentally important to any science involving computation (which is increasingly all science):

  • growing a program from 10 to 100 to 100 lines without creating a mess
  • automating repetitive tasks
  • basic quality assurance
  • managing and sharing data and code
Software Carpentry is too cool for a reference to the Carpenters. From marshallmatlock.com (click for more).

Software Carpentry is too cool for a reference to the Carpenters. From marshallmatlock.com (click for more).

Greg started teaching these topics (and others) at Los Alamos National Laboratory in 1998. After a bit of stop and start, he left a faculty position at the University of Toronto in April 2010 to devote himself to it full-time. Fast forward to January 2012, and Software Carpentry became the first project of what is now the Mozilla Science Lab, supported by funding from the Alfred P. Sloan Foundation.

This new incarnation of Software Carpentry has focused on offering intensive, two-day workshops aimed at grad students and postdocs. These workshops (which they call “boot camps”) are usually small – typically 40 learners – with low student-teacher ratios, ensuring that those in attendance get the attention and help they need.

Other than Greg himself, whose role is increasingly to train new trainers, Software Carpentry is a volunteer organization. More than 50 people are currently qualified to instruct, and the number is growing steadily. The basic framework for a boot camp is this:

  1. Someone decides to host a Software Carpentry workshop for a particular group (e.g., a flock of macroecologists, or a herd of new graduate students at a particular university). This can be fellow researchers, department chairs, librarians, advisors — you name it.
  2. Organizers round up funds to pay for travel expenses for the instructors and any other anticipated workshop expenses.
  3. Software Carpentry matches them with instructors according to the needs of their group; together, they and the organizers choose dates and open up enrolment.
  4. The boot camp itself runs eight hours a day for two consecutive days (though there are occasionally variations). Learning is hands-on: people work on their own laptops, and see how to use the tools listed below to solve realistic problems.

That’s it! They have a great webpage on how to run a bootcamp, which includes checklists and thorough instructions on how to ensure your boot camp is a success. About 2300 people have gone through a SWC bootcamp, and the organization hopes to double that number by mid-2014.

The core curriculum for the two-day boot camp is usually:

Software Carpentry also offers over a hundred short video lessons online, all of which are CC-BY licensed  (go to the SWC webpage for a hyperlinked list):

  • Version Control
  • The Shell
  • Python
  • Testing
  • Sets and Dictionaries
  • Regular Expressions
  • Databases
  • Using Access
  • Data
  • Object-Oriented Programming
  • Program Design
  • Make
  • Systems Programming
  • Spreadsheets
  • Matrix Programming
  • MATLAB
  • Multimedia Programming
  • Software Engineering

Why focus on grad students and postdocs? They focus on graduate students and post-docs because professors are often too busy with teaching, committees, and proposal writing to improve their software skills, while undergrads have less incentive to learn since they don’t have a longer-term project in mind yet. They’re also playing a long game: today’s grad students are tomorrow’s professors, and the day after that, they will be the ones setting parameters for funding programs, editing journals, and shaping science in other ways. Teaching them these skills now is one way – maybe the only way – to make computational competence a “normal” part of scientific practice.

So why am I blogging about this? When Greg started thinking about training researchers to understand the basics of good computing practice and coding, he couldn’t have predicted that huge explosion in the availability of data, the number of software programs to analyze those datasets, and the shortage of training that researchers receive in dealing with this new era. I believe that part of the reason funders stepped up to help the mission of software caprentry is because now, more than ever, reseachers need these skills to successfully do science. Reproducibility and accountability are in more demand, and data sharing mandates will likely morph into workflow sharing mandates. Ensuring reproducibility in analysis is next to impossible without the skills Software Carpentry’s volunteers teach.

My secret motive for talking about SWC? I want UC librarians to start organizing bootcamps for groups of researchers on their campuses!

Tagged , , , , , , , , ,

Software for Reproducibility Part 2: The Tools

From the Calisphere collection "American war posters from the Second World War", Contributed by UC Berkeley Bancroft Library (Click on image for more information)

From the Calisphere collection “American war posters from the Second World War”, Contributed by UC Berkeley Bancroft Library (Click on image for more information)

Last week I wrote about the workshop I attended (Workshop on Software Infrastructure for Reproducibility in Science), held in Brooklyn at the new Center for Urban Science and Progress, NYU. This workshop was made possible by the Alfred P. Sloan Foundation and brought together heavy-hitters from the reproducibility world who work on software for workflows. I provided some broad-strokes overviews last week; this week, I’ve created a list of some of the tools we saw during the workshop. Note: the level of detail for tools is consistent with my level of fatigue during their presentation!

Sumatra

  • Presenter: Andrew Davison
  • Short description: Sumatra is a library of components and graphical user interfaces (GUIs) for viewing and using tools via Python. This is an “electronic lab notebook” kind of idea.
  • Sumatra needs to interact with version control systems, such as Subversion, Git, Mercurial, or Bazaar.
  • Future plans: Integrate with R, Fortran, C/C++, Ruby.

From the Sumatra website:

The solution we propose is to develop a core library, implemented as a Python package, sumatra, and then to develop a series of interfaces that build on top of this: a command-line interface, a web interface, a graphical interface. Each of these interfaces will enable: (1) launching simulations/analyses with automated recording of provenance information; sand (2) managing a computational project: browsing, viewing, deleting simulations/analyses.

Taverna

  • Presenter: Carole Goble
  • Short description: Taverna is an “execution environment”, i.e. a way to design and execute formal workflows.
  • Other details: Written in Java. Consists of the Taverna Engine (the workhorse), the Taverna Workbench (desktop client) and Taverna Server (remote workflow execution server) that sit on top of the Engine.
  • Links up with myExperiment, a publication environment for sharing workflows. It allows you to put together the workflow, description, files, data, documents etc. and upload/share with others.

From the Taverna website:

Taverna is an open source and domain-independent Workflow Management System – a suite of tools used to design and execute scientific workflows and aid in silico experimentation.

IPython Notebook

  • Presenter: Brian Granger
  • IPython notebook is a web-based computing environment and open document format for telling stories about code and data. Spawn of the IPython Project (focused on interactive computing)
  • Very code-centric. Can integrate text (i.e., Markdown), LaTeX, images, code etc. can be interwoven. Linked up with GitHub.

Galaxy

  • Presenter: James Taylor
  • Galaxy is an open source, free web service integrating a wealth of tools, resources, etc. to simplify researcher workflows in genomics.
  • Biggest challenge is archiving. They currently have 1 PB of user data due to the abundance of sequences used.

From their website:

Galaxy is an open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses.

Madagascar

  • Presenter: Sergey Fomel
  • Geophysics-focused project management system.

From the Madagascar website:

Madagascar is an open-source software package for multidimensional data analysis and reproducible computational experiments. Its mission is to provide

  • a convenient and powerful environment
  • a convenient technology transfer tool

for researchers working with digital image and data processing in geophysics and related fields. Technology developed using the Madagascar project management system is transferred in the form of recorded processing histories, which become “computational recipes” to be verified, exchanged, and modified by users of the system.

VisTrails

  • Presenter: David Koop
  • VisTrails is an open-source scientific workflow and provenance management system that provides support for simulations, data exploration and visualization.
  • It was built with ideas of provenance and transparency in mind.  Essentially the research “trail” is followed as users generate and test a hypothesis.
  • Focus on change-based provenance. keeps track of all changes via a version tree.
  • There’s also execution provenance – start time and end time; where it was done etc. this is instrument-specific metadata.
  • Crowdlabs.org: associated social website for sharing workflows and provenance

RCloud

  • No formal website; GitHub site only
  • Presenter: Carlos Scheidegger
  • RCloud was developed for programmers who use R at AT&T Labs. They were having interactive sessions in R for data exploration and exploratory data analysis. Their idea: what if every R session was transparent and automatically versioned>
  • I know this is a bit of a thin description… but this is an ongoing active project. I’m anxious to see where it goes next

ReproZip 

  • No formal website, but see a 15 minute demo here
  • Presenter: Fernando Chirigati
  • The premise: few computational experiments are reproducible. To get closer to reproducibility, we need a record of: data description, experiment specs, description of environment (in addition to code, data etc).
  • ReproZip automatically and systematically captures required provenance of existing experiments. It does this by “packing” experiments. How it works:
    • ReproZip executes experiment. system tap captures the provenance. each node of process has details on it (provenance tree)
    • Necessary components identified.
    • Specification of workflow happens. fed into workflow.
  • When ready to examine/explore/verify experiments: the package is extracted; ReproZip unloads the experiment and workflows

Open Science Framework

  • Presenters: Brian Nosek & Jeff Spies
  • In brief, the OSF is a nifty way to track projects, work with collaborators, and link together tools from your workflow.
  • You simply go to the website and start a project (for free). Then add contributors and components to the project. Voila!
  • A neat feature – you can fork projects.
  • Provides a log of activities (i.e., version control) & access control (i.e., who can see your stuff)>
  • As long as two software systems work with OSF, they can work together – OSF allows the APIs to “talk”.

From the OSF website:

The Open Science Framework (OSF) is part network of research materials, part version control system, and part collaboration software. The purpose of the software is to support the scientist’s workflow and help increase the alignment between scientific values and scientific practices.

RunMyCode

  • Presenter: Victoria Stodden
  • The idea behind this service is that researchers create “companion websites” associated with their publications. These websites allow others to implement the methods used in the paper.

From the RunMyCode website:

RunMyCode is a novel cloud-based platform that enables scientists to openly share the code and data that underlie their research publications. This service is based on the innovative concept of a companion website associated with a scientific publication. The code is run on a computer cloud server and the results are immediately displayed to the user.

Dexy

  • Presenter: Ana Nelson
  • Tagline: make | docs | sexy (what more can I say?)
  • Dexy is an Open Source platform that allows those writing up documents that have underpinning code to combine the two.
  • Can mix/match coding languages and documentation formats. (e.g., Python, C, Markdown to HTML, data from APIs, WordPress, etc.)

From their website:

Dexy lets you to continue to use your favorite documentation tools, while getting more out of them than ever, and being able to combine them in new and powerful ways. With Dexy you can bring your scripting and testing skills into play in your documentation, bringing project automation and integration to new levels.

DuraSpace 

  • Presenter: Jonathan Markow
  • DuraSpace’s mission is access, discovery, preservation of scholarly digital data. Their primary stakeholders and audience are libraries. As part of this role, DuraSpace is a steward open source projects (Fedora, DSpace, VIVO)
  • Their main service is DuraCloud: an online storage and preservation service that does things like repair corrupt files, move online copies offsite, distribute content geographically, and scale up or down as needed.

Dataverse 

  • Presenter: Merce Crosas
  • Dataverse is a virtual archive for “research studies”. These research studies are containers for data, documentation, and code that are needed for reproducibility.

From their website:

A repository for research data that takes care of long term preservation and good archival practices, while researchers can share, keep control of and get recognition for their data.

Tagged , , ,