Workflows Part II: Formal

cummerbund pic

Nothing says formal like a sequined cummerbund and matching bow tie. From awkwardfamilyphotos.com (click the pic for more)

In my last blog post, I provided an overview of scientific workflows in general. I also covered the basics of informal workflows, i.e. flow charts and commented scripts.  Well put away the tuxedo t-shirt and pull out your cummerbund and bow tie, folks, because we are moving on to formal workflow systems.

A formal workflow (let’s call them FW) is essentially an “analytical pipeline” that takes data in one end and spits out results on the other.  The major difference between FW and commented scripts (one example of informal workflows) is that FW can be implemented in different software systems.  A commented R script for estimating parameters works for R, but what about those simulations you need to run in MATLAB afterward?  Saving the outputs from one program, importing them into another, and continuing analysis there is a very common practice in modern science.

So how do you link together multiple software systems automatically? You have two options: become one of those geniuses that use the command line for all of your analyses, or use a FW software system developed by one of those geniuses.  The former requires a level of expertise that many (most?) Earth, environmental, and ecological scientists do not possess, myself included.  It involves writing code that will access different software programs on your machine, load data into them, perform analyses, save results, and use those results as input for a completely different set of analyses, often using a different software program.  FW are often called “executable workflows” because they are a way for you to push only one button (e.g., enter) and obtain your results.

What about FW software systems? These are a bit more accessible for the average scientist.  FW software has been around for about 10 years, with the first user-friendly(ish) breakthrough being the Kepler Workflow System.  Kepler was developed with researchers in mind, and allows the user to drag and drop chunks of analytical tasks into a window.  The user can indicate which data files should be used as inputs and where the outputs should be sent, connecting the analytical tasks with arrows.  Kepler is still in a beta version, and most researchers will find the work required to set up a workflow prohibitive.

Groups that have managed to incorporate workflows into their community of sharing are genomicists; this is because they tend to have predictable data as inputs, with a comparatively small set of analyses performed on those data.  Interestingly, a social networking site has grown up around genomicists’ use workflows called myExperiment, where researchers can share workflows, download others’ workflows, and comment on those that they have tried.

The benefits of FW are the each step in the analytical pipeline, including any parameters or requirements, is formally recorded.  This means that researchers can reuse both individual steps (e.g., the data cleaning step in R or the maximum likelihood estimation in MATLAB), as well as the overall workflow).  Analyses can be re-run much more quickly, and repetitive tasks can be automated to reduce chances for manual error.  Because the workflow can be saved and re-used, it is a great way to ensure reproducibility and transparency in the scientific process.

Although Kepler is not in wide use, it is a great example of something that will likely become common place in the researcher’s toolbox over the next decade.  Other FW software includes Taverna, VisTrails, and Pegasus – all with varying levels of user-friendliness and varied communities of use.  As the complexity of analyses and the variety of software systems used by scientists continues to increase, FW are going to become a more common part of the research process.  Perhaps more importantly, it is likely that funders will start requiring the archiving of FW alongside data to ensure accountability, reproducibility, and to promote reuse.

A few resources for more info:

Tagged , , , , ,

12 thoughts on “Workflows Part II: Formal

  1. Malcolm says:

    One should also be careful that data isn’t modified when it changes from one format to another: XDR is a good was to accomplish this.

  2. Ethan White says:

    Great post Carly. My take on the matter is that the command line isn’t really that scary. We’ve built it up as such, but our experiences with <a href="http://software-carpentry.org"Software Carpentry suggest that even half a day of training can get folks using pipes and bash scripts to solve this problem. I suspect that the vast majority of folks who won’t be comfortable with the command line after less training than necessary for Kepler won’t be using multi-language workflows anyway.

  3. Ethan White says:

    Sloppy tagging, sorry, should be:
    Software Carpentry

    For anyone looking for the basics see the Shell Lectures.

  4. Two points: you say “level of expertise…do not possess.” What’s to keep you from attaining it?

    Then there’s “These are a bit more accessible for the average scientist.” Should scientists be average? Or should they be willing to learn new things and develop new methods?

    • Carly Strasser says:

      Joel – In theory, there is nothing to keep scientists (and myself) from attaining a level of expertise necessary for using the CL for their research. However in reality, the short answer is time. Yes, the CL will often save time in the long term, but it’s hard to convince scientists that they should take a week or two to completely revamp their current practices, then another year to become efficient at using the new tool set.

      Similarly, I agree that scientists should (and could) become better scientists by learning the CL and related computational skill sets. The reality is, however, that tenure and promotion, along with writing publications, does not require that these skills be learned or used. That means it falls to the bottom of the priority heap.

      Some call me a pessimist, but based on my many conversations and interviews with researchers, it’s just the reality of the incentive structure.

      • Ethan White says:

        Carly – While I absolutely agree with the limitations that you point out, I have yet to encounter a set of graphical tools for this sort of thing that actually requires less training than those that already exist (e.g., the sample workflows for Kepler look a pretty scarier to me). This means that we spend a lot of time and money building tools that aren’t widely adopted because they require just as much time to pick up as more generic tools that already exist. This is of particular concern because if the tool doesn’t develop broad adoption then the workflow that results isn’t really much better than writing the steps down in text.

        I also think it’s important to remember that we’re talking about building multi-language workflows, so if folks are getting that sophisticated then investing a little time in learning basic tools seems pretty reasonable. Most folks who don’t have time for more complicated process will just do everything in their single tool of choice anyway, in which case a well constructed script is a formal workflow.

        This of course is all part of my overarching view (bias?) that we have a bit of a bias in EEB at the moment towards building big complicated tools to solve problems that could be more easily addressed with less money through training; and that this training would then also yeild more valuable long-term benefits.

      • Carly Strasser says:

        Okay, okay… I will now show my secret membership card to the scientists-should-suck-it-up society. The more I learn about workflow systems and other software being developed to “help” scientists “streamline” the process, the more I realize that we are going about this ALL WRONG.

        If I could go back in time, I would tell grad student me to learn the CL and how to use scripts for documenting and implementing my workflows. I learned MATLAB, and later R, but that’s not quite the same level of expertise needed. Learning those made me (naively) pat myself on the back for being such a comp-savvy marine scientist. Perhaps this is more of the problem: we aren’t teaching the up-and-coming scientists that it’s no longer adequate to depend on GUI-based software for their needs; they need to know what’s going on “under the hood”.

        I wholeheartedly agree that Kepler and other software systems are more trouble than they are worth. In fact, a source-who-shall-go-unnamed, who has worked with scientists trying to implement Kepler, said that in most cases it would have been easier and faster to write a CL script.

      • “The reality is, however, that tenure and promotion, along with writing publications, does not require that these skills be learned or used.” You’re right about this, in the short term, people are not looking for “are you a flexible thinker who’s willing to learn new things?” However, that really ought to be the mark of a scientist. I found out really early in my research career (i.e. in high school) that being a scientist means learning new things. I never hesitate to learn new things, which is probably why I’m spending more time learning those things than publishing papers as a graduate student. It will probably impact my job prospects, but I will have learned a hell of a lot more than I would have otherwise.

        It shouldn’t be about “convincing scientists” to do things differently: it should be about emphasizing that if you want to really have a workflow, you just need to learn how to adapt things to solving your problems. Science is problem solving, and solving those problems with hammers and nails is better than inventing a snark and a blodger that does the same thing, because the snark is a one-off. The hammer, on the other hand, can be adapted and combined to solve whatever problem comes up next.

  5. Ethan White says:

    This is an excellent example of the value blogs in the sciences. A big thanks to Carly for a great post and a great back and forth.

    The one last thing I would add is that it’s never too late to learn this stuff. The payoff is so immediate that most folks that we teach in Software Carpentry report they are net positive on the time investment within a month or so of the workshops. I learned the shell as an Assistant Professor and I’ve taught it to several other profs, including a full professor who is now bash scripting and piping like a pro.

  6. Wow, excellent discussion and perspectives. It’s nice to see workflow management be discussed frankly, Carly & Ethan have both emphasized weaknesses in the “just build a fancy new GUI tool approach,” which is not often reflected elsewhere, such as the current NSF call for Data Infastructure Building Blocks.

    I agree with the sentiment of Carly, Ethan & Joel that such a reproducible workflow “can” be accomplished with some pipes in a bash script, but there’s also the obvious but important observation that neither the ability nor the practice of stringing together a bunch of scripts in different languages with a few bash pipes guarantee a reproducible workflow. I’d agree that this is more of a challenge of learning and teaching than a challenge of building new tools, but I think we should also emphasize that it’s more about learning the right practices than learning the right software.

    There certainly are some lessons for good software design along the way. I share Ethan’s sense that as a community we’re perhaps too fond of rich GUI interfaces that turn out to be harder to use even (particularly?) for the non-computer savvy (i.e. the difficulty in learning Kepler, or the way R has largely displaced the java-gui widgets like mesquite from the previous decade).

    So, if everyone learns bash, what else still has to happen to make sure those scripts and pipes are truly intelligible and reproducible?

  7. […] Short description: Taverna is an “execution environment”, i.e. a way to design and execute formal workflows. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: