Google Refine: An Interesting Take on Data Organization

A powerful tool for working with messy data. This is the tag line for Google Refine, a web-based application that can be used to manipulate and clean up data sets.  The history of Google Refine is that Google acquired Freebase Gridworks (originally developed by Metaweb Technologies, Inc.) back in 2010.  They re-branded the application as Google Refine.

I certainly don’t claim to be an expert on exactly how Google Refine works, but it has great potential.  You download the application, which works through a browser.  The idea is that you upload your spreadsheet or download it from the web from within Google Refine.  You can then manipulate your data, remove duplicates, rename cell entries in bulk, etc.  The underlying code is available and it appears that developers are encouraged to participate.  Alternatively, if you are generally fearful of code, Google Refine “protects users from all that nasty command line stuff,” as my smart friend Karthik says.

The trajectory of the DCXL project is still in flux, but I can say with certainty that Google Refine is a pretty great web-based application we can aspire to learn from in the course of our development.  Just yesterday the blog iPhylo had a great post about using Google Refine along with taxanomic databases.  This is one of the features we would like to incorporate into the DCXL project, so it’s great to hear that others have been hammering away at the problem of linking controlled vocabularies and data sets.

Want to know a bit more? Here’s Google’s blog entry about Google RefineFlowingData also posted a blog about Google Refine, which is where I first heard of it.  Freebase (which appears to be some iteration of Metaweb Technologies Inc.) has a Twitter feed that mentions Google Refine quite a bit at @fbase.

And in keeping with the organization theme of this post, here’s some links to one of my latest artist crushes: Ursus Wehrli.  He’s the embodiment of organization, in beautiful art form.  One of his photographs is below, but check out his Ted Talk, this Visual News post about him, or Google image search him for more amazing visuals.

If you love organizing as much as me, check out the artist Ursus Wehrli. He tidies up in amazing, artsy ways. From Flickr by Lawrence

3 thoughts on “Google Refine: An Interesting Take on Data Organization

  1. Todd Vision says:

    Also relevant, the recent post from Rod Page who has been using Google Refine to integrate different biodiversity datasets by reconciling taxonomic names: http://iphylo.blogspot.com/2012/02/using-google-refine-and-taxonomic.html

  2. Tom Morris says:

    A few clarifications:
    – Metaweb (the company) was acquired by Google

    – Metaweb developed both Freebase (the database) and Freebase Gridworks (the tool)

    – Freebase Gridworks was renamed Google Refine after the Metaweb acquisition

    – the Refine project definitely encourages developers to participate. I’m a contributor who doesn’t work for Google.
    – the tool is “web-based” from a technology point of view, but the web server runs locally on your own machine, so you retain complete control of your data


    Microsoft’s Excel format as well as OpenOffice Calc and a whole raft of other formats are supported natively.

    • cstrasser says:

      Thanks for the clarifications, Tom! I figured I was getting some of those “who’s on first” details wrong. Thanks for keeping me honest.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: