Ecology Spreadsheets 101

While at ESA, I wore two hats- one as the DCXL project manager, and one as a DataONE postdoc working on engaging the scientific community in data management, education, and archiving (see more about my work with DataONE here).  Part of my DataONE tasks at ESA involved participating in two workshops related to data management (slides for both workshops are available here).

For one of the workshops, attendees brought their Excel spreadsheets and allowed organizers to take a look.  The goal was to help them better organize their own data, but I might have learned just as much as they did.  I observed several common practices that represent poor data management, and were almost universally used among Ecologists.  So if you are an Excel user, here’s some tips that might help you get one step closer to being a rockstar at data organization (and more prepared for data archiving):

  1. Create only one table per spreadsheet. I too am guilty of creating multiple tables, separated by empty columns, on a single spreadsheet.  Perhaps these tables represent different sites, or species, or days.  A better option, however, is to separate these tables into multiple spreadsheets (or different tabs in a single Excel file).
  2. Many researchers use comments, graph things, and make notes on a spreadsheet while pondering how best to display their data. However all of these things will make your spreadsheet unreadable to many statistical programs and make exporting as a text file for archiving more difficult.  If eschewing these tools inhibits your creativity, consider maintaining a “raw” tab that is not cluttered by miscellaneous information.
  3. Following up on that last point, keep your raw data raw. Once you’ve entered all of your data into your spreadsheet, you should save it with the title “DO_NOT_TOUCH_UNDER_ANY_CIRCUMSTANCES_RAW_DATA.xls” (or perhaps some other, more informative title of your choosing).  Raw data should stay raw- it guarantees that you always have a starting point if you screw something up during analysis, quality control, or other points in your research.  Manipulate and graph your data in a separate spreadsheet from your raw data.
  4. Use good organization in your tables. Ideally, a row is a complete record, and a column consists of all of the data that make up that record.
  5. Construct your spreadsheet so that new information adds rows, not columns. If the overall structure of your spreadsheet doesn’t change with each field season or sample site, it will be easier to import into other programs or reanalyze updated data files.
  6. Atomize: make sure each cell has only one piece of information. For instance, if your spreadsheet has a column titled Location with an entry “Austin, TX”, an improvement would be to create two columns for location, one for City and the other State. Commas and spaces in data cells might cause problems for archiving data with certain data centers, and also inhibits importing spreadsheets into other software programs.
  7. Use plain text ASCII characters for variable names and data.  Again, this ensures your data are accurate if you export them to other programs

Do you have some suggestions for good data organization in Excel spreadsheets? Pass them along by commenting below.

Tagged , , , ,

One thought on “Ecology Spreadsheets 101

  1. Sky says:

    I’m interested in the DCXL project, heard about it through DataONE previously, and will pay attention to where this goes. I might, however, disagree with at least your example on atomization of data within cells. It depends on the information and its intended use. For instance, using “Austin, TX” as a value is not a big deal if the intention is to run a rough geocoding on place names, in which case that whole string is sufficient to turn up a general mappable location.

    This points to a somewhat larger problem inherent in using Excel as a raw data management platform. The gist of this project seems to be about embedding some native capability in Excel to facilitate better documentation and archiving. I hope the intent is to provide some capacity such that metadata is contained within the resulting archived file to avoid the problem in many data formats where metadata is in some way separate from data. Are you thinking about various ways that individual columns in a raw data table might be documented as to their provenance, details about units, and intended use? Tools like Morpho for creating EML metadata have one take on this problem, but how can we embed that deeper information with the data themselves such that computer programs can do smart things with a little data introspection? Some simple methods employed today include reserving several rows in a spreadsheet to describe in human- and machine-readable terms details about the attribute – using standards or not.

    I’d be interested to hear more about the level of metadata depth the DCXL project is thinking of tackling. I’ve searched a little bit here on the site but don’t see those details as yet.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: