Abandon all hope, ye who enter dates in Excel

Big thanks to Kara Woo of Washington State University for this guest blog post!

Update: The XLConnect package has been updated to fix the problem described below; however, other R packages for interfacing with Excel may import dates incorrectly. One should still use caution when storing data in Excel.


Like anyone who works with a lot of data, I have a strained relationship with Microsoft Excel. Its ubiquity forces me to tolerate it, yet I believe that it is fundamentally a malicious force whose main goal is to incite chaos through the obfuscation and distortion of data.1 After discovering a truly ghastly feature of how it handles dates, I am now fully convinced.

As it turns out, Excel “supports” two different date systems: one beginning in 1900 and one beginning in 1904.2 Excel stores all dates as floating point numbers representing the number of days since a given start date, and Excel for Windows and Mac have different default start dates (January 1, 1900 vs. January 1, 1904).3 Furthermore, the 1900 date system purposely erroneously assumes that 1900 was a leap year to ensure compatibility with a bug in—wait for it—Lotus 1-2-3.

You can’t make this stuff up.

What is even more disturbing is how the two date systems can get mixed up in the process of reading data into R, causing all dates in a dataset to be off by four years and a day. If you don’t know to look for it, you might never even notice. Read on for a cautionary tale.

I work as a data manager for a project studying biodiversity in Lake Baikal, and one of the coolest parts of my job is getting to work with data that have been collected by Siberian scientists since the 1940s. I spend a lot of time cleaning up these data in R. It was while working on some data on Secchi depth (a measure of water transparency) that I stumbled across this Excel date issue.

To read in the data I do something like the following using the XLConnect package:

library(XLConnect)
wb1 <- loadWorkbook("Baikal_Secchi_64to02.xlsx")
secchi_main <- readWorksheet(wb1, sheet = 1)
colnames(secchi_main) <- c("date", "secchi_depth", "year", "month")

So far so good. But now, what’s wrong with this picture?

head(secchi_main)
##         date secchi_depth year month
## 1 1960-01-16           12 1964     1
## 2 1960-02-04           14 1964     2
## 3 1960-02-14           18 1964     2
## 4 1960-02-24           14 1964     2
## 5 1960-03-04           14 1964     3
## 6 1960-03-25           10 1964     3

As you can see, the year in the date column doesn’t match the year in the year column. When I open the data in Excel, things look correct.

excel_secchi_data

This particular Excel file uses the 1904 date system, but that fact gets lost somewhere between Excel and R. XLConnect can tell that there are dates, but all the dates are wrong.

My solution for these particular data was as follows:

# function to add four years and a day to a given date
fix_excel_dates <- function(date) {
    require(lubridate)
    return(ymd(date) + years(4) + days(1))
}

# create a correct date column
library(dplyr)
secchi_main <- mutate(secchi_main, corrected_date = fix_excel_dates(date))

The corrected_date column looks right.

head(secchi_main)
##         date secchi_depth year month corrected_date
## 1 1960-01-16           12 1964     1     1964-01-17
## 2 1960-02-04           14 1964     2     1964-02-05
## 3 1960-02-14           18 1964     2     1964-02-15
## 4 1960-02-24           14 1964     2     1964-02-25
## 5 1960-03-04           14 1964     3     1964-03-05
## 6 1960-03-25           10 1964     3     1964-03-26

That fix is easy, but I’m left with a feeling of anxiety. I nearly failed to notice the discrepancy between the date and year columns; a colleague using the data pointed it out to me. If these data hadn’t had a year column, it’s likely we never would have caught the problem at all. Has this happened before and I just didn’t notice it? Do I need to go check every single Excel file I have ever had to read into R?

And now that I know to look for this issue, I still can’t think of a way to check the dates Excel shows against the ones that appear in R without actually opening the data file in Excel and visually comparing them. This is not an acceptable solution in my opinion, but… I’ve got nothing else. All I can do is get up on my worn out data manager soapbox and say:

and-thats-why-excel


  1. For evidence of its fearsome power, see these examples.
  2. Though as Dave Harris pointed out, “is burdened by” would be more accurate.
  3. To quote John Machin, “In reality, there are no such things [as dates in Excel spreadsheets]. What you have are floating point numbers and pious hope.”
Tagged , , ,

18 thoughts on “Abandon all hope, ye who enter dates in Excel

  1. Python properly handles the frightening Excel date import issue http://nbviewer.ipython.org/gist/ethanwhite/10403398 … #otherpeoplesdata #pydata @ethanwhite @kara_woo @wesmckinn

  2. Great post Kara! Scary stuff.
    Kara kindly provided an example dataset so that I could check for this problem in Python. Fortunately Python imports of Excel files don’t seem to have this issues: http://nbviewer.ipython.org/gist/ethanwhite/10403398

  3. Also, did you report this as a bug to the XLConnect folks? Since it works in Pandas it must be fixable in concept and I think it’s an important thing to get fixed.

  4. Kate says:

    Great post!

    Excel is highly problematic in that it assumes 7/32 is July 1932, not July 1832, or July 1732 or just seven thirty-seconds of an inch. It is a pain to have to remember to format the fields as text so that Excel will not change the essential meaning of data by first interpreting it as a *date* and then interpreting the presumed year.

  5. Austin says:

    I usually just convert to the day of the year in excel using =A1-DATE(YEAR(A1),1,0)
    Should i still worry?

  6. Nice post, thanks. This bit of code solves your problem of having to first open the file in Excel in order to check the date format (at least it worked on a test file):

    library(XLConnect)
    # load the wokbook
    wb1 <- loadWorkbook("Baikal_Secchi_64to02.xlsx")
    # create a new worksheet in the workbook of interest
    createSheet(wb1,"dateTest")
    # write the current date to the newly created sheet
    writeWorksheet(wb1,Sys.Date(), "dateTest")
    # save the workbook
    saveWorkbook(wb1)
    # read the current date back
    dateQ <- readWorksheetFromFile("Baikal_Secchi_64to02.xlsx", sheet = "dateTest")[1,1]
    # check to see if the years match; if not fix the dates
    if(format(dateQ,"%Y")!=format(Sys.Date(),"%Y")) {
    # call function to correct dates…
    }

  7. […] Heads up: Do not trust Microsoft Excel, whatsoever, with dates. It doesn’t know how to handle them. […]

  8. Does this issue with dates extend to Open Office/LibreOffice?

    • Kara Woo says:

      So far I haven’t read anything about multiple date systems in LibreOffice or Open Office. And if you’re storing your data as .csv instead of .xls/.xlsx you should be pretty safe.

  9. […] Abandon All Hope, Ye Who Enter Dates in Excel by Kara Woo. Time and time again time/date formats show themselves to be the bane of every developer working with timestamped data. If you’re working in Excel with date dependent data this is a must read. […]

  10. Neil Tangri says:

    Another alternative: when you have to work in Excel, save all your files into CSV format before uploading into R. That eliminates the date problem, but it also means that your files won’t expire when Microsoft decides to stop supporting old Excel formats (yes, that does happen).

    • Kara Woo says:

      Oh yes, absolutely. I save all my data in CSVs or other text formats, and encourage others to do the same. Unfortunately when working with other people’s data I don’t always have a choice (I prefer not to make any changes to the original data files if I can avoid it), so I’m hoping to at least make people aware of some of the issues associated with storing data in Excel files.

  11. […] wait. That’s the default on my version of Excel. Turns out there are multiple defaults for different versions of the software. And you can switch between them all willy-nilly. So… […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 322 other followers

%d bloggers like this: