Tag Archives: Excel

DataUp is Merging with Dash!

Exciting news! We are merging the DataUp tool with our new data sharing platform, Dash.

About Dash

Dash is a University of California project to create a platform that allows researchers to easily describe, deposit and share their research data publicly. Currently the Dash platform is connected to the UC3 Merritt Digital Repository; however, we have plans to make the platform compatible with other repositories using protocols such as SWORD and OAI-PMH. The Dash project is open-source and we encourage community discussion and contribution to our GitHub site.

About the Merge

There is significant overlap in functionality for Dash and DataUp (see below), so we will merge these two projects to enable better support for our users. This merge is funded by an NSF grant (available on eScholarship) supplemental to the DataONE project.

The new service will be an instance of our Dash platform (to be available in late September), connected to the DataONE repository ONEShare. Previously the only way to deposit datasets into ONEShare was via the DataUp interface, thereby limiting deposits to spreadsheets. With the Dash platform, this restriction is removed and any dataset type can be deposited. Users will be able to log in with their Google ID (other options being explored). There are no restrictions on who can use the service, and therefore no restrictions on who can deposit datasets into ONEShare, and the service will remain free. The ONEShare repository will continue to be supported by the University of New Mexico in partnership with CDL/UC3. 

The NSF grant will continue to fund a developer to work with the UC3 team on implementing the DataONE-Dash service, including enabling login via Google and other identity providers, ensuring that metadata produced by Dash will meet the conditions of harvest by DataONE, and exploring the potential for implementing spreadsheet-specific functionality that existed in DataUp (e.g., the best practices check). 

Benefits of the Merge

  • We will be leveraging work that UC3 has already completed on Dash, which has fully-implemented functionality similar to DataUp (upload, describe, get identifier, and share data).
  • ONEShare will continue to exist and be a repository for long tail/orphan datasets.
  • Because Dash is an existing UC3 service, the project will move much more quickly than if we were to start from “scratch” on a new version of DataUp in a language that we can support.
  • Datasets will get DataCite digital object identifiers (DOIs) via EZID.
  • All data deposited via Dash into ONEShare will be discoverable via DataONE.

FAQ about the change

What will happen to DataUp as it currently exists?

The current version of DataUp will continue to exist until November 1, 2014, at which point we will discontinue the service and the dataup.org website will be redirected to the new service. The DataUp codebase will still be available via the project’s GitHub repository.

Why are you no longer supporting the current DataUp tool?

We have limited resources and can’t properly support DataUp as a service due to a lack of local experience with the C#/.NET framework and the Windows Azure platform.  Although DataUp and Dash were originally started as independent projects, over time their functionality converged significantly.  It is more efficient to continue forward with a single platform and we chose to use Dash as a more sustainable basis for this consolidated service.  Dash is implemented in the  Ruby on Rails framework that is used extensively by other CDL/UC3 service offerings.

What happens to data already submitted to ONEShare via DataUp?

All datasets now in ONEShare will be automatically available in the new Dash discovery environment alongside all newly contributed data.  All datasets also continue to be accessible directly via the Merritt interface at https://merritt.cdlib.org/m/oneshare_dataup.

Will the same functionality exist in Dash as in DataUp?

Users will be able to describe their datasets, get an identifier and citation for them, and share them publicly using the Dash tool. The initial implementation of DataONE-Dash will not have capabilities for parsing spreadsheets and reporting on best practices compliance. Also the user will not be able to describe column-level (i.e., attribute) metadata via the web interface. Our intention, however, is develop out these functions and other enhancements in the future. Stay tuned!

Still want help specifically with spreadsheets?

  • We have pulled together some best practices resources: Spreadsheet Help 
  • Check out the Morpho Tool from the KNB – free, open-source data management software you can download to create/edit/share spreadsheet metadata (both file- and column-level). Bonus – The KNB is part of the DataONE Network.

 

It's the dawn of a new day for DataUp! From Flickr by David Yu.

It’s the dawn of a new day for DataUp! From Flickr by David Yu.

Tagged , , , , , ,

Abandon all hope, ye who enter dates in Excel

Big thanks to Kara Woo of Washington State University for this guest blog post!

Update: The XLConnect package has been updated to fix the problem described below; however, other R packages for interfacing with Excel may import dates incorrectly. One should still use caution when storing data in Excel.


Like anyone who works with a lot of data, I have a strained relationship with Microsoft Excel. Its ubiquity forces me to tolerate it, yet I believe that it is fundamentally a malicious force whose main goal is to incite chaos through the obfuscation and distortion of data.1 After discovering a truly ghastly feature of how it handles dates, I am now fully convinced.

As it turns out, Excel “supports” two different date systems: one beginning in 1900 and one beginning in 1904.2 Excel stores all dates as floating point numbers representing the number of days since a given start date, and Excel for Windows and Mac have different default start dates (January 1, 1900 vs. January 1, 1904).3 Furthermore, the 1900 date system purposely erroneously assumes that 1900 was a leap year to ensure compatibility with a bug in—wait for it—Lotus 1-2-3.

You can’t make this stuff up.

What is even more disturbing is how the two date systems can get mixed up in the process of reading data into R, causing all dates in a dataset to be off by four years and a day. If you don’t know to look for it, you might never even notice. Read on for a cautionary tale.

I work as a data manager for a project studying biodiversity in Lake Baikal, and one of the coolest parts of my job is getting to work with data that have been collected by Siberian scientists since the 1940s. I spend a lot of time cleaning up these data in R. It was while working on some data on Secchi depth (a measure of water transparency) that I stumbled across this Excel date issue.

To read in the data I do something like the following using the XLConnect package:

library(XLConnect)
wb1 <- loadWorkbook("Baikal_Secchi_64to02.xlsx")
secchi_main <- readWorksheet(wb1, sheet = 1)
colnames(secchi_main) <- c("date", "secchi_depth", "year", "month")

So far so good. But now, what’s wrong with this picture?

head(secchi_main)
##         date secchi_depth year month
## 1 1960-01-16           12 1964     1
## 2 1960-02-04           14 1964     2
## 3 1960-02-14           18 1964     2
## 4 1960-02-24           14 1964     2
## 5 1960-03-04           14 1964     3
## 6 1960-03-25           10 1964     3

As you can see, the year in the date column doesn’t match the year in the year column. When I open the data in Excel, things look correct.

excel_secchi_data

This particular Excel file uses the 1904 date system, but that fact gets lost somewhere between Excel and R. XLConnect can tell that there are dates, but all the dates are wrong.

My solution for these particular data was as follows:

# function to add four years and a day to a given date
fix_excel_dates <- function(date) {
    require(lubridate)
    return(ymd(date) + years(4) + days(1))
}

# create a correct date column
library(dplyr)
secchi_main <- mutate(secchi_main, corrected_date = fix_excel_dates(date))

The corrected_date column looks right.

head(secchi_main)
##         date secchi_depth year month corrected_date
## 1 1960-01-16           12 1964     1     1964-01-17
## 2 1960-02-04           14 1964     2     1964-02-05
## 3 1960-02-14           18 1964     2     1964-02-15
## 4 1960-02-24           14 1964     2     1964-02-25
## 5 1960-03-04           14 1964     3     1964-03-05
## 6 1960-03-25           10 1964     3     1964-03-26

That fix is easy, but I’m left with a feeling of anxiety. I nearly failed to notice the discrepancy between the date and year columns; a colleague using the data pointed it out to me. If these data hadn’t had a year column, it’s likely we never would have caught the problem at all. Has this happened before and I just didn’t notice it? Do I need to go check every single Excel file I have ever had to read into R?

And now that I know to look for this issue, I still can’t think of a way to check the dates Excel shows against the ones that appear in R without actually opening the data file in Excel and visually comparing them. This is not an acceptable solution in my opinion, but… I’ve got nothing else. All I can do is get up on my worn out data manager soapbox and say:

and-thats-why-excel


  1. For evidence of its fearsome power, see these examples.
  2. Though as Dave Harris pointed out, “is burdened by” would be more accurate.
  3. To quote John Machin, “In reality, there are no such things [as dates in Excel spreadsheets]. What you have are floating point numbers and pious hope.”
Tagged , , ,

DataUp-Date

It’s been over a year since the DataUp tool went live, and we figure it’s time for an update. I’m co-writing this blog post with Susan Borda from UC Merced, who joined the UC3 DataUp project a few months ago.

DataUp Version 1

We went live with the DataUp tool in November 2012. Since then, more than 600 people have downloaded the add-in for Excel, and countless others have accessed the web application. We have had more than 50 submissions of datasets to the ONEShare Repository via DataUp, and many more inquiries about using the free repository. Although the DataUp tool was considered a success by many measures, we recognized that it had even more potential for improvement and expanded features (see our list of suggested improvements and fixes on BitBucket).

"Going Up". From Flickr by vsai

“Going Up”. From Flickr by vsai

Unfortunately, development on DataUp stopped once we went live. The typical reasons apply here – lack of staff and resources to devote to the project. We therefore partnered with DataONE and requested funds from the National Science Foundation to continue work on the tool (full text of the grant available on eScholarship). Shortly after receiving notice that we received the requested grant, the UC3 team met with Microsoft Research, our original partners on DataUp. We discovered that our interests were still aligned, and that Microsoft had been using in-house resources to continue work on DataUp as an internal project titled “Sequim”. Rather than work in parallel, we decided to join forces and work on DataUp Version 2 (see more below).

In the interim, we published our work on DataUp Version 1 at F1000Research, an open access journal that focuses on rapid dissemination of results and open peer review. In this publication, we describe the project background, requirements gathering including researcher surveys, and a description of the tool’s implementation.

DataUp Version 2

The NSF grant allowed us to hire Susan Borda, a librarian at UC Merced with a background in IT and knowledge of the DataUp project. She has been serving as the project manager for DataUp Version 2, and has liaised with Microsoft Research on the project. Susan will take over from here to describe what’s on the horizon for DataUp.

The new version of DataUp will be available after February 24th, 2014. This version will have a new, clean web interface with functionality for both users and administrators. A DataUp administrator (i.e., repository manager), will be able to define the file-level metadata that will be captured from the user upon data deposit. In addition, an administrator will be able to activate the  “Data Quality Check”, which allows the DataUp tool to verify whether user’s uploaded file meets certain requirements for their repository. The “Best Practices” and file “Citation” features from DataUp version 1 are still available in version 2.

Note that we will be phasing out DataUp version 1 over the next few weeks, which means the add-in for Excel will no longer be operational.

Dying to see the new tool?

Microsoft Research will be at the International Digital Curation Conference (#IDCC14) in San Francisco at the end of February, demoing and discussing their suite of research tools, including DataUp. Susan will also be at IDCC, demoing DataUp version 2 more informally during the poster session with the goal of getting feedback from delegates.

Tagged , , ,

DataUp is Live!

party girls

We are celebrating. From Boston Public Library via Flickr.

That’s right: DataUp is LIVE! I’m so excited I needed to type it twice.  So what does “DataUp is Live!” mean? Several things:

  • The DataUp website (dataup.cdlib.org) is up and running, and is chock full of information about the project, how to participate, and how to get the tool (in either web app or add-in form).
  • The DataUp web application is up and running (www.dataup.org). Anyone with internet access can start creating high-quality, archive-ready data! Would you rather use the tool within Excel? Download the add-in instead (available via the main site).
  • The DataUp code is available. DataUp is an open source project, and we strongly encourage community members to participate in the tool’s continued improvement. Check out the code on BitBucket.
  • The special repository for housing DataUp data, ONEShare, is up and running. This new repository is a special instance of the CDL’s Merritt Repository, and is connected to the DataONE project. ONEShare is the result of collaborations between CDL, University of New Mexico, and DataONE.  Read more in my blog post about ONEShare.
  • Please note that the current version of DataUp is Beta: this means it’s a work in progress. We apologize for any hiccups you may encounter; in particular, there is a known issue that currently prevents spreadsheets archived via DataUp from appearing in DataONE searches.

Today also marks the integration of the old DCXL/DataUp blog with the Data Pub Blog. You probably noticed that they are combined since the banner at the top says “Data Pub”. I will be posting here from now on, rather than at dataup.cdlib.org. The DataUp URL now hold the DataUp main website. Read more about these changes in my blog post about it.  The Data Pub Blog is intended to hold “Conversations About Data”. That means we will run the gamut of potential topics, including (but not limited to) data publication, data sharing, open data, metadata, digital archiving, etc. etc..  There are likely to be posts from others at CDL from time to time, which means you will have access to more than just my myopic views on all things data.

The DataUp project’s core team included yours truly, Patricia Cruse (UC3 Director), John Kunze (UC3 Associate Director), and Stephen Abrams (UC3 Associate Director). Of course, no project at CDL is an island. We had SO MUCH help from the great folks here:

  • DataUp Website: Eric Satzman, Abhishek Salve, Robin Davis-White, Rob Valentine, Felicia Poe
  • DataUp Communications: Ellen Meltzer (DataUp Press Release PDF)
  • DataUp development: Mark Reyes, David Loy, Scott Fisher, Marisa Strong
  • Machine configuration: Joseph Somontan
  • Administrative support: Beaumont Yung, Rondy Epting-Day, Stephanie Lew

Thanks to all of you!

Tagged , , , ,

Progress & Plans for DataUp Release

I can't get no satisfaction album cover

Unlike Mick Jagger, our beta testers are satisfied. From wikipedia.org

It was one year ago today that I moved up to the Bay Area to work on DataUp (then DCXL) in earnest.  It seems fitting that this milestone be marked by some significant progress on the project.  No, we haven’t released DataUp to the public yet, but we have a release date slated for this September.  This is very exciting news, especially since the project got off to a bit of a slow  start.  We have been cooking with gas since March, however, and the DataUp tool promises to do much of what I had envisioned on my drive from Santa Barbara last year.

If you are wondering what DataUp looks like, you will need to be patient.  You can, however, see some preliminary responses from our very gracious beta testers.  The good news is this: most folks seem pretty happy with the tool as-is, and many offered some really great feedback that will improve the tool as we move into the community involvement phase of the development effort.

We asked 21 beta testers what they thought of DataUp features, and here are the results:

We expect that the DataUp tool will only improve from here on out, so stay tuned for our big debut in less than two months!

Tagged , , ,

Strange Uses for Excel

Happy Independence Day (Americans) and a belated Happy Canada Day (Canadians)!  We are smack dab in the middle of the lazy days of summer, which means lots of folks are on vacation this week.  To honor these lazy days, I’m providing a complement to my post a few months back on “Fun Uses for Excel“.  In this edition: strange uses for Excel.  I must admit this post will be based primarily on a similar post at chandoo.org that I found fascinating.

  • Floor layouts. Will that couch you love fit underneath the window? You can use Excel as graph paper, mapping out rooms and furniture layouts. This also works for garden layouts and quilt designing (HT @Whitney!)
  • Amuse co-workers with the Speak on Enter feature.  On the chandoo.org post, Jeff Weir said

On those tortuously long work days where the clock seems to be running backward, I often turn on SPEAK ON ENTER and get Excel to speak the words “Take this job and shove it” to my co-workers (It’s actually from a country song, but has famously been covered by the punk group The Dead Kennedys). This really cracks them up. Speak on Enter is one of Excel’s most underrated functions, if you ask me. Why they didn’t put it right there on the ribbon in 2007 is a travesty. In fact, I’m not going to upgrade my version of Excel until they do.

I checked it out, and my Windows version of Excel 2007 has “Speak on Enter”, which you can add to your toolbar by going to “Excel Options” –> “Customize”.  Of course, you can also use the Excel “Speak” features to check data entry. Read more here.

  • Naming kids (?!). It’s actually quite strange how many people mentioned this on chandoo.org as a use for Excel.  Here’s one example from Brian S.:

For each kid, my wife and I separately brainstormed a list of viable first and middle names. I entered them into a workbook to identify any matches. (Thankfully there always have been matches.) Then I had formulas to display all possible combinations of those matches, as well as up to 2 additional “favorites” from each of us. Those results were manually whittled down based on their sound (which combinations appear fine), and whether the associated first/last or first/middle/last initials create an unexpected result. (I, with initials B.S., threw that requirement in.) This always led us to a 1st and 2nd choice. But if necessary, I was ready to move to a Web data extract to determine an additional “name uniqueness” value.

Choose baby names carefully… Nobody wants Jaques Strap for a kid. Mo knew this. (Don’t get it? read more at http://www.snpp.com/guides/moe_calls.html) From ign.com: Click the pic to visit

Tagged , , ,

Survey says…

A few weeks ago we reached out to the scientific community for help on the direction of the DCXL project.  The major issue at hand was whether we should develop a web-based application or an add-in for Microsoft Excel.  Last week, I reported that we decided that rather than choose, we will develop both.  This might seem like a risky proposition: the DCXL project has a one-year timeline, meaning this all needs to be developed before August (!).  As someone in a DCXL meeting recently put it, aren’t we settling for “twice the product and half the features”?  We discussed what features might need to be dropped from our list of desirables based on the change in trajectory, however we are confident that both of the DCXL products we develop will be feature-rich and meet the needs of the target scientific community.  Of course, this is made easier by the fact that the features in the two products will be nearly identical.

Family Feud screen shot

What would Richard Dawson want? Add-in or web app? From Wikipedia. Source: J Graham (1988). Come on Down!!!: the TV Game Show Book. Abbeville Press

How did we arrive at developing an add-in and a web app? By talking to scientists. It became obvious that there were aspects of both products that appeal to our user communities based on feedback we collected.  Here’s a summary of what we heard:

Show of hands:  I ran a workshop on Data Management for Scientists at the Ocean Sciences 2012 Meeting in February.  At the close of the workshop, I described the DCXL project and went over the pros and cons of the add-in option and the web app option.  By show of hands, folks in the audience voted about 80% for the web app (n~150)

Conversations: here’s a sampling of some of the things folks told me about the two options:

  • “I don’t want to go to the web. It’s much easier if it’s incorporated into Excel.” (add-in)
  • “As long as I can create metadata offline, I don’t mind it being a web app. It seems like all of the other things it would do require you to be online anyway” (either)
  • “If there’s a link in the spreadsheet, that seems sufficient. (either)  It would be better to have something that stays on the menu bar no matter what file is open.” (Add-in)
  • “The updates are the biggest issue for me. If I have to update software a lot, I get frustrated. It seems like Microsoft is always making update something. I would rather go to the web and know it’s the most recent version.” (web app)
  • Workshop attendee: “Can it work like Zotero, where there’s ways to use it both offline and online?” (both)

Survey: I created a very brief survey using the website SurveyMonkey. I then sent the link to the survey out via social media and listservs.  Within about a week, I received over 200 responses.

Education level of respondents:

Survey questions & answers:

 

So with those results, there was a resounding “both!” emanating from the scientific community.  First we will develop the add-in since it best fits the needs of our target users (those who use Excel heavily and need assistance with good data management skills).  We will then develop the web application, with the hope that the community at large will adopt and improve on the web app over time.  The internet is a great place for building a community with shared needs and goals– we can only hope that DCXL will be adopted as wholeheartedly as other internet sources offering help and information.

Tagged , , , , ,

Hooray for Progress!

Great news on the DCXL front! We are moving forward with the Excel add-in and will have something to share with the community this summer.  If you missed it, back in January the DCXL project had an existential crisis: add-in or web-based application? I posted on the subject here and here. We spent a lot of time talking to the community and collating feedback, weighing the pros and cons of each option, and carefully considering how best to proceed with the DCXL project.

And the conclusion we came to… let’s develop both!

Comparing web-based applications and add-ins (aka plug-ins) is really an apples-and-oranges comparison.  How could we discount that a web-based application is yet another piece of software for scientists to learn? Or that an add-in is only useful for Excel spreadsheets running a Windows operating system? Instead, we have chosen to first create an add-in (this was the original intent of the project), then move that functionality to a web-based application that will have more flexibility for the longer term.

Albert-Camus

What do Camus, The Cure, and DCXL have in common? Existentialists at heart. From http://www.openlettersmonthly.com

The capabilities of the add-in and the web-based application will be similar: we are still aiming to create metadata, check the data file for .csv compatibility, generate a citation, and upload the data set to a data repository.  For a full read of the requirements (updated last week), check out the Requirements page on  this site. The implementation of these requirements might be slightly different, but the goals of the DCXL project will be met in both cases: we will facilitate good data management, data archiving, and data sharing.

It’s true that the DCXL project is running a bit behind schedule, but we believe that it will be possible to create the two prototypes before the end of the summer.  Check back here for updates on our progress.

Tagged , , , ,

Fun Uses for Excel

Friday movie

"Excel can do WHAT?" Image from Friday (the movie), from newsone.com

It’s Friday! Better still, it’s Friday afternoon!  To honor all of the hard work we’ve done this week, let’s have some fun with Excel.  Check out these interesting uses for Excel that have nothing to do with your data:

Want to see some silly spreadsheet movies? Here ya go.

Excel Hero: Download .xls files that create nifty optical illusions.  Here’s one of them.

From PCWorld, Fun uses for Excel, including a Web radio player that plays inside your worksheet (click to download the zip file and then select a station), or simulating dice rolls in case of a lack-of-dice emergency during Yatzee.

 Here’s the results of a Google Image Search for “Excel art:

excel art

 

Mona Lisa never looked so smart.  Want to know more? Check out the YouTube video tutorial or read Creating art with Microsoft Excel from the blog digital inspiration.

 

Tagged , ,