Tag Archives: web application

Announcing The Dash Tool: Data Sharing Made Easy

We are pleased to announce the launch of Dash – a new self-service tool from the UC Curation Center (UC3) and partners that allows researchers to describe, upload, and share their research data. Dash helps researchers perform the following tasks:

  • Prepare data for curation by reviewing best practice guidance for the creation or acquisition of digital research data.
  • Select data for curation through local file browse or drag-and-drop operation.
  • Describe data in terms of the DataCite metadata schema.
  • Identify data with a persistent digital object identifier (DOI) for permanent citation and discovery.
  • Preserve, manage, and share data by uploading to a public Merritt repository collection.
  • Discover and retrieve data through faceted search and browse.

Who can use Dash?

There are multiple instances of the Dash tool that all have similar functions, look, and feel.  We took this approach because our UC campus partners were interested in their Dash tool having local branding (read more). It also allows us to create new Dash instances for projects or partnerships outside of the UC (e.g., DataONE Dash and our Site Descriptors project).

Researchers at UC Merced, UCLA, UC Irvine, UC Berkeley, or UCOP can use their campus-specific Dash instance:

Other researchers can use DataONE Dash (oneshare.cdlib.org). This instance is available to anyone, free of charge. Use your Google credentials to deposit data.

Note: Data deposited into any Dash instance is visible throughout all of Dash. For example, if you are a UC Merced researcher and use dash.ucmerced.edu to deposit data, your dataset will appear in search results for individuals looking for data via any of the Dash instances, regardless of campus affiliation.

See the Users Guide to get started using Dash.

Stay connected to the Dash project:

Dash Origins

The Dash project began as DataShare, a collaboration among UC3, the University of California San Francisco Library and Center for Knowledge Management, and the UCSF Clinical and Translational Science Institute (CTSI). CTSI is part of the Clinical and Translational Science Award program funded by the National Center for Advancing Translational Sciences at the National Institutes of Health (Grant Number UL1 TR000004).

Fontana del Nettuno

Sound the horns! Dash is live! “Fontana del Nettuno” by Sorin P. from Flickr.

Tagged , , , ,

New Project: Citing Physical Spaces

A few months ago, the UC3 group was contacted by some individuals interested in solving a problem: how should we reference field stations? Rob Plowes from University of Texas/Brackenridge Field Lab emailed us:

I am on a [National Academy of Sciences] panel reviewing aspects of field stations, and we have been discussing a need for data archiving. One idea proposed is for each field station to generate a simple document with a DOI reference to enable use in publications that make reference to the field station. Having this DOI document would enable a standardized citation that could be tracked by an online data aggregator.

We thought this was a great idea and started having a few conversations with other groups (LTER, NEON, etc.) about its feasibility. Fast forward to two weeks ago, when Plowes and Becca Fenwick of UC Merced presented our more fleshed out idea to the OBFS/NAML Joint Meeting in Woods Hole, MA. (OBFS: Organization of Biological Field Stations, and NAML: National Association of Marine Laboratories). The response was overwhelmingly positive, so we are proceeding with the idea in earnest here at the CDL.

The intent of this blog post is to gather feedback from the broader community about our idea, including our proposed metadata fields, our plans for implementation, and whether there are existing initiatives or groups that we should be aware of and/or partner with moving forward.

In a Nutshell

Problem: Tracking publications associated with a field station or site is difficult. There is no clear or standard way to cite field station descriptions.

Proposal: Create individual, citable “publications” with associated persistent identifiers for each field station (more generically called a “site”). Collect these Site Descriptors in the general use DataONE repository, ONEShare. The user interface will be a new instance of the existing UC3 Dash service (under development) with some modifications for Site Descriptors.

What we need from you: 

Moving forward: We plan on gathering community feedback for the next few months, with an eye towards completing a pilot version of the interface by February 2015. We will be ramping up Dash development over the next 12 months thanks to recent funding from the Alfred P. Sloan Foundation, and this development work will include creating a more robust version of the Site Descriptors database.

Project Partners:

  • Rob Plowes, UT Austin/Brackenridge Field Lab
  • Mark Stromberg, UC Berkeley/UC Natural Reserve System
  • Kevin Browne, UC Natural Reserve System Information Manager
  • Becca Fenwick, UC Merced
  • UC3 group
  • DataONE organization

Lovers Point Laboratory (1930), which was later renamed Hopkins Marine Laboratory. From Calisphere, contributed by Monterey County Free Libraries.

Tagged , , ,

DataUp is Merging with Dash!

Exciting news! We are merging the DataUp tool with our new data sharing platform, Dash.

About Dash

Dash is a University of California project to create a platform that allows researchers to easily describe, deposit and share their research data publicly. Currently the Dash platform is connected to the UC3 Merritt Digital Repository; however, we have plans to make the platform compatible with other repositories using protocols such as SWORD and OAI-PMH. The Dash project is open-source and we encourage community discussion and contribution to our GitHub site.

About the Merge

There is significant overlap in functionality for Dash and DataUp (see below), so we will merge these two projects to enable better support for our users. This merge is funded by an NSF grant (available on eScholarship) supplemental to the DataONE project.

The new service will be an instance of our Dash platform (to be available in late September), connected to the DataONE repository ONEShare. Previously the only way to deposit datasets into ONEShare was via the DataUp interface, thereby limiting deposits to spreadsheets. With the Dash platform, this restriction is removed and any dataset type can be deposited. Users will be able to log in with their Google ID (other options being explored). There are no restrictions on who can use the service, and therefore no restrictions on who can deposit datasets into ONEShare, and the service will remain free. The ONEShare repository will continue to be supported by the University of New Mexico in partnership with CDL/UC3. 

The NSF grant will continue to fund a developer to work with the UC3 team on implementing the DataONE-Dash service, including enabling login via Google and other identity providers, ensuring that metadata produced by Dash will meet the conditions of harvest by DataONE, and exploring the potential for implementing spreadsheet-specific functionality that existed in DataUp (e.g., the best practices check). 

Benefits of the Merge

  • We will be leveraging work that UC3 has already completed on Dash, which has fully-implemented functionality similar to DataUp (upload, describe, get identifier, and share data).
  • ONEShare will continue to exist and be a repository for long tail/orphan datasets.
  • Because Dash is an existing UC3 service, the project will move much more quickly than if we were to start from “scratch” on a new version of DataUp in a language that we can support.
  • Datasets will get DataCite digital object identifiers (DOIs) via EZID.
  • All data deposited via Dash into ONEShare will be discoverable via DataONE.

FAQ about the change

What will happen to DataUp as it currently exists?

The current version of DataUp will continue to exist until November 1, 2014, at which point we will discontinue the service and the dataup.org website will be redirected to the new service. The DataUp codebase will still be available via the project’s GitHub repository.

Why are you no longer supporting the current DataUp tool?

We have limited resources and can’t properly support DataUp as a service due to a lack of local experience with the C#/.NET framework and the Windows Azure platform.  Although DataUp and Dash were originally started as independent projects, over time their functionality converged significantly.  It is more efficient to continue forward with a single platform and we chose to use Dash as a more sustainable basis for this consolidated service.  Dash is implemented in the  Ruby on Rails framework that is used extensively by other CDL/UC3 service offerings.

What happens to data already submitted to ONEShare via DataUp?

All datasets now in ONEShare will be automatically available in the new Dash discovery environment alongside all newly contributed data.  All datasets also continue to be accessible directly via the Merritt interface at https://merritt.cdlib.org/m/oneshare_dataup.

Will the same functionality exist in Dash as in DataUp?

Users will be able to describe their datasets, get an identifier and citation for them, and share them publicly using the Dash tool. The initial implementation of DataONE-Dash will not have capabilities for parsing spreadsheets and reporting on best practices compliance. Also the user will not be able to describe column-level (i.e., attribute) metadata via the web interface. Our intention, however, is develop out these functions and other enhancements in the future. Stay tuned!

Still want help specifically with spreadsheets?

  • We have pulled together some best practices resources: Spreadsheet Help 
  • Check out the Morpho Tool from the KNB – free, open-source data management software you can download to create/edit/share spreadsheet metadata (both file- and column-level). Bonus – The KNB is part of the DataONE Network.

 

It's the dawn of a new day for DataUp! From Flickr by David Yu.

It’s the dawn of a new day for DataUp! From Flickr by David Yu.

Tagged , , , , , ,

DataUp-Date

It’s been over a year since the DataUp tool went live, and we figure it’s time for an update. I’m co-writing this blog post with Susan Borda from UC Merced, who joined the UC3 DataUp project a few months ago.

DataUp Version 1

We went live with the DataUp tool in November 2012. Since then, more than 600 people have downloaded the add-in for Excel, and countless others have accessed the web application. We have had more than 50 submissions of datasets to the ONEShare Repository via DataUp, and many more inquiries about using the free repository. Although the DataUp tool was considered a success by many measures, we recognized that it had even more potential for improvement and expanded features (see our list of suggested improvements and fixes on BitBucket).

"Going Up". From Flickr by vsai

“Going Up”. From Flickr by vsai

Unfortunately, development on DataUp stopped once we went live. The typical reasons apply here – lack of staff and resources to devote to the project. We therefore partnered with DataONE and requested funds from the National Science Foundation to continue work on the tool (full text of the grant available on eScholarship). Shortly after receiving notice that we received the requested grant, the UC3 team met with Microsoft Research, our original partners on DataUp. We discovered that our interests were still aligned, and that Microsoft had been using in-house resources to continue work on DataUp as an internal project titled “Sequim”. Rather than work in parallel, we decided to join forces and work on DataUp Version 2 (see more below).

In the interim, we published our work on DataUp Version 1 at F1000Research, an open access journal that focuses on rapid dissemination of results and open peer review. In this publication, we describe the project background, requirements gathering including researcher surveys, and a description of the tool’s implementation.

DataUp Version 2

The NSF grant allowed us to hire Susan Borda, a librarian at UC Merced with a background in IT and knowledge of the DataUp project. She has been serving as the project manager for DataUp Version 2, and has liaised with Microsoft Research on the project. Susan will take over from here to describe what’s on the horizon for DataUp.

The new version of DataUp will be available after February 24th, 2014. This version will have a new, clean web interface with functionality for both users and administrators. A DataUp administrator (i.e., repository manager), will be able to define the file-level metadata that will be captured from the user upon data deposit. In addition, an administrator will be able to activate the  “Data Quality Check”, which allows the DataUp tool to verify whether user’s uploaded file meets certain requirements for their repository. The “Best Practices” and file “Citation” features from DataUp version 1 are still available in version 2.

Note that we will be phasing out DataUp version 1 over the next few weeks, which means the add-in for Excel will no longer be operational.

Dying to see the new tool?

Microsoft Research will be at the International Digital Curation Conference (#IDCC14) in San Francisco at the end of February, demoing and discussing their suite of research tools, including DataUp. Susan will also be at IDCC, demoing DataUp version 2 more informally during the poster session with the goal of getting feedback from delegates.

Tagged , , ,

DataUp is Live!

party girls

We are celebrating. From Boston Public Library via Flickr.

That’s right: DataUp is LIVE! I’m so excited I needed to type it twice.  So what does “DataUp is Live!” mean? Several things:

  • The DataUp website (dataup.cdlib.org) is up and running, and is chock full of information about the project, how to participate, and how to get the tool (in either web app or add-in form).
  • The DataUp web application is up and running (www.dataup.org). Anyone with internet access can start creating high-quality, archive-ready data! Would you rather use the tool within Excel? Download the add-in instead (available via the main site).
  • The DataUp code is available. DataUp is an open source project, and we strongly encourage community members to participate in the tool’s continued improvement. Check out the code on BitBucket.
  • The special repository for housing DataUp data, ONEShare, is up and running. This new repository is a special instance of the CDL’s Merritt Repository, and is connected to the DataONE project. ONEShare is the result of collaborations between CDL, University of New Mexico, and DataONE.  Read more in my blog post about ONEShare.
  • Please note that the current version of DataUp is Beta: this means it’s a work in progress. We apologize for any hiccups you may encounter; in particular, there is a known issue that currently prevents spreadsheets archived via DataUp from appearing in DataONE searches.

Today also marks the integration of the old DCXL/DataUp blog with the Data Pub Blog. You probably noticed that they are combined since the banner at the top says “Data Pub”. I will be posting here from now on, rather than at dataup.cdlib.org. The DataUp URL now hold the DataUp main website. Read more about these changes in my blog post about it.  The Data Pub Blog is intended to hold “Conversations About Data”. That means we will run the gamut of potential topics, including (but not limited to) data publication, data sharing, open data, metadata, digital archiving, etc. etc..  There are likely to be posts from others at CDL from time to time, which means you will have access to more than just my myopic views on all things data.

The DataUp project’s core team included yours truly, Patricia Cruse (UC3 Director), John Kunze (UC3 Associate Director), and Stephen Abrams (UC3 Associate Director). Of course, no project at CDL is an island. We had SO MUCH help from the great folks here:

  • DataUp Website: Eric Satzman, Abhishek Salve, Robin Davis-White, Rob Valentine, Felicia Poe
  • DataUp Communications: Ellen Meltzer (DataUp Press Release PDF)
  • DataUp development: Mark Reyes, David Loy, Scott Fisher, Marisa Strong
  • Machine configuration: Joseph Somontan
  • Administrative support: Beaumont Yung, Rondy Epting-Day, Stephanie Lew

Thanks to all of you!

Tagged , , , ,

Did you notice? We tidied up.

If you didn’t notice, check out the URL above for this post: unbeknownst to you, you have been rerouted from DataUp to Data Pub. If you are still reeling from our first change (DCXL to DataUp), we apologize. Keep in mind, however, that change is good. Turn and face the strain.

The newest move is a harbinger of many changes that are coming up in the next eight days: on September 18, we will be releasing the DataUp tool! In preparation for this release, a little housekeeping needed to be done:

It’s time for DataUp housekeeping! From Flickr by clotho98

First, we created a lovely new website for DataUp (hat tip to the crackerjack team of user experience design folks here at the California Digital Library).  The new website will have all of the bells and whistles needed to fully enjoy DataUp: links to the add-in, the web application, users guides and documentation, and the code to name a few. Where should this website live? At dataup.cdlib.org, of course! But this requires a bit of musical chairs. So…

We are moving the DataUp blog (formerly the DCXL blog) to the Data Pub URL (datapub.cdlib.org). The CDL already has a blog residing at this URL, however it is in dire need of sustenance.  And let’s face it: although they are all data-related, many of the blog posts you’ve read here are not specific to the DataUp project. So as of now, Data Pub will be the official blog for all things data-related at CDL, but not exclusively related to DataUp. It will be written by yours truly (with the occasional guest post), so if you are hungry for more blog content with tenuous links to music and pop culture, then re-bookmark now.

On Tuesday next week, check out the new dataup.cdlib.org website. Stay tuned for the announcement blog post, found here on Data Pub! This URL/website will be re-branded Data Pub on Tuesday next week.

Tagged ,

Workflows Part II: Formal

cummerbund pic

Nothing says formal like a sequined cummerbund and matching bow tie. From awkwardfamilyphotos.com (click the pic for more)

In my last blog post, I provided an overview of scientific workflows in general. I also covered the basics of informal workflows, i.e. flow charts and commented scripts.  Well put away the tuxedo t-shirt and pull out your cummerbund and bow tie, folks, because we are moving on to formal workflow systems.

A formal workflow (let’s call them FW) is essentially an “analytical pipeline” that takes data in one end and spits out results on the other.  The major difference between FW and commented scripts (one example of informal workflows) is that FW can be implemented in different software systems.  A commented R script for estimating parameters works for R, but what about those simulations you need to run in MATLAB afterward?  Saving the outputs from one program, importing them into another, and continuing analysis there is a very common practice in modern science.

So how do you link together multiple software systems automatically? You have two options: become one of those geniuses that use the command line for all of your analyses, or use a FW software system developed by one of those geniuses.  The former requires a level of expertise that many (most?) Earth, environmental, and ecological scientists do not possess, myself included.  It involves writing code that will access different software programs on your machine, load data into them, perform analyses, save results, and use those results as input for a completely different set of analyses, often using a different software program.  FW are often called “executable workflows” because they are a way for you to push only one button (e.g., enter) and obtain your results.

What about FW software systems? These are a bit more accessible for the average scientist.  FW software has been around for about 10 years, with the first user-friendly(ish) breakthrough being the Kepler Workflow System.  Kepler was developed with researchers in mind, and allows the user to drag and drop chunks of analytical tasks into a window.  The user can indicate which data files should be used as inputs and where the outputs should be sent, connecting the analytical tasks with arrows.  Kepler is still in a beta version, and most researchers will find the work required to set up a workflow prohibitive.

Groups that have managed to incorporate workflows into their community of sharing are genomicists; this is because they tend to have predictable data as inputs, with a comparatively small set of analyses performed on those data.  Interestingly, a social networking site has grown up around genomicists’ use workflows called myExperiment, where researchers can share workflows, download others’ workflows, and comment on those that they have tried.

The benefits of FW are the each step in the analytical pipeline, including any parameters or requirements, is formally recorded.  This means that researchers can reuse both individual steps (e.g., the data cleaning step in R or the maximum likelihood estimation in MATLAB), as well as the overall workflow).  Analyses can be re-run much more quickly, and repetitive tasks can be automated to reduce chances for manual error.  Because the workflow can be saved and re-used, it is a great way to ensure reproducibility and transparency in the scientific process.

Although Kepler is not in wide use, it is a great example of something that will likely become common place in the researcher’s toolbox over the next decade.  Other FW software includes Taverna, VisTrails, and Pegasus – all with varying levels of user-friendliness and varied communities of use.  As the complexity of analyses and the variety of software systems used by scientists continues to increase, FW are going to become a more common part of the research process.  Perhaps more importantly, it is likely that funders will start requiring the archiving of FW alongside data to ensure accountability, reproducibility, and to promote reuse.

A few resources for more info:

Tagged , , , , ,

Survey says…

A few weeks ago we reached out to the scientific community for help on the direction of the DCXL project.  The major issue at hand was whether we should develop a web-based application or an add-in for Microsoft Excel.  Last week, I reported that we decided that rather than choose, we will develop both.  This might seem like a risky proposition: the DCXL project has a one-year timeline, meaning this all needs to be developed before August (!).  As someone in a DCXL meeting recently put it, aren’t we settling for “twice the product and half the features”?  We discussed what features might need to be dropped from our list of desirables based on the change in trajectory, however we are confident that both of the DCXL products we develop will be feature-rich and meet the needs of the target scientific community.  Of course, this is made easier by the fact that the features in the two products will be nearly identical.

Family Feud screen shot

What would Richard Dawson want? Add-in or web app? From Wikipedia. Source: J Graham (1988). Come on Down!!!: the TV Game Show Book. Abbeville Press

How did we arrive at developing an add-in and a web app? By talking to scientists. It became obvious that there were aspects of both products that appeal to our user communities based on feedback we collected.  Here’s a summary of what we heard:

Show of hands:  I ran a workshop on Data Management for Scientists at the Ocean Sciences 2012 Meeting in February.  At the close of the workshop, I described the DCXL project and went over the pros and cons of the add-in option and the web app option.  By show of hands, folks in the audience voted about 80% for the web app (n~150)

Conversations: here’s a sampling of some of the things folks told me about the two options:

  • “I don’t want to go to the web. It’s much easier if it’s incorporated into Excel.” (add-in)
  • “As long as I can create metadata offline, I don’t mind it being a web app. It seems like all of the other things it would do require you to be online anyway” (either)
  • “If there’s a link in the spreadsheet, that seems sufficient. (either)  It would be better to have something that stays on the menu bar no matter what file is open.” (Add-in)
  • “The updates are the biggest issue for me. If I have to update software a lot, I get frustrated. It seems like Microsoft is always making update something. I would rather go to the web and know it’s the most recent version.” (web app)
  • Workshop attendee: “Can it work like Zotero, where there’s ways to use it both offline and online?” (both)

Survey: I created a very brief survey using the website SurveyMonkey. I then sent the link to the survey out via social media and listservs.  Within about a week, I received over 200 responses.

Education level of respondents:

Survey questions & answers:

 

So with those results, there was a resounding “both!” emanating from the scientific community.  First we will develop the add-in since it best fits the needs of our target users (those who use Excel heavily and need assistance with good data management skills).  We will then develop the web application, with the hope that the community at large will adopt and improve on the web app over time.  The internet is a great place for building a community with shared needs and goals– we can only hope that DCXL will be adopted as wholeheartedly as other internet sources offering help and information.

Tagged , , , , ,

Hooray for Progress!

Great news on the DCXL front! We are moving forward with the Excel add-in and will have something to share with the community this summer.  If you missed it, back in January the DCXL project had an existential crisis: add-in or web-based application? I posted on the subject here and here. We spent a lot of time talking to the community and collating feedback, weighing the pros and cons of each option, and carefully considering how best to proceed with the DCXL project.

And the conclusion we came to… let’s develop both!

Comparing web-based applications and add-ins (aka plug-ins) is really an apples-and-oranges comparison.  How could we discount that a web-based application is yet another piece of software for scientists to learn? Or that an add-in is only useful for Excel spreadsheets running a Windows operating system? Instead, we have chosen to first create an add-in (this was the original intent of the project), then move that functionality to a web-based application that will have more flexibility for the longer term.

Albert-Camus

What do Camus, The Cure, and DCXL have in common? Existentialists at heart. From http://www.openlettersmonthly.com

The capabilities of the add-in and the web-based application will be similar: we are still aiming to create metadata, check the data file for .csv compatibility, generate a citation, and upload the data set to a data repository.  For a full read of the requirements (updated last week), check out the Requirements page on  this site. The implementation of these requirements might be slightly different, but the goals of the DCXL project will be met in both cases: we will facilitate good data management, data archiving, and data sharing.

It’s true that the DCXL project is running a bit behind schedule, but we believe that it will be possible to create the two prototypes before the end of the summer.  Check back here for updates on our progress.

Tagged , , , ,