Category Archives: Uncategorized

Building an RDM Guide for Researchers – An (Overdue) Update

It has been a little while since I last wrote about the work we’re doing to develop a research data management (RDM) guide for researchers. Since then, we’ve thought a lot about the goals of this project and settled on a concrete plan for building out our materials. Because we will soon be proactively seeking feedback on the different elements of this project, I wanted to provide an update on what we’re doing and why.


A section of the Rosetta Stone. Though it won’t help decipher Egyptian hieroglyphs, we hope our RDM guide will researchers and data service providers speak the same language. Image from the British Museum.

Communication Barriers and Research Data Management

Several weeks ago I wrote about addressing Research Data Management (RDM) as a “wicked problem”, a problem that is difficult to solve because different stakeholders define and address it in different ways. My own experience as a researcher and library postdoc bears this out. Researchers and librarians often think and talk about data in very different ways! But as researchers face changing expectations from funding agencies, academic publishers, their own peers, and other RDM stakeholders about how they should manage and share their data, overcoming such communication barriers becomes increasingly important.

From visualizations like the ubiquitous research data lifecycle to instruments like the Data Curation Profiles, there are a wide variety of excellent tools that can be used to facilitate communication between different RDM stakeholders. Likewise, there are also discipline-specific best practice guidelines and tools like the Research Infrastructure Self Evaluation Framework (RISE) that allow researchers and organizations to assess and advance their RDM activities. What’s missing is a tool that combines these two elements that enables researchers the means to easily self-assess where they are in regards to RDM and allows data service providers to provide easily customizable guidance about how to advance their data-related practices.

Enter our RDM guide for researchers.

Our RDM Guide for Researchers

What I want to emphasize most about our RDM guide is that it is, first and foremost, designed to be a communication tool. The research and library communities both have a tremendous amount of knowledge and expertise related to data management. Our guide is not intended to supplant tools developed by either, but to assist in overcoming communication barriers in a way that removes confusion, grows confidence, and helps people in both communities find direction.

While the shape of RDM guide has not changed significantly since my last post, we have refined its basic structure and have begun filling in the details.

The latest iteration of our guide consists of two main elements:

  1. A RDM rubric which allows researchers to self-assess their data-related practices using language and terminology with which they are familiar.
  2. A series of one page guides that provide information about how to advance data-related practices as necessary, appropriate, or desired.
RDM_rubric (1)

The two components of our RDM Guide for Researchers. The rubric is intended to help researchers orient themselves in the ever changing landscape of RDM while the guides are intended to help them move forward.

The rubric is similar to the “maturity model”  described in my earlier blog posts. In this iteration, it consists of a grid containing three columns and a number of rows. The leftmost column contains descriptions of different phases of the research process. At present, the rubric contains four such phases: Planning, Collection, Analysis, and Sharing. These research data lifecycle-esque terms are in place to provide a framing familiar to data service providers in the library and elsewhere.

The next column includes phrases that describe specific research activities using language and terminology familiar to researchers. The language in this column is, in part, derived from the unofficial survey we conducted to understand how researchers describe the research process. By placing these activities beside those drawn from the research data lifecycle, we hope to ground our model in terms that both researchers and RDM service providers can relate to.

The rightmost column then contains a series of declarative statements which a researcher can use to identify their individual practices in terms of the degree to which they are defined, communicated, and forward thinking.

Each element of the rubric is designed to be customizable. We understand that RDM service providers at different institutions may wish to emphasize different services toggled to different parts data lifecycle and that researchers in different disciplines may have different ways of describing their data-related activities. For example, while we are working on refining the language of the declarative statements, I have left them out of the diagram above because they are likely the  rubric that will remain most open for customization.

Each row within the rubric will be complemented by a one page guide that will provide researchers with concrete information about data-related best practices. If the purpose of the rubric is to allow researchers to orient themselves in RDM landscape, the purpose of these guides is to help them move forward.

Generating Outputs

Now that we’ve refined the basic structure of our model, it’s time to start creating some outputs. Throughout the remainder of the summer and into the autumn, members of the UC3 team will be meeting regularly to review the content of the first set of one page guides. This process will inform our continual refinement of the RDM rubric which will, in turn, shape the writing of a formal paper.

Moving forward, I hope to workshop this project with as many interested parties as I can, both to receive feedback on what we’ve done so far and to potentially crowdsource some of the content. Over the next few weeks I’ll be soliciting feedback on various aspects of the RDM rubric. If you’d like to provide feedback, please either click through the links below (more to be added in the coming weeks) or contact me directly.


Provide feedback on our guide!

Planning for Data

More coming soon!

Software Carpentry / Data Carpentry Instructor Training for Librarians

We are pleased to announce that we are partnering with Software Carpentry ( and Data Carpentry ( to offer an open instructor training course on May 4-5, 2017 geared specifically for the Library Carpentry movement.  

Open call for Instructor Training

This course will take place in Portland, OR, in conjunction with csv,conf,v3, a community conference for data makers everywhere. It’s open to anyone, but the two-day event will focus on preparing members of the library community as Software and Data Carpentry instructors. The sessions will be led by Library Carpentry community members, Belinda Weaver and Tim Dennis.

If you’d like to participate, please apply by filling in the form at  Application closed

What is Library Carpentry?

lib_carpentryFor those that don’t know, Library Carpentry is a global community of library professionals that is customizing Software Carpentry and Data Carpentry modules for training the library community in software and data skills. You can follow us on twitter @LibCarpentry.

Library Carpentry is actively creating training modules for librarians and holding workshops around the world. It’s a relatively new movement that has already been a huge success. You can learn more by reading the recently published article: Library Carpentry: software skills training for library professionals.

Why should I get certified?

Library Carpentry is a movement tightly coupled with the Software Carpentry and Data Carpentry organizations. Since all are based on a train-the-trainer model, one of our challenges has been how to get more experience as instructors. This issue is handled within Software and Data Carpentry by requiring instructor certification.

Although certification is not a requirement to be involved in Library Carpentry, we know that doing so will help us refine workshops, teaching modules, and grow the movement. Also, by getting certified, you can start hosting your own Library Carpentry, Software Carpentry, or Data Carpentry events on your campus. It’s a great way to engage with your campuses and library community!


Applicants will learn how to teach people the skills and perspectives required to work more effectively with data and software. The focus will be on evidence-based education techniques and hands-on practice; as a condition of taking part, applicants must agree to:

  1. Abide by our code of conduct, which can be found at and,
  1. Agree to teach at a Library Carpentry, Software Carpentry, or Data Carpentry workshop within 12 months of the course, and
  1. Complete three short tasks after the course in order to complete the certification. The tasks take a total of approximately 8-10 hours: see for details.


This course will be held in Portland, OR, in conjunction with csv,conf,v3 and is sponsored by csv,conf,v3 and the California Digital Library. To help offset the costs of this event, we will ask attendees to contribute an optional fee (tiered prices will be recommended based on your or your employer’s ability to pay). No one will be turned down based on inability to pay and a small number of travel awards will be made available (more information coming soon).  


Hope to see you there! To apply for this Software Carpentry / Data Carpentry Instructor Training course, please submit the application by Jan 31, 2017:  Application closed

Under Group Name, use “CSV (joint)” if you wish to attend both the training and the conference, or “CSV (training only)” if you only wish to attend the training course.

More information

If you have any questions about this Instructor Training course, please contact And if you have any questions about the Library Carpentry movement, please contact via email at, via twitter @LibCarpentry or join the Gitter chatroom.


The integration of the Merritt repository with Amazon’s S3 and Glacier cloud storage services, previously described in an August 16 post on the Data Pub blog, is now mostly complete. The new Amazon storage supplements Merritt’s longstanding reliance on UC private cloud offerings at UCLA and UCSD. Content tagged for public access is now routed to S3 for primary storage, with automatic replication to UCSD and UCLA. Private content is routed first to UCSD, and then replicated to UCLA and Glacier. Content is served for retrieval from the primary storage location; in the unlikely event of a failure, Merritt automatically retries from secondary UCSD or UCLA storage. Glacier, which provides near-line storage with four hour retrieval latency, is not used to respond to user-initiated retrieval requests.

Content Type Primary Storage Secondary Storage Primary Retrieval Secondary Retrieval
Public S3 UCSD

In preparation for this integration, all retrospective public content, over 1.1 million objects and 3 TB, was copied from UCSD to S3, a process that took about six days to complete. A similar move from UCSD to Glacier is now underway for the much larger corpus of private content, 1.5 million objects and 71 TB, which is expected to take about five weeks to complete.

The Merritt-Amazon integration enables more optimized internal workflows and increased levels of reliability and preservation assurance. It also holds the promise of lowering overall storage costs, and thus, the recharge price of Merritt for our campus customers.  Amazon has, for example, recently announced significant price reductions for S3 and Glacier storage capacity, although their transactional fees remain unchanged.  Once the long-term impact of S3 and Glacier pricing on Merritt costs is understood, CDL will be able to revise Merritt pricing appropriately.

CDL is also investigating the possible use of the Oracle archive cloud, as a lower-cost alternative, or supplement, to Glacier for dark archival content hosting.  While offering similar function to Glacier, including four hour retrieval latency, Oracle’s price point is about 1/4th of Glacier’s for storage capacity.

Collaborative Web Archiving with Cobweb

A partnership between the CDL, Harvard Library, and UCLA Library has been awarded funding from IMLS to create Cobweb, a collaborative collection development platform for web archiving.

The demands of archiving the web in comprehensive breadth or thematic depth easily exceed the technical and financial capacity of any single institution. To ensure that the limited resources of archiving programs are deployed most effectively, it is important that their curators know something about the collection development priorities and holdings of other, similarly-engaged institutions. Cobweb will meet this need by supporting three key functions: nominating, claiming, and holdings. The nomination function will let curators and stakeholders suggest web sites pertinent to specific thematic areas; the claiming function will allow archival programs to indicate an intention to capture some subset of nominated sites; and the holdings function will allow programs to document sites that have actually been captured.

How will Cobweb work? Imagine a fast-moving news event unfolding online via news reports, videos, blogs, and social media. Recognizing the importance of recording this event, a curator immediately creates a new Cobweb project and issues an open call for nominations. Scholars, subject area specialists, interested members of the public, and event participants themselves quickly respond, contributing to a site list more comprehensive than could be created by any one curator or institution. Archiving institutions review the site list and publicly claim responsibility for capturing portions of it that are consistent with their local policies and technical capabilities. After capture, the institutions’ holdings information is updated in Cobweb to disclose the various collections containing newly available content. It’s important to note that Cobweb collects only metadata; the actual archived web content would continue to be managed by the individual collecting organizations. Nevertheless, by distributing the responsibility, more content will be captured more quickly with less overall effort than would otherwise be possible.

Cobweb will help libraries and archives make better informed decisions regarding the allocation of their individual programmatic resources, and promote more effective institutional collaboration and sharing.

This project was made possible in part by the Institute of Museum and Library Services, #LG-70-16-0093-16.

Tagged ,

Thoughts on Digital Humanities

This week I’m lucky enough to be in Amsterdam for the Beyond the PDF 2 Meeting, sponsored by FORCE11.  I’m sure I will be blogging about this meeting for weeks to come, however something came up today that has me inspired to do a blog post: digital humanities.

For those unaware of BTPDF2, it’s a spinoff event from the Beyond the PDF meeting, which took place in San Diego a few years back. Both events are a meeting of the minds for digital scholarship, with representatives from publishing, libraries, academia, software development, and everything in between. This group has customarily been dominated by bioscience data, and to a lesser extent social science. But this year, digital humanities just keeps cropping up. Next week I will talk about BTPDF2, but this week, I’m using my blog as a reason to educate myself about the digital humanities.

Digital Humanities: What does that mean? Let’s go to Wikipedia:

The digital humanities is an area of research concerned with the intersection of computing and the disciplines of the humanities. [It] embraces a variety of topics ranging from curating online collections to data mining large cultural data sets. Digital Humanities combines the methodologies from the traditional humanities disciplines (such as history, philosophy, linguistics, literature, art, archaeology, music, and cultural studies), as well as social sciences.

So there ya go – it’s just like it sounds. Humanities + computers. I must admit, I’ve been avoiding DH in my time as a data-centric person. First of all, the field is intimidating to a natural science person like me – it all seems so… human. The unpredictable element of humanity makes me nervous, especially since I thought clams were pretty darn complex back in my grad school days. Despite my biases, I’ve learned more about the wide array of interesting projects that DH encompasses, and have been impressed by the unique challenges associated with DH data collection.

A digital humanities finding: mummies had atherosclerosis. Read more from NPR by clicking on this photo.  Image from Flickr by Brooklyn Museum

A digital humanities finding: mummies had atherosclerosis. Read more from NPR by clicking on this photo. Image from Flickr by Brooklyn Museum

A great example of a DH project was written up in the New York Times back in 2011, which featured the work of DH scholars who use modern spatial tools (GIS, Google Earth) to understand human history, including the Salem Witch Trials, the Battle of Gettysburg, or ancient Greece. I actually posted about one such project back May after meeting a digital humanist at a UCLA Libraries panel- read that entry here. One thing I have noticed about digital humanities projects: they are all GREAT party conversations. Certainly better than those softshell clams.

A great example of a specific branch of digital humanities is digital archaeology – see the tDAR (Digital Archaeological Record website for an introduction. This work sounds like a cross between Indiana Jones and The Matrix, which has led me to wonder whether I’ve seen a movie in the last 15 years.

The point? Digital Humanities are kinda awesome. They have a HUGE diversity of data, and much of the work sits right on the fence between quantitative and qualitative data. It’s an interesting area I’m now embracing as an opportunity for learning about cool stuff. For an overview, check out the Twitter hash tag: short answer is they are having conferences, getting new funding from the NEH, and establishing new academic units (e.g., StanfordUniversity of NebraskaKing’s College London).

Tagged , , ,

We have a new theme song!

Thanks to DataONE‘s own Amber Budden for passing along this gem.  It starts getting good around the one minute mark. Enjoy!

Trending: Big Data

Last week, the White House Office of Science and Technology Policy  hosted a “Big Data” R&D event, which was broadcast live on the internet (recording available here, press release available as a pdf).  GeekWire did a great piece on the event that provides context.  Wondering what “Big Data” means? Keep reading.

Big Tex

“Howdy Folks!” Big Tex from the State Fair of Texas thinks Big Data is the best kind of data. From Flickr by StevenM_61. For more info on Big Tex, check out

Big Data is a phrase being used to describe the huge volume of data being produced by modern technological infrastructure.  Some examples include social media and remote sensing instruments. Facebook, Twitter, and other social media are producing huge amounts of datasets that can be analyzed to understand trends in the Internet.  Satellites and other scientific instruments are producing constant streams of data that can be used to assess the state of the environment and understand patterns in the global ecosphere.  In general, Big Data is just what it sounds like– a sometimes overwhelming amount of information, flooding scientists, statisticians, economists, and analysts with an ever-increasing pile of fodder for understanding the world.

Big Data is often used alongside the “Data Deluge”, which is a phrase used to describe the onslaught of data from multiple sources, all waiting to be collated and analyzed.  The phrase brings about images of being overwhelmed by data: check out The Economist‘s graphic that represents the concept.  From Wikipedia:

…datasets are growing so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing.

Despite the challenges of Big Data, folks are hungry for big data sets to analyze.  Just this week, the 1940 US Census data was released; there was so much interest in downloading and analyzing the data, the servers crashed.  You only need to follow the Twitter hash tag #bigdata to see it’s a very hot topic right now. Of course, Big Data should not be viewed as a bad thing.  There is no such thing as too much information; it’s simply a matter of finding the best tools for handling all of those data.

Big Data goes hand-in-hand with Big Science, which is a term first coined back in 1961 by Alvin Weinberg, then the director of the Oak Ridge National Laboratory.  Weinberg used “Big Science” to describe large, complex scientific endeavors in which society makes big investments in science, often via government funding.  Examples include the US space program, the Sloan Digital Sky Survey, and the National Ecological Observatory Network.  These projects produce mountains of data, sometimes continuously 24 hours a day, 7 days a week.  Therein lies the challenge and awesomeness of Big Data.

What does all of this mean for small datasets, like those managed and organized in Excel?  The individual scientist with their unique, smaller scale dataset has a big role in the era of Big Data.  New analytics tools for meta-analysis offer a way for individuals to participate in Big Science, but we have to be willing to make our data standardized, useable, and available.  The DCXL add-in will facilitate all three of these goals.

In the past, meta-analysis of small data sets meant digging through old papers, copying data out of tables or reconstructing data from graphs.  Wondering about the gland equivalent of phenols from castoreum? Dig through this paper and reconstruct the data table in Excel.  Would you like to combine that data set with data on average amounts of neutral compounds found in one beaver castor sac? That’s another paper to download and more data to reconstruct.  By making small datasets available publicly (with links to the datasets embedded in the paper), and adhering to discipline-wide standards, meta-analysis will be much easier and small datasets can be incorporated into the landscape of Big Science.  In essence, the whole is greater than the sum of the parts.

Think you can take on the Data Deluge? NSF’s funding call for big data proposals is available here.

Tagged ,

The Science of the DeepSea Challenge

Recently the film director and National Geographic explorer-in-residence James Cameron descended to the deepest spot on Earth: the Challenger Deep in the Mariana Trench.  He partnered with lots of sponsors, including National Geographic and Rolex, to make this amazing trip happen.  A lot of folks outside of the scientific community might not realize this, but until this week, there had been only one successful descent to this the trench by a human-occupied vehicle (that’s a submarine for you non-oceanographers).  You can read more about that 1960 exploration here and here.

I could go on about how astounding it is that we know more about the moon than the bottom of the ocean, or discuss the seemingly intolerable physical conditions found at those depths– most prominently the extremely high pressure.  However what I immediately thought when reading the first few articles about this expedition was where are the scientists?

Before Cameron, Swiss Oceanographer Piccard and Navy officer Marsh went down in it to the virgin waters of the deep. From

After combing through many news stories, several National Geographic sites including the site for the expedition, and a few press releases, I discovered (to my relief) that there are plenty of scientists involved.  The team that’s working with Cameron includes scientists from Scripps Institution of Oceanography (the primary scientific partner and long-time collaborator with Cameron),  Jet Propulsion Lab, University of Hawaii, and University of Guam.

While I firmly believe that the success of this expedition will be a HUGE accomplishment for science in the United States, I wonder if we are sending the wrong message to aspiring scientists and youngsters in general.  We are celebrating the celebrity film director involved in the project in lieu of the huge team of well-educated, interesting, and devoted scientists who are also responsible for this spectacular feat (I found less than 5 names of scientists in my internet hunt).  Certainly Cameron deserves the bulk of the credit for enabling this descent, but I would like there to be a bit more emphasis on the scientists as well.

Better yet, how about emphasis on the science in general?  It’s a too early for them to release any footage from the journey down, however I’m interested in how the samples will be/were collected, how they will be stored, what analyses will be done, whether there are experiments planned, and how the resulting scientific advances will be made just as public as Cameron’s trip was.  The expedition site has plenty of information about the biology and geology of the trench, but it’s just background: there appears to be nothing about scientific methods or plans to ensure that this project will yield the maximum scientific advancement.

How does all of this relate to data and DCXL? I suppose this post falls in the category of data is important.  The general public and many scientists hear the word “data” and glaze over.  Data isn’t inherently interesting as a concept (except to a sick few of us).  It needs just as much bolstering from big names and fancy websites as the deep sea does.  After all, isn’t data exactly what this entire trip is about?  Collecting data on the most remote corners of our planet? Making sure we document what we find so others can learn from it?

Here’s a roundup of some great reads about the Challenger expedition:

Tagged , , ,

Tweeting for Science

At risk of veering off course of this blog’s typical topics, I am going to post about tweeting.  This topic is timely given my previous post about the lack of social media use in Ocean Sciences, the blog post that it spawned at Words in mOcean,  and the Twitter hash tag #NewMarineTweep. A grad school friend asked me recently what I like about tweeting (ironically, this was asked using Facebook).  So instead of touting my thoughts on Twitter to my limited Facebook friends, I thought I would post here and face the consequences of avoiding DCXL almost completely this week on the blog.

First, there’s no need to reinvent the wheel.  Check out these resources about tweeting in science:

That being said, I will now pontificate on the value of Twitter for science, in handy numbered list form.

  1. It saves me time.  This might seem counter-intuitive, but it’s absolutely true.  If you are a head-in-the-sand kind of person, this point might not be for you. But I like to know what’s going on in science, science news, the world of science publishing, science funding, etc. etc.  That doesn’t even include regular news or local events.  The point here is that instead of checking websites, digging through RSS feeds, or having an overfull email inbox, I have filtered all of these things through HootSuite.  HootSuite is one of several free services for organizing your Twitter feeds; mine looks like a bunch of columns arranged by topic.  That way I can quickly and easily check on the latest info, in a single location. Here’s a screenshot of my HootSuite page, to give you an idea of the possibilities: click to open the PDF: HootSuite_Screenshot
  2. It is great for networking.  I’ve met quite a few folks via Twitter that I probably never would have encountered otherwise.  Some have become important colleagues, others have become friends, and all of them have helped me find resources, information, and insight.  I’ve been given academic opportunities based on these relationships and connections.  How does this happen? The Twittersphere is intimate and small enough that you can have meaningful interactions with folks.  Plus, there’s tweetups, where Twitter folks meet up at a designated physical location for in-person interaction and networking.
  3. It’s the best way to experience a conference, whether or not you are physically there. This is what spawned that previous post about Oceanography and the lack of social media use.  I was excited to experience my first Ocean Sciences meeting with all of the benefits of Twitter, only to be disappointed at the lack of participation.  In a few words, here’s how conference (or any event) tweeting works:
    1. A hash tag is declared. It’s something short and pithy, like #Oceans2012. How do you find out about the tag? Usually the organizing committee tells you, or in lieu of that you rely on your Twitter network to let you know.
    2. Everyone who tweets about a conference, interaction, talk, etc. uses the hash tag in their tweet. Examples:    
    3. Hash tags are ephemeral, but they allow you to see exactly who’s talking about something, whether you follow them or not.  They are a great way to find people on Twitter that you might want to network with… I’m looking at you, @rejectedbanana @miriamGoldste.
    4. If you are not able to attend a conference, you can “follow along” on your computer and get real-time feeds of what’s happening.  I’ve followed several conferences like this- over the course of the day, I will check in on the feed a few times and see what’s happening. It’s the next best thing to being there.

I could continue expounding the greatness of Twitter, but as I said before, others have done a better job than I could (see links above).  No, it’s not for everyone. But keep in mind that you can follow people, hash tags, etc. without actually ever tweeting. You can reap the benefits of everything I mentioned above, except for the networking.  Food for thought.

My friend from WHOI, who also attended the Ocean Sciences meeting, emailed me this comment later:

…I must say those “#tweetstars” were pretty smug about their tweeting, like they were the sitting at the cool kids table during lunch or something…

I countered that it was more like those tweeting at OS were incredulous at the lack of tweets, but yes, we are definitely the cool kids.

Tagged , , , ,