Data Science meets Academia

(guest post by Johannes Otterbach)

First Big Data and Data Science, then Data Driven and Data Informed. Even before I changed job titles—from Physicist to Data Scientist—I spent a good bit of time pondering what makes everyone so excited about these things, and whether they have a place in the academy.

Data Science is an incredibly young and flaming hot field (searching for ‘Data Science’ on Google Search yields about 283,000,000 results in 0.48 seconds [!] and the count is rising). The promises—and accordingly the stakes—of Data Science are high, and seem to follow a classic Hype Cycle. Nevertheless, Data Science is already having major impacts on all aspects of life, with personalized advertisement and self-quantification leading the charge. But is there a place for Data Science in Academia? To try and answer this question, first we have to understand more about Data Science itself, from lofty promises to practical workflows, and later I’ll offer some potential (big-picture) academic applications.

Yet another attempt at defining Data Science

There are gazillions of blogs, articles, diagrams, and other information channels that aim to define this new and still-fuzzy term ‘Data Science,’ and it will still be some years before we achieve consensus. At least for now there is some agreement surrounding the main ingredients; Drew Conway summarizes them nicely in his Venn diagram:


In this popular tweet, Josh Wills defines a Data Scientist as an individual ‘who is better at statistics than any software engineer and better at software engineering than any statistician.’  This definition just barely captures some of the basics. Referring back to the Venn diagram, a Data Scientist finds her/himself at the intersection of Statistics, Machine Learning, and a particular business need (in academic parlance, a research question).

  • Statistics is perhaps the most obvious component, as Data Science is partially about analyzing data using summary statistics (e.g., averages, standard deviations, correlations, etc.) and more complex mathematical tools. This is supplemented by
  • Machine Learning, which subsumes the programming and data munging aspects of a Data Scientist’s toolkit. Machine Learning is used to automatically sift through data that are too unwieldy for humans to analyze. (This is sometimes an aspect of defining Big Data). As an example, just try to imagine how many dimensions you could define to monitor student performance: past and current grades, participation, education history, family and social circles, physical and mental health, just to name a few categories that you could explode into several subcategories. Typically the output of Machine Learning is a certain number of features that are important within a given business problem and that can provide insight when evaluated in the context of
  • the Domain Knowledge. Domain Knowledge is essential in order to identify and explore the questions that will drive business actions. It is the one ingredient that’s not generalizable across different segments of industry (disciplines or domains) and as such a Data Scientist must acquire new Domain Knowledge for each new problem that she/he encounters.

The most formalized definition I’ve come across is from NIST’s Big Data Framework:

Data science is the empirical synthesis of actionable knowledge from raw data through the complete data lifecycle process.

I won’t elaborate on these terms here, but I do want to draw your attention to the modest word actionable. This is the key component of Data Science that distinguishes it from mere data analysis, and the implementation of which gives rise to the dichotomy of Data Driven vs. Data Informed.

Promises and shortcomings of Data Science: The Hype Cycle

The Gartner Hype Cycle report (2014) on emerging technologies places Data Science just past the threshold of inflated expectations.


This hype inflation contributes to unreasonable expectations about the problem-solving power of Data Science. All the way back in 2008, one of the early proponents of Big Data and Data Science, the Editor-in-Chief of Wired, Chris Anderson, blogged that the new data age would bring The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. He claimed that by using sufficiently advanced Machine Learning algorithms, gaining insight into a problem would become trivial. This ignores the element of Domain Knowledge to understand and pose the right questions and by now it’s not hard to see that his projection was off. If we consider highly complex processes where sufficient data are not and might never be available, we can only make advances by means of educated guesses and building appropriate models and hypotheses. This requires a substantial amount of Domain Knowledge. Nick Barrowman formulated a detailed argument (that goes beyond just a response to Anderson’s opinion) in his article on Correlation, Causation and Confusion.

Data Science, and in particular Applied Machine Learning, is not completely agnostic of the problem space in which it’s applied; this has serious implications for the analyst’s approach to unknown data. Most importantly, the Domain Knowledge is indispensable for correctly evaluating the predictions of the algorithms and making smart decisions rather than placing blind faith in the computational output. As Yoshua Bengio frames it in his book, Deep Learning [Ch. 5.3.1, p.110]:

The most sophisticated [Machine Learning] algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting that every point belongs to the same class. Fortunately, these results hold only when we average over all possible data generating distributions. If we make assumptions about the kinds of probability distributions we encounter in real-world applications, then we can design learning algorithms that perform well on these distributions.

Actionable business insights: Data Driven vs. Data Informed

The oft-quoted expression ‘Be data informed, not data driven’ seems to originate with Adam Mosseri’s (from Facebook) 2010 talk. He coined these terms to distinguish two different approaches to a data problem.

  • The Data-Driven approach involves analyzing the data and then adjusting the system to optimize a certain metric. Ad placement on a website provides a simple example. We move the ad slightly until we maximize the number of clicks on the ad. The problem with this approach is that we can get trapped in locally optimal points, i.e., points where any deviation leads to a decreasing click rate, however, we can’t be sure that there’s not an even better way of displaying the ad. Joshua Porter summarizes the pitfalls of a Data-Driven approach in the context of UX design. To find the absolute best solution, a tremendous amount of data and time are necessary (technically, an infinite amount of both).

Another shortcoming of the Data-Driven approach is that not everything can be formulated as an optimization problem, the fundamental mathematical formulation of Machine Learning. As a result, we can’t always guarantee that proper data have been collected, particularly in cases where we don’t have a good idea of a what a satisfying answer would look like. To circumvent these problems we can apply

  • The Data-Informed way of viewing a problem, which avoids micro-optimization as mentioned above. Furthermore it allows us to include decision-making inputs that cannot be cast into a ‘standard Machine Learning form,’ such as:
    – Qualitative data
    – Strategic interests
    – Regulatory bodies
    – Business interests
    – Competition
    – Market

Data-Informed decisions leverage the best of two worlds: the analysis of data given a hypothesis, followed by a well-rounded decision, that again leads to the collection of new data to improve business. Joe Blitzstein’s visualization summarizes the Data Science Process, and there’s even an industry standard know as CRISP-DM:


What about Data Science in Academia?

There have long been calls to Academia to better prepare students (especially Ph.D. graduates) for the job market. The explosion of Data Science as the sexiest job of the 21st century is fueling the creation of an increasing number of Data Science Masters programs. The value of these programs remains to be tested, as few graduates have hit the market, but the trend reveals that Academia is at least trying to respond to calls for reform.

Apart from preparing students for careers outside the academy, is there space for applying Data Science to traditional academic fields, and maybe establishing it as a field unto itself? Data Science involves much more than statistical data analysis, encompassing aspects of data management, data warehousing, reproducibility, and data best practices. To advance science as a whole, it will be necessary for researchers and staff to develop a pi-shaped skills profile (as coined by Alex Szalay):


The first leg, a.k.a. the domain specialty or Domain Knowledge, is already established after years of efforts to advance a field. However, this hints at a fundamental problem for Data Science as a domain-agnostic, standalone field. Data Science as a Service (DSaaS) is likely to fail. Instead, Data Scientists should be embedded in a field and possess domain expertise, in addition to the cross-disciplinary techniques required to tackle the data challenges at hand.

This feeds into the second, to-be-developed, leg, which represents advanced computational literacy. As more and more researchers leave the academy it’s obvious that the current system disincentivizes this development. However, it also reveals some low-hanging fruits. An easy win would be adopting simple best practices to improve how scientific data are handled and encouraging students to develop solid data skills. Another win would be to reward researchers for their efforts to make studies transparent and reproducible. Without such cultural changes, Academia will fail to advance ever-more-diversified scientific fields into the next century. Perpetuating current practices will only undermine scientific research and make it increasingly undiscoverable. As Denis Diderot put it in his 1755 Encyclopedie:

As long as the centuries continue to unfold, the number of books will grow continually, and one can predict that a time will come when it will be almost as difficult to learn anything from books as from the direct study of the whole universe. It will be almost as convenient to search for some bit of truth concealed in nature as it will be to find it hidden away in an immense multitude of bound volumes.

Next steps

It’s clear that Data Science will have major impacts on our digital and non-digital lives. The Internet of Things already transcends our individual internet presence by connecting everyday devices—such as thermostats, fridges, cars, etc.—to the internet, and thus makes them available to optimizations using Data Science. The extent of these impacts, though, will depend on our ability to make sense of the data and develop tools and intuitions to check computerized predictions against reality. Moreover, we require a better understanding of the limitations of Data Science as well as its mathematical-statistical foundations. Without thorough basic knowledge, Data Science and Machine Learning will be seen as belonging to the Dark Arts and raise skepticism. This is true for data of all sizes and depends strongly on whether we succeed in making data discoverable and processable. Data Science has a role to play in this (both in industry as well as the academy). To succeed we first need to rethink the way scientific information is produced, stored, and prepared for further investigations. And this goal hinges on overdue changes of incentives within the academy.

About the author

Johannes Otterbach is a Data Scientist at LendUp with a passion for big data technologies and applications to real world problems. He earned his Ph.D. in Physics in topics related to Quantum Computing.

Tagged , , , ,

Data (Curation) Viz.

Data management and data curation are related concepts, but they do not refer to precisely the same things. I use these terms so often now that sometimes the distinctions, fuzzy as they are, become indistinguishable. When this happens I return to visual abstractions to clarify —in my own mind—what I mean by one vs. the other. Data management is more straightforward and almost always comes in the guise of something like this:

The obligatory research data management life cycle slide. Everyone uses it, myself included, in just about every presentation I give these days. This simple (arguably oversimplified) but useful model defines more-or-less discrete data activities that correspond with different phases of the research process. It conveys what it needs to convey; namely, that data management is a dynamic cycle of activities that constantly influence one another. Essentially, we can envision a feedback loop.

Data curation, on the other hand, is a complex beastie. Standard definitions cluster around something like this one from the Digital Curation Centre in the UK:

Data curation involves maintaining, preserving, and adding value to digital research data throughout its lifecycle.

When pressed for a definition, this is certainly an elegant response. But, personally, I don’t find it to be helpful at all when I try to wrap my head around the myriad activities that go into curating anything, much less distinguishing management activities from curation activities. Moreover, I’m talking about all kinds of activities in the context of “data,” a squishy concept in and of itself. (We’ll go with the NSF’s definition: the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.)

I suppose I should mention sooner or later that the point of defining “data” and all these terms appended to it is the following. There’s a lot of it [data] and we need to figure out what on earth to do with it, ergo the proliferation of new positions with these things “data management” and “data curation” in their titles. It’s important to make sure we’re speaking the same language.

There are other, more expansive approaches to defining data curation and a related post on this very blog, but to really grasp what I’m talking about when I’m saying the words “data curation,” I invariably come back to this visualization created by Tim Norris. Tim is a geographer turned CLIR Postdoctoral Fellow in Data Curation at the University of Miami. Upon assuming a new post with an unfamiliar title, he decided to draw a map of his job to explain (to himself and to others) what he means by data curation. Many thanks to Tim for sharing this exercise with the rest of our CLIR cohort and now with the blogo-world-at-large.

Below is an abbreviated caption, in Tim’s own words, as well as short- (3 min) and long-format (9 min) tours of the map narrated by Tim. And here is a handy PNG file for those occasions when the looping life cycle visualizations just won’t do.

This map of data curation has two visual metaphors. The first is that of a stylized mandala: a drawing that implies both inwards and outwards motion that is in balance. And the second is that of a Zen Koen: first there is a mountain, then there’s none, and then there is. We start with visual complexity—the mountain. To build the data curation mountain we start with a definition of the word “curation” as a five step process that moves inwards. The final purpose of this curation is to move what is being curated back into the world for re-use, publication and dissemination. This can be understood as stewardship. Next we think about the sources of data in the outside world. These sources have been abstracted into three data spaces: library digital collections, external data sources, and research data products. As this data moves “inwards” we can think of verbs that describe the ingestion processes. Metadata creation, or describing the data, is a key that enables later data linkages to be identified with the final goal of making data interoperable. Once the data is “inside” the curation space it passes through a standard process that begins with storage and ends with discovery. Specific to data in this process are the formats in which the data is stored and the difference between preservation and conservation for data. To enable this work we need hardware, software, and human interfaces to the curated data. Finally, as the data moves back out into the world, we must pay attention to institutions of property rights and access. If we get this all right we will have a system that is sustainable, secure, and increases the value of our research data collections. Once again we have a mountain.

Tagged ,

Science Boot Camp West

Last week Stanford Libraries hosted the third annual Science Boot Camp West (SBCW 2015),

“… building on the great Science Boot Camp events held at the University of Colorado, Boulder in 2013 and at the University of Washington, Seattle in 2014. Started in Massachusetts and spreading throughout the USA, science boot camps for librarians are 2.5 day events featuring workshops and educational presentations delivered by scientists with time for discussion and information sharing among all the participants. Most of the attendees are librarians involved in supporting research in the sciences, engineering, medicine or technology although anybody with an interest in science research is welcome.”

As a former researcher and newcomer to the library and research data management (RDM) scenes, I was already familiar with many of the considerable challenges on both sides of the equation (Jake Carlson recently summarized the plight of data librarians). What made SBCW 2015 such an excellent event is that it brought researchers and librarians together to identify immediate opportunities for collaboration. It also showcased examples of Stanford libraries and librarians directly facilitating the research process, from the full-service Stanford Geospatial Center to organizing Software and Data Carpentry workshops (more on this below, and from an earlier post).

Collaboration: Not just a fancy buzzword

The mostly Stanford-based researchers were generous with their time, introducing us to high-level concerns (e.g., why electrons do what they do in condensed matter) as well as more practical matters (e.g., shopping for alternatives to Evernote—yikes—for electronic lab notebooks [ELNs]). They revealed the intimate details of their workflows and data practices (Dr. Audrey Ellerbee admitted that it felt like letting guests into her home to find dirty laundry strewn everywhere, a common anxiety among researchers that in her case was unwarranted), flagged the roadblocks, and presented a constant stream of ideas for building relationships across disciplines and between librarians and researchers.

From the myriad opportunities for library involvement, here are some of the highlights:

  • Facilitate community discussions of best practices, especially for RDM issues such as programming, digital archiving, and data sharing
  • Consult with researchers about available software solutions (e.g., ELNs such as Labguru and LabArchives; note: representatives from both of these companies gave presentations and demonstrations at SBCW 2015), connect them with other users on campus, and provide help with licensing
  • Provide local/basic IT support for students and researchers using commercial products such as ELNs (e.g., maintain FAQ lists to field common questions)
  • Leverage experience with searching databases to improve delivery of informatics content to researchers (e.g., chemical safety data)
  • Provide training in and access to GIS and other data visualization tools

A winning model

The final half-day was dedicated to computer science-y issues. Following a trio of presentations involving computational workflows and accompanying challenges (the most common: members of the same research group writing the same pieces of code over and over with scant documentation and zero version control), Tracy Teal (Executive Director of Data Carpentry) and Amy Hodge (Science Data Librarian at Stanford) introduced a winning model for improving everyone’s research lives.

Software Carpentry and Data Carpentry are extremely affordable 2-day workshops that present basic concepts and tools for more effective programming and data handling, respectively. Training materials are openly licensed (CC-BY) and workshops are led by practitioners for practitioners allowing them to be tailored to specific domains (genomics, geosciences, etc.). At present the demand for these (international) workshops exceeds the capacity to meet it … except at Stanford. With local, library-based coordination, Amy has brokered (and in some cases taught) five workshops for individual departments or research groups (who covered the costs themselves). This is the very thing I wished for as a graduate student—muddling through databases and programming in R on my own—and I think it should be replicated at every research institution. Better yet, workshops aren’t restricted to the sciences; Data Carpentry is developing training materials for techniques used in the digital humanities such as text mining.

Learning to live outside of the academic bubble

Another, subtler theme that ran throughout the program was the need/desire to strengthen connections between the academy and industry. Efforts along these lines stand to improve the science underlying matters of public policy (e.g., water management in California) and public health (e.g., new drug development). They also address the mounting pressure placed on researchers to turn knowledge into products. Mark Smith addressed this topic directly during his presentation on ChEM-H: a new Stanford initiative for supporting research across Chemistry, Engineering, and Medicine to understand and advance Human Health. I appreciated that Mark—a medicinal chemist with extensive experience in both sectors—and others emphasized the responsibility to prepare students for jobs in a rapidly shifting landscape with increasing demand for technical skills.

Over the course of SBCW 2015 I met engaged librarians, data managers, researchers, and product managers, including some repeat attendees who raved about the previous two SBCW events; the consensus seemed to be that the third was another smashing success. Helen Josephine (Head of the Engineering Library at Stanford who chaired the organizing committee) is already busy gathering feedback for next year.

SBCW 2015 at Stanford included researchers from:

Gladstone Institutes in San Francisco

ChEM-H Stanford’s lab for Chemistry, Engineering & Medicine for Human Health

Water in the West Institute at Stanford

NSF Engineering Research Center for Re-inventing the Nation’s Urban Water Infrastructure (ReNUWIt)


Special project topics on Software and Data Carpentry with Physics and BioPhysics faculty and Tracy Teal from Software Carpentry.

Many thanks to:

Helen Josephine, Suzanne Rose Bennett, and the rest of the Local Organizing Committee at Stanford. Sponsored by the National Network of Libraries of Medicine – Pacific Southwest Region, Greater Western Library Alliance, Stanford University Libraries, SPIE, IEEE, Springer Science+Business Media, Annual Reviews, Elsevier.

From Flickr by Paula Fisher (It was just like this, but indoors, with coffee, and powerpoints.)

From Flickr by Paula Fisher (It was just like this, but indoors, with coffee, and powerpoints.)

Tagged , , , ,

Data metrics survey results published

Today, we are pleased to announce the publication Making Data Count in Scientific Data. John Kratz and Carly Strasser led the research effort to understand the needs and values of both the researchers who create and use data and of the data managers who preserve and publish it. The Making Data Count project is a collaboration between the CDL, PLOS, and DataONE to define and implement a practical suite of metrics for evaluating the impact of datasets, which is a necessary prerequisite to widespread recognition of datasets as first class scholarly objects.

We started the project with research to understand what metrics would be meaningful to stakeholders and what metrics we can practically collect. We conducted a literature review, focus groups, and– the subject of today’s paper–  a pair of online surveys for researchers and data managers.

In November and December of 2014, 247 researchers and 73 data repository managers answered our questions about data sharing, use, and metrics.Graph of interest in various metrics Survey and anonymized data are available in the Dash repository. These responses told us, among other things, which existing Article Level Metrics (ALMs) might be profitably applied to data:

  • Social media: We should not worry excessively about capturing social media (Twitter, Facebook, etc.) activity around data yet, because there is not much to capture. Only 9% of researchers said they would “definitely” use social media to look for a dataset.
  • Page views: Page views are widely collected by repositories but neither researchers nor data managers consider them meaningful. (It stands to reason that, unlike a paper, you can’t have engaged very deeply with a dataset if all you’ve done is read about it.)
  • Downloads: Download counts, on the other hand, are both highly valuable and practical to collect. Downloads were a resounding second-choice metric for researchers and 85% of repositories already track them.
  • Citations: Citations are the coin of the academic realm. They were by far the most interesting metric to both researchers and data managers. Unfortunately, citations are much more difficult than download counts to work with, and relatively few repositories track them. Beyond technical complexity, the biggest challenge is cultural: data citation practices are inconsistent at best, and formal data citation is rare. Despite the difficulty, the value of citations is too high to ignore, even in the short term.

We have already begun to collect data on the sample project corpus– the entire DataONE collection of 100k+ datasets. Using this pilot corpus, we see preliminary indications of researcher engagement with data across a number of online channels not previously thought to be in use by scholars. The results of this pilot will complement the survey described in today’s paper with real measurement of data-related activities “in the wild.”

For more conclusions and in-depth discussion of the initial research, see the paper, which is open access and available here: Stay tuned for analysis and results of the DataONE data-level metrics data on the Making Data Count project page:

Does Your Library Delight You?

In a recent opinion piece in Forbes, Steve Denning provocatively asks, “Do we need libraries?

As a digital librarian, my short answer is “Yes, of course we need libraries!” But, Denning makes many excellent points in cautioning that the same disruptive threats faced by many industries — think taxis and Uber, or hotels and AirBnB, for example — are also a threat to libraries. Denning argues that in today’s world, libraries must change their management practices and offerings in order to remain relevant. The computer age is not just about computerizing, he explains, but also about a fundamental shift that puts the customer or the user in control:


[From wikimedia user: Pumbaa80]

“… the most important thing that computers and the internet have done is not just to make things faster and easier for organizations. Even more importantly, they have shifted the balance of power in the marketplace from the seller to the buyer. The customer is now in charge. The customer has choices and good information about those choices. Unless customers and users are delighted, they can and will take their business elsewhere.”

To be clear, I would never suggest “Uber-izing” libraries, but there is much that those of us in the library world can learn from these evolving user-centered models.

Denning suggests a handful of “right” and “wrong” approaches to the future of libraries. Among the right approaches is the importance of focusing on how to “delight the user or customer.” We need to create services that truly meet or exceed the expectations of library users. We need to restructure ourselves in a way that ignites continuous innovation. And, we need to think about how to create services for users that they haven’t even thought of yet, while also continuing to perform the services that our users really love about our libraries, only faster and better.

Shifting the focus of academic research libraries to new models and areas of focus is not an easy task. But that’s exactly what’s happening at the UC Berkeley Libraries with the launch of the UC Berkeley’s Research Data Management (RDM) program, a joint venture between UCB Libraries and Berkeley’s Research Information Technologies (RIT) group. I recently attended the first public workshop for this program, and I’d say this initiative is a continuous affirmation that there’s a clear and compelling role for libraries in the future.

The UCB Libraries continue to re-tool themselves to meet the exponentially growing need to provide solutions for managing, preserving, and providing access to research data. They are proving innovative in their partnership with Research Information Technologies (RIT).Together, the Libraries and RIT bring to the table an excellent complement of staff and skills that, through collaboration, will help tackle the complex challenges of data management. From the get go, the Research Data Management program has focused on being inclusive. At the first workshop, for instance, they cast a wide net to ensure attendance from a variety of disciplines and departments. They also sought everyone’s input and challenged us to think creatively about new solutions. And finally, they are focusing their efforts on connecting things that are working well in the library and across the campus with external resources that users can tap into.

The Research Data Management program’s three goals for the coming year include:

  • Training and Workshop Series: An in-person space to learn and share ideas across the campus, including hands-on training as well as tackling big picture topics such as policy, best practices and governance issues.
  • Rich, Online Resource Guide: A one-stop shopping for researchers to find resources to support their work all along the research cycle.
  • Consultative Services: A personalized service to support research needs.

With the implementation of new funder requirements and the increased pressure to share data as well as the fragility of digital media, researchers are feeling the pressure to come up with sustainable solutions for data management. Through the RDM program, the UC Berkeley Libraries’ are taking steps towards providing news services that users need, and others that they may not even know that they need.

At this first workshop, there was great energy and excitement in the room. I was certainly delighted and I think UCB faculty, students, and staff will also be.

Meet the Team

The group spearheading program includes:

  • Norm Cheng, Senior Project Manager
  • Harrison Dekker, Coordinator Data Services
  • Susan Edwards, Head, Social Sciences Division
  • Mary Elings, Archivist for Digital Collections
  • David Greenbaum, Director, Research Information Technologies (RIT)
  • Chris Hoffman, Manager, Informatics Services
  • Rick Jaffe, Web Developer
  • John Lowe, Technical Lead and Manager for the CollectionSpace service
  • Erik Mitchell, Associate University Librarian, Director of Digital Initiatives and Collaborative Services
  • Felicia Poe, Interim UC Curation Center Director, California, Digital Library

Make Data Rain

Last October, UC3,  PLOS, and DataONE launched Making Data Count, a collaboration to develop data-level metrics (DLMs). This 12-month National Science Foundation-funded project will pilot a suite of metrics to track and measure data use that can be shared with funders, tenure & promotion committees, and other stakeholders.

Featured image

[image from Freepik]

To understand how DLMs might work best for researchers, we conducted an online survey and held a number of focus groups, which culminated on a very (very) rainy night last December in a discussion at the PLOS offices with researchers in town for the 2014 American Geophysical Union Fall Meeting.

Six eminent researchers participated:

Much of the conversation concerned how to motivate researchers to share data. Sources of external pressure that came up included publishers, funders, and peers. Publishers can require (as PLOS does) that, at a minimum, the data underlying every figure be available. Funders might refuse to ‘count’ publications based on unavailable data, and refuse to renew funding for projects that don’t release data promptly. Finally, other researchers– in some communities, at least– are already disinclined to work with colleagues who won’t share data.

However, Making Data Count is particularly concerned with the inverse– not punishing researchers who don’t share, but rewarding those who do. For a researcher, metrics demonstrating data use serve not only to prove to others that their data is valuable, but also to affirm for themselves that taking the time to share their data is worthwhile. The researchers present regarded altmetrics with suspicion and overwhelmingly affirmed that citations are the preferred currency of scholarly prestige.

Many of the technical difficulties with data citation (e.g., citing  dynamic data or a particular subset) came up in the course of the conversation. One interesting point was raised by many: when citing a data subset, the needs of reproducibility and credit diverge. For reproducibility, you need to know exactly what data has been used– at a maximum level of granularity. But, credit is about resolving to a single product that the researcher gets credit for, regardless of how much of the dataset or what version of it was used– so less granular is better.

We would like to thank everyone who attended any of the focus groups. If you have ideas about how to measure data use, please let us know in the comments!

Tagged , ,

We are Hiring a DMPTool Manager!

Do you love all things data management as much as we do? Then join our team! We are hiring a person to help manage the DMPTool, including development prioritization, promotion, outreach, and education. The position is funded for two years with the potential for an extension pending funding and budgets. You would be based in the amazing city of Oakland CA, home of the California Digital Library. Read more at or download the PDF description: Data Management Product Manager (4116).

Job Duties

Product Management (30%): Ensure the DMPTool remains a viable and relevant application. Update funder requirements, maintain the integrity of publicly available DMPs, contact partner institutions to report issues, and review DMPTool guidance and content for currency. Evaluates and presents new technologies and industry trends. Recommends those that are applicable to current products or services and the organization’s long-range, strategic plans. Identifies, organizes, and participates in technical discussions with key advisory groups and other customers/clients. Identifies additional opportunities for value added product/service delivery based on customer/client interaction and feedback.

Marketing and Ourtreach (20%): Develop and implement strategies for promoting the DMPTool. Create marketing materials, update website content, contacting institutions, and present at workshops and/or conferences. Develops and participates in marketing and professional outreach activities and informational campaigns to raise awareness of product or service including communicating developments and updates to the community via social media. This includes maintaining the DMPTool blog, Twitter and Facebook accounts, GitHub Issues, and listservs.

Project Management (30%): Develops project plans including goals, deliverables, resources, budget and timelines for enhancements of the DMPTool. Acting as product/service liaison across the organization, external agencies and customers to ensure effective production, delivery and operation of the DMPTool.

Strategic Planning (10%): Assist in strategic planning, prioritizing and guiding future development of the DMPTool. Pursue outside collaborations and funding opportunities for future DMPTool development including developing an engaged community of DMPTool users (researchers) and software developers to contribute to the codebase. Foster and engage open source community for future maintenance and enhancement.

Reporting (10%): Provides periodic content progress reports outlining key activities and progress toward achieving overall goals. Develops and reports on metrics/key performance indicators and provides corresponding analysis.

To apply, visit (Requisition No. 20140735)

From Flickr by Brenda Gottsabend

From Flickr by Brenda Gottsabend

Announcing The Dash Tool: Data Sharing Made Easy

We are pleased to announce the launch of Dash – a new self-service tool from the UC Curation Center (UC3) and partners that allows researchers to describe, upload, and share their research data. Dash helps researchers perform the following tasks:

  • Prepare data for curation by reviewing best practice guidance for the creation or acquisition of digital research data.
  • Select data for curation through local file browse or drag-and-drop operation.
  • Describe data in terms of the DataCite metadata schema.
  • Identify data with a persistent digital object identifier (DOI) for permanent citation and discovery.
  • Preserve, manage, and share data by uploading to a public Merritt repository collection.
  • Discover and retrieve data through faceted search and browse.

Who can use Dash?

There are multiple instances of the Dash tool that all have similar functions, look, and feel.  We took this approach because our UC campus partners were interested in their Dash tool having local branding (read more). It also allows us to create new Dash instances for projects or partnerships outside of the UC (e.g., DataONE Dash and our Site Descriptors project).

Researchers at UC Merced, UCLA, UC Irvine, UC Berkeley, or UCOP can use their campus-specific Dash instance:

Other researchers can use DataONE Dash ( This instance is available to anyone, free of charge. Use your Google credentials to deposit data.

Note: Data deposited into any Dash instance is visible throughout all of Dash. For example, if you are a UC Merced researcher and use to deposit data, your dataset will appear in search results for individuals looking for data via any of the Dash instances, regardless of campus affiliation.

See the Users Guide to get started using Dash.

Stay connected to the Dash project:

Dash Origins

The Dash project began as DataShare, a collaboration among UC3, the University of California San Francisco Library and Center for Knowledge Management, and the UCSF Clinical and Translational Science Institute (CTSI). CTSI is part of the Clinical and Translational Science Award program funded by the National Center for Advancing Translational Sciences at the National Institutes of Health (Grant Number UL1 TR000004).

Fontana del Nettuno

Sound the horns! Dash is live! “Fontana del Nettuno” by Sorin P. from Flickr.

Tagged , , , ,

Data: Do You Care? The DLM Survey

We all know that data is important for research. So how can we quantify that? How can you get credit for the data you produce? What do you want to know about how your data is used?

If you are a researcher or data manager, we want to hear from you. Take this 5-10 minute survey and help us craft data-level metrics:

Please share widely! The survey will be open until December 1st.

Read more about the project at or check out our previous post. Thanks to John Kratz for creating the survey and jumping through IRB hoops!

What do you think of data metrics? We're listening.  From Click for more pics of dogs + radios.

What do you think of data metrics? We’re listening.
From Click for more pics of dogs + radios.

Tagged , , , ,

Dash Project Receives Funding!

We are happy to announce the Alfred P. Sloan Foundation has funded our project to improve the user interface and functionality of our Dash tool! You can read the full grant text at

More about Dash

Dash is a University of California project to create a platform that allows researchers to easily describe, deposit and share their research data publicly. Currently the Dash platform is connected to the UC3 Merritt Digital Repository; however, we have plans to make the platform compatible with other repositories using protocols during our Sloan-funded work. The Dash project is open-source; read more on our GitHub site. We encourage community discussion and contribution via GitHub Issues.

Currently there are five instances of the Dash tool available:

We plan to launch the new DataONE Dash instance in two weeks; this tool will replace the existing DataUp tool and allow anyone to deposit data into the DataONE infrastructure via the ONEShare repository using their Google credentials. Along with the release of DataONE Dash, we will release Dash 1.1 for the live sites listed above. There will be improvements to the user interface and experience.

The Newly Funded Sloan Project

Problem Statement

Researchers are not archiving and sharing their data in sustainable ways. Often data sharing involves using commercially owned solutions, posting data on personal websites, or submitting data alongside articles as supplemental material. A better option for data archiving is community repositories, which are owned and operated by trusted organizations (i.e., institutional or disciplinary repositories). Although disciplinary repositories are often known and used by researchers in the relevant field, institutional repositories are less well known as a place to archive and share data.

Why aren’t researchers using institutional repositories?

First, the repositories are often not set up for self-service operation by individual researchers who wish to deposit a single dataset without assistance. Second, many (or perhaps most) institutional repositories were created with publications in mind, rather than datasets, which may in part account for their less-than-ideal functionality. Third, user interfaces for the repositories are often poorly designed and do not take into account the user’s experience (or inexperience) and expectations. Because more of our activities are conducted on the Internet, we are exposed to many high-quality, commercial-grade user interfaces in the course of a workday. Correspondingly, researchers have expectations for clean, simple interfaces that can be learned quickly, with minimal need for contacting repository administrators.

Our Solution

We propose to address the three issues above with Dash, a well-designed, user friendly data curation platform that can be layered on top of existing community repositories. Rather than creating a new repository or rebuilding community repositories from the ground up, Dash will provide a way for organizations to allow self-service deposit of datasets via a simple, intuitive interface that is designed with individual researchers in mind. Researchers will be able to document, preserve, and publicly share their own data with minimal support required from repository staff, as well as be able to find, retrieve, and reuse data made available by others.

Three Phases of Work

  1. Requirements gathering: Before the design process begins, we will build requirements for researchers via interviews and surveys
  2. Design work: Based on surveys and interviews with researchers (Phase 1), we will develop requirements for a researcher-focused user interface that is visually appealing and easy to use.
  3. Technical work: Dash will be an added-value data sharing platform that integrates with any repository that supports community protocols (e.g., SWORD (Simple Web-service Offering Repository Deposit).

The dash is a critical component of any good ascii art. By reddit user Haleljacob

Tagged , , , , ,

Get every new post delivered to your Inbox.

Join 187 other followers