There’s a new Dash!

Dash: an open source, community approach to data publication

We have great news! Last week we refreshed our Dash data publication service.  For those of you who don’t know, Dash is an open source, community driven project that takes a unique approach to data publication and digital preservation.

Dash focuses on search, presentation, and discovery and delegates the responsibility for the data preservation function to the underlying repository with which it is integrated. It is a project based at the University of California Curation Center (UC3), a program at California Digital Library (CDL) that aims to develop interdisciplinary research data infrastructure.

Dash employs a multi-tenancy user interface; providing partners with extensive opportunities for local branding and customization, use of existing campus login credentials, and, importantly, offering the Dash service under a tenant-specific URL, an important consideration helping to drive adoption. We welcome collaborations with other organizations wishing to provide a simple, intuitive data publication service on top of more cumbersome legacy systems.

There are currently seven live instances of Dash: – UC BerkeleyUC IrvineUC MercedUC Office of the PresidentUC RiversideUC Santa CruzUC San FranciscoONEshare (in partnership with DataONE)

Architecture and Implementation

Dash is completely open source. Our code is made publicly available on GitHub (http://cdluc3.github.io/dash/). Dash is based on an underlying Ruby-on-Rails data publication platform called Stash. Stash encompasses three main functional components: Store, Harvest, and Share.

  • Store: The Store component is responsible for the selection of datasets; their description in terms of configurable metadata schemas, including specification of ORCID and Fundref identifiers for researcher and funder disambiguation; the assignment of DOIs for stable citation and retrieval; designation of an optional limited time embargo; and packaging and submission to the integrated repository
  • Harvest: The Harvest component is responsible for retrieval of descriptive metadata from that repository for inclusion into a Solr search index
  • Share: The Share component, based on GeoBlacklight, is responsible for the faceted search and browse interface

Dash Architecture Diagram

Individual dataset landing pages are formatted as an online version of a data paper, presenting all appropriate descriptive and administrative metadata in a form that can be downloaded as an individual PDF file, or as part of the complete dataset download package, incorporating all data files for all versions.

To facilitate flexible configuration and future enhancement, all support for the various external service providers and repository protocols are fully encapsulated into pluggable modules. Metadata modules are available for the DataCite and Dublin Core metadata schemas. Protocol modules are available for the SWORD 2.0 deposit protocol and the OAI-PMH and ResourceSync harvesting protocols. Authentication modules are available for InCommon/Shibboleth and Google/OAuth19 identity providers (IdPs). We welcome collaborations to develop additional modules for additional metadata schemas and repository protocols. Please email UC3 (uc3 at ucop dot edu) or visit GitHub (http://cdluc3.github.io/dash/) for more information.

Features of the newly refreshed Dash service

What are the new features on our refresh of the Dash services?  Take a look.

Feature Tech-focused User-focused Description
Open Source X All components open source, MIT licensed code (http://cdluc3.github.io/dash/)
Standards compliant X Dash integrates with any SWORD/OAI-PMH-compliant repository
Pluggable Framework X Inherent extensibility for supporting additional protocols and metadata schemas
Flexible metadata schemas X Support Datacite metadata schema out-of-the-box, but can be configured to support any schema
Innovation X Our modular framework will make new feature development easier and quicker
Mobile/responsive design X X Built mobile-first, from the ground up, for better user experience
Geolocation – Metadata X X For applicable research outputs, we have an easy to use way to capture location of your datasets
Persistent Identifers – ORCID X X Dash allows researchers to attach their ORCID, allowing them to track and get credit for their work
Persistent Identifers – DOIs X X Dash issues DOIs for all datasets, allowing researchers to track and get credit for their work
Persistent Identifers – Fundref X X Dash tracks funder information using FundRef, allowing researchers and funders to track their reasearch outputs
Login – Shibboleth /OAuth2 X X We offer easy single-sign with your campus credentials or Google account
Versioning X X Datasets can change. Dash offers a quick way for you to upload new versions of your datasets and offer a simple process for tracking updates
Accessibility X X The technology, design, and user workflows have all been built with accessibility in mind
Better user experience X Self-depositing made easy. Simple workflow, drag-and-drop upload, simple navigation, clean data publication pages, user dashboards
Geolocation – Search X With GeoBlacklight, we can offer search by location
Robust Search X Search by subject, filetype, keywords, campus, location, etc.
Discoverability X Indexing by search engines for Google, Bing, etc.
Build Relationships X Many datasets are related to publications or other data. Dash offers a quick way to describe these relationships
Supports Best Practices X Data publication can be confusing. But with Dash, you can trust Dash is following best practices
Data Metrics X See the reach of your datasets through usage and download metrics
Data Citations X Quick access to a well-formed citiation reference (with DOI) to every data publication. Easy for your peers to quickly grab
Open License X Dash supports open Creative Commons licensing for all data deposits; can be configured for other licenses
Lower Barrier to Entry X For those in a hurry, Dash offers a quick interface to self-deposit. Only three steps and few required fields
Support Data Reuse X Focus researchers on describing methods and explaining ways to reuse their datasets
Satisfies Data Availability Requirements X Many publishers and funders require researchers to make their data available. Dash is an readily accepted and easy way to comply

A little Dash history

The Dash project began as DataShare, a collaboration among UC3, the University of California San Francisco Library and Center for Knowledge Management, and the UCSF Clinical and Translational Science Institute (CTSI). CTSI is part of the Clinical and Translational Science Award program funded by the National Center for Advancing Translational Sciences at the National Institutes of Health. Dash version 2 developed by UC3 and partners with funding from the Alfred P. Sloan Foundation (our funded proposal). Read more about the code, the project, and contributing to development on the Dash GitHub site.

A little Dash future

We will continue the development of the new Dash platform and will keep you posted. Next up: support for timed deposits and embargoes.  Stay tuned!

Tagged , ,

USING AMAZON S3 AND GLACIER FOR MERRITT- An Update

The integration of the Merritt repository with Amazon’s S3 and Glacier cloud storage services, previously described in an August 16 post on the Data Pub blog, is now mostly complete. The new Amazon storage supplements Merritt’s longstanding reliance on UC private cloud offerings at UCLA and UCSD. Content tagged for public access is now routed to S3 for primary storage, with automatic replication to UCSD and UCLA. Private content is routed first to UCSD, and then replicated to UCLA and Glacier. Content is served for retrieval from the primary storage location; in the unlikely event of a failure, Merritt automatically retries from secondary UCSD or UCLA storage. Glacier, which provides near-line storage with four hour retrieval latency, is not used to respond to user-initiated retrieval requests.

Content Type Primary Storage Secondary Storage Primary Retrieval Secondary Retrieval
Public S3 UCSD
UCLA
S3 UCSD
UCLA
Private UCSD UCLA
Glacier
UCSD UCLA

In preparation for this integration, all retrospective public content, over 1.1 million objects and 3 TB, was copied from UCSD to S3, a process that took about six days to complete. A similar move from UCSD to Glacier is now underway for the much larger corpus of private content, 1.5 million objects and 71 TB, which is expected to take about five weeks to complete.

The Merritt-Amazon integration enables more optimized internal workflows and increased levels of reliability and preservation assurance. It also holds the promise of lowering overall storage costs, and thus, the recharge price of Merritt for our campus customers.  Amazon has, for example, recently announced significant price reductions for S3 and Glacier storage capacity, although their transactional fees remain unchanged.  Once the long-term impact of S3 and Glacier pricing on Merritt costs is understood, CDL will be able to revise Merritt pricing appropriately.

CDL is also investigating the possible use of the Oracle archive cloud, as a lower-cost alternative, or supplement, to Glacier for dark archival content hosting.  While offering similar function to Glacier, including four hour retrieval latency, Oracle’s price point is about 1/4th of Glacier’s for storage capacity.

An RDM Model for Researchers: What we’ve learned

Thanks to everyone who gave feedback on our previous blog post describing our data management tool for researchers. We received a great deal of input related to our guide’s use of the term “data sharing” and our guide’s position in relation to other RDM tools as well as quite a few questions about what our guide will include as we develop it further.

As stated in our initial post, we’re building a tool to enable individual researchers to assess the maturity of their data management practices within an institutional or organizational context. To do this, we’ve taken the concept of RDM maturity from in existing tools like the Five Organizational Stages of Digital Preservation, the Scientific Data Management Capability Model, and the Capability Maturity Guide and placed it within a framework familiar to researchers, the research data lifecycle.

researchercmm_090916

A visualization of our guide as presented in our last blog post. An updated version, including changed made in response to reader feedback, is presented later in this post.

Data Sharing

The most immediate feedback we received was about the term “Data Sharing”. Several commenters pointed out the ambiguity of this term in the context of the research data life cycle. In the last iteration of our guide, we intended “Data Sharing” as a shorthand to describe activities related to the communication of data. Such activities may range from describing data in a traditional scholarly publication to depositing a dataset in a public repository or publishing a data paper. Because existing data sharing policies (e.g. PLOS, The Gates Foundation, and The Moore Foundation) refer specifically to the latter over the former, the term is clearly too imprecise for our guide.

Like “Data Sharing”, “Data Publication” is a popular term for describing activities surrounding the communication of data. Even more than “Sharing”, “Publication” relays our desire to advance practices that treat data as a first class research product. Unfortunately the term is simultaneously too precise and too ambiguous it to be useful in our guide. On one hand, the term “Data Publication” can refer specifically to a peer reviewed document that presents a dataset without offering any analysis or conclusion. While data papers may be a straightforward way of inserting datasets into the existing scholarly communication ecosystem, they represent a single point on the continuum of data management maturity. On the other hand, there is currently no clear consensus between researchers about what it means to “publish” data.

For now, we’ve given that portion of our guide the preliminary label of “Data Output”. As the development process proceeds, this row will include a full range of activities- from description of data in traditional scholarly publications (that may or may not include a data availability statement) to depositing data into public repositories and the publication of data papers.

Other Models and Guides

While we correctly identified that there are are range of rubrics, tools, and capability models with similar aims as our guide, we overstated that ours uniquely allows researchers to assess where they are and where they want to be in regards to data management. Several of the tools we cited in our initial post can be applied by researchers to measure the maturity of data management practices within a project or institutional context.

Below we’ve profiled four such tools and indicated how we believe our guide differs from each. In differentiating our guide, we do not mean to position it strictly as an alternative. Rather, we believe that our guide could be used in concert with these other tools.

Collaborative Assessment of Research Data Infrastructure and Objectives (CARDIO)

CARDIO is a benchmarking tool designed to be used by researchers, service providers, and coordinators for collaborative data management strategy development. Designed to be applied at a variety of levels, from entire institutions down to individual research projects, CARDIO enables its users to collaboratively assess data management requirements, activities, and capacities using an online interface. Users of CARDIO rate their data management infrastructure relative to a series of statements concerning their organization, technology, and resources. After completing CARDIO, users are given a comprehensive set of quantitative capability ratings as well as a series of practical recommendations for improvement.

Unlike CARDIO, our guide does not necessarily assume its users are in contact with data-related service providers at their institution. As we stated in our initial blog post, we intend to guide researchers to specialist knowledge without necessarily turning them into specialists. Therefore, we would consider a researcher making contact with their local data management, research IT, or library service providers for the first time as a positive application of our guide.

Community Capability Model Framework (CCMF)

The Community Capability Model Framework is designed to evaluate a community’s readiness to perform data intensive research. Intended to be used by researchers, institutions, and funders to assess current capabilities, identify areas requiring investment, and develop roadmaps for achieving a target state of readiness, the CCMF encompasses eight “capability factors” including openness, skills and training, research culture, and technical infrastructure. When used alongside the Capability Profile Template, the CCMF provides its users with a scorecard containing multiple quantitative scores related to each capability factor.   

Unlike the CCMF, our guide does not necessarily assume that its users should all be striving towards the same level of data management maturity. We recognize that data management practices may vary significantly between institutions or research areas and that what works for one researcher may not necessarily work for another. Therefore, we would consider researchers understanding the maturity of their data management practices within their local contexts to be a positive application of our guide.

Data Curation Profiles (DCP) and DMVitals

The Data Curation Profile toolkit is intended to address the needs of an individual researcher or research group with regards to the “primary” data used for a particular project. Taking the form of a structured interview between an information professional and a researcher, a DCP can allow an individual research group to consider their long-term data needs, enable an institution to coordinate their data management services, or facilitate research into broader topics in digital curation and preservation.

DMVitals is a tool designed to take information from a source like a Data Curation Profile and use it to systematically assess a researcher’s data management practices in direct comparison to institutional and domain standards. Using the DMVitals, a consultant matches a list of evaluated data management practices with responses from an interview and ranks the researcher’s current practices by their level of data management “sustainability.” The tool then generates customized and actionable recommendations, which a consultant then provides to the researcher as guidance to improve his or her data management practices.  

Unlike DMVitals, our guide does not calculate a quantitative rating to describe the maturity of data management practices. From a measurement perspective, the range of practice maturity may differ between the four stages of our guide (e.g. the “Project Planning” stage could have greater or fewer steps than the “Data Collection” stage), which would significantly complicate the interpretation of any quantitative ratings derived from our guide. We also recognize that data management practices are constantly evolving and likely dependent on disciplinary and institutional context. On the other hand, we also recognize the utility of quantitative ratings for benchmarking. Therefore, if, after assessing the maturity of their data management practices with our guide, a researcher chooses to apply a tool like DMVitals, we would consider that a positive application of our guide.

Our Model (Redux)

Perhaps the biggest takeaway from the response to our  last blog post is that it is very difficult to give detailed feedback on a guide that is mostly whitespace. Below is an updated mock-up, which describes a set of RDM practices along the continuum of data management maturity. At present, we are not aiming to illustrate a full range of data management practices. More simply, this mock-up is intended to show the types of practices that could be described by our guide once it is complete.

screen-shot-2016-11-08-at-11-37-35-am

An updated visualization of our guide based on reader feedback. At this stage, the example RDM practices are intended to be representative not comprehensive.

Project Planning

The “Project Planning” stage describes practices that occur prior to the start of data collection. Our examples are all centered around data management plans (DMPs), but other considerations at this stage could include training in data literacy, engagement with local RDM services, inclusion of “sharing” in project documentation (e.g. consent forms), and project pre-registration.

Data Collection

The “Data Collection” stage describes practices related to the acquisition, accumulation, measurement, or simulation of data. Our examples relate mostly to standards around file naming and structuring, but other considerations at this stage could include the protection of sensitive or restricted data, validation of data integrity, and specification of linked data.

Data Analysis

The “Data Analysis” stage describes practices that involve the inspection, modeling, cleaning, or transformation of data. Our examples mostly relate to documenting the analysis workflow, but other considerations at this stage could include the generation and annotation of code and the packaging of data within sharable files or formats.

Data Output

The “Data Output” stage describes practices that involve the communication of either the data itself of conclusions drawn from the data. Our examples are mostly related to the communication of data linked to scholarly publications, but other considerations at this stage could include journal and funder mandates around data sharing, the publication of data papers, and the long term preservation of data.

Next Steps

Now that we’ve solicited a round of feedback from the community that works on issues around research support, data management, and digital curation, our next step is to broaden our scope to include researchers.

Specifically we are looking for help with the following:

  • Do you find the divisions within our model useful? We’ve used the research data lifecycle as a framework because we believe it makes our tool user-friendly for researchers. At the same time, we also acknowledge that the lines separating planning, collection, analysis, and output can be quite blurry. We would be grateful to know if researchers or data management service providers find these divisions useful or overly constrained.
  • Should there be more discrete “steps” within our framework? Because we view data management maturity as a continuum, we have shied away from creating discrete steps within each division. We would be grateful to know how researchers or data management service providers view this approach, especially when compared to the more quantitative approach employed by CARDIO, the Capability Profile Template, and DMVitals.
  • What else should we put into our model? Researchers are faced with changing expectations and obligations in regards to data management. We want our model to reflect that. We also want our model to reflect the relationship between research data management and broader issues like openness and reproducibility. With that in mind, what other practices and considerations should or model include?
Tagged , , , , , ,

Collaborative Web Archiving with Cobweb

A partnership between the CDL, Harvard Library, and UCLA Library has been awarded funding from IMLS to create Cobweb, a collaborative collection development platform for web archiving.

The demands of archiving the web in comprehensive breadth or thematic depth easily exceed the technical and financial capacity of any single institution. To ensure that the limited resources of archiving programs are deployed most effectively, it is important that their curators know something about the collection development priorities and holdings of other, similarly-engaged institutions. Cobweb will meet this need by supporting three key functions: nominating, claiming, and holdings. The nomination function will let curators and stakeholders suggest web sites pertinent to specific thematic areas; the claiming function will allow archival programs to indicate an intention to capture some subset of nominated sites; and the holdings function will allow programs to document sites that have actually been captured.

How will Cobweb work? Imagine a fast-moving news event unfolding online via news reports, videos, blogs, and social media. Recognizing the importance of recording this event, a curator immediately creates a new Cobweb project and issues an open call for nominations. Scholars, subject area specialists, interested members of the public, and event participants themselves quickly respond, contributing to a site list more comprehensive than could be created by any one curator or institution. Archiving institutions review the site list and publicly claim responsibility for capturing portions of it that are consistent with their local policies and technical capabilities. After capture, the institutions’ holdings information is updated in Cobweb to disclose the various collections containing newly available content. It’s important to note that Cobweb collects only metadata; the actual archived web content would continue to be managed by the individual collecting organizations. Nevertheless, by distributing the responsibility, more content will be captured more quickly with less overall effort than would otherwise be possible.

Cobweb will help libraries and archives make better informed decisions regarding the allocation of their individual programmatic resources, and promote more effective institutional collaboration and sharing.

This project was made possible in part by the Institute of Museum and Library Services, #LG-70-16-0093-16.

Tagged ,

CC BY and data: Not always a good fit

This post was originally published on the University of California Office of Scholarly Communication blog.

Last post I wrote about data ownership, and how focusing on “ownership” might drive you nuts without actually answering important questions about what can be done with data. In that context, I mentioned a couple of times that you (or your funder) might want data to be shared under CC0, but I didn’t clarify what CC0 actually means. This week, I’m back to dig into the topic of Creative Commons (CC) licenses and public domain tools — and how they work with data. Continue reading

Tagged , , ,

Building a user-friendly RDM maturity model

UC3 is developing a guide to help researchers assess and progress the maturity of their data management practices.

What are we doing?

Researchers are increasingly faced with new expectations and obligations in regards to data management. To help researchers navigate this changing landscape and to complement existing instruments that enable librarians and other data managers to assess the maturity of data management practices at an institutional or organizational level, we’re developing a guide that will enable researchers to assess the maturity of their individual practices within an institutional or organizational context.

Our aim is to be descriptive rather than prescriptive. We do not assume every researcher will want or need to achieve the same level of maturity for all their data management practices. Rather, we aim to provide researchers with a guide to specialist knowledge without necessarily turning researchers into specialists. We want to help researchers understand where they are and, where appropriate, how to get to where they want or need to be.

Existing Models

As a first step in building our own guide, we’ve researched the range of related tools, rubrics, and capability models. Many, including the Five Organizational Stages of Digital Preservation, the Scientific Data Management Capability Model, and the Capability Maturity Guide developed by the Australian National Data Service, draw heavily from the SEI Capability Maturity Model and are intended to assist librarians, repository managers, and other data management service providers in benchmarking the policies, infrastructure, and services of their organization or institution.  Others, including the Collaborative Assessment of Research Data Infrastructure and Objectives (CARDIO), DMVitals, and the Community Capability Framework, incorporate feedback from researchers and are intended to assist in benchmarking a broad set of data management-related topics for a broad set of stockholders – from organizations and institutions down to individual research groups.

We intend for our guide to build on these tools but to have a different, and we think novel, focus. While we believe it could be a useful tool for data management service providers, the intended audience of our guide is research practitioners. While integration with service providers in the library, research IT, and elsewhere will be included where appropriate, the the focus will be on equipping researchers to assess and refine their individual own data management activities. While technical infrastructure will be included where appropriate, the focus will be on behaviors, “soft skills”, and training.

Our Guide

Below is a preliminary mockup of our guide. Akin to the “How Open Is It?” guide developed by SPARC, PLOS, and the OASPA, our aim is to provide a tool that is comprehensive, user-friendly, and provides tangible recommendations.  

researchercmm_090916

Obviously we still have a significant amount of work to do to refine the language and fill in the details. At the moment, we are using elements of the research data lifecycle to broadly describe research activities and very general terms to describe the continuum of practice maturity. Our next step is to fill in the blanks- to more precisely describe research activities and more clearly delineate the stages of practice maturity. From there, we will work to outline the behaviors, skills, and expertise present for each research activity at each stage.

Next Steps

Now that we’ve researched existing tools for assessing data management services and sketched out a preliminary framework for our guide, our next step is to elicit feedback from the broader community that works on issues around research support, data management, and digital curation and preservation.

Specifically we are looking for help on the following:

  • Have we missed anything? There is a range of data management-related rubrics, tools, and capability models – from the community-focused frameworks described above to frameworks focused on the preservation and curation of digital assets (e.g. the Digital Asset Framework, DRAMBORA). As far as we’re aware, there isn’t a complementary tool that allows researchers to assess where they are and where they want to be in regards to data management. Are there efforts that have already met this need? We’d be grateful for any input about the existence of frameworks with similar goals.
  • What would be the most useful divisions and steps within our framework? The “three legged stool” developed by the Digital Preservation Management workshop has been highly influential for community and data management provider-focused tools. Though examining policies, resources, and infrastructure are also important for researchers when self-assessing their data management practices, we believe it would be more useful for our guide to be more reflective of how data is generated, managed, disseminated in a research context. We’d be grateful for any insight into how we could incorporate related models – such as those depicting the research data lifecycle – into our framework.
Tagged , , , , , ,

Who “owns” your data?

This post was originally published on the University of California Office of Scholarly Communication blog.

Which of these is true?

“The PI owns the data.”

“The university owns the data.”

“Nobody can own it; data isn’t copyrightable.”

You’ve probably heard somebody say at least one of these things — confidently. Maybe you’ve heard all of them. Maybe about the same dataset (but in that case, hopefully not from the same person). So who really owns research data? Well, the short answer is “it depends.”

A longer answer is that determining ownership (and whether there’s even anything to own) can be frustratingly complicated — and, even when obvious, ownership only determines some of what can be done with data. Other things like policies, contracts, and laws may dictate certain terms in circumstances where ownership isn’t relevant — or even augment or overrule an owner where it is. To avoid an unpleasant surprise about what you can or can’t do with your data, you’ll want to plan ahead and think beyond the simple question of ownership. Continue reading

PIDapalooza – What, Why, When, Who?

audience

PIDapalooza, a community-led conference on persistent identifiers
November 9-10, 2016
Radisson Blu Saga Hotel
pidapalooza.org

PIDapalooza will bring together creators and users of persistent identifiers (PIDs) from around the world to shape the future PID landscape through the development of tools and services for the research community. PIDs support proper attribution and credit, promote collaboration and reuse, enable reproducibility of findings, foster faster and more efficient progress, and facilitate effective sharing, dissemination, and linking of scholarly works.

If you’re doing something interesting with persistent identifiers, or you want to, come to PIDapalooza and share your ideas with a crowd of committed innovators.

Conference themes include:

  1. PID myths. Are PIDs better in our minds than in reality? PID stands for Persistent IDentifier, but what does that mean and does such a thing exist?
  2. Achieving persistence. So many factors affect persistence: mission, oversight, funding, succession, redundancy, governance. Is open infrastructure for scholarly communication the key to achieving persistence?
  3. PIDs for emerging uses. Long-term identifiers are no longer just for digital objects. We have use cases for people, organizations, vocabulary terms, and more. What additional use cases are you working on?
  4. Legacy PIDs. There are of thousands of venerable old identifier systems that people want to continue using and bring into the modern data citation ecosystem. How can we manage this effectively?
  5. The I-word. What would make heterogeneous PID systems “interoperate” optimally? Would standardized metadata and APIs across PID types solve many of the problems, and if so, how would that be achieved? What about standardized link/relation types?
  6. PIDagogy. It’s a challenge for those who provide PID services and tools to engage the wider community. How do you teach, learn, persuade, discuss, and improve adoption? What’s it mean to build a pedagogy for PIDs?
  7. PID stories. Which strategies worked? Which strategies failed? Tell us your horror stories! Share your victories!
  8. Kinds of persistence. What are the frontiers of ‘persistence’? We hear lots about fraud prevention with identifiers for scientific reproducibility, but what about data papers promoting PIDs for long-term access to reliably improving objects (software, pre-prints, datasets) or live data feeds?

PIDapalooza is organized by California Digital Library, Crossref, DataCite, and ORCID.  

We believe that bringing together everyone who’s working with PIDs for two days of discussions, demos, workshops, brainstorming, and updates on the state of the art will catalyze the development of PID community tools and services.  

And you can help by getting involved!.

Propose a session

Please send us your session ideas by September 18. We will notify you about your proposals in the first week of October.

Register to attend

Registration is now open — come join the festival with a crowd of like-minded innovators. And please help us spread the word about PIDapalooza in your community!

Stay tuned

Keep updated with the latest news at the PIDapalooza website and on Twitter (@PIDapalooza) in the coming weeks.

See you in November!

Tagged , , ,

UC3 to Explore Amazon S3 and Glacier Use for Merritt Storage

The UC Curation Center (UC3) has offered innovative digital content access and preservation services to the UC community for over six years through its Merritt repository.  Merritt was developed by UC3 to address unique needs for high-quality curation services at scale and a low price point.   Recently, UC3 started looking into Amazon’s S3 and Glacier cloud storage products as a way to address cost concerns, fine-tune reliability issues, increase service options, and keep pace with ever-increasing scale in the volume, variety, and velocity of new content contributions.

The current Merritt pricing model, in effect since July 1, 2015, is based on recovering the costs of storage use, currently totally over 73 TB contributed from all 10 UC campuses.  This content is now being replicated in UC private clouds supported by UCLA and UCSD.   Since the closure earlier this year of the UCOP data center, the computational processes underlying Merritt, along with all other CDL services, have been moved to virtual machines in the Amazon AWS cloud.  Collocating storage alongside this computational presence in AWS will provide increased data transfer throughput during Merritt deposit and retrieval.  In addition, the integration of online S3 with near-line Glacier storage offers opportunities to lower storage costs by moving archival materials with no expectation of direct end-user access to Glacier.  The cost for Glacier storage is about one quarter of that for S3, which is comparable with UCLA and UCSD pricing.  Of course, the additional dispersed replication of Merritt-managed data in AWS will also increase overall reliability and long-term preservation assurance.

The integration of S3 and Glacier will supplement Merritt’s existing use of UC storage.  Merritt’s storage function acts as a broker that automatically routes submitted content to the appropriate storage location based on its curatorially-defined access characteristics.  Once Amazon storage has been added to Merritt, content tagged for public access will be routed to S3 for primary storage, from which it will be automatically replicated to a UC cloud.  Retrieval requests for this content will be served from the S3 copy; should these requests fail (for example, if S3 is temporarily non-responsive), Merritt automatically retries from its secondary copy.

The path for content tagged for private access is somewhat different.  It is initially routed to S3 for temporary storage until the replication to a UC cloud completes.  The content is then moved into Glacier for permanent low-cost primary storage.  Retrieval requests will be served from the UC cloud.  In the unlikely event that this retrieval doesn’t success, there is no automatic retry from Glacier, since Glacier, while inexpensive for static storage, is costly for systematic retrieval.  UC3 staff can, however, intervene manually to retrieve from Glacier if it becomes necessary.  In the case of both public and private access, the digital content will continue to be managed with at least five copies spread across independent storage infrastructures and data centers.

The integration of Amazon S3 and Glacier into Merritt’s storage architecture will increase overall reliability and performance, while possibly leading to future reduction in costs.  Once the integration is complete, UC3 will monitor AWS storage usage and associated costs through the end of the current Merritt service year in June 30, 2017, to determine the impact on Merritt pricing.

Tagged , , ,

We’re hiring a new Product Manager!

CDL is recruiting for a new Product Manager.  This position will oversee the product management and outreach activities for the Dash project and service, as well as offer research data management and digital preservation consulting for the UC community.

We are looking for an experienced professional with a full understanding of product/service development and production practices.  This position (officially titled “UC3 Service Manager, Dash”) will focus on the successful development, outreach, and adoption of the Dash service.  A complete revamp of the UI and technical architecture of Dash is nearing completion.  More detail about Dash is available here. A recent presentation on the project is also available here. Because this position will focus on continuous development of Dash, it requires an enthusiastic advocate for research data management best practices, open source community building, and digital curation skills development.

A successful candidate will advocate for the needs of our constituents and translate those needs into detailed enhancements of diverse scope, size, impact, and budget  This Dash Product Manager will have a large support network: the UC3 Director, other UC3 product managers, UC3 development team, other California Digital Library departments, plus the library/IT teams across the 10 UC campuses.  

Learn more and apply here.

What is Dash?

Dash is an open source, online data publication service that makes research data sharing easy.  While Dash gives the appearance of being a full-fledged data repository, it is actually a lightweight overlay layer that sits on top of, and freely interoperates with, standards-compliant repositories supporting common protocols for submission and harvesting.  UC3 has integrated Dash with its Merritt curation repository. The Dash system provides intuitive, easy-to-use interfaces for dataset submission, description, publication, and discovery.  Dash imposes minimal prescriptive eligibility and submission requirements, and automates and hides the mechanical details of DOI assignment, data packaging, and repository deposit from the user.  It features a streamlined, self-service user experience that can be integrated easily and unobtrusively into multifarious scholarly workflows.  

What is UC3?

This position is within the University of California Curation Center (UC3) at the California Digital Library (CDL), an administrative unit of the University of California Office of the President (UCOP).  UC3 works within CDL and across the 10 UC campuses to deliver leading-edge digital curation services.  We plan, create, maintain, enhance, and operate robust services responsive to the evolving needs of UC stakeholders.  UC3’s current initiatives include digital preservation, research data management, data publication, alternative metrics for usage and impact, and web archiving. Reporting to the UC3 Director, this position is responsible for managing the development and maintenance of the Dash service, including playing a key role in promoting  and setting the strategic direction for Dash. As a member of this dynamic team, a successful candidate will be asked to contribute to furthering our work advancing digital curation concepts across the UC community.  More information about UC3 can be found at http://www.cdlib.org/uc3.  

More information about this position can be found here.