Tag Archives: repository

Git/GitHub: A Primer for Researchers

The Beastie Boys knew what’s up: Git it together. From egotripland.com

I might be what a guy named Everett Rogers would call an “early adopter“. Rogers wrote a book back in 1962 call The Diffusion of Innovation, wherein he explains how and why technology spreads through cultures. The “adoption curve” from his book has been widely used to  visualize the point at which a piece of technology or innovation reaches critical mass, and divides individuals into one of five categories depending on at what point in the curve they adopt a given piece of technology: innovators are the first, then early adopters, early majority, late majority, and finally laggards.

At the risk of vastly oversimplifying a complex topic, being an early adopter simply means that I am excited about new stuff that seems promising; in other words, I am confident that the “stuff” – GitHub, in this case –will catch on and be important in the future. Let me explain.

Let’s start with version control.

Before you can understand the power GitHub for science, you need to understand the concept of version control. From git-scm.com, “Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.”  We all deal with version control issues. I would guess that anyone reading this has at least one file on their computer with “v2″ in the title. Collaborating on a manuscript is a special kind of version control hell, especially if those writing are in disagreement about systems to use (e.g., LaTeX versus Microsoft Word). And figuring out the differences between two versions of an Excel spreadsheet? Good luck to you. The Wikipedia entry on version control makes a statement that brings versioning into focus:

The need for a logical way to organize and control revisions has existed for almost as long as writing has existed, but revision control became much more important, and complicated, when the era of computing began.

Ah, yes. The era of collaborative research, using scripting languages, and big data does make this issue a bit more important and complicated. Enter Git. Git is a free, open-source distributed version control system, originally created for Linux kernel development in 2005. There are other version control systems– most notably, Apache Subversion (aka SVN) and Mercurial. However I posit that the existence of GitHub is what makes Git particularly interesting for researchers.

So what is GitHub?

GitHub is a web-based hosting service for projects that use the Git revision control system. It’s free (with a few conditions) and has been quite successful since its launch in 2008. Historically, version control systems were developed for and by software developers. GitHub was created primarily as a way for efficiently developing software projects, but its reach has been growing in the last few years. Here’s why.

Note: I am not going into the details of how git works, its structure, or how to incorporate git into your daily workflow. That’s a topic best left to online courses and Software Carpentry Bootcamps

What’s in it for researchers?

At this point it is good to bring up a great paper by Karthik Ram titled “Git can facilitate greater reproducibility and increased transparency in science“, which came out in 2013 in the journal Source Code for Biology and Medicine. Ram goes into much more detail about the power of Git (and GitHub by extension) for researchers. I am borrowing heavily from his section on “Use cases for Git in science” for the four benefits of Git/GitHub below.

1. Lab notebooks make a comeback. The age-old practice of maintaining a lab notebook has been challenged by the digital age. It’s difficult to keep all of the files, software, programs, and methods well-documented in the best of circumstances, never mind when collaboration enters the picture. I see researchers struggling to keep track of their various threads of thought and work, and remember going through similar struggles myself. Enter online lab notebooks. naturejobs.com recently ran a piece about digital lab notebooks, which provides a nice overview of this topic. To really get a feel fore the power of using GitHub as a lab notebook, see GitHubber and ecologist Carl Boettiger’s site. The gist is this: GitHub can serve as a home for all of the different threads of your project, including manuscripts, notes, datasets, and methods development.

2. Collaboration is easier. You and your colleagues can work on a manuscript together, write code collaboratively, and share resources without the potential for overwriting each others’ work. No more v23.docx or appended file names with initials. Instead, a co-author can submit changes and document those with “commit messages” (read about them on GitHub here).

3. Feedback and review is easier. The GitHub issue tracker allows collaborators (potential or current), reviewers, and colleagues to ask questions, notify you of problems or errors, and suggest improvements or new ideas.

4. Increased transparency. Using a version control system means you and others are able to see decision points in your work, and understand why the project proceeded in the way that it did. For the super savvy GitHubber, you can make available your entire manuscript, from the first datapoint collected to the final submitted version, traceable on your site. This is my goal for my next manuscript.

Final thoughts

Git can be an invaluable tool for researchers. It does, however, have a bit of a high activation energy. That is, if you aren’t familiar with version control systems, are scared of the command line, or are married to GUI-heavy proprietary programs like Microsoft Word, you will be hard pressed to effectively use Git in the ways I outline above. That said, spending the time and energy to learn Git and GitHub can make your life so. much. easier. I advise graduate students to learn Git (along with other great open tools like LaTeX and Python) as early in their grad careers as possible. Although it doesn’t feel like it, grad school is the perfect time to learn these systems. Don’t be a laggard; be an early adopter.

References and other good reads

Tagged , , , , , ,

Feedback Wanted: Publishers & Data Access

This post is co-authored with Jennifer Lin, PLOS

Short Version: We need your help!

We have generated a set of recommendations for publishers to help increase access to data in partnership with libraries, funders, information technologists, and other stakeholders. Please read and comment on the report (Google Doc), and help us to identify concrete action items for each of the recommendations here (EtherPad).

Background and Impetus

The recent governmental policies addressing access to research data from publicly funded research across the US, UK, and EU reflect the growing need for us to revisit the way that research outputs are handled. These recent policies have implications for many different stakeholders (institutions, funders, researchers) who will need to consider the best mechanisms for preserving and providing access to the outputs of government-funded research.

The infrastructure for providing access to data is largely still being architected and built. In this context, PLOS and the UC Curation Center hosted a set of leaders in data stewardship issues for an evening of brainstorming to re-envision data access and academic publishing. A diverse group of individuals from institutions, repositories, and infrastructure development collectively explored the question:

What should publishers do to promote the work of libraries and IRs in advancing data access and availability?

We collected the themes and suggestions from that evening in a report: The Role of Publishers in Access to Data. The report contains a collective call to action from this group for publishers to participate as informed stakeholders in building the new data ecosystem. It also enumerates a list of high-level recommendations for how to effect social and technical change as critical actors in the research ecosystem.

We welcome the community to comment on this report. Furthermore, the high-level recommendations need concrete details for implementation. How will they be realized? What specific policies and technologies are required for this? We have created an open forum for the community to contribute their ideas. We will then incorporate the catalog of listings into a final report for publication. Please participate in this collective discussion with your thoughts and feedback by April 24, 2014.

We need suggestions! Feedback! Comments! From Flickr by Hash Milhan

We need suggestions! Feedback! Comments! From Flickr by Hash Milhan

 

Tagged , , , , ,

Institutional Repositories: Part 2

A few weeks back I wrote a post describing institutional repositories (IRs for short). IRs have been around for a while, with the impetus of making scholarly publications open access. However more recently, IRs have been cited as potential repositories for datasets, code, and other scholarly outputs. Here I continue the discussion of IRs and compare their utility to DRs. Please note - although IRs are typically associated with open access publications, I discuss them here as potential repositories for data. 

Honest criticism of IRs

In my discussions with colleagues at conferences and meetings, I have found that some are skeptical about the role of IRs in data access preservation. I posit that this skepticism has a couple of origins:

  • IRs are often not intended for “self-service”, i.e., a researcher would need to connect with IR support staff (often via a face-to-face meeting), in order to deposit material into the IR.
  • Many IRs were created at minimum 5 years ago, with interfaces that sometimes appear to pre-date Facebook. Academic institutions often have no budget for a redesign of the user interface, which means those that visit an IR might be put off by the appearance and/or functionality.
  • IRs are run by libraries and IT departments, neither of which are known for self-promotion. Many (most?) researchers are likely unaware of an IR’s existence, and would not think to check in with the libraries regarding their data preservation needs.

These are all viable issues associated with many of the existing IRs. But there is one huge advantage to IRs over other data repositories: they are owned and operated by academic institutions that have a vested interest in preserving and providing access to scholarly work. 

The bright side

IRs aren’t all bad, or I wouldn’t be blogging about them. I believe that they are undergoing a rebirth of sorts: they are now seen as viable places for datasets and other scholarly outputs. Institutions like Purdue are putting IRs at the center of their initiatives around data management, access, and preservation. Here at the CDL, the UC3 group is pursuing the implementation of a data curation platform, DataShare, to allow self-service deposit of datasets into the Merritt Repository (see the UCSF DataShare site). Recent mandates from above requiring access to data resulting from federal grants means that funders (like IMLS) and organizations (like ARL) are taking an interest in improving the utility of IRs.

IRs versus discipline-specific repositories

In my last post, I mentioned that selecting a repository for your data doesn’t need to be either an IR or discipline-specific repository (DR). These repositories each have advantages and disadvantages, so using both makes sense.

DRs: ideal for data discovery and reuse

Often, DRs have collection policies for the specific types of data they are willing to accept. GenBank, for example, has standardized how your deposit your data, what types and formats of data they accept, and the metadata accompanying that data. This all means that searching for and using the data in GenBank is easy, and data users are able to easily download data for use. Another advantage of having a collection of similar, standardized data is the ability to build tools on top of these datasets, making reuse and meta-analyses easier.

The downside of DRs

The nature of a DR is that they are selective in the types of data that they accept. Consider this scenario, typical of many research projects: what if someone worked on a project that combined sequencing genes, collecting population demographics, and documenting location with GIS? Many DRs would not want to (or be able to) handle these disparate types of data. The result is that some of the data gets shared via a DR, while data less suitable for the DR would not be shared.

In my work with the DataONE Community Engagement and Education working group, I reviewed what datasets were shared from NSF grants awarded between 2005 and 2009 (see Panel 1 in Hampton et al. 2013). Many of the resulting publications relied on multiple types of data.  The percentage of those that shared all of the data produced was around 28%. However of the data that was shared, 81% was in GenBank or TreeBase – likely due to the culture of data sharing around genetic work. That means most of the non-genetic data is not available, and potentially lost, despite its importance for the project as a whole. Enter: institutional repositories.

IRs: the whole enchilada

Unlike many DRs, IRs have the potential to host entire collections of data around a project – regardless of the type of data, its format, etc. My postdoctoral work on modeling the effects of temperature and salinity on copepod populations involved field collection, laboratory copepod growth experiments (which included logs of environmental conditions), food growth (algal density estimates and growth rates, nutrient concentrations), population size counts, R scripts, and the development of the mathematical models themselves. An IR could take all of these disparate datasets as a package, which I could then refer to in the publications that resulted from the work. A big bonus is that this package could sit next to other packages I’ve generated over the course of my career, making it easier for me to point people to the entire corpus of research work. The biggest bonus of all: having all of the data the produced a publication, available at a single location, helps ensure reproducibility and transparency.

Maybe you can have your cake (DRs) and eat it too (IRs). From Flickr by Mayaevening

Maybe you can have your cake (DRs) and eat it too (IRs). From Flickr by Mayaevening

There are certainly some repositories that could handle the type of data package I just described. The Knowledge Network for Biocomplexity is one such relatively generic repository (although I might argue that KNB is more like an IR than a discipline repository). Another is figshare, although this is a repository ultimately owned by a publisher. But as researchers start hunting for places to put their datasets, I would hope that they look to academic institutions rather than commercial publishers. (Full disclosure – I have data stored in figshare!)

Good news! You can have your cake and eat it too. Putting data in both the relevant DRs and more generic IRs is a good solution to ensure discoverability (DRs) and provenance (IRs).

Tagged , , , ,

Institutional Repositories: Part 1

If you aren’t a member of the library and archiving world, you probably aren’t aware of the phrase institutional repository (IR for short). I certainly wasn’t aware of IRs prior to joining the CDL, and I’m guessing most researchers are similarly ignorant. In the next two blog posts, I plan to first explain IRs, then lay out the case for their importance – nay, necessity – as part of the academic ecosphere. I should mention up front that although the IR’s inception focused on archiving traditional publications by researchers, I am speaking about them here as potential preservation of all scholarship, including data.

Academic lIbraries have a mission to archive scholarly work, including theses. These are at The Hive in Worcester, England. From Flickr by israelcsus.

Academic lIbraries have a mission to archive scholarly work, including theses. These are at The Hive in Worcester, England. From Flickr by israelcsus.

If you read this blog, I’m sure you are that there is increased awareness about the importance of open science, open access to publications, data sharing, and reproducibility. Most of these concepts were easily accomplished in the olden days of pen-and-paper: you simply took great notes in your notebook, and shared that notebook as necessary with colleagues (this assumes, of course geographic proximity and/or excellent mail systems). These days, that landscape has changed dramatically due to the increasingly computationally complex nature of research. Digital inputs and outputs of research might include software, spreadsheets, databases, images, websites, text-based corpuses, and more. But these “digital assets”, as the archival world might call them, are more difficult to store than a lab notebook. What does a virtual filing cabinet or file storage box look like that can house all of these different bits? In my opinion, it looks like an IR.

So what’s an IR?

An IR is a data repository run by an institution. Many of the large research universities have IRs. To name a few, Harvard has DASH, the University of California system has eScholarship and Merritt, Purdue has PURR, and MIT has DSpace. Many of these systems have been set up in the last 10 years or so to serve as archives for publications. For a great overview and history of IRs, check out this eHow article (which is surprisingly better than the relevant Wikipedia article).

So why haven’t more people heard of IRs? Mostly this is because there have never been any mandates or requirements for researchers to deposit their works in IRs. Some libraries take on this task– for example, I found out a few years ago that the MBL-WHOI Library graciously stored open access copies of all of my publications for me in their IR. But more and more these “works” include digital assets that are not publications, and the burden of collecting all of the digital scholarship produced by an institution is a near-insurmountable task for a small group of librarians; there has to be either buy-in from researchers or mandates from the top.

The Case for IRs

I’m not the first one to recognize the importance of IRs. Back in 2002 the Scholarly Publishing and Academic Resources Coalition (SPARC) put out a position paper titled “The Case for Institutional Repositories” (see their website for more information). They defined an IR as having four major qualities:

  1. Institutionally defined,
  2. Scholarly,
  3. Cumulative and perpetual, and
  4. Open and interoperable.

Taking the point of view of the academic institution (rather than the researcher), the paper cited two roles that institutional repositories play for academic institutions:

  1. Reform scholarly communication – Reassert control over scholarship, reduce monopoly power of journals, and bring relevance to libraries
  2. Promote the university – Serve as an indicator of the university’s quality; showcase the university’s research; demonstrate public value and increase status.

In general, IRs are run by information professionals (e.g., librarians), who are experts at documenting, archiving, preserving, and generally curating information. All of those digital assets that we produce as researchers fit the bill perfectly.

As a researcher, you might not be convinced by the importance of IRs given the  arguments above. Part of the indifference researchers may feel about IRs might have something to do with the existence of disciplinary repositories.

Disciplinary Repositories

There are many, many, many repositories out there for storing digital assets. To get a sense, check out re3data.org or databib.org and start browsing. Both of these websites are searchable databases for research data repositories. If you are a researcher, you probably know of at least one or two repositories for datasets in your field. For example, geneticists have GenBank, evolutionary biologists have TreeBase, ecologists have the KNB, and marine biologists have BCO-DMO. These are all examples of disciplinary repositories (DRs) for data. As any researcher who’s aware of these sites knows, you can both deposit and download data from these repositories, which makes them indispensable resources for their respective fields.

So where should a researcher put data?

The short answer is both an IR and a DR. I’ll expand on this and make the case for IRs to researchers in the next blog post.

Tagged , , , , , ,

UC Open Access: How to Comply

Free access to UC research is almost as good as free hugs! From Flickr by mhauri

Free access to UC research is almost as good as free hugs! From Flickr by mhauri

My last two blog posts have been about the new open access policy that applies to the entire University of California system. For big open science nerds like myself, this is exciting progress and deserves much ado. For the on-the-ground researcher at a UC, knee-deep in grants and lecture preparation, the ado could probably be skipped in lieu of a straightforward explanation of how to comply with the procedure. So here goes.

Who & When:

  • 1 November 2013: Faculty at UC Irvine, UCLA, and UCSF
  • 1 November 2014: Faculty at UC Berkeley, UC Merced, UC Santa Cruz, UC Santa Barbara, UC Davis, UC San Diego, UC Riverside

Note: The policy applies only to ladder-rank faculty members. Of course, graduate students and postdocs should strongly consider participating as well.

To comply, faculty members have two options:

Option 1: Out-of-the-box open access

. There are two ways to do this:

  1. Publishing in an open access-only journal (see examples here). Some have fees and others do not.
  2. Publishing with a more traditional publisher, but paying a fee to ensure the manuscript is publicly available. These are article-processing charges (APCs) and vary widely depending on the journal. For example, Elsevier’s Ecological Informatics charges $2,500, while Nature charges $5,200.

Learn more about different journals’ fees and policies: Directory of Open Access Journals: www.doaj.org

Option 2: Deposit your final manuscript in an open access repository.

In this scenario, you can publish in whatever journal you prefer – regardless of its openness. Once the manuscript is published, you take action to make a version of the article freely and openly available.

As UC faculty (or any UC researcher, including grad students and postdocs), you can comply via Option 2 above by depositing your publications in UC’s eScholarship open access repository. The CDL Access & Publishing Group is currently perfecting a user-friendly, efficient workflow for managing article deposits into eScholarship. The new workflow will be available as of November 1stLearn more.

Does this still sound like too much work? Good news! The Publishing Group is also working on a harvesting tool that will automate deposit into eScholarship. Stay tuned – the estimated release of this tool is June 2014.

An Addendum: Are you not a UC affiliate? Don’t fret! You can find your own version of eScholarship (i.e., an open access repository) by going to OpenDOAR. Also see my full blog post about making your publications open access.

Why?

Academic libraries must pay exorbitant fees to provide their patrons (researchers) with access to scholarly publications.  The very patrons who need these publications are the ones who provide the content in the form of research articles.  Essentially, the researchers are paying for their own work, by proxy via their institution’s library.

What if you don’t have access? Individuals without institutional affiliations (e.g., between jobs), or who are affiliated with institutions that have no/a poorly funded library (e.g., in 2nd or 3rd world countries), depend on open access articles for keeping up with the scholarly literature. The need for OA isn’t limited to jobless or international folks, though. For proof, one only has to notice that the Twitter community has developed a hash tag around this, #Icanhazpdf (Hat tip to the Lolcats phenomenon). Basically, you tweet the name of the article you can’t access and add the hashtag in hopes that someone out in the Twittersphere can help you out and send it to you.

Special thanks to Catherine Mitchell from the CDL Publishing & Access Group for help on this post.

Tagged , , , , ,

Sustaining Data

Last week, folks from DataONE gathered in Berkeley to discuss sustainability (new to DataONE? Read my post about it). Of course, lots of people are talking about sustainability in Berkeley, but this discussion focused on sustaining scientific data and its support systems.  The truth is, no one wants to pay for data sustainability. Purchasing servers and software, and paying for IT personnel is not cheap. Given our current grim financial times, room in the budget is not likely to be made.  So who should pay?  Let’s first think about the different groups that might pay.

  1. Private foundations
  2. Public agencies (e.g., NSF, NIH)
  3. Institutions
  4. Professional societies and organizations
  5. Researchers

Although the NSF provides funds for organizations like DataONE to develop, they are not interested in funding “sustainability”. They are in the business of funding research, which means that come 2019 when NSF funding ends for DataONE, someone else is going to have to pick up the tab.

Any researcher (including myself) will tell you that the thought of paying for data archiving and personnel is not appealing.  Budgets are already tight in proposals (which have record low acceptance rates); combine that with the lack of clarity about data management and archiving costs, and researchers are not eager to take on sustainability.

Many researchers see data sustainability as the domain of their institutions: providing data management and archiving services in bulk to their faculty would allow institutions to both regulate how their researchers handle their data, and remove the guesswork and confusion for the researchers themselves.  However with budget crises plaguing higher education due to rising costs and decreasing revenue, this is not a cost that institutions are likely to take on in the near future.

Obviously I was going to reference Pink Floyd for this post on money… From Wikipedia.

Lack of funds for critical data infrastructure is a systematic problem, and DataUp is no exception. Although we have funds to promote DataUp and publish our findings in the course of the project, we do not have funds to continue development. There is also the question of storage for datasets. Storage is not free, and we have not yet solved the problem of who will pay in the long term for storing data ingested into the ONEShare repository via DataUp.

Now that I’ve completed this post, it seems rather bleak. I am confident, however, that we have the right people working on the problem of data sustainability. It is certainly a critical piece in the current landscape of digital data.

Love Pink Floyd AND The Flaming Lips? Check out the FL cover album for Dark Side of the Moon, including a spectacular version of “Money”.

Tagged , , , ,

Looking for something? DataONE can help

ginsu knife

I’m not overselling it – DataONE is the Ginsu knife of data tools. From Flickr by inspector_81

At long last, DataONE has gone live.  For veterans of the DCXL/DataUp blog, you are probably well aware of the DataONE organization and project, but for newcomers I will provide a brief overview.  Fine print: this is NOT the official DataONE stance on DataONE.  This is merely my interpretation of it.

To explain DataONE, let’s have go through a little thought exercise. Let’s pretend I’m a researcher, starting a project on copepods in estuaries of the Pacific Northwest. I’m wondering who else has worked on them, what they have found, and whether I can use their data to help me parameterize my model.  Any researcher will tell you the best way to do this is to start searching for relevant journal articles.  I can then weave in and out of reference lists to hone in on the authors, topics, and species that might be of most use, continually refining my searches until I are satisfied.

Imagine I need the data from some of those articles I found.  I look for datasets on the authors’ websites, in the papers themselves, and online.  Some of the work was funded by NOAA, so I check there for data. I Google like crazy.  Alas, the data are nowhere to be found.

In real life, this is where I ended my search and started contacting authors directly.  Although I should have also checked data repositories, I didn’t. This was mostly because I wasn’t aware of them when I did this work back in 2008.  Sadly, many researchers are in a similar state of ignorance that I was.

The good news is that there are A LOT of data repositories out there (check out Databib.org for an intimidating list).  The bad news is it’s very difficult to know about and search all of the potential repositories with data you might want to use.

ENTER DataONE.

DataONE is all about linking together existing data repositories, allowing researchers to access, search, and discover all of the data through a single portal.  It’s basically cyber-glue for the different data centers out there. The idea is that you go to the DataONE search engine (ONEMercury) and hunt for data. It tells you where the data are housed, gives you lots of metadata, and gives you access to data when the authors have allowed this.

But wait, there’s MORE!

DataONE is also all about providing tools for researchers to find, use, organize, and manage their data throughout the research life cycle.  This is where DataUp connects with DataONE: DataUp will be part of the Investigator Toolkit, which also includes nifty things like the DMPTool, ONE-R (an R package for DataONE), and ONE-Drive (a Dropbox-esque way to look at data in DataONE, in production).

The exciting news this week is that DataONE’s search and discovery tool has gone live (check out the NSF press release or the DataONE press release).  You can now start looking for data that might be housed in any participating repository.  There are only a few data repositories (called member nodes in DataONE speak) currently on board, but the number is expected to increase exponentially over the coming years.

More questions about DataONE? I can help, or at least direct you to the person that can. Alternatively start poking around the DataONE website and ONEMercury, and give feedback so we can make it better.

 

Tagged , , ,

ONEShare and #OR2012

From Flickr by ~Coqui

One of my UC3 colleagues is at the Open Repositories 2012 Meeting (#OR2012) in Edinburgh, Scotland this week.  This prompted me to ask two questions: (1) What does open repositories mean? and (2) Why didn’t I get to go to Scotland?  Of course, (2) is easily answered by my lack of knowledge about open repositories, i.e. question (1).  After a little internet sleuthing, I’ve figured out what they mean by “Open Repositories”, and I realized that I have first-hand knowledge of a repository that contributes to the ideas of OR, ONEShare.  In this post I will share my newfound OR knowledge and give you the lowdown on ONEShare.

First, Open Repositories.  Just in case you are new to the dataverse (that’s dweeb speak for data universe), a repository is basically a place to put your data.  There are loads of data repositories, and picking one to suit your needs is an important step in data management planning.  So what is this about open repositories?

Here is a bit of text from the OR2011 website:

Open Repositories is an annual conference that brings together an international community of stakeholders engaged in the development, management, and application of digital repositories. …attendees  exchange knowledge, best practices and ideas on strategic, technical, theoretical and practical issues.

Basically, the idea of the Open Repositories group is to share knowledge among those facing similar challenges.  It’s similar to the concepts of Open Science, Open Data, and Open Access: we can accomplish more if we pool our intellectual resources.  Follow the OR2012 meeting via the #OR2012 hashtag.

Now for ONEShare.  This is the data repository we’ve created specially for DataUp users.

The name: ONEShare is called this because it’s closely intertwined with DataONE, the group enabling federation of Earth, environmental and ecological repositories.  Many of the DataONE tools have “ONE” in the title (i.e., ONE-R, ONEMercury, and ONEDrive).

The concept: One of the major features for DataUp is connecting Excel users to a data repository – essentially streamlining the process for depositing and sharing your data.  Although there are many data repositories, none of them allow just anyone to deposit data [Correction! Several allow this. See the comment below].  ONEShare is meant to be a “catch-all” repository for data owners that have no relationship with an existing repository.  Think of it as a sort of Slideshare for data – there is a low bar for participation, and anyone can join.

In a sense, ONEShare is the epitome of the “Open Repositories” concept: a repository that’s truly open to anyone.  Maybe I can represent ONEShare at OR2013 on Prince Edward Island (Oh Canada, how I miss you!).

Tagged , , , ,

Trailblazers in Demography

Last week I had the great pleasure of visiting Rostock, Germany.  If your geography lessons were a long time ago, you are probably wondering “where’s Rostock?” I sure did… Rostock is located very close to the Baltic Sea, in northeast Germany.  It’s a lovely little town with bumpy streets, lots of sausage, and great public transportation.  I was there, however, to visit the prestigious Max Planck Institute for Demographic Research (MPIDR).

Demography is the study of populations, especially their birth rates, death rates, and growth rates.  For humans, this data might be used for, say, calculating premiums for life insurance.  For other organisms, these types of data are useful for studying population declines, increases, and changes.  Such areas of study are especially important for endangered populations, invasive species, and commercially important plants and animals.

baby rhino

Sharing demography data saves adorable endangered species. From Flickr by haiwan42

I was invited to MPIDR because there is a group of scientists interested in creating a repository for non-human demography data.  Luckily, they aren’t starting from scratch.  They have a few existing collections of disparate data sets, some more refined and public-facing than others; their vision is to merge these datasets and create a useful, integrated database chock full of demographic data.  Although the group has significant challenges ahead (metadata standards, security, data governance policies, long term sustainability), their enthusiasm for the project will go a long way towards making it a reality.

The reason I am blogging about this meeting is because for me, the group’s goals represent something much bigger than a demography database.  In the past two years, I have been exposed to a remarkable range of attitudes towards data sharing (check out blog posts about it here, here, here, and here).  Many of the scientists with whom I spoke needed convincing to share their datasets.  But even in this short period of time that I have been involved in issues surrounding data, I have seen a shift towards the other end of the range.  The Rostock group is one great example of scientists who are getting it.

More and more scientists are joining the open data movement, and a few of them are even working to convert others to  believe in the cause.  This group that met in Rostock could put their heads down, continue to work on their separate projects, and perhaps share data occasionally with a select few vetted colleagues that they trust and know well.  But they are choosing instead to venture into the wilderness of scientific data sharing.  Let them be an inspiration to data hoarders everywhere.

It is our intention that the DCXL project will result in an add-in and web application that will facilitate all of the good things the Rostock group is trying to promote in the demography community.  Demographers use Microsoft Excel, in combination with Microsoft Access, to organize and manage their large datasets.  Perhaps in the future our open-source add-in and web application will be linked up with the demography database; open source software, open data, and open minds make this possible.

Tagged , , ,
Follow

Get every new post delivered to your Inbox.

Join 304 other followers