Tag Archives: government

Government Data At Risk

Government data is at risk, but that is nothing new.  

The existence of Data.gov, the Federal Open Data Policy, and open government data belies the fact that, historically, a vast amount of government data and digital information is at risk of disappearing in the transition between presidential administrations. For example, between 2008 and 2012, over 80 percent of the PDFs hosted on .gov domains disappeared. To track these and other changes, California Digital Library (CDL) joined with the University of North Texas, The Library of Congress, the Internet Archive, and the U.S. Government Publishing office to create the End of Term (EOT) Archive. After archiving the web presence of federal agencies in 2008 and 2012, the team initiated a new crawl in September of 2016.

In light of recent events, tools and infrastructure initially developed for EOT and other projects have been taken up by efforts to backup “at risk” datasets, including those related to the environment, climate change, and social justice. Data Refuge, coordinated by the Penn Program of Environmental Humanities (PPEH), has organized a series of “Data Rescue” events across the country where volunteers nominate webpages for submission to the End of Term Archive and harvest “uncrawlable” data to be bagged and submitted to an open data archive. Efforts such as the Azimuth Climate Data Backup Project and Climate Mirror do not involve submitting data or information directly to the End of Term Archive, but have similar aims and workflows.

These efforts are great for raising awareness and building back-ups of key collections. In the background, CDL and the team behind the Dat Project have worked to backup Data.gov, itself. The goal is not only to preserve the datasets catalogued by Data.gov but also the associated metadata and organization that makes it such a useful location for finding and using government data. As a result of this partnership, for the first time ever, the entire Data.gov metadata catalog of over 2 million datasets will soon be available for bulk download. This will allow the various backup efforts to coordinate and cross reference their data sets with those on Data.gov. To allow for further coordination and cross referencing, the Dat team has also begun acquiring the metadata for all the files acquired by Data Refuge, the Azimuth Climate Data Project, and Climate Mirror.

In an effort to keep track of all these efforts to preserve government data and information, we’re maintaining the following annotated list. As new efforts emerge or existing efforts broaden or change their focus, we’ll make sure the list is updated. Feel free to send additional info on government data projects to: uc3@ucop.edu

Get involved: Ongoing Efforts to Preserve Scientific Data or Support Science

Data.gov – The home of the U.S. Government’s open data, much of which is non-biological and non-environmental. Data.gov has a lightweight system for reporting and tracking datasets that aren’t represented and functions as a single point of discovery for federal data. Newly archived data can and should be reported there. CDL and the Dat team are currently working to backup the data catalogued on Data.gov and also the associated metadata.

End of Term – A collaborative project to capture and save U.S. Government websites at the end of presidential administrations. The initial partners in EOT included CDL, the Internet Archive, the Library of Congress, the University of North Texas, and the U.S. Government Publishing Office. Volunteers at many Data Rescue events use the URL nomination and BagIt/Bagger tools developed as part of the EOT project.

Data Refuge – A collaborative effort that aims to backup research-quality copies of federal climate and environmental data, advocate for environmental literacy, and build a consortium of research libraries to scale their tools and practices to make copies of other kinds of federal data. Find a Data Rescue event near you.

Azimuth Climate Data Backup Project – An urgent project to back up US government climate databases. Initially started by statistician Jan Galkowski and John Baez, a mathematician and science blogger at UC Riverside.

Climate Mirror – A distributed volunteer effort to mirror and back up U.S. Federal Climate Data. This project is currently being lead by Data Refuge.

The Environmental Data and Governance Initiative – An international network of academics and non-profits that addresses potential threats to federal environmental and energy policy, and to the scientific research infrastructure built to investigate, inform, and enforce. EDGI has built many of the tools used at Data Rescue events.

March for Science – A celebration of science and a call to support and safeguard the scientific community. The main march in Washington DC and satellite marches around the world are scheduled for April 22nd (Earth Day).

314 Action – A nonprofit that intends to leverage the goals and values of the greater science, technology, engineering, and mathematics community to aggressively advocate for science.

Tagged , , , , , , ,

Popular Demand for Public Data

Scanned image of a 1940 Census Schedule (from http://1940census.archives.gov)

The National Archives and Records Administration digitized 3.9 million schedules from the 1940 U.S. census

When talking about data publication, many of us get caught up in protracted conversations aimed at carefully anticipating and building solutions for every possible permutation and use case. Last week’s release of U.S. census data, in its raw, un-indexed form, however, supports the idea that we don’t have to have all the answers to move forward.

Genealogists, statisticians and legions of casual web surfers have been buzzing about last week’s release of the complete, un-redacted collection of scanned 1940 U.S. census data schedules. Though census records are routinely made available to the public after a 72-year privacy embargo, this most recent release marks the first time that the census data set has been made available in such a widely accessible way: by publishing the schedules online.

In the first 3-hours that the data was available, 22.5 million hits crippled the 1940census.archives.gov servers. The following day, nearly 3 times that number of requests continued to hammer the servers as curious researchers scoured the census data looking for relatives of missing soldiers; hoping to find out a little bit more about their own family members; or trying to piece together a picture of life in post-Great Depression, pre-WWII America.

For the time being, scouring the data is a somewhat laborious task of narrowing in on the census schedules for a particular district, then performing a quick visual scan for people’s names. The 3.9 million scanned images that make up the data set are not, in other words, fully indexed — in fact, only a single field (the Enumeration District number field) is searchable. Encoding that field alone took 6 full-time archivists 3-months.

The task of encoding the remaining 5.3 billion fields is being taken up by an army of volunteers. Some major genealogy websites (such as Ancestry.com and MyHeritage.com) hope the crowd-sourced effort will result in a fully indexed, fully searchable database by the end of the year.

Release day for the census has been described as “the Super Bowl for genealogists.” This excitement about data, and participation by the public in transforming the data set into a more useable, indexed form are encouraging indications that those of us interested in how best to facilitate even more sharing and publishing of data online are doing work that has enormous, widely-appreciated value. The crowd-sourced volunteer effort also reminds us that we don’t necessarily have to have all the answers when thinking about publishing data. In some cases, functionality that seems absolutely essential (such as the ability to search through the data set) is work that can (and will) be taken up by others.

So, how about your data set(s)? Who are the professional and armchair domain enthusiasts that will line up to download your data? What are some of the functionality roadblocks that are preventing you from publishing your data, and how might a third party (or a crowd sourced effort) work as a solution? (Feel free to answer in the comments section below.)

Tagged , , ,