Author Archives: cdluc3

Government Data At Risk

Government data is at risk, but that is nothing new.  

The existence of Data.gov, the Federal Open Data Policy, and open government data belies the fact that, historically, a vast amount of government data and digital information is at risk of disappearing in the transition between presidential administrations. For example, between 2008 and 2012, over 80 percent of the PDFs hosted on .gov domains disappeared. To track these and other changes, California Digital Library (CDL) joined with the University of North Texas, The Library of Congress, the Internet Archive, and the U.S. Government Publishing office to create the End of Term (EOT) Archive. After archiving the web presence of federal agencies in 2008 and 2012, the team initiated a new crawl in September of 2016.

In light of recent events, tools and infrastructure initially developed for EOT and other projects have been taken up by efforts to backup “at risk” datasets, including those related to the environment, climate change, and social justice. Data Refuge, coordinated by the Penn Program of Environmental Humanities (PPEH), has organized a series of “Data Rescue” events across the country where volunteers nominate webpages for submission to the End of Term Archive and harvest “uncrawlable” data to be bagged and submitted to an open data archive. Efforts such as the Azimuth Climate Data Backup Project and Climate Mirror do not involve submitting data or information directly to the End of Term Archive, but have similar aims and workflows.

These efforts are great for raising awareness and building back-ups of key collections. In the background, CDL and the team behind the Dat Project have worked to backup Data.gov, itself. The goal is not only to preserve the datasets catalogued by Data.gov but also the associated metadata and organization that makes it such a useful location for finding and using government data. As a result of this partnership, for the first time ever, the entire Data.gov metadata catalog of over 2 million datasets will soon be available for bulk download. This will allow the various backup efforts to coordinate and cross reference their data sets with those on Data.gov. To allow for further coordination and cross referencing, the Dat team has also begun acquiring the metadata for all the files acquired by Data Refuge, the Azimuth Climate Data Project, and Climate Mirror.

In an effort to keep track of all these efforts to preserve government data and information, we’re maintaining the following annotated list. As new efforts emerge or existing efforts broaden or change their focus, we’ll make sure the list is updated. Feel free to send additional info on government data projects to: uc3@ucop.edu

Get involved: Ongoing Efforts to Preserve Scientific Data or Support Science

Data.gov – The home of the U.S. Government’s open data, much of which is non-biological and non-environmental. Data.gov has a lightweight system for reporting and tracking datasets that aren’t represented and functions as a single point of discovery for federal data. Newly archived data can and should be reported there. CDL and the Dat team are currently working to backup the data catalogued on Data.gov and also the associated metadata.

End of Term – A collaborative project to capture and save U.S. Government websites at the end of presidential administrations. The initial partners in EOT included CDL, the Internet Archive, the Library of Congress, the University of North Texas, and the U.S. Government Publishing Office. Volunteers at many Data Rescue events use the URL nomination and BagIt/Bagger tools developed as part of the EOT project.

Data Refuge – A collaborative effort that aims to backup research-quality copies of federal climate and environmental data, advocate for environmental literacy, and build a consortium of research libraries to scale their tools and practices to make copies of other kinds of federal data. Find a Data Rescue event near you.

Azimuth Climate Data Backup Project – An urgent project to back up US government climate databases. Initially started by statistician Jan Galkowski and John Baez, a mathematician and science blogger at UC Riverside.

Climate Mirror – A distributed volunteer effort to mirror and back up U.S. Federal Climate Data. This project is currently being lead by Data Refuge.

The Environmental Data and Governance Initiative – An international network of academics and non-profits that addresses potential threats to federal environmental and energy policy, and to the scientific research infrastructure built to investigate, inform, and enforce. EDGI has built many of the tools used at Data Rescue events.

March for Science – A celebration of science and a call to support and safeguard the scientific community. The main march in Washington DC and satellite marches around the world are scheduled for April 22nd (Earth Day).

314 Action – A nonprofit that intends to leverage the goals and values of the greater science, technology, engineering, and mathematics community to aggressively advocate for science.

Tagged , , , , , , ,

Software Carpentry / Data Carpentry Instructor Training for Librarians

We are pleased to announce that we are partnering with Software Carpentry (http://software-carpentry.org) and Data Carpentry (http://datacarpentry.org) to offer an open instructor training course on May 4-5, 2017 geared specifically for the Library Carpentry movement.  

Open call for Instructor Training

This course will take place in Portland, OR, in conjunction with csv,conf,v3, a community conference for data makers everywhere. It’s open to anyone, but the two-day event will focus on preparing members of the library community as Software and Data Carpentry instructors. The sessions will be led by Library Carpentry community members, Belinda Weaver and Tim Dennis.

If you’d like to participate, please apply by filling in the form at https://amy.software-carpentry.org/forms/request_training/  Application closed

What is Library Carpentry?

lib_carpentryFor those that don’t know, Library Carpentry is a global community of library professionals that is customizing Software Carpentry and Data Carpentry modules for training the library community in software and data skills. You can follow us on twitter @LibCarpentry.

Library Carpentry is actively creating training modules for librarians and holding workshops around the world. It’s a relatively new movement that has already been a huge success. You can learn more by reading the recently published article: Library Carpentry: software skills training for library professionals.

Why should I get certified?

Library Carpentry is a movement tightly coupled with the Software Carpentry and Data Carpentry organizations. Since all are based on a train-the-trainer model, one of our challenges has been how to get more experience as instructors. This issue is handled within Software and Data Carpentry by requiring instructor certification.

Although certification is not a requirement to be involved in Library Carpentry, we know that doing so will help us refine workshops, teaching modules, and grow the movement. Also, by getting certified, you can start hosting your own Library Carpentry, Software Carpentry, or Data Carpentry events on your campus. It’s a great way to engage with your campuses and library community!

Prerequisites

Applicants will learn how to teach people the skills and perspectives required to work more effectively with data and software. The focus will be on evidence-based education techniques and hands-on practice; as a condition of taking part, applicants must agree to:

  1. Abide by our code of conduct, which can be found at http://software-carpentry.org/conduct/ and http://datacarpentry.org/code-of-conduct/,
  1. Agree to teach at a Library Carpentry, Software Carpentry, or Data Carpentry workshop within 12 months of the course, and
  1. Complete three short tasks after the course in order to complete the certification. The tasks take a total of approximately 8-10 hours: see http://swcarpentry.github.io/instructor-training/checkout/ for details.

Costs

This course will be held in Portland, OR, in conjunction with csv,conf,v3 and is sponsored by csv,conf,v3 and the California Digital Library. To help offset the costs of this event, we will ask attendees to contribute an optional fee (tiered prices will be recommended based on your or your employer’s ability to pay). No one will be turned down based on inability to pay and a small number of travel awards will be made available (more information coming soon).  

Application

Hope to see you there! To apply for this Software Carpentry / Data Carpentry Instructor Training course, please submit the application by Jan 31, 2017:

  https://amy.software-carpentry.org/forms/request_training/  Application closed

Under Group Name, use “CSV (joint)” if you wish to attend both the training and the conference, or “CSV (training only)” if you only wish to attend the training course.

More information

If you have any questions about this Instructor Training course, please contact admin@software-carpentry.org. And if you have any questions about the Library Carpentry movement, please contact via email at uc3@ucop.edu, via twitter @LibCarpentry or join the Gitter chatroom.

PIDapalooza – What, Why, When, Who?

audience

PIDapalooza, a community-led conference on persistent identifiers
November 9-10, 2016
Radisson Blu Saga Hotel
pidapalooza.org

PIDapalooza will bring together creators and users of persistent identifiers (PIDs) from around the world to shape the future PID landscape through the development of tools and services for the research community. PIDs support proper attribution and credit, promote collaboration and reuse, enable reproducibility of findings, foster faster and more efficient progress, and facilitate effective sharing, dissemination, and linking of scholarly works.

If you’re doing something interesting with persistent identifiers, or you want to, come to PIDapalooza and share your ideas with a crowd of committed innovators.

Conference themes include:

  1. PID myths. Are PIDs better in our minds than in reality? PID stands for Persistent IDentifier, but what does that mean and does such a thing exist?
  2. Achieving persistence. So many factors affect persistence: mission, oversight, funding, succession, redundancy, governance. Is open infrastructure for scholarly communication the key to achieving persistence?
  3. PIDs for emerging uses. Long-term identifiers are no longer just for digital objects. We have use cases for people, organizations, vocabulary terms, and more. What additional use cases are you working on?
  4. Legacy PIDs. There are of thousands of venerable old identifier systems that people want to continue using and bring into the modern data citation ecosystem. How can we manage this effectively?
  5. The I-word. What would make heterogeneous PID systems “interoperate” optimally? Would standardized metadata and APIs across PID types solve many of the problems, and if so, how would that be achieved? What about standardized link/relation types?
  6. PIDagogy. It’s a challenge for those who provide PID services and tools to engage the wider community. How do you teach, learn, persuade, discuss, and improve adoption? What’s it mean to build a pedagogy for PIDs?
  7. PID stories. Which strategies worked? Which strategies failed? Tell us your horror stories! Share your victories!
  8. Kinds of persistence. What are the frontiers of ‘persistence’? We hear lots about fraud prevention with identifiers for scientific reproducibility, but what about data papers promoting PIDs for long-term access to reliably improving objects (software, pre-prints, datasets) or live data feeds?

PIDapalooza is organized by California Digital Library, Crossref, DataCite, and ORCID.  

We believe that bringing together everyone who’s working with PIDs for two days of discussions, demos, workshops, brainstorming, and updates on the state of the art will catalyze the development of PID community tools and services.  

And you can help by getting involved!.

Propose a session

Please send us your session ideas by September 18. We will notify you about your proposals in the first week of October.

Register to attend

Registration is now open — come join the festival with a crowd of like-minded innovators. And please help us spread the word about PIDapalooza in your community!

Stay tuned

Keep updated with the latest news at the PIDapalooza website and on Twitter (@PIDapalooza) in the coming weeks.

See you in November!

Tagged , , ,

We’re hiring a new Product Manager!

CDL is recruiting for a new Product Manager.  This position will oversee the product management and outreach activities for the Dash project and service, as well as offer research data management and digital preservation consulting for the UC community.

We are looking for an experienced professional with a full understanding of product/service development and production practices.  This position (officially titled “UC3 Service Manager, Dash”) will focus on the successful development, outreach, and adoption of the Dash service.  A complete revamp of the UI and technical architecture of Dash is nearing completion.  More detail about Dash is available here. A recent presentation on the project is also available here. Because this position will focus on continuous development of Dash, it requires an enthusiastic advocate for research data management best practices, open source community building, and digital curation skills development.

A successful candidate will advocate for the needs of our constituents and translate those needs into detailed enhancements of diverse scope, size, impact, and budget  This Dash Product Manager will have a large support network: the UC3 Director, other UC3 product managers, UC3 development team, other California Digital Library departments, plus the library/IT teams across the 10 UC campuses.  

Learn more and apply here.

What is Dash?

Dash is an open source, online data publication service that makes research data sharing easy.  While Dash gives the appearance of being a full-fledged data repository, it is actually a lightweight overlay layer that sits on top of, and freely interoperates with, standards-compliant repositories supporting common protocols for submission and harvesting.  UC3 has integrated Dash with its Merritt curation repository. The Dash system provides intuitive, easy-to-use interfaces for dataset submission, description, publication, and discovery.  Dash imposes minimal prescriptive eligibility and submission requirements, and automates and hides the mechanical details of DOI assignment, data packaging, and repository deposit from the user.  It features a streamlined, self-service user experience that can be integrated easily and unobtrusively into multifarious scholarly workflows.  

What is UC3?

This position is within the University of California Curation Center (UC3) at the California Digital Library (CDL), an administrative unit of the University of California Office of the President (UCOP).  UC3 works within CDL and across the 10 UC campuses to deliver leading-edge digital curation services.  We plan, create, maintain, enhance, and operate robust services responsive to the evolving needs of UC stakeholders.  UC3’s current initiatives include digital preservation, research data management, data publication, alternative metrics for usage and impact, and web archiving. Reporting to the UC3 Director, this position is responsible for managing the development and maintenance of the Dash service, including playing a key role in promoting  and setting the strategic direction for Dash. As a member of this dynamic team, a successful candidate will be asked to contribute to furthering our work advancing digital curation concepts across the UC community.  More information about UC3 can be found at http://www.cdlib.org/uc3.  

More information about this position can be found here.

We are Hiring a DMPTool Manager!

Do you love all things data management as much as we do? Then join our team! We are hiring a person to help manage the DMPTool, including development prioritization, promotion, outreach, and education. The position is funded for two years with the potential for an extension pending funding and budgets. You would be based in the amazing city of Oakland CA, home of the California Digital Library. Read more at jobs.ucop.edu or download the PDF description: Data Management Product Manager (4116).

Job Duties

Product Management (30%): Ensure the DMPTool remains a viable and relevant application. Update funder requirements, maintain the integrity of publicly available DMPs, contact partner institutions to report issues, and review DMPTool guidance and content for currency. Evaluates and presents new technologies and industry trends. Recommends those that are applicable to current products or services and the organization’s long-range, strategic plans. Identifies, organizes, and participates in technical discussions with key advisory groups and other customers/clients. Identifies additional opportunities for value added product/service delivery based on customer/client interaction and feedback.

Marketing and Ourtreach (20%): Develop and implement strategies for promoting the DMPTool. Create marketing materials, update website content, contacting institutions, and present at workshops and/or conferences. Develops and participates in marketing and professional outreach activities and informational campaigns to raise awareness of product or service including communicating developments and updates to the community via social media. This includes maintaining the DMPTool blog, Twitter and Facebook accounts, GitHub Issues, and listservs.

Project Management (30%): Develops project plans including goals, deliverables, resources, budget and timelines for enhancements of the DMPTool. Acting as product/service liaison across the organization, external agencies and customers to ensure effective production, delivery and operation of the DMPTool.

Strategic Planning (10%): Assist in strategic planning, prioritizing and guiding future development of the DMPTool. Pursue outside collaborations and funding opportunities for future DMPTool development including developing an engaged community of DMPTool users (researchers) and software developers to contribute to the codebase. Foster and engage open source community for future maintenance and enhancement.

Reporting (10%): Provides periodic content progress reports outlining key activities and progress toward achieving overall goals. Develops and reports on metrics/key performance indicators and provides corresponding analysis.

To apply, visit jobs.ucop.edu (Requisition No. 20140735)

From Flickr by Brenda Gottsabend

From Flickr by Brenda Gottsabend

Announcing The Dash Tool: Data Sharing Made Easy

We are pleased to announce the launch of Dash – a new self-service tool from the UC Curation Center (UC3) and partners that allows researchers to describe, upload, and share their research data. Dash helps researchers perform the following tasks:

  • Prepare data for curation by reviewing best practice guidance for the creation or acquisition of digital research data.
  • Select data for curation through local file browse or drag-and-drop operation.
  • Describe data in terms of the DataCite metadata schema.
  • Identify data with a persistent digital object identifier (DOI) for permanent citation and discovery.
  • Preserve, manage, and share data by uploading to a public Merritt repository collection.
  • Discover and retrieve data through faceted search and browse.

Who can use Dash?

There are multiple instances of the Dash tool that all have similar functions, look, and feel.  We took this approach because our UC campus partners were interested in their Dash tool having local branding (read more). It also allows us to create new Dash instances for projects or partnerships outside of the UC (e.g., DataONE Dash and our Site Descriptors project).

Researchers at UC Merced, UCLA, UC Irvine, UC Berkeley, or UCOP can use their campus-specific Dash instance:

Other researchers can use DataONE Dash (oneshare.cdlib.org). This instance is available to anyone, free of charge. Use your Google credentials to deposit data.

Note: Data deposited into any Dash instance is visible throughout all of Dash. For example, if you are a UC Merced researcher and use dash.ucmerced.edu to deposit data, your dataset will appear in search results for individuals looking for data via any of the Dash instances, regardless of campus affiliation.

See the Users Guide to get started using Dash.

Stay connected to the Dash project:

Dash Origins

The Dash project began as DataShare, a collaboration among UC3, the University of California San Francisco Library and Center for Knowledge Management, and the UCSF Clinical and Translational Science Institute (CTSI). CTSI is part of the Clinical and Translational Science Award program funded by the National Center for Advancing Translational Sciences at the National Institutes of Health (Grant Number UL1 TR000004).

Fontana del Nettuno

Sound the horns! Dash is live! “Fontana del Nettuno” by Sorin P. from Flickr.

Tagged , , , ,

Data: Do You Care? The DLM Survey

We all know that data is important for research. So how can we quantify that? How can you get credit for the data you produce? What do you want to know about how your data is used?

If you are a researcher or data manager, we want to hear from you. Take this 5-10 minute survey and help us craft data-level metrics:

surveymonkey.com/s/makedatacount

Please share widely! The survey will be open until December 1st.

Read more about the project at mdc.plos.org or check out our previous post. Thanks to John Kratz for creating the survey and jumping through IRB hoops!

What do you think of data metrics? We're listening.  From gizmodo.com. Click for more pics of dogs + radios.

What do you think of data metrics? We’re listening.
From gizmodo.com. Click for more pics of dogs + radios.

Tagged , , , ,

Dash Project Receives Funding!

We are happy to announce the Alfred P. Sloan Foundation has funded our project to improve the user interface and functionality of our Dash tool! You can read the full grant text at http://escholarship.org/uc/item/2mw6v93b.

More about Dash

Dash is a University of California project to create a platform that allows researchers to easily describe, deposit and share their research data publicly. Currently the Dash platform is connected to the UC3 Merritt Digital Repository; however, we have plans to make the platform compatible with other repositories using protocols during our Sloan-funded work. The Dash project is open-source; read more on our GitHub site. We encourage community discussion and contribution via GitHub Issues.

Currently there are five instances of the Dash tool available:

We plan to launch the new DataONE Dash instance in two weeks; this tool will replace the existing DataUp tool and allow anyone to deposit data into the DataONE infrastructure via the ONEShare repository using their Google credentials. Along with the release of DataONE Dash, we will release Dash 1.1 for the live sites listed above. There will be improvements to the user interface and experience.

The Newly Funded Sloan Project

Problem Statement

Researchers are not archiving and sharing their data in sustainable ways. Often data sharing involves using commercially owned solutions, posting data on personal websites, or submitting data alongside articles as supplemental material. A better option for data archiving is community repositories, which are owned and operated by trusted organizations (i.e., institutional or disciplinary repositories). Although disciplinary repositories are often known and used by researchers in the relevant field, institutional repositories are less well known as a place to archive and share data.

Why aren’t researchers using institutional repositories?

First, the repositories are often not set up for self-service operation by individual researchers who wish to deposit a single dataset without assistance. Second, many (or perhaps most) institutional repositories were created with publications in mind, rather than datasets, which may in part account for their less-than-ideal functionality. Third, user interfaces for the repositories are often poorly designed and do not take into account the user’s experience (or inexperience) and expectations. Because more of our activities are conducted on the Internet, we are exposed to many high-quality, commercial-grade user interfaces in the course of a workday. Correspondingly, researchers have expectations for clean, simple interfaces that can be learned quickly, with minimal need for contacting repository administrators.

Our Solution

We propose to address the three issues above with Dash, a well-designed, user friendly data curation platform that can be layered on top of existing community repositories. Rather than creating a new repository or rebuilding community repositories from the ground up, Dash will provide a way for organizations to allow self-service deposit of datasets via a simple, intuitive interface that is designed with individual researchers in mind. Researchers will be able to document, preserve, and publicly share their own data with minimal support required from repository staff, as well as be able to find, retrieve, and reuse data made available by others.

Three Phases of Work

  1. Requirements gathering: Before the design process begins, we will build requirements for researchers via interviews and surveys
  2. Design work: Based on surveys and interviews with researchers (Phase 1), we will develop requirements for a researcher-focused user interface that is visually appealing and easy to use.
  3. Technical work: Dash will be an added-value data sharing platform that integrates with any repository that supports community protocols (e.g., SWORD (Simple Web-service Offering Repository Deposit).

The dash is a critical component of any good ascii art. By reddit user Haleljacob

Tagged , , , , ,

New Project: Citing Physical Spaces

A few months ago, the UC3 group was contacted by some individuals interested in solving a problem: how should we reference field stations? Rob Plowes from University of Texas/Brackenridge Field Lab emailed us:

I am on a [National Academy of Sciences] panel reviewing aspects of field stations, and we have been discussing a need for data archiving. One idea proposed is for each field station to generate a simple document with a DOI reference to enable use in publications that make reference to the field station. Having this DOI document would enable a standardized citation that could be tracked by an online data aggregator.

We thought this was a great idea and started having a few conversations with other groups (LTER, NEON, etc.) about its feasibility. Fast forward to two weeks ago, when Plowes and Becca Fenwick of UC Merced presented our more fleshed out idea to the OBFS/NAML Joint Meeting in Woods Hole, MA. (OBFS: Organization of Biological Field Stations, and NAML: National Association of Marine Laboratories). The response was overwhelmingly positive, so we are proceeding with the idea in earnest here at the CDL.

The intent of this blog post is to gather feedback from the broader community about our idea, including our proposed metadata fields, our plans for implementation, and whether there are existing initiatives or groups that we should be aware of and/or partner with moving forward.

In a Nutshell

Problem: Tracking publications associated with a field station or site is difficult. There is no clear or standard way to cite field station descriptions.

Proposal: Create individual, citable “publications” with associated persistent identifiers for each field station (more generically called a “site”). Collect these Site Descriptors in the general use DataONE repository, ONEShare. The user interface will be a new instance of the existing UC3 Dash service (under development) with some modifications for Site Descriptors.

What we need from you: 

Moving forward: We plan on gathering community feedback for the next few months, with an eye towards completing a pilot version of the interface by February 2015. We will be ramping up Dash development over the next 12 months thanks to recent funding from the Alfred P. Sloan Foundation, and this development work will include creating a more robust version of the Site Descriptors database.

Project Partners:

  • Rob Plowes, UT Austin/Brackenridge Field Lab
  • Mark Stromberg, UC Berkeley/UC Natural Reserve System
  • Kevin Browne, UC Natural Reserve System Information Manager
  • Becca Fenwick, UC Merced
  • UC3 group
  • DataONE organization

Lovers Point Laboratory (1930), which was later renamed Hopkins Marine Laboratory. From Calisphere, contributed by Monterey County Free Libraries.

Tagged , , ,

The 10 Things Every New Grad Student Should Do

It’s now mid-October, and I’m guessing that first year graduate students are knee-deep in courses, barely considering their potential thesis project. But for those that can multi-task, I have compiled this list of 10 things that you should undertake in your first year as a grad student. These aren’t just any 10 things… they are 10 steps you can take to make sure you contribute to a culture shift towards open science. Some a big steps, and others are small, but they will all get you (and the rest of your field) one step closer to reproducible, transparent research.

1. Learn to code in some language. Any language.

Here’s the deal: it’s easier to use black-box applications to run your analyses than to create scripts. Everyone knows this. You put in some numbers and out pop your results; you’re ready to write up your paper and get that H-index headed upwards. But this approach will not cut the mustard for much longer in the research world. Researchers need to know about how to code. Growing amounts and diversity of data, more interdisciplinary collaborators, and increasing complexity of analyses mean that no longer can black-box models, software, and applications be used in research. The truth is, if you want your research to be reproducible and transparent, you must code. In a 2013 article “The Big Data Brain Drain: Why Science is in Trouble“, Jake Vanderplas argues that

In short, the new breed of scientist must be a broadly-trained expert in statistics, in computing, in algorithm-building, in software design, and (perhaps as an afterthought) in domain knowledge as well.

I learned MATLAB in graduate school, and experimented with R during a postdoc. I wish I’d delved into this world earlier, and had more skills and knowledge about best practices for scientific software. Basically, I wish I had attended a Software Carpentry bootcamp.

The growing number of Software Carpentry (SWC) bootcamps are more evidence that researchers are increasingly aware of the importance of coding and reproducibility. These bootcamps teach researchers the basics of coding, version control, and similar topics, with the potential for customizing the course’s content to the primary discipline of the audience. I’m a big fan of SWC – read more in my blog post on the organization. Check out SWC founder Greg Wilson’s article on some insights from his years in teaching bootcamps: Software Carpentry: Lessons Learned.

2. Stop using Excel. Or at least stop ONLY using Excel.

Most seasoned researchers know that Microsoft Excel can be potentially problematic for data management: there are loads of ways to manipulate, edit, reorder, and change your data without really knowing exactly what you did. In nerd terms, the trail of dataset changes is known as provenance; generally Excel is terrible at documenting provenance. I wrote about this a few years ago on the blog, and we mentioned a few of the more egregious ways people abuse Excel in our F1000Research publication on the DataUp tool. More recently guest blogger Kara Woo wrote a great post about struggles with dates in Excel.

Of course, everyone uses Excel. In our surveys for the DataUp project, about 88% of the researchers we interviewed used Excel at some point in their research. And we can’t expect folks to stop using it: it’s a great tool! It should, however, be used carefully. For instance, don’t manipulate the sole copy of your raw data in Excel; keep your raw data raw. Use Excel to explore your data, but use other tools to clean and analyze it, such as R, Python, or MATLAB (see #1 above on learning to code). For more help with spreadsheets, see our list of resources and tools: UC3 Spreadsheet Help.

3. Learn about how to properly care for your data.

You might know more about your data than anyone else, but you aren’t so smart when it comes stewardship your data. There are some great guidelines for how best to document, manage, and generally care for your data; I’ve collected some of my favorites here on CiteULike with the tag best_practices. Pick one (or all of them) to read and make sure your data don’t get short shrift.

4. Write a data management plan.

I know, it sounds like the ultimate boring activity for a Friday night. But these three words (data management plan) can make a HUGE difference in the time and energy spent dealing with data during your thesis. Basically, if you spend some time thinking about file organization, sample naming schemes, backup plans, and quality control measures, you can save many hours of heartache later. Creating a data management plan also forces you to better understand best practices related to data (#3 above). Don’t know how to start? Head over to the DMPTool to write a data management plan. It’s free to use, and you can get an idea for the types of things you should consider when embarking on a new project. Most funders require data management plans alongside proposal submissions, so you might as well get the experience now.

5. Read Reinventing Discovery by Michael Nielsen.

 Reinventing Discovery: The New Era of Networked Science by Michael Nielsen was published in 2013, and I’ve since heard it referred to as the Bible for Open Science, and the must-read book for anyone interested in engaging in the new era of 4th paradigm research. I’ve only just recently read the book, and wow. I was fist-bumping quite a bit while reading it, which must have made fellow airline passengers wonder what the fuss was about. If they had asked, I would have told them about Nielsen’s stellar explanation of the necessity for and value of openness and transparency in research, the problems with current incentive structures in science, and the steps we should all take towards shifting the culture of research to enable more connectivity and faster progress. Just writing this blog post makes me want to re-read the book.

6. Learn version control.

My blog post, Git/GitHub: a Primer for Researchers covers much of the importance of version control. Here’s an excerpt:

From git-scm.com, “Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.”  We all deal with version control issues. I would guess that anyone reading this has at least one file on their computer with “v2” in the title. Collaborating on a manuscript is a special kind of version control hell, especially if those writing are in disagreement about systems to use (e.g., LaTeX versus Microsoft Word). And figuring out the differences between two versions of an Excel spreadsheet? Good luck to you. TheWikipedia entry on version control makes a statement that brings versioning into focus:

The need for a logical way to organize and control revisions has existed for almost as long as writing has existed, but revision control became much more important, and complicated, when the era of computing began.

Ah, yes. The era of collaborative research, using scripting languages, and big data does make this issue a bit more important and complicated. Version control systems can make this much easier, but they are not necessarily intuitive for the fledgling coder. It might take a little time (plus attending a Software Carpentry Bootcamp) to understand version control, but it will be well worth your time. As an added bonus, your work can be more reproducible and transparent by using version control. Read Karthik Ram’s great article, Git can facilitate greater reproducibility and increased transparency in science.

7. Pick a way to communicate your science to the public. Then do it.

You don’t have to have a black belt in Twitter or run a weekly stellar blog to communicate your work. But you should communicate somehow. I have plenty of researcher friends who feel exasperated by the idea that they need to talk to the public about their work. But the truth is, in the US this communication is critical to our research future. My local NPR station recently ran a great piece called Why Scientists are seen as untrustworthy and why it matters. It points out that many (most?) scientists aren’t keen to spend a lot of time engaging with the broader public about their work. However:

…This head-in-the-sand approach would be a big mistake for lots of reasons. One is that public mistrust may eventually translate into less funding and so less science. But the biggest reason is that a mistrust of scientists and science will have profound effects on our future.

Basically, we are avoiding the public at our own peril. Science funding is on the decline, we are facing increasing scrutiny, and it wouldn’t be hyperbole to say that we are at war without even knowing it. Don’t believe me? Read this recent piece in Science (paywall warning): Battle between NSF and House science committee escalates: How did it get this bad?

So start talking. Participate in public lecture series, write a guest blog post, talk about your research to a crotchety relative at Thanksgiving, or write your congressman about the governmental attack on science.

8. Let everyone watch.

Consider going open. That is, do all of your science out in the public eye, so that others can see what you’re up to. One way to do this is by keeping an open notebook. This concept throws out the idea that you should be a hoarder, not telling others of your results until the Big Reveal in the form of a publication. Instead, you keep your lab notebook (you do have one, right?) out in a public place, for anyone to peruse. Most often an open notebook takes the form of a blog or a wiki, and the researcher updates their notebook daily, weekly, or whatever is most appropriate. There are links to data, code, relevant publications, or other content that helps readers, and the researcher themselves, understand the research workflow. Read more in these two blog posts: Open Up  and Open Science: What the Fuss is About.

9. Get your ORCID.

ORCID stands for “Open Researcher & Contributor ID”. The ORCID Organization is an open, non-profit group working to provide a registry of unique researcher identifiers and a transparent method of linking research activities and outputs to these identifiers. The endgame is to support the creation of a permanent, clear and unambiguous record of scholarly communication by enabling reliable attribution of authors and contributors. Basically, researcher identifiers are like social security numbers for scientists. They unambiguously identify you throughout your research life.

Lots of funders, tools, publishers, and universities are buying into the ORCID system. It’s going to make identifying researchers and their outputs much easier. If you have a generic, complicated, compound, or foreign name, you will especially benefit from claiming your ORCID and “stamping” your work with it. It allows you to claim what you’ve done and keep you from getting mixed up with that weird biochemist who does studies on the effects of bubble gum on pet hamsters. Still not convinced? I wrote a blog post a while back that might help.

10. Publish in OA journals, or make your work OA afterward.

A wonderful post by Michael White, Why I don’t care about open access to research: and why you should, captures this issue well:

It’s hard for me to see why I should care about open access…. My university library can pay for access to all of the scientific journals I could wish for, but that’s not true of many corporate R&D departments, municipal governments, and colleges and schools that are less well-endowed than mine. Scientific knowledge is not just for academic scientists at big research universities.

It’s easy to forget that you are (likely) among the privileged academics. Not all researchers have access to publications, and this is even more true for the general public. Why are we locking our work in the Ivory Tower, allowing for-profit publishers to determine who gets to read our hard-won findings? The Open Access movement is going full throttle these days, as evidenced by increasing media coverage (see “Steal this research paper: you already paid for it” from MotherJones, or The Guardian’s blog post “University research: if you believe in openness, stand up for it“). So what can you do?

Consider publishing only in open access journals (see the Directory of Open Access Journals). Does this scare you? Are you tied to a disciplinary favorite journal with a high impact factor? Then make your work open access after publishing in a standard journal. Follow my instructions here: Researchers! Make Your Previous Work #OA.

Openness is one of the pillars of a stellar academic career. From Flickr by David Pilbrow.

Openness is the pillar of a good academic career. From Flickr by David Pilbrow.

Tagged , , , , , ,