Versioning Data with Git and Noms

It has become common for researchers to use Git to track their datasets by exporting the data to CSV files and adding those CSV files to Git. They also often publish those datasets to github. This runs into some issues because Git was not designed to track datasets. By contrast, Noms was designed specifically to track datasets. This workshop allows you to experience what it’s like to use each of these two tools to track a dataset so that you can compare and contrast the two tools.

Prerequisites

Before proceeding with the lessons in this workshop, you should have both git and noms installed. You should also be familiar with the material covered in these workshops:

Learning Objectives

After this workshop you will know how to

Lessons

  1. Find an Open Source Dataset on github and fork it
  2. Import the dataset into Noms and publish the dataset
  3. Make some changes to the dataset and import them
  4. Make a lot of changes to the dataset and import them
  5. Compare Git and Noms: Compare the experience of using these two tools to track versions of datasets
  6. Compare CSV and Code: Discuss how (tabular) data like CSV is similar to other text files and how it’s different