Coursera - Getting and Cleaning Data - Week 1 - Components of Tidy Data

Four things you should have after going from a raw data set to a tidy data set:

  1. the raw data
  2. a tidy data set
  3. a code book describing each variable and its values in the tidy data set
  4. an explicit and exact recipe you used to go from steps 1 to steps 2 and 3

Raw Data is in the right format if you did not:

The following standards are available in the guide How to share data with a statistician.

Tidy data has the following properties:

A common format for this document is a Word/text file (or Markdown). There should be a section called "Study Design" that has a thorough description of how you collected the data. There must be a section called "Code Book" that describes each variable and its units.

The Code Book should contain information about:

The Instruction List

In some cases, it will not be possible to script every step. In that case, you should provide instructions like steps:

  1. Take the raw file, run version 3.1.2 of the summarize software with parameters a=1, b=2, c=3
  2. Run the software separately for each sample.
  3. Take column three of outputfile.txt for each sample and that is the corresponding row in the output data set.

Why is the instruction list important?

Published January 13, 2015