Lion in a what?

Digital Media, Organizational, and Personal Development

twitter linkedin github gtalk

Coursera - Reproducible Research - Week 1 - Structure of a Data Analysis

Steps in a data analysis

  1. Define the question
  2. Define the ideal data set
  3. Determine what data you can access
  4. Obtain the data
  5. Clean the data
  6. Exploratory data analysis
  7. Statistical prediction/modeling
  8. Interpret results
  9. Challenge results
  10. Synthesize/write up results
  11. Create reproducible code

You will have either a surplus or insufficient information in order to solve your problems. Defining a question as narrowly as possible will help to reduce the noise in solving your problem.

Start with a general question: Can I automatically detect e-mails that are spam or not? Make it concrete: Can I use quantitative characteristics of the emails to classify them as spam?

Defining the ideal data set

  • descriptive - a whole population
  • exploratory - a random sample with many variables measured
  • inferential - the right population, randomly sampled
  • predictive - a training and test data set from the same population
  • causal - data from a randomized study
  • mechanistic - data about all components of the system

Determine what data you can access

  • Sometimes you can find free data on the web
  • Other times you may need to buy the data
  • Be sure to respect Terms of Use
  • If the data doesn't exist, you may need to generate it yourself

Obtain the data

  • Try to obtain the raw data
  • Be sure to reference the source
  • Polite emails go a long way
  • If you load the data from an internet source, record the URL and time accessed

Clean the data

  • Raw data often needs to be processed
  • If it is pre-processed, make sure you understand how
  • Understand the source of the data (census sample, convenience sample, etc.)
  • May need reformatting, subsampling - record those steps
  • Determine if the data are good enough - if not, quit or change data

Subsampling our data set

  • We need to generate a test set and a training set (prediction)

Exploratory data analysis

  • Look at summaries of the data
  • Check for missing data
  • Create exploratory plots
  • Perform exploratory analyses (e.g., clustering)

Statistical prediction/modeling

  • should be informed by the results of your exploratory analysis
  • exact methods depend on the question of interest
  • transformations/processing should be accounted for when necessary
  • measures of uncertainty should be reported

Interpret results

  • Use the appropriate language (describes, correlates with/associated with, leads to/causes, predicts)
  • Give an explanation
  • Interpret coefficients
  • Interpret measures of uncertainty

Challenge results

  • Challenge all steps: question, data source, processing, analysis, conclusions
  • Challenge measures of uncertainty
  • Challenge choices of terms to include in models
  • Think of potential alternative analyses

Synthesize/write-up results

  • Lead with the question
  • Summarize the analyses into the story
  • Don't include every analysis; include it if it is needed for the story or to address a challenge
  • Order analyses according to the story, rather than chronologically
  • Include "pretty" figures that contribute to the story

Lastly, create reproducible code using Markdown, knitr, Rstudio. It will make the evidence for your conclusions much more powerful.