Steps in a data analysis

- Define the question
- Define the ideal data set
- Determine what data you can access
- Obtain the data
- Clean the data
- Exploratory data analysis
- Statistical prediction/modeling
- Interpret results
- Challenge results
- Synthesize/write up results
- Create reproducible code

You will have either a surplus or insufficient information in order to solve your problems. Defining a question as narrowly as possible will help to reduce the noise in solving your problem.

e.g.,

Start with a general question: Can I automatically detect e-mails that are spam or not?
Make it concrete: Can I use quantitative characteristics of the emails to classify them as spam?

Defining the ideal data set

- descriptive - a whole population
- exploratory - a random sample with many variables measured
- inferential - the right population, randomly sampled
- predictive - a training and test data set from the same population
- causal - data from a randomized study
- mechanistic - data about all components of the system

Determine what data you can access

- Sometimes you can find free data on the web
- Other times you may need to buy the data
- Be sure to respect Terms of Use
- If the data doesn't exist, you may need to generate it yourself

Obtain the data

- Try to obtain the raw data
- Be sure to reference the source
- Polite emails go a long way
- If you load the data from an internet source, record the URL and time accessed

Clean the data

- Raw data often needs to be processed
- If it is pre-processed, make sure you understand how
- Understand the source of the data (census sample, convenience sample, etc.)
- May need reformatting, subsampling - record those steps
**Determine if the data are good enough**- if not, quit or change data

Subsampling our data set

- We need to generate a test set and a training set (prediction)

Exploratory data analysis

- Look at summaries of the data
- Check for missing data
- Create exploratory plots
- Perform exploratory analyses (e.g., clustering)

Statistical prediction/modeling

- should be informed by the results of your exploratory analysis
- exact methods depend on the question of interest
- transformations/processing should be accounted for when necessary
- measures of uncertainty should be reported

Interpret results

- Use the appropriate language (describes, correlates with/associated with, leads to/causes, predicts)
- Give an explanation
- Interpret coefficients
- Interpret measures of uncertainty

Challenge results

- Challenge all steps: question, data source, processing, analysis, conclusions
- Challenge measures of uncertainty
- Challenge choices of terms to include in models
- Think of potential alternative analyses

Synthesize/write-up results

- Lead with the question
- Summarize the analyses into the story
- Don't include every analysis; include it if it is needed for the story or to address a challenge
- Order analyses according to the story, rather than chronologically
- Include "pretty" figures that contribute to the story

Lastly, create reproducible code using Markdown, knitr, Rstudio. It will make the evidence for your conclusions much more powerful.