Neil Thawani - Blog - Coursera - R Programming - Week 1

Reading Tabular Data

read.table, read.cv - used for reading tabular data

readLines - used for reading lines of a text file

source - used for reading in R code files (inverse of dump)

dget - used for reading in R code files (inverse of dput)

load - used for reading in saved workspaces

unserialize - used for reading single R objects in binary form

Writing Data

write.table
writeLines
dump
dput
save
serialize

Reading Data Files with read.table

Arguments:

file - the name of a file, or a connection

header - the logical indicating if the file has a header line

sep - a string indicating how the columns are separated

colClasses - a character vector indicating the class of each column in the dataset

nrows - the number of rows in the dataset

comment.char - a character string indicating the comment character

skip - the number of lines to skip from the beginning

stringsAsFactors - should character variables be coded as factors?

For small to moderately sized datasets, you can call read.table without any other arguments.

    > data <- read.table("foo.txt")

R will automatically

skip lines that begin with #
figure out how many rows there are and how much memory needs to be allocated
figure out what type of variable is in each column - explicitly stating this makes R run faster

read.csv is identical to read.table, except the default separator is a comma

Reading Large Tables

The help page for read.table contains many hints. Memorizing it is advised.

Make a rough calculation of the memory required to store your dataset. If N > the amount of RAM on your computer, it won't be possible.

Set comment.char = " " if there are no commented lines in your file.

Reading in Larger Datasets with read.table

Specifying the colClasses argument can make read.table run significantly faster. To figure out the classes of each column:

    > initial <- read.table("datatable.txt", nrows = 100) # or 1000
    > classes <- sapply(initial, class)
    > tabAll <- read.table("datatable.txt", colClasses = classes)

Set nrows. Use the Unix tool wc to calculate the number of lines in a file.

When using R with larger data sets, it helps to know:

how much memory is available?
what other applications are in use?
are there other users logged on to the same system?
what OS?
is the OS 32- or 64-bit?

e.g., Calculating Memory Requirements
A data frame has 1,500,000 rows and 120 columns, all of which are numeric data. Roughly how much memory is required to store this data?

1,500,000 rows * 120 columns * 8 bytes/numeric = 1,440,000,000 bytes / 1,024 bytes/kb / 1,024 kb/mb / 1,024 mb/gb = 1.34 gigabytes = N

Note: There are 2^20 bytes/mb, since 2^10 = 1,024.

A rule of thumb: You'll need 2N RAM to read in a dataset that requires N memory.

Textual Data Formats: dput() and dump()

dumping and dputing are useful because the resulting textual format is editable (and recoverable in case of corruption)

dump and dput preserve the metadata (unlike write.table or writeLines) so that the user doesn't have to specify it again

textual data formats can work better with version control programs (like subversion or git), which can only track meaningful changes in text files

dput takes an arbitrary R object and will create some R code that will reconstruct the object in R.

y has two columns: a and b. dput constructs some R code with a list with two elements, the row names and the class. This metadata (row names, class) is not particularly useful, but can be output to a file to reconstruct it later.

    > y <- data.frame(a = 1, b = "a")
    > dput(y)
    structure(list(a = 1,
    \t\tb = structure(1L, .Label = "a",
    \t\t\t\tclass = "factor")),
    \t.Names = c("a", "b"), row.names = c(NA, -1L),
    \tclass = "data.frame")
    > dput(y, file = "y.R")
    > new.y <- dget("y.R")
    > new.y
    \ta\tb
    1\t1\ta

dget can only be used on a single R object, whereas dump can be used on multiple R objects.

    > x <- "foo"
    > y <- data.frame(a = 1, b = "a")
    > dump(c("x", "y"), file = "data.R")
    > rm(x, y)
    > source("data.R")
    > y
    \ta\tb
    1\t1\ta
    > a
    [1] "foo"

R Connections - Interfaces to the Outside World

Data are read in using connection interfaces. Connections can be made to files or to other, more exotic things.

file - opens a connection to a file

gzfile - opens a connection to a file compressed with gzip

bzfile - opens a connection to a file compressed with bzip2

url - opens a connection to a webpage

File connections

    > str(file)
    function (description = "", open = "", blocking = TRUE, encoding = getOption("encoding"))

description is the name of the file

open is a code indicating:

r - read only
w - writing (and initializing a new file)
a - appending
rb, wb, ab - reading, writing or appending in binary mode (Windows)

In general, we often don't need to deal with the connection interface directly.

    > con <- file("foo.txt", "r")
    > data <- read.csv(con)
    > close(con)

is the same as

> data <- read.csv("foo.txt")

Reading Lines of a Text File

    > con <- gzfile("words.gz")
    > x <- readLines(con, 10)
    > x
    [1]\t"a"\t"d"\t"g"
    [5]\t"b"\t"e"\t"h"
    [9]\t"c"\t"f"

writeLines takes a character vector and writes each element one line at a time to a text file.

readLines can be useful for reading in lines of webpages.

    > con <- url("http://www.jhsph.edu", "r")
    > x <- readLines(con)
    > head(x)
    # prints out HTML of the webpage, line by line

Coursera - R Programming - Week 1 - Reading and Writing Data