^

Lion in a what?

Digital Media, Organizational, and Personal Development

twitter linkedin github gtalk

Coursera - R Programming - Week 1 - Reading and Writing Data

Reading Tabular Data

read.table, read.cv - used for reading tabular data

readLines - used for reading lines of a text file

source - used for reading in R code files (inverse of dump)

dget - used for reading in R code files (inverse of dput)

load - used for reading in saved workspaces

unserialize - used for reading single R objects in binary form

Writing Data

  • write.table
  • writeLines
  • dump
  • dput
  • save
  • serialize

Reading Data Files with read.table

Arguments:

  • file - the name of a file, or a connection
  • header - the logical indicating if the file has a header line
  • sep - a string indicating how the columns are separated
  • colClasses - a character vector indicating the class of each column in the dataset
  • nrows - the number of rows in the dataset
  • comment.char - a character string indicating the comment character
  • skip - the number of lines to skip from the beginning
  • stringsAsFactors - should character variables be coded as factors?
  • For small to moderately sized datasets, you can call read.table without any other arguments.

    > data <- read.table("foo.txt")
    

    R will automatically

    • skip lines that begin with #
    • figure out how many rows there are and how much memory needs to be allocated
    • figure out what type of variable is in each column - explicitly stating this makes R run faster

    read.csv is identical to read.table, except the default separator is a comma

    Reading Large Tables

    The help page for read.table contains many hints. Memorizing it is advised.

    Make a rough calculation of the memory required to store your dataset. If N > the amount of RAM on your computer, it won't be possible.

    Set comment.char = " " if there are no commented lines in your file.

    Reading in Larger Datasets with read.table

    Specifying the colClasses argument can make read.table run significantly faster. To figure out the classes of each column:

    > initial <- read.table("datatable.txt", nrows = 100) # or 1000
    > classes <- sapply(initial, class)
    > tabAll <- read.table("datatable.txt", colClasses = classes)
    

    Set nrows. Use the Unix tool wc to calculate the number of lines in a file.

    When using R with larger data sets, it helps to know:

    • how much memory is available?
    • what other applications are in use?
    • are there other users logged on to the same system?
    • what OS?
    • is the OS 32- or 64-bit?

    e.g., Calculating Memory Requirements
    A data frame has 1,500,000 rows and 120 columns, all of which are numeric data. Roughly how much memory is required to store this data?

    1,500,000 rows * 120 columns * 8 bytes/numeric = 1,440,000,000 bytes / 1,024 bytes/kb / 1,024 kb/mb / 1,024 mb/gb = 1.34 gigabytes = N

    Note: There are 2^20 bytes/mb, since 2^10 = 1,024.

    A rule of thumb: You'll need 2N RAM to read in a dataset that requires N memory.

    Textual Data Formats: dput() and dump()

    dumping and dputing are useful because the resulting textual format is editable (and recoverable in case of corruption)

    dump and dput preserve the metadata (unlike write.table or writeLines) so that the user doesn't have to specify it again

    textual data formats can work better with version control programs (like subversion or git), which can only track meaningful changes in text files

    dput takes an arbitrary R object and will create some R code that will reconstruct the object in R.

    y has two columns: a and b. dput constructs some R code with a list with two elements, the row names and the class. This metadata (row names, class) is not particularly useful, but can be output to a file to reconstruct it later.

    > y <- data.frame(a = 1, b = "a")
    > dput(y)
    structure(list(a = 1,
            b = structure(1L, .Label = "a",
                    class = "factor")),
        .Names = c("a", "b"), row.names = c(NA, -1L),
        class = "data.frame")
    > dput(y, file = "y.R")
    > new.y <- dget("y.R")
    > new.y
        a   b
    1   1   a
    

    dget can only be used on a single R object, whereas dump can be used on multiple R objects.

    > x <- "foo"
    > y <- data.frame(a = 1, b = "a")
    > dump(c("x", "y"), file = "data.R")
    > rm(x, y)
    > source("data.R")
    > y
        a   b
    1   1   a
    > a
    [1] "foo"
    

    R Connections - Interfaces to the Outside World

    Data are read in using connection interfaces. Connections can be made to files or to other, more exotic things.

    file - opens a connection to a file

    gzfile - opens a connection to a file compressed with gzip

    bzfile - opens a connection to a file compressed with bzip2

    url - opens a connection to a webpage

    File connections

    > str(file)
    function (description = "", open = "", blocking = TRUE, encoding = getOption("encoding"))
    

    description is the name of the file

    open is a code indicating:

    • r - read only
    • w - writing (and initializing a new file)
    • a - appending
    • rb, wb, ab - reading, writing or appending in binary mode (Windows)

    In general, we often don't need to deal with the connection interface directly.

    > con <- file("foo.txt", "r")
    > data <- read.csv(con)
    > close(con)
    

    is the same as

    > data <- read.csv("foo.txt")

    Reading Lines of a Text File

    > con <- gzfile("words.gz")
    > x <- readLines(con, 10)
    > x
    [1] "a" "d" "g"
    [5] "b" "e" "h"
    [9] "c" "f"
    

    writeLines takes a character vector and writes each element one line at a time to a text file.

    readLines can be useful for reading in lines of webpages.

    > con <- url("http://www.jhsph.edu", "r")
    > x <- readLines(con)
    > head(x)
    # prints out HTML of the webpage, line by line