Reading in data

As empirical biologists, you’ll generally have data to read in. You probably have your data in an Excel spreadsheet. The simplest way to load these into R is to save a copy of the data as a comma separated values file (csv) and work with that.

It is actually possible to read directly from Excel (See the gdata package that has a read.xls function, and see this page for other alternatives). This is usually more hassle than it’s worth, and going through a comma separated file is easy enough.

To load the data into R:

data <- read.csv("data/seed_root_herbivores.csv")

(this doesn’t usually produce any output – the data is “just there” now).

Clicking the little table icon next to the data in the Workspace browser will view the data. Running View(data) will do the same thing.

The data variable contains data.frame object. It is a number of columns of the same length, arranged like a matrix. That sentence is tricky, for reasons that will become apparent.

Often, looking at the first few rows is all you need to remind yourself about what is in a data set.

head(data)

Plot Seed.herbivore Root.herbivore No.stems Height Weight Seed.heads
plot-2           TRUE           TRUE        1     31   4.16         83
plot-2           TRUE           TRUE        3     41   5.82        175
plot-2           TRUE           TRUE        1     42   3.51         72
plot-2           TRUE          FALSE        1     64   7.16        125
plot-2           TRUE          FALSE        1     47   6.17        212
plot-2           TRUE          FALSE        1     52   5.32        114
  Seeds.in.25.heads
               7
               0
              32
              22
               3
              19

You can get a vector of names of columns

names(data)

[1] "Plot"              "Seed.herbivore"    "Root.herbivore"   
[4] "No.stems"          "Height"            "Weight"           
[7] "Seed.heads"        "Seeds.in.25.heads"

You can get the number of rows:

nrow(data)

[1] 169

and the number of columns

ncol(data)

[1] 8

Aside from issues around factors and character vectors (that we’ll cover shortly) this is most of what you need to know about loading data.

However, it’s useful to know things about saving it.

column names should be consistent and contain no whitespaces or special characters.
for missing values, either leave them blank or use NA. But be consistent and don’t use -999 or ? or your cat’s name.
Be careful with whitespace “x” will be treated differently to “x “, and Excel makes it easy to accidently do the latter. Consider the strip.white=TRUE argument to read.csv.
Think about the type of the data. We’ll cover this more, but are you dealing with a TRUE/FALSE or a category or a count or a measurements.
Dates and times will cause you nothing but pain. Excel and R both have issues with dates and times, and exporting through CSV can make them worse. I had a case with two different year-zero offsets being used in one exported file. I recommend Year-Month-Day (ISO 8601 format, or different colummns for different entries and combine later.
Watch out for dashes between numbers. Excel will convert these into dates. So if you have “Site-Plant” style numbers 5-20 will get converted into the 20th of May 1904 or something equally useless. Similar problems happen to gene names in bioinformatics!
Merged rows and columns will not work (or at least not in an easily predictible way.
Spare rows at the top, or double header rows will not work without jumping through hoops.
Equations will (should) convert to the value displayed in Excel on export.

Looking at your data

Next we want to have a look at our data. The summary function works with most types, and gives a by-column summary of the data set

summary(data)

  Plot     Seed.herbivore  Root.herbivore     No.stems    
 plot-40:  9   Mode :logical   Mode :logical   Min.   : 1.00  
 plot-20:  8   FALSE:90        FALSE:58        1st Qu.: 1.00  
 plot-50:  8   TRUE :79        TRUE :111       Median : 1.00  
 plot-60:  8   NA's :0         NA's :0         Mean   : 1.98  
 plot-16:  7                                   3rd Qu.: 3.00  
 plot-30:  7                                   Max.   :10.00  
 (Other):122                                                  
 Height         Weight        Seed.heads   Seeds.in.25.heads
 Min.   :16.0   Min.   : 0.26   Min.   :   3   Min.   :  0.0    
 1st Qu.:44.0   1st Qu.: 4.08   1st Qu.:  93   1st Qu.: 10.0    
 Median :54.0   Median : 8.05   Median : 175   Median : 19.0    
 Mean   :55.5   Mean   :11.20   Mean   : 226   Mean   : 22.1    
 3rd Qu.:67.0   3rd Qu.:14.77   3rd Qu.: 303   3rd Qu.: 32.0    
 Max.   :97.0   Max.   :55.51   Max.   :1003   Max.   :100.0    
                                                            

Subsetting

R has many powerful subset operators and mastering them will allow you to easily perform complex operation on any kind of dataset. Allows you to manipulate data very succinctly.

There a bunch of different ways of extracting bits of your data.

Columns of `data.frames`

Get the column Plot

data$Plot

  [1] plot-2  plot-2  plot-2  plot-2  plot-2  plot-2  plot-4  plot-6 
  [9] plot-6  plot-6  plot-8  plot-8  plot-8  plot-8  plot-8  plot-10
 [17] plot-10 plot-10 plot-10 plot-12 plot-12 plot-12 plot-12 plot-12
 [25] plot-14 plot-14 plot-14 plot-14 plot-14 plot-16 plot-16 plot-16
 [33] plot-16 plot-16 plot-16 plot-16 plot-18 plot-18 plot-18 plot-18
 [41] plot-18 plot-18 plot-20 plot-20 plot-20 plot-20 plot-20 plot-20
 [49] plot-20 plot-20 plot-22 plot-22 plot-24 plot-24 plot-24 plot-24
 [57] plot-24 plot-24 plot-26 plot-26 plot-26 plot-26 plot-26 plot-26
 [65] plot-28 plot-28 plot-28 plot-28 plot-28 plot-28 plot-30 plot-30
 [73] plot-30 plot-30 plot-30 plot-30 plot-30 plot-32 plot-32 plot-32
 [81] plot-34 plot-34 plot-34 plot-34 plot-36 plot-36 plot-36 plot-36
 [89] plot-36 plot-36 plot-36 plot-38 plot-38 plot-38 plot-38 plot-38
 [97] plot-40 plot-40 plot-40 plot-40 plot-40 plot-40 plot-40 plot-40
[105] plot-40 plot-42 plot-42 plot-44 plot-44 plot-44 plot-44 plot-44
[113] plot-44 plot-46 plot-46 plot-46 plot-46 plot-46 plot-46 plot-46
[121] plot-48 plot-48 plot-48 plot-48 plot-48 plot-48 plot-50 plot-50
[129] plot-50 plot-50 plot-50 plot-50 plot-50 plot-50 plot-52 plot-52
[137] plot-52 plot-52 plot-52 plot-52 plot-52 plot-54 plot-54 plot-54
[145] plot-54 plot-54 plot-54 plot-54 plot-56 plot-56 plot-56 plot-56
[153] plot-56 plot-56 plot-58 plot-58 plot-58 plot-58 plot-58 plot-58
[161] plot-58 plot-60 plot-60 plot-60 plot-60 plot-60 plot-60 plot-60
[169] plot-60
30 Levels: plot-10 plot-12 plot-14 plot-16 plot-18 plot-2 ... plot-8

Looking at your data (cont.)

Plotting is covered in the next R module, but it’s one of the best things about R so I can’t resist showing how to do it:

Here is a histogram of the height variable:

hist(data$Height)

plot of chunk unnamed-chunk-20

(it will appear in the bottom right of your screen)

Here is a scatter plot of Height vs weight:

plot(data$Weight, data$Height)

plot of chunk unnamed-chunk-21

The order of arguments is x-variable, y-variable.

There is an alternative interface using R’s “formulae” – you’ll see this a lot in statistical models. Read this as “Height is a function of Weight”. It makes nicer axis labels, too.

plot(Height ~ Weight, data)

plot of chunk unnamed-chunk-22

Here is a series of bivariate plots for height, weight and the number of seed heads:

pairs(data[c("Height", "Weight", "Seed.heads")])

plot of chunk unnamed-chunk-23

The take-home being that R makes it very easy to create graphs, and most people who use it casually just make plots of whatever they’re looking at. The plots can vary from quick and dirty like this to really beautiful pieces of art.

Rows of `data.frames`

Extracting a row always returns a new data.frame

data[10,]

 Plot Seed.herbivore Root.herbivore No.stems Height Weight Seed.heads
10 plot-6           TRUE           TRUE        1     33   2.58         43
   Seeds.in.25.heads
10                 8

data[10:20,]

  Plot Seed.herbivore Root.herbivore No.stems Height Weight Seed.heads
plot-6           TRUE           TRUE        1     33   2.58         43
plot-8          FALSE          FALSE        1     51   5.98        114
plot-8          FALSE           TRUE        1     41   2.75         60
plot-8          FALSE           TRUE        1     38   2.15         68
plot-8          FALSE          FALSE        4     61  21.59        330
plot-8          FALSE          FALSE        1     46   6.86        134
plot-10           TRUE           TRUE        1     34   3.36         97
plot-10           TRUE           TRUE        1     50  15.76        395
plot-10           TRUE           TRUE        1     51   8.73        280
plot-10           TRUE           TRUE        3     33   3.24         68
plot-12          FALSE           TRUE        1     41   2.49         58
   Seeds.in.25.heads
               8
               5
              50
              52
              19
              40
               9
              24
              11
              21
              11

data[c(1, 5, 10),]

 Plot Seed.herbivore Root.herbivore No.stems Height Weight Seed.heads
plot-2           TRUE           TRUE        1     31   4.16         83
plot-2           TRUE          FALSE        1     47   6.17        212
plot-6           TRUE           TRUE        1     33   2.58         43
   Seeds.in.25.heads
                7
                3
               8

Be careful with indexing by location

The above all index by name or by location (index). However, you generally want to avoid referencing by number in your saved code, e.g.:

data.height <- data[[5]]

This is because if you change the order of your spreadsheet (add or delete a column), everything that depends on data.height may change. You may also see people do this in their code.

data.height <- data[,5]

This should really be avoided. By name is much more robust and easy to read later on, even if it is more typing at first.

data.height <- data$Height
data.height <- data[["Height"]]

When should you index by location?

When you are computing the indices. As an example: suppose that you wanted every other row (perhaps you’re trying to generate a nonrandom some sample of data?) Remember seq from above? We can generate a sequnce of integers 1, 3, …, up to the last (or second to last) row in our data set like this:

idx <- seq(1, nrow(data), by=2)

Then subset like this:

data.oddrows <- data[idx,]

Our new data set has half the rows of the old data set:

nrow(data.oddrows)

[1] 85

nrow(data)

[1] 169

Because row names are preserved, you can see the odd numbers in the row names.

head(data.oddrows)

 Plot Seed.herbivore Root.herbivore No.stems Height Weight Seed.heads
plot-2           TRUE           TRUE        1     31   4.16         83
plot-2           TRUE           TRUE        1     42   3.51         72
plot-2           TRUE          FALSE        1     47   6.17        212
plot-4          FALSE           TRUE        3     57  23.44        522
plot-6           TRUE           TRUE        1     40   4.01         81
plot-8          FALSE          FALSE        1     51   5.98        114
   Seeds.in.25.heads
                7
               32
                3
                7
               46
               5

Indexing by logical vector

This is one of the most powerful ways of indexing.

We can index by this - it will return rows for which the condition is TRUE:

data$Height > 50

  [1] FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
 [14]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
 [27] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
 [40]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
 [53]  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
 [66] FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE
 [79] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE
 [92]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[105]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE
[118] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
[131]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
[144]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
[157]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

You can convert from a logical (TRUE/FALSE) vector to an integer vector with the which function. It tells you the indices of the elements being TRUE.

which(data$Height > 50)

[1]   4   6   7  11  14  18  21  25  30  31  32  33  34  35  36  39  40  41  42  43[21]  44  50  53  56  57  61  64  65  67  68  69  71  73  75  76  82  88  89  90  92[41]  93  95  96  97  98  99 100 101 102 103 104 105 110 111 112 114 115 116 119 121[61] 122 123 124 127 128 129 130 131 133 134 135 136 137 138 139 140 141 142 144 146[81] 147 148 149 150 152 153 155 156 157 160 162 163 164 165 166 167 169   

Subsetting can be useful when you want to look at bits of your data. For example, all the rows where the Height is more than 10 and there was no seed herbivore:

data[data$Height > 10 & data$Seed.herbivore == FALSE,]

   Plot Seed.herbivore Root.herbivore No.stems Height Weight
  plot-2           TRUE           TRUE        1     31   4.16
  plot-2           TRUE           TRUE        3     41   5.82
  plot-2           TRUE           TRUE        1     42   3.51
  plot-2           TRUE          FALSE        1     64   7.16
  plot-2           TRUE          FALSE        1     47   6.17
  plot-2           TRUE          FALSE        1     52   5.32
  plot-6           TRUE           TRUE        2     27   1.76
  plot-6           TRUE           TRUE        1     40   4.01
 plot-6           TRUE           TRUE        1     33   2.58
plot-10           TRUE           TRUE        1     34   3.36
plot-10           TRUE           TRUE        1     50  15.76
plot-10           TRUE           TRUE        1     51   8.73
plot-10           TRUE           TRUE        3     33   3.24
plot-14           TRUE           TRUE        1     61  12.74
plot-14           TRUE          FALSE        1     50   7.58
plot-14           TRUE           TRUE        1     44   1.57
plot-14           TRUE          FALSE        1     50   7.96
plot-14           TRUE           TRUE        3     45   5.59
plot-18           TRUE          FALSE        1     37   3.92
plot-18           TRUE          FALSE        1     49   3.99
plot-18           TRUE          FALSE        1     72  13.62
Seed.heads Seeds.in.25.heads
         83                 7
        175                 0
         72                32
        125                22
        212                 3
        114                19
         28                24
         81                46
        43                 8
        97                 9
       395                24
       280                11
        68                21
       371                21
       145                13
        36                55
       162                25
       173                16
        98                 1
        96                 5
       348                29
 [ reached getOption("max.print") -- omitted 58 rows ]

The & operator here is a logical “and” (read x & y as “x and y”):

TRUE & TRUE is TRUE
TRUE & FALSE is FALSE
FALSE & TRUE is FALSE
FALSE & FALSE is FALSE

In contrast, the | operator is a logical “or” (read as “or”)

TRUE | TRUE is TRUE
TRUE | FALSE is TRUE
FALSE | TRUE is TRUE
FALSE | FALSE is FALSE

The other, less common, operator is the exclusive or (to select things that are only one or the other):

xor(TRUE, TRUE) is FALSE
xor(TRUE, FALSE) is TRUE
xor(FALSE, TRUE) is TRUE
xor(FALSE, FALSE) is FALSE

So you can do all sorts of crazy things like

data[data$Plot == "plot-2" & data$Seed.herbivore & data$Root.herbivore,]

Plot Seed.herbivore Root.herbivore No.stems Height Weight Seed.heads
plot-2           TRUE           TRUE        1     31   4.16         83
plot-2           TRUE           TRUE        3     41   5.82        175
plot-2           TRUE           TRUE        1     42   3.51         72
  Seeds.in.25.heads
               7
               0
              32

and get all the cases in plot 2 where there were both seed herbivores and root herbivores. Or

data[data$Height > 75 & (data$Seed.herbivore | data$Root.herbivore),]

   Plot Seed.herbivore Root.herbivore No.stems Height Weight
plot-16          FALSE           TRUE        3     76  18.55
plot-18           TRUE           TRUE        1     77  16.08
plot-30           TRUE           TRUE        1     89  27.70
plot-44          FALSE           TRUE        3     80  19.40
plot-44          FALSE           TRUE        1     80  15.61
plot-46           TRUE           TRUE        1     84  14.64
plot-50           TRUE          FALSE        2     94  55.51
plot-50           TRUE          FALSE        2     77  12.34
plot-50           TRUE          FALSE        3     97  41.21
plot-52          FALSE           TRUE        1     76  10.54
plot-54           TRUE           TRUE        1     80  28.16
plot-58           TRUE          FALSE        1     80  11.56
Seed.heads Seeds.in.25.heads
       379                19
       261                27
       561                 3
      278                41
      255                 3
      288                34
      963                39
      182                49
      685                25
      231                35
      376                12
      220                20

and get all the plants that are quite tall in treatments with either a seed herbivore or a root herbivore (or both).

You can build these up if you want:

idx.tall <- data$Height > 75
idx.herbivore <- data$Seed.herbivore | data$Root.herbivore
idx.select <- idx.tall & idx.herbivore
data[idx.select,]

   Plot Seed.herbivore Root.herbivore No.stems Height Weight
plot-16          FALSE           TRUE        3     76  18.55
plot-18           TRUE           TRUE        1     77  16.08
plot-30           TRUE           TRUE        1     89  27.70
plot-44          FALSE           TRUE        3     80  19.40
plot-44          FALSE           TRUE        1     80  15.61
plot-46           TRUE           TRUE        1     84  14.64
plot-50           TRUE          FALSE        2     94  55.51
plot-50           TRUE          FALSE        2     77  12.34
plot-50           TRUE          FALSE        3     97  41.21
plot-52          FALSE           TRUE        1     76  10.54
plot-54           TRUE           TRUE        1     80  28.16
plot-58           TRUE          FALSE        1     80  11.56
Seed.heads Seeds.in.25.heads
       379                19
       261                27
       561                 3
      278                41
      255                 3
      288                34
      963                39
      182                49
      685                25
      231                35
      376                12
      220                20

whatever you find easiest to read and write.

Programs should be written for people to read, and only incidentally for machines to execute (Structure and Interpretation of Computer Programs” by Abelson and Sussman)

The `subset` function to simplify writing complex subsets

There is a function subset that may help you write complex subsets.

subset(data, Height > 75 & (Seed.herbivore | Root.herbivore))

   Plot Seed.herbivore Root.herbivore No.stems Height Weight
plot-16          FALSE           TRUE        3     76  18.55
plot-18           TRUE           TRUE        1     77  16.08
plot-30           TRUE           TRUE        1     89  27.70
plot-44          FALSE           TRUE        3     80  19.40
plot-44          FALSE           TRUE        1     80  15.61
plot-46           TRUE           TRUE        1     84  14.64
plot-50           TRUE          FALSE        2     94  55.51
plot-50           TRUE          FALSE        2     77  12.34
plot-50           TRUE          FALSE        3     97  41.21
plot-52          FALSE           TRUE        1     76  10.54
plot-54           TRUE           TRUE        1     80  28.16
plot-58           TRUE          FALSE        1     80  11.56
Seed.heads Seeds.in.25.heads
       379                19
       261                27
       561                 3
      278                41
      255                 3
      288                34
      963                39
      182                49
      685                25
      231                35
      376                12
      220                20

This can help, especially interactively, but it can also bite you. It is not always obvious where the “value” of the variables in the second argument are coming from. For example:

subset(data, idx.tall & (Seed.herbivore | Root.herbivore))

   Plot Seed.herbivore Root.herbivore No.stems Height Weight
plot-16          FALSE           TRUE        3     76  18.55
plot-18           TRUE           TRUE        1     77  16.08
plot-30           TRUE           TRUE        1     89  27.70
plot-44          FALSE           TRUE        3     80  19.40
plot-44          FALSE           TRUE        1     80  15.61
plot-46           TRUE           TRUE        1     84  14.64
plot-50           TRUE          FALSE        2     94  55.51
plot-50           TRUE          FALSE        2     77  12.34
plot-50           TRUE          FALSE        3     97  41.21
plot-52          FALSE           TRUE        1     76  10.54
plot-54           TRUE           TRUE        1     80  28.16
plot-58           TRUE          FALSE        1     80  11.56
Seed.heads Seeds.in.25.heads
       379                19
       261                27
       561                 3
      278                41
      255                 3
      288                34
      963                39
      182                49
      685                25
      231                35
      376                12
      220                20

This works fine, because it found idx.tall. So when you read your code, you need to think carefully about which values are coming from the data.frame and which are coming from elsewhere.

This is an unfortunate example of a function designed to be used by beginners, but it only really understandable once you understand more of what is going on. You’ll see it used widely, and it can simplify things. But be careful.

Adding new columns

It is easy to add new columns, perhaps based on old ones:

data$small.plant <- data$Height < 50
head(data)

Plot Seed.herbivore Root.herbivore No.stems Height Weight Seed.heads
plot-2           TRUE           TRUE        1     31   4.16         83
plot-2           TRUE           TRUE        3     41   5.82        175
plot-2           TRUE           TRUE        1     42   3.51         72
plot-2           TRUE          FALSE        1     64   7.16        125
plot-2           TRUE          FALSE        1     47   6.17        212
plot-2           TRUE          FALSE        1     52   5.32        114
  Seeds.in.25.heads small.plant
               7        TRUE
               0        TRUE
              32        TRUE
              22       FALSE
               3        TRUE
              19       FALSE

You can delete a column by setting it to NULL:

data$small.plant <- NULL
head(data)

Plot Seed.herbivore Root.herbivore No.stems Height Weight Seed.heads
plot-2           TRUE           TRUE        1     31   4.16         83
plot-2           TRUE           TRUE        3     41   5.82        175
plot-2           TRUE           TRUE        1     42   3.51         72
plot-2           TRUE          FALSE        1     64   7.16        125
plot-2           TRUE          FALSE        1     47   6.17        212
plot-2           TRUE          FALSE        1     52   5.32        114
  Seeds.in.25.heads
               7
               0
              32
              22
               3
              19

Exercises

Lets practise subsetting

Make a subset of the data only for plants with 1 stem (column No.stems). Plot Height vs. Weight for the subset.
Count how many plants have seed and root herbivores using the table function (columns Seed.herbivore and Root.herbivore ). Try to find out whether plants with seed herbivores tend to have also root herbivores.
Make a boxplot to examine whether Root.Herbivores have an effect on Height. Use the function boxplot.
Is the difference statistically significant? Test it using the t.test function.

A possible solution

data.1stem <- data[data$No.stems == 1, ] or: data.1stem <- subset(data, No.stems == 1) plot(data.1stem$Height, data.1stem$Weight)

table(data$Root.herbivore, data$Seed.herbivore) Alternatively: nrow(subset(data, Seed.herbivore == TRUE & Root.herbivore == FALSE)) [1] 28 ...for all possible combinations or graphically: mosaicplot(data$Root.herbivore ~ data$Seed.herbivore)

boxplot(data$Height ~ data$Root.herbivore)

t.test(data$Height ~ data$Root.herbivore)

(bonus topic) Indexing need not make things smaller

Given this vector with the first give letters of the alphabet:

x <- c("a", "b", "c", "d", "e")

Repeat the first letter once, the second letter twice, etc.

x[c(1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5)]

 [1] "a" "b" "b" "c" "c" "c" "d" "d" "d" "d" "e" "e" "e" "e" "e"

Much better!

x[rep(1:5, 1:5)]

 [1] "a" "b" "b" "c" "c" "c" "d" "d" "d" "d" "e" "e" "e" "e" "e"

rep is incredibly useful, and can be used in many ways. See the help page ?rep