Introduction to R

Martin Morgan
February 2, 2015

R

Language and environment for statistical computing and graphics

Full-featured programming language
Interactive and interpretted – convenient and forgiving
Coherent, extensive documentation
Statistical, e.g. factor(), NA
Extensible – CRAN, Bioconductor, github, …

Vector, class, object

Efficient vectorized calculations on 'atomic' vectors logical, integer, numeric, complex, character, byte
Atomic vectors are building blocks for more complicated objects
- matrix – atomic vector with 'dim' attribute
- data.frame – list of equal length atomic vectors
Formal classes represent complicated combinations of vectors, e.g., the return value of lm(), below

Function, generic, method

Functions transform inputs to outputs, perhaps with side effects, e.g., rnorm(1000)
- Argument matching first by name, then by position
- Functions may define (some) arguments to have default values
Generic functions dispatch to specific methods based on class of argument(s), e.g., print().
Methods are functions that implement specific generics, e.g., print.factor; methods are invoked indirectly, via the generic.

Introspection

General properties, e.g., class(), str()
Class-specific properties, e.g., dim()

Help

?print: help on the generic print
?print.data.frame: help on print method for objects of class data.frame.

Example

x <- rnorm(1000)                   # atomic vectors
y <- x + rnorm(1000, sd=.5)
df <- data.frame(x=x, y=y)         # object of class 'data.frame'
plot(y ~ x, df)                    # generic plot, method plot.formula

plot of chunk unnamed-chunk-1

fit <- lm(y ~x, df)                # object of class 'lm'
methods(class=class(fit))          # introspection

##  [1] add1.lm*           alias.lm*          anova.lm*         
##  [4] case.names.lm*     confint.lm         cooks.distance.lm*
##  [7] deviance.lm*       dfbeta.lm*         dfbetas.lm*       
## [10] drop1.lm*          dummy.coef.lm      effects.lm*       
## [13] extractAIC.lm*     family.lm*         formula.lm*       
## [16] hatvalues.lm*      influence.lm*      kappa.lm          
## [19] labels.lm*         logLik.lm*         model.frame.lm*   
## [22] model.matrix.lm    nobs.lm*           plot.lm*          
## [25] predict.lm         print.lm*          proj.lm*          
## [28] qr.lm*             residuals.lm       rstandard.lm*     
## [31] rstudent.lm*       simulate.lm*       summary.lm        
## [34] variable.names.lm* vcov.lm*          
## 
##    Non-visible functions are asterisked

Lab

1. R data manipulation

This exercise servers as a refresher / tutorial on basic input and manipulation of data.

Input a file that contains ALL (acute lymphoblastic leukemia) patient information

fname <- file.choose()   ## "ALLphenoData.tsv"
stopifnot(file.exists(fname))
pdata <- read.delim(fname)

Check out the help page ?read.delim for input options, and explore basic properties of the object you've created, for instance…

class(pdata)

## [1] "data.frame"

colnames(pdata)

##  [1] "id"             "diagnosis"      "sex"            "age"           
##  [5] "BT"             "remission"      "CR"             "date.cr"       
##  [9] "t.4.11."        "t.9.22."        "cyto.normal"    "citog"         
## [13] "mol.biol"       "fusion.protein" "mdr"            "kinet"         
## [17] "ccr"            "relapse"        "transplant"     "f.u"           
## [21] "date.last.seen"

dim(pdata)

## [1] 127  21

head(pdata)

##     id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 2 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE
## 3 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
## 6 4008 7/30/1997   M  17 B1        CR CR 9/27/1997   FALSE   FALSE
##   cyto.normal        citog mol.biol fusion.protein mdr   kinet   ccr
## 1       FALSE      t(9;22)  BCR/ABL           p210 NEG dyploid FALSE
## 2       FALSE  simple alt.      NEG           <NA> POS dyploid FALSE
## 3          NA         <NA>  BCR/ABL           p190 NEG dyploid FALSE
## 4       FALSE      t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE
## 5       FALSE      del(6q)      NEG           <NA> NEG dyploid FALSE
## 6       FALSE complex alt.      NEG           <NA> NEG hyperd. FALSE
##   relapse transplant               f.u date.last.seen
## 1   FALSE       TRUE BMT / DEATH IN CR           <NA>
## 2    TRUE      FALSE               REL      8/28/2000
## 3    TRUE      FALSE               REL     10/15/1999
## 4    TRUE      FALSE               REL      1/23/1998
## 5    TRUE      FALSE               REL      11/4/1997
## 6    TRUE      FALSE               REL     12/15/1997

summary(pdata$sex)

##    F    M NA's 
##   42   83    2

summary(pdata$cyto.normal)

##    Mode   FALSE    TRUE    NA's 
## logical      69      24      34

Remind yourselves about various ways to subset and access columns of a data.frame

pdata[1:5, 3:4]

##   sex age
## 1   M  53
## 2   M  19
## 3   F  52
## 4   M  38
## 5   M  57

pdata[1:5, ]

##     id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 2 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE
## 3 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
##   cyto.normal       citog mol.biol fusion.protein mdr   kinet   ccr
## 1       FALSE     t(9;22)  BCR/ABL           p210 NEG dyploid FALSE
## 2       FALSE simple alt.      NEG           <NA> POS dyploid FALSE
## 3          NA        <NA>  BCR/ABL           p190 NEG dyploid FALSE
## 4       FALSE     t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE
## 5       FALSE     del(6q)      NEG           <NA> NEG dyploid FALSE
##   relapse transplant               f.u date.last.seen
## 1   FALSE       TRUE BMT / DEATH IN CR           <NA>
## 2    TRUE      FALSE               REL      8/28/2000
## 3    TRUE      FALSE               REL     10/15/1999
## 4    TRUE      FALSE               REL      1/23/1998
## 5    TRUE      FALSE               REL      11/4/1997

head(pdata[, 3:5])

##   sex age BT
## 1   M  53 B2
## 2   M  19 B2
## 3   F  52 B4
## 4   M  38 B1
## 5   M  57 B2
## 6   M  17 B1

tail(pdata[, 3:5], 3)

##     sex age BT
## 125   M  19 T2
## 126   M  30 T3
## 127   M  29 T2

head(pdata$age)

## [1] 53 19 52 38 57 17

head(pdata$sex)

## [1] M M F M M M
## Levels: F M

head(pdata[pdata$age > 21,])

##      id diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22.
## 1  1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE
## 3  3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA
## 4  4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE
## 5  4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE
## 10 8001 1/15/1997   M  40 B2        CR CR 3/26/1997   FALSE   FALSE
## 11 8011 8/21/1998   M  33 B3        CR CR 10/8/1998   FALSE   FALSE
##    cyto.normal        citog mol.biol fusion.protein mdr   kinet   ccr
## 1        FALSE      t(9;22)  BCR/ABL           p210 NEG dyploid FALSE
## 3           NA         <NA>  BCR/ABL           p190 NEG dyploid FALSE
## 4        FALSE      t(4;11) ALL1/AF4           <NA> NEG dyploid FALSE
## 5        FALSE      del(6q)      NEG           <NA> NEG dyploid FALSE
## 10       FALSE     del(p15)  BCR/ABL           p190 NEG    <NA> FALSE
## 11       FALSE del(p15/p16)  BCR/ABL      p190/p210 NEG dyploid FALSE
##    relapse transplant               f.u date.last.seen
## 1    FALSE       TRUE BMT / DEATH IN CR           <NA>
## 3     TRUE      FALSE               REL     10/15/1999
## 4     TRUE      FALSE               REL      1/23/1998
## 5     TRUE      FALSE               REL      11/4/1997
## 10    TRUE      FALSE               REL      7/11/1997
## 11   FALSE       TRUE BMT / DEATH IN CR           <NA>

It seems from below that there are 17 females over 40 in the data set, but when sub-setting pdata to contain just those individuals 19 rows are selected. Why? What can we do to correct this?

idx <- pdata$sex == "F" & pdata$age > 40
table(idx)

## idx
## FALSE  TRUE 
##   108    17

dim(pdata[idx,])

## [1] 19 21

Use the mol.biol column to subset the data to contain just individuals with 'BCR/ABL' or 'NEG', e.g.,

bcrabl <- pdata[pdata$mol.biol %in% c("BCR/ABL", "NEG"),]

The mol.biol column is a factor, and retains all levels even after subsetting. How might you drop the unused factor levels?

bcrabl$mol.biol <- factor(bcrabl$mol.biol)

The BT column is a factor describing B- and T-cell subtypes

levels(bcrabl$BT)

##  [1] "B"  "B1" "B2" "B3" "B4" "T"  "T1" "T2" "T3" "T4"

How might one collapse B1, B2, … to a single type B, and likewise for T1, T2, …, so there are only two subtypes, B and T

table(bcrabl$BT)

## 
##  B B1 B2 B3 B4  T T1 T2 T3 T4 
##  4  9 35 22  9  4  1 15  9  2

levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1)
table(bcrabl$BT)

## 
##  B  T 
## 79 31

Use xtabs() (cross-tabulation) to count the number of samples with B- and T-cell types in each of the BCR/ABL and NEG groups

xtabs(~ BT + mol.biol, bcrabl)

##    mol.biol
## BT  BCR/ABL NEG
##   B      37  42
##   T       0  31

Use aggregate() to calculate the average age of males and females in the BCR/ABL and NEG treatment groups.

aggregate(age ~ mol.biol + sex, bcrabl, mean)

##   mol.biol sex      age
## 1  BCR/ABL   F 39.93750
## 2      NEG   F 30.42105
## 3  BCR/ABL   M 40.50000
## 4      NEG   M 27.21154

Use t.test() to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using boxplot(). In both cases, use the formula interface. Consult the help page ?t.test and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change?

t.test(age ~ mol.biol, bcrabl)

## 
##  Welch Two Sample t-test
## 
## data:  age by mol.biol
## t = 4.8172, df = 68.529, p-value = 8.401e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.13507 17.22408
## sample estimates:
## mean in group BCR/ABL     mean in group NEG 
##              40.25000              28.07042

boxplot(age ~ mol.biol, bcrabl)

plot of chunk ALL-age

Resources

StackOverflow for R programming questions; also R-help mailing list.

Publications (General R)