Martin Morgan
February 2, 2015
Language and environment for statistical computing and graphics
factor()
, NA
Vector, class, object
logical
,
integer
, numeric
, complex
, character
, byte
matrix
– atomic vector with 'dim' attributedata.frame
– list of equal length atomic vectorslm()
, belowFunction, generic, method
rnorm(1000)
print()
. print.factor
; methods are invoked indirectly, via the generic.Introspection
class()
, str()
dim()
Help
?print
: help on the generic print ?print.data.frame
: help on print method for objects of class
data.frame.Example
x <- rnorm(1000) # atomic vectors
y <- x + rnorm(1000, sd=.5)
df <- data.frame(x=x, y=y) # object of class 'data.frame'
plot(y ~ x, df) # generic plot, method plot.formula
fit <- lm(y ~x, df) # object of class 'lm'
methods(class=class(fit)) # introspection
## [1] add1.lm* alias.lm* anova.lm*
## [4] case.names.lm* confint.lm cooks.distance.lm*
## [7] deviance.lm* dfbeta.lm* dfbetas.lm*
## [10] drop1.lm* dummy.coef.lm effects.lm*
## [13] extractAIC.lm* family.lm* formula.lm*
## [16] hatvalues.lm* influence.lm* kappa.lm
## [19] labels.lm* logLik.lm* model.frame.lm*
## [22] model.matrix.lm nobs.lm* plot.lm*
## [25] predict.lm print.lm* proj.lm*
## [28] qr.lm* residuals.lm rstandard.lm*
## [31] rstudent.lm* simulate.lm* summary.lm
## [34] variable.names.lm* vcov.lm*
##
## Non-visible functions are asterisked
This exercise servers as a refresher / tutorial on basic input and manipulation of data.
Input a file that contains ALL (acute lymphoblastic leukemia) patient information
fname <- file.choose() ## "ALLphenoData.tsv"
stopifnot(file.exists(fname))
pdata <- read.delim(fname)
Check out the help page ?read.delim
for input options, and explore
basic properties of the object you've created, for instance…
class(pdata)
## [1] "data.frame"
colnames(pdata)
## [1] "id" "diagnosis" "sex" "age"
## [5] "BT" "remission" "CR" "date.cr"
## [9] "t.4.11." "t.9.22." "cyto.normal" "citog"
## [13] "mol.biol" "fusion.protein" "mdr" "kinet"
## [17] "ccr" "relapse" "transplant" "f.u"
## [21] "date.last.seen"
dim(pdata)
## [1] 127 21
head(pdata)
## id diagnosis sex age BT remission CR date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE
## 2 1010 3/29/2000 M 19 B2 CR CR 6/27/2000 FALSE FALSE
## 3 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA
## 4 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE
## 5 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE
## 6 4008 7/30/1997 M 17 B1 CR CR 9/27/1997 FALSE FALSE
## cyto.normal citog mol.biol fusion.protein mdr kinet ccr
## 1 FALSE t(9;22) BCR/ABL p210 NEG dyploid FALSE
## 2 FALSE simple alt. NEG <NA> POS dyploid FALSE
## 3 NA <NA> BCR/ABL p190 NEG dyploid FALSE
## 4 FALSE t(4;11) ALL1/AF4 <NA> NEG dyploid FALSE
## 5 FALSE del(6q) NEG <NA> NEG dyploid FALSE
## 6 FALSE complex alt. NEG <NA> NEG hyperd. FALSE
## relapse transplant f.u date.last.seen
## 1 FALSE TRUE BMT / DEATH IN CR <NA>
## 2 TRUE FALSE REL 8/28/2000
## 3 TRUE FALSE REL 10/15/1999
## 4 TRUE FALSE REL 1/23/1998
## 5 TRUE FALSE REL 11/4/1997
## 6 TRUE FALSE REL 12/15/1997
summary(pdata$sex)
## F M NA's
## 42 83 2
summary(pdata$cyto.normal)
## Mode FALSE TRUE NA's
## logical 69 24 34
Remind yourselves about various ways to subset and access columns of a data.frame
pdata[1:5, 3:4]
## sex age
## 1 M 53
## 2 M 19
## 3 F 52
## 4 M 38
## 5 M 57
pdata[1:5, ]
## id diagnosis sex age BT remission CR date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE
## 2 1010 3/29/2000 M 19 B2 CR CR 6/27/2000 FALSE FALSE
## 3 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA
## 4 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE
## 5 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE
## cyto.normal citog mol.biol fusion.protein mdr kinet ccr
## 1 FALSE t(9;22) BCR/ABL p210 NEG dyploid FALSE
## 2 FALSE simple alt. NEG <NA> POS dyploid FALSE
## 3 NA <NA> BCR/ABL p190 NEG dyploid FALSE
## 4 FALSE t(4;11) ALL1/AF4 <NA> NEG dyploid FALSE
## 5 FALSE del(6q) NEG <NA> NEG dyploid FALSE
## relapse transplant f.u date.last.seen
## 1 FALSE TRUE BMT / DEATH IN CR <NA>
## 2 TRUE FALSE REL 8/28/2000
## 3 TRUE FALSE REL 10/15/1999
## 4 TRUE FALSE REL 1/23/1998
## 5 TRUE FALSE REL 11/4/1997
head(pdata[, 3:5])
## sex age BT
## 1 M 53 B2
## 2 M 19 B2
## 3 F 52 B4
## 4 M 38 B1
## 5 M 57 B2
## 6 M 17 B1
tail(pdata[, 3:5], 3)
## sex age BT
## 125 M 19 T2
## 126 M 30 T3
## 127 M 29 T2
head(pdata$age)
## [1] 53 19 52 38 57 17
head(pdata$sex)
## [1] M M F M M M
## Levels: F M
head(pdata[pdata$age > 21,])
## id diagnosis sex age BT remission CR date.cr t.4.11. t.9.22.
## 1 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE
## 3 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA
## 4 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE
## 5 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE
## 10 8001 1/15/1997 M 40 B2 CR CR 3/26/1997 FALSE FALSE
## 11 8011 8/21/1998 M 33 B3 CR CR 10/8/1998 FALSE FALSE
## cyto.normal citog mol.biol fusion.protein mdr kinet ccr
## 1 FALSE t(9;22) BCR/ABL p210 NEG dyploid FALSE
## 3 NA <NA> BCR/ABL p190 NEG dyploid FALSE
## 4 FALSE t(4;11) ALL1/AF4 <NA> NEG dyploid FALSE
## 5 FALSE del(6q) NEG <NA> NEG dyploid FALSE
## 10 FALSE del(p15) BCR/ABL p190 NEG <NA> FALSE
## 11 FALSE del(p15/p16) BCR/ABL p190/p210 NEG dyploid FALSE
## relapse transplant f.u date.last.seen
## 1 FALSE TRUE BMT / DEATH IN CR <NA>
## 3 TRUE FALSE REL 10/15/1999
## 4 TRUE FALSE REL 1/23/1998
## 5 TRUE FALSE REL 11/4/1997
## 10 TRUE FALSE REL 7/11/1997
## 11 FALSE TRUE BMT / DEATH IN CR <NA>
It seems from below that there are 17 females over 40 in the data set,
but when sub-setting pdata
to contain just those individuals 19 rows
are selected. Why? What can we do to correct this?
idx <- pdata$sex == "F" & pdata$age > 40
table(idx)
## idx
## FALSE TRUE
## 108 17
dim(pdata[idx,])
## [1] 19 21
Use the mol.biol
column to subset the data to contain just
individuals with 'BCR/ABL' or 'NEG', e.g.,
bcrabl <- pdata[pdata$mol.biol %in% c("BCR/ABL", "NEG"),]
The mol.biol
column is a factor, and retains all levels even after
subsetting. How might you drop the unused factor levels?
bcrabl$mol.biol <- factor(bcrabl$mol.biol)
The BT
column is a factor describing B- and T-cell subtypes
levels(bcrabl$BT)
## [1] "B" "B1" "B2" "B3" "B4" "T" "T1" "T2" "T3" "T4"
How might one collapse B1, B2, … to a single type B, and likewise for T1, T2, …, so there are only two subtypes, B and T
table(bcrabl$BT)
##
## B B1 B2 B3 B4 T T1 T2 T3 T4
## 4 9 35 22 9 4 1 15 9 2
levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1)
table(bcrabl$BT)
##
## B T
## 79 31
Use xtabs()
(cross-tabulation) to count the number of samples with
B- and T-cell types in each of the BCR/ABL and NEG groups
xtabs(~ BT + mol.biol, bcrabl)
## mol.biol
## BT BCR/ABL NEG
## B 37 42
## T 0 31
Use aggregate()
to calculate the average age of males and females in
the BCR/ABL and NEG treatment groups.
aggregate(age ~ mol.biol + sex, bcrabl, mean)
## mol.biol sex age
## 1 BCR/ABL F 39.93750
## 2 NEG F 30.42105
## 3 BCR/ABL M 40.50000
## 4 NEG M 27.21154
Use t.test()
to compare the age of individuals in the BCR/ABL versus
NEG groups; visualize the results using boxplot()
. In both cases,
use the formula
interface. Consult the help page ?t.test
and re-do
the test assuming that variance of ages in the two groups is
identical. What parts of the test output change?
t.test(age ~ mol.biol, bcrabl)
##
## Welch Two Sample t-test
##
## data: age by mol.biol
## t = 4.8172, df = 68.529, p-value = 8.401e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 7.13507 17.22408
## sample estimates:
## mean in group BCR/ABL mean in group NEG
## 40.25000 28.07042
boxplot(age ~ mol.biol, bcrabl)
Publications (General R)