SummarizedExperiment
The material in this course requires R version 3.2 and Bioconductor version 3.2
stopifnot(
getRversion() >= '3.2' && getRversion() < '3.3',
BiocInstaller::biocVersion() == "3.2"
)
Your boss has been working on Acute Lymphocytic Lukemia (ALL) for many years. One data set consists of microarray gene expression values for 12625 genes in 128 different samples. Your boss would like to analyze different subsets of the data, and has given you a couple of tab-delimited files. One file (ALLphenoData.tsv) describes the samples, the other (ALLassay.tsv) contains pre-processed gene expression data. You are supposed to come up with a way to create the subsets your boss asks you for. You realize that you could read the data in to Excel and manipulate it there, but you’re concerned about being able to do reproducible research and you’re nervous about the bookkeeping errors that always seem to come up. So you think you’ll give Bioconductor a try…
Download the ALLphenoData.tsv and ALLassay.tsv files to the current workign directory, getwd()
.
read.table()
to read ALLphenoData.tsvfname = "ALLphenoData.tsv" ## use file.choose() to find the file
pdata = read.table(fname)
Check out the help page ?read.delim
for input options, and explore basic properties of the object you’ve created, for instance…
class(pdata)
## [1] "data.frame"
colnames(pdata)
## [1] "cod" "diagnosis" "sex" "age" "BT"
## [6] "remission" "CR" "date.cr" "t.4.11." "t.9.22."
## [11] "cyto.normal" "citog" "mol.biol" "fusion.protein" "mdr"
## [16] "kinet" "ccr" "relapse" "transplant" "f.u"
## [21] "date.last.seen"
dim(pdata)
## [1] 128 21
head(pdata)
## cod diagnosis sex age BT remission CR date.cr t.4.11. t.9.22. cyto.normal citog
## 01005 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE FALSE t(9;22)
## 01010 1010 3/29/2000 M 19 B2 CR CR 6/27/2000 FALSE FALSE FALSE simple alt.
## 03002 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA NA <NA>
## 04006 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE FALSE t(4;11)
## 04007 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE FALSE del(6q)
## 04008 4008 7/30/1997 M 17 B1 CR CR 9/27/1997 FALSE FALSE FALSE complex alt.
## mol.biol fusion.protein mdr kinet ccr relapse transplant f.u date.last.seen
## 01005 BCR/ABL p210 NEG dyploid FALSE FALSE TRUE BMT / DEATH IN CR <NA>
## 01010 NEG <NA> POS dyploid FALSE TRUE FALSE REL 8/28/2000
## 03002 BCR/ABL p190 NEG dyploid FALSE TRUE FALSE REL 10/15/1999
## 04006 ALL1/AF4 <NA> NEG dyploid FALSE TRUE FALSE REL 1/23/1998
## 04007 NEG <NA> NEG dyploid FALSE TRUE FALSE REL 11/4/1997
## 04008 NEG <NA> NEG hyperd. FALSE TRUE FALSE REL 12/15/1997
summary(pdata$sex)
## F M NA's
## 42 83 3
summary(pdata$cyto.normal)
## Mode FALSE TRUE NA's
## logical 69 24 35
Remind yourselves about various ways to subset and access columns of a data.frame
pdata[1:5, 3:4]
## sex age
## 01005 M 53
## 01010 M 19
## 03002 F 52
## 04006 M 38
## 04007 M 57
pdata[1:5, ]
## cod diagnosis sex age BT remission CR date.cr t.4.11. t.9.22. cyto.normal citog
## 01005 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE FALSE t(9;22)
## 01010 1010 3/29/2000 M 19 B2 CR CR 6/27/2000 FALSE FALSE FALSE simple alt.
## 03002 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA NA <NA>
## 04006 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE FALSE t(4;11)
## 04007 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE FALSE del(6q)
## mol.biol fusion.protein mdr kinet ccr relapse transplant f.u date.last.seen
## 01005 BCR/ABL p210 NEG dyploid FALSE FALSE TRUE BMT / DEATH IN CR <NA>
## 01010 NEG <NA> POS dyploid FALSE TRUE FALSE REL 8/28/2000
## 03002 BCR/ABL p190 NEG dyploid FALSE TRUE FALSE REL 10/15/1999
## 04006 ALL1/AF4 <NA> NEG dyploid FALSE TRUE FALSE REL 1/23/1998
## 04007 NEG <NA> NEG dyploid FALSE TRUE FALSE REL 11/4/1997
head(pdata[, 3:5])
## sex age BT
## 01005 M 53 B2
## 01010 M 19 B2
## 03002 F 52 B4
## 04006 M 38 B1
## 04007 M 57 B2
## 04008 M 17 B1
tail(pdata[, 3:5], 3)
## sex age BT
## 65003 M 30 T3
## 83001 M 29 T2
## LAL4 <NA> NA T
head(pdata$age)
## [1] 53 19 52 38 57 17
head(pdata$sex)
## [1] M M F M M M
## Levels: F M
head(pdata[pdata$age > 21,])
## cod diagnosis sex age BT remission CR date.cr t.4.11. t.9.22. cyto.normal citog
## 01005 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE TRUE FALSE t(9;22)
## 03002 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA NA NA <NA>
## 04006 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE FALSE FALSE t(4;11)
## 04007 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE FALSE FALSE del(6q)
## 08001 8001 1/15/1997 M 40 B2 CR CR 3/26/1997 FALSE FALSE FALSE del(p15)
## 08011 8011 8/21/1998 M 33 B3 CR CR 10/8/1998 FALSE FALSE FALSE del(p15/p16)
## mol.biol fusion.protein mdr kinet ccr relapse transplant f.u date.last.seen
## 01005 BCR/ABL p210 NEG dyploid FALSE FALSE TRUE BMT / DEATH IN CR <NA>
## 03002 BCR/ABL p190 NEG dyploid FALSE TRUE FALSE REL 10/15/1999
## 04006 ALL1/AF4 <NA> NEG dyploid FALSE TRUE FALSE REL 1/23/1998
## 04007 NEG <NA> NEG dyploid FALSE TRUE FALSE REL 11/4/1997
## 08001 BCR/ABL p190 NEG <NA> FALSE TRUE FALSE REL 7/11/1997
## 08011 BCR/ABL p190/p210 NEG dyploid FALSE FALSE TRUE BMT / DEATH IN CR <NA>
read.table()
to read the expression valuesfname <- "ALLassay.tsv"
exprs <- as.matrix(read.table(fname, check.names=FALSE))
Use dim()
to figure out the number of rows and columns in the expression data. Use subscripts to look at the first few rows and columns exprs[1:5, 1:5]
. What are the row names? Do the column names agree with the row names of the pdata
object? What is the range()
of the expression data? Can you create a histogram (hint: hist()
) of the data? What is plot(density(exprs))
? Can you use plot()
and lines()
to plot the density of each sample, in a single figure?
You could work with the matrix and data frame directly, but it is better to put these related parts of the data into a single object, a SummarizedExperiment.
Load the appropriate Bioconductor package
if (BiocInstaller::biocVersion() >= "3.2") {
library(SummarizedExperiment)
} else {
library(GenomicRanges)
}
and create a single SummarizedExperiment object from the two parts of the data. Some Bioconductor objects enhance the behavior of base R objects; an example of this is DataFrame()
se <- SummarizedExperiment(exprs, colData=DataFrame(pdata))
Explore the object, noting that you can retrieve the original elements, and can subset in a coordinated fashion.
head(colData(se))
## DataFrame with 6 rows and 21 columns
## cod diagnosis sex age BT remission CR date.cr t.4.11.
## <factor> <factor> <factor> <integer> <factor> <factor> <factor> <factor> <logical>
## 01005 1005 5/21/1997 M 53 B2 CR CR 8/6/1997 FALSE
## 01010 1010 3/29/2000 M 19 B2 CR CR 6/27/2000 FALSE
## 03002 3002 6/24/1998 F 52 B4 CR CR 8/17/1998 NA
## 04006 4006 7/17/1997 M 38 B1 CR CR 9/8/1997 TRUE
## 04007 4007 7/22/1997 M 57 B2 CR CR 9/17/1997 FALSE
## 04008 4008 7/30/1997 M 17 B1 CR CR 9/27/1997 FALSE
## t.9.22. cyto.normal citog mol.biol fusion.protein mdr kinet ccr
## <logical> <logical> <factor> <factor> <factor> <factor> <factor> <logical>
## 01005 TRUE FALSE t(9;22) BCR/ABL p210 NEG dyploid FALSE
## 01010 FALSE FALSE simple alt. NEG NA POS dyploid FALSE
## 03002 NA NA NA BCR/ABL p190 NEG dyploid FALSE
## 04006 FALSE FALSE t(4;11) ALL1/AF4 NA NEG dyploid FALSE
## 04007 FALSE FALSE del(6q) NEG NA NEG dyploid FALSE
## 04008 FALSE FALSE complex alt. NEG NA NEG hyperd. FALSE
## relapse transplant f.u date.last.seen
## <logical> <logical> <factor> <factor>
## 01005 FALSE TRUE BMT / DEATH IN CR NA
## 01010 TRUE FALSE REL 8/28/2000
## 03002 TRUE FALSE REL 10/15/1999
## 04006 TRUE FALSE REL 1/23/1998
## 04007 TRUE FALSE REL 11/4/1997
## 04008 TRUE FALSE REL 12/15/1997
assay(se)[1:5, 1:5]
## 01005 01010 03002 04006 04007
## 1000_at 7.597323 7.479445 7.567593 7.384684 7.905312
## 1001_at 5.046194 4.932537 4.799294 4.922627 4.844565
## 1002_f_at 3.900466 4.208155 3.886169 4.206798 3.416923
## 1003_s_at 5.903856 6.169024 5.860459 6.116890 5.687997
## 1004_at 5.925260 5.912780 5.893209 6.170245 5.615210
se$sex %in% "M"
## [1] TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [31] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [46] TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE
## [61] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
## [76] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
## [91] TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## [106] TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE
## [121] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
males <- se[,se$sex %in% "M"]
males
## class: SummarizedExperiment0
## dim: 12625 83
## metadata(0):
## assays(1): ''
## rownames(12625): 1000_at 1001_at ... AFFX-YEL021w/URA3_at AFFX-YEL024w/RIP1_at
## metadata column names(0):
## colnames(83): 01005 01010 ... 65003 83001
## colData names(21): cod diagnosis ... f.u date.last.seen
assay(males)[1:5, 1:5]
## 01005 01010 04006 04007 04008
## 1000_at 7.597323 7.479445 7.384684 7.905312 7.065914
## 1001_at 5.046194 4.932537 4.922627 4.844565 5.147762
## 1002_f_at 3.900466 4.208155 4.206798 3.416923 3.945869
## 1003_s_at 5.903856 6.169024 6.116890 5.687997 6.208061
## 1004_at 5.925260 5.912780 6.170245 5.615210 5.923487
Use vignette("SummarizedExperiment")
to read about other operations on SummarizedExperiment.
Quickly create the following subsets of data for your boss:
All women in the study.
All women over 40
An object bcrabl
containing individuals with mol.biol
belonging either to “BCR/ABL” or “NEG”.
Can you…?
Create a new column that simplifies the BT
column (which lists different B- and T-cell subtypes) to contain just B
or T
, e.g., re-coding B, B1, B2, B3 and B4 to simply B
, and likewise for T
?
Use aggregate()
to calculate the average age of males and females in the BCR/ABL and NEG treatment groups?
Use t.test()
to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using boxplot()
. In both cases, use the formula
interface. Consult the help page ?t.test
and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change?
Summarize the exercises above in a simple script. Can you figure out how to write a ‘markdown’ document that includes R code chunks, as well as text describing what you did, and figures and tables showing the results?
Acknowledgements
Core (Seattle): Sonali Arora, Marc Carlson, Nate Hayden, Jim Hester, Valerie Obenchain, Hervé Pagès, Paul Shannon, Dan Tenenbaum.
The research reported in this presentation was supported by the National Cancer Institute and the National Human Genome Research Institute of the National Institutes of Health under Award numbers U24CA180996 and U41HG004059, and the National Science Foundation under Award number 1247813. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation.
sessionInfo()
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux stretch/sid
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ALL_1.11.0 org.Hs.eg.db_3.2.3
## [3] RSQLite_1.0.0 DBI_0.3.1
## [5] ggplot2_1.0.1 airway_0.103.1
## [7] limma_3.25.18 DESeq2_1.9.51
## [9] RcppArmadillo_0.6.100.0.0 Rcpp_0.12.1
## [11] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.37.6
## [13] rtracklayer_1.29.28 TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [15] GenomicFeatures_1.21.33 AnnotationDbi_1.31.19
## [17] SummarizedExperiment_0.3.11 Biobase_2.29.1
## [19] GenomicRanges_1.21.32 GenomeInfoDb_1.5.16
## [21] microbenchmark_1.4-2 Biostrings_2.37.8
## [23] XVector_0.9.4 IRanges_2.3.26
## [25] S4Vectors_0.7.23 BiocGenerics_0.15.11
## [27] BiocStyle_1.7.9
##
## loaded via a namespace (and not attached):
## [1] splines_3.2.2 Formula_1.2-1 latticeExtra_0.6-26
## [4] Rsamtools_1.21.21 yaml_2.1.13 lattice_0.20-33
## [7] digest_0.6.8 RColorBrewer_1.1-2 colorspace_1.2-6
## [10] sandwich_2.3-4 htmltools_0.2.6 plyr_1.8.3
## [13] XML_3.98-1.3 biomaRt_2.25.3 genefilter_1.51.1
## [16] zlibbioc_1.15.0 xtable_1.7-4 mvtnorm_1.0-3
## [19] scales_0.3.0 BiocParallel_1.3.54 annotate_1.47.4
## [22] TH.data_1.0-6 nnet_7.3-11 proto_0.3-10
## [25] survival_2.38-3 magrittr_1.5 evaluate_0.8
## [28] MASS_7.3-44 foreign_0.8-66 BiocInstaller_1.19.14
## [31] tools_3.2.2 formatR_1.2.1 multcomp_1.4-1
## [34] stringr_1.0.0 munsell_0.4.2 locfit_1.5-9.1
## [37] cluster_2.0.3 lambda.r_1.1.7 futile.logger_1.4.1
## [40] grid_3.2.2 RCurl_1.95-4.7 labeling_0.3
## [43] bitops_1.0-6 rmarkdown_0.8.1 gtable_0.1.2
## [46] codetools_0.2-14 reshape2_1.4.1 GenomicAlignments_1.5.18
## [49] gridExtra_2.0.0 zoo_1.7-12 knitr_1.11
## [52] Hmisc_3.17-0 futile.options_1.0.0 stringi_0.5-5
## [55] geneplotter_1.47.0 rpart_4.1-10 acepack_1.3-3.3