Your boss has been working on Acute Lymphocytic Lukemia (ALL) for many years. One data set consists of microarray gene expression values for 12625 genes in 128 different samples. Your boss would like to analyze different subsets of the data, and has given you a couple of tab-delimited files. One file (ALLphenoData.tsv) describes the samples, the other (ALLassay.tsv) contains pre-processed gene expression data. You are supposed to come up with a way to create the subsets your boss asks you for. You realize that you could read the data in to Excel and manipulate it there, but you’re concerned about being able to do reproducible research and you’re nervous about the bookkeeping errors that always seem to come up. So you think you’ll give Bioconductor a try…

1 Read the data in to R

Download the ALLphenoData.tsv and ALLassay.tsv files to the current workign directory, getwd().

1.1 Use `read.table()` to read ALLphenoData.tsv

fname = "ALLphenoData.tsv"   ## use file.choose() to find the file
pdata = read.table(fname)

Check out the help page ?read.delim for input options, and explore basic properties of the object you’ve created, for instance…

class(pdata)

## [1] "data.frame"

colnames(pdata)

##  [1] "cod"            "diagnosis"      "sex"            "age"            "BT"            
##  [6] "remission"      "CR"             "date.cr"        "t.4.11."        "t.9.22."       
## [11] "cyto.normal"    "citog"          "mol.biol"       "fusion.protein" "mdr"           
## [16] "kinet"          "ccr"            "relapse"        "transplant"     "f.u"           
## [21] "date.last.seen"

dim(pdata)

## [1] 128  21

head(pdata)

##        cod diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22. cyto.normal        citog
## 01005 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE       FALSE      t(9;22)
## 01010 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE       FALSE  simple alt.
## 03002 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA          NA         <NA>
## 04006 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE       FALSE      t(4;11)
## 04007 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE       FALSE      del(6q)
## 04008 4008 7/30/1997   M  17 B1        CR CR 9/27/1997   FALSE   FALSE       FALSE complex alt.
##       mol.biol fusion.protein mdr   kinet   ccr relapse transplant               f.u date.last.seen
## 01005  BCR/ABL           p210 NEG dyploid FALSE   FALSE       TRUE BMT / DEATH IN CR           <NA>
## 01010      NEG           <NA> POS dyploid FALSE    TRUE      FALSE               REL      8/28/2000
## 03002  BCR/ABL           p190 NEG dyploid FALSE    TRUE      FALSE               REL     10/15/1999
## 04006 ALL1/AF4           <NA> NEG dyploid FALSE    TRUE      FALSE               REL      1/23/1998
## 04007      NEG           <NA> NEG dyploid FALSE    TRUE      FALSE               REL      11/4/1997
## 04008      NEG           <NA> NEG hyperd. FALSE    TRUE      FALSE               REL     12/15/1997

summary(pdata$sex)

##    F    M NA's 
##   42   83    3

summary(pdata$cyto.normal)

##    Mode   FALSE    TRUE    NA's 
## logical      69      24      35

Remind yourselves about various ways to subset and access columns of a data.frame

pdata[1:5, 3:4]

##       sex age
## 01005   M  53
## 01010   M  19
## 03002   F  52
## 04006   M  38
## 04007   M  57

pdata[1:5, ]

##        cod diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22. cyto.normal       citog
## 01005 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE       FALSE     t(9;22)
## 01010 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE       FALSE simple alt.
## 03002 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA          NA        <NA>
## 04006 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE       FALSE     t(4;11)
## 04007 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE       FALSE     del(6q)
##       mol.biol fusion.protein mdr   kinet   ccr relapse transplant               f.u date.last.seen
## 01005  BCR/ABL           p210 NEG dyploid FALSE   FALSE       TRUE BMT / DEATH IN CR           <NA>
## 01010      NEG           <NA> POS dyploid FALSE    TRUE      FALSE               REL      8/28/2000
## 03002  BCR/ABL           p190 NEG dyploid FALSE    TRUE      FALSE               REL     10/15/1999
## 04006 ALL1/AF4           <NA> NEG dyploid FALSE    TRUE      FALSE               REL      1/23/1998
## 04007      NEG           <NA> NEG dyploid FALSE    TRUE      FALSE               REL      11/4/1997

head(pdata[, 3:5])

##       sex age BT
## 01005   M  53 B2
## 01010   M  19 B2
## 03002   F  52 B4
## 04006   M  38 B1
## 04007   M  57 B2
## 04008   M  17 B1

tail(pdata[, 3:5], 3)

##        sex age BT
## 65003    M  30 T3
## 83001    M  29 T2
## LAL4  <NA>  NA  T

head(pdata$age)

## [1] 53 19 52 38 57 17

head(pdata$sex)

## [1] M M F M M M
## Levels: F M

head(pdata[pdata$age > 21,])

##        cod diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22. cyto.normal        citog
## 01005 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE       FALSE      t(9;22)
## 03002 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA          NA         <NA>
## 04006 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE       FALSE      t(4;11)
## 04007 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE       FALSE      del(6q)
## 08001 8001 1/15/1997   M  40 B2        CR CR 3/26/1997   FALSE   FALSE       FALSE     del(p15)
## 08011 8011 8/21/1998   M  33 B3        CR CR 10/8/1998   FALSE   FALSE       FALSE del(p15/p16)
##       mol.biol fusion.protein mdr   kinet   ccr relapse transplant               f.u date.last.seen
## 01005  BCR/ABL           p210 NEG dyploid FALSE   FALSE       TRUE BMT / DEATH IN CR           <NA>
## 03002  BCR/ABL           p190 NEG dyploid FALSE    TRUE      FALSE               REL     10/15/1999
## 04006 ALL1/AF4           <NA> NEG dyploid FALSE    TRUE      FALSE               REL      1/23/1998
## 04007      NEG           <NA> NEG dyploid FALSE    TRUE      FALSE               REL      11/4/1997
## 08001  BCR/ABL           p190 NEG    <NA> FALSE    TRUE      FALSE               REL      7/11/1997
## 08011  BCR/ABL      p190/p210 NEG dyploid FALSE   FALSE       TRUE BMT / DEATH IN CR           <NA>

1.2 Use `read.table()` to read the expression values

fname <- "ALLassay.tsv"
exprs <- as.matrix(read.table(fname, check.names=FALSE))

Use dim() to figure out the number of rows and columns in the expression data. Use subscripts to look at the first few rows and columns exprs[1:5, 1:5]. What are the row names? Do the column names agree with the row names of the pdata object? What is the range() of the expression data? Can you create a histogram (hint: hist()) of the data? What is plot(density(exprs))? Can you use plot() and lines() to plot the density of each sample, in a single figure?

2 Make a SummarizedExperiment object

You could work with the matrix and data frame directly, but it is better to put these related parts of the data into a single object, a SummarizedExperiment.

Load the appropriate Bioconductor package

if (BiocInstaller::biocVersion() >= "3.2") {
    library(SummarizedExperiment)
} else {
    library(GenomicRanges)
}

and create a single SummarizedExperiment object from the two parts of the data. Some Bioconductor objects enhance the behavior of base R objects; an example of this is DataFrame()

se <- SummarizedExperiment(exprs, colData=DataFrame(pdata))

Explore the object, noting that you can retrieve the original elements, and can subset in a coordinated fashion.

head(colData(se))

## DataFrame with 6 rows and 21 columns
##            cod diagnosis      sex       age       BT remission       CR   date.cr   t.4.11.
##       <factor>  <factor> <factor> <integer> <factor>  <factor> <factor>  <factor> <logical>
## 01005     1005 5/21/1997        M        53       B2        CR       CR  8/6/1997     FALSE
## 01010     1010 3/29/2000        M        19       B2        CR       CR 6/27/2000     FALSE
## 03002     3002 6/24/1998        F        52       B4        CR       CR 8/17/1998        NA
## 04006     4006 7/17/1997        M        38       B1        CR       CR  9/8/1997      TRUE
## 04007     4007 7/22/1997        M        57       B2        CR       CR 9/17/1997     FALSE
## 04008     4008 7/30/1997        M        17       B1        CR       CR 9/27/1997     FALSE
##         t.9.22. cyto.normal        citog mol.biol fusion.protein      mdr    kinet       ccr
##       <logical>   <logical>     <factor> <factor>       <factor> <factor> <factor> <logical>
## 01005      TRUE       FALSE      t(9;22)  BCR/ABL           p210      NEG  dyploid     FALSE
## 01010     FALSE       FALSE  simple alt.      NEG             NA      POS  dyploid     FALSE
## 03002        NA          NA           NA  BCR/ABL           p190      NEG  dyploid     FALSE
## 04006     FALSE       FALSE      t(4;11) ALL1/AF4             NA      NEG  dyploid     FALSE
## 04007     FALSE       FALSE      del(6q)      NEG             NA      NEG  dyploid     FALSE
## 04008     FALSE       FALSE complex alt.      NEG             NA      NEG  hyperd.     FALSE
##         relapse transplant               f.u date.last.seen
##       <logical>  <logical>          <factor>       <factor>
## 01005     FALSE       TRUE BMT / DEATH IN CR             NA
## 01010      TRUE      FALSE               REL      8/28/2000
## 03002      TRUE      FALSE               REL     10/15/1999
## 04006      TRUE      FALSE               REL      1/23/1998
## 04007      TRUE      FALSE               REL      11/4/1997
## 04008      TRUE      FALSE               REL     12/15/1997

assay(se)[1:5, 1:5]

##              01005    01010    03002    04006    04007
## 1000_at   7.597323 7.479445 7.567593 7.384684 7.905312
## 1001_at   5.046194 4.932537 4.799294 4.922627 4.844565
## 1002_f_at 3.900466 4.208155 3.886169 4.206798 3.416923
## 1003_s_at 5.903856 6.169024 5.860459 6.116890 5.687997
## 1004_at   5.925260 5.912780 5.893209 6.170245 5.615210

se$sex %in% "M"

##   [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [16] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [31] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
##  [46]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE
##  [61]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE
##  [76]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
##  [91]  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [106]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## [121]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

males <- se[,se$sex %in% "M"]
males

## class: SummarizedExperiment0 
## dim: 12625 83 
## metadata(0):
## assays(1): ''
## rownames(12625): 1000_at 1001_at ... AFFX-YEL021w/URA3_at AFFX-YEL024w/RIP1_at
## metadata column names(0):
## colnames(83): 01005 01010 ... 65003 83001
## colData names(21): cod diagnosis ... f.u date.last.seen

assay(males)[1:5, 1:5]

##              01005    01010    04006    04007    04008
## 1000_at   7.597323 7.479445 7.384684 7.905312 7.065914
## 1001_at   5.046194 4.932537 4.922627 4.844565 5.147762
## 1002_f_at 3.900466 4.208155 4.206798 3.416923 3.945869
## 1003_s_at 5.903856 6.169024 6.116890 5.687997 6.208061
## 1004_at   5.925260 5.912780 6.170245 5.615210 5.923487

Use vignette("SummarizedExperiment") to read about other operations on SummarizedExperiment.

3 Show off your skills

Quickly create the following subsets of data for your boss:

All women in the study.
All women over 40
An object bcrabl containing individuals with mol.biol belonging either to “BCR/ABL” or “NEG”.

Can you…?

Create a new column that simplifies the BT column (which lists different B- and T-cell subtypes) to contain just B or T, e.g., re-coding B, B1, B2, B3 and B4 to simply B, and likewise for T?
Use aggregate() to calculate the average age of males and females in the BCR/ABL and NEG treatment groups?
Use t.test() to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using boxplot(). In both cases, use the formula interface. Consult the help page ?t.test and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change?

4 Document your work

Summarize the exercises above in a simple script. Can you figure out how to write a ‘markdown’ document that includes R code chunks, as well as text describing what you did, and figures and tables showing the results?

5 Resources

Acknowledgements

Core (Seattle): Sonali Arora, Marc Carlson, Nate Hayden, Jim Hester, Valerie Obenchain, Hervé Pagès, Paul Shannon, Dan Tenenbaum.
The research reported in this presentation was supported by the National Cancer Institute and the National Human Genome Research Institute of the National Institutes of Health under Award numbers U24CA180996 and U41HG004059, and the National Science Foundation under Award number 1247813. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation.

5.1 `sessionInfo()`

sessionInfo()

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux stretch/sid
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ALL_1.11.0                              org.Hs.eg.db_3.2.3                     
##  [3] RSQLite_1.0.0                           DBI_0.3.1                              
##  [5] ggplot2_1.0.1                           airway_0.103.1                         
##  [7] limma_3.25.18                           DESeq2_1.9.51                          
##  [9] RcppArmadillo_0.6.100.0.0               Rcpp_0.12.1                            
## [11] BSgenome.Hsapiens.UCSC.hg19_1.4.0       BSgenome_1.37.6                        
## [13] rtracklayer_1.29.28                     TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [15] GenomicFeatures_1.21.33                 AnnotationDbi_1.31.19                  
## [17] SummarizedExperiment_0.3.11             Biobase_2.29.1                         
## [19] GenomicRanges_1.21.32                   GenomeInfoDb_1.5.16                    
## [21] microbenchmark_1.4-2                    Biostrings_2.37.8                      
## [23] XVector_0.9.4                           IRanges_2.3.26                         
## [25] S4Vectors_0.7.23                        BiocGenerics_0.15.11                   
## [27] BiocStyle_1.7.9                        
## 
## loaded via a namespace (and not attached):
##  [1] splines_3.2.2            Formula_1.2-1            latticeExtra_0.6-26     
##  [4] Rsamtools_1.21.21        yaml_2.1.13              lattice_0.20-33         
##  [7] digest_0.6.8             RColorBrewer_1.1-2       colorspace_1.2-6        
## [10] sandwich_2.3-4           htmltools_0.2.6          plyr_1.8.3              
## [13] XML_3.98-1.3             biomaRt_2.25.3           genefilter_1.51.1       
## [16] zlibbioc_1.15.0          xtable_1.7-4             mvtnorm_1.0-3           
## [19] scales_0.3.0             BiocParallel_1.3.54      annotate_1.47.4         
## [22] TH.data_1.0-6            nnet_7.3-11              proto_0.3-10            
## [25] survival_2.38-3          magrittr_1.5             evaluate_0.8            
## [28] MASS_7.3-44              foreign_0.8-66           BiocInstaller_1.19.14   
## [31] tools_3.2.2              formatR_1.2.1            multcomp_1.4-1          
## [34] stringr_1.0.0            munsell_0.4.2            locfit_1.5-9.1          
## [37] cluster_2.0.3            lambda.r_1.1.7           futile.logger_1.4.1     
## [40] grid_3.2.2               RCurl_1.95-4.7           labeling_0.3            
## [43] bitops_1.0-6             rmarkdown_0.8.1          gtable_0.1.2            
## [46] codetools_0.2-14         reshape2_1.4.1           GenomicAlignments_1.5.18
## [49] gridExtra_2.0.0          zoo_1.7-12               knitr_1.11              
## [52] Hmisc_3.17-0             futile.options_1.0.0     stringi_0.5-5           
## [55] geneplotter_1.47.0       rpart_4.1-10             acepack_1.3-3.3

Working With Data: `SummarizedExperiment`

Martin Morgan (martin.morgan@roswellpark.org)
Roswell Park Cancer Institute, Buffalo, NY
5 - 9 October, 2015

Contents

1 Read the data in to R

1.1 Use `read.table()` to read ALLphenoData.tsv

1.2 Use `read.table()` to read the expression values

2 Make a SummarizedExperiment object

3 Show off your skills

4 Document your work

5 Resources

5.1 `sessionInfo()`

Working With Data: SummarizedExperiment

Martin Morgan (martin.morgan@roswellpark.org) Roswell Park Cancer Institute, Buffalo, NY 5 - 9 October, 2015

Contents

1 Read the data in to R

1.1 Use read.table() to read ALLphenoData.tsv

1.2 Use read.table() to read the expression values

2 Make a SummarizedExperiment object

3 Show off your skills

4 Document your work

5 Resources

5.1 sessionInfo()

Working With Data: `SummarizedExperiment`

Martin Morgan (martin.morgan@roswellpark.org)
Roswell Park Cancer Institute, Buffalo, NY
5 - 9 October, 2015

1.1 Use `read.table()` to read ALLphenoData.tsv

1.2 Use `read.table()` to read the expression values

5.1 `sessionInfo()`