---
title: "2. Working With Data: `SummarizedExperiment`"
author: "Martin Morgan (martin.morgan@roswellpark.org)
Roswell Park Cancer Institute, Buffalo, NY
5 - 9 October, 2015"
output:
BiocStyle::html_document:
toc: true
toc_depth: 2
vignette: >
% \VignetteIndexEntry{2. Working With Data: SummarizedExperiment}
% \VignetteEngine{knitr::rmarkdown}
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
options(width=100, max.print=1000)
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```
```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE}
suppressPackageStartupMessages({
library(ALL)
library(airway)
})
```
The material in this course requires R version 3.2 and Bioconductor
version 3.2
```{r configure-test}
stopifnot(
getRversion() >= '3.2' && getRversion() < '3.3',
BiocInstaller::biocVersion() == "3.2"
)
```
Your boss has been working on Acute Lymphocytic Lukemia (ALL) for many
years. One data set consists of microarray gene expression values for
12625 genes in 128 different samples. Your boss would like to analyze
different subsets of the data, and has given you a couple of
tab-delimited files. One file (_ALLphenoData.tsv_) describes the samples,
the other (_ALLassay.tsv_) contains pre-processed gene expression
data. You are supposed to come up with a way to create the subsets
your boss asks you for. You realize that you could read the data in to
Excel and manipulate it there, but you're concerned about being able
to do reproducible research and you're nervous about the bookkeeping
errors that always seem to come up. So you think you'll give
_Bioconductor_ a try...
# Read the data in to _R_
Download the [ALLphenoData.tsv][] and [ALLassay.tsv][] files to the
current workign directory, `getwd()`.
## Use `read.table()` to read _ALLphenoData.tsv_
```{r read.table}
fname = "ALLphenoData.tsv" ## use file.choose() to find the file
pdata = read.table(fname)
```
Check out the help page `?read.delim` for input options, and explore
basic properties of the object you've created, for instance...
```{r ALL-properties}
class(pdata)
colnames(pdata)
dim(pdata)
head(pdata)
summary(pdata$sex)
summary(pdata$cyto.normal)
```
Remind yourselves about various ways to subset and access columns of a
data.frame
```{r ALL-subset}
pdata[1:5, 3:4]
pdata[1:5, ]
head(pdata[, 3:5])
tail(pdata[, 3:5], 3)
head(pdata$age)
head(pdata$sex)
head(pdata[pdata$age > 21,])
```
## Use `read.table()` to read the expression values
```{r exprs}
fname <- "ALLassay.tsv"
exprs <- as.matrix(read.table(fname, check.names=FALSE))
```
Use `dim()` to figure out the number of rows and columns in the
expression data. Use subscripts to look at the first few rows and
columns `exprs[1:5, 1:5]`. What are the row names? Do the column names
agree with the row names of the `pdata` object? What is the `range()`
of the expression data? Can you create a histogram (hint: `hist()`) of
the data? What is `plot(density(exprs))`? Can you use `plot()` and
`lines()` to plot the density of each sample, in a single figure?
# Make a _SummarizedExperiment_ object
You could work with the matrix and data frame directly, but it is
better to put these related parts of the data into a single object, a
_SummarizedExperiment_.
Load the appropriate _Bioconductor_ package
```{r SummarizedExperiment}
if (BiocInstaller::biocVersion() >= "3.2") {
library(SummarizedExperiment)
} else {
library(GenomicRanges)
}
```
and create a single _SummarizedExperiment_ object from the two parts
of the data. Some _Bioconductor_ objects enhance the behavior of base
_R_ objects; an example of this is `DataFrame()`
```{r make-SE}
se <- SummarizedExperiment(exprs, colData=DataFrame(pdata))
```
Explore the object, noting that you can retrieve the original
elements, and can subset in a coordinated fashion.
```{r se-ops}
head(colData(se))
assay(se)[1:5, 1:5]
se$sex %in% "M"
males <- se[,se$sex %in% "M"]
males
assay(males)[1:5, 1:5]
```
Use `vignette("SummarizedExperiment")` to read about other operations
on _SummarizedExperiment_.
# Show off your skills
Quickly create the following subsets of data for your boss:
1. All women in the study.
2. All women over 40
3. An object `bcrabl` containing individuals with `mol.biol` belonging
either to "BCR/ABL" or "NEG".
Can you...?
1. Create a new column that simplifies the `BT` column (which lists
different B- and T-cell subtypes) to contain just `B` or `T`, e.g.,
re-coding B, B1, B2, B3 and B4 to simply `B`, and likewise for `T`?
2. Use `aggregate()` to calculate the average age of males and females
in the BCR/ABL and NEG treatment groups?
3. Use `t.test()` to compare the age of individuals in the BCR/ABL
versus NEG groups; visualize the results using `boxplot()`. In both
cases, use the `formula` interface. Consult the help page `?t.test`
and re-do the test assuming that variance of ages in the two groups
is identical. What parts of the test output change?
# Document your work
Summarize the exercises above in a simple script. Can you figure out
how to write a 'markdown' document that includes R code chunks, as
well as text describing what you did, and figures and tables showing
the results?
# Resources
Acknowledgements
- Core (Seattle): Sonali Arora, Marc Carlson, Nate Hayden, Jim Hester,
Valerie Obenchain, Hervé Pagès, Paul Shannon, Dan
Tenenbaum.
- The research reported in this presentation was supported by the
National Cancer Institute and the National Human Genome Research
Institute of the National Institutes of Health under Award numbers
U24CA180996 and U41HG004059, and the National Science Foundation
under Award number 1247813. The content is solely the responsibility
of the authors and does not necessarily represent the official views
of the National Institutes of Health or the National Science
Foundation.
## `sessionInfo()`
```{r sessionInfo}
sessionInfo()
```
[ALLphenoData.tsv]: https://raw.githubusercontent.com/Bioconductor/BiocUruguay2015/master/vignettes/ALLphenoData.tsv
[ALLassay.tsv]: https://raw.githubusercontent.com/Bioconductor/BiocUruguay2015/master/vignettes/ALLassay.tsv