---
title: "1. Introduction to _Bioconductor_"
author: "Martin Morgan (martin.morgan@roswellpark.org)
Roswell Park Cancer Institute, Buffalo, NY
5 - 9 October, 2015"
output:
BiocStyle::html_document:
toc: true
toc_depth: 2
vignette: >
% \VignetteIndexEntry{1. Introduction to Bioconductor}
% \VignetteEngine{knitr::rmarkdown}
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
options(width=100, max.print=1000)
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```
```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE}
suppressPackageStartupMessages({
library(airway)
library(DESeq2)
library(ggplot2)
library(org.Hs.eg.db)
})
```
The material in this course requires R version 3.2 and Bioconductor
version 3.2
```{r configure-test}
stopifnot(
getRversion() >= '3.2' && getRversion() < '3.3',
BiocInstaller::biocVersion() == "3.2"
)
```
# What is _Bioconductor_?
Physically
- Collection of 1024 R packages.
- Web site (http://bioconductor.org) for package distribution and
other resources.
- Support site (https://support.bioconductor.org) for user questions.
Conceptually
- Analysis and comprehension of high throughput genomic data
# Core principles
## High-throughput analysis needs statistics!
Volume of data
Type of research question
- Designed experiments
- Population samples
- ...
Technological artifacts
- Differences in sequencing depth between samples
- Bias in the genomic regions sampled
## Scientific research needs to be reproducible
### A motivating case study
- Cisplatin-resistant non-small-cell lung cancer gene sets
- Hsu et al. 2007 J Clin Oncol 25:
[4350-4357](http://jco.ascopubs.org/content/25/28/4350.abstract)
[retracted](http://jco.ascopubs.org/content/28/35/5229.long)
![](our_figures/HsuEtAl-F1-large-a.jpg)
- Baggerly & Coombes 2009 Ann Appl Stat
[3: 1309-1334](http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aoas/1267453942)
![](our_figures/BaggerlyCoombes2009-fig2a.jpg)
Lessons
- Record each step of the analysis
- Coordinated manipulation of feature, sample, and assay data
- Informative labels on visualizations
### How to be reproducible?
- Use software 'objects' that take care of some of the tedious
book-keeping
- Document our analysis in scripts and 'markdown' documents
### Example: `SummarizedExperiment`
![](our_figures/SE_Description.png)
Underlying data is a matrix
- Regions of interest (e.g., genes) x samples
- `assay()` -- e.g., matrix of counts of reads overlapping genes
Include information about rows
- `rowRanges()` -- gene identifiers, or _genomic ranges_ describing
the coordinates of each gene
Include information about columns
- `colData()` -- describing samples, experimental design, ...
```{r airway-SummarizedExperiment}
library(airway) # An 'ExperimentData' package...
data(airway) # ...with a sample data set...
airway # ...that is a SummarizedExperiment
head(assay(airway)) # contains a matrix of counts
head(rowRanges(airway)) # information about the genes...
colData(airway)[, 1:3] # ...and samples
## coordinated subsetting
untrt <- airway[, airway$dex == 'untrt']
head(assay(untrt))
colData(untrt)[, 1:3]
```
## We can 'stand on the shoulders of giants'
Packages!
- [biocViews](http://bioconductor.org/packages/)
- Vignettes
- Workflows
- Course material and tutorials
## We should explore our data
Visualization
Inter-operability between packages
- Made easier by using similar data structures
Examples (details later)
- `SummarizedExperiment`
- `DNAStringSet`
- `GenomicRanges`
## Comprehension is more than statistical analysis
Annotation
- Mapping from technical to user-friendly identifiers
- Assigning genes to pathways
- Placing our results in the context of large-scale analyses
Case studies
- e.g., [airway][]!
# _Bioconductor_'s role in sequence analysis
## Overall work flow
### General steps
1. Experimental design
- Keep it simple!
- Replication!
- Avoid or track batch effects
2. Wet-lab preparation
3. High-throughput sequencing
- Output: FASTQ files of reads and their quality scores
@ERR127302.1703 HWI-EAS350_0441:1:1:1460:19184#0/1
CCTGAGTGAAGCTGATCTTGATCTACGAAGAGAGATAGATCTTGATCGTCGAGGAGATGCTGACCTTGACCT
+
HHGHHGHHHHHHHHDGG>CE?=896=:
@ERR127302.1704 HWI-EAS350_0441:1:1:1460:16861#0/1
GCGGTATGCTGGAAGGTGCTCGAATGGAGAGCGCCAGCGCCCCGGCGCTGAGCCGCAGCCTCAGGTCCGCCC
+
DE?DD>ED4>EEE>DE8EEEDE8B?EB<@3;BA79?,881B?@73;1?########################
4. Alignment
- Many different aligners, some specialized for different purposes
- Output: BAM files of aligned reads
ERR127306.7941162 403 chr14 19653689 3 72M = 19652348 -1413 ...
ERR127306.22648137 145 chr14 19653692 1 72M = 19650044 -3720 ...
... GAATTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCC *'%%%%%#&&%''#'&%%%)&&%%$%%'%%'&*****$))$)'')'%)))&)%%%%$'%%%%&"))'')%))
... TTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAG '**)****)*'*&*********('&)****&***(**')))())%)))&)))*')&***********)****
... AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:72 YT:Z:UU NH:i:2 CC:Z:chr22 CP:i:16189276 HI:i:0
... AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:72 YT:Z:UU NH:i:3 CC:Z:= CP:i:19921600 HI:i:0
5. Summary
- e.g., RNA-Seq: _count_ of reads overlapping regions of interest
(e.g., genes)
- e.g., ChIP-Seq: _ranges_ where regulatory elements bind
- Output: '.csv', BED, or WIG files
6. Statistical analysis
7. Comprehension
### An example: RNA-Seq differential expression of known genes
More detail later!
Example: 'airway' data set used in a later lab
- Airway smooth muscle cells treated with dexamethasone, a synthetic
glucocorticoid steroid with anti-inflammatory
effects. Glucocorticoids are used in asthma patients to prevent or
reduce inflammation of the airways.
- Four primary human airway smooth muscle cell lines
- Each cell line: a control sample and a treated sample. Treatment: 1
micromolar dexamethasone for 18 hours.
- Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker
RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R
Jr, Tantisira KG, Weiss ST, Lu Q. "RNA-Seq Transcriptome Profiling
Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that
Modulates Cytokine Function in Airway Smooth Muscle Cells." PLoS
One. 2014 Jun 13;9(6):e99625. PMID:
[24926665](http://www.ncbi.nlm.nih.gov/pubmed/24926665). GEO:
[GSE52778](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778).
Steps
1. Experimental design
- One covariate: cell line
- One experimental factor with two levels: control, and treated
with dexamethasone
```{r airway-colData}
library(airway) # An 'ExperimentData' package...
data(airway) # ...with a sample data set...
colData(airway)[, 1:3] # ...represented as a SummarizedExperiment
```
2. Wet-lab preparation
3. High-throughput sequencing
- Paired-end reads
- Output: FASTQ files
4. Alignment
- [STAR](https://code.google.com/p/rna-star/) aligner
- Aligned to Ensembl release 75 of the human reference genome
- Output: BAM files
5. Summarization
- `GenomicRanges::summarizeOverlaps()`
- Output: matrix of the _count_ of reads overlapping regions of
interest. Each row is a gene. Each column is a sample.
```{r airway-assay}
head(assay(airway))
```
6. Statistical analysis
- Test each gene for statistical difference between control and
treatement group
- Output: _top table_ of differentially expressed genes. For each
gene: 'log fold change' describing how large a change occurred,
and a test statistic (e.g., adjusted p-value) summarizing
statistical evidence for the change
```{r airway-toptable}
library(DESeq2) # package implementing statistical methods
dds <- # data and experimental design
DESeqDataSet(airway, design = ~ cell + dex)
dds <- DESeq(dds) # initial analysis
res <- results(dds) # summary results
ridx <- # order from largest to smallest absolute log fold change
order(abs(res$log2FoldChange), decreasing=TRUE)
res <- res[ridx,]
head(res) # top-table
```
7. Comprehension
- Visualization
```{r airway-viz}
library(ggplot2)
ggplot(as.data.frame(res),
aes(x=log2FoldChange, y=-10 * log10(pvalue))) +
geom_point()
```
- From Ensembl gene identifiers to gene symbols, pathways, ...
```{r airway-mapids}
library(org.Hs.eg.db)
ensid <- head(rownames(res))
select(org.Hs.eg.db, ensid, c("SYMBOL", "GENENAME"), "ENSEMBL")
```
## Bioinformatic steps and _Bioconductor_ packages
![Alt Sequencing Ecosystem](our_figures/SequencingEcosystem.png)
## Two _shiny_ examples
BAMSpector -- display gene models and underlying support across BAM
(aligned read) files
```{r shiny-BAMSpector, eval=FALSE}
app <- system.file(package="BiocUruguay2015", "BAMSpector")
shiny::runApp(app)
```
MAPlotExplorer -- summarize two-group differential expression,
including drill-down of individual genes. Based on [CSAMA
2015](http://github.com/CSAMA2015/materials/labs/4_Thursday/Interactive_data_visualization_with_Shiny/)
lab by Andrzej Oles.
```{r shiny-MAPlotExplorer, eval=FALSE}
app <- system.file(package="BiocUruguay2015", "MAPlotExplorer")
shiny::runApp(app)
```
Some uses illustrated by these applications
- Working with large data -- [GenomicAlignments][], [GenomicFiles][]
- Identifier mapping -- [AnnotationDbi][]
- Representing gene structure -- [GenomicFeatures][]
- Annotation resources -- [org.Hs.eg.db][],
[TxDb.Hsapiens.UCSC.hg19.knownGene][], [AnnotationHub][]
- Statistical analysis of differential expression -- [DESeq2][]
- Experiment data packages -- [RNAseqData.HNRNPC.bam.chr14][],
[airway][]
- Visualization -- [Gviz][], [shiny][]
# Resources
Acknowledgements
- Core (Seattle): Sonali Arora, Marc Carlson, Nate Hayden, Jim Hester,
Valerie Obenchain, Hervé Pagès, Paul Shannon, Dan
Tenenbaum.
- The research reported in this presentation was supported by the
National Cancer Institute and the National Human Genome Research
Institute of the National Institutes of Health under Award numbers
U24CA180996 and U41HG004059, and the National Science Foundation
under Award number 1247813. The content is solely the responsibility
of the authors and does not necessarily represent the official views
of the National Institutes of Health or the National Science
Foundation.
## Key references
- Irizarry R, et al. (2015) Biomedical Data
Science. Course Notes, EdX PH525.1x.
- Huber W, et al. (2015) Orchestrating
high-throughput genomic analysis with
Bioconductor. Nature Methods 12:115-121;
doi:10.1038/nmeth.3252 (full-text free with registration).
- Lawrence M, Huber W, Pagès;s H, Aboyoun P, Carlson M, et al. (2013) Software for
Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8):
e1003118. doi: 10.1371/journal.pcbi.1003118
## `sessionInfo()`
```{r sessionInfo}
sessionInfo()
```
[GenomicAlignments]: http://bioconductor.org/packages/GenomicAlignments
[GenomicFiles]: http://bioconductor.org/packages/GenomicFiles
[AnnotationDbi]: http://bioconductor.org/packages/AnnotationDbi
[GenomicFeatures]: http://bioconductor.org/packages/GenomicFeatures
[org.Hs.eg.db]: http://bioconductor.org/packages/org.Hs.eg.db
[TxDb.Hsapiens.UCSC.hg19.knownGene]: http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg19.knownGene
[AnnotationHub]: http://bioconductor.org/packages/AnnotationHub
[DESeq2]: http://bioconductor.org/packages/DESeq2
[RNAseqData.HNRNPC.bam.chr14]: http://bioconductor.org/packages/RNAseqData.HNRNPC.bam.chr14
[airway]: http://bioconductor.org/packages/airway
[Gviz]: http://bioconductor.org/packages/Gviz
[shiny]: http://bioconductor.org/packages/shiny