---
title: "2. Case Studies"
author: "Martin Morgan (martin.morgan@roswellpark.org)
Roswell Park Cancer Institute, Buffalo, NY
19 October, 2015"
output:
BiocStyle::html_document:
toc: true
toc_depth: 2
vignette: >
% \VignetteIndexEntry{2. Case Studies}
% \VignetteEngine{knitr::rmarkdown}
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
options(width=100, max.print=1000)
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```
```{r packages, eval=TRUE, echo=FALSE, warning=FALSE, message=FALSE}
suppressPackageStartupMessages({
library(BiocEMBO2015)
})
```
Version: `r packageDescription("BiocEMBO2015")$Version`
Compiled: `r date()`
# _R_ data manipulation
This case study servers as a refresher / tutorial on basic input and
manipulation of data.
Input a file that contains ALL (acute lymphoblastic leukemia) patient
information
```{r echo=TRUE, eval=FALSE}
fname <- file.choose() ## "ALLphenoData.tsv"
stopifnot(file.exists(fname))
pdata <- read.delim(fname)
```
```{r echo=FALSE}
fname <- "ALLphenoData.tsv"
stopifnot(file.exists(fname))
pdata <- read.delim(fname)
```
Check out the help page `?read.delim` for input options, and explore
basic properties of the object you've created, for instance...
```{r ALL-properties}
class(pdata)
colnames(pdata)
dim(pdata)
head(pdata)
summary(pdata$sex)
summary(pdata$cyto.normal)
```
Remind yourselves about various ways to subset and access columns of a
data.frame
```{r ALL-subset}
pdata[1:5, 3:4]
pdata[1:5, ]
head(pdata[, 3:5])
tail(pdata[, 3:5], 3)
head(pdata$age)
head(pdata$sex)
head(pdata[pdata$age > 21,])
```
It seems from below that there are 17 females over 40 in the data set,
but when sub-setting `pdata` to contain just those individuals 19 rows
are selected. Why? What can we do to correct this?
```{r ALL-subset-NA}
idx <- pdata$sex == "F" & pdata$age > 40
table(idx)
dim(pdata[idx,])
```
Use the `mol.biol` column to subset the data to contain just
individuals with 'BCR/ABL' or 'NEG', e.g.,
```{r ALL-BCR/ABL-subset}
bcrabl <- pdata[pdata$mol.biol %in% c("BCR/ABL", "NEG"),]
```
The `mol.biol` column is a factor, and retains all levels even after
subsetting. How might you drop the unused factor levels?
```{r ALL-BCR/ABL-drop-unused}
bcrabl$mol.biol <- factor(bcrabl$mol.biol)
```
The `BT` column is a factor describing B- and T-cell subtypes
```{r ALL-BT}
levels(bcrabl$BT)
```
How might one collapse B1, B2, ... to a single type B, and likewise
for T1, T2, ..., so there are only two subtypes, B and T
```{r ALL-BT-recode}
table(bcrabl$BT)
levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1)
table(bcrabl$BT)
```
Use `xtabs()` (cross-tabulation) to count the number of samples with
B- and T-cell types in each of the BCR/ABL and NEG groups
```{r ALL-BCR/ABL-BT}
xtabs(~ BT + mol.biol, bcrabl)
```
Use `aggregate()` to calculate the average age of males and females in
the BCR/ABL and NEG treatment groups.
```{r ALL-aggregate}
aggregate(age ~ mol.biol + sex, bcrabl, mean)
```
Use `t.test()` to compare the age of individuals in the BCR/ABL versus
NEG groups; visualize the results using `boxplot()`. In both cases,
use the `formula` interface. Consult the help page `?t.test` and re-do
the test assuming that variance of ages in the two groups is
identical. What parts of the test output change?
```{r ALL-age}
t.test(age ~ mol.biol, bcrabl)
boxplot(age ~ mol.biol, bcrabl)
```
# Short read quality assessment
Option 1: `fastqc`
1. Start _fastqc_
2. Select fastq.gz files from the File --> Open menu. Files are in
`/mnt/nfs/practicals/day1/martin_morgan/`
3. Press `OK`
4. Study plots and the Help -> Contents menu
Option 2: [ShortRead][]
```{r ShortRead, messages=FALSE}
## 1. attach ShortRead and BiocParallel
library(ShortRead)
library(BiocParallel)
## 2. create a vector of file paths
## replace 'bigdata' with '/mnt/nfs/practicals/day1/martin_morgan/'
fls <- dir("bigdata", pattern="*fastq.gz", full=TRUE)
stopifnot(all(file.exists(fls)))
## 3. collect statistics
stats <- qa(fls)
## 4. generate and browse the report
browseURL(report(stats))
```
Check out the qa report from all lanes
```{r ShortRead-qa-all}
## replace 'bigdata' with '/mnt/nfs/practicals/day1/martin_morgan/'
load("bigdata/qa_all.Rda")
browseURL(report(qa_all))
```
# Annotation
`org` packages
- symbol mapping
```{r org}
library(airway)
data(airway)
library(org.Hs.eg.db)
ensid <- head(rownames(airway))
mapIds(org.Hs.eg.db, ensid, "SYMBOL", "ENSEMBL")
keytypes(org.Hs.eg.db)
```
`TxDb` packages
- known gene models, as _GRanges_ / _GRangesList_
- Easy to make your own, from GFF files
`GenomicFeatures::makeTxDbFromGFF()` & friends
```{r txdb}
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
exons(txdb)
exonsBy(txdb, "tx")
p <- promoters(txdb)
```
`BSgenome`
- Whole-genome sequences
- Possible to make your own, or use other formats --
`Rsamtools::FaFile()`, `tracklayer::TwoBitFile()`
```{r BSgenome}
library(BSgenome.Hsapiens.UCSC.hg19)
bsgenome <- BSgenome.Hsapiens.UCSC.hg19
ps <- getSeq(bsgenome, p)
ps
hist(letterFrequency(ps, "GC", as.prob=TRUE))
```
[AnnotationHub][]
- easily accessible genome-scale resources
Example: Ensembl 'GTF' files to _R_ / _Bioconductor_ GRanges and _TxDb_
```{r annotationhub-gtf, eval=FALSE}
library(AnnotationHub)
hub <- AnnotationHub()
hub
query(hub, c("Ensembl", "80", "gtf"))
## ensgtf = display(hub) # visual choice
hub["AH47107"]
gtf <- hub[["AH47107"]]
gtf
txdb <- GenomicFeatures::makeTxDbFromGRanges(gtf)
```
Example: non-model organism `OrgDb` packages
```{r annotationhub-orgdb, eval=FALSE}
library(AnnotationHub)
hub <- AnnotationHub()
query(hub, "OrgDb")
```
Example: Map Roadmap epigenomic marks to hg38
- Roadmap BED file as _GRanges_
```{r annotationhub-roadmap, eval=FALSE}
library(AnnotationHub)
hub <- AnnotationHub()
query(hub , c("EpigenomeRoadMap", "E126", "H3K4ME2"))
E126 <- hub[["AH29817"]]
```
- UCSC 'liftOver' file to map coordinates
```{r annotationhub-liftover, eval=FALSE}
query(hub , c("hg19", "hg38", "chainfile"))
chain <- hub[["AH14150"]]
```
- lift over -- possibly one-to-many mapping, so _GRanges_ to _GRangesList_
```{r liftover, eval=FALSE}
library(rtracklayer)
E126hg38 <- liftOver(E126, chain)
E126hg38
```
# Alignments
Integrative Genomics Viewer
1. Create an 'igv' directory (if it does not already exist) and add
the file hg19_alias.tab to it. This is a simple tab-delimited file
that maps between the sequence names used by the alignment, and the
sequence names known to IGV.
2. Start igv.
3. Choose hg19 from the drop-down menu at the top left of
the screen
4. Use File -> Load from File menu to load a bam file, e.g.,
`/mnt/nfs/practicals/day1/martin_morgan/SRR1039508_sorted.bam`
5. Zoom in to a particular gene, e.g., SPARCL1, by entering the gene
symbol in the box toward the center of the browser window. Adjust
the zoom until reads come in to view, and interpret the result.
```
mkdir -p ~/igv/genomes
cp bigdata/hg19_alias.tab ~/igv/genomes/
igv
```
_Bioconductor_: we'll explore how to map between different types of
identifiers, how to navigate genomic coordinates, and how to query BAM
files for aligned reads.
1. Attach 'Annotation' packages containing information about gene
symbols `r Biocannopkg("org.Hs.eg.db")` and genomic coordinates
(e.g., genes, exons, cds, transcripts) `r
Biocannopkg(TxDb.Hsapiens.UCSC.hg19.knownGene)`. Arrange for the
'seqlevels' (chromosome names) in the TxDb package to match those
in the BAM files.
2. Use the `org.*` package to map from gene symbol to Entrez gene id,
and the `TxDb.*` package to retrieve gene coordinates of the
SPARCL1 gene. N.B. -- The following uses a single gene symbol, but
we could have used 1, 2, or all gene symbols in a _vectorized_ fashion.
3. Attach the [GenomicAlignments][] package for working with aligned
reads. Use `range()` to get the genomic coordinates spanning the
first and last exon of SPARCL1. Input paired reads overlapping
SPARCL1.
4. What questions can you easily answer about these alignments? E.g.,
how many reads overlap this region of interest?
```{r setup-view, message=FALSE, warning=FALSE}
## 1.a 'Annotation' packages
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
## 1.b -- map 'seqlevels' as recorded in the TxDb file to those in the
## BAM file
fl <- "~/igv/genomes/hg19_alias.tab"
map <- with(read.delim(fl, header=FALSE, stringsAsFactors=FALSE),
setNames(V1, V2))
seqlevels(txdb, force=TRUE) <- map
## 2. Symbol -> Entrez ID -> Gene coordinates
sym2eg <- mapIds(org.Hs.eg.db, "SPARCL1", "ENTREZID", "SYMBOL")
exByGn <- exonsBy(txdb, "gene")
sparcl1exons <- exByGn[[sym2eg]]
## 3. Aligned reads
library(GenomicAlignments)
## replace 'bigdata' with '/mnt/nfs/practicals/day1/martin_morgan/'
fl <- "bigdata/SRR1039508_sorted.bam"
sparcl1gene <- range(sparcl1exons)
param <- ScanBamParam(which=sparcl1gene)
aln <- readGAlignmentPairs(fl, param=param)
```
5. As another exercise we ask how many of the reads we've input are
compatible with the known gene model. We have to find the
transcripts that belong to our gene, and then exons grouped by
transcript.
```{r compatibleAlignments, warning=FALSE}
## 5.a. exons-by-transcript for our gene of interest
txids <- select(txdb, sym2eg, "TXID", "GENEID")$TXID
exByTx <- exonsBy(txdb, "tx")[txids]
## 5.b compatible alignments
hits <- findCompatibleOverlaps(query=aln, subject=exByTx)
good <- seq_along(aln) %in% queryHits(hits)
table(good)
```
6. Finally, let's go from gene model to protein coding
sequence. (a) Extract CDS regions grouped by transcript, select just
transcripts we're interested in, (b) attach and then extract the coding
sequence from the appropriate reference genome. Translating the
coding sequences to proteins.
```{r coding-sequence, warning=FALSE}
## reset seqlevels
restoreSeqlevels(txdb)
## a. cds coordinates, grouped by transcript
txids <- mapIds(txdb, sym2eg, "TXID", "GENEID")
cdsByTx <- cdsBy(txdb, "tx")[txids]
## b. coding sequence from relevant reference genome
library(BSgenome.Hsapiens.UCSC.hg19)
dna <- extractTranscriptSeqs(BSgenome.Hsapiens.UCSC.hg19, cdsByTx)
protein <- translate(dna)
```
# _biomaRt_ annotations
**Exercises** Visit the [biomart](http://biomart.org) web service to
explore the diversity of annotation offerings available.
Load the [biomaRt][] package and list the available marts. Choose the
_ensembl_ mart and list the datasets for that mart. Set up a mart to
use the _ensembl_ mart and the _hsapiens_gene_ensembl_ dataset.
A [biomaRt][] dataset can be accessed via `getBM()`. In addition to
the mart to be accessed, this function takes filters and attributes as
arguments. Use `filterOptions()` and `listAttributes()` to discover
values for these arguments. Call `getBM()` using filters and
attributes of your choosing.
**Solutions**
```{r biomaRt1, eval=FALSE, results="hide"}
library(biomaRt)
head(listMarts(), 3) ## list the marts
head(listDatasets(useMart("ensembl")), 3) ## mart datasets
ensembl <- ## fully specified mart
useMart("ensembl", dataset = "hsapiens_gene_ensembl")
head(listFilters(ensembl), 3) ## filters
myFilter <- "chromosome_name"
head(filterOptions(myFilter, ensembl), 3) ## return values
myValues <- c("21", "22")
head(listAttributes(ensembl), 3) ## attributes
myAttributes <- c("ensembl_gene_id","chromosome_name")
## assemble and query the mart
res <- getBM(attributes = myAttributes, filters = myFilter,
values = myValues, mart = ensembl)
```
[AnnotationHub]: https://bioconductor.org/packages/AnnotationHub
[BiocParallel]: https://bioconductor.org/packages/BiocParallel
[Biostrings]: https://bioconductor.org/packages/Biostrings
[DESeq2]: https://bioconductor.org/packages/DESeq2
[GenomicFiles]: https://bioconductor.org/packages/GenomicFiles
[GenomicRanges]: https://bioconductor.org/packages/GenomicRanges
[GenomicAlignments]: https://bioconductor.org/packages/GenomicAlignments
[ShortRead]: https://bioconductor.org/packages/ShortRead
[biomaRt]: https://bioconductor.org/packages/biomaRt