1 Project Overview

1.1 About

Bioconductor: Analysis and comprehension of high-throughput genomic data

Statistical analysis: large data, technological artifacts, designed experiments; rigorous
Comprehension: biological context, visualization, reproducibility
High-throughput
- Sequencing: RNASeq, ChIPSeq, variants, copy number, …
- Microarrays: expression, SNP, …
- Flow cytometry, proteomics, images, …

Packages, vignettes, work flows

1296 software packages; also…
- ‘Annotation’ packages – static data bases of identifier maps, gene models, pathways, etc; e.g., TxDb.Hsapiens.UCSC.hg19.knownGene
- ’Experiment packages – data sets used to illustrate software functionality, e.g., airway
Discover and navigate via biocViews
Package ‘landing page’
- Title, author / maintainer, short description, citation, installation instructions, …, download statistics
All user-visible functions have help pages, most with runnable examples
‘Vignettes’ an important feature in Bioconductor – narrative documents illustrating how to use the package, with integrated code
‘Release’ (every six months) and ‘devel’ branches
Support site; videos, recent courses

Package installation and use

A package needs to be installed once, using the instructions on the package landing page (e.g., DESeq2).
```
source("https://bioconductor.org/biocLite.R")
biocLite(c("DESeq2", "org.Hs.eg.db"))
```
biocLite() installs Bioconductor, CRAN, and github packages.

Once installed, the package can be loaded into an R session

library(GenomicRanges)

and the help system queried interactively, as outlined above:

help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
?GRanges

1.2 Key concepts

Goals

Reproducibility
Interoperability
Use

What a few lines of R has to say

x <- rnorm(1000)
y <- x + rnorm(1000)
df <- data.frame(X=x, Y=y)
plot(Y ~ X, df)
fit <- lm(Y ~ X, df)
anova(fit)

## Analysis of Variance Table
## 
## Response: Y
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## X           1 1001.14 1001.14    1013 < 2.2e-16 ***
## Residuals 998  986.27    0.99                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

abline(fit)

Classes and methods – “S3”

data.frame()
Defines class to coordinate data
Creates an instance or object
plot(), lm(), anova(), abline(): methods defined on generics to transform instances

Discovery and help

class(fit)
methods(class=class(fit))
methods(plot)
?"plot"
?"plot.formula"

tab completion!

Bioconductor classes and methods – “S4”

Example: working with DNA sequences

library(Biostrings)
dna <- DNAStringSet(c("AACAT", "GGCGCCT"))
reverseComplement(dna)

##   A DNAStringSet instance of length 2
##     width seq
## [1]     5 ATGTT
## [2]     7 AGGCGCC

data(phiX174Phage)
phiX174Phage

##   A DNAStringSet instance of length 6
##     width seq                                                                   names               
## [1]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA Genbank
## [2]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA RF70s
## [3]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA SS78
## [4]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA Bull
## [5]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA G97
## [6]  5386 GAGTTTTATCGCTTCCATGACGCAGAAGTTAAC...TTCGATAAAAATGATTGGCGTATCCAACCTGCA NEB03

letterFrequency(phiX174Phage, "GC", as.prob=TRUE)

##            G|C
## [1,] 0.4476420
## [2,] 0.4472707
## [3,] 0.4472707
## [4,] 0.4470850
## [5,] 0.4472707
## [6,] 0.4470850

Discovery and help

class(dna)
?"DNAStringSet-class"
?"reverseComplement,DNAStringSet-method"

1.3 High-throughput sequence analysis work flows

Experimental design
Wet-lab sequence preparation (figure from http://rnaseq.uoregon.edu/)
(Illumina) Sequencing (Bentley et al., 2008, doi:10.1038/nature07517)
- Primary output: FASTQ files of short reads and their quality scores
Alignment
- Choose to match task, e.g., Rsubread, Bowtie2 good for ChIPseq, some forms of RNAseq; BWA, GMAP better for variant calling
- Primary output: BAM files of aligned reads
- More recently: kallisto and similar programs that produce tables of reads aligned to transcripts
Reduction
- e.g., RNASeq ‘count table’ (simple spreadsheets), DNASeq called variants (VCF files), ChIPSeq peaks (BED, WIG files)
Analysis
- Differential expression, peak identification, …
Comprehension
- Biological context

1.4 Bioconductor sequencing ecosystem

Alt Sequencing Ecosystem

B.1 – Introduction to Bioconductor

Martin Morgan Martin.Morgan@RoswellPark.org
Lori Shepherd Lori.Shepherd@RoswellPark.org

3 March 2017

Contents

1 Project Overview

1.1 About

1.2 Key concepts

1.3 High-throughput sequence analysis work flows

1.4 Bioconductor sequencing ecosystem

B.1 – Introduction to Bioconductor

Martin Morgan Martin.Morgan@RoswellPark.org Lori Shepherd Lori.Shepherd@RoswellPark.org

3 March 2017

Contents

1 Project Overview

1.1 About

1.2 Key concepts

1.3 High-throughput sequence analysis work flows

1.4 Bioconductor sequencing ecosystem

Martin Morgan Martin.Morgan@RoswellPark.org
Lori Shepherd Lori.Shepherd@RoswellPark.org