0.1 What is Bioconductor

Analysis and comprehension of high-throughput genomic data

Statistical analysis: large data, technological artifacts, designed experiments; rigorous
Comprehension: biological context, visualization, reproducibility
High-throughput
- Sequencing: RNASeq, ChIPSeq, variants, copy number, …
- Microarrays: expression, SNP, …
- Flow cytometry, proteomics, images, …

Packages, vignettes, work flows

‘Release’ (every six months) and ‘devel’ branches
Choose from 1045 packages.
- Discover packages via biocViews
- Each Package has a Package ‘landing page’ with Title, author / maintainer, short description, citation, installation instructions, shields, documentation, download statistics
- All user-visible functions have help pages, most with runnable examples
- ‘Vignettes’ an important feature in Bioconductor – narrative documents illustrating how to use the package, with integrated code

Helpful Links

Documentation Previous Course Material, Workflows, Videos, Developers
Ask a question Support site
Connect with us Twitter, Newsletter

0.2 Overall Workflow

A typical workflow consists of the following steps.
- Experimental design
- Wet-lab preparation
- High-throughput sequencing
+ Output: FASTQ files of reads and their quality scores
- Alignment + Many different aligners, some specialized for different purposes
+ Output: BAM files of aligned reads
- Summary
+ e.g., count of reads overlapping regions of interest (e.g., genes)
- Statistical analysis
- Comprehension

Alt Sequencing Ecosystem

0.3 Where does Bioconductor fit in

0.3.1 Infrastructure

One of the biggest strengths of Bioconductor is the classes defined to make simple tasks extremely easy and streamlined.

0.3.1.1 GenomicRanges objects

Represent annotations – genes, variants, regulatory elements, copy number regions, …
Represent data – aligned reads, ChIP peaks, called variants, …

Alt Genomic Ranges

Many biologically interesting questions represent operations on ranges

Count overlaps between aligned reads and known genes – GenomicRanges::summarizeOverlaps()
Genes nearest to regulatory regions – GenomicRanges::nearest(), [ChIPseeker][]
Called variants relevant to clinical phenotypes – [VariantFiltering][]

GRanges Algebra

Intra-range methods
- Independent of other ranges in the same object
- GRanges variants strand-aware
- shift(), narrow(), flank(), promoters(), resize(), restrict(), trim()
- See ?"intra-range-methods"
Inter-range methods
- Depends on other ranges in the same object
- range(), reduce(), gaps(), disjoin()
- coverage() (!)
- see ?"inter-range-methods"
Between-range methods
- Functions of two (or more) range objects
- findOverlaps(), countOverlaps(), …, %over%, %within%, %outside%; union(), intersect(), setdiff(), punion(), pintersect(), psetdiff()

0.3.1.2 SummarizedExperiment

The SummarizedExperiment class is a matrix-like container where rows represent ranges of interest (as a ‘GRanges or GRangesList-class’) and columns represent samples (with sample data summarized as a ‘DataFrame-class’)

Alt Ranges Algebra

0.3.2 Reading in Various file formats using R/Bioconductor

Alt Ranges Algebra

Example - Reading in BAM files
The GenomicAlignments package is used to input reads aligned to a reference genome. In this next example, we will read in a BAM file and specifically read in reads supporting an apparent
exon splice junction spanning position 19653773 of chromosome 14.

The package RNAseqData.HNRNPC.bam.chr14_BAMFILES contains 8 BAM files. We will use only the first BAM file. We will load the software packages and the data package, construct a GRanges with our region of interest, and use summarizeJunctions() to find reads in our region of interest.

## 1. load software packages
library(GenomicRanges)
library(GenomicAlignments)

## 2. load sample data
library('RNAseqData.HNRNPC.bam.chr14')
bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1]], asMates=TRUE)

## 3. define our 'region of interest'
roi <- GRanges("chr14", IRanges(19653773, width=1))

## 4. alignments, junctions, overlapping our roi
paln <- readGAlignmentsList(bf)
j <- summarizeJunctions(paln, with.revmap=TRUE)
j_overlap <- j[j %over% roi]

## 5. supporting reads
paln[j_overlap$revmap[[1]]]

## GAlignmentsList object of length 8:
## [[1]] 
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand      cigar qwidth    start      end width njunc
##   [1]    chr14      -  66M120N6M     72 19653707 19653898   192     1
##   [2]    chr14      + 7M1270N65M     72 19652348 19653689  1342     1
## 
## [[2]] 
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand     cigar qwidth    start      end width njunc
##   [1]    chr14      - 66M120N6M     72 19653707 19653898   192     1
##   [2]    chr14      +       72M     72 19653686 19653757    72     0
## 
## [[3]] 
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand     cigar qwidth    start      end width njunc
##   [1]    chr14      +       72M     72 19653675 19653746    72     0
##   [2]    chr14      - 65M120N7M     72 19653708 19653899   192     1
## 
## ...
## <5 more elements>
## -------
## seqinfo: 93 sequences from an unspecified genome

0.3.3 Annotations

0.3.3.1 AnnotationHub: Bioconductor Package to Manage & Download files

AnnotationHub is a web client with which one can browse and
download biological files from various databases such as UCSC, NCBI.
Using this package allows the user to directly get the file, without needing
to figure out where the file is located on UCSC, downloading it and managing
multiple files on their local machine.

 library(AnnotationHub)
 ah = AnnotationHub()

 ## data is available from the following sources
 unique(ah$dataprovider)

##  [1] "Ensembl"                               "EncodeDCC"                            
##  [3] "UCSC"                                  "Inparanoid8"                          
##  [5] "NCBI"                                  "NHLBI"                                
##  [7] "ChEA"                                  "Pazar"                                
##  [9] "NIH Pathway Interaction Database"      "RefNet"                               
## [11] "Haemcode"                              "GEO"                                  
## [13] "BroadInstitute"                        "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/"
## [15] "dbSNP"

 ## following types of files can be retrieved from the hub 
 unique(ah$sourcetype)

##  [1] "FASTA"         "BED"           "UCSC track"    "GTF"           "Inparanoid"    "NCBI/blast2GO"
##  [7] "TwoBit"        "Chain"         "GRASP"         "Zip"           "CSV"           "BioPax"       
## [13] "BioPaxLevel2"  "RData"         "BigWig"        "tar.gz"        "tab"           "NCBI/ensembl" 
## [19] "VCF"

 ## We will download all _Homo sapiens_ cDNA sequences from the FASTA file
 ## 'Homo_sapiens.GRCh38.cdna.all.fa' from Ensembl using
 ## `r Biocpkg("AnnotationHub")`.
 ah2 <- query(ah, c("fasta", "homo sapiens", "Ensembl"))
 fa <- ah2[["AH18522"]]
 fa

## class: FaFile 
## path: /home/ubuntu/.AnnotationHub/22617
## index: /home/ubuntu/.AnnotationHub/25666
## isOpen: FALSE 
## yieldSize: NA

Alt Annotation Packages

0.3.3.2 TxDb objects

Curatated annotation resources – http://bioconductor.org/packages/biocViews
Underlying sqlite database – dbfile(txdb)
Make your own: GenomicFeatures::makeTxDbFrom*()
Accessing gene models
- exons(), transcripts(), genes(), cds() (coding sequence)
- promoters() & friends
- exonsBy() & friends – exons by gene, transcript, …
- ‘select’ interface: keytypes(), columns(), keys(), select(), mapIds()

library("TxDb.Hsapiens.UCSC.hg19.knownGene")
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb

## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # TaxID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-05-12 10:59:39 -0700 (Tue, 12 May 2015)
## # GenomicFeatures version at creation time: 1.21.3
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1

methods(class=class(txdb))

##  [1] $                      $<-                    ExpressionSet          annotatedDataFrameFrom
##  [5] as.list                asBED                  asGFF                  assayData             
##  [9] assayData<-            cds                    cdsBy                  cdsByOverlaps         
## [13] coerce                 columns                combine                contents              
## [17] dbInfo                 dbconn                 dbfile                 dbmeta                
## [21] dbschema               disjointExons          distance               exons                 
## [25] exonsBy                exonsByOverlaps        extractUpstreamSeqs    featureNames          
## [29] featureNames<-         fiveUTRsByTranscript   genes                  initialize            
## [33] intronsByTranscript    isActiveSeq            isActiveSeq<-          isNA                  
## [37] keys                   keytypes               mapIds                 mapToTranscripts      
## [41] mappedkeys             metadata               microRNAs              nhit                  
## [45] organism               promoters              revmap                 sample                
## [49] sampleNames            sampleNames<-          saveDb                 select                
## [53] seqinfo                seqinfo<-              seqlevels0             show                  
## [57] species                storageMode            storageMode<-          tRNAs                 
## [61] taxonomyId             threeUTRsByTranscript  transcripts            transcriptsBy         
## [65] transcriptsByOverlaps  updateObject          
## see '?methods' for accessing help and source code

genes(txdb)

## GRanges object with 23056 ranges and 1 metadata column:
##         seqnames                 ranges strand   |     gene_id
##            <Rle>              <IRanges>  <Rle>   | <character>
##       1    chr19 [ 58858172,  58874214]      -   |           1
##      10     chr8 [ 18248755,  18258723]      +   |          10
##     100    chr20 [ 43248163,  43280376]      -   |         100
##    1000    chr18 [ 25530930,  25757445]      -   |        1000
##   10000     chr1 [243651535, 244006886]      -   |       10000
##     ...      ...                    ...    ... ...         ...
##    9991     chr9 [114979995, 115095944]      -   |        9991
##    9992    chr21 [ 35736323,  35743440]      +   |        9992
##    9993    chr22 [ 19023795,  19109967]      -   |        9993
##    9994     chr6 [ 90539619,  90584155]      +   |        9994
##    9997    chr22 [ 50961997,  50964905]      -   |        9997
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

0.3.3.3 OrgDb objects

Curated resources, underlying sqlite data base, like TxDb
‘select’ interface: keytypes(), columns(), keys(), select(), mapIds()
Vector of keys, desired columns
Specification of key type

library(org.Hs.eg.db)
select(org.Hs.eg.db, c("BRCA1", "PTEN"), c("ENTREZID", "GENENAME"), "SYMBOL")

## 'select()' returned 1:1 mapping between keys and columns

##   SYMBOL ENTREZID                       GENENAME
## 1  BRCA1      672   breast cancer 1, early onset
## 2   PTEN     5728 phosphatase and tensin homolog

keytypes(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"    
##  [7] "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
## [19] "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"

columns(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"    
##  [7] "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"       
## [13] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
## [19] "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [25] "UNIGENE"      "UNIPROT"

0.3.3.4 Other internet resources

biomaRt Ensembl and other annotations
PSICQUIC Protein interactions
uniprot.ws Protein annotations
KEGGREST KEGG pathways
SRAdb Sequencing experiments
rtracklayer USCS genome tracks
GEOquery Array and other data
ArrayExpress Array and other data
…

0.3.4 Downstream Statistical Analysis

Bioconductor packages are organized by biocViews. One can answer a number of Biological Questions using various packages. Some of the entries under Sequencing and other terms, and representative packages, include:

RNASeq, e.g., edgeR, DESeq2, edgeR, derfinder, and QuasR.
ChIPSeq, e.g.,DiffBind, csaw, ChIPseeker, ChIPQC.
SNPs and other variants, e.g., VariantAnnotation, VariantFiltering, h5vc.
CopyNumberVariation e.g., DNAcopy, crlmm, fastseg.
Microbiome and metagenome sequencing, e.g., metagenomeSeq, phyloseq, DirichletMultinomial.

0.4 `sessionInfo()`

sessionInfo()

## R version 3.2.1 (2015-06-18)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] org.Hs.eg.db_3.1.2                      RSQLite_1.0.0                          
##  [3] DBI_0.3.1                               TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.3
##  [5] GenomicFeatures_1.21.13                 AnnotationDbi_1.31.17                  
##  [7] AnnotationHub_2.1.30                    RNAseqData.HNRNPC.bam.chr14_0.7.0      
##  [9] GenomicAlignments_1.5.11                Rsamtools_1.21.14                      
## [11] Biostrings_2.37.2                       XVector_0.9.1                          
## [13] SummarizedExperiment_0.3.2              Biobase_2.29.1                         
## [15] GenomicRanges_1.21.16                   GenomeInfoDb_1.5.8                     
## [17] IRanges_2.3.14                          S4Vectors_0.7.10                       
## [19] BiocGenerics_0.15.3                     ggplot2_1.0.1                          
## [21] BiocStyle_1.7.4                        
## 
## loaded via a namespace (and not attached):
##  [1] reshape2_1.4.1               colorspace_1.2-6             htmltools_0.2.6             
##  [4] rtracklayer_1.29.12          yaml_2.1.13                  interactiveDisplayBase_1.7.0
##  [7] XML_3.98-1.3                 BiocParallel_1.3.34          lambda.r_1.1.7              
## [10] plyr_1.8.3                   stringr_1.0.0                zlibbioc_1.15.0             
## [13] munsell_0.4.2                gtable_0.1.2                 futile.logger_1.4.1         
## [16] codetools_0.2-14             evaluate_0.7                 labeling_0.3                
## [19] knitr_1.10.5                 biomaRt_2.25.1               httpuv_1.3.2                
## [22] BiocInstaller_1.19.8         curl_0.9.1                   proto_0.3-10                
## [25] Rcpp_0.11.6                  xtable_1.7-4                 scales_0.2.5                
## [28] formatR_1.2                  mime_0.3                     digest_0.6.8                
## [31] stringi_0.5-5                shiny_0.12.1                 grid_3.2.1                  
## [34] tools_3.2.1                  bitops_1.0-6                 magrittr_1.5                
## [37] RCurl_1.95-4.7               futile.options_1.0.0         MASS_7.3-43                 
## [40] rmarkdown_0.7                httr_1.0.0                   R6_2.1.0

Introduction to Bioconductor

Sonali Arora

Contents

0.1 What is Bioconductor

0.2 Overall Workflow

0.3 Where does Bioconductor fit in

0.3.1 Infrastructure

0.3.1.1 GenomicRanges objects

0.3.1.2 SummarizedExperiment

0.3.2 Reading in Various file formats using R/Bioconductor

0.3.3 Annotations

0.3.3.1 AnnotationHub: Bioconductor Package to Manage & Download files

0.3.3.2 TxDb objects

0.3.3.3 OrgDb objects

0.3.3.4 Other internet resources

0.3.4 Downstream Statistical Analysis

0.4 `sessionInfo()`

Introduction to Bioconductor

Sonali Arora

Contents

0.1 What is Bioconductor

0.2 Overall Workflow

0.3 Where does Bioconductor fit in

0.3.1 Infrastructure

0.3.1.1 GenomicRanges objects

0.3.1.2 SummarizedExperiment

0.3.2 Reading in Various file formats using R/Bioconductor

0.3.3 Annotations

0.3.3.1 AnnotationHub: Bioconductor Package to Manage & Download files

0.3.3.2 TxDb objects

0.3.3.3 OrgDb objects

0.3.3.4 Other internet resources

0.3.4 Downstream Statistical Analysis

0.4 sessionInfo()

0.4 `sessionInfo()`