Author: Martin Morgan (mtmorgan@fredhutch.org)
Date: 7 September, 2015
Back to Workshop Outline
The material in this document requires R version 3.2 and Bioconductor version 3.1
stopifnot(
    getRversion() >= '3.2' && getRversion() < '3.3',
    BiocInstaller::biocVersion() >= "3.1"
)
This section focuses on classes, methods, and packages, with the goal being to learn to navigate the help system and interactive discovery facilities.
Sequence analysis is specialized
Additional considerations
Solution: use well-defined classes to represent complex data; methods operate on the classes to perform useful functions. Classes and methods are placed together and distributed as packages so that we can all benefit from the hard work and tested code of others.
                   VariantAnnotation
                           |
                           v
                    GenomicFeatures
                           |
                           v
                       BSgenome
                           |
                           v
                      rtracklayer
                           |
                           v
                    GenomicAlignments
                      |           |
                      v           v
     SummarizedExperiment   Rsamtools  ShortRead
                  |         |      |      |
                  v         v      v      v
                GenomicRanges     Biostrings
                        |          |
                        v          v
               GenomeInfoDb   (XVector)
                        |     |
                        v     v
                        IRanges
                           |
                           v 
                      (S4Vectors)
The IRanges package defines an important class for specifying integer ranges, e.g.,
library(IRanges)
ir <- IRanges(start=c(10, 20, 30), width=5)
ir
## IRanges of length 3
##     start end width
## [1]    10  14     5
## [2]    20  24     5
## [3]    30  34     5
There are many interesting operations to be performed on ranges, e.g,
flank() identifies adjacent ranges
flank(ir, 3)
## IRanges of length 3
##     start end width
## [1]     7   9     3
## [2]    17  19     3
## [3]    27  29     3
The IRanges class is part of a class hierarchy. To see this, ask R for
the class of ir, and for the class definition of the IRanges class
class(ir)
## [1] "IRanges"
## attr(,"package")
## [1] "IRanges"
getClass(class(ir))
## Class "IRanges" [package "IRanges"]
## 
## Slots:
##                                                                                       
## Name:            start           width           NAMES     elementType elementMetadata
## Class:         integer         integer characterORNULL       character DataTableORNULL
##                       
## Name:         metadata
## Class:            list
## 
## Extends: 
## Class "Ranges", directly
## Class "IntegerList", by class "Ranges", distance 2
## Class "RangesORmissing", by class "Ranges", distance 2
## Class "AtomicList", by class "Ranges", distance 3
## Class "List", by class "Ranges", distance 4
## Class "Vector", by class "Ranges", distance 5
## Class "Annotated", by class "Ranges", distance 6
## 
## Known Subclasses: "NormalIRanges"
Notice that IRanges extends the Ranges class. Now try entering
?flank (?"flank,<tab>" if not using _RStudio, where <tab> means
to press the tab key to ask for tab completion). You can see that
there are help pages for flank operating on several different
classes. Select the completion
?"flank,Ranges-method" 
and verify that you're at the page that describes the method relevant
to an IRanges instance.  Explore other range-based operations.
The GenomicRanges package extends the notion of ranges to include
features relevant to application of ranges in sequence analysis,
particularly the ability to associate a range with a sequence name
(e.g., chromosome) and a strand. Create a GRanges instance based on
our IRanges instance, as follows
library(GenomicRanges)
gr <- GRanges(c("chr1", "chr1", "chr2"), ir, strand=c("+", "-", "+"))
gr
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1  [10, 14]      +
##   [2]     chr1  [20, 24]      -
##   [3]     chr2  [30, 34]      +
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
The notion of flanking sequence has a more nuanced meaning in
biology. In particular we might expect that flanking sequence on the
+ strand would precede the range, but on the minus strand would
follow it. Verify that flank applied to a GRanges object has this
behavior.
flank(gr, 3)
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1  [ 7,  9]      +
##   [2]     chr1  [25, 27]      -
##   [3]     chr2  [27, 29]      +
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
Discover what classes GRanges extends, find the help page
documenting the behavior of flank when applied to a GRanges object,
and verify that the help page documents the behavior we just observed.
class(gr)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
getClass(class(gr))
## Class "GRanges" [package "GenomicRanges"]
## 
## Slots:
##                                                                                       
## Name:         seqnames          ranges          strand elementMetadata         seqinfo
## Class:             Rle         IRanges             Rle       DataFrame         Seqinfo
##                       
## Name:         metadata
## Class:            list
## 
## Extends: 
## Class "GenomicRanges", directly
## Class "Vector", by class "GenomicRanges", distance 2
## Class "GenomicRangesORmissing", by class "GenomicRanges", distance 2
## Class "GenomicRangesORGRangesList", by class "GenomicRanges", distance 2
## Class "GenomicRangesORGenomicRangesList", by class "GenomicRanges", distance 2
## Class "RangedDataORGenomicRanges", by class "GenomicRanges", distance 2
## Class "Annotated", by class "GenomicRanges", distance 3
?"flank,GenomicRanges-method"
Notice that the available flank() methods have been augmented by the
methods defined in the GenomicRanges package.
It seems like there might be a number of helpful methods available for
working with genomic ranges; we can discover some of these from the
command line, indicating that the methods should be on the current
search() path
methods(class="GRanges")
##   [1] !=                  $                   $<-                 %in%               
##   [5] <                   <=                  ==                  >                  
##   [9] >=                  BamViews            GenomicFiles        NROW               
##  [13] Ops                 ROWNAMES            ScanBamParam        ScanBcfParam       
##  [17] [                   [<-                 aggregate           anyNA              
##  [21] append              as.character        as.complex          as.data.frame      
##  [25] as.env              as.integer          as.list             as.logical         
##  [29] as.numeric          as.raw              bamWhich<-          blocks             
##  [33] browseGenome        c                   chrom               chrom<-            
##  [37] coerce              coerce<-            compare             countOverlaps      
##  [41] coverage            disjoin             disjointBins        distance           
##  [45] distanceToNearest   duplicated          elementMetadata     elementMetadata<-  
##  [49] end                 end<-               eval                export             
##  [53] extractROWS         extractUpstreamSeqs findOverlaps        flank              
##  [57] follow              gaps                getPromoterSeq      granges            
##  [61] head                high2low            intersect           isDisjoint         
##  [65] length              liftOver            mapCoords           mapFromTranscripts 
##  [69] mapToTranscripts    match               mcols               mcols<-            
##  [73] metadata            metadata<-          mstack              names              
##  [77] names<-             narrow              nearest             order              
##  [81] overlapsAny         pack                parallelSlotNames   parallelVectorNames
##  [85] pgap                pintersect          pmapCoords          pmapFromTranscripts
##  [89] pmapToTranscripts   precede             promoters           psetdiff           
##  [93] punion              range               ranges              ranges<-           
##  [97] rank                reduce              reduceByFile        reduceByRange      
## [101] relist              relistToClass       rename              rep                
## [105] rep.int             replaceROWS         resize              restrict           
## [109] rev                 rowRanges<-         scanFa              scanTabix          
## [113] score               score<-             seqinfo             seqinfo<-          
## [117] seqlevelsInUse      seqnames            seqnames<-          setdiff            
## [121] shift               shiftApply          show                showAsCell         
## [125] sort                split               split<-             start              
## [129] start<-             strand              strand<-            subset             
## [133] subsetByOverlaps    summarizeOverlaps   table               tail               
## [137] tapply              tile                trim                union              
## [141] unique              update              updateObject        values             
## [145] values<-            width               width<-             window             
## [149] window<-            with                xtfrm              
## see '?methods' for accessing help and source code
Use help() to list the help pages in the GenomicRanges package,
and vignettes() to view and access available vignettes; these are
also available in the Rstudio 'Help' tab.
help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
GRanges and GRangesList classesAside: 'TxDb' packages provide an R representation of gene models
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
exons(): GRanges
exons(txdb)
## GRanges object with 289969 ranges and 1 metadata column:
##            seqnames               ranges strand   |   exon_id
##               <Rle>            <IRanges>  <Rle>   | <integer>
##        [1]     chr1       [11874, 12227]      +   |         1
##        [2]     chr1       [12595, 12721]      +   |         2
##        [3]     chr1       [12613, 12721]      +   |         3
##        [4]     chr1       [12646, 12697]      +   |         4
##        [5]     chr1       [13221, 14409]      +   |         5
##        ...      ...                  ...    ... ...       ...
##   [289965]     chrY [27607404, 27607432]      -   |    277746
##   [289966]     chrY [27635919, 27635954]      -   |    277747
##   [289967]     chrY [59358329, 59359508]      -   |    277748
##   [289968]     chrY [59360007, 59360115]      -   |    277749
##   [289969]     chrY [59360501, 59360854]      -   |    277750
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
exonsBy(): GRangesList
exonsBy(txdb, "tx")
## GRangesList object of length 82960:
## $1 
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames         ranges strand |   exon_id   exon_name exon_rank
##          <Rle>      <IRanges>  <Rle> | <integer> <character> <integer>
##   [1]     chr1 [11874, 12227]      + |         1        <NA>         1
##   [2]     chr1 [12613, 12721]      + |         3        <NA>         2
##   [3]     chr1 [13221, 14409]      + |         5        <NA>         3
## 
## $2 
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames         ranges strand | exon_id exon_name exon_rank
##   [1]     chr1 [11874, 12227]      + |       1      <NA>         1
##   [2]     chr1 [12595, 12721]      + |       2      <NA>         2
##   [3]     chr1 [13403, 14409]      + |       6      <NA>         3
## 
## $3 
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames         ranges strand | exon_id exon_name exon_rank
##   [1]     chr1 [11874, 12227]      + |       1      <NA>         1
##   [2]     chr1 [12646, 12697]      + |       4      <NA>         2
##   [3]     chr1 [13221, 14409]      + |       5      <NA>         3
## 
## ...
## <82957 more elements>
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
GRanges / GRangesList are incredibly useful
Many biologically interesting questions represent operations on ranges
GenomicRanges::summarizeOverlaps()GenomicRanges::nearest(),
[ChIPseeker][]GRanges Algebra
shift(), narrow(), flank(), promoters(), resize(),
restrict(), trim()?"intra-range-methods"range(), reduce(), gaps(), disjoin()coverage() (!)?"inter-range-methods"findOverlaps(), countOverlaps(), …, %over%, %within%,
%outside%; union(), intersect(), setdiff(), punion(),
pintersect(), psetdiff()Classes
Methods –
reverseComplement()letterFrequency()matchPDict(), matchPWM()Related packages
Example
Whole-genome sequences are distrubuted by ENSEMBL, NCBI, and others
as FASTA files; model organism whole genome sequences are packaged
into more user-friendly BSgenome packages. The following
calculates GC content across chr14.
library(BSgenome.Hsapiens.UCSC.hg19)
chr14_range = GRanges("chr14", IRanges(1, seqlengths(Hsapiens)["chr14"]))
chr14_dna <- getSeq(Hsapiens, chr14_range)
letterFrequency(chr14_dna, "GC", as.prob=TRUE)
##           G|C
## [1,] 0.336276
Classes – GenomicRanges-like behaivor
Methods
readGAlignments(), readGAlignmentsList()
summarizeOverlaps()Example
Find reads supporting the junction identified above, at position 19653707 + 66M = 19653773 of chromosome 14
library(GenomicRanges)
library(GenomicAlignments)
library(Rsamtools)
## our 'region of interest'
roi <- GRanges("chr14", IRanges(19653773, width=1)) 
## sample data
library('RNAseqData.HNRNPC.bam.chr14')
bf <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[[1]], asMates=TRUE)
## alignments, junctions, overlapping our roi
paln <- readGAlignmentsList(bf)
j <- summarizeJunctions(paln, with.revmap=TRUE)
j_overlap <- j[j %over% roi]
## supporting reads
paln[j_overlap$revmap[[1]]]
## GAlignmentsList object of length 8:
## [[1]] 
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand      cigar qwidth    start      end width njunc
##   [1]    chr14      -  66M120N6M     72 19653707 19653898   192     1
##   [2]    chr14      + 7M1270N65M     72 19652348 19653689  1342     1
## 
## [[2]] 
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand     cigar qwidth    start      end width njunc
##   [1]    chr14      - 66M120N6M     72 19653707 19653898   192     1
##   [2]    chr14      +       72M     72 19653686 19653757    72     0
## 
## [[3]] 
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand     cigar qwidth    start      end width njunc
##   [1]    chr14      +       72M     72 19653675 19653746    72     0
##   [2]    chr14      - 65M120N7M     72 19653708 19653899   192     1
## 
## ...
## <5 more elements>
## -------
## seqinfo: 93 sequences from an unspecified genome
Classes – GenomicRanges-like behavior
Functions and methods
readVcf(), readGeno(), readInfo(),
readGT(), writeVcf(), filterVcf()locateVariants() (variants overlapping ranges),
predictCoding(), summarizeVariants()genotypeToSnpMatrix(), snpSummary()Example
Read variants from a VCF file, and annotate with respect to a known gene model
## input variants
library(VariantAnnotation)
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
seqlevels(vcf) <- "chr22"
## known gene model
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
coding <- locateVariants(rowRanges(vcf),
    TxDb.Hsapiens.UCSC.hg19.knownGene,
    CodingVariants())
head(coding)
## GRanges object with 6 ranges and 9 metadata columns:
##     seqnames               ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID        TXID
##        <Rle>            <IRanges>  <Rle> | <factor> <integer> <integer> <integer> <character>
##   1    chr22 [50301422, 50301422]      - |   coding       939       939        24       75253
##   2    chr22 [50301476, 50301476]      - |   coding       885       885        25       75253
##   3    chr22 [50301488, 50301488]      - |   coding       873       873        26       75253
##   4    chr22 [50301494, 50301494]      - |   coding       867       867        27       75253
##   5    chr22 [50301584, 50301584]      - |   coding       777       777        28       75253
##   6    chr22 [50302962, 50302962]      - |   coding       698       698        57       75253
##             CDSID      GENEID       PRECEDEID        FOLLOWID
##     <IntegerList> <character> <CharacterList> <CharacterList>
##   1        218562       79087                                
##   2        218562       79087                                
##   3        218562       79087                                
##   4        218562       79087                                
##   5        218562       79087                                
##   6        218563       79087                                
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
Related packages
Reference
import(): BED, GTF, WIG, 2bit, etcexport(): GRanges to BED, GTF, WIG, …Functions and methods
assay() / assays(), rowData() / rowRanges(),
colData(), metadata()subsetByOverlaps()GenomicAlignmentsRecall: overall workflow
BAM files of aligned reads
Header
@HD     VN:1.0  SO:coordinate
@SQ     SN:chr1 LN:249250621
@SQ     SN:chr10        LN:135534747
@SQ     SN:chr11        LN:135006516
...
@SQ     SN:chrY LN:59373566
@PG     ID:TopHat       VN:2.0.8b       CL:/home/hpages/tophat-2.0.8b.Linux_x86_64/tophat --mate-inner-dist 150 --solexa-quals --max-multihits 5 --no-discordant --no-mixed --coverage-search --microexon-search --library-type fr-unstranded --num-threads 2 --output-dir tophat2_out/ERR127306 /home/hpages/bowtie2-2.1.0/indexes/hg19 fastq/ERR127306_1.fastq fastq/ERR127306_2.fastq
Alignments
ID, flag, alignment and mate
ERR127306.7941162       403     chr14   19653689        3       72M             =       19652348        -1413  ...
ERR127306.22648137      145     chr14   19653692        1       72M             =       19650044        -3720  ...
Sequence and quality
... GAATTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCC        *'%%%%%#&&%''#'&%%%)&&%%$%%'%%'&*****$))$)'')'%)))&)%%%%$'%%%%&"))'')%))
... TTGATCAGTCTCATCTGAGAGTAACTTTGTACCCATCACTGATTCCTTCTGAGACTGCCTCCACTTCCCCAG        '**)****)*'*&*********('&)****&***(**')))())%)))&)))*')&***********)****
Tags
... AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:72 YT:Z:UU NH:i:2  CC:Z:chr22      CP:i:16189276   HI:i:0
... AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:72 YT:Z:UU NH:i:3  CC:Z:=  CP:i:19921600   HI:i:0
Typically, sorted (by position) and indexed ('.bai' files)
Use an example BAM file (fl could be the path to your own BAM file)
## example BAM data
library(RNAseqData.HNRNPC.bam.chr14)
## one BAM file
fl <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
## Let R know that this is a BAM file, not just a character vector
library(Rsamtools)
bfl <- BamFile(fl)
Input the data into R
aln <- readGAlignments(bfl)
aln
## GAlignments object with 800484 alignments and 0 metadata columns:
##            seqnames strand       cigar    qwidth     start       end     width     njunc
##               <Rle>  <Rle> <character> <integer> <integer> <integer> <integer> <integer>
##        [1]    chr14      +         72M        72  19069583  19069654        72         0
##        [2]    chr14      +         72M        72  19363738  19363809        72         0
##        [3]    chr14      -         72M        72  19363755  19363826        72         0
##        [4]    chr14      +         72M        72  19369799  19369870        72         0
##        [5]    chr14      -         72M        72  19369828  19369899        72         0
##        ...      ...    ...         ...       ...       ...       ...       ...       ...
##   [800480]    chr14      -         72M        72 106989780 106989851        72         0
##   [800481]    chr14      +         72M        72 106994763 106994834        72         0
##   [800482]    chr14      -         72M        72 106994819 106994890        72         0
##   [800483]    chr14      +         72M        72 107003080 107003151        72         0
##   [800484]    chr14      -         72M        72 107003171 107003242        72         0
##   -------
##   seqinfo: 93 sequences from an unspecified genome
readGAlignmentPairs() / readGAlignmentsList() if paired-end
datamethods(class=class(aln))
##   [1] !=                     %in%                   <                      <=                    
##   [5] ==                     >                      >=                     NROW                  
##   [9] ROWNAMES               [                      [<-                    aggregate             
##  [13] anyNA                  append                 as.character           as.complex            
##  [17] as.data.frame          as.env                 as.integer             as.list               
##  [21] as.logical             as.numeric             as.raw                 c                     
##  [25] cigar                  coerce                 compare                countOverlaps         
##  [29] coverage               duplicated             elementMetadata        elementMetadata<-     
##  [33] end                    eval                   export                 extractROWS           
##  [37] findCompatibleOverlaps findOverlaps           findSpliceOverlaps     granges               
##  [41] grglist                head                   high2low               junctions             
##  [45] length                 mapCoords              mapFromAlignments      mapToAlignments       
##  [49] match                  mcols                  mcols<-                metadata              
##  [53] metadata<-             mstack                 names                  names<-               
##  [57] narrow                 njunc                  overlapsAny            parallelSlotNames     
##  [61] pintersect             pmapCoords             pmapFromAlignments     pmapToAlignments      
##  [65] qnarrow                qwidth                 ranges                 rank                  
##  [69] relist                 relistToClass          rename                 rep                   
##  [73] rep.int                replaceROWS            rev                    rglist                
##  [77] rname                  rname<-                seqinfo                seqinfo<-             
##  [81] seqlevelsInUse         seqnames               seqnames<-             shiftApply            
##  [85] show                   showAsCell             sort                   split                 
##  [89] split<-                start                  strand                 strand<-              
##  [93] subset                 subsetByOverlaps       summarizeOverlaps      table                 
##  [97] tail                   tapply                 unique                 update                
## [101] updateObject           values                 values<-               width                 
## [105] window                 window<-               with                   xtfrm                 
## see '?methods' for accessing help and source code
Caveat emptor: BAM files are large. Normally you will
restrict the input to particular genomic ranges, or iterate
through the BAM file. Key Bioconductor functions (e.g.,
GenomicAlignments::summarizeOverlaps() do this data management
step for you. See next section!
BiocParallel, GenomicFilesScanBamParam()which: genomic ranges of interestwhat: 'columns' of BAM file, e.g., 'seq', 'flag'BamFile(..., yieldSize=100000)Iterative programming model
Use GenomicFiles::reduceByYield()
library(GenomicFiles)
yield <- function(bfl) {
    ## input a chunk of alignments
    library(GenomicAlignments)
    readGAlignments(bfl, param=ScanBamParam(what="seq"))
}
map <- function(aln) { 
    ## Count G or C nucleotides per read
    library(Biostrings)
    gc <- letterFrequency(mcols(aln)$seq, "GC")
    ## Summarize number of reads with 0, 1, ... G or C nucleotides
    tabulate(1 + gc, 73)                # max. read length: 72
}
reduce <- `+`
Example
library(RNAseqData.HNRNPC.bam.chr14)
fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
bf <- BamFile(fls[1], yieldSize=100000)
gc <- reduceByYield(bf, yield, map, reduce)
plot(gc, type="h",
     xlab="GC Content per Aligned Read", ylab="Number of Reads")
 
Many problems are embarassingly parallel – lapply()-like –
especially in bioinformatics where parallel evaluation is across
files
Example: GC content in several BAM files
library(BiocParallel)
gc <- bplapply(BamFileList(fls), reduceByYield, yield, map, reduce)
library(ggplot2)
df <- stack(as.data.frame(lapply(gc, cumsum)))
df$GC <- 0:72
ggplot(df, aes(x=GC, y=values)) + geom_line(aes(colour=ind)) +
    xlab("Number of GC Nucleotides per Read") +
    ylab("Number of Reads")