Epigenomics 2014
Author: Martin Morgan (mtmorgan@fhcrc.org)
Date: 24 August, 2014
(Illumina) Sequencing (Bentley et al., 2008, doi:10.1038/nature07517)
Alignment
Analysis a. Reduction, e.g., to 'count table' b. Differential expression, peak identification, …
Comprehension
Analysis and comprehension of high-throughput genomic data
Packages, vignettes, work flows
Objects
getClass()
, showMethods(..., where=search())
,
selectMethod()
method?"substr,<tab>"
to select help on methods, class?D<tab>
for help on classesExample
require(Biostrings) # Biological sequences
data(phiX174Phage) # sample data, see ?phiX174Phage
phiX174Phage
## A DNAStringSet instance of length 6
## width seq names
## [1] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA Genbank
## [2] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA RF70s
## [3] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA SS78
## [4] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA Bull
## [5] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA G97
## [6] 5386 GAGTTTTATCGCTTCCATGAC...ATTGGCGTATCCAACCTGCA NEB03
m <- consensusMatrix(phiX174Phage)[1:4,] # nucl. x position counts
polymorphic <- which(colSums(m != 0) > 1)
m[, polymorphic]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## A 4 5 4 3 0 0 5 2 0
## C 0 0 0 0 5 1 0 0 5
## G 2 1 2 3 0 0 1 4 0
## T 0 0 0 0 1 5 0 0 1
showMethods(class=class(phiX174Phage), where=search())
vignette(package="Biostrings")
. Add another argument to the
vignette
function to view the 'BiostringsQuickOverview' vignette. library(Biostrings)
data(phiX174Phage)
Explain what the following code does, and how it works
m <- consensusMatrix(phiX174Phage)[1:4,]
polymorphic <- which(colSums(m != 0) > 1)
mapply(substr, polymorphic, polymorphic, MoreArgs=list(x=phiX174Phage))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## Genbank "G" "G" "A" "A" "C" "C" "A" "G" "C"
## RF70s "A" "A" "A" "G" "C" "T" "A" "G" "C"
## SS78 "A" "A" "A" "G" "C" "T" "A" "G" "C"
## Bull "G" "A" "G" "A" "C" "T" "A" "A" "T"
## G97 "A" "A" "G" "A" "C" "T" "G" "A" "C"
## NEB03 "A" "A" "A" "G" "T" "T" "A" "G" "C"
This section focuses on classes, methods, and packages, with the goal being to learn to navigate the help system and interactive discovery facilities.
Sequence analysis is specialized
Additional considerations
Solution: use well-defined classes to represent complex data; methods operate on the classes to perform useful functions. Classes and methods are placed together and distributed as packages so that we can all benefit from the hard work and tested code of others.
The IRanges package defines an important class for specifying integer ranges, e.g.,
library(IRanges)
ir <- IRanges(start=c(10, 20, 30), width=5)
ir
## IRanges of length 3
## start end width
## [1] 10 14 5
## [2] 20 24 5
## [3] 30 34 5
There are many interesting operations to be performed on ranges, e.g,
flank()
identifies adjacent ranges
flank(ir, 3)
## IRanges of length 3
## start end width
## [1] 7 9 3
## [2] 17 19 3
## [3] 27 29 3
Consult the help page for flank, ?flank
, and explore other
range-based operations.
The IRanges
class is part of a class hierarchy. To see this, ask R for
the class of ir
, and for the class definition of the IRanges
class
class(ir)
## [1] "IRanges"
## attr(,"package")
## [1] "IRanges"
getClassDef(class(ir))
## Class "IRanges" [package "IRanges"]
##
## Slots:
##
## Name: start width NAMES elementType
## Class: integer integer characterORNULL character
##
## Name: elementMetadata metadata
## Class: DataTableORNULL list
##
## Extends:
## Class "Ranges", directly
## Class "IntegerList", by class "Ranges", distance 2
## Class "RangesORmissing", by class "Ranges", distance 2
## Class "AtomicList", by class "Ranges", distance 3
## Class "List", by class "Ranges", distance 4
## Class "Vector", by class "Ranges", distance 5
## Class "Annotated", by class "Ranges", distance 6
##
## Known Subclasses: "NormalIRanges"
Notice that IRanges
extends the Ranges
class. Now try entering
?"flank,<tab>
, where <tab>
means to press the tab key to ask for
tab completion (may not be necessary in Rstudio). You can see that
there are help pages for several different classes. Tab-complete to
?"flank,Ranges-method"
and verify that you're at the page that describes the method relevant
to an IRanges
instance.
The GenomicRanges package extends the notion of ranges to include
features relevant to application of ranges in sequence analysis,
particularly the ability to associate a range with a sequence name
(e.g., chromosome) and a strand. Create a GRanges
instance based on
our IRanges
instance, as follows
library(GenomicRanges)
## Loading required package: GenomeInfoDb
gr <- GRanges(c("chr1", "chr1", "chr2"), ir, strand=c("+", "-", "+"))
gr
## GRanges with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [10, 14] +
## [2] chr1 [20, 24] -
## [3] chr2 [30, 34] +
## ---
## seqlengths:
## chr1 chr2
## NA NA
The notion of flanking sequence has a more nuanced meaning in
biology. In particular we might expect that flanking sequence on the
+
strand would precede the range, but on the minus strand would
follow it. Verify that flank
applied to a GRanges
object has this
behavior.
flank(gr, 3)
## GRanges with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [ 7, 9] +
## [2] chr1 [25, 27] -
## [3] chr2 [27, 29] +
## ---
## seqlengths:
## chr1 chr2
## NA NA
Discover what classes GRanges
extends, find the help page
documenting the behavior of flank
when applied to a GRanges object,
and verify that the help page documents the behavior we just observed.
class(gr)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
getClassDef(class(gr))
## Class "GRanges" [package "GenomicRanges"]
##
## Slots:
##
## Name: seqnames ranges strand elementMetadata
## Class: Rle IRanges Rle DataFrame
##
## Name: seqinfo metadata
## Class: Seqinfo list
##
## Extends:
## Class "GenomicRanges", directly
## Class "Vector", by class "GenomicRanges", distance 2
## Class "GenomicRangesORmissing", by class "GenomicRanges", distance 2
## Class "GenomicRangesORGRangesList", by class "GenomicRanges", distance 2
## Class "Annotated", by class "GenomicRanges", distance 3
?"flank,GenomicRanges-method"
Notice that the available flank()
methods have been augmented by the
methods defined in the GenomicRanges package.
It seems like there might be a number of helpful methods available for
working with genomic ranges; we can discover some of these from the
command line, indicating that the methods should be on the current
search()
path
showMethods(class="GRanges", where=search())
Use help()
to list the help pages in the GenomicRanges
package,
and vignettes()
to view and access available vignettes; these are
also available in the Rstudio 'Help' tab.
help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
This very open-ended topic points to some of the most prominent Bioconductor packages for sequence analysis. Use the opportunity in this lab to explore the package vignettes and help pages highlighted below; many of the material will be covered in greater detail in subsequent labs and lectures.
Basics
library(GenomicRanges)
and the help system queried interactively, as outlined above:
help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges", "GenomicRangesHOWTOs")
?GRanges
Domain-specific analysis – explore the landing pages, vignettes, and reference manuals of two or three of the following packages.
Working with sequences, alignments, common web file formats, and raw data; these packages rely very heavily on the IRanges / GenomicRanges infrastructure that we will encounter later in the course.
?consensusMatrix
,
for instance. Also check out the BSgenome package for working
with whole genome sequences, e.g., ?"getSeq,BSgenome-method"
?readGAlignments
help
page and vigentte(package="GenomicAlignments",
"summarizeOverlaps")
import
and export
functions can read in many
common file types, e.g., BED, WIG, GTF, …, in addition to querying
and navigating the UCSC genome browser. Check out the ?import
page
for basic usage.Annotation: Bioconductor provides extensive access to 'annotation' resources (see the AnnotationData biocViews hierarchy); these are covered in greater detail in Thursday's lab, but some interesting examples to explore during this lab include:
?select
?exonsBy
page to retrieve all
exons grouped by gene or transcript.Bioconductor is a large collection of R packages for the analysis and comprehension of high-throughput genomic data. Bioconductor relies on formal classes to represent genomic data, so it is important to develop a rudimentary comfort with classes, including seeking help for classes and methods. Bioconductor uses vignettes to augment traditional help pages; these can be very valuable in illustrating overall package use.