This very open-ended topic points to some of the most prominent Bioconductor packages for sequence analysis. Use the opportunity in this lab to explore the package vignettes and help pages highlighted below; many of the material will be covered in greater detail in subsequent labs and lectures.
Domain-specific analysis – explore the landing pages, vignettes, and reference manuals of two or three of the following packages.
Working with sequences, alignments, common web file formats, and raw data; these packages rely very heavily on the IRanges / GenomicRanges infrastructure.
?consensusMatrix
, for instance. Also check out the BSgenome package for working with whole genome sequences, e.g., ?"getSeq,BSgenome-method"
?readGAlignments
help page and vigentte(package="GenomicAlignments", "summarizeOverlaps")
import
and export
functions can read in many common file types, e.g., BED, WIG, GTF, …, in addition to querying and navigating the UCSC genome browser. Check out the ?import
page for basic usage.Visualization
The goal of this section is to highlight practices for writing correct, robust and efficient R code.
identical()
, all.equal()
)NA
values, …system.time()
or the microbenchmark package.Rprof()
function, or packages such as lineprof or aprofVectorize – operate on vectors, rather than explicit loops
x <- 1:10
log(x) ## NOT for (i in seq_along(x)) x[i] <- log(x[i])
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101 2.0794415 2.1972246
## [10] 2.3025851
Pre-allocate memory, then fill in the result
result <- numeric(10)
result[1] <- runif(1)
for (i in 2:length(result))
result[i] <- runif(1) * result[i - 1]
result
## [1] 0.4951848784 0.2865352417 0.1203113475 0.0282785626 0.0117993505 0.0040204726 0.0025406804
## [8] 0.0011436998 0.0010444164 0.0002276832
for
looplm.fit()
rather than repeatedly fitting the same design matrix.tabulate()
, rowSums()
and friends, %in%
, …Here’s an obviously inefficient function:
f0 <- function(n, a=2) {
## stopifnot(is.integer(n) && (length(n) == 1) &&
## !is.na(n) && (n > 0))
result <- numeric()
for (i in seq_len(n))
result[[i]] <- a * log(i)
result
}
Use system.time()
to investigate how this algorithm scales with n
, focusing on elapsed time.
system.time(f0(10000))
## user system elapsed
## 0.284 0.004 0.285
n <- 1000 * seq(1, 20, 2)
t <- sapply(n, function(i) system.time(f0(i))[[3]])
plot(t ~ n, type="b")
Remember the current ‘correct’ value, and an approximate time
n <- 10000
system.time(expected <- f0(n))
## user system elapsed
## 0.272 0.000 0.270
head(expected)
## [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519
Revise the function to hoist the common multiplier, a
, out of the loop. Make sure the result of the ‘optimization’ and the original calculation are the same. Use the microbenchmark package to compare the two versions
f1 <- function(n, a=2) {
result <- numeric()
for (i in seq_len(n))
result[[i]] <- log(i)
a * result
}
identical(expected, f1(n))
## [1] TRUE
library(microbenchmark)
microbenchmark(f0(n), f1(n), times=5)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## f0(n) 246.0423 264.9063 261.8002 264.9214 265.3211 267.8099 5 a
## f1(n) 219.2913 221.7509 246.0515 261.9951 263.2040 264.0162 5 a
Adopt a ‘pre-allocate and fill’ strategy
f2 <- function(n, a=2) {
result <- numeric(n)
for (i in seq_len(n))
result[[i]] <- log(i)
a * result
}
identical(expected, f2(n))
## [1] TRUE
microbenchmark(f0(n), f2(n), times=5)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## f0(n) 214.56280 236.09284 268.35744 244.4207 323.00509 323.7058 5 b
## f2(n) 11.68582 11.75021 12.18971 11.8478 12.04433 13.6204 5 a
Use an *apply()
function to avoid having to explicitly pre-allocate, and make opportunities for vectorization more apparent.
f3 <- function(n, a=2)
a * sapply(seq_len(n), log)
identical(expected, f3(n))
## [1] TRUE
microbenchmark(f0(n), f2(n), f3(n), times=10)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## f0(n) 210.639096 212.705845 254.638491 214.667187 315.358004 321.404387 10 b
## f2(n) 11.724926 11.803303 12.009622 11.916915 12.079235 12.733139 10 a
## f3(n) 6.126175 6.152834 6.335576 6.339932 6.441334 6.720913 10 a
Now that the code is presented in a single line, it is apparent that it could be easily vectorized. Seize the opportunity to vectorize it:
f4 <- function(n, a=2)
a * log(seq_len(n))
identical(expected, f4(n))
## [1] TRUE
microbenchmark(f0(n), f3(n), f4(n), times=10)
## Unit: microseconds
## expr min lq mean median uq max neval cld
## f0(n) 207472.674 210034.700 251660.864 212280.688 312108.901 317604.029 10 b
## f3(n) 6043.009 6085.972 6148.863 6122.600 6228.416 6292.337 10 a
## f4(n) 364.354 365.775 374.082 373.359 382.867 392.421 10 a
f4()
definitely seems to be the winner. How does it scale with n
? (Repeat several times)
n <- 10 ^ (5:8) # 100x larger than f0
t <- sapply(n, function(i) system.time(f4(i))[[3]])
plot(t ~ n, log="xy", type="b")
Any explanations for the different pattern of response?
Lessons learned:
*apply()
functions help avoid need for explicit pre-allocation and make opportunities for vectorization more apparent. This may come at a small performance cost, but is generally worth itWhen data are too large to fit in memory, we can iterate through files in chunks or subset the data by fields or genomic positions.
Iteration
open()
, read chunk(s), close()
.yieldSize
argument to Rsamtools::BamFile()
GenomicFiles::reduceByYield()
Restriction
Rsamtools::ScanBamParam()
Rsamtools::PileupParam()
VariantAnnotation::ScanVcfParam()
Parallel evalutation
BiocParallel provides a standardized interface for simple parallel evaluation. The package builds provides access to the snow
and multicore
functionality in the parallel
package as well as BatchJobs
for running cluster jobs.
General ideas:
bplapply()
instead of lapply()
Argument BPPARAM
influences how parallel evaluation occurs
MulticoreParam()
: threads on a single (non-Windows) machineSnowParam()
: processes on the same or different machinesBatchJobsParam()
: resource scheduler on a clusterDoparParam()
: parallel execution with foreach()
This small example motivates the use of parallel execution and demonstrates how bplapply()
can be a drop in for lapply
.
Use system.time()
to explore how long this takes to execute as n
increases from 1 to 10. Use identical()
and microbenchmark to compare alternatives f0()
and f1()
for both correctness and performance.
fun
sleeps for 1 second, then returns i
.
library(BiocParallel)
fun <- function(i) {
Sys.sleep(1)
i
}
## serial
f0 <- function(n)
lapply(seq_len(n), fun)
## parallel
f1 <- function(n)
bplapply(seq_len(n), fun)