Authors: Valerie Obenchain (vobencha@fredhutch.org), Martin Morgan (mtmorgan@fredhutch.org)
Date: 22 July, 2015
The goal of this section is to highlight practices for writing correct, robust and efficient R code.
identical()
, all.equal()
)NA
values, …system.time()
or the microbenchmark package.Rprof()
function, or packages such as lineprof or aprofVectorize – operate on vectors, rather than explicit loops
x <- 1:10
log(x) ## NOT for (i in seq_along) x[i] <- log(x[i])
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
## [8] 2.0794415 2.1972246 2.3025851
Pre-allocate memory, then fill in the result
result <- numeric(10)
result[1] <- runif(1)
for (i in 2:length(result))
result[i] <- runif(1) * result[i - 1]
result
## [1] 0.610156070 0.326579479 0.319540555 0.182668253 0.078093233
## [6] 0.056413616 0.021594612 0.017351821 0.006025564 0.004570092
for
looplm.fit()
rather than repeatedly fitting the same design matrix.tabulate()
, rowSums()
and friends, %in%
, …Here’s an obviously inefficient function:
f0 <- function(n, a=2) {
## stopifnot(is.integer(n) && (length(n) == 1) &&
## !is.na(n) && (n > 0))
result <- numeric()
for (i in seq_len(n))
result[[i]] <- a * log(i)
result
}
Use system.time()
to investigate how this algorithm scales with n
, focusing on elapsed time.
system.time(f0(10000))
## user system elapsed
## 0.150 0.001 0.150
n <- 1000 * seq(1, 20, 2)
t <- sapply(n, function(i) system.time(f0(i))[[3]])
plot(t ~ n, type="b")
Remember the current ‘correct’ value, and an approximate time
n <- 10000
system.time(expected <- f0(n))
## user system elapsed
## 0.141 0.000 0.142
head(expected)
## [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519
Revise the function to hoist the common multiplier, a
, out of the loop. Make sure the result of the ‘optimization’ and the original calculation are the same. Use the microbenchmark package to compare the two versions
f1 <- function(n, a=2) {
result <- numeric()
for (i in seq_len(n))
result[[i]] <- log(i)
a * result
}
identical(expected, f1(n))
## [1] TRUE
library(microbenchmark)
microbenchmark(f0(n), f1(n), times=5)
## Unit: milliseconds
## expr min lq mean median uq max neval
## f0(n) 98.11787 154.91307 145.9875 158.2557 158.9396 159.7114 5
## f1(n) 96.93034 97.99369 133.8523 156.3321 158.9998 159.0055 5
Adopt a ‘pre-allocate and fill’ strategy
f2 <- function(n, a=2) {
result <- numeric(n)
for (i in seq_len(n))
result[[i]] <- log(i)
a * result
}
identical(expected, f2(n))
## [1] TRUE
microbenchmark(f0(n), f2(n), times=5)
## Unit: milliseconds
## expr min lq mean median uq max neval
## f0(n) 102.20437 162.75624 152.84236 163.04595 167.98829 168.21692 5
## f2(n) 11.05322 11.35959 13.71776 13.29358 15.93749 16.94491 5
Use an *apply()
function to avoid having to explicitly pre-allocate, and make opportunities for vectorization more apparent.
f3 <- function(n, a=2)
a * sapply(seq_len(n), log)
identical(expected, f3(n))
## [1] TRUE
microbenchmark(f0(n), f2(n), f3(n), times=10)
## Unit: milliseconds
## expr min lq mean median uq max
## f0(n) 100.830765 107.964489 145.823291 160.782134 164.901310 166.962340
## f2(n) 9.370014 11.162675 12.739058 11.709209 15.512331 17.017115
## f3(n) 4.640934 4.703644 5.799298 5.444969 6.617305 8.305891
## neval
## 10
## 10
## 10
Now that the code is presented in a single line, it is apparent that it could be easily vectorized. Seize the opportunity to vectorize it:
f4 <- function(n, a=2)
a * log(seq_len(n))
identical(expected, f4(n))
## [1] TRUE
microbenchmark(f0(n), f3(n), f4(n), times=10)
## Unit: microseconds
## expr min lq mean median uq max
## f0(n) 101977.842 158804.803 150045.0529 161249.9995 162876.991 165036.864
## f3(n) 4221.855 4419.104 4925.7432 4651.7185 4885.559 6511.562
## f4(n) 396.465 401.305 412.4188 411.3425 418.137 441.464
## neval
## 10
## 10
## 10
f4()
definitely seems to be the winner. How does it scale with n
? (Repeat several times)
n <- 10 ^ (5:8) # 100x larger than f0
t <- sapply(n, function(i) system.time(f4(i))[[3]])
plot(t ~ n, log="xy", type="b")
Any explanations for the different pattern of response?
Lessons learned:
*apply()
functions help avoid need for explicit pre-allocation and make opportunities for vectorization more apparent. This may come at a small performance cost, but is generally worth itWhen data are too large to fit in memory, we can iterate through files in chunks or subset the data by fields or genomic positions.
Iteration - Chunk-wise - open()
, read chunk(s), close()
. - e.g., yieldSize
argument to Rsamtools::BamFile()
- Framework: GenomicFiles::reduceByYield()
Restriction - Limit to columns and / or rows of interest - Exploit domain-specific formats – BAM files and Rsamtools::ScanBamParam()
– BAM files and Rsamtools::PileupParam()
– VCF files and VariantAnnotation::ScanVcfParam()
- Use a data base
Iterate through files: `GenomicFiles::reduceByYield()
suppressPackageStartupMessages({
library(GenomicFiles)
library(GenomicAlignments)
library(Rsamtools)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
})
yield <- # how to input the next chunk of data
function(X, ...)
{
readGAlignments(X)
}
map <- # what to do to each chunk
function(VALUE, ..., roi)
{
olaps <- findOverlaps(VALUE, roi, type="within", ignore.strand=TRUE)
count <- tabulate(subjectHits(olaps), subjectLength(olaps))
setNames(count, names(roi))
}
reduce <- `+` # how to combine mapped chunks
Improvement: “yield factory” to keep track of how many records input
yieldFactory <- # return a function with local state
function()
{
n_records <- 0L
function(X, ...) {
aln <- readGAlignments(X)
n_records <<- n_records + length(aln)
message(n_records)
aln
}
}
Regions of interest, named like the chromosomes in the bam file.
exByTx <- exonsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx")
fl <- "/home/ubuntu/data/vobencha/LargeData/srarchive/hg19_alias.tab"
map0 <- read.delim(fl, header=FALSE, stringsAsFactors=FALSE)
seqlevels(exByTx, force=TRUE) <- setNames(map0$V1, map0$V2)
A function to iterate through a bam file.
count1 <- function(filename, roi) {
message(filename)
## Create and open BAM file
bf <- BamFile(filename, yieldSize=1000000)
reduceByYield(bf, yieldFactory(), map, reduce, roi=roi)
}
In action
bam <- "/home/ubuntu/data/vobencha/LargeData/srarchive/SRR1039508_sorted.bam"
count <- count1(bam, exByTx)
Type | Example use | Name | Package |
---|---|---|---|
.bed | Range annotations | BedFile() |
rtracklayer |
.wig | Coverage | WigFile() , BigWigFile() |
rtracklayer |
.gtf | Transcript models | GTFFile() |
rtracklayer |
makeTxDbFromGFF() |
GenomicFeatures | ||
.2bit | Genomic Sequence | TwoBitFile() |
rtracklayer |
.fastq | Reads & qualities | FastqFile() |
ShortRead |
.bam | Aligned reads | BamFile() |
Rsamtools |
.tbx | Indexed tab-delimited | TabixFile() |
Rsamtools |
.vcf | Variant calls | VcfFile() |
VariantAnnotation |
## rtracklayer menagerie
suppressPackageStartupMessages(library(rtracklayer))
names(getClass("RTLFile")@subclasses)
## [1] "UCSCFile" "GFFFile" "BEDFile"
## [4] "WIGFile" "BigWigFile" "ChainFile"
## [7] "TwoBitFile" "FastaFile" "TabSeparatedFile"
## [10] "CompressedFile" "GFF1File" "GFF2File"
## [13] "GFF3File" "BEDGraphFile" "BED15File"
## [16] "BWFile" "2BitFile" "GTFFile"
## [19] "GVFFile" "GZFile" "BGZFile"
## [22] "BZ2File" "XZFile"
Notes
open()
, close()
, import()
/ yield()
/ read*()
.bai
, bam index); selection (‘columns’); restriction (‘rows’)*FileList()
classes
reduceByYield()
– iterate through a single large filebplapply()
(BiocParallel) – perform independent operations on several files, in parallelGenomicFiles()
reduceByRange()
, reduceByFile()
: collapse maps into summary representationA couple of caveats -
Iteration / restriction techniques keep the memory requirements under control while parallel evaluation distributes computational load across nodes. Keep in mind that parallel computations are still restricted by the amount of memory available on each node.
There is overhead in setting up and tearing down a cluster and more so when computing in distributed memory. For small calculations, the parallel overhead may outweigh the benefits with no improvement in performance.
Jobs that benefit the most from parallel execution are CPU-intensive and operate on data chunks that fits into memory.
BiocParallel provides a standardized interface for parallel evaluation and supports the major parallel computing styles: forks and processes on a single computer, ad hoc clusters, batch schedulers and cloud computing. By default, BiocParallel chooses a parallel back-end appropriate for the OS and is supported across Unix, Mac and Windows.
General ideas:
bplapply()
instead of lapply()
Argument BPPARAM
influences how parallel evaluation occurs
MulticoreParam()
: threads on a single (non-Windows) machineSnowParam()
: processes on the same or different machinesBatchJobsParam()
: resource scheduler on a clusterDoparParam()
: parallel execution with foreach()
This small example motivates the use of parallel execution and demonstrates how bplapply()
can be a drop in for lapply
.
Use system.time()
to explore how long this takes to execute as n
increases from 1 to 10. Use identical()
and microbenchmark to compare alternatives f0()
and f1()
for both correctness and performance.
fun
sleeps for 1 second, then returns i
.
library(BiocParallel)
fun <- function(i) {
Sys.sleep(1)
i
}
## serial
f0 <- function(n)
lapply(seq_len(n), fun)
## parallel
f1 <- function(n)
bplapply(seq_len(n), fun)
bpredo()
BiocParallel “catches and returns” errors along with successful results. This exercise demonstrates how to access the traceback()
of a failed task and how to re-run the failed tasks with ‘BPREDO’. Full details on error handling, logging and debugging are in the Errors, Logs and Debugging vignette.
param <- MulticoreParam(workers = 3)
param
Call the sqrt()
function across ‘X’; the second element is a character and will throw and error.
X <- list(1, "2", 3)
res <- bplapply(X, sqrt, BPPARAM = param)
res
Along with the error message, the traceback is stored in the return list element and can be accessed with attr()
:
attr(res[[2]], "traceback")
Re-run the failed results by repeating the call to bplapply()
this time with corrected input data and the partial results as ‘BPREDO’.
X.redo <- list(1, 2, 3)
bplapply(X.redo, sqrt, BPREDO = res)
BiocParallel uses the futile.logger package for logging. The package has a flexible system for filtering messages of varying severity thresholds such as INFO, DEBUG, ERROR etc. (For a list of all thresholds see the ?bpthreshold man page.) BiocParallel captures messages written in futile.logger format as well as messages written to stdout and stderr.
This function does some argument checking and has DEBUG, WARN and INFO-level log messages.
FUN <- function(i) {
flog.debug(paste0("value of 'i': ", i))
if (!length(i)) {
flog.warn("'i' is missing")
NA
} else if (!is(i, "numeric")) {
flog.info("coercing to numeric")
as.numeric(i)
} else {
i
}
}
Turn logging on in the param and set the threshold to WARN.
param <- SnowParam(3, log = TRUE, threshold = "WARN")
bplapply(list(1, "2", integer()), FUN, BPPARAM = param)
Lower the threshold to INFO and DEBUG (i.e., use bpthreshold<-
) to see how messages are filtered on severity.
For long running jobs or untested code it can be useful to set a time limit. The timeout field is the time, in seconds, allowed for each worker to complete a task. If a task takes longer than timeout the worker returns an error.
timeout can be set during param construction,
param <- SnowParam(timeout = 20)
param
## class: SnowParam
## bpjobname:BPJOB; bpworkers:2; bptasks:0; bptimeout:20; bpRNGseed:; bpisup:FALSE
## bplog:FALSE; bpthreshold:INFO; bplogdir:NA
## bpstopOnError:FALSE; bpprogressbar:FALSE
## bpresultdir:NA
## cluster type: SOCK
or with the setter:
bptimeout(param) <- 2
param
## class: SnowParam
## bpjobname:BPJOB; bpworkers:2; bptasks:0; bptimeout:2; bpRNGseed:; bpisup:FALSE
## bplog:FALSE; bpthreshold:INFO; bplogdir:NA
## bpstopOnError:FALSE; bpprogressbar:FALSE
## bpresultdir:NA
## cluster type: SOCK
Use this function to explore different _timeout_s over a numeric vector of ‘X’ values with bplapply()
. ‘X’ values less than timeout return successfully while those over threshold return an error.
fun <- function(i) {
Sys.sleep(i)
i
}
Distribute files over workers: GenomicFiles::reduceByFile()
The previous counting example used GenomicFiles::reduceByYield()
which operates on a single file and implements a yield, map, reduce paradigm. In this exercise we’ll use GenomicFiles::reduceByFile()
which uses bplaply()
under the hood to operate on multiple files in parallel.
Primary arguments to reduceByFile()
are a set of files and a set of ranges. Files are sent to the workers and data subsets are extracted based on the ranges. The bulk of the work is done in the MAP function and an optional REDUCE function combines the output on each worker.
suppressPackageStartupMessages({
library(BiocParallel)
library(GenomicFiles)
library(GenomicAlignments)
library(Rsamtools)
})
On Unix or Mac, configure a MulticoreParam()
with 4 workers. Turn on logging and set a timeout of 60 seconds.
param <- MulticoreParam(4, log = TRUE, timeout = 60)
On Windows do the same with SnowParam()
:
param <- SnowParam(4, log = TRUE, timeout = 60)
Point to the collection of bam files.
fls <- dir("/home/ubuntu/data/vobencha/LargeData/copynumber", ".bam$", full=TRUE)
names(fls) <- basename(fls)
bfl <- BamFileList(fls)
Defining ranges (region of interest) restricts the amount of data on the workers and keeps memory requirements under control. We’ll use a set of ranges on the Major Histocompatibility Complex locus on chromosome 6.
ranges <- GRanges("chr6", IRanges(c(28477797, 29527797, 32448354),
c(29477797, 30527797, 33448354)))
The MAP function reads in records and counts overlaps. readGAlignments()
reads in bam records that overlap with any portion of the ranges defined in the scanBamParam (i.e., they could be overlapping the start or the end). Once we’ve got the records in R, we want to count only those that fall ‘within’ the ranges.
MAP <- function(range, file, ...) {
library(GenomicAlignments) ## readGAlignments(), ScanBamParam()
param = ScanBamParam(which=range) ## restriction
gal = readGAlignments(file, param=param)
## log messages
flog.info(paste0("file: ", basename(file)))
flog.debug(paste0("records: ", length(gal)))
## overlaps
olaps <- findOverlaps(gal, range, type="within", ignore.strand=TRUE)
tabulate(subjectHits(olaps), subjectLength(olaps))
}
Count …
cts <- reduceByFile(ranges, fls, MAP, BPPARAM = param)
The result is a list the same length as the number of files.
length(cts)
Each list element is the length of the number of ranges.
elementLengths(cts)
Tables of counts for each range are extracted with ‘[[’:
cts[[1]]
GoogleGenomics to interact with google compute cloud and resources
Lawrence, M, and Morgan, M. 2014. Scalable Genomics with R and Bioconductor. Statistical Science 2014, Vol. 29, No. 2, 214-226. http://arxiv.org/abs/1409.2864v1
BiocParallel: http://bioconductor.org/packages/release/bioc/html/BiocParallel.html
GenomicFiles: http://bioconductor.org/packages/release/bioc/html/GenomicFiles.html