Authors: Martin Morgan (mtmorgan@fhcrc.org), Sonali Arora (sarora@fredhutch.org)
Date: 19 June, 2015
The goal of this section is to learn to write correct, robust, simple and efficient R code. We do this through a couple of case studies.
identical()
, all.equal()
)NA
values, …system.time()
or the microbenchmark package.Rprof()
function, or packages such as lineprof or aprofVectorize – operate on vectors, rather than explicit loops
x <- 1:10
log(x) ## NOT for (i in seq_along) x[i] <- log(x[i])
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
## [8] 2.0794415 2.1972246 2.3025851
Pre-allocate memory, then fill in the result
result <- numeric(10)
result[1] <- runif(1)
for (i in 2:length(result))
result[i] <- runif(1) * result[i - 1]
result
## [1] 0.7386850875 0.3696589691 0.3542664926 0.0510763549 0.0254611460
## [6] 0.0070750491 0.0038783618 0.0011608073 0.0007160663 0.0006206700
for
looplm.fit()
rather than repeatedly fitting the same design matrix.tabulate()
, rowSums()
and friends, %in%
, …Here’s an obviously inefficient function:
f0 <- function(n, a=2) {
## stopifnot(is.integer(n) && (length(n) == 1) &&
## !is.na(n) && (n > 0))
result <- numeric()
for (i in seq_len(n))
result[[i]] <- a * log(i)
result
}
Use system.time()
to investigate how this algorithm scales with n
, focusing on elapsed time.
system.time(f0(10000))
## user system elapsed
## 0.119 0.004 0.122
n <- 1000 * seq(1, 20, 2)
t <- sapply(n, function(i) system.time(f0(i))[[3]])
plot(t ~ n, type="b")
Remember the current ‘correct’ value, and an approximate time
n <- 10000
system.time(expected <- f0(n))
## user system elapsed
## 0.121 0.000 0.120
head(expected)
## [1] 0.000000 1.386294 2.197225 2.772589 3.218876 3.583519
Revise the function to hoist the common multiplier, a
, out of the loop. Make sure the result of the ‘optimization’ and the original calculation are the same. Use the microbenchmark package to compare the two versions
f1 <- function(n, a=2) {
result <- numeric()
for (i in seq_len(n))
result[[i]] <- log(i)
a * result
}
identical(expected, f1(n))
## [1] TRUE
library(microbenchmark)
microbenchmark(f0(n), f1(n), times=5)
## Unit: milliseconds
## expr min lq mean median uq max neval
## f0(n) 109.7719 109.9072 130.5134 141.9948 142.5390 148.3542 5
## f1(n) 108.5630 139.3563 133.2842 139.4147 139.4894 139.5979 5
Adopt a ‘pre-allocate and fill’ strategy
f2 <- function(n, a=2) {
result <- numeric(n)
for (i in seq_len(n))
result[[i]] <- log(i)
a * result
}
identical(expected, f2(n))
## [1] TRUE
microbenchmark(f0(n), f2(n), times=5)
## Unit: milliseconds
## expr min lq mean median uq max
## f0(n) 121.201141 121.739866 134.761569 143.617813 143.620317 143.628707
## f2(n) 7.684828 7.849474 8.559228 8.415322 8.888803 9.957714
## neval
## 5
## 5
Use an *apply()
function to avoid having to explicitly pre-allocate, and make opportunities for vectorization more apparent.
f3 <- function(n, a=2)
a * sapply(seq_len(n), log)
identical(expected, f3(n))
## [1] TRUE
microbenchmark(f0(n), f2(n), f3(n), times=10)
## Unit: milliseconds
## expr min lq mean median uq max
## f0(n) 112.120796 143.855378 142.577849 145.567746 146.700210 178.674194
## f2(n) 7.655573 7.698159 8.250884 8.177678 8.452503 9.566792
## f3(n) 3.579709 3.677523 4.083759 4.048497 4.455364 4.736191
## neval
## 10
## 10
## 10
Now that the code is presented in a single line, it is apparent that it could be easily vectorized. Seize the opportunity to vectorize it:
f4 <- function(n, a=2)
a * log(seq_len(n))
identical(expected, f4(n))
## [1] TRUE
microbenchmark(f0(n), f3(n), f4(n), times=10)
## Unit: microseconds
## expr min lq mean median uq max
## f0(n) 112256.477 145432.576 149337.9525 146027.0340 169576.336 179843.968
## f3(n) 3583.222 3925.399 4090.4899 3987.5965 4017.998 5427.407
## f4(n) 364.700 378.439 395.9539 402.8765 407.481 422.962
## neval
## 10
## 10
## 10
Returning to our explicit iteration f2()
, in these situations it can be helpful to compile the code to a more efficient representation. Do this using the compiler package.
library(compiler)
f2c <- cmpfun(f2)
n <- 10000
identical(f2(n), f2c(n))
## [1] TRUE
microbenchmark(f2(n), f2c(n), times=10)
## Unit: milliseconds
## expr min lq mean median uq max neval
## f2(n) 7.957343 8.582533 8.721356 8.747018 9.005959 9.45067 10
## f2c(n) 2.000680 2.045304 2.157064 2.164375 2.216226 2.38820 10
f4()
definitely seems to be the winner. How does it scale with n
? (Repeat several times)
n <- 10 ^ (5:8) # 100x larger than f0
t <- sapply(n, function(i) system.time(f4(i))[[3]])
plot(t ~ n, log="xy", type="b")
Any explanations for the different pattern of response?
Lessons learned:
*apply()
functions help avoid need for explicit pre-allocation and make opportunities for vectorization more apparent. This may come at a small performance cost, but is generally worth itIt can be very helpful to reason about an algorithm in an abstract sense, to gain understanding about how an operation might scale. Here’s an interesting problem, taken from StackOverflow: Suppose one has a very long sorted vector
vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))
with the simple goal being to identify the number of values less than zero. The original post and many responses suggested a variation of scanning the vector for values less than zero, then summing
f0 <- function(v) sum(v < 0)
This algorithm compares each element of vec
to zero, creating a logical vector as long as the original, length(v)
. This logical vector is then scanned by sum()
to count the number of elements satisfying the condition.
Questions:
v
need to be allocated for this algorithm?Based on the number of comparisons that need to be performed, how would you expect this algorithm to scale with the length of v
? Verify this with a simple figure.
N <- seq(1, 11, 2) * 1e6
Time <- sapply(N, function(n) {
v <- sort(rnorm(n))
system.time(f0(v))[[3]]
})
plot(Time ~ N, type="b")
Is there a better algorithm, i.e., an approach that arrives at the same answer but takes less time (and / or space)? The vector is sorted, and we can take advantage of that by doing a binary search. The algorithm is surprisingly simple: create an index to the minimum (first) element, and the maximum (last) element. Check to see if the element half way between is greater than or equal to zero. If so, move the maximum index to that point. Otherwise, make that point the new minimum. Repeat this procedure until the minimum index is no longer less than the maximum index.
f3 <- function(x, threshold=0) {
imin <- 1L
imax <- length(x)
while (imax >= imin) {
imid <- as.integer(imin + (imax - imin) / 2)
if (x[imid] >= threshold)
imax <- imid - 1L
else
imin <- imid + 1L
}
imax
}
Approximately half of the possible values are discarded each iteration, so we expect on average to arrive at the end after about log2(length(v))
iterations – the algorithm scales with the log of the length of v
, rather than with the length of v
, and no long vectors are created. These difference become increasingly important as the length of v
becomes long.
Questions:
f3()
and f0()
result in identical()
answers.Compare how timing of f3()
scales with vector length.
## identity
stopifnot(
identical(f0((-2):2), f3((-2):2)),
identical(f0(2:4), f3(2:4)),
identical(f0(-(4:2)), f3(-(4:2))),
identical(f0(vec), f3(vec)))
## scale
N <- 10^(1:7)
Time <- sapply(N, function(n) {
v <- sort(rnorm(n))
system.time(f3(v))[[3]]
})
plot(Time ~ N, type="b")
f0()
and f3()
with the original data, vec
.R code can be compiled, and compilation helps most when doing non-vectorized operations like those in f3()
. Use compiler::cmpfun()
to compile f3()
, and compare the result using microbenchmark.
## relative time
library(microbenchmark)
microbenchmark(f0(vec), f3(vec))
## Unit: microseconds
## expr min lq mean median uq max
## f0(vec) 15659.97 16503.9970 17429.61386 17468.603 17947.157 23199.554
## f3(vec) 28.08 30.9575 47.23053 47.764 52.141 113.544
## neval
## 100
## 100
library(compiler)
f3c <- cmpfun(f3)
microbenchmark(f3(vec), f3c(vec))
## Unit: microseconds
## expr min lq mean median uq max neval
## f3(vec) 28.470 29.8355 33.27442 34.6335 36.2500 48.503 100
## f3c(vec) 6.578 7.0270 7.85183 7.3085 7.6945 18.156 100
We could likely gain additional speed by writing the binary search algorithm in C, but we are already so happy with the performance improvement that we won’t do that!
It is useful to ask what is lost by f3()
compared to f0()
. For instance, does the algorithm work on character vectors? What about when the vector contains NA
values? How are ties at 0 treated?
findInterval()
is probably a better built-in way to solve the original problem, and generalizes to additional situations. The idea is to take the query that we are interested in, 0
, and find the interval specified by vec
in which it occurs.
f4 <- function(v, query=0)
findInterval(query - .Machine$double.eps, v)
identical(f0(vec), f4(vec))
## [1] TRUE
microbenchmark(f0(vec), f3(vec), f4(vec))
## Unit: microseconds
## expr min lq mean median uq max
## f0(vec) 15408.806 16331.254 16768.227 16472.9370 16934.660 19656.259
## f3(vec) 28.265 30.392 44.537 43.5655 48.658 97.324
## f4(vec) 13645.076 13698.410 13946.125 13777.6360 14172.281 15026.055
## neval
## 100
## 100
## 100
The fact that it is flexible and well tested means that it would often be preferred to f3()
, even though it is less speedy. For instance, compare the time it takes to query 10000 different points using f3
and iteration, versus findInterval
and vectorization.
threshold <- rnorm(10000)
identical(sapply(threshold, f3, x=vec), f4(vec, threshold))
## [1] TRUE
microbenchmark(sapply(x, f3), f4(vec, x))
## Unit: microseconds
## expr min lq mean median uq
## sapply(x, f3) 38.121 40.8785 76.45288 83.7155 89.028
## f4(vec, x) 13650.604 13695.7095 13811.30469 13728.5565 13830.765
## max neval
## 154.216 100
## 14673.133 100
Some R functions that implement efficient algorithms are sort()
(including radix sort), match()
(hash table look-up), and tabulate()
; these can be useful in your own code.
Lessons learned:
This is an advanced exercise, proceed with enthusiastic caution
This extended example illustrates how one might calculate the distirbution of GC content of aligned reads across several BAM files. We start by processing one BAM file sequentially, and then processes many BAM files in parallel.
Find paths to the following sample BAM files (these are small, but large enough to illustrate the principle.
library(RNAseqData.HNRNPC.bam.chr14)
fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
BAM files are large, so cannot fit into memory. In addition, we will eventually process several BAM files in parallel, so we need to further manage the amount of memory we consume while processing each BAM file. We take to approaches.
The first is to iterate through the BAM file in chunks that are large enough to benefit from ’s effiecient vectorized calculation but not so large as to consume excessive memory. We do this by using BamFileList()
to indicate that we would like to input aligned reads in chunks of size 100,000
library(Rsamtools)
bfls <- BamFileList(fls, yieldSize=100000)
Each time we read from a BAM file, we’ll input the next 100,000 records. We’ll adopt our second strategy for managing memory by restricting the data read from the BAM file to that necessary to calculate GC content, specifically the DNA sequence of each read, in addition to its alignment coordinates. We’ll do this by writing a function yield()
that uses GenomicFiles::readGAlignments()
to input the required data; see the help pages for functions that we use but you do not understand, e.g., ?ScanBamParam()
.
library(GenomicAlignments)
yield <- function(bfl) {
## input a chunk of alignments
library(GenomicAlignments)
readGAlignments(bfl, param=ScanBamParam(what="seq"))
}
Next we’ll transform our aligned reads to GC content. We will do this using Biostrings::letterFrequency()
to count the fraction of G’s or C’s in each read, tabulate these into 2.5-percentile bins, and calculate the cummulative number of reads in each bin.
library(Biostrings)
map <- function(aln) { # GC content, bin & cummulate
gc <- letterFrequency(mcols(aln)$seq, "GC")
tabulate(1 + gc, 73) # max. read length: 72
}
map()
will be applied to the result of each of data returned by yield()
; we’ll write a function reduce()
that combines the result of two calls to map()
into a single summary. In our case, reduce
is simply the adition of the return value of two successive calls to map()
.
reduce <- `+`
The GenomicFiles package provides a way to stitch these pieces together, specifically the reduceByYield()
function, illustrated in the following code chunk
library(GenomicFiles)
bf <- BamFile(fls[1], yieldSize=100000)
reduceByYield(bf, yield, map, reduce)
## [1] 0 0 0 0 0 0 0 0 0 0 0
## [12] 0 1 1 4 24 41 87 238 490 907 1397
## [23] 2127 3208 4706 6220 8737 11052 14680 17020 19360 21804 27047
## [34] 29212 31249 35395 39807 40259 41722 42304 44108 44073 42317 41260
## [45] 38372 35689 32447 27815 22153 18960 14188 10768 7887 6182 4817
## [56] 3332 2101 1652 1455 860 459 235 116 73 34 22
## [67] 6 4 0 0 0 0 0
The result printed out above is the number aligned reads with 0, 1, …, 73 G or C nucleotides. There are never more than 100,000 BAM records in memory at any one time, so memory consumption is modest. Nonetheless, we have processed the entire file.
Now that we can iterate through a single file to generate GC content in a modest amount of memory, it is very easy to process all files in parallel: use bplapply()
to invoke reduceByYield()
on each file, passing additional arguments yield
, map
, and reduce
.
library(BiocParallel)
gc <- bplapply(bfls, reduceByYield, yield, map, reduce)
The result is a list of GC-count vectors, one element for each file.
The result can be transformed to a data.frame()
library(ggplot2)
df <- stack(as.data.frame(lapply(gc, cumsum)))
df$GC <- 0:72
and visualized, e.g.,
library(ggplot2)
ggplot(df, aes(x=GC, y=values)) + geom_line(aes(colour=ind)) +
xlab("Number of GC Nucleotides per Read") +
ylab("Number of Reads")