ATLAS, ACML BLAS library
'parallel' package
mclapply
and friends: parallel computation on lists (lapply(...)
)pvec
: parallel computations on vectors (fun(...)
)mcparallel
, mccollect
: fork and retrievelapply
-like, etcmpirun
,
a StackOverflow example), which makes sense
(responsibility for managing cluster structure should not be R's
responsibility).bplapply
for any back-endMotivation: StackOverflow question about calculating correlation coefficients between columns in a large (1M x 400) numeric matrix.
Plain R: How does cor()
scale? Some set-up:
set.seed(123)
timer <- function(dim, FUN, nrep=3) {
print(dim)
m <- matrix(runif(dim[1] * dim[2]), dim[1])
mean(replicate(nrep, system.time(FUN(m))["elapsed"]))
}
parm <- expand.grid(m=10^(4:5),
n=as.integer(seq(20, 300, length.out=3)))
And then calculation:
parm$cor <- apply(parm[,1:2], 1, timer, cor)
xtabs(cor ~ m + n, parm)
A little math: (1) center; (2) standardize; (3) cross-product. Step (3) is the slow part, and is performed by R's math library.
Implementation:
fastcor <- function(m) {
m <- t(m)
m <- m - rowMeans(m) # center
m <- m / sqrt(rowSums(m^2)) # scale
tcrossprod(m) # cross-product
}
How are we doing?
## 'small' data set initially
m <- 100000; n <- 50
mat <- matrix(runif(m * n), m)
Timing and correctness:
system.time(c0 <- cor(mat))
## user system elapsed
## 0.49 0.00 0.49
system.time(c1 <- fastcor(mat))
## user system elapsed
## 0.270 0.025 0.314
all.equal(c0, c1) # why not identical()?
## [1] TRUE
Scaling:
parm$fastcor <- apply(parm[,1:2], 1, timer, fastcor)
parm$crossprod <- apply(parm[,1:2], 1, timer, crossprod)
Visualizing:
library(lattice)
xyplot(sqrt(cor) + sqrt(fastcor) + sqrt(crossprod) ~ n,
group=m, parm, type="b", pch=20, cex=2, layout=c(3, 1),
xlab="Columns", ylab="sqrt(Time)", main="Native",
key=simpleKey(text=sprintf("%d", unique(parm$m)), lines=TRUE,
points=FALSE, x=.02, y=.95, title="Rows", cex.title=1))
Same scaling, worse coefficient, worse algorithm!
But wait! Use a parallel BLAS library:
'Easy' to arrange for parallel evaluation on Linux / Mac: (1)
install appropriate library using a package manager, e.g., apt get
install libatlas3gf-base
; (2) download R source and configure to
use installed library (see R Installation and
Administration); (3) make -j
; (4) Use.
Basic observations
Approach
Common solutions 1. Restrict data input to just that required 2. Draw a sample and infer statistical properties if appropriate, e.g., QA 3. Iterate through large data
Rsamtools::filterBam
,
GenomicRanges::summarizeOverlaps
Case study: Counting reads overlapping regions of interest, Intermediate Sequence Analysis 2013 Chapter 7.