Efficient R code
system.time()
, Rprof()
, microbenchmarkIteration
open()
, read chunk(s), close()
.yieldSize
argument to Rsamtools::BamFile()
Restriction
Rsamtools::ScanBamParam()
Sampling
ShortRead::FastqSampler()
Parallel evaluation
lapply()
-like operationsType | Example use | Name | Package |
---|---|---|---|
.bed | Range annotations | BedFile() |
rtracklayer |
.wig | Coverage | WigFile() , BigWigFile() |
rtracklayer |
.gtf | Transcript models | GTFFile() |
rtracklayer |
makeTxDbFromGFF() |
GenomicFeatures | ||
.2bit | Genomic Sequence | TwoBitFile() |
rtracklayer |
.fastq | Reads & qualities | FastqFile() |
ShortRead |
.bam | Aligned reads | BamFile() |
Rsamtools |
.tbx | Indexed tab-delimited | TabixFile() |
Rsamtools |
.vcf | Variant calls | VcfFile() |
VariantAnnotation |
## rtracklayer menagerie
library(rtracklayer)
names(getClass("RTLFile")@subclasses)
## [1] "UCSCFile" "GFFFile" "BEDFile"
## [4] "WIGFile" "BigWigFile" "ChainFile"
## [7] "TwoBitFile" "FastaFile" "TabSeparatedFile"
## [10] "CompressedFile" "GFF1File" "GFF2File"
## [13] "GFF3File" "BEDGraphFile" "BED15File"
## [16] "BWFile" "2BitFile" "GTFFile"
## [19] "GVFFile" "GZFile" "BGZFile"
## [22] "BZ2File" "XZFile"
Notes
open()
, close()
, import()
/ yield()
/ read*()
.bai
, bam index); selection (‘columns’); restriction (‘rows’)*FileList()
classes
reduceByYield()
– iterate through a single large filebplapply()
(BiocParallel) – perform independent operations on several files, in parallelGenomicFiles()
reduceByRange()
, reduceByFile()
: collapse maps into summary representationStandardized interface for simple parallel evaluation.
bplapply()
instead of lapply()
Argument BPPARAM
influences how parallel evaluation occurs
MulticoreParam()
: threads on a single (non-Windows) machineSnowParam()
: processes on the same or different machinesBatchJobsParam()
: resource scheduler on a clusterOther resources
GoogleGenomics to interact with google compute cloud and resources
Define following as a function.
f0 <- function(n) {
## inefficient!
ans <- numeric()
for (i in seq_len(n))
ans <- c(ans, exp(i))
ans
}
Use system.time()
to explore how long this takes to execute as n
increases from 100 to 10000. Use identical()
and microbenchmark to compare alternatives f1()
, f2()
, and f3()
for both correctness and performance of these three different functions. What strategies are these functions using?
f1 <- function(n) {
ans <- numeric(n)
for (i in seq_len(n))
ans[[i]] <- exp(i)
ans
}
f2 <- function(n)
sapply(seq_len(n), exp)
f3 <- function(n)
exp(seq_len(n))
Go to sleep for 1 second, then return i
. This takes 8 seconds.
library(BiocParallel)
fun <- function(i) {
Sys.sleep(1)
i
}
## serial
f0 <- function(n)
lapply(seq_len(n), fun)
## parallel
f1 <- function(n)
bplapply(seq_len(n), fun)
This exercise uses the following packages:
library(GenomicAlignments)
library(GenomicFiles)
library(BiocParallel)
library(Rsamtools)
library(GenomeInfoDb)
This is a re-implementation of the basic csaw binned counts algorithm. It supposes that ChIP fragment lengths are 110 nt, and that we bin coverage in windows of width 50. We focus on chr1.
frag.len <- 110
spacing <- 50
chr <- "chr1"
Here we point to the bam files, indicating that we’ll process the files in chunks of size 1,000,000.
fls <- dir("~/UseBioconductor-data/ChIPSeq/NFYA/", ".BAM$", full=TRUE)
names(fls) <- sub(".fastq.*", "", basename(fls))
bfl <- BamFileList(fls, yieldSize=1000000)
We’ll creating the counting bins using tileGenome()
, focusing the ‘standard’ chromosomes’
len <- seqlengths(keepStandardChromosomes(seqinfo(bfl)))[chr]
tiles <- tileGenome(len, tilewidth=spacing, cut.last.tile.in.chrom=TRUE)
We’ll use reduceByYield()
to iterate through a single file. We read to tell this function we’ll YIELD
a chunk of the file, how we’ll MAP
the chunk from it’s input representation to the per-window counts, and finally how we’ll REDUCE
successive chunks into a final representation.
YIELD
is supposed to be a function that takes one argument, the input source, and returns a chunk of records
yield <- function(x, ...)
readGAlignments(x)
MAP
must take the output of yield and perhaps additional arguments, and return a vector of counts. We’ll resize the genomic ranges describing the alignment so that they have a width equal to the fragment length
map <- function(x, tiles, frag.len, ...) {
gr <- keepStandardChromosomes(granges(x))
countOverlaps(tiles, resize(gr, frag.len))
}
REDUCE
takes two results from MAP
(in our case, vectors of counts) and combines them into a single result. We simply add our vectors (+
is actually a function!)
reduce <- `+`
To process one file, we use reduceByYield()
, passing the file we want to process, the yield, map, and reduce functions. Our ‘wrapper’ function passes any additional arguments through to reduceByYield()
using ...
:
count1file <- function(bf, ...)
reduceByYield(bf, yield, map, reduce, ...)
Using yieldSize
and reduceByYield()
means that we do not consume too much memory processing each file, so that we can process files in parallel using BiocParallel. The simplify2array()
function transforms a list-of-vectors to a matrix.
counts <- bplapply(bfl, count1file, tiles=tiles, frag.len=frag.len)
counts <- simplify2array(counts)
dim(counts)
colSums(counts)