The material in this course requires R version 3.2 and Bioconductor version 3.2
stopifnot(
getRversion() >= '3.2' && getRversion() < '3.3',
BiocInstaller::biocVersion() == "3.2"
)
BiocParallel
, GenomicFiles
ScanBamParam()
which
: genomic ranges of interestwhat
: ‘columns’ of BAM file, e.g., ‘seq’, ‘flag’BamFile(..., yieldSize=100000)
Iterative programming model
Use GenomicFiles::reduceByYield()
library(GenomicFiles)
yield <- function(bfl) {
## input a chunk of alignments
library(GenomicAlignments)
readGAlignments(bfl, param=ScanBamParam(what="seq"))
}
map <- function(aln) {
## Count G or C nucleotides per read
library(Biostrings)
gc <- letterFrequency(mcols(aln)$seq, "GC")
## Summarize number of reads with 0, 1, ... G or C nucleotides
tabulate(1 + gc, 73) # max. read length: 72
}
reduce <- `+`
Example
library(RNAseqData.HNRNPC.bam.chr14)
fls <- RNAseqData.HNRNPC.bam.chr14_BAMFILES
bf <- BamFile(fls[1], yieldSize=100000)
gc <- reduceByYield(bf, yield, map, reduce)
plot(gc, type="h",
xlab="GC Content per Aligned Read", ylab="Number of Reads")
Many problems are embarassingly parallel – lapply()
-like – especially in bioinformatics where parallel evaluation is across files
Example: GC content in several BAM files
library(BiocParallel)
gc <- bplapply(BamFileList(fls), reduceByYield, yield, map, reduce)
library(ggplot2)
df <- stack(as.data.frame(lapply(gc, cumsum)))
df$GC <- 0:72
ggplot(df, aes(x=GC, y=values)) + geom_line(aes(colour=ind)) +
xlab("Number of GC Nucleotides per Read") +
ylab("Number of Reads")
Acknowledgements
Core (Seattle): Sonali Arora, Marc Carlson, Nate Hayden, Jim Hester, Valerie Obenchain, Hervé Pagès, Paul Shannon, Dan Tenenbaum.
The research reported in this presentation was supported by the National Cancer Institute and the National Human Genome Research Institute of the National Institutes of Health under Award numbers U24CA180996 and U41HG004059, and the National Science Foundation under Award number 1247813. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation.
sessionInfo()
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux stretch/sid
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] GenomicFiles_1.5.8 BiocParallel_1.3.54
## [3] Homo.sapiens_1.3.1 GO.db_3.2.2
## [5] OrganismDbi_1.11.43 biomaRt_2.25.3
## [7] AnnotationHub_2.1.45 VariantAnnotation_1.15.34
## [9] RNAseqData.HNRNPC.bam.chr14_0.7.0 GenomicAlignments_1.5.18
## [11] Rsamtools_1.21.21 ALL_1.11.0
## [13] org.Hs.eg.db_3.2.3 RSQLite_1.0.0
## [15] DBI_0.3.1 ggplot2_1.0.1
## [17] airway_0.103.1 limma_3.25.18
## [19] DESeq2_1.9.51 RcppArmadillo_0.6.100.0.0
## [21] Rcpp_0.12.1 BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [23] BSgenome_1.37.6 rtracklayer_1.29.28
## [25] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.21.33
## [27] AnnotationDbi_1.31.19 SummarizedExperiment_0.3.11
## [29] Biobase_2.29.1 GenomicRanges_1.21.32
## [31] GenomeInfoDb_1.5.16 microbenchmark_1.4-2
## [33] Biostrings_2.37.8 XVector_0.9.4
## [35] IRanges_2.3.26 S4Vectors_0.7.23
## [37] BiocGenerics_0.15.11 BiocStyle_1.7.9
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 RColorBrewer_1.1-2 httr_1.0.0
## [4] tools_3.2.2 R6_2.1.1 rpart_4.1-10
## [7] Hmisc_3.17-0 colorspace_1.2-6 nnet_7.3-11
## [10] gridExtra_2.0.0 graph_1.47.2 formatR_1.2.1
## [13] sandwich_2.3-4 labeling_0.3 scales_0.3.0
## [16] mvtnorm_1.0-3 genefilter_1.51.1 RBGL_1.45.1
## [19] stringr_1.0.0 digest_0.6.8 foreign_0.8-66
## [22] rmarkdown_0.8.1 htmltools_0.2.6 BiocInstaller_1.19.14
## [25] shiny_0.12.2 zoo_1.7-12 acepack_1.3-3.3
## [28] RCurl_1.95-4.7 magrittr_1.5 Formula_1.2-1
## [31] futile.logger_1.4.1 munsell_0.4.2 proto_0.3-10
## [34] stringi_0.5-5 multcomp_1.4-1 yaml_2.1.13
## [37] MASS_7.3-44 zlibbioc_1.15.0 plyr_1.8.3
## [40] grid_3.2.2 lattice_0.20-33 splines_3.2.2
## [43] annotate_1.47.4 locfit_1.5-9.1 knitr_1.11
## [46] geneplotter_1.47.0 reshape2_1.4.1 codetools_0.2-14
## [49] futile.options_1.0.0 XML_3.98-1.3 evaluate_0.8
## [52] latticeExtra_0.6-26 lambda.r_1.1.7 httpuv_1.3.3
## [55] gtable_0.1.2 mime_0.4 xtable_1.7-4
## [58] survival_2.38-3 cluster_2.0.3 TH.data_1.0-6
## [61] interactiveDisplayBase_1.7.3