## Warning: replacing previous import by 'graph::.__C__dist' when loading 'topGO'
The material in this course requires R version 3.2 and Bioconductor version 3.2
stopifnot(
getRversion() >= '3.2' && getRversion() < '3.3',
BiocInstaller::biocVersion() == "3.2"
)
Key references
Lun, BioC 2015
Kharchenko et al. (2008).
ChIP-seq for differential binding
Novel statistical issues
Experimental design and execution
Single sample
Designed experiments
Sequencing & alignment
Peak calling
MACS: Model-based Analysis for ChIP-Seq, Liu et al., 2008
Down-stream analysis
‘Known’ ranges
de novo windows
de novo peak calling
ID | Mode | Library | Operation |
---|---|---|---|
1 | Single-sample | Individual | Union |
2 | Single-sample | Individual | Intersection |
3 | Single-sample | Individual | At least 2 |
4 | Single-sample | Pooled over group | Union |
5 | Single-sample | Pooled over group | Intersection |
6 | Two-sample | Pooled over group | Union |
7 | Single-sample | Pooled over all | - |
How to choose? – Lun & Smyth,
Under the null hypothesis, type I error rate is uniform
## 100,000 t-tests under the null, n = 6
n <- 6; m <- matrix(rnorm(n * 100000), ncol=n)
P <- genefilter::rowttests(m, factor(rep(1:2, each=3)))$p.value
quantile(P, c(.001, .01, .05))
## 0.1% 1% 5%
## 0.000965875 0.010195115 0.049456354
hist(P, breaks=20)
Table 2: The observed type I error rate when testing for differential enrichment using counts from each peak calling strategy. Error rates for a range of specified error thresholds are shown. All values represent the mean of 10 simulation iterations with the standard error shown in brackets. RA: reference analysis using 10 000 randomly chosen true peaks.
ID | Error rate | ||
---|---|---|---|
0.01 | 0.05 | 0.1 | |
RA | 0.010 (0.000) | 0.051 (0.001) | 0.100 (0.002) |
1 | 0.002 (0.000) | 0.019 (0.001) | 0.053 (0.001) |
2 | 0.003 (0.000) | 0.030 (0.000) | 0.073 (0.001) |
3 | 0.006 (0.000) | 0.042 (0.001) | 0.092 (0.001) |
4 | 0.033 (0.001) | 0.145 (0.001) | 0.261 (0.002) |
5 | 0.000 (0.000) | 0.001 (0.000) | 0.005 (0.000) |
6 | 0.088 (0.006) | 0.528 (0.013) | 0.893 (0.006) |
7 | 0.010 (0.000) | 0.049 (0.001) | 0.098 (0.001) |
The ChIPseeker
vignette is an excellent resource, and we’ll walk through parts of it for our lab.
vignette("ChIPseeker")
This exercise is based on the csaw vignette, where more detail can be found. This is innovative in two ways: (1) it doesn’t call ‘peaks’, but is instead based on analysis of windows that span the entire genome; and (2) it emphasizes comparison of ChIP patterns across samples, looking for differential binding between treatment groups.
Start by downloading csaw-data.Rds and csaw-normfacs.Rds
The experiment involves changes in binding profiles of the NFYA protein between embryonic stem cells and terminal neurons. It is a subset of the data provided by Tiwari et al. 2012 available as GSE25532. There are two es (embryonic stem cell) and two tn (terminal neuron) replicates. Single-end FASTQ files were extracted from GEO, aligned using Rsubread, and post-processed (sorted and indexed) using Rsamtools with this script:
## SystemRequirements: ascp; fastq-dump
source("setup.R")
library(SRAdb)
library(Rsubread)
library(Rsamtools)
library(BiocParallel)
sradb <- "SRAmetadb.sqlite"
key <- "/app/aspera-connect/3.1.1/etc/asperaweb_id_dsa.openssh"
cmd = sprintf("ascp -TT -l300m -i %s", key)
source("setup.R")
if (!file.exists(sradb))
getSRAdbFile()
con = dbConnect(dbDriver("SQLite"), sradb)
accs <- rownames(files)[!file.exists(files$sra)]
for (acc in accs)
sraFiles = ascpSRA(acc, con, cmd, fileType="sra", destDir=getwd())
sras <- files$sra[!file.exists(files$fastq)]
bplapply(sras, function(sra) system(sprintf("fastq-dump --gzip %s", sra)))
fastqs <- files$fastq[!file.exists(files$bam)]
if (length(fastqs))
Rsubread::align("../mm10/mm10.Rsubread.index", fastqs,
nthreads=parallel::detectCores() / 2L)
bams <- files$bam[!file.exists(sprintf("%s.bai", files$bam))]
bams_sorted <- sub(".BAM", ".sorted.bam", bams)
sorted <- bpmapply(sortBam, bams, bams_sorted)
## oops! didn't mean to do the next line
file.rename(sorted, names(sorted))
bplapply(sorted, indexBam)
This generates 4 BAM files, one per sample. The BAM files are about 2 GB each. The files are summarized by the following data frame:
files <- local({
acc <- c(es_1="SRR074398", es_2="SRR074399", tn_1="SRR074417",
tn_2="SRR074418")
data.frame(Treatment=sub("_.*", "", names(acc)),
Replicate=sub(".*_", "", names(acc)),
sra=sprintf("%s.sra", acc),
fastq=sprintf("%s.fastq.gz", acc),
bam=sprintf("%s.fastq.gz.subread.BAM", acc),
row.names=acc, stringsAsFactors=FALSE)
})
To process the data, I’ll change to the directory where the BAM files are located at
setwd("~/UseBioconductor-data/ChIPSeq/NFYA")
…and then load the csaw library and count reads in overlapping windows.
library(csaw)
library(GenomicRanges)
frag.len <- 110
system.time({
data <- windowCounts(files$bam, width=10, ext=frag.len)
}) # 156 seconds
acc <- sub(".fastq.*", "", data$bam.files)
colData(data) <- cbind(files[acc,], colData(data))
Load this data into your own R session with the following command:
data <- readRDS("csaw-data.Rds")
data
is a SummarizedExperiment
, so explore it a bit…
Filtering (vignette Chapter 3) Start by filtering low-count windows. There are likely to be many of these (how many?). Is there a rational way to choose the filtering threshold?
library(edgeR) # for aveLogCPM()
keep <- aveLogCPM(assay(data)) >= -1
data <- data[keep,]
Normalization (composition bias) (vignette Chapter 4) csaw uses binned counts in normalization. The bins are large relative to the ChIP peaks, on the assumption that the bins primarily represent non-differentially bound regions. The sample bin counts are normalized using the edgeR TMM (trimmed median of M values) method seen in the RNASeq differential expression lab. Explore vignette chapter 4 for more on normalization (this is a useful resource when seeking to develop normalization methods for other protocols!).
This requires access to the big data, so we don’t run the following code
system.time({
binned <- windowCounts(files$bam, bin=TRUE, width=10000)
}) #139 second
normfacs <- normalize(binned)
…but instead load the summary into our session
normfacs <- readRDS("csaw-normfacs.Rds")
Experimental design and Differential binding (vignette Chapter 5) Differential binding will be assessed using edgeR, where we need to specify the experimental design
design <- model.matrix(~Treatment, colData(data))
Apply a standard edgeR work flow to identify differentially bound regions. Creatively explore the results.
y <- asDGEList(data, norm.factors=normfacs)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design, robust=TRUE)
results <- glmQLFTest(fit, contrast=c(0, 1))
head(results$table)
## logFC logCPM F PValue
## 1 -0.6739731 -1.360819 0.6314884 0.43210089
## 2 -0.7693277 -1.401691 0.7997943 0.37718999
## 3 -0.3620229 -1.330818 0.1863866 0.66855710
## 4 0.4943534 -1.373149 0.3463071 0.55994411
## 5 1.4757585 -1.168808 3.1350061 0.08523217
## 6 2.1921162 -1.241795 5.7634503 0.02174394
Multiple testing (vignette Chapter 6) The challenge is that FDR across all detected differentially bound regions is what one is interested in, but what is immediately available is the FDR across differentially bound windows; region will often consist of multiple overlapping windows. As a first step, we’ll take a ‘quick and dirty’ approach to identifying regions by merging ‘high-abundance’ windows that are within, e.g., 1kb of one another
merged <- mergeWindows(rowRanges(data), tol=1000L)
Combine test results across windows within regions. Several strategies are explored in section 6.5 of the vignette.
tabcom <- combineTests(merged$id, results$table)
head(tabcom)
## nWindows logFC.up logFC.down PValue FDR
## 1 1 0 1 0.43210089 0.6095340
## 2 2 0 1 0.66855710 0.8044231
## 3 4 3 0 0.08261763 0.2125610
## 4 1 1 0 0.38981304 0.5705249
## 5 2 0 2 0.19036362 0.3609546
## 6 1 0 0 0.83202268 0.9134050
Section 6.6 of the vignette discusses approaches to identifying the ‘best’ windows within regions.
Finally, create a GRangesList
that associated with two result tables and the genomic ranges over which the results were calculated.
gr <- rowRanges(data)
mcols(gr) <- as(results$table, "DataFrame")
grl <- split(gr, merged$id)
mcols(grl) <- as(tabcom, "DataFrame")
Acknowledgements
Core (Seattle): Sonali Arora, Marc Carlson, Nate Hayden, Jim Hester, Valerie Obenchain, Hervé Pagès, Paul Shannon, Dan Tenenbaum.
The research reported in this presentation was supported by the National Cancer Institute and the National Human Genome Research Institute of the National Institutes of Health under Award numbers U24CA180996 and U41HG004059, and the National Science Foundation under Award number 1247813. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation.
sessionInfo()
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux stretch/sid
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] clusterProfiler_2.3.8 ChIPseeker_1.5.11
## [3] edgeR_3.11.5 csaw_1.3.14
## [5] genefilter_1.51.1 RColorBrewer_1.1-2
## [7] gplots_2.17.0 GenomicFiles_1.5.8
## [9] BiocParallel_1.3.54 Homo.sapiens_1.3.1
## [11] GO.db_3.2.2 OrganismDbi_1.11.43
## [13] biomaRt_2.25.3 AnnotationHub_2.1.45
## [15] VariantAnnotation_1.15.34 RNAseqData.HNRNPC.bam.chr14_0.7.0
## [17] GenomicAlignments_1.5.18 Rsamtools_1.21.21
## [19] ALL_1.11.0 org.Hs.eg.db_3.2.3
## [21] RSQLite_1.0.0 DBI_0.3.1
## [23] ggplot2_1.0.1 airway_0.103.1
## [25] limma_3.25.18 DESeq2_1.9.51
## [27] RcppArmadillo_0.6.100.0.0 Rcpp_0.12.1
## [29] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.37.6
## [31] rtracklayer_1.29.28 TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [33] GenomicFeatures_1.21.33 AnnotationDbi_1.31.19
## [35] SummarizedExperiment_0.3.11 Biobase_2.29.1
## [37] GenomicRanges_1.21.32 GenomeInfoDb_1.5.16
## [39] microbenchmark_1.4-2 Biostrings_2.37.8
## [41] XVector_0.9.4 IRanges_2.3.26
## [43] S4Vectors_0.7.23 BiocGenerics_0.15.11
## [45] BiocStyle_1.7.9
##
## loaded via a namespace (and not attached):
## [1] TH.data_1.0-6 colorspace_1.2-6 qvalue_2.1.0
## [4] futile.logger_1.4.1 topGO_2.21.0 interactiveDisplayBase_1.7.3
## [7] mvtnorm_1.0-3 codetools_0.2-14 splines_3.2.2
## [10] GOSemSim_1.27.4 geneplotter_1.47.0 knitr_1.11
## [13] Formula_1.2-1 gridBase_0.4-7 annotate_1.47.4
## [16] cluster_2.0.3 png_0.1-7 graph_1.47.2
## [19] shiny_0.12.2 httr_1.0.0 assertthat_0.1
## [22] formatR_1.2.1 acepack_1.3-3.3 htmltools_0.2.6
## [25] tools_3.2.2 igraph_1.0.1 gtable_0.1.2
## [28] DO.db_2.9 reshape2_1.4.1 dplyr_0.4.3
## [31] gdata_2.17.0 stringr_1.0.0 proto_0.3-10
## [34] mime_0.4 gtools_3.5.0 statmod_1.4.21
## [37] XML_3.98-1.3 DOSE_2.7.12 zlibbioc_1.15.0
## [40] MASS_7.3-44 zoo_1.7-12 scales_0.3.0
## [43] BiocInstaller_1.19.14 RBGL_1.45.1 sandwich_2.3-4
## [46] SparseM_1.7 lambda.r_1.1.7 yaml_2.1.13
## [49] gridExtra_2.0.0 UpSetR_0.0.5 rpart_4.1-10
## [52] latticeExtra_0.6-26 stringi_0.5-5 plotrix_3.5-12
## [55] caTools_1.17.1 boot_1.3-17 bitops_1.0-6
## [58] evaluate_0.8 lattice_0.20-33 labeling_0.3
## [61] plyr_1.8.3 magrittr_1.5 R6_2.1.1
## [64] Hmisc_3.17-0 multcomp_1.4-1 foreign_0.8-66
## [67] KEGGREST_1.9.1 survival_2.38-3 RCurl_1.95-4.7
## [70] nnet_7.3-11 futile.options_1.0.0 KernSmooth_2.23-15
## [73] rmarkdown_0.8.1 locfit_1.5-9.1 grid_3.2.2
## [76] digest_0.6.8 xtable_1.7-4 httpuv_1.3.3
## [79] munsell_0.4.2