Chapter 7 Correction for multiple testing

7.1 Overview

The false discovery rate (FDR) is usually the most appropriate measure of error for high-throughput experiments. Control of the FDR can be provided by applying the Benjamini-Hochberg (BH) method (Benjamini and Hochberg 1995) to a set of \(p\)-values. This is less conservative than the alternatives (e.g., Bonferroni) yet still provides some measure of error control. The most obvious approach is to apply the BH method to the set of \(p\)-values across all windows. This will control the FDR across the set of putative DB windows.

However, the FDR across all detected windows is not necessarily the most relevant error rate. Interpretation of ChIP-seq experiments is more concerned with regions of the genome in which (differential) protein binding is found, rather than the individual windows. In other words, the FDR across all detected DB regions is usually desired. This is not equivalent to that across all DB windows as each region will often consist of multiple overlapping windows. Control of one will not guarantee control of the other (Lun and Smyth 2014).

To illustrate this difference, consider an analysis where the FDR across all window positions is controlled at 10%. In the results, there are 18 adjacent window positions in one region and 2 windows in a separate region. The first set of windows is a truly DB region whereas the second set is a false positive. A window-based interpretation of the FDR is correct as only 2 of the 20 window positions are false positives. However, a region-based interpretation results in an actual FDR of 50%.

To avoid misinterpretation of the FDR, csaw provides a number of strategies to obtain region-level results. This involves defining the regions of interest - possibly from the windows themselves - and converting per-window statistics into a \(p\)-value for each region. Application of the BH method to the per-region \(p\)-values will then control the relevant FDR across regions. These strategies are demonstrated below using the NF-YA data.

7.2 Grouping windows into regions

7.2.1 Quick and dirty clustering

The mergeWindows() function provides a simple single-linkage algorithm to cluster windows into regions. Windows that are less than tol apart are considered to be adjacent and are grouped into the same cluster. The chosen tol represents the minimum distance at which two binding events are treated as separate sites. Large values (500 - 1000 bp) reduce redundancy and favor a region-based interpretation of the results, while smaller values (< 200 bp) allow resolution of individual binding sites.

#--- loading-files ---#
library(chipseqDBData)
tf.data <- NFYAData()
tf.data
bam.files <- head(tf.data$Path, -1) # skip the input.
bam.files

#--- counting-windows ---#
library(csaw)
frag.len <- 110
win.width <- 10
param <- readParam(minq=20)
data <- windowCounts(bam.files, ext=frag.len, width=win.width, param=param)

#--- filtering ---#
binned <- windowCounts(bam.files, bin=10000, param=param)
fstats <- filterWindowsGlobal(data, binned)
filtered.data <- data[fstats$filter > log2(5),]

#--- normalization ---#
filtered.data <- normFactors(binned, se.out=filtered.data)

#--- modelling ---#
cell.type <- sub("NF-YA ([^ ]+) .*", "\\1", head(tf.data$Description, -1))
design <- model.matrix(~cell.type)
colnames(design) <- c("intercept", "cell.type")

library(edgeR)
y <- asDGEList(filtered.data)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design, robust=TRUE)
res <- glmQLFTest(fit, coef="cell.type")

rowData(filtered.data) <- cbind(rowData(filtered.data), res$table)
library(csaw)
merged <- mergeWindows(filtered.data, tol=1000L)
merged$regions
## GRanges object with 3577 ranges and 0 metadata columns:
##                      seqnames            ranges strand
##                         <Rle>         <IRanges>  <Rle>
##      [1]                 chr1   7397901-7398110      *
##      [2]                 chr1   9541401-9541510      *
##      [3]                 chr1   9545301-9545360      *
##      [4]                 chr1 10007401-10007460      *
##      [5]                 chr1 13134451-13134510      *
##      ...                  ...               ...    ...
##   [3573] chrX_GL456233_random     336801-336910      *
##   [3574]                 chrY     143051-143060      *
##   [3575]                 chrY     259151-259210      *
##   [3576]                 chrY 90808851-90808860      *
##   [3577]                 chrY 90812851-90812910      *
##   -------
##   seqinfo: 66 sequences from an unspecified genome

If many adjacent windows are present, very large clusters may be formed that are difficult to interpret. We perform a simple check below to determine whether most clusters are of an acceptable size. Huge clusters indicate that more aggressive filtering from Chapter 4 is required.
This mitigates chaining effects by reducing the density of windows in the genome.

summary(width(merged$regions))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10      60     110     165     160   15660

Alternatively, chaining can be limited by setting max.width to restrict the size of the merged intervals. Clusters substantially larger than max.width are split into several smaller subclusters of roughly equal size. The chosen value should be small enough so as to separate DB regions from unchanged neighbors, yet large enough to avoid misinterpretation of the FDR. Any value from 2000 to 10000 bp is recommended. This paramater can also interpreted as the maximum distance at which two binding sites are considered part of the same event.

merged.max <- mergeWindows(filtered.data, tol=1000L, max.width=5000L)
summary(width(merged.max$regions))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10      60     110     164     160    4860

7.2.2 Using external information

Another approach is to group together windows that overlap with a pre-specified region of interest. The most obvious source of pre-specified regions is that of annotated features such as promoters or gene bodies. Alternatively, called peaks can be used provided that sufficient care has been taken to avoid loss of error control from data snooping (Lun and Smyth 2014). Regardless of how they are specified, each region of interest corresponds to a group that contains all overlapping windows,
as identified by the findOverlaps function from the GenomicRanges package.

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
broads <- genes(TxDb.Mmusculus.UCSC.mm10.knownGene)
broads <- resize(broads, width(broads)+3000, fix="end")

olap <- findOverlaps(broads, rowRanges(filtered.data))
olap
## Hits object with 12867 hits and 0 metadata columns:
##           queryHits subjectHits
##           <integer>   <integer>
##       [1]         7        6995
##       [2]        18        8323
##       [3]        18        8324
##       [4]        18        8325
##       [5]        18        8326
##       ...       ...         ...
##   [12863]     24521        6840
##   [12864]     24521        6841
##   [12865]     24524        6601
##   [12866]     24524        6602
##   [12867]     24524        6603
##   -------
##   queryLength: 24528 / subjectLength: 12352

At this point, one might imagine that it would be simpler to just collect and analyze counts over the pre-specified regions. This is a valid strategy but will yield different results. Consider a promoter containing two separate sites that are identically DB in opposite directions. Counting reads across the promoter will give equal counts for each condition so changes within the promoter will not be detected. Similarly, imprecise peak boundaries can lead to loss of detection power due to “contamination” by reads in background regions. Window-based methods may be more robust as each interval of the promoter/peak region is examined separately (Lun and Smyth 2014), avoiding potential problems with peak-calling errors and incorrect/incomplete annotation.

7.3 Obtaining per-region \(p\)-value

7.3.1 Combining window-level \(p\)-values

We compute a combined \(p\)-value for each region based on the \(p\)-values of the constituent windows (Simes 1986). This tests the joint null hypothesis for each region, i.e., that no enrichment is observed across any of its windows. Any DB within the region will reject the joint null and yield a low \(p\)-value for the entire region. The combined \(p\)-values are then adjusted using the BH method to control the region-level FDR.

tabcom <- combineTests(merged$ids, rowData(filtered.data))
is.sig.region <- tabcom$FDR <= 0.05
summary(is.sig.region)
##    Mode   FALSE    TRUE 
## logical    1593    1984

Summarizing the direction of DB for each cluster requires some care as the direction of DB can differ between constituent windows. The num.up.tests and num.down.tests fields contain the number of windows that change in each direction, and can be used to gauge whether binding increases or decreases across the cluster. A complex DB event may be present if both num.up.tests and num.down.tests are non-zero (i.e., opposing changes within the region) or if the total number of windows is much larger than either number (e.g., interval of constant binding adjacent to the DB interval).

Alternatively, the direction field specifies which DB direction contributes to the combined \(p\)-value. If "up", the combined \(p\)-value for this cluster is driven by \(p\)-values of windows with positive log-fold changes. If "down", the combined \(p\)-value is driven by windows with negative log-fold changes. If "mixed", windows with both positive and negative log-fold changes are involved. This allows the dominant DB in significant clusters to be quickly summarized, as shown below.

table(tabcom$direction[is.sig.region])
## 
## down   up 
##  178 1806

For pre-specified regions, the combineOverlaps() function will combine the \(p\)-values for all windows in each region. This is a wrapper around combineTests() for Hits objects. It returns a single combined \(p\)-value (and its BH-adjusted value) for each region. Regions that do not overlap any windows have values of NA in all fields for the corresponding rows.

tabbroad <- combineOverlaps(olap, rowData(filtered.data))
head(tabbroad[!is.na(tabbroad$PValue),])
## DataFrame with 6 rows and 8 columns
##    num.tests num.up.logFC num.down.logFC      PValue        FDR   direction
##    <integer>    <integer>      <integer>   <numeric>  <numeric> <character>
## 7          1            1              0 2.55121e-05 0.00047832          up
## 18         4            4              0 1.57806e-04 0.00141790          up
## 23         3            3              0 2.56708e-02 0.04520196          up
## 25         2            0              0 7.33955e-01 0.75625162          up
## 28         3            3              0 1.05997e-04 0.00112671          up
## 36         4            4              0 5.20676e-03 0.01441129          up
##     rep.test rep.logFC
##    <integer> <numeric>
## 7       6995  3.396599
## 18      8326  3.445921
## 23       315  1.331105
## 25      9977  0.201873
## 28      8774  3.503338
## 36      2716  2.105375
is.sig.gene <- tabcom$FDR <= 0.05
table(tabbroad$direction[is.sig.gene])
## 
##  down mixed    up 
##    94    30  1991

7.3.2 Based on the most significant window

Another approach is to use the single window with the strongest DB as a representative of the entire region. This is useful when a log-fold change is required for each cluster, e.g., for plotting. (In contrast, taking the average log-fold change across all windows in a region will understate the magnitude of DB, especially if the region includes some non-DB background intervals of the genome.) Identification of the most significant (i.e., “best”) window is performed using the getBestTest() function. This reports the index of the window with the lowest \(p\)-value in each cluster as well as the associated statistics.

tab.best <- getBestTest(merged$ids, rowData(filtered.data))
head(tab.best)
## DataFrame with 6 rows and 8 columns
##   num.tests num.up.logFC num.down.logFC    PValue       FDR   direction
##   <integer>    <integer>      <integer> <numeric> <numeric> <character>
## 1         5            2              0 0.0172116 0.0360667          up
## 2         3            2              0 0.0121190 0.0277526          up
## 3         2            2              0 0.0275654 0.0517865          up
## 4         2            0              0 0.1881740 0.2400494          up
## 5         2            0              0 0.1627457 0.2123829          up
## 6         5            1              0 0.0361941 0.0637137          up
##    rep.test rep.logFC
##   <integer> <numeric>
## 1         3   1.57358
## 2         8   1.98350
## 3        10   1.69016
## 4        11   1.12021
## 5        14   1.08339
## 6        17   1.32711

A Bonferroni correction is applied to the \(p\)-value of the best window in each region, based on the number of constituent windows in that region. This is necessary to account for the implicit multiple testing across all windows in each region. The corrected \(p\)-value is reported as PValue in tab.best, and can be used for correction across regions using the BH method to control the region-level FDR.

In addition, it is often useful to report the start location of the best window within each cluster. This allows users to easily identify a relevant DB subinterval in large regions. For example, the sequence of the DB subinterval can be extracted for motif discovery.

tabcom$rep.start <- start(rowRanges(filtered.data))[tab.best$rep.test]
head(tabcom[,c("rep.logFC", "rep.start")])
## DataFrame with 6 rows and 2 columns
##   rep.logFC rep.start
##   <numeric> <integer>
## 1   1.57358   7398001
## 2   1.81156   9541501
## 3   1.57462   9545351
## 4   1.03337  10007401
## 5   1.08339  13134501
## 6   1.32711  13372551

The same approach can be applied to the overlaps between windows and pre-specified regions, using the getBestOverlaps() wrapper function. This is demonstrated below for the broad gene body example. As with combineOverlaps(), regions with no windows are assigned NA in the output table, but these are removed here to show some actual results.

tab.best.broad <- getBestOverlaps(olap, rowData(filtered.data))
tabbroad$rep.start <- start(rowRanges(filtered.data))[tab.best.broad$rep.test]
head(tabbroad[!is.na(tabbroad$PValue),c("rep.logFC", "rep.start")])
## DataFrame with 6 rows and 2 columns
##    rep.logFC rep.start
##    <numeric> <integer>
## 7   3.396599  32657101
## 18  3.445921   8259301
## 23  1.331105  92934601
## 25  0.201873  71596101
## 28  3.503338   4137001
## 36  2.105375 100187601

7.3.3 Wrapper functions

For convenience, the steps of merging windows and computing statistics are implemented in a single wrapper function. This simply calls mergeWindows() followed by combineTests() and getBestTest().

merge.res <- mergeResults(filtered.data, rowData(filtered.data), tol=100,
    merge.args=list(max.width=5000))
names(merge.res)
## [1] "regions"  "combined" "best"

An equivalent wrapper function is also available for handling overlaps to pre-specified regions. This simply calls findOverlaps() followed by combineOverlaps() and getBestOverlaps().

broad.res <- overlapResults(filtered.data, regions=broads,
    tab=rowData(filtered.data))
names(broad.res)
## [1] "regions"  "combined" "best"

7.4 Squeezing out more detection power

7.4.1 Integrating across multiple window sizes

Repeating the analysis with different window sizes may uncover new DB events at different resolutions. Multiple sets of DB results are integrated by clustering adjacent windows together (even if they differ in size) and combining \(p\)-values within each of the resulting clusters. The example below uses the H3 acetylation data from Chapter 5. Some filtering is performed to avoid excessive chaining in this demonstration. Corresponding tables of DB results should also be obtained – for brevity, mock results are used here.

library(chipseqDBData)
ac.files <- H3K9acData()$Path
ac.small <- windowCounts(ac.files, width=150L, spacing=100L, 
    filter=25, param=param)
ac.large <- windowCounts(ac.files, width=1000L, spacing=500L, 
    filter=35, param=param)

# TODO: actually do the analysis here.
# In the meantime, mocking up results for demonstration purposes.
ns <- nrow(ac.small)
mock.small <- data.frame(logFC=rnorm(ns), logCPM=0, PValue=runif(ns)) 
nl <- nrow(ac.large)
mock.large <- data.frame(logFC=rnorm(nl), logCPM=0, PValue=runif(nl)) 

The mergeResultsList() function merges windows of all sizes into a single set of regions, and computes a combined \(p\)-value from the associated \(p\)-values for each region. Equal contributions from each window size are enforced by setting equiweight=TRUE, which uses a weighted version of Simes’ method (Benjamini and Hochberg 1997). The weight assigned to each window is inversely proportional to the number of windows of that size in the same cluster. This avoids the situation where, if a cluster contains many small windows, the DB results for the analysis with the small window size contribute most to the combined \(p\)-value. This is not ideal when results from all window sizes are of equal interest.

cons.res <- mergeResultsList(list(ac.small, ac.large), 
    tab.list=list(mock.small, mock.large), 
    equiweight=TRUE, tol=1000)
cons.res$regions
## GRanges object with 30486 ranges and 0 metadata columns:
##           seqnames            ranges strand
##              <Rle>         <IRanges>  <Rle>
##       [1]     chr1   4774001-4776500      *
##       [2]     chr1   4784501-4787000      *
##       [3]     chr1   4806501-4809000      *
##       [4]     chr1   4856501-4860000      *
##       [5]     chr1   5082501-5084500      *
##       ...      ...               ...    ...
##   [30482]     chrY 38230601-38230750      *
##   [30483]     chrY 73037501-73039000      *
##   [30484]     chrY 75445901-75446150      *
##   [30485]     chrY 88935501-88937000      *
##   [30486]     chrY 90812501-90814000      *
##   -------
##   seqinfo: 66 sequences from an unspecified genome
cons.res$combined
## DataFrame with 30486 rows and 8 columns
##       num.tests num.up.logFC num.down.logFC    PValue       FDR   direction
##       <integer>    <integer>      <integer> <numeric> <numeric> <character>
## 1             4            0              0  0.849144  0.996691        down
## 2            13            0              0  0.685623  0.996410       mixed
## 3            11            0              0  0.123892  0.990116          up
## 4            22            0              0  0.656478  0.996410       mixed
## 5             9            0              0  0.877641  0.997979       mixed
## ...         ...          ...            ...       ...       ...         ...
## 30482         1            0              0  0.861439  0.997282        down
## 30483         5            0              0  0.859316  0.997243       mixed
## 30484         2            0              0  0.542721  0.996347        down
## 30485         5            0              0  0.649004  0.996347          up
## 30486         2            0              0  0.245640  0.990116        down
##        rep.test  rep.logFC
##       <integer>  <numeric>
## 1        238292  -0.702177
## 2             3   0.953827
## 3        238299   0.517360
## 4        238304  -0.892078
## 5        238309   1.841532
## ...         ...        ...
## 30482    238280 -0.9734619
## 30483    391248 -0.2581099
## 30484    238284 -0.9382387
## 30485    391250  0.0996835
## 30486    391253 -1.7236858

Similarly, the overlapResultsList() function is used to merge windows of varying size that overlap pre-specified regions.

cons.broad <- overlapResultsList(list(ac.small, ac.large),
    tab.list=list(mock.small, mock.large), 
    equiweight=TRUE, region=broads)
cons.broad$regions
## GRanges object with 24528 ranges and 1 metadata column:
##             seqnames              ranges strand |     gene_id
##                <Rle>           <IRanges>  <Rle> | <character>
##   100009600     chr9   21062393-21076096      - |   100009600
##   100009609     chr7   84935565-84967115      - |   100009609
##   100009614    chr10   77708457-77712009      + |   100009614
##   100009664    chr11   45805087-45841171      + |   100009664
##      100012     chr4 144157557-144165663      - |      100012
##         ...      ...                 ...    ... .         ...
##       99889     chr3   84496093-85890516      - |       99889
##       99890     chr3 110246109-110253998      - |       99890
##       99899     chr3 151730922-151752960      - |       99899
##       99929     chr3   65525410-65555518      + |       99929
##       99982     chr4 136550540-136605723      - |       99982
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome
cons.res$combined
## DataFrame with 30486 rows and 8 columns
##       num.tests num.up.logFC num.down.logFC    PValue       FDR   direction
##       <integer>    <integer>      <integer> <numeric> <numeric> <character>
## 1             4            0              0  0.849144  0.996691        down
## 2            13            0              0  0.685623  0.996410       mixed
## 3            11            0              0  0.123892  0.990116          up
## 4            22            0              0  0.656478  0.996410       mixed
## 5             9            0              0  0.877641  0.997979       mixed
## ...         ...          ...            ...       ...       ...         ...
## 30482         1            0              0  0.861439  0.997282        down
## 30483         5            0              0  0.859316  0.997243       mixed
## 30484         2            0              0  0.542721  0.996347        down
## 30485         5            0              0  0.649004  0.996347          up
## 30486         2            0              0  0.245640  0.990116        down
##        rep.test  rep.logFC
##       <integer>  <numeric>
## 1        238292  -0.702177
## 2             3   0.953827
## 3        238299   0.517360
## 4        238304  -0.892078
## 5        238309   1.841532
## ...         ...        ...
## 30482    238280 -0.9734619
## 30483    391248 -0.2581099
## 30484    238284 -0.9382387
## 30485    391250  0.0996835
## 30486    391253 -1.7236858

In this manner, DB results from multiple window widths can be gathered together and reported as a single set of regions. Consolidation is most useful for histone marks and other analyses involving diffuse regions of enrichment. For such studies, the ideal window size is not known or may not even exist, e.g., if the widths of the enriched regions or DB subintervals are variable.

7.4.2 Weighting windows on abundance

Windows that are more likely to be DB can be upweighted to improve detection power. For example, in TF ChIP-seq data, the window of highest abundance within each enriched region probably contains the binding site. It is reasonable to assume that this window will also have the strongest DB. To improve power, the weight assigned to the most abundant window is increased relative to that of other windows in the same cluster. This means that the \(p\)-value of this window will have a greater influence on the final combined \(p\)-value.

Weights are computed in a manner to minimize conservativeness relative to the optimal unweighted approaches in each possible scenario. If the strongest DB event is at the most abundant window, the weighted approach will yield a combined \(p\)-value that is no larger than twice the \(p\)-value of the most abundant window. (Here, the optimal approach would be to use the \(p\)-value of the most abundance window directly as a proxy for the \(p\)-value of the cluster.) If the strongest DB event is not at the most abundant window, the weighted approach will yield a combined \(p\)-value that is no larger than twice the combined \(p\)-value without wweighting (which is optimal as all windows have equal probabilities of containing the strongest DB). All windows have non-zero weights, which ensures that any DB events in the other windows will still be considered when the \(p\)-values are combined.

The application of this weighting scheme is demonstrated in the example below. First, the getBestTest} function with \Rcode{by.pval=FALSE() is used to identify the most abundant window in each cluster. Window-specific weights are then computed using the upweightSummits} function, and supplied to \Rcode{combineTests() to use in computing combined \(p\)-values.

tab.ave <- getBestTest(merged$id, rowData(filtered.data), by.pval=FALSE)
weights <- upweightSummit(merged$id, tab.ave$rep.test)
head(weights)
## [1] 1 5 1 1 1 1
tabcom.w <- combineTests(merged$id, rowData(filtered.data), weight=weights)
head(tabcom.w)
## DataFrame with 6 rows and 8 columns
##   num.tests num.up.logFC num.down.logFC     PValue       FDR   direction
##   <integer>    <integer>      <integer>  <numeric> <numeric> <character>
## 1         5            3              0 0.01034505 0.0241858          up
## 2         3            2              0 0.00752378 0.0195728          up
## 3         2            2              0 0.02507497 0.0460910          up
## 4         2            0              0 0.11661410 0.1541495          up
## 5         2            0              0 0.12205930 0.1602224          up
## 6         5            2              0 0.01302987 0.0282643          up
##    rep.test rep.logFC
##   <integer> <numeric>
## 1         2   1.40804
## 2         7   1.81156
## 3         9   1.57462
## 4        12   1.03337
## 5        14   1.08339
## 6        17   1.32711

The weighting approach can also be applied to the clusters from the broad gene body example. This is done by replacing the call to getBestTest} with one to \Rfunction{getBestOverlaps(), as before. Similarly, upweightSummit} can be replaced with \Rfunction{summitOverlaps(). These wrappers are designed to minimize book-keeping problems when one window overlaps multiple regions.

broad.best <- getBestOverlaps(olap, rowData(filtered.data), by.pval=FALSE)
head(broad.best[!is.na(broad.best$PValue),])
## DataFrame with 6 rows and 8 columns
##    num.tests num.up.logFC num.down.logFC      PValue         FDR   direction
##    <integer>    <integer>      <integer>   <numeric>   <numeric> <character>
## 7          1            1              0 2.55121e-05 0.000483176          up
## 18         4            1              0 5.99315e-03 0.016489994          up
## 23         3            1              0 2.56708e-02 0.046995910          up
## 25         2            0              0 7.33955e-01 0.754168860          up
## 28         3            1              0 3.53323e-05 0.000580726          up
## 36         4            1              0 2.39975e-03 0.008891211          up
##     rep.test rep.logFC
##    <integer> <numeric>
## 7       6995  3.396599
## 18      8324  1.690217
## 23       315  1.331105
## 25      9977  0.201873
## 28      8774  3.503338
## 36      2717  1.957662
broad.weights <- summitOverlaps(olap, region.best=broad.best$rep.test)
tabbroad.w <- combineOverlaps(olap, rowData(filtered.data), o.weight=broad.weights) 

7.4.3 Filtering after testing but before correction

Most of the filters in Chapter~4 are applied before the statistical analysis. However, some of the approaches may be too aggressive, e.g., filtering to retain only local maxima or based on pre-defined regions. In such cases, it may be preferable to initially apply one of the other, milder filters. This ensures that sufficient windows are retained for stable normalization and/or EB shrinkage. The aggressive filters can then be applied after the window-level statistics have been calculated, but before clustering into regions and calculation of cluster-level statistics. This is still beneficial as it removes irrelevant windows that would increase the severity of the BH correction. It may also reduce chaining effects during clustering.

7.5 FDR control in difficult situations

7.5.1 Clustering only on DB windows for diffuse marks

The clustering procedures described above rely on independent filtering to remove irrelevant windows. This ensures that the regions of interest are reasonably narrow and can be easily interpreted, which is typically the case for most protein targets, e.g., TFs, narrow histone marks. However, enriched regions may be very large for more diffuse marks. Such regions may be difficult to interpret when only the DB subinterval is of interest. To overcome this, a post-hoc analysis can be performed whereby only significant windows are used for clustering.

postclust <- clusterWindows(rowRanges(filtered.data), rowData(filtered.data),
                            target=0.05, tol=100, max.width=1000)
postclust$FDR
## [1] 0.04981
postclust$region
## GRanges object with 2068 ranges and 0 metadata columns:
##                      seqnames              ranges strand
##                         <Rle>           <IRanges>  <Rle>
##      [1]                 chr1     7397951-7398010      *
##      [2]                 chr1     9541451-9541510      *
##      [3]                 chr1   13372551-13372560      *
##      [4]                 chr1   13590001-13590010      *
##      [5]                 chr1   15805551-15805660      *
##      ...                  ...                 ...    ...
##   [2064]                 chrX 104482701-104482760      *
##   [2065]                 chrX 106187051-106187060      *
##   [2066]                 chrX 136741351-136741360      *
##   [2067]                 chrX 140456551-140456560      *
##   [2068] chrX_GL456233_random       336801-336910      *
##   -------
##   seqinfo: 66 sequences from an unspecified genome

This will define and cluster significant windows in a manner that controls the cluster-level FDR at 5%. The clustering step itself is performed using mergeWindows() with the specified parameters. Each cluster consists entirely of DB windows and can be directly interpreted as a DB region or a DB subinterval of a larger enriched region. This reduces the pressure on abundance filtering to obtain well-separated regions prior to clustering, e.g., for diffuse marks or in data sets with weak IP signal. That said, users should be aware that calculation of the cluster-level FDR is not entirely rigorous. As such, independent clustering and FDR control via Simes’ method should be considered as the default for routine analyses.

7.5.2 Using the empirical FDR for noisy data

Some analyses involve comparisons of ChIP samples to negative controls. In such cases, any region exhibiting enrichment in the negative control over the ChIP samples must be a false positive. The number of significant regions that change in the “wrong” direction can be used as an estimate of the number of false positives at any given \(p\)-value threshold. Division by the number of discoveries changing in the “right” direction yields an estimate of the FDR, i.e., the empirical FDR (Zhang et al. 2008). This strategy is implemented in the empiricalFDR() function, which controls the empirical FDR across clusters based on their combined \(p\)-values. Its use is demonstrated below, though the output is not meaningful in this situation as genuine changes in binding can be present in both directions.

empres <- empiricalFDR(merged$id, rowData(filtered.data))

The empirical FDR is useful for analyses of noisy data with high levels of non-specific binding. This is because the estimate of the number of false positives adapts to the observed number of regions exhibiting enrichment in the negative controls. In contrast, the standard BH method in combineTests() relies on proper type I error control during hypothesis testing. As non-specific binding events tend to be condition-specific, they are indistinguishable from DB events and assigned low \(p\)-values, resulting in loss of FDR control. Thus, for noisy data, use of the empirical FDR may be more appropriate to control the proportion of “experimental” false positives. However, calculation of the empirical FDR is not as statistically rigorous as that of the BH method, so users are advised to only apply it when necessary.

7.5.3 Detecting complex DB

Complex DB events involve changes to the shape of the binding profile, not just a scaling increase/decrease to binding intensity. Such regions may contain multiple sites that change in binding strength in opposite directions, or peaks that change in width or position between conditions. This often manifests as DB in opposite directions in different subintervals of a region. Some of these events can be identified using the mixedTests() function.

tab.mixed <- mixedTests(merged$ids, rowData(filtered.data))
tab.mixed
## DataFrame with 3577 rows and 10 columns
##      num.tests num.up.logFC num.down.logFC    PValue       FDR   direction
##      <integer>    <integer>      <integer> <numeric> <numeric> <character>
## 1            5            5              0  0.998279         1       mixed
## 2            3            3              0  0.997980         1       mixed
## 3            2            2              0  0.993109         1       mixed
## 4            2            0              0  0.952956         1       mixed
## 5            2            0              0  0.959314         1       mixed
## ...        ...          ...            ...       ...       ...         ...
## 3573         3            0              3  1.000000         1       mixed
## 3574         1            0              0  0.936589         1       mixed
## 3575         2            0              0  0.857010         1       mixed
## 3576         1            1              0  0.971229         1       mixed
## 3577         2            0              0  0.754204         1       mixed
##      rep.up.test rep.up.logFC rep.down.test rep.down.logFC
##        <integer>    <numeric>     <integer>      <numeric>
## 1              3      1.57358             3        1.57358
## 2              7      1.81156             8        1.98350
## 3              9      1.57462            10        1.69016
## 4             12      1.03337            11        1.12021
## 5             14      1.08339            14        1.08339
## ...          ...          ...           ...            ...
## 3573       12345    -2.607148         12345      -2.607148
## 3574       12347    -0.987683         12347      -0.987683
## 3575       12349    -0.636868         12348      -0.440184
## 3576       12350     1.246127         12350       1.246127
## 3577       12351    -0.430187         12352      -0.163263

mixedTests() converts the \(p\)-value for each window into two one-sided \(p\)-values. The one-sided \(p\)-values in each direction are combined using Simes’ method, and the two one-sided combined \(p\)-values are themselves combined using an intersection-union test (Berger and Hsu 1996). The resulting \(p\)-value is only low if a region contains strong DB in both directions.

combineTests() also computes some statistics for informal detection of complex DB. For example, the num.up.tests and num.down.tests fields can be used to identify regions with changes in both directions. The direction field will also label some regions as "mixed", though this is not comprehensive. Indeed, regions labelled as "up" or "down" in the direction field may also correspond to complex DB events, but will not be labelled as "mixed" if the significance calculations are dominated by windows changing in only one direction.

7.5.4 Enforcing a minimal number of DB windows

On occasion, we may be interested in genomic regions that contain at least a minimal number or proportion of DB windows. This is motivated by the desire to avoid detecting DB regions where only a small subinterval exhibits a change, instead favoring more systematic changes throughout the region that are easier to interpret. We can identify these regions using the minimalTests() function.

tab.min <- minimalTests(merged$ids, rowData(filtered.data),
    min.sig.n=3, min.sig.prop=0.5)
tab.min
## DataFrame with 3577 rows and 8 columns
##      num.tests num.up.logFC num.down.logFC      PValue         FDR   direction
##      <integer>    <integer>      <integer>   <numeric>   <numeric> <character>
## 1            5            2              0   0.0834885   0.1512860          up
## 2            3            2              0   0.0987863   0.1706221          up
## 3            2            2              0   0.0275654   0.0719194          up
## 4            2            0              0   0.1881740   0.2739193          up
## 5            2            0              0   0.2330409   0.3208573          up
## ...        ...          ...            ...         ...         ...         ...
## 3573         3            0              3 7.95107e-06 0.000634123        down
## 3574         1            0              0 1.26823e-01 0.204805974        down
## 3575         2            0              0 5.71959e-01 0.669776383        down
## 3576         1            0              0 5.75426e-02 0.116844527          up
## 3577         2            0              0 9.83184e-01 1.000000000        down
##       rep.test rep.logFC
##      <integer> <numeric>
## 1            1  1.238147
## 2            6  1.094822
## 3            9  1.574622
## 4           12  1.033365
## 5           13  0.739039
## ...        ...       ...
## 3573     12346 -2.534971
## 3574     12347 -0.987683
## 3575     12348 -0.440184
## 3576     12350  1.246127
## 3577     12352 -0.163263

minimalTests() applies a Holm-Bonferroni correction to all windows in the same cluster and picks the \(x\)th-smallest adjusted \(p\)-value (where \(x\) is defined from min.sig.n and min.sig.prop). This tests the joint null hypothesis that the per-window null hypothesis is false for fewer than \(x\) windows in the cluster. If the \(x\)th-smallest \(p\)-value is low, this provides strong evidence against the joint null for that cluster.

As an aside, this function also has some utility outside of ChIP-seq contexts. For example, we might want to obtain a single \(p\)-value for a gene set based on the presence of a minimal percentage of differentially expressed genes. Alternatively, we may be interested in ranking genes in loss-of-function screens based on a minimal number of shRNA/CRISPR guides that exhibit a significant effect. These problems are equivalent to that of identifying a genomic region with a minimal number of DB windows.

Session information

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] chipseqDBData_1.21.0                     
 [2] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
 [3] GenomicFeatures_1.58.0                   
 [4] AnnotationDbi_1.68.0                     
 [5] csaw_1.40.0                              
 [6] SummarizedExperiment_1.36.0              
 [7] Biobase_2.66.0                           
 [8] MatrixGenerics_1.18.0                    
 [9] matrixStats_1.4.1                        
[10] GenomicRanges_1.58.0                     
[11] GenomeInfoDb_1.42.0                      
[12] IRanges_2.40.0                           
[13] S4Vectors_0.44.0                         
[14] BiocGenerics_0.52.0                      
[15] BiocStyle_2.34.0                         
[16] rebook_1.16.0                            

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1         dplyr_1.1.4              blob_1.2.4              
 [4] filelock_1.0.3           Biostrings_2.74.0        bitops_1.0-9            
 [7] fastmap_1.2.0            RCurl_1.98-1.16          BiocFileCache_2.14.0    
[10] GenomicAlignments_1.42.0 XML_3.99-0.17            digest_0.6.37           
[13] mime_0.12                lifecycle_1.0.4          statmod_1.5.0           
[16] KEGGREST_1.46.0          RSQLite_2.3.7            magrittr_2.0.3          
[19] compiler_4.4.1           rlang_1.1.4              sass_0.4.9              
[22] tools_4.4.1              utf8_1.2.4               yaml_2.3.10             
[25] rtracklayer_1.66.0       knitr_1.48               S4Arrays_1.6.0          
[28] bit_4.5.0                curl_5.2.3               DelayedArray_0.32.0     
[31] abind_1.4-8              BiocParallel_1.40.0      withr_3.0.2             
[34] purrr_1.0.2              CodeDepends_0.6.6        grid_4.4.1              
[37] fansi_1.0.6              ExperimentHub_2.14.0     edgeR_4.4.0             
[40] cli_3.6.3                rmarkdown_2.28           crayon_1.5.3            
[43] generics_0.1.3           metapod_1.14.0           httr_1.4.7              
[46] rjson_0.2.23             DBI_1.2.3                cachem_1.1.0            
[49] zlibbioc_1.52.0          parallel_4.4.1           BiocManager_1.30.25     
[52] XVector_0.46.0           restfulr_0.0.15          vctrs_0.6.5             
[55] Matrix_1.7-1             jsonlite_1.8.9           dir.expiry_1.14.0       
[58] bookdown_0.41            bit64_4.5.2              locfit_1.5-9.10         
[61] limma_3.62.0             jquerylib_0.1.4          glue_1.8.0              
[64] codetools_0.2-20         BiocVersion_3.20.0       BiocIO_1.16.0           
[67] UCSC.utils_1.2.0         tibble_3.2.1             pillar_1.9.0            
[70] rappdirs_0.3.3           htmltools_0.5.8.1        graph_1.84.0            
[73] GenomeInfoDbData_1.2.13  dbplyr_2.5.0             R6_2.5.1                
[76] evaluate_1.0.1           lattice_0.22-6           AnnotationHub_3.14.0    
[79] png_0.1-8                Rsamtools_2.22.0         memoise_2.0.1           
[82] bslib_0.8.0              Rcpp_1.0.13              SparseArray_1.6.0       
[85] xfun_0.48                pkgconfig_2.0.3         

Bibliography

Benjamini, Y., and Y. Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” J. R. Stat. Soc. Series B 57: 289–300.

Benjamini, Y., and Y. Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” J. R. Stat. Soc. Series B 57: 289–300.

1997. “Multiple Hypotheses Testing with Weights.” Scand. J. Stat. 24: 407–18.

Berger, R. L., and J. C. Hsu. 1996. “Bioequivalence Trials, Intersection-Union Tests and Equivalence Confidence Sets.” Statist. Sci. 11 (4): 283–319.

Lun, A. T., and G. K. Smyth. 2014. “De novo detection of differentially bound regions for ChIP-seq data using peaks and windows: controlling error rates correctly.” Nucleic Acids Res. 42 (11): e95.

Simes, R. J. 1986. “An Improved Bonferroni Procedure for Multiple Tests of Significance.” Biometrika 73 (3): 751–54.

Zhang, Y., T. Liu, C. A. Meyer, J. Eeckhoute, D. S. Johnson, B. E. Bernstein, C. Nusbaum, et al. 2008. “Model-based analysis of ChIP-Seq (MACS).” Genome Biol. 9 (9): R137.