--- title: "Introduction to betterChromVAR" author: - name: Pierre-Luc Germain affiliation: - ETH Zurich, Switzerland email: pierre-luc.germain@hest.ethz.ch date: "`r doc_date()`" output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show bibliography: ../inst/REFERENCES.bib package: "`r pkg_ver('betterChromVAR')`" vignette: > %\VignetteIndexEntry{Introduction to betterChromVAR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction The `r Biocpkg("chromVAR")` R package was originally developed by Alicia Schep and colleagues from the Greenleaf lab [@schep2017chromvar]. Although originally designed for single-cell ATAC-seq data, it has been shown to be highly sensitive for bulk data as well [@gerbaldo2024identification]. The aim of the method is to infer, based on the accessibility of motif matches, the relative activity of transcription factors (TFs) in each sample or cell, adjusting for technical biases (GC content and enrichment bias). It is recommended that you read the `r Biocpkg("chromVAR")` documentation before using this package. `r Biocpkg("betterChromVAR")` is first and foremost a considerably faster, analytical re-implementation of the original method (it is also considerably faster than the C++ reimplementation in ArchR [@granja2021archr]. Contrarily to the original chromVAR, it is entirely deterministic, and achieves much higher efficiency by computing expectations and variance at the level of bias bins, instead of in the peak-space. In addition, `r Biocpkg("betterChromVAR")` includes a few additions, such as simpler weighted expectations, bias shrinkage, and an ATAC-seq normalization method based on the chromVAR logic. Importantly, however, not all functionalities of chromVAR have been reimplemented here: `r Biocpkg("betterChromVAR")` chiefly focuses on the key task of efficiently computing motif deviations. # Installing `betterChromVAR` `r Biocpkg("betterChromVAR")` is a `R` package available via the [Bioconductor](http://bioconductor.org) repository. It can be installed using the following commands in your `R` session: ```{r "install", eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("betterChromVAR") ## Check that you have a valid Bioconductor installation BiocManager::valid() ``` # Computing motif deviations with betterChromVAR `r Biocpkg("betterChromVAR")` takes two primary inputs: 1) the counts in peaks across cells or samples, and 2) an annotation of which TFs/motifs match which peak^[The annotation can be binary or probabilistic (i.e. from 0 to 1). If using motif matches, however, it is recommended to input binary matches, as the magnitude of motif scores is generally poorly correlated to actual binding.]. The peak counts should be provided as a `RangedSummarizedExperiment` object^[See `r Biocpkg("SummarizedExperiment")` if you're not familiar with those], while the annotation can either be in that format too or provided as a (sparse) matrix. There are multiple ways of generating your peak counts; the original `r Biocpkg("chromVAR")` package includes such a function (`getCounts()`), and the `r Githubpkg("ETHZ-INS/epiwraps", "epiwraps")` package has some with more options (functions `peakCountsFromBAM()` and `peakCountsFromFrags()`). Similarly, the `r Biocpkg("motifmatchr")` package can be used to generate the motif matching annotation. An important consideration is that the peaks or regions used for the purpose of this analysis should have similar widths. It is thus highly recommended, before generating the count and annotation matrices, to resize your regions (e.g. using `peaks <- resize(peaks, width=300, fix="center")`). The exact size can be something not too large (otherwise the presence/absence of a motif becomes meaningless) and ideally close to the median size of your original peaks. In addition, is it advisable to restrict your peaks to those that are on standard chromosomes^[See `keepStandardChromosomes()` from the `r Biocpkg("GenomeInfoDb")` package]. Here we'll use dummy data as example: ```{r loadData} suppressPackageStartupMessages({ library(SummarizedExperiment) library(betterChromVAR) }) attach(getDummyData()) counts head(motifMatches) ``` In this case, we can see that `rowData(counts)` already has a `bias` column indicating the GC content of the regions: ```{r inspectRowData} rowData(counts) ``` Had this not been the case, we would first need to add this using: ```{r addGCBias, eval=FALSE} # not run counts <- addGCBias(counts, genome=my_genome) ``` where `my_genome` is a `r Biocpkg("BSgenome")` object or similar (e.g. an `FaFile`^[See the `r Biocpkg("Rsamtools")` package.]). Once we have this, we can launch the computation of the deviations : ```{r betterChromVAR} dev <- betterChromVAR(counts, motifMatches) dev ``` The resulting `dev` object is a SummarizedExperiment with the same columns (and colData) as the original `counts` object, but with the motifs as rows (instead of the original peaks). The values can be interpreted as the relative activity, across samples or cells, of the corresponding TFs^[Note, however, that similar motifs will be given similar activity estimates, so that it is often hard to know which of a set of highly-similar motifs is in fact responsible for the observed signal.]. When the function is used with default arguments, the results are virtually the same as the original chromVAR (averaging over the noise coming from the random background selection of the original method). The object contains two assays: the `deviations` assay contains the bias-adjusted deviations from the expectation (by default, the average), i.e. the difference to the expectation divided by the expectation, and the `z` assay contains z-scores, i.e. the difference to the expectation divided by the variance of the expectation. If your samples/cells have similar library sizes, it is recommended that you use the `z` assay for downstream analysis (such as differential TF activity using `limma`). If the samples have very different library sizes, the `z` scores will be influenced by that, and it might be preferable to use the `deviations` assay. The variability of each motif across the dataset is stored in the rowData of the object : ```{r variability} rowData(dev) ``` If bootstrap confidence intervals of the variability are additionally desired, one can simply use the `computeVariability` function of `r Biocpkg("chromVAR")` on the present output (i.e. `dev`). ## Individual steps of betterChromVAR The `betterChromVAR()` function is actually a wrapper around three steps, which can also be executed individually for more customization, or to avoid repeating some computation multiple times. The first step is creating the bias bins and their pairwise sampling probabilities. This is achieved with: ```{r getBackgroundBins} bg <- getBackgroundBins(counts) bg ``` The second step is computing the per-bin background expectations and variances for each sample or cell: ```{r computeBackgrounds} bg <- computeBackgrounds(counts, bg) bg ``` The `bg` object has now been filled with the additional information. It can now be used for: ```{r computeDeviations} dev <- computeDeviationsAnalytic(counts, background=bg, annotations=motifMatches) dev ``` This last call could be made with different annotations, without needing to recompute the previous steps. ## Including fragment length bias In addition to the enrichment and GC bias taken into account by the original `r Biocpkg("chromVAR")`, `r Biocpkg("betterChromVAR")` supports an optional third dimension: fragment length bias (see `?getBackgroundBins()` for more information). This requires the compilation of an additional bias component (the log10-transformed mean or median fragment length per region). If desired, desired, we recommended using the counting functions from the `r Githubpkg("ETHZ-INS/epiwraps", "epiwraps")` package, which can provide this information. ## Note on single-cell data If you are using betterChromVAR (or ChromVAR, for that matter) on single-cell data with multiple cell types, the way to best run it depends on whether your are interested in differences within cell types, or between cell types. If interested in differences between cell types, run it on the entire dataset, specifying the cell types as `grouping` argument to `betterChromVAR()`. In this way, rare and abundant cell types will be given the same weight in computing the expectation. If you are interested in differences within cell types (e.g. between samples/conditions), you should instead run the method separately for each cell type. In this way, the background bins will be defined based on the enrichment in the given cell type, leading to better capture of the bias. The drawback, however, is that the values won't be comparable across cell types. To test across conditions, we also recommend using it on pseudobulk data. ### Execution on Atlas-scale datasets The only step of the process that needs to be done across the entire dataset is the computing of the expectation, i.e. the mean (or weighted mean) for each region across all cells, and the creation of the background bins, which depends on it. Everything else can easily be executed in chunks (of cells), as is done for multi-threading in the `betterChromVAR()` function. This can be done manually on the full dataset, and then passed to downstream functions for computations on subsets of the data: ```{r getExpectations} ex <- getExpectation(counts) bg <- getBackgroundBins(ex, bias=rowData(counts)$bias) # this is equivalent to bg <- getBackgroundBins(counts) ``` Then we can apply the next steps only on subset of the data: ```{r chunkCompute} bg2 <- computeBackgrounds(counts[,1:3], bg, expectation = ex) dev2 <- computeDeviationsAnalytic(counts[,1:3], bg2, motifMatches) # this should be identical to what we had run on the whole object: identical(assay(dev)[,1:3], assay(dev2)) ``` If data is on disk, rather than in memory, and you are adjusting the expectation based on cell-type abundance using the `grouping` argument, `getBackgroundBins` will load the entire count matrix into memory. To avoid this, compute the mean per group using block-wise operations, for instance using the `aggregateAcrossCells()` function of the `r Biocpkg("scrapper")` package, and then use the `rowMeans()` of that as expectation.

# Bias-normalization of bulk ATAC-seq data The general chromVAR approach, and in particular the deterministic version implemented here, is also amenable to be used to normalize GC- and enrichment bias out of bulk ATAC-seq data. This is implemented in the `CVnorm()` function. In addition, if given a grouping of the samples (e.g. experimental conditions), the function applies a variance-based smoothing of the bias correction, inspired by smoothed quantile normalization (see the `r Biocpkg("qsmooth")` package or @hicks2018smooth ). In a nutshell, if the bias in a certain background bin is explained by experimental groups, it will be less corrected than if it varies across samples of the groups^[Note that this is different from the original `r Biocpkg("qsmooth")` approach -- see `?CVnorm` for more detail.]. Example usage: ```{r CVnorm} # we assign arbitrary groups to the samples: counts$group <- rep(LETTERS[1:2], each=5) # we run the smoothed CVnorm: counts <- CVnorm(counts, grouping=counts$group) # (the normal CVnorm could be run by omitting the grouping) counts ``` This adds a `corrected` assay to the object. Note that although the assay is corrected for enrichment- and GC-bias, it is not corrected for library size differences. Rather, it is on the original count scale (although not integer anymore), so that it is amenable to use in downstream count-based analysis methods such as `r Biocpkg("edgeR")`. # Appendix ## References
## Session info ```{r sessionInfo, echo=FALSE} sessionInfo() ```