---
title: "Introduction to betterChromVAR"
author:
- name: Pierre-Luc Germain
affiliation:
- ETH Zurich, Switzerland
email: pierre-luc.germain@hest.ethz.ch
date: "`r doc_date()`"
output:
BiocStyle::html_document:
self_contained: yes
toc: true
toc_float: true
toc_depth: 2
code_folding: show
bibliography: ../inst/REFERENCES.bib
package: "`r pkg_ver('betterChromVAR')`"
vignette: >
%\VignetteIndexEntry{Introduction to betterChromVAR}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Introduction
The `r Biocpkg("chromVAR")` R package was originally developed by Alicia Schep
and colleagues from the Greenleaf lab [@schep2017chromvar].
Although originally designed for single-cell ATAC-seq data, it has been shown
to be highly sensitive for bulk data as well [@gerbaldo2024identification]. The
aim of the method is to infer, based on the accessibility of motif matches, the
relative activity of transcription factors (TFs) in each sample or cell,
adjusting for technical biases (GC content and enrichment bias). It is
recommended that you read the `r Biocpkg("chromVAR")` documentation before
using this package.
`r Biocpkg("betterChromVAR")` is first and foremost a considerably faster,
analytical re-implementation of the original method (it is also considerably
faster than the C++ reimplementation in ArchR [@granja2021archr].
Contrarily to the original chromVAR, it is entirely deterministic, and achieves
much higher efficiency by computing expectations and variance at the level of
bias bins, instead of in the peak-space.
In addition, `r Biocpkg("betterChromVAR")` includes a few additions, such as
simpler weighted expectations, bias shrinkage, and an ATAC-seq normalization
method based on the chromVAR logic. Importantly, however, not all
functionalities of chromVAR have been reimplemented here:
`r Biocpkg("betterChromVAR")` chiefly focuses on the key task of efficiently
computing motif deviations.
# Installing `betterChromVAR`
`r Biocpkg("betterChromVAR")` is a `R` package available via the
[Bioconductor](http://bioconductor.org) repository. It can be installed using
the following commands in your `R` session:
```{r "install", eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("betterChromVAR")
## Check that you have a valid Bioconductor installation
BiocManager::valid()
```
# Computing motif deviations with betterChromVAR
`r Biocpkg("betterChromVAR")` takes two primary inputs: 1) the counts in peaks
across cells or samples, and 2) an annotation of which TFs/motifs match which
peak^[The annotation can be binary or probabilistic (i.e. from 0 to 1). If using
motif matches, however, it is recommended to input binary matches, as the
magnitude of motif scores is generally poorly correlated to actual binding.].
The peak counts should be provided as a `RangedSummarizedExperiment`
object^[See `r Biocpkg("SummarizedExperiment")` if you're not familiar with
those], while the annotation can either be in that format too or provided as a
(sparse) matrix.
There are multiple ways of generating your peak counts; the original
`r Biocpkg("chromVAR")` package includes such a function (`getCounts()`), and
the `r Githubpkg("ETHZ-INS/epiwraps", "epiwraps")` package has some with more
options (functions `peakCountsFromBAM()` and `peakCountsFromFrags()`). Similarly,
the `r Biocpkg("motifmatchr")` package can be used to generate the motif
matching annotation. An important consideration is that the peaks or regions
used for the purpose of this analysis should have similar widths. It is thus
highly recommended, before generating the count and annotation matrices, to
resize your regions (e.g. using
`peaks <- resize(peaks, width=300, fix="center")`). The exact size can be
something not too large (otherwise the presence/absence of a motif becomes
meaningless) and ideally close to the median size of your original peaks. In
addition, is it advisable to restrict your peaks to those that are on standard
chromosomes^[See `keepStandardChromosomes()` from the
`r Biocpkg("GenomeInfoDb")` package].
Here we'll use dummy data as example:
```{r loadData}
suppressPackageStartupMessages({
library(SummarizedExperiment)
library(betterChromVAR)
})
attach(getDummyData())
counts
head(motifMatches)
```
In this case, we can see that `rowData(counts)` already has a `bias` column
indicating the GC content of the regions:
```{r inspectRowData}
rowData(counts)
```
Had this not been the case, we would first need to add this using:
```{r addGCBias, eval=FALSE}
# not run
counts <- addGCBias(counts, genome=my_genome)
```
where `my_genome` is a `r Biocpkg("BSgenome")` object or similar (e.g. an
`FaFile`^[See the `r Biocpkg("Rsamtools")` package.]).
Once we have this, we can launch the computation of the deviations :
```{r betterChromVAR}
dev <- betterChromVAR(counts, motifMatches)
dev
```
The resulting `dev` object is a SummarizedExperiment with the same columns
(and colData) as the original `counts` object, but with the motifs as rows
(instead of the original peaks). The values can be interpreted as the relative
activity, across samples or cells, of the corresponding TFs^[Note, however,
that similar motifs will be given similar activity estimates, so that it is
often hard to know which of a set of highly-similar motifs is in fact
responsible for the observed signal.]. When the function is used with default
arguments, the results are virtually the same as the original chromVAR
(averaging over the noise coming from the random background selection of the
original method).
The object contains two assays: the `deviations` assay contains the
bias-adjusted deviations from the expectation (by default, the average), i.e.
the difference to the expectation divided by the expectation, and the `z` assay
contains z-scores, i.e. the difference to the expectation divided by the
variance of the expectation.
If your samples/cells have similar library sizes, it is recommended that you
use the `z` assay for downstream analysis (such as differential TF activity
using `limma`). If the samples have very different library sizes, the `z` scores
will be influenced by that, and it might be preferable to use the `deviations`
assay.
The variability of each motif across the dataset is stored in the rowData of the
object :
```{r variability}
rowData(dev)
```
If bootstrap confidence intervals of the variability are additionally desired,
one can simply use the `computeVariability` function of `r Biocpkg("chromVAR")`
on the present output (i.e. `dev`).
## Individual steps of betterChromVAR
The `betterChromVAR()` function is actually a wrapper around three steps, which
can also be executed individually for more customization, or to avoid repeating
some computation multiple times.
The first step is creating the bias bins and their pairwise sampling
probabilities. This is achieved with:
```{r getBackgroundBins}
bg <- getBackgroundBins(counts)
bg
```
The second step is computing the per-bin background expectations and variances
for each sample or cell:
```{r computeBackgrounds}
bg <- computeBackgrounds(counts, bg)
bg
```
The `bg` object has now been filled with the additional information. It can now
be used for:
```{r computeDeviations}
dev <- computeDeviationsAnalytic(counts, background=bg, annotations=motifMatches)
dev
```
This last call could be made with different annotations, without needing to
recompute the previous steps.
## Including fragment length bias
In addition to the enrichment and GC bias taken into account by the original
`r Biocpkg("chromVAR")`, `r Biocpkg("betterChromVAR")` supports an optional
third dimension: fragment length bias (see `?getBackgroundBins()` for more
information). This requires the compilation of an additional bias component
(the log10-transformed mean or median fragment length per region). If desired,
desired, we recommended using the counting functions from the
`r Githubpkg("ETHZ-INS/epiwraps", "epiwraps")` package, which can provide this
information.
## Note on single-cell data
If you are using betterChromVAR (or ChromVAR, for that matter) on single-cell
data with multiple cell types, the way to best run it depends on whether your
are interested in differences within cell types, or between cell types. If
interested in differences between cell types, run it on the entire dataset,
specifying the cell types as `grouping` argument to `betterChromVAR()`. In this
way, rare and abundant cell types will be given the same weight in computing
the expectation.
If you are interested in differences within cell types (e.g. between
samples/conditions), you should instead run the method separately for each cell
type. In this way, the background bins will be defined based on the enrichment
in the given cell type, leading to better capture of the bias. The drawback,
however, is that the values won't be comparable across cell types. To test
across conditions, we also recommend using it on pseudobulk data.
### Execution on Atlas-scale datasets
The only step of the process that needs to be done across the entire dataset is
the computing of the expectation, i.e. the mean (or weighted mean) for each
region across all cells, and the creation of the background bins, which depends
on it. Everything else can easily be executed in chunks (of cells), as is done
for multi-threading in the `betterChromVAR()` function. This can be done
manually on the full dataset, and then passed to downstream functions for
computations on subsets of the data:
```{r getExpectations}
ex <- getExpectation(counts)
bg <- getBackgroundBins(ex, bias=rowData(counts)$bias)
# this is equivalent to bg <- getBackgroundBins(counts)
```
Then we can apply the next steps only on subset of the data:
```{r chunkCompute}
bg2 <- computeBackgrounds(counts[,1:3], bg, expectation = ex)
dev2 <- computeDeviationsAnalytic(counts[,1:3], bg2, motifMatches)
# this should be identical to what we had run on the whole object:
identical(assay(dev)[,1:3], assay(dev2))
```
If data is on disk, rather than in memory, and you are adjusting the expectation
based on cell-type abundance using the `grouping` argument, `getBackgroundBins`
will load the entire count matrix into memory. To avoid this, compute the
mean per group using block-wise operations, for instance using the
`aggregateAcrossCells()` function of the `r Biocpkg("scrapper")` package, and
then use the `rowMeans()` of that as expectation.
# Bias-normalization of bulk ATAC-seq data
The general chromVAR approach, and in particular the deterministic version
implemented here, is also amenable to be used to normalize GC- and enrichment
bias out of bulk ATAC-seq data. This is implemented in the `CVnorm()` function.
In addition, if given a grouping of the samples (e.g. experimental conditions),
the function applies a variance-based smoothing of the bias correction,
inspired by smoothed quantile normalization (see the `r Biocpkg("qsmooth")`
package or @hicks2018smooth ).
In a nutshell, if the bias in a certain background bin is explained by
experimental groups, it will be less corrected than if it varies across samples
of the groups^[Note that this is different from the original
`r Biocpkg("qsmooth")` approach -- see `?CVnorm` for more detail.].
Example usage:
```{r CVnorm}
# we assign arbitrary groups to the samples:
counts$group <- rep(LETTERS[1:2], each=5)
# we run the smoothed CVnorm:
counts <- CVnorm(counts, grouping=counts$group)
# (the normal CVnorm could be run by omitting the grouping)
counts
```
This adds a `corrected` assay to the object. Note that although the assay is
corrected for enrichment- and GC-bias, it is not corrected for library size
differences. Rather, it is on the original count scale (although not integer
anymore), so that it is amenable to use in downstream count-based analysis
methods such as `r Biocpkg("edgeR")`.
# Appendix
## References