---
title: "Introduction to betterChromVAR"
author: 
  - name: Pierre-Luc Germain
    affiliation:
    - ETH Zurich, Switzerland
    email: pierre-luc.germain@hest.ethz.ch
date: "`r doc_date()`"
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
bibliography: ../inst/REFERENCES.bib
package: "`r pkg_ver('betterChromVAR')`"
vignette: >
  %\VignetteIndexEntry{Introduction to betterChromVAR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

# Introduction

The `r Biocpkg("chromVAR")` R package was originally developed by Alicia Schep 
and colleagues from the Greenleaf lab [@schep2017chromvar].
Although originally designed for single-cell ATAC-seq data, it has been shown 
to be highly sensitive for bulk data as well [@gerbaldo2024identification]. The 
aim of the method is to infer, based on the accessibility of motif matches, the 
relative activity of transcription factors (TFs) in each sample or cell,
adjusting for technical biases (GC content and enrichment bias). It is 
recommended that you read the `r Biocpkg("chromVAR")` documentation before 
using this package.

`r Biocpkg("betterChromVAR")` is first and foremost a considerably faster, 
analytical re-implementation of the original method (it is also considerably 
faster than the C++ reimplementation in ArchR [@granja2021archr].
Contrarily to the original chromVAR, it is entirely deterministic, and achieves 
much higher efficiency by computing expectations and variance at the level of 
bias bins, instead of in the peak-space.

In addition, `r Biocpkg("betterChromVAR")` includes a few additions, such as 
simpler weighted expectations, bias shrinkage, and an ATAC-seq normalization 
method based on the chromVAR logic. Importantly, however, not all 
functionalities of chromVAR have been reimplemented here: 
`r Biocpkg("betterChromVAR")` chiefly focuses on the key task of efficiently 
computing motif deviations.


# Installing `betterChromVAR`

`r Biocpkg("betterChromVAR")` is a `R` package available via the 
[Bioconductor](http://bioconductor.org) repository. It can be installed using
the following commands in your `R` session:

```{r "install", eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
      install.packages("BiocManager")
  }

BiocManager::install("betterChromVAR")

## Check that you have a valid Bioconductor installation
BiocManager::valid()
```


# Computing motif deviations with betterChromVAR

`r Biocpkg("betterChromVAR")` takes two primary inputs: 1) the counts in peaks 
across cells or samples, and 2) an annotation of which TFs/motifs match which 
peak^[The annotation can be binary or probabilistic (i.e. from 0 to 1). If using 
motif matches, however, it is recommended to input binary matches, as the 
magnitude of motif scores is generally poorly correlated to actual binding.]. 
The peak counts should be provided as a `RangedSummarizedExperiment` 
object^[See `r Biocpkg("SummarizedExperiment")` if you're not familiar with 
those], while the annotation can either be in that format too or provided as a 
(sparse) matrix.

There are multiple ways of generating your peak counts; the original 
`r Biocpkg("chromVAR")` package includes such a function (`getCounts()`), and 
the `r Githubpkg("ETHZ-INS/epiwraps", "epiwraps")` package has some with more 
options (functions `peakCountsFromBAM()` and `peakCountsFromFrags()`). Similarly, 
the `r Biocpkg("motifmatchr")` package can be used to generate the motif 
matching annotation. An important consideration is that the peaks or regions 
used for the purpose of this analysis should have similar widths. It is thus 
highly recommended, before generating the count and annotation matrices, to 
resize your regions (e.g. using 
`peaks <- resize(peaks, width=300, fix="center")`). The exact size can be 
something not too large (otherwise the presence/absence of a motif becomes 
meaningless) and ideally close to the median size of your original peaks. In 
addition, is it advisable to restrict your peaks to those that are on standard 
chromosomes^[See `keepStandardChromosomes()` from the 
`r Biocpkg("GenomeInfoDb")` package].

Here we'll use dummy data as example:

```{r loadData}
suppressPackageStartupMessages({
  library(SummarizedExperiment)
  library(betterChromVAR)
})
attach(getDummyData())
counts
head(motifMatches)
```

In this case, we can see that `rowData(counts)` already has a `bias` column 
indicating the GC content of the regions:

```{r inspectRowData}
rowData(counts)
```

Had this not been the case, we would first need to add this using:

```{r addGCBias, eval=FALSE}
# not run
counts <- addGCBias(counts, genome=my_genome)
```

where `my_genome` is a `r Biocpkg("BSgenome")` object or similar (e.g. an 
`FaFile`^[See the `r Biocpkg("Rsamtools")` package.]).

Once we have this, we can launch the computation of the deviations :

```{r betterChromVAR}
dev <- betterChromVAR(counts, motifMatches)
dev
```

The resulting `dev` object is a SummarizedExperiment with the same columns 
(and colData) as the original `counts` object, but with the motifs as rows 
(instead of the original peaks). The values can be interpreted as the relative 
activity, across samples or cells, of the corresponding TFs^[Note, however, 
that similar motifs will be given similar activity estimates, so that it is 
often hard to know which of a set of highly-similar motifs is in fact 
responsible for the observed signal.]. When the function is used with default 
arguments, the results are virtually the same as the original chromVAR 
(averaging over the noise coming from the random background selection of the 
original method).

The object contains two assays: the `deviations` assay contains the 
bias-adjusted deviations from the expectation (by default, the average), i.e. 
the difference to the expectation divided by the expectation, and the `z` assay 
contains z-scores, i.e. the difference to the expectation divided by the 
variance of the expectation.

If your samples/cells have similar library sizes, it is recommended that you 
use the `z` assay for downstream analysis (such as differential TF activity 
using `limma`). If the samples have very different library sizes, the `z` scores 
will be influenced by that, and it might be preferable to use the `deviations` 
assay.

The variability of each motif across the dataset is stored in the rowData of the 
object :

```{r variability}
rowData(dev)
```

If bootstrap confidence intervals of the variability are additionally desired, 
one can simply use the `computeVariability` function of `r Biocpkg("chromVAR")` 
on the present output (i.e. `dev`).


## Individual steps of betterChromVAR

The `betterChromVAR()` function is actually a wrapper around three steps, which 
can also be executed individually for more customization, or to avoid repeating 
some computation multiple times.

The first step is creating the bias bins and their pairwise sampling 
probabilities. This is achieved with:

```{r getBackgroundBins}
bg <- getBackgroundBins(counts)
bg
```

The second step is computing the per-bin background expectations and variances 
for each sample or cell:

```{r computeBackgrounds}
bg <- computeBackgrounds(counts, bg)
bg
```

The `bg` object has now been filled with the additional information. It can now 
be used for:

```{r computeDeviations}
dev <- computeDeviationsAnalytic(counts, background=bg, annotations=motifMatches)
dev
```

This last call could be made with different annotations, without needing to 
recompute the previous steps.


## Including fragment length bias

In addition to the enrichment and GC bias taken into account by the original 
`r Biocpkg("chromVAR")`, `r Biocpkg("betterChromVAR")` supports an optional 
third dimension: fragment length bias (see `?getBackgroundBins()` for more 
information). This requires the compilation of an additional bias component 
(the log10-transformed mean or median fragment length per region). If desired, 
desired, we recommended using the counting functions from the 
`r Githubpkg("ETHZ-INS/epiwraps", "epiwraps")` package, which can provide this
information.

## Note on single-cell data

If you are using betterChromVAR (or ChromVAR, for that matter) on single-cell 
data with multiple cell types, the way to best run it depends on whether your 
are interested in differences within cell types, or between cell types. If 
interested in differences between cell types, run it on the entire dataset, 
specifying the cell types as `grouping` argument to `betterChromVAR()`. In this 
way, rare and abundant cell types will be given the same weight in computing 
the expectation.

If you are interested in differences within cell types (e.g. between 
samples/conditions), you should instead run the method separately for each cell 
type. In this way, the background bins will be defined based on the enrichment 
in the given cell type, leading to better capture of the bias. The drawback, 
however, is that the values won't be comparable across cell types. To test 
across conditions, we also recommend using it on pseudobulk data.

### Execution on Atlas-scale datasets

The only step of the process that needs to be done across the entire dataset is
the computing of the expectation, i.e. the mean (or weighted mean) for each 
region across all cells, and the creation of the background bins, which depends
on it. Everything else can easily be executed in chunks (of cells), as is done 
for multi-threading in the `betterChromVAR()` function. This can be done 
manually on the full dataset, and then passed to downstream functions for 
computations on subsets of the data:

```{r getExpectations}
ex <- getExpectation(counts)
bg <- getBackgroundBins(ex, bias=rowData(counts)$bias)
# this is equivalent to bg <- getBackgroundBins(counts)
```
Then we can apply the next steps only on subset of the data:

```{r chunkCompute}
bg2 <- computeBackgrounds(counts[,1:3], bg, expectation = ex)
dev2 <- computeDeviationsAnalytic(counts[,1:3], bg2, motifMatches)
# this should be identical to what we had run on the whole object:
identical(assay(dev)[,1:3], assay(dev2))
```

If data is on disk, rather than in memory, and you are adjusting the expectation
based on cell-type abundance using the `grouping` argument, `getBackgroundBins` 
will load the entire count matrix into memory. To avoid this, compute the 
mean per group using block-wise operations, for instance using the 
`aggregateAcrossCells()` function of the `r Biocpkg("scrapper")` package, and
then use the `rowMeans()` of that as expectation.

<br/><br/>

# Bias-normalization of bulk ATAC-seq data

The general chromVAR approach, and in particular the deterministic version 
implemented here, is also amenable to be used to normalize GC- and enrichment 
bias out of bulk ATAC-seq data. This is implemented in the `CVnorm()` function. 
In addition, if given a grouping of the samples (e.g. experimental conditions), 
the function applies a variance-based  smoothing of the bias correction, 
inspired by smoothed quantile normalization (see the `r Biocpkg("qsmooth")` 
package or @hicks2018smooth ). 
In a nutshell, if the bias in a certain background bin is explained by 
experimental groups, it will be less corrected than if it varies across samples 
of the groups^[Note that this is different from the original 
`r Biocpkg("qsmooth")` approach -- see `?CVnorm` for more detail.].

Example usage:

```{r CVnorm}
# we assign arbitrary groups to the samples:
counts$group <- rep(LETTERS[1:2], each=5)
# we run the smoothed CVnorm:
counts <- CVnorm(counts, grouping=counts$group)
# (the normal CVnorm could be run by omitting the grouping)
counts
```

This adds a `corrected` assay to the object. Note that although the assay is 
corrected for enrichment- and GC-bias, it is not corrected for library size 
differences. Rather, it is on the original count scale (although not integer 
anymore), so that it is amenable to use in downstream count-based analysis 
methods such as `r Biocpkg("edgeR")`.


# Appendix

## References

<div id="refs"></div>

## Session info

```{r sessionInfo, echo=FALSE}
sessionInfo()
```