---
title: "transformGamPoi Quickstart"
author: Constantin Ahlmann-Eltze
date: "`r Sys.Date()`"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{glmGamPoi Quickstart}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
set.seed(1)
par(cex = 1.5)
```


_transformGamPoi_ provides variance stabilizing transformations to handle the heteroskedasticity of count data. For example single-cell RNA sequencing counts vary more for highly expressed genes than for lowly expressed genes. However, many classical statistical methods perform best on data with uniform variance. This package provides a set of different variance stabilizing transformations to make the subsequent application of generic statistical methods more palatable.


# Installation

You can install _transformGamPoi_ from [Bioconductor](https://bioconductor.org/packages/transformGamPoi/) after it has been accepted using the following command

```r
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("transformGamPoi")
```

In the mean time or to get the latest development version, you can install _transformGamPoi_ directly from [GitHub](https://github.com/const-ae/transformGamPoi) using the [devtools](https://devtools.r-lib.org/) package

```r
# install.packages("devtools")
devtools::install_github("const-ae/transformGamPoi")
```

# Example

The functions in _transformGamPoi_ take any kind of matrix-like object (e.g., `matrix`, `dgCMatrix`, `DelayedArray`, `SummarizedExperiment`, `SingleCellExperiment`) and return the corresponding transformed matrix objects. For sparse input the functions try to preserve sparsity. For container objects like `SummarizedExperiment`, _transformGamPoi_ extracts the `"counts"` assay and returns an object of the same type as that `"counts"` assay.

We start by loading the package to make the transformation functions available in our R session:
```{r setup}
library(transformGamPoi)
```

In the next step, we load some example data. Here, we use a single-cell RNA sequencing experiment of 4096 blood cells. For convenience, we subset the data to 1,000 genes and 5,00 cells
```{r loadData}
# Load the 'TENxPBMCData' as a SingleCellExperiment object
sce <- TENxPBMCData::TENxPBMCData("pbmc4k")
# Subset the data to 1,000 x 500 random genes and cells
sce <- sce[sample(nrow(sce), 1000), sample(ncol(sce), 500)]
```
If we take a look at the stored counts (mainly zeros) stored as a sparse matrix of type `DelayedMatrix`. Fortunately, the precise meaning of that storage type is not important, because _transformGamPoi_ handles this automatically.
```{r}
assay(sce, "counts")[1:10, 1:5]
```

To see what we mean by heteroskedasticity, let us compare the mean and variance for each gene across cells. We will use the [_MatrixGenerics_](https://bioconductor.org/packages/MatrixGenerics/) package to calculate the row means and row variances. You might be familiar with the [_matrixStats_](https://cran.r-project.org/package=matrixStats) package; _MatrixGenerics_ provides the same set of functions but depending on the type of the matrix automatically dispatches the call to either _matrixStats_, [_DelayedMatrixStats_](https://bioconductor.org/packages/DelayedMatrixStats/), or [_sparseMatrixStats_](https://bioconductor.org/packages/sparseMatrixStats/).
```{r}
library(MatrixGenerics)
# Exclude genes where all counts are zero
sce <- sce[rowMeans2(counts(sce)) > 0, ]
gene_means <- rowMeans2(counts(sce))
gene_var <- rowVars(counts(sce))
plot(gene_means, gene_var, log = "xy", main = "Log-log scatter plot of mean vs variance")
abline(a = 0, b = 1)
sorted_means <- sort(gene_means)
lines(sorted_means, sorted_means + 0.2 * sorted_means^2, col = "purple")
```
The purple line shows a quadratic mean-variance relation ($\text{Var} = \mu + 0.2 \mu^2$) typical for data that is Gamma-Poisson distributed. For example a gene with a mean expression of 5 the corresponding variance is 10, whereas for a gene with a mean expression of 500 the variance ~50,000. Here we used an overdispersion of $\alpha = 0.2$, _transformGamPoi_ provides options to either fit $\alpha$ on the data or fix it to a user-defined value.

_transformGamPoi_ implements two approaches for variance stabilization: (1) based on the delta method, (2) based on model residuals.

## Delta method-based variance stabilizing transformations

The delta method relates the standard deviation of a transformed random variable $g(X_i)$ to the standard deviation of the original random variable $X_i$. This can be used to find a function such that $g(X_i) = \text{const}$. For a quadratic mean variance relation this function is
$$
g(x) = \frac{1}{\sqrt{\alpha}} \text{acosh}(2 \alpha x + 1).
$$

We can apply this transformation to the counts:
```{r}
assay(sce, "acosh") <- acosh_transform(assay(sce, "counts"))
# Equivalent to 'assay(sce, "acosh") <- acosh_transform(sce)'
```


We plot the variance of the `acosh` transformed counts and find that for $\mu < 0.5$ the variance still increases for higher average gene expression. However, for larger expression values the variance for a gene is approximately independent of the corresponding average gene expression (note that the y-axis is not log transformed anymore!).
```{r}
acosh_var <- rowVars(assay(sce, "acosh"))
plot(gene_means, acosh_var, log = "x", main = "Log expression vs variance of acosh stabilized values")
abline(h = 1)
```

The most popular transformation for single cell data is $g(x) = \log(x + c)$ with pseudo-count $c=1$. It turns out that this transformation is closely related to the `acosh` transformation. When we choose $c = 1/(4\alpha)$ the two converge rapidly, only for small values the `acosh` is closer to $g(x) = 2\sqrt{x}$:
```{r}
x <- seq(0, 30, length.out = 1000)
y_acosh <- acosh_transform(x, overdispersion = 0.1)
y_shiftLog <- shifted_log_transform(x, pseudo_count = 1/(4 * 0.1))
y_sqrt <- 2 * sqrt(x) # Identical to acosh_transform(x, overdispersion = 0)
```
The plot looks as follows:
```{r}
plot(x, y_acosh, type = "l", col = "black", lwd = 3, ylab = "g(x)", ylim = c(0, 10))
lines(x, y_shiftLog, col = "red", lwd = 3)
lines(x, y_sqrt, col = "blue", lwd = 3)
legend("bottomright", legend = c(expression(2*sqrt(x)), 
                                 expression(1/sqrt(alpha)~acosh(2*alpha*x+1)),
                                 expression(1/sqrt(alpha)~log(x+1/(4*alpha))+b)),
       col = c("blue", "black", "red"), lty = 1, inset = 0.1, lwd = 3)
```
The offset $b$ for the shifted logarithm has no influence on the variance stabilization. We choose $b$ such that sparsity of the input is retained (i.e., $g(0) = 0$).


## Model residuals-based variance stabilizing transformations

An alternative approach for variance stabilization was suggested by Hafemeister and Satija (2019). They used the Pearson residuals from a Gamma-Poisson generalized linear model fit as the variance stabilized values. The advantage of this approach is that the variance is also stabilized for lowly expressed genes unlike the delta method-based transformations:
```{r}
assay(sce, "pearson") <- residual_transform(sce, "pearson", clipping = TRUE, on_disk = FALSE)
```

```{r}
pearson_var <- rowVars(assay(sce, "pearson"))
plot(gene_means, pearson_var, log = "x", main = "Log expression vs variance of Pearson residuals")
abline(h = 1)
```

Pearson residuals are by definition a linear transformation. This means that for genes with strong expression differences across subgroups they cannot achieve variance stabilization. As an alternative, _transformGamPoi_ provides  randomized quantile residuals which are non-linear and exploit randomization to work around the discrete nature of counts:
```{r}
assay(sce, "rand_quantile") <- residual_transform(sce, "randomized_quantile", on_disk = FALSE)
```


```{r}
rand_quant_var <- rowVars(assay(sce, "rand_quantile"))
plot(gene_means, rand_quant_var, log = "x", main = "Log expression vs variance of Randomized Quantile residuals")
abline(h = 1)
```


# Session Info

```{r}
sessionInfo()
```