Sanity (SAmpling-Noise-corrected Inference of Transcription activitY) provides a biophysics-inspired normalization method for single-cell RNA-seq data. It uses a Bayesian framework with minimal number of assumptions in order to model physically meaningful quantities and their corresponding uncertainties, facilitating comparisons between cells and experiments. The full details of the method are described in the original publication and its supplementary (Breda, Zavolan, and Nimwegen 2021).
To install this package, start R (version “4.5”) and enter:
The basic functionality of the SanityR
package is to compute the logcounts assay from UMI
counts of a SingleCellExperiment
object. It thus acts as a replacement for the logNormCounts
from scuttle
on any other function with similar functionality.
To demonstrate the functionality of SanityR, we first simulate a single-cell dataset using the method described in the Sanity paper. In short, the data are generated by sampling from the Bayesian prior of the model and the correlation between the cells’ gene expression is generated by simulating a random walk process in the gene expression space so that the mean value of transcriptional activity in one cell is close to that of its predecessor in the walk. The number of cells, genes and the other prior parameters are tuned to match the dataset generated in (Baron et al. 2016). Here we down-sample the number of genes and cells for speed.
SanityR
is designed to work with the SingleCellExperiment
class and thus the simulation methods return a
SingleCellExperiment object. The latent (“true”) log
normalized counts are stored in the logFC assay so as to
not interfere with the logcounts assay.
library(SingleCellExperiment)
library(SanityR)
set.seed(42)
sce <- simulate_branched_random_walk(
N_gene=1000,
N_path=10,
length_path=12
)
sce
#> class: SingleCellExperiment
#> dim: 981 120
#> metadata(0):
#> assays(2): counts logFC
#> rownames(981): Gene_1 Gene_2 ... Gene_999 Gene_1000
#> rowData names(2): ltq_mean ltq_var
#> colnames(120): Cell_1 Cell_2 ... Cell_119 Cell_120
#> colData names(2): cell_size predecessor
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):Genes with no UMI in any of the cells are removed that is why the number of rows may be less than 1000.
Besides the logFC assay, the following parameters are
stored in the sce to fully specify the model:
ltq_mean: the mean of the log transcriptional
activity, defined as the quotient of the number of molecules
transcribed from each gene versus the total number of molecules in the
cell.ltq_var: the variance of the log transcriptional
activity.cell_size: the total number of molecules in the
cell.predecessor: the index of the cell that precedes of the
current cell in the random walk. Thus if
predecessor[n] == m, then \(m \to
n\) in the random walk.The Sanity function is the main function of the package.
When called on a SingleCellExperiment object, it will
perform normalization for each gene and independently and store the
results in the logcounts assay.
sf <- colSums(counts(sce))
sizeFactors(sce) <- sf / mean(sf)
sce <- Sanity(sce)
sce
#> class: SingleCellExperiment
#> dim: 981 120
#> metadata(0):
#> assays(4): counts logFC logcounts logcounts_sd
#> rownames(981): Gene_1 Gene_2 ... Gene_999 Gene_1000
#> rowData names(6): ltq_mean ltq_var ... sanity_activity_sd
#> sanity_entropy
#> colnames(120): Cell_1 Cell_2 ... Cell_119 Cell_120
#> colData names(3): cell_size predecessor sizeFactor
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):Below we visualize the accuracy of the Sanity
normalization method by partially reproducing Figure 3A of the original
publication. First, at estimating the overall mean and variance of a
gene expression.
par(mfrow=c(1, 2))
plot(
sanity_log_activity_mean ~ ltq_mean,
data=rowData(sce),
frame=FALSE,
pch=16,
col="#00000044",
xlab="Mean UMI counts",
ylab="Sanity Estimate",
main="Mean Log Transcriptional Activity"
)
abline(0, 1, col=2, lwd=2, lty=2)
legend("top", legend="y=x", col=2, lwd=2, lty=2, bty="n")
plot(
sanity_activity_sd^2 ~ ltq_var, data=rowData(sce),
log="xy",
frame=FALSE,
pch=16,
col="#00000044",
xlab="True LTQ Variance",
ylab="Sanity Estimate",
main="Variance of Log Transcriptional Activity"
)
abline(0, 1, col=2, lwd=2, lty=2)
legend("top", legend="y=x", col=2, lwd=2, lty=2, bty="n")Gene Level Parameters. Comparison of Sanity estimates with latent values used to generate the data.
Second, at a cell-level by calculating the correlation of latent and
inferred logcounts using different using genes of different
expression level.
umi_mean <- rowMeans(counts(sce))
umi_quantiles <- quantile(umi_mean, seq(0, 1, .2))
umi_quantiles <- cut(
umi_mean,
umi_quantiles,
include.lowest=TRUE,
labels=paste0("Q", 1:5)
)
rho <- sapply(
split(1:nrow(sce), umi_quantiles),
function(ix) {
sce_q <- sce[ix,]
diag(cor(assay(sce_q, "logFC"), logcounts(sce_q)))
}
)
boxplot(
rho,
frame=FALSE,
main="Correlation between latent and inferred logcounts",
xlab="Quantiles of Mean UMI Count",
ylab="Pearson Correlation"
)Cell Level Estimates. Correlation of Sanity inferred logcounts with latent logcounts using genes at different quantiles of expression.
The logcounts calculated by Sanity can be
used for downstream analyses. Typically, these analyses involve
computing distances between cells which can be done by using the
functions that BiocNeighbors
provides. However, SanityR
provides a more principled approach that takes into account the
uncertainty of the estimated log normalized counts. This is implemented
in the calculateSanityDistance function.
Below is an example of using the distances calculated by
calculateSanityDistance as input to Rtsne to embed
the cells in a 2D space which captures the underlying random walk
structure that generated the data.
cell_dist <- calculateSanityDistance(sce)
reducedDim(sce, "TSNE") <- Rtsne::Rtsne(cell_dist, is_distance=TRUE)$Y
# Visualize the t-SNE embedding
sce$pseudotime <- 0
for (i in 2:ncol(sce)) # cell 1 is the stem cell
sce$pseudotime[i] <- sce$pseudotime[sce$predecessor[i]] + 1
scater::plotTSNE(sce, color_by="pseudotime")t-SNE visualization using the Sanity calculated distances. Pseudotime is computed based the predecessor relations between cells.
Unlike many functions of BiocNeighbors1,
calculateSanityDistance computes all the pairwise
distances and thus is more resource intensive, both in terms of memory
and time. To ease the computational burden, the user can limit the
number of genes involved in the calculations by filtering only the
highly variable ones. calculateSanityDistance uses the
signal-to-noise ratio (snr_cutoff) internally to filter the
genes.
sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SanityR_1.0.0 scuttle_1.20.0
#> [3] SingleCellExperiment_1.32.0 SummarizedExperiment_1.40.0
#> [5] Biobase_2.70.0 GenomicRanges_1.62.0
#> [7] Seqinfo_1.0.0 IRanges_2.44.0
#> [9] S4Vectors_0.48.0 BiocGenerics_0.56.0
#> [11] generics_0.1.4 MatrixGenerics_1.22.0
#> [13] matrixStats_1.5.0 BiocStyle_2.38.0
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 beeswarm_0.4.0 xfun_0.54
#> [4] bslib_0.9.0 ggplot2_4.0.0 ggrepel_0.9.6
#> [7] lattice_0.22-7 vctrs_0.6.5 tools_4.5.2
#> [10] parallel_4.5.2 BiocNeighbors_2.4.0 Matrix_1.7-4
#> [13] RColorBrewer_1.1-3 S7_0.2.0 lifecycle_1.0.4
#> [16] compiler_4.5.2 farver_2.1.2 codetools_0.2-20
#> [19] vipor_0.4.7 htmltools_0.5.8.1 sys_3.4.3
#> [22] buildtools_1.0.0 sass_0.4.10 yaml_2.3.10
#> [25] jquerylib_0.1.4 BiocParallel_1.44.0 DelayedArray_0.36.0
#> [28] cachem_1.1.0 viridis_0.6.5 abind_1.4-8
#> [31] rsvd_1.0.5 digest_0.6.38 Rtsne_0.17
#> [34] BiocSingular_1.26.0 labeling_0.4.3 maketools_1.3.2
#> [37] fastmap_1.2.0 grid_4.5.2 cli_3.6.5
#> [40] SparseArray_1.10.1 scater_1.38.0 S4Arrays_1.10.0
#> [43] withr_3.0.2 scales_1.4.0 ggbeeswarm_0.7.2
#> [46] rmarkdown_2.30 XVector_0.50.0 gridExtra_2.3
#> [49] ScaledMatrix_1.18.0 beachmat_2.26.0 evaluate_1.0.5
#> [52] knitr_1.50 irlba_2.3.5.1 viridisLite_0.4.2
#> [55] rlang_1.1.6 Rcpp_1.1.0 glue_1.8.0
#> [58] BiocManager_1.30.26 jsonlite_2.0.0 R6_2.6.1which compute only the k nearest
neighbors↩︎