densvis 1.10.3
Non-linear dimensionality reduction techniques such as t-SNE (Maaten and Hinton 2008)
and UMAP (McInnes, Healy, and Melville 2020) produce a low-dimensional embedding that summarises
the global structure of high-dimensional data. These techniques can be
particularly useful when visualising high-dimensional data in a biological
setting.
However, these embeddings may not accurately represent the local density
of data in the original space, resulting in misleading visualisations where
the space given to clusters of data does not represent the fraction of the
high dimensional space that they occupy.
densvis
implements the density-preserving objective function described by
(Narayan, Berger, and Cho 2020) which aims to address this deficiency by including a
density-preserving term in the t-SNE and UMAP optimisation procedures.
This can enable the creation of visualisations that accurately capture
differing degrees of transcriptional heterogeneity within different cell
subpopulations in scRNAseq experiments, for example.
We will illustrate the use of densvis
using simulated data.
We will first load the densvis
and Rtsne
libraries
and set a random seed to ensure the t-SNE visualisation is reproducible
(note: it is good practice to ensure that a t-SNE embedding is robust
by running the algorithm multiple times).
library("densvis")
library("Rtsne")
library("uwot")
library("ggplot2")
theme_set(theme_bw())
set.seed(14)
data <- data.frame(
x = c(rnorm(1000, 5), rnorm(1000, 0, 0.2)),
y = c(rnorm(1000, 5), rnorm(1000, 0, 0.2)),
class = c(rep("Class 1", 1000), rep("Class 2", 1000))
)
ggplot() +
aes(data[, 1], data[, 2], colour = data$class) +
geom_point(pch = 19) +
scale_colour_discrete(name = "Cluster") +
ggtitle("Original co-ordinates")
Density-preserving t-SNE can be generated using the densne
function. This function returns a matrix of t-SNE co-ordinates.
We set dens_frac
(the fraction of optimisation steps that consider
the density preservation) and dens_lambda
(the weight given to density
preservation relative to the standard t-SNE objective) each to 0.5.
fit1 <- densne(data[, 1:2], dens_frac = 0.5, dens_lambda = 0.5)
ggplot() +
aes(fit1[, 1], fit1[, 2], colour = data$class) +
geom_point(pch = 19) +
scale_colour_discrete(name = "Class") +
ggtitle("Density-preserving t-SNE") +
labs(x = "t-SNE 1", y = "t-SNE 2")
If we run t-SNE on the same data, we can see that the density-preserving objective better represents the density of the data,
fit2 <- Rtsne(data[, 1:2])
ggplot() +
aes(fit2$Y[, 1], fit2$Y[, 2], colour = data$class) +
geom_point(pch = 19) +
scale_colour_discrete(name = "Class") +
ggtitle("Standard t-SNE") +
labs(x = "t-SNE 1", y = "t-SNE 2")
A density-preserving UMAP embedding can be generated using the densmap
function. This function returns a matrix of UMAP co-ordinates. As with t-SNE,
we set dens_frac
(the fraction of optimisation steps that consider
the density preservation) and dens_lambda
(the weight given to density
preservation relative to the standard t-SNE objective) each to 0.5.
fit1 <- densmap(data[, 1:2], dens_frac = 0.5, dens_lambda = 0.5)
ggplot() +
aes(fit1[, 1], fit1[, 2], colour = data$class) +
geom_point(pch = 19) +
scale_colour_discrete(name = "Class") +
ggtitle("Density-preserving t-SNE") +
labs(x = "t-SNE 1", y = "t-SNE 2")
If we run UMAP on the same data, we can see that the density-preserving objective better represents the density of the data,
fit2 <- umap(data[, 1:2])
ggplot() +
aes(fit2[, 1], fit2[, 2], colour = data$class) +
geom_point(pch = 19) +
scale_colour_discrete(name = "Class") +
ggtitle("Standard t-SNE") +
labs(x = "t-SNE 1", y = "t-SNE 2")
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.4.2 uwot_0.1.16 Matrix_1.6-0 Rtsne_0.16
#> [5] densvis_1.10.3 BiocStyle_2.28.0
#>
#> loaded via a namespace (and not attached):
#> [1] sass_0.4.7 utf8_1.2.3 generics_0.1.3
#> [4] lattice_0.21-8 digest_0.6.33 magrittr_2.0.3
#> [7] evaluate_0.21 grid_4.3.1 bookdown_0.34
#> [10] fastmap_1.1.1 rprojroot_2.0.3 jsonlite_1.8.7
#> [13] BiocManager_1.30.21.1 fansi_1.0.4 scales_1.2.1
#> [16] jquerylib_0.1.4 cli_3.6.1 rlang_1.1.1
#> [19] basilisk.utils_1.12.1 munsell_0.5.0 withr_2.5.0
#> [22] cachem_1.0.8 yaml_2.3.7 FNN_1.1.3.2
#> [25] tools_4.3.1 dir.expiry_1.8.0 parallel_4.3.1
#> [28] dplyr_1.1.2 colorspace_2.1-0 filelock_1.0.2
#> [31] here_1.0.1 basilisk_1.12.1 reticulate_1.30
#> [34] assertthat_0.2.1 vctrs_0.6.3 R6_2.5.1
#> [37] png_0.1-8 lifecycle_1.0.3 magick_2.7.4
#> [40] pkgconfig_2.0.3 bslib_0.5.0 pillar_1.9.0
#> [43] gtable_0.3.3 glue_1.6.2 Rcpp_1.0.11
#> [46] xfun_0.39 tibble_3.2.1 tidyselect_1.2.0
#> [49] highr_0.10 knitr_1.43 farver_2.1.1
#> [52] htmltools_0.5.5 rmarkdown_2.23 labeling_0.4.2
#> [55] compiler_4.3.1
Maaten, Laurens van der, and Geoffrey Hinton. 2008. “Visualizing Data Using T-Sne.” Journal of Machine Learning Research 9 (Nov): 2579–2605.
McInnes, Leland, John Healy, and James Melville. 2020. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” http://arxiv.org/abs/1802.03426.
Narayan, Ashwin, Bonnie Berger, and Hyunghoon Cho. 2020. “Density-Preserving Data Visualization Unveils Dynamic Patterns of Single-Cell Transcriptomic Variability.” bioRxiv. https://doi.org/10.1101/2020.05.12.077776.