--- title: "Incorporating sketching into a typical single-cell analysis workflow" date: "`r Sys.Date()`" output: rmarkdown::html_document vignette: > %\VignetteIndexEntry{sketching_workflows} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console bibliography: bibliography.bib --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE, crop = NULL ) library(BiocStyle) ``` ## Introduction In this vignette, we outline how geometric sketching can be incorporated into the steps of a typical single-cell analysis workflow. ## Preparation We start by loading the required packages and preparing an example data set. ```{r load-pkg} suppressPackageStartupMessages({ library(sketchR) library(TENxPBMCData) library(scuttle) library(scran) library(scater) library(SingleR) library(celldex) library(cowplot) library(SummarizedExperiment) library(SingleCellExperiment) library(beachmat.hdf5) library(snifter) library(uwot) library(bluster) library(class) }) ``` We will use the PBMC3k data set from the `r Biocpkg("TENxPBMCData")` Bioconductor package for illustration. The chunk below prepares the data set by calculating log-transformed normalized counts, finding highly variable genes, applying principal component analysis, t-SNE and UMAP, and predicting cell types based on a reference data set. ```{r process-data} # load data pbmc3k <- TENxPBMCData(dataset = "pbmc3k") # set row and column names colnames(pbmc3k) <- paste0("Cell", seq_len(ncol(pbmc3k))) rownames(pbmc3k) <- uniquifyFeatureNames( ID = rowData(pbmc3k)$ENSEMBL_ID, names = rowData(pbmc3k)$Symbol_TENx ) # normalize and log-transform counts pbmc3k <- logNormCounts(pbmc3k) # find highly variable genes dec <- modelGeneVar(pbmc3k) top_hvgs <- getTopHVGs(dec, n = 2000) # perform dimensionality reduction set.seed(100) pbmc3k <- runPCA(pbmc3k, subset_row = top_hvgs) reducedDim(pbmc3k, "TSNE") <- fitsne( reducedDim(pbmc3k, "PCA"), n_jobs = 2L, random_state = 123L, perplexity = 30, n_components = 2L, pca = FALSE) reducedDim(pbmc3k, "UMAP") <- umap( X = reducedDim(pbmc3k, "PCA"), n_components = 2, n_neighbors = 50, pca = NULL, scale = FALSE, ret_model = TRUE, n_threads = 2L)$embedding # predict cell type labels ref_monaco <- MonacoImmuneData() pred_monaco_main <- SingleR(test = pbmc3k, ref = ref_monaco, labels = ref_monaco$label.main) pbmc3k$labels_main <- pred_monaco_main$labels ``` ## Sketch cells Next, we extract a subset of the cells using geometric sketching. We store information about which cells were sketched in the `colData` of `pbmc3k`. ```{r sketch} idx800gs <- geosketch(reducedDim(pbmc3k, "PCA"), N = 800, seed = 123) pbmc3k$sketched <- seq_len(ncol(pbmc3k)) %in% idx800gs table(pbmc3k$sketched) ``` We also generate a new set of highly variable genes, using only the sketched cells. This allows cells of different types to contribute more equally to the set of highly variable genes. ```{r hvgs} ## Find highly variable genes dec_sketched <- modelGeneVar(pbmc3k[, pbmc3k$sketched]) top_hvgs_sketched <- getTopHVGs(dec_sketched, n = 2000) ``` ## Principal component analysis (PCA) In this section, we will illustrate how to apply PCA to the sketched cells, followed by projection of all cells onto the obtained principal components. In this way, all cells are retained for further analysis, but only the sketched ones are used to determine the principal components. ```{r pca} #| fig.width: 8 #| fig.height: 4 set.seed(123) # don't scale features before PCA doscale <- FALSE # subset to sketched cells and run PCA, using the HVGs from the sketched subset pbmc3k_sketched <- pbmc3k[, pbmc3k$sketched] pca <- calculatePCA(pbmc3k_sketched, ncomponents = 30, subset_row = top_hvgs_sketched, scale = doscale) # store the PCA projection for the sketched cells as well as the weights for # the HVGs and the mean value for each HVG across the sketched cells projection <- pca rotation <- attr(pca, "rotation") centers <- rowMeans(assay(pbmc3k_sketched[top_hvgs_sketched, ], "logcounts")) # center the full data set using the mean values from the sketched subset and # project all cells onto the extracted principal components # note that if doscale = TRUE, we also need to calculate the standard # deviations across the sketched cells and use them here pca <- scale(t(assay(pbmc3k[names(centers), ], "logcounts")), center = centers, scale = doscale) %*% rotation # add the PCA coordinates to pbmc3k stopifnot(all(rownames(pca) == colnames(pbmc3k))) reducedDim(pbmc3k, "PCA_sketch") <- as.matrix(pca) # check that the "re-projections" of the sketched cells are the same as the # original coordinates obtained from applying PCA to the sketched cells # ... via correlations vapply(seq_len(ncol(projection)), function(i) { cor(projection[, i], reducedDim(pbmc3k, "PCA_sketch")[rownames(projection), i] ) }, NA_real_) # ... via plotting plot_grid(plotlist = lapply(seq_len(ncol(projection))[1:2], function(i) { ggplot(data.frame( Original = projection[, i], Reprojected = reducedDim(pbmc3k, "PCA_sketch")[pbmc3k$sketched, i]), aes(x = Original, y = Reprojected)) + geom_point() + labs(title = paste0("PCA dimension ", i)) + theme_bw() })) ``` For comparison, we plot the first two principal components obtained with and without sketching. ```{r plot-pca} #| fig.width: 9 #| fig.height: 4 plot_grid(plotlist = list( plotReducedDim(pbmc3k, dimred = "PCA", colour_by = "labels_main", ncomponents = c(1, 2)) + ggtitle("PCA"), plotReducedDim(pbmc3k, dimred = "PCA_sketch", colour_by = "labels_main", ncomponents = c(1, 2)) + ggtitle("PCA (sketch)") )) ``` We also visualize the location of the sketched cells in the PCA, noting that as expected, the sampling fraction is higher in regions of lower density. ```{r plot-pca-sketched} #| fig.width: 5 #| fig.height: 4 plotReducedDim(pbmc3k, dimred = "PCA_sketch", colour_by = "sketched", ncomponents = c(1, 2)) ``` # t-SNE Next, we illustrate how to pull the full data set into a tSNE representation derived from the sketched cells. This workflow is also described in the `r Biocpkg("snifter")` vignette. Note that, unlike for PCA, for tSNE the projected coordinates of the sketched cells themselves may not be identical to their original coordinates in the tSNE representation, as the initial placement of a point in the embedding is obtained by averaging the coordinates of its `k` nearest neighbors. Unless `k=1`, this initial placement will therefore be influenced by other sketched cells as well. If we want to make sure that the low-dimensional coordinates of the sketched cells are exactly those obtained by the original tSNE embedding, this can be achieved by only projecting the non-sketched cells, and concatenating these projections with the original embeddings of the sketched cells. ```{r tsne} #| fig.width: 8 #| fig.height: 4 set.seed(123) # extract tSNE embedding from sketched cells tsne <- fitsne( reducedDim(pbmc3k[, pbmc3k$sketched], "PCA_sketch"), n_jobs = 2L, random_state = 123L, perplexity = 30, n_components = 2L, pca = FALSE) # project all cells proj <- project( x = tsne, new = reducedDim(pbmc3k, "PCA_sketch"), old = reducedDim(pbmc3k[, pbmc3k$sketched], "PCA_sketch"), perplexity = 30, k = 1L) rownames(proj) <- colnames(pbmc3k) reducedDim(pbmc3k, "TSNE_sketch") <- proj # representations of sketched cells are approximately retained # ... correlations vapply(seq_len(ncol(tsne)), function(i) { cor(tsne[, i], reducedDim(pbmc3k, "TSNE_sketch")[pbmc3k$sketched, i] ) }, NA_real_) # ... plotting plot_grid(plotlist = lapply(seq_len(ncol(tsne)), function(i) { ggplot(data.frame( Original = tsne[, i], Reprojected = reducedDim(pbmc3k, "TSNE_sketch")[pbmc3k$sketched, i]), aes(x = Original, y = Reprojected)) + geom_point() + labs(title = paste0("tSNE dimension ", i)) + theme_bw() })) ``` As for the PCA, we plot the t-SNE representations obtained with and without sketching. ```{r plot-tsne} #| fig.width: 9 #| fig.height: 4 plot_grid(plotlist = list( plotReducedDim(pbmc3k, dimred = "TSNE", colour_by = "labels_main", ncomponents = c(1, 2)) + ggtitle("t-SNE"), plotReducedDim(pbmc3k, dimred = "TSNE_sketch", colour_by = "labels_main", ncomponents = c(1, 2)) + ggtitle("t-SNE (sketch)") )) ``` We also visualize the location of the sketched cells in the t-SNE. ```{r plot-tsne-sketched} #| fig.width: 5 #| fig.height: 4 plotReducedDim(pbmc3k, dimred = "TSNE_sketch", colour_by = "sketched", ncomponents = c(1, 2)) ``` # UMAP Similarly to t-SNE, we can also project new data into an existing UMAP embedding, using the `r CRANpkg("uwot")` package. ```{r umap} #| fig.width: 8 #| fig.height: 4 set.seed(123) # extract UMAP embedding from sketched cells umap <- umap( X = reducedDim(pbmc3k[, pbmc3k$sketched], "PCA_sketch"), n_components = 2, n_neighbors = 50, pca = NULL, scale = FALSE, ret_model = TRUE, n_threads = 2L) # project all cells proj <- umap_transform( X = reducedDim(pbmc3k, "PCA_sketch"), model = umap, n_threads = 2L) rownames(proj) <- colnames(pbmc3k) reducedDim(pbmc3k, "UMAP_sketch") <- proj # representations of sketched cells are approximately retained # ... correlations vapply(seq_len(ncol(umap$embedding)), function(i) { cor(umap$embedding[, i], reducedDim(pbmc3k, "UMAP_sketch")[pbmc3k$sketched, i] ) }, NA_real_) # ... plotting plot_grid(plotlist = lapply(seq_len(ncol(umap$embedding)), function(i) { ggplot(data.frame( Original = umap$embedding[, i], Reprojected = reducedDim(pbmc3k, "UMAP_sketch")[pbmc3k$sketched, i]), aes(x = Original, y = Reprojected)) + geom_point() + labs(title = paste0("UMAP dimension ", i)) + theme_bw() })) ``` As before, we plot the UMAP representations obtained with and without sketching. ```{r plot-umap} #| fig.width: 9 #| fig.height: 4 plot_grid(plotlist = list( plotReducedDim(pbmc3k, dimred = "UMAP", colour_by = "labels_main", ncomponents = c(1, 2)) + ggtitle("UMAP"), plotReducedDim(pbmc3k, dimred = "UMAP_sketch", colour_by = "labels_main", ncomponents = c(1, 2)) + ggtitle("UMAP (sketch)") )) ``` We also visualize the location of the sketched cells in the UMAP. ```{r plot-umap-sketched} #| fig.width: 5 #| fig.height: 4 plotReducedDim(pbmc3k, dimred = "UMAP_sketch", colour_by = "sketched", ncomponents = c(1, 2)) ``` # Clustering Also clustering can be beneficial to apply only to the sketched cells, followed by prediction of cell type labels using, e.g., k-nearest neighbors. Following a similar reasoning as for tSNE above, unless we use `k=1` in the k-nearest neighbor prediction, the predicted cluster labels of the sketched cells may not be the same as the ones obtained directly from the clustering. Here we cluster the sketched cells using functions from the `r Biocpkg("bluster")` package. ```{r clustering} # subset to only the sketched cells # note that since the PCA_sketch embedding was obtained by applying PCA to the # sketched cells only, subsetting like this is equivalent to applying PCA # again to the sketched cells only (see above for an illustration) pca_sketch <- reducedDim(pbmc3k[, pbmc3k$sketched], "PCA_sketch") set.seed(123) # run clustering on the sketched cells clst <- as.character(clusterRows( pca_sketch, BLUSPARAM = SNNGraphParam( k = 10, type = "rank", num.threads = 2L, cluster.fun = "leiden", cluster.args = list(resolution = 0.15)))) # predict labels of all cells using 1-nearest neighbors and add cluster labels # to pbmc3k preds <- knn(train = pca_sketch, test = reducedDim(pbmc3k, "PCA_sketch"), cl = clst, k = 1) |> as.character() pbmc3k$clusters <- preds # check that labels are retained for sketched cells all(clst == pbmc3k$clusters[pbmc3k$sketched]) ``` # Session info ```{r session-info} sessionInfo() ```