--- title: "Cell states" date: "Compiled: `r format(Sys.Date(), '%d/%m/%Y')`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Cell States} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, fig.align = "center") ```

___

## Dataset
The single-cell RNA sequencing dataset of mammary epithelial cells published by [Bach et al., (2017)](https://doi.org/10.1038/s41467-017-02001-5) will be used to demonstrated how cell cluster probability can be used to identify cell states and associated differential gene expression programs. This dataset was imported as a `SingleCellExperiment` object via the `scRNAseq` `R` package using the function `BachMammaryData()`. Additionally, the dataset was normalized with the `scater` function `logNormCounts()`, and then subset to 2,000 highly variable genes with the `scran` function `getTopHVGs()`. ```{r packages, message=FALSE, warning=FALSE} # Import packages library("dplyr") library("scater") library("ggplot2") library("scRNAseq") library("Coralysis") library("ComplexHeatmap") library("SingleCellExperiment") ``` ```{r import data, message=FALSE, warning=FALSE} # Import object sce <- BachMammaryData() colnames(sce) <- paste0("cell", 1:ncol(sce)) # create cell names ```

___

## Normalization
`Coralysis` requires log-normalised data as input. `scran` normalization can be performed, which is particularly beneficial if rare cell types exist (see the following [vignette](https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html)). ```{r Seurat normalisation, message=FALSE, warning=FALSE} ## Normalize the data set.seed(123) sce <- scater::logNormCounts(sce) counts(sce) <- NULL ```

___

## HVG selection
Highly variable genes (HVG) can be selected using the R package `scran`. ```{r hvg} # Feature selection with 'scran' package nhvg <- 500 hvg <- scran::getTopHVGs(sce, n = nhvg) hvg.idx <- which(row.names(sce) %in% hvg) sce <- sce[hvg.idx, ] row.names(sce) <- rowData(sce)$Symbol ```

___

## Multi-level integration
Multi-level integration can be performed with `Coralysis` by running the function `RunParallelDivisiveICP()`. Since there is not a batch, the function `RunParallelDivisiveICP()` is run with `batch.label=NULL`. The `RunParallelDivisiveICP()` function can be run in parallel by providing the number of `threads` to reduce the run time. Consult the full documentation of this function with `?RunParallelDivisiveICP`. For this example, the algorithm was run 10 times (`L = 25`), but generally, this number should be higher (with the default being `L = 50`). A train set can be built to improve the modeling step in `Coralysis` (by default, `build.train.set=TRUE`), with the respective named `list` of parameters provided to `build.train.params`. In this case, no training set was built; instead, the entire dataset was downsampled to 1,000 cells before each ICP run (`icp.batch.size=1000`). ```{r multi-level integration, warning=FALSE} # Perform multi-level integration set.seed(123) sce <- RunParallelDivisiveICP(object = sce, L = 25, icp.batch.size = 1000, build.train.set = FALSE, threads = 2) ``` `Coralysis` returns a list of cell cluster probabilities saved at `metadata(sce)$coralysis$joint.probability` of length equal to the number of icp runs, i.e., `L` (by default `L=20`), times the number of icp rounds, i.e., log2(`k`) (by default `k=16`, thus 4 rounds of divisive icp).

___

## Dimensional reduction
`Coralysis` returns a list of cell cluster probabilities saved at `metadata(sce)$coralysis$joint.probability` of length equal to the number of icp runs, i.e., `L` (by default `L=20`), times the number of icp rounds, i.e., log2(`k`) (by default `k=16`, thus 4 rounds of divisive icp). The cell cluster probability can be concatenated to compute a PCA in order to obtain an integrated embedding. By default only the cell cluster probability corresponding to the last icp round is used. The `Coralysis` integrated embedding can be used downstream for non-linear dimensional reduction, _t_-SNE or UMAP, and clustering. Set to `TRUE` the `return.model` argument to allow reference mapping later. ```{r integrated dimred} # Dimensional reduction - unintegrated set.seed(123) sce <- RunPCA( object = sce, assay.name = "joint.probability", return.model = TRUE ) # UMAP set.seed(123) sce <- RunUMAP( object = sce, umap.method = "uwot", dims = 1:30, n_neighbors = 15, min_dist = 0.5, return.model = TRUE ) ``` Check the difference between basal (expressing _Krt5_) and luminal (expressing _Krt18_) epithelial cells. ```{r basal vs luminal cells, fig.width=7, fig.height=2.75} # Detect basal versus luminal cells markers <- c("Krt5", "Krt18") plots <- lapply(markers, function(x) { PlotExpression(sce, color.by = x, point.size = 0.2, point.stroke = 0.2, scale.values = TRUE ) }) cowplot::plot_grid(plotlist = plots, ncol = 2) ```

___

## Cell state identification
The cell cluster probability aggregated by mean or median across the icp runs can be obtained with the function `SummariseCellClusterProbability()` and plotted to infer transient and steady cell states. ```{r cell state identification} # Summarise cell cluster probability sce <- SummariseCellClusterProbability(object = sce, icp.round = 4) # save result in 'colData' # colData(sce) # check the colData ``` It is clear that the largest basal cell population has the lowest probability compared with the luminal cell populations. ```{r plot probability, fig.width=5, fig.height=4} # Plot cell cluster probabilities - mean # possible options: "mean_probs", "median_probs", "scaled_median_probs" PlotExpression( object = sce, color.by = "scaled_mean_probs", color.scale = "viridis", point.size = 0.2, point.stroke = 0.1, legend.title = "Mean prob.\n(min-max)" ) ```

___

## ICP clusters
The ICP clusters corresponding to the ICP probability table with the highest standard deviation (i.e., 14) for the last clustering round (i.e., 4) was obtained below (`"icp_run_round_14_4_clusters"`) and the respective clusters highlighted in UMAP plot. ```{r icp clusters, fig.width=5, fig.height=6} # ICP clusters: identify the ICP probability table with the highest standard deviation probs <- GetCellClusterProbability(object = sce, icp.round = 4, concatenate = FALSE) probs.sd <- lapply(X = probs, FUN = function(x) { sd(x) }) icp.run <- which.max(probs.sd) # 7 clt <- paste0("icp_run_round_", icp.run, "_4_clusters") sce[[clt]] <- factor(sce[[clt]], levels = as.character(1:16)) PlotDimRed(sce, color.by = clt, point.size = 0.1, point.stroke = 0.1, legend.nrow = 5 ) ```

The gene coefficients for the above ICP clusters corresponding to the ICP run no. 14 and clustering round 4 can be retrieved with the function `GetFeatureCoefficients()`. The positive coefficients at least for one of the clusters were represented in the heatmap plot below. ```{r gene coefficients, message=FALSE, warning=FALSE, fig.width = 5.5, fig.height=14} # Get gene coefficients gene.coeff <- GetFeatureCoefficients(sce, icp.run = icp.run, icp.round = 4) row.names(gene.coeff$icp_56) <- gene.coeff$icp_56$feature gene.coeff$icp_56 <- gene.coeff$icp_56[, -1] # Plot top positive coefficients in heatmap positive.coeff <- (rowSums(gene.coeff$icp_56 > 0) > 0) heat <- ComplexHeatmap::Heatmap( matrix = as.matrix(gene.coeff$icp_56)[positive.coeff, ], name = "Gene coef.", cluster_columns = FALSE, show_row_dend = FALSE, show_row_names = TRUE, row_names_gp = grid::gpar(fontsize = 7) ) print(heat) ```

One of the largest population of luminal cells is constituted by three ICP clusters (3, 11, 12), one could check below the main coefficients for each one of these clusters. ```{r gene coefficients per luminal cluster, fig.width=20, fig.height=10} # Check top gene coefficients per basal cluster: 3, 11, 12 gene.coeff.basal <- gene.coeff$icp_56[positive.coeff, ] %>% mutate("gene" = row.names(.)) top5.luminal <- top5.luminal.plts <- list() for (i in as.character(c(3, 11, 12))) { clt.no <- paste0("clt", i) top5.luminal[[clt.no]] <- gene.coeff.basal %>% dplyr::select(all_of(c("gene", paste0("coeff_", clt.no)))) %>% arrange(desc(.data[[paste0("coeff_", clt.no)]])) %>% head(5) %>% pull(gene) top5.luminal.plts[[clt.no]] <- cowplot::plot_grid(plotlist = lapply(top5.luminal[[clt.no]], function(x) { PlotExpression(sce, color.by = x, point.size = 0.1, point.stroke = 0.1, scale.values = TRUE ) + ggplot2::ggtitle(gsub("clt", "cluster: ", clt.no)) }), ncol = 5, align = "vh") } cowplot::plot_grid(plotlist = top5.luminal.plts, nrow = 3, align = "vh") ```

___

## Graph-based clustering
Graph-based clustering can be obtained by running the `scran` function `clusterCells()` using the `Coralysis` integrated embedding. The community detection algorithm used was Louvain with a resolution value of 0.8. ```{r graph-based clustering, fig.width=5, fig.height=6} ### Graph-based clustering with scran ## Coralysis integrated PCA embedding bluster.params <- bluster::SNNGraphParam( k = 30, cluster.fun = "louvain", cluster.args = list(resolution = 0.8) ) set.seed(1024) sce$Cluster <- scran::clusterCells(sce, use.dimred = "PCA", BLUSPARAM = bluster.params ) PlotDimRed(sce, color.by = "Cluster", point.size = 0.2, point.stroke = 0.2, legend.nrow = 4 ) ```

___

## Cell cluster probability bins
Cell cluster probability bins per cluster `label` can be obtained by running `BinCellClusterProbability()`, which returns a `SingleCellExperiment` object with `logcounts` average per `label` per cell probability bin. The number of probability bins can be provided to the parameter `bins`. Below it was used as `label` the graph-based clustering obtained above and `bins` set to 30. In addition, the `Coralysis` probability bins were projected onto single cell UMAP. See the plots below. ```{r cell cluster prob. bins, fig.width=9, fig.height=4.5} # cell states SCE object sce.bins <- BinCellClusterProbability(sce, label = "Cluster", bins = 30) # Project Coralysis bins onto single-cell UMAP sce.bins <- ReferenceMapping(sce, sce.bins, ref.label = "Cluster", project.umap = TRUE) umap.bins.labels <- PlotDimRed(sce.bins, color.by = "coral_labels") umap.bins.probs <- PlotExpression(sce.bins, color.by = "aggregated_probability_bins", color.scale = "viridis", legend.title = "Prob. bins" ) cowplot::plot_grid(umap.bins.labels, umap.bins.probs, ncol = 2, align = "vh") ```
```{r coralysis labels, fig.width=12, fig.height=6} # Coralysis labels: single cells vs bins cowplot::plot_grid( PlotDimRed(sce, color.by = "Cluster", point.size = 0.2, point.stroke = 0.2, legend.nrow = 2 ), umap.bins.labels ) ```
```{r coralysis probability, fig.width=12, fig.height=5} # Coralysis probability: single cells vs bins cowplot::plot_grid( PlotExpression( object = sce, color.by = "scaled_mean_probs", color.scale = "viridis", point.size = 0.2, point.stroke = 0.1, legend.title = "Mean prob.\n(min-max)" ), umap.bins.probs ) ```

___

## Differential expression programs
The correlation between cell cluster probability bins from cluster 10 and averaged gene expression can be obtained with the function `CellBinsFeatureCorrelation()`. The 30 genes with the highest absolute Pearson correlation were represented in the heatmap below. ```{r differential expression, fig.width=10, fig.height=6} # Differential expression programs corr.features <- CellBinsFeatureCorrelation(object = sce.bins, labels = "10") top30.corr.clt10 <- corr.features %>% arrange(desc(abs(`10`))) %>% head(30) # Heatmap of top 30 genes with the highest absolute Pearson correlation pick.genes <- row.names(top30.corr.clt10) mtx <- as.matrix(logcounts(sce.bins[pick.genes, paste0("10_bin", 1:30)])) mtx <- t(scale(t(mtx))) col_fun2 <- circlize::colorRamp2(c(0.8, 0.9, 1.0), scales::viridis_pal()(3)) heat.diff.exp <- Heatmap( matrix = mtx, name = "Row Z-score\n(normalized data)", cluster_rows = TRUE, cluster_columns = FALSE, show_column_dend = FALSE, show_row_dend = FALSE, use_raster = TRUE, show_column_names = TRUE, top_annotation = HeatmapAnnotation( "Probability" = colData(sce.bins[, paste0("10_bin", 1:30)])$aggregated_probability_bins, simple_anno_size = unit(0.25, "cm"), col = list("Probability" = col_fun2), show_annotation_name = FALSE, show_legend = TRUE, annotation_legend_param = list(Probability = list( direction = "horizontal", title_position = "topcenter" )) ) ) print(heat.diff.exp) ```

For example, the expression of the gene _Glycam1_ (glycosylation-dependent cell adhesion molecule 1) or _Fabp3_ (fatty acid binding protein 3) are almost restricted to cluster 10, and they exhibit a gradient that correlates with both transient and steady-state cells. ```{r plot expression, fig.width=12, fig.height=10} # Look into Cited1 expression genes <- c("Glycam1", "Fabp3") cowplot::plot_grid( PlotExpression(sce, color.by = genes[1], point.size = 0.5, point.stroke = 0.5, scale.values = TRUE ), PlotExpression(sce.bins, color.by = genes[1], point.size = 1, point.stroke = 1, scale.values = TRUE ), PlotExpression(sce, color.by = genes[2], point.size = 0.5, point.stroke = 0.5, scale.values = TRUE ), PlotExpression(sce.bins, color.by = genes[2], point.size = 1, point.stroke = 1, scale.values = TRUE ), ncol = 2 ) ```

___

## R session
```{r rsession} # R session sessionInfo() ```

___

## References
Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). "Orchestrating single-cell analysis with Bioconductor." _Nature Methods_, *17*, 137-145. [https://www.nature.com/articles/s41592-019-0654-x](https://www.nature.com/articles/s41592-019-0654-x). Bach K, Pensa S, Grzelak M, Hadfield J, Adams DJ, Marioni JC, Khaled WT (2017). "Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing." _Nature communications_, 8, 1:1-11. [doi:10.1038/s41467-017-02001-5](https://doi.org/10.1038/s41467-017-02001-5) Gu, Z. (2022) "Complex heatmap visualization." _iMeta_, 1(3):e43. [doi: 10.1002/imt2.43](https://doi.org/10.1002/imt2.43). Lun ATL, McCarthy DJ, Marioni JC (2016). "A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor." _F1000Res._, *5*, 2122. [doi:10.12688/f1000research.9501.2](https://doi.org/10.12688/f1000research.9501.2). McCarthy DJ, Campbell KR, Lun ATL, Willis QF (2017). "Scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R." _Bioinformatics_, *33*, 1179-1186. [doi:10.1093/bioinformatics/btw777](https://doi.org/10.1093/bioinformatics/btw777). Sousa A, Smolander J, Junttila S, Elo L (2025). "Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration." _bioRxiv_. [doi:10.1101/2025.02.07.637023](https://doi.org/10.1101/2025.02.07.637023) Wickham H (2016). "ggplot2: Elegant Graphics for Data Analysis." _Springer-Verlag New York_.