---
title: "Cell states"
date: "Compiled: `r format(Sys.Date(), '%d/%m/%Y')`"
output: BiocStyle::html_document
vignette: >
%\VignetteIndexEntry{Cell States}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.align = "center")
```
___
## Dataset
The single-cell RNA sequencing dataset of mammary epithelial cells published by [Bach et al., (2017)](https://doi.org/10.1038/s41467-017-02001-5) will be used to demonstrated how cell cluster probability can be used to identify cell states and associated differential gene expression programs.
This dataset was imported as a `SingleCellExperiment` object via the `scRNAseq` `R` package using the function `BachMammaryData()`. Additionally, the dataset was normalized with the `scater` function `logNormCounts()`, and then subset to 2,000 highly variable genes with the `scran` function `getTopHVGs()`.
```{r packages, message=FALSE, warning=FALSE}
# Import packages
library("dplyr")
library("scater")
library("ggplot2")
library("scRNAseq")
library("Coralysis")
library("ComplexHeatmap")
library("SingleCellExperiment")
```
```{r import data, message=FALSE, warning=FALSE}
# Import object
sce <- BachMammaryData()
colnames(sce) <- paste0("cell", 1:ncol(sce)) # create cell names
```
___
## Normalization
`Coralysis` requires log-normalised data as input. `scran` normalization can be performed, which is particularly beneficial if rare cell types exist (see the following [vignette](https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html)).
```{r Seurat normalisation, message=FALSE, warning=FALSE}
## Normalize the data
set.seed(123)
sce <- scater::logNormCounts(sce)
counts(sce) <- NULL
```
___
## HVG selection
Highly variable genes (HVG) can be selected using the R package `scran`.
```{r hvg}
# Feature selection with 'scran' package
nhvg <- 500
hvg <- scran::getTopHVGs(sce, n = nhvg)
hvg.idx <- which(row.names(sce) %in% hvg)
sce <- sce[hvg.idx, ]
row.names(sce) <- rowData(sce)$Symbol
```
___
## Multi-level integration
Multi-level integration can be performed with `Coralysis` by running the function `RunParallelDivisiveICP()`. Since there is not a batch, the function `RunParallelDivisiveICP()` is run with `batch.label=NULL`. The `RunParallelDivisiveICP()` function can be run in parallel by providing the number of `threads` to reduce the run time. Consult the full documentation of this function with `?RunParallelDivisiveICP`. For this example, the algorithm was run 10 times (`L = 25`), but generally, this number should be higher (with the default being `L = 50`). A train set can be built to improve the modeling step in `Coralysis` (by default, `build.train.set=TRUE`), with the respective named `list` of parameters provided to `build.train.params`. In this case, no training set was built; instead, the entire dataset was downsampled to 1,000 cells before each ICP run (`icp.batch.size=1000`).
```{r multi-level integration, warning=FALSE}
# Perform multi-level integration
set.seed(123)
sce <- RunParallelDivisiveICP(object = sce, L = 25,
icp.batch.size = 1000,
build.train.set = FALSE,
threads = 2)
```
`Coralysis` returns a list of cell cluster probabilities saved at `metadata(sce)$coralysis$joint.probability` of length equal to the number of icp runs, i.e., `L` (by default `L=20`), times the number of icp rounds, i.e., log2(`k`) (by default `k=16`, thus 4 rounds of divisive icp).
___
## Dimensional reduction
`Coralysis` returns a list of cell cluster probabilities saved at `metadata(sce)$coralysis$joint.probability` of length equal to the number of icp runs, i.e., `L` (by default `L=20`), times the number of icp rounds, i.e., log2(`k`) (by default `k=16`, thus 4 rounds of divisive icp). The cell cluster probability can be concatenated to compute a PCA in order to obtain an integrated embedding. By default only the cell cluster probability corresponding to the last icp round is used. The `Coralysis` integrated embedding can be used downstream for non-linear dimensional reduction, _t_-SNE or UMAP, and clustering. Set to `TRUE` the `return.model` argument to allow reference mapping later.
```{r integrated dimred}
# Dimensional reduction - unintegrated
set.seed(123)
sce <- RunPCA(
object = sce, assay.name = "joint.probability",
return.model = TRUE
)
# UMAP
set.seed(123)
sce <- RunUMAP(
object = sce, umap.method = "uwot",
dims = 1:30, n_neighbors = 15,
min_dist = 0.5, return.model = TRUE
)
```
Check the difference between basal (expressing _Krt5_) and luminal (expressing _Krt18_) epithelial cells.
```{r basal vs luminal cells, fig.width=7, fig.height=2.75}
# Detect basal versus luminal cells
markers <- c("Krt5", "Krt18")
plots <- lapply(markers, function(x) {
PlotExpression(sce,
color.by = x, point.size = 0.2,
point.stroke = 0.2, scale.values = TRUE
)
})
cowplot::plot_grid(plotlist = plots, ncol = 2)
```
___
## Cell state identification
The cell cluster probability aggregated by mean or median across the icp runs can be obtained with the function `SummariseCellClusterProbability()` and plotted to infer transient and steady cell states.
```{r cell state identification}
# Summarise cell cluster probability
sce <- SummariseCellClusterProbability(object = sce, icp.round = 4) # save result in 'colData'
# colData(sce) # check the colData
```
It is clear that the largest basal cell population has the lowest probability compared with the luminal cell populations.
```{r plot probability, fig.width=5, fig.height=4}
# Plot cell cluster probabilities - mean
# possible options: "mean_probs", "median_probs", "scaled_median_probs"
PlotExpression(
object = sce, color.by = "scaled_mean_probs",
color.scale = "viridis", point.size = 0.2,
point.stroke = 0.1, legend.title = "Mean prob.\n(min-max)"
)
```
___
## ICP clusters
The ICP clusters corresponding to the ICP probability table with the highest standard deviation (i.e., 14) for the last clustering round (i.e., 4) was obtained below (`"icp_run_round_14_4_clusters"`) and the respective clusters highlighted in UMAP plot.
```{r icp clusters, fig.width=5, fig.height=6}
# ICP clusters: identify the ICP probability table with the highest standard deviation
probs <- GetCellClusterProbability(object = sce, icp.round = 4, concatenate = FALSE)
probs.sd <- lapply(X = probs, FUN = function(x) {
sd(x)
})
icp.run <- which.max(probs.sd) # 7
clt <- paste0("icp_run_round_", icp.run, "_4_clusters")
sce[[clt]] <- factor(sce[[clt]], levels = as.character(1:16))
PlotDimRed(sce,
color.by = clt, point.size = 0.1,
point.stroke = 0.1, legend.nrow = 5
)
```
The gene coefficients for the above ICP clusters corresponding to the ICP run no. 14 and clustering round 4 can be retrieved with the function `GetFeatureCoefficients()`. The positive coefficients at least for one of the clusters were represented in the heatmap plot below.
```{r gene coefficients, message=FALSE, warning=FALSE, fig.width = 5.5, fig.height=14}
# Get gene coefficients
gene.coeff <- GetFeatureCoefficients(sce, icp.run = icp.run, icp.round = 4)
row.names(gene.coeff$icp_56) <- gene.coeff$icp_56$feature
gene.coeff$icp_56 <- gene.coeff$icp_56[, -1]
# Plot top positive coefficients in heatmap
positive.coeff <- (rowSums(gene.coeff$icp_56 > 0) > 0)
heat <- ComplexHeatmap::Heatmap(
matrix = as.matrix(gene.coeff$icp_56)[positive.coeff, ],
name = "Gene coef.",
cluster_columns = FALSE, show_row_dend = FALSE, show_row_names = TRUE,
row_names_gp = grid::gpar(fontsize = 7)
)
print(heat)
```
One of the largest population of luminal cells is constituted by three ICP clusters (3, 11, 12), one could check below the main coefficients for each one of these clusters.
```{r gene coefficients per luminal cluster, fig.width=20, fig.height=10}
# Check top gene coefficients per basal cluster: 3, 11, 12
gene.coeff.basal <- gene.coeff$icp_56[positive.coeff, ] %>%
mutate("gene" = row.names(.))
top5.luminal <- top5.luminal.plts <- list()
for (i in as.character(c(3, 11, 12))) {
clt.no <- paste0("clt", i)
top5.luminal[[clt.no]] <- gene.coeff.basal %>%
dplyr::select(all_of(c("gene", paste0("coeff_", clt.no)))) %>%
arrange(desc(.data[[paste0("coeff_", clt.no)]])) %>%
head(5) %>%
pull(gene)
top5.luminal.plts[[clt.no]] <- cowplot::plot_grid(plotlist = lapply(top5.luminal[[clt.no]], function(x) {
PlotExpression(sce,
color.by = x, point.size = 0.1, point.stroke = 0.1,
scale.values = TRUE
) + ggplot2::ggtitle(gsub("clt", "cluster: ", clt.no))
}), ncol = 5, align = "vh")
}
cowplot::plot_grid(plotlist = top5.luminal.plts, nrow = 3, align = "vh")
```
___
## Graph-based clustering
Graph-based clustering can be obtained by running the `scran` function `clusterCells()` using the `Coralysis` integrated embedding. The community detection algorithm used was Louvain with a resolution value of 0.8.
```{r graph-based clustering, fig.width=5, fig.height=6}
### Graph-based clustering with scran
## Coralysis integrated PCA embedding
bluster.params <- bluster::SNNGraphParam(
k = 30, cluster.fun = "louvain",
cluster.args = list(resolution = 0.8)
)
set.seed(1024)
sce$Cluster <- scran::clusterCells(sce,
use.dimred = "PCA",
BLUSPARAM = bluster.params
)
PlotDimRed(sce,
color.by = "Cluster", point.size = 0.2,
point.stroke = 0.2, legend.nrow = 4
)
```
___
## Cell cluster probability bins
Cell cluster probability bins per cluster `label` can be obtained by running `BinCellClusterProbability()`, which returns a `SingleCellExperiment` object with `logcounts` average per `label` per cell probability bin. The number of probability bins can be provided to the parameter `bins`. Below it was used as `label` the graph-based clustering obtained above and `bins` set to 30. In addition, the `Coralysis` probability bins were projected onto single cell UMAP. See the plots below.
```{r cell cluster prob. bins, fig.width=9, fig.height=4.5}
# cell states SCE object
sce.bins <- BinCellClusterProbability(sce, label = "Cluster", bins = 30)
# Project Coralysis bins onto single-cell UMAP
sce.bins <- ReferenceMapping(sce, sce.bins, ref.label = "Cluster", project.umap = TRUE)
umap.bins.labels <- PlotDimRed(sce.bins, color.by = "coral_labels")
umap.bins.probs <- PlotExpression(sce.bins,
color.by = "aggregated_probability_bins",
color.scale = "viridis",
legend.title = "Prob. bins"
)
cowplot::plot_grid(umap.bins.labels, umap.bins.probs, ncol = 2, align = "vh")
```
```{r coralysis labels, fig.width=12, fig.height=6}
# Coralysis labels: single cells vs bins
cowplot::plot_grid(
PlotDimRed(sce,
color.by = "Cluster", point.size = 0.2,
point.stroke = 0.2, legend.nrow = 2
),
umap.bins.labels
)
```
```{r coralysis probability, fig.width=12, fig.height=5}
# Coralysis probability: single cells vs bins
cowplot::plot_grid(
PlotExpression(
object = sce, color.by = "scaled_mean_probs",
color.scale = "viridis", point.size = 0.2,
point.stroke = 0.1, legend.title = "Mean prob.\n(min-max)"
),
umap.bins.probs
)
```
___
## Differential expression programs
The correlation between cell cluster probability bins from cluster 10 and averaged gene expression can be obtained with the function `CellBinsFeatureCorrelation()`. The 30 genes with the highest absolute Pearson correlation were represented in the heatmap below.
```{r differential expression, fig.width=10, fig.height=6}
# Differential expression programs
corr.features <- CellBinsFeatureCorrelation(object = sce.bins, labels = "10")
top30.corr.clt10 <- corr.features %>%
arrange(desc(abs(`10`))) %>%
head(30)
# Heatmap of top 30 genes with the highest absolute Pearson correlation
pick.genes <- row.names(top30.corr.clt10)
mtx <- as.matrix(logcounts(sce.bins[pick.genes, paste0("10_bin", 1:30)]))
mtx <- t(scale(t(mtx)))
col_fun2 <- circlize::colorRamp2(c(0.8, 0.9, 1.0), scales::viridis_pal()(3))
heat.diff.exp <- Heatmap(
matrix = mtx, name = "Row Z-score\n(normalized data)",
cluster_rows = TRUE, cluster_columns = FALSE,
show_column_dend = FALSE, show_row_dend = FALSE,
use_raster = TRUE, show_column_names = TRUE,
top_annotation = HeatmapAnnotation(
"Probability" = colData(sce.bins[, paste0("10_bin", 1:30)])$aggregated_probability_bins,
simple_anno_size = unit(0.25, "cm"),
col = list("Probability" = col_fun2), show_annotation_name = FALSE,
show_legend = TRUE,
annotation_legend_param = list(Probability = list(
direction = "horizontal",
title_position = "topcenter"
))
)
)
print(heat.diff.exp)
```
For example, the expression of the gene _Glycam1_ (glycosylation-dependent cell adhesion molecule 1) or _Fabp3_ (fatty acid binding protein 3) are almost restricted to cluster 10, and they exhibit a gradient that correlates with both transient and steady-state cells.
```{r plot expression, fig.width=12, fig.height=10}
# Look into Cited1 expression
genes <- c("Glycam1", "Fabp3")
cowplot::plot_grid(
PlotExpression(sce,
color.by = genes[1], point.size = 0.5,
point.stroke = 0.5, scale.values = TRUE
),
PlotExpression(sce.bins,
color.by = genes[1], point.size = 1,
point.stroke = 1, scale.values = TRUE
),
PlotExpression(sce,
color.by = genes[2], point.size = 0.5,
point.stroke = 0.5, scale.values = TRUE
),
PlotExpression(sce.bins,
color.by = genes[2], point.size = 1,
point.stroke = 1, scale.values = TRUE
),
ncol = 2
)
```
___
## R session
```{r rsession}
# R session
sessionInfo()
```
___
## References
Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). "Orchestrating single-cell analysis with Bioconductor." _Nature Methods_, *17*, 137-145. [https://www.nature.com/articles/s41592-019-0654-x](https://www.nature.com/articles/s41592-019-0654-x).
Bach K, Pensa S, Grzelak M, Hadfield J, Adams DJ, Marioni JC, Khaled WT (2017). "Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing." _Nature communications_, 8, 1:1-11. [doi:10.1038/s41467-017-02001-5](https://doi.org/10.1038/s41467-017-02001-5)
Gu, Z. (2022) "Complex heatmap visualization." _iMeta_, 1(3):e43. [doi: 10.1002/imt2.43](https://doi.org/10.1002/imt2.43).
Lun ATL, McCarthy DJ, Marioni JC (2016). "A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor." _F1000Res._, *5*, 2122. [doi:10.12688/f1000research.9501.2](https://doi.org/10.12688/f1000research.9501.2).
McCarthy DJ, Campbell KR, Lun ATL, Willis QF (2017). "Scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R." _Bioinformatics_, *33*, 1179-1186. [doi:10.1093/bioinformatics/btw777](https://doi.org/10.1093/bioinformatics/btw777).
Sousa A, Smolander J, Junttila S, Elo L (2025). "Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration." _bioRxiv_. [doi:10.1101/2025.02.07.637023](https://doi.org/10.1101/2025.02.07.637023)
Wickham H (2016). "ggplot2: Elegant Graphics for Data Analysis." _Springer-Verlag New York_.