--- title: "Coralysis: sensitive integration of single-cell data" date: "Compiled: `r format(Sys.Date(), '%d/%m/%Y')`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Get started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, fig.align = "center") ```

___

## Introduction
`Coralysis` is an R package for sensitive integration, reference-mapping and cell-state identification of single-cell data (Sousa et al., 2025). It relies on an adapted version of our previously introduced Iterative Clustering Projection (ICP) algorithm (Smolander et al., 2021) to identify shared cell clusters across heterogeneous datasets by leveraging multiple rounds of divisive clustering. Inspired by the process of assembling a puzzle - where one begins by grouping pieces based on low-to high-level features, such as color and shading, before looking into shape and patterns - this multi-level integration algorithm progressively blends the batch effects while separating cell types across multiple runs of divisive clustering. The trained ICP models can then be used for various purposes, including prediction of cluster identities of related, unannotated single-cell datasets through reference-mapping, and inference of cell states and their differential expression programs using the cell cluster probabilities that represent the likelihood of each cell belonging to each cluster. While state-of-the-art single-cell integration methods often struggle with imbalanced cell types across heterogeneous datasets, `Coralysis` effectively differentiates similar yet unshared cell types across batches (Sousa et al., 2025).

___

## Installation
`Coralysis` can be installed from `Bioconductor` by running the following commands: ```{r installation, eval=FALSE} # Installation if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("Coralysis") ```

___

## Dataset
To quickly demonstrate the ability of `Coralysis`'s multi-level integration algorithm to integrate data that is unevenly distributed across batches or completely absent from some datasets, we selected two brain datasets — **zeisel** (Zeisel et al., 2018) and **romanov** (Romanov et al., 2017). These datasets share only a few cell types, with some similar yet distinct cell types across datasets, such as interneurons in **zeisel** and neurons in **romanov**. This makes them an ideal example to highlight the strengths of `Coralysis`. The **zeisel** (Zeisel et al., 2018) and **romanov** datasets can be imported as `SingleCellExperiment` objects via the `scRNAseq` `R` package using the functions `ZeiselBrainData()` and `RomanovBrainData()`. Run the R code below to load the packages `Coralysis`, `scRNAseq` and `SingleCellExperiment` as well as the datasets **zeisel** and **romanov**. ```{r packages, message=FALSE, warning=FALSE} # Import packages library("scRNAseq") library("Coralysis") library("SingleCellExperiment") ``` ```{r data} # Import data zeisel <- ZeiselBrainData() romanov <- RomanovBrainData() ```

___

## Preprocess
The datasets were normalized with the `scater` function `logNormCounts()` below. Additionaly they were merged based on shared genes and later subset to 2,000 highly variable genes with the `scran` function `getTopHVGs()`. Cell type labels correspond to level 1 class annotations provided in both `SingleCellExperiment` objects. They were stored under `cell_type` in `colData(sce)`, while batch labels (i.e., `Zeisel et al.` or `Romanov et al.`) were saved under `batch` in `colData(sce)`. ```{r preprocess} # Normalize zeisel <- scater::logNormCounts(zeisel) counts(zeisel) <- NULL altExps(zeisel) <- list() romanov <- scater::logNormCounts(romanov) counts(romanov) <- NULL # Intersected genes genes <- intersect(row.names(zeisel), row.names(romanov)) zeisel <- zeisel[genes, ] romanov <- romanov[genes, ] # Remove colData colData(zeisel) <- colData(zeisel)[, "level1class", drop = FALSE] colnames(colData(zeisel)) <- "cell_type" colData(romanov) <- colData(romanov)[, "level1 class", drop = FALSE] colnames(colData(romanov)) <- "cell_type" # Batch zeisel[["batch"]] <- "Zeisel et al." romanov[["batch"]] <- "Romanov et al." # Merge cells sce <- cbind(zeisel, romanov) rm(list = c("zeisel", "romanov")) # HVG hvg <- scran::getTopHVGs(sce, n = 2000) sce <- sce[hvg, ] ``` The `PrepareData()` function checks whether a sparse matrix is available in the `logcounts` assay (which corresponds to log-normalized data) and removes non-expressed features preparing the object to run `RunParallelDivisiveICP()` next. ```{r prepare data} # Prepare data: # checks 'logcounts' format & removes non-expressed genes sce <- PrepareData(object = sce) ```

___

## Multi-level integration
The multi-level integration algorithm is implemented in the `RunParallelDivisiveICP()` function, the main function in `Coralysis`. It only requires a `SingleCellExperiment` object, which in this case is `sce`. To perform integration across a batch, a `batch.label` available in `colData(sce)` must be provided. In this case, it is `"batch"`, created above to hold the information about the brain dataset origin, i.e., _Zeisel et al._ or _Romanov et al._. The ensemble algorithm runs 50 times by default, but for illustrative purposes, this has been reduced to 10 (`L=10`). Two threads are allocated to speed up the process (`threads=2`), though by default, the function uses all available system threads. Specify one thread if you prefer not to use any additional threads. In addition, the cells are downsampled to 1,000 cells per batch to speed up the process (`icp.batch.size = 1000`). The seed provided in `set.seed` is required to ensure the reproducibility of the sampling taken before parallelization. To ensure reproducibility during parallelization, the `RNGseed` parameter is set to `123` by default. The result consists of a set of models and their respective probability matrices (n = 40; log2(`k`) * `L`), stored in `metadata(sce)$coralysis$models` and `metadata(sce)$coralysis$joint.probability`, respectively. ```{r integration} # Multi-level integration set.seed(4591) # reproducibility of downsampling - 'icp.batch.size' sce <- RunParallelDivisiveICP( object = sce, batch.label = "batch", icp.batch.size = 1000, L = 10, threads = 2, RNGseed = 123 ) ```

___

## Integrated embedding
The integrated result of `Coralysis` consist of an integrated embedding which can be obtained by running the function `RunPCA`. This integrated PCA can, in turn, be used downstream for clustering or non-linear dimensional reduction techniques, such as `RunTSNE` or `RunUMAP`. The function `RunPCA` runs by default the PCA method implemented the `R` package `irlba` (`pca.method="irlba"`), which requires a seed to ensure the same PCA result. ```{r integrated emb} # Integrated embedding set.seed(125) sce <- RunPCA(object = sce) ```

___

## Nonlinear dimensional reduction
Compute UMAP by running the function `RunUMAP()`. ```{r umap} # UMAP set.seed(1204) sce <- RunUMAP(object = sce, min_dist = 0.5, n_neighbors = 15) ```

___

### Visualize batch & cell types
Finally, the integration can be visually inspected by highlighting the batch and cell type labels into the UMAP projection. `Coralysis` is sensitive enough to integrate cell types shared across datasets such as oligodendrocytes, microglia and endothelial cells, but not ependymal or pyramidal cells. This difference arises because Romanov et al., 2017 focused specifically on the hypothalamus, whereas Zeisel et al., 2018 provided a broader characterization of the entire mouse nervous system, including multiple brain regions. ```{r viz categorical vars, fig.width=10, fig.height=5.5} # Visualize categorical variables integrated emb. vars <- c("batch", "cell_type") plots <- lapply(X = vars, FUN = function(x) { PlotDimRed( object = sce, color.by = x, point.size = 0.25, point.stroke = 0.5, legend.nrow = 4 ) }) cowplot::plot_grid(plotlist = plots, ncol = 2, align = "vh") # join plots together ```

To allow a better distinction between batches, plot cell types by batch below. ```{r viz categorical vars by batch, fig.width=10, fig.height=5.5} plots[[2]] + ggplot2::facet_grid(~batch) ```

Interestingly, `Coralysis` integrated VSM (Vascular Smooth Muscle) cells from Romanov et al., 2017 with a subset of endothelial mural cells from Zeisel et al., 2018. Gene expression visualization of cell type markers for VSM — _Acta2_, _Myh11_, _Tagln_ — and endothelial — _Cdh5_, _Pecam1_ — cells support this integration. ```{r viz gene exp markers, fig.width=8.5, fig.height=6} # Visualization of gene expression markers markers <- c("Acta2", "Myh11", "Tagln", "Cdh5", "Pecam1") plot.markers <- lapply(markers, function(x) { PlotExpression(sce, color.by = x, point.size = 0.5, point.stroke = 0.5) + ggplot2::facet_grid(~batch) }) cowplot::plot_grid(plotlist = plot.markers, ncol = 2) ```

___

### Graph-based clustering
Graph-based clustering can be obtained by running the `scran` function `clusterCells()` using the `Coralysis` integrated embedding. The clustering result can be saved directly into the `SingleCellExperiment` object (`sce$cluster`). ```{r clustering, fig.width=5, fig.height=5.5} # Graph-based clustering set.seed(123) sce$cluster <- scran::clusterCells(sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 30, cluster.fun = "louvain") ) # Plot the clustering result plots[[3]] <- PlotDimRed( object = sce, color.by = "cluster", point.size = 0.25, point.stroke = 0.5, legend.nrow = 4 ) plots[[3]] ```

The comparison below highlights the original cell type annotations and Louvain clustering obtained with the `Coralysis` integrated embedding, allowing us to assess the extent of cell population overlap between datasets. For instance, `Coralysis` integrated shared populations (e.g., clusters 11, 12 and 14), identified similar but unshared cell populations (e.g., clusters 10 versus 12), and preserved unique populations specific to each dataset (e.g., clusters 7 and 13). ```{r viz clustering by batch, fig.width=8, fig.height=10.5} cowplot::plot_grid( plots[[2]] + ggplot2::facet_grid(~batch), plots[[3]] + ggplot2::facet_grid(~batch), ncol = 1 ) ```

___

### Gene coefficients
Next, we can examine the gene coefficients from the ICP models that best explain the clusters identified above, allowing us to explore the expression of genes that most strongly support these clusters. Below we will look into the identified similar but unshared cell clusters 10 and 12. Gene coefficients can be obtained for the `cluster` label through majority voting with the function `MajorityVotingFeatures()`. The majority voting is performed by searching for the most representative cluster for a given cell label across all possible clusters (i.e., across all icp runs and rounds). The most representative cluster corresponds to the cluster with the highest majority voting score. This corresponds to the geometric mean between the proportion of cells from the given label intersected with a cluster and the proportion of cells from that same cluster that intersected with the cells from the given label. The higher the score the better. A cluster that scores 1 indicates that all its cells correspond to the assigned label, and vice versa; i.e., all the cells with the assigned label belong to this cluster. For example, `MajorityVotingFeatures()` by providing the `colData` variable `"cluster"` (i.e., `label="cluster"`). ```{r gene coefficients, message=FALSE, warning=FALSE} # Get label gene coefficients by majority voting clts.gene.coeff <- MajorityVotingFeatures(object = sce, label = "cluster") # Print summary clts.gene.coeff$summary ```
Visualize below the expression for the top 6 positive coefficients for the identified similar but unshared cell populations: clusters 10 and 12. ```{r viz top coefficients} ## Visualize top coefficients for cluster 10 and 12 clts <- c("10", "12") plot.coeff.markers <- list() for (i in clts) { top.markers <- clts.gene.coeff$feature_coeff[[i]] top.markers <- top.markers[order(top.markers[,2], decreasing = TRUE), ] top.markers <- top.markers[1:6, "feature"] plot.coeff.markers[[i]] <- lapply(top.markers, function(x) { PlotExpression(sce, color.by = x, point.size = 0.3, point.stroke = 0.5) + ggplot2::facet_grid(~batch) }) } ``` Visualize the expression of the top 6 positive coefficients for cluster 10. The top 6 positive coefficients identified are highly specific to cluster 10. ```{r viz - cluster 10, fig.width=9.5, fig.height=3.5} # Cluster 10 cowplot::plot_grid(plotlist = plot.coeff.markers[[1]], ncol = 3) ```
Visualize the expression of the top six positive coefficients for cluster 12. Among the highly specific positive coefficients, myelination-related genes _Ptgds_ and _Mal_ are enriched in cluster 12 compared to cluster 10, where they are practically absent. This suggests that cluster 12 represents a mature, myelinating oligodendrocyte population, in contrast to the more immature cluster 10. ```{r viz - cluster 12, fig.width=9.5, fig.height=3.5} # Cluster 12 cowplot::plot_grid(plotlist = plot.coeff.markers[[2]], ncol = 3) ```
Overall, ICP gene coefficients supported, to some extent, the identification of similar but unshared cell populations (clusters 10 and 12) with `Coralysis`.

___

## R session
```{r rsession} # R session sessionInfo() ```

___

## References
Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). "Orchestrating single-cell analysis with Bioconductor." _Nature Methods_, *17*, 137-145. [https://www.nature.com/articles/s41592-019-0654-x](https://www.nature.com/articles/s41592-019-0654-x). Smolander J, Junttila S, Venäläinen MS, Elo LL (2021) "ILoReg: a tool for high-resolution cell population identification from single-cell RNA-seq data." _Bioinformatics_, *37(8)*, 1107–1114. [https://doi.org/10.1093/bioinformatics/btaa919](https://doi.org/10.1093/bioinformatics/btaa919) Lun ATL, McCarthy DJ, Marioni JC (2016). "A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor." _F1000Res._, *5*, 2122. [doi:10.12688/f1000research.9501.2](https://doi.org/10.12688/f1000research.9501.2). McCarthy DJ, Campbell KR, Lun ATL, Willis QF (2017). "Scater: pre-processing, quality control, normalisation and visualisation of single-cell RNA-seq data in R." _Bioinformatics_, *33*, 1179-1186. [doi:10.1093/bioinformatics/btw777](https://doi.org/10.1093/bioinformatics/btw777). Romanov RA, Zeisel A, Bakker J, Girach F, Hellysaz A, Tomer R, Alpár A, Mulder J, Clotman F, Keimpema E, Hsueh B, Crow AK, Martens H, Schwindling C, Calvigioni D, Bains JS, Máté Z, Szabó G, Yanagawa Y, Zhang M, Rendeiro A, Farlik M, Uhlén M, Wulff P, Bock C, Broberger C, Deisseroth K, Hökfelt T, Linnarsson S, Horvath TL, Harkany T (2017). "A novel organizing principle of the hypothalamus reveals molecularly segregated periventricular dopamine neurons." _Nature neuroscience_, *20(2)*, 176. [https://doi.org/10.1038/nn.4462](https://doi.org/10.1038/nn.4462) Sousa A, Smolander J, Junttila S, Elo L (2025). "Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration." _bioRxiv_. [doi:10.1101/2025.02.07.637023](https://doi.org/10.1101/2025.02.07.637023) Wickham H (2016). "ggplot2: Elegant Graphics for Data Analysis." _Springer-Verlag New York_. Zeisel A, Hochgerner H, Lönnerberg P, Johnsson A, Memic F, Van Der Zwan J, Häring M, Braun E, Borm LE, La Manno G, Codeluppi S, Furlan A, Lee K, Skene N, Harris KD, Hjerling-Leffler J, Arenas E, Ernfors P, Marklund U, Linnarsson S (2018). "Molecular architecture of the mouse nervous system." _Cell_, *174(4)*, 999-1014. [https://doi.org/10.1016/j.cell.2018.06.021](https://doi.org/10.1016/j.cell.2018.06.021)