--- title: "Integration" date: "Compiled: `r format(Sys.Date(), '%d/%m/%Y')`" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Integration} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, fig.align = "center") ```

___

## Dataset
To illustrate the multi-level integration algorithm, we will use in this vignette two samples from the `MouseGastrulationData` `R` package: **1** and **3**. The samples correspond to different pool and embryonic stages: + **1**: pool index 1 and embryonic stage E6.5 with 360 cells + **3**: pool index 3 and embryonic stage E7.5 with 458 cells
The samples were imported below with the function `EmbryoAtlasData()` from `MouseGastrulationData` package as a `SingleCellExperiment` object, the object (class) required by most functions in `Coralysis` (see [Chapter 4 The SingleCellExperiment class - OSCA manual](https://bioconductor.org/books/3.13/OSCA.intro/the-singlecellexperiment-class.html)). The `SCE` object provided comprises `counts` (raw count data), cell `colData` (which includes batch and cell labels, designated as `pool` and `celltype`, respectively), among other data layers. Run the code below to import the `R` packages and data required to reproduce this vignette. ```{r packages, message=FALSE, warning=FALSE} # Packages library("scran") library("ggplot2") library("Coralysis") library("SingleCellExperiment") library("MouseGastrulationData") ``` ```{r data, message=FALSE, warning=FALSE} # Import data sce <- EmbryoAtlasData(samples = c(1, 3)) ```

___

## Normalization
`Coralysis` requires log-normalised data as input. Run the code below to apply the basic log-normalisation method implemented in `Seurat` (see [NormalizeData](https://satijalab.org/seurat/reference/normalizedata)). ```{r Seurat normalisation, echo=TRUE, message=FALSE, warning=FALSE} ## Normalize the data # log1p normalization SeuratNormalisation <- function(object) { # 'SeuratNormalisation()' function applies the basic Seurat normalization to # a SingleCellExperiment object with a 'counts' assay. Normalized data # is saved in the 'logcounts' assay. logcounts(object) <- apply(counts(object), 2, function(x) { log1p(x / sum(x) * 10000) }) # log1p normalization w/ 10K scaling factor logcounts(object) <- as(logcounts(object), "sparseMatrix") return(object) } sce <- SeuratNormalisation(object = sce) # Remove 'counts' to reduce data size object counts(sce) <- NULL ``` In alternative, `scran` normalization can be performed, which is particularly beneficial if rare cell types exist (see the following [vignette](https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html)). In addition, have a look into [Chapter 7 Normalization - OSCA manual](https://bioconductor.org/books/3.12/OSCA/normalization.html).

___

## HVG selection
Highly variable genes (HVG) can be selected using the R package `scran`. The variable `pool` from `colData` is used as batch label. The `SingleCellExperiment` object allows alternative experiments to the main experiment. This is important to keep a backup of all genes in the same `SingleCellExperiment` object before selecting HVGs (see [SingleCellExperiment vignette](https://www.bioconductor.org/packages/devel/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html)). ```{r hvg} # Feature selection with 'scran' package nhvg <- 500 batch.label <- "pool" sce[[batch.label]] <- factor(sce[[batch.label]]) m.hvg <- scran::modelGeneVar(sce, block = sce[[batch.label]]) top.hvg <- getTopHVGs(m.hvg, n = nhvg) # Subset object sce <- sce[top.hvg, ] # Rename genes: 'Ensembl ID' to SYMBOL row.names(sce) <- rowData(sce)$SYMBOL rowData(sce) <- NULL ```

___

## DimRed: pre-integration
The batch effect between assays can be inspected below by projecting the data onto _t_-distributed Stochastic Neighbor Embedding (t-SNE). This can be achieved by running sequentially the `Coralysis` functions `RunPCA` and `RunTSNE`. Provide a seed before running each one of these functions to ensure reproducibility. The function `RunPCA` runs by default the PCA method implemented in the `R` package `irlba` (`pca.method="irlba"`), which requires a seed to ensure the same PCA result. In addition, the `assay.name` argument needs to be provided, otherwise uses by default the Coralysis probabilities which are obtained only after integration (after running `RunParallelDivisiveICP`). The assay `logcounts`, corresponding to the log-normalized data, and number of principal components to use `p` were provided. Any categorical variable available in `colData(sce)`, such as `pool` or `celltype`, can be visualized in a low dimensional embedding stored in `reducedDimNames(sce)` with the `Coralysis` function `PlotDimRed`. ```{r dimred: pre-integration, fig.width=8.5, fig.height=7.25} # Compute PCA & TSNE set.seed(123) sce <- RunPCA( object = sce, assay.name = "logcounts", p = 30, dimred.name = "unintPCA" ) set.seed(123) sce <- RunTSNE(sce, dimred.type = "unintPCA", dimred.name = "unintTSNE" ) # Plot TSNE highlighting the batch & cell type unint.batch.plot <- PlotDimRed( object = sce, color.by = "pool", dimred = "unintTSNE", point.size = 0.01, legend.nrow = 1, seed.color = 1024 ) unint.cell.plot <- PlotDimRed( object = sce, color.by = "celltype", dimred = "unintTSNE", point.size = 0.01, legend.nrow = 14, seed.color = 7 ) cowplot::plot_grid(unint.batch.plot, unint.cell.plot, ncol = 2, align = "vh") ```

___

## Multi-level integration
Integrate assays with the multi-level integration algorithm implemented in `Coralysis` by running the function `RunParallelDivisiveICP`. The only arguments required by this function are `object` and `batch.label`. The `object` requires a `SingleCellExperiment` object with the assay `logcounts`. The matrix in `logcounts` should be sparse, i.e., `is(logcounts(sce), "dgCMatrix")` is `TRUE`, and it should not contain non-expressing genes. This is ensured by running `PrepareData` before. The `batch.label` argument requires a label column name in `colData(sce)` corresponding to the batch label that should be used for integration. In the absence of a batch, the same function, `RunParallelDivisiveICP`, can be run without providing `batch.label` (i.e., `batch.label = NULL`), in which case the data will be modeled through the algorithm to identify fine-grained populations that do not required batch correction. An higher number of `threads` can be provided to speed up computing time depending on the number of cores available. For this example, the algorithm was run 10 times (`L = 10`), but generally, this number should be higher (with the default being `L = 50`). A train set can be built to improve the modeling step in `Coralysis`. In this case, the option was set to `FALSE` (`build.train.set = FALSE`) as it is faster and requires less memory. Otherwise, using this option is recommended as it can make the modeling step faster and consume less memory for larger datasets. ```{r multi-level integration, message=TRUE, warning=FALSE} # Prepare data for integration: # remove non-expressing genes & logcounts is from `dgCMatrix` class sce <- PrepareData(object = sce) # Perform integration with Coralysis set.seed(1024) sce <- RunParallelDivisiveICP( object = sce, batch.label = "pool", build.train.set = FALSE, L = 10, threads = 2 ) ```

___

## DimRed: post-integration
The integration result can be visually inspected by running sequentially the functions `RunPCA` and `RunTSNE`. The `assay.name` provided to `RunPCA` must be `joint.probability` (the default), the primary output of integration with `Coralysis`. The probability matrices from `Coralysis` (i.e., `joint.probability`) can be used to obtain an integrated embedding by running `RunPCA(..., assay.name = "joint.probability")`. This integrated PCA can, in turn, be used downstream for clustering or non-linear dimensional reduction techniques, such as `RunTSNE`. Below, the integrated PCA was named `intPCA`. ```{r dimred: post-integration, fig.width=8.5, fig.height=7.25} # Compute PCA with joint cluster probabilities & TSNE set.seed(123) sce <- RunPCA(sce, assay.name = "joint.probability", dimred.name = "intPCA" ) set.seed(123) sce <- RunTSNE(sce, dimred.type = "intPCA", dimred.name = "intTSNE" ) # Plot TSNE highlighting the batch & cell type int.batch.plot <- PlotDimRed( object = sce, color.by = "pool", dimred = "intTSNE", point.size = 0.01, legend.nrow = 1, seed.color = 1024 ) int.cell.plot <- PlotDimRed( object = sce, color.by = "celltype", dimred = "intTSNE", point.size = 0.01, legend.nrow = 14, seed.color = 7 ) cowplot::plot_grid(int.batch.plot, int.cell.plot, ncol = 2, align = "vh" ) ```

___

## Clustering
Run graph-based clustering with the `scran` function `clusterCells` (see [Chapter 5 Clustering - OSCA manual](https://bioconductor.org/books/3.14/OSCA.basic/clustering.html)). ```{r clustering, fig.width=11.5, fig.height=7.25} # Graph-based clustering on the integrated PCA w/ 'scran' package blusparams <- bluster::SNNGraphParam(k = 15, cluster.fun = "louvain") set.seed(123) sce$cluster <- scran::clusterCells(sce, use.dimred = "intPCA", BLUSPARAM = blusparams ) # Plot clustering clt.plot <- PlotDimRed( object = sce, color.by = "cluster", dimred = "intTSNE", point.size = 0.01, legend.nrow = 3, seed.color = 65 ) cowplot::plot_grid(int.batch.plot, int.cell.plot, clt.plot, ncol = 3, align = "h" ) ```

___

## Cluster markers
Identify the cluster markers by running the `Coralysis` function `FindAllClusterMarkers`. Provide the `clustering.label`, in this case, the label used above, i.e., `cluster`. The top three positive markers per cluster were retrieved and plotted below using the `Coralysis` function `HeatmapFeatures`. ```{r cluster markers, fig.width=5, fig.height=5} # Cluster markers cluster.markers <- FindAllClusterMarkers(object = sce, clustering.label = "cluster") # Select the top 3 positive markers per cluster top3.markers <- lapply(X = split(x = cluster.markers, f = cluster.markers$cluster), FUN = function(x) { head(x[order(x$log2FC, decreasing = TRUE), ], n = 3) }) top3.markers <- do.call(rbind, top3.markers) top3.markers <- top3.markers[order(as.numeric(top3.markers$cluster)), ] # Heatmap of the top 3 positive markers per cluster HeatmapFeatures( object = sce, clustering.label = "cluster", features = top3.markers$marker, seed.color = 65 ) ```

___

## DGE
Differential gene expression (DGE) between cluster 3 and 6 corresponding roughly to `Visceral endoderm` and `ExE endoderm`, respectively. The genes _Ttr_, _Rbp4_, _Apoa1_, _Apom_ and _Ctsl_ are upregulated in cluster 6, while _Tmsb10_ is upregulated in cluster 3. ```{r dge, fig.width=9, fig.height=4.5} # DGE analysis: cluster 3 vs 6 dge.clt3vs6 <- FindClusterMarkers(sce, clustering.label = "cluster", clusters.1 = "3", clusters.2 = "6" ) head(dge.clt3vs6[order(abs(dge.clt3vs6$log2FC), decreasing = TRUE), ]) top6.degs <- head(dge.clt3vs6[order(abs(dge.clt3vs6$log2FC), decreasing = TRUE ), "marker"]) exp.plots <- lapply(X = top6.degs, FUN = function(x) { PlotExpression( object = sce, color.by = x, scale.values = TRUE, point.size = 0.5, point.stroke = 0.5 ) }) cowplot::plot_grid(plotlist = exp.plots, align = "vh", ncol = 3) ```

___

## R session
```{r rsession} # R session sessionInfo() ```

___

## References
Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). "Orchestrating single-cell analysis with Bioconductor." _Nature Methods_, *17*, 137-145. [https://www.nature.com/articles/s41592-019-0654-x](https://www.nature.com/articles/s41592-019-0654-x). Griffiths J, Lun A (2024). "MouseGastrulationData: Single-Cell-omics Data across Mouse Gastrulation and Early Organogenesis". [10.18129/B9.bioc.MouseGastrulationData](https://doi.org/10.18129/B9.bioc.MouseGastrulationData). R package version 1.18.0. Lun ATL, McCarthy DJ, Marioni JC (2016). "A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor." _F1000Res._, *5*, 2122. [doi:10.12688/f1000research.9501.2](https://doi.org/10.12688/f1000research.9501.2). Sousa A, Smolander J, Junttila S, Elo L (2025). "Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration." _bioRxiv_. [doi:10.1101/2025.02.07.637023](https://doi.org/10.1101/2025.02.07.637023) Wickham H (2016). "ggplot2: Elegant Graphics for Data Analysis." _Springer-Verlag New York_.