---
title: "Integration"
date: "Compiled: `r format(Sys.Date(), '%d/%m/%Y')`"
output: BiocStyle::html_document
vignette: >
%\VignetteIndexEntry{Integration}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.align = "center")
```
___
## Dataset
To illustrate the multi-level integration algorithm, we will use in this vignette two samples from the `MouseGastrulationData` `R` package: **1** and **3**. The samples correspond to different pool and embryonic stages:
+ **1**: pool index 1 and embryonic stage E6.5 with 360 cells
+ **3**: pool index 3 and embryonic stage E7.5 with 458 cells
The samples were imported below with the function `EmbryoAtlasData()` from `MouseGastrulationData` package as a `SingleCellExperiment` object, the object (class) required by most functions in `Coralysis` (see [Chapter 4 The SingleCellExperiment class - OSCA manual](https://bioconductor.org/books/3.13/OSCA.intro/the-singlecellexperiment-class.html)). The `SCE` object provided comprises `counts` (raw count data), cell `colData` (which includes batch and cell labels, designated as `pool` and `celltype`, respectively), among other data layers.
Run the code below to import the `R` packages and data required to reproduce this vignette.
```{r packages, message=FALSE, warning=FALSE}
# Packages
library("scran")
library("ggplot2")
library("Coralysis")
library("SingleCellExperiment")
library("MouseGastrulationData")
```
```{r data, message=FALSE, warning=FALSE}
# Import data
sce <- EmbryoAtlasData(samples = c(1, 3))
```
___
## Normalization
`Coralysis` requires log-normalised data as input. Run the code below to apply the basic log-normalisation method implemented in `Seurat` (see [NormalizeData](https://satijalab.org/seurat/reference/normalizedata)).
```{r Seurat normalisation, echo=TRUE, message=FALSE, warning=FALSE}
## Normalize the data
# log1p normalization
SeuratNormalisation <- function(object) {
# 'SeuratNormalisation()' function applies the basic Seurat normalization to
# a SingleCellExperiment object with a 'counts' assay. Normalized data
# is saved in the 'logcounts' assay.
logcounts(object) <- apply(counts(object), 2, function(x) {
log1p(x / sum(x) * 10000)
}) # log1p normalization w/ 10K scaling factor
logcounts(object) <- as(logcounts(object), "sparseMatrix")
return(object)
}
sce <- SeuratNormalisation(object = sce)
# Remove 'counts' to reduce data size object
counts(sce) <- NULL
```
In alternative, `scran` normalization can be performed, which is particularly beneficial if rare cell types exist (see the following [vignette](https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html)). In addition, have a look into [Chapter 7 Normalization - OSCA manual](https://bioconductor.org/books/3.12/OSCA/normalization.html).
___
## HVG selection
Highly variable genes (HVG) can be selected using the R package `scran`. The variable `pool` from `colData` is used as batch label. The `SingleCellExperiment` object allows alternative experiments to the main experiment. This is important to keep a backup of all genes in the same `SingleCellExperiment` object before selecting HVGs (see [SingleCellExperiment vignette](https://www.bioconductor.org/packages/devel/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html)).
```{r hvg}
# Feature selection with 'scran' package
nhvg <- 500
batch.label <- "pool"
sce[[batch.label]] <- factor(sce[[batch.label]])
m.hvg <- scran::modelGeneVar(sce, block = sce[[batch.label]])
top.hvg <- getTopHVGs(m.hvg, n = nhvg)
# Subset object
sce <- sce[top.hvg, ]
# Rename genes: 'Ensembl ID' to SYMBOL
row.names(sce) <- rowData(sce)$SYMBOL
rowData(sce) <- NULL
```
___
## DimRed: pre-integration
The batch effect between assays can be inspected below by projecting the data onto _t_-distributed Stochastic Neighbor Embedding (t-SNE). This can be achieved by running sequentially the `Coralysis` functions `RunPCA` and `RunTSNE`. Provide a seed before running each one of these functions to ensure reproducibility. The function `RunPCA` runs by default the PCA method implemented in the `R` package `irlba` (`pca.method="irlba"`), which requires a seed to ensure the same PCA result. In addition, the `assay.name` argument needs to be provided, otherwise uses by default the Coralysis probabilities which are obtained only after integration (after running `RunParallelDivisiveICP`). The assay `logcounts`, corresponding to the log-normalized data, and number of principal components to use `p` were provided. Any categorical variable available in `colData(sce)`, such as `pool` or `celltype`, can be visualized in a low dimensional embedding stored in `reducedDimNames(sce)` with the `Coralysis` function `PlotDimRed`.
```{r dimred: pre-integration, fig.width=8.5, fig.height=7.25}
# Compute PCA & TSNE
set.seed(123)
sce <- RunPCA(
object = sce,
assay.name = "logcounts",
p = 30, dimred.name = "unintPCA"
)
set.seed(123)
sce <- RunTSNE(sce,
dimred.type = "unintPCA",
dimred.name = "unintTSNE"
)
# Plot TSNE highlighting the batch & cell type
unint.batch.plot <- PlotDimRed(
object = sce,
color.by = "pool",
dimred = "unintTSNE",
point.size = 0.01,
legend.nrow = 1,
seed.color = 1024
)
unint.cell.plot <- PlotDimRed(
object = sce,
color.by = "celltype",
dimred = "unintTSNE",
point.size = 0.01,
legend.nrow = 14,
seed.color = 7
)
cowplot::plot_grid(unint.batch.plot, unint.cell.plot, ncol = 2, align = "vh")
```
___
## Multi-level integration
Integrate assays with the multi-level integration algorithm implemented in `Coralysis` by running the function `RunParallelDivisiveICP`. The only arguments required by this function are `object` and `batch.label`. The `object` requires a `SingleCellExperiment` object with the assay `logcounts`. The matrix in `logcounts` should be sparse, i.e., `is(logcounts(sce), "dgCMatrix")` is `TRUE`, and it should not contain non-expressing genes. This is ensured by running `PrepareData` before. The `batch.label` argument requires a label column name in `colData(sce)` corresponding to the batch label that should be used for integration. In the absence of a batch, the same function, `RunParallelDivisiveICP`, can be run without providing `batch.label` (i.e., `batch.label = NULL`), in which case the data will be modeled through the algorithm to identify fine-grained populations that do not required batch correction. An higher number of `threads` can be provided to speed up computing time depending on the number of cores available. For this example, the algorithm was run 10 times (`L = 10`), but generally, this number should be higher (with the default being `L = 50`). A train set can be built to improve the modeling step in `Coralysis`. In this case, the option was set to `FALSE` (`build.train.set = FALSE`) as it is faster and requires less memory. Otherwise, using this option is recommended as it can make the modeling step faster and consume less memory for larger datasets.
```{r multi-level integration, message=TRUE, warning=FALSE}
# Prepare data for integration:
# remove non-expressing genes & logcounts is from `dgCMatrix` class
sce <- PrepareData(object = sce)
# Perform integration with Coralysis
set.seed(1024)
sce <- RunParallelDivisiveICP(
object = sce,
batch.label = "pool",
build.train.set = FALSE,
L = 10, threads = 2
)
```
___
## DimRed: post-integration
The integration result can be visually inspected by running sequentially the functions `RunPCA` and `RunTSNE`. The `assay.name` provided to `RunPCA` must be `joint.probability` (the default), the primary output of integration with `Coralysis`. The probability matrices from `Coralysis` (i.e., `joint.probability`) can be used to obtain an integrated embedding by running `RunPCA(..., assay.name = "joint.probability")`. This integrated PCA can, in turn, be used downstream for clustering or non-linear dimensional reduction techniques, such as `RunTSNE`. Below, the integrated PCA was named `intPCA`.
```{r dimred: post-integration, fig.width=8.5, fig.height=7.25}
# Compute PCA with joint cluster probabilities & TSNE
set.seed(123)
sce <- RunPCA(sce,
assay.name = "joint.probability",
dimred.name = "intPCA"
)
set.seed(123)
sce <- RunTSNE(sce,
dimred.type = "intPCA",
dimred.name = "intTSNE"
)
# Plot TSNE highlighting the batch & cell type
int.batch.plot <- PlotDimRed(
object = sce,
color.by = "pool",
dimred = "intTSNE",
point.size = 0.01,
legend.nrow = 1,
seed.color = 1024
)
int.cell.plot <- PlotDimRed(
object = sce,
color.by = "celltype",
dimred = "intTSNE",
point.size = 0.01,
legend.nrow = 14,
seed.color = 7
)
cowplot::plot_grid(int.batch.plot, int.cell.plot,
ncol = 2, align = "vh"
)
```
___
## Clustering
Run graph-based clustering with the `scran` function `clusterCells` (see [Chapter 5 Clustering - OSCA manual](https://bioconductor.org/books/3.14/OSCA.basic/clustering.html)).
```{r clustering, fig.width=11.5, fig.height=7.25}
# Graph-based clustering on the integrated PCA w/ 'scran' package
blusparams <- bluster::SNNGraphParam(k = 15, cluster.fun = "louvain")
set.seed(123)
sce$cluster <- scran::clusterCells(sce,
use.dimred = "intPCA",
BLUSPARAM = blusparams
)
# Plot clustering
clt.plot <- PlotDimRed(
object = sce,
color.by = "cluster",
dimred = "intTSNE",
point.size = 0.01,
legend.nrow = 3,
seed.color = 65
)
cowplot::plot_grid(int.batch.plot, int.cell.plot,
clt.plot,
ncol = 3, align = "h"
)
```
___
## Cluster markers
Identify the cluster markers by running the `Coralysis` function `FindAllClusterMarkers`. Provide the `clustering.label`, in this case, the label used above, i.e., `cluster`. The top three positive markers per cluster were retrieved and plotted below using the `Coralysis` function `HeatmapFeatures`.
```{r cluster markers, fig.width=5, fig.height=5}
# Cluster markers
cluster.markers <- FindAllClusterMarkers(object = sce, clustering.label = "cluster")
# Select the top 3 positive markers per cluster
top3.markers <- lapply(X = split(x = cluster.markers, f = cluster.markers$cluster), FUN = function(x) {
head(x[order(x$log2FC, decreasing = TRUE), ], n = 3)
})
top3.markers <- do.call(rbind, top3.markers)
top3.markers <- top3.markers[order(as.numeric(top3.markers$cluster)), ]
# Heatmap of the top 3 positive markers per cluster
HeatmapFeatures(
object = sce,
clustering.label = "cluster",
features = top3.markers$marker,
seed.color = 65
)
```
___
## DGE
Differential gene expression (DGE) between cluster 3 and 6 corresponding roughly to `Visceral endoderm` and `ExE endoderm`, respectively. The genes _Ttr_, _Rbp4_, _Apoa1_, _Apom_ and _Ctsl_ are upregulated in cluster 6, while _Tmsb10_ is upregulated in cluster 3.
```{r dge, fig.width=9, fig.height=4.5}
# DGE analysis: cluster 3 vs 6
dge.clt3vs6 <- FindClusterMarkers(sce,
clustering.label = "cluster",
clusters.1 = "3",
clusters.2 = "6"
)
head(dge.clt3vs6[order(abs(dge.clt3vs6$log2FC), decreasing = TRUE), ])
top6.degs <- head(dge.clt3vs6[order(abs(dge.clt3vs6$log2FC),
decreasing = TRUE
), "marker"])
exp.plots <- lapply(X = top6.degs, FUN = function(x) {
PlotExpression(
object = sce, color.by = x,
scale.values = TRUE, point.size = 0.5, point.stroke = 0.5
)
})
cowplot::plot_grid(plotlist = exp.plots, align = "vh", ncol = 3)
```
___
## R session
```{r rsession}
# R session
sessionInfo()
```
___
## References
Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). "Orchestrating single-cell analysis with Bioconductor." _Nature Methods_, *17*, 137-145. [https://www.nature.com/articles/s41592-019-0654-x](https://www.nature.com/articles/s41592-019-0654-x).
Griffiths J, Lun A (2024). "MouseGastrulationData: Single-Cell-omics Data across Mouse Gastrulation and Early Organogenesis". [10.18129/B9.bioc.MouseGastrulationData](https://doi.org/10.18129/B9.bioc.MouseGastrulationData). R package version 1.18.0.
Lun ATL, McCarthy DJ, Marioni JC (2016). "A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor." _F1000Res._, *5*, 2122. [doi:10.12688/f1000research.9501.2](https://doi.org/10.12688/f1000research.9501.2).
Sousa A, Smolander J, Junttila S, Elo L (2025). "Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration." _bioRxiv_. [doi:10.1101/2025.02.07.637023](https://doi.org/10.1101/2025.02.07.637023)
Wickham H (2016). "ggplot2: Elegant Graphics for Data Analysis." _Springer-Verlag New York_.