---
title: "Interoperability with SingleCellExperiment"
author: "Givanna Putri"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Using SuperCellCyto with Single-Cell Based Objects}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r global_options, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

## Introduction

This vignette demonstrates how to integrate SuperCellCyto results with 
cytometry data stored in SingleCellExperiment (SCE) objects,
and how to analyse SuperCellCyto output using Bioconductor packages that take 
SCE objects as input.

We use a subsampled Levine_32dim dataset stored in an SCE object to 
illustrate how to create supercells and conduct downstream analyses.

```{r setup, message=FALSE, warning=FALSE}
library(SuperCellCyto)
library(qs2)
library(scran)
library(BiocSingular)
library(scater)
library(bluster)
library(data.table)
```

## Preparing SCE object

We first load the subsampled Levine_32dim data, stored as a 
[qs2](https://cran.r-project.org/web/packages/qs2/index.html)
using the `qs_read` function.

```{r load_sce_obj}
sce <- qs_read(
    system.file(
        "extdata", 
        "Levine_32dim_sce_sub.qs2", 
        package = "SuperCellCyto"
    )
)
sce
```
The data is stored in the `counts` assay.
We will subset it to include only markers we need to perform downstream 
analysis, transform it using arcsinh transformation, and store the transformed 
data in the `logcounts` assay.

```{r subset_and_transform}
markers <- c(
    "CD45RA", "CD133", "CD19", "CD22", "CD11b", "CD4", "CD8",
    "CD34", "Flt3", "CD20", "CXCR4", "CD235ab", "CD45", "CD123", "CD321",
    "CD14", "CD33", "CD47", "CD11c", "CD7", "CD15", "CD16", "CD44", "CD38",
    "CD13", "CD3", "CD61", "CD117",
    "CD49d", "HLA-DR", "CD64", "CD41"
)

# keep only the relevant markers
sce <- sce[markers, ]

# to store arcsinh transformed data
exprs(sce) <- asinh(counts(sce) / 5)

sce
```

## Run SuperCellCyto

SuperCellCyto requires input data in a `data.table` format.
Therefore, we need to extract the arcsinh-transformed data into a 
`data.table` object, and add the sample information and IDs of the cells.

Note that SCE typically stores cells as columns and features as rows.
SuperCellCyto, conversely, requires cells as rows and features as columns, 
a format typical for cytometry data where we typically have more cells 
than features.
Hence, we will transpose the extracted data accordingly when creating the 
`data.table` object.

```{r extract_dt_and_run_supercellcyto}
dt <- data.table(t(exprs(sce)))
dt$sample <- colData(sce)$sample
dt$cell_id <- colnames(sce)

supercells <- runSuperCellCyto(
    dt = dt,
    markers = markers,
    sample_colname = "sample",
    cell_id_colname = "cell_id",
    gam = 5
)

head(supercells$supercell_expression_matrix)
```

We can now embed the supercell ID in the `colData` of our SCE object.

```{r add_supercell_id_to_coldata}
colData(sce)$supercell_id <-  factor(supercells$supercell_cell_map$SuperCellID)
head(colData(sce))
```

## Analyse Supercells as SCE object

As the number of supercells is less than the number of cells in our SCE object,
we store the supercell expression matrix as a separate SCE object.
This then allows us to use Bioconductor packages to analyse our supercells.

```{r create_supercell_sce}
supercell_sce <- SingleCellExperiment(
    list(logcounts = t(
        supercells$supercell_expression_matrix[, markers, with = FALSE]
    )),
    colData = DataFrame(
        SuperCellId = supercells$supercell_expression_matrix$SuperCellId,
        sample = supercells$supercell_expression_matrix$sample
    )
)
colnames(supercell_sce) <- colData(supercell_sce)$SuperCellId
supercell_sce
```

The code above essentially transpose the supercell expression matrix, 
making supercells columns and markers rows,
and store it in the `logcounts` assay of our new SCE object.
We also populate the `colData` with SuperCellId and the sample name for 
each supercell.

With the supercell expression matrix now in an SCE format, we can perform 
downstream analyses such as clustering and and drawing UMAP plots using 
Bioconductor packages.

```{r cluster_and_umap_supercells}
set.seed(42)

supercell_sce <- fixedPCA(
    supercell_sce,
    rank = 10,
    subset.row = NULL,
    BSPARAM = RandomParam()
)
supercell_sce <- runUMAP(supercell_sce, dimred = "PCA")

clusters <- clusterCells(
    supercell_sce, use.dimred = "PCA",
    BLUSPARAM = SNNGraphParam(cluster.fun = "leiden")
)

colLabels(supercell_sce) <- clusters

plotReducedDim(supercell_sce, dimred = "UMAP", colour_by = "label")
```

Any functions which operate on SCE object should work.
For example, we can use `plotExpression` in `scater` package to 
create violin plots of the markers against clusters.

Note, the y-axis says "logcounts", but the data is actually arcsinh 
transformed, not log transformed.

```{r plot_marker_expression_supercells}
plotExpression(
    supercell_sce, c("CD4", "CD8", "CD19", "CD34", "CD11b"),
    x = "label", colour_by = "sample"
)
```

## Transfer information from supercells SCE object to single cell SCE object

To transfer analysis results (e.g., clusters) from the supercell SCE object 
back to the single-cell SCE object, we need to do some data wrangling.
It is vital to ensure that the order of the analysis results (e.g., clusters)
aligns with the cell order in the SCE object.

Using the cluster information as an example, we need to first extract 
the `colData` of the SCE objects into two separate `data.table` objects.
We then use `merge.data.table` to match and merge them using the supercell 
ID as the common identifiers.
Make sure you set the `sort` parameter to FALSE and set `x` to the `colData` 
of your single cell SCE object.
This ensures that the order of the resulting `data.table` aligns with the 
order of the `colData` of our single-cell SCE object.

```{r transfer_cluster_to_singlecell}
cell_id_sce <- data.table(as.data.frame(colData(sce)))
supercell_cluster <- data.table(as.data.frame(colData(supercell_sce)))
cell_id_sce_with_clusters <- merge.data.table(
    x = cell_id_sce,
    y = supercell_cluster,
    by.x = "supercell_id",
    by.y = "SuperCellId",
    sort = FALSE
)
```

Finally, we can then add the cluster assignment as a column in the 
`colData` of our single-cell SCE object.

```{r add_cluster_to_coldata}
colData(sce)$cluster <- cell_id_sce_with_clusters$label
```

Visualise them as UMAP plot.

```{r umap_singlecell_colored_by_cluster}
sce <- fixedPCA(sce, rank = 10, subset.row = NULL, BSPARAM = RandomParam())
sce <- runUMAP(sce, dimred = "PCA")

plotReducedDim(sce, dimred = "UMAP", colour_by = "cluster")

```

Or violin plot to see the distribution of their marker expressions.
Note, the y-axis says "logcounts", but the data is actually arcsinh 
transformed, not log transformed.

```{r plot_marker_expression_singlecell}
plotExpression(
    sce, c("CD4", "CD8", "CD19", "CD34", "CD11b"),
    x = "cluster", colour_by = "sample"
)
```


## Session information

```{r session_info}
sessionInfo()
```