---
title: "How to create supercells"
author: "Givanna Putri"
package: SuperCellCyto
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{How to create supercells}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE, echo=FALSE, results="hide", message=FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

```{r load_packages, echo=FALSE, message=FALSE}
library(SuperCellCyto)
library(parallel)
library(BiocParallel)
```

## Introduction

This vignette describes the steps to generate supercells for 
cytometry data using SuperCellCyto R package.

Briefly, supercells are "mini" clusters of cells that are similar in their 
marker expressions. 
The motivation behind supercells is that instead of analysing millions of 
individual cells, you can analyse thousands of supercells, 
making downstream analysis much faster while maintaining 
biological interpretability.


See other vignettes for how to:

* [Prepare your data for SuperCellCyto](how_to_prepare_data.html)
* [Use SuperCellCyto for downsampling](
    using_supercellcyto_for_stratified_summarising.html)
* [Integrate SuperCellCyto output with SingleCellExperiment objects](
    interoperability_with_sce.html)
* [Integrate SuperCellCyto output with Seurat objects](
    interoperability_with_seurat.html)

## Installation

You can install stable version of SuperCellCyto from Bioconductor using:

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("SuperCellCyto")
```

For the latest development version, you can install it from GitHub using `pak`:

```{r, eval=FALSE}
if (!requireNamespace("pak", quietly = TRUE))
    install.packages("pak")
pak::install_github("phipsonlab/SuperCellCyto")
```

## Preparing your dataset

The function which creates supercells is called `runSuperCellCyto`, and it
operates on a `data.table` object, an enhanced version of R native 
`data.frame`.

In addition to needing the data stored in a `data.table` object it also 
requires:

1. The markers you will be using to create supercells to have been 
appropriately transformed, typically using either arcsinh transformation or 
linear binning (using FlowJo).
**`runSuperCellCyto` does not perform any data transformation or scaling.**
2. The object to have a column denoting the unique ID of each cell.
You most likely have to create this column yourself, and it can simply just be
a numerical value ranging from 1 to however many cells you have in your data.
3. The object to have a column denoting the biological sample each cell 
comes from.
This column is critical to ensure that cells from different samples will 
not be mixed in a supercell.

If you are not sure how to import CSV or FCS files into `data.table` 
object, and/or how to subsequently prepare the object ready for 
SuperCellCyto, please consult this [vignette](how_to_prepare_data.html).
In that vignette, we also provide an explanation behind why we need to have the
cell ID and sample column.

For this vignette, we will simulate some toy data using the `
simCytoData` function.
Specifically, we will simulate 15 markers and 3 samples, 
with each sample containing 10,000 cells.
Hence in total, we will have a toy dataset containing 15 markers and 
30,000 cells.

```{r simulate_data}
n_markers <- 15
n_samples <- 3
dat <- simCytoData(nmarkers = n_markers, ncells = rep(10000, n_samples))
head(dat)
```

For our toy dataset, we will transform our data using arcsinh transformation.
We will use the base R `asinh` function to do this:

```{r arcsinh_transformation}
# Specify which columns are the markers to transform
marker_cols <- paste0("Marker_", seq_len(n_markers))
# The co-factor for arc-sinh
cofactor <- 5

# Do the transformation
dat_asinh <- asinh(dat[, marker_cols, with = FALSE] / cofactor)

# Rename the new columns
marker_cols_asinh <- paste0(marker_cols, "_asinh")
names(dat_asinh) <- marker_cols_asinh

# Add them our previously loaded data
dat <- cbind(dat, dat_asinh)

head(dat[, marker_cols_asinh, with = FALSE])
```

We will also create a column *Cell_id_dummy* which uniquely identify each cell.
It will have values such as `Cell_1, Cell_2,` all the way until `Cell_x`
where x is the number of cells in the dataset.

```{r create_cell_id}
dat$Cell_id_dummy <- paste0("Cell_", seq_len(nrow(dat)))
head(dat$Cell_id_dummy, n = 10)
```

By default, the `simCytoData` function will generate cells for multiple samples,
and that the resulting `data.table` object will already have a column 
called *Sample* that denotes the sample the cells come from.

```{r check_sample_col}
unique(dat$Sample)
```

Let's take note of the sample and cell id column for later.

```{r set_colnames}
sample_col <- "Sample"
cell_id_col <- "Cell_id_dummy"
```

## Creating supercells

Now that we have our data, let's create some supercells.
To do this, we will use `runSuperCellCyto` function and pass the markers,
sample and cell ID columns as parameters.

The reason why we need to specify the markers is because the function will
create supercells based on only the expression of those markers.
We highly recommend creating supercells using all markers in your data, let 
that be cell type or cell state markers.
However, if for any reason you only want to only use a subset of the markers in
your data, then make sure you specify them in a vector that you later pass to
`runSuperCellCyto` function.

For this tutorial, we will use all the arcsinh transformed markers in the 
toy data.

```{r run_supercellcyto}
supercells <- runSuperCellCyto(
    dt = dat,
    markers = marker_cols_asinh,
    sample_colname = sample_col,
    cell_id_colname = cell_id_col
)
```

Let's dig deeper into the object it created:

```{r check_supercells_class}
class(supercells)
```

It is a list containing 3 elements:

```{r check_supercells_names}
names(supercells)
```

### Supercell object

The `supercell_object` contains the metadata used to create the supercells.
It is a list, and each element contains the metadata used to create the 
supercells for a sample.
This will come in handy if we need to either regenerate the supercells using
different gamma values (so we get more or less supercells) or do some 
debugging later down the line.
More on regenerating supercells on
[Controlling supercells granularity](#controlling-supercells-granularity) 
section below.

### Supercell expression matrix

The `supercell_expression_matrix` contains the marker expression of each 
supercell.
These are calculated by taking the average of the marker expression of 
all the cells contained within a supercell.

```{r show_supercell_expr_matrix}
head(supercells$supercell_expression_matrix)
```

Therein, we will have the following columns:

1. All the markers we previously specified in the `markers_col` variable.
In this example, they are the arcsinh transformed markers in our toy data.
2. A column (`Sample` in this case) denoting which sample a supercell 
belongs to, (note the column name is the same as what is stored in `sample_col` 
variable).
3. The `SuperCellId` column denoting the unique ID of the supercell.

#### SuperCellId

Let's have a look at `SuperCellId`:

```{r show_supercell_ids}
head(unique(supercells$supercell_expression_matrix$SuperCellId))
```

Let's break down one of them, `SuperCell_1_Sample_Sample_1`.
`SuperCell_1` is a numbering (1 to however many supercells there are in 
a sample) used to uniquely identify each supercell in a sample.
Notably, you may encounter this (`SuperCell_1`, `SuperCell_2`) being repeated
across different samples, e.g.,

```{r show_supercell_1_ids}
supercell_ids <- unique(supercells$supercell_expression_matrix$SuperCellId)
supercell_ids[grep("SuperCell_1_", supercell_ids)]
```

While these 3 supercells' id are pre-fixed with `SuperCell_1`, it does 
not make them equal to one another!
`SuperCell_1_Sample_Sample_1` will only contain cells from `Sample_1` while
`SuperCell_1_Sample_Sample_2` will only contain cells from `Sample_2`.

By now, you may have noticed that we appended the sample name into each
supercell id.
This aids in differentiating the supercells in different samples.

### Supercell cell map

`supercell_cell_map` maps each cell in our dataset to the supercell it 
belongs to.

```{r show_supercell_cell_map}
head(supercells$supercell_cell_map)
```

This map is very useful if we later need to expand the supercells out.
Additionally, this is also the reason why we need to have a column in 
the dataset which uniquely identify each cell.

## Running `runSuperCellCyto` in parallel

By default, `runSuperCellCyto` will process each sample one after the other.
As each sample is processed independent of one another, strictly speaking, we 
can process all of them in parallel.

To do this, we need to:

1. Create a `BiocParallelParam` object from the BiocParallel package.
This object can either be of type `MulticoreParam`or `SnowParam`.
We highly recommend consulting their vignette for more information.
2. Set the number of tasks for the `BiocParallelParam` object to the number of
samples we have in the dataset.
3. Set the `load_balancing` parameter for `runSuperCellCyto` function to TRUE.
This is to ensure even distribution of the supercell creation jobs.
As each sample will be processed by a parallel job, we don't want a job that
processs large sample to also be assigned other smaller samples if possible.
If you want to know more how this feature works, please refer to our manuscript.

```{r run_supercellcyto_parallel}
supercell_par <- runSuperCellCyto(
    dt = dat,
    markers = marker_cols_asinh,
    sample_colname = sample_col,
    cell_id_colname = cell_id_col,
    BPPARAM = MulticoreParam(tasks = n_samples),
    load_balancing = TRUE
)
```

## Controlling supercells granularity

This is described in the `runSuperCellCyto` function's documentation, but let's
briefly go through it here.

The `runSuperCellCyto` function is equipped with various parameters which 
can be customised to alter the composition of the supercells.
The one that is very likely to be used the most is the gamma parameter, 
denoted as `gam` in the function.
By default, the value for `gam` is set to 20, which we found work well for 
most cases.

The gamma parameter controls how many supercells to generate, and
indirectly, how many cells are captured within each supercell.
This parameter is resolved into the following formula 
`gamma=n_cells/n_supercells` where `n_cell` denotes the number of cells and 
`n_supercells` denotes the number of supercells.

In general, the larger gamma parameter is set to, the less supercells we 
will get.
Say for instance we have 10,000 cells.
If gamma is set to 10, we will end up with about 1,000 supercells, whereas
if gamma is set to 50, we will end up with about 200 supercells.

You may have noticed, after reading the sections above, `runSuperCellCyto`
is ran on each sample independent of each other, and that we can only set 
1 value as the gamma parameter.
Indeed, for now, the same gamma value will be used across all samples, 
and that depending on how many cells we have in each sample, we will end up 
with different number of supercells for each sample.
For instance, say we have 10,000 cells for sample 1, and 100,000 cells for 
sample 2.
If gamma is set to 10, for sample 1, we will get 1,000 supercells (10,000/10)
while for sample 2, we will get 10,000 supercells (100,000/10).

*Do note*: whatever gamma value you chose, you should not expect each 
supercell to contain exactly the same number of cells.
This behaviour is intentional to ensure rare cell types are not intermixed with
non-rare cell types in a supercell.

### Adjusting gamma value after one run of runSuperCellCyto

If you have run `runSuperCellCyto` once and have not discarded the 
SuperCell object it generated (no serious, please don't!), 
you can use the object to **quickly**
regenerate supercells using different gamma values.

As an example, using the SuperCell object we have generated for our 
toy dataset, we will regenerate the supercells using gamma of 10 and 50.
The function to do this is `recomputeSupercells`.
We will store the output in a list, one element per gamma value.

```{r recompute_supercells}
addt_gamma_vals <- c(10, 50)
supercells_addt_gamma <- lapply(addt_gamma_vals, function(gam) {
    recomputeSupercells(
        dt = dat,
        sc_objects = supercells$supercell_object,
        markers = marker_cols_asinh,
        sample_colname = sample_col,
        cell_id_colname = cell_id_col,
        gam = gam
    )
})
```

We should end up with a list containing 2 elements.
The 1st element contains supercells generated using gamma = 10,
and the 2nd contains supercells generated using gamma = 50.

```{r show_supercells_gamma10}
supercells_addt_gamma[[1]]
```

The output generated by `recomputeSupercells` is essentially a list:

1. `supercell_expression_matrix`: A data.table object that contains the marker 
expression for each supercell.
2. `supercell_cell_map`: A data.table that maps each cell to its 
corresponding supercell.

As mentioned before, gamma dictates the granularity of supercells.
Compared to the previous run where gamma was set to 20, we should get more 
supercells for gamma = 10, and less for gamma = 50.
Let's see if that's the case.

```{r count_supercells}
n_supercells_gamma20 <- nrow(supercells$supercell_expression_matrix)
n_supercells_gamma10 <- nrow(
    supercells_addt_gamma[[1]]$supercell_expression_matrix
)
n_supercells_gamma50 <- nrow(
    supercells_addt_gamma[[2]]$supercell_expression_matrix
)
```

```{r gamma10_gt_gamma20}
n_supercells_gamma10 > n_supercells_gamma20
```

```{r gamma50_lt_gamma20}
n_supercells_gamma50 < n_supercells_gamma20
```

### Specifying different gamma value for different samples

In the future, we may add the ability to specify different `gam` 
value for different samples.
For now, if we want to do this, we will need to break down our data 
into multiple `data.table` objects, each containing data from 1 sample, 
and run `runSuperCellCyto`
function on each of them with different `gam` parameter value.
Something like the following:

```{r diff_gamma_per_sample}
n_markers <- 10
dat <- simCytoData(nmarkers = n_markers)
markers_col <- paste0("Marker_", seq_len(n_markers))
sample_col <- "Sample"
cell_id_col <- "Cell_Id"

samples <- unique(dat[[sample_col]])
gam_values <- c(10, 20, 10)

supercells_diff_gam <- lapply(seq_len(length(samples)), function(i) {
    sample <- samples[i]
    gam <- gam_values[i]
    dat_samp <- dat[dat$Sample == sample, ]
    supercell_samp <- runSuperCellCyto(
        dt = dat_samp,
        markers = markers_col,
        sample_colname = sample_col,
        cell_id_colname = cell_id_col,
        gam = gam
    )
    return(supercell_samp)
})
```

Subsequently, to extract and combine the `supercell_expression_matrix` and
`supercell_cell_map`, we will need to use `rbind`:

```{r combine_supercell_results}
supercell_expression_matrix <- do.call(
    "rbind", lapply(
        supercells_diff_gam, function(x) x[["supercell_expression_matrix"]]
    )
)

supercell_cell_map <- do.call(
    "rbind", lapply(
        supercells_diff_gam, function(x) x[["supercell_cell_map"]]
    )
)
```

```{r show_combined_expr_matrix}
rbind(
    head(supercell_expression_matrix, n = 3),
    tail(supercell_expression_matrix, n = 3)
)
```

```{r show_combined_cell_map}
rbind(head(supercell_cell_map, n = 3), tail(supercell_cell_map, n = 3))
```

## Mixing cells from different samples in a supercell

If for whatever reason you don't mind (or perhaps more to the point want) 
each supercell to contain cells from different biological samples, 
you still need to have the sample column in your `data.table`.
However, what you need to do is essentially set the value in the column 
to exactly one ***unique*** value.
That way, SuperCellCyto will treat all cells as coming from one sample.

Just note, the parallel processing feature in SuperCellCyto won't work for this
as you will essentially only have 1 sample and nothing for SuperCellCyto 
to parallelise.

## I have more cells than RAM in my computer

Is your dataset so huge that you are constantly running out of RAM when 
generating supercells?
This thing happens and we have a solution for it.

Since supercells are generated for each sample independent of others you can 
easily break up the process.
For example:

1. Load up a subset of the samples (say 1-10).
2. Generate supercells for those samples.
3. Save the output using the 
[qs package](https://cran.r-project.org/web/packages/qs/index.html).
4. Extract the `supercell_expression_matrix` and `supercell_cell_map`, 
and export them out as a csv file using `data.table`'s `fwrite` function.
5. Load another sets of samples (say 11-20), rinse and repeat step 2-4.

Once you have processed all the samples, you can then load all 
`supercell_expression_matrix` and `supercell_cell_map` csv files and 
analyse them.

If you want to regenerate the supercells using different gamma values, 
load the relevant output saved using the qs package and the relevant data 
(remember to note which output belongs to which sets of samples!), and run 
`recomputeSupercells` function.

## Session information
```{r session_info}
sessionInfo()
```