---
title: "How to Prepare Data for SuperCellCyto"
author: "Givanna Putri"
output:
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{how_to_prepare_data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

## Performing Quality Control

Prior to creating supercells, it's crucial to ensure that your dataset has
undergone thorough quality control (QC).
We want to retain only single, live cells and remove any debris, 
doublets, or dead cells.
Additionally, it is also important to perform compensation to correct for 
fluorescence spillover (for Flow data) or to adjust for signal overlap or 
spillover between different metal isotopse (for Cytof data).
A well-prepared dataset is key to obtaining reliable supercells 
from SuperCellCyto.

Several R packages are available for performing QC on cytometry data.
Notable among these are 
[PeacoQC](https://onlinelibrary.wiley.com/doi/10.1002/cyto.a.24501),
[CATALYST](https://bioconductor.org/packages/release/bioc/html/CATALYST.html),
and [CytoExploreR](https://dillonhammill.github.io/CytoExploreR/).
These packages are well maintained and are continuously updated.
To make sure that the information we provide do not quickly go out of date, 
we highly recommend you to consult the packages' respective vignettes for 
detailed guidance on how to use them to QC your data.

In our manuscript, we used `CytoExploreR` to QC the `Oetjen_bcell` 
flow cytometry data and `CATALYST` to QC the `Trussart_cytofruv` Cytof data.

The specific scripts used can be found in 
[Github](https://github.com/phipsonlab/SuperCellCyto-analysis/tree/master/code):

1. `b_cell_identification/gate_flow_data.R` for `Oetjen_bcell` data.
2. `batch_correction/prepare_data.R` for `Trussart_cytofruv` data.
These scripts were adapted from those used in the 
[CytofRUV manuscript](https://elifesciences.org/articles/59630).

For Oetjen_bcell data, we used the following gating strategy post compensation:

1. FSC-H and FSC-A to isolate only the single events. 
(Also check SSC-H vs SSC-A).
2. FSC-A and SSC-A to remove debris.
3. Live/Dead and SSC-A to isolate live cells.

The following is the resulting single live cells manually gated for the 
`Oetjen_bcell` data.

```{r add_fig}
knitr::include_graphics(
    "figures/oetjen_bcell_single_live_cells.png", 
    error = FALSE
)
```

After completing the QC process, you will have clean data in either CSV or 
FCS file formats.
The next section will guide you on how to load these files and proceed with 
preparing your data for SuperCellCyto.

## Preparing FCS/CSV files for SuperCellCyto

To use SuperCellCyto, your input data must be formatted as a
[data.table](https://cran.r-project.org/web/packages/data.table/index.html)
object.
Briefly, `data.table` is an enhanced version of R native `data.frame` object.
It is a package that offers fast processing of large `data.frame`.

### Cell ID column

Additionally, each cell in your `data.table` must also have a 
unique identifier.
The purpose of this ID is to allow SuperCell to uniquely identify each cell 
in the dataset.
It will come in super handy later when/if we need to work out which cells
belong to which supercells, i.e., when we need to expand the supercells out.
Generally, we will need to create this ID ourselves.
Most dataset won't come with this ID already embedded in.

For this tutorial, we will call the column that denotes the cell ID *cell_id*.
For your own dataset, you can name this column however you like, e.g., 
id, cell_identity, etc.
Just make sure you note the column name as we will need it later to 
create supercells.

### Sample column

Lastly, each cell in the `data.table` object must also be associated 
with a sample.
This information must be stored in a column that we later on pass to the 
function that creates supercells.
Generally, sample here typically refers to the biological sample the 
cell came from.

To create supercells, it is necessary to have this column in our dataset.
This is to ensure that each supercell will only have cells from exactly 
one sample.
In most cases, it does not make sense to mix cells from different  biological 
samples in one supercell.
Additionally (not as important), SuperCellCyto can process multiple samples 
in parallel,
and for it to do that, it needs to know the sample information.

But what if we only have 1 biological sample in our dataset?
It does not matter.
We still need to have the sample column in our dataset.
The only difference is that this column will only have 1 unique value.

You can name the column however we like, e.g., Samp, Cell_Samp, etc.
For this tutorial, we will call the column *sample*.
Just make sure you note the column name as we will need it later to 
create supercells.

### Preparing CSV files

Loading CSV files into a `data.table` object is straightforward.
We can use the `fread` function from the `data.table` package.

For this example, let's load two CSV files containing subsampled data from the
`Levine_32dim` dataset we used in SuperCellCyto manuscript.
Each file represents a sample (H1 and H2), with the sample name appended 
to the file name:

```{r load_csv_data}
library(data.table)

csv_files <- system.file(
    "extdata",
    c("Levine_32dim_H1_sub.csv", "Levine_32dim_H2_sub.csv"),
    package = "SuperCellCyto"
)

samples <- c("H1", "H2")

dat <- lapply(seq_len(length(samples)), function(i) {
    csv_file <- csv_files[i]
    sample <- samples[i]

    dat_a_sample <- fread(csv_file)
    dat_a_sample$sample <- sample

    return(dat_a_sample)
})
dat <- rbindlist(dat)

dat[, cell_id := paste0("Cell_", seq_len(nrow(dat)))]

head(dat)
```

Let's break down what we have done.

We specify the location of the csv files in `csv_files` vector
and their corresponding sample names in `samples` vector.
`Levine_32dim_H1_sub.csv` belongs to sample H1 while 
`Levine_32dim_H2_sub.csv` belongs to sample H2.

We use `lapply` to simultaneously iterate over each element in the 
`csv_files` and `samples` vector.
For each csv file and the corresponding sample, we read the csv file 
into the variable `dat_a_sample` using `fread` function.
We then assign the sample id in a new column called `sample`.
As a result, we get a list `dat` containing 2 `data.table` objects, 
1 object per csv file.

We use `rbindlist` function from the `data.table` package to merge list 
into one `data.table` object.

We create a new column `cell_id` which gives each cell a unique id such as 
`Cell_1`, `Cell_2`, etc.

### Preparing FCS files

FCS files, commonly used in cytometry, require specific handling.
You can read in FCS files using the `flowCore` package available from 
Bioconductor and convert it to a `data.table` object.

Let's load two small FCS files for the Anti-PD1 data from
[FlowRepository](
http://flowrepository.org/public_experiment_representations/1124).

```{r load_fcs_data}
library(flowCore)
library(data.table)

fs <- read.flowSet(
    path = system.file(
        "extdata",
        package = "SuperCellCyto"
    ),
    pattern = "\\.fcs$"
)

dat_list <- lapply(seq_along(fs), function(i) {
    df <- as.data.table(exprs(fs[[i]]))

    # concatenate channel and marker name as column names
    names(df) <- markernames(fs[[i]])

    # add a column showing the filename
    df$file_name <- sampleNames(fs)[i]

    return(df)
})

# collate all the files into one
dat <- rbindlist(dat_list)

dat
```

The code above used `flowCore`'s `read.flowSet` function to first 
read FCS files into a `flowSet` object.

`lapply` and `rbindlist` is then used to convert it to one 
`data.table` object containing data from all FCS files.

The FCS files belong to two different patients, patient 9 and 15.
We shall use that as the sample ID.
To make sure that we correctly map the filenames to the patients,
we will first create a new `data.table` object containing the mapping of
FileName and the sample name, and then using `merge.data.table` to add 
them into our `data.table` object.

We will also to create a new column `cell_id` which gives each cell a 
unique id such as `Cell_1`, `Cell_2`, etc.

```{r add_sample_and_cellid}
sample_info <- data.table(
    sample = c("patient9", "patient15"),
    file_name = c(
        "Data23_Panel3_base_NR4_Patient9.fcs",
        "Data23_Panel3_base_R5_Patient15.fcs"
    )
)
dat <- merge.data.table(
    x = dat,
    y = sample_info,
    by = "file_name"
)

dat[, cell_id := paste0("Cell_", seq_len(nrow(dat)))]
```

With CSV and FCS files loaded as data.table objects, the next step is 
to transform the data appropriately for SuperCellCyto.

## Data Transformation

Before using SuperCellCyto, it's essential to apply appropriate 
data transformations.

A common method for data transformation in cytometry is the arcsinh 
transformation, an [inverse hyperbolic arcsinh transformation](
https://mathworld.wolfram.com/InverseHyperbolicSine.html).
The transformation requires specifying a cofactor, which affects the 
representation of the low-end data.
Typically, a cofactor of 5 is used for Cytof data and 150 for Flow data.
This vignette will focus on the transformation process rather than 
cofactor selection.

We'll use the `Levine_32dim` dataset loaded earlier from CSV files.

First, we need to select the markers to be transformed.
Usually, all markers should be transformed for SuperCellCyto.
However, you can choose to exclude specific markers if needed:

```{r define_markers}
markers <- c(
    "209Bi_CD11b", "162Dy_CD11c", "163Dy_CD7", "166Er_CD209", "167Er_CD38",
    "151Eu_CD123", "153Eu_CD62L", "152Gd_CD66b", "154Gd_ICAM-1", "155Gd_CD1c",
    "156Gd_CD86", "160Gd_CD14", "165Ho_CD16", "191Ir_DNA1", "193Ir_DNA2",
    "175Lu_PD-L1", "142Nd_CD19", "146Nd_CD64", "195Pt", "196Pt",
    "198Pt_Dead", "147Sm_CD303", "148Sm_CD34", "149Sm_CD141", "150Sm_CD61",
    "169Tm_CD33", "89Y_CD45", "170Yb_CD3", "173Yb_CD56", "174Yb_HLA-DR"
)
```

For transformation, we'll use a cofactor of 5 and apply the 
arcsinh transformation.

```{r arcsinh_transformation}
new_cols <- paste0(markers, "_asinh")
cf <- 5
dat[, (new_cols) := lapply(.SD, function(x) asinh(x / cf)), .SDcols = markers]
```

After transformation, new columns with "_asinh" appended indicate the 
transformed markers.

```{r}
head(dat)
```

With your data now transformed, you're ready to create supercells 
using SuperCellCyto.
Please refer to 
[How to create supercells](SuperCellCyto.html) vignette 
for detailed instructions.

## Session information
```{r session_info}
sessionInfo()
```