---
title: "Tutorial: Running the pipeline"
output: 
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Tutorial: Running the pipeline}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.height = 6,
  fig.width = 8,
  fig.align = "center"
)
```


# Introduction

staRgate is an automated gating pipeline to process and analyze flow cytometry data to characterize the lineage, differentiation and functional states of T-cells. 

This pipeline is designed to mimic the manual gating strategy of defining flow biomarker positive populations relative to a unimodal background population to include cells with varying intensities of marker expression. This is achieved via estimating the kernel density of the intensity distribution and corresponding derivatives. This pipeline integrates the density gating method in conjuction with some pre-processing steps achieved via the R package `r BiocStyle::Biocpkg("openCyto")` and an optional step with `r BiocStyle::Biocpkg("flowAI")`. The flow data is stored within R as a `GatingSet` object, which makes it easily transferable to other flow cytometry workflows available on BioConductor.

This vignette will walk through how to run the {staRgate} pipeline starting from importing an flow cytometry standard (FCS) file into R to preprocessing and gating, as well as identifying T-cell subpopulations for downstream analysis. This pipeline returns results at the single-cell level as well as the summarized sample-level data (for the percentages of positively expressing cells when identifying the subpopulations).

For illustration purposes, the example FCS file used in this vignette and stored in the package data is a concatenated file limited to the first 30k events/cells acquired to reduce the run time and file size. 

After running {staRgate} to gate flow cytometry data, it is recommended to perform some quality checks (QC) on the gate placements to ensure they are reasonable. We suggest to use ridgeplots in addition to the `ggcyto::autoplot` to visualize the density distributions per marker across samples. When examining a large batch of samples, downsampling, such as to a random sample of 10k CD3+ cells, will make the QC process more manageable. In addition, random spot checks of a few samples would also be helpful QC to detect any edge cases.

Currently in this tutorial, we do not extend to the QC steps and suggest to lean on other examples for how to put together a ridgeplot for example. In the near future, we hope to incorporate some examples for the additional QC steps as well, stay tuned!

# Installation

The {staRgate} package relies on a few Biocondcutor R packages.
Before installing {staRgate}, first setup Bioconductor and install all dependencies. 

[Follow instructions here for installing Bioconductor](https://www.bioconductor.org/install/)

The essential dependencies for running the {staRgate} package include: `r BiocStyle::Biocpkg("flowCore")`, and `r BiocStyle::Biocpkg("flowWorkspace")`.

The following packages are needed to fully run the pipeline as shown in this Tutorial, but not required for the {staRgate} code chunks to run: `r BiocStyle::Biocpkg("openCyto")`, `r BiocStyle::CRANpkg("ggplot2")`, `r BiocStyle::Biocpkg("ggcyto")`, `r BiocStyle::CRANpkg("gt")`


```{r setup}
# Load libraries
library(staRgate)
library(openCyto)
library(flowWorkspace)
library(flowCore)
# Just for plotting in the vignette
library(ggplot2)
library(ggcyto)

# Set up dynamic variables
pt_samp_nm <- "flow_sample_1"

## File path to the FCS file
path_fcs <- system.file("extdata", "example_fcs.fcs", package = "staRgate", mustWork = TRUE)

## File path to the compensation matrix csv file
## Expect format to match flowJo exported version
path_comp_mat <- system.file("extdata", "comp_mat_example_fcs.csv", package = "staRgate", mustWork = TRUE)

## File path for outputs/saving
# Maybe not the best sol, but create a temp dir?
path_out <- tempdir()
# Print the path_out for user to see
path_out

## File path Gating template
gtFile <- system.file("extdata", "gating_template_x50_tcell.csv", package = "staRgate", mustWork = TRUE)

## File path to biexp parameters
## Expects 4 columns: full_name, ext_neg_dec, width_basis, positive_dec
## full name should contain the channel/dye name
# 3 remaining cols fill in with desired parameter values
path_biexp_params <- system.file("extdata", "biexp_transf_parameters_x50.csv", package = "staRgate", mustWork = TRUE)

## File path to positive peak thresholds
path_pos_peak_thresholds <- system.file("extdata", "pos_peak_thresholds.csv", package = "staRgate", mustWork = TRUE)
```

```{r, echo = FALSE}
# Some preferred ggplot settings.
# Not required/relevant for gating

plot_font_size <- 11

ggplot2::theme_set(ggplot2::theme_bw() +
  ggplot2::theme(
    text = ggplot2::element_text(size = plot_font_size),
    axis.text = ggplot2::element_text(color = "black"),
    legend.position = "bottom"
  ))
```

# Input files and function parameters

In order to run the pipeline, the user must have the data in flow cytometry standard (FCS) format. This is usually the output from flowJo. 
All of these input files except the `bin size` are expected to be comma-separated values (csv) files. Please see the `inst` folder of the package for examples of the formats.

+ Compensation matrix from manual gating-
    + Matrix where the column and row names correspond to the channel names, cell values correspond to the spillover correction to be applied.
    + This can be exported as a csv in flowJo's options
+ Biexponential transformation parameters-
    + This is a table specifying the parameters (negative decades, width basis and positive decades) to be applied to the listed channels
+ Gating template
    + A gating template is required to run the pre-gating via `r BiocStyle::Biocpkg("openCyto")`
    + The package includes a gating template tailored for gating this panel of T-cell markers. 
    + For examples of how to modify the gating template, please refer to the [openCyto documentation](https://www.bioconductor.org/packages/devel/bioc/vignettes/openCyto/inst/doc/HowToWriteCSVTemplate.html).
+ Bin size 
    + From a systematic grid search, we have found bin sizes of 40 to 50 works well for the density gating

A few things to keep in mind when debugging/iterating through gating: 

+ If saving at the same path with same name (i.e., rerunning the same code), the GatingSet folder from the `flowWorkspace::save_gs()` command needs to be deleted for `r BiocStyle::Biocpkg("openCyto")` to save again, otherwise, will encounter an error related to an invalid path from the `flowWorkspace::save_gs()` function


# Example

Below is an example of gating 1 FCS sample.

## Import FCS

```{r}
# Read in gating template
dtTemplate <- data.table::fread(gtFile)

# Load the FCS
gt_tcell <- openCyto::gatingTemplate(gtFile)

cs <- flowWorkspace::load_cytoset_from_fcs(path_fcs)

# Create a GatingSet of 1 sample
gs <- flowWorkspace::GatingSet(cs)

# Check- how many cells is in the FCS file?
n_root <- flowWorkspace::gh_pop_get_count(gs, "root")

# The example FCS has 30000 cells
n_root
```

## Compensation

```{r}
# Apply comp
gs <- getCompGS(gs, path_comp_mat = path_comp_mat)

# Can check that the comp was applied
chk_cm <- flowWorkspace::gh_get_compensations(gs)

# Not aware of an accessor that we can use for this
head(methods::slot(chk_cm, "spillover"), 2)
```

## Transformation

The transformation applied to all channels is the same: biexponential with `extra negative decades = 0.5`, `positive decades = 4.5` and `width basis = -30`

The structure of the table of parameters (as .csv format) should be first column for the flurochrome names corresponding to the panel, followed by the parameters. 

```{r}
tbl_biexp_params <-
  utils::read.csv(path_biexp_params) |>
  janitor::clean_names(case="all_caps")

head(tbl_biexp_params, 2)
```

Currently, the package only supports biexponetial transformation for all channels with the `getBiexpTransformGS()` function. However, the user may choose to create a transformation list explicitly if other transformations (e.g., archsin) are desired.

Note that the `r BiocStyle::Biocpkg("flowWorkspace")` package also allows for an automated transformation calculation "guessing" appropriate parameters. We chose to explicitly specify the biexponential transformation with fixed parameters for all channels to match the manual gating strategy used in flowJo for a more direct comparison of {staRgate} when benchmarking against the manual gating results.

```{r}
# Save the pre-transformed data to compare ranges 
# And check that transformation was applied
dat_pre_transform <-
  flowWorkspace::gh_pop_get_data(gs) |>
  flowCore::exprs()

# Apply biexp trans
gs <- getBiexpTransformGS(gs, path_biexp_params = path_biexp_params)

## **Optional**-- to check what pre-transformed data against post
# save the post-transformed data
dat_post_transform <-
  flowWorkspace::gh_pop_get_data(gs) |>
  flowCore::exprs()

## **Optional**-- to check that the transformation worked on all provided channels!
## Commented out for ease of length 
# summary(dat_pre_transform)
# summary(dat_post_transform)
```

## Pre-gating

In this context, pre-gating is defined as gating from the root population (all cells acquired) up to key parent populations: CD3+, or CD4+/CD8+ subsets. Then we will gate each marker indpendently. 

The `r BiocStyle::Biocpkg("flowAI")` step serves as a quality control (QC) to match the first Time gate step that is typically done in  manual gating. It is possible, however, that the user may choose to skip this step if `r BiocStyle::Biocpkg("flowAI")` excludes too many cells. 

The first step of the gating template is a QC step that is especially important to include if the user chooses to exclude the `r BiocStyle::Biocpkg("flowAI")` step.

In this tutorial, we will skip the `r BiocStyle::Biocpkg("flowAI")` step to ease the length.

```{r}
# Pre-gating up to CD4/8+ with `r BiocStyle::Biocpkg("openCyto")`
## Set seed using today's date
set.seed(glue::glue({
  format(Sys.Date(), format = "%Y%m%d")
}))

openCyto::gt_gating(gt_tcell, gs)
```


```{r fig.height = 10, fig.width = 8, warning = FALSE, message = FALSE}
## Check autoplot
ggcyto::autoplot(gs[[1]])
```


## Extract intensity matrix

Grab the channel and marker names in the `gs` object

+ When extracting `intensity_matrix`, it is labeled with the channel names rather than the marker names.
+ The `marker_chnl_names` mapping created below will be used to rename the column names to the marker names, which will make calling the appropriate columns easier when analyzing the data

```{r}
## Grab marker names from GatingSet for labeling col names in intensity matrix
## Can skip this step if you know the names of the channels that correspond to your marker names in your FCS files
# In that case, supply strings for `chnl` and `marker_full` is fine. Such as:
# chnl = c("BV750-A", "BUV496-A")
# marker_full = c("CD3", "CD4)

# This returns a named character vector
# With the channel names as names and marker names as the values
marker_chnl_names <- flowWorkspace::markernames(flowWorkspace::gh_pop_get_data(gs))

## Specify which markers to gate based on individual density distributions
# For our Tcell panel, we only want to apply the density gating on
# these 23 markers
markers_to_gate = c("CD45RA", "ICOS", "CD25", "TIM3", 
                    "CD27", "CD57", "CXCR5", "CCR4", 
                    "CCR7", "HLADR", "CD28", "PD1", 
                    "LAG3", "CD127", "CD38", "TIGIT", 
                    "EOMES", "CTLA4", "FOX_P3", "GITR",
                    "TBET", "KI67", "GZM_B")

```

Next we grab the intensity values and indicators for the pre-gating steps. 

```{r}
# Extract intensity matrix from GatingSet object
## Grab the intensity matrix from GatingSet
intensity_dat <-
  cbind(
    # This grabs the intensity matrix with intensity values
    flowWorkspace::gh_pop_get_data(gs) |>
      flowCore::exprs(),
    # the gh_pop_get_indices grabs the 0/1 for whether gated as pos/neg 
    # for each step specified
    "fsc_ssc_qc" = flowWorkspace::gh_pop_get_indices(gs, y = "fsc_ssc_qc"),
    "nonDebris" = flowWorkspace::gh_pop_get_indices(gs, y = "nonDebris"),
    "singlets" = flowWorkspace::gh_pop_get_indices(gs, y = "singlets"),
    "cd14_neg_19_neg" = flowWorkspace::gh_pop_get_indices(gs, y = "cd14-cd19-"),
    "live" = flowWorkspace::gh_pop_get_indices(gs, y = "live"),
    "cd3_pos" = flowWorkspace::gh_pop_get_indices(gs, y = "cd3"),
    "cd4_pos" = flowWorkspace::gh_pop_get_indices(gs, y = "cd4+"),
    "cd8_pos" = flowWorkspace::gh_pop_get_indices(gs, y = "cd8+")
  ) |>
  # The intensity matrix is a matrix object. Convert to tibble.
  tibble::as_tibble() |>
  # Rename with colnames which are the channel names
  # to marker names because it's easier to call columns by markers
  # Rename using the marker_chnl_names we created above
  # But we need it flipped when supplying to dplyr::rename()
  # Where the names = value to rename to (marker names) 
  # The values of the vector = current names (Channel names)
  dplyr::rename(
    stats::setNames(names(marker_chnl_names),
                    # Clean up the marker names, make them all caps
                    janitor::make_clean_names(marker_chnl_names, 
                                              case = "all_caps", 
                                              replace = c("-" = "", "_" = "", " " = "")))
                ) |>
  dplyr::mutate(
    # Create a 4-level category for cd4, cd8 neg/pos
    # in order to calculate the percentages individually within these parent populations
    cd4_pos_cd8_pos = dplyr::case_when(
      cd3_pos == 1 & cd4_pos == 1 & cd8_pos == 1 ~ "cd4_pos_cd8_pos",
      cd3_pos == 1 & cd4_pos == 1 & cd8_pos == 0 ~ "cd4_pos_cd8_neg",
      cd3_pos == 1 & cd4_pos == 0 & cd8_pos == 1 ~ "cd4_neg_cd8_pos",
      cd3_pos == 1 & cd4_pos == 0 & cd8_pos == 0 ~ "cd4_neg_cd8_neg"
    )
  )

## Preview of intensity matrix
head(intensity_dat, 2)
```


## Gating on T-cell subsets


### Pseudo-negative control 

For this T-cell panel, we recommend using the CD3- cells as a pseudo-negative control when gating CD127 and CD28 as these two markers are expected to be predominantly negatively-expressing on CD3- cells. This mimics the isotype-based gating approach. Here we recommend using the 95th percentile of the CD3- distributions.

To ensure we only capture the CD3- cells, we first filter to the live cells then to `cd3_pos == 0`

```{r}
gates_pseudo_neg = 
  dplyr::filter(intensity_dat, live == 1, cd3_pos == 0) |>
  dplyr::select(CD127, CD28) |>
  dplyr::summarise(
    dplyr::across(c(CD127, CD28),
                  ~ quantile(.x, 0.95))
  ) 
```

### Empirical gating

For the other functional and differentiation markers where we cannot borrow CD3- as the pseudo-negative control, we gate empirically based on the CD3+ density distribution per marker. 

+ The suggested strategy is based on all CD3+, but this can be customized based on the string corresponding to the column name supplied to `subset_col` in the `get_density_gates` function
+ The suggested number of bins for density estimation is `40` as some level of smoothing is required to reduce the noise and picking up false peaks.
+ When multiple samples from the same batch or experiment run (e.g., samples are processed on the same day), we recommend to borrow information from other samples by pooling all CD3+ across the same batch before applying the density gating.
+ If interested in pooling across experiment runs, consider visualizing for any batch-level intensity shifts and variations before pooling samples together.

For illustration purposes, we will only apply density gating on a few markers.

```{r}
# Density gating parameters
# peak detection ratio where any peak < 1/10 of the tallest peak will be 
# considered as noise
peak_r <- 10 
# smoothing to apply to the density estimation
# Using the default of 512 creates many little bumps/noise that are artifacts 
# From a systematic grid search, we found the bin sizes of ~40-50 works best
bin_i <- 40

# Remove any very negative values that are artifacts from autogating
# -1000 on the biexp transformed scale corresponds to roughly -3300 on the original
# intensity scale so this is quite conservative.
neg_intensity_thres <- -1000

# select a few markers to gate
example_markers <- c("LAG3", "CCR7", "CD45RA")

# Read in positive peak thresholds
pos_thres <- utils::read.csv(path_pos_peak_thresholds) |>
  janitor::clean_names(case = "all_caps")
```

```{r}
# calculate the gates
dens_gates_pre <-
  dplyr::filter(intensity_dat, cd3_pos == 1) |>
  getDensityGates(
    intens_dat = _,
    marker = example_markers,
    subset_col = "cd3_pos",
    bin_n = bin_i,
    peak_detect_ratio = peak_r,
    pos_peak_threshold = pos_thres,
    neg_intensity_threshold = neg_intensity_thres
  )

# Since we apply density gating on CD3+ cells but
# Would like to calculate subpopulations with CD4+ and CD8+ as
# the starting parent population, we need to add corresponding rows
# to pass into getGatedDat()
dens_gates <-
  # Stack the pseudo-neg gated markers and empirically gated markers
  dplyr::bind_cols(dens_gates_pre, gates_pseudo_neg) |>
  tibble::add_row() |>
  tibble::add_row() |>
  tibble::add_row() |>
  dplyr::mutate(cd4_pos_cd8_pos = c("cd4_neg_cd8_neg", "cd4_pos_cd8_neg", "cd4_neg_cd8_pos", "cd4_pos_cd8_pos")) |>
  tidyr::fill(-cd4_pos_cd8_pos, .direction = "down")

# View updated gates with the col for CD4/CD8
dens_gates

# get indicator col
example_intensity_gated <-
  getGatedDat(
    # Only gate T-cells, which are CD3+ 
    dplyr::filter(intensity_dat, cd3_pos == 1),
    subset_col = "cd4_pos_cd8_pos",
    cutoffs = dens_gates
  )
```

## Visualizing the gates {.tabset .tabset-fade .unlisted .unnumbered}

### LAG3 for all CD3+ {.unlisted .unnumbered}

```{r, fig.width=8, fig.height=6}
# Plot the gate for visual
intensity_dat |>
  dplyr::filter(cd3_pos == 1) |>
  # additional step to remove large intensity values only when density gating.
  # Still kept in the data
  dplyr::filter(!(dplyr::if_any(dplyr::all_of(markers_to_gate), ~ .x < neg_intensity_thres))) |>
  ggplot() +
  geom_density(aes(LAG3)) +
  geom_vline(
    data = dens_gates,
    aes(xintercept = LAG3),
    color = "blue",
    linetype = "dashed"
  ) +
  labs(subtitle = "Distribution of LAG3 intensity on all CD3+. Gate identifed by {staRgate} in blue.")
```

### LAG3 by CD4 and CD8 subsets {.unlisted .unnumbered}


```{r, fig.width=8, fig.height=6}
# If by CD4/CD8,
intensity_dat |>
  dplyr::filter(cd3_pos == 1) |>
  # additional step to remove large intensity values only when density gating.
  # Still kept in the data
  dplyr::filter(!(dplyr::if_any(dplyr::all_of(markers_to_gate), ~ .x < neg_intensity_thres))) |>
  ggplot() +
  geom_density(aes(LAG3)) +
  geom_vline(
    data = dens_gates,
    aes(xintercept = LAG3),
    color = "blue",
    linetype = "dashed"
  ) +
  labs(subtitle = "Distribution of LAG3 intensity by CD4/CD8 subsets. Gate identifed by {staRgate} in blue.") +
  facet_wrap(~cd4_pos_cd8_pos)

```

### CCR7 for all CD3+  {.unlisted .unnumbered}

```{r, fig.width=8, fig.height=6}
# For CCR7
intensity_dat |>
  dplyr::filter(cd3_pos == 1) |>
  # additional step to remove large intensity values only when density gating.
  # Still kept in the data
  dplyr::filter(!(dplyr::if_any(dplyr::all_of(markers_to_gate), ~ .x < neg_intensity_thres))) |>
  ggplot() +
  geom_density(aes(CCR7)) +
  geom_vline(
    data = dens_gates,
    aes(xintercept = CCR7),
    color = "blue",
    linetype = "dashed"
  ) +
  labs(subtitle = "Distribution of CCR7 intensity on all CD3+. Gate identifed by {staRgate} in blue.")
```

### CCR7 by CD4 and CD8 subsets {.unlisted .unnumbered}

```{r, fig.width=8, fig.height=6}
# If by CD4/CD8,
intensity_dat |>
  dplyr::filter(cd3_pos == 1) |>
  # additional step to remove large intensity values only when density gating.
  # Still kept in the data
  dplyr::filter(!(dplyr::if_any(dplyr::all_of(markers_to_gate), ~ .x < neg_intensity_thres))) |>
  ggplot() +
  geom_density(aes(CCR7)) +
  geom_vline(
    data = dens_gates,
    aes(xintercept = CCR7),
    color = "blue",
    linetype = "dashed"
  ) +
  labs(subtitle = "Distribution of CCR7 intensity for CD4/CD8 subsets. Gate identifed by {staRgate} in blue.") +
  facet_wrap(~cd4_pos_cd8_pos)
```


### CD45RA for all CD3+  {.unlisted .unnumbered}

```{r, fig.width=8, fig.height=6}
# For CD45RA
intensity_dat |>
  dplyr::filter(cd3_pos == 1) |>
  # additional step to remove large intensity values only when density gating.
  # Still kept in the data
  dplyr::filter(!(dplyr::if_any(dplyr::all_of(markers_to_gate), ~ .x < neg_intensity_thres))) |>
  ggplot() +
  geom_density(aes(CD45RA)) +
  geom_vline(
    data = dens_gates,
    aes(xintercept = CD45RA),
    color = "blue",
    linetype = "dashed"
  ) +
  labs(subtitle = "Distribution of CD45RA intensity on all CD3+. Gate identifed by {staRgate} in blue.")
```

### CD45RA by CD4 and CD8 subsets {.unlisted .unnumbered}

```{r, fig.width=8, fig.height=6}
# If by CD4/CD8,
intensity_dat |>
  dplyr::filter(cd3_pos == 1) |>
  # additional step to remove large intensity values only when density gating.
  # Still kept in the data
  dplyr::filter(!(dplyr::if_any(dplyr::all_of(markers_to_gate), ~ .x < neg_intensity_thres))) |>
  ggplot() +
  geom_density(aes(CD45RA)) +
  geom_vline(
    data = dens_gates,
    aes(xintercept = CD45RA),
    color = "blue",
    linetype = "dashed"
  ) +
  labs(subtitle = "Distribution of CD45RA intensity for CD4/CD8 subsets. Gate identifed by {staRgate} in blue.") +
  facet_wrap(~cd4_pos_cd8_pos)
```


### CD127 for all CD3+  {.unlisted .unnumbered}

```{r, fig.width=8, fig.height=6}
# For CD127
intensity_dat |>
  dplyr::filter(cd3_pos == 1) |>
  # additional step to remove large intensity values only when density gating.
  # Still kept in the data
  dplyr::filter(!(dplyr::if_any(dplyr::all_of(markers_to_gate), ~ .x < neg_intensity_thres))) |>
  ggplot() +
  geom_density(aes(CD127)) +
  geom_vline(
    data = dens_gates,
    aes(xintercept = CD127),
    color = "blue",
    linetype = "dashed"
  ) +
  labs(subtitle = "Distribution of CD127 intensity on all CD3+. Gate identifed by {staRgate} in blue.")
```

### CD127 by CD4 and CD8 subsets {.unlisted .unnumbered}

```{r, fig.width=8, fig.height=6}
# If by CD4/CD8,
intensity_dat |>
  dplyr::filter(cd3_pos == 1) |>
  # additional step to remove large intensity values only when density gating.
  # Still kept in the data
  dplyr::filter(!(dplyr::if_any(dplyr::all_of(markers_to_gate), ~ .x < neg_intensity_thres))) |>
  ggplot() +
  geom_density(aes(CD127)) +
  geom_vline(
    data = dens_gates,
    aes(xintercept = CD127),
    color = "blue",
    linetype = "dashed"
  ) +
  labs(subtitle = "Distribution of CD127 intensity for CD4/CD8 subsets. Gate identifed by {staRgate} in blue.") +
  facet_wrap(~cd4_pos_cd8_pos)
```

### CD28 for all CD3+  {.unlisted .unnumbered}

```{r, fig.width=8, fig.height=6}
# For CD28
intensity_dat |>
  dplyr::filter(cd3_pos == 1) |>
  # additional step to remove large intensity values only when density gating.
  # Still kept in the data
  dplyr::filter(!(dplyr::if_any(dplyr::all_of(markers_to_gate), ~ .x < neg_intensity_thres))) |>
  ggplot() +
  geom_density(aes(CD28)) +
  geom_vline(
    data = dens_gates,
    aes(xintercept = CD28),
    color = "blue",
    linetype = "dashed"
  ) +
  labs(subtitle = "Distribution of CD28 intensity on all CD3+. Gate identifed by {staRgate} in blue.")
```

### CD28 by CD4 and CD8 subsets {.unlisted .unnumbered}

```{r, fig.width=8, fig.height=6}
# If by CD4/CD8,
intensity_dat |>
  dplyr::filter(cd3_pos == 1) |>
  # additional step to remove large intensity values only when density gating.
  # Still kept in the data
  dplyr::filter(!(dplyr::if_any(dplyr::all_of(markers_to_gate), ~ .x < neg_intensity_thres))) |>
  ggplot() +
  geom_density(aes(CD28)) +
  geom_vline(
    data = dens_gates,
    aes(xintercept = CD28),
    color = "blue",
    linetype = "dashed"
  ) +
  labs(subtitle = "Distribution of CD28 intensity for CD4/CD8 subsets. Gate identifed by {staRgate} in blue.") +
  facet_wrap(~cd4_pos_cd8_pos)
```

## {.unlisted .unnumbered}

## Getting percentage data

```{r, echo = FALSE}
panel_n <- 23
panel_n_d <- 2
```

We can then summarize the single-cell level data to counts and percentages of cells for all combinations of markers. 

For the subpopulations, the `denominator` is defined as the parent population and 
`numerator` is the population of interest out of the parent population. For example, the subpopulation CD4+ of CD3+ cells correspond to the CD4+ as the `numerator` and CD3+ as the `denominator`.

The $n_d$ refers to the number of 
markers considered for the denominator and $n$ for the number of markers considered for the numerator. 

For the 29-marker panel, if the `denominator` is specified as the CD4 and CD8 subsets, then $n_d = 2$ and $n = 23$ for the markers of interest.

The `getPerc` function allows user to list the markers of interest for the `numerator` and
`denominator`

In the example below, we will consider CD4 and CD8 subsets as the key parent populations of interest (`denominator`) and the three markers we gated on using `c("LAG3", "CCR7", "CD45RA")` (`numerator` markers). 

The additional arguments `expand_num` and `expand_denom` generates different lists of
subpopulations to calculate counts/percentages for: 

+ `expand_num`: should calculations consider up to pairs of numerator markers included?,
+ `expand_denom`: should the calculations consider combinations of each numerator marker and parent populations specified in the denominator?

<details><summary>Currently, we support the four scenarios listed:</summary>

```{r, echo = FALSE, warning = FALSE}
# Follows supp table 1 in the paper
tibble::tibble(
  expand_num = c("FALSE", "TRUE", "FALSE", "TRUE"),
  expand_denom = c("FALSE", "FALSE", "TRUE", "TRUE"),
  gen = c(
    # For expand_num, expand_denom combos
    # FALSE, FALSE
    "- Numerator: Positive and negative for each marker specified
- Denominator: Combinations of positive and negative for markers specified",
    # TRUE, FALSE
    "- Numerator: Positive and negative for each marker specified, and combinations of positive/negative for pairs of markers
    - At least 2 markers must be included in the numerator
-	Denominator: Combinations of positive and negative for marker(s)",
    # FALSE, TRUE
    "- Numerator: Positive and negative for each marker specified
    - At least 2 markers must be included in the numerator
- Denominator: Combinations of positive and negative for marker(s), and combinations of positive and negative for marker(s) in denominator with one marker from the numerator",
    # TRUE, TRUE
    "- Numerator: Positive and negative for each marker specified, and combinations of positive/negative for pairs of markers
    - At least 3 markers must be included in the numerator
- Denominator: Combinations of positive and negative for marker(s), and combinations of positive and negative for marker(s) in denominator with one marker from the numerator"
  ),
  examples = c(
    "Numerator specified to (“LAG3”) and denominator specified as (“CD4”, “CD8”):
-	LAG3+/- of CD4- & CD8-
-	LAG3+/- of CD4+ & CD8-
-	LAG3+/- of CD4+ & CD8+
-	LAG3+/- of CD4- & CD8+
",
    "Numerator specified as (“LAG3”, “KI67”) and denominator as (“CD4”):
-	LAG3+/- of CD4+/-
-	KI67+/- of CD4+/-
-	LAG3- KI67-  of CD4+/-
-	LAG3+ KI67-  of CD4+/-
-	LAG3- KI67+  of CD4+/-
-	LAG3+ KI67+  of CD4+/-
",
    "Numerator specified as (“LAG3”, “KI67”) and denominator as (“CD4”)
-	LAG3+/- of CD4+/-
-	KI67+/- of CD4+/-
-	LAG3+/- of CD4-KI67-
-	LAG3+/- of CD4-KI67+
-	LAG3+/- of CD4+KI67-
-	LAG3+/- of CD4+KI67+
-	KI67+/- of CD4-LAG3-
-	KI67+/- of CD4-LAG3+
-	KI67+/- of CD4+LAG3-
-	KI67+/- of CD4+LAG3+",
    "Numerator specified as (“LAG3”, “KI67”, “CTLA4”) and denominator as (“CD4”)
-	LAG3+/- of CD4+/-
-	KI67+/- of CD4+/-
-	CTLA4+/- of CD4+/-
-	LAG3+/- of CD4-KI67-
-	LAG3+/- of CD4-KI67+
-	LAG3+/- of CD4+KI67-
-	LAG3+/- of CD4+KI67+
-	KI67+/- of CD4-LAG3-
-	KI67+/- of CD4-LAG3+
-	KI67+/- of CD4+LAG3-
-	KI67+/- of CD4+LAG3+
-	….
-	LAG3- KI67-  of CD4+/-
-	LAG3+ KI67-  of CD4+/-
-	LAG3- KI67+  of CD4+/-
-	LAG3+ KI67+  of CD4+/-
-	…
-	LAG3+ KI67+  of CD4+/- & CTLA4+/-
-	…"
  ),
  n_subpops =
    c(
      "2n\\*(2<sup>n<sub>d</sub></sup>)",
      "2n\\*(2<sup>n<sub>d</sub></sup>) +<br>2<sup>n<sub>d</sub></sup>\\*(n choose 2)\\*(2<sup>2</sup>)",
      "2n\\*(2<sup>n<sub>d</sub></sup>) +<br>4\\*2<sup>n<sub>d</sub></sup>\\*n\\*(n-1)",
      "2n\\*(2<sup>n<sub>d</sub></sup>) +<br>2<sup>n<sub>d</sub></sup>\\*(n choose 2)\\*(2<sup>2</sup>) +<br>4\\*2<sup>n<sub>d</sub></sup>\\*n\\*(n-1) +<br>2<sup>2</sup>\\*n\\*((n-1) choose 2)\\*(2<sup>n<sub>d</sub>+1</sup>)"
    ),
  combos =
    c(
      2 * panel_n * 2^(panel_n_d), # 184,
      2 * panel_n * 2^(panel_n_d) + 2^(panel_n_d) * (choose(panel_n, 2)) * 2^2, # 4232,
      2 * panel_n * 2^(panel_n_d) + 4 * 2^(panel_n_d) * panel_n * (panel_n - 1), # 8280,
      2 * panel_n * 2^(panel_n_d) + 2^(panel_n_d) * (choose(panel_n, 2)) * 2^2 +
        4 * 2^(panel_n_d) * panel_n * (panel_n - 1) +
        2^2 * (panel_n) * (choose(panel_n - 1, 2)) * (2^(panel_n_d + 1)) # 182344)
    )
) |>
  gt::gt() |>
  gt::fmt_markdown(
    columns = dplyr::everything()
  ) |>
  gt::cols_label(
    expand_num = "expand_num",
    expand_denom = "expand_denom",
    gen = "The subpopulations to generate",
    examples = "Examples",
    n_subpops = "Expected number of subpopulations",
    combos = "Expected number of subpopulations<br>for the 29-marker T-cell panel",
    .fn = gt::md
  ) |>
  gt::cols_align(
    align = "center",
    columns = c("n_subpops", "combos")
  ) |>
  gt::tab_footnote(
    footnote = gt::md("n, number of markers specified in the numerator; n<sub>d</sub>,  number of markers specified in the denominator"),
    locations = gt::cells_column_labels(columns = "n_subpops")
  )
```

</details>

The `keep_indicators` argument provides the 0/1 for which marker is considered in the numerator and denominator
for each subpopulation. This is especially useful when merging onto other data that does not have the same naming conventions.

For example, when matching strings: "CD4+ & CD8- of CD3+" is different from "CD8- & CD4+ of CD3+" and "CD4+ and CD8- of CD3+". But using indicator columns, we can match the subpopulations regardless of the subpopulation naming convention.


Below is we show examples of each of the four combinations for the `expand_num` and `expand_denom` arguments. 

For the example of when `expand_num = FALSE` and `expand_denom = FALSE`, `keep_indicators = TRUE` to illustrate
the columns we get for the `_POS` and `_POS_D`. Other examples use `keep_indicators = FALSE`. 

<details><summary>Expand example for `expand_num = FALSE` and `expand_denom = FALSE`, and `keep_indicators = TRUE`</summary>

```{r}
example_perc1 <-
  # Should only count the CD3+ cells
  dplyr::filter(example_intensity_gated, cd3_pos == 1) |> 
  getPerc(
    intens_dat = _,
    num_marker = example_markers,
    denom_marker = c("CD4", "CD8"),
    expand_num = FALSE,
    expand_denom = FALSE,
    keep_indicators = TRUE
  )

# For display only, group based on the denominators and
# simplify the names to be numerators
example_perc1 |> 
  tidyr::separate_wider_delim(subpopulation,
    delim = "_OF_",
    names = c("num", "denom"),
    cols_remove = FALSE
  ) |>
  dplyr::mutate(denom = paste("Denom = ", denom)) |>
  dplyr::group_by(denom) |>
  dplyr::select(-subpopulation) |>
  gt::gt() |>
  gt::fmt_number(
    columns = "perc",
    decimals = 1
  )
```

</details>


<details><summary>Expand example for `expand_num = TRUE` and `expand_denom = FALSE`, and `keep_indicators = FALSE`</summary>

```{r}
example_perc2 <-
  # Should only count the CD3+ cells
  dplyr::filter(example_intensity_gated, cd3_pos == 1) |>
  getPerc(
    intens_dat = _,
    num_marker = example_markers,
    denom_marker = c("CD4", "CD8"),
    expand_num = TRUE,
    expand_denom = FALSE,
    keep_indicators = FALSE
  )

# For display only, group based on the denominators and
# simplify the names to be numerators
example_perc2 |>
  tidyr::separate_wider_delim(subpopulation,
    delim = "_OF_",
    names = c("num", "denom"),
    cols_remove = FALSE
  ) |>
  dplyr::mutate(denom = paste("Denom = ", denom)) |>
  dplyr::group_by(denom) |>
  dplyr::select(-subpopulation) |>
  gt::gt() |>
  gt::fmt_number(
    columns = "perc",
    decimals = 1
  )
```

</details>


<details><summary>Expand example for `expand_num = FALSE` and `expand_denom = TRUE`, and `keep_indicators = FALSE`</summary>
 

```{r}
example_perc3 <-
  # Should only count the CD3+ cells
  dplyr::filter(example_intensity_gated, cd3_pos == 1) |>
  getPerc(
    intens_dat = _,
    num_marker = example_markers,
    denom_marker = c("CD4", "CD8"),
    expand_num = FALSE,
    expand_denom = TRUE,
    keep_indicators = FALSE
  )

# For display only, group based on the denominators and
# simplify the names to be numerators
example_perc3 |> 
  tidyr::separate_wider_delim(subpopulation,
    delim = "_OF_",
    names = c("num", "denom"),
    cols_remove = FALSE
  ) |>
  dplyr::mutate(denom = paste("Denom = ", denom)) |>
  dplyr::group_by(denom) |>
  dplyr::select(-subpopulation) |>
  gt::gt() |>
  gt::fmt_number(
    columns = "perc",
    decimals = 1
  )
```
</details>

<details><summary>Expand example for `expand_num = TRUE` and `expand_denom = TRUE` , and `keep_indicators = FALSE`</summary>

```{r}
example_perc4 <-
  # Should only count the CD3+ cells
  dplyr::filter(example_intensity_gated, cd3_pos == 1) |>
  getPerc(
    intens_dat = _,
    num_marker = example_markers,
    denom_marker = c("CD4", "CD8"),
    expand_num = TRUE,
    expand_denom = TRUE,
    keep_indicators = FALSE
  )

# For display only, group based on the denominators and
# simplify the names to be numerators
example_perc4 |>
  tidyr::separate_wider_delim(subpopulation,
    delim = "_OF_",
    names = c("num", "denom"),
    cols_remove = FALSE
  ) |>
  dplyr::mutate(denom = paste("Denom = ", denom)) |>
  dplyr::group_by(denom) |>
  dplyr::select(-subpopulation) |>
  gt::gt() |>
  gt::fmt_number(
    columns = "perc",
    decimals = 1
  )
```

</details>

## Optional: Adding density gates back to GatingSet

Let's add the gate for LAG3 of CD4+ and CD8+. This is a good visualization to see all sequential gating steps applied to the sample.

```{r, warning = FALSE, message = FALSE, fig.height = 10, fig.width = 8}
# Grab gate as a numeric
current_gate <-
  dens_gates |>
  dplyr::filter(cd4_pos_cd8_pos == "cd4_neg_cd8_pos") |>
  dplyr::pull(LAG3)

# Apply using gs_add_gating-method and
# We want a boundary gate
openCyto::gs_add_gating_method(
  gs,
  alias = "lag3_cd8",
  pop = "+",
  parent = "cd4-cd8+",
  dims = "LAG3",
  gating_method = "boundary",
  gating_args = list(min = current_gate, max = Inf)
)

current_gate <-
  dens_gates |>
  dplyr::filter(cd4_pos_cd8_pos == "cd4_pos_cd8_neg") |>
  dplyr::pull(LAG3)

openCyto::gs_add_gating_method(
  gs,
  alias = "lag3_cd4",
  pop = "+",
  parent = "cd4+cd8-",
  dims = "LAG3",
  gating_method = "boundary",
  gating_args = list(min = current_gate, max = Inf)
)

ggcyto::autoplot(gs[[1]])
```

After running {staRgate} to gate flow cytometry data, it is recommended to perform some quality checks (QC) on the gate placements to ensure they are reasonable. We suggest to use ridgeplots in addition to the `ggcyto::autoplot` function to visualize the density distributions per marker across samples. When examining a large batch of samples, downsampling, such as to a random sample of 10k CD3+ cells, will make the QC process more manageable. In addition, random spot checks of a few samples would also be helpful QC to detect any edge cases.

Currently in this tutorial, we do not extend to the QC steps and suggest to lean on other examples for how to put together a ridgeplot for example.

In the near future, we hope to incorporate some examples for the additional QC steps as well, stay tuned!


# Session info 

```{r}
sessionInfo()
```


# References 

G. Finak, J. Frelinger, W. Jiang, E. W. Newell, J. Ramey, M. M. Davis, S. A. Kalams, S. C. De Rosa and R. Gottardo, "OpenCyto: An Open Source Infrastructure for Scalable, Robust, Reproducible, and Automated, End-to-End Flow Cytometry Data Analysis," PLoS Computational Biology, vol. 10, p. e1003806, August 2014. 

G. Finak and M. Jiang, "flowWorkspace: Infrastructure for representing and interacting with gated and ungated cytometry data sets.," 2023.

G. Finak, W. Jiang and R. Gottardo, "CytoML for cross-platform cytometry data sharing," Cytometry Part A, vol. 93, pp. -7, 2018. 

G. Monaco, H. Chen, M. Poidinger, J. Chen, J. P. de Magalhães and A. Larbi, "flowAI: automatic and interactive anomaly discerning tools for flow cytometry data," Bioinformatics, vol. 32, p. 2473–2480, April 2016. 
P. Van, W. Jiang, R. Gottardo and G. Finak, "ggcyto: Next-generation open-source visualization software for cytometry," Bioinformatics, 2018.