---
title: "VDJdive Workflow"
author:
- name: Kelly Street
  email: street.kelly@gmail.com
- name: Mercedeh Movassagh
- name: Jill Lundell
package: VDJdive
output:
  BiocStyle::html_document
abstract: |
  This vignette demonstrates the utility of the `VDJdive` package. The package provides functions for handling and analyzing immune receptor repertoire data, as produced by the CellRanger V(D)J pipeline. This includes reading the data into R, merging it with paired single-cell RNA-seq data, assigning clonotype labels, calculating diversity metrics, and producing common plots.
vignette: |
  %\VignetteIndexEntry{VDJdive Workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  message = FALSE, 
  warning = FALSE
)
```

# Introduction

Targeted sequencing of immune receptors is a powerful tool for interrogating the adaptive immune system. Single-cell technology has increased the resolution of this type of data tremendously. While many tools exist to analyze single-cell RNA sequencing data, there are fewer options available for targeted assays like adaptive immune receptor sequencing (AIRRseq).

The `r Biocpkg("VDJdive")` package provides functionality for incorporating AIRRseq (or TCRseq) data into the Bioconductor single-cell ecosystem and analyzing it in a variety of ways. We believe this will unlock many powerful tools for the analysis of immune receptor data and make it easily accessible to users already familiar with `SingleCellExperiment` objects and the Bioconductor framework. This vignette will give you a brief overview of the methods available in the package and demonstrate their usage on a simple dataset.

```{r loadPackage, echo = FALSE}
require(utils)
require(VDJdive)
```

# Installation

To install `VDJdive` from Bioconductor, you can use the following code:

```{r installation, eval = FALSE}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("VDJdive")
```

Additionally, some features of `VDJdive` make use of Python functionality, which is made available through the [`basilisk`](https://bioconductor.org/packages/release/bioc/html/basilisk.html) package.

# Read in 10X data

The 10X data are in a file called `filtered_contig_annotations.csv`, containing one entry (row) for each unique contig in each cell. When we read this data into R, we could produce a single `data.frame`, but we want to keep entries from the same cell together. Hence, we read the data in as a `SplitDataFrameList`, which efficiently stores a collection of data frames all containing the same column names. The following snippet demonstrates how to read in the `filtered_contig_annotations.csv` file and convert it to a `CompressedSplitDFrameList` for downstream use. 

The function `readVDJcontigs` takes a character vector that contains directory names. Each directory should contain a file called `filtered_contig_annotations.csv`. `readVDJcontigs` will read in the 10X files from the directory, split by cell barcode, and create a `SplitDataFrameList` from the files. The barcodes will be unique across samples because the sample name is appended to the barcode. 

```{r, eval = FALSE}
# Read in a single file
contigs <- readVDJcontigs("path_to_10X_files")

# Read in files for multiple samples
path1 <- "sample1"
path2 <- "sample2"
contigs <- readVDJcontigs(c("path1", "path2"))
```

```{r, echo = FALSE}
data("contigs")
```


# Merging V(D)J and scRNAseq Data

Besides keeping data on the same cells together, the primary advantage of the `SplitDataFrameList` is that it has length equal to the number of cells and can therefore be added to the column data of a `SingleCellExperiment` object. Cells are matched between the two objects by their barcodes. However, this merging can lead to information loss, as cells that are not already represented in the SCE object will be dropped from the `SplitDataFrameList`. When this happens, `addVDJtoSCE` will issue a warning with the number of cells that have been dropped.

```{r}
require(SingleCellExperiment)
ncells <- 24
u <- matrix(rpois(1000 * ncells, 5), ncol = ncells)
barcodes <- vapply(contigs[,'barcode'], function(x){ x[1] }, 'A')
samples <- vapply(contigs[,'sample'], function(x){ x[1] }, 'A')
sce <- SingleCellExperiment(assays = list(counts = u),
                            colData = data.frame(Barcode = barcodes,
                                                 group = samples))
sce <- addVDJtoSCE(contigs, sce)
sce$contigs
```


# Assign Clonotypes and Calculate Summaries

`r Biocpkg("VDJdive")` contains two methods for assigning clonotype lables. The first considers only cells with complete, productive alpha and beta chains (exactly one of each). These are the cells which can only be assigned a single, unique clonotype label and the resulting abudances are always whole numbers. Cells that do not meet this criteria are not assigned a clonotype and do not contribute to downstream analysis. If two cells have identical amino acid sequences in their CDR3 regions, they are considered to be the same clonotype. We assign clonotypes in this manner by running `clonoStats` with the option `method = 'unique'`:

```{r}
UNstats <- clonoStats(contigs, method = 'unique')
class(UNstats)
```

The other method for assigning clonotype lables is probabalistic and makes use of the Expectation-Maximization (EM) algorithm. Rather than filtering out ambiguous cells (ie. cells with no alpha chain, no beta chain, or more than one of either), this method allows for partial assignment. For example, if a cell has one productive alpha chain and two productive beta chains, the EM algorithm will be used to make a partial clonotype assignment based on the prevalence of each clonotype in the sample. That means that a cell may have a count of 0.6 for one clonotype and 0.4 for a different clonotype rather than a count of 1 for a single clonotype. We assign clonotypes in this manner by running `clonoStats` with the option `method = 'EM'` (which is the default): 

```{r}
EMstats <- clonoStats(contigs, method = "EM")
class(EMstats)
```

Similarly, we can call `clonoStats` on a `SingleCellExperiment` object that contains V(D)J data. This will add a `clonoStats` object to the metadata of the SCE:
```{r}
sce <- clonoStats(sce, method = 'EM')
metadata(sce)
```

Functions to access the clonoStats class.
```{r}
head(clonoAbundance(sce)) # access output of abundance for clonotypes for clonoStats class
clonoFrequency(sce) # access output of frequency for clonotypes for clonoStats class
clonoFrequency(sce) # access output of clonotypes assignment for clonoStats class
clonoGroup(sce) # access output of clonotypes grouping for clonoStats class
clonoNames(sce) # access output of clonotypes samples for clonoStats class
```

# Diversity

Diversity metrics can be computed from the `clonoStats` object or a `SingleCellExperiment` object containing the relevant output. The `calculateDiversity` function can compute (normalized) Shannon entropy, Simpson index, inverse-Simpson index, Chao diversity, and Chao-Bunge diversity. The Chao and Chao-Bunge diversity measures require clonotype frequencies that may be philosophically incompatible with the expected counts generated by the EM algorithm. In these cases, the results for the EM counts are taken as Bernoulli probabilities and we calculate the expected number of singletons, doubletons, etc. The entropy measures and Simpson indices do not require integer counts so no additional calculation is needed for those measures. 

All of the diversity measures can be computed for each sample with the following: 

```{r}
div <- calculateDiversity(EMstats, methods = "all")
div
```

We also provide a function for estimation of species richness through breakaway (Willis et al, biometrics 2015). For this section please ensure to install the breakaway package from CRAN https://cran.r-project.org/web/packages/breakaway/index.html . The method uses the frequency (number of singletons, doubletons etc to provide a more accurate estimate for richness with standard error and confidence intervals).

```{r}
#divBreakaway <- runBreakaway(EMstats, methods = 'unique')
#divBreakaway
```

# Cluster Diversity

Rather than grouping cells by sample, we may be interested in other groupings, such as clusters based on RNA expression. In this case, if we have the original clonotype assignment matrix, we can calculate new summary statistics based on a different grouping variable. This is preferable to grouping by cluster in the original call to `clonoStats` because the most accurate clonotype assignment is achieved cells are grouped by sample. Otherwise, we may see unwanted crosstalk between samples leading to nonsensical counts (ie. a clonotype that never appears in a particular sample may receive non-zero counts from some ambiguous cells in that sample).

```{r}
EMstats <- clonoStats(contigs, method = "EM", assignment = TRUE)

clus <- sample(1:2, 24, replace = TRUE)

EMstats.clus <- clonoStats(EMstats, group = clus)
```

To reiterate: we always recommend running `clonoStats` by sample of origin first, to get the most accurate clonotype assignment. It is easy to re-calculate summary statistics for other grouping variables later.


# Visualization

`r Biocpkg("VDJdive")` has many options for visualization. This section demonstrates the graphs that can be created. 

## Clonotype Abundance Bar Plot

Barplot function which shows the number of T cells in each group (sample) as well as the clonotype abundances within each group (sample). The coloring indicates the number of cells assigned to each clonotype, with darker colors denoting singletons and
lighter colors denoting expanded clonotypes (scale = log count of clontypes).

```{r fig.width= 3, fig.height= 2}
barVDJ(EMstats, title = "contigs", legend = TRUE)
```

## Clonotype Abundance Pie Chart

Similar to the bar plot above, these pie charts show the clonotype abundances
within each group (sample), as a percentage of cells within that group. The coloring indicates the number of cells assigned to each clonotype, with darker colors denoting singletons and lighter colors denoting expanded clonotypes (scale = log count of clontypes).

```{r fig.width= 2.5, fig.height= 2}
pieVDJ(EMstats)
```

## Richness vs. Evenness Scatter Plot

This scatter plot shows two measures of group-level (sample-level) diversity.
The "richness" (total number of clonotypes) is shown on the x-axis and the
"evenness" (mixture of clonotypes) on the y-axis. Diversity measures such as
Shannon entropy contain information about both the evenness and the abundance of
a sample, but because both characteristics are combined into one number,
comparison between samples or groups of samples is difficult. Other measures,
such as the breakaway measure of diversity, only express the abundance of the
sample and not the evenness. The scatterplot shows how evenness and abundance
differ between each sample and between each group of samples.

```{r}
sampleGroups <- data.frame(Sample = c("sample1", "sample2"), 
                           Group = c("Tumor", "Normal"))
scatterVDJ(div, sampleGroups = NULL, 
       title = "Evenness-abundance plot", legend = TRUE)
```


## Clonotype Abundance Dot Plot

This dot plot shows the clonotype abundances in each group (sample) above a
specified cutoff. The most abundant clonotypes for each group (sample) are annotated on
the plot and ordered from most abundant to least abundant.

```{r}
abundanceVDJ(EMstats)
```


## Session Info
```{r}
sessionInfo()
```