---
title: "User Manual: IgGeneUsage"
author: "SK"
date: "Oct 20, 2021"
output:
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{User Manual: IgGeneUsage}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r setup, include = FALSE, warning = FALSE}
knitr::opts_chunk$set(comment = FALSE, 
                      warning = FALSE, 
                      message = FALSE)
```


```{r}
require(IgGeneUsage)
require(knitr)
require(ggplot2)
require(ggforce)
require(gridExtra)
require(ggrepel)
require(rstan)
require(reshape2)
rstan_options(auto_write = TRUE)
```


# Introduction
Decoding the properties of immune repertoires is key in understanding the 
response of adaptive immunity to challenges such as viral infection. One 
important property is biases in immunoglobulin (Ig) gene usage between 
biological conditions (e.g. healthy vs tumor). Yet, most analyses for 
differential gene usage (DGU) are performed qualitatively, or with 
inadequate statistical methods. Here we introduce `r Biocpkg("IgGeneUsage")`,
a computational tool for DGU analysis.


# Input
The main input of `r Biocpkg("IgGeneUsage")` is a data.frame that has the 
following 4 columns:

  1. sample_id: name of the repertoire (e.g. Patient-1)
  2. condition: name of the condition to which each repertoire 
  belongs (e.g. healthy or tumor)
  3. gene_name: gene name (e.g. IGHV1-10 or family TRVB1)
  4. gene_usage_count: numeric (count) of usage related in sample x gene x 
     condition specified in columns 1-3

The sum of all gene usage counts (column 4) for a given repertoire is equal to
the repertoire size (number of cells in the repertoire).


# Model
`r Biocpkg("IgGeneUsage")` transforms the provided input in the following 
way. Given $R$ repertoires, each having $G$ genes, `r Biocpkg("IgGeneUsage")` 
generates a gene usage matrix $Y^{R \times G}$. Row sums in $Y$ define the 
total usage in each repertoire ($N$). The design variable $X$ is set to 
$X = 1$ for repertoires that belong to the first condition, and $X = -1$ 
otherwise.

For the analysis of DGU between two biological conditions, we designed the 
following Bayesian model ($M$) for zero-inflated beta-binomial regression. 
This model can fit over-dispersed gene usage data. The immune repertoire 
data is also not exhaustive, which leads to misdetection of genes that are 
systematically rearranged at low probability. The zero-inflated component of 
our model accounts for this:

\begin{align}
p(Y_{ij} \mid M) = \begin{cases} 
\kappa + (1 - \kappa) \operatorname{BB}\left(0 \mid N_{i}, \theta_{ij}, \phi 
\right), & \text{if $Y_{ij}$ = 0} \\
(1 - \kappa) \operatorname{BB}\left(Y_{ij} \mid N_{i}, \theta_{ij}, \phi 
\right), & \text{if $Y_{ij}$ > 0}
\end{cases}\\
\theta_{ij}=\operatorname{logit^{-1}}\left(\alpha_{j}+\beta_{ij}X_{i}\right)\\
\beta_{ij}\sim\operatorname{Normal}\left(\gamma_{j},\gamma_{\sigma} \right)\\
\gamma_{j}\sim\operatorname{Normal}\left(\hat{\gamma},\hat{\gamma}_{\sigma}
\right) \\
\alpha_{j}\sim\operatorname{Normal}\left(\hat{\alpha},\hat{\alpha}_{\sigma}
\right) \\
\hat{\gamma} \sim \operatorname{Normal}\left(0, 5\right) \\
\hat{\alpha} \sim \operatorname{Normal}\left(0, 10\right) \\
\gamma_{\sigma}, \hat{\gamma}_{\sigma}, \hat{\alpha}_{\sigma} \sim 
\operatorname{Cauchy^{+}}\left(0, 1\right) \\
\phi \sim \operatorname{Exponential}\left(\tau\right) \\
\tau \sim \operatorname{Gamma}\left(3, 0.1\right) \\
\kappa \sim \operatorname{Beta}\left(1, 3\right)
\end{align}

Model $M$ legend:

  * $i$ and $j$: index of different repertoires and genes, respectively 
  * $\kappa$: zero-inflation probability
  * $\theta$: probability of gene usage
  * $\phi$: dispersion
  * $\alpha$: intercept/baseline gene usage
  * $\beta$: slope/within-repertoire DGU coefficient
  * $\gamma$, $\gamma_{\sigma}$: slope/gene-specific DGU coefficient; 
  standard deviation
  * $\hat{\gamma}$, $\hat{\gamma}_{\sigma}$: mean and standard deviation of 
  the population of gene-specific DGU coefficients
  * $\hat{\alpha}$, $\hat{\alpha}_{\sigma}$: mean and standard deviation of 
  the population of gene-specific baseline usages
  * $\operatorname{BB}$: beta-binomial probability mass function (pmf)
  * $\operatorname{Normal}$: normal probability density function (pdf)
  * $\operatorname{Cauchy^{+}}$: half-Cauchy pdf
  * $\operatorname{Exponential}$: exponential pdf
  * $\operatorname{Gamma}$: gamma pdf
  * $\operatorname{Beta}$: beta pdf
  * $\operatorname{logit^{-1}}$: inverse logistic function

In the output of `r Biocpkg("IgGeneUsage")`, we report the mean effect 
size ($\gamma$) and its 95% highest density interval (HDI). Genes with 
$\gamma \neq 0$ (e.g. if 95% HDI of $\gamma$ excludes 0) are most likely 
to experience differential usage. Additionally, we report the probability of 
differential gene usage ($\pi$):
\begin{align}
\pi = 2 \cdot \max\left(\int_{\gamma = -\infty}^{0} p(\gamma)\mathrm{d}\gamma, 
\int_{\gamma = 0}^{\infty} p(\gamma)\mathrm{d}\gamma\right) - 1
\end{align}
with $\pi = 1$ for genes with strong differential usage, and $\pi = 0$ for 
genes with negligible differential gene usage. Both metrics are computed based
on the posterior distribution of $\gamma$, and are thus related. We find $\pi$ 
slightly easier to interpret.


# Updated model for `r Biocpkg("IgGeneUsage")` version > 1.7.25
\begin{align}
p(Y_{ij} \mid M) = \begin{cases} 
\kappa + (1 - \kappa) \operatorname{BB}\left(0 \mid N_{i}, \theta_{ij}, \phi 
\right), & \text{if $Y_{ij}$ = 0} \\
(1 - \kappa) \operatorname{BB}\left(Y_{ij} \mid N_{i}, \theta_{ij}, \phi 
\right), & \text{if $Y_{ij}$ > 0}
\end{cases}\\
\theta_{ij}=\operatorname{logit^{-1}}\left(\alpha_{ij}+\beta_{ij}X_{i}\right)\\
\alpha_{ij}\sim\operatorname{Normal}\left(\delta_{j},\delta_{\sigma} \right)\\
\beta_{ij}\sim\operatorname{Normal}\left(\gamma_{j},\gamma_{\sigma} \right)\\
\gamma_{j}\sim\operatorname{Normal}\left(0.0,\hat{\gamma}_{\sigma} \right) \\
\delta_{j}\sim\operatorname{Normal}\left(0.0,\hat{\delta}_{\sigma} \right) \\
\gamma_{\sigma}, \hat{\gamma}_{\sigma}, \delta_{\sigma}, \hat{\delta}_{\sigma} 
\sim \operatorname{Cauchy^{+}}\left(0, 1\right) \\
\phi \sim \operatorname{Exponential}\left(\tau\right) \\
\tau \sim \operatorname{Gamma}\left(3, 0.1\right) \\
\kappa \sim \operatorname{Beta}\left(1, 3\right)
\end{align}


# Case Study A
`r Biocpkg("IgGeneUsage")` has a couple of built-in Ig gene usage datasets. 
Some were obtained from studies and others were simulated.

Lets look into the simulated dataset `d_zibb`. This dataset was generated by a
zero-inflated beta-binomial (ZIBB) model, and `r Biocpkg("IgGeneUsage")` 
was designed to fit ZIBB-distributed data.

```{r}
data("d_zibb", package = "IgGeneUsage")
knitr::kable(head(d_zibb))
```

We can also visualize `d_zibb` with `r CRANpkg("ggplot")`:

```{r, fig.width=6, fig.height=3}
ggplot(data = d_zibb)+
  geom_point(aes(x = gene_name, y = gene_usage_count, col = condition),
             position = position_dodge(width = .7), shape = 21)+
  theme_bw(base_size = 11)+
  ylab(label = "Gene usage")+
  xlab(label = '')+
  theme(legend.position = "top")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))
```


## DGU analysis
As main input `r Biocpkg("IgGeneUsage")` uses a data.frame formatted as 
`d_zibb`. Other input parameters allow you to configure specific settings 
of the `r CRANpkg("rstan")` sampler.

In this example we analyze `d_zibb` with 3 MCMC chains, 1500 iterations 
each including 500 warm-ups using a single CPU core (Hint: for parallel 
chain execution set parameter `mcmc.cores` = 3). We report for each model 
parameter its mean and 95% highest density interval (HDIs).

**Important remark:** you should run DGU analyses using default 
`r Biocpkg("IgGeneUsage")` parameters. If warnings or errors are reported 
with regard to the MCMC sampling, please consult the Stan manual[^2] and 
adjust the inputs accordingly. If the warnings persist, please submit an 
issue with a reproducible script at the Bioconductor support site or on 
Github[^3].

```{r}
M <- DGU(usage.data = d_zibb, # input data
         mcmc.warmup = 500, # how many MCMC warm-ups per chain (default: 500)
         mcmc.steps = 1500, # how many MCMC steps per chain (default: 1,500)
         mcmc.chains = 3, # how many MCMC chain to run (default: 4)
         mcmc.cores = 1, # how many PC cores to use? (e.g. parallel chains)
         hdi.level = 0.95, # highest density interval level (de fault: 0.95)
         adapt.delta = 0.8, # MCMC target acceptance rate (default: 0.95)
         max.treedepth = 10) # tree depth evaluated at each step (default: 12)
```


## Output format
The following objects are provided as part of the output of DGU:
  
  * `glm.summary` (main results of `r Biocpkg("IgGeneUsage")`): quantitative 
    DGU summary 
  * `test.summary`: quantitative DGU summary from frequentist methods: 
    Welch's t-test (T-test) and Wilcoxon signed-rank test (U-test)
  * `glm`: rstan ('stanfit') object of the fitted model $rightarrow$ used for
     model checks (see section 'Model checking')
  * `ppc.data`: posterior predictive checks data (see section 'Model checking')

```{r}
summary(M)
```


## Model checking
* **Check your model fit**. For this, you can use the object glm.

  * Minimal checklist of successful MCMC sampling[^2]:
      * no divergences
      * no excessive warnings from rstan
      * Rhat < 1.05
      * high Neff
  * Minimal checklist for valid model:
      * posterior predictive checks (PPCs): is model consistent with reality, 
        i.e. is there overlap between simulated and observed data?
      * leave-one-out analysis

[^2]: https://mc-stan.org/misc/warnings.html
[^3]: https://github.com/snaketron/IgGeneUsage/issues


### MCMC sampling

  * divergences, tree-depth, energy
  
```{r}
rstan::check_hmc_diagnostics(M$glm)
```

  * Rhat and Neff

```{r, fig.height = 3, fig.width = 6}
gridExtra::grid.arrange(rstan::stan_rhat(object = M$glm),
                        rstan::stan_ess(object = M$glm),
                        nrow = 1)
```


### PPCs: repertoire-specific
The model used by `r Biocpkg("IgGeneUsage")` is generative, i.e. with the 
model we can generate usage of each Ig gene in a given repertoire (y-axis). 
Error bars show 95% HDI of mean posterior prediction. The predictions can be 
compared with the observed data (x-axis). For points near the diagonal 
$\rightarrow$ accurate prediction.

```{r, fig.height = 5.5, fig.width = 6}
ggplot(data = M$ppc.data$ppc.repertoire)+
  facet_wrap(facets = ~sample_name, nrow = 3)+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
  geom_errorbar(aes(x = observed_count, y = ppc_mean_count, 
                    ymin = ppc_L_count, ymax = ppc_H_count), col = "darkgray")+
  geom_point(aes(x = observed_count, y = ppc_mean_count, 
                 fill = condition), shape = 21, size = 1)+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  scale_x_log10()+
  scale_y_log10()+
  xlab(label = "Observed usage [counts]")+
  ylab(label = "Predicted usage [counts]")+
  annotation_logticks(base = 10, sides = "lb")
```


### PPCs: overall
Prediction of generalized gene usage within a biological condition is also 
possible. We show the predictions (y-axis) of the model, and compare them 
against the observed mean usage (x-axis). If the points are near the diagonal 
$\rightarrow$ accurate prediction. Errors are 95% HDIs of the mean.

```{r, fig.height = 4, fig.width = 4}
ggplot(data = M$ppc.data$ppc.gene)+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "darkgray")+
  geom_errorbar(aes(x = observed_prop*100, ymin = ppc_L_prop*100, 
                    ymax = ppc_H_prop*100), col = "darkgray")+
  geom_point(aes(x = observed_prop*100, y = ppc_mean_prop*100, 
                 col = condition), size = 2)+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  xlab(label = "Observed usage [%]")+
  ylab(label = "Predicted usage [%]")+
  scale_x_log10()+
  scale_y_log10()+
  annotation_logticks(base = 10, sides = "lb")
```


## Results
Each row of `glm.summary` summarizes the degree of DGU observed for specific 
Igs. Two metrics are reported: 

  * `es` (also referred to as `\gamma`): effect size encoded in(parameter 
    $\gamma$ from model $M$) on DGU, where `contrast` gives the direction of 
    the effect (e.g. tumor - healthy or healthy - tumor)
  * `pmax` (also referred to as `\pi`): probability of DGU (parameter $\pi$ 
    from model $M$)
  
For `es` we also have the mean, median standard error (se), standard 
deviation (sd), L (low bound of 95% HDI), H (high bound of 95% HDI)

```{r}
kable(x = head(M$glm.summary), row.names = FALSE, digits = 3)
```


We know that the values of `\gamma` and `\pi` are related to each other. 
Lets visualize them for all genes (shown as a point). Names are shown for 
genes associated with $\pi \geq 0.9$. Dashed horizontal line represents 
null-effect ($\gamma = 0$). 

Notice that the gene with $\pi \approx 1$ also has an effect size whose 
95% HDI (error bar) does not overlap the null-effect. The genes with high 
degree of differential usage are easy to detect with this figure.

```{r, fig.height = 4, fig.width = 6}
# format data
stats <- M$glm.summary
stats <- stats[order(abs(stats$es_mean), decreasing = FALSE), ]
stats$gene_fac <- factor(x = stats$gene_name, levels = stats$gene_name)

stats <- merge(x = stats, y = M$test.summary, by = "gene_name")

ggplot(data = stats)+
  geom_hline(yintercept = 0, linetype = "dashed", col = "gray")+
  geom_errorbar(aes(x = pmax, y = es_mean, ymin = es_L, ymax = es_H), 
                col = "darkgray")+
  geom_point(aes(x = pmax, y = es_mean), col = "darkgray")+
  geom_text_repel(data = stats[stats$pmax >= 0.9, ],
                  aes(x = pmax, y = es_mean, label = gene_fac),
                  min.segment.length = 0, size = 2.75)+
  theme_bw(base_size = 11)+
  xlab(label = expression(pi))+
  xlim(c(0, 1))
```


### Promising hits
Lets visualize the observed data of the genes with high probability of 
differential gene usage ($\pi \geq 0.9$). Here we show the gene usage in %.

```{r, fig.height = 2.5, fig.width = 3}
promising.genes <- stats$gene_name[stats$pmax >= 0.9]

ppc.gene <- M$ppc.data$ppc.gene
ppc.gene <- ppc.gene[ppc.gene$gene_name %in% promising.genes, ]

ppc.rep <- M$ppc.data$ppc.repertoire
ppc.rep <- ppc.rep[ppc.rep$gene_name %in% promising.genes, ]


ggplot()+
  geom_point(data = ppc.rep,
             aes(x = gene_name, y = observed_prop*100, col = condition),
             size = 1.5, fill = "black",
             position = position_jitterdodge(jitter.width = 0.1, 
                                             jitter.height = 0, 
                                             dodge.width = 0.35))+
  geom_errorbar(data = ppc.gene, 
                aes(x = gene_name, ymin = ppc_L_prop*100, 
                    ymax = ppc_H_prop*100, group = condition),
                position = position_dodge(width = 0.35), width = 0.15)+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  ylab(label = "Usage [%]")+
  xlab(label = '')
```


### Promising hits [count]
Lets also visualize the gene usage frequencies. Point size represents 
total usage in repertoire.

```{r, fig.height = 2.5, fig.width = 3}
ggplot()+
  geom_point(data = ppc.rep,
             aes(x = gene_name, y = observed_count, col = condition),
             size = 1.5, fill = "black",
             position = position_jitterdodge(jitter.width = 0.1, 
                                             jitter.height = 0, 
                                             dodge.width = 0.35))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  ylab(label = "Usage count")+
  xlab(label = '')
```


## Comparison with the Welch's t-test (T-test)
Despite the fact that the data is not normally distributed (see first figure). 
Nevertheless, we performed DGU analysis with the T-test, and compare $\pi$ 
with the FDR corrected P-values (-log10 scale) from the T-test:

  * Consistent results: Gene G4 has low p-value and $\pi=1$
  * Inconsistent results: Gene 11 has high $\pi\approx 0.9$ yet 
    p-value$\approx 1$

```{r, fig.height = 4, fig.width = 4}
ggplot()+
  geom_hline(yintercept = c(-log10(0.05), -log10(0.01)), 
             linetype = "dashed", col = "darkgray")+
  geom_point(data = stats, col = "red", size = 2,
             aes(x = pmax, y = -log10(t.test.fdr.pvalue)))+
  geom_text_repel(data = stats[stats$pmax >= 0.5, ], 
                  aes(x = pmax, y = -log10(t.test.fdr.pvalue), 
                      label = gene_name), size = 2.75, 
                  min.segment.length = 0)+
  xlim(0, 1)+
  ylab(label = "-log10 (P-value) from T-test [FDR corrected]")+
  xlab(label = expression(pi))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  scale_color_discrete(name = '')
```

## Comparison with the Wilcoxon signed-rank test (U-test)
The nonparametric U-test can also be used for the analysis of DGU. The U-test 
assumes data with equal shape in both groups (also not met by our data). We 
compare $\pi$ with the FDR corrected P-values (-log10 scale) from the U-test:

  * Inconsistent results: Genes 4 and 11 are associated with high 
    $\pi\approx1$, yet high p-value$\approx 1$ (false negatives)

```{r, fig.height = 4, fig.width = 4}
ggplot()+
  geom_hline(yintercept = c(-log10(0.05), -log10(0.01)), 
             linetype = "dashed", col = "darkgray")+
  geom_point(data = stats, col = "red", size = 2,
             aes(x = pmax, y = -log10(u.test.fdr.pvalue)))+
  geom_text_repel(data = stats[stats$pmax >= 0.5, ], 
                  aes(x = pmax, y = -log10(u.test.fdr.pvalue), 
                      label = gene_name), size = 2.75, 
                  min.segment.length = 0)+
  xlim(0, 1)+
  ylab(label = "-log10 (P-value) from U-test [FDR corrected]")+
  xlab(label = expression(pi))+
  theme_bw(base_size = 11)+
  theme(legend.position = "top")+
  scale_color_discrete(name = '')
```
# Session

```{r}
sessionInfo()
```