---
title: "Effect size estimation with apeglm"
author: 
- name: Anqi Zhu
  affiliation: Department of Biostatistics, UNC-Chapel Hill
- name: Joseph G. Ibrahim
  affiliation: Department of Biostatistics, UNC-Chapel Hill
- name: Michael I. Love
  affiliation: Department of Biostatistics, UNC-Chapel Hill
output:
  BiocStyle::html_document:
    toc_float: true
package: apeglm
abstract: |
  *apeglm* provides Bayesian shrinkage estimators for effect sizes for a
  variety of GLM models; *apeglm* stands for "approximate posterior
  estimation for GLM".
vignette: |
  %\VignetteIndexEntry{Effect size estimation with apeglm}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<!-- run this document with library(rmarkdown); render("apeglm.Rmd") -->

```{r setup, echo=FALSE, results="hide"}
knitr::opts_chunk$set(tidy=FALSE, cache=TRUE,
                      dev="png",
                      message=FALSE, error=FALSE, warning=TRUE)
```

# Typical RNA-seq call from DESeq2

**Note:** the typical RNA-seq workflow for users would be to call
*apeglm* estimation from within the `lfcShrink` function in *DESeq2*,
with the following code (not run here). See the DESeq2 vignette for
more details. The `lfcShrink` wrapper function takes care of many
details below, and unifies the interface for multiple shrinkage
estimators. 

```{r,eval=FALSE}
res <- lfcShrink(dds, coef=2, type="apeglm")
```

# Example RNA-seq analysis

Here we show example code which mimics what will happen inside the
`lfcShrink` function when using the *apeglm* method. 

Load a prepared `SummarizedExperiment`:

```{r}
library(airway)
data(airway)
head(assay(airway))
```

For demonstration, we will just use the first 5000 genes of the `airway` dataset.

```{r}
keep <- head(which(rowSums(assay(airway)) > 0), 5000)
airway <- airway[keep,]
```

Run a *DESeq2* differential expression analysis
(size factors, and dispersion estimates could similarly
be estimated using *edgeR*):

```{r}
library(DESeq2)
dds <- DESeqDataSet(airway, ~cell + dex)
dds$dex <- relevel(dds$dex, "untrt")
dds <- DESeq(dds)
res <- results(dds)
```

Defining data and parameter objects necessary for `apeglm`. 
We must multiply the coefficients from *DESeq2* by a factor,
because *apeglm* provides natural log coefficients. Again,
this would be handled inside of `lfcShrink` in *DESeq2* for a typical
RNA-seq analysis.

```{r}
x <- model.matrix(design(dds), colData(dds))
param <- dispersions(dds)
mle <- log(2) * cbind(res$log2FoldChange, res$lfcSE)
offset <- matrix(log(sizeFactors(dds)),
                 ncol=ncol(dds),
                 nrow=nrow(dds),byrow=TRUE)
```

# Running apeglm

Here `apeglm` on 5000 genes takes less than a minute on a laptop.
It scales with number of genes, the number of samples and the number
of variables in the design formula, where here we have 5 coefficients
(one for the four cell cultures and one for the difference due to
dexamethasone treatment).

We provide `apeglm` with the *SummarizedExperiment* although the
function can also run on a *matrix* of counts or other observed data.
We specify a `coef` as well as a `threshold` which we discuss
below. Note that we multiple the `threshold` by `log(2)` to convert
from log2 scale to natural log scale. 

```{r}
library(apeglm)
system.time({
  fit <- apeglm(Y=airway, x=x,
                log.lik=logLikNB,
                param=param,
                coef=ncol(x),
                threshold=log(2) * 1,
                mle=mle,
                offset=offset)
})
str(fit$prior.control)
```

Among other output, we have the estimated coefficients attached to the
ranges of the *SummarizedExperiment* used as input:

```{r}
class(fit$ranges)
mcols(fit$ranges, use.names=TRUE)
```

We can compare the coefficients from *apeglm* with those from *DESeq2*.
*apeglm* provides coefficients on the natural log scale, so we must
convert to log2 scale by multiplying by `log2(exp(1))`. Note that
*DESeq2*'s `lfcShrink` function converts *apeglm* coefficients to the log2
scale internally.

```{r}
res.shr <- lfcShrink(dds, coef=5)
DESeq2.lfc <- res.shr$log2FoldChange
apeglm.lfc <- log2(exp(1)) * fit$map[,5]
```

Here we plot *apeglm* estimators against *DESeq2*:

```{r apevsdeseq}
plot(DESeq2.lfc, apeglm.lfc)
abline(0,1)
```

Here we plot *MLE*, *DESeq2* and *apeglm* estimators against the mean
of normalized counts:

```{r maplots, fig.width=9, fig.height=3.5}
par(mfrow=c(1,3))
lims <- c(-8,8)
hline <- function() abline(h=c(-4:4 * 2),col=rgb(0,0,0,.2))
xlab <- "mean of normalized counts"
plot(res$baseMean, res$log2FoldChange, log="x",
     ylim=lims, main="MLE", xlab=xlab)
hline()
plot(res$baseMean, DESeq2.lfc, log="x",
     ylim=lims, main="DESeq2", xlab=xlab)
hline()
plot(res$baseMean, apeglm.lfc, log="x",
     ylim=lims, main="apeglm", xlab=xlab)
hline()
```

# Specific coefficients

Note that p-values and FSR define different events, and
are not on the same scale. An FSR of 0.5 means that the 
estimated sign is as bad as random guess. 

```{r pvalcomp, fig.width=8, fig.height=4}
par(mfrow=c(1,2),mar=c(5,5,1,1))
plot(res$pvalue, fit$fsr, col="blue",
     xlab="DESeq2 pvalue", ylab="apeglm local FSR",
     xlim=c(0,1), ylim=c(0,.5))
abline(0,1)
plot(-log10(res$pvalue), -log10(fit$fsr),
     xlab="-log10 DESeq2 pvalue", ylab="-log10 apeglm local FSR",
     col="blue")
abline(0,1)
```

We recommend using a lower threshold on s-values than typically
used for adjusted p-values, for example one might
be interested in sets with 0.01 or 0.005 aggregate FSR.

```{r pvalcomp2}
plot(res$padj, fit$svalue, col="blue",
     xlab="DESeq2 padj", ylab="apeglm svalue",
     xlim=c(0,.2), ylim=c(0,.02))
```

More scrutiny can be applied by using an LFC threshold greater
than zero, and asking for the probability of a false sign
or that the effect size is not further from zero in
distance than the threshold amount.

As we already specfified a threshold of 1 on the log2 scale,
we can see how this changes the local FSRs:

```{r maplotthresh}
cols <- ifelse(svalue(fit$thresh) < .01, "red", "black")
plot(res$baseMean, log2(exp(1)) * fit$map[,5], log="x", col=cols,
     xlab=xlab, ylab="LFC")
abline(h=c(-1,0,1), col=rgb(0,0,1,.5), lwd=2)
```

# Modeling ratios of counts 

We also show an short example using an alternative likelihood
to the negative binomial. Suppose we have allele-specific counts
for n=20 vs 20 samples across 5000 genes. We can define a 
binomial model and test for the allelic balance across groups 
of samples.

Here we will *simulate* allele counts from our existing dataset
for demonstration. We spike in 10 genes with allelic imbalance.

```{r}
library(emdbook)
n <- 20
f <- factor(rep(1:2,each=n))
mu <- ifelse(res$baseMean > 10, res$baseMean, 10)
set.seed(1)
cts <- matrix(rnbinom(nrow(dds)*2*n, 
                      mu=mu,
                      size=1/dispersions(dds)),
              ncol=2*n)
theta <- runif(nrow(cts),1,100)
prob <- rnorm(nrow(cts),.5,.01)
ase.cts <- matrix(rbetabinom(prod(dim(cts)), prob=prob,
                             size=cts, theta=rep(theta,ncol(cts))),
                  nrow=nrow(cts))
idx <- 1:10
idx2 <- which(f == 2)
theta[idx] <- 100
ase.cts[idx,idx2] <- matrix(rbetabinom(length(idx)*length(idx2), prob=.75,
                                       size=cts[idx,idx2], theta=100),
                            nrow=length(idx))
```

We define a beta-binomial likelihood function which uses the total counts
as a parameter:

```{r}
betabinom.log.lik <- function(y, x, beta, param, offset) {
  xbeta <- x %*% beta
  p.hat <- (1+exp(-xbeta))^-1
  dbetabinom(y, prob=p.hat, size=param[-1], theta=param[1], log=TRUE)
}
```

We first need to estimate MLE coefficients and standard errors.

```{r}
theta.hat.0 <- 100 # rough estimate of dispersion
param <- cbind(theta.hat.0, cts)
x <- model.matrix(~f)
system.time({
  fit.mle <- apeglm(Y=ase.cts, x=x,
                    log.lik=betabinom.log.lik,
                    param=param,
                    no.shrink=TRUE,
                    log.link=FALSE)
})
```

```{r mlemaplot}
coef <- 2
plot(res$baseMean, fit.mle$map[,coef], log="x", xlab=xlab, ylab="LFC")
points(res$baseMean[idx], fit.mle$map[idx,coef], col="dodgerblue", cex=3)
```

```{r}
theta.hat <- bbEstDisp(success=ase.cts, size=cts,
                       x=x, beta=fit.mle$map,
                       minDisp=1, maxDisp=500)
```

**Note:** here we use the `se` slot of the returned fit object,
although in future versions of *apeglm*, this slot is named `sd`
(posterior standard deviation is a more accurate description of this
quantity).

```{r}
mle <- cbind(fit.mle$map[,coef], fit.mle$se[,coef])
param <- cbind(theta.hat, cts)
system.time({
  fit2 <- apeglm(Y=ase.cts, x=x,
                 log.lik=betabinom.log.lik,
                 param=param,
                 coef=coef,
                 mle=mle,
                 log.link=FALSE)
})
```

```{r asemaplot, fig.width=8, fig.height=4}
par(mfrow=c(1,2))
ylim <- c(-3,3)
plot(res$baseMean, fit.mle$map[,coef], log="x", xlab=xlab, ylab="LFC", ylim=ylim)
points(res$baseMean[idx], fit.mle$map[idx,coef], col="dodgerblue", cex=3)
cols <- ifelse(fit2$svalue < .001, "red", "black")
plot(res$baseMean, fit2$map[,coef], log="x", xlab=xlab, ylab="LFC",
     col=cols, ylim=ylim)
points(res$baseMean[idx], fit2$map[idx,coef], col="dodgerblue", cex=3)
```

```{r}
log(.75/(1-.75))
```

# Session Info

```{r}
sessionInfo()
```