---
title: "Effect size estimation with apeglm"
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
author: "Anqi Zhu, Joseph G. Ibrahim, and Michael I. Love"
output:
  rmarkdown::html_document:
    highlight: pygments
    toc: true
    toc_float: true
abstract: |
  *apeglm* provides empirical Bayes shrinkage estimators for effect sizes for a
  variety of GLM models; *apeglm* stands for "Approximate Posterior
  Estimation for GLM".
  apeglm package version: `r packageVersion("apeglm")`
bibliography: library.bib
vignette: |
  %\VignetteIndexEntry{Effect size estimation with apeglm}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<!-- run this document with library(rmarkdown); render("apeglm.Rmd") -->

```{r setup, echo=FALSE, results="hide"}
knitr::opts_chunk$set(tidy=FALSE, cache=FALSE,
                      dev="png",
                      message=FALSE, error=FALSE, warning=TRUE)
```

# Typical RNA-seq call from DESeq2

**Note:** the typical RNA-seq workflow for users would be to call
*apeglm* estimation from within the `lfcShrink` function from the
*DESeq2* package. The unevaluated code chunk shows how to obtain
*apeglm* shrinkage estimates after running `DESeq`. See the DESeq2 vignette for
more details. The `lfcShrink` wrapper function takes care of many
details below, and unifies the interface for multiple shrinkage
estimators. The coefficient to shrink can be specified either by name
or by number (following the order in `resultsNames(dds)`). Be aware
that *DESeq2*'s `lfcShrink` interface provides LFCs on the log2 scale,
while *apeglm* provides coefficients on the natural log scale.

```{r, eval=FALSE}
res <- lfcShrink(dds, coef=2, type="apeglm")
```

# Acknowledgments

Joshua Zitovsky contributed fast C++ code for the beta-binomial
likelihood, demonstrated in a later section of the vignette.

We have benefited in the development of `apeglm` from feedback or
contributions from the following individuals:

Wolfgang Huber,
Cecile Le Sueur,
Charlotte Soneson

# Example RNA-seq analysis

Here we show example code which mimics what will happen inside the
`lfcShrink` function when using the *apeglm* method [@Zhu2018]. 

Load a prepared `SummarizedExperiment`:

```{r}
library(airway)
data(airway)
head(assay(airway))
```

For demonstration, we will use 3000 genes of the `airway` dataset,
the first from those genes with at least 10 counts across all samples.

```{r}
keep <- head(which(rowSums(assay(airway)) >= 10), 3000)
airway <- airway[keep,]
```

First run a *DESeq2* differential expression analysis [@Love2014]
(size factors, and dispersion estimates could similarly
be estimated using *edgeR*):

```{r}
library(DESeq2)
dds <- DESeqDataSet(airway, ~cell + dex)
dds$dex <- relevel(dds$dex, "untrt")
dds <- DESeq(dds)
res <- results(dds)
```

Defining data and parameter objects necessary for `apeglm`. 
We must multiply the coefficients from *DESeq2* by a factor,
because *apeglm* provides natural log coefficients. Again,
this would be handled inside of `lfcShrink` in *DESeq2* for a typical
RNA-seq analysis.

```{r}
x <- model.matrix(design(dds), colData(dds))
param <- dispersions(dds)
mle <- log(2) * cbind(res$log2FoldChange, res$lfcSE)
offset <- matrix(log(sizeFactors(dds)),
                 ncol=ncol(dds),
                 nrow=nrow(dds),byrow=TRUE)
```

# Running apeglm

Here `apeglm` on 3000 genes takes a few seconds on a laptop.
It scales with number of genes, the number of samples and the number
of variables in the design formula, where here we have 5 coefficients
(one for the four cell cultures and one for the difference due to
dexamethasone treatment).

We provide `apeglm` with the *SummarizedExperiment* although the
function can also run on a *matrix* of counts or other observed data.
We specify a `coef` as well as a `threshold` which we discuss
below. Note that we multiple the `threshold` by `log(2)` to convert
from log2 scale to natural log scale. 

The original interface to apeglm looked as follows, but was slow, we
therefore recommend using the faster interfaces shown below.

```{r eval=FALSE}
# original interface, also may give Lapack error
fit <- apeglm(Y=airway, x=x, log.lik=logLikNB, param=param, coef=ncol(x),
              threshold=log(2) * 1, mle=mle, offset=offset)
```

There are better, faster implementations of apeglm specifically for
negative binomial likelihoods. The version `nbinomR` is ~5 times
faster than the default `method="general"`. 

We will use this run for downstream analysis. Here we pass the
*SummarizedExperiment*, which will pass back the ranges. For faster
`apeglm` runs, provide only the counts matrix (as seen in subsequent
chunks).

```{r}
library(apeglm)
system.time({
  fitR <- apeglm(Y=airway, x=x, log.lik=NULL, param=param, coef=ncol(x),
                 threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomR")
})
fit <- fitR
names(fit)
str(fit$prior.control)
```

The version `nbinomCR` is ~10 times faster than the default
`general`. 

```{r}
system.time({
  fitCR <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x),
                 threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomCR")
})
```

The version `nbinomC` returns only the MAP coefficients and
can be ~50-100 times faster than the default `general`. The MAP
coefficients are the same as returned by `nbinomCR` above, we just
skip the calculation of posterior SD. A variant of `nbinomC` is
`nbinomC*` which includes random starts.

```{r}
system.time({
  fitC <- apeglm(Y=assay(airway), x=x, log.lik=NULL, param=param, coef=ncol(x),
                 threshold=log(2) * 1, mle=mle, offset=offset, method="nbinomC")
})
```

Among other output, we have the estimated coefficients attached to the
ranges of the *SummarizedExperiment* used as input:

```{r}
class(fit$ranges)
mcols(fit$ranges, use.names=TRUE)
```

We can compare the coefficients from *apeglm* with the `"normal"`
shrinkage type from the original *DESeq2* paper (2014). This method,
which makes use of a Normal-based prior, is no longer the default
shrinkage estimator for `lfcShrink`.
*apeglm* provides coefficients on the natural log scale, so we must
convert to log2 scale by multiplying by `log2(exp(1))`. Note that
*DESeq2*'s `lfcShrink` function converts *apeglm* coefficients to the log2
scale internally.

```{r}
system.time({
  res.shr <- lfcShrink(dds, coef=5, type="normal")
})
DESeq2.lfc <- res.shr$log2FoldChange
apeglm.lfc <- log2(exp(1)) * fit$map[,5]
```

Here we plot *apeglm* estimators against *DESeq2*:

```{r apevsdeseq}
plot(DESeq2.lfc, apeglm.lfc)
abline(0,1)
```

Here we plot *MLE*, *DESeq2* and *apeglm* estimators against the mean
of normalized counts:

```{r maplots, fig.width=9, fig.height=3.5}
par(mfrow=c(1,3))
lims <- c(-8,8)
hline <- function() abline(h=c(-4:4 * 2),col=rgb(0,0,0,.2))
xlab <- "mean of normalized counts"
plot(res$baseMean, res$log2FoldChange, log="x",
     ylim=lims, main="MLE", xlab=xlab)
hline()
plot(res$baseMean, DESeq2.lfc, log="x",
     ylim=lims, main="DESeq2", xlab=xlab)
hline()
plot(res$baseMean, apeglm.lfc, log="x",
     ylim=lims, main="apeglm", xlab=xlab)
hline()
```

# Specific coefficients

Note that p-values and FSR define different events, and
are not on the same scale. An FSR of 0.5 means that the 
estimated sign is as bad as random guess. 

```{r pvalcomp, fig.width=8, fig.height=4}
par(mfrow=c(1,2),mar=c(5,5,1,1))
plot(res$pvalue, fit$fsr, col="blue",
     xlab="DESeq2 pvalue", ylab="apeglm local FSR",
     xlim=c(0,1), ylim=c(0,.5))
abline(0,1)
plot(-log10(res$pvalue), -log10(fit$fsr),
     xlab="-log10 DESeq2 pvalue", ylab="-log10 apeglm local FSR",
     col="blue")
abline(0,1)
```

The s-value was proposed by @Stephens2016, as a statistic giving the
aggregate false sign rate for tests with equal or lower s-value than
the one considered.
We recommend using a lower threshold on s-values than typically
used for adjusted p-values, for example one might
be interested in sets with 0.01 or 0.005 aggregate FSR.

```{r pvalcomp2}
plot(res$padj, fit$svalue, col="blue",
     xlab="DESeq2 padj", ylab="apeglm svalue",
     xlim=c(0,.2), ylim=c(0,.02))
```

More scrutiny can be applied by using an LFC threshold greater
than zero, and asking for the probability of a "false-sign-or-small"
(FSOS) event:
that the effect size is not further from zero in distance than
the threshold amount. We can run the `svalue` function on these
per-gene probabilities to produce s-values that bound the FSOS rate
for sets of genes.
By specifying `threshold=log(2) * 1` above, `apeglm` will then output
a vector `thresh` in the results list that gives the per-gene
probabilities of false-sign-or-small events.

```{r maplotthresh}
s.val <- svalue(fit$thresh)
cols <- ifelse(s.val < .01, "red", "black")
keep <- res$baseMean > 10 # filter for plot
plot(res$baseMean[keep],
     log2(exp(1)) * fit$map[keep,5],
     log="x", col=cols[keep],
     xlab=xlab, ylab="LFC")
abline(h=c(-1,0,1), col=rgb(1,0,0,.5), lwd=2)
```

# Modeling zero-inflated counts

We have created a separate GitHub repository giving an example of how
the *apeglm* estimator can be used for Zero-Inflated Negative Binomial
data. This approach uses the *zinbwave* method and Bioconductor
package to estimate the probability of each zero belonging to the
excess zero component. We compare using a Negative Binomial likelihood
with the excess zeros down-weighted and using a Zero-Inflated Negative
Binomial likelihood. These two approaches with *apeglm* perform
similarly but we note that the first approach involves less additional
code and is faster to compute.

<https://github.com/mikelove/zinbwave-apeglm>

# Modeling ratios of counts 

We also show an short example using an alternative likelihood
to the negative binomial. Suppose we have allele-specific counts
for n=20 vs 20 samples across 5000 genes. We can define a 
binomial model and test for the allelic balance across groups 
of samples.

Here we will *simulate* allele counts from our existing dataset
for demonstration. We spike in 10 genes with strong allelic imbalance
(instead of an allelic ratio close to 0.5, these will have a ratio of
0.75). 

```{r}
library(emdbook)
n <- 20
f <- factor(rep(1:2,each=n))
mu <- ifelse(res$baseMean > 50, res$baseMean, 50)
set.seed(1)
cts <- matrix(rnbinom(nrow(dds)*2*n, 
                      mu=mu,
                      size=1/dispersions(dds)),
              ncol=2*n)
theta <- runif(nrow(cts),1,1000)
prob <- rnorm(nrow(cts),.5,.05) # close to 0.5
ase.cts <- matrix(rbetabinom(prod(dim(cts)), prob=prob,
                             size=cts, theta=rep(theta,ncol(cts))),
                  nrow=nrow(cts))
idx <- 1:10
idx2 <- which(f == 2)
theta[idx] <- 1000
prob[idx] <- 0.75
# the spiked in genes have an allelic ratio of 0.75
ase.cts[idx,idx2] <- matrix(rbetabinom(length(idx)*length(idx2), prob=prob[idx],
                                       size=cts[idx,idx2], theta=theta[idx]),
                            nrow=length(idx))
```

We first need to estimate MLE coefficients and standard errors.

One option to run `apeglm` would be to define a beta-binomial
likelihood function which uses the total counts as a parameter, and
the logit function as a link. And then this function could be provided
to the `log.lik` argument. 

```{r}
betabinom.log.lik <- function(y, x, beta, param, offset) {
  xbeta <- x %*% beta
  p.hat <- (1+exp(-xbeta))^-1
  dbetabinom(y, prob=p.hat, size=param[-1], theta=param[1], log=TRUE)
}
```

However, `apeglm` has faster C++ implementations for the beta-binomial,
which were implemented and tested by Joshua Zitovsky. Here, using
`method="betabinCR"` is 3 times faster than using the R-defined
`log.lik`, and the `"betabinCR"` method also scales significantly
better with more samples and more coefficients. As with the negative
binomial, `method="betabinC"` can be used if the standard errors are
not needed, and this will be 5 times faster than the R-defined
`log.lik` approach on this dataset.

The following code performs two iterations of estimating the MLE
coefficients, and then estimating the beta-binomial dispersion.

```{r}
theta.hat <- 100 # rough initial estimate of dispersion
x <- model.matrix(~f)
niter <- 3
system.time({
  for (i in 1:niter) {
    param <- cbind(theta.hat, cts)
    fit.mle <- apeglm(Y=ase.cts, x=x, log.lik=NULL, param=param,
                      no.shrink=TRUE, log.link=FALSE, method="betabinCR")
    theta.hat <- bbEstDisp(success=ase.cts, size=cts,
                           x=x, beta=fit.mle$map,
                           minDisp=.01, maxDisp=5000)
  }
})
```

We can then plot the MLE estimates over the mean:

```{r mlemaplot}
coef <- 2
xlab <- "mean of normalized counts"
plot(res$baseMean, fit.mle$map[,coef], log="x", xlab=xlab, ylab="log odds")
points(res$baseMean[idx], fit.mle$map[idx,coef], col="dodgerblue", cex=3)
```

Now we run the posterior estimation, including a prior on the second
coefficient:

```{r}
mle <- cbind(fit.mle$map[,coef], fit.mle$sd[,coef])
param <- cbind(theta.hat, cts)
system.time({
  fit2 <- apeglm(Y=ase.cts, x=x, log.lik=NULL, param=param,
                 coef=coef, mle=mle, threshold=0.5,
                 log.link=FALSE, method="betabinCR")
})
```

In the `apeglm` plot, we color in red the genes with a low aggregate
probability of false-sign-or-small (FSOS) events (s-value < .01), where
we've again defined "small" on the log odds scale using the
`threshold` argument above.

```{r asemaplot, fig.width=8, fig.height=4}
par(mfrow=c(1,2))
ylim <- c(-1,1.5)
s.val <- svalue(fit2$thresh) # small-or-false-sign value
plot(res$baseMean, fit.mle$map[,coef], main="MLE",
     log="x", xlab=xlab, ylab="log odds", ylim=ylim)
points(res$baseMean[idx], fit.mle$map[idx,coef], col="dodgerblue", cex=3)
abline(h=0,col=rgb(1,0,0,.5))
cols <- ifelse(s.val < .01, "red", "black")
plot(res$baseMean, fit2$map[,coef], main="apeglm",
     log="x", xlab=xlab, ylab="log odds", col=cols, ylim=ylim)
points(res$baseMean[idx], fit2$map[idx,coef], col="dodgerblue", cex=3)
abline(h=0,col=rgb(1,0,0,.5))
```

```{r}
logit <- function(x) log(x/(1-x))
logit(.75)
table(abs(logit(prob[s.val < .01])) > .5)
```

# Session Info

```{r}
sessionInfo()
```

# References