---
title: "Using eisaR for Exon-Intron Split Analysis (EISA)"
author: "Michael Stadler"
date: "`r Sys.Date()`"
bibliography: refs.bib
output: 
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Using eisaR for Exon-Intron Split Analysis (EISA)}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction
Exon-Intron Split Analysis has been described in [@eisa]. It consists of separately
quantifying exonic and intronic alignments in RNA-seq data, in order to measure
changes in mature RNA and pre-mRNA reads across different experimental conditions.
We have shown that this allows quantification of transcriptional and post-transcriptional
regulation of gene expression. 

The `eisaR` package contains convenience functions to facilitate the steps in an
exon-intron split analysis, which consists of:  
1. preparing the annotation (exonic and gene body coordinate ranges)  
2. quantifying RNA-seq alignments in exons and introns  
3. calculating and comparing exonic and intronic changes across conditions  
4. visualizing the results  

For the steps 1. and 2. above, this vignette makes use of Bioconductor annotation and
the `r Biocpkg("QuasR")` package. It is also possible to obtain count tables for exons and
introns using some other pipeline or approach, and directly start with step 3.

# Installation
To install the `eisaR` package, start R and enter:

```{r install_eisaR, eval=FALSE}
# BiocManager is needed to install Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# Install eisaR
BiocManager::install("eisaR")
```


# Preparing the annotation{#annotation}
As mentioned, `eisaR` uses gene annotations from Bioconductor.
They are provided in the form of `TxDb` or `EnsDb` objects, e.g. via packages such as `r Biocpkg("TxDb.Mmusculus.UCSC.mm10.knownGene")` or `r Biocpkg("EnsDb.Hsapiens.v86")`.
You can see available annotations using the following code:  

```{r availableOnline, eval=FALSE}
pkgs <- c(BiocManager::available("TxDb")
          BiocManager::available("EnsDb"))
```

If you would like to use an alternative source of gene annotations, you might
still be able to use `eisaR` by first converting your annotations into a `TxDb`
or an `EnsDb` (for creating a `TxDb` see `makeTxDb` in the `r Biocpkg("GenomicFeatures")`
package, for creating an `EnsDb` see `makeEnsembldbPackage` in the `r Biocpkg("ensembldb")`
package).

For this example, `eisaR` contains a small `TxDb` to illustrate how regions are extracted.
We will load it from a file. Alternatively, the object would be loaded using `library(...)`,
for example using `library(TxDb.Mmusculus.UCSC.mm10.knownGene)`.

```{r annotation, message=FALSE}
# load package
library(eisaR)

# get TxDb object
txdbFile <- system.file("extdata", "hg19sub.sqlite", package = "eisaR")
txdb <- AnnotationDbi::loadDb(txdbFile)
```

Exon and gene body regions are then extracted from the `TxDb`:  

```{r regions}
# extract filtered exonic and gene body regions
regS <- getRegionsFromTxDb(txdb = txdb, strandedData = TRUE)
regU <- getRegionsFromTxDb(txdb = txdb, strandedData = FALSE)

lengths(regS)
lengths(regU)

regS$exons
```

As you can see, the filtering procedure removes slightly more genes for unstranded data
(`strandedData = FALSE`), as overlapping genes cannot be discriminated even if
they reside on opposite strands.

You can also export the obtained regions into files. This may be useful if
you plan to align and/or quantify reads outside of R. For example, you can use
`r Biocpkg("rtracklayer")` to export the regions in `regS` into `.gtf` files: 

```{r exportregions}
library(rtracklayer)
export(regS$exons, "hg19sub_exons_stranded.gtf")
export(regS$genebodies, "hg19sub_genebodies_stranded.gtf")
```

# Quantify RNA-seq alignments in exons and introns
For this example we will use the `r Biocpkg("QuasR")` package for indexing and
alignment of short reads, and a small RNA-seq dataset that is contained in that
package. As mentioned, it is also possible to align or also quantify your reads
using an alternative aligner/counter, and skip over these steps.

## Align reads{#align}
Let's first copy the sample data from the `r Biocpkg("QuasR")` package to the
current working directory, all contained in a folder named `extdata`:  

```{r extdata}
library(QuasR)
file.copy(system.file(package = "QuasR", "extdata"), ".", recursive = TRUE)
```

We next align the reads to a mini-genome (fasta file `extdata/hg19sub.fa`) using
`qAlign`:  

```{r align}
sampleFile <- "extdata/samples_chip_single.txt"
genomeFile <- "extdata/hg19sub.fa"

proj <- qAlign(sampleFile = "extdata/samples_rna_single.txt", 
               genome = "extdata/hg19sub.fa",
               aligner = "Rhisat2", splicedAlignment = TRUE)
alignmentStats(proj)
```

## Count alignments in exons and gene bodies{#count}
Alignments in exons and gene bodies can now be counted using `qCount` and the
`regU` that we have generated earlier (assuming that the data is unstranded).
Intronic counts can then be obtained from the difference between gene bodies and
exons:  

```{r count}
cntEx <- qCount(proj, regU$exons, orientation = "any")
cntGb <- qCount(proj, regU$genebodies, orientation = "any")
cntIn <- cntGb - cntEx
head(cntEx)
head(cntIn)
```

As mentioned, both alignments and counts can also be obtained using alternative
approaches. It is required that the two resulting exon and intron count tables
have identical structure (genes in rows, samples in columns, the same order of
rows and columns in both tables).

## Load full count tables
The above example only contains very few genes. For the rest of the vignette,
we will use count tables from a real RNA-seq experiment that are provided in the
`eisaR` package. The counts correspond to the rawdata used in Figure 3a of [@eisa]
and are also available online from the [supplementary material](https://fmicompbio.github.io/projects/EISA/EISA.html): 

```{r loadcounts}
cntEx <- readRDS(system.file("extdata",
                             "Fig3abc_GSE33252_rawcounts_exonic.rds",
                             package = "eisaR"))
cntIn <- readRDS(system.file("extdata",
                             "Fig3abc_GSE33252_rawcounts_intronic.rds",
                             package = "eisaR"))
```

# Run EISA conveniently{#convenient}
All the further steps in exon-intron split analysis can now be performed using
a single function `runEISA`. If you prefer to perform the analysis step-by-step,
you can skip now to section \@ref(stepwise).

```{r runEISA}
# remove "width" column
Rex <- cntEx[, colnames(cntEx) != "width"]
Rin <- cntIn[, colnames(cntIn) != "width"]

# create condition factor (contrast will be TN - ES)
cond <- factor(c("ES", "ES", "TN", "TN"))

# run EISA
res <- runEISA(Rex, Rin, cond)
```

## Alternative implementations of EISA
There are five arguments in `runEISA` (`modelSamples`, `geneSelection`, `effects`,
`statFramework` and `pscnt`) that control gene filtering, calculation of contrasts
and the statistical method used, summarized in the bullet list below.

The default values of these arguments correspond to the currently recommended way
of running EISA. You can also run EISA exactly as it was described in [@eisa], by
setting `method = "Gaidatzis2015"`. This will override the values of the five
other arguments and set them according to the published algorithm (as indicated
below). 

* `modelSamples`: Account for individual samples in statistical model? Possible values are:  
  - `FALSE` (default and `method="Gaidatzis2015"`): use a model of the form `~ condition * region`  
  - `TRUE`: use a model of the form `~ sample + condition * region`  

* `geneSelection`: How to select detected genes. Possible values are:  
  - `"filterByExpr"` (default): First, counts are normalized using `edgeR::calcNormFactors`,
    treating intonic and exonic counts as individual samples. Then, the 
    `edgeR::filterByExpr` function is used with default parameters to select
    quantifyable genes.
  - `"none"`: This will use all the genes provided in the count tables, assuming
    that an appropriate selection of quantifyable genes has already been done.  
  - `"Gaidatzis2015"` (`method="Gaidatzis2015"`): First, intronic and exonic counts
    are linearly scaled to the mean library size (estimated as the sum of all intronic
    or exonic counts, respectively). Then, quantifyable genes are selected as the
    genes with counts `x` that fulfill `log2(x + 8) > 5` in both exons and introns.  
    
* `statFramework`: The framework within `edgeR` that is used for the statistical analysis.
  Possible values are:  
  - `"QLF"` (default): quasi-likelihood F-test using `edgeR::glmQLFit` and
    `edgeR::glmQLFTest`. This framework is highly recommended as it gives stricter
    error rate control by accounting for the uncertainty in dispersion estimation.  
  - `"LRT"` (`method="Gaidatzis2015"`): likelyhood ratio test using `edgeR::glmFit`
    and `edgeR::glmLRT`.  
  
* `effects`: How the effects (log2 fold-changes) are calculated. Possible values are:  
  - `"predFC"` (default): Fold-changes are calculated using the fitted model with
    `edgeR::predFC` and the value provided to `pscnt`. Please note that if a
    sample factor is included in the statistical model (`modelSamples=TRUE`), 
    effects cannot be obtained from that model. In that case, effects are obtained
    from a simpler model without sample effects.
  - `"Gaidatzis2015"` (`method="Gaidatzis2015"`): Fold-changes are calculated
    using the formula `log2((x + pscnt)/(y + pscnt))`. If `pscnt` is not set to 8,
    `runEISA` will warn that this deviates from the method used in Gaidatzis et al., 2015.
    
* `pscnt`: The pseudocount that is added to normalized counts before log transformation.
  For `geneSelection="Gaidatzis2015"`, `pscnt` is used both in gene selection as well as
  in the calculation of log2 fold-changes. Otherwise, `pscnt` is only used in the calculation
  of log2 fold-changes in `edgeR::predFC(, prior.count = pscnt)`.

While different values for these arguments typically yield similar results,
the defaults are often less stringent compared to `method="Gaidatzis2015"` when
selecting quantifyable genes, but more stringent when calling significant changes
(especially with low numbers of replicates).

Here is an illustration of how the results differ `method="Gaidatzis2015"` and
the defaults:  

```{r compare}
res1 <- runEISA(Rex, Rin, cond, method = "Gaidatzis2015")
res2 <- runEISA(Rex, Rin, cond)

# number of quantifyable genes
nrow(res1$DGEList)
nrow(res2$DGEList)

# number of genes with significant post-transcriptional regulation
sum(res1$tab.ExIn$FDR < 0.05)
sum(res2$tab.ExIn$FDR < 0.05)

# method="Gaidatzis2015" results contain most of default results
summary(rownames(res2$contrasts)[res2$tab.ExIn$FDR < 0.05] %in%
        rownames(res1$contrasts)[res1$tab.ExIn$FDR < 0.05])

# comparison of deltas
ids <- intersect(rownames(res1$DGEList), rownames(res2$DGEList))
cor(res1$contrasts[ids,"Dex"], res2$contrasts[ids,"Dex"])
cor(res1$contrasts[ids,"Din"], res2$contrasts[ids,"Din"])
cor(res1$contrasts[ids,"Dex.Din"], res2$contrasts[ids,"Dex.Din"])
plot(res1$contrasts[ids,"Dex.Din"], res2$contrasts[ids,"Dex.Din"], pch = "*",
     xlab = expression(paste(Delta, "exon", - Delta, "intron for method='Gaidatzis2015'")),
     ylab = expression(paste(Delta, "exon", - Delta, "intron for default parameters")))
```

## On the estimation of interactions in a split-plot design experiment
The calculation of the significance of interactions (here whether the fold-changes
differ between exonic or intronic data) is well defined for experimental designs
were all samples are independent from one another. Within EISA, this is not the
case (each sample yields two data points, on for exons and one for introns). That
results in a dependency between datapoints: If a sample is affected by a problem
in the experiment, it might at the same time give rise to outliers in both exonic
and intronic counts.

In statistics, such an experimental design is often referred to as a split-plot
design, and a recommended way to analyze interactions in such experiments would
be to use a mixed effect model with the plot (in our case, the sample) as a random
effect. The disadvantage here however would be that available packages for mixed
effect models are not designed for count data, and we therefore use an alternative
approach to explicitely model the sample dependency, by introducing a sample
factor into the design matrix (for `modelSamples=TRUE`). That sample factor is
nested in the condition factor (no sample can belong to more than one condition),
and therefore not all coefficients can be estimated. We therefore drop one of
the columns of the design matrix (identified by `limma::nonEstimable`).

This has no impact on the effects (the log2 fold-changes of `modelSamples=TRUE`
and `modelSamples=FALSE` are nearly identical). However, in the presence of sample effects,
`modelSamples=TRUE` increases the sensitivity of detecting genes with significant
interactions. Here is a comparison of the EISA results with and without accounting
for the sample in the model:  

```{r modelSamples}
res3 <- runEISA(Rex, Rin, cond, modelSamples = FALSE)
res4 <- runEISA(Rex, Rin, cond, modelSamples = TRUE)
ids <- intersect(rownames(res3$contrasts), rownames(res4$contrasts))

# number of genes with significant post-transcriptional regulation
sum(res3$tab.ExIn$FDR < 0.05)
sum(res4$tab.ExIn$FDR < 0.05)

# modelSamples=TRUE results are a super-set of
# modelSamples=FALSE results
summary(rownames(res3$contrasts)[res3$tab.ExIn$FDR < 0.05] %in%
        rownames(res4$contrasts)[res4$tab.ExIn$FDR < 0.05])

# comparison of contrasts
diag(cor(res3$contrasts[ids, ], res4$contrasts[ids, ]))
plot(res3$contrasts[ids, 3], res4$contrasts[ids, 3], pch = "*",
     xlab = "Interaction effects for modelSamples=FALSE",
     ylab = "Interaction effects for modelSamples=TRUE")

# comparison of interaction significance
plot(-log10(res3$tab.ExIn[ids, "FDR"]), -log10(res4$tab.ExIn[ids, "FDR"]), pch = "*",
     xlab = "-log10(FDR) for modelSamples=FALSE",
     ylab = "-log10(FDR) for modelSamples=TRUE")
abline(a=0, b=1, col="gray")
legend("bottomright", "y = x", bty = "n", lty = 1, col = "gray")
```


# Visualize EISA results{#plot}
We can now visualize the results by plotting intronic changes versus exonic
changes (genes with signficant interactions, which are likely to be post-transcriptionally
regulated, are color coded):  
```{r plotEISA}
plotEISA(res)
```

# Run EISA step-by-step{#stepwise}
As an alternative to `runEISA` (section \@ref(convenient)) and `plotEISA`
(section \@ref(plot)) described above, you can also analyse the data step-by-step
as described in [@eisa]. This may be preferrable to understand each
individual step and be able to access intermediate results.

The results obtained in this way are identical to what you get with
`runEISA(..., method = "Gaidatzis2015")`, so if you are happy with `runEISA` you can
skip the rest of the vignette.

## Normalization
Normalization is performed separately on exonic and intronic counts, assuming
that varying exon over intron ratios between samples are of technical origin.
```{r normalization}
# remove column "width"
Rex <- cntEx[,colnames(cntEx) != "width"]
Rin <- cntIn[,colnames(cntIn) != "width"]
Rall <- Rex + Rin
fracIn <- colSums(Rin)/colSums(Rall)
summary(fracIn)

# scale counts to the mean library size,
# separately for exons and introns
Nex <- t(t(Rex) / colSums(Rex) * mean(colSums(Rex)))
Nin <- t(t(Rin) / colSums(Rin) * mean(colSums(Rin)))

# log transform (add a pseudocount of 8)
NLex <- log2(Nex + 8)
NLin <- log2(Nin + 8)
```

## Identify quantifyable genes
Genes with very low counts in either exons or introns are better removed, as
they will be too noisy to yield useful results. We use here a fixed cut-off
on the normalized, log-transformed intron and exonic counts:  
```{r quantgenes}
quantGenes <- rownames(Rex)[ rowMeans(NLex) > 5.0 & rowMeans(NLin) > 5.0 ]
length(quantGenes)
```

## Calculate $\Delta I$, $\Delta E$ and $\Delta E - \Delta I$
The count tables were obtained from a total RNA-seq experiments in mouse
embryonic stem (MmES) cells and derived terminal neurons (MmTN), with two replicates
for each condition.

We will now calculate the changes between neurons and ES cells in introns ($\Delta I$),
in exons ($\Delta E$), and the difference between the two ($\Delta E - \Delta I$):  

```{r dIdE}
Dex <- NLex[,c("MmTN_RNA_total_a","MmTN_RNA_total_b")] - NLex[,c("MmES_RNA_total_a","MmES_RNA_total_b")]
Din <- NLin[,c("MmTN_RNA_total_a","MmTN_RNA_total_b")] - NLin[,c("MmES_RNA_total_a","MmES_RNA_total_b")]
Dex.Din <- Dex - Din

cor(Dex[quantGenes,1], Dex[quantGenes,2])
cor(Din[quantGenes,1], Din[quantGenes,2])
cor(Dex.Din[quantGenes,1], Dex.Din[quantGenes,2])
```
Both exonic and intronic changes are correlated across replicates, and so are the
differences, indicating a reproducible signal for post-transcriptional regulation.

## Statistical analysis
Finally, we use the replicate measurement in the `r Biocpkg("edgeR")` framework to
calculate the significance of the changes:  
```{r sig}
# create DGEList object with exonic and intronic counts
library(edgeR)
cnt <- data.frame(Ex = Rex, In = Rin)
y <- DGEList(counts = cnt, genes = data.frame(ENTREZID = rownames(cnt)))

# select quantifyable genes and normalize
y <- y[quantGenes, ]
y <- calcNormFactors(y)

# design matrix with interaction term
region <- factor(c("ex","ex","ex","ex","in","in","in","in"), levels = c("in", "ex"))
cond <- rep(factor(c("ES","ES","TN","TN")), 2)
design <- model.matrix(~ region * cond)
rownames(design) <- colnames(cnt)
design

# estimate model parameters
y <- estimateDisp(y, design)
fit <- glmFit(y, design)

# calculate likelihood-ratio between full and reduced models
lrt <- glmLRT(fit)

# create results table
tt <- topTags(lrt, n = nrow(y), sort.by = "none")
head(tt$table[order(tt$table$FDR, decreasing = FALSE), ])
```

## Visualize the results
Finally, we visualize the results by plotting intronic changes versus exonic
changes (genes with signficant interactions, which are likely to be post-transcriptionally
regulated, are color coded):  
```{r plot}
sig     <- tt$table$FDR < 0.05
sum(sig)
sig.dir <- sign(tt$table$logFC[sig])
cols <- ifelse(sig, ifelse(tt$table$logFC > 0, "#E41A1CFF", "#497AB3FF"), "#22222244")

# volcano plot
plot(tt$table$logFC, -log10(tt$table$FDR), col = cols, pch = 20,
     xlab = expression(paste("RNA change (log2 ",Delta,"exon/",Delta,"intron)")),
     ylab = "Significance (-log10 FDR)")
abline(h = -log10(0.05), lty = 2)
abline(v = 0, lty = 2)
text(x = par("usr")[1] + 3 * par("cxy")[1], y = par("usr")[4], adj = c(0,1),
     labels = sprintf("n=%d", sum(sig.dir == -1)), col = "#497AB3FF")
text(x = par("usr")[2] - 3 * par("cxy")[1], y = par("usr")[4], adj = c(1,1),
     labels = sprintf("n=%d", sum(sig.dir ==  1)), col = "#E41A1CFF")

# Delta I vs. Delta E
plot(rowMeans(Din)[quantGenes], rowMeans(Dex)[quantGenes], pch = 20, col = cols,
     xlab = expression(paste(Delta,"intron (log2 TN/ES)")),
     ylab = expression(paste(Delta,"exon (log2 TN/ES)")))
legend(x = "bottomright", bty = "n", pch = 20, col = c("#E41A1CFF","#497AB3FF"),
       legend = sprintf("%s (%d)", c("Up","Down"), c(sum(sig.dir == 1), sum(sig.dir == -1))))
```

# Session information
The output in this vignette was produced under:

```{r sessionInfo}
sessionInfo()
```

# References