---
title: "RNA-seq workflow: gene-level exploratory analysis and differential expression"
subtitle: CSAMA 2019 version
author:
- name: Michael I. Love
affiliation:
- Department of Biostatistics, UNC-Chapel Hill, Chapel Hill, NC, US
- Department of Genetics, UNC-Chapel Hill, Chapel Hill, NC, US
- name: Charlotte Soneson
affiliation:
- Friedrich Miescher Institute for Biomedical Research, Switzerland
- SIB Swiss Institute of Bioinformatics, Switzerland
- name: Simon Anders
affiliation: Centre for Molecular Biology, Univ. Heidelberg (ZMBH), Germany
- name: Vladislav Kim
affiliation: &EMBL European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
- name: Johannes Rainer
affiliation: Institute for Biomedicine, Eurac Research, Bolzano, Italy
- name: Wolfgang Huber
affiliation: *EMBL
date: 22 July 2019
output: BiocStyle::html_document
bibliography: bibliography.bib
vignette: >
%\VignetteIndexEntry{RNA-seq workflow at the gene level}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
---
```{r style, echo=FALSE, message=FALSE, warning=FALSE, results="asis"}
library("BiocStyle")
library("knitr")
library("rmarkdown")
## options(width=70) # Andrzej said this is not needed
opts_chunk$set(message=FALSE, error=FALSE, warning=FALSE,
cache=TRUE, fig.width=5, fig.height=5, eval=FALSE)
```
# Abstract
We walk through an end-to-end gene-level RNA-seq differential
expression workflow using Bioconductor packages. We will start from
the FASTQ files, show how these were quantified with respect to a
reference transcriptome, and prepare a count matrix which tallies the
number of RNA-seq fragments mapped to each gene for each sample. We will
perform exploratory data analysis (EDA) for quality assessment and to
explore the relationship between samples, perform differential gene
expression analysis (DGE) and visually explore the results.
# Introduction
Bioconductor has many packages which support analysis of
high-throughput sequence data, including RNA sequencing (RNA-seq). The
packages which we will use in this workflow include core packages
maintained by the Bioconductor core team for storing genomic data sets
and working with gene annotations [@Huber2015Orchestrating]. We will
also use contributed packages for statistical analysis and
visualization of sequencing data.
![A conductor ensures the orchestra plays in harmony. [(source)](https://commons.wikimedia.org/wiki/File:Tomohiro_Oura_conductor.jpg)](img/320px-Tomohiro_Oura_conductor.jpg)
Through scheduled releases every 6
months, the Bioconductor project ensures that all the packages within
a release will work together in harmony (hence the "conductor"
metaphor). The packages used in this workflow are loaded with the
*library* function and can be installed by following the Bioconductor
[installation instructions](https://bioconductor.org/install/#install-bioconductor-packages).
A published (but essentially similar) version of this workflow,
including reviewer reports and comments is available
at [F1000Research](http://f1000research.com/articles/4-1070).
If you have questions about this workflow or any Bioconductor
software, please post these to the
[Bioconductor support site](https://support.bioconductor.org/).
If the questions concern a specific package, you can tag the post with
the name of the package, or for general questions about the workflow,
tag the post with `rnaseqgene`. Note the
[posting guide](https://www.bioconductor.org/help/support/posting-guide/)
for crafting an optimal question for the support site.
## Experimental data
The data used in this workflow is stored in an R package, *airway2*,
containing quantification data for eight RNA-seq samples.
The quantification files summarize an RNA-seq experiment
wherein airway smooth muscle cells were treated with dexamethasone, a
synthetic glucocorticoid steroid with anti-inflammatory effects
[@Himes2014RNASeq]. Glucocorticoids are used, for example,
by people with asthma to reduce inflammation of the airways. In the
experiment, four primary human airway smooth muscle cell lines were treated with 1
micromolar dexamethasone for 18 hours. For each of the four cell
lines, we have a treated and an untreated sample. For more description
of the experiment see the
[PubMed entry 24926665](http://www.ncbi.nlm.nih.gov/pubmed/24926665)
and for raw data see the
[GEO entry GSE52778](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778).
An alternative version of the workflow uses the `r Biocexptpkg("airway")`
package, but this version uses the newer
*airway2* because it contains *Salmon* quantification files, which
will be discussed later.
# Preparing count matrices
As input, the count-based statistical methods, such as
`r Biocpkg("DESeq2")` [@Love2014Moderated],
`r Biocpkg("edgeR")` [@Robinson2009EdgeR],
`r Biocpkg("limma")` with the voom method [@Law2014Voom],
`r Biocpkg("DSS")` [@Wu2013New],
`r Biocpkg("EBSeq")` [@Leng2013EBSeq] and
`r Biocpkg("baySeq")` [@Hardcastle2010BaySeq],
expect data as obtained, e.g., from RNA-seq or another
high-throughput sequencing experiment, in the form of a matrix of counts.
The value in the *i*-th row and the *j*-th column of
the matrix tells how many reads (or fragments, for paired-end RNA-seq)
have been assigned to gene *i* in sample *j*. Analogously, for other
types of assays, the rows of the matrix might correspond e.g., to
binding regions (with ChIP-Seq), species of bacteria (with metagenomic
datasets), or peptide sequences (with quantitative mass spectrometry).
The values in the matrix should be counts of sequencing
reads/fragments. This is important for the statistical models used by
*DESeq2* and *edgeR* to hold, as only counts allow assessing the
measurement precision correctly. It is important to not provide
counts that were pre-normalized for sequencing depth/library size, as
the statistical model is most powerful when applied to un-normalized
counts and is designed to account for library size differences
internally.
## Transcript abundances and the *tximport* pipeline
In this workflow, we will use transcript abundances as quantified by the
[Salmon](https://combine-lab.github.io/salmon/) [@Patro2017Salmon]
software package. *Salmon* and other methods, such as
[Sailfish](http://www.cs.cmu.edu/~ckingsf/software/sailfish/) [@Patro2014Sailfish],
[kallisto](https://pachterlab.github.io/kallisto/) [@Bray2016Near], or
[RSEM](http://deweylab.github.io/RSEM/) [@Li2011RSEM],
estimate the relative abundances of all (known, annotated) transcripts
without aligning reads. Because estimating the abundance of the
transcripts involves an inference step, the counts are *estimated*.
Most methods either use a statistical framework called
Expectation-Maximization or Bayesian techniques to estimate the
abundances and counts.
Following quantification, we will use the `r Biocpkg("tximeta")` (or `r
Biocpkg("tximport")` [@Soneson2015Differential])
package for assembling estimated count and offset matrices for use
with Bioconductor differential gene expression packages.
The advantages of using the transcript abundance quantifiers in
conjunction with *tximeta*/*tximport* to produce gene-level count matrices and
normalizing offsets, are: this approach corrects for any potential
changes in gene length across samples (e.g. from differential isoform
usage) [@Trapnell2013Differential]; some of these methods are
substantially faster and require less memory and disk usage compared
to alignment-based methods; and it is possible to avoid discarding
those fragments that can align to multiple genes with homologous
sequence [@Robert2015Errors]. Note that transcript abundance
quantifiers skip the generation of large files which store read
alignments (SAM or BAM files), instead producing smaller files which
store estimated abundances, counts and effective lengths per
transcript. For more details, see the manuscript describing this approach
[@Soneson2015Differential] and the `r Biocpkg("tximport")` package
vignette for software details.
A full tutorial on how to use the *Salmon* software for quantifying
transcript abundance can be
found [here](https://combine-lab.github.io/salmon/getting_started/),
but in the next section we will provide the specific command line code
that was used to quantify the eight samples that will be used in this workflow.
## *Salmon* quantification
![The Salmon software logo)](img/salmonlogo.png)
We begin by providing *Salmon* with the sequence of all of the
reference transcripts, which we will call the *reference
transcriptome*. We used version 27 of the GENCODE human
transcripts, downloaded from
the [GENCODE website](https://www.gencodegenes.org/). As of writing,
GENCODE has already released newer versions, but for computational
reproducibility, all previous versions are available from the website.
On the command line, creating the transcriptome index looks like:
```
salmon index -i gencode.v27_salmon_0.8.2 -t gencode.v27.transcripts.fa.gz
```
The `0.8.2` refers to the version of *Salmon* that was used. As of writing,
*Salmon* is up to version 0.14.1 and it's always best to use the
latest version of software available.
To quantify an individual sample, `sample_01`, the following command
can be used:
```
salmon quant -i gencode.v27_salmon_0.8.2 -p 6 --libType A \
--gcBias --biasSpeedSamp 5 \
-1 sample_01_1.fastq.gz -2 sample_01_2.fastq.gz \
-o sample_01
```
In simple English, this command says to "quantify a sample using this
transcriptome index, with 6 threads, using automatic
[library type](http://salmon.readthedocs.io/en/latest/library_type.html) detection,
using GC bias correction (the bias speed part is now longer
needed with current versions of *Salmon*), here are the first and second
read, and use this output directory." The output directory will be
created if it doesn't exist, though if earlier parts of the path do
not exist, it will give an error. A single sample of human RNA-seq
usually takes ~5 minutes with the GC bias correction.
Rather than writing the above command on the command line, for
quantifying these eight samples, a simple
[R script](https://github.com/mikelove/airway2/blob/master/inst/scripts/salmon.R)
was written to perform the following *Salmon* calls from R, looping
over the samples. It is also possible to loop over files using a
bash loop, or more advanced workflow management
systems such as Snakemake [@Koster2012Snakemake] or Nextflow
[@Di2017Nextflow].
## Importing into R with *tximport*
Following quantification, we can use *tximport* to import the data
into R and perform statistical analysis using Bioconductor packages.
Normally, we would simply point *tximport* to the `quant.sf` files on
our machine. However, because we are distributing these files as part
of an R package, we have to do some extra steps, to figure out where
the R package, and so the files, are located on *your* machine.
The output directories from the above *Salmon* quantification calls has been
stored in the `extdata` directory of the R package called *airway2*.
```{r eval=FALSE}
library("devtools")
install_github("mikelove/airway2")
```
The R function *system.file* can be used to find out where on your
computer the files from a package have been installed. Here we ask for
the full path to the `extdata` directory, where R packages store
external data, that is part of the *airway2* package.
```{r}
library("airway2")
dir <- system.file("extdata", package="airway2")
dir
list.files(dir)
```
The quantification directories are in the `quants` directory.
```{r}
list.files(file.path(dir,"quants"))
```
The identifiers used here are the *SRR* identifiers from the
[Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra). For this
experiment, we downloaded a table that links the *SRR* identifiers to
the sample information about each experiment. From this
[Run Selector](https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRP033351) view
of the experiment, we clicked the button: *RunInfo Table*. This
downloads a file called `SraRunTable.txt`. We included this file in
the *airway2* package, and we read it in now:
```{r}
coldata <- read.delim(file.path(dir, "SraRunTable.txt"))
colnames(coldata)
idx <- c("Run","cell_line","treatment")
coldata[,idx]
```
For simplicity we restrict to these three columns, and do a little
cleanup of the column names and the levels of the factors.
```{r}
coldata <- coldata[,idx]
colnames(coldata) <- c("run","cell","dex")
coldata$cell
coldata$dex
```
We can rename the levels of the `dex` column, but we have to be
careful that our new level names correspond 1-1 to the previous
levels. Here we replace `Dexamethasone` with `trt` and `Untreated`
with `untrt`. There is nothing wrong with the longer names, but
sometimes shorter factor levels are easier to work with, and as long
as the shorter levels can be understood by other analysts, it saves
keystrokes.
```{r}
levels(coldata$dex)
levels(coldata$dex) <- c("trt","untrt")
levels(coldata$dex)
```
**Note:** it is prefered in R that the first level of a factor be the
reference level (e.g. control, or untreated samples), so we can
*relevel* the `dex` factor like so:
```{r}
coldata$dex <- relevel(coldata$dex, "untrt")
```
Now we can create a named vector that points to the *Salmon*
quantification files. Because we have gzipped the quants for
distribution, we point to `quant.sf.gz`. Normally would just be
`quant.sf`. (The importing software can import directly from gzipped
files.)
```{r}
files <- file.path(dir,"quants",coldata$run,"quant.sf.gz")
names(files) <- coldata$run
all(file.exists(files))
```
Because we perform a gene-level analysis we need the mapping of transcripts to
genes for GENCODE v27. Since GENCODE uses Ensembl transcript identifiers, we use
the `r Biocpkg("ensembldb")` package [@Rainer:2019jd] to extract these from one
of the available annotation databases. Pre-built `EnsDb` annotation databases
for all species for a large number of Ensembl releases are available through the
`r Biocpkg("AnnotationHub")` resource. We thus load below an `EnsDb` database
with human annotations for Ensembl release 91, which corresponds to GENCODE
v27. The parameter `localHub = TRUE` ensures that we list and load only locally
stored resources, due to the very limited internet connectivity at the lab site.
```{r annotationhub, message = FALSE}
library("AnnotationHub")
library("ensembldb")
ah <- AnnotationHub(localHub = TRUE)
ah_91 <- query(ah, "EnsDb.Hsapiens.v91")
ah_91
edb <- ah[[names(ah_91)]]
```
We next define a *data.frame* that connects transcripts to genes. Note that the
information in `tx2gene` needs to directly match the reference transcriptome
used by *Salmon*.
```{r}
tx2gene <- transcripts(edb, columns = c("tx_id_version", "gene_id", "tx_id"),
return.type = "data.frame")
colnames(tx2gene) <- c("TXNAME", "GENEID", "tx_id")
head(tx2gene)
```
Finally the following line of code imports *Salmon* transcript
quantifications into R, collapsing to the gene level using the
information in `tx2gene`.
```{r}
library("tximport")
library("jsonlite")
library("readr")
txi <- tximport(files, type="salmon", tx2gene=tx2gene)
```
The `txi` object is simply a list of matrices (and one character
vector):
```{r}
names(txi)
txi$counts[1:3,1:3]
txi$length[1:3,1:3]
txi$abundance[1:3,1:3]
txi$countsFromAbundance
```
## *Alternative*: Importing into R with *tximeta*
In the previous section we showed how to import the quantifications from
*Salmon* into a list of matrices in R, using the *tximport* package. The
*tximeta* package provides a wrapper around *tximport*, which in addition to
importing the quantifications also generates a *SummarizedExperiment* object
(see below). Moreover, it automatically identifies the transcriptome that was
used for the quantification, and downlaods the corresponding annotation
information. Note that this requires an internet connection. Here, we will
illustrate how to use the *tximeta* package to import the data above.
The input to *tximeta* is a data frame with sample information. This data frame
must contain at least two columns, named `names` and `files`, and containing the
sample IDs and the paths to the *Salmon* output files, respectively. Since we
already have the `coldata` table above, we just have to rename the `Run` column
and add the paths to the *Salmon* files.
```{r}
coldata2 <- coldata
colnames(coldata2)[colnames(coldata2) == "run"] <- "names"
coldata2$files <- file.path(dir,"quants",coldata2$names,"quant.sf.gz")
all(file.exists(coldata2$files))
coldata2
```
Next, we call *tximeta*, using the sample information data frame as input
argument.
```{r}
library("tximeta")
(txim <- tximeta(coldata = coldata2))
```
By default, `tximeta` returns a `SummarizedExperiment` object with
transcript-level quantifications. Since we are interested in gene-level
differential expression analysis, we summarize the abundances on the gene level.
```{r}
(txim <- summarizeToGene(txim))
```
## Branching point
At this point, we have quantified the RNA-seq fragments and summarized
to gene-level counts and abundances. This is a branching point where we
could use a variety of Bioconductor packages for exploration and
differential expression of the count data, including
`r Biocpkg("edgeR")` [@Robinson2009EdgeR],
`r Biocpkg("limma")` with the voom method [@Law2014Voom],
`r Biocpkg("DSS")` [@Wu2013New],
`r Biocpkg("EBSeq")` [@Leng2013EBSeq] and
`r Biocpkg("baySeq")` [@Hardcastle2010BaySeq].
@Schurch2016How
[compared performance](https://www.ncbi.nlm.nih.gov/pmc/articles/pmid/27022035/)
of different statistical methods
for RNA-seq using a large number of biological replicates and can help
users to decide which tools make sense to use, and how many
biological replicates are necessary to obtain a certain sensitivity.
We will continue using `r Biocpkg("DESeq2")` [@Love2014Moderated] and
`r Biocpkg("edgeR")` [@Robinson2009EdgeR] below.
## SummarizedExperiment
The *DESeq2* package will store the data and intermediate results in
an object called a *DESeqDataSet*. This object builds on top of a
general Bioconductor class of object called the
*SummarizedExperiment*. We will therefore first introduce the
*SummarizedExperiment* and we begin with a diagram of the three
component pieces:
```{r sumexp, echo = FALSE, out.width="80%", fig.cap="The component parts of a *SummarizedExperiment* object. The 'assay' (pink block) contains the matrix of counts, the 'rowRanges' or 'rowData' (blue block) contains information about the genomic ranges and the 'colData' (green block) contains information about the samples. The highlighted line in each block represents the first row (note that the first row of 'colData' lines up with the first column of the 'assay')."}
par(mar = c(0,0,0,0))
plot(1,1,xlim = c(0,100),ylim = c(0,100),bty = "n",
type="n",xlab="",ylab="",xaxt="n",yaxt="n")
polygon(c(45,90,90,45),c(5,5,70,70),col="pink",border=NA)
polygon(c(45,90,90,45),c(68,68,70,70),col="pink3",border=NA)
text(67.5,40,"assay")
text(67.5,35,'e.g. "counts"')
polygon(c(10,40,40,10),c(5,5,70,70),col="skyblue",border=NA)
polygon(c(10,40,40,10),c(68,68,70,70),col="skyblue3",border=NA)
text(25,40,"rowRanges,")
text(25,35,"or rowData")
polygon(c(45,90,90,45),c(75,75,95,95),col="palegreen",border=NA)
polygon(c(45,47,47,45),c(75,75,95,95),col="palegreen3",border=NA)
text(67.5,85,"colData")
```
The *SummarizedExperiment* container is diagrammed in the Figure above
and discussed in the latest Bioconductor paper [@Huber2015Orchestrating].
In our case we have created a single matrix named `"counts"` that
contains the estimated counts for each gene and sample, which is
stored in `assay`. It is also possible to store multiple matrices,
accessed with `assays`. The `rowData` for our object will keep track
of the intermediate calculations for each gene: e.g. the mean of
normalized counts, the dispersion estimates, the estimated
coefficients and their standard error, etc. We can also include
information about the location of each gene, in which case, we would
use `rowRanges` to access the *GenomicRanges* or *GenomicRangesList*
associated with each gene.The component parts of the
*SummarizedExperiment* are accessed with an R function of the same
name: `assay` (or `assays`), `rowRanges`/`rowData` and `colData`.
# The *DESeqDataSet* object, sample information and the design formula
Bioconductor software packages often define and use a custom class for
storing data that makes sure that all the needed data slots are
consistently provided and fulfill the requirements. In addition,
Bioconductor has general data classes (such as the
*SummarizedExperiment*) that can be used to move data between
packages. Additionally, the core Bioconductor classes provide useful
functionality: for example, subsetting or reordering the rows or
columns of a *SummarizedExperiment* automatically subsets or reorders
the associated *rowData* and *colData*, which can help to prevent
accidental sample swaps that would otherwise lead to spurious
results. With *SummarizedExperiment* this is all taken care of behind
the scenes.
In *DESeq2*, the custom class is called *DESeqDataSet*. One of the
main differences with *SummarizedExperiment* is that we will use the
*counts* function to access the `assay` matrix, and it is required
that this matrix stores non-negative integers (this is to prevent
mistaken use of the package for non-count data).
A second difference is that the *DESeqDataSet* has an associated
*design formula*. The experimental design is specified at the
beginning of the analysis, as it will inform many of the *DESeq2*
functions how to treat the samples in the analysis (one exception is
the size factor estimation, i.e., the adjustment for differing library
sizes, which does not depend on the design formula). The design
formula tells which columns in the sample information table (`colData`)
specify the experimental design and how these factors should be used
in the analysis.
The simplest design formula for differential expression would be
`~ condition`, where `condition` is a column in `colData(dds)` that
specifies which of two (or more groups) the samples belong to. For
the airway experiment, we will specify `~ cell + dex`
meaning that we want to test for the effect of dexamethasone (`dex`)
controlling for the effect of different cell line (`cell`).
We can construct a *DESeqDataSet* object from the *tximport* object
using the function *DESeqDataSetFromTximport*:
```{r}
library("DESeq2")
dds <- DESeqDataSetFromTximport(txi, coldata, design = ~cell + dex)
```
For running *DESeq2* or *edgeR* models, you can use R's formula notation to
express any fixed-effects experimental design.
Note that *DESeq2* and *edgeR* use the same formula notation as, for instance, the *lm*
function of base R. If the research aim is to determine for which
genes the effect of treatment is different across groups, then
interaction terms can be included and tested using a design such as
`~ group + treatment + group:treatment`. See the manual page for
`?results` for more examples.
## Creating a DGEList for use with *edgeR*
The *edgeR* package uses another type of data container,
namely a *DGEList* object. It is just as easy to create a *DGEList* object using
the count matrix and information about samples. We can additionally add
information about the genes:
```{r}
genetable <- data.frame(gene.id = rownames(txi$counts))
```
Because we are using counts from *tximport*, we have a slightly
different procedure to import them, which involves calculating a
statistical *offset* that will account for differences in gene length
across samples. The following code is from the *tximport* vignette in
the *edgeR* section.
```{r message=FALSE}
library("edgeR")
cts <- txi$counts
normMat <- txi$length
normMat <- normMat/exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))
```
Actually creating the *DGEList* object simply entails providing the
counts, sample information, gene and information. The second line adds
an appropriate offset.
```{r}
dge <- DGEList(counts = cts,
samples = coldata,
genes = genetable)
dge <- scaleOffset(dge, t(t(log(normMat)) + o))
names(dge)
```
A much simpler approach with respect to the amount of code is to generate
counts-from-abundance using *tximport*, which correct for the gene
length changes directly. Example for this code is:
```{r}
txi2 <- tximport(files, type="salmon", tx2gene=tx2gene,
countsFromAbundance="lengthScaledTPM")
dge2 <- DGEList(counts = txi2$counts,
samples = coldata,
genes = genetable)
names(dge2)
```
The *counts + offset* method and the *counts-from-abundance* method
are not statistically identical (obviously), but tend to output
roughly similar groups of genes. In this dataset, using
counts-from-abundance has 96% of genes in common with *count + offset*
at a target 5% FDR.
Just like the *SummarizedExperiment* and the *DESeqDataSet* the
*DGEList* contains all the information we need: the count matrix,
information about the samples (columns of the count matrix), and
information about the genes (rows of the count matrix).
# Exploratory analysis and visualization
There are two separate paths in this workflow; the one we will
see first involves *transformations of the counts*
in order to visually explore sample relationships.
In the second part, we will go back to the original raw counts
for *statistical testing*. This is critical because
the statistical testing methods rely on original count data
(not scaled or transformed) for calculating the precision of measurements.
## Pre-filtering the dataset
Our count matrix with our *DESeqDataSet* contains many rows with only zeros, and
additionally many rows with only a few fragments in total. In order to reduce
the size of the object, and to increase the speed of our functions, we can
remove the rows that have no or nearly no information about the amount of gene
expression. Here we apply the most minimal filtering rule: removing rows of the
*DESeqDataSet* that have no counts, or only a single count across all samples.
Additional weighting/filtering to improve power is applied at a later step in
the workflow.
```{r}
nrow(dds)
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)
```
Here, we also filter the *DGEList* for *edgeR* in the same way. Note, however,
that *edgeR* does not apply filtering internally, and for this reason, typically
a stronger prefiltering criterion is used at this stage for *edgeR*.
```{r}
dge <- dge[rowSums(round(dge$counts)) > 1, ]
all(rownames(dge) == rownames(dds))
```
We use the above code mostly for aiding comparison with *DESeq2* in
later sections. The recommended filtering code for *edgeR* (here not
evaluated) is as follows:
```{r, eval=FALSE}
dge <- dge[filterByExpr(dge),]
```
## Variance stabilizing transformations
Many common statistical methods for exploratory analysis of
multidimensional data, for example clustering and *principal
components analysis* (PCA), work best for data that generally has the
same range of variance at different ranges of the mean values. When
the expected amount of variance is approximately the same across
different mean values, the data is said to be *homoskedastic*. For
RNA-seq counts, however, the expected variance grows with the mean. For
example, if one performs PCA directly on a matrix of
counts or normalized counts (e.g. correcting for differences in
sequencing depth), the resulting plot typically depends mostly
on the genes with *highest* counts because they show the largest
absolute differences between samples. A simple and often used
strategy to avoid this is to take the logarithm of the normalized
count values plus a pseudocount of 1; however, depending on the
choice of pseudocount, now the genes with the very *lowest* counts
will contribute a great deal of noise to the resulting plot, because
taking the logarithm of small counts actually inflates their variance.
We can quickly show this property of counts with some simulated
data (here, Poisson counts with a range of lambda from 0.1 to 100).
We plot the standard deviation of each row (genes) against the mean:
```{r meanSdCts, fig.cap="SD over mean plot for Poisson-distributed counts."}
lambda <- 10^seq(from = -1, to = 2, length = 1000)
cts <- matrix(rpois(1000*100, lambda), ncol = 100)
library("vsn")
meanSdPlot(cts, ranks = FALSE)
```
And for logarithm-transformed counts after adding a pseudocount of 1:
```{r meanSdLogCts, fig.cap="SD over mean plot for log(x+1) transformed counts."}
log.cts.one <- log2(cts + 1)
meanSdPlot(log.cts.one, ranks = FALSE)
```
The logarithm with a small pseudocount amplifies differences when the
values are close to 0. The low count genes with low signal-to-noise
ratio will overly contribute to sample-sample distances and PCA
plots.
As a solution, *DESeq2* offers two transformations for count data that
stabilize the variance across the mean: (1) the
the *variance stabilizing transformation* (VST)
for negative binomial data with a dispersion-mean trend
[@Anders2010Differential], implemented in the *vst* function,
and (2) the *regularized-logarithm transformation* or *rlog*
[@Love2014Moderated].
For genes with high counts, the VST and rlog will give similar result
to the ordinary log2 transformation of normalized counts. For genes
with lower counts, however, the values are shrunken towards the genes'
averages across all samples. The VST or rlog-transformed data then
becomes approximately homoskedastic, and can be used directly for
computing distances between samples, making PCA plots, or as input to
downstream methods which perform best with homoskedastic data.
**Which transformation to choose?** Mike tends to prefer the VST. The
VST is much faster to compute via the function *vst* and is less
sensitive to high count outliers than the rlog. Perhaps further
development on the rlog would make it less sensitive to outliers.
(Why have the rlog at all then? In the DESeq2 paper, the rlog
sometimes outperformed the VST in terms of recovering clusters when
there was a large range of sequencing depth across samples e.g. x10
from lowest to highest sequencing depth.)
Note that the two transformations offered by DESeq2 are provided for
applications *other* than differential testing.
For differential testing we recommend the
*DESeq* function applied to raw counts, as described later
in this workflow, which also takes into account the dependence of the
variance of counts on the mean value during the dispersion estimation
step.
Both transformation functions return an object based on the *SummarizedExperiment*
class that contains the transformed values in its *assay* slot.
```{r vst, fig.cap="SD over mean plot for VST values from counts used in this workflow."}
vsd <- vst(dds, blind = FALSE)
head(assay(vsd), 3)
meanSdPlot(assay(vsd), ranks = FALSE)
```
```{r rlog, fig.cap="SD over mean plot for rlog-transformed counts from this workflow."}
rld <- rlog(dds, blind = FALSE)
head(assay(rld), 3)
meanSdPlot(assay(rld), ranks = FALSE)
```
In the above function calls, we specified `blind = FALSE`, which means
that differences between cell lines and treatment (the variables in
the design) will not contribute to the expected variance-mean trend of
the experiment. The experimental design is not used directly in the
transformation, only in estimating the global amount of variability in
the counts. For a fully *unsupervised* transformation, one can set
`blind = TRUE` (which is the default).
To show the effect of the transformation, in the figure below we plot
the first sample against the second, first simply using the *log2*
function of normalized counts (after adding 1, to avoid taking the log
of zero), and then using the VST and rlog-transformed values.
```{r transform-plot, out.width="100%", fig.width=6, fig.height=2.5, fig.cap="Scatterplot of transformed counts from two samples. Shown are scatterplots using the log2 transform of normalized counts (left), using the VST (middle), and using the rlog (right). While the rlog is on roughly the same scale as the log2 counts, the VST has a upward shift for the smaller values. It is the differences between samples (deviation from y = x in these scatterplots) which will contribute to the distance calculations and the PCA plot."}
library("dplyr")
library("ggplot2")
library("hexbin")
ntd <- normTransform(dds)
df <- bind_rows(
as_data_frame(assay(ntd)[, 1:2]) %>% mutate(transformation = "log2(x + 1)"),
as_data_frame(assay(vsd)[, 1:2]) %>% mutate(transformation = "vst"),
as_data_frame(assay(rld)[, 1:2]) %>% mutate(transformation = "rlog"))
colnames(df)[1:2] <- c("x", "y")
ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
coord_fixed() + facet_grid(. ~ transformation)
```
We can see how genes with low counts (bottom left-hand corner) seem to
be excessively variable on the ordinary logarithmic scale, while the
VST and rlog compress differences for the low count genes
for which the data provide little information about differential
expression.
## Sample distances
A useful first step in an RNA-seq analysis is often to assess overall
similarity between samples: Which samples are similar to each other,
which are different? Does this fit to the expectation from the
experiment's design?
We use the R function *dist* to calculate the Euclidean distance
between samples. To ensure we have a roughly equal contribution from
all genes, we use it on the VST data. We need to
transpose the matrix of values using *t*, because the *dist* function
expects the different samples to be rows of its argument, and
different dimensions (here, genes) to be columns.
```{r}
sampleDists <- dist(t(assay(vsd)))
sampleDists
```
We visualize the distances in a heatmap in a figure below, using the function
*pheatmap* from the `r CRANpkg("pheatmap")` package.
```{r}
library("pheatmap")
library("RColorBrewer")
```
In order to plot the sample distance matrix with the rows/columns
arranged by the distances in our distance matrix,
we manually provide `sampleDists` to the `clustering_distance`
argument of the *pheatmap* function.
Otherwise the *pheatmap* function would assume that the matrix contains
the data values themselves, and would calculate distances between the
rows/columns of the distance matrix, which is not desired.
We also manually specify a blue color palette using the
*colorRampPalette* function from the `r CRANpkg("RColorBrewer")` package.
```{r distheatmap, fig.width = 6.1, fig.height = 4.5, fig.cap="Heatmap of sample-to-sample distances using the VST values."}
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$dex, vsd$cell, sep = " - ")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)
pheatmap(sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col = colors)
```
Note that we have changed the row names of the distance matrix to
contain treatment type and patient number instead of sample ID, so
that we have all this information in view when looking at the heatmap.
Another option for calculating sample distances is to use the
*Poisson Distance* [@Witten2011Classification], implemented in the
`r CRANpkg("PoiClaClu")` package.
This measure of dissimilarity between counts
also takes the inherent variance
structure of counts into consideration when calculating the distances
between samples. The *PoissonDistance* function takes the original
count matrix (not normalized) with samples as rows instead of columns,
so we need to transpose the counts in `dds`.
```{r}
library("PoiClaClu")
poisd <- PoissonDistance(t(counts(dds)))
```
We plot the heatmap in a Figure below.
```{r poisdistheatmap, fig.width = 6.1, fig.height = 4.5, fig.cap="Heatmap of sample-to-sample distances using the *Poisson Distance*."}
samplePoisDistMatrix <- as.matrix(poisd$dd)
rownames(samplePoisDistMatrix) <- paste(vsd$dex, vsd$cell, sep = " - ")
colnames(samplePoisDistMatrix) <- NULL
pheatmap(samplePoisDistMatrix,
clustering_distance_rows = poisd$dd,
clustering_distance_cols = poisd$dd,
col = colors)
```
## PCA plot
Another way to visualize sample-to-sample distances is a
principal components analysis (PCA). In this ordination method, the
data points (here, the samples) are projected onto the 2D plane
such that they spread out in the two directions that explain most of
the differences (figure below). The x-axis is the direction that separates the data
points the most. The values of the samples in this direction are
written *PC1*. The y-axis is a direction (it must be *orthogonal* to
the first direction) that separates the data the second most. The
values of the samples in this direction are written *PC2*.
The percent of the total variance that is contained in the direction
is printed in the axis label. Note that these percentages do not add to
100%, because there are more dimensions that contain the remaining
variance (although each of these remaining dimensions will explain
less than the two that we see).
```{r plotpca, fig.width = 6, fig.height = 4.5, fig.cap="PCA plot using the VST values. Each unique combination of treatment and cell line is given its own color."}
plotPCA(vsd, intgroup = c("dex", "cell"))
```
Here, we have used the function *plotPCA* that comes with *DESeq2*.
The two terms specified by `intgroup` are the interesting groups for
labeling the samples; they tell the function to use them to choose
colors. We can also build the PCA plot from scratch using the
`r CRANpkg("ggplot2")` package [@Wickham2009Ggplot2].
This is done by asking the *plotPCA* function
to return the data used for plotting rather than building the plot.
See the *ggplot2* [documentation](http://docs.ggplot2.org/current/)
for more details on using *ggplot*.
```{r}
pcaData <- plotPCA(vsd, intgroup = c("dex", "cell"), returnData = TRUE)
pcaData
percentVar <- round(100 * attr(pcaData, "percentVar"))
```
We can then use these data to build up a second plot in a figure below, specifying that the
color of the points should reflect dexamethasone treatment and the
shape should reflect the cell line.
```{r ggplotpca, fig.width = 6, fig.height = 4.5, fig.cap="PCA plot using the VST values with custom *ggplot2* code. Here we specify cell line (plotting symbol) and dexamethasone treatment (color)."}
ggplot(pcaData, aes(x = PC1, y = PC2, color = dex, shape = cell)) +
geom_point(size = 3) +
xlab(paste0("PC1: ", percentVar[1], "% variance")) +
ylab(paste0("PC2: ", percentVar[2], "% variance")) +
coord_fixed()
```
From the PCA plot, we see that the differences between cells (the
different plotting shapes) are considerable, though not stronger than the differences due to
treatment with dexamethasone (red vs blue color). This shows why it will be important to
account for this in differential testing by using a paired design
("paired", because each dex treated sample is paired with one
untreated sample from the *same* cell line). We are already set up for
this design by assigning the formula `~ cell + dex` earlier.
## MDS plot
Another plot, very similar to the PCA plot, can be made using the
*multidimensional scaling* (MDS) function in base R. This is useful when we
don't have a matrix of data, but only a matrix of distances. Here we
compute the MDS for the distances calculated from the VST data
and plot these in a figure below.
```{r mdsvst, fig.width = 6, fig.height = 4.5, fig.cap="MDS plot using VST values."}
mds <- as.data.frame(colData(vsd)) %>%
cbind(cmdscale(sampleDistMatrix))
ggplot(mds, aes(x = `1`, y = `2`, color = dex, shape = cell)) +
geom_point(size = 3) + coord_fixed()
```
In a figure below we show the same plot for the *PoissonDistance*:
```{r mdspois, fig.width = 6, fig.height = 4.5, fig.cap="MDS plot using the *Poisson Distance*."}
mdsPois <- as.data.frame(colData(dds)) %>%
cbind(cmdscale(samplePoisDistMatrix))
ggplot(mdsPois, aes(x = `1`, y = `2`, color = dex, shape = cell)) +
geom_point(size = 3) + coord_fixed()
```
# Differential expression analysis
## Performing differential expression testing with *DESeq2*
As we have already specified an experimental design when we created
the *DESeqDataSet*, we can run the differential expression pipeline on
the raw counts with a single call to the function *DESeq*:
```{r airwayDE}
dds <- DESeq(dds)
```
This function will print out a message for the various steps it
performs. These are described in more detail in the manual page for
*DESeq*, which can be accessed by typing `?DESeq`. Briefly these are:
the estimation of size factors (controlling for differences in the
sequencing depth of the samples), the estimation of
dispersion values for each gene, and fitting a generalized linear model.
A *DESeqDataSet* is returned that contains all the fitted
parameters within it, and the following section describes how to
extract out results tables of interest from this object.
## Building the results table
Calling *results* without any arguments will extract the estimated
log2 fold changes and *p* values for the last variable in the design
formula. If there are more than 2 levels for this variable, *results*
will extract the results table for a comparison of the last level over
the first level. The comparison is printed at the top of the output:
`dex trt vs untrt`.
```{r}
res <- results(dds)
res
```
We could have equivalently produced this results table with the
following more specific command. Because `dex` is the last variable in
the design, we could optionally leave off the `contrast` argument to extract
the comparison of the two levels of `dex`.
```{r}
res <- results(dds, contrast = c("dex", "trt", "untrt"))
```
As `res` is a *DataFrame* object, it carries metadata
with information on the meaning of the columns:
```{r}
mcols(res, use.names = TRUE)
```
The first column, `baseMean`, is a just the average of the normalized
count values, divided by the size factors, taken over all samples in the
*DESeqDataSet*.
The remaining four columns refer to a specific contrast, namely the
comparison of the `trt` level over the `untrt` level for the factor
variable `dex`. We will find out below how to obtain other contrasts.
The column `log2FoldChange` is the effect size estimate. It tells us
how much the gene's expression seems to have changed due to treatment
with dexamethasone in comparison to untreated samples. This value is
reported on a logarithmic scale to base 2: for example, a log2 fold
change of 1.5 means that the gene's expression is increased by a
multiplicative factor of \(2^{1.5} \approx 2.82\).
Of course, this estimate has an uncertainty associated with it, which
is available in the column `lfcSE`, the standard error estimate for
the log2 fold change estimate. We can also express the uncertainty of
a particular effect size estimate as the result of a statistical
test. The purpose of a test for differential expression is to test
whether the data provides sufficient evidence to conclude that this
value is really different from zero. *DESeq2* performs for each gene a
*hypothesis test* to see whether evidence is sufficient to decide
against the *null hypothesis* that there is zero effect of the treatment
on the gene and that the observed difference between treatment and
control was merely caused by experimental variability (i.e., the type
of variability that you can expect between different
samples in the same treatment group). As usual in statistics, the
result of this test is reported as a *p* value, and it is found in the
column `pvalue`. Remember that a *p* value indicates the probability
that a fold change as strong as the observed one, or even stronger,
would be seen under the situation described by the null hypothesis.
We can also summarize the results with the following line of code,
which reports some additional information, that will be covered in
later sections.
```{r}
summary(res)
```
Note that there are many genes with differential expression due to
dexamethasone treatment at the FDR level of 10%. This makes sense, as
the smooth muscle cells of the airway are known to react to
glucocorticoid steroids. However, there are two ways to be more strict
about which set of genes are considered significant:
* lower the false discovery rate threshold (the threshold on `padj` in
the results table)
* raise the log2 fold change threshold from 0 using the `lfcThreshold`
argument of *results*
If we lower the false discovery rate threshold, we should also
inform the `results()` function about it, so that the function can use this
threshold for the optimal independent filtering that it performs:
```{r}
res.05 <- results(dds, alpha = 0.05)
table(res.05$padj < 0.05)
```
If we want to raise the log2 fold change threshold, so that we test
for genes that show more substantial changes due to treatment, we
simply supply a value on the log2 scale. For example, by specifying
`lfcThreshold = 1`, we test for genes that show significant effects of
treatment on gene counts more than doubling or less than halving,
because \(2^1=2\).
```{r}
resLFC1 <- results(dds, lfcThreshold = 1)
table(resLFC1$padj < 0.05)
```
Sometimes a subset of the *p* values in `res` will be `NA` ("not
available"). This is *DESeq*'s way of reporting that all counts for
this gene were zero, and hence no test was applied. In addition, *p*
values can be assigned `NA` if the gene was excluded from analysis
because it contained an extreme count outlier. For more information,
see the outlier detection section of the *DESeq2* vignette.
## Independent hypothesis weighting within *DESeq2*
The *results* function in *DESeq2* integrates with a new method for
dealing with genes that have low counts and therefore low power for
detecting differential gene expression. This method is called
Independent Hypothesis Weighting (IHW) [@Ignatiadis2016], and
generalizes the idea of filtering low count genes to reduce the
multiple testing burden. While the specifics of the
method are beyond the scope of this workflow, we demonstrate its usage
here:
```{r}
library("IHW")
resIHW <- results(dds, filterFun=ihw)
summary(resIHW)
```
The results are similar but not identical to the results table built
above, which uses a simple optimization of a filter over mean
normalized counts. The new procedure results in a gain of 50 genes.
```{r}
table(old=res$padj < .1, IHW=resIHW$padj < .1)
```
We can plot the weighting function which is fit by the *IHW* package:
```{r ihw-plot, fig.cap="IHW weighting function."}
plot(metadata(resIHW)$ihwResult)
```
## Other comparisons
In general, the results for a comparison of any two levels of a
variable can be extracted using the `contrast` argument to
*results*. The user should specify three values: the name of the
variable, the name of the level for the numerator, and the name of the
level for the denominator. Here we extract results for the log2 of the
fold change of one cell line over another:
```{r}
results(dds, contrast = c("cell", "N061011", "N61311"))
```
There are additional ways to build results tables for certain
comparisons after running *DESeq* once.
If results for an interaction term are desired, the `name`
argument of *results* should be used. Please see the
help page for the *results* function for details on the additional
ways to build results tables. In particular, the **Examples** section of
the help page for *results* gives some pertinent examples.
## Annotation of RNA-seq results
One of the central steps in the analysis of genomic data is the annotation of
the quantified entities to biologically more relevant representations such as
transcripts, genes or proteins. Such annotations enable for example pathway
enrichment analyses and ease the interpretation of the results. Bioconductor
provides a large variety of annotation packages and resources, most of them
supporting the `r Biocpkg("AnnotationDbi")` interface which enables a unified
way to retrieve annotations for given identifiers. The annotation resources from
Bioconductor range from web-based tools such as `r Biocpkg("biomaRt")` to
packages working with pre-built databases containing all identifier mappings for
a certain species (the `*.org` packages such as `org.Hs.eg.db`) or providing
gene and transcript models, such as databases build by
`r Biocpkg("GenomicFeatures")` (`TxDb` databases/packages) and
`r Biocpkg("ensembldb")` (`EnsDb` databases/packages) [@Rainer:2019jd].
In this section we use the `ensembldb` package to annotate all tested genes
which are identified by Ensembl gene IDs. The Salmon quantification used in the
analysis leading to this table was based on GENCODE v27 transcripts which is
linked to Ensembl release 91. We thus load below an `EnsDb` database with human
annotations for Ensembl release 91 from the `r Biocpkg("AnnotationHub")`
resource. The parameter `localHub = TRUE` ensures that we list and load only
locally stored resources (due to the very limited internet connectivity at the
lab site).
```{r annotationhub-ensdb, message = FALSE}
library("AnnotationHub")
library("ensembldb")
ah <- AnnotationHub(localHub = TRUE)
ah_91 <- query(ah, "EnsDb.Hsapiens.v91")
ah_91
#' Load the database
edb <- ah[[names(ah_91)]]
```
If we had a working internet connection we could simply list all available
`EnsDb` databases with `query(ah, "EnsDb")` (after loading `AnnotationHub` with
`localHub = FALSE`). We can now use the `edb` variable to access the data in
the database. With the `listColumns` function we can for example list all of the
available annotation fields (columns) in the database.
```{r listcolumns}
listColumns(edb)
```
We could get data from any of these database columns by passing the respective
column name(s) along with the `column` parameter to any function retrieving
annotations from an `EnsDb` database. Gene-related annotations can be retrieved
with the `genes` function, that by default (and similar to the `genes` function
from the `r Biocpkg("GenomicFeatures")` package) returns a `GRanges` object with
the genomic coordinates of the genes and additional annotation columns in the
`GRanges`' *metadata* columns. Below we use the `genes` call to retrieve
annotations for all human genes. With the parameter `return.type = "DataFrame"`
we tell the function to return the result as a `DataFrame` instead of the
default `GRanges`.
```{r annotations-all-genes}
anns <- genes(edb, return.type = "DataFrame")
rownames(anns) <- anns$gene_id
```
We can now combine the RNA-seq result with the annotation.
```{r annotation-combine}
resAnn <- BiocGenerics::cbind(anns[BiocGenerics::rownames(res), ], res)
resAnn
```
If we had only a reduced set of genes to annotate, e.g. the set of the most
significant genes, it is faster and more elegant to extract annotations for the
subset of genes using the filter framework from `ensembldb`. Below we create a
*top table* by selecting the 20 genes with the smallest p-values.
```{r res-top20}
resTop20 <- res[order(res$pvalue), ][1:20, ]
```
To get only annotations for the selected genes we pass the filter expression
`~ gene_id == rownames(resTop20)` with the `filter` parameter to the `genes`
function. Filter expressions have to be written in the form of a formula
(i.e. starting with `~`) and support any logical R expression and any database
column/field in the `EnsDb` database (use `supportedFilters(edb)` to list all
supported fields). Also, by specifying the respective field names with the
`columns` parameter we extract only the gene name (equivalent with the HGNC
symbol for most genes), description the gene biotype and the NCBI Entrezgene ID
from the database.
```{r annotations-top20}
anns <- genes(edb, filter = ~ gene_id == rownames(resTop20),
columns = c("gene_name", "description", "gene_biotype",
"entrezid"), return.type = "DataFrame")
```
Note that the order of the genes in the returned `data.frame` is **not** the
same as in `resTop20`. We thus below re-order the annotations to match the
order of genes in the top table and subsequently join the two data frames.
```{r join-resTop20}
resTop20 <- cbind(anns[match(rownames(resTop20), anns$gene_id), ],
resTop20)
```
Be aware that mappings between Ensembl and NCBI identifiers is not necessarily a
1:1 mapping. In our case we got for some Ensembl gene IDs more than one NCBI
Entrezgene ID. The column `"entrezid"` in our result table is thus a `list` with
eventually more than one Entrezgene ID in one row. Exporting such a table could
be problematic and we collapse therefore Entrezgene IDs in this columns into a
single character string, with multiple IDs separated by a semicolon (`";"`).
```{r collapse-entrezid}
resTop20$entrezid <- sapply(resTop20$entrezid, function(z) {
if (any(is.na(z))) z
else paste(z, collapse = ";")
})
```
The 20 most significantly regulated genes in the present analysis are listed
below.
```{r resTop20Table, results = "asis", echo = FALSE}
knitr::kable(resTop20)
```
### Using *ensembldb* to query for genes encoding a specific protein domain
For some experiments it might be interesting to search for genes, or rather
transcripts, that encode a protein with a certain protein domain. For the
present data set we could for example search for genes with proteins encoding
the ligand binding domain of nuclear hormone receptors (such as the
glucocorticoid receptor, the gene activated by treatment with the synthetic
glucocorticoid dexamethasone and hence being mainly responsible for the
transcriptional changes observed in the present dataset). Below we thus query
the database for all such genes using the protein domain ID
[PF00104](http://pfam.xfam.org/family/PF00104) from Pfam. The filter used in the
query below will return all genes with a transcript annotated to the specific
protein domain restricting to genes that are detected in the present analysis.
```{r protein-domain-query}
hormoneR <- genes(edb,
filter = ~ protein_domain_id == "PF00104" &
gene_id == rownames(res),
return.type = "data.frame")
rownames(hormoneR) <- hormoneR$gene_id
nrow(hormoneR)
```
How many genes were identified? Next we identify nuclear
hormone receptors which are significantly differentially expressed at a 5% FDR
in the present data set.
```{r hormone-receptors-res}
resHormoneR <- res[hormoneR$gene_id, ]
resHormoneR <- resHormoneR[which(resHormoneR$padj < 0.05), ]
resHormoneR <- cbind(hormoneR[rownames(resHormoneR), ],
resHormoneR)
```
The table below lists the significantly differentially expressed hormone
receptors.
```{r hormone-receptors-res-table, results = "asis"}
knitr::kable(resHormoneR[order(resHormoneR$pvalue),
c("gene_name", "padj", "log2FoldChange")])
```
Among the regulated hormone receptors are NR4A3, encoding a member of the
steroid-thyroid hormone-retinoid receptor superfamily (4-fold upregulated), but
also the glucocorticoid receptor (NR3C1) itself, which is known to down-regulate
in certain cell types its own gene as part of a negative feedback loop.
## Performing differential expression testing with *edgeR*
Next we will show how to perform differential expression analysis with *edgeR*.
Recall that we have a `DGEList` `dge`, containing three objects:
```{r}
names(dge)
```
We first define a design matrix, using the same formula syntax as for *DESeq2*
above.
```{r}
design <- model.matrix(~ cell + dex, dge$samples)
```
Then, we calculate normalization factors and estimate the dispersion for each
gene. Note that we need to specify the design in the dispersion calculation.
```{r}
dge <- calcNormFactors(dge)
dge <- estimateDisp(dge, design)
```
Finally, we fit the generalized linear model and perform the test. In the
`glmQLFTest` function, we indicate which coefficient (which column in the design
matrix) that we would like to test for. It is possible to test more general
contrasts as well, and the user guide contains many examples on how to do this.
The `topTags` function extracts the top-ranked genes. You can indicate the
adjusted p-value cutoff, and/or the number of genes to keep.
```{r edgeRcall}
fit <- glmQLFit(dge, design)
qlf <- glmQLFTest(fit, coef = ncol(design))
tt <- topTags(qlf, n = nrow(dge), p.value = 0.1)
tt10 <- topTags(qlf) # just the top 10 by default
tt10
```
We can compare to see how the results from the two software
overlap:
```{r}
tt.all <- topTags(qlf, n = nrow(dge), sort.by = "none")
table(DESeq2 = res$padj < 0.1, edgeR = tt.all$table$FDR < 0.1)
```
We can also compare the two lists by the ranks:
```{r plot-ranks, fig.cap="Comparison of genes ranked by *DESeq2* and *edgeR*."}
common <- !is.na(res$padj)
plot(rank(res$padj[common]),
rank(tt.all$table$FDR[common]), cex = .1,
xlab = "DESeq2", ylab = "edgeR")
```
Also with edgeR we can test for significance relative to a fold-change
threshold, using the function `glmTreat` instead of `glmLRT`. Below we set the
log fold-change threshold to 1 (i.e., fold change threshold equal to 2), as for
DESeq2 above. We also compare the results to those obtained with DESeq2.
```{r}
treatres <- glmTreat(fit, coef = ncol(design), lfc = 1)
tt.treat <- topTags(treatres, n = nrow(dge), sort.by = "none")
table(DESeq2 = resLFC1$padj < 0.1, edgeR = tt.treat$table$FDR < 0.1)
```
## Citing packages used in research
If you use the results from an R analysis package in published
research, you can find the proper citation for the software by typing
`citation("pkgName")`, where you would substitute the name of the
package for `pkgName`. Citing methods papers helps to support and
reward the individuals who put time into open source software for
genomic data analysis.
# Plotting results
## Counts plot with *DESeq2*
A quick way to visualize the counts for a particular gene is to use
the *plotCounts* function that takes as arguments the
*DESeqDataSet*, a gene name, and the group over which to plot the
counts (figure below).
```{r plotcounts, fig.cap="Normalized counts for a single gene over treatment group."}
topGene <- rownames(res)[which.min(res$padj)]
plotCounts(dds, gene = topGene, intgroup = c("dex"))
```
We can also make custom plots using the *ggplot* function from the
`r CRANpkg("ggplot2")` package (figures below).
```{r ggplotcountsjitter, out.width="80%", fig.width = 4, fig.height = 3, fig.cap="Normalized counts using ggbeeswarm"}
library("ggbeeswarm")
geneCounts <- plotCounts(dds, gene = topGene, intgroup = c("dex","cell"),
returnData = TRUE)
ggplot(geneCounts, aes(x = dex, y = count, color = cell)) +
scale_y_log10() + geom_beeswarm(cex = 3)
```
```{r ggplotcountsgroup, out.width="80%", fig.width = 4, fig.height = 3, fig.cap="Normalized counts with lines connecting cell lines. Note that the *DESeq* test actually takes into account the cell line effect, so this figure more closely depicts the difference being tested."}
ggplot(geneCounts, aes(x = dex, y = count, color = cell, group = cell)) +
scale_y_log10() + geom_point(size = 3) + geom_line()
```
## MA plot with *DESeq2*
An *MA plot* [@Dudoit2002Statistical] provides a useful overview for
the distribution of the estimated coefficients in the model,
e.g. the comparisons of interest, across all genes.
On the y-axis, the "M" stands for "minus" --
subtraction of log values is equivalent to the log of the ratio -- and
on the x-axis, the "A" stands for "average". You may hear this plot
also referred to as a mean-difference plot, or a Bland-Altman plot.
Before making the MA plot, we use the
*lfcShrink* function to shrink the log2 fold changes for the
comparison of dex treated vs untreated samples. This typically takes
about 20 seconds on a laptop.
```{r plotma, out.width="80%", fig.cap="An MA plot of changes induced by treatment. The log2 fold change for a particular comparison is plotted on the y-axis and the average of the counts normalized by size factor is shown on the x-axis. Each gene is represented with a dot. Genes with an adjusted *p* value below a threshold (here 0.1, the default) are shown in red."}
library("apeglm")
resultsNames(dds)
res <- lfcShrink(dds, coef="dex_trt_vs_untrt", type="apeglm")
DESeq2::plotMA(res, xlim=c(5,1e6), ylim=c(-5,5))
```
The *DESeq2* package uses a Bayesian procedure to moderate (or
"shrink") log2 fold changes from genes with very low counts and highly
variable counts, as can be seen by the narrowing of the vertical
spread of points on the left side of the MA plot. As shown above, the
*lfcShrink* function performs this operation. For a detailed
explanation of the rationale of moderated fold changes, please see the
*DESeq2* paper [@Love2014Moderated].
If we had not used statistical moderation to shrink the noisy log2
fold changes, we would have instead seen the following plot:
```{r plotmaNoShr, out.width="80%", fig.cap="MA plot without shrinkage."}
res.noshr <- results(dds)
DESeq2::plotMA(res.noshr, xlim=c(5,1e6), ylim=c(-5,5))
```
We can label individual points on the MA plot as well. Here we use the
*with* R function to plot a circle and text for a selected row of the
results object. Within the *with* function, only the `baseMean` and
`log2FoldChange` values for the selected rows of `res` are used.
```{r plotmalabel, out.width="80%", fig.cap="Example of labeling an individual gene on an MA plot."}
DESeq2::plotMA(res, xlim=c(5,1e6), ylim=c(-5,5))
topGene <- rownames(res)[which.min(res$padj)]
with(res[topGene, ], {
points(baseMean, log2FoldChange, col = "dodgerblue", cex = 2, lwd = 2)
text(baseMean, log2FoldChange, topGene, pos = 2, col = "dodgerblue")
})
```
Another useful diagnostic plot is the histogram of the *p* values
(figure below). This plot is best formed by excluding genes with very
small counts, which otherwise generate spikes in the histogram.
```{r histpvalue2, fig.cap="Histogram of *p* values for genes with mean normalized count larger than 1."}
hist(res$pvalue[res$baseMean > 1], breaks = 0:20/20,
col = "grey50", border = "white")
```
## Counts plot with *edgeR*
As we made a counts plot previously, we can also make this with the
top gene from the *edgeR* analysis:
```{r plot-counts-edger, fig.cap="CPM per sample for the top gene from the *edgeR* analysis."}
cpms <- cpm(dge)
topGene <- as.character(tt10$table$gene.id[1])
ord <- order(dge$samples$dex)
col <- as.integer(dge$samples$dex[ord])
par(mar=c(10,4,4,2))
barplot(cpms[topGene,ord],
col=c("skyblue","orange")[col],
names.arg=paste(dge$sample$dex[ord],"--",dge$sample$cell[ord]),
las=2, ylab="counts per million (CPM)")
```
## MA / Smear plot with *edgeR*
In *edgeR*, the MA plot is obtained via the `plotSmear` function.
```{r plot-smear, out.width="80%", fig.cap="The *edgeR* package includes the `plotSmear` function for generating MA plots."}
plotSmear(qlf, de.tags = tt$table$gene.id)
```
## Gene clustering
In the sample distance heatmap made previously, the dendrogram at the
side shows us a hierarchical clustering of the samples. Such a
clustering can also be performed for the genes. Since the clustering
is only relevant for genes that actually carry a signal, one usually
would only cluster a subset of the most highly variable genes. Here,
for demonstration, let us select the 20 genes with the highest
variance across samples. We will work with the VST data:
```{r}
library("genefilter")
topVarGenes <- head(order(rowVars(assay(vsd)), decreasing = TRUE), 20)
```
The heatmap becomes more interesting if we do not look at absolute
expression strength but rather at the amount by which each gene
deviates in a specific sample from the gene's average across all
samples. Hence, we center each genes' values across samples,
and plot a heatmap (figure below). We provide a *data.frame* that instructs the
*pheatmap* function how to label the columns.
```{r genescluster, out.width="100%", fig.cap="Heatmap of relative VST values across samples. Treatment status and cell line information are shown with colored bars at the top of the heatmap. Blocks of genes that covary across patients. Note that a set of genes at the top of the heatmap are separating the N061011 cell line from the others. In the center of the heatmap, we see a set of genes for which the dexamethasone treated samples have higher gene expression."}
mat <- assay(vsd)[ topVarGenes, ]
mat <- mat - rowMeans(mat)
anno <- as.data.frame(colData(vsd)[, c("cell","dex")])
pheatmap(mat, annotation_col = anno)
```
## Exporting results
You can easily save the results table in a CSV file that you can
then share or load with a spreadsheet program such as Excel. The call to
*as.data.frame* is necessary to convert the *DataFrame* object
(`r Biocpkg("IRanges")` package) to a *data.frame* object that can be
processed by *write.csv*. Here, we take just the top 100 genes for
demonstration.
```{r eval = FALSE}
resOrderedDF <- as.data.frame(res[order(res$pvalue), ])[1:100, ]
write.csv(resOrderedDF, file = "results.csv")
```
A more sophisticated way for exporting results the Bioconductor
package `r Biocpkg("ReportingTools")` [@Huntley2013ReportingTools].
*ReportingTools* will automatically generate dynamic HTML documents,
including links to external databases using gene identifiers
and boxplots summarizing the normalized counts across groups.
See the *ReportingTools* vignettes for full details. The simplest
version of creating a dynamic *ReportingTools* report is performed
with the following code:
```{r eval = FALSE}
library("ReportingTools")
htmlRep <- HTMLReport(shortName = "report", title = "My report",
reportDirectory = "./report")
publish(resOrderedDF %>% tibble::rownames_to_column("gene"), htmlRep)
url <- finish(htmlRep)
browseURL(url)
```
## Exploring results interactively
The `r Biocpkg("iSEE")` package enables interactive exploration of any data
stored in a *SummarizedExperiment* container. Since the *DESeqDataSet* class
directly extends *SummarizedExperiment*, it can also be used as input to *iSEE*.
To open an *iSEE* instance, we simply type
```{r}
library("iSEE")
if (interactive()) {
iSEE(dds)
}
```
We can also add additional information, such as the results of the statistical
test, to the *rowData* of the object before providing it to *iSEE*.
```{r}
stopifnot(all(rownames(dds) == rownames(res)))
res$neglog10pvalue <- -log10(res$pvalue)
rowData(dds)$res <- res
if (interactive()) {
iSEE(dds)
}
```
# Removing hidden batch effects
Suppose we did not know that there were different cell lines involved
in the experiment, only that there was treatment with
dexamethasone. The cell line effect on the counts then would represent
some hidden and unwanted variation that might be affecting
many or all of the genes in the dataset. We can use statistical
methods designed for RNA-seq from the `r Biocpkg("sva")` package
[@Leek2014Svaseq] or the `r Biocpkg("RUVSeq")` package
[@Risso2014Normalization] in Bioconductor to
detect such groupings of the samples, and then we can add these to the
*DESeqDataSet* design, in order to account for them.
The *SVA* package uses the term *surrogate variables* for the estimated
variables that we want to account for in our analysis, while the RUV
package uses the terms *factors of unwanted variation* with the acronym "Remove Unwanted
Variation" explaining the package title. We first use *SVA* to find
hidden batch effects and then *RUV* following.
## Using SVA with DESeq2
```{r}
library("sva")
```
Below we obtain a matrix of normalized counts for which the average count across
samples is larger than 1. As we described above, we are trying to
recover any hidden batch effects, supposing that we do not know the
cell line information. So we use a full model matrix with the
*dex* variable, and a reduced, or null, model matrix with only
an intercept term. Finally we specify that we want to estimate 2
surrogate variables. For more information read the manual page for the
*svaseq* function by typing `?svaseq`.
```{r}
dat <- counts(dds, normalized = TRUE)
idx <- rowMeans(dat) > 1
dat <- dat[idx, ]
mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv = 2)
svseq$sv
```
Because we actually do know the cell lines, we can see how well the
SVA method did at recovering these variables (figure below).
```{r svaplot, fig.cap="Surrogate variables 1 and 2 plotted over cell line. Here, we know the hidden source of variation (cell line), and therefore can see how the SVA procedure is able to identify a source of variation which is correlated with cell line."}
par(mfrow = c(2, 1), mar = c(3,5,3,1))
for (i in 1:2) {
stripchart(svseq$sv[, i] ~ dds$cell, vertical = TRUE, main = paste0("SV", i))
abline(h = 0)
}
```
Finally, in order to use SVA to remove any effect on the counts from
our surrogate variables, we simply add these two surrogate variables
as columns to the *DESeqDataSet* and then add them to the design:
```{r}
ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + dex
```
We could then produce results controlling for surrogate variables by
running *DESeq* with the new design.
## Using RUV with DESeq2
We can also use the *RUV* method in the *RUVSeq* package to detect the
hidden batch effects.
```{r}
library("RUVSeq")
```
We can use the `RUVg` function to estimate *factors of unwanted
variation*, analogous to *SVA*'s *surrogate variables*. A difference
compared to the *SVA* procedure above, is that we first would run
*DESeq* and *results* to obtain the p-values for the analysis without
knowing about the batches, e.g. just `~ dex`. Supposing that we have
this results table `res`, we then pull out a set of *empirical control
genes* by looking at the genes that do not have a small p-value.
```{r}
set <- newSeqExpressionSet(counts(dds))
idx <- rowSums(counts(set) > 5) >= 2
set <- set[idx, ]
set <- betweenLaneNormalization(set, which="upper")
not.sig <- rownames(res)[which(res$pvalue > .1)]
empirical <- rownames(set)[ rownames(set) %in% not.sig ]
set <- RUVg(set, empirical, k=2)
pData(set)
```
We can plot the factors estimated by *RUV*:
```{r ruvplot, fig.cap="Factors of unwanted variation plotted over cell line."}
par(mfrow = c(2, 1), mar = c(3,5,3,1))
for (i in 1:2) {
stripchart(pData(set)[, i] ~ dds$cell, vertical = TRUE, main = paste0("W", i))
abline(h = 0)
}
```
As before, if we wanted to control for these factors, we simply
add them to the *DESeqDataSet* and to the design:
```{r}
ddsruv <- dds
ddsruv$W1 <- set$W_1
ddsruv$W2 <- set$W_2
design(ddsruv) <- ~ W1 + W2 + dex
```
We would then run *DESeq* with the new design to re-estimate the
parameters and results.
# Session information
As the last part of this document, we call the function *sessionInfo*,
which reports the version numbers of R and all the packages used in
this session. It is good practice to always keep such a record of this
as it will help to track down what has happened in case an R script
ceases to work or gives different results because the functions have
been changed in a newer version of one of your packages. By including
it at the bottom of a script, your reports will become more reproducible.
The session information should also *always*
be included in any emails to the
[Bioconductor support site](https://support.bioconductor.org) along
with all code used in the analysis.
```{r}
devtools::session_info()
```
# References