Contents

1 Abstract

We walk through an end-to-end gene-level RNA-seq differential expression workflow using Bioconductor packages. We will start from the FASTQ files, show how these were quantified with respect to a reference transcriptome, and prepare a count matrix which tallies the number of RNA-seq fragments mapped to each gene for each sample. We will perform exploratory data analysis (EDA) for quality assessment and to explore the relationship between samples, perform differential gene expression analysis (DGE) and visually explore the results.

2 Introduction

Bioconductor has many packages which support analysis of high-throughput sequence data, including RNA sequencing (RNA-seq). The packages which we will use in this workflow include core packages maintained by the Bioconductor core team for storing genomic data sets and working with gene annotations (Huber et al. 2015). We will also use contributed packages for statistical analysis and visualization of sequencing data.

A conductor ensures the orchestra plays in harmony. (source)

A conductor ensures the orchestra plays in harmony. (source)

Through scheduled releases every 6 months, the Bioconductor project ensures that all the packages within a release will work together in harmony (hence the “conductor” metaphor). The packages used in this workflow are loaded with the library function and can be installed by following the Bioconductor installation instructions.

A published (but essentially similar) version of this workflow, including reviewer reports and comments is available at F1000Research.

If you have questions about this workflow or any Bioconductor software, please post these to the Bioconductor support site. If the questions concern a specific package, you can tag the post with the name of the package, or for general questions about the workflow, tag the post with rnaseqgene. Note the posting guide for crafting an optimal question for the support site.

2.1 Experimental data

The data used in this workflow is stored in an R package, airway2, containing quantification data for eight RNA-seq samples. The quantification files summarize an RNA-seq experiment wherein airway smooth muscle cells were treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects (Himes et al. 2014). Glucocorticoids are used, for example, by people with asthma to reduce inflammation of the airways. In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample. For more description of the experiment see the PubMed entry 24926665 and for raw data see the GEO entry GSE52778. An alternative version of the workflow uses the airway package, but this version uses the newer airway2 because it contains Salmon quantification files, which will be discussed later.

3 Preparing count matrices

As input, the count-based statistical methods, such as DESeq2 (Love, Huber, and Anders 2014), edgeR (Robinson, McCarthy, and Smyth 2009), limma with the voom method (Law et al. 2014), DSS (Wu, Wang, and Wu 2013), EBSeq (Leng et al. 2013) and baySeq (Hardcastle and Kelly 2010), expect data as obtained, e.g., from RNA-seq or another high-throughput sequencing experiment, in the form of a matrix of counts. The value in the i-th row and the j-th column of the matrix tells how many reads (or fragments, for paired-end RNA-seq) have been assigned to gene i in sample j. Analogously, for other types of assays, the rows of the matrix might correspond e.g., to binding regions (with ChIP-Seq), species of bacteria (with metagenomic datasets), or peptide sequences (with quantitative mass spectrometry).

The values in the matrix should be counts of sequencing reads/fragments. This is important for the statistical models used by DESeq2 and edgeR to hold, as only counts allow assessing the measurement precision correctly. It is important to not provide counts that were pre-normalized for sequencing depth/library size, as the statistical model is most powerful when applied to un-normalized counts and is designed to account for library size differences internally.

3.1 Transcript abundances and the tximport pipeline

In this workflow, we will use transcript abundances as quantified by the Salmon (Patro et al. 2017) software package. Salmon and other methods, such as Sailfish (Patro, Mount, and Kingsford 2014), kallisto (Bray et al. 2016), or RSEM (Li and Dewey 2011), estimate the relative abundances of all (known, annotated) transcripts without aligning reads. Because estimating the abundance of the transcripts involves an inference step, the counts are estimated. Most methods either use a statistical framework called Expectation-Maximization or Bayesian techniques to estimate the abundances and counts. Following quantification, we will use the tximeta (or r Biocpkg("tximport") (Soneson, Love, and Robinson 2015)) package for assembling estimated count and offset matrices for use with Bioconductor differential gene expression packages.

The advantages of using the transcript abundance quantifiers in conjunction with tximeta/tximport to produce gene-level count matrices and normalizing offsets, are: this approach corrects for any potential changes in gene length across samples (e.g. from differential isoform usage) (Trapnell et al. 2013); some of these methods are substantially faster and require less memory and disk usage compared to alignment-based methods; and it is possible to avoid discarding those fragments that can align to multiple genes with homologous sequence (Robert and Watson 2015). Note that transcript abundance quantifiers skip the generation of large files which store read alignments (SAM or BAM files), instead producing smaller files which store estimated abundances, counts and effective lengths per transcript. For more details, see the manuscript describing this approach (Soneson, Love, and Robinson 2015) and the tximport package vignette for software details.

A full tutorial on how to use the Salmon software for quantifying transcript abundance can be found here, but in the next section we will provide the specific command line code that was used to quantify the eight samples that will be used in this workflow.

3.2 Salmon quantification

The Salmon software logo)

The Salmon software logo)

We begin by providing Salmon with the sequence of all of the reference transcripts, which we will call the reference transcriptome. We used version 27 of the GENCODE human transcripts, downloaded from the GENCODE website. As of writing, GENCODE has already released newer versions, but for computational reproducibility, all previous versions are available from the website. On the command line, creating the transcriptome index looks like:

salmon index -i gencode.v27_salmon_0.8.2 -t gencode.v27.transcripts.fa.gz

The 0.8.2 refers to the version of Salmon that was used. As of writing, Salmon is up to version 0.14.1 and it’s always best to use the latest version of software available.

To quantify an individual sample, sample_01, the following command can be used:

salmon quant -i gencode.v27_salmon_0.8.2 -p 6 --libType A \
  --gcBias --biasSpeedSamp 5 \
  -1 sample_01_1.fastq.gz -2 sample_01_2.fastq.gz \
  -o sample_01

In simple English, this command says to “quantify a sample using this transcriptome index, with 6 threads, using automatic library type detection, using GC bias correction (the bias speed part is now longer needed with current versions of Salmon), here are the first and second read, and use this output directory.” The output directory will be created if it doesn’t exist, though if earlier parts of the path do not exist, it will give an error. A single sample of human RNA-seq usually takes ~5 minutes with the GC bias correction.

Rather than writing the above command on the command line, for quantifying these eight samples, a simple R script was written to perform the following Salmon calls from R, looping over the samples. It is also possible to loop over files using a bash loop, or more advanced workflow management systems such as Snakemake (Köster and Rahmann 2012) or Nextflow (Di Tommaso et al. 2017).

3.3 Importing into R with tximport

Following quantification, we can use tximport to import the data into R and perform statistical analysis using Bioconductor packages. Normally, we would simply point tximport to the quant.sf files on our machine. However, because we are distributing these files as part of an R package, we have to do some extra steps, to figure out where the R package, and so the files, are located on your machine.

The output directories from the above Salmon quantification calls has been stored in the extdata directory of the R package called airway2.

library("devtools")
install_github("mikelove/airway2")

The R function system.file can be used to find out where on your computer the files from a package have been installed. Here we ask for the full path to the extdata directory, where R packages store external data, that is part of the airway2 package.

library("airway2")
dir <- system.file("extdata", package="airway2")
dir
## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/airway2/extdata"
list.files(dir)
## [1] "gencode.v27.annotation.tx2gene.csv.gz"
## [2] "multiqc_config.yaml"                  
## [3] "multiqc_report.html"                  
## [4] "quants"                               
## [5] "res_lfc-1_FDR-5.csv"                  
## [6] "SraRunTable.txt"

The quantification directories are in the quants directory.

list.files(file.path(dir,"quants"))
## [1] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" "SRR1039516"
## [6] "SRR1039517" "SRR1039520" "SRR1039521"

The identifiers used here are the SRR identifiers from the Sequence Read Archive. For this experiment, we downloaded a table that links the SRR identifiers to the sample information about each experiment. From this Run Selector view of the experiment, we clicked the button: RunInfo Table. This downloads a file called SraRunTable.txt. We included this file in the airway2 package, and we read it in now:

coldata <- read.delim(file.path(dir, "SraRunTable.txt"))
colnames(coldata)
##  [1] "AvgSpotLen"         "BioSample"          "Experiment"        
##  [4] "MBases"             "MBytes"             "Run"               
##  [7] "SRA_Sample"         "Sample_Name"        "cell_line"         
## [10] "ercc_mix"           "treatment"          "Assay_Type"        
## [13] "BioProject"         "Center_Name"        "Consent"           
## [16] "DATASTORE_filetype" "DATASTORE_provider" "InsertSize"        
## [19] "Instrument"         "LibraryLayout"      "LibrarySelection"  
## [22] "LibrarySource"      "LoadDate"           "Organism"          
## [25] "Platform"           "ReleaseDate"        "SRA_Study"         
## [28] "cell_type"          "source_name"        "tissue"
idx <- c("Run","cell_line","treatment")
coldata[,idx]
##          Run cell_line     treatment
## 1 SRR1039508    N61311     Untreated
## 2 SRR1039509    N61311 Dexamethasone
## 3 SRR1039512   N052611     Untreated
## 4 SRR1039513   N052611 Dexamethasone
## 5 SRR1039516   N080611     Untreated
## 6 SRR1039517   N080611 Dexamethasone
## 7 SRR1039520   N061011     Untreated
## 8 SRR1039521   N061011 Dexamethasone

For simplicity we restrict to these three columns, and do a little cleanup of the column names and the levels of the factors.

coldata <- coldata[,idx]
colnames(coldata) <- c("run","cell","dex")
coldata$cell
## [1] N61311  N61311  N052611 N052611 N080611 N080611 N061011 N061011
## Levels: N052611 N061011 N080611 N61311
coldata$dex
## [1] Untreated     Dexamethasone Untreated     Dexamethasone Untreated    
## [6] Dexamethasone Untreated     Dexamethasone
## Levels: Dexamethasone Untreated

We can rename the levels of the dex column, but we have to be careful that our new level names correspond 1-1 to the previous levels. Here we replace Dexamethasone with trt and Untreated with untrt. There is nothing wrong with the longer names, but sometimes shorter factor levels are easier to work with, and as long as the shorter levels can be understood by other analysts, it saves keystrokes.

levels(coldata$dex)
## [1] "Dexamethasone" "Untreated"
levels(coldata$dex) <- c("trt","untrt")
levels(coldata$dex)
## [1] "trt"   "untrt"

Note: it is prefered in R that the first level of a factor be the reference level (e.g. control, or untreated samples), so we can relevel the dex factor like so:

coldata$dex <- relevel(coldata$dex, "untrt")

Now we can create a named vector that points to the Salmon quantification files. Because we have gzipped the quants for distribution, we point to quant.sf.gz. Normally would just be quant.sf. (The importing software can import directly from gzipped files.)

files <- file.path(dir,"quants",coldata$run,"quant.sf.gz")
names(files) <- coldata$run
all(file.exists(files))
## [1] TRUE

Because we perform a gene-level analysis we need the mapping of transcripts to genes for GENCODE v27. Since GENCODE uses Ensembl transcript identifiers, we use the ensembldb package (Rainer, Gatto, and Weichenberger 2019) to extract these from one of the available annotation databases. Pre-built EnsDb annotation databases for all species for a large number of Ensembl releases are available through the AnnotationHub resource. We thus load below an EnsDb database with human annotations for Ensembl release 91, which corresponds to GENCODE v27. The parameter localHub = TRUE ensures that we list and load only locally stored resources, due to the very limited internet connectivity at the lab site.

library("AnnotationHub")
library("ensembldb")
ah <- AnnotationHub(localHub = TRUE)

ah_91 <- query(ah, "EnsDb.Hsapiens.v91")
ah_91
## AnnotationHub with 1 record
## # snapshotDate(): 2019-05-20 
## # names(): AH60773
## # $dataprovider: Ensembl
## # $species: Homo sapiens
## # $rdataclass: EnsDb
## # $rdatadateadded: 2017-12-21
## # $title: Ensembl 91 EnsDb for Homo Sapiens
## # $description: Gene and protein annotations for Homo Sapiens based on En...
## # $taxonomyid: 9606
## # $genome: GRCh38
## # $sourcetype: ensembl
## # $sourceurl: http://www.ensembl.org
## # $sourcesize: NA
## # $tags: c("EnsDb", "Ensembl", "Gene", "Transcript", "Protein",
## #   "Annotation", "91", "AHEnsDbs") 
## # retrieve record with 'object[["AH60773"]]'
edb <- ah[[names(ah_91)]]

We next define a data.frame that connects transcripts to genes. Note that the information in tx2gene needs to directly match the reference transcriptome used by Salmon.

tx2gene <- transcripts(edb, columns = c("tx_id_version", "gene_id", "tx_id"),
                       return.type = "data.frame")
colnames(tx2gene) <- c("TXNAME", "GENEID", "tx_id")
head(tx2gene)
##              TXNAME          GENEID           tx_id
## 1 ENST00000373020.8 ENSG00000000003 ENST00000373020
## 2 ENST00000496771.5 ENSG00000000003 ENST00000496771
## 3 ENST00000494424.1 ENSG00000000003 ENST00000494424
## 4 ENST00000612152.4 ENSG00000000003 ENST00000612152
## 5 ENST00000614008.4 ENSG00000000003 ENST00000614008
## 6 ENST00000373031.4 ENSG00000000005 ENST00000373031

Finally the following line of code imports Salmon transcript quantifications into R, collapsing to the gene level using the information in tx2gene.

library("tximport")
library("jsonlite")
library("readr")
txi <- tximport(files, type="salmon", tx2gene=tx2gene)

The txi object is simply a list of matrices (and one character vector):

names(txi)
## [1] "abundance"           "counts"              "length"             
## [4] "countsFromAbundance"
txi$counts[1:3,1:3]
##                 SRR1039508 SRR1039509 SRR1039512
## ENSG00000000003   718.5683   471.6148   913.2719
## ENSG00000000005     0.0000     0.0000     0.0000
## ENSG00000000419   466.0003   515.0000   621.9999
txi$length[1:3,1:3]
##                 SRR1039508 SRR1039509 SRR1039512
## ENSG00000000003  2041.7517  2014.3074  2049.5581
## ENSG00000000005   156.5000   156.5000   156.5000
## ENSG00000000419   880.4449   880.7965   879.3954
txi$abundance[1:3,1:3]
##                 SRR1039508 SRR1039509 SRR1039512
## ENSG00000000003   26.61683   19.12435   27.99310
## ENSG00000000005    0.00000    0.00000    0.00000
## ENSG00000000419   40.02907   47.75923   44.43429
txi$countsFromAbundance
## [1] "no"

3.4 Alternative: Importing into R with tximeta

In the previous section we showed how to import the quantifications from Salmon into a list of matrices in R, using the tximport package. The tximeta package provides a wrapper around tximport, which in addition to importing the quantifications also generates a SummarizedExperiment object (see below). Moreover, it automatically identifies the transcriptome that was used for the quantification, and downlaods the corresponding annotation information. Note that this requires an internet connection. Here, we will illustrate how to use the tximeta package to import the data above.

The input to tximeta is a data frame with sample information. This data frame must contain at least two columns, named names and files, and containing the sample IDs and the paths to the Salmon output files, respectively. Since we already have the coldata table above, we just have to rename the Run column and add the paths to the Salmon files.

coldata2 <- coldata
colnames(coldata2)[colnames(coldata2) == "run"] <- "names"
coldata2$files <- file.path(dir,"quants",coldata2$names,"quant.sf.gz")
all(file.exists(coldata2$files))
## [1] TRUE
coldata2
##        names    cell   dex
## 1 SRR1039508  N61311 untrt
## 2 SRR1039509  N61311   trt
## 3 SRR1039512 N052611 untrt
## 4 SRR1039513 N052611   trt
## 5 SRR1039516 N080611 untrt
## 6 SRR1039517 N080611   trt
## 7 SRR1039520 N061011 untrt
## 8 SRR1039521 N061011   trt
##                                                                                                          files
## 1 /Library/Frameworks/R.framework/Versions/3.6/Resources/library/airway2/extdata/quants/SRR1039508/quant.sf.gz
## 2 /Library/Frameworks/R.framework/Versions/3.6/Resources/library/airway2/extdata/quants/SRR1039509/quant.sf.gz
## 3 /Library/Frameworks/R.framework/Versions/3.6/Resources/library/airway2/extdata/quants/SRR1039512/quant.sf.gz
## 4 /Library/Frameworks/R.framework/Versions/3.6/Resources/library/airway2/extdata/quants/SRR1039513/quant.sf.gz
## 5 /Library/Frameworks/R.framework/Versions/3.6/Resources/library/airway2/extdata/quants/SRR1039516/quant.sf.gz
## 6 /Library/Frameworks/R.framework/Versions/3.6/Resources/library/airway2/extdata/quants/SRR1039517/quant.sf.gz
## 7 /Library/Frameworks/R.framework/Versions/3.6/Resources/library/airway2/extdata/quants/SRR1039520/quant.sf.gz
## 8 /Library/Frameworks/R.framework/Versions/3.6/Resources/library/airway2/extdata/quants/SRR1039521/quant.sf.gz

Next, we call tximeta, using the sample information data frame as input argument.

library("tximeta")
(txim <- tximeta(coldata = coldata2))
## class: RangedSummarizedExperiment 
## dim: 200401 8 
## metadata(5): tximetaInfo quantInfo countsFromAbundance txomeInfo
##   txdbInfo
## assays(3): counts abundance length
## rownames(200401): ENST00000456328.2 ENST00000450305.2 ...
##   ENST00000387460.2 ENST00000387461.2
## rowData names(3): tx_id gene_id tx_name
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(3): names cell dex

By default, tximeta returns a SummarizedExperiment object with transcript-level quantifications. Since we are interested in gene-level differential expression analysis, we summarize the abundances on the gene level.

(txim <- summarizeToGene(txim))
## class: RangedSummarizedExperiment 
## dim: 58288 8 
## metadata(5): tximetaInfo quantInfo countsFromAbundance txomeInfo
##   txdbInfo
## assays(3): counts abundance length
## rownames(58288): ENSG00000000003.14 ENSG00000000005.5 ...
##   ENSG00000284747.1 ENSG00000284748.1
## rowData names(1): gene_id
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(3): names cell dex

3.5 Branching point

At this point, we have quantified the RNA-seq fragments and summarized to gene-level counts and abundances. This is a branching point where we could use a variety of Bioconductor packages for exploration and differential expression of the count data, including edgeR (Robinson, McCarthy, and Smyth 2009), limma with the voom method (Law et al. 2014), DSS (Wu, Wang, and Wu 2013), EBSeq (Leng et al. 2013) and baySeq (Hardcastle and Kelly 2010). Schurch et al. (2016) compared performance of different statistical methods for RNA-seq using a large number of biological replicates and can help users to decide which tools make sense to use, and how many biological replicates are necessary to obtain a certain sensitivity. We will continue using DESeq2 (Love, Huber, and Anders 2014) and edgeR (Robinson, McCarthy, and Smyth 2009) below.

3.6 SummarizedExperiment

The DESeq2 package will store the data and intermediate results in an object called a DESeqDataSet. This object builds on top of a general Bioconductor class of object called the SummarizedExperiment. We will therefore first introduce the SummarizedExperiment and we begin with a diagram of the three component pieces:

The component parts of a *SummarizedExperiment* object. The 'assay' (pink block) contains the matrix of counts, the 'rowRanges' or 'rowData' (blue block) contains information about the genomic ranges and the 'colData' (green block) contains information about the samples. The highlighted line in each block represents the first row (note that the first row of 'colData' lines up with the first column of the 'assay').

Figure 1: The component parts of a SummarizedExperiment object
The ‘assay’ (pink block) contains the matrix of counts, the ‘rowRanges’ or ‘rowData’ (blue block) contains information about the genomic ranges and the ‘colData’ (green block) contains information about the samples. The highlighted line in each block represents the first row (note that the first row of ‘colData’ lines up with the first column of the ‘assay’).

The SummarizedExperiment container is diagrammed in the Figure above and discussed in the latest Bioconductor paper (Huber et al. 2015). In our case we have created a single matrix named "counts" that contains the estimated counts for each gene and sample, which is stored in assay. It is also possible to store multiple matrices, accessed with assays. The rowData for our object will keep track of the intermediate calculations for each gene: e.g. the mean of normalized counts, the dispersion estimates, the estimated coefficients and their standard error, etc. We can also include information about the location of each gene, in which case, we would use rowRanges to access the GenomicRanges or GenomicRangesList associated with each gene.The component parts of the SummarizedExperiment are accessed with an R function of the same name: assay (or assays), rowRanges/rowData and colData.

4 The DESeqDataSet object, sample information and the design formula

Bioconductor software packages often define and use a custom class for storing data that makes sure that all the needed data slots are consistently provided and fulfill the requirements. In addition, Bioconductor has general data classes (such as the SummarizedExperiment) that can be used to move data between packages. Additionally, the core Bioconductor classes provide useful functionality: for example, subsetting or reordering the rows or columns of a SummarizedExperiment automatically subsets or reorders the associated rowData and colData, which can help to prevent accidental sample swaps that would otherwise lead to spurious results. With SummarizedExperiment this is all taken care of behind the scenes.

In DESeq2, the custom class is called DESeqDataSet. One of the main differences with SummarizedExperiment is that we will use the counts function to access the assay matrix, and it is required that this matrix stores non-negative integers (this is to prevent mistaken use of the package for non-count data).

A second difference is that the DESeqDataSet has an associated design formula. The experimental design is specified at the beginning of the analysis, as it will inform many of the DESeq2 functions how to treat the samples in the analysis (one exception is the size factor estimation, i.e., the adjustment for differing library sizes, which does not depend on the design formula). The design formula tells which columns in the sample information table (colData) specify the experimental design and how these factors should be used in the analysis.

The simplest design formula for differential expression would be ~ condition, where condition is a column in colData(dds) that specifies which of two (or more groups) the samples belong to. For the airway experiment, we will specify ~ cell + dex meaning that we want to test for the effect of dexamethasone (dex) controlling for the effect of different cell line (cell).

We can construct a DESeqDataSet object from the tximport object using the function DESeqDataSetFromTximport:

library("DESeq2")
dds <- DESeqDataSetFromTximport(txi, coldata, design = ~cell + dex)

For running DESeq2 or edgeR models, you can use R’s formula notation to express any fixed-effects experimental design. Note that DESeq2 and edgeR use the same formula notation as, for instance, the lm function of base R. If the research aim is to determine for which genes the effect of treatment is different across groups, then interaction terms can be included and tested using a design such as ~ group + treatment + group:treatment. See the manual page for ?results for more examples.

4.1 Creating a DGEList for use with edgeR

The edgeR package uses another type of data container, namely a DGEList object. It is just as easy to create a DGEList object using the count matrix and information about samples. We can additionally add information about the genes:

genetable <- data.frame(gene.id = rownames(txi$counts))

Because we are using counts from tximport, we have a slightly different procedure to import them, which involves calculating a statistical offset that will account for differences in gene length across samples. The following code is from the tximport vignette in the edgeR section.

library("edgeR")
cts <- txi$counts
normMat <- txi$length
normMat <- normMat/exp(rowMeans(log(normMat)))
o <- log(calcNormFactors(cts/normMat)) + log(colSums(cts/normMat))

Actually creating the DGEList object simply entails providing the counts, sample information, gene and information. The second line adds an appropriate offset.

dge <- DGEList(counts = cts,
               samples = coldata,
               genes = genetable)
dge <- scaleOffset(dge, t(t(log(normMat)) + o))
names(dge)
## [1] "counts"  "samples" "genes"   "offset"

A much simpler approach with respect to the amount of code is to generate counts-from-abundance using tximport, which correct for the gene length changes directly. Example for this code is:

txi2 <- tximport(files, type="salmon", tx2gene=tx2gene,
                 countsFromAbundance="lengthScaledTPM")
dge2 <- DGEList(counts = txi2$counts,
                samples = coldata,
                genes = genetable)
names(dge2)
## [1] "counts"  "samples" "genes"

The counts + offset method and the counts-from-abundance method are not statistically identical (obviously), but tend to output roughly similar groups of genes. In this dataset, using counts-from-abundance has 96% of genes in common with count + offset at a target 5% FDR.

Just like the SummarizedExperiment and the DESeqDataSet the DGEList contains all the information we need: the count matrix, information about the samples (columns of the count matrix), and information about the genes (rows of the count matrix).

5 Exploratory analysis and visualization

There are two separate paths in this workflow; the one we will see first involves transformations of the counts in order to visually explore sample relationships. In the second part, we will go back to the original raw counts for statistical testing. This is critical because the statistical testing methods rely on original count data (not scaled or transformed) for calculating the precision of measurements.

5.1 Pre-filtering the dataset

Our count matrix with our DESeqDataSet contains many rows with only zeros, and additionally many rows with only a few fragments in total. In order to reduce the size of the object, and to increase the speed of our functions, we can remove the rows that have no or nearly no information about the amount of gene expression. Here we apply the most minimal filtering rule: removing rows of the DESeqDataSet that have no counts, or only a single count across all samples. Additional weighting/filtering to improve power is applied at a later step in the workflow.

nrow(dds)
## [1] 58243
dds <- dds[ rowSums(counts(dds)) > 1, ]
nrow(dds)
## [1] 32748

Here, we also filter the DGEList for edgeR in the same way. Note, however, that edgeR does not apply filtering internally, and for this reason, typically a stronger prefiltering criterion is used at this stage for edgeR.

dge <- dge[rowSums(round(dge$counts)) > 1, ]
all(rownames(dge) == rownames(dds))
## [1] TRUE

We use the above code mostly for aiding comparison with DESeq2 in later sections. The recommended filtering code for edgeR (here not evaluated) is as follows:

dge <- dge[filterByExpr(dge),]

5.2 Variance stabilizing transformations

Many common statistical methods for exploratory analysis of multidimensional data, for example clustering and principal components analysis (PCA), work best for data that generally has the same range of variance at different ranges of the mean values. When the expected amount of variance is approximately the same across different mean values, the data is said to be homoskedastic. For RNA-seq counts, however, the expected variance grows with the mean. For example, if one performs PCA directly on a matrix of counts or normalized counts (e.g. correcting for differences in sequencing depth), the resulting plot typically depends mostly on the genes with highest counts because they show the largest absolute differences between samples. A simple and often used strategy to avoid this is to take the logarithm of the normalized count values plus a pseudocount of 1; however, depending on the choice of pseudocount, now the genes with the very lowest counts will contribute a great deal of noise to the resulting plot, because taking the logarithm of small counts actually inflates their variance. We can quickly show this property of counts with some simulated data (here, Poisson counts with a range of lambda from 0.1 to 100). We plot the standard deviation of each row (genes) against the mean:

lambda <- 10^seq(from = -1, to = 2, length = 1000)
cts <- matrix(rpois(1000*100, lambda), ncol = 100)
library("vsn")
meanSdPlot(cts, ranks = FALSE)
SD over mean plot for Poisson-distributed counts.

Figure 2: SD over mean plot for Poisson-distributed counts

And for logarithm-transformed counts after adding a pseudocount of 1:

log.cts.one <- log2(cts + 1)
meanSdPlot(log.cts.one, ranks = FALSE)
SD over mean plot for log(x+1) transformed counts.

Figure 3: SD over mean plot for log(x+1) transformed counts

The logarithm with a small pseudocount amplifies differences when the values are close to 0. The low count genes with low signal-to-noise ratio will overly contribute to sample-sample distances and PCA plots.

As a solution, DESeq2 offers two transformations for count data that stabilize the variance across the mean: (1) the the variance stabilizing transformation (VST) for negative binomial data with a dispersion-mean trend (Anders and Huber 2010), implemented in the vst function, and (2) the regularized-logarithm transformation or rlog (Love, Huber, and Anders 2014).

For genes with high counts, the VST and rlog will give similar result to the ordinary log2 transformation of normalized counts. For genes with lower counts, however, the values are shrunken towards the genes’ averages across all samples. The VST or rlog-transformed data then becomes approximately homoskedastic, and can be used directly for computing distances between samples, making PCA plots, or as input to downstream methods which perform best with homoskedastic data.

Which transformation to choose? Mike tends to prefer the VST. The VST is much faster to compute via the function vst and is less sensitive to high count outliers than the rlog. Perhaps further development on the rlog would make it less sensitive to outliers. (Why have the rlog at all then? In the DESeq2 paper, the rlog sometimes outperformed the VST in terms of recovering clusters when there was a large range of sequencing depth across samples e.g. x10 from lowest to highest sequencing depth.)

Note that the two transformations offered by DESeq2 are provided for applications other than differential testing. For differential testing we recommend the DESeq function applied to raw counts, as described later in this workflow, which also takes into account the dependence of the variance of counts on the mean value during the dispersion estimation step.

Both transformation functions return an object based on the SummarizedExperiment class that contains the transformed values in its assay slot.

vsd <- vst(dds, blind = FALSE)
head(assay(vsd), 3)
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003  10.221687   9.978170  10.277697  10.113774  10.521661
## ENSG00000000419   9.851450  10.042274   9.945383   9.946456   9.892338
## ENSG00000000457   9.615346   9.497622   9.538941   9.603572   9.447866
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003  10.312734  10.409570  10.121840
## ENSG00000000419  10.019459   9.830019   9.980094
## ENSG00000000457   9.572822   9.639053   9.613115
meanSdPlot(assay(vsd), ranks = FALSE)
SD over mean plot for VST values from counts used in this workflow.

Figure 4: SD over mean plot for VST values from counts used in this workflow

rld <- rlog(dds, blind = FALSE)
head(assay(rld), 3)
##                 SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000000003   9.501184   9.184020   9.571559   9.364104   9.867477
## ENSG00000000419   8.907946   9.163535   9.035810   9.037123   8.963851
## ENSG00000000457   8.391302   8.217315   8.279013   8.373579   8.139702
##                 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003   9.615245   9.732977   9.373479
## ENSG00000000419   9.134549   8.878498   9.082016
## ENSG00000000457   8.329449   8.425159   8.387897
meanSdPlot(assay(rld), ranks = FALSE)
SD over mean plot for rlog-transformed counts from this workflow.

Figure 5: SD over mean plot for rlog-transformed counts from this workflow

In the above function calls, we specified blind = FALSE, which means that differences between cell lines and treatment (the variables in the design) will not contribute to the expected variance-mean trend of the experiment. The experimental design is not used directly in the transformation, only in estimating the global amount of variability in the counts. For a fully unsupervised transformation, one can set blind = TRUE (which is the default).

To show the effect of the transformation, in the figure below we plot the first sample against the second, first simply using the log2 function of normalized counts (after adding 1, to avoid taking the log of zero), and then using the VST and rlog-transformed values.

library("dplyr")
library("ggplot2")
library("hexbin")
ntd <- normTransform(dds)

df <- bind_rows(
  as_data_frame(assay(ntd)[, 1:2]) %>% mutate(transformation = "log2(x + 1)"),
  as_data_frame(assay(vsd)[, 1:2]) %>% mutate(transformation = "vst"),
  as_data_frame(assay(rld)[, 1:2]) %>% mutate(transformation = "rlog"))
colnames(df)[1:2] <- c("x", "y")  

ggplot(df, aes(x = x, y = y)) + geom_hex(bins = 80) +
  coord_fixed() + facet_grid(. ~ transformation)
Scatterplot of transformed counts from two samples. Shown are scatterplots using the log2 transform of normalized counts (left), using the VST (middle), and using the rlog (right). While the rlog is on roughly the same scale as the log2 counts, the VST has a upward shift for the smaller values. It is the differences between samples (deviation from y = x in these scatterplots) which will contribute to the distance calculations and the PCA plot.

Figure 6: Scatterplot of transformed counts from two samples
Shown are scatterplots using the log2 transform of normalized counts (left), using the VST (middle), and using the rlog (right). While the rlog is on roughly the same scale as the log2 counts, the VST has a upward shift for the smaller values. It is the differences between samples (deviation from y = x in these scatterplots) which will contribute to the distance calculations and the PCA plot.

We can see how genes with low counts (bottom left-hand corner) seem to be excessively variable on the ordinary logarithmic scale, while the VST and rlog compress differences for the low count genes for which the data provide little information about differential expression.

5.3 Sample distances

A useful first step in an RNA-seq analysis is often to assess overall similarity between samples: Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment’s design?

We use the R function dist to calculate the Euclidean distance between samples. To ensure we have a roughly equal contribution from all genes, we use it on the VST data. We need to transpose the matrix of values using t, because the dist function expects the different samples to be rows of its argument, and different dimensions (here, genes) to be columns.

sampleDists <- dist(t(assay(vsd)))
sampleDists
##            SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## SRR1039509   37.50000                                                       
## SRR1039512   30.33886   42.43052                                            
## SRR1039513   48.29300   35.04028   39.50752                                 
## SRR1039516   33.17285   44.70749   32.27577   48.94303                      
## SRR1039517   48.23758   39.22630   43.95400   38.24421   37.54620           
## SRR1039520   30.42424   44.44568   28.41660   45.46122   34.74551   47.58855
## SRR1039521   48.59103   35.80141   44.45068   29.90182   49.61229   39.00323
##            SRR1039520
## SRR1039509           
## SRR1039512           
## SRR1039513           
## SRR1039516           
## SRR1039517           
## SRR1039520           
## SRR1039521   40.63466

We visualize the distances in a heatmap in a figure below, using the function pheatmap from the pheatmap package.

library("pheatmap")
library("RColorBrewer")

In order to plot the sample distance matrix with the rows/columns arranged by the distances in our distance matrix, we manually provide sampleDists to the clustering_distance argument of the pheatmap function. Otherwise the pheatmap function would assume that the matrix contains the data values themselves, and would calculate distances between the rows/columns of the distance matrix, which is not desired. We also manually specify a blue color palette using the colorRampPalette function from the RColorBrewer package.

sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(vsd$dex, vsd$cell, sep = " - ")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists,
         col = colors)
Heatmap of sample-to-sample distances using the VST values.

Figure 7: Heatmap of sample-to-sample distances using the VST values

Note that we have changed the row names of the distance matrix to contain treatment type and patient number instead of sample ID, so that we have all this information in view when looking at the heatmap.

Another option for calculating sample distances is to use the Poisson Distance (Witten 2011), implemented in the PoiClaClu package. This measure of dissimilarity between counts also takes the inherent variance structure of counts into consideration when calculating the distances between samples. The PoissonDistance function takes the original count matrix (not normalized) with samples as rows instead of columns, so we need to transpose the counts in dds.

library("PoiClaClu")
poisd <- PoissonDistance(t(counts(dds)))

We plot the heatmap in a Figure below.

samplePoisDistMatrix <- as.matrix(poisd$dd)
rownames(samplePoisDistMatrix) <- paste(vsd$dex, vsd$cell, sep = " - ")
colnames(samplePoisDistMatrix) <- NULL
pheatmap(samplePoisDistMatrix,
         clustering_distance_rows = poisd$dd,
         clustering_distance_cols = poisd$dd,
         col = colors)
Heatmap of sample-to-sample distances using the *Poisson Distance*.

Figure 8: Heatmap of sample-to-sample distances using the Poisson Distance

5.4 PCA plot

Another way to visualize sample-to-sample distances is a principal components analysis (PCA). In this ordination method, the data points (here, the samples) are projected onto the 2D plane such that they spread out in the two directions that explain most of the differences (figure below). The x-axis is the direction that separates the data points the most. The values of the samples in this direction are written PC1. The y-axis is a direction (it must be orthogonal to the first direction) that separates the data the second most. The values of the samples in this direction are written PC2. The percent of the total variance that is contained in the direction is printed in the axis label. Note that these percentages do not add to 100%, because there are more dimensions that contain the remaining variance (although each of these remaining dimensions will explain less than the two that we see).

plotPCA(vsd, intgroup = c("dex", "cell"))
PCA plot using the VST values. Each unique combination of treatment and cell line is given its own color.

Figure 9: PCA plot using the VST values
Each unique combination of treatment and cell line is given its own color.

Here, we have used the function plotPCA that comes with DESeq2. The two terms specified by intgroup are the interesting groups for labeling the samples; they tell the function to use them to choose colors. We can also build the PCA plot from scratch using the ggplot2 package (Wickham 2009). This is done by asking the plotPCA function to return the data used for plotting rather than building the plot. See the ggplot2 documentation for more details on using ggplot.

pcaData <- plotPCA(vsd, intgroup = c("dex", "cell"), returnData = TRUE)
pcaData
##                   PC1        PC2         group   dex    cell       name
## SRR1039508 -13.733856 -1.8922672  untrt:N61311 untrt  N61311 SRR1039508
## SRR1039509   7.850058  0.2823076    trt:N61311   trt  N61311 SRR1039509
## SRR1039512  -8.854522 -3.9485044 untrt:N052611 untrt N052611 SRR1039512
## SRR1039513  13.978649 -3.6144982   trt:N052611   trt N052611 SRR1039513
## SRR1039516 -11.919822 10.3630468 untrt:N080611 untrt N080611 SRR1039516
## SRR1039517   8.798182 13.9783088   trt:N080611   trt N080611 SRR1039517
## SRR1039520 -10.334421 -8.2904258 untrt:N061011 untrt N061011 SRR1039520
## SRR1039521  14.215733 -6.8779675   trt:N061011   trt N061011 SRR1039521
percentVar <- round(100 * attr(pcaData, "percentVar"))

We can then use these data to build up a second plot in a figure below, specifying that the color of the points should reflect dexamethasone treatment and the shape should reflect the cell line.

ggplot(pcaData, aes(x = PC1, y = PC2, color = dex, shape = cell)) +
  geom_point(size = 3) +
  xlab(paste0("PC1: ", percentVar[1], "% variance")) +
  ylab(paste0("PC2: ", percentVar[2], "% variance")) +
  coord_fixed()
PCA plot using the VST values with custom *ggplot2* code. Here we specify cell line (plotting symbol) and dexamethasone treatment (color).

Figure 10: PCA plot using the VST values with custom ggplot2 code
Here we specify cell line (plotting symbol) and dexamethasone treatment (color).

From the PCA plot, we see that the differences between cells (the different plotting shapes) are considerable, though not stronger than the differences due to treatment with dexamethasone (red vs blue color). This shows why it will be important to account for this in differential testing by using a paired design (“paired”, because each dex treated sample is paired with one untreated sample from the same cell line). We are already set up for this design by assigning the formula ~ cell + dex earlier.

5.5 MDS plot

Another plot, very similar to the PCA plot, can be made using the multidimensional scaling (MDS) function in base R. This is useful when we don’t have a matrix of data, but only a matrix of distances. Here we compute the MDS for the distances calculated from the VST data and plot these in a figure below.

mds <- as.data.frame(colData(vsd))  %>%
         cbind(cmdscale(sampleDistMatrix))
ggplot(mds, aes(x = `1`, y = `2`, color = dex, shape = cell)) +
  geom_point(size = 3) + coord_fixed()
MDS plot using VST values.

Figure 11: MDS plot using VST values

In a figure below we show the same plot for the PoissonDistance:

mdsPois <- as.data.frame(colData(dds)) %>%
   cbind(cmdscale(samplePoisDistMatrix))
ggplot(mdsPois, aes(x = `1`, y = `2`, color = dex, shape = cell)) +
  geom_point(size = 3) + coord_fixed()
MDS plot using the *Poisson Distance*.

Figure 12: MDS plot using the Poisson Distance

6 Differential expression analysis

6.1 Performing differential expression testing with DESeq2

As we have already specified an experimental design when we created the DESeqDataSet, we can run the differential expression pipeline on the raw counts with a single call to the function DESeq:

dds <- DESeq(dds)

This function will print out a message for the various steps it performs. These are described in more detail in the manual page for DESeq, which can be accessed by typing ?DESeq. Briefly these are: the estimation of size factors (controlling for differences in the sequencing depth of the samples), the estimation of dispersion values for each gene, and fitting a generalized linear model.

A DESeqDataSet is returned that contains all the fitted parameters within it, and the following section describes how to extract out results tables of interest from this object.

6.2 Building the results table

Calling results without any arguments will extract the estimated log2 fold changes and p values for the last variable in the design formula. If there are more than 2 levels for this variable, results will extract the results table for a comparison of the last level over the first level. The comparison is printed at the top of the output: dex trt vs untrt.

res <- results(dds)
res
## log2 fold change (MLE): dex trt vs untrt 
## Wald test p-value: dex trt vs untrt 
## DataFrame with 32748 rows and 6 columns
##                          baseMean      log2FoldChange             lfcSE
##                         <numeric>           <numeric>         <numeric>
## ENSG00000000003  748.016996954203  -0.353757056733999 0.106951396185109
## ENSG00000000419  522.868589367106   0.205250588223682 0.124923884026677
## ENSG00000000457  320.347867978525  0.0279097457757637 0.156868246893547
## ENSG00000000460  80.8581320979381  -0.216903040608035 0.302150308730343
## ENSG00000000938 0.308650211610144   -1.37447285564756  3.50375579763067
## ...                           ...                 ...               ...
## ENSG00000284738  4.40253030246072   0.665763550870547  0.98942622304473
## ENSG00000284740  2.95721859427428   -2.34387920763935  1.43542703692369
## ENSG00000284744 0.989456372823932  -0.216737576253391  2.65649387440176
## ENSG00000284747  14.2137963831892    1.27009488804022  0.68686593415504
## ENSG00000284748 0.669382049804924 -0.0528044401203826  3.08832551896903
##                                stat               pvalue                padj
##                           <numeric>            <numeric>           <numeric>
## ENSG00000000003   -3.30764318515042 0.000940846012638153 0.00652127155153376
## ENSG00000000419    1.64300517729542    0.100381862078792   0.269633053284747
## ENSG00000000457    0.17791838902046    0.858787068904449   0.939497621921653
## ENSG00000000460  -0.717864699590996    0.472840715549353   0.711209893537231
## ENSG00000000938  -0.392285574404761    0.694847221158491                  NA
## ...                             ...                  ...                 ...
## ENSG00000284738   0.672878417171736    0.501024644033436                  NA
## ENSG00000284740    -1.6328793782947    0.102494368051487                  NA
## ENSG00000284744 -0.0815878321203359    0.934974477595501                  NA
## ENSG00000284747    1.84911614462675   0.0644410436338607   0.196342147214049
## ENSG00000284748 -0.0170980810785808    0.986358369766931                  NA

We could have equivalently produced this results table with the following more specific command. Because dex is the last variable in the design, we could optionally leave off the contrast argument to extract the comparison of the two levels of dex.

res <- results(dds, contrast = c("dex", "trt", "untrt"))

As res is a DataFrame object, it carries metadata with information on the meaning of the columns:

mcols(res, use.names = TRUE)
## DataFrame with 6 rows and 2 columns
##                        type                               description
##                 <character>                               <character>
## baseMean       intermediate mean of normalized counts for all samples
## log2FoldChange      results  log2 fold change (MLE): dex trt vs untrt
## lfcSE               results          standard error: dex trt vs untrt
## stat                results          Wald statistic: dex trt vs untrt
## pvalue              results       Wald test p-value: dex trt vs untrt
## padj                results                      BH adjusted p-values

The first column, baseMean, is a just the average of the normalized count values, divided by the size factors, taken over all samples in the DESeqDataSet. The remaining four columns refer to a specific contrast, namely the comparison of the trt level over the untrt level for the factor variable dex. We will find out below how to obtain other contrasts.

The column log2FoldChange is the effect size estimate. It tells us how much the gene’s expression seems to have changed due to treatment with dexamethasone in comparison to untreated samples. This value is reported on a logarithmic scale to base 2: for example, a log2 fold change of 1.5 means that the gene’s expression is increased by a multiplicative factor of \(2^{1.5} \approx 2.82\).

Of course, this estimate has an uncertainty associated with it, which is available in the column lfcSE, the standard error estimate for the log2 fold change estimate. We can also express the uncertainty of a particular effect size estimate as the result of a statistical test. The purpose of a test for differential expression is to test whether the data provides sufficient evidence to conclude that this value is really different from zero. DESeq2 performs for each gene a hypothesis test to see whether evidence is sufficient to decide against the null hypothesis that there is zero effect of the treatment on the gene and that the observed difference between treatment and control was merely caused by experimental variability (i.e., the type of variability that you can expect between different samples in the same treatment group). As usual in statistics, the result of this test is reported as a p value, and it is found in the column pvalue. Remember that a p value indicates the probability that a fold change as strong as the observed one, or even stronger, would be seen under the situation described by the null hypothesis.

We can also summarize the results with the following line of code, which reports some additional information, that will be covered in later sections.

summary(res)
## 
## out of 32748 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 2345, 7.2%
## LFC < 0 (down)     : 1920, 5.9%
## outliers [1]       : 0, 0%
## low counts [2]     : 16508, 50%
## (mean count < 11)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Note that there are many genes with differential expression due to dexamethasone treatment at the FDR level of 10%. This makes sense, as the smooth muscle cells of the airway are known to react to glucocorticoid steroids. However, there are two ways to be more strict about which set of genes are considered significant:

  • lower the false discovery rate threshold (the threshold on padj in the results table)
  • raise the log2 fold change threshold from 0 using the lfcThreshold argument of results

If we lower the false discovery rate threshold, we should also inform the results() function about it, so that the function can use this threshold for the optimal independent filtering that it performs:

res.05 <- results(dds, alpha = 0.05)
table(res.05$padj < 0.05)
## 
## FALSE  TRUE 
## 13383  3492

If we want to raise the log2 fold change threshold, so that we test for genes that show more substantial changes due to treatment, we simply supply a value on the log2 scale. For example, by specifying lfcThreshold = 1, we test for genes that show significant effects of treatment on gene counts more than doubling or less than halving, because \(2^1=2\).

resLFC1 <- results(dds, lfcThreshold = 1)
table(resLFC1$padj < 0.05)
## 
## FALSE  TRUE 
## 19227   188

Sometimes a subset of the p values in res will be NA (“not available”). This is DESeq’s way of reporting that all counts for this gene were zero, and hence no test was applied. In addition, p values can be assigned NA if the gene was excluded from analysis because it contained an extreme count outlier. For more information, see the outlier detection section of the DESeq2 vignette.

6.3 Independent hypothesis weighting within DESeq2

The results function in DESeq2 integrates with a new method for dealing with genes that have low counts and therefore low power for detecting differential gene expression. This method is called Independent Hypothesis Weighting (IHW) (Ignatiadis et al. 2016), and generalizes the idea of filtering low count genes to reduce the multiple testing burden. While the specifics of the method are beyond the scope of this workflow, we demonstrate its usage here:

library("IHW")
resIHW <- results(dds, filterFun=ihw)
summary(resIHW)
## 
## out of 32748 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 2405, 7.3%
## LFC < 0 (down)     : 1921, 5.9%
## outliers [1]       : 0, 0%
## [1] see 'cooksCutoff' argument of ?results
## see metadata(res)$ihwResult on hypothesis weighting

The results are similar but not identical to the results table built above, which uses a simple optimization of a filter over mean normalized counts. The new procedure results in a gain of 50 genes.

table(old=res$padj < .1, IHW=resIHW$padj < .1)
##        IHW
## old     FALSE  TRUE
##   FALSE 11813   162
##   TRUE    118  4147

We can plot the weighting function which is fit by the IHW package:

plot(metadata(resIHW)$ihwResult)
IHW weighting function.

Figure 13: IHW weighting function

6.4 Other comparisons

In general, the results for a comparison of any two levels of a variable can be extracted using the contrast argument to results. The user should specify three values: the name of the variable, the name of the level for the numerator, and the name of the level for the denominator. Here we extract results for the log2 of the fold change of one cell line over another:

results(dds, contrast = c("cell", "N061011", "N61311"))
## log2 fold change (MLE): cell N061011 vs N61311 
## Wald test p-value: cell N061011 vs N61311 
## DataFrame with 32748 rows and 6 columns
##                          baseMean      log2FoldChange             lfcSE
##                         <numeric>           <numeric>         <numeric>
## ENSG00000000003  748.016996954203    0.26405210396441 0.152270063295407
## ENSG00000000419  522.868589367106 -0.0729290429325449 0.177572255294167
## ENSG00000000457  320.347867978525   0.139028755688018 0.222510619516051
## ENSG00000000460  80.8581320979381 -0.0494079532724753 0.423446388387593
## ENSG00000000938 0.308650211610144                   0   4.9975798104491
## ...                           ...                 ...               ...
## ENSG00000284738  4.40253030246072   -1.05831123904053  1.38923738352172
## ENSG00000284740  2.95721859427428    -1.3711188072016  2.00626873053414
## ENSG00000284744 0.989456372823932                   0  3.97321970079478
## ENSG00000284747  14.2137963831892   0.226234777609869 0.983124140103056
## ENSG00000284748 0.669382049804924    2.45829279074629  4.39131190913349
##                               stat             pvalue              padj
##                          <numeric>          <numeric>         <numeric>
## ENSG00000000003   1.73410385633152 0.0828996574187213 0.403806369670076
## ENSG00000000419 -0.410700662734336  0.681292040920636 0.934174081426105
## ENSG00000000457  0.624818518731368   0.53209017498986 0.882460596343736
## ENSG00000000460 -0.116680540033915  0.907113211730804 0.983806501837068
## ENSG00000000938                  0                  1                NA
## ...                            ...                ...               ...
## ENSG00000284738 -0.761792945967021  0.446183591154616                NA
## ENSG00000284740   -0.6834173240773  0.494343176131897                NA
## ENSG00000284744                  0                  1                NA
## ENSG00000284747  0.230118220458053  0.817999906910469 0.969365306934029
## ENSG00000284748  0.559808285454123  0.575610211594061                NA

There are additional ways to build results tables for certain comparisons after running DESeq once. If results for an interaction term are desired, the name argument of results should be used. Please see the help page for the results function for details on the additional ways to build results tables. In particular, the Examples section of the help page for results gives some pertinent examples.

6.5 Annotation of RNA-seq results

One of the central steps in the analysis of genomic data is the annotation of the quantified entities to biologically more relevant representations such as transcripts, genes or proteins. Such annotations enable for example pathway enrichment analyses and ease the interpretation of the results. Bioconductor provides a large variety of annotation packages and resources, most of them supporting the AnnotationDbi interface which enables a unified way to retrieve annotations for given identifiers. The annotation resources from Bioconductor range from web-based tools such as biomaRt to packages working with pre-built databases containing all identifier mappings for a certain species (the *.org packages such as org.Hs.eg.db) or providing gene and transcript models, such as databases build by GenomicFeatures (TxDb databases/packages) and ensembldb (EnsDb databases/packages) (Rainer, Gatto, and Weichenberger 2019).

In this section we use the ensembldb package to annotate all tested genes which are identified by Ensembl gene IDs. The Salmon quantification used in the analysis leading to this table was based on GENCODE v27 transcripts which is linked to Ensembl release 91. We thus load below an EnsDb database with human annotations for Ensembl release 91 from the AnnotationHub resource. The parameter localHub = TRUE ensures that we list and load only locally stored resources (due to the very limited internet connectivity at the lab site).

library("AnnotationHub")
library("ensembldb")
ah <- AnnotationHub(localHub = TRUE)

ah_91 <- query(ah, "EnsDb.Hsapiens.v91")
ah_91
## AnnotationHub with 1 record
## # snapshotDate(): 2019-05-20 
## # names(): AH60773
## # $dataprovider: Ensembl
## # $species: Homo sapiens
## # $rdataclass: EnsDb
## # $rdatadateadded: 2017-12-21
## # $title: Ensembl 91 EnsDb for Homo Sapiens
## # $description: Gene and protein annotations for Homo Sapiens based on En...
## # $taxonomyid: 9606
## # $genome: GRCh38
## # $sourcetype: ensembl
## # $sourceurl: http://www.ensembl.org
## # $sourcesize: NA
## # $tags: c("EnsDb", "Ensembl", "Gene", "Transcript", "Protein",
## #   "Annotation", "91", "AHEnsDbs") 
## # retrieve record with 'object[["AH60773"]]'
#' Load the database
edb <- ah[[names(ah_91)]]

If we had a working internet connection we could simply list all available EnsDb databases with query(ah, "EnsDb") (after loading AnnotationHub with localHub = FALSE). We can now use the edb variable to access the data in the database. With the listColumns function we can for example list all of the available annotation fields (columns) in the database.

listColumns(edb)
##  [1] "seq_name"              "seq_length"            "is_circular"          
##  [4] "gene_id"               "entrezid"              "exon_id"              
##  [7] "exon_seq_start"        "exon_seq_end"          "gene_name"            
## [10] "gene_biotype"          "gene_seq_start"        "gene_seq_end"         
## [13] "seq_strand"            "seq_coord_system"      "description"          
## [16] "gene_id_version"       "symbol"                "name"                 
## [19] "value"                 "tx_id"                 "protein_id"           
## [22] "protein_sequence"      "protein_domain_id"     "protein_domain_source"
## [25] "interpro_accession"    "prot_dom_start"        "prot_dom_end"         
## [28] "tx_biotype"            "tx_seq_start"          "tx_seq_end"           
## [31] "tx_cds_seq_start"      "tx_cds_seq_end"        "tx_support_level"     
## [34] "tx_id_version"         "tx_name"               "exon_idx"             
## [37] "uniprot_id"            "uniprot_db"            "uniprot_mapping_type"

We could get data from any of these database columns by passing the respective column name(s) along with the column parameter to any function retrieving annotations from an EnsDb database. Gene-related annotations can be retrieved with the genes function, that by default (and similar to the genes function from the GenomicFeatures package) returns a GRanges object with the genomic coordinates of the genes and additional annotation columns in the GRangesmetadata columns. Below we use the genes call to retrieve annotations for all human genes. With the parameter return.type = "DataFrame" we tell the function to return the result as a DataFrame instead of the default GRanges.

anns <- genes(edb, return.type = "DataFrame")
rownames(anns) <- anns$gene_id

We can now combine the RNA-seq result with the annotation.

resAnn <- BiocGenerics::cbind(anns[BiocGenerics::rownames(res), ], res)

resAnn
## DataFrame with 32748 rows and 18 columns
##                         gene_id   gene_name
##                     <character> <character>
## ENSG00000000003 ENSG00000000003      TSPAN6
## ENSG00000000419 ENSG00000000419        DPM1
## ENSG00000000457 ENSG00000000457       SCYL3
## ENSG00000000460 ENSG00000000460    C1orf112
## ENSG00000000938 ENSG00000000938         FGR
## ...                         ...         ...
## ENSG00000284738 ENSG00000284738  AL358472.5
## ENSG00000284740 ENSG00000284740  AL645728.2
## ENSG00000284744 ENSG00000284744  AL591163.1
## ENSG00000284747 ENSG00000284747  AL034417.4
## ENSG00000284748 ENSG00000284748  AL513220.1
##                                       gene_biotype gene_seq_start
##                                        <character>      <integer>
## ENSG00000000003                     protein_coding      100627109
## ENSG00000000419                     protein_coding       50934867
## ENSG00000000457                     protein_coding      169849631
## ENSG00000000460                     protein_coding      169662007
## ENSG00000000938                     protein_coding       27612064
## ...                                            ...            ...
## ENSG00000284738                      antisense_RNA      153923337
## ENSG00000284740 transcribed_unprocessed_pseudogene        1503250
## ENSG00000284744                            lincRNA        6767954
## ENSG00000284747                      antisense_RNA        7991134
## ENSG00000284748      bidirectional_promoter_lncRNA       37596126
##                 gene_seq_end    seq_name seq_strand seq_coord_system
##                    <integer> <character>  <integer>      <character>
## ENSG00000000003    100639991           X         -1       chromosome
## ENSG00000000419     50958555          20         -1       chromosome
## ENSG00000000457    169894267           1         -1       chromosome
## ENSG00000000460    169854080           1          1       chromosome
## ENSG00000000938     27635277           1         -1       chromosome
## ...                      ...         ...        ...              ...
## ENSG00000284738    153935240           1          1       chromosome
## ENSG00000284740      1509452           1          1       chromosome
## ENSG00000284744      6770038           1          1       chromosome
## ENSG00000284747      8005312           1          1       chromosome
## ENSG00000284748     37607336           1          1       chromosome
##                                                                                                    description
##                                                                                                    <character>
## ENSG00000000003                                              tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
## ENSG00000000419 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]
## ENSG00000000457                                   SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
## ENSG00000000460                        chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:HGNC:25565]
## ENSG00000000938              FGR proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:3697]
## ...                                                                                                        ...
## ENSG00000284738                                                                                           NULL
## ENSG00000284740                                                                                           NULL
## ENSG00000284744                                                                                           NULL
## ENSG00000284747                                                                                           NULL
## ENSG00000284748                                                                                           NULL
##                    gene_id_version      symbol  entrezid          baseMean
##                        <character> <character>    <list>         <numeric>
## ENSG00000000003 ENSG00000000003.14      TSPAN6      7105  748.016996954203
## ENSG00000000419 ENSG00000000419.12        DPM1      8813  522.868589367106
## ENSG00000000457 ENSG00000000457.13       SCYL3     57147  320.347867978525
## ENSG00000000460 ENSG00000000460.16    C1orf112     55732  80.8581320979381
## ENSG00000000938 ENSG00000000938.12         FGR      2268 0.308650211610144
## ...                            ...         ...       ...               ...
## ENSG00000284738  ENSG00000284738.1  AL358472.5 101928059  4.40253030246072
## ENSG00000284740  ENSG00000284740.1  AL645728.2        NA  2.95721859427428
## ENSG00000284744  ENSG00000284744.1  AL591163.1        NA 0.989456372823932
## ENSG00000284747  ENSG00000284747.1  AL034417.4        NA  14.2137963831892
## ENSG00000284748  ENSG00000284748.1  AL513220.1        NA 0.669382049804924
##                      log2FoldChange             lfcSE                stat
##                           <numeric>         <numeric>           <numeric>
## ENSG00000000003  -0.353757056733999 0.106951396185109   -3.30764318515042
## ENSG00000000419   0.205250588223682 0.124923884026677    1.64300517729542
## ENSG00000000457  0.0279097457757637 0.156868246893547    0.17791838902046
## ENSG00000000460  -0.216903040608035 0.302150308730343  -0.717864699590996
## ENSG00000000938   -1.37447285564756  3.50375579763067  -0.392285574404761
## ...                             ...               ...                 ...
## ENSG00000284738   0.665763550870547  0.98942622304473   0.672878417171736
## ENSG00000284740   -2.34387920763935  1.43542703692369    -1.6328793782947
## ENSG00000284744  -0.216737576253391  2.65649387440176 -0.0815878321203359
## ENSG00000284747    1.27009488804022  0.68686593415504    1.84911614462675
## ENSG00000284748 -0.0528044401203826  3.08832551896903 -0.0170980810785808
##                               pvalue                padj
##                            <numeric>           <numeric>
## ENSG00000000003 0.000940846012638153 0.00652127155153376
## ENSG00000000419    0.100381862078792   0.269633053284747
## ENSG00000000457    0.858787068904449   0.939497621921653
## ENSG00000000460    0.472840715549353   0.711209893537231
## ENSG00000000938    0.694847221158491                  NA
## ...                              ...                 ...
## ENSG00000284738    0.501024644033436                  NA
## ENSG00000284740    0.102494368051487                  NA
## ENSG00000284744    0.934974477595501                  NA
## ENSG00000284747   0.0644410436338607   0.196342147214049
## ENSG00000284748    0.986358369766931                  NA

If we had only a reduced set of genes to annotate, e.g. the set of the most significant genes, it is faster and more elegant to extract annotations for the subset of genes using the filter framework from ensembldb. Below we create a top table by selecting the 20 genes with the smallest p-values.

resTop20 <- res[order(res$pvalue), ][1:20, ]

To get only annotations for the selected genes we pass the filter expression ~ gene_id == rownames(resTop20) with the filter parameter to the genes function. Filter expressions have to be written in the form of a formula (i.e. starting with ~) and support any logical R expression and any database column/field in the EnsDb database (use supportedFilters(edb) to list all supported fields). Also, by specifying the respective field names with the columns parameter we extract only the gene name (equivalent with the HGNC symbol for most genes), description the gene biotype and the NCBI Entrezgene ID from the database.

anns <- genes(edb, filter = ~ gene_id == rownames(resTop20),
              columns = c("gene_name", "description", "gene_biotype",
                          "entrezid"), return.type = "DataFrame")

Note that the order of the genes in the returned data.frame is not the same as in resTop20. We thus below re-order the annotations to match the order of genes in the top table and subsequently join the two data frames.

resTop20 <- cbind(anns[match(rownames(resTop20), anns$gene_id), ],
                  resTop20)

Be aware that mappings between Ensembl and NCBI identifiers is not necessarily a 1:1 mapping. In our case we got for some Ensembl gene IDs more than one NCBI Entrezgene ID. The column "entrezid" in our result table is thus a list with eventually more than one Entrezgene ID in one row. Exporting such a table could be problematic and we collapse therefore Entrezgene IDs in this columns into a single character string, with multiple IDs separated by a semicolon (";").

resTop20$entrezid <- sapply(resTop20$entrezid, function(z) {
    if (any(is.na(z))) z
    else paste(z, collapse = ";")
})

The 20 most significantly regulated genes in the present analysis are listed below.

gene_name description gene_biotype entrezid gene_id baseMean log2FoldChange lfcSE stat pvalue padj
MAOA monoamine oxidase A [Source:HGNC Symbol;Acc:HGNC:6833] protein_coding 4128 ENSG00000189221 2422.4722 3.422182 0.1343714 25.46809 0 0
DUSP1 dual specificity phosphatase 1 [Source:HGNC Symbol;Acc:HGNC:3064] protein_coding 1843 ENSG00000120129 3468.9003 2.967590 0.1202484 24.67883 0 0
SAMHD1 SAM and HD domain containing deoxynucleoside triphosphate triphosphohydrolase 1 [Source:HGNC Symbol;Acc:HGNC:15925] protein_coding 25939 ENSG00000101347 14201.8878 3.764005 0.1579738 23.82676 0 0
GPX3 glutathione peroxidase 3 [Source:HGNC Symbol;Acc:HGNC:4555] protein_coding 2878 ENSG00000211445 12652.1625 3.767896 0.1652127 22.80634 0 0
SERPINA3 serpin family A member 3 [Source:HGNC Symbol;Acc:HGNC:16] protein_coding 12 ENSG00000196136 2745.2239 3.240508 0.1437457 22.54335 0 0
NEXN nexilin F-actin binding protein [Source:HGNC Symbol;Acc:HGNC:29557] protein_coding 91624 ENSG00000162614 5674.3163 1.991769 0.0896421 22.21912 0 0
MT2A metallothionein 2A [Source:HGNC Symbol;Acc:HGNC:7406] protein_coding 4502 ENSG00000125148 3720.1844 2.226467 0.1054555 21.11287 0 0
ZBTB16 zinc finger and BTB domain containing 16 [Source:HGNC Symbol;Acc:HGNC:12930] protein_coding 7704 ENSG00000109906 415.9916 6.325861 0.3032907 20.85742 0 0
SPARCL1 SPARC like 1 [Source:HGNC Symbol;Acc:HGNC:11220] protein_coding 8404 ENSG00000152583 966.0441 4.417228 0.2194653 20.12723 0 0
ADAMTS1 ADAM metallopeptidase with thrombospondin type 1 motif 1 [Source:HGNC Symbol;Acc:HGNC:217] protein_coding 9510 ENSG00000154734 29966.0657 2.233413 0.1117150 19.99205 0 0
SORT1 sortilin 1 [Source:HGNC Symbol;Acc:HGNC:11186] protein_coding 6272 ENSG00000134243 5594.1519 2.187524 0.1111422 19.68220 0 0
STEAP2 STEAP2 metalloreductase [Source:HGNC Symbol;Acc:HGNC:17885] protein_coding 261729 ENSG00000157214 3051.7626 2.001168 0.1024791 19.52758 0 0
CCDC69 coiled-coil domain containing 69 [Source:HGNC Symbol;Acc:HGNC:24487] protein_coding 26112 ENSG00000198624 2043.6957 2.860487 0.1526328 18.74097 0 0
KCTD12 potassium channel tetramerization domain containing 12 [Source:HGNC Symbol;Acc:HGNC:14678] protein_coding 115207 ENSG00000178695 2712.3559 -2.533603 0.1371684 -18.47075 0 0
FGD4 FYVE, RhoGEF and PH domain containing 4 [Source:HGNC Symbol;Acc:HGNC:19125] protein_coding 121512 ENSG00000139132 1187.2946 2.048211 0.1152607 17.77025 0 0
NNMT nicotinamide N-methyltransferase [Source:HGNC Symbol;Acc:HGNC:7861] protein_coding 4837 ENSG00000166741 7590.6769 2.241321 0.1278966 17.52448 0 0
PER1 period circadian regulator 1 [Source:HGNC Symbol;Acc:HGNC:8845] protein_coding 5187 ENSG00000179094 741.0586 2.892767 0.1662671 17.39831 0 0
TRNP1 TMF1-regulated nuclear protein 1 [Source:HGNC Symbol;Acc:HGNC:34348] protein_coding 388610 ENSG00000253368 1060.5358 2.015573 0.1164440 17.30937 0 0
CACNB2 calcium voltage-gated channel auxiliary subunit beta 2 [Source:HGNC Symbol;Acc:HGNC:1402] protein_coding 783 ENSG00000165995 502.9501 3.025600 0.1755551 17.23448 0 0
KLF15 Kruppel like factor 15 [Source:HGNC Symbol;Acc:HGNC:14536] protein_coding 28999 ENSG00000163884 576.5858 4.417895 0.2568764 17.19853 0 0

6.5.1 Using ensembldb to query for genes encoding a specific protein domain

For some experiments it might be interesting to search for genes, or rather transcripts, that encode a protein with a certain protein domain. For the present data set we could for example search for genes with proteins encoding the ligand binding domain of nuclear hormone receptors (such as the glucocorticoid receptor, the gene activated by treatment with the synthetic glucocorticoid dexamethasone and hence being mainly responsible for the transcriptional changes observed in the present dataset). Below we thus query the database for all such genes using the protein domain ID PF00104 from Pfam. The filter used in the query below will return all genes with a transcript annotated to the specific protein domain restricting to genes that are detected in the present analysis.

hormoneR <- genes(edb,
                  filter = ~ protein_domain_id == "PF00104" &
                      gene_id == rownames(res),
                  return.type = "data.frame")
rownames(hormoneR) <- hormoneR$gene_id
nrow(hormoneR)
## [1] 46

In total 46 genes were identified. Next we identify nuclear hormone receptors which are significantly differentially expressed at a 5% FDR in the present data set.

resHormoneR <- res[hormoneR$gene_id, ]
resHormoneR <- resHormoneR[which(resHormoneR$padj < 0.05), ]
resHormoneR <- cbind(hormoneR[rownames(resHormoneR), ],
                     resHormoneR)

The table below lists the significantly differentially expressed hormone receptors.

knitr::kable(resHormoneR[order(resHormoneR$pvalue),
                         c("gene_name", "padj", "log2FoldChange")])
gene_name padj log2FoldChange
ENSG00000119508 NR4A3 0.0000000 2.0347679
ENSG00000113580 NR3C1 0.0000000 -0.9459376
ENSG00000123358 NR4A1 0.0000000 1.3837656
ENSG00000132170 PPARG 0.0000000 1.0746543
ENSG00000175745 NR2F1 0.0000278 -0.5088171
ENSG00000185551 NR2F2 0.0001163 0.7837373
ENSG00000131408 NR1H2 0.0005496 0.3629349
ENSG00000174738 NR1D2 0.0020364 -0.6147602
ENSG00000151623 NR3C2 0.0022242 -1.1918428
ENSG00000091831 ESR1 0.0326543 -1.1631755

Among the regulated hormone receptors are NR4A3, encoding a member of the steroid-thyroid hormone-retinoid receptor superfamily (4-fold upregulated), but also the glucocorticoid receptor (NR3C1) itself, which is known to down-regulate in certain cell types its own gene as part of a negative feedback loop.

6.6 Performing differential expression testing with edgeR

Next we will show how to perform differential expression analysis with edgeR. Recall that we have a DGEList dge, containing three objects:

names(dge)
## [1] "counts"  "samples" "genes"   "offset"

We first define a design matrix, using the same formula syntax as for DESeq2 above.

design <- model.matrix(~ cell + dex, dge$samples)

Then, we calculate normalization factors and estimate the dispersion for each gene. Note that we need to specify the design in the dispersion calculation.

dge <- calcNormFactors(dge)
dge <- estimateDisp(dge, design)

Finally, we fit the generalized linear model and perform the test. In the glmQLFTest function, we indicate which coefficient (which column in the design matrix) that we would like to test for. It is possible to test more general contrasts as well, and the user guide contains many examples on how to do this. The topTags function extracts the top-ranked genes. You can indicate the adjusted p-value cutoff, and/or the number of genes to keep.

fit <- glmQLFit(dge, design)
qlf <- glmQLFTest(fit, coef = ncol(design))
tt <- topTags(qlf, n = nrow(dge), p.value = 0.1)
tt10 <- topTags(qlf) # just the top 10 by default
tt10
## Coefficient:  dextrt 
##                         gene.id    logFC   logCPM         F       PValue
## ENSG00000109906 ENSG00000109906 6.295485 4.109371 1194.3763 3.296559e-37
## ENSG00000101347 ENSG00000101347 3.748587 9.298121 1111.9553 1.911432e-36
## ENSG00000211445 ENSG00000211445 3.750816 9.113950 1046.1195 8.531279e-36
## ENSG00000189221 ENSG00000189221 3.403750 6.722805  933.9829 1.358415e-34
## ENSG00000152583 ENSG00000152583 4.390957 5.494222  910.3392 2.534658e-34
## ENSG00000196136 ENSG00000196136 3.225391 6.914366  818.5211 3.335556e-33
## ENSG00000120129 ENSG00000120129 2.951575 7.260226  790.8825 7.644427e-33
## ENSG00000096060 ENSG00000096060 3.996190 6.859061  756.9908 2.195437e-32
## ENSG00000163884 ENSG00000163884 4.418589 4.650387  748.9891 2.834664e-32
## ENSG00000127954 ENSG00000127954 4.857722 4.270540  729.4205 5.353765e-32
##                          FDR
## ENSG00000109906 1.079557e-32
## ENSG00000101347 3.129778e-32
## ENSG00000211445 9.312744e-32
## ENSG00000189221 1.112134e-30
## ENSG00000152583 1.660100e-30
## ENSG00000196136 1.820547e-29
## ENSG00000120129 3.576281e-29
## ENSG00000096060 8.987020e-29
## ENSG00000163884 1.031440e-28
## ENSG00000127954 1.753251e-28

We can compare to see how the results from the two software overlap:

tt.all <- topTags(qlf, n = nrow(dge), sort.by = "none")
table(DESeq2 = res$padj < 0.1, edgeR = tt.all$table$FDR < 0.1)
##        edgeR
## DESeq2  FALSE  TRUE
##   FALSE 11099   876
##   TRUE    194  4071

We can also compare the two lists by the ranks:

common <- !is.na(res$padj)
plot(rank(res$padj[common]), 
     rank(tt.all$table$FDR[common]), cex = .1,
     xlab = "DESeq2", ylab = "edgeR")
Comparison of genes ranked by *DESeq2* and *edgeR*.

Figure 14: Comparison of genes ranked by DESeq2 and edgeR

Also with edgeR we can test for significance relative to a fold-change threshold, using the function glmTreat instead of glmLRT. Below we set the log fold-change threshold to 1 (i.e., fold change threshold equal to 2), as for DESeq2 above. We also compare the results to those obtained with DESeq2.

treatres <- glmTreat(fit, coef = ncol(design), lfc = 1)
tt.treat <- topTags(treatres, n = nrow(dge), sort.by = "none")
table(DESeq2 = resLFC1$padj < 0.1, edgeR = tt.treat$table$FDR < 0.1)
##        edgeR
## DESeq2  FALSE  TRUE
##   FALSE 18940   270
##   TRUE      0   205

6.7 Citing packages used in research

If you use the results from an R analysis package in published research, you can find the proper citation for the software by typing citation("pkgName"), where you would substitute the name of the package for pkgName. Citing methods papers helps to support and reward the individuals who put time into open source software for genomic data analysis.

7 Plotting results

7.1 Counts plot with DESeq2

A quick way to visualize the counts for a particular gene is to use the plotCounts function that takes as arguments the DESeqDataSet, a gene name, and the group over which to plot the counts (figure below).

topGene <- rownames(res)[which.min(res$padj)]
plotCounts(dds, gene = topGene, intgroup = c("dex"))
Normalized counts for a single gene over treatment group.

Figure 15: Normalized counts for a single gene over treatment group

We can also make custom plots using the ggplot function from the ggplot2 package (figures below).

library("ggbeeswarm")
geneCounts <- plotCounts(dds, gene = topGene, intgroup = c("dex","cell"),
                         returnData = TRUE)
ggplot(geneCounts, aes(x = dex, y = count, color = cell)) +
  scale_y_log10() +  geom_beeswarm(cex = 3)
Normalized counts using ggbeeswarm

Figure 16: Normalized counts using ggbeeswarm

ggplot(geneCounts, aes(x = dex, y = count, color = cell, group = cell)) +
  scale_y_log10() + geom_point(size = 3) + geom_line()
Normalized counts with lines connecting cell lines. Note that the *DESeq* test actually takes into account the cell line effect, so this figure more closely depicts the difference being tested.

Figure 17: Normalized counts with lines connecting cell lines
Note that the DESeq test actually takes into account the cell line effect, so this figure more closely depicts the difference being tested.

7.2 MA plot with DESeq2

An MA plot (Dudoit et al. 2002) provides a useful overview for the distribution of the estimated coefficients in the model, e.g. the comparisons of interest, across all genes. On the y-axis, the “M” stands for “minus” – subtraction of log values is equivalent to the log of the ratio – and on the x-axis, the “A” stands for “average”. You may hear this plot also referred to as a mean-difference plot, or a Bland-Altman plot.

Before making the MA plot, we use the lfcShrink function to shrink the log2 fold changes for the comparison of dex treated vs untreated samples. This typically takes about 20 seconds on a laptop.

library("apeglm")
resultsNames(dds)
## [1] "Intercept"               "cell_N061011_vs_N052611"
## [3] "cell_N080611_vs_N052611" "cell_N61311_vs_N052611" 
## [5] "dex_trt_vs_untrt"
res <- lfcShrink(dds, coef="dex_trt_vs_untrt", type="apeglm")
DESeq2::plotMA(res, xlim=c(5,1e6), ylim=c(-5,5))
An MA plot of changes induced by treatment. The log2 fold change for a particular comparison is plotted on the y-axis and the average of the counts normalized by size factor is shown on the x-axis. Each gene is represented with a dot. Genes with an adjusted *p* value below a threshold (here 0.1, the default) are shown in red.

Figure 18: An MA plot of changes induced by treatment
The log2 fold change for a particular comparison is plotted on the y-axis and the average of the counts normalized by size factor is shown on the x-axis. Each gene is represented with a dot. Genes with an adjusted p value below a threshold (here 0.1, the default) are shown in red.

The DESeq2 package uses a Bayesian procedure to moderate (or “shrink”) log2 fold changes from genes with very low counts and highly variable counts, as can be seen by the narrowing of the vertical spread of points on the left side of the MA plot. As shown above, the lfcShrink function performs this operation. For a detailed explanation of the rationale of moderated fold changes, please see the DESeq2 paper (Love, Huber, and Anders 2014).

If we had not used statistical moderation to shrink the noisy log2 fold changes, we would have instead seen the following plot:

res.noshr <- results(dds)
DESeq2::plotMA(res.noshr, xlim=c(5,1e6), ylim=c(-5,5))
MA plot without shrinkage.

Figure 19: MA plot without shrinkage

We can label individual points on the MA plot as well. Here we use the with R function to plot a circle and text for a selected row of the results object. Within the with function, only the baseMean and log2FoldChange values for the selected rows of res are used.

DESeq2::plotMA(res, xlim=c(5,1e6), ylim=c(-5,5))
topGene <- rownames(res)[which.min(res$padj)]
with(res[topGene, ], {
  points(baseMean, log2FoldChange, col = "dodgerblue", cex = 2, lwd = 2)
  text(baseMean, log2FoldChange, topGene, pos = 2, col = "dodgerblue")
})
Example of labeling an individual gene on an MA plot.

Figure 20: Example of labeling an individual gene on an MA plot

Another useful diagnostic plot is the histogram of the p values (figure below). This plot is best formed by excluding genes with very small counts, which otherwise generate spikes in the histogram.

hist(res$pvalue[res$baseMean > 1], breaks = 0:20/20,
     col = "grey50", border = "white")
Histogram of *p* values for genes with mean normalized count larger than 1.

Figure 21: Histogram of p values for genes with mean normalized count larger than 1

7.3 Counts plot with edgeR

As we made a counts plot previously, we can also make this with the top gene from the edgeR analysis:

cpms <- cpm(dge)
topGene <- as.character(tt10$table$gene.id[1])
ord <- order(dge$samples$dex)
col <- as.integer(dge$samples$dex[ord])
par(mar=c(10,4,4,2))
barplot(cpms[topGene,ord],
        col=c("skyblue","orange")[col],
        names.arg=paste(dge$sample$dex[ord],"--",dge$sample$cell[ord]),
        las=2, ylab="counts per million (CPM)")
CPM per sample for the top gene from the *edgeR* analysis.

Figure 22: CPM per sample for the top gene from the edgeR analysis

7.4 MA / Smear plot with edgeR

In edgeR, the MA plot is obtained via the plotSmear function.

plotSmear(qlf, de.tags = tt$table$gene.id)
The *edgeR* package includes the `plotSmear` function for generating MA plots.

Figure 23: The edgeR package includes the plotSmear function for generating MA plots

7.5 Gene clustering

In the sample distance heatmap made previously, the dendrogram at the side shows us a hierarchical clustering of the samples. Such a clustering can also be performed for the genes. Since the clustering is only relevant for genes that actually carry a signal, one usually would only cluster a subset of the most highly variable genes. Here, for demonstration, let us select the 20 genes with the highest variance across samples. We will work with the VST data:

library("genefilter")
topVarGenes <- head(order(rowVars(assay(vsd)), decreasing = TRUE), 20)

The heatmap becomes more interesting if we do not look at absolute expression strength but rather at the amount by which each gene deviates in a specific sample from the gene’s average across all samples. Hence, we center each genes’ values across samples, and plot a heatmap (figure below). We provide a data.frame that instructs the pheatmap function how to label the columns.

mat  <- assay(vsd)[ topVarGenes, ]
mat  <- mat - rowMeans(mat)
anno <- as.data.frame(colData(vsd)[, c("cell","dex")])
pheatmap(mat, annotation_col = anno)
Heatmap of relative VST values across samples. Treatment status and cell line information are shown with colored bars at the top of the heatmap. Blocks of genes that covary across patients. Note that a set of genes at the top of the heatmap are separating the N061011 cell line from the others. In the center of the heatmap, we see a set of genes for which the dexamethasone treated samples have higher gene expression.

Figure 24: Heatmap of relative VST values across samples
Treatment status and cell line information are shown with colored bars at the top of the heatmap. Blocks of genes that covary across patients. Note that a set of genes at the top of the heatmap are separating the N061011 cell line from the others. In the center of the heatmap, we see a set of genes for which the dexamethasone treated samples have higher gene expression.

7.6 Exporting results

You can easily save the results table in a CSV file that you can then share or load with a spreadsheet program such as Excel. The call to as.data.frame is necessary to convert the DataFrame object (IRanges package) to a data.frame object that can be processed by write.csv. Here, we take just the top 100 genes for demonstration.

resOrderedDF <- as.data.frame(res[order(res$pvalue), ])[1:100, ]
write.csv(resOrderedDF, file = "results.csv")

A more sophisticated way for exporting results the Bioconductor package ReportingTools (Huntley et al. 2013). ReportingTools will automatically generate dynamic HTML documents, including links to external databases using gene identifiers and boxplots summarizing the normalized counts across groups. See the ReportingTools vignettes for full details. The simplest version of creating a dynamic ReportingTools report is performed with the following code:

library("ReportingTools")
htmlRep <- HTMLReport(shortName = "report", title = "My report",
                      reportDirectory = "./report")
publish(resOrderedDF %>% tibble::rownames_to_column("gene"), htmlRep)
url <- finish(htmlRep)
browseURL(url)

7.7 Exploring results interactively

The iSEE package enables interactive exploration of any data stored in a SummarizedExperiment container. Since the DESeqDataSet class directly extends SummarizedExperiment, it can also be used as input to iSEE. To open an iSEE instance, we simply type

library("iSEE")
if (interactive()) {
  iSEE(dds)
}

We can also add additional information, such as the results of the statistical test, to the rowData of the object before providing it to iSEE.

stopifnot(all(rownames(dds) == rownames(res)))
res$neglog10pvalue <- -log10(res$pvalue)
rowData(dds)$res <- res
if (interactive()) {
  iSEE(dds)
}

8 Removing hidden batch effects

Suppose we did not know that there were different cell lines involved in the experiment, only that there was treatment with dexamethasone. The cell line effect on the counts then would represent some hidden and unwanted variation that might be affecting many or all of the genes in the dataset. We can use statistical methods designed for RNA-seq from the sva package (Leek 2014) or the RUVSeq package (Risso et al. 2014) in Bioconductor to detect such groupings of the samples, and then we can add these to the DESeqDataSet design, in order to account for them.

The SVA package uses the term surrogate variables for the estimated variables that we want to account for in our analysis, while the RUV package uses the terms factors of unwanted variation with the acronym “Remove Unwanted Variation” explaining the package title. We first use SVA to find hidden batch effects and then RUV following.

8.1 Using SVA with DESeq2

library("sva")

Below we obtain a matrix of normalized counts for which the average count across samples is larger than 1. As we described above, we are trying to recover any hidden batch effects, supposing that we do not know the cell line information. So we use a full model matrix with the dex variable, and a reduced, or null, model matrix with only an intercept term. Finally we specify that we want to estimate 2 surrogate variables. For more information read the manual page for the svaseq function by typing ?svaseq.

dat  <- counts(dds, normalized = TRUE)
idx  <- rowMeans(dat) > 1
dat  <- dat[idx, ]
mod  <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~   1, colData(dds))
svseq <- svaseq(dat, mod, mod0, n.sv = 2)
## Number of significant surrogate variables is:  2 
## Iteration (out of 5 ):1  2  3  4  5
svseq$sv
##            [,1]        [,2]
## [1,]  0.2518590 -0.50192957
## [2,]  0.2639320 -0.59195468
## [3,]  0.1082580  0.28208393
## [4,]  0.1918983  0.43378288
## [5,] -0.6051171 -0.07768411
## [6,] -0.6102564 -0.03799004
## [7,]  0.1972668  0.22990958
## [8,]  0.2021595  0.26378202

Because we actually do know the cell lines, we can see how well the SVA method did at recovering these variables (figure below).

par(mfrow = c(2, 1), mar = c(3,5,3,1))
for (i in 1:2) {
  stripchart(svseq$sv[, i] ~ dds$cell, vertical = TRUE, main = paste0("SV", i))
  abline(h = 0)
 }
Surrogate variables 1 and 2 plotted over cell line. Here, we know the hidden source of variation (cell line), and therefore can see how the SVA procedure is able to identify a source of variation which is correlated with cell line.

Figure 25: Surrogate variables 1 and 2 plotted over cell line
Here, we know the hidden source of variation (cell line), and therefore can see how the SVA procedure is able to identify a source of variation which is correlated with cell line.

Finally, in order to use SVA to remove any effect on the counts from our surrogate variables, we simply add these two surrogate variables as columns to the DESeqDataSet and then add them to the design:

ddssva <- dds
ddssva$SV1 <- svseq$sv[,1]
ddssva$SV2 <- svseq$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + dex

We could then produce results controlling for surrogate variables by running DESeq with the new design.

8.2 Using RUV with DESeq2

We can also use the RUV method in the RUVSeq package to detect the hidden batch effects.

library("RUVSeq")

We can use the RUVg function to estimate factors of unwanted variation, analogous to SVA’s surrogate variables. A difference compared to the SVA procedure above, is that we first would run DESeq and results to obtain the p-values for the analysis without knowing about the batches, e.g. just ~ dex. Supposing that we have this results table res, we then pull out a set of empirical control genes by looking at the genes that do not have a small p-value.

set <- newSeqExpressionSet(counts(dds))
idx  <- rowSums(counts(set) > 5) >= 2
set  <- set[idx, ]
set <- betweenLaneNormalization(set, which="upper")
not.sig <- rownames(res)[which(res$pvalue > .1)]
empirical <- rownames(set)[ rownames(set) %in% not.sig ]
set <- RUVg(set, empirical, k=2)
pData(set)
##                    W_1         W_2
## SRR1039508 -0.20037701  0.47313731
## SRR1039509 -0.24252327  0.61488878
## SRR1039512  0.01456952 -0.07906549
## SRR1039513 -0.19164053 -0.10156504
## SRR1039516  0.60096460 -0.01372652
## SRR1039517  0.58434561 -0.02106782
## SRR1039520 -0.25376560 -0.44458727
## SRR1039521 -0.31157332 -0.42801396

We can plot the factors estimated by RUV:

par(mfrow = c(2, 1), mar = c(3,5,3,1))
for (i in 1:2) {
  stripchart(pData(set)[, i] ~ dds$cell, vertical = TRUE, main = paste0("W", i))
  abline(h = 0)
 }
Factors of unwanted variation plotted over cell line.

Figure 26: Factors of unwanted variation plotted over cell line

As before, if we wanted to control for these factors, we simply add them to the DESeqDataSet and to the design:

ddsruv <- dds
ddsruv$W1 <- set$W_1
ddsruv$W2 <- set$W_2
design(ddsruv) <- ~ W1 + W2 + dex

We would then run DESeq with the new design to re-estimate the parameters and results.

9 Session information

As the last part of this document, we call the function sessionInfo, which reports the version numbers of R and all the packages used in this session. It is good practice to always keep such a record of this as it will help to track down what has happened in case an R script ceases to work or gives different results because the functions have been changed in a newer version of one of your packages. By including it at the bottom of a script, your reports will become more reproducible.

The session information should also always be included in any emails to the Bioconductor support site along with all code used in the analysis.

devtools::session_info()
## ─ Session info ────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.0 (2019-04-26)
##  os       macOS High Sierra 10.13.6   
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Europe/Zurich               
##  date     2019-07-09                  
## 
## ─ Packages ────────────────────────────────────────────────────────────────
##  package                * version    date       lib
##  acepack                  1.4.1      2016-10-29 [1]
##  affy                     1.62.0     2019-05-02 [1]
##  affyio                   1.54.0     2019-05-02 [1]
##  airway2                * 0.0.1      2019-07-08 [1]
##  annotate                 1.62.0     2019-05-02 [1]
##  AnnotationDbi          * 1.46.0     2019-05-02 [1]
##  AnnotationFilter       * 1.8.0      2019-05-02 [1]
##  AnnotationHub          * 2.16.0     2019-05-02 [1]
##  apeglm                 * 1.6.0      2019-05-02 [1]
##  aroma.light              3.14.0     2019-05-02 [1]
##  assertthat               0.2.1      2019-03-21 [1]
##  backports                1.1.4      2019-04-10 [1]
##  base64enc                0.1-3      2015-07-28 [1]
##  bbmle                    1.0.20     2017-10-30 [1]
##  beeswarm                 0.2.3      2016-04-25 [1]
##  Biobase                * 2.44.0     2019-05-02 [1]
##  BiocFileCache          * 1.8.0      2019-05-02 [1]
##  BiocGenerics           * 0.30.0     2019-05-02 [1]
##  BiocManager              1.30.4     2018-11-13 [1]
##  BiocParallel           * 1.18.0     2019-05-02 [1]
##  BiocStyle              * 2.12.0     2019-05-02 [1]
##  biomaRt                  2.40.1     2019-06-27 [1]
##  Biostrings             * 2.52.0     2019-05-02 [1]
##  bit                      1.1-14     2018-05-29 [1]
##  bit64                    0.9-7      2017-05-08 [1]
##  bitops                   1.0-6      2013-08-17 [1]
##  blob                     1.1.1      2018-03-25 [1]
##  bookdown                 0.11       2019-05-28 [1]
##  callr                    3.3.0      2019-07-04 [1]
##  checkmate                1.9.4      2019-07-04 [1]
##  cli                      1.1.0      2019-03-19 [1]
##  cluster                  2.1.0      2019-06-19 [1]
##  coda                     0.19-3     2019-07-05 [1]
##  codetools                0.2-16     2018-12-24 [1]
##  colorspace               1.4-1      2019-03-18 [1]
##  colourpicker             1.0        2017-09-27 [1]
##  cowplot                  0.9.4      2019-01-08 [1]
##  crayon                   1.3.4      2017-09-16 [1]
##  curl                     3.3        2019-01-10 [1]
##  data.table               1.12.2     2019-04-07 [1]
##  DBI                      1.0.0      2018-05-02 [1]
##  dbplyr                 * 1.4.2      2019-06-17 [1]
##  DelayedArray           * 0.10.0     2019-05-02 [1]
##  desc                     1.2.0      2018-05-01 [1]
##  DESeq                    1.36.0     2019-05-02 [1]
##  DESeq2                 * 1.24.0     2019-05-02 [1]
##  devtools                 2.1.0      2019-07-06 [1]
##  digest                   0.6.20     2019-07-04 [1]
##  dplyr                  * 0.8.3      2019-07-04 [1]
##  DT                       0.7        2019-06-11 [1]
##  EDASeq                 * 2.18.0     2019-05-02 [1]
##  edgeR                  * 3.26.5     2019-06-21 [1]
##  emdbook                  1.3.11     2019-02-12 [1]
##  ensembldb              * 2.8.0      2019-05-02 [1]
##  evaluate                 0.14       2019-05-28 [1]
##  fdrtool                  1.2.15     2015-07-08 [1]
##  foreign                  0.8-71     2018-07-20 [1]
##  Formula                  1.2-3      2018-05-03 [1]
##  fs                       1.3.1      2019-05-06 [1]
##  genefilter             * 1.66.0     2019-05-02 [1]
##  geneplotter              1.62.0     2019-05-02 [1]
##  GenomeInfoDb           * 1.20.0     2019-05-02 [1]
##  GenomeInfoDbData         1.2.1      2019-05-05 [1]
##  GenomicAlignments      * 1.20.1     2019-06-18 [1]
##  GenomicFeatures        * 1.36.3     2019-06-26 [1]
##  GenomicRanges          * 1.36.0     2019-05-02 [1]
##  ggbeeswarm             * 0.6.0      2017-08-07 [1]
##  ggplot2                * 3.2.0      2019-06-16 [1]
##  glue                     1.3.1      2019-03-12 [1]
##  gridExtra                2.3        2017-09-09 [1]
##  gtable                   0.3.0      2019-03-25 [1]
##  hexbin                 * 1.27.3     2019-05-14 [1]
##  highr                    0.8        2019-03-20 [1]
##  Hmisc                    4.2-0      2019-01-26 [1]
##  hms                      0.4.2      2018-03-10 [1]
##  htmlTable                1.13.1     2019-01-07 [1]
##  htmltools                0.3.6      2017-04-28 [1]
##  htmlwidgets              1.3        2018-09-30 [1]
##  httpuv                   1.5.1      2019-04-05 [1]
##  httr                     1.4.0      2018-12-11 [1]
##  hwriter                  1.3.2      2014-09-10 [1]
##  igraph                   1.2.4.1    2019-04-22 [1]
##  IHW                    * 1.12.0     2019-05-02 [1]
##  interactiveDisplayBase   1.22.0     2019-05-02 [1]
##  IRanges                * 2.18.1     2019-05-31 [1]
##  iSEE                   * 1.4.0      2019-05-02 [1]
##  jsonlite               * 1.6        2018-12-07 [1]
##  knitr                  * 1.23       2019-05-18 [1]
##  labeling                 0.3        2014-08-23 [1]
##  later                    0.8.0      2019-02-11 [1]
##  lattice                  0.20-38    2018-11-04 [1]
##  latticeExtra             0.6-28     2016-02-09 [1]
##  lazyeval                 0.2.2      2019-03-15 [1]
##  limma                  * 3.40.2     2019-05-17 [1]
##  locfit                   1.5-9.1    2013-04-20 [1]
##  lpsymphony               1.12.0     2019-05-02 [1]
##  magrittr                 1.5        2014-11-22 [1]
##  MASS                     7.3-51.4   2019-03-31 [1]
##  Matrix                   1.2-17     2019-03-22 [1]
##  matrixStats            * 0.54.0     2018-07-23 [1]
##  memoise                  1.1.0      2017-04-21 [1]
##  mgcv                   * 1.8-28     2019-03-21 [1]
##  mime                     0.7        2019-06-11 [1]
##  miniUI                   0.1.1.1    2018-05-18 [1]
##  munsell                  0.5.0      2018-06-12 [1]
##  nlme                   * 3.1-140    2019-05-12 [1]
##  nnet                     7.3-12     2016-02-02 [1]
##  numDeriv                 2016.8-1.1 2019-06-06 [1]
##  pheatmap               * 1.0.12     2019-01-04 [1]
##  pillar                   1.4.2      2019-06-29 [1]
##  pkgbuild                 1.0.3      2019-03-20 [1]
##  pkgconfig                2.0.2      2018-08-16 [1]
##  pkgload                  1.0.2      2018-10-29 [1]
##  plyr                     1.8.4      2016-06-08 [1]
##  PoiClaClu              * 1.0.2.1    2019-01-04 [1]
##  preprocessCore           1.46.0     2019-05-02 [1]
##  prettyunits              1.0.2      2015-07-13 [1]
##  processx                 3.4.0      2019-07-03 [1]
##  progress                 1.2.2      2019-05-16 [1]
##  promises                 1.0.1      2018-04-13 [1]
##  ProtGenerics             1.16.0     2019-05-02 [1]
##  ps                       1.3.0      2018-12-21 [1]
##  purrr                    0.3.2      2019-03-15 [1]
##  R.methodsS3              1.7.1      2016-02-16 [1]
##  R.oo                     1.22.0     2018-04-22 [1]
##  R.utils                  2.9.0      2019-06-13 [1]
##  R6                       2.4.0      2019-02-14 [1]
##  rappdirs                 0.3.1      2016-03-28 [1]
##  RColorBrewer           * 1.1-2      2014-12-07 [1]
##  Rcpp                     1.0.1      2019-03-17 [1]
##  RCurl                    1.95-4.12  2019-03-04 [1]
##  readr                  * 1.3.1      2018-12-21 [1]
##  remotes                  2.1.0      2019-06-24 [1]
##  rentrez                  1.2.2      2019-05-02 [1]
##  reshape2                 1.4.3      2017-12-11 [1]
##  rintrojs                 0.2.2      2019-05-29 [1]
##  rlang                    0.4.0      2019-06-25 [1]
##  rmarkdown              * 1.13       2019-05-22 [1]
##  rpart                    4.1-15     2019-04-12 [1]
##  rprojroot                1.3-2      2018-01-03 [1]
##  Rsamtools              * 2.0.0      2019-05-02 [1]
##  RSQLite                  2.1.1      2018-05-06 [1]
##  rstudioapi               0.10       2019-03-19 [1]
##  rtracklayer              1.44.0     2019-07-02 [1]
##  RUVSeq                 * 1.18.0     2019-05-02 [1]
##  S4Vectors              * 0.22.0     2019-05-02 [1]
##  scales                   1.0.0      2018-08-09 [1]
##  sessioninfo              1.1.1      2018-11-05 [1]
##  shiny                    1.3.2      2019-04-22 [1]
##  shinyAce                 0.3.3      2019-01-03 [1]
##  shinydashboard           0.7.1      2018-10-17 [1]
##  shinyjs                  1.0        2018-01-08 [1]
##  ShortRead              * 1.42.0     2019-05-02 [1]
##  SingleCellExperiment   * 1.6.0      2019-05-02 [1]
##  slam                     0.1-45     2019-02-26 [1]
##  stringi                  1.4.3      2019-03-12 [1]
##  stringr                  1.4.0      2019-02-10 [1]
##  SummarizedExperiment   * 1.14.0     2019-05-02 [1]
##  survival                 2.44-1.1   2019-04-01 [1]
##  sva                    * 3.32.1     2019-05-22 [1]
##  testthat                 2.1.1      2019-04-23 [1]
##  tibble                   2.1.3      2019-06-06 [1]
##  tidyselect               0.2.5      2018-10-11 [1]
##  tximeta                * 1.2.2      2019-07-06 [1]
##  tximport               * 1.12.3     2019-06-25 [1]
##  usethis                  1.5.1      2019-07-04 [1]
##  vipor                    0.4.5      2017-03-22 [1]
##  viridisLite              0.3.0      2018-02-01 [1]
##  vsn                    * 3.52.0     2019-05-02 [1]
##  withr                    2.1.2      2018-03-15 [1]
##  xfun                     0.8        2019-06-25 [1]
##  XML                      3.98-1.20  2019-06-06 [1]
##  xtable                   1.8-4      2019-04-21 [1]
##  XVector                * 0.24.0     2019-05-02 [1]
##  yaml                     2.2.0      2018-07-25 [1]
##  zlibbioc                 1.30.0     2019-05-02 [1]
##  source                           
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  Bioconductor                     
##  Github (mikelove/airway2@771fe45)
##  Bioconductor                     
##  Bioconductor                     
##  Bioconductor                     
##  Bioconductor                     
##  Bioconductor                     
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  Bioconductor                     
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  Bioconductor                     
##  Bioconductor                     
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  Bioconductor                     
##  Bioconductor                     
##  Bioconductor                     
##  Bioconductor                     
##  Bioconductor                     
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  Bioconductor                     
##  Bioconductor                     
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  Bioconductor (R 3.6.0)           
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  Bioconductor                     
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
##  CRAN (R 3.6.0)                   
##  Bioconductor                     
## 
## [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

References

Anders, Simon, and Wolfgang Huber. 2010. “Differential expression analysis for sequence count data.” Genome Biology 11 (10). BioMed Central Ltd: R106+. https://doi.org/10.1186/gb-2010-11-10-r106.

Bray, Nicolas L, Harold Pimentel, Páll Melsted, and Lior Pachter. 2016. “Near-Optimal RNA-Seq Quantification.” Nat. Biotechnol.

Di Tommaso, Paolo, Maria Chatzou, Evan W Floden, Pablo Prieto Barja, Emilio Palumbo, and Cedric Notredame. 2017. “Nextflow Enables Reproducible Computational Workflows.” Nature Biotechnology 35 (4). Nature Research: 316–19.

Dudoit, Rine, Yee H. Yang, Matthew J. Callow, and Terence P. Speed. 2002. “Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments.” In Statistica Sinica, 111–39. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.117.9702.

Hardcastle, Thomas, and Krystyna Kelly. 2010. “baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data.” BMC Bioinformatics 11 (1): 422+. https://doi.org/10.1186/1471-2105-11-422.

Himes, Blanca E., Xiaofeng Jiang, Peter Wagner, Ruoxi Hu, Qiyu Wang, Barbara Klanderman, Reid M. Whitaker, et al. 2014. “RNA-Seq transcriptome profiling identifies CRISPLD2 as a glucocorticoid responsive gene that modulates cytokine function in airway smooth muscle cells.” PloS One 9 (6). https://doi.org/10.1371/journal.pone.0099625.

Huber, Wolfgang, Vincent J. Carey, Robert Gentleman, Simon Anders, Marc Carlson, Benilton S. Carvalho, Hector Corrada C. Bravo, et al. 2015. “Orchestrating high-throughput genomic analysis with Bioconductor.” Nature Methods 12 (2). Nature Publishing Group: 115–21. https://doi.org/10.1038/nmeth.3252.

Huntley, Melanie A., Jessica L. Larson, Christina Chaivorapol, Gabriel Becker, Michael Lawrence, Jason A. Hackney, and Joshua S. Kaminker. 2013. “ReportingTools: an automated result processing and presentation toolkit for high-throughput genomic analyses.” Bioinformatics 29 (24). Oxford University Press: 3220–1. https://doi.org/10.1093/bioinformatics/btt551.

Ignatiadis, Nikolaos, Bernd Klaus, Judith Zaugg, and Wolfgang Huber. 2016. “Data-Driven Hypothesis Weighting Increases Detection Power in Genome-Scale Multiple Testing.” Nature Methods. https://doi.org/10.1038/nmeth.3885.

Köster, Johannes, and Sven Rahmann. 2012. “Snakemake—a Scalable Bioinformatics Workflow Engine.” Bioinformatics 28 (19). Oxford University Press: 2520–2.

Law, Charity W., Yunshun Chen, Wei Shi, and Gordon K. Smyth. 2014. “Voom: precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biology 15 (2). BioMed Central Ltd: R29+. https://doi.org/10.1186/gb-2014-15-2-r29.

Leek, Jeffrey T. 2014. “svaseq: removing batch effects and other unwanted noise from sequencing data.” Nucleic Acids Research 42 (21). Oxford University Press: 000. https://doi.org/10.1093/nar/gku864.

Leng, N., J. A. Dawson, J. A. Thomson, V. Ruotti, A. I. Rissman, B. M. G. Smits, J. D. Haag, M. N. Gould, R. M. Stewart, and C. Kendziorski. 2013. “EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments.” Bioinformatics 29 (8). Oxford University Press: 1035–43. https://doi.org/10.1093/bioinformatics/btt087.

Li, Bo, and Colin N. Dewey. 2011. “RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.” BMC Bioinformatics 12: 323+. https://doi.org/10.1186/1471-2105-12-3231.

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology 15 (12). BioMed Central Ltd: 550+. https://doi.org/10.1186/s13059-014-0550-8.

Patro, Rob, Geet Duggal, Michael I Love, Rafael A Irizarry, and Carl Kingsford. 2017. “Salmon Provides Fast and Bias-Aware Quantification of Transcript Expression.” Nat. Methods 14: 417–19.

Patro, Rob, Stephen M. Mount, and Carl Kingsford. 2014. “Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.” Nature Biotechnology 32: 462–64. https://doi.org/10.1038/nbt.2862.

Rainer, Johannes, Laurent Gatto, and Christian X Weichenberger. 2019. “ensembldb: an R package to create and use Ensembl-based annotation resources.” Bioinformatics 14 (January): 925. https://doi.org/10.1093/bioinformatics/btz031.

Risso, Davide, John Ngai, Terence P. Speed, and Sandrine Dudoit. 2014. “Normalization of RNA-seq data using factor analysis of control genes or samples.” Nature Biotechnology 32 (9). Nature Publishing Group: 896–902. https://doi.org/10.1038/nbt.2931.

Robert, Christelle, and Mick Watson. 2015. “Errors in RNA-Seq quantification affect genes of relevance to human disease.” Genome Biology. https://doi.org/10.1186/s13059-015-0734-x.

Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2009. “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics 26 (1). Oxford University Press: 139–40. https://doi.org/10.1093/bioinformatics/btp616.

Schurch, Nicholas J, Pietá Schofield, Marek Gierliński, Christian Cole, Alexander Sherstnev, Vijender Singh, Nicola Wrobel, et al. 2016. “How Many Biological Replicates Are Needed in an RNA-seq Experiment and Which Differential Expression Tool Should You Use?” RNA 22 (6): 839–51.

Soneson, Charlotte, Michael I. Love, and Mark Robinson. 2015. “Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.” F1000Research 4 (1521). https://doi.org/10.12688/f1000research.7563.1.

Trapnell, Cole, David G Hendrickson, Martin Sauvageau, Loyal Goff, John L Rinn, and Lior Pachter. 2013. “Differential analysis of gene regulation at transcript resolution with RNA-seq.” Nature Biotechnology. https://doi.org/10.1038/nbt.2450.

Wickham, Hadley. 2009. ggplot2. New York, NY: Springer New York. https://doi.org/10.1007/978-0-387-98141-3.

Witten, Daniela M. 2011. “Classification and clustering of sequencing data using a Poisson model.” The Annals of Applied Statistics 5 (4): 2493–2518. https://doi.org/10.1214/11-AOAS493.

Wu, Hao, Chi Wang, and Zhijin Wu. 2013. “A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data.” Biostatistics 14 (2). Oxford University Press: 232–43. https://doi.org/10.1093/biostatistics/kxs033.