Contents

1 Overview

augere.de implements augere pipelines for differential expression (DE) analysis of RNA-seq data based on edgeR and voom(). Each analysis pipeline creates a self-contained Rmarkdown report that documents each step in the DE analysis, after which it compiles the report and returns the analysis results to the user in the current R session. We can then inspect the report to determine exactly how the analysis was performed. We can also modify the report to customize the analysis beyond the options provided by the pipeline function.

2 Quick start

To install this package, follow the usual instructions for Bioconductor:

install.packages("BiocManager") # if not already available
BiocManager::install("augere.de")

Let’s demonstrate on a dataset derived from the airway package. This has two groups (treated and untreated) so the parametrization of the analysis is fairly simple:

library(augere.de)
se <- loadExampleDataset()

# Testing for DE between 'trt' and 'untrt' in the 'dex' grouping factor.
output.dir <- tempfile()
res <- runEdgeR(se, group="dex", comparisons=c("trt", "untrt"), output.dir=output.dir)
# List of tables of DE results.
names(res$results)
## [1] "Increase in `trt` over `untrt`"
res$results[[1]]
## DataFrame with 63677 rows and 5 columns
##                      LogFC   AveExpr         F    PValue       FDR
##                  <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 -0.3862784   5.05487 6.7849013 0.0287436  0.130627
## ENSG00000000005         NA        NA        NA        NA        NA
## ENSG00000000419  0.1967653   4.60947 6.0508824 0.0364214  0.150396
## ENSG00000000457  0.0254907   3.48214 0.0637723 0.8063598  0.911506
## ENSG00000000460 -0.1225528   1.48030 0.2075723 0.6595808  0.827011
## ...                    ...       ...       ...       ...       ...
## ENSG00000273489         NA        NA        NA        NA        NA
## ENSG00000273490         NA        NA        NA        NA        NA
## ENSG00000273491         NA        NA        NA        NA        NA
## ENSG00000273492         NA        NA        NA        NA        NA
## ENSG00000273493         NA        NA        NA        NA        NA
# Copy of 'se' with normalized expression values.
res$normalized
## class: RangedSummarizedExperiment 
## dim: 63677 8 
## metadata(0):
## assays(2): counts logCPM
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
##   ENSG00000273493
## rowData names(11): gene_id gene_name ... symbol retained
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(11): SampleName cell ... lib.size norm.factors

As mentioned, runEdgeR() generates an Rmarkdown report containing (almost) all of the steps required to reproduce the analysis. Let’s have a look at it:

fname <- file.path(output.dir, "report.Rmd")
cat(head(readLines(fname), 50), sep="\n")
## ---
## title: Differential expression with edgeR
## author:
##   - biocbuild
## date: "`r Sys.Date()`"
## output: BiocStyle::html_document
## ---
## 
## ```{r preamble, echo=FALSE}
## knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
## ```
## 
## # Data loading
## 
## Our `SummarizedExperiment` object contains all experimental data and metadata for this study.
## Each row corresponds to a gene while each column corresponds to a sample.
## 
## ```{r} 
## se <- local({ # augere.core input (1)
##     stop("insert commands to generate 'se' here")
## })
## se
## ```
## 
## We subset the `SummarizedExperiment` to the groups involved in the comparison.
## This ensures that our variance estimates are not affected by irrelevant variability in other groups.
## 
## ```{r}
## se <- se[,SummarizedExperiment::colData(se)[,"dex"] %in% c("trt", "untrt")]
## ncol(se)
## ```
## 
## We extract the information into a `DGEList` object in preparation for analysis.
## 
## ```{r}
## y <- edgeR::DGEList(SummarizedExperiment::assay(se, 1))
## 
## # Also tracking the original row index from the SummarizedExperiment,
## # so that we can match them up after filtering out low-abundance genes.
## y$genes <- data.frame(origin=seq_len(nrow(se)))
## ```
## 
## Finally, we build the design matrix.
## 
## ```{r}
## model.data <- list()
## cd <- SummarizedExperiment::colData(se)
## model.data$group. <- factor(cd[,"dex"])
## design <- model.matrix(~ 0 + group., data=model.data)
## ```

The function will also save the results to file inside a results subdirectory, which can be loaded back into the session using the readResult() function:

reloaded <- readResult(file.path(output.dir, "results", "de-1"))
reloaded$x # the result itself
## DataFrame with 63677 rows and 5 columns
##                      LogFC   AveExpr         F    PValue       FDR
##                  <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 -0.3862784   5.05487 6.7849013 0.0287436  0.130627
## ENSG00000000005         NA        NA        NA        NA        NA
## ENSG00000000419  0.1967653   4.60947 6.0508824 0.0364214  0.150396
## ENSG00000000457  0.0254907   3.48214 0.0637723 0.8063598  0.911506
## ENSG00000000460 -0.1225528   1.48030 0.2075723 0.6595808  0.827011
## ...                    ...       ...       ...       ...       ...
## ENSG00000273489         NA        NA        NA        NA        NA
## ENSG00000273490         NA        NA        NA        NA        NA
## ENSG00000273491         NA        NA        NA        NA        NA
## ENSG00000273492         NA        NA        NA        NA        NA
## ENSG00000273493         NA        NA        NA        NA        NA
str(reloaded$metadata) # along with some metadata
## List of 4
##  $ title                       : chr "Increase in `trt` over `untrt`"
##  $ authors                     :List of 1
##   ..$ : chr "biocbuild"
##  $ type                        : chr "differential_gene_expression"
##  $ differential_gene_expression:List of 2
##   ..$ contrast:List of 3
##   .. ..$ type : chr "versus"
##   .. ..$ left :List of 1
##   .. .. ..$ : chr "trt"
##   .. ..$ right:List of 1
##   .. .. ..$ : chr "untrt"
##   ..$ method  : chr "edgeR QL"

runVoom() does the same but for voom() instead of edgeR. For simplicity, we will focus on runEdgeR() in the rest of this vignette.

3 More experimental designs

We can block on uninteresting factors of variation - for example, the cell line of origin:

output.block.dir <- tempfile()
res.block <- runEdgeR(
    se,
    group="dex",
    block="cell",
    comparisons=c("trt", "untrt"),
    output.dir=output.block.dir,

    # Not writing to disk or generating plots to reduce runtime.
    save.results=FALSE,
    suppress.plots=TRUE
)

If continuous covariates are of interest, we can include them in the model and test their effects on expression:

# Making up a continuous covariate.
se$tumor_fraction <- runif(ncol(se))

# 'comparisons' of length 1 indicates that we want to test a single covariate.
output.cov.dir <- tempfile()
res.cov <- runEdgeR(
    se,
    group="dex",
    covariate="tumor_fraction",
    comparisons="tumor_fraction",
    output.dir=output.cov.dir,

    # Not writing to disk or generating plots to reduce runtime.
    save.results=FALSE,
    suppress.plots=TRUE
)

For more complex cases, users can also directly pass in their own design matrices and contrasts:

output.custom.dir <- tempfile()
res.custom <- runEdgeR(
    se,
    design=~0 + dex + tumor_fraction,
    contrasts=c(dextrt = 1, dexuntrt = -1),
    output.dir=output.custom.dir,

    # Not writing to disk or generating plots to reduce runtime.
    save.results=FALSE,
    suppress.plots=TRUE
)

4 More comparison types

We perform an ANOVA-like comparison by providing more than two groups in comparisons. Here, the null hypothesis is that there is no DE between any of the cell lines.

output.anova.dir <- tempfile()
res.anova <- runEdgeR(
    se,
    group="cell",
    comparisons=c("N052611", "N061011", "N080611", "N61311"),
    output.dir=output.anova.dir,

    # Not writing to disk or generating plots to reduce runtime.
    save.results=FALSE,
    suppress.plots=TRUE
)

We test for differences between the averages of two sets of groups by passing a named character vector to comparisons=. Each unique name defines a set of groups, e.g., resistant versus sensitive.

output.sets.dir <- tempfile()

# Specify a named vector between the 'resistant' and 'sensitive' sets of cell lines.
res.sets <- runEdgeR(
    se,
    group="cell", 
    comparisons=c(resistant="N052611", resistant="N061011", sensitive="N080611", sensitive="N61311"),
    output.dir=output.sets.dir,

    # Not writing to disk or generating plots to reduce runtime.
    save.results=FALSE,
    suppress.plots=TRUE
)

We perform multiple comparisons in a single call by passing a list of character vectors to comparisons. This can be convenient but does have some implications for subsetting, see comments below.

output.multiple.dir <- tempfile()
res.multiple <- runEdgeR(
    se,
    group="cell",
    comparisons=list(c("N052611", "N061011"), c("N080611", "N61311")),
    output.dir=output.multiple.dir,

    # Not writing to disk or generating plots to reduce runtime.
    save.results=FALSE,
    suppress.plots=TRUE
)
names(res.multiple$results)
## [1] "Increase in `N052611` over `N061011`"
## [2] "Increase in `N080611` over `N61311`"

We can test against a log-fold change threshold to focus on genes with larger changes rather than those with small variances. This is better than post hoc filtering of the DE table as it captures the uncertainty of the log-fold change estimates, otherwise, we’d enrich for genes with variable log-fold changes that only pass the threshold by chance.

output.lfc.dir <- tempfile()
res.lfc <- runEdgeR(
    se,
    group="dex",
    lfc.threshold=0.5,
    comparisons=c("trt", "untrt"),
    output.dir=output.lfc.dir,

    # Not writing to disk or generating plots to reduce runtime.
    save.results=FALSE,
    suppress.plots=TRUE
)

5 Subsetting samples

If we only care about some of the samples, we can subset them by a pre-defined factor. For example, say we only care about a subset of the cell lines:

output.sub.dir <- tempfile()
res.sub <- runEdgeR(
    se, 
    group="dex", 
    comparisons=c("trt", "untrt"),
    subset.factor="cell",
    subset.levels=c("N052611", "N080611"),
    output.dir=output.lfc.dir,

    # Not writing to disk or generating plots to reduce runtime.
    save.results=FALSE,
    suppress.plots=TRUE
)

By default, even if no subsetting is specified by the user, runEdgeR() will automatically subset the dataset to only those groups that are involved in the comparisons. This ensures that the variance/dispersion estimates are not affected by any unusual behavior in other (irrelevant) groups. So, for example, the runEdgeR() call below would only consider samples in the N052611 and N061011 groups:

output.sub.group.dir <- tempfile()
res.sub.group <- runEdgeR(
    se,
    group="cell",
    comparisons=c("N052611", "N061011"),
    output.dir=output.sub.group.dir,

    # Not writing to disk or generating plots to reduce runtime.
    save.results=FALSE,
    suppress.plots=TRUE
)

The subsetting is only done once per runEdgeR() call, so if multiple comparisons are specified, the subdataset will retain groups from all comparisons. This is occasionally helpful for meta-analyses as it ensures that the dispersion estimates are consistent across comparisons. Most of the time, though, it is better to call runEdgeR() separately for each comparison:

comparisons <- list(c("N052611", "N061011"), c("N080611", "N61311"))

# Multiple comparisons in one call:
combined.results <- runEdgeR(
    se,
    group="cell",
    comparisons=comparisons,
    output.dir=tempfile(),

    # Not writing to disk or generating plots to reduce runtime.
    save.results=FALSE,
    suppress.plots=TRUE
)

# Or, repeated calls with different comparisons, which is usually the safer
# approach unless you know better.
separate.results <- list()
for (i in seq_along(comparisons)) {
    separate.results[[i]] <- runEdgeR(
        se,
        group="cell",
        comparisons=comparisons[[i]],
        output.dir=tempfile(),

        # Not writing to disk or generating plots to reduce runtime.
        save.results=FALSE,
        suppress.plots=TRUE
    )
}

6 More output options

As shown above, we can set save.results=FALSE to instruct runEdgeR() to not write results to disk. This is more efficient if we only need the results in the current R session.

Setting dry.run=TRUE will instruct runEdgeR() to only generate the Rmarkdown report but not evaluate it. This can be helpful for generating a draft report for further refinement.

Careful readers will note that the auto-generated report is not quite self-contained as runEdgeR() doesn’t know how se was constructed. In the absence of this information, the report contains a placeholder stop() to remind the user to insert the relevant commands:

## Our `SummarizedExperiment` object contains all experimental data and metadata for this study.
## Each row corresponds to a gene while each column corresponds to a sample.
## 
## ```{r} 
## se <- local({ # augere.core input (1)
##     stop("insert commands to generate 'se' here")
## })
## se
## ```
## 
## We subset the `SummarizedExperiment` to the groups involved in the comparison.

If we know how the object was generated, we can pass this information to runEdgeR():

output.dry.dir <- tempfile()
runEdgeR(
    wrapInput(se, commands="augere.de::loadExampleDataset()"),
    group="dex",
    comparisons=c("trt", "untrt"),
    output.dir=output.dry.dir,
    dry.run=TRUE
)
## NULL

This instructs runEdgeR() to insert the relevant commands into the report for full reproducibility:

## Our `SummarizedExperiment` object contains all experimental data and metadata for this study.
## Each row corresponds to a gene while each column corresponds to a sample.
## 
## ```{r} 
## se <- local({ # augere.core input (1)
##     augere.de::loadExampleDataset()
## })
## se
## ```
## 
## We subset the `SummarizedExperiment` to the groups involved in the comparison.

We can also attach rowData() columns to the result tables or add custom fields to the result metadata:

output.custom.dir <- tempfile()
res <- runEdgeR(
    se,
    group="dex",
    comparisons=c("trt", "untrt"),
    row.data=c("symbol", "gene_biotype"),
    metadata=list(
        project="my_phd_project",
        collaborators=list("anne", "bob", "charlie"),
        data_source="https://foo.bar.com"
    ),
    output.dir=output.custom.dir
)
# rowData columns are now included in the DF.
res$results[[1]]
## DataFrame with 63677 rows and 7 columns
##                        symbol   gene_biotype      LogFC   AveExpr         F
##                   <character>    <character>  <numeric> <numeric> <numeric>
## ENSG00000000003        TSPAN6 protein_coding -0.3862784   5.05487 6.7849013
## ENSG00000000005          TNMD protein_coding         NA        NA        NA
## ENSG00000000419          DPM1 protein_coding  0.1967653   4.60947 6.0508824
## ENSG00000000457         SCYL3 protein_coding  0.0254907   3.48214 0.0637723
## ENSG00000000460      C1orf112 protein_coding -0.1225528   1.48030 0.2075723
## ...                       ...            ...        ...       ...       ...
## ENSG00000273489 RP11-180C16.1      antisense         NA        NA        NA
## ENSG00000273490        TSEN34 protein_coding         NA        NA        NA
## ENSG00000273491  RP11-138A9.2        lincRNA         NA        NA        NA
## ENSG00000273492    AP000230.1        lincRNA         NA        NA        NA
## ENSG00000273493  RP11-80H18.4        lincRNA         NA        NA        NA
##                    PValue       FDR
##                 <numeric> <numeric>
## ENSG00000000003 0.0287436  0.130627
## ENSG00000000005        NA        NA
## ENSG00000000419 0.0364214  0.150396
## ENSG00000000457 0.8063598  0.911506
## ENSG00000000460 0.6595808  0.827011
## ...                   ...       ...
## ENSG00000273489        NA        NA
## ENSG00000273490        NA        NA
## ENSG00000273491        NA        NA
## ENSG00000273492        NA        NA
## ENSG00000273493        NA        NA
# The custom metadata is also saved to disk.
str(readResult(file.path(output.custom.dir, "results", "de-1"))$metadata)
## List of 7
##  $ project                     : chr "my_phd_project"
##  $ collaborators               :List of 3
##   ..$ : chr "anne"
##   ..$ : chr "bob"
##   ..$ : chr "charlie"
##  $ data_source                 : chr "https://foo.bar.com"
##  $ title                       : chr "Increase in `trt` over `untrt`"
##  $ authors                     :List of 1
##   ..$ : chr "biocbuild"
##  $ type                        : chr "differential_gene_expression"
##  $ differential_gene_expression:List of 2
##   ..$ contrast:List of 3
##   .. ..$ type : chr "versus"
##   .. ..$ left :List of 1
##   .. .. ..$ : chr "trt"
##   .. ..$ right:List of 1
##   .. .. ..$ : chr "untrt"
##   ..$ method  : chr "edgeR QL"

Session information

sessionInfo()
## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.24-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] augere.de_0.99.2 BiocStyle_2.41.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10                 generics_0.1.4             
##  [3] SparseArray_1.13.2          lattice_0.22-9             
##  [5] h5mread_1.5.0               alabaster.se_1.13.0        
##  [7] magrittr_2.0.5              digest_0.6.39              
##  [9] evaluate_1.0.5              grid_4.6.0                 
## [11] augere.core_0.99.3          bookdown_0.46              
## [13] fastmap_1.2.0               jsonlite_2.0.0             
## [15] Matrix_1.7-5                alabaster.schemas_1.13.0   
## [17] limma_3.69.0                tinytex_0.59               
## [19] BiocManager_1.30.27         HDF5Array_1.41.0           
## [21] jquerylib_0.1.4             abind_1.4-8                
## [23] cli_3.6.6                   rlang_1.2.0                
## [25] XVector_0.53.0              Biobase_2.73.1             
## [27] cachem_1.1.0                DelayedArray_0.39.1        
## [29] yaml_2.3.12                 otel_0.2.0                 
## [31] S4Arrays_1.13.0             tools_4.6.0                
## [33] Rhdf5lib_2.1.0              locfit_1.5-9.12            
## [35] alabaster.ranges_1.13.0     SummarizedExperiment_1.43.0
## [37] BiocGenerics_0.59.0         alabaster.base_1.13.0      
## [39] alabaster.matrix_1.13.0     R6_2.6.1                   
## [41] magick_2.9.1                matrixStats_1.5.0          
## [43] stats4_4.6.0                lifecycle_1.0.5            
## [45] rhdf5_2.57.0                Seqinfo_1.3.0              
## [47] S4Vectors_0.51.1            edgeR_4.11.0               
## [49] IRanges_2.47.0              bslib_0.10.0               
## [51] Rcpp_1.1.1-1.1              statmod_1.5.1              
## [53] xfun_0.57                   GenomicRanges_1.65.0       
## [55] MatrixGenerics_1.25.0       knitr_1.51                 
## [57] rhdf5filters_1.25.0         htmltools_0.5.9            
## [59] rmarkdown_2.31              compiler_4.6.0