augere.de 0.99.2
augere.de implements augere pipelines for differential expression (DE) analysis of RNA-seq data based on edgeR and voom().
Each analysis pipeline creates a self-contained Rmarkdown report that documents each step in the DE analysis,
after which it compiles the report and returns the analysis results to the user in the current R session.
We can then inspect the report to determine exactly how the analysis was performed.
We can also modify the report to customize the analysis beyond the options provided by the pipeline function.
To install this package, follow the usual instructions for Bioconductor:
install.packages("BiocManager") # if not already available
BiocManager::install("augere.de")
Let’s demonstrate on a dataset derived from the airway package. This has two groups (treated and untreated) so the parametrization of the analysis is fairly simple:
library(augere.de)
se <- loadExampleDataset()
# Testing for DE between 'trt' and 'untrt' in the 'dex' grouping factor.
output.dir <- tempfile()
res <- runEdgeR(se, group="dex", comparisons=c("trt", "untrt"), output.dir=output.dir)
# List of tables of DE results.
names(res$results)
## [1] "Increase in `trt` over `untrt`"
res$results[[1]]
## DataFrame with 63677 rows and 5 columns
## LogFC AveExpr F PValue FDR
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 -0.3862784 5.05487 6.7849013 0.0287436 0.130627
## ENSG00000000005 NA NA NA NA NA
## ENSG00000000419 0.1967653 4.60947 6.0508824 0.0364214 0.150396
## ENSG00000000457 0.0254907 3.48214 0.0637723 0.8063598 0.911506
## ENSG00000000460 -0.1225528 1.48030 0.2075723 0.6595808 0.827011
## ... ... ... ... ... ...
## ENSG00000273489 NA NA NA NA NA
## ENSG00000273490 NA NA NA NA NA
## ENSG00000273491 NA NA NA NA NA
## ENSG00000273492 NA NA NA NA NA
## ENSG00000273493 NA NA NA NA NA
# Copy of 'se' with normalized expression values.
res$normalized
## class: RangedSummarizedExperiment
## dim: 63677 8
## metadata(0):
## assays(2): counts logCPM
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(11): gene_id gene_name ... symbol retained
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(11): SampleName cell ... lib.size norm.factors
As mentioned, runEdgeR() generates an Rmarkdown report containing (almost) all of the steps required to reproduce the analysis.
Let’s have a look at it:
fname <- file.path(output.dir, "report.Rmd")
cat(head(readLines(fname), 50), sep="\n")
## ---
## title: Differential expression with edgeR
## author:
## - biocbuild
## date: "`r Sys.Date()`"
## output: BiocStyle::html_document
## ---
##
## ```{r preamble, echo=FALSE}
## knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
## ```
##
## # Data loading
##
## Our `SummarizedExperiment` object contains all experimental data and metadata for this study.
## Each row corresponds to a gene while each column corresponds to a sample.
##
## ```{r}
## se <- local({ # augere.core input (1)
## stop("insert commands to generate 'se' here")
## })
## se
## ```
##
## We subset the `SummarizedExperiment` to the groups involved in the comparison.
## This ensures that our variance estimates are not affected by irrelevant variability in other groups.
##
## ```{r}
## se <- se[,SummarizedExperiment::colData(se)[,"dex"] %in% c("trt", "untrt")]
## ncol(se)
## ```
##
## We extract the information into a `DGEList` object in preparation for analysis.
##
## ```{r}
## y <- edgeR::DGEList(SummarizedExperiment::assay(se, 1))
##
## # Also tracking the original row index from the SummarizedExperiment,
## # so that we can match them up after filtering out low-abundance genes.
## y$genes <- data.frame(origin=seq_len(nrow(se)))
## ```
##
## Finally, we build the design matrix.
##
## ```{r}
## model.data <- list()
## cd <- SummarizedExperiment::colData(se)
## model.data$group. <- factor(cd[,"dex"])
## design <- model.matrix(~ 0 + group., data=model.data)
## ```
The function will also save the results to file inside a results subdirectory, which can be loaded back into the session using the readResult() function:
reloaded <- readResult(file.path(output.dir, "results", "de-1"))
reloaded$x # the result itself
## DataFrame with 63677 rows and 5 columns
## LogFC AveExpr F PValue FDR
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 -0.3862784 5.05487 6.7849013 0.0287436 0.130627
## ENSG00000000005 NA NA NA NA NA
## ENSG00000000419 0.1967653 4.60947 6.0508824 0.0364214 0.150396
## ENSG00000000457 0.0254907 3.48214 0.0637723 0.8063598 0.911506
## ENSG00000000460 -0.1225528 1.48030 0.2075723 0.6595808 0.827011
## ... ... ... ... ... ...
## ENSG00000273489 NA NA NA NA NA
## ENSG00000273490 NA NA NA NA NA
## ENSG00000273491 NA NA NA NA NA
## ENSG00000273492 NA NA NA NA NA
## ENSG00000273493 NA NA NA NA NA
str(reloaded$metadata) # along with some metadata
## List of 4
## $ title : chr "Increase in `trt` over `untrt`"
## $ authors :List of 1
## ..$ : chr "biocbuild"
## $ type : chr "differential_gene_expression"
## $ differential_gene_expression:List of 2
## ..$ contrast:List of 3
## .. ..$ type : chr "versus"
## .. ..$ left :List of 1
## .. .. ..$ : chr "trt"
## .. ..$ right:List of 1
## .. .. ..$ : chr "untrt"
## ..$ method : chr "edgeR QL"
runVoom() does the same but for voom() instead of edgeR.
For simplicity, we will focus on runEdgeR() in the rest of this vignette.
We can block on uninteresting factors of variation - for example, the cell line of origin:
output.block.dir <- tempfile()
res.block <- runEdgeR(
se,
group="dex",
block="cell",
comparisons=c("trt", "untrt"),
output.dir=output.block.dir,
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)
If continuous covariates are of interest, we can include them in the model and test their effects on expression:
# Making up a continuous covariate.
se$tumor_fraction <- runif(ncol(se))
# 'comparisons' of length 1 indicates that we want to test a single covariate.
output.cov.dir <- tempfile()
res.cov <- runEdgeR(
se,
group="dex",
covariate="tumor_fraction",
comparisons="tumor_fraction",
output.dir=output.cov.dir,
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)
For more complex cases, users can also directly pass in their own design matrices and contrasts:
output.custom.dir <- tempfile()
res.custom <- runEdgeR(
se,
design=~0 + dex + tumor_fraction,
contrasts=c(dextrt = 1, dexuntrt = -1),
output.dir=output.custom.dir,
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)
We perform an ANOVA-like comparison by providing more than two groups in comparisons.
Here, the null hypothesis is that there is no DE between any of the cell lines.
output.anova.dir <- tempfile()
res.anova <- runEdgeR(
se,
group="cell",
comparisons=c("N052611", "N061011", "N080611", "N61311"),
output.dir=output.anova.dir,
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)
We test for differences between the averages of two sets of groups by passing a named character vector to comparisons=.
Each unique name defines a set of groups, e.g., resistant versus sensitive.
output.sets.dir <- tempfile()
# Specify a named vector between the 'resistant' and 'sensitive' sets of cell lines.
res.sets <- runEdgeR(
se,
group="cell",
comparisons=c(resistant="N052611", resistant="N061011", sensitive="N080611", sensitive="N61311"),
output.dir=output.sets.dir,
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)
We perform multiple comparisons in a single call by passing a list of character vectors to comparisons.
This can be convenient but does have some implications for subsetting, see comments below.
output.multiple.dir <- tempfile()
res.multiple <- runEdgeR(
se,
group="cell",
comparisons=list(c("N052611", "N061011"), c("N080611", "N61311")),
output.dir=output.multiple.dir,
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)
names(res.multiple$results)
## [1] "Increase in `N052611` over `N061011`"
## [2] "Increase in `N080611` over `N61311`"
We can test against a log-fold change threshold to focus on genes with larger changes rather than those with small variances. This is better than post hoc filtering of the DE table as it captures the uncertainty of the log-fold change estimates, otherwise, we’d enrich for genes with variable log-fold changes that only pass the threshold by chance.
output.lfc.dir <- tempfile()
res.lfc <- runEdgeR(
se,
group="dex",
lfc.threshold=0.5,
comparisons=c("trt", "untrt"),
output.dir=output.lfc.dir,
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)
If we only care about some of the samples, we can subset them by a pre-defined factor. For example, say we only care about a subset of the cell lines:
output.sub.dir <- tempfile()
res.sub <- runEdgeR(
se,
group="dex",
comparisons=c("trt", "untrt"),
subset.factor="cell",
subset.levels=c("N052611", "N080611"),
output.dir=output.lfc.dir,
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)
By default, even if no subsetting is specified by the user, runEdgeR() will automatically subset the dataset to only those groups that are involved in the comparisons.
This ensures that the variance/dispersion estimates are not affected by any unusual behavior in other (irrelevant) groups.
So, for example, the runEdgeR() call below would only consider samples in the N052611 and N061011 groups:
output.sub.group.dir <- tempfile()
res.sub.group <- runEdgeR(
se,
group="cell",
comparisons=c("N052611", "N061011"),
output.dir=output.sub.group.dir,
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)
The subsetting is only done once per runEdgeR() call, so if multiple comparisons are specified, the subdataset will retain groups from all comparisons.
This is occasionally helpful for meta-analyses as it ensures that the dispersion estimates are consistent across comparisons.
Most of the time, though, it is better to call runEdgeR() separately for each comparison:
comparisons <- list(c("N052611", "N061011"), c("N080611", "N61311"))
# Multiple comparisons in one call:
combined.results <- runEdgeR(
se,
group="cell",
comparisons=comparisons,
output.dir=tempfile(),
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)
# Or, repeated calls with different comparisons, which is usually the safer
# approach unless you know better.
separate.results <- list()
for (i in seq_along(comparisons)) {
separate.results[[i]] <- runEdgeR(
se,
group="cell",
comparisons=comparisons[[i]],
output.dir=tempfile(),
# Not writing to disk or generating plots to reduce runtime.
save.results=FALSE,
suppress.plots=TRUE
)
}
As shown above, we can set save.results=FALSE to instruct runEdgeR() to not write results to disk.
This is more efficient if we only need the results in the current R session.
Setting dry.run=TRUE will instruct runEdgeR() to only generate the Rmarkdown report but not evaluate it.
This can be helpful for generating a draft report for further refinement.
Careful readers will note that the auto-generated report is not quite self-contained as runEdgeR() doesn’t know how se was constructed.
In the absence of this information, the report contains a placeholder stop() to remind the user to insert the relevant commands:
## Our `SummarizedExperiment` object contains all experimental data and metadata for this study.
## Each row corresponds to a gene while each column corresponds to a sample.
##
## ```{r}
## se <- local({ # augere.core input (1)
## stop("insert commands to generate 'se' here")
## })
## se
## ```
##
## We subset the `SummarizedExperiment` to the groups involved in the comparison.
If we know how the object was generated, we can pass this information to runEdgeR():
output.dry.dir <- tempfile()
runEdgeR(
wrapInput(se, commands="augere.de::loadExampleDataset()"),
group="dex",
comparisons=c("trt", "untrt"),
output.dir=output.dry.dir,
dry.run=TRUE
)
## NULL
This instructs runEdgeR() to insert the relevant commands into the report for full reproducibility:
## Our `SummarizedExperiment` object contains all experimental data and metadata for this study.
## Each row corresponds to a gene while each column corresponds to a sample.
##
## ```{r}
## se <- local({ # augere.core input (1)
## augere.de::loadExampleDataset()
## })
## se
## ```
##
## We subset the `SummarizedExperiment` to the groups involved in the comparison.
We can also attach rowData() columns to the result tables or add custom fields to the result metadata:
output.custom.dir <- tempfile()
res <- runEdgeR(
se,
group="dex",
comparisons=c("trt", "untrt"),
row.data=c("symbol", "gene_biotype"),
metadata=list(
project="my_phd_project",
collaborators=list("anne", "bob", "charlie"),
data_source="https://foo.bar.com"
),
output.dir=output.custom.dir
)
# rowData columns are now included in the DF.
res$results[[1]]
## DataFrame with 63677 rows and 7 columns
## symbol gene_biotype LogFC AveExpr F
## <character> <character> <numeric> <numeric> <numeric>
## ENSG00000000003 TSPAN6 protein_coding -0.3862784 5.05487 6.7849013
## ENSG00000000005 TNMD protein_coding NA NA NA
## ENSG00000000419 DPM1 protein_coding 0.1967653 4.60947 6.0508824
## ENSG00000000457 SCYL3 protein_coding 0.0254907 3.48214 0.0637723
## ENSG00000000460 C1orf112 protein_coding -0.1225528 1.48030 0.2075723
## ... ... ... ... ... ...
## ENSG00000273489 RP11-180C16.1 antisense NA NA NA
## ENSG00000273490 TSEN34 protein_coding NA NA NA
## ENSG00000273491 RP11-138A9.2 lincRNA NA NA NA
## ENSG00000273492 AP000230.1 lincRNA NA NA NA
## ENSG00000273493 RP11-80H18.4 lincRNA NA NA NA
## PValue FDR
## <numeric> <numeric>
## ENSG00000000003 0.0287436 0.130627
## ENSG00000000005 NA NA
## ENSG00000000419 0.0364214 0.150396
## ENSG00000000457 0.8063598 0.911506
## ENSG00000000460 0.6595808 0.827011
## ... ... ...
## ENSG00000273489 NA NA
## ENSG00000273490 NA NA
## ENSG00000273491 NA NA
## ENSG00000273492 NA NA
## ENSG00000273493 NA NA
# The custom metadata is also saved to disk.
str(readResult(file.path(output.custom.dir, "results", "de-1"))$metadata)
## List of 7
## $ project : chr "my_phd_project"
## $ collaborators :List of 3
## ..$ : chr "anne"
## ..$ : chr "bob"
## ..$ : chr "charlie"
## $ data_source : chr "https://foo.bar.com"
## $ title : chr "Increase in `trt` over `untrt`"
## $ authors :List of 1
## ..$ : chr "biocbuild"
## $ type : chr "differential_gene_expression"
## $ differential_gene_expression:List of 2
## ..$ contrast:List of 3
## .. ..$ type : chr "versus"
## .. ..$ left :List of 1
## .. .. ..$ : chr "trt"
## .. ..$ right:List of 1
## .. .. ..$ : chr "untrt"
## ..$ method : chr "edgeR QL"
sessionInfo()
## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.24-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] augere.de_0.99.2 BiocStyle_2.41.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4
## [3] SparseArray_1.13.2 lattice_0.22-9
## [5] h5mread_1.5.0 alabaster.se_1.13.0
## [7] magrittr_2.0.5 digest_0.6.39
## [9] evaluate_1.0.5 grid_4.6.0
## [11] augere.core_0.99.3 bookdown_0.46
## [13] fastmap_1.2.0 jsonlite_2.0.0
## [15] Matrix_1.7-5 alabaster.schemas_1.13.0
## [17] limma_3.69.0 tinytex_0.59
## [19] BiocManager_1.30.27 HDF5Array_1.41.0
## [21] jquerylib_0.1.4 abind_1.4-8
## [23] cli_3.6.6 rlang_1.2.0
## [25] XVector_0.53.0 Biobase_2.73.1
## [27] cachem_1.1.0 DelayedArray_0.39.1
## [29] yaml_2.3.12 otel_0.2.0
## [31] S4Arrays_1.13.0 tools_4.6.0
## [33] Rhdf5lib_2.1.0 locfit_1.5-9.12
## [35] alabaster.ranges_1.13.0 SummarizedExperiment_1.43.0
## [37] BiocGenerics_0.59.0 alabaster.base_1.13.0
## [39] alabaster.matrix_1.13.0 R6_2.6.1
## [41] magick_2.9.1 matrixStats_1.5.0
## [43] stats4_4.6.0 lifecycle_1.0.5
## [45] rhdf5_2.57.0 Seqinfo_1.3.0
## [47] S4Vectors_0.51.1 edgeR_4.11.0
## [49] IRanges_2.47.0 bslib_0.10.0
## [51] Rcpp_1.1.1-1.1 statmod_1.5.1
## [53] xfun_0.57 GenomicRanges_1.65.0
## [55] MatrixGenerics_1.25.0 knitr_1.51
## [57] rhdf5filters_1.25.0 htmltools_0.5.9
## [59] rmarkdown_2.31 compiler_4.6.0