augere.core 0.99.3
augere.core provides utilities to implement pipelines for automatic generation of analysis reports within the augere framework. Each pipeline function performs a standard bioinformatics analysis while generating a fully-parametrized Rmarkdown report that contains all of the relevant commands. The generated report documents the analysis for greater transparency and reproducibility, and can be easily modified to fine-tune the analysis with custom steps or parameters. (Check out downstream packages like augere.de and augere.screen for examples of such pipelines.) This vignette aims to provide some guidance on how to write such pipeline functions.
To install this package, follow the usual instructions for Bioconductor:
install.packages("BiocManager") # if not already available
BiocManager::install("augere.core")
An augere pipeline is typically comprised of an Rmarkdown template that describes most of the analysis, and a user-visible function that realizes the template into a fully-parametrized report. To demonstrate, we’ll consider a simple differential expression (DE) analysis based on the following template:
# Data loading
```{r}
:BEGIN data
:END
```
# Preprocessing
```{r}
y <- SummarizedExperiment::assay(se)
d <- edgeR::DGEList(y)
d <- d[edgeR::filterByExpr(d),]
d <- edgeR::normLibSizes(d)
d$samples
```
# Variance modelling
```{r}
design <- model.matrix(<%= FORMULA %>, colData(se))
v <- limma::voom(d, design, plot=TRUE)
fit <- limma::lmFit(v, design)
fit <- limma::eBayes(fit, robust=<%= ROBUST %>)
```
# Testing for differences
```{r}
res <- limma::topTable(fit, n=Inf)
```
Saving the result:
```{r saveme}
write.csv(res, 'de.csv')
```
Our pipeline function fills in the placeholder expressions (<%= .. %>) and tags (:BEGIN, :END) in the template.
Some work is required to manage the input data and output results, which is explained in more detail below.
library(augere.core)
demo.pipeline <- function(
x,
formula,
output.dir='.',
robust=TRUE,
dry.run=FALSE,
save.results=TRUE
) {
restore.cache <- resetInputCache()
on.exit(restore.cache(), add=TRUE, after=FALSE)
report.text <- parseRmdTemplate(template)
report.text[["data"]] <- processInputCommands(x, "se")
report.text <- replacePlaceholders(
report.text,
c(
FORMULA=paste(as.character(formula), collapse=""),
ROBUST=deparseToString(robust)
)
)
dir.create(output.dir, showWarnings=FALSE, recursive=TRUE)
fname <- file.path(output.dir, "report.Rmd")
writeRmd(report.text, fname)
if (dry.run) {
return(invisible(NULL))
}
env <- new.env()
to.skip <- NULL
if (!save.results) {
to.skip <- "saveme"
}
compileReport(fname, env, skip.chunks=to.skip)
env$res
}
For users, executing the pipeline is as simple as:
library(airway)
data(airway)
output.dir <- tempfile()
res <- demo.pipeline(airway, formula=~dex, output.dir=output.dir)
head(res)
logFC AveExpr t P.Value adj.P.Val B
ENSG00000152583 -4.563976 4.165462 -18.83164 1.495306e-08 0.0001640445 9.494016
ENSG00000134686 -1.370138 6.837469 -16.88415 3.905774e-08 0.0001640445 9.366849
ENSG00000179094 -3.174442 4.418980 -16.81466 4.049618e-08 0.0001640445 9.105028
ENSG00000125148 -2.186793 7.022630 -16.31429 5.276623e-08 0.0001640445 9.070672
ENSG00000120129 -2.942589 6.643013 -15.82037 6.919764e-08 0.0001640445 8.823060
ENSG00000148175 -1.429101 8.854588 -16.12567 5.841799e-08 0.0001640445 8.815325
This produces an Rmarkdown report with all the details filled out:
# Looking at the top few lines of the report.
fname <- file.path(output.dir, "report.Rmd")
contents <- readLines(fname)
cat(head(contents, 30), sep="\n")
# Data loading
```{r}
se <- local({ # augere.core input (1)
stop("insert commands to generate 'se' here")
})
```
# Preprocessing
```{r}
y <- SummarizedExperiment::assay(se)
d <- edgeR::DGEList(y)
d <- d[edgeR::filterByExpr(d),]
d <- edgeR::normLibSizes(d)
d$samples
```
# Variance modelling
```{r}
design <- model.matrix(~dex, colData(se))
v <- limma::voom(d, design, plot=TRUE)
fit <- limma::lmFit(v, design)
fit <- limma::eBayes(fit, robust=TRUE)
```
# Testing for differences
```{r}
The Rmarkdown template should contain most all of the analysis steps, such that it can be realized into an reproducible analysis report with just a few modifications. We use two syntax markers to indicate where/how the report can be modified:
<%= ... %> notation represents a placeholder to be substituted by the replacePlaceholders() function.:BEGIN and :END blocks represent named blocks that are extracted by the parseRmdTemplate() function.
These blocks can be deleted, used as insertion points for more content, duplicated, etc.These two markers can be inserted into an existing Rmarkdown report to turn it into a template,
e.g., by replacing hard-coding parameters with <%= ... %> placeholders that will be filled in by the pipeline function.
Given a template, the parseRmdTemplate() function will load in the template as a nested list.
This can be modified with replacePlaceholders() or by list operations to add/delete blocks.
Once all modifications are complete, the writeRmd() function will write the report to disk.
We suggest writing the report to a location inside a user-supplied output directory, so that the compiled HTML and results do not interfere with those of other analyses.
We typically use :: rather than library() in the various code chunks.
The former avoids the unpredictability of traversing the search path in the user’s R session, which might end up finding an irrelevant function with the same name.
Similarly, it also does not attach more namespaces to the user’s session, which would otherwise be an unexpected side-effect of running the pipeline.
Each pipeline function typically has one or more arguments that represent the input data.
The processInputCommands() function creates R commands to define these data inputs,
which should be inserted into the Rmarkdown report for documentation and reproducibility.
processInputCommands(), the latter will just create a stop() statement.
This reminds the user to supply the code used to create that object before reproducing the analysis from the Rmarkdown report.
We do not attempt to deparse these objects as they might be arbitrarily large.wrapInput() to define how each input object was created.
The user-provided commands will be returned verbatim by processInputCommands().
This can be helpful if the user does not want to modify the report afterwards to document the data generation.Regardless of the input object, processInputCommands() will cache the supplied object for use in compileReport().
When compileReport() encounters the code generated by processInputCommands(), it will use the cached object rather than re-running the commands.
This saves some time and ensures that the compilation can proceed even if processInputCommands() returned a stop().
Each pipeline function should call resetInputCache() to protect against interference from other calls to the same or different pipeline.
Pipelines may also take any number of non-data arguments to define the parameters of the analysis.
These values can be of any type so long as they can be deparsed (typically with deparseToString()) and embedded in the report directly.
The compileReport() function will compile the report (duh) within a self-contained environment.
Specifically, it executes the Rmarkdown code and populates an environment containing all of the generated variables.
The relevant results can then be extracted from this environment and returned to the user.
It is also helpful to provide an option to save the results to file. The exact format is left to the developer, but for more complex Bioconductor objects, we suggest using the alabaster.base framework. We can also skip the relevant code chunk if we are only interested in looking at the results in memory.
A dry-run of the pipeline function would just create the Rmarkdown report without actually running it. This can be useful in some cases, e.g., if the user knows that some modifications are required.
sessionInfo()
R version 4.6.0 RC (2026-04-17 r89917)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.24-bioc/R/lib/libRblas.so
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] airway_1.33.0 SummarizedExperiment_1.43.0
[3] Biobase_2.73.1 GenomicRanges_1.65.0
[5] Seqinfo_1.3.0 IRanges_2.47.0
[7] S4Vectors_0.51.1 BiocGenerics_0.59.0
[9] generics_0.1.4 MatrixGenerics_1.25.0
[11] matrixStats_1.5.0 augere.core_0.99.3
[13] BiocStyle_2.41.0
loaded via a namespace (and not attached):
[1] Matrix_1.7-5 limma_3.69.0 jsonlite_2.0.0
[4] compiler_4.6.0 BiocManager_1.30.27 Rcpp_1.1.1-1.1
[7] tinytex_0.59 magick_2.9.1 jquerylib_0.1.4
[10] statmod_1.5.1 yaml_2.3.12 fastmap_1.2.0
[13] lattice_0.22-9 R6_2.6.1 XVector_0.53.0
[16] S4Arrays_1.13.0 knitr_1.51 DelayedArray_0.39.1
[19] bookdown_0.46 bslib_0.10.0 rlang_1.2.0
[22] cachem_1.1.0 xfun_0.57 sass_0.4.10
[25] otel_0.2.0 SparseArray_1.13.2 cli_3.6.6
[28] magrittr_2.0.5 locfit_1.5-9.12 digest_0.6.39
[31] grid_4.6.0 edgeR_4.11.0 lifecycle_1.0.5
[34] evaluate_1.0.5 abind_1.4-8 rmarkdown_2.31
[37] tools_4.6.0 htmltools_0.5.9