augere.solo 0.99.3
augere.solo implements augere pipelines for a variety of simple single-cell analyses. This use scrapper for routine steps such as quality control, clustering and marker detection. We also use SingleR for automatic cell type annotation based on pre-labelled references. Each pipeline creates a self-contained Rmarkdown report that documents each step in the analysis.
To install this package, follow the usual instructions for Bioconductor:
install.packages("BiocManager") # if not already available
BiocManager::install("augere.solo")
Let’s load up our favorite example dataset from the scRNAseq package:
library(scRNAseq)
sce.z <- ZeiselBrainData()
sce.z
## class: SingleCellExperiment
## dim: 20006 3005
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
## 1772058148_F03
## colData names(9): tissue group # ... level1class level2class
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC
Now we call runSolo() to perform a routine analysis of this dataset:
library(augere.solo)
outdir.z <- tempfile()
res.z <- runSolo(
sce.z,
qc.mito.regex = "^mt-",
num.threads = 2, # speed it up a little
output.dir = outdir.z
)
This returns a list of various results, including QC metrics, marker rankings and a SingleCellExperiment decorated with clusterings and reduced dimensions:
# Checking out the QC statistics.
res.z$qc.rna
## DataFrame with 3005 rows and 4 columns
## sum detected subset.proportion.mito keep
## <numeric> <integer> <numeric> <logical>
## 1772071015_C02 22354 4871 0.03462468 TRUE
## 1772071017_G12 22869 4712 0.04901832 TRUE
## 1772071017_A05 32594 6055 0.02920783 TRUE
## 1772071014_B06 33525 5852 0.01822521 TRUE
## 1772067065_H06 21694 4724 0.00755969 TRUE
## ... ... ... ... ...
## 1772067059_B04 5707 2250 0.19660067 TRUE
## 1772066097_D04 2574 1441 0.00582751 TRUE
## 1772063068_D01 4993 2001 0.19587422 TRUE
## 1772066098_A12 3099 1510 0.06550500 TRUE
## 1772058148_F03 6534 1809 0.31741659 FALSE
# Checking out the reduced dimensions.
library(scater)
plotReducedDim(res.z$sce, "TSNE", colour_by = "graph.cluster")
# Checking out the marker rankings for each cluster.
# We use previewMarkers() just for a prettier print.
scrapper::previewMarkers(res.z$markers.rna[["1"]])
## DataFrame with 10 rows and 3 columns
## mean detected lfc
## <numeric> <numeric> <numeric>
## Tspyl4 3.36948 1.000000 2.15064
## Mllt11 3.43235 1.000000 2.05507
## Ndrg4 4.40295 0.996491 2.63200
## Gad1 4.80299 1.000000 4.57610
## Celf4 3.81657 0.996491 2.09285
## Gad2 4.44021 0.996491 4.25366
## Stmn3 4.71924 0.992982 2.70762
## Atp1a3 3.91374 0.996491 2.26505
## Rcan2 2.33158 0.936842 1.75198
## Slc6a1 3.76102 0.992982 3.08077
The function generates an Rmarkdown report containing (almost) all of the steps required to reproduce the analysis. Let’s have a look at it:
fname <- file.path(outdir.z, "report.Rmd")
cat(head(readLines(fname), 50), sep="\n")
## ---
## title: "Simple analysis of scRNA-seq data"
## author: list("biocbuild")
## date: "`r Sys.Date()`"
## output:
## BiocStyle::html_document:
## toc_float: yes
## titlecaps: false
## md_document:
## preserve_yaml: yes
## ---
##
## ```{r preamble, echo=FALSE}
## knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
## ```
##
## # Data loading
##
## Our `SummarizedExperiment` object contains all of the experimental data and metadata for this study.
## Each column is assumed to correspond to a single cell.
## The main experiment contains RNA data where each row corresponds to a gene.
##
## ```{r}
## se <- local({ # augere.core input (1)
## stop("insert commands to generate 'se' here")
## })
## se
## ```
##
## We convert each relevant experiment into a `SingleCellExperiment` if it wasn't already one.
##
## ```{r}
## loadNamespace("SingleCellExperiment")
## sce <- as(se, "SingleCellExperiment")
## ```
##
## We also coerce each input matrix to Bioconductor's sparse format.
## This isn't strictly necessary but it avoids silent performance degradation from inefficient formats (e.g., file-backed, `dgTMatrix` objects).
##
## ```{r}
## loadNamespace("SparseArray")
## SummarizedExperiment::assay(sce, 1, withDimnames=FALSE) <- local({
## mat <- SummarizedExperiment::assay(sce, 1, withDimnames=FALSE)
## DelayedArray::DelayedArray(as(mat, "SVT_SparseMatrix"))
## })
## ```
##
## # Quality control
##
## ## For RNA
The function will also save the results to file inside a results subdirectory, which can be loaded back into the session using the readResult() function:
reloaded <- readResult(file.path(outdir.z, "results", "markers-rna", "2"))
scrapper::previewMarkers(reloaded$x) # preview of the marker rankings
## DataFrame with 10 rows and 3 columns
## mean detected lfc
## <numeric> <numeric> <numeric>
## Syt1 3.62958 0.979487 0.761914
## Mef2c 2.37713 0.825641 1.013806
## Snhg11 3.70648 0.964103 1.146727
## Meg3 5.00525 1.000000 1.395784
## Plp1 8.88389 1.000000 4.135423
## Mog 5.43628 0.994872 3.109208
## Cryab 4.54240 1.000000 2.332521
## Cnp 4.84884 1.000000 2.458323
## Mobp 4.46912 1.000000 2.418916
## Ermn 4.49302 0.994872 2.485988
str(reloaded$metadata) # along with some metadata
## List of 4
## $ title : chr "Marker gene for cluster 2"
## $ authors :List of 1
## ..$ : chr "biocbuild"
## $ type : chr "marker_gene_detection"
## $ marker_gene_detection:List of 3
## ..$ group : chr "2"
## ..$ order : chr "cohens.d.min.rank"
## ..$ factor: chr "graph.cluster"
Larger datasets often have uninteresting variation associated with a blocking factor, e.g., batches. To illustrate, let’s combine the Zeisel dataset with another brain dataset:
sce.t <- TasicBrainData()
common.brain <- intersect(rownames(sce.z), rownames(sce.t))
combined.brain <- combineCols(
sce.z[common.brain,],
sce.t[common.brain,]
)
combined.brain$study <- rep(c("zeisel", "tasic"), c(ncol(sce.z), ncol(sce.t)))
combined.brain
## class: SingleCellExperiment
## dim: 18698 4814
## metadata(0):
## assays(1): counts
## rownames(18698): Tspan12 Tshz1 ... Uty Usp9y
## rowData names(1): featureType
## colnames(4814): 1772071015_C02 1772071017_G12 ... Rbp4_CTX_250ng_2
## Trib2_CTX_250ng_1
## colData names(22): tissue group # ... aibs_vignette_id study
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC
We now instruct runSolo() to block on the study of origin for each cell.
In many steps (e.g., variance calculations, differential expression), this will force all calculations to be performed within each level of the blocking factor.
This avoids interference from uninteresting differences between studies.
outdir.com <- tempfile()
res.com <- runSolo(
combined.brain,
block = "study",
qc.mito.regex = "^mt-",
output.dir = outdir.com,
# Just setting these options to reduce the runtime,
# otherwise this vignette takes too long to build.
num.threads = 2,
suppress.plots = TRUE,
save.results = FALSE
)
We will also apply mutual nearest neighbors (MNN) correction to remove differences between studies in the low-dimensional space; the MNN-corrected coordinates are then used for all downstream analyses.
# Now we have an extra set of MNN-corrected coordinates.
reducedDimNames(res.com$sce)
## [1] "PCA" "MNN" "TSNE" "UMAP"
# Examining the distribution of clusters between studies.
table(res.com$sce$graph.cluster, res.com$sce$study)
##
## tasic zeisel
## 1 201 121
## 2 77 17
## 3 0 63
## 4 386 47
## 5 53 82
## 6 0 358
## 7 60 109
## 8 1 227
## 9 13 454
## 10 155 94
## 11 360 134
## 12 106 15
## 13 205 133
## 14 7 246
## 15 18 431
## 16 1 195
## 17 22 118
## 18 20 158
# Visualizing the distribution in the UMAP space.
plotReducedDim(res.com$sce, "UMAP", colour_by = "study")
runSolo() can also handle proteomics data from CITE-seq experiments.
We’ll demonstrate on yet another example dataset, where the ADT counts are held in an alternative experiment:
sce.k <- KotliarovPBMCData()
sce.k <- sce.k[,1:1000] # save some time for the build system.
sce.k
## class: SingleCellExperiment
## dim: 32738 1000
## metadata(0):
## assays(1): counts
## rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
## rowData names(0):
## colnames(1000): AAACCTGAGAGCCCAA_H1B1ln1 AAACCTGAGGCGTACA_H1B1ln1 ...
## ATAACGCGTTTGGGCC_H1B1ln1 ATAACGCTCAACGCTA_H1B1ln1
## colData names(24): nGene nUMI ... dmx_hto_match timepoint
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(1): ADT
We specify the experiment containing the ADT counts in runSolo().
outdir.k <- tempfile()
res.k <- runSolo(
sce.k,
qc.mito.regex = "^MT-",
adt.experiment = "ADT",
output.dir = outdir.k,
# Just setting these options to reduce the runtime,
# otherwise this vignette takes too long to build.
num.threads = 2,
suppress.plots = TRUE,
save.results = FALSE
)
This applies quality control, normalization and dimensionality reduction on the ADT data, and then combines the principal components with those from RNA into a single embedding for downstream analyses like clustering and visualization. The goal is to incorporate information from both modalities in the distance calculations, such that significant differences in either modality is capable of creating a new cluster.
# Now we have an extra set of combined coordinates.
reducedDimNames(res.k$sce)
## [1] "PCA" "combined" "TSNE" "UMAP"
# Checking out the ADT-specific QC metrics.
res.k$qc.adt
## DataFrame with 1000 rows and 4 columns
## sum detected subset.sum.igg keep
## <numeric> <integer> <numeric> <logical>
## AAACCTGAGAGCCCAA_H1B1ln1 3558 71 6 TRUE
## AAACCTGAGGCGTACA_H1B1ln1 3114 64 0 TRUE
## AAACCTGCAGGTGGAT_H1B1ln1 2429 70 1 TRUE
## AAACCTGCAGTATCTG_H1B1ln1 2669 67 0 TRUE
## AAACCTGCATCACAAC_H1B1ln1 3869 74 2 TRUE
## ... ... ... ... ...
## ATAACGCGTAGTAGTA_H1B1ln1 1609 66 1 TRUE
## ATAACGCGTCAACATC_H1B1ln1 1809 68 3 TRUE
## ATAACGCGTCTCTCTG_H1B1ln1 2305 64 1 TRUE
## ATAACGCGTTTGGGCC_H1B1ln1 1896 80 3 TRUE
## ATAACGCTCAACGCTA_H1B1ln1 2945 71 2 TRUE
# Checking out the reduced dimensions.
plotReducedDim(res.k$sce, "TSNE", colour_by = "graph.cluster")
# We can also look at the ADT-specific markers for a cluster.
scrapper::previewMarkers(res.k$markers.adt[[1]])
## DataFrame with 10 rows and 3 columns
## mean detected lfc
## <numeric> <numeric> <numeric>
## CD244_PROT 7.25785 1.000000 2.689771
## CD16_PROT 8.32580 1.000000 4.979206
## CD45RA_PROT 8.13694 1.000000 2.392828
## CD7_PROT 8.72043 1.000000 2.668172
## CD38_PROT 6.92452 1.000000 2.236772
## CD56_PROT 3.80793 0.987952 3.166101
## CD161_PROT 4.60555 1.000000 2.245237
## CD18_PROT 9.31142 1.000000 0.922413
## CD11b_PROT 5.09110 1.000000 0.753909
## CD314 _PROT 4.37376 1.000000 1.321706
We could also analyze the ADT modality by itself by setting rna.experiment=NULL.
augere.solo implements a wrapper around SingleR for cell type annotation. We can either use one of the pre-defined references from the celldex package:
outdir.celldex <- tempfile()
sub.z <- sce.z[,1:1000] # subsetting to save some time for the build system.
pred.celldex <- runAnnotate(
sub.z,
configureReferenceAnnotation("MouseRNAseqData", "label.main"),
output.dir = outdir.celldex,
# Just setting these options to reduce the runtime,
# otherwise this vignette takes too long to build.
num.threads = 2,
suppress.plots = TRUE,
save.results = FALSE
)
table(pred.celldex$predictions[[1]]$labels)
##
## Neurons Oligodendrocytes
## 981 19
Or we can use another single-cell dataset.
Here, we set ref.aggregate=TRUE to speed up the classification by aggregating single-cells in the reference into a smaller number of representative clusters.
outdir.tasic <- tempfile()
pred.tasic <- runAnnotate(
sub.z,
configureReferenceAnnotation(sce.t, "primary_type", ref.assay="counts", ref.aggregate=TRUE),
output.dir = outdir.tasic,
# Just setting these options to reduce the runtime,
# otherwise this vignette takes too long to build.
num.threads = 2,
suppress.plots = TRUE,
save.results = FALSE
)
table(pred.tasic$predictions[[1]]$labels)
##
## Igtp L2 Ngb L2/3 Ptgs2 L4 Arf5 L4 Ctxn3
## 15 62 160 60 8
## L4 Scnn1a L5 Chrna6 L5 Ucma L5a Batf3 L5a Hsd11b1
## 2 1 26 19 2
## L5a Pde1c L5a Tcerg1l L5b Cdh13 L5b Tph2 L6a Car12
## 12 6 27 11 139
## L6a Mgp L6a Sla L6a Syt17 L6b Rgs12 L6b Serpinb11
## 36 65 16 3 10
## Ndnf Car4 Ndnf Cxcl14 Pvalb Cpne5 Pvalb Gpx3 Pvalb Obox3
## 15 48 5 1 2
## Smad3 Sncg Sst Cbln4 Sst Cdk6 Sst Chodl
## 35 39 19 1 7
## Sst Myh8 Sst Tacstd2 Sst Th Unclassified Vip Chat
## 3 1 14 45 2
## Vip Gpc3 Vip Mybpc1 Vip Parm1 Vip Sncg
## 7 11 11 54
We can even pass multiple references, in which case runAnnotate() will annotate against each reference separately and then choose between the best labels over all references.
outdir.multi <- tempfile()
pred.multi <- runAnnotate(
sub.z,
list(
rnaseq=configureReferenceAnnotation("MouseRNAseqData", "label.main"),
immgen=configureReferenceAnnotation("ImmGenData", "label.main")
),
output.dir = outdir.multi,
# Just setting these options to reduce the runtime,
# otherwise this vignette takes too long to build.
num.threads = 2,
suppress.plots = TRUE,
save.results = FALSE
)
# Results for annotation against the individual references:
names(pred.multi$predictions)
## [1] "rnaseq" "immgen"
# As well as the combined results:
table(pred.multi$combined$labels, pred.multi$combined$reference)
##
## 1
## Neurons 983
## Oligodendrocytes 17
sessionInfo()
## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.24-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scater_1.41.1 ggplot2_4.0.3
## [3] scuttle_1.23.0 augere.solo_0.99.3
## [5] scRNAseq_2.27.0 SingleCellExperiment_1.35.0
## [7] SummarizedExperiment_1.43.0 Biobase_2.73.1
## [9] GenomicRanges_1.65.0 Seqinfo_1.3.0
## [11] IRanges_2.47.0 S4Vectors_0.51.1
## [13] BiocGenerics_0.59.0 generics_0.1.4
## [15] MatrixGenerics_1.25.0 matrixStats_1.5.0
## [17] BiocStyle_2.41.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 jsonlite_2.0.0
## [3] magrittr_2.0.5 magick_2.9.1
## [5] ggbeeswarm_0.7.3 GenomicFeatures_1.65.0
## [7] gypsum_1.9.0 farver_2.1.2
## [9] rmarkdown_2.31 BiocIO_1.23.3
## [11] vctrs_0.7.3 DelayedMatrixStats_1.35.0
## [13] memoise_2.0.1 Rsamtools_2.29.0
## [15] RCurl_1.98-1.18 tinytex_0.59
## [17] htmltools_0.5.9 S4Arrays_1.13.0
## [19] BiocBaseUtils_1.15.0 AnnotationHub_4.3.0
## [21] curl_7.1.0 BiocNeighbors_2.7.1
## [23] Rhdf5lib_2.1.0 SparseArray_1.13.2
## [25] rhdf5_2.57.0 sass_0.4.10
## [27] alabaster.base_1.13.0 bslib_0.10.0
## [29] alabaster.sce_1.13.0 httr2_1.2.2
## [31] cachem_1.1.0 GenomicAlignments_1.49.0
## [33] lifecycle_1.0.5 pkgconfig_2.0.3
## [35] rsvd_1.0.5 Matrix_1.7-5
## [37] R6_2.6.1 fastmap_1.2.0
## [39] digest_0.6.39 AnnotationDbi_1.75.0
## [41] irlba_2.3.7 ExperimentHub_3.3.0
## [43] RSQLite_2.4.6 beachmat_2.29.0
## [45] labeling_0.4.3 filelock_1.0.3
## [47] httr_1.4.8 abind_1.4-8
## [49] compiler_4.6.0 withr_3.0.2
## [51] bit64_4.8.0 S7_0.2.2
## [53] BiocParallel_1.47.0 viridis_0.6.5
## [55] DBI_1.3.0 augere.core_0.99.3
## [57] HDF5Array_1.41.0 alabaster.ranges_1.13.0
## [59] alabaster.schemas_1.13.0 rappdirs_0.3.4
## [61] DelayedArray_0.39.1 rjson_0.2.23
## [63] tools_4.6.0 vipor_0.4.7
## [65] otel_0.2.0 beeswarm_0.4.0
## [67] glue_1.8.1 h5mread_1.5.0
## [69] restfulr_0.0.16 rhdf5filters_1.25.0
## [71] grid_4.6.0 gtable_0.3.6
## [73] ensembldb_2.37.0 BiocSingular_1.29.0
## [75] ScaledMatrix_1.21.0 XVector_0.53.0
## [77] ggrepel_0.9.8 BiocVersion_3.24.0
## [79] pillar_1.11.1 dplyr_1.2.1
## [81] BiocFileCache_3.3.0 lattice_0.22-9
## [83] rtracklayer_1.73.0 bit_4.6.0
## [85] tidyselect_1.2.1 Biostrings_2.81.1
## [87] knitr_1.51 gridExtra_2.3
## [89] scrapper_1.7.0 bookdown_0.46
## [91] ProtGenerics_1.45.0 xfun_0.57
## [93] pheatmap_1.0.13 UCSC.utils_1.9.0
## [95] lazyeval_0.2.3 yaml_2.3.12
## [97] evaluate_1.0.5 codetools_0.2-20
## [99] cigarillo_1.3.0 tibble_3.3.1
## [101] alabaster.matrix_1.13.0 BiocManager_1.30.27
## [103] cli_3.6.6 jquerylib_0.1.4
## [105] dichromat_2.0-0.1 Rcpp_1.1.1-1.1
## [107] GenomeInfoDb_1.49.0 dbplyr_2.5.2
## [109] png_0.1-9 XML_3.99-0.23
## [111] parallel_4.6.0 blob_1.3.0
## [113] SingleR_2.15.0 AnnotationFilter_1.37.0
## [115] sparseMatrixStats_1.25.0 bitops_1.0-9
## [117] viridisLite_0.4.3 alabaster.se_1.13.0
## [119] scales_1.4.0 crayon_1.5.3
## [121] rlang_1.2.0 celldex_1.23.0
## [123] cowplot_1.2.0 KEGGREST_1.53.0