Contents

1 Overview

augere.solo implements augere pipelines for a variety of simple single-cell analyses. This use scrapper for routine steps such as quality control, clustering and marker detection. We also use SingleR for automatic cell type annotation based on pre-labelled references. Each pipeline creates a self-contained Rmarkdown report that documents each step in the analysis.

2 Quick start

To install this package, follow the usual instructions for Bioconductor:

install.packages("BiocManager") # if not already available
BiocManager::install("augere.solo")

Let’s load up our favorite example dataset from the scRNAseq package:

library(scRNAseq)
sce.z <- ZeiselBrainData()
sce.z
## class: SingleCellExperiment 
## dim: 20006 3005 
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
##   1772058148_F03
## colData names(9): tissue group # ... level1class level2class
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC

Now we call runSolo() to perform a routine analysis of this dataset:

library(augere.solo)
outdir.z <- tempfile()
res.z <- runSolo(
    sce.z,
    qc.mito.regex = "^mt-",
    num.threads = 2, # speed it up a little
    output.dir = outdir.z
)

This returns a list of various results, including QC metrics, marker rankings and a SingleCellExperiment decorated with clusterings and reduced dimensions:

# Checking out the QC statistics.
res.z$qc.rna
## DataFrame with 3005 rows and 4 columns
##                      sum  detected subset.proportion.mito      keep
##                <numeric> <integer>              <numeric> <logical>
## 1772071015_C02     22354      4871             0.03462468      TRUE
## 1772071017_G12     22869      4712             0.04901832      TRUE
## 1772071017_A05     32594      6055             0.02920783      TRUE
## 1772071014_B06     33525      5852             0.01822521      TRUE
## 1772067065_H06     21694      4724             0.00755969      TRUE
## ...                  ...       ...                    ...       ...
## 1772067059_B04      5707      2250             0.19660067      TRUE
## 1772066097_D04      2574      1441             0.00582751      TRUE
## 1772063068_D01      4993      2001             0.19587422      TRUE
## 1772066098_A12      3099      1510             0.06550500      TRUE
## 1772058148_F03      6534      1809             0.31741659     FALSE
# Checking out the reduced dimensions.
library(scater)
plotReducedDim(res.z$sce, "TSNE", colour_by = "graph.cluster")

# Checking out the marker rankings for each cluster.
# We use previewMarkers() just for a prettier print.
scrapper::previewMarkers(res.z$markers.rna[["1"]])
## DataFrame with 10 rows and 3 columns
##             mean  detected       lfc
##        <numeric> <numeric> <numeric>
## Tspyl4   3.36948  1.000000   2.15064
## Mllt11   3.43235  1.000000   2.05507
## Ndrg4    4.40295  0.996491   2.63200
## Gad1     4.80299  1.000000   4.57610
## Celf4    3.81657  0.996491   2.09285
## Gad2     4.44021  0.996491   4.25366
## Stmn3    4.71924  0.992982   2.70762
## Atp1a3   3.91374  0.996491   2.26505
## Rcan2    2.33158  0.936842   1.75198
## Slc6a1   3.76102  0.992982   3.08077

The function generates an Rmarkdown report containing (almost) all of the steps required to reproduce the analysis. Let’s have a look at it:

fname <- file.path(outdir.z, "report.Rmd")
cat(head(readLines(fname), 50), sep="\n")
## ---
## title: "Simple analysis of scRNA-seq data"
## author: list("biocbuild")
## date: "`r Sys.Date()`"
## output:
##   BiocStyle::html_document:
##     toc_float: yes
##     titlecaps: false 
##   md_document:
##     preserve_yaml: yes
## ---
## 
## ```{r preamble, echo=FALSE}
## knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
## ```
## 
## # Data loading
## 
## Our `SummarizedExperiment` object contains all of the experimental data and metadata for this study.
## Each column is assumed to correspond to a single cell.
## The main experiment contains RNA data where each row corresponds to a gene.
## 
## ```{r}
## se <- local({ # augere.core input (1)
##     stop("insert commands to generate 'se' here")
## })
## se
## ```
## 
## We convert each relevant experiment into a `SingleCellExperiment` if it wasn't already one.
## 
## ```{r}
## loadNamespace("SingleCellExperiment")
## sce <- as(se, "SingleCellExperiment")
## ```
## 
## We also coerce each input matrix to Bioconductor's sparse format.
## This isn't strictly necessary but it avoids silent performance degradation from inefficient formats (e.g., file-backed, `dgTMatrix` objects).
## 
## ```{r}
## loadNamespace("SparseArray")
## SummarizedExperiment::assay(sce, 1, withDimnames=FALSE) <- local({
##     mat <- SummarizedExperiment::assay(sce, 1, withDimnames=FALSE)
##     DelayedArray::DelayedArray(as(mat, "SVT_SparseMatrix"))
## })
## ```
## 
## # Quality control
## 
## ## For RNA

The function will also save the results to file inside a results subdirectory, which can be loaded back into the session using the readResult() function:

reloaded <- readResult(file.path(outdir.z, "results", "markers-rna", "2"))
scrapper::previewMarkers(reloaded$x) # preview of the marker rankings
## DataFrame with 10 rows and 3 columns
##             mean  detected       lfc
##        <numeric> <numeric> <numeric>
## Syt1     3.62958  0.979487  0.761914
## Mef2c    2.37713  0.825641  1.013806
## Snhg11   3.70648  0.964103  1.146727
## Meg3     5.00525  1.000000  1.395784
## Plp1     8.88389  1.000000  4.135423
## Mog      5.43628  0.994872  3.109208
## Cryab    4.54240  1.000000  2.332521
## Cnp      4.84884  1.000000  2.458323
## Mobp     4.46912  1.000000  2.418916
## Ermn     4.49302  0.994872  2.485988
str(reloaded$metadata) # along with some metadata
## List of 4
##  $ title                : chr "Marker gene for cluster 2"
##  $ authors              :List of 1
##   ..$ : chr "biocbuild"
##  $ type                 : chr "marker_gene_detection"
##  $ marker_gene_detection:List of 3
##   ..$ group : chr "2"
##   ..$ order : chr "cohens.d.min.rank"
##   ..$ factor: chr "graph.cluster"

3 Multi-batch analyses

Larger datasets often have uninteresting variation associated with a blocking factor, e.g., batches. To illustrate, let’s combine the Zeisel dataset with another brain dataset:

sce.t <- TasicBrainData()
common.brain <- intersect(rownames(sce.z), rownames(sce.t))
combined.brain <- combineCols(
    sce.z[common.brain,],
    sce.t[common.brain,]
)
combined.brain$study <- rep(c("zeisel", "tasic"), c(ncol(sce.z), ncol(sce.t)))
combined.brain
## class: SingleCellExperiment 
## dim: 18698 4814 
## metadata(0):
## assays(1): counts
## rownames(18698): Tspan12 Tshz1 ... Uty Usp9y
## rowData names(1): featureType
## colnames(4814): 1772071015_C02 1772071017_G12 ... Rbp4_CTX_250ng_2
##   Trib2_CTX_250ng_1
## colData names(22): tissue group # ... aibs_vignette_id study
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC

We now instruct runSolo() to block on the study of origin for each cell. In many steps (e.g., variance calculations, differential expression), this will force all calculations to be performed within each level of the blocking factor. This avoids interference from uninteresting differences between studies.

outdir.com <- tempfile()
res.com <- runSolo(
    combined.brain,
    block = "study",
    qc.mito.regex = "^mt-",
    output.dir = outdir.com,

    # Just setting these options to reduce the runtime,
    # otherwise this vignette takes too long to build.
    num.threads = 2,
    suppress.plots = TRUE,
    save.results = FALSE
)

We will also apply mutual nearest neighbors (MNN) correction to remove differences between studies in the low-dimensional space; the MNN-corrected coordinates are then used for all downstream analyses.

# Now we have an extra set of MNN-corrected coordinates. 
reducedDimNames(res.com$sce)
## [1] "PCA"  "MNN"  "TSNE" "UMAP"
# Examining the distribution of clusters between studies.
table(res.com$sce$graph.cluster, res.com$sce$study)
##     
##      tasic zeisel
##   1    201    121
##   2     77     17
##   3      0     63
##   4    386     47
##   5     53     82
##   6      0    358
##   7     60    109
##   8      1    227
##   9     13    454
##   10   155     94
##   11   360    134
##   12   106     15
##   13   205    133
##   14     7    246
##   15    18    431
##   16     1    195
##   17    22    118
##   18    20    158
# Visualizing the distribution in the UMAP space.
plotReducedDim(res.com$sce, "UMAP", colour_by = "study")

4 With proteomics

runSolo() can also handle proteomics data from CITE-seq experiments. We’ll demonstrate on yet another example dataset, where the ADT counts are held in an alternative experiment:

sce.k <- KotliarovPBMCData()
sce.k <- sce.k[,1:1000] # save some time for the build system.
sce.k
## class: SingleCellExperiment 
## dim: 32738 1000 
## metadata(0):
## assays(1): counts
## rownames(32738): MIR1302-10 FAM138A ... AC002321.2 AC002321.1
## rowData names(0):
## colnames(1000): AAACCTGAGAGCCCAA_H1B1ln1 AAACCTGAGGCGTACA_H1B1ln1 ...
##   ATAACGCGTTTGGGCC_H1B1ln1 ATAACGCTCAACGCTA_H1B1ln1
## colData names(24): nGene nUMI ... dmx_hto_match timepoint
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(1): ADT

We specify the experiment containing the ADT counts in runSolo().

outdir.k <- tempfile()
res.k <- runSolo(
    sce.k,
    qc.mito.regex = "^MT-",
    adt.experiment = "ADT",
    output.dir = outdir.k,

    # Just setting these options to reduce the runtime,
    # otherwise this vignette takes too long to build.
    num.threads = 2,
    suppress.plots = TRUE,
    save.results = FALSE
)

This applies quality control, normalization and dimensionality reduction on the ADT data, and then combines the principal components with those from RNA into a single embedding for downstream analyses like clustering and visualization. The goal is to incorporate information from both modalities in the distance calculations, such that significant differences in either modality is capable of creating a new cluster.

# Now we have an extra set of combined coordinates. 
reducedDimNames(res.k$sce)
## [1] "PCA"      "combined" "TSNE"     "UMAP"
# Checking out the ADT-specific QC metrics.
res.k$qc.adt
## DataFrame with 1000 rows and 4 columns
##                                sum  detected subset.sum.igg      keep
##                          <numeric> <integer>      <numeric> <logical>
## AAACCTGAGAGCCCAA_H1B1ln1      3558        71              6      TRUE
## AAACCTGAGGCGTACA_H1B1ln1      3114        64              0      TRUE
## AAACCTGCAGGTGGAT_H1B1ln1      2429        70              1      TRUE
## AAACCTGCAGTATCTG_H1B1ln1      2669        67              0      TRUE
## AAACCTGCATCACAAC_H1B1ln1      3869        74              2      TRUE
## ...                            ...       ...            ...       ...
## ATAACGCGTAGTAGTA_H1B1ln1      1609        66              1      TRUE
## ATAACGCGTCAACATC_H1B1ln1      1809        68              3      TRUE
## ATAACGCGTCTCTCTG_H1B1ln1      2305        64              1      TRUE
## ATAACGCGTTTGGGCC_H1B1ln1      1896        80              3      TRUE
## ATAACGCTCAACGCTA_H1B1ln1      2945        71              2      TRUE
# Checking out the reduced dimensions.
plotReducedDim(res.k$sce, "TSNE", colour_by = "graph.cluster")

# We can also look at the ADT-specific markers for a cluster.
scrapper::previewMarkers(res.k$markers.adt[[1]])
## DataFrame with 10 rows and 3 columns
##                  mean  detected       lfc
##             <numeric> <numeric> <numeric>
## CD244_PROT    7.25785  1.000000  2.689771
## CD16_PROT     8.32580  1.000000  4.979206
## CD45RA_PROT   8.13694  1.000000  2.392828
## CD7_PROT      8.72043  1.000000  2.668172
## CD38_PROT     6.92452  1.000000  2.236772
## CD56_PROT     3.80793  0.987952  3.166101
## CD161_PROT    4.60555  1.000000  2.245237
## CD18_PROT     9.31142  1.000000  0.922413
## CD11b_PROT    5.09110  1.000000  0.753909
## CD314 _PROT   4.37376  1.000000  1.321706

We could also analyze the ADT modality by itself by setting rna.experiment=NULL.

5 Cell type annotation

augere.solo implements a wrapper around SingleR for cell type annotation. We can either use one of the pre-defined references from the celldex package:

outdir.celldex <- tempfile()
sub.z <- sce.z[,1:1000] # subsetting to save some time for the build system.
pred.celldex <- runAnnotate(
    sub.z,
    configureReferenceAnnotation("MouseRNAseqData", "label.main"),
    output.dir = outdir.celldex,

    # Just setting these options to reduce the runtime,
    # otherwise this vignette takes too long to build.
    num.threads = 2,
    suppress.plots = TRUE,
    save.results = FALSE
)
table(pred.celldex$predictions[[1]]$labels)
## 
##          Neurons Oligodendrocytes 
##              981               19

Or we can use another single-cell dataset. Here, we set ref.aggregate=TRUE to speed up the classification by aggregating single-cells in the reference into a smaller number of representative clusters.

outdir.tasic <- tempfile()
pred.tasic <- runAnnotate(
    sub.z,
    configureReferenceAnnotation(sce.t, "primary_type", ref.assay="counts", ref.aggregate=TRUE),
    output.dir = outdir.tasic,

    # Just setting these options to reduce the runtime,
    # otherwise this vignette takes too long to build.
    num.threads = 2,
    suppress.plots = TRUE,
    save.results = FALSE
)
table(pred.tasic$predictions[[1]]$labels)
## 
##          Igtp        L2 Ngb    L2/3 Ptgs2       L4 Arf5      L4 Ctxn3 
##            15            62           160            60             8 
##     L4 Scnn1a     L5 Chrna6       L5 Ucma     L5a Batf3   L5a Hsd11b1 
##             2             1            26            19             2 
##     L5a Pde1c   L5a Tcerg1l     L5b Cdh13      L5b Tph2     L6a Car12 
##            12             6            27            11           139 
##       L6a Mgp       L6a Sla     L6a Syt17     L6b Rgs12 L6b Serpinb11 
##            36            65            16             3            10 
##     Ndnf Car4   Ndnf Cxcl14   Pvalb Cpne5    Pvalb Gpx3   Pvalb Obox3 
##            15            48             5             1             2 
##         Smad3          Sncg     Sst Cbln4      Sst Cdk6     Sst Chodl 
##            35            39            19             1             7 
##      Sst Myh8   Sst Tacstd2        Sst Th  Unclassified      Vip Chat 
##             3             1            14            45             2 
##      Vip Gpc3    Vip Mybpc1     Vip Parm1      Vip Sncg 
##             7            11            11            54

We can even pass multiple references, in which case runAnnotate() will annotate against each reference separately and then choose between the best labels over all references.

outdir.multi <- tempfile()
pred.multi <- runAnnotate(
    sub.z,
    list(
        rnaseq=configureReferenceAnnotation("MouseRNAseqData", "label.main"),
        immgen=configureReferenceAnnotation("ImmGenData", "label.main")
    ),
    output.dir = outdir.multi,

    # Just setting these options to reduce the runtime,
    # otherwise this vignette takes too long to build.
    num.threads = 2,
    suppress.plots = TRUE,
    save.results = FALSE
)

# Results for annotation against the individual references:
names(pred.multi$predictions)
## [1] "rnaseq" "immgen"
# As well as the combined results:
table(pred.multi$combined$labels, pred.multi$combined$reference)
##                   
##                      1
##   Neurons          983
##   Oligodendrocytes  17

Session information

sessionInfo()
## R version 4.6.0 RC (2026-04-17 r89917)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.24-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scater_1.41.1               ggplot2_4.0.3              
##  [3] scuttle_1.23.0              augere.solo_0.99.3         
##  [5] scRNAseq_2.27.0             SingleCellExperiment_1.35.0
##  [7] SummarizedExperiment_1.43.0 Biobase_2.73.1             
##  [9] GenomicRanges_1.65.0        Seqinfo_1.3.0              
## [11] IRanges_2.47.0              S4Vectors_0.51.1           
## [13] BiocGenerics_0.59.0         generics_0.1.4             
## [15] MatrixGenerics_1.25.0       matrixStats_1.5.0          
## [17] BiocStyle_2.41.0           
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3        jsonlite_2.0.0           
##   [3] magrittr_2.0.5            magick_2.9.1             
##   [5] ggbeeswarm_0.7.3          GenomicFeatures_1.65.0   
##   [7] gypsum_1.9.0              farver_2.1.2             
##   [9] rmarkdown_2.31            BiocIO_1.23.3            
##  [11] vctrs_0.7.3               DelayedMatrixStats_1.35.0
##  [13] memoise_2.0.1             Rsamtools_2.29.0         
##  [15] RCurl_1.98-1.18           tinytex_0.59             
##  [17] htmltools_0.5.9           S4Arrays_1.13.0          
##  [19] BiocBaseUtils_1.15.0      AnnotationHub_4.3.0      
##  [21] curl_7.1.0                BiocNeighbors_2.7.1      
##  [23] Rhdf5lib_2.1.0            SparseArray_1.13.2       
##  [25] rhdf5_2.57.0              sass_0.4.10              
##  [27] alabaster.base_1.13.0     bslib_0.10.0             
##  [29] alabaster.sce_1.13.0      httr2_1.2.2              
##  [31] cachem_1.1.0              GenomicAlignments_1.49.0 
##  [33] lifecycle_1.0.5           pkgconfig_2.0.3          
##  [35] rsvd_1.0.5                Matrix_1.7-5             
##  [37] R6_2.6.1                  fastmap_1.2.0            
##  [39] digest_0.6.39             AnnotationDbi_1.75.0     
##  [41] irlba_2.3.7               ExperimentHub_3.3.0      
##  [43] RSQLite_2.4.6             beachmat_2.29.0          
##  [45] labeling_0.4.3            filelock_1.0.3           
##  [47] httr_1.4.8                abind_1.4-8              
##  [49] compiler_4.6.0            withr_3.0.2              
##  [51] bit64_4.8.0               S7_0.2.2                 
##  [53] BiocParallel_1.47.0       viridis_0.6.5            
##  [55] DBI_1.3.0                 augere.core_0.99.3       
##  [57] HDF5Array_1.41.0          alabaster.ranges_1.13.0  
##  [59] alabaster.schemas_1.13.0  rappdirs_0.3.4           
##  [61] DelayedArray_0.39.1       rjson_0.2.23             
##  [63] tools_4.6.0               vipor_0.4.7              
##  [65] otel_0.2.0                beeswarm_0.4.0           
##  [67] glue_1.8.1                h5mread_1.5.0            
##  [69] restfulr_0.0.16           rhdf5filters_1.25.0      
##  [71] grid_4.6.0                gtable_0.3.6             
##  [73] ensembldb_2.37.0          BiocSingular_1.29.0      
##  [75] ScaledMatrix_1.21.0       XVector_0.53.0           
##  [77] ggrepel_0.9.8             BiocVersion_3.24.0       
##  [79] pillar_1.11.1             dplyr_1.2.1              
##  [81] BiocFileCache_3.3.0       lattice_0.22-9           
##  [83] rtracklayer_1.73.0        bit_4.6.0                
##  [85] tidyselect_1.2.1          Biostrings_2.81.1        
##  [87] knitr_1.51                gridExtra_2.3            
##  [89] scrapper_1.7.0            bookdown_0.46            
##  [91] ProtGenerics_1.45.0       xfun_0.57                
##  [93] pheatmap_1.0.13           UCSC.utils_1.9.0         
##  [95] lazyeval_0.2.3            yaml_2.3.12              
##  [97] evaluate_1.0.5            codetools_0.2-20         
##  [99] cigarillo_1.3.0           tibble_3.3.1             
## [101] alabaster.matrix_1.13.0   BiocManager_1.30.27      
## [103] cli_3.6.6                 jquerylib_0.1.4          
## [105] dichromat_2.0-0.1         Rcpp_1.1.1-1.1           
## [107] GenomeInfoDb_1.49.0       dbplyr_2.5.2             
## [109] png_0.1-9                 XML_3.99-0.23            
## [111] parallel_4.6.0            blob_1.3.0               
## [113] SingleR_2.15.0            AnnotationFilter_1.37.0  
## [115] sparseMatrixStats_1.25.0  bitops_1.0-9             
## [117] viridisLite_0.4.3         alabaster.se_1.13.0      
## [119] scales_1.4.0              crayon_1.5.3             
## [121] rlang_1.2.0               celldex_1.23.0           
## [123] cowplot_1.2.0             KEGGREST_1.53.0