Contents

1 Overview

scrapper implements bindings to C++ code for analyzing single-cell data, mostly from the libscran libraries. Each function performs an individual analysis step ranging from quality control to clustering and marker detection. Users can also conveniently perform end-to-end analyses with a single call to the analyze.se() function.

2 Basic analysis

Let’s fetch a small single-cell RNA-seq dataset for demonstration purposes:

library(scRNAseq)
sce.z <- ZeiselBrainData()
sce.z
## class: SingleCellExperiment 
## dim: 20006 3005 
## metadata(0):
## assays(1): counts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(1): featureType
## colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
##   1772058148_F03
## colData names(9): tissue group # ... level1class level2class
## reducedDimNames(0):
## mainExpName: gene
## altExpNames(2): repeat ERCC

We run it through the scrapper analysis pipeline. This performs the usual steps - quality control, normalization, selection of highly variable genes, PCA, clustering, visualization and marker detection, see the OSCA book for more theoretical details.

library(scrapper)
is.mito <- grep("^mt-", rownames(sce.z))
results <- analyze.se(
    sce.z,
    rna.qc.subsets=list(Mito=is.mito),
    num.threads=2 # limit number of threads to keep R CMD check happy.
)
results$x # most outputs are stored in a copy of the input SingleCellExperiment. 
## class: SingleCellExperiment 
## dim: 20006 2874 
## metadata(2): qc PCA
## assays(2): counts logcounts
## rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
## rowData names(6): featureType means ... residuals hvg
## colnames(2874): 1772071015_C02 1772071017_G12 ... 1772063068_D01
##   1772066098_A12
## colData names(15): tissue group # ... sizeFactor graph.cluster
## reducedDimNames(3): PCA TSNE UMAP
## mainExpName: gene
## altExpNames(2): repeat ERCC

Now we can have a look at some of the results. For example, we can make a t-SNE plot:

library(scater)
plotReducedDim(results$x, "TSNE", colour_by="graph.cluster")

We can also look at the markers defining each cluster:

# Ordering by the median AUC across pairwise comparisons involving the cluster.
previewMarkers(results$markers$rna[[1]], order.by="auc.median")
## DataFrame with 10 rows and 4 columns
##             mean  detected       lfc auc.median
##        <numeric> <numeric> <numeric>  <numeric>
## Gad1     4.79503  1.000000   4.56949   0.997684
## Gad2     4.44192  0.996503   4.25766   0.995434
## Ndrg4    4.40310  0.996503   2.59179   0.990182
## Stmn3    4.71546  0.993007   2.64538   0.986791
## Tspyl4   3.36568  1.000000   2.15128   0.983035
## Syp      2.79635  0.993007   1.53131   0.976301
## Rab3a    3.34302  0.989510   1.82740   0.975972
## Scg5     4.74990  1.000000   2.06392   0.975421
## Snap25   5.40529  1.000000   2.61350   0.974922
## Slc6a1   3.75820  0.993007   3.08908   0.974192

analyze.se() can also be broken down to its component functions for further customization:

test <- sce.z
test <- quickRnaQc.se(test, subsets=list(mt=is.mito), num.threads=2)
test <- test[,test$keep]
test <- normalizeRnaCounts.se(test, size.factors=test$sum)
test <- chooseRnaHvgs.se(test, num.threads=2)
test <- runPca.se(test, features=rowData(test)$hvg, num.threads=2)
test <- runAllNeighborSteps.se(test, num.threads=2)
markers <- scoreMarkers.se(test, groups=test$clusters, num.threads=2)

3 Blocking on batches

Let’s fetch another brain dataset and combine it with the previous one.

sce.t <- TasicBrainData()

# Only using the common genes in both datasets.
common <- intersect(rownames(sce.z), rownames(sce.t))
combined <- combineCols(sce.t[common,], sce.z[common,])
block <- rep(c("tasic", "zeisel"), c(ncol(sce.t), ncol(sce.z)))

We set block= in each step to account for the batch structure. This ensures that the various calculations are not affected by inter-block differences. It also uses MNN correction to batch effects in the low-dimensional space prior to further analyses.

# No mitochondrial genes in the combined set, so we'll just skip it.
blocked_res <- analyze.se(combined, block=block, num.threads=2)

Hopefully, the correction is able to get rid of the batch effect and merge the datasets together.

plotReducedDim(blocked_res$x, "TSNE", colour_by="block")

We can also check the distribution of clusters across batches. The presence of batch-specific clusters might indicate that the batch effect was not fully removed; or they might represent cell types that are unique to one of the studies, who knows.

table(blocked_res$x$graph.cluster, blocked_res$x$block)
##     
##      tasic zeisel
##   1    122     90
##   2    477     70
##   3    364    130
##   4    183    128
##   5     43    362
##   6     22    121
##   7     91     99
##   8     20    150
##   9     31     12
##   10    48     18
##   11   248    106
##   12    20    701
##   13     7    252
##   14     1    213
##   15     8     87
##   16     0    463

Finally, some markers for another cluster:

previewMarkers(blocked_res$markers$rna[[1]])
## DataFrame with 10 rows and 3 columns
##             mean  detected       lfc
##        <numeric> <numeric> <numeric>
## Gad1     4.99380  1.000000   3.26141
## Igf1     4.22957  0.990566   3.22416
## Synpr    5.06317  0.985849   3.61784
## Cnr1     6.27028  1.000000   4.46956
## Slc6a1   4.07537  1.000000   2.19651
## Gad2     3.45237  1.000000   2.42218
## Vstm2a   3.37646  0.995283   2.26351
## Vip      5.68657  0.863208   5.08033
## Nap1l5   3.72545  1.000000   2.35075
## Nrxn3    4.17803  1.000000   2.27554

We can also compute pseudo-bulk profiles for each cluster-dataset combination, e.g., for differential expression analyses.

aggregates <- aggregateAcrossCells.se(
    blocked_res$x,
    list(cluster=blocked_res$x$graph.cluster, dataset=blocked_res$x$block)
)
aggregates
## class: SummarizedExperiment 
## dim: 18698 31 
## metadata(1): aggregated
## assays(2): sums detected
## rownames(18698): Tspan12 Tshz1 ... Uty Usp9y
## rowData names(6): featureType means ... residuals hvg
## colnames: NULL
## colData names(30): factor.cluster factor.dataset ... block
##   graph.cluster

4 Combining multiple modalities

Let’s fetch a single-cell dataset with both RNA-seq and CITE-seq data. To keep the run-time short, we’ll only consider the first 5000 cells.

sce.s <- StoeckiusHashingData(mode=c("human", "adt1", "adt2"))
sce.s <- sce.s[,1:5000]

# Combining the various ADT-related alternative experiments into a single object.
altExp(sce.s, "all.adts") <- rbind(altExp(sce.s, "adt1"), altExp(sce.s, "adt2"))
altExp(sce.s, "adt1") <- altExp(sce.s, "adt2") <- NULL

sce.s
## class: SingleCellExperiment 
## dim: 27679 5000 
## metadata(0):
## assays(1): counts
## rownames(27679): A1BG A1BG-AS1 ... hsa-mir-8072 snoU2-30
## rowData names(0):
## colnames(5000): TGCCAAATCTCTAAGG ACTGCTCAGGTGTTAA ... CGCTATCGTCCGTCAG
##   GGCGACTGTAAGGGAA
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(1): all.adts

We specify adt.altexp= to indicate that one of the alternative experiments contains ADT data. This instructs the analysis to use data from both modalities during clustering and visualization.

is.mito <- grepl("^MT-", rownames(sce.s))
is.igg <- grepl("^IgG", rownames(altExp(sce.s, 'all.adts')))
multi_res <- analyze.se(
    sce.s,
    adt.altexp="all.adts",
    rna.qc.subsets=list(mito=is.mito),
    adt.qc.subsets=list(igg=is.igg),
    num.threads=2
)

Here’s an obligatory plot:

plotReducedDim(multi_res$x, "UMAP", colour_by="graph.cluster")

We can inspect the top markers among the ADTs for a cluster:

# By default, markers are sorted by the mean Cohen's d.
# We just replace the column name when printing the dataframe here.
previewMarkers(multi_res$markers$adt[[5]], c(effect="cohens.d.mean"))
## DataFrame with 10 rows and 4 columns
##                                   mean  detected         lfc     effect
##                              <numeric> <numeric>   <numeric>  <numeric>
## CD4-TAGACAGCTG                 9.23177   1.00000  3.71501845  3.6884442
## CD3-TCTTCGTCCA                 8.91729   1.00000  1.15035918  0.7931371
## CD45RA-AGTCCATTGG              8.44019   1.00000  0.02945903  0.1395605
## CCR7-GTAAACCGCA                7.32269   1.00000  0.05250743  0.0754008
## IgG_control-CATGATTGGCTC       1.52623   0.84375 -0.06576363 -0.0190229
## IgM_control-AAGCGCTTGGCA       0.00000   0.00000 -0.00416226 -0.0508764
## IgG2b_2_control-CATCGGTGTACA   1.49816   0.78750 -0.15456722 -0.1411518
## PD-1-GTACTTCCAG                7.63058   1.00000 -0.19576466 -0.3935047
## IgG1_2_control-GTCTAGACTTCG    4.71182   0.99375 -0.35686513 -0.5329377
## CD8-AGATGAACCC                 4.87258   1.00000 -0.84859423 -0.5452329

… along with the RNA markers, as before.

# Using the delta-detected to focus on silent-to-expressed genes.
previewMarkers(multi_res$markers$rna[[5]], order.by="delta.detected.mean")
## DataFrame with 10 rows and 4 columns
##              mean  detected       lfc delta.detected.mean
##         <numeric> <numeric> <numeric>           <numeric>
## ENAH      1.36986   0.94375  0.820350            0.598778
## CNN3      2.09056   0.97500  1.224571            0.597853
## WBP5      2.01220   0.98125  1.145416            0.594153
## PLS3      1.75284   0.96875  1.012275            0.591067
## CENPV     1.90134   0.96250  1.130238            0.589109
## S100A13   1.35069   0.93750  0.793204            0.588513
## PCBD1     1.70390   0.96875  0.931640            0.587040
## UCHL1     1.71843   0.96250  0.954069            0.585129
## CALD1     1.60756   0.93750  0.957796            0.579067
## RDX       1.26275   0.93125  0.739840            0.577328

5 Other useful functions

The runAllNeighborSteps() fucntion will run runUmap(), runTsne(), buildSnnGraph() and clusterGraph() in a single call. This runs the UMAP/t-SNE iterations and the clustering in parallel to maximize use of multiple threads.

The scoreGeneSet() function will compute a gene set score based on the input expression matrix. This can be used to summarize the activity of pathways into a single per-cell score for visualization.

The subsampleByNeighbors() function will deterministically select a representative subset of cells based on their local neighborhood. This can be used to reduce the compute time of the various steps downstream of the PCA.

For CRISPR data, quality control can be performed using computeCrisprQcMetrics(), suggestCrisprQcThresholds() and filterCrisprQcMetrics(). To normalize, we use size factors defined by centering the total sum of guide counts for each cell.

Session information

sessionInfo()
## R Under development (unstable) (2025-12-22 r89219)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scater_1.39.0               ggplot2_4.0.1              
##  [3] scuttle_1.21.0              scrapper_1.5.5             
##  [5] scRNAseq_2.25.0             SingleCellExperiment_1.33.0
##  [7] SummarizedExperiment_1.41.0 Biobase_2.71.0             
##  [9] GenomicRanges_1.63.1        Seqinfo_1.1.0              
## [11] IRanges_2.45.0              S4Vectors_0.49.0           
## [13] BiocGenerics_0.57.0         generics_0.1.4             
## [15] MatrixGenerics_1.23.0       matrixStats_1.5.0          
## [17] knitr_1.51                  BiocStyle_2.39.0           
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3       jsonlite_2.0.0           magrittr_2.0.4          
##   [4] magick_2.9.0             ggbeeswarm_0.7.3         GenomicFeatures_1.63.1  
##   [7] gypsum_1.7.0             farver_2.1.2             rmarkdown_2.30          
##  [10] BiocIO_1.21.0            vctrs_0.6.5              memoise_2.0.1           
##  [13] Rsamtools_2.27.0         RCurl_1.98-1.17          tinytex_0.58            
##  [16] htmltools_0.5.9          S4Arrays_1.11.1          AnnotationHub_4.1.0     
##  [19] curl_7.0.0               BiocNeighbors_2.5.0      Rhdf5lib_1.33.0         
##  [22] SparseArray_1.11.10      rhdf5_2.55.12            sass_0.4.10             
##  [25] alabaster.base_1.11.1    bslib_0.9.0              alabaster.sce_1.11.0    
##  [28] httr2_1.2.2              cachem_1.1.0             GenomicAlignments_1.47.0
##  [31] lifecycle_1.0.4          pkgconfig_2.0.3          rsvd_1.0.5              
##  [34] Matrix_1.7-4             R6_2.6.1                 fastmap_1.2.0           
##  [37] digest_0.6.39            AnnotationDbi_1.73.0     irlba_2.3.5.1           
##  [40] ExperimentHub_3.1.0      RSQLite_2.4.5            beachmat_2.27.1         
##  [43] labeling_0.4.3           filelock_1.0.3           httr_1.4.7              
##  [46] abind_1.4-8              compiler_4.6.0           bit64_4.6.0-1           
##  [49] withr_3.0.2              S7_0.2.1                 BiocParallel_1.45.0     
##  [52] viridis_0.6.5            DBI_1.2.3                HDF5Array_1.39.0        
##  [55] alabaster.ranges_1.11.0  alabaster.schemas_1.11.0 rappdirs_0.3.3          
##  [58] DelayedArray_0.37.0      rjson_0.2.23             tools_4.6.0             
##  [61] vipor_0.4.7              otel_0.2.0               beeswarm_0.4.0          
##  [64] glue_1.8.0               h5mread_1.3.1            restfulr_0.0.16         
##  [67] rhdf5filters_1.23.3      grid_4.6.0               gtable_0.3.6            
##  [70] ensembldb_2.35.0         BiocSingular_1.27.1      ScaledMatrix_1.19.0     
##  [73] XVector_0.51.0           ggrepel_0.9.6            BiocVersion_3.23.1      
##  [76] pillar_1.11.1            dplyr_1.1.4              BiocFileCache_3.1.0     
##  [79] lattice_0.22-7           rtracklayer_1.71.3       bit_4.6.0               
##  [82] tidyselect_1.2.1         Biostrings_2.79.3        gridExtra_2.3           
##  [85] bookdown_0.46            ProtGenerics_1.43.0      xfun_0.55               
##  [88] UCSC.utils_1.7.1         lazyeval_0.2.2           yaml_2.3.12             
##  [91] evaluate_1.0.5           codetools_0.2-20         cigarillo_1.1.0         
##  [94] tibble_3.3.0             alabaster.matrix_1.11.0  BiocManager_1.30.27     
##  [97] cli_3.6.5                jquerylib_0.1.4          dichromat_2.0-0.1       
## [100] Rcpp_1.1.0.8.1           GenomeInfoDb_1.47.2      dbplyr_2.5.1            
## [103] png_0.1-8                XML_3.99-0.20            parallel_4.6.0          
## [106] blob_1.2.4               AnnotationFilter_1.35.0  bitops_1.0-9            
## [109] viridisLite_0.4.2        alabaster.se_1.11.0      scales_1.4.0            
## [112] purrr_1.2.0              crayon_1.5.3             rlang_1.1.6             
## [115] cowplot_1.2.0            KEGGREST_1.51.1