Chapter 14 Dealing with big data

14.1 Motivation

Advances in scRNA-seq technologies have increased the number of cells that can be assayed in routine experiments. Public databases such as GEO are continually expanding with more scRNA-seq studies, while large-scale projects such as the Human Cell Atlas are expected to generate data for billions of cells. For effective data analysis, the computational methods need to scale with the increasing size of scRNA-seq data sets. This section discusses how we can use various aspects of the Bioconductor ecosystem to tune our analysis pipelines for greater speed and efficiency.

14.2 Fast approximations

14.2.1 Nearest neighbor searching

Identification of neighbouring cells in PC or expression space is a common procedure that is used in many functions, e.g., buildSNNGraph(), doubletCells(). The default is to favour accuracy over speed by using an exact nearest neighbour (NN) search, implemented with the \(k\)-means for \(k\)-nearest neighbours algorithm (Wang 2012). However, for large data sets, it may be preferable to use a faster approximate approach. The BiocNeighbors framework makes it easy to switch between search options by simply changing the BNPARAM= argument in compatible functions. To demonstrate, we will use the 10X PBMC data:

#--- loading ---#
library(DropletTestFiles)
raw.path <- getTestFile("tenx-2.1.0-pbmc4k/1.0.0/raw.tar.gz")
out.path <- file.path(tempdir(), "pbmc4k")
untar(raw.path, exdir=out.path)

library(DropletUtils)
fname <- file.path(out.path, "raw_gene_bc_matrices/GRCh38")
sce.pbmc <- read10xCounts(fname, col.names=TRUE)

#--- gene-annotation ---#
library(scater)
rownames(sce.pbmc) <- uniquifyFeatureNames(
    rowData(sce.pbmc)$ID, rowData(sce.pbmc)$Symbol)

library(EnsDb.Hsapiens.v86)
location <- mapIds(EnsDb.Hsapiens.v86, keys=rowData(sce.pbmc)$ID, 
    column="SEQNAME", keytype="GENEID")

#--- cell-detection ---#
set.seed(100)
e.out <- emptyDrops(counts(sce.pbmc))
sce.pbmc <- sce.pbmc[,which(e.out$FDR <= 0.001)]

#--- quality-control ---#
stats <- perCellQCMetrics(sce.pbmc, subsets=list(Mito=which(location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
sce.pbmc <- sce.pbmc[,!high.mito]

#--- normalization ---#
library(scran)
set.seed(1000)
clusters <- quickCluster(sce.pbmc)
sce.pbmc <- computeSumFactors(sce.pbmc, cluster=clusters)
sce.pbmc <- logNormCounts(sce.pbmc)

#--- variance-modelling ---#
set.seed(1001)
dec.pbmc <- modelGeneVarByPoisson(sce.pbmc)
top.pbmc <- getTopHVGs(dec.pbmc, prop=0.1)

#--- dimensionality-reduction ---#
set.seed(10000)
sce.pbmc <- denoisePCA(sce.pbmc, subset.row=top.pbmc, technical=dec.pbmc)

set.seed(100000)
sce.pbmc <- runTSNE(sce.pbmc, dimred="PCA")

set.seed(1000000)
sce.pbmc <- runUMAP(sce.pbmc, dimred="PCA")

#--- clustering ---#
g <- buildSNNGraph(sce.pbmc, k=10, use.dimred = 'PCA')
clust <- igraph::cluster_walktrap(g)$membership
colLabels(sce.pbmc) <- factor(clust)
sce.pbmc
## class: SingleCellExperiment 
## dim: 33694 3985 
## metadata(1): Samples
## assays(2): counts logcounts
## rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
## rowData names(2): ID Symbol
## colnames(3985): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ...
##   TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
## colData names(4): Sample Barcode sizeFactor label
## reducedDimNames(3): PCA TSNE UMAP
## mainExpName: NULL
## altExpNames(0):

We had previously clustered on a shared nearest neighbor graph generated with an exact neighbour search (Basic Section 5.2). We repeat this below using an approximate search, implemented using the Annoy algorithm. This involves constructing a AnnoyParam object to specify the search algorithm and then passing it to the buildSNNGraph() function. The results from the exact and approximate searches are consistent with most clusters from the former re-appearing in the latter. This suggests that the inaccuracy from the approximation can be largely ignored.

library(scran)
library(BiocNeighbors)
snn.gr <- buildSNNGraph(sce.pbmc, BNPARAM=AnnoyParam(), use.dimred="PCA")
clusters <- igraph::cluster_walktrap(snn.gr)
table(Exact=colLabels(sce.pbmc), Approx=clusters$membership)
##      Approx
## Exact   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
##    1  205   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##    2    0 727   0   0   0   0   0   0   0   0   0   4   0   0   0
##    3    0   0 599   0   0   0   0   0  11   0   0   0   7   0   0
##    4    0   3   1  51   0   0   0   0   0   0   0   0   0   1   0
##    5    0   0   1   0 540   0   0   0   0   0   0   0   0   0   0
##    6    0   0   2   0   0 350   0   0   0   0   0   0   0   0   0
##    7    0   0   0   0   0   0 125   0   0   0   0   0   0   0   0
##    8    0   0   0   0   0   0   0  46   0   0   0   0   0   0   0
##    9    0   0   1   0   0   0   0   0 818   0   0   0   0   0   0
##    10   0   0   0   0   0   0   0   0   0  47   0   0   0   0   0
##    11   0   0   0   0   0   4   0   0   0   0 149   0   0   0   0
##    12   0   0   0   0   0   0   0   0   0   0   0  61   0   0   0
##    13   0   0   0   0   0   0   0   0   0   0   0   0 129   0   0
##    14   0   0   0   0   0   0   0   0   0   0   0   0   0  87   0
##    15   0   0   0   0   0   0   0   0   0   0   0   0   0   0  16

Note that Annoy writes the NN index to disk prior to performing the search. Thus, it may not actually be faster than the default exact algorithm for small datasets, depending on whether the overhead of disk write is offset by the computational complexity of the search. It is also not difficult to find situations where the approximation deteriorates, especially at high dimensions, though this may not have an appreciable impact on the biological conclusions.

set.seed(1000)
y1 <- matrix(rnorm(50000), nrow=1000)
y2 <- matrix(rnorm(50000), nrow=1000)
Y <- rbind(y1, y2)
exact <- findKNN(Y, k=20)
approx <- findKNN(Y, k=20, BNPARAM=AnnoyParam())
mean(exact$index!=approx$index)
## [1] 0.5619

14.2.2 Singular value decomposition

The singular value decomposition (SVD) underlies the PCA used throughout our analyses, e.g., in denoisePCA(), fastMNN(), doubletCells(). (Briefly, the right singular vectors are the eigenvectors of the gene-gene covariance matrix, where each eigenvector represents the axis of maximum remaining variation in the PCA.) The default base::svd() function performs an exact SVD that is not performant for large datasets. Instead, we use fast approximate methods from the irlba and rsvd packages, conveniently wrapped into the BiocSingular package for ease of use and package development. Specifically, we can change the SVD algorithm used in any of these functions by simply specifying an alternative value for the BSPARAM= argument.

library(scater)
library(BiocSingular)

# As the name suggests, it is random, so we need to set the seed.
set.seed(101000)
r.out <- runPCA(sce.pbmc, ncomponents=20, BSPARAM=RandomParam())
str(reducedDim(r.out))
##  num [1:3985, 1:20] 15.05 13.43 -8.67 -7.74 6.45 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3985] "AAACCTGAGAAGGCCT-1" "AAACCTGAGACAGACC-1" "AAACCTGAGGCATGGT-1" "AAACCTGCAAGGTTCT-1" ...
##   ..$ : chr [1:20] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "varExplained")= num [1:20] 85.36 40.43 23.22 8.99 6.66 ...
##  - attr(*, "percentVar")= num [1:20] 19.85 9.4 5.4 2.09 1.55 ...
##  - attr(*, "rotation")= num [1:500, 1:20] 0.203 0.1834 0.1779 0.1063 0.0647 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:500] "LYZ" "S100A9" "S100A8" "HLA-DRA" ...
##   .. ..$ : chr [1:20] "PC1" "PC2" "PC3" "PC4" ...
set.seed(101001)
i.out <- runPCA(sce.pbmc, ncomponents=20, BSPARAM=IrlbaParam())
str(reducedDim(i.out))
##  num [1:3985, 1:20] 15.05 13.43 -8.67 -7.74 6.45 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3985] "AAACCTGAGAAGGCCT-1" "AAACCTGAGACAGACC-1" "AAACCTGAGGCATGGT-1" "AAACCTGCAAGGTTCT-1" ...
##   ..$ : chr [1:20] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "varExplained")= num [1:20] 85.36 40.43 23.22 8.99 6.66 ...
##  - attr(*, "percentVar")= num [1:20] 19.85 9.4 5.4 2.09 1.55 ...
##  - attr(*, "rotation")= num [1:500, 1:20] 0.203 0.1834 0.1779 0.1063 0.0647 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:500] "LYZ" "S100A9" "S100A8" "HLA-DRA" ...
##   .. ..$ : chr [1:20] "PC1" "PC2" "PC3" "PC4" ...

Both IRLBA and randomized SVD (RSVD) are much faster than the exact SVD with negligible loss of accuracy. This motivates their default use in many scran and scater functions, at the cost of requiring users to set the seed to guarantee reproducibility. IRLBA can occasionally fail to converge and require more iterations (passed via maxit= in IrlbaParam()), while RSVD involves an explicit trade-off between accuracy and speed based on its oversampling parameter (p=) and number of power iterations (q=). We tend to prefer IRLBA as its default behavior is more accurate, though RSVD is much faster for file-backed matrices (Section 14.4).

14.3 Parallelization

Parallelization of calculations across genes or cells is an obvious strategy for speeding up scRNA-seq analysis workflows. The BiocParallel package provides a common interface for parallel computing throughout the Bioconductor ecosystem, manifesting as a BPPARAM= argument in compatible functions. We can pick from a diverse range of parallelization backends depending on the available hardware and operating system. For example, we might use forking across 2 cores to parallelize the variance calculations on a Unix system:

library(BiocParallel)
dec.pbmc.mc <- modelGeneVar(sce.pbmc, BPPARAM=MulticoreParam(2))
dec.pbmc.mc
## DataFrame with 33694 rows and 6 columns
##                     mean       total        tech          bio   p.value
##                <numeric>   <numeric>   <numeric>    <numeric> <numeric>
## RP11-34P13.3 0.000000000 0.000000000 0.000000000  0.00000e+00       NaN
## FAM138A      0.000000000 0.000000000 0.000000000  0.00000e+00       NaN
## OR4F5        0.000000000 0.000000000 0.000000000  0.00000e+00       NaN
## RP11-34P13.7 0.002166050 0.002227438 0.002227459 -2.15811e-08  0.500027
## RP11-34P13.8 0.000522431 0.000549601 0.000537242  1.23586e-05  0.436496
## ...                  ...         ...         ...          ...       ...
## AC233755.2     0.0000000   0.0000000   0.0000000   0.00000000       NaN
## AC233755.1     0.0000000   0.0000000   0.0000000   0.00000000       NaN
## AC240274.1     0.0102893   0.0121099   0.0105809   0.00152901  0.157639
## AC213203.1     0.0000000   0.0000000   0.0000000   0.00000000       NaN
## FAM231B        0.0000000   0.0000000   0.0000000   0.00000000       NaN
##                    FDR
##              <numeric>
## RP11-34P13.3       NaN
## FAM138A            NaN
## OR4F5              NaN
## RP11-34P13.7  0.756443
## RP11-34P13.8  0.756443
## ...                ...
## AC233755.2         NaN
## AC233755.1         NaN
## AC240274.1    0.756443
## AC213203.1         NaN
## FAM231B            NaN

Another approach would be to distribute jobs across a network of computers, which yields the same result:

dec.pbmc.snow <- modelGeneVar(sce.pbmc, BPPARAM=SnowParam(5))
dec.pbmc.snow
## DataFrame with 33694 rows and 6 columns
##                     mean       total        tech          bio   p.value
##                <numeric>   <numeric>   <numeric>    <numeric> <numeric>
## RP11-34P13.3 0.000000000 0.000000000 0.000000000  0.00000e+00       NaN
## FAM138A      0.000000000 0.000000000 0.000000000  0.00000e+00       NaN
## OR4F5        0.000000000 0.000000000 0.000000000  0.00000e+00       NaN
## RP11-34P13.7 0.002166050 0.002227438 0.002227459 -2.15811e-08  0.500027
## RP11-34P13.8 0.000522431 0.000549601 0.000537242  1.23586e-05  0.436496
## ...                  ...         ...         ...          ...       ...
## AC233755.2     0.0000000   0.0000000   0.0000000   0.00000000       NaN
## AC233755.1     0.0000000   0.0000000   0.0000000   0.00000000       NaN
## AC240274.1     0.0102893   0.0121099   0.0105809   0.00152901  0.157639
## AC213203.1     0.0000000   0.0000000   0.0000000   0.00000000       NaN
## FAM231B        0.0000000   0.0000000   0.0000000   0.00000000       NaN
##                    FDR
##              <numeric>
## RP11-34P13.3       NaN
## FAM138A            NaN
## OR4F5              NaN
## RP11-34P13.7  0.756443
## RP11-34P13.8  0.756443
## ...                ...
## AC233755.2         NaN
## AC233755.1         NaN
## AC240274.1    0.756443
## AC213203.1         NaN
## FAM231B            NaN

For high-performance computing (HPC) systems with a cluster of compute nodes, we can distribute jobs via the job scheduler using the BatchtoolsParam class. The example below assumes a SLURM cluster, though the settings can be easily configured for a particular system (see here for details).

# 2 hours, 8 GB, 1 CPU per task, for 10 tasks.
bpp <- BatchtoolsParam(10, cluster="slurm",
    resources=list(walltime=7200, memory=8000, ncpus=1))

Parallelization is best suited for CPU-intensive calculations where the division of labor results in a concomitant reduction in compute time. It is not suited for tasks that are bounded by other compute resources, e.g., memory or file I/O (though the latter is less of an issue on HPC systems with parallel read/write). In particular, R itself is inherently single-core, so many of the parallelization backends involve (i) setting up one or more separate R sessions, (ii) loading the relevant packages and (iii) transmitting the data to that session. Depending on the nature and size of the task, this overhead may outweigh any benefit from parallel computing.

14.4 Out of memory representations

The count matrix is the central structure around which our analyses are based. In most of the previous chapters, this has been held fully in memory as a dense matrix or as a sparse dgCMatrix. Howevever, in-memory representations may not be feasible for very large data sets, especially on machines with limited memory. For example, the 1.3 million brain cell data set from 10X Genomics (Zheng et al. 2017) would require over 100 GB of RAM to hold as a matrix and around 30 GB as a dgCMatrix. This makes it challenging to explore the data on anything less than a HPC system.

The obvious solution is to use a file-backed matrix representation where the data are held on disk and subsets are retrieved into memory as requested. While a number of implementations of file-backed matrices are available (e.g., bigmemory, matter), we will be using the implementation from the HDF5Array package. This uses the popular HDF5 format as the underlying data store, which provides a measure of standardization and portability across systems. We demonstrate with a subset of 20,000 cells from the 1.3 million brain cell data set, as provided by the TENxBrainData package.

library(TENxBrainData)
sce.brain <- TENxBrainData20k() 
sce.brain
## class: SingleCellExperiment 
## dim: 27998 20000 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(2): Ensembl Symbol
## colnames: NULL
## colData names(4): Barcode Sequence Library Mouse
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

Examination of the SingleCellExperiment object indicates that the count matrix is a HDF5Matrix. From a comparison of the memory usage, it is clear that this matrix object is simply a stub that points to the much larger HDF5 file that actually contains the data. This avoids the need for large RAM availability during analyses.

counts(sce.brain)
## <27998 x 20000> HDF5Matrix object of type "integer":
##              [,1]     [,2]     [,3]     [,4] ... [,19997] [,19998] [,19999]
##     [1,]        0        0        0        0   .        0        0        0
##     [2,]        0        0        0        0   .        0        0        0
##     [3,]        0        0        0        0   .        0        0        0
##     [4,]        0        0        0        0   .        0        0        0
##     [5,]        0        0        0        0   .        0        0        0
##      ...        .        .        .        .   .        .        .        .
## [27994,]        0        0        0        0   .        0        0        0
## [27995,]        0        0        0        1   .        0        2        0
## [27996,]        0        0        0        0   .        0        1        0
## [27997,]        0        0        0        0   .        0        0        0
## [27998,]        0        0        0        0   .        0        0        0
##          [,20000]
##     [1,]        0
##     [2,]        0
##     [3,]        0
##     [4,]        0
##     [5,]        0
##      ...        .
## [27994,]        0
## [27995,]        0
## [27996,]        0
## [27997,]        0
## [27998,]        0
object.size(counts(sce.brain))
## 2496 bytes
file.info(path(counts(sce.brain)))$size
## [1] 76264332

Manipulation of the count matrix will generally result in the creation of a DelayedArray object from the DelayedArray package. This remembers the operations to be applied to the counts and stores them in the object, to be executed when the modified matrix values are realized for use in calculations. The use of delayed operations avoids the need to write the modified values to a new file at every operation, which would unnecessarily require time-consuming disk I/O.

tmp <- counts(sce.brain)
tmp <- log2(tmp + 1)
tmp
## <27998 x 20000> DelayedMatrix object of type "double":
##              [,1]     [,2]     [,3] ... [,19999] [,20000]
##     [1,]        0        0        0   .        0        0
##     [2,]        0        0        0   .        0        0
##     [3,]        0        0        0   .        0        0
##     [4,]        0        0        0   .        0        0
##     [5,]        0        0        0   .        0        0
##      ...        .        .        .   .        .        .
## [27994,]        0        0        0   .        0        0
## [27995,]        0        0        0   .        0        0
## [27996,]        0        0        0   .        0        0
## [27997,]        0        0        0   .        0        0
## [27998,]        0        0        0   .        0        0

Many functions described in the previous workflows are capable of accepting HDF5Matrix objects. This is powered by the availability of common methods for all matrix representations (e.g., subsetting, combining, methods from DelayedMatrixStats) as well as representation-agnostic C++ code using beachmat (A. T. L. Lun, Pages, and Smith 2018). For example, we compute QC metrics below with the same calculateQCMetrics() function that we used in the other workflows.

library(scater)
is.mito <- grepl("^mt-", rowData(sce.brain)$Symbol)
qcstats <- perCellQCMetrics(sce.brain, subsets=list(Mt=is.mito))
qcstats
## DataFrame with 20000 rows and 6 columns
##             sum  detected subsets_Mt_sum subsets_Mt_detected subsets_Mt_percent
##       <numeric> <numeric>      <numeric>           <numeric>          <numeric>
## 1          3060      1546            123                  10            4.01961
## 2          3500      1694            118                  11            3.37143
## 3          3092      1613             58                   9            1.87581
## 4          4420      2050            131                  10            2.96380
## 5          3771      1813            100                   8            2.65182
## ...         ...       ...            ...                 ...                ...
## 19996      4431      2050            127                   9           2.866170
## 19997      6988      2704             60                   9           0.858615
## 19998      8749      2988            305                  11           3.486113
## 19999      3842      1711            129                   8           3.357626
## 20000      1775       945             26                   6           1.464789
##           total
##       <numeric>
## 1          3060
## 2          3500
## 3          3092
## 4          4420
## 5          3771
## ...         ...
## 19996      4431
## 19997      6988
## 19998      8749
## 19999      3842
## 20000      1775

Needless to say, data access from file-backed representations is slower than that from in-memory representations. The time spent retrieving data from disk is an unavoidable cost of reducing memory usage. Whether this is tolerable depends on the application. One example usage pattern involves performing the heavy computing quickly with in-memory representations on HPC systems with plentiful memory, and then distributing file-backed counterparts to individual users for exploration and visualization on their personal machines.

Session Info

R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] TENxBrainData_1.24.0        HDF5Array_1.32.0           
 [3] rhdf5_2.48.0                DelayedArray_0.30.1        
 [5] SparseArray_1.4.8           S4Arrays_1.4.1             
 [7] abind_1.4-5                 Matrix_1.7-0               
 [9] BiocParallel_1.38.0         BiocSingular_1.20.0        
[11] scater_1.32.0               ggplot2_3.5.1              
[13] bluster_1.14.0              BiocNeighbors_1.22.0       
[15] scran_1.32.0                scuttle_1.14.0             
[17] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
[19] Biobase_2.64.0              GenomicRanges_1.56.0       
[21] GenomeInfoDb_1.40.1         IRanges_2.38.0             
[23] S4Vectors_0.42.0            BiocGenerics_0.50.0        
[25] MatrixGenerics_1.16.0       matrixStats_1.3.0          
[27] BiocStyle_2.32.0            rebook_1.14.0              

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                 gridExtra_2.3            
 [3] CodeDepends_0.6.6         rlang_1.1.4              
 [5] magrittr_2.0.3            RSQLite_2.3.7            
 [7] compiler_4.4.0            dir.expiry_1.12.0        
 [9] DelayedMatrixStats_1.26.0 png_0.1-8                
[11] vctrs_0.6.5               pkgconfig_2.0.3          
[13] crayon_1.5.2              fastmap_1.2.0            
[15] dbplyr_2.5.0              XVector_0.44.0           
[17] utf8_1.2.4                rmarkdown_2.27           
[19] graph_1.82.0              UCSC.utils_1.0.0         
[21] ggbeeswarm_0.7.2          purrr_1.0.2              
[23] bit_4.0.5                 xfun_0.44                
[25] zlibbioc_1.50.0           cachem_1.1.0             
[27] beachmat_2.20.0           jsonlite_1.8.8           
[29] blob_1.2.4                rhdf5filters_1.16.0      
[31] Rhdf5lib_1.26.0           irlba_2.3.5.1            
[33] parallel_4.4.0            cluster_2.1.6            
[35] R6_2.5.1                  bslib_0.7.0              
[37] limma_3.60.2              jquerylib_0.1.4          
[39] Rcpp_1.0.12               bookdown_0.39            
[41] knitr_1.47                snow_0.4-4               
[43] igraph_2.0.3              tidyselect_1.2.1         
[45] yaml_2.3.8                viridis_0.6.5            
[47] codetools_0.2-20          curl_5.2.1               
[49] lattice_0.22-6            tibble_3.2.1             
[51] KEGGREST_1.44.0           withr_3.0.0              
[53] evaluate_0.23             BiocFileCache_2.12.0     
[55] ExperimentHub_2.12.0      Biostrings_2.72.1        
[57] pillar_1.9.0              BiocManager_1.30.23      
[59] filelock_1.0.3            generics_0.1.3           
[61] BiocVersion_3.19.1        sparseMatrixStats_1.16.0 
[63] munsell_0.5.1             scales_1.3.0             
[65] glue_1.7.0                metapod_1.12.0           
[67] tools_4.4.0               AnnotationHub_3.12.0     
[69] ScaledMatrix_1.12.0       locfit_1.5-9.9           
[71] XML_3.99-0.16.1           grid_4.4.0               
[73] AnnotationDbi_1.66.0      edgeR_4.2.0              
[75] colorspace_2.1-0          GenomeInfoDbData_1.2.12  
[77] beeswarm_0.4.0            vipor_0.4.7              
[79] cli_3.6.2                 rsvd_1.0.5               
[81] rappdirs_0.3.3            fansi_1.0.6              
[83] viridisLite_0.4.2         dplyr_1.1.4              
[85] gtable_0.3.5              sass_0.4.9               
[87] digest_0.6.35             ggrepel_0.9.5            
[89] dqrng_0.4.1               memoise_2.0.1            
[91] htmltools_0.5.8.1         lifecycle_1.0.4          
[93] httr_1.4.7                mime_0.12                
[95] statmod_1.5.0             bit64_4.0.5              

Abid, A., M. J. Zhang, V. K. Bagaria, and J. Zou. 2018. “Exploring patterns enriched in a dataset with contrastive principal component analysis.” Nat Commun 9 (1): 2134.

Bach, K., S. Pensa, M. Grzelak, J. Hadfield, D. J. Adams, J. C. Marioni, and W. T. Khaled. 2017. “Differentiation dynamics of mammary epithelial cells revealed by single-cell RNA sequencing.” Nat Commun 8 (1): 2128.

Bais, A. S., and D. Kostka. 2020. “Scds: Computational Annotation of Doublets in Single-Cell RNA Sequencing Data.” Bioinformatics 36 (4): 1150–8.

Bakken, T. E., R. D. Hodge, J. A. Miller, Z. Yao, T. N. Nguyen, B. Aevermann, E. Barkan, et al. 2018. “Single-nucleus and single-cell transcriptomes compared in matched cortical cell types.” PLoS ONE 13 (12): e0209648.

Bergen, Volker, Marius Lange, Stefan Peidli, F. Alexander Wolf, and Fabian J. Theis. 2019. “Generalizing Rna Velocity to Transient Cell States Through Dynamical Modeling.” bioRxiv. https://doi.org/10.1101/820936.

Bertoli, C., J. M. Skotheim, and R. A. de Bruin. 2013. “Control of cell cycle transcription during G1 and S phases.” Nat. Rev. Mol. Cell Biol. 14 (8): 518–28.

Boileau, P., N. S. Hejazi, and S. Dudoit. 2020. “Exploring high-dimensional biological data with sparse contrastive principal component analysis.” Bioinformatics 36 (11): 3422–30.

Brennecke, P., S. Anders, J. K. Kim, A. A. Kołodziejczyk, X. Zhang, V. Proserpio, B. Baying, et al. 2013. “Accounting for technical noise in single-cell RNA-seq experiments.” Nat. Methods 10 (11): 1093–5.

Buettner, F., K. N. Natarajan, F. P. Casale, V. Proserpio, A. Scialdone, F. J. Theis, S. A. Teichmann, J. C. Marioni, and O. Stegle. 2015. “Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells.” Nat. Biotechnol. 33 (2): 155–60.

Conboy, C. M., C. Spyrou, N. P. Thorne, E. J. Wade, N. L. Barbosa-Morais, M. D. Wilson, A. Bhattacharjee, et al. 2007. “Cell cycle genes are the evolutionarily conserved targets of the E2F4 transcription factor.” PLoS ONE 2 (10): e1061.

Dahlin, J. S., F. K. Hamey, B. Pijuan-Sala, M. Shepherd, W. W. Y. Lau, S. Nestorowa, C. Weinreb, et al. 2018. “A single-cell hematopoietic landscape resolves 8 lineage trajectories and defects in Kit mutant mice.” Blood 131 (21): e1–e11.

Gavish, M., and D. L. Donoho. 2014. “The Optimal Hard Threshold for Singular Values Is \(4/\sqrt {3}\).” IEEE Transactions on Information Theory 60 (8): 5040–53.

Germain, Pierre-Luc, Aaron Lun, Will Macnair, and Mark D. Robinson. 2021. “Doublet Identification in Single-Cell Sequencing Data Using scDblFinder.” F1000Research, no. 10:979. https://doi.org/10.12688/f1000research.73600.1.

Germain, P. L., A. Sonrel, and M. D. Robinson. 2020. “pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single cell RNA-seq preprocessing tools.” Genome Biol. 21 (1): 227.

Godec, J., Y. Tan, A. Liberzon, P. Tamayo, S. Bhattacharya, A. J. Butte, J. P. Mesirov, and W. N. Haining. 2016. “Compendium of Immune Signatures Identifies Conserved and Species-Specific Biology in Response to Inflammation.” Immunity 44 (1): 194–206.

Griffiths, J. A., A. C. Richard, K. Bach, A. T. L. Lun, and J. C. Marioni. 2018. “Detection and removal of barcode swapping in single-cell RNA-seq data.” Nat Commun 9 (1): 2667.

Grun, D., M. J. Muraro, J. C. Boisset, K. Wiebrands, A. Lyubimova, G. Dharmadhikari, M. van den Born, et al. 2016. “De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data.” Cell Stem Cell 19 (2): 266–77.

Gulati, G. S., S. S. Sikandar, D. J. Wesche, A. Manjunath, A. Bharadwaj, M. J. Berger, F. Ilagan, et al. 2020. “Single-cell transcriptional diversity is a hallmark of developmental potential.” Science 367 (6476): 405–11.

Guo, M., E. L. Bao, M. Wagner, J. A. Whitsett, and Y. Xu. 2017. “SLICE: determining cell differentiation and lineage based on single cell entropy.” Nucleic Acids Res. 45 (7): e54.

Hastie, T., and W. Stuetzle. 1989. “Principal Curves.” J Am Stat Assoc 84 (406): 502–16.

Hermann, B. P., K. Cheng, A. Singh, L. Roa-De La Cruz, K. N. Mutoji, I. C. Chen, H. Gildersleeve, et al. 2018. “The Mammalian Spermatogenesis Single-Cell Transcriptome, from Spermatogonial Stem Cells to Spermatids.” Cell Rep 25 (6): 1650–67.

Kang, H. M., M. Subramaniam, S. Targ, M. Nguyen, L. Maliskova, E. McCarthy, E. Wan, et al. 2018. “Multiplexed droplet single-cell RNA-sequencing using natural genetic variation.” Nat. Biotechnol. 36 (1): 89–94.

Klein, A. M., L. Mazutis, I. Akartuna, N. Tallapragada, A. Veres, V. Li, L. Peshkin, D. A. Weitz, and M. W. Kirschner. 2015. “Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells.” Cell 161 (5): 1187–1201.

Kotliarov, Y., R. Sparks, A. J. Martins, M. P. Mul?, Y. Lu, M. Goswami, L. Kardava, et al. 2020. “Broad immune activation underlies shared set point signatures for vaccine responsiveness in healthy individuals and disease activity in patients with lupus.” Nat Med 26 (4): 618–29.

Kozar, K., M. A. Ciemerych, V. I. Rebel, H. Shigematsu, A. Zagozdzon, E. Sicinska, Y. Geng, et al. 2004. “Mouse development and cell proliferation in the absence of D-cyclins.” Cell 118 (4): 477–91.

La Manno, G., R. Soldatov, A. Zeisel, E. Braun, H. Hochgerner, V. Petukhov, K. Lidschreiber, et al. 2018. “RNA velocity of single cells.” Nature 560 (7719): 494–98.

Law, C. W., Y. Chen, W. Shi, and G. K. Smyth. 2014. “voom: Precision weights unlock linear model analysis tools for RNA-seq read counts.” Genome Biol. 15 (2): R29.

Lawlor, N., J. George, M. Bolisetty, R. Kursawe, L. Sun, V. Sivakamasundari, I. Kycia, P. Robson, and M. L. Stitzel. 2017. “Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes.” Genome Res. 27 (2): 208–22.

Leng, N., L. F. Chu, C. Barry, Y. Li, J. Choi, X. Li, P. Jiang, R. M. Stewart, J. A. Thomson, and C. Kendziorski. 2015. “Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments.” Nat. Methods 12 (10): 947–50.

Linderman, George C., Manas Rachh, Jeremy G. Hoskins, Stefan Steinerberger, and Yuval Kluger. 2019. “Fast Interpolation-Based T-SNE for Improved Visualization of Single-Cell RNA-Seq Data.” Nature Methods 16 (3): 243–45. https://doi.org/10.1038/s41592-018-0308-4.

Lun, A. 2018. “Overcoming Systematic Errors Caused by Log-Transformation of Normalized Single-Cell Rna Sequencing Data.” bioRxiv.

Lun, A., S. Riesenfeld, T. Andrews, T. P. Dao, T. Gomes, participants in the 1st Human Cell Atlas Jamboree, and J. Marioni. 2019. “EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data.” Genome Biol. 20 (1): 63.

Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11): 1795–1806.

Lun, A. T. L., and J. C. Marioni. 2017. “Overcoming confounding plate effects in differential expression analyses of single-cell RNA-seq data.” Biostatistics 18 (3): 451–64.

Lun, A. T. L., H. Pages, and M. L. Smith. 2018. “beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.” PLoS Comput. Biol. 14 (5): e1006135.

Macosko, E. Z., A. Basu, R. Satija, J. Nemesh, K. Shekhar, M. Goldman, I. Tirosh, et al. 2015. “Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets.” Cell 161 (5): 1202–14.

McCarthy, D. J., and G. K. Smyth. 2009. “Testing significance relative to a fold-change threshold is a TREAT.” Bioinformatics 25 (6): 765–71.

McGinnis, C. S., D. M. Patterson, J. Winkler, D. N. Conrad, M. Y. Hein, V. Srivastava, J. L. Hu, et al. 2019. “MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices.” Nat. Methods 16 (7): 619–26.

McInnes, Leland, John Healy, and James Melville. 2018. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” arXiv E-Prints, February, arXiv:1802.03426. http://arxiv.org/abs/1802.03426.

Messmer, T., F. von Meyenn, A. Savino, F. Santos, H. Mohammed, A. T. L. Lun, J. C. Marioni, and W. Reik. 2019. “Transcriptional heterogeneity in naive and primed human pluripotent stem cells at single-cell resolution.” Cell Rep 26 (4): 815–24.

Morgan, D. O. 2007. The Cell Cycle: Principles of Control. New Science Press.

Narayan, Ashwin. 2021. “Assessing Single-Cell Transcriptomic Variability Through Density-Preserving Data Visualization.” Nature Biotechnology, 19. https://doi.org/10.1038/s41587-020-00801-7.

Nestorowa, S., F. K. Hamey, B. Pijuan Sala, E. Diamanti, M. Shepherd, E. Laurenti, N. K. Wilson, D. G. Kent, and B. Gottgens. 2016. “A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation.” Blood 128 (8): 20–31.

Phipson, B., and G. K. Smyth. 2010. “Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn.” Stat. Appl. Genet. Mol. Biol. 9: Article 39.

Pijuan-Sala, B., J. A. Griffiths, C. Guibentif, T. W. Hiscock, W. Jawaid, F. J. Calero-Nieto, C. Mulas, et al. 2019. “A Single-Cell Molecular Map of Mouse Gastrulation and Early Organogenesis.” Nature 566 (7745): 490–95.

Richard, A. C., A. T. L. Lun, W. W. Y. Lau, B. Gottgens, J. C. Marioni, and G. M. Griffiths. 2018. “T cell cytolytic capacity is independent of initial stimulation strength.” Nat. Immunol. 19 (8): 849–58.

Roccio, M., D. Schmitter, M. Knobloch, Y. Okawa, D. Sage, and M. P. Lutolf. 2013. “Predicting stem cell fate changes by differential cell cycle progression patterns.” Development 140 (2): 459–70.

Saelens, W., R. Cannoodt, H. Todorov, and Y. Saeys. 2019. “A comparison of single-cell trajectory inference methods.” Nat. Biotechnol. 37 (5): 547–54.

Scialdone, A., K. N. Natarajan, L. R. Saraiva, V. Proserpio, S. A. Teichmann, O. Stegle, J. C. Marioni, and F. Buettner. 2015. “Computational assignment of cell-cycle stage from single-cell transcriptome data.” Methods 85 (September): 54–61.

Segerstolpe, A., A. Palasantza, P. Eliasson, E. M. Andersson, A. C. Andreasson, X. Sun, S. Picelli, et al. 2016. “Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes.” Cell Metab. 24 (4): 593–607.

Shekhar, K., S. W. Lapan, I. E. Whitney, N. M. Tran, E. Z. Macosko, M. Kowalczyk, X. Adiconis, et al. 2016. “Comprehensive Classification of Retinal Bipolar Neurons by Single-Cell Transcriptomics.” Cell 166 (5): 1308–23.

Sinha, Rahul, Geoff Stanley, Gunsagar S. Gulati, Camille Ezran, Kyle J. Travaglini, Eric Wei, Charles K. F. Chan, et al. 2017. “Index Switching Causes “Spreading-of-Signal” Among Multiplexed Samples in Illumina Hiseq 4000 Dna Sequencing.” bioRxiv. https://doi.org/10.1101/125724.

Soneson, C., A. Srivastava, R. Patro, and M. B. Stadler. 2020. “Preprocessing Choices Affect Rna Velocity Results for Droplet scRNA-Seq Data.” bioRxiv. https://doi.org/10.1101/2020.03.13.990069.

Soufi, A., and S. Dalton. 2016. “Cycling through developmental decisions: how cell cycle dynamics control pluripotency, differentiation and reprogramming.” Development 143 (23): 4301–11.

Steinman, R. A. 2002. “Cell cycle regulators and hematopoiesis.” Oncogene 21 (21): 3403–13.

Stoeckius, M., C. Hafemeister, W. Stephenson, B. Houck-Loomis, P. K. Chattopadhyay, H. Swerdlow, R. Satija, and P. Smibert. 2017. “Simultaneous epitope and transcriptome measurement in single cells.” Nat. Methods 14 (9): 865–68.

Stoeckius, M., S. Zheng, B. Houck-Loomis, S. Hao, B. Z. Yeung, W. M. Mauck, P. Smibert, and R. Satija. 2018. “Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics.” Genome Biol. 19 (1): 224.

Street, K., D. Risso, R. B. Fletcher, D. Das, J. Ngai, N. Yosef, E. Purdom, and S. Dudoit. 2018. “Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics.” BMC Genomics 19 (1): 477.

Teschendorff, A. E., and T. Enver. 2017. “Single-cell entropy for accurate estimation of differentiation potency from a cell’s transcriptome.” Nat Commun 8 (June): 15599.

Tian, L., X. Dong, S. Freytag, K. A. Le Cao, S. Su, A. JalalAbadi, D. Amann-Zalcenstein, et al. 2019. “Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments.” Nat. Methods 16 (6): 479–87.

Von Luxburg, U. 2010. “Clustering Stability: An Overview.” Foundations and Trends in Machine Learning 2 (3): 235–74.

Wang, X. 2012. “A Fast Exact K-Nearest Neighbors Algorithm for High Dimensional Search Using K-Means Clustering and Triangle Inequality.” Proc Int Jt Conf Neural Netw 43 (6): 2351–8.

Wolf, F. Alexander, Fiona Hamey, Mireya Plass, Jordi Solana, Joakim S. Dahlin, Berthold Gottgens, Nikolaus Rajewsky, Lukas Simon, and Fabian J. Theis. 2017. “Graph Abstraction Reconciles Clustering with Trajectory Inference Through a Topology Preserving Map of Single Cells.” bioRxiv. https://doi.org/10.1101/208819.

Wu, H., Y. Kirita, E. L. Donnelly, and B. D. Humphreys. 2019. “Advantages of Single-Nucleus over Single-Cell RNA Sequencing of Adult Kidney: Rare Cell Types and Novel Cell States Revealed in Fibrosis.” J. Am. Soc. Nephrol. 30 (1): 23–32.

Young, M. D., and S. Behjati. 2018. “SoupX Removes Ambient RNA Contamination from Droplet Based Single Cell RNA Sequencing Data.” bioRxiv.

Zeisel, A., A. B. Munoz-Manchado, S. Codeluppi, P. Lonnerberg, G. La Manno, A. Jureus, S. Marques, et al. 2015. “Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq.” Science 347 (6226): 1138–42.

Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January): 14049.

References

Lun, A. T. L., H. Pages, and M. L. Smith. 2018. “beachmat: A Bioconductor C++ API for accessing high-throughput biological data from a variety of R matrix types.” PLoS Comput. Biol. 14 (5): e1006135.

Wang, X. 2012. “A Fast Exact K-Nearest Neighbors Algorithm for High Dimensional Search Using K-Means Clustering and Triangle Inequality.” Proc Int Jt Conf Neural Netw 43 (6): 2351–8.

Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January): 14049.