Contents

1 Introduction

The aim of this document is to measure the performance of the HDF5Array package for normalization and PCA (Principal Component Analysis) of on-disk single cell data, two computationally intensive operations at the core of single cell analysis.

The benchmarks presented in the document were specifically designed to observe the impact of two critical parameters on performance:

  1. data representation (i.e. sparse vs dense);
  2. size of the blocks used for block-processed operations.

Hopefully these benchmarks will also facilitate comparing performance of single cell analysis workflows based on HDF5Array with workflows based on other tools like Seurat or Scanpy.

2 Install and load the required packages

Let’s install and load HDF5Array as well as the other packages used in this vignette:

if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

pkgs <- c("HDF5Array", "ExperimentHub", "DelayedMatrixStats", "RSpectra")
BiocManager::install(pkgs)

Load the packages:

library(HDF5Array)
library(ExperimentHub)
library(DelayedMatrixStats)
library(RSpectra)

3 The test datasets

3.1 Sparse vs dense representation

The datasets that we will use for our benchmarks are subsets of the 1.3 Million Brain Cell Dataset from 10x Genomics.

The 1.3 Million Brain Cell Dataset is a 27,998 x 1,306,127 matrix of counts with one gene per row and one cell per column. It’s available via the ExperimentHub package in two forms, one that uses a sparse representation and one that uses a dense representation:

hub <- ExperimentHub()
hub["EH1039"]$description  # sparse representation
## [1] "Single-cell RNA-seq data for 1.3 million brain cells from E18 mice. 'HDF5-based 10X Genomics' format originally provided by TENx Genomics"
hub["EH1040"]$description  # dense representation
## [1] "Single-cell RNA-seq data for 1.3 million brain cells from E18 mice. Full rectangular, block-compressed format, 1GB block size."

The two datasets are big HDF5 files stored on a remote location. Let’s download them to the local ExperimentHub cache if they are not there yet:

## Note that this will be quick if the HDF5 files are already in the
## local ExperimentHub cache. Otherwise, it will take a while!
full_sparse_h5 <- hub[["EH1039"]]
full_dense_h5  <- hub[["EH1040"]]

3.2 TENxMatrix vs HDF5Matrix objects

We use the TENxMatrix() and HDF5Array() constructors to bring the sparse and dense datasets in R as DelayedArray derivatives. Note that this does not load the matrix data in memory.

Bring the sparse dataset in R:

## Use 'h5ls(full_sparse_h5)' to find out the group.
full_sparse <- TENxMatrix(full_sparse_h5, group="mm10")

Note that full_sparse is a 27,998 x 1,306,127 TENxMatrix object:

class(full_sparse)
## [1] "TENxMatrix"
## attr(,"package")
## [1] "HDF5Array"
dim(full_sparse)
## [1]   27998 1306127

See ?TENxMatrix in the HDF5Array package for more information about TENxMatrix objects.

Bring the dense dataset in R:

## Use 'h5ls(full_dense_h5)' to find out the name of the dataset.
full_dense <- HDF5Array(full_dense_h5, name="counts")

Note that full_dense is a 27,998 x 1,306,127 HDF5Matrix object that contains the same data as full_sparse:

class(full_dense)
## [1] "HDF5Matrix"
## attr(,"package")
## [1] "HDF5Array"
dim(full_dense)
## [1]   27998 1306127

See ?HDF5Matrix in the HDF5Array package for more information about HDF5Matrix objects.

Finally note that the dense HDF5 file does not contain the dimnames of the matrix so we manually add them to full_sparse:

dimnames(full_dense) <- dimnames(full_sparse)

3.3 Create the test datasets

For our benchmarks below, we create subsets of the 1.3 Million Brain Cell Dataset of increasing sizes: subsets with 12,500 cells, 25,000 cells, 50,000 cells, 100,000 cells, and 200,000 cells. Note that subsetting a TENxMatrix or HDF5Matrix object with [ is a delayed operation so has virtually no cost:

sparse1 <- full_sparse[ , 1:12500]
dense1  <- full_dense[ , 1:12500]

sparse2 <- full_sparse[ , 1:25000]
dense2  <- full_dense[ , 1:25000]

sparse3 <- full_sparse[ , 1:50000]
dense3  <- full_dense[ , 1:50000]

sparse4 <- full_sparse[ , 1:100000]
dense4  <- full_dense[ , 1:100000]

sparse5 <- full_sparse[ , 1:200000]
dense5  <- full_dense[ , 1:200000]

4 Block-processed normalization and PCA

4.1 Code used for normalization and PCA

We’ll use the following code for normalization:

## Also selects the most variable genes (1000 by default).
simple_normalize <- function(mat, num_var_genes=1000)
{
    stopifnot(length(dim(mat)) == 2, !is.null(rownames(mat)))
    mat <- mat[rowSums(mat) > 0, ]
    mat <- t(t(mat) * 10000 / colSums(mat))
    row_vars <- rowVars(mat)
    rv_order <- order(row_vars, decreasing=TRUE)
    variable_idx <- head(rv_order, n=num_var_genes)
    mat <- log1p(mat[variable_idx, ])
    mat / rowSds(mat)
}

and the following code for PCA:

simple_PCA <- function(mat, k=25)
{
    stopifnot(length(dim(mat)) == 2)
    row_means <- rowMeans(mat)
    Ax <- function(x, args)
        (as.numeric(mat %*% x) - row_means * sum(x))
    Atx <- function(x, args)
        (as.numeric(x %*% mat) - as.vector(row_means %*% x))
    RSpectra::svds(Ax, Atrans=Atx, k=k, dim=dim(mat))
}

4.2 Block processing and block size

Note that the implementations of simple_normalize() and simple_PCA() are expected to work on any matrix-like object regardless of its exact type/representation e.g. it can be an ordinary matrix, a SparseMatrix object from the SparseArray package, a dgCMatrix object from the Matrix package, a DelayedMatrix derivative (TENxMatrix, HDF5Matrix, TileDBMatrix), etc…

However, when the input is a DelayedMatrix object or derivative, it’s important to be aware that:

  • Summarization methods like sum(), colSums(), rowVars(), or rowSds(), and matrix multiplication (%*%), are block-processed operations.

  • The block size is 100 Mb by default. Increasing or decreasing the block size will typically increase or decrease the memory usage of block-processed operations. It will also impact performance, but sometimes in unexpected or counter-intuitive ways.

  • The block size can be controlled with DelayedArray::getAutoBlockSize() and DelayedArray::setAutoBlockSize().

For our benchmarks below, we’ll use the following block sizes:

  • normalization of the sparse datasets: 250 Mb
  • normalization of the dense datasets: 40 Mb
  • on-disk realization of the normalized sparse datasets: 100 Mb
  • on-disk realization of the normalized dense datasets: 250 Mb
  • PCA on the normalized sparse datasets: 40 Mb
  • PCA on the normalized dense datasets: 100 Mb (the default)

4.3 Monitoring memory usage

While manually running our benchmarks below on a Linux or macOS system, we will also monitor memory usage at the command line in a terminal with:

(while true; do ps u -p <PID>; sleep 1; done) >ps.log 2>&1 &

where <PID> is the process id of our R session. This will allow us to measure the maximum amount of memory used by the calls to simple_normalize() or simple_PCA().

5 Benchmarks

5.1 Normalization

In this section we run simple_normalize() on the two smaller test datasets only (27,998 x 12,500 and 27,998 x 25,000, sparse and dense), and we report timings and memory usage.

See the Comprehensive timings obtained on various systems section at the end of this document for simple_normalize() and simple_pca() timings obtained on various systems on all our test datasets and using three different block sizes: 40 Mb, 100 Mb, and 250 Mb.

5.1.1 Normalizing the sparse datasets

Set block size to 250 Mb:

DelayedArray::setAutoBlockSize(2.5e8)
## automatic block size set to 2.5e+08 bytes (was 1e+08)

5.1.1.1 27,998 x 12,500 sparse dataset

dim(sparse1)
## [1] 27998 12500
system.time(sparse1n <- simple_normalize(sparse1))
##    user  system elapsed 
##  98.571   8.092 106.637
gc()
##            used  (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells  9919641 529.8   17466960 932.9  12903527 689.2
## Vcells 25723651 196.3   90452252 690.1 119073743 908.5
dim(sparse1n)
## [1]  1000 12500

5.1.1.2 27,998 x 25,000 sparse dataset

dim(sparse2)
## [1] 27998 25000
system.time(sparse2n <- simple_normalize(sparse2))
##    user  system elapsed 
## 170.148  10.935 182.908
gc()
##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  9919769 529.8   17466960  932.9  12903527  689.2
## Vcells 25837321 197.2  147549432 1125.8 184436537 1407.2
dim(sparse2n)
## [1]  1000 25000

5.1.1.3 About memory usage

With this block size (250 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained <= 2.6 Gb at all time.

5.1.2 Normalizing the dense datasets

Set block size to 40 Mb:

DelayedArray::setAutoBlockSize(4e7)
## automatic block size set to 4e+07 bytes (was 2.5e+08)

5.1.2.1 27,998 x 12,500 dense dataset

dim(dense1)
## [1] 27998 12500
system.time(dense1n <- simple_normalize(dense1))
##    user  system elapsed 
##  92.132   5.936 100.818
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  9920168 529.8   17466960 932.9  12903527  689.2
## Vcells 25911659 197.7   76827731 586.2 184436537 1407.2
dim(dense1n)
## [1]  1000 12500

5.1.2.2 27,998 x 25,000 dense dataset

dim(dense2)
## [1] 27998 25000
system.time(dense2n <- simple_normalize(dense2))
##    user  system elapsed 
## 198.184  14.445 215.174
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  9920333 529.9   17466960 932.9  12903527  689.2
## Vcells 26000328 198.4   83905016 640.2 184436537 1407.2
dim(dense2n)
## [1]  1000 25000

5.1.2.3 About memory usage

With this block size (40 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained <= 1.8 Gb at all time.

5.2 On-disk realization of the normalized datasets

Note that the normalized datasets obtained in the previous section are DelayedMatrix objects that carrry delayed operations. These can be displayed with showtree():

class(sparse1n)
## [1] "DelayedMatrix"
## attr(,"package")
## [1] "DelayedArray"
showtree(sparse1n)
## 1000x12500 double, sparse: DelayedMatrix object
## └─ 1000x12500 double, sparse: Unary iso op with args
##    └─ 1000x12500 double, sparse: Stack of 1 unary iso op(s)
##       └─ 1000x12500 double, sparse: Aperm (perm=c(2,1))
##          └─ 12500x1000 double, sparse: Unary iso op with args
##             └─ 12500x1000 double, sparse: Stack of 1 unary iso op(s)
##                └─ 12500x1000 integer, sparse: Aperm (perm=c(2,1))
##                   └─ 1000x12500 integer, sparse: Subset
##                      └─ 27998x1306127 integer, sparse: [seed] TENxMatrixSeed object
class(dense1n)
## [1] "DelayedMatrix"
## attr(,"package")
## [1] "DelayedArray"
showtree(dense1n)
## 1000x12500 double: DelayedMatrix object
## └─ 1000x12500 double: Set dimnames
##    └─ 1000x12500 double: Unary iso op with args
##       └─ 1000x12500 double: Stack of 1 unary iso op(s)
##          └─ 1000x12500 double: Aperm (perm=c(2,1))
##             └─ 12500x1000 double: Unary iso op with args
##                └─ 12500x1000 double: Stack of 1 unary iso op(s)
##                   └─ 12500x1000 integer: Aperm (perm=c(2,1))
##                      └─ 1000x12500 integer: Subset
##                         └─ 27998x1306127 integer: [seed] HDF5ArraySeed object

Before we proceed with PCA, we’re going to write the normalized datasets to new HDF5 files. This has a significant cost, but, on the other hand, it has the benefit of triggering on-disk realization of the object. This means that all the delayed operations carried by the object will get realized on-the-fly before the matrix data actually lands on the disk, making the new object (TENxMatrix or HDF5Matrix) more efficient for PCA or whatever block-processed operations will come next.

5.2.1 On-disk realization of the normalized sparse datasets

Set block size to 100 Mb:

DelayedArray::setAutoBlockSize(1e8)
## automatic block size set to 1e+08 bytes (was 4e+07)

5.2.1.1 1000 x 12,500 normalized sparse dataset

dim(sparse1n)
## [1]  1000 12500
sparse1n_path <- tempfile()
system.time(
  sparse1n <- writeTENxMatrix(sparse1n, sparse1n_path, group="matrix", level=0)
)
##    user  system elapsed 
##  13.093   0.808  17.212
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  9933503 530.6   17466960 932.9  12903527  689.2
## Vcells 25977046 198.2  110853456 845.8 184436537 1407.2

The new sparse1n object is a pristine TENxMatrix object:

class(sparse1n)
## [1] "TENxMatrix"
## attr(,"package")
## [1] "HDF5Array"
showtree(sparse1n)  # "pristine" object (i.e. no more delayed operations)
## 1000x12500 double, sparse: TENxMatrix object
## └─ 1000x12500 double, sparse: [seed] TENxMatrixSeed object

5.2.1.2 1000 x 25,000 normalized sparse dataset

dim(sparse2n)
## [1]  1000 25000
sparse2n_path <- tempfile()
system.time(
  sparse2n <- writeTENxMatrix(sparse2n, sparse2n_path, group="matrix", level=0)
)
##    user  system elapsed 
##  20.795   0.923  21.744
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  9933496 530.6   17466960 932.9  12903527  689.2
## Vcells 25927085 197.9   98026769 747.9 184436537 1407.2

The new sparse2n object is a pristine TENxMatrix object:

class(sparse2n)
## [1] "TENxMatrix"
## attr(,"package")
## [1] "HDF5Array"
showtree(sparse2n)  # "pristine" object (i.e. no more delayed operations)
## 1000x25000 double, sparse: TENxMatrix object
## └─ 1000x25000 double, sparse: [seed] TENxMatrixSeed object

5.2.1.3 About memory usage

With this block size (100 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained <= 2.5 Gb at all time.

5.2.2 On-disk realization of the normalized dense datasets

Set block size to 250 Mb:

DelayedArray::setAutoBlockSize(2.5e8)
## automatic block size set to 2.5e+08 bytes (was 1e+08)

5.2.2.1 1000 x 12,500 normalized dense dataset

dim(dense1n)
## [1]  1000 12500
dense1n_path <- tempfile()
system.time(
  dense1n <- writeHDF5Array(dense1n, dense1n_path, name="dense1n", level=0)
)
##    user  system elapsed 
##   4.240   0.365   4.611
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  9936037 530.7   17466960 932.9  12903527  689.2
## Vcells 25860474 197.3   78421416 598.4 184436537 1407.2

The new dense1n object is a pristine HDF5Matrix object:

class(dense1n)
## [1] "HDF5Matrix"
## attr(,"package")
## [1] "HDF5Array"
showtree(dense1n)  # "pristine" object (i.e. no more delayed operations)
## 1000x12500 double: HDF5Matrix object
## └─ 1000x12500 double: [seed] HDF5ArraySeed object

5.2.2.2 1000 x 25,000 normalized dense dataset

dim(dense2n)
## [1]  1000 25000
dense2n_path <- tempfile()
system.time(
  dense2n <- writeHDF5Array(dense2n, dense2n_path, name="dense2n", level=0)
)
##    user  system elapsed 
##  11.258   1.284  12.544
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  9936007 530.7   17466960 932.9  12903527  689.2
## Vcells 25772004 196.7  100701990 768.3 184436537 1407.2

The new dense2n object is a pristine HDF5Matrix object:

class(dense2n)
## [1] "HDF5Matrix"
## attr(,"package")
## [1] "HDF5Array"
showtree(dense2n)  # "pristine" object (i.e. no more delayed operations)
## 1000x25000 double: HDF5Matrix object
## └─ 1000x25000 double: [seed] HDF5ArraySeed object

5.2.2.3 About memory usage

With this block size (250 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained <= 1.6 Gb at all time.

5.3 PCA

In this section we run simple_pca() on the two normalized datasets obtained in the previous section (1000 x 12,500 and 1000 x 25,000, sparse and dense), and we report timings and memory usage.

See the Comprehensive timings obtained on various systems section at the end of this document for simple_normalize() and simple_pca() timings obtained on various systems on all our test datasets and using three different block sizes: 40 Mb, 100 Mb, and 250 Mb.

5.3.1 PCA on the normalized sparse datasets

Set block size to 40 Mb:

DelayedArray::setAutoBlockSize(4e7)
## automatic block size set to 4e+07 bytes (was 2.5e+08)

5.3.1.1 1000 x 12,500 normalized sparse dataset

sparse1n <- TENxMatrix(sparse1n_path)
dim(sparse1n)
## [1]  1000 12500
system.time(pca1s <- simple_PCA(sparse1n))
##    user  system elapsed 
##  86.736   3.678  82.916
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  9942444 531.0   17466960 932.9  12903527  689.2
## Vcells 26122561 199.3   80561592 614.7 184436537 1407.2

5.3.1.2 1000 x 25,000 normalized sparse dataset

sparse2n <- TENxMatrix(sparse2n_path)
dim(sparse2n)
## [1]  1000 25000
system.time(pca2s <- simple_PCA(sparse2n))
##    user  system elapsed 
## 179.510   7.053 172.868
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  9942521 531.0   17466960 932.9  12903527  689.2
## Vcells 26773063 204.3   80563102 614.7 184436537 1407.2

5.3.1.3 About memory usage

With this block size (40 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained <= 1.8 Gb at all time.

5.3.2 PCA on the normalized dense datasets

Set block size to 100 Mb (the default):

DelayedArray::setAutoBlockSize()
## automatic block size set to 1e+08 bytes (was 4e+07)

5.3.2.1 1000 x 12,500 normalized dense dataset

dense1n <- HDF5Array(dense1n_path, name="dense1n")
dim(dense1n)
## [1]  1000 12500
system.time(pca1d <- simple_PCA(dense1n))
##    user  system elapsed 
##  77.304  15.596  92.906
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  9943562 531.1   17466960 932.9  12903527  689.2
## Vcells 27111594 206.9   96063531 733.0 184436537 1407.2

5.3.2.2 1000 x 25,000 normalized dense dataset

dense2n <- HDF5Array(dense2n_path, name="dense2n")
dim(dense2n)
## [1]  1000 25000
system.time(pca2d <- simple_PCA(dense2n))
##    user  system elapsed 
## 131.230  28.598 159.839
gc()
##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  9943055 531.1   17466960 932.9  12903527  689.2
## Vcells 27761515 211.9   86654446 661.2 184436537 1407.2

5.3.2.3 About memory usage

With this block size (100 Mb), memory usage (as reported by Unix command ps u -p <PID>, see Monitoring memory usage above in this document) remained <= 2.0 Gb at all time.

5.3.3 Sanity checks

stopifnot(all.equal(pca1s, pca1d))
stopifnot(all.equal(pca2s, pca2d))

5.4 Comprehensive timings obtained on various systems

Here we report timings obtained on various systems. For each system, the results are summarized in a table that shows the normalization & realization & PCA timings obtained on all our test datasets and using three different block sizes: 40 Mb, 100 Mb, and 250 Mb. For each operation, the best time across the three different block sizes is displayed in bold.

5.4.1 DELL XPS 15 laptop (model 9520)

  • OS: Linux Ubuntu 24.04
  • Bioconductor/R versions: 3.21/4.5
  • RAM: 32GB
  • Disk: 1TB SSD
  • Disk performance:
      # Output of 'sudo hdparm -Tt <device>':
       Timing cached reads:   35188 MB in  2.00 seconds = 17620.75 MB/sec
       Timing buffered disk reads: 7842 MB in  3.00 seconds = 2613.57 MB/sec
    
Test dataset F
o
r
m
a
t
block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
Normalized
Test dataset
block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
nrow x ncol
(# genes x # cells)
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
nrow x ncol
(# sel. genes
x # cells)
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
1. NORMALIZATION
& sel. of  1000  most var. genes
2. ON-DISK REALIZATION
of the normalized dataset
3. PCA
of the normalized dataset
27998 x 12500 sparse 70 1.8Gb 53 2.0Gb  44  2.3Gb 1000 x 12500 5 1.8Gb 6 2.1Gb 6 2.2Gb  36  1.7Gb 47 1.8Gb 45 2.0Gb
dense 53 1.8Gb 55 1.9Gb 49 2.4Gb 2 1.6Gb 2 1.5Gb 2 1.4Gb 41 1.8Gb 42 2.0Gb 38 1.6Gb
27998 x 25000 sparse 144 1.9Gb 110 2.4Gb  85  2.6Gb 1000 x 25000 11 1.9Gb 10 2.5Gb 11 2.8Gb 75 1.8Gb  72  2.0Gb 133 2.3Gb
dense 103 1.8Gb 116 2.0Gb 103 2.5Gb 5 1.7Gb 5 1.6Gb 5 1.6Gb 85 1.8Gb 82 2.0Gb 128 1.6Gb
27998 x 50000 sparse 277 1.9Gb 215 2.6Gb  179  3.5Gb 1000 x 50000 22 1.8Gb 20 2.3Gb 19 3.3Gb  143  1.8Gb 188 2.1Gb 169 2.6Gb
dense 211 1.9Gb 230 2.1Gb 228 2.5Gb 13 1.7Gb 10 1.7Gb 10 1.9Gb 208 1.8Gb 188 1.9Gb 275 1.7Gb
27998 x 100000 sparse 543 2.0Gb 436 2.9Gb  360  3.7Gb 1000 x 100000 41 2.0Gb 38 2.6Gb 39 3.7Gb  289  1.9Gb 313 2.3Gb 323 2.8Gb
dense 419 1.9Gb 467 2.1Gb 456 2.6Gb 25 1.6Gb 20 1.7Gb 18 2.1Gb 417 1.9Gb 311 2.1Gb 309 2.5Gb
27998 x 200000 sparse 1130 2.2Gb 874 2.8Gb  744  4.4Gb 1000 x 200000 73 2.2Gb 78 2.7Gb 71 4.2Gb  526  2.0Gb 619 2.4Gb 599 3.4Gb
dense 865 2.0Gb 966 2.1Gb 920 3.2Gb 62 2.0Gb 51 1.8Gb 36 2.9Gb 1089 2.3Gb 892 2.1Gb 943 2.3Gb
1. NORMALIZATION
& sel. of  2000  most var. genes
2. ON-DISK REALIZATION
of the normalized dataset
3. PCA
of the normalized dataset
27998 x 12500 sparse 74 1.8Gb 61 2.2Gb  45  2.5Gb 2000 x 12500 7 1.8Gb 8 2.1Gb 7 2.5Gb  55  1.7Gb 59 1.8Gb 61 2.2Gb
dense 54 1.8Gb 59 2.0Gb 50 2.4Gb 3 1.6Gb 3 1.6Gb 3 1.8Gb 82 1.9Gb 87 1.9Gb 129 1.6Gb
27998 x 25000 sparse 151 1.8Gb 118 2.3Gb  100  2.7Gb 2000 x 25000 14 1.9Gb 15 2.3Gb 14 2.9Gb  110  1.7Gb 167 2.0Gb 159 2.3Gb
dense 117 1.8Gb 121 2.0Gb 113 2.5Gb 9 1.6Gb 8 1.6Gb 7 1.9Gb 219 1.8Gb 167 1.9Gb 151 2.5Gb
27998 x 50000 sparse 288 1.9Gb 229 2.4Gb  193  3.8Gb 2000 x 50000 28 1.9Gb 30 2.4Gb 26 3.4Gb  195  1.8Gb 229 2.0Gb 223 2.9Gb
dense 219 1.9Gb 240 2.1Gb 237 2.5Gb 19 1.7Gb 15 1.7Gb 14 2.3Gb 342 1.8Gb 275 2.0Gb 381 1.8Gb
27998 x 100000 sparse 562 2.0Gb 450 2.7Gb  375  3.8Gb 2000 x 100000 52 2.0Gb 55 2.4Gb 52 3.5Gb  351  1.8Gb 453 2.1Gb 408 3.0Gb
dense 440 1.9Gb 484 2.1Gb 475 2.6Gb 47 1.7Gb 38 1.7Gb 27 2.3Gb 809 1.9Gb 688 2.1Gb 854 1.7Gb
27998 x 200000 sparse 1170 2.1Gb 922 2.8Gb  798  3.6Gb 2000 x 200000 107 2.0Gb 114 2.5Gb 106 3.7Gb  672  1.9Gb 1160 2.2Gb 932 2.9Gb
dense 893 2.1Gb 989 2.0Gb 971 3.2Gb 99 2.0Gb 77 1.9Gb 52 3.0Gb 1713 2.3Gb 1366 2.2Gb 2047 2.2Gb
Table 1: Timings for DELL XPS 15 laptop
For each operation, the best time across the three different block sizes is displayed in bold.
In addition, if it’s also the best time across the sparse and dense formats, then we  box  it (only for Normalization and PCA).
Memory usage > 4Gb is displayed in light red.

5.4.2 Supermicro SuperServer 1029GQ-TRT

  • OS: Linux Ubuntu 22.04
  • Bioconductor/R versions: 3.21/4.5
  • RAM: 384GB
  • Disk: 1.3TB ATA Disk
  • Disk performance:
      # Output of 'sudo hdparm -Tt <device>':
       Timing cached reads:   20592 MB in  1.99 seconds = 10361.41 MB/sec
       Timing buffered disk reads: 1440 MB in  3.00 seconds = 479.66 MB/sec
    
Test dataset F
o
r
m
a
t
block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
Normalized
Test dataset
block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
nrow x ncol
(# genes x # cells)
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
nrow x ncol
(# sel. genes
x # cells)
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
1. NORMALIZATION
& sel. of  1000  most var. genes
2. ON-DISK REALIZATION
of the normalized dataset
3. PCA
of the normalized dataset
27998 x 12500 sparse 109 NA 85 NA  74  NA 1000 x 12500 9 NA 11 NA 9 NA 67 NA 89 NA 90 NA
dense 85 NA 90 NA 80 NA 3 NA 3 NA 3 NA  63  NA 71 NA 66 NA
27998 x 25000 sparse 235 NA 192 NA  144  NA 1000 x 25000 18 NA 18 NA 19 NA 150 NA 172 NA 142 NA
dense 178 NA 184 NA 170 NA 7 NA 8 NA 8 NA 142 NA 145 NA 215 NA
27998 x 50000 sparse 439 NA 370 NA  303  NA 1000 x 50000 35 NA 36 NA 34 NA  285  NA 308 NA 344 NA
dense 352 NA 379 NA 354 NA 19 NA 15 NA 15 NA 336 NA 287 NA 462 NA
27998 x 100000 sparse 881 NA 741 NA  598  NA 1000 x 100000 67 NA 69 NA 67 NA 603 NA 572 NA 590 NA
dense 706 NA 746 NA 723 NA 37 NA 30 NA 28 NA 627 NA 681 NA  546  NA
27998 x 200000 sparse 1851 NA 1487 NA  1283  NA 1000 x 200000 139 NA 132 NA 127 NA  1097  NA 1112 NA 1306 NA
dense 1457 NA 1569 NA 1512 NA 95 NA 79 NA 58 NA 1749 NA 1505 NA 1747 NA
1. NORMALIZATION
& sel. of  2000  most var. genes
2. ON-DISK REALIZATION
of the normalized dataset
3. PCA
of the normalized dataset
27998 x 12500 sparse 114 NA 98 NA  78  NA 2000 x 12500 12 NA 13 NA 13 NA 129 NA 141 NA  122  NA
dense 88 NA 97 NA 83 NA 5 NA 6 NA 6 NA 128 NA 145 NA 215 NA
27998 x 25000 sparse 246 NA 201 NA  169  NA 2000 x 25000 25 NA 28 NA 26 NA  228  NA 276 NA 290 NA
dense 183 NA 192 NA 184 NA 14 NA 13 NA 11 NA 301 NA 289 NA 448 NA
27998 x 50000 sparse 471 NA 391 NA  326  NA 2000 x 50000 51 NA 48 NA 50 NA  375  NA 424 NA 464 NA
dense 368 NA 394 NA 369 NA 30 NA 24 NA 21 NA 570 NA 645 NA 650 NA
27998 x 100000 sparse 921 NA 781 NA  639  NA 2000 x 100000 92 NA 98 NA 94 NA  777  NA 1112 NA 988 NA
dense 736 NA 764 NA 752 NA 70 NA 57 NA 45 NA 1296 NA 1049 NA 1432 NA
27998 x 200000 sparse 1913 NA 1558 NA  1376  NA 2000 x 200000 193 NA 190 NA 198 NA  1517  NA 1778 NA 1786 NA
dense 1497 NA 1648 NA 1554 NA 156 NA 114 NA 88 NA 2813 NA 2326 NA 3058 NA
Table 2: Timings for Supermicro SuperServer 1029GQ-TRT
For each operation, the best time across the three different block sizes is displayed in bold.
In addition, if it’s also the best time across the sparse and dense formats, then we  box  it (only for Normalization and PCA).
Memory usage > 4Gb is displayed in light red.

5.4.3 Intel Mac Pro (24-Core Intel Xeon W)

  • OS: macOS 12.7.6
  • Bioconductor/R versions: 3.21/4.5
  • RAM: 96GB
  • Disk: 1TB SSD
  • Disk performance: N/A
Test dataset F
o
r
m
a
t
block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
Normalized
Test dataset
block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
nrow x ncol
(# genes x # cells)
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
nrow x ncol
(# sel. genes
x # cells)
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
1. NORMALIZATION
& sel. of  1000  most var. genes
2. ON-DISK REALIZATION
of the normalized dataset
3. PCA
of the normalized dataset
27998 x 12500 sparse 100 NA 76 NA 63 NA 1000 x 12500 8 NA 10 NA 10 NA 56 NA 80 NA 79 NA
dense 68 NA 68 NA  60  NA 3 NA 3 NA 2 NA 68 NA 59 NA 56 NA
27998 x 25000 sparse 206 NA 156 NA  121  NA 1000 x 25000 17 NA 15 NA 17 NA 110 NA 141 NA 202 NA
dense 129 NA 139 NA 122 NA 5 NA 6 NA 5 NA  104  NA 112 NA 180 NA
27998 x 50000 sparse 404 NA 379 NA 294 NA 1000 x 50000 38 NA 38 NA 34 NA 286 NA 418 NA 329 NA
dense 286 NA 290 NA  278  NA 17 NA 12 NA 12 NA 327 NA 274 NA  229  NA
27998 x 100000 sparse 839 NA 727 NA 543 NA 1000 x 100000 68 NA 71 NA 68 NA  495  NA 688 NA 564 NA
dense  515  NA 553 NA 538 NA 27 NA 23 NA 20 NA 565 NA 649 NA 627 NA
27998 x 200000 sparse 1749 NA 1343 NA 1065 NA 1000 x 200000 135 NA 138 NA 135 NA  958  NA 1318 NA 1105 NA
dense  1058  NA 1153 NA 1132 NA 68 NA 56 NA 42 NA 1498 NA 1197 NA 1437 NA
1. NORMALIZATION
& sel. of  2000  most var. genes
2. ON-DISK REALIZATION
of the normalized dataset
3. PCA
of the normalized dataset
27998 x 12500 sparse 106 NA 88 NA 66 NA 2000 x 12500 11 NA 12 NA 15 NA  88  NA 101 NA 111 NA
dense 70 NA 74 NA  62  NA 4 NA 4 NA 4 NA 110 NA 116 NA 184 NA
27998 x 25000 sparse 216 NA 169 NA 142 NA 2000 x 25000 25 NA 26 NA 27 NA 198 NA 283 NA 242 NA
dense  133  NA 148 NA 134 NA 10 NA 10 NA 9 NA 286 NA 251 NA  193  NA
27998 x 50000 sparse 467 NA 360 NA 290 NA 2000 x 50000 49 NA 47 NA 51 NA  373  NA 462 NA 487 NA
dense 293 NA 301 NA  289  NA 23 NA 18 NA 16 NA 505 NA 377 NA 557 NA
27998 x 100000 sparse 847 NA 728 NA 549 NA 2000 x 100000 91 NA 102 NA 88 NA  641  NA 789 NA 715 NA
dense  534  NA 581 NA 560 NA 52 NA 46 NA 31 NA 1141 NA 968 NA 1252 NA
27998 x 200000 sparse 1808 NA 1429 NA 1145 NA 2000 x 200000 192 NA 200 NA 179 NA  1308  NA 2030 NA 1713 NA
dense  1109  NA 1186 NA 1169 NA 119 NA 86 NA 72 NA 2326 NA 1766 NA 3058 NA
Table 3: Timings for Intel Mac Pro
For each operation, the best time across the three different block sizes is displayed in bold.
In addition, if it’s also the best time across the sparse and dense formats, then we  box  it (only for Normalization and PCA).
Memory usage > 4Gb is displayed in light red.

5.4.4 Apple Silicon Mac Pro (Apple M2 Ultra)

  • OS: macOS 13.7.1
  • Bioconductor/R versions: 3.21/4.5
  • RAM: 192GB
  • Disk: 2TB SSD
  • Disk performance: N/A
Test dataset F
o
r
m
a
t
block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
Normalized
Test dataset
block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
block size
= 40 Mb
block size
= 100 Mb
block size
= 250 Mb
nrow x ncol
(# genes x # cells)
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
nrow x ncol
(# sel. genes
x # cells)
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
time
in
sec.
max.
mem.
used
1. NORMALIZATION
& sel. of  1000  most var. genes
2. ON-DISK REALIZATION
of the normalized dataset
3. PCA
of the normalized dataset
27998 x 12500 sparse 58 NA 44 NA 36 NA 1000 x 12500 4 NA 5 NA 5 NA  28  NA 39 NA 39 NA
dense 38 NA 37 NA  34  NA 2 NA 2 NA 2 NA 33 NA 30 NA 29 NA
27998 x 25000 sparse 121 NA 92 NA  70  NA 1000 x 25000 9 NA 9 NA 9 NA 57 NA 72 NA 103 NA
dense 71 NA 77 NA 71 NA 4 NA 4 NA 4 NA  55  NA 57 NA 96 NA
27998 x 50000 sparse 242 NA 189 NA  146  NA 1000 x 50000 19 NA 17 NA 15 NA 115 NA 156 NA 140 NA
dense 149 NA 152 NA 151 NA 10 NA 7 NA 8 NA 149 NA 139 NA  114  NA
27998 x 100000 sparse 460 NA 377 NA 299 NA 1000 x 100000 34 NA 34 NA 30 NA  244  NA 326 NA 276 NA
dense  291  NA 309 NA 311 NA 20 NA 16 NA 14 NA 296 NA 373 NA 334 NA
27998 x 200000 sparse 989 NA 752 NA 621 NA 1000 x 200000 67 NA 66 NA 62 NA  500  NA 686 NA 557 NA
dense  596  NA 644 NA 651 NA 51 NA 40 NA 29 NA 846 NA 669 NA 790 NA
1. NORMALIZATION
& sel. of  2000  most var. genes
2. ON-DISK REALIZATION
of the normalized dataset
3. PCA
of the normalized dataset
27998 x 12500 sparse 62 NA 50 NA 38 NA 2000 x 12500 6 NA 7 NA 7 NA  47  NA 50 NA 51 NA
dense 39 NA 40 NA  35  NA 2 NA 3 NA 3 NA 58 NA 59 NA 98 NA
27998 x 25000 sparse 127 NA 97 NA 81 NA 2000 x 25000 13 NA 12 NA 12 NA  90  NA 138 NA 116 NA
dense  74  NA 80 NA 76 NA 7 NA 6 NA 5 NA 145 NA 124 NA 100 NA
27998 x 50000 sparse 253 NA 199 NA 158 NA 2000 x 50000 25 NA 24 NA 24 NA  165  NA 227 NA 205 NA
dense  154  NA 157 NA 157 NA 14 NA 11 NA 10 NA 251 NA 197 NA 279 NA
27998 x 100000 sparse 493 NA 400 NA 322 NA 2000 x 100000 50 NA 52 NA 46 NA  293  NA 378 NA 358 NA
dense  301  NA 324 NA 321 NA 36 NA 28 NA 20 NA 631 NA 517 NA 646 NA
27998 x 200000 sparse 1048 NA 801 NA 671 NA 2000 x 200000 93 NA 98 NA 91 NA  598  NA 1010 NA 818 NA
dense  619  NA 666 NA 674 NA 82 NA 60 NA 44 NA 1256 NA 1010 NA 1634 NA
Table 4: Timings for Apple Silicon Mac Pro
For each operation, the best time across the three different block sizes is displayed in bold.
In addition, if it’s also the best time across the sparse and dense formats, then we  box  it (only for Normalization and PCA).
Memory usage > 4Gb is displayed in light red.

5.5 Final remarks

The Supermicro SuperServer 1029GQ-TRT machine is significantly slower than the other machines. This is most likely due to the less performant disk.

For PCA, choosing the sparse representation (TENxMatrix) and using small block sizes (40 Mb) is a clear winner.

For normalization, there’s no clear best choice between the parse and dense representations. More precisely, for this operation, the sparse representation tends to give the best times when using bigger blocks (250 Mb), whereas the dense representation tends to give the best times when using smaller blocks (40 Mb). However, based on the above benchmarks, there’s no clear best choice between “sparse with big blocks” and “dense with small blocks” in terms of speed. Maybe extending the benchmarks to include more extreme block sizes (e.g. 20 Mb and 500 Mb) could help break the tie.

Normalization and PCA are roughly linear in time, regardless of representation (sparse or dense) or block size.

For small block sizes (<= 100 Mb), memory usage doesn’t seem to be affected by the number of cells in the dataset, that is, all operations seem to perform at almost constant memory, regardless of representation (sparse or dense). However, for bigger block sizes (e.g. 250 Mb), the memory used to process the sparse datasets tend to increase slightly with the number of cells in the dataset.

[TODO] Add some plots to help vizualize the above assertions.

6 Session information

sessionInfo()
## R Under development (unstable) (2025-01-20 r87609)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] TENxBrainData_1.27.0        SingleCellExperiment_1.29.1
##  [3] SummarizedExperiment_1.37.0 Biobase_2.67.0             
##  [5] GenomicRanges_1.59.1        GenomeInfoDb_1.43.4        
##  [7] RSpectra_0.16-2             DelayedMatrixStats_1.29.1  
##  [9] ExperimentHub_2.15.0        AnnotationHub_3.15.0       
## [11] BiocFileCache_2.15.1        dbplyr_2.5.0               
## [13] HDF5Array_1.35.11           h5mread_0.99.4             
## [15] rhdf5_2.51.2                DelayedArray_0.33.4        
## [17] SparseArray_1.7.4           S4Arrays_1.7.1             
## [19] IRanges_2.41.2              abind_1.4-8                
## [21] S4Vectors_0.45.2            MatrixGenerics_1.19.1      
## [23] matrixStats_1.5.0           BiocGenerics_0.53.6        
## [25] generics_0.1.3              Matrix_1.7-2               
## [27] BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##  [1] KEGGREST_1.47.0          xfun_0.50                bslib_0.8.0             
##  [4] lattice_0.22-6           rhdf5filters_1.19.0      vctrs_0.6.5             
##  [7] tools_4.5.0              curl_6.2.0               tibble_3.2.1            
## [10] AnnotationDbi_1.69.0     RSQLite_2.3.9            blob_1.2.4              
## [13] pkgconfig_2.0.3          sparseMatrixStats_1.19.0 lifecycle_1.0.4         
## [16] GenomeInfoDbData_1.2.13  compiler_4.5.0           Biostrings_2.75.3       
## [19] htmltools_0.5.8.1        sass_0.4.9               yaml_2.3.10             
## [22] pillar_1.10.1            crayon_1.5.3             jquerylib_0.1.4         
## [25] cachem_1.1.0             mime_0.12                tidyselect_1.2.1        
## [28] digest_0.6.37            purrr_1.0.2              dplyr_1.1.4             
## [31] bookdown_0.42            BiocVersion_3.21.1       fastmap_1.2.0           
## [34] grid_4.5.0               cli_3.6.3                magrittr_2.0.3          
## [37] withr_3.0.2              filelock_1.0.3           UCSC.utils_1.3.1        
## [40] rappdirs_0.3.3           bit64_4.6.0-1            rmarkdown_2.29          
## [43] XVector_0.47.2           httr_1.4.7               bit_4.5.0.1             
## [46] png_0.1-8                memoise_2.0.1            evaluate_1.0.3          
## [49] knitr_1.49               rlang_1.1.5              Rcpp_1.0.14             
## [52] glue_1.8.0               DBI_1.2.3                BiocManager_1.30.25     
## [55] jsonlite_1.8.9           R6_2.5.1                 Rhdf5lib_1.29.0