---
title: "_HDF5Array_ performance"
author:
- name: Hervé Pagès
  affiliation: Fred Hutch Cancer Center, Seattle, WA
date: "Compiled `r BiocStyle::doc_date()`; Modified 18 February 2025"
package: HDF5Array
vignette: |
  %\VignetteIndexEntry{HDF5Array performance}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document
---


# Introduction

The aim of this document is to measure the performance of
the `r Biocpkg("HDF5Array")` package for normalization and PCA
(Principal Component Analysis) of _on-disk_ single cell data, two
computationally intensive operations at the core of single cell analysis.

The benchmarks presented in the document were specifically designed
to observe the impact of two critical parameters on performance:

1. data representation (i.e. _sparse_ vs _dense_);
2. size of the blocks used for _block-processed_ operations.

Hopefully these benchmarks will also facilitate comparing performance
of single cell analysis workflows based on `r Biocpkg("HDF5Array")`
with workflows based on other tools like Seurat or Scanpy.


# Install and load the required packages

Let's install and load `r Biocpkg("HDF5Array")` as well as the other
packages used in this vignette:
```{r install, eval=FALSE}
if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

pkgs <- c("HDF5Array", "ExperimentHub", "DelayedMatrixStats", "RSpectra")
BiocManager::install(pkgs)
```

Load the packages:
```{r load, message=FALSE}
library(HDF5Array)
library(ExperimentHub)
library(DelayedMatrixStats)
library(RSpectra)
```

```{r source_make_timings_table_R, echo=FALSE, results='hide'}
## Needed for the make_timings_table() function.
path <- system.file(package="HDF5Array",
                    "scripts", "make_timings_table.R", mustWork=TRUE)
source(path, verbose=FALSE)
```


# The test datasets


## _Sparse_ vs _dense_ representation

The datasets that we will use for our benchmarks are subsets of the
_1.3 Million Brain Cell Dataset_ from 10x Genomics. This is a sparse
27,998 x 1,306,127 matrix of counts, with one gene per row and one cell
per column. Around 7% of the matrix values are nonzero counts.
The dataset is available via the `r Biocpkg("ExperimentHub")` package
in two forms:

1. As a _sparse_ HDF5 file: This is the original HDF5 file provided by
   10x Genomics. It uses the CSR/CSC/Yale representation to store the
   sparse data.

2. As a _dense_ HDF5 file: The same data as the above but stored as a
   regular HDF5 dataset with (compressed) chunks of dimensions 100 x 100.

The two files are hosted on `r Biocpkg("ExperimentHub")` under resource
ids `EH1039` and `EH1040`:
```{r ExperimentHub}
hub <- ExperimentHub()
hub["EH1039"]$description  # sparse representation
hub["EH1040"]$description  # dense representation
```

Let's download them to the local `r Biocpkg("ExperimentHub")` cache
if they are not there yet:
```{r get_EH1039_and_EH1040, message=FALSE}
## Note that this will be quick if the HDF5 files are already in the
## local ExperimentHub cache. Otherwise, it will take a while!
brain_s_path <- hub[["EH1039"]]
brain_D_path <- hub[["EH1040"]]
```
`brain_s_path` and `brain_D_path` are the paths to the downloaded files.


## TENxMatrix vs HDF5Matrix objects

We use the `TENxMatrix()` and `HDF5Array()` constructors to bring the
sparse and dense datasets in R, as DelayedArray derivatives. Note that
this does _not_ load the matrix data in memory.

### Bring the sparse dataset in R

```{r brain_s}
## Use 'h5ls(brain_s_path)' to find out the group.
brain_s <- TENxMatrix(brain_s_path, group="mm10")
```

`brain_s` is a 27,998 x 1,306,127 TENxMatrix object:
```{r brain_s_class_and_dim}
class(brain_s)
dim(brain_s)
is_sparse(brain_s)
```
See `?TENxMatrix` in the `r Biocpkg("HDF5Array")` package for more
information about TENxMatrix objects.

### Bring the dense dataset in R

```{r brain_D}
## Use 'h5ls(brain_D_path)' to find out the name of the dataset.
brain_D <- HDF5Array(brain_D_path, name="counts")
```

`brain_D` is a 27,998 x 1,306,127 HDF5Matrix object
that contains the same data as `brain_s`:
```{r brain_D_class_and_dim}
class(brain_D)
dim(brain_D)
chunkdim(brain_D)
is_sparse(brain_D)
```
See `?HDF5Matrix` in the `r Biocpkg("HDF5Array")` package for more
information about HDF5Matrix objects.

Even though the data in `brain_D_path` is stored in a dense format,
we can flag it as _quantitatively_ sparse. This is done by calling
the `HDF5Array()` constructor function with `as.sparse=TRUE`:
```{r brain_Ds}
brain_Ds <- HDF5Array(brain_D_path, name="counts", as.sparse=TRUE)
```

The only difference between `brain_D` and `brain_Ds` is that
the latter is now seen as a sparse object, and will be treated as such:
```{r brain_Ds_class_and_dim}
class(brain_Ds)
dim(brain_Ds)
chunkdim(brain_Ds)
is_sparse(brain_Ds)
```
Concretely this means that, when blocks of data are loaded from
the _dense_ HDF5 file to memory during _block-processed_ operations,
they end up directly in an in-memory _sparse_ representation without going
thru an in-memory _dense_ representation first. This is expected to reduce
memory footprint and (hopefully) will help with overall performance.

Finally note that the dense HDF5 file does not contain the dimnames of the
matrix, so we manually add them to `brain_s` and `brain_Ds`:
```{r set_brain_D_and_brain_Ds_dimnames}
dimnames(brain_Ds) <- dimnames(brain_D) <- dimnames(brain_s)
```


## Create the test datasets

For our benchmarks below, we create subsets of the _1.3 Million Brain
Cell Dataset_ of increasing sizes: subsets with 12,500 cells, 25,000 cells,
50,000 cells, 100,000 cells, and 200,000 cells. Note that subsetting a
TENxMatrix or HDF5Matrix object with `[` is a delayed operation so has
virtually no cost:
```{r create_test_datasets}
brain1_s  <- brain_s[  , 1:12500]
brain1_D  <- brain_D[  , 1:12500]
brain1_Ds <- brain_Ds[ , 1:12500]

brain2_s  <- brain_s[  , 1:25000]
brain2_D  <- brain_D[  , 1:25000]
brain2_Ds <- brain_Ds[ , 1:25000]

brain3_s  <- brain_s[  , 1:50000]
brain3_D  <- brain_D[  , 1:50000]
brain3_Ds <- brain_Ds[ , 1:50000]

brain4_s  <- brain_s[  , 1:100000]
brain4_D  <- brain_D[  , 1:100000]
brain4_Ds <- brain_Ds[ , 1:100000]

brain5_s  <- brain_s[  , 1:200000]
brain5_D  <- brain_D[  , 1:200000]
brain5_Ds <- brain_Ds[ , 1:200000]
```


# Block-processed normalization and PCA


## Code used for normalization and PCA

We'll use the following code for normalization:
```{r simple_normalize_function}
## Also selects the most variable genes (1000 by default).
simple_normalize <- function(mat, num_var_genes=1000)
{
    stopifnot(length(dim(mat)) == 2, !is.null(rownames(mat)))
    mat <- mat[rowSums(mat) > 0, ]
    col_sums <- colSums(mat) / 10000
    mat <- t(t(mat) / col_sums)
    row_vars <- rowVars(mat)
    row_vars_order <- order(row_vars, decreasing=TRUE)
    variable_idx <- head(row_vars_order, n=num_var_genes)
    mat <- log1p(mat[variable_idx, ])
    mat / rowSds(mat)
}
```

and the following code for PCA:
```{r simple_PCA_function}
simple_PCA <- function(mat, k=25)
{
    stopifnot(length(dim(mat)) == 2)
    row_means <- rowMeans(mat)
    Ax <- function(x, args)
        (as.numeric(mat %*% x) - row_means * sum(x))
    Atx <- function(x, args)
        (as.numeric(x %*% mat) - as.vector(row_means %*% x))
    RSpectra::svds(Ax, Atrans=Atx, k=k, dim=dim(mat))
}
```


## Block processing and block size

Note that the implementations of `simple_normalize()` and `simple_PCA()`
are expected to work on any matrix-like object regardless of its exact
type/representation e.g. it can be an ordinary matrix, a SparseMatrix
object from the `r Biocpkg("SparseArray")` package, a dgCMatrix object
from the `r CRANpkg("Matrix")` package, a DelayedMatrix derivative
(TENxMatrix, HDF5Matrix, TileDBMatrix), etc...

However, when the input is a DelayedMatrix object or derivative,
it's important to be aware that:

- Summarization methods like `sum()`, `colSums()`, `rowVars()`, or `rowSds()`,
  and matrix multiplication (`%*%`), are _block-processed_ operations.

- The block size is 100 Mb by default. Increasing or decreasing the block
  size will typically increase or decrease the memory usage
  of _block-processed_ operations. It will also impact performance,
  but sometimes in unexpected or counter-intuitive ways.

- The block size can be controlled with `DelayedArray::getAutoBlockSize()`
  and `DelayedArray::setAutoBlockSize()`.

For our benchmarks below, we'll use the following block sizes:

  |                      | NORMALIZATION |    PCA |
  | -------------------- | ------------: | -----: |
  | TENxMatrix (sparse)  |        250 Mb |  40 Mb |
  | HDF5Matrix (dense)   |         16 Mb | 100 Mb |
  | HDF5Matrix as sparse |        250 Mb |  40 Mb |


## Monitoring memory usage

While manually running our benchmarks below on a Linux or macOS system,
we will also monitor memory usage at the command line in a terminal with:

    (while true; do ps u -p <PID>; sleep 1; done) >ps.log 2>&1 &

where `<PID>` is the process id of our R session. This will allow us
to measure the maximum amount of memory used by the calls
to `simple_normalize()` or `simple_PCA()`.


# Normalization and PCA of the 27,998 x 12,500 test dataset


## Normalization

In this section we run `simple_normalize()` on the three different
representations (TENxMatrix, HDF5Matrix, and "HDF5Matrix as sparse")
of the smaller test dataset only (27,998 x 12,500), and we report the
time of each run.

See the **Comprehensive timings obtained on various machines** section
below in this document for `simple_normalize()` and `simple_pca()` timings
obtained on various machines on all our test datasets and using four different
block sizes: 16 Mb, 40 Mb, 100 Mb, and 250 Mb.

### TENxMatrix (sparse)

```{r norm_brain1_s}
dim(brain1_s)
DelayedArray::setAutoBlockSize(250e6)  # set block size to 250 Mb
system.time(norm_brain1_s <- simple_normalize(brain1_s))
dim(norm_brain1_s)
```

### HDF5Matrix (dense)

```{r norm_brain1_D}
dim(brain1_D)
DelayedArray::setAutoBlockSize(16e6)   # set block size to 16 Mb
system.time(norm_brain1_D <- simple_normalize(brain1_D))
dim(norm_brain1_D)
```

### HDF5Matrix as sparse

```{r norm_brain1_Ds}
dim(brain1_Ds)
DelayedArray::setAutoBlockSize(250e6)  # set block size to 250 Mb
system.time(norm_brain1_Ds <- simple_normalize(brain1_Ds))
dim(norm_brain1_Ds)
```


## On-disk realization of the normalized datasets

Note that the normalized datasets obtained in the previous section
are DelayedMatrix objects that carry delayed operations. These operations
can be displayed with `showtree()` e.g. for `norm_brain1_s`:
```{r show_norm_brain1_s_delayed_ops}
class(norm_brain1_s)
showtree(norm_brain1_s)
```

The other `norm_brain1_*` objects carry similar operations.

Before we proceed with PCA, we're going to write the normalized datasets to
new HDF5 files. This introduces an additional cost, but, on the other hand,
it has the benefit of triggering _on-disk realization_ of the object. This
means that all the delayed operations carried by the object will get realized
on-the-fly before the matrix data actually lands on the disk, making the new
object (TENxMatrix or HDF5Matrix) more efficient for PCA or whatever
block-processed operations will come next.

We will use blocks of 100 Mb for all the writing operations.
```{r set_realization_block_size}
DelayedArray::setAutoBlockSize(1e8)
```

### TENxMatrix (sparse)

```{r writeTENxMatrix_norm_brain1_s}
dim(norm_brain1_s)
system.time(norm_brain1_s <- writeTENxMatrix(norm_brain1_s, level=0))
```
The new `norm_brain1_s` object is a _pristine_ TENxMatrix object:
```{r show_pristine_norm_brain1_s}
class(norm_brain1_s)
showtree(norm_brain1_s)  # "pristine" object (i.e. no more delayed operations)
```

### HDF5Matrix (dense)

```{r writeHDF5Array_norm_brain1_D}
dim(norm_brain1_D)
system.time(norm_brain1_D <- writeHDF5Array(norm_brain1_D, level=0))
```
The new `norm_brain1_D` object is a _pristine_ HDF5Matrix object:
```{r show_pristine_norm_brain1_D}
class(norm_brain1_D)
showtree(norm_brain1_D)  # "pristine" object (i.e. no more delayed operations)
```

### HDF5Matrix as sparse

```{r writeHDF5Array_norm_brain1_Ds}
dim(norm_brain1_Ds)
system.time(norm_brain1_Ds <- writeHDF5Array(norm_brain1_Ds, level=0))
```
The new `norm_brain1_Ds` object is a _pristine_ sparse HDF5Matrix object:
```{r show_pristine_norm_brain1_Ds}
class(norm_brain1_Ds)
showtree(norm_brain1_Ds)  # "pristine" object (i.e. no more delayed operations)
```


## PCA

In this section we run `simple_pca()` on the normalized datasets obtained
in the previous section and report the time of each run.

See the **Comprehensive timings obtained on various machines** section
below in this document for `simple_normalize()` and `simple_pca()` timings
obtained on various machines on all our test datasets and using four different
block sizes: 16 Mb, 40 Mb, 100 Mb, and 250 Mb.

### TENxMatrix (sparse)

```{r PCA_norm_brain1_s}
DelayedArray::setAutoBlockSize(40e6)   # set block size to 40 Mb
dim(norm_brain1_s)
system.time(pca1s <- simple_PCA(norm_brain1_s))
```

### HDF5Matrix (dense)

```{r PCA_norm_brain1_D}
DelayedArray::setAutoBlockSize(1e8)    # set block size to 100 Mb
dim(norm_brain1_D)
system.time(pca1D <- simple_PCA(norm_brain1_D))
```

Sanity check:
```{r sanity_check1d}
stopifnot(all.equal(pca1D, pca1s))
```

### HDF5Matrix as sparse

```{r PCA_norm_brain1_Ds}
DelayedArray::setAutoBlockSize(40e6)   # set block size to 40 Mb
dim(norm_brain1_Ds)
system.time(pca1Ds <- simple_PCA(norm_brain1_Ds))
```

Sanity check:
```{r sanity_check1ds}
stopifnot(all.equal(pca1Ds, pca1s))
```


# Comprehensive timings obtained on various machines

Here we report timings (and memory usage) observed on various machines.
For each machine, the results are presented in a table that shows the
normalization & realization & PCA timings obtained for all our test datasets
and using four different block sizes: 16 Mb, 40 Mb, 100 Mb, and 250 Mb.
For each operation, the best time across the four different block
sizes is displayed in **bold**.

All the timings (and memory usage) were produced by running
the `run_benchmarks.sh` script located in the `HDF5Array/inst/scripts/`
folder of the package, using R 4.5 and `r Biocpkg("HDF5Array")` 1.35.12
(Bioconductor 3.21).


## Timings for DELL XPS 15 laptop

```{r xps15_specs, echo=FALSE, results='asis'}
hdparm1 <- "Output of <code>sudo hdparm -Tt &lt;device&gt;</code>:"
hdparm1 <- sprintf("<span style=\"font-style: italic\">%s</span>", hdparm1)
hdparm2 <- c(
    "Timing cached reads:   35188 MB in  2.00 seconds = 17620.75 MB/sec",
    "Timing buffered disk reads: 7842 MB in  3.00 seconds = 2613.57 MB/sec"
)
hdparm2 <- sprintf("<code>%s</code>", paste(hdparm2, collapse="<br />"))
disk_perf <- paste0(hdparm1, "<br />", hdparm2)
make_machine_specs_table("Specs for DELL XPS 15 laptop (model 9520)",
    specs=c(`OS`="Linux Ubuntu 24.04",
            `RAM`="32GB",
            `Disk`="1TB SSD"),
    disk_perf=disk_perf)
```

```{r xps15_timings, echo=FALSE, results='asis'}
caption <- "Table 1: Timings for DELL XPS 15 laptop"
make_timings_table("xps15", caption=caption)
```


## Timings for Supermicro SuperServer 1029GQ-TRT

```{r rex3_specs, echo=FALSE, results='asis'}
hdparm1 <- "Output of <code>sudo hdparm -Tt &lt;device&gt;</code>:"
hdparm1 <- sprintf("<span style=\"font-style: italic\">%s</span>", hdparm1)
hdparm2 <- c(
    "Timing cached reads:   20592 MB in  1.99 seconds = 10361.41 MB/sec",
    "Timing buffered disk reads: 1440 MB in  3.00 seconds = 479.66 MB/sec"
)
hdparm2 <- sprintf("<code>%s</code>", paste(hdparm2, collapse="<br />"))
disk_perf <- paste0(hdparm1, "<br />", hdparm2)
make_machine_specs_table("Specs for Supermicro SuperServer 1029GQ-TRT",
    specs=c(`OS`="Linux Ubuntu 22.04",
            `RAM`="384GB",
            `Disk`="1.3TB ATA Disk"),
    disk_perf=disk_perf)
```

```{r rex3_timings, echo=FALSE, results='asis'}
caption <- "Table 2: Timings for Supermicro SuperServer 1029GQ-TRT"
make_timings_table("rex3", caption=caption)
```


## Timings for Apple Silicon Mac Pro

```{r kjohnson3_specs, echo=FALSE, results='asis'}
make_machine_specs_table("Specs for Apple Silicon Mac Pro (Apple M2 Ultra)",
    specs=c(`OS`="macOS 13.7.1",
            `RAM`="192GB",
            `Disk`="2TB SSD"),
    disk_perf="N/A")
```

```{r kjohnson3_timings, echo=FALSE, results='asis'}
caption <- "Table 3: Timings for Apple Silicon Mac Pro"
make_timings_table("kjohnson3", caption=caption)
```


## Timings for Intel Mac Pro

```{r lconway_specs, echo=FALSE, results='asis'}
make_machine_specs_table("Specs for Intel Mac Pro (24-Core Intel Xeon W)",
    specs=c(`OS`="macOS 12.7.6",
            `RAM`="96GB",
            `Disk`="1TB SSD"),
    disk_perf="N/A")
```

```{r lconway_timings, echo=FALSE, results='asis'}
caption <- "Table 4: Timings for Intel Mac Pro"
make_timings_table("lconway", caption=caption)
```


# Discussion


The "**[Ds]** HDF5Matrix as sparse" representation didn't live up to
its promise so we leave it alone for now. Note that the code for loading
blocks of data from the _dense_ HDF5 file to memory does not currently
take full advantage of the SVT\_SparseArray representation, a new
efficient data structure for multidimensional _sparse_ data implemented
in the `r Biocpkg("SparseArray")` package that overcomes some of the
limitations of the dgCMatrix representation from the `r CRANpkg("Matrix")`
package. This will need to be addressed.


## TENxMatrix (sparse) vs HDF5Matrix (dense)

### Normalization

There's no obvious best choice between the "**[s]** TENxMatrix (sparse)"
and "**[D]** HDF5Matrix (dense)" representations. More precisely, for
normalization, the former tends to give the best times when using bigger
blocks (e.g. 250 Mb), whereas the latter tends to give the best times when
using smaller blocks (e.g. 16 Mb).

Therefore, on a machine with enough memory to support big block sizes,
one will get the best results with the **[s]** representation and
blocks of 250 Mb or more. However, on a machine with not enough memory
to support such big blocks, one should instead use the **[D]** representation
with blocks of 16 Mb.

_[TODO: Add some plots to help vizualize the above observations.]_

### PCA

For PCA, choosing the "**[s]** TENxMatrix (sparse)" representation
and using small block sizes (40 Mb) tends to give the best performance.

_[TODO: Add some plots to help vizualize this observation.]_

### Hybrid approach

Note that, on a machine where using blocks of 250 Mb or more for
normalization is not an option, one should use the following hybrid
approach:

- Start with the "**[D]** HDF5Matrix (dense)" representation.

- Perform normalization, using very small blocks (16 Mb).

- Switch to the "**[s]** TENxMatrix (sparse)" format when writing
  the normalized dataset to disk, that is, use `writeTENxMatrix()`
  instead of `writeHDF5Array()` for on-disk realization of the
  intermediate dataset.

- Perform PCA on the object returned by the on-disk realization step
  (`writeTENxMatrix()`), using small blocks (40 Mb).

### A note about memory usage

The machines running macOS use between 2x and 3x more memory than the
machines running Linux for the same task using the same block size.

Overall, on Linux, and for a given choice of block size, memory usage
doesn't seem to be affected too much by the number of cells in the
dataset, that is, all operations seem to perform at _almost_ constant
memory.

However, the "**[D]** HDF5Matrix (dense)" representation appears to
be better than the "**[s]** TENxMatrix (sparse)" representation at
keeping memory usage (mostly) flat as the number of cells in the
dataset increases. This is even more accentuated on macOS where, somehow
counter intuitively, using the dense representation manages to keep memory
usage at a reasonable level (and more or less capped with respect to the
number of cells), while using the sparse representation fails to do that.


## Main takeaways

1. Using the TENxMatrix representation (sparse format), one can perform
   **normalization and PCA** of the bigger dataset (200,000 cells) on an
   average consumer-grade laptop like the DELL XPS 15 laptop (model 9520)
   **in less than 25 minutes and using less than 4Gb of memory**, as shown
   in the table below.
   For comparison, normalization and PCA on an _in-memory_ representation
   of the same dataset (e.g. on a SparseMatrix object) takes less then
   4 minutes. However, that consumes about 18.5Gb of memory! This will be
   the subject of an upcoming vignette in the `r Biocpkg("SparseArray")`
   package.

```{r summarize_machine_times, echo=FALSE, results='asis'}
machine_names <- c(
    `DELL XPS 15 laptop`="xps15",
    `Supermicro SuperServer 1029GQ-TRT`="rex3",
    `Apple Silicon Mac Pro`="kjohnson3",
    `Intel Mac Pro`="lconway"
)
summarize_machine_times(machine_names)
```

2. Normalization and PCA are roughly **linear in time** with respect to the
   number of cells in the dataset, regardless of representation (sparse or
   dense) or block size.

3. Block size matters. When using the TENxMatrix representation (sparse format),
   the bigger the blocks the faster normalization will be (at the cost of
   increased memory usage). On the other hand, it seems that PCA prefers
   small blocks, at least with our naive `simple_PCA()` implementation.

4. Disk performance is important (not surprisingly) as attested by the lower
   performance of the Supermicro SuperServer 1029GQ-TRT machine, likely due
   to its slower disk.


# Session information

```{r sessionInfo}
sessionInfo()
```