--- title: "_HDF5Array_ performance" author: - name: Hervé Pagès affiliation: Fred Hutch Cancer Center, Seattle, WA date: "Compiled `r doc_date()`; Modified 28 January 2025" package: HDF5Array vignette: | %\VignetteIndexEntry{HDF5Array performance} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document --- # Introduction The aim of this document is to measure the performance of the `r Biocpkg("HDF5Array")` package for normalization and PCA (Principal Component Analysis) of _on-disk_ single cell data, two computationally intensive operations at the core of single cell analysis. The benchmarks presented in the document were specifically designed to observe the impact of two critical parameters on performance: 1. data representation (i.e. _sparse_ vs _dense_); 2. size of the blocks used for block-processed operations. Hopefully these benchmarks will also facilitate comparing performance of single cell analysis workflows based on `r Biocpkg("HDF5Array")` with workflows based on other tools like Seurat or Scanpy. # Install and load the required packages Let's install and load `r Biocpkg("HDF5Array")` as well as the other packages used in this vignette: ```{r install, eval=FALSE} if (!require("BiocManager", quietly=TRUE)) install.packages("BiocManager") pkgs <- c("HDF5Array", "ExperimentHub", "DelayedMatrixStats", "RSpectra") BiocManager::install(pkgs) ``` Load the packages: ```{r load, message=FALSE} library(HDF5Array) library(ExperimentHub) library(DelayedMatrixStats) library(RSpectra) ``` ```{r source_make_timings_table_R, echo=FALSE, results='hide'} ## Needed for the make_timings_table() function. path <- system.file(package="HDF5Array", "scripts", "make_timings_table.R", mustWork=TRUE) source(path, verbose=FALSE) ``` # The test datasets ## _Sparse_ vs _dense_ representation The datasets that we will use for our benchmarks are subsets of the _1.3 Million Brain Cell Dataset_ from 10x Genomics. The _1.3 Million Brain Cell Dataset_ is a 27,998 x 1,306,127 matrix of counts with one gene per row and one cell per column. It's available via the `r Biocpkg("ExperimentHub")` package in two forms, one that uses a sparse representation and one that uses a dense representation: ```{r ExperimentHub} hub <- ExperimentHub() hub["EH1039"]$description # sparse representation hub["EH1040"]$description # dense representation ``` The two datasets are big HDF5 files stored on a remote location. Let's download them to the local `r Biocpkg("ExperimentHub")` cache if they are not there yet: ```{r get_EH1039_and_EH1040, message=FALSE} ## Note that this will be quick if the HDF5 files are already in the ## local ExperimentHub cache. Otherwise, it will take a while! full_sparse_h5 <- hub[["EH1039"]] full_dense_h5 <- hub[["EH1040"]] ``` ## TENxMatrix vs HDF5Matrix objects We use the `TENxMatrix()` and `HDF5Array()` constructors to bring the sparse and dense datasets in R as DelayedArray derivatives. Note that this does _not_ load the matrix data in memory. Bring the sparse dataset in R: ```{r TENxMatrix_constructor} ## Use 'h5ls(full_sparse_h5)' to find out the group. full_sparse <- TENxMatrix(full_sparse_h5, group="mm10") ``` Note that `full_sparse` is a 27,998 x 1,306,127 TENxMatrix object: ```{r full_sparse_class_and_dim} class(full_sparse) dim(full_sparse) ``` See `?TENxMatrix` in the `r Biocpkg("HDF5Array")` package for more information about TENxMatrix objects. Bring the dense dataset in R: ```{r HDF5Array_constructor} ## Use 'h5ls(full_dense_h5)' to find out the name of the dataset. full_dense <- HDF5Array(full_dense_h5, name="counts") ``` Note that `full_dense` is a 27,998 x 1,306,127 HDF5Matrix object that contains the same data as `full_sparse`: ```{r full_dense_class_and_dim} class(full_dense) dim(full_dense) ``` See `?HDF5Matrix` in the `r Biocpkg("HDF5Array")` package for more information about HDF5Matrix objects. Finally note that the dense HDF5 file does not contain the dimnames of the matrix so we manually add them to `full_sparse`: ```{r set_full_dense_dimnames} dimnames(full_dense) <- dimnames(full_sparse) ``` ## Create the test datasets For our benchmarks below, we create subsets of the _1.3 Million Brain Cell Dataset_ of increasing sizes: subsets with 12,500 cells, 25,000 cells, 50,000 cells, 100,000 cells, and 200,000 cells. Note that subsetting a TENxMatrix or HDF5Matrix object with `[` is a delayed operation so has virtually no cost: ```{r create_test_datasets} sparse1 <- full_sparse[ , 1:12500] dense1 <- full_dense[ , 1:12500] sparse2 <- full_sparse[ , 1:25000] dense2 <- full_dense[ , 1:25000] sparse3 <- full_sparse[ , 1:50000] dense3 <- full_dense[ , 1:50000] sparse4 <- full_sparse[ , 1:100000] dense4 <- full_dense[ , 1:100000] sparse5 <- full_sparse[ , 1:200000] dense5 <- full_dense[ , 1:200000] ``` # Block-processed normalization and PCA ## Code used for normalization and PCA We'll use the following code for normalization: ```{r simple_normalize_function} ## Also selects the most variable genes (1000 by default). simple_normalize <- function(mat, num_var_genes=1000) { stopifnot(length(dim(mat)) == 2, !is.null(rownames(mat))) mat <- mat[rowSums(mat) > 0, ] mat <- t(t(mat) * 10000 / colSums(mat)) row_vars <- rowVars(mat) rv_order <- order(row_vars, decreasing=TRUE) variable_idx <- head(rv_order, n=num_var_genes) mat <- log1p(mat[variable_idx, ]) mat / rowSds(mat) } ``` and the following code for PCA: ```{r simple_PCA_function} simple_PCA <- function(mat, k=25) { stopifnot(length(dim(mat)) == 2) row_means <- rowMeans(mat) Ax <- function(x, args) (as.numeric(mat %*% x) - row_means * sum(x)) Atx <- function(x, args) (as.numeric(x %*% mat) - as.vector(row_means %*% x)) RSpectra::svds(Ax, Atrans=Atx, k=k, dim=dim(mat)) } ``` ## Block processing and block size Note that the implementations of `simple_normalize()` and `simple_PCA()` are expected to work on any matrix-like object regardless of its exact type/representation e.g. it can be an ordinary matrix, a SparseMatrix object from the `r Biocpkg("SparseArray")` package, a dgCMatrix object from the `r CRANpkg("Matrix")` package, a DelayedMatrix derivative (TENxMatrix, HDF5Matrix, TileDBMatrix), etc... However, when the input is a DelayedMatrix object or derivative, it's important to be aware that: - Summarization methods like `sum()`, `colSums()`, `rowVars()`, or `rowSds()`, and matrix multiplication (`%*%`), are _block-processed_ operations. - The block size is 100 Mb by default. Increasing or decreasing the block size will typically increase or decrease the memory usage of _block-processed_ operations. It will also impact performance, but sometimes in unexpected or counter-intuitive ways. - The block size can be controlled with `DelayedArray::getAutoBlockSize()` and `DelayedArray::setAutoBlockSize()`. For our benchmarks below, we'll use the following block sizes: - normalization of the sparse datasets: 250 Mb - normalization of the dense datasets: 40 Mb - on-disk realization of the normalized sparse datasets: 100 Mb - on-disk realization of the normalized dense datasets: 250 Mb - PCA on the normalized sparse datasets: 40 Mb - PCA on the normalized dense datasets: 100 Mb (the default) ## Monitoring memory usage While manually running our benchmarks below on a Linux or macOS system, we will also monitor memory usage at the command line in a terminal with: (while true; do ps u -p ; sleep 1; done) >ps.log 2>&1 & where `` is the process id of our R session. This will allow us to measure the maximum amount of memory used by the calls to `simple_normalize()` or `simple_PCA()`. # Benchmarks ## Normalization In this section we run `simple_normalize()` on the two smaller test datasets only (27,998 x 12,500 and 27,998 x 25,000, sparse and dense), and we report timings and memory usage. See the **Comprehensive timings obtained on various systems** section at the end of this document for `simple_normalize()` and `simple_pca()` timings obtained on various systems on all our test datasets and using three different block sizes: 40 Mb, 100 Mb, and 250 Mb. ### Normalizing the sparse datasets Set block size to 250 Mb: ```{r set_block_size_for_normalize_sparse} DelayedArray::setAutoBlockSize(2.5e8) ``` #### 27,998 x 12,500 sparse dataset ```{r normalize_sparse1} dim(sparse1) system.time(sparse1n <- simple_normalize(sparse1)) gc() dim(sparse1n) ``` #### 27,998 x 25,000 sparse dataset ```{r normalize_sparse2} dim(sparse2) system.time(sparse2n <- simple_normalize(sparse2)) gc() dim(sparse2n) ``` #### About memory usage With this block size (250 Mb), memory usage (as reported by Unix command `ps u -p `, see **Monitoring memory usage** above in this document) remained <= 2.6 Gb at all time. ### Normalizing the dense datasets Set block size to 40 Mb: ```{r set_block_size_for_normalize_dense} DelayedArray::setAutoBlockSize(4e7) ``` #### 27,998 x 12,500 dense dataset ```{r normalize_dense1} dim(dense1) system.time(dense1n <- simple_normalize(dense1)) gc() dim(dense1n) ``` #### 27,998 x 25,000 dense dataset ```{r normalize_dense2} dim(dense2) system.time(dense2n <- simple_normalize(dense2)) gc() dim(dense2n) ``` #### About memory usage With this block size (40 Mb), memory usage (as reported by Unix command `ps u -p `, see **Monitoring memory usage** above in this document) remained <= 1.8 Gb at all time. ## On-disk realization of the normalized datasets Note that the normalized datasets obtained in the previous section are DelayedMatrix objects that carrry delayed operations. These can be displayed with `showtree()`: ```{r show_sparse1n_delayed_ops} class(sparse1n) showtree(sparse1n) ``` ```{r show_dense1n_delayed_ops} class(dense1n) showtree(dense1n) ``` Before we proceed with PCA, we're going to write the normalized datasets to new HDF5 files. This has a significant cost, but, on the other hand, it has the benefit of triggering _on-disk realization_ of the object. This means that all the delayed operations carried by the object will get realized on-the-fly before the matrix data actually lands on the disk, making the new object (TENxMatrix or HDF5Matrix) more efficient for PCA or whatever block-processed operations will come next. ### On-disk realization of the normalized sparse datasets Set block size to 100 Mb: ```{r set_block_size_for_realize_sparse} DelayedArray::setAutoBlockSize(1e8) ``` #### 1000 x 12,500 normalized sparse dataset ```{r writeTENxMatrix_sparse1n} dim(sparse1n) sparse1n_path <- tempfile() system.time( sparse1n <- writeTENxMatrix(sparse1n, sparse1n_path, group="matrix", level=0) ) gc() ``` The new `sparse1n` object is a _pristine_ TENxMatrix object: ```{r show_pristine_sparse1n} class(sparse1n) showtree(sparse1n) # "pristine" object (i.e. no more delayed operations) ``` #### 1000 x 25,000 normalized sparse dataset ```{r writeTENxMatrix_sparse2n} dim(sparse2n) sparse2n_path <- tempfile() system.time( sparse2n <- writeTENxMatrix(sparse2n, sparse2n_path, group="matrix", level=0) ) gc() ``` The new `sparse2n` object is a _pristine_ TENxMatrix object: ```{r show_pristine_sparse2n} class(sparse2n) showtree(sparse2n) # "pristine" object (i.e. no more delayed operations) ``` #### About memory usage With this block size (100 Mb), memory usage (as reported by Unix command `ps u -p `, see **Monitoring memory usage** above in this document) remained <= 2.5 Gb at all time. ### On-disk realization of the normalized dense datasets Set block size to 250 Mb: ```{r set_block_size_for_realize_dense} DelayedArray::setAutoBlockSize(2.5e8) ``` #### 1000 x 12,500 normalized dense dataset ```{r writeHDF5Array_dense1n} dim(dense1n) dense1n_path <- tempfile() system.time( dense1n <- writeHDF5Array(dense1n, dense1n_path, name="dense1n", level=0) ) gc() ``` The new `dense1n` object is a _pristine_ HDF5Matrix object: ```{r show_pristine_dense1n} class(dense1n) showtree(dense1n) # "pristine" object (i.e. no more delayed operations) ``` #### 1000 x 25,000 normalized dense dataset ```{r writeHDF5Array_dense2n} dim(dense2n) dense2n_path <- tempfile() system.time( dense2n <- writeHDF5Array(dense2n, dense2n_path, name="dense2n", level=0) ) gc() ``` The new `dense2n` object is a _pristine_ HDF5Matrix object: ```{r show_pristine_dense2n} class(dense2n) showtree(dense2n) # "pristine" object (i.e. no more delayed operations) ``` #### About memory usage With this block size (250 Mb), memory usage (as reported by Unix command `ps u -p `, see **Monitoring memory usage** above in this document) remained <= 1.6 Gb at all time. ## PCA In this section we run `simple_pca()` on the two normalized datasets obtained in the previous section (1000 x 12,500 and 1000 x 25,000, sparse and dense), and we report timings and memory usage. See the **Comprehensive timings obtained on various systems** section at the end of this document for `simple_normalize()` and `simple_pca()` timings obtained on various systems on all our test datasets and using three different block sizes: 40 Mb, 100 Mb, and 250 Mb. ### PCA on the normalized sparse datasets Set block size to 40 Mb: ```{r set_block_size_for_PCA_sparse} DelayedArray::setAutoBlockSize(4e7) ``` #### 1000 x 12,500 normalized sparse dataset ```{r PCA_sparse1n} sparse1n <- TENxMatrix(sparse1n_path) dim(sparse1n) system.time(pca1s <- simple_PCA(sparse1n)) gc() ``` #### 1000 x 25,000 normalized sparse dataset ```{r PCA_sparse2n} sparse2n <- TENxMatrix(sparse2n_path) dim(sparse2n) system.time(pca2s <- simple_PCA(sparse2n)) gc() ``` #### About memory usage With this block size (40 Mb), memory usage (as reported by Unix command `ps u -p `, see **Monitoring memory usage** above in this document) remained <= 1.8 Gb at all time. ### PCA on the normalized dense datasets Set block size to 100 Mb (the default): ```{r set_block_size_for_PCA_dense} DelayedArray::setAutoBlockSize() ``` #### 1000 x 12,500 normalized dense dataset ```{r PCA_dense1n} dense1n <- HDF5Array(dense1n_path, name="dense1n") dim(dense1n) system.time(pca1d <- simple_PCA(dense1n)) gc() ``` #### 1000 x 25,000 normalized dense dataset ```{r PCA_dense2n} dense2n <- HDF5Array(dense2n_path, name="dense2n") dim(dense2n) system.time(pca2d <- simple_PCA(dense2n)) gc() ``` #### About memory usage With this block size (100 Mb), memory usage (as reported by Unix command `ps u -p `, see **Monitoring memory usage** above in this document) remained <= 2.0 Gb at all time. ### Sanity checks ```{r sanity_checks} stopifnot(all.equal(pca1s, pca1d)) stopifnot(all.equal(pca2s, pca2d)) ``` ## Comprehensive timings obtained on various systems Here we report timings obtained on various systems. For each system, the results are summarized in a table that shows the normalization & realization & PCA timings obtained on all our test datasets and using three different block sizes: 40 Mb, 100 Mb, and 250 Mb. For each operation, the best time across the three different block sizes is displayed in **bold**. ### DELL XPS 15 laptop (model 9520) - **OS:** Linux Ubuntu 24.04 - **Bioconductor/R versions:** 3.21/4.5 - **RAM:** 32GB - **Disk:** 1TB SSD - **Disk performance**:
    # Output of 'sudo hdparm -Tt <device>':
     Timing cached reads:   35188 MB in  2.00 seconds = 17620.75 MB/sec
     Timing buffered disk reads: 7842 MB in  3.00 seconds = 2613.57 MB/sec
```{r xps15_timings, echo=FALSE, results='asis'} title <- "Table 1: Timings for DELL XPS 15 laptop" make_timings_table("xps15", title=title) ``` ### Supermicro SuperServer 1029GQ-TRT - **OS:** Linux Ubuntu 22.04 - **Bioconductor/R versions:** 3.21/4.5 - **RAM:** 384GB - **Disk:** 1.3TB ATA Disk - **Disk performance**:
    # Output of 'sudo hdparm -Tt <device>':
     Timing cached reads:   20592 MB in  1.99 seconds = 10361.41 MB/sec
     Timing buffered disk reads: 1440 MB in  3.00 seconds = 479.66 MB/sec
```{r rex3_timings, echo=FALSE, results='asis'} title <- "Table 2: Timings for Supermicro SuperServer 1029GQ-TRT" make_timings_table("rex3", title=title) ``` ### Intel Mac Pro (24-Core Intel Xeon W) - **OS:** macOS 12.7.6 - **Bioconductor/R versions:** 3.21/4.5 - **RAM:** 96GB - **Disk:** 1TB SSD - **Disk performance**: N/A ```{r lconway_timings, echo=FALSE, results='asis'} title <- "Table 3: Timings for Intel Mac Pro" make_timings_table("lconway", title=title) ``` ### Apple Silicon Mac Pro (Apple M2 Ultra) - **OS:** macOS 13.7.1 - **Bioconductor/R versions:** 3.21/4.5 - **RAM:** 192GB - **Disk:** 2TB SSD - **Disk performance**: N/A ```{r kjohnson3_timings, echo=FALSE, results='asis'} title <- "Table 4: Timings for Apple Silicon Mac Pro" make_timings_table("kjohnson3", title=title) ``` ## Final remarks The Supermicro SuperServer 1029GQ-TRT machine is significantly slower than the other machines. This is most likely due to the less performant disk. For PCA, choosing the sparse representation (TENxMatrix) and using small block sizes (40 Mb) is a clear winner. For normalization, there's no clear best choice between the parse and dense representations. More precisely, for this operation, the sparse representation tends to give the best times when using bigger blocks (250 Mb), whereas the dense representation tends to give the best times when using smaller blocks (40 Mb). However, based on the above benchmarks, there's no clear best choice between "sparse with big blocks" and "dense with small blocks" in terms of speed. Maybe extending the benchmarks to include more extreme block sizes (e.g. 20 Mb and 500 Mb) could help break the tie. Normalization and PCA are roughly linear in time, regardless of representation (sparse or dense) or block size. For small block sizes (<= 100 Mb), memory usage doesn't seem to be affected by the number of cells in the dataset, that is, all operations seem to perform at almost constant memory, regardless of representation (sparse or dense). However, for bigger block sizes (e.g. 250 Mb), the memory used to process the sparse datasets tend to increase slightly with the number of cells in the dataset. _[TODO] Add some plots to help vizualize the above assertions._ # Session information ```{r sessionInfo} sessionInfo() ```