--- title: "Handling large H5AD datasets" subtitle: 'On-disk versus in-memory storage' author: "Developed by [Gabriel Hoffman](http://gabrielhoffman.github.io/)" documentclass: article vignette: > %\VignetteIndexEntry{Loading large-scale H5AD datasets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\usepackage[utf8]{inputenc} output: BiocStyle::html_document: toc: true toc_float: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, warning = FALSE, message = FALSE, error = FALSE, eval = FALSE, tidy = FALSE, dev = c("png"), cache = TRUE ) ``` # Introduction As the scale of single cell RNA-seq datasets continues to increase, loading the entire dataset into memory is often impractical. Dreamlet can take advantage of efficient data storage through the [Bioconductor's](https://bioconductor.org) [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment/) interface to dramatically reduce memory usage. There are two general methods for storing large single-cell datasets. [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment/), and therefore `dreamlet`, are compatible with both: 1) __On-disk storage__: The data remains on the physical disk and is only read into memory when requested. In R there is little data duplication in memory since operations are delayed until the result is requested using `DelayedMatrix`. Memory usage is minimized at the cost of performance since disk access can be slow. 2) __In-memory storage__: Single cell read counts are generally sprase with, say, 20% of read counts being non-zero. The counts can be stored as a `sparseMatrix` that stores only non-zero entries. Performance is higher since data is in memory, but excessive memory usage can be a problem. # On-disk storage: zellkonverter [zellkonverter](https://github.com/theislab/zellkonverter) takes advantage of the [H5AD](https://anndata.readthedocs.io/en/latest/fileformat-prose.html) file format built on the [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) format in order to dramatically reduce memory usage while still retaining performance. The [zellkonverter](https://github.com/theislab/zellkonverter) package uses a [DelayedArray](https://bioconductor.org/packages/DelayedArray/) backend to provide a seamless interface to an on-disk H5AD dataset through the interface of the [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment/) class. `zellkonverter` is an interface for python code. Note that the python dependencies are installed the first time `readH5AD()` is run, rather than when the package is installed. So the first run of `readH5AD()` can take a few minutes, but subsequent runs will be fast. ## Standard usage ```{r example} library(zellkonverter) library(SingleCellExperiment) # Create SingleCellExperiment object that points to on-disk H5AD file sce <- readH5AD(h5ad_file, use_hdf5 = TRUE) ``` ## Merge multiple datasets ```{r example2} # Read a series of H5AD files into a list # then combine them into a single merged SingleCellExperiment sce.lst <- lapply(h5ad_files, function(file) { readH5AD(file, use_hdf5 = TRUE) }) sce <- do.call(cbind, sce.list) ``` ## Access alternative data Some software, including [pegasus](https://pegasus.readthedocs.io/en/stable/), store normalized data in the default portion of the file and save raw counts in the `raw/` field. Here, load a H5AD file generated by pegasus. ```{r example3} # read raw/ from H5AD file # raw = TRUE tells readH5AD() to read alternative data # Must use zellkonverter >=1.3.3 sce_in <- readH5AD(h5ad_file, use_hdf5 = TRUE, raw = TRUE) # use `raw` as counts sce <- swapAltExp(sce_in, "raw") # use raw as main counts(sce) <- assay(sce, "X") # set counts assay to data in X assay(sce, "X") <- NULL # free X ``` # In-memory storage: Seurat [Seurat](https://satijalab.org/seurat) can also convert and import H5AD files, and then convert to [SingleCellExperiment](https://bioconductor.org/packages/SingleCellExperiment/). The resulting `SingleCellExperiment` object stores data as a `sparseMatrix`. ```{r Seurat} library(SingleCellExperiment) library(SeuratDisk) # Convert h5ad file to h5seurat format Convert(h5ad_file, dest = "h5seurat") # load Seurat file obj <- LoadH5Seurat(h5seurat_file) # Convert Seurat object to SingleCellExperiment sce <- as.SingleCellExperiment(obj) ``` # Comparing interfaces The `SingleCellExperiment` interface to `zellkonverter` and `Seurat` hides the backend differences from the typical R user. The usage of `dreamlet` is the same in both cases. For small to medium datasets, the performance differences should be minimal. However, for large datasets there can be a substantial difference in performance. Use of `zellkonverter` and `DelayedMatrix` minimize memory usage at the cost of computationl performance. Use of `Seurat` and `sparseMatrix` can be faster if there is sufficent memory. `aggregateToPseudoBulk()` for computing pseudobulk data uses different backend implements for on-disk `DelayedMatrix` and in-memory `sparseMatrix`. Both versions are highly optimized compared to a naive implementation. # Session Info
```{r session, echo=FALSE} sessionInfo() ```