--- title: "SparseArray objects" author: - name: Hervé Pagès affiliation: Fred Hutchinson Cancer Research Center, Seattle, WA date: "Compiled `r BiocStyle::doc_date()`; Modified 8 April 2024" package: SparseArray vignette: | %\VignetteIndexEntry{SparseArray objects} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document --- ```{r setup, include=FALSE} library(BiocStyle) ``` # Introduction `r Biocpkg("SparseArray")` is an infrastructure package that provides an array-like container for efficient in-memory representation of multidimensional sparse data in R. The package defines the SparseArray virtual class and two concrete subclasses: COO\_SparseArray and SVT\_SparseArray. Each subclass uses its own internal representation of the nonzero multidimensional data, the "COO layout" and the "SVT layout", respectively. Note that the SparseArray virtual class could easily be extended by other S4 classes that intent to implement alternative internal representations of the nonzero multidimensional data. This vignette focuses on the SVT\_SparseArray container. # Install and load the package ```{r, eval=FALSE} if (!require("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("SparseArray") ``` ```{r, message=FALSE} library(SparseArray) ``` # SVT\_SparseArray objects The SVT\_SparseArray container provides an efficient representation of the nonzero multidimensional data via a novel layout called the "SVT layout". Note that SVT\_SparseArray objects mimic as much as possible the behavior of ordinary matrix and array objects in base R. In particular, they suppport most of the "standard matrix and array API" defined in base R and in the `r Biocpkg("matrixStats")` package from CRAN. ## Construction SVT\_SparseArray objects can be constructed in many ways. A common way is to coerce an ordinary matrix or array to SVT\_SparseArray: ```{r} m <- matrix(0L, nrow=6, ncol=4) m[c(1:2, 8, 10, 15:17, 24)] <- (1:8)*10L svt1 <- as(m, "SVT_SparseArray") svt1 a <- array(0L, 5:3) a[c(1:2, 8, 10, 15:17, 20, 24, 40, 56:60)] <- (1:15)*10L svt2 <- as(a, "SVT_SparseArray") svt2 ``` Alternatively, the ordinary matrix or array can be passed to the `SVT_SparseArray` constructor function: ```{r} svt1 <- SVT_SparseArray(m) svt2 <- SVT_SparseArray(a) ``` Note that coercing an ordinary matrix or array to SparseArray is also supported and will produce the same results: ```{r} svt1 <- as(m, "SparseArray") svt2 <- as(a, "SparseArray") ``` This is because, for most use cases, the SVT\_SparseArray representation is more efficient than the COO\_SparseArray representation, so the former is usually preferred over the latter. For the same reason, the `SparseArray` constructor function will also give the preference to the SVT\_SparseArray representation: ```{r} svt1 <- SparseArray(m) svt2 <- SparseArray(a) ``` This is actually the most convenient way to turn an ordinary matrix or array into an SVT\_SparseArray object. Coercion back to ordinary matrix or array is supported: ```{r} as.array(svt1) # same as as.matrix(svt1) as.array(svt2) ``` ## Accessors The standard array accessors are supported: ```{r} dim(svt2) length(svt2) dimnames(svt2) <- list(NULL, letters[1:4], LETTERS[1:3]) svt2 ``` Some additional accessors defined in the `r Biocpkg("S4Arrays")` / `r Biocpkg("SparseArray")` framework: ```{r} type(svt1) type(svt1) <- "double" svt1 is_sparse(svt1) ``` Other accessors/extractors specific to sparse arrays: ```{r} ## Get the number of nonzero array elements in 'svt1': nzcount(svt1) ## Extract the "linear indices" of the nonzero array elements in 'svt1': nzwhich(svt1) ## Extract the "array indices" (a.k.a. "array coordinates") of the ## nonzero array elements in 'svt1': nzwhich(svt1, arr.ind=TRUE) ## Extract the values of the nonzero array elements in 'svt1' and return ## them in a vector "parallel" to 'nzwhich(svt1)': #nzvals(svt1) # NOT READY YET! sparsity(svt1) ``` See `?SparseArray` for more information and additional examples. ## Subsetting and subassignment ```{r} svt2[5:3, , "C"] ``` Like with ordinary arrays in base R, assigning values of type `"double"` to an SVT\_SparseArray object of type `"integer"` will automatically change the type of the object to `"double"`: ```{r} type(svt2) svt2[5, 1, 3] <- NaN type(svt2) ``` See `?SparseArray_subsetting` for more information and additional examples. ## Summarization methods (whole array) The following summarization methods are provided at the moment: `anyNA()`, `any`, `all`, `min`, `max`, `range`, `sum`, `prod`, `mean`, `var`, `sd`. ```{r} anyNA(svt2) range(svt2, na.rm=TRUE) mean(svt2, na.rm=TRUE) var(svt2, na.rm=TRUE) ``` See `?SparseArray_summarization` for more information and additional examples. ## Operations from the 'Ops', 'Math', 'Math2', and 'Complex' groups SVT\_SparseArray objects support operations from the 'Ops', 'Math', `Math2`, and 'Complex' groups, with some restrictions. See `?S4groupGeneric` in the `r Biocpkg("methods")` package for more information about these group generics. ```{r} signif((svt1^1.5 + svt1) %% 100 - 0.6 * svt1, digits=2) ``` See `?SparseArray_Ops`, `?SparseArray_Math`, and `?SparseArray_Complex`, for more information and additional examples. ## Other operations on SVT\_SparseArray objects More operations will be added in the future e.g. `is.na()`, `is.infinite()`, `is.nan()`, etc... ## Generate a random SVT\_SparseArray object Two convenience functions are provided for this: ```{r} randomSparseArray(c(5, 6, 2), density=0.5) poissonSparseArray(c(5, 6, 2), density=0.5) ``` See `?randomSparseArray` for more information and additional examples. # SVT\_SparseMatrix objects ## Transposition ```{r} t(svt1) ``` Note that multidimensional transposition is supported via `aperm()`: ```{r} aperm(svt2) ``` See `?SparseArray_aperm` for more information and additional examples. ## Combine multidimensional objects along a given dimension Like ordinary matrices in base R, SVT\_SparseMatrix objects can be combined by rows or columns, with `rbind()` or `cbind()`: ```{r} svt3 <- poissonSparseMatrix(6, 2, density=0.5) cbind(svt1, svt3) ``` Note that multidimensional objects can be combined along any dimension with `abind()`: ```{r} svt4a <- poissonSparseArray(c(5, 6, 2), density=0.4) svt4b <- poissonSparseArray(c(5, 6, 5), density=0.2) svt4c <- poissonSparseArray(c(5, 6, 4), density=0.2) abind(svt4a, svt4b, svt4c) svt5a <- aperm(svt4a, c(1, 3:2)) svt5b <- aperm(svt4b, c(1, 3:2)) svt5c <- aperm(svt4c, c(1, 3:2)) abind(svt5a, svt5b, svt5c, along=2) ``` See `?SparseArray_abind` for more information and additional examples. ## Matrix multiplication and cross-product Like ordinary matrices in base R, SVT\_SparseMatrix objects can be multiplied with the `%*%` operator: ```{r} m6 <- matrix(0L, nrow=5, ncol=6, dimnames=list(letters[1:5], LETTERS[1:6])) m6[c(2, 6, 12:17, 22:30)] <- 101:117 svt6 <- SparseArray(m6) svt6 %*% svt3 ``` They also support `crossprod()` and `tcrossprod()`: ```{r} crossprod(svt3) ``` See `?SparseMatrix_mult` for more information and additional examples. ## matrixStats methods The `r Biocpkg("SparseArray")` package provides memory-efficient col/row summarization methods for SVT\_SparseMatrix objects: ```{r} colVars(svt6) ``` Note that multidimensional objects are supported: ```{r} colVars(svt2) colVars(svt2, dims=2) colAnyNAs(svt2) colAnyNAs(svt2, dims=2) ``` See `?matrixStats_methods` for more information and additional examples. ## `rowsum()` method A `rowsum()` method is provided: ```{r} rowsum(svt6, group=c(1:3, 2:1)) ``` See `?rowsum_methods` for more information and additional examples. ## Read/write a sparse matrix from/to a CSV file Use `writeSparseCSV()` to write a sparse matrix to a CSV file: ```{r} csv_file <- tempfile() writeSparseCSV(m6, csv_file) ``` Use `readSparseCSV()` to read the file. This will import the data as an SVT\_SparseMatrix object: ```{r} readSparseCSV(csv_file) ``` See `?readSparseCSV` for more information and additional examples. # Comparison with dgCMatrix objects The nonzero data of a SVT\_SparseArray object is stored in a _Sparse Vector Tree_. This internal data representation is referred to as the "SVT layout". It is similar to the "CSC layout" (compressed, sparse, column-oriented format) used by CsparseMatrix derivatives from the `r CRANpkg("Matrix")` package, like dgCMatrix or lgCMatrix objects, but with the following improvements: - The "SVT layout" supports sparse arrays of arbitrary dimensions. - With the "SVT layout", the sparse data can be of any type. Whereas CsparseMatrix derivatives only support sparse data of type `"double"` or `"logical"` at the moment. - The "SVT layout" imposes no limit on the number of nonzero array elements that can be stored. With dgCMatrix/lgCMatrix objects, this number must be < 2^31. - Overall, the "SVT layout" allows more efficient operations on SVT\_SparseArray objects. See `?SVT_SparseArray` for more information. # Learn more Please consult the individual man pages in the `r Biocpkg("SparseArray")` package to learn more about SVT\_SparseArray objects and about the package. A good starting point is the man page for SparseArray objects: `?SparseArray` # Session information ```{r} sessionInfo() ```