```{r, echo=FALSE, results="hide", message=FALSE}
require(knitr)
opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
```
```{r style, echo=FALSE, results='asis'}
BiocStyle::markdown()
```
# A common C++ API for all types of R matrices
Package: `r Rpackage("beachmat")`
Author: Aaron Lun (alun@wehi.edu.au)
Compilation date: `r Sys.Date()`
# Introduction
The _beachmat_ package provides a C++ API for handling a variety of R matrix types.
The aim is to abstract away the specific type of matrix object when writing C++ extensions, thus simplifying the processing of data stored in those objects.
Currently, the API supports double-precision, integer, logical and character matrices.
Supported classes include base `matrix` objects, a number of classes from the `r CRANpkg("Matrix")` package, and disk-backed matrices from the `r Biocpkg("HDF5Array")` package.
# Linking to the package
## Prerequisites
The _beachmat_ package currently has several dependencies:
- The compiler should support the C++11 standard.
- `r CRANpkg("Rcpp")`, `r Biocpkg("Rhdf5lib")`, `r Biocpkg("HDF5Array")` and `r Biocpkg("DelayedArray")` should be installed.
Most of the following instructions are ripped from the `r Biocpkg("Rhtslib")` vignette.
## Link to the library
To link successfully to the _beachmat_ library, a package must include *both* a `src/Makevars.win` *and* `src/Makevars` file.
**Note**: the contents of `src/Makevars.win` and `src/Makevars` are almost identical, but not quite.
Be careful of the differences.
Create a `src/Makevars.win` file with the following lines:
```
BEACHMAT_LIBS=$(shell echo 'beachmat::pkgconfig("PKG_LIBS")'|\
"${R_HOME}/bin/R" --vanilla --slave)
PKG_LIBS=$(BEACHMAT_LIBS)
```
... and a `src/Makevars` file with the following lines:
```
BEACHMAT_LIBS=`echo 'beachmat::pkgconfig("PKG_LIBS")'|\
"${R_HOME}/bin/R" --vanilla --slave`
PKG_LIBS=$(BEACHMAT_LIBS)
```
The statement for each platfrom modifies the `$PKG_LIBS` variable.
If your package needs to add to the `$PKG_LIBS` variable, do so by adding to the `PKG_LIBS=$(BEACHMAT_LIBS)` line.
For example:
```
PKG_LIBS=$(BEACHMAT_LIBS) -L/path/to/foolib -lfoo
```
The Linux implementation embeds the location of the _beachmat_ library in the package-specific shared object via the compiler flag `-Wl,rpath,path`, where path is determined by `system.file("lib", package="beachmat")`.
The path determined by `system.file()` is from `.libPaths()` and will resolve all symbolic links.
This can cause problems, e.g., when the "head" node of a cluster mimicks the cluster node via a symbolic link to the directory in which _beachmat_ is installed.
Use the environment variable `BEACHMAT_RPATH` to resolve this by setting it to the cluster-node accessible path.
Similar arguments apply to `r Biocpkg("Rhdf5lib")` with the environment variable `RHDF5LIB_RPATH`.
## Find headers
In order for the C/C++ compiler to find the _beachmat_ package headers during installation, add the following to the `LinkingTo` field of the `DESCRIPTION` file:
```
LinkingTo: Rcpp, Rhdf5lib, beachmat
```
In C or C++ code files, use standard techniques, e.g., `#include "beachmat/numeric_matrix.h"` (see below for more details).
Header files are available for perusal at the following location (enter in an R session):
```{R headers}
system.file(package="beachmat", "include")
```
## Finishing up
You need to tell the build system to use C++11, by modifying the `SystemRequirements` field of the `DESCRIPTION` file:
```
SystemRequirements: C++11
```
You also need to ensure that `r CRANpkg("Rcpp")` is initialized when your package is loaded.
This requires addition of `Rcpp` to the `Imports` field of the `DESCRIPTION` file:
```
Imports: Rcpp
```
... and a corresponding `importFrom` specification in the `NAMESPACE` file:
```
importFrom(Rcpp, sourceCpp)
```
(The exact function to be imported doesn't matter, as long as the namespace is loaded.
Check out the `r CRANpkg("Rcpp")` documentation for more details.)
`r Biocpkg("HDF5Array")`, `r Biocpkg("DelayedArray")` and _beachmat_ itself should be added to the `Suggests` field, as the API will perform some calls to R functions in those packages to query certain parameters.
If you intend to accept instances of `r CRANpkg("Matrix")` classes, the package should also be listed in the `Suggests` field, if not already in `Imports` or `Depends`:
```
Suggests: beachmat, HDF5Array, DelayedArray, Matrix
```
# Overview of the API (input)
## Creating the matrix pointer
We demonstrate the use of the API for numeric matrices.
First, we include the relevant header file:
```
#include "beachmat/numeric_matrix.h"
```
A double-precision matrix object `dmat` is handled in C++ by passing the `SEXP` struct from `.Call` to `create_numeric_matrix`:
```
std::unique_ptr dptr = beachmat::create_numeric_matrix(dmat);
```
This creates a unique pointer that points to an object of the `numeric_matrix` virtual class.
The exact class depends on the type of matrix in `dmat`, though the behaviour of the user-level functions are not affected by this detail.
## Methods for input matrices
The available methods for this object are:
- `dptr->get_nrow()` returns the number of rows (`nrow`) as a `size_t`.
- `dptr->get_ncol()` returns the number of columns (`ncol`) as a `size_t`.
- `dptr->get_col(c, in)` takes a `Rcpp::Vector::iterator` object `in` and fills it with values at column `c`.
There should be at least `nrow` accessible elements, i.e., `*in` and `*(in+nrow-1)` should be valid entries.
No value is returned by this method.
- `dptr->get_col(c, in, first, last)` takes a `Rcpp::Vector::iterator` object `in` and fills it with values at column `c` from row `first` to `last-1`.
There should be at least `last-first` accessible elements, i.e., `*in` and `*(in+last-first-1)` should be valid entries.
No value is returned by this method.
- `dptr->get_row(r, in)` takes an `Rcpp::Vector::iterator` object `in` and fills it with values at row `r`.
There should be at least `ncol` accessible elements, i.e., `*in` and `*(in+ncol-1)` should be valid entries.
No value is returned by this method.
- `dptr->get_row(r, in, first, last)` takes an `Rcpp::Vector::iterator` object `in` and fills it with values at row `r` from column `first` to `last-1`.
There should be at least `last-first` accessible elements, i.e., `*in` and `*(in+last-first-1)` should be valid entries.
No value is returned by this method.
- `dptr->get(r, c)` returns a double at matrix entry `(r, c)`.
- `dptr->clone()` returns a unique pointer to a `numeric_matrix` instance of the same type as that pointed to by `dptr`.
- `dptr->get_matrix_type()` returns a `matrix_type` value specifying the specific matrix representation that is pointed to by `dptr`.
This is an enumeration type that can be tested against constants like `beachmat::SIMPLE` or `beachmat::SPARSE`.
- `dptr->yield()` returns a `Rcpp::RObject` containing the original R matrix that was used to create `dptr`.
In all cases, `r` and `c` should be non-negative integers (specificaly `size_t`) in `[0, nrow)` and `[0, ncol)` respectively.
Zero-based indexing is assumed for both `r` and `c`, as is standard for most C/C++ applications.
Similar rules apply to `first` and `last`, which should be in `[0, nrow]` for `get_col` and in `[0, ncol]` for `get_row`.
Furthermore, `last >= first` should be true.
If the object `X` is a `Rcpp::NumericVector::iterator` instance, matrix entries will be extracted as double-precision values.
If it is a `Rcpp::IntegerVector::iterator` instance, matrix entries will be extracted as integers with implicit conversion.
It is also _possible_ to use a `Rcpp::LogicalVector::iterator`, though this will not behave as expected - see notes below.
## Special methods for specific matrix types
There are additional methods that provide some advantages for specific matrix representations:
- `dptr->get_const_col(c, work, first, last)` returns a `Rcpp::Vector::const_iterator` pointing to `first` row of column `c`.
The arguments are the same as `dptr->get_col` with `work` being equivalent to `X`.
(The data type of `work` must correspond to that of the matrix - in this case, it should be a `Rcpp::NumericVector::iterator`.)
For simple matrices, this function is more efficient than `get_col` as it returns the iterator without needing to copy data into `work`.
For other matrices, this function simply calls `get_col` and returns an iterator to the start of `work`.
- `dptr->get_nonzero_col(c, index, values, first, last)` takes a `Rcpp::IntegerVector::iterator` object `index` and a `Rcpp::Vector::iterator` object `values`.
For each non-zero entry in column `c` from rows `[first, last)`, its row index is stored in the memory pointed to by `index` and its value is stored in `values`.
(Both iterators should point to memory with at least `last-first` addressable elements.)
The return value of the function is the number of non-zero entries stored in this manner.
This function is quite efficient for sparse matrices; for all other matrices, `get_col` is called and zeros are stripped out afterwards.
- `dptr->get_nonzero_col(r, index, values, first, last)` takes a `Rcpp::IntegerVector::iterator` object `index` and a `Rcpp::Vector::iterator` object `values`.
For each non-zero entry in row `r` from columns `[first, last)`, its column index is stored in the memory pointed to by `index` and its value is stored in `values`.
(Both iterators should point to memory with at least `last-first` addressable elements.)
The return value of the function is the number of non-zero entries stored in this manner.
This function is quite efficient for sparse matrices; for all other matrices, `get_row` is called and zeros are stripped out afterwards.
Obviously, the `get_nonzero_*` functions are not available for character matrices.
## Other matrix types
Logical, integer and character matrices can be handled by including the following header files:
```
#include "beachmat/logical_matrix.h"
#include "beachmat/integer_matrix.h"
#include "beachmat/character_matrix.h"
```
The dispatch function changes correspondingly, for logical matrix `lmat`, integer matrix `imat` and character matrix `cmat`:
```
std::unique_ptr lptr=beachmat::create_logical_matrix(lmat);
std::unique_ptr iptr=beachmat::create_integer_matrix(imat);
std::unique_ptr cptr=beachmat::create_character_matrix(cmat);
```
Equivalent methods are available for each matrix types with appropriate changes in type.
For integer and logical matrices, `get` will return an integer,
while `X` can be an `iterator` object of a `Rcpp::IntegerVector`, `Rcpp::LogicalVector` or `Rcpp::NumericVector` instance (type conversions are implicitly performed as necessary).
For character matrices, `X` should be of type `Rcpp::StringVector::iterator`, and `get` will return a `Rcpp::String`.
The following matrix classes are supported:
- numeric: `matrix`, `dgeMatrix`, `dgCMatrix`, `dspMatrix`, `RleMatrix`, `HDF5Matrix`, `DelayedMatrix`
- integer: `matrix`, `RleMatrix`, `HDF5Matrix`, `DelayedMatrix`
- logical: `matrix`, `lgeMatrix`, `lgCMatrix`, `lspMatrix`, `RleMatrix`, `HDF5Matrix`, `DelayedMatrix`
- character: `matrix`, `RleMatrix`, `HDF5Matrix`, `DelayedMatrix`
Additional classes can be added on a need-to-use basis.
As a general rule, if a matrix-like object can be stored in a `SummarizedExperiment` class (from the `r Biocpkg("SummarizedExperiment")` package), the API should be able to handle it.
Please contact the maintainers if you have a class that you would like to see supported.
## Important developer information
- For non-`logical` matrices, using a `Rcpp::LogicalVector::iterator` in the `get_*` methods is not recommended.
For `numeric_matrix` instances, double-to-integer conversion is performed such that values in `(-1, 1)` are converted to integer `0`.
This would be interpreted as a logical `FALSE`, which is incorrect for non-zero double-precision values.
For `integer_matrix` instances, integer values are not coerced to `{0, 1}` when they are assigned into the `Rcpp::LogicalVector`.
Thus, even though the interpretation is correct, the vector produced will not be equivalent to the result of an `as.logical` call.
- The API is not thread-safe for simultaneous calls to the `get` methods from different threads.
Some methods use cached class members for greater efficiency, and simultaneous calls will cause race conditions.
It is the responsibility of the calling function to lock (and unlock) access to a single `*_matrix` object across threads.
Alternatively, the `clone` method can be called to generate a unique pointer to a _new_ `*_matrix` instance, which can be used concurrently in another thread.
This is fairly cheap as the underlying matrix data are not copied.
- When accessing `character_matrix` data, we do not return raw `const char*` pointers to the C-style string.
Rather, the `Rcpp::String` class is used as it provides a convenient wrapper around the underlying `CHARSXP`.
This ensures that the string is stored in R's global cache and is suitably protected against garbage collection.
- `DelayedMatrix` objects are automatically realized via the `realize` method in the `r Biocpkg("DelayedArray")` package.
This uses the same realization backend that was specified in R -- call `getRealizationBackend()` to determine the current backend.
If the realized matrix is to be reused, it may be more efficient to perform the realization in R and pass the result to `.Call`.
- The API will happily throw exceptions of the `std::exception` class, containing an informative error message.
These should be caught and handled gracefully by the end-user code, otherwise a segmentation fault will probably occur.
See the error-handling mechanism in `r CRANpkg("Rcpp")` for how to deal with these exceptions.
- For numeric matrices, _beachmat_ does not support higher-level matrix operations such as addition, multiplication or various factorizations.
Rather, the `yield` method can be used to obtain the original `Rcpp::RObject` for input to `r CRANpkg("RcppArmadillo")` or `r CRANpkg("RcppEigen")`.
This functionality is generally limited to base matrices, though there is also limited support for sparse matrices in these libraries.
# Overview of the API (output)
## Specifying the output type
Three types of output matrices are supported - simple `matrix`, `*gCMatrix` and `HDF5Matrix` objects.
For example, a simple numeric output matrix with `nrow` rows and `ncol` columns is created by:
```
std::unique_ptr odmat=beachmat::create_numeric_output(nrow, ncol, beachmat::SIMPLE_PARAM);
```
A sparse matrix is similarly created by setting the last argument to `beachmat::SPARSE_PARAM`,
while a `HDF5Matrix` is constructed by setting `beachmat::HDF5_PARAM`.
These constants are instances of the `output_param` class that specify the type and parameters of the output matrix to be constructed.
Another option is to allow the function to dynamically choose the output type to match that of an existing matrix.
This is useful for automatically choosing an output format that reflects the choice of input format.
For example, if data are supplied to a function in a simple matrix, it would be reasonable to expect that the output is similarly small enough to be stored as a simple matrix.
On the other hand, if the input is a `HDF5Matrix`, it may make more sense to return a `HDF5Matrix` object.
Dynamic choice of output type is performed by using the `Rcpp::Robject` object containing the input matrix to initialize the `output_param` object.
If I have a matrix object `dmat`, the output type can be matched to the input type with:
```
beachmat::output_param oparam(dmat, /* simplify = */ true, /* preserve_zero = */ false);
std::unique_ptr odmat=beachmat::create_numeric_output(nrow, ncol, oparam);
```
A similar process can be used for a pointer `dptr` to an existing `*_matrix` instance:
```
beachmat::output_param oparam(dptr->get_matrix_type(), /* simplify = */ true, /* preserve_zero = */ false);
```
The `simplify` argument indicates whether non-`matrix` input objects should be "simplified" to a `matrix` output object.
If `false`, a `HDF5Matrix` output object will be returned instead.
The `preserve_zero` argument indicates whether a `*gCMatrix` input should result in a `*gCMatrix` output when `simplify=false` (for logical or double-precision data only).
Exact zeroes are detected and ignored when filling this matrix.
## Methods for output matrices
To put data into the output matrix pointed to by `dptr`, the following methods are available:
- `dptr->set_col(c, out)` fills column `c` with elements pointed to by a `Rcpp::NumericVector::iterator` object (`out`).
There should be at least `nrow` accessible elements, i.e., `*out` and `*(out+nrow-1)` should be valid entries.
No value is returned by this method.
- `dptr->set_col(c, out, first, last)` fills column `c` from row `first` to `last-1`, using the elements pointed to by a `Rcpp::NumericVector::iterator` object (`out`).
There should be at least `last-first` accessible elements, i.e., `*out` and `*(out+last-first-1)` should be valid entries.
No value is returned by this method.
- `dptr->set_row(r, out)` fills row `r` with elements pointed to by a `Rcpp::NumericVector::iterator` object (`out`).
There should be at least `ncol` accessible elements, i.e., `*out` and `*(out+ncol-1)` should be valid entries.
No value is returned by this method.
- `dptr->set_row(r, out, first, last)` fills row `r` from column `first` to `last-1`, with elements pointed to by a `Rcpp::NumericVector::iterator` object (`out`).
There should be at least `last-first` accessible elements, i.e., `*out` and `*(out+last-first-1)` should be valid entries.
No value is returned by this method.
- `dptr->set(r, c, Y)` fills the matrix entry `(r, c)` with the double `Y`.
No value is returned by this method.
- `dptr->yield()` returns a `Rcpp::RObject` object containing the matrix data to pass into R.
The allowable ranges of `r`, `c`, `first` and `last` are the same as previously described.
The `get_nrow`, `get_ncol`, `get_row`, `get_col`, `get`, `get_matrix_type` and `clone` methods are also available and behave as described for `numeric_matrix` objects.
## Other matrix types
Logical, integer and character output matrices are supported by changing the types in the creator function (and its variants):
```
std::unique_ptr oimat=beachmat::create_integer_output(nrow, ncol);
std::unique_ptr olmat=beachmat::create_logical_output(nrow, ncol);
std::unique_ptr ocmat=beachmat::create_character_output(nrow, ncol);
```
Equivalent methods are available for these matrix types.
For integer and logical matrices, `X` should be of type `Rcpp::IntegerVector::iterator` and `Rcpp::LogicalVector::iterator`, respectively, and `Y` should be an integer.
For character matrices, `X` should be of type `Rcpp::StringVector::iterator` and `Y` should be a `Rcpp::String` object.
## Important developer information
- Creation of a `HDF5Matrix` will perform a new `getHDF5DumpFile()` call with `for.use=TRUE` to obtain a new file name for storing the HDF5 output.
Similarly, the name of the data set is obtained via `getHDF5DumpName()`.
Both names are recorded in the dump log via `appendDatasetCreationToHDF5DumpLog()`.
This mimics the behaviour observed when `HDF5Matrix` instances are created at the R level.
Similarly, the compression level and chunk dimensions will be taken from the appropriate global variables.
- By default, the chunk dimensions and compression level for HDF5 output are retrieved using R functions from the `r Biocpkg("HDF5Array")` package.
An alternative is to directly specify the chunk dimensions using `oparam.set_chunk_dim(chunk_nr, chunk_nc)`,
where `chunk_nr` and `chunk_nc` are the chunk rows and columns respectively.
Similarly, the compression level can be set using `oparam.set_compression(compress)`, where `compress` can range from 0 (contiguous) to 9 (most compression).
If specified, these settings will override the default behaviour, but will have no effect for non-HDF5 output.
- For consecutive row and column access from a matrix with dimensions `nr`-by-`nc`, the optimal chunk dimensions can be specified with `oparam.optimize_chunk_dims(nr, nc)`.
_beachmat_ exploits the chunk cache to store all chunks along a row or column, thus avoiding the need to reload data for the next row or column.
These chunk settings are designed to minimize the chunk cache size while also reducing the number of disk reads.
- HDF5 character output is stored as fixed-width character arrays.
As such, the API must know the maximum string length during construction of a `character_output` instance.
This can be set using `oparam.set_strlen(strlen)` where `strlen` is the length of a C-style string, _not including the null-terminating character_.
Any attempts to fill the output matrix with strings larger than `strlen` will result in silent truncation.
- The API is not thread-safe, due to (i) the use of cached class members and (ii) the potential for race conditions when writing to the same location on disk/memory.
The first issue can be solved by using `clone` to create `*_output` copies for use in each thread.
However, each copy may still read from and write to the same disk/memory location.
It is the responsibility of the calling function to ensure that access is locked and unlocked appropriately across multiple threads.
(This may not be necessary if access does not overlap, e.g., writing to different rows.)
# Session information
```{r}
sessionInfo()
```