--- title: Extending alabaster to new data structures author: - name: Aaron Lun email: infinite.monkeys.with.keyboards@gmail.com package: alabaster.base date: "Revised: September 29, 2022" output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{Extending to new data structures} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE} library(BiocStyle) fake <- "_alabaster.foo_" knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE) ``` # Overview Developers can easily extend this framework to support more R/Bioconductor classes by creating their own **alabaster** package. This involves defining a schema, creating a staging function and creating a loading function. Readers are assumed to know the basics of R package development. To demonstrate, let's use the `dgTMatrix` classes from the `r CRANpkg("Matrix")` package. In this vignette, we'll be creating an `r fake` package to save and load these objects to/from file. # Defining a schema This involves some initial steps: 1. Clone the [`ArtifactDB/BiocObjectSchemas`](https://github.com/ArtifactDB/BiocObjectSchemas) repository. 2. Read [this document](https://github.com/ArtifactDB/BiocObjectSchemas/blob/master/docs/schema_conventions.md) for the schema conventions. 3. Create a new subdirectory inside `raw/` with the name of your schema. 4. Create a schema file with the specified version, usually `v1.json`. Our new schema will be called `sparse_triplet_matrix`: ```{r, echo=FALSE, results="asis"} stuff <- '{ "$schema": "http://json-schema.org/draft-07/schema", "$id": "sparse_triplet_matrix/v1.json", "title": "Sparse Triplet Matrix", "type": "object", "description": "Sparse matrix in triplet form, as a demonstration of how to use a schema for my dense matrix type. Data is stored in a CSV file with three columns - `i`, an integer field containing the 0-based row indices of each non-zero element; `j`, an integer field containing the column indices; and `x`, a double-precision field containing the value of the non-zero element.", "allOf": [ { "$ref": "../array/v1.json" }, { "$ref": "../_md5sum/v1.json" } ], "properties": { "sparse_triplet_matrix": { "type": "object", "properties": { "num_elements": { "type": "integer", "description": "Number of non-zero elements in the sparse matrix.", "minimum": 0 } }, "required": [ "num_elements" ], "allOf": [ { "$ref": "../_compression/v1.json" } ] } }, "required": [ "sparse_triplet_matrix" ], "_attributes": { "format": "text/csv", "restore": { "R": "alabaster.foo::loadSparseTripletMatrix" } } }' cat("```json", stuff, "```", sep="\n"); ``` This involves a few choices: - Given that we're dealing with an array-like type, we'll "inherit" from the base [`array/v1.json`](https://github.com/ArtifactDB/BiocObjectSchemas/tree/master/raw/array/v1.json) schema. This allows the `sparse_triplet_matrix` to be used inside objects that expect arrays, e.g., `SummarizedExperiment`s. It also provides a standard place to store the array dimensions and type. - We'll report the number of non-zero elements inside the `sparse_triplet_matrix` property, just in case anyone cares. We'll also report the compression method used for the CSV file. - As we'll be saving the triplets in CSV format, we set the `text/csv` type in the `_attributes.format` property. - The `_attributes.restore.R` property specifies the namespaced function to be used to restore the object in an R session. Once we're happy, we can run the [`scripts/resolver.py`](https://github.com/ArtifactDB/BiocObjectSchemas/tree/scripts/resolver.py) script to resolve all the `$ref` references. ```sh python3 scripts/resolver.py raw resolved ``` This creates a `resolved/sparse_triplet_matrix/v1.json` file that looks like: ```{r, results="asis", echo=FALSE} resolved <- ' { "$id": "sparse_triplet_matrix/v1.json", "$schema": "http://json-schema.org/draft-07/schema", "_attributes": { "format": "text/csv", "restore": { "R": "alabaster.foo::loadSparseTripletMatrix" } }, "description": "Sparse matrix in triplet form, as a demonstration of how to use a schema for my dense matrix type. Data is stored in a CSV file with three columns - `i`, an integer field containing the 0-based row indices of each non-zero element; `j`, an integer field containing the column indices; and `x`, a double-precision field containing the value of the non-zero element.\\n\\nDerived from `array/v1.json`: some kind of multi-dimensional array, where we store metadata about the dimensions and type of data. The exact implementation of the array is left to concrete subclasses.", "properties": { "$schema": { "description": "The schema to use.", "type": "string" }, "array": { "additionalProperties": false, "properties": { "dimensions": { "description": "Dimensions of an n-dimensional array. Dimensions should be ordered from the fastest-changing to the slowest.", "items": { "type": "integer" }, "minItems": 1, "type": "array" }, "type": { "description": "Type of data stored in this array.", "enum": [ "boolean", "number", "integer", "string", "other" ], "type": "string" } }, "required": [ "dimensions" ], "type": "object" }, "is_child": { "default": false, "description": "Is this a child document, only to be interpreted in the context of the parent document from which it is linked? This may have implications for search and metadata requirements.", "type": "boolean" }, "md5sum": { "description": "MD5 checksum for the file.", "type": "string" }, "path": { "description": "Path to the file in the project directory.", "type": "string" }, "sparse_triplet_matrix": { "properties": { "compression": { "description": "Type of compression applied to the file.", "enum": [ "none", "gzip", "bzip2" ], "type": "string" }, "num_elements": { "description": "Number of non-zero elements in the sparse matrix.", "minimum": 0, "type": "integer" } }, "required": [ "num_elements" ], "type": "object" } }, "required": [ "$schema", "array", "md5sum", "path", "sparse_triplet_matrix" ], "title": "Sparse Triplet Matrix ", "type": "object" } ' cat("```json", resolved, "```", sep="\n"); ``` Let's put this file in the `inst/schemas/sparse_triplet_matrix` subdirectory of `r fake`. # Writing a staging method We need to create a method for the `stageObject()` generic that operates on the class of interest. This will be called whenever an instance of that class is encountered by `stageObject()`, either directly or as a child of another object (e.g., a `SummarizedExperiment`'s assay). ```{r} library(Matrix) library(alabaster.base) library(S4Vectors) setMethod("stageObject", "dgTMatrix", function(x, dir, path, child=FALSE) { # Create a subdirectory to stash our contents. dir.create(file.path(dir, path), showWarnings=FALSE) # Create a DataFrame with the triplet data. df <- DataFrame(i = x@i, j = x@j, x = x@x) # .quickWriteCsv will make sure it's written in an 'alabaster-standard' format. outpath <- file.path(path, "foo.csv.gz") .quickWriteCsv(df, file.path(dir, outpath), compression="gzip") # Specifying the package name in the package attribute of the schema, # to ensure that .writeMetadata() can find it for validation. schema <- "sparse_triplet_matrix/v1.json" attr(schema, "package") <- "alabaster.foo" # Formatting the metadata for return. list( `$schema`=schema, # Reported path must be relative to 'dir'. path=outpath, # Pass along the 'child' specification from the call. is_child=child, `array`=list( # Need I() to prevent unboxing of length-1 vectors. dimensions=I(dim(x)), # double-precision values => 'number' in JSON schema's language. type="number" ), sparse_triplet_matrix=list( num_elements=nrow(df), compression="gzip" ) ) }) ``` Alright, let's test this out with a mock `dgTMatrix`. ```{r} x <- sparseMatrix( i=c(1,2,3,5,6), j=c(3,6,1,3,8), x=runif(5), dims=c(10, 10), repr="T" ) x tmp <- tempfile() dir.create(tmp) meta <- stageObject(x, tmp, "test") str(meta) ``` Running `.writeMetadata()` will then validate `meta` against the schema requirements, using the schema file that we put inside `r fake`. ```{r, echo=FALSE, results="hide"} meta$md5sum <- digest::digest(meta$path) meta.file <- jsonlite::toJSON(meta, auto_unbox=TRUE) new.schema <- tempfile(fileext=".json") write(file=new.schema, resolved) jsonvalidate::json_validate(schema=new.schema, json=meta.file, engine="ajv", error=TRUE) ``` # Writing a loading method Finally, for loading, we need to define a new `load*` function for the new class that recreates the R object from the staged files. In this case, it's fairly simple: ```{r} loadSparseTripletMatrix <- function(info, project) { # Need to get the file path. path <- acquireFile(project, info$path) # This utility will check that the CSV is correctly formatted, # which is more stringent than read.csv. df <- .quickReadCsv(path, expected.columns=c(i="integer", j="integer", x="double"), expected.nrows=info$sparse_triplet_matrix$num_elements, compression=info$sparse_triplet_matrix$compression, row.names=FALSE ) # Constructor uses 1-based indices. sparseMatrix( i=df$i + 1L, j=df$j + 1L, x=df$x, dims=info$array$dimensions, repr="T" ) } ``` Let's try it out with our previously staged `dgTMatrix`: ```{r} loadSparseTripletMatrix(meta, tmp) ``` To register this method with `r Githubpkg("alabaster.base")`, `r fake` should add itself to the known schema locations upon load: ```{r} # Typically in zzz.R. .onLoad <- function(libname, pkgname) { existing <- options("alabaster.schema.locations") options(alabaster.schema.locations=union("alabaster.foo", existing)) } .onUnload <- function(libpath) { existing <- options("alabaster.schema.locations") options(alabaster.schema.locations=setdiff(existing, "alabaster.foo")) } ``` This ensures that `loadObject()` can find the resolved schema file in `r fake`'s `inst/schemas` when it encounters an instance of a `sparse_triplet_matrix/v1.json`. From the schema, `loadObject()` will look at the 's `_attributes.restore.R` property to figure out how to load it into R, which will eventually lead to calling `loadSparseTripletMatrix()`. # Advanced concepts Some staging methods may need to save child objects, represented by the resource pointers in the schema. For example, a `SummarizedExperiment` would contain multiple children - each assay, the `DataFrame`s for the column and row annotations, the metadata list, and so on. These are easily handled by just calling the staging methods of each child object inside the staging method for the parent object: ```{r} # Abbreviated example from artificer.se: setMethod("stageObject", "SummarizedExperiment", function(x, dir, path, child=FALSE) { dir.create(file.path(dir, path), showWarnings=FALSE) # Saving the colData. info <- .stageObject(colData(x), dir, file.path(path, "coldata"), child=TRUE) cd.info <- list(resource=.writeMetadata(info, dir=dir)) # Saving the rowData. info <- .stageObject(rowData(x), dir, file.path(path, "rowdata"), child=TRUE) rd.info <- list(resource=.writeMetadata(info, dir=dir)) # Saving the other metadata. info <- .stageObject(metadata(x), dir, file.path(path, "metadata"), child=TRUE) other.info <- list(resource=.writeMetadata(info, dir=dir)) # Saving the assays. assay.info <- list() for (a in assayNames(x)) { curmat <- assay(x, a) mat.path <- file.path(path, paste0("assay-", i)) meta <- .stageObject(curmat, path=mat.path, dir=dir, child=TRUE) deets <- .writeMetadata(meta, dir=dir) assay.info <- c(assay.info, list(list(name=ass.names[i], resource=deets))) } list( `$schema`="summarized_experiment/v1.json", path=file.path(path, meta.name), summarized_experiment=list( assays=assay.info, column_data=cd.info, row_data=rd.info, other_data=meta.info, dimensions=dim(x) ), is_child=child ) }) ``` When doing so, developers should use `.stageObject()` instead of calling `stageObject()` (note the period). The is a wrapper around ensures that the new staging method can respect application-specific overrides. Similarly, some loading methods should use `.loadObject()` when loading child objects: ```{r} # Abbreviated example from artificer.se: loadSummarizedExperiment <- function(exp.info, project) { all.assays <- list() for (y in seq_along(exp.info$summarized_experiment$assays)) { cur.ass <- exp.info$summarized_experiment$assays[[y]] aname <- cur.ass$name apath <- cur.ass$resource$path ass.info <- acquireMetadata(project, apath) all.assays[[aname]] <- .loadObject(ass.info, project=project) } cd.info <- acquireMetadata(project, exp.info$summarized_experiment$column_data$resource$path) cd <- .loadObject(cd.info, project=project) rd.info <- acquireMetadata(project, exp.info$summarized_experiment$row_data$resource$path) rd <- .loadObject(rd.info, project=project) other.info <- acquireMetadata(project, exp.info$summarized_experiment$other_data$resource$path) other <- .loadObject(other.info, project=project) SummarizedExperiment(all.assays, colData=cd, rowData=rd, metadata=other, checkDimnames=FALSE) } ``` Inside the loading functions, we always use `acquireMetadata()` and `acquireFile()` to obtain the metadata and file paths, respectively. This ensures that we respect any application-specific overrides, especially for `project` values that are not file paths. # Promoting the schema to _alabaster.base_ If the new schema refers to a standard Bioconductor object that is widely re-usable, you may consider submitting a PR with the (non-resolved) schema to [`ArtifactDB/BiocObjectSchemas`](https://github.com/ArtifactDB/BiocObjectSchemas). If accepted, we will include your schema into the standard set of schemas known to `r Githubpkg("alabaster.base")`. Once this is done, `r fake` can be simplified considerably: - Special `.onLoad()` functions to register `r fake` are no longer required. - The `$schema` string does not need a `package` attribute in the `stageObject()` method. - The resolved schema does not need to be stored in `r fake`'s `inst/schemas`. You may also wish to create a PR to `r Githubpkg("alabaster.base")` to add your class and package to the [staging search path](https://github.com/ArtifactDB/alabaster.base/tree/master/R/searchMethods.R). This will prompt `stageObject()` to attempt to load your package upon encountering an instance of the class of interest, ensuring that the correct staging method is automatically called even if the user hasn't loaded your package. # Session information {-} ```{r} sessionInfo() ```