--- title: "rhdf5client -- HDF5 server access" author: "Vincent J. Carey, stvjc at channing.harvard.edu, Shweta Gopaulakrishnan, reshg at channing.harvard.edu, Samuela Pollack, spollack at jimmy.harvard.edu" date: "`r format(Sys.time(), '%B %d, %Y')`" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{rhdf5client -- experiments with interface to remote HDF5} %\VignetteEncoding{UTF-8} output: BiocStyle::pdf_document: toc: yes number_sections: yes BiocStyle::html_document: highlight: pygments number_sections: yes theme: united toc: yes --- # HDF5 server [HDF Server](https://support.hdfgroup.org/projects/hdfserver/) "extends the HDF5 data model to efficiently store large data objects (e.g. up to multi-TB data arrays) and access them over the web using a RESTful API." In this package, several data structures are introduced - to model the server data architecture and - to perform targeted extraction of numerical data from HDF5 arrays stored on the server. We maintain, thanks to a grant from the National Cancer Institute, the server [http://h5s.channingremotedata.org:5000/](http://h5s.channingremotedata.org:5000/). Visit this URL to get a flavor of the server structure: datasets, groups, and datatypes are high-level elements to be manipulated to work with data values from the server. A key application of the rhdf5client package is support for the `r BiocStyle::Biocpkg("restfulSE")` package that defines an interface between the SummarizedExperiment class and the HDF5 Server. The server provides content for `assay()` requests to `RESTfulSummarizedExperiment` instances. # Some details ## Motivation Extensive human and computational effort is expended on downloading and managing large genomic data at site of analysis. Interoperable formats that are accessible via generic operations like those in RESTful APIs may help to improve cost-effectiveness of genome-scale analyses. In this report we examine the use of HDF5 server as a back end for assay data. A modest server configured to deliver HDF5 content via a RESTful API has been prepared and is used in this vignette. ## Executive summary We want to provide rapid read-only access to array-like data. To do this, the hierarchy and additional formalities of the HDF5 server data architecture are exposed through R functions and related classes. Full details on the HDF5 server are available at [the HDFgroup site](https://support.hdfgroup.org/projects/hdfserver/). ```{r setup, echo=FALSE} suppressPackageStartupMessages({ library(rhdf5client) }) ``` The dsmeta function returns top-level groups and datasets. ```{r dsmeta} library(rhdf5client) bigec2 = H5S_source("http://h5s.channingremotedata.org:5000") bigec2 dsmeta(bigec2)[1:2,] # two groups dsmeta(bigec2)[1,2][[1]] # all dataset candidates in group 1 ``` ## Hierarchy of server resources ### Server Given the URL of a server running HDF5 server, we create an instance of `H5S_source`: ```{r doso} mys = H5S_source(serverURL="http://h5s.channingremotedata.org:5000") mys ``` ### Groups The server identifies a collection of 'groups'. For the server we are working with, only one group, at the root, is of interest. ```{r groups} groups(mys) ``` ### Links for a group There is a class to hold the link set for any group: ```{r links} lks = links(mys,1) lks ``` ### Dataset access We use double-bracket subscripting to grab a reference to a dataset from an H5S source. A dataset must be two-dimensional, i.e., accessible with two subscripts. ```{r dataset} dta = bigec2[["tenx_100k_sorted"]] dta ``` ### Data Data are accessed by subscripting. The subscripts are arrays of increasing, positive integers. ```{r access} x = dta[ 15:20, 1905:1906 ] x ``` (Obsolete) The subscripts are colon-delimited character with initial, final and optional stride. ```{r access-obs} x = dta["15:20", "1904:1906"] x ```