1 Overall introduction
2 Introduction to the Bioconductor/R-centric interface work
3 Interfacing R to remote or local HDF5 via h5py/h5pyd

1 Overall introduction

This document summarizes work to March 2018 in demonstrating the concept of remote HDF5. There are two main components to this document

A section on Bioconductor-oriented object designs and methods for using HDF5 data in server or object store, culminating in DelayedArray interfaces to remote HDF5 datasets that are 2-d arrays of numbers
A section on a more general R interface to the h5py/h5pyd APIs for working with remote or local HDF5

N.B. All python modules that are imported in this document are imported with convert=FALSE, so that there are no unintended translations of python data into R data. You will see py_to_r used below to accomplish such transitions when desired.

2 Introduction to the Bioconductor/R-centric interface work

The rhdf5client package is a basis for using HDF Server and HDF Object store with R/Bioconductor.

2.1 HDF Server interface

As of March 2018, we can use HDF Server with R in several ways. With support from an NCI grant, we maintain a server in AWS EC2 that employs the RESTful API defined for the HDF Server.

library(rhdf5client)
bigec2 = H5S_source(URL_h5serv())

## analyzing groups for their links...

## done

class(bigec2)

## [1] "H5S_source"
## attr(,"package")
## [1] "rhdf5client"

bigec2

## H5serv server url :  http://h5s.channingremotedata.org:5000 
##  There are 10 groups. 
##  Use groups(), links(), ..., to probe and access metadata. 
##  Use dsmeta() to get information on datasets within groups. 
##  Use [[ [dsname] ]]  to get a reference suitable for [i, j] subsetting.

2.1.1 The internal structure of an HDF Server

The server defines a hierarchical structure for all server contents. There are groups, linksets, and datasets.

groups(bigec2)

## DataFrame with 10 rows and 2 columns
##                                  groups    nlinks
##                             <character> <integer>
## 1  035e3a28-85e5-11e8-88a6-1678ea0f979a        15
## 2  035ef422-85e5-11e8-88a6-1678ea0f979a        28
## 3  035f20aa-85e5-11e8-88a6-1678ea0f979a         4
## 4  035eded8-85e5-11e8-88a6-1678ea0f979a         1
## 5  035ea062-85e5-11e8-88a6-1678ea0f979a         1
## 6  035f0a2a-85e5-11e8-88a6-1678ea0f979a         1
## 7  035eb502-85e5-11e8-88a6-1678ea0f979a         1
## 8  035e670a-85e5-11e8-88a6-1678ea0f979a         1
## 9  035e8b2c-85e5-11e8-88a6-1678ea0f979a        87
## 10 035eca88-85e5-11e8-88a6-1678ea0f979a         3

links(bigec2,1)

## HDF5 server link set for group 035e3a28-85e5-11e8-88a6-1678ea0f979a 
##  There are 15 links.
##  Use targets([linkset]) to extract target URLs.

datatable(data.frame(targets(links(bigec2,1))))

2.1.2 Presenting a specific dataset to the R user

We use the double-bracket operator to derive a reference to an HDF5 dataset from an H5S_source instance. We installed an image of the 10x genomics 1.3 million neuron dataset, that we can refer to as:

tenx_remote = bigec2[["tenx_full"]]
tenx_remote

## H5S_dataset instance:
##      dsname intl.dim1 intl.dim2              created     type.base
## 1 tenx_full   1306127     27998 2017-06-27T18:55:50Z H5T_STD_I32LE

This is sufficient to do arithmetic using familiar R programming steps. Note that the data image here has ‘neurons’ as ‘rows’.

apply(tenx_remote[1:4,1:27998],1,sum)

## [1] 4046 2087 4654 3193

2.2 Using the DelayedArray infrastructure

(At the moment the DelayedArray code is in restfulSE.) The H5S_Array constructor takes care of source and dataset navigation given the URL of the server and the name of the ‘host’, in HDF server parlance.

del10x = H5S_Array(URL_h5serv(), "tenx_full")

## analyzing groups for their links...

## done

del10x

## <27998 x 1306127> H5S_Matrix object of type "double":
##                [,1]       [,2]       [,3] ... [,1306126] [,1306127]
##     [1,]          0          0          0   .          0          0
##     [2,]          0          0          0   .          0          0
##     [3,]          0          0          0   .          0          0
##     [4,]          0          0          0   .          0          0
##     [5,]          0          0          0   .          0          0
##      ...          .          .          .   .          .          .
## [27994,]          0          0          0   .          0          0
## [27995,]          1          0          0   .          0          0
## [27996,]          0          0          0   .          0          0
## [27997,]          0          0          0   .          0          0
## [27998,]          0          0          0   .          0          0

Here we have defined the R image of the data to be the transpose of the image in HDF5. So neurons are columns.

apply(del10x[,1:4],2,sum)

## [1] 4046 2087 4654 3193

2.3 Interface to HSDS (HDF Object Store)

The interface here is not mature. The URL used here is a server maintained by John Readey of the HDF Group. The string /home/stvjc used below reflects a specific setup of data on the server, it is not a “user folder” on any system.

con = H5S_source(URL_hsds())   
con = setPath(con, "/home/reshg/tenx_full.h5") # this is as defined in store
ds2 = H5S_dataset2(con)
ds2

## H5S_dataset instance:
##                     dsname intl.dim1 intl.dim2    created     type.base
## 1 /home/reshg/tenx_full.h5   1306127     27998 1533754414 H5T_STD_I32LE

Again we have DelayedArray capabilities.

library(DelayedArray)
del10x_hsds = DelayedArray(new("H5S_ArraySeed", filepath = "", domain = "", 
        host = "", H5S_dataset = ds2))
del10x_hsds

## <27998 x 1306127> H5S_Matrix object of type "double":
##                [,1]       [,2]       [,3] ... [,1306126] [,1306127]
##     [1,]          0          0          0   .          0          0
##     [2,]          0          0          0   .          0          0
##     [3,]          0          0          0   .          0          0
##     [4,]          0          0          0   .          0          0
##     [5,]          0          0          0   .          0          0
##      ...          .          .          .   .          .          .
## [27994,]          0          0          0   .          0          0
## [27995,]          1          0          0   .          0          0
## [27996,]          0          0          0   .          0          0
## [27997,]          0          0          0   .          0          0
## [27998,]          0          0          0   .          0          0

apply(del10x_hsds[,1:4], 2, sum)

## [1] 4046 2087 4654 3193

3 Interfacing R to remote or local HDF5 via h5py/h5pyd

The reticulate package makes it easy to convey python infrastructure directly to R users. However, we want to shape aspects of the interaction to simplify statistical computing. We’ll start by considering how to use local HDF5 with the h5py python modules, and then transition to remote HDF5.

Some of the basic strategies are adumbrated in the BiocSklearn, a proof of concept of use of scikit modules in R.

A note on documentation. For many python concepts imported into an R session via reticulate::import, py_help may be used to obtain documentation as recorded in python docstrings. Thus after the import defining np below, py_help(np) will return a paged high-level document on numpy to the session.

3.1 Some basic tools for accessing local HDF5

We’ll start with imports of key R and python packages.

library(reticulate)
np = import("numpy", convert=FALSE)
h5py = import("h5py", convert=FALSE)

The _hl modules are fundamental infrastructure.

Rh5py = h5py[["_hl"]]
names(Rh5py)

##  [1] "absolute_import" "attrs"           "base"            "dataset"        
##  [5] "datatype"        "files"           "filters"         "group"          
##  [9] "selections"      "selections2"     "attrs"           "base"           
## [13] "dataset"         "datatype"        "files"           "filters"        
## [17] "group"           "selections"      "selections2"

3.1.1 Handling numerical data via numpy

The following codes demonstrate ways of interfacing to HDF5 via python. h5file simply returns a python reference to a File instance.

h5dsref builds python commands to facilitate manipulation of an HDF5 dataset in R via numpy.

h5file = function( file )
  Rh5py$files$File( file )

fn = system.file("hdf5/numiris.h5", package="rhdf5client")
m1 = h5file(fn)
m1

## <HDF5 file "numiris.h5" (mode r+)>

class(m1)

##  [1] "h5py._hl.files.File"              "h5py._hl.group.Group"            
##  [3] "h5py._hl.base.HLObject"           "h5py._hl.base.CommonStateObject" 
##  [5] "h5py._hl.base.MutableMappingHDF5" "h5py._hl.base.MappingHDF5"       
##  [7] "_abcoll.MutableMapping"           "_abcoll.Mapping"                 
##  [9] "_abcoll.Sized"                    "_abcoll.Iterable"                
## [11] "_abcoll.Container"                "python.builtin.object"

The File instance can be regarded as a python dictionary. We can learn the names of the datasets in the file:

m1$keys()

## [u'numiris']

The h5dsref function was devised to give convenient access to a dataset representing a matrix.

h5dsref = function(filename, dsname="numiris") {
  py_run_string("import h5py", convert=FALSE)
  py_run_string(paste0("f = h5py.File('", filename, "')"))
  mref = py_run_string(paste0("g = f['", dsname, "']"))
  mref$g
}

We’ll focus on the h5dsref approach for now. We can get slices of the target array using numpy’s take.

numir = h5dsref(fn)
ta = np$take  # can't use `$` on the fly
numirsli = ta(numir, 0:2, 1L)
class(numirsli)

## [1] "numpy.ndarray"         "python.builtin.object"

numirsli

## [[ 5.1  4.9  4.7]
##  [ 3.5  3.   3.2]
##  [ 1.4  1.4  1.3]
##  [ 0.2  0.2  0.2]]

So numirsli is a submatrix of the iris data in /tmp/RtmpWISGIK/Rinst2dc930271571/rhdf5client/hdf5/numiris.h5, with class numpy.ndarray. We can learn about available methods using names, and try some out.

names(numirsli)

##  [1] "T"            "all"          "any"          "argmax"      
##  [5] "argmin"       "argpartition" "argsort"      "astype"      
##  [9] "base"         "byteswap"     "choose"       "clip"        
## [13] "compress"     "conj"         "conjugate"    "copy"        
## [17] "ctypes"       "cumprod"      "cumsum"       "data"        
## [21] "diagonal"     "dot"          "dtype"        "dump"        
## [25] "dumps"        "fill"         "flags"        "flat"        
## [29] "flatten"      "getfield"     "imag"         "item"        
## [33] "itemset"      "itemsize"     "max"          "mean"        
## [37] "min"          "nbytes"       "ndim"         "newbyteorder"
## [41] "nonzero"      "partition"    "prod"         "ptp"         
## [45] "put"          "ravel"        "real"         "repeat"      
## [49] "reshape"      "resize"       "round"        "searchsorted"
## [53] "setfield"     "setflags"     "shape"        "size"        
## [57] "sort"         "squeeze"      "std"          "strides"     
## [61] "sum"          "swapaxes"     "take"         "tobytes"     
## [65] "tofile"       "tolist"       "tostring"     "trace"       
## [69] "transpose"    "var"          "view"

numirsli$ndim

## 2

numirsli$shape

## (4, 3)

numirsli$T$shape

## (3, 4)

Furthermore, we can create an R matrix with the HDF5 numerical content as sliced via take using py_to_r from reticulate:

dim(py_to_r(numirsli)) # all in R

## [1] 4 3

Thus, given an HDF5 dataset that can be regarded as a numpy array, we can interrogate its attributes and retrieve slices from R using h5dsref.

3.1.2 Creating HDF5 datasets from R

if (.Platform$OS.type != "windows")  {
tf = tempfile()
nf = h5py$File(tf, "w")
irmat = data.matrix(iris[,1:4])
nf$create_dataset('irisH5', data=r_to_py(irmat))
chk = h5dsref(tf, "irisH5")
ta(chk, 0:4, 0L)
nf$file$close() # no more reading, but
try(ta(chk, 0:4, 0L)) # is the close operation working?
}

## [[ 5.1  3.5  1.4  0.2]
##  [ 4.9  3.   1.4  0.2]
##  [ 4.7  3.2  1.3  0.2]
##  [ 4.6  3.1  1.5  0.2]
##  [ 5.   3.6  1.4  0.2]]

Details on the File interface are provided in h5py docs.

3.1.3 Interim conclusions

The Rh5py interface defined here would appear to be an adequate approach to interfacing between R and HDF5, but we already have plenty of mileage in rhdf5. Our real interest is in providing a comprehensive interface to the HDF Server and Object Store APIs, and we turn to this now.

3.2 Working with HDF Object Store via h5pyd

The File API for the object store is a little different from the one for local HDF5. For the following to succeed you would need credentials for the HDF Object Store instance noted in the endpoint used below.

if (.Platform$OS.type != "windows")  {
Rh5pyd = import("h5pyd", as="h5py", convert=FALSE)
tenx_remote = Rh5pyd$File("/home/reshg/tenx_full.h5", "r",
     endpoint=URL_hsds())
tenx_remote
tenx_remote$keys() # only python
py_to_r(tenx_remote$keys()) # the strings of interest
}

## [1] "newassay001"

The following function obtains a slice from a dataset in the object store. The index expression must be appropriate for the dataset and follows the convention for h5pyd: start:end:stride for each dimension, with [:end] and [:stride] optional.

if (.Platform$OS.type != "windows")  {
getslice = function(endpoint, mode, domain, dsname, indexstring="[0,0]") {
   py_run_string("import h5pyd", convert=FALSE)
   py_run_string(paste0("f = h5pyd.File(domain=", sQuote(domain), 
         ", mode = ", sQuote(mode), ", endpoint=", sQuote(endpoint), ")"))
   py_run_string(paste0("g = f['", dsname, "']", indexstring))$g
}
mr = getslice(URL_hsds(), "r", 
   "/home/reshg/tenx_full.h5", "newassay001", "[0:4, 0:27998]")
apply(mr,1,sum)
}

## [1] 4046 2087 4654 3193

3.3 Working with HDF Server

The getslice function will work with references to an HDF Server. However, in the context of the vignette compilation, I see an authentication error triggered. It is not clear why; if the two getslice commands are isolated and run in a single R session, no problem arises.

if (.Platform$OS.type != "windows")  {
hh = import("h5pyd", as="h5py") # avoid auth problem?
mr = getslice(URL_h5serv(), "r",
   "tenx_full.h5s.channingremotedata.org", "newassay001", "[0:4, 0:27998]")
apply(mr,1,sum)
}

3.4 Towards a comprehensive interface

We’ll focus on the object store. After importing h5pyd using reticulate, we can learn about available infrastructure.

if (.Platform$OS.type != "windows")  {
names(Rh5pyd)
}

##  [1] "AttributeManager"         "Config"                  
##  [3] "Dataset"                  "Datatype"                
##  [5] "ExternalLink"             "File"                    
##  [7] "Folder"                   "Group"                   
##  [9] "HardLink"                 "Reference"               
## [11] "RegionReference"          "SoftLink"                
## [13] "UserDefinedLink"          "absolute_import"         
## [15] "check_dtype"              "config"                  
## [17] "enable_ipython_completer" "getServerInfo"           
## [19] "special_dtype"            "version"                 
## [21] "_hl"                      "config"                  
## [23] "version"

With py_help(Rh5pyd$Dataset), we obtain extensive documentation in our R session.

Help on class Dataset in module h5pyd._hl.dataset:

class Dataset(h5pyd._hl.base.HLObject)
 |  Represents an HDF5 dataset
 |  
 |  Method resolution order:
 |      Dataset
 |      h5pyd._hl.base.HLObject
 |      h5pyd._hl.base.CommonStateObject
 |      __builtin__.object
 |  
 |  Methods defined here:
 |  
 |  __array__(self, dtype=None)
 |      Create a Numpy array containing the whole dataset.  DON'T THINK
 |      THIS MEANS DATASETS ARE INTERCHANGABLE WITH ARRAYS.  For one thing,
 |      you have to read the whole dataset everytime this method is called.
 |  
 |  __getitem__(self, args)
 |      Read a slice from the HDF5 dataset.
 |      
 |      Takes slices and recarray-style field names (more than one is
 |      allowed!) in any order.  Obeys basic NumPy rules, including
 |      broadcasting.
...

In what follows, we show the code that creates a new dataset in the object store. With py_help(Rh5pyd$File), we find:

 |  create_dataset(self, name, shape=None, dtype=None, data=None, **kwds)
 |      Create a new HDF5 dataset
 |      
 |      name
 |          Name of the dataset (absolute or relative).  Provide None to make
 |          an anonymous dataset.
 |      shape
 |          Dataset shape.  Use "()" for scalar datasets.  Required if "data"
 |          isn't provided.
 |      dtype
 |          Numpy dtype or string.  If omitted, dtype('f') will be used.
 |          Required if "data" isn't provided; otherwise, overrides data
 |          array's dtype.
 |      data
 |          Provide data to initialize the dataset.  If used, you can omit
 |          shape and dtype arguments.
 |      
 |      Keyword-only arguments:
 |      
 |      chunks
 |          (Tuple) Chunk shape, or True to enable auto-chunking.
 |      maxshape
 |          (Tuple) Make the dataset resizable up to this shape.  Use None for
 |          axes you want to be unlimited.

and we make use of the create_dataset method. (Following code is unevaluated, just for illustration, as it was tested and created the persistent content.)

if (.Platform$OS.type != "windows")  {
nf = Rh5pyd$File(endpoint=URL_hsds(), mode="w", 
   domain="/home/stvjc/iris_demo.h5")
nf$create_dataset('irisH5', data=r_to_py(irmat))
}

We can read back with:

if (.Platform$OS.type != "windows")  {
getslice(URL_hsds(), mode="r",
   domain="/home/stvjc/iris_demo.h5", "irisH5", "[0:3, 0:3]")
}

##      [,1] [,2] [,3]
## [1,]  5.1  3.5  1.4
## [2,]  4.9  3.0  1.4
## [3,]  4.7  3.2  1.3

We can run create_group as well. See

if (.Platform$OS.type != "windows")  {
sort(names(Rh5pyd$File))
}

##  [1] "DELETE"          "GET"             "POST"            "PUT"            
##  [5] "allocated_bytes" "attrs"           "clear"           "close"          
##  [9] "copy"            "create_dataset"  "create_group"    "created"        
## [13] "driver"          "fid"             "file"            "filename"       
## [17] "flush"           "get"             "getACL"          "getACLs"        
## [21] "get_link_json"   "id"              "items"           "iteritems"      
## [25] "iterkeys"        "itervalues"      "keys"            "libver"         
## [29] "mode"            "modified"        "move"            "name"           
## [33] "num_chunks"      "num_datasets"    "num_datatypes"   "num_groups"     
## [37] "owner"           "parent"          "pop"             "popitem"        
## [41] "putACL"          "ref"             "regionref"       "require_dataset"
## [45] "require_group"   "setdefault"      "update"          "userblock_size" 
## [49] "values"          "verifyCert"      "visit"           "visititems"

h5client – Bioconductor and remote HDF5

August 18, 2018

Contents