---
title: "houba"
subtitle: "Yet another package for memory-mapped objects"
author: "Juliette Meyniel and Hervé Perdry"
version: 0.1
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{houba}
%\VignettePackage{houba}
%\VignetteDepends{houba}
%\VignetteDepends{bigmemory}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo = FALSE, results = "hide", message = FALSE}
oldoptions <- options()
oldpar <- par()
options(width = 85)
require(houba)
```
# Overview
**houba** provides manipulation of large data through memory-mapped files, supporting vectors,
matrices, and arrays. This allows to work with large datasets by keeping them on disk.
**houba** defines three S4 classes:
- `mvector` for memory-mapped vectors
- `mmatrix` for memory-mapped matrices
- `marray` for memory-mapped arrays
Currently, it supports `float`, `double`, `integer` and `char` data types.
**houba** allows to extract sub-vectors or sub-matrices, and to make assignments.
It also performs component wise arithmetic operations (currently no matrix arithmetic).
In-place arithmetic operations are supported. `rowSums`, `colSums`, `rowMeans`, `colMeans`
methods are defined for memory-mapped matrices.
A minimal compatibility with the **bigmemory** package is provided through descriptor files.
**NOTE 1** A current limitation of **houba** is that it relies on R integers for indices, thus
vectors of length larger than 2,147,483,647 can't be manipulated. Same limitations apply to
matrices and arrays dimensions.
**NOTE 2** **houba** relies on the C++ header only library mio by vimpunk, which is under
MIT Licence : .
# Creating memory-mapped objects
## Creating objects associated to new files
To create zero-filled objects, associated with new files, use
`mvector`, `mmatrix` and `marray`.
Here we create a memory-mapped vector of length 100, associated with a temporary file:
```{r create-file}
A <- mvector(datatype = "double", length = 100)
A
```
We can specify the filename for the backing file. Here we create a memory-mapped matrix:
```{r create-file2}
filename <- file.path(tempdir(), "integers120")
B <- mmatrix(datatype = "integer", nrow = 12, ncol = 10, filename = filename)
B
```
Similarly,
`marray("float", c(10, 20, 3))` a 10 by 20 by 3 array.
## Conversion from an R object
The methods `as.mvector`, `as.mmatrix` and `as.marray` allow to create a file
corresponding to the content of a R object.
```{r from_R}
# Convert regular R objects to memory-mapped objects
a <- matrix(1:20, 4, 5)
A <- as.mmatrix(a, datatype = "float")
A
```
If `datatype` is not provided, the method will use `integer` of `double`,
depending on the type of the R object.
```{r from_R2}
v <- 1:10
V <- as.mvector(v)
V
```
These methods also have an argument `filename`.
## Conversion to an R object
You can recover a R object using `as.vector`, `as.matrix` and `as.array`:
```{r to_R}
as.vector(V)
```
## Mapping pre-existing files
An existing file can be mapped, as long as is has the good size.
Here we use the file mapped in `B` created above.
```{r}
C <- mvector("int", 120, filename)
C
```
Providing an incompatible size will raise an error.
```{r, error = TRUE, purl = FALSE}
D <- mvector("int", 100, filename)
```
The mvector `C` is read-only, this is the default when mapping an existing file.
You can change this by providing the argument `readonly = FALSE` to `mvector`.
As `C` and `B` are mapping the same files, modifying one object should modify the other:
```{r}
B[1:4] <- 1:4
C
```
However this may not work always well, depending on your system, or when
a file is mapped through several R sessions. The function `flush` makes sure
all changes are written on disk:
```{r flush}
B[1:4] <- 2:5
flush(B)
C
```
## Descriptor Files
Descriptor files aim to provide a minimal compatibility
with the **bigmemory** package.
### Basic usage
To create a descriptor file associated is a mapped file, use `descriptor.file`. We
illustrate it here on the matrix `B` created above.
```{r descriptor_create}
B
dsc <- descriptor.file(B)
```
Descriptor files can be read with `read.descriptor`:
```{r descriptor_read}
D <- read.descriptor(dsc)
D
```
### Compatibility with bigmemory
The descriptor files created by **houba** can be read with the package **bigmemory**:
We first load the package and read the descriptor file:
```{r bigmemory}
library(bigmemory)
desc <- dget(dsc)
```
We then attach the file:
```{r bigmemory2}
bm <- attach.big.matrix(desc)
```
The resulting object maps the same datafile:
```{r bigmemory3}
bm[,1]
```
Note that alhougj **houba** allows to create descriptor files for marrays, these won't be
accepted by **bigmemory** which doesn't handle arrays.
## Restoring Broken Pointers
When restoring data from a previous session, pointers to external objects are broken,
making objects unsuable.
If the underlying data file still exists, you can use `restore` to overcome the problem.
Here we simulate this behaviour on the matrix `B`, using `save.image`.
```{r}
B
rdata_file <- tempfile(fileext = ".rda")
save.image(rdata_file)
```
Now we erase `B`:
```{r}
rm(B)
```
And we load the saved image:
```{r}
load(rdata_file)
B
```
The pointer in `B` is broken, but can be restored as this:
```{r}
B <- restore(B)
B
```
## Copying objects
You can create a copy with `copy`. This will also create a new file.
```{r}
C <- copy(B)
C
```
This function have an argument `filename`. It can in particular be used to save
data that are stored in a temporary file.
# Data manipulation
## Changing dimensions
The dimensions of an object can be accessed through `dim`.
```{r dim}
a <- matrix(1:12, 3, 4)
A <- as.mmatrix(a)
A
dim(A)
```
You can change the dimensions:
```{r dim2}
dim(A) <- c(4, 3)
A
```
Setting the dimensions to `NULL` creates a mvector:
```{r dim3}
dim(A) <- NULL
A
```
Similarly, you can obtain an marray:
```{r dim4}
dim(A) <- c(2,2,3)
A
```
## Accessing values
You can access elements of a memory-mapped object just as regular objects.
Let us create a memory-mapped matrix
```{r access}
a <- matrix( sample(0:99, 2500, TRUE), 50, 50)
A <- as.mmatrix(a)
```
Acessing a single element:
```{r access2}
A[1,1]
```
Accessing a row:
```{r access3}
A[1,]
```
The result here is a R object. This behaviour actually depends on its size!
The default is to return a R object if the result's size is less than
one million, and else to return a memory-mapped object.
This can be changed through the option `max.size`, as follows:
```{r houba}
houba(max.size = 20)
```
And now, accessing to the first row will sends a new memory-mapped object:
```{r houba2}
A[1,]
```
## Assigning values
Again, you can use R syntax to assign values:
```{r assign}
A[1,1] <- 0
A[2,] <- 10
A
```
Assignement with another memory-mapped object is also possible:
```{r assign2}
V <- as.mvector(1:50, "int")
A[3,] <- V
A
```
There is no type promotion. Assigning a floating point value to an integer object
will cast it to integer:
```{r no-promo}
A[1,1] <- pi
A[1,1]
```
## Arithmetic Operations
Arithmetic operations are available with the usual R syntax.
```{r arithmetic}
a <- matrix( sample.int(16), 4, 4)
A <- as.mmatrix(a, datatype = "float")
A <- 1 + 2*A
A
```
Memory-mapped objects can be used for both operands:
```{r arithmetic2}
B <- A + 2
C <- A / B
C
```
### There's no type promotion in houba
There is no type promotion. If the two operands have different types, the type of the result is
the type of the left operand.
Let's create to vectors with type `float` and `integer`:
```{r no-promo2}
A <- as.mvector( seq(0, 1, length = 11), datatype = "float" )
B <- as.mvector( 0:10, datatype = "integer" )
```
Now `A + B` has type `float`:
```{r no-promo3}
A + B
```
and `B + A` has type `integer`:
```{r nop4}
B + A
```
## In-Place Arithmetic Operations
We can modify the data without creating copies:
```{r inplace}
V <- as.mvector(1:20, "float")
W <- as.mvector(sample.int(20))
inplace.sum(V, 1) # Add 1 to all elements
inplace.prod(V, W) # Multiply elements of V by elements of W
inplace.minus(V, c(1,2)) # Subtract c(1,2) from all elements (recycling)
inplace.div(V, 4) # Divide all elements by 4
inplace.opposite(V) # Take opposite of all elements
inplace.inverse(V) # Take reciprocal of all elements
V
```
# Row and columns operations
**houba** provides analogs to `rowSums`, `rowMeans`, `colSums`, `colMeans`, and `apply`,
for memory-mapped matrices (but not for memory mapped arrays).
## Sums and means
```{r cs}
a <- matrix( sample.int(100), 10, 10)
A <- as.mmatrix(a)
# Row sums and meands
rowSums(A)
rowMeans(A)
```
Here the result is a R object, because its size does not exceed the value
of the option `max.size`. In the contrary case, it will be a memory-mapped
object:
```{r cs2}
houba(max.size = 5)
rowSums(A)
```
## Applying Functions
The `apply` method will extract row or lines to R objects. Again, the type of the
result depends on the `max.size` option.
If the size of the result is larger than `max.size`, a memory mapped object is returned:
```{r apply}
houba(max.size = 5)
apply(A, 1, sd)
```
The data type of this object will be `double` or `integer`, depending on the
values returned by the function. For example, the `sum` function will return
integers:
```{r apply2}
apply(A, 1, sum)
```
And if the size of the result is smaller than `max.size`, a R object is returned:
```{r apply3}
houba(max.size = 1e6)
apply(A, 1, sd)
```
# Contributing to houba
You may e-mail the author if for bug reports, feature requests,
or contributions. The source of the package is on [github](https://github.com/HervePerdry/houba).
Houba, hop!
```{r, echo = FALSE, results = "hide", message = FALSE}
options(oldoptions)
unlink(filename)
unlink(dsc)
```