--- title: "houba" subtitle: "Yet another package for memory-mapped objects" author: "Juliette Meyniel and Hervé Perdry" version: 0.1 date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{houba} %\VignettePackage{houba} %\VignetteDepends{houba} %\VignetteDepends{bigmemory} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE, results = "hide", message = FALSE} oldoptions <- options() oldpar <- par() options(width = 85) require(houba) ``` # Overview **houba** provides manipulation of large data through memory-mapped files, supporting vectors, matrices, and arrays. This allows to work with large datasets by keeping them on disk. **houba** defines three S4 classes: - `mvector` for memory-mapped vectors - `mmatrix` for memory-mapped matrices - `marray` for memory-mapped arrays Currently, it supports `float`, `double`, `integer` and `char` data types. **houba** allows to extract sub-vectors or sub-matrices, and to make assignments. It also performs component wise arithmetic operations (currently no matrix arithmetic). In-place arithmetic operations are supported. `rowSums`, `colSums`, `rowMeans`, `colMeans` methods are defined for memory-mapped matrices. A minimal compatibility with the **bigmemory** package is provided through descriptor files. **NOTE 1** A current limitation of **houba** is that it relies on R integers for indices, thus vectors of length larger than 2,147,483,647 can't be manipulated. Same limitations apply to matrices and arrays dimensions. **NOTE 2** **houba** relies on the C++ header only library mio by vimpunk, which is under MIT Licence : . # Creating memory-mapped objects ## Creating objects associated to new files To create zero-filled objects, associated with new files, use `mvector`, `mmatrix` and `marray`. Here we create a memory-mapped vector of length 100, associated with a temporary file: ```{r create-file} A <- mvector(datatype = "double", length = 100) A ``` We can specify the filename for the backing file. Here we create a memory-mapped matrix: ```{r create-file2} filename <- file.path(tempdir(), "integers120") B <- mmatrix(datatype = "integer", nrow = 12, ncol = 10, filename = filename) B ``` Similarly, `marray("float", c(10, 20, 3))` a 10 by 20 by 3 array. ## Conversion from an R object The methods `as.mvector`, `as.mmatrix` and `as.marray` allow to create a file corresponding to the content of a R object. ```{r from_R} # Convert regular R objects to memory-mapped objects a <- matrix(1:20, 4, 5) A <- as.mmatrix(a, datatype = "float") A ``` If `datatype` is not provided, the method will use `integer` of `double`, depending on the type of the R object. ```{r from_R2} v <- 1:10 V <- as.mvector(v) V ``` These methods also have an argument `filename`. ## Conversion to an R object You can recover a R object using `as.vector`, `as.matrix` and `as.array`: ```{r to_R} as.vector(V) ``` ## Mapping pre-existing files An existing file can be mapped, as long as is has the good size. Here we use the file mapped in `B` created above. ```{r} C <- mvector("int", 120, filename) C ``` Providing an incompatible size will raise an error. ```{r, error = TRUE, purl = FALSE} D <- mvector("int", 100, filename) ``` The mvector `C` is read-only, this is the default when mapping an existing file. You can change this by providing the argument `readonly = FALSE` to `mvector`. As `C` and `B` are mapping the same files, modifying one object should modify the other: ```{r} B[1:4] <- 1:4 C ``` However this may not work always well, depending on your system, or when a file is mapped through several R sessions. The function `flush` makes sure all changes are written on disk: ```{r flush} B[1:4] <- 2:5 flush(B) C ``` ## Descriptor Files Descriptor files aim to provide a minimal compatibility with the **bigmemory** package. ### Basic usage To create a descriptor file associated is a mapped file, use `descriptor.file`. We illustrate it here on the matrix `B` created above. ```{r descriptor_create} B dsc <- descriptor.file(B) ``` Descriptor files can be read with `read.descriptor`: ```{r descriptor_read} D <- read.descriptor(dsc) D ``` ### Compatibility with bigmemory The descriptor files created by **houba** can be read with the package **bigmemory**: We first load the package and read the descriptor file: ```{r bigmemory} library(bigmemory) desc <- dget(dsc) ``` We then attach the file: ```{r bigmemory2} bm <- attach.big.matrix(desc) ``` The resulting object maps the same datafile: ```{r bigmemory3} bm[,1] ``` Note that alhougj **houba** allows to create descriptor files for marrays, these won't be accepted by **bigmemory** which doesn't handle arrays. ## Restoring Broken Pointers When restoring data from a previous session, pointers to external objects are broken, making objects unsuable. If the underlying data file still exists, you can use `restore` to overcome the problem. Here we simulate this behaviour on the matrix `B`, using `save.image`. ```{r} B rdata_file <- tempfile(fileext = ".rda") save.image(rdata_file) ``` Now we erase `B`: ```{r} rm(B) ``` And we load the saved image: ```{r} load(rdata_file) B ``` The pointer in `B` is broken, but can be restored as this: ```{r} B <- restore(B) B ``` ## Copying objects You can create a copy with `copy`. This will also create a new file. ```{r} C <- copy(B) C ``` This function have an argument `filename`. It can in particular be used to save data that are stored in a temporary file. # Data manipulation ## Changing dimensions The dimensions of an object can be accessed through `dim`. ```{r dim} a <- matrix(1:12, 3, 4) A <- as.mmatrix(a) A dim(A) ``` You can change the dimensions: ```{r dim2} dim(A) <- c(4, 3) A ``` Setting the dimensions to `NULL` creates a mvector: ```{r dim3} dim(A) <- NULL A ``` Similarly, you can obtain an marray: ```{r dim4} dim(A) <- c(2,2,3) A ``` ## Accessing values You can access elements of a memory-mapped object just as regular objects. Let us create a memory-mapped matrix ```{r access} a <- matrix( sample(0:99, 2500, TRUE), 50, 50) A <- as.mmatrix(a) ``` Acessing a single element: ```{r access2} A[1,1] ``` Accessing a row: ```{r access3} A[1,] ``` The result here is a R object. This behaviour actually depends on its size! The default is to return a R object if the result's size is less than one million, and else to return a memory-mapped object. This can be changed through the option `max.size`, as follows: ```{r houba} houba(max.size = 20) ``` And now, accessing to the first row will sends a new memory-mapped object: ```{r houba2} A[1,] ``` ## Assigning values Again, you can use R syntax to assign values: ```{r assign} A[1,1] <- 0 A[2,] <- 10 A ``` Assignement with another memory-mapped object is also possible: ```{r assign2} V <- as.mvector(1:50, "int") A[3,] <- V A ``` There is no type promotion. Assigning a floating point value to an integer object will cast it to integer: ```{r no-promo} A[1,1] <- pi A[1,1] ``` ## Arithmetic Operations Arithmetic operations are available with the usual R syntax. ```{r arithmetic} a <- matrix( sample.int(16), 4, 4) A <- as.mmatrix(a, datatype = "float") A <- 1 + 2*A A ``` Memory-mapped objects can be used for both operands: ```{r arithmetic2} B <- A + 2 C <- A / B C ``` ### There's no type promotion in houba There is no type promotion. If the two operands have different types, the type of the result is the type of the left operand. Let's create to vectors with type `float` and `integer`: ```{r no-promo2} A <- as.mvector( seq(0, 1, length = 11), datatype = "float" ) B <- as.mvector( 0:10, datatype = "integer" ) ``` Now `A + B` has type `float`: ```{r no-promo3} A + B ``` and `B + A` has type `integer`: ```{r nop4} B + A ``` ## In-Place Arithmetic Operations We can modify the data without creating copies: ```{r inplace} V <- as.mvector(1:20, "float") W <- as.mvector(sample.int(20)) inplace.sum(V, 1) # Add 1 to all elements inplace.prod(V, W) # Multiply elements of V by elements of W inplace.minus(V, c(1,2)) # Subtract c(1,2) from all elements (recycling) inplace.div(V, 4) # Divide all elements by 4 inplace.opposite(V) # Take opposite of all elements inplace.inverse(V) # Take reciprocal of all elements V ``` # Row and columns operations **houba** provides analogs to `rowSums`, `rowMeans`, `colSums`, `colMeans`, and `apply`, for memory-mapped matrices (but not for memory mapped arrays). ## Sums and means ```{r cs} a <- matrix( sample.int(100), 10, 10) A <- as.mmatrix(a) # Row sums and meands rowSums(A) rowMeans(A) ``` Here the result is a R object, because its size does not exceed the value of the option `max.size`. In the contrary case, it will be a memory-mapped object: ```{r cs2} houba(max.size = 5) rowSums(A) ``` ## Applying Functions The `apply` method will extract row or lines to R objects. Again, the type of the result depends on the `max.size` option. If the size of the result is larger than `max.size`, a memory mapped object is returned: ```{r apply} houba(max.size = 5) apply(A, 1, sd) ``` The data type of this object will be `double` or `integer`, depending on the values returned by the function. For example, the `sum` function will return integers: ```{r apply2} apply(A, 1, sum) ``` And if the size of the result is smaller than `max.size`, a R object is returned: ```{r apply3} houba(max.size = 1e6) apply(A, 1, sd) ``` # Contributing to houba You may e-mail the author if for bug reports, feature requests, or contributions. The source of the package is on [github](https://github.com/HervePerdry/houba). Houba, hop! ```{r, echo = FALSE, results = "hide", message = FALSE} options(oldoptions) unlink(filename) unlink(dsc) ```