Extended case study: an S4 “Annotated Matrix” class

Why bother?

Interoperability – packages in the same work flow re-use the same data strutures, e.g., GRanges for describing regions of interest in sequencing experiments
Programmer efficiency – classes enable re-use, avoiding implementing functionality that already exists.
Often, correct way to implement a class is not to – re-use existing classes instead, even if it's not quite perfect

Motivation

Numerical data typically augmented by important information about rows and columns
- Rows ('regions of interested'): gene symbols, genome coordinates, significance values from other tests
- Columns ('samples'): study identifiers, treatment groups, covariates
Significant risks of mis-aligning row and column data, with catastrophic and real-world consequences

Declaration

'Is a' versus 'has a'

'is a' caries a lot of baggage and introduces a lot of constraints
Particularly challenging when thinking about extending base R objects, where the 'API' is not well-defined

Starting our class definition

A matrix() with row- and column data.frame() annotations

Construct with setClass; returns a simple generating function.

.AnnMat <- setClass("AnnMat",
  representation(matData="matrix", rowData="data.frame",
      colData="data.frame"))

'Is a' (inheritance) relationship via contains= argument; multiple in inheritance possible.
In use

am0 <- .AnnMat()
am1 <- .AnnMat(
    matData=matrix(1:10, 2, dimnames=list(letters[1:2], LETTERS[1:5])),
    rowData=data.frame(roi_id=1:2),
    colData=data.frame(
        sample_id=1:5,
        treatment=c("A", "A", "B", "B", "B")))

A simple method: accessors (“getters”)

Definition of generics

setGeneric("rowData", function(x, ...) standardGeneric("rowData"))

## [1] "rowData"


setGeneric("colData", function(x, ...) standardGeneric("colData"))

## [1] "colData"


setGeneric("matData", function(x, ...) standardGeneric("matData"))

## [1] "matData"

Then methods implemented of class-specific methods on the generics

setMethod("rowData", "AnnMat", function(x, ...) x@rowData)

## [1] "rowData"


setMethod("colData", "AnnMat", function(x, ...) x@colData)

## [1] "colData"


setMethod("matData", "AnnMat", function(x, ...) x@matData)

## [1] "matData"

A simple method: dim and dimnames

Discover existing generic, for signature

getGeneric("dim")

## standardGeneric for "dim" defined from package "base"
## 
## function (x) 
## standardGeneric("dim", .Primitive("dim"))
## <bytecode: 0x103cb4558>
## <environment: 0x103c9ced8>
## Methods may be defined for arguments: x
## Use  showMethods("dim")  for currently available ones.

Implement a method on our class

setMethod("dim", "AnnMat", function(x) dim(matData(x)))

## [1] "dim"


setMethod("dimnames", "AnnMat", function(x) dimnames(matData(x)))

## [1] "dimnames"

Question Hey neat, nrow() and ncol() (and rownames() and colnames()) for free! Why is that?

dim(am1)

## [1] 2 5

nrow(am1)

## [1] 2

A simple method: show

Purpose: brief summary when displaying (printing) during interactive use
Existing generic: getGeneric("show")
Many existing methods: showMethods("show", where=search())

Our implementation: brief summary, with an eye toward re-use by derived classes. Avoid direct slot access

setMethod("show", "AnnMat", function(object)
{
    cat("class:", class(object), "\n")
    cat("dim:", dim(object), "\n")
    cat("rowData names():", names(rowData(object)), "\n")
    cat("colData names():", names(colData(object)), "\n")
})

## [1] "show"

am1

## class: AnnMat 
## dim: 2 5 
## rowData names(): roi_id 
## colData names(): sample_id treatment

Question Is there a better overall philosophy for show?

A more complicated method: updating ('replacement', 'setter') methods

Provide the illusion and simple syntax for in-place modification

A familiar example: update the value of a column in a data.frame

df <- data.frame(x=1:5, y=5:1)
df[,"x"] <- log(df$x)

R denotes df[,"x"] as [.data.frame, “the subset method for data.frame”, and df[,"x"] <- value as [<-.data.frame “the subset-replace method for data.frame”.
There actually is a function [<-.data.frame

head(get("[<-.data.frame"))

##                                                        
## 1 function (x, i, j, value)                            
## 2 {                                                    
## 3     if (!all(names(sys.call()) %in% c("", "value"))) 
## 4         warning("named arguments are discouraged")   
## 5     nA <- nargs()                                    
## 6     if (nA == 4L) {

R's parser translates df[,"x"] <- value to [<-.data.frame(x, , “x”, value) and actually modifies (a copied, if necessary) first argument.

Replacement methods, e.g., matData<-, signature takes the object to be updated, additional optional arguments, and the value to update the argument with
```
setGeneric("matData<-", function(x, ..., value)
    standardGeneric("matData<-"))
```
```
## [1] "matData<-"
```
Dispatch on one or both of x, value
Impelement as a method that dispatches on both the object and value, updates the slot, and returns the updated object.
```
setReplaceMethod("matData", c("AnnMat", "matrix"),
    function(x, ..., value)
{
    x@matData <- value
    x
})
```
```
## [1] "matData<-"
```
Exercise: walk through how that assignment in the body works.

Another replacement method, for dimnames<- (the generic already exists; what is it?)

setReplaceMethod("dimnames", c("AnnMat", "list"), 
    function(x, value)
{
    dimnames(matData(x)) <- value
    value
})

## [1] "dimnames<-"

Exercise: walk through how that assignment in the body of the method works
Hey neat, we get rownames<- and colnames<- for free!

A more complicated operation: validity

Constraints on row, column and matrix dimensions: all must be equal
validity argument to setClass, or setValidity() function call.
Validity function is weird
- Each class in hierarchy visited, so no need to test for super-class properties
- returns TRUE if the object is valid, a text string defining the transgression otherwise.

Evaluated frequently, so needs to be efficient / light-weight

setValidity("AnnMat", function(object) {
     msg <- NULL
     if (nrow(rowData(object)) != nrow(matData(object)))
         msg <- c(msg, 
             "number of rowData rows and matData rows differ")
     if (nrow(colData(object)) != ncol(matData(object)))
         msg <- c(msg,
             "number of colData rows and matData columns differ")
     if (is.null(msg)) TRUE else msg
})

## Class "AnnMat" [in ".GlobalEnv"]
## 
## Slots:
##                                        
## Name:     matData    rowData    colData
## Class:     matrix data.frame data.frame

In action:

.AnnMat(matData=matrix(1:10, 2), 
     rowData=data.frame(roi_id=1:2),
     colData=data.frame(sample_id=1:5))

## class: AnnMat 
## dim: 2 5 
## rowData names(): roi_id 
## colData names(): sample_id

cat(try({
     .AnnMat(matData=matrix(1:10, 2), 
         rowData=data.frame(roi_id=1:5),
         colData=data.frame(sample_id=1:2))
}))

## Error in validObject(.Object) : 
##   invalid class "AnnMat" object: 1: number of rowData rows and matData rows differ
## invalid class "AnnMat" object: 2: number of colData rows and matData columns differ

A more complicated method: subsetting

Why do we need this? Part of the informal matrix 'API' expected by a user

Discovery

getGeneric("[")

## standardGeneric for "[" defined from package "base"
## 
## function (x, i, j, ..., drop = TRUE) 
## standardGeneric("[", .Primitive("["))
## <bytecode: 0x102bac148>
## <environment: 0x102c2b190>
## Methods may be defined for arguments: x, i, j, drop
## Use  showMethods("[")  for currently available ones.

Possible methods multiply – x times i times j; e.g., i could be integer, logical, character, …
One approach – facade of methods that do minimal work to translate into a common base function
Special variable classes: ANY, missing
Exploit default initialize function, which acts as a copy constructor that updates slots in its first argument with values provided by named arguments.

Pass '…' to allow derived classes to use this method

setMethod("[", c("AnnMat", "ANY", "ANY"),
    function(x, i, j,  ..., drop=TRUE)
{
    ## FIXME: warn user about ignoring 'drop'?
   initialize(x, matData=matData(x)[i, j, drop=FALSE],
        rowData=rowData(x)[i,,drop=FALSE],
        colData=colData(x)[,j,drop=FALSE], ...)
})

## [1] "["


setMethod("[", c("AnnMat", "ANY", "missing"),
    function(x, i, j, ..., drop=TRUE)
{
    initialize(x, matData=matData(x)[i,,drop=FALSE],
        rowData=rowData(x)[i,,drop=FALSE])
})

## [1] "["


setMethod("[", c("AnnMat", "missing", "ANY"),
    function(x, i, j, ..., drop=TRUE)
{
    initialize(x, matData=matData(x)[,j,drop=FALSE],
        colData=colData(x)[j,,drop=FALSE])
})

## [1] "["


setMethod("[", c("AnnMat", "missing", "missing"),
    function(x, i, j, ..., drop=TRUE)
{
    initialize(x, ...)
})

## [1] "["

Exercise: create some simple unit tests for these methods.
Seems 'good enough' for numeric or logical indexes, what about character?

What else have we agreed to in the matrix API?