methods
packageS4 class system implemented in the methods
package
The ingredients:
The core components: classes, generic functions and methods
The glue: method dispatch
S4 classes are sometimes called formal classes
The result: a rich and complex system (218 symbols exported from the methods
package). Can be overwhelming. The good news is that we only need a small subset of the system.
Unlike other OO programming languages, S4 offers no provision for encapsulation: the internals of an object (i.e. all its slots) are visible and directly accessible. However, the Bioconductor approach is to consider that the internals are the business of the developer and that the end user should never need to access them directly or even look at them. Access should be only via accessor functions (getters or setters).
This has the following advantages:
The developer has the freedom to modify the internals of an object without breaking user code.
Carefully crafted setters will make sure that the object remains valid when being modified by the user.
This separation between what an object looks from the outside and what’s really inside allows the developer to use tricks at the internals level to make objects more compact in memory and/or more efficient.
So from a user point of view what matters is the semantic of an object i.e. what it represents and how it can be manipulated (API).
Even developers who implement a new class by extending an existing class should do it in a way that is agnostic about the internals of the class they extend. S4 makes this possible (it’s a great feature). More on this later.
A virtual class is a class that cannot be instanciated. Typically used as the parent of one or more concrete classes:
setClass("A", representation("VIRTUAL")) # virtual class
setClass("A1", contains="A", slots=...some slots...)
setClass("A2", contains="A", slots=...some other slots...)
For example the GRanges and GPos classes are both subclasses of virtual class GenomicRanges:
suppressPackageStartupMessages(library(GenomicRanges))
showClass("GenomicRanges")
## Virtual Class "GenomicRanges" [package "GenomicRanges"]
##
## Slots:
##
## Name: elementMetadata metadata
## Class: DataTable_OR_NULL list
##
## Extends:
## Class "Vector", directly
## Class "GenomicRanges_OR_missing", directly
## Class "GenomicRanges_OR_GRangesList", directly
## Class "Annotated", by class "Vector", distance 2
##
## Known Subclasses: "GRanges", "GPos", "DelegatingGenomicRanges", "GNCList"
Many methods can be implemented in a way that works for the 2 kinds of objects.
"names"
method:
selectMethod("names", "GRanges")
## Method Definition:
##
## function (x)
## names(ranges(x))
## <environment: namespace:GenomicRanges>
##
## Signatures:
## x
## target "GRanges"
## defined "GenomicRanges"
Will also work on GPos objects because the ranges()
getter works on these objects (and like for GRanges objects, it also returns a names IRanges object). So instead of defining 2 "names"
methods (one for GRanges and one for GPos objects) we define one method for GenomicRanges objects. GRanges and GPos objects inherit it.
A virtual class can have slots (like GenomicRanges), or no slots (like class A in our first example). Typically the slots of a virtual class are not enough to represent an object (this is why the class is made virtual). The concrete subclasses add the missing slots.
Powerful but can lead to a class hierarchy that is very hard to maintain if not used carefully.
showClass("CompressedIRangesList")
## Class "CompressedIRangesList" [package "IRanges"]
##
## Slots:
##
## Name: elementType elementMetadata metadata
## Class: character DataTable_OR_NULL list
##
## Name: unlistData partitioning
## Class: ANY PartitioningByEnd
##
## Extends:
## Class "IRangesList", directly
## Class "CompressedList", directly
## Class "RangesList", by class "IRangesList", distance 2
## Class "List", by class "CompressedList", distance 2
## Class "Vector", by class "CompressedList", distance 3
## Class "Annotated", by class "CompressedList", distance 4
##
## Known Subclasses: "CompressedNormalIRangesList"
S4 imposes no restrictions but I think we never had the need to define a class with more than 2 direct parents.
Just a special kind of virtual class with no slots.
showClass("GenomicRanges_OR_GRangesList")
## Virtual Class "GenomicRanges_OR_GRangesList" [package "GenomicRanges"]
##
## No Slots, prototype of class "S4"
##
## Known Subclasses:
## Class "GenomicRanges", directly
## Class "GRangesList", directly
## Class "GRanges", by class "GenomicRanges", distance 2
## Class "GPos", by class "GenomicRanges", distance 2
## Class "DelegatingGenomicRanges", by class "GenomicRanges", distance 2
## Class "GNCList", by class "GenomicRanges", distance 2
(This class is used in the definition of RangedSummarizedExperiment objects.)
Rarely needed. As a general principle, R objects should have a pass-by-copy semantic so passing them to a function is guaranteed to leave the original object unmodified. Pass-by-reference semantic is almost never needed and should be avoided.
Open A quick overview of the S4 class system (PDF) on the S4Vectors landing page and go to the Implementing an S4 class (in 4 slides) section.
Exercise
setClass()
statement, validity method, and "length"
method only). Allow the genome slot to contain an NAsetClass("SNPLocations",
slots=c(
genome="character", # a single string or NA
snpid="character", # a character vector of length N
chrom="character", # a character vector of length N
pos="integer" # an integer vector of length N
)
)
setValidity("SNPLocations",
function(object)
{
if (length(object@genome) != 1)
return("'genome' slot must have length 1")
slot_lengths <- c(length(object@snpid),
length(object@chrom),
length(object@pos))
if (length(unique(slot_lengths)) != 1)
return("'snpid', 'chrom' and 'pos' slots must have the same length")
TRUE
}
)
setMethod("length", "SNPLocations", function(x) length(x@snpid))
Call new("SNPLocations")
. What happens? Does this look right?
How could we change the behavior of new("SNPLocations")
so it generates a valid object?
setClass("SNPLocations",
slots=c(
genome="character",
snpid="character",
chrom="character",
pos="integer"
),
prototype=list(genome=NA_character_)
)
From scratch (a rare situation)
Extend an existing class (less work)
Class def (setClass
statement)
Validity method
Constructor function (named as the class, should be documented and user friendly).
"show"
method
At a minimum some getters Try to reuse existing generics instead of introducing new ones e.g.:
organism()
, score()
, strand()
(defined in BiocGenerics
)
genome()
, seqinfo()
(defined in GenomeInfoDb
)
Depending on the shape of the objects, methods for some of the following core generic functions (from base R):
length()
, names()
, subsetting ([
, [[
), dim()
, dimnames()
, c()
Some setters (unless the objects are not intended to be modified)
Setters for the corresponding getters
Methods for some of names<-
, subassignment ([<-
, [[<-
), dimnames<-
Specialized operations As methods for existing generics (e.g. normalize
defined in BiocGenerics
), or as new generics.
setAs("SNPLocations", "data.frame",
function(from)
data.frame(snpid=from@snpid, chrom=from@chrom, pos=from@pos)
)
Exercise
Implement a coercion method from SNPLocations to GRanges.
Most objects fall in one of these categories
length()
Examples: atomic vectors and ordinary list in base R; Rle, Hits, IRanges, GRanges, GRangesList, DNAString, DNAStringSet objects
vector-like
+ [[
Examples: ordinary list and data frames in base R. CharacterList, IRanges, IRangesList, GRangesList, DNAStringSet in Bioconductor.
dim()
and length(x)
(= prod(dim(x))
)Examples: matrix and array objects in base R. Matrix objects in the Matrix
package. DelayedArray objects in the DelayedArray
package.
dim()
+ list-like along the columnsExamples: data.frame objects in base R. DataFrame objects in Bioconductor.
These categories are not mutually exclusive:
list-like objects are vector-like objects
array-like objects are also vector-like objects
data-frame-like objects are list-like objects: length(x)
is ncol(x))
, names(x)
is colnames(x)
, x[[i]]
is x[ , i]
They are NOT matrix-like objects.
These categories reflect the shape of the object
linear shape: vector-like, list-like
rectangular shape: matrix-like, data-frame-like, SummarizedExperiment
multi-dimensional shape: array-like
Rectangular and multi-dimensional objects have an “implicit linear shape”. Think of it as the “backbone” of the object.
For example, 3 different linear shapes can be associated with a rectangular shape. They are described by the following “backbones”:
“matrix-like backbone”:
run from the top-left element to the bottom-right element of the rectangle, going thru the entire rectangle by running down along all the columns (continuing from the top of next column when reaching the bottom of a column)
This is the backbone of matrix-like objects. Array-like objects generalize this to N dimensions: backbone runs along the first dimension first (fastest changing dimension in R is the first one), then along the 2nd dimension, and so on…)
“data-frame-like backbone”:
run horizontally from first to last column of the rectangle
This is the backbone of data-frame-like objects.
Bioconductor introduces a 3rd type of backbone for rectangular objects:
run vertically from first to last row of the rectangle
This is the “SummarizedExperiment-like backbone”.
When implementing a class that represents rectangular objects, the developper should decide what the backbone of the objects will be. Concretely that means deciding what the "length"
method will return: prod(dim(x))
, ncol(x)
, or nrow(x)
? Once this choice is made, things like names(x)
, x[i]
, and c()
should behave consistently (e.g. c(x, y)
should concatenate the 2 objects along their backbones).
Making it clear what the backbone is in the documentation will help the user understand (and predict) the behavior of length(x)
, names(x)
, x[i]
, c()
, etc…
Exercise
What’s the shape of SNPLocations objects?
Subsetting is a powerful and very flexible operation in R.
4 subsetting operators: [
, [[
, [<-
, [[<-
Different types of subsetting:
[
[[
(sometimes called list-style subsetting).x[i]
or x[[i]]
. Sometimes called 1D-style subsetting.x[i_1, i_2, ..., i_n]
or x[[i_1, i_2, ..., i_n]]
. Some (or all) subscripts can be missing e.g. x[ , 3, ]
.y <- x[...]
or y <- x[[...]]
x[...] <- value
or x[[...]] <- value
. Subsetting is on the left side of the assignment. value
is called the right value or replacement value. This form of subsetting calls the [<-
or [[<-
operators. Sometimes called subassignment.R syntax allows any combination of these types e.g.:
y <- x[i, j, k]
: Single-bracket multi-dimensional subsetting
x[i] <- value
: Single-bracket linear subassignment
x[[i, j]] <- value
: Double-bracket two-dimensional subassignment
etc…
==> 8 possible combinations.
Even though all these forms are syntactically valid, they don’t necessarily make sense for all objects (and they should fail in that case).
For multi-dimensional objects that support it, linear subsetting is expected to be along the backbone of the object (1 <= i <= length(x)
).
Some examples:
Ordinary matrix, array, data.frame, DataFrame objects support the 8 forms. Gotcha: Implicit linear shape of matrix-like and data-frame-like objects differ, and so do linear subsetting on them.
SummuarizedExperiment objects support rectangular (x[i, j]
) and linear (x[i]
) subsetting and their replacement versions. x[i]
and x[i] <- value
are equivalent to x[i, ]
and x[i, ] <- value
, respectively. Gotcha: [[
is also supported but behaves in a non-standard way!
GRanges objects support linear [
and [<-
. As a convenience, rectangular subsetting is also supported (even though GRanges objects are NOT rectangular).
Exercise
Find the general man page for subsetting in base R.
Find the specific man page for subsetting a data frame.
Look at the signatures of the [
, [[
, [<-
, and [[<-
generic functions.
Exercise
What form of subsetting SNPLocations objects would naturally support?
Implement it.
Avoid surprising the user.
Pay particular attention to how length()
, names()
, [
, [[
, c()
, dim()
, and dimnames()
behave.
Also pay attention to 2 properties that are considered good properties:
Should the transformation I’m implementing preserve positionality?
Should it behave like an endomorphism?
Positionality is preserved when the output is parallel to the input, that is, it has the same length as the input and its i-th element corresponds to the i-th element in the input.
Examples:
All the operations in base R that are said to be vectorized e.g.:
is.na()
, duplicated()
, nchar()
, chartr()
, tolower()
, grepl()
, regexpr()
, rank()
Functions from the Math
group: log()
, sqrt()
, cumsum()
, …
Binary operators: +
, -
, *
, /
, ^
, ==
, <
, &
, |
, … Vectorized with respect to the longest operand.
match()
, %in%
(with respect to the 1st argument)
seqnames()
, start()
, end()
, width()
, strand()
getters on a GRanges or GRangesList object.
Intra-range transformations in Bioconductor e.g. shift()
, flank()
, …
gr <- GRanges(c("chr1:51-70", "chr1:75-100", "chr1:6-55"))
gr
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [51, 70] *
## [2] chr1 [75, 100] *
## [3] chr1 [ 6, 55] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
shift(gr, 1000)
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [1051, 1070] *
## [2] chr1 [1075, 1100] *
## [3] chr1 [1006, 1055] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Biostrings::translate()
on a DNAStringSet object (not on a DNAString).These operations typically propagate the names.
Also expected to propagate the metadata columns when applied to Bioconductor objects.
Do NOT preserve positionality:
grep()
, sort()
, order()
Summarization functions on atomic vectors e.g. min()
, max()
, sum()
, mean()
, prod()
, all()
, any()
, … However they DO preserve positionality when applied to NumericList or LogicalList objects from Bioconductor!
Set operations e.g. union()
, intersect()
, …
Inter-range transformations in Bioconductor (e.g. reduce()
, disjoin()
, …) do not preserve positionality in general:
reduce(gr)
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [ 6, 70] *
## [2] chr1 [75, 100] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
These operations do NOT propagate the names or metadata columns.
Exercise
Try unique()
, sort()
, order()
on NumericList, CharacterList, and RleList objects.
How do you extract the unique values object-wise?
Count the values in a NumericList object that are greater than a given value.
An endomorphism is a transformation that preserves the class of the object.
Examples:
[
(when drop=FALSE
), except linear subsetting on an array-like object
Replacement methods: [<-
and [[<-
and other setters (e.g. names<-
, mcols<-
, seqinfo<-
, …) With some notable exceptions:
[<-
on an atomic vector when the storage.mode of the vector cannot represent the right-value (e.g. x[2] <- 3.75
on an integer vector, or x[2] <- "a"
on a numeric vector)
rowRanges()
setter on a SummarizedExperiment instance.
c()
on linear objects (e.g. on Hits, GRanges, GRangesList, DNAStringSet objects, …) With a notable exception on factors (this is bad!)
rbind()
and cbind()
on rectangular objects
Intra-range transformations
NOT endomorphisms:
c()
on array-like objects
Inter-range transformations are not endomorphisms in general:
gpos <- GPos(gr)
gpos
## GPos object with 96 positions and 0 metadata columns:
## seqnames pos strand
## <Rle> <integer> <Rle>
## [1] chr1 51 *
## [2] chr1 52 *
## [3] chr1 53 *
## [4] chr1 54 *
## [5] chr1 55 *
## ... ... ... ...
## [92] chr1 51 *
## [93] chr1 52 *
## [94] chr1 53 *
## [95] chr1 54 *
## [96] chr1 55 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
reduce(gpos)
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [ 6, 70] *
## [2] chr1 [75, 100] *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Biostrings::translate()
It’s very likely that Bioconductor already has a container suited for the kind of data that you want to handle in your package.
If your data is rectangular, the first choice would be SummarizedExperiment.
If not, then maybe extending an existing class by adding a few slots is all you need to do.
Many packages have already done this with SummarizedExperiment e.g.:
DESeq2
with the DESeqDataSet and DESeqTransform containers
minfi
with the MethylSet, RGChannelSet, RatioSet, GenomicMethylSet, and GenomicRatioSet containers
VariantAnnotation
with the VCF class.
InteractionSet
with the InteractionSet container.
GenomicFiles
with the GenomicFiles container.
SGSeq
with the SGVariantCounts and SGFeatureCounts containers.
AllelicImbalance
with the ASEset, RiskVariant, RegionSummary, DetectedAI, and LinkVariantAlmlof containers.
and many more…
Ask on the bioc-devel mailing list if you need help finding the right container for your data (to use as-is or to extend).
The SummarizedExperiment vignette (available here) has a section dedicated to how to extend the RangedSummarizedExperiment class.
Starting a new class from scratch should be the last resort. Even in that case, it’s very likely that you could extend one of the virtual classes that are at the bottom of our core class hierarchy.
For example:
By extending the Annotated class you get the metadata()
getter and setter out-of-the-box.
length()
out-of-the boxmcols()
getter and setter out-of-the boxparallel slots
almost out-of-the box (via definition of a "parallelSlotNames"
method)parallel slots
if you decide that your object should support subsettingelementType()
almost out-of-the box (via adding a prototype in your class definition)as.list()
out-of-the box[[
and [[<-
).Exercise
Modify the definition of the SNPLocations class to make it extend Vector.
What to we gain by doing this?
Sometimes we need to extend a class not to add slots to it, but to add constraints on the objects.
For example NormalIRanges objects have the same slots as IRanges objects but NormalIRanges objects have the obligation to be normal (see Normality section in ?isNormal
for what that means).
A simpler example is the following: let’s say we want to implement a class that represent IRanges objects of constant width. A simple option is to just extend the IRanges class:
setClass("ConstantWidthIRanges", contains="IRanges")
and add the “constant width” constraint via a validity method:
setValidity("ConstantWidthIRanges",
function(object)
{
w <- width(object)
if (length(w) != 0L && any(w != w[[1]]))
return("object width is not constant")
TRUE
}
)
(Should we discuss length(w) != 0L && any(w != w[[1]]
vs length(unique(w)) > 1
?)
The S4 system provides automatic coercion methods from parent to child (promotion) and from child to parent (downgrade).
There is a gotcha with these methods though.
Let’s look at promotion:
x <- as(IRanges(6:8, width=10), "ConstantWidthIRanges")
y <- as(IRanges(6:8, end=10), "ConstantWidthIRanges")
Surprisingly the coercion worked for y
but produced an invalid ConstantWidthIRanges object:
validObject(x) # TRUE
validObject(y) # error!
Why?
Let’s look at downgrade:
x2 <- as(x, "IRanges")
y2 <- as(y, "IRanges")
Works fine (both objects are valid IRanges instances).
So coercion from IRanges to ConstantWidthIRanges is broken. How can we fix it?
Note that there are other approaches for implementing constant-width IRanges objects.
When we extend a class, we get a lot of things that work out-of-the-box (thanks to inheritance).
A few things don’t work exactly as expected though, so we need to fix them. The standard mechanism for doing this is by overriding existing methods.
3 important things to keep in mind when overriding methods:
We only need to do this for methods that don’t behave as expected.
We should NEVER override a method to change its semantic! For example implementing a "width"
method for ConstantWidthIRanges objects that return the width as a single number would be a very bad idea. Why? What should we do if we want something like that?
Last but not least: sometimes we do it to improve performance!
Example: "sum"
method for CompressedIntegerList objects.
Let’s say we want to implement a class for representing a view on an array i.e. a multi-dimensional window on an array. The window can be represented by a set of ranges, one range per dimension e.g. [1-4, 1-5] for the topleft corner of a matrix. For an array with N dimensions, we need N ranges to represent the view. We’ll call this kind of object a “viewport” (by analogy with the terminology used in the computer graphics world).
In our viewport objects, we also want to store the dimensions of “the reference array” i.e. the array on top of which the viewport is defined (a.k.a. “the underlying array”). This will be useful for validation purposes.
Consider the 2 following implementations:
setClass("ArrayViewport1",
contains="IRanges",
representation(
refdim="integer"
)
)
setClass("ArrayViewport2",
representation(
refdim="integer",
ranges="IRanges"
)
)
One extends the IRanges class so inherits the IRanges slots and full IRanges API. is(x, "IRanges")
will be TRUE on these objects. This says: “an ArrayViewport1 object is an IRanges object so can be used anywhere an IRanges object is expected”.
The other one uses composition. Unlike ArrayViewport1 objects, ArrayViewport2 objects are not considered to be IRanges object. But they have one inside them (stored in the 'ranges'
slot).
In the end, the 2 containers store the same data but their semantics are different. Which one to choose?
Now I want my viewport objects to have the shape of an array i.e. dim()
should return the dimensions of the sub-array delimitted by the viewport. Can I make the right choice now?
Talk about accessors, not about slots.
Focus on semantics, not on implementation details.
If your class extends a class that is already well documented, refer the reader to the man page of that class and add a link to it (typically in the \seealso
section). Don’t repeat this information in your man page.
Add \alias{MyClass}
(in addition to \alias{class:MyClass}
and \alias{MyClass-class}
) to make it easier for the end-user to find your man page.
See ?GPos
in the GenomicRanges
package for an example.
library(DelayedArray)
## Loading required package: matrixStats
##
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
##
## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following object is masked from 'package:base':
##
## apply
grid <- ArrayRegularGrid(c(10L, 10L), spacings=c(5L, 4L))
grid
## 2 x 3 ArrayRegularGrid object on a 10 x 10 array:
## [,1] [,2] [,3]
## [1,] [1-5, 1-4] [1-5, 5-8] [1-5, 9-10]
## [2,] [6-10, 1-4] [6-10, 5-8] [6-10, 9-10]
Man pages in the methods
package: ?setClass
, ?showMethods
, ?selectMethod
, ?getMethod
, ?is
, ?setValidity
, ?setMethod
, ?setReplaceMethod
, ?as
, ?setAs
Note: S4 is NOT covered in the An Introduction to R or The R language definition manuals (both available on CRAN here).
The Writing R Extensions manual for some details about integrating S4 classes to a package.
The Extending RangedSummarizedExperiment section of the SummarizedExperiment vignette. Available on the SummarizedExperiment
package landing page.
Research reported in this course was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996.
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 633974)