1 R as we know it
2 Finding the inner R: the C API and implementation
3 Acknowledgements

1 R as we know it

1.1 Memory management: copy-on-change; garbage collection

Copy on change

It seems like R variables do not share any memory
```
x <- y <- 1:5
x
```
```
## [1] 1 2 3 4 5
```
```
y
```
```
## [1] 1 2 3 4 5
```
```
x[1] <- 2L
x
```
```
## [1] 2 2 3 4 5
```
```
y
```
```
## [1] 1 2 3 4 5
```
```
x <- 1:5
fun = function(z) { z[1] = 2L; z }
fun(x)
```
```
## [1] 2 2 3 4 5
```
```
x
```
```
## [1] 1 2 3 4 5
```
- Different from many programming languages
- Very convenient for users – they don’t have to think about ‘change-at-a-distance’.
R uses a ‘named’ concept, rather than strict ‘reference counting’
- Counts the number of references to a location in memory…
- But can only count to 2
- So ‘2’ means that at some point in the memory location’s history, 2 or more symbols referenced that location.

In action

x = 1:5
.Internal(inspect(x))

## @480c138 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5

@41de7d8: memory address.
13 INTSXP: integer ‘S-expression’ (more later).
[NAM(1)]: One symbol (x) references the location in memory.

x = y = 1:5
.Internal(inspect(x))

## @48f2f18 13 INTSXP g0c3 [MARK,NAM(2)] (len=5, tl=0) 1,2,3,4,5

.Internal(inspect(y))

## @48f2f18 13 INTSXP g0c3 [MARK,NAM(2)] (len=5, tl=0) 1,2,3,4,5

x, y point to same location in memory
[NAM(2)]: (at least) two symbols reference the location

‘Copy-on-change’: when updating a vector, make a copy if NAM() == 2.

## copy on change
x = y = 1:5
.Internal(inspect(x))

## @3f94c98 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5

x[1] = 2L
.Internal(inspect(x))

## @3cf5db0 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 2,2,3,4,5

.Internal(inspect(y))

## @3f94c98 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5

## no copy necessary
x = 1:5
.Internal(inspect(x))

## @3bf3838 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5

x[1] = 2L
.Internal(inspect(x))

## @3aef798 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 2,2,3,4,5

Garbage collection (approximate description)

S-expressions (SEXP) allocated by the system to a pool of available S-expressions
x <- 1 moves the SEXP from allocated to ‘in use’ pool, level 0 (the 0 in g0c4 of .Internal(inspect(x)))
After rm(x), a symbol no longer references the SEXP and it is a candidate for re-use.
gc() (manual, or more typically triggered automatically when R runs out of SEXP in the available pool)
- Visits all SEXP in g0 pool and marks as ‘unreferenced’
- Visits all symbols in use and marks their SEXP as ‘in use’
- Visits (second time) all SEXP in g0 pool. Moves unreferenced symbols to the ‘available’ pool, moves ‘in use’ to the g1 pool.

Note: SEXP belong to ‘generations’ and classes

y = 1
fun = function() { x <- 1; 2 }
fun(); gc()    # scans x, y; collects x, ages y

## [1] 2

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 503648 26.9     940480 50.3   940480 50.3
## Vcells 939506  7.2    1941543 14.9  1199408  9.2

fun(); gc()    # scans & collects x, does not visit y

## [1] 2

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 503747 27.0     940480 50.3   940480 50.3
## Vcells 939776  7.2    1941543 14.9  1199408  9.2

g0 generation are recently allocated, and checked for suitability for garbage collection most frequently – typically symbols allocated in functions and no longer in scope; ripe for collection.
g1 has survived increasingly many garbage collections – package or globals symbols that seldom become eligible for collection.
Net effect: g0 nodes enriched for SEXP that can collected.

Resources: R Internals manual e.g., the write barrier, Luke Tierney’s notes; memory.c.

Performance consequences

Allocation and copying are comparatively expensive
Some common use patterns involve a lot of copying!
- c(), rbind(), cbind() in a loop
- Modifying a data.frame(), list(), or S4 object.

1.1.1 Exercises

Use .Internal(inspect()) to follow memory allocated to x in the following loop.

x <- integer()
n <- 5
for (i in 1:n)
    x <- c(x, i)

Answer:

n = 5
x = integer(); .Internal(inspect(x))
for (i in 1:n) {
    x <- c(x, i)
    .Internal(inspect(x))
}

x = integer(); .Internal(inspect(x))
for (i in 1:n) {
    x[i] = i
    .Internal(inspect(x))
}
x

If an integer occupies about 8 bytes of memory, and each iteration in the loop copies the current allocation into the new allocation, how many bytes are copied in a loop with n iterations?

Interpret the output of tracemem() for memory allocations when updating a value in a data.frame(). How much of the data.frame is being copied?

df <- mtcars
tracemem(df)        # The overall data.frame

## [1] "<0x45779e0>"

tracemem(df[[1]])   # The specific column

## [1] "<0x1766760>"

df[1, 1]

## [1] 21

df[1, 1] = 22.1

## tracemem[0x45779e0 -> 0x44fcd68]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous> 
## tracemem[0x44fcd68 -> 0x447ce60]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous> 
## tracemem[0x447ce60 -> 0x447d058]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous> 
## tracemem[0x1766760 -> 0x264a210]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>

df[2, 2] = 22.2

## tracemem[0x447d058 -> 0x4464e78]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous> 
## tracemem[0x4464e78 -> 0x4464f20]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous> 
## tracemem[0x4464f20 -> 0x4465070]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>

Compare this to memory allocations when modifying a similarly sized matrix

m <- m1 <- matrix(0, nrow(mtcars), ncol(mtcars))
tracemem(m)         # A matrix is a vector with dims attributes

## [1] "<0x477afd0>"

m[1, 1] = 22.1

## tracemem[0x477afd0 -> 0x45b3820]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>

m[2, 1] = 22.2

And to a ‘vectorized’ update

df <- mtcars
tracemem(df); tracemem(df[[1]])

## [1] "<0x45779e0>"

## [1] "<0x1766760>"

df[1:2, 1] <- c(22.1, 22.2)

## tracemem[0x45779e0 -> 0x3f57e58]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous> 
## tracemem[0x3f57e58 -> 0x3f57f00]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous> 
## tracemem[0x3f57f00 -> 0x3f58050]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous> 
## tracemem[0x1766760 -> 0x3d04bd0]: [<-.data.frame [<- eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>

m <- m1 <- matrix(0, nrow(mtcars), ncol(mtcars))
tracemem(m)         # A matrix is a vector with dims attributes

## [1] "<0x4615310>"

m[1:2, 1] <- c(22.1, 22.2)

## tracemem[0x4615310 -> 0x3cd6510]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous> vweave_rmarkdown <Anonymous> <Anonymous> do.call <Anonymous>

1.2 Correct, robust, simple, fast: practical techniques

Here’s a simple function, implemented poorly

f <- function(n) {
    x <- integer()
    for (i in 1:n)
        x <- c(x, i)
    x
}

Correct

Most important!
identical() – very strict
all.equal() – allows for numerical approximation

Unit tests!

library(testthat)
test_that("f() is correct", {
    expect_identical(f(5), 1:5)
})

Robust

Edge cases, e.g., length 0 iterations, NA values, … Here we add some additional tests, and put them in a function for easy reuse.

tests <- function(f) {
    test_that("f() is correct and robust", {
        expect_identical(f(5), 1:5)
        expect_identical(f(1), 1L)
        expect_identical(f(0), integer(0))
    })
}

Our code fails the unit tests, because 1:n fails with n == 0

tryCatch(tests(f), error = conditionMessage)

## [1] "Test failed: 'f() is correct and robust'\n* f(0) not identical to integer(0).\nLengths differ: 2 vs 0\n"

Revised code, robust to common use case

f1 = function(n) {
    x <- integer()
    for (i in seq_len(n))
        x = c(x, i)
    x
}
tests(f1)

Fast

system.time(), but can be considerable run-to-run variation

microbenchmark

library(microbenchmark)
res <- microbenchmark(f1(100), f1(200), f1(400), f1(800), f1(1600))
plot(res)

Scales quadratically, because of copying. Solutions?

f2 <- function(n) {
    sapply(seq_len(n), function(i) i)
}
tests(f2)  # fails! n == 0

f2a <- function(n) {
    if (n == 0) {
        integer()
    } else {
        sapply(seq_len(n), function(i) i)
    }
}
tests(f2a)

f3 <- function(n) {
    vapply(seq_len(n), function(i) i, integer(1))
}
test_that("f() is correct", {
    expect_identical(f3(5), 1:5)
    expect_identical(f3(1), 1L)
    expect_identical(f3(0), integer(0))
})
tests(f3)

f4 <- function(n) {
    x <- integer(n)
    for (i in seq_len(n))
        x[[i]] = i
    x
}
tests(f4)

f5 <- seq_len
tests(f5)

n <- 1000
microbenchmark(f1(n), f2a(n), f3(n), f4(n), f5(n))

1.3 Symbol resolution and evaluation

Environments

Current environment

environment()

## <environment: R_GlobalEnv>

surprisingly central to R

env <- new.env()
env[["foo"]] <- 123
env[["foo"]]

## [1] 123

get("foo", env)      # equivalent (not really!)

## [1] 123

ls(env)

## [1] "foo"

hash table of key-value pairs
Key is called a name or symbol in R
constant-time lookup
‘reference’ semantics (almost all else in R is ‘copy-on-change’ semantics)
```
another <- env
env[["bar"]] <- 456
another[["bar"]]
```
```
## [1] 456
```
All environments have a ‘parent’ environment
```
parent.env(env)
```
```
## <environment: R_GlobalEnv>
```
- special environment: emptyenv()
- parent environment is set during construction (argument to new.env())

Environment can be chained together as a linked list

top <- new.env(parent = emptyenv())
mid <- new.env(parent = top)
bot <- new.env(parent = mid)

Symbol (key) look-up

top[["top"]] <- 1
mid[["mid"]] <- 2
bot[["bot"]] <- 3

Symbols not found in the reference enviroment are looked for in the parent environment

get("bot", bot)

## [1] 3

get("mid", bot)

## [1] 2

tryCatch({
    get("mid", bot, inherits=FALSE) # restricts search to current env
}, error = conditionMessage)

## [1] "object 'mid' not found"

bot[["mid"]]                        # NULL; `[[` restricts to current env

## NULL

Assignment is to the current environment

bot[["mid"]] <- 5
bot[["mid"]]

## [1] 5

mid[["mid"]]

## [1] 2

Functions are intimately connected to environments

fun <- function() {
    environment()
}
fun()              # an environment defines the body of a function

## <environment: 0x593b1a8>

fun()              # new function, so new environment

## <environment: 0x59a57f0>

What happens to the environment in a function?
- garbage collection (later!) when no longer referenced

If a function body has an environment, there must be a parent environment

fun <- function() {
    env <- environment()
    list(env, parent.env(env))
}
fun()

## [[1]]
## <environment: 0x4bb9690>
## 
## [[2]]
## <environment: R_GlobalEnv>

Is the parent of the environment of a function the environment in which the function was called from, or the environment in which the function was defined?

fun <- function() {
    fun1 <- function() {
        env1 <- environment()    # envrionment created by fun1
        par1 <- parent.env(env1) # parent environment
        list(env1 = env1, par1 = par1)
    }
    env <- environment()        # environment created by fun
    par <- parent.env(env)
    list(env = env, par = par, fun1=fun1)
}
fun()

## $env
## <environment: 0x3d44050>
## 
## $par
## <environment: R_GlobalEnv>
## 
## $fun1
## function () 
## {
##     env1 <- environment()
##     par1 <- parent.env(env1)
##     list(env1 = env1, par1 = par1)
## }
## <environment: 0x3d44050>

The parent of a function’s environment is the environment in which it was defined (different from the environment from which it was called)
Lookup, assignment (<-) and assignment to the parent(s) environment (<<-)

fun = function() {
    x0 <- 0
    x1 <- 0
    fun1 <- function() {
        y <- 1     # create local variable y
        x1 <<- 2   # assign to first occurrence of 'x' in parent env(s)
        z <<- 3    # no z in parent env(s), so created in top env
        c(x0 = x0, x1 = x1, y = y, z = z)
    }

    result <- fun1()
    list(x0 = x0, x1 = x1, result = result)
}

This is called lexical scope

When is lexical scope used?

Package namespaces, imports, and search()

The `search() path

search()

##  [1] ".GlobalEnv"             "package:microbenchmark"
##  [3] "package:testthat"       "package:BiocStyle"     
##  [5] "package:stats"          "package:graphics"      
##  [7] "package:grDevices"      "package:utils"         
##  [9] "package:datasets"       "package:methods"       
## [11] "Autoloads"              "package:base"

library(BiocGenerics): loading and attaching a package
1. Package NAMESPACE is loaded into the current R session
2. An environment containing exported symbols from the package is attached to the R search path
‘Attaching’ a package means that the .GlobalEnv’s parent points to the new package exported symbol environment. The new package exported symbol environment points to the package previously pointed to by .GlobalEnv
A package consists of a NAMESPACE
- A NAMESPACE is an environment.
- The parent of the NAMESPACE is an environment containing imported symbols.
- The parent of the environment with imported symbols is the base environment.
- The parent of the base environment is the .GlobalEnv.
- …and so on down the search path.
Symbol resolution in packages follows lexical scope
- Resolved in the function, then package NAMESPACE, then imports environment, then base then .GlobalEnv, then …

Evaluation, lazy and non-standard evaluation

1.3.1 Exercises

Exercise: bank account

Congratulations, you own a bank! Here’s a function that creates an account, and provides access to it

account <- function() {
   balance <- 0

   available <- function() {}

   deposit <- function(amt) {}

   withdraw <- function(amt) {}

   list(available = available, deposit = deposit, withdraw = withdraw)
}

Answer:

account <- function() {
   balance <- 0

   available <- function() {
       balance
   }

   deposit <- function(amt) {
       balance <<- balance + amt
       balance
   }

   withdraw <- function(amt) {
       if (amt > balance)
           stop("insufficient funds")
       deposit(-amt)
   }

   list(available = available, deposit = deposit, withdraw = withdraw)
}

Implement for the following test case, using the concept of lexical scope:

library(testthat)
test_that("I understand lexical scope", {
    acct1 <- account()
    expect_equal(acct1$deposit(10), 10)
    expect_equal(acct1$deposit(12), 22)
    expect_equal(acct1$available(), 22)
    expect_equal(acct1$withdraw(20), 2)
    expect_error(acct1$withdraw(20), "insufficient funds")
})

What happens when you get a second customer?

test_that("functions create local scope", {
    acct1 <- account(); acct1$deposit(10)
    acct2 <- account(); acct2$deposit(20)
    expect_equal(acct1$available(), 10)
    expect_equal(acct2$available(), 20)
})

Answer:

account <- function() {
   balance <- 0

   available <- function() {
       balance
   }

   deposit <- function(amt) {
       balance <<- balance + amt
       funs
   }

   withdraw <- function(amt) {
       if (amt > balance)
           stop("insufficient funds")
       deposit(-amt)
   }

   funs = list(
       available = available, deposit = deposit, withdraw = withdraw
   )

   funs
}

Exercise: bank account 2

What changes to implementation would be require to allow

test_that("endomorphisms rock", {
    expect_equal(account()$deposit(10)$withdraw(5)$available(), 5)
})

Exercise: package-local options

Implement ‘package-local’ options that the user can set via functions. Do this by using local() to create and populate an environment that contains the package-local variables as well as accessor functions get() and set() to retrieve / assign variables, e.g., tolerance. Here’s a start, and some unit tests
```
.myoptions <- local({
    tolerance = 1e-4

    get = function() {
        tolerance
    }

    set = function(value) {
    }

    ## ...
})

## setTolerance = ...
```

Answer:

.myoptions <- local({
    tolerance = 1e-4
    get = function() tolerance
    list(
        get = get,
        set = function(value) {
            ovalue <- get()
            tolerance <<- value
            ovalue
        }
    )
})

getTolerance <- .myoptions$get
setTolerance <- .myoptions$set

Here are some unit tests

test_that("local options can be set", {
    expect_equal(getTolerance(), 1e-4)
    ## setTolerance() returns previous value, to ease resetting
    expect_equal(otol <- setTolerance(1e-5), 1e-4)
    expect_equal(setTolerance(otol), 1e-5)
})

1.4 Debugging

f <- function(x) {
    ## ...
    g()
}

g <- function(x) {
    ## ...
    if (log(x) < 0) {
        TRUE
    } else {
        FALSE
    }
}

Invoke traceback() after error.
```
f(-1)
traceback()
```
debugonce(g): enter the ‘browser’ when the function is invoked. See ?browser for what can be done. Variant: debug(g) … undebug(g).
```
debugonce(g)
f(-1)
## Browser[2]>
## ...
```

browser(). Edit the source code (not very practical for complex or package code)

g <- function(x) {
    ## ...
    browser()
    if (log(x) < 0) {
        TRUE
    } else {
        FALSE
    }
}

options(error=recover). When done, options(error=NULL). Actually, any function accepting 0 arguments is possible for recover, e.g., options(error=traceback).

f(-1)
## Error: all(x >= 0) is not TRUE
##
## Enter a frame number, or 0 to exit
##
## 1: f(-1)
## 2: #2: stopifnot(all(x >= 0))
##
## Selection: 2
## Called from: f(-1)
## Browse[1]> c
##
## Enter a frame number, or 0 to exit
##
## 1: f(-1)
## 2: #2: stopifnot(all(x >= 0))
##
## Selection:

trace() – especially useful for tracing execution (e.g., tracer = quote(print(argname))), and for S4 methods (argument signature=)

## like debug
trace(g, tracer=browser)
f(-1)

trace(g, quote({ print(x); print(log(x)) }))
for (i in runif(1000, -1, 100)) f(i)

trace(g, quote(if (x < 0) browser()))
for (i in runif(1000, -1, 100)) f(i)

library(GenomicRanges)
showMethods("findOverlaps")
selectMethod("findOverlaps", c("GRanges", "GRanges"))
trace(findOverlaps, browser, signature=c("GenomicRanges", "GenomicRanges"))

Found a bug, now what?

Recover more gracefully. (Somehow not a good solution – would rather avoid the problem than treat the symptom).

g <- function(x) {
    ## ...
    tryCatch({
        if (log(x) < 0)
            TRUE
        else
            FALSE
    }, error = function(e) {
       NA
    })

Program more cleverly / robustly – in our trivial example, the if() is irrelevant, and introduces a cryptic error message!
```
g <- function(x) {
    ## ...
    log(x) < 0  # NA, instead of error from if (NA) (!)
```

Best: assert pre-conditions (and program more cleverly).

f <- function(x) {
    stopifnot(is.numeric(x), length(x) == 1, !is.na(x), x >= 0)
    ## ...
    g(x)
}

1.5 Packages

Structure: A simple directory with specific files and structure

DESCRIPTION and NAMESPACE files (required)
- DESCRIPTION: key-value pairs describing the package, e.g., ‘Title:’, ‘Version:’, ‘License:’, etc.
R/ directory
- Files containing functions
- Following a logical organization with consistent formatting
man/ directory
- Files in ‘.Rd’ format
- Structured in parallel to R/ directory, R/foo.R –> R/foo.Rd
vignettes/ directory
- One or more vignettes, R markdown (.Rmd) or LaTeX / Sweave (.Rnw)
- Markdown much prefered for user-oriented documentation.
- My preference: no ‘derived’ files, e.g., HTML produced from Rmd.
inst/
- data/: R objects, e.g., ‘.rds’ file.
- extdata/: Non-R data resources, e.g., SQL data base.
- …/: any other files or folders that need to be _inst_alled into the R library.
src/: C, C++, Fortran, or other ‘foreign’ code.

Creation

devtools::create("MyPackage")

Development

devtools::load_all() to reduce the edit-install-run loop.
devtools::use_testthat() and devtools::test() for testthat unit tests.
devtools::document() (roxygen2) for documentation.
devtools::test() (run unit tests); devtools::check()
devtools::install()

N.B.

ALL of the steps above (package creation, installation, writing .Rd documentation files, build, check) can be done ‘easily’ from scratch.

Final arbiter:

R CMD build MyPackage   # source for version 0.99.1 of package
R CMD check MyPackage_0.99.1.tar.gz

Mangement

git and GitHub!

2 Finding the inner R: the C API and implementation

2.1 Behind the curtain

S-expressions

R ‘atomic’ vectors (numeric(), logical(), etc), lists, environments, functions, etc., are represented by C structs
The C struct is called an ‘S-expression’, represented by the symbol SEXP (which I pronounce ‘S-exp’)
SEXP is defined in Rinternals.h
An SEXP is polymorhpic – the interpretation of the struct depends on type information contained in the SEXP itself. Types are enumerated as INTSXP, REALSXP, …
- STRSXP: character(), each element of a character vector is a CHARSXP.
- VECSXP: list(); LISTSXP is a ‘pair-list’ primarily used to represent function arguments.
- EXTPTRSXP: an R reference to arbitrary data that is not represented in R, e.g., an C++ class.
All SEXP contain sexpinfo, with named, garbage collect, marked, and other status, as well as actual data (e.g., the integers of an integer() vector.
Macros (actually, functions in package code) control access to the SEXP data structure. Several styles
- e.g., LENGTH(), Rf_length()
- Generally, prefer the Rf_* style.

From R to C

Several routes – .Internal(), .External(), .C(), .Call(), …
.C(): R does some work to coerce R types, e.g., integer() to C types, int *; useful for many light-weight purposes.
- character() and more complicated objects difficult to deal with.
- Return value must be allocated and passed from R.
.Call(): R arguments are represented as SEXP
- Flexible; main focus of serious development.
- Rcpp uses this convention, but wraps the details of R’s SEXP in C++ idioms.

The public API

Defined by Rinternals.h and friends
Documented in Writing R Extensions ‘System and foreign language interface’.

2.2 Old-school approaches to foreign langauges

When to worry?

Access third-party libraries
Novel algorithms
Simpler reasoning
Copy-free operations

.C()

.Call()

2.2.1 Example: uuid

uuid: globally unique id.
Provided by Boost uuid ‘header only’ C++ library.

Create a package and ‘src’ directory. Add the LinkingTo: tag to indicate that the package is going to link to (C++) header files provided by the BH (’Boost Headers’) package.

devtools::create("/tmp/uuid", list(LinkingTo = "BH"))

## Creating package 'uuid' in '/tmp'

## No DESCRIPTION found. Creating with values:

## Package: uuid
## Title: What the Package Does (one line, title case)
## Version: 0.0.0.9000
## Authors@R: person("First", "Last", email = "first.last@example.com", role = c("aut", "cre"))
## Description: What the package does (one paragraph).
## Depends: R (>= 3.4.0)
## License: What license is it under?
## Encoding: UTF-8
## LazyData: true
## LinkingTo: BH

## * Creating `uuid.Rproj` from template.

## * Adding `.Rproj.user`, `.Rhistory`, `.RData` to ./.gitignore

dir.create("/tmp/uuid/src")

Write unit tests

devtools::use_testthat("/tmp/uuid")

## * Adding testthat to Suggests

## * Creating `tests/testthat`.

## * Creating `tests/testthat.R` from template.

cat('
context("uuid")

test_that("uuid returns different values", {
    expect_true(is.character(uuid()))
    expect_true(length(uuid()) == 1L)
    uu <- unique(unlist(replicate(100, uuid())))
    expect_identical(length(uu), 100L)
})',
    file = "/tmp/uuid/tests/testthat/test_uuid.R"
)

Implement the ‘C++’ layer

cat('
#include <boost/uuid/uuid_generators.hpp>
#include <boost/uuid/uuid_io.hpp>

static boost::uuids::random_generator uuid_generator =
    boost::uuids::random_generator();

std::string uuid_generate()
{
    return boost::uuids::to_string(uuid_generator());
}
',
    file = "/tmp/uuid/src/uuid.cpp"
)

Implement the interface from C++ to R’s C layer.

cat('
#include <Rinternals.h>

SEXP uuid()
{
    return mkString(uuid_generate().c_str());
}
',
    file = "/tmp/uuid/src/uuid.cpp",
    append = TRUE
)

Implement the interface from C to R.

cat('
#include <R_ext/Rdynload.h>

extern "C" {

    static const R_CallMethodDef callMethods[] = {
        {".uuid", (DL_FUNC) &uuid, 0},
        {NULL, NULL, 0}
    };

    void R_init_uuid(DllInfo *info)
    {
        R_registerRoutines(info, NULL, callMethods, NULL, NULL);
    }
}',
    file = "/tmp/uuid/src/uuid.cpp",
    append = TRUE
)

Implement the end-user API in R.

cat("
#' @useDynLib uuid, .registration = TRUE
#' @export
uuid <- function()
     .Call(.uuid)
",
    file = "/tmp/uuid/R/uuid.R"
)

Document, test, install, use!

devtools::document("/tmp/uuid")

## Updating uuid documentation

## Loading uuid

## Re-compiling uuid

## '/home/mtmorgan/bin/R-3-4-branch/bin/R' --no-site-file --no-environ  \
##   --no-save --no-restore --quiet CMD INSTALL '/tmp/uuid'  \
##   --library='/tmp/RtmpPZvmxO/devtools_install_20c366d77c59' --no-R  \
##   --no-data --no-help --no-demo --no-inst --no-docs --no-exec  \
##   --no-multiarch --no-test-load

##

## Updating roxygen version in /tmp/uuid/DESCRIPTION

## Writing NAMESPACE

devtools::test("/tmp/uuid")

## Loading uuid

## Testing uuid

## uuid: ...
## 
## DONE ========================================================================

devtools::install("/tmp/uuid")

## Installing uuid

## '/home/mtmorgan/bin/R-3-4-branch/bin/R' --no-site-file --no-environ  \
##   --no-save --no-restore --quiet CMD INSTALL '/tmp/uuid'  \
##   --library='/home/mtmorgan/R/x86_64-pc-linux-gnu-library/3.4-Bioc-3.6'  \
##   --install-tests

##

## Reloading installed uuid

2.2.2 Exercises

Add the following unit test and implement appropriate functionality

test_that("uuid returns multiple values", {
    n <- 5
    result <- uuid(5)
    expect_true(is.character(result))
    expect_identical(length(result), 5L)
    expect_identical(length(unique(result)), 5L)
})

Do this in several stages, from ‘easy’ to harder

Implement the R infrastructure by adding an argument n to the definition of the R-level uuid(). Provide n with a default value so that the original unit tests continue to work, i.e., that devtools::test("/tmp/uuid") still passes the original tests.
Pass the argument n to C. Do this by adding n to the R-level .Call(). Then modify the C-level uuid() to accept an SEXP as argument, so the signature is SEXP uuid(SEXP n_sexp). Update the callMethods[] array so that the uuid function takes 1 argument. Test that the source code compiles using, e.g., devtools::install("/tmp/uuid") or by running the unit tests again.
In C, modify uuid() to check that the argument is in fact a scalar integer. Do this by using the macro IS_SCALAR() and function Rf_asInteger() and the constant R_NaInt; see the header file ‘Rinternals.h’ in R.home("include"). Signal an error with Rf_error(). You’ll add code that looks like
```
bool test = IS_SCALAR(n_sexp, INTSXP) && (R_NaInt != Rf_asInteger(n_sexp));
if (!test)
    Rf_error("'n' cannot be coerced to integer(1) and not NA");
int n = Rf_asInteger(n_sexp);
```
Modify uuid() to allocate an SEXP to contain the result – a STRSXP of length n. This allocated SEXP needs to be protected from garbage collection, because it is an R memory allocation not associated with a symbol. Protection is provided by the PROTECT() macro, which is traditional wrapped around functions that return SEXP.
```
SEXP result = PROTECT(Rf_allocVector(n, STRSXP));
```
Fill the return vector using a C loop, constructing a CHARSXP for each element using mkChar() and setting the ith element of result using the function SET_STRING_ELT(). Although at the R level the first element of a vector is element number 1, in C the first element is 0.
```
for (int i = 0; i < n; ++i)
    SET_STRING_ELT(result, i, mkChar(uuid_generate().c_str()));
```
Our function is done; use UNPROTECT() to remove the protection on the allocated SEXP (the calling function, in this case R itself, is resposible for PROTECTing any SEXP it recieves) and return the result to R.
```
UNPROTECT(1);
return result;
```
(Optional) As a nicety for our users, coerce their R-level argument n to an integer (so, e.g., they can provide 5 rather than 5L as the argument). This can be done at the R level using as.integer(), or at the C level with the following. Consider the merits of each approach.
```
n_sexp = PROTECT(Rf_coerceVector(n_sexp, INTSXP));
```

2.3 Rcpp

Create a new package using Rcpp::Rcpp.package.skeleton(). Edit the DESCRIPTION file to include BH in the LinkingTo: field.

Add a file src/rcpp_uuid.cpp with the boost-related code

#include <boost/uuid/uuid_generators.hpp>
#include <boost/uuid/uuid_io.hpp>

static boost::uuids::random_generator uuid_generator =
    boost::uuids::random_generator();

std::string uuid_generate()
{
    return boost::uuids::to_string(uuid_generator());
}

Include the Rcpp.h header and use the Rcpp namespace
```
#include <Rcpp.h>
using namespace Rcpp;
```
Implement the function using Rcpp structures like CharacterVector; no need to worry about coercing the argument, PROTECTion, etc. Decorate the function with the // [[Rcpp::export]] attribute.
```
// [[Rcpp::export]]
CharacterVector rcpp_uuid(int n) {
    CharacterVector uuids(n);
    for (int i = 0; i < uuids.size(); ++i)
        uuids[i] = uuid_generate();
    return uuids;
}
```
In R, make Rcpp process the attributes; check out src/RcppExports.cpp and R/RcppExports.R

> Rcpp::compileAttributes()

Install and use!

Rcppuuid$ R CMD INSTALL .
[...]
Rcppuuid$ R --vanilla -e "Rcppuuid::rcpp_uuid(3)"
> Rcppuuid::rcpp_uuid(3)
[1] "cb9cc7e8-bfcb-4262-b05b-2f1c0a4bc196"
[2] "7b592e1f-8341-4da0-af04-b387726af8ea"
[3] "7b2ade09-6ff2-4e76-b564-31b1a626e864"

2.4 Fun (?!) with C-level debugging

Setup

gdb on linux, lldb on Mac; [gdb-to-lldb][gdb-lldb] command map.
Code compiled with debugging symbols and without optimizations

One approach:

$ cat ~/.R/Makevars
CFLAGS = -g -O0
CXXFLAGS = -g -O0
CXX11FLAGS = -g -O0

Alternative: configure R with appropriate CFLAGS, etc.

Overall

Get and install a copy of our uuid package from GitHub

$ git clone https://github.com/Bioconductor/BiocAdvanced
$ cd BiocAdvanced
$ git checkout Zurich-2017
$ cd inst/uuid
$ R CMD INSTALL .

Start R and launch debugger
```
$ R -d gdb
(gdb)
```

Start the R process, load packages, etc., break into the debugger

(gdb) run
> library(uuid)
> uuid()
Error in uuid() : invalid type/length (symbol/16) in vector allocation

Break into the debugger, set a breakpoint at the point where R throws an error, and continue
```
> ^C
(gdb) break Rf_error
(gdb) continue
>
```

Trigger the error

> uuid()
Breakpoint 1, Rf_error (
    format=0x643020 "invalid type/length (%s/%d) in vector allocation")
    at /home/mtmorgan/src/R-3-4-branch/src/main/errors.c:821
821 {
(gdb)

Discover where you are in the call stack. Mostly we’re in R’s source code where bugs are unlikely (though not impossible!) three up is our own code

(gdb) where
#0  Rf_error (
    format=0x643020 "invalid type/length (%s/%d) in vector allocation")
    at /home/mtmorgan/src/R-3-4-branch/src/main/errors.c:821
#1  0x0000000000525dbc in Rf_allocVector3 (type=1, length=16, allocator=0x0)
    at /home/mtmorgan/src/R-3-4-branch/src/main/memory.c:2612
#2  0x000000000050e5c0 in Rf_allocVector (type=1, length=16)
    at /home/mtmorgan/src/R-3-4-branch/src/include/Rinlinedfuns.h:196
#3  0x00007fffed786780 in uuid (n_sexp=0x21a16a8) at uuid.cpp:26
#4  0x000000000048b7e3 in R_doDotCall (ofun=0x7fffed7866ac <uuid(SEXPREC*)>,
    nargs=1, cargs=0x7fffffffbdd0, call=0x19f7d48)
    at /home/mtmorgan/src/R-3-4-branch/src/main/dotcode.c:570

Navigate and look around

3 Acknowledgements

Research reported in this course was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 633974)

Inside the mind of R

June 19 - 20, 2017

Contents

1 R as we know it

1.1 Memory management: copy-on-change; garbage collection

1.1.1 Exercises

1.2 Correct, robust, simple, fast: practical techniques

1.3 Symbol resolution and evaluation

1.3.1 Exercises

1.4 Debugging

1.5 Packages

2 Finding the inner R: the C API and implementation

2.1 Behind the curtain

2.2 Old-school approaches to foreign langauges

2.2.1 Example: uuid

2.2.2 Exercises

2.3 Rcpp

2.4 Fun (?!) with C-level debugging

3 Acknowledgements