The R language

Motivation and relevance

Large data requires computationally efficient approaches to analysis.
Understanding the way R works provides some insight into how we can write efficient code, expanding the range of problems that we can reasonably tackle in R. It is not unusual to write efficient R code that is as correct but 100x faster than naive code.
Parallel evaluation allows us to scale efficient code to exploit computational resources both on our personal computers and in high performance environments, further expanding the scale of analysis reasonably undertaken in R.

'Atomic' vectors

logical(), integer(), numeric(), complex(), character(), raw(), list()

`data.frame()`, `matrix()`

Attributes

str() (including 2nd argument) and dput()

Exercise: what's a factor()?

`environment()`

Pass-by-value versus pass-by-reference
Utility? Why would pass-by-value be useful?
Confusion? How would pass-by-value confuse the user?
Each environement has a parent. By default, symbols not found by get() in the environment are searched for in the parent, iteratively

Exercise: finding variables

How does R know the value of pi?
Why does this question appear under an exercise about envirnoments?
if you are the Illinois state legislature in the 1897 and define pi <- 3.2, where does your symbol pi reside?
Is there still a pi with a more precise value?

Exercise: NAMESPACE

An R package contains symbols defined in the package namespaces The namespace is an environment
Use getNamespace("IRanges") to retrieve the IRanges package name space
Use ls() to list the content of the namespace
Use parent.env() to recursively discover the search path for a symbol mentioned in the name space

`function`

Argument basics

Named, unnamed, unspecified arguments
Default values
Argument matching – by name, then position

Function environments

Scope and symbol resolution
What about <<- ?

Exercise: bank account: explain…

account <- function(initial=0) {
    available <- initial
    list(deposit=function(amount) {
        available <<- available + amount
        available
    }, balance=function() {
         available
    })
}
my_acct <- account()
my_acct$deposit(100)

## [1] 100

your_acct <- account(20)
my_acct$deposit(200)

## [1] 300

my_acct$balance()

## [1] 300

your_acct$balance()

## [1] 20

Implement, if necessary, my_acct$withdraw.
Implement, if you're quick, my_bank to manage a number of accounts.

Primitives and essential 'classes' and 'methods'

conceptual class hierarchy, with functions and C SEXP type (mentioned further below)

vector
   o length, [, [<-, [[, [[<-, names, names<-, class, class<-, ...
-- raw()                   RAWSXP
-- logical()               LGLSXP
-- numeric()               REALSXP
   -- integer()            INTSXP
-- complex()               CPLXSXP
-- character()             STRSXP
-- list()                  VECSXP
   -- data.frame()
   -- ... many S3 objects
-- structure()
   -- array()
      -- matrix()
-- expression()            EXPRSXP
environment (new.env())    ENVSXP
   o ls
   o [[, [[<-
closure (e.g., function)   CLOSSXP
S4 class                   S4SXP
...

Deeper…

Copy-on-write

Some tools

.Internal(inspect())
tracemem()

Exercise: explain…

x <- 1:5; tracemem(x)
x[1] <- 2L
x[1] <- 2

x <- y <- seq(1, 5); tracemem(x)
x[1] <- 2L

df <- data.frame(x=1:5, y=5:1)
tracemem(df); tracemem(df$x)
df[1,1] <- 2

m <- matrix(1:10, 2); tracemem(m)
m[1, 1] <- 2L

f <- function(x) x[1]
g <- function(x) { x[1] <- 2L; x }
tracemem(x <- 1:5); f(x)
tracemem(x <- 1:5); g(x)

Data representation

Something like x <- 1:5 associates the symbol x with the value 1:5 in a particular (e.g., .GlobalEnv) environment.
x points to a location in memory, where there's a C struct.
The C struct is an S-expression (SEXP) atom that represents the data values (1:5) as well as information about the data (e.g., that they are integers, hence INTSXP). We can peak into the S-expression structure with

x <- 1:5
.Internal(inspect(x))

## @1050cc088 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5

Among other pieces of information, we see the memory address where the symbol x is located (following the @), that this is an instance of type INTSXP (integer), and that it has length len=5.

Exercise: Use .Internal(inspect()) to discover other common S-expression types, in addition to INTSXP. Some examples:

.Internal(inspect(pi))
.Internal(inspect(data.frame()))
.Internal(inspect(function() {}))
.Internal(inspect(expression(1 + 2)))

Garbage collection

Memory allocated and reclaimed by the program, no need for user to explicitly remove rm() or garbage collect gc() (many experienced R programmers never use rm()).

Uses NAMED rather than reference counts

NAMED 0: memory that is not bound to a symbol

.Internal(inspect(1:5))

## @105199ff8 13 INTSXP g0c3 [] (len=5, tl=0) 1,2,3,4,5

NAMED 1: memory that is or has been bound to exactly one symbol

.Internal(inspect(x <- 1:5))

## @1050f6838 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5

NAMED 2: memory that is or has been bound to two or more symbols

.Internal(inspect(y <- x <- 1:5))

## @1051bc768 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5

Copy-on-write illusion

Only duplicate when updating memory that is NAMED(2)

Styles of programming

Declarative vs. imperative

Example from Rowe:

'clamp' data so that values are no greater than 5 standard deviations from the mean.

Data:

set.seed(123)
x <- rnorm(10000000)

Find values: Declarative

x[abs(x) > 5 * sd(x)]

## [1] -5.051  5.213  5.348  5.227

Imperative

ans <- numeric()
for (xi in x)
    if (xi > 5 * sd(x))
        ans <- c(ans, xi)

Clamp: Declarative

x[abs(x) > 5 * sd(x)] <- 5 * sd(x)

Imperative

for (i in seq_along(x))
    if (abs(x[i]) > 5 * sd(x))
        x[i] <- 5 * sd(x)

(how could the imperative style be made more efficient?)

Question: What are the merits of declarative vs. imperative styles?

Functional

Ideal: return value entirely determined by arguments, no 'side effects'
Very easy to reason about
Easy to parallelize

Exercise: Few R functions are truly functional, but its possible to to recognize 'more' versus 'less' functional ways of writing R code. For the following,

df <- data.frame(x=1:5, y=5:1)
x0 <- sapply(names(df), function(x) sqrt(df[[x]]))
x1 <- sapply(names(df), function(x, df) sqrt(df[[x]]), df)
x3 <- sapply(df, function(x, fun) fun(x), sqrt)
x2 <- sapply(df, sqrt)

Are the following equivalent?
Are some of these examples more functional than others?
Is updating an object, e.g.,. df$x <- 5:1, consistent with functional programming?

Object-oriented

Association of methods with object classes
In R: methods associated with generic functions
- Different from main-stream languages (Java, C++)
- Preserves illusion of copy-on-write
- But not from many langauges
Much detail this afternoon