1 Inspiration: correlated coin tosses
2 First approach
- 2.1 Implementation
- 2.2 Assessment
3 Second approach
- 3.1 Implementation
- 3.2 Assessment
4 Third approach
- 4.1 Implementation
- 4.2 Assessment
5 Fourth approach
- 5.1 Implementation
- 5.2 Assessment
6 Fifth approach
- 6.1 Implementation
- 6.2 Assessment
7 XXX Approach
8 Session info

1 Inspiration: correlated coin tosses

Wolfgang mentioned on Monday that observing 50 heads and then 50 tails would not be evidence of a ‘fair’ coin, even though the observed number of head (50) was exactly that expected of a fair coin.

I wonder, how do you generate correlated coin tosses?

Criteria (most to least important)

Correct
Understandable
Robust
Fast

2 First approach

2.1 Implementation

Suppose we represent heads as ‘0’ and tails as ‘1’. Let p be the probability that a coin that is currently a head is tossed and remains a head, and similarly for tails. If p == 0.5, then the tosses would seem to be uncorrelated. We implement this idea as a simple function

f1 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    outcome <- NULL
    for (i in 1:n) {
        if (runif(1) > p)
            current <- (current + 1) %% 2
        outcome <- c(outcome, current)
    }
    outcome
}

2.2 Assessment

Is it correct?

f1(10, .5)

##  [1] 0 0 0 1 1 1 1 1 1 1

table(f1(1000, .5))

## 
##   0   1 
## 510 490

res <- f1(1000, .9)
mean(rle(res)$length) # expectation: 1 / (1 - p)

## [1] 10.10101

Is it understandable? Pretty much

Is it robust? We’ll come back to this.

Is it fast? Seems so initially, but it scales poorly!

system.time(f1(5000, .5))

##    user  system elapsed 
##   0.188   0.028   0.215

system.time(f1(10000, .5))

##    user  system elapsed 
##   0.587   0.001   0.587

system.time(f1(20000, .5))

##    user  system elapsed 
##   2.033   0.015   2.050

Doubling the problem size (n) causes a 4-fold increase in execution time.

What’s the problem? c(outcome, current) causes outcome to be copied each time through the loop!

3 Second approach

3.1 Implementation

Pre-allocate and fill the outcome vector; outome updated in place, without copying.

f2 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    outcome <- numeric(n)
    for (i in 1:n) {
        if (runif(1) > p)
            current <- (current + 1) %% 2
        outcome[i] <- current
    }
    outcome
}

3.2 Assessment

Is it correct?

set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res2 <- f2(100, .9)
identical(res1, res2)

## [1] TRUE

Is it understandable? As understandable as f1().

Is it robust? In a minute…

Is it fast?

system.time(f2(5000, .5))

##    user  system elapsed 
##    0.05    0.00    0.05

system.time(f2(10000, .5))

##    user  system elapsed 
##   0.076   0.000   0.076

system.time(f2(20000, .5))

##    user  system elapsed 
##   0.149   0.000   0.150

f2() seems to scale linearly, so performs increasingly well compared to f1().

4 Third approach

4.1 Implementation

Note that runif() is called n times, but the program could be modified, and still be understandable, if it were called just once – hoist runif(1) > p out of the loop.

f3 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    outcome <- numeric(n)
    test <- runif(n) > p
    for (i in 1:n) {
        if (test[i])
            current <- (current + 1) %% 2
        outcome[i] <- current
    }
    outcome
}

4.2 Assessment

Correct?

set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res3 <- f3(100, .9)
identical(res1, res3)

## [1] TRUE

Understandable? Yes.
Robust? None of them have been, and f3() really isn’t!

set.seed(123)
f1(0, .5)

## [1] 1 1

set.seed(123)
try(f3(0, .5))

## Error in if (test[i]) current <- (current + 1)%%2 : 
##   missing value where TRUE/FALSE needed

Fast? Yes, about 10 times faster than f3()

n <- 100000
system.time(f2(n, .5))

##    user  system elapsed 
##   0.788   0.000   0.788

system.time(f3(n, .5))

##    user  system elapsed 
##   0.060   0.000   0.059

… with linear scaling.

system.time(f3(n * 10, .5))

##    user  system elapsed 
##   0.570   0.012   0.583

5 Fourth approach

5.1 Implementation

The problem is that 1:n is not robust, especially 1:0 generates the sequence c(1, 0), whereas we were expecting a zero-length sequence!

Solution: use seq_len(n)

lapply(3:1, seq_len)

## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] 1 2
## 
## [[3]]
## [1] 1

f4 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    outcome <- numeric(n)
    test <- runif(n) > p
    for (i in seq_len(n)) {
        if (test[i])
            current <- (current + 1) %% 2
        outcome[i] <- current
    }
    outcome
}

5.2 Assessment

Correct? Yes
Understandable? Yes
Robust? Yes

set.seed(123)
f4(3, .5)

## [1] 1 1 0

f4(2, .5)

## [1] 1 0

f4(1, .5)

## [1] 0

f4(0, .5)

## numeric(0)

Fast? Yes

6 Fifth approach

6.1 Implementation

Use cumsum() (cummulative sum) to produce sequential groups that have the same head or tail status. Use %% on the cummulative sum to translate those groups into heads (cummsum() %% 2 == 0 or tails (cummsum() %% 2 == 1).

f5 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    test <- runif(n) > p
    cumsum(current + test) %% 2
}

6.2 Assessment

Correct?

set.seed(123); res1 <- f1(10, .8)
set.seed(123); res5 <- f5(10, .8)
identical(res1, res5)

## [1] TRUE

Understandable? Harder to understand…
Robust? Yes

f5(0, .8)

## numeric(0)

Fast?

n <- 1000000
system.time(f4(n, .5))

##    user  system elapsed 
##   0.562   0.000   0.562

system.time(f5(n, .5))

##    user  system elapsed 
##   0.102   0.000   0.102

system.time(f5(n * 10, .5))

##    user  system elapsed 
##   1.259   0.076   1.334

About 4x faster than f4(), scales linearly, fast even for very large n.

Could be used to generate a large data set for developing methods about correlated samples, along the lines of

correlated_tosses_expts <- function(m, n, p) {
    ## m tosses (rows) in each of n experiments
    start0 <- rep(rbinom(m, 1, .5), each = m)
    group0 <- cumsum(runif(m * n) > p)
    toss <- (start0 + group0) %% 2
    matrix(toss, m)
}
system.time({
    expt <- correlated_tosses_expts(1000, 10000, .8)
})

##    user  system elapsed 
##   1.158   0.092   1.250

hist(colSums(expt))

7 XXX Approach

Probably we have reached the point of diminishing gains, we’ve already spent far more time developing f5() than we’ll ever save by further investigation… However,

Avoid adding current to cumsum() vector.
Use rgeom() to generate change points.
‘Is there a package for that?’

Other tools

microbenchmark for comparing fine-scale performance differences (but do we really care about speed when, e.g., timing differences are less than a couple of seconds for large-scale data?)
testthat for writing ‘unit tests’ that allow easy implementation of tests for correct and robust code.

8 Session info

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 19
## 
## Matrix products: default
## BLAS:   /home/msmith/Applications/R/R-3.6.0/lib/libRblas.so
## LAPACK: /home/msmith/Applications/R/R-3.6.0/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocStyle_2.12.0
## 
## loaded via a namespace (and not attached):
##  [1] BiocManager_1.30.4 compiler_3.6.0     magrittr_1.5      
##  [4] bookdown_0.10      tools_3.6.0        htmltools_0.3.6   
##  [7] yaml_2.2.0         Rcpp_1.0.1         codetools_0.2-16  
## [10] stringi_1.4.3      rmarkdown_1.12     knitr_1.23        
## [13] stringr_1.4.0      xfun_0.7           digest_0.6.20     
## [16] evaluate_0.13

Evening 2.1: Efficient R

24 July 2019

Contents