Contents

1 Inspiration: correlated coin tosses

Wolfgang mentioned on Monday that observing 50 heads and then 50 tails would not be evidence of a ‘fair’ coin, even though the observed number of head (50) was exactly that expected of a fair coin.

I wonder, how do you generate correlated coin tosses?

Criteria (most to least important)

  1. Correct
  2. Understandable
  3. Robust
  4. Fast

2 First approach

2.1 Implementation

Suppose we represent heads as ‘0’ and tails as ‘1’. Let p be the probability that a coin that is currently a head is tossed and remains a head, and similarly for tails. If p == 0.5, then the tosses would seem to be uncorrelated. We implement this idea as a simple function

f1 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    outcome <- NULL
    for (i in 1:n) {
        if (runif(1) > p)
            current <- (current + 1) %% 2
        outcome <- c(outcome, current)
    }
    outcome
}

2.2 Assessment

Is it correct?

f1(10, .5)
##  [1] 0 0 0 1 1 1 1 1 1 1
table(f1(1000, .5))
## 
##   0   1 
## 510 490
res <- f1(1000, .9)
mean(rle(res)$length) # expectation: 1 / (1 - p)
## [1] 10.10101

Is it understandable? Pretty much

Is it robust? We’ll come back to this.

Is it fast? Seems so initially, but it scales poorly!

system.time(f1(5000, .5))
##    user  system elapsed 
##   0.188   0.028   0.215
system.time(f1(10000, .5))
##    user  system elapsed 
##   0.587   0.001   0.587
system.time(f1(20000, .5))
##    user  system elapsed 
##   2.033   0.015   2.050

Doubling the problem size (n) causes a 4-fold increase in execution time.

What’s the problem? c(outcome, current) causes outcome to be copied each time through the loop!

3 Second approach

3.1 Implementation

Pre-allocate and fill the outcome vector; outome updated in place, without copying.

f2 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    outcome <- numeric(n)
    for (i in 1:n) {
        if (runif(1) > p)
            current <- (current + 1) %% 2
        outcome[i] <- current
    }
    outcome
}

3.2 Assessment

Is it correct?

set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res2 <- f2(100, .9)
identical(res1, res2)
## [1] TRUE

Is it understandable? As understandable as f1().

Is it robust? In a minute…

Is it fast?

system.time(f2(5000, .5))
##    user  system elapsed 
##    0.05    0.00    0.05
system.time(f2(10000, .5))
##    user  system elapsed 
##   0.076   0.000   0.076
system.time(f2(20000, .5))
##    user  system elapsed 
##   0.149   0.000   0.150

f2() seems to scale linearly, so performs increasingly well compared to f1().

4 Third approach

4.1 Implementation

Note that runif() is called n times, but the program could be modified, and still be understandable, if it were called just once – hoist runif(1) > p out of the loop.

f3 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    outcome <- numeric(n)
    test <- runif(n) > p
    for (i in 1:n) {
        if (test[i])
            current <- (current + 1) %% 2
        outcome[i] <- current
    }
    outcome
}

4.2 Assessment

  1. Correct?
set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res3 <- f3(100, .9)
identical(res1, res3)
## [1] TRUE
  1. Understandable? Yes.

  2. Robust? None of them have been, and f3() really isn’t!

set.seed(123)
f1(0, .5)
## [1] 1 1
set.seed(123)
try(f3(0, .5))
## Error in if (test[i]) current <- (current + 1)%%2 : 
##   missing value where TRUE/FALSE needed
  1. Fast? Yes, about 10 times faster than f3()
n <- 100000
system.time(f2(n, .5))
##    user  system elapsed 
##   0.788   0.000   0.788
system.time(f3(n, .5))
##    user  system elapsed 
##   0.060   0.000   0.059

… with linear scaling.

system.time(f3(n * 10, .5))
##    user  system elapsed 
##   0.570   0.012   0.583

5 Fourth approach

5.1 Implementation

The problem is that 1:n is not robust, especially 1:0 generates the sequence c(1, 0), whereas we were expecting a zero-length sequence!

Solution: use seq_len(n)

lapply(3:1, seq_len)
## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] 1 2
## 
## [[3]]
## [1] 1
f4 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    outcome <- numeric(n)
    test <- runif(n) > p
    for (i in seq_len(n)) {
        if (test[i])
            current <- (current + 1) %% 2
        outcome[i] <- current
    }
    outcome
}

5.2 Assessment

  1. Correct? Yes

  2. Understandable? Yes

  3. Robust? Yes

set.seed(123)
f4(3, .5)
## [1] 1 1 0
f4(2, .5)
## [1] 1 0
f4(1, .5)
## [1] 0
f4(0, .5)
## numeric(0)
  1. Fast? Yes

6 Fifth approach

6.1 Implementation

Use cumsum() (cummulative sum) to produce sequential groups that have the same head or tail status. Use %% on the cummulative sum to translate those groups into heads (cummsum() %% 2 == 0 or tails (cummsum() %% 2 == 1).

f5 <- function(n, p) {
    current <- rbinom(1, 1, 0.5)
    test <- runif(n) > p
    cumsum(current + test) %% 2
}

6.2 Assessment

  1. Correct?
set.seed(123); res1 <- f1(10, .8)
set.seed(123); res5 <- f5(10, .8)
identical(res1, res5)
## [1] TRUE
  1. Understandable? Harder to understand…

  2. Robust? Yes

f5(0, .8)
## numeric(0)
  1. Fast?
n <- 1000000
system.time(f4(n, .5))
##    user  system elapsed 
##   0.562   0.000   0.562
system.time(f5(n, .5))
##    user  system elapsed 
##   0.102   0.000   0.102
system.time(f5(n * 10, .5))
##    user  system elapsed 
##   1.259   0.076   1.334

About 4x faster than f4(), scales linearly, fast even for very large n.

Could be used to generate a large data set for developing methods about correlated samples, along the lines of

correlated_tosses_expts <- function(m, n, p) {
    ## m tosses (rows) in each of n experiments
    start0 <- rep(rbinom(m, 1, .5), each = m)
    group0 <- cumsum(runif(m * n) > p)
    toss <- (start0 + group0) %% 2
    matrix(toss, m)
}
system.time({
    expt <- correlated_tosses_expts(1000, 10000, .8)
})
##    user  system elapsed 
##   1.158   0.092   1.250
hist(colSums(expt))

7 XXX Approach

Probably we have reached the point of diminishing gains, we’ve already spent far more time developing f5() than we’ll ever save by further investigation… However,

Other tools

8 Session info

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 19
## 
## Matrix products: default
## BLAS:   /home/msmith/Applications/R/R-3.6.0/lib/libRblas.so
## LAPACK: /home/msmith/Applications/R/R-3.6.0/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocStyle_2.12.0
## 
## loaded via a namespace (and not attached):
##  [1] BiocManager_1.30.4 compiler_3.6.0     magrittr_1.5      
##  [4] bookdown_0.10      tools_3.6.0        htmltools_0.3.6   
##  [7] yaml_2.2.0         Rcpp_1.0.1         codetools_0.2-16  
## [10] stringi_1.4.3      rmarkdown_1.12     knitr_1.23        
## [13] stringr_1.4.0      xfun_0.7           digest_0.6.20     
## [16] evaluate_0.13