Suppose we represent heads as ‘0’ and tails as ‘1’. Let p
be the
probability that a coin that is currently a head is tossed and remains
a head, and similarly for tails. If p == 0.5
, then the tosses would
seem to be uncorrelated. We implement this idea as a simple function
f1 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- NULL
for (i in 1:n) {
if (runif(1) > p)
current <- (current + 1) %% 2
outcome <- c(outcome, current)
}
outcome
}
Is it correct?
f1(10, .5)
## [1] 0 0 0 1 1 1 1 1 1 1
table(f1(1000, .5))
##
## 0 1
## 510 490
res <- f1(1000, .9)
mean(rle(res)$length) # expectation: 1 / (1 - p)
## [1] 10.10101
Is it understandable? Pretty much
Is it robust? We’ll come back to this.
Is it fast? Seems so initially, but it scales poorly!
system.time(f1(5000, .5))
## user system elapsed
## 0.188 0.028 0.215
system.time(f1(10000, .5))
## user system elapsed
## 0.587 0.001 0.587
system.time(f1(20000, .5))
## user system elapsed
## 2.033 0.015 2.050
Doubling the problem size (n
) causes a 4-fold increase in execution time.
What’s the problem? c(outcome, current)
causes outcome
to be
copied each time through the loop!
Pre-allocate and fill the outcome vector; outome
updated in place,
without copying.
f2 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- numeric(n)
for (i in 1:n) {
if (runif(1) > p)
current <- (current + 1) %% 2
outcome[i] <- current
}
outcome
}
Is it correct?
set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res2 <- f2(100, .9)
identical(res1, res2)
## [1] TRUE
Is it understandable? As understandable as f1()
.
Is it robust? In a minute…
Is it fast?
system.time(f2(5000, .5))
## user system elapsed
## 0.05 0.00 0.05
system.time(f2(10000, .5))
## user system elapsed
## 0.076 0.000 0.076
system.time(f2(20000, .5))
## user system elapsed
## 0.149 0.000 0.150
f2()
seems to scale linearly, so performs increasingly well compared
to f1()
.
Note that runif()
is called n
times, but the program could be
modified, and still be understandable, if it were called just once –
hoist runif(1) > p
out of the loop.
f3 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- numeric(n)
test <- runif(n) > p
for (i in 1:n) {
if (test[i])
current <- (current + 1) %% 2
outcome[i] <- current
}
outcome
}
set.seed(123)
res1 <- f1(100, .9)
set.seed(123)
res3 <- f3(100, .9)
identical(res1, res3)
## [1] TRUE
Understandable? Yes.
Robust? None of them have been, and f3()
really isn’t!
set.seed(123)
f1(0, .5)
## [1] 1 1
set.seed(123)
try(f3(0, .5))
## Error in if (test[i]) current <- (current + 1)%%2 :
## missing value where TRUE/FALSE needed
f3()
n <- 100000
system.time(f2(n, .5))
## user system elapsed
## 0.788 0.000 0.788
system.time(f3(n, .5))
## user system elapsed
## 0.060 0.000 0.059
… with linear scaling.
system.time(f3(n * 10, .5))
## user system elapsed
## 0.570 0.012 0.583
The problem is that 1:n
is not robust, especially 1:0
generates
the sequence c(1, 0)
, whereas we were expecting a zero-length
sequence!
Solution: use seq_len(n)
lapply(3:1, seq_len)
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 1 2
##
## [[3]]
## [1] 1
f4 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
outcome <- numeric(n)
test <- runif(n) > p
for (i in seq_len(n)) {
if (test[i])
current <- (current + 1) %% 2
outcome[i] <- current
}
outcome
}
Correct? Yes
Understandable? Yes
Robust? Yes
set.seed(123)
f4(3, .5)
## [1] 1 1 0
f4(2, .5)
## [1] 1 0
f4(1, .5)
## [1] 0
f4(0, .5)
## numeric(0)
Use cumsum()
(cummulative sum) to produce sequential groups that
have the same head or tail status. Use %%
on the cummulative sum to
translate those groups into heads (cummsum() %% 2 == 0
or tails
(cummsum() %% 2 == 1
).
f5 <- function(n, p) {
current <- rbinom(1, 1, 0.5)
test <- runif(n) > p
cumsum(current + test) %% 2
}
set.seed(123); res1 <- f1(10, .8)
set.seed(123); res5 <- f5(10, .8)
identical(res1, res5)
## [1] TRUE
Understandable? Harder to understand…
Robust? Yes
f5(0, .8)
## numeric(0)
n <- 1000000
system.time(f4(n, .5))
## user system elapsed
## 0.562 0.000 0.562
system.time(f5(n, .5))
## user system elapsed
## 0.102 0.000 0.102
system.time(f5(n * 10, .5))
## user system elapsed
## 1.259 0.076 1.334
About 4x faster than f4()
, scales linearly, fast even for very large n
.
Could be used to generate a large data set for developing methods about correlated samples, along the lines of
correlated_tosses_expts <- function(m, n, p) {
## m tosses (rows) in each of n experiments
start0 <- rep(rbinom(m, 1, .5), each = m)
group0 <- cumsum(runif(m * n) > p)
toss <- (start0 + group0) %% 2
matrix(toss, m)
}
system.time({
expt <- correlated_tosses_expts(1000, 10000, .8)
})
## user system elapsed
## 1.158 0.092 1.250
hist(colSums(expt))
Probably we have reached the point of diminishing gains, we’ve already
spent far more time developing f5()
than we’ll ever save by further
investigation… However,
current
to cumsum()
vector.rgeom()
to generate change points.Other tools
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 19
##
## Matrix products: default
## BLAS: /home/msmith/Applications/R/R-3.6.0/lib/libRblas.so
## LAPACK: /home/msmith/Applications/R/R-3.6.0/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] BiocManager_1.30.4 compiler_3.6.0 magrittr_1.5
## [4] bookdown_0.10 tools_3.6.0 htmltools_0.3.6
## [7] yaml_2.2.0 Rcpp_1.0.1 codetools_0.2-16
## [10] stringi_1.4.3 rmarkdown_1.12 knitr_1.23
## [13] stringr_1.4.0 xfun_0.7 digest_0.6.20
## [16] evaluate_0.13