Runtime

For this package, we have written methods to estimate regressions trees and random forests to minimize the spectral objective:

\[\hat{f} = \text{argmin}_{f' \in \mathcal{F}} \frac{||Q(\mathbf{Y} - f'(\mathbf{X}))||_2^2}{n}\] The package is currently fully written in R Core Team (2024) for now and it gets quite slow for larger sample sizes. There might be a faster cpp version in the future, but for now, there are a few ways to increase the computations if you apply the methods to larger data sets.

Parallel Processing

Many of our functions support parallel processing using the parameter mc.cores to control the number of cores used.

# fits the individual SDTrees in parallel on 22 cores
fit <- SDForest(x = X, y = Y, mc.cores = 22)

# predicts with the individual SDTrees in parallel
predict(fit, newdata = data.frame(X), mc.cores = 10)

# evaluates different strengths of regularization in parallel
paths <- regPath(fit, mc.cores = 10)

# predicts potential outcomes for different values of covariate one in parallel
pd <- partDependence(model, 1, mc.cores = 10)

# performs cross validation in parallel
model <- SDAM(X, Y, cv_k = 5, mc.cores = 5)

To support parallelization, we use the R package future Bengtsson (2021). If mc.cores is larger than one, multicore (forking of processes) is used if possible, and multisession otherwise. If mc.cores is smaller than two, we process sequentially or use a pre-specified plan. This way, a user can freely choose and set up any backend.

# predefined plan
future::plan(multisession, workers = 2)
# fits the individual SDTrees in parallel on 2 cores
fit <- SDForest(x = X, y = Y)

Approximations

In a few places, approximations perform almost as well as if we run the whole procedure. Reasonable split points to divide the space of \(\mathbb{R}^p\) are, in principle, all values between the observed ones. In practice and with many observations, the number of potential splits grows too large. We, therefore, evaluate maximal max_candidates splits of the potential ones and choose them according to the quantiles of the potential ones.

# approximation of candidate splits
fit <- SDForest(x = X, y = Y, max_candidates = 100)
tree <- SDTree(x = X, y = Y, max_candidates = 50)

If we have many observations, we can reduce computing time by only sampling max_size observations from the data instead of \(n\). This can dramatically reduce computing time compared to a full bootstrap sample but could also decrease performance.

# draws maximal 500 samples from the data for each tree
fit <- SDForest(x = X, y = Y, max_size = 500)

Bengtsson, Henrik. 2021. “A Unifying Framework for Parallel and Distributed Processing in r Using Futures.” The R Journal 13 (2): 208–27. https://doi.org/10.32614/RJ-2021-048.

R Core Team. 2024. “R: A Language and Environment for Statistical Computing.” Vienna, Austria. https://www.R-project.org/.