For this package, we have written methods to estimate regressions trees and random forests to minimize the spectral objective:
\[\hat{f} = \text{argmin}_{f' \in \mathcal{F}} \frac{||Q(\mathbf{Y} - f'(\mathbf{X}))||_2^2}{n}\] The package is currently fully written in R Core Team (2024) for now and it gets quite slow for larger sample sizes. There might be a faster cpp version in the future, but for now, there are a few ways to increase the computations if you apply the methods to larger data sets.
Many of our functions support parallel processing using the parameter
mc.cores to control the number of cores used.
# fits the individual SDTrees in parallel on 22 cores
fit <- SDForest(x = X, y = Y, mc.cores = 22)
# predicts with the individual SDTrees in parallel
predict(fit, newdata = data.frame(X), mc.cores = 10)
# evaluates different strengths of regularization in parallel
paths <- regPath(fit, mc.cores = 10)
# predicts potential outcomes for different values of covariate one in parallel
pd <- partDependence(model, 1, mc.cores = 10)
# performs cross validation in parallel
model <- SDAM(X, Y, cv_k = 5, mc.cores = 5)To support parallelization, we use the R package future Bengtsson (2021). If mc.cores is
larger than one, multicore (forking of processes) is used
if possible, and multisession otherwise. If
mc.cores is smaller than two, we process sequentially or
use a pre-specified plan. This way, a user can freely choose and set up
any backend.
In a few places, approximations perform almost as well as if we run
the whole procedure. Reasonable split points to divide the space of
\(\mathbb{R}^p\) are, in principle, all
values between the observed ones. In practice and with many
observations, the number of potential splits grows too large. We,
therefore, evaluate maximal max_candidates splits of the
potential ones and choose them according to the quantiles of the
potential ones.
# approximation of candidate splits
fit <- SDForest(x = X, y = Y, max_candidates = 100)
tree <- SDTree(x = X, y = Y, max_candidates = 50)If we have many observations, we can reduce computing time by only
sampling max_size observations from the data instead of
\(n\). This can dramatically reduce
computing time compared to a full bootstrap sample but could also
decrease performance.
# draws maximal 500 samples from the data for each tree
fit <- SDForest(x = X, y = Y, max_size = 500)