Type: | Package |
Date: | 2025-10-07 |
Title: | Probabilistic Regression Trees |
Version: | 1.0.0 |
Depends: | R (≥ 4.3.0) |
Description: | Implementation of Probabilistic Regression Trees (PRTree), providing functions for model fitting and prediction, with specific adaptations to handle missing values. The main computations are implemented in 'Fortran' for high efficiency. The package is based on the PRTree methodology described in Alkhoury et al. (2020), "Smooth and Consistent Probabilistic Regression Trees" https://proceedings.neurips.cc/paper_files/paper/2020/file/8289889263db4a40463e3f358bb7c7a1-Paper.pdf. Details on the treatment of missing data and implementation aspects are presented in Prass, T.S.; Neimaier, A.S.; Pumi, G. (2025), "Handling Missing Data in Probabilistic Regression Trees: Methods and Implementation in R" <doi:10.48550/arXiv.2510.03634>. |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
NeedsCompilation: | yes |
RoxygenNote: | 7.3.3 |
Packaged: | 2025-10-07 14:20:03 UTC; taiane |
Author: | Alisson Silva Neimaier
|
Maintainer: | Taiane Schaedler Prass <taianeprass@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-10-09 23:10:02 UTC |
PRTree: Probabilistic Regression Tress
Description
Probabilistic Regression Trees (PRTree). Functions for fitting and predicting PRTree models with some adaptations to handle missing values. The main calculations are performed in 'FORTRAN', resulting in highly efficient algorithms. This package's implementation is based on the PRTree methodology described in Alkhoury, S.; Devijver, E.; Clausel, M.; Tami, M.; Gaussier, E.; Oppenheim, G. (2020) - "Smooth And Consistent Probabilistic Regression Trees" <https://proceedings.neurips.cc/paper_files/paper/2020/file/8289889263db4a40463e3f358bb7c7a1-Paper.pdf>.
Author(s)
Taiane Schaedler Prass taianeprass@gmail.com and Alisson Silva Neimaier alissonneimaier@hotmail.com
Probabilistic Regression Trees (PRTrees)
Description
Fits a Probabilistic Regression Tree (PRTree) model. This is the main user-facing function of the package.
Usage
pr_tree(y, X, control = list(), ...)
Arguments
y |
A numeric vector for the dependent variable. |
X |
A numeric matrix or data frame for the independent variables. |
control |
A list of control parameters, typically created by 'pr_tree_control()'. Alternatively, control parameters can be passed directly via the '...' argument. |
... |
Control parameters to be passed to 'pr_tree_control()'. These will override any parameters specified in the 'control' list. |
Value
An object of class 'prtree' containing the fitted model. This is a list with the following components
yhat |
The estimated values for 'y'. |
XRegion |
A matrix with two columns indicating the terminal node (region) each observation belongs to. The first column ('TRUE') may have 'NA' for observations with missing values. The second column ('Internal') shows the region assigned by the algorithm. |
dist |
The Fortran code corresponding to the distribution used. (For prediction purposes) |
par_dist |
Parameters related to the distribution (if any). |
fill_type |
Fortran code corresponding to the method used to fill the matrix P when missing values are present. |
P |
The matrix of probabilities for each terminal node. |
gamma |
The values of the |
MSE |
The mean squared error for the training, test/validation, and global datasets. |
sigma |
The optimal |
nodes_matrix_info |
A matrix with information for each node of the tree. |
regions |
A data frame with the bounds of each variable in each node of the returned tree. |
Examples
set.seed(1234)
X <- matrix(runif(200, 0, 10), ncol = 1)
eps <- matrix(rnorm(200, 0, 0.05), ncol = 1)
y <- cos(X) + eps
# Fit model with custom controls passed directly
reg <- pr_tree(y, X, max_terminal_nodes = 9, perc_test = 0)
plot(
X[order(X)], reg$yhat[order(X)],
xlab = "x", ylab = "cos(x)", col = "blue", type = "l"
)
points(
X[order(X)], y[order(X)],
xlab = "x", ylab = "cos(x)", col = "red"
)
Set Control Parameters for PRTree
Description
This function creates a list of control parameters for the 'pr_tree' function, with validation for each parameter.
Usage
pr_tree_control(sigma_grid = NULL, grid_size = 8,
max_terminal_nodes = 15L, cp = 0.01, max_depth = max_terminal_nodes -
1, n_min = 5L, perc_x = 0.1, p_min = 0.05, perc_test = 0.2,
idx_train = NULL, fill_type = 2L, proxy_crit = "both",
n_candidates = 3L, by_node = FALSE, dist = "norm", iprint = -1, ...)
Arguments
sigma_grid |
Optional, a numeric value, vector or a matrix with
candidate values for the parameter |
grid_size |
Optional, the number of candidate values for 'sigma' to generate when 'sigma_grid' is 'NULL'. Default is 8. |
max_terminal_nodes |
A non-negative integer. The maximum number of regions in the output tree. The default is 15. |
cp |
A positive numeric value. The complexity parameter. Any split that does not decrease the MSE by a factor of 'cp' will be ignored. The default is 0.01. |
max_depth |
A non-negative integer. The maximum depth of the decision tree. The depth is defined as the length of the longest path from the root to a leaf. The default is 14. |
n_min |
A positive integer, The minimum number of observations in a final node. The default is 'max_terminal_nodes - 1'. |
perc_x |
A positive numeric value between 0 and 1. Given any
column of |
p_min |
A positive numeric value. A threshold probability that controls
the splitting process. A splitting attempt is made in a given region only
when the proportion of rows with probability higher than 'p_min', in
the corresponding column of the matrix |
perc_test |
A numeric value between 0 (inclusive) and 1 (exclusive) that specifies the proportion of the data to be held out for model validation or testing. Default is 0.2. The role of this hold-out set depends on the 'sigma_grid'
The data split is performed using stratified sampling to ensure that the proportion of observations with missing values is similar across the training and validation/test sets. |
idx_train |
Indexes for the training sample. Default is 'NULL', in which case the indexes are computed based on the 'perc_test' argument. If 'idx_train' is provided, 'perc_test' is ignored. |
fill_type |
Integer indicating the method to be used to fill the probability matrix when ‘X' has NA’s. Default is 2.
|
proxy_crit |
Character. Default is '"both"'. Criterion used to associate an observation with missing values to a region:
|
n_candidates |
Integer. The number of competing candidates to consider when searching for the best split. To select the candidates, a proxy improvement measure is used. Then a full analysis is performed to choose the best among the 'n_candidates' candidates. Default is 3. |
by_node |
Logical. If 'TRUE', the algorithm selects 'n_candidates' for each node and then makes a full analysis to choose the best among all nodes. Otherwise the 'n_candidates' are selected globally. Default is 'FALSE'. |
dist |
Character. The distribution to be used in the model. One of
‘"norm"' (Gaussian), '"lnorm"' (log-normal), '"t"' (Student’s |
iprint |
Integer. Controls the verbosity of the Fortran backend. Default is -1 (silent).
|
... |
Extra parameters to be passed to the chosen distribution.
|
Value
A list of class 'prtree.control' containing the validated control parameters.
Examples
# Get default control parameters
controls <- pr_tree_control()
# Customize some parameters
custom_controls <- pr_tree_control(max_depth = 5, n_candidates = 5)
Predict from a Probabilistic Regression Tree Model
Description
Obtains predictions from a fitted 'prtree' object.
Usage
## S3 method for class 'prtree'
predict(object, newdata, complete = FALSE, ...)
Arguments
object |
An object of class 'prtree', as returned by 'pr_tree()'. |
newdata |
A data frame or matrix containing new data for which to generate predictions. Must contain the same predictor variables as the data used to fit the model. |
complete |
Logical. If 'FALSE' (default), only the vector of predicted values is returned. If 'TRUE', a list containing both the predicted values and the probability matrix 'P' is returned. |
... |
further arguments passed to or from other methods. |
Value
If 'complete = FALSE', a numeric vector of predicted values ('yhat'). If 'complete = TRUE', a list containing:
yhat |
The numeric vector of predicted values. |
P |
The probability matrix for the new data. |