--- title: "Scoring via random forests" vignette: > %\VignetteIndexEntry{Scoring via random forests} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} knitr: opts_chunk: collapse: true comment: '#>' --- ```{r} #| include: false knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` > ⚠️ **work-in-progress** ```{r} #| label: start #| include: false library(filtro) library(dplyr) library(modeldata) ``` We'll need to load a few packages: ```{r} #| label: setup library(filtro) library(dplyr) library(modeldata) ``` ## Score class objects Predictor importance can be assessed using three different random forest models. They can be accessed via the following score class objects: ```{r} #| eval: false score_imp_rf score_imp_rf_conditional score_imp_rf_oblique ``` These models are powered by the following packages: ```{r} #| echo: false score_imp_rf@engine score_imp_rf_conditional@engine score_imp_rf_oblique@engine ``` Regarding score types: - The {ranger} random forest computes the importance scores. - The {partykit} conditional random forest computes the conditional importance scores. - The {aorsf} oblique random forest computes the permutation importance scores. ## A scoring example — random forest The {modeldata} package contains a data set used to predict which cells in a high content screen were well segmented. It has 57 predictor columns and a factor variable `class` (the outcome). Since `case` is only used to indicate Train/Test, not for data analysis, it will be set to `NULL`. Furthermore, for efficiency, we will use a small sample of 50 from the original 2019 observations. ```{r} cells_subset <- modeldata::cells |> # Use a small example for efficiency dplyr::slice(1:50) cells_subset$case <- NULL # cells_subset |> str() # Uncomment to see the structure of the data ``` First, we create a score class object to specify a {ranger} random forest, and then use the `fit()` method with the standard formula to compute the importance scores. ```{r} # Specify random forest and fit score cells_imp_rf_res <- score_imp_rf |> fit( class ~ ., data = cells_subset, seed = 42 ) ``` The data frame of results can be accessed via `object@results`. ```{r} cells_imp_rf_res@results ``` A copule of notes here: The random forest filter, including all three types of random forests, - regression tasks, and - classificaiton tasks. In case where `NA` is produced, a safe value can be used to retain the predictor, and can be accessed via `object@fallback_value`. Larger values indicate more important predictors. For this specific filter, i.e., `score_imp_rf_*`, case weights are supported. ## Hyperparameter tuning Like {parsnip}, the argument names are harmonized. For example, the arguments to set the number of trees: `num.trees` in {ranger}, `ntree` in {partykit}, and `n_tree` in {aorsf} are all standardized to a single name, `trees`, so users only need to remember a single name. The same applies to the number of variables to split at each node, `mtry`, and the minimum node size for splitting, `min_n`. ```{r} #| eval: false # Set hyperparameters cells_imp_rf_res <- score_imp_rf |> fit( class ~ ., data = cells_subset, trees = 100, mtry = 2, min_n = 1 ) ``` However, there is one argument name specific to {ranger}. For reproducibility, instead of using the standard `set.seed()` method, we would use the `seed` argument. ```{r} #| eval: false cells_imp_rf_res <- score_imp_rf |> fit( class ~ ., data = cells_subset, trees = 100, mtry = 2, min_n = 1, seed = 42 # Set seed for reproducibility ) ``` ## Seamless argument support If users use {ranger} argument names, intentionally or not, it still works. We have handled the necessary adjustments. The following code chunk can be used to obtain a fitted score: ```{r} #| eval: false cells_imp_rf_res <- score_imp_rf |> fit( class ~ ., data = cells_subset, num.trees = 100, mtry = 2, min.node.size = 1, seed = 42 ) ``` The same applies to {partykit}- and {aorsf}- specific arguments. ## A scoring example — conditional random forest For the {partykit} conditional random forest, we again create a score class object to specify the model, then use the `fit()` method to compute the importance scores. The data frame of results can be accessed via `object@results`. ```{r} # Set seed for reproducibility set.seed(42) # Specify conditional random forest and fit score cells_imp_rf_conditional_res <- score_imp_rf_conditional |> fit(class ~ ., data = cells_subset, trees = 100) cells_imp_rf_conditional_res@results ``` Note that when a predictor’s importance score is 0, `partykit::cforest()` may exclude its name from the output. In such cases, a score of 0 is assigned to the missing predictors. ## An scoring example — oblique random forest For the {aorsf} oblique random forest, we again create a score class object to specify the model, then use the `fit()` method to compute the importance scores. The data frame of results can be accessed via `object@results`. ```{r} # Set seed for reproducibility set.seed(42) # Specify oblique random forest and fit score cells_imp_rf_oblique_res <- score_imp_rf_oblique |> fit(class ~ ., data = cells_subset, trees = 100, mtry = 2) cells_imp_rf_oblique_res@results ``` ## Available objects and engines The list of score class objects for random forests, their corresponding engines and supported tasks: ```{r} #| echo: false #| message: false knitr::kable( data.frame( "object" = c("`score_imp_rf`", "`score_imp_rf_conditional`", "`score_imp_rf_oblique`"), "engine" = c("`ranger::ranger`", "`partykit::cforest`", "`aorsf::orsf`"), "task" = rep(c("regression, classification"), 3) ) ) ```