--- title: "Introduction to filtro" vignette: > %\VignetteIndexEntry{Introduction to filtro} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} knitr: opts_chunk: collapse: true comment: '#>' --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` > ⚠️ **work-in-progress** ```{r} #| label: start #| include: false library(filtro) library(desirability2) library(dplyr) library(modeldata) ``` This document demonstrates some basic uses of filtro. We'll need to load a few packages: ```{r} #| label: setup library(filtro) library(desirability2) library(dplyr) library(modeldata) ``` ## A scoring example The {modeldata} package contains a data set used to predict housing sale price. It has 73 predictor columns and a numeric variable `Sale_Price` (the outcome). Since the outcome are right-skewed, we apply a log (base 10) transformation. ```{r} ames <- modeldata::ames ames <- ames |> dplyr::mutate(Sale_Price = log10(Sale_Price)) # ames |> str() # uncomment to see the structure of the data ``` To apply the ANOVA F-test filter, we first create a score class object to define the scoring method, and then use the `fit()` method with the standard formula to compute the scores. ```{r} ames_aov_pval_res <- score_aov_pval |> fit(Sale_Price ~ ., data = ames) ``` The data frame of results can be accessed via `object@results`. ```{r} ames_aov_pval_res@results ``` A couple of notes here: Since our focus is on feature relevance (rather than hypothesis testing), the ANOVA F-test filter handles both cases when: - The predictors are numeric and the outcome is categorical, or - The predictors are categorical and the outcome is numeric. Because the outcome is numeric, any predictor that is not a factor will result in an `NA`. In case where `NA` is produced, a safe value can be used to retain the predictor, and can be accessed via `object@fallback_value`. By default, this filter computes `-log10(p_value)`, so that larger values indicate more important predictors. If users prefer raw p-values, a helper function `dont_log_pvalues()` is available. For this specific filter, i.e., `score_aov_*`, case weights are supported. For other filters, you can check the property `object@case_weights` to see if they can use case weights. ## Filtering and ranking There are two main ways to rank and select a top proportion or number of features. To filter or rank a single score, we can use built-in methods: - `show_best_score_*()` - `rank_best_score_*()` For multi-parameter optimization, we can use API calls adapted from {desirability}: - `show_best_desirability_*()` ## A filtering exmple for score *singular* The `show_best_score_prop()` function returns the best score for a single metric. The `prop_terms` argument lets us control the proportion of predictors to keep. ```{r} # Show best score, based on proportion of predictors ames_aov_pval_res |> show_best_score_prop(prop_terms = 0.2) ``` ## A filtering example for scores *plural* To handle multiple scores, we first create multiple score class objects, and then use the `fit()` method with the standard formula to compute the scores. ```{r} # ANOVA raw p-value natrual_units <- score_aov_pval |> dont_log_pvalues() ames_aov_pval_natrual_res <- natrual_units |> fit(Sale_Price ~ ., data = ames) # Pearson correlation ames_cor_pearson_res <- score_cor_pearson |> fit(Sale_Price ~ ., data = ames) # Forest importance ames_imp_rf_reg_res <- score_imp_rf |> fit(Sale_Price ~ ., data = ames, seed = 42) # Information gain ames_info_gain_reg_res <- score_info_gain |> fit(Sale_Price ~ ., data = ames) ``` Next, we create a list to collect these score class objects, including their associated metadata and scores. ```{r} # Create a list class_score_list <- list( ames_aov_pval_natrual_res, ames_cor_pearson_res, ames_imp_rf_reg_res, ames_info_gain_reg_res ) ``` Then, we fill the safe value specific to each method, and then remove the `outcome` column. ```{r} # Fill safe values ames_scores_results <- class_score_list |> fill_safe_values() |> # Remove outcome dplyr::select(-outcome) ames_scores_results ``` Analogous to [`show_best_desirability()`](https://desirability2.tidymodels.org/reference/show_best_desirability.html), the `show_best_desirability_prop()` function allows joint optimization of multiple metrics using desirability functions. A desirability function maps values of a metric to a $[0, 1]$ range where $1$ is most desirable and $0$ is unacceptable. When the verb `maximize()` is used, it means larger values are better. This is the case for Pearson correlation, forest importance, and information gain. For examples: ```{r} # Optimize correlation alone ames_scores_results |> show_best_desirability_prop( maximize(cor_pearson, low = 0, high = 1) ) |> # Show predictor and desirability only dplyr::select(predictor, starts_with(".d_")) # Optimize correlation and forest importance ames_scores_results |> show_best_desirability_prop( maximize(cor_pearson, low = 0, high = 1), maximize(imp_rf) ) |> dplyr::select(predictor, starts_with(".d_")) # Optimize correlation, forest importance and information gain ames_scores_results |> show_best_desirability_prop( maximize(cor_pearson, low = 0, high = 1), maximize(imp_rf), maximize(infogain) ) |> dplyr::select(predictor, starts_with(".d_")) ``` In `show_best_desirability_prop()`, there is a argument called `prop_terms` that lets us control the proportion of predictors to keep. ```{r} # Same as above, but retain only a proportion of predictors ames_scores_results |> show_best_desirability_prop( maximize(cor_pearson, low = 0, high = 1), maximize(imp_rf), maximize(infogain), prop_terms = 0.2 ) |> dplyr::select(predictor, starts_with(".d_")) ``` Besides `maximize()`, additional verbs that are available are: `minimize()`, `target()`, and `constrain()`. They are used in different situations: - `maximize()` when larger values are better. - `minimize()` when smaller values are better. - `target()` when a specific value of the metric is important. - `constrain()` when a range of values is equally desirable. For examples: ```{r} ames_scores_results |> show_best_desirability_prop( minimize(aov_pval, low = 0, high = 1) ) |> dplyr::select(predictor, starts_with(".d_")) ames_scores_results |> show_best_desirability_prop( target(cor_pearson, low = 0.2, target = 0.255, high = 0.9) ) |> dplyr::select(predictor, starts_with(".d_")) ames_scores_results |> show_best_desirability_prop( constrain(cor_pearson, low = 0.2, high = 1) ) |> dplyr::select(predictor, starts_with(".d_")) ``` ## Available score objects and filter methods The list of score class objects included: ```{r} #| echo: false grep("^score_", ls("package:filtro"), value = TRUE) ``` The list of filter methods for score *singular*: ```{r} #| echo: false grep("^show_best_score_", ls("package:filtro"), value = TRUE) ``` The list of filter methods for scores *plural*: ```{r} #| echo: false grep("^show_best_desirability_", ls("package:filtro"), value = TRUE) ```