---
title: "Introduction to filtro"
vignette: >
  %\VignetteIndexEntry{Introduction to filtro}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
knitr:
  opts_chunk:
    collapse: true
    comment: '#>'
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

> ⚠️ **work-in-progress**

```{r}
#| label: start
#| include: false
library(filtro)
library(desirability2)
library(dplyr)
library(modeldata)
```

This document demonstrates some basic uses of filtro. We'll need to load a few packages: 


```{r}
#| label: setup
library(filtro)
library(desirability2)
library(dplyr)
library(modeldata)
```

## A scoring example

The {modeldata} package contains a data set used to predict housing sale price. It has 73 predictor columns and a numeric variable `Sale_Price` (the outcome). Since the outcome are right-skewed, we apply a log (base 10) transformation. 

```{r}
ames <- modeldata::ames
ames <- ames |>
  dplyr::mutate(Sale_Price = log10(Sale_Price))

# ames |> str() # uncomment to see the structure of the data
```

To apply the ANOVA F-test filter, we first create a score class object to define the scoring method, and then use the `fit()` method with the standard formula to compute the scores.

```{r}
ames_aov_pval_res <-
  score_aov_pval |>
  fit(Sale_Price ~ ., data = ames)
```

The data frame of results can be accessed via `object@results`. 

```{r}
ames_aov_pval_res@results
```

A couple of notes here: 

Since our focus is on feature relevance (rather than hypothesis testing), the ANOVA F-test filter handles both cases when:

- The predictors are numeric and the outcome is categorical, or

- The predictors are categorical and the outcome is numeric.

Because the outcome is numeric, any predictor that is not a factor will result in an `NA`. In case where `NA` is produced, a safe value can be used to retain the predictor, and can be accessed via `object@fallback_value`. 

By default, this filter computes `-log10(p_value)`, so that larger values indicate more important predictors. If users prefer raw p-values, a helper function `dont_log_pvalues()` is available. 

For this specific filter, i.e., `score_aov_*`, case weights are supported. For other filters, you can check the property  `object@case_weights` to see if they can use case weights.

## Filtering and ranking

There are two main ways to rank and select a top proportion or number 
of features. 

To filter or rank a single score, we can use built-in methods:  

- `show_best_score_*()`

- `rank_best_score_*()`

For multi-parameter optimization, we can use API calls adapted from {desirability}: 

- `show_best_desirability_*()`

## A filtering exmple for score *singular*

The `show_best_score_prop()` function returns the best score for a single metric. The `prop_terms` argument lets us control the proportion of predictors to keep.

```{r}
# Show best score, based on proportion of predictors
ames_aov_pval_res |> show_best_score_prop(prop_terms = 0.2)
```

## A filtering example for scores *plural*

To handle multiple scores, we first create multiple score class objects, and then use the `fit()` method with the standard formula to compute the scores. 

```{r}
# ANOVA raw p-value 
natrual_units <- score_aov_pval |> dont_log_pvalues()
ames_aov_pval_natrual_res <-
  natrual_units |>
  fit(Sale_Price ~ ., data = ames)

# Pearson correlation
ames_cor_pearson_res <-
  score_cor_pearson |>
  fit(Sale_Price ~ ., data = ames)

# Forest importance
ames_imp_rf_reg_res <-
  score_imp_rf |>
  fit(Sale_Price ~ ., data = ames, seed = 42)

# Information gain
ames_info_gain_reg_res <-
  score_info_gain |>
  fit(Sale_Price ~ ., data = ames)
```

Next, we create a list to collect these score class objects, including their associated metadata and scores. 

```{r}
# Create a list
class_score_list <- list(
  ames_aov_pval_natrual_res, 
  ames_cor_pearson_res,
  ames_imp_rf_reg_res,
  ames_info_gain_reg_res
)
```

Then, we fill the safe value specific to each method, and then remove the `outcome` column.

```{r}
# Fill safe values
ames_scores_results <- class_score_list |>
  fill_safe_values() |>
  # Remove outcome
  dplyr::select(-outcome)
ames_scores_results
```

Analogous to [`show_best_desirability()`](https://desirability2.tidymodels.org/reference/show_best_desirability.html), the `show_best_desirability_prop()` function allows joint optimization of multiple metrics using desirability functions. 

A desirability function maps values of a metric to a $[0, 1]$ range where $1$ is most desirable and $0$ is unacceptable. When the verb `maximize()` is used, it means larger values are better. This is the case for Pearson correlation, forest importance, and information gain.

For examples: 

```{r}
# Optimize correlation alone
ames_scores_results |>
  show_best_desirability_prop(
    maximize(cor_pearson, low = 0, high = 1)
  ) |> 
  # Show predictor and desirability only
  dplyr::select(predictor, starts_with(".d_"))

# Optimize correlation and forest importance
ames_scores_results |>
  show_best_desirability_prop(
    maximize(cor_pearson, low = 0, high = 1),
    maximize(imp_rf)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))

# Optimize correlation, forest importance and information gain
ames_scores_results |>
  show_best_desirability_prop(
    maximize(cor_pearson, low = 0, high = 1),
    maximize(imp_rf),
    maximize(infogain)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
```

In `show_best_desirability_prop()`, there is a argument called `prop_terms` that lets us control the proportion of predictors to keep.

```{r}
# Same as above, but retain only a proportion of predictors
ames_scores_results |>
  show_best_desirability_prop(
    maximize(cor_pearson, low = 0, high = 1),
    maximize(imp_rf),
    maximize(infogain),
    prop_terms = 0.2
  ) |>
  dplyr::select(predictor, starts_with(".d_"))
```

Besides `maximize()`, additional verbs that are available are: `minimize()`, `target()`, and `constrain()`. They are used in different situations:

- `maximize()` when larger values are better.

- `minimize()` when smaller values are better.

- `target()` when a specific value of the metric is important. 

- `constrain()` when a range of values is equally desirable.

For examples: 

```{r}
ames_scores_results |>
  show_best_desirability_prop(
    minimize(aov_pval, low = 0, high = 1)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))

ames_scores_results |>
  show_best_desirability_prop(
    target(cor_pearson, low = 0.2, target = 0.255, high = 0.9)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))

ames_scores_results |>
  show_best_desirability_prop(
    constrain(cor_pearson, low = 0.2, high = 1)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
```

## Available score objects and filter methods 

The list of score class objects included:

```{r}
#| echo: false
grep("^score_", ls("package:filtro"), value = TRUE)
```

The list of filter methods for score *singular*: 

```{r}
#| echo: false
grep("^show_best_score_", ls("package:filtro"), value = TRUE)
```

The list of filter methods for scores *plural*: 

```{r}
#| echo: false
grep("^show_best_desirability_", ls("package:filtro"), value = TRUE)
```