--- title: "Examples of FFTrees" author: "Nathaniel Phillips and Hansjörg Neth" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: fft.bib csl: apa.csl vignette: > %\VignetteIndexEntry{Examples of FFTrees} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, echo = FALSE} knitr::opts_chunk$set(collapse = FALSE, comment = "#>", prompt = FALSE, tidy = FALSE, echo = TRUE, message = FALSE, warning = FALSE, # Default figure options: dpi = 100, fig.align = 'center', fig.height = 6.0, fig.width = 6.5, out.width = "580px") ``` ```{r load-pkg-0, echo = FALSE, message = FALSE, results = 'hide'} library(FFTrees) ``` ## Examples of FFTs with **FFTrees** This vignette illustrates how to construct fast-and-frugal trees (FFTs) for additional datasets included in the **FFTrees** package. [See @phillips2017FFTrees, for a comparison across 10\ real-world datasets.] ### Mushrooms data ```{r image-mushrooms, fig.align = "center", out.width = "225px", echo = FALSE} knitr::include_graphics("../inst/mushrooms.jpg") ``` The `mushrooms` dataset contains data about mushrooms (see `?mushrooms` for details). The goal of our model is to predict which mushrooms are `poisonous` based on `r ncol(mushrooms) - 1` cues ranging from the mushroom's odor, color, etc. Here are the first few rows and a subset of 10\ potential predictors of the `mushrooms` data: ```{r data-mushrooms, echo = FALSE} # names(mushrooms) # Select subset: mushrooms_sub <- mushrooms[1:6, c(1:6, 18:23)] knitr::kable(head(mushrooms_sub)) ``` Table: **Table 1**: Binary criterion variable\ `poisonous` and 10\ potential predictors in the `mushrooms` data. #### Creating FFTs Let's create some trees using `FFTrees()`! We'll use the `train.p = .50` argument to split the original data into a $50$%\ training set and a $50$%\ testing set: ```{r fft-mushrooms-1, message = FALSE, results = 'hide', warning = FALSE} # Create FFTs from the mushrooms data: set.seed(1) # for replicability of the training / test data split mushrooms_fft <- FFTrees(formula = poisonous ~., data = mushrooms, train.p = .50, # split data into 50:50 training/test subsets main = "Mushrooms", decision.labels = c("Safe", "Poison"), do.comp = FALSE) ``` Setting the `do.comp = FALSE` argument prevents the evaluation of alternative algorithms (beyond those needed for FFTs). This can save time, but also avoid error messages stemming from alternative prediction methods (e.g., rank-deficient fits of linear models). Here's basic information about the best performing FFT (Tree\ #1): ```{r fft-mushrooms-1-print} # Print information about the best tree (during training): print(mushrooms_fft) ``` [Cool beans](https://goo.gl/B7YDuC). #### Visualizing cue accuracies Let's look at the individual cue training accuracies with `plot(fft, what = "cues")`: ```{r fft-mushrooms-1-plot-cues, fig.width = 6.0, fig.height = 6.0, out.width = "450px"} # Plot the cue accuracies of an FFTrees object: plot(mushrooms_fft, what = "cues") ``` It looks like the cues `oder` and `sporepc` are the best predictors. In fact, the single cue `odor` has a hit rate of\ $97$% and a false alarm rate of nearly\ $0$%! Based on this, we should expect the final trees to use just these cues. #### Visualizing FFT performance Now let's plot the performance of the best training tree when applied to the test data: ```{r fft-mushrooms-1-plot} # Plot the best FFT (for test data): plot(mushrooms_fft, data = "test") ``` Indeed, it looks like the best tree only uses the\ `odor` and\ `sporepc` cues. In our test dataset, the tree had a _false alarm rate_ of\ $0$% ($1 -$\ specificity), and a _sensitivity_ (aka. hit rate) of\ $85$%. #### Trading off prediction errors When considering the implications of our predictions, the fact that our FFT incurs many misses, but no false alarms, is problematic: Given our current task, failing to detect poisonous mushrooms has usually more serious consequences than falsely classifying some as poisonous. To change the balance between both possible errors, we can select another tree from the set of FFTs. In this case, FFT\ #2 would use the same two cues, but alter their exit structure so that our prediction incurs false alarms, but no misses. Alternatively, we could re-generate a new set of FFTs with a higher sensitivity weight value (e.g., increase the default value of\ `sens.w = .50` to\ `sens.w = .67`) and optimize the FFTs' weighted accuracy\ `wacc`. #### An alternative FFT Let's assume that a famous mushroom expert insists that our FFT is using the wrong cues. According to her, the best predictors for poisonous mushrooms are\ `ringtype` and\ `ringnum`. To test this, we build a set of FFTs from only these cues and check how they perform relative to our initial tree: ```{r fft-mushrooms-2-seed, include = FALSE} set.seed(200) ``` ```{r fft-mushrooms-2, message = FALSE, results = 'hide', warning = FALSE} # Create trees using only the ringtype and ringnum cues: mushrooms_ring_fft <- FFTrees(formula = poisonous ~ ringtype + ringnum, data = mushrooms, train.p = .50, main = "Mushrooms (ring cues)", decision.labels = c("Safe", "Poison"), do.comp = FALSE) ``` Again, we plot the best training tree, when predicting the cases in the test dataset: ```{r fft-mushrooms-2-plot} # Plotting the best training FFT (for test data): plot(mushrooms_ring_fft, data = "test") ``` As we can see, this tree (in `mushrooms_ring_fft`) has both sensitivity and specificity values of around\ $80$%, but does not perform as well as our earlier one (in `mushrooms_fft`). This suggests that we should discard the expert's advice and primarily rely on the\ `odor` and\ `sporepc` cues. ### Iris.v data ```{r iris-image, fig.align = "center", out.width = "225px", echo = FALSE} knitr::include_graphics("../inst/virginica.jpg") ``` The `iris.v` dataset contains data about 150\ flowers (see `?iris.v`). Our goal is to predict which flowers are of the class _Virginica_. In this example, we'll create trees using the entire dataset (without splitting the available data into explicit training vs. test subsets), so that we are really fitting the data, rather than engaging in genuine prediction: ```{r iris-fft, message = FALSE, results = 'hide'} # Create FFTrees object for iris data: iris_fft <- FFTrees(formula = virginica ~., data = iris.v, main = "Iris", decision.labels = c("Not-Vir", "Vir")) ``` The **FFTrees** package provides various functions to inspect the `FFTrees` object `iris_fft`. For summary information on the best training tree, we can print the `FTrees` object (by evaluating `iris_fft` or `print(iris_fft)`). Alternatively, we could visualize the tree (via `plot(iris_fft)`) or summarize the `FFTrees` object (via `summary(iris_fft)`): ```{r iris-fft-print, echo = TRUE, eval = FALSE, results = 'hide'} # Inspect resulting FFTs: print(iris_fft) # summarize best training tree plot(iris_fft) # visualize best training tree summary(iris_fft) # summarize FFTrees object ``` However, let's first take a look at the individual training cue accuracies... #### Visualizing cue accuracies We can plot the training cue accuracies during training by specifying `what = "cues"`: ```{r iris-plot-cues, fig.width = 6.0, fig.height = 6.0, out.width = "450px"} # Plot cue values: plot(iris_fft, what = "cues") ``` It looks like the two cues\ `pet.len` and\ `pet.wid` are the best predictors for this dataset. Based on this insight, we should expect the final trees will likely use one or both of these cues. #### Visualizing FFT performance Now let's visualize the best tree: ```{r iris-plot-fft} # Plot best FFT: plot(iris_fft) ``` Indeed, it turns out that the best tree only uses the\ `pet.len` and\ `pet.wid` cues (in that order). For this data, the fitted tree exhibits a performance with a sensitivity of\ 100% and a specificity of\ 94%. #### Viewing alternative FFTs Now, this tree did quite well, but what if someone wanted a tree with the lowest possible false alarm rate? If we inspect the ROC\ plot in the bottom right corner of the figure, we see that Tree\ #2 has a specificity close to\ 100%. Let's plot this tree: ```{r iris-plot-fft-2} # Plot FFT #2: plot(iris_fft, tree = 2) ``` As we can see, this tree does indeed have a higher specificity (of\ 98%), but this increase comes at a cost of a lower sensitivity (of\ 90%). Such trade-offs between conflicting measures are inevitable when fitting and predicting real-world data. Importantly, using FFTs and the **FFTrees** package help us to render such trade-offs more transparent. ### Titanic data For examples that predict people's survival of the _Titanic_ disaster (by growing FFTs for the `titanic` data), see the [Visualizing FFTs](FFTrees_plot.html) vignette. ## Vignettes Here is a complete list of the vignettes available in the **FFTrees** package: | | Vignette | Description | |--:|:------------------------------|:-------------------------------------------------| | | [Main guide: FFTrees overview](guide.html) | An overview of the **FFTrees** package | | 1 | [Tutorial: FFTs for heart disease](FFTrees_heart.html) | An example of using `FFTrees()` to model heart disease diagnosis | | 2 | [Accuracy statistics](FFTrees_accuracy_statistics.html) | Definitions of accuracy statistics used throughout the package | | 3 | [Creating FFTs with FFTrees()](FFTrees_function.html) | Details on the main `FFTrees()` function | | 4 | [Manually specifying FFTs](FFTrees_mytree.html) | How to directly create FFTs without using the built-in algorithms | | 5 | [Visualizing FFTs](FFTrees_plot.html) | Plotting `FFTrees` objects, from full trees to icon arrays | | 6 | [Examples of FFTs](FFTrees_examples.html) | Examples of FFTs from different datasets contained in the package | ## References