--- title: "Tidymodels Workflow with Sequential Keras Models" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Tidymodels Workflow with Sequential Keras Models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = reticulate::py_module_available("keras") ) # Suppress verbose Keras output for the vignette options(keras.fit_verbose = 0) set.seed(123) ``` ## Introduction This vignette demonstrates a complete `tidymodels` workflow for a classification task using a Keras sequential model defined with `kerasnip`. We will use the Palmer Penguins dataset to predict penguin species based on physical measurements. The `kerasnip` package allows you to define Keras models using a modular "layer block" approach, which then integrates seamlessly with the `parsnip` and `tune` packages for model specification and hyperparameter tuning. ## Setup First, we load the necessary packages. ```{r load-packages} library(kerasnip) library(tidymodels) library(keras3) library(dplyr) # For data manipulation library(ggplot2) # For plotting library(future) # For parallel processing library(finetune) # For racing ``` ## Data Preparation We'll use the `penguins` dataset from the `modeldata` package. We will clean it by removing rows with missing values and ensuring the `species` column is a factor. ```{r data-prep} # Remove rows with missing values penguins_df <- penguins |> na.omit() |> # Convert species to factor for classification mutate(species = factor(species)) # Split data into training and testing sets set.seed(123) penguin_split <- initial_split(penguins_df, prop = 0.8, strata = species) penguin_train <- training(penguin_split) penguin_test <- testing(penguin_split) # Create cross-validation folds for tuning penguin_folds <- vfold_cv(penguin_train, v = 5, strata = species) ``` ## Recipe for Preprocessing We will create a `recipes` object to preprocess our data. This recipe will: * Predict `species` using all other variables. * Normalize all numeric predictors. * Create dummy variables for all categorical predictors. ```{r create-recipe} penguin_recipe <- recipe(species ~ ., data = penguin_train) |> step_normalize(all_numeric_predictors()) |> step_dummy(all_nominal_predictors()) ``` ## Define Keras Sequential Model with `kerasnip` Now, we define our Keras sequential model using `kerasnip`'s layer blocks. We'll create a simple Multi-Layer Perceptron (MLP) with two hidden layers. For a sequential Keras model with tabular data, all preprocessed input features are typically combined into a single input layer. The `recipes` package handles this preprocessing, transforming predictors into a single matrix that serves as the input to the Keras model. ```{r define-kerasnip-model} # Define layer blocks input_block <- function(model, input_shape) { keras_model_sequential(input_shape = input_shape) } hidden_block <- function(model, units = 32, activation = "relu", rate = 0.2) { model |> layer_dense(units = units, activation = activation) |> layer_dropout(rate = rate) } output_block <- function(model, num_classes, activation = "softmax") { model |> layer_dense(units = num_classes, activation = activation) } # Create the kerasnip model specification function create_keras_sequential_spec( model_name = "penguin_mlp", layer_blocks = list( input = input_block, hidden_1 = hidden_block, hidden_2 = hidden_block, output = output_block ), mode = "classification" ) ``` ## Model Specification We'll define our `penguin_mlp` model specification and set some hyperparameters to `tune()`, indicating that they should be optimized. We will also set fixed parameters for compilation and fitting. ```{r define-tune-spec} # Define the tunable model specification mlp_spec <- penguin_mlp( # Tunable parameters for hidden layers hidden_1_units = tune(), hidden_1_rate = tune(), hidden_2_units = tune(), hidden_2_rate = tune(), # Fixed compilation and fitting parameters compile_loss = "categorical_crossentropy", compile_optimizer = "adam", compile_metrics = c("accuracy"), fit_epochs = 20, fit_batch_size = 32, fit_validation_split = 0.2, fit_callbacks = list( callback_early_stopping(monitor = "val_loss", patience = 5) ) ) |> set_engine("keras") print(mlp_spec) ``` ## Create Workflow A `workflow` combines the recipe and the model specification. ```{r create-workflow} penguin_wf <- workflow() |> add_recipe(penguin_recipe) |> add_model(mlp_spec) print(penguin_wf) ``` ## Define Tuning Grid We will create a regular grid for our hyperparameters. ```{r create-tuning-grid} # Define the tuning grid params <- extract_parameter_set_dials(penguin_wf) |> update( hidden_1_units = hidden_units(range = c(32, 128)), hidden_1_rate = dropout(range = c(0.1, 0.4)), hidden_2_units = hidden_units(range = c(16, 64)), hidden_2_rate = dropout(range = c(0.1, 0.4)) ) mlp_grid <- grid_regular(params, levels = 3) print(mlp_grid) ``` ## Tune Model Now, we'll use `tune_race_anova()` to perform cross-validation and find the best hyperparameters. ```{r tune-model, cache=TRUE} # Note: Parallel processing with `plan(multisession)` is currently not working # with Keras models due to backend conflicts set.seed(123) penguin_tune_results <- tune_race_anova( penguin_wf, resamples = penguin_folds, grid = mlp_grid, metrics = metric_set(accuracy, roc_auc, f_meas), control = control_race(save_pred = TRUE, save_workflow = TRUE) ) ``` ## Inspect Tuning Results We can inspect the tuning results to see which hyperparameter combinations performed best. ```{r inspect-results} # Show the best performing models based on accuracy show_best(penguin_tune_results, metric = "accuracy", n = 5) # Autoplot the results # Currently does not work due to a label issue: autoplot(penguin_tune_results) # Select the best hyperparameters best_mlp_params <- select_best(penguin_tune_results, metric = "accuracy") print(best_mlp_params) ``` ## Finalize Workflow and Fit Model Once we have the best hyperparameters, we finalize the workflow and fit the model on the entire training dataset. ```{r finalize-fit} # Finalize the workflow with the best hyperparameters final_penguin_wf <- finalize_workflow(penguin_wf, best_mlp_params) # Fit the final model on the full training data final_penguin_fit <- fit(final_penguin_wf, data = penguin_train) print(final_penguin_fit) ``` ### Inspect Final Model You can extract the underlying Keras model and its training history for further inspection. ```{r inspect-final-keras-model-summary} # Extract the Keras model summary final_penguin_fit |> extract_fit_parsnip() |> extract_keras_model() |> summary() ``` ```{r inspect-final-keras-model-plot, eval=FALSE} # Plot the Keras model final_penguin_fit |> extract_fit_parsnip() |> extract_keras_model() |> plot(show_shapes = TRUE) ``` ![Model](images/model_plot_shapes_ws.png){fig-alt="A picture showing the model shape"} ```{r inspect-final-keras-model-history} # Plot the training history final_penguin_fit |> extract_fit_parsnip() |> extract_keras_history() |> plot() ``` ## Make Predictions and Evaluate Finally, we will make predictions on the test set and evaluate the model's performance. ```{r predict-evaluate} # Make predictions on the test set penguin_test_pred <- predict(final_penguin_fit, new_data = penguin_test) penguin_test_prob <- predict( final_penguin_fit, new_data = penguin_test, type = "prob" ) # Combine predictions with actuals penguin_results <- penguin_test |> select(species) |> bind_cols(penguin_test_pred, penguin_test_prob) print(head(penguin_results)) # Evaluate performance using yardstick metrics metrics_results <- metric_set( accuracy, roc_auc, f_meas )( penguin_results, truth = species, estimate = .pred_class, .pred_Adelie, .pred_Chinstrap, .pred_Gentoo ) print(metrics_results) # Confusion Matrix conf_mat(penguin_results, truth = species, estimate = .pred_class) |> autoplot(type = "heatmap") ```