--- title: "DImodelsVis with black-box models" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{DImodelsVis with black-box models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, fig.width = 5, fig.height = 5, fig.align = "center" ) ``` ```{css, echo = FALSE} /* Highlight box */ .highlightbox { background-color: #e6f0fa; /* light blue */ border: 2px solid #156082; /* darker blue */ border-radius: 6px; padding: 10px 14px; margin: 12px 0; box-shadow: 2px 2px 6px rgba(0,0,0,0.1); } /* Note box */ .notebox { background-color: #f7f7f7; /* light gray */ border: 1px solid #999999; /* medium gray */ border-radius: 4px; padding: 8px 12px; margin: 10px 0; box-shadow: 1px 1px 4px rgba(0,0,0,0.08); } ``` This vignette shows an example of how `DImodelsVis` can be used to visualise the response surface across the simplex space using black-box style models such as neural networks or random forests. Note that, while we show limited examples in this vignetted, it is worth noting that all visualisations in the package except for model diagnostics and prediction contributions plots can be created for black-box style models. ### Loading necessary packages ```{r, setup, message = FALSE} library(DImodelsVis) library(dplyr) library(ggplot2) library(nnet) library(randomForest) ``` ### Data For this example we simulate a dataset containing 200 soil samples with varying proportions of sand, silt and clay, constrained to sum to one. The response variable is soil organic carbon (SOC) and was simulated to reflect that soils richer in clay have higher SOC, those with more silt have moderate SOC, and those dominated by sand have lower SOC. ```{r, simulate_data} # For reproducability set.seed(123) # Number of samples n <- 200 # Simulate 3d compositional data # Assume equal density for each variable alpha <- c(1, 1, 1) X <- matrix(rgamma(n * length(alpha), shape = alpha), ncol = length(alpha), byrow = TRUE) X <- X / rowSums(X) # normalize to sum to 1 colnames(X) <- c("Sand", "Silt", "Clay") # Simulate Soil organic carbon (SOC) to be used as a response # We assume, SOC is higher for higher clay content, lower for sand, moderate for silt SOC <- 4 + 5*X[, "Clay"] - 2*X[, "Sand"] + 1*X[, "Silt"] + rnorm(n, 0, 0.5) # Combine the compositional predictors and continuous response into data frame soil_data <- data.frame(X, SOC) # Snippet of data head(soil_data) ``` ### Visualising raw data We create a ternary diagram using the `tenrary_plot()` function showing the spread of data across the 3d simplex space. The points are coloured by SOC values, with higher values colour green and lower values given shades of pink and white. ```{r, raw_data, fig.alt = "Plot of raw data in ternary"} ternary_plot( # Compositional variables prop = c("Sand", "Silt", "Clay"), # Labels for ternary axes tern_labels = c("Sand", "Silt", "Clay"), # Data with compositional variables data = soil_data, # Show raw data as points instead of a contour map show = "points", # Colour points by SOC variable col_var = "SOC") ``` ### Neural network We fit a basic neural network containing a single hidden layer with seven nodes to the soil data using the proportions of sand, silt, and clay as predictors. Additional parameters such as `decay`, `linout`, `maxit` are set (explained in code) at specific values to ensure predictions are consistent across successive runs of the neural network. ```{r, nn_model} # Seed to ensure weight initialisation is reproducible set.seed(737) nn_model <- nnet(SOC ~ Sand + Silt + Clay, data = soil_data, size = 7, # Seven nodes in hidden layer decay = 0.01, # Parameter for weight decay (helps stabilise predictions) linout = TRUE, # Boolean to indicate continuous predictions maxit = 1000) # Number of iterations nn_model ```
#### Visualise response surface across ternary The predicted SOC surface across the 3d simplex space is visualised here. We first generated the grid of points across the simplex space using the `ternary_data()` function and then use the neural network to predict the SOC. The data.frame containing the grid of points and corresponding predicted SOC is then passed to the `ternary_plot()` function for visualising the response surface. :::{.notebox} While we set the `prediction = FALSE` argument here and add the predictions at a later step using the `add_prediction()` function from `DImodelsVis`. Users can also use the base R `predict()` function or any custom function of their choice which returns the predicted response from a black-box model. ::: ```{r, data_nn_model} # Prepare data ternary_grid <- ternary_data(# Compositional predictors prop = c("Sand", "Silt", "Clay"), # Don't make predictions now and only return template prediction = FALSE) head(ternary_grid) ``` The predicted response from the neural network can now be added to the data template created by the `ternary_data()` function. ```{r, predict_nn_model} # Add the predicted response to data plot_data <- add_prediction(# Grid data from ternary_data() function data = ternary_grid, # Model to use for making predictions model = nn_model, # No uncertainty interval added to data interval = "none") # The predicted response is added as a column called .Pred head(plot_data) ``` The response surface can be visualised as follows ```{r, visualise_nn_model, fig.alt = "Ternary diagram showing SOC"} # Create ternary plot ternary_plot(data = plot_data, # Compositional data with predicted response lower_lim = 1.5, # Lower limit of fill scale upper_lim = 9.5, # Upper limit of fill scale nlevels = 8, # Number of colours for fill scale # Labels for ternary axes tern_labels = c("Sand", "Silt", "Clay")) + # ggplot function to add labels on plot labs(fill = "Predicted\nSOC") ``` The predicted SOC is higher in soil containing high proportions of clay and lower in soil containing high amounts of sand.
#### Effects plots for the compositional predictors We can also generate effects plots for the compositional predictors in the neural network. First, the dataset needed for generating the effects plot is created using the `visualise_effects_data()` function. The predicted SOC is then added to the data using the `add_prediction()` function and the prepared data is passed onto the `visualise_effects_plot()` function for visualising the effects. :::{.notebox} Users could also set `prediction = TRUE` and have the predictions added to the data in the data preparation step itself, as shown in the random-forest example later in the vignette. ::: ```{r, nn_model_effects, fig.alt = "Effects plot for compositional predictors", fig.height = 5, fig.width=9} # Compositions to be used for generating the effects plots. # Not all are needed but having more initial compositions gives higher accuracy. eff_data <- tribble(~"Sand", ~"Silt", ~"Clay", 1, 0, 0, 0, 1, 0, 0, 0, 1, 0.5, 0.5, 0, 0.5, 0, 0.5, 0, 0.5, 0.5, 0.8, 0.2, 0, 0.2, 0.8, 0, 0.8, 0, 0.2, 0.2, 0, 0.8, 0, 0.8, 0.2, 0, 0.2, 0.8, 0.6, 0.2, 0.2, 0.2, 0.6, 0.2, 0.2, 0.2, 0.6, 1/3, 1/3, 1/3) # The data-preparation, adding predictions and plotting are done in a single dplyr pipeline here # Prepare data visualise_effects_data(# Data with initial compostions data = eff_data, # Column names for compositional predictors in data prop = c("Sand", "Silt", "Clay"), # Show effects plot for all compostional variables var_interest = c("Sand", "Silt", "Clay"), # Don't make predictions now and only return template prediction = FALSE) %>% # Add predictions add_prediction(interval = "none", # No uncertainty intervals model = nn_model) %>% # Model to use for making predictions # Print first few rows of data as_tibble() %>% print(n = 6) %>% # Create plot visualise_effects_plot() + # ggplot function to add labels on plot labs(y = "Predicted SOC") ``` Increasing the proportion of sand lowers SOC, while increasing clay raises it. The effect of silt depends on the soil context: it tends to increase SOC in sand-rich soils but reduce it in clay-rich soils, leading to an overall flat effect of silt on SOC. ### Random forest model We also show an example of using `DImodelsVis` with random forest models. The same pipeline as depicted above can be used with random forest models to create visualisations from `DImodelsVis`. We first a random forest model using the proportions of sand, silt, and clay as well as their pairwise interactions as predictors. ```{r, forest_model} forest_model <- randomForest(# Predictor for random forest model SOC ~ Sand + Silt + Clay + Sand:Silt + Silt:Clay + Sand:Clay, # Data data = soil_data, # 5000 trees to be used random forest ntree = 5000) forest_model ```
#### Response surface across ternary using random forests Similar to above, we could visualise the predicted response surface for SOC across the three dimensional simplex space using a ternary diagram. For this example, we also showcase the use of setting `prediction = TRUE` in the `ternary_data()` function to automatically add predictions to created grid of points for the ternary diagram. This prepared data is then passed to the `ternary_plot()` function to visualise the response surface. :::{.notebox} The `prediction = TRUE` argument can be specified in any of the data preparation functions in the package to automatically add predictions to the prepared data. However, when working with complex models with multiple predictions, care should be taken as users may wish to supply values for these predictors other than the function's defaults. ::: ```{r, visualise_forest_model, fig.alt = "Response surface from random forest"} # Prepare data ternary_data(# Names of compositional predictors prop = c("Sand", "Silt", "Clay"), # Model to use for making predictions model = forest_model, # Add predictions directly prediction = TRUE) %>% ternary_plot(lower_lim = 1.5, # Lower limit of fill scale upper_lim = 9.5, # Upper limit of fill scale nlevels = 8, # Number of colours for fill scale # Labels for ternary axes tern_labels = c("Sand", "Silt", "Clay")) + # ggplot function to add labels on plot labs(fill = "Predicted\nSOC") ``` A similar inference can be drawn as the neural network: predicted SOC is higher in soils with greater clay content and lower in those with higher sand content. However, the predicted response surface is noticeably more jagged for a random forest compared to a neural network or standard linear regression model. This could be improved by either fine tuning the model parameters further or by applying a post-processing smoothing operation over the predicted values.