---
title: "XAItest: Enhancing Feature Discovery with eXplainable AI"
shorttitle: XAItest
author: Ghislain FIEVET <ghislain.fievet@gmail.com>
package: XAItest
abstract: >
    XAItest is an R Package designed for integrating eXplainable Artificial
    Intelligence (XAI) techniques with traditional statistical analysis to
    enhance the discovery and interpretation of significant features within
    datasets. It applies p-values and feature importance metrics, such as
    SHAP, LIME or custom functions. A major XAItest tool is the generation
    of simulated data to establish significance thresholds, aiding in the
    direct comparison of feature importance with statistical p-values. The
    package aims at helping researchers and bioinformaticians seeking to
    uncover key features inside a dataset.
output:
    BiocStyle::html_document:
        toc: true
        toc_depth: 2
vignette: >
  %\VignetteIndexEntry{01_XAItest}
  %\VignetteEngine{knitr::knitr}
  %\VignetteEncoding{UTF-8}
---

# Add a custom feature importance function
"The *XAItest* package includes several classic feature importance algorithms
and supports the addition of new ones. To integrate an *XGBoost* model and
generate its feature importance metrics using the *SHAP* package *shapr*. 

### The following function structure is required

The function should accept:
- *df*: a DataFrame where rows represent samples and columns represent
    features.
- *y*: the name of the target feature.
- *...*: to authorize additional arguments.

The function should return:
- *featImps*: the list of feature importance for each feature.
- *modelPredictions* (optional): the list of predictions when the model is
        applied to *df*, utilized in the *plotModel* function.
- *model* (optional): the model used to compute feature importance, employed
        in the *plotModel* function when *modelPredictions* is unavailable.


## Load libraries and classification dataset
```{r load libs, message=F}
# Load the libraries
library(XAItest)
library(ggplot2)
library(ggforce)
library(SummarizedExperiment)
```

```{r load classif simu data}
se_path <- system.file("extdata", "seClassif.rds", package="XAItest")
dataset_classif <- readRDS(se_path)

data_matrix <- assay(dataset_classif, "counts")
data_matrix <- t(data_matrix)
metadata <- as.data.frame(colData(dataset_classif))
df_simu_classif <- as.data.frame(cbind(data_matrix, y = metadata[['y']]))
for (col in names(df_simu_classif)) {
    if (col != 'y') {
        df_simu_classif[[col]] <- as.numeric(df_simu_classif[[col]])
    }
}
```

## Build and use the custom feature importance function

```{r}
featureImportanceXGBoost <- function(df, y="y", ...){
    # Prepare data
    matX <- as.matrix(df[, colnames(df) != y])
    vecY <- df[[y]]
    vecY <- as.character(vecY)
    vecY[vecY == unique(vecY)[1]] <- 0
    vecY[vecY == unique(vecY)[2]] <- 1
    vecY <- as.numeric(vecY)
    
    # Train the XGBoost model
    model <- xgboost::xgboost(data = matX, label = vecY,
                                nrounds = 10, verbose = FALSE)
    modelPredictions <- predict(model, matX)
    modelPredictionsCat <- modelPredictions
    modelPredictionsCat[modelPredictions < 0.5] <-
                                unique(as.character(df[[y]]))[1]
    modelPredictionsCat[modelPredictions >= 0.5] <-
                                unique(as.character(df[[y]]))[2]

    # Specifying the phi_0, i.e. the expected prediction without any features
    p <- mean(vecY)
    # Computing the actual Shapley values with kernelSHAP accounting
    # for feature dependence using the empirical (conditional)
    # distribution approach with bandwidth parameter sigma = 0.1 (default)
    explanation <- shapr::explain(
        model,
        approach = "empirical",
        x_explain = matX,
        x_train = matX,
        phi0 = p,
        iterative = TRUE,
        iterative_args = list(max_iter = 3)
    )
    results <- colMeans(abs(explanation$shapley_values_est), na.rm = TRUE)
    results <- results[3:length(results)]
    list(featImps = results, model = model, modelPredictions=modelPredictionsCat)
}
```

```{r xai4, warning=FALSE}
set.seed(123)
results <- XAI.test(dataset_classif,"y", simData = TRUE,
                   simPvalTarget = 0.001,
                   customFeatImps=
                   list("XGB_SHAP_feat_imp"=featureImportanceXGBoost),
                   defaultMethods = c("ttest", "lm")
                  )
```

The *mapPvalImportance* function reveals that both the custom
*XGB_SHAP_feat_imp* and other feature importance metrics identify the
*biDistrib* feature as significant.

Display as a data.frame:
```{r mapPvalImportance4}
mpi <- mapPvalImportance(results, refPvalColumn = "ttest_adjPval", refPval = 0.001)
head(mpi$df)
```

Display as a datatable:
```{r mapPvalImportance5, eval=FALSE}
mpi$dt
```
![ ](im/dt4.png)

```{r}
# Plot of the XGboost generated model
plotModel(results, "XGB_SHAP_feat_imp", "diff_distrib01", "biDistrib")
```

```{r}
sessionInfo()
```