Introduction

In biological and biomedical research, the identification of key variables in omics data—such as transcriptomes, proteomes, and methylomes—is generally accomplished using statistical tests like the t-test or correlation analysis. While these methods perform well for detecting classical mean differences, they often fail to capture more complex scenarios, such as non-Gaussian data distributions, interactions among multiple variables, or multimodal distributions.

To address these challenges, the scientific literature proposes to use the potential of explainable AI (XAI) to assist in the discovery of key features. Examples of these techniques can be found in the analysis of methylation and transcriptome data @Rajpal_2023, @Kumar_2023.

The XAItest package aims at proposing a versatile tool to run these types of analysis easily. It calculates various p-value based statistics, such as the t-test or correlation, depending on whether the target variable is categorical or numerical. It also computes different XAI matrics such as feature importance metrics, by default a built-in method from random forest, an implementation of SHAP and of LIME, and the possibility to integrate custom functions.

Establishing a significance level for a p-value is straightforward, often set at 0.05 or 0.01. However, determining thresholds for the feature importance is more complicated, and has to be computed at each experiment. Our package offers an integrated solution to this challenge by generating simulated data designed to meet selected p-value thresholds, thus setting significance thresholds for the feature importance.

Lastly, our package offers an intuitive visualization of results through color-coded tables, highlighting statistical metrics and their significance for easy interpretation.

Installation

if (!require("BiocManager")) {
    install.packages("BiocManager")
}
BiocManager::install('XAItest')

Generate benchmark datasets

The XAItest package provides several benchmark scenarios to evaluate the performance of statistical tests and machine learning methods. These scenarios include both classification and regression problems with different underlying patterns:

Classification scenarios:

Scenario 1 : Variance Difference
Scenario 2 : Bimodal Distribution
Scenario 3 : XOR Interaction
Scenario 4 : Concentric Circles

Regression scenarios:

Scenario 5 : Parabolic Relationship
Scenario 6 : Rising Sinusoid

These scenarios demonstrate cases where classical statistical tests (t-test, correlation) may fail to detect meaningful patterns, while machine learning approaches and XAI methods can successfully identify the underlying relationships.

library(XAItest)
library(gridExtra)
library(ggplot2)

set.seed(12)

# Scenario 1: Variance Difference
df1 <- genScenario(1)
# Scenario 2: Bimodal Distribution
df2 <- genScenario(2)
# Scenario 3: XOR Interaction
df3 <- genScenario(3)
# Scenario 4: Concentric Circles
df4 <- genScenario(4)
# Scenario 5: Parabolic Relationship
df5 <- genScenario(5)
# Scenario 6: Rising Sinusoid
df6 <- genScenario(6)

# Create plots for each scenario
p1 <- ggplot(df1, aes(x = norm_noise01, y = diffVar01, color = y)) +
  geom_point() + scale_color_manual(values = c("orange", "purple")) +
  ggtitle("Scenario 1:\nVariance Difference")
p2 <- ggplot(df2, aes(x = norm_noise01, y = biDistrib, color = y)) +
  geom_point() + scale_color_manual(values = c("orange", "purple")) +
  ggtitle("Scenario 2:\nBimodal Distribution")
p3 <- ggplot(df3, aes(x = relVar01, y = relVar02, color = y)) +
  geom_point() + scale_color_manual(values = c("orange", "purple")) +
  ggtitle("Scenario 3:\nXOR Interaction")
p4 <- ggplot(df4, aes(x = relVar01, y = relVar02, color = y)) +
  geom_point() + scale_color_manual(values = c("orange", "purple")) +
  ggtitle("Scenario 4:\nConcentric Circles")
p5 <- ggplot(df5, aes(x = norm_noise01, y = y)) +
  geom_point() + ggtitle("Scenario 5:\nParabolic Relationship")
p6 <- ggplot(df6, aes(x = unif_noise01, y = y)) +
  geom_point() + ggtitle("Scenario 6:\nRising Sinusoid")

# Load required library for grid arrangement


# Arrange the plots in a 2x3 grid with width twice the height
grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 3, nrow = 2,
             widths = rep(2, 3), heights = rep(1, 2))

plot of chunk scenarios

Add the `simThresh` simulated variable to the dataset

Objective of this section:

Add a simulated variable simThresh with XAItest::addSimThresh()
Verify that the p-value (t-test) is well calibrated
Train a decision tree and retrieve Gini importances
Use FI(simThresh) as threshold and identify features that exceed it

Brief explanation: addSimThresh() adds a continuous variable “simThresh” constructed to achieve a target p-value (default ~0.05) vs the target ‘y’. It serves as a reference/benchmark for comparing feature importances and p-values.

set.seed(1)
# Add the simThresh variable to achieve a p-value = 0.001 between A and B
df_st <- addSimThresh(df1, pval_target = 0.001)
df_st[c(1:2,50:51),c(1:3,(ncol(df_st)-2):ncol(df_st))]

##    y norm_noise01 norm_noise02 diff_distrib02 diffVar01 simThresh
## 1  A    -1.876532     8.173621     -0.9168318  1.212579 -6.148631
## 2  A   -12.699134    -1.967586      4.3666478  1.756601 -5.273279
## 50 A    -8.795093     3.179686      2.5800761 -2.030026  5.802162
## 51 B    -9.094009    -1.085735      0.5476526  4.833794 -8.005196

We verify that the p-value is indeed close to 0.001

grp_A <- df_st$simThresh[df_st$y == "A"]
grp_B <- df_st$simThresh[df_st$y == "B"]
tt <- t.test(grp_A, grp_B)
 tt$p.value

## [1] 0.0009787382

Now we train a decision tree model and compute Gini feature importances to identify which variables are most informative for classification. We’ll use the simThresh variable’s importance as a threshold to determine which features exceed this benchmark.

library(rpart)
fit <- rpart(y ~ ., data = df_st, method = "class")

imp_raw <- fit$variable.importance

# Significance threshold
fi_sim <- imp_raw[["simThresh"]]
fi_sim

## [1] 6.461132

Now that we have the threshold, we can identify which features exceed it.

# Feature Importance
imp_sorted <- sort(imp_raw, decreasing = TRUE)
imp_sorted

## diff_distrib01 diff_distrib02      diffVar01      simThresh   norm_noise10 
##     33.7414693     22.0523447     12.6881971      6.4611324      5.7432288 
##   norm_noise01   norm_noise06 
##      5.0253252      0.9676717

# Features above the threshold
vars_above <- setdiff(names(imp_sorted[imp_sorted > fi_sim]), "simThresh")
vars_above

## [1] "diff_distrib01" "diff_distrib02" "diffVar01"

These are indeed the key features from scenario 1: two classic mean difference variables (diff_distrib01 and diff_distrib02), and the variance difference variable (diffVar01).

Now we will apply the same approach to a regression scenario using df5, which contains the parabolic relationship pattern.

set.seed(1)
# Add the simThresh variable to achieve a p-value = 0.001 for the correlation with `y`
df_st <- addSimThresh(df5, pval_target = 0.001)
df_st[1:4,c(1:3,ncol(df_st))]

##           y norm_noise01 norm_noise02  simThresh
## 1 0.3708380    -3.158688    2.3560620 -0.3118023
## 2 0.2003057    -4.956585   -0.3250087 -0.4025818
## 3 0.3124379    -1.196562    0.7858448 -0.2398601
## 4 0.3723267    -2.138664    2.5099355 -0.1419215

tt <- cor.test(df_st$y, df_st$simThresh )
tt$p.value

## [1] 0.0006136033

fit <- rpart(y ~ ., data = df_st)

imp_raw <- fit$variable.importance
imp_raw

## norm_noise01    simThresh norm_noise04 norm_noise02 norm_noise03 unif_noise05 
##  0.633840354  0.037217532  0.005982311  0.004325586  0.002991156  0.002991156 
## unif_noise03 unif_noise07 
##  0.002162793  0.002162793

# Significance threshold
fi_sim <- imp_raw[["simThresh"]]
fi_sim

## [1] 0.03721753

# Feature Importance
imp_sorted <- sort(imp_raw, decreasing = TRUE)
imp_sorted

## norm_noise01    simThresh norm_noise04 norm_noise02 norm_noise03 unif_noise05 
##  0.633840354  0.037217532  0.005982311  0.004325586  0.002991156  0.002991156 
## unif_noise03 unif_noise07 
##  0.002162793  0.002162793

# Features above the threshold
vars_above <- setdiff(names(imp_sorted[imp_sorted > fi_sim]), "simThresh")
vars_above

## [1] "norm_noise01"

The feature norm_noise01 is indeed the feature used to build the parabolic y feature.

Get feature importance of a SummarizedExperiment dataset

# Load the libraries
library(ggforce)
library(SummarizedExperiment)

We consider the RNA-seq data from the airway dataset, which contains read counts per gene for airway smooth muscles stored in a RangedSummarizedExperiment object. The metadata contains a column dex, which indicates whether the subjects are treated or untreated. The XAI.test function, by default, builds random forest models to predict a target, here the ‘dex’ treatment status, and calculates the associated gene feature importances.

data(airway, package="airway")
se <- airway

results <- XAI.test(se[1:100,], y = "dex", verbose = TRUE, simData = TRUE)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:generics':
## 
##     train

results

## Metrics Table (first 5 rows):
##                   ttest_pval ttest_adjPval  ebayes_pval ebayes_adjPval
## simThresh       0.0004442891    0.04487320 0.0002671305     0.02698018
## ENSG00000003096 0.0009195955    0.09287914 0.0001397020     0.01410990
## ENSG00000005189 0.0015221670    0.15373887 0.0009032689     0.09123016
## ENSG00000001626 0.0240899065    1.00000000 0.0954418912     1.00000000
## ENSG00000001036 0.0477152309    1.00000000 0.0363164951     1.00000000
##                      lm_coef lm_pval lm_adjPval RF_feat_imp SHAP_feat_imp
## simThresh                 NA      NA         NA       0.150       0.00500
## ENSG00000003096           NA      NA         NA       0.150       0.00500
## ENSG00000005189           NA      NA         NA       0.372       0.00500
## ENSG00000001626           NA      NA         NA       0.075       0.00425
## ENSG00000001036 0.0004872526     NaN        NaN       0.000       0.00400
##                 LIME_feat_imp
## simThresh         0.017688379
## ENSG00000003096   0.014713998
## ENSG00000005189   0.014982886
## ENSG00000001626   0.010418775
## ENSG00000001036   0.008513849

Detect new key features on a classification dataset

On a toy dataset we will investigate which features can be identified using a t-test or linear modeling, and which can be detected through feature importance metrics.

Load simulated classification dataset

The dataset contains 50 samples of A, and 50 samples of B, with 10 noise features et 3 key features:

norm_noise01, norm_noise02 … 10 features of normal distribution noise
diff_distrib01 different normal distributions between “A” and “B”
diff_distrib02 different normal distributions between “A” and “B”
bidistrib01 normal bidistribution for “A” and normal distribution for “B”

The t-test can detect diff_distrib01 and diff_distrib02 but not bidistrib01.

se_path <- system.file("extdata", "seClassif.rds", package="XAItest")
dataset_classif <- readRDS(se_path)

data_matrix <- assay(dataset_classif, "counts")
data_matrix <- t(data_matrix)
metadata <- as.data.frame(colData(dataset_classif))
df_simu_classif <- as.data.frame(cbind(data_matrix, y = metadata[['y']]))
for (col in names(df_simu_classif)) {
    if (col != 'y') {
        df_simu_classif[[col]] <- as.numeric(df_simu_classif[[col]])
    }
}

Example with a dataframe.

df_path <- system.file("extdata", "dfClassif.txt", package="XAItest")
dataset_classif_df <- read.table(df_path)
df_simu_classif_df <- dataset_classif

We can see below how class A and class B are distributed accross the features: norm_noise01, norm_noise02, norm_noise03, norm_noise04, diff_distrib01, diff_distrib02 and biDistrib.

p1 <- ggplot(df_simu_classif, aes(x=norm_noise01, y=norm_noise02,
            color=y)) +
    geom_point() +
    ggtitle("Noise features\nnorm_noise01 vs norm_noise02") +
    theme_bw()
p2 <- ggplot(df_simu_classif, aes(x=diff_distrib01, y=diff_distrib02,
            color=y)) +
    geom_point() +
    ggtitle("Normal distributions\ndiff_distrib01 vs diff_distrib02") +
    theme_bw()
p3 <- ggplot(df_simu_classif, aes(x=norm_noise03, y=biDistrib, color=y)) +
    geom_point() +
    ggtitle("Normal bidistribution\nbidistrib01 vs norm_noise03") +
    theme_bw()
grid.arrange(p1, p2, p3, ncol = 3, nrow = 1)

plot of chunk unnamed-chunk-3

We employ the XAI.test function to evaluate the p-values and feature importance of each feature. The parameter simData is set to TRUE, which instructs the function to generate a new feature named simThresh, designed to achieve a specified p-value (column ttest_adjPval), set at simPvalTarget = 0.01. The purpose of simThresh is to establish a significance threshold for assessing feature importance values.

set.seed(123)
objXAI <- XAI.test(dataset_classif, "y", simData = TRUE, simPvalTarget = 0.01)

head(getMetricsTable(objXAI))

##                  ttest_pval ttest_adjPval  ebayes_pval ebayes_adjPval
## diff_distrib02 1.017434e-43  1.424407e-42 8.037163e-48   1.125203e-46
## diff_distrib01 7.927938e-37  1.109911e-35 1.066684e-37   1.493358e-36
## simThresh      7.023057e-04  9.832280e-03 6.273684e-04   8.783158e-03
## norm_noise03   7.972660e-02  1.000000e+00 7.919121e-02   1.000000e+00
## norm_noise08   1.109098e-01  1.000000e+00 1.064060e-01   1.000000e+00
## norm_noise09   1.438542e-01  1.000000e+00 1.437854e-01   1.000000e+00
##                      lm_coef      lm_pval   lm_adjPval RF_feat_imp
## diff_distrib02 -0.0668305461 2.207235e-21 3.090129e-20  16.4114993
## diff_distrib01 -0.0312318926 1.235226e-11 1.729317e-10  19.4533855
## simThresh      -0.0076745689 1.251399e-01 1.000000e+00   1.1936746
## norm_noise03    0.0027506413 6.054747e-01 1.000000e+00   1.1560764
## norm_noise08    0.0004548054 8.992620e-01 1.000000e+00   0.3780559
## norm_noise09    0.0035048925 5.214001e-01 1.000000e+00   0.7460329
##                SHAP_feat_imp LIME_feat_imp
## diff_distrib02  0.2505186480   0.212450755
## diff_distrib01  0.2442663510   0.233883647
## simThresh       0.0003595620   0.014770319
## norm_noise03    0.0002384213   0.008099617
## norm_noise08    0.0001998247   0.008304666
## norm_noise09    0.0001731488   0.013512810

The mapPvalImportance function reveals the significance of the feature importance by comparison with the p-values.

Display as a data.frame:

mpi <- mapPvalImportance(objXAI)
mpi$df

##                ttest_pval isSign_ttest_pval ttest_adjPval isSign_ttest_adjPval
## diff_distrib02   1.02e-43                 1      1.42e-42                    1
## diff_distrib01   7.93e-37                 1      1.11e-35                    1
## simThresh        7.02e-04                 1      9.83e-03                    1
## norm_noise03     7.97e-02                 0      1.00e+00                    0
## norm_noise08     1.11e-01                 0      1.00e+00                    0
## norm_noise09     1.44e-01                 0      1.00e+00                    0
## norm_noise02     1.46e-01                 0      1.00e+00                    0
## biDistrib        2.08e-01                 0      1.00e+00                    0
## norm_noise06     2.20e-01                 0      1.00e+00                    0
## norm_noise10     5.21e-01                 0      1.00e+00                    0
## norm_noise04     5.25e-01                 0      1.00e+00                    0
## norm_noise05     5.87e-01                 0      1.00e+00                    0
## norm_noise01     6.78e-01                 0      1.00e+00                    0
## norm_noise07     8.92e-01                 0      1.00e+00                    0
##                ebayes_pval isSign_ebayes_pval ebayes_adjPval
## diff_distrib02    8.04e-48                  1       1.13e-46
## diff_distrib01    1.07e-37                  1       1.49e-36
## simThresh         6.27e-04                  1       8.78e-03
## norm_noise03      7.92e-02                  0       1.00e+00
## norm_noise08      1.06e-01                  0       1.00e+00
## norm_noise09      1.44e-01                  0       1.00e+00
## norm_noise02      1.43e-01                  0       1.00e+00
## biDistrib         1.99e-01                  0       1.00e+00
## norm_noise06      2.22e-01                  0       1.00e+00
## norm_noise10      5.54e-01                  0       1.00e+00
## norm_noise04      5.18e-01                  0       1.00e+00
## norm_noise05      5.83e-01                  0       1.00e+00
## norm_noise01      6.74e-01                  0       1.00e+00
## norm_noise07      8.91e-01                  0       1.00e+00
##                isSign_ebayes_adjPval  lm_pval isSign_lm_pval lm_adjPval
## diff_distrib02                     1 2.21e-21              1   3.09e-20
## diff_distrib01                     1 1.24e-11              1   1.73e-10
## simThresh                          1 1.25e-01              0   1.00e+00
## norm_noise03                       0 6.05e-01              0   1.00e+00
## norm_noise08                       0 8.99e-01              0   1.00e+00
## norm_noise09                       0 5.21e-01              0   1.00e+00
## norm_noise02                       0 3.05e-01              0   1.00e+00
## biDistrib                          0 4.11e-01              0   1.00e+00
## norm_noise06                       0 1.96e-01              0   1.00e+00
## norm_noise10                       0 1.43e-01              0   1.00e+00
## norm_noise04                       0 5.49e-03              1   7.69e-02
## norm_noise05                       0 6.68e-01              0   1.00e+00
## norm_noise01                       0 6.75e-01              0   1.00e+00
## norm_noise07                       0 1.49e-01              0   1.00e+00
##                isSign_lm_adjPval RF_feat_imp isSign_RF_feat_imp SHAP_feat_imp
## diff_distrib02                 1      16.400                1.0      0.251000
## diff_distrib01                 1      19.500                1.0      0.244000
## simThresh                      0       1.190                0.5      0.000360
## norm_noise03                   0       1.160                0.0      0.000238
## norm_noise08                   0       0.378                0.0      0.000200
## norm_noise09                   0       0.746                0.0      0.000173
## norm_noise02                   0       0.565                0.0      0.000133
## biDistrib                      0       8.250                1.0      0.000674
## norm_noise06                   0       0.243                0.0      0.000183
## norm_noise10                   0       0.245                0.0      0.000104
## norm_noise04                   0       0.252                0.0      0.000180
## norm_noise05                   0       0.113                0.0      0.000177
## norm_noise01                   0       0.377                0.0      0.000214
## norm_noise07                   0       0.174                0.0      0.000079
##                isSign_SHAP_feat_imp LIME_feat_imp isSign_LIME_feat_imp
## diff_distrib02                  1.0       0.21200                  1.0
## diff_distrib01                  1.0       0.23400                  1.0
## simThresh                       0.5       0.01480                  0.5
## norm_noise03                    0.0       0.00810                  0.0
## norm_noise08                    0.0       0.00830                  0.0
## norm_noise09                    0.0       0.01350                  0.0
## norm_noise02                    0.0       0.01700                  0.5
## biDistrib                       0.5       0.06450                  1.0
## norm_noise06                    0.0       0.00663                  0.0
## norm_noise10                    0.0       0.00676                  0.0
## norm_noise04                    0.0       0.00799                  0.0
## norm_noise05                    0.0       0.00743                  0.0
## norm_noise01                    0.0       0.00579                  0.0
## norm_noise07                    0.0       0.00572                  0.0

Display as a datatable:

mpi$dt

We can see that the biDistrib feature is not detected by the p-values, but is detected by the feature importance metrics.

Plot each model

The XAI.test function builds several predictive models. In order to provide a p-value, linear regression (lm) builds a model that we can access. Similarly, to give feature importances, methods like the random forest, SHAP, and LIME also build models that are accessible

The plotModel function is used to visualize the results of these predictive models. This visualization helps determine which classes are well-predicted by which model. This function aids in more effectively interpreting the model’s performance in each situation.

In this section, we aim to evaluate the performance of each model after removing the diff_distrib01 and diff_distrib02 features from the dataset, leaving only noise and the biDistrib features.

set.seed(123)
objXAI <- XAI.test(df_simu_classif[,setdiff(colnames(df_simu_classif),
                                    c("diff_distrib01", "diff_distrib02"))],
                  simData=TRUE)
head(getMetricsTable(objXAI))

##              ttest_pval ttest_adjPval ebayes_pval ebayes_adjPval      lm_coef
## simThresh    0.00373774    0.04485288 0.003572831     0.04287397 -0.048072126
## norm_noise03 0.07972660    0.95671925 0.079812024     0.95774429  0.026051579
## norm_noise08 0.11090980    1.00000000 0.106582213     1.00000000 -0.015024728
## norm_noise09 0.14385418    1.00000000 0.144698338     1.00000000  0.029296030
## norm_noise02 0.14626421    1.00000000 0.143893819     1.00000000  0.023104227
## biDistrib    0.20818373    1.00000000 0.199417829     1.00000000  0.004796583
##                 lm_pval lm_adjPval RF_feat_imp SHAP_feat_imp LIME_feat_imp
## simThresh    0.01789724  0.2147668    3.864181   0.011145448   0.009946593
## norm_noise03 0.18885965  1.0000000    4.310358   0.019542338   0.011494790
## norm_noise08 0.25933766  1.0000000    1.282664   0.004334013   0.010515326
## norm_noise09 0.15050988  1.0000000    3.879493   0.001412531   0.012126513
## norm_noise02 0.16579352  1.0000000    2.694503   0.006467815   0.010868020
## biDistrib    0.74233539  1.0000000   22.104365   0.415715984   0.596812724

We can see below that the linear model predictions contain many errors, while the random forest, along with the models used for SHAP and LIME, are much more precise.

p1 <- plotModel(objXAI, "lm_pval", "biDistrib", "simThresh")
p2 <- plotModel(objXAI, "RF_feat_imp", "biDistrib", "simThresh")
p3 <- plotModel(objXAI, "SHAP_feat_imp", "biDistrib", "simThresh")
p4 <- plotModel(objXAI, "LIME_feat_imp", "biDistrib", "simThresh")
grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)

plot of chunk unnamed-chunk-5

Display as a data.frame:

mpi <- mapPvalImportance(objXAI)
head(mpi$df)

##              ttest_pval isSign_ttest_pval ttest_adjPval isSign_ttest_adjPval
## simThresh       0.00374                 1        0.0449                    1
## norm_noise03    0.07970                 0        0.9570                    0
## norm_noise08    0.11100                 0        1.0000                    0
## norm_noise09    0.14400                 0        1.0000                    0
## norm_noise02    0.14600                 0        1.0000                    0
## biDistrib       0.20800                 0        1.0000                    0
##              ebayes_pval isSign_ebayes_pval ebayes_adjPval
## simThresh        0.00357                  1         0.0429
## norm_noise03     0.07980                  0         0.9580
## norm_noise08     0.10700                  0         1.0000
## norm_noise09     0.14500                  0         1.0000
## norm_noise02     0.14400                  0         1.0000
## biDistrib        0.19900                  0         1.0000
##              isSign_ebayes_adjPval lm_pval isSign_lm_pval lm_adjPval
## simThresh                        1  0.0179              1      0.215
## norm_noise03                     0  0.1890              0      1.000
## norm_noise08                     0  0.2590              0      1.000
## norm_noise09                     0  0.1510              0      1.000
## norm_noise02                     0  0.1660              0      1.000
## biDistrib                        0  0.7420              0      1.000
##              isSign_lm_adjPval RF_feat_imp isSign_RF_feat_imp SHAP_feat_imp
## simThresh                    0        3.86                0.5       0.01110
## norm_noise03                 0        4.31                0.5       0.01950
## norm_noise08                 0        1.28                0.0       0.00433
## norm_noise09                 0        3.88                0.5       0.00141
## norm_noise02                 0        2.69                0.0       0.00647
## biDistrib                    0       22.10                1.0       0.41600
##              isSign_SHAP_feat_imp LIME_feat_imp isSign_LIME_feat_imp
## simThresh                     0.5       0.00995                  0.5
## norm_noise03                  0.5       0.01150                  0.5
## norm_noise08                  0.0       0.01050                  0.5
## norm_noise09                  0.0       0.01210                  0.5
## norm_noise02                  0.0       0.01090                  0.5
## biDistrib                     1.0       0.59700                  1.0

Display as a datatable:

mpi$dt

Detect new key features on a regression dataset

First, we load a regression toy example of 100 samples, with 5 noise features, and a y feature computed from one of the noise feature. We will investigate if the feature used to compute y can be identified using correlation, linear modeling, or feature importance metrics.

Load regression simulated dataset

norm_noise1, norm_noise2: normal distribution noise features
unif_noise1, unif_noise2 and unif_noise3: uniform distribution noise features
y feature is aparabole computed from the norm_noise2 values.

Function of parabole

This function is invisible to the correlation test and is used to create the output df_simu_regr$y <- transfo_parab(df_simu_regr$norm_noise2).

transfo_parab <- function(xs){
    x1 <- min(xs)
    x2 <- max(xs)
    h <- (x1 + x2) / 2
    k <- max(xs)/2
    a <- k / ((h - x1) * (h - x2))
    
    y <- a * (xs - x1) * (xs - x2)
    
    return (y)
}

se_path <- system.file("extdata", "seRegress.rds", package="XAItest")
dataset_regress <- readRDS(se_path)

data_matrix <- assay(dataset_regress, "counts")
data_matrix <- t(data_matrix)
metadata <- as.data.frame(colData(dataset_regress))
df_simu_regr <- as.data.frame(cbind(data_matrix, y = metadata[['y']]))
for (col in names(df_simu_regr)) {
    if (col != 'y') {
        df_simu_regr[[col]] <- as.numeric(df_simu_regr[[col]])
    }
}

Example with a dataframe.

df_path <- system.file("extdata", "dfRegress.txt", package="XAItest")
dataset_regress_df <- read.table(df_path)
df_simu_regr_df <- dataset_regress

ggplot(df_simu_regr, aes(x=norm_noise2, y=y)) + geom_point() + theme_bw()

plot of chunk unnamed-chunk-7

We employ the XAI.test function to evaluate the p-values and feature importance of each feature. The parameter simData is set to TRUE, which instructs the function to generate a new feature named simThresh, which is designed to achieve a specified p-value (column ttest_adjPval), set at simPvalTarget = 0.01. The purpose of simThresh is to establish a significance threshold for assessing feature importance values.

set.seed(123)
regr_results <- XAI.test(dataset_regress, "y",
                            simData=TRUE, simPvalTarget = 0.01)
getMetricsTable(regr_results)

##                     cor    cor_pval cor_adjPval      lm_coef    lm_pval
## simThresh    0.31120805 0.001623763 0.009742579  0.092081676 0.00113489
## unif_noise1  0.20779578 0.038029687 0.228178121  0.349840416 0.03174897
## unif_noise2  0.14747642 0.143127825 0.858766953  0.127425571 0.12148391
## unif_noise3  0.12517231 0.214652757 1.000000000  0.016799254 0.53806117
## norm_noise1  0.06115865 0.545535487 1.000000000  0.002523175 0.91762120
## norm_noise2 -0.01119237 0.911996996 1.000000000 -0.008511977 0.60820434
##              lm_adjPval RF_feat_imp SHAP_feat_imp LIME_feat_imp
## simThresh   0.006809342    2.483707   0.002638811    0.01215317
## unif_noise1 0.190493842    1.954692   0.002872570    0.01452009
## unif_noise2 0.728903433    1.536527   0.003133188    0.01629232
## unif_noise3 1.000000000    1.415226   0.005056324    0.01241457
## norm_noise1 1.000000000    1.651757   0.001686529    0.01227430
## norm_noise2 1.000000000   12.932198   0.380751392    0.55671653

Display as a data.frame:

mpi <- mapPvalImportance(regr_results, refPvalColumn = "cor_adjPval", refPval = 0.01)
head(mpi$df)

##             cor_pval isSign_cor_pval cor_adjPval isSign_cor_adjPval lm_pval
## simThresh    0.00162               1     0.00974                  1 0.00113
## unif_noise1  0.03800               0     0.22800                  0 0.03170
## unif_noise2  0.14300               0     0.85900                  0 0.12100
## unif_noise3  0.21500               0     1.00000                  0 0.53800
## norm_noise1  0.54600               0     1.00000                  0 0.91800
## norm_noise2  0.91200               0     1.00000                  0 0.60800
##             isSign_lm_pval lm_adjPval isSign_lm_adjPval RF_feat_imp
## simThresh                1    0.00681                 1        2.48
## unif_noise1              0    0.19000                 0        1.95
## unif_noise2              0    0.72900                 0        1.54
## unif_noise3              0    1.00000                 0        1.42
## norm_noise1              0    1.00000                 0        1.65
## norm_noise2              0    1.00000                 0       12.90
##             isSign_RF_feat_imp SHAP_feat_imp isSign_SHAP_feat_imp LIME_feat_imp
## simThresh                  0.5       0.00264                  0.5        0.0122
## unif_noise1                0.0       0.00287                  0.5        0.0145
## unif_noise2                0.0       0.00313                  0.5        0.0163
## unif_noise3                0.0       0.00506                  0.5        0.0124
## norm_noise1                0.0       0.00169                  0.0        0.0123
## norm_noise2                1.0       0.38100                  1.0        0.5570
##             isSign_LIME_feat_imp
## simThresh                    0.5
## unif_noise1                  0.5
## unif_noise2                  0.5
## unif_noise3                  0.5
## norm_noise1                  0.5
## norm_noise2                  1.0

Display as a datatable:

mpi$dt

The relationship between the norm_noise2 and y features is not apparent through correlation and linear modeling statistical tests, but is revealed by the Random Forest built-in feature importance, as well as its SHAP and LIME values.

Plot each model

Similar to the classification case, the use of the XAI.test function creates predictive models that we can visualize with the plotModel function. This provides insights into why the norm_noise2 feature was detected by feature importance metrics and not by linear modeling.

regr_results@args$modelType

## [1] "regression"

p1 <- plotModel(regr_results, "lm_pval", "norm_noise2")
p2 <- plotModel(regr_results, "SHAP_feat_imp", "norm_noise2")
grid.arrange(p1, p2, ncol = 2, nrow = 1)

plot of chunk unnamed-chunk-9

Use the modelsOverview function to quickly review the performance of the models.

modelsOverview(regr_results)

##        lm_pval RF_feat_imp SHAP_feat_imp LIME_feat_imp
## mse  0.2000662  0.01346611    0.01391434   0.002154499
## rmse 0.4472877  0.11604358    0.11795904   0.046416576
## mae  0.3347928  0.07164813    0.05904951   0.019892683
## r2   0.1736793  0.94438178    0.94253052   0.991101414

Using custom functions

Here is fctParamFIPV(), an example of a custom function used in the preprint computations. This function still requires further improvements before being production-ready. Set simData=T to add the simThresh variable.

script_path <- system.file("fctParamFIPV.R", package = "XAItest")
source(script_path)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:BiocGenerics':
## 
##     type

## The following object is masked from 'package:ggplot2':
## 
##     alpha

## 
## Attaching package: 'lime'

## The following object is masked from 'package:generics':
## 
##     explain

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:Biobase':
## 
##     combine

## The following object is masked from 'package:BiocGenerics':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:gridExtra':
## 
##     combine

xai_res <- XAI.test(df1, y="y", simData=F, customFIPV=list("pimp"=function(data, ...) {
    fctParamFIPV(data=data, fi_algo="gini", pv_algo="pimp", ml_algo="dt",
    fi_transfo="none", n_perm=20, shap_nsim = 10, verbose=0, ...)
}), defaultMethods=c())

xai_res

## Metrics Table (first 5 rows):
##                  ttest_pval ttest_adjPval pimp_feat_imp pimp_feat_imp_pval
## diff_distrib01 7.827846e-22  1.017620e-20     33.741469         0.04761905
## diff_distrib02 1.898481e-16  2.468026e-15     22.052345         0.04761905
## diffVar01      2.790718e-02  3.627934e-01     12.688197         0.04761905
## norm_noise10   1.540737e-01  1.000000e+00      5.743229         0.14285714
## norm_noise06   1.710054e-01  1.000000e+00      5.275093         0.14285714
##                pimp_feat_imp_adjPval
## diff_distrib01             0.6190476
## diff_distrib02             0.6190476
## diffVar01                  0.6190476
## norm_noise10               1.0000000
## norm_noise06               1.0000000

The number of permutations in PIMP and mProbes is controlled by n_perm, at least 100 recommended. The number of probes in mProbes, or the number of combinations for SHAP and LIME, is specified by shap_nsim.

You can see the compute times for each column:

xai_res@computeTimes

## $ttest_pval
## Time difference of 0.003425837 secs
## 
## $ttest_adjPval
## Time difference of 0.003425837 secs
## 
## $pimp_feat_imp
## Time difference of 0.005793095 secs
## 
## $pimp_feat_imp_pval
## Time difference of 0.1246238 secs
## 
## $pimp_feat_imp_adjPval
## Time difference of 0.1246238 secs

The available arguments combinations are:

ml_algo="dt", fi_algo="gini", pv_algo="pimp"
ml_algo="dt", fi_algo="gini", pv_algo="mprobes"
ml_algo="dt", fi_algo="shap", pv_algo="pimp"
ml_algo="dt", fi_algo="shap", pv_algo="mprobes"
ml_algo="dt", fi_algo="lime", pv_algo="pimp"
ml_algo="dt", fi_algo="lime", pv_algo="mprobes"
ml_algo="rf", fi_algo="gini", pv_algo="pimp"
ml_algo="rf", fi_algo="gini", pv_algo="mprobes"
ml_algo="rf", fi_algo="accuracy", pv_algo="pimp"
ml_algo="rf", fi_algo="accuracy", pv_algo="mprobes"
ml_algo="rf", fi_algo="shap", pv_algo="pimp"
ml_algo="rf", fi_algo="shap", pv_algo="mprobes"
ml_algo="rf", fi_algo="lime", pv_algo="pimp"
ml_algo="rf", fi_algo="lime", pv_algo="mprobes"
ml_algo="svm+vallinadot", fi_algo="shap", pv_algo="pimp"
ml_algo="svm+vallinadot", fi_algo="shap", pv_algo="mprobes"
ml_algo="svm+vallinadot", fi_algo="lime", pv_algo="pimp"
ml_algo="svm+vallinadot", fi_algo="lime", pv_algo="mprobes"
ml_algo="svm-rbfdot", fi_algo="shap", pv_algo="pimp"
ml_algo="svm-rbfdot", fi_algo="shap", pv_algo="mprobes"
ml_algo="svm-rbfdot", fi_algo="lime", pv_algo="pimp"
ml_algo="svm-rbfdot", fi_algo="lime", pv_algo="mprobes"
ml_algo="mlp", fi_algo="olden", pv_algo="pimp"
ml_algo="mlp", fi_algo="olden", pv_algo="mprobes"
ml_algo="mlp", fi_algo="shap", pv_algo="pimp"
ml_algo="mlp", fi_algo="shap", pv_algo="mprobes"
ml_algo="mlp", fi_algo="lime", pv_algo="pimp"
ml_algo="mlp", fi_algo="lime", pv_algo="mprobes"

sessionInfo()

## R Under development (unstable) (2025-10-20 r88955)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] randomForest_4.7-1.2        lime_0.5.3                 
##  [3] kernlab_0.9-33              caret_7.0-1                
##  [5] lattice_0.22-7              SummarizedExperiment_1.41.0
##  [7] Biobase_2.71.0              GenomicRanges_1.63.0       
##  [9] Seqinfo_1.1.0               IRanges_2.45.0             
## [11] S4Vectors_0.49.0            BiocGenerics_0.57.0        
## [13] generics_0.1.4              MatrixGenerics_1.23.0      
## [15] matrixStats_1.5.0           ggforce_0.5.0              
## [17] rpart_4.1.24                ggplot2_4.0.0              
## [19] gridExtra_2.3               XAItest_1.3.0              
## 
## loaded via a namespace (and not attached):
##  [1] pROC_1.19.0.1        rlang_1.1.6          magrittr_2.0.4      
##  [4] e1071_1.7-16         compiler_4.6.0       vctrs_0.6.5         
##  [7] reshape2_1.4.4       stringr_1.5.2        pkgconfig_2.0.3     
## [10] shape_1.4.6.1        fastmap_1.2.0        XVector_0.51.0      
## [13] labeling_0.4.3       prodlim_2025.04.28   purrr_1.1.0         
## [16] xfun_0.54            glmnet_4.1-10        cachem_1.1.0        
## [19] jsonlite_2.0.0       recipes_1.3.1        DelayedArray_0.37.0 
## [22] tweenr_2.0.3         parallel_4.6.0       R6_2.6.1            
## [25] bslib_0.9.0          stringi_1.8.7        RColorBrewer_1.1-3  
## [28] limma_3.67.0         parallelly_1.45.1    jquerylib_0.1.4     
## [31] lubridate_1.9.4      Rcpp_1.1.0           assertthat_0.2.1    
## [34] iterators_1.0.14     knitr_1.50           future.apply_1.20.0 
## [37] Matrix_1.7-4         splines_4.6.0        nnet_7.3-20         
## [40] timechange_0.3.0     tidyselect_1.2.1     dichromat_2.0-0.1   
## [43] abind_1.4-8          timeDate_4051.111    codetools_0.2-20    
## [46] listenv_0.9.1        tibble_3.3.0         plyr_1.8.9          
## [49] withr_3.0.2          S7_0.2.0             evaluate_1.0.5      
## [52] future_1.67.0        survival_3.8-3       proxy_0.4-27        
## [55] polyclip_1.10-7      pillar_1.11.1        kernelshap_0.9.1    
## [58] DT_0.34.0            foreach_1.5.2        scales_1.4.0        
## [61] globals_0.18.0       class_7.3-23         glue_1.8.0          
## [64] tools_4.6.0          data.table_1.17.8    ModelMetrics_1.2.2.2
## [67] gower_1.0.2          grid_4.6.0           crosstalk_1.2.2     
## [70] ipred_0.9-15         nlme_3.1-168         cli_3.6.5           
## [73] S4Arrays_1.11.0      lava_1.8.2           doFuture_1.1.2      
## [76] dplyr_1.1.4          gtable_0.3.6         sass_0.4.10         
## [79] digest_0.6.37        SparseArray_1.11.1   htmlwidgets_1.6.4   
## [82] farver_2.1.2         htmltools_0.5.8.1    lifecycle_1.0.4     
## [85] hardhat_1.4.2        statmod_1.5.1        MASS_7.3-65

XAItest: Enhancing Feature Discovery with eXplainable AI

Ghislain FIEVET ghislain.fievet@gmail.com

Introduction

Installation

Generate benchmark datasets

Add the `simThresh` simulated variable to the dataset

Get feature importance of a SummarizedExperiment dataset

Detect new key features on a classification dataset

Load simulated classification dataset

Plot each model

Detect new key features on a regression dataset

Load regression simulated dataset

Function of parabole

Plot each model

Using custom functions

References

XAItest: Enhancing Feature Discovery with eXplainable AI

Ghislain FIEVET ghislain.fievet@gmail.com

Introduction

Installation

Generate benchmark datasets

Add the simThresh simulated variable to the dataset

Get feature importance of a SummarizedExperiment dataset

Detect new key features on a classification dataset

Load simulated classification dataset

Plot each model

Detect new key features on a regression dataset

Load regression simulated dataset

Function of parabole

Plot each model

Using custom functions

References

Add the `simThresh` simulated variable to the dataset