In biological and biomedical research, the identification of key variables in omics data—such as transcriptomes, proteomes, and methylomes—is generally accomplished using statistical tests like the t-test or correlation analysis. While these methods perform well for detecting classical mean differences, they often fail to capture more complex scenarios, such as non-Gaussian data distributions, interactions among multiple variables, or multimodal distributions.
To address these challenges, the scientific literature proposes to use the potential of explainable AI (XAI) to assist in the discovery of key features. Examples of these techniques can be found in the analysis of methylation and transcriptome data @Rajpal_2023, @Kumar_2023.
The XAItest package aims at proposing a versatile tool to run these types of analysis easily. It calculates various p-value based statistics, such as the t-test or correlation, depending on whether the target variable is categorical or numerical. It also computes different XAI matrics such as feature importance metrics, by default a built-in method from random forest, an implementation of SHAP and of LIME, and the possibility to integrate custom functions.
Establishing a significance level for a p-value is straightforward, often set at 0.05 or 0.01. However, determining thresholds for the feature importance is more complicated, and has to be computed at each experiment. Our package offers an integrated solution to this challenge by generating simulated data designed to meet selected p-value thresholds, thus setting significance thresholds for the feature importance.
Lastly, our package offers an intuitive visualization of results through color-coded tables, highlighting statistical metrics and their significance for easy interpretation.
if (!require("BiocManager")) {
install.packages("BiocManager")
}
BiocManager::install('XAItest')
The XAItest package provides several benchmark scenarios to evaluate the performance of statistical tests and machine learning methods. These scenarios include both classification and regression problems with different underlying patterns:
Classification scenarios:
Regression scenarios:
These scenarios demonstrate cases where classical statistical tests (t-test, correlation) may fail to detect meaningful patterns, while machine learning approaches and XAI methods can successfully identify the underlying relationships.
library(XAItest)
library(gridExtra)
library(ggplot2)
set.seed(12)
# Scenario 1: Variance Difference
df1 <- genScenario(1)
# Scenario 2: Bimodal Distribution
df2 <- genScenario(2)
# Scenario 3: XOR Interaction
df3 <- genScenario(3)
# Scenario 4: Concentric Circles
df4 <- genScenario(4)
# Scenario 5: Parabolic Relationship
df5 <- genScenario(5)
# Scenario 6: Rising Sinusoid
df6 <- genScenario(6)
# Create plots for each scenario
p1 <- ggplot(df1, aes(x = norm_noise01, y = diffVar01, color = y)) +
geom_point() + scale_color_manual(values = c("orange", "purple")) +
ggtitle("Scenario 1:\nVariance Difference")
p2 <- ggplot(df2, aes(x = norm_noise01, y = biDistrib, color = y)) +
geom_point() + scale_color_manual(values = c("orange", "purple")) +
ggtitle("Scenario 2:\nBimodal Distribution")
p3 <- ggplot(df3, aes(x = relVar01, y = relVar02, color = y)) +
geom_point() + scale_color_manual(values = c("orange", "purple")) +
ggtitle("Scenario 3:\nXOR Interaction")
p4 <- ggplot(df4, aes(x = relVar01, y = relVar02, color = y)) +
geom_point() + scale_color_manual(values = c("orange", "purple")) +
ggtitle("Scenario 4:\nConcentric Circles")
p5 <- ggplot(df5, aes(x = norm_noise01, y = y)) +
geom_point() + ggtitle("Scenario 5:\nParabolic Relationship")
p6 <- ggplot(df6, aes(x = unif_noise01, y = y)) +
geom_point() + ggtitle("Scenario 6:\nRising Sinusoid")
# Load required library for grid arrangement
# Arrange the plots in a 2x3 grid with width twice the height
grid.arrange(p1, p2, p3, p4, p5, p6, ncol = 3, nrow = 2,
widths = rep(2, 3), heights = rep(1, 2))
simThresh
simulated variable to the datasetObjective of this section:
simThresh
with XAItest::addSimThresh()Brief explanation: addSimThresh() adds a continuous variable “simThresh” constructed to achieve a target p-value (default ~0.05) vs the target ‘y’. It serves as a reference/benchmark for comparing feature importances and p-values.
set.seed(1)
# Add the simThresh variable to achieve a p-value = 0.001 between A and B
df_st <- addSimThresh(df1, pval_target = 0.001)
df_st[c(1:2,50:51),c(1:3,(ncol(df_st)-2):ncol(df_st))]
## y norm_noise01 norm_noise02 diff_distrib02 diffVar01 simThresh
## 1 A -1.876532 8.173621 -0.9168318 1.212579 -6.148631
## 2 A -12.699134 -1.967586 4.3666478 1.756601 -5.273279
## 50 A -8.795093 3.179686 2.5800761 -2.030026 5.802162
## 51 B -9.094009 -1.085735 0.5476526 4.833794 -8.005196
We verify that the p-value is indeed close to 0.001
grp_A <- df_st$simThresh[df_st$y == "A"]
grp_B <- df_st$simThresh[df_st$y == "B"]
tt <- t.test(grp_A, grp_B)
tt$p.value
## [1] 0.0009787382
Now we train a decision tree model and compute Gini feature importances to identify which variables are most informative for classification. We’ll use the simThresh
variable’s importance as a threshold to determine which features exceed this benchmark.
library(rpart)
fit <- rpart(y ~ ., data = df_st, method = "class")
imp_raw <- fit$variable.importance
# Significance threshold
fi_sim <- imp_raw[["simThresh"]]
fi_sim
## [1] 6.461132
Now that we have the threshold, we can identify which features exceed it.
# Feature Importance
imp_sorted <- sort(imp_raw, decreasing = TRUE)
imp_sorted
## diff_distrib01 diff_distrib02 diffVar01 simThresh norm_noise10
## 33.7414693 22.0523447 12.6881971 6.4611324 5.7432288
## norm_noise01 norm_noise06
## 5.0253252 0.9676717
# Features above the threshold
vars_above <- setdiff(names(imp_sorted[imp_sorted > fi_sim]), "simThresh")
vars_above
## [1] "diff_distrib01" "diff_distrib02" "diffVar01"
These are indeed the key features from scenario 1: two classic mean difference variables (diff_distrib01
and diff_distrib02
), and the variance difference variable (diffVar01
).
Now we will apply the same approach to a regression scenario using df5
, which contains the parabolic relationship pattern.
set.seed(1)
# Add the simThresh variable to achieve a p-value = 0.001 for the correlation with `y`
df_st <- addSimThresh(df5, pval_target = 0.001)
df_st[1:4,c(1:3,ncol(df_st))]
## y norm_noise01 norm_noise02 simThresh
## 1 0.3708380 -3.158688 2.3560620 -0.3118023
## 2 0.2003057 -4.956585 -0.3250087 -0.4025818
## 3 0.3124379 -1.196562 0.7858448 -0.2398601
## 4 0.3723267 -2.138664 2.5099355 -0.1419215
tt <- cor.test(df_st$y, df_st$simThresh )
tt$p.value
## [1] 0.0006136033
fit <- rpart(y ~ ., data = df_st)
imp_raw <- fit$variable.importance
imp_raw
## norm_noise01 simThresh norm_noise04 norm_noise02 norm_noise03 unif_noise05
## 0.633840354 0.037217532 0.005982311 0.004325586 0.002991156 0.002991156
## unif_noise03 unif_noise07
## 0.002162793 0.002162793
# Significance threshold
fi_sim <- imp_raw[["simThresh"]]
fi_sim
## [1] 0.03721753
# Feature Importance
imp_sorted <- sort(imp_raw, decreasing = TRUE)
imp_sorted
## norm_noise01 simThresh norm_noise04 norm_noise02 norm_noise03 unif_noise05
## 0.633840354 0.037217532 0.005982311 0.004325586 0.002991156 0.002991156
## unif_noise03 unif_noise07
## 0.002162793 0.002162793
# Features above the threshold
vars_above <- setdiff(names(imp_sorted[imp_sorted > fi_sim]), "simThresh")
vars_above
## [1] "norm_noise01"
The feature norm_noise01
is indeed the feature used to build the parabolic y
feature.
# Load the libraries
library(ggforce)
library(SummarizedExperiment)
We consider the RNA-seq data from the airway dataset, which contains read counts per gene for airway smooth muscles stored in a RangedSummarizedExperiment object. The metadata contains a column dex, which indicates whether the subjects are treated or untreated. The XAI.test function, by default, builds random forest models to predict a target, here the ‘dex’ treatment status, and calculates the associated gene feature importances.
data(airway, package="airway")
se <- airway
results <- XAI.test(se[1:100,], y = "dex", verbose = TRUE, simData = TRUE)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:generics':
##
## train
results
## Metrics Table (first 5 rows):
## ttest_pval ttest_adjPval ebayes_pval ebayes_adjPval
## simThresh 0.0004442891 0.04487320 0.0002671305 0.02698018
## ENSG00000003096 0.0009195955 0.09287914 0.0001397020 0.01410990
## ENSG00000005189 0.0015221670 0.15373887 0.0009032689 0.09123016
## ENSG00000001626 0.0240899065 1.00000000 0.0954418912 1.00000000
## ENSG00000001036 0.0477152309 1.00000000 0.0363164951 1.00000000
## lm_coef lm_pval lm_adjPval RF_feat_imp SHAP_feat_imp
## simThresh NA NA NA 0.150 0.00500
## ENSG00000003096 NA NA NA 0.150 0.00500
## ENSG00000005189 NA NA NA 0.372 0.00500
## ENSG00000001626 NA NA NA 0.075 0.00425
## ENSG00000001036 0.0004872526 NaN NaN 0.000 0.00400
## LIME_feat_imp
## simThresh 0.017688379
## ENSG00000003096 0.014713998
## ENSG00000005189 0.014982886
## ENSG00000001626 0.010418775
## ENSG00000001036 0.008513849
On a toy dataset we will investigate which features can be identified using a t-test or linear modeling, and which can be detected through feature importance metrics.
The dataset contains 50 samples of A, and 50 samples of B, with 10 noise features et 3 key features:
The t-test can detect diff_distrib01 and diff_distrib02 but not bidistrib01.
se_path <- system.file("extdata", "seClassif.rds", package="XAItest")
dataset_classif <- readRDS(se_path)
data_matrix <- assay(dataset_classif, "counts")
data_matrix <- t(data_matrix)
metadata <- as.data.frame(colData(dataset_classif))
df_simu_classif <- as.data.frame(cbind(data_matrix, y = metadata[['y']]))
for (col in names(df_simu_classif)) {
if (col != 'y') {
df_simu_classif[[col]] <- as.numeric(df_simu_classif[[col]])
}
}
Example with a dataframe.
df_path <- system.file("extdata", "dfClassif.txt", package="XAItest")
dataset_classif_df <- read.table(df_path)
df_simu_classif_df <- dataset_classif
We can see below how class A and class B are distributed accross the features: norm_noise01, norm_noise02, norm_noise03, norm_noise04, diff_distrib01, diff_distrib02 and biDistrib.
p1 <- ggplot(df_simu_classif, aes(x=norm_noise01, y=norm_noise02,
color=y)) +
geom_point() +
ggtitle("Noise features\nnorm_noise01 vs norm_noise02") +
theme_bw()
p2 <- ggplot(df_simu_classif, aes(x=diff_distrib01, y=diff_distrib02,
color=y)) +
geom_point() +
ggtitle("Normal distributions\ndiff_distrib01 vs diff_distrib02") +
theme_bw()
p3 <- ggplot(df_simu_classif, aes(x=norm_noise03, y=biDistrib, color=y)) +
geom_point() +
ggtitle("Normal bidistribution\nbidistrib01 vs norm_noise03") +
theme_bw()
grid.arrange(p1, p2, p3, ncol = 3, nrow = 1)
We employ the XAI.test function to evaluate the p-values and feature importance of each feature. The parameter simData is set to TRUE, which instructs the function to generate a new feature named simThresh, designed to achieve a specified p-value (column ttest_adjPval), set at simPvalTarget = 0.01. The purpose of simThresh is to establish a significance threshold for assessing feature importance values.
set.seed(123)
objXAI <- XAI.test(dataset_classif, "y", simData = TRUE, simPvalTarget = 0.01)
head(getMetricsTable(objXAI))
## ttest_pval ttest_adjPval ebayes_pval ebayes_adjPval
## diff_distrib02 1.017434e-43 1.424407e-42 8.037163e-48 1.125203e-46
## diff_distrib01 7.927938e-37 1.109911e-35 1.066684e-37 1.493358e-36
## simThresh 7.023057e-04 9.832280e-03 6.273684e-04 8.783158e-03
## norm_noise03 7.972660e-02 1.000000e+00 7.919121e-02 1.000000e+00
## norm_noise08 1.109098e-01 1.000000e+00 1.064060e-01 1.000000e+00
## norm_noise09 1.438542e-01 1.000000e+00 1.437854e-01 1.000000e+00
## lm_coef lm_pval lm_adjPval RF_feat_imp
## diff_distrib02 -0.0668305461 2.207235e-21 3.090129e-20 16.4114993
## diff_distrib01 -0.0312318926 1.235226e-11 1.729317e-10 19.4533855
## simThresh -0.0076745689 1.251399e-01 1.000000e+00 1.1936746
## norm_noise03 0.0027506413 6.054747e-01 1.000000e+00 1.1560764
## norm_noise08 0.0004548054 8.992620e-01 1.000000e+00 0.3780559
## norm_noise09 0.0035048925 5.214001e-01 1.000000e+00 0.7460329
## SHAP_feat_imp LIME_feat_imp
## diff_distrib02 0.2505186480 0.212450755
## diff_distrib01 0.2442663510 0.233883647
## simThresh 0.0003595620 0.014770319
## norm_noise03 0.0002384213 0.008099617
## norm_noise08 0.0001998247 0.008304666
## norm_noise09 0.0001731488 0.013512810
The mapPvalImportance function reveals the significance of the feature importance by comparison with the p-values.
Display as a data.frame:
mpi <- mapPvalImportance(objXAI)
mpi$df
## ttest_pval isSign_ttest_pval ttest_adjPval isSign_ttest_adjPval
## diff_distrib02 1.02e-43 1 1.42e-42 1
## diff_distrib01 7.93e-37 1 1.11e-35 1
## simThresh 7.02e-04 1 9.83e-03 1
## norm_noise03 7.97e-02 0 1.00e+00 0
## norm_noise08 1.11e-01 0 1.00e+00 0
## norm_noise09 1.44e-01 0 1.00e+00 0
## norm_noise02 1.46e-01 0 1.00e+00 0
## biDistrib 2.08e-01 0 1.00e+00 0
## norm_noise06 2.20e-01 0 1.00e+00 0
## norm_noise10 5.21e-01 0 1.00e+00 0
## norm_noise04 5.25e-01 0 1.00e+00 0
## norm_noise05 5.87e-01 0 1.00e+00 0
## norm_noise01 6.78e-01 0 1.00e+00 0
## norm_noise07 8.92e-01 0 1.00e+00 0
## ebayes_pval isSign_ebayes_pval ebayes_adjPval
## diff_distrib02 8.04e-48 1 1.13e-46
## diff_distrib01 1.07e-37 1 1.49e-36
## simThresh 6.27e-04 1 8.78e-03
## norm_noise03 7.92e-02 0 1.00e+00
## norm_noise08 1.06e-01 0 1.00e+00
## norm_noise09 1.44e-01 0 1.00e+00
## norm_noise02 1.43e-01 0 1.00e+00
## biDistrib 1.99e-01 0 1.00e+00
## norm_noise06 2.22e-01 0 1.00e+00
## norm_noise10 5.54e-01 0 1.00e+00
## norm_noise04 5.18e-01 0 1.00e+00
## norm_noise05 5.83e-01 0 1.00e+00
## norm_noise01 6.74e-01 0 1.00e+00
## norm_noise07 8.91e-01 0 1.00e+00
## isSign_ebayes_adjPval lm_pval isSign_lm_pval lm_adjPval
## diff_distrib02 1 2.21e-21 1 3.09e-20
## diff_distrib01 1 1.24e-11 1 1.73e-10
## simThresh 1 1.25e-01 0 1.00e+00
## norm_noise03 0 6.05e-01 0 1.00e+00
## norm_noise08 0 8.99e-01 0 1.00e+00
## norm_noise09 0 5.21e-01 0 1.00e+00
## norm_noise02 0 3.05e-01 0 1.00e+00
## biDistrib 0 4.11e-01 0 1.00e+00
## norm_noise06 0 1.96e-01 0 1.00e+00
## norm_noise10 0 1.43e-01 0 1.00e+00
## norm_noise04 0 5.49e-03 1 7.69e-02
## norm_noise05 0 6.68e-01 0 1.00e+00
## norm_noise01 0 6.75e-01 0 1.00e+00
## norm_noise07 0 1.49e-01 0 1.00e+00
## isSign_lm_adjPval RF_feat_imp isSign_RF_feat_imp SHAP_feat_imp
## diff_distrib02 1 16.400 1.0 0.251000
## diff_distrib01 1 19.500 1.0 0.244000
## simThresh 0 1.190 0.5 0.000360
## norm_noise03 0 1.160 0.0 0.000238
## norm_noise08 0 0.378 0.0 0.000200
## norm_noise09 0 0.746 0.0 0.000173
## norm_noise02 0 0.565 0.0 0.000133
## biDistrib 0 8.250 1.0 0.000674
## norm_noise06 0 0.243 0.0 0.000183
## norm_noise10 0 0.245 0.0 0.000104
## norm_noise04 0 0.252 0.0 0.000180
## norm_noise05 0 0.113 0.0 0.000177
## norm_noise01 0 0.377 0.0 0.000214
## norm_noise07 0 0.174 0.0 0.000079
## isSign_SHAP_feat_imp LIME_feat_imp isSign_LIME_feat_imp
## diff_distrib02 1.0 0.21200 1.0
## diff_distrib01 1.0 0.23400 1.0
## simThresh 0.5 0.01480 0.5
## norm_noise03 0.0 0.00810 0.0
## norm_noise08 0.0 0.00830 0.0
## norm_noise09 0.0 0.01350 0.0
## norm_noise02 0.0 0.01700 0.5
## biDistrib 0.5 0.06450 1.0
## norm_noise06 0.0 0.00663 0.0
## norm_noise10 0.0 0.00676 0.0
## norm_noise04 0.0 0.00799 0.0
## norm_noise05 0.0 0.00743 0.0
## norm_noise01 0.0 0.00579 0.0
## norm_noise07 0.0 0.00572 0.0
Display as a datatable:
mpi$dt
We can see that the biDistrib feature is not detected by the p-values, but is detected by the feature importance metrics.
The XAI.test function builds several predictive models. In order to provide a p-value, linear regression (lm) builds a model that we can access. Similarly, to give feature importances, methods like the random forest, SHAP, and LIME also build models that are accessible
The plotModel function is used to visualize the results of these predictive models. This visualization helps determine which classes are well-predicted by which model. This function aids in more effectively interpreting the model’s performance in each situation.
In this section, we aim to evaluate the performance of each model after removing the diff_distrib01 and diff_distrib02 features from the dataset, leaving only noise and the biDistrib features.
set.seed(123)
objXAI <- XAI.test(df_simu_classif[,setdiff(colnames(df_simu_classif),
c("diff_distrib01", "diff_distrib02"))],
simData=TRUE)
head(getMetricsTable(objXAI))
## ttest_pval ttest_adjPval ebayes_pval ebayes_adjPval lm_coef
## simThresh 0.00373774 0.04485288 0.003572831 0.04287397 -0.048072126
## norm_noise03 0.07972660 0.95671925 0.079812024 0.95774429 0.026051579
## norm_noise08 0.11090980 1.00000000 0.106582213 1.00000000 -0.015024728
## norm_noise09 0.14385418 1.00000000 0.144698338 1.00000000 0.029296030
## norm_noise02 0.14626421 1.00000000 0.143893819 1.00000000 0.023104227
## biDistrib 0.20818373 1.00000000 0.199417829 1.00000000 0.004796583
## lm_pval lm_adjPval RF_feat_imp SHAP_feat_imp LIME_feat_imp
## simThresh 0.01789724 0.2147668 3.864181 0.011145448 0.009946593
## norm_noise03 0.18885965 1.0000000 4.310358 0.019542338 0.011494790
## norm_noise08 0.25933766 1.0000000 1.282664 0.004334013 0.010515326
## norm_noise09 0.15050988 1.0000000 3.879493 0.001412531 0.012126513
## norm_noise02 0.16579352 1.0000000 2.694503 0.006467815 0.010868020
## biDistrib 0.74233539 1.0000000 22.104365 0.415715984 0.596812724
We can see below that the linear model predictions contain many errors, while the random forest, along with the models used for SHAP and LIME, are much more precise.
p1 <- plotModel(objXAI, "lm_pval", "biDistrib", "simThresh")
p2 <- plotModel(objXAI, "RF_feat_imp", "biDistrib", "simThresh")
p3 <- plotModel(objXAI, "SHAP_feat_imp", "biDistrib", "simThresh")
p4 <- plotModel(objXAI, "LIME_feat_imp", "biDistrib", "simThresh")
grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)
Display as a data.frame:
mpi <- mapPvalImportance(objXAI)
head(mpi$df)
## ttest_pval isSign_ttest_pval ttest_adjPval isSign_ttest_adjPval
## simThresh 0.00374 1 0.0449 1
## norm_noise03 0.07970 0 0.9570 0
## norm_noise08 0.11100 0 1.0000 0
## norm_noise09 0.14400 0 1.0000 0
## norm_noise02 0.14600 0 1.0000 0
## biDistrib 0.20800 0 1.0000 0
## ebayes_pval isSign_ebayes_pval ebayes_adjPval
## simThresh 0.00357 1 0.0429
## norm_noise03 0.07980 0 0.9580
## norm_noise08 0.10700 0 1.0000
## norm_noise09 0.14500 0 1.0000
## norm_noise02 0.14400 0 1.0000
## biDistrib 0.19900 0 1.0000
## isSign_ebayes_adjPval lm_pval isSign_lm_pval lm_adjPval
## simThresh 1 0.0179 1 0.215
## norm_noise03 0 0.1890 0 1.000
## norm_noise08 0 0.2590 0 1.000
## norm_noise09 0 0.1510 0 1.000
## norm_noise02 0 0.1660 0 1.000
## biDistrib 0 0.7420 0 1.000
## isSign_lm_adjPval RF_feat_imp isSign_RF_feat_imp SHAP_feat_imp
## simThresh 0 3.86 0.5 0.01110
## norm_noise03 0 4.31 0.5 0.01950
## norm_noise08 0 1.28 0.0 0.00433
## norm_noise09 0 3.88 0.5 0.00141
## norm_noise02 0 2.69 0.0 0.00647
## biDistrib 0 22.10 1.0 0.41600
## isSign_SHAP_feat_imp LIME_feat_imp isSign_LIME_feat_imp
## simThresh 0.5 0.00995 0.5
## norm_noise03 0.5 0.01150 0.5
## norm_noise08 0.0 0.01050 0.5
## norm_noise09 0.0 0.01210 0.5
## norm_noise02 0.0 0.01090 0.5
## biDistrib 1.0 0.59700 1.0
Display as a datatable:
mpi$dt
First, we load a regression toy example of 100 samples, with 5 noise features, and a y feature computed from one of the noise feature. We will investigate if the feature used to compute y can be identified using correlation, linear modeling, or feature importance metrics.
This function is invisible to the correlation test and is used to create the
output df_simu_regr$y <- transfo_parab(df_simu_regr$norm_noise2)
.
transfo_parab <- function(xs){
x1 <- min(xs)
x2 <- max(xs)
h <- (x1 + x2) / 2
k <- max(xs)/2
a <- k / ((h - x1) * (h - x2))
y <- a * (xs - x1) * (xs - x2)
return (y)
}
se_path <- system.file("extdata", "seRegress.rds", package="XAItest")
dataset_regress <- readRDS(se_path)
data_matrix <- assay(dataset_regress, "counts")
data_matrix <- t(data_matrix)
metadata <- as.data.frame(colData(dataset_regress))
df_simu_regr <- as.data.frame(cbind(data_matrix, y = metadata[['y']]))
for (col in names(df_simu_regr)) {
if (col != 'y') {
df_simu_regr[[col]] <- as.numeric(df_simu_regr[[col]])
}
}
Example with a dataframe.
df_path <- system.file("extdata", "dfRegress.txt", package="XAItest")
dataset_regress_df <- read.table(df_path)
df_simu_regr_df <- dataset_regress
ggplot(df_simu_regr, aes(x=norm_noise2, y=y)) + geom_point() + theme_bw()
We employ the XAI.test function to evaluate the p-values and feature importance of each feature. The parameter simData is set to TRUE, which instructs the function to generate a new feature named simThresh, which is designed to achieve a specified p-value (column ttest_adjPval), set at simPvalTarget = 0.01. The purpose of simThresh is to establish a significance threshold for assessing feature importance values.
set.seed(123)
regr_results <- XAI.test(dataset_regress, "y",
simData=TRUE, simPvalTarget = 0.01)
getMetricsTable(regr_results)
## cor cor_pval cor_adjPval lm_coef lm_pval
## simThresh 0.31120805 0.001623763 0.009742579 0.092081676 0.00113489
## unif_noise1 0.20779578 0.038029687 0.228178121 0.349840416 0.03174897
## unif_noise2 0.14747642 0.143127825 0.858766953 0.127425571 0.12148391
## unif_noise3 0.12517231 0.214652757 1.000000000 0.016799254 0.53806117
## norm_noise1 0.06115865 0.545535487 1.000000000 0.002523175 0.91762120
## norm_noise2 -0.01119237 0.911996996 1.000000000 -0.008511977 0.60820434
## lm_adjPval RF_feat_imp SHAP_feat_imp LIME_feat_imp
## simThresh 0.006809342 2.483707 0.002638811 0.01215317
## unif_noise1 0.190493842 1.954692 0.002872570 0.01452009
## unif_noise2 0.728903433 1.536527 0.003133188 0.01629232
## unif_noise3 1.000000000 1.415226 0.005056324 0.01241457
## norm_noise1 1.000000000 1.651757 0.001686529 0.01227430
## norm_noise2 1.000000000 12.932198 0.380751392 0.55671653
Display as a data.frame:
mpi <- mapPvalImportance(regr_results, refPvalColumn = "cor_adjPval", refPval = 0.01)
head(mpi$df)
## cor_pval isSign_cor_pval cor_adjPval isSign_cor_adjPval lm_pval
## simThresh 0.00162 1 0.00974 1 0.00113
## unif_noise1 0.03800 0 0.22800 0 0.03170
## unif_noise2 0.14300 0 0.85900 0 0.12100
## unif_noise3 0.21500 0 1.00000 0 0.53800
## norm_noise1 0.54600 0 1.00000 0 0.91800
## norm_noise2 0.91200 0 1.00000 0 0.60800
## isSign_lm_pval lm_adjPval isSign_lm_adjPval RF_feat_imp
## simThresh 1 0.00681 1 2.48
## unif_noise1 0 0.19000 0 1.95
## unif_noise2 0 0.72900 0 1.54
## unif_noise3 0 1.00000 0 1.42
## norm_noise1 0 1.00000 0 1.65
## norm_noise2 0 1.00000 0 12.90
## isSign_RF_feat_imp SHAP_feat_imp isSign_SHAP_feat_imp LIME_feat_imp
## simThresh 0.5 0.00264 0.5 0.0122
## unif_noise1 0.0 0.00287 0.5 0.0145
## unif_noise2 0.0 0.00313 0.5 0.0163
## unif_noise3 0.0 0.00506 0.5 0.0124
## norm_noise1 0.0 0.00169 0.0 0.0123
## norm_noise2 1.0 0.38100 1.0 0.5570
## isSign_LIME_feat_imp
## simThresh 0.5
## unif_noise1 0.5
## unif_noise2 0.5
## unif_noise3 0.5
## norm_noise1 0.5
## norm_noise2 1.0
Display as a datatable:
mpi$dt
The relationship between the norm_noise2 and y features is not apparent through correlation and linear modeling statistical tests, but is revealed by the Random Forest built-in feature importance, as well as its SHAP and LIME values.
Similar to the classification case, the use of the XAI.test function creates predictive models that we can visualize with the plotModel function. This provides insights into why the norm_noise2 feature was detected by feature importance metrics and not by linear modeling.
regr_results@args$modelType
## [1] "regression"
p1 <- plotModel(regr_results, "lm_pval", "norm_noise2")
p2 <- plotModel(regr_results, "SHAP_feat_imp", "norm_noise2")
grid.arrange(p1, p2, ncol = 2, nrow = 1)
Use the modelsOverview function to quickly review the performance of the models.
modelsOverview(regr_results)
## lm_pval RF_feat_imp SHAP_feat_imp LIME_feat_imp
## mse 0.2000662 0.01346611 0.01391434 0.002154499
## rmse 0.4472877 0.11604358 0.11795904 0.046416576
## mae 0.3347928 0.07164813 0.05904951 0.019892683
## r2 0.1736793 0.94438178 0.94253052 0.991101414
Here is fctParamFIPV()
, an example of a custom function used in the preprint computations.
This function still requires further improvements before being production-ready.
Set simData=T
to add the simThresh
variable.
script_path <- system.file("fctParamFIPV.R", package = "XAItest")
source(script_path)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:BiocGenerics':
##
## type
## The following object is masked from 'package:ggplot2':
##
## alpha
##
## Attaching package: 'lime'
## The following object is masked from 'package:generics':
##
## explain
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:Biobase':
##
## combine
## The following object is masked from 'package:BiocGenerics':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:gridExtra':
##
## combine
xai_res <- XAI.test(df1, y="y", simData=F, customFIPV=list("pimp"=function(data, ...) {
fctParamFIPV(data=data, fi_algo="gini", pv_algo="pimp", ml_algo="dt",
fi_transfo="none", n_perm=20, shap_nsim = 10, verbose=0, ...)
}), defaultMethods=c())
xai_res
## Metrics Table (first 5 rows):
## ttest_pval ttest_adjPval pimp_feat_imp pimp_feat_imp_pval
## diff_distrib01 7.827846e-22 1.017620e-20 33.741469 0.04761905
## diff_distrib02 1.898481e-16 2.468026e-15 22.052345 0.04761905
## diffVar01 2.790718e-02 3.627934e-01 12.688197 0.04761905
## norm_noise10 1.540737e-01 1.000000e+00 5.743229 0.14285714
## norm_noise06 1.710054e-01 1.000000e+00 5.275093 0.14285714
## pimp_feat_imp_adjPval
## diff_distrib01 0.6190476
## diff_distrib02 0.6190476
## diffVar01 0.6190476
## norm_noise10 1.0000000
## norm_noise06 1.0000000
The number of permutations in PIMP and mProbes is controlled by n_perm
, at least 100 recommended.
The number of probes in mProbes, or the number of combinations for SHAP and LIME, is specified by shap_nsim
.
You can see the compute times for each column:
xai_res@computeTimes
## $ttest_pval
## Time difference of 0.003570557 secs
##
## $ttest_adjPval
## Time difference of 0.003570557 secs
##
## $pimp_feat_imp
## Time difference of 0.006237268 secs
##
## $pimp_feat_imp_pval
## Time difference of 0.1246829 secs
##
## $pimp_feat_imp_adjPval
## Time difference of 0.1246829 secs
The available arguments combinations are:
ml_algo="dt", fi_algo="gini", pv_algo="pimp"
ml_algo="dt", fi_algo="gini", pv_algo="mprobes"
ml_algo="dt", fi_algo="shap", pv_algo="pimp"
ml_algo="dt", fi_algo="shap", pv_algo="mprobes"
ml_algo="dt", fi_algo="lime", pv_algo="pimp"
ml_algo="dt", fi_algo="lime", pv_algo="mprobes"
ml_algo="rf", fi_algo="gini", pv_algo="pimp"
ml_algo="rf", fi_algo="gini", pv_algo="mprobes"
ml_algo="rf", fi_algo="accuracy", pv_algo="pimp"
ml_algo="rf", fi_algo="accuracy", pv_algo="mprobes"
ml_algo="rf", fi_algo="shap", pv_algo="pimp"
ml_algo="rf", fi_algo="shap", pv_algo="mprobes"
ml_algo="rf", fi_algo="lime", pv_algo="pimp"
ml_algo="rf", fi_algo="lime", pv_algo="mprobes"
ml_algo="svm+vallinadot", fi_algo="shap", pv_algo="pimp"
ml_algo="svm+vallinadot", fi_algo="shap", pv_algo="mprobes"
ml_algo="svm+vallinadot", fi_algo="lime", pv_algo="pimp"
ml_algo="svm+vallinadot", fi_algo="lime", pv_algo="mprobes"
ml_algo="svm-rbfdot", fi_algo="shap", pv_algo="pimp"
ml_algo="svm-rbfdot", fi_algo="shap", pv_algo="mprobes"
ml_algo="svm-rbfdot", fi_algo="lime", pv_algo="pimp"
ml_algo="svm-rbfdot", fi_algo="lime", pv_algo="mprobes"
ml_algo="mlp", fi_algo="olden", pv_algo="pimp"
ml_algo="mlp", fi_algo="olden", pv_algo="mprobes"
ml_algo="mlp", fi_algo="shap", pv_algo="pimp"
ml_algo="mlp", fi_algo="shap", pv_algo="mprobes"
ml_algo="mlp", fi_algo="lime", pv_algo="pimp"
ml_algo="mlp", fi_algo="lime", pv_algo="mprobes"
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] randomForest_4.7-1.2 lime_0.5.3
## [3] kernlab_0.9-33 caret_7.0-1
## [5] lattice_0.22-7 SummarizedExperiment_1.39.1
## [7] Biobase_2.69.0 GenomicRanges_1.61.1
## [9] Seqinfo_0.99.2 IRanges_2.43.0
## [11] S4Vectors_0.47.0 BiocGenerics_0.55.1
## [13] generics_0.1.4 MatrixGenerics_1.21.0
## [15] matrixStats_1.5.0 ggforce_0.5.0
## [17] rpart_4.1.24 ggplot2_3.5.2
## [19] gridExtra_2.3 XAItest_1.1.3
##
## loaded via a namespace (and not attached):
## [1] pROC_1.19.0.1 rlang_1.1.6 magrittr_2.0.3
## [4] e1071_1.7-16 compiler_4.5.1 vctrs_0.6.5
## [7] reshape2_1.4.4 stringr_1.5.1 pkgconfig_2.0.3
## [10] shape_1.4.6.1 crayon_1.5.3 fastmap_1.2.0
## [13] XVector_0.49.0 labeling_0.4.3 prodlim_2025.04.28
## [16] purrr_1.1.0 xfun_0.53 glmnet_4.1-10
## [19] cachem_1.1.0 jsonlite_2.0.0 recipes_1.3.1
## [22] DelayedArray_0.35.2 tweenr_2.0.3 parallel_4.5.1
## [25] R6_2.6.1 bslib_0.9.0 stringi_1.8.7
## [28] RColorBrewer_1.1-3 limma_3.65.3 parallelly_1.45.1
## [31] jquerylib_0.1.4 lubridate_1.9.4 Rcpp_1.1.0
## [34] assertthat_0.2.1 iterators_1.0.14 knitr_1.50
## [37] future.apply_1.20.0 Matrix_1.7-3 splines_4.5.1
## [40] nnet_7.3-20 timechange_0.3.0 tidyselect_1.2.1
## [43] dichromat_2.0-0.1 abind_1.4-8 timeDate_4041.110
## [46] codetools_0.2-20 listenv_0.9.1 tibble_3.3.0
## [49] plyr_1.8.9 withr_3.0.2 evaluate_1.0.4
## [52] future_1.67.0 survival_3.8-3 proxy_0.4-27
## [55] polyclip_1.10-7 pillar_1.11.0 kernelshap_0.9.0
## [58] DT_0.33 foreach_1.5.2 scales_1.4.0
## [61] globals_0.18.0 class_7.3-23 glue_1.8.0
## [64] tools_4.5.1 data.table_1.17.8 ModelMetrics_1.2.2.2
## [67] gower_1.0.2 grid_4.5.1 crosstalk_1.2.1
## [70] ipred_0.9-15 nlme_3.1-168 cli_3.6.5
## [73] S4Arrays_1.9.1 lava_1.8.1 doFuture_1.1.2
## [76] dplyr_1.1.4 gtable_0.3.6 sass_0.4.10
## [79] digest_0.6.37 SparseArray_1.9.1 htmlwidgets_1.6.4
## [82] farver_2.1.2 htmltools_0.5.8.1 lifecycle_1.0.4
## [85] hardhat_1.4.2 statmod_1.5.0 MASS_7.3-65