Contents

The statistics functionality in notameStats aims to identify interesting features across study groups. See the project example vignette in the notame package and notame website reference index for listing of functions. Similar functionality is available in several packages.

Unless otherwise stated, all functions return separate data frames or other objects with the results. These can be then added to the object feature data using join_rowData(object, results). The reason for not adding these to the objects automatically is that most of the functions return excess information that is not always worth saving. We encourage you to choose which information is important to you.

1 Installation

To install notameStats, install BiocManager first, if it is not installed. Afterwards use the install function from BiocManager and load notameStats.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("notameStats")
library(notame)
library(notameViz)
library(notameStats)

ppath <- tempdir()
init_log(log_file = file.path(ppath, "log.txt"))
## INFO [2025-10-10 19:14:29] Starting logging
data(hilic_neg_sample, package = "notame")
data(toy_notame_set, package = "notame")

2 Univariate functions

2.1 Summary statistics and effect sizes

It is straightforward to provide summary statistics and effect sizes for all features:

toy_notame_set <- mark_nas(toy_notame_set, value = 0)

# Impute missing values, required especially for multivariate methods
toy_notame_set <- notame::impute_rf(toy_notame_set)
## INFO [2025-10-10 19:14:30] 
## Starting random forest imputation at 2025-10-10 19:14:30.14515
## INFO [2025-10-10 19:14:34] Out-of-bag error in random forest imputation: 0.466
## INFO [2025-10-10 19:14:34] Random forest imputation finished at 2025-10-10 19:14:34.609549
sum_stats <- summary_statistics(toy_notame_set, grouping_cols = "Group")
toy_notame_set <- notame::join_rowData(toy_notame_set, sum_stats)

d_results <- cohens_d(toy_notame_set, group = "Group")
## INFO [2025-10-10 19:14:37] Starting to compute Cohen's D between groups B & A
## INFO [2025-10-10 19:14:37] Cohen's D computed.
## INFO [2025-10-10 19:14:37] Starting to compute Cohen's D between groups QC & A
## INFO [2025-10-10 19:14:37] Cohen's D computed.
## INFO [2025-10-10 19:14:37] Starting to compute Cohen's D between groups QC & B
## INFO [2025-10-10 19:14:37] Cohen's D computed.
toy_notame_set <- notame::join_rowData(toy_notame_set, d_results)

fc <- fold_change(toy_notame_set, group = "Group")
## INFO [2025-10-10 19:14:37] Starting to compute fold changes.
## INFO [2025-10-10 19:14:38] Fold changes computed.
toy_notame_set <- notame::join_rowData(toy_notame_set, fc)

colnames(rowData(toy_notame_set))
##  [1] "Feature_ID"      "Split"           "Alignment"       "Average_Mz"     
##  [5] "Average_Rt_min"  "Column"          "Ion_mode"        "Flag"           
##  [9] "A_mean"          "A_sd"            "A_median"        "A_mad"          
## [13] "A_min"           "A_Q25"           "A_Q75"           "A_max"          
## [17] "B_mean"          "B_sd"            "B_median"        "B_mad"          
## [21] "B_min"           "B_Q25"           "B_Q75"           "B_max"          
## [25] "QC_mean"         "QC_sd"           "QC_median"       "QC_mad"         
## [29] "QC_min"          "QC_Q25"          "QC_Q75"          "QC_max"         
## [33] "B_vs_A_Cohen_d"  "QC_vs_A_Cohen_d" "QC_vs_B_Cohen_d" "B_vs_A_FC"      
## [37] "QC_vs_A_FC"      "QC_vs_B_FC"

2.2 Univariate tests

These functions perform univariate hypothesis tests for each feature, report relevant statistics and correct the p-values using FDR correction. For features, where the model fails for some reason, all statistics are recorded as NA. NOTE setting all_features = FALSE does not prevent the tests on the flagged compounds, but only affects p-value correction, where flagged features are not included in the correction and thus do not have an FDR- corrected p-value. To prevent the testing of flagged features altogether, use notame::drop_flagged before the tests.

Most of the univariate statistical test functions in this package use the formula interface, where the formula is provided as a character, with one special condition: the word “Feature” will get replaced at each iteration by the corresponding feature name. So for example, when testing if any of the features predict the difference between study groups, the formula would be: “Group ~ Feature”. Or, when testing if group and time point affect metabolite levels, the formula could be “Feature ~ Group + Time + Group:Time”, with the last term being an interaction term (“Feature ~ Group * Time” is equivalent).

toy_notame_set <- notame::flag_quality(toy_notame_set)
## INFO [2025-10-10 19:14:38] 
## 96% of features flagged for low quality
toy_notame_set <- notame::drop_qcs(toy_notame_set)

lm_results <- perform_lm(toy_notame_set, 
  formula_char = "Feature ~ Group + Time")
## INFO [2025-10-10 19:14:38] Starting linear regression.
## INFO [2025-10-10 19:14:39] Linear regression performed.

Most of the functions allow you to pass extra arguments to the underlying functions performing the actual tests, so you can set custom contrasts etc.

Functions not using the formula interface include correlation tests between molecular features and/or sample information variable (perform_correlation_tests()) and area under curve computation (perform_auc()).

3 Multivariate analysis

3.1 MUVR

notame provides a wrapper for the MUVR analysis (Multivariate methods with Unbiased Variable selection in R, [shi2019variable] using the MUVR2 package. MUVR2 allows fitting both RF and PLS models with clever variable selection for both finding a minimal subset of features that achieves a good performance AND for finding all relevant features. There is also a set of useful visualizations in MUVR2.

# nRep = 2 for quick example
pls_model <- muvr_analysis(toy_notame_set, 
  y = "Injection_order", nRep = 2, method = "PLS")
## Warning in (function (X, Y, ID, scale = TRUE, nRep = 5, nOuter = 6, nInner, : 
## Missing ID -> Assume all unique (i.e. sample independence)
## Warning: executing %dopar% sequentially: no parallel backend registered
class(pls_model)
## [1] "MUVR"       "Regression" "PLS"

3.2 Random forest

For random forest models, we also use the randomForest package. We also include a wrapper for getting feature importance.

rf <- fit_rf(toy_notame_set, y = "Group")

class(rf)
## [1] "randomForest"
head(importance_rf(rf))
##                                          Feature_ID           A            B
## HILIC_neg_118_9111a4_1865 HILIC_neg_118_9111a4_1865  0.01552720  0.009375397
## HILIC_pos_255_0094a7_9288 HILIC_pos_255_0094a7_9288 -0.03925466 -0.002480875
## RP_neg_139_4456a4_1251       RP_neg_139_4456a4_1251  0.02227511 -0.014353918
##                           MeanDecreaseAccuracy MeanDecreaseGini
## HILIC_neg_118_9111a4_1865          0.012278411         6.673016
## HILIC_pos_255_0094a7_9288         -0.020885661         6.102022
## RP_neg_139_4456a4_1251             0.003800981         6.743762

3.3 PLS(-DA)

There are also wrappers for PLS(-DA) functions from the mixOmics package.

pls_res <- mixomics_pls(toy_notame_set, y = "Injection_order", ncomp = 3)
## INFO [2025-10-10 19:14:41] Fitting PLS
class(pls_res)
## [1] "mixo_pls"

4 Session information

## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] pROC_1.19.0.1               notameStats_0.99.1         
##  [3] notameViz_0.99.5            notame_0.99.6              
##  [5] SummarizedExperiment_1.39.2 Biobase_2.69.1             
##  [7] GenomicRanges_1.61.5        Seqinfo_0.99.2             
##  [9] IRanges_2.43.5              S4Vectors_0.47.4           
## [11] BiocGenerics_0.55.3         generics_0.1.4             
## [13] MatrixGenerics_1.21.0       matrixStats_1.5.0          
## [15] ggplot2_4.0.0               BiocStyle_2.37.1           
## 
## loaded via a namespace (and not attached):
##   [1] mnormt_2.1.1         gridExtra_2.3        formatR_1.14        
##   [4] rlang_1.1.6          magrittr_2.0.4       compiler_4.5.1      
##   [7] mgcv_1.9-3           reshape2_1.4.4       vctrs_0.6.5         
##  [10] MUVR2_0.1.0          stringr_1.5.2        pkgconfig_2.0.3     
##  [13] shape_1.4.6.1        crayon_1.5.3         fastmap_1.2.0       
##  [16] XVector_0.49.1       rmarkdown_2.30       prodlim_2025.04.28  
##  [19] itertools_0.1-3      purrr_1.1.0          xfun_0.53           
##  [22] glmnet_4.1-10        randomForest_4.7-1.2 cachem_1.1.0        
##  [25] jsonlite_2.0.0       recipes_1.3.1        DelayedArray_0.35.3 
##  [28] BiocParallel_1.43.4  psych_2.5.6          parallel_4.5.1      
##  [31] R6_2.6.1             stringi_1.8.7        bslib_0.9.0         
##  [34] RColorBrewer_1.1-3   ranger_0.17.0        parallelly_1.45.1   
##  [37] rpart_4.1.24         lubridate_1.9.4      jquerylib_0.1.4     
##  [40] Rcpp_1.1.0           bookdown_0.45        iterators_1.0.14    
##  [43] knitr_1.50           future.apply_1.20.0  igraph_2.1.4        
##  [46] timechange_0.3.0     Matrix_1.7-4         splines_4.5.1       
##  [49] nnet_7.3-20          tidyselect_1.2.1     dichromat_2.0-0.1   
##  [52] abind_1.4-8          yaml_2.3.10          timeDate_4041.110   
##  [55] doParallel_1.0.17    codetools_0.2-20     listenv_0.9.1       
##  [58] doRNG_1.8.6.2        lattice_0.22-7       tibble_3.3.0        
##  [61] plyr_1.8.9           rARPACK_0.11-0       withr_3.0.2         
##  [64] S7_0.2.0             evaluate_1.0.5       future_1.67.0       
##  [67] lambda.r_1.2.4       survival_3.8-3       futile.logger_1.4.3 
##  [70] pillar_1.11.1        BiocManager_1.30.26  rngtools_1.5.2      
##  [73] foreach_1.5.2        ellipse_0.5.0        scales_1.4.0        
##  [76] globals_0.18.0       class_7.3-23         glue_1.8.0          
##  [79] tools_4.5.1          data.table_1.17.8    RSpectra_0.16-2     
##  [82] ModelMetrics_1.2.2.2 gower_1.0.2          grid_4.5.1          
##  [85] missForest_1.5       tidyr_1.3.1          ipred_0.9-15        
##  [88] nlme_3.1-168         cli_3.6.5            futile.options_1.0.1
##  [91] mixOmics_6.33.0      S4Arrays_1.9.1       viridisLite_0.4.2   
##  [94] lava_1.8.1           dplyr_1.1.4          corpcor_1.6.10      
##  [97] gtable_0.3.6         sass_0.4.10          digest_0.6.37       
## [100] caret_7.0-1          ggrepel_0.9.6        SparseArray_1.9.1   
## [103] farver_2.1.2         htmltools_0.5.8.1    lifecycle_1.0.4     
## [106] hardhat_1.4.2        MASS_7.3-65

5 References