1 Project setup
- 1.1 Set up path and logging
- 1.2 Read data
2 Tabular data preprocessing
- 2.1 Preprocessing by mode
- 2.2 Preprocessing for the complete dataset
3 Feature selection
- 3.1 Univariate analysis
4 Session information
References

In this project example, core functionality of notame, notameViz and notameStats is demonstrated in unison with the associated protocol article (Klåvus et al. 2020).

1 Project setup

1.1 Set up path and logging

Let’s set up a path and start logging our project. Many functions in the notame packages will log information when they have finished. This helps in monitoring the analysis process in case something goes wrong and in reviewing the results later. The log will also print to console, although we’ll exclude console output for the rest of the document for brevity.

library(notame)
library(notameViz)
library(notameStats)
library(dplyr)
ppath <- tempdir()
init_log(log_file = file.path(ppath, "log.txt"))
## INFO [2025-12-12 19:10:18] Starting logging

1.2 Read data

The data is mock data with two balanced groups and two balanced time points. The time points also correspond to two batches. There are 50 samples with ten regularly interspersed QC samples and 80 features divided into four analytical modes.

data <- import_from_excel(
  file = system.file("extdata", "toy_notame_set.xlsx", package = "notame"), 
  sheet = 1, corner_row = 11, corner_column = "H", 
  split_by = c("Column", "Ion_mode"))
  
names(assays(data)) <- "abundances"

The function import_from_excel() returns a SummarizedExperiment object with three main parts and corresponding accessors for streamlined data wrangling.

assay: feature abundances across the samples
colData: sample information
rowData: feature information

The example data contains four analytical modes, so we separate them using fix_object(), which also conveniently checks the object for compatibility with all of notame.

modes <- fix_object(data, split_data = TRUE, assay.type = "abundances")

Now we have a list of objects, one per mode (LC column x ionization mode) or however many unique values were specified in rowData(data)$Split.

names(modes)
## [1] "HILIC_neg" "HILIC_pos" "RP_neg"    "RP_pos"
sapply(modes, class)
##              HILIC_neg              HILIC_pos                 RP_neg 
## "SummarizedExperiment" "SummarizedExperiment" "SummarizedExperiment" 
##                 RP_pos 
## "SummarizedExperiment"

2 Tabular data preprocessing

2.1 Preprocessing by mode

Tabular data preprocessing is performed to complete the dataset by way of reducing unwanted variation and data preparation dependent on downstream methods. Steps performed separately for each mode include marking missing values as NA, flagging features with low detection rate (Broadhurst et al. 2018), drift correction (Kirwan et al. 2013), and flagging of low-quality features (Broadhurst et al. 2018). Visualizations are drawn to monitor the process and explore the data globally. The save_QC_plots() wrapper does not include feature-wise plots as it may not be feasible to inspect them at this stage of the analysis. For example, drift correction visualizations are best drawn for a subset of interesting features after feature selection.

# Initialize empty list for processed objects
processed <- list()
for (i in seq_along(modes)) {
  name <- names(modes)[i]
  mode <- modes[[i]]
  # Set all zero abundances to NA
  mode <- mark_nas(mode, value = 0)
  # Flag features with low detection rate
  mode <- flag_detection(mode, group = "Group")
  # Visualize data before drift correction
  save_QC_plots(mode, prefix = paste0(ppath, "figures/", name, "_ORIG"),
                 perplexity = 5, group = "Group", time = "Time", 
                 id = "Subject_ID", color = "Group")
  # Correct drift
  corrected <- correct_drift(mode)
  # Visualize data after drift correction
  save_QC_plots(corrected, prefix = paste0(ppath, "figures/", name, "_DRIFT"),
                perplexity = 5, group = "Group", time = "Time", 
                id = "Subject_ID", color = "Group")
  # Flag low-quality features
  corrected <- corrected %>% 
    assess_quality() %>% 
    flag_quality()
  # Visualize data after removal of low-quality features
  save_QC_plots(corrected, prefix = paste0(ppath, "figures/", name, "_CLEAN"),
                perplexity = 5, group = "Group", time = "Time",
                id = "Subject_ID", color = "Group")
  # Save result of iteration
  processed[[i]] <- corrected
}

Feature-wise flagging information, quality metrics, and brief drift correction notes are included in the feature information after the above steps.

rowData(processed[[1]])$DC_note
##  [1] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
##  [5] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
##  [9] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
## [13] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
## [17] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"

2.2 Preprocessing for the complete dataset

Next, it is time to merge the modes together and visualize the complete dataset.

merged <- merge_notame_sets(processed)
save_QC_plots(merged, prefix = paste0(ppath, "figures/_FULL"),
              group = "Group", time = "Time",
              id = "Subject_ID", color = "Group")

Then, QC samples can (and should) be removed, as they are no longer needed, and the dataset is visualized anew.

merged_no_qc <- drop_qcs(merged)
save_QC_plots(merged_no_qc, prefix = paste0(ppath, "figures/FULL_NO_QC"),
              group = "Group", time = "Time",
              id = "Subject_ID", color = "Group")

If there are any missing values in the data, they need to be imputed. Let’s use random forest imputation to impute these values and visualize the dataset one last time before statistical analysis. Seed number should be set before random forest imputation to guarantee reproducibility of results.

set.seed(2024)
imputed <- impute_rf(merged_no_qc)
save_QC_plots(imputed, prefix = paste0(ppath, "figures/FULL_IMPUTED"),
              group = "Group", time = "Time",
              id = "Subject_ID", color = "Group")

By default, the imputation procedure only operates on good quality features, i.e. those that have not been flagged. To use flagged features in statistical analysis, they should be imputed as well. This can be achieved through a second round of imputation, now with all features included. This two-step imputation makes sure that low-quality features don’t affect the imputation of quality features. Imputation could also be performed separately for each mode to reduce execution time, especially if the amounts of features still allows for good imputation results.

base <- impute_rf(imputed, all_features = TRUE)

It is a good idea to save the merged and processed data, so experimenting with different statistical analyses becomes easier.

save(base, file = paste0(ppath, "full_data.RData"))

3 Feature selection

3.1 Univariate analysis

First, we’ll use a linear model to evaluate the features in terms of the probability of obtaining the observed abundances given the null hypothesis, namely that there is no difference in feature abundance across study groups.

assay(base, "log") <- log(assay(base))
lm_results <- perform_lm(base, assay.type = "log", 
  formula_char = "Feature ~ Group")
base <- join_rowData(base, lm_results)

Next, univariate and supervised rankings of features could be combined in a final ranking. Such a final ranking would combine the qualities of univariate analysis and supervised learning. Next, carefully selected features would undergo labour-intensive scrutiny for biological meaning, for example by identification and pathway analysis.

That concludes our project example. The last thing to do is to write the results to an Excel file and finish the logging. Thanks!

write_to_excel(base, file = paste0(ppath, "results.xlsx"))
finish_log()

4 Session information

## R Under development (unstable) (2025-10-20 r88955)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] dplyr_1.1.4                 notameStats_1.1.0          
##  [3] notameViz_1.1.0             notame_1.1.1               
##  [5] SummarizedExperiment_1.41.0 Biobase_2.71.0             
##  [7] GenomicRanges_1.63.1        Seqinfo_1.1.0              
##  [9] IRanges_2.45.0              S4Vectors_0.49.0           
## [11] BiocGenerics_0.57.0         generics_0.1.4             
## [13] MatrixGenerics_1.23.0       matrixStats_1.5.0          
## [15] ggplot2_4.0.1               BiocStyle_2.39.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     viridisLite_0.4.2    farver_2.1.2        
##  [4] S7_0.2.1             fastmap_1.2.0        digest_0.6.39       
##  [7] lifecycle_1.0.4      magrittr_2.0.4       compiler_4.6.0      
## [10] rlang_1.1.6          sass_0.4.10          rngtools_1.5.2      
## [13] tools_4.6.0          yaml_2.3.12          knitr_1.50          
## [16] lambda.r_1.2.4       S4Arrays_1.11.1      labeling_0.4.3      
## [19] doRNG_1.8.6.2        DelayedArray_0.37.0  RColorBrewer_1.1-3  
## [22] abind_1.4-8          BiocParallel_1.45.0  Rtsne_0.17          
## [25] withr_3.0.2          purrr_1.2.0          itertools_0.1-3     
## [28] grid_4.6.0           scales_1.4.0         iterators_1.0.14    
## [31] MASS_7.3-65          dichromat_2.0-0.1    cli_3.6.5           
## [34] rmarkdown_2.30       cachem_1.1.0         parallel_4.6.0      
## [37] formatR_1.14         BiocManager_1.30.27  XVector_0.51.0      
## [40] vctrs_0.6.5          Matrix_1.7-4         jsonlite_2.0.0      
## [43] bookdown_0.46        foreach_1.5.2        jquerylib_0.1.4     
## [46] tidyr_1.3.1          ggdendro_0.2.0       missForest_1.6.1    
## [49] glue_1.8.0           codetools_0.2-20     cowplot_1.2.0       
## [52] stringi_1.8.7        gtable_0.3.6         futile.logger_1.4.3 
## [55] tibble_3.3.0         pillar_1.11.1        pcaMethods_2.3.0    
## [58] htmltools_0.5.9      randomForest_4.7-1.2 R6_2.6.1            
## [61] Rdpack_2.6.4         evaluate_1.0.5       lattice_0.22-7      
## [64] futile.options_1.0.1 rbibutils_2.4        openxlsx_4.2.8.1    
## [67] bslib_0.9.0          Rcpp_1.1.0.8.1       zip_2.3.3           
## [70] SparseArray_1.11.9   ranger_0.17.0        xfun_0.54           
## [73] pkgconfig_2.0.3

References

Broadhurst, David, Royston Goodacre, Stacey N Reinke, Julia Kuligowski, Ian D Wilson, Matthew R Lewis, and Warwick B Dunn. 2018. “Guidelines and Considerations for the Use of System Suitability and Quality Control Samples in Mass Spectrometry Assays Applied in Untargeted Clinical Metabolomic Studies.” Metabolomics 14: 1–17.

Kirwan, JA, DI Broadhurst, RL Davidson, and MR Viant. 2013. “Characterising and Correcting Batch Variation in an Automated Direct Infusion Mass Spectrometry (Dims) Metabolomics Workflow.” Analytical and Bioanalytical Chemistry 405: 5147–57.

Klåvus, Anton, Marietta Kokla, Stefania Noerman, Ville M Koistinen, Marjo Tuomainen, Iman Zarei, Topi Meuronen, et al. 2020. “‘Notame’: Workflow for Non-Targeted Lc–Ms Metabolic Profiling.” Metabolites 10 (4): 135.

Project example

2025-12-12

Contents