In this project example, core functionality of notame, notameViz and notameStats is demonstrated in unison with the associated protocol article (Klåvus et al. 2020).
Let’s set up a path and start logging our project. Many functions in the notame packages will log information when they have finished. This helps in monitoring the analysis process in case something goes wrong and in reviewing the results later. The log will also print to console, although we’ll exclude console output for the rest of the document for brevity.
library(notame)
library(notameViz)
library(notameStats)
library(dplyr)
ppath <- tempdir()
init_log(log_file = file.path(ppath, "log.txt"))
## INFO [2025-10-10 19:14:34] Starting logging
The data is mock data with two balanced groups and two balanced time points. The time points also correspond to two batches. There are 50 samples with ten regularly interspersed QC samples and 80 features divided into four analytical modes.
data <- import_from_excel(
file = system.file("extdata", "toy_notame_set.xlsx", package = "notame"),
sheet = 1, corner_row = 11, corner_column = "H",
split_by = c("Column", "Ion_mode"))
names(assays(data)) <- "abundances"
The function import_from_excel()
returns a SummarizedExperiment object with
three main parts and corresponding accessors for streamlined data wrangling.
assay
: feature abundances across the samplescolData
: sample informationrowData
: feature informationThe example data contains four analytical modes, so we separate them using
fix_object()
, which also conveniently checks the object for compatibility
with all of notame.
modes <- fix_object(data, split_data = TRUE, assay.type = "abundances")
Now we have a list of objects, one per mode (LC column x
ionization mode) or however many unique values were specified in
rowData(data)$Split
.
names(modes)
## [1] "HILIC_neg" "HILIC_pos" "RP_neg" "RP_pos"
sapply(modes, class)
## HILIC_neg HILIC_pos RP_neg
## "SummarizedExperiment" "SummarizedExperiment" "SummarizedExperiment"
## RP_pos
## "SummarizedExperiment"
Tabular data preprocessing is performed to complete the dataset by way of
reducing unwanted variation and data preparation dependent on downstream
methods. Steps performed separately for each mode include marking missing
values as NA, flagging features with low detection rate (Broadhurst et al. 2018), drift
correction (Kirwan et al. 2013), and flagging of low-quality features
(Broadhurst et al. 2018). Visualizations are drawn to monitor the process and explore
the data globally. The save_QC_plots()
wrapper does not include
feature-wise plots as it may not be feasible to inspect them at this
stage of the analysis. For example, drift correction visualizations are best
drawn for a subset of interesting features after feature selection.
# Initialize empty list for processed objects
processed <- list()
for (i in seq_along(modes)) {
name <- names(modes)[i]
mode <- modes[[i]]
# Set all zero abundances to NA
mode <- mark_nas(mode, value = 0)
# Flag features with low detection rate
mode <- flag_detection(mode, group = "Group")
# Visualize data before drift correction
save_QC_plots(mode, prefix = paste0(ppath, "figures/", name, "_ORIG"),
perplexity = 5, group = "Group", time = "Time",
id = "Subject_ID", color = "Group")
# Correct drift
corrected <- correct_drift(mode)
# Visualize data after drift correction
save_QC_plots(corrected, prefix = paste0(ppath, "figures/", name, "_DRIFT"),
perplexity = 5, group = "Group", time = "Time",
id = "Subject_ID", color = "Group")
# Flag low-quality features
corrected <- corrected %>%
assess_quality() %>%
flag_quality()
# Visualize data after removal of low-quality features
save_QC_plots(corrected, prefix = paste0(ppath, "figures/", name, "_CLEAN"),
perplexity = 5, group = "Group", time = "Time",
id = "Subject_ID", color = "Group")
# Save result of iteration
processed[[i]] <- corrected
}
Feature-wise flagging information, quality metrics, and brief drift correction notes are included in the feature information after the above steps.
rowData(processed[[1]])$DC_note
## [1] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
## [5] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
## [9] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
## [13] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
## [17] "Drift_corrected" "Drift_corrected" "Drift_corrected" "Drift_corrected"
Next, it is time to merge the modes together and visualize the complete dataset.
merged <- merge_notame_sets(processed)
save_QC_plots(merged, prefix = paste0(ppath, "figures/_FULL"),
group = "Group", time = "Time",
id = "Subject_ID", color = "Group")
Then, QC samples can (and should) be removed, as they are no longer needed, and the dataset is visualized anew.
merged_no_qc <- drop_qcs(merged)
save_QC_plots(merged_no_qc, prefix = paste0(ppath, "figures/FULL_NO_QC"),
group = "Group", time = "Time",
id = "Subject_ID", color = "Group")
If there are any missing values in the data, they need to be imputed. Let’s use random forest imputation to impute these values and visualize the dataset one last time before statistical analysis. Seed number should be set before random forest imputation to guarantee reproducibility of results.
set.seed(2024)
imputed <- impute_rf(merged_no_qc)
save_QC_plots(imputed, prefix = paste0(ppath, "figures/FULL_IMPUTED"),
group = "Group", time = "Time",
id = "Subject_ID", color = "Group")
By default, the imputation procedure only operates on good quality features, i.e. those that have not been flagged. To use flagged features in statistical analysis, they should be imputed as well. This can be achieved through a second round of imputation, now with all features included. This two-step imputation makes sure that low-quality features don’t affect the imputation of quality features. Imputation could also be performed separately for each mode to reduce execution time, especially if the amounts of features still allows for good imputation results.
base <- impute_rf(imputed, all_features = TRUE)
It is a good idea to save the merged and processed data, so experimenting with different statistical analyses becomes easier.
save(base, file = paste0(ppath, "full_data.RData"))
First, we’ll use a linear model to evaluate the features in terms of the probability of obtaining the observed abundances given the null hypothesis, namely that there is no difference in feature abundance across study groups.
assay(base, "log") <- log(assay(base))
lm_results <- perform_lm(base, assay.type = "log",
formula_char = "Feature ~ Group")
base <- join_rowData(base, lm_results)
Next, univariate and supervised rankings of features could be combined in a final ranking. Such a final ranking would combine the qualities of univariate analysis and supervised learning. Next, carefully selected features would undergo labour-intensive scrutiny for biological meaning, for example by identification and pathway analysis.
That concludes our project example. The last thing to do is to write the results to an Excel file and finish the logging. Thanks!
write_to_excel(base, file = paste0(ppath, "results.xlsx"))
finish_log()
## R version 4.5.1 Patched (2025-08-23 r88802)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] dplyr_1.1.4 notameStats_0.99.1
## [3] notameViz_0.99.5 notame_0.99.6
## [5] SummarizedExperiment_1.39.2 Biobase_2.69.1
## [7] GenomicRanges_1.61.5 Seqinfo_0.99.2
## [9] IRanges_2.43.5 S4Vectors_0.47.4
## [11] BiocGenerics_0.55.3 generics_0.1.4
## [13] MatrixGenerics_1.21.0 matrixStats_1.5.0
## [15] ggplot2_4.0.0 BiocStyle_2.37.1
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.53 bslib_0.9.0
## [4] lattice_0.22-7 vctrs_0.6.5 tools_4.5.1
## [7] parallel_4.5.1 missForest_1.5 tibble_3.3.0
## [10] pkgconfig_2.0.3 Matrix_1.7-4 RColorBrewer_1.1-3
## [13] S7_0.2.0 rngtools_1.5.2 lifecycle_1.0.4
## [16] compiler_4.5.1 farver_2.1.2 codetools_0.2-20
## [19] htmltools_0.5.8.1 sass_0.4.10 yaml_2.3.10
## [22] tidyr_1.3.1 pillar_1.11.1 crayon_1.5.3
## [25] jquerylib_0.1.4 MASS_7.3-65 BiocParallel_1.43.4
## [28] DelayedArray_0.35.3 cachem_1.1.0 doRNG_1.8.6.2
## [31] iterators_1.0.14 abind_1.4-8 foreach_1.5.2
## [34] pcaMethods_2.1.0 zip_2.3.3 tidyselect_1.2.1
## [37] digest_0.6.37 Rtsne_0.17 stringi_1.8.7
## [40] purrr_1.1.0 bookdown_0.45 labeling_0.4.3
## [43] cowplot_1.2.0 fastmap_1.2.0 grid_4.5.1
## [46] cli_3.6.5 SparseArray_1.9.1 magrittr_2.0.4
## [49] S4Arrays_1.9.1 dichromat_2.0-0.1 randomForest_4.7-1.2
## [52] withr_3.0.2 scales_1.4.0 rmarkdown_2.30
## [55] lambda.r_1.2.4 XVector_0.49.1 futile.logger_1.4.3
## [58] openxlsx_4.2.8 evaluate_1.0.5 knitr_1.50
## [61] viridisLite_0.4.2 ggdendro_0.2.0 rlang_1.1.6
## [64] itertools_0.1-3 futile.options_1.0.1 Rcpp_1.1.0
## [67] glue_1.8.0 BiocManager_1.30.26 formatR_1.14
## [70] jsonlite_2.0.0 R6_2.6.1
Broadhurst, David, Royston Goodacre, Stacey N Reinke, Julia Kuligowski, Ian D Wilson, Matthew R Lewis, and Warwick B Dunn. 2018. “Guidelines and Considerations for the Use of System Suitability and Quality Control Samples in Mass Spectrometry Assays Applied in Untargeted Clinical Metabolomic Studies.” Metabolomics 14: 1–17.
Kirwan, JA, DI Broadhurst, RL Davidson, and MR Viant. 2013. “Characterising and Correcting Batch Variation in an Automated Direct Infusion Mass Spectrometry (Dims) Metabolomics Workflow.” Analytical and Bioanalytical Chemistry 405: 5147–57.
Klåvus, Anton, Marietta Kokla, Stefania Noerman, Ville M Koistinen, Marjo Tuomainen, Iman Zarei, Topi Meuronen, et al. 2020. “‘Notame’: Workflow for Non-Targeted Lc–Ms Metabolic Profiling.” Metabolites 10 (4): 135.