MetaboDynamics 0.99.23
Metabolomics data from longitudinal high-throughput experiments, such as untargeted Liquid Chromatography-Mass Spectrometry (LC-MS), allows for insights into changes of metabolite abundances over time. However, most existing tools are limited to comparing only two time points or experimental conditions, and they often rely on frequentist statistical methods. Additionally, most tools are only able to analyze groups of metabolites on a pathway level, potentially missing more general patterns of changes in the metabolism in sparse data.
Here, we introduce MetaboDynamics, a modular workflow of probabilistic models for the analysis of longitudinal metabolomics data. This package provides a flexible framework for analyzing longitudinal data, allowing users to model changes in metabolite abundances over time and identify patterns of interest.
Metabolomics data is often noisy and limited by few replicates due to high costs. To address these challenges, we employ a Bayesian hierarchical model that balances between under-fitting and over-fitting. This model provides a robust method for estimating mean abundances at every time point, even with limited replicates.
The MetaboDynamics workflow requires abundance tables as input, which can include metabolites and their observed abundances, LC-MS peak intensities or areas, normalized by biomass or cell number. The workflow can handle any number of time points, and the dynamics model requires at least three replicates. However, functions to compare clustered dynamics with each other do not require a certain number of replicates and can be applied with calculated mean abundances of e.g. two replicates.
In this vignette, we will demonstrate how to use the MetaboDynamics package to analyze longitudinal metabolomics data, including data preprocessing, model fitting, and interpretation of results. We will show how to apply the package to a simulated dataset and how to visualize and interpret the results.
library(MetaboDynamics)
library(SummarizedExperiment) # storing and manipulating simulated metabolomics data
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
##
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
##
## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
## colWeightedMeans, colWeightedMedians, colWeightedSds,
## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
## rowWeightedSds, rowWeightedVars
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
## unsplit, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
##
## rowMedians
## The following objects are masked from 'package:matrixStats':
##
## anyMissing, rowMedians
library(ggplot2) # visualization
library(dplyr) # data handling
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:Biobase':
##
## combine
## The following objects are masked from 'package:GenomicRanges':
##
## intersect, setdiff, union
## The following object is masked from 'package:GenomeInfoDb':
##
## intersect
## The following objects are masked from 'package:IRanges':
##
## collapse, desc, intersect, setdiff, slice, union
## The following objects are masked from 'package:S4Vectors':
##
## first, intersect, rename, setdiff, setequal, union
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, setequal, union
## The following object is masked from 'package:generics':
##
## explain
## The following object is masked from 'package:matrixStats':
##
## count
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr) # data handling
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:S4Vectors':
##
## expand
The MetaboDynamics package includes a simulated data set called “longitudinalMetabolomics”. This data set consists of 98 metabolites with three replicates at four time points (1-4) across three experimental conditions (A-C). The data set is represented as a SummarizedExperiment object, where the abundance tables of each experimental condition are stored in assays (raw abundances in “concentrations”, log-transformed transformations in “log_con” and scaled (mean=0, sd=1) log-transformed abundances" in “scaled_log”), and the metabolite names, KEGG IDs, experimental conditions, and clustering solutions per experimental condition are stored in colData and rowData. To load the data, execute the following code:
data("longitudinalMetabolomics",package = "MetaboDynamics")
The next plot shows the raw metabolite abundances.
# convert to dataframe
abundances <- as.data.frame(cbind(
as.data.frame(t(assays(longitudinalMetabolomics)$concentrations)),
as.data.frame(colData(longitudinalMetabolomics))
))
abundances <- abundances %>% pivot_longer(
cols = seq_len(4), names_to = "time",
values_to = "abundance"
)
ggplot(abundances, aes(x = abundance)) +
geom_freqpoly() +
theme_bw() +
facet_grid(cols = vars(time), rows = vars(condition)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
ggtitle("raw data", "raw measurements")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In this plot I visualize the distribution of metabolite abundances. We can see many observations with low abundances and few observations for higher abundances. This raw data is not distributed normally. So let’s log-transform the values. In the integrated simulated data set this is already done in the column “log_m”.
ggplot(abundances, aes(x = abundance)) +
geom_density() +
scale_x_log10()+
theme_bw() +
facet_grid(cols = vars(time), rows = vars(condition)) +
ggtitle("data", "log-transformed values")
We can see that the metabolite abundances vary over magnitudes and are normally distributed after log-transformation.
The next plot shows the log-transformed dynamics of single metabolites.
ggplot(abundances) +
geom_line(aes(
x = time, y = abundance, col = metabolite,
group = interaction(metabolite, replicate)
)) +
scale_y_log10()+
theme_bw() +
xlab("timepoint") +
scale_color_viridis_d()+
theme(legend.position = "none") +
facet_grid(rows = vars(condition)) +
ggtitle("raw metabolite dynamics", "color=metabolite")
The dynamics profiles of single metabolites have intercepts (i.e. mean metabolite abundances) that differ by magnitudes from each other. To be able to compare metabolites with each other which abundances differ by magnitude we define dynamics as deviations at the observed time points from the metabolite’s mean abundance. For this we standardize each metabolite at each experimental condition to a mean of zero and a standard deviation of one.
In the simulated data set the scaled measurements are in column “m_scaled”.
data("longitudinalMetabolomics")
# convert to dataframe
abundances <- as.data.frame(cbind(
as.data.frame(t(assays(longitudinalMetabolomics)$scaled_log)),
as.data.frame(colData(longitudinalMetabolomics))
))
abundances <- abundances %>% pivot_longer(
cols = seq_len(4), names_to = "time",
values_to = "abundance"
)
ggplot(abundances) +
geom_line(aes(
x = time,
y = abundance, col = metabolite,
group = interaction(metabolite, replicate)
)) +
theme_bw() +
scale_color_viridis_d()+
xlab("timepoint") +
theme(legend.position = "none") +
facet_grid(rows = vars(condition)) +
ggtitle("standardized dynamics", "color=metabolite")
To analyze the metabolomics data, we use a Bayesian hierarchical model. This model is used to account for the complexity and variability of the data and to provide a flexible and robust framework for analysis. The model is defined as follows with: con = metabolite abundances, m = metabolite, c = experimental condition and t = time point ID:
\[\begin{align*} \log(con_{m,c,t})&\sim {\sf normal}(\mu_{m,c,t},\sigma_{m,c,t}) \\ \mu_{m,c,t}&\sim {\sf normal}(0,2) \\ \sigma_{m,c,t}&\sim {\sf exponential}(\lambda_{m,c}) \\ \lambda_{m,c}&\sim {\sf exponential}(2) \end{align*}\]
This model assumes a normal distribution of log-transformed and scaled metabolite abundances and estimates a mean per metabolite and time point. The spread of the replicate measurements is also estimated per metabolite and time point, but the spread of different time points informs each other through the hierarchical structure of the hyper-parameter . This partial pooling allows for the balance between over- and underfitting.
The code below shows how to fit the Bayesian model with the function fit_dynamics_model(). This might take of the order of 10 minutes per experimental condition for 98 metabolites. Here we fit the dynamics model for a small subset of the simulated metabolites.
# we can hand a SummarizedExperiment object to the function
data("longitudinalMetabolomics")
# we only use a subsection of the simulated data (1 condition and subsample of
# the whole dataset) for demonstration purposes
samples <- c(
"UMP", "Taurine", "Succinate", "Phosphocreatine", "PEP",
"Malic acid", "L-Cystine", "CMP", "Citramalic acid", "2-Aminomuconate"
)
# only one condition
data <- longitudinalMetabolomics[, longitudinalMetabolomics$condition == "A" &
longitudinalMetabolomics$metabolite %in% samples]
# fit model
data <- fit_dynamics_model(
data = data, scaled_measurement = "m_scaled",
assay = "scaled_log", time = "time",
condition = "condition", max_treedepth = 10,
adapt_delta = 0.95, # default 0.95
iter = 5000,
cores = 1,
chains = 2 # only set to 2 for vignette, default = 4
)
## Is your data normalized and standardized?
## We recommend normalization by log-transformation.
## Scaling and centering (mean=0, sd=1) should be metabolite and condition specific.
##
## SAMPLING FOR MODEL 'm_ANOVA_partial_pooling' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 6.1e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.61 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 5000 [ 0%] (Warmup)
## Chain 1: Iteration: 500 / 5000 [ 10%] (Warmup)
## Chain 1: Iteration: 1000 / 5000 [ 20%] (Warmup)
## Chain 1: Iteration: 1251 / 5000 [ 25%] (Sampling)
## Chain 1: Iteration: 1750 / 5000 [ 35%] (Sampling)
## Chain 1: Iteration: 2250 / 5000 [ 45%] (Sampling)
## Chain 1: Iteration: 2750 / 5000 [ 55%] (Sampling)
## Chain 1: Iteration: 3250 / 5000 [ 65%] (Sampling)
## Chain 1: Iteration: 3750 / 5000 [ 75%] (Sampling)
## Chain 1: Iteration: 4250 / 5000 [ 85%] (Sampling)
## Chain 1: Iteration: 4750 / 5000 [ 95%] (Sampling)
## Chain 1: Iteration: 5000 / 5000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 1.274 seconds (Warm-up)
## Chain 1: 3.257 seconds (Sampling)
## Chain 1: 4.531 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'm_ANOVA_partial_pooling' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 4.8e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.48 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 5000 [ 0%] (Warmup)
## Chain 2: Iteration: 500 / 5000 [ 10%] (Warmup)
## Chain 2: Iteration: 1000 / 5000 [ 20%] (Warmup)
## Chain 2: Iteration: 1251 / 5000 [ 25%] (Sampling)
## Chain 2: Iteration: 1750 / 5000 [ 35%] (Sampling)
## Chain 2: Iteration: 2250 / 5000 [ 45%] (Sampling)
## Chain 2: Iteration: 2750 / 5000 [ 55%] (Sampling)
## Chain 2: Iteration: 3250 / 5000 [ 65%] (Sampling)
## Chain 2: Iteration: 3750 / 5000 [ 75%] (Sampling)
## Chain 2: Iteration: 4250 / 5000 [ 85%] (Sampling)
## Chain 2: Iteration: 4750 / 5000 [ 95%] (Sampling)
## Chain 2: Iteration: 5000 / 5000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 1.326 seconds (Warm-up)
## Chain 2: 3.24 seconds (Sampling)
## Chain 2: 4.566 seconds (Total)
## Chain 2:
This returns a list of model fits that are named by the experimental condition (“A”,“B”,“C”). If data is a SummarizedExperiment the fits are stored in metadata(data)[[“dynamic_fits”]].
MetaboDynamics fits Bayesian models with MCMC (Markov Chain Monte Carlo) which returns a list of diagnostic criteria: rhat = R-hat convergence diagnostic (should be 1), neff = number of effective samples (should be above 100), divergences = number of divergent transitions (should be 0), max_treedepth = number of iterations that exceed maximum treedepth (should be 0). Hints on how to adjust parameters for fit_dynamics_model() can be found in the function documentation (?fit_dynamics_model).
With diagnostics_dynamics() we can extract all the diagnostic criteria of MCMC and visualize them. If data is a SummarizedExperiment the diagnostics are stored in metadata(data)[[“diagnostics_dynamics”]]. Additionally data frames for visual Posterior predictive checks (PPC) are prepared and plots for the PPCs and diagnostic criteria can be generated with plot_PPC() and plot_diagnostics().
# extract diagnostics
data <- diagnostics_dynamics(
data = data,
iter = 5000, # number of iterations used for model fitting
# the dynamic model
chains = 2 # number of chains used for model fitting
)
plot_diagnostics(
data = data
)
## $divergences
##
## $max_treedepth
##
## $Rhat
##
## $n_eff
In this example we have zero divergent transitions, zero iterations exceeding maximum tree depth, Rhat values are consistently below 1.01 and the number of effective samples are comfortably above 100. All of this indicates that the fitting of the model went well.
The result of a Bayesian model (the posterior) should predict the experimental well. This can be checked with a posterior predictive check (PPC) in which the experimental data (points) should lie within the predictions (violins).
# PPCs can be accessed with
plot_PPC(
data = data
)
## $posterior_A
## Warning: Removed 18614 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
The PPC plot indicates that the model is well-behaved and that the experimental data points are consistent with the predicted distribution, suggesting that the model is a good fit to the data.
After checking the diagnostic criteria and the PPC we can extract the estimates from the model fit usin the estimates_dynamics() function: If data is a SummarizedExperiment the estimates are stored in metadata(data)[[“estimates_dynamics”]]
# #extract estimates
data <- estimates_dynamics(
data = data, iter = 5000,
chains = 2, condition = "condition"
)
This function returns two types of estimates: the estimated abundance differences between two subsequent time points for each metabolite at each experimental condition, and the dynamics profiles of each metabolite at each experimental condition.
In the following plot the column label “A” indicates the experimental condition. The row labels “1”, “2” and “3” indicate the differences between subsequent time points. “1” is the difference between time point 1 and 2. “2” between time point 2 and 3 and “3” between time point 3 and 4. The metabolites are on the x-axis and the estimated abundance differences (“delta”) are on the y-axis.
# 1) the differences between two timepoints
plot_estimates(
data = data,
dynamics = FALSE
)
## $plot_timepoint_differences
The results of bayesian models can be interpreted intuitively: the 95% credible interval (CrI) (95% highest density interval) indicates the 95% probability range of the parameter (i.e. differences in abundances between time points). That means that with 95% probability the desired values lies within the range of the error bars. If the 95% CrI lies below zero, we can conclude that there is a decrease in abundance between the two subsequent time points (e.g. PEP between time point 3 and 4, red error bar). If the 95% CrI lies above zero, we can conclude that there is an increase in abundance between the two subsequent time points (e.g. 2-Amminomuconate between time point 2 and 3). If the 95% CrI contains zero, we cannot conclude that there is a difference in abundance between the two time points.
The dynamic profiles of each metabolite can be visualized using the estimated mean abundances at each time point. The plot below shows the estimated mean abundances at each time point for each metabolite.
# 2) dynamic profiles
plot_estimates(
data = data,
delta_t = FALSE
)
## $plot_dynamics
The dynamic profiles can be used to identify patterns and trends in the data, such as shifts in abundance over time, and to visualize the changes in abundance for each metabolite over the observed time interval.
To identify groups of metabolites that have similar dynamics we could now cluster these metabolite specific dynamics vectors (estimates[,c(“mu1.mean”:"mut.mean)]) to determine if groups of metabolites have similar dynamics.
The estimates of the dynamics model are returned by the function estimates_dynamics() and can be accessed with:
# get estimates
estimates <- metadata(data)[["estimates_dynamics"]]
# only show first rows of condition A
estimates[["A"]][1:10,]
## metabolite.ID metabolite KEGG time.ID condition mu_mean
## 1 1 2-Aminomuconate C02220 1 A 0.31341250
## 2 1 2-Aminomuconate C02220 2 A 0.59754732
## 3 1 2-Aminomuconate C02220 3 A -0.24272401
## 4 1 2-Aminomuconate C02220 4 A -0.88907412
## 5 2 CMP C00055 1 A -0.15235362
## 6 2 CMP C00055 2 A 0.89553053
## 7 2 CMP C00055 3 A -0.19469034
## 8 2 CMP C00055 4 A -0.41835773
## 9 3 Citramalic acid C00815 1 A -0.84405344
## 10 3 Citramalic acid C00815 2 A 0.05100112
## mu_lower mu_higher sigma_mean sigma_lower sigma_higher lambda_mean
## 1 -1.34685195 1.91367055 1.4761636 0.62656856 3.6136576 0.8861121
## 2 -1.31289927 2.43135013 1.7741887 0.78778146 4.1690584 0.8861121
## 3 -0.57999546 0.09228308 0.2271917 0.05028264 0.9615982 0.8861121
## 4 -2.08703255 0.44916554 1.0364884 0.36170743 2.9076477 0.8861121
## 5 -0.75190255 0.40807748 0.4022107 0.09966508 1.5302345 0.8674417
## 6 0.05296186 1.66614565 0.5642657 0.17028885 1.8130315 0.8674417
## 7 -1.35921798 0.97543732 0.8969115 0.29125573 2.6872295 0.8674417
## 8 -0.98040699 0.16134430 0.3726340 0.08950307 1.4422598 0.8674417
## 9 -2.51026713 0.98361952 1.5545624 0.64682486 3.7426105 1.0617318
## 10 -1.48124741 1.60316539 1.3276728 0.51153691 3.5227417 1.0617318
## lambda_lower lambda_higher delta_mu_mean delta_mu_lower delta_mu_higher
## 1 0.2502203 1.964081 -1.12401410 -2.7642533 0.5617670
## 2 0.2502203 1.964081 1.02220837 0.3487209 1.6532585
## 3 0.2502203 1.964081 -0.04191123 -1.9941627 1.8786053
## 4 0.2502203 1.964081 NA NA NA
## 5 0.2588986 1.898380 0.83575684 -1.0115368 2.4480904
## 6 0.2588986 1.898380 -0.57290991 -2.4505708 1.4366243
## 7 0.2588986 1.898380 -0.79521767 -2.2454509 0.7957687
## 8 0.2588986 1.898380 NA NA NA
## 9 0.2960523 2.306760 -1.66309333 -3.5070599 0.3704973
## 10 0.2960523 2.306760 0.36399439 -0.7064280 1.3280420
## mu_sample_1
## 1 -0.2330314
## 2 2.5962015
## 3 -0.2677583
## 4 0.3151239
## 5 1.4876223
## 6 0.6931978
## 7 -0.7298516
## 8 -0.4029193
## 9 -1.2679270
## 10 2.8152722
For the sake of demonstration we use from here on a clustering result (data(“cluster”)) on the full simulated data set (data(“longitudinalMetabolomics”)). In a real life example the optimal number of clusters (“k”) should be determined by optimal clustering criteria such as Gap statistics and average silhouette. Some clustering methods also allow to use the overall variation (estimates[,“lamda_mean”]). Clustering precision could be calculated by using samples from the model posterior (i.e. specifying “samples” in estimates_dynamics function) and comparing clustering solutions of the mean abundances with clustering solutions using the posterior samples (estimates[,“mu_sample_n”]). The code below shows a very fundamental example of hierarchical clustering.
# get distances between vectors
clust_A <- metadata(data)[["estimates_dynamics"]][["A"]]
clust_A <- clust_A %>%
select(metabolite.ID, condition, time.ID, mu_mean) %>%
pivot_wider(id_cols = c(metabolite.ID, condition), names_from = time.ID, values_from = mu_mean)
dist <- clust_A %>% select(-c(metabolite.ID, condition))
dd_A <- dist(dist,
method = "euclidean"
)
# hierarchical clustering
clust <- hclust(dd_A, method = "ward.D2")
clust_cut <- cutree(clust, k = 8)
# adding cluster ID to estimates
clust_A$cluster <- clust_cut
clust_A
## # A tibble: 10 × 7
## metabolite.ID condition `1` `2` `3` `4` cluster
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 A 0.313 0.598 -0.243 -0.889 1
## 2 2 A -0.152 0.896 -0.195 -0.418 2
## 3 3 A -0.844 0.0510 -0.811 0.0246 3
## 4 4 A 1.01 0.403 0.423 -0.710 4
## 5 5 A -0.113 0.609 0.0750 -0.660 2
## 6 6 A 0.212 -0.771 -0.649 0.156 5
## 7 7 A -0.0684 -0.121 -0.570 -0.0427 6
## 8 8 A 0.0704 0.496 0.170 0.0652 7
## 9 9 A -0.285 0.280 -0.292 -0.0440 6
## 10 10 A 0.825 -0.215 0.756 0.212 8
The clustering result for the whole simulated data set is stored in the metadata of the SummarizedExperiment object and can be accessed using the following code:
data <- metadata(longitudinalMetabolomics)[["cluster"]]
temp <- data
temp <- temp %>% pivot_longer(
cols = 4:7,
names_to = "timepoint", values_to = "mu_mean"
)
ggplot(temp, aes(
x = as.factor(as.numeric(as.factor(timepoint))),
y = mu_mean, group = metabolite
)) +
geom_line() +
xlab("timepoint") +
ylab("estimated difference from mean abundance") +
theme_bw() +
ylim(-2,2)+
theme(legend.position = "none") +
facet_grid(rows = vars(condition), cols = vars(cluster)) +
ggtitle("clustered dynamics", "panels=cluster ID")
We can identify different patterns (i.e. clusters) of metabolite dynamics and these patterns differ between different experimental conditions. Can we quantify the biological function of these dynamics clusters?
To quantify the possible biological function of these dynamics clusters we retrieved from the KEGG-database the following information: 1. to which functional modules our experimental metabolites are annotated and 2. which metabolites are annotated to functional modules in general.
The functional modules of the KEGG-database are organised in three hierarchies: upper, middle and lower. The middle hierarchy holds modules such as “Amino acid metabolism”, the lower for example “branched chain amino acid metabolism”. To facilitate analysis the data frames “metabolite_modules”, which holds the information about experimental metabolites, and “modules_compounds”, which holds the information about which metabolites are in general annotated to functional modules, were prepared with MetaboDynamics function “get_ORA_annotations” (type “?get_ORA_annotations” in R to get to the help page).
We load both data sets and can inspect the documentation.
data("metabolite_modules")
help("metabolite_modules")
head(metabolite_modules)
## # A tibble: 6 × 8
## ...1 metabolite KEGG module_id module_name upper_hierarchy middle_hierarchy
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 1-Aminocyc… C012… M00368 Ethylene b… Pathway modules Amino acid meta…
## 2 2 2-Aminomuc… C022… M00038 Tryptophan… Pathway modules Amino acid meta…
## 3 3 2-Phosphog… C006… M00001 Glycolysis… Pathway modules Carbohydrate me…
## 4 4 2-Phosphog… C006… M00002 Glycolysis… Pathway modules Carbohydrate me…
## 5 5 2-Phosphog… C006… M00003 Gluconeoge… Pathway modules Carbohydrate me…
## 6 6 2-Phosphog… C006… M00346 Formaldehy… Pathway modules Energy metaboli…
## # ℹ 1 more variable: lower_hierarchy <chr>
data("modules_compounds")
help("modules_compounds")
head(modules_compounds)
## module_id module_name KEGG
## 2 M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate C00267
## 3 M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate C00668
## 4 M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate C00085
## 5 M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate C00354
## 6 M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate C00111
## 7 M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate C00118
## upper_hierarchy middle_hierarchy lower_hierarchy
## 2 Pathway modules Carbohydrate metabolism Central carbohydrate metabolism
## 3 Pathway modules Carbohydrate metabolism Central carbohydrate metabolism
## 4 Pathway modules Carbohydrate metabolism Central carbohydrate metabolism
## 5 Pathway modules Carbohydrate metabolism Central carbohydrate metabolism
## 6 Pathway modules Carbohydrate metabolism Central carbohydrate metabolism
## 7 Pathway modules Carbohydrate metabolism Central carbohydrate metabolism
It is important to note that not all KEGG modules are suitable for testing on every observed organism and experimental condition. For example, the modules “Xenobiotics biodegradation”, “Biosynthesis of other secondary metabolites”, and “Biosynthesis of terpenoids and polyketides” should not be analyzed in a human lung cancer cell line.
For the functional analysis we employ a hypergeometric model. We consider a functional module as over-represented in a cluster if the 95% inter-quantile range (ICR) of the log-transformed probabilities of OvEs (observed vs. expected) lies above zero. OvE refers to the ratio of observed metabolites in a cluster being mapped to a functional module over the number of expected metabolites in a cluster being mapped to a module under the assumption of a hypergeometric distribution (=drawing without replacement). If data is a SummarizedExperiment object where the clustering solution is stored in metadata(data)[[“cluster”]] the ORA results are stored in metadata(data)[[“ORA_tested_column”]] We apply the functional analysis to the middle and lower hierarchy of functional modules.
data <- ORA_hypergeometric(
background = modules_compounds,
annotations = metabolite_modules,
data = longitudinalMetabolomics, tested_column = "middle_hierarchy"
)
plot_ORA(data, tested_column = "middle_hierarchy")
Great, we can now see which functional module is over- (green points and error-bars) or under-represented (none in this example) in which dynamics cluster! For instance in cluster 8 of condition B module “Nucleotide metabolism” over-represented.
We can not only conduct over-representation analysis of KEGG-functional modules but also compare dynamics clusters across different experimental conditions. For this we employ a Bayesian model that estimates the mean difference as well as the standard deviation of differences between dynamics clusters. If data is a SummarizedExperiment object the results of compare_metabolites are stored in metadata(data)[[“comparison_dynamics”]].
dist = vector of pairwise euclidean distances between each dynamic vector of cluster a and every dynamic vector of cluster b, ID = cluster pair ID \[\begin{align*} dist_{ID}&\sim {\sf normal}(\mu_{ID},\sigma_{ID}) \\ \mu_{ID}&\sim {\sf normal^+}(0,2) \\ \sigma_{ID}&\sim {\sf exponential}(1) \end{align*}\]
data("longitudinalMetabolomics")
longitudinalMetabolomics <- compare_dynamics(
data = longitudinalMetabolomics,
dynamics = c("mu1_mean", "mu2_mean", "mu3_mean", "mu4_mean"),
cores = 1 # just set to 1 for vignette, set to 4 if possible
)
##
## SAMPLING FOR MODEL 'm_cluster_distances_padded' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0.000596 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 5.96 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 501 / 2000 [ 25%] (Sampling)
## Chain 1: Iteration: 700 / 2000 [ 35%] (Sampling)
## Chain 1: Iteration: 900 / 2000 [ 45%] (Sampling)
## Chain 1: Iteration: 1100 / 2000 [ 55%] (Sampling)
## Chain 1: Iteration: 1300 / 2000 [ 65%] (Sampling)
## Chain 1: Iteration: 1500 / 2000 [ 75%] (Sampling)
## Chain 1: Iteration: 1700 / 2000 [ 85%] (Sampling)
## Chain 1: Iteration: 1900 / 2000 [ 95%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 3.906 seconds (Warm-up)
## Chain 1: 8.367 seconds (Sampling)
## Chain 1: 12.273 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'm_cluster_distances_padded' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0.000451 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 4.51 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 501 / 2000 [ 25%] (Sampling)
## Chain 2: Iteration: 700 / 2000 [ 35%] (Sampling)
## Chain 2: Iteration: 900 / 2000 [ 45%] (Sampling)
## Chain 2: Iteration: 1100 / 2000 [ 55%] (Sampling)
## Chain 2: Iteration: 1300 / 2000 [ 65%] (Sampling)
## Chain 2: Iteration: 1500 / 2000 [ 75%] (Sampling)
## Chain 2: Iteration: 1700 / 2000 [ 85%] (Sampling)
## Chain 2: Iteration: 1900 / 2000 [ 95%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 3.838 seconds (Warm-up)
## Chain 2: 8.169 seconds (Sampling)
## Chain 2: 12.007 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'm_cluster_distances_padded' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0.000371 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 3.71 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 501 / 2000 [ 25%] (Sampling)
## Chain 3: Iteration: 700 / 2000 [ 35%] (Sampling)
## Chain 3: Iteration: 900 / 2000 [ 45%] (Sampling)
## Chain 3: Iteration: 1100 / 2000 [ 55%] (Sampling)
## Chain 3: Iteration: 1300 / 2000 [ 65%] (Sampling)
## Chain 3: Iteration: 1500 / 2000 [ 75%] (Sampling)
## Chain 3: Iteration: 1700 / 2000 [ 85%] (Sampling)
## Chain 3: Iteration: 1900 / 2000 [ 95%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 3.803 seconds (Warm-up)
## Chain 3: 8.183 seconds (Sampling)
## Chain 3: 11.986 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'm_cluster_distances_padded' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 0.00042 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 4.2 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 501 / 2000 [ 25%] (Sampling)
## Chain 4: Iteration: 700 / 2000 [ 35%] (Sampling)
## Chain 4: Iteration: 900 / 2000 [ 45%] (Sampling)
## Chain 4: Iteration: 1100 / 2000 [ 55%] (Sampling)
## Chain 4: Iteration: 1300 / 2000 [ 65%] (Sampling)
## Chain 4: Iteration: 1500 / 2000 [ 75%] (Sampling)
## Chain 4: Iteration: 1700 / 2000 [ 85%] (Sampling)
## Chain 4: Iteration: 1900 / 2000 [ 95%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 3.928 seconds (Warm-up)
## Chain 4: 8.264 seconds (Sampling)
## Chain 4: 12.192 seconds (Total)
## Chain 4:
We can visualize the posterior estimates of the model with:
heatmap_dynamics(data = longitudinalMetabolomics)
This results in a bubble heat map where each cluster of all conditions (i.e.
A_2 = cluster 2 of condition A) is compared with all clusters of all conditions.
The comparison of clusters is reciprocal and therefore each comparison pair is
only visualized once.
The bigger and brighter a point, the smaller is the mean distance between
dynamics clusters and the smaller is the standard deviation. That means big
bright points indicate high dynamic similarity which small spread. Here B_8 and
A_4 have high similarity in dynamics.
We can also compare metabolite composition of clusters. For this we employ the Jaccard Index which is a metric for similarity of two vectors. Values near 0 indicate low similarity, values near 1 high similarity. If data is a SummarizedExperiment object the results of compare_metabolites are stored in metadata(data)[[“comparison_metabolites”]].
longitudinalMetabolomics <- compare_metabolites(
data = longitudinalMetabolomics
)
heatmap_metabolites(data = longitudinalMetabolomics)
The brighter a tile, the higher is the similarity of two clusters in regard
to their metabolite composition.
We have two clusters that are similar in their metabolite composition:
C_6 and A_5. If we compare that to the dynamics profiles and ORA analysis
we see that similar functional modules are over-expressed as expected BUT
the dynamics differ between the two radiation doses.
Can we facilitate visualization for the combination of both measures?
dynamics <- metadata(longitudinalMetabolomics)[["comparison_dynamics"]][["estimates"]]
metabolites <- metadata(longitudinalMetabolomics)[["comparison_metabolites"]]
temp <- left_join(dynamics, metabolites, join_by("cluster_a", "cluster_b"))
x <- unique(c(temp[, "cluster_a"], temp[, "cluster_b"]))
temp <- temp %>% mutate(scale_Jaccard = scale(Jaccard))
ggplot(temp, aes(x = cluster_b, y = cluster_a)) +
geom_point(aes(size = Jaccard, col = mu_mean)) +
theme_bw() +
scale_color_viridis_c(option = "magma") +
scale_x_discrete(limits = x) +
xlab("") +
ylab("") +
scale_y_discrete(limits = x) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
labs(col = "dynamics distance", size = "metabolite similarity") +
ggtitle("comparison of clusters","label = condition + cluster ID")
With this form of visualization we can easily spot clusters in which the experimental condition (e.g. kind of treatment/ dose of treatment) has the biggest effect. These clusters are similar in regards to their composing metabolites (i.e. big points) but dissimilar (i.e. bright color) in regards to their dynamics. In this example this is the cluster pair B_7 and A_4. Their ORA profiles are also quite similar as expected from the similar metabolite compositions. Clusters of metabolites that are not much affected by the experimental condition can be spotted by big (i.e. high metabolite similaritey) black (i.e. low dissimilarity in dynamics) points, for example cluster pair C_6 and A_5
sessionInfo()
## R Under development (unstable) (2025-03-13 r87965)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] tidyr_1.3.1 dplyr_1.1.4
## [3] ggplot2_3.5.1 SummarizedExperiment_1.37.0
## [5] Biobase_2.67.0 GenomicRanges_1.59.1
## [7] GenomeInfoDb_1.43.4 IRanges_2.41.3
## [9] S4Vectors_0.45.4 BiocGenerics_0.53.6
## [11] generics_0.1.3 MatrixGenerics_1.19.1
## [13] matrixStats_1.5.0 MetaboDynamics_0.99.23
## [15] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.47.0 gtable_0.3.6 xfun_0.51
## [4] bslib_0.9.0 QuickJSR_1.6.0 inline_0.3.21
## [7] lattice_0.22-6 vctrs_0.6.5 tools_4.6.0
## [10] curl_6.2.2 parallel_4.6.0 tibble_3.2.1
## [13] pkgconfig_2.0.3 Matrix_1.7-3 RcppParallel_5.1.10
## [16] lifecycle_1.0.4 GenomeInfoDbData_1.2.14 farver_2.1.2
## [19] compiler_4.6.0 stringr_1.5.1 Biostrings_2.75.4
## [22] tinytex_0.56 munsell_0.5.1 codetools_0.2-20
## [25] htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.10
## [28] pillar_1.10.1 crayon_1.5.3 jquerylib_0.1.4
## [31] DelayedArray_0.33.6 cachem_1.1.0 magick_2.8.6
## [34] StanHeaders_2.32.10 abind_1.4-8 rstan_2.32.7
## [37] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.7
## [40] purrr_1.0.4 bookdown_0.42 labeling_0.4.3
## [43] fastmap_1.2.0 grid_4.6.0 colorspace_2.1-1
## [46] cli_3.6.4 SparseArray_1.7.7 magrittr_2.0.3
## [49] loo_2.8.0 S4Arrays_1.7.3 utf8_1.2.4
## [52] pkgbuild_1.4.7 withr_3.0.2 UCSC.utils_1.3.1
## [55] scales_1.3.0 rmarkdown_2.29 XVector_0.47.2
## [58] httr_1.4.7 gridExtra_2.3 png_0.1-8
## [61] evaluate_1.0.3 knitr_1.50 V8_6.0.3
## [64] viridisLite_0.4.2 rstantools_2.4.0 rlang_1.1.5
## [67] Rcpp_1.0.14 glue_1.8.0 BiocManager_1.30.25
## [70] jsonlite_2.0.0 R6_2.6.1