1 Background
2 Setup: load required packages
3 Load data and plot data overview
4 Model dynamics
- 4.1 Differences between two time points
- 4.2 Dynamic profiles
5 Cluster dynamics
6 Over-representation analysis of functional modules in dynamics clusters
7 Comparison of clusters of different experimental conditions

1 Background

Metabolomics data from longitudinal or time-course high-throughput experiments, such as untargeted liquid chromatography-mass spectrometry (LC-MS), allows for insights into changes of metabolite abundances over time. However, most existing tools are limited to comparing only two time points or experimental conditions, and they often rely on frequentist statistical methods. Additionally, most tools are only able to analyze groups of metabolites on a pathway level, potentially missing more general patterns of changes in the metabolism in sparse data.

Here, we introduce MetaboDynamics, a modular workflow of probabilistic models for the analysis of longitudinal metabolomics data. This package provides a flexible framework for analyzing longitudinal data, allowing users to model changes in metabolite abundances over time and identify patterns of interest.

Metabolomics data is often noisy and limited by few replicates due to high costs. To address these challenges, we employ a Bayesian hierarchical model that balances between under-fitting and over-fitting. This model provides a robust method for estimating mean abundances at every time point as well as abundance differences between subsequent time points, even with limited replicates.

The MetaboDynamics workflow requires abundance tables as input, which can include metabolites and their observed abundances, LC-MS peak intensities or areas. The workflow can handle any number of time points, and the dynamics model requires at least three replicates. However, functions to compare cluster and compare dynamics with each other do not require a certain number of replicates and can be applied to calculated mean abundances of e.g. two replicates.

In this vignette, I will demonstrate how to use the MetaboDynamics package to analyze longitudinal metabolomics data, including data preprocessing, model fitting, and interpretation of results. We will show how to apply the package to a simulated data set and how to visualize and interpret the results.

2 Setup: load required packages

library(MetaboDynamics)
library(SummarizedExperiment) # storing and manipulating simulated metabolomics data

## Loading required package: MatrixGenerics

## Loading required package: matrixStats

## 
## Attaching package: 'MatrixGenerics'

## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars

## Loading required package: GenomicRanges

## Loading required package: stats4

## Loading required package: BiocGenerics

## Loading required package: generics

## 
## Attaching package: 'generics'

## The following objects are masked from 'package:base':
## 
##     as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
##     setequal, union

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
##     unsplit, which.max, which.min

## Loading required package: S4Vectors

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:utils':
## 
##     findMatches

## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname

## Loading required package: IRanges

## Loading required package: GenomeInfoDb

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## 
## Attaching package: 'Biobase'

## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians

## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians

library(ggplot2) # visualization
library(dplyr) # data handling

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:Biobase':
## 
##     combine

## The following objects are masked from 'package:GenomicRanges':
## 
##     intersect, setdiff, union

## The following object is masked from 'package:GenomeInfoDb':
## 
##     intersect

## The following objects are masked from 'package:IRanges':
## 
##     collapse, desc, intersect, setdiff, slice, union

## The following objects are masked from 'package:S4Vectors':
## 
##     first, intersect, rename, setdiff, setequal, union

## The following objects are masked from 'package:BiocGenerics':
## 
##     combine, intersect, setdiff, setequal, union

## The following object is masked from 'package:generics':
## 
##     explain

## The following object is masked from 'package:matrixStats':
## 
##     count

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr) # data handling

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:S4Vectors':
## 
##     expand

3 Load data and plot data overview

The MetaboDynamics package includes a simulated data set called “longitudinalMetabolomics”. This data set consists of 98 metabolites with three replicates at four time points (1-4) across three experimental conditions (A-C). The data set is represented as a SummarizedExperiment object, where the abundance tables of each experimental condition are stored in assays (raw abundances in “concentrations”, log-transformed transformations in “log_con” and scaled (mean=0, sd=1) log-transformed abundances" in “scaled_log”), and the metabolite names, KEGG IDs, experimental conditions, and clustering solutions per experimental condition are stored in colData and rowData. To load the data, execute the following code:

data("longitudinalMetabolomics", package = "MetaboDynamics")

The next plot shows the raw metabolite abundances.

# convert to dataframe
abundances <- as.data.frame(cbind(
  as.data.frame(t(assays(longitudinalMetabolomics)$concentrations)),
  as.data.frame(colData(longitudinalMetabolomics))
))
abundances <- abundances %>% pivot_longer(
  cols = seq_len(4), names_to = "time",
  values_to = "abundance"
)
ggplot(abundances, aes(x = abundance)) +
  geom_freqpoly() +
  theme_bw() +
  facet_grid(cols = vars(time), rows = vars(condition)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  ggtitle("raw data", "raw measurements")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In this plot I visualize the distribution of metabolite abundances. We can see many observations with low abundances and few observations for higher abundances. This raw data is not distributed normally. So let’s log-transform the values. In the integrated simulated data set this is already done in the column “log_m”.

ggplot(abundances, aes(x = abundance)) +
  geom_density() +
  scale_x_log10() +
  theme_bw() +
  facet_grid(cols = vars(time), rows = vars(condition)) +
  ggtitle("data", "log-transformed values")

We can see that the metabolite abundances vary over magnitudes and are normally distributed after log-transformation.

The next plot shows the log-transformed dynamics of single metabolites.

ggplot(abundances) +
  geom_line(aes(
    x = time, y = abundance, col = metabolite,
    group = interaction(metabolite, replicate)
  )) +
  scale_y_log10() +
  theme_bw() +
  xlab("timepoint") +
  scale_color_viridis_d() +
  theme(legend.position = "none") +
  facet_grid(rows = vars(condition)) +
  ggtitle("raw metabolite dynamics", "color=metabolite")

The dynamics profiles of single metabolites have intercepts (i.e. mean metabolite abundances) that differ by magnitudes from each other. To be able to compare metabolites with each other we define dynamics as deviations at the observed time points from the metabolite’s mean abundance. For this we standardize each metabolite at each experimental condition to a mean of zero and a standard deviation of one.

In the simulated data set the scaled measurements are in column “m_scaled”.

data("longitudinalMetabolomics")
# convert to dataframe
abundances <- as.data.frame(cbind(
  as.data.frame(t(assays(longitudinalMetabolomics)$scaled_log)),
  as.data.frame(colData(longitudinalMetabolomics))
))
abundances <- abundances %>% pivot_longer(
  cols = seq_len(4), names_to = "time",
  values_to = "abundance"
)
ggplot(abundances) +
  geom_line(aes(
    x = time,
    y = abundance, col = metabolite,
    group = interaction(metabolite, replicate)
  )) +
  theme_bw() +
  scale_color_viridis_d() +
  xlab("timepoint") +
  ylab("deviation from mean abundance") +
  theme(legend.position = "none") +
  facet_grid(rows = vars(condition)) +
  ggtitle("standardized dynamics", "color=metabolite")

4 Model dynamics

To analyze the dynamics data, we use a Bayesian hierarchical model. This model is used to account for the complexity and variability of the data and to provide a flexible and robust framework for analysis. The model is defined as follows with: con = metabolite abundances, m = metabolite, c = experimental condition and t = time point ID:

\[\begin{align*} \log(con_{m,c,t})&\sim {\sf normal}(\mu_{m,c,t},\sigma_{m,c,t}) \\ \mu_{m,c,t}&\sim {\sf normal}(0,2) \\ \sigma_{m,c,t}&\sim {\sf exponential}(\lambda_{m,c}) \\ \lambda_{m,c}&\sim {\sf exponential}(2) \end{align*}\]

The model assumes a normal distribution of log-transformed and scaled metabolite abundances and estimates a mean \(\mu\) per metabolite and time point. The spread \(\sigma\) of the replicate measurements is also estimated per metabolite and time point, but the spread of different time points inform each other through the hierarchical structure of the hyper-parameter \(\lambda\). This partial pooling allows for the balance between over- and under fitting.

The code below shows how to fit the Bayesian model with the function fit_dynamics_model(). This might take of the order of 10 minutes per experimental condition for 98 metabolites. Here we fit the dynamics model for a small subset of the simulated metabolites.

# we can hand a SummarizedExperiment object to the function
data("longitudinalMetabolomics")
# we only use a subsection of the simulated data (1 condition and subsample of
# the whole dataset) for demonstration purposes
samples <- c(
  "UMP", "Taurine", "Succinate", "Phosphocreatine", "PEP",
  "Malic acid", "L-Cystine", "CMP", "Citramalic acid", "2-Aminomuconate"
)
# only one condition
data <- longitudinalMetabolomics[, longitudinalMetabolomics$condition == "A" &
  longitudinalMetabolomics$metabolite %in% samples]

# fit model
data <- fit_dynamics_model(
  data = data, scaled_measurement = "m_scaled",
  assay = "scaled_log", time = "time",
  condition = "condition", max_treedepth = 10,
  adapt_delta = 0.95, # default 0.95
  iter = 5000,
  cores = 1,
  chains = 2 # only set to 2 for vignette, default = 4
)

## Is your data normalized and standardized?
##           We recommend normalization by log-transformation.
##           Scaling and centering (mean=0, sd=1) should be metabolite and condition specific.

## 
## SAMPLING FOR MODEL 'm_ANOVA_partial_pooling' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 5.5e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.55 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 5000 [  0%]  (Warmup)
## Chain 1: Iteration:  500 / 5000 [ 10%]  (Warmup)
## Chain 1: Iteration: 1000 / 5000 [ 20%]  (Warmup)
## Chain 1: Iteration: 1251 / 5000 [ 25%]  (Sampling)
## Chain 1: Iteration: 1750 / 5000 [ 35%]  (Sampling)
## Chain 1: Iteration: 2250 / 5000 [ 45%]  (Sampling)
## Chain 1: Iteration: 2750 / 5000 [ 55%]  (Sampling)
## Chain 1: Iteration: 3250 / 5000 [ 65%]  (Sampling)
## Chain 1: Iteration: 3750 / 5000 [ 75%]  (Sampling)
## Chain 1: Iteration: 4250 / 5000 [ 85%]  (Sampling)
## Chain 1: Iteration: 4750 / 5000 [ 95%]  (Sampling)
## Chain 1: Iteration: 5000 / 5000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 1.427 seconds (Warm-up)
## Chain 1:                3.185 seconds (Sampling)
## Chain 1:                4.612 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'm_ANOVA_partial_pooling' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 3.7e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.37 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 5000 [  0%]  (Warmup)
## Chain 2: Iteration:  500 / 5000 [ 10%]  (Warmup)
## Chain 2: Iteration: 1000 / 5000 [ 20%]  (Warmup)
## Chain 2: Iteration: 1251 / 5000 [ 25%]  (Sampling)
## Chain 2: Iteration: 1750 / 5000 [ 35%]  (Sampling)
## Chain 2: Iteration: 2250 / 5000 [ 45%]  (Sampling)
## Chain 2: Iteration: 2750 / 5000 [ 55%]  (Sampling)
## Chain 2: Iteration: 3250 / 5000 [ 65%]  (Sampling)
## Chain 2: Iteration: 3750 / 5000 [ 75%]  (Sampling)
## Chain 2: Iteration: 4250 / 5000 [ 85%]  (Sampling)
## Chain 2: Iteration: 4750 / 5000 [ 95%]  (Sampling)
## Chain 2: Iteration: 5000 / 5000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 1.196 seconds (Warm-up)
## Chain 2:                3.176 seconds (Sampling)
## Chain 2:                4.372 seconds (Total)
## Chain 2:

This returns a list of model fits that are named by the experimental condition (“A”,“B”,“C”). If data is a SummarizedExperiment object the fits are stored in metadata(data)[[“dynamic_fits”]].

MetaboDynamics fits Bayesian models with MCMC (Markov Chain Monte Carlo) which returns a list of diagnostic criteria: rhat = R-hat convergence diagnostic (should be 1), neff = number of effective samples (should be above 100), divergences = number of divergent transitions (should be 0), max_treedepth = number of iterations that exceed maximum treedepth (should be 0). Hints on how to adjust parameters for fit_dynamics_model() can be found in the function documentation (type “?fit_dynamics_model” in R console if MetaboDynamics is loaded).

With diagnostics_dynamics() we can extract all the diagnostic criteria of the model fit and visualize them. If data is a SummarizedExperiment object the diagnostics are stored in metadata(data)[[“diagnostics_dynamics”]]. The diagnostic criteria can be visualized with the function plot_diagnostics(). Additionally data frames for visual posterior predictive checks (PPC) are prepared and the PPC can be visualized with the function plot_PPC().

# extract diagnostics
data <- diagnostics_dynamics(
  data = data,
  iter = 5000, # number of iterations used for model fitting
  # the dynamic model
  chains = 2 # number of chains used for model fitting
)

plot_diagnostics(
  data = data
)

## $divergences

## 
## $max_treedepth

## 
## $Rhat

## 
## $n_eff

In this example we have zero divergent transitions, zero iterations exceeding maximum tree depth, Rhat values are consistently below 1.01 and the number of effective samples are comfortably above 100. All of this indicates that the fitting of the model went well.

The result of a Bayesian model (the posterior) should predict the experimental data well. This can be checked with a posterior predictive check (PPC) in which the experimental data (points) should lie within the predictions (violins).

# PPCs can be accessed with
plot_PPC(
  data = data
)

## $posterior_A

## Warning: Removed 4449 rows containing non-finite outside the scale range
## (`stat_ydensity()`).

The PPC plot indicates that the model is well-behaved and that the experimental data points are consistent with the predicted distribution, suggesting that the model is a good fit to the data.

After checking the diagnostic criteria and the PPC we can extract the estimates from the model fit using the estimates_dynamics() function: If data is a SummarizedExperiment the estimates are stored in metadata(data)[[“estimates_dynamics”]].

# #extract estimates
data <- estimates_dynamics(
  data = data, iter = 5000,
  chains = 2, condition = "condition"
)

This function returns two types of estimates: 1) the estimated metabolite abundance differences between two subsequent time points for each metabolite at each experimental condition, and 2) the dynamics profiles of each metabolite at each experimental condition.

4.1 Differences between two time points

In the following plot the column label “A” indicates the experimental condition. The row labels “1”, “2” and “3” indicate the differences between subsequent time points. “1” is the difference between time point 1 and 2. “2” between time point 2 and 3 and “3” between time point 3 and 4. The metabolites are on the x-axis and the estimated abundance differences (“delta”) are on the y-axis.

# 1) the differences between two timepoints
plot_estimates(
  data = data,
  dynamics = FALSE
)

## $plot_timepoint_differences

The results of bayesian models can be interpreted intuitively: the 95% credible interval (CrI) (95% highest density interval) indicates the 95% probability range of the parameter (i.e. in this case the differences in abundances between time points). That means that with 95% probability the desired values lies within the range of the error bars. If the 95% CrI lies below zero, we can conclude that there is a decrease in abundance between the two subsequent time points (e.g. PEP between time point 3 and 4, red error bar). If the 95% CrI lies above zero, we can conclude that there is an increase in abundance between the two subsequent time points (e.g. 2-Amminomuconate between time point 2 and 3). If the 95% CrI contains zero, we cannot conclude that there is a difference in abundance between the two time points.

4.2 Dynamic profiles

The dynamic profiles of each metabolite can be visualized using the estimated mean abundances at each time point. The plot below shows the estimated mean abundances at each time point for each metabolite.

# 2) dynamic profiles
plot_estimates(
  data = data,
  delta_t = FALSE
)

## $plot_dynamics

The dynamic profiles can be used to identify patterns and trends in the data, such as shifts in abundance over time, and to visualize the changes in abundance for each metabolite over the observed time interval.

The estimates of the dynamics model are returned by the function estimates_dynamics() and can be accessed with:

# get estimates
estimates <- metadata(data)[["estimates_dynamics"]]
# only show first rows of condition A
estimates[["A"]][1:10, ]

##    metabolite.ID      metabolite   KEGG time.ID condition     mu_mean
## 1              1 2-Aminomuconate C02220       1         A  0.30399411
## 2              1 2-Aminomuconate C02220       2         A  0.60884951
## 3              1 2-Aminomuconate C02220       3         A -0.24240100
## 4              1 2-Aminomuconate C02220       4         A -0.88583878
## 5              2             CMP C00055       1         A -0.14746174
## 6              2             CMP C00055       2         A  0.88403996
## 7              2             CMP C00055       3         A -0.19558385
## 8              2             CMP C00055       4         A -0.42873801
## 9              3 Citramalic acid C00815       1         A -0.83712158
## 10             3 Citramalic acid C00815       2         A  0.06005493
##       mu_lower  mu_higher sigma_mean sigma_lower sigma_higher lambda_mean
## 1  -1.44986302 1.91411654  1.4769016  0.62110744     3.651725   0.8920227
## 2  -1.39216848 2.40794381  1.7963640  0.80065746     4.204213   0.8920227
## 3  -0.56768663 0.13234438  0.2223971  0.05038337     0.912860   0.8920227
## 4  -2.04835133 0.48095208  1.0511402  0.36575032     2.987225   0.8920227
## 5  -0.81655710 0.45838689  0.4076907  0.09885455     1.538581   0.8678008
## 6  -0.01992399 1.60924716  0.5677785  0.16686632     1.902934   0.8678008
## 7  -1.28625043 0.93139595  0.8556814  0.28949870     2.379625   0.8678008
## 8  -0.96831734 0.06867133  0.3572321  0.08839485     1.332216   0.8678008
## 9  -2.41358961 0.93496392  1.5488615  0.65372878     3.750508   1.0770766
## 10 -1.48942987 1.60151559  1.3390859  0.52644937     3.500062   1.0770766
##    lambda_lower lambda_higher delta_mu_mean delta_mu_lower delta_mu_higher
## 1     0.2525716      1.968635   -1.10777503     -2.8013310       0.7160458
## 2     0.2525716      1.968635    1.03728945      0.3113718       1.6988324
## 3     0.2525716      1.968635   -0.04072108     -1.9579107       1.8772741
## 4     0.2525716      1.968635            NA             NA              NA
## 5     0.2554353      1.884690   -0.59764702     -2.4721752       1.4496420
## 6     0.2554353      1.884690   -0.78658256     -2.1982939       0.8214685
## 7     0.2554353      1.884690    0.85418834     -0.9033955       2.5115817
## 8     0.2554353      1.884690            NA             NA              NA
## 9     0.3152266      2.309860    1.27924764     -0.5304833       2.9266786
## 10    0.3152266      2.309860   -1.69296041     -3.4853668       0.2841211
##    mu_sample_1
## 1  -0.36716651
## 2   0.81577518
## 3   0.36333588
## 4  -0.91667445
## 5  -0.06381848
## 6   1.23975199
## 7  -0.20674070
## 8  -0.27927451
## 9  -0.13732644
## 10  0.15587956

To identify groups of metabolites that have similar dynamics we could now cluster these metabolite specific dynamics vectors (estimates[,“mu_mean”]) to determine if groups of metabolites have similar dynamics.

5 Cluster dynamics

MetaboDynamics has a build-in wrapper function for clustering with the “hybrid” method of the package dynamicTreeCut. An alternative would for example be to use a soft-clustering algorithm such as implemented in the Bioconductor package Mfuzz. Some clustering methods also allow to use the overall variation (estimates[,“lamda_mean”]). Clustering precision could be calculated by using samples from the model posterior (i.e. specifying “samples” in estimates_dynamics function) and comparing clustering solutions of the mean abundances with clustering solutions using the posterior samples (estimates[,“mu_sample_n”]).

The code below shows how to use the cluster wrapper function cluster_dynamics on a SummarizedExperiment object where the dynamics estimates are stored in metadata(data)[[“estimates_dynamics”]]. The function cluster_dynamics() also accepts lists of dataframes with calculated mean abundances of metabolites if less than three replicates (minimal requirement for fit_dynamics) are available.

data <- cluster_dynamics(data,
                         deepSplit = 0) # low clustering sensitivity

## Is your data normalized and standardized?
##           We recommend normalization by log-transformation.
##           Scaling and centering (mean=0, sd=1) should be metabolite and condition specific.

##  ..cutHeight not given, setting it to 2.61  ===>  99% of the (truncated) height range in dendro.
##  ..done.

For the visualization of the clustering results we can use the function plot_cluster, which offers three different visualizations of the clustering results: 1) a dendrogram with color coding of the different clusters.

plot <- plot_cluster(data)

2) A Principal Component Analysis (PCA) of the clustering result

plot[["A"]][["PCA_plot"]]

## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure

## Warning: Computation failed in `stat_ellipse()`.
## Caused by error in `chol.default()`:
## ! the leading minor of order 2 is not positive

and 3) the visualization of the single metabolite dynamics within the different conditions. For the sake of demonstration we use from here on a clustering result (metaddata(longitudinalMetabolomics[[(“cluster”))]] on the full simulated data set (data(“longitudinalMetabolomics”)).

plot <- plot_cluster(longitudinalMetabolomics)

plot[["lineplot"]]

We can identify different patterns (i.e. clusters) of metabolite dynamics and these patterns differ between different experimental conditions. Can we quantify the biological function of these dynamics clusters?

6 Over-representation analysis of functional modules in dynamics clusters

To quantify the possible biological function of these dynamics clusters we retrieved from the KEGG-database the following information: 1. to which functional modules our experimental metabolites are annotated and 2. which metabolites are annotated to functional modules in general.

The functional modules of the KEGG-database are organised in three hierarchies: upper, middle and lower. The middle hierarchy holds modules such as “Amino acid metabolism”, the lower for example “branched chain amino acid metabolism”. To facilitate analysis the data frames “metabolite_modules”, which holds the information about the metabolites of the simulated data set, and “modules_compounds”, which holds the information about which metabolites are in general annotated to functional modules, were prepared.

To construct the needed annotation dataframe for your experimental dataset simply filter the “modules_compounds” dataset for your experimental metabolites’ KEGG IDs.

We load both data sets and can inspect the documentation.

data("metabolite_modules")
help("metabolite_modules")
head(metabolite_modules)

## # A tibble: 6 × 8
##    ...1 metabolite  KEGG  module_id module_name upper_hierarchy middle_hierarchy
##   <dbl> <chr>       <chr> <chr>     <chr>       <chr>           <chr>           
## 1     1 1-Aminocyc… C012… M00368    Ethylene b… Pathway modules Amino acid meta…
## 2     2 2-Aminomuc… C022… M00038    Tryptophan… Pathway modules Amino acid meta…
## 3     3 2-Phosphog… C006… M00001    Glycolysis… Pathway modules Carbohydrate me…
## 4     4 2-Phosphog… C006… M00002    Glycolysis… Pathway modules Carbohydrate me…
## 5     5 2-Phosphog… C006… M00003    Gluconeoge… Pathway modules Carbohydrate me…
## 6     6 2-Phosphog… C006… M00346    Formaldehy… Pathway modules Energy metaboli…
## # ℹ 1 more variable: lower_hierarchy <chr>

data("modules_compounds")
help("modules_compounds")
head(modules_compounds)

##   module_id                                               module_name   KEGG
## 2    M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate C00267
## 3    M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate C00668
## 4    M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate C00085
## 5    M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate C00354
## 6    M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate C00111
## 7    M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate C00118
##   upper_hierarchy        middle_hierarchy                 lower_hierarchy
## 2 Pathway modules Carbohydrate metabolism Central carbohydrate metabolism
## 3 Pathway modules Carbohydrate metabolism Central carbohydrate metabolism
## 4 Pathway modules Carbohydrate metabolism Central carbohydrate metabolism
## 5 Pathway modules Carbohydrate metabolism Central carbohydrate metabolism
## 6 Pathway modules Carbohydrate metabolism Central carbohydrate metabolism
## 7 Pathway modules Carbohydrate metabolism Central carbohydrate metabolism

It is important to note that not all KEGG modules are suitable for testing on every observed organism and experimental condition. For example, the modules “Xenobiotics biodegradation”, “Biosynthesis of other secondary metabolites”, and “Biosynthesis of terpenoids and polyketides” should not be analyzed in a human lung cancer cell line.

For the functional analysis we employ a hypergeometric model. We consider a functional module as over-represented in a cluster if the 95% inter-quantile range (ICR) of the log-transformed probabilities of OvEs (observed vs. expected) lies above zero. OvE refers to the ratio of observed metabolites in a cluster being mapped to a functional module over the number of expected metabolites in a cluster being mapped to a module under the assumption of a hypergeometric distribution (=drawing without replacement). If data is a SummarizedExperiment object where the clustering solution is stored in metadata(data)[[“cluster”]] the ORA results are stored in metadata(data)[[“ORA_tested_column”]] We apply the functional analysis to the middle hierarchy of functional modules.

data <- ORA_hypergeometric(
  background = modules_compounds,
  annotations = metabolite_modules,
  data = longitudinalMetabolomics, tested_column = "middle_hierarchy"
)

plot_ORA(data, tested_column = "middle_hierarchy")

Great, we can now see which functional module is over- (green points and error-bars) or under-represented (none in this example) in which dynamics cluster! For instance in cluster 2 of condition A module “Nucleotide metabolism” over-represented.

7 Comparison of clusters of different experimental conditions

7.1 Dynamics

We can not only conduct over-representation analysis of KEGG-functional modules but also compare dynamics clusters across different experimental conditions. For this we employ a Bayesian model that estimates the mean difference as well as the standard deviation of differences between dynamics clusters. If data is a SummarizedExperiment object the results of compare_metabolites are stored in metadata(data)[[“comparison_dynamics”]].

dist = vector of pairwise euclidean distances between every dynamic vector of cluster a and every dynamic vector of cluster b, ID = cluster pair ID \[\begin{align*} dist_{ID}&\sim {\sf normal}(\mu_{ID},\sigma_{ID}) \\ \mu_{ID}&\sim {\sf normal^+}(0,2) \\ \sigma_{ID}&\sim {\sf exponential}(1) \end{align*}\]

data("longitudinalMetabolomics")
longitudinalMetabolomics <- compare_dynamics(
  data = longitudinalMetabolomics,
  cores = 1 # just set to 1 for vignette, set to 4 if possible
)

## 
## SAMPLING FOR MODEL 'm_cluster_distances_padded' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 0.000633 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 6.33 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 1: Iteration:  501 / 2000 [ 25%]  (Sampling)
## Chain 1: Iteration:  700 / 2000 [ 35%]  (Sampling)
## Chain 1: Iteration:  900 / 2000 [ 45%]  (Sampling)
## Chain 1: Iteration: 1100 / 2000 [ 55%]  (Sampling)
## Chain 1: Iteration: 1300 / 2000 [ 65%]  (Sampling)
## Chain 1: Iteration: 1500 / 2000 [ 75%]  (Sampling)
## Chain 1: Iteration: 1700 / 2000 [ 85%]  (Sampling)
## Chain 1: Iteration: 1900 / 2000 [ 95%]  (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 5.316 seconds (Warm-up)
## Chain 1:                9.937 seconds (Sampling)
## Chain 1:                15.253 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'm_cluster_distances_padded' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 0.000528 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 5.28 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 2: Iteration:  501 / 2000 [ 25%]  (Sampling)
## Chain 2: Iteration:  700 / 2000 [ 35%]  (Sampling)
## Chain 2: Iteration:  900 / 2000 [ 45%]  (Sampling)
## Chain 2: Iteration: 1100 / 2000 [ 55%]  (Sampling)
## Chain 2: Iteration: 1300 / 2000 [ 65%]  (Sampling)
## Chain 2: Iteration: 1500 / 2000 [ 75%]  (Sampling)
## Chain 2: Iteration: 1700 / 2000 [ 85%]  (Sampling)
## Chain 2: Iteration: 1900 / 2000 [ 95%]  (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 5.334 seconds (Warm-up)
## Chain 2:                9.9 seconds (Sampling)
## Chain 2:                15.234 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'm_cluster_distances_padded' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 0.000489 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 4.89 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 3: Iteration:  501 / 2000 [ 25%]  (Sampling)
## Chain 3: Iteration:  700 / 2000 [ 35%]  (Sampling)
## Chain 3: Iteration:  900 / 2000 [ 45%]  (Sampling)
## Chain 3: Iteration: 1100 / 2000 [ 55%]  (Sampling)
## Chain 3: Iteration: 1300 / 2000 [ 65%]  (Sampling)
## Chain 3: Iteration: 1500 / 2000 [ 75%]  (Sampling)
## Chain 3: Iteration: 1700 / 2000 [ 85%]  (Sampling)
## Chain 3: Iteration: 1900 / 2000 [ 95%]  (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 8.27 seconds (Warm-up)
## Chain 3:                9.868 seconds (Sampling)
## Chain 3:                18.138 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'm_cluster_distances_padded' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 0.00057 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 5.7 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
## Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
## Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
## Chain 4: Iteration:  501 / 2000 [ 25%]  (Sampling)
## Chain 4: Iteration:  700 / 2000 [ 35%]  (Sampling)
## Chain 4: Iteration:  900 / 2000 [ 45%]  (Sampling)
## Chain 4: Iteration: 1100 / 2000 [ 55%]  (Sampling)
## Chain 4: Iteration: 1300 / 2000 [ 65%]  (Sampling)
## Chain 4: Iteration: 1500 / 2000 [ 75%]  (Sampling)
## Chain 4: Iteration: 1700 / 2000 [ 85%]  (Sampling)
## Chain 4: Iteration: 1900 / 2000 [ 95%]  (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 5.159 seconds (Warm-up)
## Chain 4:                9.892 seconds (Sampling)
## Chain 4:                15.051 seconds (Total)
## Chain 4:

We can visualize the posterior estimates of the model with:

heatmap_dynamics(data = longitudinalMetabolomics)

This results in a bubble heat map where each cluster of every condition (i.e. A_2 = cluster 2 of condition A) is compared with all clusters of every other condition. The comparison of clusters is reciprocal and therefore each comparison pair is only visualized once. The bigger and brighter a point, the smaller is the mean distance between dynamics clusters and the smaller is the standard deviation. That means that big bright points indicate high dynamic similarity which small spread. Here C_10 and A_10 have high similarity in dynamics.

7.2 Metabolites

We can also compare metabolite composition of clusters. For this we employ the Jaccard Index which is a metric for similarity of two vectors. Values near 0 indicate low similarity, values near 1 high similarity. If data is a SummarizedExperiment object the results of compare_metabolites are stored in metadata(data)[[“comparison_metabolites”]].

longitudinalMetabolomics <- compare_metabolites(
  data = longitudinalMetabolomics
)
heatmap_metabolites(data = longitudinalMetabolomics)

The brighter a tile, the higher is the similarity of two clusters in regard to their metabolite composition. The comparison of clusters of one condition here always have a Jaccard Index of zero, as every metabolite is only assigned to one cluster per condition, hence there is no similarity between clusters of one experimental condition. We have two clusters that are similar in their metabolite composition: C_4 and A_3. If we compare that to the dynamics profiles and ORA analysis we see that similar functional modules are over-expressed as expected BUT the dynamics differ between the two radiation doses.

Can we facilitate visualization for the combination of both measures?

7.3 Combine both

dynamics <- metadata(longitudinalMetabolomics)[["comparison_dynamics"]][["estimates"]]
metabolites <- metadata(longitudinalMetabolomics)[["comparison_metabolites"]]
temp <- left_join(dynamics, metabolites, join_by("cluster_a", "cluster_b"))
x <- unique(c(temp[, "cluster_a"], temp[, "cluster_b"]))
temp <- temp %>% mutate(scale_Jaccard = scale(Jaccard))

ggplot(temp, aes(x = cluster_b, y = cluster_a)) +
  geom_point(aes(size = Jaccard, col = mu_mean)) +
  theme_bw() +
  scale_color_viridis_c(option = "magma") +
  scale_x_discrete(limits = x) +
  xlab("") +
  ylab("") +
  scale_y_discrete(limits = x) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  labs(col = "dynamics distance", size = "metabolite similarity") +
  ggtitle("comparison of clusters", "label = condition + cluster ID")

With this form of visualization we can easily spot clusters in which the experimental condition (e.g. kind of treatment/ dose of treatment) has the biggest effect. These clusters are similar in regards to their composing metabolites (i.e. big points) but dissimilar (i.e. bright color) in regards to their dynamics. Here this is for example the cluster pair B_10 and A_5. Their ORA profiles are also quite similar as expected from the similar metabolite compositions. Clusters of metabolites that are not much affected by the experimental condition can be spotted by big (i.e. high metabolite similarity) black (i.e. low dissimilarity in dynamics) points, for example cluster pair C_4 and A_3.

sessionInfo()

## R version 4.5.0 RC (2025-04-04 r88126)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] tidyr_1.3.1                 dplyr_1.1.4                
##  [3] ggplot2_3.5.2               SummarizedExperiment_1.38.1
##  [5] Biobase_2.68.0              GenomicRanges_1.60.0       
##  [7] GenomeInfoDb_1.44.0         IRanges_2.42.0             
##  [9] S4Vectors_0.46.0            BiocGenerics_0.54.0        
## [11] generics_0.1.4              MatrixGenerics_1.20.0      
## [13] matrixStats_1.5.0           MetaboDynamics_1.0.2       
## [15] BiocStyle_2.36.0           
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6            QuickJSR_1.7.0          xfun_0.52              
##  [4] bslib_0.9.0             inline_0.3.21           lattice_0.22-7         
##  [7] vctrs_0.6.5             tools_4.5.0             curl_6.2.3             
## [10] parallel_4.5.0          tibble_3.2.1            pkgconfig_2.0.3        
## [13] Matrix_1.7-3            RColorBrewer_1.1-3      RcppParallel_5.1.10    
## [16] lifecycle_1.0.4         GenomeInfoDbData_1.2.14 compiler_4.5.0         
## [19] farver_2.1.2            tinytex_0.57            codetools_0.2-20       
## [22] htmltools_0.5.8.1       sass_0.4.10             yaml_2.3.10            
## [25] pillar_1.10.2           crayon_1.5.3            jquerylib_0.1.4        
## [28] MASS_7.3-65             DelayedArray_0.34.1     cachem_1.1.0           
## [31] magick_2.8.6            StanHeaders_2.32.10     viridis_0.6.5          
## [34] abind_1.4-8             rstan_2.32.7            tidyselect_1.2.1       
## [37] digest_0.6.37           purrr_1.0.4             bookdown_0.43          
## [40] labeling_0.4.3          fastmap_1.2.0           grid_4.5.0             
## [43] colorspace_2.1-1        cli_3.6.5               SparseArray_1.8.0      
## [46] magrittr_2.0.3          loo_2.8.0               S4Arrays_1.8.0         
## [49] utf8_1.2.5              dichromat_2.0-0.1       pkgbuild_1.4.8         
## [52] dynamicTreeCut_1.63-1   withr_3.0.2             UCSC.utils_1.4.0       
## [55] scales_1.4.0            rmarkdown_2.29          XVector_0.48.0         
## [58] httr_1.4.7              gridExtra_2.3           evaluate_1.0.3         
## [61] knitr_1.50              V8_6.0.3                viridisLite_0.4.2      
## [64] rstantools_2.4.0        rlang_1.1.6             Rcpp_1.0.14            
## [67] dendextend_1.19.0       glue_1.8.0              BiocManager_1.30.25    
## [70] jsonlite_2.0.0          R6_2.6.1

MetaboDynamics: analyzing longitudinal metabolomics data with probabilistic models

28 May 2025

Package

Contents