The DNEA R package is currently available for download from Bioconductor. The source code is open-source and available for review on the Karnovsky Lab github. Any troubleshooting or issues with the software can be reported there. The package can be installed using the BiocManager R package.
This package is the most customizable implementation of DNEA to date, allowing the user to create the best analysis for their data. This vignette aims to walk you through a typical DNEA workflow, and describe the parameters that may be modified in common use cases.
We will use the metabolomics expression data derived from The Environmental Determinants of Type I Diabetes in the Young (TEDDY) clinical trial (Lee HS (2014)) to demonstrate DNEA. TEDDY is a prospective case-control study that seeks to better understand environmental causes of Type I diabetes. High-risk infants, as defined by the presence of HLA-DR and HLA-DQ genotypes, were visited every 6 months and blood samples were collected until development of type I diabetes or their 15 birthday, whichever occurred first. The study consisted of two conditions, Islet auto-antibody (IA) case vs. control and Type I Diabetes (T1D) case vs. control, but we will focus on the latter here. The data was originally accessed from Metaboolomics Workbench under Project ID: PR000950. The T1D arm contains 3525 samples from 440 subjects, 50 of which developed Type I Diabetes. The example data contains the samples from those 50 subjects on the date of T1D diagnosis as well as samples from the 272 paired control subjects closest to that day. This data is stored in the package, and instructions on how to access the data are below. Please visit the Metabolomics Workbench for more information about the data set.
The following code chunk represents a typical DNEA workflow for the example data.
#Load packages
library(DNEA)
library(BiocParallel)
#load the example data
data("TEDDY")
data("T1Dmeta")
aminos <- c("alanine", "arginine", "asparagine", "aspartic_acid",
"cysteine", "glutamine", "glutamic_acid", "glycine",
"histidine", "isoleucine", "leucine", "lysine",
"methionine", "phenylalanine", "proline", "serine",
"threonine", "tryptophan", "tyrosine", "valine")
keep <- rownames(TEDDY) %in% aminos
TEDDYreduced <- TEDDY[keep,]
#initiate BiocParallel
BP_plan <- MulticoreParam(workers = 2, RNGseed = 417, progressbar = FALSE)
#re-order the metadata to match the sample order of expression_data
T1Dmeta <- T1Dmeta[colnames(TEDDYreduced),]
#save the group column to be used as group_labels
group_labels <- T1Dmeta$group
#name each element for its corresponding sample
names(group_labels) <- rownames(T1Dmeta)
#convert to factor
group_labels <- factor(group_labels, levels = c("DM:control", "DM:case"))
#Run DNEA
TEDDYdat <- createDNEAobject(project_name = "TEDDYmetabolomics",
expression_data = TEDDYreduced,
group_labels = group_labels)
#> Data has been normalized for further analysis. New data can be found in the log_scaled_data assay!
#> Data diagnostics was performed on log_scaled_data assay. To check a different assay, please specify the assay parameter.
#> Warning in dataDiagnostics(mat = ds_input, condition_values = network_groups, :
#> Diagnostic tests complete. You can proceed with the analysis!
#> Diagnostic criteria are as follows:
#> DNEAinputSummary
#> Number of Samples - 322
#> Number of Features - 19
#> min_eigen condition_num
#> all_data 0.10852848 43.82871
#> DM:control 0.09464841 50.29212
#> DM:case 0.04497635 119.74070
TEDDYdat <- BICtune(object = TEDDYdat, informed = TRUE,
interval = 1e-3, BPPARAM = BP_plan)
#> Optimizing the lambda hyperparameter using Bayesian-Information Criterion outlined in Guo et al. (2011)
#> A Link to this reference can be found in the function documentation by running ?BICtune() in the console.
#> The log_scaled_data expression data will be used for analysis.
#> Estimating optimal c constant range for asymptotic lambda...
#> Fine-tuning Lambda...
#> The optimal Lambda hyper-parameter has been set to: 0.0371436897625381!
TEDDYdat <- stabilitySelection(object = TEDDYdat, subSample = TRUE,
nreps = 1000, BPPARAM = BP_plan,
BPOPTIONS=bpoptions(tasks = 8))
#> The lambda value stored in the DNEA object will be used for analysis (this can be accessed via the optimizedLambda() function
#> Using Lambda hyper-parameter: 0.0371436897625381!
#> stabilitySelection will be performed with 1000 replicates!
#> The log_scaled_data expression data will be used for analysis.
#> Additional sub-sampling will be performed on uneven groups
#> Calculating selection probabilities WITH subsampling for...DM:control...
#> Calculating selection probabilities WITH subsampling for...DM:case...
TEDDYdat <- getNetworks(object = TEDDYdat, aprox = TRUE, informed = TRUE,
interval = 1e-3, BPPARAM = BP_plan)
#> The log_scaled_data expression data will be used for analysis.
#> Lambda will be aproximated by sqrt(log(# features)/# samples)
#> NOTE: If your dataset contains 500 or more samples per experimental condition, you should set "aprox=FALSE" and tune lambda for each network!
#> selection_probabilites from stability selection will be used in glasso model!
#> Estimating model for DM:control...using 0.104043948914672 for lambda...
#> model estimated!
#> Estimating model for DM:case...using 0.242670104428479 for lambda...
#> model estimated!
#> DM:control network specific edges: 12
#> DM:case network specific edges: 6
#> -----------------------------------
#> Number of edges shared by both networks: 21
#> Total number of edges in dataset: 39
TEDDYdat <- clusterNet(object = TEDDYdat, tau = 0.5, verbose = FALSE)
#> Initiating consensus cluster with a maximum of 5iterations!
#> Constructing initial consensus matrix...
#>
#> ...starting iteration 1...
#> Warning in cluster_edge_betweenness(adjacency_graph, weights = graph_weights):
#> At vendor/cigraph/src/community/edge_betweenness.c:503 : Membership vector will
#> be selected based on the highest modularity score.
#>
#> ...starting iteration 2...
#> Warning in cluster_edge_betweenness(adjacency_graph, weights = graph_weights):
#> At vendor/cigraph/src/community/edge_betweenness.c:503 : Membership vector will
#> be selected based on the highest modularity score.
#>
#> Consensus was reached in: 2 iterations
TEDDYdat <- runNetGSA(TEDDYdat)
#> The log_input_data expression data will be used for analysis.There are 3 inputs required to initiate the DNEA workflow: a
character string corresponding to the experiment name indicated by
project_name, the expression data indicated by
expression_data, and the experimental groups indicated by
group_labels.
The expression_data should be an m x n numeric
matrix wherein the metabolites, m, (or lipids, proteins, etc.)
each have a row, and the samples, n, each have a column. DNEA
jointly estimates the biological network for each experimental condition
using Guassian Graphical Model’s (GGM), so it is important that the data
for each group is approximately normal. To that end, DNEA log transforms
and auto-scales the expression data for each group independent of the
other when the workflow is initiated. The input should be raw peak
intensities/concentrations. Differential expression analysis is an
important part of pathway enrichment analysis downstream, and it cannot
be performed after the experimental groups have been normalized
separately, since the mean expression for each metabolite within the
experimental groups is then centered around zero. NOTE: It
is acceptable to adjust the expression data for batch effect or
confounding variables (i.e. age, sex, bmi) prior to
analysis.
Let’s load the example data stored inside the DNEA package. It is a numeric expression matrix where the metabolites are in rows and the samples are in columns. The peak intensity data has been adjusted for age and sex to remove those confounding variables, but it has not been log-transformed or scaled.
| sample_1280961 | sample_7525064 | sample_7623738 | sample_6267572 | |
|---|---|---|---|---|
| x1_5_anhydroglucitol | 79151.2878 | 81382.8463 | 77595.9854 | 69488.3473 |
| x1_monoolein | 288.3502 | 170.6537 | 1025.3033 | 680.9223 |
| x1_monopalmitin | 641.4075 | 627.7670 | 335.0213 | 447.8168 |
| x1_monostearin | 878.2598 | 1062.1696 | 329.4171 | 663.4438 |
| x2_deoxytetronic_acid | 412.5438 | 397.9415 | 355.6989 | 544.5197 |
| x2_hydroxybutanoic_acid | 5175.5341 | 15057.2652 | 32365.8109 | 6666.4498 |
| x2_hydroxyglutaric_acid | 129.0752 | 141.5003 | 126.4818 | 163.0608 |
| x2_hydroxyvaleric_acid | 1987.1576 | 1116.6175 | 2212.4761 | 948.5992 |
| x2_ketoisocaproic_acid | 8045.5097 | 3713.3164 | 4877.6225 | 1864.5956 |
| x2_ketoisovaleric_acid | 2770.2950 | 2200.1752 | 1300.8156 | 2040.2058 |
The group_labels should be a vector of factor elements
that correspond to the experimental condition of each sample. Each
element should be named for its corresponding sample, and the order
should match the order of the samples in expression_data. We
can create the group_labels object for the TEDDY
data using the metadata, T1Dmeta, stored in DNEA. Each row is a
different sample, and each column is a different variable. We need the
experimental conditions for each sample in the “group” column and the
names for each sample which are the row names of the data frame. More
information about the available metadata can be found in the
T1Dmeta documentation.
| subject | Sex | sample | group | |
|---|---|---|---|---|
| sample_1280961 | 730478 | Female | sample_1280961 | DM:control |
| sample_7525064 | 878078 | Female | sample_7525064 | DM:control |
| sample_7623738 | 449664 | Female | sample_7623738 | DM:control |
| sample_6267572 | 731841 | Female | sample_6267572 | DM:control |
| sample_1144415 | 790136 | Female | sample_1144415 | DM:control |
| sample_8145773 | 827715 | Female | sample_8145773 | DM:control |
| sample_1234699 | 319767 | Female | sample_1234699 | DM:control |
| sample_6397821 | 936220 | Female | sample_6397821 | DM:control |
| sample_4282520 | 475528 | Female | sample_4282520 | DM:case |
| sample_9302834 | 812362 | Female | sample_9302834 | DM:control |
DNEA is designed to jointly estimate biological networks from only TWO experimental conditions. If you have one experimental condition, you should consider using another tool we developed, CorrelationCalculator, for your analysis.
Our experimental condition has two possible values: “DM:control” and “DM:case”. Now we can re-order the metadata to match the sample order of the expression data, and save the group labels as a new vector element. Finally, we can convert this character vector into factors. We will specify “DM:control” as the reference.
#re-order the metadata to match the sample order of expression_data
T1Dmeta <- T1Dmeta[colnames(TEDDY),]
#save the group column to be used as group_labels
group_labels <- T1Dmeta$group
#name each element for its corresponding sample
names(group_labels) <- rownames(T1Dmeta)
#convert to factor
group_labels <- factor(group_labels, levels = c("DM:control", "DM:case"))DNEA is an object-oriented workflow built around a custom s4 object,
DNEA. We provide a helper function that encompasses several
necessary steps to begin DNEA. Additionally, the
sumExp2DNEA and massDataset2DNEA helper
functions are provided to enable creation of a DNEA from a
SummarizedExperiment and mass_dataset object,
respectively. First, the data is restructured for input into a
DNEA object, and differential expression (DE) analysis is
performed on the log-transformed input data. The data is then split by
experimental condition and auto-scaled. An eigen decomposition is
performed to check the minimum eigenvalue and the condition number of
the correlation matrix for the whole data set, as well as each
experimental group individually.
#initiate DNEA object
TEDDYdat <- createDNEAobject(project_name = "TEDDYmetabolomics",
expression_data = TEDDY,
group_labels = group_labels)
#> Data has been normalized for further analysis. New data can be found in the log_scaled_data assay!
#> Data diagnostics was performed on log_scaled_data assay. To check a different assay, please specify the assay parameter.
#> Warning in dataDiagnostics(mat = ds_input, condition_values = network_groups, :
#> One or more condition(s) look unstable. We recommend you aggregate features
#> features before continuing!
#> Diagnostic criteria are as follows:
#> DNEAinputSummary
#> Number of Samples - 322
#> Number of Features - 134
#> min_eigen condition_num
#> all_data 4.503367e-02 2.384726e+02
#> DM:control 3.234137e-02 3.362727e+02
#> DM:case -3.948702e-15 1.450916e+18Eigenvalues close to or below zero, or similarly extremely large condition numbers, represent instability in the data set and triggers a warning that recommends highly correlated metabolites be aggregated into single features prior to analysis. As you can see above, the TEDDY data set has a negative eigenvalue for the “DM:case” experimental group, triggering a warning.
NOTE: To initiate a DNEA-class object from a
summarizedExperiment-class or
mass_dataset-class object, the sumExp2DNEA()
and massDataset2DNEA() functions can be used,
respectively.
WHAT CAUSED THIS WARNING?
There are a number of reasons that instability may occur within a
data set; Highly correlated features is one of the most common causes in
expression data. We can view this by plotting a heat map of the pearson
correlation matrix for the “DM:case” data. The log-scaled data is
accessed using the expressionData() function.
#load the pheatmap and Hmisc packages
library(pheatmap)
library(Hmisc)
#grab the DM:case data from the DNEA object
expr_dat <- expressionData(TEDDYdat, assay = "log_scaled_data")[["DM:case"]]
#create a pearson correlation matrix - data should be transposed first so features are in columns
cor_dat <- Hmisc::rcorr(t(expr_dat), type = "pearson")$r
#cluster the correlations and reorder correlation matrix to better visualize
dd <- as.dist((1-cor_dat)/2)
hc <- hclust(dd)
cor_dat <-cor_dat[hc$order, hc$order]
#create pheatmap
pheatmap(cor_dat, cluster_rows = FALSE,cluster_cols = FALSE,
legend = TRUE,annotation_legend = FALSE,
labels_row = '',labels_col = '',
main = 'Feature correlations in DM:case group'
)There are pockets of red and blue indicating highly correlated metabolites in this data set that may be causing the instability. More information can be found in the following section, Feature Aggregation.
Finally, we can view information about our analysis using the built
in show function. We can access the differential expression results in
the node list using the nodeList() function.
#show summary of DNEA object
TEDDYdat
#> DNEA
#> Project Name - TEDDYmetabolomics
#> Samples - There are 322 samples.
#> Features - There are 134 Features.
#> Conditions - control: DM:control case: DM:case
#> Optimized Lambda - Parameter tuning not performed
#> Metabolic modules: Consensus Clustering not performed
#> netGSA results: Enrichment analysis not performed| clean_feature_names | comparison | foldchange | fcdirection | t_statistic | t_statistic_direction | pvalue | qvalue | DEstatus | |
|---|---|---|---|---|---|---|---|---|---|
| x1_5_anhydroglucitol | x1_5_anhydroglucitol | DM:case / DM:control | -0.8059415 | Down | -7.7674069 | Down | 0.0000000 | 0.0000000 | TRUE |
| x1_monoolein | x1_monoolein | DM:case / DM:control | -0.0041874 | Down | -0.0220156 | Down | 0.9824992 | 0.9939137 | FALSE |
| x1_monopalmitin | x1_monopalmitin | DM:case / DM:control | 0.0102540 | Up | 0.1466153 | Up | 0.8838534 | 0.9834991 | FALSE |
| x1_monostearin | x1_monostearin | DM:case / DM:control | -0.0827575 | Down | -0.8432378 | Down | 0.4019505 | 0.7081766 | FALSE |
| x2_deoxytetronic_acid | x2_deoxytetronic_acid | DM:case / DM:control | 0.1196823 | Up | 1.2895674 | Up | 0.2023103 | 0.5513077 | FALSE |
We have implemented an updated node aggregating algorithm similar to
the one described in Iyer GR (2020). The
aggregateFeatures() function makes available the
correlation-based, knowledge-based,
and hybrid collapsing methods from Filigree. The main
difference being that the user now specifies a pearson correlation
threshold by which to collapse features with an absolute correlation
above said value. The function then re-runs
createDNEAobject() to create a collapsed_DNEA
object with the new data.
WHY SHOULD WE AGGREGATE?
The regularization steps employed in DNEA will correct instability without additional user-intervention, however, it can be desirable to preemptively address this by aggregating highly-correlated metabolites - particularly if the data set contains many features of the same class of compounds (i.e. fatty acids, carnitines, etc.). This gives the user more control over network construction and simplifies the resulting networks.
GGM’s require considerable computational power. Adding additional samples to your data set will have marginal effects on the memory and processing time required for analysis, however, as the number of features grows in your data set the run time will increase dramatically. The ability to parallelize tasks on a high-performance machine (cluster or cloud resource) to perform stability selection helps tremendously with this issue. Even so, a user may still find themselves constrained by the resources available to them. Moreover, the data set may contain many compounds of the same class that are highly correlated, and/or network resolution at the individual molecule level is not needed (i.e. fatty acids that vary by only a few bonds in a lipidomics study). Aggregating highly correlated features retains signal from all of the metabolites, as opposed to regularization arbitrarily choosing one as representation and removing some or all of the correlates. This decreases the complexity of the analysis without losing critical network information. Less compute resources are required for the analysis and the resulting networks are more interpretable.
To perform feature aggregation, we need to provide
aggregateFeatures() the DNEA object and a
character string corresponding to the desired aggregation method. If the
“correlation” or “hybrid” method is chosen, we specify the
correlation_threshold parameter wherein metabolites with
correlations stronger than this threshold are aggregated into a single
node. If the “knowledge” or “hybrid” method is chosen, we need to
provide a data frame for feature_groups to specify the
class of each metabolite in our data set. Metabolite aggregation is then
performed within each class. More information about the three methods as
well as best use cases can be found in the function documentation and in
Iyer GR (2020).
For demonstration purposes, we are going to use the “hybrid” approach
for feature aggregation. The feature_groups input should be
a data frame with the original metabolite names as the first column and
as the row names. The class labels should be in the second column.
Independent metabolites can retain their name as its class label. Let’s
create the feature_groups data frame, and group the
branched chain amino acids into one class and all other acids into
another.
#save metabolite names
metab_names <- rownames(expressionData(TEDDYdat, assay = "input_data"))
#create feature_group data.frame
TEDDY_groups <- data.frame(features = metab_names,
groups = metab_names,
row.names = metab_names)
#create labels
TEDDY_groups$groups[TEDDY_groups$groups %in% c("isoleucine",
"leucine",
"valine")] <- "BCAAs"
TEDDY_groups$groups[grepl("acid",
TEDDY_groups$groups)] <- "fatty_acids"| features | groups | |
|---|---|---|
| x1_5_anhydroglucitol | x1_5_anhydroglucitol | x1_5_anhydroglucitol |
| x1_monoolein | x1_monoolein | x1_monoolein |
| x1_monopalmitin | x1_monopalmitin | x1_monopalmitin |
| x1_monostearin | x1_monostearin | x1_monostearin |
| x2_deoxytetronic_acid | x2_deoxytetronic_acid | fatty_acids |
| x2_hydroxybutanoic_acid | x2_hydroxybutanoic_acid | fatty_acids |
| x2_hydroxyglutaric_acid | x2_hydroxyglutaric_acid | fatty_acids |
| x2_hydroxyvaleric_acid | x2_hydroxyvaleric_acid | fatty_acids |
| x2_ketoisocaproic_acid | x2_ketoisocaproic_acid | fatty_acids |
| x2_ketoisovaleric_acid | x2_ketoisovaleric_acid | fatty_acids |
For the “hybrid” method we also have to provide a
correlation_threshold value. The algorithm will only
aggregate metabolites within a specified group that have a stronger
correlation than the correlation_threshold value
provided.
#perform feature collapsing
collapsed_TEDDY <- aggregateFeatures(object = TEDDYdat,
method = "hybrid",
correlation_threshold = 0.7,
feature_groups = TEDDY_groups)
#> The raw peak intensity data was used for aggregation
#> The aggregated log-scaled data is in the @assay slot
#> (The orginal DNEA object can be found in the @original_experiment slot)
#> Data has been normalized for further analysis. New data can be found in the log_scaled_data assay!
#> Data diagnostics was performed on log_scaled_data assay. To check a different assay, please specify the assay parameter.
#> Warning in dataDiagnostics(mat = ds_input, condition_values = network_groups, :
#> One or more condition(s) look unstable. We recommend you aggregate features
#> features before continuing!
#> Diagnostic criteria are as follows:
#> DNEAinputSummary
#> Number of Samples - 322
#> Number of Features - 130
#> min_eigen condition_num
#> all_data 5.707415e-02 1.312624e+02
#> DM:control 4.362061e-02 1.759915e+02
#> DM:case -3.179855e-15 1.611756e+18
collapsed_TEDDY
#> collapsed_DNEA
#> Project Name - TEDDYmetabolomics
#> Samples - There are 0 samples.
#> Features - There are 130 Features.
#> Conditions - control: DM:control case: DM:case
#> Optimized Lambda - Parameter tuning not performed
#> Metabolic modules: Consensus Clustering not performed
#> netGSA results: Enrichment analysis not performedWe have reduced our total number of features from 134 to 130. To use the aggregated data, we would continue our analysis with the collapsed_TEDDY object. However, this data set is a curated list of diverse metabolites, so high correlations are not likely a result of chemical similarity. We do not want to erroneously combine disparate features, so it is better to continue without aggregating. Therefore, we will use the TEDDYdat object for the rest of the analysis.
The expression data is log transformed and auto-scaled by DNEA upon
initiation of the workflow with createDNEAobject().
However, varying normalization methods are used for metabolomics and
lipidomics data (e.g. auto-scaling, median-scaling,
quantile-normalization, RUV2) that may affect correlation-based studies
differently. If you have a preferred normalization method, we provide an
additional helper function, addExpressionData(), to input
custom-normalized data into the object for analysis. The user must
provide a named list that includes:
1. [experimental group 1]: a matrix that corresponds to the data of the first experimental group that was scaled independent of the other experimental group
2. [experimental group 2]: a matrix similar to element two for the second experimental group.
REMEMBER: The intensities/concentrations of each metabolite must be approximately normally distributed. As an example, let’s median-scale the TEDDY data and insert it into the TEDDYdat object.
#log-transform and transpose the TEDDY data
TEDDY <- t(log(TEDDY))
#make sure metadata and expression data are in same order
T1Dmeta <- T1Dmeta[rownames(TEDDY),]
dat <- list('DM:control' = TEDDY[T1Dmeta$group == "DM:control",],
'DM:case' = TEDDY[T1Dmeta$group == "DM:case",])
#log-transform and median center the expression data without scaling
newdat <- list()
for(cond in seq(length(dat))){
group_dat <- dat[[cond]]
for(i in seq(1, ncol(group_dat))){
metab_median = median(group_dat[, i], na.rm = TRUE)
metab_range = range(group_dat[, i], na.rm = TRUE)
scale_factor = max(abs(metab_range - metab_median))
group_dat[, i] <- (group_dat[, i] - metab_median) / scale_factor
rm(metab_median, metab_range, scale_factor)
}
group_dat <- group_dat[rownames(dat[[cond]]),colnames(dat[[cond]])]
group_dat <- t(group_dat)
newdat <- append(newdat, list(group_dat))
rm(i, group_dat)
}
#add names
names(newdat) <- names(dat)
#add data
TEDDYdatCustomInput <- addExpressionData(object = TEDDYdat,
dat = newdat,
assay_name="median_scaled_data")This function inserts the provided data into the
scaled_expression_data list element in the assays slot and
moves the log-transformed and auto-scaled expression data created by
DNEA to the DNEA_scaled_data list element. We are going to
proceed with the TEDDYdat object and the log-scaled data
inherent to DNEA in the next step.
Model tuning utilizes two regularization methods: \(\lambda\) tuning via Bayesian-information
criterion (BIC), and stability selection. This step allows us to analyze
data sets with less samples than the number of metabolites and is
critical for DNEA. When the number of samples approaches or
exceeds the number of features, regularization in the model is
not strictly necessary and you may proceed without this step. This
allows users with very large data sets to create networks from a
simplified GGM model. If tuning is not performed and the
optimal_lambda parameter is not specified, the \(\lambda\) value defaults to
and all of the selection probabilities are set to 1. However, we highly recommend performing this step irrespective. Regularization adds sparsity to the resulting networks and improves model specificity by removing weak or potentially false positive edges.
The BICtune() function optimizes the \(\lambda\) parameter by calculating a BIC
score and likelihood value for every tested \(\lambda\), as described in Guo J (2011). We minimized the necessary
computation in this step by optimizing the c constant that describes the
asymptotically valid \(\lambda\) for
smaller data sets, following the equation:
The user can also opt to provide a range of \(\lambda\) values to test using the
lambda_values parameter or optimize lambda directly by
setting informed=FALSE. In our testing, we have been able
to reduce the values tested ten-fold by using the informed approach.
Each \(\lambda\) value tested can be
run embarrassingly parallel using the BiocParallel package. Both
BICtune() and stabilitySelection() have
support for parallel computation. We only need to create the workers
once and they can be called for each parallel process. Since
stabilitySelection() uses random number generation to
randomly sample the expression data in each replicate, we will have to
set the seed now so that the results are reproducible. We will also turn
on the progress bar for real time updates.
#load in BiocParallel
library(BiocParallel)
#create parallel sockets
BPPARAM <- BiocParallel::SnowParam(workers = 2, RNGseed = 417, progressbar = TRUE)Now that we have our workers set up, we can optimize \(\lambda\). If you are performing this step
on a high-performance computer, we can silence the progress bar using
BPOPTIONS.
#optimize lambda
TEDDYdat <- BICtune(TEDDYdat,
informed = TRUE,
interval = 1e-3,
BPPARAM = BPPARAM,
BPOPTIONS = bpoptions(progressbar = FALSE))#> Optimizing the lambda hyperparameter using Bayesian-Information Criterion outlined in Guo et al. (2011)
#> A Link to this reference can be found in the function documentation by running ?BICtune() in the console
#> The log_scaled_data expression data will be used for analysis.
#> Estimating optimal c constant range for asymptotic lambda...
#> Fine-tuning Lambda...
#> The optimal Lambda hyper-parameter has been set to: 0.05072355!
Our tuned lambda value is 0.05072355. This was selected by minimizing the BIC score. To illustrate BIC optimization, we can plot the BIC scores as a function of \(\lambda\)
#load ggplot2
library(ggplot2)
#create data frame of values
BICtuneData <- rbind(data.frame(lambda = unlist(lambdas2Test(TEDDYdat)),
value = vapply(BICscores(TEDDYdat), function(x) x$BIC, double(1)),
score = rep("BIC", length(lambdas2Test(TEDDYdat)))),
data.frame(lambda = unlist(lambdas2Test(TEDDYdat)),
value = vapply(BICscores(TEDDYdat), function(x) x$likelihood, double(1)),
score = rep("likelihood", length(lambdas2Test(TEDDYdat)))))
#create plot
ggplot(data = BICtuneData, aes(x = lambda, y = value)) +
geom_line(aes(linetype = score)) +
geom_point(aes(shape = score)) +
geom_vline(xintercept = optimizedLambda(TEDDYdat), color = "red") +
ggtitle("BIC and Likelihood Scores with Respect to Lambda")The stabilitySelection() function calculates selection
probabilities, or the estimated probability that an edge is identified
in a randomly sampled subset of the input data, using the method
outlined in Ma J (2019). The selection
probability for each metabolite-metabolite interaction is then used to
modify the lambda parameter following the equation:
To perform stability selection, we need to pass TEDDYdat to
the stabilitySelection() function. We also need to specify
the number of replicates to perform with the nreps
parameter, provide the BiocParallel object we created earlier to
BPPARAM, and specify whether or not additional sub sampling
of the data should be performed with subSample. We
recommend setting nreps = 500 when not using the
sub-sampling protocol, which is the default value. Stability selection
without additional sub-sampling randomly samples 50% of each
group (without replacement), creating two evenly sampled data sets and
fitting a GGM to both. This means that at the default
nreps = 500, 1000 replicates are actually performed.
When the sample groups are very unbalanced, the selection
probabilities from stability selection strongly favor the larger group
resulting in unstable edges. We combat this by employing a sub-sampling
protocol during stability selection that was first introduced in Iyer GR (2020) by setting
subSample = TRUE. This method ensures that each group is
equally represented in stability selection. Since nearly all of the data
for the smaller group is used with additional sub-sampling,
only one model is fit per replicate. When utilizing the sub-sampling
protocol, nreps should be set to 1000.
If you elect to optimize the \(\lambda\) parameter using a different
method than BICtune(), you can specify the value to use
with the optimal_lambda parameter. Only 50 of our 322
samples are “DM:case”, therefore, it is most apropriate to use the
subsampling protocol by setting subSample = TRUE. In this
scenario, we recommend running 1000 replicates by setting
nreps=1000. This is the most computationally expensive step
in the algorithm, and the workflow up to this point takes ~15 minutes
using 4 cores on a 2.5 GHz Quad-Core Intel Core i7 processor.
#perform stability selection
TEDDYdat <- stabilitySelection(TEDDYdat,
subSample = TRUE,
nreps = 1000,
BPPARAM = BPPARAM,
BPOPTIONS = bpoptions(progressbar=FALSE))#> The lambda value stored in the DNEA will be used for analysis (this can be accessed via the optimizedLambda() function
#> Using Lambda hyper-parameter: 0.0507235493942187!
#> stabilitySelection will be performed with 1000 replicates!
#> The log_scaled_data expression data will be used for analysis.
#> Additional sub-sampling will be performed on uneven groups
#> Calculating selection probabilities WITH subsampling for...DM:case...
#> Calculating selection probabilities WITH subsampling for...DM:control...
Now that we have optimized the input parameters, we can jointly estimate the biological networks. We will construct a glasso model to calculate the partial correlation value for each metabolite-metabolite interaction. We can then perform consensus clustering to identify metabolic modules, or sub networks, within the larger experimental group networks.
We provide a function, called getNetworks(), to perform
joint-estimation of the biological networks. The necessary inputs are
already stored in the TEDDYdat object. Since we have far less
than 500 samples per experimental group, we’re going to set
aprox=TRUE and approximate the optimal lambda for each
experimental group via the equation
This increases the specificity of the analysis for data sets with
relatively few samples and decreases false positives by increasing
regularization. If your data set contains ~500 or more samples per
experimental group, we suggest setting aprox=FALSE which
will optimize lambda for each experimental group using the
BICtune() function. The aforementioned parameters for
BICtune must also be specified to
getNetworks() in this scenario (ie.
informed = TRUE, interval = 1e-3, BPPARAM = BPPARAM).
#jointly estimate the biological networks
TEDDYdat <- getNetworks(TEDDYdat, aprox = TRUE)
#> The log_scaled_data expression data will be used for analysis.
#> Lambda will be aproximated by sqrt(log(# features)/# samples)
#> NOTE: If your dataset contains 500 or more samples per experimental condition, you should set "aprox=FALSE" and tune lambda for each network!
#> selection_probabilites from stability selection will be used in glasso model!
#> Estimating model for DM:control...using 0.13418928411169 for lambda...
#> model estimated!
#> Estimating model for DM:case...using 0.312980504183596 for lambda...
#> model estimated!
#> DM:control network specific edges: 70
#> DM:case network specific edges: 112
#> -----------------------------------
#> Number of edges shared by both networks: 108
#> Total number of edges in dataset: 290The TEDDY data set contains 134 metabolites, so a completely dense
network has 134 * 134 = 17956 possible edges. For obvious reasons, we do
not expect every feature to be connected with every other feature and
the regularization steps we performed in [Step 2: Model Optimization]
have removed the non-connections. There are 290 total edges in the
networks: 70 edges specific to the “DM:control” network, 112 edges
specific to the “DM:case” network, and 108 edges identified in both. We
can access the edge list using the edgeList() function.
#save edge list to new object
edge_list <- edgeList(TEDDYdat)
#access the edge list
edge_list[1:10,]| Metabolite.A | Metabolite.B | pcor.0 | pcor.1 | edge |
|---|---|---|---|---|
| x1_5_anhydroglucitol | deoxypentitol | 0.0000000 | -0.0932950 | DM:case |
| x1_5_anhydroglucitol | fucose | 0.1532387 | 0.0000000 | DM:control |
| x1_5_anhydroglucitol | hippuric_acid | 0.2415766 | 0.1806729 | Both |
| x1_5_anhydroglucitol | lactic_acid | 0.0000000 | -0.0584437 | DM:case |
| x1_5_anhydroglucitol | pseudo_uridine | 0.2893000 | 0.1605088 | Both |
| x1_monoolein | sucrose | 0.0000000 | 0.0151666 | DM:case |
| x1_monopalmitin | x1_monostearin | 0.3188505 | 0.3034550 | Both |
| x1_monopalmitin | pseudo_uridine | 0.0000000 | 0.0928751 | DM:case |
| x1_monopalmitin | sucrose | 0.0000000 | 0.0638883 | DM:case |
| x1_monostearin | methionine | 0.0000000 | -0.0938810 | DM:case |
NOTE: Setting aprox=FALSE will run the
BICtune() algorithm on the normalized data for each
experimental condition, respectively, to optimize the \(/lambda\) parameter. For large datasets,
this will increase the sensitivity of edge detection, but also result in
much denser networks. Weak edges can be filtered out using the
filterNetworks() function since they are not as interesting
from a biological perspective and they disrupt the performance of
consensus clustering, resulting in a small number of large networks.
They may also clutter the network, making it difficult to derive
meaningful biological insight. Filtering edges is accomplished by either
providing a partial correlation value using the pcor
parameter, or a percentage using the top_percent_edges
parameter. If pcor is provided, all edges with an absolute
partial correlation less than the specified value are removed. If
top_percent_edges is provided, only the strongest
X% of edges in the networks are retained.
#filter networks based on an absolute threshold of pcor = 0.166
TEDDYdat_filtered <- filterNetworks(TEDDYdat, pcor = 0.166)We aproximated lambda, which increases regularization in the model, so we already have relatively small networks. We will continue without filtering.
The consensus clustering algorithm described in Ma J (2019) and employed here utilizes 7 network
clustering algorithms implemented in the igraph R package (For more information,
please see the documentation for clusterNet()). It works by
clustering the data to identify sub networks of highly inter-connected
metabolites within the networks. Consensus clustering is performed
iteratively until agreement among the clustering algorithms on sub
network membership is reached. This enables the implementation of
data-driven pathway enrichment analysis downstream to identify sub
networks that are differentially enriched across the two experimental
conditions.
Consensus clustering is performed by passing the DNEA
object to clusterNet(). You may opt to specify the
tau parameter, which corresponds to the percent agreement
threshold (i.e. tau% of the clustering algorithms must agree on
the metabolite membership within a sub network). tau can
range from 0.5-1. The default value is 0.5, or 4 of the 7 algorithms
must be in agreement. Increasing the value of tau will increase the
specificity of the analysis, and therefore increase the number and
decrease the size of the resulting sub networks.
Several of the clustering methods utilize random number generation.
Since clusterNet() does not use the BiocParallel framework,
we need to set the seed in native R to ensure reproducibility of our
results.
#set the seed
set.seed(417)
#perform consensus clustering
TEDDYdat <- clusterNet(TEDDYdat, tau = 0.5,
max_iterations = 5,
verbose = FALSE)
#> Initiating consensus cluster with a maximum of 5iterations!
#> Constructing initial consensus matrix...
#>
#> ...starting iteration 1...
#>
#> ...starting iteration 2...
#>
#> Consensus was reached in: 2 iterationsCCsummary(). The summary shows you the number of nodes and
edges per network as well as how many were differentially expressed,
respectively. 132 of the 134 metabolites clustered into the 13 sub
networks, whil the remaining 2 were left as independent metabolites.
| subnetworks | number_of_nodes | number_of_edges | number_of_DE.nodes | number_of_DE.edges |
|---|---|---|---|---|
| subnetwork1 | 12 | 18 | 1 | 14 |
| subnetwork2 | 5 | 4 | 0 | 3 |
| subnetwork3 | 22 | 33 | 0 | 14 |
| subnetwork4 | 10 | 18 | 3 | 7 |
| subnetwork5 | 15 | 20 | 0 | 14 |
| subnetwork6 | 23 | 38 | 3 | 26 |
| subnetwork7 | 4 | 3 | 0 | 2 |
| subnetwork8 | 16 | 21 | 0 | 13 |
| subnetwork9 | 15 | 17 | 0 | 15 |
| subnetwork10 | 8 | 8 | 0 | 1 |
| subnetwork11 | 2 | 1 | 0 | 1 |
| independent | 2 | 0 | 0 | 0 |
Now that we have constructed our biological networks and identified metabolic modules within them, we can perform additional analyses to help us derive biological insight from our data. Two common analyses are pathway enrichment and visualization.
This data-driven approach to network construction overcomes
challenges faced in more traditional pathway analyses of metabolomics
and lipidomics data by using the correlation structure of the data to
define metabolic modules. We can then test them for enrichment across
the experimental condition using netGSA.
DNEA contains a wrapper function for the netgsa algorithm,
runNetGSA(). Everything we need for the analysis is passed
to the function with TEDDYdat. We can access the results using
the netGSAresults() function.
#perform pathway enrichment using netgsa
TEDDYdat <- runNetGSA(TEDDYdat, min_size = 5)
#> The log_input_data expression data will be used for analysis.| subnetworks | number_of_nodes | number_of_edges | number_of_DE.nodes | number_of_DE.edges | NetGSA_pval | NetGSA_pFDR | |
|---|---|---|---|---|---|---|---|
| 1 | subnetwork1 | 12 | 18 | 1 | 14 | 0.0000000 | 0.0000000 |
| 2 | subnetwork2 | 5 | 4 | 0 | 3 | 0.0000000 | 0.0000000 |
| 3 | subnetwork3 | 22 | 33 | 0 | 14 | 0.0000000 | 0.0000000 |
| 4 | subnetwork4 | 10 | 18 | 3 | 7 | 0.0000000 | 0.0000000 |
| 5 | subnetwork5 | 15 | 20 | 0 | 14 | 0.0000107 | 0.0000192 |
| 6 | subnetwork6 | 23 | 38 | 3 | 26 | 0.0025002 | 0.0037502 |
| 8 | subnetwork7 | 16 | 21 | 0 | 13 | 0.2919978 | 0.3754257 |
| 9 | subnetwork8 | 15 | 17 | 0 | 15 | 0.7689038 | 0.7689038 |
| 10 | subnetwork9 | 8 | 8 | 0 | 1 | 0.7602562 | 0.7689038 |
11 of the 13 subnetworks contained 5+ metabolites and were therefore tested for enrichment. 8 of the 11 sub networks tested were significantly enriched at \(/alpha\) = 0.05. NOTE: The sub networks have been reordered by their false-discovery rate calculated during enrichment analysis, so the sub network numbering may look different after pathway analysis as compared to prior.
There are several common tools for visualizing biological networks, three of the most common being:
DNEA provides functionality that makes using all three easy.
Visualizing Networks using DNEA and igraph
The igraph R package is commonly used to visualize networks due to
its customization. DNEA contains a function, plotNetworks()
that is built on igraph. This function provides the user an easy way to
visualize the constructed networks and utilize the features available in
the igraph package. Edges specific to group 1, in our case “DM:control”,
are colored green and edges
specific to group 2, or “DM:case”, are colored
red. Common edges are
black, and DE nodes are colored
purple.
There are two parameters that are required in addition to the
DNEA object: the type, and the
subtype. When the type parameter is set to
“group_networks” we can plot the networks for either of the experimental
groups. We do so by providing its label (i.e. “DM:control” or “DM:case”)
to the subtype parameter, or “All” to plot both
networks.
#names of our experimental conditions
networkGroups(TEDDYdat)
#> [1] "DM:control" "DM:case"
#create side by side plots
par(mfrow = c(1,3))
#plot networks
plotNetworks(TEDDYdat,
type = "group_networks",
subtype = "DM:control",
main = "DM:control Network")
plotNetworks(TEDDYdat,
type = "group_networks",
subtype = "All",
main = "Joint Network")
plotNetworks(TEDDYdat,
type = "group_networks",
subtype = "DM:case",
main = "DM:case Network")We can also plot the sub networks by setting type to
“sub_networks” and specifying which sub network to plot. Sub networks 1
and 2 are the most differentially enriched by fdr, so let’s plot them by
setting subtype = 1 and subtype = 2,
respectively. We can change the layout to a circle by providing the
igraph layout_in_circle() function to the
layout_func parameter. We will need to load the igraph
package into our environment first to do so. More information about
customizing network figures using plotNetworks() can be
found in the function documentation.
#load igraph
library(igraph)
#create side by side plots
par(mfrow = c(1,2))
#plot subnetworks
plotNetworks(TEDDYdat,
type = "sub_networks",
subtype = 1,
layout_func = layout_in_circle,
main = "Sub Network 1")
plotNetworks(TEDDYdat,
type = "sub_networks",
subtype = 2,
layout_func = layout_in_circle,
main = "Sub Network 2")Visualizing Networks using Metscape or Cytoscape
We made network visualization in third-party software easy by
formatting the node and edge lists for input into Cytoscape or Metscape.
You can save these tables as files using the
getNetworkFiles() function. If no filepath is
specified, the two files save to the working directory by default.
Once the files are saved, the data can be read into Cytyoscape for visualization:
Importing the edge list
1A: Open Cytoscape and, in the top left corner, select file —>Import —>Network From File.
1B: In the pop-up, click on the drop-down menu next to Metabolite A and select the green circle, making this the source node.
1C: Click on the drop-down menu next to Metabolite B and select the red target, making this the target node.
1D: Then, hit the OK button in the bottom right corner of the pop-up to finish importing the edge list.
Importing the node list
2A: To import the node list, select Import Table From File from the task bar just above the edge list in the bottom right panel and select the node list file from its directory.
2B: Make sure the Import Data as drop-down says “Node Table Columns”, and select OK in the bottom right corner of the pop-up.
Metscape is an app within Cytoscape that provides additional functionality for the visualization of metabolomics and lipidomics data. More information about these tools can be found on their respective websites linked at the top of Network Visualization.
The DNEA R package is accompanied by a peer-reviewed manuscript published in BMC Bioinformatics. If using this software in your work, please cite:
Patsalis C, Iyer G, Brandenburg M, Karnovsky A, Michailidis G. DNEA: an R package for fast and versatile data-driven network analysis of metabolomics data. BMC Bioinformatics 25, 383 (2024). DOI:10.1186/s12859-024-05994-1.
sessionInfo()
#> R version 4.5.1 (2025-06-13)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] igraph_2.2.1 ggplot2_4.0.0 Hmisc_5.2-4
#> [4] pheatmap_1.0.13 BiocParallel_1.45.0 DNEA_1.1.0
#> [7] kableExtra_1.4.0 knitr_1.50 dplyr_1.1.4
#> [10] BiocStyle_2.39.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 gridExtra_2.3
#> [3] rlang_1.1.6 magrittr_2.0.4
#> [5] snakecase_0.11.1 matrixStats_1.5.0
#> [7] compiler_4.5.1 RSQLite_2.4.3
#> [9] gdata_3.0.1 png_0.1-8
#> [11] systemfonts_1.3.1 vctrs_0.6.5
#> [13] reshape2_1.4.4 quadprog_1.5-8
#> [15] stringr_1.5.2 pkgconfig_2.0.3
#> [17] crayon_1.5.3 fastmap_1.2.0
#> [19] backports_1.5.0 XVector_0.51.0
#> [21] labeling_0.4.3 rmarkdown_2.30
#> [23] graph_1.89.0 bit_4.6.0
#> [25] xfun_0.54 cachem_1.1.0
#> [27] graphite_1.55.3 jsonlite_2.0.0
#> [29] blob_1.2.4 DelayedArray_0.37.0
#> [31] cluster_2.1.8.1 parallel_4.5.1
#> [33] R6_2.6.1 bslib_0.9.0
#> [35] stringi_1.8.7 RColorBrewer_1.1-3
#> [37] genefilter_1.91.0 rpart_4.1.24
#> [39] GenomicRanges_1.63.0 lubridate_1.9.4
#> [41] jquerylib_0.1.4 Rcpp_1.1.0
#> [43] Seqinfo_1.1.0 SummarizedExperiment_1.41.0
#> [45] base64enc_0.1-3 IRanges_2.45.0
#> [47] nnet_7.3-20 Matrix_1.7-4
#> [49] splines_4.5.1 timechange_0.3.0
#> [51] tidyselect_1.2.1 rstudioapi_0.17.1
#> [53] abind_1.4-8 yaml_2.3.10
#> [55] codetools_0.2-20 lattice_0.22-7
#> [57] tibble_3.3.0 plyr_1.8.9
#> [59] withr_3.0.2 Biobase_2.71.0
#> [61] KEGGREST_1.49.2 S7_0.2.0
#> [63] glassoFast_1.0.1 evaluate_1.0.5
#> [65] foreign_0.8-90 survival_3.8-3
#> [67] xml2_1.4.1 Biostrings_2.79.1
#> [69] pillar_1.11.1 BiocManager_1.30.26
#> [71] MatrixGenerics_1.23.0 checkmate_2.3.3
#> [73] stats4_4.5.1 generics_0.1.4
#> [75] S4Vectors_0.49.0 scales_1.4.0
#> [77] gtools_3.9.5 xtable_1.8-4
#> [79] glue_1.8.0 janitor_2.2.1
#> [81] maketools_1.3.2 tools_4.5.1
#> [83] sys_3.4.3 data.table_1.17.8
#> [85] annotate_1.87.0 netgsa_4.0.6
#> [87] buildtools_1.0.0 XML_3.99-0.19
#> [89] grid_4.5.1 colorspace_2.1-2
#> [91] AnnotationDbi_1.71.2 htmlTable_2.4.3
#> [93] Formula_1.2-5 cli_3.6.5
#> [95] rappdirs_0.3.3 textshaping_1.0.4
#> [97] S4Arrays_1.11.0 viridisLite_0.4.2
#> [99] svglite_2.2.2 corpcor_1.6.10
#> [101] glasso_1.11 gtable_0.3.6
#> [103] sass_0.4.10 digest_0.6.37
#> [105] BiocGenerics_0.57.0 SparseArray_1.11.1
#> [107] htmlwidgets_1.6.4 farver_2.1.2
#> [109] memoise_2.0.1 htmltools_0.5.8.1
#> [111] lifecycle_1.0.4 httr_1.4.7
#> [113] bit64_4.6.0-1