MMUPHin 1.14.0
MMUPHin is an R package implementing meta-analysis methods for microbial
community profiles. It has interfaces for: a) covariate-controlled batch and
study effect adjustment, b) meta-analytic differential abundance testing, and meta-analytic discovery of c) discrete (cluster-based) or d) continuous
unsupervised population structure.
Overall, MMUPHin enables the normalization and combination of multiple
microbial community studies. It can then help in identifying microbes, genes,
or pathways that are differential with respect to combined phenotypes. Finally,
it can find clusters or gradients of sample types that reproduce consistently
among studies.
MMUPHin is a Bioconductor package and can be installed via the following command.
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("MMUPHin")This vignette is intended to provide working examples for all four
functionalities of MMUPHin.
MMUPHin can be loaded into R with the first library() command shown below. We also load some utility packages for data manipulation and visualization.
library(MMUPHin)
library(magrittr)
library(dplyr)
library(ggplot2)As input, MMUPHin requires a properly
formatted collection of microbial community studies, with both feature
abundances and accompanying metadata. Here we use the five published
colorectal cancer (CRC)
stool metagenomic studies, incorporated in Thomas et al. (2019).
Data for the studies are already
conveniently packaged and accessible through the Biocondcutor package
curatedMetagenomicData,
though additional wranglings are needed to format input for MMUPHin.
Importantly, MMUPHin asks that feature abundances be provided as a
feature-by-sample matrix, and the metadata be provided as a data frame.
The two objects shoud agree on sample IDs, that is, colname of the feature
abundance matrix and rowname of the metadata data frame must agree.
Many popular ’omic data classes in R already enforce this standard, such as
ExpressionSet from Biobase, or phyloseq from
phyloseq.
To minimize users’ efforts in loading data to run the examples, we have properly formatted the five studies for easy access. The feature abundances and metadata can be loaded with the following code chunk. For the interested user, the commented out scripts were used for accessing data directly from curatedMetagenomicData and formatting. It might be worthwhile to read through these as they perform many of the common tasks for preprocessing microbial feature abundance data in R, including sample/feature subsetting, normalization, filtering, etc.
data("CRC_abd", "CRC_meta")
CRC_abd[1:5, 1, drop = FALSE]##                                 FengQ_2015.metaphlan_bugs_list.stool:SID31004
## s__Faecalibacterium_prausnitzii                                    0.11110668
## s__Streptococcus_salivarius                                        0.09660736
## s__Ruminococcus_sp_5_1_39BFAA                                      0.09115385
## s__Subdoligranulum_unclassified                                    0.05806767
## s__Bacteroides_stercoris                                           0.05685503CRC_meta[1, 1:5]##                                               subjectID body_site
## FengQ_2015.metaphlan_bugs_list.stool:SID31004  SID31004     stool
##                                               study_condition
## FengQ_2015.metaphlan_bugs_list.stool:SID31004             CRC
##                                                                    disease
## FengQ_2015.metaphlan_bugs_list.stool:SID31004 CRC;fatty_liver;hypertension
##                                               age_category
## FengQ_2015.metaphlan_bugs_list.stool:SID31004        adulttable(CRC_meta$studyID)## 
##      FengQ_2015.metaphlan_bugs_list.stool 
##                                       107 
## HanniganGD_2017.metaphlan_bugs_list.stool 
##                                        55 
##  VogtmannE_2016.metaphlan_bugs_list.stool 
##                                       104 
##        YuJ_2015.metaphlan_bugs_list.stool 
##                                       128 
##    ZellerG_2014.metaphlan_bugs_list.stool 
##                                       157adjust_batchadjust_batch aims to correct for technical study/batch effects in microbial
feature abundances. It takes as input the feature-by-sample abundance matrix,
and performs batch effect adjustment given provided batch and optional covariate variables. It returns the batch-adjusted abundance matrix. Check help(adjust_batch) for additional details and options. This figure from the paper (figure 2b) shows the desired effect of the batch adjustment: batch effects are removed, making the effect of the biological variable of interest easier to detect.
Here we use adjust_batch to correct for differences in the five studies,
while controlling for the effect of CRC versus control (variable
study_condition in CRC_meta).
adjust_batch returns a list of multiple outputs. The part we are most interested in now is the adjusted abundance table, which we assign to CRC_abd_adj.
fit_adjust_batch <- adjust_batch(feature_abd = CRC_abd,
                                 batch = "studyID",
                                 covariates = "study_condition",
                                 data = CRC_meta,
                                 control = list(verbose = FALSE))
CRC_abd_adj <- fit_adjust_batch$feature_abd_adjOne way to evaluate the effect of batch adjustment is to assess the total
variability in microbial profiles attributable to study differences, before
and after adjustment. This is commonly known as a PERMANOVA test
(Tang, Chen, and Alekseyenko 2016), and can be performed with the adonis function in
vegan. The inputs to adonis() are indices of microbial compositional dissimilarity.
library(vegan, quietly = TRUE)## This is vegan 2.6-4D_before <- vegdist(t(CRC_abd))
D_after <- vegdist(t(CRC_abd_adj))
set.seed(1)
fit_adonis_before <- adonis(D_before ~ studyID, data = CRC_meta)## 'adonis' will be deprecated: use 'adonis2' insteadfit_adonis_after <- adonis(D_after ~ studyID, data = CRC_meta)## 'adonis' will be deprecated: use 'adonis2' insteadprint(fit_adonis_before$aov.tab)## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##            Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## studyID     4     13.36  3.3400  11.855 0.07991  0.001 ***
## Residuals 546    153.82  0.2817         0.92009           
## Total     550    167.18                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1print(fit_adonis_after$aov.tab)## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##            Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## studyID     4     4.939 1.23463  4.2855 0.03044  0.001 ***
## Residuals 546   157.298 0.28809         0.96956           
## Total     550   162.237                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1We can see that, before study effect adjustment, study differences can expalin a total of 7.99% of the variability in microbial abundance profiles, whereas after adjustment this was reduced to 3.04%, though the effect of study is significant in both cases.
lm_metaOne of the most common meta-analysis goals is to combine association effects
across batches/studies to identify consistent overall effects. lm_meta
provides a straightforward interface to this task, by first performing
regression analysis in individual batches/studies using the well-validated
Maaslin2 package, and then aggregating results with
established fixed/mixed effect models, realized via the
vegan package. Here, we use lm_meta to test for
consistently differential abundant species between CRC and control samples
across the five studies, while controlling for demographic covariates (gender,
age, BMI). Again, lm_meta returns a list of more than one components.
meta_fits provides the final meta-analytical testing results. See
help(lm_meta) for the meaning of other components.
fit_lm_meta <- lm_meta(feature_abd = CRC_abd_adj,
                       batch = "studyID",
                       exposure = "study_condition",
                       covariates = c("gender", "age", "BMI"),
                       data = CRC_meta,
                       control = list(verbose = FALSE))## Warning in lm_meta(feature_abd = CRC_abd_adj, batch = "studyID", exposure = "study_condition", : Covariate gender is missing or has only one non-missing value in the following batches; will be excluded from model for these batches:
## HanniganGD_2017.metaphlan_bugs_list.stool, YuJ_2015.metaphlan_bugs_list.stool## Warning in lm_meta(feature_abd = CRC_abd_adj, batch = "studyID", exposure = "study_condition", : Covariate age is missing or has only one non-missing value in the following batches; will be excluded from model for these batches:
## HanniganGD_2017.metaphlan_bugs_list.stool, YuJ_2015.metaphlan_bugs_list.stool## Warning in lm_meta(feature_abd = CRC_abd_adj, batch = "studyID", exposure = "study_condition", : Covariate BMI is missing or has only one non-missing value in the following batches; will be excluded from model for these batches:
## HanniganGD_2017.metaphlan_bugs_list.stool, YuJ_2015.metaphlan_bugs_list.stoolmeta_fits <- fit_lm_meta$meta_fitsYou’ll note that this command produces a few error messages. This is because for one of the studies the specified covariates were not measured. As the message states, these covariates are not included when analyzing data from that study. We also note that batch correction is standard but not necessary for the analysis here.
We can visualize the significant (FDR q < 0.05) species associated with CRC in these studies/samples. Comparing them with Figure 1b of Thomas et al. (2019), we can see that many of the significant species do agree.
meta_fits %>% 
  filter(qval.fdr < 0.05) %>% 
  arrange(coef) %>% 
  mutate(feature = factor(feature, levels = feature)) %>% 
  ggplot(aes(y = coef, x = feature)) +
  geom_bar(stat = "identity") +
  coord_flip()discrete_discoverClustering analysis of microbial profiles can help identify meaningful
discrete population subgroups (Ravel et al. 2011), but must be carried out
carefully with validations to ensure that the identified structures are
consistent (Koren et al. 2013). discrete_discover provides the functionality to
use prediction strength (Tibshirani and Walther 2005) to evaluate discrete
clustering structures within individual microbial studies, as well as a
“generalized prediction strength” to evaluate their reproducibility in
other studies. These jointly provide meta-analytical evidence for
(or against) identifying discrete population structures in microbial profiles.
Check help(discrete_discover) to see more details on the method and additional
options. We can contrast discrete structure (clusters) with continuous structure (which will be overviewed in the next section):
The gut microbiome is known to form gradients rather than discrete clusters
(Koren et al. 2013). Here we use discrete_discover to evaluate clustering
structures among control samples in the five stool studies. It’s worth noting that this function takes a distance/dissimilarity matrix as an input, rather than the abundance table.
control_meta <- subset(CRC_meta, study_condition == "control")
control_abd_adj <- CRC_abd_adj[, rownames(control_meta)]
D_control <- vegdist(t(control_abd_adj))
fit_discrete <- discrete_discover(D = D_control,
                                  batch = "studyID",
                                  data = control_meta,
                                  control = list(k_max = 8,
                                                 verbose = FALSE))The internal_mean and internal_sd components of fit_discrete are matrices
that provide internal evaluation statistics (prediction strength) for each
batch (column) and evaluated number of clusters (row). Similarly,
external_mean and external_sd provide external evaluation statistics (
generalized prediction strenght). Evidence for existence of discrete structures
would be a “peaking” of the mean statistics at a particular cluster number.
Here, for easier examination of such a pattern, we visualize the results
for the largest study, ZellerG_2014. Note that visualization for all studies
are by default automatically generated and saved to the output file
“diagnostic_discrete.pdf”.
study_id = "ZellerG_2014.metaphlan_bugs_list.stool"
internal <- data.frame(K = 2:8,
                       statistic = fit_discrete$internal_mean[, study_id],
                       se = fit_discrete$internal_se[, study_id],
                       type = "internal")
external <- data.frame(K = 2:8,
                       statistic = fit_discrete$external_mean[, study_id],
                       se = fit_discrete$external_se[, study_id],
                       type = "external")
rbind(internal, external) %>% 
  ggplot(aes(x = K, y = statistic, color = type)) +
  geom_point(position = position_dodge(width = 0.5)) + 
  geom_line(position = position_dodge(width = 0.5)) +
  geom_errorbar(aes(ymin = statistic - se, ymax = statistic + se),
                position = position_dodge(width = 0.5), width = 0.5) +
  ggtitle("Evaluation of discrete structure in control stool microbiome (ZellerG_2014)")The decreasing trend for both the internal and external statistics along with number of clusters (K) suggests that discrete structures cannot be well-established. To provide a positive example, we examine the two vaginal microbiome studies provided by curatedMetagenomicData, as the vaginal microbiome is known to have distinct subtypes (Ravel et al. 2011). Again, we pre-formatted these datasets for easy access, but you can recreate them from curatedMetagenomicData with the commented out scripts.
data("vaginal_abd", "vaginal_meta")
D_vaginal <- vegdist(t(vaginal_abd))
fit_discrete_vag <- discrete_discover(D = D_vaginal,
                                      batch = "studyID",
                                      data = vaginal_meta,
                                      control = list(verbose = FALSE,
                                                     k_max = 8))
hmp_id = "HMP_2012.metaphlan_bugs_list.vagina"
data.frame(K = 2:8,
           statistic = fit_discrete_vag$internal_mean[, hmp_id],
           se = fit_discrete_vag$internal_se[, hmp_id]) %>% 
  ggplot(aes(x = K, y = statistic)) +
  geom_point() +
  geom_line() +
  geom_errorbar(aes(ymin = statistic - se, 
                    ymax = statistic + se), 
                width = 0.5) +
  ggtitle("Evaluation of discrete structure in vaginal microbiome (HMP_2012)")We can see that for the vaginal microbiome, discrete_discover suggests the
existence of five clusters. Here we examine only the internal metrics of
HMP_2012 as the other study (FerrettiP_2018) has only
19
samples.
continuous_discoverPopulation structure in the microbiome can manifest as gradients rather than
discrete clusters, such as dominant phyla trade-off or disease-associated
dysbiosis. continuous_discover provide functionality to identify such
structures as well as to validate them with meta-analysis. We again evaluate
these continuous structures in control samples of the five studies.
Much like adjust_batch and lm_meta, continuous_discover also takes as input feature-by-sample abundances. The control argument offers many tuning parameters and here we set one of them, var_perc_cutoff, to 0.5, which asks the method to include top principal components within each batch that in total explain at least 50% of the total variability in the batch. See help(continuous_discover) for more details on the tuning parameters and their interpretations.
fit_continuous <- continuous_discover(feature_abd = control_abd_adj,
                                      batch = "studyID",
                                      data = control_meta,
                                      control = list(var_perc_cutoff = 0.5,
                                                     verbose = FALSE))We can visualize the identified continuous structure scores in at least two ways: first, to examine their top contributing microbial features, to get an idea of what the score is characterizing, and second, to overlay the continuous scores with an ordination visualization. Here we perform these visualizations on the first identified continuous score.
loading <- data.frame(feature = rownames(fit_continuous$consensus_loadings),
                      loading1 = fit_continuous$consensus_loadings[, 1])
loading %>%
    arrange(-abs(loading1)) %>%
    slice(1:20) %>%
    arrange(loading1) %>%
    mutate(feature = factor(feature, levels = feature)) %>%
    ggplot(aes(x = feature, y = loading1)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    ggtitle("Features with top loadings")mds <- cmdscale(d = D_control)
colnames(mds) <- c("Axis1", "Axis2")
as.data.frame(mds) %>% 
  mutate(score1 = fit_continuous$consensus_scores[, 1]) %>% 
  ggplot(aes(x = Axis1, y = Axis2, color = score1)) +
  geom_point() +
  coord_fixed()From ordination we see that the first continuos score indeed represent strong variation across these stool samples. From the top loading features we can see that this score strongly represents a Bacteroidetes (the Bacteroides species) versus Firmicutes (the Ruminococcus species) tradeoff.
sessionInfo()## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] vegan_2.6-4      lattice_0.21-8   permute_0.9-7    ggplot2_3.4.2   
## [5] dplyr_1.1.2      magrittr_2.0.3   MMUPHin_1.14.0   BiocStyle_2.28.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0    farver_2.1.1        fastmap_1.1.1      
##  [4] mathjaxr_1.6-0      digest_0.6.31       lifecycle_1.0.3    
##  [7] cluster_2.1.4       kernlab_0.9-32      compiler_4.3.0     
## [10] rlang_1.1.0         sass_0.4.5          tools_4.3.0        
## [13] igraph_1.4.2        utf8_1.2.3          yaml_2.3.7         
## [16] knitr_1.42          labeling_0.4.2      mclust_6.0.0       
## [19] withr_2.5.0         purrr_1.0.1         numDeriv_2016.8-1.1
## [22] nnet_7.3-18         grid_4.3.0          pcaPP_2.0-3        
## [25] hash_2.2.6.2        stats4_4.3.0        fansi_1.0.4        
## [28] Maaslin2_1.14.0     colorspace_2.1-0    scales_1.2.1       
## [31] fpc_2.2-10          MASS_7.3-59         prabclus_2.3-2     
## [34] logging_0.10-108    optparse_1.7.3      cli_3.6.1          
## [37] mvtnorm_1.1-3       rmarkdown_2.21      metafor_4.0-0      
## [40] crayon_1.5.2        ragg_1.2.5          generics_0.1.3     
## [43] robustbase_0.95-1   biglm_0.9-2.1       getopt_1.20.3      
## [46] DBI_1.1.3           pbapply_1.7-0       cachem_1.0.7       
## [49] stringr_1.5.0       modeltools_0.2-23   splines_4.3.0      
## [52] metadat_1.2-0       parallel_4.3.0      BiocManager_1.30.20
## [55] vctrs_0.6.2         Matrix_1.5-4        jsonlite_1.8.4     
## [58] bookdown_0.33       systemfonts_1.0.4   magick_2.7.4       
## [61] diptest_0.76-0      tidyr_1.3.0         jquerylib_0.1.4    
## [64] glue_1.6.2          DEoptimR_1.0-12     cowplot_1.1.1      
## [67] stringi_1.7.12      gtable_0.3.3        munsell_0.5.0      
## [70] lpsymphony_1.28.0   tibble_3.2.1        pillar_1.9.0       
## [73] htmltools_0.5.5     R6_2.5.1            textshaping_0.3.6  
## [76] evaluate_0.20       highr_0.10          bslib_0.4.2        
## [79] class_7.3-21        Rcpp_1.0.10         flexmix_2.3-19     
## [82] nlme_3.1-162        mgcv_1.8-42         xfun_0.39          
## [85] pkgconfig_2.0.3Koren, Omry, Dan Knights, Antonio Gonzalez, Levi Waldron, Nicola Segata, Rob Knight, Curtis Huttenhower, and Ruth E Ley. 2013. “A Guide to Enterotypes Across the Human Body: Meta-Analysis of Microbial Community Structures in Human Microbiome Datasets.” PLoS Computational Biology 9 (1): e1002863.
Ravel, Jacques, Pawel Gajer, Zaid Abdo, G Maria Schneider, Sara SK Koenig, Stacey L McCulle, Shara Karlebach, et al. 2011. “Vaginal Microbiome of Reproductive-Age Women.” Proceedings of the National Academy of Sciences 108 (Supplement 1): 4680–7.
Tang, Zheng-Zheng, Guanhua Chen, and Alexander V Alekseyenko. 2016. “PERMANOVA-S: Association Test for Microbial Community Composition That Accommodates Confounders and Multiple Distances.” Bioinformatics 32 (17): 2618–25.
Thomas, Andrew Maltez, Paolo Manghi, Francesco Asnicar, Edoardo Pasolli, Federica Armanini, Moreno Zolfo, Francesco Beghini, et al. 2019. “Metagenomic Analysis of Colorectal Cancer Datasets Identifies Cross-Cohort Microbial Diagnostic Signatures and a Link with Choline Degradation.” Nature Medicine 25 (4): 667.
Tibshirani, Robert, and Guenther Walther. 2005. “Cluster Validation by Prediction Strength.” Journal of Computational and Graphical Statistics 14 (3): 511–28.