Contents

1 Introduction

MetaDICT is a computational method designed for the integration of microbiome datasets, effectively addressing batch effects while preserving biological variation across heterogeneous datasets.

The method operates in two stages:

  1. Initial Batch Effect Estimation – Utilizes covariate balancing to estimate batch effects, which are defined as heterogeneous sequencing efficiency across datasets.

  2. Refinement via Shared Dictionary Learning – Further refines batch effect estimation by leveraging shared structures across datasets.

Compared with regression-based approaches, MetaDICT minimizes overcorrection when unobserved confounding variables are present, ensuring a more reliable integration of datasets. The resulting integrated data can be applied to downstream analyses, including Principal Coordinates Analysis (PCoA), taxa/sample community detection, and differential abundance test. For more details, please refer to the MetaDICT paper (Yuan and Wang 2024).

2 Installation

The package can be downloaded from github:

if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")
devtools::install_github("BoYuan07/MetaDICT", build_vignettes = TRUE)

Load the package:

# load the package
library(MetaDICT)

3 Run MetaDICT on a simulated dataset

3.1 Example dataset

Example dataset contains two simulated datasets, using gut microbiome data collected by He et al. (He et al. 2018). These two datasets share the same set of taxa, and each dataset contains 200 samples. Load the example sample:

# load data
data("exampleData")
O = exampleData$O
meta = exampleData$meta
dist_mat = exampleData$dist_mat
taxonomy = exampleData$taxonomy
tree = exampleData$tree

The object contains the following components:

  • Integrated Count Matrix (O)
    A merged abundance data matrix (taxa-by-sample) combining both datasets.

  • Meta Table (meta)
    The sample metadata, which includes details for both datasets:

    • batch – Indicates the dataset source (Dataset1 or Dataset2).
    • Y – A covariate associated with microbial compositions.
    • Y2 – An uninformative covariate with no biological relevance.
  • Sequence Distance Matrix (dist_mat)
    A matrix quantifying the relationships between sequences.

  • Phylogenetic Tree (tree)

  • Taxonomy Table (taxonomy)

We will use this example data to illustrate how to use MetaDICT.

Significant batch effects are observed between these two datasets on PCoA plots.

# batch label
batchid = meta$batch

# PCoA plot
pcoa_plot_discrete(O,batchid,"Batch")

The PCoA plots of target variable Y:

# sample covariate
Y = meta$Y

# PCoA plot
pcoa_plot_discrete(O,Y,"Sample", colorset = "Set2")

3.2 Check the singular values of each dataset

A crucial assumption in MetaDICT is that the microbial load matrix can be approximated by a product of two matrices, one of which is shared dictionary. A simple diagnostic tool for such an assumption is to evaluate the singular values of the sequencing count matrix in each study and see how fast the singular values decay:

plot_singular_values(O, meta)

3.3 Run MetaDICT function with taxa dissimilarity matrix

We apply MetaDICT to remove batch effects and integrate these two datasets. MetaDICT requires three inputs: integrated count table, integrated meta table, and taxa dissimilarity matrix used for measurement efficiency (batch effect) estimation. A taxa dissimilarity matrix can directly be used in MetaDICT:

# main function of MetaDICT
metadict_res = MetaDICT(O, meta, distance_matrix = dist_mat)

X = metadict_res$count
D = metadict_res$D
R_list = metadict_res$R
w = metadict_res$w
meta_output = metadict_res$meta

The results from MetaDICT include:

  1. Corrected Count Table (X) – The adjusted abundance matrix after integration.
  2. Estimated Shared Dictionary (D) – The learned dictionary representing shared features across datasets.
  3. Estimated Sample Representation (R) – The latent factor representation of samples.
  4. Estimated Measurement Efficiency (w) – The scaling factors capturing dataset-specific measurement variations.
  5. Integrated Meta Table – The combined metadata containing information from all integrated datasets.

Batch effect is significantly reduced using MetaDICT:

# PCoA plot of batch variable
pcoa_plot_discrete(X,batchid,"Batch")

# PCoA plot of sample covariate
pcoa_plot_discrete(X,Y,"Sample",colorset = "Set2")

3.4 Run MetaDICT with the phylogenetic tree

MetaDICT can also accept a phylogenetic tree as input and uses phylogenetic information to estimate taxa similarity:

metadict_res1 = MetaDICT(O, meta, tree = tree)
X1 = metadict_res1$count
# PCoA plot of batch variable
pcoa_plot_discrete(X1,batchid,"Batch")

# PCoA plot of sample covariate
pcoa_plot_discrete(X1,Y,"Sample",colorset = "Set2")

3.5 Run MetaDICT with the taxonomy information

If phylogenetic tree is not available, MetaDICT can use taxonomic information to estimate taxa similarity. In this case, the taxonomy level of the count table must be specified.

metadict_res2 = MetaDICT(O, meta, taxonomy = taxonomy, tax_level = "order")
X2 = metadict_res2$count
# PCoA plot of batch variable
pcoa_plot_discrete(X2,batchid,"Batch")

# PCoA plot of sample covariate
pcoa_plot_discrete(X2,Y,"Sample",colorset = "Set2")

3.6 Run MetaDICT with the specified covariate

Users can specify the covariates that should be included in MetaDICT using covariates. In this example, we only use Y2 in the covariate balancing step. MetaDICT is able to preserve the biological variation of Y even when Y is not observed.

metadict_res3 = MetaDICT(O,meta,covariates = c("Y2"), 
                         distance_matrix = dist_mat)
X3 = metadict_res3$count
# PCoA plot of batch variable
pcoa_plot_discrete(X3,batchid,"Batch")

# PCoA plot of sample covariate
pcoa_plot_discrete(X3,Y,"Sample",colorset = "Set2")

3.7 Run MetaDICT with customized parameters

MetaDICT includes two predefined parameters \(\alpha\) and \(\beta\). The parameter \(\alpha\) enforces the low-rank structure of shared dictionary. The parameter \(\beta\) enforces the smoothness of estimated measurement efficiency. If customize_parameter = FALSE, MetaDICT automatically selects parameters based on the provided inputs. When customize_parameter = TRUE, users can manually set parameter values to customize the analysis.

metadict_res4 = MetaDICT(O,meta,distance_matrix = dist_mat,
                         customize_parameter = TRUE, alpha = 0.01, beta = 0.01)

4 Add new datasets to an existing integrated study

In certain situations, new studies may become available after multiple datasets have already been integrated and a machine learning model has been trained on the combined data. To incorporate these new studies into the existing integrated dataset, use the following function:

# load the data
data("exampleData_transfer")
new_data = exampleData_transfer$new_data
new_meta = exampleData_transfer$new_meta

# add new dataset to previous result
new_data_res = metadict_add_new_data(new_data, new_meta, metadict_res)

# corrected count
new_count = new_data_res$count

# integrate count tables
all_count_raw = cbind(X,new_data)
all_count_corrected = cbind(X,new_count)
covariates <- intersect(colnames(meta), colnames(new_meta))
all_meta = rbind(meta[,covariates, drop = FALSE],new_meta[,covariates, drop = FALSE])

Before batch correction:

# PCoA plot of batch variable
pcoa_plot_discrete(all_count_raw,all_meta$batch,"Batch")

# PCoA plot of sample covariate
pcoa_plot_discrete(all_count_raw,all_meta$Y,"Sample",colorset = "Set2")

After batch correction:

# PCoA plot of batch variable
pcoa_plot_discrete(all_count_corrected,all_meta$batch,"Batch")

# PCoA plot of sample covariate
pcoa_plot_discrete(all_count_corrected,all_meta$Y,"Sample",colorset = "Set2")

The corrected data new_count can be directly applied to pre-trained machine learning model.

5 Community detection

We can detect taxa communities and sample subpopulation using the output of MetaDICT.

Load ggraph for visualization:

library(ggraph)

Shared dictionary D can be used in taxa community detection. The number of columns used in this process is determined using the elbow of column-wise squared norms.

D = metadict_res4$D
plot(diag(t(D)%*%(D)), ylab = "Column-wise Squared Norm")

The elbow value is around 20. We apply community detection method:

D_filter = D[,1:20]
taxa_c = community_detection(D_filter, max_k = 5)
taxa_cluster = taxa_c$cluster
taxa_graph = taxa_c$graph
ggraph(taxa_graph, layout = "stress") +  
    geom_node_point(aes(color = as.factor(taxa_cluster)), size = 2) +  
    scale_color_brewer(palette = "Set1", name = "Taxa Cluster") + 
    theme_bw() +
    xlab("") +
    ylab("") +
    theme(
        legend.position = "right",  
        legend.title = element_text(size = 12, face = "bold"),  
        legend.text = element_text(size = 10)  
    )

6 Session information

sessionInfo()
R version 4.5.1 Patched (2025-08-23 r88802)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggraph_2.2.2     MetaDICT_0.99.4  DT_0.34.0        lubridate_1.9.4 
 [5] forcats_1.0.1    stringr_1.5.2    dplyr_1.1.4      purrr_1.1.0     
 [9] readr_2.1.5      tidyr_1.3.1      tibble_3.3.0     ggplot2_4.0.0   
[13] tidyverse_2.0.0  BiocStyle_2.37.1

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1    viridisLite_0.4.2   farver_2.1.2       
 [4] viridis_0.6.5       S7_0.2.0            fastmap_1.2.0      
 [7] tweenr_2.0.3        RANN_2.6.2          digest_0.6.37      
[10] timechange_0.3.0    lifecycle_1.0.4     cluster_2.1.8.1    
[13] statmod_1.5.0       magrittr_2.0.4      compiler_4.5.1     
[16] rlang_1.1.6         sass_0.4.10         tools_4.5.1        
[19] igraph_2.1.4        yaml_2.3.10         knitr_1.50         
[22] ggsignif_0.6.4      graphlayouts_1.2.2  labeling_0.4.3     
[25] htmlwidgets_1.6.4   RColorBrewer_1.1-3  abind_1.4-8        
[28] withr_3.0.2         polyclip_1.10-7     grid_4.5.1         
[31] ggpubr_0.6.1        edgeR_4.7.5         scales_1.4.0       
[34] MASS_7.3-65         tinytex_0.57        dichromat_2.0-0.1  
[37] cli_3.6.5           rmarkdown_2.30      vegan_2.7-1        
[40] generics_0.1.4      tzdb_0.5.0          ggforce_0.5.0      
[43] ape_5.8-1           cachem_1.1.0        splines_4.5.1      
[46] parallel_4.5.1      BiocManager_1.30.26 matrixStats_1.5.0  
[49] vctrs_0.6.5         Matrix_1.7-4        jsonlite_2.0.0     
[52] carData_3.0-5       bookdown_0.44       car_3.1-3          
[55] hms_1.1.3           ggrepel_0.9.6       rstatix_0.7.2      
[58] Formula_1.2-5       magick_2.9.0        locfit_1.5-9.12    
[61] limma_3.65.4        jquerylib_0.1.4     glue_1.8.0         
[64] cowplot_1.2.0       ecodist_2.1.3       stringi_1.8.7      
[67] gtable_0.3.6        pillar_1.11.1       htmltools_0.5.8.1  
[70] R6_2.6.1            tidygraph_1.3.1     evaluate_1.0.5     
[73] lattice_0.22-7      backports_1.5.0     memoise_2.0.1      
[76] broom_1.0.10        bslib_0.9.0         Rcpp_1.1.0         
[79] gridExtra_2.3       nlme_3.1-168        permute_0.9-8      
[82] mgcv_1.9-3          xfun_0.53           pkgconfig_2.0.3    

References

He, Y., W. Wu, H. Zheng, P. Li, D. McDonald, H. Sheng, M. Chen, et al. 2018. “Regional Variation Limits Applications of Healthy Gut Microbiome Reference Ranges and Disease Models.” Nature Medicine 24 (10): 1532–5.

Yuan, B., and S. Wang. 2024. “Microbiome Data Integration via Shared Dictionary Learning.” bioRxiv, 2024–10.