--- title: "MetaDICT Tutorial" author: - Bo Yuan date: '`r format(Sys.Date(), "%B %d, %Y")`' output: BiocStyle::html_document bibliography: bibliography.bib vignette: > %\VignetteIndexEntry{MetaDICT Tutorial} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, message = FALSE, warning = FALSE, comment = NA, include=FALSE} knitr::opts_chunk$set(message = FALSE, warning = FALSE, comment = NA, fig.width = 6.25, fig.height = 5) library(tidyverse) library(DT) options(DT.options = list( initComplete = JS("function(settings, json) {", "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff'});","}"))) ``` # Introduction MetaDICT is a computational method designed for the **integration of microbiome datasets**, effectively addressing **batch effects** while preserving **biological variation** across heterogeneous datasets. The method operates in **two stages**: 1. **Initial Batch Effect Estimation** – Utilizes **covariate balancing** to estimate batch effects, which are defined as **heterogeneous sequencing efficiency** across datasets. 2. **Refinement via Shared Dictionary Learning** – Further refines batch effect estimation by leveraging shared structures across datasets. Compared with regression-based approaches, **MetaDICT minimizes overcorrection** when **unobserved confounding variables** are present, ensuring a more reliable integration of datasets. The resulting integrated data can be applied to **downstream analyses**, including **Principal Coordinates Analysis (PCoA)**, **taxa/sample community detection**, and **differential abundance test**. For more details, please refer to the MetaDICT paper [@yuan2024microbiome]. # Installation The package can be downloaded from github: ```{r,eval=FALSE} if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools") devtools::install_github("BoYuan07/MetaDICT", build_vignettes = TRUE) ``` Load the package: ```{r} # load the package library(MetaDICT) ``` # Run MetaDICT on a simulated dataset ## Example dataset Example dataset contains two simulated datasets, using gut microbiome data collected by He et al. [@he2018regional]. These two datasets share the same set of taxa, and each dataset contains 200 samples. Load the example sample: ```{r} # load data data("exampleData") O = exampleData$O meta = exampleData$meta dist_mat = exampleData$dist_mat taxonomy = exampleData$taxonomy tree = exampleData$tree ``` The object contains the following components: - **Integrated Count Matrix (`O`)** A merged abundance data matrix (taxa-by-sample) combining both datasets. - **Meta Table (`meta`)** The sample metadata, which includes details for both datasets: - **`batch`** – Indicates the dataset source (`Dataset1` or `Dataset2`). - **`Y`** – A covariate associated with microbial compositions. - **`Y2`** – An uninformative covariate with no biological relevance. - **Sequence Distance Matrix (`dist_mat`)** A matrix quantifying the relationships between sequences. - **Phylogenetic Tree (`tree`)** - **Taxonomy Table (`taxonomy`)** We will use this example data to illustrate how to use MetaDICT. Significant batch effects are observed between these two datasets on PCoA plots. ```{r} # batch label batchid = meta$batch # PCoA plot pcoa_plot_discrete(O,batchid,"Batch") ``` The PCoA plots of target variable `Y`: ```{r} # sample covariate Y = meta$Y # PCoA plot pcoa_plot_discrete(O,Y,"Sample", colorset = "Set2") ``` ## Check the singular values of each dataset A crucial assumption in MetaDICT is that the microbial load matrix can be approximated by a product of two matrices, one of which is shared dictionary. A simple diagnostic tool for such an assumption is to evaluate the singular values of the sequencing count matrix in each study and see how fast the singular values decay: ```{r} plot_singular_values(O, meta) ``` ## Run MetaDICT function with taxa dissimilarity matrix We apply MetaDICT to remove batch effects and integrate these two datasets. MetaDICT requires three inputs: integrated count table, integrated meta table, and taxa dissimilarity matrix used for measurement efficiency (batch effect) estimation. A taxa dissimilarity matrix can directly be used in MetaDICT: ```{r} # main function of MetaDICT metadict_res = MetaDICT(O, meta, distance_matrix = dist_mat) X = metadict_res$count D = metadict_res$D R_list = metadict_res$R w = metadict_res$w meta_output = metadict_res$meta ``` The results from **MetaDICT** include: 1. **Corrected Count Table (`X`)** – The adjusted abundance matrix after integration. 2. **Estimated Shared Dictionary (`D`)** – The learned dictionary representing shared features across datasets. 3. **Estimated Sample Representation (`R`)** – The latent factor representation of samples. 4. **Estimated Measurement Efficiency (`w`)** – The scaling factors capturing dataset-specific measurement variations. 5. **Integrated Meta Table** – The combined metadata containing information from all integrated datasets. Batch effect is significantly reduced using MetaDICT: ```{r} # PCoA plot of batch variable pcoa_plot_discrete(X,batchid,"Batch") # PCoA plot of sample covariate pcoa_plot_discrete(X,Y,"Sample",colorset = "Set2") ``` ## Run MetaDICT with the phylogenetic tree MetaDICT can also accept a phylogenetic tree as input and uses phylogenetic information to estimate taxa similarity: ```{r} metadict_res1 = MetaDICT(O, meta, tree = tree) X1 = metadict_res1$count ``` ```{r} # PCoA plot of batch variable pcoa_plot_discrete(X1,batchid,"Batch") # PCoA plot of sample covariate pcoa_plot_discrete(X1,Y,"Sample",colorset = "Set2") ``` ## Run MetaDICT with the taxonomy information If phylogenetic tree is not available, MetaDICT can use taxonomic information to estimate taxa similarity. In this case, the taxonomy level of the count table must be specified. ```{r} metadict_res2 = MetaDICT(O, meta, taxonomy = taxonomy, tax_level = "order") X2 = metadict_res2$count ``` ```{r} # PCoA plot of batch variable pcoa_plot_discrete(X2,batchid,"Batch") # PCoA plot of sample covariate pcoa_plot_discrete(X2,Y,"Sample",colorset = "Set2") ``` ## Run MetaDICT with the specified covariate Users can specify the covariates that should be included in MetaDICT using `covariates`. In this example, we only use `Y2` in the covariate balancing step. MetaDICT is able to preserve the biological variation of `Y` even when `Y` is not observed. ```{r} metadict_res3 = MetaDICT(O,meta,covariates = c("Y2"), distance_matrix = dist_mat) X3 = metadict_res3$count ``` ```{r} # PCoA plot of batch variable pcoa_plot_discrete(X3,batchid,"Batch") # PCoA plot of sample covariate pcoa_plot_discrete(X3,Y,"Sample",colorset = "Set2") ``` ## Run MetaDICT with customized parameters MetaDICT includes two predefined parameters $\alpha$ and $\beta$. The parameter $\alpha$ enforces the low-rank structure of shared dictionary. The parameter $\beta$ enforces the smoothness of estimated measurement efficiency. If `customize_parameter = FALSE`, MetaDICT automatically selects parameters based on the provided inputs. When `customize_parameter = TRUE`, users can manually set parameter values to customize the analysis. ```{r} metadict_res4 = MetaDICT(O,meta,distance_matrix = dist_mat, customize_parameter = TRUE, alpha = 0.01, beta = 0.01) ``` # Add new datasets to an existing integrated study In certain situations, new studies may become available after multiple datasets have already been integrated and a machine learning model has been trained on the combined data. To incorporate these new studies into the existing integrated dataset, use the following function: ```{r} # load the data data("exampleData_transfer") new_data = exampleData_transfer$new_data new_meta = exampleData_transfer$new_meta # add new dataset to previous result new_data_res = metadict_add_new_data(new_data, new_meta, metadict_res) # corrected count new_count = new_data_res$count # integrate count tables all_count_raw = cbind(X,new_data) all_count_corrected = cbind(X,new_count) covariates <- intersect(colnames(meta), colnames(new_meta)) all_meta = rbind(meta[,covariates, drop = FALSE],new_meta[,covariates, drop = FALSE]) ``` Before batch correction: ```{r} # PCoA plot of batch variable pcoa_plot_discrete(all_count_raw,all_meta$batch,"Batch") # PCoA plot of sample covariate pcoa_plot_discrete(all_count_raw,all_meta$Y,"Sample",colorset = "Set2") ``` After batch correction: ```{r} # PCoA plot of batch variable pcoa_plot_discrete(all_count_corrected,all_meta$batch,"Batch") # PCoA plot of sample covariate pcoa_plot_discrete(all_count_corrected,all_meta$Y,"Sample",colorset = "Set2") ``` The corrected data `new_count` can be directly applied to pre-trained machine learning model. # Community detection We can detect taxa communities and sample subpopulation using the output of MetaDICT. Load ggraph for visualization: ```{r} library(ggraph) ``` Shared dictionary `D` can be used in taxa community detection. The number of columns used in this process is determined using the elbow of column-wise squared norms. ```{r} D = metadict_res4$D plot(diag(t(D)%*%(D)), ylab = "Column-wise Squared Norm") ``` The elbow value is around 20. We apply community detection method: ```{r} D_filter = D[,1:20] taxa_c = community_detection(D_filter, max_k = 5) taxa_cluster = taxa_c$cluster taxa_graph = taxa_c$graph ggraph(taxa_graph, layout = "stress") + geom_node_point(aes(color = as.factor(taxa_cluster)), size = 2) + scale_color_brewer(palette = "Set1", name = "Taxa Cluster") + theme_bw() + xlab("") + ylab("") + theme( legend.position = "right", legend.title = element_text(size = 12, face = "bold"), legend.text = element_text(size = 10) ) ``` # Session information ```{r} sessionInfo() ``` # References