--- title: "DspikeIn with TSE" author: "Mitra Ghotbi" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 number_sections: true highlight: tango df_print: tibble css: vignette-style.css fontsize: 11pt mainfont: "Arial" geometry: margin=1in vignette: > %\VignetteIndexEntry{DspikeIn with TSE} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( fig.path = "figures/", dev = "png", dpi = 300, fig.width = 7, fig.height = 4, fig.align = "center", out.width = "70%", message = FALSE, warning = FALSE ) ``` # Acknowledgments The development of the DspikeIn package was made possible through the generous and pioneering efforts of the R and Bioconductor communities. We gratefully acknowledge the developers and maintainers of the following open-source packages, whose tools and infrastructure underpin our work: **Core infrastructure & data manipulation:** methods, stats, utils, graphics, grDevices, data.table, dplyr, tibble, tidyr, reshape2, matrixStats, rlang, S4Vectors, grid, officer, xml2 **Statistical analysis & modeling:** DESeq2, edgeR, limma, randomForest, microbiome **Phylogenetics & microbiome structure:** phyloseq, TreeSummarizedExperiment, SummarizedExperiment, phangorn, ape, DECIPHER, msa, Biostrings **Network and graph analysis:** igraph, ggraph **Visualization & layout design:** ggplot2, ggrepel, ggpubr, ggnewscale, ggalluvial, ggtree, ggtreeExtra, ggstar, ggridges, patchwork, scales, RColorBrewer, flextable These tools collectively empowered us to build a reproducible, modular, and extensible platform for robust absolute abundance quantification in microbial community analysis. We further acknowledge the broader scientific community working on absolute microbial quantification, spike-in calibration, and compositional data analysis, whose foundational insights directly informed the design and conceptual framework of DspikeIn. ```{r installation, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("DspikeIn") library(DspikeIn) ``` # DspikeIn The DspikeIn package supports both phyloseq and TreeSummarizedExperiment formats to streamline microbial quantification across diverse experimental setups. It accommodates either a single spike-in taxon or synthetic community taxa with variable or equal spike-in volumes and copy numbers. The package offers a comprehensive suite of tools for absolute abundance (AA) quantification, addressing challenges through ten core functions: 1) validation of spiked species, 2) data preprocessing, 3) system-specific spiked species retrieval, 4) scaling factor calculation, 5) conversion to absolute abundance, 6) bias correction and normalization, 7) performance assessment, and 8) taxa exploration and filtering 9) network topology assessment 10) further analyses and visualization. For more information please check doi.org/10.1093/ismejo/wraf150 # DspikeIn requirements Important: DspikeIn operates on a taxonomy table with exactly seven ranks: Kingdom Phylum Class Order Family Genus Species. colnames(taxmat) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species") Please remove or rename extra ranks (e.g., Strain, Subfamily) before running DspikeIn. ```{r , eval=TRUE} library(SummarizedExperiment) library(DspikeIn) # DspikeIn requirements # -------------------- # DspikeIn operates on a taxonomy table with exactly seven ranks: # Kingdom Phylum Class Order Family Genus Species expected <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species") actual <- colnames(as.data.frame(SummarizedExperiment::rowData(tse))) if (!identical(expected, actual)) { stop( paste0( "DspikeIn requires exactly 7 taxonomic ranks in this order:\n ", paste(expected, collapse = " "), "\n\nYour taxonomy columns are:\n ", paste(actual, collapse = ", "), "\n\nPlease remove or rename extra ranks (e.g., 'Strain', 'Subfamily') ", "to match the required format." ) ) } ``` # To build TreeSummarizedExperiment file please refer to github.com/markrobinsonuzh/TreeSummarizedExperiment ```{r , eval=TRUE} library(phyloseq) # Get path to external data folder extdata_path <- system.file("extdata", package = "DspikeIn") list.files(extdata_path) # and data("physeq_16SOTU", package = "DspikeIn") length(taxa_names(physeq_16SOTU)) new_ids <- paste0("OTU", seq_along(taxa_names(physeq_16SOTU))) taxa_names(physeq_16SOTU) <- new_ids # Verify head(taxa_names(physeq_16SOTU)) # [1] "OTU1" "OTU2" "OTU3" "OTU4" "OTU5" "OTU6" tse_16SOTU <- convert_phyloseq_to_tse(physeq_16SOTU) tse_16SOTU <- tidy_phyloseq_tse(tse_16SOTU) ``` Do all detected sample spike-in sequences cluster with the reference, and are their branch lengths statistically similar, supporting a common ancestor? # spike-in validation All sample-derived sequences are forming a clade with the reference. We look for a monophyletic grouping of spike-in OTUs The clade is strongly supported (bootstrap around 100 percentage). The branch lengths and distances are in a biologically plausible range. ```{r} # Use the Neighbor-Joining method based on a Jukes-Cantor distance matrix and plot the tree with bootstrap values. # we compare the Sanger read of Tetragenococcus halophilus with the FASTA sequence of Tetragenococcus halophilus from our phyloseq object. library(Biostrings) library(TreeSummarizedExperiment) library(SummarizedExperiment) library(DspikeIn) # Filter TSE object to keep only Bacteria and Archaea tse_16SOTU <- tse_16SOTU[ rowData(tse_16SOTU)$Kingdom %in% c("Bacteria", "Archaea"),] library(Biostrings) # # # Subset the TSE object to include only Tetragenococcus # tetragenococcus_tse <- tse_16SOTU[ # rowData(tse_16SOTU)$Genus == "Tetragenococcus" & # !is.na(rownames(tse_16SOTU)) & # rownames(tse_16SOTU) != "", ] # # ref_sequences <- referenceSeq(tetragenococcus_tse) # # # Convert to DNAStringSet if needed # ref_sequences <- Biostrings::DNAStringSet(ref_sequences) # Biostrings::writeXStringSet(ref_sequences, filepath = "Sample.fasta") ref_fasta <- system.file("extdata", "Ref.fasta", package = "DspikeIn") sample_fasta <- system.file("extdata", "Sample.fasta", package = "DspikeIn") result <- validate_spikein_clade( reference_fasta = ref_fasta, sample_fasta = sample_fasta, bootstrap = 200, output_prefix = NULL) result$tree_plot ``` # Did spike-ins behave as expected across all samples? Tip labels= OTU/ASV names Branch length numbers= Actual evolutionary distances (small = very similar) Prevalence stars How frequently the OTU occurs across samples Blue bar ring= Log10 mean abundance Outer colored tiles= The metadata variable you choose (e.g., Animal.type) ```{r fig.align='center', fig.width=7, fig.height=5, out.width='50%', dpi=120} library(ggstar) library(ggplot2) # Filter your object to only include spike-in taxa beforehand: # We changed the OTU IDs for easy detection # Big stars = detected in many samples # Small stars = rarely detected # log10(Mean Abundance) Bars= Color intensity reflects mean abundance. # The log-transformed average abundance of each OTU across all samples where it appears. # Extreme blue may signal unintended over-representation. # Metadata Ring = factor of your interest e.g. Animal.type # Each OTU is colored by where it was observed. # Branch length numbers= Actual evolutionary distances (small = very similar) library(DspikeIn) library(TreeSummarizedExperiment) # ---- 1. Subset taxa where Genus is Tetragenococcus ---- spikein_tse <- tse_16SOTU[ rowData(tse_16SOTU)$Genus == "Tetragenococcus", ] # ---- 2. Diagnostic plot (tree-based) ---- ps <- plot_spikein_tree_diagnostic( obj = spikein_tse, metadata_var = "Animal.type", save_plot = FALSE ) print(ps) ``` # Pre_processing merges monophyletic ASVs/OTUs. Why We Merge Similar ASVs/OTUs? In amplicon sequencing, multiple Amplicon Sequence Variants (ASVs) or Operational Taxonomic Units (OTUs) often map to the same taxonomic species due to: sequencing noise, PCR amplification artifacts, or minor intra-strain variations that do not represent biologically distinct taxa. When performing absolute quantitation (such as spike-in recovery), these subtle sequence variants should not be treated as independent species, since they originate from the same biological organism. Retaining them as separate entries can artificially fragment read counts and distort abundance estimates. Therefore, this function pre-processes the dataset by merging all ASVs/OTUs that belong to the same declared taxon into a single representative unit, while keeping the phylogenetic tree, taxonomy, and reference sequences consistent. This ensures that the subsequent quantitation step reflects true organism-level abundance, not technical redundancy. When to use: As the first preprocessing step before calculating spike-in recovery percentages or absolute quantitation metrics. # Spiked species and related parameters should be defined ```{r} # PREREQUISITE FOR 16S & CALCULATE SPIKED % # Define spiked species and related parameters** library(flextable) library(DspikeIn) library(TreeSummarizedExperiment) library(SummarizedExperiment) library(dplyr) # Define spike-in parameters spiked_cells <- 1847 species_name <- spiked_species <- c("Tetragenococcus_halophilus", "Tetragenococcus_sp.") merged_spiked_species <- "Tetragenococcus" # If you prefer Genus-level matching # species_name <- spiked_species <- c("Tetragenococcus") # merged_spiked_species <- "Tetragenococcus" # -------------------------------------------- # Subset taxa for the spike-in species tse_16SOTU_spiked_taxa <- tse_16SOTU[rowData(tse_16SOTU)$Species %in% species_name, ] # Subset samples based on spike-in volume (e.g., "1" or "2") tse_16SOTU_spiked_samples <- tse_16SOTU[, colData(tse_16SOTU)$spiked.volume %in% c("2", "1") ] # -------------------------------------------- # Merge spiked OTUs using the DspikeIn function # merges monophyletic ASVs/OTUs # The function Pre_processing_species() merges ASVs/OTUs of an spiked species using "sum" or "max" methods, # preserving taxonomic, phylogenetic, and sequencing data. output_rds <- file.path(tempdir(), "merged_tse_sum.rds") Spiked_16S_sum_scaled <- Pre_processing_species( tse_16SOTU_spiked_samples, species_name, merge_method = "sum", output_file = output_rds) ``` # Calculate the percentage of spiked species retrieval Why and What It Measures? Spike-in standards provide an internal quantitative reference for microbial load normalization. After merging the spike-in species, we can evaluate how many sequencing reads in each sample correspond to these known spike-in taxa. This value serves as a recovery control, reflecting both: the accuracy of DNA extraction and sequencing, and the efficiency of downstream normalization. By expressing spike-in reads as a percentage of total reads per sample, we can detect samples with technical failure (too low recovery) or overrepresentation (possible contamination or pipetting bias etc.). This metric provides an essential quality-control checkpoint: samples with spike-in recovery outside the expected range (e.g., <0.1% or >20%) may indicate experimental anomalies or uneven DNA extraction efficiency. The resulting table allows quick identification of samples passing or failing QC thresholds, ensuring that downstream absolute quantitation is based on reliable input data. ```{r} # * Calculate the percentage of spiked species retrieval per sample* library(mia) library(dplyr) library(SummarizedExperiment) # Subset TSE to remove blanks Spiked_16S_sum_scaled_filtered <- Spiked_16S_sum_scaled[, colData(Spiked_16S_sum_scaled)$sample.or.blank != "blank"] # Calculate spike-in retrieval percentage Perc <- calculate_spike_percentage( Spiked_16S_sum_scaled_filtered, merged_spiked_species, #output_file = "output_tse.docx", passed_range = c(0.1, 20) ) head(Perc) ``` # spiked species retrieval is system-dependent Spike-in controls provide a way to assess the quantitative reliability of metagenomic or amplicon sequencing workflows. However, the expected recovery percentage of spike-in reads is not universal, it depends on the sequencing platform, extraction chemistry, and the complexity of the native microbial community. As discussed in our paper (doi.org/10.1093/ismejo/wraf150), the acceptable range of spike-in recovery is system-dependent and should be determined empirically from each dataset. This function (regression_plot()) supports that step by visualizing the relationship between observed spike-in abundance and measured total reads, stratified by recovery percentage ranges. The regression_plot() generates a scatter plot of log-transformed variables (typically Observed vs Total_Reads_spiked), overlays a linear regression fit, and facets the data according to custom recovery ranges (e.g., 010%, 1030%, 3050%, etc.). Interpretation: A strong linear trend (high R, low p-value) within a range indicates consistent quantitative behavior. Deviations or flattening slopes suggest technical bias or over-amplification. The empirically stable range can then be used to adjust the passed_range argument in calculate_spike_percentage() for future analyses. Why this matters: By empirically determining recovery behavior instead of assuming a fixed threshold, users ensure that spike-in QC is tailored to their experimental conditions, improving both accuracy and comparability of absolute quantitation results. ```{r fig.align='center', fig.width=7, fig.height=5, out.width='50%', dpi=120} # The acceptable range of spiked species retrieval is system-dependent # Spiked species become centroid of the community (Distance to Centroid) # Spiked species become dominant and imbalance the community (Evenness) # What range of spiked species retrieval is appropriate for your system? # Calculate Pielou's Evenness using Shannon index and species richness (Observed) # Hill number q = 1 = exp(Shannon index), representing the effective number of equally abundant species. Hill is weighted by relative abundance, so dominant species have more influence. # Unlike Pielou's evenness, this metric is not normalized by richness and it shows Effective number of common species library(TreeSummarizedExperiment) library(SummarizedExperiment) library(dplyr) library(tibble) library(microbiome) library(mia) library(vegan) library(S4Vectors) library(ggplot2) # --- 1. Extract current metadata from TSE --- metadata <- colData(Spiked_16S_sum_scaled) %>% as.data.frame() %>% rownames_to_column("Sample") # --- 2. Add spike-in reads (Perc) --- # Ensure Perc has 'Sample' column and matching format metadata <- dplyr::left_join(metadata, Perc, by = "Sample") # --- 3. Estimate alpha diversity indices and extract --- Spiked_16S_sum_scaled <- mia::addAlpha( Spiked_16S_sum_scaled, index = c("observed", "shannon", "pielou", "hill") ) alpha_df <- colData(Spiked_16S_sum_scaled) %>% as.data.frame() %>% rownames_to_column("Sample") %>% select(Sample, observed, shannon, pielou, hill) metadata <- dplyr::left_join(metadata, alpha_df, by = "Sample") # --- 4. Compute Bray-Curtis distance to centroid --- otu_mat <- assay(Spiked_16S_sum_scaled, "counts") otu_mat <- t(otu_mat) # samples as rows otu_mat_rel <- vegan::decostand(otu_mat, method = "total") centroid_profile <- colMeans(otu_mat_rel) dist_to_centroid <- apply(otu_mat_rel, 1, function(x) { vegan::vegdist(rbind(x, centroid_profile), method = "bray")[1] }) # Match and assign distances metadata$Dist_to_Centroid <- dist_to_centroid[metadata$Sample] # --- 5. Assign updated metadata to TSE --- metadata <- column_to_rownames(metadata, var = "Sample") metadata <- metadata[colnames(Spiked_16S_sum_scaled), , drop = FALSE] # ensure correct order and size colData(Spiked_16S_sum_scaled) <- S4Vectors::DataFrame(metadata) # 4. Regression Plots: Diversity vs. Spike-in Reads # ============================================================================ # Pielous Evenness plot_object_pielou <- regression_plot( data = metadata, x_var = "pielou", y_var = "Spiked_Reads", custom_range = c(0.1, 20, 30, 40, 50, 60, 100), plot_title = "Pielous Evenness vs. Spike-in Reads" ) # Hill Number (q = 1) plot_object_hill <- regression_plot( data = metadata, x_var = "hill", y_var = "Spiked_Reads", custom_range = c(0.1, 10, 20, 30, 100), plot_title = "Hill Number (q = 1) vs. Spike-in Reads" ) plot_object_hill # Distance to Centroid plot_object_centroid <- regression_plot( data = metadata, x_var = "Dist_to_Centroid", y_var = "Spiked_Reads", custom_range = c(0.1, 20, 30, 40, 50, 60, 100), plot_title = "Distance to Global Centroid vs. Spike-in Reads" ) # Interpretation # ============================================================================ # - Pielou's evenness is normalized by richness; useful for detecting imbalance. # - Hill number q = 1 gives effective number of common species; sensitive to dominance. # - Distance to centroid in full Bray-Curtis space shows deviation from the average community. ``` # Spiked species and related parameters for ITS ```{r, eval=FALSE} # PREREQUISITE FOR ITS & CALCULATE SPIKED % # Define spiked species and related parameters** # Define the spiked species # spiked_cells <- 733 # species_name <- spiked_species <- merged_spiked_species <- "Dekkera_bruxellensis" # Subset taxa for spiked species # Dekkera <- phyloseq::subset_taxa( # physeq_ITSOTU, # Species %in% species_name) # hashcodes <- row.names(phyloseq::tax_table(Dekkera)) # Subset samples based on spiked volume # physeq_ITSOTU_spiked <- phyloseq::subset_samples(physeq_ITSOTU, spiked.volume %in% c("2", "1")) # if TSE format # tse_ITSOTU <- convert_phyloseq_to_tse(physeq_ITSOTU) # physeq_ITSOTU_spiked <- tse_ITSOTU[, tse_ITSOTU$spiked.volume %in% c("2", "1")] ``` # Calculating Scaling Factors The calculate_spikeIn_factors() function estimates bias-corrected scaling factors for absolute microbiome quantitation using spike-in standards. The ratio between known and observed spike-in counts yields a sample-specific correction factor that normalizes total microbial abundances to absolute scale. This whole-cell spike-in approach corrects for variations in DNA extraction efficiency, amplification yield, and sequencing depth, yielding a unified, biologically interpretable scale for measuring absolute microbial load. The function automatically handles:Species or genus-level spike-in identification, Safe tree and taxonomy synchronization, Volume-based scaling (via the spiked.volume field in metadata), Optional export of intermediate results for traceability (Total_Reads.csv, Spiked_Reads.csv, Scaling_Factors.csv). ```{r} # CALCULATE SCALING FACTORS result <- calculate_spikeIn_factors( Spiked_16S_sum_scaled_filtered, spiked_cells = spiked_cells, merged_spiked_species = species_name) # View extracted outputs result$spiked_species_reads # Merged spiked species name result$total_reads # Total reads detected for the spike scaling_factors <- result$scaling_factors head(scaling_factors) # Vector of scaling factors per sample ``` # Convert relative counts to absolute counts This function translates relative abundance profiles into biologically meaningful absolute counts by applying empirically derived scaling factors. Absolute quantification enables direct comparison of microbial loads across samples and treatments, avoiding compositional artifacts inherent to relative data. This transformation is fundamental for integrative multi-omics analyses, network inference, and ecological interpretation of microbiome dynamics. ```{r} # Convert relative counts to absolute counts # absolute counts=relative countssample-specific scaling factor # Convert to absolute counts library(DspikeIn) absolute <- convert_to_absolute_counts(Spiked_16S_sum_scaled_filtered, scaling_factors) # Extract processed data absolute_counts <- absolute$absolute_counts tse_absolute <- absolute$obj_adj tse_absolute <- tidy_phyloseq_tse(tse_absolute) # View absolute count data head(tse_absolute) ``` # Summary Stat This function generates a concise statistical summary of numeric variables, including measures of central tendency (mean, median), dispersion (standard deviation, standard error), and distribution (quartiles). The results can be exported as both .docx and .csv files for reporting. ```{r} # CALCULATE SPIKE PERCENTAGE & summary stat # ** Calculate spike percentage & Generate summary statistics for absolute counts** # absolute_counts is a data.frame of OTU/ASV table # Generate summary statistics for absolute counts post_eval_summary <- calculate_summary_stats_table(absolute_counts) # You may want to Back normal to check calculation accuracy # the scaling factor was computed based on spiked species reads and fixed cell count. # Multiplying the spiked species read count by this scaling factor restores the exact spiked cell count. # lets check it # BackNormal <- calculate_spike_percentage( # tse_absolute, # merged_spiked_species, # passed_range = c(0.1, 20)) ``` # Conclusion conclusion() evaluates spike-in performance and recovery, i.e., how well the spike-in behaved after normalization and how consistent it is across samples. The function relies on outputs from calculate_spike_percentage() and indirectly reflects how scaling factors and normalization affected your dataset. It is adviced to be performed after absolute abundance conversion because -> If you summarize spike-in performance before normalization, youre essentially measuring how biased your workflow was, not how successfully it was corrected. ```{r} # * Calculate the percentage of spiked species retrieval per sample* library(mia) library(dplyr) library(SummarizedExperiment) # Subset TSE to remove blanks tse_absolute_filtered <- tse_absolute[, colData(tse_absolute)$sample.or.blank != "blank"] # Generate conclusion report conc <- conclusion( tse_absolute_filtered, merged_spiked_species, max_passed_range = 20, output_path = output_path) conc$full_report # Filter to keep only the samples that passed passed_samples <- Perc$Sample[Perc$Result == "passed"] # Subset the TSE object to include only passed samples tse_passed <- tse_absolute_filtered[, colnames(tse_absolute_filtered) %in% passed_samples] dim(tse_passed) ``` # Abundance-based Core microbiome Alluvial plot illustrating the distribution of prevalent and core microbiome taxa across animal types, ecoregions, diets, and ecological modes. Flows represent absolute abundance of bacterial classes (colored by taxonomic class) across categorical variables. Dominant lineages such as Bacteroidia, Clostridia, Fusobacteria and Gammaproteobacteria remain prevalent across diverse hosts and habitats. ```{r fig.align='center', fig.width=7, fig.height=5, out.width='70%', dpi=120} pps_Abs <- DspikeIn::get_long_format_data(tse_passed) # calculation for relative abundance needs sum of total reads # total_reads <- sum(pps_Abs$Abundance) # Generate an alluvial plot alluvial_plot_abs <- alluvial_plot( data = pps_Abs, axes = c("Animal.type", "Ecoregion.III","Diet", "Animal.ecomode"), abundance_threshold = 5000, fill_variable = "Class", silent = TRUE, abundance_type = "absolute", top_taxa = 10, text_size = 3, legend_ncol = 1, custom_colors = DspikeIn::color_palette$MG_Awesome # Use the color palette from DspikeIn ) ``` # Data transform & Normalization Normalization is a crucial preprocessing step in microbiome analysis. It corrects for technical artifacts such as unequal sequencing depth, amplification bias, and compositionality, ensuring that observed differences reflect biological variation rather than data imbalance. The choice of normalization depends on the goal of your analysis, differential abundance, compositional analysis, or quick exploratory visualization. Normalization is a crucial preprocessing step in microbiome analysis. It corrects for technical artifacts such as unequal sequencing depth, amplification bias, and compositionality, ensuring that observed differences reflect biological variation rather than data imbalance. The choice of normalization depends on the **goal of your analysis**, differential abundance, compositional analysis, or quick exploratory visualization. | **Normalization Method** | **Corrects For** | **Best For** | **Notes / Comments** | |----------------------------|------------------|---------------|----------------------| | **DESeq** | Library size and count variance | Differential abundance testing | Robust, model-based normalization from *DESeq2*. | | **TMM** (Trimmed Mean of M-values) | Library size and compositional bias | Group comparisons | Widely used in *edgeR*, handles unequal composition. | | **CSS** (Cumulative Sum Scaling) | Sparse, compositional count data | Microbiome datasets with many zeros | Balances low-abundance and high-abundance taxa. | | **CLR** (Centered Log-Ratio) | Compositional structure | Ordination, diversity, correlation | Produces log-ratio data suitable for distance-based methods. | | **TC** (Total Count) | Library size | Simple exploratory work | Quick but sensitive to outliers. | | **UQ** (Upper Quartile) | Library size and extreme counts | Moderate bias correction | Uses the upper quartile of counts as scaling reference. | | **Median** | Library size | Exploratory visualization | Stable for balanced datasets. | | **RLE** (Relative Log Expression) | Sample-specific bias | Large count matrices | Used in RNA-seq and applicable to microbial data. | | **TSS** (Total Sum Scaling) | Library size | Relative abundance conversion | Similar to TC, transforms counts to proportions. | | **QN** (Quantile Normalization) | Global distribution differences | Multi-omics integration | Forces identical distributions across samples. | | **Poisson** | Count noise model | Statistical modeling | Based on Poisson GLM; often paired with inference. | | **Rarefying** | Unequal sequencing depth | Richness/evenness estimation | Not recommended for statistical tests. | - Use **`DESeq`**, **`TMM`**, or **`CSS`** for **differential abundance** testing. - Use **`CLR`** for **compositional and ordination** analyses. - Use **`TC`**, **`UQ`**, or **`Median`** for **exploratory or rapid screening**. - Always check scaling factors and library size distributions before interpretation. ```{r} # you may need to normalize/transform your data to reduce biases ps <- tse_passed # TC Normalization result_TC <- normalization_set(ps, method = "TC", groups = "Host.species") normalized_ps_TC <- result_TC$dat.normed scaling_factors_TC <- result_TC$scaling.factor # UQ Normalization # result_UQ <- normalization_set(ps, method = "UQ", groups = "Host.species") # normalized_ps_UQ <- result_UQ$dat.normed # scaling_factors_UQ <- result_UQ$scaling.factor # # # Median Normalization # result_med <- normalization_set(ps, method = "med", groups = "Host.species") # normalized_ps_med <- result_med$dat.normed # scaling_factors_med <- result_med$scaling.factor # DESeq Normalization # ps_n <- remove_zero_negative_count_samples(ps) # result_DESeq <- normalization_set(ps_n, method = "DESeq", groups = "Animal.type") # normalized_ps_DESeq <- result_DESeq$dat.normed # scaling_factors_DESeq <- result_DESeq$scaling.factor # Poisson Normalization # result_Poisson <- normalization_set(ps, method = "Poisson", groups = "Host.genus") # normalized_ps_Poisson <- result_Poisson$dat.normed # scaling_factors_Poisson <- result_Poisson$scaling.factor # Quantile Normalization # result_QN <- normalization_set(ps, method = "QN") # normalized_ps_QN <- result_QN$dat.normed # scaling_factors_QN <- result_QN$scaling.factor # TMM Normalization # result_TMM <- normalization_set(ps, method = "TMM", groups = "Animal.type") # normalized_ps_TMM <- result_TMM$dat.normed # scaling_factors_TMM <- result_TMM$scaling.factor # CLR Normalization # result_clr <- normalization_set(ps, method = "clr") # normalized_ps_clr <- result_clr$dat.normed # scaling_factors_clr <- result_clr$scaling.factor # Rarefying # result_rar <- normalization_set(ps, method = "rar") # normalized_ps_rar <- result_rar$dat.normed # scaling_factors_rar <- result_rar$scaling.factor # CSS Normalization # result_css <- normalization_set(ps, method = "css") # normalized_ps_css <- result_css$dat.normed # scaling_factors_css <- result_css$scaling.factor # # TSS Normalization # result_tss <- normalization_set(ps, method = "tss") # normalized_ps_tss <- result_tss$dat.normed # scaling_factors_tss <- result_tss$scaling.factor # RLE Normalization # result_rle <- normalization_set(ps, method = "rle") # normalized_ps_rle <- result_rle$dat.normed # scaling_factors_rle <- result_rle$scaling.factor ``` Random Forestbased feature selection identifies the most predictive ASVs or OTUs that distinguish samples according to a chosen metadata variable (e.g., host, treatment, or environment). It builds an ensemble of decision trees to evaluate how strongly each taxon contributes to accurate classification, ranking taxa by Mean Decrease in Gini impurity (importance score). This approach helps users pinpoint biologically informative taxa while reducing data dimensionality, noise, and redundancy, enabling focused downstream analyses such as differential abundance, biomarker discovery, or visualization of key community drivers.For methodological details, see Liaw & Wiener (2002), Classification and Regression by randomForest, R News, 2(3):1822. Available at: journal.r-project.org/articles/RN-2002-022/RN-2002-022.pdf # Ridge plot The ridge plot illustrates abundance distributions of the top 10 families across all samples, showing which taxa exhibit narrow, high-density peaks (consistent abundance) versus broader or multimodal patterns (context-dependent abundance) ```{r fig.align='center', fig.width=7, fig.height=5, out.width='50%', dpi=120} normalized_tse_TC<-convert_phyloseq_to_tse(normalized_ps_TC) rf_tse <- RandomForest_selected( normalized_ps_TC, response_var = "Host.genus", na_vars = c("Habitat", "Ecoregion.III", "Host.genus", "Diet") ) ridge_tse<- ridge_plot_it(rf_tse, taxrank = "Family", top_n = 10) + scale_fill_manual(values = DspikeIn::color_palette$cool_MG) ridge_tse ``` # Differential abundance (Single Pairwise) remove the spike-in sp before further analysis Note: Before performing differential abundance analysis, it is recommended to remove the spike-in species to avoid artificial bias in normalization. If you select the DESeq2 option embedded in DspikeIn, note that it internally handles all normalization and variance stabilization steps required for modeling raw count data. DESeq2 estimates size factors (accounting for library depth differences) and dispersion parameters (accounting for biological variability). Therefore, users should always provide integer, raw counts, not normalized, transformed, or rarefied data, as input. ```{r fig.align='center', fig.width=11, fig.height=7, out.width='60%', dpi=120} library(mia) library(dplyr) library(SummarizedExperiment) # ---Remove unwanted taxa by Genus, Family, or Order --- tse_filtered <- tse_absolute[ rowData(tse_absolute)$Genus != "Tetragenococcus" & rowData(tse_absolute)$Family != "Chloroplast" & rowData(tse_absolute)$Order != "Chloroplast", ] tse_caudate <- tse_filtered[, colData(tse_filtered)$Clade.Order == "Caudate"] genera_keep <- c("Desmognathus", "Plethodon", "Eurycea") tse_three_genera <- tse_caudate[, colData(tse_caudate)$Host.genus %in% genera_keep] tse_blue_ridge <- tse_three_genera[, colData(tse_three_genera)$Ecoregion.III == "Blue Ridge"] tse_desmog <- tse_blue_ridge[, colData(tse_blue_ridge)$Host.genus == "Desmognathus"] # --- Differential Abundance with DESeq2 --- results_DESeq2 <- perform_and_visualize_DA( obj = tse_desmog, method = "DESeq2", group_var = "Host.taxon", contrast = c("Desmognathus monticola", "Desmognathus imitator"), output_csv_path = "DA_DESeq2.csv", target_glom = "Genus", significance_level = 0.01) results_DESeq2$results results_DESeq2$obj_significant # Optional: Visualization results_DESeq2$bar_plot ``` Differentially abundant bacterial taxa between Desmognathus imitator and D. monticola are shown (adjusted p < 0.05). Purple bars indicate taxa enriched in D. imitator (positive LFC) and teal bars those enriched in D. monticola (negative LFC). Bar length denotes log fold change; lines show 1 SE. Distinct enrichment patterns reveal host-specific bacterial community differentiation. # Differential abundance (Multilayer Pairwise) ```{r, eval=FALSE} # DspikeIn: Differential Abundance Examples # Using perform_and_visualize_DA() with multiple contrast scenarios # 1. Single Contrast res_single <- perform_and_visualize_DA( obj = tse_absolute, method = "DESeq2", group_var = "Diet", contrast = c("Insectivore", "Carnivore"), target_glom = "Genus") # 2. Single Factor All Pairwise Contrasts col_df <- as.data.frame(colData(tse_absolute)) col_df$Host.taxon <- factor(make.names(as.character(col_df$Host.taxon))) colData(tse_absolute)$Host.taxon <- col_df$Host.taxon # Get unique DESeq2-compatible factor levels host_levels <- levels(col_df$Host.taxon) print(host_levels) # contrast list contrast_named <- list( Host.taxon = combn(host_levels, 2, simplify = FALSE)) # multiple pairwise contrasts res_multi <- perform_and_visualize_DA( obj = tse_absolute, method = "DESeq2", group_var = "Host.taxon", contrast = contrast_named, target_glom = "Genus") # 3. Single Factor Selected Contrasts contrast_list <- list( c("Insectivore", "Carnivore"), c("Omnivore", "Herbivore")) res_selected <- perform_and_visualize_DA( obj = tse_absolute, method = "DESeq2", group_var = "Diet", contrast = contrast_list, global_fdr = TRUE) # 4. Multiple Factors Selected Contrasts contrast_named <- list( Diet = list( c("Insectivore", "Carnivore"), c("Omnivore", "Carnivore") ), Animal.type = list( c("Frog", "Salamander") )) res_multi_factor <- perform_and_visualize_DA( obj = tse_absolute, method = "DESeq2", contrast = contrast_named, target_glom = "Genus", significance_level = 0.01, global_fdr = TRUE) # 5. Multiple Factors all pairwise contrasts colData(tse_absolute)$Host.taxon <- droplevels(factor(colData(tse_absolute)$Host.taxon)) colData(tse_absolute)$Habitat <- droplevels(factor(colData(tse_absolute)$Habitat)) host_levels <- levels(colData(tse_absolute)$Host.taxon) habitat_levels <- levels(colData(tse_absolute)$Habitat) # Define pairwise contrasts as a named list contrast_named <- list( Host.taxon = combn(host_levels, 2, simplify = FALSE), Habitat = combn(habitat_levels, 2, simplify = FALSE)) # Define ComboGroup variable colData(tse_absolute)$ComboGroup <- factor(interaction( colData(tse_absolute)$Animal.type, colData(tse_absolute)$Diet, drop = TRUE)) # --- Run multi-factor DESeq2 comparisons --- results_multi_factor <- perform_and_visualize_DA( obj = tse_absolute, method = "DESeq2", contrast = contrast_named, target_glom = "Genus", significance_level = 0.01) # --- Combined Group Interactions --- colData(tse_absolute)$ComboGroup <- factor(interaction( colData(tse_absolute)$Animal.type, colData(tse_absolute)$Diet, drop = TRUE)) # --- Define custom interaction contrasts --- contrast_list <- list( c("Salamander.Insectivore", "Lizard.Insectivore"), c("Salamander.Carnivore", "Snake.Carnivore"), c("Salamander.Carnivore", "Frog.Carnivore")) # --- Run combined group comparisons --- res_combo <- perform_and_visualize_DA( obj = tse_absolute, method = "DESeq2", group_var = "ComboGroup", contrast = contrast_list, target_glom = "Genus", global_fdr = TRUE) ``` # Turnover (Presence/absence analysis) taxonomic detection consistency between relative abundance (RA) and AA tables Presence/absence analysis to detect concordance between AA and RA profiles Purpose: This analysis compares taxonomic detection consistency between RA and AA datasets. By converting both OTU tables to presence/absence matrices and intersecting shared taxa and samples, we assess how many taxa are consistently detected in both datasets. Interpretation: The resulting "percent_shared" value represents the proportion of taxa jointly detected in RA and AA profiles, providing an estimate of detection concordance and turnover. Presenceabsence comparison between absolute (AA) and relative (RA) abundance datasets revealed that ~100% of taxa were shared, while the discrepancies likely reflect compositional bias introduced by relative normalization rather than true biological differences. ```{r} # -------------------------------------------- # Extract OTU matrices from TSE objects # -------------------------------------------- # relative dataset -> tse_desmog_rel # tse_16SOTU_spiked_samples <- tse_16SOTU[, colData(tse_16SOTU)$spiked.volume %in% c("2", "1") ] tse_filtered_rel <- tse_16SOTU_spiked_samples[ rowData(tse_16SOTU_spiked_samples)$Genus != "Tetragenococcus" & rowData(tse_16SOTU_spiked_samples)$Family != "Chloroplast" & rowData(tse_16SOTU_spiked_samples)$Order != "Chloroplast", ] tse_caudate_rel <- tse_filtered_rel[, colData(tse_filtered_rel)$Clade.Order == "Caudate"] genera_keep <- c("Desmognathus", "Plethodon", "Eurycea") tse_three_genera_rel <- tse_caudate[, colData(tse_caudate_rel)$Host.genus %in% genera_keep] tse_blue_ridge_rel <- tse_three_genera_rel[, colData(tse_three_genera_rel)$Ecoregion.III == "Blue Ridge"] tse_desmog_rel <- tse_blue_ridge_rel[, colData(tse_blue_ridge_rel)$Host.genus == "Desmognathus"] # absolute dataset -> tse_desmog otu_rel <- assay(tse_desmog_rel, "counts") otu_abs <- assay(tse_desmog, "counts") # Make sure samples are rows and taxa are columns otu_rel <- t(otu_rel) otu_abs <- t(otu_abs) # Ensure they are matrices otu_rel <- as.matrix(otu_rel) otu_abs <- as.matrix(otu_abs) # -------------------------------------------- # 2. Convert to Presence/Absence # -------------------------------------------- otu_rel_pa <- (otu_rel > 0) * 1 otu_abs_pa <- (otu_abs > 0) * 1 # -------------------------------------------- # 3. Identify common samples & taxa # -------------------------------------------- shared_samples <- intersect(rownames(otu_rel_pa), rownames(otu_abs_pa)) shared_taxa <- intersect(colnames(otu_rel_pa), colnames(otu_abs_pa)) # Subset to common set otu_rel_pa <- otu_rel_pa[shared_samples, shared_taxa] otu_abs_pa <- otu_abs_pa[shared_samples, shared_taxa] # -------------------------------------------- # 4. Compare presence/absence profiles # -------------------------------------------- shared_pa <- (otu_rel_pa + otu_abs_pa) == 2 total_present <- (otu_rel_pa + otu_abs_pa) >= 1 # Calculate percent agreement (global) percent_shared <- sum(shared_pa) / sum(total_present) cat("Percent of shared taxa detections (AA vs RA):", round(100 * percent_shared, 1), "%\n") ``` # Visualization ```{r fig.align='center', fig.width=8, fig.height=5, out.width='50%', dpi=120} # Visualization of community composition # Subset rows where Genus is not Tetragenococcus and not NA keep_taxa <- !is.na(rowData(tse_absolute)$Genus) & rowData(tse_absolute)$Genus != "Tetragenococcus" # Subset the TSE by keeping only those taxa (rows) tse_filtered <- tse_absolute[keep_taxa, ] # Exclude blanks tse_filtered <- tse_filtered[, colData(tse_filtered)$sample.or.blank != "blank"] # subset Salamander samples tse_sal <- tse_filtered[, colData(tse_filtered)$Animal.type == "Salamander"] Prok_OTU_sal <- tidy_phyloseq_tse(tse_sal) TB<-taxa_barplot( Prok_OTU_sal, target_glom = "Family", custom_tax_names = NULL, normalize = TRUE, treatment_variable = "Habitat", abundance_type = "relative", x_angle = 15, fill_variable = "Family", facet_variable = "Diet", top_n_taxa = 10, palette = DspikeIn::color_palette$cool_MG, legend_size = 11, legend_columns = 1, x_scale = "free", xlab = NULL) ``` # Microbial dynamics & NetWork comparision To create a microbial co-occurrence network, you can refer to the SpiecEasi package available at: SpiecEasi GitHub Repository //github.com/zdk123/SpiecEasi The weight_Network() function provides an integrated framework for exploring and quantifying microbial interaction networks. It imports a pre-computed or user-supplied GraphML network, computes key topological descriptors (e.g., modularity, connectivity, transitivity, and path length), and visualizes the graph using a force-directed layout. Node colors are assigned according to modularity classes, while edge thickness and color reflect the magnitude and sign of associations, respectively, enabling intuitive interpretation of network structure. Together, these analyses highlight how community organization and interaction patterns change under different ecological contexts, for instance, after removing hubs or specific taxa (e.g., Basidiobolus) as we showed (doi.org/10.1093/ismejo/wraf150). The resulting visualization and metrics provide complementary insights into network complexity and modular structure, serving as a quantitative basis for assessing microbiome stability and cross-domain connectivity. ```{r fig.align='center', fig.width=11, fig.height=6, out.width='80%', dpi=120} # 1. Initialization and loading NetWorks for Comparision # library(SpiecEasi) # library(ggnet) # library(igraph) library(DspikeIn) library(tidyr) library(dplyr) library(ggpubr) library(igraph) # herp.Bas is a merged phyloseq object for both bacterial and fungal domains # spiec.easi(herp.Bas, method='mb', lambda.min.ratio=1e-3, nlambda=250,pulsar.select=TRUE ) Complete <- DspikeIn::load_graphml("Complete.graphml") NoBasid <- DspikeIn::load_graphml("NoBasid.graphml") NoHubs <- DspikeIn::load_graphml("NoHubs.graphml") # degree_network -> please note that needs full path of your graphml result_kk <- DspikeIn::degree_network(load_graphml("Complete.graphml"), save_metrics = TRUE, layout_type = "stress") result_kk$metrics result <- DspikeIn::weight_Network(graph_path = "Complete.graphml") result$plot ``` Burts Theory of Structural Holes (Ronald S. Burt, 1992, 2004) & Network Closure and Redundancy (1992, 2001) Cohesion and Trust & Brokerage and Network Advantage (Burt, 2005) The BrokerageClosure Tradeoff & Redundancy and Efficiency (20072010) Non-redundant Contacts and Efficiency & Structural Autonomy (Burt, 1997) Freedom through Brokerage Ronald Burt (1992) proposed that in a network, some nodes occupy structural holes, gaps between otherwise disconnected groups or clusters. A node that bridges these holes plays a unique role: it connects information or resources between otherwise separate modules. Bridging role of Basidiobolus consistent with Burts Structural Hole Theory. The network illustrates taxon-level associations among microbial domains. Distinct bacterial and fungal subnetworks form modular clusters (colored by domain or module). Basidiobolus acts as a bridging node connecting otherwise disconnected clusters, a position corresponding to a structural hole in Burts theory (Burt, 1992). Such brokerage positions facilitate information and metabolite exchange across domains, enhancing functional redundancy and community stability. Nodes are sized by betweenness centrality, emphasizing the high structural influence of Basidiobolus in maintaining network cohesion under environmental stress. Moreover, Basidioboluss links are non-redundant; it connects functionally distinct taxa that otherwise have no direct edge. This makes it ecologically efficient in facilitating novel functional exchanges (e.g., horizontal gene transfer or cross-domain metabolic complementation). # Visualizes pairwise relationships between node-level metrics using a quadrant-based plot node_level_metrics() & quadrant_plot() functions quantify and visualize node-level structural properties within microbial association networks. By integrating multiple centrality and connectivity metrics, it characterizes each nodes influence, redundancy (based on Bert's theory), and modular affiliation, allowing comparisons of network architecture and community organization. The resulting plots and table provide complementary views of node prominence and connectivity patterns, serving as a foundation for interpreting ecological stability, functional redundancy, and cross-domain interactions. | **Metric** | **Description** | |--------------------------------|-----------------------------------------------------------| | 'Node' | Node name (character format) | | 'Degree' | Number of edges connected to the node | | 'Strength' | Sum of edge weights connected to the node | | 'Closeness' | Closeness centrality (normalized, based on shortest paths) | | 'Betweenness' | Betweenness centrality (normalized, measures control over network flow) | | 'EigenvectorCentrality' | Eigenvector centrality (importance based on connections to influential nodes) | | 'PageRank' | PageRank score (importance based on incoming links) | | 'Transitivity' | Local clustering coefficient (tendency of a node to form triangles) | | 'Coreness' | Node's coreness (from k-core decomposition) | | 'Constraint' | Burt's constraint (measures structural holes in a node's ego network) | | 'EffectiveSize' | Inverse of constraint (larger values = more non-redundant connections) | | 'Redundancy' | Sum of constraint values of a node's alters | | 'Community' | Community assignment from Louvain clustering | | 'Efficiency' | Global efficiency (average inverse shortest path length) | | 'Local_Efficiency' | Local efficiency (subgraph efficiency for a node's neighbors) | | 'Within_Module_Connectivity' | Proportion of neighbors in the same community | | 'Among_Module_Connectivity' | Proportion of neighbors in different communities | ```{r fig.align='center', fig.width=8, fig.height=6, out.width='70%', dpi=120} # Load required libraries library(igraph) library(dplyr) library(tidyr) library(ggplot2) library(ggrepel) ggplot2::margin() # 2. Metrics Calculation completeMetrics <- DspikeIn::node_level_metrics(Complete) NoHubsMetrics <- DspikeIn::node_level_metrics(NoHubs) NoBasidMetrics <- DspikeIn::node_level_metrics(NoBasid) Complete_metrics <- completeMetrics$metrics Nohub_metrics <- NoHubsMetrics$metrics Nobasid_metrics <- NoBasidMetrics$metrics Complete_metrics <- data.frame(lapply(Complete_metrics, as.character), stringsAsFactors = FALSE) Nohub_metrics <- data.frame(lapply(Nohub_metrics, as.character), stringsAsFactors = FALSE) Nobasid_metrics <- data.frame(lapply(Nobasid_metrics, as.character), stringsAsFactors = FALSE) print(igraph::vcount(Complete)) # Number of nodes in Complete network print(igraph::ecount(Complete)) # Number of edges in Complete network print(igraph::vcount(NoBasid)) print(igraph::ecount(NoBasid)) print(igraph::vcount(NoHubs)) print(igraph::ecount(NoHubs)) metrics_scaled <- bind_rows( Complete_metrics %>% mutate(Network = "Complete"), Nohub_metrics %>% mutate(Network = "NoHubs"), Nobasid_metrics %>% mutate(Network = "NoBasid") ) %>% dplyr::mutate(dplyr::across(where(is.numeric), scale)) metrics_long_scaled <- metrics_scaled %>% tidyr::pivot_longer(cols = -c(Node, Network), names_to = "Metric", values_to = "Value") # Ensure each dataset has a "Network" column before combining completeMetrics$metrics <- completeMetrics$metrics %>% mutate(Network = "Complete Network") NoHubsMetrics$metrics <- NoHubsMetrics$metrics %>% mutate(Network = "Network & Module Hubs Removed") NoBasidMetrics$metrics <- NoBasidMetrics$metrics %>% mutate(Network = "Basidiobolus Subnetwork Removed") # Combine All Data combined_data <- bind_rows( completeMetrics$metrics, NoHubsMetrics$metrics, NoBasidMetrics$metrics ) # Add Node Identifier if missing if (!"Node" %in% colnames(combined_data)) { combined_data <- combined_data %>% mutate(Node = rownames(.)) } # Convert `Network` into Factor combined_data$Network <- factor(combined_data$Network, levels = c( "Complete Network", "Network & Module Hubs Removed", "Basidiobolus Subnetwork Removed")) # Convert Data to Long Format metrics_long <- combined_data %>% pivot_longer( cols = c("Redundancy", "Efficiency", "Betweenness"), names_to = "Metric", values_to = "Value") # Define Custom Colors and Shapes network_colors <- c( "Complete Network" = "#F1E0C5", "Network & Module Hubs Removed" = "#D2A5A1", "Basidiobolus Subnetwork Removed" = "#B2C3A8") network_shapes <- c( "Complete Network" = 21, "Network & Module Hubs Removed" = 22, "Basidiobolus Subnetwork Removed" = 23) # Determine Top 30% of Nodes to Label/Optional metrics_long <- metrics_long %>% group_by(Network, Metric) %>% mutate(Label = ifelse(rank(-Value, ties.method = "random") / n() <= 0.3, Node, NA)) # ?quadrant_plot() can creat plot for indivisual network plot4 <- quadrant_plot(completeMetrics$metrics, x_metric = "Among_Module_Connectivity", y_metric = "Degree") # Customized comparision Plots create_metric_plot <- function(metric_name, data, title) { data_filtered <- data %>% filter(Metric == metric_name) median_degree <- median(data_filtered$Degree, na.rm = TRUE) median_value <- median(data_filtered$Value, na.rm = TRUE) ggplot(data_filtered, aes(x = Degree, y = Value, fill = Network)) + geom_point(aes(shape = Network), size = 3, stroke = 1, color = "black") + geom_text_repel(aes(label = Label), size = 3, max.overlaps = 50) + scale_fill_manual(values = network_colors) + scale_shape_manual(values = network_shapes) + geom_vline(xintercept = median_degree, linetype = "dashed", color = "black", size = 1) + geom_hline(yintercept = median_value, linetype = "dashed", color = "black", size = 1) + labs( title = title, x = "Degree", y = metric_name, fill = "Network", shape = "Network" ) + theme_minimal() + theme( plot.title = element_text( hjust = 0.5, size = 16, face = "bold", margin = margin(t = 10, b = 20) # Moves the title downward ), axis.title = element_text(size = 14, face = "bold"), legend.position = "top", legend.title = element_text(size = 14, face = "bold"), legend.text = element_text(size = 12) ) } # Generate Plots plot_redundancy <- create_metric_plot("Redundancy", metrics_long, "Redundancy vs. Degree Across Networks") plot_efficiency <- create_metric_plot("Efficiency", metrics_long, "Efficiency vs. Degree Across Networks") plot_betweenness <- create_metric_plot("Betweenness", metrics_long, "Betweenness vs. Degree Across Networks") # Print Plots print(plot_betweenness) ``` Betweenness-degree relationships illustrate how node influence and connectivity shift after hub or Basidiobolus subnetwork removal, identifying taxa that maintain global connectivity and serve as potential ecological keystones. # Comparision of node-level metrics Compares standardized node-level metrics across network configurations (e.g., complete vs. hub-removed). Boxplots and Wilcoxon tests illustrate how structural perturbations alter node influence and connectivity, offering insights into microbial network robustness and modular reorganization. ```{r fig.align='center', fig.width=10, fig.height=8, out.width='70%', dpi=120} # 2. Metrics Calculation result_Complete <- DspikeIn::node_level_metrics(Complete) result_NoHubs <- DspikeIn::node_level_metrics(NoHubs) result_NoBasid <- DspikeIn::node_level_metrics(NoBasid) Complete_metrics <- result_Complete$metrics Nohub_metrics <- result_NoHubs$metrics Nobasid_metrics <- result_NoBasid$metrics Complete_metrics <- data.frame(lapply(Complete_metrics, as.character), stringsAsFactors = FALSE) Nohub_metrics <- data.frame(lapply(Nohub_metrics, as.character), stringsAsFactors = FALSE) Nobasid_metrics <- data.frame(lapply(Nobasid_metrics, as.character), stringsAsFactors = FALSE) print(igraph::vcount(Complete)) # Number of nodes in Complete network print(igraph::ecount(Complete)) # Number of edges in Complete network print(igraph::vcount(NoBasid)) print(igraph::ecount(NoBasid)) print(igraph::vcount(NoHubs)) print(igraph::ecount(NoHubs)) metrics_scaled <- bind_rows( Complete_metrics %>% mutate(Network = "Complete"), Nohub_metrics %>% mutate(Network = "NoHubs"), Nobasid_metrics %>% mutate(Network = "NoBasid") ) %>% dplyr::mutate(dplyr::across(where(is.numeric), scale)) metrics_long_scaled <- metrics_scaled %>% tidyr::pivot_longer(cols = -c(Node, Network), names_to = "Metric", values_to = "Value") # Remove missing values metrics_long_scaled <- na.omit(metrics_long_scaled) # We visualize only six metrics selected_metrics <- c( "Degree", "Closeness", "Betweenness", "EigenvectorCentrality", "PageRank", "Transitivity" ) metrics_long_filtered <- metrics_long_scaled %>% filter(Metric %in% selected_metrics) %>% mutate( Value = as.numeric(as.character(Value)), Network = recode(Network, "Complete" = "Complete Network", "NoHubs" = "Network & Module Hubs Removed", "NoBasid" = "Basidiobolus Subnetwork Removed" ) ) %>% na.omit() # Remove any NA if any metrics_long_filtered$Network <- factor(metrics_long_filtered$Network, levels = c( "Complete Network", "Network & Module Hubs Removed", "Basidiobolus Subnetwork Removed" )) # DspikeIn::color_palette$light_MG network_colors <- c( "Complete Network" = "#F1E0C5", "Network & Module Hubs Removed" = "#D2A5A1", "Basidiobolus Subnetwork Removed" = "#B2C3A8") # statistical comparisons a vs b comparisons <- list( c("Complete Network", "Network & Module Hubs Removed"), c("Complete Network", "Basidiobolus Subnetwork Removed"), c("Network & Module Hubs Removed", "Basidiobolus Subnetwork Removed")) networks_in_data <- unique(metrics_long_filtered$Network) comparisons <- comparisons[sapply(comparisons, function(pair) all(pair %in% networks_in_data))] met <- ggplot(metrics_long_filtered, aes(x = Network, y = Value, fill = Network)) + geom_boxplot(outlier.shape = NA) + geom_jitter(aes(color = Network), position = position_jitter(0.2), alpha = 0.2, size = 1.5 ) + scale_fill_manual(values = network_colors) + scale_color_manual(values = network_colors) + facet_wrap(~Metric, scales = "free_y", labeller = label_wrap_gen(width = 20)) + ggpubr::stat_compare_means(method = "wilcox.test", label = "p.signif", comparisons = comparisons) + theme_minimal() + theme( axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text.x = element_text(size = 10, angle = 10, hjust = 0.8), strip.text = element_text(size = 12, margin = margin(t = 15, b = 5)), legend.position = "top", legend.text = element_text(size = 13), legend.title = element_text(size = 13, face = "bold"), plot.title = element_text(size = 13, face = "bold") ) + labs(title = "Selected Node Metrics Across Networks", fill = "Network Type", color = "Network Type") met ``` Node-level topological metrics were compared across network configurations to assess structural robustness. Removal of the Basidiobolus subnetwork altered centrality measures, particularly closeness and PageRank, while maintaining overall network coherence. # To find neighbors of the target node This function identifies both first- and second-degree neighbors of a given target node within a microbial association network. It enables users to trace direct and indirect ecological connections, revealing potential mediator taxa that bridge interactions between microbial guilds. The resulting summary table facilitates the inspection of a nodes local connectivity and extended influence zone, supporting hypothesis-driven exploration of cross-domain interactions and community structuring. ```{r} Complete <- load_graphml("Complete.graphml") result2 <- DspikeIn::extract_neighbors( graph = Complete, target_node = "OTU69:Basidiobolus_sp", mode = "all") head(result2$summary) ``` # Compute degree metrics and visualize the network This analysis evaluates the global topology of microbial association networks under different structural configurationsnamely, the complete network, a hub-removed network, and a Basidiobolus-excluded subnetwork. Rather than inspecting individual nodes, this step focuses on overall network organization, including metrics that describe connectivity, efficiency, modularity, and cohesion. The degree_network() function computes degree-based summaries and layouts for visualization, while weight_Network() extracts quantitative descriptors such as total edge weight, mean path length, and network density. Aggregating these metrics across network types enables direct comparison of topological robustness and information flow after targeted node or subnetwork removal. The resulting bar plot displays log-scaled totals of each topological descriptor, allowing users to identify how network simplification (e.g., hub or guild removal) reshapes the overall architecture and resilience of the microbial network. ```{r fig.align='center', fig.width=8, fig.height=5, out.width='60%', dpi=120} # Options: `"stress"` (default), `"graphopt"`, `"fr"` # Compute degree metrics result <- degree_network(graph_path = Complete, save_metrics = TRUE) # Compute network weights for different graph structures NH <- weight_Network(graph_path = "NoHubs.graphml") NB <- weight_Network(graph_path = "NoBasid.graphml") C <- weight_Network(graph_path = "Complete.graphml") # Extract metrics from the computed network weights CompleteM <- C$metrics NoHubsM <- NH$metrics NoBasidM <- NB$metrics # Combine metrics into a single dataframe for comparison df <- bind_rows( CompleteM %>% mutate(Group = "Complete Network"), NoHubsM %>% mutate(Group = "Network & Module Hubs Removed"), NoBasidM %>% mutate(Group = "Basidiobolus Subnetwork Removed") ) %>% pivot_longer(cols = -Group, names_to = "Metric", values_to = "Value") # Aggregate total values by metric and group df_bar <- df %>% group_by(Metric, Group) %>% summarise(Total_Value = sum(Value, na.rm = TRUE), .groups = "drop") # Order metrics by mean value (optional) metric_order <- df_bar %>% group_by(Metric) %>% summarise(mean_value = mean(Total_Value)) %>% arrange(desc(mean_value)) %>% pull(Metric) df_bar$Metric <- factor(df_bar$Metric, levels = metric_order) # Define refined color palette network_colors <- c( "Complete Network" = "#F1E0C5", "Network & Module Hubs Removed" = "#D2A5A1", "Basidiobolus Subnetwork Removed" = "#B2C3A8" ) # Create elegant bar plot pg <- ggplot(df_bar, aes(x = Metric, y = log1p(Total_Value), fill = Group)) + geom_bar(stat = "identity", position = position_dodge(width = 0.8), width = 0.7, alpha = 0.9) + scale_fill_manual(values = network_colors) + theme_minimal(base_size = 14) + theme( panel.grid.major.x = element_blank(), panel.grid.minor = element_blank(), axis.text.x = element_text(size = 11, angle = 15, hjust = 1, vjust = 1, color = "gray20"), axis.text.y = element_text(size = 11, color = "gray25"), axis.title = element_text(size = 13, face = "bold"), plot.title = element_text(size = 15, face = "bold", hjust = 0.5, margin = ggplot2::margin(b = 10)), legend.position = "top", legend.title = element_blank(), legend.text = element_text(size = 12), panel.background = element_rect(fill = "white", color = NA), plot.background = element_rect(fill = "white", color = NA) ) + labs( subtitle = "Comparison of total network-level metrics (log-scaled)", x = NULL, y = "Total Value (log1p)" ) + geom_text( aes(label = round(log1p(Total_Value), 1)), position = position_dodge(width = 0.8), vjust = -0.3, size = 3.5, color = "gray25" ) ``` # Small-World Index Analysis To determine whether the complete network exhibited small-world topology, we computed the Small-World Index (SWI; ) following the quantitative framework established by Humphries M, Gurney K, 2008. PLoS ONE 4 (Eq.1) (SWI; ) = (Global clustering coefficient real/Global clustering coefficient random)/(Avg Path real/Avg Path random)) ```{r, eval=FALSE} library(igraph) library(tidygraph) library(ggraph) library(DspikeIn) # AA abundance Complete<-load_graphml("Complete.graphml") deg <- degree(Complete) hist(deg, breaks = 30, main = "Degree Distribution", xlab = "Degree") fit <- fit_power_law(deg + 1) # Avoid zero degrees print(fit) # ---- empirical network ---- g_empirical = Complete #g_empirical = Complete_Rel # ---- Calculate real network metrics ---- creal <- transitivity(g_empirical, type = "global") E(g_empirical)$weight <- abs(E(g_empirical)$weight) lreal <- mean_distance(g_empirical, directed = FALSE, unconnected = TRUE, weights = E(g_empirical)$weight) # ---- Generate 1000 Erds-Rnyi random graphs with same size ---- set.seed(42) # for reproducibility n_nodes <- vcount(g_empirical) n_edges <- ecount(g_empirical) crand_vals <- numeric(1000) lrand_vals <- numeric(1000) for (i in 1:1000) { g_rand <- sample_gnm(n = n_nodes, m = n_edges, directed = FALSE) if (!is_connected(g_rand)) next crand_vals[i] <- transitivity(g_rand, type = "global") lrand_vals[i] <- mean_distance(g_rand, directed = FALSE, unconnected = TRUE) } # ---- Calculate mean values across random graphs ---- crand <- mean(crand_vals, na.rm = TRUE) lrand <- mean(lrand_vals, na.rm = TRUE) # ---- Compute Small-World Index () ---- sigma <- (creal / crand) / (lreal / lrand) cat("Global clustering coefficient (real):", creal, "\n") cat("Average path length (real):", lreal, "\n") cat("Mean clustering coefficient (random):", crand, "\n") cat("Mean path length (random):", lrand, "\n") cat("Small-World Index ():", round(sigma, 2), "\n") ``` = 231.83 is extremely high, showing very strong small-world properties. > 1 indicates a small-world network,combining high clustering with short paths. ```{r} sessionInfo() ``` ```{r cleanup, include = FALSE} # ===================================================================== # FINAL CLEANUP: Remove files created during vignette execution # ===================================================================== # Note: This cleanup ensures no large or temporary files remain after vignette execution. # All deletions are conditional (safe) and comply with CRAN/Bioconductor policies. files_to_delete <- c( "Global_Network_Metrics.csv", "absolute_counts.csv", "Summary.csv", "post eval summary.csv", "abundance_analysis_filtered_phyloseq.rds", "abundance_analysis_filtered_tse.rds", "abundance_analysis_high_abundance_phyloseq.rds", "abundance_analysis_high_abundance_tse.rds", "abundance_analysis_low_abundance_phyloseq.rds", "abundance_analysis_low_abundance_tse.rds", "adjusted_prevalence.rds", "core_microbiome.csv", "core_microbiome.rds", "comp_spikein_patristic_distances.csv", "comp_spikein_branch_lengths.pdf", "comp_spikein_patristic_distances_hist.pdf", "merged_data.csv", "merged_data.docx", "comp_spikein.nwk", "comp_spikein_tree.png", "comp_spikein_patristic_distances_hist.pdf", "post_eval_summary.csv", "post_eval_summary.docx", "merged_physeq_max_processed.rds", "merged_physeq_sum.rds", "merged_physeq_sum_processed.rds", "spikeIn_factors_output", "plot_Spike.percentage_Kruskal.pdf", "plot_Spike.percentage_Kruskal.png", "plot_Spike.reads_Kruskal.pdf", "plot_Spike.reads_Kruskal.png", "plot_Total.reads_Kruskal.pdf", "plot_Total.reads_Kruskal.png", "proportion_adjusted_physeq.rds", "proportion_adjusted_tse.rds", "randomsubsample_Trimmed_evenDepth.rds", "summary.csv", "summary.docx", "DspikeIn-manual.tex", "output.csv", "output.docx", "tree_with_bootstrap_and_cophenetic.png" ) # Safely remove existing files if (any(file.exists(files_to_delete))) { invisible(file.remove(files_to_delete[file.exists(files_to_delete)])) } # Safely remove known temporary folders folders_to_delete <- c("spikeIn_factors_output") for (folder in folders_to_delete) { if (dir.exists(folder)) unlink(folder, recursive = TRUE, force = TRUE) } ``` # DspikeIn volume protocol Spike-in volume Protocol; The species Tetragenococcus halophilus (bacterial spike; ATCC33315) and Dekkera bruxellensis (fungal spike; WLP4642-White Labs) were selected as taxa to spike into gut microbiome samples as they were not found in an extensive collection of wildlife skin (GenBank BioProjects: PRJNA1114724, PRJNA 1114659) or gut microbiome samples. Stock cell suspensions of both microbes were grown in either static tryptic soy broth (T. halophilus) or potato dextrose broth (D. bruxellensis) for 72 hours then serially diluted and optical density (OD600) determined on a ClarioStar plate reader. Cell suspensions with an optical density of 1.0, 0.1, 0.01, 0.001 were DNA extracted using the Qiagen DNeasy Powersoil Pro Kit. These DNA isolations were used as standards to determine the proper spike in volume of cells to represent 0.1-10% of a sample (Rao et al., 2021b) Fecal pellets (3.1 1.6 mg; range = 1 5.1 mg) from an ongoing live animal study using wood frogs (Lithobates sylvaticus) were used to standardize the input material for the development of this protocol. A total of (n=9) samples were used to validate the spike in protocol. Each fecal sample was homogenized in 1mL of sterile molecular grade water then 250uL of fecal slurry was DNA extracted as above with and without spiked cells. Two approaches were used to evaluate the target spike-in of 0.1-10%, the range of effective spike-in percentage described in (Rao et al., 2021b), including 1) an expected increase of qPCR cycle threshold (Ct) value that is proportional to the amount of spiked cells and 2) the expected increase in copy number of T. halophilus and D. bruxellensis in spiked vs. unspiked samples. A standard curve was generated using a synthetic fragment of DNA for the 16S-V4 rRNA and ITS1 rDNA regions of T. halophilus and D. bruxellensis, respectively. The standard curve was used to convert Ct values into log copy number for statistical analyses (detailed approach in[2, 3]) using the formula y = -0.2426x + 10.584 for T. halophilus and y = -0.3071x + 10.349 for D. bruxellensis, where x is the average Ct for each unknown sample. Quantitative PCR (qPCR) was used to compare known copy numbers from synthetic DNA sequences of T. halophilus and D. bruxellensis to DNA extractions of T. halophilus and D. bruxellensis independently, and wood frog fecal samples with and without spiked cells. SYBR qPCR assays were run at 20ul total volume including 10ul 2X Quantabio PerfeCTa SYBR Green Fastmix, 1ul of 10uM forward and reverse primers, 1ul of ArcticEnzymes dsDNAse master mix clean up kit, and either 1ul of DNA for D. bruxellensis or 3ul for T. halophilus. Different volumes of DNA were chosen for amplification of bacteria and fungi due to previous optimization of library preparation and sequencing steps [3]. The 515F [4] and 806R [5] primers were chosen to amplify bacteria and ITS1FI2 [6] and ITS2 for fungi, as these are the same primers used during amplicon library preparation and sequencing. Cycling conditions on an Agilent AriaMX consisted of 95 C for 3 mins followed by 40 cycles of 95 C for 15 sec, 60 C for 30 sec and 72 C for 30 sec. Following amplification, a melt curve was generated under the following conditions including 95 C for 30 sec, and a melt from 60 C to 90 C increasing in resolution of 0.5 C in increments of a 5 sec soak time. To validate the spike in protocol we selected two sets of fecal samples including 360 samples from a diverse species pool of frogs, lizards, salamanders and snakes and a more targeted approach of 122 fecal samples from three genera of salamanders from the Plethodontidae. (Supplemental Table #). FFecal samples were not weighed in the field, rather, a complete fecal pellet was diluted in an equal volume of sterile water and standardized volume of fecal slurry (250L) extracted for independent samples.. A volume of 1ul T. halophilus (1847 copies) and 1ul D. bruxellensis (733 copies) were spiked into each fecal sample then DNA was extracted as above, libraries constructed, and amplicon sequenced on an Illumina MiSeq as in [7] .