--- title: "scGraphVerse Case Study: Zero-Inflated Simulation and GRN Inference" author: "Francesco Cecere" output: BiocStyle::html_document: toc: true number_sections: true vignette: > %\VignetteIndexEntry{scGraphVerse Simulation Study: Sim & GRN Reconstruction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction scGraphVerse is a modular and extensible R package for constructing, comparing, and visualizing gene regulatory networks (GRNs) from single-cell RNAseq data. It includes several inference algorithms, utilities, and visualization functions. Single-cell data presents unique challenges: high sparsity, batch variation, and the need to distinguish shared versus condition-specific regulation. scGraphVerse helps address these through multi-method inference consensus. It starts with SingleCellExperiment, Seurat or matrix-based objects, enabling flexible pipelines across diverse experimental setups. ## Why use scGraphVerse? While several GRN inference packages exist in Bioconductor (e.g. SCENIC), most are limited to single-method inference or lack support for multi-condition, multi-replicate, and multi-method comparisons. scGraphVerse adds value by: 1. Supporting multiple inference algorithms, including GENIE3, GRNBoost2, ZILGM, PCzinb, and Joint Random Forests (JRF). 2. Providing joint modeling of GRNs across multiple conditions via JRF 3. Enabling benchmarking simulations from known ground-truth GRNs with customizable ZINB-based count matrix generation. 4. Offering community analysis, including consensus GRNs and databases integration (e.g. PubMed and STRINGdb). ## How scGraphVerse works The typical scGraphVerse workflow begins with one or more count matrix, or objects like SingleCellExperiment or Seurat. These are passed to the core function infer_networks(), which runs the selected inference methods and returns an adjacency matrix or list of matrices. For benchmarking or teaching purposes, synthetic datasets can be generated using zinb_simdata() based on a known adjacency matrix. This supports method comparison across early, late, and joint integration strategies using earlyj() and compare_consensus(). An end-to-end workflow, from network inference to community detection and consensus visualization, is provided in the rest of vignettes. As intended for both high-throughput analysis and intuitive usage, scGraphVerse supports custom workflows using its modular functions. # Simulation study In this simulation study, we use **scGraphVerse** to: 1. Define a ground-truth regulatory network from high-confidence interactions. 2. Simulate zero-inflated scRNA-seq count data that respects the ground truth. 3. Infer gene regulatory networks using **GENIE3**. 4. Evaluate performance with ROC curves, precision–recall scores, and community similarity. 5. Build consensus networks and perform edge mining. ```{r setup, include=FALSE} knitr::opts_chunk$set( eval = TRUE, collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 5 ) library(scGraphVerse) ``` # 1. Load Ground-Truth Network We use a pre-computed toy adjacency matrix (`toy_adj_matrix`) as our ground truth network. This matrix represents known regulatory relationships between genes and serves as the reference for evaluating network inference performance. ```{r ground-truth-network, fig.alt="Network visualization gtruth"} # Load the toy ground truth adjacency matrix data(toy_adj_matrix) adj_truth <- toy_adj_matrix # Visualize network if (requireNamespace("igraph", quietly = TRUE) && requireNamespace("ggraph", quietly = TRUE)) { gtruth <- igraph::graph_from_adjacency_matrix(adj_truth,mode="undirected") ggraph::ggraph(gtruth, layout = "fr") + ggraph::geom_edge_link(color = "gray") + ggraph::geom_node_point(color = "steelblue") + ggplot2::ggtitle(paste0( "Ground Truth: ", igraph::vcount(gtruth), " nodes, ", igraph::ecount(gtruth), " edges" )) + ggplot2::theme_minimal() } ``` # 2. Simulating Zero-Inflated Count Data Now we'll generate synthetic single-cell count data that respects our ground-truth network structure. We simulate three experimental conditions (batches) with 50 cells each, using different parameters to model realistic batch effects and biological variability. The `zinb_simdata()` function creates zero-inflated negative binomial count data where gene expression levels are influenced by the regulatory relationships defined in our ground-truth network. **Key simulation parameters:** - `mu_range`: Different mean expression levels across conditions - `theta`: Overdispersion parameter (lower = more variable) - `pi`: Zero-inflation probability (dropout rate) - `depth_range`: Sequencing depth variation per cell ```{r simulate} # Simulation parameters nodes <- nrow(adj_truth) sims <- zinb_simdata( n = 50, p = nodes, B = adj_truth, mu_range = list(c(1, 4), c(1, 7)), mu_noise = c(1, 3), theta = c(1, 0.7), pi = c(0.2, 0.2), kmat = 2, depth_range = c(0.8 * nodes * 3, 1.2 * nodes * 3) ) # Transpose to cells × genes count_matrices <- lapply(sims, t) ``` # 3. Inferring Networks with GENIE3 Next, we apply the **GENIE3** algorithm to infer gene regulatory networks from our simulated count data. GENIE3 uses random forests to predict each gene's expression based on all other genes, then ranks the importance of potential regulatory relationships. We process all three batches to capture condition-specific and shared regulatory patterns. **Note:** The package now uses Bioconductor data structures: - `count_matrices` is now a `MultiAssayExperiment` (MAE) object - Network outputs are stored in `SummarizedExperiment` (SE) objects **GENIE3 workflow:** 1. Create a MAE object from the count matrices list 2. For each target gene, build random forest model using all other genes as predictors 3. Extract feature importance scores as edge weights 4. Generate weighted adjacency matrices stored in a SummarizedExperiment 5. Symmetrize the matrices using mean weights to create undirected networks ```{r genie3} # Create MultiAssayExperiment from count matrices mae <- create_mae(count_matrices) # Infer networks networks_joint <- infer_networks( count_matrices_list = mae, method = "GENIE3", nCores = 1 ) # Generate weighted adjacency matrices (returns SummarizedExperiment) wadj_se <- generate_adjacency(networks_joint) # Symmetrize weights (returns SummarizedExperiment) swadj_se <- symmetrize(wadj_se, weight_function = "mean") ``` # 4. ROC Curve and AUC Now we evaluate how well our inferred networks recover the ground-truth regulatory relationships using ROC curve analysis. The ROC curve plots True Positive Rate (sensitivity) against False Positive Rate (1-specificity) across different edge weight thresholds. A perfect classifier would achieve AUC = 1.0, while random guessing gives AUC = 0.5. **ROC Analysis Steps:** 1. Use continuous edge weights from symmetrized adjacency matrices 2. Compare against binary ground-truth network 3. Calculate Area Under Curve (AUC) as overall performance metric 4. Higher AUC indicates better recovery of true regulatory edges ```{r roc, fig.alt="ROC curve"} if (requireNamespace("pROC", quietly = TRUE)) { roc_res <- plotROC( swadj_se, adj_truth, plot_title = "ROC Curve: GENIE3 Network Inference", is_binary = FALSE ) roc_res$plot auc_joint <- roc_res$auc } ``` ## 4.1. Precision–Recall and Graph Visualization To convert our continuous edge weights into binary networks, we need to determine appropriate cutoff thresholds. The `cutoff_adjacency()` function uses a data-driven approach: it creates shuffled versions of our count data, infers networks from these null datasets, and sets the threshold at the 95th percentile of shuffled edge weights. This ensures we keep only edges that are much stronger than expected by random chance. **Cutoff Method:** 1. Generate shuffled count matrices (randomize gene expression profiles) 2. Run GENIE3 on shuffled data to get null distribution of edge weights 3. Set threshold at 95th percentile of shuffled weights 4. Apply threshold to original networks to get binary adjacency matrices 5. Calculate precision metrics: TPR (sensitivity), FPR, Precision, F1-score ```{r cutoff, fig.alt="result Graphs"} # Binary cutoff at 95th percentile (returns SummarizedExperiment) binary_se <- cutoff_adjacency( count_matrices = mae, weighted_adjm_list = swadj_se, n = 1, method = "GENIE3", quantile_threshold = 0.95, nCores = 1, debug = TRUE ) # Precision scores pscores_joint <- pscores(adj_truth, binary_se) head(pscores_joint) # Network plot if (requireNamespace("ggraph", quietly = TRUE)) { plotg(binary_se) } ``` # 5. Consensus Networks and Community Similarity Since we have three binary networks (one per experimental condition), we can create a consensus network that captures edges consistently detected across conditions. The "vote" method includes an edge in the consensus only if it appears in at least 2 out of 3 individual networks, making it more robust to condition-specific noise. **Consensus Building:** - **Vote method**: Edge included if present in majority of networks (≥2/3) - **Union method**: Edge included if present in any network (≥1/3) - **INet method**: Uses weighted evidence combination (more sophisticated) ```{r consensus, fig.alt="consensus Graph"} # Consensus matrix (returns SummarizedExperiment) consensus <- create_consensus(binary_se, method = "vote") # Plot consensus network if (requireNamespace("ggraph", quietly = TRUE)) { plotg(consensus) } ``` ```{r communities-adj-truth, fig.alt="Community detection adj_truth"} # Compare consensus to truth using classify_edges and plot_network_comparison if (requireNamespace("ggplot2", quietly = TRUE) && requireNamespace("ggraph", quietly = TRUE)) { # Wrap toy_adj_matrix in a SummarizedExperiment for classify_edges adj_truth_se <- build_network_se(list(ground_truth = adj_truth)) # classify_edges expects SummarizedExperiment objects edge_classification <- classify_edges( consensus_matrix = consensus, reference_matrix = adj_truth_se ) print(edge_classification) # Visualize network comparison (show_fp = FALSE to hide false positives) comparison_plot <- plot_network_comparison( edge_classification = edge_classification, show_fp = FALSE ) print(comparison_plot) } ``` Community detection identifies groups of highly interconnected genes that likely share biological functions or regulatory mechanisms. We'll detect communities in both our inferred consensus network and the ground-truth network, then compare their similarity using several metrics. **Community Detection Steps:** 1. Apply Louvain algorithm to identify network modules 2. Compare community assignments using multiple similarity measures: - **Variation of Information (VI)**: Lower values = more similar - **Normalized Mutual Information (NMI)**: Higher values = more similar - **Adjusted Rand Index (ARI)**: Higher values = more similar Now we'll detect communities in our ground-truth network to establish the reference community structure: ```{r communities-gtruth, fig.alt="Community detection gtruth"} if (requireNamespace("igraph", quietly = TRUE)) { comm_truth <- community_path(adj_truth) } ``` ```{r communities-consensus, fig.alt="Community detection consensus"} # Detect communities in consensus network if (requireNamespace("igraph", quietly = TRUE)) { comm_cons <- community_path(consensus) } ``` Finally, we'll quantify how similar the community structures are between our inferred consensus network and the ground-truth network using the broken-down functions: ```{r community-similarity, fig.alt="Community similarity metrics"} # Calculate community metrics (VI, NMI, ARI) if (requireNamespace("igraph", quietly = TRUE)) { comm_metrics <- compute_community_metrics(comm_truth, list(comm_cons)) print(comm_metrics) # Calculate topology metrics (modularity, density, etc.) # Note: compute_topology_metrics expects community_path outputs # Returns a list with $topology_measures and $control_topology topo_metrics <- compute_topology_metrics(comm_truth, list(comm_cons)) print(topo_metrics) # Visualize community comparison if (requireNamespace("fmsb", quietly = TRUE)) { plot_community_comparison( community_metrics = comm_metrics, topology_measures = topo_metrics$topology_measures, control_topology = topo_metrics$control_topology ) } } ``` ## 5.1. Edge Mining Edge mining provides a detailed analysis of network reconstruction performance by categorizing each predicted edge as True Positive (TP), False Positive (FP), True Negative (TN), or False Negative (FN). This gives us insight into which specific regulatory relationships were successfully recovered. The function now returns a list of dataframes matching the documentation. **Edge Mining Analysis:** - **True Positives (TP)**: Correctly predicted edges that exist in ground truth - **False Positives (FP)**: Incorrectly predicted edges not in ground truth - **False Negatives (FN)**: Missed edges that exist in ground truth - **True Negatives (TN)**: Correctly identified non-edges ```{r edge-mining} em <- edge_mining( consensus, ground_truth = adj_truth, query_edge_types = "TP" ) head(em[[1]]) ``` ```{r sessioninfo} sessionInfo() ```