---
title: "iModMix -integrative Modules for Multi-omics data"
author: "Isis Narvaez-Bandera"
date: "`r Sys.Date()`"
package: "`r packageVersion('iModMix')`"
output:
  html_document:
    toc: true
    toc_depth: 3
    number_sections: true
    theme: united
    highlight: tango
fig_width: 7
vignette: >
    %\VignetteIndexEntry{Publication-ready integration omics data including unidentified metabolites}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
    %\usepackage[utf8]{inputenc}
---

# Introduction

`iModMix` is a novel network-based approach that integrates multi-omics data, including metabolomics, proteomics, and transcriptomics data, into a unified network to unveil associations across various layers of omics data and reveal significant associations with phenotypes of interest.

iModMix can incorporate both identified and unidentified metabolites, addressing a key limitation of existing methodologies.

```{r setup, echo=FALSE}
suppressWarnings(library(knitr))

opts_chunk$set(tidy = FALSE, message = FALSE, warning = FALSE)
```


# Installation

## Download the package from Bioconductor

```{r getPackage, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

BiocManager::install("iModMix")
```

Note: to install development version:

```{r getPackageDevel, eval=FALSE}
devtools::install_github("biodatalab/iModMix")
```

## Load the package into R session

```{r Load, message=FALSE}
library(iModMix)
```


## Launching the iModMix Shiny App

`iModMix` includes an interactive Shiny application that allows users to explore multi-omics integration results in a user-friendly interface. The app can be launched locally from R or accessed online.

### Launch the app locally

To launch the app from your R session, simply run:

```{r launchApp, eval=FALSE}
iModMix::run_app()
```

This will open the app in your default web browser. Make sure the package is installed and loaded before running the command.

###  Access the app online

You can also explore the app without installing anything by visiting the hosted version:
https://imodmix.moffitt.org/
This online version provides access to the same interactive features, including module visualization, enrichment analysis, and metadata exploration.


# Example to integrate 2 omics datasets

## Generate modules

### Load expression Data 1 and metadata from clear cell renal cell carcinoma

Below is an example showing the step-by-step analysis of two datasets through iModMix package: We used 24 normal and 52 tumor clear cell renal cell carcinoma (ccRCC, RC20 dataset) samples (Golkaram et al. 2022; Tang et al. 2023; Benedetti et al. 2023) as the example case study.  (https://doi.org/10.5281/zenodo.13988161).

- Data 1: It contained 904 identified metabolites from untargeted metabolomics. 
- Data 2: It contained 23001 genes from RNA-seq. Since this is purely an example of how to run the entire analysis from start to finish, we are going to limit our analysis to the first 1000 features.

Input data: 

* The expression dataset inputs must be arranged as a matrix with genes/proteins/features as rows and samples as columns
* The Metadata should be a matrix with the first column named 'Sample' (with a capitalized 'S'). The second column on are user assigned based on comparisons desired, i.e. "disease" or "Genotype"
* The Metadata should include unique sample IDs that match the sample IDs in your abundance/expression matrix files. Ensure that sample IDs are consistent across all datasets with thw same number of samples. This column enables linking omics data through samples.

```{r upload1}
# Load the package
library(iModMix)

# Get the path to the expression data file
pathMetabExp <- system.file("Example_data/ccRCC4_Data/Metab_exp.rds", package = "iModMix")

# Load the expression data
dataExp1 <- readRDS(pathMetabExp)

# Check the expression data
dataExp1[1:5, 1:5]

# Get the path to the Metadata file
pathMetadata <- system.file("Example_data/ccRCC4_Data/Metadata.rds", package = "iModMix")

# Load the Metadata
metadata <- readRDS(pathMetadata)

# Check the Metadata
head(metadata)
```

### Perform preprocessing of data
The iModMix downstream pipeline requires an input matrix of RNA-seq counts. The load_data function performs the following steps:
* Transposes the input data so that rows represent samples and columns represent features
* Applies feature selection (SD below the 25th percentile) 
* Handles missing values (KNN)
This step is optional. If the input matrix already has samples as rows, features as columns, and contains no missing values, users may skip it.


```{r Preprocess}
# Preprocess of data for iModMix upload
loadData1 <- fctLoadData(dataExp1)
loadData1[1:5, 1:5]
```

### Perform Partial correlation analysis, network construction and module eigengene calculation

Calculations for partial correlation is usually the slowest step. Partial correlation is a method of analyzing the relationship between two variables when other variables are present. Graphical Lasso (Glasso) is used to estimate the partial correlation and captures only direct associations. Below is a preview of the sparse partial correlation of the first five features. 

```{r partial_cors1}
# Perform Partial correlation
parcorData1 <- fctPartialCors(loadData1, rho = 0.25)
parcorData1[1:5, 1:5]
```
Hierarchical clustering is used to identify common neighbors between the features. Calculations are determined using the topographical overlap matrix (TOM) and are based on the sparse partial correlations. Hierarchical clustering is visualized as a dendrogram. 

The dendrogram plot is publication-ready and displays a dendrogram of features and modules. 
Axes: The vertical axis (y-axis) represents the dissimilarity between features, while the horizontal axis (x-axis) shows the modules. 
Branches: Each line in the dendrogram represents a feature. Features that are closer in the hierarchy (i.e., joined at a lower height in the dendrogram) have more similar expression profiles.


```{r hierarchical_cluster1}
# Perform hierarchical clustering
hcData1 <- fctHierarchicalCluster(parcorMat = parcorData1, tom = TRUE, minModuleSize = 10)

hcClu <- hcData1$hclustTree
hcMod <- as.matrix(hcData1$dynamicMods_numeric)
WGCNA::plotDendroAndColors(
  dendro = hcClu,
  colors = hcMod,
  dendroLabels = FALSE,
  hang = 0.03,
  addGuide = TRUE,
  guideHang = 0.05,
  groupLabels = "Modules",
  main = "Feature dendrogram and module assignments"
)
```

Let's see the cluster assignments. Hierarchical clustering generates multiple modules (clusters) to which each feature is assigned. The table below details the following columns:

* **Feature**: Feature ID.
* **cluster**: The module where the feature is assigned 
* **col**: The color used on the hierarchical clustering dendrogram.

```{r cluster_assignments1}
# Perform cluster assignments
clusterAssigData1 <- fctClusterAssignments(as.data.frame(hcData1$cluster_assignments))
head(clusterAssigData1)
```

The 904 features (identified metabolites) were assigned into 34 modules. The first principal component (PC1) is calculated for each module, referred to as an eigenfeature. Eigenfeatures are useful for

* Relating the modules to the phenotypes. 
* Obtaining the correlation between omics datasets (integration). 

```{r Eigengenes11}
# Obtain Eigenfeatures
eigenData1 <- fctEigengenes(loadData1, hcData1$cluster_assignments[, 3])
eigengenesData1 <- eigenData1$module_eigenmetab_Me
eigengenesData1[1:5, 1:5]
```

### Load expression Data 2 and perform partial correlation analysis, network construction and module eigengene calculation

Since this is purely an example of how to run the entire analysis from start to finish, we are going to limit our analysis to the first 1000 features.

```{r upload2}
# Get the path to the expression data file
pathRNAexp <- system.file("Example_data/ccRCC4_Data/RNA_exp.rds", package = "iModMix")

# Load the expression data
dataExp2 <- readRDS(pathRNAexp)
dataExp2 <- dataExp2[1:1000, ]

# Check the expression data
dataExp2[1:5, 1:5]

# Preprocessing of data for iModMix upload
loadData2 <- fctLoadData(dataExp2)
```

Calculate correlated matrix  

```{r partial_cors2}
# Partial correlation
parcorData2 <- fctPartialCors(loadData2, rho = 0.25)
parcorData2[1:5, 1:5]
```

Calculate hierarchical clustering 

```{r hierarchical_cluster2}
# Perform hierarchical clustering
hcData2 <- fctHierarchicalCluster(parcorMat = parcorData2, tom = TRUE, minModuleSize = 10)

# Access the hierarchical clustering tree and module assignments
hcClu2 <- hcData2$hclustTree
hcMod2 <- as.matrix(hcData2$dynamicMods_numeric)

# Plot the dendrogram
WGCNA::plotDendroAndColors(
  dendro = hcClu2,
  colors = hcMod2,
  dendroLabels = FALSE,
  hang = 0.03,
  addGuide = TRUE,
  guideHang = 0.05,
  groupLabels = "Modules",
  main = "Feature dendrogram and module assignments"
)
```

Perform network construction and module eigengene calculation

```{r cluster_assignments2}
# Perform cluster assignment
clusterAssigData2 <- fctClusterAssignments(as.data.frame(hcData2$cluster_assignments))
eigenData2 <- fctEigengenes(loadData2, hcData2$cluster_assignments[, 3])
eigengenesData2 <- eigenData2$module_eigenmetab_Me
eigengenesData2[1:5, 1:5]
```
## Relate modules to phenotype

Statistical analysis by Students t-test compares phenotypes chosen as variable of interest and the user can specify a significance threshold for the p-value, the default p-value is set to 0.05. The eigenfeatures of each module, determined previously, are used as predictors. A data frame is generated with the following columns 

Here, TN variable (Tumor, Normal) is selected as the phenotype variable of interest 

Boxplots are automatically generated at the bottom for significant eigenfeatures, with outliers identified as individual dots and a legend is provided to describe the compared phenotypes.


```{r perform_classification2}
# Perform Classification
ClassificationData <- fctPerformClassification(
  eigengeneData = eigengenesData1,
  metadata = metadata,
  phenotypeVariable = "TN",
  significanceThreshold = 0.09
)

ClassificationData$result[1:10, ]

# Plot BoxPlot
selectedVariable <- "TN"
levels <- unique(metadata[[selectedVariable]])
classLabel <- paste(levels, collapse = " vs ")
plot <- ClassificationData$plots[[1]]
plot <- plot +
  ggplot2::labs(
    title = classLabel, fill = as.factor(levels),
    x = "Variables",
    y = "Class"
  ) +
  ggplot2::theme(
    axis.text.x = ggplot2::element_text(angle = 90, hjust = 1)
  )
plot
```

## Multi-omics analysis: Integration of Data 1 and Data 2

This step perform the integration of the datasets previously analyzed using the eigenfeatures obtained from the steps above. 

```{r Multiomics}
# Summarize into list the different eigenfeatures
eigengenesList <- list(eigengenesData1, eigengenesData2)
clusterList <- list(clusterAssigData1, clusterAssigData2)
IntegrationData1Data2 <- fctModulesCorrelation(eigengenesList, clusterList, threshold = 0.6)
TopCorrelations <- IntegrationData1Data2[["Top_cor_Prot_metab"]]
head(TopCorrelations)

# Plot correlation
CorrelationPlot <- IntegrationData1Data2$Correlation_Plot
hist(CorrelationPlot[[1]], main = "Correlation: Data 1 / Data 2 ")

# Plot Network
nodes <- as.data.frame(IntegrationData1Data2$nodes)
edges <- as.data.frame(IntegrationData1Data2$edges)
n <- IntegrationData1Data2$n

shapes <- c("diamond", "triangle", "dot")
colors <- c("orange", "darkgreen", "darkblue")

network <- visNetwork::visNetwork(nodes = nodes, edges = edges, width = "100%", height = "800px")
network <- visNetwork::visLegend(network,
  useGroups = FALSE,
  addNodes = data.frame(
    label = paste0("Data", 1:n, " Modules"),
    shape = shapes[1:n], color = colors[1:n]
  ),
  addEdges = data.frame(label = "Correlation", shape = "line", length = 200, color = "darkgreen")
)
network <- visNetwork::visInteraction(network, navigationButtons = TRUE)
network
```

## Enrichment analysis using gene Symbol 

Enrichment analysis can be performed if annotation data for proteomics or transcriptomics is uploaded with the column Symbol available.  The drop-down menu displays available libraries for pathway analysis. Choose a library to automatically amend the dataset cluster descriptions on table below. 
For example, selecting "GO_Biological_Process_2023" will annotate cluster descriptions with enriched biological processes.

To perform enrichment analysis programmatically, you can use the function `fctAssignmentGenesEnrichr()`. Under column enriched_Term the most highly correlated pathway is displayed and in the following columns, along with enriched_Genes, and p-values as determined by Enrichr. The following code shows how to run the enrichment analysis using transcriptomics data (`data2`). This chunk is not evaluated to avoid long loading times during vignette rendering.


```{r enrichmentAnalysisCode, eval = FALSE}
# Perform enrichment analysis per module
selectedDatabase <- "GO_Biological_Process_2023"
clusterAssignmentsData2Enrich <- fctAssignmentGenesEnrichr(clusterAssignmentsProtGenes = clusterAssigData2, database = selectedDatabase)
```
 we show a representative subset of the enrichment results:
 
 Enrichment Analysis Results[1:5,]

| feature | cluster      | col      | feature_name | enriched_Term                  | enriched_Overlap | enriched_Genes | enriched_P.value | enriched_Adjusted.P.value |
|---------|--------------|----------|--------------|--------------------------------|------------------|----------------|------------------|----------------------------|
| A1BG    | cluster_000005 | #66628D | A1BG         | Cilium Disassembly             | 1/5              | HDAC6          | 0.008968455      | 0.09813736                 |
| NAT2    | cluster_000041 | #DD87B4 | NAT2         | NA                             | NA               | NA             | NA               | NA                         |
| ADA     | cluster_000017 | #91569A | ADA          | Purine Ribonucleoside Metabolic Process | 1/6              | ADA            | 0.007477447      | 0.08081181                 |
| CDH2    | cluster_000019 | #B45B76 | CDH2         | Detection Of Muscle Stretch    | 1/6              | CDH2           | 0.007179245      | 0.11473611                 |
| AKT3    | cluster_000019 | #B45B76 | AKT3         | Detection Of Muscle Stretch    | 1/6              | CDH2           | 0.007179245      | 0.11473611                 |

This table summarizes enriched biological processes associated with selected modules. The full enrichment analysis can be reproduced by uncommenting the function call above.

## Working with Bioconductor's SummarizedExperiment class

To ensure compatibility with Bioconductor standards, `iModMix` supports input data structured as `SummarizedExperiment` objects. This allows users to integrate metabolomics or other omics data using standardized containers that facilitate downstream analysis and interoperability with other Bioconductor packages.

Below is an example of how to convert your expression matrix and metadata into a `SummarizedExperiment` object:

```{r summarizedExperimentExample}
# Load required library
library(SummarizedExperiment)

# Assuming:
# - 'dataExp1' has features in rows and samples in columns
# - 'Feature_ID' is the first column of dataExp1
# - 'metadata' has sample information with 'Sample' column matching column names of dataExp1 (except Feature_ID)

# Create SummarizedExperiment object
se <- fctiModMix2SE(dataExp1, metadata)

# Check structure
se
```

# Session Information

```{r sessionInfo, echo=FALSE}
sessionInfo()
```