--- title: "Analyze and visualize RNA-Seq data with pathlinkR" author: - Andy An - Travis Blimkie output: BiocStyle::html_document: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Analyze and visualize RNA-Seq data with pathlinkR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo=TRUE, comment="", dev="png") ``` This vignette contains some minimal examples for the main **pathlinkR** functions; for more complete documentation, please see our [Github pages](https://hancockinformatics.github.io/pathlinkR/). # Introduction Often times, gene expression studies such as microarrays and RNA-Seq result in hundreds to thousands of differentially expressed genes (DEGs). It becomes very difficult to understand the biological significance of such massive data sets, especially when there are multiple conditions and comparisons being analyzed. This package facilitates visualization and downstream analyses of differential gene expression results, using pathway enrichment and protein-protein interaction networks, to aid researchers in uncovering underlying biology and pathophysiology from their gene expression studies. We have included an example data set of gene expression results in this package as the object `exampleDESeqResults`. This is a list of 2 data frames, generated using the `results()` functions from the package `r BiocStyle::Biocpkg("DESeq2")` (Love et al. 2014). The data is from an RNA-Seq study investigating COVID-19 and non-COVID-19 sepsis patients at admission (T1) compared to approximately1 week later (T2) in the ICU, indexed over time (i.e., T2 vs T1) (An et al. 2023). # Installation To install and load the package: ```{r load_library, message=FALSE} # We'll also be using some functions from dplyr # BiocManager::install("pathlinkR", version="devel") library(dplyr) library(pathlinkR) ``` # Visualizing RNA-Seq data with volcano plots One of the first visualizations commonly performed with gene expression studies is to identify the number of DEGs. These are typically defined using specific cutoffs for both fold change and statistical significance. Thresholds of adjusted p-value <0.05 and absolute fold change >1.5 are used as the default, though any value can be specified. **pathlinkR** includes the function `eruption()` to create a volcano plot. ```{r datasets_and_preliminary_volcano_plot, message=FALSE} ## A quick look at the DESeq2 results table data("exampleDESeqResults") knitr::kable(head(exampleDESeqResults[[1]])) ## Generate a volcano plot from the first data frame, with default thresholds eruption( rnaseqResult=exampleDESeqResults[[1]], title=names(exampleDESeqResults[1]) ) ``` There are multiple options available for customizing this volcano plot, including: - Adjust the cutoffs - Switch the colours of significant and non-significant genes - Change the number of labelled genes - Adjust the x- and y-axis limits - Plot either log2 or non-log2 fold change (for a better sense of magnitude) - Highlight specific gene sets of interest # Visualizing fold changes across comparisons In addition to creating volcano plots, we can also visualize our DEGs using heatmaps of genes involved in a specific pathways (e.g. one identified as significant by `pathwayEnrichment()`). The function `plotFoldChange()` accomplishes this by taking in an input list of `DESeq2::results()` data frames, just like `pathwayEnrichment()`, and creating a heatmap of fold changes for the constituent genes. ```{r plot_fold_changes_I, fig.height=8} plotFoldChange( inputList=exampleDESeqResults, pathName="Interferon alpha/beta signaling" ) ``` A number of options are provided for customization, including: - Shorten the names of each comparison - Provide a custom list of genes to visualize, and add an informative title - Use log2 fold changes in the heatmap cells - Swap the rows and columns from the default, and add a split between the rows (comparisons) # Building and visualizing PPI networks **pathlinkR** includes tools for constructing and visualizing Protein-Protein Interaction (PPI) networks. Here we leverage PPI data gathered from [InnateDB](https://www.innatedb.com/) to generate a list of interactions among DE genes identified in gene expression analyses. These interactions can then be used to build PPI networks within R, with multiple options for controlling the type of network, such as support for first, minimum, or zero order networks. The two main functions used to accomplish this are `ppiBuildNetwork()` and `ppiPlotNetwork()`. Let's continue looking at the DEGs from the COVID positive patients over time, using the significant DEGs to build a PPI network. Since the data frame we're inputting includes all measured genes (not just the significant ones), we'll use the `filterInput=TRUE` option to ensure the network is made only with those genes which pass the standard thresholds (defined above). Since we're visualizing a network of DEGs, let's colour the nodes to indicate the direction of their dysregulation (i.e. up- or down-regulated) by specifying `fillType="foldChange"`. ```{r ppi_networks, fig.width=10, fig.height=8, warning=FALSE} exNetwork <- ppiBuildNetwork( rnaseqResult=exampleDESeqResults[[1]], filterInput=TRUE, order="zero" ) ppiPlotNetwork( network=exNetwork, title=names(exampleDESeqResults)[1], fillColumn=LogFoldChange, fillType="foldChange", label=TRUE, labelColumn=hgncSymbol, legend=TRUE ) ``` The nodes with blue labels (e.g. STAT1, FBXO6, CDH1, etc.) are hubs within the network; i.e. those genes which have a high betweenness score. The statistic used to determine hub nodes can be set in `ppiBuildNetwork()` with the "hubMeasure" option. ## Enriching networks and extracting subnetworks **pathlinkR** includes two functions for further analyzing PPI networks. First, `ppiEnrichNetwork()` will use the node table from a network to test for enriched Reactome pathways or Hallmark gene sets (see the next section for more detail on the pathway enrichment methods): ```{r ppiEnrichNetwork, message=FALSE} exNetworkPathways <- ppiEnrichNetwork( network=exNetwork, analysis="hallmark", filterResults="default", geneUniverse = rownames(exampleDESeqResults[[1]]) ) ``` Second, the function `ppiExtractSubnetwork()` can extract a minimally-connected subnetwork from a starting network, using the genes from an enriched pathway as the "starting" nodes for extraction. For example, below we use the results from the Hallmark enrichment above to pull out a subnetwork of genes from the "Interferon Gamma Response" term, then plot this reduced network while highlighting the genes from the pathway: ```{r ppiExtractSubnetwork, warning=FALSE, message=FALSE} exSubnetwork <- ppiExtractSubnetwork( network=exNetwork, pathwayEnrichmentResult=exNetworkPathways, pathwayToExtract="INTERFERON GAMMA RESPONSE" ) ppiPlotNetwork( network=exSubnetwork, fillType="oneSided", fillColumn=degree, label=TRUE, labelColumn=hgncSymbol, legendTitle="Degree" ) ``` Alternatively you can use the "genesToExtract" argument in `ppiExtractSubnetwork()` to supply your own set of genes (as Ensembl IDs) to extract as a subnetwork. # Performing pathway enrichment The essence of pathway enrichment is the concept of over-representation: that is, are there more genes belonging to a specific pathway present in our DEG list than we would expect to find by chance? To calculate this, the simplest method is to compare the ratio of DEGs in some pathway to all DEGs, and all genes in tat pathway to all genes in all pathways in a database. **pathlinkR** mainly uses the [Reactome](https://reactome.org/) database (Fabregat et al. 2017) for this purpose. One issue that can occur with over-representation analysis is the assumption that each gene in each pathway has "equal" value in belonging to that pathway. In reality, a single protein can have multiple (and sometimes very different) functions, and belong to multiple pathways, like protein kinases. There are also pathways that have substantial overlap with cellular machinery, like the TLR pathways. This can lead to enrichment of multiple similar pathways or even unrelated "false-positives" that make parsing through the results very difficult. One solution is to use unique gene pairs, as described by the creators of the package `r BiocStyle::CRANpkg("sigora")` (Foroushani et al. 2013). This methodology decreases the number of similar and unrelated pathways from promiscuous genes, focusing more on the pathways that are likely related to the underlying biology. This approach is the default used in the **pathlinkR** function `pathwayEnrichment()`. The `pathwayEnrichment()` function takes as input a **list of data frames** (each from `DESeq2::results()`), and by default will split the genes into up- and down-regulated before performing pathway enrichment one each set. The name of the data frames in the list should indicate the comparison that was made in the DESeq2 results, as it will be used to identify the results. For `analysis="sigora"` we also need to provide a Gene Pair Signature Repository (`gpsRepo`) which contains the pathways and gene pairs to be tested. Leaving this argument to "default" will use the `reaH` GPS repository from **sigora**, containing human Reactome pathways. Alternatively one can supply their own GPS repository; see `??sigora::makeGPS()` for details on how to make one. ```{r sigora_enrichment} ## Note the structure of `exampleDESeqResults`: a named list of results from ## DESeq2 exampleDESeqResults enrichedResultsSigora <- pathwayEnrichment( inputList=exampleDESeqResults, analysis="sigora", filterInput=TRUE, gpsRepo="default" ) head(enrichedResultsSigora) ``` For those who still prefer traditional over-representation analysis, we include the option of doing so by setting `analysis="reactome"`, which uses `r BiocStyle::Biocpkg("ReactomePA")` (Yu et al. 2016). When using this method, we recommend providing a gene universe to serve as a background for the enrichment test; here we'll use all the genes which were tested for significance by DESeq2 (i.e. all genes from the count matrix), converting them to Entrez gene IDs before running the test. See the full vignette at our [Github pages](https://hancockinformatics.github.io/pathlinkR/) for details. In addition to the Reactome database used when setting `analysis` to "sigora" or "reactome", we also provide over-representation analysis using the [Hallmark gene sets](https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp) from the Molecular Signatures Database (MSigDb). These are 50 gene sets that represent "specific, well-defined biological states or processes with coherent expression" (Liberzon et al. 2015). This database provides a more high-level summary of key biological processes compared to the more granular Reactome pathways. ```{r hallmark_enrichment} enrichedResultsHm <- pathwayEnrichment( inputList=exampleDESeqResults, analysis="hallmark", filterInput=TRUE, split=TRUE ) head(enrichedResultsHm) ``` Finally, users may also use data from [KEGG](https://www.kegg.jp/) when performing enrichment analyses. This database is supported with traditional over-representation analysis by setting `analysis="kegg"`, or with a gene pair-based approach by specifying `analysis="sigora"` and `gpsRepo="kegH"`. # Plotting pathway enrichment results Now that we have (a lot of) pathway enrichment results from multiple comparisons, its time to visualize them. The function `plotPathways()` does this by grouping Reactome pathways (or Hallmark gene sets) under parent groups, and indicates if each pathway is up- or down-regulated in each comparison, making it easy to identify which pathways are shared or unique to different DEG lists. Because there are often many pathways, you can split the plot into multiple columns (up to 3), and truncate the pathway names to make the results fit more easily. Sometimes a pathway may be enriched in both up- and down-regulated genes from the same DEG list (these usually occur with larger pathways). Such occurrences are indicated by a white asterisk where the more significant (lower adjusted p-value) direction is displayed. You can also change the angle/labels of the comparisons, or add the number of DEGs in each comparison below the labels. Lastly, you can specify which pathways or top pathway groups to include for visualization. ```{r plot_pathways_I, fig.width=10, fig.height=8} pathwayPlots( pathwayEnrichmentResults=enrichedResultsSigora, columns=2 ) ``` A variety of tweaks can be applied to these plot as well: - Only show immune-related pathways based on the top pathway grouping - Change the colour scaling used for the adjusted p-values - Condense the comparison names to better fit onto the plot, and make them display horizontally - Add the number of DEGs used for enrichment below the comparison name - Increase the truncation cutoff value of pathway names for more words From these results, you can see that while many of the immune system pathways change in the same direction over time in COVID-19 and non-COVID-19 sepsis patients, a few unique ones stand out, mostly related to interferon signaling ("Interferon Signaling", "Interferon gamma signaling", "Interferon alpha/beta signaling", "ISG15 antiviral mechanism"). This likely reflects an elevated early antiviral response in COVID-19 patients that decreased over time, compared to no change in non-COVID-19 sepsis patients. # Generating networks from enriched pathways **pathlinkR** includes functions for turning the pathway enrichment results from either Reactome-based method ("sigora" or "reactomepa") into networks, using the overlap of the genes assigned to each pathway to determine their similarity to one other. In these networks, each pathway is a node, with connections or edges between them determined via a distance measure. A threshold can be set, where two pathways with a minimum similarity measure are considered connected, and would have an edge drawn between their nodes. We provide a pre-computed distance matrix of Reactome pathways, generated using Jaccard distance, but there is support for multiple distance measures to be used. Once this "foundation" of pathway interactions is created, a pathway network can be built using the `createPathnet()` function: ```{r pathway_network_I, fig.width=12, fig.height=10, warning=FALSE, message=FALSE} data("sigoraDatabase") pathwayDistancesJaccard <- getPathwayDistances(pathwayData = sigoraDatabase) startingPathways <- pathnetFoundation( mat=pathwayDistancesJaccard, maxDistance=0.8 ) # Get the enriched pathways from the "COVID Pos Over Time" comparison exPathwayNetworkInput <- enrichedResultsSigora %>% filter(comparison == "COVID Pos Over Time") myPathwayNetwork <- pathnetCreate( pathwayEnrichmentResult=exPathwayNetworkInput, foundation=startingPathways ) ``` There are two options for visualization, the first being a static network: ```{r plot_pathnet_I, warning=FALSE, fig.height=7} pathnetGGraph( myPathwayNetwork, labelProp=0.1, nodeLabelSize=3, nodeLabelOverlaps=8, segColour="red", themeBaseSize = 12 ) ``` Nodes (pathways) which are filled in are enriched pathways (i.e. those output by `pathwayEnrichment()`). Size of nodes is correlated with statistical significance, while edge thickness relates to the similarity of two connected pathways. Though this type of visualization is useful, we can also display this network using an alternate method that creates an interactive display, with the function `pathnetVisNetwork()`; see our [Github pages](https://hancockinformatics.github.io/pathlinkR/) for these details. # Supplemental materials ## Gene-pair signatures **sigora** uses a Gene-Pair Signature (GPS) Repository that stores information on which gene pairs are unique for which pathways. We recommend using the one provided by sigora, which can be loaded via `data("reaH", package = "sigora")` (Reactome Human). You can also generate your own GPS repo using sigora's own function and a custom set of pathways (e.g. from another pathway database like GO or KEGG). Please consult the sigora documentation on how to generate your custom GPS repository. ## Why are there different p-value cut-offs for sigora vs. ReactomePA/Hallmark? Because there are now multiple gene pairs vs. single genes, the gene-pair "universe" is greatly increased and it is more likely for a result to be significant. Therefore, the cutoff threshold for significance is more stringent (adjusted p-value < 0.001) and a more conservative adjustment method (Bonferroni) is used. For regular over-representation analysis, a less conservative adjustment method (Benjamini-Hochberg) is used with adjusted p-value < 0.05. These are automatically set with `filterResults="default"`. You can adjust these cut-offs by setting `filterResults` to different values between 0 and 1, or 1 if you want all the pathways (this may be useful for comparing which enriched genes appear in which comparisons, even if the enrichment is not significant). # Citations An AY, Baghela AS, Falsafi R, Lee AH, Trahtemberg U, Baker AJ, dos Santos CC, Hancock REW. Severe COVID-19 and non-COVID-19 severe sepsis converge transcriptionally after a week in the intensive care unit, indicating common disease mechanisms. Front Immunol. 2023;6(14):1167917. Fabregat A, Sidiropoulos K, Viteri G, Forner O, Marin-Garcia P, Arnau V, D’Eustachio P, Stein L, Hermjakob H. Reactome pathway analysis: a high-performance in-memory approach. BMC Bioinform. 2017;18:142. Foroushani ABK, Brinkman FSL, Lynn DJ. Pathway-GPS and sigora: identifying relevant pathways based on the over-representation of their gene-pair signatures. PeerJ. 2013;1:e229. Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 2015;1(6):417–25. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. Yu G, He QY. ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Mol Biosyst. 2016;12(2):477-9. # Session information ```{r session_information, echo=FALSE} sessionInfo() ```