---
title: "GRaNIE Workflow Example"
author: "Christian Arnold, Judith Zaugg, Rim Moussa"
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('GRaNIE')`"
abstract: >
In this vignette, we present GRaNIE (**G**ene **R**egul**a**tory **N**etwork **I**nference including **E**nhancers), a framework to reconstruct predictive enhancer-mediated regulatory network models that are based on integrating of expression and chromatin accessibility/activity pattern across individuals, and provide a comprehensive resource of cell-type specific gene regulatory networks for particular cell types. For an extended biological motivation, see the first section below. In the following, we summarize how to use the `GRaNIE` package in a real-world example, illustrate most features and how to work with a `GRaNIE` object. Importantly, this vignette will be continuously updated whenever new functionality becomes available or when we receive user feedback.
vignette: >
%\VignetteIndexEntry{Workflow example}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
output:
BiocStyle::html_document:
code_folding: hide
---
```{css, echo=FALSE}
pre {
max-height: 300px;
overflow-y: auto;
}
pre[class] {
max-height: 100px;
}
```
```{css, echo=FALSE}
.scroll-100 {
max-height: 100px;
overflow-y: auto;
background-color: inherit;
}
```
```{css, echo=FALSE}
.scroll-200 {
max-height: 200px;
overflow-y: auto;
background-color: inherit;
}
```
```{css, echo=FALSE}
.scroll-300 {
max-height: 300px;
overflow-y: auto;
background-color: inherit;
}
```
# Motivation and Summary
Genetic variants associated with diseases often affect non-coding regions, thus likely having a regulatory role. To understand the effects of genetic variants in these regulatory regions, identifying genes that are modulated by specific regulatory elements (REs) is crucial. The effect of gene regulatory elements, such as enhancers, is often cell-type specific, likely because the combinations of transcription factors (TFs) that are regulating a given enhancer have cell-type specific activity. This TF activity can be quantified with existing tools such as `diffTF` and captures differences in binding of a TF in open chromatin regions. Collectively, this forms a enhancer-mediated gene regulatory network (*eGRN*) with cell-type and data-specific TF-RE and RE-gene links. Here, we reconstruct such a *eGRN* using bulk RNA-seq and open chromatin (e.g., using ATAC-seq or ChIP-seq for open chromatin marks) and optionally TF activity data. Our network contains different types of links, connecting TFs to regulatory elements, the latter of which is connected to genes in the vicinity or within the same chromatin domain (*TAD*). We use a statistical framework to assign empirical FDRs and weights to all links using a permutation-based approach.
# Example data
Before we start with the package, let's retrieve some example data! For the purpose of this vignette, the data we will use is taken from [here](https://zenodo.org/record/1188300#.X370PXUzaSN) [1], has been minimally processed to meet the requirements of the `GRaNIE` package and consists of the following files:
* ATAC-seq peaks, raw counts (originally around 75,000, genome-wide, here filtered to around 60,500)
* RNA-Seq data, raw counts (originally for around 35,000 genes, here filtered to around 19,000)
* sample metadata with additional sample-specific information
In general, the dataset is from human macrophages (both naive and IFNg primed) of healthy individuals and various stimulations / infections (naive vs primed and infected with SL1344 vs not), with 4 groups in total: control/infected(SL1344) and naive/primed(IFNg)
Furthermore, the example dataset is accompanied by the following files:
* genome-wide transcription factor binding site predictions for 75 selected TFs, along with a translation table to link TF names to their corresponding Ensembl IDs. For each TF, a gzipped BED file has been created with predicted TF binding sites. The files have been generated with `PWMScan` and the `HOCOMOCO` database, see [2] for details.
# Example Workflow
In the following example, you will use the example data to construct a *eGRN* from ATAC-seq, RNA-seq data as well transcription factor (TF) data.
```{r "The original publication explaining the method and motivation in detail"
[2]: "the *ReadTheDocs* help for *diffTF*"
[3]: "the following part*"
For more parameter details, see also the R help (`?AR_classification_wrapper`).
## Save `GRaNIE` object to disk (optional)
After steps that take up a bit of time, it may make sense to store the `GRaNIE` object to disk in order to be able to restore it at any time point. This can simply be done, for example, by saving it as an `rds` file using the built-in function `saveRDS` from R to save our `GRaNIE` object in a compressed rds format.
```{r saveObject2, echo=TRUE, include=TRUE, eval=FALSE, class.output="scroll-200"}
GRN_file_outputRDS = paste0(dir_output, "/GRN.rds")
saveRDS(GRN, GRN_file_outputRDS)
```
You can then, at any time point, read it back into R with the following line:
`GRN = readRDS(GRN_file_outputRDS)`
## Add enhancer-gene connections
Let's add now the second type of connections, enhancer-genes, to connect our enhancers to genes! This can be done via the function `addConnections_peak_gene()`. This function has a few parameters, and we only touch upon a few of them here. Most importantly, the `promoterRange` specifies the *neighborhood* size, which denotes the maximum neighborhood size in bp for enhancers (for both upstream and downstream ) to find genes in vicinity and associate/correlate genes with enhancers. While the default is 250,000 bp, we here set it to just 10,000 bp for computational reasons. Also, we support the incorporation of *TADs* if available to replace the default fixed neighborhood-based approach by a more flexible, biologically-driven chromatin domain based approach. Here, we do not have TADs available, so we set it to `NULL`. For more parameter details, see the R help (`?addConnections_peak_gene`).
```{r addPeakGeneConnections, echo=TRUE, include=TRUE, eval = TRUE, class.output="scroll-200"}
GRN = addConnections_peak_gene(GRN,
corMethod = "pearson", promoterRange = 10000, TADs = NULL,
nCores = 1, plotDiagnosticPlots = FALSE, plotGeneTypes = list(c("all")), forceRerun = TRUE)
```
We see from the output that almost 38,000 enhancer-gene links have been identified that match our parameters. However, only around 16,351 actually had corresponding RNA-seq data available, while RNA-seq data was missing or has been filtered for the other. This is a rather typical case, as not all known and annotated genes are included in the RNA-seq data in the first place. Similar to before, the correlations have also been performed for the permuted data.
## Quality control 3: Diagnostic plots for enhancer-gene connections
Let's now check some diagnostic plots for the enhancer-gene connections. In analogy to the other diagnostic plots that we encountered already before, we describe their interpretation and meaning in more detail in the [Package Details Vignette](packageDetails.html)!
The following plot summarizes many key QC measures we implemented:
```{r, echo=TRUE, class.output="scroll-200"}
GRN = plotDiagnosticPlots_peakGene(GRN, gene.types = list(c("protein_coding", "lincRNA")), plotAsPDF = FALSE, pages = 1)
```
Without explaining the details here, we can see from mainly the upper left plot that the enhancer-gene connections show a good signal to noise ratio in the context of our framework and assumptions indeed!
## Combine TF-enhancer and enhancer-gene connections and filter
Now that we added both TF-enhancers and enhancer-gene links to our `GRaNIE` object, we are ready to filter and combine them. So far, they are stored separately in the object for various reasons (see the Introductory Vignette for details), but ultimately, we aim for combining them to derive TF-enhancer-gene connections. To do so, we can simply run the `filterGRNAndConnectGenes()` function and filter the individual TF-enhancer and enhancer-gene links to our liking. The function has many more arguments, and we only specify a few in the example below. As before, we get a `GRaNIE` object back that now contains the merged and filtered TF-enhancer-gene connections that we can later extract. Some of the filters apply to the TF-enhancer links, some of them to the enhancer-gene links, the parameter name is intended to indicate that.
```{r combineAndFilter, echo=TRUE, include=TRUE, eval = TRUE, class.output="scroll-200"}
GRN = filterGRNAndConnectGenes(GRN, TF_peak.fdr.threshold = 0.2,
peak_gene.fdr.threshold = 0.2, peak_gene.fdr.method = "BH",
gene.types = c("protein_coding", "lincRNA"),
allowMissingTFs = FALSE, allowMissingGenes = FALSE
)
```
The output shows the number of links before and after applying a particular filter that has been set for both real and permuted data. As expected and reassuringly, almost no connections remain for the permuted data, while the real data keeps around 2500 connections.
Importantly, this filtered set of connections is now also saved in the `GRN` object and the basis for most if not all downstream functions that the package offers and that we explore now! It is important to keep that in mind: Re-running the `filterGRNAndConnectGenes()` method overwrites the `all.filtered` slot in the `GRN` object, and all downstream functions have to be re-run as well.
For more parameter details, see the R help (`?filterGRNAndConnectGenes`).
## Add TF-gene correlations (optional)
Optionally, we can also include extra columns about the correlation of TF and genes directly. So far, only TF-enhancers and enhancer-genes have been correlated, but not directly TFs and genes. Based on a filtered set of TF-enhancer-gene connections, the function `add_TF_gene_correlation()` calculates the TF-gene correlation for each connection from the filtered set for which the TF is not missing.
```{r, include=TRUE, eval = TRUE, class.output="scroll-200"}
GRN = add_TF_gene_correlation(GRN, corMethod = "pearson", nCores = 1, forceRerun = TRUE)
```
As can be seen from the output, the Pearson correlation for 587 TF-gene pairs has been calculated. From the around 2500 connections we obtained above, since we set the parameter `allowMissingGenes = TRUE`, for the majority of the TF-enhancer-gene connections the gene is actually missing. That is, while a TF-enhancer connection below the specified significance threshold exists, no corresponding gene could be found that connects to the same enhancer, therefore setting the gene to *NA* rather than excluding the row altogether.
For more parameter details, see the R help (`?add_TF_gene_correlation`).
Time to save our object again!
```{r saveObject, echo=TRUE, include=TRUE , eval=FALSE, class.output="scroll-200"}
GRN = deleteIntermediateData(GRN)
saveRDS(GRN, GRN_file_outputRDS)
```
## Retrieve filtered connections
We are now ready to retrieve the connections and the additional data we added to them. This can be done with the helper function `getGRNConnections()` that retrieves a data frame from a `GRaNIE` object from a particular slot. Here, we specify `all.filtered`, as we want to retrieve all filtered connections. For more parameter details, see the R help (`getGRNConnections`). Note that the first time, we assign a different variable to the return of the function (i.e., `GRN_connections.all` and NOT `GRN` as before). Importantly, we have to select a new variable as we would otherwise overwrite our `GRN` object altogether! All `get` functions from the `GRaNIE` package return an element from within the object and NOT the object itself, so please keep that in mind and always check what the functions returns before running it. You can simply do so in the R help (`?getGRNConnections`).
```{r, include=TRUE, eval = TRUE, class.output="scroll-200"}
GRN_connections.all = getGRNConnections(GRN, type = "all.filtered", include_TF_gene_correlations = TRUE)
GRN_connections.all
```
The table contains many columns, and the prefix of each column name indicates the part of the *eGRN* network that the column refers to (e.g., TFs, TF-enhancers, enhancers, enhancer-genes or genes, or TF-gene if the function `add_TF_gene_correlation()` has been run before). Data are stored in a format that minimizes the memory footprint (e.g., each character column is stored as a factor). This table can now be used for any downstream analysis, as it is just a normal data frame.
## Construct the eGRN graph
For all network-related and visualization-related functions, we need to build a graph out of the filtered connections. For this, we provide a helper function that stores the graph-structure in the object, and it can be invoked as follows:
```{r buildGraph, echo=TRUE, include=TRUE, eval = FALSE, class.output="scroll-200"}
GRN = build_eGRN_graph(GRN, forceRerun = TRUE)
```
We don't include the output here, but this function runs actually very quickly.
## Visualize the filtered *eGRN*
The `GRaNIE` package also offers a function to visualize a filtered *eGRN* network! It is very easy to invoke, but provides many options to customize the output and the way the graph is drawn. We recommend to explore the options in the R help (`?getGRNConnections`), and here just run the default visualization.
```{r visualizeGRN, echo=TRUE, include=TRUE, eval = TRUE, class.output="scroll-200"}
GRN = visualizeGRN(GRN, plotAsPDF = FALSE)
```
We can see some highly connected TFs and that they actually seem to co-regulate some shared genes. The selection of TFs here for this toy dataset was based on highly connected TFs across all TF, though, so for a larger list of TFs, expect to see some TFs being not connected much or at all, though.
## Generate a connection summary for filtered connections
It is often useful to get a grasp of the general connectivity of a network and the number of connections that survive the filtering. This makes it possible to make an informed decision about which FDR to choose for TF-enhancer and enhancer-gene links, depending on how many links are retained and how many connections are needed for downstream analysis. To facilitate this and automate it, we offer the convenience function `generateStatsSummary()` that in essence iterates over different combinations of filtering parameters and calls the function `filterGRNAndConnectGenes()` once for each of them, and then records various connectivity statistics, and finally plots it by calling the function `plot_stats_connectionSummary()`. Note that running this function may take a while. Afterwards, we can graphically summarize this result in either a heatmap or a boxplot. For more parameter details, see the R help (`?generateStatsSummary` and `plot_stats_connectionSummary`).
```{r connectionSummary, echo=TRUE, include=TRUE, eval = TRUE, class.output="scroll-200"}
GRN = generateStatsSummary(GRN,
TF_peak.fdr = c(0.05, 0.1, 0.2),
TF_peak.connectionTypes = "all",
peak_gene.p_raw = NULL,
peak_gene.fdr = c(0.1, 0.2),
peak_gene.r_range = c(0,1),
allowMissingGenes = c(FALSE, TRUE),
allowMissingTFs = c(FALSE),
gene.types = c("protein_coding", "lincRNA"))
GRN = plot_stats_connectionSummary(GRN, type = "heatmap", plotAsPDF = FALSE, pages = 1:6)
GRN = plot_stats_connectionSummary(GRN, type = "boxplot", plotAsPDF = FALSE, pages = 1)
```
Here, the output is less informative and just tells us about the current progress and parameter it iterates over. We can now check the two new PDF files that have been created!
Let's start with a connection summary in form of a heatmap! There are 3 heatmap classes, one for TFs, enhancers (labeled peaks) and genes, respectively. All of them compare the number of distinct TFs, enhancers, and genes that end up in the final `eGRN` in dependence of how stringently the connections are filtered (i.e., different FDR thresholds for both TF-enhancers and enhancer-genes). In addition, the same is repeated for the permuted data, which enables you to judge the connectivity of the real data as compared to what you can expect with random data!
For TFs, we see that the numbers are generally very small because we just run the analysis with few TFs. For the permuted data, depending on the random nature of it, none or almost none connections survive the filtering. You should see much bigger differences for full TF data and not just a few selected ones.
As the output plots show, alternatively, we can also represent the connectivity in form of a boxplot, which shows the connectivity for each node or connection type (i.e. TFs, enhancers, and genes, while enhancers are split between TF-enhancer and enhancer-gene depending on whether they connect with TFs or genes, respectively), and compares this again to the random version of the `eGRN`. The PDF contains many pages, and iterates over different FDR stringency thresholds. We here show two example pages:
Not all parameter combinations (such as FDR stringencies) result in connections! Sometimes, there is no `eGRN` as no connections survived the filtering.
## Network and enrichment analyses for filtered connections
Lastly, our framework also supports various types of network and enrichment analyses that are fully integrated into the package. We offer these for the full filtered `eGRN` network as a whole (as produced by running the function `filterGRNAndConnectGenes()` before) as well as an enrichment per community.
First, a proper graph (network) structure can be build with the function `build_eGRN_graph()`, which all network and enrichment functions use subsequently.
For both the general and the community statistics and enrichment, the package can:
- calculate and plot general structure and connectivity statistics for a filtered *eGRN* (function `plotGeneralGraphStats()`) and per community (functions `calculateCommunitiesStats()` and `plotCommunitiesStats()`) ,
- ontology enrichment and visualization for genes for the full network (functions `calculateGeneralEnrichment()` and `plotGeneralEnrichment()`) as well as per community (functions `calculateCommunitiesEnrichment()` and `plotCommunitiesEnrichment()`)
All functions can be called individually, adjusted flexibly and the data is stored in the `GRaNIE` object for ultimate flexibility. In the near future, we plan to expand this set of functionality to additional enrichment analyses such as other databases (specific diseases pathways etc), so stay tuned!
`calculateCommunitiesStats()`
**For the purpose of this vignette, let's run an enrichment analysis using `GO`. here, we run it with only `GO Biological Process` (`GO_BP`), while the other `GO` ontologies are also available (i.e., `GO Molecular Function`, abbreviated `GO_MF` in the plots). We also support other, more specialized enrichment analyses (`KEGG`, `Disease Ontology`, and `Reactome`). Lastly, users can select an arbitrary combination of all supported ontologies for ultimate flexibility! More are coming soon, stay tuned!**
For user convenience, all aforementioned functions can be called at once via a designated wrapper function `performAllNetworkAnalyses()`.
Many results are produces by this convenience function, and we here show only a few of them. The function is highly customizable, all or almost all of the available parameters from the individual functions (see above) are also available in this wrapper function, see the R help (`?performAllNetworkAnalyses`) for details. In order to invoke it and save all results to several PDF files using the default options, for example, you could simply type this:
```{r allNetworkAnalyses, echo=TRUE, include=TRUE, eval = FALSE, class.output="scroll-200"}
GRN = performAllNetworkAnalyses(GRN, ontology = c("GO_BP"), outputFolder = ".", forceRerun = TRUE)
```
As this functions needs a few minutes, for the purpose of the vignette, we do not include the output of this function here. Let's, however, go through all the functions that this wrapper executes so we have a better understanding of what is actually being done. We will also plot some of the results!
First, we have to create a network representation out of the filtered connections, and there are a few options for how the network structure should look like. We here keep the default options and refer to the R help for details (`?build_eGRN_graph`).
### General network statistics
Let's, however, check some of the results that are produced! Let's start with checking some general network statistics for the full network. From the various graphs that are produced, we here select only 2 of them for demonstration purposes. First we can check the vertex distribution and gene types for the overall network to get an idea of how the network looks like. Second, we can investigate the most important TFs and genes for the network for both the TF-enhancer-gene as well as TF-gene network. Here, we here show the results for the TF-gene network only:
```{r plotGraphStats, echo=FALSE, fig.cap="General network statistics for the filtered network", class.output="scroll-200"}
GRN = plotGeneralGraphStats(GRN, plotAsPDF = FALSE, pages = c(1,6))
```
First, we see the vertex degree of TF and genes, respectively:
We can also use algorithms for measuring the influence of a node in a network (*centrality*). Here, we show the results for both TFs and genes for two different measures of centrality, *eigenvector centrality* and centrality based on node degree:
### General network enrichment
Now that we have our eGRN network, we can do various enrichment analyses. Let's start with the most obvious one: Enrichment for the whole network. Again, we are not executing the function here for reasons of time, but you should do so of course when learning how to use the package!
```{r generalEnrichment, echo=FALSE, eval = FALSE, class.output="scroll-200"}
GRN = calculateGeneralEnrichment(GRN, ontology = "GO_BP")
```
We can now plot the enrichment for the full graph. In analogy to all the other `plot` functions, a PDF with all enrichment results is produced with the default setting, but by setting `plotAsPDF` to `FALSE`, we can also plot selected results / pages directly to the currently active graphics device. In this case here, as we select only one ontology, there is only one page:
```{r plotGeneralEnrichment, echo=TRUE, fig.cap="General network enrichment for the filtered network", class.output="scroll-200"}
GRN = plotGeneralEnrichment(GRN, plotAsPDF = FALSE, pages = 1)
```
We can see that overall, cell cycle is the term with the most number of genes, and while it does not have the highest significance among all terms, it is still significant. Most of the other terms are more specialized, and point towards altered regulation of various epigenetic signaling alterations. The biological plausibility of them and how to continue after is now your challenge!
### Community network statistics and enrichment
Now, let's check whether we can identify communities within the whole network, along with community-specific enrichments.
```{r communityEnrichment, echo=FALSE, class.output="scroll-200"}
GRN = calculateCommunitiesStats(GRN)
GRN = calculateCommunitiesEnrichment(GRN, ontology = "GO_BP")
```
These functions may take a while, as enrichment is calculated for each community. Once finished, we are ready to plot the results! First, let's start with some general community statistics:
```{r plotCommunityStats, echo=FALSE, fig.cap="General statistics for the communities from the filtered network", class.output="scroll-200"}
GRN = plotCommunitiesStats(GRN, plotAsPDF = FALSE, pages = c(1,3))
```
First, we see an overview across all communities and their network sizes, and whether the links belong to a TF or gene. Second, for a selected community, we summarize which genes and TFs are most relevant for this particular community. Because the example data is rather minimal, this looks very unspectacular here: Only one TF appears in the list, and all connected genes have the same node degree also. For real datasets, this will look more interesting and diverse.
Next, let's plot the community-specific enrichment:
```{r plotCommunityEnrichment, echo=FALSE, fig.cap="Community enrichment for 3 different communities", class.output="scroll-200"}
GRN = plotCommunitiesEnrichment(GRN, plotAsPDF = FALSE, pages = c(1,2,3))
```
We also provide an overview across the whole network and all communities that lists all the enriched terms that appear in at least one enrichment analysis, so a direct comparison of the specificity and commonalities across communities and between the general network and any community is facilitated. This also shows that some terms, here more than with a full dataset, are only identified as being enriched for the full network but not within any of the communities individually. We offer this function for all terms as well as only the top 10 enriched terms per community, and we here show only the filtered version due to reasons of brevity:
```{r plotCommunityEnrichment2, echo=TRUE, fig.cap="Summary of the community enrichment", class.output="scroll-200"}
GRN = plotCommunitiesEnrichment(GRN, plotAsPDF = FALSE, pages = c(5))
```
### TF enrichment analyses
In analogy to community enrichment, we can also calculate enrichment based on TFs via their target genes they are connected to. Running a TF enrichment analyses is straight forward, with the parameter `n` we can control the number of TFs to run the enrichment for - the function runs the enrichment for the top connected TFs. Thus, `n=3` equals running the enrichment for the top 3 connected TFs. Here, we show the results for one of the TFs, *EGR1.0.A*, as well as a summary across all top 3 connected TFs, in analogy the results for the community enrichment.
Let's first calculate the TF enrichment.This may take a while. We omit the output here.
```{r TFEnrichmentCal, echo=FALSE, eval = FALSE, class.output="scroll-200"}
GRN = calculateTFEnrichment(GRN, ontology = "GO_BP")
```
Now, we can plot it, similar to the enrichment results before:
```{r TFEnrichment, echo=TRUE, fig.cap="Enrichment summary for TFs", class.output="scroll-200"}
GRN = plotTFEnrichment(GRN, plotAsPDF = FALSE, n = 3, pages = c(1,5))
```
## Wrapping up
We are now finished with the main workflow, all that is left to do is to save our `GRaNIE` object to disk so we can load it at a later time point without having to repeat the analysis. We recommend to run the convenience function `deleteIntermediateData()` beforehand that aims to reduce its size by deleting some intermediate data that may still be stored within the object. For more parameter details, see the R help (`?deleteIntermediateData`). Finally, as we did already in the middle of the workflow, we save the object finally in rds format.
```{r saveObject3, echo=TRUE, include=TRUE, eval=FALSE, class.output="scroll-200"}
GRN = deleteIntermediateData(GRN)
saveRDS(GRN, GRN_file_outputRDS)
```
# How to continue?
From here on, possibilities are endless, and you can further investigate patterns and trends in the data! We hope that the `GRaNIE` package is useful for your research and encourage you to contact us if you have any question or feature request!
# Session Info
```{r, class.output="scroll-200"}
sessionInfo()
```