---
title: "sSNAPPY: Singel Sample directioNAl Pathway Perturbation analYsis"
author:
- name: Wenjun Nora Liu
  affiliation: Dame Roma Mitchell Cancer Research Laboratories, Adelaide Medical School, University of Adelaide
  email: wenjun.liu@adelaide.edu.au
pacakge: sSNAPPY
output: 
    BiocStyle::html_document:
    toc: yes
vignette: >
  %\VignetteIndexEntry{Single Sample Directional Pathway Perturbation Analysis}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: inline
---

```{r, echo=FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE, crop = NULL)
```

# Introduction

This vignette demonstrates how to use the package `sSNAPPY` to compute directional single sample pathway perturbation scores by incorporating pathway topologies, utilize sample permutation to test the significance of individual scores and compare average pathway activities across treatments.

The package also provides a function to visualise overlap between pathway genes contained in perturbed biological pathways as network plots.

# To get ready

## Installation

The package `sSNAPPY` can be installed using the package `BiocManager`

```{r install, eval = FALSE}
if (!"BiocManager" %in% rownames(installed.packages()))
  install.packages("BiocManager")
# Other packages required in this vignette
pkg <- c("tidyverse", "magrittr", "ggplot2", "cowplot", "DT")
BiocManager::install(pkg)
BiocManager::install("sSNAPPY")
install.packages("htmltools")
```

## Load packages

```{r setup,  results="hide", warning=FALSE}
library(sSNAPPY)
library(tidyverse)
library(magrittr)
library(ggplot2)
library(cowplot)
library(DT)
library(htmltools)
```

## Load data

The example dataset used for this tutorial can be loaded with `data()` as shown below. It's a subset of data retrieved from [Singhal H et al. 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4928895/), where ER-positive primary breast cancer tumour tissues collected from 12 patients were split into tissue fragments of equal sizes for different treatments.

For this tutorial, we are only looking at the RNA-seq data from samples that were treated with vehicle, R5020(progesterone) OR E2(estrogen) + R5020 for 48 hrs. They were from 5 different patients, giving rise to 15 samples in total. A more detailed description of the dataset can be assessed through the help page (`?logCPM_example` and `?metadata_example`).

```{r data}
data(logCPM_example)
data(metadata_example)
# check if samples included in the logCPM matrix and metadata dataframe are identical
setequal(colnames(logCPM_example), metadata_example$sample)
# View sample metadata
datatable(metadata_example,  filter = "top")
```

# `sSNAPPY` workflow

## Compute weighted single-sample logFCs (ssLogFCs)

It is expected that the logCPM matrix will be filtered to remove undetectable genes and normalised to correct for library sizes or other systematic artefacts, such as gene length or GC contents, prior to applying the `sSNAPPY` workflow. Filtration and normalisation have been performed on the example dataset.

Before single-sample logFCs (ssLogFCs) can be computed, row names of the logCPM matrix need to be converted to `entrez ID`. This is because all the pathway topology information retrieved will be in `entrez ID`. The conversion can be achieved through bioconductor packages `AnnotationHub` and `ensembldb`.

```{r logCPM_example, eval=FALSE}
head(logCPM_example)
```

To compute the ssLogFCs, samples must be in matching pairs. In our example data, treated samples were matched to the corresponding control samples derived from the same patients. Therefore the `factor` parameter of the `weight_ss_fc()` functions will be set to be "patient". The function also requires the control treatment level to be specified, which was "Vehicle" in this case.

`weight_ss_fc()` requires both the logCPM matrix and sample metadata as input. The column names of the logCPM matrix should be sample names, matching a column called `sample` in the metadata. The metadata must also contain a treatment column, and a column corresponding to the `factor` parameter (i.e.patient in this case).

```{r ssFC}
#compute weighted single sample logFCs
weightedFC <- weight_ss_fc(logCPM_example, metadata = metadata_example,
factor = "patient", control = "Vehicle")
```

The `weight_ss_fc()` function firstly computes raw ssLogFCs for each gene by subtracting logCPM values of the control sample from the logCPM values of treated samples for each patient.

It has been demonstrated previously that in RNA-seq data, lowly expressed genes turn to have a larger variance, which is also demonstrated by the plots below. To reduce the impact of this artefact, `weight_ss_fc` also weights each ssLogFCs by estimating the relationship between the variance in ssLogFCs and mean logCPM, and defining the gene-wise weight to be the inverse of the predicted variance of that gene's mean logCPM value.

```{r lowess, fig.width=8,fig.height=4}
perSample_FC <- lapply(levels(metadata_example$patient), function(x){
    temp <- logCPM_example[seq_len(1000),str_detect(colnames(logCPM_example), x)] 
    ratio <- temp[, str_detect(colnames(temp), "Vehicle", negate = TRUE)] - temp[, str_detect(colnames(temp), "Vehicle")] 
}) %>%
    do.call(cbind,.)
aveCPM <- logCPM_example[seq_len(1000),] %>%
    rowMeans() %>%
    enframe(name = "gene_id", 
            value = "aveCPM")
p1 <- perSample_FC %>%
    as.data.frame() %>%
    rownames_to_column("gene_id") %>%
    pivot_longer(cols = -"gene_id",
                 names_to = "name",
                 values_to = "logFC") %>%
    left_join(aveCPM) %>%
    ggplot(aes(aveCPM, logFC)) +
    geom_point() +
    labs(y = "sslogFC", 
         x = "Average logCPM") +
    theme(
        panel.background = element_blank()
    )
p2 <- data.frame(
    gene_id = rownames(perSample_FC),
    variance = perSample_FC %>%
        apply(1,var)) %>%
    left_join(aveCPM) %>%
    ggplot(aes(aveCPM, variance)) +
    geom_point() +
    geom_smooth(method = "loess") +
    labs(y = "Variance in ssLogFCs", 
         x = "Average logCPM") +
    theme(
        panel.background = element_blank()
    )
plot_grid(p1, p2)

```

The output of the `weight_ss_fc()` function is a list with two element, where one is the weighted ssLogFCs matrix and the other is a vector of gene-wise weights.

## Retrieve pathway topologies in the required format

*sSNAPPY* adopts the pathway perturbation scoring algorithm proposed in `r Biocpkg("SPIA")`, which makes use of gene-set topologies and gene-gene interaction to propagate pathway genes' logFCs down the topologies to compute pathway perturbation scores, where signs of scores reflect the potential directions of changes.

Therefore, pathway topology information needs to be firstly retrieved from your chosen database and converted to weight adjacency matrices, the format required to apply the scoring algorithm.

This step is achieved through a chain of functions that are part of the `r Biocpkg("grapghite")`, which have been nested into one simple function in this package called `retrieve_topology()`. Databases that are currently supported are:

```{r pathwayDatabases}
library(graphite)
graphite::pathwayDatabases() %>%
  dplyr::filter(species ==  "hsapiens") %>%
  pander::pander()
```

The retrieved topology information will be a list where each element corresponds a pathway. It's recommended to save the list as a file so this step only needs to be performed once for each database.

This vignette chose *KEGG* pathways as an example.

```{r retrieve_topology}
gsTopology <- retrieve_topology(database = "kegg")
head(names(gsTopology))
```

If only selected pathways are of interest to your research, it's possible to only retrieve the topologies of those pathways by specifying the pathway names.

```{r gsTopology_sub}
gsTopology_sub <- retrieve_topology(
  database = "kegg",
  pathwayName = c(
    "Glycolysis / Gluconeogenesis", 
    "Citrate cycle (TCA cycle)",
    "Pentose phosphate pathway"
    ))
names(gsTopology_sub)
```

## Score single sample pathway perturbation

Once the expression matrix, sample metadata and pathway topologies are all ready, gene-wise single-sample perturbation scores can be computed:

```{r}
genePertScore <- raw_gene_pert(weightedFC$logFC, gsTopology)
```

and summed to derive pathway perturbation scores for each pathway in each treated samples.

```{r}
pathwayPertScore <- pathway_pert(genePertScore)
head(pathwayPertScore)
```

## Generate null distributions of perturbation scores

To derive the empirical p-values for each single-sample perturbation scores and normalize the raw scores for comparing overall treatment effects, null distributions of scores for each pathway are generated through a sample-label permutation approach.

For each round of permutation, sample labels are randomly shuffled to derive the permuted ssLogFCs, which are then used to score pathway perturbation. We recommend performing a minimum of 1000 rounds of permutation, which means at least 8 samples are required. 

The `generate_permuted_scores()` function does not require sample metadata but the number of treatments in the study design, including the control treatment, need to be specified. In this example data, the number of treatmentS was 3 (Vehicle, E2, and E2+R5020).

The output of the `generate_permuted_scores()` function is a list where each element is a vector of permuted perturbation scores for a pathway.

The permutation step relies on the parallel computing feature provided by `r Biocpkg("BiocParallel")`. You can choose to customize the parallel back-end or stick with the default one returned by `BiocParallel::bpparam()`. Depending on the size of the data, this step can take some time to complete. If the sample size is large, we recommend users consider performing this step on an HPC.

```{r permutedScore, eval=FALSE}
permutedScore <- generate_permuted_scores(
  expreMatrix  = logCPM_example, 
  numOfTreat = 3, NB = 1000, 
  gsTopology = gsTopology, 
  weight = weightedFC$weight
)
```

## Test significant perturbation on

### single-sample level

After the empirical null distributions are generated, the median and mad of each distribution will be calculated for each pathway to convert the test single-sample perturbation scores derived from the `compute_perturbation_score()` function to robust z-scores: $$(Score - Median)/MAD$$

Two-sided p-values associated with each robust z-scores are also computed and will be corrected for multiple-testing using a user-defined approach. The default is `fdr`.

The `normalise_by_permu()` function requires the test perturbation scores and permuted perturbation scores as input.

```{r normalisedScores, eval=FALSE}
normalisedScores <- normalise_by_permu(permutedScore, ssPertScore)
```

Since the permutation step takes a long time to run and the output is too large to be included as part of the package, the results of the `normalise_by_permu` step has been pre-computed and can be loaded with:

```{r}
load(system.file("extdata", "normalisedScores.rda", package = "sSNAPPY"))
```

Pathways that were significant perturbed within individual samples are:

```{r DT_indi}
normalisedScores %>%
    dplyr::filter(adjPvalue < 0.05) %>%
    left_join(metadata_example) %>%
    mutate_at(vars(c("sample", "gs_name")), as.factor) %>%
    mutate_if(is.numeric, sprintf, fmt = '%#.4f') %>%
    mutate(Direction = ifelse(robustZ < 0, "Inhibited", "Activation")) %>%
    dplyr::select(
        sample, patient, Treatment = treatment,
        `Perturbation Score` = robustZ, Direction,
        `Gene-set name` = gs_name, 
        `P-value` = pvalue, 
        FDR = adjPvalue
    ) %>%
    datatable(
        filter = "top", 
        options = list(
            columnDefs = list(list(targets = "Direction", visible = FALSE))
        ), 
        caption = htmltools::tags$caption(
                  htmltools::em(
                      "Pathways that were significant perturbed within individual samples.")
              )
    ) %>% 
    formatStyle(
        'Perturbation Score', 'Direction',
        color = styleEqual(c("Inhibited", "Activation"), c("blue", "red"))
    )
```

#### Visualise overlap between gene-sets as networks

 To visualise significantly perturbed biological pathways as a network, where edges between gene-sets reflect how much overlap those two gene-sets share, we can use the `plot_gs_network` function. The function can take `normalise_by_permu()`'s output, or a subset of it as its direct input.

Nodes in the network plots could be colored by the predicted direction of perturbation (i.e.. sign of robust z-score):

```{r sigGS_nt_zscore, fig.width= 10, fig.height=4, fig.cap="*Pathways significantly perturbed in individual samples, where gene-sets were colored by pathways' directions of changes.*"}
pl <- normalisedScores %>%
  dplyr::filter(adjPvalue < 0.05) %>%
  split(f = .$sample) %>%
  lapply(
    plot_gs_network, 
    # layout = "dh",
    gsTopology = gsTopology, 
    colorBy = "robustZ"
  ) %>%
    lapply(function(x){
        x + theme(
        panel.grid = element_blank(), 
        panel.background = element_blank()
        ) })
plot_grid(
    plotlist = pl, 
    labels = names(pl),
    label_size = 8,
    nrow = 1)
```

Or p-values:

```{r sigGS_nt_pvalue, fig.width= 10, fig.height=4,fig.cap="*Pathways significantly perturbed in individual samples, where gene-sets were colored by pathways' p-values*"}
pl <- normalisedScores %>%
    dplyr::filter(adjPvalue < 0.05) %>%
    split(f = .$sample) %>%
    lapply(
        plot_gs_network, 
        gsTopology = gsTopology, 
        colorBy = "pvalue", 
        color_lg_title = "P-value"
    ) %>%
    lapply(function(x){
        x + theme(
        panel.grid = element_blank(), 
        panel.background = element_blank()
        ) })
plot_grid(
    plotlist = pl, 
    labels = names(pl),
    label_size = 8,
    nrow = 1)
```

The function allows you to customize the layout, colour, edge transparency and other aesthetics of the graph. More information can be found on the help page (`?plot_gs_network`). The output of the graph is a `ggplot` object and the theme of it can be changed just as any other `ggplot` figures.

### treatment-level

In addition to testing pathway perturbations at single-sample level, normalised perturbation scores can also be used to model mean treatment effects within a group, such as within each treatment group. An advantage of this method is that it has a high level of flexibility that allows us to incorporate confounding factors as co-factors or co-variates to offset their effects.

For example, in the example data-set, samples were collected from patients with different progesterone receptor (PR) statuses. Knowing that PR status would affect how tumour tissues responded to estrogen and/or progesterone treatments, we can include it as a cofactor to offset its confounding effect.

```{r fit}
fit <- normalisedScores %>%
    left_join(metadata_example) %>%
    split(f = .$gs_name) %>%
    #.["Estrogen signaling pathway"] %>%
    lapply(function(x)lm(robustZ ~ 0 + treatment + PR, data = x)) %>%
    lapply(summary)
treat_sig <- lapply(
  names(fit), 
  function(x){
    fit[[x]]$coefficients %>%
      as.data.frame() %>%
      .[seq_len(2),] %>%
      dplyr::select(Estimate, pvalue = `Pr(>|t|)` ) %>%
      rownames_to_column("Treatment") %>%
      mutate(
        gs_name = x, 
        FDR = p.adjust(pvalue, "fdr"), 
        Treatment = str_remove_all(Treatment, "treatment")
      ) 
  }) %>%
  bind_rows() 
```

Results from the linear modelling revealed pathways that were on average perturbed due to each treatment:

```{r treat_sig_DT}
treat_sig %>% 
    dplyr::filter(FDR < 0.05) %>%
    mutate_at(vars(c("Treatment", "gs_name")), as.factor) %>%
    mutate_if(is.numeric, sprintf, fmt = '%#.4f') %>%
    mutate(Direction = ifelse(Estimate < 0, "Inhibited", "Activation")) %>%
    dplyr::select(
        Treatment, `Perturbation Score` = Estimate, Direction,
        `Gene-set name` = gs_name, 
        `P-value` = pvalue, 
        FDR
    ) %>%
    datatable(
        filter = "top", 
        options = list(
            columnDefs = list(list(targets = "Direction", visible = FALSE))
        ), 
        caption = htmltools::tags$caption(
                  htmltools::em(
                      "Pathways that were significant perturbed within each treatment group.")
              )
    ) %>% 
    formatStyle(
        'Perturbation Score', 'Direction',
        color = styleEqual(c("Inhibited", "Activation"), c("blue", "red"))
    )
```

It is not surprising to see that the estrogen signalling pathway was significantly activated under both R5020 and E2+R5020 treatments.

#### Visualise genes' contributions to pathway perturbation

If there's a specific pathway that we would like to dig deeper into and explore which pathway genes played a key role in its perturbation, for example, activation of the "Ovarian steroidogenesis", we can plot the gene-level perturbation scores of all genes contained in that pathway as a heatmap and annotate each column (ie. each sample) by the direction of pathway perturbation in that sample or any other sample metadata using the `plot_gene_contribution()` function.

From the heatmap below that we can see that the activation of this pathway was driven by a few dominating genes.

```{r fig.height= 5, fig.width=8, fig.cap="*Gene-level perturbation scores of all genes in the Ovarian steroidogenesis gene-set. The activation of pathway was driven by a few dominating genes.*" }
plot_gene_contribution(
    genePertScore = genePertScore, gsToPlot = "Ovarian steroidogenesis", metadata = metadata_example, 
    annotation_attribute = c("pathwayPertScore", "treatment"), pathwayPertScore = pathwayPertScore, 
    breaks = seq(-0.0001, 0.0001, length.out = 100),
    show_rownames = FALSE
)
```

By default, genes' entrez IDs are used and plotted as row names, which may not be very informative. So the row names could be overwritten by providing a `data.frame` mapping entrez IDs to other identifiers through the `mapRownameTo` parameter. 

Mapping between different gene identifiers could be achieved through the `mapIDs()` function from the Bioconductor package [`AnnotationDbi`](https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html). But to reduce the compiling time of this vignette, mappings between entrez IDs and gene names have been provided as a `data.frame` called `entrez2name`.

Note that if the mapping information was provided and the mapping was successful for some genes but not the others, only genes that have been mapped successfully will be plotted.

```{r fig.height= 5, fig.width=8, fig.cap="*Gene-level perturbation scores of all genes in the Ovarian steroidogenesis gene-set.Genes playing the most important roles in pathway activation was INS, CGA, GNAS.*"}
load(system.file("extdata", "entrez2name.rda", package = "sSNAPPY"))
plot_gene_contribution(
    genePertScore = genePertScore, gsToPlot = "Ovarian steroidogenesis", metadata = metadata_example, 
    annotation_attribute = c("pathwayPertScore", "treatment"), pathwayPertScore = pathwayPertScore,
    breaks = seq(-0.0001, 0.0001, length.out = 100), 
    fontsize_row = 6,
    mapEntrezID = entrez2name
)
```

#### Visualise overlap between gene-sets as networks

Results of group-level perturbation can also be visualised using the `plot_gs_network()` function. We just need to change the name of the `Estimate` column we had in the `treat_sig` object to `robustZ` to colour the networks by the predicted directions of changes.

```{r fig.width=10, fig.height=8, fig.cap="*Pathways significantly perturbed by the E2+R5020 combintation treatment, where colors of nodes reflect pathways' directions of changes.*"}
treat_sig %>% 
    dplyr::filter(FDR < 0.05, Treatment == "E2+R5020") %>%
    dplyr::rename(robustZ = Estimate) %>%
    plot_gs_network(
       
        gsTopology = gsTopology, 
        colorBy = "robustZ"
    ) +
    theme(
        panel.grid = element_blank(), 
        panel.background = element_blank()
    )
```

By default, `plot_gs_network()` function does not include nodes that are not connected to any other nodes, which could be turnt off by setting the `plotIsolated` parameter to TURE.

#### Visualise community structure in the gene-set network

When a large number of pathways were perturbed, it is hard to answer the question "What key biological processes were perturbed?" solely by looking at all the pathway names. To solve that, we can use the `plot_community()` function to apply a community detection algorithm to the network we had above, and annotate each community by the biological process that most pathways assigned to that community belong to.

```{r, fig.height=8, fig.width=12, fig.cap="*Pathways significantly perturbed by the E2+R5020 combintation treatment, annotated by the main biological processes each pathways belong to.*" }
treat_sig %>% 
    dplyr::filter(FDR < 0.05, Treatment == "E2+R5020") %>%
    dplyr::rename(robustZ = Estimate) %>%
    plot_community(
        gsTopology = gsTopology, 
        colorBy = "community", 
        color_lg_title = "Community"
    ) +
    theme(panel.background = element_blank())
```
`sSNAPPY` was built in with categorisations of *KEGG* pathways. But to annotate pathways retrieved from other databases, customised annotation `data.frame` could be provided through the `gsAnnotation` parameter. 

#### Visualise genes included in perturbed pathways networks

If we want to not only know if two pathways are connected but also the genes connecting those pathways, we can use the `plot_gs2gene()` function instead:

```{r, fig.height=7, fig.width=12, fig.cap="*Pathways significantly perturbed by the E2+R5020 combintation treatment and all genes contained in those pathways.*" }
treat_sig %>% 
    dplyr::filter(FDR < 0.05, Treatment == "E2+R5020") %>%
    dplyr::rename(robustZ = Estimate) %>%
    plot_gs2gene(
        gsTopology = gsTopology, 
        colorGS_By = "robustZ", 
        label_Gene = FALSE,
        GeneNode_size = 1, 
        edgeAlpha = 0.1
    ) +
    theme(panel.background = element_blank())
```

However, since there are a large number of genes in each pathway, the plot above was quite messy and it was unrealistic to plot all genes' names. So it is recommend to filter pathway genes by their perturbation scores or logFCs.

For example, we can rank genes by the absolute values of their mean single-sample logFCs and only focus on genes that were ranked in the top 200 of the list.

```{r top200_FC}
meanFC <- weightedFC$logFC %>%
    .[, str_detect(colnames(.), "E2", negate = TRUE)] %>%
    apply(1, mean )
top200_gene <- meanFC %>%
    abs() %>%
    sort(decreasing = TRUE, ) %>%
    .[1:200] %>%
    names()
top200_FC <- meanFC %>%
    .[names(.) %in% top200_gene]
```

When we provide genes' logFCs as a named vector through the `geneFC` parameter, only pathway genes with logFCs provided will be plotted and gene nodes will be colored by genes' directions of changes. The names of the logFC vector must be genes' entrez IDs in the format of "ENTREZID:XXXX", as pathway topology matrices retrieved through `retrieve_topology()` always use entrez IDs as identifiers.

However, it is not going to be informative to label genes with their entrez IDs. So the same as the `plot_gene_contribution()` function, we can provide a `data.frame` to match genes' entrez IDs to our chosen identifiers through the `mapEntrezID` parameter in the `plot_gs2gene()` function too.

```{r, fig.height=8, fig.width=10, fig.cap="*Pathways significantly perturbed by the E2+R5020 combintation treatment, and pathway genes with top 200 magnitudes of changes among all E2+R5020-treated samples. Both pathways and genes were colored by their directions of changes.*" }
treat_sig %>% 
    dplyr::filter(FDR < 0.05, Treatment == "E2+R5020") %>%
    dplyr::rename(robustZ = Estimate) %>%
    plot_gs2gene(
        gsTopology = gsTopology, 
        colorGS_By = "robustZ", 
        mapEntrezID = entrez2name, 
        geneFC = top200_FC, 
        edgeAlpha = 0.3, 
        GsName_size = 4
    ) +
    theme(panel.background = element_blank())
```

We can also filters genes' by their contributions to a pathway perturbations.To help with that, `sSNAPPYT` provides an option to rank genes' perturbation scores within each sample for each pathway. 

If in a given pathway, both positive and negative gene-wise perturbation scores exist, positive and negative scores are ranked separately, where the larger a positive rank, the more the gene contributed to the pathway's activation, and the smaller a negative rank, the more the gene contributed to the pathways' inhibition. 

```{r geneRank}
geneRank <- rank_gene_pert(genePertScore, gsTopology)
```

Depending on the biological question, it could be interesting to plot only the pathway genes with the same directions of changes as the pathway they belonged to, and ignoring those genes that were antagonizing the pathway perturbation. 

# References

-   Sales G, Calura E, Cavalieri D, Romualdi C (2012). "graphite - a Bioconductor package to convert pathway topology to gene network." BMC Bioinformatics. <https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-20>.
-   Tarca, Adi Laurentiu et al. (2009). "A novel signaling pathway impact analysis." Bioinformatics vol. 25,1 : 75-82. <doi:10.1093/bioinformatics/btn577>

# Session Info

```{r sessionInfo}
sessionInfo()
```