---
title: "Analysing single-cell RNA-sequencing Data"
author: "Johannes Griss"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Analysing single-cell RNAseq data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

The ReactomeGSA package is a client to the web-based Reactome Analysis System. Essentially, it performs a gene set analysis using the latest version of the Reactome pathway database as a backend.

This vignette shows how the ReactomeGSA package can be used to perform a pathway analysis of cell clusters in single-cell RNA-sequencing data.

### Citation

To cite this package, use 

```
Griss J. ReactomeGSA, https://github.com/reactome/ReactomeGSA (2019)
```

## Installation

The `ReactomeGSA` package can be directly installed from Bioconductor:

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

if (!require(ReactomeGSA))
  BiocManager::install("ReactomeGSA")

# install the ReactomeGSA.data package for the example data
if (!require(ReactomeGSA.data))
  BiocManager::install("ReactomeGSA.data")
```

For more information, see https://bioconductor.org/install/.

## Example data

As an example we load single-cell RNA-sequencing data of B cells extracted from the dataset published by Jerby-Arnon *et al.* (Cell, 2018).

**Note**: This is not a complete Seurat object. To decrease the size, the object only contains gene expression values and cluster annotations.

```{r}
library(ReactomeGSA.data)
data(jerby_b_cells)

jerby_b_cells
```

## Types of analyses

There are two methods to analyse scRNA-seq data using ReactomeGSA:

ReactomeGSA can generate pseudo-bulk data from scRNA-seq
data and then analyse this data using the classical quantitative pathway
analysis algorithms. Thereby, it is possible to directly compare, f.e. 
two cell types with each other or two treatment approaches. 
The result is a classical pathway analysis result with
p-values and fold-changes attributed to each pathway.

The `analyse_sc_clusters` function offers a second approach using
the gene set variation algorithm `ssGSEA` to derive pathway-level
quantitative values for each cluster or cell type in the dataset.
This is helpful to visualize the "expression level" of pathways
accross the dataset. Statistical analyses have to be performed separately.

## Comparative pathway analysis (pseudo-bulk approach)

The pathway analysis is at the very end of a scRNA-seq workflow. This means, 
that any Q/C was already performed, the data was normalized and cells were already clustered.

In this example we are going to compare `Cluster 1` against `Cluster 2`.

```{r}
# store the Idents as a meta.data field - this was
# removed while compressing the object as an example
jerby_b_cells$clusters <- Idents(jerby_b_cells)

table(jerby_b_cells$clusters)
```

As a next step, we need to create the pseudo-bulk data for the analysis. This
is achieved through the `generate_pseudo_bulk_data` function.

```{R}
library(ReactomeGSA)

# This creates a pseudo-bulk object by splitting each cluster in 4
# random bulks of data. This approach can be changed through the
# split_by and k_variable parameter.
pseudo_bulk_data <- generate_pseudo_bulk_data(jerby_b_cells, group_by = "clusters")

# we can immediately create the metadata data.frame for this data
pseudo_metadata <- generate_metadata(pseudo_bulk_data)
```

This pseudo-bulk data is directly compatible with the existing algorithms
for quantitative pathway analysis and can be processed using the respective
ReactomeGSA methods.

```{r}
# Create a new request object using 'Camera' for the gene set analysis
my_request <- ReactomeAnalysisRequest(method = "Camera")

# set the maximum number of allowed missing values to 50%
my_request <- set_parameters(request = my_request, max_missing_values = 0.5)

# add the pseudo-bulk data as a dataset
my_request <- add_dataset(request = my_request,
                          expression_values = pseudo_bulk_data,
                          sample_data = pseudo_metadata,
                          name = "Pseudo-bulk",
                          type = "rnaseq_counts",
                          comparison_factor = "Group",
                          comparison_group_1 = "Cluster 1",
                          comparison_group_2 = "Cluster 2")

my_request
```

This request object can be directly submitted to the ReactomeGSA analysis. 

```{r}
quant_result <- perform_reactome_analysis(my_request, compress = FALSE)
quant_result
```

This object can be used in the same fashion as any ReactomeGSA result object.

```{r}
# get the pathway-level results
quant_pathways <- pathways(quant_result)
head(quant_pathways)
```

```{r}
# get the top pathways to label them
library(tidyr)
library(dplyr)

top_pathways <- quant_pathways %>% 
  tibble::rownames_to_column(var = "Id") %>%
  filter(`FDR.Pseudo-bulk` < 0.001) %>%
  arrange(desc(`av_foldchange.Pseudo-bulk`))

# limit to a few pathway for aesthetic reasons
top_pathways <- top_pathways[c(1,5,6), ]

# create a simple volcano plot of the pathway results
library(ggplot2)
ggplot(quant_pathways,
       aes(x = `av_foldchange.Pseudo-bulk`,
           y = -log10(`FDR.Pseudo-bulk`))) +
  geom_vline(xintercept = c(log2(0.5), log2(2)), colour="grey", linetype = "longdash") +
  geom_hline(yintercept = -log10(0.05), colour="grey", linetype = "longdash") +
  geom_point() +
  geom_label(data = top_pathways, aes(label = Name), nudge_y = 1, check_overlap = TRUE)
```

## Pathway analysis of cell clusters (analyse_sc_clusters)

The ReactomeGSA package can now be used to get pathway-level expression values for every cell cluster. This is achieved by calculating the mean gene expression for every cluster and then submitting this data to a gene set variation analysis. 

All of this is wrapped in the single `analyse_sc_clusters` function.

```{R}
gsva_result <- analyse_sc_clusters(jerby_b_cells, verbose = TRUE)
```

The resulting object is a standard `ReactomeAnalysisResult` object.

```{r}
gsva_result
```

`pathways` returns the pathway-level expression values per cell cluster:

```{r}
pathway_expression <- pathways(gsva_result)

# simplify the column names by removing the default dataset identifier
colnames(pathway_expression) <- gsub("\\.Seurat", "", colnames(pathway_expression))

pathway_expression[1:3,]
```

A simple approach to find the most relevant pathways is to assess the maximum difference in expression for every pathway:

```{r}
# find the maximum differently expressed pathway
max_difference <- do.call(rbind, apply(pathway_expression, 1, function(row) {
    values <- as.numeric(row[2:length(row)])
    return(data.frame(name = row[1], min = min(values), max = max(values)))
}))

max_difference$diff <- max_difference$max - max_difference$min

# sort based on the difference
max_difference <- max_difference[order(max_difference$diff, decreasing = T), ]

head(max_difference)
```

### Plotting the results

The ReactomeGSA package contains two functions to visualize these pathway results. The first simply plots the expression for a selected pathway:

```{r, fig.width=7, fig.height=4}
plot_gsva_pathway(gsva_result, pathway_id = rownames(max_difference)[1])
```

For a better overview, the expression of multiple pathways can be shown as a heatmap using `gplots` `heatmap.2` function:

```{r, fig.width=7, fig.height=8}
# Additional parameters are directly passed to gplots heatmap.2 function
plot_gsva_heatmap(gsva_result, max_pathways = 15, margins = c(6,20))
```

The `plot_gsva_heatmap` function can also be used to only display specific pahtways:
```{r, fig.width=7, fig.height=4}
# limit to selected B cell related pathways
relevant_pathways <- c("R-HSA-983170", "R-HSA-388841", "R-HSA-2132295", "R-HSA-983705", "R-HSA-5690714")
plot_gsva_heatmap(gsva_result, 
                  pathway_ids = relevant_pathways, # limit to these pathways
                  margins = c(6,30), # adapt the figure margins in heatmap.2
                  dendrogram = "col", # only plot column dendrogram
                  scale = "row", # scale for each pathway
                  key = FALSE, # don't display the color key
                  lwid=c(0.1,4)) # remove the white space on the left
```

This analysis shows us that cluster 8 has a marked up-regulation of B Cell receptor signalling, which is linked to a co-stimulation of the CD28 family. Additionally, there is a gradient among the cluster with respect to genes releated to antigen presentation. 

Therefore, we are able to further classify the observed B cell subtypes based on their pathway activity.

### Pathway-level PCA

The pathway-level expression analysis can also be used to run a Principal Component Analysis on the samples. This is simplified through the function `plot_gsva_pca`:

```{r, fig.width=6, fig.height=4}
plot_gsva_pca(gsva_result)
```

In this analysis, cluster 11 is a clear outlier from the other B cell subtypes and therefore might be prioritised for further evaluation.

## Session Info

```{r}
sessionInfo()
```