--- title: "Analysing single-cell RNA-sequencing Data" author: "Johannes Griss" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Analysing single-cell RNAseq data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction The ReactomeGSA package is a client to the web-based Reactome Analysis System. Essentially, it performs a gene set analysis using the latest version of the Reactome pathway database as a backend. This vignette shows how the ReactomeGSA package can be used to perform a pathway analysis of cell clusters in single-cell RNA-sequencing data. ### Citation To cite this package, use ``` Griss J. ReactomeGSA, https://github.com/reactome/ReactomeGSA (2019) ``` ## Installation The `ReactomeGSA` package can be directly installed from Bioconductor: ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") if (!require(ReactomeGSA)) BiocManager::install("ReactomeGSA") # install the ReactomeGSA.data package for the example data if (!require(ReactomeGSA.data)) BiocManager::install("ReactomeGSA.data") ``` For more information, see https://bioconductor.org/install/. ## Example data As an example we load single-cell RNA-sequencing data of B cells extracted from the dataset published by Jerby-Arnon *et al.* (Cell, 2018). **Note**: This is not a complete Seurat object. To decrease the size, the object only contains gene expression values and cluster annotations. ```{r} library(ReactomeGSA.data) data(jerby_b_cells) jerby_b_cells ``` ## Types of analyses There are two methods to analyse scRNA-seq data using ReactomeGSA: ReactomeGSA can generate pseudo-bulk data from scRNA-seq data and then analyse this data using the classical quantitative pathway analysis algorithms. Thereby, it is possible to directly compare, f.e. two cell types with each other or two treatment approaches. The result is a classical pathway analysis result with p-values and fold-changes attributed to each pathway. The `analyse_sc_clusters` function offers a second approach using the gene set variation algorithm `ssGSEA` to derive pathway-level quantitative values for each cluster or cell type in the dataset. This is helpful to visualize the "expression level" of pathways accross the dataset. Statistical analyses have to be performed separately. ## Comparative pathway analysis (pseudo-bulk approach) The pathway analysis is at the very end of a scRNA-seq workflow. This means, that any Q/C was already performed, the data was normalized and cells were already clustered. In this example we are going to compare `Cluster 1` against `Cluster 2`. ```{r} # store the Idents as a meta.data field - this was # removed while compressing the object as an example jerby_b_cells$clusters <- Idents(jerby_b_cells) table(jerby_b_cells$clusters) ``` As a next step, we need to create the pseudo-bulk data for the analysis. This is achieved through the `generate_pseudo_bulk_data` function. ```{R} library(ReactomeGSA) # This creates a pseudo-bulk object by splitting each cluster in 4 # random bulks of data. This approach can be changed through the # split_by and k_variable parameter. pseudo_bulk_data <- generate_pseudo_bulk_data(jerby_b_cells, group_by = "clusters") # we can immediately create the metadata data.frame for this data pseudo_metadata <- generate_metadata(pseudo_bulk_data) ``` This pseudo-bulk data is directly compatible with the existing algorithms for quantitative pathway analysis and can be processed using the respective ReactomeGSA methods. ```{r} # Create a new request object using 'Camera' for the gene set analysis my_request <- ReactomeAnalysisRequest(method = "Camera") # set the maximum number of allowed missing values to 50% my_request <- set_parameters(request = my_request, max_missing_values = 0.5) # add the pseudo-bulk data as a dataset my_request <- add_dataset(request = my_request, expression_values = pseudo_bulk_data, sample_data = pseudo_metadata, name = "Pseudo-bulk", type = "rnaseq_counts", comparison_factor = "Group", comparison_group_1 = "Cluster 1", comparison_group_2 = "Cluster 2") my_request ``` This request object can be directly submitted to the ReactomeGSA analysis. ```{r} quant_result <- perform_reactome_analysis(my_request, compress = FALSE) quant_result ``` This object can be used in the same fashion as any ReactomeGSA result object. ```{r} # get the pathway-level results quant_pathways <- pathways(quant_result) head(quant_pathways) ``` ```{r} # get the top pathways to label them library(tidyr) library(dplyr) top_pathways <- quant_pathways %>% tibble::rownames_to_column(var = "Id") %>% filter(`FDR.Pseudo-bulk` < 0.001) %>% arrange(desc(`av_foldchange.Pseudo-bulk`)) # limit to a few pathway for aesthetic reasons top_pathways <- top_pathways[c(1,5,6), ] # create a simple volcano plot of the pathway results library(ggplot2) ggplot(quant_pathways, aes(x = `av_foldchange.Pseudo-bulk`, y = -log10(`FDR.Pseudo-bulk`))) + geom_vline(xintercept = c(log2(0.5), log2(2)), colour="grey", linetype = "longdash") + geom_hline(yintercept = -log10(0.05), colour="grey", linetype = "longdash") + geom_point() + geom_label(data = top_pathways, aes(label = Name), nudge_y = 1, check_overlap = TRUE) ``` ## Pathway analysis of cell clusters (analyse_sc_clusters) The ReactomeGSA package can now be used to get pathway-level expression values for every cell cluster. This is achieved by calculating the mean gene expression for every cluster and then submitting this data to a gene set variation analysis. All of this is wrapped in the single `analyse_sc_clusters` function. ```{R} gsva_result <- analyse_sc_clusters(jerby_b_cells, verbose = TRUE) ``` The resulting object is a standard `ReactomeAnalysisResult` object. ```{r} gsva_result ``` `pathways` returns the pathway-level expression values per cell cluster: ```{r} pathway_expression <- pathways(gsva_result) # simplify the column names by removing the default dataset identifier colnames(pathway_expression) <- gsub("\\.Seurat", "", colnames(pathway_expression)) pathway_expression[1:3,] ``` A simple approach to find the most relevant pathways is to assess the maximum difference in expression for every pathway: ```{r} # find the maximum differently expressed pathway max_difference <- do.call(rbind, apply(pathway_expression, 1, function(row) { values <- as.numeric(row[2:length(row)]) return(data.frame(name = row[1], min = min(values), max = max(values))) })) max_difference$diff <- max_difference$max - max_difference$min # sort based on the difference max_difference <- max_difference[order(max_difference$diff, decreasing = T), ] head(max_difference) ``` ### Plotting the results The ReactomeGSA package contains two functions to visualize these pathway results. The first simply plots the expression for a selected pathway: ```{r, fig.width=7, fig.height=4} plot_gsva_pathway(gsva_result, pathway_id = rownames(max_difference)[1]) ``` For a better overview, the expression of multiple pathways can be shown as a heatmap using `gplots` `heatmap.2` function: ```{r, fig.width=7, fig.height=8} # Additional parameters are directly passed to gplots heatmap.2 function plot_gsva_heatmap(gsva_result, max_pathways = 15, margins = c(6,20)) ``` The `plot_gsva_heatmap` function can also be used to only display specific pahtways: ```{r, fig.width=7, fig.height=4} # limit to selected B cell related pathways relevant_pathways <- c("R-HSA-983170", "R-HSA-388841", "R-HSA-2132295", "R-HSA-983705", "R-HSA-5690714") plot_gsva_heatmap(gsva_result, pathway_ids = relevant_pathways, # limit to these pathways margins = c(6,30), # adapt the figure margins in heatmap.2 dendrogram = "col", # only plot column dendrogram scale = "row", # scale for each pathway key = FALSE, # don't display the color key lwid=c(0.1,4)) # remove the white space on the left ``` This analysis shows us that cluster 8 has a marked up-regulation of B Cell receptor signalling, which is linked to a co-stimulation of the CD28 family. Additionally, there is a gradient among the cluster with respect to genes releated to antigen presentation. Therefore, we are able to further classify the observed B cell subtypes based on their pathway activity. ### Pathway-level PCA The pathway-level expression analysis can also be used to run a Principal Component Analysis on the samples. This is simplified through the function `plot_gsva_pca`: ```{r, fig.width=6, fig.height=4} plot_gsva_pca(gsva_result) ``` In this analysis, cluster 11 is a clear outlier from the other B cell subtypes and therefore might be prioritised for further evaluation. ## Session Info ```{r} sessionInfo() ```