---
title: "User Guide"
output:
BiocStyle::html_document:
toc: true
toc_depth: 2
highlight: pygments
author: "Søren Helweg Dam"
date: "Last updated: `r format(Sys.Date(), '%Y.%m.%d')`"
bibliography: library.bib
vignette: >
%\VignetteIndexEntry{User Guide}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Introduction
`pariedGSEA` is a user-friendly framework for paired differential gene
expression and splicing analyses.
Providing bulk RNA-seq count data, `pariedGSEA` combines the results of
`r BiocStyle::Biocpkg("DESeq2")` [@deseq2] and
`r BiocStyle::Biocpkg("DEXSeq")` [@dexseq], aggregates the p-values to
gene level and allows
you to run a subsequent gene set over-representation analysis using
`r BiocStyle::Biocpkg("fgsea")`'s `fora` function [@fgsea].
Since version 0.99.2, you can also do the differential analyses using
`r BiocStyle::Biocpkg("limma")` .
`pairedGSEA` was developed to highlight the importance of differential
splicing analysis. It was build in a way that yields comparable results between
splicing and expression-related events. It, by default, accounts for
surrogate variables in the data, and facilitates exploratory data analysis
either through storing intermediate results or through plotting functions
for the over-representation analysis.
This vignette will guide you through how to use the main functions of
`pairedGSEA`.
`pariedGSEA` assumes you have already preprocessed and aligned your sequencing
reads to transcript level.
Before starting, you should therefore have a counts matrix and a metadata file.
This data may also be in the format of a
`r BiocStyle::Biocpkg("SummarizedExperiment")` [@se] or
`DESeqDataSet`.
*Importantly*, please ensure that the rownames have the format:
`gene:transcript`.
The metadata should, as a minimum, contain the `sample IDs` corresponding to
the column names of the count matrix, a `group` column containing information
on which samples corresponds to the baseline (controls) and
the case (condition). Bear in mind, the column names can be as you wish,
but the names must be provided in the `sample_col` and `group_col` parameters,
respectively.
To see an example of what such data could look like,
please see `?pairedGSEA::example_se`.
# Installation
To install this package, start R (version "4.2") and enter:
Bioconductor dependencies
``` {r install_dep, eval=FALSE}
# Install Bioconductor dependencies
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c(
"SummarizedExperiment", "S4Vectors", "DESeq2", "DEXSeq",
"fgsea", "sva", "BiocParallel"))
```
Install `pairedGSEA` from [GitHub](https://github.com/shdam/pairedGSEA)
``` {r install_github, eval=FALSE}
# Install pairedGSEA from github
devtools::install_github("shdam/pairedGSEA", build_vignettes = TRUE)
```
Install `pairedGSEA` from Bioconductor
```{r install_bioc, eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("pairedGSEA", version = 'devel')
```
# Paired differential expression and splicing
## Preparing the experiment parameters
Running a paired Differential Gene Expression (DGE) and
Differential Gene Splicing (DGS) analysis is the first step in
the `pairedGSEA` workflow.
But before running the `paired_diff` function, we recommend storing
the experiment parameters in a set of variables at the top of your script
for future reference and easy access:
```{r setup}
library("pairedGSEA")
# Defining plotting theme
ggplot2::theme_set(ggplot2::theme_classic(base_size = 20))
## Load data
# In this vignette we will use the included example Summarized Experiment.
# See ?example_se for more information about the data.
data("example_se")
if(FALSE){ # Examples of other data imports
# Example of count matrix
tx_count <- read.csv("path/to/counts.csv") # Or other load function
md_file <- "path/to/metadata.xlsx" # Can also be a .csv file or a data.frame
# Example of locally stored DESeqDataSet
dds <- readRDS("path/to/dds.RDS")
# Example of locally stored SummarizedExperiment
se <- readRDS("path/to/se.RDS")
}
## Experiment parameters
group_col <- "group_nr" # Column with the groups you would like to compare
sample_col <- "id" # Name of column that specifies the sample id of each sample.
# This is used to ensure the metadata and count data contains the same samples
# and to arrange the data according to the metadata
# (important for underlying tools)
baseline <- 1 # Name of baseline group
case <- 2 # Name of case group
experiment_title <- "TGFb treatment for 12h" # Name of your experiment. This is
# used in the file names that are stored if store_results is TRUE (recommended)
```
## Check your metadata
```{r metadata_check}
# Check if parameters above fit with metadaata
SummarizedExperiment::colData(example_se)
```
```{r sample_check}
# Check that all data samples are in the metadata
all(colnames(example_se) %in%
SummarizedExperiment::colData(example_se)[[sample_col]])
```
## Running the analysis
The paired DGE/DGS analysis is run with `paired_diff()`.
`paired_diff()` is essentially a wrapper function around
`DESeq2::DESeq` [@deseq2] and `DEXSeq::DEXSeq` [@dexseq], the latter takes in
the ballpark of 20-30 minutes to run depending on the size of the data
and computational resources. Please visit their individual vignettes
for further information.
```{r paired_diff}
set.seed(500) # For reproducible results
diff_results <- paired_diff(
object = example_se,
metadata = NULL, # Use with count matrix or if you want to change it in
# the input object
group_col = group_col,
sample_col = sample_col,
baseline = baseline,
case = case,
experiment_title = experiment_title,
store_results = FALSE # Set to TRUE (recommended) if you want
# to store intermediate results, such as the results on transcript level
)
```
After running the analyses, `paired_diff` will aggregate the p-values to
gene level using lancaster `r BiocStyle::CRANpkg("aggregation")`
[@lancaster] and calculate
the FDR-adjusted p-values (see `?pairedGSEA:::aggregate_pvalue`
for more information).
For the DGE transcripts, the log2 fold changes will be aggregated
using a weighted mean, whereas the DGS log2 fold changes will be
assigned to the log2 fold change of the transcript with the lowest p-value.
Use the latter with a grain of salt.
From here, feel free to analyse the gene-level results using your
preferred method.
If you set `store_results = TRUE`, you could extract the transcript
level results found in the `results` folder under the names
"\*_expression_results.RDS"
and "\*_splicing_results.RDS" for the DGE and DGS analysis, respectively
(The \* corresponds to the provided experiment title).
## Additional parameters
There are a range of other parameters you can play with to tailor
the experience. Here, the default settings are showed.
See `?pairedGSEA::paired_diff` for further details.
```{r extra_settings, eval=FALSE}
covariates = NULL,
run_sva = TRUE,
use_limma = FALSE,
prefilter = 10,
fit_type = "local",
test = "LRT",
quiet = FALSE,
parallel = TRUE,
BPPARAM = BiocParallel::bpparam(),
...
```
To highlight some examples of use:
1. You can use `limma::eBayes` and `limma::diffSplice` for the analyses with
`use_limma = TRUE`.
2. You can use additional columns in your metadata as covariates in the
model matrix by setting `covariates` to a character vector of the
specific names. This will be used in SVA, DGE, and DGS.
3. The test should be kept at default settings, but advanced users may use
the "wald" test instead, if they wish.
4. The `...` parameters will be fed to `DESeq2::DESeq`,
see their manual for options.
5. If you want a stricter, more loose, or more advanced pre-filtering
(generally not necessary, as `r BiocStyle::Biocpkg("DESeq2")`
and `r BiocStyle::Biocpkg("DEXSeq")` performs their own
pre-filtering), you can set the parameter to a different value or NULL and
use your own pre-filtering method directly on the counts matrix.
# Over-Representation Analysis
`pairedGSEA` comes with a wrapper function for `fgsea::fora` [@fgsea].
If you wish,
feel free to use that directly or any other gene set analysis method
- investigate the `diff_results` object before use to see what
information it contains.
The inbuilt wrapper also computes the relative risk for each gene set and an
enrichment score (`log2(relative_risk + 0.06)`, the pseudo count is
for plotting purposes).
### Running the inbuilt ORA function
Before you get going, you will need a list of gene sets (aka. pathways)
according to the species you are working with and the category of
gene sets of interest.
For this purpose, feel free to use the `r BiocStyle::CRANpkg("msigdbr")`
[@msigdbr] wrapper function in
`pairedGSEA`: `pairedGSEA::prepare_msigdb`. If you do,
see `?pairedGSEA::prepare_msigdb` for further details.
The inbuilt ORA function is called `paired_ora` and is run as follows
```{r paied_ora}
# Define gene sets in your preferred way
gene_sets <- pairedGSEA::prepare_msigdb(
species = "Homo sapiens",
category = "C5",
gene_id_type = "ensembl_gene"
)
ora <- paired_ora(
paired_diff_result = diff_results,
gene_sets = gene_sets,
experiment_title = NULL # experiment_title - if not NULL,
# the results will be stored in an RDS object in the 'results' folder
)
```
## Additional parameters
```{r ora_settings, eval=FALSE}
cutoff = 0.05,
min_size = 25,
quiet = FALSE
```
Please investigate the returned object to see the column names and what
they contain.
# Analysing ORA results
There are many options for investigating your ORA results.
`pairedGSEA` comes with an inbuilt scatter plot function that plots the
enrichment score of DGE against those of DGS.
The function allows you to interactively look at the placement of the
significant pathways using `r BiocStyle::CRANpkg("plotly")` [@plotly].
You can color specific points
based on a regular expression for gene sets of interest.
## Scatter plot of enrichment scores
```{r scatter}
# Scatter plot function with default settings
plot_ora(
ora,
plotly = FALSE,
pattern = "Telomer", # Identify all gene sets about telomeres
cutoff = 0.05, # Only include significant gene sets
lines = TRUE # Guide lines
)
```
As mentioned, this function can be utilized in a few different ways.
The default settings will plot the enrichment scores of each analysis
and draw dashed lines for the `y = x` line, `y = 0`, and `x = 0`.
Remove those by setting `lines = FALSE`.
Make the plot interactive with `plotly = TRUE`.
Highlight gene sets containing a specific regex pattern
by setting `pattern` to the regex pattern of interest.
# Session Info
```{r session info}
sessionInfo()
```
# References