---
title: "Getting Started with PLAID"
author: "BigOmics Analytics"
package: plaid
output: 
    BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Getting Started with PLAID}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.align = "center"
)
```

# Introduction

PLAID (Pathway Level Average Intensity Detection) is a
novel, ultrafast and memory optimized gene set enrichment scoring
algorithm. PLAID demonstrates accurate gene set scoring and
outperforms all currently available gene set scoring methods in large
bulk and single-cell RNA-seq datasets.

# Motivation

In recent years, computational methods have emerged that calculate enrichment of gene signatures
within individual samples, rather than across pooled samples. These signatures offer critical insights into the
coordinated activity of functionally related genes, proteins or metabolites, enabling identification of unique
molecular profiles based on gene set and pathway activity in individual cells and patients. This strategy is pivotal
for patient stratification and advancement of personalized medicine. However, the rise of large-scale datasets,
including single-cell profiles and population biobanks, has exposed significant computational inefficiencies in
existing methods. Current methods often demand excessive runtime and memory resources, becoming
impractical for large datasets. Overcoming these limitations is a focus of current efforts by bioinformatics teams in
academia and the pharmaceutical industry, as essential to support basic and clinical biomedical research.


# Example: Single-cell RNA-seq hallmark scoring

## Preparing data

For this vignette, our package includes a small subset of the the
pmbc3k single-cell dataset of just 50 cells. Please install the Seurat
and SeuratData packages if you want to run this vignette against the
full dataset.

```{r}
library("plaid")
load(system.file("extdata", "pbmc3k-50cells.rda", package = "plaid"),verbose=TRUE)
dim(X)
```

Note that X is the normalized expression matrix from the Seurat
object, not the raw counts matrix. We recommend to run plaid on the
log transformed expression matrix, not on the counts, as the average
in the logarithmic space is more robust and is in concordance to
calculating the geometric mean.

It is not necessary to normalize your expression matrix before running
plaid because plaid normalizes the enrichment scores
afterwards. However, again, log transformation is recommended.

It is recommended to keep the expression matrix sparse as much as
possible because plaid extensively take advantage of sparse matrix
computations. But even for dense matrices plaid is fast.


## Preparing gene sets

For convenience we have included the 50 Hallmark genesets in our
package. But we encourage you to download larger geneset collections
as plaid's speed advantage will be more apparent for larger datasets
and large geneset collections.

Plaid needs the gene sets as sparse matrix. If you have your
collection of gene sets a a list, we need first to convert the gmt
list to matrix format.

```{r}
hallmarks <- system.file("extdata", "hallmarks.gmt", package = "plaid")
gmt <- read.gmt(hallmarks)
matG <- gmt2mat(gmt)
dim(matG)
```

If you have your own gene sets stored as gmt files, you can
conveniently use the included `read.gmt()` function to read the gmt
file.


## Calculating the score

The main function to run plaid is `plaid()`. We run plaid on our
expression matrix `X` and gene set matrix `matG`.

```{r}
gsetX <- plaid(X, matG, normalize=TRUE)
dim(gsetX)
```

The resulting matrix `gsetX` contains the single-sample enrichment
scores for the specified gene sets and samples. 

Notice that by default plaid performs median normalization of the
final results. That also means that it is not necessary to normalize
your expression matrix before running plaid. However, generally, log
transformation is recommended.

Plaid can also be run on the ranked matrix, we will see later that
this corresponds to the singscore (Fouratan et al., 2018). Or plaid
could be run on the (non-logarithmic) counts which can be used to
calculate the scSE score (Pont et al., 2019).

## Very large matrices

Plaid is fast and memory efficient because it uses very efficient
sparse matrix computation in the back. For very large `X`, plaid uses
chunked computation by splitting the matrix in chunks to avoid index
overflow. Should you encounter errors, please compute your dataset by
subsetting manually the expression matrix and/or gene sets.

Although `X` and `matG` are generally very sparse, be aware that the
result matrix `gsetX` generally is dense and therefore can become very
large. If you would want to compute the score of 10.000 gene sets on a
million of cells this would create a large 10.000 x 1.000.000 dense
matrix which requires about 75GB of memory.


## Differential expression testing using dualGSEA

Once we have the gene sets scores we can use these scores for
statistical analysis. We could compute the differential gene set
expression between two groups using a general t-test or limma directly
on the score matrix `gsetX`.

Another way to test whether a gene set is statistically significant
would be to test whether the fold-change of the genes in the gene sets
are statistically different than zero. That is, we can perform a one
sample z-test on the logFC of the genes of each gene sets and test
whether they are significantly different from zero. The logFC is
computed from the original (log) expression matrix `X` and group
vector `y`.


The function `dualGSEA()` does both tests: the one-sample z-test on
the logFC  and the two-group t-test on the gene set matrix `gsetX`. 
Dual testing has been suggested by Bull et al. (Sci Rep., 2024)

```{r}
y <- 1*(celltype == "B")
res <- dualGSEA(X, y, G=matG)
```

The top significant genesets can be shown with

```{r}
res <- res[order(res[,"p.dual"]),] 
head(res)
```

The column `gsetFC` corresponds to the difference in gene set score
and also corresponds to the average foldchange of the genes in the
gene set. The column 'p.fc corresponds to the test on the preranked
logFC, the column 'p.ss' corresponds to the two-group t-test on the
geneset scores `gsetX`. The two p-values are then combined using
Stouffer's method in the column 'p.dual' and adjusted for multiple
testing in column `q.dual`.

The left figure below plots the fold-change enrichment `p.fc` vs. 
single-sample enrichment `p.ss`. The right figure shows the volcano plot
`p.dual` vs. `gsetFC`:

```{r, fig.height=6, fig.width=12, fig.fullwidth=TRUE}
fc <- res[,"gsetFC"]
pv <- res[,"p.dual"]
p1 <- res[,"p.fc"]
p2 <- res[,"p.ss"]
ii <- head(order(pv))
par(mfrow=c(1,2))
plot( -log10(p1), -log10(p2),
  xlab="FC enrichment (-log10p)",
  ylab="single-sample enrichment (-log10p)", pch=19)
text( -log10(p1[ii]), -log10(p2[ii]), rownames(res)[ii],pos=2)
plot( fc, -log10(pv), xlab="gsetFC", ylab="-log10p", pch=19)
abline(h=0, v=0, lty=2)
text( fc[ii], -log10(pv[ii]), rownames(res)[ii],pos=2)
```

# Replicating ssGSEA, singscore and scSE

Plaid can be used to replicate other enrichment score such as
[ssGSEA](https://github.com/rcastelo/GSVA),
[GSVA](https://github.com/rcastelo/GSVA),
[AUCell](https://github.com/aertslab/AUCell),
[Singscore](https://github.com/DavisLaboratory/singscore),
[scSE](https://doi.org/10.1093/nar/gkz601) and
[UCell](https://github.com/carmonalab/UCell). But using plaid, the
computation is much faster than the original code.

## Replicating singscore

Computing the
[singscore](https://github.com/DavisLaboratory/singscore) requires to
compute the ranks of the expression matrix. We have wrapped this in a
single convenience function:

```{r}
sing <- replaid.sing(X, matG)
```

We have extensively compared the results of `replaid.sing` and from
the original `singscore` R package and we showed identical result in
the score, logFC and p-values.


## Replicating ssGSEA

Plaid can also be used to compute the ssGSEA score (Barbie et al.,
2009). Using plaid, we can calculate the score upto 100x faster. We
have wrapped this in a single convenience function:

```{r}
ssgsea <- replaid.ssgsea(X, matG, alpha=0)
```

We have extensively compared the results of `replaid.ssgsea()` and
from the original [GSVA](https://github.com/rcastelo/GSVA) R
package. Note the rank weight parameter alpha. For `alpha=0` we
obtained identical result for the score, logFC and p-values. For
non-zero values for alpha the results are close but not exactly the
same. The default value in the original publication and in GSVA is
`alpha=0.25`.


## Replicating the scSE score

[Single-cell Signature Explorer
(scSE)](https://doi.org/10.1093/nar/gkz601) is a fast enrichment
algorithm implemented in GO. Computing the scSE score requires running
plaid on the linear (not logarithmic) score and perform additional
normalization by the total UMI per sample. We have wrapped this in a
single convenience function:

```{r}
scse <- replaid.scse(X, matG, removeLog2=TRUE, scoreMean=FALSE)
```

To replicate the original "sum-of-UMI" scSE score, set
`removeLog2=TRUE` and `scoreMean=FALSE`. scSE and plaid scores become
more similar for `removeLog2=FALSE` and `scoreMean=TRUE`.

We have extensively compared the results from `replaid.scse` and from
the original scSE (implemented in GO lang) and we showed almost
identical results in the score, logFC and p-values.


## Compare scores

We can compare all scores in a pairs plot:

```{r}
S <- cbind(plaid=gsetX[,1], sing=sing[,1], ssgsea=ssgsea[,1], scSE=scse[,1])
pairs(S)
```

# Session info

```{r}
sessionInfo()
```