---
title: "An Introduction to TMSig"
author:
- name: "Tyler J. Sagendorf"
  affiliation: 
  - Pacific Northwest National Laboratory, Richland, WA
date: "`r BiocStyle::doc_date()`"
abstract: >
  The `r BiocStyle::Biocpkg('TMSig')` package contains tools to prepare,
  analyze, and visualize named lists of sets, with an emphasis on molecular
  signatures (such as gene or kinase sets). It includes fast, memory efficient
  functions to construct sparse incidence and similarity matrices; filter,
  cluster, invert, and decompose sets; perform pre-ranked Correlation Adjusted
  MEan RAnk (CAMERA-PR) gene set testing via a flexible matrix method (see 
  `r BiocStyle::Biocpkg('limma')` for the original implementation); and create
  bubble heatmaps to visualize the results of any differential or molecular
  signatures analysis.
package: |
    `r BiocStyle::pkg_ver('TMSig')`
    
    Report issues at https://github.com/EMSL-Computing/TMSig
vignette: >
  %\VignetteIndexEntry{An Introduction to TMSig}
  %\VignettePackage{TMSig}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteKeywords{GeneSetEnrichment, Clustering, Pathways, Visualization}
output: 
    BiocStyle::html_document:
        number_sections: true
        toc: true
        toc_float: true
        toc_depth: 2
bibliography: references.bib
link-citations: true
---

License: `r packageDescription("TMSig")[["License"]]`

```{r setup, include=FALSE}
options(width = 80)
knitr::opts_chunk$set(collapse = TRUE,
                      message = FALSE,
                      comment = "#>",
                      dpi = 250)

suppressPackageStartupMessages({
    library(TMSig)
})
```


# Read GMT Files

GMT files, such as those available through the Molecular Signatures Database
(MSigDB)[@liberzon_molecular_2011; @liberzon_molecular_2015], can be read as
named lists with `readGMT`:

```{r read-gmt}
# Path to GMT file - MSigDB Gene Ontology sets
pathToGMT <- system.file(
    "extdata", "c5.go.v2023.2.Hs.symbols.gmt.gz",
    package = "TMSig"
)

geneSets <- readGMT(path = pathToGMT)

length(geneSets) # 10461 gene sets

head(names(geneSets)) # first 6 gene set names

geneSets[[1]] # genes in first set
```

# Filter Sets

`filterSets` keeps sets that satisfy minimum and maximum size constraints.
Optionally, sets can be restricted to a smaller background of genes before
filtering by size.

```{r filter-sets}
# Filter by size - no background
filt <- filterSets(x = geneSets, 
                   min_size = 10L, 
                   max_size = 500L)
length(filt) # 7096 gene sets remain (down from 10461)
```

Normally, **the background is the set of genes quantified in a particular
experiment**. For the purposes of this example, the top 2000 most common genes
will be selected to serve as the background. This ensures the gene sets will
maintain some degree of overlap for later steps.

```{r create-background}
# Top 2000 most common genes
topGenes <- table(unlist(geneSets))
topGenes <- head(sort(topGenes, decreasing = TRUE), 2000L)
head(topGenes)

background <- names(topGenes)
```

The gene sets will be restricted to elements of the background before filtering
by size.

```{r filter-sets-bg}
# Restrict genes sets to background of genes
geneSetsFilt <- filterSets(
    x = geneSets, 
    background = background, 
    min_size = 10L # min. overlap of each set with background
)

length(geneSetsFilt) # 4629 gene sets pass
```

The proportion of genes in each set that overlap with the background will be
calculated and used to further select sets.

```{r overlap-prop-opts, include=FALSE}
cap <- "Scatterplot of the original set size vs. overlap proportion."
```

```{r overlap-prop, fig.cap=cap}
# Calculate proportion of overlap with background
setSizes <- lengths(geneSetsFilt)
setSizesOld <- lengths(geneSets)[names(geneSetsFilt)]
overlapProp <- setSizes / setSizesOld

plot(setSizesOld, overlapProp, main = "Set Size vs. Overlap")
```

Gene sets with an overlap of at least 70% will be kept, so we can be reasonably
confident that the gene sets are correctly described by their labels. However,
since the background is small, this will remove the majority of sets.

```{r filter-by-overlap}
table(overlapProp >= 0.7)

geneSetsFilt <- geneSetsFilt[overlapProp >= 0.7]
length(geneSetsFilt) # 1015 gene sets pass
```


# Incidence Matrix

An incidence matrix is a representation of a graph. For a named list of sets,
the set names form the rows of the matrix, and all unique elements are columns.
A value of 1 indicates that the element is a member of the set, while a value of
0 indicates otherwise. The incidence matrix forms the basis for many of the
functions in TMSig, including `similarity`, `clusterSets`, `decomposeSets`,
and `cameraPR.matrix`.

```{r incidence-matrix}
imat <- sparseIncidence(x = geneSetsFilt)
dim(imat) # 1015 sets, 1914 genes

imat[seq_len(8L), seq_len(5L)] # first 8 sets, first 5 genes
```

The incidence matrix can be used to calculate the sizes of all pairwise set
intersections or the number of sets or pairs of sets to which each gene belongs.

```{r incidence-products}
# Calculate sizes of all pairwise intersections
tcrossprod(imat)[1:3, 1:3] # first 3 gene sets

# Calculate number of sets and pairs of sets to which each gene belongs
crossprod(imat)[1:3, 1:3] # first 3 genes
```


# Set Similarity

The `similarity` function constructs a sparse symmetric matrix of pairwise
Jaccard, overlap/Simpson, or Ōtsuka similarity coefficients for all pairs of
sets $A$ and $B$, where

- Jaccard($A$, $B$) = $\frac{|A \cap B|}{|A \cup B|}$
- Overlap($A$, $B$) = $\frac{|A \cap B|}{\min(|A|, |B|)}$
- Ōtsuka($A$, $B$) = $\frac{|A \cap B|}{\sqrt{|A| \times |B|}}$

All 3 similarity measures can identify aliasing of sets: when two or more sets
contain the same elements, but have different descriptions. Only the
overlap/Simpson similarity can also identify when one set is a subset of
another.


## Jaccard

```{r sim-matrix}
# Jaccard similarity (default)
jaccard <- similarity(x = geneSetsFilt)
dim(jaccard) # 1015 1015`
class(jaccard)
```

The 6 sets having the highest Jaccard similarity with
GOBP_CARDIAC_ATRIUM_DEVELOPMENT are shown for each of the 3 measures of set
similarity.

```{r sim-jaccard}
# 6 sets with highest Jaccard for a specific term
idx <- order(jaccard[, "GOBP_CARDIAC_ATRIUM_DEVELOPMENT"], 
             decreasing = TRUE)
idx <- head(idx)

jaccard[idx, "GOBP_CARDIAC_ATRIUM_DEVELOPMENT", drop = FALSE]
```

GOBP_CARDIAC_ATRIUM_MORPHOGENESIS is highly similar to
GOBP_CARDIAC_ATRIUM_DEVELOPMENT ($J$ = 0.853).


## Overlap/Simpson

```{r sim-simpson}
overlap <- similarity(x = geneSetsFilt, type = "overlap")

overlap[idx, "GOBP_CARDIAC_ATRIUM_DEVELOPMENT", drop = FALSE]
```

GOBP_CARDIAC_ATRIUM_MORPHOGENESIS, GOBP_ATRIAL_SEPTUM_DEVELOPMENT, and
GOBP_ATRIAL_SEPTUM_MORPHOGENESIS are subsets of GOBP_CARDIAC_ATRIUM_DEVELOPMENT,
since they are smaller (each containing 15 to 29 genes compared to 34 in
GOBP_CARDIAC_ATRIUM_DEVELOPMENT).


## Ōtsuka

```{r sim-otsuka}
otsuka <- similarity(x = geneSetsFilt, type = "otsuka")

otsuka[idx, "GOBP_CARDIAC_ATRIUM_DEVELOPMENT", drop = FALSE]
```


# Cluster Similar Sets

`clusterSets` employs hierarchical clustering to identify groups of highly
similar sets. This procedure was developed for the removal of redundant Gene
Ontology and Reactome gene sets for the MSigDB. See
`help(topic = "clusterSets", package = "TMSig")` for details.

```{r cluster-sets}
# clusterSets with default arguments
clusterDF <- clusterSets(x = geneSetsFilt, 
                         type = "jaccard", 
                         cutoff = 0.85, 
                         method = "complete",
                         h = 0.9)

# First 4 clusters with 2 or more sets
subset(clusterDF, subset = cluster <= 4L)
```

In each cluster, a single set can be retained for analysis using a combination
of criteria such as set size, overlap proportion, or length of the description
(shorter descriptions tend to be more general terms).

```{r add-clusterDF-columns}
## Use clusterSets output to select sets to retain for analysis
clusterDF$overlap_prop <- overlapProp[clusterDF$set] # overlap proportion
clusterDF$n_char <- nchar(clusterDF$set) # length of set description

# Reorder rows so that the first set in each cluster is the one to keep
o <- with(clusterDF, 
          order(cluster, set_size, overlap_prop, n_char, set,
                decreasing = c(FALSE, TRUE, TRUE, FALSE, TRUE),
                method = "radix"))
clusterDF <- clusterDF[o, ]

subset(clusterDF, cluster <= 4L) # show first 4 clusters
```

Now that the first gene set (first row) in each cluster is the one to keep, we
can remove rows where the cluster is duplicated (this only keeps the first row
of each cluster) and then extract the vector of sets to subset `geneSetsFilt`.
Note that `max(clusterDF$cluster)` will indicate how many gene sets will remain
after this redundancy filter.

```{r remove-similar-sets}
# Sets to keep for analysis
keepSets <- with(clusterDF, set[!duplicated(cluster)])
head(keepSets, 4L)

# Subset geneSetsFilt to those sets
geneSetsFilt <- geneSetsFilt[keepSets]
length(geneSetsFilt) # 986 (down from 1015)
```

# Decompose Sets

Decompose all pairs of sufficiently overlapping sets, $A$ and $B$, into 3
disjoint parts:

1. The elements unique to $A$: $A \setminus B$ ("A minus B")
2. The elements unique to $B$: $B \setminus A$ ("B minus A")
3. The elements common to both $A$ and $B$: $A \cap B$ ("A and B")

```{r venn-diagram-chunk-opts, include=FALSE}
h <- w <- "80%"
cap <- "Decomposition of sets."
```

```{r venn-diagram, echo=FALSE, out.height=h, out.width=w, fig.cap=cap}
knitr::include_graphics("images/Venn_Diagram.png")
```


Decomposition of sets is described in section 2.3.1 of "Extensions to Gene Set
Enrichment" [@jiang_extensions_2007] as a method to reduce the redundancy of
gene set testing results.

```{r decompose-sets}
# Limit maximum set size for this example
geneSetsFilt2 <- filterSets(geneSetsFilt, max_size = 50L)

# Decompose all pairs of sets with at least 10 genes in common
decompSets <- decomposeSets(x = geneSetsFilt2, overlap = 10L)

# Last 3 components
tail(decompSets, 3L)
```


# Invert Sets

A list of sets can be inverted so that elements become set names and set names
become elements. This is primarily used to identify all sets to which a
particular element belongs.

```{r invert-sets}
invertedSets <- invertSets(x = geneSetsFilt)

length(invertedSets) # 1914 genes

head(names(invertedSets)) # names are genes

invertedSets["FOXO1"] # all gene sets containing FOXO1
```

This can also be used to calculate the similarity of each pair of genes. That
is, do pairs of genes tend to appear in the same sets?

```{r}
similarity(x = invertedSets[1:5]) # first 5 genes
```


# Enrichment Analysis

We will simulate a matrix of gene expression data for those 2000 genes that were
selected earlier and perform differential analysis using `limma`. Then, the
differential analysis results, as well as the filtered gene sets, will be used
as input for the pre-ranked version of Correlation Adjusted MEan RAnk gene set
testing (CAMERA) [@wu_camera_2012].


## Simulate Gene Expression Data

Gene expression data is simulated for 3 biological replicates in each of 3
experimental groups: ctrl (control), treat1 (treatment group 1), and treat2
(treatment group 2.

```{r simulate-expression-data}
# Control and 2 treatment groups, 3 replicates each
group <- rep(c("ctrl", "treat1", "treat2"), 
             each = 3)
design <- model.matrix(~ 0 + group) # design matrix
contrasts <- makeContrasts(
    contrasts = c("grouptreat1 - groupctrl",
                  "grouptreat2 - groupctrl"),
    levels = colnames(design)
)

# Shorten contrast names
colnames(contrasts) <- gsub("group", "", colnames(contrasts))

ngenes <- length(background) # 2000 genes
nsamples <- length(group) # 9 samples

set.seed(0)
y <- matrix(data = rnorm(ngenes * nsamples),
            nrow = ngenes, ncol = nsamples,
            dimnames = list(background, make.unique(group)))
head(y)
```

Now, we introduce differential expression in two randomly selected gene sets. We
will make genes in the "GOBP_CARDIAC_ATRIUM_DEVELOPMENT" set higher in control
relative to treatment samples, and we will make genes in the
"GOBP_ACTIVATED_T_CELL_PROLIFERATION" set higher in treatment relative to
control samples. Since the contrasts are "treat1 - ctrl" and "treat2 - ctrl",
the direction of change will be "Down" for cardiac atrium development and "Up"
for activated T cell proliferation. The degree of change will be less for the
"treat2 - ctrl" comparison.

```{r perturb-gene-sets}
cardiacGenes <- geneSetsFilt[["GOBP_CARDIAC_ATRIUM_DEVELOPMENT"]]
tcellGenes <- geneSetsFilt[["GOBP_ACTIVATED_T_CELL_PROLIFERATION"]]

# Indices of treatment group samples
trt1 <- which(group == "treat1")
trt2 <- which(group == "treat2")

# Cardiac genes: higher in control relative to treatment
y[cardiacGenes, trt1] <- y[cardiacGenes, trt1] - 2
y[cardiacGenes, trt2] <- y[cardiacGenes, trt2] - 0.7

# T cell proliferation genes: higher in treatment relative to control
y[tcellGenes, trt1] <- y[tcellGenes, trt1] + 2
y[tcellGenes, trt2] <- y[tcellGenes, trt2] + 1
```


## Differential Gene Expression Analysis

```{r differential-analysis}
# Differential analysis with LIMMA
fit <- lmFit(y, design)
fit.contr <- contrasts.fit(fit, contrasts = contrasts)
fit.smooth <- eBayes(fit.contr)

# Matrix of z-score equivalents of moderated t-statistics
modz <- with(fit.smooth, zscoreT(x = t, df = df.total))
head(modz)
```

The results will be reformatted to make bubble heatmaps later.

```{r reformat-DA-results}
# Reformat differential analysis results for enrichmap
resDA <- lapply(colnames(contrasts), function(contrast_i) {
    res_i <- topTable(fit.smooth, 
                      coef = contrast_i, 
                      number = Inf, 
                      sort.by = "none")
    res_i$contrast <- contrast_i
    res_i$gene <- rownames(res_i)
    res_i$df.total <- fit.smooth$df.total
    
    return(res_i)
})

resDA <- data.table::rbindlist(resDA)

# Add z-statistic column
resDA$z <- with(resDA, zscoreT(x = t, df = df.total))

# Reorder rows
resDA <- resDA[with(resDA, order(contrast, P.Value, z)), ]

head(resDA)
```

Below is the number of significantly differentially expressed genes for each
comparison.

```{r count-signif-genes}
# Count number of significant (P adj. < 0.05) genes
table(resDA$contrast, resDA$adj.P.Val < 0.05)
```


## CAMERA-PR

LIMMA moderated t-statistics are converted to z-score equivalents and used as
input for `cameraPR.matrix`. CAMERA-PR is a modification of the two-sample
t-test that accounts for inter-gene correlation to correctly control the false
discovery rate (FDR) [@wu_camera_2012]. For the pre-ranked version, a default
inter-gene correlation of 0.01 is assumed for all gene sets. A non-parametric
version of CAMERA-PR is also available by specifying `use.ranks=TRUE`. See
`help(topic = "cameraPR.matrix", package = "TMSig")` for details. See 
`help(topic = "cameraPR", package = "limma")` for the original implementation.

The main benefits of using `cameraPR.matrix` over `cameraPR.default` are
significantly faster execution times and the ability to perform simultaneous
inference of multiple contrasts or coefficients.

```{r camera-PR}
# CAMERA-PR (matrix method)
res <- cameraPR(statistic = modz, 
                index = geneSetsFilt)
head(res)
```

Both cardiac atrium development and activated T cell proliferation gene sets
have adjusted p-values below 0.05 and are ranked at the top of "Down" and "Up"
sets. However, other gene sets are also statistically significant after
adjustment due to their genes overlapping with these two terms.

```{r count-signif-sets}
# Number of sets passing FDR threshold
table(res$Contrast, res$FDR < 0.05)
```

```{r echo=FALSE}
j1 <- round(jaccard["GOBP_CARDIAC_ATRIUM_DEVELOPMENT", 
                    "GOBP_ATRIAL_SEPTUM_DEVELOPMENT"], 
            digits = 3)
o1 <- overlap["GOBP_CARDIAC_ATRIUM_DEVELOPMENT", 
              "GOBP_ATRIAL_SEPTUM_DEVELOPMENT"]
```

To illustrate the above point, notice that GOBP_CARDIAC_ATRIUM_DEVELOPMENT and
GOBP_ATRIAL_SEPTUM_DEVELOPMENT are both significantly down. Their Jaccard and
Overlap coefficients are `r j1` and `r o1`, respectively. That is, atrial septum
development is a subset of cardiac atrium development (at least, when both sets
are restricted to the background we defined earlier), so it is being driven by
changes in the latter.


# Bubble Heatmaps

A bubble heatmap will be generated to visualize the top genes from the
differential expression analysis. Since `plot_sig_only=TRUE` only those 10 genes
that were significantly differentially expressed in the "treat1 - ctrl"
comparison will appear in the heatmap. The bubbles will be scaled such that the
most significant contrast/gene combination (smallest adjusted p-value) is of
maximum diameter, and all other bubbles will be scaled relative to it based on
their -log$_{10}$ adjusted p-values. See 
`help(topic = "enrichmap", package = "TMSig")` for more details.

```{r DA-heatmap-chunk-opts, include=FALSE}
h <- w <- "70%"
cap <- "Bubble heatmap of differential expression analysis results."
```

```{r DA-bubble-heatmap, out.height=h, out.width=w, fig.cap=cap}
# Differential analysis bubble heatmap
enrichmap(x = resDA, 
          scale_by = "max",
          n_top = 15L, # default
          plot_sig_only = TRUE, # default
          set_column = "gene", 
          statistic_column = "z", 
          contrast_column = "contrast", 
          padj_column = "adj.P.Val", 
          padj_legend_title = "BH Adjusted\nP-Value", 
          padj_cutoff = 0.05,
          cell_size = grid::unit(20, "pt"),
          # Additional arguments passed to ComplexHeatmap::Heatmap. Used to
          # modify default appearance.
          heatmap_args = list(
              name = "Z-Score",
              column_names_rot = 60,
              column_names_side = "top"
          ))
```

Now, a bubble heatmap will be generated to visualize the CAMERA-PR gene set
analysis results. The top 20 gene sets will be shown.

```{r camera-heatmap-chunk-opts, include=FALSE}
h <- w <- "100%"
cap <- "Bubble heatmap of CAMERA-PR results."
```

```{r CAMERA-bubble-heatmap, out.height=h, out.width=w, fig.cap=cap}
# CAMERA-PR bubble heatmap
enrichmap(x = res, 
          scale_by = "row", # default
          n_top = 20L,
          set_column = "GeneSet", 
          statistic_column = "ZScore", 
          contrast_column = "Contrast",
          padj_column = "FDR",
          padj_legend_title = "BH Adjusted\nP-Value",
          padj_cutoff = 0.05,
          cell_size = grid::unit(13, "pt"),
          # Additional arguments passed to ComplexHeatmap::Heatmap. Used to
          # modify default appearance.
          heatmap_args = list(
              name = "Z-Score", 
              column_names_rot = 60,
              column_names_side = "top"
          ))
```

Although no genes were differentially expressed in the "treat2 - ctrl"
comparison, summarizing results at the gene set level uncovered several
significant changes, though they are less pronounced than those in the
"treat1 - ctrl" comparison, as evidenced by the sizes and colors of the
bubbles.


# Session Information {.unnumbered}

```{r session-info}
print(sessionInfo(), locale = FALSE)
```


# References