---
title: "Simplify Functional Enrichment Results"
author: "Zuguang Gu (z.gu@dkfz.de)"
date: '`r Sys.Date()`'
output: 
  rmarkdown::html_vignette:
    fig_caption: true
    css: main.css
vignette: >
  %\VignetteIndexEntry{Simplify Functional Enrichment Results}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo = FALSE}
library(knitr)
knitr::opts_chunk$set(
    error = FALSE,
    tidy  = FALSE,
    message = FALSE,
    warning = FALSE,
    fig.align = "center",
    dev = "jpeg"
)
options(width = 80)
```

The **simplifyEnrichment** package clusters functional terms into groups by
clustering the similarity matrix of the terms with a new proposed method
"binary cut" which recursively applies partition around medoids (PAM) with two
groups on the similarity matrix and in each iteration step, a score is
assigned to decide whether the group of gene sets that corresponds to the
current sub-matrix should be split or not. For more details of the method, 
please refer to the simplifyEnrichment paper.

## Simplify GO enrichment results

```{r, echo = FALSE, message = FALSE}
library(simplifyEnrichment)
mat = readRDS(system.file("extdata", "random_GO_BP_sim_mat.rds", package = "simplifyEnrichment"))
go_id = rownames(mat)
```

The major use case for **simplifyEnrichment** is for simplying the GO
enrichment results by clustering the corresponding semantic similarity matrix
of the significant GO terms. To demonstrate the usage, we first generate a
list of random GO IDs from the Biological Process (BP) ontology category:

```{r, eval = FALSE}
library(simplifyEnrichment)
set.seed(888)
go_id = random_GO(500)
```

**simplifyEnrichment** starts with the GO similarity matrix. Users can use
their own similarity matrices or use the `GO_similarity()` function to
calculate the semantic similarity matrix. The `GO_similarity()` function is
simply a wrapper on `GOSemSim::termSim()`. The function accepts a vector of GO
IDs. Note the GO terms should only belong to one same ontology (_i.e._, `BP`,
`CC` or `MF`).


```{r, eval = FALSE}
mat = GO_similarity(go_id)
```

By default, `GO_similarity()` uses `Rel` method in `GOSemSim::termSim()`. Other
methods to calculate GO similarities can be set by `measure` argument, _e.g._:

```{r, eval = FALSE}
GO_similarity(go_id, measure = "Wang")
```

With the similarity matrix `mat`, users can directly apply `simplifyGO()`
function to perform the clustering as well as visualizing the results.

```{r, fig.width = 7*1.2, fig.height = 4*1.2}
df = simplifyGO(mat)
```

On the right side of the heatmap there are the word cloud annotations which
summarize the functions with keywords in every GO cluster. Additionally, enrichment
is done on keywords compared to GO background vocabulary and the significance corresponds
to the font size of the keywords.

Note there is no word cloud for the cluster that is merged from small clusters (size < 5).

The returned variable `df` is a data frame with GO IDs and the
cluster labels:

```{r}
head(df)
```

The size of GO clusters can be retrieved by:

```{r}
sort(table(df$cluster))
```

Or split the data frame by the cluster labels:

```{r, eval = FALSE}
split(df, df$cluster)
```

`plot` argument can be set to `FALSE` in `simplifyGO()`, so that no plot is
generated and only the data frame is returned.

If the aim is only to cluster GO terms, `binary_cut()` or `cluster_terms()` functions can be
directly applied:

```{r}
binary_cut(mat)
```

or

```{r, eval = FALSE}
cluster_terms(mat, method = "binary_cut")
```

`binary_cut()` and `cluster_terms()` basically generate the same clusterings, but the labels of clusters might differ.

## Simplify general functional enrichment results

Semantic measures can be used for the similarity of GO terms. However,
there are still a lot of ontologies (e.g. MsigDB gene sets) that are only
represented as a list of genes where the similarity between gene sets are
mainly measured by gene overlap. **simplifyEnrichment** provides the
`term_similarity()` and other related functions
(`term_similarity_from_enrichResult()`, `term_similarity_from_KEGG()`,
`term_similarity_from_Reactome()`, `term_similarity_from_MSigDB()` and
`term_similarity_from_gmt()`) which calculate the similarity of terms by the
gene overlapping, with methods of [Jaccard
coefficient](https://en.wikipedia.org/wiki/Jaccard_index), [Dice
coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient),
[overlap coefficient](https://en.wikipedia.org/wiki/Overlap_coefficient) and
[kappa coefficient](https://en.wikipedia.org/wiki/Cohen%27s_kappa).

The similarity can be calculated by providing:

1. A list of gene sets where each gene set contains a vector of genes.
2. A `enrichResult` object which is normally from the 'clusterProfiler', 'DOSE', 'meshes' or 'ReactomePA' package.
3. A list of KEGG/Reactome/MsigDB IDs. The gene set names can also be provided for MsigDB ontologies.
4. A gmt file and the corresponding gene set IDs.

Once you have the similarity matrix, you can send it to `simplifyEnrichment()` function.
But note, as we benchmarked in the manuscript, the clustering on the gene
overlap similarity performs much worse than on the semantic similarity.

## Comparing clustering methods

In the **simplifyEnrichment** package, there are also functions that compare
clustering results from different methods. Here we still use previously
generated variable `mat` which is the similarity matrix from the 500 random GO
terms. Simply running `compare_clustering_methods()` function performs all supported
methods (in `all_clustering_methods()`) excluding `mclust`, because
`mclust` usually takes very long time to run. The function generates a figure
with three panels:

1. A heatmap of the similarity matrix with different clusterings as row
   annotations.
2. A heatmap of the pair-wise concordance of the clustering from every two methods.
3. Barplots of the difference scores for each method, the number of clusters
   (total clusters and the clusters with size >= 5) and the mean similarity of
   the terms that are in the same clusters (block mean).

In the barplots, the three metrics are defined as follows:

1. **Different score**: This is the difference between the similarity values
   for the terms that belong to the same clusters and different clusters.
   For a similarity matrix $M$, for term $i$ and term $j$ where $i \ne j$, the
   similarity value $x_{i,j}$ is saved to the vector $\mathbf{x_1}$ only when
   term $i$ and $j$ are in a same cluster. $x_{i,j}$ is saved to the vector
   $\mathbf{x_2}$ when term $i$ and $j$ are not in the same cluster. The
   difference score measures the distribution difference between $\mathbf{x_1}$
   and $\mathbf{x_2}$, calculated as the Kolmogorov-Smirnov statistic between
   the two distributions.
2. **Number of clusters**: For each clustering, there are two numbers: the
   number of total clusters and the number of clusters with size >= 5 (only
   the big clusters).
3. **Block mean**: Mean similarity values of the diagonal blocks in the similarity 
   heatmap. Using the same convention as for the difference score, the block mean is the mean value of $\mathbf{x_1}$.

```{r, fig.width = 10, fig.height = 7}
compare_clustering_methods(mat)
```

If `plot_type` argument is set to `heatmap`. There are heatmaps for the
similarity matrix under different clusterings methods. The last panel is a
table with the number of clusters.

```{r, fig.width = 18, fig.height = 14, dev = "jpeg"}
compare_clustering_methods(mat, plot_type = "heatmap")
```

Please note, the clustering methods might have randomness, which means,
different runs of `compare_clustering_methods()` may generate different clusterings
(slightly different). Thus, if users want to compare the plots between
`compare_clustering_methods(mat)` and `compare_clustering_methods(mat, plot_type = "heatmap")`, they
should set the same random seed before executing the function.

```{r, eval = FALSE}
set.seed(123)
compare_clustering_methods(mat)
set.seed(123)
compare_clustering_methods(mat, plot_type = "heatmap")
```

`compare_clustering_methods()` is simply a wrapper on `cmp_make_clusters()`
and `cmp_make_plot()` functions where the former function performs
clustering with different methods and the latter visualizes the results. To
compare different plots, users can also use the following code without
specifying the random seed.

```{r, eval = FALSE}
clt = cmp_make_clusters(mat) # just a list of cluster labels
cmp_make_plot(mat, clt)
cmp_make_plot(mat, clt, plot_type = "heatmap")
```

### Register new clustering methods

New clustering methods can be added by `register_clustering_methods()`,
removed by `remove_clustering_methods()` and reset to the default methods by
`reset_clustering_methods()`. All the supported methods can be retrieved by
`all_clustering_methods()`. `compare_clustering_methods()` runs all the clustering methods
in `all_clustering_methods()`.

The new clustering methods should be as user-defined functions and sent to
`register_clustering_methods()` as named arguments, e.g.:

```{r, eval = FALSE}
register_clustering_methods(
    method1 = function(mat, ...) ...,
    method2 = function(mat, ...) ...,
    ...
)
```

The functions should accept at least one argument which is the input matrix
(`mat` in above example). The second optional argument should always be `...`
so that parameters for the clustering function can be passed by `control`
argument from `cluster_terms()` or `simplifyGO()`. If users forget to add
`...`, it is added internally.

Please note, the user-defined function should automatically identify the
optimized number of clusters. The function should return a vector of cluster
labels. Internally it is converted to numeric labels.

## Examples

There are following examples which we did for the benchmarking in the manuscript:

- [Examples of simplifyEnrichment](https://simplifyenrichment.github.io/examples/).
- [Compare different similarity measures for functional terms](https://simplifyenrichment.github.io/compare_similarity/).
- [Compare different partitioning methods in binary cut clustering](https://simplifyenrichment.github.io/test_partition_methods/).

## Apply to multiple lists of GO IDs

It is always very common that users have multiple lists of GO enrichment
results (e.g. from multiple groups of genes) and they want to compare the
significant terms between different lists, e.g. to see which biological
functions are more specific in a certain list. There is a function
`simplifyGOFromMultipleLists()` in the package which helps this type of analysis.

The input data for `simplifyGOFromMultipleLists()` (with the argument `lt`) can have three types of formats:

- A list of numeric vectors of adjusted p-values where each vector has the GO IDs as names.
- A data frame. The column of the GO IDs can be specified with ``go_id_column`` argument and the column of the adjusted p-values can be
      specified with ``padj_column`` argument. If the two columns are not specified, they are automatically identified. The GO ID column
      is found by checking whether a column contains all GO IDs. The adjusted p-value column is found by comparing the column names of the 
      data frame to see whether it might be a column for adjusted p-values. These two columns are used to construct a numeric vector
      with GO IDs as names.
 - A list of character vectors of GO IDs. In this case, each character vector is changed to a numeric vector where
   all values take 1 and the original GO IDs are used as names of the vector.

If the GO enrichment results is directly from upstream analysis, e.g. the package **clusterProfiler** or other similar packages, the results are
most probably represented as a list of data frames, thus, we first demonstrate the usage on a list of data frames.

The function `functional_enrichment()` in **cola** package applies functional
enrichment on different groups of signature genes from consensus clustering.
The function internally uses **clusterProfiler** and returns a list of data frames:

```{r, fig.width=10, fig.height = 7}
# perform functional enrichment on the signatures genes from cola anlaysis 
library(cola)
data(golub_cola) 
res = golub_cola["ATC:skmeans"]

library(hu6800.db)
x = hu6800ENTREZID
mapped_probes = mappedkeys(x)
id_mapping = unlist(as.list(x[mapped_probes]))

lt = functional_enrichment(res, k = 3, id_mapping = id_mapping)
names(lt)
head(lt[[1]][, 1:7])
```

By default, `simplifyGOFromMultipleLists()` automatically identifies the columns that contain GO IDs and adjusted p-values, so here we directly
send `lt` to `simplifyGOFromMultipleLists()`. We additionally set `padj_cutoff` to 0.001 because under the default cutoff 0.01, there are too many
GO IDs and to save the running time, we set a more strict cutoff.

```{r, fig.width=10, fig.height = 6, out.width = "100%"}
simplifyGOFromMultipleLists(lt, padj_cutoff = 0.001)
```

Next we demonstrate two other data types for `simplifyGOFromMultipleLists()`. Both usages are straightforward. The first is a list of numeric vectors:

```{r, eval = FALSE}
lt2 = lapply(lt, function(x) structure(x$p.adjust, names = x$ID))
simplifyGOFromMultipleLists(lt2, padj_cutoff = 0.001)
```

And the second is a list of character vectors of GO IDs:

```{r, eval = FALSE}
lt3 = lapply(lt, function(x) x$ID[x$p.adjust < 0.001])
simplifyGOFromMultipleLists(lt3)
```

The process of this analysis is as follows.
Let's assume there are $n$ GO lists, we first construct a global matrix where columns correspond to the $n$ GO lists and rows correspond
to the "union" of all GO IDs in the $n$ lists. The value for the i<sup>th</sup> GO ID and in the j<sup>th</sup> list are taken from the corresponding numeric vector
in `lt`. If the j<sup>th</sup> vector in `lt` does not contain the i<sup>th</sup> GO ID, the value defined by `default` argument is taken there (e.g. in most cases the numeric
values are adjusted p-values, thus `default` is set to 1). Let's call this matrix as $M_0$.

Next step is to filter $M_0$ so that we only take a subset of GO IDs of interest. We define a proper function via argument `filter` to remove
GO IDs that are not important for the analysis. Function for `filter` is applied to every row in $M_0$ and `filter` function needs
to return a logical value to decide whether to keep or remove the current GO ID. For example, if the values in `lt` are adjusted p-values, the `filter` function
can be set as `function(x) any(x < padj_cutoff)` so that the GO ID is kept as long as it is signfiicant in at least one list. After the filtering, let's call
the filtered matrix $M_1$.

GO IDs in $M_1$ (row names of $M_1$) are used for clustering. A heatmap of $M_1$ is attached to the left of the GO similarity heatmap so that
the group-specific (or list-specific) patterns can be easily observed and to corresponded to GO functions.

Argument `heatmap_param` controls several parameters for heatmap $M_1$:

- `transform`: A self-defined function to transform the data for heatmap visualization. The most typical case is to transform adjusted p-values by `-log10(x)`.
- `breaks`: Break values for color interpolation.
- `col`: The corresponding values for `breaks`.
- `labels`: The corresponding labels for legend.
- `name`: Legend title.


## Session Info

```{r}
sessionInfo()
```