---
title: "SUITOR: selecting the number of mutational signatures"
author: "DongHyuk Lee and Bin Zhu"
output:
    BiocStyle::pdf_document
vignette: >
    %\VignetteIndexEntry{SUITOR: selecting the number of mutational signatures}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

# Introduction
Mutational signatures are patterns of somatic mutations imprinted on the
cancer genome by operative mutational processes, and have been proposed 
to identify cancer predisposition genes and to stratify cancer patients 
for precision treatment. For the *de novo* mutational signature 
analysis, estimating the correct number of signatures is the crucial 
starting point, since it influences all the downstream steps, including 
extraction of signature profiles, estimation of signature activities and 
classification of tumors based on the estimated activities. 
Despite the many algorithms proposed to extract signature profiles and to 
estimate signature contributions, relatively little emphasis has been 
placed on selecting the correct number of de novo mutational signatures
in cancer genomics studies.
The SUITOR package uses unsupervised cross-validation to select the optimal
number of signatures.

# Installing the SUITOR package from Bioconductor
```{r, eval=FALSE}
    if (!requireNamespace("BiocManager", quietly = TRUE)) 
        install.packages("BiocManager") 
    BiocManager::install("SUITOR") 
```

# Loading the package
Before using the SUITOR package, it must be loaded into an R session.
```{r}
library(SUITOR)
```

# Example data
For illustrative purposes, we simulated a 96 by 300 mutational catalog
matrix which contains 300 tumors with respect to 96 single base substitution 
categories. Each element of the matrix is generated from the Poisson 
distribution with the mean corresponding to each element 
of `WH`, where `W` is the true signature matrix of size 96 by 8 for 8 
signatures, and `H` is the activity matrix of size 8 by 300.
Specifically for `W`, we used the profile of eight COSMIC signatures 
4, 6, 7a, 9, 17b, 22, 26, 39 from
[**COSMIC**](https://cancer.sanger.ac.uk/cosmic/signatures/SBS).
`H` is generated from a uniform distribution
between 0 and 100 with some randomly chosen elements of `H` set to 0 in order
to mimic real data.
```{r}
    data(SimData, package="SUITOR")
    dim(SimData)
    SimData[1:6, 1:6]
```

# Selecting the number of mutational signatures
The main function `suitor(data, op)` is to select the number of mutational
signatures based on cross-validation. It has two arguments described below.

## Input data
The first argument of the function `suitor()` is `data`. It could be an R 
data frame or matrix containing a mutational catalog whose elements are 
non-negative counts.
Each column of `data` corresponds to a tumor (or sample) while its rows 
represent a mutation type. Although selection of the number of signatures 
is independent to the order of mutation type, we specify the order of 
mutation type according to the 
[**COSMIC database**](https://cancer.sanger.ac.uk/signatures/sbs/) 
for extracting signature profiles using `suitorExtractWH()` after
estimating the optimal rank.

## Options
Since SUITOR is based on cross-validation and the Expectation Conditional
Maximization (ECM) algorithm, it is necessary to set a list of tuning 
parameters which control the fitting process. These parameters are defined
in the table below.

```{r table1, echo=FALSE}
v1 <- c("min.value", "min.rank", "max.rank", "k.fold", "em.eps", "max.iter",
        "n.starts", "get.summary", "plot", "print")       
v2 <- c("Minimum value of matrix before factorizing", "Minimum rank",
        "Maximum rank", "Number of folds", "EM algorithm stopping tolerance", 
        "Maximum number of iterations in EM algorithm", 
        "Number of starting points", 
        "0 or 1 to create summary results", "0 or 1 to produce an error plot",
        "0 or 1 to print info (0=no printing)")
v3 <- c("1e-4", "1", "10", "10", "1e-5", "2000", "30", "1", "1", "1")
tab1 <- data.frame(Name=v1, Description=v2, "Default Value"=v3,
                   stringsAsFactors=FALSE, check.names=FALSE)
knitr::kable(tab1)
```

The option `min.value` is a small number added to the data matrix for stable 
computation of non-negative matrix factorization. For a given number of 
signatures or ranks `rnk (min.rank <= rnk <= max.rank)`, the data matrix is 
divided into `k.fold` parts for the cross-validation. The default value of 
the maximal rank `max.rank` is 10 but it can be changed depending on the 
cancer type. The default value of the number of folds K (`k.fold`) is 10 and 
it can be modified depending on the computer resources.
Since the ECM algorithm may converge to a local saddle point, 
SUITOR tries multiple initial values for `W` and `H`. 
For this purpose, the number of starts (`n.starts`) is used. 
Although the default `n.starts` is set to 30, it can be increased
depending on the size of the data matrix and/or computational resources.
For the ECM algorithm, the default value of the maximal iteration `max.iter`
is set to 2000. It is possible for some cases to reach the maximal iteration,
for which the function would produce a warning message. Overall, we 
recommend a two-stage approach where the user would run `suitor()` with the 
default option first and then narrow down the set of plausible ranks 
(`min.rank <= rnk <= max.rank`) with more starts (`n.starts`) and a larger 
number of maximal iteration (`max.iter`) if necessary.


# Running suitor() 
By default, the `suitor()` function returns a
list containing the estimated optimal rank (`re$rank`), a summary matrix
(`re$summary`) where cross validation errors are tabulated, as well as 
the detailed results (`re$all.results`) which contain the training and 
testing errors, the total number of ECM updates, and options (`re$op`) 
used by the `suitor()` function. In addition to the estimated optimal rank 
provided by `re$rank`, a cross validation error plot is created by default.

```{r}
OP  <- list(max.rank=3, k.fold=5, n.starts=4)
set.seed(123)
re <- suitor(SimData, op=OP)
str(re)
```

# Extracting the signature profiles and activities
Once the optimal number of signatures or called rank is estimated by 
`suitor()`, we can extract the signature profiles `W` and activities
`H` with the function `suitorExtractWH(data, rank, op)`. 
As in the `suitor()` function, the input data is a data frame or matrix 
containing a mutational catalog whose elements are non-negative counts. 
A non-negative integer rank is the number of mutational signatures to be 
extracted. The possible option values are summarized in the
following table and they can be used in the same manner as `suitor()`.

```{r table2, echo=FALSE}
v1 <- c("min.value", "n.starts", "print")
v2 <- c("Minimum value of matrix before factorizing", 
        "Number of starting points",
        "0 or 1 to print info (0=no printing)")
v3 <- c("1e-4", "30", "1")
tab2 <- data.frame(Name=v1, Description=v2, "Default Value"=v3,
                   stringsAsFactors=FALSE, check.names=FALSE)
knitr::kable(tab2)
```


```{r}
re$rank
set.seed(123)
Extract <- suitorExtractWH(SimData, re$rank)
head(Extract$W)
Extract$H[,1:3]
```

`Extract$W` and `Extract$H` are estimated matrices for the profile `W`
and the activity `H`, respectively.

# Summarizing signature profiles with MutationalPatterns
The R package MutationalPatterns (Blokzijl et al., 2021) provides some utility
functions to summarize signature profiles and contains matrices of pre-defined
signatures like COSMIC. The function `plot_96_profile()` can draw 
the signature profile plot with respect to the 96 trinucleotide categories.
In addition, the function `cos_sim_matrix()` computes the cosine similarity 
between two profiles.
```{r}
suppressPackageStartupMessages(library(MutationalPatterns))
COSMIC <- get_known_signatures(source = "COSMIC_v3.2")
plot_96_profile(Extract$W, condensed=TRUE, ymax=0.3)
CS <- cos_sim_matrix(Extract$W, COSMIC)
CS[, 1:3]
```

# Session Information
```{r}
sessionInfo()
```