```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```

```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=10, fig.height=7, warning=FALSE, message=FALSE)
options(width=110)
```
<!--
%\VignetteIndexEntry{MSstats+: Peak quality-weighted differential analysis}
%\VignetteEngine{knitr::knitr}
-->

# __MSstats+: Peak quality-weighted differential analysis__

Author: Devon Kohler

`MSstats+` is an extension to the `MSstats` R package in Bioconductor. It is 
designed to incorporate the underlying spectral peak quality into differential 
analysis, going beyond peak intensity alone. It integrates the underlying 
spectral peaks using an isolation forest (iForest) for anomaly detection, which
integrates all the peak quality metrics into a single score. This score is then 
used a weight in protein-level summarization. After 
protein-level summarization, the estimates of underlying protein abundance and 
their associated variance are used as weights in a second weighted 
linear model which is used to perform differential analysis. The models can be 
applied to experiments with simple group comparison designs, or more complex designs 
(e.g. comparing more than two experimental conditions, or a repeated measure 
design, such as a time course).

The algorithm is designed to be used with data acquired using data independent 
acquisition (DIA). Other acquisitions are possible but have yet to be tested. 
For guidance on how to use `MSstats+` with other acquisitions, please reach out 
on the `MSstats` Google Group: \url{https://groups.google.com/g/msstats}. 

Below we outline the workflow for using `MSstats+`, highlighting key options 
and decisions.

## Load packages
```{r}
library(MSstats)
library(MSstatsConvert)
library(data.table)
library(ggplot2)
```

## Converter

The main iForest algorithm for `MSstats+` is implemented in the `MSstatsConvert`
package (a dependency of `MSstats`). The `MSstatsConvert` package contains 
individual converters for each spectral processing tool. Currently, `MSstats+` 
only has a dedicated converter for `Spectronaut`, however more converters are 
in development and can be implemented manually (see below), For integration with
other tools, please reach out on the `MSstats` Google Group: \url{https://groups.google.com/g/msstats}. 

### Load data

First we will load the Spectronaut report of identified and quantified peaks. It
is necessary that users output the spectral peak quality data along with 
these peaks. These spectral peak quality metrics must be at least at the 
precursor-level (i.e., the column headers must not start with `F.`). There are 
multiple options for columns that can be extracted, some 
examples include: `FG.ShapeQualityScore (MS1)`, `FG.ShapeQualityScore (MS2)`,
and `EG.DeltaRT`. Users can include as many quality metrics as they want, 
however we recommend using only quality metrics that are key indicators of the
quality of a spectral peak. Including too many quality metrics may lead to
the iForest model fitting noise. We include additional functionality to assess the 
performance of the model (discussed below).

An example output from Spectronaut is included in the 
`MSstatsConvert` package to provide an idea of what the input should look like.

In addition to the Spectronaut output, `MSstats+` also requires an annotation 
file which maps the MS run names to the corresponding experimental design. For 
more details on creating an annotation file, please see the documentation in 
standard `MSstats` and the associated publications.

Finally, `MSstats+` also requires the temporal order of MS run acquisition to 
identify time-based drops in quality. This can be extracted from the Spectronaut
output as shown below.

```{r}
# Spectronaut quant file
spectronaut_raw = system.file(
    "tinytest/raw_data/Spectronaut/spectronaut_quality_input.csv",
    package = "MSstatsConvert")
spectronaut_raw = data.table::fread(spectronaut_raw)

# Annotation file
annotation = system.file(
    "tinytest/raw_data/Spectronaut/annotation.csv",
    package = "MSstatsConvert")
annotation = data.table::fread(annotation)

# Create temporal run ordering data.table
spectronaut_raw[, `R.Run Date (Formatted)` := as.POSIXct(
    `R.Run Date (Formatted)`, format = "%m/%d/%Y %I:%M:%S %p")]
run_order = unique(spectronaut_raw[, c("R.Run Date (Formatted)", 
                                       "R.FileName")])
run_order = run_order[order(`R.Run Date (Formatted)`)]
run_order$Order = 1:nrow(run_order)
setnames(run_order, c("R.FileName"), c("Run"))
```

### Run converter

`MSstats+` uses the same converter as base `MSstats`. New parameters have been 
added to indicate if a user wants to leverage the functionalities in `MSstats+`.
Specifically, to use `MSstats+`, users must indicate that the anomaly detection 
algorithm should used by setting `calculateAnomalyScores=TRUE`.

An example specification of the converter can be seen below, with key parameter
options highlighted below the code.

```{r}
msstats_input = MSstatsConvert::SpectronauttoMSstatsFormat(
    spectronaut_raw, annotation, intensity = 'PeakArea',
    excludedFromQuantificationFilter = TRUE, # Pre-filtering
    calculateAnomalyScores=TRUE, # Key parameter to use MSstats+
    anomalyModelFeatures=c("FGShapeQualityScore(MS2)",
                           "FGShapeQualityScore(MS1)",
                           "EGDeltaRT"), # Quality metrics
    anomalyModelFeatureTemporal=c(
        "mean_decrease", "mean_decrease",
        "dispersion_increase"), # Temporal direction of quality metrics
    runOrder=run_order, # Temporal order of MS runs
    n_trees=100,
    max_depth="auto", # iForest parameter (depth of trees)
    numberOfCores=1) # Cores to use for parallel processing (Mac or Linux only)

head(msstats_input)
```

Key parameters for `MSstats+`:
- `calculateAnomalyScores`: Set to `TRUE` to use the iForest algorithm to 
  calculate anomaly scores for each spectral peak.
- `anomalyModelFeatures`: A character vector of the column names of the quality 
   metrics to be used in the iForest model. These must be at the precursor-level 
   or higher (i.e., the column headers must not start with `F.`).
- `anomalyModelFeatureTemporal`: A character vector of the same length as
   `anomalyModelFeatures` indicating the temporal direction of each quality 
   metric.9 Possible values are `mean_decrease`, `mean_increase`, and 
   `dispersion_increase`. These indicate the direction a metric is changes which
   indicates a drop in quality (i.e., decrease, increase, or can go in either
   direction). For example, `EG.DeltaRT` uses `dispersion_increase` because a 
   deviation in expected retention can be either higher or lower.
- `runOrder`: A data.table with two columns: `R.FileName` and `Order`. The 
   `R.FileName` column should contain the MS run names as they appear in the 
   annotation file, and the `Order` column should contain the temporal order of
   acquisition (1 being the first run acquired, 2 being the second, etc.). This
   is used to identify time-based drops in quality.
- `n_trees`: The number of trees to use in the iForest. Default is 100. Too few 
   trees will make the algorithm unstable, while too many trees will increase
   computation time without performance improvement.
- `max_depth`: The maximum depth of each tree in the iForest. Default is 
   `"auto"`, which sets the depth to `ceiling(log2(number of samples))`.
- `numberOfCores`: The number of cores to use for parallel processing. This is 
   recommended since the algorithm is fit on the precursor-level and can take 
   significant time to process if using a single core. Only available on Mac and
   Linux.
   
### Check model fit

After running the converter, it is important to check the fit of the iForest. 
The iForest is an unsupervised algorithm, so there is no ground truth to compare
to. However, we can check the distribution of the anomaly scores to ensure that
the model is performing as expected. The anomaly scores should be right-skewed,
with most of the scores being low (indicating normal peaks) and a small number
of scores being high (indicating anomalous peaks). If the distribution is not
right-skewed, it may indicate that the quality metrics used are not appropriate,
or that there are not many poor quality peaks (generally found in smaller 
studies using less complex biological material). In these cases, it may make 
sense to use base `MSstats` instead of `MSstats+`.

The fit of the model can be evaluated using the `CheckDataHealth` function in `MSstatsConvert`. This function will calculate Pearson's moment coefficient of 
skewness for each iForest model fit (each precursor). We recommend plotting 
these coefficients using a histogram and confirming the distribution is positive
and shifted away from zero (with a median around 1).

An example of how to use the function is shown below.

```{r}
health_info = CheckDataHealth(msstats_input)
skew_score = health_info[[2]]

ggplot(skew_score, aes(x = skew)) + 
  geom_histogram(fill = "#E69F00", color = "black", bins=12) + 
  geom_vline(xintercept = 0, linetype = "dashed", 
             color = "black", linewidth = 1.5) + 
  theme_minimal(base_size = 16) + 
  labs(
    title="Distribution of skewness of anomaly scores",
    x = "Pearson's moment coefficient of skewness",
    y = "Count"
  ) +
  theme(
    axis.text.x = element_text(size = 16),
    axis.text.y = element_text(size = 16),
    axis.title.x = element_text(size = 18),
    axis.title.y = element_text(size = 18)
  )

```

We can see in the example data that the distribution is shifted right away from 
zero. This indicates that the anomaly model was able to detect poor quality 
measurements and the user can proceed with the next step.

## Manual formatting and data conversion

The `MSstatsConvert` package includes an automatic `MSstats+` converter for 
Spectronaut, however other tools require manual formatting. This can be done by
leveraging the existing `MSstats` converters, joining the quality metrics, and 
running the anomaly model separately. An example workflow is included below.

```{r}

# Extract quality metrics from PSM output
quality_join_cols = c("R.FileName", "PG.ProteinGroups", "EG.ModifiedSequence",
                      "FG.Charge", "FG.ShapeQualityScore (MS1)", 
                      "FG.ShapeQualityScore (MS2)", "EG.DeltaRT")
quality_join_df = spectronaut_raw[, ..quality_join_cols]
quality_join_df$EG.ModifiedSequence = stringr::str_replace_all(
    quality_join_df$EG.ModifiedSequence, "\\_", "")
quality_join_df = unique(quality_join_df)

# In case there are duplicates (not common)
quality_join_df = quality_join_df[
  , lapply(.SD, mean, na.rm = TRUE),
  by = .(R.FileName, PG.ProteinGroups, EG.ModifiedSequence, FG.Charge),
  .SDcols = c("FG.ShapeQualityScore (MS1)",
              "FG.ShapeQualityScore (MS2)",
              "EG.DeltaRT")
]

# Extract run order
spectronaut_raw[, `R.Run Date (Formatted)` := as.POSIXct(
    `R.Run Date (Formatted)`, format = "%m/%d/%Y %I:%M:%S %p")]
run_order = unique(spectronaut_raw[, c("R.Run Date (Formatted)", 
                                       "R.FileName")])
run_order = run_order[order(`R.Run Date (Formatted)`)]
run_order$Order = 1:nrow(run_order)
setnames(run_order, c("R.FileName"), c("Run"))

# Run standard MSstats converter without anomaly detection
msstats_input = MSstatsConvert::SpectronauttoMSstatsFormat(
    spectronaut_raw, annotation, intensity = 'PeakArea',
    excludedFromQuantificationFilter = TRUE, # Pre-filtering
    calculateAnomalyScores=FALSE) # Turn anomaly detection off

# Join in quality metrics
msstats_input = merge(as.data.table(msstats_input), 
                      as.data.table(quality_join_df), 
                      all.x=TRUE, all.y=FALSE,
                      by.x=c("Run", "ProteinName", "PeptideSequence", 
                             "PrecursorCharge"),
                      by.y=c("R.FileName", "PG.ProteinGroups", 
                             "EG.ModifiedSequence", "FG.Charge"))

# Run Anomaly Model separately
input = MSstatsConvert::MSstatsAnomalyScores(
    msstats_input, 
    c("FG.ShapeQualityScore (MS1)", "FG.ShapeQualityScore (MS2)", "EG.DeltaRT"), 
    c("mean_decrease", "mean_decrease", "dispersion_increase"),
    .5, 100, run_order, 100, "auto", 1)
```

The resulting data can now be passed into the remaining `MSstats+` workflow as 
listed below.

## Data processing and summarization

After running the converter with `MSstats+` options enabled, the next step is to
run the standard `MSstats` `dataProcess` function. This function will 
automatically leverage the information from the `MSstats+` option, as long as 
the linear summarization option is turned on. The code to do this is highlighted
below.

```{r}
summarized = dataProcess(msstats_input,
                         normalization=FALSE, # no normalization
                         MBimpute=TRUE, # Use imputation
                         summaryMethod="linear" # Key MSstats+ parameter
                         )

head(summarized$ProteinLevelData)
```

Key parameters for `MSstats+`:
- `summaryMethod`: Set to `linear` to use the peak quality-weighted linear 
  regression summarization. This is a key parameter for `MSstats+`. Other 
  summarization methods (e.g., Tukey's median polish) will not leverage the 
  peak quality information.
  
When the `dataProcess` function is run with `summaryMethod=linear` a new column 
will be included called `Variance` indicating the variance of the underlying
protein abundance estimate. This variance is used in the next step of the
workflow.

## Differential analysis

The final step of the `MSstats+` workflow is to perform differential analysis. 
This is done using the standard `groupComparison` function in `MSstats`. Similar
to the `dataProcess` function, the `groupComparison` function will automatically
leverage the information from the `MSstats+` option. The code to do this is highlighted below.

```{r}
comparison = matrix(c(-1,1),nrow=1)
row.names(comparison) = "Condition2-Condition1"
colnames(comparison) = c("Condition1", "Condition2")

comparison_result = groupComparison(
    contrast.matrix = comparison,
    data=summarized)

head(comparison_result$ComparisonResult)
```

The `groupComparison` function is run the exact same in `MSstats+` as `MSstats`, 
but under the hood the variation from the previous step is used in a second 
weighted linear regression (in comparison to the unweighted version in 
`MSstats`).

## Compare results to base `MSstats`

```{r}
base_msstats_input = MSstatsConvert::SpectronauttoMSstatsFormat(
    spectronaut_raw, annotation, intensity = 'PeakArea',
    excludedFromQuantificationFilter = TRUE, # Pre-filtering
    calculateAnomalyScores=FALSE) # No anomaly model

base_summarized = dataProcess(base_msstats_input,
                         normalization=FALSE, # no normalization
                         MBimpute=TRUE, # Use imputation
                         summaryMethod="TMP" # Tukey median polish summarization
                         )

base_comparison_result = groupComparison(
    contrast.matrix = comparison,
    data=base_summarized)

head(base_comparison_result$ComparisonResult)
```

We can see that `MSstats` does not identify all the proteins as differentially abundant, even though there was a true log$_2$ fold change of -1. We can compare the fold change and standard error estimates between `MSstats+` and base `MSstats` to see why the p-values are different.

```{r}
comparison_result$ComparisonResult$Model = "MSstats+"
base_comparison_result$ComparisonResult$Model = "MSstats"

merged_results = rbindlist(list(
    comparison_result$ComparisonResult,
    base_comparison_result$ComparisonResult
))

ggplot(data=merged_results) + 
    geom_col(aes(x=Protein, y=abs(log2FC), fill=Model), position="dodge") +
    geom_hline(yintercept=1, linetype="dashed", color="black", linewidth=1) +
    theme_bw(base_size =16) + 
    labs(title="Log fold change comparison",
         x="MSstats+",
         y="Base MSstats")
```

The fold changes are both near the ground truth and `MSstats+` does not clearly outperform base `MSstats`.

```{r}
ggplot(data=merged_results) + 
    geom_col(aes(x=Protein, y=SE, fill=Model), position="dodge") +
    theme_bw(base_size =16) + 
    labs(title="Standard error comparison",
         x="MSstats+",
         y="Base MSstats")
```

We can see the advantage of `MSstats+` in the standard error estimates. `MSstats+` downweights poorly quantified measurements and runs, which can lead to much better estimates of the true standard error. In this example `MSstats+` downweighted the runs which added a large amount of uncertainty, leading to it identifying all proteins as differentially abundant. 


## MSstatsBig

MSstats+ can also be used in conjunction with MSstatsBig to analyze large DIA 
datasets. The workflow leverages the MSstatsBig converter and an external 
execution of the MSstats+ isolation forest. An example of how to do this is 
shown below.

This section of the vignette does not run automatically and must be executed 
manually.

```{r eval=FALSE, include=TRUE}
library(MSstatsBig)

# Define file path to data
spectronaut_raw = system.file(
    "tinytest/raw_data/Spectronaut/spectronaut_quality_input.csv",
    package = "MSstatsConvert")

# Run MSstatsBig converter
converted_data = bigSpectronauttoMSstatsFormat(
  input_file=spectronaut_raw,
  output_file_name="output_file.csv",
  intensity = "F.PeakHeight",
  filter_by_excluded = TRUE,
  filter_by_identified = FALSE,
  filter_by_qvalue = TRUE,
  qvalue_cutoff = 0.01,
  filter_unique_peptides = TRUE,
  aggregate_psms = TRUE,
  filter_few_obs = FALSE,
  remove_annotation = FALSE,
  calculateAnomalyScores=TRUE,
  anomalyModelFeatures=c("FG.ShapeQualityScore (MS1)",
                         "FG.ShapeQualityScore (MS2)",
                         "EG.ApexRT"),
  backend="arrow")
converted_data = dplyr::collect(converted_data)

# Prepare info for anomaly model
# Get temporal ordering from input file (can also be done externally)
temporal = fread(spectronaut_raw)[
    , .SD[1], by = .(`R.Run Date (Formatted)`, R.FileName)]

temporal = temporal[, Run.TimeParsed := as.POSIXct(
    `R.Run Date (Formatted)`, format = "%m/%d/%Y %I:%M:%S %p")][
  order(Run.TimeParsed)][
  , !"Run.TimeParsed"]

temporal$Order = 1:nrow(temporal)
temporal$Run = temporal$R.FileName
temporal = temporal[, c("Run", "Order")]

# Specify anomaly model parameters
anomalyModelFeatures=c("FG.ShapeQualityScore (MS2)",
                       "FG.ShapeQualityScore (MS1)",
                       "EG.ApexRT")
anomalyModelFeatures = MSstatsConvert:::.standardizeColnames(anomalyModelFeatures)

anomalyModelFeatureTemporal=c("mean_decrease",
                              "mean_decrease",
                              "dispersion_increase")

n_trees=100
max_depth="auto"
numberOfCores=1

# Run anomaly detection model
msstats_input = MSstatsConvert::MSstatsAnomalyScores(
  as.data.table(converted_data), anomalyModelFeatures,
  anomalyModelFeatureTemporal, .5, 100, temporal, n_trees,
  max_depth, numberOfCores)
head(msstats_input)

```

This code will run the MSstatsBig converter and then run the anomaly model. This
workflow leads to a much faster processing speed compared to using the standard 
converter, and allows users to work with Spectronaut files which may be too 
large to fit into memory. After running the anomaly model, the resulting data 
can be input into the MSstats+ workflow reviewed previously.

One note here is the temporal ordering. In this example, we extracted the 
temporal run order directly from the Spectronaut file. However, in practice this
may not be appropriate for big data files. In cases where the data is large, 
this ordering should be extracted externally and provided by the user, or the 
user should sample the Spectronaut file rather than load it all into memory.