```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` ```{r global_options, include=FALSE} knitr::opts_chunk$set(fig.width=10, fig.height=7, warning=FALSE, message=FALSE) options(width=110) ``` # __MSstats+: Peak quality-weighted differential analysis__ Author: Devon Kohler `MSstats+` is an extension to the `MSstats` R package in Bioconductor. It is designed to incorporate the underlying spectral peak quality into differential analysis, going beyond peak intensity alone. It integrates the underlying spectral peaks using an isolation forest (iForest) for anomaly detection, which integrates all the peak quality metrics into a single score. This score is then used a weight in protein-level summarization. After protein-level summarization, the estimates of underlying protein abundance and their associated variance are used as weights in a second weighted linear model which is used to perform differential analysis. The models can be applied to experiments with simple group comparison designs, or more complex designs (e.g. comparing more than two experimental conditions, or a repeated measure design, such as a time course). The algorithm is designed to be used with data acquired using data independent acquisition (DIA). Other acquisitions are possible but have yet to be tested. For guidance on how to use `MSstats+` with other acquisitions, please reach out on the `MSstats` Google Group: \url{https://groups.google.com/g/msstats}. Below we outline the workflow for using `MSstats+`, highlighting key options and decisions. ## Load packages ```{r} library(MSstats) library(MSstatsConvert) library(data.table) library(ggplot2) ``` ## Converter The main iForest algorithm for `MSstats+` is implemented in the `MSstatsConvert` package (a dependency of `MSstats`). The `MSstatsConvert` package contains individual converters for each spectral processing tool. Currently, `MSstats+` only has a dedicated converter for `Spectronaut`, however more converters are in development and can be implemented manually (see below), For integration with other tools, please reach out on the `MSstats` Google Group: \url{https://groups.google.com/g/msstats}. ### Load data First we will load the Spectronaut report of identified and quantified peaks. It is necessary that users output the spectral peak quality data along with these peaks. These spectral peak quality metrics must be at least at the precursor-level (i.e., the column headers must not start with `F.`). There are multiple options for columns that can be extracted, some examples include: `FG.ShapeQualityScore (MS1)`, `FG.ShapeQualityScore (MS2)`, and `EG.DeltaRT`. Users can include as many quality metrics as they want, however we recommend using only quality metrics that are key indicators of the quality of a spectral peak. Including too many quality metrics may lead to the iForest model fitting noise. We include additional functionality to assess the performance of the model (discussed below). An example output from Spectronaut is included in the `MSstatsConvert` package to provide an idea of what the input should look like. In addition to the Spectronaut output, `MSstats+` also requires an annotation file which maps the MS run names to the corresponding experimental design. For more details on creating an annotation file, please see the documentation in standard `MSstats` and the associated publications. Finally, `MSstats+` also requires the temporal order of MS run acquisition to identify time-based drops in quality. This can be extracted from the Spectronaut output as shown below. ```{r} # Spectronaut quant file spectronaut_raw = system.file( "tinytest/raw_data/Spectronaut/spectronaut_quality_input.csv", package = "MSstatsConvert") spectronaut_raw = data.table::fread(spectronaut_raw) # Annotation file annotation = system.file( "tinytest/raw_data/Spectronaut/annotation.csv", package = "MSstatsConvert") annotation = data.table::fread(annotation) # Create temporal run ordering data.table spectronaut_raw[, `R.Run Date (Formatted)` := as.POSIXct( `R.Run Date (Formatted)`, format = "%m/%d/%Y %I:%M:%S %p")] run_order = unique(spectronaut_raw[, c("R.Run Date (Formatted)", "R.FileName")]) run_order = run_order[order(`R.Run Date (Formatted)`)] run_order$Order = 1:nrow(run_order) setnames(run_order, c("R.FileName"), c("Run")) ``` ### Run converter `MSstats+` uses the same converter as base `MSstats`. New parameters have been added to indicate if a user wants to leverage the functionalities in `MSstats+`. Specifically, to use `MSstats+`, users must indicate that the anomaly detection algorithm should used by setting `calculateAnomalyScores=TRUE`. An example specification of the converter can be seen below, with key parameter options highlighted below the code. ```{r} msstats_input = MSstatsConvert::SpectronauttoMSstatsFormat( spectronaut_raw, annotation, intensity = 'PeakArea', excludedFromQuantificationFilter = TRUE, # Pre-filtering calculateAnomalyScores=TRUE, # Key parameter to use MSstats+ anomalyModelFeatures=c("FGShapeQualityScore(MS2)", "FGShapeQualityScore(MS1)", "EGDeltaRT"), # Quality metrics anomalyModelFeatureTemporal=c( "mean_decrease", "mean_decrease", "dispersion_increase"), # Temporal direction of quality metrics runOrder=run_order, # Temporal order of MS runs n_trees=100, max_depth="auto", # iForest parameter (depth of trees) numberOfCores=1) # Cores to use for parallel processing (Mac or Linux only) head(msstats_input) ``` Key parameters for `MSstats+`: - `calculateAnomalyScores`: Set to `TRUE` to use the iForest algorithm to calculate anomaly scores for each spectral peak. - `anomalyModelFeatures`: A character vector of the column names of the quality metrics to be used in the iForest model. These must be at the precursor-level or higher (i.e., the column headers must not start with `F.`). - `anomalyModelFeatureTemporal`: A character vector of the same length as `anomalyModelFeatures` indicating the temporal direction of each quality metric.9 Possible values are `mean_decrease`, `mean_increase`, and `dispersion_increase`. These indicate the direction a metric is changes which indicates a drop in quality (i.e., decrease, increase, or can go in either direction). For example, `EG.DeltaRT` uses `dispersion_increase` because a deviation in expected retention can be either higher or lower. - `runOrder`: A data.table with two columns: `R.FileName` and `Order`. The `R.FileName` column should contain the MS run names as they appear in the annotation file, and the `Order` column should contain the temporal order of acquisition (1 being the first run acquired, 2 being the second, etc.). This is used to identify time-based drops in quality. - `n_trees`: The number of trees to use in the iForest. Default is 100. Too few trees will make the algorithm unstable, while too many trees will increase computation time without performance improvement. - `max_depth`: The maximum depth of each tree in the iForest. Default is `"auto"`, which sets the depth to `ceiling(log2(number of samples))`. - `numberOfCores`: The number of cores to use for parallel processing. This is recommended since the algorithm is fit on the precursor-level and can take significant time to process if using a single core. Only available on Mac and Linux. ### Check model fit After running the converter, it is important to check the fit of the iForest. The iForest is an unsupervised algorithm, so there is no ground truth to compare to. However, we can check the distribution of the anomaly scores to ensure that the model is performing as expected. The anomaly scores should be right-skewed, with most of the scores being low (indicating normal peaks) and a small number of scores being high (indicating anomalous peaks). If the distribution is not right-skewed, it may indicate that the quality metrics used are not appropriate, or that there are not many poor quality peaks (generally found in smaller studies using less complex biological material). In these cases, it may make sense to use base `MSstats` instead of `MSstats+`. The fit of the model can be evaluated using the `CheckDataHealth` function in `MSstatsConvert`. This function will calculate Pearson's moment coefficient of skewness for each iForest model fit (each precursor). We recommend plotting these coefficients using a histogram and confirming the distribution is positive and shifted away from zero (with a median around 1). An example of how to use the function is shown below. ```{r} health_info = CheckDataHealth(msstats_input) skew_score = health_info[[2]] ggplot(skew_score, aes(x = skew)) + geom_histogram(fill = "#E69F00", color = "black", bins=12) + geom_vline(xintercept = 0, linetype = "dashed", color = "black", linewidth = 1.5) + theme_minimal(base_size = 16) + labs( title="Distribution of skewness of anomaly scores", x = "Pearson's moment coefficient of skewness", y = "Count" ) + theme( axis.text.x = element_text(size = 16), axis.text.y = element_text(size = 16), axis.title.x = element_text(size = 18), axis.title.y = element_text(size = 18) ) ``` We can see in the example data that the distribution is shifted right away from zero. This indicates that the anomaly model was able to detect poor quality measurements and the user can proceed with the next step. ## Manual formatting and data conversion The `MSstatsConvert` package includes an automatic `MSstats+` converter for Spectronaut, however other tools require manual formatting. This can be done by leveraging the existing `MSstats` converters, joining the quality metrics, and running the anomaly model separately. An example workflow is included below. ```{r} # Extract quality metrics from PSM output quality_join_cols = c("R.FileName", "PG.ProteinGroups", "EG.ModifiedSequence", "FG.Charge", "FG.ShapeQualityScore (MS1)", "FG.ShapeQualityScore (MS2)", "EG.DeltaRT") quality_join_df = spectronaut_raw[, ..quality_join_cols] quality_join_df$EG.ModifiedSequence = stringr::str_replace_all( quality_join_df$EG.ModifiedSequence, "\\_", "") quality_join_df = unique(quality_join_df) # In case there are duplicates (not common) quality_join_df = quality_join_df[ , lapply(.SD, mean, na.rm = TRUE), by = .(R.FileName, PG.ProteinGroups, EG.ModifiedSequence, FG.Charge), .SDcols = c("FG.ShapeQualityScore (MS1)", "FG.ShapeQualityScore (MS2)", "EG.DeltaRT") ] # Extract run order spectronaut_raw[, `R.Run Date (Formatted)` := as.POSIXct( `R.Run Date (Formatted)`, format = "%m/%d/%Y %I:%M:%S %p")] run_order = unique(spectronaut_raw[, c("R.Run Date (Formatted)", "R.FileName")]) run_order = run_order[order(`R.Run Date (Formatted)`)] run_order$Order = 1:nrow(run_order) setnames(run_order, c("R.FileName"), c("Run")) # Run standard MSstats converter without anomaly detection msstats_input = MSstatsConvert::SpectronauttoMSstatsFormat( spectronaut_raw, annotation, intensity = 'PeakArea', excludedFromQuantificationFilter = TRUE, # Pre-filtering calculateAnomalyScores=FALSE) # Turn anomaly detection off # Join in quality metrics msstats_input = merge(as.data.table(msstats_input), as.data.table(quality_join_df), all.x=TRUE, all.y=FALSE, by.x=c("Run", "ProteinName", "PeptideSequence", "PrecursorCharge"), by.y=c("R.FileName", "PG.ProteinGroups", "EG.ModifiedSequence", "FG.Charge")) # Run Anomaly Model separately input = MSstatsConvert::MSstatsAnomalyScores( msstats_input, c("FG.ShapeQualityScore (MS1)", "FG.ShapeQualityScore (MS2)", "EG.DeltaRT"), c("mean_decrease", "mean_decrease", "dispersion_increase"), .5, 100, run_order, 100, "auto", 1) ``` The resulting data can now be passed into the remaining `MSstats+` workflow as listed below. ## Data processing and summarization After running the converter with `MSstats+` options enabled, the next step is to run the standard `MSstats` `dataProcess` function. This function will automatically leverage the information from the `MSstats+` option, as long as the linear summarization option is turned on. The code to do this is highlighted below. ```{r} summarized = dataProcess(msstats_input, normalization=FALSE, # no normalization MBimpute=TRUE, # Use imputation summaryMethod="linear" # Key MSstats+ parameter ) head(summarized$ProteinLevelData) ``` Key parameters for `MSstats+`: - `summaryMethod`: Set to `linear` to use the peak quality-weighted linear regression summarization. This is a key parameter for `MSstats+`. Other summarization methods (e.g., Tukey's median polish) will not leverage the peak quality information. When the `dataProcess` function is run with `summaryMethod=linear` a new column will be included called `Variance` indicating the variance of the underlying protein abundance estimate. This variance is used in the next step of the workflow. ## Differential analysis The final step of the `MSstats+` workflow is to perform differential analysis. This is done using the standard `groupComparison` function in `MSstats`. Similar to the `dataProcess` function, the `groupComparison` function will automatically leverage the information from the `MSstats+` option. The code to do this is highlighted below. ```{r} comparison = matrix(c(-1,1),nrow=1) row.names(comparison) = "Condition2-Condition1" colnames(comparison) = c("Condition1", "Condition2") comparison_result = groupComparison( contrast.matrix = comparison, data=summarized) head(comparison_result$ComparisonResult) ``` The `groupComparison` function is run the exact same in `MSstats+` as `MSstats`, but under the hood the variation from the previous step is used in a second weighted linear regression (in comparison to the unweighted version in `MSstats`). ## Compare results to base `MSstats` ```{r} base_msstats_input = MSstatsConvert::SpectronauttoMSstatsFormat( spectronaut_raw, annotation, intensity = 'PeakArea', excludedFromQuantificationFilter = TRUE, # Pre-filtering calculateAnomalyScores=FALSE) # No anomaly model base_summarized = dataProcess(base_msstats_input, normalization=FALSE, # no normalization MBimpute=TRUE, # Use imputation summaryMethod="TMP" # Tukey median polish summarization ) base_comparison_result = groupComparison( contrast.matrix = comparison, data=base_summarized) head(base_comparison_result$ComparisonResult) ``` We can see that `MSstats` does not identify all the proteins as differentially abundant, even though there was a true log$_2$ fold change of -1. We can compare the fold change and standard error estimates between `MSstats+` and base `MSstats` to see why the p-values are different. ```{r} comparison_result$ComparisonResult$Model = "MSstats+" base_comparison_result$ComparisonResult$Model = "MSstats" merged_results = rbindlist(list( comparison_result$ComparisonResult, base_comparison_result$ComparisonResult )) ggplot(data=merged_results) + geom_col(aes(x=Protein, y=abs(log2FC), fill=Model), position="dodge") + geom_hline(yintercept=1, linetype="dashed", color="black", linewidth=1) + theme_bw(base_size =16) + labs(title="Log fold change comparison", x="MSstats+", y="Base MSstats") ``` The fold changes are both near the ground truth and `MSstats+` does not clearly outperform base `MSstats`. ```{r} ggplot(data=merged_results) + geom_col(aes(x=Protein, y=SE, fill=Model), position="dodge") + theme_bw(base_size =16) + labs(title="Standard error comparison", x="MSstats+", y="Base MSstats") ``` We can see the advantage of `MSstats+` in the standard error estimates. `MSstats+` downweights poorly quantified measurements and runs, which can lead to much better estimates of the true standard error. In this example `MSstats+` downweighted the runs which added a large amount of uncertainty, leading to it identifying all proteins as differentially abundant. ## MSstatsBig MSstats+ can also be used in conjunction with MSstatsBig to analyze large DIA datasets. The workflow leverages the MSstatsBig converter and an external execution of the MSstats+ isolation forest. An example of how to do this is shown below. This section of the vignette does not run automatically and must be executed manually. ```{r eval=FALSE, include=TRUE} library(MSstatsBig) # Define file path to data spectronaut_raw = system.file( "tinytest/raw_data/Spectronaut/spectronaut_quality_input.csv", package = "MSstatsConvert") # Run MSstatsBig converter converted_data = bigSpectronauttoMSstatsFormat( input_file=spectronaut_raw, output_file_name="output_file.csv", intensity = "F.PeakHeight", filter_by_excluded = TRUE, filter_by_identified = FALSE, filter_by_qvalue = TRUE, qvalue_cutoff = 0.01, filter_unique_peptides = TRUE, aggregate_psms = TRUE, filter_few_obs = FALSE, remove_annotation = FALSE, calculateAnomalyScores=TRUE, anomalyModelFeatures=c("FG.ShapeQualityScore (MS1)", "FG.ShapeQualityScore (MS2)", "EG.ApexRT"), backend="arrow") converted_data = dplyr::collect(converted_data) # Prepare info for anomaly model # Get temporal ordering from input file (can also be done externally) temporal = fread(spectronaut_raw)[ , .SD[1], by = .(`R.Run Date (Formatted)`, R.FileName)] temporal = temporal[, Run.TimeParsed := as.POSIXct( `R.Run Date (Formatted)`, format = "%m/%d/%Y %I:%M:%S %p")][ order(Run.TimeParsed)][ , !"Run.TimeParsed"] temporal$Order = 1:nrow(temporal) temporal$Run = temporal$R.FileName temporal = temporal[, c("Run", "Order")] # Specify anomaly model parameters anomalyModelFeatures=c("FG.ShapeQualityScore (MS2)", "FG.ShapeQualityScore (MS1)", "EG.ApexRT") anomalyModelFeatures = MSstatsConvert:::.standardizeColnames(anomalyModelFeatures) anomalyModelFeatureTemporal=c("mean_decrease", "mean_decrease", "dispersion_increase") n_trees=100 max_depth="auto" numberOfCores=1 # Run anomaly detection model msstats_input = MSstatsConvert::MSstatsAnomalyScores( as.data.table(converted_data), anomalyModelFeatures, anomalyModelFeatureTemporal, .5, 100, temporal, n_trees, max_depth, numberOfCores) head(msstats_input) ``` This code will run the MSstatsBig converter and then run the anomaly model. This workflow leads to a much faster processing speed compared to using the standard converter, and allows users to work with Spectronaut files which may be too large to fit into memory. After running the anomaly model, the resulting data can be input into the MSstats+ workflow reviewed previously. One note here is the temporal ordering. In this example, we extracted the temporal run order directly from the Spectronaut file. However, in practice this may not be appropriate for big data files. In cases where the data is large, this ordering should be extracted externally and provided by the user, or the user should sample the Spectronaut file rather than load it all into memory.