--- title: "MSstatsPTM LabelFree Workflow" author: "Devon Kohler ()" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{MSstatsPTM LabelFree Workflow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width=8, fig.height=8 ) ``` This vignette provides an example workflow for how to use the package `MSstatsPTM` for a mass spectrometry based proteomics experiment acquired with a labelfree acquisition methods, such as DDA/DIA/SRM/PRM. It also provides examples and an analysis of how adjusting for global protein levels allows for better interpretations of PTM modeling results. ## Installation To install this package, start R (version "4.0") and enter: ``` {r, eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("MSstatsPTM") ``` ```{r, message=FALSE, warning=FALSE} library(MSstatsPTM) library(MSstats) ``` ## 1. Workflow ### 1.1 Raw Data Format **Note: We are actively developing dedicated converters for `MSstatsPTM`. If you have data from a processing tool that does not have a dedicated converter in MSstatsPTM please add a github issue `https://github.com/Vitek-Lab/MSstatsPTM/issues` and we will add the converter.** The first step is to load in the raw dataset for both the PTM and Protein datasets. Each dataset can formatted using dedicated converters in `MSstatsPTM`, such as `FragePipetoMSstatsPTMFormat`. If there is not a dedicated converter for your tool in `MSstatsPTM`, you can alternatively leverage converters from base `MSstats`. If using converters from `MSstats` note they will need to be run both on the global protein and PTM datasets. You might notice a FASTA file is also needed for some converters. This FASTA file can be obtained by querying [Uniprot](https://www.uniprot.org/id-mapping) with all of the protein IDs present in your PTM dataset. The FASTA file is a necessary input because some tools (e.g. MaxQuant) do not report the specific amino acid that is modified relative to the whole *protein* sequence. Rather, they report the specific amino acid relative to the reported *peptide*. This distinction is important because modifications like phosphorylation, methylation, or acetylation often have specific roles depending on where they occur within the full-length protein. With the help of a FASTA file, MSstatsPTM can determine the specific amino acid that is modified in the context of the whole protein sequence. Please note for the PTM dataset, both the protein and modification site (or peptide), must be added into the `ProteinName` column. This allows for the package to summarize to the peptide level, and avoid the off chance there are matching peptides between proteins. For an example of how this can be done please see the code below. The output of the converter is a list with two formatted data.tables. One each for the PTM and Protein datasets. If a global profiling run was not performed, the Protein data.table will just be `NULL` #### 1.1.1 MaxQuant - `MaxQtoMSstatsPTMFormat` `MSstatsPTM` includes a dedicated converter for MaxQuant output. Experiments can be acquired with label-free or TMT labeling methods. No matter the experiment, we recommend using the `evidence.txt` file, and not a PTM specific file such as the `Phospho (STY).txt` file. The `MaxQtoMSstatsPTMFormat` converter includes a variety of parameters. Examples for processing a TMT and LF dataset can be seen below. ``` {r maxq, eval=TRUE} # TMT experiment head(maxq_tmt_evidence) head(maxq_tmt_annotation) msstats_format_tmt = MaxQtoMSstatsPTMFormat(evidence=maxq_tmt_evidence, annotation=maxq_tmt_annotation, fasta=system.file("extdata", "maxq_tmt_fasta.fasta", package="MSstatsPTM"), fasta_protein_name="uniprot_ac", mod_id="\\(Phospho \\(STY\\)\\)", use_unmod_peptides=TRUE, labeling_type = "TMT", which_proteinid_ptm = "Proteins") head(msstats_format_tmt$PTM) head(msstats_format_tmt$PROTEIN) # LF experiment head(maxq_lf_evidence) head(maxq_lf_annotation) msstats_format_lf = MaxQtoMSstatsPTMFormat(evidence=maxq_lf_evidence, annotation=maxq_lf_annotation, fasta=system.file("extdata", "maxq_lf_fasta.fasta", package="MSstatsPTM"), fasta_protein_name="uniprot_ac", mod_id="\\(Phospho \\(STY\\)\\)", use_unmod_peptides=TRUE, labeling_type = "LF", which_proteinid_ptm = "Proteins") head(msstats_format_lf$PTM) head(msstats_format_lf$PROTEIN) ``` #### 1.1.2 FragPipe - `FragPipetoMSstatsPTMFormat` `MSstatsPTM` includes a dedicated converter for FragPipe output. Experiments must be acquired with TMT labeling methods (a label-free converter is currently in development).The input is the `msstats.csv` and annotation files automatically generated by FragPipe. FragPipe provides additional info on the localization of modification sites, and the `FragPipetoMSstatsPTMFormat` converter includes localization options that are not present in other converters. ``` {r fragpipe, eval=TRUE} head(fragpipe_input) head(fragpipe_annotation) head(fragpipe_input_protein) head(fragpipe_annotation_protein) msstats_data = FragPipetoMSstatsPTMFormat(fragpipe_input, fragpipe_annotation, fragpipe_input_protein, fragpipe_annotation_protein, mod_id_col = "STY", localization_cutoff=.75, remove_unlocalized_peptides=TRUE) head(msstats_data$PTM) head(msstats_data$PROTEIN) ``` #### 1.1.3 Proteome Discoverer - `PDtoMSstatsPTMFormat` `MSstatsPTM` includes a dedicated converter for Proteome Discoverer output. Experiments can be acquired with label-free or TMT labeling methods. The input is the `psm` file and a custom built annotation file. Proteome Discoverer provides additional info on the localization of modification sites, and the `PDtoMSstatsPTMFormat` converter includes localization options that are not present in other converters. ``` {r pd, eval=TRUE} head(pd_psm_input) head(pd_annotation) msstats_format = PDtoMSstatsPTMFormat(pd_psm_input, pd_annotation, system.file("extdata", "pd_fasta.fasta", package="MSstatsPTM"), use_unmod_peptides=TRUE, which_proteinid = "Master.Protein.Accessions") head(msstats_format$PTM) head(msstats_format$PROTEIN) ``` #### 1.1.4 Spectronaut - `SpectronauttoMSstatsPTMFormat` `MSstatsPTM` includes a dedicated converter for Spectronaut output. Experiments can be acquired with label-free labeling methods. ``` {r Spectronaut, eval=TRUE} head(spectronaut_input) head(spectronaut_annotation) msstats_input = SpectronauttoMSstatsPTMFormat(spectronaut_input, annotation=spectronaut_annotation, fasta_path=system.file("extdata", "spectronaut_fasta.fasta", package="MSstatsPTM"), use_unmod_peptides=TRUE, mod_id = "\\[Phospho \\(STY\\)\\]", fasta_protein_name = "uniprot_iso" ) head(msstats_input$PTM) head(msstats_input$PROTEIN) ``` #### 1.1.5 Skyline - `SkylinetoMSstatsPTMFormat` `MSstatsPTM` includes a dedicated converter for Skyline output. Experiments can be acquired with label-free labeling methods. #### 1.1.6 Peak Studio - `PStoMSstatsPTMFormat` `MSstatsPTM` includes a dedicated converter for Peak Studio output. Experiments can be acquired with label-free labeling methods. #### 1.1.6 Progenesis - `ProgenesistoMSstatsPTMFormat` `MSstatsPTM` includes a dedicated converter for Progenesis output. Experiments can be acquired with label-free labeling methods. #### 1.1.7 Additional tools If there is not a dedicated `MSstatsPTM` converter for a processing tool, the existing converters in `MSstats` and `MSstatsTMT` converters can be used as described below. Note, in order to do this, it is critical that the ProteinName column be a combination of the Protein Name and modification site. As an example, if you would like to analyze an experiment processed with DIANN, you can leverage the `DIANNtoMSstatsFormat` in `MSstats`. Given two datasets, named `raw_ptm_df` and `raw_protein_df`, and an annotation file, we can process the data as follows. ```{r raw_data, eval = FALSE} # Add site into ProteinName column raw_ptm_df$ProteinName = paste(raw_ptm_df$ProteinName, raw_ptm_df$Site, sep = "_") # Run MSstats Converters PTM_data = MSstats::DIANNtoMSstatsFormat(raw_ptm_df, annotation) PROTEIN_data = MSstats::DIANNtoMSstatsFormat(raw_protein_df, annotation) # Combine into one list msstatsptm_input_data = list(PTM = PTM_data, PROTEIN = PROTEIN_data) ``` The variable `msstatsptm_input_data` can now be used as the input to the remainder of the `MSstatsPTM` processing pipeline. ### 1.2 Summarization - `dataSummarizationPTM` After loading in the input data, the next step is to use the `dataSummarizationPTM` function. This provides the summarized dataset needed to model the protein/PTM abundance. The function will summarize the Protein dataset up to the protein level and will summarize the PTM dataset up to the peptide level. There are multiple options for normalization and missing value imputation. These options should be reviewed in the package documentation. ```{r summarize, message=FALSE, warning=FALSE} MSstatsPTM.summary = dataSummarizationPTM(raw.input, verbose = FALSE, use_log_file = FALSE, append = FALSE) head(MSstatsPTM.summary$PTM$ProteinLevelData) head(MSstatsPTM.summary$PROTEIN$ProteinLevelData) ``` The summarize function returns a list with PTM and Protein summarization information. Each PTM and Protein include a list of data.tables: `FeatureLevelData` is a data.table of reformatted input of dataSummarizationPTM, `ProteinLevelData` is the run level summarization data. ### 1.2.1 QCPlot Once summarized, MSstatsPTM provides multiple plots to analyze the experiment. Here we show the quality control boxplot. The first plot shows the modified data and the second plot shows the global protein dataset. ```{r qcplot, message=FALSE, warning=FALSE} dataProcessPlotsPTM(MSstatsPTM.summary, type = 'QCPLOT', which.PTM = "allonly", address = FALSE) ``` ### 1.2.2 Profile Plot Here we show a profile plot. Again the top plot shows the modified peptide, and the bottom shows the overall protein. ```{r profileplot, message=FALSE, warning=FALSE} dataProcessPlotsPTM(MSstatsPTM.summary, type = 'ProfilePlot', which.Protein = "Q9Y6C9", address = FALSE) ``` ### 1.3 Modeling - `groupComparisonPTM` After summarization, the summarized datasets can be modeled using the `groupComparisonPTM` function. This function will model the PTM and Protein summarized datasets, and then adjust the PTM model for changes in overall protein abundance. The output of the function is a list containing these three models named: `PTM.Model`, `PROTEIN.Model`, `ADJUSTED.Model`. ```{r model, message=FALSE, warning=FALSE} # Specify contrast matrix comparison = matrix(c(-1,0,1,0),nrow=1) row.names(comparison) = "CCCP-Ctrl" colnames(comparison) = c("CCCP", "Combo", "Ctrl", "USP30_OE") MSstatsPTM.model = groupComparisonPTM(MSstatsPTM.summary, data.type = "LabelFree", contrast.matrix = comparison, use_log_file = FALSE, append = FALSE, verbose = FALSE) head(MSstatsPTM.model$PTM.Model) head(MSstatsPTM.model$PROTEIN.Model) head(MSstatsPTM.model$ADJUSTED.Model) ``` ### 1.3.1 Volcano Plot The models from the `groupComparisonPTM` function can be used in the model visualization function, `groupComparisonPlotsPTM`. Here we show Volcano Plots for the models. ``` {r volcano, message=FALSE, warning=FALSE} groupComparisonPlotsPTM(data = MSstatsPTM.model, type = "VolcanoPlot", FCcutoff= 2, logBase.pvalue = 2, address=FALSE) ``` ### 1.3.2 Heatmap Plot Here we show a Heatmap for the models. ``` {r heatmap, message=FALSE, warning=FALSE} groupComparisonPlotsPTM(data = MSstatsPTM.model, type = "Heatmap", which.PTM = 1:30, address=FALSE) ``` ### 1.4 Sample Size Calculation - `designSampleSizePTM` Finally, sample size calculation can be performed using the output of the model and the `designSampleSizePTM` ```{r sample_size, message=FALSE, warning=FALSE} # Specify contrast matrix sample_size = designSampleSizePTM(MSstatsPTM.model, c(2.0, 2.75), FDR = 0.05, numSample = TRUE, power = 0.8) head(sample_size) ``` ### 1.4.1 Sample Size Plot The output of the sample size function can be plotted using the `MSstats` `designSampleSizePlots` function. ```{r sample_size_plot, message=FALSE, warning=FALSE} MSstats::designSampleSizePlots(sample_size) ```