--- title: "MSstatsResponse: A package for detecting drug-protein interactions in dose-response mass spectrometry-based proteomics experiments" author: "Sarah Szvetecz (szvetecz.s@northeastern.edu), Devon Kohler (kohler.d@northeastern.edu), Olga Vitek (o.vitek@northeastern.edu)" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{MSstatsResponse User Guide} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: markdown: wrap: sentence --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Introduction MSstatsResponse provides statistical tools for detecting drug-protein interactions and estimating IC50 values from dose-response mass spectrometry-based proteomics experiments. The package implements isotonic regression models to detect monotonic dose-response relationships and provides robust statistical inference through F-tests and bootstrap confidence intervals. MSstatsResponse is designed to work seamlessly with summarized data from MSstats and MSstatsTMT workflows, but can also process other dose-response proteomics datasets that include protein-level quantifications across drug treatments. ### Statistical framework The package employs isotonic regression, a non-parametric method that fits monotonic functions without assuming a specific functional form. This approach is particularly suitable for dose-response data where monotonicity (increasing or decreasing trend) is expected. ### Analysis workflow MSstatsResponse implements a three-step statistical analysis pipeline: 0. **Data preparation**: `MSstatsPrepareDoseResponseFit()` - Convert input data to standardized format 1. **Interaction detection**: `doseResponseFit()` - Detect drug-protein interactions using isotonic regression and F-tests 2. **IC50 estimation**: `predictIC50()` - Estimate IC50 values with bootstrap confidence intervals 3. **Power analysis for experimental planning**: `futureExperimentSimulation` - Simulate dose-response experiments to determine optimal study designs and predict true-positive and false-positive rates ## Installation and setup ### Install MSstatsResponse ```{r install, eval=FALSE} # Install from Bioconductor if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("MSstatsResponse") ``` ### Load required libraries ```{r load_libraries, message=FALSE} library(MSstatsResponse) library(dplyr) library(tidyverse) library(ggplot2) library(data.table) # Optional: for upstream data processing # library(MSstats) # library(MSstatsTMT) ``` ## Data preprocessing with MSstats If analyzing a mass spectrometry-based proteomics dataset, data should be preprocessed using the standard MSstats workflow for cleaning, normalization, and protein summarization prior to analysis with MSstatsResponse. When using MSstats, data can be converted with one of the available converter functions (e.g., SpectronauttoMSstatsFormat()) and subsequently processed with dataProcess(). For MSstatsTMT, converter functions (e.g., MaxQtoMSstatsTMTFormat()) may be used, followed by proteinSummarization() for data processing. Please refer to the MSstats and MSstatsTMT Bioconductor vignettes for a comprehensive list of supported converters and detailed preprocessing recommendations. ### Example MSstats workflow ```{r msstats_preprocessing, eval=FALSE} # Read raw data (example with Spectronaut output) raw_data <- read_tsv("path/to/spectronaut_report.tsv") # Convert to MSstats format msstats_data <- SpectronauttoMSstatsFormat(raw_data) # Process data: normalization and protein summarization processed_data <- dataProcess( msstats_data, normalization = "equalizeMedians", # or FALSE for no normalization summaryMethod = "TMP", # Tukey's median polish MBimpute = TRUE, # Impute missing values maxQuantileforCensored = 0.999 ) # Extract protein-level data for dose-response analysis protein_level_data <- processed_data$ProteinLevelData ``` ### Loading example data For this vignette, we'll use pre-processed data prepared following the steps described above. For more information on preprocessing mass spectrometry–based proteomics experiments, see the vignettes for MSstats and/or MSstatsTMT. ```{r load_example_data} # Load pre-processed DIA-MS data example from data/ (DIA_MSstats_Normalized.rda) data("DIA_MSstats_Normalized", package = "MSstatsResponse") dia_normalized <- DIA_MSstats_Normalized # Examine data structure str(dia_normalized$ProteinLevelData[1:5, ]) ``` ## Data preparation for dose-response analysis ### Converting GROUP labels to numeric doses The `convertGroupToNumericDose()` function parses MSstats-style GROUP labels to extract drug names and numeric dose values. **Format expectations:** - Control samples: `"DMSO"` (assigned dose = 0) - Treatment samples: `"DrugName_ValueUnit"` (e.g., `"Drug1_003uM"`, `"Drug2_300nM"`) ```{r convert_doses} # Convert GROUP labels to numeric doses dose_info <- convertGroupToNumericDose(dia_normalized$ProteinLevelData$GROUP) # Add dose and drug information to the dataset dia_normalized$ProteinLevelData <- dia_normalized$ProteinLevelData %>% mutate( dose_nM = dose_info$dose_nM, # Dose in nanomolar drug = dose_info$drug # Drug name ) # View converted data dia_normalized$ProteinLevelData %>% select(Protein, GROUP, drug, dose_nM) %>% head(10) ``` ### Formatting data with MSstatsPrepareDoseResponseFit() This function standardizes the data structure for downstream analysis. Users can either use pre-processed data as shown above (e.g., dia_normalized$ProteinLevelData) or begin at this step with their own datasets. It is important to ensure that column names are correctly matched to the expected ones when using MSstatsPrepareDoseResponseFit(). The dose_column corresponds to the values on the x-axis (e.g., dose, concentration, time, or temperature). The protein_column should contain unique identifiers (e.g., protein, gene, or peptide sequence), and the log_abundance_column represents the response values on the y-axis (e.g., log intensity, expression, or growth). ```{r prepare_data} # Prepare data for dose-response fitting dia_prepared <- MSstatsPrepareDoseResponseFit( data = dia_normalized$ProteinLevelData, dose_column = "dose_nM", drug_column = "drug", protein_column = "Protein", log_abundance_column = "LogIntensities", transform_nM_to_M = TRUE ) # Examine prepared data structure str(dia_prepared) # View sample of prepared data dia_prepared %>% filter(protein %in% c("PROTEIN_A", "PROTEIN_B")) %>% arrange(protein, drug, dose) %>% head(20) ``` ## Drug-protein interaction detection ### Understanding Isotonic Regression Isotonic regression fits a monotonic function to the data without assuming a parametric form. For this example, we have drug inhibitor dose-response data. When treated with inhibitors we typically expect: - **Non-increasing response**: Drug binding reduces protein abundance - **Monotonic constraint**: Response doesn't increase with dose ### Running doseResponseFit() The `doseResponseFit()` function: 1. Fits isotonic regression for each protein-drug pair 2. Performs F-test comparing the fitted model to a flat (null) model 3. Adjusts p-values for multiple testing using Benjamini-Hochberg procedure ```{r fit_dose_response, message=FALSE, warning=FALSE} # Detect drug-protein interactions response_results <- doseResponseFit( data = dia_prepared, increasing = FALSE, # Expect decreasing response transform_dose = TRUE, # Apply log10(dose + 1) transformation ratio_response = FALSE # Stay on log2 scale for testing ) # Examine results response_results %>% select(protein, drug, F_statistic, P_value, adjust_pval) %>% arrange(adjust_pval) %>% head(10) ``` ### Interpreting results - **F_statistic**: Ratio of between-dose to within-dose variance - **P_value**: Raw p-value from F-test - **adjust_pval**: FDR-adjusted p-value (Benjamini-Hochberg) Proteins with `adjust_pval < 0.05` show significant dose-dependent responses. ```{r summarize_results} # Count significant interactions n_total <- nrow(response_results) n_significant <- sum(response_results$adjust_pval < 0.05) cat("Total protein-drug pairs tested:", n_total, "\n") cat("Significant interactions (FDR < 0.05):", n_significant, "\n") cat("Percentage significant:", round(100 * n_significant/n_total, 1), "%\n") # Top hits top_hits <- response_results %>% filter(adjust_pval < 0.05) %>% arrange(adjust_pval) %>% head(5) print(top_hits) ``` ## IC50 estimation ### Understanding IC50 calculation IC50 represents the dose at which the response is reduced to 50% of the control (DMSO) level. MSstatsResponse estimates IC50 by: 1. Fitting isotonic regression on the ratio scale (response relative to DMSO) 2. Finding the dose where fitted response = 0.5 3. Computing confidence intervals via bootstrap resampling ### Running predictIC50() ```{r predict_ic50, message=FALSE, warning=FALSE} # Estimate IC50 with bootstrap confidence intervals ic50_predictions <- predictIC50( data = dia_prepared, ratio_response = TRUE, # Use ratio scale for IC50 transform_dose = TRUE, # Log-transform doses increasing = FALSE, # Decreasing response expected bootstrap = TRUE, # Compute confidence intervals n_samples = 1000, # Number of bootstrap samples alpha = 0.10 # 90% confidence intervals ) # View IC50 estimates ic50_predictions %>% arrange(IC50) %>% head(10) ``` ### Parallel processing for large datasets For datasets with many proteins, use parallel processing: ```{r parallel_ic50, eval=FALSE} # Register a BiocParallel backend library(BiocParallel) if (.Platform$OS.type == "windows") { register(SnowParam(workers = 4, type = "SOCK")) } else { register(MulticoreParam(workers = 4)) } # Parallel IC50 estimation (faster for large datasets) ic50_parallel <- predictIC50( data = dia_prepared, ratio_response = TRUE, transform_dose = TRUE, bootstrap = TRUE, n_samples = 1000, BPPARAM = bpparam() # use 4 CPU cores defined above ) ``` ## Visualization ### Individual protein dose-response curves The `visualizeResponseProtein()` function creates publication-quality dose-response plots: ```{r visualize_single, fig.height=5, fig.width=7} # Visualize strong responder visualizeResponseProtein( data = dia_prepared, protein_name = "PROTEIN_A", drug_name = "Drug1", ratio_response = TRUE, show_ic50 = TRUE, add_ci = TRUE, transform_dose = TRUE, n_samples = 1000 ) ``` ```{r visualize_another, fig.height=5, fig.width=7} # Visualize another target visualizeResponseProtein( data = dia_prepared, protein_name = "PROTEIN_B", drug_name = "Drug1", ratio_response = TRUE, show_ic50 = TRUE, add_ci = TRUE ) ``` ### Comparing log2 vs ratio scale visualization ```{r compare_scales, fig.height=4, fig.width=12} # Log2 scale (left panel) p1 <- visualizeResponseProtein( data = dia_prepared, protein_name = "PROTEIN_A", drug_name = "Drug1", ratio_response = FALSE, # Log2 scale show_ic50 = TRUE, add_ci = FALSE ) # Ratio scale (right panel) p2 <- visualizeResponseProtein( data = dia_prepared, protein_name = "PROTEIN_A", drug_name = "Drug1", ratio_response = TRUE, # Ratio scale show_ic50 = TRUE, add_ci = FALSE ) # Combine plots (requires gridExtra) gridExtra::grid.arrange(p1, p2, ncol = 2) ``` ## Experimental design planning ### Simulating future experiments The `futureExperimentSimulation()` function helps optimize experimental design by simulating performance under different conditions: ### Using your own data as templates You can use proteins from your own experimental data as templates for more realistic simulations. This approach leverages your validated hits to model expected performance: ```{r} # First, identify proteins from your results to use as templates # Run simulation using your data as templates custom_simulation <- futureExperimentSimulation( N_proteins = 3000, N_rep = 2, N_Control_Rep = 3, Concentrations = c(0, 1, 3, 10, 30, 100, 300, 1000, 3000), data = dia_prepared, # Your data strong_proteins = 'PROTEIN_B', # User defined strong responder weak_proteins = 'PROTEIN_A', # User defined weak responder no_interaction_proteins = 'PROTEIN_C', # User defined negative control drug_name = "Drug1", # Specify which drug to model IC50_Prediction = FALSE ) # Compare performance metrics print(custom_simulation$Hit_Rates_Plot) # Examine the templates that were extracted from your data print(custom_simulation$Template_Used) ``` ## Session information ```{r session_info} sessionInfo() ``` ## References 1. Szvetecz, S., et al. (2025).MSstatsResponse: Semi-parametric statistical model enhances drug-protein interaction detection and IC50 estimation in chemoproteomics experiments. *Manuscript in preparation.* 2. Choi, M., et al. (2014). MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. *Bioinformatics*, 30(17), 2524-2526.