--- title: "2. Scoring Functions in SMAD" author: - name: "Qingzhou (Johnson) Zhang" email: zqzneptune@hotmail.com date: "`r Sys.Date()`" package: SMAD output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{2. Scoring Functions in SMAD} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction The `SMAD` package provides a suite of scoring functions to evaluate protein-protein interactions (PPI) from Affinity Purification-Mass Spectrometry (AP-MS) data. These functions assign probability or confidence scores to interactions, helping to distinguish true biological interactions from non-specific background contaminants. This vignette showcases the various scoring methods implemented in `SMAD`. # Data Preparation Most scoring functions in `SMAD` take a standardized input format. We will use the built-in `TestDatInput` dataset for demonstration. ```{r load_data} library(SMAD) data("TestDatInput") head(TestDatInput) ``` The columns are: - `idRun`: Unique identifier for the AP-MS run. - `idBait`: Unique identifier for the bait protein. - `idPrey`: Unique identifier for the prey protein. - `countPrey`: Spectral counts (or peptide counts) for the prey. - `lenPrey`: Length of the prey protein. # Scoring Methods ## CompPASS The Comparative Proteomic Analysis Software Suite (CompPASS) identifies high-confidence interactions by comparing protein occurrences across multiple AP-MS experiments. It produces four types of scores: Z-score, S-score, D-score, and WD-score (weighted D-score). ```{r compPASS} scoreCompPASS <- CompPASS(TestDatInput) head(scoreCompPASS) ``` ## HGScore `HGScore` is based on a hypergeometric distribution error model. It incorporates the Normalized Spectral Abundance Factor (NSAF) to account for protein length and abundance. ```{r HG} scoreHG <- HG(TestDatInput) head(scoreHG) ``` ## DICE The Dice coefficient is used to score the interaction affinity between two proteins based on their co-occurrence across different runs. It focuses on prey-prey interactions. ```{r DICE} scoreDICE <- DICE(TestDatInput) head(scoreDICE) ``` ## Hart Based on Hart et al. (2007), this algorithm uses a hypergeometric distribution to compute the probability of two proteins interacting, based on their frequency of co-purification. ```{r Hart} scoreHart <- Hart(TestDatInput) head(scoreHart) ``` ## PE (Purification Enrichment) The PE score is based on a Bayesian classifier framework (Collins et al., 2007). It combines "spoke" (bait-prey) and "matrix" (prey-prey) models to compute a comprehensive enrichment score. ```{r PE} # PE might require data.table and RcppAlgos scorePE <- PE(TestDatInput) head(scorePE) ``` ## SAINTexpress Significance Analysis of INTeractome (SAINT) is a widely used tool for AP-MS data. `SMAD` provides an integrated version with two modes: Spectral Count (`spc`) and Intensity (`int`). ### SAINTexpress-spc (Spectral Count) This mode is used for data where protein abundance is measured by spectral counts. ```{r SAINT_spc} # Using example data from the package bait_path <- system.file("exdata", "TIP49", "bait.dat", package = "SMAD") prey_path <- system.file("exdata", "TIP49", "prey.dat", package = "SMAD") inter_path <- system.file("exdata", "TIP49", "inter.dat", package = "SMAD") bait <- read.table(bait_path, sep = "\t", header = FALSE, col.names = c("ip_id", "bait_id", "test_ctrl")) prey <- read.table(prey_path, sep = "\t", header = FALSE, col.names = c("prey_id", "prey_length")) inter <- read.table(inter_path, sep = "\t", header = FALSE, col.names = c("ip_id", "bait_id", "prey_id", "quant")) result_spc <- SAINTexpress_spc(inter, prey, bait) head(result_spc[, c("Bait", "Prey", "SaintScore", "BFDR")]) ``` ### SAINTexpress-int (Intensity) This mode is designed for intensity-based data, such as those from label-free quantification (LFQ). ```{r SAINT_int} # Re-using the same example data for demonstration purposes result_int <- SAINTexpress_int(inter, prey, bait) head(result_int[, c("Bait", "Prey", "SaintScore", "BFDR")]) ``` # Visualization of Scores Visualizing the distribution of scores can help in selecting appropriate thresholds for high-confidence interactions. ```{r visualization, fig.width=10, fig.height=6} par(mfrow = c(2, 3)) hist(scoreCompPASS$scoreWD, main = "CompPASS WD-score", xlab = "WD-score", col = "skyblue") hist(scoreHG$HG, main = "HGScore", xlab = "HGScore", col = "salmon") hist(scoreDICE$DICE, main = "DICE Score", xlab = "DICE", col = "lightgreen") hist(scoreHart$Hart, main = "Hart Score", xlab = "Hart", col = "plum") hist(scorePE$PE, main = "PE Score", xlab = "PE", col = "orange") hist(result_spc$SaintScore, main = "SAINT Score (spc)", xlab = "SAINT Score", col = "gold") ``` # Session Information ```{r sessionInfo} sessionInfo() ```