---
title: "2. Scoring Functions in SMAD"
author: 
- name: "Qingzhou (Johnson) Zhang"
  email: zqzneptune@hotmail.com
date: "`r Sys.Date()`"
package: SMAD
output: 
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{2. Scoring Functions in SMAD}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Introduction

The `SMAD` package provides a suite of scoring functions to evaluate protein-protein interactions (PPI) from Affinity Purification-Mass Spectrometry (AP-MS) data. These functions assign probability or confidence scores to interactions, helping to distinguish true biological interactions from non-specific background contaminants.

This vignette showcases the various scoring methods implemented in `SMAD`.

# Data Preparation

Most scoring functions in `SMAD` take a standardized input format. We will use the built-in `TestDatInput` dataset for demonstration.

```{r load_data}
library(SMAD)
data("TestDatInput")
head(TestDatInput)
```

The columns are:
- `idRun`: Unique identifier for the AP-MS run.
- `idBait`: Unique identifier for the bait protein.
- `idPrey`: Unique identifier for the prey protein.
- `countPrey`: Spectral counts (or peptide counts) for the prey.
- `lenPrey`: Length of the prey protein.

# Scoring Methods

## CompPASS

The Comparative Proteomic Analysis Software Suite (CompPASS) identifies high-confidence interactions by comparing protein occurrences across multiple AP-MS experiments. It produces four types of scores: Z-score, S-score, D-score, and WD-score (weighted D-score).

```{r compPASS}
scoreCompPASS <- CompPASS(TestDatInput)
head(scoreCompPASS)
```

## HGScore

`HGScore` is based on a hypergeometric distribution error model. It incorporates the Normalized Spectral Abundance Factor (NSAF) to account for protein length and abundance.

```{r HG}
scoreHG <- HG(TestDatInput)
head(scoreHG)
```

## DICE

The Dice coefficient is used to score the interaction affinity between two proteins based on their co-occurrence across different runs. It focuses on prey-prey interactions.

```{r DICE}
scoreDICE <- DICE(TestDatInput)
head(scoreDICE)
```

## Hart

Based on Hart et al. (2007), this algorithm uses a hypergeometric distribution to compute the probability of two proteins interacting, based on their frequency of co-purification.

```{r Hart}
scoreHart <- Hart(TestDatInput)
head(scoreHart)
```

## PE (Purification Enrichment)

The PE score is based on a Bayesian classifier framework (Collins et al., 2007). It combines "spoke" (bait-prey) and "matrix" (prey-prey) models to compute a comprehensive enrichment score.

```{r PE}
# PE might require data.table and RcppAlgos
scorePE <- PE(TestDatInput)
head(scorePE)
```

## SAINTexpress

Significance Analysis of INTeractome (SAINT) is a widely used tool for AP-MS data. `SMAD` provides an integrated version with two modes: Spectral Count (`spc`) and Intensity (`int`).

### SAINTexpress-spc (Spectral Count)

This mode is used for data where protein abundance is measured by spectral counts.

```{r SAINT_spc}
# Using example data from the package
bait_path <- system.file("exdata", "TIP49", "bait.dat", package = "SMAD")
prey_path <- system.file("exdata", "TIP49", "prey.dat", package = "SMAD")
inter_path <- system.file("exdata", "TIP49", "inter.dat", package = "SMAD")

bait <- read.table(bait_path, sep = "\t", header = FALSE, 
                   col.names = c("ip_id", "bait_id", "test_ctrl"))
prey <- read.table(prey_path, sep = "\t", header = FALSE, 
                   col.names = c("prey_id", "prey_length"))
inter <- read.table(inter_path, sep = "\t", header = FALSE, 
                    col.names = c("ip_id", "bait_id", "prey_id", "quant"))

result_spc <- SAINTexpress_spc(inter, prey, bait)
head(result_spc[, c("Bait", "Prey", "SaintScore", "BFDR")])
```

### SAINTexpress-int (Intensity)

This mode is designed for intensity-based data, such as those from label-free quantification (LFQ).

```{r SAINT_int}
# Re-using the same example data for demonstration purposes
result_int <- SAINTexpress_int(inter, prey, bait)
head(result_int[, c("Bait", "Prey", "SaintScore", "BFDR")])
```

# Visualization of Scores

Visualizing the distribution of scores can help in selecting appropriate thresholds for high-confidence interactions.

```{r visualization, fig.width=10, fig.height=6}
par(mfrow = c(2, 3))
hist(scoreCompPASS$scoreWD, main = "CompPASS WD-score", xlab = "WD-score", col = "skyblue")
hist(scoreHG$HG, main = "HGScore", xlab = "HGScore", col = "salmon")
hist(scoreDICE$DICE, main = "DICE Score", xlab = "DICE", col = "lightgreen")
hist(scoreHart$Hart, main = "Hart Score", xlab = "Hart", col = "plum")
hist(scorePE$PE, main = "PE Score", xlab = "PE", col = "orange")
hist(result_spc$SaintScore, main = "SAINT Score (spc)", xlab = "SAINT Score", col = "gold")
```

# Session Information

```{r sessionInfo}
sessionInfo()
```