---
title: 'KinSwingR: Predicting kinase activity from phosphoproteomics data'
author: "Ashley J. Waardenberg"
date: 'Last modified: 2019-04-25. Compiled: `r Sys.Date()`'
output:
  html_document:
    highlight: tango
    number_sections: yes
    toc: yes
    toc_depth: 3
    toc_float:
      collapsed: no
      smooth_scroll: no
  pdf_document:
    toc: yes
    toc_depth: '3'
  word_document:
    toc: yes
    toc_depth: '3'
references:
- DOI: 10.1371/journal.pbio.3000170
  URL: 'https://doi.org/10.1371/journal.pbio.3000170'
  author:
  - family: Engholm-Keller
    given: K*
  - family: Waardenberg
    given: AJ*
  - family: Müller 
    given: JA
  - family: Wark 
    given: JR
  - family: Fernando  
    given: RN
  - family: Arthur  
    given: JA
  - family: Robinson 
    given: PJ
  - family: Dietrich  
    given: D
  - family: Schoch  
    given: S
  - family: Graham  
    given: ME
  id: Kasper2019
  issued:
    month: 3
    year: 2019
  publisher: PloS Biology
  title: The temporal profile of activity-dependent presynaptic phospho-signalling reveals long lasting patterns of post-stimulus regulation
  type: article-journal
vignette: >
  %\VignetteIndexEntry{KinSwingR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction to KinSwing

KinSwingR aims to predict kinase activity from phoshoproteomics data. It implements the alogorithm described in: @Kasper2019 (in greater detail below). KinSwingR predicts kinase activity by integrating kinase-substrate predictions and the fold change and signficance of change for peptide sequences obtained from phospho-proteomics studies. The score is based on the network connectivity of kinase-substrate networks and is weighted for the number of substrates as well as the size of the local network. P-values are provided to assess the significance of the KinSwing scores, which are determined through random permuations of the overall kinase-substrate network.

KinSwingR is implemented as 3 core functions:

+ **```buildPWM()```** builds position weight matrices (PWMs) from known kinase-substrate sequences
+ **```scoreSequences()```** score PWMs build using ```buildPWM()``` against input phosphoproteome data
+ **```swing()```** integrates PWM scores, direction of phosphopeptide change and significance of phosphopeptide change into a "swing" score.

The KinSwing score is a metric of kinase activity, ranging from positive to negative, and p-values are provided to determine significance.

Additional functions are also provided:

+ **```cleanAnnotation()```** function to tidy annotations and extract peptide sequences.
+ **```viewPWM()```** function to view PWM models


Detailed information for each of these functions can be accessed using the ```?``` command before the function of interest. E.g. ```?buildPWM```

# KinSwingR example workflow

We will now consider an example dataset to predict kinase activity. Kinase-substrate sequences and phosphoproteomics data are provided as example data in the KinSwingR package.

Begin by loading the KinSwingR library and the two data libraries included in the package.
```{r echo=TRUE}
library(KinSwingR)
data(example_phosphoproteome)
data(phosphositeplus_human)

# View the datasets:
head(example_phosphoproteome)

head(phosphositeplus_human)

## sample 100 data points for demonstration
sample_data <- head(example_phosphoproteome, 1000)

# Random sample for demosntration purposes
set.seed(1234)
sample_pwm <- phosphositeplus_human[sample(nrow(phosphositeplus_human), 1000),]

# for visualising a motif, sample only CAMK2A
CAMK2A_example <- phosphositeplus_human[phosphositeplus_human[,1]== "CAMK2A",] 

```

## Extracting peptides for analysis
Where the centered peptide sequences (on the phosphosite of interest) are not provided in the format required for ```scoreSequences()``` (see the argument "input_data", in ?scoreSequences), these can be required to be extracted from another column of annotated data. NB. "input_data" table format must contain columns for "annotation", "peptide", "fold-change" and "p-values".

In the example dataset provided, ```example_phosphoproteome```, peptides have not been extracted into a stand-a-lone peptide column. ```cleanAnnotation()``` is provided as a function to extract peptides from annotation columns and place into the peptide column. 

In the example dataset, ```example_phosphoproteome```, the peptide sequence is the 4th component of the annotation, which corresponds to using the argument ```seq_number = 4``` below, and is seperated by ```|```, which corresponds to the argument ```annotation_delimiter = "|"```. In this case, the annotated data also contains multi-mapped and multi-site information. For example the following annotation ```A1L1I3|Numbl|263;270|PAQPGHVSPTPATTS;SPTPATTSPGEKGEA``` contains two peptides ```PAQPGHVSPTPATTS``` and ```SPTPATTSPGEKGEA``` that map to different sites from the same reference gene ```Numbl```, where the peptides are seperated by ```;```. The annotated data also includes multi-protein mapped (where a peptide could map to more than one protein - not shown) and contains ```X``` instead of ```_``` to indicate sequences that were outside of the length of the coding sequences. KinSwingR requires that these sequences outside of the coding region are marked with ```_``` as deafult and therefore ```replace_search = "X"``` and ```replace_with = "_"``` can be used as arguments in ```cleanAnnotation()``` to replace these. This allows for full flexibility of the input data here, depending of the software used to generate determine the peptide sequences. NB: characters other than ```_``` can be used, but these need to be declared when calling buildPWM and scoreSequences functions later (see their help files).

Calling ```cleanAnnotation()``` will produce a new table with the unique combinations of peptide sequences extracted from the annotation column into the peptide column:

```{r echo=TRUE}
annotated_data <- cleanAnnotation(input_data = sample_data, 
                                  annotation_delimiter = "|",
                                  multi_protein_delimiter = ":", 
                                  multi_site_delimiter = ";",
                                  seq_number = 4, 
                                  replace = TRUE, 
                                  replace_search = "X",
                                  replace_with = "_")

head(annotated_data)

```

## Build Position Weight Matrices (PWMs)
The first step to inferring kinase activity, is to build Position Weight Matrices (PWMs) for kinases. This can be done using ```buildPWM()``` for any table containing centered substrate peptide sequences for a list of kinases. The example data ```data(phosphositeplus_human)``` indicates the required format for building PWM models. Below, for demosntration, we use a subset that was sampled above ```sample_pwm```

To generate the PWMs:
```{r echo=TRUE}
pwms <- buildPWM(sample_pwm)
```

This will build the PWM models, accessible as ```PWM$pwm``` and list the number of substrate sequences used to build each PWM, accesible as ```PWM$kinase```.

To view the list of kinases and the number of sequences used:
```{r echo=TRUE}
head(pwms$kinase)
```

### Visualising motifs

Motif amino acids are coloured according to their properties. ```color_scheme``` parameter allows options are "lesk" or "shapely" (default). Y-axis is the information content, measured as bits.

```{r echo=TRUE}
CAMK2A_pwm <- buildPWM(CAMK2A_example)
CAMK2A <- viewPWM(CAMK2A_pwm, 
                  which_pwm="CAMK2A",
                  view_pwm = TRUE,
                  color_scheme = "shapely")
```


## Score PWM matches against peptide sequences

Next, we will use the PWM models generated, ```pwms```, to identify matches in the ```annotated_data``` table that was cleaned using ```cleanAnnotation()``` above. ``scoreSequences``` supports multi-core processing - see the example below for setting the number of workers to 4. ```scoreSequences``` draws a random background by default of size ```n = 1000```. It is recommended to use ```set.seed()``` prior to calling ```scoreSequences``` if you wish to reproduce your results. To access the help file, which explains all the arguments, type ```?scoreSequences``` into the console.

```{r echo=TRUE}
# As an example of control over multi-core processing
# load BiocParallel library
library(BiocParallel)
# finally set/register the number of cores to use
register(SnowParam(workers = 4))

# set seed for reproducible results
set.seed(1234)
scores <- scoreSequences(input_data = annotated_data, 
                         pwm_in = pwms,
                         n = 100)
```

The outputs of ```scores``` are transparent and accessible. These are however primarily intermediate tables for obtaining swing scores. ```scores``` is a simple list object that contains peptide scores ```(scores$peptide_scores)```, p-values for the peptide scores ```(scores$peptide_p)``` and the background peptides used to score significance ```(scores$background)``` for reproducibility (i.e. the background can saved and reused for reproducibility).

In summary, ```scoreSequences()``` scores each input sequence for a match against all PWMs provided using ```buildPWM()`` and generates p-values for scores. This is effectively one large network of kinase-substrate edges of dimensions kinase, ***k***, by substrate, ***s***.

## Predict kinase activity using swing()

Having built a kinase-substrate network, ```swing()``` then integrates the kinase-substrate predictions, directionality and significance of phosphopeptide fold change to assess local connectivity (or swing) of kinase-substrate networks. The final score is a normalised score of predicted kinase activity that is weighted for the number of substrates used in the PWM model and number of peptides in the local kinase-substrate network. By default, this will permute the network 1000 times (here we use 10 for example purposes). It is recommended to use ```set.seed()``` prior to calling ```swing``` if you wish to reproduce your results. ``swing``` supports multi-core processing - see the example below for setting the number of workers to 4.

```{r echo=TRUE}
# after loading BiocParallel library, set/register the number of cores to use
register(SnowParam(workers = 4))

# set seed for reproducible results
set.seed(1234)
swing_out <- swing(input_data = annotated_data, 
                   pwm_in = pwms, 
                   pwm_scores = scores,
                   permutations = 10)

# This will produce two tables, one is a network for use with e.g. Cytoscape and the other is the scores. To access the scores:
head(swing_out$scores)
```

The outputs of this table indicate the following:

+ ```kinase```: The kinase
+ ```pos```: Number of ***positively*** regulated kinase substrates
+ ```neg```: Number of ***negatively*** regulated kinase substrates
+ ```all```: Total number of regulated kinase substrates
+ ```pk```: Proportion of ***positively*** regulated kinase substrates
+ ```nk```: Proportion of ***negatively*** regulated kinase substrates
+ ```swing_raw```: Raw - weighted score
+ ```n```:  Number of subtrate sequence in ```kinase``` PWM
+ ```swing```: Normalised (Z-score transformed) - weighted score
+ ```p_greater```: probability of observing a swing score greater than
+ ```p_less```:  probability of observing a swing score less than

Note that the ``pos``, ``neg`` and ``all`` include a pseudo-count, that is set in ``?swing``, note ``pseudo_count``.

*** See @Kasper2019 for methods description ***

# KinSwingR algorithm

*** For a full description of the KinSwing algorithm, see @Kasper2019 ***

**In brief:**

```buildPWM()``` generates Position Weight Matrices (PWMs) for kinases based on known substrate sequence (Equation 1), where each kinase, $K$, is considered as the log-likelihood ratio of the average frequency of amino acid, $a$, at each position, $p$, divided by background frequencies, $B$ ($C$ is a pseudo count to avoid log zero):

***Equation 1:*** $PWM_{(a,p)}=log((1/n∑^n_{i=1}K_i)+C)/B_a+C)$

```scoreSequences()``` scores each kinase, $K$, match to a substrate $S$, given as $S_{score}=∑^n_{(i=1)}f(a,p)$ , which corresponds to the sum of the corresponding amino acid, $a$, of peptide sequence length, $i$, from position, $p$, of $PWM_{(a,p)}$ and $f(a,p)=PWM_{ap}∈PWM_{(a,p)}$. The probability of observing $S_score$ for kinase, $K$, is determined as conditional on a randomly sampled reference distribution of size $N$ sequences $P(S_{score}|R,N)$, where $R$ sequences are determined to have a test statistic less than or equal to $S_{score}$:

***Equation 2:*** $R= ∑^N_{n=1}I((S_{score})n* ≥ (S_{score})i)$

```swing()``` integrates phosphosite data and kinase-substrate scores from ```scoreSequences()``` into a network for scoring kinase activity based on local connectivity, $swing_k$, (Equation 3). $swing_k$ is the weighted product of the proportion of positive, $Pos_k$, and negative, $Neg_k$, network edges, determined as the product of a logic function (described here: @Kasper2019) given a local network of size, $C_k$, with $n$ substrates for kinase, $K$:

***Equation 3:*** $swing_k=log_2((Pos_k+c)/(Neg_k+c))*log_2(C_k)*log_2(S_n)$

$swing_k$, is transformed into a z-score, $Z(swing_k)$, where, $μ$, is the mean and, $σ$, the standard deviation of swing scores, thus allowing for comparison of predicted kinase activity across multiple timepoints and/or conditions. 

KinSwingR addresses the question of “how likely is it is observe the predicted activity of kinase, $K$, by random chance?” by computing $swing_k$ given $N$ permutations of kinase node labels, $K$, to substrates, $S$, of the total network, $M_{ks}$. Thus, the probability of observing $swing_k$ is conditional on this permuted reference distribution, of size, $N$ (Equation 2). This is computed for each tail of the distribution, that is, positive and negative $swing_k$ scores.

# References