--- title: "Using JohnsonKinaseData to predict kinase-substrate relationships" author: "Florian Geier" date: "`r BiocStyle::doc_date()`" package: "`r BiocStyle::pkg_ver('JohnsonKinaseData')`" output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{JohnsonKinaseData} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console bibliography: JohnsonKinaseData.bib --- ```{r include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction The `r Biocpkg("JohnsonKinaseData")` package provides substrate affinities in the form of position-specific weight matrices (PWMs) for 396 human kinases originally published in Johnson et al. [@Johnson2023] and Yaron-Barir et al. [@Yaron-Barir2024]. It includes basic functionality to pre-process user-provided phosphopetides and match them against all kinase PWMs. The aim is to give the user a simple way of predicting kinase-substrate relationships based on PWM-phosphosite matching. These predictions can serve to infer kinase activity from differential phospho-proteomic data. # Installation The `r Biocpkg("JohnsonKinaseData")` package can be installed using the following code: ```{r installation, eval=FALSE} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("ExperimentHub") BiocManager::install("JohnsonKinaseData") ``` # Load PWM annotation Annotation data for all provides kinase PWMs can be accessed with: ```{r load-anno} library(JohnsonKinaseData) anno <- getKinaseAnnotation() head(anno) ``` Its includes PWM names and associated gene information, such as gene symbol, description, Entrez and Uniprot IDs. PWMs are classified by their specificity: ```{r anno-spec} xtabs(~AcceptorSpecificity, anno) ``` Tyrosine specific kinase PWMs are additionally classified by sub-type: receptor (RTK), non-receptor (nRTK) and non-canonical tyrosine kinases (ncTK). ```{r anno-spec-sub} xtabs(~AcceptorSpecificity + KinaseSubType, anno) ``` PWMs for non-canonical tyrosine kinases, i.e. kinases which also phosphorylate serine/threonine residues, are indicated by the `_TYR` suffix in the matrix name. All PWMs are grouped into kinase families: ```{r anno-spec-family} xtabs(~AcceptorSpecificity + KinaseFamily, anno) ``` # Loading kinase PWMs Kinase PWMs can be loaded with the `getKinasePWM()` function which returns the full list of 396 kinase PWMs. ```{r load-pwm} library(JohnsonKinaseData) pwms <- getKinasePWM() head(names(pwms)) ``` Each PWM is a numeric matrix with amino acids as rows and positions as columns. Matrix elements are log2-odd scores measuring differential affinity relative to a random frequency of amino acids [@Johnson2023]. ```{r pwm-example} pwms[["PLK2"]] ``` Beside the 20 standard amino acids, also phosphorylated serine, threonine and tyrosine residues are included. These phosphorylated residues are distinct from the central phospho-acceptor (serine, threonine or tyrosine at position `0`) and can have a strong impact on the affinity of a given kinase-substrate pair (phospho-priming). For serine/threonine specific kinase PWMs, the central phospho-acceptor measures the favorability of serine over threonine. The user can exclude this favorability measure by setting the parameter `includeSTfavorability` to `FALSE`, in which case the central position doesn't contribute to the PWM score. ```{r pwm-st} getKinasePWM(includeSTfavorability=FALSE)[["PLK2"]] ``` In order to disable scoring of phosphosites that do no contain a matching phospho-acceptor, i.e. S/T in case of serine/threonine PWMs or K in case of tyrosine PWMs, parameter `matchAcceptorSpecificity` can be set to `TRUE`. In this case the log2-odd score of non matching residues is set to `-Inf`: ```{r pwm-acc} getKinasePWM(matchAcceptorSpecificity=TRUE)[["PLK2"]] ``` # Processing user-provided phosphosites Phosphorylated peptides are often represented in two different formats: (1) the phosphorylated residues are indicated by an asterix as in `SAGLLS*DEDC`, (2) phosphorylated residues are given by lower case letters as in `SAGLLsDEDC`. In order to unify the phosophosite representation for PWM matching, `r Biocpkg("JohnsonKinaseData")` provides the function `processPhosphopeptides()`. It takes a character vector with phospho-peptides, aligns them to the central phospho-acceptor position and pads and/or truncates the surrounding residues. By default this means, 5 upstream residues, a central acceptor and 5 downstream residues. The central phospho-acceptor position is defined as the left closest phosphorylated residue to the midpoint of the peptide given by `floor(nchar(sites)/2)+1`. This midpoint definition is also the default alignment position if no phosphorylated residue was recognized. ```{r peps-central} ppeps <- c("SAGLLS*DEDC", "GDtND", "EKGDSN__", "HKRNyGsDER", "PEKS*GyNV") sites <- processPhosphopeptides(ppeps) sites ``` If a peptide contains several phosphorylated residues, option `onlyCentralAcceptor` controls how to select the acceptor position. Setting `onlyCentralAcceptor=FALSE` will return all possible aligned phosphosites for a given input peptide. Note that in this case the output is not parallel to the input. ```{r peps-non-central} sites <- processPhosphopeptides(ppeps, onlyCentralAcceptor=FALSE) sites ``` A warning is raised if the central acceptor is not serine, threonine or tyrosine. # Scoring of user-provided phosphosites Once peptides are processed to sites, the function `scorePhosphosites()` can be used to create a matrix of kinase-substrate match scores. ```{r score} selected <- sites |> dplyr::pull(processed) scores <- scorePhosphosites(pwms, selected) dim(scores) scores[,1:5] ``` The PWM scoring can be parallelized by supplying a `BiocParallelParam` object to `BPPARAM=`. ```{r score-parallel} scores <- scorePhosphosites(pwms, selected, BPPARAM=BiocParallel::SerialParam()) ``` By default, the resulting score is the log2-odds score of the PWM. Alternatively, by setting `scoreType="percentile"`, a percentile rank of the log2-odds score is calculated, using for each PWM a background score distribution. ```{r score-percentile} scores <- scorePhosphosites(pwms, selected, scoreType="percentile") scores[,1:5] ``` Quantifying PWM matches by percentile rank was first described in Yaffe et al. 2001 [@Yaffe2001]. The background score distributions used here are derived from matching each PWM to either the 85'603 unique phosphosites published in Johnson et al. 2023 (serine/threonine PWMs) or the 6659 unique phosphosites published in Yaron-Barir et al. 2024 (tyrosine PWMs). They can be accessed with: ```{r background-tyr} bg <- getBackgroundScores(phosphoAcceptor='Tyr') ``` where `phosphoAcceptor` can be either `Ser/Thr` or `Tyr`. The corresponding mappings of log2-odd scores to percentile ranks can be accessed with function `getScoreMaps()` which returns a list of mapping functions, one for each kinase PWM. Note that these percentile ranks do not account for phospho-priming, as non-central phosphorylated residues were missing in the background sites. I.e. the percentile ranks cannot reflect the impact of phospho-priming. # Session info ```{r session-info} sessionInfo() ``` # References