--- title: "SILAC complexomics" author: "Petra Palenikova & Rick Scavetta" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{SILAC complexomics} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( fig.width=7, fig.height=4.5, collapse = TRUE, eval = TRUE, comment = "#>" ) ``` # Introduction Complexome profiling or complexomics is a mass spectrometry-based method used in biology to study macromolecular complexes in their native form. Protein complexes are assessed by evaluating their migration profile across fractions. First, lysed protein sample is separated into fractions, typically by blue native electrophoresis (BN-PAGE) or density gradient centrifugation. Individual fractions are analysed by protein mass spectrometry. This allows to identify protein migration profiles across the fractions and to assess protein co-migration. It is often desirable to compare co-migration between multiple biological samples. Typically, this would require analysing each sample separately, using multiple lanes of blue native gel/multiple density gradients. However, this approach might introduce technical biases making the qualitative comparison of migration profiles as well as quantitative comparison of protein amount between different biological samples difficult. To mitigate these technical biases, biological samples can be labeled by means detectable by mass spectrometry (e.g. SILAC, TMT, iTRAQ) and analysed simultaneously. Stable Isotope Labelling with Amino acids in Cell culture (SILAC) is a method when cells are grown in the presence of amino acids with either low natural abundance “heavy” isotopes of carbon and nitrogen or the most frequent “light” isotopes. This labelling allows for reciprocally labelled samples to be mixed and multiplexed at the very early steps of experiment, making it a useful tool when experimental design requires comparison of 2 biological samples. Here we present a ComPrAn (Complexome Profiling Analysis) R package to analyse data produced by SILAC complexomics experiments. This package does not interpret raw mass spectrometry data. As an input it takes a table of peptides that were identified by search engines such as Mascot, and quantified, e.g. in Proteome Discoverer. This package allows to estimate relative protein abundance from peptide abundance, produces a normalized migration profiles of proteins through fractions, performs cluster analysis and contains tools for visualization of results. The analysis is indented for samples that were SILAC labelled, therefore the input file should contain both “heavy” and “light” peptides. ## Method description Below, we provide an overview of the ComPrAn use (Figure 1). There are **two possible uses of the package**. Here, we describe a **full workflow**, enabling the analysis starting from the peptide information. Second option is to use only **protein workflow** (described in a separate vignette). The second workflow would include only steps 3) and 4) from below. It provides greater flexibility, for example, when only normalised protein values are available or if different approach of estimation of protein intensities from peptide intensities and/or for normalisation are required. ![](PackageWorkflow_small.png){height=700px} **Fig 1** *Schematic overview of the data analysis using ComPrA* *There are two entry point into ComPrAn analysis pipeline* *Peptide data can be imported into Peptide-to-protein part of data analysis* *(shown in blue). When only normalised protein data from an experiment are* *available, it is possible to use Protein workflow functionality* *of the ComPrAn package.* ### 1) Estimation of protein intensities from peptide intensities In mass spectrometry analysis multiple peptides are detected for each protein. During initial data processing peptides are identified, assigned to proteins and quantified. In a single sample, quantities of different peptides belonging to the same protein might differ. Differences in detected peptide quantities might be caused by different efficiency of peptides ionization, peptide charge state, modifications etc. There are many possibilities how to estimate a protein quantity from peptide quantity, including minimum/average/maximum of the quantities of all peptides that belong to a given protein or minimum/average/maximum can be calculated for a subset of peptides. In complexomics, protein quantity values are compared among large number of fractions (typically 64 for BN-PAGE and ~20 for gradient centrifugation). Usually, peptides detected for any given protein can differ among fractions. To make sure that protein quantification values used by the ComPrAn package are comparable between fractions and between two label states, a single representative peptide is picked for each protein. Selected peptide must have the same sequence, modifications and charge state across fractions. The most abundant peptide for each protein is picked as the representative peptide. Based on the experimental question, two scenarios to select a representative peptide are used: * **Scenario A** - *Which proteins co-migrate together within one label state?* *Does the migration profile for a protein differ between label states?* In scenario A, we firs split peptides into 2 groups, unlabeled and labelled, and then pick a representative peptide as the one that is present in the highest number of fractions within each label state (maximum possible value is the total number of fractions). This means that for any single protein the **peptide that is picked for the labelled sample may be different** **from the peptide selected for the unlabeled sample**. This scenario picks representative peptide for all proteins, even if they are present only in one label state. ![](pickPeptideA.png){width=650px} **Fig 2** *Schematic of example case of representative peptide selection for* *Heavy and Light samples according to* *scenario A. Green ticked boxes mean that a given peptide was detected in a* *given fraction. Peptide present in highest number of fractions, separately* *for each label state is selected (Peptide 2 for Heavy, Peptide 1 for Light)* * **Scenario B** - *Does the quantity of a protein differ between label states?* In scenario B, we pick the representative peptide as the one that is present in the highest number of fractions, together for both label states (maximum = 2x number of fractions). This means that for any single protein the **representative peptide that is picked for** **labelled sample and unlabeled sample are the same**. This scenario picks representative peptide only for shared peptides. ![](pickPeptideB.png){width=650px} **Fig 3** *Schematic of example case of representative peptide selection for* *Heavy and Light samples according to* *scenario B. Green ticked boxes mean that a given peptide was detected in a* *given fraction. Peptide present in both samples in the highest number of* *fraction (column H+L) is selected as representative (Peptide 2)* After selecting a representative peptide for scenarios A and B we produce a table containing only representative peptide values for each protein. ### 2) Normalization of protein intensities For easier comparison of protein co-migrations and quantities we normalize all protein quantity values to be between 0 and 1. Method of normalization differs between scenarios: * **Scenario A** - For each protein, the quantity value for each fraction is divided by the maximum value across all fractions *within one label state*. This means that e.g. for *protein A* there will be a value of 1 present at least in one fraction for unlabeled sample and at lease in one fraction for labelled samples. Because of the way how we select representative peptide and this normalization *we do not advice to use scenario A* *for quantitative comparison of proteins between SILAC label states*. Intended use of scenario A is to compare co-migration of proteins for individual samples. ![](normalizeA.png){width=720px} **Fig 4** *Schematic of simplified case of normalisation according to* *scenario A. First, fraction with highest value is identified, separately for* *Heavy and Light samples. Next, value of each fraction is divided by this* *maximum value.* * **Scenario B** - For each protein, the quantity value for each fraction is divided by the maximum value across all fractions and *both label states*. This means that e.g. for *protein A* there will be a value of 1 present at least in one fraction in at least one of labelled or unlabeled samples. Because of the way how we select representative peptide and this normalization *we recommend using scenario B* *for quantitative comparison of proteins between label states*. ![](normalizeB.png){width=720px} **Fig 5** *Schematic of simplified case of normalisation according to* *scenario B. First, fraction with highest value across both samples is* *identified. Next, value of each fraction is divided by this* *maximum value.* ### 3) Hierarchical clustering Clustering allows to identify similarity between migration profiles of proteins in an unbiased way. We can examine co-migration of known protein complexes by simply filtering the data, however, clustering provides additional information by allowing to identify unknown proteins that show similar migration profile as our proteins of interest. This package contains functions to perform hierarchical clustering using Pearson correlation (centered or uncentered) as a distance measure and one of the three linkage methods (single, average or complete). ### 4) Export files and visualizations We provide several functions to export intermediate steps of the analysis. Plotting functionality includes: * allPeptidesPlot - scatter plot of all peptides that belong to a single protein * proteinPlot - line plot for a selected protein * groupHeatMap - heatmap for a selected group of proteins * oneGroupTwoLabelsCoMigration - scatter plot for a selected group of proteins * twoGroupsWithinLabelCoMigration - scatter plot for 2 selected groups of proteins * makeBarPlotClusterSummary - bar plot showing number of proteins per cluster # Example workflow **Read in and filter data** ```{r} library(ComPrAn) inputFile <- system.file("extData", "data.txt", package = "ComPrAn") #read in data peptides <- peptideImport(inputFile) #mandatory filtering peptides <- cleanData(peptides, fCol = "Search ID") #optional filtering peptides <- toFilter(peptides, rank = 1) # separate chemical modifications and labelling into separate columns peptides <- splitModLab(peptides) #remove unnecessary columns, simplify rows peptides <- simplifyProteins(peptides) ``` **Pick representative peptides for scenario A and scenario B** ```{r} peptide_index <- pickPeptide(peptides) ``` **Have a look at all peptides for a protein (scatter plot)** ```{r} protein <- "P52815" max_frac <- 23 #example all peptide plot (default settings) allPeptidesPlot(peptide_index,protein, max_frac = max_frac) ``` **Create a list of proteins present in both/only in one label state** ```{r} listOnlyOneLabState <- onlyInOneLabelState(peptide_index) ``` **Extract values for representative peptides** for each protein, for both scenario A and scenario B and perform normalization. Combine results into one data frame used in further analysis. ```{r} # produce a data frame in a format needed for downstream analysis forAnalysis <- getNormTable(peptide_index,purpose = "analysis") # produce a data frame in an easily human readable format that can be exported forExport <- getNormTable(peptide_index,purpose = "export") ``` Once data frame with normalized protein data is produced, it can be either exported (*forExport* in the code above), or used for further analysis and visualization (*forAnalysis* in the code above). Example procedure of analysing normalized protein data is available in **Protein workflow** vignette. To access relevant section of **Protein workflow** vignette [click here](./proteinWorkflow.html#visualization-of-normalised-protein-data) .