%\VignetteIndexEntry{Using R for proteomics data analysis} %\VignetteKeywords{Bioinformatics, proteomics, mass spectrometry, tutorial, data} %\VignettePackage{RforProteomics} \documentclass[12pt,a4paper,english]{scrartcl} \usepackage{amsmath,amsfonts,amssymb} \usepackage{tikz} \usepackage{hyperref} \usepackage{natbib} \usepackage[auth-sc]{authblk} \usepackage{setspace} \onehalfspacing % caption formatting \setcapindent{0em} \setkomafont{captionlabel}{\sffamily\bfseries} \setkomafont{caption}{\sffamily} \renewcommand\Authands{ and } \usepackage{color} \usepackage{graphicx} \usepackage{xstring} \usepackage{longtable} \usepackage{tabulary} %% colors \definecolor{Red}{rgb}{0.7,0,0} \definecolor{Blue}{rgb}{0,0,0.8} \newcommand{\R}{\textbf{\texttt{R} }} \newcommand{\Rfunction}[1]{{\textbf{\texttt{#1}}}} \newcommand{\Rclass}[1]{{\textbf{\texttt{#1}}}} \newcommand{\Robject}[1]{{\textbf{\texttt{#1}}}} \newcommand{\Rpackage}[1]{{\mbox{\normalfont\textbf{\textsf{#1}}}}} \newcommand{\email}[1]{\href{mailto:#1}{\normalfont\texttt{#1}}} \usepackage{hyperref} \usepackage{breakurl} \hypersetup{% pdfauthor={Laurent Gatto},% pdfusetitle, bookmarks = {true}, bookmarksnumbered = {true}, bookmarksopen = {true}, bookmarksopenlevel = 2, unicode = {true}, breaklinks = {false}, hyperindex = {true}, colorlinks = {true}, linktocpage = {true}, plainpages = {false}, linkcolor = {Blue}, citecolor = {Blue}, urlcolor = {Red}, pdfstartview = {Fit}, pdfpagemode = {UseOutlines}, pdfview = {XYZ null null null} } \author{ Laurent Gatto$^1$\thanks{\email{lg390@cam.ac.uk}} } \author{ Sebastian Gibb$^2$ } \affil{ \small $^1$Cambridge Center for Proteomics, University of Cambridge, UK \\ $^2$Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig, Germany } \title{Using \R and Bioconductor for Proteomics Data Analysis} \begin{document} \maketitle <<'setup', include = FALSE, cache = FALSE>>= library("knitr") ## opts_chunk$set(fig.align = 'center', ## fig.show = 'hold', ## par = TRUE, ## prompt = TRUE, ## comment = NA) ## }) opts_chunk$set(tidy.opts = list(width.cutoff = 50, tidy = FALSE), fig.align = 'center', stop_on_error = 1L, comment = NA, prompt = TRUE) options(width = 60) @ %% $ <<'env0', echo = FALSE>>= library(xtable) ## suppressPackageStartupMessages(library("parallel")) suppressPackageStartupMessages(library("MSnbase")) suppressPackageStartupMessages(library("isobar")) suppressPackageStartupMessages(library("MALDIquant")) suppressPackageStartupMessages(library("MALDIquantForeign")) suppressPackageStartupMessages(library("IPPD")) suppressPackageStartupMessages(library("rols")) suppressPackageStartupMessages(library("hpar")) suppressPackageStartupMessages(library("BRAIN")) suppressPackageStartupMessages(library("org.Hs.eg.db")) suppressPackageStartupMessages(library("GO.db")) suppressPackageStartupMessages(library("Rdisop")) suppressPackageStartupMessages(library("rTANDEM")) @ %% Abstract and keywords %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \vskip 0.3in minus 0.1in \hrule \begin{abstract} This vignette shows and executes the code presented in the manuscript \textit{Using \R for proteomics data analysis}. It also aims at being a general overview useful for new users who wish to explore the \R environment and programming language for the analysis of proteomics data. \end{abstract} \textit{Keywords}: bioinformatics, proteomics, mass spectrometry, tutorial. \vskip 0.1in minus 0.05in \hrule \vskip 0.2in minus 0.1in \vspace{10mm} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newpage \tableofcontents \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Section %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction}\label{sec:intro} This document illustrates some existing \R infrastructure for the analysis of proteomics data. It presents the code for the use cases taken from \cite{R4prot2013}. A pre-print of the manuscript is avaiable on arXiv\footnote{\url{http://arxiv.org/abs/1305.6559}}. There are however numerous additional \R resources distributed by the Bioconductor\footnote{\url{http://www.bioconductor.org}} and CRAN\footnote{\url{http://cran.r-project.org/web/packages/}} repositories, as well as packages hosted on personal websites. Section \ref{sec:packages} on page \pageref{sec:packages} tries to provide a wider picture of available packages, without going into details. \subsection{General \R resources} The reader is expected to have basic \R knowledge to find the document helpful. There are numerous \R introductions freely available, some of which are listed below. From the \R project web-page: \begin{itemize} \item \textbf{An Introduction to R} is based on the former \textit{Notes on R}, gives an introduction to the language and how to use R for doing statistical analysis and graphics. [\href{http://cran.r-project.org/doc/manuals/R-intro.html}{browse HTML} | \href{http://cran.r-project.org/doc/manuals/R-intro.pdf}{download PDF}] \item Several introductory tutorials in the \href{http://cran.r-project.org/other-docs.html}{contributed documentation} section. \item The \texttt{TeachingMaterial} repository\footnote{\href{https://github.com/lgatto/TeachingMaterial}{https://github.com/lgatto/TeachingMaterial}} contains several sets of slides and vignettes about \R programming. \end{itemize} Relevant background on the \R software and its application to computational biology in general and proteomics in particular can also be found in \cite{R4prot2013}. For details about the Bioconductor project, the reader is referred to \cite{Gentleman2004}. \subsection{Getting help} All \R packages come with ample documentation. Every command (function, class or method) a user is susceptible to use is documented. The documentation can be accessed by preceding the command by a \texttt{?} in the \R console. For example, to obtain help about the \Rfunction{library} function, that will be used in the next section, one would type \Rfunction{?library}. In addition, all Bioconductor packages come with at least one vignette (this document is the vignette that comes with the \Rpackage{RforProteomics} package), a document that combines text and \R code that is executed before the pdf is assembled. To look up all vignettes that come with a package, say \Rpackage{RforProteomics} and then open the vignette of interest, one uses the \Rfunction{vignette} function as illustrated below. More details can be found in \Rfunction{?vignette}. <>= ## list all the vignettes in the RforProteomics package vignette(package = "RforProteomics") ## Open the vignette called RforProteomics vignette("RforProteomics", package = "RforProteomics") ## or just vignette("RforProteomics") @ \R has several mailing lists\footnote{\url{http://www.r-project.org/mail.html}}. The most relevant here being the main \texttt{R-help} list, \textit{for discussion about problem and solutions using \R}. This one is for general \R content and is not suitable for bioinformatics or proteomics questions. Bioconductor also offers several mailing lists\footnote{\url{http://bioconductor.org/help/mailing-list/}} dedicated to bioinformatics matters and Bioconductor packages. The main \texttt{Bioconductor} list is the most relevant one. It is possible to post\footnote{\url{http://bioconductor.org/help/mailing-list/mailform/}} questions without subscribing to the list. Finally, the dedicated \Rpackage{RforProteomics} google group\footnote{\url{https://groups.google.com/forum/\#!forum/rbioc-sig-proteomics}} welcomes questions/comments/annoucements related to \R and mass-spectrometry/proteomics. It is important to read and comply to the posting guides (\href{http://www.r-project.org/posting-guide.html}{here} and \href{http://bioconductor.org/help/mailing-list/posting-guide/}{here}) to maximise the chances to obtain good responses. It is important to specify the software versions using the \Rfunction{sessionInfo()} functions (see an example output at the end of this document, on page \pageref{sec:sessionInfo}). It the question involves some code, make sure to isolate the relevant portion and report it with your question, trying to make your code/example reproducible\footnote{\url{https://github.com/hadley/devtools/wiki/Reproducibility}}. All lists have browsable archives. \subsection{Installation} The package should be installed using as described below: <>= ## only first time you install Bioconductor packages source("http://www.bioconductor.org/biocLite.R") ## else library("BiocInstaller") biocLite("RforProteomics") @ To install all dependencies (78 packages) and reproduce the code in the vignette, replace the last line in the code chunk above with:) <>= biocLite("RforProteomics", dependencies = TRUE) @ Finally, the package can be loaded with <>= library("RforProteomics") @ See also the `RforProteomics` web page\footnote{\url{http://lgatto.github.io/RforProteomics/}} for more information on installation. \subsection{External dependencies} Some packages used in the document depend on external libraries that need to be installed prior to the \R packages: \begin{description} \item[\Rpackage{mzR}] depends on the Common Data Format\footnote{\url{http://cdf.gsfc.nasa.gov/}} (CDF) to CDF-based raw mass-spectrometry data. On linux, the \texttt{libcdf} library is required. On debian-based systems, for instance, one needs to install the \texttt{libnetcdf-dev} package. \item[\Rpackage{IPPD}] (and others) depend on the \Rpackage{XML} package which requires the \texttt{libxml2} infrastructure on linux. On debian-based systems, one needs to install \texttt{libxml2-dev}. \item [\Rpackage{biomaRt}] performs on-line requests using the \texttt{curl}\footnote{\url{http://curl.haxx.se/}} infrastructure. On debian-based systems, you one needs to install \texttt{libcurl-dev} or \texttt{libcurl4-openssl-dev}. \end{description} \subsection{Obtaining the code} The code in this document describes all the examples presented in \cite{R4prot2013} and can be copy, pasted and executed. It is however more convenient to have it in a separate text file for better interaction with \R (using ESS\footnote{\url{http://ess.r-project.org/}} for emacs or RStudio\footnote{\url{http://rstudio.org/}} for instance) to easily modify and explore it. This can be achieved with the \Rfunction{Stangle} function. One needs the Sweave source of this document (a document combining the narration and the \R code) and the \Rfunction{Stangle} then specifically extracts the code chunks and produces a clean \R source file. If the package is installed, the following code chunk will create a \texttt{RforProteomics.R} file in your working directory containing all the annotated source code contained in this document. <>= ## gets the vignette source rnwfile <- dir(system.file(package = "RforProteomics", dir = "doc/vigsrc"), full.name = TRUE, pattern = "RforProteomics.Rnw") ## produces the R file in the working directory Stangle(rnwfile) @ %% $ Alternatively, you can obtain the \texttt{Rnw} file on the github page \url{https://github.com/lgatto/RforProteomics/blob/master/inst/doc/vigsrc/RforProteomics.Rnw}. \subsection{Prepare the working environment} The packages that we will depend on to execute the examples will be loaded in the respective sections. Here, we pre-load packages that provide general functionality used throughout the document. <>= library("RColorBrewer") ## Color palettes library("ggplot2") ## Convenient and nice plotting library("reshape2") ## Flexibly reshape data @ \section{Data standards and input/output} \subsection{The \Rpackage{mzR} package} The \Rpackage{mzR} package \cite{Chambers2012} provides a unified interface to various mass spectrometry open formats. This code chunk, taken mainly from the \Rfunction{openMSfile} documentation illustrated how to open a connection to an raw data file. The example \texttt{mzML} data is taken from the \Rpackage{msdata} data package. The code below would also be applicable to an \texttt{mzXML}, \texttt{mzData} or \texttt{netCDF} file. <>= ## load the required packages library("mzR") ## the software package library("msdata") ## the data package ## below, we extract the releavant example file ## from the local 'msdata' installation filepath <- system.file("microtofq", package = "msdata") file <- list.files(filepath, pattern="MM14.mzML", full.names=TRUE, recursive = TRUE) ## creates a commection to the mzML file mz <- openMSfile(file) ## demonstraction of data access basename(fileName(mz)) isInitialized(mz) runInfo(mz) instrumentInfo(mz) ## once finished, it is good to explicitely ## close the connection close(mz) @ \Rpackage{mzR} is used by other packages, like \Rpackage{MSnbase} \cite{Gatto2012}, \Rpackage{TargetSearch} \cite{TargetSearch2009} and \Rpackage{xcms} \cite{Smith2006, Benton2008, Tautenhahn2008}, that provide a higher level abstraction to the data. \section{Raw data abstraction with \Robject{MSnExp} objects} \Rpackage{MSnbase} \cite{Gatto2012} provides base functions and classes for MS-based proteomics that allow facile data and meta-data processing, manipulation and plotting (see for instance figure \ref{fig:msnexp} on page \pageref{fig:msnexp}). <>= library("MSnbase") ## uses a simple dummy test included in the package mzXML <- dir(system.file(package="MSnbase",dir="extdata"), full.name=TRUE, pattern="mzXML$") basename(mzXML) ## reads the raw data into and MSnExp instance raw <- readMSData(mzXML, verbose = FALSE) raw ## Extract a single spectrum raw[[3]] @ %% $ \begin{figure}[!ht] <>= plot(raw, full=TRUE) plot(raw[[3]], full=TRUE, reporters=iTRAQ4) @ \caption{The \Rfunction{plot} method can be used on experiments, i.e. spectrum collections (left), or individual spectra (right). } \label{fig:msnexp} \end{figure} \clearpage \subsection{\texttt{mgf} read/write support} Read and write support for data in the \texttt{mgf}\footnote{\url{http://www.matrixscience.com/help/data_file_help.html\#GEN}} and \texttt{mzTab}\footnote{\url{https://code.google.com/p/mztab/}} formats are available via the \Rfunction{readMgfData}/\Rfunction{writeMgfData} and \Rfunction{readMzTabData}/\Rfunction{writeMzTabData} functions, respectively. An example for the latter is shown in the next section. \section{Quantitative proteomics} As an running example throughout this document, we will use a TMT 6-plex data set, \texttt{PXD000001} to illustrate quantitative data processing. The code chunk below first downloads this data file from the ProteomeXchange server using the \Rfunction{getPXD000001mzTab} function from the \Rpackage{RforProteomics} package. \subsection{The \texttt{mzTab} format} The first code chunk downloads the data, reads it into \R and generates an \Robject{MSnSet} instance and then calculates protein intensities by summing the peptide quantitation data. Figure \ref{fig:matplot} illustrates the intensities for 5 proteins. <>= ## Downloads the experiment mztab <- getPXD000001mzTab() mztab ## the mzTab file name ## Load mzTab peptide data qnt <- readMzTabData(mztab, what = "PEP") sampleNames(qnt) <- reporterNames(TMT6) head(exprs(qnt)) ## combine into proteins ## - using the 'accession' feature meta data ## - sum the peptide intensities protqnt <- combineFeatures(qnt, groupBy = fData(qnt)$accession, fun = sum) @ \begin{figure}[!ht] <>= cls <- brewer.pal(5, "Set1") matplot(t(tail(exprs(protqnt), n = 5)), type = "b", lty = 1, col = cls, ylab = "Protein intensity (summed peptides)", xlab = "TMT reporters") legend("topright", tail(featureNames(protqnt), n=5), lty = 1, bty = "n", cex = .8, col = cls) @ \caption{Protein quantitation data. } \label{fig:matplot} \end{figure} <>= qntS <- normalise(qnt, "sum") qntV <- normalise(qntS, "vsn") qntV2 <- normalise(qnt, "vsn") acc <- c("P00489", "P00924", "P02769", "P62894", "ECA") idx <- sapply(acc, grep, fData(qnt)$accession) idx2 <- sapply(idx, head, 3) small <- qntS[unlist(idx2), ] idx3 <- sapply(idx, head, 10) medium <- qntV[unlist(idx3), ] m <- exprs(medium) colnames(m) <- c("126", "127", "128", "129", "130", "131") rownames(m) <- fData(medium)$accession rownames(m)[grep("CYC", rownames(m))] <- "CYT" rownames(m)[grep("ENO", rownames(m))] <- "ENO" rownames(m)[grep("ALB", rownames(m))] <- "BSA" rownames(m)[grep("PYGM", rownames(m))] <- "PHO" rownames(m)[grep("ECA", rownames(m))] <- "Background" cls <- c(brewer.pal(length(unique(rownames(m)))-1, "Set1"), "grey") names(cls) <- unique(rownames(m)) wbcol <- colorRampPalette(c("white", "darkblue"))(256) @ %% $ \begin{figure}[!ht] <>= heatmap(m, col = wbcol, RowSideColors=cls[rownames(m)]) @ \caption{A heatmap. } \label{fig:heatmap} \end{figure} \begin{figure}[!ht] <>= dfr <- data.frame(exprs(small), Protein = as.character(fData(small)$accession), Feature = featureNames(small), stringsAsFactors = FALSE) colnames(dfr) <- c("126", "127", "128", "129", "130", "131", "Protein", "Feature") dfr$Protein[dfr$Protein == "sp|P00924|ENO1_YEAST"] <- "ENO" dfr$Protein[dfr$Protein == "sp|P62894|CYC_BOVIN"] <- "CYT" dfr$Protein[dfr$Protein == "sp|P02769|ALBU_BOVIN"] <- "BSA" dfr$Protein[dfr$Protein == "sp|P00489|PYGM_RABIT"] <- "PHO" dfr$Protein[grep("ECA", dfr$Protein)] <- "Background" dfr2 <- melt(dfr) ggplot(aes(x = variable, y = value, colour = Protein), data = dfr2) + geom_point() + geom_line(aes(group=as.factor(Feature)), alpha = 0.5) + facet_grid(. ~ Protein) + theme(legend.position="none") + labs(x = "Reporters", y = "Normalised intensity") @ %% $ \caption{Spikes plot using \Rpackage{ggplot2}.} \label{fig:spikes} \end{figure} \clearpage \subsection{Working with raw data} <>= mzxml <- getPXD000001mzXML() rawms <- readMSData(mzxml, centroided = TRUE, verbose = FALSE) qntms <- quantify(rawms, reporters = TMT7, method = "max", verbose = FALSE, parallel = FALSE) d <- data.frame(Signal = rowSums(exprs(qntms)[, 1:6]), Incomplete = exprs(qntms)[, 7]) d <- log(d) cls <- rep("#00000050", nrow(qnt)) pch <- rep(1, nrow(qnt)) cls[grep("P02769", fData(qnt)$accession)] <- "gold4" ## BSA cls[grep("P00924", fData(qnt)$accession)] <- "dodgerblue" ## ENO cls[grep("P62894", fData(qnt)$accession)] <- "springgreen4" ## CYT cls[grep("P00489", fData(qnt)$accession)] <- "darkorchid2" ## PHO pch[grep("P02769", fData(qnt)$accession)] <- 19 pch[grep("P00924", fData(qnt)$accession)] <- 19 pch[grep("P62894", fData(qnt)$accession)] <- 19 pch[grep("P00489", fData(qnt)$accession)] <- 19 @ <>= mzp <- plotMzDelta(rawms, reporters = TMT6, verbose = FALSE) + ggtitle("") @ \begin{figure}[!ht] <>= mzp @ \caption{A m/z delta plot.} \label{fig:plotmzdelta} \end{figure} \begin{figure}[!ht] <>= plot(Signal ~ Incomplete, data = d, xlab = expression(Incomplete~dissociation), ylab = expression(Sum~of~reporters~intensities), pch = 19, col = "#4582B380") grid() abline(0, 1, lty = "dotted") abline(lm(Signal ~ Incomplete, data = d), col = "darkblue") @ \caption{Incomplete dissociation.} \label{fig:incompl} \end{figure} \begin{figure}[!ht] <>= MAplot(qnt[, c(4, 2)], cex = .9, col = cls, pch = pch, show.statistics = FALSE) @ \caption{MAplot on an \Robject{MSnSet} instance.} \label{fig:maplot} \end{figure} \clearpage \subsection{The \Rpackage{MALDIquant} package} This section illustrates some of \Rpackage{MALDIquant}'s data processing capabilities \cite{Gibb2012}. The code is taken from the \texttt{processing-peaks.R} script downloaded from the package homepage\footnote{\url{http://strimmerlab.org/software/maldiquant/}}. \subsubsection*{Loading the data} <>= ## load packages library("MALDIquant") library("MALDIquantForeign") ## getting test data datapath <- file.path(system.file("Examples", package = "readBrukerFlexData"), "2010_05_19_Gibb_C8_A1") dir(datapath) sA1 <- importBrukerFlex(datapath, verbose=FALSE) # in the following we use only the first spectrum s <- sA1[[1]] summary(mass(s)) summary(intensity(s)) head(as.matrix(s)) @ \begin{figure}[!ht] <>= plot(s) @ \caption{Spectrum plotting in \Rpackage{MALDIquant}.} \label{fig:mqplot} \end{figure} \subsubsection*{Preprocessing} <>= ## sqrt transform (for variance stabilization) s2 <- transformIntensity(s, fun=sqrt) s2 ## smoothing - 5 point moving average s3 <- transformIntensity(s2, movingAverage, halfWindowSize=2) s3 ## baseline subtraction s4 <- removeBaseline(s3, method="SNIP") s4 @ \subsubsection*{Peak picking} <>= ## peak picking p <- detectPeaks(s4) length(p) # 181 peak.data <- as.matrix(p) # extract peak information @ \begin{figure}[!ht] <>= par(mfrow=c(2,3)) xl <- range(mass(s)) # use same xlim on all plots for better comparison plot(s, sub="", main="1: raw", xlim=xl) plot(s2, sub="", main="2: variance stabilisation", xlim=xl) plot(s3, sub="", main="3: smoothing", xlim=xl) plot(s4, sub="", main="4: base line correction", xlim=xl) plot(s4, sub="", main="5: peak detection", xlim=xl) points(p) top20 <- intensity(p) %in% sort(intensity(p), decreasing=TRUE)[1:20] labelPeaks(p, index=top20, underline=TRUE) plot(p, sub="", main="6: peak plot", xlim=xl) labelPeaks(p, index=top20, underline=TRUE) @ \caption{Spectrum plotting in \Rpackage{MALDIquant}.} \label{fig:mqplot} \end{figure} \clearpage \subsection{Working with peptide sequences} <>= library(IPPD) library(BRAIN) atoms <- getAtomsFromSeq("SIVPSGASTGVHEALEMR") unlist(atoms) library(Rdisop) pepmol <- getMolecule(paste0(names(atoms), unlist(atoms), collapse = "")) pepmol ## library(OrgMassSpecR) data(itraqdata) simplottest <- itraqdata[featureNames(itraqdata) %in% paste0("X", 46:47)] sim <- SpectrumSimilarity(as(simplottest[[1]], "data.frame"), as(simplottest[[2]], "data.frame"), top.lab = "itraqdata[['X46']]", bottom.lab = "itraqdata[['X47']]", b = 25) title(main = paste("Spectrum similarity", round(sim, 3))) MonoisotopicMass(formula = list(C = 2, O = 1, H=6)) molecule <- getMolecule("C2H5OH") molecule$exactmass ## x11() ## plot(t(.pepmol$isotopes[[1]]), type = "h") ## x <- IsotopicDistribution(formula = list(C = 2, O = 1, H=6)) ## t(molecule$isotopes[[1]]) ## par(mfrow = c(2,1)) ## plot(t(molecule$isotopes[[1]]), type = "h") ## plot(x[, c(1,3)], type = "h") ## data(myo500) ## masses <- c(147.053, 148.056) ## intensities <- c(93, 5.8) ## molecules <- decomposeIsotopes(masses, intensities) ## experimental eno peptides exppep <- as.character(fData(qnt[grep("ENO", fData(qnt)[, 2]), ])[, 1]) ## 13 minlength <- min(nchar(exppep)) eno <- download.file("http://www.uniprot.org/uniprot/P00924.fasta", destfile = "P00924.fasta") eno <- paste(readLines("P00924.fasta")[-1], collapse = "") enopep <- Digest(eno, missed = 1) nrow(enopep) ## 103 sum(nchar(enopep$peptide) >= minlength) ## 68 pepcnt <- enopep[enopep[, 1] %in% exppep, ] nrow(pepcnt) ## 13 @ The following code chunks demonstrate how to use the \Rpackage{cleaver} package for in-silico cleavage of polypeptides, e.g. cleaving of \emph{Gastric juice peptide 1 (P01358)} using \emph{Trypsin}: <>= library(cleaver) cleave("LAAGKVEDSD", enzym = "trypsin") @ Sometimes cleavage is not perfect and the enzym miss some cleavage positions: <>= ## miss one cleavage position cleave("LAAGKVEDSD", enzym = "trypsin", missedCleavages = 1) ## miss zero or one cleavage positions cleave("LAAGKVEDSD", enzym = "trypsin", missedCleavages = 0:1) @ <>= ## example code to generate an Texshade image to be ## included directly in a Latex document or R vignette ## seq1file <- "seq1.tex" ## cat("\\begin{texshade}{Figures/P00924.fasta} ## \\setsize{numbering}{footnotesize} ## \\setsize{residues}{footnotesize} ## \\residuesperline*{70} ## \\shadingmode{functional} ## \\hideconsensus ## \\vsepspace{1mm} ## \\hidenames ## \\noblockskip\n", file = seq1file) ## tmp <- sapply(1:nrow(pepcnt), function(i) { ## col <- ifelse((i %% 2) == 0, "Blue", "RoyalBlue") ## cat("\\shaderegion{1}{", pepcnt$start[i], "..", pepcnt$stop[i], "}{White}{", col, "}\n", ## file = seq1file, append = TRUE) ## }) ## cat("\\end{texshade} ## \\caption{Visualising observed peptides for the Yeast enolase protein. Peptides are shaded in blue and black. The last peptide is a mis-cleavage and overlaps with \\texttt{IEEELGDNAVFAGENFHHGDK}.} ## \\label{fig:seq} ## \\end{center} ## \\end{figure}\n\n", ## file = seq1file, append = TRUE) @ %% $ \subsubsection*{$^{15}N$ incorporation} \begin{figure}[!ht] <>= ## 15N example ## incorporation rates from 0, 0.1, ..., 0.9, 0.95, 1 incrate <- c(seq(0, 0.9, 0.1), 0.95, 1) inc <- lapply(incrate, function(inc) IsotopicDistributionN("YEVQGEVFTKPQLWP", inc)) par(mfrow = c(4,3)) for (i in 1:length(inc)) plot(inc[[i]][, c(1, 3)], xlim = c(1823, 1848), type = "h", main = paste0("15N incorporation at ", incrate[i]*100, "%")) @ \caption{Isotopic envelope for the \texttt{YEVQGEVFTKPQLWP} peptide at different $^{15}N$ incorporation rates. } \label{fig:n15} \end{figure} \clearpage \subsection{The \Rpackage{isobar} package} The \Rpackage{isobar} package \cite{Breitwieser2011} provides methods for the statistical analysis of isobarically tagged MS$^2$ experiments. %% <>= %% ## temporary local fix %% subsetIBSpectra <- %% function (x, protein = NULL, %% peptide = NULL, direction = "exclude", %% specificity = %% c(REPORTERSPECIFIC, GROUPSPECIFIC, UNSPECIFIC), %% ...) %% { %% if (is.null(protein)) %% sel.spectra <- spectrumSel(x, peptide = peptide, ...) %% else sel.spectra <- spectrumSel(x, %% protein = protein, %% specificity = specificity, %% ...) %% if (direction == "exclude") %% sel.spectra <- !sel.spectra %% for (aden in assayDataElementNames(x)) { %% assayDataElement(x, aden) <- %% assayDataElement(x, aden)[sel.spectra, ] %% } %% ## changed this from as.data.frame %% pg.df <- as(proteinGroup(x), "data.frame") %% proteinGroup(x) <- %% ProteinGroup(pg.df[pg.df[, "spectrum"] %in% %% fData(x)[sel.spectra, "spectrum"], ]) %% featureData(x) <- as(fData(x)[sel.spectra, ], %% "AnnotatedDataFrame") %% x %% } %% @ <>= library(isobar) ## Prepare the PXD000001 data for isobar analysis .ions <- exprs(qnt) .mass <- matrix(mz(TMT6), nrow(qnt), byrow=TRUE, ncol = 6) colnames(.ions) <- colnames(.mass) <- reporterTagNames(new("TMT6plexSpectra")) rownames(.ions) <- rownames(.mass) <- paste(fData(qnt)$accession, fData(qnt)$sequence, sep = ".") pgtbl <- data.frame(spectrum = rownames(.ions), peptide = fData(qnt)$sequence, modif = ":", start.pos = 1, protein = fData(qnt)$accession, accession = fData(qnt)$accession) x <- new("TMT6plexSpectra", pgtbl, .ions, .mass) featureData(x)$proteins <- as.character(fData(qnt)$accession) x <- correctIsotopeImpurities(x) ## using identity matrix here x <- normalize(x, per.file = FALSE) ## spikes spks <- c(protein.g(proteinGroup(x), "P00489"), protein.g(proteinGroup(x), "P00924"), protein.g(proteinGroup(x), "P02769"), protein.g(proteinGroup(x), "P62894")) cls2 <- rep("#00000040", nrow(x)) pch2 <- rep(1, nrow(x)) cls2[grep("P02769", featureNames(x))] <- "gold4" ## BSA cls2[grep("P00924", featureNames(x))] <- "dodgerblue" ## ENO cls2[grep("P62894", featureNames(x))] <- "springgreen4" ## CYT cls2[grep("P00489", featureNames(x))] <- "darkorchid2" ## PHO pch2[grep("P02769", featureNames(x))] <- 19 pch2[grep("P00924", featureNames(x))] <- 19 pch2[grep("P62894", featureNames(x))] <- 19 pch2[grep("P00489", featureNames(x))] <- 19 nm <- NoiseModel(x) ib.background <- subsetIBSpectra(x, protein=spks, direction = "exclude") nm.background <- NoiseModel(ib.background) ib.spks <- subsetIBSpectra(x, protein = spks, direction="include", specificity="reporter-specific") nm.spks <- NoiseModel(ib.spks, one.to.one=FALSE, pool=TRUE) ratios <- 10^estimateRatio(x, nm, channel1="127", channel2="129", protein = spks, combine = FALSE)[, "lratio"] res <- estimateRatio(x, nm, channel1="127", channel2="129", protein = unique(fData(x)$proteins), combine = FALSE, sign.level = 0.01)[, c(1, 2, 6, 8)] res <- as.data.frame(res) res$lratio <- -(res$lratio) cls3 <- rep("#00000050", nrow(res)) pch3 <- rep(1, nrow(res)) cls3[grep("P02769", rownames(res))] <- "gold4" ## BSA cls3[grep("P00924", rownames(res))] <- "dodgerblue" ## ENO cls3[grep("P62894", rownames(res))] <- "springgreen4" ## CYT cls3[grep("P00489", rownames(res))] <- "darkorchid2" ## PHO pch3[grep("P02769", rownames(res))] <- 19 pch3[grep("P00924", rownames(res))] <- 19 pch3[grep("P62894", rownames(res))] <- 19 pch3[grep("P00489", rownames(res))] <- 19 rat.exp <- c(PHO = 2/2, ENO = 5/1, BSA = 2.5/10, CYT = 1/1) @ %% $ \begin{figure}[!ht] <>= par(mfrow = c(1,2)) maplot(x, noise.model = c(nm.background, nm.spks, nm), channel1="127", channel2="129", pch = 19, col = cls2, main = "Spectra MA plot") abline(h = 1, lty = "dashed", col = "grey") legend("topright", c("BSA", "ENO", "CYT", "PHO"), pch = 19, col = c("gold4", "dodgerblue", "springgreen4", "darkorchid2"), bty = "n", cex = .7) plot(res$lratio, -log10(res$p.value.rat), col = cls3, pch = pch3, xlab = expression(log[10]~fold-change), ylab = expression(-log[10]~p-value), main = "Protein volcano plot", xlim = c(-0.7, 0.7)) grid() abline(h = -log10(0.01), lty = "dotted") abline(v = log10(c(2, 0.5)), lty = "dotted") abline(v = -0.003, col = "springgreen4", lty = "dashed", lwd = 2) abline(v = 0.003, col = "darkorchid2", lty = "dashed", lwd = 2) abline(v = log10(5), col = "dodgerblue", lty = "dashed", lwd = 2) abline(v = log10(0.25), col = "gold4", lty = "dashed", lwd = 2) points(res[spks, "lratio"], -log10(res[spks, "p.value.rat"]), col = c("darkorchid2", "dodgerblue", "gold4", "springgreen4"), pch = 19) @ \caption{Result from the \Rpackage{isobar} pipeline. } \label{fig:ibplot} \end{figure} \clearpage \subsection{The \Rpackage{synapter} package} The \Rpackage{synapter} \cite{synapter} package comes with a detailed vignette that describes how to prepare the MS$^E$ data and then process it in \R. Several interfaces are available provided the user with maximum control, easy batch processing capabilities or a graphical user interface. The conversion into \Robject{MSnSet} instances and filter and combination thereof as well as statistical analysis are also described. <>= ## open the synapter vignette library("synapter") synapterGuide() @ \section{MS$^2$ spectra identification} A recent addition to Bioconductor 2.12 is the \Rpackage{rTANDEM} package, that provides a direct interface to the X!Tandem software \cite{Craig2004}. A typical \Rpackage{rTANDEM} pipeline comprises \begin{enumerate} \item Prepare the input data. \item Run the search. \item Import the search results and extract the peptides and proteins \end{enumerate} Using example code/data from the \Rpackage{rTANDEM} vignette/package, these steps are executed as described below. \subsection{Preparation of the input data} <>= library(rTANDEM) taxonomy <- rTTaxo(taxon = "yeast", format = "peptide", URL = system.file( "extdata/fasta/scd.fasta.pro", package="rTANDEM")) param <- rTParam() param <- setParamValue(param, 'protein', 'taxon', value="yeast") param <- setParamValue(param, 'list path', 'taxonomy information', taxonomy) param <- setParamValue(param, 'list path', 'default parameters', value = system.file( "extdata/default_input.xml", package="rTANDEM")) param <- setParamValue(param, 'spectrum', 'path', value = system.file( "extdata/test_spectra.mgf", package="rTANDEM")) param <- setParamValue(param, 'output', 'xsl path', value = system.file( "extdata/tandem-input-style.xsl", package="rTANDEM")) param <- setParamValue(param, 'output', 'path', value = paste(getwd(), "output.xml", sep="/")) @ \subsection{Performing the search} The analysis is run using the \Rfunction{tandem} function (see also the \Rfunction{rtandem} function), which returns the results data file path (only the file name is displayed below). <>= resultPath <- tandem(param) basename(resultPath) @ \subsection{Import and analyse results} <>= res <- GetResultsFromXML(resultPath) ## the inferred proteins proteins <- GetProteins(res, log.expect = -1.3, min.peptides = 2) proteins[, -(4:5), with = FALSE] ## the identified peptides for YFR053C peptides <- GetPeptides(protein.uid = 1811, results = res, expect = 0.05) peptides[, c(1:4, 9, 10:16), with = FALSE] @ More details are provided in the vignette available with \Rfunction(vignette("rTANDEM")), for instance the extraction of degenerated peptides, i.e. peptides found in multiple proteins. \section{Annotation} In this section, we briefly present some Bioconductor annotation infrastructure. We start with the \Rpackage{hpar} package, an interface to the \textit{Human Protein Atlas} \cite{Uhlen2005, Uhlen2010}, to retrieve subcellular localisation information for the \texttt{ENSG00000002746} ensemble gene. <>= id <- "ENSG00000002746" library("hpar") getHpa(id, "SubcellularLoc") @ Below, we make use of the human annotation package \Rpackage{org.Hs.eg.db} and the Gene Ontology annotation package \Rpackage{GO.db} to retrieve the same information as above. <>= library(org.Hs.eg.db) library(GO.db) ans <- select(org.Hs.eg.db, keys = id, cols = c("ENSEMBL", "GO", "ONTOLOGY"), keytype = "ENSEMBL") ans <- ans[ans$ONTOLOGY == "CC", ] ans sapply(as.list(GOTERM[ans$GO]), slot, "Term") @ Finally, this information can also be retrieved from on-line databases using the \Rpackage{biomaRt} package \cite{Durinck2005}. <>= library("biomaRt") ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl") efilter <- "ensembl_gene_id" eattr <- c("go_id", "name_1006", "namespace_1003") bmres <- getBM(attributes=eattr, filters = efilter, values = id, mart = ensembl) bmres[bmres$namespace_1003 == "cellular_component", "name_1006"] @ %% $ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Section %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newpage \section{Other packages}\label{sec:packages} \subsection{Bioconductor packages} <>= biocv <- 2.13 getPackagesInBiocView <- function(view, rep = c("BioCsoft", "BioCann", "BioCexp", "BioCextra"), biocVersion = biocv) { suppressPackageStartupMessages(require("biocViews")) data(biocViewsVocab) rep <- match.arg(rep) biocMirror <- getOption("BioC_mirror", "http://bioconductor.org") biocPaths <- switch(rep, BioCsoft = "bioc", BioCann = "data/annotation", BioCexp = "data/experiment", BioCextra = "extra") rep <- paste(biocMirror, "packages", biocVersion, biocPaths, sep = "/") bv <- getBiocViews(rep, biocViewsVocab, "NoViewProvided") if (!view %in% names(bv)) { warning("BiocView ", view, " not found.") return(NULL) } return(bv[[view]]) } pk <- list(Proteomics = getPackagesInBiocView("Proteomics"), MassSpectrometryData = getPackagesInBiocView("MassSpectrometryData", rep = "BioCexp"), MassSpectrometry = getPackagesInBiocView("MassSpectrometry")) .makePackageTable <- function(x, nsub = TRUE) { Package <- sapply(x@packageList, function(x) x@Package) Title <- sapply(x@packageList, function(x) x@Title) if (nsub) Title <- sub("\n", " ", Title) return(data.frame(Package, Title)) } pkTab <- lapply(pk, .makePackageTable) @ This section provides a complete list of packages available in the relevant Bioconductor version \Sexpr{biocv} (as of \today) \textit{biocView}\footnote{\url{http://www.bioconductor.org/packages/devel/BiocViews.html}} categories. Tables \ref{tab:prot}, \ref{tab:ms} and \ref{tab:msdata} represent the packages for the \texttt{Proteomics} (\Sexpr{nrow(pkTab[["Proteomics"]])} packages), \texttt{MassSpectrometry} (\Sexpr{nrow(pkTab[["MassSpectrometry"]])} packages) and \texttt{MassSpectrometryData} (\Sexpr{nrow(pkTab[["MassSpectrometryData"]])} experiment packages) categories. <>= print(xtable(pkTab[["Proteomics"]], caption = "Packages available under the \\texttt{Proteomics} \\textit{biocViews} category.", table.placement = "thb", align = c("llL"), label = "tab:prot"), include.rownames = FALSE, tabular.environment = "tabulary", width = "1\\textwidth", size = "scriptsize") @ <>= print(xtable(pkTab[["MassSpectrometry"]], caption = "Packages available under the \\texttt{MassSpectrometry} \\textit{biocViews} category.", table.placement = "thb", label = "tab:ms"), include.rownames = FALSE, size = "scriptsize") @ <>= print(xtable(pkTab[["MassSpectrometryData"]], caption = "Experimental Packages available under the \\texttt{MassSpectrometryData} \\textit{biocViews} category.", table.placement = "thb", label = "tab:msdata"), include.rownames = FALSE, size = "scriptsize") @ <>= X <- readLines("http://cran.r-project.org/web/views/ChemPhys.html") x2 <- grep("Related links:", X) x1 <- grep("CRAN packages:", X) np <- length(grep("../packages/", X[x1:x2])) @ \subsection{The Chemometrics and Computational Physics CRAN Task View} The CRAN task view on Chemometrics and Computational Physics\footnote{\url{http://cran.r-project.org/web/views/ChemPhys.html}} lists \Sexpr{np} packages, including a set of packages for mass spectrometry and proteomics, some of which are illustrated in this document. The most relevant (non Bioconductor) packages are summarised below. \begin{description} \item[MALDIquant] provides tools for quantitative analysis of MALDI-TOF mass spectrometry data, with support for baseline correction, peak detection and plotting of mass spectra \\ (\url{http://cran.r-project.org/web/packages/MALDIquant/index.html}). \item[OrgMassSpecR] is for organic/biological mass spectrometry, with a focus on graphical display, quantification using stable isotope dilution, and protein hydrogen/deuterium exchange experiments \\ (\url{http://cran.r-project.org/web/packages/OrgMassSpecR/index.html}). \item[FTICRMS] provides functions for Analyzing Fourier Transform-Ion Cyclotron Resonance Mass Spectrometry Data \\ (\url{http://cran.r-project.org/web/packages/FTICRMS/index.html}). \item[titan] provides a GUI to analyze mass spectrometric data on the relative abundance of two substances from a titration series \\ (\url{http://cran.r-project.org/web/packages/titan/index.html}). \end{description} \subsection{Other CRAN packages} Finally, \Rpackage{digeR}\footnote{\url{http://cran.r-project.org/web/packages/digeR/index.html}}, which is available on CRAN but not listed in the Chemometrics and Computational Physics Task View, provides a GUI interface for analysing 2D DIGE data. It allows to perform correlation analysis, score plot, classification, feature selection and power analysis for 2D DIGE experiment data. \bigskip Suggestions for additional \R packages are welcome and will be added to the vignette. Please send suggestions and possibly a short description and/or a example utilisation with code to \url{lg390@cam.ac.uk}. The only requirement is that the package must be available on an official package channel (CRAN, Bioconductor, R-forge, Omegahat), i.e. not only available through a personal web page. \clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% Section %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Session information}\label{sec:sessionInfo} All software and respective versions used in this document, as returned by \Rfunction{sessionInfo()} are detailed below. <>= toLatex(sessionInfo(), locale = FALSE) @ \bibliographystyle{abbrv} \bibliography{RforProteomics} \end{document}