%\VignetteIndexEntry{Using R for proteomics data analysis}
%\VignetteKeywords{Bioinformatics, proteomics, mass spectrometry, tutorial, data}
%\VignettePackage{RforProteomics}

\documentclass[12pt,a4paper,english]{scrartcl}
\usepackage{amsmath,amsfonts,amssymb}
\usepackage{tikz}
\usepackage{hyperref}
\usepackage{natbib}
\usepackage[auth-sc]{authblk}
\usepackage{setspace}
\onehalfspacing

% caption formatting
\setcapindent{0em}
\setkomafont{captionlabel}{\sffamily\bfseries}
\setkomafont{caption}{\sffamily}

\renewcommand\Authands{ and }

\usepackage{color}
\usepackage{graphicx}
\usepackage{xstring}

\usepackage{longtable}
\usepackage{tabulary}


%% colors
\definecolor{Red}{rgb}{0.7,0,0}
\definecolor{Blue}{rgb}{0,0,0.8}

\newcommand{\R}{\textbf{\texttt{R} }}
\newcommand{\Rfunction}[1]{{\textbf{\texttt{#1}}}}
\newcommand{\Rclass}[1]{{\textbf{\texttt{#1}}}}
\newcommand{\Robject}[1]{{\textbf{\texttt{#1}}}}
\newcommand{\Rpackage}[1]{{\mbox{\normalfont\textbf{\textsf{#1}}}}}
\newcommand{\email}[1]{\href{mailto:#1}{\normalfont\texttt{#1}}}

\usepackage{hyperref}
\usepackage{breakurl}
\hypersetup{%
  pdfauthor={Laurent Gatto},%
  pdfusetitle,
  bookmarks = {true},
  bookmarksnumbered = {true},
  bookmarksopen = {true},
  bookmarksopenlevel = 2,
  unicode = {true},
  breaklinks = {false},
  hyperindex = {true},
  colorlinks = {true},
  linktocpage = {true},
  plainpages = {false},
  linkcolor = {Blue},
  citecolor = {Blue},
  urlcolor = {Red},
  pdfstartview = {Fit},
  pdfpagemode = {UseOutlines},
  pdfview = {XYZ null null null}
}


\author{
  Laurent Gatto$^1$\thanks{\email{lg390@cam.ac.uk}}
}

\author{
  Sebastian Gibb$^2$
}

\affil{
  \small
  $^1$Cambridge Center for Proteomics, University of Cambridge, UK \\
  $^2$Institute for Medical Informatics, Statistics and Epidemiology, University of Leipzig, Germany 
}

\title{Using \R and Bioconductor for Proteomics Data Analysis}

\begin{document}

\maketitle

<<'setup', include = FALSE, cache = FALSE>>=
library("knitr")
## opts_chunk$set(fig.align = 'center', 
##                fig.show = 'hold', 
##                par = TRUE,
##                prompt = TRUE,
##                comment = NA)
## })
opts_chunk$set(tidy.opts = 
               list(width.cutoff = 50, 
                    tidy = FALSE),
               fig.align = 'center',
               stop_on_error = 1L,
               comment = NA,
               prompt = TRUE)
options(width = 60)
@
%% $

<<'env0', echo = FALSE>>=
library(xtable)
## suppressPackageStartupMessages(library("parallel"))
suppressPackageStartupMessages(library("MSnbase"))
suppressPackageStartupMessages(library("isobar"))
suppressPackageStartupMessages(library("MALDIquant"))   
suppressPackageStartupMessages(library("MALDIquantForeign"))  
suppressPackageStartupMessages(library("IPPD"))                           
suppressPackageStartupMessages(library("rols"))
suppressPackageStartupMessages(library("hpar"))
suppressPackageStartupMessages(library("BRAIN"))
suppressPackageStartupMessages(library("org.Hs.eg.db"))
suppressPackageStartupMessages(library("GO.db"))
suppressPackageStartupMessages(library("Rdisop"))
suppressPackageStartupMessages(library("rTANDEM")) 
@ 

%% Abstract and keywords %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\vskip 0.3in minus 0.1in
\hrule
\begin{abstract}
  This vignette shows and executes the code presented in the manuscript \textit{Using \R for proteomics data analysis}.  
  It also aims at being a general overview useful for new users who wish to explore the \R environment 
  and programming language for the analysis of proteomics data.
\end{abstract}

\textit{Keywords}: bioinformatics, proteomics, mass spectrometry, tutorial.
\vskip 0.1in minus 0.05in
\hrule
\vskip 0.2in minus 0.1in
\vspace{10mm}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newpage

\tableofcontents

\newpage

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}\label{sec:intro} 

This document illustrates some existing \R infrastructure for the analysis of proteomics data. 
It presents the code for the use cases taken from \cite{R4prot2013}. A pre-print of the manuscript 
is avaiable on arXiv\footnote{\url{http://arxiv.org/abs/1305.6559}}.

There are however numerous additional \R resources
distributed by the Bioconductor\footnote{\url{http://www.bioconductor.org}} and 
CRAN\footnote{\url{http://cran.r-project.org/web/packages/}} repositories, as well as packages hosted on 
personal websites. Section \ref{sec:packages} on page \pageref{sec:packages} tries to provide a wider 
picture of available packages, without going into details. 

\subsection{General \R resources}

The reader is expected to have basic \R knowledge to find the document helpful. 
There are numerous \R introductions freely available, some of which are listed below. 

From the \R project web-page:
\begin{itemize}
\item \textbf{An Introduction to R} is based on the former \textit{Notes on R}, 
  gives an introduction to the language and how to use R for doing statistical 
  analysis and graphics. 
  [\href{http://cran.r-project.org/doc/manuals/R-intro.html}{browse HTML} | 
    \href{http://cran.r-project.org/doc/manuals/R-intro.pdf}{download PDF}]
\item Several introductory tutorials in the 
  \href{http://cran.r-project.org/other-docs.html}{contributed documentation} section.
\item The \texttt{TeachingMaterial} repository\footnote{\href{https://github.com/lgatto/TeachingMaterial}{https://github.com/lgatto/TeachingMaterial}} contains several sets of slides and vignettes about \R programming.
\end{itemize}

Relevant background on the \R software and its application to computational biology in general and proteomics in particular  
can also be found in \cite{R4prot2013}. For details about the Bioconductor project, the reader is referred to \cite{Gentleman2004}.

\subsection{Getting help}

All \R packages come with ample documentation. Every command (function, class or method) a user is susceptible to use is documented. The documentation can be accessed by preceding the command by a \texttt{?} in the \R console. For example, to obtain help about the \Rfunction{library} function, that will be used in the next section, one would type \Rfunction{?library}. In addition, all Bioconductor packages come with at least one vignette (this document is the vignette that comes with the \Rpackage{RforProteomics} package), a document that combines text and \R code that is executed before the pdf is assembled. To look up all vignettes that come with a package, say \Rpackage{RforProteomics} and then open the vignette of interest, one uses the \Rfunction{vignette} function as illustrated below. More details can be found in \Rfunction{?vignette}.

<<vignette1, echo = TRUE, eval = FALSE>>=
## list all the vignettes in the RforProteomics package
vignette(package = "RforProteomics")
## Open the vignette called RforProteomics
vignette("RforProteomics", package = "RforProteomics")
## or just
vignette("RforProteomics")
@ 


\R has several mailing lists\footnote{\url{http://www.r-project.org/mail.html}}. The most relevant here being the main \texttt{R-help} list, \textit{for discussion about problem and solutions using \R}. This one is for general \R content and is not suitable for bioinformatics or proteomics questions. Bioconductor also offers several mailing lists\footnote{\url{http://bioconductor.org/help/mailing-list/}} dedicated to bioinformatics matters and Bioconductor packages. The main \texttt{Bioconductor} list is the most relevant one. It is possible to post\footnote{\url{http://bioconductor.org/help/mailing-list/mailform/}} questions without subscribing to the list. Finally, the dedicated \Rpackage{RforProteomics} google group\footnote{\url{https://groups.google.com/forum/\#!forum/rbioc-sig-proteomics}} welcomes questions/comments/annoucements related to \R and mass-spectrometry/proteomics. 

It is important to read and comply to the posting guides (\href{http://www.r-project.org/posting-guide.html}{here} and \href{http://bioconductor.org/help/mailing-list/posting-guide/}{here}) to maximise the chances to obtain good responses. It is important to specify the software versions using the \Rfunction{sessionInfo()} functions (see an example output at the end of this document, on page \pageref{sec:sessionInfo}). It the question involves some code, make sure to isolate the relevant portion and report it with your question, trying to make your code/example reproducible\footnote{\url{https://github.com/hadley/devtools/wiki/Reproducibility}}.

All lists have browsable archives.

\subsection{Installation}

The package should be installed using as described below:

<<installation, eval = FALSE>>=
## only first time you install Bioconductor packages
source("http://www.bioconductor.org/biocLite.R")
## else
library("BiocInstaller")
biocLite("RforProteomics")
@ 

To install all dependencies (78 packages) and reproduce the code in the vignette, replace the last line in the code chunk above with:)

<<installation2, eval = FALSE>>=
biocLite("RforProteomics", dependencies = TRUE)
@ 

Finally, the package can be loaded with

<<loadR4Prot>>=
library("RforProteomics")
@ 

See also the `RforProteomics` web page\footnote{\url{http://lgatto.github.io/RforProteomics/}} 
for more information on installation. 

\subsection{External dependencies}

Some packages used in the document depend on external libraries that need to be installed prior to the \R packages:

\begin{description}
  \item[\Rpackage{mzR}] depends on the Common Data Format\footnote{\url{http://cdf.gsfc.nasa.gov/}} (CDF) 
    to CDF-based raw mass-spectrometry data. On linux, the \texttt{libcdf} library is required. 
    On debian-based systems, for instance, one needs to install the \texttt{libnetcdf-dev} package.
  \item[\Rpackage{IPPD}] (and others) depend on the \Rpackage{XML} package which requires the \texttt{libxml2} 
    infrastructure on linux. On debian-based systems, one needs to install \texttt{libxml2-dev}.
  \item [\Rpackage{biomaRt}] performs on-line requests using the 
    \texttt{curl}\footnote{\url{http://curl.haxx.se/}} 
    infrastructure. On debian-based systems, you one needs to install  
    \texttt{libcurl-dev} or \texttt{libcurl4-openssl-dev}.
\end{description}

\subsection{Obtaining the code}

The code in this document describes all the examples presented in \cite{R4prot2013} and can be copy, pasted and executed. It is however more convenient to have it in a separate text file for better interaction with \R (using ESS\footnote{\url{http://ess.r-project.org/}} for emacs or RStudio\footnote{\url{http://rstudio.org/}} for instance) to easily modify and explore it. This can be achieved with the \Rfunction{Stangle} function. One needs the Sweave source of this document (a document combining the narration and the \R code) and the \Rfunction{Stangle} then specifically extracts the code chunks and produces a clean \R source file. If the package is installed, the following code chunk will create a \texttt{RforProteomics.R} file in your working directory containing all the annotated source code contained in this document.

<<stangle, eval=TRUE, tidy=FALSE, error=FALSE>>= 
## gets the vignette source
rnwfile <- dir(system.file(package = "RforProteomics", 
                           dir = "doc/vigsrc"),
               full.name = TRUE,
               pattern = "RforProteomics.Rnw")
## produces the R file in the working directory
Stangle(rnwfile)
@ %% $

Alternatively, you can obtain the \texttt{Rnw} file on the github page \url{https://github.com/lgatto/RforProteomics/blob/master/inst/doc/vigsrc/RforProteomics.Rnw}.

\subsection{Prepare the working environment}

The packages that we will depend on to execute the examples will be loaded in the respective sections. Here, we pre-load packages that provide general functionality used throughout the document.

<<env>>=
library("RColorBrewer") ## Color palettes
library("ggplot2")  ## Convenient and nice plotting
library("reshape2") ## Flexibly reshape data 
@ 

\section{Data standards and input/output}

\subsection{The \Rpackage{mzR} package}

The \Rpackage{mzR} package \cite{Chambers2012} provides a unified interface to various mass spectrometry open formats. 
This code chunk, taken mainly from the \Rfunction{openMSfile} documentation illustrated how to open a connection to an 
raw data file. The example \texttt{mzML} data is taken from the \Rpackage{msdata} data package. 
The code below would also be applicable to an \texttt{mzXML}, \texttt{mzData} or \texttt{netCDF} file.

<<mzr>>=
## load the required packages
library("mzR") ## the software package
library("msdata") ## the data package
## below, we extract the releavant example file 
## from the local 'msdata' installation 
filepath <- system.file("microtofq", package = "msdata")
file <- list.files(filepath, pattern="MM14.mzML",
                   full.names=TRUE, recursive = TRUE)
## creates a commection to the mzML file
mz <- openMSfile(file)
## demonstraction of data access
basename(fileName(mz))
isInitialized(mz)
runInfo(mz)
instrumentInfo(mz)
## once finished, it is good to explicitely 
## close the connection
close(mz)
@ 

\Rpackage{mzR} is used by other packages, like \Rpackage{MSnbase} \cite{Gatto2012}, 
\Rpackage{TargetSearch} \cite{TargetSearch2009}
and \Rpackage{xcms} \cite{Smith2006, Benton2008, Tautenhahn2008}, 
that provide a higher level abstraction to the data.

\section{Raw data abstraction with \Robject{MSnExp} objects}

\Rpackage{MSnbase} \cite{Gatto2012} provides base functions and classes for MS-based proteomics 
that allow facile data and meta-data processing, manipulation and plotting 
(see for instance figure \ref{fig:msnexp} on page \pageref{fig:msnexp}).

<<msnexp>>= 
library("MSnbase")
## uses a simple dummy test included in the package
mzXML <- dir(system.file(package="MSnbase",dir="extdata"),
             full.name=TRUE,
             pattern="mzXML$")
basename(mzXML)
## reads the raw data into and MSnExp instance
raw <- readMSData(mzXML, verbose = FALSE)
raw
## Extract a single spectrum
raw[[3]]
@ 
%% $

\begin{figure}[!ht]
<<msnexpPlot, dev='pdf', echo=TRUE, fig.width=5, fig.height=5, fig.keep='all', out.width='.7\\linewidth'>>=
plot(raw, full=TRUE)
plot(raw[[3]], full=TRUE, reporters=iTRAQ4)
@ 
\caption{The \Rfunction{plot} method can be used on experiments, i.e. spectrum collections (left), or individual spectra (right). }
\label{fig:msnexp}
\end{figure}

\clearpage

\subsection{\texttt{mgf} read/write support}

Read and write support for data in the 
\texttt{mgf}\footnote{\url{http://www.matrixscience.com/help/data_file_help.html\#GEN}} 
and \texttt{mzTab}\footnote{\url{https://code.google.com/p/mztab/}} formats
are available via the \Rfunction{readMgfData}/\Rfunction{writeMgfData} 
and \Rfunction{readMzTabData}/\Rfunction{writeMzTabData} functions, respectively.
An example for the latter is shown in the next section.

\section{Quantitative proteomics}

As an running example throughout this document, we will use a TMT 6-plex data set, 
\texttt{PXD000001} to illustrate quantitative data processing. 
The code chunk below first downloads this data file from the ProteomeXchange server using 
the \Rfunction{getPXD000001mzTab} function from the \Rpackage{RforProteomics} package.

\subsection{The \texttt{mzTab} format}

The first code chunk downloads the data, reads it into \R and generates 
an \Robject{MSnSet} instance and then calculates protein intensities 
by summing the peptide quantitation data.
Figure \ref{fig:matplot} illustrates the intensities for 5 proteins.

<<mztab, tidy = FALSE>>=
## Downloads the experiment
mztab <- getPXD000001mzTab()
mztab ## the mzTab file name
## Load mzTab peptide data 
qnt <- readMzTabData(mztab, what = "PEP")
sampleNames(qnt) <- reporterNames(TMT6)
head(exprs(qnt))
## combine into proteins
## - using the 'accession' feature meta data
## - sum the peptide intensities
protqnt <- combineFeatures(qnt, 
                           groupBy = fData(qnt)$accession, 
                           fun = sum)
@ 


\begin{figure}[!ht]
<<matplot, dev='pdf', echo=TRUE, fig.width=5.5, fig.height=5.5, fig.keep='last', out.width='.6\\linewidth', tidy=FALSE>>=
cls <- brewer.pal(5, "Set1")
matplot(t(tail(exprs(protqnt), n = 5)), type = "b", 
        lty = 1, col = cls,
        ylab = "Protein intensity (summed peptides)",
        xlab = "TMT reporters")
legend("topright", tail(featureNames(protqnt), n=5), 
       lty = 1, bty = "n", cex = .8, col = cls)
@ 
\caption{Protein quantitation data. }
\label{fig:matplot}
\end{figure}

<<mztab2>>=
qntS <- normalise(qnt, "sum")
qntV <- normalise(qntS, "vsn")
qntV2 <- normalise(qnt, "vsn")

acc <- c("P00489", "P00924", 
         "P02769", "P62894",
         "ECA")

idx <- sapply(acc, grep, fData(qnt)$accession)
idx2 <- sapply(idx, head, 3)
small <- qntS[unlist(idx2), ]

idx3 <- sapply(idx, head, 10)
medium <- qntV[unlist(idx3), ]

m <- exprs(medium)
colnames(m) <- c("126", "127", "128",
                 "129", "130", "131")
rownames(m) <- fData(medium)$accession
rownames(m)[grep("CYC", rownames(m))] <- "CYT"
rownames(m)[grep("ENO", rownames(m))] <- "ENO"
rownames(m)[grep("ALB", rownames(m))] <- "BSA"
rownames(m)[grep("PYGM", rownames(m))] <- "PHO"
rownames(m)[grep("ECA", rownames(m))] <- "Background"

cls <- c(brewer.pal(length(unique(rownames(m)))-1, "Set1"),
         "grey")
names(cls) <- unique(rownames(m))
wbcol <- colorRampPalette(c("white", "darkblue"))(256)
@ %% $

\begin{figure}[!ht]
<<heatmap, dev='pdf', echo=TRUE, fig.width=5.5, fig.height=5.5, fig.keep='last', out.width='.6\\linewidth'>>=
heatmap(m, col = wbcol, RowSideColors=cls[rownames(m)])
@ 
\caption{A heatmap. }
\label{fig:heatmap}
\end{figure}


\begin{figure}[!ht]
<<spikes, dev='pdf', echo=TRUE, fig.width=9, fig.height=5, fig.keep='all', out.width='1\\linewidth', tidy=FALSE>>=
dfr <- data.frame(exprs(small),
                  Protein = as.character(fData(small)$accession),
                  Feature = featureNames(small),
                  stringsAsFactors = FALSE)
colnames(dfr) <- c("126", "127", "128", "129", "130", "131", 
                   "Protein", "Feature")
dfr$Protein[dfr$Protein == "sp|P00924|ENO1_YEAST"] <- "ENO"
dfr$Protein[dfr$Protein == "sp|P62894|CYC_BOVIN"]  <- "CYT"
dfr$Protein[dfr$Protein == "sp|P02769|ALBU_BOVIN"] <- "BSA"
dfr$Protein[dfr$Protein == "sp|P00489|PYGM_RABIT"] <- "PHO"
dfr$Protein[grep("ECA", dfr$Protein)] <- "Background"
dfr2 <- melt(dfr)
ggplot(aes(x = variable, y = value, colour = Protein), 
       data = dfr2) +
  geom_point() + 
  geom_line(aes(group=as.factor(Feature)), alpha = 0.5) +
  facet_grid(. ~ Protein) + theme(legend.position="none") +
  labs(x = "Reporters", y = "Normalised intensity")
@ %% $
\caption{Spikes plot using \Rpackage{ggplot2}.}
\label{fig:spikes}
\end{figure}

\clearpage

\subsection{Working with raw data}

<<mzxml, cache=TRUE>>=
mzxml <- getPXD000001mzXML()
rawms <- readMSData(mzxml, centroided = TRUE, verbose = FALSE)
qntms <- quantify(rawms, reporters = TMT7, method = "max", 
                  verbose = FALSE, parallel = FALSE)

d <- data.frame(Signal = rowSums(exprs(qntms)[, 1:6]),
                Incomplete = exprs(qntms)[, 7])
d <- log(d)
cls <- rep("#00000050", nrow(qnt))
pch <- rep(1, nrow(qnt))
cls[grep("P02769", fData(qnt)$accession)] <- "gold4" ## BSA
cls[grep("P00924", fData(qnt)$accession)] <- "dodgerblue" ## ENO
cls[grep("P62894", fData(qnt)$accession)] <- "springgreen4" ## CYT
cls[grep("P00489", fData(qnt)$accession)] <- "darkorchid2" ## PHO
pch[grep("P02769", fData(qnt)$accession)] <- 19
pch[grep("P00924", fData(qnt)$accession)] <- 19
pch[grep("P62894", fData(qnt)$accession)] <- 19
pch[grep("P00489", fData(qnt)$accession)] <- 19
@ 

<<mzp, cache = TRUE, fig.keep='none', warning = FALSE>>=
mzp <- plotMzDelta(rawms, reporters = TMT6, verbose = FALSE) + ggtitle("")
@ 

\begin{figure}[!ht]
<<plotmzdelta, dev='pdf', echo=TRUE, fig.width=8, fig.height=5, fig.keep='last', out.width='0.9\\linewidth', warning = FALSE>>=
mzp
@ 
\caption{A m/z delta plot.}
\label{fig:plotmzdelta}
\end{figure}


\begin{figure}[!ht]
<<incompl, dev='pdf', echo=TRUE, fig.width=5.5, fig.height=5.5, fig.keep='last', out.width='.6\\linewidth', tidy=FALSE>>=
plot(Signal ~ Incomplete, data = d, 
     xlab = expression(Incomplete~dissociation), 
     ylab = expression(Sum~of~reporters~intensities),
     pch = 19,
     col = "#4582B380")
grid()
abline(0, 1, lty = "dotted")
abline(lm(Signal ~ Incomplete, data = d), col = "darkblue")
@ 
\caption{Incomplete dissociation.}
\label{fig:incompl}
\end{figure}
 

\begin{figure}[!ht]
<<maplot, dev='pdf', echo=TRUE, fig.width=5.5, fig.height=5.5, fig.keep='last', out.width='.6\\linewidth'>>=
MAplot(qnt[, c(4, 2)], cex = .9, col = cls, pch = pch, show.statistics = FALSE)
@ 
\caption{MAplot on an \Robject{MSnSet} instance.}
\label{fig:maplot}
\end{figure}

\clearpage

\subsection{The \Rpackage{MALDIquant} package}

This section illustrates some of \Rpackage{MALDIquant}'s data processing capabilities \cite{Gibb2012}. 
The code is taken from the \texttt{processing-peaks.R} script downloaded from the package 
homepage\footnote{\url{http://strimmerlab.org/software/maldiquant/}}.

\subsubsection*{Loading the data}

<<mqload, tidy=FALSE>>=
## load packages
library("MALDIquant")
library("MALDIquantForeign")
## getting test data
datapath <- 
  file.path(system.file("Examples", 
                        package = "readBrukerFlexData"), 
            "2010_05_19_Gibb_C8_A1")
dir(datapath)
sA1 <- importBrukerFlex(datapath, verbose=FALSE)
# in the following we use only the first spectrum
s <- sA1[[1]] 

summary(mass(s))
summary(intensity(s))
head(as.matrix(s))
@ 

\begin{figure}[!ht]
<<mqplot, dev='pdf', echo=TRUE, fig.width=7, fig.height=5.5, fig.keep='last', out.width='.6\\linewidth'>>=
plot(s)
@ 
\caption{Spectrum plotting in \Rpackage{MALDIquant}.}
\label{fig:mqplot}
\end{figure}

\subsubsection*{Preprocessing}

<<mqpreproc>>=
## sqrt transform (for variance stabilization)
s2 <- transformIntensity(s, fun=sqrt)
s2

## smoothing - 5 point moving average 
s3 <- transformIntensity(s2, movingAverage, halfWindowSize=2)
s3

## baseline subtraction
s4 <- removeBaseline(s3, method="SNIP")
s4
@ 

\subsubsection*{Peak picking}

<<mqred>>=
## peak picking
p <- detectPeaks(s4)
length(p) # 181
peak.data <- as.matrix(p) # extract peak information
@ 


\begin{figure}[!ht]
<<mqplot2, dev='pdf', echo=TRUE, fig.width=10, fig.height=5, fig.keep='high', out.width='1\\linewidth'>>=
par(mfrow=c(2,3))
xl <- range(mass(s)) 
# use same xlim on all plots for better comparison
plot(s, sub="", main="1: raw", xlim=xl)
plot(s2, sub="", main="2: variance stabilisation", xlim=xl)
plot(s3, sub="", main="3: smoothing", xlim=xl)
plot(s4, sub="", main="4: base line correction", xlim=xl)
plot(s4, sub="", main="5: peak detection", xlim=xl)
points(p)
top20 <- intensity(p) %in% sort(intensity(p), decreasing=TRUE)[1:20]
labelPeaks(p, index=top20, underline=TRUE)
plot(p, sub="", main="6: peak plot", xlim=xl)
labelPeaks(p, index=top20, underline=TRUE)
@ 
\caption{Spectrum plotting in \Rpackage{MALDIquant}.}
\label{fig:mqplot}
\end{figure}

\clearpage

\subsection{Working with peptide sequences}

<<isotopes, tidy = FALSE, warning = FALSE>>=
library(IPPD)
library(BRAIN)
atoms <- getAtomsFromSeq("SIVPSGASTGVHEALEMR")
unlist(atoms)

library(Rdisop)
pepmol <- getMolecule(paste0(names(atoms), 
                             unlist(atoms), 
                             collapse = ""))
pepmol

##
library(OrgMassSpecR)
data(itraqdata)

simplottest <- 
  itraqdata[featureNames(itraqdata) %in% paste0("X", 46:47)]
sim <- SpectrumSimilarity(as(simplottest[[1]], "data.frame"),
                          as(simplottest[[2]], "data.frame"),
                          top.lab = "itraqdata[['X46']]",
                          bottom.lab = "itraqdata[['X47']]",
                          b = 25)
title(main = paste("Spectrum similarity", round(sim, 3)))

MonoisotopicMass(formula = list(C = 2, O = 1, H=6))
molecule <- getMolecule("C2H5OH")
molecule$exactmass
## x11()
## plot(t(.pepmol$isotopes[[1]]), type = "h")

## x <- IsotopicDistribution(formula = list(C = 2, O = 1, H=6))
## t(molecule$isotopes[[1]])
## par(mfrow = c(2,1))
## plot(t(molecule$isotopes[[1]]), type = "h")
## plot(x[, c(1,3)], type = "h")

## data(myo500)
## masses <- c(147.053, 148.056)
## intensities <- c(93, 5.8)
## molecules <- decomposeIsotopes(masses, intensities)

## experimental eno peptides
exppep <- 
  as.character(fData(qnt[grep("ENO", fData(qnt)[, 2]), ])[, 1]) ## 13
minlength <- min(nchar(exppep))

eno <- download.file("http://www.uniprot.org/uniprot/P00924.fasta", 
                     destfile = "P00924.fasta")
eno <- paste(readLines("P00924.fasta")[-1], collapse = "")
enopep <- Digest(eno, missed = 1)
nrow(enopep) ## 103
sum(nchar(enopep$peptide) >= minlength) ## 68 
pepcnt <- enopep[enopep[, 1] %in% exppep, ]
nrow(pepcnt) ## 13
@ 

The following code chunks demonstrate how to use the \Rpackage{cleaver} package
for in-silico cleavage of polypeptides, e.g. cleaving of
\emph{Gastric juice peptide 1 (P01358)} using \emph{Trypsin}:
<<cleaver, tidy = FALSE>>=
library(cleaver)
cleave("LAAGKVEDSD", enzym = "trypsin")
@

Sometimes cleavage is not perfect and the enzym miss some cleavage positions:

<<cleaver_missing, tidy = FALSE>>=
## miss one cleavage position
cleave("LAAGKVEDSD", enzym = "trypsin", missedCleavages = 1)

## miss zero or one cleavage positions
cleave("LAAGKVEDSD", enzym = "trypsin", missedCleavages = 0:1)
@

<<texshade, echo=TRUE, eval=FALSE>>=
## example code to generate an Texshade image to be 
## included directly in a Latex document or R vignette

## seq1file <- "seq1.tex"
## cat("\\begin{texshade}{Figures/P00924.fasta}
##      \\setsize{numbering}{footnotesize}     
##      \\setsize{residues}{footnotesize}     
##      \\residuesperline*{70}
##      \\shadingmode{functional}
##      \\hideconsensus      
##      \\vsepspace{1mm}
##      \\hidenames
##      \\noblockskip\n", file = seq1file)
## tmp <- sapply(1:nrow(pepcnt), function(i) {
##   col <- ifelse((i %% 2) == 0, "Blue", "RoyalBlue")
##   cat("\\shaderegion{1}{", pepcnt$start[i], "..", pepcnt$stop[i], "}{White}{", col, "}\n", 
##       file = seq1file, append = TRUE)
## })
## cat("\\end{texshade}
##     \\caption{Visualising observed peptides for the Yeast enolase protein. Peptides are shaded in blue and black. The last peptide is a mis-cleavage and overlaps with \\texttt{IEEELGDNAVFAGENFHHGDK}.}
##     \\label{fig:seq}
##   \\end{center}
## \\end{figure}\n\n", 
##     file = seq1file, append = TRUE)
@ %% $

\subsubsection*{$^{15}N$ incorporation}

\begin{figure}[!ht]
<<n15, dev='pdf', echo=TRUE, fig.width=8, fig.height=8, fig.keep='high', out.width='.9\\linewidth', tidy=FALSE>>=
## 15N example
## incorporation rates from 0, 0.1, ..., 0.9, 0.95, 1
incrate <- c(seq(0, 0.9, 0.1), 0.95, 1)
inc <- lapply(incrate, function(inc)               
              IsotopicDistributionN("YEVQGEVFTKPQLWP", inc))
par(mfrow = c(4,3))
for (i in 1:length(inc))
  plot(inc[[i]][, c(1, 3)], xlim = c(1823, 1848), 
       type = "h", 
       main = paste0("15N incorporation at ", incrate[i]*100, "%"))
@ 
\caption{Isotopic envelope for the \texttt{YEVQGEVFTKPQLWP} peptide at different $^{15}N$ incorporation rates. }
\label{fig:n15}
\end{figure}

\clearpage

\subsection{The \Rpackage{isobar} package}

The \Rpackage{isobar} package \cite{Breitwieser2011} provides methods for the statistical
analysis of isobarically tagged MS$^2$ experiments.


%% <<isobarfix, echo=TRUE>>=
%% ## temporary local fix 
%% subsetIBSpectra <- 
%%   function (x, protein = NULL, 
%%             peptide = NULL, direction = "exclude", 
%%             specificity = 
%%             c(REPORTERSPECIFIC, GROUPSPECIFIC, UNSPECIFIC), 
%%             ...) 
%% {
%%     if (is.null(protein)) 
%%         sel.spectra <- spectrumSel(x, peptide = peptide, ...)
%%     else sel.spectra <- spectrumSel(x, 
%%                                     protein = protein, 
%%                                     specificity = specificity, 
%%         ...)
%%     if (direction == "exclude") 
%%         sel.spectra <- !sel.spectra
%%     for (aden in assayDataElementNames(x)) {
%%         assayDataElement(x, aden) <- 
%%           assayDataElement(x, aden)[sel.spectra,             ]
%%     }
%%     ## changed this from as.data.frame
%%     pg.df <- as(proteinGroup(x), "data.frame")
%%     proteinGroup(x) <- 
%%       ProteinGroup(pg.df[pg.df[, "spectrum"] %in% 
%%         fData(x)[sel.spectra, "spectrum"], ])
%%     featureData(x) <- as(fData(x)[sel.spectra, ], 
%%                          "AnnotatedDataFrame")
%%     x
%% }
%% @ 

<<isobar, cache=TRUE, tidy=FALSE>>=
library(isobar)

## Prepare the PXD000001 data for isobar analysis
.ions <- exprs(qnt)
.mass <- matrix(mz(TMT6), nrow(qnt), byrow=TRUE, ncol = 6)
colnames(.ions) <- colnames(.mass) <- 
  reporterTagNames(new("TMT6plexSpectra"))
rownames(.ions) <- rownames(.mass) <- 
  paste(fData(qnt)$accession, fData(qnt)$sequence, sep = ".")
pgtbl <- data.frame(spectrum = rownames(.ions),
                    peptide = fData(qnt)$sequence,
                    modif = ":",
                    start.pos = 1,
                    protein = fData(qnt)$accession,
                    accession = fData(qnt)$accession)
x <- new("TMT6plexSpectra", pgtbl, .ions, .mass)
featureData(x)$proteins <- as.character(fData(qnt)$accession)

x <- correctIsotopeImpurities(x) ## using identity matrix here
x <- normalize(x, per.file = FALSE) 
## spikes
spks <- c(protein.g(proteinGroup(x), "P00489"),
          protein.g(proteinGroup(x), "P00924"),
          protein.g(proteinGroup(x), "P02769"),
          protein.g(proteinGroup(x), "P62894"))

cls2 <- rep("#00000040", nrow(x))
pch2 <- rep(1, nrow(x))
cls2[grep("P02769", featureNames(x))] <- "gold4" ## BSA
cls2[grep("P00924", featureNames(x))] <- "dodgerblue" ## ENO
cls2[grep("P62894", featureNames(x))] <- "springgreen4" ## CYT
cls2[grep("P00489", featureNames(x))] <- "darkorchid2" ## PHO
pch2[grep("P02769", featureNames(x))] <- 19
pch2[grep("P00924", featureNames(x))] <- 19
pch2[grep("P62894", featureNames(x))] <- 19
pch2[grep("P00489", featureNames(x))] <- 19

nm <- NoiseModel(x)
ib.background <- subsetIBSpectra(x, protein=spks, 
                                 direction = "exclude")
nm.background <- NoiseModel(ib.background)
ib.spks <- subsetIBSpectra(x, protein = spks,
                           direction="include",
                           specificity="reporter-specific")
nm.spks <- NoiseModel(ib.spks, one.to.one=FALSE, pool=TRUE)

ratios <- 10^estimateRatio(x, nm,
                           channel1="127", channel2="129",
                           protein = spks,
                           combine = FALSE)[, "lratio"]

res <- estimateRatio(x, nm,
                     channel1="127", channel2="129",
                     protein = unique(fData(x)$proteins), 
                     combine = FALSE,
                     sign.level = 0.01)[, c(1, 2, 6, 8)]
res <- as.data.frame(res)
res$lratio <- -(res$lratio)

cls3 <- rep("#00000050", nrow(res))
pch3 <- rep(1, nrow(res))
cls3[grep("P02769", rownames(res))] <- "gold4" ## BSA
cls3[grep("P00924", rownames(res))] <- "dodgerblue" ## ENO
cls3[grep("P62894", rownames(res))] <- "springgreen4" ## CYT
cls3[grep("P00489", rownames(res))] <- "darkorchid2" ## PHO
pch3[grep("P02769", rownames(res))] <- 19
pch3[grep("P00924", rownames(res))] <- 19
pch3[grep("P62894", rownames(res))] <- 19
pch3[grep("P00489", rownames(res))] <- 19

rat.exp <- c(PHO = 2/2, 
             ENO = 5/1, 
             BSA = 2.5/10, 
             CYT = 1/1)
@ %% $


\begin{figure}[!ht]
<<ibplot, dev='pdf', echo=TRUE, fig.width=10, fig.height=5, fig.keep='last', out.width='1\\linewidth'>>=
par(mfrow = c(1,2))
maplot(x,
       noise.model = c(nm.background, nm.spks, nm),
       channel1="127", channel2="129",
       pch = 19, col = cls2,
       main = "Spectra MA plot")
abline(h = 1, lty = "dashed", col = "grey")
legend("topright", 
       c("BSA", "ENO", "CYT", "PHO"),
       pch = 19, col = c("gold4", "dodgerblue", 
                   "springgreen4", "darkorchid2"),
       bty = "n", cex = .7) 
plot(res$lratio, -log10(res$p.value.rat), col = cls3, pch = pch3,
     xlab = expression(log[10]~fold-change),
     ylab = expression(-log[10]~p-value),
     main = "Protein volcano plot", 
     xlim = c(-0.7, 0.7))
grid()
abline(h = -log10(0.01), lty = "dotted")
abline(v = log10(c(2, 0.5)), lty = "dotted")
abline(v = -0.003, col = "springgreen4", lty = "dashed", lwd = 2)
abline(v = 0.003, col = "darkorchid2", lty = "dashed", lwd = 2)
abline(v = log10(5), col = "dodgerblue", lty = "dashed", lwd = 2)
abline(v = log10(0.25), col = "gold4", lty = "dashed", lwd = 2)
points(res[spks, "lratio"], 
       -log10(res[spks, "p.value.rat"]), 
       col = c("darkorchid2", "dodgerblue", 
         "gold4", "springgreen4"),
       pch = 19)
@ 
\caption{Result from the \Rpackage{isobar} pipeline. }
\label{fig:ibplot}
\end{figure}

\clearpage

\subsection{The \Rpackage{synapter} package}

The \Rpackage{synapter} \cite{synapter} package comes with a detailed vignette that describes how to prepare the MS$^E$ data and then process it in \R. Several interfaces are available provided the user with maximum control, easy batch processing capabilities or a graphical user interface. The conversion into \Robject{MSnSet} instances and filter and combination thereof as well as statistical analysis are also described. 

<<synapter, eval=FALSE>>=
## open the synapter vignette
library("synapter")
synapterGuide()
@ 

\section{MS$^2$ spectra identification}

A recent addition to Bioconductor 2.12 is the \Rpackage{rTANDEM} package, 
that provides a direct interface to the X!Tandem software \cite{Craig2004}. 
A typical \Rpackage{rTANDEM} pipeline comprises

\begin{enumerate}
\item Prepare the input data.
\item Run the search.
\item Import the search results and extract the peptides and proteins
\end{enumerate}

Using example code/data from the \Rpackage{rTANDEM} vignette/package, these steps are executed as described below.

\subsection{Preparation of the input data}

<<rtandem1, tidy = FALSE>>=
library(rTANDEM)
taxonomy <- rTTaxo(taxon = "yeast",
                   format = "peptide", 
                   URL = system.file(
                     "extdata/fasta/scd.fasta.pro", 
                     package="rTANDEM"))
param <- rTParam()
param <- setParamValue(param, 
                       'protein', 'taxon', 
                       value="yeast")
param <- setParamValue(param, 'list path', 
                       'taxonomy information', taxonomy)
param <- setParamValue(param, 
                       'list path', 'default parameters',
                       value = system.file(
                         "extdata/default_input.xml",
                         package="rTANDEM"))
param <- setParamValue(param, 'spectrum', 'path',
                       value = system.file(
                         "extdata/test_spectra.mgf", 
                         package="rTANDEM"))
param <- setParamValue(param, 'output', 'xsl path',
                       value = system.file(
                         "extdata/tandem-input-style.xsl", 
                         package="rTANDEM"))
param <- setParamValue(param, 'output', 'path',
                       value = paste(getwd(), 
                         "output.xml", sep="/"))
@ 


\subsection{Performing the search}

The analysis is run using the \Rfunction{tandem} function (see also the \Rfunction{rtandem} function), 
which returns the results data file path (only the file name is displayed below).

<<rtanem2>>=
resultPath <- tandem(param)
basename(resultPath)
@ 

\subsection{Import and analyse results}

<<rtandem3>>=
res <- GetResultsFromXML(resultPath)
## the inferred proteins
proteins <- GetProteins(res, 
                        log.expect = -1.3, 
                        min.peptides = 2)
proteins[, -(4:5), with = FALSE]
## the identified peptides for YFR053C
peptides <- GetPeptides(protein.uid = 1811,
                        results = res,
                        expect = 0.05)
peptides[, c(1:4, 9, 10:16), with = FALSE]
@ 

More details are provided in the vignette available with \Rfunction(vignette("rTANDEM")), 
for instance the extraction of degenerated peptides, i.e. peptides found in multiple proteins.

\section{Annotation}

In this section, we briefly present some Bioconductor annotation infrastructure. 

We start with the \Rpackage{hpar} package, an interface to the \textit{Human Protein Atlas} 
\cite{Uhlen2005, Uhlen2010}, to retrieve subcellular localisation information for the 
\texttt{ENSG00000002746} ensemble gene.

<<annot1, cache=FALSE>>=
id <- "ENSG00000002746"
library("hpar")
getHpa(id, "SubcellularLoc")
@

Below, we make use of the human annotation package \Rpackage{org.Hs.eg.db} and 
the Gene Ontology annotation package \Rpackage{GO.db} to retrieve the same 
information as above.

<<annot2, cache=FALSE, warning = FALSE>>=
library(org.Hs.eg.db)
library(GO.db)
ans <- select(org.Hs.eg.db, 
              keys = id, cols = c("ENSEMBL", "GO", "ONTOLOGY"), 
              keytype = "ENSEMBL")
ans <- ans[ans$ONTOLOGY == "CC", ]
ans
sapply(as.list(GOTERM[ans$GO]), slot, "Term")
@ 

Finally, this information can also be retrieved from on-line databases using the 
\Rpackage{biomaRt} package \cite{Durinck2005}.

<<annot3, cache=FALSE>>=
library("biomaRt")
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
efilter <- "ensembl_gene_id"
eattr <- c("go_id", "name_1006", "namespace_1003")
bmres <- getBM(attributes=eattr, filters = efilter, values = id, mart = ensembl)
bmres[bmres$namespace_1003 == "cellular_component", "name_1006"]
@ %% $

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newpage

\section{Other packages}\label{sec:packages}

\subsection{Bioconductor packages}

<<gist4493747, echo=FALSE, warning = FALSE, cache = TRUE>>=
biocv <- 2.13

getPackagesInBiocView <- function(view, 
                                  rep = c("BioCsoft", "BioCann",
                                      "BioCexp", "BioCextra"),
                                  biocVersion = biocv) {
    suppressPackageStartupMessages(require("biocViews"))
    data(biocViewsVocab)
    rep <- match.arg(rep)
    biocMirror <- getOption("BioC_mirror", "http://bioconductor.org")
    biocPaths <- switch(rep,         
                        BioCsoft = "bioc",
                        BioCann = "data/annotation", 
                        BioCexp = "data/experiment",
                        BioCextra = "extra")   
    rep <- paste(biocMirror,
                 "packages",
                 biocVersion,
                 biocPaths, 
                 sep = "/") 
    bv <- getBiocViews(rep, biocViewsVocab, "NoViewProvided")
    if (!view %in% names(bv)) {
        warning("BiocView ", view, " not found.")
        return(NULL)
    }  
    return(bv[[view]])
}

pk <- list(Proteomics = getPackagesInBiocView("Proteomics"),
           MassSpectrometryData = getPackagesInBiocView("MassSpectrometryData", rep = "BioCexp"),
           MassSpectrometry = getPackagesInBiocView("MassSpectrometry"))

.makePackageTable <- function(x, nsub = TRUE) {
  Package <- sapply(x@packageList, function(x) x@Package)
  Title <- sapply(x@packageList, function(x) x@Title)
  if (nsub)
    Title <- sub("\n", " ", Title)
  return(data.frame(Package, Title))
}

pkTab <- lapply(pk, .makePackageTable)
@ 

This section provides a complete list of packages available in the relevant Bioconductor version \Sexpr{biocv} 
(as of \today) \textit{biocView}\footnote{\url{http://www.bioconductor.org/packages/devel/BiocViews.html}} categories. 
Tables \ref{tab:prot}, \ref{tab:ms} and \ref{tab:msdata} represent the packages for the 
\texttt{Proteomics} (\Sexpr{nrow(pkTab[["Proteomics"]])} packages), 
\texttt{MassSpectrometry} (\Sexpr{nrow(pkTab[["MassSpectrometry"]])} packages) and
\texttt{MassSpectrometryData} (\Sexpr{nrow(pkTab[["MassSpectrometryData"]])} experiment packages)
categories.

<<protstab, echo=FALSE, results='asis'>>=
print(xtable(pkTab[["Proteomics"]],
             caption = "Packages available under the \\texttt{Proteomics} \\textit{biocViews} category.",
             table.placement = "thb",
             align = c("llL"),
             label = "tab:prot"),
      include.rownames = FALSE,
      tabular.environment = "tabulary",       
      width = "1\\textwidth",
      size = "scriptsize")
@ 

<<mstab, echo=FALSE, results='asis'>>=
print(xtable(pkTab[["MassSpectrometry"]],
             caption = "Packages available under the \\texttt{MassSpectrometry} \\textit{biocViews} category.",
             table.placement = "thb",
             label = "tab:ms"),
      include.rownames = FALSE,
      size = "scriptsize")
@ 

<<msdatatab, echo=FALSE, results='asis'>>=
print(xtable(pkTab[["MassSpectrometryData"]],
             caption = "Experimental Packages available under the \\texttt{MassSpectrometryData} \\textit{biocViews} category.",
             table.placement = "thb",
             label = "tab:msdata"),
      include.rownames = FALSE,
      size = "scriptsize")
@ 


<<cntCRAN, echo=FALSE>>=
X <- readLines("http://cran.r-project.org/web/views/ChemPhys.html")
x2 <- grep("Related links:", X)
x1 <- grep("CRAN packages:", X)
np <- length(grep("../packages/", X[x1:x2]))
@ 

\subsection{The Chemometrics and Computational Physics CRAN Task View}

The CRAN task view on Chemometrics and Computational 
Physics\footnote{\url{http://cran.r-project.org/web/views/ChemPhys.html}}
lists \Sexpr{np} packages, including a set of packages for 
mass spectrometry and proteomics, some of which are illustrated in this document. 
The most relevant (non Bioconductor) packages are summarised below.

\begin{description}
  \item[MALDIquant] 
    provides tools for quantitative analysis of MALDI-TOF mass spectrometry data, 
    with support for baseline correction, peak detection and plotting of 
    mass spectra \\
    (\url{http://cran.r-project.org/web/packages/MALDIquant/index.html}).
  \item[OrgMassSpecR] 
    is for organic/biological mass spectrometry, with a focus on graphical display, 
    quantification using stable isotope dilution, and protein hydrogen/deuterium exchange 
    experiments \\
    (\url{http://cran.r-project.org/web/packages/OrgMassSpecR/index.html}).
  \item[FTICRMS] 
    provides functions for Analyzing Fourier Transform-Ion Cyclotron 
    Resonance Mass Spectrometry Data \\
    (\url{http://cran.r-project.org/web/packages/FTICRMS/index.html}).
  \item[titan]
    provides a GUI to analyze mass spectrometric data on the relative abundance 
    of two substances from a titration series \\
    (\url{http://cran.r-project.org/web/packages/titan/index.html}).
\end{description}

\subsection{Other CRAN packages}

Finally, \Rpackage{digeR}\footnote{\url{http://cran.r-project.org/web/packages/digeR/index.html}}, 
which is available on CRAN but not listed in the Chemometrics and Computational Physics Task View, 
provides a GUI interface for analysing 2D DIGE data. It allows to 
perform correlation analysis, score plot, classification, 
feature selection and power analysis for 2D DIGE experiment data.

\bigskip

Suggestions for additional \R packages are welcome and will be added to the vignette. 
Please send suggestions and possibly a short description and/or a example utilisation with code 
to \url{lg390@cam.ac.uk}. The only requirement is that the package must be available 
on an official package channel (CRAN, Bioconductor, R-forge, Omegahat), 
i.e. not only available through a personal web page. 

\clearpage

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Section
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Session information}\label{sec:sessionInfo} 

All software and respective versions used in this document, as returned by \Rfunction{sessionInfo()} are detailed below.

<<sessioninfo, results='asis', echo=FALSE, echo=FALSE>>=
toLatex(sessionInfo(), locale = FALSE)
@

\bibliographystyle{abbrv}
\bibliography{RforProteomics}

\end{document}