%\VignetteIndexEntry{HOWTO PROcess}
%\VignettePackage{PROcess}
\documentclass[12pt]{article}

\usepackage{amsmath}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}

\bibliographystyle{plainnat}
%\bibliographystyle{apalike}

\begin{document}

@
\title{HowTo Use the Bioconductor PROcess package}
\author{}
\maketitle
\tableofcontents

% library(tools)
% Rnwfile<- file.path("~/PROcess/inst/doc","howtoprocess.Rnw")
% Sweave(Rnwfile,pdf=TRUE,eps=TRUE,stylepath=TRUE,driver=RweaveLatex())

\section{Introduction}
The {\tt PROcess} package contains a collection of functions for 
processing spectra to remove baseline drifts if any, detect
peaks and align them to a set of protobiomarkers. This document 
serves as a quick tutorial for using the {\tt PROcess} package.

\section{Baseline subtraction}
Our first observation of a raw spectrum is that it exhibits
elevated baseline, more so at smaller m/z values than at larger
values. This elevated baseline
is mostly caused by the chemical noises in the EAM and ion
overload. Ideally a spectrum should rest more
or less on the zero horizontal line.  This baseline needs to
be subtracted from each raw spectrum. The following example 
shows the result of a spectrum with its baseline removed. 

<<bslnoffSingle,fig=T>>=
library(PROcess)
fdat <- system.file("Test", package="PROcess")
fs <- list.files(fdat, pattern="\\.*csv\\.*", full.names=TRUE)
f1 <- read.files(fs[1])
fcut <- f1[f1[,1]>0,]
bseoff <-bslnoff(fcut,method="loess",plot=TRUE, bw=0.1)
title(basename(fs[1]))
@
\section{Peak detection}
After baseline is removed, peaks can be located by using
{\tt isPeak}. A spectrum is smoothed first using moving 
average of the $k$ nearest neighbours. Smoothing helps 
to enhance peaks and get rid of spurious peaks. However, we 
do not recommend large amount of smoothing (controlled by
parameter {\tt sm.span}) in this step because we do not 
wish to 
smooth away too many short and wide peaks 
and also we need the precision in peak locations. As a first step 
we do not mind getting more potential features. 

<<isPeakSingle,fig=T>>=
pkgobj <- isPeak(bseoff,span=81,sm.span=11,plot=TRUE)
@

We can also zoom in to inspect peaks in a particular range of 
m/z values.
<<specZoom,fig=T>>=
specZoom(pkgobj, xlim=c(5000,10000))
@

\section{Batch operation}
We demonstrate the batch functionality of this package using
a set of 2 spectra.
\subsection{Apply baseline subtraction to a set of spectra}
<<rmBaselineBatch>>=
testdir <- system.file("Test", package = "PROcess")
testM <- rmBaseline(testdir)
@

\subsection{Renormalize spectra}
Suppose we want to normalize a set of spectra to their median 
AUC (Area Under the Curve), where an AUC is calculated for
m/z values greater than a cutoff point, 1500.

<<renorm>>=
rtM <- renorm(testM, cutoff=1500)
@

\subsection{Identify peaks of spectra}
<<getPeaksBatch>>=
peakfile <- paste(tempdir(), "testpeakinfo.csv", sep = "/")
getPeaks(rtM, peakfile)
@

\subsection{Quality assessment}
Quality assessment is necessary because some spectra are very 
noisy and have hardly any peaks. Function {\tt quality} 
computes three parameters {\tt Quality}, {\tt Retain} and 
{\tt peak} for assessing a set of spectra.
<<QC>>=
qualRes <- quality(testM, peakfile, cutoff=1500)
print(qualRes)
@

A spectrum is deemed of poor quality and should be removed from
subsequent analyses if it meets the following 3
conditions simultaneously:
\begin{enumerate}
\item $\mbox{\tt Quality}\, <\, 0.4$;
\item $\mbox{\tt Retain}\, <\, 0.1 $;
\item $\mbox{\tt peak} \,< \,1/2 \mbox{ of the mean peak number in the
chip}$.
\end{enumerate}


\subsection{Get protobiomarkers}
One challenge in MS data is that not only they exhibit
variation vertically but also do they horizontally. This
horizontal variation is not simply a constant shift but
associated with value of m/z. Currently the 
accuracy in the m/z position is believed to be within $0.3\%$ o
f the m/z value. Once the peaks are detected, we align the
peaks by first generating an interval of size $0.3\%$ of 
the m/z value which centers at m/z for each m/z where a peak is 
detected. We treat those m/z
intervals as interval censored data.  We treat those intervals as a 
partially ordered set and
 use the locations of the maximal cliques to define the
 locations of the peaks \cite{Gent:Van:2001}.
 We call these aligned peaks (across spectra) proto-biomarkers
 and use the centers of the resulting intervals as the
 locations of the aligned peaks. For each spectrum we
 determine which actual peaks are represented by an aligned
peak (proto-biomarker) and use the maximum of those as the
 height of the proto-biomarker. If there were no peaks then
we use the maximum value within the resulting interval.
 

<<pk2bmkr,fig=T>>=
bmkfile <- paste(tempdir(), "testbiomarker.csv", sep = "/")
testBio <- pk2bmkr(peakfile, rtM, bmkfile)
mzs <- as.numeric(rownames(rtM))
matplot(mzs, rtM, type = "l", xlim = c(1000, 10000),
ylab="intensities", main="proto-biomarkers")
bks <- getMzs(testBio)
print(round(bks))
abline(v = bks, col = "green")
@
\section{An alternative way to obtain proto-biomarkers}
\cite{baggMean} propose an alternative way for peak
detection using the average spectrum of all spectra of a given
experiment. Their algorithm is comprised of the following steps,
(1) compute the mean of all raw spectra, (2) de-noise, baseline
correct and find peaks by locating all local maxima in the mean
spectrum, and (3) quantify the identified peaks in the
individual spectra. The major advantage of this approach is the 
simplicity and the speed to arrive at a set of proto-biomarkers in 
comparison to the approach described in the previous sections to 
select peaks from individual spectra.

In {\tt PROcess} we adopt the general idea of using mean
spectrum to locate peaks, but leave the decision to users whether
they should compute the mean spectrum of the raw spectra, or of
the smoothed (de-noised), baseline corrected and normalized spectra
for their experiment at hand. If the steps of baseline-subtraction
and normalization are done properly, there may be improvement
in peak detection using the mean spectrum computed from the
pre-processed spectra.

Our approach comprises the following steps.
\begin{itemize}
\item Compute the mean of all raw spectra using {\tt aveSpec},
or the mean of all pre-processed spectra using standard R function
{\tt rowMeans}.
\item Detect peaks of the mean spectrum by {\tt isPeak}.
\item Align peaks by {\tt align} if there seems to be peak
      clusters.
\item Quantify peaks in individual spectra that have been smoothed,
      baseline-removed and normalized, by {\tt getPeaks2} that
      locates for each peak the maximum intensity in a neighbourhood
      of the peak defined by a user-specified precision of peaks.
\end{itemize}

We now demonstrate this approach to the test data set using
{\tt PROcess}.  

We execute the following code to compute the mean spectrum.
<<overallMean>>=
grandAve <- aveSpec(fs)
mzs <- grandAve[,1]
@

Should you wish to compute the mean spectrum of the pre-processed
spectra, you can do {\tt rowMeans(rtM)} instead, then skip
the baseline correction step and proceed
to the peak detection step using {\tt isPeak} . 

We execute the following code to remove baseline, detect
peaks in the overall mean spectrum and quantify them in individual
spectra.

<<bslnoffRawMean,fig=T>>=
grandOff <- bslnoff(grandAve[mzs>0,], method="loess",
        plot=T, bw=0.1)
@
<<RawMeanPeak,fig=T>>=
grandPkg <- isPeak(grandOff[grandOff[,1]>1500,], zerothrsh=1,
plot=T, ratio=0.1)
grandpvec <- round(grandPkg[grandPkg$peak, "mz"])
print(as.vector(grandpvec))
@

We can then quantify the peaks by running the following code.
<<getPeaks2>>=
grandBmk <- getPeaks2(rtM, grandpvec)
@

\bibliography{howtoprocess}

\end{document}