%\VignetteIndexEntry{An introduction to PLGEM}
%\VignetteKeywords{Error Model, Microarray, Proteomics}
%\VignetteDepends{plgem}
%\VignettePackage{plgem}

\documentclass[a4paper]{article}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\title{An introduction to PLGEM}
\author{Mattia Pelizzola and Norman Pavelka}

\begin{document}

\maketitle

\section{Introduction}

This document serves as a brief tutorial to the \textbf{Power Law Global Error
Model} (\textbf{PLGEM}) analysis method \cite{bmc}. \textbf{PLGEM} has so far
been shown to faithfully model the variance-versus-mean dependence that exists
in a wide variety of genome-wide data sets, including microarray \cite{bmc}
and proteomics data \cite{mcp}. The use of \textbf{PLGEM} has furthermore been
shown to improve the detection of differentially expressed genes or proteins in
these datasets \cite{bmc, mcp}.

\section{Running \textbf{PLGEM} in \emph{wrapper} mode}

A wrapper function (called \Rfunction{run.plgem}) is provided in the package,
which performs all the necessary steps to obtain a list of differentially
expressed genes or proteins (DEG), starting from a dataset of class
\Rclass{ExpressionSet}.

This input dataset is expected to contain either normalized gene expression
values obtained from one-channel microarrays (such as Affymetrix GeneChip
\cite{bmc}), or normalized spectral counts obtained from mass spectrometry-based
proteomics methods (such as MudPIT \cite{mcp}).

The wrapper automatically attempts to find the best solution at each step, and
requires only modest or no input decisions by the user. For didactic purposes,
we will use here a subset of the microarray dataset used in the original
publication about \textbf{PLGEM}, containing two replicates of LPS-stimulated
dendritic cells (`LPS') and four replicates of untreated dendritic cells (`C'):

<<PLGEMwrapper, echo=TRUE, results=hide>>=
library(plgem)
data(LPSeset)
set.seed(123)
LPSdegList <- run.plgem(esdata=LPSeset)
@

The above obtained object \Robject{LPSdegList} will contain the list of genes or
proteins, selected as significantly changing between experimental condition
`LPS' and baseline condition `C', at the default significance level 0.001.

\section{Running \textbf{PLGEM} in \emph{step-by-step} mode}

To provide advanced users with a higher control on the inner workings of the
\textbf{PLGEM} pipeline, the individual functions called by
\Rfunction{run.plgem} are also described in this tutorial. We will use them now,
step by step.

\subsection{Fitting of the model to a data set}

The first step is to fit the model to the microarray or proteomics dataset. By
fitting the model we will obtain a mathematical relationship that will allow us
to determine the expected standard deviation associated to a given average gene
expression value or average protein abundance level.

\textbf{PLGEM} can only be fitted on a set of replicates of a same experimental
condition, therefore we first need to choose which condition to use for the
fitting step. In our dataset two conditions are provided: `C' and `LPS'. Usually
the most replicated one is chosen, in this example we will therefore choose the
first condition `C' (i.e. \Rfunarg{fitCondition="C"}), because it contains four
replicates. Technically, we may have decided to fit the model also on the two
`LPS' samples, as two replicates are required and sufficient to fit a PLGEM. But
having 3 or more replicates improves the fitting and is usually recommended
\cite{mcp}.

Moreover, we can can change the default values of parameters \Rfunarg{p} and
\Rfunarg{q}. Briefly, \Rfunarg{p} represents the number of intervals (or bins)
used to partition the expression value range of the dataset. We observed that
\Rfunarg{p} can be modified over a wide range of values, without any major
effects on the final results, except when it was chosen to close to the total
number of genes or proteins in the dataset \cite{bmc}. As a rule of thumb,
\Rfunarg{p} should be no more than one tenth of the number of genes or proteins.
The default of 10 should therefore be appropriate for most microarray
experiments, but could be set lower for proteomics data where less than 100
proteins were identified.

\Rfunarg{q} is the quantile of the location-dependent spread used to fit the
model. The default of \Rfunarg{q} is set to 0.5, because this represents the
median value, which is what you are looking for when modeling the variability.
We recommend to only modify this parameter for very special purposes, e.g. for
the determination of empirical confidence intervals of standard deviation.

Moreover it is possible to evaluate the fitting of the model setting the option
\Rfunarg{fittingEval}. If this argument is set to \Rfunarg{TRUE}, a multi-panel
plot is produced where the residuals of the model are evaluated. A good fit is
characterized by a near-normal distribution and an horizontal symmetric plot of
the ranked residuals.

Finally, setting \Rfunarg{plot.file=TRUE} saves above diagnostic plot to a png
file instead of the default device, therefore we won't change its default.

\begin{center}
<<modelFitting,fig=TRUE,echo=TRUE>>=
LPSfit <- plgem.fit(data=LPSeset, covariate=1, fitCondition='C', p=10, q=0.5,
  plot.file=FALSE, fittingEval=TRUE, verbose=TRUE)
@
\end{center}

\subsection{Computation of observed signal-to-noise ratios}

The next step is the computation of the signal-to-noise ratio (STN) statistics
for the detection of differential expression. The STN is determined using the
model-derived spread estimates instead of the data-derived ones.
Therefore it is necessary to give to the \Rfunction{plgem.obsStn} function the
model parameters \Rfunarg{slope} and \Rfunarg{intercept} determined during the
model fitting that are contained in the value returned by \Rfunction{plgem.fit}
function. By default, all experimental conditions (according to the values of
the \Rfunarg{covariate} defined in the \Rclass{phenoData} slot of
the \Rclass{ExpressionSet}) are compared to the first condition (baseline). If
the condition to be treated as the baseline is not the first one, we can change
this by modifying the argument \Rfunarg{baselineCondition}. A matrix of
observed STN is determined, where the number of rows are the number of genes or
proteins in the dataset and the number of columns are the number of comparisons
that can be performed in the dataset (i.e. the total number of experimental
conditions minus one). In this case, the dataset contains only one condition to
be compared to the baseline, therefore the matrix will be a one-dimensional
array of observed PLGEM-STN values.

<<observedSTN, echo=TRUE>>=
LPSobsStn <- plgem.obsStn(data=LPSeset, covariate=1, baselineCondition=1,
  plgemFit=LPSfit, verbose=TRUE)
@

\subsection{Computation of resampled signal-to-noise ratios}

In order to get an estimate of the distribution of the test statistic under the
null hypothesis of no differential expression, a resampled statistic is
determined using the method described in the paper \cite{bmc}. The number of
iterations of the resampling step should be correlated with the total number of
replicates that are present in the data set. If this argument is set to
\Rfunarg{"automatic"}, the number of iterations is automatically determined
based on the total number of possible combinations. In this case, an upper
threshold of 500 iterations is set to avoid excessive computation time. This
should be fine for most purposes.

<<resampledSTN, echo=TRUE>>=
set.seed(123)
LPSresampledStn <- plgem.resampledStn(data=LPSeset, plgemFit=LPSfit,
  iterations="automatic", verbose=TRUE)
@

\subsection{Computation of p-values}

Next, p-values are calculated for each observed STN value via a call to function
\Rfunction{plgem.pValue}. Resampled STN values are also required by this
function, because they will be used to build empirical cumulative distribution
functions of the STN values that can be observed under the null hypothesis:

<<Pvalues, echo=TRUE>>=
LPSpValues <- plgem.pValue(observedStn=LPSobsStn,
  plgemResampledStn=LPSresampledStn, verbose=TRUE)
head(LPSpValues)
@

\subsection{Detection of differentially expressed genes or proteins (DEG)}

Finally, DEG are selected at the given significance level \Rfunarg{delta} via a
call to function \Rfunction{plgem.deg}. The chosen value of \Rfunarg{delta} can
be seen as an estimate of the False Positive Rate (FPR). Therefore, in case of a
microarray dataset with 10,000 genes of which not a single one is truly
differentially expressed, choosing \Rfunarg{delta=0.001} will select
roughly 10 genes by chance:

<<DEGselection,echo=TRUE>>=
LPSdegList <- plgem.deg(observedStn=LPSobsStn, plgemPval=LPSpValues,
  delta=0.001, verbose=TRUE)
head(LPSdegList$significant[["0.001"]][["LPS"]])
@

Above function returns a list with a number of items that is equal to the number
of different significance levels \Rfunarg{delta} used as input. In this case
the default single value of 0.001 was used, so the list will contain only one
item at this level. This item is again a list, whose number of items correspond
to the number of performed comparisons, i.e. the number of conditions in the
starting \Rclass{ExpressionSet} minus the baseline, in this case again only one.
In each list-item the values are the observed STN and the names are the IDs of
the significantly changing genes or proteins, as defined in the
\Rclass{ExpressionSet}.

Finally, the obtained list of DEG can be written to the current working
directory using the \Rfunction{plgem.write.summary} function.

\begin{thebibliography}{}

\bibitem[1]{bmc}
Pavelka, N.,~Pelizzola, M.,~Vizzardelli, C.,~Capozzoli, M.,~Splendiani, A.,~
  Granucci, F. and P. Ricciardi-Castagnoli (2004).
\newblock A power law global error model for the identification of
  differentially expressed genes in microarray data.
\newblock \emph{BMC Bioinformatics}, 5:203.

\bibitem[2]{mcp}
Pavelka, N.,~Fournier, M., L.,~Swanson, S., K.,~Pelizzola, M.,~
  Ricciardi-Castagnoli, P.,~Florens, L. and M. P. Washburn (2008).
\newblock Statistical similarities between transcriptomics and quantitative
  shotgun proteomics data.
\newblock \emph{Molecular and Cellular Proteomics}, 7(4):631-44.

\end{thebibliography}

\end{document}