%\VignetteIndexEntry{Structured Analysis of Microarrays}
%\VignetteDepends{golubEsets, hu6800.db}
%\VignetteKeywords{classification microarrays complex phenotypes}
%\VignettePackage{stam}

\documentclass[11pt,a4paper]{report}

\usepackage{compdiag}
\usepackage{amsmath,a4}
\SweaveOpts{eps=false}

\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\textit{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}

\title{Structured Analysis of Microarrays \bigskip \\
       User's Guide to the Bioconductor package \Rpackage{stam}}
\author{Claudio Lottaz\footnote{Corresponding author: 
        \texttt{claudio.lottaz@molgen.mpg.de}}
        and Rainer Spang \bigskip \\
        \small Max Planck Institute for Molecular Genetics\\ 
         Computational Diagnostics, Department for Computation Molecular Biology \\
         Ihnestr. 63-73, D-14195 Berlin, Germany}
\reportnr{03}
\year{2004}
\abstract{
  This is the vignette of the Bioconductor compliant packege {\tt
       stam}. We describe our implementation of {\it structured
       analysis of microarray data} for decomposing complex clinical
       phenotypes in clinical gene expression profiling studies.
}
\date{}


\begin{document}

\maketitle

<<echo=FALSE,results=hide>>=
library(golubEsets)
oldopt <- options(digits=3)
on.exit( {options(oldopt)} )
options(width=70)
if (interactive()) { 
    options(error=recover)
}
set.seed(123)
@

\chapter{Introduction}

In clinical context microarray data can be used in a straight forward
manner for diagnostic tasks. Thereby, standard classification methods
developed in the machine learning community have been suggested and
applied in clinical research to distinguish between clinically
relevant phenotypes. However, it is commonly accepted that such
phenotypes are often complex, i.e. caused by different causes. Thus we
expect to find heterogenous gene expression profiles in patients with
homogenous clinical phenotype.

In order to decompose complex phenotypes, we suggest a procedure we
call {\it structured analysis of Microarrays}. In this procedure we
train many classifiers to recognize the same complex phenotype. Each
classifier focusses on a single biological aspect by extracting only
the expression data of the corresponding genes. Many of these
classifiers will be unable to recognize the phenotype of interest and
are thus discarded. Others will recognize a subset of patients and
some of them hopefully discover all patients. {\it Structured analysis
of microarrays} uses classifiers with high specificity, we call them
molecular symptoms, to further stratify patients on a molecular
level. Patterns of absence and presence of molecular symptoms identify
particular groups of patients. Due to the biological focus of our
classifiers, molecular symptoms always have a biological meaning.

The Bioconductor compliant R package \Rpackage{stam} described in this
document implements {\it structured analysis of microarrays} based on
the Gene Ontology \cite{GO00}. GO terms attributed to leaf nodes of
the GO graph are used as biological focusses for potential molecular
symptoms. The hierarchy of the Gene Ontology is used to compute more
general classifiers by weighted averaging. Thereby, good classifiers
obtain higher weights. A shrinkage process of these weights, similar
to the one used in PAM \cite{Tibshirani02a} to select genes, restricts
the graph of classifiers in a cross validation setting to the branches
linked to the phenotype.

The remainder of this document describes the usage of the
\Rpackage{stam} package. \Rpackage{stam} relies on Bioconductor meta data
packages to implement structured training of classifiers (Section
\ref{sec:training}) and structured prediction (Section
\ref{sec:predict}). For further convenience we have implemented a
complete structured evaluation of a data set (Section
\ref{sec:evaluate}) and a web based possibility to explore some
crucial parameters of such an analyses interactively (Section
\ref{sec:server}).

\chapter{Training Classifiers}
\label{sec:training}

The training of a structured classifier for a phenotype in microarray
data consists of the following steps:
\begin{enumerate}
\item Generate a classifier network according to the chip's
  annotations and the GO hierarchy.
\item Perform cross-validated PAM fits in each leaf node to determine
  adequate PAM shrinkage levels and compute performance for several
  graph shrinkage candidates.
\item Compute single structured model using the best graph shrinkage
  level chosen by the user or automatically according some performance
  criterion calculated in the cross validation step.
\end{enumerate}

Before starting any analysis you have to load the \Rpackage{stam}
package in your R session as follows: 
<<>>= 
library(stam) 
@ 
For illustration of \Rpackage{stam} usage we use the Golub data set on
acute leukemia \cite{Golub99} as it is stored in the
\Rpackage{golubEsets} data package. For preparing the data, issue the
following commands:
<<>>= 
library(golubEsets) 
data(Golub_Merge) 
golubTrain <- Golub_Merge[,1:38] 
golubTest <- Golub_Merge[,39:72] 
@

\section{Generate a raw classifier graph}

The raw classifier graph can be generated by the function
\Rfunction{stam.net}. This function needs the name of the chip used in
the study to be analyzed as well as the root node to be used for the
classifier graph. The function returns an object of class
\Rclass{stamNet}.
<<results=hide>>= 
net <- stam.net(chip="hu6800", root="GO:0005576", probes=rownames(exprs(Golub_Merge))) 
@
The string given as the chip's name is used to load the annotation
data. Thus \Rpackage{stam} expects a library of the same name to be
installed, where it looks for the \Robject{<chip-name>GO} hash as it
is provided by Bioconductor meta data packages. The root of the
classifier graph must be identified by a string interpreted as a GO
identifier of the form '{\tt GO:ddddddd}'. The default is {\tt
GO:0008150} which represents the 'biological process' branch of the
Gene Ontology. Other prominent candidates may be {\tt GO:0003674}
(molecular function) or {\tt GO:0005575} (cellular
component). However, any valid GO identifier is allowed.

The classifier graph may be analysed using the print function as is
shown in the next chunk of R code. Printing the object returned by
\Rfunction{stam.net} directly shows some properties of the generated
classifier graph. One component of this object holds an entry for each
node. Summary information on each node can also be printed as shown
below.  
<<>>= 
print(net) 
print(net@nodes[[31]])
print(net@nodes[["GO:0005579"]])
@

A convenient meands to explore the information on the raw classifier
graph is provided by the function \Rfunction{stam.writeHTML} which,
applied on a \Rclass{stamNet} object, writes a set of interlinked HTML
pages.  
\begin{Schunk}
\begin{Sinput}
> stam.writeHTML(net) 
\end{Sinput}
\end{Schunk}
One referrable section of an HTML page is written for each node. Each
such section contains links to the parents and the children. Sections
for leaf nodes contain links to the Affymetrix annotations of the
genes they contain.

\section{Cross validate graph shrinkage candidates}

In order to choose the graph shrinkage level reasonably, we provide a
cross validation method similar to the procedure suggested for
choosing shrinkage fore gene selection in PAM. \Rfunction{stam.cv}
applies this method and returns an object of class \Rclass{stamCV}:
<<results=hide>>= 
golubTrain.cv <- stam.cv(golubTrain, "ALL.AML", chip="hu6800",
                         root="GO:0005576", ndeltas=10)
@
This call also generates the above mentioned raw classifier graph. The
considered shrinkage candidates can be provided to this function
directly. If only a number of candidates is specified using
\Rfunarg{ndeltas} \Rfunction{stam.cv} chooses these equidistantly
within the range of performance measures observed in the leaf nodes. 

For further investigation of the returned \Rclass{stamCV} object you
can use the \Rmethod{print} and \Rmethod{plot} methods.  
<<>>=
print(golubTrain.cv) 
plot(golubTrain.cv, delta=0.6) 
@ 
The \Rmethod{plot} method can actually provide two types of plots as
shown below. One of them shows the error rate and performance measure
in the root node as well as the mean redundancy across the resulting
classifier graph depending on the graph shrinkage level (Figure
\ref{fig:erd}). Thereby, our performance measure in the root node is
similar to a deviance and thus to be minimized. The second plot shows
number of nodes and accessible genes in the remaining graphs depending
on the graph shrinkage level (Figure \ref{fig:ng}).

\begin{figure}
  \begin{center}
<<fig=TRUE,echo=FALSE,width=8>>=
plot(golubTrain.cv, delta=0.6, which=1)
@
  \caption{Results of cross validation - performance and redundancy}
  \label{fig:erd}
  \end{center}
\end{figure}

\begin{figure}
  \begin{center}
<<fig=TRUE,echo=FALSE,width=8>>=
plot(golubTrain.cv, delta=0.6, which=2)
@
  \caption{Results of cross validation - number of nodes and genes}
  \label{fig:ng}
  \end{center}
\end{figure}

In order to investigate the cruss validation results, you can also apply
the function \Rfunction{stam.writeHTML} on the object returned by
\Rfunction{stam.cv}. 

\section{Computing a single model}

After deciding for the best graph shrinkage level based on the cross
validation results, we suggest to determine a single model to
recognize the investigated phenotype. This task is performed using the
\Rfunction{stam.fit} function. The graph shrinkage level can be chosen
through the \Rfunarg{delta} argument. It can also be determined
automatically based on a weighting factor between performance and
redundancy specified in \Rfunarg{alpha}. If \Rfunarg{alpha} is set to
{\tt NULL} the error rate in the root node is minimized. You can also
provide a sequence of weighting factors between 0 and 1 in
\Rfunarg{alpha}. In this case \Rfunction{stam.fit} computes evaluation
criteria for each of these factors and asks the user to provide the
best graph shrinkage level interactively.
<<>>=
golubTrain.fit <- stam.fit(golubTrain.cv, golubTrain, alpha=seq(0, 1, 0.1))
@

For further investigation we provide \Rmethod{print} and
\Rmethod{plot} methods on the \Rclass{stamFit} object returned by
\Rfunction{stam.fit}. The \Rmethod{plot} method on \Rclass{stamFit}
objects generates two different pluts. The first plut, only generated when
a set of weighting factors is given to \Rfunction{stam.fit},
illustrates the dependency between weighting factor and resulting
graph shrinkage (Figure \ref{fig:ad}). The second plot shows the
nodewise evaluation of classifiers according to sensitivity,
specificity, redundancy and performance (Figure \ref{fig:ssrp}).
<<>>=
print(golubTrain.fit)
plot(golubTrain.fit)
@

\begin{figure}
  \begin{center}
<<fig=TRUE,echo=FALSE,width=6>>=
plot(golubTrain.fit, which=1)
@
  \caption{Results of model fit - weighting vs. shrinkage}
  \label{fig:ad}
  \end{center}
\end{figure}

\begin{figure}
  \begin{center}
<<fig=TRUE,echo=FALSE,width=6>>=
plot(golubTrain.fit, which=2)
@
  \caption{Results of model fit - nodewise evaluation}
  \label{fig:ssrp}
  \end{center}
\end{figure}

The function \Rfunction{stam.writeHTML} can also be applied on
\Rclass{stamFit} objects. If the package {\tt graphviz} is installed
on your system, an HTML page with a clickable graph plot is generated
for exploration of the fitted model. The function
\Rfunction{stam.graph.plot} generates this graph layout using the {\tt
dot} utility from {\tt graphviz} (Figure \ref{fig:graph}.
% store index_graph_plot.png into package to avoid dot (graphviz)
%<<echo=FALSE,results=hide>>=
%stam.writeHTML(golubTrain.fit)
%# pdflatex gets confused with this file, therefore:
%#file.remove("index_graph_plot.png") 
@

\begin{figure}
  \begin{center}
\includegraphics{index_graph_plot}
  \caption{Shrunken classifier graph}
  \label{fig:graph}
  \end{center}
\end{figure}

\chapter{Structured Prediction}
\label{sec:predict}

The model generated in the above described manner is meant to be used
as a classifier for new expression profiles. The
\Rfunction{stam.predict} function allows for this task and returns an
object of class \Rclass{stamPrediction}. When calling
\Rfunction{stam.predict} the user can specify a series of expression
profiles and may specify a number of these to be part of the test set
while the others are assumed to be from the training set. The two
variants of the call to the structured prediction function are shown here:
<<>>=
golubTest.pred <- stam.predict(golubTrain.fit, golubTest, 
                               pData(golubTest)[,"ALL.AML"])
golubMerge.pred <- stam.predict(golubTrain.fit, Golub_Merge, 
                                pData(Golub_Merge)[,"ALL.AML"], testset=39:72)
@

The \Rfunction{stam.predict} returns an object of class
\Rclass{stamPrediction}. In addition to the usual call to
\Rfunction{stam.writeHTML} the \Rmethod{print}, \Rmethod{plot}, and
\Rmethod{image} methods may be used for further exploration of the
structured prediction:
<<>>=
print(golubTest.pred)
plot(golubTest.pred, outfile="golubTest")
image(golubMerge.pred, outfile="golubMerge")
@

Structured prediction means to compute prediction probabilities for
each considered phenotype as well as each expression profile of
interest. Thereby, the classifier in each node of the shrunken
classifier graph is evaluated and the averaging through the node
weights computed in the model fit step is performed. Thus we can
derive an overall decision for new expression profiles according to
the root node classeifier. Additionally, we can consider the pattern
of absence and presence of molecular symptoms for each node to define a
similarity to expression profiles used in training. This is
illustrated using the \Rmethod{image} method in Figure \ref{fig:predimg}.

\begin{figure}
  \begin{center}
\includegraphics{golubMerge_pred_img}
  \caption{Molecular symptoms image of a prediction, test samples are
     marked with capitals.}
  \label{fig:predimg}
  \end{center}
\end{figure}

The \Rfunction{stam.writeHTML} function can generate a clickable map
for the molecular symptom image, where each of the GO terms can be
clicked in order to view the results of the corresponding
classifier. The \Rmethod{plot} method illustrates nodewise
sensitivity, specificity, redundancy and performance.

\chapter{All in One}
\label{sec:evaluate}

The function \Rfunction{stam.evaluate} calls the functions mentioned
so far in sequence by the function
 in order to performe a complete structured
analysis. Thereby, we have two application paradigms in mind: the
predictive usage and the explorative usage.

\section{Explorative Use}

When using structured analyses of microarray in an explorative manner,
we do not split the available expression profiles into training and
test set. Thus we are not able to evaluate the prediction accuracy of
the classifier. However, we are looking for particular structure in
the classifiers of the shrunken classifier graph. In order to call
\Rfunction{stam.evaluate} exploratively it must be called with the
argument \Rfunarg{testset} set to {\tt NULL}: 
\begin{Schunk}
\begin{Sinput}
> golubNorm.eval.explore <- 
+                  stam.evaluate(Golub_Merge, "ALL.AML", testset = NULL,
+                                chip="hu6800", root="GO:0005576", 
+                                alpha=seq(0, 1, 0.1), ndelta=10)
\end{Sinput}
\end{Schunk}

For further investigation of the returned \Rclass{stamEval} object, a
print method is provided but we recommend the usage of
\Rfunction{stam.writeHTML}. The latter function generates a series of
inter-linked HTML pages holding most plots and both clickable maps of
the generated objects. Furthermore, the \Rclass{stamEval} object holds
the original objects returned by \Rfunction{stam.cv},
\Rfunction{stam.fit}, and \Rfunction{stam.predict} in its
slots. The user can further investigate using the corresponting
\Rmethod{print}, \Rmethod{plot}, and \Rmethod{image} methods.

\section{Predictive Use}

The predictive use of structured analysis of microarrays is the
standard behavior of \Rfunction{stam.evaluate}. By default a third of
the expression profiles is chosen randomly as testset while the rest
is used for training. The user can explicitly set the testset through
the \Rfunarg{testset} parameter of \Rfunction{stam.evaluate}:
\begin{Schunk}
\begin{Sinput}
> golubNorm.eval.predict <- 
+                   stam.evaluate(Golub_Merge, "ALL.AML", testset=39:72,
+                                 chip = "hu6800", root = "GO:0005576", 
+                                 ndelta = 10)
\end{Sinput}
\end{Schunk}
The same remarks as above apply according the further investigation possibilities.

\chapter{StAM WWW-Server}
\label{sec:server}

For the interactive exploration of the graph shrinkage level and a few
parameters to influence the model graph plot as well as the molecular
symptoms image, we have implemented a WWW based solution. The function
\Rfunction{stam.writeHTML} can write HTML forms for these parameters
directly into the HTML pages representing the analysis results. We
provide CGI scripts with the \Rpackage{stam} package which collect the
user entries from these forms and store them into requests stored in a
temporary directory writable for the web server. The
\Rfunction{stam.serve} provides the functionality to check this
temporary directory and perform the stored requests as illustrated in
Figure \ref{fig:servescheme}. The WWW client is redirected to a
progress page which reloads automatically every other second. As soon
as \Rfunction{stam.serve} has finished treating the request the
progress page redirects the browser to the new result page.

\begin{figure}
  \begin{center}
\includegraphics[width=6cm]{server_scheme}
  \caption{Architecture of StAM's server feature.}
  \label{fig:servescheme}
  \end{center}
\end{figure}

When the \Rfunction{stam.serve} function is first called after
installing the \Rpackage{package}, the location of the temporary
directory for requests, the directory for CGI scripts and the URL
where these scripts can be found is obtained either from function
arguments or interactively. 
\begin{Schunk}
\begin{Sinput}
> stam.serve(tmp.path = "/home/upload/stam", 
+            cgi.path = "/home/cgi-bin/stam",
+            cgi.url  = "/cgi-bin/stam")

The CGI URL has changed, make sure to regenerate the HTML pages
to be worked with through the server.
 - start listening '/home/upload/stam'...
   Stop the server by creating '/home/upload/quit'.
   Make sure you have written the HTML pages to be worked with 
   through the StAM server with 'options(stam.write.forms=TRUE)'
\end{Sinput}
\end{Schunk}

\Rfunction{stam.serve} stores this information into the package
installation and copies the modified CGI scripts to the given
location. Now \Rfunction{stam.serve} starts "listening" in the
temporary directory and will do so without the installation steps in
subsequent calls.

In order to use \Rpackage{stam}'s server feature, you need to set the
option {\tt stam.write.forms} to {\tt TRUE} before rewriting the HTML
code for the model you want to work with:
\begin{Schunk}
\begin{Sinput}
> options(stam.write.forms=TRUE)
> stam.writeHTML(golubNorm.eval.explore)
\end{Sinput}
\end{Schunk}
Now, if you load the generated HTML page into your browser, you will
see the forms to enter modified parameters. It is enough to start
\Rfunction{stam.serve} in order to use these forms interactively. 

\section*{Acknowledgements}

This research has been supported by BMBF grant 031U117/031U217 of the
German Federal Ministry of Education and the National Genome Research
Network.


\begin{thebibliography}{1}\setlength{\itemsep}{-1ex}\small

\bibitem{GO00}
{The Gene Ontology Consortium}.
\newblock Gene ontology: Tool for the unification of biology.
\newblock {\em Nature Genetics}, 25:25--29, May 2000.

\bibitem{Huber02}
W.~Huber, A.~von~Heydebreck, H.~S{\"u}ltmann, A.~Poustka, and
M.~Vingron.
\newblock Variance stabilization applied to microarray data
  calibration and to the quantification of differential expression. 
\newblock \emph{Bioinformatics} vol. 18, suppl. 1, pp. S96-S104, 2002.

\bibitem{Golub99}
T.R.~Golub, D.K.~Slonim, P.~Tamayo, C.~Huard, M.~Gaasenbeek,
J.P.~Mesirov, H.~Coller, M.L.~Loh, J.R.~Downing, M.A.~Caligiuri,
C.D.~Bloomfield, and E.S.~Lander.
\newblock Molecular classification of cancer: class discovery and
class  prediction by gene expression monitoring.
\newblock \emph{Science}, vol. 286, pp. 531-537, 1999.

\bibitem{Tibshirani02a}
R.~Tibshirani, T.~Hastie, B.~Narashiman, and B.~Chu.
\newblock Diagnosis of multiple cancer types by shrunken centroids of gene
  expression.
\newblock {\em PNAS}, 99:6567--6572, May 2002.

\end{thebibliography}


\end{document}