%\VignetteIndexEntry{Structured Analysis of Microarrays} %\VignetteDepends{golubEsets, hu6800.db} %\VignetteKeywords{classification microarrays complex phenotypes} %\VignettePackage{stam} \documentclass[11pt,a4paper]{report} \usepackage{compdiag} \usepackage{amsmath,a4} \SweaveOpts{eps=false} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \title{Structured Analysis of Microarrays \bigskip \\ User's Guide to the Bioconductor package \Rpackage{stam}} \author{Claudio Lottaz\footnote{Corresponding author: \texttt{claudio.lottaz@molgen.mpg.de}} and Rainer Spang \bigskip \\ \small Max Planck Institute for Molecular Genetics\\ Computational Diagnostics, Department for Computation Molecular Biology \\ Ihnestr. 63-73, D-14195 Berlin, Germany} \reportnr{03} \year{2004} \abstract{ This is the vignette of the Bioconductor compliant packege {\tt stam}. We describe our implementation of {\it structured analysis of microarray data} for decomposing complex clinical phenotypes in clinical gene expression profiling studies. } \date{} \begin{document} \maketitle <>= library(golubEsets) oldopt <- options(digits=3) on.exit( {options(oldopt)} ) options(width=70) if (interactive()) { options(error=recover) } set.seed(123) @ \chapter{Introduction} In clinical context microarray data can be used in a straight forward manner for diagnostic tasks. Thereby, standard classification methods developed in the machine learning community have been suggested and applied in clinical research to distinguish between clinically relevant phenotypes. However, it is commonly accepted that such phenotypes are often complex, i.e. caused by different causes. Thus we expect to find heterogenous gene expression profiles in patients with homogenous clinical phenotype. In order to decompose complex phenotypes, we suggest a procedure we call {\it structured analysis of Microarrays}. In this procedure we train many classifiers to recognize the same complex phenotype. Each classifier focusses on a single biological aspect by extracting only the expression data of the corresponding genes. Many of these classifiers will be unable to recognize the phenotype of interest and are thus discarded. Others will recognize a subset of patients and some of them hopefully discover all patients. {\it Structured analysis of microarrays} uses classifiers with high specificity, we call them molecular symptoms, to further stratify patients on a molecular level. Patterns of absence and presence of molecular symptoms identify particular groups of patients. Due to the biological focus of our classifiers, molecular symptoms always have a biological meaning. The Bioconductor compliant R package \Rpackage{stam} described in this document implements {\it structured analysis of microarrays} based on the Gene Ontology \cite{GO00}. GO terms attributed to leaf nodes of the GO graph are used as biological focusses for potential molecular symptoms. The hierarchy of the Gene Ontology is used to compute more general classifiers by weighted averaging. Thereby, good classifiers obtain higher weights. A shrinkage process of these weights, similar to the one used in PAM \cite{Tibshirani02a} to select genes, restricts the graph of classifiers in a cross validation setting to the branches linked to the phenotype. The remainder of this document describes the usage of the \Rpackage{stam} package. \Rpackage{stam} relies on Bioconductor meta data packages to implement structured training of classifiers (Section \ref{sec:training}) and structured prediction (Section \ref{sec:predict}). For further convenience we have implemented a complete structured evaluation of a data set (Section \ref{sec:evaluate}) and a web based possibility to explore some crucial parameters of such an analyses interactively (Section \ref{sec:server}). \chapter{Training Classifiers} \label{sec:training} The training of a structured classifier for a phenotype in microarray data consists of the following steps: \begin{enumerate} \item Generate a classifier network according to the chip's annotations and the GO hierarchy. \item Perform cross-validated PAM fits in each leaf node to determine adequate PAM shrinkage levels and compute performance for several graph shrinkage candidates. \item Compute single structured model using the best graph shrinkage level chosen by the user or automatically according some performance criterion calculated in the cross validation step. \end{enumerate} Before starting any analysis you have to load the \Rpackage{stam} package in your R session as follows: <<>>= library(stam) @ For illustration of \Rpackage{stam} usage we use the Golub data set on acute leukemia \cite{Golub99} as it is stored in the \Rpackage{golubEsets} data package. For preparing the data, issue the following commands: <<>>= library(golubEsets) data(Golub_Merge) golubTrain <- Golub_Merge[,1:38] golubTest <- Golub_Merge[,39:72] @ \section{Generate a raw classifier graph} The raw classifier graph can be generated by the function \Rfunction{stam.net}. This function needs the name of the chip used in the study to be analyzed as well as the root node to be used for the classifier graph. The function returns an object of class \Rclass{stamNet}. <>= net <- stam.net(chip="hu6800", root="GO:0005576", probes=rownames(exprs(Golub_Merge))) @ The string given as the chip's name is used to load the annotation data. Thus \Rpackage{stam} expects a library of the same name to be installed, where it looks for the \Robject{GO} hash as it is provided by Bioconductor meta data packages. The root of the classifier graph must be identified by a string interpreted as a GO identifier of the form '{\tt GO:ddddddd}'. The default is {\tt GO:0008150} which represents the 'biological process' branch of the Gene Ontology. Other prominent candidates may be {\tt GO:0003674} (molecular function) or {\tt GO:0005575} (cellular component). However, any valid GO identifier is allowed. The classifier graph may be analysed using the print function as is shown in the next chunk of R code. Printing the object returned by \Rfunction{stam.net} directly shows some properties of the generated classifier graph. One component of this object holds an entry for each node. Summary information on each node can also be printed as shown below. <<>>= print(net) print(net@nodes[[31]]) print(net@nodes[["GO:0005579"]]) @ A convenient meands to explore the information on the raw classifier graph is provided by the function \Rfunction{stam.writeHTML} which, applied on a \Rclass{stamNet} object, writes a set of interlinked HTML pages. \begin{Schunk} \begin{Sinput} > stam.writeHTML(net) \end{Sinput} \end{Schunk} One referrable section of an HTML page is written for each node. Each such section contains links to the parents and the children. Sections for leaf nodes contain links to the Affymetrix annotations of the genes they contain. \section{Cross validate graph shrinkage candidates} In order to choose the graph shrinkage level reasonably, we provide a cross validation method similar to the procedure suggested for choosing shrinkage fore gene selection in PAM. \Rfunction{stam.cv} applies this method and returns an object of class \Rclass{stamCV}: <>= golubTrain.cv <- stam.cv(golubTrain, "ALL.AML", chip="hu6800", root="GO:0005576", ndeltas=10) @ This call also generates the above mentioned raw classifier graph. The considered shrinkage candidates can be provided to this function directly. If only a number of candidates is specified using \Rfunarg{ndeltas} \Rfunction{stam.cv} chooses these equidistantly within the range of performance measures observed in the leaf nodes. For further investigation of the returned \Rclass{stamCV} object you can use the \Rmethod{print} and \Rmethod{plot} methods. <<>>= print(golubTrain.cv) plot(golubTrain.cv, delta=0.6) @ The \Rmethod{plot} method can actually provide two types of plots as shown below. One of them shows the error rate and performance measure in the root node as well as the mean redundancy across the resulting classifier graph depending on the graph shrinkage level (Figure \ref{fig:erd}). Thereby, our performance measure in the root node is similar to a deviance and thus to be minimized. The second plot shows number of nodes and accessible genes in the remaining graphs depending on the graph shrinkage level (Figure \ref{fig:ng}). \begin{figure} \begin{center} <>= plot(golubTrain.cv, delta=0.6, which=1) @ \caption{Results of cross validation - performance and redundancy} \label{fig:erd} \end{center} \end{figure} \begin{figure} \begin{center} <>= plot(golubTrain.cv, delta=0.6, which=2) @ \caption{Results of cross validation - number of nodes and genes} \label{fig:ng} \end{center} \end{figure} In order to investigate the cruss validation results, you can also apply the function \Rfunction{stam.writeHTML} on the object returned by \Rfunction{stam.cv}. \section{Computing a single model} After deciding for the best graph shrinkage level based on the cross validation results, we suggest to determine a single model to recognize the investigated phenotype. This task is performed using the \Rfunction{stam.fit} function. The graph shrinkage level can be chosen through the \Rfunarg{delta} argument. It can also be determined automatically based on a weighting factor between performance and redundancy specified in \Rfunarg{alpha}. If \Rfunarg{alpha} is set to {\tt NULL} the error rate in the root node is minimized. You can also provide a sequence of weighting factors between 0 and 1 in \Rfunarg{alpha}. In this case \Rfunction{stam.fit} computes evaluation criteria for each of these factors and asks the user to provide the best graph shrinkage level interactively. <<>>= golubTrain.fit <- stam.fit(golubTrain.cv, golubTrain, alpha=seq(0, 1, 0.1)) @ For further investigation we provide \Rmethod{print} and \Rmethod{plot} methods on the \Rclass{stamFit} object returned by \Rfunction{stam.fit}. The \Rmethod{plot} method on \Rclass{stamFit} objects generates two different pluts. The first plut, only generated when a set of weighting factors is given to \Rfunction{stam.fit}, illustrates the dependency between weighting factor and resulting graph shrinkage (Figure \ref{fig:ad}). The second plot shows the nodewise evaluation of classifiers according to sensitivity, specificity, redundancy and performance (Figure \ref{fig:ssrp}). <<>>= print(golubTrain.fit) plot(golubTrain.fit) @ \begin{figure} \begin{center} <>= plot(golubTrain.fit, which=1) @ \caption{Results of model fit - weighting vs. shrinkage} \label{fig:ad} \end{center} \end{figure} \begin{figure} \begin{center} <>= plot(golubTrain.fit, which=2) @ \caption{Results of model fit - nodewise evaluation} \label{fig:ssrp} \end{center} \end{figure} The function \Rfunction{stam.writeHTML} can also be applied on \Rclass{stamFit} objects. If the package {\tt graphviz} is installed on your system, an HTML page with a clickable graph plot is generated for exploration of the fitted model. The function \Rfunction{stam.graph.plot} generates this graph layout using the {\tt dot} utility from {\tt graphviz} (Figure \ref{fig:graph}. % store index_graph_plot.png into package to avoid dot (graphviz) %<>= %stam.writeHTML(golubTrain.fit) %# pdflatex gets confused with this file, therefore: %#file.remove("index_graph_plot.png") @ \begin{figure} \begin{center} \includegraphics{index_graph_plot} \caption{Shrunken classifier graph} \label{fig:graph} \end{center} \end{figure} \chapter{Structured Prediction} \label{sec:predict} The model generated in the above described manner is meant to be used as a classifier for new expression profiles. The \Rfunction{stam.predict} function allows for this task and returns an object of class \Rclass{stamPrediction}. When calling \Rfunction{stam.predict} the user can specify a series of expression profiles and may specify a number of these to be part of the test set while the others are assumed to be from the training set. The two variants of the call to the structured prediction function are shown here: <<>>= golubTest.pred <- stam.predict(golubTrain.fit, golubTest, pData(golubTest)[,"ALL.AML"]) golubMerge.pred <- stam.predict(golubTrain.fit, Golub_Merge, pData(Golub_Merge)[,"ALL.AML"], testset=39:72) @ The \Rfunction{stam.predict} returns an object of class \Rclass{stamPrediction}. In addition to the usual call to \Rfunction{stam.writeHTML} the \Rmethod{print}, \Rmethod{plot}, and \Rmethod{image} methods may be used for further exploration of the structured prediction: <<>>= print(golubTest.pred) plot(golubTest.pred, outfile="golubTest") image(golubMerge.pred, outfile="golubMerge") @ Structured prediction means to compute prediction probabilities for each considered phenotype as well as each expression profile of interest. Thereby, the classifier in each node of the shrunken classifier graph is evaluated and the averaging through the node weights computed in the model fit step is performed. Thus we can derive an overall decision for new expression profiles according to the root node classeifier. Additionally, we can consider the pattern of absence and presence of molecular symptoms for each node to define a similarity to expression profiles used in training. This is illustrated using the \Rmethod{image} method in Figure \ref{fig:predimg}. \begin{figure} \begin{center} \includegraphics{golubMerge_pred_img} \caption{Molecular symptoms image of a prediction, test samples are marked with capitals.} \label{fig:predimg} \end{center} \end{figure} The \Rfunction{stam.writeHTML} function can generate a clickable map for the molecular symptom image, where each of the GO terms can be clicked in order to view the results of the corresponding classifier. The \Rmethod{plot} method illustrates nodewise sensitivity, specificity, redundancy and performance. \chapter{All in One} \label{sec:evaluate} The function \Rfunction{stam.evaluate} calls the functions mentioned so far in sequence by the function in order to performe a complete structured analysis. Thereby, we have two application paradigms in mind: the predictive usage and the explorative usage. \section{Explorative Use} When using structured analyses of microarray in an explorative manner, we do not split the available expression profiles into training and test set. Thus we are not able to evaluate the prediction accuracy of the classifier. However, we are looking for particular structure in the classifiers of the shrunken classifier graph. In order to call \Rfunction{stam.evaluate} exploratively it must be called with the argument \Rfunarg{testset} set to {\tt NULL}: \begin{Schunk} \begin{Sinput} > golubNorm.eval.explore <- + stam.evaluate(Golub_Merge, "ALL.AML", testset = NULL, + chip="hu6800", root="GO:0005576", + alpha=seq(0, 1, 0.1), ndelta=10) \end{Sinput} \end{Schunk} For further investigation of the returned \Rclass{stamEval} object, a print method is provided but we recommend the usage of \Rfunction{stam.writeHTML}. The latter function generates a series of inter-linked HTML pages holding most plots and both clickable maps of the generated objects. Furthermore, the \Rclass{stamEval} object holds the original objects returned by \Rfunction{stam.cv}, \Rfunction{stam.fit}, and \Rfunction{stam.predict} in its slots. The user can further investigate using the corresponting \Rmethod{print}, \Rmethod{plot}, and \Rmethod{image} methods. \section{Predictive Use} The predictive use of structured analysis of microarrays is the standard behavior of \Rfunction{stam.evaluate}. By default a third of the expression profiles is chosen randomly as testset while the rest is used for training. The user can explicitly set the testset through the \Rfunarg{testset} parameter of \Rfunction{stam.evaluate}: \begin{Schunk} \begin{Sinput} > golubNorm.eval.predict <- + stam.evaluate(Golub_Merge, "ALL.AML", testset=39:72, + chip = "hu6800", root = "GO:0005576", + ndelta = 10) \end{Sinput} \end{Schunk} The same remarks as above apply according the further investigation possibilities. \chapter{StAM WWW-Server} \label{sec:server} For the interactive exploration of the graph shrinkage level and a few parameters to influence the model graph plot as well as the molecular symptoms image, we have implemented a WWW based solution. The function \Rfunction{stam.writeHTML} can write HTML forms for these parameters directly into the HTML pages representing the analysis results. We provide CGI scripts with the \Rpackage{stam} package which collect the user entries from these forms and store them into requests stored in a temporary directory writable for the web server. The \Rfunction{stam.serve} provides the functionality to check this temporary directory and perform the stored requests as illustrated in Figure \ref{fig:servescheme}. The WWW client is redirected to a progress page which reloads automatically every other second. As soon as \Rfunction{stam.serve} has finished treating the request the progress page redirects the browser to the new result page. \begin{figure} \begin{center} \includegraphics[width=6cm]{server_scheme} \caption{Architecture of StAM's server feature.} \label{fig:servescheme} \end{center} \end{figure} When the \Rfunction{stam.serve} function is first called after installing the \Rpackage{package}, the location of the temporary directory for requests, the directory for CGI scripts and the URL where these scripts can be found is obtained either from function arguments or interactively. \begin{Schunk} \begin{Sinput} > stam.serve(tmp.path = "/home/upload/stam", + cgi.path = "/home/cgi-bin/stam", + cgi.url = "/cgi-bin/stam") The CGI URL has changed, make sure to regenerate the HTML pages to be worked with through the server. - start listening '/home/upload/stam'... Stop the server by creating '/home/upload/quit'. Make sure you have written the HTML pages to be worked with through the StAM server with 'options(stam.write.forms=TRUE)' \end{Sinput} \end{Schunk} \Rfunction{stam.serve} stores this information into the package installation and copies the modified CGI scripts to the given location. Now \Rfunction{stam.serve} starts "listening" in the temporary directory and will do so without the installation steps in subsequent calls. In order to use \Rpackage{stam}'s server feature, you need to set the option {\tt stam.write.forms} to {\tt TRUE} before rewriting the HTML code for the model you want to work with: \begin{Schunk} \begin{Sinput} > options(stam.write.forms=TRUE) > stam.writeHTML(golubNorm.eval.explore) \end{Sinput} \end{Schunk} Now, if you load the generated HTML page into your browser, you will see the forms to enter modified parameters. It is enough to start \Rfunction{stam.serve} in order to use these forms interactively. \section*{Acknowledgements} This research has been supported by BMBF grant 031U117/031U217 of the German Federal Ministry of Education and the National Genome Research Network. \begin{thebibliography}{1}\setlength{\itemsep}{-1ex}\small \bibitem{GO00} {The Gene Ontology Consortium}. \newblock Gene ontology: Tool for the unification of biology. \newblock {\em Nature Genetics}, 25:25--29, May 2000. \bibitem{Huber02} W.~Huber, A.~von~Heydebreck, H.~S{\"u}ltmann, A.~Poustka, and M.~Vingron. \newblock Variance stabilization applied to microarray data calibration and to the quantification of differential expression. \newblock \emph{Bioinformatics} vol. 18, suppl. 1, pp. S96-S104, 2002. \bibitem{Golub99} T.R.~Golub, D.K.~Slonim, P.~Tamayo, C.~Huard, M.~Gaasenbeek, J.P.~Mesirov, H.~Coller, M.L.~Loh, J.R.~Downing, M.A.~Caligiuri, C.D.~Bloomfield, and E.S.~Lander. \newblock Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. \newblock \emph{Science}, vol. 286, pp. 531-537, 1999. \bibitem{Tibshirani02a} R.~Tibshirani, T.~Hastie, B.~Narashiman, and B.~Chu. \newblock Diagnosis of multiple cancer types by shrunken centroids of gene expression. \newblock {\em PNAS}, 99:6567--6572, May 2002. \end{thebibliography} \end{document}