%\VignetteIndexEntry{topGO} %\VignetteDepends{ALL, hgu95av2.db, genefilter, xtable, multtest, Rgraphviz} %\VignetteKeywords{topGO, GO, graph} %\VignettePackage{topGO} \documentclass[a4paper, oneside, 10pt]{article} \usepackage[pdftex]{graphicx} \usepackage{calc} \usepackage{sectsty} \usepackage{caption} \renewcommand{\captionfont}{\it\sffamily} \renewcommand{\captionlabelfont}{\bf\sffamily} \allsectionsfont{\sffamily} % page style %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \usepackage[a4paper, left=25mm, right=20mm, top=20mm, bottom=25mm, nohead]{geometry} \setlength{\parskip}{1.5ex} \setlength{\parindent}{0cm} \pagestyle{empty} \usepackage{Sweave} \SweaveOpts{prefix.string = tGO} \title{\vspace*{-6ex} Gene set enrichment analysis with {\bf topGO}} \author{Adrian Alexa, J\"org Rahnenf\"uhrer} \date{\today \\% \texttt{http://www.mpi-sb.mpg.de/$\sim$alexa}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% %% This document is the short version of the topGO manual %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} \maketitle %%\newpage \tableofcontents \newpage <>= options(width = 95) @ \section{Introduction} \label{sampleSession} This manuscript provides a quick tour into {\tt topGO}. There is an accompanying document which provides details on the functions used in this document as well as showing more advance functionality implemented in the {\tt topGO} package. The {\tt topGO} package is designed to work with different test statistics and with multiple GO graph algorithms, see~\cite{Alexa}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% This section describes a simple working-session using {\tt topGO}. There are only a handful of commands necessary to perform a gene set enrichment analysis and we will be briefly presented them bellow. A typical session can be divided into three steps: \begin{enumerate} \item Collection and preprocessing of the data. The list of genes and the correspondent gene' score, the criteria for selecting genes based on their scores and the gene-to-GO annotations are all collected to form a suitable R object. \item Using the object created in the first step the user can perform enrichment analysis using various statistical tests and methods that deal with the GO topology. \item Analysis of the results. \end{enumerate} {\it But before going through each of these steps we need to decide which biological question we want to answer. This will most likely dictate which test statistic we need to use. The choice of the test is also restricted by the data available at hand. The knowledge of the test will dictate the way the gene expression data needs to be process.} Here we will test for enrichment of GO terms with differentially expressed genes using the Kolmogorov-Smirnov test and Fisher's exact test as examples. \subsection{Step 1: Preparing the data} In this step a convenient R object of class {\tt topGOdata} is created containing all the information required for the remaining two steps. The user needs to provide a list of genes, GO annotations and a criteria for selecting interesting genes. In most cases we want to test enrichment of GO terms with differentially expressed genes. Thus, the starting point is a list of genes and the respective $p$-values for differential expression. A toy example of a list of gene identifiers and the respective $p$-values is provided by the {\tt geneList} object. <>= library(topGO) library(ALL) data(ALL) data(geneList) @ The {\tt geneList} data is based on an differential expression analysis of the ALL dataset. It contains just a small number, $\Sexpr{length(geneList)}$, of genes with the corespondent $p$-values. For these genes we need GO annotations to be able to form the gene groups. The information on where to find the GO annotations is stored in the ALL object and its easily accessible. <<>>= affyLib <- paste(annotation(ALL), "db", sep = ".") library(package = affyLib, character.only = TRUE) @ The chip used in the experiment is the {\tt \Sexpr{annotation(ALL)}} from Affymetrix, as we can see from the {\tt affyLib} object. When we loaded the {\tt geneList} object a function which will select the differentially expressed genes from the list was also loaded under the name of {\tt topDiffGenes}. For example using this function we can see that there are \Sexpr{sum(topDiffGenes(geneList))} genes with a row $p$-value less than $0.01$ out of a total of \Sexpr{length(geneList)} genes. <<>>= sum(topDiffGenes(geneList)) @ We now have all the data necessary to build on object of type {\tt topGOdata}. This object will contain all the gene identifiers and their score, the GO annotations, the GO hierarchical structure and all the other information needed to perform the enrichment analysis. <>= sampleGOdata <- new("topGOdata", description = "Simple session", ontology = "BP", allGenes = geneList, geneSel = topDiffGenes, nodeSize = 10, annot = annFUN.db, affyLib = affyLib) @ A summary of the {\tt sampleGOdata} object can be seen by typing the object name at the R prompt. Having all the data stored into this object facilitates the access to identifiers, annotations and to obtain basic data statics. <>= sampleGOdata @ \subsection{Step 2: Running the desired tests} Once we have an object of class {\tt topGOdata} we can start with the enrichment analysis. Since for each gene we have a score and there is also a procedure to select interesting genes based on the scores we will use two types of test statistics: Fisher's exact test which is based on gene counts, and a Kolmogorov-Smirnov like test which needs gene scores. The function {\tt runTest} is used to apply the specified enrichment test and method to the data. It has three basic arguments. The first argument needs to be on object of class {\tt topGOdata}. The second argument is of type character and specifies which method for dealing with the GO graph structure will be used (this argument is optional). And the third argument specifies the test statistic and is of type character. First we perform a classical enrichment analysis by testing the over-representation of GO terms with differentially expressed genes. In the classical case each GO category is tested independently. <>= resultFisher <- runTest(sampleGOdata, algorithm = "classic", statistic = "fisher") @ {\tt runTest} returns an object of class {\tt topGOresult}. A short summary of this object is shown bellow. <<>>= resultFisher @ Next we will test the enrichment using the Kolmogorov-Smirnov test. We will use the both the {\sf classic} and the {\sf elim} methods. <>= resultKS <- runTest(sampleGOdata, algorithm = "classic", statistic = "ks") resultKS.elim <- runTest(sampleGOdata, algorithm = "elim", statistic = "ks") @ Please note that some statistical tests will not work with any method. The compatibility matrix between the methods and statistical tests is shown in Table~\ref{tabletopGO}. The $p$-values computed by the {\tt runTest} function are unadjusted for multiple testing. We do note advocate against adjusting the $p$-values of the tested groups, but in many cases the an adjusted $p$-value can be misleading. \subsection{Step 3: Analysis of results} After the enrichment tests are performed the researcher needs tools for analysing and interpreting the results. {\tt GenTable} is an easy to use function for analysing the most significant GO terms and the corresponding $p$-values. For example, we want to see which are the top $10$ significant GO terms identified by the {\sf elim} method. We also want to compare the ranks and the $p$-values of these GO terms in the case where the {\sf classic} method was employe used. <<>>= allRes <- GenTable(sampleGOdata, classicFisher = resultFisher, classicKS = resultKS, elimKS = resultKS.elim, orderBy = "elimKS", ranksOf = "classicFisher", topNodes = 10) @ The {\tt GenTable} function returns a data.frame containing the top {\tt topNodes} GO terms identified by the {\sf elim} algorithm, see {\tt orderBy} argument. The data.frame includes some statistics on the GO terms and the $p$-values obtained from the methods passes as arguments. Table~\ref{tab:sampleGOresults} shows the results. \begin{table}[!t] \centering\resizebox{.99\linewidth}{!}{% <>= if(require(xtable)) print(xtable(apply(allRes, 2, as.character)), floating = FALSE) @ }\caption{Significance of GO terms according to {\sf classic} and {\sf elim} methods.} \label{tab:sampleGOresults} \end{table} For accessing the GO term's $p$-values from a {\tt topGOresult} object the user should use the {\tt score} functions. As a simple example, we look at the differences between the results of the {\sf classic} and the {\sf elim} methods in the case of the Kolmogorov-Smirnov test. Due to the fact that there are few significant GO terms, $\Sexpr{sum(score(resultKS) <= 0.01)}$ terms with a $p$-value less than $0.01$ in the {\sf classic} method, one would expect that only few GO terms will have different $p$-values between these two methods. This, of course, depends on the value of the cutoff parameter used by the {\sf elim} method. \setkeys{Gin}{width=.4\linewidth} \begin{figure}[!h] \centering <>= pValue.classic <- score(resultKS) pValue.elim <- score(resultKS.elim)[names(pValue.classic)] plot(pValue.classic, pValue.elim, xlab = "p-value classic", ylab = "p-value elim", cex = .5) @ \caption{$p$-values scatter plot for the {\sf classic} and {\sf elim} methods.} \label{scatterClassicElim} \end{figure} We can see in Figure~\ref{scatterClassicElim} that there are indeed few differences between the two methods. Some GO terms witch are found significant by the {\sf classic} method are less significant in the {\sf elim}, as expected. Another insightful way of looking at the results of the analysis is to investigate how the significant GO terms are distributed over the GO graph. For the {\tt elim} algorithm the subgraph induced by the $5$ most significant GO terms is plotted. In the plot, the significant nodes are represented as boxes. The plotted graph is the upper induced graph generated by these significant nodes. <>= showSigOfNodes(sampleGOdata, score(resultKS.elim), firstTerms = 5, useInfo = 'all') @ <>= printGraph(sampleGOdata, resultKS.elim, firstSigNodes = 5, fn.prefix = "tGO", useInfo = "all", pdfSW = TRUE) @ \begin{figure}[!ht] \centering \includegraphics[width=1.05\linewidth]{tGO_elim_5_all} \caption{The subgraph induced by the top $5$ GO terms identified by the {\sf elim} algorithm for scoring GO terms for enrichment. Rectangles indicate the $5$ most significant terms. Rectangle color represents the relative significance, ranging from dark red (most significant) to light yellow (least significant). Black arrows indicate {\it is-a} relationships and red arrows {\it part-of} relationships. For each node, some basic information is displayed. The first two lines show the GO identifier and a trimmed GO name. In the third line the raw p-value is shown. The forth line is showing the number of significant genes and the total number of genes annotated to the respective GO term. } \label{fig:sampleGOelim} \end{figure} \clearpage \clearpage \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= toLatex(sessionInfo()) @ \addcontentsline{toc}{section}{References} \label{references} \bibliography{ref_books} \bibliographystyle{apalike} \end{document}