% \VignetteIndexEntry{An introduction to DOSE} % \VignetteDepends{AnnotationDbi, DO.db, methods, stats, plyr} % \VignetteSuggests{GOSemSim, clusterProfiler} % \VignetteKeywords{Disease Ontology Semantic and Enrichment analysis} % \VignettePackage{DOSE} %\SweaveOpts{prefix.string=images/fig} \documentclass[]{article} \usepackage{times} \usepackage{natbib} \usepackage{hyperref} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\R}{\textsf{R}} \newcommand{\DOSE}{\Rpackage{DOSE}} \newcommand{\DOParams}{\Rclass{DOParams}} \newcommand{\enrichResult}{\Rclass{enrichResult}} \newcommand{\term}[1]{\emph{#1}} \newcommand{\mref}[2]{\htmladdnormallinkfoot{#2}{#1}} \title{Disease Ontology Semantic and Enrichment analysis} \author{Guangchuang Yu, Li-Gen Wang \\ \\ Jinan University, Guangzhou, China} \begin{document} \bibliographystyle{plainnat} \maketitle <>= options(width=60) require(DOSE) @ \section{Introduction} Disease Ontology (DO) provides an open source ontology for the integration of biomedical data that is associated with human disease. DO analysis can lead to interesting discoveries that deserve further clinical investigation. \Rpackage{DOSE} was designed for semantic similarity measure and enrichment analysis. Four information content (IC)-based methods, proposed by Resnik \citep{Resnik1999}, Jiang \citep{Jiang1997}, Lin \citep{Lin1998} and Schlicker \citep{Schlicker2006}, and one graph structure-based method, proposed by Wang \citep{Wang2007}, were implemented. These methods were also implemented in our \Rpackage{GOSemSim} \citep{GYu2010} package for measuring GO-term semantic similarities. Hypergeometric test \citep{boyle2004} was implemented for enrichment analysis. \\ To start with \Rpackage{DOSE} package, type following code below: <>= library(DOSE) help(DOSE) @ \section{Semantic Similarity Measurement} The \DOSE package contains functions to estimate semantic similarity of DO terms based on Resnik's, Lin's, Jiang and Conrath's, Rel's and Wang's method. Details about Resnik's, Lin's, and Jiang and Conrath's methods can be seen in \citep{lord_semantic_2003}, details about Rel's method can be seen in \citep{Schlicker2006}, and details about Wang's method can be seen in \citep{Wang2007}. IC-based method depend on the frequencies of two DO terms involved and that of their closest common ancestor term in a specific corpus of DO annotations. Information content is defined as frequency of each term occurs in the corpus. As DO allow multiple parents for each concept, two terms can share parents by multiple paths. We take the minimum p(t), where there is more than one shared parents. The \textit{$p_{ms}$} is defined as: \indent \begin{center} \textit{$p_{ms}(t1,t2)$} $= \displaystyle\min_{t \in S(t1,t2)} \{p(t)\})$ \end{center} Where S(t1,t2) is the set of parent terms shared by t1 and t2. \begin{itemize} \item Resnik's method is defined as: \begin{center} $sim(t1,t2) = -\ln p_{ms}(t1,t2)$ \end{center} \item Lin's method is defined as: \begin{center} $sim(t1,t2)=\displaystyle\frac{2 \times \ln (p_{ms}(t1,t2))}{\ln p(t1) + \ln p(c2)}$ \end{center} \item Schlicker's method, which combine Resnik's and Lin's method, is defined as: \begin{center} $sim(t1,t2)=\displaystyle\frac{2 \times \ln p_{ms}(t1,t2)}{\ln p(t1) + \ln p(p2)} \times (1-p_{ms}(t1,t2))$ \end{center} \item Jiang and Conrath's method is defined as: \begin{center} $sim(t1,t2) = 1-\min(1, d(t1,t2))$ \end{center} where \begin{center} $d(t1,t2)= \ln p(t1) + \ln p(p2) - 2 \times \ln p_{ms}(t1,t2)$ \end{center} \end{itemize} Graph-based methods using the topology of DO graph structure to compute semantic similariy. Formally, a DO term A can be represented as $DAG_{A}=(A,T_{A},E_{A})$ where $T_{A}$ is the set of DO terms in $DAG_{A}$, including term A and all of its ancestor terms in the DO graph, and $E_{A}$ is the set of edges connecting the DO terms in $DAG_{A}$. \begin{itemize} \item Wang's method To encode the semantic of a DO term in a measurable format to enable a quantitative comparison, Wang firstly defined the semantic value of term A as the aggregate contribution of all terms in $DAG_{A}$ to the semantics of term A, terms closer to term A in $DAG_{A}$ contribute more to its semantics. Thus, defined the contribution of a DO term \textit{t} to the semantics of DO term A as the S-value of DO term \textit{t} related to term A. For any of term \textit{t} in $DAG_{A}$, its S-value related to term A. $S_{A}(\textit{t})$ is defined as: \begin{center} \[\left\{ \begin{array}{l} S_{A}(A)=1 \\ S_{A}(\textit{t})=\max\{w_{e} \times S_{A}(\textit{t}') | \textit{t}' \in childrenof(\textit{t}) \}$ if $\textit{t} \ne A \end{array} \right.\] \end{center} where $w_{e}$ is the semantic contribution factor for edge $e \in E_{A}$ linking term \textit{t} with its child term \textit{t}'. Wang defined term A contributes to its own as one. After obtaining the S-values for all terms in $DAG_{A}$, the semantic value of GO term A, SV(A), is calculated as: \begin{center} $SV(A)=\displaystyle\sum_{t \in T_{A}} S_{A}(t)$ \end{center} Thus, given two DO terms A and B, the semantic similarity between these two terms, $DO_{A,B}$, is defined as: \begin{center} $sim_{Wang}(A, B) = \frac{\displaystyle\sum_{t \in T_{A} \cap T_{B}}{S_{A}(t) + S_{B}(t)}}{SV(A) + SV(B)}$ \end{center} where $S_{A}(\textit{t})$ is the S-value of DO term \textit{t} related to term A and $S_{B}(\textit{t})$ is the S-value of DO term \textit{t} related to term B. This method proposed by Wang \citep{Wang2007} determines the semantic similarity of two DO terms based on both the locations of these terms in the DO graph and their relations with their ancestor terms. \end{itemize} \section{Enrichment Analysis} Enrichment analysis is a widely used approach to identify biological themes. Here we implement hypergeometric model to assess whether the number of selected genes associated with disease is larger than expected. We also implement a bar plot and gene-category-network for visualization. \begin{itemize} \item Calculation of Statistical Significance To determine whether any DO terms annotate a specified list of genes at frequency greater than that would be expected by chance, \DOSE calculates a p-value using the hypergeometric distribution: $ p = 1 - \displaystyle\sum_{i = 0}^{k-1} \frac{ {M \choose i} {{N-M} \choose {n-i}} } { {N \choose n} } $ In this equation, \textit{N} is the total number of genes in the background distribution, \textit{M} is the number of genes within that distribution that are annotated (either directly or indirectly) to the node of interest, \textit{n} is the size of the list of genes of interest and \textit{k} is the number of genes within that list which are annotated to the node. The background distribution by default is all the genes that have DO annotation. \end{itemize} \section{Example} The following lines provide a quick and simple example on the use of \Rpackage{DOSE}. \begin{itemize} \item Calculate DO terms Similarity <>= data(DO2EG) set.seed(123) terms <- list(a=sample(names(DO2EG), 5),b= sample(names(DO2EG), 6)) terms ## Setting Parameters... params <- new("DOParams", IDs=terms, type="DOID", method="Wang") ## Calculating Semantic Similarities... sim(params) @ Four combine methods which called \textit{max}, \textit{average}, \textit{rcmax} and \textit{BMA}, were implmented to combine semantic similarity scores of multiple DO terms. <>= params <- new("DOParams", IDs=terms, type="DOID", method="Wang", combine="BMA") sim(params) @ \item Calculate Gene products Similarity <>= geneid <- list(a=c("5320", "338"), b= c("341", "581", "885")) params <- new("DOParams", IDs=geneid, type="GeneID", method="Wang", combine="BMA") x <- sim(params) x @ \Rpackage{DOSE} implement \Rfunction{simplot} to visualize the semantic similarity matrix. \begin{figure}[h] \begin{center} <>= simplot(x) @ \caption{\label{simplot} Heatmap plot for semantic similarity matrix} \end{center} \end{figure} \item Enrichment analysis of a list of genes can also be performed as shown in the following examples. <>= data(AL1) x <- enrichDO(AL1, pvalueCutoff=0.05) head(summary(x)) @ User can use the following command for mapping gene IDs to their corresponding gene symbol. <>= setReadable(x) <- TRUE head(summary(x)) @ \Rpackage{DOSE} package implement bar plot and gene-category network plot for visualization. \begin{figure}[h] \begin{center} <>= plot(x, type="bar") @ \caption{\label{barplot} Bar Plot of Enrichment Result} \end{center} \end{figure} \begin{figure}[h] \begin{center} <>= plot(x, categorySize="geneNum", output="fixed") @ \caption{\label{cnetplot} Category-Network Plot of Enrichment Result} \end{center} \end{figure} In the category-network plot, if expression values is provided, the \Rfunction{plot} function will use them to label the gene nodes. Red indicates up-regulated and green indicates down-regulated. <>= AL1expr @ The plot was re-generate by using this log fold change expression values as follows: \begin{figure}[h] \begin{center} <>= plot(x,showCategory=5, logFC=AL1expr, categorySize="geneNum",output="fixed") @ \caption{\label{cnetplot} Category-Network Plot of Enrichment Result} \end{center} \end{figure} \end{itemize} \section{Session Information} The version number of R and packages loaded for generating the vignette were: \begin{verbatim} <>= sessionInfo() @ \end{verbatim} \bibliography{DOSE} \end{document}