% \VignetteIndexEntry{An introduction to GOSemSim}
% \VignetteDepends{org.Hs.eg.db,GO.db}
% \VignetteSuggests{cluster}
% \VignetteKeywords{GO Semantic Similarity Measurement}
% \VignettePackage{GOSemSim}

\documentclass[a4paper]{article}

\usepackage{Sweave}
\usepackage{a4wide}
\usepackage{times}
\usepackage{hyperref}
\usepackage[T1]{fontenc}
\usepackage[english]{babel}
\usepackage{framed}
\usepackage{longtable}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage[authoryear,round]{natbib}

\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}

\newcommand{\R}{\textsf{R}}

\bibliographystyle{plainnat}

\title{GO-terms Semantic Similarity Measures}
\author{Guangchuang Yu \\
College of Life Science and Technology \\
Jinan University, Guangzhou, China \\
email: \texttt{guangchuangyu@gmail.com}}

\begin{document}


\maketitle


<<options,echo=FALSE>>=
options(width=60)
library(GOSemSim)
library(org.Hs.eg.db)
library(GO.db)
@

\section{Introduction}

Functional similarity of gene products can be estimated by controlled
biological vocabularies, such as Gene Ontology (GO). GO comprises of
three orthogonal ontologies, i.e. molecular function (MF), biological
process (BP), and cellular component (CC).
\\

Four methods have been presented to determine the semantic similarity
of two GO terms based on the annotation statistics of their common
ancestor terms (Resnik  \citep{philip_semantic_1999}, Jiang  \citep{jiang_semantic_1997},
Lin  \citep{lin_information-theoretic_1998} and Schlicker  \citep{schlicker_new_2006}).
Wang  \citep{wang_new_2007} proposed a
new method to measure the similarity based on the graph structure of
GO. Each of these methods has its own advantages and
weaknesses. \Rpackage{GOSemSim} package  \citep{yu2010} is developed to compute
semantic similarity among GO terms, sets of GO terms, gene products,
and gene clusters, providing both five methods mentioned above. I have
developed another package, \Rpackage{DOSE}, for measuring semantic
similarity among DO terms and gene products at disease perspective.
\\

To start with \Rpackage{GOSemSim} package, type following code below:

<<results=hide>>=
library(GOSemSim)
help(GOSemSim)
@


\section{Citation}

Please cite the following articles when using
\Rpackage{GOSemSim}. \\
\\
G Yu, F Li, Y Qin, X Bo, Y Wu, S Wang. GOSemSim: an R package for
measuring semantic similarity among GO terms and gene
products. \textit{Bioinformatics}. 2010,26(7):976-978.\\


\section{Semantic Similarity Measurement Based on GO}

Four methods proposed by Resnik  \citep{philip_semantic_1999},
Jiang  \citep{jiang_semantic_1997}, Lin  \citep{lin_information-theoretic_1998}
and Schlicker  \citep{schlicker_new_2006} are information content (IC) based, which
depend on the frequencies of two GO terms involved and that of their
closest common ancestor term in a specific corpus of GO
annotations. The information content of a GO term is computed by the
negative log probability of the term occurring in GO corpus, and is
defined as $IC(t) = -\log{p(t)}$. A rarely used term contains a
greater amount of information.


At present, \Rpackage{GOSemSim} supports analysis
on many species. We used the following Bioconductor packages to
calculate the information content.

\begin{itemize}
  \item org.At.tair.db    for \textit{Arabidopsis}
  \item org.Ag.eg.db      for \textit{Anopheles}
  \item org.Bt.eg.db      for \textit{Bovine}
  \item org.Cf.eg.db      for \textit{Canine}
  \item org.Gg.eg.db      for \textit{Chicken}
  \item org.Pt.eg.db      for \textit{Chimp}
  \item org.Sco.eg.db     for \textit{Coelicolor}
  \item org.EcK12.eg.db   for \textit{E coli strain K12}
  \item org.EcSakai.eg.db for \textit{E coli strain Sakai}
  \item org.Dm.eg.db      for \textit{Fly}
  \item org.Hs.eg.db      for \textit{Human}
  \item org.Pf.plasmo.db  for \textit{Malaria}
  \item org.Mm.eg.db      for \textit{Mouse}
  \item org.Ss.eg.db      for \textit{Pig}
  \item org.Rn.eg.db      for \textit{Rat}
  \item org.Mmu.eg.db     for \textit{Rhesus}
  \item org.Ce.eg.db      for \textit{Worm}
  \item org.Xl.eg.db      for \textit{Xenopus}
  \item org.Sc.sgd.db     for \textit{Yeast}
  \item org.Dr.eg.db      for \textit{Zebrafish}
\end{itemize}

The information content will update regularly.


As GO allow multiple parents for each concept, two terms can share
parents by multiple paths. We take the most informative common
ancestor (MICA), where there is more than one shared parents.


\indent
\begin{itemize}
\item Resnik's method is defined as:
  \begin{center}
    $sim_{Resnik}(t_1,t_2) = IC(MICA)$
  \end{center}
\item Lin's method is defined as:
  \begin{center}
    $sim_{Lin}(t_1,t_2) = \frac{2IC(MICA)}{IC(t_1)+IC(t_2)}$
  \end{center}
\item The Relevance method, which was proposed by Schlicker's method,
  combine Resnik's and Lin's method, and is defined as:
  \begin{center}
    $sim_{Rel}(t_1,t_2) = \frac{2IC(MICA)(1-p(MICA))}{IC(t_1)+IC(t_2)}$
  \end{center}
\item Jiang and Conrath's method is defined as:
  \begin{center}
    $sim_{Jiang}(t_1,t_2) = 1-\min(1, d_{Jiang}(t_1, t_2))$
  \end{center}

  where
  \begin{center}
    $d_{Jiang}(t_1, t_2) = IC(t_1) + IC(t_2) - 2IC(MICA)$
  \end{center}

\end{itemize}


Graph-based methods using the topology of GO graph structure to
compute semantic similarity. Formally, a GO term A can be represented
as $DAG_{A}=(A,T_{A},E_{A})$ where $T_{A}$ is the set of GO terms in
$DAG_{A}$, including term A and all of its ancestor terms in the GO
graph, and $E_{A}$ is the set of edges connecting the GO terms in
$DAG_{A}$.

\begin{itemize}
  \item Wang's method

    To encode the semantics of a GO term in a measurable format to enable
    a quantitative comparison between two term's semantics, Wang firstly
    defined the semantic value of term A as the aggregate contribution of
    all terms in $DAG_{A}$ to the semantics of term A, terms closer to
    term A in $DAG_{A}$ contribute more to its semantics. Thus, defined the
    contribution of a GO term \textit{t} to the semantics of GO term A as
    the S-value of GO term \textit{t} related to term A. For any of term
    \textit{t} in $DAG_{A}=(A,T_{A},E_{A})$, its S-value related to term
    A. $S_{A}(\textit{t})$ is defined as:
    \begin{center}
      \[\left\{
        \begin{array}{l}
          S_{A}(A)=1 \\
          S_{A}(\textit{t})=\max\{w_{e} \times S_{A}(\textit{t}') |
          \textit{t}' \in childrenof(\textit{t}) \}$ if $\textit{t}
          \ne A
        \end{array}
      \right.\]
    \end{center}

    where $w_{e}$ is the semantic contribution factor for edge $e \in
    E_{A}$ linking term \textit{t} with its child term \textit{t}'. Wang
    defined term A contributes to its own as one. After obtaining the
    S-values for all terms in $DAG_{A}$, the semantic value of GO term A,
    SV(A), is calculated as:
    \begin{center}
      $SV(A)=\displaystyle\sum_{t \in T_{A}} S_{A}(t)$
    \end{center}

    Thus, given two GO terms A and B, the semantic similarity between these two
    terms, $sim_{Wang}(A, B)$, is defined as:
    \begin{center}
      $sim_{Wang}(A, B) = \frac{\displaystyle\sum_{t \in T_{A} \cap
          T_{B}}{S_{A}(t) + S_{B}(t)}}{SV(A) + SV(B)}$
    \end{center}

    where $S_{A}(\textit{t})$ is the S-value of GO term \textit{t} related
    to term A and $S_{B}(\textit{t})$ is the S-value of GO term \textit{t}
    related to term B.


This method proposed by Wang  \citep{wang_new_2007} determines the semantic
similarity of two GO terms based on both the locations of these terms
in the GO graph and their relations with their ancestor terms.

\end{itemize}

On the basis of semantic similarity between GO terms, \Rpackage{GOSemSim} can
also compute semantic similarity among sets of GO terms, gene
products, and gene clusters.


We implemented four methods which called \textit{max},
\textit{avg}, \textit{rcmax}, and \textit{BMA} to combine
semantic similarity scores of multiple GO terms. The similarities
among gene products and gene clusters which annotated by multiple GO
terms were also calculated by the same combine methods mentioned
above.

Given two GO terms sets $GO_{1}=\{go_{11},go_{12} \cdots go_{1m}\}$
and $GO_{2}=\{go_{21},go_{22} \cdots go_{2n}\}$ annotated for tow
genes $g_1$ and $g_2$, four combine methods for calculating gene
similarity are defined as follows:

\begin{itemize}

\item The \textit{max} method calculates the maximum semantic similarity score over all pairs of GO
terms between these two sets.

$sim_{max}(g_1, g_2) = \displaystyle\max_{1 \le i \le m, 1 \le j \le n} sim(go_{1i}, go_{2j})$

\item The \textit{avg} calcuate the average semantic similarity score over all pairs of GO terms.

  $sim_{avg}(g_1, g_2) = \frac{\displaystyle\sum_{i=1}^m\sum_{j=1}^nsim(go_{1i}, go_{2j})}{m \times n}$

\item Similarities among GO terms form a matrix, and method \textit{rcmax}
use the maximum of RowScore and ColumnScore as the similarity, where
RowScore (or ColumnScore) is the average of maximum similarities on
each row (or column).

$sim_{rcmax}(g_1, g_2) = \max(\frac{\displaystyle\sum_{i=1}^m \max_{1
    \le j \le n} sim(go_{1i}, go_{2j})}{m},
\frac{\displaystyle\sum_{j=1}^n \max_{1 \le i \le m} sim(go_{1i},
  go_{2j})}{n})$


\item The \textit{BMA} method, used the best-match average strategy,
  calculates the average of all maximum similarities on each row and
  column, and defined as:

$sim_{BMA}(g_1, g_2) = \frac{\displaystyle\sum_{1=i}^m \max_{1 \le j
    \le n}sim(go_{1i}), go_{2j}) + \displaystyle\sum_{1=j}^n \max_{1
    \le i \le m}sim(go_{1i}, go_{2j})} {m+n}$

\end{itemize}

\section{Examples}

\Rpackage{GOSemSim} implemented multiple functions for calculate
semantic similarities:
\begin{itemize}
\item goSim for calculate semantic similarity between two GO terms.
\item mgoSim for calculate semantic similarity among multiple GO terms.
\item geneSim for calculate semantic similarity between two gene products.
\item mgeneSim for calculate semantic similarity among multiple gene products.
\item clusterSim for calculate semantic similarity between two gene clusters.
\item mclusterSim for calculate semantic similarity among multiple
  gene clusters.
\end{itemize}

The following example demonstrated the function calls of these
function, details about the arguments can refer to the manuals (eg
\Rfunction{?geneSim}).
<<function call>>=
goSim("GO:0004022", "GO:0005515", ont="MF", measure="Wang")
go1 = c("GO:0004022","GO:0004024","GO:0004174")
go2 = c("GO:0009055","GO:0005515")
mgoSim(go1, go2, ont="MF", measure="Wang", combine="BMA")

geneSim("241", "251", ont="MF", organism="human", measure="Wang", combine="BMA")
mgeneSim(genes=c("835", "5261","241", "994"), ont="MF", organism="human", measure="Wang")
@


<<clusterSim, eval=F>>=
gs1 <- c("835", "5261","241", "994", "514", "533")
gs2 <- c("578","582", "400", "409", "411")
clusterSim(gs1, gs2, ont="MF", organism="human", measure="Wang", combine="BMA")

x <- org.Hs.egGO
hsEG <- mappedkeys(x)
set.seed <- 123
clusters <- list(a=sample(hsEG, 20), b=sample(hsEG, 20), c=sample(hsEG, 20))
mclusterSim(clusters, ont="MF", organism="human", measure="Wang", combine="BMA")
@

\section{Case Study}

In  \cite{yu_new_2011}, we proposed a method for measuring functional
similarity of microRNAs. This method was based on semantic similarity
of microRNAs' target genes, and was calculated by \Rpackage{GOSemSim}. We
further analyzed viral microRNAs using this method  \citep{yu_new_2011}
and compared significant KEGG pathways regulated by different viruses'
microRNAs using \Rpackage{clusterProfiler}  \citep{yu2012}.


\section{Session Information}

The version number of R and packages loaded for generating the vignette were:

\begin{verbatim}
<<echo=FALSE,results=tex>>=
sessionInfo()
@
\end{verbatim}

\bibliography{GOSemSim}
\end{document}