% \VignetteIndexEntry{An introduction to clusterProfiler}
% \VignettePackage{clusterProfiler}}
% \VignetteEngine{knitr::knitr}

% To compile this document, run the commands within R
% R CMD Sweave --pdf clusterProfiler.Rnw


\documentclass[12pt]{article}

<<knitr, echo=FALSE, results="hide">>=
knitr::opts_chunk$set(tidy         = FALSE,
                      out.truncate = 80,
                      out.lines    = 6,
                      dev          = 'pdf',
                      include      = TRUE,
                      fig.width    = 6,
                      fig.height   = 6,
                      resolution   = 100,
                      message      = FALSE,
                      warning      = FALSE)
@ 

<<style-Sweave, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@ 


<<loadPKG, echo=FALSE>>=
library(DOSE)
library(GO.db)
library(org.Hs.eg.db)
library(clusterProfiler)
@


\author{Guangchuang Yu \\[1em]
  \small{School of Public Health} \\
  \small{The University of Hong Kong} \\
  \small{\email{guangchuangyu@gmail.com}}
}


\title{Statistical analysis and visualization of functional profiles for genes and gene clusters}


\begin{document}
\maketitle

<<options,echo=FALSE>>=
options(digits=3, width=80, prompt=" ", continue=" ")
@

\begin{abstract}
  \Biocpkg{clusterProfiler} supports enrichment analysis of Gene Ontology (GO) and Kyoto Encyclopedia of genes and Genomes (KEGG) with either hypergeometric test or Gene Set Enrichment Analysis (GSEA). \Biocpkg{clusterProfiler} adjust the estimated significance level to account for multiple hypothesis testing and also \textit{q-values} were calculated for FDR control. It supports several visualization methods, including \Rfunction{barplot}, \Rfunction{cnetplot}, \Rfunction{enrichMap} and \Rfunction{gseaplot}. \Biocpkg{clusterProfiler} also supports comparing functional profiles among gene clusters. It supports comparing biological themes of GO, KEGG, Disease Ontology (via \Biocpkg{DOSE}) and Reactome pathways (via \Biocpkg{ReactomePA}).
  
  \begin{center}
    
    \vspace{1em}
    \textbf{\Biocpkg{clusterProfiler} version:} \Sexpr{packageVersion("clusterProfiler")}
    \vspace{1em}
    
    \begin{tabular}{ | l | }
      \hline
      If you use \Biocpkg{clusterProfiler} in published research, please cite: \\
      \\
      G Yu, LG Wang, Y Han, QY He. \textbf{clusterProfiler: an R package for} \\
      \textbf{comparing biological themes among gene clusters}. \\
      \emph{Journal of Integrative Biology} 2012, 16(5):284-287. \\
      \url{http://dx.doi.org/10.1089/omi.2011.0118} \\
      \hline
    \end{tabular}
    
  \end{center}
\end{abstract}

\newpage
\tableofcontents
\newpage

\section{Introduction}

In recently years, high-throughput experimental techniques such as
microarray, RNA-Seq and mass spectrometry can detect cellular
molecules at systems-level. These kinds of analyses generate huge
quantitaties of data, which need to be given a biological
interpretation. A commonly used approach is via clustering in the gene
dimension for grouping different genes based on their similarities \cite{yu2010}.

To search for shared functions among genes, a common way is to
incorporate the biological knowledge, such as Gene Ontology (GO) and
Kyoto Encyclopedia of Genes and Genomes (KEGG), for identifying
predominant biological themes of a collection of genes.

After clustering analysis, researchers not only want to determine
whether there is a common theme of a particular gene cluster, but also
to compare the biological themes among gene clusters. The manual step
to choose interesting clusters followed by enrichment analysis on each
selected cluster is slow and tedious. To bridge this gap, we designed
\Biocpkg{clusterProfiler} \cite{yu2012}, for comparing and visualizing functional
profiles among gene clusters.

\section{bitr: Biological Id TranslatoR}

Many new R user may find traslating ID is a tedious task and I have received many feedbacks from \Biocpkg{clusterProfiler} users that they don't know how to convert gene symbol, uniprot ID or other ID types to Entrez gene ID that used in \Biocpkg{clusterProfiler} for most of the species.

To remove this obstacle, We provide \Rfunction{bitr} function for translating among different gene ID types.


<<bitr>>=
x <- c("GPX3",  "GLRX",   "LBP",   "CRYAB", "DEFB1", "HCLS1",   "SOD2",   "HSPA2", 
       "ORM1",  "IGFBP1", "PTHLH", "GPC3",  "IGFBP3","TOB1",    "MITF",   "NDRG1", 
       "NR1H4", "FGFR3",  "PVR",   "IL6",   "PTPRM", "ERBB2",   "NID2",   "LAMB1", 
       "COMP",  "PLS3",   "MCAM",  "SPP1",  "LAMC1", "COL4A2",  "COL4A1", "MYOC",  
       "ANXA4", "TFPI2",  "CST6",  "SLPI",  "TIMP2", "CPM",     "GGT1",   "NNMT",
       "MAL",   "EEF1A2", "HGD",   "TCN2",  "CDA",   "PCCA",    "CRYM",   "PDXK",  
       "STC1",  "WARS",  "HMOX1", "FXYD2", "RBP4",   "SLC6A12", "KDELR3", "ITM2B")
eg = bitr(x, fromType="SYMBOL", toType="ENTREZID", annoDb="org.Hs.eg.db")
head(eg)
@ 

User should provides an annotation package, both \textit{fromType} and \textit{toType} can accept any types that supported.

User can use \Rfunction{idType} to list all supporting types.
<<idType>>=
idType("org.Hs.eg.db")
@

We can translate from one type to other types.
<<bitr2>>=
ids <- bitr(x, fromType="SYMBOL", toType=c("UNIPROT", "ENSEMBL"), annoDb="org.Hs.eg.db")
head(ids)
@ 

\section{Gene Ontology analysis}
\subsection{Supported organisms}
At present, GO analysis in \Biocpkg{clusterProfiler} supports about 20 species internally as shown below:
    
\begin{itemize}
  \item \textit{Arabidopsis}
  \item \textit{Anopheles}
  \item \textit{Bovine}
  \item \textit{Canine}
  \item \textit{Chicken}
  \item \textit{Chimp}
  \item \textit{Coelicolor}
  \item \textit{E coli strain K12}
  \item \textit{E coli strain Sakai}
  \item \textit{Fly}
  \item \textit{Gondii}
  \item \textit{Human}
  \item \textit{Malaria}
  \item \textit{Mouse}
  \item \textit{Pig}
  \item \textit{Rat}
  \item \textit{Rhesus}
  \item \textit{Worm}
  \item \textit{Xenopus}
  \item \textit{Yeast}
  \item \textit{Zebrafish}
\end{itemize}

For un-supported organisms, user can use their own GO annotation data (in data.frame format with one column of GO and another column of gene ID) and passed it to \Rfunction{buildGOmap} function, which will generate annotation file that suitable for GO analysis in \Biocpkg{clusterProfiler}. In future version, we may add functions to help user query annotation from public available database.


\subsection{Gene Ontology Classification}
In \Biocpkg{clusterProfiler}, \Rfunction{groupGO} is designed for gene classification based on GO distribution at a specific level.

<<groupGO>>=
library("DOSE")
data(geneList)
gene <- names(geneList)[abs(geneList) > 2]
head(gene)
ggo <- groupGO(gene     = gene,
               organism = "human",
               ont      = "BP",
               level    = 3,
               readable = TRUE)
head(summary(ggo))
@

The input parameters of \textit{gene} is a vector of gene IDs. It expects entrezgene for most of the organisms. For yeast, it should be ORF IDs; \textit{organism} should be the common name of supported species.
If \textit{readable} is setting to \Robject{TRUE}, the input gene IDs will be converted to gene symbols.


\subsection{GO over-representation test}

Over-representation test \cite{boyle2004} is a widely used approach to identify biological themes. Here we implement hypergeometric model to assess whether the number of selected genes associated with disease is larger than expected. 

To determine whether any terms annotate a specified list of genes at frequency greater than that would be expected by chance, \Biocpkg{clusterProfiler} calculates a p-value using the hypergeometric distribution:

$
p = 1 - \displaystyle\sum_{i = 0}^{k-1}
  \frac{
      {M \choose i}
      {{N-M} \choose {n-i}}
    } {
      {N \choose n}
    }
$

In this equation, \textit{N} is the total number of genes in the
background distribution, \textit{M} is the number of genes within that
distribution that are annotated (either directly or indirectly) to the
node of interest, \textit{n} is the size of the list of genes of
interest and \textit{k} is the number of genes within that list which
are annotated to the node. The background distribution by default is
all the genes that have annotation. User can set the background via \textit{universe} parameter.

P-values were adjusted for multiple comparison, and q-values were also calculated for FDR control.

<<enrichGO>>=
ego <- enrichGO(gene          = gene,
                universe      = names(geneList),
                organism      = "human",
                ont           = "CC",
                pAdjustMethod = "BH",
                pvalueCutoff  = 0.01,
                qvalueCutoff  = 0.05,
                readable      = TRUE)
head(summary(ego))
@

The input parameter \textit{universe} is the background gene list. If user not explicitly setting this parameter, it will use all the genes that have GO annotation. \textit{pAdjustMethod} specify the method for adjusting p-values. The \textit{pvalueCutoff} parameter is use to restrict the result based on the p-values and the adjusted p values while \textit{qvalueCutoff} is used to control q-values.

\subsection{GO Gene Set Enrichment Analysis}
A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated previous were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA) \cite{subramanian_gene_2005} directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.
\\
\\
Genes are ranked based on their phenotypes. Given a priori defined set of gens \textit{S} (e.g., genes shareing the same \textit{GO} or \textit{KEGG} category), the goal of GSEA is to determine whether the members of \textit{S} are randomly distributed throughout the ranked gene list (\textit{L}) or primarily found at the top or bottom.
\\
\\
There are three key elements of the GSEA method:
\begin{itemize}
	\item Calculation of an Enrichment Score.\\
The enrichment score (\textit{ES}) represent the degree to which a set \textit{S} is over-represented at the top or bottom of the ranked list \textit{L}. The score is calculated by walking down the list \textit{L}, increasing a running-sum statistic when we encounter a gene in \textit{S} and decreasing when it is not. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The \textit{ES} is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov-like statistic \cite{subramanian_gene_2005}.
	\item Esimation of Significance Level of \textit{ES}.\\
The \textit{p-value} of the \textit{ES} is calculated using permutation test. Specifically, we permute the gene labels of the gene list \textit{L} and recompute the \textit{ES} of the gene set for the permutated data, which generate a null distribution for the \textit{ES}. The \textit{p-value} of the observed ES is then calculated relative to this null distribution.
	\item Adjustment for Multiple Hypothesis Testing.\\
When the entire \textit{GO} or \textit{KEGG} gene sets is evaluated, \Biocpkg{clusterProfiler} adjust the estimated significance level to account for multiple hypothesis testing and also \textit{q-values} were calculated for FDR control.
\end{itemize}

<<gseGO>>=
ego2 <- gseGO(geneList     = geneList,
              organism     = "human",
              ont          = "CC",
              nPerm        = 100,
              minGSSize    = 120,
              pvalueCutoff = 0.01,
              verbose      = FALSE)
head(summary(ego2))
@

GSEA use permutation test, user can set \textit{nPerm} for number of permutations. Gene Set size below \textit{minGSSize} will be omitted.


\subsection{GO Semantic Similarity Analysis}
GO semantic similarity can be calculated by \Biocpkg{GOSemSim} \cite{yu2010}. We can use it to cluster genes/proteins into different clusters based on their functional similarity and can also use it to measure the similarities among GO terms to reduce the redundancy of GO enrichment results.

\section{KEGG analysis}
The annotation package, KEGG.db, is not updated since 2012. It's now pretty old and in \Biocpkg{clusterProfiler}, \Rfunction{enrichKEGG} supports downloading latest online version of KEGG data for enrichment analysis. Using KEGG.db is also supported by explicitly setting \textit{use\_internal\_data} parameter to \Robject{TRUE}, but it's not recommended. 

With this new feature, organism is not restricted to those supported in previous release, it can be any species that have KEGG annotation data available in KEGG database. User should pass abbreviation of academic name to the \Robject{organism} parameter. The full list of KEGG supported organisms can be accessed via \url{http://www.genome.jp/kegg/catalog/org_list.html}.

\subsection{KEGG over-representation test}

To speed up the compilation of this document, we set \textit{use\_internal\_data = TRUE}.

<<enrichKEGG>>=
kk <- enrichKEGG(gene         = gene,
                 organism     = "human",
                 pvalueCutoff = 0.05, 
                 readable     = TRUE,
                 use_internal_data = TRUE)
head(summary(kk))
@

\subsection{KEGG Gene Set Enrichment Analysis}
<<gseKEGG>>=
kk2 <- gseKEGG(geneList     = geneList,
               organism     = "human",
               nPerm        = 100,
               minGSSize    = 120,
               pvalueCutoff = 0.01,
               verbose      = FALSE,
               use_internal_data = TRUE)
head(summary(kk2))
@

\section{Disease Ontology analysis}

\Biocpkg{DOSE} \cite{yu_dose_2015} supports Disease Ontology (DO) Semantic and Enrichment analysis, please refer to the package vignettes. The \Rfunction{enrichDO} function is very useful for identifying disease association of interesting genes, and function \Rfunction{gseAnalyzer} function is designed for gene set enrichment analysis of \textit{DO}.

\section{Reactome pathway analysis}

With the demise of KEGG (at least without subscription), the KEGG pathway data in Bioconductor will not update and we encourage user to analyze pathway using \Biocpkg{ReactomePA} which use Reactome as a source of pathway data. The function call of \Rfunction{enrichPathway} and \Rfunction{gsePathway} in \Biocpkg{ReactomePA} is consistent with \Rfunction{enrichKEGG} and \Rfunction{gseKEGG}.

\section{DAVID functional analysis}

\Biocpkg{clusterProfiler} provides enrichment and GSEA analysis with GO, KEGG, DO and Reactome pathway supported internally, some user may prefer GO and KEGG analysis with DAVID \cite{huang_david_2007} and still attracted by the visualization methods provided by \Biocpkg{clusterProfiler} \cite{paranjpe_genome_wid_2013}. To bridge the gap between DAVID and clusterProfiler, we implemented \Rfunction{enrichDAVID}. This function query enrichment analysis result from DAVID webserver via RDAVIDWebService \cite{fresno_rdavidwebservice_2013} and stored the result as an \Robject{enrichResult} instance, so that we can use all the visualization functions in \Biocpkg{clusterProfiler} to visualize DAVID results. \Rfunction{enrichDAVID} is fully compatible with \Rfunction{compareCluster} function and comparing enrichment results from different gene clusters is now available with DAVID.

<<enrichDAVID, eval=FALSE>>=
david <- enrichDAVID(gene = gene,
                     idType = "ENTREZ_GENE_ID",
                     listType = "Gene",
                     annotation = "KEGG_PATHWAY")
@ 


\section{Visualization}
The function calls of \Rfunction{groupGO}, \Rfunction{enrichGO}, \Rfunction{enrichKEGG}, \Rfunction{enrichDO} and \Rfunction{enrichPathway} are consistent and all the output can be visualized by bar plot, enrichment map and category-gene-network plot. It is very common to visualize the enrichment result in bar or pie chart. We believe the pie chart is misleading and only provide bar chart.

\subsection{barplot}
<<barplot, fig.height=5, fig.width=6>>=
barplot(ggo, drop=TRUE, showCategory=12)
@


<<barplot-enrich, fig.height=5, fig.width=8>>=
barplot(ego, showCategory=8)
@


\subsection{enrichMap}
Enrichment map can be viusalized by \Rfunction{enrichMap}, which also support results obtained from hypergeometric test and gene set enrichment analysis.
<<enrichMap, fig.cap="enrichment map of enrichment result", fig.align="center", fig.height=16, fig.width=16, out.width="0.9\\textwidth", fig.pos="h">>=  
enrichMap(ego)
@ 

%% <<enrichMap2, fig.cap="enrichment map of gsea result", fig.align="center", fig.height=16, fig.width=16, out.width="0.9\\textwidth", fig.pos="h">>=  
%% enrichMap(ego2)
%% @ 


\subsection{cnetplot}
In order to consider the potentially biological complexities in which a gene may belong to multiple annotation categories and provide information of numeric changes if available, we developed \Rfunction{cnetplot} function to extract the complex association.
<<cnetplot, fig.height=14, fig.width=14>>=
cnetplot(ego, categorySize="pvalue", foldChange=geneList)
@


<<cnetplot-KEGG, fig.height=14, fig.width=14>>=
cnetplot(kk, categorySize="geneNum", foldChange=geneList)
@

\subsection{gseaplot}
Running score of gene set enrichment analysis and its association of phenotype can be visualized by \Rfunction{gseaplot}.
<<gseaplot, fig.cap="plotting gsea result", fig.align="center", fig.height=6, fig.width=8, out.width="0.6\\textwidth", fig.pos="h">>=  
gseaplot(kk2, geneSetID = "hsa04145")
@


\subsection{pathview from pathview package}
\Biocpkg{clusterProfiler} users can also use \Rfunction{pathview} from the \Rpackage{pathview} \cite{luo_pathview} to visualize KEGG pathway.

The following example illustrate how to visualize "hsa04110" pathway, which was enriched in our previous analysis.

<<viewKEGG, eval=FALSE>>=
library("pathview")
hsa04110 <- pathview(gene.data  = geneList,
                     pathway.id = "hsa04110",
                     species    = "hsa",
                     limit      = list(gene=max(abs(geneList)), cpd=1))
@

\begin{figure}[h]
\centering
\includegraphics[width=.9\textwidth]{figures/hsa04110_pathview.png}
\caption{visualize KEGG pathway using pathview}
\label{viewKEGG}
\end{figure}

For further information, please refer to the vignette of \Biocpkg{pathview} \cite{luo_pathview}.


\section{Biological theme comparison}
\Biocpkg{clusterProfiler} was developed for biological theme comparison \cite{yu2012}, and it provides a function, \Rfunction{compareCluster}, to automatically calculate enriched functional categories of each gene clusters.

<<gcSample>>=
data(gcSample)
lapply(gcSample, head)
@

The input for \textit{geneCluster} parameter should be a named list of gene IDs. 

<<comparecluster>>=
ck <- compareCluster(geneCluster = gcSample, fun = "enrichKEGG", use_internal_data = TRUE)
head(summary(ck))
@


\subsection{Formula interface of compareCluster}
\Rfunction{compareCluster} also supports passing a formula\footnote{The code to support formula has been contributed by Giovanni Dall'Olio.} of type $Entrez \sim group$ or $Entrez \sim group + othergroup$. 

<<formula>>=
## formula interface
mydf <- data.frame(Entrez=c('1', '100', '1000', '100101467',
                       '100127206', '100128071'),
                   group = c('A', 'A', 'A', 'B', 'B', 'B'),
                   othergroup = c('good', 'good', 'bad', 'bad',
                       'good', 'bad'))
xx.formula <- compareCluster(Entrez~group, data=mydf, fun='groupGO')
head(summary(xx.formula))

## formula interface with more than one grouping variable
xx.formula.twogroups <- compareCluster(Entrez~group+othergroup,
                                       data=mydf, fun='groupGO')
head(summary(xx.formula.twogroups))
@

\subsection{Visualization of profile comparison}
We can visualize the result using \Rfunction{plot} method.
<<compareCluster, fig.height=8, fig.width=8>>=
plot(ck)
@

By default, only top 5 (most significant) categories of each cluster
was plotted. User can changes the parameter \textit{showCategory} to
specify how many categories of each cluster to be plotted, and if
\textit{showCategory} was set to \textit{NULL}, the whole result will
be plotted.

The \Rfunction{plot} function accepts a parameter \textit{by} for setting the scale of dot sizes. The default parameter \textit{by} is setting to "geneRatio", which corresponding to the "GeneRatio" column of the output. If it was setting to \textit{count}, the comparison will be based on gene counts, while if setting to \textit{rowPercentage}, the dot sizes will be normalized by \textit{count/(sum of each row)}

To provide the full information, we also provide number of identified genes in each category (numbers in parentheses) when \textit{by} is setting to \textit{rowPercentage} and number of gene clusters in each cluster label (numbers in parentheses) when \textit{by} is setting to \textit{geneRatio}, as shown in Figure 3. If the dot sizes were based on \textit{count}, the row numbers will not shown.

The p-values indicate that which categories are more likely to have biological meanings. The dots in the plot are color-coded based on their corresponding p-values. Color gradient ranging from red to blue correspond to in order of increasing p-values. That is, red indicate low p-values (high enrichment), and blue indicate high p-values (low enrichment). P-values and adjusted p-values were filtered out by the threshold giving by
parameter \textit{pvalueCutoff}, and FDR can be estimated by \textit{qvalue}.

User can refer to the example in \cite{yu2012}; we analyzed the publicly available expression dataset of breast tumour tissues from 200 patients (GSE11121, Gene Expression Omnibus) \cite{schmidt2008}. We identified 8 gene clusters from differentially expressed genes, and using \Rfunction{compareCluster} to compare these gene clusters by their enriched biological process.


Another example was shown in \cite{yu2011}, we calculated functional similarities among viral miRNAs using method described in \cite{yu_new_2011}, and compared significant KEGG pathways regulated by different viruses using \Rfunction{compareCluster}.

The comparison function was designed as a framework for comparing gene clusters of any kind of ontology associations, not only \Rfunction{groupGO}, \Rfunction{enrichGO}, and \Rfunction{enrichKEGG} provided in this package, but also other biological and biomedical ontologies, for instance, \Rfunction{enrichDO} from \Biocpkg{DOSE} \cite{yu_dose_2015} and \Rfunction{enrichPathway} from \Biocpkg{ReactomePA} work fine with \Rfunction{compareCluster} for comparing biological themes in disease and reactome pathway perspective. More details can be found in the vignettes of \Biocpkg{DOSE} \cite{yu_dose_2015} and \Biocpkg{ReactomePA}.


\section{External documents}

\begin{itemize}
  \item \href{http://ygc.name/2014/08/07/why-clusterprofiler-fails/}{Why clusterProfiler fails}
  \item \href{http://ygc.name/2015/02/01/kegg-enrichment-analysis-with-latest-online-data-using-clusterprofiler/}{KEGG enrichment analysis with latest online data using clusterProfiler}
  \item \href{http://ygc.name/2015/02/10/ggtree-updating-a-tree-view/}{DAVID functional analysis with clusterProfiler}
  \item \href{http://ygc.name/2014/08/03/enrichment-map/}{Enrichment map}
  \item \href{http://bioinfoblog.it/2015/02/a-formula-interface-for-geneontology-analysis/}{a formula interface for GeneOntology analysis}
\end{itemize}

\section{Session Information}


Here is the output of \Rcode{sessionInfo()} on the system on which this document was compiled:

<<sessInfo, results='asis', echo=FALSE>>=
toLatex(sessionInfo())
@


\bibliography{clusterProfiler}
\end{document}