% \VignetteIndexEntry{An introduction to clusterProfiler} % \VignetteDepends{AnnotationDbi, GO.db, org.Hs.eg.db, ggplot2, plyr, % methods, qvalue, KEGG.db, igraph} % \VignetteSuggests{GOSemSim, DOSE} % \VignetteKeywords{Gene Ontology Analysis} % \VignettePackage{clusterProfiler} %\SweaveOpts{prefix.string=images/fig} \documentclass[a4paper]{article} \usepackage{Sweave} \usepackage{a4wide} \usepackage{times} \usepackage{hyperref} \usepackage[T1]{fontenc} \usepackage[english]{babel} \usepackage{framed} \usepackage{longtable} \usepackage{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage[authoryear,round]{natbib} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newtheorem{theorem}{Theorem}[section] \newcommand{\R}{\textsf{R}} \newcommand{\term}[1]{\emph{#1}} \newcommand{\mref}[2]{\htmladdnormallinkfoot{#2}{#1}} \bibliographystyle{plainnat} \title{Biological Theme Comparison} \author{Guangchuang Yu \\ College of Life Science and Technology \\ Jinan University, Guangzhou, China \\ email: \texttt{guangchuangyu@gmail.com}} \begin{document} \maketitle <>= options(width=60) require(clusterProfiler) @ \section{Introduction} In recently years, high-throughput experimental techniques such as microarray, RNA-Seq and mass spectrometry can detect cellular moleculars at systems-level. These kinds of analysis generate huge quantitaties of data, which need to be given a biological interpretation. A commonly used approach is via clustering in the gene dimension for grouping different genes based on their similarities\citep{yu2010}. \\ To search for shared functions among genes, a common way is to incorporate the biological knowledge, such as Gene Ontology (GO) and Kyoto Encyclopedia of genes and Genomes (KEGG), for identifying predominant biological themes of a collection of genes. After clustering analysis, researchers not only want to determine whether there is a common theme of a particular gene cluster, but also to compare the biological themes among gene clusters. The manual step to choose interesting clusters followed by enrichment analysis on each selected cluster is slow and tedious. To bridge this gap, we designed \Rpackage{clusterProfiler}, for comparing and visulizing functional profiles among gene clusters. \section{Citation} Please cite the following articles when using \Rpackage{clusterProfiler}. \\ \\ G Yu, LG Wang, Y Han, QY He. clusterProfiler: an R package for comparing biological themes among gene clusters. \textit{OMICS: A Journal of Integrative Biology}. 2012, 16(5), in press. \\ \section{Functional Profiles} In \Rpackage{clusterProfiler}, we implemented three functions to explore the functional profiles of a collection of genes. \begin{itemize} \item \Rfunction{groupGO} for gene classification based on GO distribution at a specific level <>= data(gcSample) x <- groupGO(gene=gcSample[[1]], organism="human", ont="CC", level=2, readable=TRUE) head(summary(x)) @ \item \Rfunction{enrichGO} for GO enrichment analysis <>= y <- enrichGO(gene=gcSample[[2]], organism="human", ont="MF", pvalueCutoff=0.01, qvalueCutoff=0.05, readable=TRUE) head(summary(y)) @ \item \Rfunction{enrichKEGG} for KEGG pathway enrichment analysis. <>= z <- enrichKEGG(gene=gcSample[[3]], organism="human", pvalueCutoff=0.05, qvalueCutoff=0.05, readable=TRUE) head(summary(z)) @ With the demise of KEGG (at least without subscription), the pathway data used in \Rpackage{clusterProfiler} will not update, and we encourage user to use \Rfunction{enrichPathway} in Bioconductor package \Rpackage{ReactomePA}, which use Reactome as a source of pathway data. \end{itemize} The function calls of \Rfunction{groupGO}, \Rfunction{enrichGO} and \Rfunction{enrichKEGG} are similar. The input parameters of \textit{gene} is a vector of entrezgene (for human and mouse) or ORF (for yeast) IDs, and \textit{organism} must be one of "human", "mouse", and "yeast", according to the gene IDs. For GO analysis, \textit{ont} must be assigned to one of "BP", "MF", and "CC" for biological process, molecular function and cellular component, respectively. In \Rfunction{groupGO}, the \textit{level} specify the GO level for gene projection. In enrichment analysis, the \textit{pvalueCutoff} is to restrict the result based on their pvalues, and \textit{qvalueCutoff} is to control false discovery rate (FDR) to prevent high FDR in multiple testing. The \textit{readable} is a logical parameter to indicate the input gene IDs will map to gene symbols or not. \section{Biological theme comparison} \Rpackage{clusterProfiler} was developed for biological theme comparison, and it supplies a function, \Rfunction{compareCluster}, to automatically calculate enriched functional categories of each gene clusters. As we demonstrated in \cite{yu2012}, we analyzed the publicly available expression dataset of breast tumour tissues from 200 patients (GSE11121, Gene Expression Omnibus) \citep{schmidt2008}. We identified 8 gene clusters from differentially expressed genes, and using \Rfunction{compareCluster} to compare these gene clusters by their enriched biological process, with the strict cutoff of p-values < 0.01 and q-values < 0.05. The analysis result was illustrated in Figure 1. More details of this analysis are described in \cite{yu2012}. \begin{figure}[htb] \begin{center} \includegraphics[width=.7\linewidth]{omics2012.pdf} \caption{Comparison of GO enrichment of gene clusters} \label{Fig:omicsYu2012} \end{center} \end{figure} Another example was shown in \cite{yu2011}, we calculated functional similarities among viral miRNAs using method described in \cite{yu_new_2011}, and compared significant KEGG pathways regulated by different viruses using \Rpackage{clusterProfiler}. The comparison function was designed as a general-package for comparing gene clusters of any kind of gene-ontology associations, not only GO and KEGG this package provided, but also other biological and biomedical ontologies. For example, \Rfunction{compareCluster} can cooperate seamless with \Rpackage{DOSE} and \Rpackage{ReactomePA} and compare gene cluster in the context of disease and reactome pathway as demonstrated in the online vignette of \Rpackage{DOSE} and \Rpackage{ReactomePA} respectively. \section{Visualization} \Rpackage{clusterProfiler} implemented serveral methods for visualizing analyzed result. \begin{itemize} \item Bar Plot Bar plot was used to visualized functional profile of the given collection of genes. \begin{figure}[htb] \begin{center} <>= plot(y, type="bar", title="MF Enrichment analysis", showCategory=10) @ \end{center} \caption{\label{Fig:enrichGO} Example of plotting functional profiles} \end{figure} The plot function call was consistent for analysis results generated by \Rfunction{groupGO}, \Rfunction{enrichGO} and \Rfunction{enrichKEGG}. Users can try the following command: <>= plot(x, type="bar", order=FALSE, drop=TRUE) plot(z, type="bar", font.size=12) @ \item Category Net Plot Category-gene network model was also implmented to extract the complex relationships between genes and associated categories. It provides a high-level model to understand the functionalities of genes. \begin{figure}[htb] \begin{center} <>= plot(x, type="cnet", showCategory=5, output="fixed") @ \end{center} \caption{\label{Fig:groupGO-cnet} Example of plotting GO profiles using cnetplot} \end{figure} The plot function call was consistent for analysis results generated by \Rfunction{groupGO}, \Rfunction{enrichGO} and \Rfunction{enrichKEGG}. Users can try the following command: <>= plot(y, type="cnet", categorySize="geneNum") plot(z, type="cnet", categorySize="pvalue", output="interactive") @ \item Dot Plot Dot plot was implemented for cluster comparison as shown in Figure 1. Here, we demonstrated the functional call of \Rfunction{compareCluster}. <>= xx <- compareCluster(gcSample, fun="enrichGO", ont="CC", organism="human", pvalueCutoff=0.05, qvalueCutoff=0.05) plot(xx) @ \begin{figure}[htb] \begin{center} \includegraphics[width=.7\linewidth]{compareCluster.png} \caption{GO Enrichment Comparison} \label{Fig:compareCluster} \end{center} \end{figure} Bar plot was also supported to visualize cluster comparision. User can try the following command to explore the usage: <>= plot(xx, type="bar", by="percentage") plot(xx, type="bar", by="count") @ By default, only top 5 (most significant) categories of each cluster was plotted. User can changes the parameter \textit{showCategory} to specify how many categories of each cluster to be plotted, and if \textit{showCategory} was set to \textit{NULL}, the whole result will be plotted. The dot sizes were based on their corresponding row percentage by default, and user can set the parameter \textit{by} to "count" to make the comparison based on gene counts. We choose "percentage" as default parameter to represent the size of dots, since some categories may contain a large number of genes, and make the dot sizes of those small categories too small to compare. To provide the full information, we also provide number of identified genes in each category (numbers in parentheses), as shown in Figure 3. If the dot sizes were based on "count", the row numbers will not shown. The p-values indicate that which categories are more likely to have biological meanings. The dots in the plot are color-coded based on their corresponding p-values. Color gradient ranging from red to blue correspond to in order of increasing p-values. That is, red indicate low p-values (high enrichment), and blue indicate high p-values (low enrichment). P-values were filtered out by the threshold giving by parameter \textit{pvalueCutoff}, and FDR was control by parameter \textit{qvalueCutoff}. \end{itemize} \Rfunction{compareCluster} was designed as a general function for comparing gene clusters of any kind of gene-ontology associations, not only GO (\Rfunction{groupGO} and \Rfunction{enrichGO}) and KEGG (\Rfunction{enrichKEGG}) provided in this package, but also other biological or biomedical ontologies, including Disease Ontology (via \Rfunction{enrichDO} in \Rpackage{DOSE}) and Reactome Pathway (via \Rfunction{enrichPathway} in \Rpackage{ReactomePA}). More details can be found in the vignettes of \Rpackage{DOSE} and \Rpackage{ReactomePA}. \section{Session Information} The version number of R and packages loaded for generating the vignette were: \begin{verbatim} <>= sessionInfo() @ \end{verbatim} \bibliography{clusterProfiler} \end{document}