%\VignetteIndexEntry{GRAPH Interaction from pathway Topological Environment} %\VignetteDepends{graph} %\VignetteDepends{graphite} %\VignetteDepends{SPIA} %\VignetteDepends{hgu133plus2.db} %\VignetteDepends{topologyGSA} \documentclass{article} \usepackage[utf8]{inputenc} <>= BiocStyle::latex() @ \title{graphite} \author{ Gabriele Sales\thanks{\email{gabriele.sales@unipd.it}}, Enrica Calura\thanks{\email{enrica.calura@unipd.it}} and Chiara Romualdi\thanks{\email{chiara.romualdi@unipd.it}} \\ Department of Biology, University of Padua } \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \tableofcontents <>= library(graph) library(graphite) @ \section{Introduction} \Biocpkg{graphite} (GRAPH Interaction from pathway Topological Environment) was designed to: \begin{itemize} \item provide networks derived from eight public pathway databases, \item automate the conversion of node identifiers (e.g. from Entrez IDs to gene symbols), \item facilitate the execution of topological pathway analyses on the provided networks. \end{itemize} The pathway databases available in this version are: \begin{itemize} \item KEGG \cite{KEGG} \item Panther \cite{panther} \item PathBank \cite{pathbank} \item PharmGKB \cite{pharmgkb} \item Reactome \cite{Reactome} \item SMPDB \cite{smpdb} \item WikiPathways \cite{wikipathways} \end{itemize} All provided pathways are annotated in human and, since version 1.14, other 13 species are available. \Biocpkg{graphite} pathways are collected and pre-processed at every BioConductor release (roughly every 6 months). This guarantees the synchronization of the provided data with all other BioConductor annotation packages. The topological pathway analyses directly supported are: \begin{itemize} \item \Biocpkg{SPIA} \cite{Tarca2009,Adi2009,Draghici2007} \item \Biocpkg{topologyGSA} \cite{Massa2010} \item \Biocpkg{clipper} \cite{Martini2013} \end{itemize} Since version 1.24, every \Biocpkg{graphite} pathway can be accessed through three separate views: \begin{description} \item [gene-only network] where metabolite nodes have been removed and edges have been propagated through them; \item [metabolite-only network] where protein nodes have been removed and edges have been propagated through them; \item [mixed network] containing both proteins and metabolites, useful for metabolomic and transcriptomic data integration. \end{description} \section{Pathway topology conversion to gene network} In order to gather curated information about pathways, we have collected data from different public databases that have emerged as reference points for the systems biology community. The KEGG database has been in development by Kanehisa Laboratories since 1995, and is now a prominent knowledge base for integration and interpretation of large-scale molecular data sets generated by genome sequencing and other high-throughput experimental technologies. KEGG is the only pathway database not in {\tt BioPax} format, as it stores the information using the KGML format. KEGG pathways (KGML format) provides maps for both signaling and metabolic pathways \cite{KEGG}. Reactome, backed by the EBI, is one of the most complete repository; it is frequently updated and provides a semantically rich description of each pathway \cite{Reactome}. Finally, Panther \cite{panther} is data are a comprehensive, curated database of pathways, protein families, trees, subfamilies and functions available at http://pantherdb.org backed by the University of Southern California. Since version 1.24, \Biocpkg{graphite} contains also SMPDB \cite{smpdb} and PharmGKB \cite{pharmgkb}, important resources for metabolomic data analyses. \subsection{Pathway definition} \Biocpkg{graphite} pathways are derived by conversion of {\tt KGML} and {\tt BioPax} data formats. KEGG database provides a separate {\tt xml} file, one for each pathway. Thus, a pathway is defined by all the reactions defined within each file. For the other databases, we define a pathway for each element of type {\tt pathway} in the {\tt BioPax} document. \subsection{Nodes with multiple elements} Within a pathway, a node can correspond to multiple gene products. These nodes with multiple elements can be divided into protein complexes (proteins linked by protein-protein interactions) and the groups containing alternative members (genes generally with similar biochemical functions). When considering signal propagation these groups should be considered differently; the first (hereafter group AND) should be expanded into a clique (all proteins connected to the others), while the second (hereafter group OR) should be expanded without connection among the contained elements. In the KGML format there are two ways of defining nodes with multiple elements: protein complexes (group AND defined by entry {\tt type="group"}) and group with alternative members (group OR defined by entry {\tt type="gene"}). In the BioPax format only one type of group is allowed: protein complexes (group AND) with type {\tt complex}. However, it often happens that {\tt protein} tag contains multiple {\tt xref} pointing to alternative elements of the process (group OR). \subsection{Metabolite and protein-mediated interactions} Metabolites mediated interactions are interactions for which a metabolite acts as a bridge between two proteins. Since metabolites are not measured during gene expression analysis (using microarray or RNA-seq), their removal from the original network is useful. However, the trivial elimination of the metabolites, without signal propagation, will strongly bias the topology, interrupting the signals that pass through them. If element $A$ is linked to metabolite $c$ and metabolite $c$ is linked to element $B$, thus elements $A$ should be linked to elements $B$. Within the KGML format, there are two different ways of describing a metabolite mediated interaction: i) direct interaction {\tt type="PPrel"} ($A$ interacts whit $B$ through metabolite $c$) and ii) indirect one {\tt type="PCrel"} ($A$ interacts whit metabolite $c$ and $c$ interacts whit $B$). Since proper signal propagation is crucial for topological gene set analysis we decided to include additional rules for the propagation reconstructing a connection between two genes connected through a series of metabolites. Not all metabolites are considered for the propagation because some of them, such as Hydrogen, H2O, ATP, ADP etc., are highly frequent in map descriptions and the signal propagation through them would lead to degenerate too long chains of metabolites. The metabolites not considered for propagation are not characteristic of a specific reaction but secondary substrates/products widely shared among different processes. Since version 1.24, \Biocpkg{graphite} can be used also for metabolomics pathway analyses. We applied the same procedure descibed above to provide network with metabolites only propagating the signal through proteins. Finally, \Biocpkg{graphite} includes pathways containing both proteins and metabolites (without edges propagation) allowing also integrated pathway analyses of gene expression and metabolomic data. \subsection{Edge attributes} \Biocpkg{graphite} allows the user to see the single/multiple relation types that characterize an edge. The type of edges have been preserved to remain as close as possible to the original annotations. Some new types have been introduced due to the needs of the topological conversion. For instance a {\it Process(indirect)} is introduced when the edge is generated propagating the signal from a gene to another gene passing through metabolites, or from a metabolite to another metabolite passing through proteins. \subsection{Loading pathways} Human pathways are natively distribuited with the package. Since the version 1.14.0, non-human pathway data are also available and can be dowloaded automatically using the \Rfunction{pathways} function. \Rfunction{pathways} requires the name of the specie of interest and the name of the pathway database name as follows: <>= humanReactome <- pathways("hsapiens", "reactome") names(humanReactome)[1:10] p <- humanReactome[["ABC-family proteins mediated transport"]] p @ A pathway database is a list of pathways. We can access a pathway through its name, as above, or through its position in the list, as follow: <>= p <- humanReactome[[21]] pathwayTitle(p) @ In the pathway, nodes represent genes/proteins: <>= head(nodes(p)) @ Edges can be characterized by multiple functional relationships: <>= head(edges(p)) @ By default, the function \Rfunction{edges} and \Rfunction{nodes} provide the edges or the nodes of the pathway containing only proteins with signals propagated through metabolites. Since version 1.24, using the option {\it which} we can access the pathway with proteins and metabolites ({\it which = "mixed"}) or the pathway with only metabolites with edges propagated through proteins ({\it which} = "metabolites"). <>= head(nodes(p), which = "mixed") head(edges(p), which = "mixed") @ These same steps can be used to access the KEGG, Panther, PathBank, PathBank, PharmGKB, Reactome, SMPDB and WikiPathways databases for \textit{Homo sapiens}, but not all such databases are available for all other species. To know the pathway data available the user can use the \Rfunction{pathwayDatabases}. <>= pathwayDatabases() @ \section{Pathway graph} The function \Rfunction{pathwayGraph} builds a \Rclass{graphNEL} object from a pathway \Robject{p}: <>= g <- pathwayGraph(p) g @ <>= edgeData(g)[1] @ Similarly to the \Rfunction{edges} and \Rfunction{nodes}, by default, the function \Rfunction{pathwayGraph} provide the pathway with only proteins and edges propagated through metabolites. Since version 1.24, using the option {\it which} we can access the pathway with proteins and metabolites ({\it which = "mixed"}) or the pathway with only metabolites with edges propagated through proteins ({\it which} = "metabolites"). <>= g <- pathwayGraph(p, which = "mixed") g @ <>= edgeData(g)[1] @ \section{Identifiers} Gene annotations databases are widely used as public repositories of biological information. Our current knowledge on biological elements is spread out over a number of databases (such as: Entrez Gene , RefSeq, backed by the NCBI http://www.ncbi.nlm.nih.gov/, UniProt, ENSEMBL backed by the EBI http://www.ebi.ac.uk/ to name just a few), specialised on a subset of specific biological entities (for instance, UniProt focuses on proteins while Entrez Gene focuses on genes). Key identifiers (IDs) in the internal structure of each database uniquely represent biological entities, thus biological entities can be identified by homogeneous IDs according to the selected database they refer to. Due to their different origins and specificity, switching from an ID to another is possible but not trivial: there could be either no correspondence among them or many-to-many relations. For detailed information about IDs, their structures and differences please consult those resources. The function \Rfunction{converterIdentifiers} allows the user to convert the pathway IDs into different types, to fit the user needs. This mapping process, however, may lead to the loss of some nodes (not all identifiers may be recognized) and has an impact on the topology of the network (one ID may correspond to multiple IDs in another annotation or vice versa). We based the function of ID conversion through the species-specific Bioconductor \Biocpkg{AnnotationDbi}, such as \Biocpkg{org.Hs.eg.db}. Since the version 1.14, all the conversion supported in the annotation packages can be used. \Rfunction{converterIdentifiers} needs a list of pathways or a single pathway and a string describing the type of the identifier as provided by an Annotation package (for example,‘"ENTREZID"’), while the values ‘"entrez"’, ‘"symbol"’ remains for backward compatibility. <>= pSymbol <- convertIdentifiers(p, "SYMBOL") pSymbol head(nodes(pSymbol)) @ <>= reactomeSymbol <- convertIdentifiers(humanReactome[1:5], "SYMBOL") @ Since the introduction of metabolites in \Biocpkg{graphite} pathways (version 1.24), also the metabolite ID conversion is possible. \clearpage \section{Cytoscape Plot} Several pathways have a huge number of nodes and edges, thus there is the need of an efficient system of visualization. To this end \Biocpkg{graphite} uses the \Biocpkg{Rcy3} package to export the network to Cytoscape 3, through the function \Rfunction{cytoscapePlot}. Cytoscape is a Java based software specifically built to manage biological network complexity and for this reason it is widely used by the biological community. Since version 1.24, using the option {\it which} we can access the pathway with proteins and metabolites ({\it which = "mixed"}) or the pathway with only metabolites and edges propagated through proteins ({\it which} = "metabolites"). \begin{Schunk} \begin{Sinput} > cytoscapePlot(convertIdentifiers(reactome$`Unwinding of DNA`, "symbol"), which = "mixed") \end{Sinput} \end{Schunk} \begin{center} \includegraphics[width=10cm, height=9.12cm]{Cytoscape.png} \end{center} \section{Topological pathway analysis} \Biocpkg{graphite} gives access to three types of topological pathway analyses. For more details on the results obtained by these analyses see the corresponding R packages. Following developper instructions, all the methods with the exception of \Biocpkg{clipper} are available only for pathways with proteins and with edges propagated through metabolites. \Biocpkg{clipper} is the only analysis that can be used to study metabolomics data. Note that, since version 1.24, given the introduction of nodes with mixed types, all the ID in the pathway have been prefixed with the type of identifier (e.g. ENTREZID:7157 or SYMBOL:TP53). Thus, also the gene expression or metabolite data, in order to match with the pathway nodes, should be in the same format (e.g. ENTREZID:7157 or SYMBOL:TP53). \subsection{SPIA} The analysis with \Biocpkg{SPIA} requires the conversion of gene-gene networks in a suitable format. This conversion is performed by the function \Rfunction{prepareSPIA} that must be executed before the analysis command \Rfunction{runSPIA}. The \Biocpkg{SPIA} data will be saved in the current working directory; every time you change it, you should also re-run \Rfunction{prepareSPIA}. Edges not included in SPIA have been coerced into the admitted SPIA types. Metabolite mediated interactions (edges of propagation through metabolites) annotated in graphite with the "indirect" type are mapped into the SPIA edge type "indirect effect". For more details see the \Biocpkg{SPIA} package \cite{Tarca2009,Adi2009,Draghici2007}. <>= library(SPIA) data(colorectalcancer) library(hgu133plus2.db) top$ENTREZ <- mapIds(hgu133plus2.db, top$ID, "ENTREZID", "PROBEID", multiVals = "first") top <- top[!is.na(top$ENTREZ) & !duplicated(top$ENTREZ), ] top$ENTREZ <- paste("ENTREZID", top$ENTREZ, sep = ":") tg1 <- top[top$adj.P.Val < 0.05, ] DE_Colorectal = tg1$logFC names(DE_Colorectal) <- tg1$ENTREZ ALL_Colorectal <- top$ENTREZ kegg <- pathways("hsapiens", "kegg") kegg <- convertIdentifiers(kegg, "ENTREZID") kegg <- kegg[c("Bladder cancer", "Cytosolic DNA-sensing pathway")] prepareSPIA(kegg, "keggEx") res <- runSPIA(de = DE_Colorectal, all = ALL_Colorectal, "keggEx") res @ <>= unlink("keggExSPIA.RData") @ \subsection{topologyGSA} \Biocpkg{topologyGSA} uses graphical models to test the pathway components and to highlight those involved in its deregulation. In \Biocpkg{graphite}, \Biocpkg{topologyGSA} has a dedicated function, \Rfunction{runTopologyGSA}, which performs the analysis on a single pathway or on a pathway list. <>= library(topologyGSA) data(examples) colnames(y1) <- paste("SYMBOL", colnames(y1), sep = ":") colnames(y2) <- paste("SYMBOL", colnames(y2), sep = ":") kegg <- pathways("hsapiens", "kegg") p <- convertIdentifiers(kegg[["Fc epsilon RI signaling pathway"]], "SYMBOL") runTopologyGSA(p, "var", y1, y2, 0.05) @ The function \Rfunction{runTopologyGSA}, which easily performs the analysis on the entire pathway database, provides as result a list of two elements: a list with the results of the pathway analyses and the list of generated errors. For more details see the \Biocpkg{topologyGSA} package \cite{Massa2010}. \subsection{clipper} \Biocpkg{clipper} is a package for topological analysis. It implements a two-step empirical approach based on the exploitation of graph decomposition into a junction tree to reconstruct the most relevant signal path. In the first step clipper selects significant pathways according to statistical tests on the means and the concentration matrices of the graphs derived from pathway topologies. Then, it "clips" the whole pathway identifying the signal paths having the greatest association with a specific phenotype. In \Biocpkg{graphite}, \Biocpkg{clipper} has a dedicated function, \Rfunction{runClipper}, which performs the analysis on a single pathway or on a pathway list. <>= library(ALL) library(a4Preproc) library(clipper) data(ALL) pheno <- as(phenoData(ALL), "data.frame") samples <- unlist(lapply(c("NEG", "BCR/ABL"), function(t) { which(grepl("^B\\d*", pheno$BT) & (pheno$mol.biol == t))[1:10] })) classes <- c(rep(1,10), rep(2,10)) expr <- exprs(ALL)[,samples] rownames(expr) <- paste("ENTREZID", featureData(addGeneInfo(ALL))$ENTREZID, sep = ":") k <- as.list(pathways("hsapiens", "kegg")) selection <- c("Bladder cancer", "Hippo signaling pathway - multiple species") selected <- k[selection] names(selected) <- c("Bladder cancer", "Hippo signaling") clipped <- runClipper(selected, expr, classes, "mean", pathThr = 0.1) resClip <- do.call(rbind, clipped$results) resClip[, c("startIdx", "endIdx", "maxIdx", "lenght", "maxScore", "aScore")] @ The function \Rfunction{runClipper}, which easily performs the analysis on the entire pathway database, provides as result a list of two elements: the list with the results of the pathway analyses and the list of eventually generated errors. Since version 1.24, using the option {\it which} we can access the pathway with proteins and metabolites ({\it which = "mixed"}) or the pathway with only metabolites and edges propagated through proteins ({\it which} = "metabolites"). For more details see the \Biocpkg{clipper} package \cite{Martini2013}. \section{Build pathway} In \Biocpkg{graphite}, it is also possible build a pathway object using \Rfunction{buildPathway}. This function creates a new object of type \Rclass{Pathway} given a data frame describing its edges. The edges should be divided in three dataframes containing edges between proteins, edges between proteins and metabolites, and edges between metabolites, and the should be included in the new pathway using {\it proteinEdges}, {\it mixedEdges} and {\it metaboliteEdges} respectively. This practice will guarantee the compatibility with \Rfunction{convertIdentifiers}, this only works with the types of identifiers that are provided in the Annotation package (for example using the string,‘"ENTREZID"’). <>= edges <- data.frame(src_type = "ENTREZID", src="672", dest_type = "ENTREZID", dest="7157", direction="undirected", type="binding") pathway <- buildPathway("#1", "example", "hsapiens", "database", proteinEdges = edges) @ <>= edges <- data.frame(src_type = "ENTREZID", src="672", dest_type = "ENTREZID", dest="7157", direction="undirected", type="binding") edgemix <- data.frame(src_type = "CHEBI", src="77750", dest_type = "ENTREZID", dest="7157", direction="undirected", type="biochemicalReaction") edgemet <- data.frame(src_type = "CHEBI", src="15351", dest_type = "CHEBI", dest="77750", direction="undirected", type="biochemicalReaction") pathway <- buildPathway("#1", "example", "hsapiens", "database", proteinEdges = edges, mixedEdges = edgemix, metaboliteEdges = edgemet) @ \section{Parallelism} Some of \Biocpkg{graphite} operations can be made significantly faster by exploiting the parallelism offered by recent hardware. Here is a list of the functions providing this option: \begin{itemize} \item \Rfunction{convertIdentifiers} \item \Rfunction{runClipper} \item \Rfunction{runTopologyGSA} \end{itemize} To exploit parallel processing you are going to need two ingredients. First, you have to specify the maximum number of cores that the package can use. For instance, if you have 6 cores on your computer you can set: <>= options(Ncpus = 6) @ Your code, then, should try to call \Biocpkg{graphite} functions passing entire lists of pathways; you should not loop manually. <>= original <- pathways("hsapiens", "reactome") # Do (will exploit parallelism) converted <- convertIdentifiers(original, "SYMBOL") # Don't (no parallelism here) converted <- lapply(original, convertIdentifiers, "SYMBOL") @ \warning{Parallelism is not guaranteed to reduce run times.} Splitting a given operation over multiple cores requires some coordination, and that has its own costs. In certain situations (small number of pathways / small number of cores) the overhead may actually make run times worse. Our suggestion is thus to: \begin{itemize} \item always set \Rcode{Ncpus} to a value smaller or equal to the actual number of hardware cores (see \Rfunction{parallel::detectCores} for an indication); \item in any case, start from a small number like 2 or 4, and then work your way up measuring the actual speedups on your own hardware. \end{itemize} \section{Session Information} <>= sessionInfo() @ \begin{thebibliography}{9} \bibitem{KEGG} Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 1999 Jan 1;27(1):29-34. \bibitem{Reactome} Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, Garapati P, Hemish J, Hermjakob H, Jassal B, Kanapin A, Lewis S, Mahajan S, May B, Schmidt E, Vastrik I, Wu G, Birney E, Stein L, D'Eustachio P. Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res. 2009 Jan;37(Database issue):D619-22. Epub 2008 Nov 3. \bibitem{NCI} Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, Buetow KH. PID: the Pathway Interaction Database. Nucleic Acids Res. 2009 Jan;37(Database issue):D674-9. Epub 2008 Oct 2. \bibitem{Draghici2007} Draghici, S., Khatri, P., Tarca, A.L., Amin, K., Done, A., Voichita, C., Georgescu, C., Romero, R. A systems biology approach for pathway level analysis. Genome Research, 17, 2007. \bibitem{Tarca2009} Tarca AL, Draghici S, Khatri P, Hassan SS, Mittal P, Kim JS, Kim CJ, Kusanovic JP, Romero R. A novel signaling pathway impact analysis. Bioinformatics. 2009 Jan 1;25(1):75-82. \bibitem{Adi2009} Adi L. Tarca, Sorin Draghici, Purvesh Khatri, et. al. A Signaling Pathway Impact Analysis for Microarray Experiments. Bioinformatics, 2009, 25(1):75-82. \bibitem{Massa2010} Massa MS, Chiogna M, Romualdi C. Gene set analysis exploiting the topology of a pathway. BMC System Biol. 2010 Sep 1;4:121. \bibitem{Martini2013} Martini P, Sales G, Massa MS, Chiogna M, Romualdi C. Along signal paths: an empirical gene set approach exploiting pathway topology. Nucleic Acids Res. 2013 Jan 7;41(1):e19. doi: 10.1093/nar/gks866. Epub 2012 Sep 21. \bibitem{humancyc} Caspi R, Altman T, Dale JM, Dreher K, Fulcher CA, Gilham F, Kaipa P, Karthikeyan AS, Kothari A, Krummenacker M, Latendresse M, Mueller LA, Paley S, Popescu L, Pujar A, Shearer AG, Zhang P, Karp PD. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Research 38:D473-9 2010. \bibitem{panther} PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Huaiyu Mi, Anushya Muruganujan and Paul D. Thomas Nucl. Acids Res. (2012) doi: 10.1093/nar/gks1118 \bibitem{pharmgkb} Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong L, Sangkuhl K, Thorn CF, Altman RB, Klein TE. Pharmacogenomics knowledge for personalized medicine. Clin Pharmacol Ther. 2012 Oct;92(4):414-7. doi: 10.1038/clpt.2012.96. Review. PubMed \bibitem{smpdb} Jewison T, Su Y, Disfany FM, Liang Y, Knox C, Maciejewski A, Poelzer J, Huynh J, Zhou Y, Arndt D, Djoumbou Y, Liu Y, Deng L, Guo AC, Han B, Pon A, Wilson M, Rafatnia S, Liu P, Wishart DS. SMPDB 2.0: big improvements to the Small Molecule Pathway Database. Nucleic Acids Res. 2014 Jan;42(Database issue):D478-84. doi:10.1093/nar/gkt1067. Epub 2013 Nov 6. \bibitem{pathbank} Wishart DS, Li C, Marcu A, Badran H, Pon A, Budinski Z, Patron J, Lipton D, Cao X, Oler E, Li K, Paccoud M, Hong C, Guo AC, Chan C, Wei W, Ramirez-Gaona M. PathBank: a comprehensive pathway database for model organisms. Nucleic Acids Res. 2020 Jan 8;48(D1):D470-D478. doi: 10.1093/nar/gkz861. \bibitem{wikipathways} Martens M, Ammar A, Riutta A, Waagmeester A, Slenter D, Hanspers K, Miller R, Digles D, Lopes E, Ehrhart F, Dupuis L, Winckers L, Coort S, Willighagen E, Evelo C, Pico A, Kutmon M WikiPathways: connecting communities Nucleic Acids Research, Volume 49, Issue D1, 8 January 2021, Pages D613–D621, doi:10.1093/nar/gkaa1024 \end{thebibliography} \end{document}