%\VignetteIndexEntry{EnrichmentBrowser Manual} %\VignetteDepends{EnrichmentBrowser} %\VignettePackage{EnrichmentBrowser} %\VignetteEngine{utils::Sweave} \documentclass[a4paper]{article} <>= BiocStyle::latex() @ \usepackage{graphicx, subfig} \usepackage[utf8]{inputenc} \title{EnrichmentBrowser: Seamless navigation through combined \\ results of set-based and network-based enrichment analysis} \author{Ludwig Geistlinger} \begin{document} \maketitle \tableofcontents \section{Introduction} The \Biocpkg{EnrichmentBrowser} package implements essential functionality for the enrichment analysis of gene expression data. The analysis combines the advantages of set-based and network-based enrichment analysis in order to derive high-confidence gene sets and biological pathways that are differentially regulated in the expression data under investigation. Besides, the package facilitates the visualization and exploration of such sets and pathways.\\ The following instructions will guide you through an end-to-end expression data analysis workflow including: \begin{enumerate} \item Preparing the data \item Preprocess the data \item Differential expression (DE) analysis \item Defining gene sets of interest \item Executing individual enrichment methods \item Combining the results of different methods \item Visualize and explore the results \end{enumerate} All of these steps are modular, i.e.~each step can be executed individually and fine-tuned with several parameters. In case you are interested only in a particular step, you are advised to directly jump to the respective section (Let's say, for example, you are at the point where you have differential expression calculated for each gene. Now you are interested whether certain gene functions are enriched for differential regulation. Section \emph{Set-based enrichment analysis} would then be the one you should go for). The last section \emph{Putting it all together} also demonstrates how to wrap the whole workflow into a single function, making use of suitably chosen defaults. \section{Reading expression data from file} Typically, the expression data is not already available in \R{} but rather has to be read in from file. This can be done using the function \Rfunction{read.eset}, which reads the expression data (\Rfunction{exprs}) along with the phenotype data (\Rfunction{pData}) and feature data (\Rfunction{fData}) into an \Rclass{ExpressionSet}. <>= library(EnrichmentBrowser) data.dir <- system.file("extdata", package="EnrichmentBrowser") exprs.file <- file.path(data.dir, "exprs.tab") pdat.file <- file.path(data.dir, "pData.tab") fdat.file <- file.path(data.dir, "fData.tab") eset <- read.eset(exprs.file, pdat.file, fdat.file) @ The man pages provide details on file format and the \Rclass{ExpressionSet} data structure. <>= ?read.eset ?ExpressionSet @ \newpage \section{Types of expression data} The two major data types processed by the \Biocpkg{EnrichmentBrowser} are microarray (intensity measurements) and RNA-seq (read counts) data. \subsection{Microarray data} To demonstrate the functionality of the package for microarray data, we consider expression measurements of patients suffering from acute lymphoblastic leukemia \cite{Chiaretti04}. A frequent chromosomal defect found among these patients is a translocation, in which parts of chromosome 9 and 22 swap places. This results in the oncogenic fusion gene BCR/ABL created by positioning the ABL1 gene on chromosome 9 to a part of the BCR gene on chromosome 22.\\ We load the \Biocpkg{ALL} dataset <>= library(ALL) data(ALL) @ and select B-cell ALL patients with and without the BCR/ABL fusion as it has been described previously \cite{Scholtens05}. <>= ind.bs <- grep("^B", ALL$BT) ind.mut <- which(ALL$mol.biol %in% c("BCR/ABL", "NEG")) sset <- intersect(ind.bs, ind.mut) all.eset <- ALL[, sset] @ We can now access the expression values, which are intensity measurements on a log-scale for 12,625 probes (rows) across 79 patients (columns). <>= dim(all.eset) exprs(all.eset)[1:4,1:4] @ As we often have more than one probe per gene, we compute gene expression values as the average of the corresponding probe values. <>= all.eset <- probe.2.gene.eset(all.eset) head(featureNames(all.eset)) @ (Note, that the mapping from probe to gene is done automatically as long as as you have the corresponding annotation package, here the \Biocpkg{hgu95av2.db} package, installed. Otherwise, the mapping can be defined in the \Rfunction{fData} slot.) <>= head(fData(eset)) @ \subsection{RNA-seq data} To demonstrate the functionality of the package for RNA-seq data, we consider transcriptome profiles of four primary human airway smooth muscle cell lines in two conditions: control and treatment with dexamethasone \cite{Himes14}. We load the \Biocpkg{airway} dataset <>= library(airway) data(airway) @ and create the ExpressionSet (for further analysis, we remove genes with very low read counts and measurements that are not mapped to an ENSEMBL gene ID). <>= expr <- assays(airway)[[1]] expr <- expr[grep("^ENSG", rownames(expr)),] expr <- expr[rowMeans(expr) > 10,] air.eset <- new("ExpressionSet", exprs=expr, annotation="hsa") dim(air.eset) exprs(air.eset)[1:4,1:4] @ %Core functions of the \texttt{EnrichmentBrowser} can be carried out independent %of organism and gene ID type. %However, to make full use of pre-defined gene sets and visualization capabilities %it is typically recommended to annotate the organism under study % %<>= %annotation(air.eset) <- "hsa" %@ % %and to use the most widely supported ENTREZ gene IDs. %<>= %eids <- mapIds(org.Hs.eg.db, keys=featureNames(air.eset), keytype="ENSEMBL", column="ENTREZID") %eids <- eids[!is.na(eids)] %eids <- eids[!duplicated(eids)] %air.eset <- air.eset[featureNames(air.eset) %in% names(eids), ] %names(eids) <- NULL %featureNames(air.eset) <- eids %@ \newpage \section{Normalization} Normalization of high-throughput expression data is essential to make results within and between experiments comparable. Microarray (intensity measurements) and RNA-seq (read counts) data exhibit typically distinct features that need to be normalized for. The function \Rfunction{normalize} wraps commonly used funtionality from \Biocpkg{limma} for microarray normalization and from \Biocpkg{EDASeq} for RNA-seq normalization. For specific needs that deviate from these standard normalizations, the user should always refer to more specific functions/packages.\\ Microarray data is expected to be single-channel. For two-color arrays, it is expected here that normalization within arrays has been already carried out, e.g.~using \Rfunction{normalizeWithinArrays} from \Biocpkg{limma}. A default quantile normalization based on \Rfunction{normalizeBetweenArrays} from \Biocpkg{limma} can be carried out via <>= before.norm <- exprs(all.eset) all.eset <- normalize(all.eset, norm.method="quantile") after.norm <- exprs(all.eset) @ \begin{center} <>= par(mfrow=c(1,2)) boxplot(before.norm) boxplot(after.norm) @ \end{center} Note that this is done here for demonstration purpose only, as the ALL data has been already rma-normalized from the authors of the ALL dataset.\\ RNA-seq data is expected to be raw read counts. Please note that normalization for downstream DE analyis, e.g.~with \Biocpkg{edgeR} and \Biocpkg{DESeq2}, is not ultimately necessary (and in some cases even discouraged) as many of these tools implement specific normalization approaches themselves. See the vignette of \Biocpkg{EDASeq}, \Biocpkg{edgeR}, and \Biocpkg{DESeq2} for details. In case normalization is desired, between-lane normalization to adjust for sequencing depth, can be carried out as demonstrated above for microarray data. <>= norm.air <- normalize(air.eset, norm.method="quantile") @ Within-lane normalization to adjust for gene specific effects such as gene length and GC content effect requires to retrieve this information first. %e.g.~from BioMart or specific \Bioconductor annotation packages. %Both modes are implemented in the \Biocpkg{EDASeq} function \Rfunction{getGeneLengthAndGCContent}. % %<>= %ids <- head(featureNames(air.eset)) %lgc <- EDASeq::getGeneLengthAndGCContent(ids, org="hsa", mode="biomart") %lgc %@ Using precomputed information for all genes, normalization within and between lanes can then be carried out via <>= lgc.file <- file.path(data.dir, "air_lgc.tab") fData(air.eset) <- read.delim(lgc.file) norm.air <- normalize(air.eset, within=TRUE) @ \newpage \section{Differential expression} Differential expression analysis between sample groups can be performed using the function \Rfunction{de.ana}. As a prerequisite, the phenotype data should contain for each patient a binary group assignment. For the ALL dataset this indicates whether the BCR-ABL gene fusion is present (1) or not (0). <>= pData(all.eset)$GROUP <- ifelse(all.eset$mol.biol == "BCR/ABL", 1, 0) table(pData(all.eset)$GROUP) @ For the airway dataset this indicates whether the cell lines have been treated with dexamethasone (1) or not (0). <>= pData(air.eset)$GROUP <- ifelse(colData(airway)$dex == "trt", 1, 0) table(pData(air.eset)$GROUP) @ Paired samples, or in general sample batches/blocks, can be defined via a \Robject{BLOCK} column in the \Rfunction{pData} slot. For the airway dataset the sample blocks correspond to the four different cell lines. <>= pData(air.eset)$BLOCK <- colData(airway)$cell table(pData(air.eset)$BLOCK) @ For microarray expression data, the \Rfunction{de.ana} function carries out a differential expression analysis between the two groups based on functionality from the \Biocpkg{limma} package. Resulting fold changes and $t$-test derived $p$-values for each gene are appended to the \Robject{fData} slot. <>= all.eset <- de.ana(all.eset) head(fData(all.eset), n=4) @ Raw $p$-values are already corrected for multiple testing (\Robject{ADJ.PVAL}) using the method from Benjamini and Hochberg implemented in the function \Rfunction{p.adjust} from the \CRANpkg{stats} package.\\ To get a first overview, we inspect the $p$-value distribution and the volcano plot (fold change against $p$-value). \begin{center} <>= par(mfrow=c(1,2)) pdistr(fData(all.eset)$ADJ.PVAL) volcano(fData(all.eset)$FC, fData(all.eset)$ADJ.PVAL) @ \end{center} The expression change of highest statistical significance is observed for the ENTREZ gene \verb=7525=. <>= fData(all.eset)[ which.min(fData(all.eset)$ADJ.PVAL), ] @ This turns out to be the YES proto-oncogene 1 (\href{http://www.genome.jp/dbget-bin/www_bget?hsa:7525}{hsa:7525@KEGG}). For RNA-seq data, the \Rfunction{de.ana} function carries out a differential expression analysis between the two groups either based on functionality from \Biocpkg{limma} (that includes the \Rfunction{voom} transformation), or alternatively, from the popular \Biocpkg{edgeR} or \Biocpkg{DESeq2} package. We use here the analysis based on \Biocpkg{edgeR} for demonstration. <>= air.eset <- de.ana(air.eset, de.method="edgeR") head(fData(air.eset), n=4) @ Now, we subject the ALL and the airway gene expression data to the enrichment analysis. \newpage \section{Set-based enrichment analysis} In the following, we introduce how the \Biocpkg{EnrichmentBrowser} package can be used to perform state-of-the-art enrichment analysis of gene sets. We consider the ALL and the airway gene expression data as processed in the previous sections. We are now interested whether there are not only single genes that are differentially expressed, but also sets of genes known to work together, e.g.~as defined in the Gene Ontology or the KEGG pathway annotation.\\ The function \Rfunction{get.kegg.genesets}, which is based on functionality from the \Biocpkg{KEGGREST} package, downloads all KEGG pathways for a chosen organism as gene sets. <>= kegg.gs <- get.kegg.genesets("hsa") @ Analogously, the function \Rfunction{get.go.genesets} defines GO terms of a selected ontology as gene sets. <>= go.gs <- get.go.genesets(org="hsa", onto="BP", mode="GO.db") @ User-defined gene sets can be parsed from the GMT file format <>= gmt.file <- file.path(data.dir, "hsa_kegg_gs.gmt") hsa.gs <- parse.genesets.from.GMT(gmt.file) length(hsa.gs) hsa.gs[1:2] @ Currently, the following set-based enrichment analysis methods are supported <>= sbea.methods() @ \begin{itemize} \item{ORA: Overrepresentation Analysis (simple and frequently used test based on the hypergeometric distribution \cite{Goeman07} for a critical review)} \item{SAFE: Significance Analysis of Function and Expression (generalization of ORA, includes other test statistics, e.g.~Wilcoxon's rank sum, and allows to estimate the significance of gene sets by sample permutation; implemented in the \Biocpkg{safe} package)} \item{GSEA: Gene Set Enrichment Analysis (frequently used and widely accepted, uses a Kolmogorov–Smirnov statistic to test whether the ranks of the $p$-values of genes in a gene set resemble a uniform distribution \cite{Subramanian05})} \item{SAMGS: Significance Analysis of Microarrays on Gene Sets (extending the SAM method for single genes to gene set analysis \cite{Dinu07})} \end{itemize} For demonstration we perform here a basic ORA choosing a significance level $\alpha$ of 0.05. <>= sbea.res <- sbea(method="ora", eset=all.eset, gs=hsa.gs, perm=0, alpha=0.05) gs.ranking(sbea.res) @ The result of every enrichment analysis is a ranking of gene sets by the corresponding $p$-value. The \Rfunction{gs.ranking} function displays only those gene sets satisfying the chosen significance level $\alpha$.\\ While such a ranked list is the standard output of existing enrichment tools, the functionality of the \Biocpkg{EnrichmentBrowser} package allows visualization and interactive exploration of resulting gene sets far beyond that point. Using the \Rfunction{ea.browse} function creates a HTML summary from which each gene set can be inspected in more detail (this builds on functionality from the \Biocpkg{ReportingTools} package). The various options are described in Figure \ref{fig:oraRes}. <>= ea.browse(sbea.res) @ \begin{figure} \centering \includegraphics[width=14cm, height=8.75cm]{ora_ebrowse.png} \caption{ORA result view. For each significant gene set in the ranking, the user can select to view (1) a gene report, that lists all genes of a set along with fold change and derived $p$-value, (2) interactive overview plots, such as heatmap, $p$-value distribution, and volcano plot, (3) the pathway in KEGG with differentially expressed genes highlighted in red.}\label{fig:oraRes} \end{figure} The goal of the \Biocpkg{EnrichmentBrowser} package is to provide the most frequently used enrichment methods. However, it is also possible to exploit its visualization capabilities while using one's own set-based enrichment method. This requires to implement a function that takes the characteristical arguments \Robject{eset} (expression data), \Robject{gs} (gene sets), \Robject{alpha} (significance level), and \Robject{perm} (number of permutations). In addition, it must return a numeric vector \Robject{ps} storing the resulting $p$-value for each gene set in \Robject{gs}. The $p$-value vector must be also named accordingly (i.e.~\Rcode{names(ps) == names(gs)}).\\ Let us consider the following dummy enrichment method, which randomly renders five gene sets significant and all others insignificant. <>= dummy.sbea <- function(eset, gs, alpha, perm) { sig.ps <- sample(seq(0,0.05, length=1000),5) insig.ps <- sample(seq(0.1,1, length=1000), length(gs)-5) ps <- sample(c(sig.ps, insig.ps), length(gs)) names(ps) <- names(gs) return(ps) } @ We can plug this method into \Rfunction{sbea} as before. <>= sbea.res2 <- sbea(method=dummy.sbea, eset=all.eset, gs=hsa.gs) gs.ranking(sbea.res2) @ \newpage \section{Network-based enrichment analysis} Having found sets of genes that are differentially regulated in the ALL data, we are now interested whether these findings can be supported by known regulatory interactions. For example, we want to know whether transcription factors and their target genes are expressed in accordance to the connecting regulations. Such information is usually given in a gene regulatory network derived from specific experiments, e.g.~using the \Biocpkg{GeneNetworkBuilder}, or compiled from the literature (\cite{Geistlinger13} for an example). There are well-studied processes and organisms for which comprehensive and well-annotated regulatory networks are available, e.g.~the \verb=RegulonDB= for \emph{E.~coli} and \verb=Yeastract= for \emph{S.~cerevisiae}. However, in many cases such a network is missing. A first simple workaround is to compile a network from regulations in the KEGG database.\\ We can download all KEGG pathways of a specified organism via the \Rfunction{download.kegg.pathways} function that exploits functionality from the \Biocpkg{KEGGREST} package. <>= pwys <- download.kegg.pathways("hsa") @ In this case, we have already downloaded all human KEGG pathways. We parse them making use of the \Biocpkg{KEGGgraph} package and compile the resulting gene regulatory network. <>= pwys <- file.path(data.dir, "hsa_kegg_pwys.zip") hsa.grn <- compile.grn.from.kegg(pwys) head(hsa.grn) @ Now we are able to perform enrichment analysis based on the compiled network. Currently the following network-based enrichment analysis methods are supported <>= nbea.methods() @ \begin{itemize} \item{GGEA: Gene Graph Enrichment Analysis (evaluates consistency of known regulatory interactions with the observed expression data \cite{Geistlinger11})} \item{SPIA: Signaling Pathway Impact Analysis (implemented in the \Biocpkg{SPIA} package)} \item{NEA: Network Enrichment Analysis (implemented in the \Biocpkg{neaGUI} package)} \item{PathNet: Pathway Analysis using Network Information (implemented in the \Biocpkg{PathNet} package)} \end{itemize} For demonstration we perform here GGEA using the gene regulatory network compiled above. <>= nbea.res <- nbea(method="ggea", eset=all.eset, gs=hsa.gs, grn=hsa.grn) gs.ranking(nbea.res) @ The resulting ranking lists for each statistically significant gene set the number of relations (NR.RELS) of the given gene regulatory network that involve a gene set member, the sum of consistencies over all relations (RAW.SCORE), the score normalized by induced network size (NORM.SCORE = RAW.SCORE / NR.RELS), and the statistical significance of each gene set based on a permutation approach.\\ A GGEA graph for a gene set of interest depicts the consistency of each interaction in the network that involves a gene set member. Nodes (genes) are colored according to expression (up-/down-regulated) and edges (interactions) are colored according to consistency, i.e.~how well the interaction type (activation/inhibition) is reflected in the correlation of the observed expression of both interaction partners. \begin{center} <>= par(mfrow=c(1,2)) ggea.graph( gs=hsa.gs[["hsa05217_Basal_cell_carcinoma"]], grn=hsa.grn, eset=all.eset) ggea.graph.legend() @ \end{center} As described in the previous section it is also possible to plug in one's own network-based enrichment method. \newpage \section{Combining results} Different enrichment analysis methods usually result in different gene set rankings for the same dataset. To compare results and detect gene sets that are supported by different methods, the \Biocpkg{EnrichmentBrowser} package allows to combine results from the different set-based and network-based enrichment analysis methods. The combination of results yields a new ranking of the gene sets under investigation by e.g. the average rank across methods.\\ We consider the ORA result and the GGEA result from the previous sections and use the function \Rfunction{comb.ea.results}. <>= res.list <- list(sbea.res, nbea.res) comb.res <- comb.ea.results(res.list) @ The combined result can be detailedly inspected as before and interactively ranked as depicted in Figure \ref{fig:combRes}. <>= ea.browse(comb.res, graph.view=hsa.grn, nr.show=5) @ \begin{figure}[!h] \centering \includegraphics[width=17cm, height=6.8cm]{comb_ebrowse.png} \caption{Combined result view. By clicking on one of the columns (ORA.RANK, ..., GGEA.PVAL) the result can be interactively ranked according to the selected criterion.}\label{fig:combRes} \end{figure} \newpage \section{Putting it all together} There are cases where it is necessary to perform some steps of the demonstrated enrichment analysis pipeline individually. However, often it is more convenient to run the complete standardized pipeline. This can be done using the all-in-one wrapper function \Rfunction{ebrowser}. Thus, in order to produce the result page displayed in Figure \ref{fig:combRes} from scratch, without going through the individual steps listed above, the following call would do the job. <>= ebrowser( meth=c("ora", "ggea"), exprs=exprs.file, pdat=pdat.file, fdat=fdat.file, org="hsa", gs=hsa.gs, grn=hsa.grn, comb=TRUE, nr.show=5) @ \newpage %\bibliographystyle{unsrturl} \bibliography{refs} \end{document}