\documentclass[a4paper]{article} \usepackage[utf8]{inputenc} <>= BiocStyle::latex() @ %\VignetteIndexEntry{UsersGuide} %\VignetteKeyword{GOexpress} \bioctitle[GOexpress: Visualise gene expression data using gene ontology annotations]{GOexpress: identify and visualise robust gene ontology signatures through supervised clustering of gene expression data} \author{Kévin Rue-Albrecht, Paul A. McGettigan, Belinda Hernández,\\ David A. Magee, Nicolas C. Nalpas, Andrew C. Parnell,\\ Stephen V. Gordon, and David E. MacHugh} \begin{document} \SweaveOpts{concordance=TRUE} \setkeys{Gin}{width=1\textwidth} <>= options(useFancyQuotes="UTF-8") @ \maketitle \tableofcontents \section[Introduction]{Introduction} \subsection[The origin and purpose of \Biocpkg{GOexpress}]{The origin and purpose of \Biocpkg{GOexpress}} The idea leading to the \Biocpkg{GOexpress} \R{} package emerged from a set of plotting functions I regularly copy-pasted across various complex multifactorial transcriptomics studies from both microarray and RNA-seq platforms. Those functions were repeatedly used to visualise the expression profile of genes across groups of samples, to annotate technical gene identifiers from both microarray and RNA-seq platforms (i.e. probesets, Ensembl gene identifiers) with their associated gene name, and to evaluate the clustering of samples based on genes participating in a common cellular function or location (i.e. gene ontology). While developing the \Biocpkg{GOexpress} package and discussing its features with colleagues and potential users, a few more features were added, to enhance and complement the initial functions, leading to the present version of the package. Complex multifactorial experiments have become the norm in many research fields, thanks to the decrease in cost of high-throughput transcriptomics platforms and the barcoding/multiplexing of samples on the RNA-seq platform. While much effort has been (correctly!) spent on the development of adequate statistical frameworks for the processing of raw expression data, much of the genewise exploration and visualisation is left to the end-user. However, data summarisation and visualisation can be a daunting task in multifactorial experiments, or require large amounts of copy-pasting to investigate the expression profile of a handful or genes and cellular pathways. Developed and tested on multiple RNA-seq and microarray datasets, \Biocpkg{GOexpress} offers an extendable set of data-driven plotting functions readily applicable to the output of widely used analytic packages estimating (differential) gene expression. Once the initial analysis and filtering of \Biocpkg{GOexpress} results is complete --- literaly two command lines ---, each gene and gene ontology is accessible by a single line of code to produce high-quality graphics and summary tables. In short, \Biocpkg{GOexpress} is a software package developed based on real experimental datasets to ease the visualisation and interpretation of multifactorial transcriptomics data by bioinformaticians and biologists, while striving to keep it a simple, fast, and intuitive toolkit. Notably, the use of the \Biocpkg{biomaRt} package enables \Biocpkg{GOexpress} to support and annotate gene expression identifiers from any species and any microarray platform present in the Ensembl BioMart server (\url{http://www.ensembl.org/biomart/martview}), while custom annotations may also be provided for the analysis of species or platforms not supported yet, the clustering of non-transcriptomics datasets (e.g. proteomics), or the comparison of panels of biomarkers independent from gene ontology annotations. \subsection[Purpose of this document]{Purpose of this document} This User's Guide was intended as a helpful description of the main features implemented in the \Biocpkg{GOexpress} package, as well as a tutorial taking the user through a typical analysis pipeline that \Biocpkg{GOexpress} was designed for. While an example usage will be provided for each function of the package, the many arguments of each function cannot realistically be demonstrated in this Guide, and we kindly ask users to also read the individual help pages accompanying the corresponding package functions for further details. \section[Before you start]{Before you start} \subsection[Installation]{Installation} Installing \Biocpkg{GOexpress} should be as easy as running the two lines below: <>= source("http://bioconductor.org/biocLite.R") biocLite("GOexpress") @ Installation issues should be reported to the \Bioconductor{} mailing list. \subsection[Getting help]{Getting help} The \Biocpkg{GOexpress} package is still relatively young and may require some fine-tuning or bug fixes. Please contact the maintainer with a copy of the error message and the command run. <>= maintainer("GOexpress") @ Despite our efforts to repeatedly test the package on in-house datasets of both microarray and RNA-seq platform, and of human and bovine origin, many of the model species and gene expression platforms have not been tested yet. We welcome feedback! Interesting suggestions for additional package functions, or improvement of existing ones are most welcome and may be implemented when time allows. Alternatively, we also encourage users to fork the GitHub repository of the project, develop and test their own feature(s), and finally generate a pull request to integrate it to the original repository (\url{https://github.com/kevinrue/GOexpress}). As for all \Bioconductor{} packages, the \Bioconductor{} support site is the best place to seek advice with a large and active community of \Bioconductor{} users. More detailed information is available at: \url{http://www.bioconductor.org/help/support}. \subsection[Citing \Biocpkg{GOexpress}]{Citing \Biocpkg{GOexpress}} The work underlying \Biocpkg{GOexpress} has not been formally published yet. A manuscript has been submitted for peer-review. In the meantime, users of the \Biocpkg{GOexpress} package are encouraged to cite it using the output of the \Rfunction{citation} function in the \Rpackage{utils} package, as follows: <>= citation(package="GOexpress") @ \section[Quick start]{Quick start} \subsection[Input data]{Input data} Despite their different underlying technologies, microarray and RNA-seq analytic pipelines typically yield a matrix measuring the expression level of many gene features in each sample. Commonly, this expression matrix is filtered to retain only genes qualified as ``informative'' (e.g. > 1 cpm in at least N replicates; N being the number of replicates for a given set of experimental conditions); and genes lowly expressed are removed to limit the False Discovery Rate (FDR) of differentially expressed genes induced by the larger variability of expression at the lower end of the dynamic range. \Biocpkg{GOexpress} requires this filtered normalised expression matrix to be accompanied by an \Rclass{AnnotatedDataFrame} object of the \Biocpkg{Biobase} package providing phenotypic information for each of those samples (e.g. unique identifier, treatment, time-point). \Biocpkg{GOexpress} expects those two variables in an \Rclass{ExpressionSet} container of the \Biocpkg{Biobase} package, both simplifying the manipulation of the data and, most importantly, ensuring interoperatibility with other packages that handle \Bioconductor{} \Rclass{ExpressionSet} objects. The other fields of the \Rclass{ExpressionSet} container may be left empty as \Biocpkg{GOexpress} does not currently access them. Instructions to create \Rclass{AnnotatedDataFrame} and \Rclass{ExpressionSet} objects are detailed in the vignettes of the \Biocpkg{Biobase} package. To use the analytical part of the \Biocpkg{GOexpress} package, the phenotypic data-frame --- \Robject{phenodata} slot of the \Rclass{ExpressionSet} --- must contain at least one column containing an experimental factor made of two or more levels in the strict meaning of ``factor'' and ``levels'' in the \R{} programming language. The above \Rclass{ExpressionSet} and the name of the column containing such a factor are the minimal two input variables required for the \Rfunction{GO\_analyse} function to work. Additional arguments may be required, in particular for microarray datasets, but those are discussed in section \ref{sec:MicroarrayArguments}. In the examples below, we will use the toy dataset \Robject{AlvMac} provided with the package and made of a subset of 100 bovine Ensembl gene identifiers (rows) measured in 117 samples (columns), extracted from a larger RNA-seq experiment (see help page for the \Robject{AlvMac} object). This toy \Rclass{ExpressionSet} also includes an \Rclass{AnnotationDataFrame} detailing a number of phenotypic information fields describing each sample. Let us load the \Biocpkg{GOexpress} package and import the toy dataset in the workspace: <<>>= library(GOexpress) # load the GOexpress package data(AlvMac) # import the training dataset @ Now, the expression matrix and phenotypic data of the \Rclass{ExpressionSet} container can be accessed using dedicated functions from the \Biocpkg{Biobase} package: <<>>= exprs(AlvMac)[1:5,1:5] # Subset of the expression data head(pData(AlvMac)) # Subset of the phenotypic information @ An advantage of the \Rclass{ExpressionSet} container is that it takes care of the compatibility between the expression matrix and the phenotypic information data-frame. For instance, it will check that samples names do not differ between expression matrix and phenotypic information. Users can visually inspect that adequate row names are used in the expression matrix: <<>>= head(rownames(exprs(AlvMac))) # Subset of gene identifiers @ \bioccomment{In the training dataset, the \Robject{Time} column of \Rcode{pData(targets)} is an \R{} factor while the \Robject{Timepoint} column is a numeric vector. The fomer is useful for grouping the samples for the analysis, while the latter is better suited to plot gene expression profiles respecting the relative distance between the time-points. See section \ref{sec:expressionPlots} for examples using of a numeric value or a factor as the variable of the X-axis.} \subsection[Main analysis]{Main analysis} \subsubsection[Preparing the grouping factor to analyse]{Preparing the grouping factor to analyse} In this example, we search for GO terms enriched in genes that cluster best the samples according their \Robject{Treatment} level. But first, let us make sure that the \Robject{Treatment} column of \Rcode{pData(targets)} is indeed an \R{} factor: <<>>= is.factor(AlvMac$Treatment) # assertion test AlvMac$Treatment # visual inspection @ In this case, it is already a properly formatted factor. If that was not the case, the following line of code would convert the column to an \R{} factor and allow to continue the analysis (note that in some cases, it may be preferrable to order the different levels of a factor, for an example see factor \Robject{Time}): <>= AlvMac$Treatment <- factor(AlvMac$Treatment) @ \subsubsection[Running the random forest algorithm using local annotations]{ Running the random forest algorithm using local annotations} Now, we use the random forest statistical framework to score each gene feature on its ability to cluster samples from different treatments separately, before summarising this information at the ontology level. The ensuing analysis therefore considers the \Robject{Treatment} factor, irrespective of the \Robject{Time} and \Robject{Animal} factors (we search for time- and animal-independent discriminants of infection). Alternatively, a subset of samples from the input \Rclass{ExpressionSet} may be specified to address more specific hypotheses. In this example, we use locally saved copies of gene and gene ontology annotations previously downloaded from Ensembl annotation release 75 using the \Biocpkg{biomaRt} package (See section \ref{sec:customAnnotations} to generate suitable local annotations, and the benefits of using them). \label{sec:GOanalyse} <<>>= set.seed(4543) # set random seed for reproducibility AlvMac_results <- GO_analyse( eSet = AlvMac, f = "Treatment", GO_genes=AlvMac_GOgenes, all_GO=AlvMac_allGO, all_genes=AlvMac_allgenes) @ At this stage, it is a good idea to save the result variable into an \R{} data-file using the \Rfunction{save} function. Mostly because the stochastic aspect of the sampling approach implemented by the \CRANpkg{randomForest} package may return slightly different scores in each run (as opposed to the use of ANOVA F-score). The output variable of the analysis summarises the parameters of the analysis and can easily be browsed with standard \R{} functions: <<>>= names(AlvMac_results) # Data slot names head(AlvMac_results$GO[, c(1:5, 7)], n=5) # Ranked table of GO terms (subset) head(AlvMac_results$genes[, c(1:3)], n=5) # Ranked table of genes (subset) head(AlvMac_results$mapping) # Gene to gene ontology mapping table (subset) @ \bioccomment{In the tables of GO terms and genes above, the column containing the name of the GO terms and the column containing the description of the genes are hidden, as their content is very long in some cases, affecting the readability of this document.} \subsubsection[Important notes in the absence of local annotations]{ Important notes in the absence of local annotations} If no annotations mapping gene features identifiers to gene ontology identifiers are provided, \Rfunction{GO\_analyse} will connect to the Ensembl server to fetch appropriate annotations in a semi-automated procedure (See arguments \Robject{dataset} and \Robject{microarray} of the \Rfunction{GO\_analyse} function). Typically, the first feature identifier in the \Robject{ExpressionSet} is used to determine the corresponding species and type of data. This is a fairly straightforward process for Ensembl gene identifiers (e.g. in the prefix `ENSBTAG', `BT' indicates \textit{Bos taurus}). Hence, the simplest use of the \Rfunction{GO\_analyse} is: <>= AlvMac_results <- GO_analyse(eSet = AlvMac, f = "Treatment") @ \warning{ Without local annotations, connecting to the Ensembl server and downloading the annotations significantly impacts the runtime of the function. } \label{sec:MicroarrayArguments} However, it can be more difficult to identify the microarray used to obtain a certain dataset, as many different Affymetrix chips contain probesets named with the pattern ``AFFX.*'' for instance. In cases where the microarray platform cannot be detected automatically, we recommend using the \Robject{microarray} argument of the \Rfunction{GO\_analyse} function. The list of valid values for the microarray argument is available in the \Robject{microarray2dataset} data frame which can be loaded in the workspace using: <>= data(microarray2dataset) @ \subsection[Permutation-based P-value for ontologies]{Permutation-based P-value for ontologies} To assess the significance of GO term ranking --- or scoring ---, we implemented a permutation-based function randomising the gene feature ranking, and counting how many times each GO term is ranked (scored) equal or higher than the real rank (score). Critically, the function should be applied directly to the output of the \Rfunction{GO\_analyse} function prior to filtering, in order to use the full list of gene features as a background for permutation: <>= AlvMac_results.pVal = pValue_GO(result=AlvMac_results, N=100) @ <>= data(AlvMac_results.pVal) @ \warning{ The \Rfunction{pValue\_GO} function is relatively lengthy. However, it is suggested to calculate P-values on the basis of at least 1,000 permutation (approximately 50 min on a standard Ubuntu server) to obtain reach minimal non-zero P-values as low as 0.001.} Given that many genes are associated to various gene ontologies due to the hierarchical relationships in the three Direct Acyclic Graphs (DAGs), a permutation-based approach appears the best suited strategy to assess the significance of ontology-related genes to display a specific average rank (or score). As further discussed in the next section, this approach also addresses the issue of the considerable range of gene counts associated with each known gene ontology. \subsection[Filtering of results]{Filtering of results} \label{sec:Filtering} \subsubsection[Filtering of the result object]{Filtering of the result object} The \Rfunction{subset\_scores} function allows users to filter for GO terms passing certain criteria (e.g. maximal P-value, minimal gene counts, type of ontology). A \Robject{filters.GO} slot will be created in the filtered result object, stating the filters and cutoff values applied. A filtered object may be further filtered, and the \Robject{filters.GO} slot will be updated accordingly. Importantly, an early-identified bias of the scoring function is that GO terms associated with fewer genes are favored at the top of the ranking table. This is due to the fact that it is much easier for a group of 5 genes (e.g. ``\texttt{B cell apoptotic process}'') to have an high average rank --- and average score --- than it is for a group of ~6,000 genes (e.g. ``\texttt{protein binding}''). Indeed, the highest possible average rank of 5 genes is 3 while it is 3,000 for a group of 6,000 genes. The calculation of P-values using the \Rfunction{pValue\_GO} partially controls that bias, as smaller groups of ontology-related genes may appear by chance at a higher average rank than observed using the ranking based on actual gene expression. Furthermore, in our experience, this bias presents some benefits. First, it implicitely favors specific and well-defined GO terms (e.g. ``\texttt{negative regulation of T cell apoptotic process}'') as opposed to vague and uninformative GO terms (e.g. ``\texttt{cytoplasm}''). Secondly, we observed many top-ranking GO terms associated with a single gene. Those GO terms are consequently susceptible to single-gene events and artefacts in the expression data, as opposed to GO terms with a reasonable number of associated genes. Using the above filtering function, it is straightforward to filter out those GO terms with only a handful of associated genes, in combination with a standard P-value filter: <<>>= BP.5 <- subset_scores( result = AlvMac_results.pVal, namespace = "biological_process", total = 5, # requires 5 or more associated genes p.val=0.05) MF.10 <- subset_scores( result = AlvMac_results.pVal, namespace = "molecular_function", total = 10, p.val=0.05) CC.15 <- subset_scores( result = AlvMac_results.pVal, namespace = "cellular_component", total = 15, p.val=0.05) @ Finally, the inherent hierarchical structure and ``granularity'' of gene ontology terms can be browsed conveniently by using increasingly large values of the \Robject{total} filter. Note that this filter retains only GO terms associated with a minimal given count of genes in the gene-GO mapping table. It is also possible to use the \Robject{data} argument to filter for GO terms associated with a certain count of genes in the given expression dataset, although this approach is obviously more data-dependent and less robust. \warning{ To optimise the use of memory space, after removing all ontologies not passing the criteria, the function will also discard from the filtered result object all gene features not associated with the remaining gene ontologies, and the mapping information related to gene ontologies absent from the filtered data. However, those data will still be left in the original object containing unfiltered raw results.} \subsubsection[Quick filtering of the ontology scoring table]{Quick filtering of the ontology scoring table} As an alternative to the above cascade-filtering of gene ontologies and gene features result tables, users can extract and filter information from either the gene or the gene ontology scoring tables using the \Rfunction{subset} function. An example of filtering the gene ontology result table for the top ontologies of the `Biological Process' type, associated with at least 5 genes, and a P-value lower than 0.05: <>= subset( AlvMac_results.pVal$GO, total_count >= 5 & p.val<0.05 & namespace_1003=='biological_process' ) @ \bioccomment{ Note that the above command will return a data-frame containing only the filtered GO terms, as opposed to the full result object returned by the \Rfunction{subset\_scores} function. } \subsection[Details of the top-ranking GO terms]{Details of the top-ranking GO terms} Once the GO terms are ranked (and filtered), the top-ranking GO terms in the filtered object are those most enriched in genes with expression levels that best cluster the predefined groups of samples, based on the levels of the factor considered (\Rcode{raw\_results\$factor}). In this example, we list the top filtered ``Biological Process'' GO terms extracted above and their statistics (currently ranked by increasing average rank of their associated genes): <<>>= head(BP.5$GO) @ \subsection[Hierarchical clustering of samples based on gene expression associated with a GO term]{Hierarchical clustering of samples based on gene expression associated with a GO term} In the previous section, we identified the GO terms most enriched for genes clustering samples according to their treatment. We will now generate for the top-ranked GO term (``\texttt{toll-like receptor 4 signaling pathway}'') a heatmap to visualise simulatenously the clustering of samples and the expression level of each gene in each sample: \begin{center} <>= heatmap_GO( go_id = "GO:0034142", result = BP.5, eSet=AlvMac, cexRow=0.4, cexCol=1, cex.main=1, main.Lsplit=30) @ \end{center} In this example, we can observe a group of ``Control'' (i.e. untreated; green color) samples clustering together at the bottom of the heatmap. Re-labelling of samples by \Robject{Group} (i.e. combination of treatment and time-point) reveals that those samples are mainly 24 and 48 hours post-infection control samples: \begin{center} <>= heatmap_GO( go_id = "GO:0034142", result = BP.5, eSet=AlvMac, cexRow=0.4, cexCol=1, cex.main=1, main.Lsplit=30, labRow=AlvMac$Group) @ \end{center} Subsequently encouraging the generation of a heatmap restricted to samples from those time-points (i.e. \Robject{24H} and \Robject{48H}): \begin{center} <>= heatmap_GO( go_id = "GO:0034142", result = BP.5, eSet=AlvMac, cexRow=0.6, cexCol=1, cex.main=1, main.Lsplit=30, labRow='Group', subset=list(Time=c('24H','48H'))) @ \end{center} Alternatively, it is possible to focus only on the hirearchical clustering of samples. The following code will build a dendrogram clustering samples using the expression data of the subset of genes associated with the ``\texttt{toll-like receptor 4 signaling pathway}'' gene ontology, considering only samples obtained 24 and 48 hours post-infection, and labelling samples by \Robject{Group} rather than simply \Robject{Treatment}: \begin{center} <>= cluster_GO( go_id = "GO:0034142", result = BP.5, eSet=AlvMac, cex.main=1, cex=0.6, main.Lsplit=30, subset=list(Time=c("24H", "48H")), f="Group") @ \end{center} Note that labelling samples by another factor does not affect the clustering process itself, as the underlying expression data has not changed. \subsection[Details of genes associated with a GO term]{Details of genes associated with a GO term} Following the identification of relevant GO terms in the above sections, users may want to have a closer look at the individual genes associated with a given GO term. The default behaviour of the function is to order the gene features by increasing rank (equivalent to decreasing score): <<>>= table_genes(go_id = "GO:0034142", result = BP.5)[,c(1:3)] @ \bioccomment{In the table above, the column containing the description of the genes is hidden, as its content was very long in some cases, affecting the readability of this document.} Note that the default behaviour of the above function is to return a table of all the genes associated with the GO term based on the annotations collected. For obvious reasons, genes present in the annotations but absent from the expression dataset will be absent from the score table and consequently lack data in this result table. It is possible to restrict the above table to only genes present in the expression dataset using the \Robject{data.only} argument set to \Robject{TRUE}. If only the feature identifiers associated with a given GO identifier are needed, users may use the function below: <<>>= list_genes(go_id = "GO:0034142", result = BP.5) @ \subsection[Expression profile of a gene by sample group]{Expression profile of a gene by sample group} \subsubsection[Using the unique feature identifier]{Using the unique feature identifier} \label{sec:expressionPlots} In the above section, we listed the genes associated with a particular gene ontology. In our example, the respective score and rank of each gene estimates the capacity of the gene to cluster the samples according to the treatment factor. The genes that best cluster the samples will have the highest scores and the lowest ranks. Those genes will likely produce the expression profiles with the most consistent differential expression between the treatment groups over time. Here is one example: \setkeys{Gin}{width=0.75\textwidth} \begin{center} <>= expression_plot( gene_id = "ENSBTAG00000047107", result = BP.5, eSet=AlvMac, x_var = "Timepoint", title.size=1.5, legend.title.size=10, legend.text.size=10, legend.key.size=15) @ \end{center} Note that \Robject{Timepoint} is another column of \Rcode{pData(targets)}. that encodes a numeric vector, as opposed to the column named \Robject{Time}, encoding a factor. This difference enables the plotting function to respect the relative distance between the time-points for an output more representative of the actual time-scale. To investigate the impact of other factors on the expression level of the same gene, users are encouraged to use the \Robject{f} and \Robject{x\_var} arguments to specify alternate grouping factor and X variable, respectively. Note that the \Rfunction{geom\_smooth} of the \Biocpkg{ggplot2} package may fail if a minimal number of replicates is not available to calculate proper confidence intervals. In such cases, it is recommended to use the function \Rfunction{expression\_profiles} described in section \ref{sec:expressionProfiles}. Here is another valid example separating samples by the factor \Robject{Animal} on the X axis and summarising all time-points in a confidence 95\% confidence interval on the Y-axis: \begin{center} <>= expression_plot( gene_id = "ENSBTAG00000047107", result = BP.5, eSet=AlvMac, x_var = "Animal", title.size=1.5, axis.text.angle=90, legend.title.size=10, legend.text.size=10, legend.key.size=15) @ \end{center} \subsubsection[Using the associated gene name]{Using the associated gene name} It is also possible to visualise the expression profile of genes from their associated gene name if any. This is a more human-friendly version of the function presented in the previous subsection: \begin{center} <>= expression_plot_symbol( gene_symbol = "BIKBA", result = BP.5, eSet=AlvMac, x_var = "Timepoint", title.size=1.5, legend.title.size=10, legend.text.size=10, legend.key.size=15) @ \end{center} However, the benefits of this feature are balanced by the fact that genes lacking an associated gene name cannot be visualised in this manner, and that some gene symbols are associated with multiple Ensembl gene identifiers and probesets (e.g. `RPL36A'). In the latter case, we turned the ambiguity into an additional useful feature: a lattice is created, and each of the multiple features associated with the given gene symbol are plotted simultaneously in the lattice. Subsequently, each of the sub-figures plotted may be re-plotted by itself using the \Robject{index} argument as indicated in the accompanying message printed in the \R{} console: \begin{center} <>= expression_plot_symbol( gene_symbol = "RPL36A", result = AlvMac_results, eSet=AlvMac, x_var = "Timepoint", title.size=1.5, legend.title.size=10, legend.text.size=10, legend.key.size=15) @ \end{center} \subsection[Expression profile of a gene by individual sample series]{ Expression profile of a gene by individual sample series} \label{sec:expressionProfiles} \subsubsection[Using the unique feature identifier]{Using the unique feature identifier} It may be useful to track and visualise the expression profile of genes in each individual sample series, rather than their average. This could help identify outliers within sample groups, or visually compare paired samples, for instance. In the \Robject{AlvMac} dataset, samples from each of the animals were subjected to all three treatments in parallel (i.e. paired samples). In the figure below, a sample series is defined by a given \Robject{Animal} and a given \Robject{Treatment}. Each sample series is then tracked over time, and coloured according to the \Robject{Treatment} factor (default, factor stored in \Rcode{raw\_results\$factor}): \begin{center} <>= AlvMac$Animal.Treatment <- paste(AlvMac$Animal, AlvMac$Treatment, sep="_") expression_profiles( gene_id = "ENSBTAG00000047107", result = AlvMac_results, eSet=AlvMac, x_var = "Timepoint", line.size=1, seriesF="Animal.Treatment", linetypeF="Animal", legend.title.size=10, legend.text.size=10, legend.key.size=15) @ \end{center} In the figure above, the \Robject{linetypeF} helps to highlight samples from an animal which start at unusually high expression values, while those samples progressively return to expression values similar to other samples in their respective treatment groups. If omitted, the \Robject{linetypeF} argument will mirror the \Robject{colourF}, which can be useful for colour-blind people. Alternatively, a single line-type can be applied to all groups using the \Robject{lty} argument as follows: \begin{center} <>= expression_profiles( gene_id = "ENSBTAG00000047107", result = AlvMac_results, eSet=AlvMac, x_var = "Timepoint", lty=rep(1,10), # use line-type 1 for all 10 groups seriesF="Animal.Treatment", linetypeF="Animal", legend.title.size=10, legend.text.size=10, legend.key.size=15, line.size=1) @ \end{center} \subsubsection[Using the associated gene name]{Using the associated gene name} Similarly to the \Rfunction{expression\_plot} function, an alternative was implemented to use gene names instead of feature identifiers. An example: \begin{center} <>= expression_profiles_symbol( gene_symbol="TNIP3", result = AlvMac_results, x_var = "Timepoint", linetypeF="Animal", line.size=1, eSet=AlvMac, lty=rep(1,10), seriesF="Animal.Treatment", title.size=1.5, legend.title.size=10, legend.text.size=10, legend.key.size=15) @ \end{center} \subsection[Comparison of univariate effects on gene expression]{Comparison of univariate effects on gene expression} While the analysis is restricted to the evaluation of a single factor, it can be helpful to compare the relative impact of all known factors present in the the accompanying \Robject{phenoData} on the gene expression in the different groups of samples. In other words, given a GO term identifier this feature will generate a plot for each associated gene, where the mean (default; can be changed) expression level will be computed for each level of each factor and compared to one another: \begin{center} <>= plot_design( go_id = "GO:0034134", result = BP.5, eSet=AlvMac, ask = FALSE, factors = c("Animal", "Treatment", "Time", "Group"), main.Lsplit=30) @ \end{center} \section[Additional controls and advanced functions]{Additional controls and advanced functions} \subsection[Custom annotations]{Custom annotations} \label{sec:customAnnotations} To enable all downstream filtering and visualisation features, the \Rfunction{GO\_analyse} function uses three types of annotations: \begin{itemize} \item \Robject{GO\_genes}: Mapping of gene feature identifiers to gene ontology identifiers. \item \Robject{all\_genes}: Annotations of gene feature identifiers. \item \Robject{all\_GO}: Annotations of gene ontology identifiers. \end{itemize} The use of custom annotations has several advantages: \begin{itemize} \item \textbf{Traceability and reproducibility:} the Ensembl annotations are updated on a regular basis. A local copy of annotations allows use of archived annotation releases. In the absence of local annotations, \Rfunction{GO\_analyse()} systematically connects to the latest (i.e. current) Ensembl release. \item \textbf{Speed:} Providing custom annotations will skip calls to the Ensembl server, significantly reducing the runtime of the \Rfunction{GO\_analyse} function. \item \textbf{Autonomy from web-services:} occasionally the Ensembl server may be unavailable (e.g. maintenance, internet connection). A local copy of annotations allows to work independently from any web-service. \item \textbf{Alternative annotations:} Users working with gene feature identifiers not supported in the Ensembl annotations (e.g. species, or platforms), or comparing panels of biomarkers instead of GO terms for instance, may provide their own custom annotations to enable the ranking and visualisation of their data using \Biocpkg{GOexpress}. \end{itemize} \subsubsection[Generating custom annotations]{Generating custom annotations} \label{sec:generateCustomAnnotations} Using the Ensembl release 75, we show below the code used to retrieve annotations for \textit{Bos taurus} Ensembl gene identifiers: <>= # Load the interface to BioMart databases library(biomaRt) # See available resources in Ensembl release 75 listMarts(host='feb2014.archive.ensembl.org') # Connect to the Ensembl Genes annotation release 75 for Bos taurus ensembl75 = useMart( host='feb2014.archive.ensembl.org', biomart='ENSEMBL_MART_ENSEMBL', dataset='btaurus_gene_ensembl') ## Download all the Ensembl gene annotations (no filtering) allgenes.Ensembl = getBM( attributes=c('ensembl_gene_id', 'external_gene_id', 'description'), mart=ensembl75) # Rename the gene identifier column to 'gene_id' # This allows GOexpress to treat microarray and RNA-seq data identically colnames(allgenes.Ensembl)[1] = 'gene_id' ## Download all the gene ontology annotations (no filtering) allGO.Ensembl = getBM( attributes=c('go_id', 'name_1006', 'namespace_1003'), mart=ensembl75) ## Download all the mapping between gene and gene ontology identifiers GOgenes.Ensembl = getBM( attributes=c('ensembl_gene_id', 'go_id'), mart=ensembl75) # Rename the gene identifier column to 'gene_id' colnames(GOgenes.Ensembl)[1] = 'gene_id' # Cleanup: remove some blank fields often found in both columns GOgenes.Ensembl = GOgenes.Ensembl[GOgenes.Ensembl$go_id != '',] GOgenes.Ensembl = GOgenes.Ensembl[GOgenes.Ensembl$gene_id != '',] @ \bioccomment{ The automated retrieval procedure retrieves all gene ontology annotations from the Ensembl server, inclusive of annotations not Inferred from Experiment (EXP). Users may consider filtering local annotations for desired GO Evidence code(s). } \subsubsection[Using custom annotations]{Using custom annotations} \label{sec:useCustomAnnotations} The annotations download above can then be saved in local R data files, and subsequently used to run entirely offline analyses of \Rclass{ExpressionSet} objects with corresponding gene feature identifiers: <>= # save each custom annotation to a R data file save(GOgenes.Ensembl, file='GOgenes.Ensembl75.rda') save(allGO.Ensembl, file='allGO.Ensembl75.rda') save(allgenes.Ensembl, file='allgenes.Ensembl75.rda') # Run an analysis using those local annotations GO_analyse( eSet=AlvMac, f='Treatment', GO_genes=GOgenes.Ensembl, all_GO=allGO.Ensembl, all_genes=allgenes.Ensembl) @ Ideally, all three annotation objects should be provided, to enable all downstream features. A toy example for each type of custom annotations, ready for analysis, is provided with the package: <>= data(AlvMac_GOgenes) data(AlvMac_allGO) data(AlvMac_allgenes) @ \warning{ Critically, it is highly recommended to provide full genome annotations for the species of interest, including annotations for feature identifiers that are absent from the given \Rclass{ExpressionSet}. As described in section \ref{sec:StatsOverview}, all genes present in the annotations will affect the scoring of gene ontologies, even if they are absent from the \Rclass{ExpressionSet}.} \subsection[Using subsets of samples]{Using subsets of samples} \label{sec:subsetArgument} It may be desirable to rank genes and gene ontologies according to their capacity to cluster only specific subsets of samples, while visualising the expression data of all samples. Instead of creating two separate \Rclass{ExpressionSet} objects (one containing all the samples to visualise, another one containing only the samples to analyse), a \Robject{subset} argument was added to most functions in the \Biocpkg{GOexpress} package, allowing the use of a single \Rclass{ExpressionSet} object from which the desired subset of samples is extracted at run time. This argument takes a named list where names must be column names existing in \Rcode{colnames(pData(eSet))}, and values must be vectors of values existing in the corresponding column of \Rcode{pData(eSet)}. The original \Rclass{ExpressionSet} will be left unchanged. An example: \begin{center} <>= AlvMac_results <- GO_analyse( eSet = AlvMac, f = "Treatment", subset=list( Time=c("6H","24H", "48H"), Treatment=c("CN","MB")) ) expression_plot( gene_id = "ENSBTAG00000047107", result = BP.5, eSet=AlvMac, x_var = "Timepoint", title.size=1.5, legend.title.size=10, legend.text.size=10, legend.key.size=15, subset=list(Treatment=c("TB","MB")) ) @ \end{center} \subsection[Overlapping genes between GO terms]{Overlapping genes between GO terms} Often, the top-ranked cellular functions identified by pathway analysis and gene-set enrichment analysis tools may significantly overlap each other in terms of genes, which limits the number of genes identified by those approaches and blurrs the precise cellular function represented by the core gene-set among a list of closely related yet different functions. Typically, overlapping sets are representated as Venn diagrams. The example below will produce a Venn diagram of the gene-sets associated with the five top-ranked ``Biological Process'' ontologies, and display it on screen, as the filename argument is left to the default \Rcode{NULL} value (if a filename is given, the diagram will be printed to a TIFF file , no matter the extension specified): \begin{center} <>= overlap_GO(go_ids = head(BP.5$GO$go_id, n=5), result = BP.5, filename=NULL) @ \end{center} This function calls the \Rfunction{venn.diagram} function of the \CRANpkg{VennDiagram} package, offering a high level of control on the resulting Venn diagram. A detailed description of the available arguments are accessible in the corresponding \R{} documentation file. \subsection[Distribution of scores]{Distribution of scores} Users might be interested in the general distribution of score and rank statistics produced by \Biocpkg{GOexpress}. The distribution of scores may be represented as a histogram: \begin{center} <>= hist_scores(result = BP.5, labels = TRUE) @ \end{center} Alternatively, quantile values can be returned for default or customised percentiles: <<>>= quantiles_scores(result = BP.5) @ \subsection[Reordering scoring tables]{Reordering scoring tables} \subsubsection[Reordering by score]{Reordering by score} While scores are more prone to extreme outlier values and may slightly fluctuate between multiple runs of the random forest algorithm; ranks of genes and subsequently average ranks of GO terms tend to be more stable and may be more reliable estimators of the importance of genes and cellular functions. Therefore, the default behaviour of \Biocpkg{GOexpress} is to use the rank and average rank metrics to order genes and GO terms, respectively, in the returned score tables. It is however possible to re-order the tables in the output variable according to the score metric (or revert back to the original one) as in the example below: <>= BP.5.byScore <- rerank(result = BP.5, rank.by = "score") @ \subsubsection[Reordering by P-value]{Reordering by P-value} Additionally, it is possible to reorder GO terms by increasing P-value, provided those values were were computed using the \Rfunction{pValue\_GO} function. <>= BP.5.byPval <- rerank(result = BP.5, rank.by = "p.val") @ \subsubsection[Reordering and breaking ties]{Reordering and breaking ties} For instance, to rank GO terms by P-value, while breaking ties on their \Robject{ave\_rank} value, one needs to first rank the object by \Robject{rank}, and rank the resulting object by \Robject{p.val}: <>= BP.5.pVal_rank <- rerank(result = BP.5, rank.by = "rank") BP.5.pVal_rank <- rerank(result = BP.5.pVal_rank, rank.by = "p.val") @ \subsection[Subsetting an ExpressionSet to specific sample groups]{Subsetting an ExpressionSet to specific sample groups} While this feature exists, users may want to consider the newer section \ref{sec:subsetArgument} describing the definition of a subset of samples from the given \Rclass{ExpressionSet} on-the-fly without the need to create a new object containing the subsetted \Rclass{ExpressionSet}. It is straightforward to subset an \Rclass{ExpressionSet} by extracting given columns (i.e. samples) and rows (i.e. gene features). Nevertheless, the \CRANpkg{randomForest} package is quite sensitive to the definition of \R{} factors; for instance, the \Rfunction{randomForest} function will crash if a factor is declared to have 3 levels (e.g. "A", "B", and "C"), while the \Rclass{ExpressionSet} only contains samples for two of them (e.g. "A" and "B"). A simple fix is to update the known levels of the factor after having subsetted the \Rclass{ExpressionSet}: <<>>= pData(AlvMac) <- droplevels(pData(AlvMac)) @ \bioccomment{ Note that this operation preserves the order of ordered factors. } Typically, subsetting an \Rclass{ExpressionSet} by rows and columns does not automatically update the known levels of each factor to the remaining levels. The function \Rfunction{subEset} performs this additional task. The function takes a named list, where names must be column names from the \Robject{phenoData} slots and values must be present in the corresponding columns, and returns a subset of the original \Rclass{ExpressionSet} including only the samples which match those values: <>= subEset( eSet=AlvMac, subset=list( Time=c("2H","6H","24H"), Treatment=c("CN","MB"))) @ \section[Statistics]{Statistics} \subsection[Overview]{Overview} \label{sec:StatsOverview} \Biocpkg{GOexpress} was initially created from a set of gene-based, and later ontology-based, visualisation functions. Following the integration of those various plotting functions at the core of the \Biocpkg{GOexpress} package, the need for a ranking of genes and GO terms soon became apparent in order to rapidly identify those with gene expression data best clustering the samples according to the experimental factor studied. A two-fold procedure was implemented: \begin{enumerate} \item Using the available expression data, each gene present in the dataset is scored, evaluating its ability to cluster the predefined groups of samples. Genes are ranked according to their respective score; ties are resolved by assigning the lowest rank R to all G genes, giving the rank G+n to the next gene(s). The genewise scoring functions implemented are described in the following subsections, and all were validated on in-house datasets cross-checked with comparable analytic pipelines (e.g. Ingenuity\textregistered Pathway Analysis, SIGORA) \item Using the above ranks and scores, each GO term in the gene ontology annotations is assigned the mean score and the mean rank of all the genes associated with it in the gene-GO annotations. Genes present in the annotations but absent from the \Rclass{ExpressioSet} are assigned a score of 0 (minimum valid score; indicates no power to discriminates the predefined groups of sample) and a rank equal to the number of genes present in the \Rclass{ExpressioSet} plus one (worst rank, while preserving discrete continuity of the ranking). \end{enumerate} \subsection[Data-driven visualisation functions]{Data-driven visualisation functions} Importantly, the statistics performed to rank GO terms and genes do \emph{not} influence the behaviour of the subsequent plotting functions; heatmaps, dendrograms and gene expression profiles are purely driven by the expression data, sample phenotype annotations provided and GO terms annotations, without any transformation or normalisation applied to the data. Therefore, users are encouraged to use and suggest alternative relevant scoring and ranking strategies, which could prioritise GO terms and genes in different ways. A current acknowledged bias is the higher scoring of GO terms associated with fewer genes, which is discussed in section \ref{sec:Filtering}. \subsection[Random Forest]{Random Forest} We implemented the Random Forest framework to answer the question: ``How well does each gene in the dataset cluster predefined groups of samples?''. The random forest consists of multiple decision trees. Each tree is built based on a bootstrap sample (sample with replacement) of observations and a random sample of variables. The \CRANpkg{randomForest} package first calculates the Gini index (Breiman et al, 1984) for each node in each tree. The Gini index is a measure of homogeneity from 0 (homogeneous) to 1 (heterogeneous). The decrease in the Gini index resulting from a split on a variable is then calculated for each node and averaged for each variable over all the trees in the model. The variable with the biggest mean decrease in the Gini index is then considered the most important. Technically, \Biocpkg{GOexpress} extracts the \Robject{MeanDecreaseGini} value from the \Robject{importance} slot of the \Rfunction{randomForest} output and uses this value as the score for each gene. A key feature of the Random Forest framework is the implicit handling of interactions between genes, given a sufficient number of trees generated and genes sampled. Indeed, at each node in the decision tree, genes are sampled from the expression data and tested for their individual capacity to improve the partitioning reached in the previous node. The larger the number of trees and genes sampled, the more complete the coverage of interactions will be. \subsection[One-way Analysis of Variance (ANOVA)]{One-way Analysis of Variance (ANOVA)} We implemented the ANOVA statistical framework to answer the question: "How different is the expression level of each gene in the dataset in the different groups of samples?". Given the expression level of a gene in all the samples, the one-way ANOVA determines the ratio of the variance between the groups compared to the variance within the groups, summarised as an F statistic that \Biocpkg{GOexpress} uses as a score for each gene. Simply put, if samples from the same groups show gene expression values similar to each other while samples from different groups show different levels of expression, those genes will produce a higher score. This score cannot be less than 0 (variance between groups insignificant to variance within groups), while very large ratios can easily be reached for genes markedly different between groups. Contrary to the random forest framework, the Analysis of Variance makes important assumptions on the data: namely, the independence of observations, the normality of residuals, and the equality of variances in all groups. While the former two are the responsibility of the user to verify, the latter is taken care of by \Biocpkg{GOexpress}. Indeed, the \Rfunction{oneway.test} function of the package \Rpackage{stats} is used with parameter \Robject{var.equal} set to to \Rcode{FALSE}. While this reduces the sensitivity of the test, all genes are affected by this correction based on the relative amount of variance in the different predefined groups of samples. Finally, it is once again important to note that the one-way ANOVA only evaluates univariate changes, while the random forest framework implicitely allows for interactions between genes. \section[Integration with other packages]{Integration with other packages} \subsection[shiny]{shiny} Shiny is an \R{} package that makes it easy to build interactive web applications (apps) straight from \R{} \incfig{images/shiny_screenshot.png}{\textwidth}{Shiny app.} {This simple application allows visualisation of genes using the \Biocpkg{GOexpress} \Rfunction{expression\_profiles\_symbol} function. An online interactive version was made available at: \url{https://kevinrue.shinyapps.io/alvmac/}} as shown in Figure~\ref{images/shiny_screenshot.png}. More information is available at \url{http://shiny.rstudio.com/}. \section[Notes]{Notes} \subsection[Authors' contributions]{Authors' contributions} Conception and development of the \Biocpkg{GOexpress} package was carried out by KR-A with contributions by PAM, under the supervision of SVG and DEM. Experimental data used for testing was generated and analysed with the help of DAM and NC. Integration of the random forest statistical frameworks was advised by BH and ACP. This User's Guide was prepared by KR-A, and edited by PAM, BH, DAM, NCN, ACP, SVG, and DEM. \subsection[Acknowledgement]{Acknowledgments} Since the early beginning, \Biocpkg{GOexpress} has grown from constructive feedback, and I would like to thank a number of colleagues and scientists from all backgrounds who contributed each in their own way to the present version and features of the package. Special thanks to Dr. Paul Cormican, Simone Coughlan, Dr. Karsten Hokamp, and an unknown reviewer for feedback leading to some features. Sincere thanks to Dr. Kate Killick for testing on human data and feedback. My thanks to the \Bioconductor{} staff and in particular to Hervé Pagès for the helpful feedback which improved the standards of the code and documentation. Last but not least, thanks to the University College Dublin "OpenSequencing" group, the "Virtual Institute of Bioinformatics and Evolution" (VIBE), and the "UCD PhD Symposium in Computational Biology \& Innovation" where I first presented raw versions of \Biocpkg{GOexpress} and received valuable feedback and advice in return, underlying a significant number of updated and new features. \subsection[Session information]{Session information} <<>>= sessionInfo() @ \end{document}