%\VignetteIndexEntry{Gene Set Variation Analysis}
%\VignetteDepends(Biobase, methods, gplots, glmnet}
%\VignetteKeywords{GSVA, GSEA, Expression, Microarray, Pathway}
%\VignettePackage{GSVA}
\documentclass[a4paper]{article}
\usepackage{url}
\usepackage{graphicx}
\usepackage{longtable}
\usepackage{hyperref}
\usepackage{natbib}
\usepackage{fullpage}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\title{GSVA: The Gene Set Variation Analysis package \\ for microarray and RNA-seq data}
\author{Sonja H\"anzelmann$^1$, Robert Castelo$^1$ and Justin Guinney$^2$}

\begin{document}

\SweaveOpts{eps=FALSE}

\maketitle

\begin{quote}
{\scriptsize
1. Research Program on Biomedical Informatics (GRIB), Hospital del Mar Research Institute (IMIM) and Universitat Pompeu Fabra, Parc de Recerca Biom\`edica de Barcelona, Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain

2. Sage Bionetworks, 1100 Fairview Ave N., Seattle, Washington, 98109 USA
}
\end{quote}

\begin{abstract}
The \Rpackage{GSVA} package implements a non-parametric unsupervised method,
called Gene Set Variation Analysis (GSVA), for assessing gene set enrichment
(GSE) in gene expression microarray and RNA-seq data. In contrast to most
GSE methods, GSVA performs a change in coordinate systems, transforming the
data from a gene by sample matrix to a gene set by sample matrix. Thereby
allowing for the evaluation of pathway enrichment for each sample. This
transformation is done without the use of a phenotype, thus facilitating very
powerful and open-ended analyses in a now pathway centric manner. In this
vignette we illustrate how to use the \Rpackage{GSVA} package to perform some
of these analyses using published microarray and RNA-seq data already
pre-processed and stored in the companion experimental data package
\Rpackage{GSVAdata}.
\end{abstract}

<<options, echo=FALSE>>=
options(width=60)
@

\section{Introduction}

Gene set enrichment analysis (GSEA)
\citep[see][]{mootha_pgc_1alpha_responsive_2003, subramanian_gene_2005} is a
method designed to assess the concerted behavior of functionally related genes
forming a set, between two well-defined groups of samples. Because it does not
rely on a ``gene list'' of interest but on the entire ranking of genes, GSEA
has been shown to provide greater sensitivity to find gene expression changes
of small magnitude that operate coordinately in specific sets of functionally
related genes. However, due to the reduced costs in genome-wide gene-expression
assays, data is being produced under more complex experimental designs that
involve multiple RNA sources enriched with a wide spectrum of phenotypic and/or
clinical information. The Cancer Genome Atlas (TCGA) project
(see \url{http://cancergenome.nih.gov}) and the data deposited on it constitute
a canonical example of this situation.

To facilitate the functional enrichment analysis of this kind of data, we
developed Gene Set Variation Analysis (GSVA) which allows the assessment of the
underlying pathway activity variation by transforming the gene by sample matrix
into a gene set by sample matrix without the \textit{a priori} knowledge of the
experimental design. The method is both non-parametric and unsupervised, and
bypasses the conventional approach of explicitly modeling phenotypes within
enrichment scoring algorithms. Focus is therefore placed on the
\textit{relative} enrichment of pathways across the sample space rather than
the \textit{absolute} enrichment with respect to a phenotype. The value
of this approach is that it permits the use of traditional analytical methods
such as classification, survival analysis, clustering, and correlation analysis
in a pathway focused manner. It also facilitates sample-wise comparisons between
pathways and other complex data types such as microRNA expression or binding
data, copy-number variation (CNV) data, or single nucleotide polymorphisms
(SNPs). However, for case-control experiments, or data with a moderate to small
sample size ($<30$), other GSE methods that explicitly include the phenotype in
their model are more likely to provide greater statistical power to detect
functional enrichment.

In the rest of this vignette we describe briefly the methodology behind GSVA,
give an overview of the functions implemented in the package and show a few
applications. The interested reader is referred to \citep{GSVA_submitted} for
more comprehensive explanations and more complete data analysis examples with
GSVA, as well as for citing GSVA if you use it in your own work.

\section{GSVA enrichment scores}

A schematic overview of the GSVA method is provided in Figure \ref{methods}, which shows the two main required inputs: a matrix $X=\{x_{ij}\}_{p\times n}$ of normalized expression values (see Methods for details on the pre-processing steps) for $p$ genes by $n$ samples, where typically $p\gg n$, and a collection of gene sets $\Gamma = \{\gamma_1, \dots, \gamma_m\}$. We shall denote by $x_i$ the expression profile of the $i$-th gene, by $x_{ij}$ the specific expression value of the $i$-th gene in the $j$-th sample, and by $\gamma_k$ the subset of row indices in $X$ such that $\gamma_k \subset \{1,\ldots\,p\}$ defines a set of genes forming a pathway or some other functional unit. Let $|\gamma_k |$ be the number of genes in $\gamma_k$.

\begin{figure}[ht]
\centerline{\includegraphics[width=\textwidth]{methods}}
\caption{{\bf GSVA methods outline.}
The input for the GSVA algorithm are a gene expression matrix in the form of log2
microarray expression values or RNA-seq counts and a database of gene sets.
1. Kernel estimation of the cumulative density function (kcdf). The two plots
show two simulated expression profiles mimicking 6 samples from microarray and
RNA-seq. The $x$-axis corresponds to expression values where each gene is lowly
expressed in the four samples with lower values and highly expressed in the other
two. The scale of the kcdf is on the left $y$-axis and the scale of the Gaussian
and Poisson kernels is on the right $y$-axis.
2. The expression-level statistic is rank ordered for each sample.
3. For every gene set, the Kolmogorov-Smirnov-like rank statistic is calculated.
The plot illustrates a gene set consisting of 3 genes out of a total number of 10
with the sample-wise calculation of genes inside and outside of the gene set.
4. The GSVA enrichment score is either the difference between the two sums or the
maximum deviation from zero. The two plots show two simulations of the resulting
scores under the null hypothesis of no gene expression change (see main text).
The output of the algorithm is matrix containing pathway enrichment profiles for
each gene set and sample. }
\label{methods}
\end{figure}

GSVA starts by evaluating whether a gene $i$ is highly or lowly expressed in sample $j$ in the context of the sample population distribution. Probe effects can alter hybridization intensities in microarray data such that expression values can greatly differ between two non-expressed genes\cite{zilliox_gene_2007}. Analogous gene-specific biases, such as GC content or gene length have been described in RNA-seq data\cite{hansen_removing_2012}. To bring distinct expression profiles to a common scale, an expression-level statistic is calculated as follows. For each gene expression profile $x_i=\{x_{i1},\dots,x_{in}\}$, a non-parametric kernel estimation of its cumulative density function is performed using a Gaussian kernel \cite[pg.~148]{silverman_density_1986} in the case of microarray data:

\begin{equation}
\label{density}
\hat{F}_{h_i}(x_{ij})=\frac{1}{n}\sum_{k=1}^n\int_{-\infty}^{\frac{x_{ij}-x_{ik}}{h_i}}\frac{1}{\sqrt{2\pi}}e^{-\frac{t^2}{2}}dt\,,
\end{equation}
where $h_i$ is the gene-specific bandwidth parameter that controls the resolution of the kernel estimation, which is set to $h_i=s_i/4$, where $s_i$ is the sample standard deviation of the $i$-th gene (Figure \ref{methods}, step 1). In the case of RNA-seq data, a discrete Poisson kernel \cite{canale_bayesian_2011} is employed:

\begin{equation}
\hat{F}_r(x_{ij}) = \frac{1}{n} \sum_{k=1}^n \sum_{y=0}^{x_{ij}} \frac{e^{-(x_{ik}+r)}(x_{ik}+r)^y}{y!}\,,
\end{equation}
where $r=0.5$ in order to set the mode of the Poisson kernel at each $x_{ik}$, because the mode of a Poisson distribution with an integer mean $\lambda$ occurs at $\lambda$ and $\lambda-1$ and at the largest integer smaller than $\lambda$ when $\lambda$ is continuous.

Let $z_{ij}$ denote the previous expression-level statistic $\hat{F}_{h_i}(x_{ij})$, or $\hat{F}_r(x_{ij})$, depending on whether $x_{ij}$ are continuous microarray, or discrete count RNA-seq values, respectively. The following step condenses expression-level statistics into gene sets by calculating sample-wise enrichment scores. To reduce the influence of potential outliers, we first convert $z_{ij}$ to ranks $z_{(i)j}$ for each sample $j$ and normalize further $r_{ij}=|p/2-z_{(i)j}|$ to make the ranks symmetric around zero (Figure~\ref{methods}, step 2).

We assess the enrichment score similar to the GSEA and ASSESS methods \cite{subramanian_gene_2005,edelman_analysis_2006} using the Kolmogorov-Smirnov (KS) like random walk statistic (Figure~\ref{methods}, step 3):
\begin{equation}
\label{walk}
\nu_{jk}(\ell) = \frac{\sum_{i=1}^\ell |r_{ij}|^{\tau} I(g_{(i)} \in
\gamma_k)}{\sum_{i=1}^p |r_{ij}|^{\tau}I(g_{(i)} \in \gamma_k)} -
\frac{\sum_{i=1}^\ell I(g_{(i)} \not\in \gamma_k)}{p-|\gamma_k|},
\end{equation}
where $\tau$ is a parameter describing the weight of the tail in the random walk (default $\tau = 1$), $\gamma_k$ is the $k$-th gene set, $I(g_{(i)} \in \gamma_k)$ is the indicator function on whether the $i$-th gene (the gene corresponding to the $i$-th ranked expression-level statistic) belongs to gene set $\gamma_k$, $|\gamma_k|$ is the number of genes in the $k$-th gene set, and $p$ is the number of genes in the data set. 

We offer two approaches for turning the KS like random walk statistic into an enrichment statistic (ES) (also called GSVA score), the classical maximum deviation method \cite{subramanian_gene_2005,edelman_analysis_2006,verhaak_integrated_2010} and a normalized ES. The first ES is the maximum deviation from zero of the random walk of the $j$-th sample with respect to the $k$-th gene set :

\begin{equation}
\label{escore}
ES^{\mbox{\tiny{max}}}_{jk} = \nu_{jk}[\arg \max_{\ell=1,\dots,p} \left|\nu_{jk}(\ell)\right|].
\end{equation}
For each gene set $k$, this approach produces a distribution of enrichment scores that is bimodal (Figure~\ref{methods}, step 4, top panel). This is an intrinsic property of the KS like random walk, which generates non-zero maximum deviations under the null distribution. In GSEA \cite{subramanian_gene_2005} it is also observed that the empirical null distribution obtained by permuting phenotypes is bimodal and, for this reason, significance is determined independently using the positive and negative sides of the null distribution. In our case, we would like to provide a standard Gaussian distribution of enrichment scores under the null hypothesis of no change in pathway activity throughout the sample population. For this purpose we propose a second, alternative score that produces an ES distribution approximating this requirement (Figure~\ref{methods}, step 4, bottom panel):

\begin{equation}
\label{escore_2}
ES^{\mbox{\tiny{diff}}}_{jk} = \left|ES^{+}_{jk}\right| - \left|ES^{-}_{jk}\right|=\max_{\ell=1,\dots,p}(0,\nu_{jk}(\ell)) - \min_{\ell=1,\dots,p}(0,\nu_{jk}(\ell))\,,
\end{equation}
where $ES_{jk}^{+}$ and $ES_{jk}^{-}$ are the largest positive and negative random walk deviations from zero, respectively, for sample $j$ and gene set $k$.  This statistic may be compared to the Kuiper test statistic \cite{pearson_comparison_1963}, which sums the maximum and minimum deviations to make the test statistic more sensitive in the tails.  In contrast, our test statistic penalizes deviations that are large in both tails, and provides a ``normalization'' of the enrichment score by subtracting potential noise. There is a clear biological interpretation of this statistic, it emphasizes genes in pathways that are concordantly activated in one direction only, either over-expressed or under-expressed relative to the overall population. For pathways containing genes strongly acting in both directions, the deviations will cancel each other out and show little or no enrichment. Because this statistic is unimodal and approximately normal, downstream analyses which may impose distributional assumptions on the data are possible.

Figure~\ref{methods}, step 4 shows a simple simulation where standard Gaussian deviates are independenty sampled from $p=20,000$ genes and $n=30$ samples, thus mimicking a null distribution of no change in gene expression. One hundred gene sets are uniformly sampled at random from the $p$ genes with sizes ranging from 10 to 100 genes. Using these two inputs, we calculate the maximum deviation ES and the normalized ES. The resulting distributions are depicted in Figure~\ref{methods}, step 4 and in the larger figure below, illustrating the previous description.

<<>>=
library(GSVA)

p <- 20000    ## number of genes
n <- 30       ## number of samples
nGS <- 100    ## number of gene sets
min.sz <- 10  ## minimum gene set size
max.sz <- 100 ## maximum gene set size
X <- matrix(rnorm(p*n), nrow=p, dimnames=list(1:p, 1:n))
dim(X)
gs <- as.list(sample(min.sz:max.sz, size=nGS, replace=TRUE)) ## sample gene set sizes
gs <- lapply(gs, function(n, p) sample(1:p, size=n, replace=FALSE), p) ## sample gene sets
es.max <- gsva(X, gs, mx.diff=FALSE, verbose=FALSE)$es.obs
es.dif <- gsva(X, gs, mx.diff=TRUE, verbose=FALSE)$es.obs
@

\begin{center}
<<maxvsdif, fig=TRUE, png=TRUE, echo=TRUE, height=5, width=8>>=
par(mfrow=c(1,2), mar=c(4, 4, 4, 1))
plot(density(as.vector(es.max)), main="Maximum deviation from zero",
     xlab="GSVA score", lwd=2, las=1, xaxt="n", xlim=c(-0.75, 0.75), cex.axis=0.8)
axis(1, at=seq(-0.75, 0.75, by=0.25), labels=seq(-0.75, 0.75, by=0.25), cex.axis=0.8)
plot(density(as.vector(es.dif)), main="Difference between largest\npositive and negative deviations",
     xlab="GSVA score", lwd=2, las=1, xaxt="n", xlim=c(-0.75, 0.75), cex.axis=0.8)
axis(1, at=seq(-0.75, 0.75, by=0.25), labels=seq(-0.75, 0.75, by=0.25), cex.axis=0.8)
@
\end{center}

\bigskip
Although the GSVA algorithm itself does not evaluate statistical significance for the
enrichment of gene sets, significance with respect to one or more phenotypes can be easily
evaluated using conventional statistical models. Likewise, false discovery rates can be
estimated by permuting the sample labels (Methods).  Examples of these techniques are
provided in the following section.

\section{Overview of the package}

The \Rpackage{GSVA} package implements the methodology described in the previous
section in the function \Rfunction{gsva()} which requires two main input
arguments: the gene expression data and a collection of gene sets. The
expression data can be provided either as a \Rclass{matrix} object of genes
(rows) by sample (columns) expression values, or as an \Rclass{ExpressionSet}
object. The collection of gene sets can be provided either as a \Rclass{list}
object with names identifying gene sets and each entry of the list containing
the gene identifiers of the genes forming the corresponding set, or as a
\Rclass{GeneSetCollection} object as defined in the \Rpackage{GSEABase} package.

When the two main arguments are an \Rclass{ExpressionSet} object and a
\Rclass{GeneSetCollection} object, the \Rfunction{gsva()} function will first
translate the gene identifiers used in the \Rclass{GeneSetCollection} object
into the corresponding feature identifiers of the \Rclass{ExpressionSet} object,
according to its corresponding annotation package. This translation is carried
out by an internal call to the function \Rfunction{mapIdentifiers()} from the
\Rpackage{GSEABase} package. This means that both input arguments may specify
features with different types of identifiers, such as Entrez IDs and probeset IDs,
and the \Rpackage{GSEABase} package will take care of mapping them to each other.

A second filtering step is applied that removes genes without matching features
in the \Rclass{ExpressionSet} object. If the expression data is given as a
\Rclass{matrix} object then only the latter filtering step will be applied by the
\Rfunction{gsva()} function and, therefore, it will be the responsibility of the
user to have the same type of identifiers in both the expression data and the
gene sets.

After these automatic filtering steps, we may additionally filter out gene sets
that do not meet a minimum and/or maximum size specified by the optional
arguments \Robject{min.sz} and \Robject{max.sz} which are set by default to 1
and \Robject{Inf}, respectively. Finally, the \Rfunction{gsva()} function will
carry out the calculations specified in the previous section and return a
gene set by sample matrix of GSVA enrichment scores in the form of a
\Rclass{matrix} object when this was the class of the input expression data
object. Otherwise, it will return an \Rclass{ExpressionSet} object inheriting
all the corresponding phenotypic information from the input data.

An important argument of the \Rfunction{gsva()} function is the flag
\Robject{mx.diff} which is set to \Robject{TRUE} by default. Under this default
setting, GSVA enrichment scores are calculated using Equation~\ref{escore_2}, and
therefore, are more amenable by analysis techniques that assume the data to be
normally distributed. When setting \Robject{mx.diff=FALSE}, then
Equation~\ref{escore} is employed, calculating enrichment in an analogous way to
classical GSEA which typically provides a bimodal distribution of GSVA enrichment
scores for each gene.

Since the calculations for each gene set are independent from each other, the
\Rfunction{gsva()} function offers two possibilities to perform them in
parallel. One consists of loading the library \Rpackage{snow}, which will enable
the parallelization of the calculations through a cluster of computers. In order
to activate this option we should specify in the argument \Robject{parallel.sz}
the number of processors we want to use (default is zero which means no
parallelization is going to be employed). The other is loading the library
\Rpackage{parallel} and then the \Rfunction{gsva()} function will use the core
processors of the computer where R is running. If we want to limit
\Rfunction{gsva()} in the number of core processors that should be used, we can
do it by specifying the number of cores in the \Robject{parallel.sz} argument.

The other two functions of the \Rpackage{GSVA} package are
\Rfunction{filterGeneSets()} and \Rfunction{computeGeneSetsOverlaps()} that
serve to explicitly filter gene sets by size and by pair-wise overlap,
respectively. Note that the size filter can also be applied within the
\Rfunction{gsva()} function call.

The \Rfunction{gsva()} function also offers the following three other unsupervised
GSE methods that calculate single sample pathway summaries of expression and which
can be selected through the \Robject{method} argument:

\begin{itemize}
  \item \Robject{method="plage"} \citep{tomfohr_pathway_2005}. Pathway level analysis
        of gene expression (PLAGE) standardizes first expression profiles into z-scores
        over the samples and then calculates the singular value decomposition
        $Z_\gamma=UDV'$ on the z-scores of the genes in the gene set. The coefficients
        of the first right-singular vector (first column of $V$) are taken as the gene
        set summaries of expression over the samples.
  \item \Robject{method="zscore"} \citep{lee_inferring_2008}. The combined z-score method
        also, as PLAGE, standardizes first expression profiles into z-scores over the samples,
        but combines them together for each gene set at each individual sample as follows.
        Given a gene set $\gamma=\{1,\dots,k\}$ with z-scores $Z_1,\dots,Z_k$ for each gene,
        the combined z-score $Z_\gamma$ for the gene set $\gamma$ is defined as:
        \begin{equation}
        Z_\gamma = \frac{\sum_{i=1}^k Z_i}{\sqrt{k}}\,.
        \end{equation}
  \item \Robject{method="ssgsea"} \citep{barbie_systematic_2009}. Single sample GSEA (ssGSEA)
        calculates a gene set enrichment score per sample as the normalized difference in
        empirical cumulative distribution functions of gene expression ranks inside and
        outside the gene set.
\end{itemize}

By default \Robject{method="gsva"} and the \Rfunction{gsva()} function uses the GSVA algorithm.

\section{Applications}

In this section we illustrate the following applications of \Rpackage{GSVA}:

\begin{itemize}
  \item Functional enrichment between two subtypes of leukemia.
  \item Identification of molecular signatures in distinct glioblastoma subtypes.
  \item Meta-pathway analysis in the leukemia data.
\end{itemize}

Throughout this vignette we will use the C2 collection of curated gene sets that
form part of the Molecular Signatures Database (MSigDB) version 3.0. This
particular collection of gene sets is provided as a \Rclass{GeneSetCollection}
object called \Robject{c2BroadSets} in the accompanying experimental data package
\Rpackage{GSVAdata}, which stores these and other data employed in this vignette.
These data can be loaded as follows:

<<results=hide>>=
library(GSEABase)
library(GSVAdata)

data(c2BroadSets)
c2BroadSets
@
where we observe that \Robject{c2BroadSets} contains \Sexpr{length(c2BroadSets)}
gene sets. We also need to load the following additional libraries:

<<results=hide>>=
library(Biobase)
library(genefilter)
library(limma)
library(RColorBrewer)
library(RBGL)
library(graph)
library(Rgraphviz)
library(GSVA)
@
As a final setup step for this vignette, we will employ the
\Rfunction{cache()} function from the \Rpackage{Biobase} package in order to
load some pre-computed results and speed up the building time of the vignette:

<<>>=
cacheDir <- system.file("extdata", package="GSVA")
cachePrefix <- "cache4vignette_"
@
In order to enforce re-calculating everything, either the call to the
\Rfunction{cache()} function should be replaced by its first argument, or the
following command should be written in the R console at this point:

<<eval=FALSE>>=
file.remove(paste(cacheDir, list.files(cacheDir, pattern=cachePrefix), sep="/"))
@

\subsection{Functional enrichment}

In this section we illustrate how to identify functionally enriched gene sets
between two phenotypes. As in most of the applications we start by calculating
GSVA enrichment scores and afterwards, we will employ the linear modeling
techniques implemented in the \Rpackage{limma} package to find the enriched gene
sets.

The data set we use in this section corresponds to the microarray data from
\citep{armstrong_mll_2002} which consists of 37 different individuals
with human acute leukemia, where 20 of them have conventional childhood acute
lymphoblastic leukemia (ALL) and the other 17 are affected with the MLL
(mixed-lineage leukemia gene) translocation. This leukemia data set is stored as
an \verb+ExpressionSet+ object called \Robject{leukemia} in the
\Rpackage{GSVAdata} package and details on how the data was pre-processed can be
found in the corresponding help page. Enclosed with the RMA expression values we
provide some metadata including the main phenotype corresponding to the leukemia
sample subtype.

<<>>=
data(leukemia)
leukemia_eset
head(pData(leukemia_eset))
table(leukemia_eset$subtype)
@
Let's examine the variability of the expression profiles across samples by
plotting the cumulative distribution of IQR values as shown in Figure~\ref{figIQR}.
About 50\% of the probesets show very limited variability across samples
and, therefore, in the following non-specific filtering step we remove this
fraction from further analysis.

<<figIQR, echo=FALSE, results=hide>>=
png(filename="GSVA-figIQR.png", width=500, height=500, res=150)
IQRs <- esApply(leukemia_eset, 1, IQR)
plot.ecdf(IQRs, pch=".", xlab="Interquartile range (IQR)", main="Leukemia data")
abline(v=quantile(IQRs, prob=0.5), lwd=2, col="red")
dev.off()
@
\begin{figure}[ht]
\centerline{\includegraphics[width=0.5\textwidth]{GSVA-figIQR}}
\caption{Empirical cumulative distribution of the interquartile range (IQR) of
expression values in the leukemia data. The vertical red bar is located at the
50\% quantile value of the cumulative distribution.}
\label{figIQR}
\end{figure}

We carry out a non-specific filtering step by discarding the 50\% of the
probesets with smaller variability, probesets without Entrez ID annotation,
probesets whose associated Entrez ID is duplicated in the annotation, and
Affymetrix quality control probes:

<<>>=
filtered_eset <- nsFilter(leukemia_eset, require.entrez=TRUE, remove.dupEntrez=TRUE,
                          var.func=IQR, var.filter=TRUE, var.cutoff=0.5, filterByQuantile=TRUE,
                          feature.exclude="^AFFX")
filtered_eset
leukemia_filtered_eset <- filtered_eset$eset
@

The calculation of GSVA enrichment scores is performed in one single call to the
\verb+gsva()+ function. However, one should take into account that this function
performs further non-specific filtering steps prior to the actual calculations.
On the one hand, it matches gene identifiers between gene sets and gene
expression values. On the other hand, it discards gene sets that do not meet
minimum and maximum gene set size requirements specified with the arguments
\verb+min.sz+ and \verb+max.sz+, respectively, which, in the call below, are
set to 10 and 500 genes. Because we want to use \Rpackage{limma} on the resulting
GSVA enrichment scores, we leave deliberately unchanged the default argument
\Robject{mx.diff=TRUE} to obtain approximately normally distributed ES.

<<>>=
cache(leukemia_es <- gsva(leukemia_filtered_eset, c2BroadSets,
                           min.sz=10, max.sz=500, verbose=TRUE)$es.obs,
                           dir=cacheDir, prefix=cachePrefix)
@
We test whether there is a difference between the GSVA enrichment scores from each
pair of phenotypes using a simple linear model and moderated t-statistics computed
by the \verb+limma+ package using an empirical Bayes shrinkage method
\citep[see][]{Smyth_2004}. We are going to examine both, changes at gene level
and changes at pathway level and since, as we shall see below, there are plenty
of them, we are going to employ the following stringent cut-offs to attain a high
level of statistical and biological significance:

<<>>=
adjPvalueCutoff <- 0.001
logFCcutoff <- log2(2)
@
where we will use the latter only for the gene-level differential expression
analysis.

<<>>=
design <- model.matrix(~ factor(leukemia_es$subtype))
colnames(design) <- c("ALL", "MLLvsALL")
fit <- lmFit(leukemia_es, design)
fit <- eBayes(fit)
allGeneSets <- topTable(fit, coef="MLLvsALL", number=Inf)
DEgeneSets <- topTable(fit, coef="MLLvsALL", number=Inf,
                       p.value=adjPvalueCutoff, adjust="BH")
res <- decideTests(fit, p.value=adjPvalueCutoff)
summary(res)
@
Thus, there are \Sexpr{sum(res[, "MLLvsALL"]!=0)} MSigDB C2 curated pathways that
are differentially activated between MLL and ALL at \Sexpr{adjPvalueCutoff*100}\%
FDR. When we carry out the corresponding differential expression analysis at gene level:

<<>>=
logFCcutoff <- log2(2)
design <- model.matrix(~ factor(leukemia_eset$subtype))
colnames(design) <- c("ALL", "MLLvsALL")
fit <- lmFit(leukemia_filtered_eset, design)
fit <- eBayes(fit)
allGenes <- topTable(fit, coef="MLLvsALL", number=Inf)
DEgenes <- topTable(fit, coef="MLLvsALL", number=Inf,
                    p.value=adjPvalueCutoff, adjust="BH", lfc=logFCcutoff)
res <- decideTests(fit, p.value=adjPvalueCutoff, lfc=logFCcutoff)
summary(res)
@
Here, \Sexpr{sum(res[, "MLLvsALL"]!=0)} genes show up as being differentially
expressed with a minimum fold-change of \Sexpr{2^logFCcutoff} at \Sexpr{adjPvalueCutoff*100}\%
FDR. We illustrate the genes and pathways that are changing by means of volcano
plots (Fig.~\ref{leukemiaVolcano}.

<<leukemiaVolcano, echo=FALSE, results=hide>>=
png(filename="GSVA-leukemiaVolcano.png", width=800, height=500)
par(mfrow=c(1,2))
plot(allGeneSets$logFC, -log10(allGeneSets$P.Value), pch=".", cex=4, col=grey(0.75),
     main="Gene sets", xlab="GSVA enrichment score difference", ylab=expression(-log[10]~~~Raw~P-value))
abline(h=-log10(max(allGeneSets$P.Value[allGeneSets$adj.P.Val <= adjPvalueCutoff])),
       col=grey(0.5), lwd=1, lty=2)
points(allGeneSets$logFC[match(DEgeneSets$ID, allGeneSets$ID)],
       -log10(allGeneSets$P.Value[match(DEgeneSets$ID, allGeneSets$ID)]), pch=".",
       cex=4, col="red")
text(max(allGeneSets$logFC)*0.85,
         -log10(max(allGeneSets$P.Value[allGeneSets$adj.P.Val <= adjPvalueCutoff])),
         sprintf("%.1f%% FDR", 100*adjPvalueCutoff), pos=1)

plot(allGenes$logFC, -log10(allGenes$P.Value), pch=".", cex=4, col=grey(0.75),
     main="Genes", xlab="Log fold-change", ylab=expression(-log[10]~~~Raw~P-value))
abline(h=-log10(max(allGenes$P.Value[allGenes$adj.P.Val <= adjPvalueCutoff])),
       col=grey(0.5), lwd=1, lty=2)
abline(v=c(-logFCcutoff, logFCcutoff), col=grey(0.5), lwd=1, lty=2)
points(allGenes$logFC[match(DEgenes$ID, allGenes$ID)],
       -log10(allGenes$P.Value[match(DEgenes$ID, allGenes$ID)]), pch=".",
       cex=4, col="red")
text(max(allGenes$logFC)*0.85,
         -log10(max(allGenes$P.Value[allGenes$adj.P.Val <= adjPvalueCutoff])),
         sprintf("%.1f%% FDR", 100*adjPvalueCutoff), pos=1)
dev.off()
@
\begin{figure}
\centerline{\includegraphics[width=0.8\textwidth]{GSVA-leukemiaVolcano}}
\caption{Volcano plots for differential pathway activation (left) and differential
         gene expression (right) in the leukemia data set.}
\label{leukemiaVolcano}
\end{figure}

The signatures of both, the differentially activated pathways reported by the
GSVA analysis and of the differentially expressed genes are shown in Figures
\ref{leukemiaHeatmapGeneSets} and \ref{leukemiaHeatmapGenes}, respectively.
Many of the gene sets and pathways reported in Figure~\ref{leukemiaHeatmapGeneSets}
are directly related to ALL and MLL.

<<leukemiaHeatmapGeneSets, echo=FALSE, results=hide>>=
png(filename="GSVA-leukemiaHeatmapGeneSets.png", width=500, height=500)
GSVAsco <- exprs(leukemia_es[DEgeneSets$ID, ])
colorLegend <- c("darkred", "darkblue")
names(colorLegend) <- c("ALL", "MLL")
sample.color.map <- colorLegend[pData(leukemia_es)[, "subtype"]]
names(sample.color.map) <- colnames(GSVAsco)
sampleClustering <- hclust(as.dist(1-cor(GSVAsco, method="spearman")), method="complete")
geneSetClustering <- hclust(as.dist(1-cor(t(GSVAsco), method="pearson")), method="complete")
heatmap(GSVAsco, ColSideColors=sample.color.map, xlab="samples",
        ylab="Gene sets and pathways", margins=c(2, 20),
        labRow=substr(gsub("_", " ", gsub("^KEGG_|^REACTOME_|^BIOCARTA_", "", rownames(GSVAsco))), 1, 35),
        labCol="", scale="row",
        Colv=as.dendrogram(sampleClustering), Rowv=as.dendrogram(geneSetClustering))
legend("topleft", names(colorLegend), fill=colorLegend, inset=0.01, bg="white")
dev.off()
@
\begin{figure}[ht]
\centerline{\includegraphics[width=0.7\textwidth]{GSVA-leukemiaHeatmapGeneSets}}
\caption{Heatmap of differentially activated pathways at \Sexpr{adjPvalueCutoff*100}\% FDR
in the Leukemia data set.}
\label{leukemiaHeatmapGeneSets}
\end{figure}


<<leukemiaHeatmapGenes, echo=FALSE, results=hide>>=
png(filename="GSVA-leukemiaHeatmapGenes.png", width=500, height=500)
exps <- exprs(leukemia_eset[DEgenes$ID, ])
colorLegend <- c("darkred", "darkblue")
names(colorLegend) <- c("ALL", "MLL")
sample.color.map <- colorLegend[pData(leukemia_eset)[, "subtype"]]
names(sample.color.map) <- colnames(exps)
sampleClustering <- hclust(as.dist(1-cor(exps, method="spearman")), method="complete")
geneClustering <- hclust(as.dist(1-cor(t(exps), method="pearson")), method="complete")
heatmap(exps, ColSideColors=sample.color.map, xlab="samples", ylab="Genes",
        labRow="", labCol="", scale="row", Colv=as.dendrogram(sampleClustering),
         Rowv=as.dendrogram(geneClustering), margins=c(2,2))
legend("topleft", names(colorLegend), fill=colorLegend, inset=0.01, bg="white")
dev.off()
@
\begin{figure}[ht]
\centerline{\includegraphics[width=0.5\textwidth]{GSVA-leukemiaHeatmapGenes}}
\caption{Heatmap of differentially expressed genes with a minimum fold-change of
\Sexpr{2^logFCcutoff} at \Sexpr{adjPvalueCutoff*100}\% FDR in the leukemia data set.}
\label{leukemiaHeatmapGenes}
\end{figure}

\subsection{Molecular signature identification}

In \citep{verhaak_integrated_2010} four subtypes of Glioblastoma multiforme
(GBM) - proneural, classical, neural and mesenchymal - were identified by
the characterization of distinct gene-level expression patterns. Using eight
gene set signatures specific to brain cell types - astrocytes, oligodendrocytes,
neurons and cultured astroglial cells - derived from murine models by
\citep{cahoy_transcriptome_2008}, we replicate the analysis of
\citep{verhaak_integrated_2010} by employing GSVA to transform the gene
expression measurements into enrichment scores for these eight gene sets, without
taking the sample subtype grouping into account. We start by loading and have
a quick glance to the data which forms part of the \verb+GSVAdata+ package:

<<>>=
data(gbm_VerhaakEtAl)
gbm_eset
head(featureNames(gbm_eset))
table(gbm_eset$subtype)
data(brainTxDbSets)
sapply(brainTxDbSets, length)
lapply(brainTxDbSets, head)
@
GSVA enrichment scores for the gene sets contained in \Robject{brainTxDbSets}
are calculated, in this case using \Robject{mx.diff=FALSE},  as follows:

<<>>=
gbm_es <- gsva(gbm_eset, brainTxDbSets, mx.diff=FALSE, verbose=FALSE)$es.obs
@
Figure \ref{gbmSignature} shows the GSVA enrichment scores obtained for the
up-regulated gene sets across the samples of the four GBM subtypes. As expected,
the \emph{neural} class is associated with the neural gene set and the astrocytic
gene sets. The \emph{mesenchymal} subtype is characterized by the expression of
mesenchymal and microglial markers, thus we expect it to correlate with the
astroglial gene set. The \emph{proneural} subtype shows high expression of
oligodendrocytic development genes, thus it is not surprising that the
oligodendrocytic gene set is highly enriched for ths group. Interestingly, the
\emph{classical} group correlates highly with the astrocytic gene set. In
summary, the resulting GSVA enrichment scores recapitulate accurately the
molecular signatures from \citet{verhaak_integrated_2010}.

<<gbmSignature, echo=FALSE, results=hide>>=
png(filename="GSVA-gbmSignature.png", width=700, height=500)
subtypeOrder <- c("Proneural", "Neural", "Classical", "Mesenchymal")
sampleOrderBySubtype <- sort(match(gbm_es$subtype, subtypeOrder), index.return=TRUE)$ix
subtypeXtable <- table(gbm_es$subtype)
subtypeColorLegend <- c(Proneural="red", Neural="green", Classical="blue", Mesenchymal="orange")
geneSetOrder <- c("astroglia_up", "astrocytic_up", "neuronal_up", "oligodendrocytic_up")
geneSetLabels <- gsub("_", " ", geneSetOrder)
hmcol <- colorRampPalette(brewer.pal(10, "RdBu"))(256)
hmcol <- hmcol[length(hmcol):1]

heatmap(exprs(gbm_es)[geneSetOrder, sampleOrderBySubtype], Rowv=NA, Colv=NA,
        scale="row", margins=c(3,5), col=hmcol,
		    ColSideColors=rep(subtypeColorLegend[subtypeOrder], times=subtypeXtable[subtypeOrder]),
				labCol="", gbm_es$subtype[sampleOrderBySubtype],
        labRow=paste(toupper(substring(geneSetLabels, 1,1)), substring(geneSetLabels, 2), sep=""),
        cexRow=2, main=" \n ")
par(xpd=TRUE)
text(0.22,1.11, "Proneural", col="red", cex=1.2)
text(0.36,1.11, "Neural", col="green", cex=1.2)
text(0.48,1.11, "Classical", col="blue", cex=1.2)
text(0.66,1.11, "Mesenchymal", col="orange", cex=1.2)
mtext("Gene sets", side=4, line=0, cex=1.5)
mtext("Samples          ", side=1, line=4, cex=1.5)
dev.off()
@
\begin{figure}
\centerline{\includegraphics[width=0.6\textwidth]{GSVA-gbmSignature}}
\caption{Heatmap of GSVA scores for cell-type brain signatures from murine models (y-axis)
across GBM samples grouped by GBM subtype.}
\label{gbmSignature}
\end{figure}


%========================================================================================

\subsection{Meta-pathway analysis}

In biological systems, pathways do not operate independently and can have diverse
degrees of cross-talk between them. In this subsection we show how to identify
pathways that have a highly-coordinated activity but whose gene sets have little
or no overlap. We apply this type of analysis, which we call meta-pathway analysis,
to the leukemia data set we previously analyzed for differential pathway activation.

For this analysis we consider the subset of canonical pathways from the C2
collection of MSigDB Gene Sets. These correspond to the following pathways from
KEGG, REACTOME and BIOCARTA:

<<>>=
canonicalC2BroadSets <- c2BroadSets[c(grep("^KEGG", names(c2BroadSets)),
                                      grep("^REACTOME", names(c2BroadSets)),
                                      grep("^BIOCARTA", names(c2BroadSets)))]
canonicalC2BroadSets
@
We calculate GSVA enrichment scores discarding gene sets with less than 10
genes and more than 500. Note that we do not filter for variability as we are not
interested for differential pathway activation:

<<>>=
cache(leukemia_canonicalPwy_es <- gsva(leukemia_eset, canonicalC2BroadSets,
       min.sz=10, max.sz=500, mx.diff=TRUE, verbose=TRUE)$es.obs,
       dir=cacheDir, prefix=cachePrefix)
@
We are interested in those pathways that have little overlap between their sets
of genes but are highly correlated. For the purpose of applying such a filter
we need to calculate the fraction of genes that overlap between every pair of
gene sets which is possible to do with the function
\verb+computeGeneSetsOverlap()+:

<<>>=
overlapMatrix <- computeGeneSetsOverlap(canonicalC2BroadSets, leukemia_eset,
                                        min.sz=10, max.sz=500)
@
We could quickly obtain a network of cross-talk associations by calculating
marginal pair-wise correlations, like Pearson correlation coefficients (PCCs),
and selecting those pairs of pathways that are highly correlated. However,
potentially many of the marginal pair-wise associations could be spurious,
that is, indirectly mediated by other pathways.

In order to select direct (non-spurious) relationships we have carried out a
Gaussian graphical modeling (GGM) analysis of the cross-talk associations between
pathways that follow from the GSVA enrichment score data. A GGM analysis assumes
that the data forms a multivariate normal sample from a distribution
$\mathcal{N}(\mu, \Sigma)$ and that the underlying network can be represented by
an undirected graph $G$ whose missing edges match the pattern of zeroes in the
inverse covariance matrix $\Sigma^{-1}$ \citep[see][]{Lauritzen_1996}. Since the
dimension of the data with $p=$\Sexpr{dim(leukemia_canonicalPwy_es)[1]} and
$n=$\Sexpr{dim(leukemia_canonicalPwy_es)[2]} precludes the application of classical
GGM techniques \citep[pg.~126]{Lauritzen_1996} we will follow a limited-order partial
correlation based approach described in \citep{Castelo_qpgraph_2006,
Castelo_qpgraph_2009} and implemented in the Bioconductor package \Rpackage{qpgraph}:

<<>>=
library(qpgraph)
@
In this approach one calculates a measure of linear association over all marginal
distributions of size $(q+2) < n$, called the non-rejection rate \citep{Castelo_qpgraph_2006}.
However, the stratification of samples into the two leukemia subtypes ALL and MLL
introduces a covariate which could be confounding the estimated linear associations.
To take this structure of the data into account, we can employ the so-called
generalized non-rejection rate \citep{Roverato_gnrr_2010}, which aims at identifying
associations that are common through multiple data sets or experimental conditions.
We calculate generalized non-rejection rates using $q=15$ for ALL and $q=12$ for MLL,
i.e., $q=n-5$ in each condition:

<<>>=
cache(gennrr <- qpGenNrr(leukemia_canonicalPwy_es, datasetIdx="subtype",
                         qOrders=c(ALL=15, MLL=12),
                         identicalQs=FALSE, clusterSize=2)$genNrr,
      dir=cacheDir, prefix=cachePrefix)
@
We do not consider the associations involving pairs of pathways with an overlap of
genes larger than 5\%:

<<>>=
gennrr[overlapMatrix > 0.05] <- NA
@
By considering a cut-off on the generalized non-rejection rate we could directly
obtain an estimated $q$-order partial correlation graph, or qp-graph, denoted by
$\hat{G}^{(q)}$, which would constitute an approximation to the underlying
undirected graph $G$. However, following the model-based strategy proposed in
\citep{Castelo_qpgraph_2006}, we will first examine the maximum boundary size,
denoted by $bd(G)$, of the possible resulting graphs as function of different
cut-offs applied to the non-rejection rates. This can be done by using the
function \verb+qpBoundary()+ whose result is displayed in
Figure~\ref{fig:gennrrcliquenr}

<<gennrrmaxbd, echo=FALSE, results=hide>>=
png(filename="GSVA-gennrrmaxbd.png", width=800, height=800, res=150)
qpbd <- qpBoundary(gennrr, N=ncol(leukemia_canonicalPwy_es), breaks=20)
dev.off()
@
\begin{figure}[ht]
\centerline{\includegraphics[width=0.5\textwidth]{GSVA-gennrrmaxbd}}
\caption{Maximum boundary as function of non-rejection rate cut-offs. The red
line indicates the sample size of the leukemia data set
($n=$\Sexpr{dim(leukemia_canonicalPwy_es)[2]}) and next to each point the graph
density is indicated.}
\label{fig:gennrrcliquenr}
\end{figure}

This function also returns the largest cut-off $\beta^*$, among those considered
in the plot, whose resulting estimated graph $\hat{G}$ has $bd(\hat{G}) < n$. In
this case $\beta^*=$\Sexpr{round(qpbd$threshold, digits=2)} and using this
cut-off we obtain a resulting qp-graph $\hat{G}^{(q)}$ which has
$bd(\hat{G}^{(q)})=$\Sexpr{qpbd$bdsizeunderN} as shown here below:

<<>>=
g <- qpGraph(gennrr, threshold=qpbd$threshold, return.type="graphNEL")
g
max(degree(g))
@
Since $bd(\hat{G}^{(q)})=$\Sexpr{qpbd$bdsizeunderN} is smaller than
$n=$\Sexpr{dim(leukemia_eset)[2]} there is a chance that the maximum
likelihood estimate (MLE) of the sample covariance matrix $S$ exists
\citep[pg.~133]{Lauritzen_1996}, under the restrictions imposed by the
qp-graph $\hat{G}^{(q)}$. Once a MLE of $S$ is obtained, its inverse
$\Sigma^{-1}=K=\{\kappa_{ij}\}$ can be calculated and, therefore, the
corresponding partial correlation coefficients (PACs) as follows:

\begin{equation}
\rho_{ij.R}=\frac{-\kappa_{ij}}{\sqrt{\kappa_{ii}\,\kappa_{jj}}}\,\,\,
\mathrm{where}\,\, R=V\backslash\{i,j\}\,.
\end{equation}
Since these PACs come from a MLE of the sample covariance matrix $S$,
p-values for the null hypothesis of zero partial correlation can be
calculated following \citep{Roverato_1996}. All these computations can
be made in one single call to the function \verb+qpPAC()+:

<<>>=
pac <- qpPAC(leukemia_canonicalPwy_es, g, return.K=TRUE, tol=0.01,
             verbose=FALSE)
@
We employ the estimated PACs and their p-values to select a final estimate
$\hat{G}$ of the underlying undirected graph $G$ whose FDR of wrongly included
edges is below a desired network-wide significance level. This FDR control at
network-wide level helps in discarding spurious associations with a large marginal
strength (i.e., a large Pearson correlation coefficient) but which in fact are
indirectly occurring. Note below that PCCs are not directly estimated from the data
but by scaling the MLE of the sample covariance matrix, obtaining in this way more
precise estimates of the PCCs.

<<>>=
gAM <- qpGraph(gennrr, threshold=qpbd$threshold)
ridx <- row(pac$P)[as.matrix(upper.tri(pac$P) & gAM)]
cidx <- col(pac$P)[as.matrix(upper.tri(pac$P) & gAM)]

allEdges <- data.frame(PWYi=colnames(pac$P)[ridx],
                       PWYj=colnames(pac$P)[cidx],
                       PAC=pac$R[cbind(ridx, cidx)],
                       PAC.P.value=pac$P[cbind(ridx, cidx)],
                       PAC.adj.P.value=p.adjust(pac$P[cbind(ridx, cidx)], method="fdr"),
                       PCC=cov2cor(solve(pac$K))[cbind(ridx, cidx)])

allEdges <- allEdges[sort(allEdges$PAC.adj.P.value, index.return=TRUE)$ix, ]
dim(allEdges)
@
From the \Sexpr{dim(allEdges)[1]} associations we select those with an
absolute PCC larger than 0.7 and a PAC p-value leading to a network-wide FDR
below 10\%:

<<>>=
sigEdges <- allEdges[which(allEdges$PAC.adj.P.value < 0.1 & abs(allEdges$PCC) > 0.7) , ]
dim(sigEdges)
@
Using these significant edges we build a \Rclass{graphNEL} object representing this
network:

<<>>=
vtc <- unique(as.character(unlist(sigEdges[, c("PWYi", "PWYj")], use.names=FALSE)))
g <- new("graphNEL", nodes=vtc, edgemode="undirected")
g <- addEdge(from=as.character(sigEdges[, "PWYi"]),
             to=as.character(sigEdges[, "PWYj"]),
             graph=g)
g
@

<<leukemiaNet, echo=FALSE, results=hide>>=
png(filename="GSVA-leukemiaNet.png", width=800, height=800, res=150)
nodlab <- gsub("_", " ", gsub("KEGG_|REACTOME_|BIOCARTA_", "", vtc))
nodlab <- sapply(nodlab, function(x) { v <- unlist(strsplit(x, ' ')) ; t <- ""; l <- 0; for (w in v) { t <- paste(t,w,sep=" ") ; if (nchar(t)-l > 5) { t <- paste(t, "\ \n", sep="") ; l <- nchar(t) } } ; t })
names(nodlab) <- vtc
nodeRenderInfo(g) <- list(shape="ellipse", label=nodlab, fill="lightgrey", lwd=1)
edgeRenderInfo(g) <- list(lwd=1)
g <- layoutGraph(g, layoutType="fdp")
renderGraph(g)
dev.off()
@
\begin{figure}[ht]
\centerline{\includegraphics{GSVA-leukemiaNet}}
\caption{Network of cross-talk associations between Broad C2 canonical pathways
         of the leukemia data set obtained by a Gaussian graphical modeling approach
         by which edges are included in the graph with a minimum absolute PCC > 0.7
         at a FDR < 10\%.}
\label{fig:LeukemiaNet}
\end{figure}

The network is divided in connected components of the following sizes:

<<>>=
cct <- table(listLen(connComp(g)))
cct
@
Thus, there is one very large connected component of \Sexpr{max(as.integer(names(cct)))}
pathways in which we are going to focus our analysis. In order to dissect the larger
component of \Sexpr{max(as.integer(names(cct)))} pathways, we are going to look to its
degree distribution:

<<>>=
vtcCCmax <- connComp(g)[[which(sapply(connComp(g), length) == max(as.integer(names(cct))))]]
gCCmax <- subGraph(vtcCCmax, g)
gCCmax
table(degree(gCCmax))
@
We can observe that four pathways, representing less than 3\% of the total, have a
connectivity degree higher than \Sexpr{sort(degree(gCCmax), decreasing=TRUE)[5]}.
These highly connected pathways are the following:

<<>>=
sort(degree(gCCmax), decreasing=TRUE)[1:4]
@
We are going to display the subnetwork formed by these highly-connected pathways, the
paths that connect them and their interacting partners. We start by extracting the
corresponding subgraph:

<<>>=
vtcHubNet <- names(sort(degree(gCCmax), decreasing=TRUE)[1:4])
pairs <- combn(vtcHubNet, 2)
sp <- sp.between(gCCmax, pairs[1, ], pairs[2, ])
vtcInShortestPaths <- unlist(sapply(sp, function(x) x$path_detail), use.names=FALSE)
vtcHubNet <- unique(c(vtcHubNet,
                      unlist(boundary(vtcHubNet, gCCmax), use.names=FALSE),
                      vtcInShortestPaths))
gHubNet <- subGraph(vtcHubNet, gCCmax)
gHubNet
@

The resulting network is shown in Figure~\ref{fig:selectedGGMhighDeg}. The gene sets connected by
the meta-pathway analysis show a clear structure, forming four distinct groups with each having a
hub gene set. The gene sets surrounding each hub gene set are functionally related to each other
and associated with cancer. Most of the gene sets are related to genome stability, cell cycle and
cell proliferation, have an effect on the regulation of cell differentiation processes, and pertain
to the escape of apoptosis. Also, the members of the \textit{NUCLEAR RECEPTOR TRANSCRIPTION PATHWAY}
-estrogen receptors, retinoic acid receptor and nuclear receptors- have been indicated as potential
drug targets in disease \cite{gronemeyer_principles_2004}. Among them, Nr4a3 and Nr4a1 are proteins
whose function is not very well understood. However, they have been implicated as tumor suppressors
in myeloid leukemogenesis \cite{mullican_abrogation_2007}. Also, the decreased expression of retinoic
acid receptor $\alpha$ plays a role in leukemogenesis \cite{glasow_dna_2008}. In fact, a recent
review enumerating all genes known to be involved in MLL, includes nuclear-, retinoic acid- and
estrogen receptors \cite{ansari_mixed_2010}. The \textit{FGFR LIGAND BINDING AND ACTIVATION} pathway
contains members of the fibroblast growth factor family which are involved in oncogenic mechanisms
\cite{turner_fibroblast_2010} and have been investigated as therapeutic targets
\cite{greulich_targeting_2011}.

<<selectedGGMhighDeg, fig=TRUE, include=FALSE, height=4, width=4>>=
nodlab <- gsub("_", " ", gsub("KEGG_|REACTOME_|BIOCARTA_", "", vtcHubNet))
nodlab <- sapply(nodlab, function(x) { v <- unlist(strsplit(x, ' ')) ; t <- ""; l <- 0; for (w in v) { t <- paste(t,w,sep=" ") ; if (nchar(t)-l > 5) { t <- paste(t, "\ \n", sep="") ; l <- nchar(t) } } ; t })
names(nodlab) <- vtcHubNet
nodeRenderInfo(gHubNet) <- list(shape="ellipse", label=nodlab, fill="lightgrey", lwd=1)
edgeRenderInfo(gHubNet) <- list(lwd=1)
gHubNet <- layoutGraph(gHubNet, layoutType="fdp")
renderGraph(gHubNet)
@
\begin{figure}[p]
\centerline{\includegraphics[height=0.9\textheight]{GSVA-selectedGGMhighDeg}}
\caption{Network of cross-talk associations between Broad C2 canonical pathways
         of the leukemia data set obtained by a Gaussian graphical modeling approach
         by which edges are included in the graph with a minimum absolute PCC > 0.7
         at a FDR < 10\%. The displayed network only includes edges connecting the
         4 hub genes with highest degree and their interaction partners.}
\label{fig:selectedGGMhighDeg}
\end{figure}

Another gene set that comes to attention is \textit{ONE CARBON POOL BY FOLATE} because it is not
directly associated with proliferation or mitosis. Genetic variation in folate metabolism has been
reported to be associated with childhood leukemia \cite{thompson_maternal_2001}. The genes that
form this pathway, and form part of the analyzed data set, are:
<<>>=
ids <- geneIds(c2BroadSets["KEGG_ONE_CARBON_POOL_BY_FOLATE"])[[1]]
unlist(mget(ids[!is.na(match(ids, unlist(mget(featureNames(leukemia_eset),
                                              hgu95aENTREZID))))], org.Hs.egSYMBOL),
       use.names=FALSE)
@
Among the genes forming this KEGG pathway, the {\it MTR} gene encodes an enzyme that catalyzes
the final step in methionine biosynthesis and has been associated to an increased risk of ALL.
This risk was most pronounced for cases with the \textit{MLL} translocation \cite{lightfoot_genetic_2010}.

\section{Comparison with other methods}

In this section we compare with simulated data the performance of GSVA with other methods
producing pathway summaries of gene expression, concretely, PLAGE, the combined z-score and
ssGSEA which are available through the argument \Robject{method} of the function
\Rfunction{gsva()}. We employ the following simple linear additive model for simulating
normalized microarray data on $p$ genes and $n$ samples divided in two groups representing
a case-control scenario:

\begin{equation}
y_{ij} = \alpha_i + \beta_j + \epsilon_{ij}\,,
\end{equation}
where $\alpha_i\sim\mathcal{N}(0, 1)$ is a gene-specific effect, such as a probe-effect,
with $i=1,\dots,p$, $\beta_j\sim\mathcal{N}(\mu_j, 0.5)$ is a sample-effect with $j=1,2$ and
$e_{ij}\sim\mathcal{N}(0,0.1)$ corresponds to random noise.

We will assess the statistical power to detect one differentially expressed gene set formed by
30 genes, out of $p=1000$, as a function of the sample size and two varying conditions: the
fraction of differentially expressed genes in the gene set (50\% and 80\%) and the signal-to-noise
ratio expressed as the magnitude of the mean sample effect for one of the sample groups
($\mu_1=0$ and either $\mu_2=0.2$ or $\mu_2=1$).

The following function enables simulating such data, computes the corresponding GSE scores with
each method, performs a $t$-test on the tested gene set between the two groups of samples for each
method and returns the corresponding $p$-values:

<<>>=
runSim <- function(p, n, gs.sz, S2N, fracDEgs) {
  sizeDEgs <- round(fracDEgs * gs.sz)
  group.n <- round(n / 2)

  sampleEffect <- rnorm(n, mean=0, sd=1)
  sampleEffectDE <- rnorm(n, mean=S2N, sd=0.5)
  probeEffect <- rnorm(p, mean=0, sd=1)
  noise <- matrix(rnorm(p*n, mean=0, sd=0.1), nrow=p, ncol=n)
  noiseDE <- matrix(rnorm(p*n, mean=0, sd=0.1), nrow=p, ncol=n)
  M <- outer(probeEffect, sampleEffect, "+") + noise
  M2 <- outer(probeEffect, sampleEffectDE, "+") + noiseDE
  M[1:sizeDEgs, 1:group.n] <- M2[1:sizeDEgs, 1:group.n]

  rownames(M) <- paste("g", 1:nrow(M), sep="")
  geneSets <- list(testGeneSet=paste0("g", 1:(gs.sz)))

  es.gsva <- gsva(M, geneSets, verbose=FALSE)$es.obs
  es.plage <- gsva(M, geneSets, method="plage", verbose=FALSE)
  es.zscore <- gsva(M, geneSets, method="zscore", verbose=FALSE)
  es.ssgsea <- gsva(M, geneSets, method="ssgsea", verbose=FALSE)

  gsva.pval <- t.test(es.gsva[1:group.n],es.gsva[(group.n+1):length(es.gsva)])$p.value
  plage.pval <- t.test(es.plage[1:group.n],es.plage[(group.n+1):length(es.plage)])$p.value
  zscore.pval <- t.test(es.zscore[1:group.n],es.zscore[(group.n+1):length(es.zscore)])$p.value
  ssgsea.pval <- t.test(es.ssgsea[1:group.n],es.ssgsea[(group.n+1):length(es.ssgsea)])$p.value

  c(gsva.pval, ssgsea.pval, zscore.pval, plage.pval)
}
@
The next function takes the $p$-values of the output of the previous function and
estimates the statistical power as the fraction of non-rejections at a significant level
$\alpha=0.05$. It assumes the $p$-values were generated under the alternative hypothesis
of differential expression.

<<>>=
estimatePower <- function(pvals, alpha=0.05) {
  N <- ncol(pvals)
  c(1 - sum(pvals[1, ] > alpha)/N, 1 - sum(pvals[2, ] > alpha)/N,
    1 - sum(pvals[3, ] > alpha)/N, 1 - sum(pvals[4, ] > alpha)/N)
}
@
Finally, we perform the simulation on each of the four described scenarios 50 times using
the code below. The results in Fig.~\ref{simpower} show that GSVA attains higher statistical
power than the other three methods in each of the simulated scenarios.

<<>>=
set.seed(1234)

exp1 <- cbind(estimatePower(replicate(50, runSim(1000, 10, gs.sz=30, S2N=0.2, fracDEgs=0.5))),
              estimatePower(replicate(50, runSim(1000, 20, gs.sz=30, S2N=0.2, fracDEgs=0.5))),
              estimatePower(replicate(50, runSim(1000, 40, gs.sz=30, S2N=0.2, fracDEgs=0.5))),
              estimatePower(replicate(50, runSim(1000, 60, gs.sz=30, S2N=0.2, fracDEgs=0.5))))

exp2 <- cbind(estimatePower(replicate(50, runSim(1000, 10, gs.sz=30, S2N=1.0, fracDEgs=0.5))),
              estimatePower(replicate(50, runSim(1000, 20, gs.sz=30, S2N=1.0, fracDEgs=0.5))),
              estimatePower(replicate(50, runSim(1000, 40, gs.sz=30, S2N=1.0, fracDEgs=0.5))),
              estimatePower(replicate(50, runSim(1000, 60, gs.sz=30, S2N=1.0, fracDEgs=0.5))))

exp3 <- cbind(estimatePower(replicate(50, runSim(1000, 10, gs.sz=30, S2N=0.2, fracDEgs=0.8))),
              estimatePower(replicate(50, runSim(1000, 20, gs.sz=30, S2N=0.2, fracDEgs=0.8))),
              estimatePower(replicate(50, runSim(1000, 40, gs.sz=30, S2N=0.2, fracDEgs=0.8))),
              estimatePower(replicate(50, runSim(1000, 60, gs.sz=30, S2N=0.2, fracDEgs=0.8))))

exp4 <- cbind(estimatePower(replicate(50, runSim(1000, 10, gs.sz=30, S2N=1.0, fracDEgs=0.8))),
              estimatePower(replicate(50, runSim(1000, 20, gs.sz=30, S2N=1.0, fracDEgs=0.8))),
              estimatePower(replicate(50, runSim(1000, 40, gs.sz=30, S2N=1.0, fracDEgs=0.8))),
              estimatePower(replicate(50, runSim(1000, 60, gs.sz=30, S2N=1.0, fracDEgs=0.8))))
@

<<powersim, fig=TRUE, echo=FALSE, results=hide, include=FALSE, height=7, width=7>>=
plotPower <- function(statpower, main, legendposition) {
  plot(statpower[1,], ylim=c(0, 1.0), type="b", lwd=2, pch=1, main=main,
       col="blue", ylab="Statistcal Power", xlab="Sample Size", xaxt="n")
  lines(statpower[2,], col="red", type="b", lwd=2, pch=2)
  lines(statpower[3,], col="darkgreen", type="b", lwd=2, pch=3)
  lines(statpower[4,], col="lightgreen", type="b", lwd=2, pch=4)
  legend(legendposition, c("GSVA","ssGSEA","z-score","PLAGE"),
         col=c("blue","red","darkgreen","lightgreen"),pch=1:4,lty=1,lwd=2,inset=0.02)
  axis(1,at=1:4, labels=c("10","20","40","60"))
}

par(mfrow=c(2,2), mar=c(4, 4, 4, 1))
plotPower(exp1, main="Weak signal-to-noise ratio\n50% DE genes in gene set",
          legendposition="topleft")
plotPower(exp2, main="Strong signal-to-noise ratio\n50% DE genes in gene set",
          legendposition="bottomright")
plotPower(exp3, main="Weak signal-to-noise ratio\n80% DE genes in gene set",
          legendposition="topleft")
plotPower(exp4, main="Strong signal-to-noise ratio\n80% DE genes in gene set",
          legendposition="bottomright")
@
\begin{figure}[ht]
\centerline{\includegraphics[width=0.7\textwidth]{GSVA-powersim}}
\caption{{\bf Comparison of the statistical power of GSVA, PLAGE, single sample GSEA (ssGSEA)
and combined z-score (zscore).} Each panel shows the statistical power in the $y$-axis as
function of the sample size in the $x$-axis. GSE scores were calculated with each method with
respect to a single DE gene set. The two panels on top correspond to simulations where 50\% of
the genes in the gene set were DE while the two at the bottom contained 80\% of DE genes. The
two panels on the left correspond to a weak signal-to-noise ratio in the DE magnitude while the
two on the right correspond to a strong one. Data were simulated from a linear model with additive
sample and probe effects reflecting the previous four different scenarios. The statistical power
was estimated as 1 minus the fraction of non-rejections on the DE gene set at $\alpha=0.05$.
GSVA provides in general higher power than the other three methods, specially under a weak
signal-to-noise ratio.}
\label{simpower}
\end{figure}

\section{GSVA for RNA-Seq data}

In this section we illustrate how to use GSVA with RNA-seq data and, more importantly, how the
method provides pathway activity profiles analogous to the ones obtained from microarray data by
using samples of lymphoblastoid cell lines (LCL) from HapMap individuals which have been profiled
using both technologies \cite{huang_genome-wide_2007, pickrell_understanding_2010}. These data
form part of the experimental package \Rpackage{GSVAdata} and the corresponding help pages contain
details on how the data were processed. We start loading these data and verifying that
they indeed contain expression data for the same genes and samples, as follows:

<<>>=
data(commonPickrellHuang)

stopifnot(identical(featureNames(huangArrayRMAnoBatchCommon_eset),
                    featureNames(pickrellCountsArgonneCQNcommon_eset)))
stopifnot(identical(sampleNames(huangArrayRMAnoBatchCommon_eset),
                    sampleNames(pickrellCountsArgonneCQNcommon_eset)))
@
Next, for the current analysis we use the previously defined collection of canonical C2
Broad Sets extended with two gene sets formed by genes with sex-specific expression:

<<<>>=
data(genderGenesEntrez)

MSY <- GeneSet(msYgenesEntrez, geneIdType=EntrezIdentifier(),
               collectionType=BroadCollection(category="c2"), setName="MSY")
MSY
XiE <- GeneSet(XiEgenesEntrez, geneIdType=EntrezIdentifier(),
               collectionType=BroadCollection(category="c2"), setName="XiE")
XiE


canonicalC2BroadSets <- GeneSetCollection(c(canonicalC2BroadSets, MSY, XiE))
canonicalC2BroadSets
@
We calculate now GSVA enrichment scores for these gene sets using first the microarray
data and then the RNA-seq data. Note that the only requirement to do the latter is to
set the argument \Robject{rnaseq=TRUE} which is \Robject{FALSE} by default.

<<<>>=
esmicro <- gsva(huangArrayRMAnoBatchCommon_eset, canonicalC2BroadSets, min.sz=5, max.sz=500,
                mx.diff=TRUE, verbose=FALSE, rnaseq=FALSE)$es.obs
dim(esmicro)
esrnaseq <- gsva(pickrellCountsArgonneCQNcommon_eset, canonicalC2BroadSets, min.sz=5, max.sz=500,
                 mx.diff=TRUE, verbose=FALSE, rnaseq=TRUE)$es.obs
dim(esrnaseq)
@
To compare expression values from both technologies we are going to transform
the RNA-seq read counts into RPKM values. For this purpose we need gene length and G+C content
information also stored in the \Rpackage{GSVAdata} package and use the \Rfunction{cpm()}
function from the \Rpackage{edgeR} package. Note that RPKMs can only be calculated for those
genes for which the gene length and G+C content information is available:

<<>>=
data(annotEntrez220212)
head(annotEntrez220212)

cpm <- edgeR::cpm(exprs(pickrellCountsArgonneCQNcommon_eset))
dim(cpm)

common <- intersect(rownames(cpm), rownames(annotEntrez220212))
length(common)

rpkm <- sweep(cpm[common, ], 1, annotEntrez220212[common, "Length"] / 10^3, FUN="/")
dim(rpkm)

dim(huangArrayRMAnoBatchCommon_eset[rownames(rpkm), ])
@
We finally calculate Spearman correlations between gene and gene-level expression values
and gene set level GSVA enrichment scores produced from data obtained by microarray and
RNA-seq technologies:

<<>>=
corsrowsgene <- sapply(1:nrow(huangArrayRMAnoBatchCommon_eset[rownames(rpkm), ]),
                       function(i, expmicro, exprnaseq) cor(expmicro[i, ], exprnaseq[i, ], method="pearson"),
                       exprs(huangArrayRMAnoBatchCommon_eset[rownames(rpkm), ]), log2(rpkm+0.1))
names(corsrowsgene) <- rownames(rpkm)

corsrowsgs <- sapply(1:nrow(esmicro),
                     function(i, esmicro, esrnaseq) cor(esmicro[i, ], esrnaseq[i, ], method="spearman"),
                     exprs(esmicro), exprs(esrnaseq))
names(corsrowsgs) <- rownames(esmicro)
@
In panels A and B of Figure~\ref{rnaseqcomp} we can see the distribution of these correlations at
gene and gene set level. They show that GSVA enrichment scores correlate similarly to gene
expression levels produced by both profiling technologies.

<<RNAseqComp, echo=FALSE, results=hide>>=
png(filename="GSVA-RNAseqComp.png", width=1100, height=1100, res=150)
par(mfrow=c(2,2), mar=c(4, 5, 3, 2))
hist(corsrowsgene, xlab="Spearman correlation", main="Gene level\n(RNA-seq RPKM vs Microarray RMA)",
     xlim=c(-1, 1), col="grey", las=1)
par(xpd=TRUE)
text(par("usr")[1]*1.5, par("usr")[4]*1.1, "A", font=2, cex=2)
par(xpd=FALSE)
hist(corsrowsgs, xlab="Spearman correlation", main="Gene set level\n(GSVA enrichment scores)",
     xlim=c(-1, 1), col="grey", las=1)
par(xpd=TRUE)
text(par("usr")[1]*1.5, par("usr")[4]*1.1, "B", font=2, cex=2)
par(xpd=FALSE)
plot(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ], xlab="GSVA scores RNA-seq", ylab="GSVA scores microarray",
     main=sprintf("MSY R=%.2f", cor(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ])), las=1, type="n")
     sprintf("MSY R=%.2f", cor(exprs(esrnaseq)["MSY", ], exprs(esmicro)["MSY", ]))
abline(lm(exprs(esmicro)["MSY", ] ~ exprs(esrnaseq)["MSY", ]), lwd=2, lty=2, col="grey")
points(exprs(esrnaseq)["MSY", pickrellCountsArgonneCQNcommon_eset$Gender == "Female"],
       exprs(esmicro)["MSY", huangArrayRMAnoBatchCommon_eset$Gender == "Female"], col="red", pch=21, bg="red", cex=1)
points(exprs(esrnaseq)["MSY", pickrellCountsArgonneCQNcommon_eset$Gender == "Male"],
       exprs(esmicro)["MSY", huangArrayRMAnoBatchCommon_eset$Gender == "Male"], col="blue", pch=21, bg="blue", cex=1)
par(xpd=TRUE)
text(par("usr")[1]*1.5, par("usr")[4]*1.1, "C", font=2, cex=2)
par(xpd=FALSE)
plot(exprs(esrnaseq)["XiE", ], exprs(esmicro)["XiE", ], xlab="GSVA scores RNA-seq", ylab="GSVA scores microarray",
     main=sprintf("XiE R=%.2f", cor(exprs(esrnaseq)["XiE", ], exprs(esmicro)["XiE", ])), las=1, type="n")
abline(lm(exprs(esmicro)["XiE", ] ~ exprs(esrnaseq)["XiE", ]), lwd=2, lty=2, col="grey")
points(exprs(esrnaseq["XiE", pickrellCountsArgonneCQNcommon_eset$Gender == "Female"]),
       exprs(esmicro)["XiE", huangArrayRMAnoBatchCommon_eset$Gender == "Female"], col="red", pch=21, bg="red", cex=1)
points(exprs(esrnaseq)["XiE", pickrellCountsArgonneCQNcommon_eset$Gender == "Male"],
       exprs(esmicro)["XiE", huangArrayRMAnoBatchCommon_eset$Gender == "Male"], col="blue", pch=21, bg="blue", cex=1)
par(xpd=TRUE)
text(par("usr")[1]*1.5, par("usr")[4]*1.1, "D", font=2, cex=2)
par(xpd=FALSE)
dev.off()
@
\begin{figure}[p]
\centerline{\includegraphics[height=0.5\textheight]{GSVA-RNAseqComp}}
\caption{{\bf GSVA for RNA-seq (Argonne).} A. Distribution of Spearman correlation values between gene expression profiles of RNA-seq and microarray data. B. Distribution of Spearman correlation values between GSVA enrichment scores of gene sets calculated from RNA-seq and microarray data. C and D. Comparison of GSVA enrichment scores obtained from microarray and RNA-seq data for two gene sets containing genes with sex-specific expression: MSY formed by genes from the male-specific region of the Y chromosome, thus male-specific, and XiE formed by genes that escape X-inactivation in females, thus female-specific. Red and blue dots represent female and male samples, respectively. In both cases GSVA-scores show very high correlation between the two profiling technologies where female samples show higher enrichment scores in the female-specific gene set and male samples show higher enrichment scores in the male-specific gene set.}
\label{rnaseqcomp}
\end{figure}

We also examined the two gene sets containing gender specific genes in detail: those that escape
X-inactivation in female samples \citep{carrel_x-inactivation_2005} and those that are located on
the male-specific region of the Y chrosomome \citep{skaletsky_male-specific_2003}. In panels C and D
of Figure~\ref{rnaseqcomp} we can see how microarray and RNA-seq enrichment scores correlate very
well in these gene sets, with $\rho=0.82$ for the male-specific gene set and $\rho=0.78$ for the
female-specific gene set. Male and female samples show higher GSVA enrichment scores in their
corresponding gene set. This demonstrates the flexibility of GSVA to enable analogous unsupervised
and single sample GSE analyses in data coming from both, microarray and RNA-seq technologies.

\section{Session Information}

<<info, results=tex>>=
toLatex(sessionInfo())
@

\bibliography{GSVA}
\bibliographystyle{apalike}

\end{document}