%\VignetteIndexEntry{Analysing RNA-Seq data with the "DESeq" package}
%\VignettePackage{DESeq}

% To compile this document
% library('weaver'); rm(list=ls()); Sweave('DESeq.Rnw', driver=weaver()); system('pdflatex DESeq')

\documentclass{article}

\usepackage{Sweave}
\usepackage[a4paper]{geometry}
\usepackage{hyperref,graphicx}
\usepackage{color}

\SweaveOpts{keep.source=TRUE,eps=FALSE,include=FALSE,width=4,height=4.5}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rpackage}[1]{\textit{#1}}
\newcommand{\Rclass}[1]{\textit{#1}}
\newcommand{\Rfunction}[1]{{\small\texttt{#1}}}
\newcommand{\fixme}[1]{{\textbf{Fixme:} \textit{\textcolor{blue}{#1}}}}

\renewcommand{\floatpagefraction}{0.7}

\author{Simon Anders\\[1em]European Molecular Biology Laboratory (EMBL),\\ Heidelberg, Germany\\[1em]
\texttt{sanders@fs.tum.de}}

\title{\textsf{\textbf{Analysing RNA-Seq data with the \Rpackage{DESeq} package}}}

\begin{document}

\maketitle

\begin{abstract}
A basic task in the analysis of count data from RNA-Seq is the detection of differentially expressed genes.
The count data are presented as a table which reports, for each sample, the number of reads that have been
assigned to a gene. Analogous analyses also arise for other assay types, such as comparative ChIP-Seq.
The package \Rpackage{DESeq} provides a method to test for differential
expression by use of the negative binonial distribution and a shrinkage estimator for the distribution's
 variance\footnote{Other Bioconductor packages with similar aims are \Rpackage{edgeR} and \Rpackage{baySeq}.}.
This vignette explains the use of the package. For an exposition of
the statistical method, please see our paper~\cite{Anders:2010:GB}.
\end{abstract}

\tableofcontents

%--------------------------------------------------
\section{Input data and preparations} \label{sec:prep}
%--------------------------------------------------

The \Rpackage{DESeq} package expects count data, as obtained, e.g.,
from an RNA-Seq or other high-throughput sequencing (HTS) experiment,
in the form of a matrix of integer values. Each column corresponds to
a sample, e.g., one library preparation or one lane.
The rows correspond to the entities for which you want to compare coverage,
e.\,g.\ to a gene, to a binding region in a ChIP-Seq dataset,
a window in CNV-Seq or the like. So, for a typical RNA-Seq experiment, each
element in the table tells how many reads have been mapped in a given sample
to a given gene.

To obtain such a count table for your own data, you will need
to create it from your sequence alignments and suitable annotation. Within Bioconductor,
you can use the function \Rfunction{summerizeOverlaps} in the
\Rpackage{GenomicRanges} package. See the vignette, Ref.\ \cite{summerizeOverlaps},
for a worked example. Another possibility (outside of Bioconductor)
is the \emph{htseq-count} script distributed with the HTSeq Python framework
\cite{htseq}. (You do not need to know any Python to use
\texttt{htseq-count}.) A third possibility might be given by the Bioconductor
package \Rpackage{easyrnaseq} (by Nicolas Delhomme; in preparation, available
soon; package name may change).

Another easy way to produce such a table from the output
  of the aligner is to use the \texttt{htseq-count} script distributed
  with the \emph{HTSeq} package. Even though \emph{HTSeq} is a Python
  package, you do not need to know any Python to use
  \texttt{htseq-count}. See
  \url{http://www-huber.embl.de/users/anders/HTSeq/doc/count.html}. (If you use
htseq-count, be sure to remove the extra lines with general counters (``ambiguous'' etc.)
when importing the data.)

The count values must be raw counts of
sequencing reads. This is important for \Rpackage{DESeq}'s statistical model to hold,
as only raw reads allow to assess the measurement precision correctly. (Hence,
do not supply rounded values of normalized counts, or counts of covered base pairs.)

Furthermore, it is important that each column stems from an independent biological
replicate.  For purely technical replicates (e.\,g. when the same
library preparation was distributed over multiple lanes of the
sequencer in order to increase coverage), please sum up their counts
to get a single column, corresponding to a unique biological
replicate.  This is needed in order to allow \Rpackage{DESEq} to
estimate variability in the experiment correctly.

As an example dataset, we use the gene level read counts from the
\Rpackage{pasilla} data package. This dataset is from an experiment on
\emph{Drosophila melanogaster} cell cultures and
investigated the effect of RNAi knock-down of the splicing factor
\emph{pasilla}~\cite{Brooks2010}. The data are presented in the object
called \Robject{pasillaGenes}. For a description how this data object
was created from the raw data of Ref.~\cite{Brooks2010}, see the vingette
included with the \Rpackage{pasilla} package.

The \Robject{pasillaGenes} object is of class
\Rclass{CountDataSet}, which is the data container used by
\Rpackage{DESeq}. We load the needed packages and the data as
follows.

<<options,results=hide,echo=FALSE>>=
options(digits=3)
@

<<pasilla2,results=hide>>=
library( "DESeq" )
library( "pasilla" )
data( "pasillaGenes" )
@

\Robject{pasillaGenes} contains the counts and also metadata about the samples:

<<look_into_pasillaGenes>>=
head( counts(pasillaGenes) )
pData( pasillaGenes )
@

As you can see, the samples differ by experimental condition (untreated or treated, i.\,e., with
pasilla knocked down) and by library type. To keep things simple, we will only
look at the paired-end data for now. In Section \ref{sec:GLM}, we will see how to deal
with more than one factor.

For your own analysis, you will start form a count table, so we ``unpack'' the
\Rclass{countDataSet} object and build a new one ``from scratch'' to demonstrate how this is done.

<<pairedSamples>>=
pairedSamples <- pData(pasillaGenes)$type == "paired-end"
countsTable <- counts(pasillaGenes)[ , pairedSamples ]
conds <- pData(pasillaGenes)$condition[ pairedSamples ]
@

Now, we have a count table, as described above, of integer count data. For your own data,
use R's \Rfunction{read.table} or \Rfunction{read.csv} function to read your count data
from a text file.

We also need a description of the samples, which is here simply a factor:

<<conds>>=
conds
@

For your own data, create such a factor simply with
<<conds,eval=FALSE>>=
#not run
conds <- factor( c( "treated", "treated", "untreated", "untreated" ) )
@

We can now instantiate a \Rclass{CountDataSet}, which is the central
data structure in the \Rpackage{DESeq} package:
%
<<instantiate>>=
cds <- newCountDataSet( countsTable, conds )
@

The \Rclass{CountDataSet} class is derived from \Rpackage{Biobase}'s
\Rclass{eSet} class and so shares all features of this standard
Bioconductor class. Furthermore, accessors are provided for its data
slots\footnote{In fact, the objects \Robject{pasillaGenes} and \Robject{cds}
  from the \Rpackage{pasilla} are also of class \Rclass{CountDataSet};
  here we re-created \Robject{cds} from elementary data types, a matrix and a factor,
  for pedagogic effect.}.  For example, the counts can be accessed with the
\Rfunction{counts} function.
%
<<headcounts1>>=
head( counts(cds) )
@
%

As first processing step, we need to estimate the effective library
size. This information is called the ``size factors'' vector, as the
package only needs to know the relative library sizes. So, if the counts
of non-differentially expressed genes in one sample are, on average, twice as high
as in another, the size factor for the first sample should be
twice as large as the one for the other sample. The function
\Rfunction{estimateSizeFactors} estimates the size factors from the
count data. (See the man page of
\Rfunction{estimateSizeFactorsForMatrix} for technical details on the
calculation.)
%
<<estimateSizeFactors>>=
cds <- estimateSizeFactors( cds )
sizeFactors( cds )
@

If we divide each column of the count table by the size factor for this column, the
count values are brought to a common scale, making them comparable. When called
with \texttt{normalized=TRUE}, the \texttt{counts} accessor function performs this
calculation. This is useful, e.g., for visualization.
%
<<headcounts2>>=
head( counts( cds, normalized=TRUE ) )
@

%------------------------------------------------------------
\section{Variance estimation}
%------------------------------------------------------------

The inference in \Rpackage{DESeq} relies on an estimation of the typical
relationship between the data's variance and their mean, or, equivalently,
between the data's dispersion and their mean.

The \emph{dispersion} can be understood as the square of the coefficient
of biological variation. So, if a gene's expression typically differs
from replicate to replicate sample by 20\%, this gene's dispersion is $.20^2=.04$.
Note that the variance seen between counts is the sum of two components: the
sample-to-sample variation just mentioned, and the uncertainty in measuring
a concentration by counting reads. The latter, known as shot noise or Poisson
noise, is the dominating noise source for lowly expressed genes.
The sum of both, shot noise and dispersion, is considered in the differential expression inference.

Hence, the variance $v$ of count values is modelled as
\[ v = s \mu + \alpha s^2\mu^2, \]
where $\mu$ is the expected normalized count value (estimated by the average normalized count value),
$s$ is the size factor for the sample under consideration, and $\alpha$ is the dispersion
value for the gene under consideration.

To estimate the dispersions, use this command.
<<estimateDispersions,cache=TRUE>>=
cds <- estimateDispersions( cds )
@

We could now proceed straight to the testing for differetial
expression in Section~\ref{sec:DE}.  However, it is prudent to check
the dispersion estimates and to make sure that the data quality is as
expected.

The function \Rfunction{estimateDispersions} performs three steps. First, it estimates
a dispersion value for each gene, then, it fits,
for each condition, a curve through the estimates. Finally, it assigns to each gene
a dispersion value, using either the estimated or the fitted value.
To allow the user to inspect the
intermediate steps, a ``fit info'' object is stored, which contains
the empirical dispersion values for each gene, the curve fitted through the dispersions,
and the fitted values that will be used in the test.

<<str>>=
str( fitInfo(cds) )
@

To visualize these, we plot the per-gene estimates against the normalized mean
expressions per gene, and then overlay the fitted curve in red. As we will need
this again later, we define a function:

<<fitUntr, fig=TRUE>>=
plotDispEsts <- function( cds )
{
   plot(
      rowMeans( counts( cds, normalized=TRUE ) ),
      fitInfo(cds)$perGeneDispEsts,
      pch = '.', log="xy" )
   xg <- 10^seq( -.5, 5, length.out=300 )
   lines( xg, fitInfo(cds)$dispFun( xg ), col="red" )
}
@

\begin{figure}
\centering
\includegraphics[width=.6\textwidth]{DESeq-figFit}
\caption{Empirical (black dots) and fitted (red lines) dispersion values
plotted against mean expression strength.}
\label{figFit}
\end{figure}

Calling the function produces the plot (Fig.\ \ref{figFit}).

<<figFit,fig=TRUE>>=
plotDispEsts( cds )
@

% check a claim made in the paragraph below.
<<echo=FALSE, results=hide>>=
all(table(conditions(cds))==2)
@

The plot in Figure~\ref{figFit} is doubly logarithmic; this may be
helpful or misleading, and it is worth experimenting with other
plotting styles.

As we estimated the
dispersion from only two samples, we should expect the estimates to
scatter with quite some sampling variance around their true values.
%However, can we really assume that the genes with extremely large empirical
%dispersion values in Figure~\ref{figFit} have the same true, underlying dispersion as the
%other genes with similar expression? To quantify this, we compare the
%scatter of the empirical dispersion values around the fitted value
%with a scaled $\chi^2$ distribution with the appropriate number of degrees
%of freedom.  The 99-percentile of this distribution is shown by the
%blue lines in Figure~\ref{figFit}).  There are a substantial number of
%genes above these lines, definitely more than 1\%. Hence, it may be
%imprudent to assume that for these genes, the fitted values (red
%lines) are good estimates of their true, underlying dispersion.
%However, we also do not want to ignore the fit lines -- as we the wide
%sampling distribution shows, for many genes the empirical dispersions
%will be much lower than the plausible true value, and naively using
%them would lead to avoidable false positives.
Hence, we \Rpackage{DESeq} should not use the per-gene estimates directly in the test,
because using too low dispersion values leads to false positives.
Many of the values below the red line are likely to be underestimates of the
true dispersions, and hence, it is prudent to instead rather use the fitted value.
On the othe hand, not all of he values above the red line are overestimations, and
hence, the conservative choice is to keep them instead of replacing them with
their fitted values. If you do not like this default behaviour, you can change it
with the option \texttt{sharingMode} of \Rfunction{estimateDispersions}. Note that
\Rpackage{DESeq} orginally (as described in~\cite{Anders:2010:GB}) only
used the fitted values (\texttt{sharingMode="fit-only"}). The
current default (\texttt{sharingMode="maximum"}) is more conservative.

Another difference of the current \Rpackage{DESeq} version to the original method described in the
paper is the way how the mean-dispersion relation is fitted. By default,
\Rfunction{estimateDispersion} now performs a parametric fit: Using a gamma-family
GLM, two coefficients $\alpha_0, \alpha_1$ are found to
parametrize the fit as $\alpha = \alpha_0 + \alpha_1/\mu$. (The values of the two
coefficients can be found in the \texttt{fitInfo} object, as attribute \texttt{coefficients}
to \texttt{dispFunc}.) For some data sets, the parametric fit may give bad results, in
which case one should try a local fit (the method described in the paper), which
is available via the option \texttt{fitType="local"} to \Rfunction{estimateDispersions}.

In any case, the dispersion values which finally should be used by the subsequent testing are stored in
the feature data slot of \texttt{cds}:

<<head>>=
head( fData(cds) )
@

You can verify that \texttt{disp\_pooled} indeed contains the maximum of the two value vectors
we looked at before, namely

<<str>>=
str( fitInfo(cds) )
@

Advanced users who want to fiddle with the dispersion estimation can change the values
in \texttt{fData(cds)} prior to calling the testing function.

%------------------------------------------------------------
\section{Inference: Calling differential expression} \label{sec:DE}
%------------------------------------------------------------

\begin{figure}
\centering
\includegraphics[width=.6\textwidth]{DESeq-figDE}
\caption{Plot of normalised mean versus $log_2$ fold change
(this plot is sometimes also called the ``MA-plot'') for the contrast ``untreated''
versus ``treated''.}
\label{figDE}
\end{figure}

\begin{figure}
\centering
\includegraphics[width=.6\textwidth]{DESeq-histp}
\caption{Histogram of $p$-values from the call to nbinomTest.}
\label{fighistp}
\end{figure}

\subsection{Standard comparison between two experimental conditions}

Having estimated the dispersion for each gene, it is now straight-forward to look for
differentially expressed genes. To contrast two conditions, e.g., to see whether there is differential
expression between conditions ``untreated'' and ``treated'', we simply call the function \Rfunction{nbinomTest}.
It performs the tests as described in the paper and returns a data frame with the $p$ values and other
useful information.
%
<<nbt1,cache=TRUE>>=
res <- nbinomTest( cds, "untreated", "treated" )
<<nbt2>>=
head(res)
@
%
For each gene, we get its mean expression level (at the base scale) as a joint estimate from both conditions, and
estimated separately for each  condition, the fold change from
the first to the second condition, the logarithm (to basis 2) of the fold change, and the
$p$ value for the statistical significance of this change. The \texttt{padj} column contains the
$p$ values, adjusted for multiple testing with the Benjamini-Hochberg procedure (see the R function
\Rfunction{p.adjust}), which controls false discovery
rate (FDR).

Let us first plot the $\log_2$ fold changes against the base means, colouring in red those genes
that are significant at 10\% FDR.
<<figDE,fig=TRUE>>=
plotDE <- function( res )
   plot(
      res$baseMean,
      res$log2FoldChange,
      log="x", pch=20, cex=.3,
      col = ifelse( res$padj < .1, "red", "black" ) )

plotDE( res )
@ %$
See Figure~\ref{figDE} for the plot. As we will use this plot more often, we have
stored its code in a function.

It is also instructive to look at the histogram of $p$ values (Figure~\ref{fighistp}).
The enrichment of low po values stems from the differentially expressed genes,
while those not differebtially expressed are spread uniformly over the range
from zero to one (except for the $p$ values from genes with very low counts, which
take discrete values and so give rise to higher bins to the right.)
%
<<histp,fig=TRUE>>=
hist(res$pval, breaks=100, col="skyblue", border="slateblue", main="")
@


We can filter for significant genes, according to some chosen threshold for the false dicovery rate (FDR),
<<ressig1>>=
resSig <- res[ res$padj < 0.1, ]
@
and list, e.\,g., the most significantly differentially expressed genes:
<<ressig2>>=
head( resSig[ order(resSig$pval), ] )
@

We may also want to look at the most strongly down-regulated of the significant
genes,
<<ressig3>>=
head( resSig[ order( resSig$foldChange, -resSig$baseMean ), ] )
@
or at the most strongly up-regulated ones:
<<ressig4>>=
head( resSig[ order( -resSig$foldChange, -resSig$baseMean ), ] )
@

To save the output to a file, use the R functions \Rfunction{write.table} and
\Rfunction{write.csv}. (The latter is useful if you want to load the table in a
spreadsheet program such as Excel.)

<<writetable,eval=FALSE>>=
#not run
write.table( res, file="results.txt" )
@

\begin{figure}
\centering
\includegraphics[width=.6\textwidth]{DESeq-MArepl}
\caption{Plot of the log2 fold change between the untreated replicates
versus average expression strength.}
\label{figMArepl}
\end{figure}

Note in Fig.~\ref{figDE} how the power to detect significant differential expression depends on
the expression strength. For weakly expressed genes, stronger changes are
required for the gene to be called significantly expressed. To understand the reason for
this let, us compare the normalized counts between two replicate samples, here
taking the two untreated samples as an example:
%
<<ncu>>=
ncu <- counts( cds, normalized=TRUE )[ , conditions(cds)=="untreated" ]
@
\Robject{ncu} is now a matrix with two columns.
%
<<MArepl,fig=TRUE>>=
plot( rowMeans(ncu), log2( ncu[,2] / ncu[,1] ), pch=".", log="x" )
@
%
As one can see in Figure~\ref{figMArepl}, the log fold changes between replicates
are stronger for lowly expressed genes than for highly expressed ones. We
ought to conclude that a gene's expression is influenced by the treatment only
if the change between treated and untreated samples is stronger than
what we see between replicates, and hence, the dividing line between red and black in
Figure~\ref{figDE} mimics the shape seen in Figure~\ref{figMArepl}.

%------------------------------------------------------------
\subsection{Working partially without replicates}
%------------------------------------------------------------

If you have replicates for one condition but not for the other, you can still proceed as
before. In such cases only the conditions with replicates will be used to estimate the dispersion.
Of course, this is only valid if you have good reason to believe that the unreplicated
condition does not have larger variation than the replicated one.

To demonstrate, we subset our data object to only three samples:

<<subset>>=
cdsTTU <- cds[ , 1:3]
pData( cdsTTU )
@

\begin{figure}
\centering
\includegraphics[width=.6\textwidth]{DESeq-figDE_Tb}
\caption{MvA plot for the contrast ``treated'' vs.\ ``untreated'', using two treated and
only one untreated sample.}
\label{figDE_Tb}
\end{figure}

Now, we do the analysis as before.

<<est123,cache=TRUE>>=
cdsTTU <- estimateSizeFactors( cdsTTU )
cdsTTU <- estimateDispersions( cdsTTU )
resTTU <- nbinomTest( cdsTTU, "untreated", "treated" )
@

We produce the analogous plot as before, again with
<<figDE_Tb,fig=TRUE>>=
plotDE( resTTU )
@ %$
Figure~\ref{figDE_Tb} shows the same symmetry in up- and down-regulation
as in Fig.~\ref{figDE} but a certain asymmetry in the boundary line for significance. This
has an easy explanation: low counts suffer from proportionally stronger shot noise than high
counts, and since there is only one ``untreated'' sample versus two ``treated'' ones,
a stronger downward fold-change is required to be called significant than for the upward direction.

%--------------------------------------------------
\subsection{Working without any replicates}
%--------------------------------------------------

Proper replicates are essential to interpret a biological experiment. After all, if one compares two
conditions and finds a difference, how else can one know that this difference is due to the different conditions
and would not have arisen between replicates, as well, just due to experimental or biological noise? Hence, any attempt to work without any
replicates will lead to conclusions of very limited reliability.

Nevertheless, such experiments are sometimes undertaken, and the \Rpackage{DESeq} package
can deal with them, even though the soundness of the results may depend much on the circumstances.

Our primary assumption is still that the mean is a good predictor for the dispersion.
Once we accept this assumption, we may argue as follows: Given two samples from
different conditions and a number of genes with comparable expression levels, of which we expect only a minority
to be influenced by the condition, we may take the dispersion estimated from comparing their counts \emph{across}
conditions as ersatz for a proper estimate of the variance across replicates. After all,
we assume most genes to behave the same within replicates as across conditions, and hence, the estimated variance
should not be affected too much by the influence of the hopefully few differentially expressed genes. Furthermore,
the differentially expressed genes will only cause the dispersion estimate to be too high, so that the test will err to the
side of being too conservative.

We shall now see how well this works for our example data.
We reduce our count data set to just two columns, one ``untreated'' and one ``treated'' sample:
<<subset2>>=
cds2 <- cds[ ,c(  "untreated3fb", "treated3fb"   ) ]
@

Now, without any replicates at all, the \Rfunction{estimateDispersions} function will refuse to proceed unless we
instruct it to ignore the condition labels and estimate the variance by treating all
samples as if they were replicates of the same condition:
<<cds2,cache=TRUE>>=
cds2 <- estimateDispersions( cds2, method="blind", sharingMode="fit-only" )
@
Note the option \texttt{sharingMode="fit-only"}. Remember that the default,
\texttt{sharingMode="maximum"}, takes care of outliers, i.e., genes with
dispersion much larer than the fitted values. Without replicates, we cannot
catch such outliers and so have to disable this function.

Now, we can attempt to find differential expression:
<<res2,cache=TRUE>>=
res2 <- nbinomTest( cds2, "untreated", "treated" )
@

\begin{figure}
\centering
\includegraphics[width=.6\textwidth]{DESeq-figDE2}
\caption{MvA plot, from a test using no replicates.}
\label{figDE2}
\end{figure}

Unsurprisingly, we find much fewer hits, as can be seen from the plot (Fig.\ \ref{figDE2})
<<figDE2,fig=TRUE>>=
plotDE( res2 )
@
and from this table, tallying the number of significant hits in our previous and our new, restricted analysis:
<<addmarg>>=
addmargins( table( res_sig = res$padj < .1, res2_sig = res2$padj < .1 ) )
@


%-------------------------------------------------------------------------------------------------
\section{Multi-factor designs} \label{sec:GLM}
%-------------------------------------------------------------------------------------------------

Let us return to the full pasilla data set, which we got as \Robject{pasillaGenes}
from the \Rpackage{pasilla} package. Due to the usage of both single-end
and paired-end libraries, it has a design with two factors,  \emph{condition}
(or treatment) and library \emph{type}:

<<design>>=
design <- pData(pasillaGenes)[ , c("condition","type") ]
design
@

When creating a count data set with multiple factors, just pass a data frame
instead of the condition factor:

<<fct>>=
fullCountsTable <- counts( pasillaGenes )

cdsFull <- newCountDataSet( fullCountsTable, design )
@

\Robject{cdsFull} is now essentially the same object as \Robject{pasillaGenes},
we have only recreated it for demonstration.

As before, we estimate the size factors and then the dispersions. For the latter,
we specify the method ``pooled''. Then, only one dispersion is computed for each gene,
an average over all cells (weighted by the number of samples for each cells), where
the term \emph{cell} denotes any of the four combinations of factor levels of
the design.

<<estsfdisp,cache=TRUE>>=
cdsFull <- estimateSizeFactors( cdsFull )
cdsFull <- estimateDispersions( cdsFull )
@

\begin{figure}
\centering
\includegraphics[width=.7\textwidth]{DESeq-figFitPooled}
\caption{Estimated (black) pooled dispersion values for all seven samples, with
regression curve (red).}
\label{figFitPooled}
\end{figure}


We check the fit (Fig.~\ref{figFitPooled}):

<<figFitPooled,fig=TRUE>>=
plotDispEsts( cdsFull )
@ %$

For inference, we now specify two \emph{models} by formulas. The \emph{full model}
regresses the genes' expression on both the library type and
the treatment condition, the \emph{reduced model} regresses them only on the library
type. For each gene, we fit generalized linear models (GLMs)
according to the two models, and then compare them in order to infer whether
the additional specification of the treatment improves the fit and hence, whether
the treatment has significant effect.

<<fit1,cache=TRUE,results=hide>>=
fit1 <- fitNbinomGLMs( cdsFull, count ~ type + condition )
fit0 <- fitNbinomGLMs( cdsFull, count ~ type  )
@
These commands take a while to execute. Also, they may produce a few warnings,
informing you that the GLM fit failed to converge (and the results from these
genes should be interpreted with care). The ``fit'' objects are
data frames with three columns:

<<fitstr>>=
str(fit1)
@

To perform the test, we call
<<pvalsGLM,cache=TRUE>>=
pvalsGLM <- nbinomGLMTest( fit1, fit0 )
padjGLM <- p.adjust( pvalsGLM, method="BH" )
@

The function \Rfunction{nbinomTestGLM} returned simply a vector of $p$ values which
we handed to the standard R function \Rfunction{p.adjust} to adjust for multiple
testing using the Benjamini-Hochberg (BH) method.

\begin{figure}
\centering
\includegraphics[width=.7\textwidth]{DESeq-figPval}
\caption{Comparison of $p$ values (unadjusted) from the test using only the
four paired-end samples and the test using all seven samples. We can see that
the latter tend to be smaller, as expected from the higher power of the test with all samples.}
\label{figPval}
\end{figure}

Let's compare with the result from the four-samples test:
<<addmarg2>>=
tab = table( "paired end only" = res$padj < .1, "all samples" = padjGLM < .1 )
addmargins( tab )
@
We see that the analyses find \Sexpr{tab[2,2]} genes in common, while
\Sexpr{tab[1,2]} were only found in the analysis using all samples and
\Sexpr{tab[2,1]} were specific for the \emph{paired end only} analysis.
A more informative comparison might be a scatter plot of $p$ values:
%
<<figPval,fig=TRUE>>=
bottom = function(x, theta=1e-12) pmax(x, theta)
plot( bottom(res$pval), bottom(pvalsGLM), log="xy", pch=20, cex=.3 )
abline(a=0, b=1, col="blue")
@
The result is shown in Fig.~\ref{figPval}.

Now, we can extract the significant genes from the vector \Robject{padjGLM} as before.
To see the corresponding fold changes, we have a closer look at the object
\Robject{fit1}
<<lookatfit1>>=
head(fit1)
@

The first three columns show the fitted coefficients, converted to a logarithm base 2 scale.
The log2 fold change due to the condition is shown in the third column. As indicated
by the column name, it is the effect of ``untreated'', i.e., the log ratio of
``untreated'' versus ``treated''. (This is unfortunately the other way round as
before, due to the peculiarities of contrast coding.) Note that the library type also
had noticeable influence on the expression, and hence would have increased the dispersion
estimates (and so reduced power) if we had not fitted an effect for it.

The column \emph{deviance}
is the deviance of the fit. (Comparing the deviances with a $\chi^2$ likelihood ratio test is
how \Rfunction{nbinomGLMTest} calculates the $p$ values.) The last column, \emph{converged},
indicates whether the calculation of coefficients and deviance has fully converged.
(If it is false too often, you can try to change the GLM control parameters, as
explained in the help page to \Rfunction{fitNbinomGLMs}.)
% FIXME: not urgent - this is a bit vague, 'too often' should be specified and ideally
%   the remedy be automated.


\medskip

Finally, we show that taking the library type into account was important to have
good detection power by doing the analysis again using the standard workflow,
as outlined earlier, and without informing \Rpackage{DESeq} of the library types:

<<fullAnalysisSimple,cache=TRUE>>=
cdsFullB <- newCountDataSet( fullCountsTable, design$condition )
cdsFullB <- estimateSizeFactors( cdsFullB )
cdsFullB <- estimateDispersions( cdsFullB )
resFullB <- nbinomTest( cdsFullB, "untreated", "treated" )
<<table>>=
addmargins(table(
   `all samples simple` = resFullB$padj < 0.1,
   `all samples GLM`    = padjGLM < 0.1 ))
@ %$

%--------------------------------------------------
\section{Independent filtering} \label{sec:indepfilt}
%--------------------------------------------------
The analyses of the previous sections involve the application of
statistical tests, one by one, to each row of the dataset, in order to
identify those genes that have evidence for differential expression.
The idea of \emph{independent filtering} is to filter out those tests
from the procedure that have no, or little chance of showing significant
evidence, without even doing the testing. Typically, this results in
increased detection power --- at the same type I error
as measured, e.\,g., in terms of false discovery rate.
A good choice for a filtering criterion is one that
\begin{enumerate}
  \item\label{it:indp} is statistically independent from the test statistic
    under the null hypothesis,
  \item\label{it:corr} is correlated with the test statistic under the
    alternative, and
  \item\label{it:joint} does not notably change the dependence
    structure --if there is any-- between the
    tests that pass the filter, compared to the dependence structure
    between the tests before filtering.
\end{enumerate}
The benefit from filtering relies on property~\ref{it:corr}, and we
will explore it further in Section~\ref{sec:whyitworks}. Its
statistical validity relies on properties \ref{it:indp} and
\ref{it:joint}, and we refer to~\cite{Bourgon:2010:PNAS} for further
explanation of this topic.
Here, we consider filtering by the overall sum of counts (irrespective
of biological condition):
<<rs>>=
rs <- rowSums ( counts ( cdsFull ))
use <- (rs > quantile(rs, 0.4))
table(use)
cdsFilt <- cdsFull[ use, ]
@
<<check,echo=FALSE>>=
stopifnot(!any(is.na(use)))
@
We perform the testing as before in Section~\ref{sec:GLM}.
<<fitFilt,cache=TRUE,results=hide>>=
fitFilt1  <- fitNbinomGLMs( cdsFilt, count ~ type + condition )
fitFilt0  <- fitNbinomGLMs( cdsFilt, count ~ type  )
pvalsFilt <- nbinomGLMTest( fitFilt1, fitFilt0 )
padjFilt  <- p.adjust(pvalsFilt, method="BH" )
@
<<doublecheck,echo=FALSE>>=
stopifnot(all.equal(pvalsFilt, pvalsGLM[use]))
@
Let us compare the number of genes found at an FDR of 0.1 by this
analysis with that from the previous one (\Robject{padjGLM}).
<<tab>>=
padjFiltForComparison = rep(+Inf, length(padjGLM))
padjFiltForComparison[use] = padjFilt
tab = table( `no filtering`   = padjGLM < .1,
             `with filtering` = padjFiltForComparison < .1 )
addmargins(tab)
@
The analysis with filtering found an additional \Sexpr{tab[1,2]}
genes, an increase in the detection rate by about
\Sexpr{round(100*tab[1,2]/tab[2,2])}\%,
while \Sexpr{tab[2,1]} genes were only found by the previous analysis.

\subsection{Why does it work?}\label{sec:whyitworks}
First, consider Figure~\ref{figscatterindepfilt}, which shows that
among the 40--45\% of genes with lowest overall counts, \Robject{rs},
there are essentially none that achieve an (unadjusted) $p$ value less than
\Sexpr{signif(quantile(pvalsGLM[!use],0.0001, na.rm=TRUE),1)}.
%
<<figscatterindepfilt,fig=TRUE>>=
plot(rank(rs)/length(rs), -log10(pvalsGLM), pch=".")
@
\begin{figure}
\centering
\includegraphics[width=.5\textwidth]{DESeq-figscatterindepfilt}
\caption{Scatterplot of rank of filter criterion (overall sum of
  counts \Robject{rs}) versus the negative logarithm of the test statistic \Robject{pvalsGLM}.}
\label{figscatterindepfilt}
\end{figure}
This means that by dropping the 40\% genes with lowest \Robject{rs},
we do not loose anything substantial from our subsequent
results. Second, consider the $p$ value histogram Figure~\ref{fighistindepfilt}.
It shows how the filtering ameliorates the multiple testing problem
-- and thus the severity of a multiple testing adjustment -- by
removing a background set of hypotheses whose $p$ values are distributed
more or less uniformly in $[0,1]$.
<<histindepfilt,width=7,height=5>>=
h1 = hist(pvalsGLM[!use], breaks=50, plot=FALSE)
h2 = hist(pvalsGLM[use], breaks=50, plot=FALSE)
colori = c(`do not pass`="khaki", `pass`="powderblue")
<<fighistindepfilt,fig=TRUE>>=
barplot(height = rbind(h1$counts, h2$counts),
        beside = FALSE, col = colori,
        space = 0, main = "", ylab="frequency")
text(x = c(0, length(h1$counts)), y = 0,
     label = paste(c(0,1)), adj = c(0.5,1.7), xpd=NA)
legend("topright", fill=rev(colori), legend=rev(names(colori)))
@
\begin{figure}
\centering
\includegraphics[width=.5\textwidth]{DESeq-fighistindepfilt}
\caption{Histogram of $p$ values for all tests (\Robject{pvalsGLM}).
  The area shaded in blue indicates the subset of those that pass the filtering,
  the area in khaki those that do not pass.}
\label{fighistindepfilt}
\end{figure}

%-------------------------------------------------------------------------------------------------
\section{Moderated fold change estimates and applications to sample clustering and visualisation}
%-------------------------------------------------------------------------------------------------
In Section~\ref{sec:DE} we have seen how to use \Rpackage{DESeq} for
calling differentially expressed genes. For each gene,
\Rpackage{DESeq} reports a (logarithm base 2) fold change estimate and a
$p$ value, as shown for instance in the dataframe \Robject{res} in the
beginning of that section.  When the involved counts are small, the
(logarithmic) fold-change estimate can be highly variable, and can even be
infinite.

For some purposes, such as the clustering of samples (or genes)
according to their overall profiles, or for visualisation of the data,
the (logarithmic) fold changes may thus not be useful: the random variability
associated with fold changes computed from ratios between low counts
might drown informative, systematic signal in other parts of the data.
We would like to \emph{moderate} the fold change estimates in some
way, so that they are more amenable to plotting or clustering. One
approach to do so uses so-called pseudocounts: instead of the
log-ratio $\log_2(n_A/n_B)$ between the counts $n_A$, $n_B$ in two
conditions $A$ and $B$ consider $\log_2\left((n_A+c)/(n_B+c)\right)$,
where $c$ is a small positive number, e.\,g.\ $c=0.5$ or $c=1$.  For
small values of either $n_A$ or $n_B$, or both, the value of this term
is shifted towards 0 compared to the direct log-ratio
$\log_2(n_A/n_B)$. When $n_A$ and $n_B$ are both large, the direct
log-ratio and the log-ratio with pseudocounts (asymptotically) agree.
This approach is simple and intuitive, but it requires making a choice
for what value to use for $c$, and that may not be obvious.

A variant of this approach is to look for a mathematical function of
$n_A$ and $n_B$ that is like $\log_2(n_A/n_B)$ when $n_A$ and $n_B$
are large enough, but still behaves gracefully when they become
small. If we interpret \emph{graceful} as having the same variance
throughout, then we arrive at variance stabilising transformations
(VST)~\cite{Anders:2010:GB}. An advantage is that the parameters of
this function are chosen automatically based on the
variability of the data, and no \emph{ad hoc} choice of $c$, as above,
is necessary.
%
<<vsd>>=
cdsBlind <- estimateDispersions( cds, method="blind" )
vsd <- getVarianceStabilizedData( cdsBlind )
@


The data are now on a logarithm-like scale, and we can compute
\emph{moderated log fold changes}.

<<modlr>>=
mod_lfc <- (rowMeans( vsd[, conditions(cds)=="treated", drop=FALSE] ) -
            rowMeans( vsd[, conditions(cds)=="untreated", drop=FALSE] ))
@

Now let us compare these to the original ($\log_2$) fold changes. First we find that many of the latter
are infinite (resulting from division of a finite value by 0) or \emph{not a number} (NaN,
resulting from division of 0 by 0).

<<dah>>=
lfc <- res$log2FoldChange
finite <- is.finite(lfc)
table(as.character(lfc[!finite]), useNA="always")
@
% \fixme{Define 0/0 := 1 in the  appropriate place in the code, then the \texttt{NaN} can go away.}

For plotting (Figure~\ref{figmodlr}), we replace the infinite values by an arbitrary fixed large number:
<<repl>>=
largeNumber <- 10
lfc <- ifelse(finite, lfc, sign(lfc) * largeNumber)
@

<<figmodlr, fig=TRUE>>=
plot( lfc, mod_lfc, pch=20, cex=.3,
      col = ifelse( finite, "#80808040", "red" ) )
abline( a=0, b=1, col="#40404040" )
@


\begin{figure}
\centering
\includegraphics[width=.6\textwidth]{DESeq-figmodlr}
\caption{Scatterplot of direct (\Robject{lfc}) versus \emph{moderated} log-ratios (\Robject{mod\_lfc}).
  The moderation criterion used is variance stabilisation.
  The red points correspond to values that were infinite in \Robject{lfc}
  and were abitrarily set to $\pm\Sexpr{largeNumber}$ for the purpose of plotting.
  These values vary in a finite range (as shown in the plot) in \Robject{mod\_lfc}.}
\label{figmodlr}
\end{figure}

These data are now approximately homoscedastic and hence suitable as input to
a distance calculation. This can be useful, e.g., to visualise the expression
data of, say, the top 40 differentially expressed genes.
%
<<figHeatmap2a,fig=TRUE,width=7,height=10>>=
select <- order(res$pval)[1:40]
colors <- colorRampPalette(c("white","darkblue"))(100)
heatmap( vsd[select,],
         col = colors, scale = "none")
@

\begin{figure}
\centering
\includegraphics[width=.49\textwidth]{DESeq-figHeatmap2a}
\includegraphics[width=.49\textwidth]{DESeq-figHeatmap2b}
\caption{Heatmaps showing the expression data of the top \Sexpr{length(select)}
  differentially expressed genes. Left, the variance stabilisation
  transformed data (\Robject{vsd}),
  right, the original count data (\Robject{counts{cds}}).
  The right plot is dominated by a small number of data points with large values.}
\label{figHeatmap2}
\end{figure}


For comparison, let us also try the same with the untransformed counts.
%
<<figHeatmap2b,fig=TRUE,width=7,height=10>>=
heatmap( counts(cds)[select,],
         col = colors, scale = "none")
@
The result is shown in Figure~\ref{figHeatmap2}.

We note that the \Rfunction{heatmap} function that we have used here is rather basic, and that better
options exist. For instance, consider the \Rfunction{heatmap.2} function from the package \Rpackage{gplots}
or the manual page for \Rfunction{dendrogramGrob} in the package \Rpackage{latticeExtra}.

\begin{figure}
\centering
\includegraphics[width=.6\textwidth]{DESeq-figHeatmapSamples}
\caption{Heatmap showing the Euclidean distances between the samples
  as calculated from the variance-stabilising transformation of the
  count data.}
\label{figHeatmapSamples}
\end{figure}

Another use of variance stabilized data is sample clustering. Here, we apply the
\Rfunction{dist} function to the transpose of the \Robject{vsd} matrix to get
sample-to-sample distances. We demonstrate this with the full data set with all
seven samples.

<<sampleClust>>=
cdsFullBlind <- estimateDispersions( cdsFull, method = "blind" )
vsdFull <- getVarianceStabilizedData( cdsFullBlind )
dists <- dist( t( vsdFull ) )
@

A heatmap of this dstance matrix now shows us, which samples
are similar (Figure \ref{figHeatmapSamples}):

<<figHeatmapSamples,fig=TRUE,width=7,height=7>>=
heatmap( as.matrix( dists ),
   symm=TRUE, scale="none", margins=c(10,10),
   col = colorRampPalette(c("darkblue","white"))(100),
   labRow = paste( pData(cdsFullBlind)$condition, pData(cdsFullBlind)$type ) )
@

The clustering correctly reflects our experimental design, i.e., samples are
more similar when they have the same treatment or the same library type.
(To make this conclusion, it was important to re-estimate e dispersion with
mode ``blind'' (in the calculation for \Robject{cdsFullBlind} above, as only then,
the variance stabilizing transformation is not informed about the design and
hence not biased towards a result supporting it.) Such an analysis is useful
for quality control, and (even though we finish our vignette with it), it may
be useful to this first in any analysis.

%------------------------------------------------------------
\section{Further reading}
%------------------------------------------------------------

For more information on the statistical method, see our paper~\cite{Anders:2010:GB}. For
more information on how to customize the \Rpackage{DESeq} work flow, see the package
help pages, especially the help page for \Rfunction{estimateDispersions}.

%------------------------------------------------------------
\section{Changes since publication of the paper}
%------------------------------------------------------------
Since our paper on \Rpackage{DESeq} was published in Genome Biology in
Oct 2010, we have made a number of changes to algorithm and
implementation, which are listed here.

\begin{itemize}

\item \Rfunction{nbinomTest} calculates a $p$ value by summing up the probabilities
of all per-group count sums $a$ and $b$ that sum up to the observed count
sum $k_{iS}$ and are more extreme than the observed count sums $k_{iA}$ and $k_{iB}$.
Equation~(11) of the paper defined \emph{more extreme} as those pairs of values $(a,b)$
that had smaller probability than the observed pair. This caused problems in cases
where the dispersion exceeded 1. Hence, we now sum instead the probabilities of all
values pairs that are \emph{further out} in the sense that they cause a more extreme
fold change $(a/s_A)/(b/s_B)$, where $s_A$ and $s_B$ are the sums of the size factors of
the samples in conditions $A$ and $B$, respectively. We do this in a one-tailed manner and
double the result. Furthermore, we no longer approximate the sum, but always
calculate it exactly.

\item We added the possibility to fit GLMs of the negative binomial
family with log link.
This new functionality is described in this vignette. $p$ values are calculated by a
$\chi^2$ likelihood ratio test. The logarithms of the size factors are
are provided as offsets to the GLM fitting function.

\item The option \texttt{sharingMode='maximum'} was added to \Rfunction{estimateDispersion}
and made default. This change makes DESeq robust against variance outliers and was not yet
discussed in the paper.

\item By default, DESeq now estimates one pooled dispersion estimate across all
(replicated) conditions. In the original version, we estimated a separate
dispersion-mean relation for each condition. The ``maximum'' sharing mode achieves
its goal of making DESeq robust against outliers only with pooled dispersion
estimate, and hence, this is now the default. The option \texttt{method='per-condition'} to
\Rfunction{estimateDispersions} allows the user to go back to the old method.

\item In the paper, the mean-dispersion relation is fitted by local regression. Now,
DESeq also offers a parametric regression, as described in this vignette. The option
\texttt{fitType} to \Rfunction{estimateDispersions} allows the user to choose between these.

\item Finally, instead of the term \emph{raw squared coefficient of variance}
used in the paper we now prefer the more standard term \emph{dispersion}.

\end{itemize}

\section{Session Info}
<<sessi>>=
sessionInfo()
@

\bibliographystyle{unsrt}
\bibliography{DESeq}


\end{document}