%\VignetteIndexEntry{RNA-Seq}
%\VignetteKeywords{RNA-Seq}
%\VignettePackage{easyRNASeq}
\documentclass[11pt,a4paper]{article}
\usepackage{verbatim}
\usepackage{natbib}
\usepackage{color}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\pref}[1]{\ref{#1} (page \pageref{#1})}

\title{easyRNASeq, an overview}
\author{Nicolas Delhomme, Isma\"{e}l Padioleau}

\SweaveOpts{keep.source=TRUE}

\begin{document}

\maketitle

\tableofcontents

\begin{comment}
For a faster vignette generation, many R section that do not need to
be run are set to eval=FALSE. For testing purpose, change them all to
TRUE.
\end{comment}

\begin{comment}
  TODO explain how to get a unique exon set
  TODO think of using the ShortRead to read the BAM files, if we want
  strand specificity or read extension
\end{comment}

\newpage

\section{Before you start}

If you are completely new to the R/Bioconductor
\citep{Gentleman:2004p2013} packages dealing with Next-Generation
Sequencing, you might want to start by doing the tutorial available in
the \Robject{RnaSeqTutorial} vignette in the data package
\Rpackage{RnaSeqTutorial}. The same holds true if you're interested in
understanding the details behind the content of the present
vignette. If neither nor, just go ahead.

Moreover, if you find \Rpackage{easyRNASeq} useful and apply it in the
frame of your research for a publication, please cite it:

\subparagraph{}
easyRNASeq: a bioconductor package for processing RNA-Seq data\\
Nicolas Delhomme; Ismael Padioleau; Eileen E. Furlong; Lars M. Steinmetz\\
Bioinformatics 2012; doi: 10.1093/bioinformatics/bts477\\

\section{Changes to the vignette}

The last changes are reported in red.

\paragraph{1.3.14}
A new section: \ref{s:ytc} was added describing foreseen changes to
the package, see page \pageref{s:ytc}. Additional changes to the
vignette use-case where made thanks to Richard Friedman's suggestions,
see section \ref{p:pasohs}, page \pageref{p:pasohs}.

\paragraph{1.3.10 - 1.3.13}
Some vignette discrepancies have been corrected. Thanks to Richard
Friedman for spotting them, see paragraph \ref{p:W}, page \pageref{p:W}.
Some additional information about the 'gtf' and 'gff' format has been
added after fixing the bug reported by Tomasz Kulinski. See
section \ref{ss:usage}, page \pageref{ss:usage}.

\paragraph{1.3.9}
Not much change to the vignette, but for the fact that the
\Rpackage{easyRNASeq} application note got published in
\textbf{Bioinformatics} \citep{Delhomme:2012p4678}.

\paragraph{1.3.5 - 1.3.8}
The vignette has been updated to report function call changes to the
\Rfunction{easyRNASeq} function:
\begin{itemize}
  \item the format argument defaults to bam
  \item the chr.sizes can be derived from the bam file header
  \item the addition of the knownOrganisms function
  \item support for variable length reads
  \item read files can be processed in parallel
\end{itemize}
Sections affected are: 
\pref{ss:usage}, \pref{p:args}, \pref{p:input}, \pref{ss:diffInput},
\pref{perf},\pref{p:pasohs},\pref{p:dwai}

\paragraph{1.3.1 - 1.3.4}
The vignette has been updated to:
\begin{itemize}
  \item first: enhance its readability.
  \item second: introduce a new \textbf{Use Case} section
\end{itemize}
Sections affected are:
\pref{s:eRS}, \pref{ss:usage}, \pref{p:annot}, \pref{ss:diffAnn},
\pref{ss:demultiplex}, \pref{s:useCases}

\newpage

\section{Introduction}

This file describes the use of the \Rpackage{easyRNASeq} package to ease
the processing of RNA-seq data in R/Bioconductor. RNA-seq
\citep{Mortazavi:2008p740} was introduced as a new method to perform
Gene Expression Analysis, using the advantages of the high
throughput of \emph{Next-Generation Sequencing} (NGS)
machines. 
\newline

The goal of this vignette is, first, to show an example of the
\Rfunction{easyRNASeq} function that generates a count table for the
selected genic features of interest, \emph{i.e.} exons, transcripts,
gene models, \emph{etc.} using the read data and the genic and genomic
annotations. This overall process is described in figure
\ref{fig:fig1}, page \pageref{fig:fig1}. Shortly, the genomic and
genic annotation will be retrieved from the selected/preferred source
and converted into an appropriate \Robject{object}. In parallel, the
sequenced reads' information (\textit{e.g.} chromosome, position,
strand, \textit{etc.}) will be retrieved from the alignment file and,
similarly, converted to an appropriate \Robject{object}. Then, the
reads contained in the \Robject{reads} object are summarized per genic
annotation contained in the \Robject{annotation} object, which give
raise to a count table that, finally, can be normalized using
additional R packages.
\newline

Then, the \Rfunction{easyRNASeq} function arguments will be detailed
and additional information given on how to generate normalized data,
such as creating \emph{RPKM (Reads Per Kilobase of feature per Million
  reads in the library)} counts or by using more advanced statistical
models implemented in the \Rpackage{DESeq} or \Rpackage{edgeR}
packages. 
\newline

In the second part, we will see how more advanced pre-processing can
be performed, such as \emph{de-multiplexing}.
\begin{comment}
  TODO something Ismael should do:
  how to identify \emph{de-novo} expressed regions.
\end{comment}
\newline 

Additional post-processing such as exporting the count table as bed
and wig formatted file, to be visualized into the UCSC genome browser
or a stand alone genome browser are described in the vignette of the
companion data package \Rpackage{RnaSeqTutorial}. 

\begin{figure}[htpb]
\centerline{\includegraphics[width=0.80\textwidth]{easyRNASeq.png}}
\caption{easyRNAseq Overview. The R packages used for the
  different steps are emphasized in bold face.}
\label{fig:fig1}
\end{figure}

\newpage

\section{easyRNASeq}
\label{s:eRS}

We will use the data present in the \Rpackage{RnaSeqTutorial} data
package to perform the examples. Refer to its documentation for more
details. Shortly, it contains different \emph{Drosophila melanogaster}
RNA-Seq dataset saved in different format as well as the related
annotation retrieved from  FlyBase \citep{Tweedie:2009p2014}. 
\newline

The main function of the \Rpackage{easyRNASeq} package is
\Rfunction{easyRNASeq}. That should be the only processing method you
need to know about when using the package. It is essentially a wrapper
around other function performing the different tasks described in the
following, functions which are all exported too, if you feel you need
to have a look at them. The \Rpackage{easyRNASeq} package defines an
\Robject{RNAseq} class for holding the necessary information used
during the processing. The \Rfunction{easyRNASeq} function instantiates
an object of that class and although most arguments of the
\Rfunction{easyRNASeq} function are optional (as are the corresponding
\Rclass{RNAseq} object slots ),it is advised to set them, especially
the \emph{readLength} and the \emph{organismName}; to help having a
proper documentation of your analysis. The organismName slot is
actually mandatory if you want to get genomic annotation using
\Rpackage{biomaRt}. In that case, you need to provide the name as
specified in the corresponding \Rpackage{BSgenome} package,
\emph{i.e.} ``Dmelanogaster'' for the
\Rpackage{BSgenome.Dmelanogaster.UCSC.dm3} package. 

\textbf{Important Note:} As there are numerous RNA-Seq protocol,
each with its own specificities, the \Rfunction{easyRNASeq} offers
many arguments to adapt to these. This number of arguments, with
only a few defaults values, might be intimidating. We are in the
process of consolidating them to make it easier to handle them. In
the meantime, please refer to the documentation and the examples in
this vignette and the \Rpackage{RnaSeqTutorial} one to get familiar
with them. Do not hesitate to ask on the Bioconductor mailing list
if the purpose of some argument was still unclear.

\newpage

\subsection{usage}
\label{ss:usage}

\paragraph{Initial example}
In the following, the \Rfunction{easyRNASeq} function is called with
an almost minimal set of parameters (some optional ones too...). The
sequencing reads' excerpts were obtained from a fruit-fly sample that
was subjected to a 36 cycles sequencing on an Illumina GAIIX
machine. The annotation were retrieved from FlyBase and
converted into an ``Rdata'' object. We're interested to retrieve the
count table of 4 samples for every exon in the genome.
\textcolor{red}{As of version $1.3.5$, 'bam' is the default
  format. Hence, the argument 'format' has been removed from the
  following function call.}

<<Load the library and create the object>>=
library(easyRNASeq)
library(RnaSeqTutorial)
library(BSgenome.Dmelanogaster.UCSC.dm3)

count.table <- easyRNASeq(filesDirectory=system.file(
                            "extdata",
                            package="RnaSeqTutorial"),
                          pattern="[A,C,T,G]{6}\\.bam$",
                          readLength=30L,
                          organism="Dmelanogaster",
                          chr.sizes=seqlengths(Dmelanogaster),
                          annotationMethod="rda",
                          annotationFile=system.file(
                            "data",
                            "gAnnot.rda",
                            package="RnaSeqTutorial"),
                          count="exons"
                          )
@ 

<<Show the results>>=
head(count.table)
dim(count.table)
@ 

In this command, we're reading bam files (\Robject{format}), present
in the first argument to the function (\Robject{filesDirectory}) using
a regular expression to match the file names (\Robject{pattern}). The
\Rfunction{system.file} retrieve the actual path on your file system
where the ``RnaSeqTutorial'' package was installed and in that
particular case of its ``extdata'' sub-directory. The pattern used to
retrieve the files is similar to that of the \Rclass{base} commands
\Rfunction{dir} and \Rfunction{grep}. Here we look up files, which name
contains 6 times one of the ``A'', ``C'', ``G'' or ``T'' letter,
followed by the ``.bam'' extension. 
We additionally define the number of sequencing cycles
(\Robject{readLength}), and the \Robject{organism} that sample
originates from. We then provide genomic and genic information. The
size of the chromosomes (\Robject{chr.sizes}) and the exons' annotation
(\Robject{annotationFile}) that were already processed in an ``RData''
file (\Robject{annotationMethod}). Finally, we precise that we want
the sequenced reads to be summarized by ``exons''; \emph{i.e.} we want
to get a read count value per exon and per sample.
\newline

And that is all. In that one command, you got the exon count table for
your 4 samples!

\textbf{Important Note:} It is not required to use any of the
\Rpackage{BSgenome} packages to retrieve the chr sizes. These can be provided
as a simple \textcolor{red}{named vector}. We just use the
\Rpackage{BSgenome.Dmelanogaster.UCSC.dm3}  package for
commodity. 
\textcolor{red}{Moreover, since version $1.3.5$, the 'chr.sizes' can
  be extracted from the BAM file header; \emph{i.e.} setting
  'chr.sizes' to ``auto'' (the default) would have retrieved the
  chromosome sizes from the BAM file. Note that this obviously
  does not work if your format is not 'bam'.}
See the additional comments in the use case page \pageref{chrsize}.
\textcolor{red}{In addition, since version $1.3.10$, the
  'filesDirectory' argument defaults to the current directory. This
  one can be looked up using the \Rfunction{getwd()} function.}

\paragraph{Warnings}
\label{p:W}
\textcolor{red}{As you could see when running the previous example,  warnings were
emitted and quite rightly so. 
\begin{enumerate}
\item about the annotation:
The annotation we are using here is
redundant and this at two levels. First, some exons overlap. These are
alternative exons from different transcript isoforms. Second, the
annotation contains the information about all the possible different
transcript isoforms. This means that some exons are
duplicated. Therefore counting by exons or transcripts using these
annotation will result in counting some of the reads several
times. There might be reasons one might want to do that, but as it is
probably not what you want when performing an RNA-Seq analysis, the
warning is emitted. As this can be a very significant source of error,
all the examples here will emit this warning, at the exception of the
summarization by ``GeneModels'' that address that problem (\emph{c.f.}
\ref{subsec:diffNorm}, page \pageref{subsec:diffNorm}). The ideal
solution is to provide an annotation object that contains no
overlapping features. The \Rfunction{disjoin} function from the
\Rpackage{IRanges} package offers a way to achieve this.
\item about potential naming issue in the input file:
It is (sadly) very frequent that the sequencing facilities use
different naming conventions for the chromosomes they report in the
alignment files. It is therefore very frequent that the annotation
provided to easyRNASeq uses different chromosome names than the
alignment file. These warnings are there to inform you about this
issue.
\end{enumerate}
}

\paragraph{Arguments}
\label{p:args}
Going back to the \Rfunction{easyRNASeq} function, more arguments can
be set to fine tune it, depending on the input at hand, the desired
output, etc. Most will be detailed in the following, but not all, so check
``?easyRNASeq'' out for more. One such argument is: ``filenames'',
that offers the possibility to give actual filenames rather than using
a pattern matching approach to find them. For example, if you're only
interested in two samples out of the four available ones in the
provided directory, do: 

<< Create the object with only 2 files, eval=FALSE>>=
count.table <- easyRNASeq(system.file(
                                      "extdata",
                                      package="RnaSeqTutorial"),
                          organism="Dmelanogaster",
                          readLength=30L,
                          annotationMethod="rda",
                          annotationFile=system.file(
                            "data",
                            "gAnnot.rda",
                            package="RnaSeqTutorial"),
                          count="exons",
                          filenames=c("ACACTG.bam", "ACTAGC.bam"))
@

\textcolor{red}{As of version $1.3.5$, 'bam' is the default
  format. Hence, the argument 'format' has been removed from the
  previous function call. The same is true for the 'chr.sizes'
  argument. Removing it results in setting it to its default value:
  'auto', which has the effect of determining the chromosome sizes
  using the BAM header.}

<<Show, eval=FALSE>>=
head(count.table)
dim(count.table)
@ 

\newpage

\subsection{description}

There are several arguments that influence how the
\Rfunction{easyRNASeq} is going to be executed. They concern:
\begin{itemize}
  \item the input at hand
  \item how to retrieve the annotations
  \item how to summarize the counts
  \item the desired output
  \item whether normalization should be performed
\end{itemize}

\paragraph{Input}
\label{p:input}
Using the \Robject{format} arguments, reads can be loaded from BAM
files or any format supported by the \Rpackage{ShortRead} 
\citep{Morgan:2009p739} package. The SAM/BAM format
\citep{Li:2009p1662} is becoming a \emph{de-facto} standard for
storing NGS aligned data. Gapped alignment can be read from BAM files,
set the \Robject{gapped} argument to ``TRUE'' for that. 
\textcolor{red}{
As of version $1.3.5$, the 'bam' format is the default. If you are
using bam files, you do not need to precise that argument anymore.
As of version $1.3.6$, it is possible to set additional options to
locate the read files. All the options from the list.files base
function are available, \emph{i.e.} one can search for file recursively.
As of version $1.3.7$, variable read lengths is supported. 
The current implementation results in a small, neglictible approximation
of the read counts, due to the likelyhood of an integer overflow. This is
currently investigated and appropriate warnings will be raised if it
were to affect the results.
}

\paragraph{Annotation}
\label{p:annot}

The \Rfunction{easyRNASeq} function currently accepts the following \Robject{annotationMethods}:
\begin{itemize}
  \item ``biomaRt'' use the \Rpackage{biomaRt}
    \citep{Durinck:2005p1990} package to retrieve the annotation 
  \item ``env'' use a \Rclass{RangedData} or \Rclass{GRangesList} class object present in the
    environment - see page \pageref{env} for a use case.
  \item ``gff'' reads in a gff version 3 file
  \item ``gtf'' reads in a gtf file as provided by Ensembl \citep{Flicek:2011p2042}
  \item ``rda'' load an RData object. The object needs to be named
    \Robject{gAnnot} and of class \Rclass{RangedData} or
    \Rclass{GRangesList}. See the use case page \pageref{grnglist} to prepare
    such an object using the \Rpackage{GenomicFeatures}. ``.rda'' is
    the filename extension of serialized R data object (\textit{i.e.} object
    written to disk using the \Rfunction{save} function). The filename
    extension for these files used to be and still is
    sometimes``.RData''. The actual filename extension used is of no
    importance to \Rfunction{easyRNASeq} - see page \pageref{rda} for a use case.
\end{itemize}
The structure of the \Rclass{RangedData} object that has to be
provided when using the ``rda'' or ``env'' \Robject{annotationMethods}
is described in the section \pref{ss:diffAnn}. For using a \Rclass{GRangesList}, see the
use case page \pageref{grnglist}.
\textcolor{red}{
More details on the gff3 and gtf format can be obtained from:
\begin{itemize}
\item gff3: http://www.sequenceontology.org/gff3.shtml
\item gtf: http://mblab.wustl.edu/GTF22.html
\end{itemize}
Unlike for the gtf, the gff3 ninth column holding the attributes has
no mandatory tags. Some of them are however ``predefined''. As
exemplified further down (see \ref{p:gI}, \pageref{p:gI}), we expect
these ``predefined'' attributes to be used and we rely on the ``ID'',
``Parent'' and ``Name'' to identify the relation between a gene and
its exons and transcripts.
}

\paragraph{Summarization}

The reads can be counted/summarized by:
\begin{itemize}
  \item exons 
  \item features (any features such as introns, enhancers, \emph{etc.})
  \item transcripts
  \item geneModels: they consists of the set of disjoint ``synthetic''
    exons that fully overlap every known exons and UTRs of a
    gene; \emph{I.e.} every alternate exon will be split into
    separate ``synthetic'' exons and these exons will be grouped into
    a set that correspond to a single gene.
\end{itemize}

When you use a the ``geneModels'' summarization, no reads will be
counted twice at the exception of those mapping overlapping exons of
different genes, whether on the same strand or not. The
\Rfunction{easyRNASeq} does not deal with these situation, but as
previously a warning will be emitted if this occurs.

\paragraph{Output}

The results can be exported in four different formats:
\begin{itemize}
\item count table (the default, a n (features) x  m (samples) \Robject{matrix}).
\item a \Rpackage{DESeq} \citep{Anders:2010p1659}
  \Rclass{CountDataSet} class object. Useful to perform further
  analyses using the \Rpackage{DESeq} package.
\item an \Rpackage{edgeR} \citep{Robinson:2010p775} \Rclass{DGEList}
  class object. Useful to perform further analyses using the 
  \Rpackage{edgeR} package.
\item an \Rclass{RNAseq} class object. Useful for performing
  additional pre-processing without re-loading the reads and
  annotations.
\end{itemize}

\paragraph{Normalization}

The results can optionally be ``normalized'' as \emph{Reads per Kilobase
  of feature per Million reads in the library} (RPKM,
\cite{Mortazavi:2008p740}). In addition, the normalization step is
available when the results are exported in the \Rpackage{DESeq} or
\Rpackage{edgeR} formats. This will as well generate plots to assess

the validity of the normalization and help decide how to proceed with
any downstream analysis.
\newline
\newline
In the following paragraphs, the usage of these different arguments
will be presented. 

\newpage

\subsection{different input}
\label{ss:diffInput}
The following show how to load an 'export' file. This was the default
Illumina output before Illumina released CASAVA v1.8 in spring
2011. The alignment are done with ELAND. The \Robject{format} argument
has to be set to the ``aln'' value and an additional arguments:
\Robject{type} has to be provided to precise the input file
format. See the \Rpackage{ShortRead} \Rfunction{readAligned} help page
for all the available formats. Note as well the \Robject{filter}
argument that is essential to filter the usable reads from the export
file. Indeed, the export file does contain every read, including those
not aligning and those not passing the original quality filter
performed during the Illumina Base Calling (Bustard) a.k.a. the
chastity filter. For more details, see the section ``Filtering the
Data'' in the \Rpackage{RnaSeqTutorial} vignette.

<<read an export, eval=FALSE>>=
count.table <- easyRNASeq(system.file(
                                      "extdata",
                                      package="RnaSeqTutorial"),
                          organism="Dmelanogaster",
                          chr.sizes=seqlengths(Dmelanogaster),
                          readLength=36L,
                          annotationMethod="rda",
                          annotationFile=system.file(
                            "data",
                            "gAnnot.rda",
                            package="RnaSeqTutorial"),
                          format="aln",
                          type="SolexaExport",
                          count="exons",
                          pattern="subset_export",
                          filter=compose(
                            chromosomeFilter(regex="chr"),
                            chastityFilter()))
@ 

\textcolor{red}{Note that here the chr.sizes argument has to be
  specified. Indeed, we are using the 'aln' format and consequently
  cannot determine the chromosome sizes from the header of the BAM files.}

In the next example, we read a BAM file containing gapped
alignments. In that particular case, TopHat \citep{Trapnell:2009p156}
was used to look for splicing events. Note the use of the
\Robject{gapped} boolean argument. 
\textcolor{red}{As of version $1.3.5$, the format argument default to
  bam and has therefore been removed from the following function call.}

<<read a gapped  bam file, eval=FALSE>>=
count.table <- easyRNASeq(system.file(
                                      "extdata",
                                      package="RnaSeqTutorial"),
                          organism="Dmelanogaster",
                          chr.sizes="auto",
                          readLength=36L,
                          annotationMethod="rda",
                          annotationFile=system.file(
                            "data",
                            "gAnnot.rda",
                            package="RnaSeqTutorial"),
                          gapped=TRUE,
                          count="exons",
                          filenames="gapped.bam")
@ 

\textcolor{red}{Note that setting the 'chr.sizes' argument to 'auto'
  is the same as removing it from the function call.}

\newpage

\subsection{different annotation}
\label{ss:diffAnn}

The following paragraphs present how to use different annotation
sources to retrieve the genic information. It describes as the well
the expected format of these annotations.

\textbf{Important Note:} There are many derivative of the GFF3 and GTF
file formats, some of which do not strictly follow the convention
these files should have. Using incorrectly formatted file will fail
and most of the time a relevant error message will be reported. GTF
files as produced by Ensembl and GFF3 files as produced by Flybase are
correct implementation of these formats and are processed smoothly by
the \Rfunction{easyRNASeq} function. If you encounter problem while
using such an annotation file, please try to correct them. For the GFF
format, the example file present in the \Rpackage{RnaSeqTutorial}
package can
be used as a guideline. If you have installed this package, suggested
by the \Rpackage{easyRNASeq} package, the following will give you the
path of that file on your machine:

<<GFF3, eval=FALSE>>=
system.file(
            "extdata",
            "annot.gff",
            package="RnaSeqTutorial")
@ 

If you do not manage to identify the issue, please contact the
Bioconductor mailing list for help.

\paragraph{biomaRt}

The following show how to use biomaRt to retrieve the annotations.
<<load annotation from biomaRt, eval=FALSE>>=
count.table <- easyRNASeq(system.file(
                                      "extdata",
                                      package="RnaSeqTutorial"),
                          organism="Dmelanogaster",
                          readLength=36L,
                          annotationMethod="biomaRt",
                          gapped=TRUE,
                          count="exons",
                          filenames="gapped.bam")
@ 

Once the annotation has been fetched, one can retrieve them from an
\Robject{RNAseq} object. This implies that the ``outputFormat'' argument
will have been set to ``RNAseq'', so that the complete object is
returned and not only the count table (\textit{c.f.}
\ref{subsec:diffOut}, page \pageref{subsec:diffOut}). Using the
``genomicAnnotation'' accessor on that object, one can get the
annotation as an \Rclass{RangedData} object. This object can be saved
in an ``rda'' file under the name ``gAnnot'' for a faster processing
using the \Rfunction{easyRNASeq} function with the
arguments``annotationMethod'' and ``annotationFile'' set to ``rda''
and the path where you would have saved the rda file.

\paragraph{genomeIntervals}
\label{p:gI}
In the next example is shown how to use a gff file for the same
purpose. Clearly, every gff formatted file will have different gff
attributes (its last column). This column contains ``key=value'' pairs
separated by semi-colons. We depend on specific key to be present as
we have been using Flybase gff files to develop that functionality;
hence we expect attributes as defined by Flybase. First, we're only
considering feature of type 'exon'  in the third column. Then we
expect the following keys: \textit{ID}, \textit{Parent}, and
\textit{Name} in the attributes. The \textit{ID} key should have the
format: \textit{geneID:exonNumber}; both the exon and gene are derived
from it. The \textit{Parent} key should contain the transcript ID. If
one exon is part of several transcripts, these can be concatenated
using commas without space. Finally, the \textit{Name} key should
contain the gene name, but is actually optional. As we are using the
\Rfunction{readGff3} function from the \Rpackage{genomeIntervals}
package, it is as well essential that the pairs key=value are
separated by semi-colons \textbf{whitout} space. See the ``annot.gff''
file used in the next example.

<<load the annotation from a gff, eval=FALSE>>=
count.table <- easyRNASeq(system.file(
                                      "extdata",
                                      package="RnaSeqTutorial"),
                          organism="Dmelanogaster",
                          readLength=36L,
                          annotationMethod="gff",
                          annotationFile=system.file(
                            "extdata",
                            "annot.gff",
                            package="RnaSeqTutorial"),
                          gapped=TRUE,
                          count="exons",
                          filenames="gapped.bam")
@ 

\paragraph{RangedData}

Internally, an \Rfunction{easyRNAseq} function call with the
\Robject{annotationMethod} ``gff'', ``gtf'', or ``biomaRt'' will
convert the annotation into an object of the \Rclass{RangedData}
class. Using the \Robject{annotationMethod} ``env'' or ``rda''
requires that you provide such an object. The following example shows
the expected structure of such an object.

<<RangedData structure>>=
gAnnot <- RangedData(
                     IRanges(
                             start=c(10,30,100),
                             end=c(21,53,123)),
                          space=c("chr01","chr01","chr02"),
                          strand=c("+","+","-"),
                          transcript=c("trA1","trA2","trB"),
                          gene=c("gA","gA","gB"),
                          exon=c("e1","e2","e3"),
                          universe = "Hs19"
                          )
gAnnot
@ 

Here we are describing the genomic location of three exons
(chromosome, position and strand) as well as their transcript and gene
membership. Nothing else is actually required. Note that the names'
spelling is \textbf{essential} for the \Rfunction{easyRNAseq} function
to work properly. Indeed, the \Robject{count} argument will be used to
lookup the proper values into the annotation, \emph{e.g.} for summarizing by
``exons'', an ``exon'' column in the \Rclass{RangedData} object is
mandatory;  a ``feature'' one for counting ``features'', a
``transcript'' one for ``transcripts'', \emph{etc.}
To have a look at the \Rclass{RangedData} class \Robject{gAnnot}
object used in this tutorial, you can do the following:

<<look at the gAnnot, eval=FALSE>>=
data(gAnnot)
gAnnot
@ 

The \Rpackage{easyRNASeq} package support as well annotation provided
as a \Rclass{GRangesList} class object (from the
\Rpackage{GenomicRanges} package). Converting a \Rclass{RangedData}
class object into a \Rclass{GRangesList} class object is pretty straightforward.

<<GRangesList coercion>>=
grngs <- as(gAnnot,"GRanges")
grngsList<-split(grngs,seqnames(grngs))
grngsList
@ 

The advantage of doing so is that the \Rclass{RangedData} class might
get deprecated in the future. The next advantage is that the
\Rclass{GRangesList}  class is strand-aware. Therefore, support for
strand-specific data is going to be supported as soon as I put my hand
on such a dataset. Any suggestion, or data excerpt is welcome.

\paragraph{Remember}

Generating the proper annotation is probably the most important step
in processing your RNA-Seq sample. Mind the warnings produced by the
\Rfunction{easyRNASeq} function, they might be annoying, but there are
there for a good reason: to help.

\newpage

\subsection{different output}
\label{subsec:diffOut}

For example, to generate from the same input files and annotation an object of
class \Rclass{RNAseq} , use the \Rfunction{outputFormat} argument as follow:

<<easyRNASeq output>>=
rnaSeq <- easyRNASeq(system.file(
                                 "extdata",
                                 package="RnaSeqTutorial"),
                     organism="Dmelanogaster",
                     readLength=30L,
                     annotationMethod="rda",
                     annotationFile=system.file(
                       "data",
                       "gAnnot.rda",
                       package="RnaSeqTutorial"),
                     count="exons",
                     pattern="[A,C,T,G]{6}\\.bam$",
                     outputFormat="RNAseq")
@

More details on how to generate \Rclass{CountDataSet}
(\Rpackage{DESeq}) or \Rclass{DGEList} (\Rpackage{edgeR}) will be
presented further down, see section \ref{subsec:optNorm},
page \pageref{subsec:optNorm}.

\newpage

\subsection{different summarization}
\label{subsec:diffNorm}

\Rpackage{easyRNASeq} offers two possibilities to apply different
reads' summarization. First, as we have seen previously, by using one
of the four possible \Robject{count} argument: ``exons'',
``features'', ``transcripts'' or ``genes''. However, if you want to
perform different summarization on the same data, \emph{e.g.} by
``exons'' and ``transcripts'', calling the \Rfunction{easyRNASeq}
twice is inefficient as you will be processing the read files and
fetching the annotation twice. To prevent this, one can provide the
``RNAseq'' value to the \Robject{outputFormat} argument, in which
case, the entire object is returned. Using specific functions on that
object is the second way to apply different summarization on the same
data. These functions are:
\begin{itemize}
  \item \Rfunction{exonCounts} function to calculate the exon counts
  \item \Rfunction{featureCounts} function to calculate the exon counts
  \item \Rfunction{transcriptCounts} function to calculate the
    transcript counts
  \item \Rfunction{geneCounts} function to calculate the gene
    counts. It takes an additional parameter defining the kind of gene
    summarization, either \Robject{bestExons} or
    \Robject{geneModels}. The \Robject{bestExons}
    summarization will return per gene, the highest exon count. The
    \Robject{geneModels} summarization first calculate a gene
    model and then return the read count for it.
  \item \Rfunction{readCounts} function to access the different count
    tables stored in the \Robject{RNAseq} object.
\end{itemize}
  
For example, to summarize the reads per transcripts, we will use the
\Rfunction{transcriptCounts} function on the previously generated
\Robject{rnaSeq} object and then use the \Rfunction{readCounts}
function to access the transcript count table.

<<Transcript counts>>=
rnaSeq <- transcriptCounts(rnaSeq)
head(readCounts(rnaSeq,'transcripts'))
@ 

Summarizing by transcript introduces some complexity in the data
analysis, \emph{i.e.} exons part of different isoforms introduce a
bias in the counts. For that reason, it might be better to have a
first look at the data, summarized by genes. This, however, requires to
combine all the alternative exons and UTRs present for every gene into
a ``gene model''; \emph{i.e.} overlapping exons are merged into
``synthetic'' ones. This is what is performed when the arguments
``count'' and ``summarization'' are set to ``genes'' and
``geneModels'', respectively. A caveat not addressed by this procedure
are genes overlapping on the same or opposite strands. If this occurs
a warning will be emitted. If the reads were summarized by
``geneModels'' and the ``outputFormat'' argument was set to
``RNAseq'', one can use the ``geneModel'' accessor on the obtained
object to access the computed gene models. They are stored in an
\Rclass{RangedData} object and can be modified to address the caveat
previously mentioned. To be strict, one would remove every overlapping
loci and conserve only the other ones. Such a modified annotation can
then be saved and used for the next \Rfunction{easyRNASeq} run.
\newline
It is not possible yet to summarize by ``geneModels'' using
the \Rfunction{geneCounts}  function. A meaningful error
message is thrown if the \Rfunction{geneCounts}  is used for that
purpose.

<<geneCounts fails, eval=FALSE>>=
try(geneCounts(rnaSeq,summarization='geneModels'))
@ 

This behavior will be corrected in the next release. At the moment,
this has to be done using the \Rfunction{easyRNASeq} function
directly. 

<<current access to geneCounts, eval=FALSE>>=
rnaSeq2 <- easyRNASeq(system.file(
                                  "extdata",
                                  package="RnaSeqTutorial"),
                      organism="Dmelanogaster",
                      readLength=30L,
                      annotationMethod="rda",
                      annotationFile=system.file(
                        "data",
                        "gAnnot.rda",
                        package="RnaSeqTutorial"),
                      count="genes",
                      summarization="geneModels",
                      pattern="[A,C,T,G]{6}\\.bam$",
                      outputFormat="RNAseq")
head(readCounts(rnaSeq2,'genes','geneModels'))
@

\newpage

\subsection{optional normalization}
\label{subsec:optNorm}

In this section, the different ``normalization'' available will be
described. First the RPKM, even though this cannot be considered as a
proper normalization based on sound statistical models, but is rather
a kind of common sense ``correction'' that one can apply to the
data. This implies normalizing the read counts depending on the genic
feature size (exon, transcript, gene model,...) and on the total
number of reads sequenced for that library. I would not recommend to
use it for doing any kind of differential expression analysis, but
it's definitely sufficient to create tracks to be displayed in a
genome browser in order to get a feel when comparing different
sample visually.

<<RPKM normalization, eval=FALSE>>=
count.table <- easyRNASeq(system.file(
                                      "extdata",
                                      package="RnaSeqTutorial"),
                          organism="Dmelanogaster",
                          readLength=30L,
                          annotationMethod="rda",
                          annotationFile=system.file(
                            "data",
                            "gAnnot.rda",
                            package="RnaSeqTutorial"),
                          count="exons",
                          filenames=c("ACACTG.bam", "ACTAGC.bam", 
                            "ATGGCT.bam", "TTGCGA.bam"),
                          normalize=TRUE
                          )
@ 

In addition, \Rfunction{easyRNASeq} count tables can be easily
transformed into RPKM, by using the \Rfunction{RPKM} method:

<<RPKM counts>>=
feature.size = width(genomicAnnotation(rnaSeq))
names(feature.size) = genomicAnnotation(rnaSeq)$exon
lib.size=c(
  "ACACTG.bam"=56643,
  "ACTAGC.bam"=42698,
  "ATGGCT.bam"=55414,
  "TTGCGA.bam"=60740)
head(RPKM(readCounts(rnaSeq,summarization="exons")$exons,
          NULL,
          lib.size=lib.size,
          feature.size=feature.size))
@ 

The same can be directly done on object of the \Rclass{RNAseq} class.

<<RPKM on RNAseq>>=
head(RPKM(rnaSeq,from="transcripts"))
@ 

<<RPKM on RNAseq,eval=FALSE>>=
head(RPKM(rnaSeq2,from="geneModels"))
@ 

A better way of normalizing the data is to use either the
\Rpackage{edgeR} or \Rpackage{DESeq} packages, provided you have got
enough replicates.

\paragraph{DESeq}
\label{par:DESeq}
To be able to normalize the data using \Rpackage{DESeq} (or
\Rpackage{edgeR} for that matter), one needs to define the
samples' ``conditions'', \textit{e.g.} ``disease'' vs. ``healthy''. To
ensure traceability, the \Rpackage{easyRNASeq} package require the
conditions to be a \emph{named vector} where the names are the raw
data filenames. 

<<define the conditions>>=
conditions=c("A","A","B","B")
names(conditions) <- c("ACACTG.bam", "ACTAGC.bam", 
                       "ATGGCT.bam", "TTGCGA.bam")
@ 

Once the conditions have been defined, one can  call the
\Rfunction{easyRNASeq} function with the \Robject{outputFormat}
argument set to DESeq. The \Robject{normalize} argument precise
whether or not to perform the normalization implemented in
\Rpackage{DESeq}. If either case, a \Robject{CountDataSet} object is
returned containing or not normalized values. Here, in addition, we
precise the kind of fit that needs to be performed by the
\Rfunction{estimateDispersion} function of the \Rpackage{DESeq}
package. Since we have so few data, the default fit would fail and we
would get an error telling us to change the \Robject{fitType}
argument. As a consequence, we set that argument to \textit{local}.

<<DESeq normalization, eval=FALSE>>=
countDataSet <- easyRNASeq(system.file(
                                      "extdata",
                                      package="RnaSeqTutorial"),
                          organism="Dmelanogaster",
                          readLength=30L,
                          annotationMethod="rda",
                          annotationFile=system.file(
                            "data",
                            "gAnnot.rda",
                            package="RnaSeqTutorial"),
                          count="exons",
                          filenames=c("ACACTG.bam", "ACTAGC.bam", 
                            "ATGGCT.bam", "TTGCGA.bam"),
                          normalize=TRUE,
                          outputFormat="DESeq",
                          conditions=conditions,
                          fitType="local"
                          )
@ 

Note that the plot produced here are irrelevant since the dataset is
too small. You can turn the plotting off by setting the \Robject{plot}
argument to \textit{FALSE}. Have a look at the \Rpackage{DESeq}
vignette (essential if you plan to use \Rpackage{DESeq} anyway!) for
proper explanation about these.

\paragraph{edgeR}

Here we perform the same procedure as that described for the
\Rpackage{DESeq} in the paragraph above, using the \Rpackage{edgeR}
instead. We simply use ``edgeR'' as the value to the ``outputFormat''
argument.

<<edgeR normalization, eval=FALSE>>=
dgeList <- easyRNASeq(system.file(
                                  "extdata",
                                  package="RnaSeqTutorial"),
                      organism="Dmelanogaster",
                      readLength=30L,
                      annotationMethod="rda",
                      annotationFile=system.file(
                        "data",
                        "gAnnot.rda",
                        package="RnaSeqTutorial"),
                      count="exons",
                      filenames=c("ACACTG.bam", "ACTAGC.bam",
                        "ATGGCT.bam", "TTGCGA.bam"),
                      normalize=TRUE,
                      outputFormat="edgeR",
                      conditions=conditions
                      )
@ 

As mentioned in the former paragraph about \Rpackage{DESeq}
(\ref{par:DESeq}, page \pageref{par:DESeq}, the plots generated here
are only semi-relevant. Check out the \Rpackage{edgeR} package
vignette for more details about these plots. See in particular chapter 10
(\Rpackage{edgeR} v2.3.24, \Robject{R} v2.14.0). Note that producing
the plots is rather slow.

\paragraph{Next steps}

At this stage you are done with the normalization and what's ahead of
you: calling differential expression, exporting track files for
visualization, etc. is not the scope of the \Rpackage{easyRNASeq}
package. This one has a few more functionalities, the most important
of which will be described in the next section. To proceed with your
data analysis, check the relevant package vignettes (\Rpackage{DESeq},
\Rpackage{edgeR}) for differential expression calling and the
\Rpackage{RnaSeqTutorial} for examples of track files generation using
the \Rpackage{rtracklayer} package.

\subsection{Performance}
\label{perf}
\textcolor{red}{
As of version $1.3.8$, it is possible to parallelize some additional
steps in the workflow. So far, only the generation of the geneModel
could be parallelized. As of now, the actual read counting and
summarization can be parallelized too. To achieve this, the 'nbCore'
argument simply has to be set to the number of CPU cores, one wants to
use. There are a number of word of caution though: 
\begin{enumerate}
\item CPU cores means CPU cores, \textit{i.e.} located on the same
  physical machine. Do not expect it to work across machines.
\item there's no load nor memory management, so watch out yourself for
  it. Memory will scale linearly with the number and size of read libraries
  (e.g. bam files) you process in parallel.
\item It's using the R 'parallel' package, so it's supported on all
  three main OS. However, some cosmetic reporting might get lost.
\end{enumerate}
That's it for the cautionary part. And despite these, try it, you'll
get a time boost and that's worth monitoring CPU and RAM for a while.
}

\newpage

\section{Advanced usage}

In this section we will discuss about more advanced RNA-Seq
pre-processing, such as de-multiplexing, or the \emph{de novo}
identification of expressed regions. 
The same can be directly done on object of the \Rclass{RNAseq} class.

\subsection{De-multiplexing samples}
\label{ss:demultiplex}
Nowadays, NGS machines produces such huge quantity of``raw'' reads
(40M, 100M and 160M per Illumina GAIIx, ABI SOLiD4 or Illumina HiSeq
lanes (or equivalent), respectively ), that the coverage obtained per
lane for the transcriptome of ``small'' genome-sized organisms, is for
a single sample essentially a waste of resource. \textit{E.g.} one
lane of HiSeq results in an approximate 2,800X coverage if the
\textit{Saccharomyces cerevisiae} genome which is 12Mb large.
Therefore, techniques to have several samples running as a
\emph{single} library have been
created \citep{Lefrancois:2009p778,Smith:2010p777}, using 4-6bp
barcodes to uniquely identify the sample. This is called
\textbf{multiplexing} and one can today with an average Illumina
GenomeAnalyzer GAIIx average run (105bp read, single end), multiplex
48 yeast samples in a single lane at an almost 30X coverage. This
approach is very advantageous for researchers, especially in term of
costs, but it adds an additional layer of pre-processing that is not
as trivial to process as one would think.  Extracting the barcodes
would be fairly straightforward, but for the average 0.1 percent
sequencing error rate that introduces a lot of multiplicity in the
actual barcodes present in the samples. A proper design of the
barcodes, maximizing the Hamming
distance\citep{Hamming:1950p825,Pilcher:2008p824} is an essential step
for a proper de-multiplexing.
\newline
There are two kinds of barcoding, the one described in
\cite{Lefrancois:2009p778} where the barcode is part of the read
sequence and the one developed by Illumina, where the barcode is read
in a separate sequencing reaction after the first mate sequencing.
\newline
The data used in the following example has been sequenced using the
Illumina protocol. We'll look at the specificities that this introduce.

\textbf{Important Note:} This pre-processing procedure has to be
applied on the raw reads before any alignment is performed. Most
often, one would use the ``fastq'' formatted file as input. In the
particular case of the Illumina protocol, the barcodes can be retrieved
from the fastq ID lines or from the Illumina export format. The export
format is normally the result of aligning the reads with ELAND, the
Illumina aligner, but it can be generated as well without aligning the
reads, \textit{i.e.} the export file may not contain any alignment
information. As the ShortRead package offers a quick functionality to
access the barcode field of an export file, we use this example to
introduce it.

\begin{comment}
Are you sure the Hamming distance is the best? There is a lot of
literature on error-correcting codes and I think there may be better
and less naive methods, \emph{e.g.} taking into account error rates and correlations.
\end{comment}

<<cleanup, echo=FALSE>>=
file2clean=c(
  "ACACTG.fastq",
  "ACTAGC.fastq",
  "ATGGCT.fastq",
  "TTGCGA.fastq")
silent <- sapply(
                 file2clean,
                 function(f2c){
                   if(file.exists(f2c)){file.remove(f2c)}
                 })
@

<<Demultiplexing sample>>=
alns <- readAligned(
                    system.file(
                                "extdata",
                                package="RnaSeqTutorial"),
                    pattern="multiplex_export",
                    filter=compose(
                      chastityFilter(),
                      nFilter(2),
                      chromosomeFilter(regex="chr")),
                    type="SolexaExport",
                    withAll=TRUE)
@ 

Note the use of the \Robject{withAll} argument. It is essential to get
the barcode, since this data was multiplexed using the
Illumina protocol. For the Illumina protocol, the barcode is read in a
separate sequencing reaction and its sequence is reported as a
field of the \emph{export} file. This field is not parsed by
default by the \Rfunction{readAligned} to save time and
memory. It becomes available when the data is loaded using
the\Robject{withAll} argument and is afterwards accessible in the
metadata of the returned object. It can be accessed using the
following command:

<<Illumina example, eval=FALSE>>=
alignData(alns)$multiplexIndex
@

where \Robject{alns} is the object of the \Rpackage{ShortRead} \Rclass{AlignedRead} class.
\newline

In the following, we will look at the other kind of barcoding, where
the barcode is part of the sequenced sequence. First, we will create
some plots to evaluate the barcoding efficiency. 

<<barcodes>>=
barcodes=c("ACACTG","ACTAGC","ATGGCT","TTGCGA")
@ 
<<barcodePlot, fig=TRUE, height=6, eps=FALSE>>=
barcodePlot(alns,
            barcodes=barcodes,
            type="within",
            barcode.length=6,
            show.barcode=20,
            main="All samples",
            xlim=c(0,0.5))
@ 

% \begin{figure}[htbp]
% \begin{center}
% \includegraphics[width=0.8\textwidth]{easyRNASeq-barcodePlot}
% \end{center}
% \caption{Barcode occurencies in the whole dataset}
% \label{fig:barcodePlot}
% \end{figure}

All the barcodes seem to be almost equally distributed. Every one has
a proportion close to 25\%. Just the ``ACTAGG'' seems to have been
either less amplified during the library preparation or simply had a
lower concentration than the other or generated less
clusters. Overall, only a low percentage of them cannot be surely
assigned. Once this has been verified, the sample can be
\emph{demultiplexed}.

<<demultiplex>>=
dem.alns <- demultiplex(alns,
                        barcodes=barcodes,
                        edition.dist=2,
                        barcodes.qty=4,
                        type="within")
@ 

<<check the objects, eval=FALSE>>=
dem.alns$reads[[1]]
dem.alns$barcodes[[1]]
@

Remains to write the extracted data to file, or proceed it within
R. Again performing some validation plot is important to ensure that
the \emph{de-multiplexing} process succeeded.

<<demPlot, fig=TRUE, height=8, width=8, eps=FALSE>>=
par(mfrow=c(2,2))
barcode.frequencies <- lapply(
                              names(dem.alns$barcodes),
                              function(barcode,alns){                                
                                barcodePlot(
                                            alns$barcodes[[barcode]],
                                            barcodes=barcode,
                                            type="within",barcode.length=6,
                                            show.barcode=20,
                                            main=paste(
                                              "Expected barcode:",
                                              barcode))
                              },dem.alns)
@ 

% \begin{figure}[htbp]
% \begin{center}
% \includegraphics[width=0.8\textwidth]{easyRNASeq-demPlot}
% \end{center}
% \caption{Barcode occurences independantly per sample}
% \label{fig:demPlot}
% \end{figure}

Here we see the demultiplexed barcodes. It's important to assess if
within a given sample, there was a barcode bias. Generally, visually
assessing the data is important to make sure that the raw data you are
looking at agrees with the experimental design that was
performed. This is discussed in the \Robject{RnaSeqTutorial} vignette
of the \Rpackage{RnaSeqTutorial}.
\newline
Finally, we can save the demultiplexed files to disk.
<<write the file>>=
status <- lapply(
                 names(dem.alns$barcodes),
                 function(barcode,alns){         
                   writeFastq(
                              alns$reads[[barcode]],
                              file=paste(
                                barcode,
                                "fastq",sep="."))
                 },dem.alns)
@ 

<<cleanup, echo=FALSE>>=
file2clean=c(
  "ACACTG.fastq",
  "ACTAGC.fastq",
  "ATGGCT.fastq",
  "TTGCGA.fastq",
  "Rplots.pdf")
silent <- sapply(
                 file2clean,
                 function(f2c){
                   if(file.exists(f2c)){file.remove(f2c)}
                 })
@ 

The fastq files are now ready to be aligned with your preferred aligner
and the resulting bam files processed with the \Rfunction{easyRNASeq}
function as described in the first part of the present vignette! Enjoy.

\newpage
\section{Yet to come}
\label{s:ytc}
\textcolor{red}{
The \Rfunction{easyRNASeq} can now return a
\Rclass{SummarizedExperiment} object. The aim is to have it be the
only output in the next development version of the
\Rpackage{easyRNASeq} package (version $1.5.x$). This in order to
consolidate the kind of objects used for Next-Generation Sequencing in
the Bioconductor repository (rather ambitious, yes). For this, as of
version $1.3.14$, a new function: \Rfunction{count} has been
introduced that supersedes the \Rfunction{easyRNASeq} function. An
example follows, however the parameters -currently alike those of the
\Rfunction{easyRNASeq} - will be refined in order to consolidate and
structure them, so do not rely on them for writing production code.
}

<<count function example, eval=FALSE>>=
## creating a SummarizedExperiment from 4 bam files
sumExp <- count(filesDirectory=system.file(
                  "extdata",
                  package="RnaSeqTutorial"),
                pattern="[A,C,T,G]{6}\\.bam$",
                readLength=30L,
                organism="Dmelanogaster",
                chr.sizes=seqlengths(Dmelanogaster),
                annotationMethod="rda",
                annotationFile=system.file(
                  "data",
                  "gAnnot.rda",
                  package="RnaSeqTutorial"),
                count="exons"
                )
## the counts
assays(sumExp)
## the sample info
colData(sumExp)
## the 'features' info
rowData(sumExp)

@

\textcolor{red}{
See the \Rpackage{GenomicRange} package \Rclass{SummarizedExperiment}
class for more details on last three accessors used in the example.
}

\newpage

\section{Use Cases}
\label{s:useCases}
The following use cases have been created to answer user request from
the Bioconductor mailing list. We want to thank Francesco Lescai,
Wade Davis and Richard Friedman for asking pertinent questions that
helped us make this vignette better.
\newline
The first use case examplify how to use the \Rfunction{easyRNASeq} to
get \Rpackage{DESeq} normalized data from two human samples. It
introduces as well how to use the \Rpackage{GenomicFeatures} to
retrieve annotations. \textcolor{red}{Note that the data for this example
is not readily available. If you want to reproduce this example, you
will need to get 2 (at least) fastq files from either the SRA or ENA websites and
align them against the human genome using your aligner of choice and
convert the alignment into the BAM format. An example of how to
achieve this in R is described in the beginning of the use case.}
\newline
The second use case describe how to combine different annotation
(chromosomic and genic), when for example the chromosome names in the
aligned file(s) are different from the annotation retrieved using
\Rpackage({biomaRt}. In both use case, we will assume the
\Rpackage{easyRNASeq} library as already been loaded as in:

<<load the lib, eval=FALSE>>=
library(easyRNASeq)
@

\paragraph{Processing a set of human samples}
\label{p:pasohs}

\subparagraph{Getting the data}
\textcolor{red}{If you already have a set of bam files resulting of short read
alignments against the human genome, you can just proceed to the next
paragraph. If not here is an example on how to retrieve and align data
in R. Note that the \Rpackage{Rsubread} is only supported on the linux
platform. You'll need to use your aligner of choice and generate BAM
files if you're using another platform. However you should still be
able to download the SRA/ENA files. The files listed in this example
are from the \citet{Jaager:2012p4708} study, but where simply selected for their
small size. Still downloading and creating all the necessary file
requires times and a computer with a sufficient amount of memory to
index the human genome.}
<<bam prep, eval=FALSE>>=
library(BSgenome.Hsapiens.UCSC.hg19)
library(GEOquery)
library(SRAdb)
library(Rsamtools)
library(Rsubread)

## create a temp dir
dir.create(file.path(getwd(),"tmp1234"))

## change the working directory
setwd(file.path(getwd(),"tmp1234"))

## init SRA
sqlfile <- getSRAdbFile()

## init a connection
sra_con <- dbConnect(SQLite(),sqlfile)

## list the files
acc <- c("SRR490224","SRR490225")
getFASTQinfo( in_acc = acc, srcType = 'ftp' )

## get the read files
getSRAfile( in_acc=acc, sra_con, destDir = getwd(),
           fileType = 'fastq', srcType = 'ftp' )

## close the connection
dbDisconnect( sra_con )

## write the human genome sequences
writeXStringSet(Reduce(append,
                       lapply(seqnames(Hsapiens),
                              function(nam){
                                dss <- DNAStringSet(unmasked(Hsapiens[[nam]]))
                                names(dss) <- nam
                                dss
                              })),
                file="hg19.fa")

## create the indexes
dir.create("indexes")
buildindex(basename=file.path("indexes","hg19"),
           reference="hg19.fa")

## align the reads
sapply(dir(pattern="*\\.gz$"),function(fil){
  ## decompress the files
  gunzip(fil)
  
  ## align
  align(index=file.path("indexes","hg19"),
        readfile1=sub("\\.gz$","",fil),
        nsubreads=2,TH1=1,
        output_file=sub("\\.fastq\\.gz$","\\.sam",fil)
        )
  
  ## create bam files
  asBam(file=sub("\\.fastq\\.gz$","\\.sam",fil),
        destination=sub("\\.fastq\\.gz$","",fil),
        indexDestination=TRUE)
})

@ 

\textcolor{red}{Note that this has generated a number of files that
  you should clean up afterwards, i.e. delete the ``tmp1234'' folder,
  once you are done with the use case.}

\subparagraph{Processing the data}
First, we start by retrieveing the size of the chromosomes. This is an
important information for calculating any feature count. Actually,
neither the BSgenome nor any related packages are required for
\Rpackage{easyRNASeq} to run. As they are the easiest way to access
genomic information such as chromosome lengths within the
R/Bioconductor framework, they are made available to the
\Rfunction{easyRNASeq} for that purpose. However, providing a simple
named \textcolor{red}{vector} is sufficient and therefore is easyRNASeq not limited to
existing \Rpackage{BSgenome} organisms. The chromosome size is
essential for one reason: to provide a complete representation of the
data. When counting reads per features, one get counts for these
features that have at least one read aligning to them, i.e. every
feature having no reads will be missed. One could then either return
only those features having counts or returning a value of 0 counts for
those that do not. We do not find these solutions satisfying and to
ensure that we provide coherent data, we return the counts for every
feature present on the chromosomes. For that purpose, the chromosome
size is essential as it allows us to define those features on a
chromosome that are located between the last feature having a number a
count bigger or equal to one and the end of the chromosome - features,
which would otherwise be ignored. It is as well a mean to monitor that
the provided annotation is pertinent.
\textcolor{red}{As of version $1.3.5$, when using the 'bam' format, it
  is even easier. Setting the 'chr.sizes' argument to \textbf{auto}
  will result in the chromosome sizes to be retrieved from the BAM
  header. However, the purpose of this use case is to, at least partly,
  demonstrate the importance of using appropriate annotations and
  therefore the use of the 'chr.sizes' argument is demonstrated.
}
\label{chrsize}

<<chromosome size, eval=FALSE>>=
library(BSgenome.Hsapiens.UCSC.hg19)
chr.sizes=seqlengths(Hsapiens)
@ 

Then, we list the BAM files. 

<<bam files name, eval=FALSE>>=
bamfiles=dir(getwd(),pattern="*\\.bam$")
@

We can now run the \Rfunction{easyRNASeq} function, fetching the
annotation using \Rpackage{biomaRt}. As this is time consuming (about
10 minutes from an average network) and
since we might want to work on these annotation to avoid
double-counting, we first ask the function to return an instance of the
\Rclass{RNAseq} class. Then, using the genomicAnnotation accessor, we 
extract the retrieved annotation.
\textcolor{red}{Note: As of version $1.3.5$, 'bam' is the default
  format. Hence, the argument 'format' has been removed from the
following function call.}

<<Fetching with biomaRt, eval=FALSE>>=
rnaSeq <- easyRNASeq(filesDirectory=getwd(),
                    organism="Hsapiens",
                    chr.sizes=chr.sizes,
                    readLength=58L,
                    annotationMethod="biomaRt",
                    count="exons",
                    filenames=bamfiles[1],
                    outputFormat="RNAseq"
                    )
gAnnot <- genomicAnnotation(rnaSeq)
@

As one can see by looking at this object, it contains $434$
``chromosomes'', most of which are of no interest to us. For that
reason, we first filter it and then save it to the disk for later
re-use. The ``.rda'' extension is a synonym of the ``.RData'' one and
identifies a serialized R data file.
\label{rda}

<<filtering the annot, eval=FALSE>>=
gAnnot <- gAnnot[space(gAnnot) %in% paste("chr",c(1:22,"X","Y","M"),sep=""),]
save(gAnnot,file="gAnnot.rda")
@ 

Now using the modified, saved annotation, we can get a count table as follows:

<<get a count table, eval=FALSE>>=
countTable <- easyRNASeq(filesDirectory=getwd(),
                        organism="Hsapiens",
                        chr.sizes=chr.sizes,
                        readLength=58L,
                        annotationMethod="rda",
                        annotationFile="gAnnot.rda",
                        count="exons",
                        filenames=bamfiles[1]
                        )
@ 

Another way to retrieve the annotation is to use the
\Rpackage{GenomicFeatures} library. \Rpackage{easyRNASeq} does not yet supports
that library automatically, but it can take \Robject{GRangesList}
object as input; objects that are derivatives of the ones returned by the functions of the
\Rpackage{GenomicFeatures} package. A few changes needs to be done to
the obtained object so that it can be used by the
\Rfunction{easyRNASeq}: first, a metadata element needs to be updated
and then the object needs to be converted into a \Robject{GRangesList}.
\label{grnglist}

<<using different annotation, eval=FALSE>>=
library(GenomicFeatures)
hg19.tx <- makeTranscriptDbFromUCSC(
                                   genome="hg19",
                                   tablename="refGene")

gAnnot <- exons(hg19.tx)

colnames(elementMetadata(gAnnot)) <- "exon"

gAnnot <- split(gAnnot,seqnames(gAnnot))
@ 

As previously, this annotation could be saved to disk or can as well
be used directly, using the \Robject{annotationMethod} "env" and the
\Robject{annotationObject} arguments. Additionaly, one can select chromosome of
interest using the \Robject{chr.sel} argument. It accepts a vector of
chromosome names. In the following, we subset for the chromosome ``chr1'' only.
\label{env}
\label{chr.sel}

<<sub setting, eval=FALSE>>= 
countTable <- easyRNASeq(filesDirectory=getwd(),
                        organism="Hsapiens",
                        readLength=58L,
                        annotationMethod="env",
                        annotationObject=gAnnot,
                        count="exons",
                        filenames=bamfiles[1],
                        chr.sel="chr1"
                        )
@

\textcolor{red}{Note that in the previous call, we removed the
  'chr.sizes' argument, just to demonstrate that the chr.sizes can be
  retrieved from the bam files. Removing the argument is the same as
  setting it to its default value: 'auto'}

\textcolor{red}{Alternatively, one can use the \Rfunction{count}
function to get a \Rclass{SummarizedExperiment} object back. Mind the
comments in section \ref{s:ytc}, page \pageref{s:ytc} though.}

<<using the count function, eval=FALSE>>=
sumExp <- count(filesDirectory=getwd(),
                    organism="Hsapiens",
                    readLength=58L,
                    annotationMethod="env",
                    annotationObject=gAnnot,
                    count="exons",
                    filenames=bamfiles[1],
                    chr.sel="chr1"
                    )
@ 

Finally, instead of returning a count table, we can get a
\Robject{CountDataSet} instance from the \Rpackage{DESeq}
package. This will perform the normalization of the data and generate
some Quality Assessment plots. In the present example, it would not yield
very sensitive results as we have no replicates (biological). These
are important to \Rpackage{DESeq} to accurately model the technical and biological variance.
With no replicates for every condition, the dispersion will be based
on a "pooled" estimate making the differential expression call lose
sensitivity. In addition, \Rpackage{DESeq} is in such cases using a
conservative approach (which is good) so you'd get even less
significant results. Here, as we have no replicates, we need to pass
additional arguments (the last two) that will be tunnelled to the \Rpackage{DESeq}
\Rfunction{estimateDispersions} function. See the \Rpackage{DESeq} vignette for
more details on these.

<<DESeq run, eval=FALSE>>=
conditions <- c("A","B")
names(conditions) <- bamfiles

countDataSet <- easyRNASeq(filesDirectory=getwd(),
                          organism="Hsapiens",
                          chr.sizes=chr.sizes,
                          annotationMethod="env",
                          annotationObject=gAnnot,
                          count="exons",
                          filenames=bamfiles,
                          chr.sel="chr1",
                          outputFormat="DESeq",
                          normalize=TRUE,
                          conditions=conditions,
                          fitType="local",
                          method="blind"
                        )
@ 

\textcolor{red}{Note that as the read length differs between the two
files: $58$ and $76$, the \Robject{readLength} argument was removed
from the previous function call. The read size is then identified
automatically.}

\paragraph{Dealing with annotation inconsistencies}
\label{p:dwai}
This use case shows how to deal with inconsistent annotations,
\textit{e.g.} when the chromosome names present in the aligned file
are different from those that can be retrieved from an annotation
source such as \Rpackage{biomaRt}. 
\newline
First, we have a look at the data, in this case some Illumina export
files. Reading in the data using the \Rpackage{ShortRead} package is
quite resource demanding as the whole sequences are loaded in memory.
Then we look at the chromosome names. These differs from what we
expect - UCSC standards - as they have an additional ``.fa'' extension.

<<read the data, eval=FALSE>>=
aln <- readAligned("data",type="SolexaExport",pattern="*.txt.gz")
gc()
levels(chromosome(aln))
@ 

<<fake the data, echo=FALSE>>=
c("chr1.fa","chr10.fa","...","chrY.fa")
@ 

They are different from what \Rpackage{biomaRt} will return as well:
\textit{i.e.} 1:19, X, Y and MT plus others. This triple inconsistency will
be a problem for \Rpackage{easyRNASeq}. If there were only two sets of
names, using the "custom" chromosome name map by-pass (see the Details
section of the ``?easyRNASeq'' help page) would solve the
issue. However, in the present particular case, as we are retrieving
the annotation from \Rpackage{biomaRt}, we need to precise the name of
the organism, which circumvent the chromosome name mapping
by-pass. The solution is to first fetch the annotation, modify it and
save it as an R data file. Some of the retrieved annotation are ``NT''
contigs. There are only a few of them, so instead of filtering them
out, we just ignore them.

<<fetch the annotation, eval=FALSE>>=
obj <- fetchAnnotation(new('RNAseq',
                          organismName="Mmusculus"
                          ),
                      method="biomaRt")

gAnnot <- genomicAnnotation(obj)

length(grep("NT_",space(gAnnot)))

@ 

<<fake number 2, echo=FALSE>>=
1181
@ 

<<modify the annotation, eval=FALSE>>=
names(gAnnot) <- paste("chr",names(gAnnot),".fa",sep="")
@

As described previously, see page \pageref{rda} we can save the
annotation to a file or use it directly, using the
\Robject{AnnotationMethod} ``env'' argument. In that later case,
since we did not process the annotation, numerous warnings concerning
possible multiple countings of reads will be raised. Double counting
reads is not what ones want, that is why the better way is to rework
the obtained annotation to remove/clarify these cases and save the
obtained annotation as an rda file on your file system. 
\newline
\textcolor{red}{Note that \Rpackage{easyRNASeq} does support some chromosome names
conversion by default. The list of organism for which this is possible
can be listed using the knownOrganisms function.}
\newline
Before going on, we do some cleanup as some of the objects we have
generated take large amounts of memory.

<<clean, eval=FALSE>>=
rm(aln,obj)
gc()
@ 

As on page \pageref{chrsize} we get the chromosome sizes.
Again, note that the use of the BSgenome is not mandatory. It's just
easy as they are available in Bioconductor. Typing in your own
chromosome sizes \textcolor{red}{named vector} is as valid.

<<chrsize, eval=FALSE>>=
library(BSgenome.Mmusculus.UCSC.mm9)
chr.sizes<- seqlengths(Mmusculus)
@ 

\textcolor{red}{Note that now, providing the chromosome sizes as a list is deprecated.
It has been changed to a named vector, which is more
intuitive. The list was an historical remnant of the RangedData API. 
Note as well that the chr.sizes will be necessary here
as we are using 'aln' files and not 'bam' files.}
\newline
We can now create the chromosome name mapping.

<<create the chr.map, eval=FALSE>>=
chr.map <- data.frame(
                      from=paste("chr",c(1:19,"X","Y"),".fa",sep=""),
                      to=paste("chr",c(1:19,"X","Y"),sep=""))
@ 

Using this chromosome map, we can now summarize the reads per feature
of interest. Here we want to look for gene models, so we set the
\Robject{count} and \Robject{summarization} argument. Note again that
the \Robject{summarization} argument will be deprecated in the near
future and its values merged with the \Robject{count} ones. 
\newline
Asking to get back an \Rclass{RNAseq} instance allows us to look at the gene
models defined by \Rpackage{easyRNASeq}. This offers the possibility
to clean them to avoid multiple counting. An additional use case
describing how to do this will be introduced soon. 
\newline
As we are provided with Illumina export data, we need to define a set
of Filter to ensure that the data is read properly. Indeed, the export
file contains all the reads, so the one that do not pass the chastity
filter have to be removed. In addition, some of the other reads are
for internal QC, and they have no position. For that reason, we need
to filter those too out. 
\newline
Reading in export data is more resource exhausting that reading in bam
files, as we are loading in the sequence and quality information as
well, whereas we do not need them. 
\newline
We will now look through three different set of parameters that will
stepwise reduce the number of warnings emitted. These warnings are
there to help you understand the different pitfalls that the
\Rfunction{easyRNASeq} helps you avoiding when analysing your RNA-Seq
data. The first approach generates a lot of warnings, because of the
differing annotations, since we are using the entire set of annotation
we got. As we do not want to generate all these warnings, these code
lines are not evaluated. To get a feel about these warnings, they will look as the follows:

\begin{verbatim}
Warning messages:
1: In .convertToUCSC(names(genomicAnnotation(obj)), organismName(obj),  :
Your custom map does not define a mapping for the following 
chromosome names: chrMT.fa
2: In easyRNASeq(filesDirectory = headDir, organism = "custom", chr.map = chr.map,  :
There are 6096 synthetic exons as determined from your annotation that overlap!
This implies that some reads will be counted more than once!
Is that really what you want?
\end{verbatim}

Let us start with the full set of annotation.

<<no filter, eval=FALSE>>=
rnaSeq <- easyRNASeq(filesDirectory="data",
                    organism="custom",
                    chr.map=chr.map,
                    chr.sizes=chr.sizes,
                    filter=compose(
                      naPositionFilter(),
                      chastityFilter()),
                    readLength=50L,
                    annotationMethod="env",
                    annotationObject=gAnnot,
                    format="aln",
                    count="genes",
                    summarization= "geneModels",
                    filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz",
                    outputFormat="RNAseq",
                    nbCore=2
                    )
@ 

To reduce the number of warnings emitted, we can select for the
chromosome we are interested in as previously done on page \pageref{chr.sel}.

<<selected chr, eval=FALSE>>=
rnaSeq <- easyRNASeq(filesDirectory="data",
                    organism="custom",
                    chr.map=chr.map,
                    chr.sizes=chr.sizes,
                    chr.sel=chr.map$from,
                    filter=compose(
                      naPositionFilter(),
                      chastityFilter()),
                    readLength=50L,
                    annotationMethod="env",
                    annotationObject=gAnnot,
                    format="aln",
                    count="genes",
                    summarization= "geneModels",
                    filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz",
                    outputFormat="RNAseq",
                    nbCore=2
                    )
@

Finally, to further reduce the warnings, we can manipulate the
\Robject{RangedData} object to remove the unnecessary annotation.

<<remove annotation, eval=FALSE>>=
sel <- grep("NT_",names(gAnnot))
gAnnot <- RangedData(ranges=ranges(gAnnot)[-sel,],values=values(gAnnot)[-sel,])
colnames(gAnnot) <- gsub("values\\.","",colnames(gAnnot))
@ 

This last call will only generate two warnings, one that could be
easily dealt with (a complaint about the chrMT). The other one is
about double counting and it requires to adapt the annotation.

<<final call, eval=FALSE>>=
rnaSeq <- easyRNASeq(filesDirectory="data",
                    organism="custom",
                    chr.map=chr.map,
                    chr.sizes=chr.sizes,
                    chr.sel=chr.map$from,
                    filter=compose(
                      naPositionFilter(),
                      chastityFilter()),
                    readLength=50L,
                    annotationMethod="env",
                    annotationObject=gAnnot,
                    format="aln",
                    count="genes",
                    summarization= "geneModels",
                    filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz",
                    outputFormat="RNAseq",
                    nbCore=2
                    )
@

\newpage

\section{Session Information}

The version number of R\citep{ref:R} and packages loaded for
generating the vignette were: 

<<SessionInfo, echo=FALSE>>=
library(easyRNASeq)
library(BSgenome.Dmelanogaster.UCSC.dm3)
library(RnaSeqTutorial)
sessionInfo()
@

\newpage

\section{Final remarks}
\label{sec:finalRem}
RNA-seq is still maturating and a lot of new developments are to be
expected. If you have any questions, comments, feel free to contact me:
delhomme \emph{at} embl \emph{dot} de.

\begin{comment}
  The following code is just to make sure that the functionality of
  strand and reduce are kept, despite the NAMESPACE clashes between
  imported packages.
\end{comment}

<<some test, echo=FALSE>>=
grngs = GRanges(seqnames=c("chr1", "chr2"),
  ranges=IRanges(start=1:2, end=2:3), 
  strand=c("+","-"))
silent <- strand(grngs)
silent <- reduce(grngs)
@ 

\newpage

\bibliographystyle{plainnat}

\bibliography{easyRNASeq}

\end{document}