%\VignetteIndexEntry{Description of ExiMiR} 

\documentclass[11pt,a4paper]{article}
\usepackage[OT1]{fontenc}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{Sweave}
\usepackage{subfigure}
\DeclareGraphicsExtensions{.png}
\graphicspath{{images}}


\textwidth=6.2in
\textheight=8.5in
%\textheight=9.0in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\renewcommand{\thefootnote}{\alph{footnote}}

\begin{document}
\title{Description of ExiMiR}
\author{Sylvain Gubian, Alain Sewer, PMP SA}

\maketitle
\tableofcontents

\section{Introduction}

The \emph{ExiMiR} package provides tools for normalizing miRNA expression data obtained
from Exiqon miRCURY LNA\texttrademark\ arrays. It gives the possibility of applying a
novel miRNA-specific normalization method using spike-in probes and is based on
controlled assumptions \cite{bibi}. These featurse allow to take into account the
differences between miRNA and gene (mRNA) expression data, as discussed in a recent
study \cite{sarkar}. \\

\emph{ExiMiR} is particularly suited for two-color microarray experiments using a
common reference. In such cases the spike-in probe-based normalization method allows to
treat the raw data as if they were coming from single-channel arrays, like
Affymetrix\textsuperscript{\textregistered} Genechip\textsuperscript{\textregistered}.
This is why the classes and functions in \emph{ExiMiR} have been designed to
closely resemble those of the "single-color" \emph{affy} package, while remaining
compatible with those of the "two-color" \emph{limma} package. \\

Further features of \emph{ExiMiR} include:
\begin{itemize}
\item
reading raw data in the ImaGene\textsuperscript{\textregistered} TXT format
provided by Exiqon;
\item
allowing to update the array probe annotations to the latest
\href{http://www.mirbase.org}{miRbase} releases incorporated in the Exiqon GAL files.
\end{itemize}

This vignette also shows how to process raw expression data obtained from the
Affymetrix\textsuperscript{\textregistered} Genechip\textsuperscript{\textregistered}
miRNA array (CEL and CDF files), in order to illustrate the similarities between
\emph{ExiMiR} and \emph{affy}.  


\section{Raw and annotation data}
\label{sec:data}

This section describes how to find raw and annotation data on which \emph{ExiMiR} can
be applied.\footnote{N.B.: R objects correponding to the raw and annotation data described in this section
are provided by the \emph{ExiMiR} package, see Section \ref{sec:example}} It covers both Affymetrix (CEL and CDF) and Exiqon/ImaGene (TXT and GAL)
cases. If you have your own expression data in CEL or TXT formats, then you just need
to complete them with the annotation files in CDF or GAL formats, respectively, as
described below. Do no forget the appropriate \verb+samplelist.txt+ file for the Exiqon
case.

\subsection{Affymetrix}
\label{subsec:affydata}
First create a directory \verb+Affymetrix+ in your file system. The
\href{http://www.ncbi.nlm.nih.gov/geo/}{GEO repository} contains several datasets using
the Affymetrix miRNA array. We choose the series
\href{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19183}{GSE19183} for which
the raw data file \verb+GSE19183_RAW.tar+ can be downloaded from this
\href{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?mode=raw&acc=GSE19183&db=GSE19183%5FRAW%2Etar&is_ftp=true}{URL}.
Extract the CEL files into the \verb+Affymetrix+ directory. Then get the annotation
library \verb+miRNA_libraryfile.zip+ from the Affymetrix website at this
\href{http://www.affymetrix.com/support/downloads/library_files/miRNA_libraryfile.zip}{
URL}. Extract the file \verb+miRNA-1_0.CDF+ from the directory
\verb+/CD_miRNA-1_0_v3/Full/miRNA-1_0/LibFiles/+ into the \verb+Affymetrix+ directory
as well.


\subsection{Exiqon} 
\label{subsec:exidata}


First create a directory \verb+Exiqon+ in your file system. The GEO series
\href{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE20122}{GSE20122} contains a
suitable raw data file \verb+GSE20122_RAW.tar+ at the following
\href{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?mode=raw&acc=GSE20122&db=GSE20122%5FRAW%2Etar&is_ftp=true}{URL}.
After downloading it, extract the enclosed raw data file in ImaGene TXT format into the
\verb+Exiqon+ directory. Then download the corresponding annotation GAL file from the
following
\href{http://shop.exiqon.com/annotations/download/gal_208200-A,208201-A,208202-A_lot31022-31022_hsa,mmu,rno-and-related-vira_from_mb160,miRPlus.gal}{URL}
into the \verb+Exiqon+ directory as well. Finally, copy and paste the content of
Appendix \ref{sec:app} from this vignette into a TAB-separated text file called
\verb+samplesinfo.txt+ and also located the \verb+Exiqon+ directory. This file is
required by \emph{ExiMiR} to match the raw data results from the Hy3 and Hy5
channels. It contains the names of all the TXT files in the experiment, organized in a
table with one row for each array and two columns corresponding to the two channels Hy3
and Hy5. \footnote{For practical purposes only 12 of the 54 Hy3-Hy5 raw data filenname
pairs from the GEO series GSE20122 are listed in Appendix \ref{sec:app}. Feel free to
complete \texttt{samplelist.txt} with the 42 remaining ones if you want to be
exhaustive!} It is very similar to the file \verb+target.txt+ required by the
\emph{limma} package and is usually provided by Exiqon together with the raw data TXT
files. 


\section{Raw data normalization}
\label{sec:norm}

This section describes how to apply \emph{ExiMiR} to normalize raw miRNA
expression data obtained from the Affymetrix\textsuperscript{\textregistered}
Genechip\textsuperscript{\textregistered} or from the Exiqon miRCURY LNA\texttrademark\
arrays. Notice that although these descriptions are generic, some of the
filenames given in the command-line examples might differ from case to case (e.g. GAL
filenames). Begin by launching an \verb+R+ session at the same level as the
\verb+Affymetrix+ and \verb+Exiqon+ directories created in Section \ref{sec:data}. 


\subsection{Affymetrix: from CEL files to ExpressionSet objects}
\label{subsec:affynorm}


First create the array annotation environment using the CDF file \verb+miRNA-1_0.CDF+
and the \emph{makecdfenv} package (set previously your working directory to the parent directory of the 'Affymetrix' folder):

\begin{Sinput}
R> library(makecdfenv)
R> cdfenv <- make.cdf.env(cdf.path="Affymetrix", filename='miRNA-1_0.CDF')
\end{Sinput}
Then load the CEL file raw data into an AffyBatch object using the \emph{affy}
package:

\begin{Sinput}
R> library(affy)
R> abatch <- ReadAffy(cdfname='cdfenv', celfile.path='Affymetrix')
\end{Sinput}
Raw data normalization is directly applied on \verb+abatch+ to create an
ExpressionSet object. For instance:

\begin{Sinput}
R> eset.rma <- rma(abatch)
\end{Sinput}
As an alternative to the \verb+rma+ quantile normalization validated for gene (mRNA)
expression, the spike-in probe-based approach in \emph{ExiMiR} might give better
results for miRNA expression data \cite{bibi,sarkar}:   

\begin{Sinput}
R> library(ExiMiR)
R> eset.spike <- NormiR(abatch)
\end{Sinput}	
For the GSE19183 dataset, the assumptions allowing the application of the \verb+NormiR+
spike-in probe-based normalization are not satisfied and a default median normalization
is performed instead. Section \ref{sec:trouble} describes this safeguarding strategy
and the options allowing to deal with problematic cases. If the \verb+NormiR+
assumptions are satisfied, a series of control figures are generated. Their
description is given in Section \ref{subsec:figs} below. 


\subsection{Exiqon: from TXT files to ExpressionSet objects}
\label{subsec:exinorm}

First load the \emph{ExiMiR} package:

\begin{Sinput}
R> library(ExiMiR)
\end{Sinput}
Then create the array annotation environment using the GAL file and the
\verb+make.gal.env+ function: 

%\small
\begin{Sinput}
R> galenv <- make.gal.env(gal.path='Exiqon')
\end{Sinput}
%\normalsize
Read the raw data TXT files into an AffyBatch object using the \verb+ReadExi+ function:

\begin{Sinput}
R> ebatch <- ReadExi(galname='galenv', txtfile.path='Exiqon')
\end{Sinput}
Similarly to the Affymetrix case in Subsection \ref{subsec:affynorm}, the raw data
normalization is applied on \verb+ebatch+ to create an ExpressionSet object. For
instance the \verb+rma+ quantile normalization from the \emph{affy} package, using the option \verb+background=FALSE+, as recommanded by a recent study\cite{lopez}:

\begin{Sinput}
R> library(affy)
R> eset.rma <- rma(ebatch, background=FALSE)
\end{Sinput}
However, the assumptions for applying \verb+rma+ are not guaranteed to be satisfied in
the case of miRNA expression data \cite{bibi, sarkar}. It might be better to use
the spike-in probe-based method from \emph{ExiMiR}:

\begin{Sinput} 
R> eset.spike <- NormiR(ebatch)
\end{Sinput}
If all the assumptions underlying \verb+NormiR+ are satisfied, a series of control
figures are generated, that will be explained in Subsection \ref{subsec:figs} below. If
one or more assumptions are not met, then the median normalization is applied instead
of the spike-in probe-based method. However, \emph{ExiMiR} offers several options
to deal with such situations, as explained in Section \ref{sec:trouble}
below. 


\subsection{Control figures for the spike-in probe-based normalization}
\label{subsec:figs}

In order to follow the execution of the spike-in probe-based normalization implemented
in \texttt{NormiR}, a series of three control figures are generated for each channel of
the input data. They allow to confirm the successful application of the normlization
method but also to detect possible anomalies, that can be then treated with the options
described in Section \ref{sec:trouble}. This feature runs by default and can be
deactivated by setting  \verb+figures.show = FALSE+ in \verb+NormiR+.\\

The three control figures generated for the Hy3 channel of the Exiqon example from
Subsections \ref{subsec:exidata} and \ref{subsec:exinorm} are briefly described
hereafter. For more details see \cite{bibi}.

\begin{figure}[t]
\begin{center}
\fbox{\includegraphics{fig1}}
\caption{Correction of the spike-in probeset intensities (Hy3 channel)}
\label{fig1}
\end{center}
\end{figure}


\begin{description}

\item[Correction of the spike-in probeset intensities] The four pannels in Figure
\ref{fig1} show the successive steps in removing the array-dependent biases from the
spike-in probeset intensities. A meaningful application of \verb+NormiR+ indeed
requires that the spike-in probeset intensities display coherent deviations across the
arrays of the experiment. Such a behavior manifests itself by roughly parallel curves
on the upper-left pannel and by collapsing ones on the upper-right pannel. The
normalization correction consists first in subtracting this common variance (lower-left
pannel) and second in transforming back to the original intensity range (lower-right
pannel). Correcting the curves is proven efficient when the final ones appear
'straighter' than the initial ones. 

\begin{figure}[t]
\begin{center}
\fbox{\includegraphics{fig2}}
\caption{Performance of the spike-in probeset intensity correction (Hy3 channel)}
\label{fig2}
\end{center}
\end{figure}

\item[Performance of the spike-in probeset intensity correction] Figure \ref{fig2}
contains two measures for quantitatively assessing the performance of the spike-in
probeset intensity correction used by \verb+NormiR+. The upper pannel shows a heatmap
of the Pearson correlations between the array-dependent raw intensities of the spike-in
probesets, i.e. between the curves displayed on the upper-left pannel of Figure
\ref{fig1}. If the values are globally larger than 0.5, then the array-dependent biases
are sufficiently coherent and applying \verb+NormiR+ is justified. The lower pannel
displays the variance ratio of the spike-in probesets intensities before and after the
correction. They correspond to the curves in the upper-left and lower-right pannels of
Figure \ref{fig1}. If these ratios are sufficiently low, then the \verb+NormiR+
approach was efficient. 

\begin{figure}[t]
\begin{center}
\fbox{\includegraphics{fig3}}
\caption{Intensity-dependent correction functions (Hy3 channel)}
\label{fig3}
\end{center}
\end{figure}

\item[Intensity-dependent correction functions for all probes] Figure \ref{fig3}
displays the intensity and array dependent correction functions that \verb+NormiR+
applies to all miRNA probes to perform the normalization. It is constucted based on the
spike-in probe corrections, already shown on Figures \ref{fig1} and \ref{fig2}. Several
requirements are necessary to ensure a stable coverage of the whole range of probe
intensities measured on the array. \emph{ExiMiR} automatically performs checks to
prevent critical situations where its meaningful application is not guaranteed.
Sometimes the constructed correction functions do not look good, even if \verb+NormiR+
ran smoothly. Dealing with such situations is also described in Section
\ref{sec:trouble} below.

\end{description}  


\section{Troubleshooting and fine-tuning options}
\label{sec:trouble}

This section describes possible problems you may encounter when applying
\emph{ExiMiR} to your own data, see for instance Subsection \ref{subsec:affynorm}. It
will help you understanding their origin and deciding whether to still use
\emph{ExiMiR} (with different parameters) or to choose another normalization
method like median normalization.  


\subsection{Possible problems}
\label{subsec:problems}

The application of \emph{ExiMiR} fundamentally assumes that the spike-in probes capture
the greatest part of the between-array technical variability in the miRNA expression data. This
is normally the case when the processing of the RNA samples prior to the addition of
the spike-in RNAs is suitably standardized and controlled. If this condition is
satisfied, then \emph{ExiMiR} requires three features from the spike-in control probes
to be meaningfully applied, see \cite{bibi}. These features are automatically tested by
the software. In case of failure, median normalization is used instead of spike-in
probe-based method. However the threshold values used in these tests can be changed to
force the application of the spike-in probe-based method. Its consequences can then be 
investigated on the control figures described in Subsection \ref{subsec:figs} to decide
whether the application of \emph{ExiMiR} was justified or not. Other problems like
annotation conflicts are also supported by \emph{ExiMiR}. \\

Here is the list of the problematic situations covered by \emph{ExiMiR}, arranged by
potential order of appearence. 

\begin{description} \item[Incompatibility between GAL and TXT files] If the array
annotation contained in the GAL file is not compatible with the one contained in the
TXT files, or if there is no GAL file available, then \verb+ReadExi+ directly generates
a default \verb+galenv+ environment from the annotation contained in the TXT files. 

\item[Insufficient coherence between spike-in probesets] If the raw intensities of the
spike-in probesets are not sufficiently coherent across the arrays of the experiment,
i.e. if the mean of the off-diagonal elements of the Pearson correlation matrix shown
on the upper pannel of Figure \ref{fig2} is smaller than 0.5, then a median
normalization is applied. The value can be changed by using the \verb+min.corr+ of
\verb+NormiR+. 

\item[Specificity of the spike-in probeset intensities] If the spike-in probeset
intensities are not specific, i.e. if the intensity ranges covered by the probes
mapping to the same probesets are too large, then computing the intensity-dependent
correction functions from Figure \ref{fig3} becomes problematic. The
intensity-independent median normalization is preferred in this case. The \verb+NormiR+
option \verb+max.log2span+ can be changed to allow for probeset intensity ranges larger
than the default value 1.

\item[Insufficient coverage of the probe intensity range] If the range
[$\sim$6,$\sim$16] of all array probe intensities is not appropriately covered by the
spike-in probe intensities, then computing the intensity dependence of the correction
functions from Figure \ref{fig3} becomes unstable. The \verb+NormiR+ option
\verb+cover.int+ tests the size of the largest intensity interval between two
consecutive spikes. Its default value is 1/3. The \verb+NormiR+ option
\verb+cover.ext+ tests the minimal ratio between the intensity range covered by the
spike-in probes and the one covered by all probes on the array. Its default value is
1/2. These two values can be changed but an eye must be kept on their consequences on
the correction functions from Figure \ref{fig3}, since the latter are not explicitly
tested by \emph{ExiMiR}. The \verb+NormiR+ options for computing these correction
functions are explained in Subsection \ref{subsec:opt} below.

\end{description}

\subsection{\emph{NormiR} options for computing the correction functions}
\label{subsec:opt}

The results for the spike-in probe-based correction functions displayed on Figure
\ref{fig3} are not tested automatically by \emph{ExiMiR} and might not be entirely
satisfactory. This might be due to mutliple reasons, ranging from inhomogeneous
affinities across the spike-in probesets to an inappropriate coverage of the probe
intensity range. \emph{ExiMiR} offers the possibility of fine-tuning the
parameters used by \verb+NormiR+ to improve the stability of the correction
functions.     

\begin{description} 

\item[Overall LOESS smoothing] If the correction functions 'wiggle' too much, the
\verb+NormiR+ option \verb+loess.span+ can be set to higher values to better smooth the
resulting curves. By default, it takes the value 5/(number of spike-in probesets), e.g.
5/10 in the Exiqon case. In the extreme cases of values close to 1, the intensity
dependence of the correction is lost and the results become very similar to a mean or a
median normalization.  

\item[Low-intensity stabilization] If one correction function change its sign in the
low intensity range, then an inclusion into the LOESS smoothing of a zero value at the
intensity minimum will prevent this feature. Set the \verb+NormiR+ option
\verb+force.zero+ to \verb+TRUE+  to activate this functionality. 

\item[High-intensity extrapolation] It often occurs that the largest spike-in probeset
intensities are lower than the largest probe intensities on the array. In this case
\verb+NormiR+ needs to include extrapolated values into the LOESS smoothing in order to
compute the correction functions in the high-intensity range. Fortunately this step is
quite stable thanks to the fact that high intensity values are less noisy. By default
\verb+NormiR+ uses the mean of the correction values of two spike-in probesets with the
largest intensities. The option \verb+extrap.points+ allows to change the number of
spike-in probesets used in the extrapolation and \verb+extrap.method+ determines the
extrapolation method.

\end{description}

\section{Concrete example with provided data}
\label{sec:example}
\emph{ExiMiR} provides datasets that allows one to test the functions described in this vignette. The test data are in the \verb+R+ objects obtained as 
described in Section \ref{sec:data}. They can be used as follows,  which reproduces the commands explained in Section \ref{sec:norm}.

Load the \emph{ExiMiR} package, the GAL environment and the AffyBatch objects corresponding to the data described in Section \ref{subsec:exidata}:
<<>>=
library(ExiMiR)
data(galenv)
data(GSE20122)
@
Apply the RMA quantile normalization on the AffyBatch object \verb+GSE20122+, using the \verb+rma+ option \verb+background=FALSE+ as recommanded by a recent study\cite{lopez}.
This creates the ExpressionSet object \verb+eset.rma+ containing the normalized data:
<<>>=
eset.rma <- rma(GSE20122, background=FALSE)
@
The spike-in probe-based normalization implemented in \emph{ExiMiR} is applied as follows:
<<>>=
eset.spike <- NormiR(GSE20122, figures.show=FALSE)
@
To obtain the same control figures as the ones displayed in Section \ref{subsec:figs}, use the \verb+NormiR+ option \verb+figures.show=TRUE+.

\newpage
\begin{thebibliography}{2}
\bibitem{bibi}
Sewer A et \textit{al}., to be published.

\bibitem{sarkar}
Sarkar D et \textit{al}., Quality assessment and data analysis for miRNA expression
arrays, Nucleic Acids Res. 2009 Feb;37(2):e17.

\bibitem{lopez}
L\'{o}pez-Romero P et \textit{al}., Procession of Agilent microRNA array data,
BMC Research Notes 2010, \textbf{3}:18.
\end{thebibliography}


\newpage
\appendix{}
\section{Content of the file ''sampleinfo.txt''}
\label{sec:app}
\scriptsize{
\begin{Sinput}
Names	Hy3	Hy5
1	GSM503402_Hy3_Exiqon_14114402_S01_Cropped.txt	GSM503402_Hy5_Exiqon_14114402_S01_Cropped.txt
2	GSM503403_Hy3_Exiqon_14114403_S01_Cropped.txt	GSM503403_Hy5_Exiqon_14114403_S01_Cropped.txt
3	GSM503404_Hy3_Exiqon_14114404_S01_Cropped.txt	GSM503404_Hy5_Exiqon_14114404_S01_Cropped.txt
4	GSM503405_Hy3_Exiqon_14114405_S01_Cropped.txt	GSM503405_Hy5_Exiqon_14114405_S01_Cropped.txt
5	GSM503406_Hy3_Exiqon_14114406_S01_Cropped.txt	GSM503406_Hy5_Exiqon_14114406_S01_Cropped.txt
6	GSM503407_Hy3_Exiqon_14114407_S01_Cropped.txt	GSM503407_Hy5_Exiqon_14114407_S01_Cropped.txt
7	GSM503408_Hy3_Exiqon_14114408_S01_Cropped.txt	GSM503408_Hy5_Exiqon_14114408_S01_Cropped.txt
8	GSM503409_Hy3_Exiqon_14114409_S01_Cropped.txt	GSM503409_Hy5_Exiqon_14114409_S01_Cropped.txt
9	GSM503410_Hy3_Exiqon_14114410_S01_Cropped.txt	GSM503410_Hy5_Exiqon_14114410_S01_Cropped.txt
10	GSM503411_Hy3_Exiqon_14114411_S01_Cropped.txt	GSM503411_Hy5_Exiqon_14114411_S01_Cropped.txt
11	GSM503412_Hy3_Exiqon_14114412_S01_Cropped.txt	GSM503412_Hy5_Exiqon_14114412_S01_Cropped.txt
12	GSM503413_Hy3_Exiqon_14114413_S01_Cropped.txt	GSM503413_Hy5_Exiqon_14114413_S01_Cropped.txt
\end{Sinput}
}

\end{document}