% -*- mode: noweb; noweb-default-code-mode: R-mode; -*-
%\VignetteIndexEntry{affycomp primer}
%\VignetteKeywords{Preprocessing, Affymetrix}
%\VignetteDepends{Biobase}
%\VignetteDepends{affycompData}
%\VignettePackage{affycomp}
%documentclass[12pt, a4paper]{article}
\documentclass[12pt]{article}

\usepackage{amsmath}
\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}

\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}
\author{Rafael Irizarry and Leslie Cope}
\begin{document}
\title{Bioconductor Expression Assessment Tool for Affymetrix Oligonucleotide Arrays (affycomp)}

\maketitle
\tableofcontents
\section{Introduction}
<<echo=F,results=hide>>=
library(affycomp)
library(affycompData)
@

The defining feature of 
oligonucleotide expression arrays is the use of several probes to
assay each targeted transcript.  This is a bonanza for the statistical
geneticist, offering a great opportunity to create probeset summaries with
specific characteristics.  There are now several methods available
for summarizing probe level data from the popular Affymetrix
GeneChips.  It is harder to identify the method best suited to a given
inquiry.  This package provides a {\it graphical tool} for summaries of
Affymetrix probe level data.  It is the engine behind our webtool
\url{http://affycomp.jhsph.edu/}.  Plots and summary statistics offer a
picture of how an expression measure performs in several important
areas.  This picture facilitates the comparison of competing
expression measures and the selection of methods suitable for a
specific investigation.

The key is a benchmark dataset consisting of
a dilution study and a spike-in study.  Because the {\it truth} is 
known for this data, it is possible to identify statistical features
of the data for which the expected outcome is known in advance.
Those features highlighted in our suite of graphs are justified by
questions of biological interest, and motivated by the presence of
appropriate data. 

\subsection{Benchmark dataset}

The spike-in benchmark data was originally free and easily available
to the public, from Affymetrix.  Copies in compressed-tar format may
now be obtained here:
\url{http://www.biostat.jhsph.edu/~ririzarr/affycomp/spikein.tgz}
and
\url{http://www.biostat.jhsph.edu/~ririzarr/affycomp/hgu133spikein.tgz}.
The dilution benchmark data was originally also free, although not as
easily available, from Gene Logic.

For a full description of the benchmark data, see our {\it RMA} papers
"Exploration, normalization, and summaries of high density
oligonucleotide array probe level data", Biostatistics, 2003
Apr;4(2):249-64 (\url{http://www.ncbi.nlm.nih.gov/pubmed/12925520})
and
"Summaries of Affymetrix GeneChip probe level data", Nucleic Acids Res,
2003 February 15; 31(4): e15
(\url{http://www.ncbi.nlm.nih.gov/pmc/articles/PMC150247/}).

In the Gene Logic dilution study,
(\url{http://www.genelogic.com/support/scientific-studies/}),
two sources of cRNA, human liver tissue and central nervous system cell
line (CNS), were hybridized to human arrays (HG-U95Av2) in a range of
dilutions and proportions.  You must contact Gene Logic
({info@genelogic.com} or +1-800-GENELOGIC) to obtain the dilution data.

In the Affymetrix spike-in study, different cRNA fragments were added
to the hybridization mixture of the arrays at different picoMolar
concentrations. The cRNAs were spiked-in at a different concentration
on each array (apart from replicates) arranged in a cyclic Latin square
design with each concentration appearing once in each row and column.
All arrays had a common background cRNA.

\subsection{phenodata}

The package includes phenoData objects that give more details,
\verb+dilution.phenodata+ and \verb+spikein.phenodata+ (also,
\verb+hgu133a.spikein.phenodata+).

\section{What's new in version 1.2?}

A new set of assessments is provided by the function
\verb+assessSpikeIn2+, which is a wrapper for \verb+assessMA2+,
\verb+assessSpikeInSD+, and \verb+assessLS+.  It will accept a
full \verb+ExpressionSet+ of the spikein data (59 columns), as
with \verb+assessSpikeIn+, but in this case only uses columns
1:13, 17, 21:33, and 37.  Otherwise, it assumes that it has
expression measures only for those particular 28 arrays.
 
All spike-in related assessments now support the HGU133A chip,
in addition to the HGU95A chip handled by version 1.1.  The
function \verb+read.newspikein+, which is simply a call to
\verb+read.spikein+ with {\tt cdfName = "hgu133a"}, will read
HGU133A expression measures.

\section{The Image Report}

Given a file named \verb+dilfilename.csv+ containing your dilution
expression measures and a file named \verb+sifilename.csv+ containing your
spikein expression measures, you can easily obtain the graphs and summary
statistics in an image report (illustrated using RMA):

\begin{Sinput}
R> library(affycomp) ##load the package
R> d <- read.dilution("dilfilename.csv")
R> s <- read.spikein("sifilename.csv")
R> rma.assessment <- affycomp(d, s, method.name="RMA")
\end{Sinput}

\verb+affycomp+ is a wrapper for \verb+assessAll+ and \verb+affycompTable+.
The return value will contain all the information, from \verb+assessAll+,
necessary to recreate the graphs (below) without having to re-do the assessment.
For example, the following two objects were created by \verb+assessAll+ on
\verb+ExpressionSet+s created by {\it MAS 5.0} and RMA, respectively; they
are lists of lists.

<<>>=
data(mas5.assessment)
data(rma.assessment)
@ 

<<>>=
names(mas5.assessment)
@ 

Each component is the result of a specific assessment. The names tell
us what they are for. \verb+Dilution+ are the assessment based on the
dilution data and can be used to create Figures 2, 3, and
4b. \verb+MA+ has the necessary information for the MA plot or Figure
1. \verb+Signal+ has the necessary information to create Figure
4a. \verb+FC+ has assessments related to fold change and can be used
to create Figures 5a, 6a, and 6b. Finally, \verb+FC2+ has the necessary
information to create Figure 5b. The captions for these Figures will
give you an idea of what they are for.

There are two kinds of plots, basic and comparative.  The basic plots
depict a given expression measure.  In the comparative plots, the given
expression measure is compared to other measures, MAS 5.0 by default.
Tables are also automatically created with assessment statistics.  Finally,
a simple assessment of standard error estimates can be done.  These are all
described in the following subsections.

\subsection{Basic plots}

You can use \verb+affycompPlot+ which will automatically know what to
do, or you can use the auxiliary figure functions that will need to
have a specific assessment list.

% HJ - 1/13/2011
% To control their placement, which was wrong, all figures are now
% done in-line, each on its own forced newpage.  The original figure
% structure is left, commented out, in case someone can make it work.

\newpage
%\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
affycompPlot(mas5.assessment$MA)
@
\end{center}
{Figure 1: The MA plot shows log fold change as a function of
    mean log expression level.  A set of 14 arrays representing a single
    experiment from the Affymetrix spike-in data are used for this plot.
    A total of 13 sets of fold changes are generated by comparing the
    first array in the set to each of the others. Genes are
    symbolized by numbers representing the nominal $\log_2$ fold change
    for the gene.  Non-differentially expressed genes with observed fold
    changes larger than 2 are plotted in red. All other probesets are
    represented with black dots.}
%\end{figure}

\newpage
%\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
affycomp.figure2(mas5.assessment$Dilution)
@
\end{center}
  {Figure 2: For each gene, and each experimental condition, we
  calculate the mean log expression and the observed standard
  deviation across 5 replicates. The resulting scatterplot is smoothed
  to generate a single curve representing mean standard deviation as
  a function of mean log expression.}
%\end{figure}

\newpage
%\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
affycomp.figure3(mas5.assessment$Dilution)
@
\end{center}
{Figure 3: This plot, using the Gene Logic dilution data,
  shows the sensitivity of fold change calculations to total RNA
  abundance.  Average log fold-changes between liver and CNS for the
  lowest concentration and the   highest in the dilution data set are
  computed.  Orange and red color is used to denote genes with
  $M_{6g}-M_{1g}$ bigger than $\log_2(2)$ and $\log_2(3)$
  respectively. The rest are denoted with black.}
%\end{figure}

\newpage
%\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
par(mfrow=c(2,1))
affycomp.figure4a(mas5.assessment$Signal)
affycomp.figure4b(mas5.assessment$Dilution)
@
\end{center}
{Figure 4: a) Average observed $log_2$ intensity plotted
    against nominal $log_2$ concentration for each spiked-in gene for
    all arrays in  Affymetrix spike-In experiment. b) For the Gene Logic
    dilution data, log expression values are regressed against their
    log nominal concentration. The slope estimates are plotted against
    average log intensity across all concentrations.}
%\end{figure}

\newpage
%\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
par(mfrow=c(2,1))
affycomp.figure5a(mas5.assessment$FC)
affycomp.figure5b(mas5.assessment$FC)
@
\end{center}
  {Figure 5: A typical identification rule for differential expression
      filters genes with fold change exceeding a given threshold.
      This figure shows average ROC curves  which offer a graphical
      representation of both specificity and sensitivity for such a
      detection rule. a) Average ROC curves based on comparisons with
      nominal fold changes ranging from 2 to 4096. b) As a) but with
      nominal fold changes equal to 2.}
%\end{figure}

\newpage
%\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
par(mfrow=c(2,1))
affycomp.figure6a(mas5.assessment$FC)
affycomp.figure6b(mas5.assessment$FC)
@
\end{center}
  {Figure 6: a) Observed log fold changes plotted against
    nominal log fold changes. The dashed lines represent highest, 25th
    highest, 100th  highest, 25th percentile, 75th percentile,
    smallest 100th, smallest 25th, and smallest log fold change for
    the genes that were not differentially expressed. b) Like a) but
    the observed fold changes were calculated for spiked in genes with
    nominal concentrations no higher than 2pM.}
%\end{figure}

\subsection{Comparative plots}

You can use \verb+affycompPlot+ which will automatically know what to
do, or you can use the auxiliary figure functions that will need to
have a specific assessment list.

\newpage
%\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
par(mfrow=c(2,1))
affycompPlot(mas5.assessment$MA, rma.assessment$MA)
@
\end{center}
{Figure 1: The MA plot shows log fold change as a function of
    mean log expression level.  A set of 14 arrays representing a single
    experiment from the Affymetrix spike-in data are used for this plot.
    A total of 13 sets of fold changes are generated by comparing the
    first array in the set to each of the others. Genes are
    symbolized by numbers representing the nominal $\log_2$ fold change
    for the gene.  Non-differentially expressed genes with observed fold
    changes larger than 2 are plotted in red. All other probesets are
    represented with black dots.}
%\end{figure}

\newpage
%\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
affycomp.compfig2(list(mas5.assessment$Dilution, rma.assessment$Dilution),
                  method.names=c("MAS 5.0","RMA"))
@
\end{center}
  {Figure 2:   For each gene, and each experimental condition,
  we calculate the mean log expression and the observed standard
  deviation across 5 replicates. The resulting scatterplot is
  smoothed to generate a single curve representing mean standard
  deviation as a function of mean log expression.}
%\end{figure}

\newpage
%\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
affycomp.compfig3(list(mas5.assessment$Dilution, rma.assessment$Dilution),
                  method.names=c("MAS 5.0","RMA"))
@
\end{center}
{Figure 3: This plot, using the Gene Logic dilution data,
  shows the sensitivity of fold change calculations to total RNA
  abundance.  Average log fold-changes between liver and CNS for the
  lowest concentration and the highest in the dilution data set are
  computed.  Orange and red color is used to denote genes with
  $M_{6g}-M_{1g}$ bigger than $\log_2(2)$ and $\log_2(3)$
  respectively. The rest are denoted with black.}
%\end{figure}

\newpage
%\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
par(mfrow=c(2,1))
affycomp.compfig4a(list(mas5.assessment$Signal, rma.assessment$Signal),
                  method.names=c("MAS 5.0","RMA"))
affycomp.compfig4b(list(mas5.assessment$Dilution, rma.assessment$Dilution),
                  method.names=c("MAS 5.0","RMA"))
@
\end{center}
  {Figure 4: a) Average observed $log_2$ intensity plotted
    against nominal $log_2$ concentration for each spiked-in gene for
    all arrays in  Affymetrix spike-In experiment. b) For the Gene Logic
    dilution data, log expression values are regressed against their
    log nominal concentration. The slope estimates are plotted against
    average log intensity across all concentrations.}
%\end{figure}

\newpage
%\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
par(mfrow=c(2,1))
affycomp.compfig5a(list(mas5.assessment$FC, rma.assessment$FC),
                  method.names=c("MAS 5.0","RMA"))
affycomp.compfig5b(list(mas5.assessment$FC2, rma.assessment$FC2),
                  method.names=c("MAS 5.0","RMA"))
@
\end{center}
  {Figure 5: A typical identification rule for differential expression
      filters genes with fold change exceeding a given threshold.
      This figure shows average ROC curves  which offer a graphical
      representation of both specificity and sensitivity for such a
      detection rule. a) Average ROC curves based on comparisons with
      nominal fold changes ranging from 2 to 4096. b) As a) but with
      nominal fold changes equal to 2.}
%\end{figure}

\newpage
%\begin{figure}[htbp]
\begin{center}
<<fig=TRUE>>=
par(mfrow=c(2,2))
affycomp.figure6a(mas5.assessment$FC, main="a) MAS 5.0", ylim=c(-12,12))
affycomp.figure6a(rma.assessment$FC, main="a) RMA", ylim=c(-12,12))
affycomp.figure6b(mas5.assessment$FC, main="b) MAS 5.0", ylim=c(-6,6))
affycomp.figure6b(rma.assessment$FC, main="b) RMA", ylim=c(-6,6))
@
\end{center}
  {Figure 6: a) Observed log fold changes plotted against
    nominal log fold changes. The dashed lines represent highest, 25th
    highest, 100th  highest, 25th percentile, 75th percentile,
    smallest 100th, smallest 25th, and smallest log fold change for
    the genes that were not differentially expressed. b) Like a) but
    the observed fold changes were calculated for spiked in genes with
    nominal concentrations no higher than 2pM.}
%\end{figure}

\subsection{Tables}

The function \verb+tableAll+ returns a matrix with assessment
statistics. Once the assessment function is run, all one needs to
type is 

<<>>=
data(rma.assessment)
data(mas5.assessment)
tableAll(rma.assessment, mas5.assessment)
@ 

\newpage

The function \verb+affycompTable+ makes a minimal table (that is also
informative).

<<>>=
affycompTable(rma.assessment, mas5.assessment)
@ 

\subsection{Standard deviation assessment}

The package also contains a simple tool to assess standard error
estimates. For this to work the \verb+ExpressionSet+ object used for the
assessment must have standard error estimates for the dilution data.
We include two examples.

<<>>=
data(rma.sd.assessment)
data(lw.sd.assessment)
tableAll(rma.sd.assessment, lw.sd.assessment)
@ 

For the SD assessment, there is also a comparison plot in addition to
the basic plot.  See \verb+affycompPlot+ (and \verb+affycomp.compfig7+
or \verb+affycomp.figure7+).

\newpage
%\begin{figure}[htbp!]
\begin{center}
<<fig=TRUE>>=
affycompPlot(lw.sd.assessment, rma.sd.assessment)
@
\end{center}
{Figure 7: Using the replicates from the dilution data, we calculate the
  mean predicted variance for each gene, tissue, and dilution by squaring
  the estimated standard error.  The usual sample variances  $s^2_{tdg}
  = \sum_r (y_{tdrg} - y_{td\cdot g})^2/4$ are calculated as well.
  These boxplots are of the log ratios of the predicted and observed variance.
}
%\end{figure}

\newpage
%\begin{figure}[htbp!]
\begin{center}
<<fig=TRUE>>=
affycompPlot(lw.sd.assessment)
@
\end{center}
{Figure 7: Using the replicates from the dilution data, we calculate the
  mean predicted variance for each gene, tissue, and dilution by squaring
  the estimated standard error.  The usual sample variances  $s^2_{tdg}
  = \sum_r (y_{tdrg} - y_{td\cdot g})^2/4$ are calculated as well.
  A scatterplot of the log ratios of the predicted and observed variance,
  against mean log expression.
}
%\end{figure}

\section{The end}
Enjoy!
\end{document}