%\VignetteIndexEntry{codelink}
%\VignetteKeywords{Preprocessing, Codelink}
%\VignetteDepends{codelink}
%\VignettePackage{codelink}
\documentclass{article}
\usepackage[latin1]{inputenc}
\usepackage[english]{babel}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\texttt{#1}}}
\begin{document}
\title{Codelink}
\author{Diego Diez}
\maketitle
\section{Introduction}
Codelink is a platform for the analysis of gene expression on biological
samples property of Applied Microarrays, Inc. (previously was GE Healthcare and
Amersham). The hybridization reagents are still supplied by GE Healthcare.
The system uses 30 base long oligonucleotide probes for expression testing.
There is a proprietary software for reading scanned images, doing spot
intensity quantization and some diagnostics. The software assigns quality flags
(see Table~\ref{tab:Flag}) to each spot on the basis of a signal to noise ratio
(SNR) computation (Eq: \ref{eq:SNR}) and other morphological characteristics as
irregular shape of the spots, saturation of the signal or manufacturer spots
removed. By default, the software performs background correction (subtract)
followed by median normalization. The results can be exported in several
formats as XML, Excel, plain text, etc. This library allows to read Codelink
plain text exported data into R \cite{r} for the analysis of gene expression
with any of the available tools in R+Bioconductor\cite{BIOC}.
For storing the available data, a new class \Robject{Codelink} was designed.
The now defunt \Robject{exprsSet} in the \Rpackage{Biobase} package was not a
convenient store class for Codelink data, because it didn't allow probes with
duplicated names, as those names are stored as rownames in the expression
matrix.
Currently there is experimental support for a \Robject{ExpressionSet} derived
class that can accomodate duplicated probe names in an
\Robject{AnnotatedDataFrame} whereas the feature ids would be used for row
names when available.
\begin{table}[ht]
\begin{center}
\begin{tabular}{|c|l|}
\hline \em Flag & \em Description \\
\hline
G & Good signal (SNR >= 1)\\
L & Limit signal (SNR < 1)\\
I & Irregular shape \\
S & Saturated signal \\
M & MSR spot \\
C & Background contaminated\\
X & User excluded spots\\
\hline
\end{tabular}
\caption{Quality Flag description. SNR: Signal to Noise Ratio.}
\label{tab:Flag}
\end{center}
\end{table}
\begin{equation}
SNR=\frac{Smean}{(Bmedian + 1.5 * Bstdev)}
\label{eq:SNR}
\end{equation}
\begin{table}[ht]
\begin{center}
\begin{tabular}{|r|l|}
\hline \em Probe type & \em Description \\
\hline
DISCOVERY & Gene expression testing probes \\
POSITIVE & Positive control probes \\
NEGATIVE & Negative control probes \\
FIDUCIAL & Grid alignment probes \\
OTHER & Other controls and housekeeping gene probes \\
\hline
\end{tabular}
\caption{Probe types for Codelink arrays.}
\label{tab:Type}
\end{center}
\end{table}
\section{Reading data}
Currently only data exported as plain text from Codelink software is supported.
Unfortunately the Codelink exported text format can have arbitrary columns and
header fields so depending of what you have exported you can read it or not.
The suggestion is that you put on the files everything you can, including
Spot\_mean and Bkgd\_median values so you can do background correction and
normalization in R. In addition, Bkgd\_stdev is needed to compute the SNR. If
you put Raw\_intensity or Normalized\_intensity columns then you can also read
it directly and avoid background correction and/or normalization but this is
not recommended. To read some Codelink files you do:
<>=
# NOT RUN #
library(codelink)
foo <- readCodelink()
summaryFlag(foo) # will show a summary of flag values.
# NOT RUN #
@
This suppose that your files have the extension ``TXT'' (uppercase)
and that they are in your working directory. If this is not the case you can
specify the files to be read with the 'file' argument. The function
\Rfunction{readCodelink} returns and object of \Robject{Codelink} similar to
that:
<>=
data(codelink.example)
codelink.example
@
\begin{table}[ht]
\begin{center}
\begin{tabular}{|r|l|}
\hline \em Slot & \em Description \\
\hline
product & Chip name description \\
sample & Sample names vector \\
file & File names vector \\
name & Probe names vector \\
type & Probe types vector \\
method & Methods applied to data \\
method\$background & Background correction method used \\
method\$normalization & Normalization method used \\
method\$merge & Merge method used \\
method\$log & Logical: If data is in log scale \\
flag & Quality flag matrix \\
Smean & Mean signal intensity matrix \\
Bmedian & Median background intensity matrix \\
Ri & Raw intensity matrix \\
Ni & Normalized intensity matrix \\
snr & Signal to Noise Ratio matrix \\
cv & Coefficient of Variation matrix \\
\hline
\end{tabular}
\caption{Description of Codelink object slots.}
\label{tab:Slots}
\end{center}
\end{table}
The \Robject{Codelink-class} is basically a list that stores information in
several slots (see Table~\ref{tab:Slots}). The chip type (product slot) is read
from the PRODUCT field if found in the header of Codelink files. If it is not
found then a warning message is shown and product slot is set to "Unknown". If
the product is not the same in all the files the reading is canceled with an
error message.
By default, all spots flagged with M, I, and S flags are set to NA. This can be
controlled with the flag argument of \Rfunction{readCodelink}. The flag
argument is a list that can contain a valid flag identifier and a value for
that flag. For example, if you want to set all M flagged spots to 0.01 and let
other spot untouched you do:
<>=
foo <- readCodelink(flag = list(M = 0.01) )
@
It is possible to find probes wit more that one flag assigned, i.e. CL for a
probe labeled as C and L, CLI for a probe labeled as C, L and I, and so on. A a
regular expression is used to find flag types in an attempt to to manage all
the possible situations. When two user modified flags fall in the same probe
the more restricting (NA being the most) is assigned.
\section{Background correction}
If you have Spot\_mean values Bkgd\_median values the you can apply one of the
several background correction methos interfaced. This is done by the function
\Rfunction{bkgdCorrect}. To see the different options look at ?bkgdCorrect. For
instance, if you want to apply \emph{half} method you do:
<>=
foo <- bkgdCorrect(foo, method = "half")
@
The default method used is \emph{half} and is based in the same method applied
in the \Rpackage{limma} \cite{limma} package to two channel microarrays. In
this method, the median background intensity (Bmedian) is subtracted from mean
spot intensity (Smean) and any value smaller that 0.5 is shift to 0.5 to ensure
no negative numbers are obtained that would prevent to transform the data into
log scale. Other available methods are \emph{none} that let the spot
intensities untouched, \emph{subtract} that is analog to the default method
used in the Codelink software and \emph{normexp} and interface to the method
available in the \Rpackage{limma} package.
\section{Normalization}
Normalization of the background corrected intensities is done by the wrapper
function \Rfunction{normalize}. The default method is \emph{quantile}
normalization that in fact call \Rfunction{normalizeQuantiles()} from
\Rpackage{limma} package (allowing for NAs). There is also the possibility to
use a modified version of CyclicLoess from \Rpackage{affy} \cite{affy} package
that allow using weights and missing values. Finally, the \emph{median}
normalization allows to normalize using a method analog to the default method
in the Codelink software. To normalize you usually do:
<>=
foo <- normalize(foo, method = "quantiles")
@
By default, \Rfunction{normalize} return log2 intensity values. This could be
controlled setting the parameter log.it to FALSE.
\section{Plotting}
There are some diagnostic plots available for the \Robject{Codelink} object.
These are functions for producing MA plots (\Rfunction{plotMA}), scatterplots
(\Rfunction{plotCorrelation}) and density plots (\Rfunction{plotDensities}).
All functions use the available intensity value (i.e. Smean, Ri or Ni) to make
the plot.
The functions \Rfunction{plotMA} and \Rfunction{plotCorrelation} can highlight
points based on the Spot Type, which is the default behavior or using the SNR
values. The mode is controlled with argument \emph{label}.
\Rfunction{plotCorrelation} requires arguments \emph{array1} and \emph{array2}
to be set in order to select which arrays are going to be plotted. For
\Rfunction{plotMA} if only \emph{array1} is specified, the values are plotted
againts a pseudoarray constructed with the mean of the probe intensities along
all available arrays. M and A values are computed following equations
\ref{eq:M} and \ref{eq:A}.
\begin{equation}
M=Array2-Array1
\label{eq:M}
\end{equation}
\begin{equation}
A=\frac{Array2+Array1}{2}
\label{eq:A}
\end{equation}
<>=
plotMA(codelink.example)
@
The function \Rfunction{plotDensities} plot the density of intensity values of
all arrays. It can plot only a subset of arrays if the \emph{subset} argument
is supplied.
<>=
plotDensities(codelink.example)
@
When Logical\_row and Logical\_col columns are exported into they are stored
into the \emph{logical} slot. This information stores the physical location of
each probe in the array, and can be used to plot pseudo images of the array
intensities. To plot a pseudo image you should use:
<>=
imageCodelink(foo)
imageCodelink(foo, what = "snr")
@
It is possible to plot the background intensities (default), the spot mean,
raw and normalized intensities and the SNR values. This images are useful to
identify spatial artifact that may be affecting the analysis.
\section{Miscellaneous}
There are also some miscellaneous functions used in some analysis that could be
useful for someone.
\subsection{Export to file}
The function \Rfunction{writeCodelink} exports a \Robject{Codelink} object to a file. The file will contain probe intensities and, if specified flag = TRUE, probe quality flags.
\subsection{Using weights}
The \Rfunction{createWeights} function creates a matrix of weights based on
probe type labels to be used, for example, when fitting a linear model with
\Rpackage{limma} \cite{limma}.
<>=
w <- createWeights(codelink.example, type = list(FIDUCIAL = 0.01, NEGATIVE =
0.1))
w[1:10,]
@
\subsection{Merging arrays}
In case you want to merge array intensities the \Rfunction{mergeArray} function
help on this task. It computes the mean of Ni values on arrays of the same
class. The grouping is done by means of the \emph{class} argument (numerical
vector of classes). New sample names should be assigned to the sample slot
using the \emph{names} argument. The function also returns the coefficient of
variation in the \emph{cv} slot. The distribution of coefficients of variations
can be checked with the function \Rfunction{plotCV}.
<>=
foo <- mergeArray(foo, class = c(1, 1, 2, 2), names = c("A", "B"))
plotCV(foo)
@
\section{Problems reading data}
Some updated version of the Codelink software changed the order in which probes
are printed in the exported text files. That makes files from these different
software version impossible to be analyzed together. The function
\Rfunction{readCodelink} have a new argument \emph{check} TRUE by default, that
check the Probe\_name columns to see if they have the same order in all the
arrays at the cost is a little extra loading time. This behaviour can be turn
off by setting \emph{check} to FALSE. If an ordering problem is found, a
warning message is print but the reading of data is not stopped, allowing the
visual examination of the data. In this case, the best option is to export the
old files again using the updated version of the software. If this is
impossible for whatever reason, the codelink package can try to fix the order
of the files.
To the aim, a new argument \emph{fix = TRUE} can be passed to
\Rfunction{readCodelink}. The method will try to order the probes using the
Feature\_id column, which is a combination of Logical\_row and Logical\_col and
for this, a unique identifier of each probe. This is the optimal fix, and the
new data should be perfectly ordered. If that column is missing, the it will
try to order the data using the Probe\_name column. This is a sub-optimal
solution because as the fiducial, control and some discovery probes have
duplicated Probe\_name, they may be end messed up. In this case, the best
solution again, is to try to export again the data with the updated version of
the software.
<>=
foo <- readCodelink(fix = TRUE)
@
\section{Future improvements}
As the new classes in the Biobase package (\Robject{eSet} and
\Robject{ExpressionSet}) allow more flexible data structures, a
reimplementation of the \Robject{Codelink-class} based on
\Robject{ExpressionSet-class} will be the next major feature added to this
package. This will allow a better integration with other tools available
through the Bioconductor project.
Right now, there is some experimental implementation of the new codebase. You
need to use the wrapper function \Rfunction{readCodelinkSet}.
<>=
foo <- readCodelinkSet()
@
\bibliographystyle{plain}
\bibliography{codelink}
\end{document}