% -*- mode: noweb; noweb-default-code-mode: R-mode; -*-
%\VignetteIndexEntry{splicegear Introduction}
%\VignetteDepends{splicegear, Biobase}
%\VignetteKeywords{Expression Analysis}
%\VignettePackage{splicegear}
\documentclass[12pt]{article}

\usepackage{amsmath}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\bibliographystyle{plainnat}

\title{Splicegear package}

\begin{document}

\maketitle

\section*{Introducing splicegear}

Microarrays have become an established technique for the analysis of gene expression.
With the advances in numerical processing of the data, the lowering of the
costs and well defined experimental protocols, the reliability of data analysis
has increased. The possiblity to study alternative splicing using microarrays has
appeared very recently in experimental publications. The word is now 'this is possible',
and it has been reported for spotted oligonucleotides arrays, for
printed arrays and for arrays made by the {\it Affymetrix} company. 
However little has been done to quantify how well the
technique performs, how the existing oligonucleotide data could have been influenced
by alternative phenomenon and how it could contribute to discover novel splice variants.
This package defines classes to handle probe expression values in an alternative splicing
context.

The class \Robject{SpliceExprSet} combines information about putative
splice sites on a sequence with probes matching the sequence and
corresponding probe intensities. It it constituted of three
attributes, \Robject{eset}, \Robject{probes} and
\Robject{spliceSites}, the first being of class \Robject{ExpressionSet} 
the second of class \Robject{Probes} and the
third of class \Robject{SpliceSites}. The idea behind the
clear separation between splice site positions, probes and probe intensities
is the dynamic nature of their relationship. While the sequence for
the probe is fixed, the target sequences can vary slightly and the
position or presence of splice sites as well (especially when working with
putative splice sites).

The view chosen is appropriate for linking alternative splicing
information with microarray data, but is admitedly not very standard
on the alternative splicing side. 
Many representations of alternative splicing show exons as boxes and
draw broken lines between segments to show the possible splice variants.
The model chosen is not completely incompatible with this view. We
present how to use this representation within the package in one of
the last sections below.

The package is loaded by a simple call to
\begin{verbatim}
library(splicegear)
\end{verbatim}

<<R.hide, results=hide, echo=FALSE>>=
library(splicegear)
@

\section*{Plotting methods}

Plotting methods are defined for the classes \Robject{SpliceSites} and \Robject{SpliceExprSet}.

Exact matching of the reference sequences used by the database of
putative splice sites against the probes of {\it Affymetrix} {\tt U95A} chips were performed.
The expression values from the {\it GeneLogic} `Dilution' dataset
(RNA extracted from cells from the central nervous system and from the
liver) were observed.

<<<fig=TRUE>>=
data(spsites)

print(spsites)

plot(spsites)
@ 

\section*{Import data}
The package has facilities to parse XML structures in a defined XML
format 
(see \url{http://palsdb.ym.edu.tw/index2.html})
%(see Appendix~\ref{appendix:dtd}).
Motivations were discussed in length
in a once submitted manuscript.
To summarize, we hoped to initiate efficient data exchange for
alternative splicing, in the spirit of the WDDX and SOAP
formats. The DTD we introduce is more a call for discussion by interested
parties. It is not fully compliant with WDDX nor SOAP, althought it may
come in the future.

The XML can be stored in a file. In {\bf R}, one can make an object of
class \Rclass{xml} very easily:
<<>>=
library(XML)
filename <- system.file("extdata", "example.xml", package="splicegear")
xml <- xmlTreeParse(filename, asTree=TRUE)
@
Further details concerning XML handling can be found in the documentation
for the package XML.

The XML structure can be converted to a list of \Rclass{SpliceSites}:
<<>>=
spsites <- buildSpliceSites(xml, verbose=FALSE)
length(spsites)
show(spsites[1:2])
@ 
The package is currently able to connect to the `database of putative
alternative splicing' {\it PALSdb}.
The typical way to obtain data from the web is composed of two steps.
\begin{enumerate}
  \item query a web site and obtain XML in return
  \item build {\bf R} objects from the XML
\end{enumerate}

\begin{verbatim}
xml <- queryPALSdb("alcohol")
spsites <- buildSpliceSites(xml, verbose=FALSE)
\end{verbatim}

\section*{The class \Rclass{SpliceSites}}
The class \Rclass{SpliceSites} is a little complex. One should refer
to the relevant help file for details.
We only introduce here a detailed example of can be performed.

<<>>=
## build SpliceSites
library(XML)
filename <- system.file("extdata", "example.xml", package="splicegear")
xml <- xmlTreeParse(filename, asTree=TRUE)
spsites <- buildSpliceSites(xml, verbose=FALSE)

## subset the second object in the list
my.spsites <- spsites[[2]]
@ 

<<fig=TRUE>>=
plot(my.spsites)
@ 
As shown, for most of the putative splice site several ESTs are
supporting evidences. One might want to see the tissue distribution of
the matches


\section*{Data in a \Rclass{data.frame}}
The \Rclass{data.frame} has a very important role in the S
language. A large number of function are designed around this data
structure. We provide a way to link the class \Robject{SpliceExprSet}
with this data structure. The function casting to \Rclass{data.frame}
is named \Rfunction{as.data.frame.SpliceExprSet}. Using the S3
dispatch system, a call to \Rfunction{as.data.frame} with a first
argument of class \Rclass{SpliceExprSet} should be enough.

<<>>=
data(spliceset)

dataf <- as.data.frame(spliceset)

colnames(dataf)
@ 

%<<fig=TRUE>>=
<<>>=
lm.panel <- function(x, y, ...) {
                                  points(x,y,...)
                                  p.lm <- lm(y~x); abline(p.lm)
                                }

## to plot probe intensity values conditioned by the position of the probes on
## the mRNA:
## (commented out to avoid a warning)
##coplot(log(exprs) ~ Material | begin, data=dataf, panel=lm.panel)

@
Further explanations about formulas and models in {\bf S-plus} and {\bf R}
can be found easily elsewhere.


\section*{Genomic view}
A popular representation of splice variants shows exons as boxes,
linked by broken lines to show which exons are skipped and which ones
are not for the splice variants~\ref{fig:genomic.AS}. In this context,
type II and type III splice variants are not relevant: each exon
is only likely to be skipped. The package features an experimental
class that extends the class \Rclass{SpliceExprSet} and gives compatibility
with this representation.

%\begin{figure}[htbp]
%\begin{center}
%\includegraphics[width=\textwidth]{HASDB.spliceview}
%\includegraphics[width=\textwidth]{transcript.geneview.Hs.4291}
%\caption{\label{fig:genomic.AS}More common representation of alternative splicing.}
%\end{center}
%\end{figure}

<<>>=
## a 10 bp window
seq.length <- as.integer(10)
## positions of the exons
spsiteIpos <- matrix(c(1, 3.5, 5, 9, 3, 4, 8, 10), nc=2)
## known variants
variants <- list(a=c(1,2,3,4), b=c(1,2,3), c=c(1,3,4))
##
n.exons <- nrow(spsiteIpos)

spvar <- new("SpliceSitesGenomic", spsiteIpos=spsiteIpos,
         variants=variants, seq.length=seq.length)
@ 

A plotting method (not unlike the display that can be found on the
TrEmbl website) is provided.

<<fig=TRUE>>=
par(mfrow = c(3,1), mar = c(3.1, 2.1, 2.1, 1.1))

plot(spvar, split=TRUE, col.exon=rainbow(n.exons))
@

\section*{Combining alternative splicing information with probes intensities}

The class \Rclass{Probes} stores information relative to probes
matching a reference sequence, primarily their position on the
a reference sequence. Additional data concerning the probes
can be stored in the slot {\it info}.

The class \Rclass{SpliceExprSet} is mainly an aggregation of an instance
of class \Rclass{SpliceSites}, an instance of \Rclass{Probes} and an
instance of class \Rclass{ExpressionSet}.

The typical procedure is to build \Rclass{SpliceSites} and their
correspondings \Rclass{Probes} for a type of chip\footnote{See the functions
\Rfunction{queryPALSdb}, \Rfunction{buildSpliceSites} and the Bioconductor
package \Rpackage{matchprobes}.}.
Experimental hybridizations for the probes are stored in
\Rclass{ExpressionSet} objects. This design facilitates the use of the package
in different contexts. For example, once the mapping of the probes has
been performed and the data relative to alternative splicing have been
acquired, the analyst can re-use these and combine them with data from different
hybridization experiments. It also becomes possible to distribute
mappings and splice variants information for certain types of chips.
This was performed for {\it Affymetrix} chips and data packages providing
the information will be distributed shortly.

%\appendix

%\label{appendix:dtd}

\end{document}