%\VignetteIndexEntry{A short presentation of the basic classes defined in Biostrings 2}
%\VignetteKeywords{DNA, RNA, Sequence, Biostrings, Sequence alignment} 
%\VignettePackage{Biostrings}

%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
\documentclass[11pt]{article}

%\usepackage[authoryear,round]{natbib}
%\usepackage{hyperref}


\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}


\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\textwidth=6.2in

\bibliographystyle{plainnat} 
 
\begin{document}
%\setkeys{Gin}{width=0.55\textwidth}

\title{The \Rpackage{Biostrings}~2 classes (work in progress)}
\author{Herv\'e Pag\`es}
\maketitle

\tableofcontents


% ---------------------------------------------------------------------------

\section{Introduction}

This document briefly presents the new set of classes implemented in the
\Rpackage{Biostrings}~2 package.
Like the \Rpackage{Biostrings}~1 classes (found in \Rpackage{Biostrings}
v~1.4.x), they were designed to make manipulation of big strings (like DNA
or RNA sequences) easy and fast.
This is achieved by keeping the 3 following ideas from the
\Rpackage{Biostrings}~1 package:
(1) use R external pointers to store the string data,
(2) use bit patterns to encode the string data,
(3) provide the user with a convenient class of objects where each instance
    can store a set of views {\it on the same} big string (these views being
    typically the matches returned by a search algorithm).

However, there is a flaw in the \Rclass{BioString} class design
that prevents the search algorithms to return correct information
about the matches (i.e. the views) that they found.
The new classes address this issue by replacing the \Rclass{BioString}
class (implemented in \Rpackage{Biostrings}~1) by 2 new classes:
(1) the \Rclass{XString} class used to represent a {\it single} string, and
(2) the \Rclass{XStringViews} class used to represent a set of views
    {\it on the same} \Rclass{XString} object, and by introducing new
    implementations and new interfaces for these 2 classes.


% ---------------------------------------------------------------------------

\section{The \Rclass{XString} class and its subsetting operator~\Rmethod{[}}

The \Rclass{XString} is in fact a virtual class and therefore cannot be
instanciated. Only subclasses (or subtypes) \Rclass{BString},
\Rclass{DNAString}, \Rclass{RNAString} and \Rclass{AAString} can.
These classes are direct extensions of the \Rclass{XString} class (no
additional slot).

A first \Rclass{BString} object:
<<a1>>=
library(Biostrings)
b <- BString("I am a BString object")
b
length(b)
@

A \Rclass{DNAString} object:
<<a2>>=
d <- DNAString("TTGAAAA-CTC-N")
d
length(d)
@
The differences with a \Rclass{BString} object are: (1) only letters from the
{\it IUPAC extended genetic alphabet} + the gap letter ({\tt -}) are allowed
and (2) each letter in the argument passed to the \Rfunction{DNAString}
function is encoded in a special way before it's stored in the
\Rclass{DNAString} object.

Access to the individual letters:
<<a3>>=
d[3]
d[7:12]
d[]
b[length(b):1]
@
Only {\it in bounds} positive numeric subscripts are supported.

In fact the subsetting operator for \Rclass{XString} objects is not efficient
and one should always use the \Rmethod{subseq} method to extract a substring
from a big string:
<<a4>>=
bb <- subseq(b, 3, 6)
dd1 <- subseq(d, end=7)
dd2 <- subseq(d, start=8)
@

To {\it dump} an \Rclass{XString} object as a character vector (of length 1),
use the \Rmethod{toString} method:
<<a5>>=
toString(dd2)
@

Note that \Robject{length(dd2)} is equivalent to
\Robject{nchar(toString(dd2))} but the latter would be very inefficient
on a big \Rclass{DNAString} object.

{\it [TODO: Make a generic of the substr() function to work with
XString objects. It will be essentially doing toString(subseq()).]}


% ---------------------------------------------------------------------------

\section{The \Rmethod{==} binary operator for \Rclass{XString} objects}

The 2 following comparisons are \Robject{TRUE}:
<<b1,results=hide>>=
bb == "am a"
dd2 != DNAString("TG")
@

When the 2 sides of \Rmethod{==} don't belong to the same class
then the side belonging to the ``lowest'' class is first converted
to an object belonging to the class of the other side (the ``highest'' class).
The class (pseudo-)order is \Rclass{character} < \Rclass{BString} < \Rclass{DNAString}.
When both sides are \Rclass{XString} objects of the same subtype (e.g. both
are \Rclass{DNAString} objects) then the comparison is very fast because it
only has to call the C standard function {\tt memcmp()} and no memory allocation
or string encoding/decoding is required.

The 2 following expressions provoke an error because the right member can't
be ``upgraded'' (converted) to an object of the same class than the left member:
<<b2,echo=FALSE>>=
cat('> bb == ""')
cat('> d == bb')
@

When comparing an \Rclass{RNAString} object with a \Rclass{DNAString} object,
U and T are considered equals:
<<b3>>=
r <- RNAString(d)
r
r == d
@


% ---------------------------------------------------------------------------

\section{The \Rclass{XStringViews} class and its subsetting
operators~\Rmethod{[} and~\Rmethod{[[}}

An \Rclass{XStringViews} object contains a set of views {\it on the same}
\Rclass{XString} object called the {\it subject} string.
Here is an \Rclass{XStringViews} object with 4 views:
<<c1>>=
v4 <- Views(dd2, start=3:0, end=5:8)
v4
length(v4)
@

Note that the 2 last views are {\it out of limits}.

You can select a subset of views from an \Rclass{XStringViews} object:
<<c3>>=
v4[4:2]
@

The returned object is still an \Rclass{XStringViews} object,
even if we select only one element.
You need to use double-brackets to extract a given view
as an \Rclass{XString} object:
<<c4>>=
v4[[2]]
@

You can't extract a view that is {\it out of limits}:
<<c6,echo=FALSE>>=
cat('> v4[[3]]')
cat(try(v4[[3]], silent=TRUE))
@

Note that, when \Robject{start} and \Robject{end} are numeric
vectors and \Robject{i} is a {\it single} integer,
\Robject{Views(b, start, end)[[i]]}
is equivalent to \Robject{subseq(b, start[i], end[i])}.

Subsetting also works with negative or logical values with the expected
semantic (the same as for R built-in vectors):
<<c7>>=
v4[-3]
v4[c(TRUE, FALSE)]
@
Note that the logical vector is recycled to the length of \Robject{v4}.


% ---------------------------------------------------------------------------

\section{A few more \Rclass{XStringViews} objects}

12 views (all of the same width):
<<d1>>=
v12 <- Views(DNAString("TAATAATG"), start=-2:9, end=0:11)
@

This is the same as doing \Robject{Views(d, start=1, end=length(d))}:
<<d2,results=hide>>=
as(d, "Views")
@

Hence the following will always return the \Robject{d} object itself:
<<d3,results=hide>>=
as(d, "Views")[[1]]
@

3 \Rclass{XStringViews} objects with no view:
<<d4,results=hide>>=
v12[0]
v12[FALSE]
Views(d)
@


% ---------------------------------------------------------------------------

\section{The \Rmethod{==} binary operator for \Rclass{XStringViews} objects}

This operator is the vectorized version of the \Rmethod{==} operator
defined previously for \Rclass{XString} objects:
<<e1>>=
v12 == DNAString("TAA")
@

To display all the views in \Robject{v12} that are equals to a given view,
you can type R cuties like:
<<e2>>=
v12[v12 == v12[4]]
v12[v12 == v12[1]]
@

This is \Robject{TRUE}:
<<e2,results=hide>>=
v12[3] == Views(RNAString("AU"), start=0, end=2)
@


% ---------------------------------------------------------------------------

\section{The \Rmethod{start}, \Rmethod{end} and \Rmethod{width}
methods}

<<f1>>=
start(v4)
end(v4)
width(v4)
@

Note that \Robject{start(v4)[i]} is equivalent to
\Robject{start(v4[i])}, except that the former will not issue
an error if \Robject{i} is out of bounds
(same for \Rmethod{end} and \Rmethod{width} methods).

Also, when \Robject{i} is a {\it single} integer,
\Robject{width(v4)[i]} is equivalent to \Robject{length(v4[[i]])}
except that the former will not issue an error
if \Robject{i} is out of bounds or if view \Robject{v4[i]}
is {\it out of limits}.


\end{document}