%\VignetteIndexEntry{Streamer: A simple example}
%\VignetteDepends{}
%\VignetteKeywords{Stream}
%\VignettePackage{Streamer}
\documentclass{article}

\usepackage{times}
\usepackage{hyperref}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}

\newcommand{\software}[1]{\textsf{#1}}
\newcommand{\R}{\software{R}}
\newcommand{\Bioconductor}{\software{Bioconductor}}

\SweaveOpts{keep.source=TRUE}

\title{Using the Streamer classes to count genomic overlaps 
       with \Rmethod{summarizeOverlaps}}
\author{Nishant Gopalakrishnan, Martin Morgan}
\date{\today}

\begin{document}

\maketitle

\section{Introduction}
This vignette illustrates how users can make use of the functionality
provided by the \Rclass{Producer}, \Rclass{Consumer} and
\Rclass{Stream} classes in the \Rpackage{Streamer} package to process
data in a streaming fashion. The users have the option of quickly
being able to create their own class to stream process data by
inheriting from the classes provided by the \Rpackage{Streamer}
package.

This example illustrates a simple \Rclass{BamInput} class that
inherits from the \Rclass{Producer} class and a \Rclass{CountGOverlap}
class that inherits from the \Rclass{Consumer} class. These classes
allows us to count the number of hits in a BAM file corresponding to
the ranges specified by the user and return the hits in a streaming
manner on a per sequence basis. Finally, the results for each sequence
is collated and reordered using a helper function so they appear in
the same order as the ranges provided by the user. The classes that we
are going to develop in this example make use of the reference class
system available in \R{}.

We first load the \Rpackage{GenomicAlignments} and \Rpackage{Streamer}
packages.

<<Load packages>>=
library(GenomicAlignments)
library(Streamer)
@

\section{\Rclass{BAMInput} class}

The \Rclass{BAMInput} class will be used to read gapped alignments
from a file specified by the user in a streaming manner. i.e reads
will be read one sequence at a time.

The two inputs specified by the user are
\begin{itemize}
    \item file: a character string specifying the file from which
      alignments are to be read.
    \item ranges: the ranges from which alingments are to be
�      retrieved.
\end{itemize}

Like the design of the other classes in the \Rpackage{Streamer}
package, the \Rclass{BamInput} class will have an \Rmethod{initialize}
and a \Rmethod{yield} method. The \Rmethod{initialize} method will be
used to initialize the fields of the \Rclass{BamInput} class and is
called automatically when objects are instantiated from this class.

The \Rmethod{yield} method does not take any inputs. Each call to the
\Rmethod{yield} method returns a \Rclass{GAlignments} object for
a single sequence within the ranges specified by the user until all
the sequences have been read from the BAM file at which point, an
empty \Rclass{GAlignments} object will be returned.

<<BamInput class>>=

.BamInput <-
    setRefClass("BamInput",
    contains="Producer",
    fields=list(
        file="character",
        ranges="GRanges",
        .seqNames="character"))

.BamInput$methods(
    yield=function()
    {
       "yield data from .bam file"
       if (verbose) msg("BamInput$.yield()")
       if(length(.self$.seqNames))
       {
            seq <- .self$.seqNames[1]
            .self$.seqNames <- .self$.seqNames[-1]
            idx <- as.character(seqnames(.self$ranges)) == seq
            param <- ScanBamParam(which=.self$ranges[idx],
                                  what=character())
            aln <- readGAlignments(.self$file, param=param)
            seqlevels(aln) <- seq
       } else {
           aln <- GAlignments()
       }
       list(aln)
    })

@ 
The constructor for the \Rclass{BamInput} class takes the file and
ranges as input and returns and instance of the \Rclass{BamInput}
class.
<<BamInput constructor>>=

BamInput <- function(file, ranges,...)
{
    .seqNames <- names(scanBamHeader(file)[[1]]$target)
    .BamInput$new(file=file, ranges=ranges, .seqNames=.seqNames, ...)
}

@
\section{\Rclass{CountGOverlap} class}

The second class we are going to develop is a \Rclass{Consumer} class
that processes the data obtained from the \Rclass{BamInput} class. The
class calls the \Rmethod{summarizeOverlaps} method with the
\Rclass{GAlignments} object, user supplied ranges and additional
arguments to control the behaviour of the \Rmethod{summarizeOverlaps} 
method.

The \Rclass{CountGOverlap} class has an \Rmethod{initialize} method
and a \Rmethod{yield} method. The \Rmethod{initialize} method
initializes the class with the options to be passed in to the
\Rmethod{countGenomicOverlaps} method as well as some variables for
keeping track of the order of the hits to be returned by the
\Rclass{CountGOverlap} class.

The \Rmethod{yield} method returns a \Rclass{DataFrame} with the
number of hits.  The rownames of the result returned correspond to the
order of the results in the original ranges supplied by the
user. (These are subsequently used to reorder the results for the hits
after collating results for all the sequences)

<<CountGOverlap class>>=

.CountGOverlap <-
    setRefClass("CountGOverlap",
    contains="Consumer",
    fields=list(ranges="GRanges",
                mode="character",
                ignore.strand="logical"))

.CountGOverlap$methods(
    yield=function()
    {
        "return number of hits"
        if (verbose) msg(".CountGOt$yield()")
        aln <- callSuper()[[1]]
        df <- DataFrame(hits=numeric(0))
        if(length(aln))
        {
            idx <- as.character(seqnames(.self$ranges)) == levels(rname(aln))
            which <- .self$ranges[idx]
            olap <- summarizeOverlaps(which, aln, mode=.self$mode,
                                      ignore.strand=.self$ignore.strand)
            df <- as(assays(olap)[[1]], "DataFrame")
            dimnames(df) <- list(rownames(olap), seqlevels(aln))
        }
        df
    })

CountGOverlap <-
    function(ranges,
             mode = c("Union", "IntersectionStrict",
               "IntersectionNotEmpty"),
             ignore.strand = FALSE, ...)
{
    values(ranges)$pos <- seq_len(length(ranges))
    .CountGOverlap$new(ranges=ranges, mode=mode,
                       ignore.strand=ignore.strand, ...)
}

@

\section{Stream with \Rclass{BamInput} and \Rclass{CountGOverlap}}

Instances of the \Rclass{BamInput} and \Rclass{CountGOverlap} classes
can be created using their respective constructors and can
subsequently be hooked up to form a stream using the \Rmethod{Stream}
function provided by the \Rpackage{Streamer} package. For our example
we shall make use of a BAM file available in the \Rpackage{Rsamtools}
package and create a \Rclass{GenomicRanges} object for the ranges that
we are interested. A \Rclass{Stream} can then be created by passing
these objects as the arguments to the \Rmethod{Stream} function.

A call to the \Rmethod{yield} function of the \Rclass{Stream} class
will yield the results obtained by calling yield first on the
\Rclass{BamInput} class and subsequently on the \Rclass{CountGOverlap}
class for the first sequence in the ranges provided.

<<BAM file and ranges>>=

galn_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
gr <-
    GRanges(seqnames =
            Rle(c("seq2", "seq2", "seq2", "seq1"), c(1, 3, 2, 4)),
            ranges = IRanges(rep(10,1), width = 1:10,
              names = head(letters,10)),
            strand = Rle(strand(rep("+", 5)), c(1, 2, 2, 3, 2)),
            score = 1:10,
            GC = seq(1, 0, length=10))

bam <- BamInput(file = galn_file, ranges = gr)
olap  <- CountGOverlap(ranges=gr, mode="IntersectionNotEmpty")
s <- Stream(bam, olap)
yield(s)

@

\section{Collate results}

Each call to the \Rmethod{yield} function of the stream process data
for one sequence. It would be convenient to have a function that
processed data for all the sequences in the ranges provided and
collated the results so that they are ordered correctly. (same order
as the ranges provided). We proceed to create this helper
\Rmethod{overlapCounter} function that takes a \Rclass{BAMInput} and
\Rclass{CountGOverlap} class objects as inputs.

<<overlap counter >>=

overlapCounter <- function(pr, cs) {
    s <- Stream(pr, cs)
    len <- length(levels(seqnames(pr$ranges)))
    lst <- vector("list", len)
    for(i in 1:len) {
        lst[[i]] <- yield(s)
        names(lst[[i]]) <- "Count"
    }
    do.call(rbind, lst)[names(cs$ranges), ,drop=FALSE ]
}

bam <- BamInput(file = galn_file, ranges = gr)
olap  <- CountGOverlap(ranges=gr, mode="IntersectionNotEmpty")
overlapCounter(bam, olap)
@

\end{document}