%\VignetteIndexEntry{Overlap encodings}
%\VignetteDepends{pasillaBamSubset, GenomicRanges, Rsamtools, GenomicFeatures}
%\VignetteKeywords{sequence, sequencing, alignments}
%\VignettePackage{GenomicRanges}
\documentclass[10pt]{article}

\usepackage{times}
\usepackage{hyperref}

\usepackage[margin=0.65in]{geometry}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}

\newcommand{\software}[1]{\textsf{#1}}
\newcommand{\R}{\software{R}}
\newcommand{\Bioconductor}{\software{Bioconductor}}

\SweaveOpts{keep.source=TRUE}

\title{Overlap encodings}
\author{Herv\'e Pag\`es}
\date{Last modified: May 2012; Compiled: \today}

\begin{document}

\maketitle

<<options,echo=FALSE>>=
options(width=100)
@

\begin{center}
\begin{minipage}{0.8\textwidth}
WARNING (May 8, 2012): This vignette is incomplete and was still a WORK IN
PROGRESS at the time \Rpackage{GenomicRanges} 1.8 was released (as part of
Bioconductor 2.10, released in April 2012). All the work on {\it overlap
encodings} is now happening in the {\bf devel} version of the
\Rpackage{GenomicRanges} package (\Rpackage{GenomicRanges} 1.9, part of
Bioconductor 2.11, at the time of this writing). To use {\it overlap encodings}
and related tools, and to get an updated version of this vignette, it is
{\bf strongly} recommended that you use Bioconductor 2.11. In particular please
make sure that you always use the {\bf latest} version of the
\Rpackage{GenomicRanges} package (version 1.9.14 at the time of this writing,
expect frequent updates).

In other words, this vignette is superseded by the vignette found in
\Rpackage{GenomicRanges} 1.9 (part of Bioconductor 2.11). It won't be updated.

See \Rcode{?useDevel} in the \Rpackage{BiocInstaller} package for how to use
the devel version of Bioconductor (2.11 at the time of this writing).

See \Rcode{?biocLite} in the \Rpackage{BiocInstaller} package for how to update
all the installed packages.
\end{minipage}
\end{center}

\tableofcontents


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


In the context of an RNA-seq experiment, encoding the overlaps between
the aligned reads and the transcripts can be used for detecting those
overlaps that are ``compatible'' with the splicing of the transcript.

Various tools are provided in the \Rpackage{IRanges} and
\Rpackage{GenomicRanges} packages for working with {\it overlap encodings}.
In this vignette, we illustrate the use of these tools on real RNA-seq
data.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Example 1: With single-end reads}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Load the reads and transcripts}

We start by loading some aligned reads into a \Rclass{GappedAlignments} object.
The reads are stored in a BAM file ({\tt untreated1\_chr4.bam}) located
in the \Rpackage{pasillaBamSubset} data package. This file contains single-end
reads from an RNA-seq experiment and aligned against the dm3 genome (see
\Rcode{?untreated1\_chr4} in the \Rpackage{pasillaBamSubset} package for more
information about those reads):

<<untreated1_chr4>>=
library(pasillaBamSubset)
untreated1_chr4()
@

We use the \Rfunction{readGappedAlignments} function defined in the
\Rpackage{GenomicRanges} package to read the BAM file into a
\Rclass{GappedAlignments} object.
It's probably a good idea to get rid of the PCR or optical duplicate
(flag bit 0x400 in the SAM format, see the SAM Spec
\footnote{\url{http://samtools.sourceforge.net/}} for the details), as well
as reads not passing quality controls (flag bit 0x200 in the SAM format).
We do this by creating a \Rclass{ScanBamParam} object that we pass to
\Rcode{readGappedAlignments} (see \Rcode{?ScanBamParam} in the
\Rpackage{Rsamtools} package for the details). Note that we also use
\Rcode{use.names=TRUE} in order to load the {\it query template names}
(QNAME field in the SAM format) from the BAM file (\Rcode{readGappedAlignments}
will use them to set the names of the returned object):

<<readGappedAlignments>>=
library(GenomicRanges)
library(Rsamtools)
flag0 <- scanBamFlag(isDuplicate=FALSE, isValidVendorRead=TRUE)
param0 <- ScanBamParam(flag=flag0)
gal14 <- readGappedAlignments(untreated1_chr4(), use.names=TRUE, param=param0)
@

Our reads can have up to 2 gaps (a gap corresponds to an N operation in the
CIGAR):

<<ngap>>=
head(gal14)
table(ngap(gal14))
@

We also need to retrieve the dm3 transcripts and their exons from UCSC,
and extract the exons grouped by transcript in a \Rclass{GRangesList}
object. IMPORTANT: The reference genome of the transcripts must be
{\bf exactly} the same as the reference genome used to align the reads
(note that this is a general rule, not only when working with overlap
encodings):

<<makeTranscriptDbFromUCSC>>=
library(GenomicFeatures)
dm3_refGene_txdb <- makeTranscriptDbFromUCSC(genome="dm3", tablename="refGene")
exbytx <- exonsBy(dm3_refGene_txdb, by="tx")
@

We check that all the exons in any given transcript belong to the same
chromosome and strand. Knowing that our set of transcripts is free of
this kind of trans-splicing events will allow us some significant
simplifications during the downstream analysis \footnote{Dealing with
trans-splicing events is not covered in this document}.
A quick and easy way to check this is to take advantage of the fact
that \Rcode{seqnames} and \Rcode{strand} return \Rclass{RleList} objects.
So we can extract the number of Rle runs for each transcript and make
sure it's always 1:

<<check-for-trans-splicing>>=
table(elementLengths(runLength(seqnames(exbytx))))
table(elementLengths(runLength(strand(exbytx))))
@

Therefore the strand of any given transcript is unambiguously defined
and can be extracted with:

<<exbytx_strand>>=
exbytx_strand <- unlist(runValue(strand(exbytx)), use.names=FALSE)
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Find and encode the overlaps}

We are ready to compute the overlaps with the \Rfunction{findOverlaps}
function. Note that the strand of the queries produced by the RNA-seq
experiment is typically unknown so we use \Rcode{ignore.strand=TRUE}:

<<ov14>>=
ov14 <- findOverlaps(gal14, exbytx, ignore.strand=TRUE)
@

However, the {\it overlap encodings} are strand sensitive so we will
compute them twice, once for the original alignments (i.e. the alignments
of the original queries), and once again for the ``flipped alignments''
(i.e. the alignments of the flipped queries). We extract the ranges of
the original and flipped alignments in 2 \Rclass{GRangesList} objects
with:

<<grl14of>>=
grl14o <- grglist(gal14, order.as.in.query=TRUE)
grl14f <- flipQuery(grl14o)
@

and encode their overlaps with the transcripts:

<<ovenc14of>>=
ovenc14o <- encodeOverlaps(grl14o, exbytx, hits=ov14)
ovenc14f <- encodeOverlaps(grl14f, exbytx, hits=ov14)
@

\Rcode{ovenc14o} and \Rcode{ovenc14f} are 2 \Rclass{OverlapsEncodings}
objects of the same length as \Rclass{Hits} object \Rcode{ov14}.
For each hit in \Rcode{ov14}, we have 2 corresponding encodings, one
in \Rcode{ovenc14o} and one in \Rcode{ovenc14f}, but only one of them
encodes a hit between alignment ranges and exon ranges that are on the
same strand.
We use the \Rfunction{selectEncodingWithCompatibleStrand} function
to merge them into a single \Rclass{OverlapsEncodings} of the same
length. For each hit in \Rcode{ov14}, this selects the encoding
corresponding to alignment ranges and exon ranges with compatible
strand:

<<ovenc14>>=
grl14o_strand <- unlist(runValue(strand(grl14o)), use.names=FALSE)
ovenc14 <- selectEncodingWithCompatibleStrand(ovenc14o, ovenc14f,
                                              grl14o_strand, exbytx_strand,
                                              hits=ov14)
ovenc14
@

As a convenience, the 2 above calls to \Rfunction{encodeOverlaps} + merging
step can be replaced by a single call to \Rfunction{encodeOverlaps} on either
\Rcode{grl14f} or \Rcode{grl14o} with \Rcode{flip.query.if.wrong.strand=TRUE}:
<<ovenc14_again>>=
ovenc14_again <- encodeOverlaps(grl14o, exbytx, hits=ov14, flip.query.if.wrong.strand=TRUE)
stopifnot(identical(ovenc14_again, ovenc14))
@

Unique encodings in \Rcode{ovenc14}:

<<ovenc14_table>>=
unique_ovenc14 <- levels(encoding(ovenc14))
length(unique_ovenc14)
head(unique_ovenc14)
ovenc14_table <- table(encoding(ovenc14))
tail(sort(ovenc14_table))
@

Encodings are sort of cryptic but utilities are provided to extract
specific meaning from them. Use of these utilities is covered in the
next three subsections.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Detect overlaps showing ``compatibility'' with the transcript}

We are interested in a particular type of overlap where the read
overlaps the transcript in a ``compatible'' way, that is, in a way
compatible with the splicing of the transcript.
The \Rfunction{isCompatibleWithSplicing} function can be used on an
\Rclass{OverlapEncodings} object to detect this type of overlap.
Note that \Rfunction{isCompatibleWithSplicing} can also be used on a
character vector or factor.

\Rcode{ovenc14} contains 7 unique encodings ``compatible'' with the splicing
of the transcript:

<<compatible_unique_ovenc14>>=
sort(ovenc14_table[isCompatibleWithSplicing(unique_ovenc14)])
@

Encodings \Rcode{"1:i:"} (403826 occurences in \Rcode{ovenc14}),
\Rcode{"2:jm:af:"} (68914 occurences in \Rcode{ovenc14}),
and \Rcode{"3:jmm:agm:aaf:"} (438 occurences in \Rcode{ovenc14}),
correspond to the following overlaps:

\begin{itemize}
  \item \Rcode{"1:i:"}
\begin{verbatim}
  - read (no gap):            oooooooo
  - transcript:     ...   >>>>>>>>>>>>>>   ...
\end{verbatim}

  \item \Rcode{"2:jm:af:"}
\begin{verbatim}
  - read (1 gap):             ooooo---ooo
  - transcript:     ...   >>>>>>>>>   >>>>>>>>>   ...
\end{verbatim}

  \item \Rcode{"3:jmm:agm:aaf:"}
\begin{verbatim}
  - read (2 gaps):              oo---ooooo---o
  - transcript:     ...   >>>>>>>>   >>>>>   >>>>>>>   ...
\end{verbatim}
\end{itemize}

For clarity, only the exons involved in the overlap are represented. The
transcript can of course have more upstream and downstream exons, which
is denoted by the ... on the left side (5' end) and right side (3' end)
of each drawing. Note that the exons represented in the 2nd and 3rd drawings
are consecutive in the transcript.

Encodings \Rcode{"1:f:"} and \Rcode{"1:j:"} are variations of the situation
described by encoding \Rcode{"1:i:"}. For \Rcode{"1:f:"}, the first
aligned base of the read (or flipped read) is aligned with the first base
of the exon. For \Rcode{"1:j:"}, the last aligned base of the read (or
flipped read) is aligned with the last base of the exon:

\begin{itemize}
  \item \Rcode{"1:f:"}
\begin{verbatim}
  - read (no gap):        oooooooo
  - transcript:     ...   >>>>>>>>>>>>>>   ...
\end{verbatim}

  \item \Rcode{"1:j:"}
\begin{verbatim}
  - read (no gap):              oooooooo
  - transcript:     ...   >>>>>>>>>>>>>>   ...
\end{verbatim}
\end{itemize}

<<ov14_is_compat>>=
ov14_is_compat <- isCompatibleWithSplicing(ovenc14)
table(ov14_is_compat)  # 476124 "compatible" overlaps
@

Number of ``compatible'' overlaps per alignment in \Rcode{gal14}:

<<gal14_ncompat>>=
gal14_ncompat <- tabulate(queryHits(ov14)[ov14_is_compat], nbins=length(gal14))
elementMetadata(gal14)$ncompat <- gal14_ncompat
head(gal14)
table(gal14_ncompat)
@

Number of alignments in \Rcode{gal14} with at least 1 ``compatible'' overlap:

<<nb_non_zero_gal14_ncompat>>=
sum(gal14_ncompat != 0)
@

Number of ``compatible'' overlaps per transcript in \Rcode{exbytx}:

<<exbytx_ncompat14>>=
exbytx_ncompat14 <- tabulate(subjectHits(ov14)[ov14_is_compat], nbins=length(exbytx))
names(exbytx_ncompat14) <- names(exbytx)
tail(table(exbytx_ncompat14))
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Detect overlaps showing ``almost compatibility'' with the
            transcript}

In many aspects, ``compatible'' overlaps correspond to an ideal situation
but in practise many overlaps don't fall into that category.
Here we are interested in a less perfect type of overlap where the
read overlaps the transcript in a way that {\it would} be ``compatible''
if 1 or more exons were removed from the transcript. In that case we say
that the overlap is ``almost compatible'' with the transcript.
The \Rfunction{isCompatibleWithSkippedExons} function can be used on an
\Rclass{OverlapEncodings} object to detect this type of overlap.
Note that \Rfunction{isCompatibleWithSkippedExons} can also be used on a
character vector of factor.

\Rcode{ovenc14} contains 7 unique encodings ``almost compatible'' with the
splicing of the transcript:

<<almost_compatible_unique_ovenc14>>=
sort(ovenc14_table[isCompatibleWithSkippedExons(unique_ovenc14)])
@

Encodings \Rcode{"2:jm:am:af:"} (696 occurences in \Rcode{ovenc14}),
\Rcode{"2:jm:am:am:af:"} (114 occurences in \Rcode{ovenc14}),
and \Rcode{"3:jmm:agm:aam:aaf:"} (18 occurences in \Rcode{ovenc14}),
correspond to the following overlaps:

\begin{itemize}
  \item \Rcode{"2:jm:am:af:"}
\begin{verbatim}
  - read (1 gap):           ooooo----------ooo
  - transcript:     ...   >>>>>>>   >>>>   >>>>>>>>   ...
\end{verbatim}

  \item \Rcode{"2:jm:am:am:af:"}
\begin{verbatim}
  - read (1 gap):           ooooo------------------ooo
  - transcript:     ...   >>>>>>>   >>>>   >>>>>   >>>>>>>>   ...
\end{verbatim}

  \item \Rcode{"3:jmm:agm:aam:aaf:"}
\begin{verbatim}
  - read (2 gaps):             oo---oooo-----------oo
  - transcript:     ...   >>>>>>>   >>>>   >>>>>   >>>>>>>>   ...
\end{verbatim}
\end{itemize}

<<ov14_is_almostcompat>>=
ov14_is_almostcompat <- isCompatibleWithSkippedExons(ovenc14)
table(ov14_is_almostcompat)  # 837 "almost compatible" overlaps
@

Number of ``almost compatible'' overlaps per alignment in \Rcode{gal14}:

<<gal14_nalmostcompat>>=
gal14_nalmostcompat <- tabulate(queryHits(ov14)[ov14_is_almostcompat], nbins=length(gal14))
elementMetadata(gal14)$nalmostcompat <- gal14_nalmostcompat
head(gal14)
table(gal14_nalmostcompat)
@

Number of alignments in \Rcode{gal14} with at least 1 ``almost compatible''
overlap:

<<nb_non_zero_gal14_nalmostcompat>>=
sum(gal14_nalmostcompat != 0) 
@

Number of ``almost compatible'' overlaps per transcript in \Rcode{exbytx}:

<<exbytx_nalmostcompat14>>=
exbytx_nalmostcompat14 <- tabulate(subjectHits(ov14)[ov14_is_almostcompat], nbins=length(exbytx))
names(exbytx_nalmostcompat14) <- names(exbytx)
table(exbytx_nalmostcompat14)
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Combining results of \Rcode{isCompatibleWithSplicing()} and
            \Rcode{isCompatibleWithSkippedExons()} to detect novel splice
            junctions}

An alignment in \Rcode{gal14} with ``almost compatible'' hits but
no ``compatible'' hits suggests the presence of one or more transcripts that
are not in our annotations.

First we extract the index of those alignments:

<<aln_shows_nov_splice_jct>>=
aln_shows_nov_splice_jct <- gal14_nalmostcompat != 0L &
                            gal14_ncompat == 0L
head(which(aln_shows_nov_splice_jct))
@

We make this an index into \Rcode{ov14} (Hits object):

<<is_nov_splice_jct>>=
is_nov_splice_jct <- queryHits(ov14) %in% which(aln_shows_nov_splice_jct)
@

We intersect with \Rcode{is\_almost\_compat} to keep only the overlaps
of interest:

<<narrow_is_nov_splice_jct>>=
is_nov_splice_jct <- is_nov_splice_jct & ov14_is_almostcompat
@

For each overlap of interest, we extract the ranks of the skipped
exons (we use a list for this as there might be more than 1 skipped
exon per overlap):

<<extractSkippedExonRanks>>=
skpexrk <- extractSkippedExonRanks(ovenc14)[is_nov_splice_jct]
table(elementLengths(skpexrk))
@

Finally, we split \Rcode{skpexrk} by transcript TxDb internal id:

<<tx2skpexrk>>=
names(skpexrk) <- queryHits(ov14)[is_nov_splice_jct]
f <- names(exbytx)[subjectHits(ov14)[is_nov_splice_jct]]
tx2skpexrk <- split(skpexrk, f)
@

\Rcode{tx2skpexrk} is a named list of named lists of integer vectors.
The first level of names (outer names) are transcript TxDb internal
ids and the second level of names (inner names) are alignment
indices into \Rcode{gal14}:

<<tx2skpexrk_names>>=
head(names(tx2skpexrk))  # transcript TxDb internal ids
@

Transcript \Rcode{"10"} has 7 hits. All of them skip exons 9 and 10:

<<tx10_details>>=
tx2skpexrk[["10"]]
@

Transcript \Rcode{"58"} has 4 hits. Two of them skip exon 2, one of them
skips exons 2 to 6, and one of them skips exon 10:

<<tx58_details>>=
tx2skpexrk[["58"]]
@

A few words about the interpretation of \Rcode{tx2skpexrk}:
Because of how we've conducted this analysis, the aligments reported in
\Rcode{tx2skpexrk} are guaranteed to NOT have any ``compatible'' overlaps
with other known transcripts. All we can say, for example in the case of
transcript \Rcode{"10"}, is that the 7 reported hits that skip exons 9 and 10
show evidence of one or more unknown transcripts with a splice junction that
corresponds to the gap between exons 8 and 11. But without further analysis,
we can't make any assumption about the exons structure of those unknown
transcripts. In particular, we cannot assume the existence of an unknown
transcript made of the same exons as transcript \Rcode{"10"} minus exons 9
and 10!


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Example 2: With paired-end reads}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Load the reads and transcripts}

We start by loading some aligned paired-end reads into a
\Rclass{GappedAlignmentPairs} object.
The reads are stored in a BAM file ({\tt untreated3\_chr4.bam}) located
in the \Rpackage{pasillaBamSubset} data package. This file contains paired-end
reads from an RNA-seq experiment and aligned against the dm3 genome (see
\Rcode{?untreated3\_chr4} in the \Rpackage{pasillaBamSubset} package for more
information about those reads):

<<untreated3_chr4>>=
untreated3_chr4()
@

We use the \Rfunction{readGappedAlignmentPairs} function to read the BAM file
into a \Rclass{GappedAlignmentPairs} object:

<<readGappedAlignmentPairs>>=
galp34 <- readGappedAlignmentPairs(untreated3_chr4(), use.names=TRUE, param=param0)
head(galp34)
@

The \Rcode{show} method for \Rclass{GappedAlignmentPairs} objects displays
two {\tt ranges} columns, one for the {\it first} alignment in the pair (the
left column), and one for the {\it last} alignment in the pair (the right
column). The {\tt strand} column corresponds to the strand of the {\it first}
alignment.

<<first_last>>=
head(first(galp34))
head(last(galp34))
@

According to the SAM format specifications, the aligner is expected to mark
each alignment pair as {\it proper} or not (flag bit 0x2 in the SAM format).
The SAM Spec only says that a pair is {\it proper} if the {\it first} and
{\it last} alignments in the pair are ``properly aligned according to the
aligner''. So the exact criteria used for setting this flag is left to the
aligner.

We use \Rcode{isProperPair} to extract this flag from the
\Rclass{GappedAlignmentPairs} object:

<<isProperPair>>=
table(isProperPair(galp34))
@

Even though we could do {\it overlap encodings} with the full object,
we keep only the {\it proper} pairs for our downstream analysis:

<<keep_only_proper_pairs>>=
galp34 <- galp34[isProperPair(galp34)]
@

For the transcript, we'll reuse the \Rcode{exbytx} object obtained
previously.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Find and encode the overlaps}

Let's compute the overlaps:

<<ov34>>=
ov34 <- findOverlaps(galp34, exbytx, ignore.strand=TRUE)
@

and encode them:

<<ovenc34>>=
grl34 <- grglist(galp34, order.as.in.query=TRUE)
ovenc34 <- encodeOverlaps(grl34, exbytx, hits=ov34, flip.query.if.wrong.strand=TRUE)
ovenc34
@

Unique encodings in \Rcode{ovenc34}:

<<ovenc34_table>>=
unique_ovenc34 <- levels(encoding(ovenc34))
length(unique_ovenc34)
head(unique_ovenc34)
ovenc34_table <- table(encoding(ovenc34))
tail(sort(ovenc34_table))
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Detect overlaps showing ``compatibility'' with the transcript}

\Rcode{ovenc34} contains 13 unique encodings ``compatible'' with the splicing
of the transcript:

<<compatible_unique_ovenc34>>=
sort(ovenc34_table[isCompatibleWithSplicing(unique_ovenc34)])
@

Encodings \Rcode{"1{-}{-}1:i{-}{-}i:"} (89039 occurences in \Rcode{ovenc34}),
\Rcode{"2{-}{-}1:jm{-}{-}m:af{-}{-}i:"} (2825 occurences in \Rcode{ovenc34}),
\Rcode{"1{-}{-}2:i{-}{-}jm:a{-}{-}af:"} (2339 occurences in \Rcode{ovenc34}),
and \Rcode{"1{-}{-}1:i{-}{-}m:a{-}{-}i:"} (342 occurences in \Rcode{ovenc34}),
correspond to the following overlaps:

\begin{itemize}
  \item \Rcode{"1{-}{-}1:i{-}{-}i:"}
\begin{verbatim}
  - paired-end read (no gap on the first end, no gap on the
    last end):               oooo   oooo
  - transcript:     ...   >>>>>>>>>>>>>>>>   ...
\end{verbatim}

  \item \Rcode{"2{-}{-}1:jm{-}{-}m:af{-}{-}i:"}
\begin{verbatim}
  - paired-end read (1 gap on the first end, no gap on the
    last end):                 ooo---o    oooo
  - transcript:     ...   >>>>>>>>   >>>>>>>>>>>   ...
\end{verbatim}

  \item \Rcode{"1{-}{-}2:i{-}{-}jm:a{-}{-}af:"}
\begin{verbatim}
  - paired-end read (no gap on the first end, 1 gap on the
    last end):             oooo      oo---oo
  - transcript:     ...  >>>>>>>>>>>>>>   >>>>>>>>>  ...
\end{verbatim}

  \item \Rcode{"1{-}{-}1:i{-}{-}m:a{-}{-}i:"}
\begin{verbatim}
  - paired-end read (no gap on the first end, no gap on the
    last end):               oooo      oooo
  - transcript:     ...   >>>>>>>>>   >>>>>>>   ...
\end{verbatim}
\end{itemize}

Note: switch use of ``first'' and ``last'' above if the read was flipped.

<<ov34_is_compat>>=
ov34_is_compat <- isCompatibleWithSplicing(ovenc34)
table(ov34_is_compat)  # 95801 "compatible" overlaps
@

Number of ``compatible'' overlaps per alignment pair in \Rcode{galp34}:

<<galp34_ncompat>>=
galp34_ncompat <- tabulate(queryHits(ov34)[ov34_is_compat], nbins=length(galp34))
elementMetadata(galp34)$ncompat <- galp34_ncompat
head(galp34)
table(galp34_ncompat)
@

Number of alignment pairs in \Rcode{galp34} with at least 1 ``compatible''
overlap:

<<nb_non_zero_galp34_ncompat>>=
sum(galp34_ncompat != 0)
@

Number of ``compatible'' overlaps per transcript in \Rcode{exbytx}:

<<exbytx_ncompat34>>=
exbytx_ncompat34 <- tabulate(subjectHits(ov34)[ov34_is_compat], nbins=length(exbytx))
names(exbytx_ncompat34) <- names(exbytx)
tail(table(exbytx_ncompat34))
@


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{\Rcode{sessionInfo()}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

<<sessionInfo>>=
sessionInfo()
@

\end{document}