%\VignetteIndexEntry{An Introduction to the BUMHMM pipeline}
%\VignetteKeywords{BUMHMM, structure probing, RNA}
%\VignettePackage{BUMHMM}
%\VignetteEngine{knitr::knitr}

\documentclass{article}

<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@

\bioctitle[BUMHMM: Computational pipeline for modelling structure probing data]{BUMHMM: Probabilistic computational pipeline for modelling RNA structure probing data}
\author{Alina Selega\footnote{alina.selega@ed.ac.uk or alina.selega@gmail.com}}
\date{Modified: October 25, 2016. Compiled: \today}

\begin{document}

\maketitle

\tableofcontents

\section{Introduction}

RNA structure is known to be a key regulator of many important mechanisms, such
as RNA stability, transcription, and mRNA translation. RNA structural regulatory
elements are interrogated with chemical and enzymatic structure probing
\cite{kubota2015progress}. In these experiments, a chemical agent reacts with
the RNA molecule in a structure-dependent way, cleaving or otherwise modifying
its flexible parts. These modified positions can then be detected, providing
valuable structural information that can be used for structure prediction
\cite{wu2015improved}.

Specifically, chemical modification terminates the reverse transcription
reaction, resulting in the reverse transcriptase (RT) dropping off at the
modified positions. These positions of drop-off can be then mapped back to the
reference sequence. However, the challenge lies in the stochasticity of this
process as the RT can also drop off randomly. To address this, a complementary
control experiment is routinely performed to monitor random RT drop-offs when no
reagent is used.

Let us consider a toy example of data obtained in a paired-end sequencing
structure probing experiment (Fig. 1). We'll focus on a particular nucleotide G
and analyse the data from a control experiment (with no reagent added) and a
treatment experiment (with RNAs modified by the reagent). In control conditions,
we mapped 5 fragments overlapping with the nucleotide G, one of which also
terminated at that position. Thus, this nucleotide had a coverage of 5 and a
drop-off count of 1 (the number of times the RT dropped off immediately after
this position), giving it a \textit{drop-off rate} of \( \frac{1}{5} \)
(formally defined below). In treatment conditions, more fragments terminated at
this position and we measured a drop-off rate of \( \frac{4}{5} \). This seems
to suggest that the next nucleotide T has been modified by the reagent and
perhaps corresponds to a flexible site within the molecule. However, would our
conclusion remain the same had we observed a higher drop-off rate in control
conditions to start with? In fact, how high would this control drop-off rate
have to be for us to dismiss the drop-off rate of \( \frac{4}{5} \) as a noisy
measurement of randrom drop-off rather than an indication of real modification?

\begin{figure}[h]
\caption{Toy example of structure probing data.}
\centering
\includegraphics[width=0.9\textwidth]{toyEx_structureProbing.png}
\end{figure}

This question reinforces the need for deciding statistically whether the
drop-off rate in treatment conditions is significantly higher than the drop-off
rate in control. To do this, we must understand how much noise can be expected
in control conditions. If the treatment drop-off rate is outside of this range
of drop-off rate variability, then we could deem it as significantly higher.

We developed Beta-Uniform Mixture hidden Markov model (\verb|BUM-HMM|)
\cite{selega2016robust}, a statistical framework for modelling reactivity scores
from an RNA structure probing experiment such as SHAPE
\cite{spitale2013rna} or ChemModSeq \cite{hector2014snapshots}. \verb|BUM-HMM|
implements the intuition outlined above by utilising data from multiple
experimental replicates and quantifying the variability of the RT drop-offs.
\verb|BUM-HMM| also provides empirical strategies to correct intrinsic biases in
the data. It generates a probabilistic output measuring the probability of
modification for each nucleotide transcriptome-wide.

The \verb|BUMHMM| package implements the functionality of the \verb|BUM-HMM|
model. This vignette provides an example workflow for using the \verb|BUMHMM|
package on the structure probing data set for the yeast ribosomal RNA 18S,
obtained in an experiment with random priming and paired-end sequencing
(available in the Gene Expression Omnibus under accession number GSE52878).

\section{Data format}

The \verb|BUMHMM| pipeline requires three data sets for all nucleotide
positions:

\begin{itemize}
\item the coverage (or the number of reads overlapping with this position),
\item the drop-off count (or the number of times the RT dropped off at the next
    nucleotide),
\item and the drop-off rate at this position.
\end{itemize}

The coverage and the drop-off counts are the data obtained in a structure
probing experiment with paired-end sequencing. The drop-off rate $r$ at each
nucleotide position is computed as the ratio between its drop-off count $k$ and
the coverage $n$: \(r = \frac{k}{n} \). Such data sets can be easily stored in a
\Biocpkg{SummarizedExperiment} object, commonly used to represent data from
sequencing-based experiments such as RNA-Seq.

The key strength of the \verb|BUM-HMM| model is accounting for the biological
variability of the data and thus, it requires data sets available in multiple
replicates. The data set \texttt{se} provided with this package (accession
number GSE52878) is available in triplicates and was obtained in a structure
probing experiment on the 18S ribosomal RNA using the DMS chemical probing agent
\cite{wells200032}.

<<>>=
suppressPackageStartupMessages({
    library(BUMHMM)
    library(Biostrings)
    library(SummarizedExperiment)
})
se
@

We see that 18S has 1,800 nucleotides (represented as rows in \texttt{se}) and
that the data set has 6 replicates: 3 control experiments followed by 3
treatment experiments (represented as columns in \texttt{se}). The assays
correspond to different data sets, namely, the coverage, drop-off count, and
drop-off rate information for each nucleotide. One could quickly access the
coverage information for control experimental replicates (labelled 'C1', 'C2',
'C3') as follows:

<<>>=
controls <- se[, se$replicate == "control"]
head(assay(controls, 'coverage'))
@

We also provide the associated genomic sequence of 18S, accessible through the
\Rfunction{rowData} function which stores information about the rows, in this
case corresponding to the nucleobases:

<<>>=
rowData(controls)[1:4,]
@

Similarly, the function \Rfunction{colData} stores the description of the
columns, which correspond to the experimental replicates:

<<>>=
colData(controls)
@

For transcriptome-wide experiments, the data over different chromosomes should
be concatenated row-wise.

To briefly illustrate the data set, let us examine the 300th nucleotide:

<<>>=
pos <- 300
assay(controls, 'coverage')[pos, 1]
assay(controls, 'dropoff_count')[pos, 1]
assay(controls, 'dropoff_rate')[pos, 1]
@

We see that it had coverage of \Sexpr{assay(controls, 'coverage')[pos, 1]}
in the first control experimental replicate, of which the reverse transcription
randomly terminated at that position
\Sexpr{assay(controls, 'dropoff_count')[pos, 1]} times, giving it a drop-off
rate of \Sexpr{signif(assay(controls, 'dropoff_rate')[pos, 1], 3)}.

<<>>=
treatments <- se[, se$replicate == "treatment"]
assay(treatments, 'coverage')[pos, 1]
assay(treatments, 'dropoff_count')[pos, 1]
assay(treatments, 'dropoff_rate')[pos, 1]
@

In the presence of a chemical probe (in the first treatment replicate), the
coverage and drop-off count at that position were higher but the drop-off rate
remained roughly similar,
\Sexpr{signif(assay(treatments, 'dropoff_rate')[pos, 1], 3)}.

\section{The overview of pipeline}

The logic of structure probing experiments associates the binding accessibility
of a nucleotide with its structural flexibility, i.e. double-stranded
nucleotides or those otherwise protected (e.g. by a protein interaction) will
not be available for interaction with the chemical reagent. In contrast, those
nucleotides located in flexible parts of the molecule, could be chemically
modified by the reagent and will therefore correspond to the positions at which
the RT drops off. Thus, we expect the nucleotides immediately downstream from
the modification sites within the transcript to have a high drop-off rate in the
presence of a reagent; higher than what we observe in control conditions.

To quantify the variability in drop-off rate measured in control conditions,
the \verb|BUM-HMM| method compares the drop-off rates at each nucleotide
position between two \textit{control} experimental replicates, ${C_i}$ and
${C_j}$:

\[log \Big( \frac{r_{C_i}}{r_{C_j}} \Big) \]

If the drop-off rates $r_{C_i}$ and $r_{C_j}$ are similar in a pair of control
replicates, the above log-ratio will be close to 0, indicating little
to no variability in drop-off rate. In contrast, different drop-off rates will
result in a large log-ratio (in absolute value). Computing these per-nucleotide
log-ratios for all pairs of control experimental replicates defines a
\textit{null distribution}, which quantifies how much variability in drop-off
rate we can observe between two experimental replicates simply by chance, in the
absence of any reagent. (Note that due to a log transform, the drop-off rates
$r = 0$ are not allowed.)

We now compute this log-ratio between the drop-off rates in all pairs of
\textit{treatment} and \textit{control} experimental replicates, ${T_i}$ and
${C_j}$:

\[log \Big( \frac{r_{T_i}}{r_{C_j}} \Big) \]

We expect the neighbour of a modified nucleotide to have a much larger drop-off
rate in a treatment experiment compared to control conditions, generating a
large log-ratio for this pair of experimental replicates. By comparing each
treatment-control log-ratio to the null distribution, we can find those
nucleotide positions that demonstrate differences in drop-off rate larger than
what can be expected by chance.

The next section goes through the steps of the \verb|BUMHMM| pipeline using the
provided data set \texttt{se} as an example.

\section{BUMHMM pipeline steps}

\subsection{Selecting pairs of nucleotides}

We first need to select nucleotide positions in each experimental replicate for
which we will compute the log-ratios. This is implemented with the function
\Rfunction{selectNuclPos}. This function requires the coverage and drop-off
count information stored in \texttt{se}, the numbers of control and treatment
experimental replicates (\texttt{Nc} and \texttt{Nt}, correspondingly), and a
user-specified coverage threshold \texttt{t}. Nucleotides with coverage $n < t$
will not be considered.

In our data set, we have 3 control and 3 treatment replicates, so if we set the
minimum allowed coverage as $t = 1$, we can make the following function call:

<<>>=
Nc <- Nt <- 3
t <- 1
nuclSelection <- selectNuclPos(se, Nc, Nt, t)
List(nuclSelection)
@

The function \Rfunction{selectNuclPos} returns a list with two elements:

\begin{itemize}

  \item \texttt{analysedC} is a list where each element corresponds to a
  control-control replicate comparison. Each element holds indices of
  nucleotides that have coverage $n >= t$ and a drop-off count $k > 0$ in both
  replicates of that comparison. Thus, each element stores those nucleotide
  positions for which we can compute the log-ratio for the corresponding pair of
  control replicates.

  \item \texttt{analysedCT} is a list where each element corresponds to a
  treatment-control replicate comparison. Again, each element holds indices of
  nucleotides that have coverage $n >= t$ and a drop-off count $k > 0$ in both
  replicates of that comparison.

\end{itemize}

The pairwise control replicate comparisons are enumerated with the function
\Rfunction{combn} from the \verb|utils| package:

<<>>=
t(combn(Nc, 2))
@

Thus, the first element of \texttt{analysedC} corresponds to comparing the
control replicate 1 to control replicate 2. The comparisons between treatment
and control replicates are computed similarly.

<<>>=
length(nuclSelection$analysedC[[1]])
length(nuclSelection$analysedCT[[1]])
@

We select \Sexpr{length(nuclSelection$analysedC[[1]])} nucleotide positions for
computing the log-ratios for the first control-control comparison and
\Sexpr{length(nuclSelection$analysedCT[[1]])} positions for the first
treatment-control comparison.

\subsection{Scaling the drop-off rates across replicates}

Because \verb|BUMHMM| works with data collected in multiple experimental
replicates, it is important to ensure that the drop-off rates do not differ
dramatically between replicates. Thus, the second step of the pipeline scales
the drop-off rates of nucleotides selected for pairwise comparisons to have a
common median value. This is implemented with a function \Rfunction{scaleDOR},
which requires the data container \texttt{se}, the output of the function
\Rfunction{selectNuclPos} (described above), and the numbers of replicates. It
returns the updated drop-off rates such that the selected positions have the
same median drop-off rate in all replicates:

<<>>=
## Medians of original drop-off rates in each replicate
apply(assay(se, 'dropoff_rate'), 2, median)

## Scale drop-off rates
assay(se, "dropoff_rate") <- scaleDOR(se, nuclSelection, Nc, Nt)

## Medians of scaled drop-off rates in each replicate
apply(assay(se, 'dropoff_rate'), 2, median)
@

After scaling, medians are much more similar across replicates (they are not
exactly equal when computed this way as most, but not all nucleotides were
selected for the pairwise comparisons.)

\subsection{Computing stretches of nucleotide positions}

The next step in the \verb|BUM-HMM| modelling approach enforces a smoothness
assumption over the state of nucleotides: chemical modification does not
randomly switch along the chromosome, rather, continuous stretches of RNA are
either flexible or not. This is captured with a hidden Markov model (HMM) with
binary latent state corresponding to the true state of each nucleotide: modified
or unmodified.

The observations of the HMM are the empirical $p$-values associated with each
nucleotide. These $p$-values arise from comparing the treatment-control
log-ratios corresponding to each nucleotide position with the null distribution:

\[\textrm{$p$-value} = 1 - \textrm{closest percentile of null distribution}\]

If the difference between the drop-off rates in treatment and control replicates
(as measured by the treatment-control log-ratio) is well within the range of the
drop-off rate variability that we observed in control conditions (as summarised
by the null distribution), then this log-ratio will get assigned a fairly large
$p$-value. However, those log-ratios that are much larger than most values in
the null distribution will be close to its right side
(e.g. 90\textsuperscript{th} percentile). They will then receive a small
$p$-value ($1 - 0.9 = 0.1$ in this case). Thus, the $p$-value can be thought of
as a probability for the treatment-control log-ratio to belong to the null
distribution. We are interested in those log-ratios that are unlikely to belong
to it as they could indicate the real RT drop-off signal. Note that we expect
the drop-off rate in treatment conditions to be higher than in control, which is
why we restrict our attention to the right side of the null distribution.

Modelling $p$-values directly enabled us to define the emission distribution of
the HMM as a Beta-Uniform mixture model. Briefly, the unmodified state of a
nucleotide corresponds to the null hypothesis and the associated $p$-values are
modelled with the Uniform distribution. In the modified state we expect to see
large log-ratios and small associated $p$-values, which are modelled with a Beta
distribution. Further details and theoretical justifications can be found in
\cite{selega2016robust}.

To run the HMM, we compute uninterrupted stretches of nucleotides for which the
posterior probabilities are to be computed. Posterior probabilities will be
computed for those nucleotides with at least the minimum allowed coverage in all
experimental replicates and a non-zero drop-off count in at least one treatment
replicate. This is achieved with the function \Rfunction{computeStretches},
which takes \texttt{se} and the threshold \texttt{t} as parameters.

<<>>=
stretches <- computeStretches(se, t)
@

The function returns an \Biocpkg{IRanges} object where each element corresponds
to a stretch of nucleotides and each stretch is at least 2 nucleotides long.
HMM will be run separately on each stretch.

<<>>=
head(stretches)
assay(se, 'dropoff_count')[1748,]
@

On this data set, we will compute posterior probabilities for all nucleotides
but one, which is at the 1748th position. This is because at this position, all
treatment replicates (and in fact, all replicates) had a drop-off count of 0.

\subsection{Bias correction}

Using a transcriptome-wide data set, we identified sequence and coverage as
factors that influence log-ratios in control conditions \cite{selega2016robust}.
We would therefore like to transform the log-ratios such that these biases are
eliminated and the performed comparisons are not confounded.

\subsubsection{Coverage bias}

The coverage bias is addressed by a variance stabilisation strategy, implemented
by the \Rfunction{stabiliseVariance} function. This function aims to find a
functional relationship between the log-ratios in the null distribution and the
average coverage in the corresponding pair of control replicates. This
relationship is modelled with the assumption that the drop-off count is a
binomially distributed random variable (see \cite{selega2016robust} for details)
and is fitted to the data with a non-linear least squares technique. Then, all
log-ratios (both for control-control and treatment-control comparisons) are
transformed accordingly so that this dependency on coverage is eliminated or at
least reduced.

The function requires the data container \texttt{se}, the positions of
nucleotides selected for pairwise comparisons, and the numbers of replicates. It
returns a list with two elements (\texttt{LDR} stands for ``log drop-off rate
ratio''):

\begin{itemize}

\item \texttt{LDR\_C} is a matrix with transformed log-ratios for
control-control comparisons.
\item \texttt{LDR\_CT} is a matrix with transformed log-ratios for
treatment-control comparisons.

\end{itemize}

Both matrices have rows corresponding to nucleotide positions and columns -- to
a pairwise comparison. Thus, \texttt{LDR\_C} has as many columns as there are
control-control comparisons (3 comparisons for 3 control replicates) and
\texttt{LDR\_CT} has as many columns as treatment-control comparisons (9
comparisons for 3 control and 3 treatment replicates).

<<>>=
varStab <- stabiliseVariance(se, nuclSelection, Nc, Nt)
LDR_C <- varStab$LDR_C
LDR_CT <- varStab$LDR_CT

hist(LDR_C, breaks = 30, main = 'Null distribution of LDRs')
@

The histogram shows the null distribution after the transformation.

\subsubsection{Sequence bias}

The sequence-dependent bias is addressed by computing different null
distributions of log-ratios for different sequence patterns of nucleotides. One
could consider sequences of three nucleotides, reflecting the assumption that
the immediate neighbours of a nucleotide on both sides could affect its
accessibility; patterns of other lengths could also be considered. The function
\Rfunction{nuclPerm} returns a vector of all permutations of four nucleobases
(A, T, G, and C) of length $n$:

<<>>=
nuclNum <- 3
patterns <- nuclPerm(nuclNum)
patterns
@

Considering patterns of length $n = 3$ will result in computing
\Sexpr{length(patterns)} different null distributions of log-ratios, each
corresponding to one sequence pattern. To do this, we first need to find all
occurrences of each pattern within the sequence. This is implemented with the
function \Rfunction{findPatternPos}, which takes the list of patterns, a string
containing the sequence (e.g. a \verb|DNAString| object), and a parameter
indicating whether we are dealing with sense (+) or anti-sense (-) DNA strand.
For transcriptome-wide experiments, when searching for pattern occurrences
within the genomic sequence on the anti-sense strand (sense strand sequence is
expected by the function as the second parameter), the patterns will be
converted to complementary sequence.

<<>>=
## Extract the DNA sequence
sequence <- subject(rowData(se)$nucl)
sequence
nuclPosition <- findPatternPos(patterns, sequence, '+')
patterns[[1]]
head(nuclPosition[[1]])
@

The function returns a list with an element corresponding to each pattern
generated by \Rfunction{nuclPerm}. Each element holds the indices of the middle
nucleotide of this pattern's occurrence in the genomic sequence. Thus, we will
separately consider the drop-off rates of the nucleotides that occur in the
context of each pattern. For instance, the null distribution specific to the
pattern ``\Sexpr{patterns[[1]]}'' will be constructed from the log-ratios for
the nucleotide positions \Sexpr{head(nuclPosition[[1]])} etc.

\subsection{Computing posterior probabilities with HMM}

We are now ready for the final step of the pipeline which computes posterior
probabilities of modification with the HMM. Due to the short length of the 18S
molecule (only 1,800 nucleotides), we will be omitting the sequence
bias-correcting step, which is primarily designed for transcriptome studies.
Instead, we will use all nucleotide positions for constructing a single null
distribution quantifying the drop-off rate variability. The
\texttt{nuclPosition} list should therefore have one element corresponding to
the single stretch that we will run the HMM on. This stretch contains all
nucleotide positions:

<<>>=
nuclPosition <- list()
nuclPosition[[1]] <- 1:nchar(sequence)

## Start of the stretch
nuclPosition[[1]][1]
## End of the stretch
nuclPosition[[1]][length(nuclPosition[[1]])]
@

The HMM is run separately on all stretches of nucleotides (of which we have only
one in this particular case). However, in transcriptome studies it could be
useful to only select some stretches of interest, e.g. those overlapping with
particular genes.

The function \Rfunction{computeProbs} computes the posterior probabilities of
modification for all nucleotides in the specified stretches. The function
requires matrices with transformed log-ratios \texttt{LDR\_C} and
\texttt{LDR\_CT}, the numbers of replicates, the strand indicator, the lists of
positions for computing the null distribution(-s) (stored in
\texttt{nuclPosition}) and pairwise comparisons (stored in
\texttt{nuclSelection}), and the stretches which to run the HMM on:

<<>>=
posteriors <- computeProbs(LDR_C, LDR_CT, Nc, Nt, '+', nuclPosition,
                             nuclSelection$analysedC, nuclSelection$analysedCT,
                             stretches)
@

The function \Rfunction{computeProbs} compares log-ratios to the null
distribution(-s) and computes empirical $p$-values. These are then passed as
observations to the HMM, which computes posterior probabilities for each
selected nucleotide of being in the unmodified (first column in
\texttt{posteriors}) and modified state (second column in \texttt{posteriors}).

\section{BUMHMM output}

We see that the model assigns very large probabilities to the first few
nucleotides to be unmodified by the chemical probe.

<<>>=
head(posteriors)
@

As the modified positions within a transcript are the ones at which the reverse
transcriptase drops off, the last position of the corresponding cDNA fragment
that is detected and for which we consequently increment the drop-off count is
the nucleotide next to the modification site. Thus, we need to shift our
probabilities up by one position:

<<>>=
shifted_posteriors <- matrix(, nrow=dim(posteriors)[1], ncol=1)
shifted_posteriors[1:(length(shifted_posteriors) - 1)] <-
  posteriors[2:dim(posteriors)[1], 2]
@

We can now plot the probabilities to see the \verb|BUMHMM| output for the DMS
structure probing data for the yeast rRNA 18S.

<<>>=
plot(shifted_posteriors, xlab = 'Nucleotide position',
     ylab = 'Probability of modification',
     main = 'BUMHMM output for 18S DMS data set')
@

We see that most nucleotides are predicted to be in the unmodified state, having
the probability of modification close to 0 (and thus could possibly be
double-stranded or protected by a protein interaction). However, the model also
identified some modified regions, which could correspond to accessible parts of
the molecule. It should be noted that the DMS probe preferentially reacts with
``A'' and ``C'' nucleotides, which effectively makes only a subset of the
structural state of the molecule accessible to probing.

The \verb|BUMHMM| package also provides an option to optimise the shape
parameters of the Beta distribution, which defines the HMM emission model for
the modified state. To optimise parameters with the EM algorithm, the
\Rfunction{computeProbs} function should be called with the last parameter
\texttt{optimise} set to the desired tolerance. Once the previous and current
estimates of the parameters are within this tolerance, the EM algorithms stops
(unless it already reached the maximum number of iterations before that).
Further details can be found in \cite{selega2016robust}.

<<eval=FALSE>>=
## Call the function with the additonal tolerance parameter
posteriors <- computeProbs(LDR_C, LDR_CT, Nc, Nt, '+', nuclPosition,
                           nuclSelection$analysedC, nuclSelection$analysedCT,
                           stretches, 0.001)
@

By default, the last parameter is set to \texttt{NULL}. During our experiments,
we discovered that this optimisation appeared vulnerable to local minima. Thus,
the current version of the \verb|BUMHMM| pipeline does not use this
optimisation.

%\clearpage
\section{Session Info}
This vignette was compiled using:
<<>>=
sessionInfo()
@

\section{Acknowledgements}
This package was developed at the the School of Informatics, University of
Edinburgh with support from Sander Granneman and Guido Sanguinetti.

The manuscript describing the \verb|BUM-HMM| computational model can be found in
\cite{selega2016robust}.

This study was supported in part by the grants from the UK Engineering and
Physical Sciences Research Council, Biotechnology and Biological Sciences
Research Council, and the UK Medical Research Council to the University of
Edinburgh Doctoral Training Centre in Neuroinformatics and Computational
Neuroscience (EP/F500385/1 and BB/F529254/1).

\bibliography{vignette}

\end{document}