---
title: "Quick analysis of nucleosome positioning experiments using the nucleR
package"
author:
  - name: Oscar Flores Guri
    affiliation:
    - Institute for Research in Biomedicine
    - Barcelona Supercomputing Center
date: "`r format(Sys.time(), '%B %d, %Y')`"
vignette: >
  %\VignetteIndexEntry{Vignette Title}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document:
    number_sections: yes
    toc: true
---

```{r style, echo=FALSE, results='asis'}
BiocStyle::markdown()
```

```{r setup, include = FALSE}
options(tinytex.verbose=TRUE)
getCaps <- function ()
{
    label <- knitr::opts_current$get("label")
    txts <- list(
        figcover=c(
            "Variation in the sharpness of the peaks using \\texttt{trim}",
            "attribute."
        ),
        figmnase=c(
            "Toy example of MNase biase correction. Random nucleosomal and",
            "control reads have been generated using \\texttt{synteticNucMap}",
            "function and corrected using \\texttt{controlCorrect}."
        ),
        fignoise=c(
            "Original intensities from tiling array experiment. Smoothing",
            "using a sliding window of variable length (0, 20, 50 and 100 bp)",
            "is presented."
        ),
        figfft=c(
            "Power spectrum of the example Tiling Array data, percentile 1",
            "marked with a dashed line."
        ),
        figfft2=c(
            "Filtering in Tiling Array (up, blue) (1\\% comp.) and NGS (down",
            "red) (2\\% comp.)."
        ),
        figpeak=c(
            "Output of \\texttt{plotPeaks} function. Peaks are spotted in red",
            "and detection threshold marked with an horitzontal line."
        ),
        figpeak2="\\texttt{plotPeaks} function with \\texttt{score=TRUE}.",
        figpeak3=c(
            "\\texttt{plotPeaks} output with \\texttt{score=TRUE} and",
            "\\texttt{width=140}."
        ),
        figranges=c(
            "Simple example of ranges manipulation to plot fuzzy nucleosomes."
        ),
        figsyn=c(
            "Example synthetic coverage map of 90 well-positioned (100-10)",
            "and 20 fuzzy nucleosomes."
        )
    )
    caps <- lapply(txts, paste, collapse=" ")
    return (caps[[label]])
}

knitr::opts_chunk$set(
    collapse   = TRUE,
    comment    = "#>",
    message    = FALSE,
    warning    = FALSE,
    fig.width  = 6,
    fig.height = 2,
    fig.wide   = TRUE
)

```


# Introduction

The \Biocpkg{nucleR2} package provides a high-level processing of genomic
datasets focused in nucleosome positioning experiments, despite they should
be also applicable to chromatin inmunoprecipitation (ChIP) experiments in
general.

The aim of this package is not providing an all-in-one data analysis pipeline
but complement those existing specialized libraries for low-level data
importation and pre-processment into \R{}/\Bioconductor{} framework.

\Biocpkg{nucleR} works with data from the two main high-troughput
technologies available nowadays for ChIP: Next Generation Sequencing/NGS
(ChIP-seq) and Tiling Microarrays (ChIP-on-Chip).

This is a brief summary of the main functions:

* Data import: `readBAM`, `processReads`, `processTilingArray`
* Data transformation: `coverage.rpm`, `filterFFT`, `controlCorrection`
* Nucleosome calling: `peakDetection`, `peakScoring`
* Visualization: `plotPeaks`
* Data generation: `syntheticNucMap`

For more details about the functions and how to use them refer to the
\Biocpkg{nucleR} manual.

This software was published in Bioinformatics Journal. See the paper for
additional information [@Flores2011].


# Reading data

As mentioned previously, \Biocpkg{nucleR} uses the pre-processed data of other
lower level packages for data importation, supporting a few but common formats
that should fulfill the requirements of most users.

`ExpressionSet` from package \Biocpkg{Biobase} is used for Tiling Array
experiments as described in \Biocpkg{Starr} and other packages for the Tiling
Array manipulation. This kind of experiments can be readed with the
`processTilingArray` function.

`AlignedRead` from package \Biocpkg{ShortRead} is recommended for NGS, covering
most of the state of the art sequencing technologies. Additionally, support for
reads in `RangedData` format is also provided (a range per read with a `strand`
column).


## Reading Tiling Arrays

Tiling Arrays are a cheap and fast way to have low-resolution nucleosome
coverage maps. They have been widely used in literature
[@Yuan2005, @Lee2007, @Mavrich2008], but complex statistical methods were
needed for their processing [@Liu2007].

This kind of microarrays cover a part of the genome with certain spacing
between probes which causes a drop in the resolution and originates some
problems. The nucleosome calling from Tiling Array data required hard work on
bioinformatics side and use of heavy and artificious statistical machinery
such as Hidden Markov Models [@Yuan2005, @Lee2007] or higher order Bayesian
Networks [@Kuan2009].

\Biocpkg{nucleR} presents a new method based on a simple but effective peak
calling method which achieves a great performance at low computing cost that
will be presented in subsequent sections.

In order to standardize the data coming both from Tiling Arrays and NGS, the
array fluorescence intensities (usually the ratio of the hybridization of
nucleosomal and control sample) are converted to 1bp resolution by inferring
the missed values from the neighboring probes. This is done by the function
`processTilingArray`:

```{r processtilling, eval=FALSE}
processTilingArray(data, exprName, chrPattern, inferLen=50)
```

An example of a processed dataset is provided in this package. See the help
page of `tilingArray_preproc` for details on how it has been created.
This object is a numeric vector covering the 8000 first positions of chromosome
1 in yeast (*Saccharomices cerevisiae* genome `SacCer1`).

```{r loadTA}
library(nucleR)
library(ggplot2)
library(IRanges)
library(GenomicRanges)
data(nucleosome_tiling)
head(nucleosome_tiling, n=25)
```

This values represent the normalized fluorescence intensity from hybridized
sample of nucleosomal DNA versus naked DNA obtained from \Biocpkg{Starr}. The
values can be either direct observations (if a probe was starting at that
position) or a inferred value from neighboring probes. This data can be passed
directly to the filtering functions, as described later in the section
\@ref(peaks).


## Importing BAM files

Additionally, the function `importBAM`, allows to directly import into \R{} the
mapped reads of a NGS experiment contained in a *BAM* file. The user has to
specify whether the file contains *paired-end* or *single-end* read fragments.

```{r readBAM}
sample.file <- system.file("extdata", "cellCycleM_chrII_5000-25000.bam",
    package="nucleR")
reads <- readBAM(sample.file, type="paired")
head(reads)
```


## Next Generation Sequencing

NGS has become one of the most popular technique to map nucleosome in the
genome in the last years [@Kaplan2009, @Schones2008, @Xi2010]. The drop of
the costs of a genome wide sequencing together with the high resolution
coverage maps obtained, made it the election of many scientists.

The package \Biocpkg{ShortRead} allows reading of the data coming from many
sources (Bowtie, MAQ, Illumina pipeline...) and has become one of the most
popular packages in \R{}/\Bioconductor{} for NGS data manipulation.

A new \R{} package, called \Biocpkg{htSeqTools}, has been recently created to
perform preprocessing and quality assesment on NGS experiments.
\Biocpkg{nucleR} supports most of the output generated by the functions on that
package and recommends its use for quality control and correction of common
biases that affect NGS.

\Biocpkg{nucleR} handles `ShortRead` and `RangedData` data formats. The dataset
`nucleosome_htseq` includes some NGS reads obtained from a nucleosome
positioning experiment also from yeast genome, following a protocol similar to
the one described in [@Lee2007].

The paired-end reads coming from Illumina Genome Analyzer II sequencer were
mapped using Bowtie and imported into \R{} using \Biocpkg{ShortRead}.
Paired ends where merged and sorted according the start position. Those in the
first 8000bp of chromosome 1 where saved for this example. Further details are
in the reference [@Deniz2011]:


```{r import}
data(nucleosome_htseq)
class(nucleosome_htseq)
nucleosome_htseq
```

Now we will transform the reads to a normalized format. Moreover, as the data
is paired-ended and we are only interested in mononucleosomes (which are
typically 147bp), we will discard the reads with a length greater than 200bp,
allowing margin for some underdigestion but discarding extra long reads. Note
that the behaviour of `fragmentLen` is different for single-ended data, see the
manual page of this function for detailed information.

As our final objective is identifying the nucleosome positions, and
\Bioconductor{nucleR} does it from the dyad, we will increase the sharpness of
the dyads by removing some bases from the ends of each read. In the next
example, we will create two new objects, one with the original paired-end reads
and another one with the reads trimmed to the middle 40bp around the dyad
(using the `trim` argument).

```{r processReads}
# Process the paired end reads, but discard those with length > 200
reads_pair <- processReads(nucleosome_htseq, type="paired", fragmentLen=200)

# Process the reads, but now trim each read to 40bp around the dyad
reads_trim <- processReads(nucleosome_htseq, type="paired", fragmentLen=200,
    trim=40)
```

The next step is obtain the coverage (the count of how many reads are in each
position). The standard \Biocpkg{IRanges} package function `coverage` will work
well here, but it is a common practice to normalize the coverage values
according to the total number of short reads obtained in the NGS experiment.
The common used unit is *reads per milon* (r.p.m.) which is the coverage value
divided by the total number of reads and multiplied per one milion. A quick and
efficient way to do this with \Biocpkg{nucleR} is the `coverage.rpm` function.
[^1]

[^1]: Note that conversion in the example dataset gives huge values. This
is because r.p.m. expects a large number of reads, and this dataset is only a
fraction of a whole one. Also take into account that reads from single-ended
(or trimmed reads) and reads from paired-ended could have different mean value
of coverage

```{r coverage}
# Calculate the coverage, directly in reads per million (r.p.m)
cover_pair <- coverage.rpm(reads_pair)
cover_trim <- coverage.rpm(reads_trim)
```

In Figure \@ref(fig:figcover) we can observe the effect of `trim` attribute
plotting both coverages. Note that the coverages are normalized in the range
0--1:


## MNase bias correction

```{r figcover, echo=FALSE, fig.cap=getCaps(), fig.width=5}
# Compare both coverages
t1 <- as.vector(cover_pair[[1]])[1:2000]
t2 <- as.vector(cover_trim[[1]])[1:2000]
t1 <- (t1 - min(t1)) / max(t1 - min(t1)) # Normalization
t2 <- (t2 - min(t2)) / max(t2 - min(t2)) # Normalization
plot_data <- rbind(
    data.frame(
        x=seq_along(t1),
        y=t1,
        coverage="original"
    ),
    data.frame(
        x=seq_along(t1),
        y=t2,
        coverage="trimmed"
    )
)
ggplot(plot_data, aes(x=x, y=y)) +
    geom_line(aes(color=coverage)) +
    xlab("position") +
    ylab("norm coverage")
```

The Microccocal Nuclease is a widely used enzyme that has been proved to have
a biase for certain dinucleotide steps [@Deniz2011]. In this package we offer a
quick way to inspect the effect of such artifact by correcting the profiles of
nucleosomal DNA reads with a mock sample of naked DNA digested with MNase.

The use of this function requires a paired-end control sample and a paired end
or extended single-read nucleosomal DNA sample. A toy example generated using
synthetic data can be found in Figure \@ref(fig:figmnase).

```{r figmnase, echo=c(1:5), fig.cap=getCaps(), fig.width=5}
# Toy example
map <- syntheticNucMap(as.ratio=TRUE, wp.num=50, fuz.num=25)
exp <- coverage(map$syn.reads)
ctr <- coverage(map$ctr.reads)
corrected <- controlCorrection(exp, ctr)

plot_data <- rbind(
    data.frame(
        x=seq_along(exp),
        y=as.vector(exp),
        coverage="normal"
    ),
    data.frame(
        x=seq_along(corrected),
        y=as.vector(corrected),
        coverage="corrected"
    )
)
ggplot(plot_data, aes(x=x, y=y)) +
    geom_line(aes(color=coverage)) +
    xlab("position") +
    ylab("coverage")
```


# Signal Smoothing and Nucleosome Calling {#peaks}

In the previous sections we converted the experimental data from NGS or Tiling
Arrays to a continous, 1bp resolution signal. In this section we will remove
the noise present in the data and score the peaks identified, giving place to
the nucleosome calls.

Previously, in the literature, Hidden Markov Models, Support Vector Machines
or other complex intelligent agents where used for this task
[@Yuan2005, @Lee2007, @Kuan2009, @Chen2010, @Xi2010]. This was needed for
dealing with the noise and uncertain characterization of the fuzzy positioning
of the nucleosomes.

Despite this approach is a valid way to face the problem, the use of such
artificious constructs is difficult to implement and sometimes requires a
subjective modeling of the solution, constraining or at least conditioning the
results observed.

The method presented here proposes to *keep it simple*, allowing the
researcher to study the results he or she is interested *a posteriori*.

\Biocpkg{nucleR} aim is to evaluate where the nucleosomes are located and how
accurate that position is. We can find a nucleosome read in virtually any place
in the genome, but some positions will show a high concentration and will allow
us to mark this nucleosome as **well-positioned** whereas other will be less
phased giving place to **fuzzy** or **de-localized** nucleosomes [@Jiang2009].

We think it's better to provide a detailed but convenient identification of
the relevant nucleosome regions and score them according to its grade of
fuzziness. From our point of view, every researcher should make the final
decision regarding filtering, merging or classifying the nucleosomes according
its necessities, and \Biocpkg{nucleR} is only a tool to help in this *dirty*
part of the research.


## Noise removal

NGS and specially Tiling Array data show a very noisy profile which
complicates the process of the nucleosome detection from peaks in the signal.
A common approach used in the literature is smooth the signal with a sliding
window average and then use a Hidden Markov Model to calculate the
probabilities of having one or another state.

```{r fignoise, echo=FALSE, fig.cap=getCaps(), fig.height=1.5}
windowFilter <- function (x, w) {
    if (missing(w)) {
        return(x)
    } else {
        y <- filter(x, rep(1, w)/w)
        return(as.vector(y))
    }
}

mkEntry <- function (x, i, w, lab) {
    if (missing(lab)) {
        lab <- sprintf("slinding w. %i bp", w)
    }
    df <- data.frame(x=i, y=windowFilter(x[i], w), lab=lab)
    df[!is.na(df[, "y"]), ]
}

i <- 1:2000
plot_data <- rbind(
    mkEntry(nucleosome_tiling, i, 1, "original"),
    mkEntry(nucleosome_tiling, i, 20),
    mkEntry(nucleosome_tiling, i, 50),
    mkEntry(nucleosome_tiling, i, 100)
)

ggplot(plot_data, aes(x=x, y=y)) +
    geom_line() +
    facet_grid(.~lab) +
    xlab("position") +
    ylab("intensity")
```

As can be seen in Figure \@ref(fig:fignoise), data needs some smoothing to be
interpretable, but a simple sliding window average is not sufficient. Short
windows allow too much noise but larger ones change the position and the shape
of the peaks.

\Biocpkg{nucleR} proposes a method of filtering based on the Fourier Analysis
of the signal and the selection of its principal components.

Any signal can be described as a function of individual periodic waves with
different frequencies and the combination of them creates more complex
signals. The noise in a signal can be described as a small, non periodic
fluctuations, and can be easily identified and removed [@Smith1999].

\Biocpkg{nucleR} uses this theory to transform the input data into the Fourier
space using the Fast Fourier Transform (FFT). A FFT has a real and a imaginary
component. The representation of the real component it's called the power
spectrum of the signals and shows which are the frequencies that have more
weight (power) in the signal. The low frequency components (so, very periodic)
usually have a huge influence in the composite signal, but its rellevance
drops as the frequency increases.

We can look at the power spectrum of the example dataset with the following
command:

```{r figfft, fig.cap=getCaps(), fig.width=5}
fft_ta <- filterFFT(nucleosome_tiling, pcKeepComp=0.01, showPowerSpec=TRUE)
```

In the Figure \@ref(fig:figfft) only the half of the components are plotted, as
the spectrum is repeated symmetrically respect to its middle point. The first
component (not shown in the plot), has period 1, and, in practice, is a count
of the lenght of the signal, so it has a large value.

High frequency signals are usually echoes (repeating waves) of lower
frequencies, i.e. a peak at 10 will be the sum of the pure frequence 10 plus
the echo of the frequency 5 in its 2nd repetition. Echoes can be ignored
without losing relevant information.

The approach \Biocpkg{nucleR} follows is supposing that with just a small
percentage of the components of the signal, the input signal can be recreated
with a high precision, but without a significant amount of noise. We check
empirically that with 1% or 2% of the components (this means account 1 or 2
components for each 100 positions of the genomic data) it's enough to recreate
the signal with a very high correlation (>0.99). Tiling Array could require
more smoothing (about 1% should be fine) and NGS has less noise and more
components can be selected for fitting better the data (about 2%), See Figure
\@ref(fig:figfft) for the selected components in the example.

In order to easy the choice of the `pcKeepComp` parameter,
\Biocpkg{nucleR} includes a function for automatic detection of a fitted value
that provides a correlation between the original and the filtered profiles
close to the one specified. See the manual page of `pcKeepCompDetect` for
detailed information.

In short, the cleaning process consists on converting the coverage/intensity
values to the Fourier space, and knock-out (set to 0) the components greater
than the given percentile in order to remove the noise from the profile. Then
the inverse Fast Fourier Transform is applyied to recreate the filtered
signal. In Figure \@ref(fig:figfft2) the filtered signal is overlapped to the
raw signal.

The cleaning of the input has almost no effect on the position and shape of
the peaks, mantaining a high correlation with the original signal but allowing
achieve a great performance with a simple peak detection algorithm:

```{r figfft2, echo=FALSE, fig.cap=getCaps(), fig.width=5, fig.height=4}
i <- 1:3000
tiling_raw <- nucleosome_tiling[i]
tiling_fft <- filterFFT(tiling_raw, pcKeepComp=0.01)
htseq_raw <- as.vector(cover_trim[[1]])[i]
htseq_fft <- filterFFT(htseq_raw, pcKeepComp=0.02)

plot_data <- rbind(
    data.frame(x=i, y=tiling_raw, lab="intensity", treatment="raw"),
    data.frame(x=i, y=tiling_fft, lab="intensity", treatment="filtered"),
    data.frame(x=i, y=htseq_raw,  lab="coverage",  treatment="raw"),
    data.frame(x=i, y=htseq_fft,  lab="coverage",  treatment="filtered")
)

ggplot(plot_data, aes(x=x, y=y)) +
    geom_line(aes(color=treatment)) +
    facet_grid(lab~., scales="free_y") +
    ylab("") +
    xlab("position")
```

```{r corfft}
tiling_raw <- nucleosome_tiling
tiling_fft <- filterFFT(tiling_raw, pcKeepComp=0.01)
htseq_raw <- as.vector(cover_trim[[1]])
htseq_fft <- filterFFT(htseq_raw, pcKeepComp=0.02)

cor(tiling_raw, tiling_fft, use="complete.obs")
cor(htseq_raw, htseq_fft, use="complete.obs")
```


## Peak detection and Nucleosome Calling

After noise removal, the calling for nucleosomes is easy to perform. In
nucleosome positioning, in contrast with other similar experiments like ChIP,
the problem for the peaks detection algorithms is deal with the presence of an
irregular signal which causes lots of local maxima (i.e., peaks due to noise
inside a real peak). Here, we avoid this problem applying the FFT filter,
allowing the detection of peaks in a simple but efficient way just looking for
changes in the trend of the profile. This is implemented in the `peakDetection`
function and results can be represented with the function `plotPeaks`:

```{r figpeak, fig.cap=getCaps()}
peaks <- peakDetection(htseq_fft, threshold="25%", score=FALSE)
peaks

plotPeaks(peaks, htseq_fft, threshold="25%", ylab="coverage")
```

All the peaks above a threshold value are identified. Threshold can be set to
0 for detecting all the peaks, but this is not recommended as usually small
fluctuations can apear in bottom part of the profile. This package also
provides an automatic scoring of the peaks, which accounts for the two main
features we are interested in: the height and the sharpness of the peak.

The *height* of a peak is a direct measure of the reads coverage in the peak
position, but represented as a probability inside a Normal distribution.

The *sharpness* is a measure of how fuzzy is a nucleosome. If a peak is very
narrow and the surrounding regions are depleted, this is an indicator of a good
positioned nucleosome, while wide peaks or peaks very close to each other are
probably fuzzy nucleosomes (despite the coverage can be very high in this
region).

Scores can be calculated with the `peakScoring` function or directly with the
argument `score=TRUE` in `peakDetection`.

```{r figpeak2, fig.cap=getCaps()}
peaks <- peakDetection(htseq_fft, threshold="25%", score=TRUE)
head(peaks)

plotPeaks(peaks, htseq_fft, threshold="25%")
```

The scores in Figure \@ref(fig:figpeak2) only account for the punctual height
of the peak. As said previously, this measure can be improved by accounting
the fuzzyness of a nucleosome call (the sharpness of the peak). This requires
a way to account for longer range peaks, which can be obtained with the
`width` argument. In this way one can convert the identified nucleosome dyads
to whole nucleosome length ranges and account for its degree of fuzzyness:

```{r figpeak3, fig.cap=getCaps()}
peaks <- peakDetection(htseq_fft, threshold="25%", score=TRUE, width=140)
peaks

plotPeaks(peaks, htseq_fft, threshold="25%")
```

Note than in Figure \@ref(fig:figpeak3) overlapped peaks in a width and tall
region are penalized, meanwhile the peaks with surrounding depleted regions
have a higher relative score. This is the approach recommended for working
with nucleosome calls.

Nucleosome calls filtering, merging or classification can be performed with
standard Biocpkg{IRanges} functions, shuch as `reduce`, `findOverlaps` or
`disjoint`.

The next example shows a simple way to merge those nucleosomes which are
overlap accounting them as a fuzzy regions:

```{r figranges, echo=-6, fig.cap=getCaps()}
nuc_calls <- peaks[peaks$score > 0.1, ]
red_calls <- reduce(nuc_calls)
red_class <- GRanges(red_calls, isFuzzy=width(red_calls) > 140)
red_class

plotPeaks(red_calls, htseq_fft, threshold="25%")
```


# Exporting data

`export.wig` and `export.bed` allow exportation of coverage/intensity values
and nucleosome calls in a standard format which works on most of the genome
browsers available today (like UCSC Genome Browser or Integrated Genome
Browser).

`export.wig` creates WIG files wich are suitable for coverage/intensities,
meanwhile `export.bed` creates BED files which contain ranges and scores
information, suitable for calls.


# Generating synthetic maps

\Biocpkg{nucleR} includes a synthetic nucleosome map generator, which can be
helpful in benchmarking or comparing data against a random map.
`syntheticNucMap` function does that, allowing a full customization of the
generated maps.

When generating a map, the user can choose the number of the well-positioned
and fuzzy nucleosome, as their variance or maximum number of reads. It also
provides an option to calculate the ratio between the generated nucleosome map
and a mock control of random reads (like a naked DNA randomly fragmented
sample) to simulate hybridation data of Tiling Arrays.

The perfect information about the nucleosome dyads is returned by this
function, together with the coverage or ratio profiles.

See the man page of this function for detailed information about the different
parameters and options.

```{r figsyn, fig.cap=getCaps(), fig.height=3}
syn <- syntheticNucMap(wp.num=100, wp.del=10, wp.var=30, fuz.num=20,
    fuz.var=50, max.cover=20, nuc.len=147, lin.len=20, rnd.seed=1,
    as.ratio=TRUE, show.plot=TRUE)
```


# References