%\VignetteIndexEntry{An introduction to FunChIP}
%\VignetteDepends{}
%\VignetteKeywords{ChIP-Seq, GRanges, shape, functional data analysis}
%\VignettePackage{FunChIP}
%\VignetteEngine{utils::Sweave}

\documentclass{article}

<<style, echo=FALSE, results=tex>>=
BiocStyle::latex()
@

\usepackage[algoruled]{algorithm2e}
\usepackage{color}
\usepackage{amsfonts}
\usepackage{graphicx}

\definecolor{bronze}{rgb}{0.93, 0.53, 0.18}
\definecolor{darkblue}{rgb}{0, 0.3, 0.6}
\SetKw{align}{\color{darkblue}Alignment:}
\SetKw{template}{Template:}
\SetKw{assign}{Assignment:}
\SetKw{norm}{Normalization:}

\newcommand{\bam}{\texttt{BAM}}

\title{\Biocpkg{FunChIP}: A Functional Data Analysis approach to cluster
ChIP-Seq peaks according to their shapes}
\author{Alice Parodi \email{alicecarla.parodi@polimi.it}
\\ Laura M. Sangalli
\\ Piercesare Secchi
\\ Simone Vantini
\\ Marco J. Morelli }
%\date{Modified: February 24, 2016. Compiled: \today}
\date{\today}

\begin{document}
\SweaveOpts{concordance=FALSE}
\SweaveOpts{background = "#C0C0C0", size = tiny}

\maketitle

\tableofcontents

<<options,echo=FALSE>>=
options(width=90)
@

<<preliminaries>>=
library(FunChIP)
@


\section{Introduction}
The \Biocpkg{FunChIP} package provides a set of methods for the
\Rclass{GRanges} class of the package \Biocpkg{GenomicRanges} to cluster ChIP-Seq
peaks according to their shapes, starting from a \bam{}
file containing the aligned reads and a \Rclass{GRanges}
object with the corresponding enriched regions.


\section{Input and Preprocessing}\label{sect:prep}
ChIP-Seq enriched regions are provided by the user
in a \Rclass{GRanges} object \Rcode{GR}. The user must
provide the \bam{} file containing the reads aligned on the positive and negative strands of the DNA. From the \bam{} file we can compute, for each
region of the  \Rclass{GRanges} (let $N$ be the total number of regions),
the base-level coverage separately
for  positive and negative reads. These two count vectors are used to
estimate the distance $d_{pn}$ between positive and negative
reads and then the total length of the fragments of the ChIP-Seq experiment $d$.
In particular, we assume that the positive and negative counts
measure the same signal, shifted by $d_{pn}$, as they are computed from the
two ends of the sequencing fragments.
The global length of the fragment is the sum between
the length of the reads of the \bam{} file, $r$\footnote{ If
in the \bam{} file multiple length are present, $r$ is estimated as the
average length.}, and the distance between the positive and negative coverage $d_{pn}$
$$
d = d_{pn} + r.
$$
 The function
\Rfunction{compute\_fragments\_length} computes, from the \Rclass{GRanges}
object and the \bam{} file, the estimated length of the fragments. Given a
range for $d_{pn}$: $[d_{\min}; d_{\max}]$,
the optimum distance $d_{pn}$ is
$$
d_{pn} = \textrm{argmin}_{\delta \in [d_{\min}: d_{\max}]} \sum_{n = 1}^N D(f_{n+}, f_{n-}^{\delta}),
$$
where $f_{n+}$ is the positive coverage function of the $n$-th region,
and $f_{n-}^{\delta}$ is the negative coverage of the $n$ th region, shifted
by $\delta$. The distance $D$ is the square of the $L^2$ distance
between the coverages, normalized by the width of the region. The definition of
the $L^2$ distance is detailed in Section \ref{sect:kmean}.


<<fragment_length, fig=TRUE, include=FALSE, width=8, height=8>>=

# load the GRanges object associated
# to the ChIP-Seq experiment on the
# transcription factor c-Myc in murine cells

data(GR100)

# name of the .bam file (the
# .bam.bai index file must also be present)

bamf <- system.file("extdata", "test.bam",
                    package="FunChIP", mustWork=TRUE)

# compute d

d <- compute_fragments_length(GR, bamf, min.d = 0, max.d = 300)
d

@
%%

In Figure \ref{FunChIP-fragment_length} the distance
function is shown varying the parameter $\delta$, and  the minimum value $d_{pn}$ is computed.

\incfig{FunChIP-fragment_length}{0.5\textwidth}{Identification of d.}
      {optimal value of $d_{pn}$ is presented. It is the minimum of the global distance function.}

Once we have correctly identified the fragment length we can compute
the final coverage function to obtain the shape of the
peaks.
The \Rfunction{pileup\_peak} method for the \Rclass{GRanges}
class uses the \bam{} file to compute the base-level coverage on these regions, once the reads are extended up to their
final length $d$. \Rfunction{pileup\_peak} adds to the
\Rclass{GRanges}  a \Rcode{counts} metadata column, containing
for each region a vector with length equal to the width of the region storing the coverage function.

<<pileup_peak>>=

# associate to each peak
# of the GRanges object the correspondent
# coverage function

peaks <- pileup_peak(GR, bamf, d = d)
peaks

@
%%


Additional information can be found in the help page of the \Rfunction{pileup\_peak} method.

\section{Smoothing}
The \Rcode{counts} metadata is approximated by a combination of splines
to guarantee the smoothness and regularity needed for further analysis, as
described in the following Sections.\\
The preprocessing steps carried out in the \Rfunction{smooth\_peak} method are the following:
\begin{itemize}
\item \textit{Removal of the background and extension.} In ChIP-Seq experiments, peaks may have an
additive noisy background, and the removal of this background is mandatory to compare different peaks.
The background is estimated as a constant value "raising" the peak and equal to the
minimum value the coverage assumes. Consequently, once the background has been removed, each peak
has zero as minimum value, thus allowing the peak to be indefinitely extended with zeros, if necessary.
In Section \ref{sect:kmean}, how this choice affects the algorithm will be discussed.
\item \textit{Smoothing.} In order to be regular enough to computed derivatives, a peak has to be
transformed in a suitable functional object, as described in Section \ref{sect:kmean}.
The smoothing of the count vector $c$ is performed through the projection of $c$ on a cubic B-spline
basis $\Phi = \{\phi_1, \ldots \phi_K\}$ with a penalization on the second derivative \cite{ramsay}.
The result is a spline approximation of the data, which is continuous
on the whole domain, together with its first order derivatives. Moreover, the penalization on
the second derivative allows to control the global regularity of the function
avoiding over-fitting and a consequent noisy spline definition. The spline approximation
$s = \sum_{k = 1}^K \theta_k \phi_k$
of the count vector $ c = \{c_j\}$ is defined minimizing
$$
S(\lambda) = \sum_{j = 1}^n \left[ c_j - s(x_j) \right]^2 + \lambda \int \! \left[ s''(x)\right] ^2 dx,
$$
with $x_j$ being the relative genomic coordinate the counts.
The multiplying coefficient $\lambda$ quantifies the penalization on the second derivative and is chosen
through the Generalized Cross Validation criteria.
For each peak $i$ the $GCV_i$ index is computed with a leave-one-out cross validation
$$
GCV_i = \left(\frac{n}{n - df(\lambda)} \right) \left( \frac{SSE_i}{n-df(\lambda)}\right)
$$
and then it is summed on the whole data set to obtain the global $GCV$. The number of degrees of freedom
$df(\lambda)$ is automatically computed from the definition of the basis $\Phi$.\\
The error $SSE_i$ can be computed either on the data ($SSE_i^0$) or on the derivatives ($SSE_i^1$), to
control the regularity of the function or the regularity of the derivatives, respectively:
$$
SSE_i^0 = \sqrt{\sum_{j = 1}^n \left( c_j - s(x_j) \right)^2} \,\,\, \textrm{or} \,\,\,
SSE_i^1 = \sqrt{\sum_{j = 1}^{n-1} \left( \nabla c_j - s'(x_j) \right)^2} ,
$$
with $\nabla c_j$ being the finite-difference approximation of
the derivative of the \Rcode{counts} vector $c$ for the data $i$: $c = c(i)$, while
$s'(x_i)$ is the evaluation of the first derivative $s' = s'(i)$ on the genomic coordinates.
For further details on
the spline definition see the \Rfunction{spline} function of the \CRANpkg{fda} package.
\item\textit{Scaling of the peaks.} This optional preprocessing step makes all the curves having the same width and area. In particular all the abscissa grid are scaled to become equal to the smallest grid throughout the data, while $y-$values are scaled to make areas of all the curves equal to 1.
\end{itemize}
The \Rfunction{smooth\_peak} method approximates the \Rcode{counts}
metadata by removing the background, computing the spline and potentially defining the scaled approximation.
Focusing on the spline approximation, \Rfunction{smooth\_peak} automatically chooses the optimal $\lambda$ parameter
according to the $GCV$ criteria; the user can
decide whether to consider the data or the derivatives to compute the $SSE$.


<<figureGCV, fig=TRUE, include=FALSE, width=11, height=5.6>>=

# the method smooth_peak
# removes the background and defines the spline
# approximation from the previously computed peaks
# with lambda estimated from the
# GCV on derivatives. The method spans a non-uniform
# grid for lambda from 10^-4 to 10^12.
# ( the grid is uniform for log10(lambda) )

peaks.smooth <- smooth_peak(peaks, lambda = 10^(-4:12),
                            subsample.data = 50, GCV.derivatives = TRUE,
                            plot.GCV = TRUE, rescale = FALSE )

@
\incfig{FunChIP-figureGCV}{0.8\textwidth}{Generalized Cross Validation index.}
      {$GCV$ computed on data (left), and on the derivatives (right), as a function of $\lambda$.}

In Figure \ref{FunChIP-figureGCV}, the plot of the $GCV$ for both data and
 derivatives is shown. From this Figure we see that the optimum value of $\lambda$, which minimizes the $GCV$ for the derivatives, is also associated to a small value of the $GCV$ for the data thus supporting the automatic choice.

<<smooth>>=

# the automatic choice is lambda = 10^6

peaks.smooth <- smooth_peak(peaks, lambda = 10^6,
                             plot.GCV = FALSE)
head(peaks.smooth)

# mantaining this choice of lambda smooth_peak
# can also define the scaled approximation
# of the spline

peaks.smooth.scaled <- smooth_peak(peaks, lambda = 10^6,
                             plot.GCV = FALSE, rescale = TRUE)
head(peaks.smooth.scaled)

@

Now the \Rclass{GRanges} object contains, besides \Rcode{counts}, 5 new metadata
columns with the spline approximation evaluated on the base-level grid, its
derivatives, the width of the spline and the new starting and ending points
(see Figure \ref{FunChIP-plotBOTH}). For
a more detailed description of the metadata columns, see the help page of
the \Rfunction{smooth\_peak} method.

With the introduction of the smoothing, counts at the edges of the peak are connected
with regularity to 0, and therefore new values different from zeros may be introduced.
In order to maintain regularity, the grid is extended up to the new boundaries.

Adding to \Rfunction{smooth\_peak} the option \Rcode{rescale = TRUE} the method, beside the 5 metadata columns previously introduced, returns 2 more metadata columns with the scaled approximation of the spline and its derivatives.

Once the spline approximation is defined, the summit of the smoothed peak (or even of the scaled peak),
i.e. of its spline approximation,
can be detected. The summit will be used to initialize the peak alignment procedure, described in Section
\ref{sect:kmean}, and it can either be a
user-defined parameter, stored in a vector of the same length of the \Rcode{GR}, or
automatically computed as the maximum height of the spline. The summit is stored in the
new metadata column \Rcode{summit\_spline}. If the \Rcode{rescale} option is set to \Rcode{TRUE} the summit of the scaled approximation is also returned in the metadata colum \Rcode{summit\_spline\_rescaled}.

<<summit>>=

# peaks.summit identifies the maximum point
# of the smoothed peaks

peaks.summit <- summit_peak(peaks.smooth)
head(peaks.summit)

# peaks.summit can identify also the maximum
# point of the scaled approximation

peaks.summit.scaled <- summit_peak(peaks.smooth.scaled, 
                            rescale = TRUE)
head(peaks.summit.scaled)
@

\section{The k-mean alignment algorithm and the \Rfunction{cluster\_peak} method }\label{sect:kmean}
The k-mean alignment algorithm is an efficient method to classify functional data allowing for general transformation of abscissae \cite{kma}; this general method is implemented in the package  \CRANpkg{fdakma} and various applications to real dataset are introduced in \cite{Sangalli:2014}, \cite{Bernardi:2014}, \cite{Patriarca:2014}.

In particular, given

\begin{itemize}
\item a set of curves $s_1, \ldots, s_n$,
\item the number of clusters $K$,
\item a distance function $d(s_i, s_j)$ between two curves $s_i$ and $s_j$, as for example the integral of the difference $s_i - s_j$,
\item a family of warping functions $\mathcal{W}$ to transform the abscissae of the curves and therefore align the peaks. Generally,
$\mathcal{W}$ is the set of shifts or dilations or affine transformations (shift + dilation),
\end{itemize}
the algorithm, presented in Algorithm \ref{kmean_box}, is an iterative procedure to split the
curves into $K$ clusters. The introduction
of the warping function $h \in \mathcal{W}$ allows each curve to be shifted, dilated, or both,
to define the minimum distance between curves.
The new curve $s \circ h$ has the same values of $s$, but its
abscissa grid is modified.
\incfig{FunChIP-alignment}{0.8\textwidth}{Alignment procedure.}
      {Representation of two smoothed peaks. In the left panel they are not
      aligned, while in the right panel they are aligned with an integer shift.}
For example, in Figure \ref{FunChIP-alignment} two peaks are presented: in the left panel,
they are not aligned, while the right panel shows the effects of alignment; the transformation of the
abscissae (shift transformation) makes the two peaks more similar,
and the distance $d$ is not anymore affected
by artificial phase distance. The code generating Figure \ref{FunChIP-alignment}
calls \Rfunction{cluster\_peak} and \Rfunction{plot\_peak}, which
are described in Section \ref{sect:cluster} and Section \ref{sect:plot}.

<<alignment, fig=TRUE, include=FALSE, width=12, height=5.6>>=

# representation of two peaks

par (mfrow = c(1,2))
plot_peak(peaks.summit, index = c(6,7), col=c('red',2), cex.main = 2, cex.lab = 2, cex.axis = 1.5, lwd = 2)
aligned.peaks <- cluster_peak(peaks.summit[c(6,7)], parallel = FALSE ,
                                    n.clust = 1, seeds = 1, shift.peak = TRUE,
                                    weight = 1, alpha = 1, p = 2, t.max = 2,
                                    plot.graph.k = FALSE, verbose = FALSE)
aligned.peaks

# shift coefficients
aligned.peaks$coef_shift
plot_peak(aligned.peaks, col = 'forestgreen',
          shift = TRUE, k = 1, cluster.peak = TRUE,
          line.plot = 'spline', 
          cex.main = 2,  cex.lab = 2, cex.axis = 1.5, lwd = 2)
@


For the specific case of ChIP-Seq data, the admitted warping functions for the k-mean alignment algorithm
(in the \Rfunction{cluster\_peak} method), are integer shifts:
\begin{equation}\label{eq:warping}
\mathcal{W} = \left\{ h : h(t) = t + q \textrm{ with } q \in \mathbb{Z} \right\}.
\end{equation}
In other words, with this choice, peaks can be shifted by integer values in the \emph{alignment} procedure of the algorithm.

\begin{algorithm}
Given a set of functions $s_1, \ldots, s_n$ and a number $K$ of clusters\\
\template random choice (if not provided) of the initial centers of the clusters $c_1 \ldots, c_k$\\
\While{ decrease of the distance higher than a fixed threshold }{
\ForEach {$i \in 1:n$} {
\align $s_i$ is aligned to each template $c_k$: the optimal warping function $h_{i,k}^{\star}$ in $\mathcal{W}$ is detected
$$
h^{\star}_{i,k} = \textrm{argmin}_{h \in \mathcal{W}} \,\, d(c_k, x_i \circ h)
$$
with the corresponding distance $d^{\star}_{i,k} = \min_{h \in \mathcal{W}}  d(c_k, x_i \circ h)$\\
\assign $s_i$ is assigned to the best cluster
$$
k_i^{\star} = \textrm{argmin}_{k \in 1 : K} \,\, d^{\star}_{i,k}
$$
}
\ForEach {$k \in 1 : K$}
{
\template identification of the new template of the cluster $c_k$\\
\norm the average warping function of the curves belonging to $k$ is set to be the identity transformation
$$
h(s) = s
$$
}
}
\caption{k-mean alignment algorithm}
\label{kmean_box}
\end{algorithm}

In the \Rfunction{cluster\_peak} method the distance between two curves $s_1$ and $s_2$ is defined as
\begin{eqnarray}\label{eq:dist}
d (s_1, s_2) & = & (1 - \alpha) \, d_0(s_1, s_2) + \alpha \, w  \, d_1(s_1, s_2) = \nonumber\\
 & = & (1 - \alpha)\, \| s^e_1 - s^e_2 \|_p + \alpha \, w \, \| (s^e_1)' - (s^e_2)' \|_p,
\end{eqnarray}
where
\begin{itemize}
\item $\| f \|_p$ is the $p$ norm of $f$. In particular, for $p = 0$, $\| \cdot \|_p$ is the $L^{\infty}$ norm
$$
\| f \|_0 = \|f \|_{L^{\infty}} = \max_{x \in U} |f(x)|,
$$
with $U$ being the domain of $f$.\\
For $p=1$, $\| \cdot \|_p$ is the $L^1$ norm
$$
\| f \|_1 = \|f \|_{L^{1}} = \int_U |f(x)| dx.
$$
And for $p=2$, $\| \cdot \|_p$ is the $L^2$ norm
$$
\|f \|_2 = \|f \|_{L^{2}} = \int_U \left(f(x)\right)^2 dx.
$$
\item $s_1^e$ and $s_2^e$ are the functions $s_1$ and $s_2$ extended with zeros where not defined, after
their backgrounds have been removed (see Section \ref{sect:prep}). The distance function is computed on the
union of the domains of $s_1$ and $s_2$ ($U$);
$s_1$ and $s_2$ need to be extended to cover the whole $U$.
\item $\alpha \in [0,1]$ is a coefficient tuning the contributions of the norm of the data
and the norm of the derivatives. If $\alpha = 0$, the distance is computed on the data,
while if $\alpha = 1$ it is based on  the derivatives. Intermediate values balance these two
contributions: increasing the relevance given to the derivatives emphasizes the shapes
of the peaks, while data are more related to the height.
\item $w$ is a weight coefficient, essential to make the norm of
the data and of the derivatives comparable. It can be user defined or computed inside the
 \Rfunction{cluster\_peak} method. A suggestion for computing  the weight $w$ is
given in Section \ref{sect:weight}.
\end{itemize}


\subsection{Definition of weight in the distance function} \label{sect:weight}
If not provided, the method \Rfunction{cluster\_peak} defines $w$ as
$$
w = \textrm{median} \left( \frac{d_0(s_i, s_j)}{d_1(s_i, s_j)} \right)
$$
where $d_0(i,j)=\| s^e_i - s^e_j \|_p$ and $d_1(i, j) = \| (s^e_1)' - (s^e_2)' \|_p$.
These matrices can be automatically computed with the \Rfunction{distance\_peak} function.

<<weight>>=

# compute the weight from the first 10 peaks

dist_matrix <- distance_peak(peaks.summit)
# dist matrix contains the two matrices d_0(i,j)
# and d_1(i,j), used to compute w
names(dist_matrix)

ratio_norm <- dist_matrix$dist_matrix_d0 / dist_matrix$dist_matrix_d1
ratio_norm_upper_tri <- ratio_norm[upper.tri(ratio_norm)]
summary(ratio_norm_upper_tri)
# suggestion: use the median as weight
w <- median(ratio_norm_upper_tri)
w

@

\subsection{The \Rfunction{cluster\_peak} method} \label{sect:cluster}
The two main characteristics of the k-mean alignment algorithm used in
\Biocpkg{FunChIP} are the distance function $d$ (defined in Equation (\ref{eq:dist})),
used to compute the  distance between
curves, and the set of warping functions $\mathcal{W}$
(defined in Equation (\ref{eq:warping})) considered for the  alignment.
The \Rfunction{cluster\_peak} method applies the k-mean alignment algorithm
with these specifications to the set of peaks stored in the \Rclass{GRanges} object.
In particular, the parameters \Rcode{weight}\footnote{ \Rcode{weight} can be also set
to \Rcode{NULL} and it will be automatically computed as specified in Section \ref{sect:weight}.
To save computational time, it is generally computed
on a random sub-sample of data, whose size is set by the \Rcode{subsample.weight}
parameter.}, \Rcode{alpha} and \Rcode{p}
define the distance used in the algorithm, while \Rcode{t.max}
sets the maximum shift of each peak in each iteration (in this particular case, $q$
of Equation (\ref{eq:warping}) does not vary in
the whole $\mathbb{Z}$ but
$q \in \{ - \textrm{\Rcode{t.max}} \cdot |U|, \ldots, + \textrm{\Rcode{t.max}}\cdot |U| \}$,
with $|U|$ being the maximum width of the spline approximation of the peaks.

Given a \Rclass{GRanges} \Rcode{GR} containing the metadata columns computed from the
\Rfunction{smooth\_peak} method,  \Rfunction{cluster\_peak} applies the k-mean alignment algorithm
for all the values of $k$ between 1 and \Rcode{n.clust} (parameter of the function).

The algorithm can be run in parallel, setting to \Rcode{TRUE} the
\Rcode{parallel} argument of the method and providing the number
of cores \Rcode{num.cores}. With these settings, the different applications of the algorithm, corresponding to
different numbers of clusters, are executed in parallel.

As detailed in the help, the \Rfunction{cluster\_peak} method has 2 outputs:
\begin{itemize}
\item The \Rclass{GRanges} object, updated with new metadata columns
associated to the classification. In particular, in the general
case of classification with and without alignment, columns with information
on the clustering of the peaks (\Rcode{cluster\_shift} and
\Rcode{cluster\_NOshift}), the corresponding shifts (\Rcode{coef\_shift}) and
the distances from the template of the clusters (\Rcode{dist\_shift} and \Rcode{dist\_NOshift}) are added.
\item The graph of the global distance within clusters\footnote{sum over all
the peaks of the distance of each peak from the corresponding template.} as a function of the number of
clusters (if \Rcode{plot.graph.k = TRUE}). This plot can be used to identify the optimal number of clusters
of the partition of the data set and the effect of the alignment procedure.
In particular, if \Rcode{shift = NULL}, the algorithm is run both with and without
alignment and two trend lines are plotted: the black line corresponds to the
global distance without the shift, and the red line corresponds to the distance
obtained with alignment. If \Rcode{shift} is set to \Rcode{TRUE} or \Rcode{FALSE},
just one type of algorithm is run and the correspondent curve is plotted.
For each trend line, this graph allows
the identification of the optimal value of the number of clusters: for this value, the distance
significantly decreases with respect to the lower values of $k$, and negligibly increases
with respect to higher values of $k$ (elbow in the line).
The gap between the red and the black line, instead, shows the decrease of the distance
when the shift is introduced.
\end{itemize}

It is relevant to point out that the algorithm can be run both on the original data and on the scaled peaks, depending on the focus of the analysis. The logic paramter \Rcode{rescale} allows the user to choose.

<<figurek, fig=TRUE, include=FALSE, width=8, height=8>>=

# classification of the smooth peaks in different
# numbers of clusters, from 1 ( no distinction, only shift )
# to 4. 

# here the analysis is run on the spline approximation
# without scaling
peaks.cluster <- cluster_peak(peaks.summit, parallel = FALSE ,  seeds=1:4,
                                   n.clust = 1:4, shift = NULL,
                                   weight = 1, alpha = 1, p = 2, t.max = 2,
                                   plot.graph.k = TRUE, verbose = FALSE)
head(peaks.cluster)

@


<<figurekScaled, fig=TRUE, include=FALSE, width=8, height=8>>=

# here the analysis is run on the spline approximation
# with scaling
peaks.cluster.scaled <- cluster_peak(peaks.summit.scaled, parallel = FALSE ,  seeds=1:4,
                                     n.clust = 1:4, shift = NULL,
                                     weight = 1, alpha = 1, p = 2, t.max = 2,
                                     plot.graph.k = TRUE, verbose = FALSE, 
                                     rescale = TRUE)
head(peaks.cluster.scaled)

@

\begin{figure} \label{graphK}
\centering
\includegraphics [width = 0.4\textwidth]{FunChIP-figurek}
\includegraphics [width = 0.4\textwidth]{FunChIP-figurekScaled}
\caption{\textbf{Global distance within clusters.} Global distance of the peaks form the corresponding template, as a function of the number of clusters $k$. In the left panel the graph for the original spline approximation, while in the right panel restuls are relative to the scaled approximation.}
\end{figure}

The particular case of k-mean alignment with $k = 1$ clusters can be used to highlight the effects of the
alignment of the peaks: no grouping is performed, just the shifts are computed. Therefore, the decrease
of the global distance is solely due to a change of the abscissae of the functions,
as Figure \ref{FunChIP-alignment} shows.
 Moreover, focusing for exemple on the first panel of Figure \ref{graphK}, we can deduce that, for this case
 \begin{itemize}
 \item the alignment can effectively decrease the distance, for exemple for $k = 6$, the gap between red and black line is significant;
 \item the alignment may change the optimal $k$: looking at the black line, one would have chosen $k=4$, while the
red line suggests $k=3$ is the best choice. With the introduction of the shifts, data which are originally different
becomes more similar and therefore
one less cluster is needed; it has to be noted that the distance
obtained with $k=3$ and alignment is very similar to the one obtained with $k=4$ and no alignment.
 \end{itemize}

Therefore, for this case, one possible classification is the one associated to $k=3$ with shift. On the contray for the scaled peaks the value of $k$ we can identify as crucial is $k=2$ and shift is relevant since it reduces a lot the global distance. 
In Section \ref{sect:silhouette}  we present two quantitative methods to isolate the most suitable value of $k$: the first quantifies the elbow rule we have presented here and the second that computes the Silhouette index to consider the global structure of the dataset. 
The results for this specific number of clusters can then be selected with the \Rfunction{choose\_k}
method:

<<choose_k>>=

# select the results for k = 3 with alignment
peaks.classified.short <- choose_k(peaks.cluster, k = 3,
                                  shift = TRUE, cleaning = TRUE)
head(peaks.classified.short)

peaks.classified.extended <- choose_k(peaks.cluster, k = 3,
                                  shift = TRUE, cleaning = FALSE)

# and for the scaled version for k =2 and alignment

peaks.classified.scaled.short <- choose_k(peaks.cluster.scaled, k = 2,
                                  shift = TRUE, cleaning = TRUE)
head(peaks.classified.scaled.short)

peaks.classified.scaled.extended <- choose_k(peaks.cluster.scaled, k = 2,
                                  shift = TRUE, cleaning = FALSE)
@

The \Rfunction{choose\_k} method allows, respectively, to remove all the metadata columns computed by
\Biocpkg{FunChIP} and obtain a
\Rclass{GRanges} equivalent to the initial one, with an extra the metadata column
\Rcode{cluster} containing the classification labels (\Rcode{cleaning = TRUE}), or a \Rclass{GRanges}
retaining all the details of the prepossessing and clustering (all the previously described
metadata columns), with the extra column \Rcode{cluster} (\Rcode{cleaning = FALSE}).

\subsection{The choice of the final classification with the bending index and the Silhouette plot} \label{sect:silhouette}

In this section we present two analyses which could be useful to identify the most suitable value of the number of clusters. 

The first analysis quantifies the elbow rule we  proposed in the previous section. The function \Rfunction{bending\_index} computes for each value of $k$ in the range $[2 : K-1]$, with $K$ the maximum  number of clusters allowed in  \Rfunction{cluster\_peak}, an index that quantifies the bending of the curve. The higher  this index,  the more appropriate is the correspondent value of $k$. The index is computed as the distance of the point $P$ in $k$ of the global distance function, normalized to its  maximum value, ( $ P = (k, \tilde{D}(k))$) from the line $r$ passing by the point in $k-1$ and in $k+1$ ($ r = r((k-1, \tilde{D}(k-1)), (k+1, \tilde{D}(k+1))$). The normalized distance function is 
$$
\tilde{D}(k) = \frac{D(k)}{\max_{k = 1: K}{D(k)}}
$$
with $D(k)$ being the global distance of the set of classified peaks, when divided into $k$ clusters:
$$
D(k) = \sum_{i = 1}^n d_{i,k}^{\star}.
$$
In particular, $d_{i,k}^{\star}$ is the distance of 
each peak $i$ from the center of the correspondent cluster $c_{k^{\star}_i}$.
Then, the index can be written as:
$$
r(k) = \textrm{dist}((k, \tilde{D}(k)), r((k-1, \tilde{D}(k-1)), (k+1, \tilde{D}(k+1))))
$$

<<ratio>>=
# here we compute the bending index for the classification
# provided for the non scaled dataset

index <- bending_index(peaks.cluster, plot.graph.k = FALSE)
index

# and then for the scaled dataset
index_scaled <- bending_index(peaks.cluster.scaled, plot.graph.k = FALSE)
index_scaled

@

The function returns the value of the bending index varying $k$ for
both the classification with \Rcode{index\_shift} and without \Rcode{index\_NOshift} alignment, if both   
classifications are stored in the \Rcode{object}. It can also show the plot of the global distance curves of Figure \ref{graphK}, if \Rcode{plot.graph.k = TRUE}.

Moreover, we present another method to evaluate the choice the proper number of clusters, which takes into account the global set of data. This method is based on the defintion of the Silhouette index of \cite{Rousseeuw:1987} and,  for the peak $i$, the index is computed as
$$
s(i) = \frac{a(i)-b(i)}{\max(a(i), b(i))},
$$
with $a(i)$ being the average dissimilarity of peak $i$ with all other data within the same cluster and $b(i)$ the lowest average dissimilarity of  $i$ to any other cluster, of which $i$ is not a member. To compute the $b(i)$ value it is necessary to compare peaks belonging to different clusters. In case of classification with alignment, to define the distance of peaks of different clusters it is necessary first to align the two clusters. To perform this global alignment in an efficient way, we  align the centers of the clusters and then use the  estimated shift coefficient  to align  all the peaks of the two clusters. 

<<sil, fig=TRUE, include=FALSE, width=15, height=8>>=
sil <- silhouette_plot(peaks.cluster, p=2, weight = 1, alpha = 1,
                         rescale = FALSE, t.max = 2)
@

<<sil_scaled, fig=TRUE, include=FALSE, width=15, height=8>>=
sil <- silhouette_plot(peaks.cluster.scaled, p=2, weight = 1, alpha = 1,
                         rescale = FALSE, t.max = 2)
@


\begin{figure} \label{shilouette}
\centering
\includegraphics [width = 0.99\textwidth]{FunChIP-sil}
\caption{\textbf{Silhouette graph for the classification of  unscaled peaks.} In the top panel, Silhouette plots for the classification without alignment, as function of the number of clusters; in the bottom panel, Silhouette plots for the classification with alignment.}
\end{figure}

\begin{figure} \label{shilouette}
\centering
\includegraphics [width = 0.99\textwidth]{FunChIP-sil_scaled}
\caption{\textbf{Silhouette graph for the classification of  scaled peaks.} In the top panel, Silhouette plots for the classification without alignment, as a function of the number of clusters; in the bottom panel, Silhouette plots for the classification with alignment.}
\end{figure}

Th plots shown in Figure \ref{shilouette} represent the Silhouette index for all the peaks divided in the clusters, together with the global average Silhouette index. The average Silhouette index is maximum when the clustering is separating the data in the most suited manner and then this index has to be maximized to choose the best classification. In the case here presented here, the best classification according to this criteria is the classification with alignment and $k =2$ or $k=3$. Considering both Silhouette and the  bending criteria, in the next section we present the results of the classification with alignement and $k=3$ clusters.

\section{Visualization of the peaks} \label{sect:plot}
The  \Rfunction{plot\_peak} method is a very flexible function
for displaying ChIP-Seq peaks. In particular, it allows to
plot the raw counts obtained by the \Rfunction{pileup\_peak} method, as
in Figure \ref{FunChIP-plot1}. It can also plot
smoothed peaks, possibly centered around the summit, as in
Figure \ref{twoplots}, or scaled as in Figure \ref{scaled} and centerd.

<<plot1, fig=TRUE, include=FALSE, width=8, height=8>>=

# plot of the first 10 peaks (raw data)
par(mar=c(4.5,5,4,4))
plot_peak(peaks, index = 1:10, line.plot = 'count',
          cex.main = 2, cex.lab = 2, cex.axis =2)
@
\incfig{FunChIP-plot1}{0.5\textwidth}{10 peaks: counts.}
     {Representation of the original peaks as raw counts  (no smoothing).}


<<plot2, fig=TRUE, include=FALSE, width=8, height=8>>=

# plot of the smoothed approximation of the first 10 peaks
par(mar=c(4.5,5,4,4))
plot_peak(peaks.smooth, index = 1:10, line.plot = 'spline',  cex.main = 2,cex.lab = 2, cex.axis =2)

@
% \incfig{FunChIP-plot2}{0.5\textwidth}{10 peaks: splines.}
%      {Representation of the original peaks (with spline smoothing) as counts
%      on the basis of the genome.}

<<plot3, fig=TRUE, include=FALSE, width=8, height=8>>=

# plot of the smoothed approximation of the first 10 peaks,
# centering peaks around their summits
par(mar=c(4.5,5,4,4))
plot_peak(peaks.summit, index = 1:10, line.plot = 'spline', cex.main = 2,cex.lab = 2, cex.axis =2)
@

<<plot2bis, fig=TRUE, include=FALSE, width=8, height=8>>=

# plot of the smoothed approximation of the first 10 peaks;
# the scaled functions are plotted.
# 
par(mar=c(4.5,5,4,4))
plot_peak(peaks.smooth.scaled, index = 1:10, 
          line.plot = 'spline', rescale = TRUE, 
           cex.main = 2,cex.lab = 2, cex.axis =2)

@

<<plot3bis, fig=TRUE, include=FALSE, width=8, height=8>>=

# plot of the scaled approximation of the first 10 peaks,
# centering peaks around their summits
par(mar=c(4.5,5,4,4))
plot_peak(peaks.summit.scaled, index = 1:10, 
          line.plot = 'spline', rescale = TRUE, 
           cex.main = 2,cex.lab = 2, lwd = 2,cex.axis =2)
@

\begin{figure} \label{twoplots}
\centering
\includegraphics [width = 0.4\textwidth]{FunChIP-plot2}
\includegraphics [width = 0.4\textwidth]{FunChIP-plot3}
\caption{\textbf{10 spline-smoothed peaks.} In the left panel, smoothed
peaks are shown, while in the right panel the same peaks are
centered around their summits.}
\end{figure}

\begin{figure} \label{twoplotsbis}
\centering
\includegraphics [width = 0.4\textwidth]{FunChIP-plot2bis}
\includegraphics [width = 0.4\textwidth]{FunChIP-plot3bis}
\caption{\textbf{10 spline-smoothed and scaled peaks.} In the left panel, scaled
peaks are shown, while in the right panel the same peaks are
centered around their summits.}
\end{figure}

From the comparison of Figure \ref{twoplots} and Figure \ref{twoplotsbis}
it is clear how the scaling affects the shape of splines. Now peaks are no more 
related to the magnitude, but just to their shapes.

Moreover, plotting both raw counts and spline is also possible:
Figure \ref{FunChIP-plotBOTH} shows a single
peak in its raw and smoothed version.
This representation is useful to check the accuracy
of the smoothing and, if needed, manually set the $\lambda$
parameter of the spline approximation.

<<plotBOTH, fig=TRUE, include=FALSE, width=8, height=8>>=

# plot of a peak comparing its raw structure and
# its spline-smoothed version.
par(mar=c(4.5,5,4,4))
plot_peak(peaks.summit, index = 3, lwd =2, line.plot = 'both', col = 'darkblue', cex.main = 2,cex.lab = 2, cex.axis =2)
@

\incfig{FunChIP-plotBOTH}{0.4\textwidth}{Read coverage and spline approximation.}
     {Plot of the original read coverage of a peak and its smoothing (spline approximation), centered
     around the summit.}
     
\incfig{FunChIP-plot4}{0.99\textwidth}{Peaks divided in the three clusters}
     {The same spline-smoothed peaks are plotted in
     grey, and for each panel the peaks in the corresponding cluster are colored to show their different shapes.
     Peaks are aligned with the shift coefficients obtained by the k-mean
     alignment algorithm.}
     
\incfig{FunChIP-plot4bis}{0.8\textwidth}{Scaled peaks divided in the two clusters}
     {The same spline-smoothed scaled peaks are plotted in
     grey, and for each panel the peaks in the corresponding cluster are colored to show their different shapes.
     Peaks are aligned with the shift coefficients obtained by the k-mean
     alignment algorithm.}

<<plot4, fig=TRUE, include=FALSE, width=13, height=5.5>>=

# plot of the results of the kmean alignment.
# Peaks are plotted in three different panels
# according to the clustering results.

plot_peak(peaks.cluster, index = 1:100, line.plot = 'spline',
          shift = TRUE, k = 3, cluster.peak = TRUE, 
          col = topo.colors(3),  cex.main = 2,cex.lab = 2, cex.axis =2)
@

<<plot4bis, fig=TRUE, include=FALSE, width=12, height=5.8>>=

# plot of the results of the kmean alignment.
# Scaled peaks are plotted in three different panels
# according to the clustering results.

plot_peak(peaks.cluster.scaled, index = 1:100, line.plot = 'spline',
          shift = TRUE, k = 2, cluster.peak = TRUE, rescale = TRUE, 
          col = heat.colors(2), cex.main = 2,cex.lab = 2, cex.axis =2)

@

Finally, the  \Rfunction{plot\_peak} method allows to plot the results
of the clustering via the k-mean alignment. In Figure \ref{FunChIP-plot4} and  Figure \ref{FunChIP-plot4bis},
smoothed and scaled peaks are divided into the three clusters and plotted with
the optimal shift obtained with the alignment.



\bibliography{mybib}

\end{document}