\documentclass[a4paper, 9pt]{article}
\usepackage{hyperref}
\usepackage{amsmath, amsthm, amssymb}
\usepackage{xfrac}

\usepackage{fullpage}

\usepackage{marginnote}
\usepackage{graphicx}

\usepackage[table]{xcolor} %http://ctan.org/pkg/xcolor

\usepackage[numbers]{natbib}
\usepackage{algorithmic}
\usepackage{algorithm}

\usepackage{url}


\usepackage{xspace}

\newcommand{\CAPRESE}{\textsc{caprese}}
\newcommand{\TRONCO}{\textsc{tronco}}

\usepackage{fullpage}
% \VignetteIndexEntry{TRONCO}
%\VignetteIndexEntry{TRONCO}
%\VignetteDepends{TRONCO}
%\VignetteKeywords{TRONCO}
%\VignettePackage{TRONCO}

\begin{document}


\title{Using the \TRONCO{} package}


\author{
Marco Antoniotti\footnote{Dipartimento di Informatica Sistemistica e Comunicazione, Universit\'a degli Studi Milano-Bicocca
Milano, Italy.} \and
Giulio Caravagna$^\ast$ \and
Alex Graudenzi$^\ast$ \and
Ilya Korsunsky\footnote{Courant Institute of Mathematical Sciences, New York University, New York, USA.} \and
Mattia Longoni$^\ast$ \and
Loes Olde Loohuis\footnote{Center for Neurobehavioral Genetics, University of California, Los Angeles, USA.} \and
Giancarlo Mauri$^\ast$ \and
Bud Mishra$^\dagger$ \and
Daniele Ramazzotti$^\ast$ 
}

\date{\today}
\maketitle


\begin{center}
\begin{minipage}[h]{0.75\textwidth}
\textbf{Abstract.} Genotype-level {\em cancer progression models} describe the ordering of accumulating mutations, e.g.,  somatic mutations / copy number variations, during  cancer development. These graphical models help understand the ``causal structure'' involving events promoting cancer progression, possibly predicting complex patterns characterising  genomic  progression of a cancer. Reconstructed models can be used to better characterise genotype-phenotype relation, and suggest novel targets for therapy design.

\TRONCO{} ({\sc tr}{\em anslational} {\sc onco}{\em logy}) is a \textsc{r} package  aimed at collecting  state-of-the-art algorithms to infer
\emph{progression models}  from \emph{cross-sectional} data, i.e., data collected from independent patients which does not necessarily incorporate any evident temporal information. These algorithms require a binary input matrix where: $(i)$ each row represents a patient genome, $(ii)$ each column an event relevant to the progression (a priori selected) and a $0/1$ value models the absence/presence of a certain mutation in a certain patient.

 
The current first version of \TRONCO{}
implements the \CAPRESE{} algorithm ({\sc ca}{\em ncer} {\sc pr}{\em ogression} {\sc e}{\em xtraction} {\em with} {\sc s}{\em ingle} {\sc e}{\em dges}) to infer possible progression models arranged as \emph{trees};
cfr.
\begin{itemize}
\item \emph{Inferring tree causal models of cancer progression with
    probability raising},  L. Olde Loohuis, G. Caravagna,
  A. Graudenzi, D. Ramazzotti, G. Mauri,  M. Antoniotti and
  B. Mishra. {PLoS One}, \emph{to appear}.
\end{itemize}
This vignette shows how to  use  \TRONCO{} to infer a tree model of
ovarian cancer progression from CGH data of copy number alterations (classified as gains or losses over chromosome's arms). The dataset used is
available in the  SKY/M-FISH database.
The reference manual for \TRONCO{} is available in the package.
\begin{center}
\includegraphics[width=0.9\textwidth]{workflow.png}
\end{center}
\flushright
\scriptsize \em The \TRONCO{} workflow.
\end{minipage}
\end{center}

\vspace{1.0cm}


\SweaveOpts{concordance=TRUE}

\paragraph{\large Requirements: } You must  have \texttt{rgraphviz}
installed to  use the package, see \texttt{Bioconductor.org}.


\paragraph{\large 1. Types/Events definition}{\ }\\

First, load \TRONCO{} in your \textsc{r} console. 
<<>>=
library(TRONCO)
@
Every node in the plotted topology can be colored according to the
color table defined in \textsc{r}. You can use the command
\texttt{colors} to see the available colors, e.g., \texttt{"red"}, \texttt{"blue"} or RGB 
\texttt{"\#FF9900FF"}.

You can start defining the \emph{event types} that you are
considering, and assign them  a color.

As an example, for CGH data we define two types of events, \emph{gain}
and \emph{loss}, which we color  \emph{red} and \emph{green} to represent
amplifications or deletion of a  chromosome arm.  For instance, we can
do this as follows:
<<>>=
types.add("gain", "cornflowerblue")
types.add("loss", "brown1")
@
If many types have to be defined it might be convenient to load all of
them at once.  This is possible by using a tabular input file
(in \texttt{csv} format):
\[
\texttt{type\_name, type\_color}  \qquad\qquad  \text e.g., \quad \texttt{red, gain} 
\]
and issuing the command \texttt{types.load("types.txt")} -- if types
are defined in file \texttt{types.txt}.  The output produced by
\TRONCO{} might show warnings due to, e.g., different types assigned
the same color.

Once  types are defined, you can define the set of \emph{events} in
the dataset (which will constitute the progression), give them a \emph{label}, a type and bind them to a
dataset column.  Since in general there are much more events than types, it might be convenient to prepare an external file to load via command {\tt events.load("events.txt")}. The format expected for events is similar to the one expected for types, namely as a tabular input file in \texttt{csv} format:
\[
\texttt{event\_name, event\_type, column\_number} \qquad\qquad \text e.g., \quad \texttt{8p+, gain, 1}\, .
\]
For the ovarian CGH dataset, such a file contains the following rows (we show the first 3 lines)
\begin{verbatim}
8p+, gain, 1
3p+, gain, 2
5q-, loss, 3
......
\end{verbatim}
which define, as events, gains in  arm $p$ of chromosomes $8$ and $3$, losses on arm $q$ of chromosomes $5$, etc. Given the file \emph{events.txt} where are defined the events with the above notation, the events can be loaded from a file as follows.
<<>>=
 events.load("events.txt")
@
Events will constitute the nodes in the progression model.  If one is willing to add events in a iterative fashion the command {\tt events.add(event\_name, event\_type, column\_number)} can be used. For instance {\tt events.add("8q+", "gain", 1)}.


At this point, \TRONCO{} executes some consistency checks to ensure that all the  added events are of a declared type, and report the user potential inconsistencies.


\paragraph{\large 2. Data loading \& Progression inference}{\ }\\

Once events are set, you can load the input dataset, which must be
stored in a text file  as a binary matrix (once loaded, you can use {\tt tronco.data.view(your\_data)} to visualise loaded data as a heatmap).
<<>>=
data(ov.cgh)
data.load(ov.cgh)
<<>>=
str(data.values)
@
In this case 87 samples are available and  7 events are considered (in general, the inference problem is well posed if there are more samples than events, which is the case here for ovarian).

Further  consistency checks are performed by \TRONCO{} at data-loading time; these include checking that:
\begin{itemize}
\item All the columns of the dataset are assigned a unique event;
\item There are no identical columns in the dataset. If this is the
  case, the columns get merged and the events associated get merged
  too (a default type is assigned in this case);
\item There are no columns in the dataset solely constituted by 0s
  or 1s. If this is the case, the columns and the events associated
  are deleted.
\end{itemize}
\TRONCO{} signals the user that the data presents some inconsistency, if that is the case. Once the input is loaded, \CAPRESE{} can
be executed.

\begin{figure}[t]\center
{\includegraphics[width=0.5\textwidth]{vignette-007}}
\caption{\textbf{Ovarian cancer CGH tree reconstructed with CAPRESE.}
  We show the result of reconstruction with \CAPRESE{}.
  These trees are plot as explained in \S $2$ and {$3$}. The
  tree is the reconstructed model without confidence information.}
\label{fig:tree}
\end{figure}

<<>>=
topology <- tronco.caprese(data.values, lambda=0.5)
@
In the above example, \CAPRESE{} is executed with a \emph{shrinkage
  coefficient} set to $0.5$ (the default value, if not specified), which
is the optimal value for  data containing \emph{false positives} and
\emph{false negatives}.  If these were absent, the optimal coefficient
should be set to an arbitrary small value, e.g. $10^{-3}$; in any
case the coefficient  must be  in $[0,1]$.  Notice that \TRONCO{}
provides an \emph{empirical estimation} of the  the rate of false
positives and negatives in the data, given the reconstructed model;
this is done  via $\ell_2$ distance.

The returned topology can be printed to screen by using the
\texttt{topology} object print method, or can be visualized by using
the \texttt{tronco.plot} function.
<<fig=TRUE,include=FALSE>>=
topology
tronco.plot(topology, title="Ovarian cancer progression with CAPRESE", legend.title="CGH events", 
	legend.coeff = 1.0, label.coeff = 1.2, legend = TRUE)
@

In this case we are assigning a title to the plot, we are requiring
to display a legend (\texttt{ legend = TRUE}), and we are setting custom
size for the text in the legend (\texttt{legend.coeff = 0.7}, $70\%$
of the default size) and in the model (\texttt{ label.coeff = 1.2});
see Figure \ref{fig:tree}.

\paragraph{\large 3. Confidence estimation}{\ }\\

\begin{figure}[t]\centerline{
\fbox{\includegraphics[width=0.33\textwidth]{vignette-008}}
\fbox{\includegraphics[width=0.33\textwidth]{vignette-009}}
\fbox{\includegraphics[width=0.33\textwidth]{vignette-010}} \\
}

\centerline{
\fbox{\includegraphics[width=0.33\textwidth]{vignette-011}}
\fbox{\includegraphics[width=0.33\textwidth]{vignette-012}}
\fbox{\includegraphics[width=0.33\textwidth]{vignette-013}}
}
\caption{\textbf{Probabilities (input data): visualisation and comparison with  model's predictions.} Top: observed
  \emph{frequencies} of \emph{observed},  \emph{joint} and
  \emph{conditional} distributions of events (conditionals are
  restricted according to the reconstructed progression
  model) as emerge from the data. Bottom: difference between observed and fitted
  probabilities, according to the reconstructed progression.}
\label{fig:distrib}
\end{figure}

\paragraph{Data and model probabilities.} Before estimating the
confidence of a reconstruction, one might print and visualise the
\emph{frequency of occurrence} for each event, the \emph{ joint
  distribution} and the  \emph{conditional distribution} according to
the input data (i.e., the \emph{observed} probabilities).   Notice
that for the conditional distribution we condition only on the parent
of a node, as reconstructed in the returned model.  Plots of these distributions are shown in Figure
\ref{fig:distrib}, and are evaluated as follows.
<<fig=TRUE,include=FALSE>>=
 confidence.data.single(topology)
@
<<fig=TRUE,include=FALSE>>=
 confidence.data.joint(topology)
@
<<fig=TRUE,include=FALSE>>=
 confidence.data.conditional(topology)
@

In a similar way, by using \texttt{ confidence.fit.single(topology)},
\texttt{ confidence.fit.joint(topology)} or
\texttt{confidence.fit.conditional(topology)}, the analogous
probabilities can be assessed according to the model. This are not
shown in this vignette.

The difference between  observed and  fit probabilities can be
visualised as follows.
<<fig=TRUE,include=FALSE>>=
confidence.single(topology)
@
<<fig=TRUE,include=FALSE>>=
confidence.joint(topology)
@
<<fig=TRUE,include=FALSE>>=
confidence.conditional(topology)
@


\paragraph{Bootstrap confidence.}{\ }\\

Confidence in a model can be estimated via \emph{parametric} and
\emph{non-parametric bootstrap}.  In the former case, the model is
assumed to be correct and data is sampled by the model, in the latter
case resamples are taken from the input data, with repetitions.  In any
case, the reconstruction confidence is the number of times that the
estimated tree or edge is inferred out of a number of
resamples.  The parameters of the bootstrap procedure can be custom
set.

<<>>=
set.seed(12345)
topology <- tronco.bootstrap(topology, type="non-parametric", nboot=1000)
@
<<fig=TRUE,include=FALSE>>=
tronco.bootstrap.show(topology)
@


In this case, for instance, we are performing non-parametric bootstrap
(the default one) with $1000$ repetitions and, since no shrinkage
coefficient is specified, we are still using $0.5$. Here the estimated
error rates are used to include noise levels estimated from the
data/model. To perform parametric bootstrap is enough to use the flag
\texttt{ type="parametric"}.
<<>>=
set.seed(12345)
topology <- tronco.bootstrap(topology, type="parametric", nboot=1000)
@
<<fig=TRUE,include=FALSE>>=
tronco.bootstrap.show(topology)
@


Results of bootstrapping are visualized as a table (useful for edge
confidence), and as a heatmap by using command
\texttt{tronco.bootstrap.show}. The overall model confidence is
reported, too. In Figure 3 results of bootstrap are
shown.  If one is willing to visualize this confidence in the plot of
the inferred tree  an input flag \texttt{confidence}  can be used with
function \texttt{tronco.plot}. For instance:
<<fig=TRUE,include=FALSE>>=
tronco.plot(topology, title="Ovarian cancer progression with CAPRESE", legend.title="CGH events", 
	legend.coeff = 1.0, label.coeff = 1.2, legend = TRUE, confidence = TRUE)
@

In this case, the thicker lines  reflect the most confident edges;
confidence is also reported as labels of edges, as shown in
Figure 4
%
%
% These are visualized in  Figure \ref{fig:bootstrap}.

\begin{figure}[t]\center
\fbox{\includegraphics[width=0.45\textwidth]{vignette-015}}
\fbox{\includegraphics[width=0.45\textwidth]{vignette-017}}
\caption{\textbf{Bootstrap for edge confidence.} Non-parametric and parametric confidence in each reconstructed edge as assessed via bootstrapping.}  
\label{fig:bootstrap}
\end{figure}

\begin{figure}[t]\center
\fbox{\includegraphics[width=0.45\textwidth]{vignette-018}}
\caption{\textbf{Bootstrap information included in the model.} You can include the result of edge confidence estimation via bootstrap by using flag {\tt confidence}. In this case the thickness of each edge is proportional to its estimated confidence.}  
\label{fig:bootstrap}
\end{figure}


\end{document}