%\VignetteEngine{knitr::knitr} \documentclass[a4paper,9pt]{article} <>= BiocStyle::latex() @ %\VignetteIndexEntry{LACE 2.0} \usepackage[utf8]{inputenc} \usepackage{graphicx} \usepackage{placeins} \usepackage{url} \usepackage{tcolorbox} \usepackage{authblk} \usepackage{longtable} \usepackage{booktabs} \usepackage{array} \usepackage{amsmath} \begin{document} \title{LACE 2.0: an interactive R tool for the inference and visualization of longitudinal cancer evolution models} \author[1]{Gianluca Ascolani} \author[1]{Fabrizio Angaroni} \author[1]{Davide Maspero} \author[2]{Narra Lakshmi Sai Bhavesh} \author[3,4]{Rocco Piazza} \author[5,6]{Chiara Damiani} \author[3]{Daniele Ramazzotti} \author[1,4]{Marco Antoniotti} \author[1,4,7]{Graudenzi Alex} \affil[1]{Dept. of Informatics, Systems and Communication, Universit\'{a} degli Studi di Milano-Bicocca, Milan, Italy.} \affil[2]{Department of Biological Sciences, Birla Institute of Technology and Science (BITS), Pilani, Rajasthan, India.} \affil[3]{Dept.of Medicine and Surgery, Universit\'{a} degli Studi di Milano-Bicocca, Milan, Italy.} \affil[4]{Bicocca Bioinformatics Biostatistics and Bioimaging Centre -- B4, Milan, Italy.} \affil[5]{Dept. of Biotechnology and Biosciences, Universit\'{a} degli Studi di Milano-Bicocca, Milan, Italy.} \affil[6]{ISBE/SYSBIO Centre of Systems Biology, Milan, Italy.} \affil[7]{Inst. of Molecular Bioimaging and Physiology, National Research Council (IBFM-CNR), Segrate, Italy.} \date{\today} \maketitle \begin{tcolorbox}{\bf Overview.} LACE 2.0 includes: (a) an easy-to-use module for data processing and organization, which allows one to test and adjust several quality metrics and relevance filters, and to annotate gene variants; (b) a module for the interactive visualization of the inferred cancer evolution models, which returns both the \emph{longitudinal clonal tree} and the \emph{fishplot}, and allows one to examine the different parts of the model (i.e., clones and temporal relations), also by querying external databases such as Ensembl.org. Importantly, LACE 2.0 requires a minimum set of parameters, which are sufficient to mitigate the aforementioned noise sources and infer reliable models of cancer evolution. \vspace{1.0cm} {\em In this vignette, we give an overview of the Shiny App by presenting its main functions.} \vspace{1.0cm} \renewcommand{\arraystretch}{1.5} \end{tcolorbox} <>= library(knitr) opts_chunk$set( concordance = TRUE, background = "#f3f3ff" ) @ \newpage \tableofcontents LACE 2.0 is a new release of the LACE R package. LACE 2.0 is capable of performing clonal evolution analyses for single-cell sequencing data including longitudinal experiments. LACE 2.0 allows to annotate variants and retrieve the relevant mutations interactively based on user-defined filtering criteria; it infers the maximum likelihood clonal tree, cell matrix attachments and false positive/negative rates using boolean matrix factorization. Furthermore, LACE 2.0 allows to investigate cancer clonal evolution under different experimental conditions and the occurrence of single mutations which can be queried via \emph{ensembl} database. \section{Installation of LACE 2.0 R package}\label{installation-of-lace-2.0-r-package} The package is available on GitHub and Bioconductor. LACE 2.0 requires R \textgreater{} 4.1.0 and Bioconductor. To install Bioconductor run: \begin{verb} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") \end{verb} To install LACE 2.0 run: \begin{verb} remotes::install_github("https://github.com/BIMIB-DISCo/LACE", dependencies = TRUE) \end{verb} LACE 2.0 uses \emph{Annovar} and \emph{Samtools suite} as back-ends for variant calling annotation and depth computation, respectively. Please refer to the next section to install them. \section{Installation of other required softwares}\label{installation-of-other-required-softwares} \emph{Annovar} is a widely used variant calling software freely available upon registration to their website at https://annovar.openbioinformatics.org/en/latest/. The package contains \emph{Perl} scripts and variant calling annotation reference databases for the human species. For other databases, please refer to their website. If the scripts are installed in binary search path, then LACE 2.0 will detect them automatically. \emph{Perl} (https://www.perl.org/) is required to run \emph{Annovar}. \emph{Samtools suite} is a standard set of tools and libraries to handle SAM/BAM/BED file format and perform a variety of common operations on sequencing data. It is freely available at http://www.htslib.org/ and https://github.com/samtools/htslib. To install \emph{Samtools} follow the instructions in their website. \section{Running LACE 2.0}\label{run-lace-2.0} To start LACE 2.0 user interface run: \begin{verb} library(LACE) LACE2() \end{verb} \section{Using LACE 2.0}\label{using-lace-2.0} LACE 2.0 has been thought to be used on single cell sequencing data for which it is available variant calling data in standard VCF format and binary aligned data in standard BAM format. The user is provided with an interface to initiate a project and to set filter thresholds after which annotation of variants, filtering of data and depth at variant sites are retrieved. Both annotation and depth derivation are computationally expensive steps. LACE 2.0 reduces possible re-computation by detecting parameter variations of the user interface and by comparison of the timestamps of interface state, inputs and outputs. Intermediary and final data are stored in the designated folders. The operation is followed by the possibility for the user to select variants which are drivers of the understudied biological problem. At this point, the inferential step is executed so that the most likelihood longitudinal clonal tree and clonal prevalences are retrieved together with the best set of false positive/negative rates among those provided by the user. The results are displayed via an interactive interface. \section{Interface}\label{interface} The interface is divided in two parts interleaved by variant selection and inferential computation parts: the processing interface and the results interface. \subsection{Project creation}\label{project-creation} To begin the clonal analyses the user needs to create a project by choosing a folder and a meaningful name for the project. %\includegraphics[width=0.95\textwidth,height=\textheight]{./resources/Project_tab.png} If the project path does not contain former LACE projects, it asks the user to create a new one (``Create project'') and all intermediate and final results will be stored in structured sub-folders of the chosen project path. LACE 2.0 automatically recognizes if the selected path is a LACE project and proposes the user to load it (\includegraphics[width=0.1\textwidth,height=\textheight]{./resources/load_project.png}) \subsection{Sidebar and Demos}\label{sidebar-and-demos} The sidebar is divided in Projects and Current Project. Projects contain a set of recently opened projects which can be reloaded by selecting the respective project name. \includegraphics[width=0.3\textwidth,height=\textheight]{./resources/Projects_menu.png} Under Projects, there is the sub-menu Demos. LACE 2.0 provides the user with two demos: \begin{itemize} \item Melanoma dataset {[}Rambow, Florian, et al.~``Toward minimal residual disease-directed therapy in melanoma.'' Cell 174.4 (2018): 843-855.{]} containing single cell variant annotations and depths for a total of 674 cells sampled at 4 different time points: \begin{enumerate} \def\labelenumi{\arabic{enumi}.} \item sample before treatment \begin{itemize} \item BRAF inhibitor treatment \end{itemize} \item sample after 4 days \item sample after 28 days \item sample after 57 days \end{enumerate} \item Small dataset containing only 3 cells per 4 sampling time points. \end{itemize} Current Project allows the user to save and load the parameters of the whole project or of the active tab. \subsection{Processing interface}\label{processing-interface} \subsubsection{Single Cell Metadata}\label{single-cell-metadata} In order to perform clonal analyses it is required to provide some information about the experiment such as the cell IDs which are used to retrieve the VCF/BAM files and the sample name each cell belongs to. These metadata are generally included with the experimental sequencing data as part of the library preparation. LACE 2.0 accepts tsv/csv tabular formats with headers and rds standard data R format. Metadata file can include more information relative to the experimental setup and the platform used. \includegraphics[width=0.95\textwidth,height=\textheight]{./resources/SCmeta_tab.png} The user should identify and select the ID and sampling columns. The user can also reorder the sampling names \includegraphics[width=0.6\textwidth,height=\textheight]{./resources/samples_order.png} so to recreate the chronological order of longitudinal events of the experiment, or to explore different orders. \subsubsection{Annotations}\label{annotations} Annotations of variant calling data is performed using \emph{Annovar} as back-end. Each mutation of each cell is annotated based on the annotation database used. \emph{Annovar} provides a database for the human species useful to tag cancer mutations and whenever possible their functional effect. %\includegraphics[width=0.95\textwidth,height=\textheight]{./resources/Annotation_tab.png} If \emph{Annovar}'s \emph{Perl} scripts are available in the standard binary paths, they are detected and the \emph{Annovar} folder field is autocompleted; otherwise, the user should provide the folder containing the \emph{Perl} scripts. The user should also provide the database to use for the annotation and the folder containing the variant calling files (VCF). \subsubsection{Variant filtering}\label{variant-filtering} All types of single cell sequencing data is characterized by various sources of noise which depends on the technology used. Detected mutation might be characterized by low quality score or low statistical power. Some mutations can be neglected, while others might cause relevant effects, especially on exonic regions where variations can result in changes of the translational process and modify their functional form. In order to avoid small and possibly spurious fluctuations on sequencing data, variants can be filtered if they have low supporting evidences. The user can set the minimum values for: \begin{enumerate} \def\labelenumi{\arabic{enumi}.} \item the number of reads supporting the alternative alleles in a cell, \item the frequency of the minor allele for each referenced SNP included in a default global population, \item the cell frequency per sample showing the mutation at the same site. \end{enumerate} \includegraphics[width=0.95\textwidth,height=\textheight]{./resources/Filters_tab.png} Furthermore, the user can choose which functional exonic variation should be considered or neglected. For example, there are cases in which unknown and synonymous mutations in exonic regions are disregarded because their effect cannot be explicitly related to the experimental condition. All possible variant functions and their explanations are resumed in the following table: \begin{longtable}[]{@{}ll@{}} \toprule \begin{minipage}[b]{0.16\columnwidth}\raggedright Annotation (+)\strut \end{minipage} & \begin{minipage}[b]{0.78\columnwidth}\raggedright Explanation\strut \end{minipage}\tabularnewline \midrule \endhead \begin{minipage}[t]{0.16\columnwidth}\raggedright frameshift insertion\strut \end{minipage} & \begin{minipage}[t]{0.78\columnwidth}\raggedright an insertion of one or more nucleotides that cause frameshift changes in protein coding sequence\strut \end{minipage}\tabularnewline \begin{minipage}[t]{0.16\columnwidth}\raggedright frameshift deletion\strut \end{minipage} & \begin{minipage}[t]{0.78\columnwidth}\raggedright a deletion of one or more nucleotides that cause frameshift changes in protein coding sequence\strut \end{minipage}\tabularnewline \begin{minipage}[t]{0.16\columnwidth}\raggedright frameshift block substitution\strut \end{minipage} & \begin{minipage}[t]{0.78\columnwidth}\raggedright a block substitution of one or more nucleotides that cause frameshift changes in protein coding sequence\strut \end{minipage}\tabularnewline \begin{minipage}[t]{0.16\columnwidth}\raggedright stopgain\strut \end{minipage} & \begin{minipage}[t]{0.78\columnwidth}\raggedright a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate creation of stop codon at the variant site. For frameshift mutations, the creation of stop codon downstream of the variant will not be counted as ``stopgain''!\strut \end{minipage}\tabularnewline \begin{minipage}[t]{0.16\columnwidth}\raggedright stoploss\strut \end{minipage} & \begin{minipage}[t]{0.78\columnwidth}\raggedright a nonsynonymous SNV, frameshift insertion/deletion, nonframeshift insertion/deletion or block substitution that lead to the immediate elimination of stop codon at the variant site\strut \end{minipage}\tabularnewline \begin{minipage}[t]{0.16\columnwidth}\raggedright nonframeshift insertion\strut \end{minipage} & \begin{minipage}[t]{0.78\columnwidth}\raggedright an insertion of 3 or multiples of 3 nucleotides that do not cause frameshift changes in protein coding sequence\strut \end{minipage}\tabularnewline \begin{minipage}[t]{0.16\columnwidth}\raggedright nonframeshift deletion\strut \end{minipage} & \begin{minipage}[t]{0.78\columnwidth}\raggedright a deletion of 3 or mutliples of 3 nucleotides that do not cause frameshift changes in protein coding sequence\strut \end{minipage}\tabularnewline \begin{minipage}[t]{0.16\columnwidth}\raggedright nonframeshift block substitution\strut \end{minipage} & \begin{minipage}[t]{0.78\columnwidth}\raggedright a block substitution of one or more nucleotides that do not cause frameshift changes in protein coding sequence\strut \end{minipage}\tabularnewline \begin{minipage}[t]{0.16\columnwidth}\raggedright nonsynonymous SNV\strut \end{minipage} & \begin{minipage}[t]{0.78\columnwidth}\raggedright a single nucleotide change that cause an amino acid change\strut \end{minipage}\tabularnewline \begin{minipage}[t]{0.16\columnwidth}\raggedright synonymous SNV\strut \end{minipage} & \begin{minipage}[t]{0.78\columnwidth}\raggedright a single nucleotide change that does not cause an amino acid change\strut \end{minipage}\tabularnewline \begin{minipage}[t]{0.16\columnwidth}\raggedright unknown\strut \end{minipage} & \begin{minipage}[t]{0.78\columnwidth}\raggedright unknown function (due to various errors in the gene structure definition in the database file)\strut \end{minipage}\tabularnewline \bottomrule \end{longtable} (+) see \emph{Annovar} \subsubsection{Single cell sampling depth}\label{single-cell-sampling-depth} The number of reads per SNV site represents an optimal filter to retrieve relevant mutations. Depth at specific sites is usually not provided in standard alignment or variant calling pipelines and it is computationally expensive to retrieve. The user should provide the folder with aligned data and, if not found already, the \emph{samtools} executable folder location to compute depth only on sites passing the variant filters set in the previous tab. \includegraphics[width=0.95\textwidth,height=\textheight]{./resources/Depth_tab.png} \subsection{Selection of relevant variants}\label{selection-of-relevant-variants} Not all mutations are distinctive of the disease or experiment understudied. Identifying relevant and driver variants allows to reproduce a more significant longitudinal clonal tree. The user can select a set of filters based on gold standards and other analyses such as: \begin{enumerate} \def\labelenumi{\arabic{enumi}.} \item the minimum number of reads at given mutation site \item the maximum number of missing data per gene \item the minimum median depth per locus \item the minimum median depth supporting the mutation \item subset of known genes \end{enumerate} \includegraphics[width=0.95\textwidth,height=\textheight]{./resources/Variant_tab.png} Finding the parameters to select variants is not an easy task, and the user might not know in advance how to choose the best set of filters. Hence, the user can press ``Select variants'' (\includegraphics[width=0.1\textwidth,height=\textheight]{./resources/select_variant.png}) to perform the aforementioned computations on VCF and BAM files to derive all the necessary aggregated information on the sampled cells. Afterward, the user is presented with a interactive live preview of variants passing the filters (including relevant parameters which can help in the selection) while values are changed. \includegraphics[width=0.95\textwidth,height=\textheight]{./resources/Interactive_Variant_tab.png} \subsection{Inference}\label{inference} The inferential tab allows the user to set all the parameters to solve the boolean matrix factorization problem and estimate the model parameters of the Bayesian model using a MCMC to maximize the likelihood. Inferential step uses the following set of parameters: \begin{enumerate} \def\labelenumi{\arabic{enumi}.} \item Learning rate \item False positive rates for each sample \item False negative rates for each sample \item Number of iterations in each MCMC search \item Number of restart for the MCMC 5 Early stopping number of iterations with no growing likelihood\\ \item Number of parallel processes \item Random seed to recreate simulations \item Initialize the clonal tree randomly \item Marginalize the cell attachment matrix \item Keep equivalent solutions and return all of them \item Check indistinguishable event and remove them \item Estimate error rates of MCMC moves \item Show results \end{enumerate} \includegraphics[width=0.6\textwidth,height=\textheight]{./resources/Inference_tab2.png} The user must insert at least one false positive rate and one false negative rate value for each sample. By double clicking on any cell belonging to the row ``add row'', the user can insert a new set of false or positive rates for each sample. During the inferential step, the maximization of the likelihood for each set of rates is performed. The best results are returned. \includegraphics[width=0.95\textwidth,height=\textheight]{./resources/Inference_tab1.png} Press ``Run LACE'' to perform all computation and estimation steps and visualize the results. \subsection{Longitudinal display and outputs interface}\label{longitudinal-display-and-outputs-interface} The longitudinal display tab shows the longitudinal clonal tree, the fishplot and the clonal tree, together with the best false positive and negative rate parameters. The longitudinal clonal tree is placed on the top left side. Continuous edges are used to represent parental relations between clones, while dashed edges show persistence relations. The size of the nodes is proportional to the clonal prevalence. Moving the mouse over a node, it is possible to see the clonal prevalence value. While moving the mouse over the mutation labels, it shows the complete set of mutations for that clone under the longitudinal clonal tree. The fishplot is on the top right side of the tab. Passing the mouse over the ribbons, the clonal prevalences of the clones at different time points are displayed. Similarly, with the mouse over the time tags, the clonal prevalences of all clones at the specified time point are visualized. By clicking on a mutation label of the longitudinal clonal tree as well as on a fishplot ribbon or a clonal tree node, the details of the mutation characterizing the clone can be queried and retrieved from the \emph{Ensembl} database. The bottom part of the tab is used to display the tree legend and the best estimated set false positive/negative rates. \includegraphics[width=0.95\textwidth,height=\textheight]{./resources/Display_tab.png} %\include{3} \section{\Rcode{sessionInfo()}} <>= toLatex(sessionInfo()) @ \end{document}