%\VignetteIndexEntry{LowMACA}
%\VignettePackage{LowMACA}
%\VignetteDepends{LowMACA}
%\VignetteEngine{knitr::knitr}
\documentclass[11pt]{article}
\usepackage[margin=2cm,nohead]{geometry}

\usepackage{hyperref}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\software}[1]{\textsf{#1}}
\newcommand{\R}{\software{R}}

\title{LowMACA: Low frequency Mutation Analysis via Consensus Alignment}
\author{Giorgio Melloni , Stefano de Pretis}

\begin{document}

\maketitle

\tableofcontents

%
\section{Introduction}
The \Rpackage{LowMACA} package is a simple suite of tools to investigate and analyze the profile of the somatic mutations provided by the cBioPortal (via the \Rpackage{cgdsr}). \Rpackage{LowMACA} 
evaluates the functional impact of somatic mutations by statistically assessing the number of alterations that accumulates on the same residue (or residue that are conserved in Pfam domains). 
For example, the known driver mutations G12,G13 and Q61 in KRAS can be found on the corresponding residues of other proteins in the RAS family (PF00071) like NRAS and HRAS, 
but also in less frequently mutated genes like RRAS and RRAS2. The corresponding residues are identified via multiple sequence alignment. Thanks to this approach the user can identify 
new driver mutations that occur at low frequency at single protein level but emerge at Pfam level. In addition, the impact of known driver mutations can be transferred to other proteins that share 
a high degree of sequence similarity (like in the RAS family example). 

You can conduct an hypothesis driven exploratory analysis using our package simply providing a set of genes and/or pfam domains of your interest. 
The user is able to choose the kind of tumor and the type of mutations (like missense, nonsense, frameshift etc.). 
The data are directly downloaded from the largest cancer sequencing projects and aggregated by LowMACA to evaluate the possible functional impact of somatic mutations by spotting 
the most conserved variations in the cohort of cancer samples. By connecting several proteins that share sequence similarity via consensus alignment, this package is able to statistically 
assessing the occurrence of mutations on the same residue and ultimately see:
\begin{itemize}
  \item where mutations fall and what are the involved domains
	\item what is the frequency of the aberrations and what is the more represented tumor type
	\item if and where the mutations tend to clusterize 
	\item what is the degree of conservation of the mutated residues
	\item if there are new driver genes and in particular, driver mutations
\end{itemize}

%
\section{System Requirements}
LowMACA relies on two external resources to work properly.
\begin{itemize}
	\item Clustal Omega, our trusted aligner (http://www.clustal.org/omega/)
	\item Ghostscript, a postscript interpreter needed to draw logo plots (http://www.ghostscript.com/)
\end{itemize}
Clustal Omega is a fast aligner that can be downloaded from the link above. 
For both Unix and Windows users, remember to have "clustalo" in your PATH variable. In case you cannot set "clustalo" in the PATH,
you can always set the clustalo command from inside R, after creating a LowMACA object:
<<pre-firstchunk1, echo=TRUE , eval=FALSE , results="hide" , message=FALSE , warning=FALSE>>=
#Given a LowMACA object 'lm'
lm <- newLowMACA(genes=c("TP53" , "TP63" , "TP73"))
lmParams(lm)$clustal_cmd <- "/your/path/to/clustalo"
@
If you cannot install clustalomega, we provide a wrapper around EBI web service (http://www.ebi.ac.uk/Tools/webservices/services/msa/clustalo\_soap). You just need to set your email as explained in section setup, but you have a limit of 2000 input sequences and perl must be installed.

Ghostscript is needed to draw logo plots and it is used by the library motifStack. Download it and install it from the link above. For Windows users only, don't forget to set this R environment command:
<<pre-firstchunk2, echo=TRUE , eval=FALSE , results="hide" , message=FALSE , warning=FALSE>>=
# Needed only on Windows - run once per R session
# Adjust the path to match your installation of Ghostscript
if(Sys.info()['sysname']=="Windows")
  Sys.setenv(R_GSCMD = '"C:/Program Files/gs/gs9.15/bin/gswin64c.exe"')
@
More details can be found here http://pgfe.umassmed.edu/BioconductorGallery/docs/motifStack/motifStack.html
\Rpackage{LowMACA} needs an internet connection to retrieve mutation data from cBioPortal, 
draw the Protter-style plot (http://wlab.ethz.ch/protter/start/) and use the web service of 
clustalomega (http://www.ebi.ac.uk/Tools/webservices/services/msa/clustalo\_soap)

%
\section{Create a LowMACA object}
First of all, we have to define our target genes or pfam domains that we wish to analyze. 

%
\subsection{Find your target family of proteins or pfam you want to know more about}
<<firstchunk, echo=TRUE , eval=TRUE,results="hide" , message=FALSE , warning=FALSE>>=
library(LowMACA)
#User Input
Genes <- c("ADNP","ALX1","ALX4","ARGFX","CDX4","CRX"
        ,"CUX1","CUX2","DBX2","DLX5","DMBX1","DRGX"
        ,"DUXA","ESX1","EVX2","HDX","HLX","HNF1A"
    	  ,"HOXA1","HOXA2","HOXA3","HOXA5","HOXB1","HOXB3"
			  ,"HOXD3","ISL1","ISX","LHX8")
Pfam <- "PF00046"
@
<<secondchunk, echo=TRUE>>=
#Construct the object
lm <- newLowMACA(genes=Genes, pfam=Pfam)
str(lm , max.level=3)
@
Now we have created a \Rpackage{LowMACA} object. In this case, we want to analyze the homeodomain fold pfam (PF00046), considering 28 genes that belong to this clan. If we don't specify the pfam parameter, \Rpackage{LowMACA} proceeds to analyze the entire proteins passed by the genes parameter (we map only canonical proteins, one per gene). Vice versa, if we don't specify the genes parameter, \Rpackage{LowMACA} looks for all the proteins that contain the specified pfam (or pfams if you wish) and analyze just the portion of the protein assigned to the domain.
%
\subsection{Change default parameters}
A LowMACA object is composed by four slots. The first slot is \Rfunction{arguments} and is filled at the very creation of the object. It contains information as Uniprot name for the proteins associated to the genes, the amino acid sequences, start and end of the selected domains and many default parameters that can be change to start the analysis.

<<thirdchunk, echo=TRUE>>=
#See default parameters
lmParams(lm)
#Change some parameters
#Accept sequences even with no mutations
lmParams(lm)$min_mutation_number <- 0
#Changing selected tumor types
#Check the available tumor types in cBioPortal
available_tumor_types <- showTumorType()
head(available_tumor_types)
#Select melanoma, stomach adenocarcinoma, uterine cancer, lung adenocarcinoma, 
#lung squamous cell carcinoma, colon rectum adenocarcinoma and breast cancer
lmParams(lm)$tumor_type <- c("skcm" , "stad" , "ucec" , "luad" , "lusc" , "coadread" , "brca")
@

%
\section{Setup}
%
\subsection{Align sequences}

<<fourthchunk, echo=TRUE , eval=TRUE>>=
lm <- alignSequences(lm)
@
This method is basically self explained. It aligns the sequences in the object. If you didn't install clustalomega yet, you can use the web service of clustalomega that we wrapped in our R package. The limit is set to 2000 sequences and it is obviously slower than a local installation. Rememeber to put your own email in the mail command to activate this option since is required by the EBI server. In case of errors, they will be sent there.
<<fourthchunkBis, echo=TRUE , eval=TRUE  , message=FALSE , warning=FALSE>>=
lm <- alignSequences(lm , mail="lowmaca@gmail.com")
@

<<fifthchunck, echo=TRUE, eval=TRUE>>=
str(lmAlignment(lm) , max.level=2 , vec.len=2)
@
\begin{itemize}
	\item ALIGNMENT: mapping from original position to the position in the consensus
	\item SCORE: some score of distance between the sequences
	\item CLUSTAL: an object of class \Rfunction{AAMultipleAlignment} as provided by Biostrings R package
	\item df: Consesus sequence and conservation Trident Score at every position
\end{itemize}
%
\subsection{Get Mutations and Map Mutations}
<<sixthchunk, echo=TRUE , eval=TRUE>>=
lm <- getMutations(lm)
lm <- mapMutations(lm)
@
These commands produce a change in the slot mutation and provide the results from R cgdsr package.
<<seventhchunk2, echo=TRUE,eval=TRUE>>=
str(lmMutations(lm) , vec.len=3 , max.level=1)
@
\begin{itemize}
	\item data: provide the mutations selected from the cBioPortal divided by gene and patient/tumor type
	\item freq: a table containing the absolute number of mutated patients by gene and tumor type (this is useful to explore the mutational landscape of your genes in the different tumor types)
	\item aligned: a matrix of m rows, proteins or pfam, and n columns, consensus positions derived from the mapping of the mutations from the original positions to the new consensus 
\end{itemize}
If we want to check what are the most represented genes in terms of number of mutations divided by tumor type, we can simply run:
<<seventhchunk, echo=TRUE,eval=TRUE>>=
myMutationFreqs <- lmMutations(lm)$freq
#Showing the first genes
myMutationFreqs[ , 1:10]
@
This can be useful for a stratified analysis in the future.
%
\subsection{Whole setup}
If you don't want to provide your own mutation data you can use directly the command setup to launch alignSequences, getMutations and mapMutations at once
<<eighthchunk, echo=TRUE , eval=FALSE , message=FALSE , warning=FALSE>>=
#Local Installation of clustalo
lm <- setup(lm)
#Web Service
lm <- setup(lm , mail="lowmaca@gmail.com")
@
If you have your own data, you can use the getMutations method as \Rfunction{lm <- getMutations(lm , repos=myOwnData)}. myOwnData is a data.frame with the following columns:
<<ninthchunk, echo=TRUE>>=
str(lmMutations(lm)$data , vec.len=1)
@
Alternatively, setup accepts the parameter repos too.

%
\section{Statistics}
In this step we calculate the general statistics for the entire consensus profile
<<tenthchunk, echo=TRUE , eval=TRUE>>=
lm <- entropy(lm)
#Global Score
str(lmEntropy(lm))
#Per position score
head(lmAlignment(lm)$df)
@
With the method entropy, we calculate the entropy score and a pvalue against the null hypothesis that the mutations are distributed uniformly.
In addition, a test is performed for every position of the consensus and the output is reported in the slot \Rfunction{alignment}. The position 4 has a conservation score of 0.88 (highly conserved) and the corrected pvalue is significant (qvalue below 0.01).
There are signs of positive selection for the position 4. To retrieve the original mutations that generated that cluster, we can use the function lfm
<<eleventhchunk, echo=TRUE>>=
significant_muts <- lfm(lm)
#Display original mutations that formed significant clusters (column Multiple_Aln_pos)
head(significant_muts)
#What are the genes mutated in position 4 in the consensus?
cluster_4_genes <- significant_muts[ significant_muts$Multiple_Aln_pos==4 , 'Gene_Symbol']
@
<<eleventh_2chunck , echo=TRUE>>=
sort(table(cluster_4_genes))
@
The position 4 accounts for mutations in 13 different genes. The most represented one is ISX (ISX\_HUMAN, Intestine-specific homeobox protein).

%
\newpage
\section{Plot}
\subsection{Consensus Bar Plot}
\begin{center}
<<echo=TRUE, eval=TRUE, results="hide">>=
bpAll(lm)
@
\end{center}

This barplot shows all the mutations reported on the consensus sequence divided by protein/pfam domain

\newpage
\subsection{LowMACA comprehensive Plot}
\begin{center}
<< echo=TRUE, eval=TRUE, results="hide">>=
lmPlot(lm)
@
\end{center}
This four layer plot encompasses:
\begin{itemize}
	\item The bar plot visualize before
	\item The distribution of mutations against the null hypothesis 
		(blue line) with orange bars representing a pvalue below 0.05 for
		 that position and a red star for qvalue below 0.05
	\item The Trident score distribution
	\item The logo plot representing the consensus sequence
\end{itemize}

\subsection{Protter plot}
<<protterChunk, echo=TRUE, eval=TRUE ,message=FALSE ,warning=FALSE>>=
#This plot is saved as a png image
protter(lm , filename="homeobox.png")
@
<<protterremover, echo=FALSE, eval=TRUE ,message=FALSE ,warning=FALSE>>=
#Remove the file from vignette folder
file.remove("homeobox.png")
@


A request to the Protter server is sent and a png file is downloaded with the possible sequence structure of the protein and the significant positions colored in orange and red

%
\section{Summary}
Copy and paste on your R console and perform the entire analysis by yourself. You need Ghostscript to see all the plots.

<<summary, eval=FALSE , echo=TRUE>>=
library(LowMACA)
Genes <- c("ADNP","ALX1","ALX4","ARGFX","CDX4","CRX"
			,"CUX1","CUX2","DBX2","DLX5","DMBX1","DRGX"
			,"DUXA","ESX1","EVX2","HDX","HLX","HNF1A"
			,"HOXA1","HOXA2","HOXA3","HOXA5","HOXB1","HOXB3"
			,"HOXD3","ISL1","ISX","LHX8")
Pfam <- "PF00046"
lm <- newLowMACA(genes=Genes , pfam=Pfam)
lmParams(lm)$tumor_type <- c("skcm" , "stad" , "ucec" , "luad" , "lusc" , "coadread" , "brca")
lmParams(lm)$min_mutation_number <- 0
lm <- setup(lm , mail="lowmaca@gmail.com")
lm <- entropy(lm)
lfm(lm)
bpAll(lm)
lmPlot(lm)
protter(lm)
@

\newpage
\section{Session Information}
<<info,echo=TRUE>>=
sessionInfo()
@

\end{document}