%\VignetteIndexEntry{LowMACAAnnotation}
%\VignettePackage{LowMACAAnnotation}
%\VignetteDepends{LowMACAAnnotation}
%\VignetteEngine{knitr::knitr}
\documentclass[11pt]{article}
\usepackage[margin=2cm,nohead]{geometry}

\usepackage{hyperref}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\software}[1]{\textsf{#1}}
\newcommand{\R}{\software{R}}

\title{LowMACAAnnotation: a serie of annotation tables for LowMACA package}
\author{Stefano de Pretis , Giorgio Melloni}

\begin{document}

\maketitle

\tableofcontents

%
\section{Introduction}
The \Rpackage{LowMACAAnnotation} is composed by three interconnected datasets and relative functions to retrieve them. They are used by \Rpackage{LowMACA} package internal functions to properly map mutations from genes to protein sequences and finally to Pfam sequences in a unambiguous manner.

%
\section{Functionalities}
This package contains three simple functions to retrieve three manually curated datasets. The available data.frames are:
\begin{itemize}
	\item \Rfunction{myUni} a datasets of proteins with their relative gene names and their Uniprot amino acid sequence
	\item \Rfunction{myPfam} a datasets of Pfam domains and their relative boundaries in protein sequences of myUni data.frame
	\item \Rfunction{myAlias} a dataset of official gene symbols and relative aliases used by \Rpackage{LowMACA} internal functions to check for correct user input
\end{itemize}

We consequently show how to retrieve the data and what is the content.

<<firstchunk, echo=TRUE , eval=TRUE , message=FALSE , warning=FALSE>>=
library(LowMACAAnnotation)
myUni <- getMyUni()
str(myUni , nchar.max=10 , vec.len=2)
@

In details, myUni is a data.frame composed by 9 columns:
\begin{itemize}
	\item Gene\_Symbol: a character vector of official Gene Symbols
	\item Entrez: a numeric vector of Entrez IDs
	\item UNIPROT: a character vector of Uniprot entries in "name\_HUMAN" format
	\item Entry: a character vector of Uniprot entries
	\item HGNC: a character vector of gene names as HGNC numbers
	\item Approved\_Name: a character vector of approved extended gene names
	\item Protein.name: a character vector of approved extended protein names
	\item Chromosome: a character vector of chromosomic cytoband positions
	\item Protein.name: a character vector of extended protein names
	\item AMINO\_SEQ: a character vector of amino acid sequences for Uniprot entries
\end{itemize}

<<secondchunk, echo=TRUE , eval=TRUE , message=FALSE , warning=FALSE>>=
myPfam <- getMyPfam()
str(myPfam , nchar.max=10 , vec.len=2)
@

In details, myPfam is a data.frame composed by 11 columns connected via UNIPROT/Entry to myUni dataset:
\begin{itemize}
	\item Entry: a character vector of Uniprot entries
	\item Envelope\_Start: a numeric vector of starts of the pfam domain relative to the reference protein
	\item Envelope\_End: a numeric vector of ends of the pfam domain relative to the reference protein
	\item Pfam\_ID: a character vector of Pfam IDs in the form of PF\#\#\#\#\#\# supported by LowMACA
	\item Pfam\_Name: a character vector of full Pfam domain names
	\item Type: a character vector. One of the following: "Domain" "Family" "Repeat" or "Motif"
	\item Clan\_ID: a numeric vector of Clan IDs, a sort of families of Pfam domains
	\item Entrez: a numeric vector of Entrez IDs
	\item UNIPROT: a character vector of Uniprot entries in format "name\_HUMAN"
	\item Gene\_Symbol: a character vector of official Gene Symbols
	\item Pfam\_Fasta: a character vector of amino acid sequences of corresponding Pfam
\end{itemize}

<<thirdchunk, echo=TRUE , eval=TRUE , message=FALSE , warning=FALSE>>=
myAlias <- getMyAlias()
str(myAlias , nchar.max=10 , vec.len=2)
@

In details, myAlias is a data.frame composed by 2 columns:
\begin{itemize}
	\item Alias: a character vector representing all the possible aliases 
                    and previous symbols for official Gene Symbols
	\item Official\_Gene\_Symbol: a character vector representing the approved 
                              and official Gene Symbol for HGNC database
	\item Locus\_Group a character vector representing all the possible 
							locus groups in HGNC database, 
							like protein coding, RNA, pseudogene etc.
	\item Locus\_Type a character vector representing all the possible 
							locus types in HGNC database. It is a specification
							of locus group
	\item MappedByLowMACA a character vector of yes and no if the gene is
								included in myUni.RData
\end{itemize}

%
\section{Datasets Curation}

The three datasets presented above are the result of a manual curation of Uniprot database (http://www.uniprot.org/), Pfam-A database (http://pfam.xfam.org/) and HGNC database (http://www.genenames.org/). The entire script for the creation of the RData files can be found inside inst directory of this package.

\Rpackage{LowMACA} package maps mutations on residues, rather than genomic coordinates. This mapping arises a problem since a mutation is in fact a change in DNA that causes changes in all the proteins produced from that DNA piece. Our software package needs a 1 to 1 match between gene and protein as the majority of variant annotation tools does. The transformation from DNA change into unique amino acidic change on a single transcript is sometimes referred as "best effect" searching.

We follow this pipeline, in order of importance, to create our 1 gene 1 protein dataset. For every gene, we take the corresponding protein if:
\begin{enumerate}
	\item Only one protein is known in Uniprot database
	\item A "canonical" protein exists
	\item There is a unique match with the protein sequence chosen by cBioPortal annotation
	\item There is unique match between gene symbol and Uniprot protein symbol (like LCE6A and LCEA6\_HUMAN)
	\item All the Uniprot entries are classified as "Fragment", except one
	\item Only one protein sequence is classified as "reviewed" by Uniprot
	\item Only one protein is chosen by HGNC database
	\item There is a partial match between gene symbol and Uniprot protein symbol, with a Levensthein distance that does not exceed 3 (e.g. TP53 and P53\_HUMAN , distance=1). In case of ties, the "isoform 1" among them
	\item The protein has the longest sequence among all the possible transcripts
	\item The Uniprot name is the first in alphabetical order (there are in fact no genes in which this rule was applied)
\end{enumerate}

The non protein coding genes are not included in the dataset and all the Pfam domains considered are comprised among the ones selected for myUni dataset.

%
\section{Session Information}
<<info,echo=TRUE>>=
sessionInfo()
@

\end{document}