%\VignetteIndexEntry{AnnBuilder ABPrimer}
%\VignetteKeyword{annotation}
%\VignettePackage{AnnBuilder}
\documentclass[12pt]{article}
\usepackage{hyperref}
\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}

\begin{document}
\author{Jianhua Zhang}

\title{How to use AnnBuilder}

\maketitle

\section{Overview}

AnnBuilder constructs annotation data packages for given
sets of genes with known mappings either to GenBank accession numbers,
UniGene identifiers, Image identifiers, or Entrez Gene identifiers. This
vignette describes the process of building an annotation data package
based on a sample file that maps probes to GenBank accession
numbers. The process involves:

\begin{enumerate}
\item Map a given set of probes to Entrez Gene identifiers.  There are two
  main components:
  \begin{itemize}
  \item Obtain mappings from different sources.  For a given
    set of genes, especially for genes on an Affymetrix chip, there are
    quite a few sources of existing mappings available from the web.
  \item Unify mappings from different sources.  Mapping information
    from different sources may agree or
    disagree. \Rpackage{AnnBuilder} resolves conflicts using a voting
    mechanism to obtain unified mappings between probes and Entrez Gene ids.
  \end{itemize}
\item Based on the unified mappings, extract data from Locus Link and
  other sources such as Golden path, GO, KEGG.
\item Combine data into an R data package.
\end{enumerate}

The capability of \Rpackage{AnnBuider} is well beyond what has be described
above. In theory, any set of genes can be annotated using
AnnBuilder as long as they can be mapped to an id used by a public
data repository. However, that will require some extend of programming
using the existing functions. That part will be covered in another
vignette (Advanced). In this vignette, the process of annotating a set
of genes that are mapped to GenBank accession numbers using a single
function will be discussed.

\section{Getting Started}
\subsection{Requirements}

AnnBuilder requires the support from the following items. The system
will fail due to the lack of any of the requirements.

\begin{itemize}
%\item {R package XML is required to support the functions dealing with
%  XML files. The package is available through
%  \url{http://cran.r-project.org}.}
\item Perl is required to process the potentially rather large
  annotation source data files.
\end{itemize}

\subsection{Function description}

For a set of genes that are mapped to GenBank accession numbers (or
UniGene identifiers, Image clone identifiers,Entrez Gene identifiers) a
function named \Rfunction{ABPkgBuilder} can be used for building an
annotation package. \Rfunction{ABPkgBuilder} takes the following arguments:

\begin{description}
\item[baseName] A character string for the name of a file to be used
    as a base file to parse source data. The file should contain two
    columns with the first one being the target genes to be annotated
    and the other being the corresponding mappings to GenBank
    accession numbers, UniGene identifiers, Image clone identifiers,
    or Entrez Gene identifiers. The second column should have either a
    value or "NA".
\item[srcUrls] A named vector of character strings for the urls where
    source data files will be obtained. Valid sources are Entrez Gene,
    UniGene, Golden Path, Gene Ontology, and KEGG. The names for the
    character strings should be EG, UG, GP, GO, and KEGG,
    respectively. LL and UG are required. For windows users, the
    values should be unzipped files downloaded from the sources. A
    function call getSrcUrl("all", "Homo sapiens") will return the urls needed
    for building a package for human. Other valid organism names
    include the scientific names for mouse and rat.
\item[baseMapType] A character string to indicate whether target genes
    in {\Robject{baseName}} are mapped to GenBank accession numbers (gb),
    UniGene identifiers (ug), Image clone identifiers (image), or
    Entrez Gene identifiers (ll).
\item[otherSrc] A named vector of character strings for the names of
    files that contain mappings between target genes in
    {\Robject{baseName}} and Entrez Gene identifiers that will be unified
    to get more reliable mappings.
\item[pkgName] A character string for the name of the data package to
    be be built (e. g. hgu95av2, rgu34a).
\item[pkgPath] A character string for the full path of an existing
    directory where the package to be built will be stored.
\item[organism] A character string for the name of the organism of
    concern (now can only be "Homo sapiens", "Mus musculus", or "Rattus norvegicus"). See section \texttt{Extend AnnBuilder} if you have an organism other than the three species.
\item[version] A character string for the version number of the data
    package to be built.
\item[author] A list of character strings with an author element for
    the name of the maintainer of the data package and a maintainer
    element for the email address of the maintainer.
\end{description}

What we need to to is to assign correct values to the above arguments and
then call \Rfunction{ABPkgBuilder} with these arguments.

\subsection{Datasets}

We have placed two data sets in the {\texttt{data}} directory of
\Rpackage{AnnBuilder} to demonstrate how to use
\Rfunction{ABPkgBuilder}. One of them is {\verb+hgu95av2_ID.txt+} that
contains Affymetrix probe ids and their mappings to GenBank accession
numbers for the HGU95Av2 array. The file looks like:

<<>>=
library(AnnBuilder)
read.table(file.path(.path.package("AnnBuilder"), "data", "hgu95av2_ID"),
           sep = "\t", header = FALSE, as.is = TRUE)[1:5,]
@

Now we set the file as the base file (\Robject{baseName}) and indicate
that the mappings for the base file is GenBank accession numbers.

<<>>=
myBase <- file.path(.path.package("AnnBuilder"), "data", "hgu95av2_ID")
myBaseType <- "gb"
@

The data sources that can be used for annotation are abundant. We
focus on the following public data sources:

\begin{description}
\item[Entrez Gene] The data
    {\url{ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz}} will be
    used to map genes to Entrez Gene identifiers and also to annotate
    genes after the unified mappings have been obtained.
\item[UniGene] The data contained at
    {\url{ftp://ftp.ncbi.nih.gov/repository/UniGene/}} will be used to
    obtain mappings between genes and Entrez Gene identifiers. The exact
    data that will be used depend on the organism.
\item[Golden Path] The data (refLink.txt and refGene.txt) at
    {\url{http://www.genome.ucsc.edu/goldenPath/14nov2002/database}}
    will be used to obtain the chromosomal location and orientation
    data for genes. The part 14nov2002 will be something else for a
    different organism or when there is new built for the data sets.
\item[Gene Ontology] The data
    {\url{http://www.godatabase.org/dev/database/archive/2003-03-01/go_200303-termdb.xml.gz}}
    will be used to obtain gene ontology information. The last part
    of the url changes with builds.
\item[KEGG] Some data at
    {\url{ftp://ftp.genome.ad.jp/pub/kegg/pathway/organisms}} will be used to
    extract the pathway and enzyme information. Quite a few
    individual files will be used and the system has a way of locating
    them with information available at the site by the url.
\item[HomoloGene] A data file provided by
  {\url{ftp://ftp.ncbi.nih.gov/pub/HomoloGene/}} will be used to
  extract mappings between Entrez Gene ids and HomoloGene ids.
\end{description}

%% removed by Ting:  these info changes frequently.  It is not a good idea to specify these in case users get confused
%We may assign the urls to \Robject{srcUrls}.
%
%<<>>=
%mySrcUrls <- c(EG = "ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz", 
%               UG = "ftp://ftp.ncbi.nih.gov/repository/UniGene/Hs.data.gz",
%               GP = "http://www.genome.ucsc.edu/goldenPath/14nov2002/database/" , 
%               GO = "http://www.godatabase.org/dev/database/archive/2003-03-01/go_200303-termdb.xml.gz", 
%               KEGG = "ftp://ftp.genome.ad.jp/pub/kegg/pathway/organisms")
%
%
%However, 
\Rpackage{AnnBuilder} comes with URL information which can be
accessed via the global option \verb+AnnBuilderSourceUrls+.

<<>>=
mySrcUrls <- getSrcUrl("all", "Homo sapiens")
mySrcUrls
@

%As \Rpackage{AnnBuilder} does not know how to handle
%{\texttt{.gz}} files under windows. Therefore, any of the source files
%that are of type {\texttt{.gz}} (namely LL, UG, GP, and GO) will have to be
%downloaded/unzipped by windows users and then stored locally before
%hand. The file names of the downloaded/unzipped files will be used to replace
%the urls for the corresponding source data files.

%In the vignette, we will use truncated versions of some of the files
%to reduce the length of time required to process the source data. The
%truncated files are stored at the Bioconductor web site. Note that EG is replaced by LL just to use the example data. 
%For windows
%users, we have downloaded/unzipped the source files and stored them in
%the {\texttt{data}} directory of \Rpackage{AnnBuilder}.

%% FIXME: the LL and UG urls here are not correct.  

If there is not other sources of mappings between the target genes and
Entrez Gene identifiers available, the mappings provided by Entrez Gene
and UniGene will be unified. However, as an example, let us assume
that we have the mappings from Affymetrix and another unidentified source that we would like to use as other sources to obtain the unified mappings. The two source files are also stored in the \Robject{data} directory of
\Rpackage{AnnBuilder}.

<<>>=
read.table(file.path(.path.package("AnnBuilder"), "data", "hgu95av2_AFFY"), sep = "\t", header = FALSE, as.is = TRUE)[1:5, ] 
read.table(file.path(.path.package("AnnBuilder"), "data", "srcb"), sep = "\t", header = FALSE, as.is = TRUE)
@

We assign the file to \Robject{otherSrc}

<<>>=
myOtherSrc <- c(srcone = file.path(.path.package("AnnBuilder"),
                "data", "hgu95av2_AFFY"), srctwo =
                file.path(.path.package("AnnBuilder"), "data",
                          "srcb"))
@

The other arguments needed are pretty straight forward and will not be
elaborated.

\subsection{Build annoation}

To build an annotation data package, we only have to call
\Rfunction{ABPkgBuilder} with correct argument values. However, the
code below (and hereafter) is turned off under windows as human
intervention is required under the system. Copying the code chunk and
pasting into an R session under windows should work.

<<>>=
    myDir <- tempdir()
@ 
   
\begin{Sinput}
> ABPkgBuilder(baseName = myBase, srcUrls = mySrcUrls, baseMapType =
             myBaseType, otherSrc = myOtherSrc, pkgName = "hgu95av2",
             pkgPath = myDir, organism = "Homo sapiens", version = "1.1.0",
             author = list(authors = "myname", maintainer = 
                             "myname@myemail.com"), fromWeb = TRUE)
\end{Sinput}

Please note that the build takes quite a while to finish. If you are patient enough to wait until the end, you will have a data package named "hgu95av2" in the directory defined by myDir. The data package has a data, man, and R subdirectory each with some files. The created data package can be installed the same way as a regular R package. 

\section{Extend AnnBuilder}

\Rpackage{AnnBuilder}

\section{Further note}

Function \Rfunction{ABPkgBuilder} works only if the data files are of
the correct format (e.g. delimiter separated two column text files)
and the urls for the source data and information on their builds
remain unchanged. When changes to the urls occur, the function will
fail and users may not have much power of control because
\Rfunction{ABPkgBuilder} makes assumptions and then calls different
functions based on the assumptions. Another vignette
{\texttt{AnnBuilder}} shows the details of using the functions
\Rfunction{ABPkgBuilder} based on but are available in
\Rpackage{AnnBuilder} to build data packages. More coding is involved
there but users will have much greater control over the building
process and avoiding system failures as that may occur when using
{\texttt{ABPkgBuilder}}. Users are encouraged to read that vignette
when become comfortable with \Rfunction{ABPkgBuilder}.

\section{Session Information}

The version number of R and packages loaded for generating the vignette were:

<<echo=FALSE>>=
sessionInfo()
@

\end{document}