%\VignetteIndexEntry{Using the PAnnBuilder Package}
%\VignetteKeywords{Annotation}
%\VignetteDepends{PAnnBuilder}
%\VignettePackage{PAnnBuilder}

\documentclass[12pt,fullpage]{article}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\usepackage{amsmath,pstricks,fullpage}
% With MikTeX 2.9, using pstricks also requires using auto-pst-pdf or running
% pdflatex will fail. Note that using auto-pst-pdf requires to set environment
% variable MIKTEX_ENABLEWRITE18=t on Windows, and to set shell_escape = t in
% texmf.cnf for Tex Live on Unix/Linux/Mac.
\usepackage{auto-pst-pdf}
\usepackage{hyperref}
\usepackage{url}
\usepackage[authoryear,round]{natbib}
\usepackage{multirow}

\author{Hong Li$^\ddagger$\footnote{sysptm@gmail.com}}
\begin{document}
\title{Using the PAnnBuilder Package}
\maketitle
\begin{center}$^\ddagger$Key Lab of Systems Biology\\ 
Shanghai Institutes for Biological Sciences\\ 
Chinese Academy of Sciences, P. R. China
\end{center}

\tableofcontents

\section{Overview} 

In genomic era, genome-scale experiments and data analyses require genes to be 
annotated from different sources, an R \cite{1} package AnnBuilder was developped
for this purpose \cite{2}. In post genomic era, advances in proteomics highlight 
the urgence of understanding protein language \cite{3}. However, relative to genes, 
AnnBuilder is limited in protein annotation due to the complexity of proteins 
caused by 3-D strucutre, alternative splicing, modification, dynamic location 
and other features. The package PAnnBuilder focuses on assembling proteomic 
annotation data, which should be very useful for proteomic data interpretation. 
It not only inherits the good features of AnnBuilder such as automatically parsing 
protein annotation data and building R packages from selected sources, but also 
emphasizes specific features of proteins. PAnnBuilder currently supports 16 
databases involving diverse aspects of proteomics, such as protein primary data, 
protein domain/family, subcellular location, protein interaction, post-translational 
modifications, body fluid proteomics, homolog/ortholog groups and so on. 
Additionally, PAnnBuilder allows annotation to unknown proteins based on sequence 
similarities to other well-annotated proteins. To extend the use of PAnnBuilder, 
54 standard R annotation packages are produced from main protein databases, 
which are freely available for download via biocLite.

\section{Getting Started}

\subsection{Requirements}

PAnnBuilder requires the support from the following items:
\begin{enumerate}
\item For PAnnBuilder >= 1.3.0, R >=2.7.0 is needed for building SQLite-based 
package. Dependent R packages are needed to be installed: \Rpackage{methods}, 
\Rpackage{utils}, \Rpackage{RSQLite}, 
\Rpackage{Biobase} (>= 1.17.0), \Rpackage{AnnotationDbi (>= 1.3.12)}. 
If you use the installation script "PAnnBuilder.R" to install PAnnBuilder (>=1.3.0), 
it will automatically check these dependent packages and install missing 
packages from CRAN or Bioconductor.
\item Rtools is needed for Windows user. The matched version of Rtools with R can 
be downloaded via {\url{http://www.murdoch-sutherland.com/Rtools/}}.
\item Perl is required to parse the rather large annotation source data files. 
It can be downloaded from {\url{http://www.perl.com/download.csp}}.
\item Program formatdb and BLASTP is needed for function \Rfunction{crossBuilder}, 
\Rfunction{crossBuilder\_DB}, \Rfunction{pSeqBuilder}, \Rfunction{pSeqBuilder\_DB}.

BLAST can be downloaded from http://www.ncbi.nlm.nih.gov/BLAST/download.shtml.
\end{enumerate}
Note: It is better to have enough space for the temporary directory. The path of
the per-session temporary directory can be acquired by:
<<tempdir>>=
tempdir() 
@

\subsection{Installation}
The biocLite script is used to install PAnnBuilder from within R:
<<install, eval=FALSE>>=
source("http://bioconductor.org/biocLite.R")
biocLite("PAnnBuilder")
@
<<load>>=
library(PAnnBuilder)
@
Note: Web Connection is needed to install PAnnBuilder and its depended packages.

\subsection{Public Data Sources}

PAnnBuilder parses proteomics annotation data from public sources and build R 
annotation packages. It also provides convenient functions to access these sources. 
For example, you can get all supported databases for "Homo sapiens" by:
<<urls, eval=FALSE>>=
library(PAnnBuilder)         ## Load package 
getALLUrl("Homo sapiens")    ## Get urls 
getALLBuilt("Homo sapiens")  ## Get version/release
@
Detail description of all supported public data sources in PAnnBuilder are as 
follows:
\begin{description}
\item[UniProt] The data
    {\url{ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase}} will be
    used to map protein UniProt identifiers to diverse annotation available in 
    UniProt database.
\item[IPI] The data
    {\url{ftp://ftp.ebi.ac.uk/pub/databases/IPI/current}} will be
    used to map protein IPI identifiers to diverse annotation available in 
    International Protein Index database.
\item[RefSeq] The data
    {\url{ftp://ftp.ncbi.nih.gov/refseq}} will be
    used to map protein RefSeq identifiers to diverse annotation available in 
    NCBI RefSeq database.
\item[Entrez Gene] The data
    {\url{ftp://ftp.ncbi.nih.gov/gene/DATA}} will be
    used to annotate genes after the Entrez Gene identifiers have been obtained.
\item[Gene Ontology] The data
    {\url{ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/proteomes}} and 
    {\url{http://archive.geneontology.org/latest-termdb}}
    will be used to obtain gene ontology information. 
\item[KEGG] Some data at
    {\url{ftp://ftp.genome.jp/pub/kegg}} will be used to obtain pathway 
    information.
\item[HomoloGene] A data file provided by
  {\url{ftp://ftp.ncbi.nlm.nih.gov/pub/HomoloGene/current/}} will be used to
  extract mappings between GeneID/ProteinGI and HomoloGene ids.
\item[InParanoid] The data 
  {\url{http://inparanoid.sbc.su.se}} will be used to obtain ortholog protein 
  groups between two organisms.
\item[Gene Interaction] The data 
  {\url{ftp://ftp.ncbi.nih.gov/gene/GeneRIF}} will be used to extract 
  protein-protein interactions between Entrez GeneID or Protein RefSeq ids.
\item[IntAct] The data 
  {\url{ftp://ftp.ebi.ac.uk/pub/databases/intact/current/psimitab}} will be used
  to extract protein-protein interactions between UniProt protein accession numbers.
\item[MPPI] The data 
  {\url{http://mips.gsf.de/proj/ppi/data/mppi.gz}} will be used to extract 
  protein-protein interactions between UniProt protein accession numbers.
\item[3DID] The data 
  {\url{http://gatealoy.pcb.ub.es/3did/download}} will be used to extract 
  domain-domain interactions between Pfam domain identifiers.
\item[DOMINE] The data 
  {\url{http://domine.utdallas.edu}} will be used to extract 
  domain-domain interactions between Pfam domain identifiers.
\item[DBSubLoc] The data 
 {\url{http://www.bioinfo.tsinghua.edu.cn/~guotao}} will be used to obtain
 subcellular localization for protein from SWISS-PROT and PIR database.
\item[BaCelLo] The data 
 {\url{http://gpcr2.biocomp.unibo.it/bacello}} will be used to map SwissProt
 eukaryotes protein identifiers to subcellular localization.
\item[SCOP] The data 
 {\url{http://scop.mrc-lmb.cam.ac.uk/scop}} will be used to map PDB structure 
 identifiers to SCOP domain identifiers. 
\item[PeptideAtlas] The data 
 {\url{http://www.peptideatlas.org/builds}} will be used to obtain experimentally 
 identified peptides and their coordinates on chromosomes.
\item[SysPTM] The data 
 {\url{http://www.biosino.org/papers/SysPTM}} will be used to obtain protein
 post-translational modifications information.
\item[Sys-BodyFluid] The data 
 {\url{http://www.biosino.org/papers/Sys-BodyFluid}} will be used to map IPI
 protein identifiers to body fluids.  
\end{description}

\subsection{Annotation Packages Produced by PAnnBuilder}

\subsubsection{Packages produced by PAnnBuilder}

PAnnBuilder has powerful ability to build R package for assembling proteome 
annotation. However, the process of building new package may be time-consuming 
because of the downloading and parsing of large data files. To make PAnnBuilder useful 
for any users, we have built many frequently used annotation packages in advance.
These pre-built package can be downloaded via biocLite.

These packages are divided into two classes: enviroment-based 
packages built by "*Builder" functions (see Table~\ref{table:packages1}); SQLite based packages built by "*Builder\_DB" 
functions (see Table~\ref{table:packages2}). They are widely used methods for building Bioconductor 
meta-data packages. Each SQLite-based annotation package (identified by a ".db"
suffix in the package name) contains a number of AnnDbBimap objects in
place of the environment objects found in the old-style environment-based
annotation packages. In future, SQLite-based annotation package will
replace environment-based packages. 

The pre-built packages provide a quick start for R beginners. If one wants to analyze 
protein set in Human IPI database, the quickest way is to download and use 
SQLite based package "org.Hs.ipi.db". 
However, if the package one wants has not been built or a new-version database is 
released, new package should be built using functions in PAnnBuilder (See 
Section 4 for methods of building annotation packages).

\begin{table}
\footnotesize
\caption{\label{table:packages2}SQLite-based Annotation Packges Produced by PAnnBuilder.}
\begin{center}
\begin{tabular}{|p{0.23\textwidth}|p{0.18\textwidth}|p{0.12\textwidth}|p{0.17\textwidth}|p{0.15\textwidth}|p{0.15\textwidth}|}
\hline
\bf{Description} & \bf{R Function} & \bf{Source} & \bf{Organism} 
 & \bf{Package} \\ \hline
\multirow{9}{0.25\textwidth}{complete and canonical annotaion for all proteins of
 a specific organism, including protein description, Entrez gene 
 identifier, KEGG pathway, gene ontology, domain, and so on.} & 
\multirow{9}{0.2\textwidth}{pBaseBuilder\_DB} &
 \multirow{3}{0.15\textwidth}{IPI} & 
  Homo sapiens & org.Hs.ipi.db \\
  & & & Mus musculus & org.Mm.ipi.db \\ 
  & & & Rattus norvegicus & org.Rn.ipi.db \\ 
 & & \multirow{3}{0.15\textwidth}{Swiss-Prot} & 
  Homo sapiens & org.Hs.sp.db \\
  & & & Mus musculus & org.Mm.sp.db \\
  & & & Rattus norvegicus & org.Rn.sp.db \\ 
 & & \multirow{3}{0.15\textwidth}{RefSeq} & 
  Homo sapiens & org.Hs.ref.db \\
  & & & Mus musculus & org.Mm.ref.db \\
  & & & Rattus norvegicus & org.Rn.ref.db \\ \hline 
\multirow{3}{0.25\textwidth}{protein indentifier mapping} & 
\multirow{3}{0.2\textwidth}{crossBuilder\_DB} & 
\multirow{3}{0.15\textwidth}{Swiss-Prot, IPI, RefSeq} & 
  Homo sapiens & org.Hs.cross.db \\
  & & &  Mus musculus & org.Mm.cross.db \\
  & & & Rattus norvegicus & org.Rn.cross.db \\ \hline 
\multirow{5}{0.25\textwidth}{protein-protein or domain-domain interaction data} & 
\multirow{5}{0.2\textwidth}{intBuilder\_DB}  
 & IntAct & & int.intact.db \\
 & & NCBI Gene & & int.geneint.db \\
 & & MPPI & & int.mppi.db \\
 & & 3did & & int.did.db \\
 & & Domine & & int.domine.db \\ \hline
\multirow{2}{0.25\textwidth}{protein subcell location} &
\multirow{2}{0.2\textwidth}{subcellBuilder\_DB} 
 & BaCelLo & & sc.bacello.db \\
 & & DBSubLoc & & sc.dbsubloc.db \\ \hline
protein structure classification & scopBuilder\_DB & SCOP & 
 & scop.db \\ \hline
protein post-translational modification & ptmBuilder\_DB & SysPTM &
 & sysptm.db \\ \hline
body fluid protein & bfBuilder\_DB & Sys-BodyFluid & Homo sapiens
 & org.Hs.bf.db \\ \hline
homolog protein group & HomoloGeneBuilder\_DB & HomoloGene
 & & homolog.db \\ \hline
ortholog protein group & InParanoidBuilder\_DB & InParanoid
 & Homo sapiens, Mus musculus & org.HsMm.ortholog.db \\ \hline
peptides identified by mass spectrometry & PeptideAtlasBuilder\_DB
 & PeptideAtlas & Homo sapiens & org.Hs.pep.db \\ \hline
gene ontology & GOABuilder\_DB & GOA & Homo sapiens & org.Hs.goa.db \\ \hline
identifier and name & dNameBuilder\_DB & & & dName.db \\ \hline
\end{tabular}
\end{center}
\end{table}

\subsubsection{Using annotation data package}
SQLite-based *.db packages are capable of flexible data queries, reverse mapping,
and  data filtering. Vignette of "AnnotationDbi" detailedly described  how to use
SQLite based packages
({\url{http://www.bioconductor.org/packages/release/bioc/vignettes/AnnotationDbi/inst/doc/AnnotationDbi.pdf}}). 
Here package "org.Hs.ipi.db" produced by PAnnBuilder is illustrated as an example
in the following code chuck.
\begin{enumerate}
\item Install and load annotation package. Package "org.Hs.ipi" is used as an example,
 other packages can be derived accordingly.
<<exampleInstall1, eval=FALSE>>=
biocLite("org.Hs.ipi.db")
@
<<exampleLoad1>>=
library(org.Hs.ipi.db) # Load annotation package
@
\item Browse data in the package.
<<help>>=
?org.Hs.ipiGENEID
@
\item Convert the environment object into a "list" object, and get values  
by index or name.
<<data1>>=
xx <- as.list(org.Hs.ipiGENEID)
xx[!is.na(xx)][1:3]
xx[["IPI00792103.1"]]
@
\item Specific utilities for SQLite-based *.db packages.
<<data2>>=
## Access the data in table (data.frame) format via function "toTable".
toTable(org.Hs.ipiPATH[1:3])
## reverse the role of protein and pathway via 
## function "revmap" or "reverseSplit".
tmp1 <- revmap(org.Hs.ipiPATH)  ## return a AnnDbBimap Object
class(tmp1)
as.list(tmp1)[1]
tmp2 <- reverseSplit(as.list(org.Hs.ipiPATH)) ## return a list Object
class(tmp2)
tmp2[1]

## The left and right keys of the Bimap can be extracted 
## using "Lkeys" and "Rkeys".
Lkeys(org.Hs.ipiPATH)[1:3]
Rkeys(org.Hs.ipiPATH)[1:3]

## Get the create table statements.
org.Hs.ipi_dbschema()

## Use "SELECT" SQL query.
selectSQL<-paste("SELECT ipi_id, de",
           "FROM basic",
           "WHERE de like '%histone%'")
tmp3 <- dbGetQuery(org.Hs.ipi_dbconn(), selectSQL)
tmp3[1:3,]
@
\end{enumerate}

\section{Function Description}

\subsection{Getting URL and Version}

To download data file from public database, the first step is getting its URL 
and release/version information. URLs of supported databases are stored in 
"data/sourceURLs.txt". Following functions are used to get the url:
\begin{itemize}
\item \Rfunction{getSrcUrl} - return url according given database.
\item \Rfunction{getALLUrl} - return urls for all databases used in PAnnBuilder 
 packages.
\item \Rfunction{getSrcBuilt} - return release/version according given database.
\item \Rfunction{getALLBuilt} - return release/version information for all 
 databases used in PAnnBuilder packages.
\end{itemize}

\subsection{Parsing and Writing Data}

Parsing is a key step to convert original data file to R object. Sometimes R is 
directlly used to parse and write data. But for large data file or complicated 
data format, perl is firstly employed to quickly process data, and then R 
function reads the result file into R objects. 

\subsubsection{Employing perl program to parse data}

Segment of perl program is written into file in "inst/scripts". Name and 
function of these parser files are as follows:
\begin{itemize}
\item spParser - parse protein data from SwissProt or TrEMBL
\item ipiParser - parse protein data from IPI
\item refseqParser - parse protein data from NCBI RefSeq
\item equalParser - find protein ID mapping with equal sequences
\item mergeParser - merge different ID mapping files
\item mppiParser - parse protein protein interaction data from MIPS
\item paParser - parse data from PeptideAtlas
\item dbsublocParser - parse data from DBSubLoc
\item pfamNameParser - parse domain id and name from Pfam
\item blastParser - filter the results of blast
\end{itemize}
Function \Rfunction{fileMuncher} and \Rfunction{fileMuncher\_DB}  perl file 
based on given parser file and additional input data file, then perform this 
perl program via R.

\subsubsection{Writing data using R}

Besides using perl program, R functions also parse data from simple data file and 
store them as R environment objects or Bimap objects.
\begin{itemize}
\item \Rfunction{createEmptyDPkg} - create an empty R packge at given directory.
\item \Rfunction{writeSQ} - write sequence data into R package.
\item \Rfunction{writeName} - parse mulitple data file, and write the mapping of 
 id and name into R package. It employs \Rfunction{writeGOName}, 
 \Rfunction{writeKEGGName}, \Rfunction{writePFAMName},  
 \Rfunction{writeINTERPROName} and \Rfunction{writeTAXName} to 
 respectively write data from GO, KEGG, Pfam, InterPro and TAX.
\item \Rfunction{writeSCOPData} - parse structural classification of proteins from 
  SCOP database.
\item \Rfunction{writeSubCellData} - parse data from protein subcellular 
  location databases, and write into R package. It employs 
  \Rfunction{writeBACELLOData} and \Rfunction{writeDBSUBLOCData} to 
  respectively write data from BaCelLo and DBSubLoc.
\item \Rfunction{writeIntData} - parse data from protein-protein/domain-domain
  interaction databases, and write into R package. It employs 
  \Rfunction{writeGENEINTData}, \Rfunction{writeINTACTData}, 
  \Rfunction{writeMPPIData}, \Rfunction{write3DIDData} and 
  \Rfunction{writeDOMINEData} to respectively write data from NCBI gene 
  interaction data file, EBI intact, MIPS interaction data, 3DID database 
  and DOMINE database.
\item \Rfunction{writePtmData} - parse database involving protein 
  post-translational modifications, and write into R package. It employs 
  \Rfunction{writeSYSPTMData} to write data from SysPTM database.
\item \Rfunction{writeBfData} - write data involving body fulids proteomics 
  into R package. It employs \Rfunction{writeSYSBODYFLUIDData} to write data
  from Sys-BodyFluid database.
\item \Rfunction{writeGOAData} - write gene ontology terms from GOA database 
  into R package.
\item \Rfunction{writeHomoloGeneData} - write homolog groups from NCBI 
  HomoloGene into R package.
\item \Rfunction{writeInParanoidData} - write paralog groups from InParanoid 
  into R package.
\item \Rfunction{writePeptideAtlasData} - write peptides identified by Mass 
  Spectrometry from PeptideAtlas database into R package.
\item \Rfunction{writeMeta\_DB} - write meta information about the annotation 
  package into SQLite-based package.
\item \Rfunction{writeData\_DB} - parse data from databases, and write data as 
  tables into SQLite-based R package. It employs \Rfunction{writeSPData\_DB}, 
  \Rfunction{writeIPIData\_DB}, \Rfunction{writeREFSEQData\_DB}, 
  \Rfunction{writeGENEINTData\_DB}, \Rfunction{writeINTACTData\_DB}, 
  \Rfunction{writeMPPIData\_DB}, \Rfunction{write3DIDData\_DB},
  \Rfunction{writeDOMINEData\_DB}, \Rfunction{writeSYSBODYFLUIDData\_DB}, 
  \Rfunction{writeSYSPTMData\_DB}, \Rfunction{writeSCOPData\_DB}, 
  \Rfunction{writeBACELLOData\_DB}, \Rfunction{writeDBSUBLOCData\_DB},
  \Rfunction{writeGOAData\_DB}, \Rfunction{writeHomoloGeneData\_DB},
  \Rfunction{writeInParanoidData\_DB}, and \Rfunction{writePeptideAtlasData\_DB}
  to respectively write data from Swiss-Prot, IPI, NCBI RefSeq database, and so on.
\item \Rfunction{writeName\_DB} - parse mulitple data file, and write the mapping
  of id and name into SQLite-based R package. It employs \Rfunction{writeGOName\_DB}, 
  \Rfunction{writeKEGGName\_DB}, \Rfunction{writePFAMName\_DB},  
  \Rfunction{writeINTERPROName\_DB} and \Rfunction{writeTAXName\_DB} to 
  respectively write data from GO, KEGG, Pfam, InterPro and TAX.
\item \Rfunction{createSeeds} - define a list of AnnDbBimap objects which indicates
  key and value of .
\item \Rfunction{createAnnObjs} - produce AnnDbBimap objects based on the 
  definiation in \Rfunction{createSeeds}.
\end{itemize}

\subsection{Writing Help Documents}

Help documents is an important part for new package. Diverse templates of help 
documents are stored in the "inst/templates" directory. When building new package, 
R functions use these templates to create "*.rd" help file in the "man" directory:
\begin{itemize}
\item \Rfunction{getRepList} - return a list which will replace the symbols in 
 template file.
\item \Rfunction{copyTemplates\_DB} - implement similar function with 
 \Rfunction{copyTemplates}, and is specially developped for SQLite-based 
 annotation package. 
\item \Rfunction{writeDescription\_DB} - implement similar function with 
 \Rfunction{writeDescription}, and is specially developped for SQLite-based 
 annotation package. 
\end{itemize}

\subsection{Building Data Packages}

Basic functions described above make it possible to build proteomic annotation 
data packages. Based on these, PAnnBuilder develops multiple sophisticated functions 
to assemble proteomic annotaion data. Each function is implemented by the 
"*Builder\_DB" R function.
\begin{itemize}
\item \Rfunction{pBaseBuilder\_DB} - build annotation 
 data packages for primary protein database such as SwissProt, TREMBL, IPI or 
 NCBI RefSeq protein data. 
\item \Rfunction{pSeqBuilder\_DB} - build annotation
 data packages for query protein sequences based on sequence similarity.
\item \Rfunction{crossBuilder\_DB} - build annotation
 data packages for protein id mapping in SwissProt, Trembl, IPI and NCBI Refseq
 databases. 
\item \Rfunction{subcellBuilder\_DB} - build 
 annotation data packages for protein subcellular location from BaCelLo or 
 DBSubLoc database.
\item \Rfunction{HomoloGeneBuilder\_DB} - build 
 annotation data packages for homolog protein group from NCBI HomoloGene database.
\item \Rfunction{InParanoidBuilder\_DB} - build
 annotation data packages for ortholog protein group between two given organisms
 from InParanoid database.
\item \Rfunction{GOABuilder\_DB} - build annotation data
 packages for mapping proteins of UniProt to Gene Ontolgy from GOA database.
\item \Rfunction{scopBuilder\_DB} - build annotation
 data packages for Structural Classification of Proteins.
\item \Rfunction{intBuilder\_DB} - build annotation data
 packages for protein-protein or domain-domain interaction from IntAct, MPPI, 
 3DID, DOMINE or NCBI Gene interaction database.
\item \Rfunction{PeptideAtlasBuilder\_DB} - 
 build annotation data packages for experimentally identified peptides from 
 PeptideAtlas database.
\item \Rfunction{ptmBuilder\_DB} - build annotation data
 packages for post-translational modifications from SysPTM database.
\item \Rfunction{bfBuilder\_DB} - build annotation data
 packages for proteins in body fluids from SysBodyFluid database.
\item \Rfunction{dNameBuilder\_DB} - build annotation
 data packages for mapping between entry ID and name from GO, KEGG, Pfam, 
 InterPro and NCBI Taxonomy databases.
\end{itemize}

\section{Building Annotation Data Packages}

\begin{enumerate}
\item The first thing you need to do is setting basic parameters such as "pkgpath", 
"version", and "author".
<<parameter>>=
# Set path, version and author for the package.
library(PAnnBuilder)
pkgPath <- tempdir()
version <- "1.0.0"
author <- list()
author[["authors"]] <- "Hong Li"
author[["maintainer"]] <- "Hong Li <sysptm@gmail.com>"
@
\item Then you can run diverse "*Builder\_DB" functions to build packages
 by yourselves.  \Rfunction{pBaseBuilder\_DB}, 
\Rfunction{subcellBuilder\_DB}, and
\Rfunction{pSeqBuilder\_DB} are taken as examples to build annotation packages.
\begin{description}
\item[pBaseBuilder\_DB] builds annotation data packages for proteins in three primary
 protein databases (SwissProt, IPI, RefSeq). It is a convenient way to obtain 
 complete and canonical annotaion, including protein description, Entrez gene 
 identifier, KEGG pathway, gene ontology, domain, coordinates on chromosomes 
 and so on. For example, if you want to bulid annotation package for Mouse IPI 
 database, you can use codes as follows:
<<pBaseBuilder, eval=FALSE>>=
 ## Build SQLite based annotation package "org.Mm.ipi.db" 
 ## for Mouse IPI database.
 ## Note: Perl is needed for parsing data file. 
 ##       Rtools is needed for Windows user.
 pBaseBuilder_DB(baseMapType = "ipi", organism = "Mus musculus",  
                 prefix = "org.Mm.ipi", pkgPath = pkgPath, version = version, 
                 author = author)              
@
After running, a subdirectory called "org.Mm.ipi" will be produced in the path 
given by "pkgPath". This directory contains all data and files, which can be 
used to build R package by "R CMD build" command. 

\item[subcellBuilder\_DB] builds annotation data Package which
 provides protein subcellular location information.
<<subcellBaseBuilder>>=
 ## Build subcellular location annotation package "sc.bacello.db" 
 ## from BaCelLo database.
 subcellBuilder_DB(src="BaCelLo", prefix="sc.bacello", 
                pkgPath, version, author)
  ## List all files in created directory "sc.bacello.db".
  dir(file.path(pkgPath,"sc.bacello.db"))
@

\item[pSeqBuilder\_DB] uses blast to calculate sequence similarity between query 
proteins and subject proteins, then assign annotation for query protein
according to existing annotation of its similar proteins. pSeqBuilder\_DB is useful for 
proteins which have not well annotated. Following code chunk gives an example 
for annotation query proteins by pSeqBuilder\_DB. Needed R packages \Rpackage{org.Hs.sp.db}, 
\Rpackage{org.Hs.ipi.db}, can be downloaded via biocLite.
<<pSeqBuilder, eval=FALSE>>=
## Read query sequence.
tmp = system.file("extdata", "query.example", package="PAnnBuilder")
tmp = readLines(tmp)
tag = grep("^>",tmp)
query <- sapply(1:(length(tag)-1), function(x){ 
     paste(tmp[(tag[x]+1):(tag[x+1]-1)], collapse="") })
query <- c(query, paste(tmp[(tag[length(tag)]+1):length(tmp)], collapse="") )
names(query) = sub(">","",tmp[tag])
## Set parameters for sequence similarity.
blast <- c("blastp", "10.0", "BLOSUM62", "0", "-1", "-1", "T", "F")
names(blast) <- c("p","e","M","W","G","E","U","F")
match <- c(0.00001, 0.9, 0.9)
names(match) <- c("e","c","i")

## Install ackages "org.Hs.sp", "org.Hs.ipi".
if( !require("org.Hs.sp.db") ){
  biocLite("org.Hs.sp.db")
}
if( !require("org.Hs.ipi.db") ){
  biocLite("org.Hs.ipi.db")
}

## Use packages "org.Hs.sp.db", "org.Hs.ipi.db" to produce annotation  
## R package for query sequence.
  annPkgs = c("org.Hs.sp.db","org.Hs.ipi.db")  
  seqName = c("org.Hs.spSEQ","org.Hs.ipiSEQ")  
  pSeqBuilder_DB(query, annPkgs, seqName, blast, match, 
  prefix="test1", pkgPath, version, author)  
@
\end{description}

\item After the running of "*Builder\_DB" function has been 
finished, a subdirectory named "pkgName" will be produced in given "pkgPath". 
Then the command "R CMD build" can be used to build source R package, and 
"R CMD --binary build" can be used to build binary R package for Windows. 

\item Note: 
\begin{itemize}
\item Web connection is needed to download files from public databases, and Perl 
is needed to parse data files. Additionally, Rtools is needed for Windows user.
\item Users should be aware that downloading, parsing, and saving data may take
a long time, in addition to requiring enough disk space to store temporary 
data files. 
\item "R CMD build" and "R CMD --binary build" should be used in command line, 
not in R. Detailed document about how to create your own packages can be found 
in the book "Writing R Extensions" ({\url{http://cran.r-project.org/doc/manuals/R-exts.pdf}}). 
For Windows users, "R CMD build" needs you to have installed the files for building 
source packages (which is the default), as well as the Windows toolset (see the 
"R Installation and Administration" manual at {\url{http://cran.r-project.org/doc/manuals/R-admin.pdf}}).
\end{itemize}

\end{enumerate}

\section{Session Information}

This vignette was generated using the following package versions:
<<echo=FALSE>>=
sessionInfo()
@

\bibliographystyle{plainnat}
\bibliography{PAnnBuilder}

\end{document}