%\VignetteIndexEntry{Introduction To Bioconductor Annotation Packages}
%\VignetteDepends{org.Hs.eg.db,TxDb.Hsapiens.UCSC.hg19.knownGene,hom.Hs.inp.db}

\documentclass[11pt]{article}


\usepackage{Sweave}
\usepackage[usenames,dvipsnames]{color}
\usepackage{graphics}
\usepackage{latexsym, amsmath, amssymb}
\usepackage{authblk}
\usepackage[colorlinks=true, linkcolor=Blue, urlcolor=Blue,
  citecolor=Blue]{hyperref}

%% Simple macros

\newcommand{\code}[1]{{\texttt{#1}}}
\newcommand{\file}[1]{{\texttt{#1}}}

\newcommand{\software}[1]{\textsl{#1}}
\newcommand\R{\textsl{R}}
\newcommand\Bioconductor{\textsl{Bioconductor}}
\newcommand\Rpackage[1]{{\textsl{#1}\index{#1 (package)}}}
\newcommand\Biocpkg[1]{%
  {\href{http://bioconductor.org/packages/devel/bioc/html/#1.html}%
    {\textsl{#1}}}%
  \index{#1 (package)}}
\newcommand\Rpkg[1]{%
  {\href{http://cran.fhcrc.org/web/devel/#1/index.html}%
    {\textsl{#1}}}%
  \index{#1 (package)}}
\newcommand\Biocdatapkg[1]{%
  {\href{http://bioconductor.org/packages/devel/data/experiment/html/#1.html}%
    {\textsl{#1}}}%
  \index{#1 (package)}}
\newcommand\Robject[1]{{\small\texttt{#1}}}
\newcommand\Rclass[1]{{\textit{#1}\index{#1 (class)}}}
\newcommand\Rfunction[1]{{{\small\texttt{#1}}\index{#1 (function)}}}
\newcommand\Rmethod[1]{{\texttt{#1}}}
\newcommand\Rfunarg[1]{{\small\texttt{#1}}}
\newcommand\Rcode[1]{{\small\texttt{#1}}}

%% Question, Exercise, Solution
\usepackage{theorem}
\theoremstyle{break}
\newtheorem{Ext}{Exercise}
\newtheorem{Question}{Question}


\newenvironment{Exercise}{
  \renewcommand{\labelenumi}{\alph{enumi}.}\begin{Ext}%
}{\end{Ext}}
\newenvironment{Solution}{%
  \noindent\textbf{Solution:}\renewcommand{\labelenumi}{\alph{enumi}.}%
}{\bigskip}


\title{Introduction To Bioconductor Annotation Packages}
\author{Marc Carlson}

\SweaveOpts{keep.source=TRUE}
\begin{document}

\maketitle


\begin{figure}[ht]
\centering
\includegraphics[width=.6\textwidth]{databaseTypes.pdf}
\caption{Annotation Packages: the big picture}
\label{fig:dbtypes}
\end{figure}

\Bioconductor{} provides extensive annotation resources. These can be
\emph{gene centric}, or \emph{genome centric}. Annotations can be
provided in packages curated by \Bioconductor, or obtained from
web-based resources.  This vignette is primarily concerned with
describing the annotation resources that are available as packages.
This includes both how to extract data from them and also what steps
are required to expose other databases in a similar fashion.

Gene centric \Rpackage{AnnotationDbi} packages
include:
\begin{itemize}
  \item Organism level: e.g. \Rpackage{org.Mm.eg.db}.
  \item Platform level: e.g. \Rpackage{hgu133plus2.db},
        \Rpackage{hgu133plus2.probes}, \Rpackage{hgu133plus2.cdf}.
  \item Homology level: e.g. \Rpackage{hom.Dm.inp.db}.
  \item System-biology level: \Rpackage{GO.db} or \Rpackage{KEGG.db}.
\end{itemize}
Genome centric \Rpackage{GenomicFeatures} packages include
\begin{itemize}
  \item Transcriptome level: e.g. \Rpackage{TxDb.Hsapiens.UCSC.hg19.knownGene}
  \item Generic genome features: Can generate via \Rpackage{GenomicFeatures}
\end{itemize}
One web-based resource accesses
\href{http://www.biomart.org/}{biomart}, via the \Rpackage{biomaRt}
package:
\begin{itemize}
  \item Query web-based `biomart' resource for genes, sequence,
        SNPs, and etc.
\end{itemize}


The most popular annotation packages have been modified so that they
can make use of a new set of methods to more easily access their
contents.  These four methods are named: \Rmethod{cols},
\Rmethod{keytypes}, \Rmethod{keys} and \Rmethod{select}.  And they are
described in this vignette.  They can currently be used with all chip,
organism, and \Rclass{TranscriptDb} packages along with the popular
GO.db package.

For the older less popular packages, there are still conventient ways
to retrieve the data.  The \emph{How to use bimaps from the ".db"
annotation packages} vignette in the \Rpackage{AnnotationDbi} package
is a key reference for learnign about how to use bimap objects.

Finally, all of the `.db' (and most other \Bioconductor{} annotation
packages) are updated every 6 months corresponding to each release of
\Bioconductor{}.  Exceptions are made for packages where the actual
resources that the packages are based on have not themselves been
updated.


\subsection{Organism level packages}

An organism level package (an `org' package) uses a central gene
identifier (e.g. Entrez Gene id) and contains mappings between this
identifier and other kinds of identifiers (e.g. GenBank or Uniprot
accession number, RefSeq id, etc.).  The name of an org package is
always of the form \emph{org.<Ab>.<id>.db}
(e.g. \Rpackage{org.Sc.sgd.db}) where \emph{<Ab>} is a 2-letter
abbreviation of the organism (e.g. \Rpackage{Sc} for
\emph{Saccharomyces cerevisiae}) and \emph{<id>} is an abbreviation
(in lower-case) describing the type of central identifier
(e.g. \Rpackage{sgd} for gene identifiers assigned by the
Saccharomyces Genome Database, or \Rpackage{eg} for Entrez Gene ids).


\subsection{AnnotationDb objects and the select method}

As previously mentioned, a new set of methods have been added that
allow a simpler way of extracting identifier based annotations.  All
the annotation packages that support these new methods expose an
object named exactly the same way as the package itself.  These
objects are collectively called \Rclass{AnntoationDb} objects, with
more specific classes with names such as \Rclass{OrgDb},
\Rclass{ChipDb} or \Rclass{TranscriptDb} objects.  The methods that
can be applied to these objects are \Rmethod{cols}, \Rmethod{keys},
\Rmethod{keytypes} and \Rmethod{select}.

%% select exercise
\begin{Exercise}
  Display the \Rclass{OrgDb} object for the \Biocpkg{org.Hs.eg.db} package.
  
  Use the \Rmethod{cols} method to discover which sorts of
  annotations can be extracted from it. Is this the same as the result
  from the \Rfunction{keytypes} method?  Use the \Rfunction{keytypes}
  method to find out.
  
  Use the \Rfunction{keys} method to extract UNIPROT identifiers and
  then pass those keys in to the \Rfunction{select} method in such a
  way that you extract the gene symbol and KEGG pathway information
  for each.
\end{Exercise}
\begin{Solution}
<<select>>=
library(org.Hs.eg.db)
cols(org.Hs.eg.db)
keytypes(org.Hs.eg.db)
uniKeys <- head(keys(org.Hs.eg.db, keytype="UNIPROT"))
cols <- c("SYMBOL", "PATH")
select(org.Hs.eg.db, keys=uniKeys, cols=cols, keytype="UNIPROT")

@
\end{Solution}


\subsection{TranscriptDb packages}

A \Rclass{TranscriptDb} package (a 'TxDb' package) connects a set of genomic
coordinates to various transcript oriented features.  The package can
also contain Identifiers to features such as genes and transcripts,
and the internal schema describes the relationships between these
different elements.  All TranscriptDb containing packages follow a
specific naming scheme that tells where the data came from as well as
which build of the genome it comes from.


%% select exercise TranscriptDb
\begin{Exercise}
  Display the \Rclass{TranscriptDb} object for the 
  \Biocpkg{TxDb.Hsapiens.UCSC.hg19.knownGene} package.
  
  As before, use the \Rmethod{cols} and \Rmethod{keytypes} methods to
  discover which sorts of annotations can be extracted from it.
  
  Use the \Rmethod{keys} method to extract just a few gene identifiers and
  then pass those keys in to the \Rmethod{select} method in such a
  way that you extract the transcript ids and transcript starts for each.
\end{Exercise}
\begin{Solution}
<<select>>=
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb
cols(txdb)
keytypes(txdb)
keys <- head(keys(txdb, keytype="GENEID"))
cols <- c("TXID", "TXSTART")
select(txdb, keys=keys, cols=cols, keytype="GENEID")

@
\end{Solution}

As is widely known, in addition to providing access via the
\Rmethod{select} method, \Rclass{TranscriptDb} objects also provide
access via the more familiar \Rfunction{transcripts},
\Rfunction{exons}, \Rfunction{cds}, \Rfunction{transcriptsBy},
\Rfunction{exonsBy} and \Rfunction{cdsBy} methods.  For those who do
not yet know about these other methods, more can be learned by seeing
the vignette called: \emph{Making and Utilizing TranscriptDb Objects}
in the \Rpackage{GenomicFeatures} package.


\subsection{Advanced topic: Creating other kinds of Annotation packages}
%% It would not be a terrible idea to make a whole vignette from this
%% that starts with code from last years advanced course on how to make a
%% database in R.


A few options already exist for generating various kinds of annotation
packages.For users who seek to make custom chip packages, users should
see the \emph{SQLForge: An easy way to create a new annotation package
with a standard database schema.} in the \Rpackage{AnnotationDbi}
package.  And, for users who seek to make a probe package, there is
another vignette called \emph{Creating probe packages} that is also in
the AnnotationDbi package. And finally, for custom organism packages
users should look at the manual page for
\Rfunction{makeOrgPackageFromNCBI}.  This function will attempt to
make you an simplified organism package from NCBI resources.  However,
this function is not meant as a way to refresh annotation packages
between releases.  It is only meant for people who are working on less
popular model organisms (so that annotations can be made available in
this format).


But what if you had another kind of database resource and you wanted
to expose it to the world using something like this new
\Rmethod{select} method interface?  How could you go about this?


The 1st step would be to make a package that contains a SQLite
database.  For the sake of expediency, lets look at an existing example
of this in the \Rpackage{hom.Hs.inp.db} package.  If you download this
tarball from the website you can see that it contains a .sqlite
database inside of the inst/extdata directory.  There are a couple of
important details though about this database.  The 1st is that we
recommend that the database have the same name as the package, but end
with the extension .sqlite.  The second detail is that we recommend
that the metadata table contain some important fields.  This is the
metadata from the current \Rpackage{hom.Hs.inp.db} package.

<<getMetadata, echo=FALSE>>=
library(hom.Hs.inp.db)
hom.Hs.inp_dbInfo()
@ 

As you can see there are a number of very useful fields stored in the
metadata table and if you list the equivalent table for other packages
you will find even more useful information than you find here.  But
the most important fields here are actually the ones called "package"
and "Db type".  Those fields specify both the name of the package with
the expected class definition, and also the name of the object that
this database is expected to be represented by in the R session
respectively. If you fail to include this information in your metadata
table, then \Rmethod{loadDb} will not know what to do with the
database when it is called.  In this case, the class definition has
been stored in the \Rpackage{AnnotationDbi} package, but it could live
anywhere you need it too.  By specifying the metadata field, you
enable \Rmethod{loadDb} to find it.


Once you have set up the metadata you will need to create a class for
your package that extends the \Rclass{AnnotationDb} class.  In the
case of the hom.Hs.inp.db package, the class is defined to be a
\Rclass{InparanoidDb} class. This code is inside of
\Rpackage{AnnotationDbi}.

<<referenceClass,eval=FALSE>>=
.InparanoidDb <-
    setRefClass("InparanoidDb", contains="AnnotationDb")
@ 

Finally the \Rfunction{.onLoad} call for your package will have to
contain code that will call the \Rmethod{loadDb} method.  This is what
it currently looks like in the Rpackage{hom.Hs.inp.db} package.

<<onLoad,eval=FALSE>>=
sPkgname <- sub(".db$","",pkgname)
txdb <- loadDb(system.file("extdata", paste(sPkgname,
               ".sqlite",sep=""), package=pkgname, lib.loc=libname),
               packageName=pkgname)
dbNewname <- AnnotationDbi:::dbObjectName(pkgname,"InparanoidDb")
ns <- asNamespace(pkgname)
assign(dbNewname, txdb, envir=ns)
namespaceExport(ns, dbNewname)
@ 


When this code is run, the name of the package is used to derive the
name for the object.  Then that name, is used by \Rmethod{onload} to
create an \Rclass{InparanoidDb} object.  This object is then assigned to
the namespace for this package so that at load time it will be loaded
for the user.


\subsection{Creating package accessors}

At this point, all that remains is to create the means for accessing
the data in the database.  This should prove a lot less difficult than
it may initially sound.  For the new interface, only the four methods
that were described earlier are really required:
\Rmethod{cols},\Rmethod{keytypes},\Rmethod{keys} and \Rmethod{select}.

In order to do this you need to know a small amount of SQL and a few
tricks for accessing the database from R.  The point of providing
these 4 accessors is to give users of these packages a more unified
experience when retrieving data from the database.  But other kinds of
accessors (such as those provided for the \Rclass{TranscriptDb}
objects) may also be warranted.


\subsubsection{Getting a connection}

If all you know is the name of the SQLite database, then to get a DB
connection you need to do something like this:

<<classicConn,results=hide>>=
drv <- SQLite()
library("org.Hs.eg.db")
con <- dbConnect(drv, dbname=system.file("extdata", "org.Hs.eg.sqlite",
                        package = "org.Hs.eg.db"))
con
@

But in our case the connection is already here as part of the object:

<<ourConn>>=
str(hom.Hs.inp.db)
@

So we can do something like below:

<<ourConn2>>=
hom.Hs.inp.db$conn
## or better we can use a helper function to wrap this
AnnotationDbi:::dbConn(hom.Hs.inp.db)
@


\subsubsection{Getting data out}

Now we just need to get our data out of the DB.  There are several
useful functions for doing this.  Most of these come from the RSQLite
or DBI packages.  For the sake of simplicity, I will only discuss
those that are immediately useful for exploring and extracting data
from a database in this vignette.  One pair of useful methods are the
\Rmethod{dbListTables} and \Rmethod{dbListFields} which are useful for
exploring the schema of a database.

<<dbListTables>>=
con <- AnnotationDbi:::dbConn(hom.Hs.inp.db)
head(dbListTables(con))
dbListFields(con, "Mus_musculus")
@ 

And for actually executing SQL to retrieve data, you probably want to
use something like \Rmethod{dbGetQuery}.  The only caveat is that this
will actually require you to know a little SQL.

<<dbGetQuery>>=
dbGetQuery(con, "SELECT * FROM metadata")
@ 


\subsubsection{Some basic SQL}

The good news is that SQL is pretty easy to learn.  Especially if you
are primarily interested in just retrieving data from an existing
database.  Here is a quick run-down to get you started on writing
simple SELECT statements.  Consider a table that looks like this:\\

Table sna:
\begin{tabular}{cc}
foo & bar \\\hline
1   & baz \\
2   & boo \\
\end{tabular}\\

\noindent This statement:\\

SELECT bar FROM sna;\\

\noindent Tells SQL to get the "bar" field from the "foo" table.  If
we wanted the other field called "sna" in addition to "bar", we could
have written it like this:\\

SELECT foo, bar FROM sna;\\

\noindent Or even this (* is a wildcard character here)\\

SELECT * FROM sna;\\

\noindent Now lets suppose that we wanted to filter the results.  We could also
have said something like this:\\

SELECT * FROM sna where bar='boo';\\

\noindent That query will only retrieve records from foo that match
the criteria for bar.  But there are two other things to notice.
First notice that a single = was used for testing equality.  Second
notice that I used single quotes to demarcate the string.  I could
have also used double quotes, but when working in R this will prove to
be less convenient as the whole SQL statement itself will frequently
have to be wrapped as a string. \\

\noindent What if we wanted to be more general?  Then you can use LIKE.  Like
this:\\

SELECT * FROM sna where bar LIKE 'boo\%';\\

\noindent That query will only return records where bar starts with "boo", (the
\% character is acting as another kind of wildcard in this context)\\

\noindent You will often find that you need to get things from two or
more different tables at once.  Or, you may even find that you need to
combine the results from two different queries.  Sometimes these two
queries may even come from the same table.  In any of these cases, you
want to do a join.  The simplest and most common kind of join is an
inner join.  Lets suppose that we have two tables:\\

Table sna:
\begin{tabular}{cc}
foo & bar \\\hline
1   & baz \\
2   & boo \\
\end{tabular}\\ \\

Table fu:
\begin{tabular}{cc}
foo & bo \\\hline
1   & hi \\
2   & ca \\
\end{tabular}\\

\noindent And we want to join them where the records match in their
corresponding "foo" columns.  We can do this query to join them:\\

SELECT * FROM sna,fu WHERE sna.foo=fu.foo;\\

\noindent Something else we can do is tidy this up by using aliases like so:\\

SELECT * FROM sna AS s,fu AS f WHERE s.foo=f.foo;\\

\noindent This last trick is not very useful in this particular
example since the query ended up being longer than we started with,
but is still great for other cases where queries can become really
long.


\subsubsection{Exploring the SQLite database from R}

Now that we know both some SQL and also about some of the methods in
\Rpackage{DBI} and \Rpackage{RSQLite} we can begin to explore the
underlying database from R.  How should we go about this?  Well the
1st thing we always want to know are what tables are present.  We
already know how to learn this:

<<dbListTables2>>=
con <- AnnotationDbi:::dbConn(hom.Hs.inp.db)
head(dbListTables(con))
@ 

And we also know that once we have a table we are curious about, we
can then look up it's fields using \Rmethod{dbListFields}

<<dbListFields2>>=
dbListFields(con, "Apis_mellifera")
@ 

And once we know something about which fields are present in a table,
we can compose a SQL query.  perhaps the most straightforward query is
just to get all the results from a given table.  We know that the SQL
for that should look like: \\

SELECT * FROM Apis\_mellifera; \\

\noindent So we can now call a query like that from R by using \Rmethod{dbGetQuery}: 

<<dbGetQuery2>>=
head(dbGetQuery(con, "SELECT * FROM Apis_mellifera"))
@ 


\begin{Exercise}
Now use what you have learned to explore the \Rpackage{hom.Hs.inp.db}
database.  The formal scientific name for one of the mosquitos that
carry the malaria parasite is Anopheles gambiae.  Now find the table
for that organism in the \Rpackage{hom.Hs.inp.db} database and extract
it into R.  How many species are present in this table?  Inparanoid
uses a five letter designation for each species that is composed of
the 1st 2 letters of the genus followed by the 1st 3 letters of the
species.  Using this fact, write a SQL query that will retrieve only
records from this table that are from humans (Homo sapiens).
\end{Exercise}
\begin{Solution}
<<Anopheles,eval=FALSE>>=
head(dbGetQuery(con, "SELECT * FROM Anopheles_gambiae"))
## Then only retrieve human records
## Query: SELECT * FROM Anopheles_gambiae WHERE species='HOMSA'
head(dbGetQuery(con, "SELECT * FROM Anopheles_gambiae WHERE species='HOMSA'"))
@
\end{Solution}


\subsubsection{Example: creating a cols method}

Now lets suppose that we want to define a \Rmethod{cols} method for
our \Rclass{hom.Hs.inp.db} object. And lets also suppose that we want
is for it to tell us about the actual organisms for which we can
extract identifiers.  How could we do that?

<<cols,eval=FALSE>>=
.cols <- function(x){
  con <- AnnotationDbi:::dbConn(x)
  list <- dbListTables(con)
  ## drop unwanted tables
  unwanted <- c("map_counts","map_metadata","metadata")
  list <- list[!list %in% unwanted]
  ## Then just to format things in the usual way
  toupper(list)
}

## Then make this into a method
setMethod("cols", "InparanoidDb", .cols(x))
## Then we can call it
cols(hom.Hs.inp.db)
@

Notice how I formatted the output to all uppercase characters?  This
is just done to make the interface look consistent with what has been
done before for the other \Rmethod{select} interfaces.  But doing this
means that we will have to do a tiny bit of extra work when we
implement out other methods.


\begin{Exercise}
Now use what you have learned to try and define a method for
\Rmethod{keytypes} on \Rclass{hom.Hs.inp.db}.  The keytypes method
should return the same results as cols (in this case).  What if you
needed to translate back to the lowercase table names?  Also write an
quick helper function to do that.
\end{Exercise}
\begin{Solution}
<<keytypes,eval=FALSE>>=
setMethod("keytypes", "InparanoidDb", .cols(x))
## Then we can call it
keytypes(hom.Hs.inp.db)

## refactor of .cols
.getLCcolnames <- function(x){
  con <- AnnotationDbi:::dbConn(x)
  list <- dbListTables(con)
  ## drop unwanted tables
  unwanted <- c("map_counts","map_metadata","metadata")
  list <- list[!list %in% unwanted]
}
.cols <- function(x){
  list <- .getLCcolnames(x)
  ## Then just to format things in the usual way
  toupper(list)
}
## Test:
cols(hom.Hs.inp.db)

## new helper function:
.getTableNames <- function(x){
  LC <- .getLCcolnames(x)
  UC <- .cols(x)
  names(UC) <- LC
  UC
}
.getTableNames(hom.Hs.inp.db)
@
\end{Solution}


\begin{Exercise}
Now define a method for \Rmethod{keys} on \Rclass{hom.Hs.inp.db}.  The
keys method should return the keys from a given organism based on the
appropriate keytype.  Since each table has rows that correspond to
both human and non-human IDs, it will be necessary to filter out the
human rows from the result
\end{Exercise}
\begin{Solution}
<<keys,eval=FALSE>>=
.keys <- function(x, keytype){
  ## translate keytype back to table name
  tabNames <- .getTableNames(x)
  lckeytype <- names(tabNames[tabNames %in% keytype])
  ## get a connection
  con <- AnnotationDbi:::dbConn(x)
  sql <- paste("SELECT inp_id FROM",lckeytype, "WHERE species!='HOMSA'")
  res <- dbGetQuery(con, sql)
  as.vector(t(res))
}

setMethod("keys", "InparanoidDb", .keys(x, keytype))
## Then we can call it
keys(hom.Hs.inp.db, "TRICHOPLAX_ADHAERENS")
@
\end{Solution}


%% I will only have 45 minutes.


%% I need to now intro some basic SQL and some helpful functions like getDBQuery() and dbListTables, dbListFields etc.
%% Then I need to do an example by doing cols() for people


\end{document}