%\VignetteIndexEntry{Introduction To Bioconductor Annotation Packages} %\VignetteDepends{org.Hs.eg.db,TxDb.Hsapiens.UCSC.hg19.knownGene,hom.Hs.inp.db} \documentclass[11pt]{article} \usepackage{Sweave} \usepackage[usenames,dvipsnames]{color} \usepackage{graphics} \usepackage{latexsym, amsmath, amssymb} \usepackage{authblk} \usepackage[colorlinks=true, linkcolor=Blue, urlcolor=Blue, citecolor=Blue]{hyperref} %% Simple macros \newcommand{\code}[1]{{\texttt{#1}}} \newcommand{\file}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsl{#1}} \newcommand\R{\textsl{R}} \newcommand\Bioconductor{\textsl{Bioconductor}} \newcommand\Rpackage[1]{{\textsl{#1}\index{#1 (package)}}} \newcommand\Biocpkg[1]{% {\href{http://bioconductor.org/packages/devel/bioc/html/#1.html}% {\textsl{#1}}}% \index{#1 (package)}} \newcommand\Rpkg[1]{% {\href{http://cran.fhcrc.org/web/devel/#1/index.html}% {\textsl{#1}}}% \index{#1 (package)}} \newcommand\Biocdatapkg[1]{% {\href{http://bioconductor.org/packages/devel/data/experiment/html/#1.html}% {\textsl{#1}}}% \index{#1 (package)}} \newcommand\Robject[1]{{\small\texttt{#1}}} \newcommand\Rclass[1]{{\textit{#1}\index{#1 (class)}}} \newcommand\Rfunction[1]{{{\small\texttt{#1}}\index{#1 (function)}}} \newcommand\Rmethod[1]{{\texttt{#1}}} \newcommand\Rfunarg[1]{{\small\texttt{#1}}} \newcommand\Rcode[1]{{\small\texttt{#1}}} %% Question, Exercise, Solution \usepackage{theorem} \theoremstyle{break} \newtheorem{Ext}{Exercise} \newtheorem{Question}{Question} \newenvironment{Exercise}{ \renewcommand{\labelenumi}{\alph{enumi}.}\begin{Ext}% }{\end{Ext}} \newenvironment{Solution}{% \noindent\textbf{Solution:}\renewcommand{\labelenumi}{\alph{enumi}.}% }{\bigskip} \title{Introduction To Bioconductor Annotation Packages} \author{Marc Carlson} \SweaveOpts{keep.source=TRUE} \begin{document} \maketitle \begin{figure}[ht] \centering \includegraphics[width=.6\textwidth]{databaseTypes.pdf} \caption{Annotation Packages: the big picture} \label{fig:dbtypes} \end{figure} \Bioconductor{} provides extensive annotation resources. These can be \emph{gene centric}, or \emph{genome centric}. Annotations can be provided in packages curated by \Bioconductor, or obtained from web-based resources. This vignette is primarily concerned with describing the annotation resources that are available as packages. This includes both how to extract data from them and also what steps are required to expose other databases in a similar fashion. Gene centric \Rpackage{AnnotationDbi} packages include: \begin{itemize} \item Organism level: e.g. \Rpackage{org.Mm.eg.db}. \item Platform level: e.g. \Rpackage{hgu133plus2.db}, \Rpackage{hgu133plus2.probes}, \Rpackage{hgu133plus2.cdf}. \item Homology level: e.g. \Rpackage{hom.Dm.inp.db}. \item System-biology level: \Rpackage{GO.db} or \Rpackage{KEGG.db}. \end{itemize} Genome centric \Rpackage{GenomicFeatures} packages include \begin{itemize} \item Transcriptome level: e.g. \Rpackage{TxDb.Hsapiens.UCSC.hg19.knownGene} \item Generic genome features: Can generate via \Rpackage{GenomicFeatures} \end{itemize} One web-based resource accesses \href{http://www.biomart.org/}{biomart}, via the \Rpackage{biomaRt} package: \begin{itemize} \item Query web-based `biomart' resource for genes, sequence, SNPs, and etc. \end{itemize} The most popular annotation packages have been modified so that they can make use of a new set of methods to more easily access their contents. These four methods are named: \Rmethod{cols}, \Rmethod{keytypes}, \Rmethod{keys} and \Rmethod{select}. And they are described in this vignette. They can currently be used with all chip, organism, and \Rclass{TranscriptDb} packages along with the popular GO.db package. For the older less popular packages, there are still conventient ways to retrieve the data. The \emph{How to use bimaps from the ".db" annotation packages} vignette in the \Rpackage{AnnotationDbi} package is a key reference for learnign about how to use bimap objects. Finally, all of the `.db' (and most other \Bioconductor{} annotation packages) are updated every 6 months corresponding to each release of \Bioconductor{}. Exceptions are made for packages where the actual resources that the packages are based on have not themselves been updated. \subsection{Organism level packages} An organism level package (an `org' package) uses a central gene identifier (e.g. Entrez Gene id) and contains mappings between this identifier and other kinds of identifiers (e.g. GenBank or Uniprot accession number, RefSeq id, etc.). The name of an org package is always of the form \emph{org...db} (e.g. \Rpackage{org.Sc.sgd.db}) where \emph{} is a 2-letter abbreviation of the organism (e.g. \Rpackage{Sc} for \emph{Saccharomyces cerevisiae}) and \emph{} is an abbreviation (in lower-case) describing the type of central identifier (e.g. \Rpackage{sgd} for gene identifiers assigned by the Saccharomyces Genome Database, or \Rpackage{eg} for Entrez Gene ids). \subsection{AnnotationDb objects and the select method} As previously mentioned, a new set of methods have been added that allow a simpler way of extracting identifier based annotations. All the annotation packages that support these new methods expose an object named exactly the same way as the package itself. These objects are collectively called \Rclass{AnntoationDb} objects, with more specific classes with names such as \Rclass{OrgDb}, \Rclass{ChipDb} or \Rclass{TranscriptDb} objects. The methods that can be applied to these objects are \Rmethod{cols}, \Rmethod{keys}, \Rmethod{keytypes} and \Rmethod{select}. %% select exercise \begin{Exercise} Display the \Rclass{OrgDb} object for the \Biocpkg{org.Hs.eg.db} package. Use the \Rmethod{cols} method to discover which sorts of annotations can be extracted from it. Is this the same as the result from the \Rfunction{keytypes} method? Use the \Rfunction{keytypes} method to find out. Use the \Rfunction{keys} method to extract UNIPROT identifiers and then pass those keys in to the \Rfunction{select} method in such a way that you extract the gene symbol and KEGG pathway information for each. \end{Exercise} \begin{Solution} <>= library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene txdb cols(txdb) keytypes(txdb) keys <- head(keys(txdb, keytype="GENEID")) cols <- c("TXID", "TXSTART") select(txdb, keys=keys, cols=cols, keytype="GENEID") @ \end{Solution} As is widely known, in addition to providing access via the \Rmethod{select} method, \Rclass{TranscriptDb} objects also provide access via the more familiar \Rfunction{transcripts}, \Rfunction{exons}, \Rfunction{cds}, \Rfunction{transcriptsBy}, \Rfunction{exonsBy} and \Rfunction{cdsBy} methods. For those who do not yet know about these other methods, more can be learned by seeing the vignette called: \emph{Making and Utilizing TranscriptDb Objects} in the \Rpackage{GenomicFeatures} package. \subsection{Advanced topic: Creating other kinds of Annotation packages} %% It would not be a terrible idea to make a whole vignette from this %% that starts with code from last years advanced course on how to make a %% database in R. A few options already exist for generating various kinds of annotation packages.For users who seek to make custom chip packages, users should see the \emph{SQLForge: An easy way to create a new annotation package with a standard database schema.} in the \Rpackage{AnnotationDbi} package. And, for users who seek to make a probe package, there is another vignette called \emph{Creating probe packages} that is also in the AnnotationDbi package. And finally, for custom organism packages users should look at the manual page for \Rfunction{makeOrgPackageFromNCBI}. This function will attempt to make you an simplified organism package from NCBI resources. However, this function is not meant as a way to refresh annotation packages between releases. It is only meant for people who are working on less popular model organisms (so that annotations can be made available in this format). But what if you had another kind of database resource and you wanted to expose it to the world using something like this new \Rmethod{select} method interface? How could you go about this? The 1st step would be to make a package that contains a SQLite database. For the sake of expediency, lets look at an existing example of this in the \Rpackage{hom.Hs.inp.db} package. If you download this tarball from the website you can see that it contains a .sqlite database inside of the inst/extdata directory. There are a couple of important details though about this database. The 1st is that we recommend that the database have the same name as the package, but end with the extension .sqlite. The second detail is that we recommend that the metadata table contain some important fields. This is the metadata from the current \Rpackage{hom.Hs.inp.db} package. <>= library(hom.Hs.inp.db) hom.Hs.inp_dbInfo() @ As you can see there are a number of very useful fields stored in the metadata table and if you list the equivalent table for other packages you will find even more useful information than you find here. But the most important fields here are actually the ones called "package" and "Db type". Those fields specify both the name of the package with the expected class definition, and also the name of the object that this database is expected to be represented by in the R session respectively. If you fail to include this information in your metadata table, then \Rmethod{loadDb} will not know what to do with the database when it is called. In this case, the class definition has been stored in the \Rpackage{AnnotationDbi} package, but it could live anywhere you need it too. By specifying the metadata field, you enable \Rmethod{loadDb} to find it. Once you have set up the metadata you will need to create a class for your package that extends the \Rclass{AnnotationDb} class. In the case of the hom.Hs.inp.db package, the class is defined to be a \Rclass{InparanoidDb} class. This code is inside of \Rpackage{AnnotationDbi}. <>= .InparanoidDb <- setRefClass("InparanoidDb", contains="AnnotationDb") @ Finally the \Rfunction{.onLoad} call for your package will have to contain code that will call the \Rmethod{loadDb} method. This is what it currently looks like in the Rpackage{hom.Hs.inp.db} package. <>= sPkgname <- sub(".db$","",pkgname) txdb <- loadDb(system.file("extdata", paste(sPkgname, ".sqlite",sep=""), package=pkgname, lib.loc=libname), packageName=pkgname) dbNewname <- AnnotationDbi:::dbObjectName(pkgname,"InparanoidDb") ns <- asNamespace(pkgname) assign(dbNewname, txdb, envir=ns) namespaceExport(ns, dbNewname) @ When this code is run, the name of the package is used to derive the name for the object. Then that name, is used by \Rmethod{onload} to create an \Rclass{InparanoidDb} object. This object is then assigned to the namespace for this package so that at load time it will be loaded for the user. \subsection{Creating package accessors} At this point, all that remains is to create the means for accessing the data in the database. This should prove a lot less difficult than it may initially sound. For the new interface, only the four methods that were described earlier are really required: \Rmethod{cols},\Rmethod{keytypes},\Rmethod{keys} and \Rmethod{select}. In order to do this you need to know a small amount of SQL and a few tricks for accessing the database from R. The point of providing these 4 accessors is to give users of these packages a more unified experience when retrieving data from the database. But other kinds of accessors (such as those provided for the \Rclass{TranscriptDb} objects) may also be warranted. \subsubsection{Getting a connection} If all you know is the name of the SQLite database, then to get a DB connection you need to do something like this: <>= drv <- SQLite() library("org.Hs.eg.db") con <- dbConnect(drv, dbname=system.file("extdata", "org.Hs.eg.sqlite", package = "org.Hs.eg.db")) con @ But in our case the connection is already here as part of the object: <>= str(hom.Hs.inp.db) @ So we can do something like below: <>= hom.Hs.inp.db$conn ## or better we can use a helper function to wrap this AnnotationDbi:::dbConn(hom.Hs.inp.db) @ \subsubsection{Getting data out} Now we just need to get our data out of the DB. There are several useful functions for doing this. Most of these come from the RSQLite or DBI packages. For the sake of simplicity, I will only discuss those that are immediately useful for exploring and extracting data from a database in this vignette. One pair of useful methods are the \Rmethod{dbListTables} and \Rmethod{dbListFields} which are useful for exploring the schema of a database. <>= con <- AnnotationDbi:::dbConn(hom.Hs.inp.db) head(dbListTables(con)) dbListFields(con, "Mus_musculus") @ And for actually executing SQL to retrieve data, you probably want to use something like \Rmethod{dbGetQuery}. The only caveat is that this will actually require you to know a little SQL. <>= dbGetQuery(con, "SELECT * FROM metadata") @ \subsubsection{Some basic SQL} The good news is that SQL is pretty easy to learn. Especially if you are primarily interested in just retrieving data from an existing database. Here is a quick run-down to get you started on writing simple SELECT statements. Consider a table that looks like this:\\ Table sna: \begin{tabular}{cc} foo & bar \\\hline 1 & baz \\ 2 & boo \\ \end{tabular}\\ \noindent This statement:\\ SELECT bar FROM sna;\\ \noindent Tells SQL to get the "bar" field from the "foo" table. If we wanted the other field called "sna" in addition to "bar", we could have written it like this:\\ SELECT foo, bar FROM sna;\\ \noindent Or even this (* is a wildcard character here)\\ SELECT * FROM sna;\\ \noindent Now lets suppose that we wanted to filter the results. We could also have said something like this:\\ SELECT * FROM sna where bar='boo';\\ \noindent That query will only retrieve records from foo that match the criteria for bar. But there are two other things to notice. First notice that a single = was used for testing equality. Second notice that I used single quotes to demarcate the string. I could have also used double quotes, but when working in R this will prove to be less convenient as the whole SQL statement itself will frequently have to be wrapped as a string. \\ \noindent What if we wanted to be more general? Then you can use LIKE. Like this:\\ SELECT * FROM sna where bar LIKE 'boo\%';\\ \noindent That query will only return records where bar starts with "boo", (the \% character is acting as another kind of wildcard in this context)\\ \noindent You will often find that you need to get things from two or more different tables at once. Or, you may even find that you need to combine the results from two different queries. Sometimes these two queries may even come from the same table. In any of these cases, you want to do a join. The simplest and most common kind of join is an inner join. Lets suppose that we have two tables:\\ Table sna: \begin{tabular}{cc} foo & bar \\\hline 1 & baz \\ 2 & boo \\ \end{tabular}\\ \\ Table fu: \begin{tabular}{cc} foo & bo \\\hline 1 & hi \\ 2 & ca \\ \end{tabular}\\ \noindent And we want to join them where the records match in their corresponding "foo" columns. We can do this query to join them:\\ SELECT * FROM sna,fu WHERE sna.foo=fu.foo;\\ \noindent Something else we can do is tidy this up by using aliases like so:\\ SELECT * FROM sna AS s,fu AS f WHERE s.foo=f.foo;\\ \noindent This last trick is not very useful in this particular example since the query ended up being longer than we started with, but is still great for other cases where queries can become really long. \subsubsection{Exploring the SQLite database from R} Now that we know both some SQL and also about some of the methods in \Rpackage{DBI} and \Rpackage{RSQLite} we can begin to explore the underlying database from R. How should we go about this? Well the 1st thing we always want to know are what tables are present. We already know how to learn this: <>= con <- AnnotationDbi:::dbConn(hom.Hs.inp.db) head(dbListTables(con)) @ And we also know that once we have a table we are curious about, we can then look up it's fields using \Rmethod{dbListFields} <>= dbListFields(con, "Apis_mellifera") @ And once we know something about which fields are present in a table, we can compose a SQL query. perhaps the most straightforward query is just to get all the results from a given table. We know that the SQL for that should look like: \\ SELECT * FROM Apis\_mellifera; \\ \noindent So we can now call a query like that from R by using \Rmethod{dbGetQuery}: <>= head(dbGetQuery(con, "SELECT * FROM Apis_mellifera")) @ \begin{Exercise} Now use what you have learned to explore the \Rpackage{hom.Hs.inp.db} database. The formal scientific name for one of the mosquitos that carry the malaria parasite is Anopheles gambiae. Now find the table for that organism in the \Rpackage{hom.Hs.inp.db} database and extract it into R. How many species are present in this table? Inparanoid uses a five letter designation for each species that is composed of the 1st 2 letters of the genus followed by the 1st 3 letters of the species. Using this fact, write a SQL query that will retrieve only records from this table that are from humans (Homo sapiens). \end{Exercise} \begin{Solution} <>= head(dbGetQuery(con, "SELECT * FROM Anopheles_gambiae")) ## Then only retrieve human records ## Query: SELECT * FROM Anopheles_gambiae WHERE species='HOMSA' head(dbGetQuery(con, "SELECT * FROM Anopheles_gambiae WHERE species='HOMSA'")) @ \end{Solution} \subsubsection{Example: creating a cols method} Now lets suppose that we want to define a \Rmethod{cols} method for our \Rclass{hom.Hs.inp.db} object. And lets also suppose that we want is for it to tell us about the actual organisms for which we can extract identifiers. How could we do that? <>= .cols <- function(x){ con <- AnnotationDbi:::dbConn(x) list <- dbListTables(con) ## drop unwanted tables unwanted <- c("map_counts","map_metadata","metadata") list <- list[!list %in% unwanted] ## Then just to format things in the usual way toupper(list) } ## Then make this into a method setMethod("cols", "InparanoidDb", .cols(x)) ## Then we can call it cols(hom.Hs.inp.db) @ Notice how I formatted the output to all uppercase characters? This is just done to make the interface look consistent with what has been done before for the other \Rmethod{select} interfaces. But doing this means that we will have to do a tiny bit of extra work when we implement out other methods. \begin{Exercise} Now use what you have learned to try and define a method for \Rmethod{keytypes} on \Rclass{hom.Hs.inp.db}. The keytypes method should return the same results as cols (in this case). What if you needed to translate back to the lowercase table names? Also write an quick helper function to do that. \end{Exercise} \begin{Solution} <>= setMethod("keytypes", "InparanoidDb", .cols(x)) ## Then we can call it keytypes(hom.Hs.inp.db) ## refactor of .cols .getLCcolnames <- function(x){ con <- AnnotationDbi:::dbConn(x) list <- dbListTables(con) ## drop unwanted tables unwanted <- c("map_counts","map_metadata","metadata") list <- list[!list %in% unwanted] } .cols <- function(x){ list <- .getLCcolnames(x) ## Then just to format things in the usual way toupper(list) } ## Test: cols(hom.Hs.inp.db) ## new helper function: .getTableNames <- function(x){ LC <- .getLCcolnames(x) UC <- .cols(x) names(UC) <- LC UC } .getTableNames(hom.Hs.inp.db) @ \end{Solution} \begin{Exercise} Now define a method for \Rmethod{keys} on \Rclass{hom.Hs.inp.db}. The keys method should return the keys from a given organism based on the appropriate keytype. Since each table has rows that correspond to both human and non-human IDs, it will be necessary to filter out the human rows from the result \end{Exercise} \begin{Solution} <>= .keys <- function(x, keytype){ ## translate keytype back to table name tabNames <- .getTableNames(x) lckeytype <- names(tabNames[tabNames %in% keytype]) ## get a connection con <- AnnotationDbi:::dbConn(x) sql <- paste("SELECT inp_id FROM",lckeytype, "WHERE species!='HOMSA'") res <- dbGetQuery(con, sql) as.vector(t(res)) } setMethod("keys", "InparanoidDb", .keys(x, keytype)) ## Then we can call it keys(hom.Hs.inp.db, "TRICHOPLAX_ADHAERENS") @ \end{Solution} %% I will only have 45 minutes. %% I need to now intro some basic SQL and some helpful functions like getDBQuery() and dbListTables, dbListFields etc. %% Then I need to do an example by doing cols() for people \end{document}