%\VignetteIndexEntry{Introduction to genome projects} \documentclass[12pt]{article} \usepackage{Sweave} \usepackage{fullpage} \usepackage{hyperref} \newcommand{\R}{\textsf{R}} \newcommand{\Rcmd}[1]{\texttt{#1}} \newcommand{\pkg}[1]{\texttt{#1}} \title{ Genome project tables in the genomes package } \author{Chris Stubben} \begin{document} \maketitle %% for cutting and pasting use continue ="" %% change margins on every chunk <>= library(genomes) options(warn=-1, width=75, digits=2, scipen=3, "prompt" = "R> ", "continue" = " ") options(SweaveHooks=list(fig=function() par(mar=c(5,4.2,1,1)))) @ The \pkg{genomes} package collects genome project metadata from NCBI (\url{http://www.ncbi.nlm.nih.gov}) and the ENA (\url{http://www.ebi.ac.uk/ena}) and provides tools to summarize, compare and plot the data in the \R~programming environment. Genome tables are a defined class (\emph{genomes}) and each table is a data frame where rows are genome projects and columns are the fields describing the associated metadata. At a minimum, the table should have a column listing the project name, status, and release date. A number of methods are available that operate on genome tables including \Rcmd{print}, \Rcmd{summary}, \Rcmd{plot} and \Rcmd{update}. There are a number of ways to install this package. If you are running the most recent \R~version, you can use the \Rcmd{biocLite} command. <>= source("http://bioconductor.org/biocLite.R") biocLite("genomes") @ Since the format of online genome tables may change (and then \Rcmd{update} commands may fail), I would recommend downloading the development version for fixes in between the six month release cycle. <>= install.packages("genomes", repos="http://www.bioconductor.org/packages/devel/bioC", type="source") @ Genome tables from the Genome Project database at NCBI include prokaryotic projects (\Rcmd{lproks}), eukaryotic projects (\Rcmd{leuks}), metagenomes (\Rcmd{lenvs}) and viruses (\Rcmd{virus}). The \Rcmd{print} methods displays the first few rows and columns of the table (either select less than seven rows or convert the object to a \Rcmd{data.frame} to print all columns). The \Rcmd{summary} function displays the download date, a count of projects by status, and a list of recent submissions. The \Rcmd{plot} method displays a cumulative plot of genomes by release date ( Figure \ref{lproks}, use \Rcmd{lines} to add additional tables). <>= data(lproks) lproks summary(lproks) plot(lproks, log='y', las=1) data(leuks) data(lenvs) lines(leuks, col="red") lines(lenvs, col="green3") legend("topleft", c("Microbes", "Eukaryotes", "Metagenomes"), lty=1, bty='n', col=c("blue", "red", "green3")) @ \begin{figure}[t] \centering \includegraphics[height=5in,width=5in]{genome-tables-lproks.pdf} \caption{Cumulative plot of genome projects by release date at NCBI. } \label{lproks} \end{figure} Most importantly, the \Rcmd{update} method downloads the latest version of the table from NCBI and displays a message listing the number of project IDs added and removed (not run). <>= update(lproks) @ A number of additional functions assist in selecting, sorting and grouping genomes. The \Rcmd{species} and \Rcmd{genus} functions can be used to extract the species or genus from a scientific name. The \Rcmd{table2} function formats and sorts a contingency table by counts. <>= spp<-species(lproks$name) table2(spp) @ The \Rcmd{month} and \Rcmd{year} functions can be used to extract the month or year from the release date (Figure \ref{complete}). <>= complete <- subset(lproks, status == "Complete") x<-table(year(complete$released)) barplot(x, col="blue", ylim=c(0,max(x)*1.04), space=0.5, las=1, axis.lty=1, xlab="Year", ylab="Genomes per year") box() @ \begin{figure}[t] \centering \includegraphics[height=3in,width=5in]{genome-tables-complete.pdf} \caption{Number of complete microbial genomes released each year at NCBI} \label{complete} \end{figure} Because subsets of tables are often needed, the binary operator \Rcmd{like} allows pattern matching using wildcards. The \Rcmd{plotby} function can then be used to plot the release dates by status using labeled points, in this case to identify complete and draft sequences of \emph{Yersinia pestis} (Figure \ref{yersinia}). <>= ## Yersinia pestis yp<-subset(lproks, name %like% 'Yersinia pestis*') plotby(yp, labels=TRUE, cex=.5, lbty='n') @ \begin{figure}[t] \centering \includegraphics[height=5in,width=5in]{genome-tables-yersinia.pdf} \caption{Cumulative plot of \emph{Yersinia pestis} genomes by release date.} \label{yersinia} \end{figure} A number of recent functions have been added that allow \R~users to query NCBI databases or the European Nucleotide Archive. These functions will be described in a separate vignette. \end{document}