%\VignetteIndexEntry{Introduction to R and Bioconductor} %\VignettePackage{EMBL2011} \documentclass{article} \usepackage{BioconductorVignette} \title{An introduction to \R{} and \Bioconductor{}} \author{Martin Morgan\footnote{\href{mailto:mtmorgan@fhcrc.org}{mtmorgan@fhcrc.org}}} \date{June 20-23, 2011} \SweaveOpts{keep.source=TRUE} \begin{document} \maketitle \section{Introduction} \R{} is an open-source statistical programming language. It is used to manipulate data, to perform statistical analyses, and to present graphical and other results. \R{} consists of a core language, additional `packages' distributed with the \R{} language, and a very large number of packages contributed by the broader community. Packages add specific functionality to an \R{} installation. \R{} has become the primary language of academic statistical analyses, and is widely used in diverse areas of research, government, and industry. \R{} has several unique features. It has a surprisingly `old school' interface: users type commands into a console; scripts in plain text represent work flows; tools other than \R{} are used for editing and other tasks. \R{} is a flexible programming language, so while one person might use functions provided by \R{} to accomplish advanced analytic tasks, another might implement their own functions for novel data. As a programming language, \R{} adopts syntax and grammar that differ from many other languages: objects in \R{} are `vectors', and functions are `vectorized' to operate on all elements of the object; \R{} objects have `copy on change' and `pass by value' semantics, reducing unexpected consequences for users at the expense of less efficient memory use; common paradigms in other languages, such as the `for' loop, are encountered much less commonly in \R{}. Many authors contribute to \R{} so there can be a frustrating inconsistency of documentation and interface. \R{} grew up in the academic community, so authors have not shied away from trying new approaches. Of course statistical analyses, especially exploratory, are very well-developed. \Bioconductor{} is a collection of \R{} packages for the analysis and comprehension of high-throughput genomic data. \Bioconductor{} started more than 10 years ago. It gained credibility for its statistically rigorous approach to microarray pre-preprocessing and designed experiments, and integrative and reproducible approaches to bioinformatic tasks. There are now more than 460 \Bioconductor{} packages for expression and other microarrays, sequence analysis, flow cytometery, imaging, and other domains. \section{Statistical programming} Many academic and commercial software products are available; why would one use \R{} and \Bioconductor? One answer is to ask what demands high-throughput genomic data place on the effectiveness of computational biology software. \subsection{Effective computational biology software} High-throughput questions make use of large data sets. This applies both to the primary data (microarray expression values, sequenced reads, etc.) and also to the annotations on those data (coordinates of genes and features such as exons or regulatory regions; participation in biological pathways, etc.). Large data sets place demands on our tools that preclude some standard approaches, such as spread sheets. Likewise intricate relationships between data and annotation, and the diversity of research questions, require flexibility typical of a programming language rather than a narrowly-enabled graphical user interface. Analysis of high-throughput data is necessarily statistical. The volume of data requires that it be appropriately summarized before any sort of comprehension is possible. The data are produced by advanced technologies, and these introduce artifacts (e.g., probe-specific bias in microarrays; sequence or base calling bias in RNA-seq experiments) that need to be accommodated to avoid incorrect or inefficient inference. Data sets typically derive from designed experiments, requiring a statistical approach both to account for the design, and to correctly address the large number of observed values (e.g., gene expression or sequence tag counts) and small number of samples accessible in typical experiments. Research needs to be reproducible. Reproducibility is both an ideal of the scientific method, and a pragmatic requirement. The latter comes from the long-term and multi-participant nature of contemporary science. An analysis will be performed for the initial experiment, revisited during manuscript preparation, and revisited during reviews or in determining next steps. Likewise, analyses typically involve a team of individuals with diverse domains of expertise. Effective collaborations result when it is easy to reproduce, perhaps with minor modifications, an existing result, and when sophisticated statistical or bioinformatic analyses can be effectively conveyed to other group members. Science moves very quickly. This is driven by the novel questions that are the hallmark of discovery, and by technological innovation and accessibility. This places significant burdens on software, which must also move quickly. Effective software cannot be too polished, because that requires that the correct analyses are `known' and that significant resources of time and money have been invested in developing the software; this implies software that is tracking the trailing edge of innovation. On the other hand, leading-edge software cannot be too idiosyncratic; it must be usable by a wider audience than the creator of the software, and fit in with other software relevant to the analysis. Effective software must be accessible. Affordability is one aspect of accessibility. Another is transparent implementation, where the novel software is sufficiently documented and source code accessible enough for the assumptions, approaches, practical implementation decisions, and inevitable coding errors to be assessed by other skilled practitioners. A final aspect of affordability is that the software is actually usable. This is achieved through adequate documentation, support forums, and training opportunities. \subsection{\Bioconductor{} as effective computational biology software} What features of \R{} and \Bioconductor{} contribute to its effectiveness as a software tool? \Bioconductor{} is well suited to handle extensive data and annotation. \Bioconductor{} `classes' represent high-throughput data and their annotation in an integrated way. \Bioconductor{} methods use advanced programming techniques or \R{} resources (such as transparent data base or network access) to minimize memory requirements and integrate with diverse resources. Classes and methods coordinate complicated data sets with extensive annotation. Nonetheless, the basic model for object manipulation in \R{} involves vectorized in-memory representations. For this reason, particular programming paradigms (e.g., block processing of data streams; explicit parallelism) or hardware resources (e.g., large-memory computers) are sometimes required when dealing with extensive data. %% FIXME: refs to RMA, limma, DESeq / edgeR \R{} is ideally suited to addressing the statistical challenges of high-throughput data. Three examples include the development of the `RMA' and other normalization algorithm for microarray pre-processing, use of moderated $t$-statistics for assessing microarray differential expression, and development of approaches to estimating dispersion read counts necessary for appropriate analysis of RNAseq designed experiments. Many of the `old school' aspects of \R{} and \Bioconductor{} facilitate reproducible research. An analysis is often represented as a text-based script. Reproducing the analysis involves re-running the script; adjusting how the analysis is performed involves simple text-editing tasks. Beyond this, \R{} has the notion of a `vignette', which represents an analysis as a \LaTeX{} document with embedded \R{} commands. The \R{} commands are evaluated when the document is built, thus reproducing the analysis. The use of \LaTeX{} means that the symbolic manipulations in the script are augmented with textual explanations and justifications for the approach taken; these include graphical and tabular summaries at appropriate places in the analysis. \R{} includes facilities for reporting the exact version of \R{} and associated packages used in an analysis so that, if needed, discrepancies between software versions can be tracked down and their importance evaluated. While users often think of \R{} packages as providing new functionality, they are also used to encapsulate a single analysis. The package can contain data sets, vignette(s) describing the analysis, \R{} functions that might have been written, scripts for key data processing stages, and documentation (via standard \R{} help mechanisms) of what the functions, data, and packages are about. The \Bioconductor{} project adopts practices that facilitate reproducibility. Versions of \R{} and \Bioconductor{} are released twice each year. Each \Bioconductor{} release is the result of development, in a separate branch, during the previous six months. The release is built daily against the corresponding version of \R{} on Linux, Mac, and Windows platforms, with an extensive suite of tests performed. The \code{biocLite} function ensures that each release of \R{} uses the corresponding \Bioconductor{} packages. The user thus has access to stable and tested package versions. \R{} and \Bioconductor{} are effective tools for reproducible research. %% FIXME: Biomart, UCSC href \R{} and \Bioconductor{} exist on the leading portion of the software life cycle. Contributors are primarily from academic institutions, and are directly involved in leading-edge research activities. New developments are made available in a familiar format, i.e., the \R{} language, packaging, and build systems. The rich set of facilities in \R{} (e.g., for advanced statistical analysis or visualization) and the extensive resources in \Bioconductor{} (e.g., for annotation using third-party data such as Biomart or the UCSC genome browser tracks) mean that innovations can be directly incorporated into existing work flows. The `development' branches of \R{} and \Bioconductor{} provide an environment where contributors can explore new approaches without alienating their user base. \R{} and \Bioconductor{} also fair well in terms of accessibility. The software is freely available. The source code is easily and fully accessible for critical evaluation. The \R{} packaging and check system requires that all functions are documented. \Bioconductor{} requires that each package contain vignettes to illustrate the use of the software. There are very active \R{} and \Bioconductor{} mailing lists for immediate support, and regular training and conference activities for professional development. \section{\R{} data types and functions} Opening an \R{} session results in a prompt. The user types instructions at the prompt. Here's an example: <>= ## assign values 5, 4, 3, 2, 1 to variable 'x' x <- c(5, 4, 3, 2, 1) x @ %% The first line starts with a \code{\#} to represent a comment; the line is ignored by \R{}. The next line creates a variable \code{x}. The variable is assigned (using \code{<-}, we could have used \code{=} almost interchangably) a value. The value assigned is the result of a call to the \Rfunction{c} function. That it is a function call is indicated by the symbol named followed by parentheses, \code{c()}. The \Rfunction{c} function takes zero or more arguments, and returns a vector. The vector is the value assigned to \code{x}. \R{} responds to this line with a new prompt, ready for the next input. The next line asks \R{} to display the value of the variable x. \R{} responds by printing \code{[1]} to indicate that the subsequent number is the first element of the vector. It then prints the value of \code{x}. \R{} has many features to aid common operations. Entering sequences is a very common operation, and expressions of the form \code{2:4} create a sequence from \code{2} to \code{4}. Subsetting one vector by another is enabled with \code{[}. Here we create a sequence from 2 to 4, and use the sequence as an index to select the second, third, and fourth elements of \code{x} <>= x[2:4] @ \R{} functions operate on variables. Functions are usually vectorized, acting on all elements of their argument and obviating the need for explicit iteration. Functions can generate warnings when performing suspect operations, or errors if evaluation cannot proceed; try \code{log(0)} or \code{log(-1)}. <>= log(x) @ \subsection{Essential data types} \R{} has a number of standard data types, to represent \code{integer}, \code{numeric} (floating point), \code{complex}, \code{character}, \code{logical} (boolean), and \code{raw} (byte) data. It is possible to convert between data types, and to discover the type or mode of a variable. <>= c(1.1, 1.2, 1.3) # numeric c(FALSE, TRUE, FALSE) # logical c("foo", "bar", "baz") # character, single or double quote ok as.character(x) # convert 'x' to character typeof(x) # the number 5 is numeric, not integer typeof(2L) # append 'L' to force integer typeof(2:4) # ':' produces a sequence of integers @ %% \R{} includes data types particularly useful for statistical analysis, including \code{factor} to represent categories and \code{NA} (used in any vector) to represent missinng values. <>= sex <- factor(c("Male", "Female", NA), levels=c("Female", "Male")) sex @ \subsection{Lists, data frames, and matricies} All of the vectors mentioned so far are homogenous, consisting of a single type of element. A \code{list} can contain a collection of different types of elements and, like all vectors, these elements can be named to create a key-value association. <>= lst <- list(a=1:3, b=c("foo", "bar"), c=sex) lst @ Lists can be subset like other vectors to get another list, or subset with \code{[[} to retrieve the actual list element; as with other vectors, subsetting can use names <>= lst[c(3, 1)] # another list lst[["a"]] # the element itself, by name @ A \code{data.frame} is a list of equal-length vectors, representing a rectangular data structure not unlike a spread sheet. Each column of the data frame is a vector, so data types must be homogenous with a column. A \code{data.frame} can be subset by row or column, and columns can be accessed with \code{\$} or \code{[[}. <>= df <- data.frame(age=c(27L, 32L, 19L), sex=factor(c("Male", "Female", "Male"))) df df[c(1, 3),] df[df$age > 20,] @ A \code{matrix} is also a rectangular data structure, but subject to the constraint that all elements are the same type. A matrix is created by taking a vector, and specifying the number of rows or columns the vector is to represent. On subsetting, \R{} coerces a single column \code{data.frame} or single row or column \code{matrix} to a vector if possible; use \code{drop=FALSE} to stop this behavior. <>= m <- matrix(1:12, nrow=3) m m[c(1, 3), c(2, 4)] m[, 3] m[, 3, drop=FALSE] @ %% An \code{array} is a data structure for representing homogenous, rectangular data in higher dimensions. \subsection{S3 and S4 classes} More complicated data structures are represented using the `S3' or `S4' object system. Objects are often created by functions (\Rfunction{lm}, below), with parts of the object extracted or assigned using \emph{accessor} functions. The following generates 1000 random normal deviates as \code{x}, and uses these to create another 1000 deviates \code{y} that are linearly related to \code{x} but with some error. We fit a linear regression using a `formula' to describe the relationship between variables, summarize the results in a familiar ANOVA table, and access \code{fit} (an S3 object) for the residuals of the regression, using these as input first to the \Rfunction{var} (variance) and then \Rfunction{sqrt} (square-root) functions. Objects can be interogated for their class. <>= x <- rnorm(1000, sd=1) y <- x + rnorm(1000, sd=.5) fit <- lm(y ~ x) # formula describes linear regression fit # an 'S3' object anova(fit) sqrt(var(resid(fit))) # residuals accessor and subsequent transforms class(fit) @ %% Find help on functions, e.g., the return value of \Rfunction{fit}, using the \R{} help system. <>= ?data.frame ?rnorm ?lm ?anova # a generic function ?anova.lm # an S3 method, specialized for 'lm' objects @ Many \Bioconductor{} packages implement S4 objects to represent data. S3 and S4 systems are quite different from a programmer's perspective, but fairly similar from a user's perspective: both systems encapsulate complicated data structures, and allow for methods specialized to different data types. \subsection{Functions} \R{} functions accept arguments, and return values. Arguments can be required or optional. Some functions may take variable numbers of arguments, e.g., the columns in a \code{data.frame} <>= y <- 5:1 log(y) args(log) # arguments 'x' and 'base'; see ?log log(y, base=2) # 'base' is optional, with default value try(log()) # 'x' required; 'try' continues even on error args(data.frame) # ... represents variable number of arguments @ Arguments can be matched by position or by name. If an argument appears after \code{...}, it must be named. <>= log(base=2, y) # match argument 'base' by name, 'x' by position @ %% A generic may have fewer arguments than a method, as with the S3 function \Rfunction{anova} and its method \Rfunction{anova.glm}. <>= args(anova) args(anova.glm) @ The source code of a function is printed if the function is invoked without parentheses. Here we discover that the function \Rfunction{head} (which returns the first 6 elements of anything) is an S3 generic (indicated by \code{UseMethod}) and has several methods; use \Rfunction{getAnywhere} to retrieve non-visible function definitions. We use \Rfunction{head} to look at the first six lines of the \Rfunction{head} method specialized for \code{matrix} objects. <>= head methods(head) head(head.matrix) @ \section{Packages} Packages provide functionality beyond that available in base \R{}. There are over 3000 packages in CRAN (comprehensive \R{} archive network) and more than 460 \Bioconductor{} packages. Packages are contributed by diverse members of the community; they vary in quality (many are excellent) and sometimes contain idiosyncratic aspects to their implementation. The \Rpackage{lattice} package is distributed with \R{} but not loaded by default. It provides a very expressive way to visualize data. The following example plots yield for a number of barley varieties, conditioned on site and grouped by year. Figure~\ref{fig:lattice} is read from the lower left corner. Note the common scales, efficient use of space, and not-too-pleasing default color palette. The Waseca sample appears to be mis-labelled for `year', an apparent error in the original data. Find out about the built-in data set used in this example with \code{?barley}. <>= library(lattice) dotplot(variety ~ yield | site, data = barley, groups = year, key = simpleKey(levels(barley$year), space = "right"), xlab = "Barley Yield (bushels/acre) ", aspect=0.5, layout = c(2,3), ylab=NULL) @ <>= pdf("lattice-example.pdf") <> invisible(dev.off()) @ \begin{figure} \centering\includegraphics{lattice-example} \caption{Variety yield conditional on site and grouped by year, for the \code{barley} data set.} \label{fig:lattice} \end{figure} New packages can be added to an \R{} installation using \code{install.packages}. A package is installed only once per \R{} installation, but needs to be loaded (with \Rfunction{library}) in each session in which it is used. Loaded packages are displayed with \Rfunction{search}. The path returned by \Rfunction{search} represents the order in which the global environment (where commands entered at the prompt are evaluated) and attached packages are searched for symbols; it is possible for a package earlier in the search path to mask symbols later in the search path; these can be disambiguated using \code{::}. <>= search() base::log(1:3) @ \subsection{\Bioconductor{}} \Bioconductor{} enhances \R{} through packages for microarray, sequence, annotation, flow cytometry, imaging, and other high-throughput genomic analyses. Packages are contributed by academic and other groups from around the world; a core group at the Fred Hutchinson Cancer Research Center is funded by the US National Human Genome Research Institute to provide project infrastructure. Classical \Bioconductor{} work flows involve data input, pre-processing, quality assessment (before or after pre-processing), exploratory and downstream analysis. Downstream steps depends on the research question but might involve analysis of designed experiments (e.g., differential expression between groups), machine learning (clustering and classification), or per-sample processing (e.g., SNP calling, copy number estimation). Results are then subject to gene or pathway annotation and perhaps higher-level analysis (e.g., gene set enrichment). A more concrete work flow example illustrates use of \Bioconductor{} packages for expression microarrays; it serves as an established model for other, nascent, technologies. Input and pre-processing might begin with a vendor-specific package such as \Biocpkg{affy}, \Biocpkg{lumi}, or \Biocpkg{limma} (for two-color arrays). The \Biocpkg{oligo} package is useful for high density arrays, \Biocpkg{GEOquery} and \Biocpkg{ArrayExpress} provide facilities for retrieving data sets from the \href{http://www.ncbi.nlm.nih.gov/geo}{GEO} and \href{http://www.ebi.ac.uk/microarray-as}{ArrayExpress} repositories. Typical pre-processing steps are background correction to address imaging and other technology-specific artifacts, normalization to standardize expression signals from different samples, and summarization of signals from multiple probes of a single feature (e.g., `gene', used as a synonym for summarized expression values below). Packages used for input provide mechanisms for technology-specific pre-processing; \Biocpkg{vsn} (variance stabilizing normalization, using a generalized log transform) is an example of a package that implements normalization methods meant for use on several different platforms. A common output from pre-processing is an S4 \Rclass{ExpressionSet} object. This object coordinates expression values, sample metadata (e.g., sample X belongs to treatment group Y, and was processed on day Z), and gene annotation in a way that reduces book-keeping and other common errors associated with managing large data with complicated relationships. The \Biocpkg{arrayQualityMetrics} package produces a comprehensive HTML summary of expression microarray quality, based either on raw or pre-processed data from a variety of platforms. Exploratory analysis of pre-processed data sets often involve a combination of \Bioconductor{} packages and standard \R{} tools. One might use the \Biocpkg{genefilter} package to remove unannotated genes and to identify genes that are highly variable across all samples; genes passing filter form a reduced data set for exploratory or perhaps downstream analysis. Exploratory analysis often involves standard \R{} functions, e.g., \Rfunction{heatmap} to cluster samples and look for association with un-intended covariates such as batch processing date. The next work flow step depends on research question, experimental design, and project-specific objectives. The \Biocpkg{limma} package provides an excellent illustration of the opportunities provided by \Bioconductor{} for analysis of designed experiments. A typical starting point is the \Rclass{ExpressionSet} of pre-processed data, and a (statistical) model matrix describing experimental design. The functions \Rfunction{lmFit}, \Rfunction{eBayes} , and \Rfunction{topTable} are then used to fit the linear model implied by the design matrix to each gene, to assess differential expression, and to summarize results in a convenient \code{data.frame}. This step requires that the user is comfortable with more-than-elementary statistical concepts (e.g., formulating a model matrix and contrasts appropriate for their experiment) and is willing to accept the nuanced statistical reasoning (e.g., about pooled variance estimates and priors) that \Biocpkg{limma} implements. In exchange, one gains great flexibility in the type of experiment that can be analyzed, and access to an approach that considerably improves on simple per-gene analyses. The \Biocpkg{limma} vignette is an excellent resource. \Bioconductor{} annotation facilities help place results in context. \Biocpkg{AnnotationDbi} is at the base of a large suite of packages that provide gene-level mappings between different identifiers. For instance, \Biocpkg{org.Hs.eg.db} is an organism-specific annotation package for \emph{H.~sapiens}. It uses Entrez gene identifers (the \code{eg} in the package name) as a key, from which chromosomal location, gene ontology or KEGG pathway, and other information can be obtained. The \Biocpkg{BSgenome} packages povides whole-genome sequences (see \Rfunction{BSgenome::available.genomes()}). The \Biocpkg{biomaRt} package allows users to query the rich data resources of \href{http://biomart.org}{Biomart}; \Biocpkg{rtracklayer} (for the \href{http://genome.ucsc.edu}{UCSC genome browser}) and \Biocpkg{GenomicFeatures} (for UCSC and Biomart) provide access to track and other annotation resources. Several packages (e.g., \Biocpkg{Category}, \Biocpkg{GOstats}, \Biocpkg{topGO}, \Biocpkg{GSEAlm}, \Biocpkg{limma}, \Biocpkg{GSEABase}) provide gene set enrichment style analyses. \section{Summary} High-throughput genomic data analysis requires software tools that are capable of dealing with extensive and inter-related data sets. The software must necessarily be statistical to accommodate large data volume, technological artifacts, and experimental design. Scientific researchers work on long term projects in collaborative groups, and require reproducible results and easily modified analyses. Software on the leading edge will not be blemish-free, but can fit in to existing work flows, be well-documented, and easily distributed to diverse users. \R{} and \Bioconductor{} have many attributes that make them very suitable for genomic data analysis. \R{} is a statistical programming language. Its vectorized data types can be incorporated into data structures and objects to represent complicated, inter-related data sets or results. The \R{} packaging system means that the language is extensible by a wide community; it is the tool of choice for academic and other leading edge research groups. The \Bioconductor{} project builds on \R's foundations to provide a large suite of packages for high-throughput analysis. These packages are combined into work flows. The microarray differential expression work flow summarized above provides a template relevant to other domains, highlights many key packages for microarray analysis, and illustrates the sophisticated statistical approaches and rewards made available to \Bioconductor{} users. \end{document}