% % To compile this: % 1. use R-2.3.1-patched % 2. Sweave("AffyLab.Rnw") % 3. source("../parseExercise.R"); parseExercise("AffyLab.tex") % 4. pdflatex AffyLab; pdflatex AffyLab-lab % %\VignetteDepends{affy, SpikeInSubset, gcrma, affyPLM, genefilter, limma, marray, annotate, KEGG, GO,annaffy} %\VignetteIndexEntry{Bioconductor Affymetrix GeneChips Lab} %\VignetteKeywords{tutorial, Affymetrix arrays,probe data} % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \documentclass[11pt,a4paper]{article} \usepackage{amsmath,fullpage} \usepackage{theorem} \usepackage{color} \usepackage{../Common/bioc} \parindent 0in % Left justify \newcommand{\inclfig}[3]{% \begin{figure}[tb] \begin{center} \includegraphics[width=#2]{#1} \caption{\label{#1}#3} \end{center} \end{figure} } \author{Jean Wu and Robert Gentleman} \title{Lab: From Affymetrix array CEL files to annotated list of interesting genes} \begin{document} \maketitle \section{Introduction} Identifying a list of differentially expressed genes is one of the major applications of microarray experiments. In this lab we will start with Affymetrix CEL files, which contain the raw probe level (feature level) data and create an annotated report of interesting genes. This lab is based on Chapter 25, \textit{From CEL files to annotated list of interesting genes} by R. A. Irizarry. \section{Reading CEL files} \label{import} Load the \Rpackage{affy} package, which is needed to import the data into Bioconductor. <>= library("affy") @ The probe level data provided by Affymetrix is contained in what we call the CEL files, because these usually have extension {\tt .CEL}. The function \Robject{ReadAffy} can be used to import these data into \Robject{AffyBatch} objects. One easy way of reading in the CEL files is to have all CEL files in one directory and set the R working directory to that directory. To set the working directory you can use the function \Rfunction{setwd}, or use the menu commands on either Windows or OS X. Then you can simply use the \Rfunction{ReadAffy} function, as shown in the code chunk below. <>= MyData <- ReadAffy() @ If instead, you have many CEL files and only want to select some of them, you can supply the names of the ones you would like to use as the \Rfunarg{filenames} argument to the function. An example is given in the code chunk below. <>= ReadAffy(filenames=filenames.of.your.data) @ The help file for \Rfunction{ReadAffy} gives more details. For illustration of the rest of this lab we will use an \Robject{AffyBatch} named \Robject{spikein95} that is already read-in, available through the package \Rpackage{SpikeInSubset}. You are welcome to try out your own data. <>= library("SpikeInSubset") data(spikein95) @ \subsection{The phenotypic data} These data are a six array subset (2 sets of triplicate arrays) from a calibration experiment performed by Affymetrix. For the purpose of this lab we generate a simple object, \Robject{ad} of class \Rclass{AnnotatedDataFrame} describing the arrays. The first three arrays are from one population and the next three are from the other population. We will arbitrarily label the first population . <>= pd <- data.frame(population=factor(c("Pop1", "Pop1", "Pop1", "Pop2", "Pop2", "Pop2")), replicate=c(1, 2, 3, 1, 2, 3)) rownames(pd) <- sampleNames(spikein95) vl <- data.frame(labelDescription= c("Pop1 is control, Pop2 is treatment", "1, 2, 3, arbitrary numbering"), row.names = c("population", "replicate")) phenoData(spikein95) <- new("AnnotatedDataFrame", data=pd, varMetadata=vl) @ % If you have your own data, generate an \Robject{AnnotatedDataFrame} object according to your experimental design. The \Rclass{AnnotatedDataFrame} object can also be created using the function \Robject{read.AnnotatedDataFrame}. \section{Quality Assessment} The package, \Rpackage{affyQCReport} has functions for carrying out an assessment of the quality of Affymetrix microarrays. Unfortunately many of the tools rely on system functions, and for the present it is only likely to work well on a Unix-based system. The code below is given for those on Unix-systems (or on Windows with Latex installed). <>= library("affyQCReport") ##we need to fix up the sample names ##until the affyQAReport function is fixed s2 = spikein95 sampleNames(s2) = sub("_", ".", sampleNames(s2)) qarep = affyQAReport(s2) @ You may then view the report by using the \Rfunction{openQAReport} function. The report can be removed by using the \Rfunction{rmQAReport} function. \section{Preprocessing} \label{rma} To convert probe level data to expression measures (preprocessing), we use RMA as example. <>= eset <- rma(spikein95) @ If your computer has limited memory this command might fail. In that case you can try to use \Rfunction{just.rma}, but to use it you must have CEL files. So, if you brought your own CEL files, you can also use something similar to the code below. \Rfunction{just.rma}. <>= eset<- just.rma(filenames=list.celfiles()) @ After this function has run \Robject{eset} is an instance of the class \Robject{exprSet} which contains the expression values and other important experimental information. The expression values calculated by \Rfunction{rma} are in log base 2 scale. A matrix with the expression information is readily available. The following code extracts the expression data and obtains the dimensions of the matrix containing the data. <>= e <- exprs(eset) dim(e) @ \begin{Ex} How many probe sets are there in this dataset? \end{Ex} \begin{solution} <>= dim(e) dim(exprs(spikein95)) @ \end{solution} The \Robject{phenoData} information that was stored with the raw data has been copied to the output of the preprocessing function. <>= pData(eset) pData(spikein95) @ You can conveniently use \verb+$+ to access each column and create indexes denoting which columns represent each population: <>= Index1 <- which(eset$population == "Pop1") Index2 <- which(eset$population == "Pop2") @ % We will use this information in the next section. \subsection*{Optional exercise: other preprocessing methods} Here we are only demonstrating the use of RMA. Other options are available through the functions \Robject{mas5} and \Robject{expresso}. If you find \Robject{expresso} too slow, the package \Rpackage{affyPLM} provides an alternative, \Robject{threestep}, that is faster due to the use of C code. In the next code chunk we create another expression set object, this time using \Rfunction{gcrma} to normalize the data. % <>= library("gcrma") eset2 <- gcrma(spikein95) @ % Then, in this next code chunk we make use of \Rfunction{threestep} from the \Rpackage{affyPLM} package. There are a number of different options that can be set. In the code chunk below we do not do any background correction, normalization is done using quantile normalization (the default) and a one step Tukey biweight function is used to summarize the data, per probeset. <>= library("affyPLM") eset3 <- threestep(spikein95, background=FALSE, summary.method="tukey.biweight") @ % \section{Ranking and filtering probesets} \label{tests} Now we have, in \Robject{e}, a measurement $x_{ijk}$ of log (base 2) expression from each probeset $j$ on each array $i$ for both populations $k=1,2$. You can rank the probesets as soon as you select a statistic. %------------------------------------------------------------ \subsection{Summary statistics and tests for ranking} %------------------------------------------------------------ \paragraph{Log fold-change} A naive first choice is simply the average log fold-change: the \Rfunction{rowMeans} function works on matrix and provides a much faster alternative to the commonly used function \Rfunction{apply}: <>= d <- rowMeans(e[, Index2]) - rowMeans(e[, Index1]) @ The variability of fold-change measurements often depends on over-all intensity of the probeset in question. An example is average log expression, which we compute in the code chunk below. <>= a <- rowMeans(e) @ % \begin{Ex} Plot \Robject{d} against \Robject{a} (see Figure~\ref{figdvsa}) to see if this applies to your data. Also plot \Robject{d} versus \Robject{rank(a)}. Does the variability of the \Robject{d}-values depend on \Robject{a}? \end{Ex} \begin{solution} <>= par(mfrow=c(1,2)) plot(a, d, pch=".", ylim=c(-2,2)) abline(h=0,col="blue") plot(rank(a), d, pch=".", ylim=c(-2,2)) abline(h=0,col="pink") @ \end{solution} \inclfig{figdvsa}{0.75\textwidth}{Scatterplot of \Robject{d} against \Robject{a}.} Another simple statistic that takes into account the probeset-specific variation is the $t$-statistic. You can use the function \Rfunction{rowttests} from the package \Rpackage{genefilter}, which applies the $t$--test to each row: % <>= library("genefilter") tt <- rowttests(eset, "population") @ % The first argument is the \Rclass{exprSet}, the second indicates the covariate which defines the two groups to be compared. When there are few replicates the variance is not well estimated and the $t$-statistic can perform poorly. And alternative statistics that borrow strength across all genes often provide better results. An example of such a modified $t$-statistic is based on an empirical Bayes approach, implemented in the \Rfunction{eBayes} function in the \Rpackage{limma} package. <>= library("limma") design <- model.matrix(~eset$population) fit <- lmFit(eset, design) ebayes <- eBayes(fit) @ Because the example data set is from a calibration experiment, the truly differentially expressed genes are known. Interested users can refer to the optional exercise to carry out the comparison and confirm the better performance of the moderated versus classical $t$-statistics in this case. When sample sizes are moderate, say ten in each group there is generally no advantage to using the Bayesian approach. \subsection{Visualization of Differential Expression} The volcano plot is a useful way to see the estimate of the log fold-change and statistic you choose to rank the genes simultaneously. Figure~\ref{volcano} plots $p$-values (more specifically $-\log_{10}$ versus the $p$-value) versus effect size. For simplicity we assume the $t$-statistic follows a $t$-distribution to obtain the $p$-values. To create a volcano plot you can use the following simple code: % <>= lod <- -log10(tt$p.value) plot(d, lod, cex=.25, main="Volcano plot for $t$-test") abline(h=2) @ \inclfig{volcano}{0.5\textwidth}{Volcano plot.} % \begin{Ex} Generate the volcano plot using a moderated $t$-statistic (Figure~\ref{volcanoeb}). Do they look the same? \end{Ex} \begin{solution} <>= plot(d, -log10(ebayes$p.value[,2]), xlim=c(-1,1), cex=.25, main="Volcano plot for t-test") abline(h=2) @ \end{solution} \inclfig{volcanoeb}{0.5\textwidth}{Volcano plot when using moderated $t$-statistic.} You can highlight a subset of genes with the \Rfunction{points} function. For example, you could set the graphical parameters to use blue, (\verb+col="blue"+) diamonds, (\verb+pch=18+), to denote the top 25 genes ranked by average fold change. \subsection{Highlighting interesting genes} \begin{Ex} Can you highlight the top 20 genes ranked by smallest $p$-value or the largest statistic of your choice, with a different color and symbol (as in Figure~\ref{volcano2})? (Hint: look at the function \Rfunction{points} with options \Robject{col="blue"} and \Robject{pch=18}.) \end{Ex} \begin{solution} <>= plot(d,lod,cex=.25,xlim=c(-1,1),ylim=range(lod),main="Volcano plot") o1 <- order(abs(d),decreasing=TRUE)[1:25] points(d[o1],lod[o1],pch=18,col="blue") @ \end{solution} \inclfig{volcano2}{0.5\textwidth}{Volcano plot with highlighting of the top \Sexpr{length(o1)} probesets.} \subsection{Selecting cut-offs} When you get to this section, you have generated three statistics that can be used to rank genes. Now we turn our attention to deciding on a cutoff. A naive approach is to consider genes attaining $p$-values less than 0.01. However, because we are testing many hypotheses, the $p$-values no longer have the typical meaning. There are \Sexpr{nrow(e)} null hypotheses (this is the number of probesets on the array) in the example data set. If they are all true (no genes are differentially expressed) we expect $0.01 \times \Sexpr{nrow(e)} = \Sexpr{nrow(e)*0.01}$ false positives. %%FIXME: but we should do some non-specific filtering first... How many probesets with $p<0.01$ do you actually see using the $t$-test? Or with the moderated $t$-test? <<>>= table(tt$p.value<=0.01) @ Various approaches have been suggested for dealing with the multiple testing problem. Chapter 15 gives a detailed discussion and the \Rpackage{multtest} package provides extensive implementations of the different options. Alternatively, the \Rfunction{topTable} in \Rpackage{limma} package provides some adjustment to the raw p-values, including Benjamini \& Hochberg's False Discovery Rate (FDR), simple Bonferroni correction and several others. For details look at help files for \Rfunction{topTable} and \Rfunction{p.adjust}. The following example lists the top 10 genes and creates a report. <>= tab <- topTable(ebayes,coef=2,adjust.method="BH",n=10) genenames <- as.character(tab$ID) @ Here, \Robject{coef=2} specifies that we care for the second coefficient in the linear mode fit, that is the line slope, as the parameter of interest. The first coefficient is the intercept. The parameter \Rfunarg{n} indicates how many genes should be selected. \subsection{Annotation} The following is an optional exercise that you can do when you have internet access. First we load the \Rpackage{annotate} package % <>= library("annotate") @ and find out which metadata package we need. % <>= annotation(eset) @ % For most platforms the character returned by \Rfunction{annotate} function is the name of the annotation package you will need. However, Bioconductor maintains annotation information for both HG-U95A and HG-U95Av2 platforms in the same package \Rpackage{hgu95av2}. Now load the package \Rpackage{hgu95a}. % <>= library("hgu95a") @ Now you can add more annotation information about our top 10 interesting genes. For example, the EntrezGene ID and gene symbol, <>= ll <- getLL(genenames, "hgu95a") sym <- getSYMBOL(genenames, "hgu95a") @ % With these values available, we can use the following code to create an HTML page, useful for instance to share results with collaborators. % <>= tab <- data.frame(sym, signif(tab[,-1], 3)) htmlpage(ll, filename="GeneList1.html", title="HTML report", othernames=tab, table.head=c("Locus ID",colnames(tab)), table.center=TRUE) @ Look for the resulting file \textit{GeneList1.html} in your R working directory. <<>>= browseURL("GeneList1.html") @ Look for a file named \verb+GeneList1.html+ in your R working directory. The above HTML report only connects with EntrezGene. To create a report with more annotation information, we can use the \Rpackage{annaffy} package. Below are a few lines of code that create a useful HTML report HTML report with links to various annotation sites. % <>= library("KEGG") library("GO") library("annaffy") atab <- aafTableAnn(genenames, "hgu95av2", aaf.handler()) saveHTML(atab, file="GeneList2.html") @ \Rfunction{aaf.handler()} returns 13 annotation types. You can also select a subset instead of using all of these. <>= aaf.handler() atab <- aafTableAnn(genenames, "hgu95av2", aaf.handler()[c(2,5,8,12)]) saveHTML(atab, file="GeneList3.html") @ \end{document}