\documentclass{article} \usepackage{amsmath} \usepackage{amscd} \usepackage[tableposition=top]{caption} \usepackage{ifthen} \usepackage[utf8]{inputenc} \topmargin 0in \headheight 0in \headsep 0in \oddsidemargin 0in \evensidemargin 0in \textwidth 176mm \textheight 215mm \begin{document} %\VignetteIndexEntry{Using edgeR -- differential expression for digital gene expression datasets} \title{\texttt{edgeR}: Methods for differential expression \\ in digital gene expression datasets} \author{Mark Robinson \\ \texttt{mrobinson@wehi.edu.au}} \maketitle \section{Introduction} \noindent This document gives a brief introduction and overview of the R Bioconductor package \texttt{edgeR}, which provides statistical routines for determining differential expression in digital gene expression data. The routines can be applied equally to SAGE, CAGE, Illumina/Solexa, 454 or ABI SOLiD experiments. In fact, the methods may be useful in other experiments where counts are observed. R packages for the processing of raw data files for digital gene expression (DGE) datasets are still in development stages (e.g. \texttt{ShortRead}) at time of writing. The methods presented here require a simple \texttt{DGEList} object that contains three pieces of information: \begin{enumerate} \item \texttt{data}: a table of counts where each row represents a gene/exon (or whatever genomic feature is being tracked) and each column is a different sample. \item \texttt{group}: a vector (with length equal to the number of columns of \texttt{data}) denoting the experimental group. \item \texttt{lib.size} (same length as \texttt{group}): the total number of reads sequenced for each sample. \end{enumerate} We now discuss a couple examples. \section{Moderated negative binomial dispersions} The basic model we use for DGE data is based on the negative binomial distribution. The model is very flexible. For example, if $Y$ is distributed as $NB(\mu,\phi)$, then the expected value of $Y$ is $\mu$ and the variance is $\mu + \mu^2 \cdot \phi$, thus giving sufficient flexibility for many scenarios in observing count data. The observed data can be denoted as $Y_{gij}$ where $g$ is the gene (tag, exon, etc.), $i$ is the experimental group and $j$ is the index of the replicate. We can model the counts as \[ Y_{gij} \sim NB(M_j \cdot p_{gi}, \phi_g) \] where $p_{gi}$ represents the proportion of the sequenced sample for group $i$ that is tag $g$ and $M_j$ represents the library size. It is of interest to find genes where, for example, $p_{g1}$ is significantly different from $p_{g2}$. The parameter $\phi_g$ is the overdispersion (relative to the Poisson) and represents the biological, or sample-to-sample variability. The methods we developed moderate the dispersion estimates towards a common dispersion, much like how the \texttt{limma} package moderates the variances in the analysis of microarray data. To illustrate the methods, we use the Zhang et al. (1997) SAGE dataset. First, we read in the raw data: <>= library(edgeR) WD <- getwd() dataPath <- paste(.find.package("edgeR"),"data",sep="/") fn <- dir(dataPath, "txt") fn setwd(dataPath) head(read.table(fn[1],header=TRUE)) dataList <- readDGE(fn,header=TRUE) setwd(WD) @ \texttt{dataList} is a \texttt{list} object with elements for the table of counts and the library sizes. We need to add to this a `group' variable which specifies the experimental condition and then we create a \texttt{DGEList} object that is the container for digital gene expression data. Note that your raw data may come from a different source (e.g. another R object), but the \texttt{DGEList} needs the 3 pieces of information mentioned above. <>= cls <- gsub("[0-9]","",colnames(dataList$data)) cls keep <- rowSums(dataList$data) > 8 # to keep the compute time lower sum(keep) d<-DGEList(data=dataList$data[keep,],group=cls,lib.size=dataList$lib.size) # call constructor d @ To run the moderated analysis, we first need to determine how much moderation is necessary. For this, we use an empirical Bayes rule and involves calculating a weight parameter $\alpha$. Following this, the main function to do the statistical testing is called \texttt{deDGE}. <>= alpha <- alpha.approxeb(d) alpha ms <- deDGE(d,alpha=alpha$alpha) ms @ It may be informative to look at the data via logRatio vs. Abundance (or M vs. A) plots, as is commonly done in the analysis of microarray experiments for 2-group comparisons. In addition, we can highlight those genes that are determined to be differentially expressed (here, using a 5\% FDR multiple testing correction). To do this, you can call the commands: <>= exactP <- exactTestNB(ms$pseudo, ms$group, pair=c("Tu","NC"), ms$M * ms$ps$p.common, ms$r, verbose=TRUE) exactPadj <- p.adjust(exactP,"fdr") k <- (exactPadj<.05) plotMA( ms, xlim=c(-16,-5), ylim=c(-6,6), xlab="Relative Abundance", ylab="log Ratio", col=c("black","blue")[k+1]) @ %Also, we may want to look at distributions of the counts to see the effect of different total numbers of reads. An easy way to do this is: %<>= %boxplot(as.data.frame(d$data^.2)) %boxplot(as.data.frame(ms$pseudo^.2)) %@ Or, you can have a look at the sorted list of differential expression calls, according to: <>= topTags(ms) @ The above procedure will work for experiments with more than 2 groups as well. At present, the testing can only be done in a pairwise fashion. Therefore, with multiple groups you will need to specify the pair of groups to do the testing on. \section{Poisson example} It has been observed that in some deep sequencing approaches that not a great deal of overdispersion is observed. Specifically, the means and variances appear to be very close to each other, suggesting the Poisson distribution is a good fit. Methods within the \texttt{edgeR} package may still be useful, including the quantile adjustment (effectively a normalization) and the exact testing routines. To illustrate this, we sample Poisson data and run \texttt{deDGE} with the \texttt{doPoisson} argument set to \texttt{TRUE}. The data is quantile-adjusted and when the exact test is invoked, the dispersion parameter to 0. For example: <>= set.seed(101) n <- 1000 lib.sizes<-c(40000,50000,38000,40000) p<-runif(n,min=.0001,.001) mu<-outer(p,lib.sizes) mu[1:5,3:4]<-mu[1:5,3:4]*8 y<-matrix(rpois(4*n,lambda=mu),nrow=n) dP<-DGEList(data=y,group=rep(1:2,each=2),lib.size=lib.sizes) msP<-deDGE(dP,doPoisson=TRUE) msP @ And you can proceed as before: <>= topTags(msP) plotMA(msP,col=c( rep("blue",5), rep("black",n-5) )) @ \section{Future improvements and extension} Here, we list some improvements/extensions that are planned for the \texttt{edgeR} package: \begin{enumerate} \item As the packages for the processing of raw high-throughput sequencing data become more mature, \texttt{edgeR} may need to adapt and operate on different objects. As shown above, \texttt{edgeR} operates on a simple object containing simple data summaries which will presumably be readily available from pre-processing steps. \item Some speed improvements have been made but as the datasets become larger, some further optimizations may be necessary. We are exploring various ways to do this. \item The current quantile-based normalization assumes the library sizes as fixed. Depending on the circumstances of the samples in question, it may be necessary to explore something like an {\em effective} library size. \item We are exploring more general testing procedures. \end{enumerate} %<>= %getwd() %dir("") %@ \section{Setup} This vignette was built on: <>= sessionInfo() @ \section{References} See the following manuscripts for further details: \begin{enumerate} \item Robinson MD, Smyth GK. {\em Moderated statistical tests for assessing differences in tag abundance.}{\bf Bioinformatics}. 2007 Nov 1;23(21):2881-7. \item Robinson MD, Smyth GK. {\em Small-sample estimation of negative binomial dispersion, with applications to SAGE data.} {\bf Biostatistics}. 2008 Apr;9(2):321-32. \end{enumerate} \end{document}