%\VignetteIndexEntry{Differential expression for RNA-seq data with dispersion shrinkage} %\VignettePackage{DSS} \documentclass{article} \usepackage{float} \usepackage{Sweave} \usepackage[a4paper]{geometry} \usepackage{hyperref,graphicx} \textwidth=6.5in \textheight=9in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.5in \footskip=0.6in \renewcommand{\baselinestretch}{1.3} \SweaveOpts{keep.source=TRUE,eps=FALSE,include=TRUE,width=4,height=4} %\newcommand{\Robject}[1]{\texttt{#1}} %\newcommand{\Rpackage}[1]{\textit{#1}} %\newcommand{\Rclass}[1]{\textit{#1}} %\newcommand{\Rfunction}[1]{{\small\texttt{#1}}} \author{Hao Wu \\[1em]Department of Biostatistics and Bioinformatics\\ Emory University\\ Atlanta, GA 303022 \\ [1em] \texttt{hao.wu@emory.edu}} \title{\textsf{\textbf{Differential expression with DSS \\ (Dispersion Shrinkage for Sequencing data)}}} \begin{document} \maketitle \tableofcontents %% abstract \begin{abstract} This vignette introduces the use of Bioconductor package DSS ({\underline D}ispersion {\underline S}hrinkage for {\underline S}equencing data), which is designed primarily for differential expression detection for count data from RNA-seq. DSS uses new procedures to estimate and shrink gene-specific dispersions, then conduct Wald test for hypothesis testing. Compared to existing methods (DESeq and edgeR) DSS provides excellent statistical and computational performance, especially when overall dispersion level is high in data. \end{abstract} \section{Introdution} RNA-seq is a new technology for measuring the abundance of RNA products in a biological sample. Compared to gene expression microarrays, it provides better dynamic ranges and lower signal-to-noise ratio, so it's quickly becoming the technology of choice for gene expression quantifications. One of the fundamental questions for RNA-seq data analyses is the regulation of gene expression under different biological contexts. Therefore identifying differential expression (DE) remains a key task in studying gene expression. The major distinction of RNA-seq data compared to microarray is that the expression measurements are counts. Most of the existing statistical methods model the count data as over-dispersed Poisson, or negative binomial. The over dispersion parameters, which represent the biological variations for replicates within a treatment group, play a central role in the DE detection algorithm. There have been several statistical methods and software tools available to perform DE detection from RNA-seq data, each with different procedures for dispersion estimation and hypothesis testing. Here we present a new DE detection algorithm. First the gene specific dispersions are estimated through a method of moment estimator. Then data from all genes were combined to shrink dispersions through a penalized likelihood approach. Finally hypothesis testing is conducted using a Wald test. Results showed that the new method provide excellent performance compared to existing method, especially when overall dispersion level is high. The method is implemented in the Bioconductor package DSS, referring to \underline{D}ispersion \underline{S}hrinkage for \underline{S}equencing data. Currently DSS only support comparison of expressions from two treatment groups. Methods for more advanced design is under development and will be implemented soon. \section{Getting started to use {\tt DSS}} Required inputs for DSS are (1) gene expressions as a matrix of integers, rows are for genes and columns are for samples; and (2) a vector representing experimental designs. The length of the design vector must match the number of columns of input counts. Optionally, normalization factors or additional annotation for genes can be supplied. The basic data container in the package is {\tt SeqCountSet} class, which is directly inherited from {\tt ExpressionSet} class defined in {\tt Biobase}. An object of the class contains all necessary information for a DE analysis: gene expressions, experimental designs, and additional annotations. A typical DE analysis contain following simple steps. \begin{enumerate} \item Create a {\tt SeqCountSet} object using {\tt newSeqCountSet}. \item Estimate normalization factor using {\tt estNormFactors}. \item Estimate and shrink gene-wise dispersion using {\tt estDispersion} \item Two group comparison using {\tt waldTest}. \end{enumerate} The usage of DSS is demonstrated by below simple simulation. \begin{enumerate} \item First load in the library, and make a {\tt SeqCountSet} object from some counts for 2000 genes and 6 samples. <>= library(DSS) counts1=matrix(rnbinom(300, mu=10, size=10), ncol=3) counts2=matrix(rnbinom(300, mu=50, size=10), ncol=3) X1=cbind(counts1, counts2) ## these are 100 DE genes X2=matrix(rnbinom(11400, mu=10, size=10), ncol=6) X=rbind(X1,X2) designs=c(0,0,0,1,1,1) seqData=newSeqCountSet(X, designs) seqData @ \item Estimate normalization factor. <>= seqData=estNormFactors(seqData) @ \item Estimate and shrink gene-wise dispersions <<>>= seqData=estDispersion(seqData) @ \item With normalization factors and dispersions ready, two group comparison can be conducted via a wald test: <<>>= result=waldTest(seqData, 0, 1) head(result,5) @ \end{enumerate} \section{Session Info} <>= sessionInfo() @ \end{document}