% To compile this document % library('cacheSweave'); rm(list=ls()); Sweave('DESeq2_parathyroid.Rnw',driver=cacheSweaveDriver()); system('pdflatex DESeq2_parathyroid'); relocateAnswers(); for(i in 1:2) system('pdflatex DESeq2_parathyroid') \documentclass[12pt]{article} \usepackage[a4paper,left=2cm,top=2cm,bottom=3cm,right=2cm,ignoreheadfoot]{geometry} \pagestyle{empty} \usepackage[usenames,dvipsnames]{xcolor} \usepackage{helvet} \usepackage[titletoc]{appendix} \usepackage{tocloft} \setlength{\parindent}{0em} \setlength{\parskip}{.5em} \renewcommand{\familydefault}{\sfdefault} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\small\texttt{#1}}} \newcommand{\Rfunction}[1]{\Robject{#1}} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rclass}[1]{\textit{#1}} \newcommand{\lowtilde}{\raise.17ex\hbox{$\scriptstyle\mathtt{\sim}$}} \newcommand{\thetitle}{Differential analysis of RNA-Seq data at the gene level using the DESeq2 package} \usepackage[pdftitle={\thetitle},% pdfauthor={Michael Love, Simon Anders, Wolfgang Huber},% bookmarks,% colorlinks,% linktoc=section,% linkcolor=RedViolet,% citecolor=RedViolet,% urlcolor=RedViolet,% linkbordercolor={1 1 1},% citebordercolor={1 1 1},% urlbordercolor={1 1 1},% raiselinks,% plainpages,% pdftex]{hyperref} \usepackage{cite} \renewcommand{\floatpagefraction}{0.9} \usepackage{sectsty} \sectionfont{\sffamily\bfseries\color{RoyalBlue}\sectionrule{0pt}{0pt}{-1ex}{1pt}} \subsectionfont{\sffamily\bfseries\color{RoyalBlue}} \subsubsectionfont{\sffamily\bfseries\color{RoyalBlue}} \usepackage{fancyhdr} \pagestyle{fancy} \fancyhead{} \fancyfoot{} \lfoot{}\cfoot{}\rfoot{} \renewcommand{\headrule}{} \usepackage{graphicx} \usepackage{Sweave} \SweaveOpts{keep.source=TRUE,eps=FALSE,pdf=FALSE,png=TRUE,include=FALSE,width=4,height=4.5,resolution=150} \setkeys{Gin}{width=0.5\textwidth} \definecolor{darkgray}{gray}{0.2} \DefineVerbatimEnvironment{Sinput}{Verbatim}{fontsize=\small,xleftmargin=1em,formatcom={\color{darkgray}}} \DefineVerbatimEnvironment{Soutput}{Verbatim}{fontsize=\small,xleftmargin=1em,frame=leftline,framerule=.6pt,rulecolor=\color{darkgray},framesep=1em,formatcom={\color{darkgray}}} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \newcommand{\fixme}[1]{{\textbf{Fixme}: \textit{\textcolor{blue}{#1}}}} %%--Questions and Answers macro; this could be but into a re-usable style file \newcounter{questionCounter} \newcounter{answerCounter} \newenvironment{question}{% \refstepcounter{questionCounter} \par \textbf{Question \arabic{questionCounter}}:}{} \newenvironment{answer}{% \refstepcounter{answerCounter} \par \textbf{Answer \arabic{answerCounter}}:}{} %------- Unsatisfactory attempt at the above: %\newcommand{\question}[2]{% %\refstepcounter{QuestionsAndAnswersCounter} %\paragraph{Question \arabic{QuestionsAndAnswersCounter}}: \textit{#1} %\paragraph{Answer to Question \arabic{QuestionsAndAnswersCounter}: \unexpanded{#2}}} %\usepackage{newfile} %\openoutputfile{\jobname.qna}{qna} %\newoutputstream{qna} %\newcommand{\question}[2]{% %\refstepcounter{QuestionsAndAnswersCounter} %\textbf{Question \arabic{QuestionsAndAnswersCounter}}: \textit{#1} %\addtostream{qna}{\unexpanded{\paragraph{Answer to Question \arabic{QuestionsAndAnswersCounter}: #2}}}} % ------------------------------------------------------------ \author{Michael Love, Simon Anders, Wolfgang Huber} \title{\textsf{\textbf{\thetitle}}} %------------------------------------------------------------ \begin{document} \maketitle \tableofcontents <>= options(digits=3, width=80) library("cacheSweave") setCacheDir("cachedir") @ \section{Introduction} In this lab, you will learn how to analyse a count table, such as arising from a summarised RNA-Seq experiment, for differentially expressed genes. \section{Input data} \subsection{Experiment data} We read in a prepared \Rclass{SummarizedExperiment}, which was generated from publicly available data from the article by Felix Haglund et al., ``Evidence of a Functional Estrogen Receptor in Parathyroid Adenomas'', J Clin Endocrin Metab, Sep 2012, \url{http://www.ncbi.nlm.nih.gov/pubmed/23024189}. Details on the generation of this object can be found in the vignette for the \Rpackage{parathyroidSE} package, \url{http://bioconductor.org/packages/release/data/experiment/html/parathyroidSE.html}. The purpose of the experiment was to investigate the role of the estrogen receptor in parathyroid tumors. The investigators derived primary cultures of parathyroid adenoma cells from 4 patients. These primary cultures were treated with diarylpropionitrile (DPN), an estrogen receptor $\beta$ agonist, or with 4-hydroxytamoxifen (OHT). RNA was extracted at 24 hours and 48 hours from cultures under treatment and control. The blocked design of the experiment allows for statistical analysis of the treatment effects while controlling for patient-to-patient variation. We first load the \Rpackage{DESeq2} package and the data package \Rpackage{parathyroidSE}, which contains the example data set. <>= library( "DESeq2" ) library( "parathyroidSE" ) @ The \Rfunction{data} command loads a data object. <>= data("parathyroidGenesSE") @ The information in a \Rclass{SummarizedExperiment} object can be accessed with accessor functions. For example, to see the actual data, i.e., here, the read counts, we use the \Rfunction{assay} function. (The \Rfunction{head} function restricts the output to the first few lines.) <>= head( assay( parathyroidGenesSE ) ) @ In this count table, each row represents an Ensembl gene, each column a sequenced RNA library, and the values give the raw numbers of sequencing reads that were mapped to the respective gene in each library. \begin{question} For how many genes are there counts in this table? \end{question} \begin{answer} <>= nrow(parathyroidGenesSE) @ \end{answer} We also have metadata on each of the samples (the ``columns'' of the count table): % <>= colData( parathyroidGenesSE ) @ \begin{question} What are the metadata for the genes (the ``rows'' of the count table)? \end{question} \begin{answer} <>= rowData( parathyroidGenesSE ) @ \end{answer} \subsection{Collapsing technical replicates} There are a number of samples which were sequenced in multiple runs. For example, sample \Robject{SRS308873} was sequenced twice. To see, we list the respective columns of the \Rfunction{colData}. (The use of \Rfunction{as.data.frame} forces R to show us the full list, not just the beginning and the end as before.) <>= as.data.frame( colData(parathyroidGenesSE)[,c("sample","patient","treatment","time")] ) @ We recommend to first add together technical replicates (i.e., libraries derived from the same samples), such that we have one column per sample. \fixme{Present the 'pedestrian' approach of chunk \texttt{poormanssum} first, only then (and as 'optional') the \Rfunction{sapply}-magic.} As is often the case, this preparatory step looks more complicated than the subsequent actual analysis. In fact, the following operations are not specific to \Rpackage{DESeq2}, but are specific preparations needed for this data set. To understand the general ideas of \Rpackage{DESeq2}, you could now skip to Section~\ref{sec:runDESeq}. What you will learn in the rest of this section is an example of a typical preparatory data manipulation task done with elementary R functions. Details on these can be found in general textbooks on R; also consider reading the help pages of the functions used. We first use the function \Rfunction{split} to see which columns need to be collapsed. <>= allColSamples <- colData(parathyroidGenesSE)$sample sp <- split( seq(along=allColSamples), allColSamples ) @ Using \Rfunction{sapply}, we loop over the elements of \Robject{sp}, which correspond to the distinct samples, construct subtables of the count table (i.e., \Rfunction{assay(parathyroidGenesSE)}) corresponding only to the current sample considered, and add up across rows if there is more than one column. The result of the \Rfunction{sapply} call is a new table, in which each column now corresponds to a different sample. % <>= countdata <- sapply(sp, function(columns) rowSums( assay(parathyroidGenesSE)[,columns,drop=FALSE] ) ) head(countdata) @ Novice users might find the preceding two code chunks difficult. Of course, there is a much easier way to add up the columns, namely by explicitly specifying the indices of the columns we want to use as is and the columns we want to add up, and using \Rfunction{cbind} to bind all the columns to a matrix: <>= a <- assay(parathyroidGenesSE) countdata2 <- cbind( a[,1:8], a[,9]+a[,10], a[,11], a[,12]+a[,13], a[,14:22], a[,23]+a[,24], a[,25], a[,26]+a[,27] ) all( countdata == countdata2 ) @ While this is simpler to understand, it is more error-prone. Mistakes can easily happen when determining the column indices, and it is tedious to update the code if the input data changes, for instance, if at a later time you would like to add more replicates to your data set. Hence, if you are a beginner in R and want to improve your R skills, try to understand how the \Rfunction{split} and the \Rfunction{sapply} calls above work, because only learning to master such expressions will give you the skills to make full use of R. Having reduced our count data table to only one column per sample, we next need to subset the column metadata accordingly, as we now have less columns. We also now use the sample names as names for the column data rows: % <>= coldata <- colData(parathyroidGenesSE)[sapply(sp, `[`, 1),] rownames(coldata) <- coldata$sample coldata @ % \begin{question} What do the quotation marks in the expression \Robject{`[`} do? What happens if you omit them? \end{question} \begin{answer} The function \Rfunction{sapply} expects an R function as its second argument. Here, we want to provide it with the function for vector subsetting (as in \Robject{a[1]}), and the name of this function is \Robject{[}. However, if we provide that name without the quotation marks, the R interpreter gets confused and complains about the unexpected symbol (try this out). Hence we need to quote the function name in our call to \Rfunction{sapply}. \end{answer} To unclutter the output in the subsequent steps, we only keep the column data columns that we actually need for our analysis. % <>= coldata <- coldata[ , c( "patient", "treatment", "time" ) ] head( coldata ) @ Our \Rclass{SummarizedExperiment} object also contains metadata on the rows, which we can simply keep unchanged: <>= rowdata <- rowData(parathyroidGenesSE) rowdata @ We now have all the ingredients to prepare our data object in a form that is suitable for analysis, namely: \begin{itemize} \item \Robject{countdata}: a table with the read counts, with technical replicates summed up, \item \Robject{coldata}: a table with metadata on the count table's columns, i.e., on the samples, \item \Robject{rowdata}: a table with metadata on the count table's rows, i.e., on the genes, and \item a design formula, which tells which factors in the column metadata table specify the experimental design and how these factors should be used in the analysis. We specify \Rfunction{\lowtilde{} patient + treatment}, which means that we want to test for the effect of treatment (the last factor), controlling for the effect of patient (the first factor). You can use R's formula notation to express any experimental design that can be described within an ANOVA-like framework. \end{itemize} To now construct the data object from the matrix of counts and the metadata table, we use: <>= ddsFull <- DESeqDataSetFromMatrix( countData = countdata, colData = coldata, design = ~ patient + treatment, rowData = rowdata) ddsFull @ <>= stopifnot(sum(assay(parathyroidGenesSE)) == sum(counts(ddsFull))) @ \section{Running the DESeq2 pipeline}\label{sec:runDESeq} Here we will analyze a subset of the samples, namely those taken after 48 hours, with either control or DPN treatment, taking into account the multifactor design. \subsection{Preparing the data object for the analysis of interest} First we subset the relevant columns from the full dataset: <>= dds <- ddsFull[ , colData(ddsFull)$treatment %in% c("Control","DPN") & colData(ddsFull)$time == "48h" ] @ Sometimes it is necessary to ``refactor'' the factors, in case that levels have been dropped. (Here, for example, the treatment factor still contains the level ``OHT'', but no sample to this level.) <>= dds$patient <- factor(dds$patient) dds$treatment <- factor(dds$treatment) @ It will be convenient to make sure that \texttt{Control} is the \emph{first} level in the treatment factor, so that the $\log_2$ fold changes are calculated as treatment over control. The function \Rfunction{relevel} achieves this: <>= dds$treatment <- relevel( dds$treatment, "Control" ) @ % We specify the design (although % this was already added when we constructed \Robject{dds}), and put % the main variable of interest at the end of the formula (because the functions % for extracting results and plotting fold changes take the last variable % in the design formula by default). % <>= % design(dds) <- ~ patient + treatment % @ A quick check whether we now have the right samples: <>= colData(dds) @ \subsection{Running the pipeline} With the data object prepared, the \Rpackage{DESeq2} analysis can now be run with a single call to the function \Rfunction{DESeq}: <>= dds <- DESeq(dds) @ \subsection{Inspecting the results table} The results for the last variable in the design formula, in our case the \Rcode{treatment} variable, can be extracted using the \Rfunction{results} function. <>= res <- results(dds) res @ As \Robject{res} is a \Robject{DataFrame} object, it carries metadata with information on the meaning of the columns: <>= mcols(res) @ The first column, \Robject{baseMean}, is a just the average of the normalized count values, taken over all samples. The remaining four columns refer to a specific \emph{contrast}, namely the comparison of the levels \emph{DPN} versus \emph{Control} of the factor variable \emph{treatment}. See the help page for \Rfunction{results} (by typing \Rfunction{?results}) for information on how to obtain other contrasts. The column \Robject{log2FoldChange} is the effect size estimate. It tells us how much the gene's expression seems to have changed due to treatment with DPN in comparison to control. This value is reported on a logarithmic scale to base 2: for example, a $\log_2$ fold change of 1.5 means that the gene's expression is increased by a factor of $2^{1.5}\approx 2.82$. Of course, this estimate has an uncertainty associated with it, which is available in the column \Robject{lfcSE}, the standard error estimate for the log2 fold change estimate. We can also express the uncertainty of a particular effect size estimate as the result of a statistical test. The purpose of a test for differential expression is to test whether the data provides sufficient evidence to conclude that this value is really different from zero (and that the sign is correct). DESeq2 performs for each gene a \emph{hypothesis test} to see whether evidence is sufficient to decide against the \emph{null hypothesis} that there is no effect of the treatment on the gene and that the observed difference between treatment and control was merely caused by experimental variability (i.\,e., the type of variability that you can just as well expect between different samples in the same treatment group). As usual in statistics, the result of this test is reported as a \emph{p value}, and it is found in the column \Robject{pvalue}. (Remember that a p value indicates the probability that a fold change as strong as the observed one, or even stronger, would be seen under the situation described by the null hypothesis.) Finally, we note that a subset of the p values in \Robject{res} are \Robject{NA} (``not available''). This is \Rfunction{DESeq}'s way of reporting that all counts for this gene were zero, and hence not test was applied. \begin{question} How could you check to see if the \Robject{baseMean} is the mean of raw counts or the mean of normalized counts? \end{question} \begin{answer} The raw counts and normalized counts of a \Rclass{DESeqDataSet} object are available via the accessor function \Rfunction{counts}, which has an argument \Robject{normalized}, which defaults to \Robject{FALSE}. <>= all.equal(res$baseMean, rowMeans(counts(dds))) all.equal(res$baseMean, rowMeans(counts(dds,normalized=TRUE))) @ \end{answer} \subsection{Multiple testing} Novices in high-throughput biology often assume that thresholding these p values at 0.05, as is often done in other settings, would be appropriate -- but it is not. We briefly explain why: There are \Sexpr{sum( res$pvalue < 0.05, na.rm=TRUE )} genes with a p value below 0.05 among the \Sexpr{sum( !is.na(res$pvalue) )} genes, for which the test succeeded in reporting a p value: <>= sum( res$pvalue < 0.05, na.rm=TRUE ) table( is.na(res$pvalue) ) @ Now, assume for a moment that the null hypothesis is true for all genes, i.e., no gene is affected by the treatment with DPN. Then, by the definition of \emph{p value}, we expect up to 5\% of the genes to have a p value below 0.05. This amounts to \Sexpr{round( sum( !is.na(res$pvalue) ) * 0.05 )} genes. If we just considered the list of genes with a p value below 0.05 as differentially expressed, this list should therefore be expected to contain up to $\Sexpr{round( sum( !is.na(res$pvalue) ) * 0.05)} / \Sexpr{sum( res$pvalue < 0.05, na.rm=TRUE ) } = \Sexpr{round( sum( !is.na(res$pvalue) ) * 0.05 / sum( res$pvalue < 0.05, na.rm=TRUE ) * 100, 0 )}$\% false positives! \Rpackage{DESeq2} uses the so-called Benjamini-Hochberg (BH) adjustment; in brief, this method calculates for each gene an \emph{adjusted p value} which answers the following question: if one called significant all genes with a p value less than or equal to this gene's p value threshold, what would be the fraction of false positives (the \emph{false discovery rate}, FDR) among them (in the sense of the calculation outlined above)? These values, called the BH-adjusted p values, are given in the column \Robject{padj} of the \Robject{results} object. Hence, if we consider a fraction of 10\% false positives acceptable, we can consider all genes with an \emph{adjusted} p value below 10\%=0.1 as significant. How many such genes are there? <>= sum( res$padj < 0.1, na.rm=TRUE ) @ We subset the results table to these genes and then sort it by the log2-fold-change estimate to get the significant genes with the strongest down-regulation <>= resSig <- res[ which(res$padj < 0.1 ), ] head( resSig[ order( resSig$log2FoldChange ), ] ) @ and with the strongest upregulation <>= tail( resSig[ order( resSig$log2FoldChange ), ] ) @ \begin{question} What is the proportion of down- and up-regulation among the genes with adjusted p value less than 0.1? \end{question} \begin{answer} <>= table(sign(resSig$log2FoldChange)) @ \end{answer} \subsection{Diagnostic plots} A so-called MA plot provides a useful overview for an experiment with a two-group comparison: <>= plotMA(dds, ylim = c( -1.5, 1.5 ) ) @ The plot (Fig.\ \ref{MA1}) represents each gene with a dot. The $x$ axis is the average expression over all samples, the $y$ axis the $\log_2$ fold change between treatment and control. Genes with an adjusted p value below a threshold (here 0.1, the default) are shown in red. \begin{figure}[ht] \centering \includegraphics[width=.5\textwidth]{DESeq2_parathyroid-plotMA1.png} \caption{The MA-plot shows the $\log_2$ fold changes from the treatment over the mean of normalized counts, i.e. the average of counts normalized by size factor. The \Rpackage{DESeq2} package incorporates a prior on $\log_2$ fold changes, resulting in moderated estimates from genes with low counts and highly variable counts, as can be seen by the narrowing of spread of points on the left side of the plot.} \label{MA1} \end{figure} This plot demonstrates that only genes with an average normalized count above 10 contain sufficient information to yield a significant call, and only above about 300 counts can smaller fold-changes become significant. Also note \Rpackage{DESeq2}'s shrinkage estimation of log fold changes (LFCs): When count values are too low to allow an accurate estimate of the LFC, the value is ``shrunken'' towards zero to avoid that these values, which otherwise would frequently be unrealistically large, dominate the top-ranked log fold changes. Whether a gene is called significant depends not only on its LFC but also on its within-group variability, which \Rpackage{DESeq2} quantifies as the \emph{dispersion}. For strongly expressed genes, the dispersion can be understood as a squared coefficient of variation: a dispersion value of 0.01 means that the gene's expression tends to differ by typically $\sqrt{0.01}=10\%$ between samples of the same treatment group. For weak genes, the Poisson noise is an additional source of noise, which is added to the dispersion. The function \Rfunction{plotDispEsts} visualizes \Rpackage{DESeq2}'s dispersion estimates: <>= plotDispEsts( dds ) @ \begin{figure}[ht] \centering \includegraphics[width=.5\textwidth]{DESeq2_parathyroid-dispPlot.png} \caption{Plot of dispersion estimates. See text for details} \label{dispPlot} \end{figure} The black dots are the dispersion estimates for each gene as obtained by considering the information from each gene separately. Unless one has many samples, these values fluctuate strongly around their true values. Therefore, we fit the red trend line, which shows the dispersions' dependence on the mean, and then shrink each gene's estimate towards the red line to obtain the final estimates (blue circles) that are then used in the hypothesis test. \begin{question} How could you change the MA-plot so as to color those genes with adjusted p-value less than 0.5 instead of 0.1? \end{question} \begin{answer} <>= plotMA(dds, pvalCutoff = 0.5, ylim = c( -1.5, 1.5) ) @ See Figure~\ref{plotMApadjchange}. \begin{figure}[ht] \centering \includegraphics[width=.5\textwidth]{DESeq2_parathyroid-plotMApadjchange.png} \caption{The MA-plot with red points indicating adjusted p value less than 0.5.} \label{plotMApadjchange} \end{figure} \end{answer} \begin{figure}[ht] \centering \includegraphics[width=.5\textwidth]{DESeq2_parathyroid-pvalHist.png} \caption{Histogram of the p values returned by the test for differential expression.} \label{pvalHist} \end{figure} Another useful diagnostic plot is the histogram of the p values (Fig.~\ref{pvalHist}). <>= hist( res$pvalue, breaks=100 ) @ \begin{question} Revisit the discussion about p values and multiple testing in the previous section. Which part of the histogram is caused by genes that are called significant? And which part is caused by those that are truly significant? Why are there ``spikes'' at intermediate values? \end{question} \begin{answer} Genes that are not differentially expressed have p values that are approximately uniformly distributed between 0 and 1. This gives rise to the floor of bars of equal heights. The truly differentially expressed genes give rise to the tall bar(s) at the very left -- but only to that part of the bars that raises above the uniform floor. Of course, we cannot know which of the genes in these tall bars are true ones and which are not. When only looking at the bars to the left of our chosen p value cut-off, the ratio of ``floor'' area to total area provides an estimate of the false discovery rate. This is a graphical way of understanding FDR. The rule that p values from null cases are uniform is true only for continuous test statistics. However, for genes with low counts, the fact that we are working with integer counts becomes noticeable, and gives rise to the spikes at intermediate p values. \end{answer} \section{Independent filtering} The MA plot (Figure~\ref{MA1}) highlights an important property of RNA-Seq data. For weakly expressed genes, we have no chance of seeing differential expression, because the low read counts suffer from so high Poisson noise that any biological effect is drowned in the uncertainties from the read counting. The MA plot suggests that for genes with less than one or two counts per sample, averaged over all samples, there is no real inferential power. We loose little if we filter out these genes: <>= filterThreshold <- 2.0 keep <- rowMeans( counts( dds, normalized=TRUE ) ) > filterThreshold table( keep ) @ Note that none of the genes below the threshold had a significant adjusted p value <>= min( res$padj[!keep], na.rm=TRUE ) @ At first sight, there may seem to be little benefit in filtering out these genes. After all, the test found them to be non-significant anyway. However, these genes have an influence on the multiple testing adjustment, whose performance improves if such genes are removed. Compare: <>= table( p.adjust( res$pvalue, method="BH" ) < .1 ) table( p.adjust( res$pvalue[keep], method="BH" ) < .1 ) @ By removing the weakly-expressed genes from the input to the FDR procedure, we have found more genes to be significant among those which we kept, and so improved the power of our test. This approach is known as \emph{independent filtering}. The term \emph{independent} highlights an important caveat. Such filtering is permissible only if the filter criterion is independent of the actual test statistic~\cite{Bourgon:2010:PNAS}. Otherwise, the filtering would invalidate the test and consequently the assumptions of the BH procedure. This is why we filtered on the average over \emph{all} samples: this filter is blind to the assignment of samples to the treatment and control group and hence independent. \begin{question} Redo the histogram as in Figure~\ref{pvalHist}, now only using the genes that passed the filtering. What happened to the spikes at intermediate values? \end{question} \begin{answer} Run <>= hist( res$pvalue[keep], breaks=100 ) @ See Figure~\ref{pvalHistFilt}. As explained before, the spikes were caused by genes with low counts. Having removed these, our p value histogram now looks smoother. \begin{figure}[ht] \centering \includegraphics[width=.5\textwidth]{DESeq2_parathyroid-pvalHistFilt.png} \caption{Histogram of the p values returned by the test for differential expression.} \label{pvalHistFilt} \end{figure} In this vignette, we have determined the value for \Robject{filterThreshold}, \Sexpr{signif(filterThreshold, 3)}, by looking at Figure~\ref{MA1}. More formal, automatable ways exist; if you are interested, please have a look at the vignette \emph{Diagnostics for independent filtering} in the \Rpackage{genefilter} package. \end{answer} \subsection{Adding gene names} Our result table only uses Ensembl gene IDs, but gene names may be more informative. Bioconductor's annotation packages help with mapping various ID schemes to each other. We load the annotation package \Rpackage{org.Hs.eg.db}: <>= library( "org.Hs.eg.db" ) @ This is the organism annotation package (``org'') for \emph{Homo sapiens} (``Hs''), organized as an \Rpackage{AnnotationDbi} package (``db''), using Entrez Gene IDs (``eg'') as primary key. To get a list of all available key types, use <>= cols(org.Hs.eg.db) @ Converting IDs with the native functions from the \Rpackage{AnnotationDbi} package is currently a bit cumbersome, so we provide the following convenience function (without explaining how exactly it works): <>= convertIDs <- function( ids, fromKey, toKey, db, ifMultiple=c( "putNA", "useFirst" ) ) { stopifnot( inherits( db, "AnnotationDb" ) ) ifMultiple <- match.arg( ifMultiple ) suppressWarnings( selRes <- AnnotationDbi::select( db, keys=ids, keytype=fromKey, cols=c(fromKey,toKey) ) ) if( ifMultiple == "putNA" ) { duplicatedIds <- selRes[ duplicated( selRes[,1] ), 1 ] selRes <- selRes[ ! selRes[,1] %in% duplicatedIds, ] } return( selRes[ match( ids, selRes[,1] ), 2 ] ) } @ This function takes a list of IDs as first argument and their key type as the second argument. The third argument is the key type we want to convert to, the fourth is the \Rclass{AnnotationDb} object to use. Finally, the last argument specifies what to do if one source ID maps to several target IDs: should the function return an NA or simply the first of the multiple IDs? To convert the Ensembl IDs in the rownames of \Robject{res} to gene symbols and add them as a new column, we use: <>= res$symbol <- convertIDs( row.names(res), "ENSEMBL", "SYMBOL", org.Hs.eg.db ) res @ % \subsection{Gene annotations} % % <>= % library("biomaRt") % ensembl = useMart("ensembl", dataset = "hsapiens_gene_ensembl") % res$ensembl <- sapply(strsplit(rownames(res),split="\\+"),"[",1) % entrezmap <- getBM(attributes = c("ensembl_gene_id", "entrezgene"), % filters = "ensembl_gene_id", % values = res$ensembl, % mart = ensembl) % res$entrez <- entrezmap[match(res$ensembl, entrezmap[,1]),2] % hgncmap <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), % filters = "ensembl_gene_id", % values = res$ensembl, % mart = ensembl) % res$hgnc_symbol <- hgncmap[match(res$ensembl, hgncmap[,1]),2] % @ Finally, we note that you can easily save the results table in a CSV file, which you can then load with a spreadsheet program such as Excel: <>= write.csv( as.data.frame(res), file="results.csv" ) @ \section{Downstream analyses} A list of gene names is no final result. We demonstrate two possible further analysis steps. \subsection{Gene set enrichment analysis} Do the genes with a strong up- or down-regulation have something in common? We perform next a gene-set enrichment analysis (GSEA) to examine this question. We use the gene sets in the Reactome database <>= library( "reactome.db" ) @ This database works with Entrez IDs, so we add a column with such IDs, using our \Rfunction{convertIDs} function: <>= res$entrez <- convertIDs( row.names(res), "ENSEMBL", "ENTREZID", org.Hs.eg.db ) @ Next, we subset the results table, \Robject{res}, to only those genes for which the Reactome database has data (i.e, whose Entrez ID we find in the respective key column of \Robject{reactome.db}) and for which the test gave a p value that was not NA. <>= res2 <- res[ res$entrez %in% keys( reactome.db, "ENTREZID" ) & !is.na( res$pvalue) , ] head(res2) @ Using \Rfunction{select}, a function from \Rpackage{AnnotationDbi} for querying database objects, we get a table with the mapping from Entrez IDs to Reactome Path IDs <>= reactomeTable <- AnnotationDbi::select( reactome.db, keys=res2$entrez, keytype="ENTREZID", cols=c("ENTREZID","REACTOMEID") ) head(reactomeTable) @ % The next code chunk transforms this table into an \emph{incidence matrix}. This is a boolean matrix with one row for each Reactome Path and one column for each gene in \Robject{res2}, which tells us which genes are members of which Reactome Paths. (If you want to understand how this chunk exactly works, read up about the \Rfunction{tapply} function.) <>= incm <- do.call( rbind, with(reactomeTable, tapply( ENTREZID, factor(REACTOMEID), function(x) res2$entrez %in% x ) )) colnames(incm) <- res2$entrez str(incm) @ % % WH suggests that the following code is perhaps easier to understand (and % faster) for the construction of 'incm' % ft2mat = function(x, y) { % ux = unique(x) % uy = unique(y) % im = matrix(FALSE, nrow=length(ux), ncol=length(uy), dimnames=list(ux, uy)) % im[ cbind(x, y) ] = TRUE % return(im) % } % incm = with(reactomeTable, ft2mat(REACTOMEID, ENTREZID)) We remove all rows corresponding to Reactome Paths with less than 5 assigned genes. <>= incm <- incm[ rowSums(incm) >= 5, ] @ To test whether the genes in a Reactome Path behave in a special way in our experiment, we perform $t$-tests to see whether the average of the genes' $\log_2$ fold change values are different from zero. If so, we can say that our treatment tends to upregulate (or downregulate) the genes in the category. To facilitate the computations, we define a little helper function: <>= testCategory <- function( reactomeID ) { isMember <- incm[ reactomeID, ] data.frame( reactomeID = reactomeID, numGenes = sum( isMember ), avgLFC = mean( res2$log2FoldChange[isMember] ), strength = sum( res2$log2FoldChange[isMember] ) / sqrt(sum(isMember)), pvalue = t.test( res2$log2FoldChange[ isMember ] )$p.value, reactomeName = reactomePATHID2NAME[[reactomeID]] ) } @ The function can be called with a Reactome Path ID: <>= testCategory("109581") @ % As you can see the function not only performs the $t$ test and returns the p value but also lists other useful information such as the number of genes in the category, the average log fold change, a ``strength'' measure (see below) and the name with which Reactome describes the Path. We call the function for all Paths in our incidence matrix and collect the results in a data frame: % <>= reactomeResult <- do.call( rbind, lapply( rownames(incm), testCategory ) ) @ % As we performed many tests, we should again use a multiple testing adjustment. % <>= reactomeResult$padjust <- p.adjust( reactomeResult$pvalue, "BH" ) @ This is a list of Reactome Paths which are significantly differentially expressed in our comparison of DPN treatment with control, sorted according to sign and strength of the signal: {\small <>= reactomeResultSignif <- reactomeResult[ reactomeResult$padjust < 0.05, ] reactomeResultSignif[ order(reactomeResultSignif$strength), ] @ } Note that such lists need to be interpreted with care, and a grain of salt. Which of these categories make sense, given the biology of the experiment? \subsection{Nearest peak to a differentially expressed gene} The RNA-Seq experiment analyzed above provides a list of genes which have responded to a selective estrogen-receptor-beta agonist. We can investigate whether we find estrogen receptor binding sites in the vicinity of the gene with the highest fold induction. In order to match differentially expressed genes to other experiment data, we will use annotated binding sites of estrogen receptor alpha from the ENCODE project. It is not necessarily the case that these annotated binding sites are actually functional in the cell lines of the RNA-Seq experiment or biologically relevant as the alpha and beta subtypes are distinct proteins transcribed from different genes; here we only use these binding site data for demonstration purposes. Let us consider a particular gene with a low p value. The \Rfunction{rowData} function provides us with all the information about the gene model; each of the exons is represented as a \Rclass{GRanges}, and these are tied together as a \Rclass{GRangesList}. We use the function \Rfunction{range} to extract the entire range of the gene, from the start of the left-most exon to the end of the right-most exon. This is all the information we need in order to find the nearest binding site. <>= deGeneID <- "ENSG00000099194" res[deGeneID,] deGene <- range(rowData(dds[deGeneID,])[[1]]) names(deGene) <- deGeneID deGene @ We would like to compare the location of this gene with the location of annotated estrogen receptor binding sites, provided by the UCSC Genome Browser. We must first alter the sequence name (the chromosome name) of the differentially expressed gene, as the Ensembl gene annotation does not use the ``chr'' prefix, which the UCSC chromosomes are annotated with. (Note that we ignore here another complication, which is that the Ensembl sequence ``MT'' corresponds to the UCSC's sequence ``chrM''.) We use the \Rfunction{paste0} function, which concatenates the character vectors provided without using any separating characters. We then create a range which is 10 Mb to the left and right of the start of the \Robject{deGene} object. <>= as.character(seqnames(deGene)) ucscChrom <- paste0("chr",as.character(seqnames(deGene))) ucscRanges <- ranges(flank(deGene,width=10e6,both=TRUE)) subsetRange <- GRanges(ucscChrom, ucscRanges) subsetRange @ We now provide code which would download a track from the UCSC Genome Browser, in our case a track containing transcription factor binding sites obtained from ChIP-Seq experiments across various cell lines, generated by the ENCODE project. The track names and table names must match a track name provided by the UCSC Genome Browser. For more information on these steps, see the detailed instructions in the vignette of the useful Bioconductor package \Rpackage{rtracklayer}. <>= ## ## Please do not run this code if you do not have an internet connection, ## alternatively use the local file import in the next code chunk. ## library( "rtracklayer" ) trackName <- "wgEncodeRegTfbsClusteredV2" tableName <- "wgEncodeRegTfbsClusteredV2" trFactor <- "ERalpha_a" mySession <- browserSession() ucscTable <- getTable(ucscTableQuery(mySession, track=trackName, range=subsetRange, table=tableName, name=trFactor)) @ Here we use a locally cached copy of \Robject{ucscTable}: <>= ucscTableFile <- "localUcscTable.csv" ucscTable <- read.csv(ucscTableFile, stringsAsFactors=FALSE) @ We now can use the downloaded table of annotated estrogen receptor peaks. Whether to use a cutoff on the provided peak scores at this step, or what scores cutoff to use, depends on your experience with the specific transcription factor and the ChIP-Seq experiments used to define these peaks. It often makes sense to visualize tracks in a genome browser in order to get a sense of the qualitative difference between peaks of different scores. We create a \Rclass{GRanges} object, \Robject{peaks}, from the table obtained from UCSC, and then we convert the chromosome names back to the Ensembl style using the global substitute function, \Rfunction{gsub}. Finally, we enforce that the sequence levels of the peaks match the sequence levels of the differential expressed gene, which is necessary for performing the nearest matching in the following code chunk. <>= peaks <- with(ucscTable, GRanges(chrom, IRanges(chromStart, chromEnd), score=score)) seqlevels(peaks) <- gsub("chr(.+)","\\1",seqlevels(peaks)) seqlevels(peaks) <- seqlevels(deGene) @ Now we have two \Rclass{GRanges} objects, defined over the same chromosomes, so we can use the \Rfunction{distanceToNearest} function from the package \Rpackage{GRanges}. This provides a \Rclass{Hits} object, which contains the matches between the ``query'' and the ``subject'', the first and second arguments to the function, as well as the distance from the query to the subject. As we only have a single query, there should only be one nearest range in the subject. See the documentation via \Rfunction{?distanceToNearest} and \Rfunction{?Hits} for more information on the options for the this matching step. <>= # suppress warning which exists for BioC 2.12 of changes to distance() suppressWarnings({d2nearest <- distanceToNearest(deGene, peaks)}) @ <>= d2nearest <- distanceToNearest(deGene, peaks) @ \begin{question} What is the distance from the differentially expressed gene to all the peaks? \end{question} \begin{answer} <>= distance(deGene, peaks) @ <>= suppressWarnings({distance(deGene, peaks)}) @ \end{answer} We can now examine the object \Robject{d2nearest}. This tells us that the nearest peak is \Sexpr{as.data.frame(d2nearest)$distance[1]} base pairs from the differential expressed gene. <>= d2nearest @ The function \Rfunction{subjectHits} is used to extract the index of the closest hit in the \Robject{peaks} object. <>= deGene peaks[subjectHits(d2nearest)] @ Is \Sexpr{as.data.frame(d2nearest)$distance[1]} base pairs unexpectedly close? Here we make a simple plot of the starting points of the peaks and gene along the chromosome, to get a sense of the distribution of peaks and how surprised we should be with the distance of the nearest. To identify the nearest peak, we construct a logical vector \Robject{peakNearest}, which can be used to change the y value and the color of the point corresponding to the nearest peak. <>= plotRange <- start(deGene) + 1e6 * c(-1,1) peakNearest <- ( seq_along(peaks) == subjectHits(d2nearest) ) plot(x=start(peaks), y=ifelse(peakNearest,.3,.2), ylim=c(0,1), xlim=plotRange, pch='p', col=ifelse(peakNearest,"red","grey60"), yaxt="n", ylab="", xlab=paste("2 Mb on chromosome",as.character(seqnames(deGene)))) points(x=start(deGene),y=.8,pch='g') @ <>= plotGRange <- GRanges(seqnames(peaks[1]),IRanges(plotRange[1],plotRange[2])) numPeaksInPlotRange <- sum(peaks %over% plotGRange) @ \begin{figure}[!ht] \centering \includegraphics[width=.6\textwidth]{DESeq2_parathyroid-plotPeaksAndGene.png} \caption{A 2 Mb genomic range showing the location of the differentially expressed gene (labelled 'g'), and the peaks (labelled 'p'). As there are only \Sexpr{numPeaksInPlotRange} peaks spread over 2 Mb, it is surprising to find a peak \Sexpr{as.data.frame(d2nearest)$distance[1]} base pairs away from the differentially expressed gene.} \label{plotPeaksAndGene} \end{figure} Again, the biological relevance of the distances between peaks and genes is another matter, especially considering the data are from different sources. An important consideration when investigating the distribution of distances between two sets of genomic features, is how the individual sets cluster along the genome. \begin{question} Are the peaks relatively uniformly distributed? \end{question} \begin{answer} We can answer this question by investigating the inter-peak distances. As all of our peaks are on the same chromosome, we just sort the peak starts and subtract the 2nd from the 1st, the 3rd from the 2nd, etc. Then we call the \Rfunction{summary} function which provides the mean and median. Note that the mean is constricted: it must be equal to the total span divided by the number of inter-peak distances. The median distance is about one quarter of the mean, so the peaks tend to cluster. You can also verify this by plotting the histogram of \Robject{peakDists}. <>= peakDists <- diff(sort(start(peaks))) summary(peakDists) mean(peakDists) @ \end{answer} \section{Working with rlog-transformed data} \subsection{The rlog transform} Many common statistical methods for exploratory analysis of multidimensional data, especially methods for clustering and ordination (e.\,g., principal-component analysis and the like), work best for (at least approximately) homoskedastic data; this means that the variance of an observable (i.e., here, the expression strength of a gene) does not depend on the mean. In RNA-Seq data, however, variance grows with the mean. For example, if one performs PCA directly on a matrix of normalized read counts, the result typically depends only on the few most strongly expressed genes because they show the largest absolute differences between samples. A simple and often used strategy to avoid this is to take the logarithm of the normalized count values; however, now the genes with low counts tend to dominate the results because, due to the strong Poisson noise inherent to small count values, they show the strongest relative differences between samples. As a solution, \Rpackage{DESeq2} offers the \emph{regularized-logarithm transformation}, or \emph{rlog} for short. For genes with high counts, the rlog transformation differs not much from an ordinary $\log_2$ transformation. For genes with lower counts, however, the values are shrunken towards the genes' averages across all samples. Using an empirical Bayesian prior in the form of a \emph{ridge penality}, this is done such that the rlog-transformed data are approximately homoskedastic. The function \Rfunction{rlogTransform} returns a \Rclass{SummarizedExperiment} object which contains the rlog-transformed values in its \Rclass{assay} slot: <>= rld <- rlogTransformation(dds) head( assay(rld) ) @ To show the effect of the transformation, we plot the first sample against the second, first simply using the \Rfunction{log2} function (after adding 1, to avoid taking the log of zero), and then using the rlog-transformed values. \begin{figure}[ht] \centering \includegraphics[width=.9\textwidth]{DESeq2_parathyroid-rldPlot.png} \caption{Scatter plot of sample 2 versus sample 1. Left: using an ordinary $log_2$ transformation. Right: Using the rlog transformation.} \label{rldPlot} \end{figure} <>= par( mfrow = c( 1, 2 ) ) plot( log2( 1+counts(dds, normalized=TRUE)[, 1:2] ), col="#00000020", pch=20, cex=0.3 ) plot( assay(rld)[, 1:2], col="#00000020", pch=20, cex=0.3 ) @ Note that, in order to make it easier to see where several points are plotted on top of each other, we set the plotting color to a semi-transparent black (encoded as \Rfunction{\#00000020}) and changed the points to solid disks (\Rfunction{pch=20}) with reduced size (\Rfunction{cex=0.3})\footnote{The function \Rfunction{heatscatter} from the package \Rpackage{LSD} offers a colourful alternative.}. In Figure~\ref{rldPlot}, we can see how genes with low counts seem to be excessively variable on the ordinary logarithmic scale, while the rlog transform compresses differences for genes for which the data cannot provide good information anyway. \subsection{Sample distances} A useful first step in an RNA-Seq analysis is often to assess overall similarity between samples: Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment's design? \begin{figure}[ht] \centering \includegraphics[width=.5\textwidth]{DESeq2_parathyroid-sampleDistHeatmap.png} \caption{Heatmap of Euclidean sample distances after rlog transformation.} \label{sampleDistHeatmap} \end{figure} We use the R function \Rfunction{dist} to calculate the Euclidean distance between samples. To avoid that the distance measure is dominated by a few highly variable genes, and have a roughly equal contribution from all genes, we use it on the rlog-transformed data: <>= sampleDists <- dist( t( assay(rld) ) ) sampleDists @ Note the use of the function \Rfunction{t} to transpose the data matrix. We need this because \Rfunction{dist} calculates distances between data \emph{rows} and our samples constitute the columns. We visualize the distances in a heatmap, using the function \Rfunction{heatmap.2} from the \Rpackage{gplots} package. <>= sampleDistMatrix <- as.matrix( sampleDists ) rownames(sampleDistMatrix) <- paste( colData(rld)$treatment, colData(rld)$patient, sep="-" ) library( "gplots" ) heatmap.2( sampleDistMatrix, trace="none" ) @ Note that we have changed the row names of the distance matrix to contain treatment type and patient number instead of sample ID, so that we have all this information in view when looking at the heatmap (Fig.~\ref{sampleDistHeatmap}). \begin{figure}[ht] \centering \includegraphics[width=.5\textwidth]{DESeq2_parathyroid-samplePCA.png} \caption{Principal components analysis (PCA) of samples after rlog transformation.} \label{samplePCA} \end{figure} \begin{question} Some people find the colour scheme used in Figure~\ref{sampleDistHeatmap} ugly. Make a better version. \emph{Hint:} Look at the sequential colour schemes in the \Rpackage{RColorBrewer} package and at the \Rfunction{colorRampPalette} function. \end{question} \begin{answer} <>= library("RColorBrewer") colours = colorRampPalette( rev(brewer.pal(9, "Blues")) )(255) heatmap.2( sampleDistMatrix, trace="none", col=colours) @ See Figure~\ref{betterheatmap}. \begin{figure}[ht] \centering \includegraphics[width=.5\textwidth]{DESeq2_parathyroid-betterheatmap.png} \caption{The same heatmap as in Figure~\ref{sampleDistHeatmap} but with better colours.} \label{betterheatmap} \end{figure} \end{answer} Another way to visualize sample-to-sample distances is a principal-components analysis (PCA). In this ordination method, the data points (i.e., here, the samples) are projected onto the 2D plane such that they spread out optimally (Fig.\ \ref{samplePCA}). <>= print( plotPCA( rld, intgroup = c( "patient", "treatment") ) ) @ Here, we have used the function \Rfunction{plotPCA} which comes with \Rpackage{DESeq2}. The two terms specified as \Rfunction{intgroup} are column names from our sample data; they tell the function to use them to choose colours. From both visualizations, we see that the differences between patients is much larger than the difference between treatment and control samples of the same patient. This shows why it was important to account for this paired design (``paired'', because each treated sample is paired with one control sample from the \emph{same} patient). We did so by using the design formula \verb+!~ patient + treatment! when setting up the data object in the beginning. Had we used an un-paired analysis, by specifying only \verb!~ treatment!, we would not have found many hits, because then, the patient-to-patient differences would have drowned out any treatment effects. Here, we have performed this sample distance analysis towards the end of our analysis. In practice, however, this is a step suitable to give a first overview on the data. Hence, one will typically carry out this analysis as one of the first steps in an analysis. To this end, you may also find the function \Rfunction{arrayQualityMetrics}, from the equinymous package, useful. % \fixme{@Mike: not for this lab, but perhaps we should add a % method for \Rclass{SummarizedExperiment} to \Rpackage{arrayQualityMetrics}, would you % like to send me a patch?} \subsection{Gene clustering} \begin{figure}[ht] \centering \includegraphics[width=.5\textwidth]{DESeq2_parathyroid-geneHeatmap.png} \caption{Heatmap with gene clustering.} \label{geneHeatmap} \end{figure} In the heatmap of Fig.~\ref{sampleDistHeatmap}, the dendrogram at the side shows us a hierarchical clustering of the samples. Such a clustering can also be performed for the genes. Since the clustering is only relevant for genes that actually carry signal, one usually carries it out only for a subset of most highly variable genes. Here, for demonstration, let us select the 35 genes with the highest variance across samples: <>= library( "genefilter" ) topVarGenes <- head( order( rowVars( assay(rld) ), decreasing=TRUE ), 35 ) @ The heatmap becomes more interesting if we do not look at absolute expression strength but rather at the amount by which each gene deviates in a specific sample from the gene's average across all samples. Hence, we center and scale each genes' values across samples, and plot a heatmap. <>= heatmap.2( assay(rld)[ topVarGenes, ], scale="row", trace="none", dendrogram="column", col = colorRampPalette( rev(brewer.pal(9, "RdBu")) )(255)) @ We can now see (Fig. \ref{geneHeatmap}) blocks of genes which covary across patients. Often, such a heatmap is insightful, even though here, seeing these variations across patients is of limited value because we are rather interested in the effects between the two samples from each patient. \section{Advanced Questions} For these questions, we provide (and probably have) no solutions, advanced readers are encouraged to explore them. \begin{enumerate} \item \Rpackage{DESeq2} performs the shrinkage of the dispersion estimates by fitting a parametric curve on the mean of normalized counts (cf.~Figure~\ref{dispPlot}). However, one could argue that the biological variability of genes should not be a function of counts, but of counts per gene length (i.\,e., expression level), and that regression on that covariate should lead to a better fit. Write your own version of the \Rfunction{estimateDispersions} function to explore this question. \item What is the contribution of UTR length variations to the between-replicates variability modelled by \Rpackage{DESeq2}? The read counting script (available in the vignette of \Rpackage{parathyroidSE}) uses all exons of the genes, which includes UTRs. Would detection power be increased --or would we preferentially detect different phenomena-- if we left out UTRs from the counting (i.\,e. count reads that fall on coding exons only); or indeed, if we looked only at UTRs? \end{enumerate} \bibliographystyle{unsrt} \bibliography{library} \newpage \section{Solutions} % \closeoutputstream{qna} % \input\jobname.qna \section{Session Info} As last part of this document, we call the function \Rfunction{sessionInfo}, which reports the version numbers of R and all the packages used in this session. It is good practice to always keep such a record as it will help to trace down what has happened in case that an R script ceases to work because a package has been changed in a newer version. <>= sessionInfo() @ <>= relocateAnswers = function(con="DESeq2_parathyroid.tex"){ txt = readLines(con=con) i1 = grep("\\begin{answer}", txt, fixed=TRUE) i2 = grep("\\end{answer}", txt, fixed=TRUE) if(!((length(i1)==length(i2))&&all(i2>i1))) stop("\\begin{answer}/\\end{answer} macros are unbalanced.") i = unlist(mapply(`:`, i1, i2)) res = txt[-i] sol = txt[i] dest = grep("\\section{Solutions}", res, fixed=TRUE) if(length(dest)!=1) stop("\\section{Solutions} not found exactly once.") res = c(res[1:dest], sol, res[(dest+1):length(res)]) writeLines(res, con=con) } @ \end{document}