Assume we want to find genes that differ between two sub-types of a cancer entity, say without and with a certain somatice mutation. We envision to get tumour samples from N patients of one sub-type and from N patients of the other subtype. While many genes may be weakly affected by the presence of the somatic variant, we hope to find most of those genes, which change at least by a factor of 2.
How many samples do we need?
Let's use DESeq2 to perform a mock analysis.
library( Biobase )
library( DESeq2 )
How strong is the inter-subject variability in typical tumour samples?
Find a suitable data set.
We use here the Pickrell et al. (Nature, 2010) dataset with RNA-Seq data from lymphoblastoid cell lines (LCLs) derived from 69 Yoruban HapMap subjects. A true cancer data set might be more suitable, though.
We download the count table for this data set from ReCount (Frazee et al., BMC Bioinformatics, 2011).
#montpick <- load( url( "http://bowtie-bio.sourceforge.net/recount/ExpressionSets/montpick_eset.RData" ) )
load( "montpick_eset.RData" )
countTable <- exprs(montpick.eset)[ , pData( montpick.eset )$study == "Pickrell" ]
Let's pick 30 samples at random, assign half to sub-type “A” and half to sub-type “B”, then make a DESeq2 object.
n <- 30
randomPick <- sample( colnames(countTable), n )
dds <- DESeqDataSetFromMatrix(
countTable[ , randomPick ],
data.frame( subtype = rep( c( "A", "B" ), each=n/2 ) ),
~ subtype )
Pick 100 genes at random, make half of them 2-fold higher in subtype “A” and half two-fold higher in subtype “B”
nGenes <- 100
affectedGenes <- sample( rownames(dds), nGenes )
upGenes <- affectedGenes[ 1:(nGenes/2) ]
dnGenes <- affectedGenes[ (nGenes/2+1):nGenes ]
To make things symmetric, we either multiply or divide by sqrt(2):
fc <- 2
counts(dds)[ upGenes, colData(dds)$subtype == "A" ] <- as.integer(round(
counts(dds)[ upGenes, colData(dds)$subtype == "A" ] / sqrt(fc) ))
counts(dds)[ upGenes, colData(dds)$subtype == "B" ] <- as.integer(round(
counts(dds)[ upGenes, colData(dds)$subtype == "B" ] * sqrt(fc) ))
counts(dds)[ dnGenes, colData(dds)$subtype == "A" ] <- as.integer(round(
counts(dds)[ dnGenes, colData(dds)$subtype == "A" ] * sqrt(fc) ))
counts(dds)[ dnGenes, colData(dds)$subtype == "B" ] <- as.integer(round(
counts(dds)[ dnGenes, colData(dds)$subtype == "B" ] / sqrt(fc) ))
Now, do an analysis:
dds <- DESeq( dds )
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
## -- replacing outliers and refitting for 237 genes
## -- DESeq argument 'minReplicatesForReplace' = 7
## -- original counts are preserved in counts(dds))
## estimating dispersions
## fitting model and testing
res <- results( dds )
signGenes <- rownames(res)[ which( res$padj < .1 ) ]
table(
found = rownames(dds) %in% signGenes,
truth = rownames(dds) %in% affectedGenes )
## truth
## found FALSE TRUE
## FALSE 52462 89
## TRUE 18 11
Highlight the affected genes in an MA plot
plotMA( res, ylim=c( -3, 3 ) )
points( res[upGenes,"baseMean"], res[upGenes,"log2FoldChange"], col="red", cex=1.5 )
points( res[dnGenes,"baseMean"], res[dnGenes,"log2FoldChange"], col="blue", cex=1.5 )
Session info
sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] DESeq2_1.4.5 RcppArmadillo_0.4.300.8.0
## [3] Rcpp_0.11.2 GenomicRanges_1.16.3
## [5] GenomeInfoDb_1.0.2 IRanges_1.22.9
## [7] Biobase_2.24.0 BiocGenerics_0.10.0
## [9] knitr_1.6
##
## loaded via a namespace (and not attached):
## [1] annotate_1.42.0 AnnotationDbi_1.26.0 DBI_0.2-7
## [4] evaluate_0.5.5 formatR_0.10 genefilter_1.46.1
## [7] geneplotter_1.42.0 grid_3.1.0 lattice_0.20-29
## [10] locfit_1.5-9.1 markdown_0.7 mime_0.1.1
## [13] RColorBrewer_1.0-5 RSQLite_0.11.4 splines_3.1.0
## [16] stats4_3.1.0 stringr_0.6.2 survival_2.37-7
## [19] tcltk_3.1.0 tools_3.1.0 XML_3.98-1.1
## [22] xtable_1.7-3 XVector_0.4.0