Keep it simple
Replicate
Avoid confounding experimental factors with other factors
Record co-variates
Be aware of batch effects
Known
Unknown
Surrogate variable analysis
HapMap samples from one facility, ordered by date of processing.
Confounding factors
Artifacts of your particular protocols
Axes of variation
Application-specific, e.g.,
Alignment strategies
Splice-aware aligners (and Bioconductor wrappers)
GenomicAlignments::summarizeOverlaps()
Rsubread::featureCount()
Unique statistical aspects
Summarization
Normalization
Appropriate error model
Pre-filtering
Borrowing information
Placing differentially expressed regions in context
Correlation between genomic copy number and mRNA expression identified 38 mis-labeled samples in the TCGA ovarian cancer Affymetrix microarray dataset.
DESeq2 estimateSizeFactors()
, Anders and Huber, 2010
edgeR calcNormFactors()
TMM method of Robinson and Oshlack, 2010
DESeq2 estimateDispersions()
edgeR estimateDisp()
t.test()
x
: vector of univariate measurementsy
: factor
describing experimental designvar.equal=TRUE
: appropriate for relatively small experiments where no additional information available?formula
: alternative representation, y ~ x
.head(sleep)
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
plot(extra ~ group, data = sleep)
## Traditional interface
with(sleep, t.test(extra[group == 1], extra[group == 2]))
##
## Welch Two Sample t-test
##
## data: extra[group == 1] and extra[group == 2]
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.3654832 0.2054832
## sample estimates:
## mean of x mean of y
## 0.75 2.33
## Formula interface
t.test(extra ~ group, sleep)
##
## Welch Two Sample t-test
##
## data: extra by group
## t = -1.8608, df = 17.776, p-value = 0.07939
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.3654832 0.2054832
## sample estimates:
## mean in group 1 mean in group 2
## 0.75 2.33
## equal variance between groups
t.test(extra ~ group, sleep, var.equal=TRUE)
##
## Two Sample t-test
##
## data: extra by group
## t = -1.8608, df = 18, p-value = 0.07919
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.363874 0.203874
## sample estimates:
## mean in group 1 mean in group 2
## 0.75 2.33
lm()
and anova()
lm()
: fit linear model.anova()
: statisitcal evaluation.## linear model; compare to t.test(var.equal=TRUE)
fit <- lm(extra ~ group, sleep)
anova(fit)
## Analysis of Variance Table
##
## Response: extra
## Df Sum Sq Mean Sq F value Pr(>F)
## group 1 12.482 12.4820 3.4626 0.07919 .
## Residuals 18 64.886 3.6048
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
formula
: translated into model matrix, used in lm.fit()
.## underlying model, used in `lm.fit()`
model.matrix(extra ~ group, sleep) # last column indicates group effect
## (Intercept) group2
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## 7 1 0
## 8 1 0
## 9 1 0
## 10 1 0
## 11 1 1
## 12 1 1
## 13 1 1
## 14 1 1
## 15 1 1
## 16 1 1
## 17 1 1
## 18 1 1
## 19 1 1
## 20 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
model.matrix(extra ~ 0 + group, sleep) # contrast between columns
## group1 group2
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## 7 1 0
## 8 1 0
## 9 1 0
## 10 1 0
## 11 0 1
## 12 0 1
## 13 0 1
## 14 0 1
## 15 0 1
## 16 0 1
## 17 0 1
## 18 0 1
## 19 0 1
## 20 0 1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
fit0 <- lm(extra ~ ID, sleep)
fit1 <- lm(extra ~ ID + group, sleep)
anova(fit0, fit1)
## Analysis of Variance Table
##
## Model 1: extra ~ ID
## Model 2: extra ~ ID + group
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 10 19.290
## 2 9 6.808 1 12.482 16.501 0.002833 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
t.test(extra ~ group, sleep, var.equal=TRUE, paired=TRUE)
##
## Paired t-test
##
## data: extra by group
## t = -4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.4598858 -0.7001142
## sample estimates:
## mean of the differences
## -1.58
genefilter::rowttests()
x
: matrix of expression valuesfeatures x samples (reverse of how a ‘statistician’ would represent the data – samples x features)
fac
: factor of one or two levels describing experimental design
Limitations
Consequences
count ~ factor
. Alternative: count ~ 0 + factor
and contrastscount ~ covariate + factor
count ~ factor
or count ~ 0 + factor
count ~ factor1 + factor2
; main effects and interactions, count ~ factor1 * factor2
. Contrasts to ask specific questionsduplicateCorrelation()
Adapted from Love, Anders, and Huber’s Bioconductor work flow
Michael Love [1], Simon Anders [2], Wolfgang Huber [2]
[1] Department of Biostatistics, Dana-Farber Cancer Institute and Harvard School of Public Health, Boston, US;
[2] European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.
The data used in this workflow is an RNA-Seq experiment of airway smooth muscle cells treated with dexamethasone, a synthetic glucocorticoid steroid with anti-inflammatory effects. Glucocorticoids are used, for example, in asthma patients to prevent or reduce inflammation of the airways. In the experiment, four primary human airway smooth muscle cell lines were treated with 1 micromolar dexamethasone for 18 hours. For each of the four cell lines, we have a treated and an untreated sample. The reference for the experiment is:
Himes BE, Jiang X, Wagner P, Hu R, Wang Q, Klanderman B, Whitaker RM, Duan Q, Lasky-Su J, Nikolos C, Jester W, Johnson M, Panettieri R Jr, Tantisira KG, Weiss ST, Lu Q. “RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene that Modulates Cytokine Function in Airway Smooth Muscle Cells.” PLoS One. 2014 Jun 13;9(6):e99625. PMID: 24926665. GEO: GSE52778.
We use the airway package to illustrate reduction. The package provides sample information, a subset of eight BAM files, and the known gene models required to count the reads.
library(airway)
path <- system.file(package="airway", "extdata")
dir(path)
## [1] "GSE52778_series_matrix.txt"
## [2] "Homo_sapiens.GRCh37.75_subset.gtf"
## [3] "SRR1039508_subset.bam"
## [4] "SRR1039509_subset.bam"
## [5] "SRR1039512_subset.bam"
## [6] "SRR1039513_subset.bam"
## [7] "SRR1039516_subset.bam"
## [8] "SRR1039517_subset.bam"
## [9] "SRR1039520_subset.bam"
## [10] "SRR1039521_subset.bam"
## [11] "SraRunInfo_SRP033351.csv"
## [12] "sample_table.csv"
The ingredients for counting include are:
read.csv()
.csvfile <- dir(path, "sample_table.csv", full=TRUE)
sampleTable <- read.csv(csvfile, row.names=1)
head(sampleTable)
## SampleName cell dex albut Run avgLength Experiment
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354
## Sample BioSample
## SRR1039508 SRS508568 SAMN02422669
## SRR1039509 SRS508567 SAMN02422675
## SRR1039512 SRS508571 SAMN02422678
## SRR1039513 SRS508572 SAMN02422670
## SRR1039516 SRS508575 SAMN02422682
## SRR1039517 SRS508576 SAMN02422673
yieldSize
argument mean?library(Rsamtools)
filenames <- dir(path, ".bam$", full=TRUE)
bamfiles <- BamFileList(filenames, yieldSize=1000000)
names(bamfiles) <- sub("_subset.bam", "", basename(filenames))
TxDb
package, or created from biomart or UCSC, or from a GTF file. We’ll take the hard road, making a TxDb object from the GTF file used to align reads and using the TxDb to get all exons, grouped by gene.library(GenomicFeatures)
gtffile <- file.path(path, "Homo_sapiens.GRCh37.75_subset.gtf")
txdb <- makeTxDbFromGFF(gtffile, format="gtf", circ_seqs=character())
## Prepare the 'metadata' data frame ... metadata: OK
genes <- exonsBy(txdb, by="gene")
After these preparations, the actual counting is easy. The function summarizeOverlaps()
from the GenomicAlignments package will do this. This produces a SummarizedExperiment
object, which contains a variety of information about an experiment
library(GenomicAlignments)
se <- summarizeOverlaps(features=genes, reads=bamfiles,
mode="Union",
singleEnd=FALSE,
ignore.strand=TRUE,
fragments=TRUE)
colData(se) <- as(sampleTable, "DataFrame")
se
## class: SummarizedExperiment
## dim: 20 8
## exptData(0):
## assays(1): counts
## rownames(20): ENSG00000009724 ENSG00000116649 ... ENSG00000271794
## ENSG00000271895
## rowRanges metadata column names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
colData(se)
## DataFrame with 8 rows and 9 columns
## SampleName cell dex albut Run avgLength
## <factor> <factor> <factor> <factor> <factor> <integer>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98
## Experiment Sample BioSample
## <factor> <factor> <factor>
## SRR1039508 SRX384345 SRS508568 SAMN02422669
## SRR1039509 SRX384346 SRS508567 SAMN02422675
## SRR1039512 SRX384349 SRS508571 SAMN02422678
## SRR1039513 SRX384350 SRS508572 SAMN02422670
## SRR1039516 SRX384353 SRS508575 SAMN02422682
## SRR1039517 SRX384354 SRS508576 SAMN02422673
## SRR1039520 SRX384357 SRS508579 SAMN02422683
## SRR1039521 SRX384358 SRS508580 SAMN02422677
rowData(se)
## Warning: 'rowData' is deprecated.
## Use 'rowRanges' instead.
## See help("Deprecated")
## GRangesList object of length 20:
## $ENSG00000009724
## GRanges object with 18 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] 1 [11086580, 11087705] - | 98 ENSE00000818830
## [2] 1 [11090233, 11090307] - | 99 ENSE00000472123
## [3] 1 [11090805, 11090939] - | 100 ENSE00000743084
## [4] 1 [11094885, 11094963] - | 101 ENSE00000743085
## [5] 1 [11097750, 11097868] - | 103 ENSE00003520086
## ... ... ... ... ... ... ...
## [14] 1 [11106948, 11107176] - | 111 ENSE00003467404
## [15] 1 [11106948, 11107176] - | 112 ENSE00003489217
## [16] 1 [11107260, 11107280] - | 113 ENSE00001833377
## [17] 1 [11107260, 11107284] - | 114 ENSE00001472289
## [18] 1 [11107260, 11107290] - | 115 ENSE00001881401
##
## ...
## <19 more elements>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
head(assay(se))
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
## ENSG00000009724 38 28 66 24 42
## ENSG00000116649 1004 1255 1122 1313 1100
## ENSG00000120942 218 256 233 252 269
## ENSG00000120948 2751 2080 3353 1614 3519
## ENSG00000171819 4 50 19 543 1
## ENSG00000171824 869 1075 1115 1051 944
## SRR1039517 SRR1039520 SRR1039521
## ENSG00000009724 41 47 36
## ENSG00000116649 1879 745 1536
## ENSG00000120942 465 207 400
## ENSG00000120948 3716 2220 1990
## ENSG00000171819 10 14 1067
## ENSG00000171824 1405 748 1590
The previous section illustrates the reduction step on a subset of the data; here’s the full data set
data(airway)
se <- airway
This object contains an informative colData
slot – prepared as described in the airway vignette. In particular, the colData()
include columns describing the cell line cell
and treatment dex
for each sample
colData(se)
## DataFrame with 8 rows and 9 columns
## SampleName cell dex albut Run avgLength
## <factor> <factor> <factor> <factor> <factor> <integer>
## SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126
## SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126
## SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120
## SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101
## SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98
## Experiment Sample BioSample
## <factor> <factor> <factor>
## SRR1039508 SRX384345 SRS508568 SAMN02422669
## SRR1039509 SRX384346 SRS508567 SAMN02422675
## SRR1039512 SRX384349 SRS508571 SAMN02422678
## SRR1039513 SRX384350 SRS508572 SAMN02422670
## SRR1039516 SRX384353 SRS508575 SAMN02422682
## SRR1039517 SRX384354 SRS508576 SAMN02422673
## SRR1039520 SRX384357 SRS508579 SAMN02422683
## SRR1039521 SRX384358 SRS508580 SAMN02422677
DESeq2 makes the analysis particularly easy, simply add the experimental design, run the pipeline, and extract the results
library(DESeq2)
dds <- DESeqDataSet(se, design = ~ cell + dex)
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds)
Simple visualizations / sanity checks include
Look at counts of strongly differentiated genes, to get a sense of how counts translate to the summary statistics reported in the result table
topGene <- rownames(res)[which.min(res$padj)]
res[topGene,]
## log2 fold change (MAP): dex untrt vs trt
## Wald test p-value: dex untrt vs trt
## DataFrame with 1 row and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000152583 997.4398 -4.3161 0.1724125 -25.03357 2.636198e-138
## padj
## <numeric>
## ENSG00000152583 4.624155e-134
plotCounts(dds, gene=topGene, intgroup=c("dex"))
An ‘MA’ plot shows for each gene the between-group log-fold-change versus average log count; it should be funnel-shaped and approximately symmetric around y=0
, with lots of between-treatment variation for genes with low counts.
plotMA(res, ylim=c(-5,5))
Plot the distribution of (unadjusted) P values, which should be uniform (under the null) but with a peak at small P value (true positives, hopefully!)
hist(res$pvalue, breaks=50)
Look at a ‘volcano plot’ of adjusted P-value versus log fold change, to get a sense of the fraction of up- versus down-regulated genes
plot(-log10(padj) ~ log2FoldChange, as.data.frame(res), pch=20)
Many additional diagnostic approaches are described in the DESeq2 (and edgeR) vignettes, and in the RNA-seq gene differential expression work flow.
see Part E, Gene Set Enrichment