Is expression of genes in a gene set associated with experimental condition?
Many methods, a recent review is Kharti et al., 2012.
Any a priori classification of `genes’ into biologically relevant groups
Sets do not need to be…
Gene Ontology (GO) Annotation (GOA)
Pathways
E.g., MSigDb
Initially based on a presentation by Simon Anders, CSAMA 2010
Steps
In gene set? | |||
Yes | No | ||
---|---|---|---|
Differentially | Yes | k | K |
expressed? | No | n - k | N - K |
fiser.test()
or GOstatsNotes
Steps
E.g., Jiang & Gentleman, 2007;
Expression in NEG vs BCR/ABL samples for genes in the ‘ribosome’ KEGG pathway; Category vignette.
Goemann & Bühlmann, 2007
E.g., Hummel et al., 2008,
romer()
and Wu & Smythe 2012 camera()
for enrichment (competitive null) linear modelsroast()
, mroast()
for self-contained null linear modelsE.g., Tarca et al., 2009,
Incorporate pathway topology (e.g., interactions between gene products) into signficance testing
Signaling Pathway Impact Analysis
Combined evidence: pathway over-representation \(P_{NDE}\); unusual signaling \(P_{PERT}\) (equation 1 of Tarca et al.)
Evidence plot, colorectal cancer. Points: pathway gene sets. Significant after Bonferroni (red) or FDR (blue) correction.
E.g., Young et al., 2010, goseq
DE genes vs. transcript length. Points: bins of 300 genes. Line: fitted probability weighting function.
Example: Langfelder & Hovarth, WGCNA
list()
, where names of the list are sets, and each element of the list is a vector of genes in the set.data.frame()
of set name / gene name pairsGene set enrichment classifications
Selected Packages
Approach | Packages |
---|---|
Hypergeometric | GOstats, topGO |
Enrichment | limma::romer() |
Category \(t\)-test | Category |
Linear model | GlobalAncova, GSEAlm, limma::roast() |
Pathway topology | SPIA |
Sequence-specific | goseq |
de novo | WGCNA |
This practical is based on section 6 of the goseq vignette.
This (relatively old) experiment examined the effects of androgen stimulation on a human prostate cancer cell line, LNCaP (Li et al., 2008). The experiment used short (35bp) single-end reads from 4 control and 3 untreated lines. Reads were aligned to hg19 using Bowtie, and counted using ENSEMBL 54 gene models.
Input the data to edgeR’s DGEList
data structure.
library(edgeR)
path <- system.file(package="goseq", "extdata", "Li_sum.txt")
table.summary <- read.table(path, sep='\t', header=TRUE, stringsAsFactors=FALSE)
counts <- table.summary[,-1]
rownames(counts) <- table.summary[,1]
grp <- factor(rep(c("Control","Treated"), times=c(4,3)))
summarized <- DGEList(counts, lib.size=colSums(counts), group=grp)
Use a ‘common’ dispersion estimate, and compare the two groups using an exact test
disp <- estimateCommonDisp(summarized)
tested <- exactTest(disp)
topTags(tested)
## Comparison of groups: Treated-Control
## logFC logCPM PValue FDR
## ENSG00000127954 11.557868 6.680748 2.574972e-80 1.274766e-75
## ENSG00000151503 5.398963 8.499530 1.781732e-65 4.410322e-61
## ENSG00000096060 4.897600 9.446705 7.983756e-60 1.317479e-55
## ENSG00000091879 5.737627 6.282646 1.207655e-54 1.494654e-50
## ENSG00000132437 -5.880436 7.951910 2.950042e-52 2.920896e-48
## ENSG00000166451 4.564246 8.458467 7.126763e-52 5.880292e-48
## ENSG00000131016 5.254737 6.607957 1.066807e-51 7.544766e-48
## ENSG00000163492 7.085400 5.128514 2.716461e-45 1.681014e-41
## ENSG00000113594 4.051053 8.603264 9.272066e-44 5.100255e-40
## ENSG00000116285 4.108522 7.864773 6.422468e-43 3.179507e-39
Start by extracting all P values, then correcting for multiple comparison using p.adjust()
. Classify the genes as differentially expressed or not.
padj <- with(tested$table, {
keep <- logFC != 0
value <- p.adjust(PValue[keep], method="BH")
setNames(value, rownames(tested)[keep])
})
genes <- padj < 0.05
table(genes)
## genes
## FALSE TRUE
## 19535 3208
Under the hood, goseq uses Bioconductor annotation packages (in this case org.Hs.eg.db and r Biocannopkg("GO.db")
to map from gene symbols to GO pathways.
Expore these packages through the columns()
and select()
functions. Can you map between ENSEMBL gene identifiers (the row names of topTable()
) to GO pathway? What about ‘drilling down’ on particular GO identifiers to discover the term’s definition?
Calculate the weighting for each gene. This looks up the gene lengths in a pre-defined table (how could these be calculated using TxDb packages? What challenges are associated with calculating these ‘weights’, based on the knowledge that genes typically consist of several transcripts, each expressed differently?)
pwf <- nullp(genes,"hg19","ensGene")
## Loading hg19 length data...
## Warning in pcls(G): initial point very close to some inequality
## constraints
head(pwf)
## DEgenes bias.data pwf
## ENSG00000230758 FALSE 247 0.03757470
## ENSG00000182463 FALSE 3133 0.20436865
## ENSG00000124208 FALSE 1978 0.16881769
## ENSG00000230753 FALSE 466 0.06927243
## ENSG00000224628 FALSE 1510 0.15903532
## ENSG00000125835 FALSE 954 0.12711992
Perform the main analysis. This includes association of genes to GO pathway
GO.wall <- goseq(pwf, "hg19", "ensGene")
## Fetching GO annotations...
## For 9751 genes, we could not find any categories. These genes will be excluded.
## To force their use, please run with use_genes_without_cat=TRUE (see documentation).
## This was the default behavior for version 1.15.1 and earlier.
## Calculating the p-values...
head(GO.wall)
## category over_represented_pvalue under_represented_pvalue
## 10729 GO:0044763 8.237627e-15 1
## 10708 GO:0044699 2.079753e-14 1
## 2453 GO:0005737 2.956026e-10 1
## 3004 GO:0006614 6.131543e-09 1
## 7499 GO:0031982 1.101818e-08 1
## 2372 GO:0005576 1.339207e-08 1
## numDEInCat numInCat
## 10729 1893 8355
## 10708 2054 9201
## 2453 1790 8080
## 3004 39 107
## 7499 638 2675
## 2372 669 2836
## term ontology
## 10729 single-organism cellular process BP
## 10708 single-organism process BP
## 2453 cytoplasm CC
## 3004 SRP-dependent cotranslational protein targeting to membrane BP
## 7499 vesicle CC
## 2372 extracellular region CC
Here we do the same operation, but ignore gene lengths
GO.nobias <- goseq(pwf,"hg19","ensGene",method="Hypergeometric")
## Fetching GO annotations...
## For 9751 genes, we could not find any categories. These genes will be excluded.
## To force their use, please run with use_genes_without_cat=TRUE (see documentation).
## This was the default behavior for version 1.15.1 and earlier.
## Calculating the p-values...
Compare the over-represented P-values for each set, under the different methods
idx <- match(GO.nobias$category, GO.wall$category)
plot(log10(GO.nobias[, "over_represented_pvalue"]) ~
log10(GO.wall[idx, "over_represented_pvalue"]),
xlab="Wallenius", ylab="Hypergeometric",
xlim=c(-5, 0), ylim=c(-5, 0))
abline(0, 1, col="red", lwd=2)