- 1 The Molecular Signatures Database (MSigDB)
- 2 Download data from the msigdb R package
- 3 Downloading and integrating KEGG gene sets
- 4 Accessing the GeneSet and GeneSetCollection objects
- 5 Subset collections from the MSigDB
- 6 Preparing collections for limma::fry
- 7 Accessing the mouse MSigDB
- 8 Session information
1 The Molecular Signatures Database (MSigDB)
The molecular signatures database (MSigDB) is one of the largest collections of molecular signatures or gene expression signatures. A variety of gene expression signatures are hosted on this database including experimentally derived signatures and signatures representing pathways and ontologies from other curated databases. This rich collection of gene expression signatures (>25,000) can facilitate a wide variety of signature-based analyses, the most popular being gene set enrichment analyses. These signatures can be used to perform enrichment analysis in a DE experiment using tools such as GSEA, fry (from limma) and camera (from limma). Alternatively, they can be used to perform single-sample gene-set analysis of individual transcriptomic profiles using approaches such as singscore, ssGSEA and GSVA.
This package provides the gene sets in the MSigDB in the form of GeneSet
objects. This data structure is specifically designed to store information about gene sets, including their member genes and metadata. Other packages, such as msigdbr
and EGSEAdata
provide these gene sets too, however, they do so by storing them as lists or tibbles. These structures are not specific to gene sets therefore do not allow storage of important metadata associated with each gene set, for example, their short and long descriptions. Additionally, the lack of structure allows creation of invalid gene sets. Accessory functions implemented in the GSEABase
package provide a neat interface to interact with GeneSet
objects.
This package can be installed using the code below:
2 Download data from the msigdb R package
This ExperimentHub package processes the latest version of the MSigDB database into R objects that can be queried using the GSEABase R/Bioconductor package. The entire database is stored in a GeneSetCollection
object which in turn stores each signature as a GeneSet
object. All empty gene expression signatures (i.e. no genes formed the signature) have been dropped. Data in this package can be downloaded using the ExperimentHub
interface as shown below.
To download the data, we first need to get a list of the data available in the msigdb
package and determine the unique identifiers for each data. The query()
function assists in getting this list.
eh = ExperimentHub()
query(eh , 'msigdb')
#> ExperimentHub with 4 records
#> # snapshotDate(): 2021-05-05
#> # $dataprovider: Broad Institute
#> # $species: Mus musculus, Homo sapiens
#> # $rdataclass: GSEABase::GeneSetCollection
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["EH5421"]]'
#>
#> title
#> EH5421 | msigdb.v7.2.hs.SYM
#> EH5422 | msigdb.v7.2.hs.EZID
#> EH5423 | msigdb.v7.2.mm.SYM
#> EH5424 | msigdb.v7.2.mm.EZID
Data can then be downloaded using the unique identifier.
eh[['EH5421']]
#> GeneSetCollection
#> names: chr11q, chr6q, ..., WP_HOSTPATHOGEN_INTERACTION_OF_HUMAN_CORONA_VIRUSES_INTERFERON_INDUCTION (31322 total)
#> unique identifiers: AP001767.2, SLC22A9, ..., AC023491.2 (40044 total)
#> types in collection:
#> geneIdType: SymbolIdentifier (1 total)
#> collectionType: BroadCollection (1 total)
Alternatively, data can be downloaded using object name accessors in the msigdb
package as below:
#metadata are displayed
msigdb.v7.2.hs.SYM(metadata = TRUE)
#> ExperimentHub with 1 record
#> # snapshotDate(): 2021-05-05
#> # names(): EH5421
#> # package(): msigdb
#> # $dataprovider: Broad Institute
#> # $species: Homo sapiens
#> # $rdataclass: GSEABase::GeneSetCollection
#> # $rdatadateadded: 2021-03-18
#> # $title: msigdb.v7.2.hs.SYM
#> # $description: Gene expression signatures (human) from the Molecular Signat...
#> # $taxonomyid: 9606
#> # $genome: NA
#> # $sourcetype: XML
#> # $sourceurl: https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.2...
#> # $sourcesize: NA
#> # $tags: c("Homo_sapiens_Data", "Mus_musculus_Data")
#> # retrieve record with 'object[["EH5421"]]'
#data are loaded
msigdb.v7.2.hs.SYM()
#> GeneSetCollection
#> names: chr11q, chr6q, ..., WP_HOSTPATHOGEN_INTERACTION_OF_HUMAN_CORONA_VIRUSES_INTERFERON_INDUCTION (31322 total)
#> unique identifiers: AP001767.2, SLC22A9, ..., AC023491.2 (40044 total)
#> types in collection:
#> geneIdType: SymbolIdentifier (1 total)
#> collectionType: BroadCollection (1 total)
Data can also be downloaded using the custom accessor `msigdb::getMsigdb()`:
#use the custom accessor to select a specific version of MSigDB
msigdb.v7.2.hs.SYM = getMsigdb('hs', 'SYM')
msigdb.v7.2.hs.SYM
#> GeneSetCollection
#> names: chr11q, chr6q, ..., WP_HOSTPATHOGEN_INTERACTION_OF_HUMAN_CORONA_VIRUSES_INTERFERON_INDUCTION (31322 total)
#> unique identifiers: AP001767.2, SLC22A9, ..., AC023491.2 (40044 total)
#> types in collection:
#> geneIdType: SymbolIdentifier (1 total)
#> collectionType: BroadCollection (1 total)
3 Downloading and integrating KEGG gene sets
KEGG gene sets cannot be integrated within this ExperimentHub package due to licensing limitations. However, users can download, process and integrate the data directly from the MSigDB when needed. This can be done using the code that follows.
msigdb.v7.2.hs.SYM = appendKEGG(msigdb.v7.2.hs.SYM)
msigdb.v7.2.hs.SYM
#> GeneSetCollection
#> names: chr11q, chr6q, ..., KEGG_VIRAL_MYOCARDITIS (31508 total)
#> unique identifiers: AP001767.2, SLC22A9, ..., AC023491.2 (40044 total)
#> types in collection:
#> geneIdType: SymbolIdentifier (1 total)
#> collectionType: BroadCollection (1 total)
4 Accessing the GeneSet and GeneSetCollection objects
A GeneSetCollection object is effectively a list therefore all list processing functions such as length
and lapply
can be used to process its constituents
Each signature is stored in a GeneSet
object and can be processed using functions in the GSEABase
R/Bioconductor package.
gs = msigdb.v7.2.hs.SYM[[1000]]
gs
#> setName: TONKS_TARGETS_OF_RUNX1_RUNX1T1_FUSION_ERYTHROCYTE_DN
#> geneIds: LTBP1, MYL4, ..., PF4 (total: 18)
#> geneIdType: Symbol
#> collectionType: Broad
#> bcCategory: c2 (Curated)
#> bcSubCategory: CGP
#> details: use 'details(object)'
#get genes in the signature
geneIds(gs)
#> [1] "LTBP1" "MYL4" "GP1BB" "HBE1" "SLC27A2" "COL18A1"
#> [7] "HBA1" "PDLIM1" "LTC4S" "ASAP2" "ITM2A" "ARHGAP22"
#> [13] "CLC" "MYLK" "LDLRAD4" "LRRC61" "AHSP" "PF4"
#get collection type
collectionType(gs)
#> collectionType: Broad
#> bcCategory: c2 (Curated)
#> bcSubCategory: CGP
#get MSigDB category
bcCategory(collectionType(gs))
#> [1] "c2"
#get MSigDB subcategory
bcSubCategory(collectionType(gs))
#> [1] "CGP"
#get description
description(gs)
#> [1] "Genes down-regulated in erythroid lineage cells by RUNX1-RUNX1T1 [GeneID=861;862] fusion ."
#get details
details(gs)
#> setName: TONKS_TARGETS_OF_RUNX1_RUNX1T1_FUSION_ERYTHROCYTE_DN
#> geneIds: LTBP1, MYL4, ..., PF4 (total: 18)
#> geneIdType: Symbol
#> collectionType: Broad
#> bcCategory: c2 (Curated)
#> bcSubCategory: CGP
#> setIdentifier: PC1500:21096:Wed Mar 17 21:20:55 2021:194843
#> description: Genes down-regulated in erythroid lineage cells by RUNX1-RUNX1T1 [GeneID=861;862] fusion .
#> (longDescription available)
#> organism: Homo sapiens
#> pubMedIds: 17898786
#> urls: https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.2/msigdb_v7.2.xml
#> contributor: Arthur Liberzon
#> setVersion: 0.0.1
#> creationDate:
We can also summarise some of these values across the entire database. Description of these codes can be found at the MSigDB website (https://www.gsea-msigdb.org/gsea/msigdb).
#calculate the number of signatures in each category
table(sapply(lapply(msigdb.v7.2.hs.SYM, collectionType), bcCategory))
#>
#> archived c1 c2 c3 c4 c5 c6 c7
#> 391 299 6226 3556 858 14765 189 4872
#> c8 h
#> 302 50
#calculate the number of signatures in each subcategory
table(sapply(lapply(msigdb.v7.2.hs.SYM, collectionType), bcSubCategory))
#>
#> C1_NONE C2_CGP C2_CP:BIOCARTA C2_CP:REACTOME C5_GO:BP
#> 6 6 4 100 157
#> C5_GO:CC C5_GO:MF CGN CGP CM
#> 75 43 427 3358 431
#> CP CP:BIOCARTA CP:KEGG CP:PID CP:REACTOME
#> 56 289 186 196 1554
#> CP:WIKIPATHWAYS GO:BP GO:CC GO:MF HPO
#> 587 7573 1001 1697 4494
#> MIR:MIRDB MIR:MIR_Legacy TFT:GTRD TFT:TFT_Legacy
#> 2377 221 348 610
#plot the distribution of sizes
hist(sapply(lapply(msigdb.v7.2.hs.SYM, geneIds), length),
main = 'MSigDB signature size distribution',
xlab = 'Signature size')
5 Subset collections from the MSigDB
Most gene set analysis is performed within specific collections rather than across the entire database. This package comes with functions to subset specific collections. The list of all collections and sub-collections present within a GeneSetCollection object can be listed using the functions below:
listCollections(msigdb.v7.2.hs.SYM)
#> [1] "archived" "c2" "c1" "c3" "c4" "c6"
#> [7] "c7" "c5" "h" "c8"
listSubCollections(msigdb.v7.2.hs.SYM)
#> [1] "C1_NONE" "C2_CP:BIOCARTA" "C2_CP:REACTOME" "CP"
#> [5] "C5_GO:BP" "C5_GO:CC" "C5_GO:MF" "C2_CGP"
#> [9] "CP:REACTOME" "CGP" "CP:BIOCARTA" "CP:PID"
#> [13] "MIR:MIRDB" "MIR:MIR_Legacy" "TFT:TFT_Legacy" "CGN"
#> [17] "CM" "GO:BP" "GO:CC" "GO:MF"
#> [21] "TFT:GTRD" "HPO" "CP:WIKIPATHWAYS" "CP:KEGG"
Specific collections can be retrieved using the code below:
#retrieeve the hallmarks gene sets
subsetCollection(msigdb.v7.2.hs.SYM, 'h')
#> GeneSetCollection
#> names: HALLMARK_TNFA_SIGNALING_VIA_NFKB, HALLMARK_HYPOXIA, ..., HALLMARK_PANCREAS_BETA_CELLS (50 total)
#> unique identifiers: JUNB, CXCL2, ..., SRP14 (4383 total)
#> types in collection:
#> geneIdType: SymbolIdentifier (1 total)
#> collectionType: BroadCollection (1 total)
#retrieve the biological processes category of gene ontology
subsetCollection(msigdb.v7.2.hs.SYM, 'c5', 'GO:BP')
#> GeneSetCollection
#> names: GO_MITOCHONDRIAL_GENOME_MAINTENANCE, GO_REPRODUCTION, ..., GO_LIPOXIN_METABOLIC_PROCESS (7573 total)
#> unique identifiers: AKT3, PPARGC1A, ..., ANTXRL (17901 total)
#> types in collection:
#> geneIdType: SymbolIdentifier (1 total)
#> collectionType: BroadCollection (1 total)
6 Preparing collections for limma::fry
Any gene-set collection can be easily transformed for usage with limma::fry
by first transforming it into a list of gene IDs and following that with a transformation to indices as shown below.
library(limma)
#create expression data
allg = unique(unlist(geneIds(msigdb.v7.2.hs.SYM)))
emat = matrix(0, nrow = length(allg), ncol = 6)
rownames(emat) = allg
colnames(emat) = paste0('sample', 1:6)
head(emat)
#> sample1 sample2 sample3 sample4 sample5 sample6
#> AP001767.2 0 0 0 0 0 0
#> SLC22A9 0 0 0 0 0 0
#> OR5J7P 0 0 0 0 0 0
#> MAML2 0 0 0 0 0 0
#> GRIK4 0 0 0 0 0 0
#> HPRT1P3 0 0 0 0 0 0
#retrieve collections
hallmarks = subsetCollection(msigdb.v7.2.hs.SYM, 'h')
msigdb_ids = geneIds(hallmarks)
#convert gene sets into a list of gene indices
fry_indices = ids2indices(msigdb_ids, rownames(emat))
fry_indices[1:2]
#> $HALLMARK_TNFA_SIGNALING_VIA_NFKB
#> [1] 102 112 195 200 210 214 215 231 234 250 293 381
#> [13] 388 393 395 417 453 515 586 615 856 888 910 934
#> [25] 1346 1373 1444 1720 1727 1882 1883 1884 1934 2086 2113 2120
#> [37] 2149 2152 2155 2156 2277 2335 2500 2545 2547 2600 2616 2650
#> [49] 2667 2669 2671 2727 2728 2735 2755 2756 2757 2769 2771 2773
#> [61] 2774 2776 2899 2920 2951 2959 2993 3001 3021 3032 3062 3068
#> [73] 3080 3088 3108 3110 3122 3216 3221 3251 3275 3317 3389 3409
#> [85] 3538 3562 3604 3707 3713 3740 3868 3910 3967 3993 3999 4075
#> [97] 4098 4140 4154 4295 4339 4369 4508 4554 4568 4670 4710 5025
#> [109] 5058 5069 5289 5319 5326 5327 5366 5471 5483 5490 5503 5546
#> [121] 5550 5574 5575 5578 5586 5595 5628 5689 5694 5902 5921 5947
#> [133] 5979 6003 6004 6009 6017 6049 6124 6454 6492 6539 6629 6643
#> [145] 6681 7136 7162 7198 7429 7552 7595 7623 7763 7810 7836 7935
#> [157] 7963 7994 8072 8085 8087 8161 8173 8174 8182 8188 8198 8279
#> [169] 8323 8363 8368 8369 8390 8417 8421 8447 8576 8644 8657 8709
#> [181] 8776 8780 8864 8879 9076 9796 11033 11183 12590 14010 17471 20117
#> [193] 22021 24625 24706 27392 27481 31681 37905 38553
#>
#> $HALLMARK_HYPOXIA
#> [1] 42 44 47 48 200 210 373 417 418 446 482 515
#> [13] 532 624 803 814 830 833 836 838 856 966 1208 1258
#> [25] 1309 1318 1320 1371 1383 1389 1398 1444 1447 1475 1668 1676
#> [37] 1721 2113 2171 2335 2456 2509 2517 2526 2699 2769 2781 2790
#> [49] 2823 2920 3088 3110 3131 3183 3221 3289 3291 3324 3366 3373
#> [61] 3392 3467 3472 3505 3518 3631 3694 3713 3715 3851 3935 3956
#> [73] 3968 4096 4098 4395 4409 4445 4575 4606 4621 4624 4627 4717
#> [85] 4751 4943 5025 5069 5070 5071 5179 5202 5211 5213 5232 5236
#> [97] 5238 5258 5262 5267 5347 5407 5437 5487 5575 5586 5618 5637
#> [109] 5674 5689 5788 5885 5888 5920 6053 6083 6337 6415 6467 6539
#> [121] 6594 6614 6826 6913 6962 7057 7064 7072 7100 7302 7309 7312
#> [133] 7326 7341 7404 7416 7440 7479 7481 7702 7799 7836 7935 8056
#> [145] 8079 8115 8173 8182 8185 8216 8279 8328 8472 8657 8714 8835
#> [157] 8862 8939 9000 9319 9339 9424 9429 9433 9434 9796 10518 11143
#> [169] 12165 12568 12590 13175 16436 16547 17620 18297 18439 18868 19221 20855
#> [181] 22203 23067 23516 26015 26383 27595 28041 28341 29101 29262 29621 29809
#> [193] 30071 31186 33627 34209 35211 37527 38563 39416
7 Accessing the mouse MSigDB
The mouse MSigDB has been created in collaboration with Gordon K. Smyth and Alex Garnham from the Walter and Eliza Hall Institute of Medical Research (WEHI). The code they use to generate the mouse MSigDB has been used in this package. Detailed description of the steps conducted to convert human gene expression signatures to mouse can be found at http://bioinf.wehi.edu.au/MSigDB/index.html. Mouse homologs for human genes were obtained using the HCOP database (as of 18/03/2021).
All the above functions apply to the mouse MSigDB and can be used to interact with the collection.
msigdb.v7.2.mm.SYM = msigdb.v7.2.mm.SYM()
msigdb.v7.2.mm.SYM
#> GeneSetCollection
#> names: 10qA1, 10qA2, ..., ZWANG_TRANSIENTLY_UP_BY_2ND_EGF_PULSE_ONLY (43766 total)
#> unique identifiers: Epm2a, Esr1, ..., Gm52481 (52381 total)
#> types in collection:
#> geneIdType: SymbolIdentifier (1 total)
#> collectionType: BroadCollection (1 total)
8 Session information
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 parallel stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] limma_3.48.0 GSEABase_1.54.0 graph_1.70.0
#> [4] annotate_1.70.0 XML_3.99-0.6 AnnotationDbi_1.54.0
#> [7] IRanges_2.26.0 S4Vectors_0.30.0 Biobase_2.52.0
#> [10] ExperimentHub_2.0.0 AnnotationHub_3.0.0 BiocFileCache_2.0.0
#> [13] dbplyr_2.1.1 BiocGenerics_0.38.0 msigdb_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] httr_1.4.2 sass_0.4.0
#> [3] bit64_4.0.5 jsonlite_1.7.2
#> [5] prettydoc_0.4.1 bslib_0.2.5.1
#> [7] shiny_1.6.0 assertthat_0.2.1
#> [9] interactiveDisplayBase_1.30.0 highr_0.9
#> [11] BiocManager_1.30.15 blob_1.2.1
#> [13] GenomeInfoDbData_1.2.6 yaml_2.2.1
#> [15] BiocVersion_3.13.1 pillar_1.6.1
#> [17] RSQLite_2.2.7 glue_1.4.2
#> [19] digest_0.6.27 promises_1.2.0.1
#> [21] XVector_0.32.0 htmltools_0.5.1.1
#> [23] httpuv_1.6.1 pkgconfig_2.0.3
#> [25] zlibbioc_1.38.0 purrr_0.3.4
#> [27] xtable_1.8-4 later_1.2.0
#> [29] tibble_3.1.2 KEGGREST_1.32.0
#> [31] generics_0.1.0 ellipsis_0.3.2
#> [33] cachem_1.0.5 withr_2.4.2
#> [35] magrittr_2.0.1 crayon_1.4.1
#> [37] mime_0.10 memoise_2.0.0
#> [39] evaluate_0.14 fansi_0.4.2
#> [41] tools_4.1.0 org.Hs.eg.db_3.13.0
#> [43] BiocStyle_2.20.0 lifecycle_1.0.0
#> [45] stringr_1.4.0 Biostrings_2.60.0
#> [47] compiler_4.1.0 jquerylib_0.1.4
#> [49] GenomeInfoDb_1.28.0 rlang_0.4.11
#> [51] RCurl_1.98-1.3 rappdirs_0.3.3
#> [53] bitops_1.0-7 rmarkdown_2.8
#> [55] DBI_1.1.1 curl_4.3.1
#> [57] R6_2.5.0 knitr_1.33
#> [59] dplyr_1.0.6 fastmap_1.1.0
#> [61] bit_4.0.4 utf8_1.2.1
#> [63] filelock_1.0.2 stringi_1.6.2
#> [65] Rcpp_1.0.6 vctrs_0.3.8
#> [67] png_0.1-7 tidyselect_1.1.1
#> [69] xfun_0.23