1 The Molecular Signatures Database (MSigDB)
2 Download data from the msigdb R package
3 Downloading and integrating KEGG gene sets
4 Accessing the GeneSet and GeneSetCollection objects
5 Subset collections from the MSigDB
6 Preparing collections for limma::fry
7 Accessing the mouse MSigDB
8 Session information

1 The Molecular Signatures Database (MSigDB)

The molecular signatures database (MSigDB) is one of the largest collections of molecular signatures or gene expression signatures. A variety of gene expression signatures are hosted on this database including experimentally derived signatures and signatures representing pathways and ontologies from other curated databases. This rich collection of gene expression signatures (>25,000) can facilitate a wide variety of signature-based analyses, the most popular being gene set enrichment analyses. These signatures can be used to perform enrichment analysis in a DE experiment using tools such as GSEA, fry (from limma) and camera (from limma). Alternatively, they can be used to perform single-sample gene-set analysis of individual transcriptomic profiles using approaches such as singscore, ssGSEA and GSVA.

This package provides the gene sets in the MSigDB in the form of GeneSet objects. This data structure is specifically designed to store information about gene sets, including their member genes and metadata. Other packages, such as msigdbr and EGSEAdata provide these gene sets too, however, they do so by storing them as lists or tibbles. These structures are not specific to gene sets therefore do not allow storage of important metadata associated with each gene set, for example, their short and long descriptions. Additionally, the lack of structure allows creation of invalid gene sets. Accessory functions implemented in the GSEABase package provide a neat interface to interact with GeneSet objects.

This package can be installed using the code below:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("msigdb")

2 Download data from the msigdb R package

This ExperimentHub package processes the latest version of the MSigDB database into R objects that can be queried using the GSEABase R/Bioconductor package. The entire database is stored in a GeneSetCollection object which in turn stores each signature as a GeneSet object. All empty gene expression signatures (i.e. no genes formed the signature) have been dropped. Data in this package can be downloaded using the ExperimentHub interface as shown below.

To download the data, we first need to get a list of the data available in the msigdb package and determine the unique identifiers for each data. The query() function assists in getting this list.

library(msigdb)
library(ExperimentHub)
library(GSEABase)

eh = ExperimentHub()
query(eh , 'msigdb')
#> ExperimentHub with 4 records
#> # snapshotDate(): 2021-05-05
#> # $dataprovider: Broad Institute
#> # $species: Mus musculus, Homo sapiens
#> # $rdataclass: GSEABase::GeneSetCollection
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["EH5421"]]' 
#> 
#>            title              
#>   EH5421 | msigdb.v7.2.hs.SYM 
#>   EH5422 | msigdb.v7.2.hs.EZID
#>   EH5423 | msigdb.v7.2.mm.SYM 
#>   EH5424 | msigdb.v7.2.mm.EZID

Data can then be downloaded using the unique identifier.

eh[['EH5421']]
#> GeneSetCollection
#>   names: chr11q, chr6q, ..., WP_HOSTPATHOGEN_INTERACTION_OF_HUMAN_CORONA_VIRUSES_INTERFERON_INDUCTION (31322 total)
#>   unique identifiers: AP001767.2, SLC22A9, ..., AC023491.2 (40044 total)
#>   types in collection:
#>     geneIdType: SymbolIdentifier (1 total)
#>     collectionType: BroadCollection (1 total)

Alternatively, data can be downloaded using object name accessors in the msigdb package as below:

#metadata are displayed
msigdb.v7.2.hs.SYM(metadata = TRUE)
#> ExperimentHub with 1 record
#> # snapshotDate(): 2021-05-05
#> # names(): EH5421
#> # package(): msigdb
#> # $dataprovider: Broad Institute
#> # $species: Homo sapiens
#> # $rdataclass: GSEABase::GeneSetCollection
#> # $rdatadateadded: 2021-03-18
#> # $title: msigdb.v7.2.hs.SYM
#> # $description: Gene expression signatures (human) from the Molecular Signat...
#> # $taxonomyid: 9606
#> # $genome: NA
#> # $sourcetype: XML
#> # $sourceurl: https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.2...
#> # $sourcesize: NA
#> # $tags: c("Homo_sapiens_Data", "Mus_musculus_Data") 
#> # retrieve record with 'object[["EH5421"]]'
#data are loaded
msigdb.v7.2.hs.SYM()
#> GeneSetCollection
#>   names: chr11q, chr6q, ..., WP_HOSTPATHOGEN_INTERACTION_OF_HUMAN_CORONA_VIRUSES_INTERFERON_INDUCTION (31322 total)
#>   unique identifiers: AP001767.2, SLC22A9, ..., AC023491.2 (40044 total)
#>   types in collection:
#>     geneIdType: SymbolIdentifier (1 total)
#>     collectionType: BroadCollection (1 total)

Data can also be downloaded using the custom accessor `msigdb::getMsigdb()`:

#use the custom accessor to select a specific version of MSigDB
msigdb.v7.2.hs.SYM = getMsigdb('hs', 'SYM')
msigdb.v7.2.hs.SYM
#> GeneSetCollection
#>   names: chr11q, chr6q, ..., WP_HOSTPATHOGEN_INTERACTION_OF_HUMAN_CORONA_VIRUSES_INTERFERON_INDUCTION (31322 total)
#>   unique identifiers: AP001767.2, SLC22A9, ..., AC023491.2 (40044 total)
#>   types in collection:
#>     geneIdType: SymbolIdentifier (1 total)
#>     collectionType: BroadCollection (1 total)

3 Downloading and integrating KEGG gene sets

KEGG gene sets cannot be integrated within this ExperimentHub package due to licensing limitations. However, users can download, process and integrate the data directly from the MSigDB when needed. This can be done using the code that follows.

msigdb.v7.2.hs.SYM = appendKEGG(msigdb.v7.2.hs.SYM)
msigdb.v7.2.hs.SYM
#> GeneSetCollection
#>   names: chr11q, chr6q, ..., KEGG_VIRAL_MYOCARDITIS (31508 total)
#>   unique identifiers: AP001767.2, SLC22A9, ..., AC023491.2 (40044 total)
#>   types in collection:
#>     geneIdType: SymbolIdentifier (1 total)
#>     collectionType: BroadCollection (1 total)

4 Accessing the GeneSet and GeneSetCollection objects

A GeneSetCollection object is effectively a list therefore all list processing functions such as length and lapply can be used to process its constituents

length(msigdb.v7.2.hs.SYM)
#> [1] 31508

Each signature is stored in a GeneSet object and can be processed using functions in the GSEABase R/Bioconductor package.

gs = msigdb.v7.2.hs.SYM[[1000]]
gs
#> setName: TONKS_TARGETS_OF_RUNX1_RUNX1T1_FUSION_ERYTHROCYTE_DN 
#> geneIds: LTBP1, MYL4, ..., PF4 (total: 18)
#> geneIdType: Symbol
#> collectionType: Broad
#>   bcCategory: c2 (Curated)
#>   bcSubCategory: CGP
#> details: use 'details(object)'
#get genes in the signature
geneIds(gs)
#>  [1] "LTBP1"    "MYL4"     "GP1BB"    "HBE1"     "SLC27A2"  "COL18A1" 
#>  [7] "HBA1"     "PDLIM1"   "LTC4S"    "ASAP2"    "ITM2A"    "ARHGAP22"
#> [13] "CLC"      "MYLK"     "LDLRAD4"  "LRRC61"   "AHSP"     "PF4"
#get collection type
collectionType(gs)
#> collectionType: Broad
#>   bcCategory: c2 (Curated)
#>   bcSubCategory: CGP
#get MSigDB category
bcCategory(collectionType(gs))
#> [1] "c2"
#get MSigDB subcategory
bcSubCategory(collectionType(gs))
#> [1] "CGP"
#get description
description(gs)
#> [1] "Genes down-regulated in erythroid lineage cells by RUNX1-RUNX1T1 [GeneID=861;862] fusion ."
#get details
details(gs)
#> setName: TONKS_TARGETS_OF_RUNX1_RUNX1T1_FUSION_ERYTHROCYTE_DN 
#> geneIds: LTBP1, MYL4, ..., PF4 (total: 18)
#> geneIdType: Symbol
#> collectionType: Broad
#>   bcCategory: c2 (Curated)
#>   bcSubCategory: CGP
#> setIdentifier: PC1500:21096:Wed Mar 17 21:20:55 2021:194843
#> description: Genes down-regulated in erythroid lineage cells by RUNX1-RUNX1T1 [GeneID=861;862] fusion .
#>   (longDescription available)
#> organism: Homo sapiens
#> pubMedIds: 17898786
#> urls: https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.2/msigdb_v7.2.xml
#> contributor: Arthur Liberzon
#> setVersion: 0.0.1
#> creationDate:

We can also summarise some of these values across the entire database. Description of these codes can be found at the MSigDB website (https://www.gsea-msigdb.org/gsea/msigdb).

#calculate the number of signatures in each category
table(sapply(lapply(msigdb.v7.2.hs.SYM, collectionType), bcCategory))
#> 
#> archived       c1       c2       c3       c4       c5       c6       c7 
#>      391      299     6226     3556      858    14765      189     4872 
#>       c8        h 
#>      302       50
#calculate the number of signatures in each subcategory
table(sapply(lapply(msigdb.v7.2.hs.SYM, collectionType), bcSubCategory))
#> 
#>         C1_NONE          C2_CGP  C2_CP:BIOCARTA  C2_CP:REACTOME        C5_GO:BP 
#>               6               6               4             100             157 
#>        C5_GO:CC        C5_GO:MF             CGN             CGP              CM 
#>              75              43             427            3358             431 
#>              CP     CP:BIOCARTA         CP:KEGG          CP:PID     CP:REACTOME 
#>              56             289             186             196            1554 
#> CP:WIKIPATHWAYS           GO:BP           GO:CC           GO:MF             HPO 
#>             587            7573            1001            1697            4494 
#>       MIR:MIRDB  MIR:MIR_Legacy        TFT:GTRD  TFT:TFT_Legacy 
#>            2377             221             348             610
#plot the distribution of sizes
hist(sapply(lapply(msigdb.v7.2.hs.SYM, geneIds), length),
     main = 'MSigDB signature size distribution',
     xlab = 'Signature size')

5 Subset collections from the MSigDB

Most gene set analysis is performed within specific collections rather than across the entire database. This package comes with functions to subset specific collections. The list of all collections and sub-collections present within a GeneSetCollection object can be listed using the functions below:

listCollections(msigdb.v7.2.hs.SYM)
#>  [1] "archived" "c2"       "c1"       "c3"       "c4"       "c6"      
#>  [7] "c7"       "c5"       "h"        "c8"
listSubCollections(msigdb.v7.2.hs.SYM)
#>  [1] "C1_NONE"         "C2_CP:BIOCARTA"  "C2_CP:REACTOME"  "CP"             
#>  [5] "C5_GO:BP"        "C5_GO:CC"        "C5_GO:MF"        "C2_CGP"         
#>  [9] "CP:REACTOME"     "CGP"             "CP:BIOCARTA"     "CP:PID"         
#> [13] "MIR:MIRDB"       "MIR:MIR_Legacy"  "TFT:TFT_Legacy"  "CGN"            
#> [17] "CM"              "GO:BP"           "GO:CC"           "GO:MF"          
#> [21] "TFT:GTRD"        "HPO"             "CP:WIKIPATHWAYS" "CP:KEGG"

Specific collections can be retrieved using the code below:

#retrieeve the hallmarks gene sets
subsetCollection(msigdb.v7.2.hs.SYM, 'h')
#> GeneSetCollection
#>   names: HALLMARK_TNFA_SIGNALING_VIA_NFKB, HALLMARK_HYPOXIA, ..., HALLMARK_PANCREAS_BETA_CELLS (50 total)
#>   unique identifiers: JUNB, CXCL2, ..., SRP14 (4383 total)
#>   types in collection:
#>     geneIdType: SymbolIdentifier (1 total)
#>     collectionType: BroadCollection (1 total)
#retrieve the biological processes category of gene ontology
subsetCollection(msigdb.v7.2.hs.SYM, 'c5', 'GO:BP')
#> GeneSetCollection
#>   names: GO_MITOCHONDRIAL_GENOME_MAINTENANCE, GO_REPRODUCTION, ..., GO_LIPOXIN_METABOLIC_PROCESS (7573 total)
#>   unique identifiers: AKT3, PPARGC1A, ..., ANTXRL (17901 total)
#>   types in collection:
#>     geneIdType: SymbolIdentifier (1 total)
#>     collectionType: BroadCollection (1 total)

6 Preparing collections for limma::fry

Any gene-set collection can be easily transformed for usage with limma::fry by first transforming it into a list of gene IDs and following that with a transformation to indices as shown below.

library(limma)

#create expression data
allg = unique(unlist(geneIds(msigdb.v7.2.hs.SYM)))
emat = matrix(0, nrow = length(allg), ncol = 6)
rownames(emat) = allg
colnames(emat) = paste0('sample', 1:6)
head(emat)
#>            sample1 sample2 sample3 sample4 sample5 sample6
#> AP001767.2       0       0       0       0       0       0
#> SLC22A9          0       0       0       0       0       0
#> OR5J7P           0       0       0       0       0       0
#> MAML2            0       0       0       0       0       0
#> GRIK4            0       0       0       0       0       0
#> HPRT1P3          0       0       0       0       0       0

#retrieve collections
hallmarks = subsetCollection(msigdb.v7.2.hs.SYM, 'h')
msigdb_ids = geneIds(hallmarks)

#convert gene sets into a list of gene indices
fry_indices = ids2indices(msigdb_ids, rownames(emat))
fry_indices[1:2]
#> $HALLMARK_TNFA_SIGNALING_VIA_NFKB
#>   [1]   102   112   195   200   210   214   215   231   234   250   293   381
#>  [13]   388   393   395   417   453   515   586   615   856   888   910   934
#>  [25]  1346  1373  1444  1720  1727  1882  1883  1884  1934  2086  2113  2120
#>  [37]  2149  2152  2155  2156  2277  2335  2500  2545  2547  2600  2616  2650
#>  [49]  2667  2669  2671  2727  2728  2735  2755  2756  2757  2769  2771  2773
#>  [61]  2774  2776  2899  2920  2951  2959  2993  3001  3021  3032  3062  3068
#>  [73]  3080  3088  3108  3110  3122  3216  3221  3251  3275  3317  3389  3409
#>  [85]  3538  3562  3604  3707  3713  3740  3868  3910  3967  3993  3999  4075
#>  [97]  4098  4140  4154  4295  4339  4369  4508  4554  4568  4670  4710  5025
#> [109]  5058  5069  5289  5319  5326  5327  5366  5471  5483  5490  5503  5546
#> [121]  5550  5574  5575  5578  5586  5595  5628  5689  5694  5902  5921  5947
#> [133]  5979  6003  6004  6009  6017  6049  6124  6454  6492  6539  6629  6643
#> [145]  6681  7136  7162  7198  7429  7552  7595  7623  7763  7810  7836  7935
#> [157]  7963  7994  8072  8085  8087  8161  8173  8174  8182  8188  8198  8279
#> [169]  8323  8363  8368  8369  8390  8417  8421  8447  8576  8644  8657  8709
#> [181]  8776  8780  8864  8879  9076  9796 11033 11183 12590 14010 17471 20117
#> [193] 22021 24625 24706 27392 27481 31681 37905 38553
#> 
#> $HALLMARK_HYPOXIA
#>   [1]    42    44    47    48   200   210   373   417   418   446   482   515
#>  [13]   532   624   803   814   830   833   836   838   856   966  1208  1258
#>  [25]  1309  1318  1320  1371  1383  1389  1398  1444  1447  1475  1668  1676
#>  [37]  1721  2113  2171  2335  2456  2509  2517  2526  2699  2769  2781  2790
#>  [49]  2823  2920  3088  3110  3131  3183  3221  3289  3291  3324  3366  3373
#>  [61]  3392  3467  3472  3505  3518  3631  3694  3713  3715  3851  3935  3956
#>  [73]  3968  4096  4098  4395  4409  4445  4575  4606  4621  4624  4627  4717
#>  [85]  4751  4943  5025  5069  5070  5071  5179  5202  5211  5213  5232  5236
#>  [97]  5238  5258  5262  5267  5347  5407  5437  5487  5575  5586  5618  5637
#> [109]  5674  5689  5788  5885  5888  5920  6053  6083  6337  6415  6467  6539
#> [121]  6594  6614  6826  6913  6962  7057  7064  7072  7100  7302  7309  7312
#> [133]  7326  7341  7404  7416  7440  7479  7481  7702  7799  7836  7935  8056
#> [145]  8079  8115  8173  8182  8185  8216  8279  8328  8472  8657  8714  8835
#> [157]  8862  8939  9000  9319  9339  9424  9429  9433  9434  9796 10518 11143
#> [169] 12165 12568 12590 13175 16436 16547 17620 18297 18439 18868 19221 20855
#> [181] 22203 23067 23516 26015 26383 27595 28041 28341 29101 29262 29621 29809
#> [193] 30071 31186 33627 34209 35211 37527 38563 39416

7 Accessing the mouse MSigDB

The mouse MSigDB has been created in collaboration with Gordon K. Smyth and Alex Garnham from the Walter and Eliza Hall Institute of Medical Research (WEHI). The code they use to generate the mouse MSigDB has been used in this package. Detailed description of the steps conducted to convert human gene expression signatures to mouse can be found at http://bioinf.wehi.edu.au/MSigDB/index.html. Mouse homologs for human genes were obtained using the HCOP database (as of 18/03/2021).

All the above functions apply to the mouse MSigDB and can be used to interact with the collection.

msigdb.v7.2.mm.SYM = msigdb.v7.2.mm.SYM()
msigdb.v7.2.mm.SYM
#> GeneSetCollection
#>   names: 10qA1, 10qA2, ..., ZWANG_TRANSIENTLY_UP_BY_2ND_EGF_PULSE_ONLY (43766 total)
#>   unique identifiers: Epm2a, Esr1, ..., Gm52481 (52381 total)
#>   types in collection:
#>     geneIdType: SymbolIdentifier (1 total)
#>     collectionType: BroadCollection (1 total)

8 Session information

sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] limma_3.48.0         GSEABase_1.54.0      graph_1.70.0        
#>  [4] annotate_1.70.0      XML_3.99-0.6         AnnotationDbi_1.54.0
#>  [7] IRanges_2.26.0       S4Vectors_0.30.0     Biobase_2.52.0      
#> [10] ExperimentHub_2.0.0  AnnotationHub_3.0.0  BiocFileCache_2.0.0 
#> [13] dbplyr_2.1.1         BiocGenerics_0.38.0  msigdb_1.0.0        
#> 
#> loaded via a namespace (and not attached):
#>  [1] httr_1.4.2                    sass_0.4.0                   
#>  [3] bit64_4.0.5                   jsonlite_1.7.2               
#>  [5] prettydoc_0.4.1               bslib_0.2.5.1                
#>  [7] shiny_1.6.0                   assertthat_0.2.1             
#>  [9] interactiveDisplayBase_1.30.0 highr_0.9                    
#> [11] BiocManager_1.30.15           blob_1.2.1                   
#> [13] GenomeInfoDbData_1.2.6        yaml_2.2.1                   
#> [15] BiocVersion_3.13.1            pillar_1.6.1                 
#> [17] RSQLite_2.2.7                 glue_1.4.2                   
#> [19] digest_0.6.27                 promises_1.2.0.1             
#> [21] XVector_0.32.0                htmltools_0.5.1.1            
#> [23] httpuv_1.6.1                  pkgconfig_2.0.3              
#> [25] zlibbioc_1.38.0               purrr_0.3.4                  
#> [27] xtable_1.8-4                  later_1.2.0                  
#> [29] tibble_3.1.2                  KEGGREST_1.32.0              
#> [31] generics_0.1.0                ellipsis_0.3.2               
#> [33] cachem_1.0.5                  withr_2.4.2                  
#> [35] magrittr_2.0.1                crayon_1.4.1                 
#> [37] mime_0.10                     memoise_2.0.0                
#> [39] evaluate_0.14                 fansi_0.4.2                  
#> [41] tools_4.1.0                   org.Hs.eg.db_3.13.0          
#> [43] BiocStyle_2.20.0              lifecycle_1.0.0              
#> [45] stringr_1.4.0                 Biostrings_2.60.0            
#> [47] compiler_4.1.0                jquerylib_0.1.4              
#> [49] GenomeInfoDb_1.28.0           rlang_0.4.11                 
#> [51] RCurl_1.98-1.3                rappdirs_0.3.3               
#> [53] bitops_1.0-7                  rmarkdown_2.8                
#> [55] DBI_1.1.1                     curl_4.3.1                   
#> [57] R6_2.5.0                      knitr_1.33                   
#> [59] dplyr_1.0.6                   fastmap_1.1.0                
#> [61] bit_4.0.4                     utf8_1.2.1                   
#> [63] filelock_1.0.2                stringi_1.6.2                
#> [65] Rcpp_1.0.6                    vctrs_0.3.8                  
#> [67] png_0.1-7                     tidyselect_1.1.1             
#> [69] xfun_0.23

msigdb: The molecular signatures database (MSigDB) in R

Dharmesh D. Bhuva

20 May 2021