Note: On Aug 19 2019 GREAT released version 4 which supports hg38 genome and removes some ontologies such pathways. submitGreatJob() still takes hg19 as default. hg38 can be specified by argument species = "hg38". To use the older versions such as 3.0.0, specify as submitGreatJob(..., version = "3").
GREAT (Genomic Regions Enrichment of Annotations Tool) is a popular web-based tool to associate biological functions to genomic regions. The rGREAT package makes GREAT anlaysis automatic by first constructing a HTTP POST request according to user’s input and retrieving results from GREAT web server afterwards.
Load the package:
The input data is either a GRanges object or a BED-format data frame, no matter it is sorted or not. In following example, we use a data frame which is randomly generated.
##    chr   start      end
## 1 chr1 7634457  9204434
## 2 chr1 9853594 10435028Submit genomic regions by submitGreatJob(). Before submitting, genomic regions will be sorted and overlapping regions will be merged.
The returned variable job is a GreatJob class instance which can be used to retrieve results from GREAT server and stored results which are already downloaded.
You can get the summary of your job by directly calling job variable.
## Submit time: 2021-05-19 18:36:43 
## Version: 4.0.4 
## Species: hg19 
## Inputs: 1005 regions
## Background: wholeGenome 
## Model: Basal plus extension 
##   Proximal: 5 kb upstream, 1 kb downstream,
##   plus Distal: up to 1000 kb
## Include curated regulatory domains
## 
## Enrichment tables for following ontologies have been downloaded:
##   NoneMore parameters can be set for the job:
job = submitGreatJob(bed, species = "mm9")
job = submitGreatJob(bed, bg, species = "mm9")
job = submitGreatJob(bed, adv_upstream = 10, adv_downstream = 2, adv_span = 2000)
job = submitGreatJob(bed, rule = "twoClosest", adv_twoDistance = 2000)
job = submitGreatJob(bed, rule = "oneClosest", adv_oneDistance = 2000)Also you can choose different versions of GREAT for the analysis.
Available parameters are (following content is copied from GREAT website):
species: “hg38”, “hg19”, “mm10”, “mm9” are supported in GREAT version 4.x.x, “hg19”, “mm10”, “mm9”, “danRer7” are supported in GREAT version 3.x.x and “hg19”, “hg18”, “mm9”, “danRer7” are supported in GREAT version 2.x.x.bgChoise: Background regions. wholeGenome and data. If this value is set to data, bg argument should be specifiedincludeCuratedRegDoms: Whether to include curated regulatory domains.rule: How to associate genomic regions to genes.
basalPlusExt: mode ‘Basal plus extension’. Gene regulatory domain definition: Each gene is assigned a basal regulatory domain of a minimum distance upstream and downstream of the TSS (regardless of other nearby genes). The gene regulatory domain is extended in both directions to the nearest gene’s basal domain but no more than the maximum extension in one direction.
adv_upstream: proximal extension to upstream (unit: kb)adv_downstream: proximal extension to downstream (unit: kb)adv_span: maximum extension (unit: kb)twoClosest: mode ‘Two nearest genes’. Gene regulatory domain definition: Each gene is assigned a regulatory domain that extends in both directions to the nearest gene’s TSS but no more than the maximum extension in one direction.
adv_twoDistance: maximum extension (unit: kb)oneClosest: mode ‘Single nearest gene’. Gene regulatory domain definition: Each gene is assigned a regulatory domain that extends in both directions to the midpoint between the gene’s TSS and the nearest gene’s TSS but no more than the maximum extension in one direction.
adv_oneDistance: maximum extension (unit: kb)With job, we can now retrieve results from GREAT. The first and the primary results are the tables which contain enrichment statistics for the analysis. By default it will retrieve results from three GO Ontologies and all pathway ontologies. All tables contains statistics for all terms no matter they are significant or not. Users can then make filtering yb self-defined cutoff.
There is a column for adjusted p-values by “BH” method. Other p-value adjustment methods can be applied by p.adjust().
The returned value of getEnrichmentTables() is a list of data frames in which each one corresponds to tables for single ontology. The structure of data frames are same as the tables on GREAT website.
## The default enrichment tables contain no associated genes for the input regions. You can
## set `download_by = 'tsv'` to download the complete table, but note only the top 500
## regions can be retreived. See the following link:
## 
## https://great-help.atlassian.net/wiki/spaces/GREAT/pages/655401/Export#Export-GlobalExport## [1] "GO Molecular Function" "GO Biological Process" "GO Cellular Component"##           ID                                         name Binom_Genome_Fraction Binom_Expected
## 1 GO:0016798 hydrolase activity, acting on glycosyl bonds          0.0100418100     10.0920200
## 2 GO:0051787                    misfolded protein binding          0.0007646973      0.7685208
##   Binom_Observed_Region_Hits Binom_Fold_Enrichment Binom_Region_Set_Coverage Binom_Raw_PValue
## 1                         20              1.981763                 0.0199005      0.003633144
## 2                          4              5.204804                 0.0039801      0.007894598
##   Binom_Adjp_BH Hyper_Total_Genes Hyper_Expected Hyper_Observed_Gene_Hits Hyper_Fold_Enrichment
## 1             1               127      10.756210                       17              1.580482
## 2             1                12       1.016335                        4              3.935710
##   Hyper_Gene_Set_Coverage Hyper_Term_Gene_Coverage Hyper_Raw_PValue Hyper_Adjp_BH
## 1             0.010821130                0.1338583       0.03955741     1.0000000
## 2             0.002546149                0.3333333       0.01458949     0.6280924Information stored in job will be updated after retrieving enrichment tables.
## Submit time: 2021-05-19 18:36:43 
## Version: 4.0.4 
## Species: hg19 
## Inputs: 1005 regions
## Background: wholeGenome 
## Model: Basal plus extension 
##   Proximal: 5 kb upstream, 1 kb downstream,
##   plus Distal: up to 1000 kb
## Include curated regulatory domains
## 
## Enrichment tables for following ontologies have been downloaded:
##   GO Biological Process
##   GO Cellular Component
##   GO Molecular FunctionYou can get results by either specifying the ontologies or by the pre-defined categories (categories already contains pre-defined sets of ontologies):
tb = getEnrichmentTables(job, ontology = c("GO Molecular Function", "BioCyc Pathway"))
tb = getEnrichmentTables(job, category = c("GO"))As you have seen in the previous messages and results, The enrichment tables contain no associated genes. However, you can set download_by = 'tsv' in getEnrichmentTables() to download the complete tables, but due to the restriction from GREAT web server, only the top 500 regions can be retreived.
## [1] 500##                Ontology         ID                                                 Desc BinomRank
## 1 GO Molecular Function GO:0016798         hydrolase activity, acting on glycosyl bonds         1
## 2 GO Molecular Function GO:0051787                            misfolded protein binding         2
## 3 GO Molecular Function GO:0016799 hydrolase activity, hydrolyzing N-glycosyl compounds         3
## 4 GO Molecular Function GO:0000900 translation repressor activity, nucleic acid binding         4
## 5 GO Molecular Function GO:0004844                    uracil DNA N-glycosylase activity         5
## 6 GO Molecular Function GO:0004553 hydrolase activity, hydrolyzing O-glycosyl compounds         6
##        BinomP BinomBonfP BinomFdrQ RegionFoldEnrich ExpRegions ObsRegions   GenomeFrac      SetCov
## 1 0.003633144          1         1         1.981763 10.0920200         20 0.0100418100 0.019900500
## 2 0.007894598          1         1         5.204804  0.7685208          4 0.0007646973 0.003980100
## 3 0.008692832          1         1         4.043178  1.2366510          5 0.0012304990 0.004975124
## 4 0.008774120          1         1         3.453267  1.7374850          6 0.0017288410 0.005970149
## 5 0.009371463          1         1        13.924090  0.1436360          2 0.0001429214 0.001990050
## 6 0.012096630          1         1         1.864224  9.1190760         17 0.0090737070 0.016915420
##   HyperRank       HyperP HyperBonfP  HyperFdrQ GeneFoldEnrich  ExpGenes ObsGenes TotalGenes
## 1       161 0.0395574100          1 1.00000000       1.580482 10.756210       17        127
## 2        98 0.0145894900          1 0.62809240       3.935710  1.016335        4         12
## 3       344 0.1102850000          1 1.00000000       2.146751  1.863281        4         22
## 4        26 0.0003721502          1 0.06038853       5.449444  1.101030        6         13
## 5       198 0.0603123900          1 1.00000000       4.722852  0.423473        2          5
## 6       180 0.0485899900          1 1.00000000       1.620586  8.638848       14        102
##    GeneSetCov   TermCov
## 1 0.010821130 0.1338583
## 2 0.002546149 0.3333333
## 3 0.002546149 0.1818182
## 4 0.003819223 0.4615385
## 5 0.001273074 0.4000000
## 6 0.008911521 0.1372549
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Regions
## 1 chr10:28982747-29009664,chr10:73401324-73765625,chr11:76186434-77258959,chr12:53623282-55658052,chr12:78776693-82267513,chr13:31546338-35719523,chr14:88100082-88802558,chr15:66717196-67068786,chr1:103764557-105740881,chr3:5263280-5764512,chr3:80776737-81360027,chr3:81815619-81922882,chr4:177298284-178454455,chr5:108206447-109354245,chr5:109583981-109647304,chr5:54399178-54715484,chr6:119435242-119806141,chr6:95026573-95490377,chr6:96124590-96566913,chr7:141672214-142165254
## 2                                                                                                                                                                                                                                                                                                                                                                                                chr22:21607529-22377646,chr2:182956275-184302935,chr3:5263280-5764512,chr5:174911166-178784645
## 3                                                                                                                                                                                                                                                                                                                                                                     chr12:53623282-55658052,chr4:177298284-178454455,chr5:108206447-109354245,chr5:109583981-109647304,chr5:54399178-54715484
## 4                                                                                                                                                                                                                                                                                                                                                chr10:93760800-93794630,chr17:38360990-38658964,chr18:34406674-37112725,chr4:14211768-14961015,chr5:172455557-174030576,chr5:79990154-80307883
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                chr12:53623282-55658052,chr5:54399178-54715484
## 6                                                                         chr10:28982747-29009664,chr10:73401324-73765625,chr11:76186434-77258959,chr12:78776693-82267513,chr13:31546338-35719523,chr14:88100082-88802558,chr15:66717196-67068786,chr1:103764557-105740881,chr3:5263280-5764512,chr3:80776737-81360027,chr3:81815619-81922882,chr5:108206447-109354245,chr5:109583981-109647304,chr6:119435242-119806141,chr6:95026573-95490377,chr6:96124590-96566913,chr7:141672214-142165254
##                                                                                                       Genes
## 1 ACER3,AMY1C,CCNO,EDEM1,ENSG00000257743,GALC,GBE1,KL,LCTL,LYZL1,MAN1A1,MAN2A1,MANEA,NEIL3,OTOGL,PSAP,SMUG1
## 2                                                                                  DNAJC10,EDEM1,F12,SDF2L1
## 3                                                                                   CCNO,MAN2A1,NEIL3,SMUG1
## 4                                                                         CELF4,CPEB2,CPEB3,CPEB4,DHFR,RARA
## 5                                                                                                CCNO,SMUG1
## 6                  ACER3,AMY1C,EDEM1,ENSG00000257743,GALC,GBE1,KL,LCTL,LYZL1,MAN1A1,MAN2A1,MANEA,OTOGL,PSAPAll available ontology names for given species can be get by availableOntologies() and all available ontology categories can be get by availableCategories(). Here you do not need to provide species information because job already contains it.
## [1] "GO Molecular Function"     "GO Biological Process"     "GO Cellular Component"    
## [4] "Mouse Phenotype"           "Mouse Phenotype Single KO" "Human Phenotype"          
## [7] "Ensembl Genes"## [1] "GO"        "Phenotype" "Genes"## [1] "GO Molecular Function" "GO Biological Process" "GO Cellular Component"Association between genomic regions and genes can be get by plotRegionGeneAssociationGraphs(). The function will make the three plots which are same as on GREAT website and returns a GRanges object which contains the gene-region associations.
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames          ranges strand |        gene   distTSS
##          <Rle>       <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr1 7634457-9204434      * |     SLC45A1     41560
##   [2]     chr1 7634457-9204434      * |        RERE    458078
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengthsFor those regions that are not associated with any genes under current settings, the corresponding gene and distTSS columns will be NA.
You can also choose only plotting one of the three figures.
By specifying ontology and term ID, you can get the association in a certain term. Here the term ID is from the first column of the data frame which is returned by getEnrichmentTables().
res = plotRegionGeneAssociationGraphs(job, ontology = "GO Molecular Function",
    termID = "GO:0004984")## The webpage for 'GOMolecularFunction:GO:0004984' is available at:
##   http://great.stanford.edu/public-4.0.4/cgi-bin/showTermDetails.php?termId=GO:0004984&ontoName=GOMolecularFunction&ontoUiName=GO Molecular Function&sessionName=20210519-public-4.0.4-2a1jv1&species=hg19&foreName=filede10b7a815217.gz&backName=&table=region## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames              ranges strand |        gene   distTSS
##          <Rle>           <IRanges>  <Rle> | <character> <numeric>
##   [1]     chr1 247871555-248931674      * |       OR2M4      -616
##   [2]     chr3   96368632-99050767      * |      OR5AC2    -96317
##   -------
##   seqinfo: 7 sequences from an unspecified genome; no seqlengths## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB             
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rGREAT_1.24.0        GenomicRanges_1.44.0 GenomeInfoDb_1.28.0  IRanges_2.26.0      
## [5] S4Vectors_0.30.0     BiocGenerics_0.38.0  knitr_1.33          
## 
## loaded via a namespace (and not attached):
##  [1] XVector_0.32.0         magrittr_2.0.1         zlibbioc_1.38.0        colorspace_2.0-1      
##  [5] rjson_0.2.20           R6_2.5.0               rlang_0.4.11           highr_0.9             
##  [9] stringr_1.4.0          tools_4.1.0            grid_4.1.0             circlize_0.4.12       
## [13] xfun_0.23              jquerylib_0.1.4        htmltools_0.5.1.1      yaml_2.2.1            
## [17] digest_0.6.27          crayon_1.4.1           GenomeInfoDbData_1.2.6 GlobalOptions_0.1.2   
## [21] sass_0.4.0             bitops_1.0-7           RCurl_1.98-1.3         shape_1.4.6           
## [25] evaluate_0.14          rmarkdown_2.8          stringi_1.6.2          compiler_4.1.0        
## [29] bslib_0.2.5.1          jsonlite_1.7.2         GetoptLong_1.0.5