Introduction

DNA methylation can be used to identify functional changes at transcriptional enhancers and other cis-regulatory modules (CRMs) in tumors and other primary disease tissues. Our R/Bioconductor package ELMER (Enhancer Linking by Methylation/Expression Relationships) provides a systematic approach that reconstructs gene regulatory networks (GRNs) by combining methylation and gene expression data derived from the same set of samples. ELMER uses methylation changes at CRMs as the central hub of these networks, using correlation analysis to associate them with both upstream master regulator (MR) transcription factors and downstream target genes.

This package can be easily applied to TCGA public available cancer data sets and custom DNA methylation and gene expression data sets.

ELMER analyses have 5 main steps:

  1. Identify distal probes on HM450K or EPIC arrays.
  2. Identify distal probes with significantly different DNA methylation level between two groups
  3. Identify putative target genes for differentially methylated distal probes.
  4. Identify enriched motifs for the distalprobes which are significantly differentially methylated and linked to putative target gene.
  5. Identify regulatory TFs whose expression associate with DNA methylation at enriched motifs.

Package workflow

The package workflow is showed in the figure below:

ELMER workflow: ELMER receives as input a DNA methylation object, a gene expression object (both can be either a matrix or a SummarizedExperiment object) and a Genomic Ranges (GRanges) object with distal probes to be used as a filter which can be retrieved using the get.feature.probe function. The function createMAE will create a Multi Assay Experiment object keeping only samples that have both DNA methylation and gene expression data. Genes will be mapped to genomic position and annotated using ENSEMBL database, while for probes it will add annotation from (http://zwdzwd.github.io/InfiniumAnnotation). This MAE object will be used as input to the next analysis functions. First, it identifies differentially methylated probes followed by the identification of their nearest genes (10 upstream and 10 downstream) through the get.diff.meth and GetNearGenes functions respectively. For each probe, it will verify if any of the nearby genes were affected by its change in the DNA methylation level and a list of gene and probes pairs will be outputted from get.pair function. For the probes in those pairs, it will search for enriched regulatory Transcription Factors motifs with the get.enriched.motif function. Finally, the enriched motifs will be correlated with the level of the transcription factor through the get.TFs function. In the figure green Boxes represent user input data, blue boxes represent output object, orange boxes represent auxiliary pre-computed data and gray boxes are functions.

ELMER workflow: ELMER receives as input a DNA methylation object, a gene expression object (both can be either a matrix or a SummarizedExperiment object) and a Genomic Ranges (GRanges) object with distal probes to be used as a filter which can be retrieved using the get.feature.probe function. The function createMAE will create a Multi Assay Experiment object keeping only samples that have both DNA methylation and gene expression data. Genes will be mapped to genomic position and annotated using ENSEMBL database, while for probes it will add annotation from (http://zwdzwd.github.io/InfiniumAnnotation). This MAE object will be used as input to the next analysis functions. First, it identifies differentially methylated probes followed by the identification of their nearest genes (10 upstream and 10 downstream) through the get.diff.meth and GetNearGenes functions respectively. For each probe, it will verify if any of the nearby genes were affected by its change in the DNA methylation level and a list of gene and probes pairs will be outputted from get.pair function. For the probes in those pairs, it will search for enriched regulatory Transcription Factors motifs with the get.enriched.motif function. Finally, the enriched motifs will be correlated with the level of the transcription factor through the get.TFs function. In the figure green Boxes represent user input data, blue boxes represent output object, orange boxes represent auxiliary pre-computed data and gray boxes are functions.

Main differences between ELMER v2 vs ELMER v1

Summary table

ELMER Version 1 ELMER Version 2
Primary data structure mee object (custom data structure) MAE object (Bioconductor data structure)
Auxiliary data Manually created Programmatically created
Number of human TFs 1,982 1,639 (curated list from Lambert, Samuel A., et al.)
Number of TF motifs 91 771 (HOCOMOCO v11 database)
TF classification 78 families 82 families and 331 subfamilies (TFClass database, HOCOMOCO)
Analysis performed Normal vs tumor samples Group 1 vs group 2
Statistical grouping Unsupervised only Unsupervised or supervised using labeled groups
TCGA data source The Cancer Genome Atlas (TCGA) (not available) The NCI’s Genomic Data Commons (GDC)
Genome of reference GRCh37 (hg19) GRCh37 (hg19)/GRCh38 (hg38)
DNA methylation platforms HM450 EPIC and HM450
Graphical User Interface (GUI) None TCGAbiolinksGUI
Automatic report None HTML summarizing results
Annotations None StateHub

Supervised vs Unsupervised mode

In ELMER v2 we introduce a new concept, the algorithm mode that can be either supervised or unsupervised. In the unsupervised mode (described in ELMER v1), it is assumed that one of the two groups is a heterogeneous mix of different (sometimes unknown) molecular phenotypes. For instance, in the example of Breast Cancer, normal breast tissues (Group A) are relatively homogenous, whereas Breast tumors fall into multiple molecular subtypes.

The assumption of the Unsupervised mode is that methylation changes may be restricted to a subset of one or more molecular subtypes, and thus only be present in a fraction of the samples in the test group. For instance, methylation changes related to estrogen signaling may only be present in LuminalA or LuminalB subtypes.

When this structure is unknown, the Unsupervised mode is the appropriate model, since it only requires changes in a subset of samples (by default, 20%). In contrast, in the Supervised mode, it is assumed that each group represents a more homogenous molecular phenotype, and thus we compare all samples in Group A vs. all samples in Group B. This can be used in the case of direct comparison of tumor subtypes (i.e. Luminal vs. Basal-like tumors), but can also be used in numerous other situations, including sorted cells of different types, or treated vs. untreated samples in perturbation experiments.

Installing and loading ELMER

To install this package from github (development version), start R and enter:

To install this package from Bioconductor start R and enter:

Then, to load ELMER enter:

Citing this work

If you used ELMER package or its results, please cite:

  • Yao, L., Shen, H., Laird, P. W., Farnham, P. J., & Berman, B. P. “Inferring regulatory element landscapes and transcription factor networks from cancer methylomes.” Genome Biol 16 (2015): 105.
  • Yao, Lijing, Benjamin P. Berman, and Peggy J. Farnham. “Demystifying the secret mission of enhancers: linking distal regulatory elements to target genes.” Critical reviews in biochemistry and molecular biology 50.6 (2015): 550-573.
  • Tiago C Silva, Simon G Coetzee, Nicole Gull, Lijing Yao, Dennis J Hazelett, Houtan Noushmehr, De-Chen Lin, Benjamin P Berman; ELMER v.2: An R/Bioconductor package to reconstruct gene regulatory networks from DNA methylation and transcriptome profiles, Bioinformatics, , bty902, https://doi.org/10.1093/bioinformatics/bty902

If you get TCGA data using getTCGA function, please cite TCGAbiolinks package:

  • Colaprico A, Silva TC, Olsen C, Garofano L, Cava C, Garolini D, Sabedot T, Malta TM, Pagnotta SM, Castiglioni I, Ceccarelli M, Bontempi G and Noushmehr H. “TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data.” Nucleic acids research (2015): gkv1507.
  • Silva, TC, A Colaprico, C Olsen, F D’Angelo, G Bontempi, M Ceccarelli, and H Noushmehr. 2016. “TCGA Workflow: Analyze Cancer Genomics and Epigenomics Data Using Bioconductor Packages [Version 2; Referees: 1 Approved, 1 Approved with Reservations].” F1000Research 5 (1542). doi:10.12688/f1000research.8923.2.

  • Grossman, Robert L., et al. “Toward a shared vision for cancer genomic data.” New England Journal of Medicine 375.12 (2016): 1109-1112.

If you get use the Graphical user interface, please cite TCGAbiolinksGUI package:

  • Silva, Tiago C. and Colaprico, Antonio and Olsen, Catharina and Bontempi, Gianluca and Ceccarelli, Michele and Berman, Benjamin P. and Noushmehr, Houtan. “TCGAbiolinksGUI: A graphical user interface to analyze cancer molecular and clinical data” (bioRxiv 147496; doi: https://doi.org/10.1101/147496)

Bugs and questions

If you have questions, wants to report a bug, please use our github repository: http://www.github.com/tiagochst/ELMER

Paper supplemental material

TCGA-BRCA reports (paper supplemental material) can be found at https://tiagochst.github.io/ELMER_supplemental/

Session Info

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.13-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.13-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] MultiAssayExperiment_1.18.0 SummarizedExperiment_1.22.0
##  [3] Biobase_2.52.0              GenomicRanges_1.44.0       
##  [5] GenomeInfoDb_1.28.0         IRanges_2.26.0             
##  [7] S4Vectors_0.30.0            BiocGenerics_0.38.0        
##  [9] MatrixGenerics_1.4.0        matrixStats_0.58.0         
## [11] BiocStyle_2.20.0            dplyr_1.0.6                
## [13] DT_0.18                     ELMER_2.16.0               
## [15] ELMER.data_2.15.0          
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1                backports_1.2.1            
##   [3] circlize_0.4.12             Hmisc_4.5-0                
##   [5] BiocFileCache_2.0.0         plyr_1.8.6                 
##   [7] lazyeval_0.2.2              splines_4.1.0              
##   [9] crosstalk_1.1.1             BiocParallel_1.26.0        
##  [11] ggplot2_3.3.3               digest_0.6.27              
##  [13] foreach_1.5.1               ensembldb_2.16.0           
##  [15] htmltools_0.5.1.1           fansi_0.4.2                
##  [17] magrittr_2.0.1              checkmate_2.0.0            
##  [19] memoise_2.0.0               BSgenome_1.60.0            
##  [21] cluster_2.1.2               doParallel_1.0.16          
##  [23] openxlsx_4.2.3              ComplexHeatmap_2.8.0       
##  [25] Biostrings_2.60.0           readr_1.4.0                
##  [27] R.utils_2.10.1              prettyunits_1.1.1          
##  [29] jpeg_0.1-8.1                colorspace_2.0-1           
##  [31] ggrepel_0.9.1               blob_1.2.1                 
##  [33] rvest_1.0.0                 rappdirs_0.3.3             
##  [35] haven_2.4.1                 xfun_0.23                  
##  [37] crayon_1.4.1                RCurl_1.98-1.3             
##  [39] jsonlite_1.7.2              survival_3.2-11            
##  [41] VariantAnnotation_1.38.0    iterators_1.0.13           
##  [43] glue_1.4.2                  gtable_0.3.0               
##  [45] zlibbioc_1.38.0             XVector_0.32.0             
##  [47] GetoptLong_1.0.5            DelayedArray_0.18.0        
##  [49] car_3.0-10                  shape_1.4.6                
##  [51] abind_1.4-5                 scales_1.1.1               
##  [53] DBI_1.1.1                   rstatix_0.7.0              
##  [55] Rcpp_1.0.6                  viridisLite_0.4.0          
##  [57] progress_1.2.2              htmlTable_2.2.1            
##  [59] clue_0.3-59                 foreign_0.8-81             
##  [61] bit_4.0.4                   Formula_1.2-4              
##  [63] htmlwidgets_1.5.3           httr_1.4.2                 
##  [65] RColorBrewer_1.1-2          ellipsis_0.3.2             
##  [67] farver_2.1.0                reshape_0.8.8              
##  [69] pkgconfig_2.0.3             XML_3.99-0.6               
##  [71] R.methodsS3_1.8.1           Gviz_1.36.0                
##  [73] nnet_7.3-16                 sass_0.4.0                 
##  [75] dbplyr_2.1.1                utf8_1.2.1                 
##  [77] labeling_0.4.2              reshape2_1.4.4             
##  [79] tidyselect_1.1.1            rlang_0.4.11               
##  [81] AnnotationDbi_1.54.0        cellranger_1.1.0           
##  [83] munsell_0.5.0               tools_4.1.0                
##  [85] cachem_1.0.5                cli_2.5.0                  
##  [87] downloader_0.4              generics_0.1.0             
##  [89] RSQLite_2.2.7               broom_0.7.6                
##  [91] evaluate_0.14               stringr_1.4.0              
##  [93] fastmap_1.1.0               yaml_2.2.1                 
##  [95] knitr_1.33                  bit64_4.0.5                
##  [97] zip_2.1.1                   purrr_0.3.4                
##  [99] KEGGREST_1.32.0             AnnotationFilter_1.16.0    
## [101] TCGAbiolinks_2.20.0         R.oo_1.24.0                
## [103] xml2_1.3.2                  biomaRt_2.48.0             
## [105] compiler_4.1.0              rstudioapi_0.13            
## [107] plotly_4.9.3                filelock_1.0.2             
## [109] curl_4.3.1                  png_0.1-7                  
## [111] ggsignif_0.6.1              tibble_3.1.2               
## [113] bslib_0.2.5.1               stringi_1.6.2              
## [115] highr_0.9                   ps_1.6.0                   
## [117] TCGAbiolinksGUI.data_1.11.0 GenomicFeatures_1.44.0     
## [119] forcats_0.5.1               lattice_0.20-44            
## [121] ProtGenerics_1.24.0         Matrix_1.3-3               
## [123] vctrs_0.3.8                 pillar_1.6.1               
## [125] lifecycle_1.0.0             BiocManager_1.30.15        
## [127] jquerylib_0.1.4             GlobalOptions_0.1.2        
## [129] data.table_1.14.0           bitops_1.0-7               
## [131] rtracklayer_1.52.0          R6_2.5.0                   
## [133] BiocIO_1.2.0                latticeExtra_0.6-29        
## [135] rio_0.5.26                  gridExtra_2.3              
## [137] codetools_0.2-18            dichromat_2.0-0            
## [139] assertthat_0.2.1            rjson_0.2.20               
## [141] GenomicAlignments_1.28.0    Rsamtools_2.8.0            
## [143] GenomeInfoDbData_1.2.6      hms_1.1.0                  
## [145] grid_4.1.0                  rpart_4.1-15               
## [147] tidyr_1.1.3                 rmarkdown_2.8              
## [149] carData_3.0-4               Cairo_1.5-12.2             
## [151] ggpubr_0.4.0                biovizBase_1.40.0          
## [153] base64enc_0.1-3             restfulr_0.0.13