--- title: "MethReg: estimating regulatory potential of DNA methylation in gene transcription" author: - name: Tiago Chedraoui Silva affiliation: University of Miami Miller School of Medicine email: txs902 at miami.edu - name: Lily Wang affiliation: University of Miami Miller School of Medicine email: lily.wangg at gmail.com package: MethReg output: BiocStyle::html_document: toc_float: true toc: true df_print: paged code_download: false toc_depth: 3 link-citations: true bibliography: bibliography.bib editor_options: chunk_output_type: inline vignette: > %\VignetteIndexEntry{MethReg: estimating regulatory potential of DNA methylation in gene transcription} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} ---
```{css, echo = FALSE, eval = T} .whiteCode { background-color: black; border-color: #337ab7 !important; border: 1px solid; } ``` ```{r settings, include = FALSE} options(width = 100) knitr::opts_chunk$set(collapse = TRUE, comment = "#>",class.source = "whiteCode") library(dplyr) ``` ```{r sesame, include = FALSE, eval = TRUE} library(sesameData) ``` # Introduction Transcription factors (TFs) are proteins that facilitate the transcription of DNA into RNA. A number of recent studies have observed that the binding of TFs onto DNA can be affected by DNA methylation, and in turn, DNA methylation can also be added or removed by proteins associated with transcription factors [@bonder2017disease; @banovich2014methylation; @zhu2016transcription]. To provide functional annotations for differentially methylated regions (DMRs) and differentially methylated CpG sites (DMS), `MethReg` performs integrative analyses using matched DNA methylation and gene expression along with Transcription Factor Binding Sites (TFBS) data. MethReg evaluates, prioritizes and annotates DNA methylation regions (or sites) with high regulatory potential that works synergistically with TFs to regulate target gene expressions, without any additional ChIP-seq data. The results from `MethReg` can be used to generate testable hypothesis on the synergistic collaboration of DNA methylation changes and TFs in gene regulation. `MethReg` can be used either to evaluate regulatory potentials of candidate regions or to search for methylation coupled TF regulatory processes in the entire genome. # Installation `MethReg` is a Bioconductor package and can be installed through `BiocManager::install()`. ```{r, eval = FALSE} if (!"BiocManager" %in% rownames(installed.packages())) install.packages("BiocManager") BiocManager::install("MethReg", dependencies = TRUE) ``` After the package is installed, it can be loaded into R workspace by ```{r setup, eval = TRUE} library(MethReg) ``` # MethReg workflow The figure below illustrates the workflow for MethReg. Given matched array DNA methylation data and RNA-seq gene expression data, MethReg additionally incorporates TF binding information from ReMap2020 [@remap2020] or the JASPAR2020 [@JASPAR2020; @fornes2020jaspar] database, and optionally additional TF-target gene interaction databases, to perform both promoter and distal (enhancer) analysis. In the unsupervised mode, MethReg analyzes all CpGs on the Illumina arrays. In the supervised mode, MethReg analyzes and prioritizes differentially methylated CpGs identified in EWAS. There are three main steps: (1) create a dataset with triplets of CpGs, TFs that bind near the CpGs, and putative target genes, (2) for each triplet (CpG, TF, target gene), apply integrative statistical models to DNA methylation, target gene expression, and TF expression values, and (3) visualize and interpret results from statistical models to estimate individual and joint impacts of DNA methylation and TF on target gene expression, as well as annotate the roles of TF and CpG methylation in each triplet. The results from the statistical models will also allow us to identify a list of CpGs that work synergistically with TFs to influence target gene expression. ```{r workflow, fig.cap = "Figure 2 Workflow of MethReg. Data - MethReg required inputs are: (1) array-based DNA methylation data (HM450/EPIC) with beta-values, (2) RNA-seq data with normalized counts, (3) estimated TF activities from the RNA-seq data using GSVA or VIPER software. Creating triplets – there are multiple ways to create a CpG-TF-target gene triplet: (1) CpGs can be mapped to human TFs by using TF motifs from databases such as JASPAR or ReMap and scanning the CpGs region to identify if there is a binding site (2) CpG can be mapped to target genes using a distance-based approach, CpGs will be linked to a gene if it is on promoter region (+-1K bp away from the TSS), for a distal CpG it can be linked to either all genes within a fixed width (i.e. 500k bp), or a fixed number of genes upstream and downstream of the CpG location (3) TF-target genes can be retrieved from external databases (i.e. Cistrome Cancer; Dorothea). Using two different pairs (i.e. CpG-TF, TF-target gene), triplets can then be created. Analysis – each triplet will be evaluated using a robust linear model in which TF activity is estimated from GSVA/Viper software and DNAm.group is a binary variable used to model if the sample CpG belongs to the top 25% methylation levels or the lowest 25% methylation levels. Results – MethReg will output the prioritized triplets and classify both the TF role on the gene expression (repressor or activator) and the DNAm effect on the TF activity (Enhancing or attenuating).", echo = FALSE, fig.width = 13} png::readPNG("workflow_methReg.png") %>% grid::grid.raster() ``` # Analysis illustration ## Input data For illustration, we will use chromosome 21 data from 38 TCGA-COAD (colon cancer) samples. ### Input DNA methylation dataset The DNA methylation dataset is a matrix or SummarizedExperiment object with methylation beta or M-values (The samples are in the columns and methylation regions or probes are in the rows). If there are potential confounding factors (e.g. batch effect, age, sex) in the dataset, this matrix would contain residuals from fitting linear regression instead (see details **Section 5** "Controlling effects from confounding variables" below). #### Analysis for individual CpGs data We will analyze all CpGs on chromosome 21 in this vignette. However, oftentimes, the methylation data can also be, for example, **differentially methylated sites** (DMS) or **differentially methylated regions** (DMRs) obtained in an epigenome-wide association study (EWAS) study. ```{R warning=FALSE} data("dna.met.chr21") ``` ```{R} dna.met.chr21[1:5,1:5] ``` We will first create a SummarizedExperiment object with the function `make_dnam_se`. This function will use the Sesame R/Bioconductor package to map the array probes into genomic regions. You cen set human genome version (hg38 or hg19) and the array type ("450k" or "EPIC") ```{R} dna.met.chr21.se <- make_dnam_se( dnam = dna.met.chr21, genome = "hg38", arrayType = "450k", betaToM = FALSE, # transform beta to m-values verbose = FALSE # hide informative messages ) ``` ```{R} dna.met.chr21.se SummarizedExperiment::rowRanges(dna.met.chr21.se)[1:4,1:4] ``` #### Analysis of DMRs Differentially Methylated Regions (DMRs) associated with phenotypes such as tumor stage can be obtained from R packages such as `coMethDMR`, `comb-p`, `DMRcate` and many others. The methylation levels in multiple CpGs within the DMRs need to be summarized (e.g. using medians), then the analysis for DMR will proceed in the same way as those for CpGs. ### Input gene expression dataset The gene expression dataset is a matrix with log2 transformed and normalized gene expression values. If there are potential confounding factors (e.g. batch effect, age, sex) in the dataset, this matrix can also contain residuals from linear regression instead (see **Section 6** "Controlling effects from confounding variables" below). The samples are in the columns and the genes are in the rows. ```{R} data("gene.exp.chr21.log2") gene.exp.chr21.log2[1:5,1:5] ``` We will also create a SummarizedExperiment object for the gene expression data. This object will contain the genomic information for each gene. ```{R} gene.exp.chr21.se <- make_exp_se( exp = gene.exp.chr21.log2, genome = "hg38", verbose = FALSE ) gene.exp.chr21.se SummarizedExperiment::rowRanges(gene.exp.chr21.se)[1:5,] ``` ### Creating triplet dataset #### Creating triplet dataset using distance based approaches and JASPAR2022 In this section, **regions** refer to the regions where CpGs are located. ##### Linking a region to a target gene To evaluate the DNA methylation effect on the expression of a gene, first we need to define which are the possible affected genes. For this we initially define if the DNA methylation occurred withing a promoter regions, defined as 2 kbp upstream and 2 kbp downstream of the transcription start site (TSS), or in a non-promoter region, also known as distal regions, that could behave like enhancer of the gene expression. Enhancers can increase the transcription of genes and are found in different locations (upstream or downstream of genes, within introns). Their functional complexity lies in the possibility genes located more distantly than the neighboring genes and being able to regulate multiple genes [@pennacchio2013enhancers]. Also, enhancer–promoter looping could happen at two sequences within approximately 1 Mb of each other [@pennacchio2013enhancers]. @williamson2011enhancers also highlighted not only that a proportion of enhancers are situated hundreds to thousands of kilobases from their target genes, often in large gene-poor regions, but also the promiscuous activity when placed within gene-rich domains. These promoters and enhancers interactions could be further identified using Chromosome conformation capture techniques such as 3C, 4C, Hi-C. However, in the lack of this information one could use the position information in the genome to link an enhancer to a candidate target gene. Such problem is also identified in the GWAS studies, for example, @brodie2016far found that affected genes are often up to $2 Mbps$ away from the associated SNP and highlighted that some studies suggested to use a cutoff of $500 Kbps$ since enhancers and repressors may be as distant as $500 Kbps$ from their genes. The issue of this method is that with a big window in gene-rich regions would map to several genes, and a small window might not map the gene-poor region, making the decision on the window size very difficult. Another method was presented by @yao2015inferring which provided a linkage method based on a fixed quantity of genes upstream and downstream of the enhancers. MethReg offer two methods for enhancer linking 1) a window-based method similar to the ones in the GWAS studies, 2) a fixed number of genes upstream and downstream of the DNA methylation loci similar to the one suggested by @yao2015inferring, and one method for promoter linking, which maps to the gene of the promoter region. The function `create_triplet_distance_based` provides those three different methods to link a region to a target gene: 1. Mapping the region to the closest gene (`target.method = "genes.promoter.overlap"`) 2. Mapping the region to a specific number of genes upstream down/upstream of the region (`target.method = "nearby.genes"`) [@silva2019elmer]. 3. Mapping the region to all the genes within a window (default size = 500 kbp around the region, i.e. +- 250 kbp from start or end of the region) (`target.method = "window"`) [@reese2019epigenome]. ```{r plot, fig.cap = "Different target linking strategies", echo = FALSE} png::readPNG("mapping_target_strategies.png") %>% grid::grid.raster() ``` For the analysis of probes in gene promoter region, we recommend setting `method = "genes.promoter.overlap"`, or `method = "closest.gene"`. For the analysis of probes in distal regions, we recommend setting either `method = "window"` or `method = "nearby.genes"`. Note that the distal analysis will be more time and resource consuming. To link regions to TF using JASPAR2022, MethReg uses `motifmatchr` [@motifmatchr] to scan these regions for occurrences of motifs in the database. JASPAR2020 is an open-access database of curated, non-redundant transcription factor (TF)-binding profiles [@JASPAR2022; @fornes2022jaspar], which contains more the 500 human TF motifs. The motif search width of the scanned region is one important parameter. Although TF recognizes short specific DNA sequence motifs ($6–12 bp$) [@leporcq2020tfmotifview], the output of a ChIP-seq experiment can include peaks longer than $1000 bp$ [@boeva2016analysis], but most of the motifs are found $\pm$ $50-75 bp$ from the TF peak center [@heinz2010simple]. Also, recently, it has been shown by @grossman2018positional that TFs have different positional bindings around nucleosome-depleted regions of DNA, which could range from $\pm200bp$ around the center of the DNaseI-hypersensitive (DHS) sites defined by the Roadmap Epigenomics project and @wang2019identification showed that the methylation levels at UM (unmethylated motifs) and MM (methylated Motifs) were also altered within that range. Since a single CpG is only 1bp, to predict if the methylation at the loci would affect the TF binding site, we suggest using a motif search window no bigger than $400bp$. The argument `motif.search.window.size` will be used to extend the region when scanning for the motifs, for example, a `motif.search.window.size` of `50` will add `25` bp upstream and `25` bp downstream of the original region. As an example, the following scripts link CpGs with the probes in gene promoter region (method 1. above) ```{R, message = FALSE, results = "hide"} triplet.promoter <- create_triplet_distance_based( region = dna.met.chr21.se, target.method = "genes.promoter.overlap", genome = "hg38", target.promoter.upstream.dist.tss = 2000, target.promoter.downstream.dist.tss = 2000, motif.search.window.size = 400, motif.search.p.cutoff = 1e-08, cores = 1 ) ``` Alternatively, we can also link each probe with genes within $500 kb$ window (method 2). ```{R, message = FALSE, results = "hide"} # Map probes to genes within 500kb window triplet.distal.window <- create_triplet_distance_based( region = dna.met.chr21.se, genome = "hg38", target.method = "window", target.window.size = 500 * 10^3, target.rm.promoter.regions.from.distal.linking = TRUE, motif.search.window.size = 500, motif.search.p.cutoff = 1e-08, cores = 1 ) ``` For method 3, to map probes to 5 nearest upstream and downstream genes: ```{R, message = FALSE, results = "hide"} # Map probes to 5 genes upstream and 5 downstream triplet.distal.nearby.genes <- create_triplet_distance_based( region = dna.met.chr21.se, genome = "hg38", target.method = "nearby.genes", target.num.flanking.genes = 5, target.window.size = 500 * 10^3, target.rm.promoter.regions.from.distal.linking = TRUE, motif.search.window.size = 400, motif.search.p.cutoff = 1e-08, cores = 1 ) ``` #### Creating triplet dataset using distance based approaches and REMAP2020 Instead of using JASPAR2020 motifs, we will be using REMAP2018 catalogue of TF peaks which can be access either using the package `ReMapEnrich` or a most updated version (RemMap2022) is available online at https://remap.univ-amu.fr/download_page ```{r, eval = FALSE} if (!"BiocManager" %in% rownames(installed.packages())) install.packages("BiocManager") BiocManager::install("remap-cisreg/ReMapEnrich", dependencies = TRUE) ``` To download REMAP2018 catalogue (~1Gb) the following functions are used: ```{R, eval = FALSE} library(ReMapEnrich) remapCatalog2018hg38 <- downloadRemapCatalog("/tmp/", assembly = "hg38") remapCatalog <- bedToGranges(remapCatalog2018hg38) ``` The function `create_triplet_distance_based` will accept any Granges with TF information in the same format as the `remapCatalog` one. ```{R, eval = FALSE} #------------------------------------------------------------------------------- # Triplets promoter using remap #------------------------------------------------------------------------------- triplet.promoter.remap <- create_triplet_distance_based( region = dna.met.chr21.se, genome = "hg19", target.method = "genes.promoter.overlap", TF.peaks.gr = remapCatalog, motif.search.window.size = 400, max.distance.region.target = 10^6, ) ``` #### Creating triplet dataset using regulon-based approaches The human regulons from the dorothea database will be used as an example: ```{r, eval = FALSE} if (!"BiocManager" %in% rownames(installed.packages())) install.packages("BiocManager") BiocManager::install("dorothea", dependencies = TRUE) ``` ```{R} regulons.dorothea <- dorothea::dorothea_hs regulons.dorothea %>% head ``` Using the regulons, you can calculate enrichment scores for each TF across all samples using dorothea and viper. ```{R} rnaseq.tf.es <- get_tf_ES( exp = gene.exp.chr21.se %>% SummarizedExperiment::assay(), regulons = regulons.dorothea ) ``` Finally, triplets can be identified using TF-target from regulon databases with the function `create_triplet_regulon_based`. ```{R, message = FALSE, results = "hide"} triplet.regulon <- create_triplet_regulon_based( region = dna.met.chr21.se, genome = "hg38", motif.search.window.size = 400, tf.target = regulons.dorothea, max.distance.region.target = 10^6 # 1Mbp ) ``` ```{R} triplet.regulon %>% head ``` #### Example of triplet data frame The triplet is a data frame with the following columns: * `target`: gene identifier (obtained from row names of the gene expression matrix), * `regionID`: region/CpG identifier (obtained from row names of the DNA methylation matrix) * `TF`: gene identifier (obtained from the row names of the gene expression matrix) ```{R} str(triplet.promoter) triplet.promoter$distance_region_target_tss %>% range triplet.promoter %>% head ``` Note that there may be multiple rows for a CpG region, when multiple target gene and/or TFs are found close to it. ## Evaluating the regulatory potential of CpGs (or DMRs) Because TF binding to DNA can be influenced by (or influences) DNA methylation levels nearby [@yin2017impact], target gene expression levels are often resulted from the synergistic effects of both TF and DNA methylation. In other words, TF activities in gene regulation is often affected by DNA methylation. Our goal then is to highlight DNA methylation regions (or CpGs) where these synergistic DNAm and TF collaborations occur. We will perform analyses using the 3 datasets described above in Section 3: * An input DNA methylation matrix * An input Gene expression matrix * The created triplet data frame ### Analysis using model with methylation by TF interaction The function `interaction_model` assess the regulatory impact of DNA methylation on TF regulation of target genes via the following approach: **considering DNAm values as a binary variable** - we define a binary variable `DNAm Group` for DNA methylation values (high = 1, low = 0). That is, samples with the highest DNAm levels (top 25 percent) has high = 1, samples with lowest DNAm levels (bottom 25 pecent) has high = 0. Note that in this implementation, only samples with DNAm values in the first and last quartiles are considered. $$log_2(RNA target) \sim log_2(TF) + \text{DNAm Group} + log_2(TF) * \text{DNAm Group}$$ ```{R interaction_model, message = FALSE, results = "hide", eval = TRUE} results.interaction.model <- interaction_model( triplet = triplet.promoter, dnam = dna.met.chr21.se, exp = gene.exp.chr21.se, dnam.group.threshold = 0.1, sig.threshold = 0.05, fdr = T, stage.wise.analysis = FALSE, filter.correlated.tf.exp.dnam = F, filter.triplet.by.sig.term = T ) ``` The output of `interaction_model` function will be a data frame with the following variables: * `