--- title: "Overview of the DMRcaller package" author: "Nicolae Radu Zabet\\thanks{e-mail: \\texttt{r.zabet@qmul.ac.uk}, Blizard Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, UK}" date: "`r Sys.Date()`" output: bookdown::html_document2: toc: true toc_depth: 3 number_sections: true toc_float: true fig_caption: true bookdown::pdf_document2: toc: true toc_depth: 3 number_sections: true fig_caption: true BiocStyle::html_document: bibliography: DMRcaller.bib csl: apa.csl link-citations: true vignette: > %\VignetteIndexEntry{Overview of the DMRcaller package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} local_caption <- paste( "Local methylation profile.", "The points on the graph represent methylation proportion of individual cytosines,", "their colour (red or blue) which sample they belong to and the intensity of the colour how many reads that particular cytosine had.", "This means that darker colours indicate stronger evidence that the corresponding cytosine has the corresponding methylation proportion,", "while lighter colours indicate a weaker evidence.", "The solid lines represent the smoothed profiles and the intensity of the colour the coverage at the corresponding position (darker colours indicate more reads while lighter ones less reads).", "The boxes on top represent the DMRs, where a filled box will represent a DMR which gained methylation while a box with a pattern represent a DMR that lost methylation.", "The DMRs need to have a metadata column `regionType` which can be either `gain` (where there is more methylation in condition 2 compared to condition 1) or `loss` (where there is less methylation in condition 2 compared to condition 1).", "In case this metadata column is missing all DMRs are drawn using filled boxes.", "Finally we also allow annotation of the DNA sequence.", "We represent by black boxes all the exons, which are joined by a horizontal black line, thus, marking the full body of the gene.", "With grey boxes we mark the transposable elements.", "Both for genes and transposable elements we plot them over a mid line if they are on the positive strand and under the mid line if they are on the negative strand." ) knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, fig.width = 7, fig.height = 5) ``` \setlength\parindent{12pt} # Introduction DNA methylation is an epigenetic modification of the DNA where a methyl group is added to the cytosine nucleotides. This modification is heritable, able to control the gene regulation and, in general, is associated with transcriptional gene silencing. While in mammals the DNA is predominantly methylated in CG context, in plants non-CG methylation (CHG and CHH, where H can be any of the A, C or T nucleotides) is also present and is important for the epigenetic regulation of transcription. Sequencing of bisulfite converted DNA has become the method of choice to determine genome wide methylation distribution. The *DMRcaller* package computes the set of Differentially Methylated Regions (DMRs) between two samples. *DMRcaller* will compute the differentially methylated regions from Whole Genome Bisulfite Sequencing (WGBS) or Reduced Representation Bisulfite Sequencing (RRBS) data. There are several tools able to call DMRs, but most work has been done in mammalian systems and, thus, they were designed to primarily call CG methylation. # Methods The package computes the DMRs using the CX report files generated by Bismark [@krueger_2011], which contain the number of methylated and unmmethylated reads for each cytosine in the genome. The coverage at each position on the genome is not homogeneous and this makes it difficult to compute the differentially methylated cytosines. Here, we implemented three methods: - **noise_filter** where we use a kernel [@hebestreit_2013] to smooth the number of methylated reads and the total number of reads (the *DMRcaller* package provides four kernels: `"uniform"`, `"triangular"`, `"gaussian"` and `"epanechnicov"`) - **neighbourhood** where individual cytosines in a specific context are considered in the analysis without any smoothing - **bins** where the genome is split into equal bins where all the reads are pooled together The DMRs are then computed by performing a statistical test between the number of methylated reads and the total number of reads in the two conditions for each position, cytosine or bin. In particular, we implemented two statistical tests: $(i)$ Fisher's exact test and $(ii)$ the Score test. The former (Fisher's exact test) uses the `fisher.test` in the *stats* package. The Score test is a statistical test of a simple null hypothesis that a parameter of interest is equal to some particular value. In our case, we are interested if the methylation levels in the two samples are equal or different. Given that $m_1$ is the number of methylated reads in condition 1, $m_2$ is the number of methylated reads in condition 2, $n_1$ is the total number of reads in condition 1 and $n_2$ is the total number of reads in condition 2, the Z-score of the Score test is $$ Z = \frac{(p_1 - p_2)\,\nu}{\sqrt{p(1-p)}} $$ where $p_1=m_1/n_1$, $p_2 = m_2/n_2$, $$ p = \frac{m_1 + m_2}{n_1 + n_2}\quad \textrm{and}\quad \nu = \sqrt{\frac{n_1n_2}{n_1 + n_2}} $$ We then convert the Z-score to the p-value assuming a normal distribution and a two sided test. ```{r pvalue, echo=TRUE, message=FALSE, eval=FALSE} pValue <- 2*pnorm(-abs(zScore)) ``` Finally, for both statistical tests (Fisher's exact test and Score test), we adjust the p-values for multiple testing using Benjamini and Hochberg's method [@benjamini_1995] to control the false discovery ```{r pvalue_adjust, echo=TRUE, message=FALSE, eval=FALSE} pValue <- p.adjust(pValue, method="fdr") ``` The algorithm performs the statistical test for each position, cytosine or bin and then marks as DMRs all positions/cytosines/bins that satisfy the following three conditions: - the difference in methylation levels between the two conditions is statistically significant according to the statistical test; - the difference in methylation proportion between the two conditions is higher than a threshold value; - the mean number of reads per cytosine is higher than a threshold. To group adjacent DMRs, we run an iterative process, where neighbouring DMRs (within a certain distance of each other) are joined only if these three conditions are still met after joining the DMRs. Finally, we filter the DMRs as follow - Remove DMRs whose lengths are less than a minimum size. - Remove DMRs with fewer cytosines than a threshold value. For a set of potential DMRs (e.g. genes, transposable elements or CpG islands) the user can call the function `filterDMRs` where all reads in a set of provided regions are pooled together and then the algorithm performs the statistical test for each region. # Description ## Data ### Bisulfite sequencing data Bismark [@krueger_2011] is a popular tool for methylation call on WGBS or RRBS data. *DMRcaller* takes as inputs the CX report files generated by Bismark and stores this data in a `GRanges` object. In the package, we included two CX report files that contain the methylation calls of WT and `met1-3` `Arabidopsis thaliana` [@stroud_2013]. `MET1` gene encodes for the main DNA methyltransferase in `Arabidopsis thaliana` and the `met1-3` mutation results in a genome-wide loss of DNA methylation (mainly in CG context). Due to running time, we restricted the data and analysis to the first $1\ Mb$ of the third chromosome of `A. thaliana`. ```{r load_presaved, message=FALSE,cache=FALSE} library(DMRcaller) #load presaved data data(methylationDataList) ``` To load a different dataset, one can use `readBismark` function, which takes as input the filename of the CX report file to be loaded. ```{r load, echo=TRUE, message=FALSE, cache=FALSE, eval=FALSE} # specify the Bismark CX report files saveBismark(methylationDataList[["WT"]], "chr3test_a_thaliana_wt.CX_report") saveBismark(methylationDataList[["met1-3"]], "chr3test_a_thaliana_met13.CX_report") # load the data methylationDataWT <- readBismark("chr3test_a_thaliana_wt.CX_report") methylationDataMet13 <- readBismark("chr3test_a_thaliana_met13.CX_report") methylationDataList <- GRangesList("WT" = methylationDataWT, "met1-3" = methylationDataMet13) ``` `methylationDataList` is a `GRangesList` object , where the `GRanges` elements contain four metadata columns - **context** - the context of the Cytosine (CG, CHG or CHH) - **readsM** - the number of methylated reads - **readsN** - the total number of reads - **trinucleotide_context** - the specific context of the cytosine (H is replaced by the actual nucleotide) If the data consists of two or more replicates, these can be pooled together using the function `poolMethylationDatasets` or `poolTwoMethylationDatasets` (in the case of pooling only two datasets). The latter function (`poolTwoMethylationDatasets`) is useful when the datasets are large and creating a `GRangesList` object is not possible (e.g. the `GRanges` objects are two large). ```{r pool_data, echo=TRUE, message=FALSE, cache=FALSE, eval=FALSE} # load the data methylationDataAll <- poolMethylationDatasets(methylationDataList) # In the case of 2 elements, this is equivalent to methylationDataAll <- poolTwoMethylationDatasets(methylationDataList[[1]], methylationDataList[[2]]) ``` Alternatively, one can use `ReadBismarkPool` to directly read a list of CX report files and pool them together. ```{r read_pool_data, echo=TRUE, message=FALSE, cache=FALSE, eval=FALSE} # load the data methylationDataAll <- readBismarkPool(c(file_wt, file_met13)) ``` ### Oxford Nanopore methylation data Oxford Nanopore Technologies (ONT) provides direct detection of modified bases (e.g., `5mC`) during sequencing by modelling electrical current disruptions at DNA pores. The methylation information is embedded as `MM` (modified base) and `ML` (modification probability) tags in aligned `BAM` files, which are typically produced after basecalling and alignment from raw signal files (`.pod5`). To analyse ONT methylation data with *DMRcaller*, users should first: 1. Get raw signal data (`.pod5`) from ONT sequencing device or public resources (e.g., [the 1000 Genomes ONT dataset](https://s3.amazonaws.com/1000g-ont/index.html?prefix=)). 2. Basecall the signals using [*dorado*](https://github.com/nanoporetech/dorado), a GPU-accelerated basecaller by ONT. 3. Align the reads and embed `MM`/`ML` tags using [*dorado*](https://github.com/nanoporetech/dorado) with methylation-aware models. 4. Decode the tags and summarise methylation calls with `readONTbam()`. #### Oxford Nanopore cytosine reference selection To analyse Oxford Nanopore (ONT) methylation data, one must first define the genomic positions of interest, i.e., the candidate cytosines in a specific methylation context (`CG`, `CHG`, or `CHH`). The `selectCytosine()` function provides this functionality by scanning a reference genome and returning a `GRanges` object of relevant cytosine positions. ```{r selectCytosine_example, echo=TRUE, message=FALSE, cache=FALSE,eval=FALSE} library(BSgenome.Hsapiens.UCSC.hg38) # Select cytosines in CG context on chromosome 1 gr_cpg <- selectCytosine( genome = BSgenome.Hsapiens.UCSC.hg38, context = "CG", chr = "chr1" ) gr_cpg ``` The resulting `GRanges` object contains one genomic range per cytosine that matches the specified context. For `CG` context, each range spans two bases (e.g., `CpG`). The following metadata columns are attached: - **context** - A factor indicating the context (`"CG"`, `"CHG"`, or `"CHH"`). - **trinucleotide_context** - Character or factor giving the surrounding trinucleotide sequence (or `NA` for `"CG"`). This function is particularly useful when preparing to quantify methylation signals from ONT `BAM` files using the `readONTbam()` function, as it provides the input cytosine sites to be queried in the aligned sequencing data. #### Reading Oxford Nanopore BAM files After alignment and basecalling, you can extract methylation calls using the `readONTbam()` function. This function decodes `MM`/`ML` tags and calculates per-site read-level methylation counts across the genome ```{r read_ONT, echo=TRUE,message=FALSE,cache=FALSE,eval=FALSE} # Path to ONT BAM (with MM/ML tags) bamfile <- system.file("extdata", "scanBamChr1Random5.bam", package="DMRcaller") # Decode methylation and compute read-level counts OntChr1CG <- readONTbam( bamfile = bamfile, ref_gr = gr_cpg, genome = BSgenome.Hsapiens.UCSC.hg38, modif = "C+m?", chr = "chr1", region = FALSE, prob_thresh= 0.5, parallel = FALSE, ) # View results OntChr1CG ``` **Notes on important parameters** - `genome`: The genome argument is required if `region = FALSE`, and must be a `BSgenome` object from the Bioconductor package *BSgenome*. This allows the function to extract cytosine coordinates and sequence contexts from the full reference genome. For example: ```{r BSgenome_available, echo=TRUE, cache=FALSE, message=FALSE, eval=FALSE} head(BSgenome::available.genomes()) ``` - `modif`: The modif parameter specifies which base modifications to extract based on the `BAM` file’s `MM` tag. The syntax follows the [SAMtags specification](chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://samtools.github.io/hts-specs/SAMtags.pdf) and is usually one of "`C+m?`", "`C+m.`", "`C+h?`" or "`C+h.`" For example, in this package we provide methylation data derived from the `GM18501` and `GM18876` B-Lymphocyte cell lines from the 1000 Genomes ONT dataset [@gustafson_2024]: - **Raw data**: [GM18501](https://s3.amazonaws.com/1000g-ont/index.html?prefix=pod5_data/GM18501_R9/), [GM18876](https://s3.amazonaws.com/1000g-ont/index.html?prefix=pod5_data/GM18876_R9/) - **Basecalling**: performed using `dorado v0.9.6` with the model `dna_r10.4.1_e8.2_400bps_hac@v5.2.0` The resulting BAM files contain `MM` and `ML` tags for `5mC` detection. These are processed with `readONTbam()` to compute per-cytosine methylation statistics. The example data included in the package is accessible as: ```{r load_ontSampleGRangesList, message=FALSE, cache=FALSE} # load presaved ont GRanges data data(ontSampleGRangesList) ``` This is a `GRangesList` object with one `GRanges` per sample (`GM18501` and `GM18876`), covering $1.5–2 Mbp$ of human chromosome 1. Each `GRanges` object contains the following metadata columns: - **context** - the context of the Cytosine (`CG`, `CHG` or `CHH`) - **readsM** - the number of methylated reads - **readsN** - the total number of reads - **trinucleotide_context** - the specific context of the cytosine (`H` is replaced by the actual nucleotide) - **ONT_Cm** - comma-delimited read-indices called modified - **ONT_C** - comma-delimited read-indices covering but unmodified This function allows for fine-grained downstream analysis, such as identifying differentially methylated regions (DMRs), variably methylated domains (VMDs), or co-methylation structures across samples. ## Low resolution profiles The *DMRcaller* package also offers the possibility to visualise context specific global changes in the methylation profile. To achieve this, the user can call `plotMethylationProfileFromData` function, which computes the mean methylation proportion in tiling bins of fixed size; see Figure \@ref(fig:lowResProfile). ```{r lowResProfile, fig.height=4, fig.width=12, fig.cap="Low resolution profile in CG context for WT and met1-3", out.width='95%',echo=TRUE, fig.align='center', message=FALSE, cache=FALSE} plotMethylationProfileFromData(methylationDataList[["WT"]], methylationDataList[["met1-3"]], conditionsNames = c("WT","met1-3"), windowSize = 10000, autoscale = FALSE, context = c("CG")) ``` Alternatively, for a finer control, the user can use `computeMethylationProfile` function to compute the methylation profile at certain locations on the genome. This function returns a `Ranges` object with four metadata columns - **sumReadsM** - the number of methylated reads - **sumReadsN** - the total number of reads - **Proportion** - the proportion of methylated reads - **context** - the context One or more of these `GRanges` objects can be put in a `GRangesList` object which is then passed as a parameter to the `plotMethylationProfile` function. ```{r profile_fine, echo=TRUE, message=FALSE, cache=FALSE, fig.width=5,fig.height=5, fig.align='center', eval=FALSE} regions <- GRanges(seqnames = Rle("Chr3"), ranges = IRanges(1,1E6)) # compute low resolution profile in 10 Kb windows profileCGWT <- computeMethylationProfile(methylationDataList[["WT"]], regions, windowSize = 10000, context = "CG") profileCGMet13 <- computeMethylationProfile(methylationDataList[["met1-3"]], regions, windowSize = 10000, context = "CG") profilesCG <- GRangesList("WT" = profileCGWT, "met1-3" = profileCGMet13) #plot the low resolution profile par(mar=c(4, 4, 3, 1)+0.1) par(mfrow=c(1,1)) plotMethylationProfile(profilesCG, autoscale = FALSE, labels = NULL, title="CG methylation on Chromosome 3", col=c("#D55E00","#E69F00"), pch = c(1,0), lty = c(4,1)) ``` ## Coverage of the bisulfite sequencing data The number of reads from the bisulfite sequencing can differ significantly between different locations on the genome in the sense that cytosines in the same context (including neighbouring cytosines) can display large variability in the coverage. To plot the coverage of the bisulfite sequencing datasets, one can use `plotMethylationDataCoverage` function which takes as input one or two datasets and the vector with the thresholds used to compute the proportion of cytosines with at least that many reads; see Figure \@ref(fig:coverage). ```{r coverage, fig.width=6,fig.height=5,out.width='60%', fig.cap="Coverage. For example, this figure shows that in WT only $30%$ of the cytosines in CHH context have at least 10 reads", fig.align='center', message=FALSE,cache=FALSE} # plot the coverage in the two contexts par(mar=c(4, 4, 3, 1)+0.1) plotMethylationDataCoverage(methylationDataList[["WT"]], methylationDataList[["met1-3"]], breaks = c(1,5,10,15), regions = NULL, conditionsNames=c("WT","met1-3"), context = c("CHH"), proportion = TRUE, labels=LETTERS, contextPerRow = FALSE) ``` Alternatively, the *DMRcaller* also provides the `computeMethylationDataCoverage` function which returns a numeric vector with the number or proportion of cytosines in a specific context that have at least a certain number of reads specified by the input vector `breaks`. ```{r coverage_compute, message=FALSE,cache=FALSE, eval=FALSE} coverageCGWT <- computeMethylationDataCoverage(methylationDataList[["WT"]], context="CG", breaks = c(1,5,10,15)) ``` ## Spatial correlation of methylation levels Methylation levels are often spatially correlated and some methods to detect DMRs assume this spatial correlation. Nevertheless, different tissues, samples and even methylation context will display different levels of correlation. *DMRcaller* implements `plotMethylationDataSpatialCorrelation` function that plots the correlation of methylation levels as function of distance between cytosines. This function takes as input one or two datasets and the vector with the distances between cytosines; see Figure \@ref(fig:correlationPlot). ```{r correlationPlot, fig.width=6,fig.height=5, fig.cap="Spatial correlation of methylation levels", fig.align='center', out.width='60%', message=FALSE,cache=FALSE} # compute the spatial correlation of methylation levels plotMethylationDataSpatialCorrelation(methylationDataList[["WT"]], distances = c(1,100,10000), regions = NULL, conditionsNames = c("WT"), context = c("CG"), labels = LETTERS, col = NULL, pch = c(1,0,16,2,15,17), lty = c(4,1,3,2,6,5), contextPerRow = FALSE, log = "x") ``` Alternatively, the *DMRcaller* also provides the `computeMethylationDataSpatialCorrelation` function which returns a numeric vector with the correlation of methylation levels of cytosines separated by certain distances in a specific context. ```{r correlation_compute, message=FALSE,cache=FALSE, eval=FALSE} correlation_CG_wt <- computeMethylationDataSpatialCorrelation(methylationDataList[["WT"]], context="CG", distances=c(1,10,100,1000,10000)) ``` ## Calling DMRs *DMRcaller* package provides `computeDMRs` function to call DMRs. The output of this function is a `GRanges` with 11 metadata columns. - **direction** - a numeric value indicating whether the methylation was lost in the second condition compared to the first one ($-1$) or gained ($+1$) - **context** - the context of the cytosine (CG, CHG or CHH) - **sumReadsM1** - the number of methylated reads in the DMR in condition 1 - **sumReadsN1** - the total number of reads in the DMR in condition 1 - **proportion1** - the proportion of methylated reads in the DMR in condition 1 - **sumReadsM2** - the number of methylated reads in the DMR in condition 2 - **sumReadsN2** - the total number of reads in the DMR in condition 2 - **proportion2** - the proportion of methylated reads in the DMR in condition 2 - **cytosinesCount** - the number of cytosines in the DMR - **pValue** - the adjusted p-value of the statistical test - **regionType** - a character string indicating whether the methylation was lost in the second condition compared to the first one (*"loss"*) or gained (*"gain"*) For predifined regions (e.g. genes, transposons or CpG islands) the user can call `filterDMRs` function to extract the list of regions that are differentially methylated. The output of this function is again a GRanges with the same 11 metadata columns. Below we present examples of calling both functions. ```{r compute_DMRs_CG_noise_filter, message=FALSE,cache=FALSE} chr_local <- GRanges(seqnames = Rle("Chr3"), ranges = IRanges(5E5,6E5)) # compute the DMRs in CG context with noise_filter method DMRsNoiseFilterCG <- computeDMRs(methylationDataList[["WT"]], methylationDataList[["met1-3"]], regions = chr_local, context = "CG", method = "noise_filter", windowSize = 100, kernelFunction = "triangular", test = "score", pValueThreshold = 0.01, minCytosinesCount = 4, minProportionDifference = 0.4, minGap = 0, minSize = 50, minReadsPerCytosine = 4, cores = 1) print(DMRsNoiseFilterCG) ``` ```{r compute_DMRs_CG_neighbourhood, message=FALSE,cache=FALSE} # compute the DMRs in CG context with neighbourhood method DMRsNeighbourhoodCG <- computeDMRs(methylationDataList[["WT"]], methylationDataList[["met1-3"]], regions = chr_local, context = "CG", method = "neighbourhood", test = "score", pValueThreshold = 0.01, minCytosinesCount = 4, minProportionDifference = 0.4, minGap = 200, minSize = 1, minReadsPerCytosine = 4, cores = 1) print(DMRsNeighbourhoodCG) ``` ```{r compute_DMRs_CG_bins, message=FALSE,cache=FALSE} # compute the DMRs in CG context with bins method DMRsBinsCG <- computeDMRs(methylationDataList[["WT"]], methylationDataList[["met1-3"]], regions = chr_local, context = "CG", method = "bins", binSize = 100, test = "score", pValueThreshold = 0.01, minCytosinesCount = 4, minProportionDifference = 0.4, minGap = 200, minSize = 50, minReadsPerCytosine = 4, cores = 1) print(DMRsBinsCG) ``` ```{r compute_DMRs_CG_GE, message=FALSE,cache=FALSE} # load the gene annotation data data(GEs) #select the genes genes <- GEs[which(GEs$type == "gene")] # compute the DMRs in CG context over genes DMRsGenesCG <- filterDMRs(methylationDataList[["WT"]], methylationDataList[["met1-3"]], potentialDMRs = genes[overlapsAny(genes, chr_local)], context = "CG", test = "score", pValueThreshold = 0.01, minCytosinesCount = 4, minProportionDifference = 0.4, minReadsPerCytosine = 3, cores = 1) print(DMRsGenesCG) ``` ## Merge DMRs Finally, for merging adjacent DMRs, *DMRcaller* provides the function `mergeDMRsIteratively` which can be used as follows: ```{r merge_DMRs_CG_noise_filter, message=FALSE,cache=FALSE} DMRsNoiseFilterCGMerged <- mergeDMRsIteratively(DMRsNoiseFilterCG, minGap = 200, respectSigns = TRUE, methylationDataList[["WT"]], methylationDataList[["met1-3"]], context = "CG", minProportionDifference = 0.4, minReadsPerCytosine = 4, pValueThreshold = 0.01, test="score") print(DMRsNoiseFilterCGMerged) ``` Note that two neighbouring DMRs will be merged if all the conditions below are met - they are within a distance from each other smaller than minGap - the difference in methylation levels between the two conditions is statistically significant according to the statistical test when the two DMRs are joined - the difference in methylation proportion between the two conditions is higher than a threshold value when the two DMRs are joined - the number of reads per cytosine is higher than a threshold when the two DMRs are joined ## Calling DMRs using biological replicates The package also contains a set of functions for the analysis of multiple biological replicates. The synthetic dataset is made by 300 different cytosines, extracted from those present in the *A. thaliana* dataset. The value for **readsN** are created using the function `rnorm`, while the values for `readsM` are generated using the function `rbinom`. The probabilities used are 0.1 in the external region and 0.8 in the central region. In this way a DMR should be detected in the central region of the synthetic dataset. The difference in proportion is plotted in figure \@ref(fig:differenceMethylationReplicates). ```{r differenceMethylationReplicates, fig.width=12,fig.height=5,out.width='95%', fig.cap="Methylation proportions in the synthetic dataset.", fig.align='center',message=FALSE,cache=FALSE} # loading synthetic data data("syntheticDataReplicates") # create vector with colours for plotting cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7") # plotting the difference in proportions plot(start(syntheticDataReplicates), syntheticDataReplicates$readsM1/syntheticDataReplicates$readsN1, ylim=c(0,1), col=cbbPalette[2], xlab="Position in Chr3 (bp)", ylab="Methylation proportion") points(start(syntheticDataReplicates), syntheticDataReplicates$readsM2/syntheticDataReplicates$readsN2, col=cbbPalette[7], pch=4) points(start(syntheticDataReplicates), syntheticDataReplicates$readsM3/syntheticDataReplicates$readsN3, col=cbbPalette[3], pch=2) points(start(syntheticDataReplicates), syntheticDataReplicates$readsM4/syntheticDataReplicates$readsN4, col=cbbPalette[6], pch=3) legend(x = "topleft", legend=c("Treatment 1", "Treatment 2", "Control 1", "Control 2"), pch=c(1,4,2,3), col=cbbPalette[c(2,7,3,6)], bty="n", cex=1.0) ``` The DMRs are computed using the function `computeDMRsReplicates`, which uses beta regression [@ferrari_2004] to detect differential methylation. ```{r computingBiologicalReplicates, message=FALSE,cache=FALSE, eval=TRUE} if (requireNamespace("GenomeInfoDb", quietly = TRUE)) { # starting with data joined using joinReplicates # creating condition vector condition <- c("a", "a", "b", "b") # computing DMRs using the neighbourhood method DMRsReplicatesBins <- computeDMRsReplicates(methylationData = syntheticDataReplicates, condition = condition, regions = NULL, context = "CG", method = "bins", binSize = 100, test = "betareg", pseudocountM = 1, pseudocountN = 2, pValueThreshold = 0.01, minCytosinesCount = 4, minProportionDifference = 0.4, minGap = 0, minSize = 50, minReadsPerCytosine = 4, cores = 1) print(DMRsReplicatesBins) } ``` ## Extract methylation data in regions `analyseReadsInsideRegionsForCondition` function can extract additional information in a set of genomic regions (including DMRs) from any `methylationData` object. For example, to establish a link between the CG and CHH methylation, one might want to extract the number of methylated reads and the total number of reads in CHH context inside DMRs called in CG context. ```{r analyse_reads_inside_regions_for_condition, message=FALSE,cache=FALSE} #retrive the number of reads in CHH context in WT in CG DMRs DMRsNoiseFilterCGreadsCHH <- analyseReadsInsideRegionsForCondition( DMRsNoiseFilterCGMerged, methylationDataList[["WT"]], context = "CHH", label = "WT") print(DMRsNoiseFilterCGreadsCHH) ``` ## Calling partially methylated domains (PMDs) The *DMRcaller* package provides the `computePMDs` function to identify **partially methylated domains (PMDs)** from both WGBS and Oxford Nanopore methylation data. PMDs are large genomic regions with intermediate methylation levels (e.g., 40–60%) and can be identified using three available methods: "`noise_filter`", "`neighbourhood`", and "`bins`" The output of this function is a GRanges object with five metadata columns: - context – the context of the cytosine (CG, CHG or CHH) - sumReadsM – the number of methylated reads in the PMD - sumReadsN – the total number of reads in the PMD - proportion – the average methylation proportion in the PMD (between minMethylation and maxMethylation) - cytosinesCount – the number of cytosines considered in the PMD To compute PMDs in a specific region of interest, define a GRanges object. If not specified, PMDs are computed genome-wide. ```{r compute_PMDs_CG_noise_filter, message=FALSE, cache=FALSE,eval=FALSE} # load the ONT methylation data data(ontSampleGRangesList) # define the region of interest on chr1 chr1_ranges <- GRanges(seqnames = Rle("chr1"), ranges = IRanges(1E6+5E5,2E6)) # compute the PMDs in CG context using the noise_filter method PMDsNoiseFilterCG <- computePMDs(ontSampleGRangesList[["GM18501"]], regions = chr1_ranges, context = "CG", method = "noise_filter", windowSize = 100, kernelFunction = "triangular", lambda = 0.5, minCytosinesCount = 4, minMethylation = 0.4, maxMethylation = 0.6, minGap = 200, minSize = 50, minReadsPerCytosine = 4, cores = 1, parallel = FALSE) PMDsNoiseFilterCG ``` ```{r compute_PMDs_CG_neighbourhood, message=FALSE, cache=FALSE, eval=FALSE} # compute the PMDs using the neighbourhood method PMDsNeighbourhoodCG <- computePMDs(ontSampleGRangesList[["GM18501"]], regions = chr1_ranges, context = "CG", method = "neighbourhood", minCytosinesCount = 4, minMethylation = 0.4, maxMethylation = 0.6, minGap = 200, minSize = 50, minReadsPerCytosine = 4, cores = 1, parallel = FALSE) PMDsNeighbourhoodCG ``` ```{r compute_PMDs_CG_bins, message=FALSE, cache=FALSE, eval=FALSE} # compute the PMDs using the bins method PMDsBinsCG <- computePMDs(ontSampleGRangesList[["GM18501"]], regions = chr1_ranges, context = "CG", method = "bins", binSize = 100, minCytosinesCount = 4, minMethylation = 0.4, maxMethylation = 0.6, minGap = 200, minSize = 50, minReadsPerCytosine = 4, cores = 1, parallel = FALSE) PMDsBinsCG ``` ## Calling variably methylated domains (VMDs) using ONT data The *DMRcaller* package provides the `computeVMDs` function to identify **variably methylated domains (VMDs)** — genomic regions showing high or low variability in methylation levels across reads at each cytosine. This is particularly relevant for identifying regulatory instability or stochastic epigenetic variation using Oxford Nanopore sequencing data. The function segments the genome into *fixed-size bins*, evaluates the *per-read methylation variability* within each bin, and filters bins based on their **weighted standard deviation (SD)** using user-defined or data-driven thresholds. The output is a GRanges object with the following metadata columns: - **context** – the context in which the VMDs were computed ("`CG`", "`CHG`", or "`CHH`") - **sumReadsM** – total number of methylated reads across the region - **sumReadsN** – total number of reads across the region - **proportion** – average methylation level across cytosines - **cytosinesCount** – number of cytosines in the region - **mean** – mean of per-read methylation proportions - **sd** – standard deviation of per-read methylation proportions - **w_mean** – weighted mean (based on read count per cytosine) - **w_sd** – weighted standard deviation To compute VMDs in a specific region or genome-wide, you must specify bin size and a method for selecting variance thresholds. ```{r compute_VMDs_CG, message=FALSE, cache=FALSE, eval=FALSE} # load the ONT methylation data data(ontSampleGRangesList) # define the region of interest on chr1 chr1_ranges <- GRanges(seqnames = Rle("chr1"), ranges = IRanges(1E6+5E5, 1E6+6E5)) # compute the VMDs in CG context using the "EDE.high" variance threshold VMDsBinsCG <- computeVMDs(ontSampleGRangesList[["GM18501"]], regions = chr1_ranges, context = "CG", binSize = 100, minCytosinesCount = 4, sdCutoffMethod = "per.high", # elbow-detected high variance percentage = 0.05, # only used if sdCutoffMethod = "per.high" or "per.low" minGap = 200, minSize = 50, minReadsPerCytosine = 4, parallel = FALSE, cores = 1) VMDsBinsCG ``` The `sdCutoffMethod` determines how to select high or low variance regions: - "`per.high`" or "`per.low`" filter to top/bottom bins based on a quantile cutoff - "`EDE.high`" or "`EDE.low`" apply an elbow detection algorithm by the `inflection::ede` from package *inflection* to automatically determine the cutoff point in the SD distribution. ## Detecting Variably Methylated Regions (VMRs) using ONT data The function `filterVMRsONT()` is designed to identify **Variably Methylated Regions (VMRs)** between two ONT methylation datasets. These are genomic regions (e.g., genes, transposable elements, or CpG islands) that exhibit statistically significant differences in **methylation variability** or **methylation levels** between two conditions. This function applies the following ONT-specific statistical filters: - **Wilcoxon rank-sum test** on per-read methylation proportions - **F-test** on variance of per-read methylation proportions - Optional: fold-change cutoffs on variance ratio, confidence interval filtering, and coverage/methylation thresholds We use the example ONT methylation data and transcript annotation to identify VMRs across genes within a region of chromosome 1. ```{r filter_VMRs, message=FALSE, cache=FALSE, eval=FALSE} # load the ONT methylation data data(ontSampleGRangesList) # load the gene annotation data data(GEs_hg38) # select the transcript transcript <- GEs_hg38[which(GEs_hg38$type == "transcript")] # the regions where to compute the PMDs regions <- GRanges(seqnames = Rle("chr1"), ranges = IRanges(1E6+5E5,2E6)) transcript <- transcript[overlapsAny(transcript, regions)] # filter genes that are differntially methylated in the two conditions VMRsGenesCG <- filterVMRsONT(ontSampleGRangesList[["GM18501"]], ontSampleGRangesList[["GM18876"]], potentialVMRs = transcript, context = "CG", pValueThreshold = 0.01, minCytosinesCount = 4, minProportionDifference = 0.01, minReadsPerCytosine = 3, ciExcludesOne = TRUE, varRatioFc = NULL, parallel = TRUE) # parallel recommended ``` The function returns a `GRanges` object containing the same genomic intervals as the input `potentialVMRs`, enriched with the following metadata columns: - **sumReadsM1** / **sumReadsN1**: Total methylated / total reads in condition 1 - **sumReadsM2** / **sumReadsN2**: Total methylated / total reads in condition 2 - **proportion1** / **proportion2**: Methylation proportions - **variance1** / **variance2**: Variance of per-read methylation levels - **wilcox_pvalue**: FDR-adjusted p-value from *Wilcoxon test* - **f_pvalue**: FDR-adjusted p-value from *F-test* - **var_ratio**: Ratio of variances (condition1 / condition2) - **is_DMR**: TRUE if *Wilcoxon test* passed significance threshold - **is_VMR**: TRUE if *F-test* passed significance threshold - **regionType**: "`gain`" or "`loss`" of methylation - **direction**: `+1` (gain) or `-1` (loss) Only regions satisfying all filtering criteria (e.g., coverage, cytosine count, variance CI exclusion, proportion difference) are retained in the output. For strict variance detection, set `ciExcludesOne = TRUE` and optionally apply `varRatioFc = 2` for 2-fold variance filtering. ## Calling Co-Methylation using ONT data The *DMRcaller* package provides the `computeCoMethylation` function to detect **pairwise co-methylation** patterns between CpG, CpHpG or CpHpH sites within user-defined genomic regions, such as **PMDs**, **genes**, or **repeat elements**. This is especially useful when exploring coordinated methylation regulation or identifying methylation-linked chromatin interactions using *long-read Nanopore sequencing data*. For each region, the function computes **strand-specific pairwise associations** between methylation positions that fall within a specified genomic distance range. It constructs a 2×2 contingency table summarising methylation co-occurrence across reads and performs a statistical test to evaluate co-dependency. The output is a list of `GInteractions` objects (one per input region) containing significant pairs and their co-methylation statistics. Each significant interaction includes: - **C1_C2** – number of reads methylated at both cytosines - **C1_only** – number of reads methylated only at the first cytosine - **C2_only** – number of reads methylated only at the second cytosine - **neither** – number of reads methylated at neither cytosines - **strand** – DNA strand of the pair ("`+`" or "`-`") - **genomic_position** – region ID in UCSC/IGV format (e.g. "`chr1:1522971-1523970`") - **p.value** – FDR-adjusted p-value from Fisher’s exact or permutation test ```{r compute_comethylation_fisher, message=FALSE, cache=FALSE, eval=FALSE} # load ONT methylation and PMD data data("ont_gr_GM18870_chr1_PMD_bins_1k") data("ont_gr_GM18870_chr1_sorted_bins_1k") # compute co-methylation using Fisher's exact test coMethylationFisher <- computeCoMethylation( ont_gr_GM18870_chr1_sorted_bins_1k, regions = ont_gr_GM18870_chr1_PMD_bins_1k[1:4], minDistance = 150, maxDistance = 1000, minCoverage = 1, pValueThreshold = 0.01, test = "fisher", parallel = FALSE) coMethylationFisher ``` **Note on region size and multiple testing correction** The `computeCoMethylation` function applies `FDR` correction globally across **all cytosine pairs** tested within the input regions. As such, the number and range of regions provided can influence the adjusted p-values. For example, providing smaller or fewer regions results in fewer total comparisons, which can lead to more conservative (higher) adjusted p-values for the same cytosine pairs. Therefore, when designing region inputs for co-methylation analysis, it’s important to note that adjusted significance depends not only on the pairwise test but also on the size and number of regions analysed. ## Plotting the distribution of DMRs, PMDs, VMDs or VMRs Sometimes, it is useful to obtain the distribution of the DMRs over the chromosomes. The *DMRcaller* provides the `computeOverlapProfile` function, which computes this distribution. The `GRanges` object generated by this function can then be added to a `GRangesList` object, which can be plotted using `plotOverlapProfile` function; see Figure \@ref(fig:DMRdensity). Additionally, the `plotOverlapProfile` function allows the user to specify two `GRangesList`, thus, allowing the plotting of distributions of hypo or hyper methylated DMRs separately. ```{r DMRdensity, fig.width=15,fig.height=2.5,out.width='95%', fig.cap= "Distribution of DMRs. Darker colour indicates higher density, while lighter colour lower density.", fig.align='center', message=FALSE,cache=FALSE} # compute the distribution of DMRs hotspots <- computeOverlapProfile(DMRsNoiseFilterCGMerged, chr_local, windowSize=5000, binary=TRUE) # plot the distribution of DMRs plotOverlapProfile(GRangesList("Chr3"=hotspots)) ``` ## Plotting profiles with DMRs, PMDs, VMDs or VMRs Finally, *DMRcaller* package also provides a function to plot methylation profiles at a specific location on the genome. To plot the methylation profile the user needs to call the `plotLocalMethylationProfile` function; see Figure \@ref(fig:localMethylationProfile). ```{r localMethylationProfile, fig.width=15,fig.height=5,out.width='95%', fig.cap=local_caption, fig.align='center',message=FALSE,cache=FALSE} # select a 20 Kb location on the Chr3 chr3Reg <- GRanges(seqnames = Rle("Chr3"), ranges = IRanges(510000,530000)) # create a list with all DMRs DMRsCGList <- list("noise filter" = DMRsNoiseFilterCGMerged, "neighbourhood" = DMRsNeighbourhoodCG, "bins" = DMRsBinsCG, "genes" = DMRsGenesCG) # plot the local profile par(cex=0.9) par(mar=c(4, 4, 3, 1)+0.1) plotLocalMethylationProfile(methylationDataList[["WT"]], methylationDataList[["met1-3"]], chr3Reg, DMRsCGList, conditionsNames = c("WT", "met1-3"), GEs, windowSize = 300, main="CG methylation") ``` # Parallel computation Computing the DMRs can be computationally intensive. For example, in the case of *A. thaliana* (with a genome of $\approx 130\ Mb$), it can take several hours to compute the DMRs depending on the method used and on the number of DMRs. To speed up computations, *DMRcaller* supports parallel computing of DMRs using the package *BiocParallel*, which able to run parallel works in any types of processor (including Windows, Mac, Linux and HPC conditions). ```{r DMRinParallel, message=FALSE,cache=FALSE} # Parallel; compute the DMRs in CG context with noise_filter method DMRsNoiseFilterCG_3cores <- computeDMRs(methylationDataList[["WT"]], methylationDataList[["met1-3"]], regions = chr_local, context = "CG", method = "noise_filter", windowSize = 100, kernelFunction = "triangular", test = "score", pValueThreshold = 0.01, minCytosinesCount = 4, minProportionDifference = 0.4, minGap = 0, minSize = 50, minReadsPerCytosine = 4, parallel = TRUE, # default: FALSE BPPARAM = NULL, # default: NULL cores = 3 # default: 1 ) print(DMRsNoiseFilterCG_3cores) ``` The five functions used for computing and filtering the DMRs, PMDs and VMRs (`computeDMRs`, `filterDMRs`, `mergeDMRsIteratively`, `analyseReadsInsideRegionsForCondition`, `computeDMRsReplicates`,`computePMDs`, `filterPMDs`, `mergePMDsIteratively`, `analyseReadsInsideRegionsForConditionPMD`, and `filterVMRsONT`) accept three parameters - `parallel` - run in parallel if `TRUE`.; *default setting*: `FALSE` - `BPPARAM` - A `BiocParallelParam` object controlling parallel execution. This value will automatically set when parallel is `TRUE`, also able to set as manually.; *default setting*: `NULL` - `cores` - which specifies the number of cores that can be used when performing the corresponding computations (must not exceed `BPPARAM$workers`). This value will automatically set as the maximum number of system workers, also able to set as manually.; *default setting*: `1` These two function used for reading ONT bam file and computing co-methylation (`readONTbam`,`computeCoMethylation`) accept `parallel` and `BPPARAM` only. When using 10 cores, it can take between 10 and 30 minutes to compute the DMRs in *A. thaliana* depending on the selected parameters. # Session information ```{r session_info, echo=TRUE} sessionInfo() ``` # References