--- title: "3.2 - Identifying differentially methylated probes" output: html_document: self_contained: true number_sections: no theme: flatly highlight: tango mathjax: null toc: true toc_float: true toc_depth: 2 css: style.css bibliography: bibliography.bib vignette: > %\VignetteIndexEntry{"3.2 - Identifying differentially methylated probes"} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, echo = FALSE,hide=TRUE, message=FALSE, warning=FALSE} library(ELMER.data) library(ELMER) library(DT) library(dplyr) library(BiocStyle) ```
# Identifying differentially methylated probes The first step is the identification of differentially methylated CpGs (DMCs) carried out by function `get.diff.meth`. In the `Supervised` mode, we compare the DNA methylation level of each distal CpG for all samples in Group 1 compared to all samples Group 2, using an unpaired one-tailed t-test. In the `Unsupervised` mode, the samples of each group (Group 1 and Group 2) are ranked by their DNA methylation beta values for the given probe, and those samples in the lower quintile (20\% samples with the lowest methylation levels) of each group are used to identify if the probe is hypomethylated in Group 1 compared to Group 2. The reverse applies for the identification of hypermethylated probes. It is important to highlight that in the `Unsupervised` mode, each probe selected may be based on a different subset the samples, and thus probe sets from multiple molecular subtypes may be represented. In the `Supervised` mode, all tests are based on the same set of samples. The 20\% is a parameter to the `diff.meth` function called `minSubgroupFrac`. For the unsupervised analysis, this is set to 20\% as in Yao et al. [@yao2015inferring], because we wanted to be able to detect a specific molecular subtype among samples; these subtypes often make up only a minority of samples, and 20\% was chosen as a lower bound for the purposes of statistical power (high enough sample numbers to yield t-test p-values that could overcome multiple hypotheses corrections, yet low enough to be able to capture changes in individual molecular subtypes occurring in 20\% or more of the cases.) This number can be set as an input to the `diff.meth` function and should be tuned based on sample sizes in individual studies. In the `Supervised` mode, where the comparison groups are implicit in the sample set and labeled, the `minSubgroupFrac` parameter is set to 100\%. An example would be a cell culture experiment with 5 replicates of the untreated cell line, and another 5 replicates that include an experimental treatment. To identify hypomethylated DMCs, a one-tailed t-test is used to rule out the null hypothesis: $\mu_{group1} \geq \mu_{group2}$, where $\mu_{group1}$ is the mean methylation within the lowest group 1 quintile (or another percentile as specified by the `minSubgroupFrac` parameter) and $\mu_{group2}$ is the mean within the lowest group 2 quintile. Raw p-values are adjusted for multiple hypothesis testing using the Benjamini-Hochberg method, and probes are selected when they had adjusted p-value less than $0.01$ (which can be configured using the `pvalue` parameter). For additional stringency, probes are only selected if the methylation difference: $\Delta = \mu_{group1} - \mu_{group2}$ was greater than $0.3$. The same method is used to identify hypermethylated DMCs, except we use the *upper* quintile, and the opposite tail in the t-test is chosen. ![Source: Yao, Lijing, et al. "Inferring regulatory element landscapes and transcription factor networks from cancer methylomes." Genome biology 16.1 (2015): 105.](figures/paper_diff_meth.jpg) [@yao2015inferring,@yao2015demystifying] # Function arguments

Main get.diff.meth arguments

| Argument | Description | |------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | data | A `multiAssayExperiment` with DNA methylation and Gene Expression data. See `createMAE` function. | | diff.dir | A character can be "hypo", "hyper" or "both", showing differential methylation dirction. It can be "hypo" which is only selecting hypomethylated probes (one tailed test); "hyper" which is only selecting hypermethylated probes (one tailed test); or "both" which are probes differenly methylated (two tailed test). | | minSubgroupFrac | A number ranging from 0 to 1,specifying the fraction of extreme samples from group 1 and group 2 that are used to identify the differential DNA methylation. The default is 0.2 because we typically want to be able to detect a specific (possibly unknown) molecular subtype among tumor; these subtypes often make up only a minority of samples, and 20\% was chosen as a lower bound for the purposes of statistical power. If you are using pre-defined group labels, such as treated replicates vs. untreated replicated, use a value of 1.0 (***Supervised*** mode) | | pvalue | A number specifies the significant P value (adjusted P value by BH) cutoff for selecting significant hypo/hyper-methylated probes. Default is 0.01 | | group.col | A column defining the groups of the sample. You can view the available columns using: `colnames(MultiAssayExperiment::colData(data))`. | | group1 | A group from group.col. ELMER will run group1 vs group2. That means, if direction is hyper, get probes hypermethylated in group 1 compared to group 2. | | group2 | A group from group.col. ELMER will run group1 vs group2. That means, if direction is hyper, get probes hypermethylated in group 1 compared to group 2. | | sig.dif | A number specifies the smallest DNA methylation difference as a cutoff for selecting significant hypo/hyper-methylated probes. Default is 0.3. |

# Example of use ```{r,eval=TRUE, message=FALSE, warning = FALSE, results = "hide"} mae <- get(load("mae.rda")) sig.diff <- get.diff.meth(data = mae, group.col = "definition", group1 = "Primary solid Tumor", group2 = "Solid Tissue Normal", minSubgroupFrac = 0.2, # if supervised mode set to 1 sig.dif = 0.3, diff.dir = "hypo", # Search for hypomethylated probes in group 1 cores = 1, dir.out ="result", pvalue = 0.01) ``` ```{r,eval=TRUE, message=FALSE, warning = FALSE} head(sig.diff) %>% datatable(options = list(scrollX = TRUE)) # get.diff.meth automatically save output files. # - getMethdiff.hypo.probes.csv contains statistics for all the probes. # - getMethdiff.hypo.probes.significant.csv contains only the significant probes which # is the same with sig.diff # - a volcano plot with the diff mean and significance levels dir(path = "result", pattern = "getMethdiff") ``` ```{r,eval=TRUE, message=FALSE, warning = FALSE,echo=FALSE} group1 <- "Primary solid Tumor" group2 <- "Solid Tissue Normal" out <- readr::read_csv(dir(path = "result", pattern = "getMethdiff.hypo.probes.csv",full.names = TRUE)) TCGAbiolinks:::TCGAVisualize_volcano(x = as.data.frame(out)[,grep("Minus",colnames(out),value = T)], y = out$adjust.p, title = paste0("Volcano plot - Probes ", "hypomethylated in ", group1, " vs ", group2,"\n"), filename = NULL, label = c("Not Significant", paste0("Hypermethylated in ",group1), paste0("Hypomethylated in ",group1)), ylab = expression(paste(-Log[10], " (FDR corrected P-values) [one tailed test]")), xlab = expression(paste( "DNA Methylation difference (",beta,"-values)") ), x.cut = 0.3, y.cut = 0.01) ``` # Bibliography