---
title: "3.2 - Identifying differentially methylated probes"
output:
html_document:
self_contained: true
number_sections: no
theme: flatly
highlight: tango
mathjax: null
toc: true
toc_float: true
toc_depth: 2
css: style.css
bibliography: bibliography.bib
vignette: >
%\VignetteIndexEntry{"3.2 - Identifying differentially methylated probes"}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
```{r, echo = FALSE,hide=TRUE, message=FALSE, warning=FALSE}
library(ELMER.data)
library(ELMER)
library(DT)
library(dplyr)
library(BiocStyle)
```
# Identifying differentially methylated probes
The first step is the identification of differentially methylated CpGs (DMCs) carried out by function `get.diff.meth`.
In the `Supervised` mode, we compare the DNA methylation level of each distal CpG for all samples in Group 1 compared to all samples Group 2, using an unpaired one-tailed t-test. In the `Unsupervised` mode, the samples of each group (Group 1 and Group 2) are ranked by their DNA methylation beta values for the given probe, and those samples in the lower quintile (20\% samples with the lowest methylation levels) of each group are used to identify if the probe is hypomethylated in Group 1 compared to Group 2. The reverse applies for the identification of hypermethylated probes. It is important to highlight that in the `Unsupervised` mode, each probe selected may be based on a different subset the samples, and thus probe sets from multiple molecular subtypes may be represented. In the `Supervised` mode, all tests are based on the same set of samples.
The 20\% is a parameter to the `diff.meth` function called `minSubgroupFrac`. For the unsupervised analysis, this is set to 20\% as in Yao et al. [@yao2015inferring], because we wanted to be able to detect a specific molecular subtype among samples; these subtypes often make up only a minority of samples, and 20\% was chosen as a lower bound for the purposes of statistical power (high enough sample numbers to yield t-test p-values that could overcome multiple hypotheses corrections, yet low enough to be able to capture changes in individual molecular subtypes occurring in 20\% or more of the cases.) This number can be set as an input to the `diff.meth` function and should be tuned based on sample sizes in individual studies.
In the `Supervised` mode, where the comparison groups are implicit in the sample set and labeled, the `minSubgroupFrac` parameter is set to 100\%. An example would be a cell culture experiment with 5 replicates of the untreated cell line, and another 5 replicates that include an experimental treatment.
To identify hypomethylated DMCs, a one-tailed t-test is used to rule out the null hypothesis: $\mu_{group1} \geq \mu_{group2}$, where $\mu_{group1}$ is the mean methylation within the lowest group 1 quintile (or another percentile as specified by the `minSubgroupFrac` parameter) and $\mu_{group2}$ is the mean within the lowest group 2 quintile. Raw p-values are adjusted for multiple hypothesis testing using the Benjamini-Hochberg method, and probes are selected when they had adjusted p-value less than $0.01$ (which can be configured using the `pvalue` parameter). For additional stringency, probes are only selected if the methylation difference: $\Delta = \mu_{group1} - \mu_{group2}$ was greater than $0.3$. The same method is used to identify hypermethylated DMCs, except we use the *upper* quintile, and the opposite tail in the t-test is chosen.
![Source: Yao, Lijing, et al. "Inferring regulatory element landscapes and transcription factor networks from cancer methylomes." Genome biology 16.1 (2015): 105.](figures/paper_diff_meth.jpg) [@yao2015inferring,@yao2015demystifying]
# Function arguments