Here we demonstrate the analysis of DNA methylation dependent on one
or more predictors. The predictors might be tissue, cell type, sex, age,
tumor/normal, case/control or a combination of these factors. The
DML
(Differential Methylation Locus) function models β values (DNA methylation levels)
using mixed linear models. This general supervised learning framework
identifies CpG loci whose differential methylation is associated with
known co-variates and can be used to perform epigenome-wide association
studies (EWAS). Let’s first load the needed packages:
library(sesame)
library(SummarizedExperiment)
sesameDataCache() # required at new sesame installation
In the following, we will use an MM285 dataset of 10 mouse samples.
This dataset contains mouse samples from different tissues and mice of
different ages and sexes. The dataset is stored in a
SummarizedExperiment
object, which contains a data matrix
combined with column-wise metadata accessible with
colData
:
se = sesameDataGet("MM285.10.SE.tissue")[1:1000,] # an arbitrary 1000 CpGs
cd = as.data.frame(colData(se)); rownames(cd) = NULL
cd
CRITICAL: If your data contains NA
, it
is required that you exclude CpGs with missing levels. For example, one
cannot assess sex-specific DNA methylation for a CpG that only has
non-NA measurement on one sex. Exclusion of such CpGs for differential
methylation modeling can be done using the checkLevels
function. Here, we will check this for both sex and tissue:
se_ok = (checkLevels(assay(se), colData(se)$sex) &
checkLevels(assay(se), colData(se)$tissue))
sum(se_ok) # the number of CpGs that passes
## [1] 995
NOTE: If your model include discrete contrast
variables like tissue and sex as in the current example, you should
consider explicitly turning it into a factor variable with a reference
level (we use treatment coding
, see different
coding systems).
For example, to use Colon
as the reference tissue and
Female
as the reference sex, one can do the following
colData(se)$tissue <- relevel(factor(colData(se)$tissue), "Colon")
colData(se)$sex <- relevel(factor(colData(se)$sex), "Female")
Then we will model DNA methylation variation treating tissue and sex
as covariates. To do that we will call the DML
function and
specify the R formula ~tissue + sex
. This function fits DNA
methylation reading to a linear model and perform the corresponding
slope test and goodness-of-fit test (F-test holding out each contrast
variable). See also formula
for how to specify lm/glm-like symbolic form for regression. All these
results are returned in an object of class DMLSummary
:
## DMLSummary Object with 995 Loci, 10 samples.
## Contrasts: tissue, sex
You can use
DML(..., BPPARAM=BiocParallel::MulticoreParam(8))
argument
to parallelize the computing.
The DMLSummary
object is a list of slightly expanded
summary.lm
objects as is typically returned by
summary(lm())
. The summaryExtractTest
function
extracts some key test statistics from the DMLSummary
object and store them in a data frame. Rows of the data frame correspond
to CpGs/loci and columns contain the slopes and p-values of each
variable.
test_result = summaryExtractTest(smry)
colnames(test_result) # the column names, show four groups of statistics
## [1] "Probe_ID" "Est_X.Intercept." "Est_tissueCecum"
## [4] "Est_tissueEsophagus" "Est_tissueFat" "Est_tissueHeart"
## [7] "Est_sexMale" "Pval_X.Intercept." "Pval_tissueCecum"
## [10] "Pval_tissueEsophagus" "Pval_tissueFat" "Pval_tissueHeart"
## [13] "Pval_sexMale" "FPval_tissue" "FPval_sex"
## [16] "Eff_tissue" "Eff_sex"
With the exception of the Intercept
, there are four
groups of columns, each starting with “Est_”, “Pval_”, “FPval_”, and
“Eff_” as prefix. Here are what they represent:
Est_* : The slope estimate (aka the β coefficient, not to be confused
with the DNA methylation β-value though) for continuous
variable. DNA methylation difference of the current level with respect
to the reference level for nominal contrast variables. Each suffix is
concatenated from the contrast variable name (e.g., tissue, sex) and the
level name if the contrast variable is discrete (e.g, Cecum, Esophagus,
Fat). For example, Est_tissueFat
should be interpreted as
the estimated methylation level difference of Fat compared to the
reference tissue (which is Colon
, as set above). If
reference is not set, the first level in the alphabetic order is used as
the reference level. There is a special column named
Est_`(Intercept)`
. It corresponds to the base-level
methylation of the reference (in this case a Female Colon
sample).
Pval_* : The unadjusted p-values of t-testing
the slope. This represents the statistical significance of the
methylation difference. For example, Pval_tissueFat
tests
whether Fat
is significantly different from
Colon
(the reference level) in DNA methylation. The
Pval_`(Intercept)`
tests whether the reference level is
significantly different from zero.
FPval_* : The unadjusted p-value of the F-test contrasting the full model against a reduced model with the labeled contrast variable held out. Note that “Pval_” and “FPval_” are equivalent when the contrast variable is a 2-level factor, i.e., in the case of a pairwise comparison.
Eff_* : The effect size of each normial contrast variable. This is equivalent to the maximum slope subtracted by the minimum level including the reference level (0).
Multiple-testing adjustment can be done afterwards using R’s
p.adjust
function. It is integrated to the DMR
function by default (see below).
One may want to ask a question like
Is the CpG methylation tissue-specific?
rather than
Is the CpG more methylated in fat compared to liver?
The first question ask about the contrast variable as a whole while
the second question concerns only a specific level in the contrast
variable. To answer this question, we can use an F-test contasting the
full model with a reduced model with the target contrast held out. By
default, this statistics is already computed in the DML
function. The test result is recorded in the FPval_
columns. For example, to get all CpGs that are methylated specific to
sex,
library(dplyr)
library(tidyr)
test_result %>% dplyr::filter(FPval_sex < 0.05, Eff_sex > 0.1) %>%
select(FPval_sex, Eff_sex)
Here we used 0.1 as the effect size threshold. This means DNA methylation difference under 0.1 (10%) is considered not biologically meaningful. This can be a valid assumption for homogenous cell composition as most cells would be either biallelically methylated, unmethylated or monoallelically methylated. But different threshold can be used in different analysis scenarios.
We can define CpG methylation as sex-specific, tissue-specific or both, by:
test_result %>%
mutate(sex_specific =
ifelse(FPval_sex < 0.05 & Eff_sex > 0.1, TRUE, FALSE)) %>%
mutate(tissue_specific =
ifelse(FPval_tissue < 0.05 & Eff_tissue > 0.1, TRUE, FALSE)) %>%
select(sex_specific, tissue_specific) %>% table
## tissue_specific
## sex_specific FALSE TRUE
## FALSE 891 99
## TRUE 5 0
As you can see from the result, some probes are sex-specific and others are tissue-specific. There is no overlap between probes whose methylation reading is differential along both contrasts.
From the test result, we can also ask whether the DNA methylation is
different between two sexes or between two specific tissues. For
example, Est_sexMale
compares male from females. The
following code creates a volcano plot.
Likewise, we can ask whether DNA methylation might be different between fat and colon. We can do
The variable tested in the DML
function can be
continuous. Suppose we are interested in age
besides
sex
. We will call the same function but with the following
formula:
Let’s verify the CpGs positively associated with age.
df = data.frame(Age = colData(se)$age,
BetaValue = assay(se)[test_result2$Probe_ID[nrow(test_result2)],])
ggplot(df, aes(Age, BetaValue)) + geom_smooth(method="lm") + geom_point()
## `geom_smooth()` using formula = 'y ~ x'
For a given contrast, one can merge neighboring CpGs that show consistent methylation variation into differentially methylated regions (DMRs). For example, we can merge sex-specific differential methylation identified above to chromosome X regions that show X-inactivation-related methylation difference. To do this, we need to pick a contrast:
## [1] "X.Intercept." "tissueCecum" "tissueEsophagus" "tissueFat"
## [5] "tissueHeart" "sexMale"
## Merging correlated CpGs ... Done.
## Generated 509 segments.
## Combine p-values ...
## - 25 significant segments.
## - 1 significant segments (after BH).
## Done.
See Supplemental Vignette for track-view visualization of the data.
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggplot2_3.5.1 tidyr_1.3.1
## [3] dplyr_1.1.4 SummarizedExperiment_1.37.0
## [5] Biobase_2.67.0 GenomicRanges_1.59.1
## [7] GenomeInfoDb_1.43.2 IRanges_2.41.2
## [9] S4Vectors_0.45.2 MatrixGenerics_1.19.0
## [11] matrixStats_1.5.0 knitr_1.49
## [13] sesame_1.25.3 sesameData_1.25.0
## [15] ExperimentHub_2.15.0 AnnotationHub_3.15.0
## [17] BiocFileCache_2.15.0 dbplyr_2.5.0
## [19] BiocGenerics_0.53.3 generics_0.1.3
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.2 blob_1.2.4
## [4] filelock_1.0.3 Biostrings_2.75.3 fastmap_1.2.0
## [7] digest_0.6.37 lifecycle_1.0.4 KEGGREST_1.47.0
## [10] RSQLite_2.3.9 magrittr_2.0.3 compiler_4.5.0
## [13] rlang_1.1.4 sass_0.4.9 tools_4.5.0
## [16] yaml_2.3.10 labeling_0.4.3 S4Arrays_1.7.1
## [19] bit_4.5.0.1 curl_6.1.0 DelayedArray_0.33.3
## [22] plyr_1.8.9 RColorBrewer_1.1-3 abind_1.4-8
## [25] BiocParallel_1.41.0 withr_3.0.2 purrr_1.0.2
## [28] grid_4.5.0 preprocessCore_1.69.0 wheatmap_0.2.0
## [31] colorspace_2.1-1 MASS_7.3-64 scales_1.3.0
## [34] cli_3.6.3 rmarkdown_2.29 crayon_1.5.3
## [37] reshape2_1.4.4 httr_1.4.7 tzdb_0.4.0
## [40] DBI_1.2.3 cachem_1.1.0 stringr_1.5.1
## [43] splines_4.5.0 parallel_4.5.0 AnnotationDbi_1.69.0
## [46] BiocManager_1.30.25 XVector_0.47.2 vctrs_0.6.5
## [49] Matrix_1.7-1 jsonlite_1.8.9 hms_1.1.3
## [52] bit64_4.5.2 fontawesome_0.5.3 jquerylib_0.1.4
## [55] glue_1.8.0 codetools_0.2-20 stringi_1.8.4
## [58] gtable_0.3.6 BiocVersion_3.21.1 UCSC.utils_1.3.0
## [61] munsell_0.5.1 tibble_3.2.1 pillar_1.10.1
## [64] rappdirs_0.3.3 htmltools_0.5.8.1 GenomeInfoDbData_1.2.13
## [67] R6_2.5.1 evaluate_1.0.1 lattice_0.22-6
## [70] readr_2.1.5 png_0.1-8 memoise_2.0.1
## [73] BiocStyle_2.35.0 bslib_0.8.0 Rcpp_1.0.13-1
## [76] nlme_3.1-166 SparseArray_1.7.2 mgcv_1.9-1
## [79] xfun_0.50 pkgconfig_2.0.3