---
title: "The epialleleR User's Guide "
date: "`r format(Sys.time(), '%d %B, %Y')`"
abstract: |
A comprehensive guide to using the epialleleR package for analysis of
next-generation sequencing data
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{epialleleR}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
resource_files:
- epialleleR_logo.svg
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
width = 100
)
options(width=100)
```
*****
# Introduction
Cytosine DNA methylation is an important epigenetic mechanism for regulation
of gene expression. Abnormal methylation is linked to several diseases, being
for example the most common molecular lesion in cancer cell.[^1] Multiple
studies suggest that alterations in DNA methylation, despite occurring at a low
mosaic level, may confer increased risk of cancer later in life.[^2]
The cytosine methylation levels within relatively small regions of the human
genome are thought to be often concordant, resulting in a limited number of
distinct methylation patterns of short sequencing reads.[^3] Due to the
cell-to-cell variations in methylation, DNA purified from tissue samples
contains a mix of hyper- and hypomethylated alleles with varying ratios that
depend on the genomic region and tissue type.
Unsurprisingly, when the frequencies of hypermethylated epialleles are low
(e.g. 1e-02 and lower) and cytosine methylation levels are averaged and reported
using conventional algorithms, the identification of such hypermethylated
epialleles becomes nearly impossible. In order to increase the sensitivity of
DNA methylation analysis we have developed *`epialleleR`* — an R package
for calling hypermethylated variant epiallele frequencies (VEF).
Two edge cases epialleleR was designed to distinguish are presented below
(more examples can be found
[here](https://bbcg.github.io/epialleleR/articles/values.html)).
While these are simplified and entirely artificial, they still give an idea of
two different methylation patterns that may exist in real life, be characterised
by very similar quantitative metrics (beta value, the ratio of methylated
cytosines, $C$, to total number of cytosines, $C+T$, per genomic position, i.e.
$\beta = \frac{C}{C+T}$), but have entirely different
biological properties: non-functional scattered methylation / technical
artefacts on the left, and epigenetic gene inactivation on the right. VEF
values, that are essentially the ratio of methylated cytosines in
hypermethylated (**a**bove threshold) reads, $C^a$, to total number of
cytosines, $C+T$, per genomic position, i.e.
$VEF = \frac{C^a}{C+T}$, clearly separate these cases and are thought to be
more useful in detection and quantification of concordant methylation events.
```{r, echo=FALSE, fig.width=10, fig.height=6, out.width="100%", out.height="100%"}
library(epialleleR)
data.beta <- data.table::data.table(
pattern=rep(letters[1:10], each=10),
pos=rep(10*c(1:10), 10),
base=rep(c('meth',rep('unmeth',10)), length.out=100)
)
val.beta <- data.table::data.table(
pos=rep(1:10, 2),
var=c(rep("VEF", 10), rep("beta\nvalue", 10)),
val=c(rep(0, 10), rep(0.1, 10))
)
data.vef <- data.table::data.table(
pattern=rep(letters[1:10], each=10),
pos=rep(10*c(1:10), 10),
base=c(rep('unmeth',20), rep('meth',10), rep('unmeth',70))
)
val.vef <- data.table::data.table(
pos=rep(1:10, 2),
var=c(rep("VEF", 10), rep("beta\nvalue", 10)),
val=c(rep(0.1, 10), rep(0.1, 10))
)
if (require("ggplot2", quietly=TRUE) & require("gridExtra", quietly=TRUE)) {
plot.data.beta <-
ggplot(data.beta,
aes(x=pos, y=pattern,
group=pattern, color=factor(base))) +
geom_line(color="grey") +
geom_point() +
scale_colour_grey(start=0, end=0.8) +
theme_light() +
theme(axis.text.x=element_blank(), axis.text.y=element_blank(), legend.position="top") +
labs(x="genomic position", y="epiallele", title="scattered CpG methylation", color="base")
plot.val.beta <-
ggplot(val.beta, aes(x=pos, y=var, label=val, fill=factor(val))) +
geom_text(size=3) +
theme_light() +
theme(axis.text.x=element_blank()) +
labs(x="", y="", title=NULL)
plot.data.vef <-
ggplot(data.vef,
aes(x=pos, y=pattern,
group=pattern, color=factor(base))) +
geom_line(color="grey") +
geom_point() +
scale_colour_grey(start=0, end=0.8) +
theme_light() +
theme(axis.text.x=element_blank(), axis.text.y=element_blank(), legend.position="top") +
labs(x="genomic position", y="epiallele", title="epimutation", color="base")
plot.val.vef <-
ggplot(val.vef, aes(x=pos, y=var, label=val, fill=factor(val))) +
geom_text(size=3) +
theme_light() +
theme(axis.text.x=element_blank()) +
labs(x="", y="", title=NULL)
grobs <- lapply(list(plot.data.beta, plot.data.vef, plot.val.beta, plot.val.vef), ggplotGrob)
widths <- do.call(grid::unit.pmax, lapply(grobs, `[[`, "widths"))
for (i in 1:length(grobs)) grobs[[i]]$widths[2:5] <- widths[2:5]
do.call("grid.arrange", c(grobs, list(ncol=2, heights=c(3, 1))))
}
```
*`epialleleR`* is a very fast and scalable solution for analysis of data
obtained by next-generation methylation/native sequencing of DNA samples. The
minimum requirement for the input is a Binary Alignment Map (BAM) file
containing sequencing reads. These reads can be obtained from either deep or
ultra-deep sequencing, using either narrowly targeted gene panels (amplicon
sequencing), larger methylation capture panels, or even whole-genome approaches.
## Current Features
* If methylation calls are not present in BAM file, *`epialleleR`* can
make and store such calls in output BAM (similar to Bismark or Illumina's
mapping/alignment solutions; short-read sequencing alignments only)
* *`epialleleR`* can create and store sample BAM files for testing purposes by
means of *`simulateBam`* method
* In addition to conventional reporting of cytosine DNA methylation levels
(beta values), *`epialleleR`* can call variant epiallele frequencies (VEF) of
hypermethylated alleles at the level of individual cytosines
(*`generateCytosineReport`*) or supplied genomic regions
(*`generateBedReport`*)
* Linearised Methylated Haplotype Load (lMHL,
*`generateMhlReport`*) can be used instead of VEF when thresholding is not
recommended (long-read sequencing)
* DNA methylation patterns of genomic region of interest can be explored
using *`extractPatterns`*
* The association of methylation with single-nucleotide variations within
epialleles can be tested using *`generateVcfReport`*
* Potential bimodality of methylation for genomic regions of interest can be
assessed using *`generateBedEcdf`* method
## Processing speed
Currently *`epialleleR`* runs in a single-thread mode only. Reading/writing
of BAM and FASTA
data is now done by means of *`HTSlib`*, therefore it is possible to speed
it up significantly using additional decompression threads (*`nthreads`* option
in *`epialleleR`* methods). All operations are performed using optimised
C++ functions, and usually take reasonable time.
During methylation calling, human genome (hg38) loading usually takes 10-15
seconds. Using preloaded reference genome, the calling itself is performed
at the speed of 200-300 thousand short reads (150 or 225 bases long) per second
(25-40 MB/s of BAM data).
During methylation reporting,
running time for complete task "BAM on disk -> CX report on disk" depends on the
size of the BAM file, and the speed is usually within the range of 30-50 MB/s
(or 250-400 thousand short reads per second) for a single core of a relatively
modern CPU (Intel(R) Core(TM) i7-7700).
Major bottlenecks (in BAM loading and preprocessing) were removed in the release
v1.2, full multithreading and minor improvements are expected in the future.
*****
# Sample data
The *`epialleleR`* package includes sample data, which was obtained using
targeted sequencing. The description of assays and files is given below. All the
genomic coordinates for external data files are according to GRCh38 reference
assembly.
### Amplicon-based methylation NGS data
The samples of Human HCT116 DKO Non-Methylated (Zymo Research, cat # D5014-1),
or Human HCT116 DKO Methylated (Zymo Research, cat # D5014-2) DNA,[^4] or their
mix were bisulfite-converted, and the BRCA1 gene promoter region was amplified
using four pairs of primers. Amplicons were mixed, indexed and sequenced at
Illumina MiSeq system. The related files are:
| Name | Type | Description |
| --- | --- | --- |
| amplicon000meth.bam | BAM | a subset of reads for non-methylated DNA sample |
| amplicon010meth.bam | BAM | a subset of reads for a 1:9 mix of methylated and non-methylated DNA samples |
| amplicon100meth.bam | BAM | a subset of reads for fully methylated DNA sample |
| amplicon.bed | BED | genomic coordinates of four amplicons covering promoter area of BRCA1 gene |
| amplicon.vcf.gz | VCF | a relevant subset of sequence variations |
| amplicon.vcf.gz.tbi | tabix | tabix file for the amplicon.vcf.gz |
### Capture-based methylation NGS data
The tumour DNA was bisulfite-converted, fragmented and hybridized with
custom-made probes covering promoter regions of 283 tumour suppressor genes (as
described in [^5]). Libraries were sequenced using Illumina MiSeq system. The
related files are:
| Name | Type | Description |
| --- | --- | --- |
| capture.bam | BAM | a subset of reads |
| capture.bed | BED | genomic coordinates of capture target regions |
| capture.vcf.gz | VCF | a relevant subset of sequence variations |
| capture.vcf.gz.tbi | tabix | tabix file for the capture.vcf.gz |
### Manually creating sample BAM files
For the purposes of testing this package's methods or other tools for
methylation calling and/or reporting, *`epialleleR`* provides a convenient way
to manually create sample BAM files by specifying mandatory and optional BAM
file tags. The following code will create a small BAM file that contains
methylation calls and can be used for methylation reporting as described later:
```{r}
bam.file <- tempfile(pattern="simulated", fileext=".bam")
simulateBam(output.bam.file=bam.file, XM=c("ZZzZZ", "zzZzz"), XG="CT")
# one can view the resulting file using `samtools view -h `
# or, if desired, file can be converted to SAM using `samtools view`,
# manually corrected and converted back to BAM
```
Check *`simulateBam`* method help page for more information on parameters and
their default values.
*****
# Typical workflow
## Requirements
As mentioned earlier, *`epialleleR`* uses data stored in Binary Alignment Map
(BAM) files as its input and currently allows to load both short-read
(e.g., bisulfite) and long-read (native) sequencing alignments. Specific
requirements for these types of data are given below. Additionally,
please check the *`preprocessBam`* function help file for a full description
of available parameters, as well as explanation of the function's logic.
### Short-read sequencing
It is a prerequisite that records in the BAM file
contain an XG tag with a genomic strand they map to ("CT" or "GA"), and an XM
tag with the methylation call string — such files are
produced by mapping and alignment tools such as Bismark Bisulfite Read Mapper
and Methylation Caller or state-of-the-art Illumina solutions: Illumina DRAGEN
Bio-IT Platform, Illumina Cloud analysis solutions, as well as contemporary
Illumina sequencing instruments
with on-board read mapping/alignment (NextSeq 1000/2000, NovaSeq X). These BAM
files will contain all the necessary information and can be analysed without
additional steps.
If BAM files were produced by other mapping/alignment tools (e.g., bwa-meth or
BSMAP) and lack XG/XM data, it is possible to call methylation using
*`callMethylation`* method. This method will add absent XG/XM tags
and save all data in the output BAM file that can be further analysed by
*`epialleleR`*.
### Long-read sequencing
For preprocessing of long reads, *`epialleleR`* requires presence of MM (Mm)
and ML (Ml) tags that hold information on base modifications and related
probabilities, respectively. These are standard tags described in SAM/BAM
format specification, therefore relevant tools for analysis and alignment
of long sequencing reads should be able to produce them.
## Reading the data
All *`epialleleR`* methods can load BAM data using the file path.
However, if a file is very large and several reports need to be prepared, it is
advised to use the *`preprocessBam`* convenience function as shown below. This
function is also used internally when a BAM file location string is supplied as
an input for other *`epialleleR`* methods.
*`preprocessBam`* automatically determines if BAM file contains paired- or
single-end alignments and has all necessary tags (XM/XG) available. It is
recommended to use *`verbose`* processing and check messages for correct
identification of alignment endness. Otherwise, if the *`paired`* parameter is
set explicitly, exception is thrown when expected endness differs from the
auto detected one.
During preprocessing of paired-end alignments,
paired reads are merged according to their base
quality: nucleotide base with the highest value in the QUAL string is taken,
unless its quality is less than *`min.baseq`*, which results in no information
for that particular position ("-"/"N"). These **merged reads** are then
processed as a **single entity** in all *`epialleleR`* methods. Due to merging,
overlapping bases in read pairs are counted only once, and the base with the
highest quality is taken. It is a requirement currently that paired-end BAM
file must be sorted by QNAME instead
of genomic location (i.e., "unsorted") to perform merging of paired-end
reads. Error message is shown if it is sorted by genomic location, in this
case please sort it by QNAME using 'samtools sort -n -o out.bam in.bam'.
During preprocessing of single-end alignments, no read merging is
performed. Only bases with quality of at least *`min.baseq`* are considered.
Lower base quality results in no information for that particular position
("-"/"N").
For RRBS-like protocols, it is possible to trim alignments from one or both
ends. Trimming is performed during BAM loading and will therefore influence
results of all downstream *`epialleleR`* methods. Internally, trimming is
performed at the level of a template (i.e., read pair for paired-end BAM or
individual read for single-end BAM). This ensures that only necessary parts
(real ends of sequenced fragment) are removed for paired-end sequencing
reads.
### Specific considerations for long-read sequencing data:
Any location not reported is implicitly assumed to contain no modification.
According to SAM format specification,
MM base modification tags are allowed to list modifications observed not
only on the original sequenced strand (e.g., `C+m`) but also on the
opposite strand (e.g., `G-m`). The logic of their processing is as follows
(with the examples given below):
* if an alignment record has no methylation modifications (neither
`C+m`, nor `G-m` are present), this record is, naturally, considered to
be a single read with no cytosines methylated
* if an alignment record has `C+m` modification (base modifications
on the original sequenced strand), then this record is, naturally,
considered to be a single read with cytosine modifications on the
sequenced strand
* if an alignment record has `G-m` modification (base modifications
on the strand opposite to sequenced), then this record is treated as two
reads, with the original sequenced strand having no modifications,
while the opposite strand having cytosine modifications
* if both `C+m` and `G-m` are present, then this record is treated
as two reads, with both strands having cytosine modifications
```{r}
library(epialleleR)
capture.bam <- system.file("extdata", "capture.bam", package="epialleleR")
bam.data <- preprocessBam(capture.bam)
# Specifics of long-read alignment processing
out.bam <- tempfile(pattern="out-", fileext=".bam")
simulateBam(
seq=c("ACGCCATYCGCGCCA"),
Mm=c("C+m,0,2,0;G-m,0,0,0;"),
Ml=list(as.integer(c(102,128,153,138,101,96))),
output.bam.file=out.bam
)
generateCytosineReport(out.bam, threshold.reads=FALSE, report.context="CX")
```
## Optional calling of cytosine methylation
If short-read BAM file lacks XG/XM tags (e.g., is an output of bwa-meth or
BSMAP),
preprocessing will fail with the message that cytosine methylation calling
must be performed. This can be done as follows:
```{r}
# bwa-meth sample output
input.bam <- system.file("extdata", "test", "bwameth-se-unsort-yd.bam", package="epialleleR")
# resulting BAM with XG/XM tags
output.bam <- tempfile(pattern="output-", fileext=".bam")
# sample reference genome
genome <- preprocessGenome(system.file("extdata", "test", "reference.fasta.gz", package="epialleleR"))
# calls cytosine methylation and stores it in the output BAM
# Input BAM has 100 records of which 73 are mapped to the genome
callMethylation(input.bam, output.bam, genome)
# process this data further
# bam.data <- preprocessBam(output.bam)
```
## Making cytosine reports
*`epialleleR`* can generate conventional cytosine reports in a format, which is
similar to the genome-wide cytosine report produced by the *`coverage2cytosine`*
Bismark module.[^6]
Please note that *`generateCytosineReport`* produces thresholded (VEF) report
by default: **methylated** cytosines from reads that do **not** pass the
threshold (**hypo**methylated reads) are counted as being **un**methylated. In
order to make a conventional cytosine report, use *`threshold.reads=FALSE`*.
To produce conventional cytosine reports without thresholding by
within-context methylation level though
minimally affected by incomplete cytosine conversion, run this method with
the following parameters: *`threshold.reads=TRUE`*, *`threshold.context="CG"`*,
*`min.context.sites=0`*, *`min.context.beta=0`*, *`max.outofcontext.beta=0.1`*.
All cytosines within reads (read pairs) having more than 10% out-of-context
cytosines methylated, will be effectively treated as unmethylated ones.
```{r}
# data.table::data.table object for
# CpG VEF report
cg.vef.report <- generateCytosineReport(bam.data)
head(cg.vef.report[order(meth+unmeth, decreasing=TRUE)])
# CpG cytosine report
cg.report <- generateCytosineReport(bam.data, threshold.reads=FALSE)
head(cg.report[order(meth+unmeth, decreasing=TRUE)])
# CX cytosine report
cx.report <- generateCytosineReport(bam.data, threshold.reads=FALSE,
report.context="CX")
head(cx.report[order(meth+unmeth, decreasing=TRUE)])
```
## Making VEF reports for a set of genomic regions
*`epialleleR`* allows to make reports not only for individual cytosine bases,
but also for a set of genomic regions. It is especially useful when the targeted
methylation sequencing was used to produce reads (such as amplicon sequencing or
hybridization capture using, e.g., Agilent SureSelect Target Enrichment Probes).
The amplicon sequencing principally differs from capture-based assays in that
the coordinates of reads are known. Therefore, reads can be assigned to
amplicons by their exact positions, while to the capture targets — by the
overlap. For this, *`epialleleR`* provides generic *`generateBedReport`*
function as well as two of its aliases, *`generateAmpliconReport`* (for
amplicon-based NGS) and *`generateCaptureReport`* (for capture-based NGS).
```{r}
# report for amplicon-based data
# matching is done by exact start or end positions plus/minus tolerance
amplicon.report <- generateAmpliconReport(
bam=system.file("extdata", "amplicon010meth.bam", package="epialleleR"),
bed=system.file("extdata", "amplicon.bed", package="epialleleR")
)
amplicon.report
# report for capture-based data
# matching is done by overlap
capture.report <- generateCaptureReport(
bam=system.file("extdata", "capture.bam", package="epialleleR"),
bed=system.file("extdata", "capture.bed", package="epialleleR")
)
head(capture.report)
# generateBedReport is a generic function for BED-guided reports
bed.report <- generateBedReport(
bam=system.file("extdata", "capture.bam", package="epialleleR"),
bed=system.file("extdata", "capture.bed", package="epialleleR"),
bed.type="capture"
)
identical(capture.report, bed.report)
```
## Linearized MHL reports
VEF values are extremely useful for detection of mosaic epimutations. However,
default thresholding parameters might not fit with the nature of regions of
interest. In this case, it is advised to learn the characteristics of these
regions with *`extractPatterns`* and *`generateBedEcdf`* methods as described
below. Alternatively, *`epialleleR`* provides a method
to calculate a
metric that is similar to VEF in its ability to highlight hypermethylated
regions but does not require thresholding —
linearised Methylated Haplotype Load (lMHL).
lMHL is a modified version of MHL (MHL was first described by
[Guo et al., 2017](https://doi.org/10.1038/ng.3805}{10.1038/ng.3805)), sought to
be faster and applicable for a wider range of sequencing data. More information
on this is given in the help page for the *`generateMhlReport`* as well as in
the `values` vignette.
```{r}
# lMHL report can be generated using
mhl.report <- generateMhlReport(
bam=system.file("extdata", "capture.bam", package="epialleleR")
)
```
## Exploring DNA methylation patterns
Individual epialleles can be extracted and plotted in order to visualize
methylation patters within a genomic region of interest. For this,
*`epialleleR`* provides method *`extractPatterns`* which can be used as follows:
```{r, fig.width=10, fig.height=6, out.width="100%", out.height="100%"}
# First, let's extract base methylation information for sequencing reads
# of 1:9 mix of methylated and non-methylated control DNA
patterns <- extractPatterns(
bam=system.file("extdata", "amplicon010meth.bam", package="epialleleR"),
bed=as("chr17:43125200-43125600","GRanges")
)
# that many read pairs overlap genomic region of interest
nrow(patterns)
# these are positions of bases within relevant read pairs
base.positions <- grep("^[0-9]+$", colnames(patterns), value=TRUE)
# let's make a summary table with counts for every pattern
patterns.summary <- patterns[, c(lapply(.SD, unique), .N),
by=.(pattern, beta), .SDcols=base.positions]
# that many unique methylation patterns were found
nrow(patterns.summary)
# let's melt and plot patterns
plot.data <- data.table::melt.data.table(patterns.summary,
measure.vars=base.positions, variable.name="pos", value.name="base")
# categorical positions, all patterns sorted by beta, with counts on the right
if (require("ggplot2", quietly=TRUE) & require("ggstance", quietly=TRUE)) {
ggplot(na.omit(plot.data),
aes(x=pos, y=reorder(pattern,beta),
group=pattern, color=factor(base))) +
geom_line(color="grey", position=position_dodgev(height=0.5)) +
geom_point(position=position_dodgev(height=0.5)) +
scale_colour_grey(start=0.8, end=0) +
theme_light() +
theme(axis.text.x=element_text(angle=60, hjust=1, vjust=1),
axis.text.y=element_blank()) +
labs(x="position", y="pattern", title="epialleles", color="base") +
scale_x_discrete(expand=c(0.05,0)) +
geom_text(inherit.aes=FALSE, data=patterns.summary,
mapping=aes(x="count", y=pattern, label=N), size=3)
}
# continuous positions, nonunique patterns according to their counts
if (require("ggplot2", quietly=TRUE) & require("ggstance", quietly=TRUE)) {
ggplot(na.omit(plot.data)[N>1],
aes(x=as.numeric(as.character(pos)), y=factor(N),
group=pattern, color=factor(base))) +
geom_line(color="grey", position=position_dodgev(height=0.5)) +
geom_point(position=position_dodgev(height=0.5)) +
scale_colour_grey(start=0.8, end=0) +
theme_light() +
labs(x="position", y="count", title="epialleles", color="base")
}
# upset-like plot of all patterns, continuous positions, sorted by counts
if (require("ggplot2", quietly=TRUE) & require("gridExtra", quietly=TRUE)) {
grid.arrange(
ggplot(na.omit(plot.data),
aes(x=as.numeric(as.character(pos)), y=reorder(pattern,N),
color=factor(base))) +
geom_line(color="grey") +
geom_point() +
scale_colour_grey(start=0.8, end=0) +
theme_light() +
theme(axis.text.y=element_blank(), legend.position="none") +
labs(x="position", y=NULL, title="epialleles", color="base"),
ggplot(unique(na.omit(plot.data)[, .(pattern, N, beta)]),
aes(x=N+0.5, y=reorder(pattern,N), alpha=beta, label=N)) +
geom_col() +
geom_text(alpha=0.5, nudge_x=0.2, size=3) +
scale_x_log10() +
theme_minimal() +
theme(axis.text.y=element_blank(), legend.position="none") +
labs(x="count", y=NULL, title=""),
ncol=2, widths=c(0.75, 0.25)
)
}
# now let's explore methylation patterns in RAD51C gene promoter using
# methylation capture data
capture.patterns <- extractPatterns(
bam=system.file("extdata", "capture.bam", package="epialleleR"),
bed=as("chr17:58691673-58693108", "GRanges"),
verbose=FALSE
)
capture.positions <- grep("^[0-9]+$", colnames(capture.patterns), value=TRUE)
capture.summary <-
capture.patterns[, c(lapply(.SD, unique), .N),
by=.(pattern, beta), .SDcols=capture.positions]
capture.data <- data.table::melt.data.table(capture.summary,
measure.vars=capture.positions, variable.name="pos", value.name="base")
if (require("ggplot2", quietly=TRUE) & require("ggstance", quietly=TRUE)) {
ggplot(na.omit(capture.data),
aes(x=as.numeric(as.character(pos)), y=factor(N),
group=pattern, color=factor(base))) +
geom_line(color="grey", position=position_dodgev(height=0.9)) +
geom_point(position=position_dodgev(height=0.9)) +
scale_colour_grey(start=0.8, end=0) +
theme_light() +
labs(x="position", y="count", title="RAD51C", color="base")
}
```
## Exploring sequence variants in epialleles
It is known that sequence variants can affect the methylation status of a
DNA.[^7] The *`generateVcfReport`* function calculates frequencies of single
nucleotide variants (SNVs) within epialleles and tests for the association
between SNV and epiallelic status using Fisher Exact test. Base counts and the
test's p-values are included in the returned value.
In addition to BAM file location string or preprocessed BAM object, the function
requires a location string for the Variant Call Format (VCF) file or the VCF
object that was obtained using *`VariantAnnotation::readVcf`* function. As VCF
files can be extremely large, it is strongly advised to prefilter the VCF object
by the relevant set of genomic regions, or specify such relevant set of regions
as a *`bed`* parameter when *`vcf`* points to a VCF file location.
Please note, that the output report is currently limited to SNVs only. Also,
the default (`min.baseq=0`) output of `generateVcfReport` is equivalent to the
one of `samtools mplieup -Q 0 ...`, and therefore may result in false SNVs
caused by misalignments. Remember to increase `min.baseq`
(`samtools mplieup -Q` default value is 13) to obtain results of a higher
quality.
```{r, fig.width=10, fig.height=6, out.width="100%", out.height="100%"}
# VCF report
vcf.report <- generateVcfReport(
bam=system.file("extdata", "amplicon010meth.bam", package="epialleleR"),
bed=system.file("extdata", "amplicon.bed", package="epialleleR"),
vcf=system.file("extdata", "amplicon.vcf.gz", package="epialleleR"),
# thresholds on alignment and base quality
min.mapq=30, min.baseq=13,
# when VCF seqlevels are different from BED and BAM it is possible
# to convert them internally
vcf.style="NCBI"
)
# NA values are shown for the C->T variants on the "+" and G->A on the "-"
# strands, because bisulfite conversion makes their counting impossible
head(vcf.report)
# let's sort the report by increasing Fisher's exact test's p-values.
# the p-values are given separately for reads that map to the "+"
head(vcf.report[order(`FEp-`, na.last=TRUE)])
# and to the "-" strand
head(vcf.report[order(`FEp+`, na.last=TRUE)])
# and finally, let's plot methylation patterns overlapping one of the most
# covered SNPs in the methylation capture test data set - rs573296191
# (chr17:61864584) in BRIP1 gene
brip1.patterns <- extractPatterns(
bam=system.file("extdata", "capture.bam", package="epialleleR"),
bed=as("chr17:61864583-61864585", "GRanges"),
highlight.positions=61864584,
verbose=FALSE
)
brip1.positions <- grep("^[0-9]+$", colnames(brip1.patterns), value=TRUE)
brip1.summary <-
brip1.patterns[, c(lapply(.SD, unique), .N),
by=.(pattern, beta), .SDcols=brip1.positions]
brip1.data <- data.table::melt.data.table(brip1.summary,
measure.vars=brip1.positions, variable.name="pos", value.name="base")
if (require("ggplot2", quietly=TRUE) & require("ggstance", quietly=TRUE)) {
ggplot(na.omit(brip1.data),
aes(x=as.numeric(as.character(pos)), y=factor(N),
group=pattern, color=factor(base))) +
geom_line(color="grey", position=position_dodgev(height=0.9)) +
geom_point(position=position_dodgev(height=0.9)) +
scale_colour_manual(values=c("blue",NA,"red","grey80","black")) +
theme_light() +
labs(x="position", y="count", title="BRIP1", color="base")
}
```
## Plotting the distribution of per-read beta values
As stated in the introduction, human genomic DNA regions often show bimodal
methylation patterns. *`epialleleR`* allows to visualize this information by
plotting empirical cumulative distribution functions (eCDFs) for within- and
out-of-context beta values.
The code below produces plots for the sequencing reads of control DNA
samples. Note that within-the-context eCDF(0.5) values are very close to the
expected 1-VEF values for the corresponding control DNA samples:
* non-methylated DNA — expected VEF = 0, observed 1-eCDF(0.5) ≈ 0
* 1:9 mix of methylated and non-methylated DNA — expected VEF = 0.1,
observed 1-eCDF(0.5) ≈ 0.1
* and fully methylated DNA — expected VEF = 1,
observed 1-eCDF(0.5) ≈ 1
```{r, fig.width=10, fig.height=6, out.width="100%", out.height="100%"}
# First, let's visualise eCDFs for within- and out-of-context beta values
# for all four amplicons and unmatched reads. Note that within-the-context eCDF
# of 0.5 is very close to the expected 1-VEF value (0.1) for all amplicons
# produced from this 1:9 mix of methylated and non-methylated control DNA
# let's compute eCDF
amplicon.ecdfs <- generateBedEcdf(
bam=system.file("extdata", "amplicon010meth.bam", package="epialleleR"),
bed=system.file("extdata", "amplicon.bed", package="epialleleR"),
bed.rows=NULL
)
# there are 5 items in amplicon.ecdfs, let's plot all of them
par(mfrow=c(1,length(amplicon.ecdfs)))
# cycle through items
for (x in 1:length(amplicon.ecdfs)) {
# four of them have names corresponding to genomic regions of amplicon.bed
# fifth - NA for all the reads that don't match to any of those regions
main <- if (is.na(names(amplicon.ecdfs[x]))) "unmatched"
else names(amplicon.ecdfs[x])
# plotting eCDF for within-the-context per-read beta values (in red)
plot(amplicon.ecdfs[[x]]$context, col="red", verticals=TRUE, do.points=FALSE,
xlim=c(0,1), xlab="per-read beta value", ylab="cumulative density",
main=main)
# adding eCDF for out-of-context per-read beta values (in blue)
plot(amplicon.ecdfs[[x]]$out.of.context, add=TRUE, col="blue",
verticals=TRUE, do.points=FALSE)
}
# Second, let's compare eCDFs for within-the-context beta values for only one
# amplicon but all three sequenced samples: pure non-methylated DNA, 1:9 mix of
# methylated and non-methylated DNA, and fully methylated DNA
# our files
bam.files <- c("amplicon000meth.bam", "amplicon010meth.bam",
"amplicon100meth.bam")
# let's plot all of them
par(mfrow=c(1,length(bam.files)))
# cycle through items
for (f in bam.files) {
# let's compute eCDF
amplicon.ecdfs <- generateBedEcdf(
bam=system.file("extdata", f, package="epialleleR"),
bed=system.file("extdata", "amplicon.bed", package="epialleleR"),
# only the second amplicon
bed.rows=2, verbose=FALSE
)
# plotting eCDF for within-the-context per-read beta values (in red)
plot(amplicon.ecdfs[[1]]$context, col="red", verticals=TRUE, do.points=FALSE,
xlim=c(0,1), xlab="per-read beta value", ylab="cumulative density",
main=f)
# adding eCDF for out-of-context per-read beta values (in blue)
plot(amplicon.ecdfs[[1]]$out.of.context, add=TRUE, col="blue",
verticals=TRUE, do.points=FALSE)
}
```
*****
# Other information
## Citing the *`epialleleR`* package
Oleksii Nikolaienko, Per Eystein Lønning, Stian Knappskog, *epialleleR*: an R/Bioconductor package for sensitive allele-specific methylation analysis in NGS data. *GigaScience*, Volume 12, 2023, giad087, [https://doi.org/10.1093/gigascience/giad087](https://doi.org/10.1093/gigascience/giad087)
## The data underlying *`epialleleR`* manuscript
Replication Data for: "epialleleR: an R/BioC package for quantifying and analysing low-frequency DNA methylation", [https://doi.org/10.18710/2BQTJP](https://doi.org/10.18710/2BQTJP)
NCBI GEO dataset GSE201690: "Methylation analysis of promoter regions for selected tumour suppressor genes in DNA from white blood cells", [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE201690](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE201690)
## Our experimental studies that use the package
Per Eystein Lonning, Oleksii Nikolaienko, Kathy Pan, Allison W. Kurian, Hans Petter Petter Eikesdal, Mary Pettinger, Garnet L Anderson, Ross L Prentice, Rowan T. Chlebowski, and Stian Knappskog. Constitutional *BRCA1* methylation and risk of incident triple-negative breast cancer and high-grade serous ovarian cancer. *JAMA Oncology* 2022. [https://doi.org/10.1001/jamaoncol.2022.3846](https://doi.org/10.1001/jamaoncol.2022.3846)
Oleksii Nikolaienko, Hans P. Eikesdal, Elisabet Ognedal, Bjørnar Gilje, Steinar Lundgren, Egil S. Blix, Helge Espelid, Jürgen Geisler, Stephanie Geisler, Emiel A.M. Janssen, Synnøve Yndestad, Laura Minsaas, Beryl Leirvaag, Reidun Lillestøl, Stian Knappskog, Per E. Lønning. Prenatal *BRCA1* epimutations contribute significantly to triple-negative breast cancer development. *Genome Medicine* 2023. [https://doi.org/10.1186/s13073-023-01262-8](https://doi.org/10.1186/s13073-023-01262-8).
Data: [GSE243966](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE243966)
## Session Info
```{r session}
sessionInfo()
```
*****
# References
[^1]: https://doi.org/10.1146/annurev.pharmtox.45.120403.095832
[^2]: https://doi.org/10.1101/2020.12.01.403501
[^3]: https://doi.org/10.1093/bib/bbx077
[^4]: https://www.zymoresearch.com/products/human-methylated-non-methylated-dna-set-dna-w-primers
[^5]: https://dx.doi.org/10.1186%2Fs13148-020-00920-7
[^6]: https://www.bioinformatics.babraham.ac.uk/projects/bismark/
[^7]: https://doi.org/10.1038/modpathol.2009.130