ONT-scale workflows with RBedMethyl

Overview

RBedMethyl provides disk-backed access to nanoporetech modkit bedMethyl files generated from ONT data, enabling ONT-scale workflows without loading full tables into memory. This vignette demonstrates a typical workflow: ingest, subset by region, compute per-region summaries, and coerce to downstream Bioconductor classes.

The bedMethyl format

modkit, the bioinformatics tool produced by Oxford Nanopore Technologies, is one of the easiest and most direct ways to extract site-specific methylation information from a long-read BAM file produced by Dorado. As explained in UCSC, the bedMethyl format is an extension of the standard BED 9 format, with the following columns:

column name description type
1 chrom name of reference sequence from BAM header str
2 start position 0-based start position int
3 end position 0-based exclusive end position int
4 modified base code single letter code for modified base str
5 score Equal to Nvalid_cov. int
6 strand ‘+’ for positive strand ‘-’ for negative strand, ‘.’ when strands are combined str
7 start position included for compatibility int
8 end position included for compatibility int
9 color included for compatibility, always 255,0,0 str
10 Nvalid_cov Number of reads with valid modification call int
11 fraction modified Percent of valid calls that are modified float
12 Nmod Number of calls with a modified base int
13 Ncanonical Number of calls with a canonical base int
14 Nother_mod Number of calls with a modified base, other modifications int
15 Ndelete Number of reads with a deletion at this reference position int
16 Nfail Number of calls where the probability of the call was below the threshold int
17 Ndiff Number of reads with a base other than the canonical base for this modification int
18 Nnocall Number of reads aligned to this reference position, with the correct canonical base, but without a base modification call int

RBedMethyl as a plug-and-play connector

RBedMethyl package provides a minimalistic interface, which allows the users to process bedMethyl files. They can:

  1. load them into R, with a minimal memory footprint with readBedMethyl
  2. subset them:
    • by chromosome(s) (subsetByChromosomes)
    • by region(s) (subsetByRegion or using the [] bracketing index system)
    • by coverage (filterByCoverage)
    • by a specific filter on an assay (subsetBy)
  3. create region summary statistics (summarizeByRegion)
  4. convert them to other Bioconductor-powered data structures using the as(x, name) coercion: RangedSummarizedExperiment and BSseq

Ingest a bedMethyl file

The RBedMethyl object is an HDF5-backed structure. It is designed to keep a single modification (i.e, “h” or “m”). The user can select which fields to keep from the bedMethyl file during the reading process. The correspondence is:

bedMethyl name RBedMethyl assay
Nvalid_cov coverage
fraction modified pct
Nmod mod_reads
Ncanonical unmod_reads
Nother_mod other_reads
Ndelete del_reads
Nfail fail_reads
Ndiff diff_reads
Nnocall nocall_reads

By default the readBedMethyl only loads “coverage”, “pct” and “mod_reads” assays.

# Example bedMethyl-like content, header is dropped
df <- data.frame(
  r1 = c("chr1", 0, 1, "m", 0, "+", 0, 1, 0, 10, 0.5, 5, 5, 0, 0, 0, 0, 0),
  r2 = c("chr1", 10, 11, "m", 0, "+", 10, 11, 0, 20, 0.25, 5, 15, 0, 0, 0, 0, 0),
  r3 = c("chr2", 0, 1, "m", 0, "-", 0, 1, 0, 8, 0.375, 3, 5, 0, 0, 0, 0, 0),
  r4 = c("chr2", 0, 1, "h", 0, "-", 0, 1, 0, 8, 0.375, 3, 5, 0, 0, 0, 0, 0)
)
df <- t(df)
colnames(df) <- c(
  "chrom", "chromStart", "chromEnd", "mod", "score", "strand",
  "thickStart", "thickEnd", "itemRgb", "coverage", "pct",
  "mod_reads", "unmod_reads", "other_reads",
  "del_reads", "fail_reads", "diff_reads", "nocall_reads"
)
df
#>    chrom  chromStart chromEnd mod score strand thickStart thickEnd itemRgb
#> r1 "chr1" "0"        "1"      "m" "0"   "+"    "0"        "1"      "0"    
#> r2 "chr1" "10"       "11"     "m" "0"   "+"    "10"       "11"     "0"    
#> r3 "chr2" "0"        "1"      "m" "0"   "-"    "0"        "1"      "0"    
#> r4 "chr2" "0"        "1"      "h" "0"   "-"    "0"        "1"      "0"    
#>    coverage pct     mod_reads unmod_reads other_reads del_reads fail_reads
#> r1 "10"     "0.5"   "5"       "5"         "0"         "0"       "0"       
#> r2 "20"     "0.25"  "5"       "15"        "0"         "0"       "0"       
#> r3 "8"      "0.375" "3"       "5"         "0"         "0"       "0"       
#> r4 "8"      "0.375" "3"       "5"         "0"         "0"       "0"       
#>    diff_reads nocall_reads
#> r1 "0"        "0"         
#> r2 "0"        "0"         
#> r3 "0"        "0"         
#> r4 "0"        "0"

bed <- tempfile(fileext = ".bed")
write.table(df, bed, quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)

# Read only "m" modification rows, and only coverage,pct and mod_reads assays
bm <- readBedMethyl(bed, mod = "m", fields = c("coverage", "pct", "mod_reads"))

bm
#> RBedMethyl object
#>   Modification: m 
#>   Rows (active/total): 3 / 3 
#>   Chromosomes: 2 ( chr1, chr2 )
#>   Assays: chrom, chromStart, chromEnd, strand, coverage, pct, mod_reads

Subsetting example

# By one or more chromosomes
subsetByChromosomes(bm, "chr1")
#> RBedMethyl object
#>   Modification: m 
#>   Rows (active/total): 2 / 3 
#>   Chromosomes: 2 ( chr1, chr2 )
#>   Assays: chrom, chromStart, chromEnd, strand, coverage, pct, mod_reads
subsetByChromosomes(bm, c("chr1", "chr2"))
#> RBedMethyl object
#>   Modification: m 
#>   Rows (active/total): 3 / 3 
#>   Chromosomes: 2 ( chr1, chr2 )
#>   Assays: chrom, chromStart, chromEnd, strand, coverage, pct, mod_reads

# By region
## Can supply the chromosome, start, end
subsetByRegion(bm, "chr1", 1, 12)
#> RBedMethyl object
#>   Modification: m 
#>   Rows (active/total): 1 / 3 
#>   Chromosomes: 2 ( chr1, chr2 )
#>   Assays: chrom, chromStart, chromEnd, strand, coverage, pct, mod_reads
## Or a GRanges object
regions <- GenomicRanges::GRanges(
  seqnames = c("chr1", "chr2"),
  ranges = IRanges::IRanges(start = c(1, 1), end = c(12, 2))
)
subsetByRegion(bm, regions)
#> RBedMethyl object
#>   Modification: m 
#>   Rows (active/total): 3 / 3 
#>   Chromosomes: 2 ( chr1, chr2 )
#>   Assays: chrom, chromStart, chromEnd, strand, coverage, pct, mod_reads

## The [] bracketing index system also works
bm[regions]
#> RBedMethyl object
#>   Modification: m 
#>   Rows (active/total): 3 / 3 
#>   Chromosomes: 2 ( chr1, chr2 )
#>   Assays: chrom, chromStart, chromEnd, strand, coverage, pct, mod_reads

# By minimum coverage
filterByCoverage(bm, 15)
#> RBedMethyl object
#>   Modification: m 
#>   Rows (active/total): 1 / 3 
#>   Chromosomes: 2 ( chr1, chr2 )
#>   Assays: chrom, chromStart, chromEnd, strand, coverage, pct, mod_reads

# By column using a function
## The following keeps only lines where the methylation percentage is higher than 0.3
subsetBy(bm, "pct", function(x) x > 0.3)
#> RBedMethyl object
#>   Modification: m 
#>   Rows (active/total): 2 / 3 
#>   Chromosomes: 2 ( chr1, chr2 )
#>   Assays: chrom, chromStart, chromEnd, strand, coverage, pct, mod_reads

Region-level summaries

regions <- GenomicRanges::GRanges(
  seqnames = c("chr1", "chr2"),
  ranges = IRanges::IRanges(start = c(1, 1), end = c(12, 2))
)
# For each region, generate summary statistics
## `mod_reads` is the total number of reads, `beta` stands for average methylation, `n_sites` is the number of sites
summarizeByRegion(bm, regions)
#> DataFrame with 2 rows and 4 columns
#>    coverage mod_reads      beta   n_sites
#>   <numeric> <numeric> <numeric> <integer>
#> 1        30        10  0.333333         2
#> 2         8         3  0.375000         1

Coercion to downstream classes

# RangedSummarizedExperiment
rse <- as(bm, "RangedSummarizedExperiment")
rse
#> class: RangedSummarizedExperiment 
#> dim: 3 1 
#> metadata(0):
#> assays(3): coverage mod_reads pct
#> rownames: NULL
#> rowData names(0):
#> colnames: NULL
#> colData names(0):

# bsseq (if installed)
if (requireNamespace("bsseq", quietly = TRUE) &&
    methods::hasMethod("coerce", c("RBedMethyl", "BSseq"))) {
  bs <- as(bm, "BSseq")
  bs
} else {
  message("bsseq is not installed or BSseq coercion is unavailable; skipping BSseq coercion.")
}
#> An object of type 'BSseq' with
#>   3 methylation loci
#>   1 samples
#> has not been smoothed
#> All assays are in-memory

Session info

sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] RBedMethyl_1.0.0 BiocStyle_2.40.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.42.0 rjson_0.2.23               
#>  [3] xfun_0.57                   bslib_0.10.0               
#>  [5] rhdf5_2.56.0                Biobase_2.72.0             
#>  [7] lattice_0.22-9              bitops_1.0-9               
#>  [9] rhdf5filters_1.22.0         tools_4.6.0                
#> [11] generics_0.1.4              curl_7.1.0                 
#> [13] stats4_4.6.0                parallel_4.6.0             
#> [15] R.oo_1.27.1                 Matrix_1.7-5               
#> [17] data.table_1.18.2.1         BSgenome_1.80.0            
#> [19] RColorBrewer_1.1-3          cigarillo_1.2.0            
#> [21] S4Vectors_0.50.0            sparseMatrixStats_1.24.0   
#> [23] lifecycle_1.0.5             compiler_4.6.0             
#> [25] farver_2.1.2                Rsamtools_2.28.0           
#> [27] Biostrings_2.80.0           statmod_1.5.1              
#> [29] Seqinfo_1.2.0               codetools_0.2-20           
#> [31] permute_0.9-10              htmltools_0.5.9            
#> [33] sys_3.4.3                   buildtools_1.0.0           
#> [35] sass_0.4.10                 RCurl_1.98-1.18            
#> [37] yaml_2.3.12                 crayon_1.5.3               
#> [39] jquerylib_0.1.4             R.utils_2.13.0             
#> [41] BiocParallel_1.46.0         DelayedArray_0.38.1        
#> [43] cachem_1.1.0                limma_3.68.0               
#> [45] abind_1.4-8                 gtools_3.9.5               
#> [47] locfit_1.5-9.12             digest_0.6.39              
#> [49] restfulr_0.0.16             maketools_1.3.2            
#> [51] fastmap_1.2.0               grid_4.6.0                 
#> [53] cli_3.6.6                   SparseArray_1.12.0         
#> [55] S4Arrays_1.12.0             XML_3.99-0.23              
#> [57] h5mread_1.4.0               DelayedMatrixStats_1.34.0  
#> [59] scales_1.4.0                httr_1.4.8                 
#> [61] rmarkdown_2.31              XVector_0.52.0             
#> [63] matrixStats_1.5.0           R.methodsS3_1.8.2          
#> [65] beachmat_2.28.0             HDF5Array_1.40.0           
#> [67] evaluate_1.0.5              knitr_1.51                 
#> [69] GenomicRanges_1.64.0        IRanges_2.46.0             
#> [71] BiocIO_1.22.0               rtracklayer_1.72.0         
#> [73] rlang_1.2.0                 Rcpp_1.1.1-1.1             
#> [75] glue_1.8.1                  BiocManager_1.30.27        
#> [77] BiocGenerics_0.58.0         bsseq_1.48.0               
#> [79] jsonlite_2.0.0              R6_2.6.1                   
#> [81] Rhdf5lib_2.0.0              GenomicAlignments_1.48.0   
#> [83] MatrixGenerics_1.24.0