--- title: "Getting Started with epiRomics" author: "Alex M. Mawla & Mark O. Huising" date: "`r Sys.Date()`" bibliography: references.bib nocite: | @mawla2023, @mawla2021 output: html_document: toc: yes toc_float: yes df_print: default theme: cosmo highlight: tango pdf_document: toc: yes df_print: default latex_engine: pdflatex keep_tex: no vignette: | %\VignetteIndexEntry{Getting Started with epiRomics} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r knitr-options, include=FALSE} # Knitr chunk defaults used throughout this vignette. knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE, cache = FALSE, fig.width = 10, fig.height = 6, out.width = "100%", tidy = FALSE ) options(width = 70) ``` # Introduction `epiRomics` integrates multi-omics data to identify and visualise enhancer regions from bulk ChIP-seq, histone modification, and ATAC-seq chromatin accessibility signal, alongside companion RNA-seq. Its core contribution is a reproducible pipeline that combines co-occurring histone marks with transcription factor (TF) co-binding to surface putative enhancers and enhanceosomes, then renders signal, annotation, and peak tracks together in a single, publication-ready frame so the biology is immediately inspectable. The workflow decomposes into four reusable stages that mirror how epigenomic data is analysed in practice: (1) catalogue every input track in a single manifest and build an `epiRomicsS4` database, (2) call putative enhancers from any pair of co-occurring histone/ChIP marks [@creyghton2010], (3) intersect those calls against curated enhancer catalogues (FANTOM5 [@andersson2014], the Islet Regulome [@miguel_escalada2019], user-supplied BEDs, etc.) to filter for regions supported by an external reference, and (4) identify enhanceosome regions with dense TF co-binding and visualise them against the underlying signal. Each stage is usable on its own: the database alone supports gene-centred track plots; the filter step accepts any BED-derived reference; the enhanceosome finder works on either raw or filtered calls. epiRomics is designed to extend the existing Bioconductor epigenomics stack rather than replace it. Peak-to-feature annotation reuses `ChIPseeker`, genomic-feature overlays compose with `annotatr`, BigWig and BED I/O go through `rtracklayer`, coordinate algebra and seqinfo harmonisation use `GenomicRanges` and `GenomeInfoDb`, gene models come from any `TxDb.*` package, ID mapping from any `org.*.db`, and example-data fetching is cached through `BiocFileCache`. Static and interactive track viewers such as `Gviz`, `trackViewer`, and `Signac` remain the right choice for the scenarios they were built for (publication-grade track aesthetics, lolliplots, single-cell assays); `epiRomics` focuses on the narrower problem of combining multi-mark bulk data into enhancer and enhanceosome calls and surfacing them against their own signal for biological interpretation. Because all gene models come from a user-supplied `TxDb.*` and all ID mapping from a user-supplied `org.*.db`, any organism available in Bioconductor's annotation catalogue can be analysed. This vignette uses human hg38 as a concrete example, but the same code runs against `TxDb.Mmusculus.UCSC.mm10.knownGene` + `org.Mm.eg.db`, `TxDb.Rnorvegicus.UCSC.rn6.refGene` + `org.Rn.eg.db`, or any other valid pairing. The `genome` string is cross-checked against `GenomeInfoDb::genome(txdb)` at build time, so mismatches fail loudly rather than silently. This *getting-started* vignette uses a small toy dataset (under 1 MB) bundled with the package in `inst/extdata/toy/`. The toy subset covers a single 400 kb window on hg38 chr11:1,900,000-2,300,000 centred on the human insulin gene (`INS`). A companion vignette, *"Full Analysis with epiRomics"*, walks through the complete alpha vs. beta pancreatic islet analysis using the full 1.3 GB Zenodo archive, available on demand via `epiRomics::cache_data()`. ## Locating the toy data All toy files live under the package's installed `inst/extdata/toy/` directory. We resolve the path through `system.file()` so the vignette works identically whether it is rendered from the installed package, during `R CMD check`, or from the source tree. The absolute path is machine-specific, so we only show that the directory resolves and list its basenames. ```{r locate-data} toy_dir <- system.file("extdata", "toy", package = "epiRomics") dir.exists(toy_dir) list.files(toy_dir) ``` The window includes four BigWig signal tracks (alpha/beta ATAC-seq and RNA-seq), seven histone modification peak sets, five TF ChIP-seq peak sets, and three enhancer annotation resources. Everything needed to exercise the full `epiRomics` pipeline end-to-end is contained here; nothing is downloaded. **Dataset provenance.** The toy bundle is a narrow subset of a human pancreatic alpha-versus-beta islet dataset that was curated, reprocessed, and archived on Zenodo [@zenodo_epiRomics_data]. The underlying sequencing data were not generated by the package authors; they were assembled from published public resources and re-analysed through a uniform pipeline. The full curation methodology (source accessions, alignment, peak calling, normalisation) is documented in the package README. The BigWig tracks here are centred on the hg38 *INS* locus (`chr11:1,900,000-2,300,000`), while the BED peak catalogues (histone marks, TF ChIP-seq, and curated enhancer references) are drawn from additional islet loci so the transcription-factor co-binding analysis below has enough observations for the Fisher statistic to be informative. The unabridged curated archive (~1.3 GB) lives on Zenodo and is used by the companion vignette [*Full Analysis with epiRomics*](full-analysis-with-epiRomics.html); it is fetched on demand (and cached locally) via `epiRomics::cache_data()`. The reproducibility script that converts the full archive into this toy subset lives at `inst/scripts/make-toy-data.R`. # Installation `epiRomics` is installed from Bioconductor: ```{r install, eval = FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("epiRomics") ``` The organism-specific `TxDb.*` and `org.*.db` packages this vignette uses are listed in `Suggests` and can be installed the same way when needed. # Loading the package This example analyses human hg38 data, so we load `epiRomics` together with the hg38 `TxDb` and the human `org.Hs.eg.db`. These annotation packages are declared in `Suggests` because `epiRomics` itself does not bind to any particular genome: any `TxDb.*` + `org.*.db` pair that Bioconductor ships for an organism works. ```{r load-pkg, message = FALSE, warning = FALSE} library(epiRomics) library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(org.Hs.eg.db) ``` > **Analysing a different organism.** Swap the two annotation > packages for the equivalent pair — e.g. `TxDb.Mmusculus.UCSC.mm10.knownGene` > + `org.Mm.eg.db` for mouse mm10, or `TxDb.Rnorvegicus.UCSC.rn6.refGene` > + `org.Rn.eg.db` for rat rn6. A catalogue of `TxDb.*` packages is > maintained on the Bioconductor website > (), > and a catalogue of `org.*.db` annotations is at > (). > Update the `genome`, `organism`, and `txdb_organism` arguments to > `build_database()` to match; nothing else in the pipeline changes. # Building the database The database manifest is a CSV listing every annotation track with its `name`, `path`, `genome` build, `format`, and `type`. `type` is one of `histone`, `methyl`, `SNP`, `chip`, or `functional`. `genome` is a free-form string (`hg38`, `mm10`, `rn6`, …) and is validated against the supplied `TxDb` — a mismatch raises an error rather than proceeding silently. ```{r show-manifest} db_sheet_path <- file.path(toy_dir, "example_epiRomics_Db_sheet.csv") db_sheet <- read.csv(db_sheet_path) db_sheet ``` Each row points to a BED file under `toy_dir`. To extend the database with your own annotations — for example, an in-house ChIP peak call set or a curated enhancer list — append additional rows with the appropriate `type` and a path relative to the data directory. `build_database()` imports every row, attaches annotation via the supplied `TxDb` / `org.*.db`, and returns an S4 `epiRomicsS4` object that all downstream functions consume. The `txdb_organism` argument takes a `"package::object"` string so the `TxDb` can be resolved lazily without forcing an upfront `library()` call; live `TxDb` objects are also accepted. ```{r build-db, eval = FALSE} database <- build_database( db_file = db_sheet_path, txdb_organism = paste0( "TxDb.Hsapiens.UCSC.hg38.knownGene::", "TxDb.Hsapiens.UCSC.hg38.knownGene"), genome = "hg38", organism = "org.Hs.eg.db", data_dir = toy_dir) ``` `build_database()` calls `annotatr::build_annotations()` which fetches several UCSC annotation tables (CpG islands, lncRNA transcripts, etc.) the first time it runs and caches them locally. To keep this vignette offline-reproducible during `R CMD build` on the Bioconductor build farm, we ship a pre-built `epiRomicsS4` for the toy window with the package and load it here instead of rebuilding. The code above is the canonical invocation every downstream chunk assumes was run. ```{r load-prebuilt-db, echo = FALSE} database <- readRDS(file.path(toy_dir, "toy_database.rds")) database ``` The returned object holds each imported track together with its genome build and file paths, ready for enhancer discovery. # Quick gene-locus inspection Before invoking the full pipeline, `plot_quick_view()` gives a zero-configuration view of BigWig signal at any gene locus. No `epiRomicsS4` database is required — just a gene symbol, a named vector of BigWig paths, and a genome string. This is useful for fast exploration of signal shape at known loci before deciding which regions to probe with the enhancer pipeline. We first stage the BigWig track manifest. The paths in the CSV are relative, so we resolve them against `toy_dir`; we also reorder the rows so that beta-cell tracks plot above alpha-cell tracks in the mirrored view: ```{r bigwig-sheet} track_connection <- read.csv( file.path(toy_dir, "example_epiRomics_BW_sheet.csv")) track_connection$path <- file.path(toy_dir, track_connection$path) ## Reorder: beta tracks first, alpha second beta_idx <- grep("Beta", track_connection$name) alpha_idx <- grep("Alpha", track_connection$name) track_connection <- track_connection[c(beta_idx, alpha_idx), ] ## Named vector of BigWig paths and matching colours bw_paths <- setNames(track_connection$path, track_connection$name) bw_colors <- track_connection$color ## Display-friendly view (basenames only); the real data frame ## still holds fully resolved paths for plot_tracks() below. display_tc <- track_connection display_tc$path <- basename(display_tc$path) display_tc ``` The toy BigWigs cover only the `INS` window on `chr11` (`1,900,000-2,300,000`), so `plot_quick_view()` renders signal for genes inside that window. The companion *Full Analysis* vignette demonstrates the same call against the full-genome BigWigs for multiple beta-cell and alpha-cell loci (`INS`, `PDX1`, `MAFA`, `GCG`, `MAFB`, `UCN3`). ```{r plot-quick-view-ins, fig.align = "center", message = FALSE, warning = FALSE} plot_quick_view("INS", bw_paths = bw_paths, colors = bw_colors, mirror = TRUE, genome = "hg38") ``` # Identifying putative enhancers Enhancers are canonically identified by the simultaneous presence of H3K4me1 and H3K27ac [@creyghton2010], but `epiRomics` is deliberately general: `find_enhancers_by_comarks()` accepts any pair of marks loaded in the database (columns with `type` `histone` or `chip`) and returns the regions where both overlap. That flexibility means the same function can be used to ask different biological questions — H3K4me1 + H3K27ac for *active* enhancers [@creyghton2010], H3K4me3 + H3K27ac for active promoter-proximal regions [@heintzman2009], H3K4me1 + H3K27me3 for *poised* enhancers [@rada2011], H2A.Z + H3K27ac for primed regions [@ernst2012; @kundaje2015], etc. The choice of marks encodes the hypothesis; the function signature does not change. ChromHMM-style combinatorial states summarising these are revisited in the *Chromatin state classification* section below. `epiRomicsS4` objects expose their five slots through public getter functions: `annotations()`, `meta()`, `txdb()`, `organism()`, and `genome()`. Users should prefer these accessors over reaching into slots directly with `@` or `slot()` — they are the stable public API for reading and (via `<-` assignment) writing slot contents. ```{r putative-enhancers} putative_enhancers <- find_enhancers_by_comarks( database, histone_mark_1 = "h3k4me1", histone_mark_2 = "h3k27ac") head(as.data.frame(annotations(putative_enhancers)), 5) ``` The resulting `epiRomicsS4` object carries the co-marked regions in its annotations slot as a `GRanges`-like data frame — the genomic coordinates and feature metadata for each putative enhancer. We inspect the slot via the `annotations()` getter throughout this vignette; the key columns are `seqnames`/`start`/`end` (genomic coordinates), `annotation` (genic context from `ChIPseeker`), and `distanceToTSS` (signed distance to the nearest transcription start site). # Cross-referencing against curated enhancer databases Putative enhancers gain biological weight when they overlap regions from a trusted reference. `filter_enhancers()` intersects the putative calls against any row whose `type` is `functional` in the database manifest. The reference to use is selected with the `type` argument, which follows the convention `"{genome}_custom_{name}"` where `{name}` matches the `name` column of the functional row. The toy dataset ships with three functional references already loaded: | `type` argument | Reference | |----------------------------------|----------------------------------------------| | `hg38_custom_fantom` | FANTOM5 permissive enhancer atlas | | `hg38_custom_regulome_active` | Human Islet Regulome — active enhancers | | `hg38_custom_regulome_super` | Human Islet Regulome — super-enhancers | To add additional references — an in-house enhancer catalogue, a differentially-accessible chromatin region list, ChromHMM state calls, disease-variant BEDs, or any other BED-valued resource — append a new row to the manifest CSV with `type = "functional"` and the corresponding `name`. After rebuilding the database, the new row is accessible via `filter_enhancers(..., type = "hg38_custom_")`. No code changes are required. ```{r filter-fantom} fantom_enhancers <- filter_enhancers( putative_enhancers, database, type = "hg38_custom_fantom") head(as.data.frame(annotations(fantom_enhancers)), 5) ``` The Human Islet Regulome [@miguel_escalada2019] supplies an orthogonal, tissue-specific catalogue. We cross-reference the same putative calls against its active enhancer set, and separately against its super-enhancer set: ```{r filter-regulome-active} regulome_enhancers <- filter_enhancers( putative_enhancers, database, type = "hg38_custom_regulome_active") head(as.data.frame(annotations(regulome_enhancers)), 5) ``` ```{r filter-regulome-super} regulome_super_enhancers <- filter_enhancers( putative_enhancers, database, type = "hg38_custom_regulome_super") head(as.data.frame(annotations(regulome_super_enhancers)), 5) ``` Each filter narrows the putative set to the subset supported by the chosen reference. Downstream steps can operate on either filtered object (or all three, to compare reference-agreement). # Identifying enhanceosome regions Enhanceosomes are enhancer regions with high TF co-binding. `find_enhanceosomes()` intersects putative (or filtered) enhancers with every ChIP-seq track in the database and sorts the result by co-binding count, so the densest regulatory regions sort to the top. Here we feed the broader `putative_enhancers` set (rather than a reference-filtered subset) into `find_enhanceosomes()` so the TF co-binding demo has statistical power across all four islet loci represented in the toy BEDs (INS, GCG, PDX1, MAFB). Filtering against FANTOM5 or the Islet Regulome first would collapse the call set to the narrow BigWig window and leave the Fisher tests underpowered; in practice users chain either the filtered or the unfiltered set into `find_enhanceosomes()` depending on their hypothesis. ```{r enhanceosomes} enhanceosomes <- find_enhanceosomes( putative_enhancers, database) head(as.data.frame(annotations(enhanceosomes)), 5) length(annotations(enhanceosomes)) ``` The first rows list the most densely co-bound regions, ranked by the number of TFs binding each; the final value reports the total number of enhanceosome calls recovered across the toy BED window. # TF co-binding statistics `analyze_tf_cobinding()` runs a Fisher's exact test on each pair of TFs represented in the enhanceosome set and reports odds ratios and point-wise mutual information (PMI), so statistically enriched partnerships can be flagged directly. ```{r cobinding} tf_stats <- analyze_tf_cobinding(enhanceosomes, database) tf_stats$pairwise ``` Each row reports one TF pair with: `tf1`/`tf2` (factor names), `n_both`/`n_tf1_only`/`n_tf2_only`/`n_neither` (2x2 contingency counts of enhanceosomes), `odds_ratio` (Fisher's test effect size; >1 indicates enrichment), `pvalue` and `fdr` (Benjamini-Hochberg adjusted), `pmi` (pointwise mutual information), and `significant` (TRUE when FDR below threshold and `n_both` above the `min_regions` floor). Statistical power here is bounded by the size of the toy dataset; on the full dataset the same call surfaces canonical islet TF partnerships such as FOXA2 + PDX1 with substantially stronger significance. The *"Full Analysis with epiRomics"* vignette works through the full-scale result. # Visualising an enhanceosome with signal tracks `plot_tracks()` renders the selected enhanceosome region with its supporting ChIP-seq, histone, annotation, and BigWig signal tracks stacked for direct visual comparison. By default, paired ATAC and RNA signals for matched cell types are plotted in a **mirrored** layout (`mirror = TRUE`), with one cell type above the axis and the other below — this makes cell-type-specific differences read at a glance. Setting `mirror = FALSE` falls back to a single-direction layout with each track stacked in its own panel. The toy BigWigs cover only the `INS` window on `chr11`, while the enhanceosome set spans all four toy loci. To get a track plot with live BigWig signal we pick the first enhanceosome that falls inside the BigWig window; outside that window the signal panels would be empty. ```{r pick-enhanceosome-index} enh_df <- as.data.frame(annotations(enhanceosomes)) in_window <- which( enh_df$seqnames == "chr11" & enh_df$start >= 1900000 & enh_df$end <= 2300000) length(in_window) selected_index <- in_window[1] selected_index enh_df[selected_index, c("seqnames", "start", "end")] ``` ```{r plot-enhanceosome, fig.align = "center", message = FALSE, warning = FALSE} plot_tracks( enhanceosomes, index = selected_index, database = database, track_connection = track_connection) ``` Reading the figure (top → bottom): the gene model panel anchors the locus and shows nearby transcripts, with the focal gene highlighted; the coordinate bar beneath it marks genomic position; the middle block holds the mirrored signal tracks (alpha-cell tracks above the axis, beta-cell tracks below); the highlighted vertical span marks the enhanceosome interval; and the histone-mark and TF peak annotation panels at the bottom indicate which of the database's tracks contributed to the call. # Gene-centred visualisation `plot_gene_tracks()` takes a gene symbol and renders every database track alongside the BigWig signal in a gene-centred frame. Because the toy window was built around `INS`, this call reproduces the canonical islet insulin locus view at toy-level bundle size. ```{r plot-gene-ins, fig.align = "center", message = FALSE, warning = FALSE} plot_gene_tracks("INS", database, track_connection) ``` # Chromatin state classification `classify_chromatin_states()` labels genomic regions using their combinations of histone marks, producing six ChromHMM-style states [@ernst2012; @kundaje2015]. TSS proximity is incorporated to refine the classification. The six labels are: | State | Defining marks | Interpretation | |-------------|---------------------------------------------|----------------| | `active` | H3K4me1 + H3K27ac | Active enhancer or active regulatory element [@creyghton2010] | | `bivalent` | H3K4me3 + H3K27me3 | Poised developmental promoter held in a balance of activating and repressive marks | | `poised` | H3K4me1 + H3K27me3 | Poised enhancer — primed but held repressed | | `primed` | H3K4me1 only | Primed enhancer — accessible but not yet active | | `repressed` | H3K27me3 (or H3K9me3) without activating marks | Polycomb-repressed or heterochromatic region | | `unmarked` | None of the above | No informative histone signal in the input set | ```{r chromatin-states} chromatin_states <- classify_chromatin_states(database) table(chromatin_states$chromatin_state) table(chromatin_states$genomic_context) ``` The tables above summarise how many regions fall into each state and each genomic context (promoter, intron, distal) within the toy window. Chromatin-state calls can be overlaid directly on a gene-centred track plot by passing the `chromatin_states` data frame together with `show_chromatin = TRUE`: ```{r fig-chromatin-ins, fig.align = "center", message = FALSE, warning = FALSE} plot_gene_tracks("INS", database, track_connection, chromatin_states = chromatin_states, show_chromatin = TRUE, show_enhancer_highlight = TRUE) ``` The overlay highlights where active, bivalent, poised, primed, repressed, or unmarked states land relative to gene bodies and signal — a useful sanity check before moving from enhanceosome calls to downstream biological interpretation. # Moving on to the full dataset This vignette exercises the core enhancer-discovery and visualisation pipeline on a 400 kb toy window. For the complete alpha-vs-beta pancreatic islet analysis — including differential chromatin accessibility integration and the full set of enhanceosome calls — see the companion vignette *"Full Analysis with epiRomics"*. That walkthrough uses the full 1.3 GB Zenodo archive, which is fetched once with: ```r epiRomics::cache_data() ``` The archive is stored in a `BiocFileCache`, so subsequent runs reuse the download. ## Interactive showcases Two companion Shiny applications render published enhanceosome browsers built with `epiRomics`: - **[Shiny epiRomics Mouse Islet Enhanceosome Browser](https://www.huisinglab.com/epiRomics_2021/index.html)** — mouse pancreatic alpha, beta, and delta cell results from [@mawla2023]. - **[Shiny epiRomics Human Islet Enhanceosome Browser](https://huisinglab.shinyapps.io/Server_epiRomics_Shiny_Human/)** — human alpha and beta cell results used as the example dataset for this package [@mawla2021]. # Citation If you use `epiRomics` in published research, please cite both [@mawla2023] and [@mawla2021] (full entries appear in the References section below). A ready-to-paste BibTeX block is also available via: ```r citation("epiRomics") ``` # Session information ```{r sessioninfo} sessionInfo() ``` # References