--- title: "DOTSeq: Detecting Differential ORF Usage in Ribosome Profiling Data" author: - name: "Chun Shen Lim" affiliation: | Department of Biochemistry, Faculty of Biomedical Sciences, University of Otago, Dunedin, New Zealand. - name: "Gabrielle S.W. Chieng" affiliation: | Department of Biochemistry, Faculty of Biomedical Sciences, University of Otago, Dunedin, New Zealand. output: BiocStyle::html_document: toc_float: true BiocStyle::pdf_document: default vignette: > %\VignetteIndexEntry{DOTSeq} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} package: DOTSeq bibliography: DOTSeq.bib csl: nature.csl --- # Introduction `DOTSeq` is an R package for detecting differential translation of open reading frames (ORFs) from ribosome profiling (Ribo-seq) [@ingolia2009genome; @ingolia2019ribosome] and matched RNA-seq datasets [@lim2025differential]. Translation is a tightly regulated process, where multiple ORFs within the same gene can modulate protein synthesis. While useful, traditional gene-level approaches (e.g., [annota2seq](https://bioconductor.org/packages/release/bioc/html/anota2seq.html) and [terapadog](https://bioconductor.org/packages/release/bioc/html/terapadog.html) available at Bioconductor [@oertlin2019generally; @carancini2025terapadog]) may overlook these regulatory layers. `DOTSeq` addresses this limitation by modelling translation at the ORF level. `DOTSeq` provides two complementary modules: - **Differential ORF Usage (DOU)**: Detects *cis*-translational regulation by testing whether the relative usage of ORFs within a gene changes across biological conditions. Implemented via a beta-binomial generalised linear model (GLM) ([glmmTMB](https://cran.r-project.org/web/packages/glmmTMB/index.html)) [@RJ-2017-066]. - **Differential Translation Efficiency (DTE)**: Tests for changes in ribosome occupancy relative to RNA abundance across conditions using a negative-binomial GLM ([DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)) [@love2014moderated]. Together, **DOU** and **DTE** modules allow global identification of translation-specific effects, revealing regulatory events such as uORF-mediated repression. # Load required library We start by loading the `DOTSeq` package and its dependencies. `DOTSeq` builds on the Bioconductor ecosystem, e.g., [rtracklayer](https://www.bioconductor.org/packages/release/bioc/html/rtracklayer.html) and [txdbmaker](https://www.bioconductor.org/packages/release/bioc/html/txdbmaker.html) for parsing GTF files, and the [SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) container for storing count matrices and metadata [@lawrence2009rtracklayer; @pages2024txdbmaker; @lawrence2013software; @morgan2025summarized]. ```{r setup, message=FALSE} library(DOTSeq) suppressPackageStartupMessages(library(SummarizedExperiment)) suppressPackageStartupMessages(library(curl)) # Internet availability has_net <- isTRUE(curl::has_internet()) # Ping Ensembl REST; on any error, return FALSE ensembl_ok <- FALSE if (has_net) { ensembl_ok <- tryCatch({ res <- curl::curl_fetch_memory( "https://rest.ensembl.org/info/ping", handle = curl::new_handle(timeout = 3) ) isTRUE(res$status_code >= 200 && res$status_code < 500) }, error = function(e) FALSE) } if (!ensembl_ok) { message("Ensembl not reachable; vignette will fall back to org.Hs.eg.db for gene symbols.") suppressPackageStartupMessages(library(org.Hs.eg.db)) } ``` # Example dataset We demonstrate `DOTSeq` functionalities using a HeLa cell cycle dataset [@ly2024nuclear]. This dataset includes the snapshots of translation landscapes in HeLa cells synchronised at Mitotic Cycling, Mitotic Arrest, and Interphase. Each condition contains two biological replicates. Here, we use only a small portion of the dataset to speed up the vignette execution while still demonstrating all key functionalities. ```{r dir} dir <- system.file("extdata", package = "DOTSeq") list.files(dir) ``` # Input data Ribo-seq and RNA-seq reads are preprocessed using standard pipelines. Reads were adapter-trimmed using Cutadapt [@martin2011cutadapt], aligned to the reference genome using STAR [@dobin2013star], and summarised at the ORF level using featureCounts [@liao2014featurecounts]. Each row of the input count matrix corresponds to an ORF, while columns contain read counts for individual samples. The first six columns represent ORF-level annotation fields such as gene ID, transcript ID, chromosome, and coordinates. This ensures accurate mapping between read counts and annotated ORFs, which is critical for our analysis. ```{r read-in-count-file} cnt <- read.table( file.path(dir, "featureCounts.cell_cycle_subset.txt.gz"), header = TRUE, comment.char = "#" ) names(cnt) <- gsub(".*(SRR[0-9]+).*", "\\1", names(cnt)) head(cnt) ``` # ORF annotation Flattened ORF-level annotations were generated using the `orf_to_gtf.py` wrapper for RIBOSS [@lim2025riboss]. Alternatively, annotations can be prepared within R using `getORFs()` (see [Prepare annotations](#annotation)). This pre-processing step ensures that each ORF occupies a distinct, non-overlapping genomic interval, avoiding ambiguity when assigning reads. This design follows the same logic as `DEXSeq`'s exon flattening, but at the level of ORFs rather than exons [@anders2012detecting]. ```{r ref} gtf <- file.path(dir, "gencode.v47.orf_flattened_subset.gtf.gz") bed <- file.path(dir, "gencode.v47.orf_flattened_subset.bed.gz") ``` # Prepare condition table The condition table defines the experimental design for both Ribo-seq and RNA-seq samples. It allows `DOTSeq` to match Ribo/RNA-seq pairs and to model translation-specific interactions. Each row corresponds to one sequencing run. The table **must** include four essential columns: - run: unique identifier for each sequencing run - strategy: either `"ribo"` or `"rna"` (or `"0"` or `"1"`), indicating the sequencing strategy - replicate: biological replicate identifier - condition: experimental condition (e.g., control, treatment) Additional columns (e.g., batch) can also be included. In the example below, we load the metadata and restrict the analysis to samples pre-treated with cycloheximide (chx) prior to library preparation: ```{r condition-table} cond <- read.table(file.path(dir, "metadata.txt.gz")) # Ensure required column names for DOTSeq names(cond) <- c("run", "strategy", "replicate", "treatment", "condition") # Filter to include only chx-treated samples cond <- cond[cond$treatment == "chx", ] head(cond) ``` # Run DOTSeq The main function `DOTSeq()` performs the differential analysis workflow, starting from parsing input files to post hoc inference. It sequentially runs three key steps: - Construction of two `SummarizedExperiment` objects (`DOTSeqDataSet()`). - model fitting for DOU via beta-binomial GLM (`fitDOU()` using `glmmTMB`) and DTE via negative-binomial GLM (`DESeq2`). - Post hoc inference using `emmeans` [@emmeans] for contrasts and `ashr` [@stephens2017false] for adaptive shrinkage of effect sizes (`testDOU()`). While each step can be run individually, we construct `DOTSeqDataSets` with `DOTSeqDataSetsFromFeatureCounts()`, followed by the `DOTSeq()` main function here for simplicity. To capture the translation-specific interaction effect ($\beta_{\text{int}}$), `DOTSeq()` uses `formula = ~ condition * strategy` with `dispersion_modeling = "auto"` by default. The design and dispersion formulas can be customised to include batch and random effects. ```{r dotseq, warning=FALSE} # Create a DOTSeqDataSets object datasets <- DOTSeqDataSetsFromFeatureCounts( count_table = cnt, condition_table = cond, flattened_gtf = gtf, flattened_bed = bed, verbose = FALSE ) # Run the DOTSeq workflow d <- DOTSeq(datasets = datasets) ``` In this case Interphase was automatically assigned as the baseline. The target and baseline conditions can explicitly defined as `target = "Mitotic_Cycling"` and `baseline = "Interphase"`. # Inspect results The output is a `DOTSeqDataSets` object that encapsulates the post hoc contrast results for DOU and DTE analyses. ```{r dotseq-objects} show(d) ``` As indicated, we will use `getContrasts()` to extract the results. ```{r posthoc} results <- getContrasts(d, type = "interaction") results ``` These data frames summarise differential translation results for each ORF. Key statistics include: - `posterior`: The shrunken log-odds fold change in ORF usage (DOU), reflecting the estimated effect size after empirical Bayes shrinkage. - `lfsr` (Local False Sign Rate): The probability that the estimated effect has the wrong sign, providing a measure of directional uncertainty. - `log2FoldChange`: The shrunken log2 fold change in ORF translation efficiency (DTE), indicating relative changes in ribosome loading. - `padj`: The adjusted p-value for DTE, accounting for multiple testing. The DTE effect size (`log2FoldChange`) was derived from `DESeq2::lfcShrink()` within the `DOTSeq()` main function. We specify `type = ashr` to apply adaptive shrinkage and compute all pairwise contrasts between conditions. Additional statistics for DOU results include: - `betahat`: The raw (unshrunken) effect size estimate. - `sebetahat`: The standard error of the raw estimate. - `qvalue` and `lfdr` (Local False Discovery Rate): Alternative measures of significance and false discovery, also computed by `ashr`. - `waldpval` and `waldpadj`: Wald test results are included for completeness, providing frequentist estimates of significance. We can also inspect the `rowData` and `rowRanges` of the `SummarizedExperiment` objects to explore ORF level annotation. This includes mappings from gene to transcript to ORF IDs, as well as `orf_type`, which currently supports uORF, mORF, and dORF. Filtering criteria `count_filter` and `singlet_filter` determine whether an ORF is retained (`is_kept`) for model fitting. `DOUResults` stores the model fit results and the `emmeans` objects for post hoc contrasts. ```{r rowdata} rowData(getDOU(d)) ``` The `rowRanges` slot retains the genomic coordinates corresponding to ORFs, extending from the translational start (AUG) to stop codons. These ranges may include intronic regions, depending on the gene structures. ```{r rowranges} rowRanges(getDOU(d)) ``` # Visualisation `DOTSeq` provides the `plotDOT()` function for visualising DOU and DTE results. The following plots can be generated: ```{r merge} ou <- results$DOU[results$DOU$contrast == "Mitotic_Cycling - Interphase", ] te <- results$DTE[results$DTE$contrast == "Mitotic_Cycling - Interphase", ] results <- merge(ou, te, by = c("orf_id", "contrast"), all = TRUE) ``` ## Venn diagram: To explore the relationship between DOU and DTE, we can visualise the overlap between significant ORFs identified by each module. ORFs that are significant in **DOU** and **DTE** analyses suggest changes in both usage and translation efficiency, potentially indicating strong translational control. ```{r plot-venn, eval = requireNamespace("eulerr", quietly = TRUE), fig.small = TRUE} plotDOT( plot_type = "venn", results = results, id_mapping = FALSE, force_new_device = FALSE ) ``` ## Composite scatter plot: To compare DOU and DTE, we generate a composite scatter plot of their respective effect sizes. Marginal density plots along each axis illustrate the distribution of effect sizes for each metric. Spearman correlation is calculated and reported. For this scatter plot, points are colour-coded by `significance`, allowing us to identify ORFs with strong effects in one analysis but weak in the other. This visualisation helps highlight patterns of differential translation across ORFs. ```{r plot-composite-by-significance, fig.small = TRUE} plotDOT( plot_type = "composite", results = results, id_mapping = FALSE, plot_params = list(color_by = "significance", legend_position = "bottomright"), force_new_device = FALSE ) ``` If `DOUData` is given as an input, points can also be colour-coded by `orf_type`. This enables quick identification of patterns in differential translation across different ORF types. ```{r plot-composite-by-orfs, fig.small = TRUE} plotDOT( plot_type = "composite", results = results, id_mapping = FALSE, data = getDOU(d), plot_params = list(color_by = "orf_type", legend_position = "bottomright"), force_new_device = FALSE ) ``` ## Volcano plot: To highlight ORFs with strong differential usage, we construct a volcano plot comparing significance versus DOU effect size. By default, gene identifiers are mapped from Ensembl gene IDs to HGNC gene symbols using `biomaRt` This mapping improves plot readability by displaying gene symbols. The resulting ID mapping dataframe can be stored as a dataframe and reused across plots to avoid repeated downloads. ```{r id-mapping-online-or-offline, fig.small = TRUE} # Try accessing Ensembl REST API mapping <- NULL if (isTRUE(ensembl_ok)) { mapping <- try( plotDOT( plot_type = "volcano", results = results, id_mapping = TRUE, plot_params = list(color_by = "significance", top_hits = 3, legend_position = "topright"), force_new_device = FALSE ), silent = TRUE ) } # Fallback to org.Hs.eg.db if (is.null(mapping)) { suppressPackageStartupMessages(library(org.Hs.eg.db)) rd <- rowData(getDOU(d)) rd$gene_id <- sub("\\.\\d+$", "", rd$gene_id) mapping <- AnnotationDbi::select( org.Hs.eg.db, keys = unique(rd$gene_id), keytype = "ENSEMBL", columns = "SYMBOL" ) names(mapping) <- c("ensembl_gene_id", "hgnc_symbol") } ``` Similarly, volcano plots can be colour-coded by significance or ORF types, depending on whether rowdata is given as an input. ```{r plot-volcano-by-orfs, fig.small = TRUE} plotDOT( plot_type = "volcano", results = results, data = getDOU(d), id_mapping = mapping, plot_params = list(color_by = "orf_type", top_hits = 3, legend_position = "topright"), force_new_device = FALSE ) ``` ## Heatmap: To visualise the top-ranked genes, we generate a heatmap based on DOU metrics for the top 20 uORF-regulated genes. ```{r plot-heatmap, warning=FALSE, fig.width=4, fig.height=6, fig.align="center", out.width="325px"} plotDOT( plot_type = "heatmap", results = results, data = getDOU(d), id_mapping = mapping, plot_params = list(rank_by = "score", top_hits = 20), force_new_device = FALSE ) ``` ## ORF usage plot: We can visualise the estimated ORF usage within a gene as inferred by the **DOU** module. This plot combines the probabilistic estimates of ORF usage with post hoc statistical significance annotations, allowing interpretation of DOU across conditions. If the ID mapping dataframe provided, the gene symbol is displayed as the plot title. ```{r plot-usage-gene-symbol, eval = requireNamespace("ggplot2", quietly = TRUE) && requireNamespace("ggsignif", quietly = TRUE)} orderby <- c("Mitotic_Cycling", "Mitotic_Arrest", "Interphase") id <- "TAF2" plotDOT( plot_type = "usage", data = getDOU(d), gene_id = id, id_mapping = mapping, plot_params = list(order_by = orderby), force_new_device = FALSE ) ``` The `plotDOT()` function supports querying according to the gene IDs or gene symbols, as listed in the rowData, rowRanges and/or the ID mapping dataframe. Stable gene IDs (e.g., ENSG00000060339) can be used without their version suffix (e.g., .14). ```{r plot-usage-gene-id, eval = requireNamespace("ggsignif", quietly = TRUE)} id <- "ENSG00000060339" plotDOT( plot_type = "usage", data = getDOU(d), gene_id = id, id_mapping = mapping, plot_params = list(order_by = orderby), force_new_device = FALSE ) ``` # Summary `DOTSeq` offers a unified framework for modelling translational control at the ORF level. `DOTSeq` has two modules capturing distinct biological signals: **DOU** tests for shifts in relative usage within genes (*cis*-regulation), whereas **DTE** tests for changes in per-ORF translation efficiency relative to transcript abundance. They can be used together to provide a fuller picture of translational regulation. # References # Prepare annotations {#annotation} Besides using the `orf_to_gtf.py` wrapper [@lim2025riboss], we can generate ORF-level annotation natively in R using the `getORFs()` function, which is based on the [ORFik](https://www.bioconductor.org/packages/release/bioc/html/ORFik.html)'s engine [@tjeldnes2021orfik]. This approach creates a `GRanges` object. ```{r get-orfs, eval = FALSE} library(withr) falink <- "https://ftp.ensembl.org/pub/release-78/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP5.cdna.all.fa.gz" gtflink <- "https://ftp.ensembl.org/pub/release-78/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.78.gtf.gz" annotation <- basename(gtflink) sequences <- basename(falink) download.file(gtflink, destfile=annotation) download.file(falink, destfile=sequences) # This will generate flattened ORF-level annotation gr <- getORFs( sequences = sequences, annotation = annotation, organism = "Drosophila melanogaster" ) withr::defer(unlink(c(annotation, sequences))) ``` We perform read counting using the RNA-seq BAM files from the [pasillaBamSubset](https://bioconductor.org/packages/release/data/experiment/html/pasillaBamSubset.html) package. We can optionally clean up the BAM files before read counting by discarding reads mapped to introns, noncoding genes, and intergenic regions. Note that this dataset is not derived from a Ribo-seq experiment and is used solely for illustrative purposes. It should not be used for biological interpretation. ```{r pasilla, eval = FALSE} library(pasillaBamSubset) library(GenomeInfoDb) bam_list <- c(untreated1_chr4(), untreated3_chr4()) # Keep only reads mapped to the exons of coding genes temp_dir <- tempdir() getExonicReads(gr = gr, seqlevels_style = "UCSC", bam_files = bam_list, bam_output_dir = temp_dir, coding_genes_only = TRUE) # Get the list of filtered BAM files bam_list <- list.files( path = temp_dir, pattern = "*exonic.*", full.names = TRUE ) cnt <- countReads(gr = gr, bam_files = bam_list) withr::defer(unlink(bam_list)) ``` Here we make use the ORF-level annotation and create count and condition tables as input for `DOTSeqDataSetsFromSummarizeOverlaps()`. This is alternative to using `DOTSeqDataSetsFromFeatureCounts()` as demonstrated at the beginning. ```{r expand, eval = FALSE} set.seed(42) # Create count_table # Create two replicates for each condition with random scaling rna_treated_reps <- do.call( cbind, replicate( 2, cnt[["untreated3_chr4.exonic.sorted.bam"]] * runif( nrow(cnt), min = 0.5, max = 2 ), simplify = FALSE ) ) rna_control_reps <- do.call( cbind, replicate( 2, cnt[["untreated3_chr4.exonic.sorted.bam"]] * runif( nrow(cnt), min = 0.1, max = 0.5 ), simplify = FALSE ) ) ribo_treated_reps <- do.call( cbind, replicate( 2, cnt[["untreated1_chr4.exonic.sorted.bam"]] * runif( nrow(cnt), min = 0.5, max = 2 ), simplify = FALSE ) ) ribo_control_reps <- do.call( cbind, replicate( 2, cnt[["untreated1_chr4.exonic.sorted.bam"]] * runif( nrow(cnt), min = 0.1, max = 0.5 ), simplify = FALSE ) ) # Combine and name columns colnames(rna_treated_reps) <- paste0("rna_treated", 1:2) colnames(rna_control_reps) <- paste0("rna_control", 1:2) colnames(ribo_treated_reps) <- paste0("ribo_treated", 1:2) colnames(ribo_control_reps) <- paste0("ribo_control", 1:2) cnt_expanded <- cbind( rna_treated_reps, rna_control_reps, ribo_treated_reps, ribo_control_reps ) # Convert numbers to integer cnt_expanded <- round(cnt_expanded) storage.mode(cnt_expanded) <- "integer" rownames(cnt_expanded) <- rownames(cnt) cnt_expanded <- as.data.frame(cnt_expanded) # Create condition_table # Sample names from cnt_expanded sample_names <- colnames(cnt_expanded) # Define condition and strategy for each sample condition <- c( rep("treated", 2), rep("control", 2), rep("treated", 2), rep("control", 2) ) strategy <- c(rep("RNA", 4), rep("Ribo", 4)) cond <- data.frame( run = sample_names, replicate = c(1,2), condition = factor(condition, levels = c("control", "treated")), strategy = factor(strategy, levels = c("RNA", "Ribo")) ) # Create a DOTSeqDataSets object d <- DOTSeqDataSetsFromSummarizeOverlaps( count_table = cnt_expanded, condition_table = cond, annotation = gr ) invisible(file.remove(bam_list)) invisible(file.remove(sub("\\.gz$", ".sqlite", annotation))) ``` # Simulation and benchmarking `DOTSeq` includes the function `simDOT()` for generating simulated Ribo-seq and RNA-seq count matrices. It enables benchmarking under controlled scenarios with configurable effect sizes, batch effects, and ORF types. Simulations in our manuscript [@lim2025differential] show that: - The **DOU** module performs well at low-to-moderate effect sizes on DOU-type scenarios. This module is specifically designed to detect condition-dependent shifts in the expected proportion of Ribo-seq to RNA-seq counts for individual ORFs relative to other ORFs within the same gene (i.e., changes in ORF usage). This module targets *cis*-regulatory usage changes that alter relative ORF proportions within genes. - The **DTE** module performs better at large-magnitude changes in translation efficiency. This module captures ORF-level changes in ribosome occupancy relative to RNA abundance. # Session info ```{r sessionInfo} sessionInfo() ```