---
title: "DOTSeq: Detecting Differential ORF Usage in Ribosome Profiling Data"
author: 
- name: "Chun Shen Lim"
  affiliation: |
    Department of Biochemistry, Faculty of Biomedical Sciences, 
    University of Otago, Dunedin, New Zealand.
- name: "Gabrielle S.W. Chieng"
  affiliation: |
    Department of Biochemistry, Faculty of Biomedical Sciences, 
    University of Otago, Dunedin, New Zealand.
output:
  BiocStyle::html_document:
    toc_float: true
  BiocStyle::pdf_document: default
vignette: >
  %\VignetteIndexEntry{DOTSeq}  
  %\VignetteEngine{knitr::rmarkdown}  
  %\VignetteEncoding{UTF-8}
package: DOTSeq
bibliography: DOTSeq.bib
csl: nature.csl
---

# Introduction
`DOTSeq` is an R package for detecting differential translation of open 
reading frames (ORFs) from ribosome profiling (Ribo-seq) 
[@ingolia2009genome; @ingolia2019ribosome] and matched RNA-seq datasets 
[@lim2025differential]. Translation is a tightly regulated process, 
where multiple ORFs within the same gene can modulate protein synthesis.

While useful, traditional gene-level approaches (e.g., 
[annota2seq](https://bioconductor.org/packages/release/bioc/html/anota2seq.html) 
and 
[terapadog](https://bioconductor.org/packages/release/bioc/html/terapadog.html) 
available at Bioconductor [@oertlin2019generally; @carancini2025terapadog]) 
may overlook these regulatory layers. `DOTSeq` addresses this limitation by 
modelling translation at the ORF level.

`DOTSeq` provides two complementary modules:

- **Differential ORF Usage (DOU)**: Detects *cis*-translational regulation 
by testing whether the relative usage of ORFs within a gene changes across 
biological conditions. Implemented via a beta-binomial generalised linear 
model (GLM) ([glmmTMB](https://cran.r-project.org/web/packages/glmmTMB/index.html)) 
[@RJ-2017-066].

- **Differential Translation Efficiency (DTE)**: Tests for changes in 
ribosome occupancy relative to RNA abundance across conditions using a 
negative-binomial GLM ([DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)) 
[@love2014moderated].

Together, **DOU** and **DTE** modules allow global identification of 
translation-specific effects, revealing regulatory events such as 
uORF-mediated repression.

# Load required library
We start by loading the `DOTSeq` package and its dependencies. 
`DOTSeq` builds on the Bioconductor ecosystem, e.g., 
[rtracklayer](https://www.bioconductor.org/packages/release/bioc/html/rtracklayer.html) 
and 
[txdbmaker](https://www.bioconductor.org/packages/release/bioc/html/txdbmaker.html) 
for parsing GTF files, and the 
[SummarizedExperiment](https://bioconductor.org/packages/release/bioc/html/SummarizedExperiment.html) 
container for storing count matrices and metadata 
[@lawrence2009rtracklayer; @pages2024txdbmaker; @lawrence2013software; @morgan2025summarized].

```{r setup, message=FALSE}
library(DOTSeq)
suppressPackageStartupMessages(library(SummarizedExperiment))
suppressPackageStartupMessages(library(curl))

# Internet availability
has_net <- isTRUE(curl::has_internet())

# Ping Ensembl REST; on any error, return FALSE
ensembl_ok <- FALSE
if (has_net) {
  ensembl_ok <- tryCatch({
    res <- curl::curl_fetch_memory(
      "https://rest.ensembl.org/info/ping",
      handle = curl::new_handle(timeout = 3)
    )
    isTRUE(res$status_code >= 200 && res$status_code < 500)
  }, error = function(e) FALSE)
}

if (!ensembl_ok) {
  message("Ensembl not reachable; vignette will fall back to org.Hs.eg.db for gene symbols.")
  suppressPackageStartupMessages(library(org.Hs.eg.db))
}
```

# Example dataset
We demonstrate `DOTSeq` functionalities using a HeLa cell cycle dataset 
[@ly2024nuclear]. This dataset includes the snapshots of translation 
landscapes in HeLa cells synchronised at Mitotic Cycling, Mitotic Arrest, 
and Interphase. Each condition contains two biological replicates. 

Here, we use only a small portion of the dataset to speed up the 
vignette execution while still demonstrating all key functionalities.

```{r dir}
dir <- system.file("extdata", package = "DOTSeq")
list.files(dir)
```

# Input data
Ribo-seq and RNA-seq reads are preprocessed using standard pipelines. 
Reads were adapter-trimmed using Cutadapt [@martin2011cutadapt], aligned 
to the reference genome using STAR [@dobin2013star], and summarised at the 
ORF level using featureCounts [@liao2014featurecounts]. 

Each row of the input count matrix corresponds to an ORF, while columns 
contain read counts for individual samples. The first six columns 
represent ORF-level annotation fields such as gene ID, transcript ID, 
chromosome, and coordinates. This ensures accurate mapping between 
read counts and annotated ORFs, which is critical for our analysis.

```{r read-in-count-file}
cnt <- read.table(
    file.path(dir, "featureCounts.cell_cycle_subset.txt.gz"), 
    header = TRUE, 
    comment.char = "#"
)
names(cnt) <- gsub(".*(SRR[0-9]+).*", "\\1", names(cnt))
head(cnt)
```

# ORF annotation 
Flattened ORF-level annotations were generated using the `orf_to_gtf.py` 
wrapper for RIBOSS [@lim2025riboss]. Alternatively, annotations can be 
prepared within R using `getORFs()` 
(see [Prepare annotations](#annotation)). This pre-processing step 
ensures that each ORF occupies a distinct, non-overlapping genomic interval, 
avoiding ambiguity when assigning reads. This design follows the same 
logic as `DEXSeq`'s exon flattening, but at the level of ORFs rather 
than exons [@anders2012detecting].

```{r ref}
gtf <- file.path(dir, "gencode.v47.orf_flattened_subset.gtf.gz")
bed <- file.path(dir, "gencode.v47.orf_flattened_subset.bed.gz")
```

# Prepare condition table
The condition table defines the experimental design for both Ribo-seq and 
RNA-seq samples. It allows `DOTSeq` to match Ribo/RNA-seq pairs and to 
model translation-specific interactions. 

Each row corresponds to one sequencing run. The table **must** include four 
essential columns: 
- run: unique identifier for each sequencing run

- strategy: either `"ribo"` or `"rna"` (or `"0"` or `"1"`), indicating the 
sequencing strategy

- replicate: biological replicate identifier

- condition: experimental condition (e.g., control, treatment)

Additional columns (e.g., batch) can also be included. In the example 
below, we load the metadata and restrict the analysis to samples 
pre-treated with cycloheximide (chx) prior to library preparation:

```{r condition-table}
cond <- read.table(file.path(dir, "metadata.txt.gz"))

# Ensure required column names for DOTSeq
names(cond) <- c("run", "strategy", "replicate", "treatment", "condition")

# Filter to include only chx-treated samples
cond <- cond[cond$treatment == "chx", ]
head(cond)
```

# Run DOTSeq
The main function `DOTSeq()` performs the differential analysis workflow, 
starting from parsing input files to post hoc inference. It sequentially 
runs three key steps:

- Construction of two `SummarizedExperiment` objects (`DOTSeqDataSet()`).

- model fitting for DOU via beta-binomial GLM (`fitDOU()` using `glmmTMB`) 
and DTE via negative-binomial GLM (`DESeq2`).

- Post hoc inference using `emmeans` [@emmeans] for contrasts and `ashr` 
[@stephens2017false] for adaptive shrinkage of effect sizes (`testDOU()`). 

While each step can be run individually, we construct `DOTSeqDataSets` with 
`DOTSeqDataSetsFromFeatureCounts()`, followed by the `DOTSeq()` main 
function here for simplicity. To capture the translation-specific 
interaction effect ($\beta_{\text{int}}$), `DOTSeq()` uses 
`formula = ~ condition * strategy` with `dispersion_modeling = "auto"` by 
default. The design and dispersion formulas can be customised to include 
batch and random effects. 

```{r dotseq, warning=FALSE}
# Create a DOTSeqDataSets object
datasets <- DOTSeqDataSetsFromFeatureCounts(
    count_table = cnt,
    condition_table = cond,
    flattened_gtf = gtf,
    flattened_bed = bed,
    verbose = FALSE
)

# Run the DOTSeq workflow
d <- DOTSeq(datasets = datasets)
```

In this case Interphase was automatically assigned as the baseline. The 
target and baseline conditions can explicitly defined as 
`target = "Mitotic_Cycling"` and `baseline = "Interphase"`.

# Inspect results
The output is a `DOTSeqDataSets` object that encapsulates the post hoc 
contrast results for DOU and DTE analyses.

```{r dotseq-objects}
show(d)
```

As indicated, we will use `getContrasts()` to extract the results.

```{r posthoc}
results <- getContrasts(d, type = "interaction")
results
```

These data frames summarise differential translation results for each ORF. 
Key statistics include:

- `posterior`: The shrunken log-odds fold change in ORF usage (DOU), 
reflecting the estimated effect size after empirical Bayes shrinkage.

- `lfsr` (Local False Sign Rate): The probability that the estimated effect 
has the wrong sign, providing a measure of directional uncertainty.

- `log2FoldChange`: The shrunken log2 fold change in ORF translation 
efficiency (DTE), indicating relative changes in ribosome loading.

- `padj`: The adjusted p-value for DTE, accounting for multiple testing.

The DTE effect size (`log2FoldChange`) was derived from 
`DESeq2::lfcShrink()` within the `DOTSeq()` main function. We specify 
`type = ashr` to apply adaptive shrinkage and compute all pairwise 
contrasts between conditions. 

Additional statistics for DOU results include:

- `betahat`: The raw (unshrunken) effect size estimate.

- `sebetahat`: The standard error of the raw estimate.

- `qvalue` and `lfdr` (Local False Discovery Rate): Alternative measures 
of significance and false discovery, also computed by `ashr`.

- `waldpval` and `waldpadj`: Wald test results are included for completeness, 
providing frequentist estimates of significance.

We can also inspect the `rowData` and `rowRanges` of the 
`SummarizedExperiment` objects to explore ORF level annotation. This 
includes mappings from gene to transcript to ORF IDs, as well as `orf_type`, 
which currently supports uORF, mORF, and dORF. Filtering criteria 
`count_filter` and `singlet_filter` determine whether an ORF is retained 
(`is_kept`) for model fitting. `DOUResults` stores the model fit 
results and the `emmeans` objects for post hoc contrasts.

```{r rowdata}
rowData(getDOU(d))
```

The `rowRanges` slot retains the genomic coordinates corresponding to ORFs, 
extending from the translational start (AUG) to stop codons. These ranges 
may include intronic regions, depending on the gene structures.

```{r rowranges}
rowRanges(getDOU(d))
```

# Visualisation
`DOTSeq` provides the `plotDOT()` function for visualising DOU and DTE 
results. The following plots can be generated:

```{r merge}
ou <- results$DOU[results$DOU$contrast == "Mitotic_Cycling - Interphase", ]
te <- results$DTE[results$DTE$contrast == "Mitotic_Cycling - Interphase", ]

results <- merge(ou, te, by = c("orf_id", "contrast"), all = TRUE)
```

## Venn diagram: 
To explore the relationship between DOU and DTE, we can visualise the 
overlap between significant ORFs identified by each module. ORFs that 
are significant in **DOU** and **DTE** analyses suggest changes in both 
usage and translation efficiency, potentially indicating strong 
translational control.

```{r plot-venn, eval = requireNamespace("eulerr", quietly = TRUE), fig.small = TRUE}
plotDOT(
    plot_type = "venn", 
    results = results, 
    id_mapping = FALSE, 
    force_new_device = FALSE
)
```

## Composite scatter plot: 
To compare DOU and DTE, we generate a composite scatter plot of their 
respective effect sizes. Marginal density plots along each axis 
illustrate the distribution of effect sizes for each metric. Spearman 
correlation is calculated and reported. 

For this scatter plot, points are colour-coded by `significance`, 
allowing us to identify ORFs with strong effects in one analysis but 
weak in the other. This visualisation helps highlight patterns of 
differential translation across ORFs. 

```{r plot-composite-by-significance, fig.small = TRUE}
plotDOT(
    plot_type = "composite", 
    results = results, 
    id_mapping = FALSE, 
    plot_params = list(color_by = "significance", legend_position = "bottomright"),
    force_new_device = FALSE
)
```

If `DOUData` is given as an input, points can also be colour-coded by 
`orf_type`. This enables quick identification of patterns in differential 
translation across different ORF types.

```{r plot-composite-by-orfs, fig.small = TRUE}
plotDOT(
    plot_type = "composite", 
    results = results, 
    id_mapping = FALSE, 
    data = getDOU(d), 
    plot_params = list(color_by = "orf_type", legend_position = "bottomright"),
    force_new_device = FALSE
)
```

## Volcano plot: 
To highlight ORFs with strong differential usage, we construct a volcano 
plot comparing significance versus DOU effect size.

By default, gene identifiers are mapped from Ensembl gene IDs to HGNC 
gene symbols using `biomaRt` This mapping improves plot readability by 
displaying gene symbols. The resulting ID mapping dataframe can be 
stored as a dataframe and reused across plots to avoid repeated downloads.

```{r id-mapping-online-or-offline, fig.small = TRUE}
# Try accessing Ensembl REST API 
mapping <- NULL
if (isTRUE(ensembl_ok)) {
    mapping <- try(
        plotDOT(
            plot_type   = "volcano",
            results     = results,
            id_mapping  = TRUE, 
            plot_params = list(color_by = "significance", top_hits = 3, legend_position = "topright"),
            force_new_device = FALSE
        ),
    silent = TRUE
    )
}

# Fallback to org.Hs.eg.db
if (is.null(mapping)) {
    suppressPackageStartupMessages(library(org.Hs.eg.db))

    rd <- rowData(getDOU(d))
    rd$gene_id <- sub("\\.\\d+$", "", rd$gene_id)

    mapping <- AnnotationDbi::select(
        org.Hs.eg.db,
        keys    = unique(rd$gene_id),
        keytype = "ENSEMBL",
        columns = "SYMBOL"
    )
    names(mapping) <- c("ensembl_gene_id", "hgnc_symbol")
}
```

Similarly, volcano plots can be colour-coded by significance or ORF types, 
depending on whether rowdata is given as an input. 

```{r plot-volcano-by-orfs, fig.small = TRUE}
plotDOT(
    plot_type = "volcano", 
    results = results,
    data = getDOU(d),
    id_mapping = mapping,
    plot_params = list(color_by = "orf_type", top_hits = 3, legend_position = "topright"),
    force_new_device = FALSE
)
```

## Heatmap: 
To visualise the top-ranked genes, we generate a heatmap based on DOU 
metrics for the top 20 uORF-regulated genes. 

```{r plot-heatmap, warning=FALSE, fig.width=4, fig.height=6, fig.align="center", out.width="325px"}
plotDOT(
    plot_type = "heatmap", 
    results = results, 
    data = getDOU(d), 
    id_mapping = mapping, 
    plot_params = list(rank_by = "score", top_hits = 20),
    force_new_device = FALSE
)
```

## ORF usage plot:
We can visualise the estimated ORF usage within a gene as inferred by 
the **DOU** module. This plot combines the probabilistic estimates of 
ORF usage with post hoc statistical significance annotations, allowing 
interpretation of DOU across conditions. If the ID mapping dataframe 
provided, the gene symbol is displayed as the plot title.

```{r plot-usage-gene-symbol, eval = requireNamespace("ggplot2", quietly = TRUE) && requireNamespace("ggsignif", quietly = TRUE)}
orderby <- c("Mitotic_Cycling", "Mitotic_Arrest", "Interphase")

id <- "TAF2"

plotDOT(
    plot_type = "usage",
    data = getDOU(d), 
    gene_id = id, 
    id_mapping = mapping, 
    plot_params = list(order_by = orderby),
    force_new_device = FALSE
)
```

The `plotDOT()` function supports querying according to the gene 
IDs or gene symbols, as listed in the rowData, rowRanges and/or the 
ID mapping dataframe. Stable gene IDs (e.g., ENSG00000060339) can be 
used without their version suffix (e.g., .14).

```{r plot-usage-gene-id, eval = requireNamespace("ggsignif", quietly = TRUE)}
id <- "ENSG00000060339"

plotDOT(
    plot_type = "usage",
    data = getDOU(d),
    gene_id = id, 
    id_mapping = mapping, 
    plot_params = list(order_by = orderby),
    force_new_device = FALSE
)
```

# Summary

`DOTSeq` offers a unified framework for modelling translational control at 
the ORF level. `DOTSeq` has two modules capturing distinct biological 
signals: **DOU** tests for shifts in relative usage within genes 
(*cis*-regulation), whereas **DTE** tests for changes in per-ORF translation 
efficiency relative to transcript abundance. They can be used together to 
provide a fuller picture of translational regulation.

# References

# Prepare annotations {#annotation}

Besides using the `orf_to_gtf.py` wrapper [@lim2025riboss], 
we can generate ORF-level annotation natively in R using the `getORFs()` 
function, which is based on the 
[ORFik](https://www.bioconductor.org/packages/release/bioc/html/ORFik.html)'s 
engine [@tjeldnes2021orfik]. This approach creates a `GRanges` object.

```{r get-orfs, eval = FALSE}
library(withr)

falink <- "https://ftp.ensembl.org/pub/release-78/fasta/drosophila_melanogaster/cdna/Drosophila_melanogaster.BDGP5.cdna.all.fa.gz"
gtflink <- "https://ftp.ensembl.org/pub/release-78/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP5.78.gtf.gz"

annotation <- basename(gtflink)
sequences <- basename(falink)

download.file(gtflink, destfile=annotation)
download.file(falink, destfile=sequences)

# This will generate flattened ORF-level annotation
gr <- getORFs(
    sequences = sequences,
    annotation = annotation,
    organism = "Drosophila melanogaster"
)

withr::defer(unlink(c(annotation, sequences)))
```

We perform read counting using the RNA-seq BAM files from the 
[pasillaBamSubset](https://bioconductor.org/packages/release/data/experiment/html/pasillaBamSubset.html) 
package. We can optionally clean up the BAM files before read counting 
by discarding reads mapped to introns, noncoding genes, and intergenic 
regions. Note that this dataset is not derived from a Ribo-seq experiment 
and is used solely for illustrative purposes. It should not be used for 
biological interpretation.

```{r pasilla, eval = FALSE}
library(pasillaBamSubset)
library(GenomeInfoDb)

bam_list <- c(untreated1_chr4(), untreated3_chr4())

# Keep only reads mapped to the exons of coding genes
temp_dir <- tempdir()
getExonicReads(gr = gr, seqlevels_style = "UCSC", bam_files = bam_list, bam_output_dir = temp_dir, coding_genes_only = TRUE)

# Get the list of filtered BAM files
bam_list <- list.files(
    path = temp_dir,
    pattern = "*exonic.*",
    full.names = TRUE
)

cnt <- countReads(gr = gr, bam_files = bam_list)
withr::defer(unlink(bam_list))
```

Here we make use the ORF-level annotation and create count and condition 
tables as input for `DOTSeqDataSetsFromSummarizeOverlaps()`. This is alternative to 
using `DOTSeqDataSetsFromFeatureCounts()` as demonstrated at the beginning.

```{r expand, eval = FALSE}
set.seed(42)

# Create count_table
# Create two replicates for each condition with random scaling
rna_treated_reps <- do.call(
    cbind, 
    replicate(
        2, 
        cnt[["untreated3_chr4.exonic.sorted.bam"]] * runif(
            nrow(cnt), 
            min = 0.5, 
            max = 2
        ), 
        simplify = FALSE
    )
)
rna_control_reps <- do.call(
    cbind, 
    replicate(
        2, 
        cnt[["untreated3_chr4.exonic.sorted.bam"]] * runif(
            nrow(cnt), 
            min = 0.1, 
            max = 0.5
        ), 
        simplify = FALSE
    )
)
ribo_treated_reps <- do.call(
    cbind, replicate(
        2, 
        cnt[["untreated1_chr4.exonic.sorted.bam"]] * runif(
            nrow(cnt), 
            min = 0.5, 
            max = 2
        ), 
        simplify = FALSE
    )
)
ribo_control_reps <- do.call(
    cbind, 
    replicate(
        2, 
        cnt[["untreated1_chr4.exonic.sorted.bam"]] * runif(
            nrow(cnt), 
            min = 0.1, 
            max = 0.5
        ), 
        simplify = FALSE
    )
)

# Combine and name columns
colnames(rna_treated_reps) <- paste0("rna_treated", 1:2)
colnames(rna_control_reps) <- paste0("rna_control", 1:2)
colnames(ribo_treated_reps) <- paste0("ribo_treated", 1:2)
colnames(ribo_control_reps) <- paste0("ribo_control", 1:2)

cnt_expanded <- cbind(
    rna_treated_reps, 
    rna_control_reps, 
    ribo_treated_reps, 
    ribo_control_reps
)

# Convert numbers to integer
cnt_expanded <- round(cnt_expanded)
storage.mode(cnt_expanded) <- "integer"

rownames(cnt_expanded) <- rownames(cnt)
cnt_expanded <- as.data.frame(cnt_expanded)


# Create condition_table
# Sample names from cnt_expanded
sample_names <- colnames(cnt_expanded)

# Define condition and strategy for each sample
condition <- c(
    rep("treated", 2), 
    rep("control", 2), 
    rep("treated", 2), 
    rep("control", 2)
)
strategy <- c(rep("RNA", 4), rep("Ribo", 4))

cond <- data.frame(
    run = sample_names,
    replicate = c(1,2),
    condition = factor(condition, levels = c("control", "treated")),
    strategy = factor(strategy, levels = c("RNA", "Ribo"))
)

# Create a DOTSeqDataSets object
d <- DOTSeqDataSetsFromSummarizeOverlaps(
    count_table = cnt_expanded, 
    condition_table = cond, 
    annotation = gr
)

invisible(file.remove(bam_list))
invisible(file.remove(sub("\\.gz$", ".sqlite", annotation)))
```

# Simulation and benchmarking

`DOTSeq` includes the function `simDOT()` for generating simulated Ribo-seq 
and RNA-seq count matrices. It enables benchmarking under controlled 
scenarios with configurable effect sizes, batch effects, and ORF types. 

Simulations in our manuscript [@lim2025differential] show that:

- The **DOU** module performs well at low-to-moderate effect sizes on 
DOU-type scenarios. This module is specifically designed to detect 
condition-dependent shifts in the expected proportion of Ribo-seq to 
RNA-seq counts for individual ORFs relative to other ORFs within the 
same gene (i.e., changes in ORF usage). This module targets 
*cis*-regulatory usage changes that alter relative ORF proportions 
within genes.

- The **DTE** module performs better at large-magnitude changes in 
translation efficiency. This module captures ORF-level changes in ribosome 
occupancy relative to RNA abundance.

# Session info
```{r sessionInfo}
sessionInfo()
```