---
title: "splicelogic: differential transcripts to splice events"
author:
    - name: Beatriz Campillo
      orcid: 0000-0001-7323-9125
    - name: Michael I. Love
      orcid: 0000-0001-8401-0545
date: "`r format(Sys.time(), '%m/%d/%Y')`"
output: rmarkdown::html_document
vignette: |
  %\VignetteIndexEntry{Introduction to `splicelogic`}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction
*splicelogic* allows users to find alternative splicing events after 
performing differential transcript usage (DTU) analysis.
Unlike event-based tools that work at the junction level, 
*splicelogic* operates on whole transcript structures: each transcript 
and all its exons are annotated with a DTU effect estimate, allowing 
splicing events to be derived directly from transcript quantification 
with full isoform context. By comparing up- and down-regulated 
transcripts, *splicelogic* can detect skipped exons, included exons,
mutually exclusive exons, retained introns, and alternative 5' and 3'
splice sites. Because it takes transcript-level effect estimates as input, 
it is compatible with any upstream DTU method (including *DRIMSeq*, *DEXSeq*, 
*satuRn*, and *edgeR*), supporting flexible experimental designs.

*splicelogic* operates on exon-level data stored as *GRanges* objects.
Given a set of exons annotated with a coefficient column indicating
differential transcript usage (DTU), *splicelogic* identifies the
following types of splicing events:

-   **Skipped exons (SE)** -- exons skipped in up-regulated transcripts
-   **Included exons (IE)** -- exons included in up-regulated transcripts
-   **Mutually exclusive exons (MXE)** -- pairs of exons where one is
    included and another is excluded in up-regulated transcripts
    compared to down-regulated transcripts
-   **Retained introns (RI)** -- intronic regions that are retained
    as part of an exon in up-regulated transcripts
-   **Alternative 5' (A5SS)** -- exons in up-regulated transcripts
    that share the same 3' splice site but differ at the 5' splice
    site from exons in down-regulated transcripts
-   **Alternative 3' (A3SS)** -- as above, but differing in the 3'
    splice site

These are detected using a series of functions, e.g. `find_se()`,
`find_ie()`, etc. described below.

## Related methods

Several tools exist for detecting alternative splicing from
RNA-seq data and analyzing its consequences.

-   [IsoformSwitchAnalyzeR](https://www.bioconductor.org/packages/IsoformSwitchAnalyzeR)
    — a comprehensive workflow that performs DTU testing, annotates
    the resulting isoform switches with splicing event types, and
    predicts functional consequences such as domain loss and NMD.

-   [GeneStructureTools](https://www.bioconductor.org/packages/GeneStructureTools)
    — takes differential splicing results from read-based tools such
    as Whippet or leafcutter (which detect splicing events from
    junction counts), classifies the event types, and assesses their
    structural and functional consequences such as ORF changes, NMD
    potential, and UTR structure.

-   [isoformic](https://github.com/luciorq/isoformic)
    — a visualization and functional interpretation pipeline for
    pre-computed differential transcript expression results. It
    organizes transcripts by biotype (protein-coding, lncRNA, NMD,
    etc.) and provides expression profile plots and functional
    enrichment analysis.

*splicelogic* differs in its main input and focus: it does not perform 
DTU testing, and also does not take pre-computed splicing events as input.
Instead, it takes transcript-level DTU results (effect estimates
and adjusted p-values from any upstream method) already mapped
onto exon structures, and directly classifies the type of splicing
event from the comparison of up- and down-regulated isoforms.

By operating on exon-level *GRanges* with attached DTU statistics, 
*splicelogic* provides a flexible framework for users to identify 
and interpret splicing events in the context of their specific 
experimental design and choice of DTU method. 
*GRanges* are a common data structure in Bioconductor for representing 
genomic features such as exons and introns, making *splicelogic*
compatible with a wide range of already available tools and 
existing workflows. For example, it can be combined with *plyranges* 
for dplyr-like operations on exons/introns following identification 
of regulated exon and introns, and *Biostrings* for extraction of 
sequence surrounding splice events.

# Quick start

With DTU results attached to a *GRanges* of the exons from
significant transcripts, one can use the following code to identify
splice events:

```{r quick-start}
#| eval: false
exons <- prepare_exons(
  txdb = <A TxDb OBJECT>,
  dtu_table = <DTU_TABLE>,
  coef_col = "estimate"
)

exons <- preprocess(exons, coef_col = "estimate")

# find skipped exons:
skipped <- exons |> find_se()
```

# Input data

*splicelogic* assumes the user has run a differential transcript
usage (DTU) or differential splicing analysis providing an error
bound (e.g. adjusted p-value / FDR) and an effect estimate with
direction (e.g. GLM estimated coefficient or deltaPSI)
(see [upstream methods](#upstream-methods)).

It also assumes the user has obtained genomic ranges representing
the exon structure of each transcript being analyzed
(see [obtaining exon ranges](#obtaining-exon-ranges)).

The `exons` should be provided in a flat *GRanges* object (one
range per exon), containing exon-level metadata in `mcols(exons)`
such as the gene ID, transcript ID, rank in the transcript, and
the estimated change or at least direction of change from the
differential analysis.

Required columns for exons:

| Column                         | Description                                     |
|------------------------------------|------------------------------------|
| `gene_id`                      | Gene identifier                                 |
| `tx_id`                        | Transcript identifier                           |
| `exon_rank`                    | Exon rank within the transcript                 |
| `coef_col = "<user supplied>"` | differential effect estimate for the transcript |

The differential effect estimate column is specified by the user in
`preprocess()`. This should be the differential effect estimate
associated with the specific transcript containing the exon, and
minimally could indicate the direction of effect with `+1/-1`.
Positive values indicate up-regulated exons and negative values
indicate down-regulated exons. All exons from the same transcript
will share the same value for this column.

## Jones et al mouse long read dataset

For demonstration, we will use a published mouse long read dataset
and its reported splicing changes.

The citation is:

> Emma F. Jones, Timothy C. Howton, Victoria L. Flanary, Amanda D. Clark, Brittany N. Lasseigne Long-read RNA sequencing identifies region- and sex-specific C57BL/6J mouse brain mRNA isoform expression and usage **Mol Brain** 17, 40 (2024). [doi: 10.1186/s13041-024-01112-7](https://doi.org/10.1186/s13041-024-01112-7)

Information about the paper, including code and publicly available
data can be found at this URL:

<https://github.com/lasseignelab/230227_EJ_MouseBrainIsoDiv>

In the abstract, Jones *et al.* describe the experiment:

> To assess differences in AS across the cerebellum, cortex, hippocampus, and striatum by sex, we generated and analyzed Oxford Nanopore Technologies (ONT) long-read RNA sequencing (lrRNA-Seq) C57BL/6J mouse brain cDNA libraries. From \> 85 million reads that passed quality control metrics, we calculated differential gene expression (DGE), differential transcript expression (DTE), and differential transcript usage (DTU) across brain regions and by sex.

Here we will focus on the cortex-specific sex comparison, comparing
female to male mice.
For demonstration in the vignette, we have saved a small subset of
the results from this
[Zenodo entry](https://zenodo.org/records/10381745). The dataset
was made available under an MIT license. For information on how the
DTU table was saved, this is noted in `inst/scripts/make-data.R`.
Generating the exons BED file is also described there, which was
downloaded and parsed from the GENCODE M31 GTF file
(comprehensive gene annotation).
See below for more details on how to prepare exons for use with
_splicelogic_.

## Loading example data

In this section we load the small example dataset that has been
prepared for this vignette. Note that in a typical *splicelogic*
workflow, the exons *GRanges* would be loaded from a *TxDb*.
See the [obtaining exon ranges](#obtaining-exon-ranges) section
below for further details.

We load the differential transcript usage analysis from the Jones
*et al.* paper. Specifically, we load the transcripts found to
exhibit DTU in the
[F vs M comparison in cortex](https://github.com/lasseignelab/230227_EJ_MouseBrainIsoDiv/blob/main/results/dtu_transcripts/cortex_sex_sig_features.csv).

The DTU table for this example, and the exons for the differential
transcripts have been saved in the `extdata` directory of
*splicelogic*.

The DTU table was generated by the authors of Jones *et al.*,
which used _IsoformSwitchAnalyzeR_ to process DTU analysis results
from _DEXSeq_, running on transcript counts instead of exon counts.
For _splicelogic_, we only require that some statistical method was
used to test DTU, that we have adjusted p-values allowing us to
subset to an FDR bounded set, and that we have a numeric value
indicating the direction of change associated with each transcript.

```{r dtu-table}
#| message: false
library(readr)
# load DTU results
dir <- system.file("extdata", package="splicelogic")
dtu_table <- readr::read_delim(file.path(dir, "dtu_table.tsv"))
```

We load from BED file the 601 exons from GENCODE M31 annotation
for the 49 differential transcripts, and attach the metadata
columns.

```{r load-exons}
#| message: false
library(plyranges)
exons_file <- "exons_M31.bed.gz"
exons_mcols_file <- "exons_mcols.tsv.gz"
exons <- plyranges::read_bed(file.path(dir, exons_file))
mcols(exons) <- DataFrame(
  readr::read_delim(file.path(dir, exons_mcols_file))
)
```

Finally, we add `Seqinfo` for mm49, the reference mouse genome for
GENCODE M31.

```{r add-seqinfo}
si <- Seqinfo::Seqinfo(genome="mm39")
seqlevels(exons) <- seqlevels(si)
seqinfo(exons) <- si
```

Next, we join the DTU results onto the `exons` GRanges. This is
necessary for the *splicelogic* functions to know which transcripts
are up- or down-regulated, and to which gene the exons and
transcripts belong.

The code below matches the transcript IDs in the `exons` metadata
with the DTU results table, and then binds the two tables before
assigning back to `exons`.

```{r merge-dtu}
txp_idx <- match(exons$tx_id, dtu_table$tx_id)
cols_to_add <- dtu_table[txp_idx, ] |> dplyr::select(-tx_id)
merged_DF <- S4Vectors::cbind(mcols(exons), cols_to_add)
mcols(exons) <- merged_DF
```

As of Bioc 3.23 and *plyranges* version 1.32, the above code to
add metadata columns from an additional table can be replaced with
the following:

```{r join-mcols}
#| eval: false
exons <- exons |>
  plyranges::join_mcols_left(dtu_table, by = "tx_id")
```

*splicelogic* is designed to take as input the exons from
significantly changed transcripts, so we first filter out
transcripts that were not signficant at FDR 10%.

```{r filter-sig}
sig_exons <- exons |> filter(padj < .1)
```

# Finding splicing events

## Preprocessing input data

The first step is to run `preprocess()`, which prepares the `exons`
for event detection. This function checks the input data, ensures
that the necessary columns are present:

```{r preprocess}
#| message: false
library(splicelogic)
sig_exons <- sig_exons |>
  preprocess(coef_col = "estimate")
```

## Finding individual events

Next we can run the various functions for calculating different
types of splicing events.

**Skipped exons (SE)**

For example, we can calculate exons that are skipped in
up-regulated transcripts relative to down-regulated transcripts,
across all genes. As these are skipped in the up-regulated
transcripts, it is expected that the exons returned belong to
down-regulated transcripts:

```{r calc-se}
skipped <- sig_exons |> find_se()
skipped
```

**Included exons (IE)**

```{r calc-re}
included <- sig_exons |> find_ie()
included
```

**Mutually exclusive exons (MXE)**

```{r calc-mx}
mx <- sig_exons |> find_mxe()
mx
```

**Retained introns (RI)**

Here we do not find retained introns, and the function returns an
empty vector.

```{r calc-ri}
ri <- sig_exons |> find_ri()
ri
```

**Alternative 5' and 3' splice sites (A5/3SS)**

```{r calc-a5-a3-ss}
a5ss <- sig_exons |> find_a5ss()
a5ss
a3ss <- sig_exons |> find_a3ss()
a3ss
```

## Finding all events

The following function wraps the above ones to find all events.

```{r calc-all}
all_events <- sig_exons |> find_all_events()
all_events
```

```{r events-barplot}
#| fig.alt: 'Barplot of event types'
barplot(
  table(all_events$event), horiz=TRUE, las=1,
  xlab="exons participating in an event"
)
```

# Upstream methods {#upstream-methods}

*splicelogic* is designed to operate downstream of differential
transcript usage (DTU). DTU methods test whether the relative
proportions of transcripts within a gene differ between
experimental conditions.

In general, any upstream method that produces transcript-resolved
differential usage statistics can be used with *splicelogic*,
provided that results include:

1.  a per-transcript directional effect estimate (e.g. a model
    coefficient, change in isoform fraction, deltaPSI, etc.), and\
2.  an adjusted p-value (or equivalent significance metric by
    thresholding).

Common upstream methods include:

-   [satuRn](https://bioconductor.org/packages/satuRn) — fits
    quasi-binomial generalized linear models to transcript usage
    proportions and performs scalable transcript-level DTU testing.
    Particularly well suited to larger datasets. Gene-level testing
    recommended via *stageR*.

-   [DRIMSeq](https://bioconductor.org/packages/DRIMSeq) — models
    transcript proportions within genes using a
    Dirichlet-Multinomial framework, with both gene-level and
    transcript-level testing.

-   [BANDITS](https://bioconductor.org/packages/BANDITS) — a
    Bayesian hierarchical DTU method that models transcript usage
    using a Dirichlet-Multinomial and explicitly accounts for
    mapping uncertainty using equivalence classes, with both
    gene-level and transcript-level testing.

-   [DEXSeq](https://bioconductor.org/packages/DEXSeq) — primarily
    a differential exon usage (DEU) method based on negative
    binomial GLMs, but commonly used in transcript-level DTU
    workflows (e.g. the *rnaseqDTU* Bioconductor 2018 workflow).
    The log2 fold change coefficients from *DEXSeq* can be used
    directly with *splicelogic*.

-   [limma](https://bioconductor.org/packages/limma) and
    [edgeR](https://bioconductor.org/packages/edgeR)
    (`diffSplice` / `diffSpliceDGE`) — can be used for DTU
    analyses (as well as exon-level), with both gene-level and
    transcript-level testing.

Regardless of which method is used, the per-transcript DTU
statistics (effect estimates and adjusted p-values) have to be
mapped onto the individual exons of each transcript to produce an
exon-level *GRanges*
(see [Obtaining exon ranges](#obtaining-exon-ranges)). This
annotated *GRanges* is the starting point for *splicelogic*,
beginning with `preprocess()` and then followed by event-specific
functions (`find_se()`, `find_ie()`, etc.).

# Obtaining exon ranges {#obtaining-exon-ranges}

## Using `prepare_exons()`

`prepare_exons()` extracts exon ranges from a *TxDb* object and
merges them with your DTU results table. It returns a flat
*GRanges* ready for `preprocess()` and the `find_*` functions.
We demonstrate using GENCODE v32 (human genes).

The first step is to load a *TxDb* object. Typically, a user would
supply their own GTF to `txdbmaker::makeTxDbFromGFF()` to generate
this. For this demonstration we will load a pre-constructed *TxDb*
from Bioconductor's *AnnotationHub*.

```{r ahub}
suppressPackageStartupMessages({
  library(AnnotationHub)
  library(AnnotationDbi)
  library(GenomicFeatures)
})
ah <- AnnotationHub()
txdb <- ah[["AH75191"]] # GENCODE v32 (human)
```

```{r get-txps}
suppressPackageStartupMessages({
  library(tibble)
})
txps <- txdb |>
  AnnotationDbi::select(
    keys(txdb, "TXID"), c("TXNAME","GENEID"), "TXID"
  ) |>
  tibble::as_tibble() |>
  dplyr::select(tx_num = TXID, tx_id = TXNAME, gene_id = GENEID) |>
  dplyr::filter(!is.na(gene_id))
txps
```

```{r sim-dtu}
# simulate DTU results
sim_dtu_table <- txps |>
  dplyr::mutate(
    padj = runif(dplyr::n()),
    effect_est = rnorm(dplyr::n())
  )
sim_dtu_table
```

Here we name this output `human_exons` to not collide with the
example above for the mouse dataset.

```{r prepare-exons}
human_exons <- prepare_exons(
  txdb, sim_dtu_table, coef_col = "effect_est", verbose = TRUE
)
human_exons <- human_exons |>
  filter(padj < .01) |>
  preprocess(coef_col = "effect_est")
```

## Manual construction

This section walks through the steps that `prepare_exons()`
performs internally. This is useful if you need more control over
the process or want to understand how exon ranges are built from
a *TxDb*.

The following extracts the exons grouped by transcript from the
*TxDb*:

```{r extract-exons}
# extract exons as a GRangesList
exons_list <- GenomicFeatures::exonsBy(
  txdb,
  by="tx"
  )
# Our DTU table aligns with txps, which aligns with the names
# of the GRangesList. prepare_exons() handles alignment checks.
names(exons_list) <- sim_dtu_table$tx_id
```

Next flattening the exons (here using a new name `flat_exons` to
not collide with the `exons` mouse dataset above):

```{r flatten-exons}
flat_exons <- unlist(exons_list)
# swap tx_id with exon_name as the names of the GRanges
flat_exons$tx_id <- names(flat_exons) # store transcript ids
names(flat_exons) <- flat_exons$exon_name
```

Adding DTU results and gene ID:

```{r add-dtu}
txp_idx <- match(flat_exons$tx_id, sim_dtu_table$tx_id)
cols_to_add <- sim_dtu_table[txp_idx,] |>
  dplyr::select(-c(tx_id, tx_num))
merged_DF <- cbind(mcols(flat_exons), cols_to_add)
mcols(flat_exons) <- merged_DF
```

# Session info

```{r session-info}
sessionInfo()
```