--- title: "splicelogic: differential transcripts to splice events" author: - name: Beatriz Campillo orcid: 0000-0001-7323-9125 - name: Michael I. Love orcid: 0000-0001-8401-0545 date: "`r format(Sys.time(), '%m/%d/%Y')`" output: rmarkdown::html_document vignette: | %\VignetteIndexEntry{Introduction to `splicelogic`} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction *splicelogic* allows users to find alternative splicing events after performing differential transcript usage (DTU) analysis. Unlike event-based tools that work at the junction level, *splicelogic* operates on whole transcript structures: each transcript and all its exons are annotated with a DTU effect estimate, allowing splicing events to be derived directly from transcript quantification with full isoform context. By comparing up- and down-regulated transcripts, *splicelogic* can detect skipped exons, included exons, mutually exclusive exons, retained introns, and alternative 5' and 3' splice sites. Because it takes transcript-level effect estimates as input, it is compatible with any upstream DTU method (including *DRIMSeq*, *DEXSeq*, *satuRn*, and *edgeR*), supporting flexible experimental designs. *splicelogic* operates on exon-level data stored as *GRanges* objects. Given a set of exons annotated with a coefficient column indicating differential transcript usage (DTU), *splicelogic* identifies the following types of splicing events: - **Skipped exons (SE)** -- exons skipped in up-regulated transcripts - **Included exons (IE)** -- exons included in up-regulated transcripts - **Mutually exclusive exons (MXE)** -- pairs of exons where one is included and another is excluded in up-regulated transcripts compared to down-regulated transcripts - **Retained introns (RI)** -- intronic regions that are retained as part of an exon in up-regulated transcripts - **Alternative 5' (A5SS)** -- exons in up-regulated transcripts that share the same 3' splice site but differ at the 5' splice site from exons in down-regulated transcripts - **Alternative 3' (A3SS)** -- as above, but differing in the 3' splice site These are detected using a series of functions, e.g. `find_se()`, `find_ie()`, etc. described below. ## Related methods Several tools exist for detecting alternative splicing from RNA-seq data and analyzing its consequences. - [IsoformSwitchAnalyzeR](https://www.bioconductor.org/packages/IsoformSwitchAnalyzeR) — a comprehensive workflow that performs DTU testing, annotates the resulting isoform switches with splicing event types, and predicts functional consequences such as domain loss and NMD. - [GeneStructureTools](https://www.bioconductor.org/packages/GeneStructureTools) — takes differential splicing results from read-based tools such as Whippet or leafcutter (which detect splicing events from junction counts), classifies the event types, and assesses their structural and functional consequences such as ORF changes, NMD potential, and UTR structure. - [isoformic](https://github.com/luciorq/isoformic) — a visualization and functional interpretation pipeline for pre-computed differential transcript expression results. It organizes transcripts by biotype (protein-coding, lncRNA, NMD, etc.) and provides expression profile plots and functional enrichment analysis. *splicelogic* differs in its main input and focus: it does not perform DTU testing, and also does not take pre-computed splicing events as input. Instead, it takes transcript-level DTU results (effect estimates and adjusted p-values from any upstream method) already mapped onto exon structures, and directly classifies the type of splicing event from the comparison of up- and down-regulated isoforms. By operating on exon-level *GRanges* with attached DTU statistics, *splicelogic* provides a flexible framework for users to identify and interpret splicing events in the context of their specific experimental design and choice of DTU method. *GRanges* are a common data structure in Bioconductor for representing genomic features such as exons and introns, making *splicelogic* compatible with a wide range of already available tools and existing workflows. For example, it can be combined with *plyranges* for dplyr-like operations on exons/introns following identification of regulated exon and introns, and *Biostrings* for extraction of sequence surrounding splice events. # Quick start With DTU results attached to a *GRanges* of the exons from significant transcripts, one can use the following code to identify splice events: ```{r quick-start} #| eval: false exons <- prepare_exons( txdb = , dtu_table = , coef_col = "estimate" ) exons <- preprocess(exons, coef_col = "estimate") # find skipped exons: skipped <- exons |> find_se() ``` # Input data *splicelogic* assumes the user has run a differential transcript usage (DTU) or differential splicing analysis providing an error bound (e.g. adjusted p-value / FDR) and an effect estimate with direction (e.g. GLM estimated coefficient or deltaPSI) (see [upstream methods](#upstream-methods)). It also assumes the user has obtained genomic ranges representing the exon structure of each transcript being analyzed (see [obtaining exon ranges](#obtaining-exon-ranges)). The `exons` should be provided in a flat *GRanges* object (one range per exon), containing exon-level metadata in `mcols(exons)` such as the gene ID, transcript ID, rank in the transcript, and the estimated change or at least direction of change from the differential analysis. Required columns for exons: | Column | Description | |------------------------------------|------------------------------------| | `gene_id` | Gene identifier | | `tx_id` | Transcript identifier | | `exon_rank` | Exon rank within the transcript | | `coef_col = ""` | differential effect estimate for the transcript | The differential effect estimate column is specified by the user in `preprocess()`. This should be the differential effect estimate associated with the specific transcript containing the exon, and minimally could indicate the direction of effect with `+1/-1`. Positive values indicate up-regulated exons and negative values indicate down-regulated exons. All exons from the same transcript will share the same value for this column. ## Jones et al mouse long read dataset For demonstration, we will use a published mouse long read dataset and its reported splicing changes. The citation is: > Emma F. Jones, Timothy C. Howton, Victoria L. Flanary, Amanda D. Clark, Brittany N. Lasseigne Long-read RNA sequencing identifies region- and sex-specific C57BL/6J mouse brain mRNA isoform expression and usage **Mol Brain** 17, 40 (2024). [doi: 10.1186/s13041-024-01112-7](https://doi.org/10.1186/s13041-024-01112-7) Information about the paper, including code and publicly available data can be found at this URL: In the abstract, Jones *et al.* describe the experiment: > To assess differences in AS across the cerebellum, cortex, hippocampus, and striatum by sex, we generated and analyzed Oxford Nanopore Technologies (ONT) long-read RNA sequencing (lrRNA-Seq) C57BL/6J mouse brain cDNA libraries. From \> 85 million reads that passed quality control metrics, we calculated differential gene expression (DGE), differential transcript expression (DTE), and differential transcript usage (DTU) across brain regions and by sex. Here we will focus on the cortex-specific sex comparison, comparing female to male mice. For demonstration in the vignette, we have saved a small subset of the results from this [Zenodo entry](https://zenodo.org/records/10381745). The dataset was made available under an MIT license. For information on how the DTU table was saved, this is noted in `inst/scripts/make-data.R`. Generating the exons BED file is also described there, which was downloaded and parsed from the GENCODE M31 GTF file (comprehensive gene annotation). See below for more details on how to prepare exons for use with _splicelogic_. ## Loading example data In this section we load the small example dataset that has been prepared for this vignette. Note that in a typical *splicelogic* workflow, the exons *GRanges* would be loaded from a *TxDb*. See the [obtaining exon ranges](#obtaining-exon-ranges) section below for further details. We load the differential transcript usage analysis from the Jones *et al.* paper. Specifically, we load the transcripts found to exhibit DTU in the [F vs M comparison in cortex](https://github.com/lasseignelab/230227_EJ_MouseBrainIsoDiv/blob/main/results/dtu_transcripts/cortex_sex_sig_features.csv). The DTU table for this example, and the exons for the differential transcripts have been saved in the `extdata` directory of *splicelogic*. The DTU table was generated by the authors of Jones *et al.*, which used _IsoformSwitchAnalyzeR_ to process DTU analysis results from _DEXSeq_, running on transcript counts instead of exon counts. For _splicelogic_, we only require that some statistical method was used to test DTU, that we have adjusted p-values allowing us to subset to an FDR bounded set, and that we have a numeric value indicating the direction of change associated with each transcript. ```{r dtu-table} #| message: false library(readr) # load DTU results dir <- system.file("extdata", package="splicelogic") dtu_table <- readr::read_delim(file.path(dir, "dtu_table.tsv")) ``` We load from BED file the 601 exons from GENCODE M31 annotation for the 49 differential transcripts, and attach the metadata columns. ```{r load-exons} #| message: false library(plyranges) exons_file <- "exons_M31.bed.gz" exons_mcols_file <- "exons_mcols.tsv.gz" exons <- plyranges::read_bed(file.path(dir, exons_file)) mcols(exons) <- DataFrame( readr::read_delim(file.path(dir, exons_mcols_file)) ) ``` Finally, we add `Seqinfo` for mm49, the reference mouse genome for GENCODE M31. ```{r add-seqinfo} si <- Seqinfo::Seqinfo(genome="mm39") seqlevels(exons) <- seqlevels(si) seqinfo(exons) <- si ``` Next, we join the DTU results onto the `exons` GRanges. This is necessary for the *splicelogic* functions to know which transcripts are up- or down-regulated, and to which gene the exons and transcripts belong. The code below matches the transcript IDs in the `exons` metadata with the DTU results table, and then binds the two tables before assigning back to `exons`. ```{r merge-dtu} txp_idx <- match(exons$tx_id, dtu_table$tx_id) cols_to_add <- dtu_table[txp_idx, ] |> dplyr::select(-tx_id) merged_DF <- S4Vectors::cbind(mcols(exons), cols_to_add) mcols(exons) <- merged_DF ``` As of Bioc 3.23 and *plyranges* version 1.32, the above code to add metadata columns from an additional table can be replaced with the following: ```{r join-mcols} #| eval: false exons <- exons |> plyranges::join_mcols_left(dtu_table, by = "tx_id") ``` *splicelogic* is designed to take as input the exons from significantly changed transcripts, so we first filter out transcripts that were not signficant at FDR 10%. ```{r filter-sig} sig_exons <- exons |> filter(padj < .1) ``` # Finding splicing events ## Preprocessing input data The first step is to run `preprocess()`, which prepares the `exons` for event detection. This function checks the input data, ensures that the necessary columns are present: ```{r preprocess} #| message: false library(splicelogic) sig_exons <- sig_exons |> preprocess(coef_col = "estimate") ``` ## Finding individual events Next we can run the various functions for calculating different types of splicing events. **Skipped exons (SE)** For example, we can calculate exons that are skipped in up-regulated transcripts relative to down-regulated transcripts, across all genes. As these are skipped in the up-regulated transcripts, it is expected that the exons returned belong to down-regulated transcripts: ```{r calc-se} skipped <- sig_exons |> find_se() skipped ``` **Included exons (IE)** ```{r calc-re} included <- sig_exons |> find_ie() included ``` **Mutually exclusive exons (MXE)** ```{r calc-mx} mx <- sig_exons |> find_mxe() mx ``` **Retained introns (RI)** Here we do not find retained introns, and the function returns an empty vector. ```{r calc-ri} ri <- sig_exons |> find_ri() ri ``` **Alternative 5' and 3' splice sites (A5/3SS)** ```{r calc-a5-a3-ss} a5ss <- sig_exons |> find_a5ss() a5ss a3ss <- sig_exons |> find_a3ss() a3ss ``` ## Finding all events The following function wraps the above ones to find all events. ```{r calc-all} all_events <- sig_exons |> find_all_events() all_events ``` ```{r events-barplot} #| fig.alt: 'Barplot of event types' barplot( table(all_events$event), horiz=TRUE, las=1, xlab="exons participating in an event" ) ``` # Upstream methods {#upstream-methods} *splicelogic* is designed to operate downstream of differential transcript usage (DTU). DTU methods test whether the relative proportions of transcripts within a gene differ between experimental conditions. In general, any upstream method that produces transcript-resolved differential usage statistics can be used with *splicelogic*, provided that results include: 1. a per-transcript directional effect estimate (e.g. a model coefficient, change in isoform fraction, deltaPSI, etc.), and\ 2. an adjusted p-value (or equivalent significance metric by thresholding). Common upstream methods include: - [satuRn](https://bioconductor.org/packages/satuRn) — fits quasi-binomial generalized linear models to transcript usage proportions and performs scalable transcript-level DTU testing. Particularly well suited to larger datasets. Gene-level testing recommended via *stageR*. - [DRIMSeq](https://bioconductor.org/packages/DRIMSeq) — models transcript proportions within genes using a Dirichlet-Multinomial framework, with both gene-level and transcript-level testing. - [BANDITS](https://bioconductor.org/packages/BANDITS) — a Bayesian hierarchical DTU method that models transcript usage using a Dirichlet-Multinomial and explicitly accounts for mapping uncertainty using equivalence classes, with both gene-level and transcript-level testing. - [DEXSeq](https://bioconductor.org/packages/DEXSeq) — primarily a differential exon usage (DEU) method based on negative binomial GLMs, but commonly used in transcript-level DTU workflows (e.g. the *rnaseqDTU* Bioconductor 2018 workflow). The log2 fold change coefficients from *DEXSeq* can be used directly with *splicelogic*. - [limma](https://bioconductor.org/packages/limma) and [edgeR](https://bioconductor.org/packages/edgeR) (`diffSplice` / `diffSpliceDGE`) — can be used for DTU analyses (as well as exon-level), with both gene-level and transcript-level testing. Regardless of which method is used, the per-transcript DTU statistics (effect estimates and adjusted p-values) have to be mapped onto the individual exons of each transcript to produce an exon-level *GRanges* (see [Obtaining exon ranges](#obtaining-exon-ranges)). This annotated *GRanges* is the starting point for *splicelogic*, beginning with `preprocess()` and then followed by event-specific functions (`find_se()`, `find_ie()`, etc.). # Obtaining exon ranges {#obtaining-exon-ranges} ## Using `prepare_exons()` `prepare_exons()` extracts exon ranges from a *TxDb* object and merges them with your DTU results table. It returns a flat *GRanges* ready for `preprocess()` and the `find_*` functions. We demonstrate using GENCODE v32 (human genes). The first step is to load a *TxDb* object. Typically, a user would supply their own GTF to `txdbmaker::makeTxDbFromGFF()` to generate this. For this demonstration we will load a pre-constructed *TxDb* from Bioconductor's *AnnotationHub*. ```{r ahub} suppressPackageStartupMessages({ library(AnnotationHub) library(AnnotationDbi) library(GenomicFeatures) }) ah <- AnnotationHub() txdb <- ah[["AH75191"]] # GENCODE v32 (human) ``` ```{r get-txps} suppressPackageStartupMessages({ library(tibble) }) txps <- txdb |> AnnotationDbi::select( keys(txdb, "TXID"), c("TXNAME","GENEID"), "TXID" ) |> tibble::as_tibble() |> dplyr::select(tx_num = TXID, tx_id = TXNAME, gene_id = GENEID) |> dplyr::filter(!is.na(gene_id)) txps ``` ```{r sim-dtu} # simulate DTU results sim_dtu_table <- txps |> dplyr::mutate( padj = runif(dplyr::n()), effect_est = rnorm(dplyr::n()) ) sim_dtu_table ``` Here we name this output `human_exons` to not collide with the example above for the mouse dataset. ```{r prepare-exons} human_exons <- prepare_exons( txdb, sim_dtu_table, coef_col = "effect_est", verbose = TRUE ) human_exons <- human_exons |> filter(padj < .01) |> preprocess(coef_col = "effect_est") ``` ## Manual construction This section walks through the steps that `prepare_exons()` performs internally. This is useful if you need more control over the process or want to understand how exon ranges are built from a *TxDb*. The following extracts the exons grouped by transcript from the *TxDb*: ```{r extract-exons} # extract exons as a GRangesList exons_list <- GenomicFeatures::exonsBy( txdb, by="tx" ) # Our DTU table aligns with txps, which aligns with the names # of the GRangesList. prepare_exons() handles alignment checks. names(exons_list) <- sim_dtu_table$tx_id ``` Next flattening the exons (here using a new name `flat_exons` to not collide with the `exons` mouse dataset above): ```{r flatten-exons} flat_exons <- unlist(exons_list) # swap tx_id with exon_name as the names of the GRanges flat_exons$tx_id <- names(flat_exons) # store transcript ids names(flat_exons) <- flat_exons$exon_name ``` Adding DTU results and gene ID: ```{r add-dtu} txp_idx <- match(flat_exons$tx_id, sim_dtu_table$tx_id) cols_to_add <- sim_dtu_table[txp_idx,] |> dplyr::select(-c(tx_id, tx_num)) merged_DF <- cbind(mcols(flat_exons), cols_to_add) mcols(flat_exons) <- merged_DF ``` # Session info ```{r session-info} sessionInfo() ```