IsoformSwitchAnalyzeR

Enabling Analysis of Isoform Switches with Functional Consequences and the Underlying Alternative Splicing

Kristoffer Vitting-Seerup

2023-06-27

Abstract

Recent breakthroughs in bioinformatics now allow us to accurately reconstruct and quantify full-length gene isoforms from RNA-sequencing data via tools such as StringTie, Kallisto and Salmon. Alternatively long-read RNA-seq (third generation sequencing) now routinely provide us with full length transcripts. These full length transcripts makes it possible to analyze changes in isoform usage, but unfortunately this is rarely done meaning RNA-seq data is typically not used to its full potential.

To solve this problem, we developed IsoformSwitchAnalyzeR. IsoformSwitchAnalyzeR is an easy-to use-R package that enables statistical identification of isoform switches from RNA-seq derived quantification of novel and/or known full-length isoforms. IsoformSwitchAnalyzeR facilitates integration of many sources of annotation such as protein domains, signal peptides, Intrinsically Disordered Regions (IDR), subcellular localization, sensitivity to Non-sense Mediated Decay (NMD) and more. The combination of identified isoform switches and their annotation enables IsoformSwitchAnalyzeR to predict potential functional consequences of the identified isoform switches — such as loss of protein domains — thereby identifying isoform switches of particular interest. Lastly, IsoformSwitchAnalyzeR provides article-ready visualization of isoform switches of individual genes, the genome-wide consequences of isoform switches, and the associated changes in alternative splicing.

In summary, IsoformSwitchAnalyzeR enables analysis of RNA-seq data with isoform resolution with a focus on isoform switching (with predicted consequences) and its associated alternative splicing, thereby expanding the usability of RNA-seq data.

Table of Content

Preliminaries

Background and Package Description
Installation
How To Get Help

What To Cite (please remember)

Quick Start

Overview of Isoform Switch Workflow
Example Workflow (a.k.a. the “Too long; didn’t read” section)
Examples Visualizations

Detailed Workflow

Overview of Detailed Workflow
- IsoformSwitchAnalyzeR Background Information
Importing Data Into R
Filtering
Identifying Isoform Switches
Analyzing Open Reading Frames
Extracting Nucleotide and Amino Acid Sequences
Advice for Running External Sequence Analysis Tools and Downloading Results
Importing External Sequence Analysis
Predicting Alternative Splicing
Predicting Switch Consequences
Post Analysis of Isoform Switches with Consequences
- Analysis of Individual Isoform Switching
- Genome-Wide Analysis of Isoform Switching

Analyzing Alternative Splicing

Overview of Alternative Splicing Workflow
Genome-Wide Analysis of Alternative Splicing

Other workflows

Augmenting ORF Predictions with Pfam Results
Analyze Small Upstream ORFs
Remove Sequences Stored in SwitchAnalyzeRlist
Adding Uncertain Category to Coding Potential Predictions
Quality control of ORF of known annotation
Analyzing the Biological Mechanisms Behind Isoform Switching
Analyzing experiments without replicates
Rescue StringTie Annoation and Extract Gene Count Matrix

Frequently Asked Questions, Problems and Errors

What Quantification Tool(s) Should I Use?
What Transcript Database Should I Use?
How to handle confounding effects (including batches)
What constitute an independent biological replicate?
And more…

Final Remarks

Sessioninfo

Preliminaries

Background and Package Description

The usage of Alternative Transcription Start sites (aTSS), Alternative Splicing (AS) and alternative Transcription Termination Sites (aTTS) are collectively collectively results in the production of different isoforms. Alternative isoforms are widely used as recently demonstrated by The ENCODE Consortium, which found that on average, 6.3 different transcripts are generated per gene; a number which may vary considerably per gene.

The importance of analyzing isoforms instead of genes has been highlighted by many examples showing functionally important changes. One of these examples is the pyruvate kinase. In normal adult homeostasis, cells use the adult isoform (M1), which supports oxidative phosphorylation. However, almost all cancer cells use the embryonic isoform (M2), which promotes aerobic glycolysis, one of the hallmarks of cancer. Such shifts in isoform usage are termed ‘isoform switching’ and cannot be detected at when only analyzing data on gene level. For an introduction to isoform switches please refere to Vitting-Seerup et al 2017.

On a more systematic level several recent studies suggest that isoform switches are quite common since they often identify hundreds of switch events in both cancer and non-cancer studies.

Tools such as StringTie, Salmon and Kallisto allows for (reconstruction and) quantification of full-length transcripts from RNA-seq data. Such data has the potential to facilitate genome-wide analysis of alternative isoform usage and identification of isoform switching — but unfortunately these types of analyses are still only rarely done; most analyses are on gene level only.

We hypothesize that there are multiple reasons why RNA-seq data is not used to its full potential:

There is still a lack of tools that can identify isoform switches with isoform resolution.
Although there are many excellent tools to perform sequence analysis, there is no common framework which allows for integration of the analysis provided by these tools.
There is a lack of tools facilitating easy and article-ready visualization of isoform switches.

To solve all these problems, we developed IsoformSwitchAnalyzeR.

IsoformSwitchAnalyzeR is an easy-to-use R package that enables the user to import the result of the (reconstructed) full-length isoforms quantification an RNA-seq experiment into R. If annotated transcripts are analyzed, IsoformSwitchAnalyzeR offers integration with the multi-layer information stored in a GTF/GFF file including the annotated coding sequences (CDS). If transcript structures are predicted (either de-novo or guided) IsoformSwitchAnalyzeR offers an accurate tool for identifying the dominant ORF of the isoforms. The knowledge of isoform positions for the CDS/ORF allows for prediction of sensitivity to Nonsense Mediated Decay (NMD) — the mRNA quality control machinery that degrades isoforms with pre-mature termination codons (PTC).

IsoformSwitchAnalyzeR facilitates identification of isoform switches via newly developed statistical methods that tests each individual isoform for differential usage and thereby identifies the exact isoforms involved in an isoform switch.

Since we know the exon structure of the full-length isoform, IsoformSwitchAnalyzeR can extract the underlying nucleotide sequence from a reference genome. This enables integration with the tools for assessing coding potential of an isoform (both the CPAT and CPC2 tools are supported) which can be used to increase accuracy of ORF predictions. By combining the CDS/ORF isoform positions with the nucleotide sequence, we can extract the most likely amine acid sequence of the CDS/ORF. The amine acid sequence enables integration of analysis of protein domains (via Pfam) and signal peptides (via SignalP) — both supported by IsoformSwitchAnalyzeR. Additionally, since the structures of all expressed isoforms from a given gene are known, one can also annotate alternative splicing - including aTSS and aTTS sites - a functionality also implemented in IsoformSwitchAnalyzeR.

In summary, IsoformSwitchAnalyzeR enables annotation of isoforms with intron retention, ORF, NMD sensitivity, coding potential, protein domains and signal peptides (and many more), resulting in the ability to predict important functional consequences of isoform switches in both individual genes and on a genome wide level.

IsoformSwitchAnalyzeR contains tools that allow the user to create article-ready visualizations of:

Individual isoform switches
Genome-wide analysis of isoform switches and their predicted consequences
Genome-wide analysis of alternative splicing and changes thereof.

These visualizations are easy to understand and integrate all information gathered throughout the workflow. Example of visualizations can be found in the Examples Visualizations section.

Lastly IsoformSwitchAnalyzeR, although originally developed for model organisms now also directly support analysis of all organisms.

Back to Table of Content.

Installation

IsoformSwitchAnalyzeR is part of the Bioconductor repository and community which means it is distributed with, and dependent on, Bioconductor. Installation of IsoformSwitchAnalyzeR is easy and can be done from within the R terminal. If it is the first time you use Bioconductor (or don’t know if you have used it), simply copy-paste the following into an R session to install the basic Bioconductor packages (will only done if you don’t already have them):

if (!requireNamespace("BiocManager", quietly = TRUE)){
    install.packages("BiocManager")
    BiocManager::install()
}

If you already have installed Bioconductor, running these two commands will check whether updates for installed packages are available.

After you have installed the basic Bioconductor packages you can install IsoformSwitchAnalyzeR by copy-pasting the following into an R session:

BiocManager::install("IsoformSwitchAnalyzeR")

This will install the IsoformSwitchAnalyzeR package as well as other R packages that are needed for IsoformSwitchAnalyzeR to work.

If you need to install the developmental version of IsoformSwitchAnalyzeR this can be done from the GitHub. Please note that this is for advanced uses and should not be done unless you have good reason to. Installation from the GitHub can be done by copy/pasting the following into R:

if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
}
devtools::install_github("kvittingseerup/IsoformSwitchAnalyzeR", build_vignettes = TRUE)

How To Get Help

This R package comes with plenty of documentation. Much information can be found in the R help files (which can easily be accessed by running the following command in R “?functionName”, for example “?importRdata”) - here a lot of information can be found in the individual argument description as well as in the “details” section of the individual function documentation. This vignette also contains a lot of information and will be continuously updated, so make sure to read both sources carefully as it contains the answers to the most questions (including Frequently Asked Questions, Problems and Errors).

If you want to report a bug/error (found in the newest version of the R package!) please make an issue with a reproducible example at GitHub and ember to post i) the code that gave the error, ii) the error message as well as iii) the output of running sessionInfo().

If you have suggestions for improvements, please put them on GitHub. This will allow other people to up vote your idea, thereby showing us there is wide support of implementing your idea.

Back to Table of Content.

What To Cite

The analysis performed by IsoformSwitchAnalyzeR is only possible due to a string of other tools and scientific discoveries — please read this section thoroughly and cite the appropriate articles to give credit where credit is due (also citations are what allow people to continue both maintaining and developing bioinformatic tools).

If you are using the:

Import of data from Salmon/Kallisto/RSEM/StringTie (importRdata() function): Please cite reference 10.
Import of data from Salmon via Tximeta (importSalmonData() function): Please cite reference 10 and 17.
Inter-library normalization of abundance values: Please cite reference 10 and 11.
Isoform switch test implemented utilizing DEXSeq via IsoformSwitchAnalyzeR (Default) : Please cite reference 1, 12 and 13.
Isoform switch test implemented in the satuRn package: Please cite reference 3.
Prediction of open reading frames (ORF) analysis: Please cite reference 1 and 4.
Prediction of pre-mature termination codons (PTC) and thereby NMD-sensitivity: Please cite reference 1, 4, 5 and 6.
CPAT: Please cite reference 7.
CPC2: Please cite reference 14.
Pfam: Please cite reference 8.
Pfam isotypes: Please cite reference 20.
SignalP: Please cite reference 9.
NetSurf2-P: Please cite reference 15.
IUPred2A: Please cite reference 16.
Deeploc2: Please cite reference 18.
DeepTMHMM: Please cite reference 19.
Prediction of consequences: Please cite reference 1.
Visualizations (plots) implemented in the IsoformSwitchAnalyzeR package: Please cite reference 1.
Alternative splicing analysis: Please cite both reference 1 and 4.
Genome-wide enrichment analysis: Please cite both reference 1 and 2.

Refrences:

Vitting-Seerup et al. The Landscape of Isoform Switches in Human Cancers. Cancer Res. (2017) link.
Vitting-Seerup et al. IsoformSwitchAnalyzeR: Analysis of changes in genome-wide patterns of alternative splicing and its functional consequences. Bioinformatics (2019) link.
Gilis et al. satuRn: Scalable analysis of differential transcript usage for bulk and single-cell RNA-sequencing applications (version 2). F1000Research, 10:374. link
Vitting-Seerup et al. spliceR: an R package for classification of alternative splicing and prediction of coding potential from RNA-seq data. BMC Bioinformatics 2014, 15:81. link.
Weischenfeldt et al. Mammalian tissues defective in nonsense-mediated mRNA decay display highly aberrant splicing patterns. Genome Biol 2012, 13:R35 link.
Huber et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods, 2015, 12:115-121. link.
Wang et al. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Res. 2013, 41:e74. link.
Finn et al. The Pfam protein families database. Nucleic Acids Research (2012) link.
_Teufel et al. SignalP 6.0 predicts all five types of signal peptides using protein language models.**. Nat. Biotechnol (2022)_ link
Soneson et al. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research 4, 1521 (2015). link.
Robinson et al. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology (2010) link.
Ritchie et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research (2015) link.
Anders et al. Detecting differential usage of exons from RNA-seq data. Genome Research (2012) link.
Kang et al. CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res (2017) link.
Klausen et al. NetSurfP-2.0: improved prediction of protein structural features by integrated deep learning. Proteins (2019) link
Meszaros et al. IUPred2A: Context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res (2018) link
Love et al. Tximeta: Reference sequence checksums for provenance identification in RNA-seq. PLoS Comput. Biol (2020) link
Thumuluri et al. DeepLoc 2.0: multi-label subcellular localization prediction using protein language models. Nucleic Acids Res (2022) link
Hallgren et al. DeepTMHMM predicts alpha and beta transmembrane proteins using deep neural networks. bioRxiv (2022) link
Vitting-Seerup. Most protein domains exist as multiple variants with distinct structure and function. bioRxiv (2022) link

Quick Start

Overview of Isoform Switch Workflow

In this section, we will outline the IsoformSwitchAnalyzeR workflow - in the following sections we will show you how to do it including example data and code which can be copy/pasted to do your own analysis.

The idea behind IsoformSwitchAnalyzeR is to make it easy to do advanced post-analysis of full-length RNA-seq derived isoform/transcript level quantification of RNASeq data with a focus on finding, annotating and visualizing isoform switches with functional consequences as well as its associated alternative splicing. If you want to know more about considerations and recommendations for bioinformatic tools which can perform the initial isoform quantification, please refer to What Quantification Tool(s) Should I Use? and note that IsoformSwitchAnalyzeR both supports analysis of short- and long-read sequencing data.

From the isoform quantification IsoformSwitchAnalyzeR performs five high-level tasks:

Statistical identification of isoform switches.
Integration of a wide range of (predicted) annotations for the isoforms involved in the identified switches (e.g. protein domains).
Identification of which isoforms have a predicited functional consequnce (e.g. loss/gain of protein domain).
Visualization of predicted consequences of the isoform switches for individual genes
Analysis of genome wide patterns in both switch consequences and alternative splicing.

Please note that just like any other statistical tool IsoformSwitchAnalyzeR requires count data from independent biological replicates (see What constitute an independent biological replicate?) and that recent benchmarks highlights that at least three (if not more) independent replicates are needed for good statistical performance - most tools have a hard time controlling the False Discovery Rate (FDR) with only two replicates so extra caution is needed for interpreting results based on three or fewer replicates.

The first step of a IsoformSwitchAnalyzeR workflow is to import and integrate the isoform quantification with its basic annotation. Once you have all the relevant data imported into R (IsoformSwitchAnalyzeR will also help you with that), the workflow for identification and analysis of isoform switches with functional consequences can be divided into two parts (also illustrated below in Figure 1).

Part 1) Extract Isoform Switches and Their Sequences. This part includes filtering out lowly expressed genes/isoforms, statistical analysis to identify isoform switches, annotating those switches with open reading frames (ORF) and writing the nucleotide and amino acid (peptide) sequences to fasta files. All of these analysis are performed by the high-level function:

isoformSwitchAnalysisPart1()

See below for example of usage, and Detailed Workflow for details on the individual analysis step (and what function actually perform them).

The resulting fasta files enables the usage of external sequence analysis tools. Currently the external sequence tools directly supported by IsoformSwitchAnalyzeR are:

Pfam: Prediction of protein domains, which can be run either locally or via their webserver.
Tools for Coding Potential Analysis:
- CPC2: The Coding Calculator 2, which can be run either locally or via their webserver.
- CPAT: The Coding-Potential Assessment Tool, which can be run either locally or via their webserver.
SignalP: Prediction of Signal Peptides, which can be run either locally or via their webserver (V5 is supported).
Tools Prediction of Intrinsically Disordered Regions (IDR):
- IUPred2A: Which both predict IDRs and Intrinsically Disordered Binding Regions (IDBR) It can be run either locally or via their webserver
- NetSurfP-2: Which as parts of its predicitons also predict IDR. It can be run either locally or via their webserver.
Tools for localization prediction
- DeepTMHMM: Predicts protein toplology (location compared to cell membrane) and can be run from from this webserver
- DeepLoc2: Predicts sub-cellular location and can be run from this webserver.

Part 2) Plot All Isoform Switches and Their annotation. This part involves importing and incorporating the results of all the external sequence analysis, analyzing alternative splicing, predicting functional consequences of isoform switches and plotting i) individual genes with isoform switches and ii) genome wide patterns in the consequences of isoform switching.

All of this can be done using the function:

isoformSwitchAnalysisPart2()

See below for example of usage, and Detailed Workflow for details on the individual analysis step (and what function actually perform them).

Alternatively, if one does not plan to incorporate external sequence analysis (or is only interested in splicing), it is possible to run the full workflow using:

isoformSwitchAnalysisCombined()

This corresponds to running isoformSwitchAnalysisPart1() and isoformSwitchAnalysisPart2() without adding the external results.

Figure 1: Workflow overview. Grey transparent boxes indicate the two parts of a normal workflow for analyzing isoform switches. The individual steps in the two sub-workflows are indicated by arrows. Speech bubbles summarize how this full analysis can be done in a two-step process using the high-level functions ( isoformSwitchAnalysisPart1() and isoformSwitchAnalysisPart2()). Please note the boxes marked with “external analysis” means the user have to run tools outside of R either locally or on a web-server - more info can be found below.

Back to Table of Content.

Example Workflow

As indicted above, a comprehensive, but less customizable, analysis of isoform switches can be done using the two high-level functions isoformSwitchAnalysisPart1() and isoformSwitchAnalysisPart2(). This section aims to show how these functions are used as well as illustrate what IsoformSwitchAnalyzeR can be used for. For the detailed, fully customizable workflow, please refer to the Detailed Workflow (though we still recommend reading this section as it will provide a nice overview of the workflow).

Let us start by loading IsoformSwitchAnalyzeR into the R environment.

library(IsoformSwitchAnalyzeR)

Note that newer versions of RStudio (a wrapper for R which makes it much easier to work with R) support auto-completion of package names so you just have to type “Iso” and press then RStudio will suggest IsoformSwitchAnalyzeR.

Also note this workflow have been written to be used with:

packageVersion('IsoformSwitchAnalyzeR')
#> [1] '2.0.1'

Since IsoformSwitchAnalyzeR is (significantly) updated quite frequently please make sure you use the up to date version.

Importing the Data

The first step is to import all data needed for the analysis into R and then concatenate them as a switchAnalyzeRlist object. Although IsoformSwitchAnalyzeR has different functions for importing data from tools such as Salmon/Kallisto/RSEM/StringTie/Cufflinks/IsoQuant (for long-read data), IsoformSwitchAnalyzeR can be used with quantifications from any tool which produce isoform-level quantification.

For the purpose of illustrating importing data into R let us use some cufflinks data re-quantified quantified via Salmon which is included in the distribution of IsoformSwitchAnalyzeR.

Please note:

The approach illustrated below for Salmon data is identical if the data was quantified with Kallisto, RSEM or StringTie.
If you have used Salmon to quantify newer versions of the Human/Mouse/Dropsophila Gencode/Ensembl transcriptomes there is an even easier way to import the data (using tximeta) as described in Importing Data from Salmon via Tximeta.
For other sources of quantification (including StringTie and IsoQuant) see Importing Data Into R.

The import of the example quantification can be done as follows:

### Please note
# The way of importing files in the following example with
# "system.file('pathToFile', package="IsoformSwitchAnalyzeR") is
# specialized way of accessing the example data in the IsoformSwitchAnalyzeR package
# and not something  you need to do - just supply the string e.g.
# parentDir = "/path/to/mySalmonQuantifications/" pointing to the parent directory (where
# each sample is a separate sub-directory) to the function.

### Import quantifications
salmonQuant <- importIsoformExpression(
    parentDir = system.file("extdata/",package="IsoformSwitchAnalyzeR")
)
#> Step 1 of 3: Identifying which algorithm was used...
#>     The quantification algorithm used was: Salmon
#>     Found 6 quantification file(s) of interest
#> Step 2 of 3: Reading data...
#> reading in files with read_tsv
#> 1 2 3 4 5 6 
#> Step 3 of 3: Normalizing abundance values (not counts) via edgeR...
#> Done

Note that:

When using the “parentDir” argument importIsoformExpression() assumes each quantification files are stored in seperate sub-directory. To accomodate cases where that is not the case the “sampleVector” argument should be used instead. Type “?importIsoformExpression” for more information.
When using the “parentDir” argument importIsoformExpression() can can import only a subset of the subfolders via the “pattern” argument (and its friends). Type “?importIsoformExpression” for details.

The result of using importIsoformExpression() is a list containing both count and abundance estimates for each isoform in each sample (count and abundance matrix):

head(salmonQuant$abundance, 2)
#>       isoform_id Fibroblasts_0 Fibroblasts_1 hESC_0   hESC_1    iPS_0    iPS_1
#> 1 TCONS_00000001      7.481356      7.969927      0 0.000000 0.000000 5.320791
#> 2 TCONS_00000002      0.000000      0.000000      0 2.015807 6.627746 3.218813

head(salmonQuant$counts, 2)
#>       isoform_id Fibroblasts_0 Fibroblasts_1 hESC_0    hESC_1    iPS_0    iPS_1
#> 1 TCONS_00000001      12.30707      14.02487      0 0.0000000  0.00000 18.13313
#> 2 TCONS_00000002       0.00000       0.00000      0 0.1116201 21.10248 10.96964

Please note that (estimated) isoform level counts are nessesary for the statistical analysis.

Apart from the isoform quantification we need 2-3 additional sets of annotation in the IsoformSwitchAnalyzeR workflow:

A design matrix indicating which of the independent biological replicates belong to which condition (and if there are other covariates (incl batch effects) that should be taking into account - more info about that below).
The transcript structure of the isoforms (in genomic coordinates) as well as information about which isoforms originate from the same gene. This information is typically stored in a GTF file - a file format which IsoformSwitchAnalyzeR can directly import and integrate (as described below) you just have to tell IsoformSwitchAnalyzeR where the file is located on your computer/server. If you are using a refrence-only workflow (tools such as Kallisto/Salmon/RSEM etc) this argument should point to the refrence database GTF corresponding to the fasta file that you used to build the refrence index. If you use a guided/de-novo transcriptome assembly approach (tools like StringTie and Cufflinks) this argument should point to the GTF file created at the “merge” stage of the workflow. Please refere to [What Quantification Tool(s) Should I Use] for a more detailed description of the two different workflows. Please note that IsoformSwitchAnalyzeR also support RefSeq GFF files - for all other database you have to use a GTF file - see What Transcript Database Should I Use? for recomendations on databases and more info on how to obtain these.
If you have quantified the transcriptome via tools such as Salmon/Kallisto/RSEM (aka you did NOT do any guided/de-novo transcriptome assembly (using tools such as Cufflinks/StringTie)) we highly recommended that the transcript nucleotide sequences (the fasta file used to make the index for the quantification tool) is also supplied to IsoformSwitchAnalyzeR. If a fasta file is provided IsoformSwitchAnalyzeR which will import and integrate these sequences into the workflow. Although options are available to extracting transcript sequences later via BSgenomes (more info in the Detailed Workflow) importing the transcript fasta file at this stage will provide a much faster (and potentially more accurate) analysis. Providing the sequences here will be (a lot!) easier for people analyzing non-model organisms. You can also find suggestions about how to obtain fasta files which are paired with the GTF/GFF files from some of the major databases in What Transcript Database Should I Use?.

Since the information in point 2-3 from above are just files stored on the computer which IsoformSwitchAnalyzeR itself imports into R, the only thing you need to make yourself is the design matrix. The design matrix will have to be generated manually to ensure correct grouping of samples. For the example data this can be done from the file names as follows:

myDesign <- data.frame(
    sampleID = colnames(salmonQuant$abundance)[-1],
    condition = gsub('_.*', '', colnames(salmonQuant$abundance)[-1])
)
myDesign
#>        sampleID   condition
#> 1 Fibroblasts_0 Fibroblasts
#> 2 Fibroblasts_1 Fibroblasts
#> 3        hESC_0        hESC
#> 4        hESC_1        hESC
#> 5         iPS_0         iPS
#> 6         iPS_1         iPS

Please also note that if you have additional covariates in the experimental setup these can also be added to the design matrix to ensure they are taking into account during the statistical analysis. Covariates are (unwanted) sources of variation not due to experimental conditions you are interested in. This can be anything from a batch effects (e.g. data produced at different points in time) to an effect you are not interested in but which might influence the result (e.g. sex or age). See How to handle confounding effects (including batches) for more info.

Now that we have the isoform quantifications, the design matrix, the isoform annotation and the isoform nucleotide sequence (respectively in a GTF file and a fasta file stored on the computer, both of which IsoformSwitchAnalyzeR will import itself) we can now integrate it all into a switchAnalyzeRlist.

Please note that:

It is highly recommended to both supply count and abundance expression matrices to the importRdata() as count are better for the statistical analysis but the abundance estimates are better to measure effect sizes. If necessary IsoformSwitchAnalyzeR can calculate FPKM values from the counts but the values estimated by the quantification tools are typically much better!
It is essential that the quantification, the GTF file and the fasta file all contain information on the same isoforms. Note the What Transcript Database Should I Use? section contains suggestion for how to obtain such matchin pairs.
It is highly recommended to also supply the nucleotide fasta file as IsoformSwitchAnalyzeR will import and propagate the sequences in the entire workflow making it more accurate and much faster.

Since we now have all the data we can use the importRdata() function to import and integrate it all into a switchAnalyzeRlist:

### Please note:
# The way of importing files in the following example with
# "system.file("extdata/example.gtf.gz", package="IsoformSwitchAnalyzeR")"" is
# specialiced way of accessing the example data in the IsoformSwitchAnalyzeR package
# and not somhting you need to do - just supply the string e.g.
# isoformExonAnnoation="/myAnnotation/myQuantified.gtf" to the isoformExonAnnoation argument

### Create switchAnalyzeRlist
aSwitchList <- importRdata(
    isoformCountMatrix   = salmonQuant$counts,
    isoformRepExpression = salmonQuant$abundance,
    designMatrix         = myDesign,
    isoformExonAnnoation = system.file("extdata/example.gtf.gz"             , package="IsoformSwitchAnalyzeR"),
    isoformNtFasta       = system.file("extdata/example_isoform_nt.fasta.gz", package="IsoformSwitchAnalyzeR"),
    fixStringTieAnnotationProblem = TRUE,
    showProgress = FALSE
)
#> Step 1 of 10: Checking data...
#> Step 2 of 10: Obtaining annotation...
#>     importing GTF (this may take a while)...
#> Step 3 of 10: Fixing StringTie gene annoation problems...
#>     There were no need to rescue any annotation
#>     302 genes_id were assigned their original gene_id instead of the StringTie gene_id.
#>         This was only done when it could be done unambiguous.
#> Step 4 of 10: Calculating expression estimates from count data...
#>     Skipped as user supplied expression via the "isoformRepExpression" argument...
#> Step 5 of 10: Testing for unwanted effects...
#>     Added 1 batch/covariates to the design matrix
#> Step 6 of 10: Batch correcting expression estimates...
#> Step 7 of 10: Extracting data from each condition...
#> Step 8 of 10: Making comparisons...
#> Step 9 of 10: Making switchAnalyzeRlist object...
#> Step 10 of 10: Guestimating differential usage...
#>     The GUESSTIMATED number of genes with differential isoform usage are:
#>            comparison estimated_genes_with_dtu
#> 1 Fibroblasts vs hESC                  26 - 44
#> 2  Fibroblasts vs iPS                  50 - 83
#> 3         hESC vs iPS                  37 - 62
#> Done

summary(aSwitchList)
#> This switchAnalyzeRlist list contains:
#>  1092 isoforms from 368 genes
#>  3 comparison from 3 conditions (in total 6 samples)
#> 
#> Feature analyzed:
#> [1] "ORFs, ntSequence"

Please note that:

By supplying the GTF file to the “isoformExonAnnoation” argument IsoformSwitchAnalyzeR will automatically import and integrate CoDing Sequence (CDS, aka the part that is predicted to give rise to a protein) regions from in the GTF file as the ORF regions used by IsoformSwitchAnalyzeR.
By setting fixStringTieAnnotationProblem = TRUE (default) importRdata will automatically try and correct some of the annoation problems created when doing transcript assembly (unassigned transcripts and merged genes). For more information see documentation of importRdata (by typing ?importRdata in R).
The guesstimated provided is just that - a guess informed by an estimate. This is provided for getting a quick idea of what to expect. The final number will naturally not be known untill further down the workflow.

For more information about the switchAnalyzeRlist see IsoformSwitchAnalyzeR Background Information.

Part 1

Before we can run the analysis, it is necessary to know that IsoformSwitchAnalyzeR measures isoform usage vian isoform fraction (IF) values which quantifies the fraction of the parent gene expression originating from a specific isoform (calculated as isoform_exp / gene_exp). Consequently, the difference in isoform usage is quantified as the difference in isoform fraction (dIF) calculated as IF2 - IF1, and these dIF are used to measure the effect size (like fold changes are in gene/isoform expression analysis).

To illustrate the IsoformSwitchAnalyzeR workflow we will use a smaller example dataset equal to what you would get from following the import steps described above (used for convenience). The example switchAnalyzeRlist is included in IsoformSwitchAnalyzeR and can loaded as follows:

### load example data
data("exampleSwitchList")

### Subset for quick runtime
exampleSwitchList <- subsetSwitchAnalyzeRlist(
    switchAnalyzeRlist = exampleSwitchList,
    subset = abs(exampleSwitchList$isoformFeatures$dIF) > 0.4
)

### Print summary of switchAnalyzeRlist
summary(exampleSwitchList)
#> This switchAnalyzeRlist list contains:
#>  20 isoforms from 12 genes
#>  1 comparison from 2 conditions (in total 6 samples)
#> 
#> Feature analyzed:
#> [1] "ORFs, ntSequence"

We can now run the first part of the isoform switch analysis workflow which filters for non-expressed genes/isoforms, identifies isoform switches, annotates open reading frames (ORF), switches with and extracts both the nucleotide and peptide (amino acid) sequences and output them as two separate fasta files (here turned off to make the example run - so a user should use outputSequences=TRUE).

exampleSwitchList <- isoformSwitchAnalysisPart1(
    switchAnalyzeRlist   = exampleSwitchList,
    # pathToOutput = 'path/to/where/output/should/be/'
    outputSequences      = FALSE, # change to TRUE whan analyzing your own data 
    prepareForWebServers = FALSE  # change to TRUE if you will use webservers for external sequence analysis
)

Note that:

In the example above, we set outputSequences=FALSE only to make the example data run nicely (e.g. to prevent the function from out the two fasta files of the example data to your computer). When you analyze your own data, you want to set outputSequences=TRUE and use the pathToOutput to specify where the files should be saved. The two files outputted are needed to do the external analysis as described below.
The isoformSwitchAnalysisPart1() function has an argument, overwritePvalues, which overwrites the result of an already existing switch test (such as those imported by cufflinks) with the result of running isoformSwitchTestDEXSeq().
The switchAnalyzeRlist returned by isoformSwitchAnalysisPart1() has been reduced to only contain genes where an isoform switch (as defined by the alpha and dIFcutoff arguments) was identified. This enables much faster runtimes for the rest of the pipeline, as only isoforms from a gene with a switch are analyzed.
If you are planning on using the webserver version of the external sequence analysis tools you should set prepareForWebServers=TRUE. This argument exists because some webservers (Pfam and SignalP) have (quite) strict limits on the number (and length) of amino acid sequences one can submit at the time. When prepareForWebServers=TRUE (default) IsoformSwitchAnalyzeR will generate multiple filtered amino acid fasta files which can then be supplied as multiple individual analysis at the websites as well as trim the sequences which are too long (will not cause problems downstream since IsoformSwitchAnalyzeR have been optimized to take this into account).

The number of switching features is easily summarized as follows:

extractSwitchSummary( exampleSwitchList )
#>            Comparison nrIsoforms nrSwitches nrGenes
#> 1 hESC vs Fibroblasts         10          5       5

In a typical workflow, the user would here have to use produced fasta files to perform the external analysis tools (Pfam (for prediction of protein domains), SignalP (for prediction of signal peptides), CPAT or CPC2 (for prediction of coding potential, since they perform similar analysis so it is only necessary to run one of them)), Prediction of sub-cellular location (DeepLoc2) and isoform toplogy (DeepTmHmm), IUPred2A or NetSurfP-2 (for prediction of Intrinsically Disordered Regions (IDR), since they perform similar analysis so it is only necessary to run one of them. The main difference is that NetSurfP-2 is newer but IUPred2A also allows for prediction of IDBS (which are also integrated into the IsoformSwitchAnalyzeR workflow)). For more information on how to run those tools refer to Advice for Running External Sequence Analysis Tools and Downloading Results. To illustrate the workflow, we will here use the result of running some of these tools on the example data - these results are also included in the IsoformSwitchAnalyzeR package.

Part 2

The second part of the isoform switch analysis workflow, which includes importing and incorporating external sequence annotation, analysis of alternative splicing, predicting functional consequences and visualizing both the general effects of isoform switches and the individual isoform switches. The combined analysis can be done by:

# Please note that in the following the part of the examples using
# the "system.file()" command not nesseary when using your own
# data - just supply the path as a string
# (e.g. pathToCPC2resultFile = "/myFiles/myCPC2results.txt" )

exampleSwitchList <- isoformSwitchAnalysisPart2(
  switchAnalyzeRlist        = exampleSwitchList, 
  n                         = 10,    # if plotting was enabled, it would only output the top 10 switches
  removeNoncodinORFs        = TRUE,
  pathToCPC2resultFile      = system.file("extdata/cpc2_result.txt"         , package = "IsoformSwitchAnalyzeR"),
  pathToPFAMresultFile      = system.file("extdata/pfam_results.txt"        , package = "IsoformSwitchAnalyzeR"),
  pathToIUPred2AresultFile  = system.file("extdata/iupred2a_result.txt.gz"  , package = "IsoformSwitchAnalyzeR"),
  pathToSignalPresultFile   = system.file("extdata/signalP_results.txt"     , package = "IsoformSwitchAnalyzeR"),
  outputPlots               = FALSE  # keeps the function from outputting the plots from this example
)
#> Step 1 of 3 : Importing external sequence analysis...
#> Step 2 of 3 : Analyzing alternative splicing...
#> Step 3 of 3 : Predicting functional consequences...
#> 
#> The number of isoform switches with functional consequences identified were:
#>            Comparison nrIsoforms nrSwitches nrGenes
#> 1 hESC vs Fibroblasts         10          5       5

Please note that if you had to submit your data as multiple runs at the Pfam or SignalP website you can just supply a vector of strings indicating the path to each of the resulting files and IsoformSwitchAnalyzeR will read them all and integrate them.

The exampleSwitchList now contains all the information needed to analyze isoform switches and alternative splicing both for individual genes as well as genome-wide summary statistics or analysis.

The top switching genes can be extracted as follows:

extractTopSwitches(exampleSwitchList, filterForConsequences = TRUE, n=3)
#>            gene_ref           gene_id gene_name condition_1 condition_2
#> 1 geneComp_00000098 XLOC_000088:KIF1B     KIF1B        hESC Fibroblasts
#> 2 geneComp_00000205       XLOC_000177   LDLRAD2        hESC Fibroblasts
#> 3 geneComp_00000389       XLOC_001345  ARHGEF19        hESC Fibroblasts
#>   gene_switch_q_value switchConsequencesGene Rank
#> 1        8.467463e-85                   TRUE    1
#> 2        1.582351e-41                   TRUE    2
#> 3        4.584799e-16                   TRUE    3

Now the data gathering and integration is complete (and some analysis have already been saved to your computer if you used outputPlots=TRUE. This means we are now ready to look at the results via the buld in visualizations as described in the next sections.

Examples Visualizations

To illustrate visualizations, both for individual genes and for the genome-wide analysis, we will use a larger dataset which is a subset of two of the TCGA Cancer types analyzed in Vitting-Seerup et al 2017. The fully annotated switchAnalyzeRlist (corresponding to what was created with isoformSwitchAnalysisPart1() and isoformSwitchAnalysisPart2() above) with the TCGA example data can be loaded as follows:

data("exampleSwitchListAnalyzed")

The number of isoform switches with predicted functional consequences are extracted by setting “filterForConsequences = TRUE” in the extractSwitchSummary() function:

extractSwitchSummary(
    exampleSwitchListAnalyzed,
    filterForConsequences = TRUE
) 
#>                 Comparison nrIsoforms nrSwitches nrGenes
#> 1 COAD_ctrl vs COAD_cancer        653        372     349
#> 2 LUAD_ctrl vs LUAD_cancer        384        205     198
#> 3                 Combined        946        531     497

The switchPlot

One isoform switch of particular interest is found in the ZAK gene (as highlighted in Vitting-Seerup et al 2017). ZAK is a signaling gene involved in the response to radiation (one of the most common treatments for cancers). It is ranked as the 7th most significant isoform switch in the COAD as can easily be seen by subsetting on the top isoform switches (extracted with extractTopSwitches()):

subset(
    extractTopSwitches(
        exampleSwitchListAnalyzed,
        filterForConsequences = TRUE,
        n=10,
        inEachComparison = TRUE
    )[,c('gene_name','condition_1','condition_2','gene_switch_q_value','Rank')],
    gene_name == 'ZAK'
)
#>   gene_name condition_1 condition_2 gene_switch_q_value Rank
#> 7       ZAK   COAD_ctrl COAD_cancer        5.344466e-06    7

Let us take a look at the isoform switch in the ZAK gene (7th most significant gene) using the switchPlot functionality. Note that since the switchAnalyzeRlist contains two different comparison (LUAD and COAD) we need to specify the conditions to plot (this is not necessary if there is only one condition in your data).

switchPlot(
    exampleSwitchListAnalyzed,
    gene='ZAK',
    condition1 = 'COAD_ctrl',
    condition2 = 'COAD_cancer',
    localTheme = theme_bw(base_size = 13) # making text sightly larger for vignette
)

From this plot, we first note that no difference in the gene expression of ZAK between normal and cancer samples (bottom left) meaning that a standard gene level analysis would not pick up any changes to ZAK. Next, we see the large highly significant switch in isoform usage across conditions (bottom right). By comparing which isoforms are changing (bottom right) to the isoform structure (top) it can be inferred that in the healthy samples it is primary the two short isoforms that are used. In the COAD patients, there is however a large increase in the use of the long isoform. Interestingly the long isoform contains a SAM_1 protein domain, a domain which facilitates interaction with both other protein and RNA molecules. This leads to the hypothesis that the long isoform, use more in the tumor samples, results in a protein which could interact with a new/different set of interaction partners in the cancer setting. Since only the long isoform have a IDR with a predicted binding site this IDR strengthens the new-interaction-partners hypothesis. Lastly we see that both the topological analysis (the small boxes above the transcripts) and the subcellular location analysis (indicated below the isoform name to the left) suggest all transcripts are intracellular.

If you want to save this plot as a pdf file via the pdf command, you need to specify “onefile = FALSE”. The following code chunk will produce a nicely-sized figure:

pdf(file = '<outoutDirAndFileName>.pdf', onefile = FALSE, height=6, width = 9)
switchPlot(switchAnalyzeRlist, gene='<gene_name>')
dev.off()

Furthermore, note that:

The differential gene/isoform expression analysis shown here is not a part of the standard IsoformSwitchAnalyzeR workflow but can easily be added as described in Adding differential gene expression.
the switchPlot() function has an argument called IFcutoff which requires the Isoform Fraction of an isoform to be larger than IFcutoff (default 0.05) to be included in the plot. Increasing this cutoff can result in “simpler” plots as less used isoforms will be removed.
Some regions of transcripts might be marked as “Not Analyzed” if you have been using webservers - for more info on that see The switchPlot contains regions “Not Annotated”
The dIFcutoff argument in the switchPlot() function is what determines the “increased/decreased/unchanged usage” underneed the isoform names in the transcript plots. Please refer to the isoform usage (bottom right) for indication of the statistical analysis.

Genome-wide Summaries

To analyze the global usage of isoform switches the next step would typically be to look at the number, and overlap, between conditions - which can be visualized with the extractSwitchOverlap() function:

extractSwitchOverlap(
    exampleSwitchListAnalyzed,
    filterForConsequences=TRUE,
    plotIsoforms = FALSE
)

From which it can be seen that the overlap here is fairly small. By comparing the overlaps of switches (left) and genes (right) we can even see that the two cancer types utilize different switches in some of the same genes.

We can investigate the consequences of isoform switches by counting each type consequence separately as follows:

extractConsequenceSummary(
    exampleSwitchListAnalyzed,
    consequencesToAnalyze='all',
    plotGenes = FALSE,           # enables analysis of genes (instead of isoforms)
    asFractionTotal = FALSE      # enables analysis of fraction of significant features
)

Note the “consequencesToAnalyze” argument enables analysis of only a subset of features.

From this summary plots above many things can be seen. First of all, the most frequent changes affect protein domains, Intrinsically Disordered Regions (IDR) and length of extracellular domains. The most frequent “switch” (both gain and loss of fatures) is in the sub-cellular locations of isoforms. We also notice some consequences are un-evenly distributed (e.g. many more instances of “domain loss” compared to “domain gains”). Such differences can be systematically analyzed as shown below.

Consequence Enrichment Analysis

IsoformSwitchAnalyzeR enables systematic analysis of the predicted consequences of isoform switches. When considering opposite consequence (e.g. the gain vs loss of protein domains) in the plot above its quite easy to see they are unevenly distributed (e.g. more protein domain loss than protein domain gain). Such uneven distribution can be systematically analyzed for all consequences using the build in enrichment analysis:

extractConsequenceEnrichment(
    exampleSwitchListAnalyzed,
    consequencesToAnalyze='all',
    analysisOppositeConsequence = TRUE,
    localTheme = theme_bw(base_size = 14), # Increase font size in vignette
    returnResult = FALSE # if TRUE returns a data.frame with the summary statistics
)

For each pair of opposing consequences (e.g. protein domain loss vs gain) this plot shows the fraction of switches (x-axis) that results in the consequence indicated (e.g. protein domain loss) (y-axis). The fraction is calculated only considering switches affected by either of the opposing consequence indicated on the y-axis. If this fraction is significantly different from 0.5 it indicates there is a systematic bias in the functional consequences of isoform switches.

From the analysis above, it is therefore quite clear, that many of the opposing consequences are significantly unevenly distributed. In other words, many types of consequences seem to be used in a cancer-specific manner. It is specifically noteworthy there is a significant loss of protein domains, IDRs. Interestingly there is also a significant gain of non-reference protein domain isotypes (as the y-axis indicate loss and the fraction is < 0.5) suggesting domain isotopes are also specifically regulated. Lastly this analysis indicate that in COAD there are significantly more switches resulting in isoforms located in additional sub-cellular places (as y-axis indicate loss but the fraction is < 0.5). More on these location gains later.

Please note this enrichment analysis does not consider mixed consequences such as sub-cellular location switch, domain switch and IDR switch.

When comparing the two cancer types (right vs left sub-plot) the overall pattern seems similar - but there might be differences. Such differences can formally be analyzed with the extractConsequenceEnrichmentComparison() function. This function will for each opposing set of consequence (e.g. protein domain loss vs gain), in a pairwise manner contrasts the individual comparisons to assess whether the ratio of loss/gains are indeed significantly different:

extractConsequenceEnrichmentComparison(
    exampleSwitchListAnalyzed,
    consequencesToAnalyze=c('domains_identified','coding_potential'),
    analysisOppositeConsequence = TRUE,
    returnResult = FALSE # if TRUE returns a data.frame with the summary statistics
)

The analysis shows that compared to LUAD, COAD have a significantly higher fraction of switches resulting in protein domain loss, but there is no difference in switches from noncoding to coding transcripts.

As indicated by the analysis done above with extractConsequenceEnrichment() and extractConsequenceSummary() there seems to be a lot of location changes - especially in COAD. We can further investigate these as follows:

extractSubCellShifts(
    exampleSwitchListAnalyzed,
    returnResult = FALSE # returns plot instead of data
)

Note that here we use “none” to indicate if there is no loss or gain of location in a switch. From this plot it is clear that the major number of location changes in COAD are switches the isoforms used more are now also found in Nucleus (as they have gain of “nucleus” (y-axis) and loss of “none” (x-axis)). This indicates that in COAD the cancer-specific isoforms are more frequently also found in the nucleus where they can affect transcription, DNA repair etc. Similarly there is a number of genes where the isoform switch causes a change away from the nucleus meaning these genes can no longer affect transcription, DNA repair etc.

Also note that like many of the other functions you can obtain the results as a data.frame that can easily be analyzed. E.g. the most frequent location changes can be found as follows.

### Extract shifts
locChange <- extractSubCellShifts(
    exampleSwitchListAnalyzed,
    returnResult = TRUE # returns data insted of plot
)

### Sort
locChange <- locChange[sort.list(
    locChange$n_genes,
    decreasing = TRUE
),]

### View
head(locChange, 3)
#> # A tibble: 3 × 6
#>   condition_1 condition_2 location_gain    location_loss n_switch n_genes
#>   <chr>       <chr>       <chr>            <chr>            <int>   <int>
#> 1 COAD_ctrl   COAD_cancer Nucleus          None                22      22
#> 2 COAD_ctrl   COAD_cancer Lysosome_Vacuole None                11      11
#> 3 COAD_ctrl   COAD_cancer Cytoplasm        None                 9       9

Splicing Enrichment Analysis

The abundance of isoform switches analyzed above naturally progresses to the question of what cellular mechanism underlies these changes. Due to the detailed alternative splicing analysis performed by IsoformSwitchAnalyzeR we can analyze this question using the built-in summary function:

extractSplicingEnrichment(
    exampleSwitchListAnalyzed,
    returnResult = FALSE # if TRUE returns a data.frame with the summary statistics
)

This analysis is equivalent to the consequence enrichment analysis described above - just for splicing. By looking for the estimates most different from 0.5 (which correspond to no systematic change) and paying attention to the dot-size (indicating the number of genes with this type of change) we can see that the majority of changes are in regards to use of alternative 3’ acceptor sites (A3) and alternative transcription start sites (ATTS). Futhermore we observe that although the patterns of consequences looked mostly similar (as seen above), some of the underlying changes in alternative splicing are quite different. To futher analyze these differences we can contrast the two cancer types as follows

extractSplicingEnrichmentComparison(
    exampleSwitchListAnalyzed,
    splicingToAnalyze = c('A3','A5','ATSS','ATTS'), # the splicing highlighted above
    returnResult = FALSE # if TRUE returns a data.frame with the results
)

From which we can see the only significantly difference between the two cancer types are the use of alternative transcription start sites (ATSS).

For more details regarding splicing analysis see the Genome-Wide Analysis of Alternative Splicing section.

Overview Plots

Since all the data about isoform and gene expression is saved in the switchAnalyzeRlist, we can also start building custom plots:

### Volcano like plot:
ggplot(data=exampleSwitchListAnalyzed$isoformFeatures, aes(x=dIF, y=-log10(isoform_switch_q_value))) +
     geom_point(
        aes( color=abs(dIF) > 0.1 & isoform_switch_q_value < 0.05 ), # default cutoff
        size=1
    ) +
    geom_hline(yintercept = -log10(0.05), linetype='dashed') + # default cutoff
    geom_vline(xintercept = c(-0.1, 0.1), linetype='dashed') + # default cutoff
    facet_wrap( ~ condition_2) +
    #facet_grid(condition_1 ~ condition_2) + # alternative to facet_wrap if you have overlapping conditions
    scale_color_manual('Signficant\nIsoform Switch', values = c('black','red')) +
    labs(x='dIF', y='-Log10 ( Isoform Switch Q Value )') +
    theme_bw()

As there are many dIF values (effect size) very close to zero, which have a significant isoform switch (black dots above dashed horizontal line) this nicely illustrates why a cutoff on both the dIF and the q-value are necessary.

Another interesting overview plot can be made as follows:

### Switch vs Gene changes:
ggplot(data=exampleSwitchListAnalyzed$isoformFeatures, aes(x=gene_log2_fold_change, y=dIF)) +
    geom_point(
        aes( color=abs(dIF) > 0.1 & isoform_switch_q_value < 0.05 ), # default cutoff
        size=1
    ) + 
    facet_wrap(~ condition_2) +
    #facet_grid(condition_1 ~ condition_2) + # alternative to facet_wrap if you have overlapping conditions
    geom_hline(yintercept = 0, linetype='dashed') +
    geom_vline(xintercept = 0, linetype='dashed') +
    scale_color_manual('Signficant\nIsoform Switch', values = c('black','red')) +
    labs(x='Gene log2 fold change', y='dIF') +
    theme_bw()

Here, it is clear that changes in gene expression and isoform switches are not in any way mutually exclusive, as there are many genes which are both differentially expressed (large gene log2FC) and contain isoform switches (color). This further highlights the importance of also analyzing RNA-seq data at isoform-level resolution.

Back to Table of Content.

Detailed Workflow

Overview of Detailed Workflow

Before you start on this section, we recommend you to read though the Quick Start section, as it gives the large overview, introduces some basic concepts and gives a couple of tips not repeated here.

Compared to the workflow presented in Quick Start, a full workflow for analyzing isoform switches has many sub-steps which each can be customized/optimized. In this section, we will go more into depth with each of the steps as well as provide tips and shortcuts for working with IsoformSwitchAnalyzeR. Specifically, each of the main functions behind these steps will be described and illustrated. For a more comprehensive and detailed description of each individual function please refer to the individual function documentation build into the R package (easily accessed via ?functionName).

First we will start by getting an overview of the detailed workflow before diving into the individual parts. Then we will introduce the IsoformSwitchAnalyzeR Background Information - all you need to know before doing the detailed analysis. This is followed by a detailed step-by-step isoform switch analysis workflow and lastly an overview of useful Other Tools in IsoformSwitchAnalyzeR is provided.

The detailed workflow consists of the following steps (illustrated in Figure 2) which, just like before, can be divided into two parts:

Part 1) Extract Isoform Switches and Their Sequences.

Importing Data Into R
Filtering
Identifying Isoform Switches
Analyzing Open Reading Frames
Extracting Nucleotide and Amino Acid Sequences
Advice for Running External Sequence Analysis Tools and Downloading Results

This corresponds to running the following functions in sequential order (which incidentally is just what the isoformSwitchAnalysisPart1() function does):

### Import data into R via one of:
myQantifications <- importIsoformExpression() # RSEM/Kallisto/Salmon/StringTie

### Create SwitchAnalyzeRlist
mySwitchList <- importRdata()         # OR
mySwitchList <- importCufflinksFiles()

### Filter
mySwitchList <- preFilter( mySwitchList )

### Test for isoform switches
mySwitchList <- isoformSwitchTestDEXSeq( mySwitchList )   # OR
mySwitchList <- isoformSwitchTestSatuRn( mySwitchList ) 

### If analysing (some) novel isoforms (else use CDS from ORF as explained in importRdata() )
mySwitchList <- addORFfromGTF( mySwitchList )
mySwitchList <- analyzeNovelIsoformORF( mySwitchList )

### Extract Sequences
mySwitchList <- extractSequence( mySwitchList )

### Summary
extractSwitchSummary(mySwitchList)

Part 2) Plot All Isoform Switches and Their annotation.

Importing External Sequence Analysis
Predicting Alternative Splicing
Predicting Switch Consequences
Post Analysis of Isoform Switches with Consequences
- Analysis of Individual Isoform Switching
- Genome-Wide Analysis of Isoform Switching

This corresponds to running the following functions in sequential order (just like the isoformSwitchAnalysisPart2() function does):

### Add annotation
mySwitchList <- analyzeCPAT( mySwitchList ) # OR
mySwitchList <- analyzeCPC2( mySwitchList )

mySwitchList <- analyzePFAM( mySwitchList )
mySwitchList <- analyzeSignalP( mySwitchList )

analyzeIUPred2A() # OR
analyzeNetSurfP2()

mySwitchList <- analyzeAlternativeSplicing( mySwitchList )

### Analyse consequences
mySwitchList <- analyzeSwitchConsequences( mySwitchList )

### Visual analysis
# Indiviudal switches
switchPlotTopSwitches( mySwitchList )

### Summary
extractSwitchSummary(mySwitchList, filterForConsequences = TRUE)

For a normal workflow, this would then typically be followed by the global analysis of alternative splicing and consequences:

# global consequence analysis
extractConsequenceSummary( mySwitchList )
extractConsequenceEnrichment( mySwitchList )
extractConsequenceGenomeWide( mySwitchList )

# global splicing analysis
extractSplicingSummary( mySwitchList )
extractSplicingEnrichment( mySwitchList )
extractSplicingGenomeWide( mySwitchList )

The combined workflow is visualized in the figure below.

Figure 2: Detailed workflow overview. The grey transparent boxes indicate the two parts of a normal workflow for analyzing isoform switches. The individual steps in the two sub-workflows are indicated by arrows along with a description and the main R functions for performing the steps.

IsoformSwitchAnalyzeR Background Information

The switchAnalyzeRlist

The switchAnalyzeRlist object is created to specifically contain and summarize all relevant information about the isoforms involved in isoform switches. The switchAnalyzeRlist object is a named list, meaning each entry in the list can be accessed by its name via the ‘$’ symbol or by using “[[‘entryName’]]”. A newly created switchAnalyzeRlist object contains 10 entries, and as the isoforms are gradually annotated and analyzed more entries are added.

### A newly created switchAnalyzeRlist + switch analysis
data("exampleSwitchList")         
names(exampleSwitchList)
#>  [1] "isoformFeatures"      "exons"                "conditions"          
#>  [4] "designMatrix"         "isoformCountMatrix"   "sourceId"            
#>  [7] "orfAnalysis"          "ntSequence"           "isoformRepExpression"
#> [10] "isoformRepIF"

The first entry ‘isoformFeatures’ is a data.frame where all relevant data about each comparison of an isoform (between conditions), as well as the analysis performed and annotation incorporate via IsoformSwitchAnalyzeR, is stored. Among the default information is isoform and gene id, gene and isoform expression as well as the isoform_switch_q_value and isoform_switch_q_value where the result of the differential isoform analysis is stored. The comparisons made can be identified as “from ‘condition_1’ to ‘condition_2’”, meaning ‘condition_1’ is considered the ground state and ‘condition_2’ the changed state. This also means that a positive dIF value indicates that the isoform usage is increased in ‘condition_2’ compared to ‘condition_1’. Since the ‘isoformFeatures’ entry is the most relevant part of the switchAnalyzeRlist object, the most-used standard methods have also been implemented to work directly on isoformFeatures.

### Preview
head(exampleSwitchList, 2)
#>             iso_ref          gene_ref     isoform_id     gene_id condition_1
#> 19 isoComp_00000007 geneComp_00000005 TCONS_00000007 XLOC_000005        hESC
#> 22 isoComp_00000008 geneComp_00000005 TCONS_00000008 XLOC_000005        hESC
#>    condition_2 gene_name nearest_ref_id class_code TSS_group_id length
#> 19 Fibroblasts      <NA>     uc009vjk.2          =         TSS2   2750
#> 22 Fibroblasts      <NA>     uc001aau.2          =         TSS3   4369
#>                 locus gene_status gene_overall_mean gene_value_1 gene_value_2
#> 19 chr1:322036-328580          OK          193.8339      696.704      48.0566
#> 22 chr1:322036-328580          OK          171.6605      696.704      48.0566
#>    gene_stderr_1 gene_stderr_2 gene_log2_fold_change gene_p_value gene_q_value
#> 19      3.592857      2.307488              -3.85774  2.66665e-09  3.20379e-08
#> 22      3.592857      2.307488              -3.85774  2.66665e-09  3.20379e-08
#>    gene_significant iso_status iso_overall_mean iso_value_1 iso_value_2
#> 19              yes         OK         372.3803     358.383    29.28480
#> 22              yes         OK         372.3803     338.308     5.01291
#>    iso_stderr_1 iso_stderr_2 iso_log2_fold_change iso_p_value iso_q_value
#> 19     2.091049    17.489950             -3.61328 4.83698e-05 0.000548629
#> 22     1.322809     6.999295             -6.07655 2.64331e-03 0.015898000
#>    iso_significant IF_overall   IF1   IF2    dIF isoform_switch_q_value   PTC
#> 19             yes     0.5630 0.509 0.617  0.108                     NA FALSE
#> 22             yes     0.2925 0.491 0.094 -0.397                     NA FALSE
#>    gene_switch_q_value
#> 19                  NA
#> 22                  NA
# tail(exampleSwitchList, 2)

### Dimentions
dim(exampleSwitchList$isoformFeatures)
#> [1] 259  39

nrow(exampleSwitchList)
#> [1] 259
ncol(exampleSwitchList)
#> [1] 39
dim(exampleSwitchList)
#> [1] 259  39

A very useful functionality implemented in IsoformSwitchAnalyzeR is the subsetSwitchAnalyzeRlist() function, which allows for removal of isoforms and all their associated information across all entries in a switchAnalyzeRlist. The function subsets the switchAnalyzeRlist based on a vector of logicals matching the isoformFeatures entry of the list.

exampleSwitchList
#> This switchAnalyzeRlist list contains:
#>  259 isoforms from 84 genes
#>  1 comparison from 2 conditions (in total 6 samples)
#> 
#> Feature analyzed:
#> [1] "ORFs, ntSequence"

### Subset
subsetSwitchAnalyzeRlist(
    exampleSwitchList,
    exampleSwitchList$isoformFeatures$gene_name == 'ARHGEF19'
)
#> This switchAnalyzeRlist list contains:
#>  3 isoforms from 1 genes
#>  1 comparison from 2 conditions (in total 6 samples)
#> 
#> Feature analyzed:
#> [1] "ORFs, ntSequence"

Transcript structure information is stored in the exon entry of the switchAnalyzeRlist and contains the genomic coordinates for each exon in each isoform, and a column indicating which isoform it originates from. This information is stored as GenomicRanges (GRanges), which is very useful for overlapping genomic features and interacting with other Bioconductor packages.

head(exampleSwitchList$exons,2)
#> GRanges object with 2 ranges and 2 metadata columns:
#>       seqnames        ranges strand |     isoform_id     gene_id
#>          <Rle>     <IRanges>  <Rle> |    <character> <character>
#>   [1]     chr1 322037-322228      + | TCONS_00000007 XLOC_000005
#>   [2]     chr1 324288-324345      + | TCONS_00000007 XLOC_000005
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

A full description of the initial switchAnalyzeRlist can be found via ?switchAnalyzeRlist and each additional entry added is described in the “value” part of the documentation for the specific function adding the entry.

IsoformSwitchAnalyzeR

Enabling Analysis of Isoform Switches with Functional Consequences and the Underlying Alternative Splicing

Kristoffer Vitting-Seerup

2023-06-27

Abstract

Table of Content

Preliminaries

Background and Package Description

Installation

How To Get Help

What To Cite

Quick Start

Overview of Isoform Switch Workflow

Example Workflow

Importing the Data

Part 1

Part 2

Examples Visualizations

The switchPlot

Genome-wide Summaries

Consequence Enrichment Analysis

Splicing Enrichment Analysis

Overview Plots

Detailed Workflow

Overview of Detailed Workflow

IsoformSwitchAnalyzeR Background Information

The switchAnalyzeRlist

Function overview

Importing Data Into R

Importing Data from Kallisto, Salmon, RSEM or StringTie

Importing Data from Salmon via Tximeta

Importing Data from Cufflinks/Cuffdiff

Importing Data from Full-length Transcript Assemblers

Filtering

Identifying Isoform Switches

Testing for Isoform Switches via DEXSeq

Testing for Isoform Switches via satuRn

Testing for Isoform Switches with other Tools

Analyzing Open Reading Frames

Analyzing Only Known Isoforms

Analyzing Known and Novel Isoforms

Analyzing Only Novel Isoforms

Extracting Nucleotide and Amino Acid Sequences

Advice for Running External Sequence Analysis Tools and Downloading Results

Importing External Sequence Analysis

Predicting Alternative Splicing

Predicting Switch Consequences

Post Analysis of Isoform Switches with Consequences

Analysis of Individual Isoform Switching

Genome-Wide Analysis of Isoform Switching

Other Tools in IsoformSwitchAnalyzeR

Analyzing Alternative Splicing

Overview of Alternative Splicing Workflow

Genome-Wide Analysis of Alternative Splicing

Other workflows

Augmenting ORF Predictions with Pfam Results

Analyze Small Upstream ORFs

Remove Sequences Stored in SwitchAnalyzeRlist

Adding Uncertain Category to Coding Potential Predictions

Quality control of ORF of known annotation

Analyzing the Biological Mechanisms Behind Isoform Switching

Analyzing experiments without replicates

Rescue StringTie Annoation and Extract Gene Count Matrix

Frequently Asked Questions, Problems and Errors

FAQ Table of Content

What Quantification Tool(s) Should I Use?

What Transcript Database Should I Use?

Are there Different Approaches to Analysis of Alternative Splicing?

Approach 1: Exon based analysis

Approach 2: Splicing based analysis

Approach 3: Transcript based analysis

How to handle confounding effects (including batches)

What constitute an independent biological replicate?

Adding differential gene expression

The error “The annotation and quantification (count/abundance matrix and isoform annotation) seems to be different (Jaccard similarity < 0.925)”

The warning “The annotation and quantification (count/abundance matrix and isoform annotation) seem to be slightly different”

The error “The supplied design matrix will result in a model matrix that is not full rank”

How to prevent visualization of NMD in the switchPlot

BSGenomes for non-model organisms

The switchPlot contains regions “Not Annotated”