%\VignetteIndexEntry{FRASER: Find RAre Splicing Events in RNA-seq Data} %\VignettePackage{FRASER} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} \documentclass[11pt]{article} <>= BiocStyle::latex() @ <>= library(BiocParallel) # for speed we run everything in serial mode register(SerialParam()) @ <>= library(knitr) opts_chunk$set( tidy=FALSE, dev="png", fig.width=7, fig.height=7, dpi=300, message=FALSE, warning=FALSE, cache=FALSE ) @ \usepackage{amsmath} \usepackage{verbatim} \usepackage[nottoc]{tocbibind} \newcommand{\fraser}{\Biocpkg{FRASER}} \newcommand{\fds}{\Rclass{FraserDataSet}} \title{FRASER: Find RAre Splicing Events in RNA-seq Data} \author{ Christian Mertes$^{1}$, Ines Scheller$^{1}$, Julien Gagneur$^{1}$ \\ \small{$^{1}$ Technical University of Munich, Department of Informatics, Garching, Germany} } \begin{document} <>= opts_chunk$set(concordance=TRUE) @ \maketitle \begin{abstract} Genetic variants affecting splicing are a major cause of rare diseases yet their identification remains challenging. Recently, detecting splicing defects by RNA sequencing (RNA-seq) has proven to be an effective complementary avenue to genomic variant interpretation. However, no specialized method exists for the detection of aberrant splicing events in RNA-seq data. Here, we addressed this issue by developing the statistical method \fraser{} (Find RAre Splicing Events in RNA-seq). \fraser{} detects splice sites de novo, assesses both alternative splicing and intron retention, automatically controls for latent confounders using a denoising autoencoder, and provides significance estimates using an over-dispersed count fraction distribution. \fraser{} outperforms state-of-the-art approaches on simulated data and on enrichments for rare near-splice site variants in 48 tissues of the GTEx dataset. Application to a previously analysed rare disease dataset led to a new diagnostic by reprioritizing an aberrant exon truncation in TAZ. Altogether, we foresee \fraser{} as an important tool for RNA-seq based diagnostics of rare diseases. \vspace{1em} \begin{center} \begin{tabular}{ | l | } \hline If you use \fraser{} version >= 1.99.0 in published research, please cite: \\ \\ Scheller I, Lutz K, Mertes C, \emph{et al.} \textbf{Improved detection of aberrant splicing with FRASER 2.0} \\ \textbf{using the Intron Jaccard Index}, medrXiv, 2023, \\ \emph{\url{https://doi.org/10.1101/2023.03.31.23287997}} \\ \hline For previous versions of \fraser{}, please cite: \\ \\ Mertes C, Scheller I, Yepez V, \emph{et al.} \textbf{Detection of aberrant splicing events} \\ \textbf{in RNA-seq data with FRASER}, biorXiv, 2019, \\ \emph{\url{https://doi.org/10.1101/2019.12.18.866830}} \\ \hline \end{tabular} \end{center} \end{abstract} \packageVersion{\Sexpr{BiocStyle::pkg_ver("FRASER")}} \newpage \tableofcontents \newpage <>= suppressPackageStartupMessages({ library(FRASER) }) @ \section{Introduction} \label{sec:Introduction} \fraser{} (Find RAre Splicing Evens in RNA-seq) is a tool for finding aberrant splicing events in RNA-seq samples. It works on the splice metrics $\psi_5$, $\psi_3$ and $\theta$ to be able to detect any type of aberrant splicing event from exon skipping over alternative donor usage to intron retention. To detect these aberrant events, \fraser{} uses a similar approach as the \Biocpkg{OUTRIDER} package that aims to find aberrantly expressed genes and makes use of an autoencoder to automatically control for confounders within the data. \fraser{} also uses this autoencoder approach and models the read count ratios in the $\psi$ values by fitting a beta binomial model to the $\psi$ values obtained from RNA-seq read counts and correcting for apparent co-variations across samples. Similarly as in \Biocpkg{OUTRIDER}, read counts that significantly deviate from the distribution are detected as outliers. A scheme of this approach is given in Figure \ref{FRASER_sketch}. \incfig{FRASER_sketch}{1\textwidth}{The \fraser{} splicing outlier detection workflow.}{ The workflow starts with RNA-seq aligned reads and performs splicing outlier detection in three steps. First (left column), a splice site map is generated in an annotation-free fashion based on RNA-seq split reads. Split reads supporting exon-exon junctions as well as non-split reads overlapping splice sites are counted. Splicing metrics quantifying alternative acceptors ( $\psi_5$), alternative donors ($\psi_3$) and splicing efficiencies at donors ($\theta_5$) and acceptors ($\theta_3$) are computed. Second (middle column), a statistical model is fitted for each splicing metric that controls for sample covariations (latent space fitting using a denoising autoencoder) and overdispersed count ratios (beta-binomial distribution). Third (right column), outliers are detected as data points significantly deviating from the fitted models. Candidates are then visualized with a genome browser. } \fraser{} uses the following splicing metrics as described by Pervouchine et al\cite{Pervouchine2012}: we compute for each sample, for donor D (5' splice site) and acceptor A (3' splice site) the $\psi_5$ and $\psi_3$ values, respectively, as: \begin{equation} \psi_5(D,A) = \frac{n(D,A)}{\sum_{A'} n(D,A')}\label{eq:psi5} \end{equation} and \begin{equation} \psi_3(D,A) = \frac{n(D,A)}{\sum_{D'} n(D',A)}\label{eq:psi3}, \end{equation} where $n(D,A)$ denotes the number of split reads spanning the intron between donor D and acceptor A and the summands in the denominators are computed over all acceptors found to splice with the donor of interest (Equation \eqref{eq:psi5}), and all donors found to splice with the acceptor of interest (Equation \eqref{eq:psi3}). To not only detect alternative splicing but also partial or full intron retention, we also consider $\theta$ as a splicing efficiency metric. \begin{equation} \theta_5(D) = \frac{\sum_{A'}n(D,A')}{n(D) + \sum_{A'}n(D,A')} \label{eq:theta5} \end{equation} and \begin{equation} \theta_3(A) = \frac{\sum_{D'}n(D',A)}{n(A) + \sum_{D'}n(D',A)} \label{eq:theta3}, \end{equation} where $n(D)$ is the number of non-split reads spanning exon-intron boundary of donor D, and $n(A)$ is defined as the number of non-split reads spanning the intron-exon boundary of acceptor A. While we calculate $\theta$ for the 5' and 3' splice site separately, we do not distinguish later in the modeling step between $\theta_5$ and $\theta_3$ and hence call it jointly $\theta$ in the following. From \fraser{} 2.0 on, only a single metric - the Intron Jaccard Index (Figure \ref{IntronJaccardIndex_sketch}) - is used by default. The Intron Jaccard Index is more robust and allows to focus more on functionally relevant aberrant splicing events. It allows to detect all types of aberrant splicing previously detected using the three metrics ($\psi_5$, $\psi_3$, $\theta$) within a single metric. \incfig{IntronJaccardIndex_sketch}{1\textwidth}{Overview over the Intron Jaccard Index, the splice metric used in \fraser{}2.}{ The Intron Jaccard Index considers both split and nonsplit reads within a single metric and allows to detect all different types of aberrant splicing previously captured with either of the metrics $\psi_5$, $\psi_3$, $\theta$. } The Intron Jaccard Index considers both split and nonsplit reads and is defined as the Jaccard index of the set of donor reads (reads sharing a donor site with the intron of interest and nonsplit reads at that donor site) and acceptor reads (reads sharing an acceptor site with the intron of interest and nonsplit reads at that acceptor site): \begin{equation} J(D,A) = \frac{n(D,A)}{\sum_{A'} n(D,A') + \sum_{D'} n(D',A) + n(D) + n(A) - n(D,A)} \label{eq:jaccard} \end{equation} \section{Quick guide to \fraser{}} Here we show how to do an analysis with \fraser{}, starting from a sample annotation table and raw data (RNA-seq BAM files). First, we create a \fds{} object from the sample annotation and count the relevant reads in the BAM files. Then, we compute the $\psi/\theta$ values and filter out introns that are lowly expressed. Secondly, we run the full pipeline using the command \Rfunction{FRASER}. In the last step, we extract the results table from the \fds{} using the \Rfunction{results} function. Additionally, the user can create several analysis plots directly from the fitted \fds{} object. These plotting functions are described in section \ref{sec:result-vis}. <>= # load FRASER library library(FRASER) # count raw data fds <- createTestFraserSettings() fds <- countRNAData(fds) fds # compute stats fds <- calculatePSIValues(fds) # filter junctions with low expression fds <- filterExpressionAndVariability(fds, minExpressionInOneSample=20, minDeltaPsi=0.0, filter=TRUE) # we provide two ways to annotate introns with the corresponding gene symbols: # the first way uses TxDb-objects provided by the user as shown here library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(org.Hs.eg.db) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene orgDb <- org.Hs.eg.db fds <- annotateRangesWithTxDb(fds, txdb=txdb, orgDb=orgDb) # fit the splicing model for each metric # with a specific latentspace dimension fds <- FRASER(fds, q=c(jaccard=2)) # Alternatively, we also provide a way to use BioMart for the annotation: # fds <- annotateRanges(fds) # get results: we recommend to use an FDR cutoff of 0.05, but due to the small # dataset size, we extract all events and their associated values # eg: res <- results(fds, padjCutoff=0.05, deltaPsiCutoff=0.1) res <- results(fds, all=TRUE) res # result visualization, aggregate=TRUE means that results are aggregated at the gene level plotVolcano(fds, sampleID="sample1", type="jaccard", aggregate=TRUE) @ \section{A detailed \fraser{} analysis} The analysis workflow of \fraser{} for detecting rare aberrant splicing events in RNA-seq data can be divided into the following steps: \begin{enumerate} \item Data import or counting reads \ref{sec:dataPreparation} \item Data preprocessing and QC \ref{sec:DataPreprocessing} \item Correcting for confounders \ref{sec:correction} \item Calculating P-values \ref{sec:P-value-calculation} \item Visualizing the results \ref{sec:result-vis} \end{enumerate} Steps 3 and 4 are wrapped up in one function \Rfunction{FRASER}, but each step can be called individually and parametrized. Either way, data preprocessing should be done before starting the analysis, so that samples failing quality measurements or introns stemming from background noise are discarded. Detailed explanations of each step are given in the following subsections. For this tutorial, we will use the a small example dataset that is contained in the package. \subsection{Data preparation} \label{sec:dataPreparation} \subsubsection{Creating a \fds{} and Counting reads} \label{sec:CountingReads} To start an RNA-seq data analysis with \fraser{} some preparation steps are needed. The first step is the creation of a \fds{} which derives from a RangedSummarizedExperiment object. To create the \fds, sample annotation and two count matrices are needed: one containing counts for the splice junctions, i.e. the split read counts, and one containing the splice site counts, i.e. the counts of non split reads overlapping with the splice sites present in the splice junctions. You can first create the \fds{} with only the sample annotation and subsequently count the reads as described in \ref{sec:CountingReads}. For this, we need a table with basic informations which then can be transformed into a \Rclass{FraserSettings} object. The minimum of information per sample is a unique sample name and the path to the BAM file. Additionally groups can be specified for the P-value calculations. If a \textbf{NA} is assigned, no P-values will be calculated. An example sample table is given within the package: <>= sampleTable <- fread(system.file( "extdata", "sampleTable.tsv", package="FRASER", mustWork=TRUE)) head(sampleTable) @ To create a settings object for \fraser{}, the constructor \Rfunction{FraserSettings} should be called with at least a sampleData table. For an example have a look into the \Rfunction{createTestFraserSettings}. In addition to the sampleData you can specify further parameters. \begin{enumerate} \item The parallel backend (a \Rclass{BiocParallelParam} object) \item The read filtering (a \Rclass{ScanBamParam} object) \item An output folder for the resulting figures and the cache \item If the data is strand specific or not \end{enumerate} The following shows how to create a example \fds{} with only the settings options from the sample annotation above: <>= # convert it to a bamFile list bamFiles <- system.file(sampleTable[,bamFile], package="FRASER", mustWork=TRUE) sampleTable[, bamFile := bamFiles] # create FRASER object settings <- FraserDataSet(colData=sampleTable, workingDir="FRASER_output") # show the FraserSettings object settings @ The \fds{} for this example data can also be generated through the function \Rfunction{createTestFraserSettings}: <>= settings <- createTestFraserSettings() settings @ Counting the reads is straightforward and is done through the \Rfunction{countRNAData} function. The only required parameter is the FraserSettings object. First, all split reads are extracted from each individual sample and cached if enabled. Then a dataset-wide junction map is created (all visible junctions over all samples). After that for each sample the non-spliced reads at each given donor and acceptor site are counted. The resulting \Rclass{FraserDataSet} object contains two \Rclass{SummarizedExperiment} objects, one for the junctions and one for the splice sites. <>= # example of how to use parallelization: use 10 cores or the maximal number of # available cores if fewer than 10 are available and use Snow if on Windows if(.Platform$OS.type == "unix") { register(MulticoreParam(workers=min(10, multicoreWorkers()))) } else { register(SnowParam(workers=min(10, multicoreWorkers()))) } @ <>= # count reads fds <- countRNAData(settings) fds @ \subsubsection{Creating a \fds{} from existing count matrices} If the count matrices already exist, you can use these matrices directly together with the sample annotation from above to create the \fds: <>= # example sample annotation for precalculated count matrices sampleTable <- fread(system.file("extdata", "sampleTable_countTable.tsv", package="FRASER", mustWork=TRUE)) head(sampleTable) # get raw counts junctionCts <- fread(system.file("extdata", "raw_junction_counts.tsv.gz", package="FRASER", mustWork=TRUE)) head(junctionCts) spliceSiteCts <- fread(system.file("extdata", "raw_site_counts.tsv.gz", package="FRASER", mustWork=TRUE)) head(spliceSiteCts) # create FRASER object fds <- FraserDataSet(colData=sampleTable, junctions=junctionCts, spliceSites=spliceSiteCts, workingDir="FRASER_output") fds @ \subsection{Data preprocessing and QC} \label{sec:DataPreprocessing} As with gene expression analysis, a good quality control of the raw data is crucial. For some hints please refere to our workshop slides\footnote{\url{http://tinyurl.com/RNA-ASHG-presentation}}. At the time of writing this vignette, we recommend that the RNA-seq data should be aligned with a splice-aware aligner like STAR\cite{Dobin2013} or GEM\cite{MarcoSola2012}. To obtain better results, at least 50 samples should be sequenced and they should be processed with the same protocol and originated from the same tissue. \subsubsection{Filtering} \label{sec:filtering} Before filtering the data, we have to compute the main splicing metrics: the $\psi$-value (Percent Spliced In) and the Intron Jaccard Index. <>= fds <- calculatePSIValues(fds) fds @ Now we can filter down the number of junctions we want to test later on. Currently, we suggest keeping only junctions which support the following: \begin{itemize} \item At least one sample has 20 (or more) reads \item 25\% (or more) of the samples have at least 10 reads \end{itemize} Furthemore one could filter for: \begin{itemize} \item At least one sample has a $|\Delta\psi|$ of 0.1 \end{itemize} <>= fds <- filterExpressionAndVariability(fds, minDeltaPsi=0, filter=FALSE) plotFilterExpression(fds, bins=100) @ After looking at the expression distribution between filtered and unfiltered junctions, we can now subset the dataset: <>= fds_filtered <- fds[mcols(fds, type="j")[,"passed"]] fds_filtered # filtered_fds not further used for this tutorial because the example dataset # is otherwise too small @ \subsubsection{Sample co-variation} Since $\psi$ values are ratios within a sample, one might think that there should not be as much correlation structure as observed in gene expression data within the splicing data. However, we do see strong sample co-variation across different tissues and cohorts. Let's have a look into our demo data to see if we it has correlation structure or not. To have a better estimate, we use the logit transformed $\psi$ values to compute the correlation. <>= # Heatmap of the sample correlation plotCountCorHeatmap(fds, type="jaccard", logit=TRUE, normalized=FALSE) @ It is also possible to visualize the correlation structure of the logit transformed $\psi$ values of the $topJ$ most variable introns for all samples: <>= # Heatmap of the intron/sample expression plotCountCorHeatmap(fds, type="jaccard", logit=TRUE, normalized=FALSE, plotType="junctionSample", topJ=100, minDeltaPsi = 0.01) @ \subsection{Detection of aberrant splicing events} After preprocessing the raw data and visualizing it, we can start with our analysis. Let's start with the first step in the aberrant splicing detection: the model fitting. \subsubsection{Fitting the splicing model} During the fitting procedure, we will normalize the data and correct for confounding effects by using a denoising autoencoder. Here we use a predefined latent space with a dimension $q=10$ . Using the correct dimension is crucial to have the best performance (see \ref{sec:encDim}). Alternatively, one can also use a PCA to correct the data. The wrapper function \Rfunction{FRASER} both fits the model and calculates the p-values for all $\psi$ types. For more details see section \ref{sec:details}. <>= # This is computational heavy on real datasets and can take some hours fds <- FRASER(fds, q=c(jaccard=3)) @ To check whether the correction worked, we can have a look at the correlation heatmap using the normalized $\psi$ values from the fit. <>= plotCountCorHeatmap(fds, type="jaccard", normalized=TRUE, logit=TRUE) @ \subsubsection{Calling splicing outliers} Before we extract the results, we should add HGNC symbols to the junctions. \fraser{} comes already with an annotation function. The function uses \Biocpkg{biomaRt} in the background to overlap the genomic ranges with the known HGNC symbols. To have more flexibilty on the annotation, one can also provide a custom `txdb` object to annotate the HGNC symbols. Here we assume a beta binomial distribution and call outliers based on the significance level. The user can choose between a p value cutoff, a cutoff on the $\Delta\psi$ values between the observed and expected $\psi$ values or both. <>= # annotate introns with the HGNC symbols of the corresponding gene library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(org.Hs.eg.db) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene orgDb <- org.Hs.eg.db fds <- annotateRangesWithTxDb(fds, txdb=txdb, orgDb=orgDb) # fds <- annotateRanges(fds) # alternative way using biomaRt # retrieve results with default and recommended cutoffs (padj <= 0.05 and # |deltaPsi| >= 0.3) res <- results(fds) @ \subsubsection{Interpreting the results table} The function \Rfunction{results} retrieves significant events based on the specified cutoffs as a \Rclass{GRanges} object which contains the genomic location of the splice junction or splice site that was found as aberrant and the following additional information: \begin{itemize} \item sampleID: the sampleID in which this aberrant event occurred \item hgncSymbol: the gene symbol of the gene that contains the splice junction or site, if available \item type: the metric for which the aberrant event was detected (either jaccard for Intron Jaccard Index or psi5 for $\psi_5$, psi3 for $\psi_3$ or theta for $\theta$) \item pValue, padjust: the p-value and adjusted p-value (FDR) of this event (at intron or splice site level depending on metric) \item pValueGene, padjustGene: only present in the gene-level results table, gives the p-value and FDR adjusted p-value at gene-level \item psiValue: the value of the splice metric (see 'type' column for the name of the metric) of this junction or splice site for the sample in which it is detected as aberrant \item deltaPsi: the $\Delta\psi$-value of the event in this sample, which is the difference between the actual observed $\psi$ and the expected $\psi$ \item counts, totalCounts: the count (k) and total count (n) of the splice junction or site for the sample where it is detected as aberrant \item meanCounts: the mean count (k) of reads mapping to this splice junction or site over all samples \item meanTotalCounts: the mean total count (n) of reads mapping to the same donor or acceptor site as this junction or site over all samples \item nonsplitCounts, nonsplitProportion: only present for the Intron Jaccard Index. States the sum of nonsplit counts overlapping either the donor or acceptor site of the outlier intron for the sample where it is detected as aberrant; and their proportion out of the total counts (N). A high nonsplitProportion indicates possible (partial) intron retention. \item FDR\_set The set of genes on which FDR correction is applied. If not otherwise specified, FDR correction is transcriptome-wide. \end{itemize} Please refer to section \ref{sec:Introduction} for more information about the Intron Jaccard Index metric (or the previous metrics $\psi_5$, $\psi_3$ and $\theta$) and their definition. In general, an aberrant $\psi_5$ value might indicate aberrant acceptor site usage of the junction where the event is detected; an aberrant $\psi_3$ value might indicate aberrant donor site usage of the junction where the event is detected; and an aberrant $\theta$ value might indicate partial or full intron retention, or exon truncation or elongation. As the Intron Jaccard Index combines the three metrics, an aberrant Intron Jaccard value can indicate any of the above described cases. We recommend inspecting the outliers using IGV. \fraser{}2 also provides the function \Rfunction{plotBamCoverageFromResultTable} to create a sashimi plot for an outlier in the results table directly in R (if paths to bam files are available in the \fds{} object). <>= # for visualization purposes for this tutorial, no cutoffs were used res <- results(fds, all=TRUE) res # for the gene level pvalues, gene symbols need to be added to the fds object # before calling the calculatePadjValues function (part of FRASER() function) # as we previously called FRASER() before annotating genes, we run it again here fds <- calculatePadjValues(fds, type="jaccard", geneLevel=TRUE) # generate gene-level results table (if gene symbols have been annotated) res_gene <- results(fds, aggregate=TRUE, all=TRUE) res_gene @ \subsection{Finding splicing candidates in patients} Let's have a look at sample 10 and check if we got some splicing candidates for this sample. <>= plotVolcano(fds, type="jaccard", "sample10") @ Which are the splicing events in detail? <>= sampleRes <- res[res$sampleID == "sample10"] sampleRes @ To have a closer look at the junction level, use the following functions: <>= plotExpression(fds, type="jaccard", result=sampleRes[9]) # plots the 9th row plotSpliceMetricRank(fds, type="jaccard", result=sampleRes[9]) plotExpectedVsObservedPsi(fds, result=sampleRes[9]) @ \subsection{Saving and loading a \fds{}} A \fds{} object can be easily saved and reloaded as follows: <>= # saving a fds workingDir(fds) <- "FRASER_output" name(fds) <- "ExampleAnalysis" saveFraserDataSet(fds, dir=workingDir(fds), name=name(fds)) # two ways of loading a fds by either specifying the directory and anaysis name # or directly giving the path the to fds-object.RDS file fds <- loadFraserDataSet(dir=workingDir(fds), name=name(fds)) fds <- loadFraserDataSet(file=file.path(workingDir(fds), "savedObjects", "ExampleAnalysis", "fds-object.RDS")) @ \section{More details on \fraser{}} \label{sec:details} The function \Rfunction{FRASER} is a convenient wrapper function that takes care of correcting for confounders, fitting the beta binomial distribution and calculating p-values for all $\psi$ types. To have more control over the individual steps, the different functions can also be called separately. The following sections give a short explanation of these steps. \subsection{Correction for confounders} \label{sec:correction} The wrapper function \Rfunction{FRASER} and the underlying function \Rfunction{fit} method offer different methods to automatically control for confounders in the data. Currently the following methods are implemented: \begin{itemize} \item AE: uses a beta-binomial AE \item PCA-BB-Decoder: uses a beta-binomial AE where PCA is used to find the latent space (encoder) due to speed reasons \item PCA: uses PCA for both the encoder and the decoder \item BB: no correction for confounders, fits a beta binomial distribution directly on the raw counts \end{itemize} <>= # Using an alternative way to correct splicing ratios # here: only 2 iterations to speed the calculation up for the vignette # the default is 15 iterations fds <- fit(fds, q=3, type="jaccard", implementation="PCA-BB-Decoder", iterations=2) @ \subsubsection{Finding the dimension of the latent space} \label{sec:encDim} For the previous call, the dimension $q$ of the latent space has been fixed. Since working with the correct $q$ is very important, the \fraser{} package also provides the function \Rfunction{optimHyperParams} that can be used to estimate the dimension $q$ of the latent space of the data. It works by artificially injecting outliers into the data and then comparing the AUC of recalling these outliers for different values of $q$. Since this hyperparameter optimization step can take some time for the full dataset, we only show it here for a subset of the dataset: <>= set.seed(42) # hyperparameter opimization fds <- optimHyperParams(fds, type="jaccard", plot=FALSE) # retrieve the estimated optimal dimension of the latent space bestQ(fds, type="jaccard") @ The results from this hyper parameter optimization can be visualized with the function \Rfunction{plotEncDimSearch}. <>= plotEncDimSearch(fds, type="jaccard") @ \subsection{P-value calculation} \label{sec:P-value-calculation} After determining the fit parameters, two-sided beta binomial P-values are computed using the following equation: \begin{equation} p_{ij} = 2 \cdot min \left\{\frac{1}{2}, \sum_{0}^{k_{ij}} BB(k_{ij}, n_{ij}, \mu_{ij} ,\rho_i), 1 - \sum_{0}^{k_{ij-1}} BB(k_{ij}, n_{ij}, \mu_{ij} ,\rho_i) \right\}, \end{equation} where the $\frac{1}{2}$ term handles the case of both terms exceeding 0.5, which can happen due to the discrete nature of counts. Here $\mu_{ij}$ are computed as the product of the fitted correction values from the autoencoder and the fitted mean adjustements. <>= fds <- calculatePvalues(fds, type="jaccard") head(pVals(fds, type="jaccard")) @ Afterwards, adjusted p-values can be calculated. Multiple testing correction is done across all junctions in a per-sample fashion using Benjamini-Yekutieli's false discovery rate method\cite{Benjamini2001}. Alternatively, all adjustment methods supported by \Rfunction{p.adjust} can be used via the \Robject{method} argument. <>= fds <- calculatePadjValues(fds, type="jaccard", method="BY") head(padjVals(fds,type="jaccard")) @ With FRASER 2.0 we introduce the option to limit FDR correction to a subset of genes based on prior knowledge, e.g. genes that contain a rare variant per sample. To use this option, provide a list of genes per sample during FDR computation: <>= genesOfInterest <- list("sample1"=c("XAB2", "PNPLA6", "STXBP2", "ARHGEF18"), "sample2"=c("ARHGEF18", "TRAPPC5")) fds <- calculatePadjValues(fds, type="jaccard", subsets=list("exampleSubset"=genesOfInterest)) head(padjVals(fds, type="jaccard", subsetName="exampleSubset")) @ \subsection{Result visualization} \label{sec:result-vis} Besides the plotting methods \Rfunction{plotVolcano}, \Rfunction{plotExpression}, \Rfunction{plotExpectedVsObservedPsi}, \Rfunction{plotSpliceMetricRank}, \Rfunction{plotFilterExpression} and \Rfunction{plotEncDimSearch} used above, the \fraser{} package provides additional functions to visualize the results: \Rfunction{plotAberrantPerSample} displays the number of aberrant events per sample of the whole cohort based on the given cutoff values and \Rfunction{plotQQ} gives a quantile-quantile plot either for a single junction/splice site or globally. <>= plotAberrantPerSample(fds) # qq-plot for single junction plotQQ(fds, result=res[1]) # global qq-plot (on gene level since aggregate=TRUE) plotQQ(fds, aggregate=TRUE, global=TRUE) @ The \Rfunction{plotManhattan} function can be used to visualize the p-values along with the genomic coordinates of the introns: <>= plotManhattan(fds, sampleID="sample10") plotManhattan(fds, sampleID="sample10", chr="chr19") @ Finally, when one has access to the bam files from which the split and unsplit counts of FRASER were created, the \Rfunction{plotBamCoverage} and \Rfunction{plotBamCoverageFromResultTable} functions use the \Rpackage{SGSeq} package to allow visualizing the read coverage in the bam file a certain intron from the results table or within a given genomic region as a sashimi plot: <>= ### plot coverage from bam file for a certain genomic region fds <- createTestFraserSettings() vizRange <- GRanges(seqnames="chr19", IRanges(start=7587496, end=7598895), strand="+") plotBamCoverage(fds, gr=vizRange, sampleID="sample3", control_samples="sample2", min_junction_count=5, curvature_splicegraph=1, curvature_coverage=1, mar=c(1, 7, 0.1, 3)) ### plot coverage from bam file for a row in the result table fds <- createTestFraserDataSet() # load gene annotation require(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene require(org.Hs.eg.db) orgDb <- org.Hs.eg.db # get results table res <- results(fds, padjCutoff=NA, deltaPsiCutoff=NA) res_dt <- as.data.table(res) res_dt <- res_dt[sampleID == "sample2",] # plot full range of gene highlighting the outlier intron plotBamCoverageFromResultTable(fds, result=res_dt[1,], show_full_gene=TRUE, txdb=txdb, orgDb=orgDb, control_samples="sample3") # plot only certain range around the outlier intron plotBamCoverageFromResultTable(fds, result=res_dt[1,], show_full_gene=FALSE, control_samples="sample3", curvature_splicegraph=0.5, txdb=txdb, curvature_coverage=0.5, right_extension=5000, left_extension=5000, splicegraph_labels="id") @ \bibliography{bibliography} \section{Session Info} Here is the output of \Rfunction{sessionInfo()} on the system on which this document was compiled: <>= sessionInfo() @ \end{document}