%\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{SGSeq} %\VignettePackage{SGSeq} \documentclass[10pt]{article} <>= library(knitr) opts_chunk$set(fig.align='center', fig.show='hold') @ <<>= BiocStyle::latex() @ \bioctitle{Splice event detection and quantification from RNA-seq data} \author{Leonard D Goldstein \\ [1em] \small{Department of Bioinformatics and Computational Biology, Genentech Inc.}} \begin{document} \maketitle \section{Background} RNA-seq data can be used for transcript isoform discovery and transcript-level expression studies. Most computational tools for genome-guided transcript isoform analysis (methods that work with RNA-seq data aligned against a reference genome) fall into one of two categories: Methods for the analysis of splice events, which often rely on transcript annotation, and methods for reconstructing and quantifying full-length transcripts. \Rpackage{SGSeq} implements a method for the annotation-free detection and quantification of splice events from RNA-seq data. \section{Overview} \Rpackage{SGSeq} predicts splice junctions and exons from RNA-seq reads aligned against a reference genome. Splice junctions and disjoint exon bins are quantified using counts or FPKMs based on structurally compatible reads. Input data must be in BAM format. It is essential that BAM files are obtained using a splice-aware alignment program that generates the custom tag 'XS' indicating the direction of transcription for spliced reads. Splice junctions and disjoint exon bins are assembled into a genome-wide splice graph \cite{Heber:2002aa}. In the \Rpackage{SGSeq} framework, the splice graph is a directed acyclic graph with nodes corresponding to transcript starts, ends and splice sites, and edges corresponding to disjoint exon bins and splice junctions, directed from $5^\prime$ to the $3^\prime$ end. A splice event is defined by a start node and an end node connected by two or more paths and no intervening nodes with all paths intersecting. \Rpackage{SGSeq} identifies splice events recursively from the graph, and estimates relative usage of splice variants based on compatible reads spanning the event boundaries. \section{Preliminaries} This vignette illustrates an analysis of paired-end RNA-seq data obtained from four colorectal tumors and four normal colorectal samples, which are part of a data set published in \cite{Seshagiri:2012gr}. For the purpose of this vignette, we created BAM files including a single gene of interest (\emph{FBXO31}). <>= library(SGSeq) library(TxDb.Hsapiens.UCSC.hg19.knownGene) @ In the following, we use a \Rclass{data.frame} \Robject{si} with sample information, and a \Rclass{GRanges} object \Robject{gr} with genomic coordinates of the \emph{FBXO31} gene. The \Rclass{data.frame} \Robject{si} contains library information, including paired-end status, median read length, median insert size and the total number of alignments. These were obtained from the original BAM files using function \Rfunction{getBamInfo()}. When analyzing a new data set, library information must be obtained once initially and can then be used for all subsequent analyses. The following code block sets the correct BAM file paths in the sample information for this vignette. <<>>= path <- system.file("extdata", package = "SGSeq") si$file_bam <- file.path(path, "bams", si$file_bam) @ We obtain transcript annotation from the UCSC knownGene table, available as a \Bioconductor{} annotation package \Biocannopkg{TxDb.Hsapiens.UCSC.hg19.knownGene}. We retain transcripts on chromosome 16, where the \emph{FBXO31} gene is located, and change chromosome names in the annotation to match chromosome names in the BAM files. <<>>= txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene txdb <- keepSeqlevels(txdb, "chr16") seqlevelsStyle(txdb) <- "NCBI" @ \Rpackage{SGSeq} makes extensive use of the \Bioconductor{} infrastructure for genomic ranges \cite{Lawrence:2013hi}. To store genomic coordinates for both exons and splice junctions, \Rpackage{SGSeq} implements the \Rclass{TxFeatures} class, which extends the \Rclass{GRanges} class with additional columns. Column \Rcode{type} takes values \Rcode{J} (splice junction), \Rcode{I} (internal exon), \Rcode{F} (first/$5^\prime$ terminal exon), \Rcode{L} (last/$3^\prime$ terminal exon) or \Rcode{U} (unspliced). In addition to \Rclass{TxFeatures}, \Rpackage{SGSeq} implements the \Rclass{SGFeatures} class to store splice graph features. Similar to \Rclass{TxFeatures}, \Rclass{SGFeatures} extends the \Rclass{GRanges} class with additional columns. Column \Rcode{type} for an \Rclass{SGFeatures} object takes values \Rcode{J} (splice junction), \Rcode{E} (disjoint exon bin), \Rcode{D} (splice donor) or \Rcode{A} (splice acceptor). For both \Rclass{TxFeatures} and \Rclass{SGFeatures}, class-specific columns can be accessed using functions named after the columns they access (e.g. use function \Rfunction{type()} to obtain feature type). Transcript features or splice graph features can be exported to BED files using function \Rfunction{exportFeatures()}. To work with annotated transcripts in the \Rpackage{SGSeq} framework, we extract transcript features from the \Rclass{TxDb} object and store them as \Rclass{TxFeatures}. We only retain features overlapping the \emph{FBXO31} gene locus. <<>>= txf_annotated <- convertToTxFeatures(txdb) txf_annotated <- txf_annotated[txf_annotated %over% gr] @ If transcript annotation is not available as a \Rclass{TxDb} object, \Rfunction{convertToTxFeatures()} can construct \Rclass{TxFeatures} from a \Rclass{GRangesList} of exons grouped by transcript (which can be obtained from other formats such as GFF/GTF). \section{Analysis based on annotated transcript features} We first perform an analysis based on annotated transcript features. The following example converts the transcript features into splice graph features and obtains compatible counts for each feature and each sample. <<>>= sgfc <- analyzeFeatures(si, features = txf_annotated) @ \sloppy \Rfunction{analyzeFeatures()} returns an object of class \Rclass{SGFeatureCounts}, which extends the \Rclass{SummarizedExperiment} class from the \Biocpkg{GenomicRanges} package. \Rclass{SGFeatureCounts} contains sample information as \Rcode{colData}, splice graph features as \Rcode{rowRanges} and assays \Rcode{counts} and \Rcode{FPKM}, which store compatible counts and FPKMs for each splice graph feature and sample, respectively. Accessor functions \Rfunction{colData()}, \Rfunction{rowRanges()}, \Rfunction{counts()} and \Rfunction{FPKM()} can be used to access the data. Compatible FPKMs for splice graph features can be visualized with \Rfunction{plotFeatures()}. The plotting function invisibly returns a \Rcode{data.frame} with information on splice graph features, including genomic coordinates. <>= df <- plotFeatures(sgfc, geneID = 1) df @ \section{Analysis based on predicted transcript features} Instead of relying on existing annotation, we can predict transcript features from BAM files directly. <<>>= sgfc <- analyzeFeatures(si, which = gr) @ For interpretability, we annotate predicted features with respect to known transcript features. <<>>= sgfc <- annotate(sgfc, txf_annotated) @ Predicted splice graph features and compatible FPKMs can be visualized as previously. Splice graph features with missing annotation can be highlighted using argument \Rcode{color\_novel}. <>= df <- plotFeatures(sgfc, geneID = 1, color_novel = "red") df @ Note that, in contrast to the previous figure, the predicted gene model does not include parts of the splice graph that are not expressed. Also, an unannotated exon was discovered from the RNA-seq data, which is expressed in three of the four normal colorectal samples. \section{Analysis of splice variants} Instead of considering the complete splice graph of a gene, we can focus our analysis on individual splice events. The following example identifies splice events from the splice graph and obtains representative counts for each splice variant. <<>>= sgvc <- analyzeVariants(sgfc) @ \Rfunction{analyzeVariants()} returns an \Rclass{SGVariantCounts} object. Similar to \Rclass{SGFeatureCounts}, \Rclass{SGVariantCounts} extends the \Rclass{SummarizedExperiment} class. \Rclass{SGVariantCounts} contains sample information as \Rcode{colData} and splice variants as \Rcode{rowRanges}. Assay \Rcode{variantFreq} stores estimates of relative usage for each splice variant and sample. Accessor functions \Rfunction{colData()}, \Rfunction{rowRanges()} and \Rfunction{variantFreq()} can be used to access the data. Each splice variant consists of one or more splice graph features. Information on splice variants is stored as \Rcode{elementMetadata} (or \Rcode{mcols}) in the \Rclass{SGVariants} object and can be accessed as follows. <<>>= mcols(sgvc) @ Splice variants and estimates of relative usage can be visualized with function \Rfunction{plotVariants()}. <>= plotVariants(sgvc, eventID = 1, color_novel = "red") @ \section{Advanced use} Functions \Rfunction{analyzeFeatures()} and \Rfunction{analyzeVariants()} wrap multiple analysis steps for convenience. Alternatively, the functions performing individual steps can be called directly. For example, the previous analysis based on predicted transcript features can be performed as follows. <<>>= txf <- predictTxFeatures(si, gr) sgf <- convertToSGFeatures(txf) sgf <- annotate(sgf, txf_annotated) sgfc <- getSGFeatureCounts(si, sgf) sgv <- findSGVariants(sgf) sgvc <- getSGVariantCounts(sgv, sgfc) @ \sloppy \Rfunction{predictTxFeatures()} and \Rfunction{getSGFeatureCounts()} can be run on individual samples (e.g. for distribution across a high-performance computing cluster). \Rfunction{predictTxFeatures()} predicts features for each sample, merges features across samples and finally performs filtering and processing of predicted terminal exons. When using \Rfunction{predictTxFeatures()} for individual samples, with predictions intended to be merged at a later point in time, run \Rfunction{predictTxFeatures()} with argument \Rcode{min\_overhang = NULL} to suppress processing of terminal exons. Then predictions can subsequently be merged and processed with functions \Rfunction{mergeTxFeatures()} and \Rfunction{processTerminalExons()}, respectively. \section{Performance} \sloppy When performing genome-wide analyses or working with large data sets, parallelization is highly recommended. Functions \Rfunction{getBamInfo()}, \Rfunction{predictTxFeatures()} and \Rfunction{getSGFeatureCounts()} support parallel processing of multiple samples. \Rfunction{predictTxFeatures()} and \Rfunction{getSGFeatureCounts()} also support parallel processing of multiple chromosomes/strands for a given sample. Parallization is controlled with argument \Robject{cores}. Running a single BAM file with $\sim50$ million paired-end reads using 4 cores typically takes $\sim2-3$ hours for prediction and $\sim1-2$ hours for counting. Processing times can be affected by individual genes or genomic regions with many read alignments (e.g. the mitochondrial chromosome). Thus it can be benefical to exclude high coverage regions by filtering BAM files prior to analysis with \Rpackage{SGSeq}. For prediction with \Rfunction{predictTxFeatures()}, genomic regions with high splice complexity likely due to spurious alignments can be skipped automatically to speed up processing. Skipping is controlled with argument \Rcode{max\_complexity}. Identification of splice events is performed on a per-gene basis and can be parallelized using argument \Robject{cores} for \Rfunction{analyzeVariants()} and \Rfunction{findSGVariants()}. \section{Session information} <>= toLatex(sessionInfo()) @ \bibliography{SGSeq} \end{document}