---
title: "EventPointer: An effective identification of alternative splicing events using junction arrays and RNA-Seq data"
author: "Juan P. Romero, Juan A. Ferrer-Bonsoms, Cesar Lobato & Angel Rubio"
date: "`r BiocStyle::doc_date()`"
abstract: >
  *EventPointer* is an R package to identify alternative splicing events that involve either simple (case-control experiment) or complex experimental designs such as time course experiments and studies including paired-samples. The algorithm can be used to analyze data from either junction arrays (Affymetrix Arrays) or sequencing data (RNA-Seq). 
  
  
  The algorithm can generate a series of files to visualize the detected alternative splicing events in IGV. This eases the interpretation of results and the design of primers for standard PCR validation.
  The software returns a data.frame with the detected alternative splicing events: gene name, type of event (cassette, alternative 3',...,etc), genomic position, statistical significance and increment of the percent spliced in ($\Delta \Psi$) for all the events.
vignette: >
  %\VignetteEncoding{UTF-8}
  %\VignetteIndexEntry{EventPointer}
  %\VignettePackage{EventPointer}
  %\VignetteEngine{knitr::rmarkdown}
output:
  BiocStyle::html_document:
    toc: true
    fig_caption: true
editor_options: 
  markdown: 
    wrap: 72
---

```{r LoadFunctions, echo=FALSE, message=FALSE, warning=FALSE, results='hide'}
library(knitr)
opts_chunk$set(error = FALSE)
library(EventPointer)
library(dplyr)
library(kableExtra)
```

```{r style, echo = FALSE, results = 'asis'}
##BiocStyle::markdown()
```

# Installation

EventPointer can be installed from Bioconductor using the BiocManager
package:

```{r, eval=FALSE}

library(BiocManager)

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

BiocManager::install("EventPointer")
```

# Introduction

*EventPointer* R package provides users a simplified way to identify,
classify and visualize alternative splicing events. EventPointer can work
with arrays and bulk-RNA-Seq data. For arrays there is a unique pipeline.
For RNA-Seq there are two main pipelines: one able to detect novel events
and the other one relying on annotated events.
The steps required by the algorithm are almost the same for all approaches:

-   *Splicing Graph Creation* The Splicing Graph (SG) is a directed
    graph used to represent the structure of a gene. EventPointer relies
    on SGSeq package to obtain the corresponding SGs for every gene.
    For arrays, the SG is built according to
    the reference transcriptome selected by the user and for RNA-Seq, it
    is created by predicting the features in the .bam files provided by
    the user or by the reference transcriptome.

-   *Event Detection* Once the graphs are created, EventPointer
    analyzes each SG in order to find the alternative splicing events.
    The definition of splicing events by EventPointer consists in a
    triplet of subgraphs (P1,P2 and PR) i.e. for a cassette exon, PR
    correspond to the flanking exons, P1 the junctions and exon, and P2
    the junction that skips the exon. This is depicted in Figure 1.

![**Figure 1.** Definition of alternative splicing events for EventPointer](Figure2.png)

-   *Statistical Analysis* With all the detected events, EventPointer
    performs a statistical analysis to obtain the statistical
    significance of every splicing event:
      - for RNA-Seq EventPointer use bootstrap approeach to perform
      the statistical test.
      - for arrays, EventPointer considers there is a differential splicing
      event if the isoforms in the associated paths change
      their expression in opposite directions.
      Different and specific statistical tests can be applied.

-   *Visualization* To ease the interpretation of the splicing events,
    EventPointer generates .gtf files that can be loaded into
    Integrative Genomcis Viewer (IGV). The visualization allows
    researchers to design primers to validated the detected events using
    standard PCR.

Figure 2 shows each of the main steps in a graphical layout.

EventPointer was originally developed for array-based analysis and has
since been adapted for bulk RNA-Seq. This vignette is structured into
two main sections: RNA-Seq analysis and array analysis. The RNA-Seq
section is divided into two parts:

-   **EventPointerBAM**: Alignment-based analysis, which detects novel events specific to the
    samples (novel version).

-   **EventPonterST**: Analysis of events annotated in a reference transcriptome.

To cite *EventPointer*:

For EventPonterST:

-   Ferrer-Bonsoms JA, Gimeno M, Olaverri D, Sacristan P, Lobato C,
    Castilla C, Carazo F, Rubio A (2022). "EventPointer 3.0: flexible
    and accurate splicing analysis that includes studying the
    differential usage of protein-domains." *NAR Genomics &
    Bioinformatics*, **4**(3), lqac067.
    [doi:10.1093/nargab/lqac067](https://doi.org/10.1093/nargab/lqac067).


For EventPointer for arryas:

-   Romero, Juan P., et al. "EventPointer: an effective identification
    of alternative splicing events using junction arrays." BMC genomics
    17.1 (2016): 467.
    [doi:10.1186/s12864-016-2816-x](http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2816-x)

-   Romero, Juan P., et al. "Comparison of RNA-seq and microarray
    platforms for splice event detection using a cross-platform
    algorithm." BMC genomics 19.1 (2018): 703.
    [doi:10.1186/s12864-018-5082-2](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-5082-2)

![**Figure 2.** Overview of *EventPointer*](Romero_Fig1.png)

# RNA-Seq analysis

EventPointer has two pipelines for RNA-Seq analysis:

-   **EventPointerBAM (EP_BAM)**: The detection, classification and
    quantification of splicing events is carried out using the **BAM**
    **files generated from RNA-seq sequence alignment**. It should be
    noted that the **alignment must be performed using an aligner that**
    **takes splicing** **into account (recommended tool: STAR)**. The
    main advantage of this methodology is its ability to **detect de**
    **novo** **events** based on reads alignment.

-   **EventPointerST (EP_ST)**: The detection and classification of
    events are determined by the **corresponding transcriptome annotation** 
    in GTF format. In this case, the quantification files
    of transcripts come from pseudoaligners (examples: salmon or
    kallisto).

As mentioned above, EventPointer consists of several core steps:
(i) building the splicing graph, (ii) detecting and classifying events,
(iii) quantification, and (iv) performing statistical analysis. Finally, users
can visualize the results with IGV by files generated by EventPointer.

The Figure 3 shows the pipeline of EventPointerBAM and EventPointerST:

![Figure 3 (A) Worflow and advantages and disadvantages of
EventPointerBAM. (B) Worflow and advantages and disadvantages of
EventPointerST.](EPST_EPBAM.png)


## EventPointerBAM

The aim of `EventPointerBAM` is to perform detection, classification,
quantification and differential analysis between conditions, using as
input data sequence alignments of RNA-seq data performed using aligners
sensitive to junction reads.

There are two requirements for the BAM files in order
to get `EventPointerBAM` working correctly:

1.  The BAM files should include the **XS-flag**, the flag can be
    included in the files when running the alignment. Most of the
    available software has parameters to incude the flag. For example,
    in the case of [STAR](https://github.com/alexdobin/STAR) the flag
    *--outSAMattributes XS* must be included when mapping.

2.  All files to be analyzed should have the corresponding index files
    **(.bai)** in the same directory as the BAM files. Create the index
    before running EventPointer.
    

To demonstrate the use of EventPointer, we utilise synthetically generated
FASTQ files containing genes with predefined alternative splicing events,
including alternative 3' sites, alternative 5' sites, and cassette exons.
These FASTQ files are first processed with the STAR aligner to generate
the BAM files required as input for EventPointer.
In order to reduce computation time in this example, the data have
been limited to genes from chromosome 3 with positive strand (strand +).


### Events Detection

The function **EventsDetection_BAM()** is responsible for the detection
of events from the input alignment files, their classification and the
calculation of the associated PSI. The inputs for the function are
described below:

*Bams and GTF file parameters*

-   **PathSamplesAbundance**: Path to bam and bai files or path to
    folder with bam and bai files.

-   **PathTranscriptomeGTF**: Path to file containing the regions to be
    analysed from the bam files in GTF format.

*Detection parameters*

-   **region**: Numerical vector indicating the index of positions (at
    chromosomal level) to be analysed from the GTF. Default value is
    NULL so that all regions are analysed. Note that the index position
    depends on the bam file. In order to see which is the specific index
    of the region of interest we can use the `Rsamtools` library:
    
```{r}
path_bam_files = system.file("extdata/bams", package = "EventPointer")
path_file_1 = dir(path_bam_files,pattern = "*.bam$",full.names = TRUE)[1]

file_tmp <- BamFile(path_file_1)
si <- seqinfo(file_tmp)
sl <- rep(seqlevels(si),2)
st <- c(rep(c("+"), length(si)),rep(c("-"), length(si)))
ranges_info <- GRanges(sl, IRanges(1, seqlengths(si)[sl]), st)
```

```{r,echo=FALSE}
ranges_info = as.data.frame(ranges_info)

ranges_info$index = 1:nrow(ranges_info)
ranges_info = ranges_info[,c(6,1:5)]
levels(ranges_info$strand) = c("p","n","o")
# ranges_info
kable(ranges_info) %>%
  kable_paper() %>%
  scroll_box(width = "500px", height = "200px")
```


In this case, as we are interested in the chr3 with + starnd, we want the
region 16. Therefore, we will set the parameter `region` to 16.

-   **min_junction_count**: Minimum number of junctions detected in the
    alignment to be considered in the splicing graph. Default value is 2.  

-   **max_complexity**: Maximum allowed complexity. If a locus exceeds
    this threshold, it is skipped, resulting in a warning. Complexity is
    defined as the maximum number of unique predicted splice junctions
    overlapping a given position. High complexity regions are often due
    to spurious read alignments and can slow down processing. Default
    value is 30.

-   **min_n_sample**: Minimum number of samples that a junction must
    have to be considered. Default value is 3.

-   **min_anchor**: Minimum number of aligned bases at one end of an
    exon to consider a junction. Default value is 6.

*PSI calculation parameters*

-   **nboot**: Number of resamples of the quantification of the samples
    to perform bootstrap. Default value is 20.

-   **lambda**:

*Performance parameters*

-   **cores**: Nº of cores used for the function. Default value is 1.

-   **verbose**: Logical indicating whether to show warnings and
    messages. If FALSE, warnings from internal functions (e.g.,
    makeTxDbFromGRanges) will be suppressed. Default is FALSE.

*Results parameters*

-   **PathSGResult**: Path where to save the following 4 files:

    -   **SgFC.RData**: Contains the splicing graph in RData format.
    -   **TotalEventsFound.csv**: General data for total events detected
        in csv format.
    -   **EventsDetection_EPBAM.RData**: Raw data per event, paths of
        splicing graph and counts in RData format.
    -   **PSI_boots.RData**: $\Psi$ per event and sample in RData
        format.

The following code chunk describe how to use this function:

```{r, eval=FALSE, warning=FALSE, collapse=TRUE}

PathSamplesAbundance <- system.file("extdata/bams", package = "EventPointer")
PathTranscriptomeGTF <- list.files(PathSamplesAbundance,"*.gtf",full.names = T)

PathSGResult <- tempdir()

EventsDetection_BAM(PathSamplesAbundance, 
                    PathTranscriptomeGTF, 
                    region = 16,
                    min_junction_count = 2, 
                    max_complexity = 30,
                    min_n_sample = NULL,
                    min_anchor = 6,
                    nboot = 20,
                    lambda = NULL,
                    cores = 1,
                    PathSGResult = PathSGResult)

```

The 4 files created by this function are:

-   **SgFC.RData**: Contains the needed information to create the splicing graph.
-   **TotalEventsFound.csv**: General data for total events detected in
    csv format.
-   **EventsDetection_EPBAM.RData**: Raw data per event, paths of
    splicing graph and counts in RData format.
-   **PSI_boots.RData**: $\Psi$ per event and sample in RData format.

A further explanation of these files are depicted in the following
subsecction "step-by-step"

#### step-by-step

As mentioned above, the function `EventsDetection_BAM` unifies  three main steps.
As each task has a long duration and it may be of interest
to a user to perform any of these
tasks separately, the following section explains how to execute this
function step by step.

**get Information for the Splicing graph**

The first step of `EventsDetection_BAM` is to get information to build the splicing graph.
For this, the function `PrepareBam_EP` is called. The parameters of this
functions are depicted above.

```{r,eval=FALSE}
PathSamplesAbundance <- system.file("extdata/bams", package = "EventPointer")
PathTranscriptomeGTF <- list.files(PathSamplesAbundance,"*.gtf",full.names = T)
PathSGResult <- tempdir()

SgFC <- PrepareBam_EP(PathSamplesAbundance = PathSamplesAbundance,
                        PathTranscriptomeGTF = PathTranscriptomeGTF,
                        region = 16,
                        min_junction_count = 2,
                        max_complexity = 30,
                        min_n_sample = NULL,
                        min_anchor = 6,
                        cores = 1,
                        PathSGResult = PathSGResult,
                        verbose = FALSE)
```

This functions is the most demading regarding time computing. The SgFC object can be laoded from the EventPointer package:
<br><br>
```{r,eval=TRUE}
load(system.file("extdata/SgFC.RData", package = "EventPointer"))
SgFC
```
<br>
The SgFC object is a SGFeatureCounts object. for more information check
[SGSeq
Vignette](https://bioconductor.org/packages/release/bioc/vignettes/SGSeq/inst/doc/SGSeq.html).
Briefly the SGFeaturesCounts contains a GRanges object with all the
required elements to create the different splicing graphs found in the
given samples. It also contains the number of counts associated with
each element of the splicing graph.

**Event Detection**

The second step of `EventsDetection_BAM` takes the SgFC object to create
the splicing graph and then detect and classify splicing events.
This is done by calling the function `EventDetection`.
This function only requieres two parameters: the SgFC object, and the
number of cores:

```{r,eval=FALSE}
EventsDetection_EPBAM<-EventDetection(SgFC, cores=1)
```

```{r,echo=FALSE}
load(system.file("extdata/EventsDetection_EPBAM.RData", package = "EventPointer"))
```


The function retunes a list called EventsDetection_EPBAM that
corresponds to the `EventsDetection_EPBAM.RData` generated by
`EventsDetection_BAM` function. 
<br>
The output is a list vector with
one element for each gene. Each gene (element of the list) is itself a
list with a length equal to the number of events in that gene. Finally,
each event is also a list containing the event information. In order to
see the values of this object, the follownig code chunk show an example
of how to see the information of the first event of the first gene.

```{r,eval=FALSE}
index_gene <- 1
index_event <- 1
length(EventsDetection_EPBAM) #number of genes
length(EventsDetection_EPBAM[[index_gene]]) #number of events in the corresponding gene
names(EventsDetection_EPBAM[[index_gene]][[index_event]]) #the different information stored in the object.
```
<br><br>
In order to replicate the `.csv` file stored by the
`EventsDetection_BAM` function, you can use the following code chunk:

```{r,eval=FALSE}
event_list <- list()
index <- 1
for (geneEvent in EventsDetection_EPBAM) {
  for (event in geneEvent) {
    event_list[[index]] <- event$Info
    index <- index + 1
  }
}
totalEventTable <- do.call(rbind, event_list)
write.csv(totalEventTable, file = paste0(PathSGResult, "/TotalEventsFound.csv"), row.names = FALSE)

```
<br><br>
**Get PSI**

Finally, the `EventsDetection_BAM` function compute the PSI values by
calling the function `getPSI_RNASeq_boot` which needs the following
parameters: lambda, and nboot (depicted above).

```{r,eval=FALSE}
cores <- 1
lambda <- NULL
nboot <- 20
PSI_boots <- getPSI_RNASeq_boot(EventsDetection_EPBAM, lambda = lambda,
                                cores = cores, nboot = nboot)
```


PSI_boots is an array of number_of_events x (number_of_boots+1) x number_of_samples.
In the second dimension, we have the PSI values corresponding to the maximum likelihood estimation of the expression and the PSI corresponding to the nboot bootstraps.
<br>

### Statistical Analysis

The main function to perform the statistical analysis is `EventPointerStats_BAM`.

```{r}
data(PSI_boots)
Design <- cbind(rep(1, 9), rep(c(1, 0, 0), 3), rep(c(0, 1, 0), 3))
Contrast <- cbind(c(0, 1, 0), c(0, 0, 1))

PathSGResult <- tempdir()
result = EventPointerStats_BAM(PSI_boots,
                               Design,
                               Contrast,
                               Threshold = 0,
                               nbootstraps = 1000,
                               cores = 1,
                               ram = 0.1,
                               pathResult = PathSGResult)
```

`EventPointerStats_BAM` returns a list with the reulsts of the statistical analysis.
The list contains the summary of the Bootstrap analysis: deltaPSI, Pvalues, iqr_events,
median_events, and LocalFDR. The first for are data.frame with the corresponding information
for each event (row) and contrast. LocalFDR is a list of length equal to the number 
of contrast. Each elemnt has information of the local flase dicovery rate.
Brielfy:

- **deltaPSI**: the increment of PSI of each event in each contrast.
- **Pvalues**: p.values obteined from the bootstrap test.
- **iqr_events**: 
- **median_events**: 
- **LocalFDR**: multiple hypothesis correction information.

`EventPointerStats_BAM` also return files with the results of the statistical analysis
for each contrast in a .csv file.


### IGV visualization

Alternative splicing events detected by EventPointerBAM can be visualized using the `EventPointerBAM_IGV()` function as shown in Figure 4. This function displays the paths that make up the splicing graph of each event in a gtf format that can be viewed using the IGV program. Each event is composed of 3 paths: path1, path2, and Reference, which will be indicated along with the event id for identification as "path1\_", "path2\_", "Ref_1", and "Ref_2".

To create the GTF files, the algorithm uses the `EventPointerBAM_IGV()` function that
has the following parameters:

*Events and reference transcriptome parameters*

-  **SG_RNASeq**: Data.frame generated by `EventsDetection_BAM` with the events to be included in the GTF file.
-   **EventsTxt**: Reference transcriptome. Must be one of: "Ensembl", "UCSC", "AffyGTF" or       "CustomGTF".

*Results parameters*

-  **PathGTF**: Folder where to save the generated GTF file. 


```{r, eval=FALSE, collapse=TRUE}

EventsTxt<-paste(system.file("extdata",package="EventPointer"),"/TotalEventsFound.csv",sep="")
load(paste(system.file("extdata",package="EventPointer"),"/SgFC.RData",sep=""))
SG_RNASeq <- SgFC
PathGTF <- tempdir()

EventPointerBAM_IGV(SG_RNASeq = SgFC,
                    EventsCSV =  EventsTxt,
                    PathGTF =  "./")
```

The GTF file can be loaded in IGV and will display as following:


![**Figure 4.** EventPointer visualization using IGV.](4_IGV.png)


### summary

This approach of EventPointer can be run with tow main functions: `EventsDetection_BAM()` and `EventPointerStats_BAM()`. The former functin can be run step by step by using the function: 
`PrepareBam_EP`, `EventDetection`, and `getPSI_RNASeq_boot`.
<br>
When running both `EventsDetection_BAM()` and `EventPointerStats_BAM()`, the resulting files at the defined output folder will be those shown in Figure 5. 

![Figure 5. Output EventPointerBAM functions.](3_ficheros_EPBAM.drawio.png)

Finally, results can be visualized using the function `EventPointerBAM_IGV`.


## EventPointerST

In this pipeline alternative splicng events are detected from a
**reference transcriptome without finding novel events**. 

The events quantification relies on
isoform expression estimate from pseudo-alignment process such as
*Kallisto* or *Salmon*. Besides, we provide a function to leverage the
bootstrap data from *kallisto* or *salmon*. Preveous statistical
analysis have also been adapted to this data. Further, Primers design
for PCR validation and protein domain enrichment analysis can be
performed. Figure 6 shows an overview of this branch of EventPointer.

![**Figure 6.** Analysis of RNA-Seq data with kallisto output](Figure7_2.png)

### Event Detection

We use `EventDetection_transcriptome` to identify and classify
alternative splicing events of a given reference transcriptome. The
required parameters are:

-   **inputFile** = inputFile should point to the GTF file to be used.
-   **Pathtxt** = Directory to save the .txt file of the events found.
-   **cores** = Number of cores using in the parallel processing (by
    default = 1).

Following code shows how to apply `EventDetection_transcriptome`
function using the Gencode 24 transcriptome (GRCH 38) as reference
annotation. In this example we do not use the full reference but only
two genes (ENSG00000185252.17 and ENSG00000254709.7).

```{r GTFfGTF, eval=TRUE, warning=FALSE, collapse=TRUE}


# Set input variables
PathFiles<-system.file("extdata",package="EventPointer")
inputFile <- paste(PathFiles,"/gencode.v24.ann_2genes.gtf",sep="")
Pathtxt <- tempdir()
```

```{r GTFfGTF_2, eval=FALSE, warning=FALSE, collapse=TRUE}
# Run the function
EventXtrans <- EventDetection_transcriptome(inputFile = inputFile,
                                             Pathtxt=Pathtxt,
                                             cores=1)

```

This function takes a couple of hours to complete (depending on the
computer), and creates a file called
**EventsFound.txt**. This file contains the information
of the events. This .txt file will be used as an input in further
functions.

The object returned by the function is a list containing five elements: three sparce
matrices that relate which isoforms build up the paths (path1,path2 and
pathRef) of each event. The fourth element contains the name of the
reference annotation: only appear the name of the transcript. The final
element is SG_List: a list with the information of the graph of each
gene, this variable is necesary for Primers design step.

```{r GTFfGTFnames, eval=FALSE, warning=FALSE, collapse=TRUE}

names(EventXtrans)

```

**NOTE**: as the events are detected based on the reference transcriptome, the function
`EventDetection_transcriptome` can be applied only once for each reference
transcriptome.

### Get expression and PSI ($\Psi$)

Given (I) the output of `EventDetection_transcriptome`, which contains
events information and (II) the result of pseudo-alignment, EventPointer
is able to get the expression of the path of each event and also to
compute the PSI value.

To first step is to load the data from pseudo-alignment. EventPointer
works with both *Kallisto* and *Salmon* methods.

In this example we have applied *kallisto* to the samples ERR315326,
ERR315400, ERR315407 and ERR315494 using the transcriptome reference used before.
These samples have been download from *EMBL-EBI* dataset.

EventPointer offers the option to take advantage of bootstraps that
return both Kallisto and Salmon. In case we want to leverage the
Bootstrap data we have to use the function `getbootstrapdata`:

```{r getbootstrpdata, eval=TRUE, warning=FALSE, collapse=TRUE}

PathSamples<-system.file("extdata",package="EventPointer")
PathSamples <- paste0(PathSamples,"/output")
PathSamples <- dir(PathSamples,full.names = TRUE)

data_exp <- getbootstrapdata(PathSamples = PathSamples,type = "kallisto")

```

`getbootstrapdata` returns a list containing the quantification data
with the bootstrap information. The list is of length equal to the
number of samples. Each element of the list storage a matrix where the
number of rows is equal to the number of transcripts and the number of
columns is equal to the number of bootstrats + 1. (the first column
corresponds to the maximum likelihood expression and the rest to the
bootstrap data).

In case **you don't want to use the bootstrap returned by the**
**pseudo-alignmnet step (or is no available)**, you should load a matrix
with the expression of the transcripts. The following code chunk shown
an example of how to do this:

```{r getdata_nb, eval=TRUE, warning=FALSE, collapse=TRUE}

# this code chunk is an example of how to load the data from kallisto output.
# the expression of the isoforms are counts

PathFiles <- system.file("extdata",package="EventPointer")
filesnames <- dir(paste0(PathFiles,"/output"))
PathFiles <- dir(paste0(PathFiles,"/output"),full.names = TRUE)
dirtoload <- paste0(PathFiles,"/","abundance.tsv")
RNASeq <- read.delim(dirtoload[1],sep = "\t", colClasses = c(NA,"NULL","NULL",NA,"NULL"))
for (n in 2:length(dirtoload)){
  RNASeq[,n+1] <- read.delim(dirtoload[n],sep = '\t', colClasses = c('NULL','NULL','NULL',NA,'NULL'))
}
rownames(RNASeq)<-RNASeq[,1]
RNASeq<-RNASeq[,-1]
colnames(RNASeq) <- filesnames


```

Once we have the expression of the isoforms loaded, the next step is to
compute the PSI value. For both cases (with and withuot bootstrap data
from psudoaligment) EventPointer provides the function
`GetPSI_FromTranRef` for this purpose.

The function `GetPSI_FromTranRef` returns the values of $\Psi$ and the
expression of the paths of each events. This functions requires the
following inputs:

-   **PathsxTranscript** The output of EventDetection_transcriptome.
-   **Samples** A matrix or list containing the expression of the
    samples.
-   **Bootstrap** Boolean variable to indicate if bootstrap data from
    pseudo-alignment is used.
-   **Filter** Boolean variable to indicate if an expression filter is
    applied. Defaul TRUE.
-   **Qn** Quantile used to filter the events (Bounded between 0-1, Q1
    would be 0.25).

This function also requires that the same reference transcriptome have
been used in both the pseudo-alignment and event detections steps.
Besides, in the variable *Samples* the type of annotation used must be
equal to the one used in the variable *EventXtrans*.

**Get psi if NO bootstrap from pseudo-alignmnet is used**

First step is to verify if same annotation is used in *RNASeq* and
*EventXtrans*:

```{r check_ann_nb, eval=TRUE, warning=FALSE, collapse=TRUE}
rownames(RNASeq)[1:5]
EventXtrans$transcritnames[1:5]
```

We need to change the rownames of RNASeq variable before applying
`GetPSI_FromTranRef` function:

```{r PSI_Statistic, eval=TRUE, warning=FALSE, collapse=TRUE}

#change rownames of RNASeq variable
rownames(RNASeq) <- sapply(strsplit(rownames(RNASeq),"\\|"),function(X) return(X[1]))
RNASeq<-as.matrix(RNASeq) #must be a matrix variable


PSIss_nb <- GetPSI_FromTranRef(PathsxTranscript = EventXtrans,
                            Samples = RNASeq,
                            Bootstrap = FALSE,
                            Filter = FALSE)

PSI <- PSIss_nb$PSI
Expression <- PSIss_nb$ExpEvs

```

The output of the function is a list containing two elements: a matrix
with the $\Psi$ values, and a list containing as many matrices as number
of events. In each matrix is stored the expression of the different
paths of the correspondig event along the samples.

**If bootstrap data from pseudo-alignment IS required**

As done above, we need to check if same annotation is used in *data_exp*
and *EventXtrans*. The difference from above is that *data_exp* is a
list and not a matrix as *RNASeq*. Therefore, we are going to used the
rownames of the first list of *data_exp* to check if same annotations is
used:

```{r check_ann, eval=TRUE, warning=FALSE, collapse=TRUE}
rownames(data_exp[[1]])[1:5]
EventXtrans$transcritnames[1:5]

```

We need to change the rownames of the first element of the list
*data_exp* before applying `GetPSI_FromTranRef` function:

```{r psi_with_bootstrap, eval=TRUE, warning=FALSE, collapse=TRUE}

#change rownames of the first element of teh list data_exp
rownames(data_exp[[1]]) <- sapply(strsplit(rownames(data_exp[[1]]),"\\|"),function(X) return(X[1]))

PSIss <- GetPSI_FromTranRef(PathsxTranscript = EventXtrans,
                               Samples = data_exp,
                               Bootstrap = TRUE,
                               Filter = FALSE)

PSI <- PSIss$PSI
Expression <- PSIss$ExpEvs
```

The output of the function is a list containing two elements: an array
with the $\Psi$ values with dimension = c(number of events, number of
bootstrap+1,number of samples), and a list containing as many matrices
as number of events. In each matrix is stored the expression of the
different paths of the corresponding event along the samples.


### Statistical Analysis (Bootstrap Test)

We have implemented a bootstrap test to evaluate the difference in
$\Psi$ among the conditions under study. This test can be done both for
PSI values that contain information from the pseudo-alignment bootstrap
and for cases in which we do not have this information.

The function `EventPointer_Bootstraps` proceed to calculate the increase
in PSI and its p.values by performing bootstrap statistic test. This
function can be parallelized. Users may set the number of Cores (by
default 1) and the ram available (by default 1 Gb).

```{r ep_bootstrp_statistic, eval=TRUE, warning=FALSE, collapse=TRUE}

Dmatrix <- cbind(1,rep(c(0,1),each=2))
Cmatrix <- matrix(c(0,1),nrow=2)

Fit_bots <- EventPointer_Bootstraps(PSI = PSI,
                               Design = Dmatrix,
                               Contrast = Cmatrix,
                               cores = 1,
                               ram = 1,
                               nbootstraps = 10,
                               UsePseudoAligBootstrap = TRUE)

```

You can extract a table of the top-ranked events from the results with
the function *ResulTable*

```{r ResultTable, eval=TRUE, warning=FALSE, collapse=TRUE}

ResulTable(EP_Result = Fit_bots,coef = 1,number = 5)

```


### Statistical Analysis (NO bootstrap)

The statistical analysis of the alternative splicing events is done in
exactly the same way as for junction arrays (section 4).

The function `EventPointer_RNASeq_TranRef` perform this statistical
analysis and requires the following parameters:

-   **Count_Matrix** The list containing the expression data taken from
    the ouput of *GetPSI_FromTranRef*
-   **Statistic** The type of statistic to apply. Default = "LogFC" (can
    be "logFC,"Dif_LogFC","DRS")
-   **Design** The design matrix of the experiment.
-   **Contrast** The Contrast matrix of the experiment.

```{r PSI_Statistic2, eval=TRUE, warning=FALSE, collapse=TRUE}

# Design and contrast matrix:

Design <- matrix(c(1,1,1,1,0,0,1,1),nrow=4)
Contrast <- matrix(c(0,1),nrow=2)

# Statistical analysis:

Fit <- EventPointer_RNASeq_TranRef(Count_Matrix = Expression,Statistic = "LogFC",Design = Design, Contrast = Contrast)

```

The output of this function is a data.frame with the information of the
names of the event, its p.values and the corresponding z.value. If there
is more than one contrast, the function returns as many data.frames as
number of contrast and all these data.frame are sotred in an unique
list.

```{r EP_TranRef_Res_Table, echo=FALSE}
kable(Fit,digits=5,row.names=FALSE,align="c",caption = "Table 3: PSI_Statistic results")
```


### IGV visualization

EventPointer creates one GTF file that can be loaded into IGV to
visualize the alternative splicing events. Figure XX displays an example
result showed in IGV (5th ranked event in Table 2). Also, in the figure
a reference transcriptome is displayed (blue track).

To create the GTF files, the algorithm uses the
`EventPointer_RNASeq_TranRef_IGV` functions with the following
parameters:

-   **SG_List**: List with the Splicing Graph information of the events.
    This list is created by `EventDetection_transcriptome` function.
-   **pathtoeventstable**: Complete path to the table returned by
    `EventDetection_transcriptome` that contains the information of each
    event, or table with specific events that the user want to load into
    IGV to visualize.
-   **PathGTF**: Directory where to write the GTF files.

As `EventDetection_transcriptome` function returns the splicing graph
information, this function does not need to create the splicing graph.

```{r ep_tranref_igv, eval=TRUE, warning=FALSE, collapse=TRUE}

SG_List <- EventXtrans$SG_List
PathEventsTxt<-system.file('extdata',package='EventPointer')
PathEventsTxt <- paste0(PathEventsTxt,"/EventsFound_Gencode24_2genes.txt")
PathGTF <- tempdir()

EventPointer_RNASeq_TranRef_IGV(SG_List = SG_List,pathtoeventstable = PathEventsTxt,PathGTF = PathGTF)

```

### Domain Enrichment

The main idea of protein domain analysis is to analyze whether the
presence of a protein domain increases or decreases in the condition
under study. To do this, we have implemented the function
`Protein_Domain_Enrichment`. This function requires the following
parameters:

-   **PathsxTranscript**: the output of EventDetection_transcriptome.
-   **TxD**: matrix that relates transcripts with Protein domain. Users
    can get it from BioMart.
-   **Diff_PSI**: matrix with the difference of psi of the condition
    under study.
-   **method**: a character string indicating which correlation
    coefficient is to be calculated. "spearman" (default) or "pearson"
    can be selected.

As mentioned, it is first necessary to know the structure of protein
domains within the transcriptome or in other words: to relate the
isoforms to protein domains. This can be done using `bioMart` R
package. The input needed by EP to perform this analysis is a matrix
that relates transcripts to protein domains (TxD matrix). Therefore, you
can work either with Pfam, Interpro or superfamily annotation.

The following conde chunck shows an example of how to get this TxD
matrix:

```{r biomart, eval=FALSE, warning=FALSE, collapse=TRUE}

library(biomaRt)
mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", host = "mar2016.archive.ensembl.org")
mart<-useDataset("hsapiens_gene_ensembl",mart)

mistranscritos <- EventXtrans$transcritnames
head(mistranscritos)

## we need to remove the ".x":
mistranscritos <- gsub("\\..*","",mistranscritos)

Dominios <- getBM(attributes = c("ensembl_transcript_id","interpro","interpro_description"),
                  filters = "ensembl_transcript_id",
                  values = mistranscritos,
                  mart=mart)

#we build the isoform x protein domain matrix
library(Matrix)
ii <- match(Dominios$ensembl_transcript_id,mistranscritos)
misDominios <- unique(Dominios[,2])
jj <- match(Dominios[,2],misDominios)

TxD <- sparseMatrix(i=ii,j = jj,dims = c(length(mistranscritos),length(misDominios)),x = 1)
rownames(TxD) <- mistranscritos
colnames(TxD) <- misDominios

```

The above code chunk build the required Transcript X Domain (TxD) sparse
matrix where $TxD_{ij}$ = 1 means that Domain $j$ match to transcript
$i$.

The next step is to perform the enrichment analysis with the function
*Protein_Domain_Enrichment* Internally, given the TxD matrix
*Protein_Domain_Enrichment* relates protein domains to events and
performs the enrichment analysis:

**NOTE**: Check that EventXtrans\$transcritnames annotation type match
to the rownames of the TxD matrix.

```{r pd_enrichmetn, eval=FALSE, warning=FALSE, collapse=TRUE}

data("TxD")

#check same annotation for transcripts:
EventXtrans$transcritnames[1]
# [1] "ENST00000611540.4"
rownames(TxD)[1]
# [1] "ENST00000611540"


## as is not the same, we change EventXtrans$transcritnames annotation
transcriptnames <- EventXtrans$transcritnames
transcriptnames <- gsub("\\..*","",transcriptnames)
EventXtrans$transcritnames <- transcriptnames

Result_PDEA <- Protein_Domain_Enrichment(PathsxTranscript = EventXtrans,
                                         TxD = TxD,
                                         Diff_PSI = Fit_bots$deltaPSI)


```

The function returns a list containing the results of the protein domain
enrichment analysis. This list contains 3 matrices in which the rows
indicate the protein domains and the columns the number of contrasts.
The 3 matrices are the following:

-   **mycor**: correlation value between the deltaPSI and the
    DifProtDomain matrix (see more details in vignette)
-   **STATISTIC**: the values of the test statistic
-   **PVAL**: the pvalues of the test statistic

### Primers Design

EventPointer can also be used to design primers for PCR validation. The
aim of *FindPrimers* function is the design of PCR primers and TaqMan
probes for detection and quantification of alternative splicing.
Depending on the assay we want to carry out the the algorithm will
design the primers for a conventional PCR or the primers and TaqMan
probes if we are performing a TaqMan assay. In the case of a
conventional PCR we will be able to detect the alternative splicing
event. Besides, the algorithm gives as an output the length of the PCR
bands that are going to appear. In the case of a TaqMan assay, we will
not only detect but also quantify alternative splicing. <br><br> The
Primers Design step has been developed with *Primer3* software and works
with versions \>= 2.3.6. In order to use this option you need to install
this sofware. You can download from
<a href="https://sourceforge.net/projects/primer3/" target="_blank">sourceforge</a>
(for Windows, Mac OSX or Unix/Linux) or from
<a href="https://github.com/primer3-org/primer3/" target="_blank">github</a>
(for Mac OSX or Unix/Linux). <br><br> To work with Primer Design option
you also need to add *Primer3* to your environment PATH. For Windows you
can add environment variables in: <br><br> Control Panel -\> System and
Security -\> System -\> Advanced system settings -\> Environment
Variables... <br><br> For Unix/Linux or Max you need to add the path to
*Primer3* to the variable PATH using terminal. <br><br> Then, as is
explained below, one of the input variables for the main function is
**Primer3Path** which is a string variable with the complete path where
primer3_core.exe is placed. As we have added this path to the
environment variable we can set down this variable with the following
command as is shown later in the example:

```{r pathprimer3, eval=FALSE, warning=FALSE, collapse=TRUE}

Primer3Path <- Sys.which("primer3_core")

```

The main function of this step is called `Find_Primers` and requires the
following main parameters:

-   **SG**: Information of the graph of the gene where the selected
    event belongs.This information is available in the output of
    EventsGTFfromTranscriptomeGTF function.
-   **EventNum**: The "EventNum" variable can be found in the returned
    .txt file from the EventsGTFfromTranscriptomeGTF function in the
    column "EventNumber" or in the output of
    EventPointer_RNASeq_TranRef, the number after the "\_" character of
    the 'Event_ID'.
-   **Primer3Path**: Complete path where primer3_core.exe is placed.
-   **Dir**: Complete path where primer3web_v4_0_0_default_settings.txt
    file and primer3_config directory are placed.
-   **taqman**: 1 if you want to get probes and primers for TaqMan. 0 if
    you want to get primers for conventional PCR.

Other required parameters that have a default value but can be modified
by user:

-   **nProbes**: Number of probes for TaqMan experiments. By default 1.
-   **nPrimerstwo**: Number of potential exon locations for primers
    using two primers (one forward and one reverse). By default 3.
-   **ncommonForward**: Number of potential exon locations for primers
    using one primer in forward and two in reverse. By default 3.
-   **ncommonReverse**: Number of potential exon locations for primers
    using two primer in forward and one in reverse. By default 3.
-   **nExons**: Number of combinations of ways to place primers in exons
    to interrogate an event after sorting. By default 5.
-   **nPrimers**: Once the exons are selected, number of primers
    combination sequences to search within the whole set of potential
    sequences.By default 5.
-   **shortdistpenalty**: Penalty for short exons following an
    exponential function. By default 2000.
-   **maxLength**: Max length of exons that are between primers and for
    paths once we have calculated the sequence.By default 1000.
-   **minsep**: Distance from which it is penalized primers for being
    too close. By default 100.
-   **wminsep**: Weigh of the penalization to primers for being too
    close. By default 200.
-   **valuethreePenalty**: penalization for cases that need three
    primers instead of 2. By default 1000.
-   **minexonlength**: Minimum length that a exon has to have to be able
    to contain a primer. By default 25.
-   **wnpaths**: Penalty for each existing path. By default 200.
-   **qualityfilter**: Results will show as maximum 3 combinations with
    a punctuation higher than qualityfilter.By default 5000.

The following example shows how to design primers and TaqMan probes for
a specific alternative splicing event:

```{r FindPrimers, eval=FALSE, warning=FALSE, collapse=TRUE}

data("EventXtrans")
#From the output of EventsGTFfromTranscriptomeGTF we take the splicing graph information
SG_list <- EventXtrans$SG_List
#SG_list contains the information of the splicing graphs for each gene
#Let's supone we want to design primers for the event 1 of the gene ENSG00000254709.7 

#We take the splicing graph information of the required gene
SG <- SG_list$ENSG00000254709.7 

#We point the event number
EventNum <- 1

#Define rest of variables:
Primer3Path <- Sys.which("primer3_core")
Dir <- "C:\\PROGRA~2\\primer3\\"


MyPrimers_taqman <- FindPrimers(SG = SG,
                                EventNum = EventNum,
                                Primer3Path = Primer3Path,
                                Dir = Dir,
                                mygenomesequence = BSgenome.Hsapiens.UCSC.hg38::Hsapiens,
                                taqman = 1,
                                nProbes=1,
                                nPrimerstwo=4,
                                ncommonForward=4,
                                ncommonReverse=4,
                                nExons=10, 
                                nPrimers =5,
                                maxLength = 1200)

```

*FindPrimers* return a data.frame. Either for conventional PCR and
taqman option the data.frame has the following columns:

-   **For1Seq**: Sequence of the first forward primer.
-   **For2Seq**: Sequence of the second forward primer in case it is
    needed.
-   **Rev1Seq**: Sequence of the first reverse primer.
-   **Rev2Seq**: Sequence of the second reverse primer in case it is
    needed.
-   **For1Exon**: Name of the exon of the first forward primer.
-   **For2Exon**: Name of the exon of the second forward primer in case
    it is needed.
-   **Rev1Exon**: Name of the exon of the first reverse primer.
-   **Rev2Exon**: Name of the exon of the second reverse primer in case
    it is needed.
-   **FINALvalue**: Final punctuation for that combination of exons and
    sequences. The lower it is this score, the better it is the
    combination.
-   **DistPath1**: Distances of the bands, in base pairs, that
    interrogate Path1 when we perform the conventional PCR experiment.
-   **DistPath2**: Distances of the bands, in base pairs, that
    interrogate Path2 \`when we perform the conventional PCR experiment.
-   **DistNoPath**: Distances of the bands, in base pairs, that they do
    not interrogate any of the two paths when we perform the
    conventional PCR experiment.

```{r EP_DesiCP_Res_Table, eval=TRUE, warning=FALSE, collapse=TRUE,echo=FALSE}
data("MyPrimers")
kable(MyPrimers[1:5,],digits=5,row.names=FALSE,align="c",caption = "Table 4: Data.frame output of FindPrimers for conventional PCR") %>%
  kable_styling() %>%
  scroll_box(width ="660px")

```

For the tacman option, the data.frame contains three additional columns:

-   **SeqProbeRef**: Sequence of the TaqMan probe placed in the
    Reference.
-   **SeqProbeP1**: Sequence of the TaqMan probe placed in the Path1.
-   **SeqProbeP2**: Sequence of the TaqMan probe placed in the Path2.

```{r EP_DesiTP_Res_Table, eval=TRUE, warning=FALSE, collapse=TRUE,echo=FALSE}
data("MyPrimers_taqman")
kable(MyPrimers_taqman[1:5,],digits=5,row.names=FALSE,align="c",caption = "Table 5: Data.frame output of FindPrimers for conventional PCR") %>%
  kable_styling() %>%
  scroll_box(width ="660px")

```

#### IGV primers and probes visualization

We can visualize the result in IGV using the *Find Motif* tool given the
output of *FindPrimers* function.

Forward primers are represented in blue whilst reverse primers are
represented in red as long as the gene is transcribed from the forward
strand. Otherwise the colors we will be swapped. TaqMan probes will
always have the color of the forward primers.

![**Figure 7.** Transcriptome, paths and primers visualization using
IGV.](Figure11.png)


# Analysis of junction arrays

## Overview of junction arrays

*EventPointer* is prepared to work with different Affymetrix arrays,
such as: HTA 2.0, Clariom-D, RTA and MTA.

To build the CDF file (to work under the aroma.affymetrix framework),
EventPointer requires a GTF file with the reference transcriptome
information. In case not provided, the algorithm automatically downloads
the required information from different databases such as ENSEMBL or
UCSC. As the probes for the HTA 2.0 array are mapped to the HG19 genome,
the latests versions of the ensembl and ucsc genome, mapped to the hg19
version, will be downloaded. For the other arrays, the following genomes
are used: ClariomD = GRCh38, MTA = mm10 and RTA = rn6.

The required files are:

1.  Exon probes genomic position (Tab separated .txt file)
2.  Junction probes genomic position (Tab separated .txt file)
3.  Reference transcriptome (GTF file)

Files 1 and 2 contain probe information for the array. These files and
the corresponding CDF files can be downloaded from the following URL:
[EventPointer
Dropbox](https://www.dropbox.com/sh/wpwz1jx0l112icw/AAAOU0M2QS0TjqHAhrrdEhp6a?dl=0)

The format of these files is briefly explained in the following
paragraphs:

For the Exon Probes, the file corresponds to a tab separated .txt file
composed of 11 columns that include: Probe Id, X coordinate in the
array, Y coordinate in the array, Transcript Cluster Id, Probeset Id,
Probeset name, Probe sequence, chromosome, start, end and strand.

The Junction probes file is also a tab separated .txt composed of 10
columns: Probe Id, X coordinate in the array, Y coordinate in the array,
Transcript Cluster Id, Probeset Id, Probeset name, Probe sequence,
chromosome and probe alignments.

For the GTF the standard format is used. (For more information see
[GTF](http://mblab.wustl.edu/GTF22.html))

This vignette uses the probes and annotation file for the DONSON gene in
order to run the examples in a short amount of time. To generate the
corresponding CDF file for the whole genome, the files from the Dropbox
link must be used.

[**Note:**]{style="color:red"} It is advisable to work with reference
GTF files, as the probes are annotated to them. However, if other
database is used, the algorithm will only include probes that are mapped
to such database.

## CDF file creation

This step can be skipped if the corresponding CDF file is doownloaded
from the Dropbox link. The available CDF files were generated using the
GTF files for each of the arrays, if users want to generate a CDF for
other databases (Ensembl or UCSC), this step should be used.

The function **CDFfromGTF** generates the CDF file used afterwards in
the aroma.affymetrix pre-processing pipeline. Internally, it calls
flat2cdf function written by Elizabeth Purdom. More information about it
can be seen in the following webpage: [Create CDF from
scratch](http://www.aroma-project.org/howtos/create_CDF_from_scratch/)

The required input for the function is described below:

-   **input** : Reference transcriptome. Available options are :
    "Ensembl","UCSC" , "AffyGTF" and "CustomGTF".
-   **inputFile**: If input is "AffyGTF" or "CustomGTF", inputFile
    should point to the GTF file to be used.
-   **PSR**: Path to the Exon probes txt file
-   **Junc**: Path to the Junction probes txt file
-   **PathCDF**: Directory where the output will be saved
-   **microarray**: Microarray used to create the CDF file. Must be one
    of: "HTA-2_0", "ClariomD", "RTA" or "MTA".

This function takes a couple of hours to complete (depending on the
computer), and creates the following files:

1.  EventsFound.txt : Tab separated file with all the information of all
    the alternative splcing events found.
2.  .flat file : Used to build the corresponding CDF file.
3.  .CDF file: Output required for the aroma.affymetrix preprocessing
    pipeline.

The following code chunks show examples on how to run the function using
part of the Affymetrix GTF and the example data included in the package.
This data corresponds to the Affymetrix HTA 2.0 GTF representing only
the DONSON gene and the probes that are mapped to it.

Using Affymetrix GTF as reference transcriptome

```{r CDFGTF, eval=FALSE, warning=FALSE, collapse=TRUE}

# Set input variables
   PathFiles<-system.file("extdata",package="EventPointer")
   DONSON_GTF<-paste(PathFiles,"/DONSON.gtf",sep="")
   PSRProbes<-paste(PathFiles,"/PSR_Probes.txt",sep="")
   JunctionProbes<-paste(PathFiles,"/Junction_Probes.txt",sep="")
   Directory<-tempdir()
   array<-"HTA-2_0"
   
# Run the function

   CDFfromGTF(input="AffyGTF",inputFile=DONSON_GTF,
              PSR=PSRProbes,Junc=JunctionProbes,
              PathCDF=Directory,microarray=array)

```

[**Note:**]{style="color:red"} Both the .flat and .CDF file take large
amounts of space in the hard drive, it is recommended that the directory
has at least 1.5 GB of free space.

Figure 3 shows a screen shot with the output of the CDFfromGTF function

![**Figure 9.** Output of CDFfromGTF](Figure3.png)

Once the file is created, the name of the cdf file can be changed.
Internally, the algorithm gives the name as HTA-2_0, but the actual name
of the file can be different. In the rest of the vignette, we have
renamed our CDF file as EP_HTA-2_0.

Once the CDF file has been created, it can be used for the analysis of
different experiments.

## Statistical Analysis

### aroma.affymetrix pre-processing pipeline

For microarray data, a pre-processing pipeline must be applied to the
.cel files of the experiment. Usually this pre-processing is done using
the aroma.affymetrix R package. This procedure normalizes and summarizes
the expression of the different probesets into single values.

The aroma.affymetrix R package provides users different functions to
work with affymetrix arrays. The functions are used to extract all the
information contained in the .cel files and to do all the required
pre-processing steps such as background correction, normalization and
summarization. The package requires a certain folder structure in order
to work correctly. For more information about aroma.affymetrix visit the
webpage:[aroma project](http://www.aroma-project.org/)

The following code chunk displays the pipeline used to get the results
required by EventPointer after the pre-processing using
aroma.affymetrix. The code is not intended to be a runable example, but
just to show users the settings and functions that can be used. In order
for users to have a runable example, the corrrespoding folder structure
and files are required.

```{r aroma, eval=FALSE}

# Simple example of Aroma.Affymetrix Preprocessing Pipeline

verbose <- Arguments$getVerbose(-8);
timestampOn(verbose);
projectName <- "Experiment"
cdfGFile <- "EP_HTA-2_0,r"
cdfG <- AffymetrixCdfFile$byChipType(cdfGFile)
cs <- AffymetrixCelSet$byName(projectName, cdf=cdfG)
bc <- NormExpBackgroundCorrection(cs, method="mle", tag=c("*","r11"));
csBC <- process(bc,verbose=verbose,ram=20);
qn <- QuantileNormalization(csBC, typesToUpdate="pm");
csN <- process(qn,verbose=verbose,ram=20);
plmEx <- ExonRmaPlm(csN, mergeGroups=FALSE)
fit(plmEx, verbose=verbose)
cesEx <- getChipEffectSet(plmEx)
ExFit <- extractDataFrame(cesEx, addNames = TRUE)
```

### EventPointer function

EventPointer is the main function of the algorithm

The function requires the following parameters:

-   **Design** : The design matrix for the experiment.
-   **Contrast**: The contrast matrix for the experiment.
-   **ExFit**: [aroma.affymetrix] pre-processed variable after using
    extractDataFrame(affy, addNames=TRUE)
-   **Eventstxt**: Path to the EventsFound.txt file generated by
    CDFfromGTF function.
-   **Filter**: Boolean variable to indicate if an expression filter is
    applied.
-   **Qn**: Quantile used to filter the events (Bounded between 0-1, Q1
    would be 0.25).
-   **Statistic**: Statistical test to identify differential splicing
    events, must be one of : "LogFC","Dif_LogFC" and "DRS".
-   **PSI**: Boolean variable to indicate if $\Delta \Psi$ should be
    calculated for every splicing event.

If the `Filter` variable is `TRUE`, the following is performed:

The summarized expression value of all the reference paths is obtained
and the threshold is set depending on the `Qn` value used.

An event is considered if at least one sample in all paths is expressed
above the threshold.

The rest of the events are not shown unless the `Filter` variable is set
to `FALSE`

```{r EP_arrays, eval=TRUE}

   data(ArraysData)

   Dmatrix<-matrix(c(1,1,1,1,0,0,1,1),nrow=4,ncol=2,byrow=FALSE)
   Cmatrix<-t(t(c(0,1)))
   EventsFound<-paste(system.file("extdata",package="EventPointer"),"/EventsFound.txt",sep="")
   
   Events<-EventPointer(Design=Dmatrix,
                      Contrast=Cmatrix,
                      ExFit=ArraysData,
                      Eventstxt=EventsFound,
                      Filter=FALSE,
                      Qn=0.25,
                      Statistic="LogFC",
                      PSI=TRUE)
```

Table 1 displays the output of EventPointer function

```{r EP_Arrays_Res_Table, echo=FALSE}
kable(Events[1:5,],digits=5,row.names=TRUE,align="c",caption = "Table 1: EventPointer Arrays results")
```

The columns of the `data.frame` are:

-   **Gene name** : Gene identifier
-   **Event Type**: Type of alternative splicing event (Cassette,
    Alternative 3',Alternative 5', Alternative Last Exon, Alternative
    First Exon, Mutually Exclusive Exons or Complex Event)
-   **Genomic Position**: Genomic Coordinates for the event
-   **Splicing Z Value**: Corresponding Z value for the statistical test
    performed
-   **Splicing P Value**: Corresponding P-value for the statistical test
    performed
-   **Delta PSI**: $\Delta \Psi$ parameter for each event

## IGV visualization

EventPointer creates two different GTF files to visualize the
alternative splicing events. Figure 4 displays the cassette exon for the
COPS7A gene (4th ranked event in Table 1). In the IGV visualization, the
reference path is shown in gray color, the path 1 in red and path 2 in
green. Below the transcripts, the different probes that are measuring
each of the paths are represented in the same colors.

![**Figure 10.** EventPointer visualization using IGV](Figure5.png)

To create the GTF files, the algorithm uses the EventPointer_IGV
functions with the following parameters:

-   **Events** : Data.frame generated by EventPointer with the events to
    be included in the GTF file.
-   **input**: Reference transcriprome. Must be one of: "Ensembl",
    "UCSC" , "AffyGTF" or "CustomGTF".
-   **inputFile**: If input is "AffyGTF" or "CustomGTF", inputFile
    should point to the GTF file to be used.
-   **PSR**: Path to the Exon probes txt file.
-   **Junc**: Path to the Junction probes txt file.
-   **PathGTF**: Directory where to write the GTF files.
-   **EventsFile**: Path to EventsFound.txt file generated with
    CDFfromGTF function.
-   **microarray**: Microarray used to create the CDF file. Must be one
    of: HTA-2_0, ClariomD, RTA or MTA

The inital process of the function is slow, as the splicing graphs must
be created every time the function is executed. A progress bar is
displayed to the user to inform about the progress of the function.

Once the process is completed two GTF files are generated in the
specified directory:

1.  paths.gtf : GTF file representing the alternative splicing events.
2.  probes.gtf : GTF file representing the probes that measure each
    event and each path.

![**Figure 11.** GTF files generated with EventPointer_IGV](Figure6.png)

```{r Arrays_IGV, eval=FALSE, collapse=TRUE}

# Set Input Variables
  
   DONSON_GTF<-paste(PathFiles,"/DONSON.gtf",sep="")
   PSRProbes<-paste(PathFiles,"/PSR_Probes.txt",sep="")
   JunctionProbes<-paste(PathFiles,"/Junction_Probes.txt",sep="")
   Directory<-tempdir()
   EventsFound<-paste(system.file("extdata",package="EventPointer"),"/EventsFound.txt",sep="")
   array<-"HTA-2_0"

                      
# Generate Visualization files  

 EventPointer_IGV(Events[1,,drop=FALSE],"AffyGTF",DONSON_GTF,PSRProbes,JunctionProbes,Directory,EventsFound,array)


```

<!-- ## Domain Enrichment -->

<!-- The [EventPointer paper](http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-2816-x) states that the algorithm provides a function to identify the protein domains that are affected by alternative splicing using an enrichment analysis. -->

<!-- As we have updated the algorithm, we are currently working on the implementation of this feature in the newer version. As soon as it is implemented, it will be uploaded along with the vignette to show its use. -->

# Multi-Path Events

EventPointer can also identify Multi-Path events. The multi-path events
are composed of more elements than a triplet where the concentration of
the reference path should be equal to the sum of the concentration of
the rest of paths. This is depicted in Figure 8.

![**Figure 12.** Multi-Path alternative splicing events for
EventPointer](Figure10.png)

EventPointer identify Multi-Path events and estimate the percent spliced
in value $\Psi$.

## junctions arrays (CDF file for Multi-Path)

The function *CDFfromGTF_Multipath* generates the CDF file used
afterwards in the aroma.affymetrix preprocesing pipeline. This function
is equivalent to the function *CDFfromGTF*.

The function requires the following parameters:

-   **input** : Reference transcriptome. Available options are :
    "Ensembl","UCSC" , "AffyGTF" and "CustomGTF".
-   **inputFile**: If input is "AffyGTF" or "CustomGTF", inputFile
    should point to the GTF file to be used.
-   **PSR**: Path to the Exon probes txt file
-   **Junc**: Path to the Junction probes txt file
-   **PathCDF**: Directory where the output will be saved
-   **microarray**: Microarray used to create the CDF file. Must be one
    of: "HTA-2_0", "ClariomD", "RTA" or "MTA".
-   **paths**: Maximum number of paths of the events to find.

The function *CDFfromGTF_Multipath* detects events with number of paths
from 2 to variable *paths*. This function classifies the events with two
paths as the *CDFfromGTF* function does and the events with more than
two paths as *"Multi-Path"*.

This function takes a couple of hours to complete (depending on the
computer), and creates the same files as the function *CDFfromGTF*.

The following code chunks show examples on how to run the function using
part of the Affymetrix GTF and the example data included in the package.
his data corresponds to the Affymetrix HTA 2.0 GTF representing only the
DONSON gene and the probes that are mapped to it.

Using Affymetrix GTF as reference transcriptome

```{r CDFGTF_MP, eval=FALSE, warning=FALSE, collapse=TRUE}

# Set input variables
   PathFiles<-system.file("extdata",package="EventPointer")
   DONSON_GTF<-paste(PathFiles,"/DONSON.gtf",sep="")
   PSRProbes<-paste(PathFiles,"/PSR_Probes.txt",sep="")
   JunctionProbes<-paste(PathFiles,"/Junction_Probes.txt",sep="")
   Directory<-tempdir()
   array<-"HTA-2_0"

# Run the function

   CDFfromGTF_Multipath(input="AffyGTF",inputFile=DONSON_GTF,
              PSR=PSRProbes,Junc=JunctionProbes,
              PathCDF=Directory,microarray=array,paths=3)

```

##RNA-Seq (Event detection for Multi-Path)

As for two-paths events, the only required files are the corresponding
.BAM files from the experiment. After the *BAM Preparation* step
explained in *section 4.2* we have all the elements needed to analyze
each of the splicing graphs. To detect Multi-Path events, the function
*EventDetectionMultipath* is used.

This function requires the following parameters:

-   **Input** : Output of the PrepareBam_EP function
-   **cores**: Number of cores used for parallel processing
-   **Path**: Directory where to write the EventsFound.txt file
-   **paths**: Maximum number of paths of the events to find.

The function *EventDetectionMultipath* detects events with number of
paths from 2 to variable *paths*. This function classifies the events
with two paths as the *EventDetection* function does and the events with
more than two paths as *"Multi-Path"*.

```{r EventDetection_MP, eval=FALSE}
  # Run EventDetection function
   data(SG_RNASeq)
   TxtPath<-tempdir()
   AllEvents_RNASeq_MP<-EventDetectionMultipath(SG_RNASeq,cores=1,Path=TxtPath,paths=3)
```

This function retireves a list with all the events found for all the
genes present in the experiment. It also generates a file called
EventsFound_RNASeq.txt with the information for every detected event.

The list is organized in the following way:

```{r ListofLists_MP, eval=FALSE}
Events[[i]][[j]]
```

The list will have as many $i$ values as genes and $j$ values as many
events detected for the $i_{th}$ gene. In other words, the command above
will display the $j_{th}$ event detected for the $i_{th}$ gene.

# Advanced Use

## Events Reclassification

As described above, EventPointer is able to detect splicing events that
cannot be classified as canonical splicing events. EventPointer
classifies these events as "Complex Event". Within this category one can
observe simple events such as 'Multiple Skipping Exons' and events whose
structure is very complex. Thus, EP is classifying in the same category
events that have nothing to do with each other.<br> EventPointer
includes a function called `Events_ReClassification` to reclassify
complex events. The input to this function is:

-   **EventsTable**: a data.frame with the event information. This
    data.frame can be generated by loading the .txt obtained using the
    EventDetection_transcriptome function (see example).
-   **SplicingGraph**: A list with the splicing graph of all the genes
    of a reference transcriptome. a reference transcriptome. This data
    is returned by the function EventDetection_transcriptome.
-   **cores**: Number of cores using in the parallel processing (by
    default = 1)

Thus, EventPointer reclassifies the complex events according to how
similar the event is to the struture canonical events. The same complex
event can have several types (see). Further, EP adds a new type of
event: "multiple skipping exon". These events are characterized by
presenting several exons in a row as alternative exons. I.e., If there
is only one alternative exon we would be talking about a "Casstte
Exon".<br>

The following code shows how this function works for a subset of 5
splicing events:

```{r reclassification, eval=TRUE}
#load splicing graph
data("SG_reclassify")

#load table with info of the events
PathFiles<-system.file("extdata",package="EventPointer")
inputFile <- paste(PathFiles,"/Events_found_class.txt",sep="")
EventTable <- read.delim(file=inputFile)
#this table has the information of 5 complex events.

EventTable_new <- Events_ReClassification(EventTable = EventTable,
                        SplicingGraph = SG_reclassify)
```

The following figures show the structure of these events and their
corresponding gene:

**Example 1**
![**Figure 13.** Structure of example 1](example_1_alt5_ce.png)


## Statistical Tests

EventPointer provides three different statistical tests that can be used
to determine the statistical significance to the alternative splicing
events.

In Table 5, the most relevatn coefficients from the statistical and
events information point of view are shown

![**Table 6.** Coefficients of the linear model used to detect
alternative splicing](Figure9.png)

There are a number of alternatives to detect AS using these
coefficients. Either of $\beta_4$ , $\beta_5$ , $\beta_4$ + $\beta_5$ is
theoretically able to detect AS events. Some of them are more sensitive
than others depending on the relative concentrations of the isoforms.
For example, if isoform 2 is much more highly expressed than isoform 1
in both conditions, $\beta_4$ will be more sensitive than $\beta_4$ +
$\beta_5$ since in the latter case, the numerator and denominator of the
logarithms of both terms are similar, and hence their logs are close to
zero.

A contrast on $\beta_5$ would seem to be more sensitive than a contrast
on $\beta_4$ or $\beta_4$ + $\beta_5$ ; however, in practice, this
contrast proved to be "too sensitive" and led to many false positives
especially in weakly expressed isoforms. If one of the paths is not
expressed in any condition, its signal will be similar in either
condition (the background level) and a change in the expression of the
other isoform will drive to a false positive detection. This contrast
can be used only if the signals are filtered to guarantee that they are
above the background.

In the PCR validation, the contrast that provided the best performance
was the combination of the fold changes of both isoforms (i.e.
$\beta_3$ + $\beta_4$ and $\beta_3$+$\beta_4$ + $\beta_5$ in Table 4)
plus the requirement that the fold-changes have opposite directions,
i.e. if isoform 1 significantly increases its expression, isoform 2 must
significantly decrease its expression and vice versa. Therefore, if this
test requires that the detected AS events show a significant change of
the expression both paths and this change must be in opposite direction.

In order to compute this contrast, we summed up the p-values
(one-tailed) for both contrasts. If the null hypothesis holds, the
expected null distribution is triangular from 0 to 2 with the peak at 1,
and the summation of the p-values must be close to 0 or close to 2 for
genes with differential AS. Using this triangular distribution, it is
possible to assign an overall p-value to their sum. We preferred this
combination rather than the classical Fisher method since in the latter
a single good p-value yields a good summary p-value for the event. Using
this approach, both p-values must be close to zero or one in order to
generate a significant overall p-value.

The different options availabe in EventPointer are:

**1. LogFC** : Compute the contrast using $\beta_3$ + $\beta_4$ and
$\beta_3$+$\beta_4$ + $\beta_5$

**2. Dif_LogFC** : Compute the contrast using $\beta_4$ and $\beta_4$ +
$\beta_5$

**3. DRS**: Compute the constast using $\beta_5$

## Alpha parameter for PrepareBam_EP function

Alpha is a parameter used by SGSeq R package to predict the features
that are along the different bam files that are being analyzed. As
stated in the help menu for the predictTxtFeatures function:

**Alpha** Minimum FPKM required for a splice junction to be included.

The user can change the value to be more or less restrictive when
deciding if a feature is included or not. As the alpha value increases,
the algorithm will slow down as the splicing graphs would became more
complex.

## Percent Spliced In

EventPointer estimates the abundance of the isoforms mapped to each of
the paths, in an splicing event, to obtain the PSI values. With this
values, a simple linear model, using the provided design and contrast
matrices, is solved and this increment is returned to the user in the
data.frame (if PSI argument is TRUE).

It is possible to obtain the estimated PSI values using the internal
functions *getPSI*,*getPSImultipath* for junction arrays,
*getPSI_RNASeq* or *getPSI_RNASeq_MultiPath* for RNA-Seq data (data from
the pipeline described in section *4.2: statistical analysis from de BAM
files*).

These functions not only calculate the value of $\Psi$ but also the
residuals of the simple linear model used to estimate the values of
$\Psi$.

```{r PSI_ADV, eval=TRUE, collapse=TRUE}

# Microarrays (two paths)
data(ArraysData)
PSI_Arrays_list<-EventPointer:::getPSI(ArraysData)
PSI_Arrays <- PSI_Arrays_list$PSI
Residuals_Arrays <- PSI_Arrays_list$Residuals

# Microarrays (Multi-Path)
data(ArrayDatamultipath)
PSI_MP_Arrays_list <- EventPointer:::getPSImultipath(ArrayDatamultipath)
PSI_multipath_Arrays <- PSI_MP_Arrays_list$PSI
Residuals_multipath_Arrays <- PSI_MP_Arrays_list$Residuals

# RNASeq (two paths)
data(AllEvents_RNASeq)
PSI_RNASeq_list<-EventPointer:::getPSI_RNASeq(AllEvents_RNASeq)
PSI_RNASeq <- PSI_RNASeq_list$PSI
Residuals_RNASeq <- PSI_RNASeq_list$Residuals

# RNASeq (Multi-Path)
data(AllEvents_RNASeq_MP)
PSI_MP_RNASeq_list <- EventPointer:::getPSI_RNASeq_MultiPath(AllEvents_RNASeq_MP)
PSI_multipath_RNASeq <- PSI_MP_RNASeq_list$PSI
Residuals_multipath_RNASeq <- PSI_MP_RNASeq_list$Residuals

```

We can apply the function *PSI_Statistic* to the values of $\Psi$ (only
for two-paths events). This function takes as input the values of PSI
and performs a statistical analysis based on permutation test.

```{r PSI_ADV2, eval=TRUE, collapse=TRUE}

Dmatrix<-matrix(c(1,1,1,1,0,0,1,1),nrow=4,ncol=2,byrow=FALSE)
Cmatrix<-t(c(0,1))

table <- PSI_Statistic(PSI = PSI_Arrays,Design = Dmatrix,Contrast = Cmatrix,nboot = 20)

```

The residual obtained in the linear model used to estimate the values of
$\Psi$ must be independent from the Design matrix of the experiment. The
data of the residuals that returns the internal functions *getPSI* or
*getPSI_RNASeq* are useful to validate if this is true. The next code
shows an example of how to perform a statistical analysis of the
residuals.

```{r PSI_ADV3, eval=TRUE, collapse=TRUE}

Dmatrix<-matrix(c(1,1,1,1,0,0,1,1),nrow=4,ncol=2,byrow=FALSE)
Cmatrix<-t(t(c(0,1)))

Ress <- vector("list", length = ncol(Cmatrix))

fitresiduals <- limma::lmFit(Residuals_Arrays,design = Dmatrix)
fitresiduals2 <- limma::contrasts.fit(fitresiduals, Cmatrix)
fitresiduals2 <- limma::eBayes(fitresiduals2)

# repeated if there is more than one contrast
for (jj in 1:ncol(Cmatrix)) {
  TopPSI <- limma::topTable(fitresiduals2, coef = jj, number = Inf)[, 1, drop = FALSE]
  colnames(TopPSI)<-"Residuals"
  Ress[[jj]] <- TopPSI
}


```

# References

-   Romero, Juan P., et al. "EventPointer: an effective identification
    of alternative splicing events using junction arrays." BMC genomics
    17.1 (2016): 467.
-   Romero, Juan P., et al. "Comparison of RNA-seq and microarray
    platforms for splice event detection using a cross-platform
    algorithm." BMC genomics 19.1 (2018): 703.
-   Goldstein, Leonard D., et al. "Prediction and Quantification of
    Splice Events from RNA-Seq Data." PloS one 11.5 (2016): e0156132.
-   Seshagiri, Somasekar, Eric W Stawiski, Steffen Durinck, Zora
    Modrusan, Elaine E Storm, Caitlin B Conboy, Subhra Chaudhuri, et
    al. 2012. "Recurrent R-spondin fusions in colon cancer." Nature 488
    (7413): 660--64. \*Untergasser, A., Cutcutache, I., Koressaar, T.,
    Ye, J., Faircloth, B. C., Remm, M., & Rozen, S. G. (n.d.).
    Primer3-new capabilities and interfaces.
    <https://doi.org/10.1093/nar/gks596>

# Session Information

```{r}
sessionInfo()
```