---
title: 'psichomics tutorial: command-line interface (CLI)'
author: "Nuno Saraiva-Agostinho"
date: "22 April 2016"
bibliography: refs.bib
output: 
    rmarkdown::html_vignette:
        toc: true
vignette: >
  %\VignetteIndexEntry{Command-line interface tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

---

psichomics is an interactive R package for integrative analyses of alternative 
splicing using data from [The Cancer Genome Atlas (TCGA)][1] (containing 
molecular data associated with 34 tumour types) and from the
[Genotype-Tissue Expression (GTEx)][7] project (containing data for multiple
normal human tissues). The data leveraged from these projects includes clinical
information and transcriptomic data, such as the quantification of RNA-Seq reads
aligning to splice junctions (henceforth called junction quantification) and
exons.

# Installing and starting the program
Install psichomics by typing the following in an R console (the 
[R environment](https://www.r-project.org/) is required):

```{r install, eval=FALSE}
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("psichomics")
```

After the installation, load psichomics by typing:

```{r load, message=FALSE}
library(psichomics)
```

# Retrieving data
The quantification of each alternative splicing event is based on the proportion
of junction reads that support the inclusion isoform, known as percent 
spliced-in or PSI [@wang2008].

To estimate this value for each splicing event, both alternative splicing
annotation and junction quantification are required. While alternative splicing
annotation is provided by the package, junction quantification may be retrieved
from [TCGA][1].

## Download and load data from TCGA/Firebrowse
Data is downloaded from [Firebrowse][2], a service that hosts proccessed data
from [TCGA][1] as required to run the downstream analyses. Before downloading 
data, check the following available options:

```{r TCGA options}
# Available tumour types
cohorts <- getFirebrowseCohorts()

# Available sample dates
date <- getFirebrowseDates()

# Available data types
dataTypes <- getFirebrowseDataTypes()
```

After deciding on the options to use, download and load data of interest using:
```{r download, eval=FALSE}
# Set download folder
folder <- getDownloadsFolder()

# Download and load most recent junction quantification and clinical data from
# TCGA/Firebrowse for Adrenocortical Carcinoma
data <- loadFirebrowseData(folder=folder,
                         cohort="BRCA",
                         data=c("Clinical", "junction_quantification"),
                         date="2016-01-28")

# Select clinical and junction quantification dataset
clinical <- data[[1]]$`Clinical data`
junctionQuant <- data[[1]]$`Junction quantification (Illumina HiSeq)`
```

Data is only downloaded if the files are not present in the given folder. In 
other words, if the files were already downloaded, the function will just load
the files, so it is possible to reuse the code above just to load the requested
files.

```{r prepare examples, include=FALSE}
clinical <- readRDS("BRCA_clinical.RDS")
```

## Load local files

To load local files instead, indicate the folder of interest. Any files located
in this folder and sub-folders will be attempted to load. To avoid errors during
this process, files of interest should be put in a dedicated folder.

For instance, to load GTEx files, create a directory called **GTEx**, put all
files of interest inside that folder and follow these commands:

```{r load local, eval=FALSE}
folder <- "~/Downloads/GTEx/"
ignore <- c(".aux.", ".mage-tab.")
data <- loadLocalFiles(folder, ignore=ignore)

# Select clinical and junction quantification dataset
clinical <- data[[1]]$`Clinical data`
junctionQuant <- data[[1]]$`Junction quantification (Illumina HiSeq)`
```

# Quantifying alternative splicing
As previously mentioned, alternative splicing is quantified from the previously
loaded junction quantification and an alternative splicing annotation file. To 
check current annotation files available:

```{r quantify options}
# Available alternative splicing annotation
annotList <- listSplicingAnnotations()
annotList
```

Custom alternative splicing annotation obtained from SUPPA, rMATS, VAST-TOOLS or 
MISO can also be provided. Note that SUPPA and rMATS are able to create their
splicing annotation based on transcript annotation. [Read more here][6].

To quantify alternative splicing, first select the junction quantification,
alternative splicing annotation and alternative splicing event type(s) of 
interest:

```{r prepare to quantify splicing, eval=FALSE}
# Load Human (hg19/GRCh37 assembly) annotation
human <- listSplicingAnnotations()[[1]]
annotation <- loadAnnotation(human)
```

```{r event types}
# Available alternative splicing event types (skipped exon, alternative 
# first/last exon, mutually exclusive exons, etc.)
eventType <- getSplicingEventTypes()
eventType
eventType <- "SE"
```

Afterwards, quantify alternative splicing using the previously defined
parameters:

```{r quantify splicing, eval=FALSE}
# Min. reads threshold: number of reads required to quantify a splicing event
minReads <- 10

psi <- quantifySplicing(annotation, junctionQuant, eventType=eventType, 
                        minReads=minReads)
```

```{r load splicing, echo=FALSE}
psi <- readRDS("BRCA_psi.RDS")
```

```{r check splicing events}
# Check the identifier of the splicing events in the resulting table
events <- rownames(psi)
head(events)
```

Note that the event identifier (for instance,
`SE_1_-_2125078_2124414_2124284_2121220_C1orf86`) is composed of:

- Event type (`SE` stands for skipped exon)
- Chromosome (`1`)
- Strand (`-`)
- Relevant coordinates depending on event type (in this case, the first
constitutive exon's end, the alternative exon' start and end and the second
constitutive exon's start)
- Associated gene (`C1orf86`)

> WARNING: all the examples shown below are performed using a small, but
representative subset of the data available.

# Survival analysis

Survival data can be analysed based on clinical attributes, for instance, by 
tumour stage and patient gender, using time to death as the follow-up time and
death as the event of interest.

Around 80% of breast cancers have cells that express estrogen receptors (ER) and 
require estrogen binding to grow. Estrogen receptors may be blocked by tamoxifen
and other drugs. These drugs are thus used as treatment or prevention for
ER-positive breast cancers.

To compare the overall survival of ER-positive patients treated with and without
tamoxifen, the following groups will be created:

* ER-positive tamoxifen-treated patients,
* ER-positive non-tamoxifen-treated patients and
* non-ER-positive tamoxifen-treated patients.

```{r survival groups}
# Get available groups in clinical data
cols <- colnames(clinical)

# Check the name from which to retrieve groups of interest
er_expr <- grep("estrogen_receptor_status", cols, value=TRUE)[1]
er_expr <- createGroupByAttribute(col=er_expr, dataset=clinical)

# Discard values other than "positive" and "negative" for ER expression
er_expr <- er_expr[c("positive", "negative")]

############################################################################
# Now the same for patients treated with tamoxifen. However, TCGA presents #
# data for the administred drugs spread through multiple columns, so...    #
############################################################################

# Look for the appropriate columns with the drugs administred
drug_name <- grep("drug_name", cols, value=TRUE)[1:23]

# Using the previous columns, look for either "tamoxifen" or "tamoxiphen"
tamoxifen <- lapply(drug_name, function(i) grep("tamoxi.*", clinical[ , i]))

# Collect all previous results and create a single list named tamoxifen
tamoxifen <- sort(unique(unlist(tamoxifen)))
tamoxifen <- list("tamoxifen"=tamoxifen)

# Combine all previously created groups
groups <- c(er_expr, tamoxifen)

# Assign each patient to a group
patients <- nrow(clinical)
g <- groupPerPatient(groups[c("tamoxifen", "positive")], patients)

# Append the created groups to the clinical dataset
cl <- cbind(clinical, groups=g)
```

Before continuing, be sure you undestand how to use formulas in R.

```{r help formula, eval=FALSE}
help(formula)
```

To plot the survival curves:

```{r survival by er expression and tamoxifen}
# Create the right-hand side of a formula
formulaStr <- "groups"

# Events are retrieved based on the information available for time to event: if 
# there is info for a patient, the event is considered as occurred
daysToDeath <- "days_to_death"
survTerms  <- processSurvTerms(cl, censoring="right", event=daysToDeath,
                               timeStart=daysToDeath,
                               formulaStr=formulaStr, scale="years")

require("survival")
surv <- survfit(survTerms)
pvalue <- testSurvival(survTerms)

plotSurvivalCurves(surv, pvalue=pvalue)
```

Information regarding number of individuals and events is returned when hovering
over each survival curve in the plot. The plot also allows zooming in by
clicking and dragging and to omit data series by clicking on their name in the 
legend.

## Cox Proportional Hazards model
Calculate Cox proportional hazards model using the previously used parameters by
adding the argument `coxph=TRUE` when processing survival terms.

```{r cox model}
survTermsCox <- processSurvTerms(cl, censoring="right", event=daysToDeath,
                                 timeStart=daysToDeath,
                                 formulaStr=formulaStr, coxph=TRUE)
summary(survTermsCox)
```

# Exploring principal component analysis

Principal component analysis (PCA) is a technique to reduce data dimensionality
by identifying variable combinations (called principal components) that explain
the variance in the data [@Ringner2008gb].

To explore alternative splicing quantification groups by estrogen receptor (ER)
expression:

```{r plot PCA variance}
# Percentage of missing values tolerated by splicing event: tolerated missing
# values are replaced with the median value of that splicing event
naTolerance <- 0

# Center bu do not scale values (they are already scaled between 0 and 1)
center <- TRUE
scale  <- FALSE

# Match patients with samples
samples <- colnames(psi)
match <- getPatientFromSample(samples, clinical)

# Filter alternative splicing quantification to colour values by positive or
# negative ER expression
erGroups <- getMatchingSamples(groups[c("positive", "negative")], samples, 
                               clinical, match=match)
filtered_psi <- psi[ , unlist(erGroups)]

# Perform principal component analysis (transpose alternative splicing
# quantification to have samples as rows)
pca <- performPCA(t(filtered_psi), center=center, scale.=scale,
                  naTolerance=naTolerance)

# Plot the variance explained by each principal component
plotVariance(pca)
```

Now plot the score plot of the PCA coloured according to the ER expression:

```{r plot PCA}
plotPCA(pca, pcX="PC1", pcY="PC2", erGroups)
```

Moreover, let's plot the corresponding loadings plot that displays the variables
(in this case, alternative splicing events):

```{r plot loadings}
plotPCA(pca, pcX="PC1", pcY="PC2", individuals = FALSE, loadings = TRUE)
```

The bubble size of the loadings plot represent the total contribution of each
alternative splicing event to the selected principal components.

Note that the clinical samples from ER-positive individuals (i.e. patients whose
cancer cells express estrogen receptors) seem to separate from samples from
ER-negative individuals along the principal component 1. There is one 
alternative splicing event that may contribute to this separation:
*SE 10 + 79797062 79799962 79799983 79800373 RPS24*.

## Differential splicing analysis

By default, differential splicing analysis is performed by sample types (i.e.
tumour, normal, metastasis, etc) for multiple parametric and non-parametric
statistical tests. Let's analyse the previous splicing event found:

```{r diff splicing 1}
# Choose a method to use when adjusting p-values ("none" is valid)
# help(p.adjust.methods)
pvalueAdjust <- "BH"

# Check sample types available (normal, tumour, etc.)
types <- parseSampleGroups(colnames(psi))
unique(types)

# Analyse by sample types (by default) and perform all statistical analyses (by 
# default)
event <- "SE_10_+_79797062_79799962_79799983_79800373_RPS24"
eventPSI <- psi[event, ]
stats <- diffAnalyses(eventPSI, pvalueAdjust=pvalueAdjust,
                      analyses = c("wilcoxRankSum", "ttest", "kruskal", 
                                   "levene", "fligner"))
# View(stats)
plotDistribution(as.numeric(eventPSI), groups=types)
```

Differential splicing analysis returns a data frame with the results of the
statistical tests, as well as other info, such as associated gene and number of
samples per group. The following statistical tests are available:

* Unpaired t-test (`ttest`)
* Wilcoxon rank sum test (`wilcoxRankSum`)
* Kruskal test (`kruskal`)
* Levene's test (`levene`)
* Fligner-Killeen test (`fligner`)

Not all statistical tests are performed depending on the number of groups
available. Check if any tests show statistical significance using only tumour
against normal samples (for instance, Log-rank p-values below 0.05):

```{r diff splicing 2}
filter <- grep("Tumor|Normal", types)
gr_event <- types[filter]
eventPSI <- eventPSI[filter]

stats <- diffAnalyses(eventPSI, groups=gr_event, pvalueAdjust=pvalueAdjust,
                      analyses = c("wilcoxRankSum", "ttest", "kruskal", 
                                   "levene", "fligner"))
# View(stats)
plotDistribution(as.numeric(eventPSI), groups=gr_event)
```

* Hover each group in the plot to compare the respective number of samples,
median and variance.
* To zoom in a specific region, click-and-drag in the region of interest.
* To hide or show groups, click on their name in the legend.

Now, change the groups to *positive* and *negative* for ER expression and check
if any of the tests show statistical significance:

```{r diff splicing 3}
# Filter alternative splicing quantification by positive or negative ER
# expression; match between patient information and samples as such:
erGroups <- getMatchingSamples(groups[c("negative", "positive")], samples, 
                               clinical, match=match)
eventPSI <- psi[event, unlist(erGroups)]
erGroups <- rep(names(erGroups), sapply(erGroups, length))

stats <- diffAnalyses(eventPSI, groups=erGroups, pvalueAdjust=pvalueAdjust,
                      analyses = c("wilcoxRankSum", "ttest", "kruskal", 
                                   "levene", "fligner"))
# View(stats)
plotDistribution(as.numeric(eventPSI), groups=erGroups)
```

## Survival analysis

To study the impact of an alternative splicing event on prognosis, Kaplan-Meier
curves may be plotted for groups of patients separated by a given PSI cut-off 
for the selected alternative splicing event. Let's carry on with the splicing
event from before:

```{r splicing event data}
event <- "SE_10_+_79797062_79799962_79799983_79800373_RPS24"

# Assign alternative splicing quantification to patients based on their samples
clinicalPSI <- getPSIperPatient(psi, match, clinical)
eventPSI <- as.numeric(clinicalPSI[event, ])

# Statistics of the alternative splicing event's quantification
summary(eventPSI)
```

The optimal PSI cut-off that maximises the significance of their difference in
survival (i.e. minimises the p-value of the Wald/Log/Logrank tests of difference
in survival between individuals with PSI below and above that threshold) can be
calculated as per:

```{r optimal cut-off}
opt <- optimalPSIcutoff(clinical, eventPSI, censoring="right", 
                        event="days_to_death", timeStart="days_to_death")
optimalCutoff <- opt$par # Optimal exon inclusion level
optimalCutoff
optimalPvalue <- opt$value # Respective p-value
optimalPvalue
```

Finally, plot survival curves separated by the optimal quantification cut-off or
any other cut-off of interest:

```{r}
cutoff <- optimalCutoff
group <- labelBasedOnCutoff(eventPSI, cutoff, label="PSI values")
survTerms <- processSurvTerms(clinical, censoring="right",
                              event="days_to_death", timeStart="days_to_death", 
                              group=group)

require("survival")
surv <- survfit(survTerms)
pvalue <- testSurvival(survTerms)
plotSurvivalCurves(surv, pvalue=pvalue)
```

## Literature support and external database information

If an event is differentially spliced and has an impact on patient survival, 
its association with the studied disease might be already described in the
literature. Check for relevant research articles on [PubMed](http://pubmed.gov).

It is also possible to plot transcripts and proteins related to the gene
associated to a given splicing event. To retrieve and plot transcript
information from [ENSEMBL][4]:

```{r plot transcripts}
parsed <- parseSplicingEvent(event)
info <- queryEnsemblByEvent(event, species="human", assembly="hg19")
plotTranscripts(info, parsed$pos[[1]])
```

To retrieve and plot protein information from [UniProt][5]:

```{r plot proteins}
parsed <- parseSplicingEvent(event)
info <- queryEnsemblByEvent(event, species="human", assembly="hg19")

# Some of these transcripts have no corresponding protein
proteins <- info$Transcript$Translation$id
protein <- proteins[[6]]

# Convert protein identifier from ENSEMBL to UniProt
uniprot <- ensemblToUniprot(protein)
uniprot
uniprot <- uniprot[[1]]

plotProtein(uniprot)
```

The protein plot shows the [UniProt][5] matches for the selected transcript.
Hover the protein's rendered domains to obtain more information on them.

Other database of interest include:
    - [Human Protein Atlas (Cancer Atlas)](http://www.proteinatlas.org/cancer)
    allows to check the evidence of a gene at protein level for multiple cancer 
    tissues.
    - [VastDB](http://vastdb.crg.eu) shows multi-species alternative splicing
    profiles for diverse tissues and cell types.
    - [UCSC Genome Browser](https://genome.ucsc.edu) may reveal protein domain 
    disruptions caused by the alternative splicing event. To check so, activate 
    the **Pfam in UCSC Gene** and **UniProt** tracks (in *Genes and Gene 
    Predictions*) and check if any domains are annotated in the alternative 
    and/or constitutive exons of the splicing event.

# Exploring differential splicing analysis

To analyse differencial splicing, choose which clinical groups on which to 
perform the analyses. For instance, splicing events can be analysed based on 
sample types (e.g., tumour versus normal tissue, if available) or clinical 
groups of the patients (e.g. stage of the disease).

Let's perform differential splicing analysis based on ER expression:

```{r exploring diff splicing}
# Filter alternative splicing quantification by positive or negative ER
# expression; match between patient information and samples as such:
erGroups <- getMatchingSamples(groups[c("negative", "positive")], samples, 
                               clinical, match=match)
eventsPSI <- psi[ , unlist(erGroups)]
erGroups  <- rep(names(erGroups), sapply(erGroups, length))

diff <- diffAnalyses(eventsPSI, erGroups)
# View(diff)
```

## Survival analysis

To study the impact of the alternative splicing events on prognosis, significant
differences in survival may be identified between patients separated by a given
splicing quantification cut-off. Given the time-consuming process of identifying
a cut-off that minimises the survival difference, it is recommended to filter
differentially spliced events supported by statistical significance before
calculating the optimal cut-off and associated p-value.

```{r survival analysis on filtered events, results="hide"}
# Filter statistically significant events as you see fit
filter <- diff$`Wilcoxon p-value (BH adjusted)` <= 0.05
filter <- filter[!is.na(filter)]
events <- rownames(diff[filter, ])

# Assign alternative splicing quantification to patients based on their samples
clinicalPSI <- getPSIperPatient(psi, match, clinical)

# Get time for all survival analyses of interest (the same for all, so this is
# faster)
daysToDeath <- "days_to_death"
survTime <- getColumnsTime(clinical, event=daysToDeath, timeStart=daysToDeath)

# Prepare progress bar
min <- 1
max <- nrow(clinicalPSI)
pb <- txtProgressBar(min, max, "Calculating optimal PSI cut-off", style=3)

# Prepare empty vectors where information will be stored (faster)
optimalCutoff <- rep(NA, max)
optimalPvalue <- rep(NA, max)

# Retrieve the optimal PSI cut-off and respective p-value for multiple events
for (row in min:max) {
    singlePSI <- as.numeric(clinicalPSI[row, ])
    opt <- optimalPSIcutoff(clinical, singlePSI, censoring="right", 
                            event=daysToDeath, timeStart=daysToDeath,
                            survTime=survTime)
    optimalCutoff[row] <- opt$par # Optimal exon inclusion level
    optimalPvalue[row] <- opt$value # Respective p-value
    setTxtProgressBar(pb, row)
}

# Bind columns to differential splicing analysis if desired
diff <- cbind(diff, optimalCutoff, optimalPvalue)
# View(diff)
```

Afterwards, check literature information and search external databases for more 
information on the events of interest, including on **UCSC Genome Browser** for 
putative protein domain disruptions resulting from the splicing event.

# Exploring an alternative splicing event of interest

The event *SE 6 - 46823711 46822518 46822452 46821808 GPR116* has been 
previously reported to have potential prognostic value in breast cancer 
patients, where patients with higher PSI values for this event have a lower 
5-year survival rate than patients with lower PSI values [@Tsai2015jf].

Confirm so by performing differential splicing (for example, normal vs tumour
samples) on this event and survival analysis by PSI cut-off. Also, check
literature information and search external databases for more information,
including on **UCSC Genome Browser** for putative protein domain disruptions
resulting from the alternative splicing event.

# Feedback

All feedback on the program, documentation and associated material (including
this tutorial) is welcome. Please send any suggestions and comments to:

| Nuno Saraiva-Agostinho (nunodanielagostinho@gmail.com)
| [Computation Biology Lab, Instituto de Medicina Molecular (Portugal)][3]

# References

[1]: https://tcga-data.nci.nih.gov/docs/publications/tcga
[2]: http://firebrowse.org
[3]: http://imm.medicina.ulisboa.pt/group/compbio
[4]: http://www.ensembl.org
[5]: http://www.uniprot.org
[6]: http://rpubs.com/nuno-agostinho/alt-splicing-annotation
[7]: http://gtexportal.org