---
title: 'RNAmodR: RiboMethSeq'
author: "Felix G.M. Ernst and Denis L.J. Lafontaine"
date: "`r Sys.Date()`"
abstract: >
  Detection of 2'-O methylations by RiboMethSeq
output:
  BiocStyle::html_document:
    toc: true
    toc_float: true
    df_print: paged
vignette: >
  %\VignetteIndexEntry{RNAmodR.RiboMethSeq}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
---

```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown(css.files = c('custom.css'))
```

# Introduction

Among the various post-transcriptional RNA modifications, 2'-O methylations are
commonly found in rRNA and tRNA. They promote the endo conformation of the
ribose and confere resistance to alkaline degradation by preventing a
nucleophilic attack on the 3'-phosphate especially in flexible RNA, which is
fascilitated by high pH conditions. This property can be queried using a method
called RiboMethSeq [[@Birkedal.2015]](#References) for which RNA is treated in
alkaline conditions and RNA fragments are used to prepare a sequencing library
[[@Marchand.2017]](#References).

At position containing a 2'-O methylations, read ends are less frequent, which
is used to detect and score the 2'-O methylations.

The `ModRiboMethSeq` class uses the the `ProtectedEndSequenceData` class to
store and aggregate data along the transcripts. The calculated scores follow the
nomenclature of [[@Birkedal.2015;@Galvanin.2019]](#References) with the names
`scoreRMS` (default), `scoreA`, `scoreB` and `scoreMean`.

```{r, echo = FALSE}
suppressPackageStartupMessages({
  library(rtracklayer)
  library(GenomicRanges)
  library(RNAmodR.RiboMethSeq)
  library(RNAmodR.Data)
})
```
```{r, eval = FALSE}
library(rtracklayer)
library(GenomicRanges)
library(RNAmodR.RiboMethSeq)
library(RNAmodR.Data)
```

# Example workflow

The example workflow is limited to two 2'-O methylated position on 5.8S rRNA,
since the size of the raw data is limited. For annotation data either a gff file
or a `TxDb` object and for sequence data a fasta file or a `BSgenome` object can
be used. The data is provided as bam files.

```{r, message=FALSE, results='hide'}
annotation <- GFF3File(RNAmodR.Data.example.RMS.gff3())
sequences <- RNAmodR.Data.example.RMS.fasta()
files <- list("Sample1" = c(treated = RNAmodR.Data.example.RMS.1()),
              "Sample2" = c(treated = RNAmodR.Data.example.RMS.2()))
```

# Analysis of data

The analysis is triggered by the construction of a `ModSetRiboMethSeq` object.
Internally parallelization is used via the `BiocParallel` package, which would
allow optimization depending on number/size of input files (number of samples,
number of replicates, number of transcripts, etc).

```{r}
msrms <- ModSetRiboMethSeq(files, annotation = annotation, sequences = sequences)
msrms
```

# Visualizing the results

To compare samples, we need to know, which positions should be part of the
comparison. This can either be done by aggregating the detect over all samples
and use the union or intersect or by using publish data. We want to assemble
a GRanges object from the latter by utilising the infomation from the snoRNAdb
[[@Lestrade.2006]](#References).

In this specific example only information for the 5.8S RNA is used, since the
example data would be to big otherwise. The information regarding the parent and
seqname must match the information used as the annotation data. Check that it
matches the output of `ranges()` on a `SequenceData`, `Modifier` oder
`ModifierSet` object.

```{r}
table <- read.csv2(RNAmodR.Data.snoRNAdb(), stringsAsFactors = FALSE)
table <- table[table$hgnc_id == "53533",] # Subset to RNA5.8S
# keep only the current coordinates
table <- table[,1L:7L]
snoRNAdb <- GRanges(seqnames = "chr1",
              ranges = IRanges(start = table$position,
                               width = 1),
              strand = "+",
              type = "RNAMOD",
              mod = table$modification,
              Parent = "1", #this is the transcript id
              Activity = IRanges::CharacterList(strsplit(table$guide,",")))
coord <- split(snoRNAdb,snoRNAdb$Parent)
```

In addition to the coordinates of published, we also want to include more 
meaningful names for the transcripts. For this we provide a `data.frame` with
two columns, `tx_id` and `name`. All values in the first column have to match
transcript IDs.

```{r}
ranges(msrms)
alias <- data.frame(tx_id = "1", name = "5.8S rRNA", stringsAsFactors = FALSE)
```

```{r plot1, fig.cap="Heatmap showing RiboMethSeq scores for 2'-O methylated positions on the 5.8S rRNA."}
plotCompareByCoord(msrms[c(2L,1L)], coord, alias = alias)
```

Results can also be compared on a sequence level, by selecting specific 
coordinates to compare.

```{r, plot2, fig.cap="RiboMethSeq scores around Um(14) on 5.8S rRNA.", fig.asp=1}
singleCoord <- coord[[1L]][1L,]
plotDataByCoord(msrms, singleCoord)
```

By default only the RiboMethSeq score and the ScoreMean are shown. The raw 
sequence data can be inspected as well

```{r, plot3, fig.cap="RiboMethSeq scores around Um(14) on 5.8S rRNA. Sequence data is shown by setting `showSequenceData = TRUE`.", fig.asp=1}
singleCoord <- coord[[1L]][1L,]
plotDataByCoord(msrms, singleCoord, showSequenceData = TRUE)
```

# Performance

To access the performance of the method in combination with samples used, use
the `plotROC` function.

```{r plot4, fig.cap="TPR versus FPR plot.", fig.asp=1}
plotROC(msrms,coord)
```

The example given here should be regarded as a proof of concept. Based on the
results, minimal scores for calling modified positions can be adjusted to the
individual requirements.

```{r settings}
settings(msrms) <- list(minScoreMean = 0.7)
msrms
```

As the warning suggested, after modifying the settings the results should be 
updated by running `modify(x,force = TRUE)`.

```{r udpate}
msrms2 <- modify(msrms,force = TRUE)
```

# Session info

```{r, sessioninfo}
sessionInfo()
```

<a name="References"></a>

# References