---
title: "An Introduction To ngsReports"
author:
- name: Christopher Ward
  affiliation: School of Biological Sciences, University of Adelaide
- name: Hien To
  affiliation: Bioinformatics Hub, University of Adelaide
- name: Steve Pederson
  affiliation: Dame Roma Mitchell Cancer Research Laboratories, Adelaide Medical School, University of Adelaide
  email: stephen.pederson@adelaide.edu.au
package: ngsReports
output:
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{An Introduction To ngsReports}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo=FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
```


# Introduction

The package `ngsReports` is designed to bolt into data processing pipelines and 
produce combined plots for multiple FastQC reports generated across an entire 
set of libraries or samples.
The primary functionality of the package is parsing FastQC reports, with import 
methods also implemented for log files produced by tools as as `STAR`, `hisat2` 
and others.
In addition to parsing files, default plotting methods are implemented.
Plots applied to a single file will replicate the default plots from 
`FastQC`^[https://www.bioinformatics.babraham.ac.uk/projects/fastqc/], 
whilst methods applied to multiple FastQC reports summarise these and produce 
a series of custom plots.

Plots are produced as standard `r CRANpkg("ggplot2")` objects, with an 
interactive option available using `r CRANpkg("plotly")`.
As well as custom summary plots, tables of read counts and the like can also 
be easily generated.

# Basic Usage

## Using the Shiny App

In addition to the usage demonstrated below, a `shiny` app has been developed 
for interactive viewing of FastQC reports.
This can be installed using:

```
remotes::install_github("UofABioinformaticsHub/shinyNgsreports")
```

A vignette for this app will be installed with the `shinyNgsreports` package.

## The default report

In it's simplest form, a default summary report can be generated simply by 
specifying  a directory containing the output from FastQC and calling the 
function `writeHtmlReport()`.

```{r}
library(ngsReports)
```

```
fileDir <- file.path("path", "to", "your", "FastQC", "Reports")
writeHtmlReport(fileDir)
```

This function will transfer the default template to the provided directory and 
produce a single `.html` file containing interactive summary plots of any 
FastQC output found in the directory.
FastQC output can be `*fastqc.zip` files or the same files extracted as 
individual directories.

The default template is provided as `ngsReports_Fastqc.Rmd` in the package 
directory .
This template can be easily modified and supplied as an alternate template to 
the above function using your modified file as the template RMarkdown file.

```
altTemplate <- file.path("path", "to", "your", "new", "template.Rmd")
writeHtmlReport(fileDir, template = altTemplate)
```

# Advanced Usage

## Classes Defined in the Package

The package `ngsReports` introduces two main `S4` classes: 

* `FastqcData` & `FastqcDataList`

`FastqcData` objects hold the *parsed data from a single report* as 
generated by the stand-alone tool `FastQC`.
These are then extended into lists for *more than one file* as a 
`FastqcDataList`.
For most users, the primary class of interest will be the `FastqcDataList`.

## Loading FastQC Data Into `R`

To load a set of `FastQC` reports into `R` as a `FastqcDataList`, specify the 
vector of file paths, then call the function `FastqcDataList()`.
In the rare case you'd like an individual file, this can be performed by 
calling `FastqcData()` on an individual file, or subsetting the output from 
`FastqcDataList()` using the `[[]]` operator as with any list object.

```{r}
fileDir <- system.file("extdata", package = "ngsReports")
files <- list.files(fileDir, pattern = "fastqc.zip$", full.names = TRUE)
fdl <- FastqcDataList(files)
```

From here, all FastQC modules can be obtained as a `tibble` (i.e. `data.frame`) 
using the function `getModule()` and choosing one of the following modules:

* `Summary` (The PASS/WARN/FAIL status for each module)
* `Basic_Statistics`
* `Per_base_sequence_quality`
* `Per_sequence_quality_scores`
* `Per_base_sequence_content`
* `Per_sequence_GC_content`
* `Per_base_N_content`
* `Sequence_Length_Distribution`
* `Sequence_Duplication_Levels`
* `Overrepresented_sequences`
* `Adapter_Content`
* `Kmer_Content`
* `Per_tile_sequence_quality`

```{r}
getModule(fdl[[1]], "Summary")
```

Capitalisation and spelling of these module names follows the default patterns 
from FastQC reports with spaces replaced by underscores.
One additional module is available and taken directly from the text within the 
supplied reports

* `Total_Duplicated_Percentage`

In addition, the read totals for each file in the library can be obtained 
using `readTotals()`, which can be easily used to make a table of read totals.
This essentially just returns the first two columns from 
`getModule(x, "Basic_Statistics")`.

```{r, results='hide'}
reads <- readTotals(fdl)
```

The packages `dplyr` and `pander` can also be extremely useful for manipulating 
and displaying imported data.
To show only the R1 read totals, you could do the following

```{r}
library(dplyr)
library(pander)
reads %>%
    dplyr::filter(grepl("R1", Filename)) %>% 
    pander(
        big.mark = ",",
        caption = "Read totals from R1 libraries", 
        justify = "lr"
    )
```


## Generating Plots For One or More Fastqc Files

Plots created from a single `FastqcData` object will resemble those generated 
by the `FastQC` tool, whilst those created from a `FastqcDataList` will be 
combined summaries across a library of files.
In addition, all plots are able to be generated as interactive plots using the 
argument `usePlotly = TRUE`.

All FastQC modules have been enabled for plotting using default `S4` dispatch, 
with the exception of `Per_tile_sequence_quality`.

### Inspecting the PASS/WARN/FAIL Status of each module

The simplest of the plots is to summarise the `PASS/WARN/FAIL` flags as 
produced by `FastQC` for each module. 
This plot can be simply generated using `plotSummary()`

```{r plotSummary, fig.cap="Default summary of FastQC flags.", fig.wide = TRUE}
plotSummary(fdl)
```

### Visualising Read Totals

The next most informative plot may be to summarise the total numbers of reads 
in each associated Fastq file.
By default, the number of duplicated sequences from the 
`Total_Duplicated_Percentage` module are shown, but this can be disabled by 
setting `duplicated = FALSE`.

```{r}
plotReadTotals(fdl)
```

As these are `ggplot2` objects, the output can be modified easily using 
conventional `ggplot2` syntax.
Here we'll move the legend to the top right as an example.

```{r}
plotReadTotals(fdl) +
    theme(
        legend.position = c(1, 1), 
        legend.justification = c(1, 1),
        legend.background = element_rect(colour = "black")
    )
```

### Per Base Sequence Qualities

Turning to the `Per base sequence quality` scores is the next most common step 
for most researchers, and these can be obtained for an individual file by 
selecting this as an element (i.e. `FastqcData` object ) of the main 
`FastqcDataList` object.
This plot replicates the default plots from a FastQC report.

```{r, fig.cap = "Example showing the Per_base_sequence_quality plot for a single FastqcData object."}
plotBaseQuals(fdl[[1]])
```

When working with multiple FastQC reports, these are summarised as a heatmap 
using the mean quality score at each position.

```{r, fig.cap="Example showing the Mean Per Base Squence Qualities for a set of FastQC reports."}
plotBaseQuals(fdl)
```

Boxplots of any combinations can also be drawn from a `FastqcDataList` by 
setting the argument `plotType = "boxplot"`.
However, this may be not suitable for datasets with a large number of libraries.

```{r}
plotBaseQuals(fdl[1:4], plotType = "boxplot")
```

### Mean Sequence Quality Per Read

Similarly, the Mean Sequence Quality Per Read plot can be generated to 
replicate plots from a FastQC report by selecting the individual file from the 
`FastqcDataList` object.

```{r, fig.cap = "Example plot showing Per_sequence_quality_scores for an individual file."}
plotSeqQuals(fdl[[1]])
```

A heatmap of mean sequence qualities can be generated when inspecting multiple 
reports.

```{r, fig.cap = "Example heatmaps showing Per_sequence_quality_scores for a set of files."}
plotSeqQuals(fdl)
```

An alternative view may be to plot these as overlaid lines, which can be simply 
done by setting `plotType = "line"`.
Again, discretion should be shown when choosing this option for many samples.

```{r}
r2 <- grepl("R2", names(fdl))
plotSeqQuals(fdl[r2], plotType = "line")
```

### Per Base Sequence Content

The `Per_base_sequence_content` module can also be plotted for an individual 
report with the layout being identical to that from FastQC.

```{r, fig.cap="Individual Per_base_sequence_content plot"}
plotSeqContent(fdl[[1]])
```

These are then combined across multiple files as a heatmap showing a composite 
colour for each position. 
Colours are combined using `rgb()` with `A`,`C`, `G` and `T` being represented 
by green, blue, black and red respectively.

```{r, fig.cap="Combined Per_base_sequence_content plot"}
plotSeqContent(fdl)
```

**NB** These plots can be very informative setting the argument 
`usePlotly = TRUE`, however they can be slow to render given the nature of how 
the package `plotly` renders graphics.

Again, supplying multiple files and setting `plotType = "line"` will replicate 
multiple individual plots from the original FastQC reports.
The argument `nc` is passed to `facet_wrap()` from the package `ggplot2` to 
determine the number of columns in the final plot.

```{r}
plotSeqContent(fdl[1:2], plotType = "line", nc = 1)
```

### Adapter Content

Adapter content as identified by FastQC is also able to be plotted for an 
individual file.

```{r, fig.cap = "Adapter Content plot for a single FastQC report"}
plotAdapterContent(fdl[[1]]) 
```

When producing a heatmap across a set of FastQC reports, this will default to 
Total Adapter Content, instead of showing the individual adapter types.

```{r, fig.cap = "Heatmap showing Total Adapter Content by position across a set of FastQC reports"}
plotAdapterContent(fdl)
```


### Sequence Duplication Levels

As with all modules, the Sequence Duplication Levels plot is able to be 
replicated for an individual file.

```{r, fig.cap = "Example Sequence Duplication Levels plot for an individual file."}
plotDupLevels(fdl[[1]])
```

When plotting across multiple FastQC reports, duplication levels are shown as a 
heatmap based on each default bin included in the initial FastQC reports.
By default, the plotted values are the "Pre" de-duplication values.
Note that values are converted to percentages instead of read numbers to ensure 
comparability across files.
In the plot below, `CCGC_R2` shows very low duplication levels, whilst `ATTG_R1` 
shows high levels of duplication.
The commonly observed 'spikes' around `>10` are also evident as the larger red 
blocks.

```{r, fig.cap = "Sequence Duplication Levels for multiple files"}
plotDupLevels(fdl)
```

## GC content

### The class TheoreticalGC

A selection of Theoretical GC content is supplied with the package in the 
object `gcTheoretical`, which has been defined with the additional `S4` class 
`TheoreticalGC`.
GC content was calculated using scripts obtained from  
https://github.com/mikelove/fastqcTheoreticalGC.
Available genomes and transcriptomes can be obtained using the function 
`gcAvail()` on the object `gcTheoretical` and specifying the type.

```{r}
gcAvail(gcTheoretical, "Genome")
```

### Inspecting GC Content

As with all modules, data for an individual file replicates the default plot 
from a FastQC report, but with one key difference.
This is that the Theoretical GC content has been provided in the object 
`gcTheoretical` based on 100bp reads.
This empirically determined content is shown as the Theoretical GC content 
line.

```{r, fig.cap = "Example GC Content plot using the Hsapiens Transcriptome for the theoretical distribution."}
plotGcContent(fdl[[1]], species = "Hsapiens", gcType = "Transcriptome")
```

Again, data is summarised as a heatmap when plotting across multiple reports, 
with the default value being the difference between the observed and the 
theoretical GC content.

```{r, fig.cap = "Example GC content showing the difference between observed and theoretical GC content across multiple files."}
plotGcContent(fdl)
```

Line plots can also be produce an alternative viewpoint, with read totals 
displayed as percentages instead of raw counts.

```{r, fig.cap = "Example GC content plot represented as a line plot instead of a heatmap."}
plotGcContent(fdl, plotType = "line",  gcType = "Transcriptome")
```

Customized theoretical GC content can be generated using input DNA sequences 
from a supplied fasta file.

```{r, message=FALSE, warning=FALSE, eval=FALSE}
faFile <- system.file(
    "extdata", "Athaliana.TAIR10.tRNA.fasta", 
    package = "ngsReports")
plotGcContent(fdl, Fastafile = faFile, n = 1000)
```


## Overrepresented Sequences

When inspecting the Overrepresented Sequence module, the top `n`can be plotted 
for an individual file, again broken down by their possible source, and 
coloured based on their `WARN/FAIL` status.

```{r, fig.wide = TRUE}
plotOverrep(fdl[[1]])
```

When applying this across multiple files, instead of identifying common 
sequences across a set of libraries, overrepresented sequences are summarised 
by their possible source as defined by FastQC.

```{r}
plotOverrep(fdl)
```

In addition to the above, the most abundant `n` overrepresented sequences can 
be exported as a FASTA file for easy submission to `blast`.

```{r, eval = FALSE}
overRep2Fasta(fdl, n = 10)
```

# Importing Log Files

A selection of log files as produced by tools such as `STAR`, `hisat2`, 
`bowtie`, `bowtie2` and `picard duplicationMetrics`, can be easily imported 
using the `importNgsLogs()` function. 

Tool can be specified by the user using the argument `type`, however if no
`type` is provided we will attempt to auto-detect from the file's structure. 
Note: only a single log file type can be imported at any time.

The `importNgsLogs()` function currently supports log files from 
the following tools. 

Adapter removal and trimming

- AdapterRemoval
- cutadapt
- trimmomatic

Mapping and alignment 

- bowtie
- bowtie2
- hisat2
- samtools flagstat
- STAR
- picard MarkDuplicates

Transcript/gene quantification

- featureCounts

Genome assembly

- BUSCO
- Quast


## Adapter removal and trimming 


```{r}
fl <- c("Sample1.trimmomaticPE.txt")
trimmomaticLogs <- system.file("extdata", fl, package = "ngsReports")
df <- importNgsLogs(trimmomaticLogs)
```

```{r}
df %>%
    dplyr::select("Filename", contains("Surviving"), "Dropped") %>%
    pander(
        split.tables = Inf,
        style = "rmarkdown", 
        big.mark = ",",
        caption = "Select columns as an example of output from trimmomatic."
    )
```



## Mapping and alignment 

### Log file import 

Bowtie log files can be parsed and imported  

```{r}
fls <- c("bowtiePE.txt", "bowtieSE.txt")
bowtieLogs <- system.file("extdata", fls, package = "ngsReports")
df <- importNgsLogs(bowtieLogs, type = "bowtie")
```

```{r}
df %>%
    dplyr::select("Filename", starts_with("Reads")) %>%
    pander(
        split.tables = Inf,
        style = "rmarkdown", 
        big.mark = ",",
        caption = "Select columns as an example of output from bowtie."
    )
```

STAR log files can be parsed and imported  

```{r}
starLog <- system.file("extdata", "log.final.out", package = "ngsReports")
df <- importNgsLogs(starLog, type = "star")
```

```{r, echo=FALSE}
df %>% 
    dplyr::select("Filename", contains("Unique")) %>%
    pander(
        split.tables = Inf,
        style = "rmarkdown", 
        big.mark = ",",
        caption = "Select columns as output from STAR"
    )
```

The output of the `samtools flagstat` module can be parsed and imported

```{r}
flagstatLog <- system.file("extdata", "flagstat.txt", package = "ngsReports")
df <- importNgsLogs(flagstatLog, type = "flagstat")
```

```{r, echo=FALSE}
df %>% 
    pander(
        split.tables = Inf,
        style = "rmarkdown", 
        big.mark = ",",
        caption = "Flagstat output for a single file"
    )
```

In addition to the files produced by the above alignment tools, the output 
from Duplication Metrics (`picard`) can also be imported.
This is imported as a list with a `tibble` containing the detailed output in 
the list element `$metrics` and the histogram data included as the second 
elements.

```{r}
sysDir <- system.file("extdata", package = "ngsReports")
fl <- list.files(sysDir, "Dedup_metrics.txt", full.names = TRUE)
dupMetrics <- importNgsLogs(fl, type = "duplicationMetrics", which = "metrics")
str(dupMetrics)
```


### Plot log file imports 

Summaries of log files from select mapping and alignment tools can be plot 
using the function `plotAlignemntSummary()`.

```{r, fig.cap = "Example Bowtie logs for PE and SE sequencing"}
plotAlignmentSummary(bowtieLogs, type = "bowtie")
```

```{r, fig.cap = "Example STAR aligner logs"}
plotAlignmentSummary(starLog, type = "star")
```


## Genome assembly 

Assembly 'completeness' and summary statistic information from the tools BUSCO 
and quast can also be plot using the function `plotAssemblyStats()`

```{r, fig.cap = "Example plot after running BUSCO v3 on the Drosophila melanogaster reference genome"}
buscoLog <- system.file("extdata", "short_summary_Dmelanogaster_Busco.txt", package = "ngsReports")
plotAssemblyStats(buscoLog, type = "busco")
```

```{r, fig.cap = "Example plot after running quast on two shortread assemblies"}
fls <- c("quast1.tsv", "quast2.tsv")
quastLog <- system.file("extdata", fls, package = "ngsReports")
plotAssemblyStats(quastLog, type = "quast")
```

```{r, fig.cap = "Example parallel coordinate plot after running quast on two shortread assemblies"}
plotAssemblyStats(quastLog, type = "quast", plotType = "paracoord")
```

# Session info {.unnumbered}

```{r sessionInfo, echo=FALSE}
sessionInfo()
```