---
title: "Rqc - Quality Control Tool for High-Throughput Sequencing Data"
author:
- name: Welliton Souza
  affiliation: University of Campinas, Campinas, Brazil
  email: well309@gmail.com
- name: Benilton Carvalho
  affiliation: University of Campinas, Campinas, Brazil
package: Rqc
output: 
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Using Rqc}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r include=FALSE}
library(BiocStyle)
```

# Introduction

Rqc is an optimized tool designed for quality control and assessment
of high-throughput sequencing data. It performs parallel processing of
entire files and produces a report which contains a set of
high-resolution graphics that can be used for quality assessment.

This version of Rqc produces high-quality images for the following statistics:

- *Average Quality*: this plot describes the average quality pattern
   by showing on the X-axis quality thresholds and on the Y-axis the
   percentage of reads that exceed that quality level.
- *Cycle-specific Average Quality*: this describes the average quality
   scores for each cycle of sequencing.
- *Read Length Distribution*: this is a barplot that presents the
   distribuition of the lengths of the reads available in the FASTQ
   file.
- *Cycle-specific GC Content*: a line plot showing the average GC
   content for every cycle of sequencing.
- *Cycle-specific Quality Distribution*: a bar plot showing the
   proportion of quality calls per cycle. Colors are presented in a
   gradient Red-Blue, where red identifies calls of lower
   quality. This visualization is preferred as it is cleaner than the
   boxplots described below.
- *Cycle-specific Quality Distribution - Boxplots*: boxplots
   describing empirical patterns of quality distribution on each cycle
   of sequencing.
- *Cycle-specific Base Call Proportion*: this bar plot describes the
   proportion of each nucleotide called for every cycle of sequencing.

## Basic Workflow

The main goal of Rqc is to provide graphical tools for quality
assessment of reads contained in FASTQ files.  This package is
designed focusing on simplicity of use. Therefore, the Rqc package
allows the user to call one single function called `rqc`. The `rqc`
method processes a set of input files and generates an HTML report
containing several plots that can be used for quality assessment.

To access this functionality, the user needs to load Rqc package.
```{r load, message=FALSE}
library(Rqc)
```

The next step is to determine the location of the FASTQ files that
should be analyzed. The example below, uses sample files provided by
the ShortRead package, but the user must modify this location
accordingly, in order to reflect the actual location of the files that
need QA.

```{r file_loc}
folder <- system.file(package="ShortRead", "extdata/E-MTAB-1147")
```

The basic usage of the `rqc` function requires the definition of 2
arguments. One, `path`, is the location where the files of interest
are saved at (this was defined on the step above). The other argument,
`pattern`, is a regular expression that identifies all files of
interest. Below, we use `.fastq.gz` to specify that all files
containing that string are to be processed.

```{r, rqc, eval=FALSE}
rqc(path = folder, pattern = ".fastq.gz")
```

At this point, the user's default Internet browser will open an HTML
file. This file is the report generated by Rqc, which, by default, is
stored in a temporary directory. A sample report is shown below:

----------

```{r example, echo=FALSE, message=FALSE}
fastqDir <- system.file(package="ShortRead", "extdata/E-MTAB-1147")
files <- list.files(fastqDir, "fastq.gz", full.names=TRUE)
qa <- rqcQA(files, workers=1)
```

# Quality control report

## File Information
This table describes input files. `reads` column can be total number of reads (`sample=FALSE`) or sample size.
```{r}
knitr::kable(perFileInformation(qa))
```

## Per Read Mean Quality Distribution of Files
This plot describe an overview of per read mean quality distribution of all files
```{r read-mean-dist}
rqcReadQualityBoxPlot(qa)
```

## Average Quality
This plot describes the average quality pattern by showing on the
X-axis quality thresholds and on the Y-axis the percentage of reads
that exceed that quality level.
```{r average-quality-plot}
rqcReadQualityPlot(qa)
```

## Cycle-specific Average Quality
This describes the average quality scores for each cycle of sequencing.
```{r cycle-average-quality-plot}
rqcCycleAverageQualityPlot(qa)
```

## Read Frequency
This plot shows the proportion of reads that appeared many times.
```{r readfrequency}
rqcReadFrequencyPlot(qa)
```

## Heatmap of top represented reads
This heatmap plot shows dstance matrix between top represented reads. This functon only works with one result file (and not a list).
```{r heatmap-reads}
rqcFileHeatmap(qa[[1]])
```

## Read Length Distribution
Barplot that presents the distribuition of the lengths of the reads available in the FASTQ file.
```{r read-width-plot}
rqcReadWidthPlot(qa)
```

## Cycle-specific GC Content
Line plot showing the average GC content for every cycle of sequencing.
```{r cycle-gc-plot}
rqcCycleGCPlot(qa)
```

## Cycle-specific Quality Distribution
Bar plot showing the proportion of quality calls per cycle. Colors are
presented in a gradient Red-Blue, where red identifies calls of lower
quality. This visualization is preferred as it is cleaner than the
boxplots described below.
```{r cycle-quality-plots}
rqcCycleQualityPlot(qa)
```

## PCA Biplot (cycle-specific read average quality)
Biplot from Principal Component Analysis (PCA) of cycle-specific read average quality.
```{r biplot}
rqcCycleAverageQualityPcaPlot(qa)
```

## Cycle-specific Quality Distribution - Boxplot
Boxplots describing empirical patterns of quality distribution on each cycle of sequencing.
```{r cycle-quality-boxplots}
rqcCycleQualityBoxPlot(qa)
```

## Cycle-specific Base Call Proportion
This bar plot describes the proportion of each nucleotide called for every cycle of sequencing.
```{r cycle-basecall-plots}
rqcCycleBaseCallsPlot(qa)
```

The line plot shows a more detailed view.
```{r cycle-basecall-lineplots}
rqcCycleBaseCallsLinePlot(qa)
```

----------

It is important to note that the `rqc` function samples 1 million
records from the FASTQ files. This can be set by adjusting the `n`
argument for this function. If the user desires to have the file
processed as a whole (rather than sampling records from it), (s)he
must set the argument `sample` to `FALSE`.

# Advanced Workflow

The `rqc` function wraps a set of functions to generate a quick report
that can be used for quality assessment. However, users can perform a
step-by-step analysis by using the information described below.

## Defining input files

If one wants to process a set of files that are not located at the
same directory, the users needs to create a vector containing the
absolute path of files.  The `list.files` function can be useful.

```{r input}
fastqDir <- system.file(package="ShortRead", "extdata/E-MTAB-1147")
files <- list.files(fastqDir, "fastq.gz", full.names=TRUE)
```

The example input files are samples from a public data set. These
samples are available through the `r Biocpkg('ShortRead')` package.
More information regarding these data can be found on the vignette of
that package.

## Processing files

To process the files without generating an HTML report, the user
should use `rqcQA` function instead `rqc`. This function receives a
vector containing the paths of the input files.

```{r rqcQA}
qa <- rqcQA(files, workers=1)
```

The `rqcQA` function returns a list that contains the required
information to create the plots present on the standard HTML
report. Actually, both `rqc` and `rqcQA` return a named list of
`RqcResultSet` objects. This output can be used directly as input to
other Rqc package functions. Examples of functions that can use these
objects are `rqcReport` and `plot`.  `RqcResultSet` is a class that
extends `.QA` class of ShortRead package. An `RqcResultSet` object
contains information in two different perspectives: *read-specific
data* and *cycle-specific data*. They can be accessed using `[[`
brackets.

## Generating report

To create a final HTML report, the user can apply the `rqcReport`
function to the `qa` object.

```{r report, eval=FALSE}
reportFile <- rqcReport(qa)
browseURL(reportFile)
```

The report generated by `rqcReport` contains the all plots described
on the beginning of this document. By default, it is written to a
temporary directory. This behavior can by modified by setting the
`outdir` argument.

# Parallel processing

The Rqc package performs parallel processing of the samples by
interfacing with the `r Biocpkg('BiocParallel')` package. By default,
Rqc calls `multicoreWorkers` function that returns the maximum number 
of workers available. You can change the number of cores setting `workers`
parameter on `rqc` and `rqcQA` functions. It is possible to process the 
input files serially by setting `workers=1`.

# Graphics

For each plot generated by Rqc, there is a function that *shapes* the
data appropriately. The shaped information is then used to produce the
final plot. The example below shows how the user can access these data
to generate plots using other tools.

```{r calc}
df <- rqcCycleAverageQualityCalc(qa)
cycle <- as.numeric(levels(df$cycle))[df$cycle]
plot(cycle, df$quality, col = df$filename, xlab='Cycle', ylab='Quality Score')
```

One can also process a subset of result data using subsets of a list.

```{r subset}
sublist <- qa[1]
rqcCycleQualityPlot(sublist)
```

# Writing personalized quality control reports 

Rqc package accepts R Markdown file format as template file for generating custom reports. 
Markdown is a markup language for web development. 
R Markdown files are regular Markdown files with R codes. 
Every chunk of code is executed during compilation done by knitr  package. 
Knitr takes R Markdown file and generates Markdown file merged with R codes and their outputs such as text, tables and figures. 
Result Markdown file is used by Rqc for generating final report in HTML or PDF formats. 
The source file of Rqc's default report is a good reference for writing new template reports. 
Run code below returns the system file path to this source file. 

```{r default-report-path, eval=FALSE}
system.file(package = "Rqc", "templates", "rqc_report.Rmd") 
```

Basic concepts for writing Markdown documents with R code are available at http://rmarkdown.rstudio.com/. 
Advanced concepts about the compilation process of R Markdown files are available at http://yihui.name/knitr/. 
The Bioconductor project provides style guidelines for HTML documents at http://www.bioconductor.org/packages/release/bioc/vignettes/BiocStyle/inst/doc/HtmlStyle.html. 

Rqc result data is available inside template files through `rqcResultSet` object. 
This object is a list of summarized statistics about input files and it is used by all accessor functions and plots provided by Rqc package. 
The `rqcReport` function takes template file path as argument and generates personalized reports. 

```{r rqc-report-custom-template, eval=FALSE}
rqcReport(qa, templateFile = "custom_report.Rmd") 
```

# Final Considerations

The Rqc package provides a simple interface to generate plots often
used for quality assessment of high-throughput sequencing data. It
uses the standard Bioconductor parallelization framework to add
efficiency to data processing. The images produced by the package are
high-quality figures that can be directly used on publications.

# Session Information

```{r sessionInfo, echo=FALSE}
sessionInfo()
```