ngsReports 1.8.1
The package ngsReports
is designed to bolt into data processing pipelines and
produce combined plots for multiple FastQC reports generated across an entire
set of libraries or samples.
The primary functionality of the package is parsing FastQC reports, with import
methods also implemented for log files produced by tools as as STAR
, hisat2
and others.
In addition to parsing files, default plotting methods are implemented.
Plots applied to a single file will replicate the default plots from
FastQC
1 https://www.bioinformatics.babraham.ac.uk/projects/fastqc/,
whilst methods applied to multiple FastQC reports summarise these and produce
a series of custom plots.
Plots are produced as standard ggplot2 objects, with an interactive option available using plotly. As well as custom summary plots, tables of read counts and the like can also be easily generated.
In addition to the usage demonstrated below, a shiny
app has been developed
for interactive viewing of FastQC reports.
This can be installed using:
remotes::install_github("UofABioinformaticsHub/shinyNgsreports")
A vignette for this app will be installed with the shinyNgsreports
package.
In it’s simplest form, a default summary report can be generated simply by
specifying a directory containing the output from FastQC and calling the
function writeHtmlReport()
.
library(ngsReports)
fileDir <- file.path("path", "to", "your", "FastQC", "Reports")
writeHtmlReport(fileDir)
This function will transfer the default template to the provided directory and
produce a single .html
file containing interactive summary plots of any
FastQC output found in the directory.
FastQC output can be *fastqc.zip
files or the same files extracted as
individual directories.
The default template is provided as ngsReports_Fastqc.Rmd
in the package
directory .
This template can be easily modified and supplied as an alternate template to
the above function using your modified file as the template RMarkdown file.
altTemplate <- file.path("path", "to", "your", "new", "template.Rmd")
writeHtmlReport(fileDir, template = altTemplate)
The package ngsReports
introduces two main S4
classes:
FastqcData
& FastqcDataList
FastqcData
objects hold the parsed data from a single report as
generated by the stand-alone tool FastQC
.
These are then extended into lists for more than one file as a
FastqcDataList
.
For most users, the primary class of interest will be the FastqcDataList
.
R
To load a set of FastQC
reports into R
as a FastqcDataList
, specify the
vector of file paths, then call the function FastqcDataList()
.
In the rare case you’d like an individual file, this can be performed by
calling FastqcData()
on an individual file, or subsetting the output from
FastqcDataList()
using the [[]]
operator as with any list object.
fileDir <- system.file("extdata", package = "ngsReports")
files <- list.files(fileDir, pattern = "fastqc.zip$", full.names = TRUE)
fdl <- FastqcDataList(files)
From here, all FastQC modules can be obtained as a tibble
(i.e. data.frame
)
using the function getModule()
and choosing one of the following modules:
Summary
(The PASS/WARN/FAIL status for each module)Basic_Statistics
Per_base_sequence_quality
Per_sequence_quality_scores
Per_base_sequence_content
Per_sequence_GC_content
Per_base_N_content
Sequence_Length_Distribution
Sequence_Duplication_Levels
Overrepresented_sequences
Adapter_Content
Kmer_Content
Per_tile_sequence_quality
getModule(fdl[[1]], "Summary")
## # A tibble: 12 x 3
## Filename Status Category
## <chr> <chr> <chr>
## 1 ATTG_R1.fastq PASS Basic Statistics
## 2 ATTG_R1.fastq FAIL Per base sequence quality
## 3 ATTG_R1.fastq WARN Per tile sequence quality
## 4 ATTG_R1.fastq PASS Per sequence quality scores
## 5 ATTG_R1.fastq FAIL Per base sequence content
## 6 ATTG_R1.fastq FAIL Per sequence GC content
## 7 ATTG_R1.fastq PASS Per base N content
## 8 ATTG_R1.fastq PASS Sequence Length Distribution
## 9 ATTG_R1.fastq FAIL Sequence Duplication Levels
## 10 ATTG_R1.fastq FAIL Overrepresented sequences
## 11 ATTG_R1.fastq FAIL Adapter Content
## 12 ATTG_R1.fastq FAIL Kmer Content
Capitalisation and spelling of these module names follows the default patterns from FastQC reports with spaces replaced by underscores. One additional module is available and taken directly from the text within the supplied reports
Total_Duplicated_Percentage
In addition, the read totals for each file in the library can be obtained
using readTotals()
, which can be easily used to make a table of read totals.
This essentially just returns the first two columns from
getModule(x, "Basic_Statistics")
.
reads <- readTotals(fdl)
The packages dplyr
and pander
can also be extremely useful for manipulating
and displaying imported data.
To show only the R1 read totals, you could do the following
library(dplyr)
library(pander)
reads %>%
dplyr::filter(grepl("R1", Filename)) %>%
pander(
big.mark = ",",
caption = "Read totals from R1 libraries",
justify = "lr"
)
Filename | Total_Sequences |
---|---|
ATTG_R1.fastq | 24,978 |
CCGC_R1.fastq | 22,269 |
GACC_R1.fastq | 10,287 |
Plots created from a single FastqcData
object will resemble those generated
by the FastQC
tool, whilst those created from a FastqcDataList
will be
combined summaries across a library of files.
In addition, all plots are able to be generated as interactive plots using the
argument usePlotly = TRUE
.
All FastQC modules have been enabled for plotting using default S4
dispatch,
with the exception of Per_tile_sequence_quality
.
The simplest of the plots is to summarise the PASS/WARN/FAIL
flags as
produced by FastQC
for each module.
This plot can be simply generated using plotSummary()
plotSummary(fdl)
The next most informative plot may be to summarise the total numbers of reads
in each associated Fastq file.
By default, the number of duplicated sequences from the
Total_Duplicated_Percentage
module are shown, but this can be disabled by
setting duplicated = FALSE
.
plotReadTotals(fdl)
As these are ggplot2
objects, the output can be modified easily using
conventional ggplot2
syntax.
Here we’ll move the legend to the top right as an example.
plotReadTotals(fdl) +
theme(
legend.position = c(1, 1),
legend.justification = c(1, 1),
legend.background = element_rect(colour = "black")
)
Turning to the Per base sequence quality
scores is the next most common step
for most researchers, and these can be obtained for an individual file by
selecting this as an element (i.e. FastqcData
object ) of the main
FastqcDataList
object.
This plot replicates the default plots from a FastQC report.
plotBaseQuals(fdl[[1]])
When working with multiple FastQC reports, these are summarised as a heatmap using the mean quality score at each position.
plotBaseQuals(fdl)
Boxplots of any combinations can also be drawn from a FastqcDataList
by
setting the argument plotType = "boxplot"
.
However, this may be not suitable for datasets with a large number of libraries.
plotBaseQuals(fdl[1:4], plotType = "boxplot")