--- title: "An introduction to QuasR" date: "`r format(Sys.time(), '%d %B, %Y')`" bibliography: QuasR-refs.bib author: - Michael Stadler - Dimos Gaidatzis - Charlotte Soneson - Anita Lerch package: QuasR output: BiocStyle::html_document: toc_float: true vignette: > %\VignetteIndexEntry{An introduction to QuasR} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- # Introduction The `r Biocpkg("QuasR")` package (short for Quantify and annotate short reads in R) integrates the functionality of several **R** packages (such as `r Biocpkg("IRanges")` [@IRanges] and `r Biocpkg("Rsamtools")`) and external software (e.g. `bowtie`, through the `r Biocpkg("Rbowtie")` package, and `HISAT2`, through the `r Biocpkg("Rhisat2")` package). The package aims to cover the whole analysis workflow of typical high throughput sequencing experiments, starting from the raw sequence reads, over pre-processing and alignment, up to quantification. A single **R** script can contain all steps of a complete analysis, making it simple to document, reproduce or share the workflow containing all relevant details. The current `r Biocpkg("QuasR")` release supports the analysis of single read and paired-end ChIP-seq (chromatin immuno-precipitation combined with sequencing), RNA-seq (gene expression profiling by sequencing of RNA) and Bis-seq (measurement of DNA methylation by sequencing of bisulfite-converted genomic DNA) experiments. It has been successfully used with data from Illumina, 454 Life Technologies and SOLiD sequencers, the latter by using bam files created externally of `r Biocpkg("QuasR")`. # Preliminaries ## Citing `r Biocpkg("QuasR")` If you use `r Biocpkg("QuasR")` [@QuasR] in your work, you can cite it as follows: ```{r cite, eval=TRUE} citation("QuasR") ``` ## Installation{#installation} `r Biocpkg("QuasR")` is a package for the **R** computing environment and it is assumed that you have already installed **R**. See the **R** project at (http://www.r-project.org). To install the latest version of `r Biocpkg("QuasR")`, you will need to be using the latest version of **R**. `r Biocpkg("QuasR")` is part of the Bioconductor project at (http://www.bioconductor.org). To get `r Biocpkg("QuasR")` together with its dependencies you can use ```{r install, eval=FALSE} if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install("QuasR") ``` Bioconductor works on a 6-monthly official release cycle. As with other Bioconductor packages, there are always two versions of `r Biocpkg("QuasR")`. Most users will use the current official release version, which will be installed by `BiocManager::install` if you are using the current release version of **R**. There is also a development version of `r Biocpkg("QuasR")` that includes new features due for the next official release. The development version will be installed if you are using the development version of **Bioconductor** (see `version = "devel"` in `r Biocpkg("BiocManager")`). The official release version always has an even second number (for example 1.20.1), whereas the developmental version has an odd second number (for example 1.21.4). ## Loading of *QuasR* and other required packages In order to run the code examples in this vignette, the `r Biocpkg("QuasR")` package and a few additional packages need to be loaded: ```{r loadLibraries, eval=TRUE} suppressPackageStartupMessages({ library(QuasR) library(BSgenome) library(Rsamtools) library(rtracklayer) library(GenomicFeatures) library(Gviz) }) ``` ## How to get help Most questions about `r Biocpkg("QuasR")` will hopefully be answered by the documentation or references. If you've run into a question which isn't addressed by the documentation, or you've found a conflict between the documentation and software, then there is an active support community which can offer help. The authors of the package (maintainer: `r maintainer("QuasR")`) always appreciate receiving reports of bugs in the package functions or in the documentation. The same goes for well-considered suggestions for improvements. Any other questions or problems concerning `r Biocpkg("QuasR")` should be posted to the Bioconductor support site (https://support.bioconductor.org). Users posting to the support site for the first time should read the helpful posting guide at (https://support.bioconductor.org/info/faq/). Note that each function in `r Biocpkg("QuasR")` has it's own help page, as described in the section \@ref(introToR). Posting etiquette requires that you read the relevant help page carefully before posting a problem to the site. # Quick Start ## A brief introduction to **R**{#introToR} If you already use **R** and know about its interface, just skip this section and continue with section \@ref(sampleQuasRsession). The structure of this vignette and in particular this section is based on the excellent user guide of the `r Biocpkg("limma")` package, which we would like to hereby acknowledge. **R** is a program for statistical computing. It is a command-driven language meaning that you have to type commands into it rather than pointing and clicking using a mouse. In this guide it will be assumed that you have successfully downloaded and installed **R** from (http://www.r-project.org) as well as `r Biocpkg("QuasR")` (see section \@ref(installation)). A good way to get started is to type ```{r help1, eval=FALSE} help.start() ``` at the **R** prompt or, if you're using **R** for Windows, to follow the drop-down menu items *Help* $\succ$ *Html help*. Following the links *Packages* $\succ$ `r Biocpkg("QuasR")` from the html help page will lead you to the contents page of help topics for functions in `r Biocpkg("QuasR")`. Before you can use any `r Biocpkg("QuasR")` commands you have to load the package by typing ```{r loadQuasRLibrary, eval=FALSE} library(QuasR) ``` at the **R** prompt. You can get help on any function in any loaded package by typing `?` and the function name at the **R** prompt, for example ```{r help2, eval=FALSE} ?preprocessReads ``` or equivalently ```{r help3, eval=FALSE} help("preprocessReads") ``` for detailed help on the `preprocessReads` function. The function help page is especially useful to learn about its arguments and its return value. Working with **R** usually means creating and transforming *objects*. Objects might include data sets, variables, functions, anything at all. For example ```{r assign, eval=FALSE} x <- 2 ``` will create a variable `x` and will assign it the value 2. At any stage of your **R** session you can type ```{r ls, eval=FALSE} ls() ``` to get a list of all the objects currently existing in your session. You can display an object by typing its name on the prompt. The following displays the object `x`: ```{r printObject, eval=FALSE} x ``` We hope that you can use `r Biocpkg("QuasR")` without having to spend a lot of time learning about the **R** language itself but a little knowledge in this direction will be very helpful, especially when you want to do something not explicitly provided for in `r Biocpkg("QuasR")` or in the other Bioconductor packages. For more details about the **R** language see *An Introduction to R* which is available from the online help, or one of the many great online resources, like the documentation at [r-project.org](https://www.r-project.org/), the growing list of free books at [bioconductor.org](https://bioconductor.org/books/release/), or the books from [rstudio.com](https://rstudio.com/resources/books/) (many of which are also available for free). For more background on using **R** for statistical analysis see [@Dalgaard]. ## Sample *QuasR* session{#sampleQuasRsession} This is a quick overview of what an analysis could look like for users preferring to jump right into an analysis. The example uses data that is provided with the `r Biocpkg("QuasR")` package, which is first copied to the current working directory, into a subfolder called `"extdata"`: ```{r SampleSession1, eval=TRUE} file.copy(system.file(package = "QuasR", "extdata"), ".", recursive = TRUE) ``` The sequence files to be analyzed are listed in `sampleFile` (see section \@ref(sampleFile) for details). The sequence reads will be aligned using `bowtie` [@bowtie] (from the `r Biocpkg("Rbowtie")` package [@Rbowtie]) to a small reference genome (consisting of three short segments from the hg19 human genome assembly, available in full for example in the `r Biocpkg("BSgenome.Hsapiens.UCSC.hg19")` package). Information on selecting an appropriate reference genome is summarized in section \@ref(genome). Make sure that you have sufficient disk space, both in your **R** temporary directory (`tempdir()`) as well as to store the resulting alignments (see section \@ref(qAlign)). ```{r SampleSession2, eval=TRUE} sampleFile <- "extdata/samples_chip_single.txt" genomeFile <- "extdata/hg19sub.fa" proj <- qAlign(sampleFile, genomeFile) proj ``` The `proj` object keeps track of all the information of a sequencing experiment, for example where sequence and alignment files are stored, and what aligner and reference genome was used to generate the alignments. Now that the alignments have been generated, further analyses can be performed. A quality control report is saved to the `"extdata/qc_report.pdf"` file using the `qQCReport` function. ```{r SampleSession3, eval=TRUE} qQCReport(proj, "extdata/qc_report.pdf") ``` The number of alignments per promoter region is quantified using `qCount`. Genomic coordinates for promoter regions are imported from a gtf file (`annotFile`) into the `GRanges`-object with the name `promReg`: ```{r SampleSession4, eval=TRUE} library(rtracklayer) library(GenomicFeatures) annotFile <- "extdata/hg19sub_annotation.gtf" txStart <- import.gff(annotFile, format = "gtf", feature.type = "start_codon") promReg <- promoters(txStart, upstream = 500, downstream = 500) names(promReg) <- mcols(promReg)$transcript_name promCounts <- qCount(proj, query = promReg) promCounts ``` # *QuasR* Overview The following scheme shows the major components of `r Biocpkg("QuasR")` and their relationships: ![](QuasR-scheme.png) `r Biocpkg("QuasR")` works with data (sequences and alignments, reference genome, etc.) that are stored as files on your storage (the gray cylinder on the lower left of Figure above, see section \@ref(fileStorageLocations) for details on storage locations). `r Biocpkg("QuasR")` does not need a database management system, or these files to be named and organized according to a specific scheme. In order to keep track of directory paths during an analysis, `r Biocpkg("QuasR")` makes use of a `qProject` object that is returned by the `qAlign` function, which at the minimum requires two inputs: the name of a samples text file (see section \@ref(sampleFile) for details), and the reference genome for the alignments (see section \@ref(genome)). The `qProject` object is the main argument passed to subsequent functions such as `qQCReport` and `qCount`. The `qProject` object contains all necessary information on the current project and eliminates the need to repeatedly enter the same information. All functions that work on `qProject` objects can be recognized by their names starting with the letter *q*. Read quantification (apart from quantification of methylation which has its own function `qMeth`) is done using the `qCount` function: It counts the alignments in regions of interest (e.g. promoters, genes, exons, etc.) and produces a count table (regions in rows, samples in columns) for further visualization and analysis. The count table can also be used as input to a statistical analysis using packages such as `r Biocpkg("edgeR")` [@edgeR], `r Biocpkg("DESeq")` [@DESeq], `r Biocpkg("DESeq2")` [@DESeq2], `r Biocpkg("TCC")` [@TCC], `r Biocpkg("DEXSeq")` [@DEXSeq] or `r Biocpkg("baySeq")` [@baySeq]. In summary, a typical `r Biocpkg("QuasR")` analysis consists of the following steps (some of them are optional): - `preprocessReads` (optional): Remove adapters from start or end of reads, filter out reads of low quality, short length or low complexity (section \@ref(preProcessing)). - Prepare *samples file*: List sequence files or alignments, provide sample names (section \@ref(sampleFile)). - Prepare *auxiliary file* (optional): List additional reference sequences for alignment of reads not matching the reference genome (section \@ref(auxiliaryFile)). - `qAlign`: Create `qProject` object and specify project parameters. Also download BSgenome package, create aligner indices and align reads if not already existing (section \@ref(qAlign)). - `qQCReport` (optional): Create quality control report with plots on sequence qualities and alignment statistics (section \@ref(qQCReport)). - `qExportWig` (optional): Export genomic alignments as wiggle tracks for genome browser visualization (section \@ref(qExportWig)). - `qCount`: Quantify alignments in regions of interest (section \@ref(qCount)). Recurrent example tasks that may be part of any typical analysis are described in section \@ref(exampleTasks). Example workflows for specific experiment types (ChIP-seq, RNA-seq and Bis-seq) are described in section \@ref(exampleWorkflows). ## File storage locations{#fileStorageLocations} Apart from `qExportWig` and `qQCReport`, which generate wig files and pdf reports, `qAlign` is the only function in `r Biocpkg("QuasR")` that stores files on the disk (see section \@ref(qAlign) for details). All files generated by `qAlign` are listed here by type, together with their default location and how locations can be changed. - *Temporary files* (default: `tempdir()`): Temporary files include reference genomes in `fasta` format, decompressed input sequence files, and temporary alignments in text format, and can require a large amount of disk space. By default, these files will be written to the temporary directory of the **R** process (as reported by `tempdir()`). If using `clObj` for parallel processing, this may be the `tempdir()` from the cluster node(s). An alternative location can be set using the `TMPDIR` environment variable (see `?tempdir`). - *Alignment files (`bam` format)* (default: same directory as the input sequence files): Alignments against reference genome and auxiliary targets are stored in `bam` format in the same directory that also contains the input sequence file (listed in `sampleFile`). Please note that if the input sequence file corresponds to a symbolic link, `r Biocpkg("QuasR")` will follow the link and use the directory of the original file instead. An alternative directory can be specified with the `alignmentsDir` argument from `qAlign`, which will store all `bam` files in that directory even if the input sequence files are located in different directories. - *Alignment index files* (default: depends on `genome` and `snpFile` arguments): Many alignment tools including `bowtie` require an index of the reference sequence to perform alignments. If necessary, `qAlign` will build this index automatically and store it in a default location that depends on the `genome` argument: + `BSgenome`: If `genome` is the name of a `r Biocpkg("BSgenome")` package (such as `"BSgenome.Hsapiens.UCSC.hg19"`), the index will be stored as a new **R** package in the default library path (as reported by `.libPaths()[1]`, see `?install.packages` for details), or alternatively in the library specified by the `lib.loc` argument of `qAlign`. The name of this index package will be the name of the original `r Biocpkg("BSgenome")` package with a suffix for the index type, for example `"BSgenome.Hsapiens.UCSC.hg19.Rbowtie"`. + `fasta`: If `genome` refers to a reference genome file in `fasta` format, the index will be stored in a subdirectory at the same location. Similarly, the indices for files listed in `auxiliaryFile` are store at the location of these files. For example, the `Rbowtie` index for the genome at `"./genome/mm9.fa"` is stored in `"./genome/mm9.fa.Rbowtie"`. + *Allele-specific analysis*: A special case is the allele-specific analysis, where reference and alternative alleles listed in `snpFile` (e.g. `"./mySNPs.tab"`) are injected into the `genome` (e.g. `"BSgenome.Mmusculus.UCSC.mm9"`) to create two variant genomes to be indexed. These indices are saved at the location of the `snpFile` in a directory named after `snpFile`, `genome` and the index type (e.g. `"./mySNPs.tab.BSgenome.Mmusculus.UCSC.mm9.A.fa.Rbowtie"`). # Example tasks{#exampleTasks} ## Create a sample file{#sampleFile} The sample file is a tab-delimited text file with two or three columns. The first row contains the column names: For a single read experiment, these are 'FileName' and 'SampleName'; for a paired-end experiment, these are 'FileName1', 'FileName2' and 'SampleName'. If the first row does not contain the correctly spelled column names, `r Biocpkg("QuasR")` will not accept the samples file. Subsequent rows contain the input sequence files. Here are examples of such sample files for a single read experiment:
```{r sampleFileSingle, echo=FALSE, results="asis"} cat(paste(readLines(system.file(package = "QuasR", "extdata", "samples_chip_single.txt")), collapse = "\n")) ```and for a paired-end experiment:
```{r sampleFilePaired, echo=FALSE, results="asis"} cat(paste(readLines(system.file(package = "QuasR", "extdata", "samples_rna_paired.txt")), collapse = "\n")) ```These example files are also contained in the `r Biocpkg("QuasR")` package and may be used as templates. The path of the files can be determined using: ```{r sampleFile, eval=FALSE} sampleFile1 <- system.file(package="QuasR", "extdata", "samples_chip_single.txt") sampleFile2 <- system.file(package="QuasR", "extdata", "samples_rna_paired.txt") ``` The columns *FileName* for single-read, or *FileName1* and *FileName2* for paired-end experiments contain paths and names to files containing the sequence data. The paths can be absolute or relative to the location of the sample file. This allows combining files from different directories in a single analysis. For each input sequence file, `qAlign` will create one alignment file and by default store it in the same directory as the sequence file. Already existing alignment files with identical parameters will not be re-created, so that it is easy to reuse the same sequence files in multiple projects without unnecessarily copying sequence files or recreating alignments. The *SampleName* column contains sample names for each sequence file. The same name can be used on several lines to indicate multiple sequence files that belong to the same sample (`qCount` can use this information to automatically combine counts for one sample from multiple files). Three file formats are supported for input files (but cannot be mixed within a single sample file): - **fasta** files have names that end with '.fa', '.fna' or '.fasta'. They contain only sequences (and no base qualities) and will thus by default be aligned on the basis of mismatches (the best alignment is the one with fewest mismatches). - **fastq** files have names that end with '.fq' or '.fastq'. They contain sequences and corresponding base qualities and will be aligned by default using these qualities. The encoding scheme of base qualities is automatically detected for each individual fastq file. - **bam** files have names that end with '.bam'. They can be used if the sequence reads have already been aligned outside of `r Biocpkg("QuasR")`, and `r Biocpkg("QuasR")` will only be used for downstream analysis based on the alignments contained in the `bam` files. This makes it possible to use alignment tools that are not available within `r Biocpkg("QuasR")`, but making use of this option comes with a risk and should only be used by experienced users. For example, it cannot be guaranteed any more that certain assumptions made by `qCount` are fulfilled by the external aligner (see below). When using external `bam` files, we recommend to use files which contain only one alignment per read. This may also include multi-hit reads, for which one of the alignments is randomly selected. This allows `r Biocpkg("QuasR")` to count the total number of reads by counting the total number of alignments. Furthermore, if the `bam` files also contain the unmapped reads, `r Biocpkg("QuasR")` will be able to calculate the fraction of mapped reads. For bisulfite samples we require ungapped alignments stored in unpaired or paired *ff* orientation (even if the input reads are *fr*). For allele-specific `bam` files, `r Biocpkg("QuasR")` requires an additional tag for each alignment called `XV` of type `A` (printable character) with the possible values `R` (Reference), `U` (Unknown) and `A` (Alternative). **fasta** and **fastq** files can be compressed with gzip, bzip2 or xz (file extensions '.gz', '.bz2' or 'xz', respectively) and will be automatically decompressed when necessary. ### Working only with `bam` files after performing alignments Once alignments have been created, most analyses will only require the `bam` files and will not access the original raw sequence files anymore. However, re-creating a `qProject` object by a later identical call to `qAlign` will still need access to the raw sequences to verify consistency between raw data and alignments. It may be desirable to remove this dependency, for example to archive or move away the raw sequence files and to reclaim used disk space. This can be achieved using the following procedure involving two sequential calls to `qAlign`. First, `qAlign` is called with the orignial sample file (`sampleFile1`) that lists the raw sequence files, and subsequently with a second sample file (`sampleFile2`) that lists the `bam` files generated in the first call. Such a second sample file can be easily generated given the `qProject` object (`proj1`) returned by the first call: ```{r
```{r snpFile, echo=FALSE, results="asis"} cat(paste(c(readLines(system.file(package = "QuasR", "extdata", "hg19sub_snp.txt"))[1:4], "..."), collapse = "\n")) ```For a given locus, either reference or alternative allele may but does not have to be identical to the sequence of the reference genome (`genomeFile` in this case). To avoid an alignment bias, all reads are aligned separately to each of the two new genomes, which `r Biocpkg("QuasR")` generates by *injecting* the SNPs listed in `snpFile` into the reference genome. Finally, the two alignment files are combined, only retaining the best alignment for each read. While this procedure takes more than twice as long as aligning against a single genome, it has the advantage to correctly align reads even in regions of high SNP density and has been shown to produce more accurate results. While combining alignments, each read is classified into one of three groups (stored in the `bam` files under the `XV` tag): - **R**: the read aligned better to the **reference** genome - **U**: the read aligned equally well to both genomes (**unknown** origin) - **A**: the read aligned better to the **alternative** genome Using these alignment classifications, the `qCount` and `qMeth` functions will produce three counts instead of a single count; one for each class. The column names will be suffixed by `_R`, `_U` and `_A`. The examples below use data provided with the `r Biocpkg("QuasR")` package, which is first copied to the current working directory, into a subfolder called `"extdata"`: ```{r Alelle_Extdata, eval=TRUE} file.copy(system.file(package = "QuasR", "extdata"), ".", recursive = TRUE) ``` The example below aligns the same reads that were also used in the ChIP-seq workflow (section \@ref(ChIPseq)), but this time using a `snpFile`: ```{r Allele_qAlign, eval=TRUE} sampleFile <- "extdata/samples_chip_single.txt" genomeFile <- "extdata/hg19sub.fa" snpFile <- "extdata/hg19sub_snp.txt" proj1SNP <- qAlign(sampleFile, genome = genomeFile, snpFile = snpFile) proj1SNP ``` Instead of one count per promoter region and sample, `qCount` now returns three (`promRegSel` was generated in the ChIP-seq example workflow): ```{r Allele_qCount, eval=TRUE} head(qCount(proj1, promRegSel)) head(qCount(proj1SNP, promRegSel)) ``` The example below illustrates use of a `snpFile` for Bis-seq experiments. Similarly as for `qCount`, the count types are labeled by `R`, `U` and `A`. These labels are added to the total and methylated column suffixes `_T` and `_M`, resulting in a total of six instead of two counts per feature and sample: ```{r Allele_Bis, eval=TRUE} sampleFile <- "extdata/samples_bis_single.txt" genomeFile <- "extdata/hg19sub.fa" proj4SNP <- qAlign(sampleFile, genomeFile, snpFile = snpFile, bisulfite = "dir") head(qMeth(proj4SNP, mode = "CpGcomb", collapseBySample = TRUE)) ``` # Description of Individual *QuasR* Functions Please refer to the `r Biocpkg("QuasR")` reference manual or the function documentation (e.g. using `?qAlign`) for a complete description of `r Biocpkg("QuasR")` functions. The descriptions provided below are meant to give an overview over all functions and summarize the purpose of each one. ## `preprocessReads` {#preprocessReads} The `preprocessReads` function can be used to prepare the input sequences before alignment to the reference genome, for example to filter out low quality reads unlikely to produce informative alignments. When working with paired-end experiments, the paired reads are expected to be contained in identical order in two separate files. For this reason, both reads of a pair are filtered out if any of the two reads fulfills the filtering criteria. The following types of filtering tasks can be performed (in the order as listed): 1. **Truncate reads**: remove nucleotides from the start and/or end of each read. 2. **Trim adapters**: remove nucleotides at the beginning and/or end of each read that match to a defined (adapter) sequence. The adapter trimming is done by calling `trimLRPatterns` from the `r Biocpkg("Biostrings")` package [@Biostrings]. 3. **Filter out low quality reads**: Filter out reads that fulfill any of the filtering criteria (contain more than `nBases` `N` bases, are shorter than `minLength` or have a dinucleotide complexity of less than `complexity`-times the average complexity of the human genome sequence). The dinucleotide complexity is calculated in bits as Shannon entropy using the following formula $-\sum_i f_i \cdot \log_2 f_i$, where $f_i$ is the frequency of dinucleotide $i$ ($i=1, 2, ..., 16$). ## `qAlign` {#qAlign} `qAlign` is the function that generates alignment files in `bam` format, for all input sequence files listed in `sampleFile` (see section \@ref(sampleFile)), against the reference genome (`genome` argument), and for reads that do not match to the reference genome, against one or several auxiliary target sequences (`auxiliaryFile`, see section \@ref(auxiliaryFile)). The reference genome can be provided either as a `fasta` sequence file or as a `BSgenome` package name (see section \@ref(genome)). If a `BSgenome` package is not found in the installed packages, `qAlign` will abort with a description how the missing package can be installed from Bioconductor. The alignment program is set by `aligner`, and parameters by `maxHits`, `paired`, `splicedAlignment` and `alignmentParameter`. Currently, `aligner` can only be set to `"Rbowtie"`, which is a wrapper for *bowtie* [@bowtie] and *SpliceMap* [@SpliceMap], or `"Rhisat2"`, which is a wrapper for *HISAT2* [@hisat2]. When `aligner="Rbowtie"`, *SpliceMap* will be used if `splicedAlignment=TRUE` (not recommended anymore except for reproducing older analyses). However, it is recommended to create spliced alignment using `splicedAlignment=TRUE, aligner="Rhisat2"`, which will use the *HISAT2* aligner and typically leads to more sensistive alignments and shorter alignment times compared to *SpliceMap*. The alignment strategy is furthermore affected by the parameters `snpFile` (alignments to variant genomes containing sequence polymorphisms) and `bisulfite` (alignment of bisulfite-converted reads). Finally, `clObj` can be used to enable parallelized alignment, sorting and conversion to `bam` format. For each input sequence file listed in `sampleFile`, one `bam` file with alignments to the reference genome will be generated, and an additional one for each auxiliary sequence file listed in `auxiliaryFile`. By default, these `bam` files are stored at the same location as the sequence files, unless a different location is specified under `alignmentsDir`. If compatible alignment files are found at this location, they will not be regenerated, which allows re-use of the same sequencing samples in multiple analysis projects by listing them in several project-specific `sampleFile`s. The alignment process produces temporary files, such as decompressed input sequence files or raw alignment files before conversion to `bam` format, which can be several times the size of the input sequence files. These temporary files are stored in the directory specified by `cacheDir`, which defaults to the **R** process temporary directory returned by `tempdir()`. The location of `tempdir()` can be set using environment variables (see `?tempdir`). `qAlign` returns a `qProject` object that contains all file names and paths, as well as all alignment parameters necessary for further analysis (see section \@ref(qProject) for methods to access the information contained in a `qProject` object). ## `qProject` class {#qProject} The `qProject` objects are returned by `qAlign` and contain all information about a sequencing experiment needed for further analysis. It is the main argument passed to the functions that start with a *q* letter, such as `qCount`, `qQCReport` and `qExportWig`. Some information inside of a `qProject` object can be accessed by specific methods (in the examples below, `x` is a `qProject` object): - `length(x)` gets the number of input files. - `genome(x)` gets the reference genome as a `character(1)`. The type of genome is stored as an attribute in `attr(genome(x),"genomeFormat")`: `"BSgenome"` indicates that `genome(x)` refers to the name of a *BSgenome* package, `"file"` indicates that it contains the path and file name of a genome in `fasta` format. - `auxiliaries(x)` gets a `data.frame` with auxiliary target sequences. The `data.frame` has one row per auxiliary target file, and two columns "FileName" and "AuxName". - `alignments(x)` gets a list with two elements `"genome"` and `"aux"`. `"genome"` contains a `data.frame` with `length(x)` rows and two columns `"FileName"` (containing the path to bam files with genomic alignments) and `"SampleName"`. `"aux"` contains a `data.frame` with one row per auxiliary target file (with auxiliary names as row names), and `length(x)` columns (one per input sequence file). - `x[i]` returns a `qProject` object instance with `i` input files, where `i` can be an `NA`-free logical, numeric, or character vector. ## `qQCReport` {#qQCReport} The `qQCReport` function samples a random subset of sequences and alignments from each sample or input file and generates a series of diagnostic plots for estimating data quality. The available plots vary depending on the types of available input (fasta, fastq, bam files or `qProject` object; paired-end or single-end). The plots below show the currently available plots as produced by the ChIP-seq example in section \@ref(ChIPseq) (except for the fragment size distributions which are based on an unspliced alignment of paired-end RNA seq reads): - **Quality score boxplot** shows the distribution of base quality values as a box plot for each position in the input sequence. The background color (green, orange or red) indicates ranges of high, intermediate and low qualities. ```{r qcplotsFig1, eval=TRUE, echo=FALSE, fig.height=4, fig.width=8} QuasR:::plotQualByCycle(qcdat1$raw$qa, lmat = rbind(1:2)) ``` - **Nucleotide frequency** plot shows the frequency of A, C, G, T and N bases by position in the read. ```{r qcplotsFig2, eval=TRUE, echo=FALSE, fig.height=4, fig.width=8} QuasR:::plotNuclByCycle(qcdat1$raw$qa, lmat = rbind(1:2)) ``` - **Duplication level** plot shows for each sample the fraction of reads observed at different duplication levels (e.g. once, two-times, three-times, etc.). In addition, the most frequent sequences are listed. ```{r qcplotsFig3, eval=TRUE, echo=FALSE, fig.height=4, fig.width=8} QuasR:::plotDuplicated(qcdat1$raw$qa, lmat = rbind(1:2)) ``` - **Mapping statistics** shows fractions of reads that were (un)mappable to the reference genome. ```{r qcplotsFig4, eval=TRUE, echo=FALSE, fig.height=4, fig.width=8} QuasR:::plotMappings(qcdat1$raw$mapdata, a4layout = FALSE) ``` - **Library complexity** shows fractions of unique read(-pair) alignment positions. Please note that this measure is not independent from the total number of reads in a library, and is best compared between libraries of similar sizes. ```{r qcplotsFig5, eval=TRUE, echo=FALSE, fig.height=4, fig.width=8} QuasR:::plotUniqueness(qcdat1$raw$unique, a4layout = FALSE) ``` - **Mismatch frequency** shows the frequency and position (relative to the read sequence) of mismatches in the alignments against the reference genome. ```{r qcplotsFig6, eval=TRUE, echo=FALSE, fig.height=4, fig.width=8} QuasR:::plotErrorsByCycle(qcdat1$raw$mm, lmat = rbind(1:2)) ``` - **Mismatch types** shows the frequency of read bases that caused mismatches in the alignments to the reference genome, separately for each genome base. ```{r qcplotsFig7, eval=TRUE, echo=FALSE, fig.height=4, fig.width=8} QuasR:::plotMismatchTypes(qcdat1$raw$mm, lmat = rbind(1:2)) ``` - **Fragment size** shows the distribution of fragment sizes inferred from aligned read pairs. ```{r qcplotsFig8, eval=TRUE, echo=FALSE, fig.height=4, fig.width=8} QuasR:::plotFragmentDistribution(qcdat2$raw$frag, lmat = rbind(1:2)) ``` ## `alignmentStats` {#alignmentStats} `alignmentStats` is comparable to the `idxstats` function from Samtools; it returns the size of the target sequence, as well as the number of mapped and unmapped reads that are contained in an indexed `bam` file. The function works for arguments of type `qProject`, as well as on a string with one or several `bam` file names. There is however a small difference in the two that is illustrated in the following example, which uses the `qProject` object from the ChIP-seq workflow: ```{r alignmentStats, eval=TRUE} # using bam files alignmentStats(alignments(proj1)$genome$FileName) alignmentStats(unlist(alignments(proj1)$aux)) # using a qProject object alignmentStats(proj1) ``` If calling `alignmentStats` on the bam files directly as in the first two expressions of the above example, the returned numbers correspond exactly to what you would obtain by the `idxstats` function from Samtools, only that the latter would report them separately for each target sequence, while `alignmentStats` sums them for each `bam` file. These numbers correctly state that there are zero unmapped reads in the auxiliary `bam` files. However, if calling `alignmentStats` on a `qProject` object, it will report 7 and 12 unmapped reads in the auxiliary `bam` files. This is because `alignmentStats` is aware that unmapped reads are removed from auxiliary `bam` files by `r Biocpkg("QuasR")`, but can be calculated from the total number of reads to be aligned to the auxiliary target, which equals the number of unmapped reads in the corresponding genomic `bam` file. ## `qExportWig` {#qExportWig} `qExportWig` creates fixed-step wig files (see (http://genome.ucsc.edu/goldenPath/help/wiggle.html) for format definition) from the genomic alignments contained in a `qProject` object. The `combine` argument controls if several input files are combined into a single multi-track wig file, or if they are exported as individual wig files. Alignments of single read experiments can be shifted towards there 3'-end using `shift`; paired-end alignments are automatically shifted by half the insert size. The resolution of the created wig file is defines by the `binsize` argument, and if `scaling=TRUE`, multiple alignment files in the `qProject` object are scaled by their total number of reads. ## `qCount` {#qCount} `qCount` is the workhorse for counting alignments that overlap query regions. Usage and details on parameters can be obtained from the `qCount` function documentation. Two aspects that are of special importance are also discussed here: ### Determination of overlap How an alignment overlap with a query region is defined can be controlled by the following arguments of `qCount`: - `selectReadPosition` specifies the read base that serves as a reference for overlaps with query regions. The alignment position of that base, eventually after shifting (see below), needs to be contained in the query region for an overlap. `selectReadPosition` can be set to `"start"` (the default) or `"end"` , which refer to the biological start (5'-end) and end (3'-end) of the read. For example, the `"start"` of a read aligned to the plus strand is the leftmost base in the alignment (the one with the lowest coordinate), and the `"end"` of a read aligned to the minus strand is also its leftmost base in the alignment. - `shift` allows shifting of alignments towards their 3'-end prior to overlap determination and counting. This can be helpful to increase resolution of ChIP-seq experiments by moving alignments by half the immuno-precipitated fragment size towards the middle of fragments. `shift` can either contain `"integer"` values that specify the shift size, or for paired-end experiments, it can be set to the keyword `"halfInsert"`, which will estimate the true fragment size from the distance between aligned read pairs and shift the alignments accordingly. - `orientation` controls the interpretation of alignment strand relative to the strand of the query region. The default value `"any"` will count all overlapping alignments, irrespective of the strand. This setting is for example used in an unstranded RNA-seq experiment where both sense and antisense reads are generated from an mRNA. A value of `"same"` will only count the alignments on the same strand as the query region (e.g. in a stranded RNA-seq experiment), and `"opposite"` will only count the alignments on the opposite strand from the query region (e.g. to quantify anti-sense transcription in a stranded RNA-seq experiment). - `useRead` only applies to paired-end experiments and allows to quantify either all alignments (`useRead="any"`), or only the first (`useRead="first"`) or last (`useRead="last"`) read from each read pair or read group. Note that for `useRead="any"` (the default), an alignment pair that is fully contained within a query region will contribute two counts to the value of that region. - `includeSpliced`: When set to `FALSE`, spliced alignments will be excluded from the quantification. This could be useful for example to avoid redundant counting of reads when the spliced alignments are quantified separately using `reportLevel="junction"`. ### Running modes of `qCount` The features to be quantified are specified by the `query` argument. At the same time, the type of `query` selects the mode of quantification. `qCount` supports three different types of `query` arguments and implements three corresponding quantification types, which primarily differ in the way they deal with redundancy, such as query bases that are contained in more than one query region. A fourth quantification mode allows counting of alignments supporting exon-exon juctions: - `GRanges` query: Overlapping alignments are counted separately for each coordinate region in the query object. If multiple regions have identical names, their counts will be summed, counting each alignment only once even if it overlaps more than one of these regions. Alignments may however be counted more than once if they overlap multiple regions with different names. This mode is for example used to quantify ChIP-seq alignments in promoter regions (see section \@ref(ChIPseq)), or gene expression levels in an RNA-seq experiment (using a 'query' with exon regions named by gene). - `GRangesList` query: Alignments are counted and summed for each list element in the query object if they overlap with any of the regions contained in the list element. The order of the list elements defines a hierarchy for quantification: Alignment will only be counted for the first element (the one with the lowest index in the query) that they overlap, but not for any potential further list elements containing overlapping regions. This mode can be used to hierarchically and uniquely count (assign) each alignment to a one of several groups of regions (the elements in the query), for example to estimate the fractions of different classes of RNA in an RNA-seq experiment (rRNA, tRNA, snRNA, snoRNA, mRNA, etc.) - `TxDb` query: Used to extract regions from annotation and report alignment counts depending on the value of the `reportLevel` argument. If `reportLevel="exon"`, alignments overlapping each exon in the query are counted. If `reportLevel="gene"`, alignment counts for all exons of a gene will be summed, counting each alignment only once even if it overlaps multiple annotated exons of a gene. These are useful to calculate exon or gene expression levels in RNA-seq experiments based on the annotation in a `TxDb` object. If `reportLevel="promoter"`, the `promoters` function from package `r Biocpkg("GenomicFeatures")` is used with default arguments to extract promoter regions around transcript start sites, e.g. to quantify alignments inf a ChIP-seq experiment. - any of the above or `NULL` for `reportLevel="junction"`: The `query` argument is ignored if `reportLevel` is set to `"junction"`, and `qCount` will count the number of alignments supporting each exon-exon junction detected in any of the samples in `proj`. The arguments `selectReadPosition`, `shift`, `orientation`, `useRead` and `mask` will have no effect in this quantification mode. ## `qProfile` {#qProfile} The `qProfile` function differs from `qCount` in that it returns alignments counts relative to their position in the query region. Except for `upstream` and `downstream`, the arguments of `qProfile` and `qCount` are the same. This section will describe these two additional arguments; more details on the other arguments are available in section \@ref(qCount) and from the `qProfile` function documentation. The regions to be profiled are anchored by the biological start position, which are aligned at position zero in the return value. The biological start position is defined as `start(query)` for regions on the plus strand and `end(query)` for regions on the minus strand. The anchor positions are extended to the left and right sides by the number of bases indicated in the `upstream` and `downstream` arguments. - `upstream` indicates the number of bases upstream of the anchor position, which is on the left side of the anchor point for regions on the plus strand and on the right side for regions on the minus strand. - `downstream` indicates the number of bases downstream of the anchor position, which is on the left side of the anchor point for regions on the plus strand and on the left side for regions on the minus strand. Be aware that query regions with a `*` strand are handled the same way as regions on the plus strand. ## `qMeth` {#qMeth} `qMeth` is used exclusively for Bis-seq experiments. In contrast to `qCount`, which counts the number of read alignments per query region, `qMeth` quantifies the number of C and T bases per cytosine in query regions, in order to determine methylation status. `qMeth` can be run in one of four modes, controlled by the `mode` argument: - `CpGcomb`: Only C's in CpG context are considered. It is assumed that methylation status of the CpG base-pair on both strands is identical. Therefore, the total and methylated counts obtained for the C at position $i$ and the C on the opposite strand at position $i+1$ are summed. - `CpG`: As with `CpGcomb`, only C's in CpG context are quantified. However, counts from opposite strand are not summed, resulting in separate output values for C's on both strands. - `allC`: All C's contained in query regions are quantified, keeping C's from either strand separate. While this mode allows quantification of non-CpG methylation, it should be used with care, as the large result could use up available memory. In that case, a possible work-around is to divide the region of interest (e.g. the genome) into several regions (e.g. chromosomes) and call `qMeth` separately for each region. - `var`: In this mode, only alignments on the opposite strand from C's are analysed in order to collect evidence for sequence polymorphisms. Methylated C's are hot-spots for C-to-T transitions, which in a Bis-seq experiment cannot be discriminated from completely unmethylated C's. The information is however contained in alignments to the G from the opposite strand: Reads containing a G are consistent with a non-mutated C, and reads with an A support the presence of a sequence polymorphism. `qMeth(..., mode="var")` returns counts for total and matching bases for all C's on both strands. A low fraction of matching bases is an indication of a mutation and can be used as a basis to identify mutated positions in the studied genome relative to the reference genome. Such positions should not be included in the quantification of methylation. When using `qMeth` in a allele-specific quantification (see also section \@ref(alleleSpecificAnalysis), cytosines (or CpGs) that overlap a sequence polymorphism will not be quantified. # Session information The output in this vignette was produced under: ```{r sessionInfo, echo=FALSE} sessionInfo() ``` ```{r cleanUp, eval=TRUE, echo=FALSE} unlink("extdata", recursive = TRUE, force = TRUE) ``` # References