--- title: "A guide to small RNA analysis using Seqpac" shorttitle: "sRNA analysis using Seqpac" author: "Daniel Nätt, Signe Skog, Lovisa Örkenby" date: "`r format(Sys.Date(), '%m/%d/%Y')`" abstract: > Seqpac provides functions and workflows for analysis of short sequenced reads. It was originally developed for small RNA (sRNA) analysis, but can be implemented on any sequencing raw data (provided as a fastq-file), where the unit of measurement is counts of unique sequences. The core of the Seqpac strategy is the generation and subsequence annotation/analysis/ visualization of a standardized object called PAC. Using an innovative target system, Seqpac process, analyze and visualize group differences using the PAC object. Seqpac is particularly useful in generating sequence coverage profiles across reference sequences (such as tRNA or gene coverage) without loosing the integrity of the original fastq sequences. Thus, candidate sequences are easily validated by for example a BLAST at a genome browser. This guide will introduce you to the basics of our recommended workflow for sRNA analysis, both fastq-processing (e.g. adapter trimming and generating sequence counts) and post-fastq analysis (e.g filtering, genome and sRNA alignment, differential expression and visualization). To assure package quality, Seqpac has been optimized for Bioconducter. Seqpac package version: `r packageVersion("seqpac")` output: BiocStyle::html_document: number_sections: yes toc: true vignette: > %\VignetteIndexEntry{A guide to small RNA analysis using Seqpac} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ```

Stable versions of Seqpac is available on Bioconductor, and development branches can be obtained at github: ```{r BiocManager, eval=FALSE} ## Installation Bioconductor: if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install("seqpac") ## Installation github: devtools::install_github("Danis102/seqpac", dependencies=TRUE, ref="development") ```

# Introduction
## Overview Seqpac contains an extensive toolbox for the analysis of short sequenced reads. The package was originally developed for small RNA (sRNA) analysis, but can be applied on any type of data generated by large scale nucleotide sequencing, where the user wish to maintain sequence integrity throughout the whole analysis. This involves, for example, regular RNA-seq with long RNA. But note, however, that it currently does not automatically support paired-end sequencing.
To preserve sequence integrity, and thereby guarantee a clean data linage, Seqpac applies sequence-based counting. This can be contrasted against feature-based counting, where reads are counted over known genomic features, like protein coding genes or sRNA. Such strategy will result in a count table where each count is made per feature (e.g. number of read counts mapping to a gene). In Seqpac, such annotations are initially disregarded, as a count table is generated solely by counting unique sequences in the raw sequence files (fastq). Annotations against known reference(s) (such as databases containing sequences of full genomes, miRNA, tRNA, protein coding genes, or genomic coordinates of such features) are done after the count table has been produced.
The advantage of using sequence-based counting, compared to feature-based counting, is that for example mismatches in the classification of reads, can be applied any time during the analysis. In feature-based counting, allowing a mismatch in the mapping will result in a loss of information (loss of sequence integrity).
For example, say that you have 506 read counts for miR-185 (a miRNA) when allowing 3 mismatches against the reference. In feature-based analysis this will result in 1 entry in the count table. Unless you have saved the exact alignments elsewhere, you have lost the information about what sequences that are included in that count. The only thing you know is that the sequence may differ from the reference sequence by any 3 mismatches. In sequence-based counting, reads are counted as unique sequences and each unique sequence is then mapped against the target reference databases. Information about whether a sequence was aligning against a reference is maintained in a linked annotation table that can contain both perfect alignments and alignments allowing mismatches. Thus, in any stage of the analysis mismatches and classifications into features can be dynamically controlled, meaning that you can easily observe the effect of for example allowing more mismatches or changes in hierarchical classifications (often applied in sRNA analysis when a sequence align with multiple classes of sRNA).
**But the best with sequence-based counting where you preserve sequence integrity, is probably that you will be able blast your candidate sequences at your favorite database, to verify and find out more about them.**
## The Seqpac workflow The input file format for Seqpac is fastq, which can be generated from most sequencing platforms. Two types of objects sustain the core of the Seqpac workflow: - The PAC object - The Targeting objects
The PAC object stores the Phenotypic information (P), Annotations (A) and Counts (C) needed for all downstream analysis, while the targeting object is used for accessing specific subsets of data within the PAC object.
Seqpac provides two versions of the PAC object. In its simplest form it is a S3 list containing 3 data.frames (Pheno, Anno and Counts). As default, however, an S4 PAC object is generated. By using the coercion method `as(pac-object, "list")`, you can turn an S4 PAC into an S3 list. Similarly, by using `as.PAC(pac-object)` you can turn an S3 PAC into an S4. Plaese see `?as.PAC` an `?PAC` for more details and examples.
The targeting object is always a list of 2 character objects.
The workflow is roughly divided into four steps: - Generate a PAC object from fastq-files - Preprocess the PAC object - Annotate the PAC object - Analyze the PAC object.
## Quick start Here is a quick simple workflow using Seqpac. Remember, however, that in Seqpac you can create much more advanced workflows. Especially when applying the reanno workflow, as will be described below. ```{r, eval=TRUE, echo=TRUE, warning=FALSE, message=FALSE, results = "hide"} # Setup library(seqpac) sys_path = system.file("extdata", package = "seqpac", mustWork = TRUE) fastq <- list.files(path = sys_path, pattern = "fastq", all.files = FALSE, full.names = TRUE) ref_tRNA <- system.file("extdata/trna", "tRNA.fa", package = "seqpac", mustWork = TRUE) # Trim NEB adapter, generate counts and create PAC-object (pheno, anno, counts) count_list <- make_counts(fastq, plot = FALSE, parse="default_neb", trimming="seqpac") pac <- make_PAC(count_list$counts) pheno(pac)$groups <- c(rep("gr1",3), rep("gr2",3)) #Add groups to pheno table # Preprocess PAC-object and creat means pac <- PAC_norm(pac) # Default normalization is cpm pac <- PAC_filter(pac, size=c(15,60)) # Here, filter on sequence length pac <- PAC_summary(pac, norm = "cpm", type = "means", pheno_target=list("groups", unique(pheno(pac)$groups))) # Quickly align (annotate) and plot PAC-object map_tRNA <- PAC_mapper(pac, ref=ref_tRNA, override=TRUE) plts <- PAC_covplot(pac, map_tRNA, summary_target=list("cpmMeans_groups")) plts[[13]] ```

## Example data Contained within this package are 6 sRNA fastq file. These files were originally generated by extracting sRNA from a single fruit fly embryo. The original fastq file was acquired by Illumina NextSeq 500 sequencer using a High Output 75 cycles flow cell (product no: 20024906) and NEBNext® Small RNA Library Prep Set for Illumina (E7330). Due to package size restriction, this fastq was randomly down-sampled to a fraction of its original size into 6 much much smaller files.
There is also a prepared example PAC-object that were similarly generated from randomly down-sampled fastq files (but here we used 9 unique embryos). In addition, there are a few fasta reference file. These files are used in examples of this vignette and in the function manuals.

# Making a PAC object
## Before making a PAC object (merge_lanes) Seqpac works best with merged fastq, meaning that if sequencing was done in separate lanes on the flow cell, where the same sample was present in all lanes, lane files must be merged for each unique sample.
Conveniently, Seqpac contains a function that can be used to merge lane files: `merge_lanes`. Please see, `?merge_lanes` for examples on how to merge fastq-files. Specifically, this function searches the input folder for similar file names and append files with similar names. As long as the user uses unique sample names (e.g. sample1_lane1.fastq.gz, sample1_lane2.fastq.gz, sample2_lane1.fastq.gz, sample2_lane2.fastq.gz etc) then `merge_lanes` will merge lane files into the same sample (e.g. sample1.fastq.gz, sample2.fastq.gz).
## Generating a count table (make_counts) Count tables are the core components in most sequence analysis. Seqpac uses sequence-based counts (see 1.1 Overview). Thus, we need to generate a count table with the number of unique sequences in each sample.
There are three ways of generating a count table in Seqpac. - Trim raw fastq files using internal functions in Seqpac/R. - Trim raw fastq files using external functions outside R. - Import reads from already trimmed fastq files.
The two first involves adapter trimming, while the last involves already trimmed reads. All input options are contained within the `make_counts`. function.
### Internal trimming Generating a count table from untrimmed fastq files using nothing but R packages (internal) depends on the `make_trim` function. This function goes through a series of trimming cycles using the `trimLRPatterns` function in the `r Biocpkg("Biostrings")` package to generate highly comparable adapter trimming as generated by popular fastq trimming software, such as cutadapt.
Seqpac trimming is praticularly efficient when trimming many fastq files at the same time. It parallelizes trimming on multiple processor cores/threads by applying functions in the `r CRANpkg("foreach")` package. Thus, if `threads=12` and you input 12 fastq files, all files will be trimmed simultaneously in parallel. Note, however, that each thread will make a substantial impact on your computers memory performance. Thus, if you know that you are low in ram memory use fewer number of threads.
When `make_trim` is used on its own it results in trimmed fastq files stored at a user defined location. When `trimming="Seqpac"` is set in the `make_counts` function, this function parses settings to `make_trim` to generate a count table from temporary stored trimmed fastq files. ```{r, eval=FALSE, echo=TRUE} ## Using default settings for NEB type adapter # NEB= NEBNext® Small RNA Library Prep Set for Illumina (E7300/E7330) # For illumina type adapters, 'parse="default_illumina"' may be used # To speed things up we use 4 threads. count_list <- make_counts(input=fastq, trimming="seqpac", threads=4, parse="default_neb") ```
### External trimming As an alternative to internal trimming, two external software: [cutadapt](https://cutadapt.readthedocs.io/en/stable/installation.html) and [fastq_quality_filter](http://hannonlab.cshl.edu/fastx_toolkit/commandline.html) can be used to trim the adapter sequence and remove low quality reads prior of making a count table. By setting `trimming="cutadapt" make_counts will check if these software is installed on your system.
### Already trimmed fastq Setting `trimming=NULL` in the `make_counts` function will pass the adapter trimming and a count table will be generated directly from the fastq files. Thus, this option can be used if you already have trimmed your fastq files.
### Optimizing for low-end systems and challenging data In the `make_counts`function users may choose to process data on-disk and in chunks instead of processing data in-memory. Thus, users on low-end computers with little RAM may choose to process smaller chunks and save data to disk from time to time on the expense of performance. This is controlled by the `optimize`option. Here, two panic filters are also available. These filters will only be applied if the unfiltered data fails to complete due to memory shortage. If such filter is applied to a fastq file this is clearly stated in the progress report, which give you the choice of what to do with such a sample.
See the manuals for `make_counts` for more details.
### Outputs from make_counts Besides the counts table, `make_counts` automatically generates a progress report for the trimming and evidence filter, as well as bar graphs showing the impact of the evidence filter (if plots=TRUE).
### Evidence filtering Users may already when making the counts table with `make_counts` markedly reduce the noise in the data by specifying the number of independent evidence that a specific sequence must reach to be included. This is controlled by the `evidence` argument. As default, `evidence=c(experiment=2, sample=1)` will include all sequences that have >=1 counts in at least 2 independent fastq files.
Importantly, we routinely observe that 95-99% of all reads are maintained when applying the default evidence filter, while >50% of the unique sequences will be dropped because they can not be replicated across two independent samples. For an example of how a saturated experiment may look like, see the original paper, Skog et al. from 2021.
As you may have guessed, the 'experiment' argument controls the number of independent fastq evidence needed across the whole experiment. Note, however, that 'sample' does not control the number of counts needed in each sample. The evidence filter will always use >=1 counts in X number of fastq files. Instead 'sample' controls when a sequence should be included despite not reaching the 'experiment' threshold. Thus, if `evidence=c(experiment=2, sample=10)`, sequences that reach 10 counts in a single sample will also be included. Here is an example on how to use the 'sample' argument: ```{r, eval=FALSE, echo=TRUE, warning=FALSE, message=FALSE, results = "hide"} ###--------------------------------------------------------------------- ## Evidence over two independent samples, saving single sample sequences ## reaching 3 counts test <- make_counts(input=fastq, trimming="seqpac", parse="default_neb", threads=3, evidence=c(experiment=2, sample=3)) extras <- apply(test$counts, 1, function(x){sum(!x==0)}) test$counts[extras==1,] # A few still have 3 alone ```
Lastly, if `evidence=NULL` all unique sequences will be saved. Remember, however, that the count table will become larger in well saturated experiments that may lead to performance issues on some systems, since each unique sequence occupies a row in `counts(PAC)` and `anno(PAC)`. Apply other filters prior to annotating your sequences may solve such problems.
## Generate the phenotype table (make_pheno) Next, we generate a Pheno table (P) and merge it with the progress report that was stored from the `make_counts` output. The Pheno table contains sample specific information such as sample name and possible interventions, with each sample as a row.
The `make_pheno` function can import a Pheno table both externally by providing a path to a .cvs file, as well as internally by just passing a data.frame. In common for both options, is that a unique column named "Sample_ID" needs to be present, and must contain the sample names (column names) present in the counts table generated by `make_counts`. Now, `make_pheno` will attempt to match similar names, but to avoid problems use identical names.
In addition, `make_pheno` may add the progress report generated by `make_counts`. Lets have a look at how this works: ```{r, eval=TRUE, echo=TRUE, warning=FALSE, message=FALSE, results = "hide"} ###--------------------------------------------------------------------- ## Generate a Pheno table using the file names of test fastq Sample_ID <- colnames(count_list$counts) pheno_fl <- data.frame(Sample_ID=Sample_ID, Treatment=c(rep("heat", times=1), rep("control", times=2)), Batch=rep(c("1", "2", "3"), times=1)) pheno <- make_pheno(pheno=pheno_fl, progress_report=count_list$progress_report, counts=count_list$counts) ``` ## make_PAC
Finally, we put everything together as a master PAC object using the `make_PAC` function. Conveniently, skipping the `anno=` argument will automatically add a very simple annotation table to the PAC-object that can be filled later. ```{r, eval=TRUE, echo=TRUE, warning=FALSE, message=FALSE, results = "hide"} ###--------------------------------------------------------------------- ## Generate PAC object (default S4) pac_master <- make_PAC(pheno=pheno, counts=count_list$counts) pac_master ## If TRUE the PAC object is ok PAC_check(pac_master) ```

# Preprocessing the PAC object ## Targeting objects Seqpac uses a simple targeting system to divide and focus the analysis of a PAC object to certain sample groups (Pheno) or specific sequences (Anno). If not specified otherwise in the function manual, a targeting object is always a two item list, where the first object is naming a column in a specific table in the PAC and the second object tells the function which categories in that column that should be targeted. - `pheno_target`: extracts samples from a column in `pheno(PAC)` - `anno_target`: extracts sequences from a column in `anno(PAC)` While `pheno_target` and `anno_target` are the most common targeting objects, there are other targeting object. Common for all is that they target a specific sub-table in the PAC.
Now, lets take a look at the provided example PAC-object in Seqpac. As noted by the name, this PAC has already been filtered and annotated, but it will still show you the principle of preprocessing using targeting objects. ```{r, eval=TRUE, fig.show=FALSE, results="hide", fig.keep='none'} ###--------------------------------------------------------------------- ## Load the PAC-object and inspect the columns in Pheno and Anno library(seqpac) load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", package = "seqpac", mustWork = TRUE)) pheno(pac)[,1:5] anno(pac)[1:5,1:5] counts(pac)[1:5,1:5] ## This PAC is an S4 object, but you can easily turn it to an S3 list with as(). ## using as.PAC() will turn it back into S4: isS4(pac) pac_s3 <- as(pac, "list") lapply(pac_s3[1:3], head) isS4(pac_s3) pac <- as.PAC(pac_s3) methods(class="PAC") #all S4 methods ```
A pheno_target object that will exclusively target stage 1 and 5 samples would look like this: `pheno_target=list("stage", c("Stage1", "Stage5"))`
While an anno_target that will target sequences of 22 nt in length would look like this: `anno_target=list("Size", "22")`
Notice that both targeting objects are lists holding two secondary objects: one character string targeting a specific column, and one character vector targeting specific categories within the target column. The only difference being that each targeting object target different tables in the PAC object.
Important, if the second object is set to NULL or is left out this will automatically target all samples in the specified column. If the anno_target is a character vector (not a list) some function will attempt to extract matches in the row names of either Pheno or Anno (see example with `PAC_filtsep` below).For more, please refer to the original Seqpac publication: Skog et al. 2021.
## Filtering A master PAC object will contain sequences in your original fastq files that passed the trimming and the initial evidence filter. Now, in many experiments this still involves millions of unique sequences, making the master PAC difficult to work with. Thus, it is often wise to further reduce noise prior to normalization and annotation.
Seqpac contains two functions for this purpose: - `PAC_filter` - `PAC_filtsep`
While `PAC_filter` gives a variety of choices to subset the PAC object according to for example sequence size or minimum counts in a certain proportion of the samples,`PAC_filtsep` extracts the sequence names that pass a filter within sample groups.
Here are some examples: ```{r, eval=TRUE, echo=TRUE} ###--------------------------------------------------------------------- ## Extracts all sequences between 20-30 nt in length with at least 5 counts ## in 20% of all samples. pac_filt <- PAC_filter(pac, size=c(20,30), threshold=5, coverage=20, norm = "counts", pheno_target=NULL, anno_target=NULL) ``` ```{r, eval=FALSE, echo=TRUE} ###--------------------------------------------------------------------- ## Optionally, a simple coverage/threshold graph at different filter depths ## can be plotted with stat=TRUE (but it will promt you for a question). pac_filt <- PAC_filter(pac, threshold=5, coverage=20, norm = "counts", stat=TRUE, pheno_target=NULL, anno_target=NULL) ``` ```{r, eval=TRUE, echo=TRUE} ###--------------------------------------------------------------------- ## Extracts all sequences with 22 nt size and the samples in Batch1 and Batch2. pac_filt <- PAC_filter(pac, subset_only = TRUE, pheno_target=list("batch", c("Batch1", "Batch2")), anno_target=list("Size", "22")) ###--------------------------------------------------------------------- ## Extracts sequences with >=5 counts in 100% of samples within each stage filtsep <- PAC_filtsep(pac, norm="counts", threshold=5, coverage=100, pheno_target= list("stage")) pac_filt <- PAC_filter(pac, subset_only = TRUE, anno_target= unique(do.call("c", as.list(filtsep)))) ``` Notice that `PAC_filtsep` only extracts the sequence names and stores them in a data.frame. Converted into a character vector with unique names (using `unique` + `do.call`) they can be applied as an anno_target that targets sequence names in Anno. Hint, the data.frame output of `PAC_filtsep` has been designed for the purpose of generating Venn and Upset diagrams that visualize overlaps across groups: ```{r, eval=FALSE, echo=TRUE} ###--------------------------------------------------------------------- ## Venn diagram using the venneuler package olap <- reshape2::melt(filtsep, measure.vars = c("Stage1", "Stage3", "Stage5"), na.rm=TRUE) plot(venneuler::venneuler(olap[,c(2,1)])) ###--------------------------------------------------------------------- ## Setting output="binary", prepares data for upset plots (UpSetR package): filtsep_bin <- PAC_filtsep(pac, norm="counts", threshold=5, coverage=100, pheno_target= list("stage"), output="binary") UpSetR::upset(filtsep_bin, sets = colnames(filtsep_bin), mb.ratio = c(0.55, 0.45), order.by = "freq", keep.order=TRUE) ```
## Storage of processed data in PAC While the simplest PAC only contains three data.frame objects (P, A and C), additional 'folders' will be stored in the PAC object as the analysis progresses. In fact, if using a S3 PAC (list), you may choose to store any number of objects in your PAC as long as they do not conflict with the names of the standard folders, which are: Pheno (P): data.frame Anno (A): data.frame Counts (C): data.frame norm: list of data.frames summary: list of data.frames
The *norm* folder will be used to store counts that have been normalized in different ways, while the *summary* folder will contain the data that has been summarized across groups of samples. In both cases, applying a filter using `PAC_filter` will automatically subset not only P, A, and C, but also all tables stored in `norm(PAC)` and `summary(PAC)`.
## Normalization The built in normalization methods in Seqpac are handled by the `PAC_norm` function. In the current version three methods are available: - Counts per million reads (cpm) - Variance-Stabalizing Transformation (vst) - Regularized log transformation (rlog)
More specifically, cpm normalize the dataset in relation to the total number of reads, while vst and rlog uses both mean dispersion and size factor to produce homoskedastic values with functions available in the `r Biocpkg("DESeq2")` package.
It is easy to generate your own normalized count table using your method of choice, and import it as a data.frame to the *norm* folder in the PAC object. Just remember that row and column names must be identical to the original Counts table. Use `PAC_check` to verify. Lets generate two normalized count tables with cpm and vst: ```{r, eval=TRUE, results=FALSE, message=FALSE, warning=FALSE, error=FALSE} ###--------------------------------------------------------------------- ## Example normalization in Seqpac load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", package = "seqpac", mustWork = TRUE)) # PAC_norm works on both pac <- PAC_norm(pac, norm="cpm") pac <- PAC_norm(pac, norm="vst") pac ``` From this, it may sometimes be wise to run further filtering steps to focus on reads with high cpm values. ```{r, results=FALSE, eval=TRUE} ###--------------------------------------------------------------------- ## Filtering using cpm instead of raw counts ## filter >=100 cpm in 100% of samples in >= 1 full group filtsep <- PAC_filtsep(pac, norm="cpm", threshold=100, coverage=100, pheno_target= list("stage")) pac_cpm_filt <- PAC_filter(pac, subset_only = TRUE, anno_target= unique(do.call("c", as.list(filtsep)))) ```

# Generate annotations From the start a PAC object contains a very limited Anno table such as this: ```{r, results=FALSE, eval=TRUE} load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", package = "seqpac", mustWork = TRUE)) anno(pac) <- anno(pac)[,1, drop = FALSE] head(anno(pac)) ``` To facilitate easy adoption of Seqpac to any species, we provide a fast and flexible workflow for mapping PAC sequences against user defined reference databases. We call this reannotation, since we map the sequences in Anno object sequentially over one or multiple references. Important, Seqpac relies on the popular bowtie aligner `r Biocpkg("Rbowtie")` for mapping: http://bowtie-bio.sourceforge.net/manual.shtml. This means that references must be indexed using the `bowtie_build` function before they are compatible with the reannotation workflow (but see the `PAC_mapper` function for an index free alternative).
## Fasta references Small RNA annotation/mapping depends heavily on choosing your references wisely and not to be too greedy if possible. If your hypothesis target a specific type of class (like tRNA), it might for instance be wiser to target such references only. Many databases provide their references in the fasta file format. Seqpac can use any correctly formated fasta file as input in the reannotation workflow. However, we distinguish two kinds of fasta references: - Reference genomes. - Other specialized references. You can download the files manually from the source database. Below are some helpful example links where to retrieve useful fasta references. Note that all compressed files must be uncompressed to work with the bowtie aligner.
**Reference genomes:**
[Ensembl Animals](https://www.ensembl.org/info/data/ftp/index.html)
[Ensembl Plants](https://plants.ensembl.org/info/data/ftp/index.html)
[UCSC](http://hgdownload.cse.ucsc.edu/downloads.html)

Hint: Unmasked top level fasta files are best for most purposes, e.g. Ensembl file name *.dna.toplevel.fa.gz. **Other specialized references:**
[miRNA](http://www.mirbase.org/ftp.shtml)
[miRNA](https://mirgenedb.org/download)
[piRNA](http://www.regulatoryrna.org/database/piRNA/download.html)
[many ncRNA Animals](https://www.ensembl.org/info/data/ftp/index.html)
[many ncRNA Plants](https://plants.ensembl.org/info/data/ftp/index.html)
[tRNA](http://gtrnadb.ucsc.edu/)
[repeats](http://repeatmasker.org/genomicDatasets/RMGenomicDatasets.html)

There are also several packages in Bioconductor/R such as `r Biocpkg("BSgenome")` and `r Biocpkg("biomaRt")` that allow you to download reference genomes and other specialized references directly into r. Actually, using the `download.file` function makes most urls (https/http/ftp etc) available for download. Custom fasta files can also be used, and you can easily merge, add and remove features to the references by functions in the `r Biocpkg("Biostrings")` package, such as `readDNAStringSet` (read fasta) and `writeXStringSet` (save fasta).
As default, bowtie only reports the names up to the first white space. Sometimes the sequence names need to be modified or rearranged for the bowtie aligner to best report the names in the mapping output. This can be done using the `r Biocpkg("Biostrings")` package to read, rearrange and save the fasta. A useful expression to verify that you have caught the relevant info before the the first white space is: `do.call("rbind", strsplit(names(), " "))[,1]`
You may also want to add tRNAs from the mitochrondrial genome to your tRNA reference, which is not always provided in the GtRNAdb fasta. This can be done by moving them from Ensembl ncRNA fasta or scan the mitochondrial genome manually at [tRNAscan-SE](http://lowelab.ucsc.edu/tRNAscan-SE/). The mitochonridal genome is found in the reference genome fasta.
## Indexing the fasta references Before we can use the fasta references we need to index them for the bowtie aligner. You can do this by using the `bowtie_build` function in the Rbowtie package or by running bowtie externally, outside R. For more information see ?Rbowtie::bowtie, Rbowtie::bowtie_build_usage() and http://bowtie-bio.sourceforge.net/manual.shtml.
Important, the bowtie prefix must have the same basename (`prefix=`) and the indexes must be saved in the same directory (`outdir=`) as the original fasta file. Since the `bowtie_build` naming is often confusing for newbies, here are some examples: ```{r, eval = TRUE} ###--------------------------------------------------------------------- ## Examples of generating bowtie indexes path <- "/some/path/to/your/folder" #Genome mycoplasma_path <- system.file("extdata/mycoplasma_genome", "mycoplasma.fa", package = "seqpac", mustWork = TRUE) Rbowtie::bowtie_build(mycoplasma_path, outdir=gsub("mycoplasma.fa", "", mycoplasma_path), prefix="mycoplasma", force=TRUE) #tRNA trna_path <- system.file("extdata/trna", "tRNA.fa", package = "seqpac", mustWork = TRUE) Rbowtie::bowtie_build(trna_path, outdir=gsub("tRNA.fa", "", trna_path), prefix="tRNA", force=TRUE) #rRNA rrna_path <- system.file("extdata/rrna", "rRNA.fa", package = "seqpac", mustWork = TRUE) Rbowtie::bowtie_build(rrna_path, outdir=gsub("rRNA.fa", "", rrna_path), prefix="rRNA", force=TRUE) ```
Building indexes may take some time for large fasta, such as a reference genome. You only need to generate the index once for every fasta reference, however.
## GTF feature coordinates After the sequences have been mapped against a reference genome they can also acquire annotations from genomic feature files, such as the gtf and gff formats. These files contains the coordinates for different genomic features associated with a specific reference genome, such as repetitive regions, exons and CpG islands etc.
GTF files can often be acquired at the same places as the fasta reference files (see above). The `r Biocpkg("biomartr")` package also have functions for downloading some gtf files (e.g. `getGTF`).In addition, you may create your own by downloading/importing different tables, for example using `r Biocpkg("rtracklayer")` (see `readGFF`, `getTable`, `ucscTableQuery`) and then create a GRanges object (`r Biocpkg("GenomicRanges")` package) where you add your preferred table info. Finally you may save it using rtracklayer again (`export(, , format="gtf"`)
## The reannotation workflow The procedure to annotate sequences in Seqpac is called reannotation.
The PAC reannotation family of functions carries out the following tasks: - Maps sequences against fasta references
- Stores the output on hard drive
- Reads selected parts into R
- Creates a reanno object
- Generates an overview from the reanno object
- Classifies and simplifies annotations
- Adds classifications to a PAC object

### The functions in the reanno workflow
**map_reanno**
To accommodate most users on most platforms, Seqpac provides two alternatives for calling bowtie using the `map_reanno` function: internally from within R (`type="internal"`) or externally from outside R (`type="external"`). In the internal mode, Seqpac uses the `bowtie` function in the `r Biocpkg("Rbowtie")` package, while in the external mode bowtie is called by a system command to a locally installed version of bowtie. Both options gives identical results, but since the internal Rbowtie package is rarely updated, we provide the external option.
Since the input commands for external bowtie and internal Rbowtie differs slightly, `map_reanno` has two parse options: `parse_internal` and `parse_external`. These can be used to control the two versions of bowtie as if they were run from the console or command line, respectively.
While bowtie has its own way to handle mismatches in the alignments, Seqpac disregards this and instead runs bowtie sequentially, increasing the number of mismatches for each cycle. After each cycle all sequences that received a match in any of the provided references will be subtracted. Therefore, in the next cycle only sequences without a match will be 'reannotated' but allowing one additional mismatch. The `map_reanno` function controls this by sequentially increasing the number of mismatches after subtracting the matches.
**import_reanno**
This function is called by `map_reanno` to import the bowtie results after each cycle. If you would run `import_reanno` separately it simply provides you with options on how to import bowtie results into R. For example, when aligning against a reference genome the mapping coordinates is important, while mapping against more specialized references used for creating sRNA classifications (rRNA, tRNA, miRNA etc), 'hit' or 'no hit' might be enough, and will be a much faster option. The `map_reanno` function, therefore, has two default import modes, import="genome" and import="biotype", which automatically configure `import_reanno` for genome or class annotation.
**make_reanno**
After each annotation cycle, `map_reanno` will store an .Rdata file (Full_reanno_mis0/1/2/3.Rdata) containing a list with all alignment hits for each PAC-sequence-to-fasta-reference mismatch cycle. Calling `make_reanno` will search for these .Rdata files in a given folder and create a reanno object in R. This involves extracting, ordering and preparing annotations in a format that can be merged with the existing Anno table in a PAC object. It will also generate an overview table, that summarizes the annotations if multiple fasta references were used by `map_reanno`.
**add_reanno**
Before merging the new annotations with a PAC object, unless we have provided a fasta reference with only one type (e.g. miRNA for mirBase), classification not only *between* fasta references but also *within* references might be necessary (e.g. snoRNA, tRNA, miRNA, rRNA in the Ensembl_ncRNA fasta). Classification of the fasta reference names, which was first saved by bowtie and then caught in our reanno object as an annotation, is done with the `add_reanno` function. Again, there are two primary modes, either "genome" or "biotype". When type="genome" a standardized workflow anticipating genomic coordinates are applied. When type="biotype" a list of search term vectors directed against each fasta reference needs to be provided. Matches between search term (written as regular expressions) and text strings in the fasta reference names will result in a classification of the PAC sequence. Finally, `add_reanno` generates an annotation table that can either be saved separately or merged with the original PAC object.
**simplify_reanno**
While `add_reanno` starts the process of simplifying the reference name annotations into useful classifications, it will generate multiple classifications for each counted sequence if it matches multiple search terms. The `simplify_reanno` function boils down multi-classifications into one classification for each sequence. By doing so, an hierarchy must be set, where priority of some classifications over others is done (e.g. miRNA over piRNA). While this is a controversial topic, using such hierarchies is still the only way to provide overview statistics in experiments targeting highly multimapping sequences, such as small RNA. Nonetheless, the Seqpac workflow with its sequence based counts, where hierarchical classification is done in the very end of the annotation process, makes it easy to generate classifications using alternative hierarchies and observe the consequences of your choices in for example a pie chart.
In summary: - **map_reanno** Bowtie reference mapping over mismatch cycles - **import_reanno** Used by `map_reanno` to import bowtie output - **make_reanno** Organizes `map_reanno` output into a reanno object - **add_reanno** Classifies annotations according to reference name - **simplify_reanno** Hierarchical classification
Already at this point it is good to know that Seqpac provides a 'backdoor' function, `PAC_mapper`, that carries out many of the steps in the reanno workflow using smaller references and pac objects. Please see '6.1 Advanced alignment analysis' or ?PAC_mapper for more details.
### Reannotation in action Lets observe the functions in action using the drosophila test dataset.
First we map against a reference genome, using `map_reanno`. Note that import="genome" must be set to catch the genomic coordinates. Since `map_reanno` can handle multiple fasta references. Remember, each fasta must have bowtie indexes (see 4.2).
```{r, eval=TRUE, results=FALSE, message=FALSE, warning=FALSE, error=FALSE} # Lets reload our pac library(seqpac) load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", package = "seqpac", mustWork = TRUE)) anno(pac) <- anno(pac)[,1, drop = FALSE] # Remove previous Anno ###--------------------------------------------------------------------- ## Genome mapping # Path to your output folder outpath_genome <- "/some/path/to/reanno_folder" # or a temp folder outpath_genome <- paste0(tempdir(), "/seqpac_reanno/reanno_genome") # Look up the your own Bowtie indexed reference genomes (fasta) ref_paths <- list(myco="") # For testing, you can use a small mycoplasma genome provided with Seqpac # But remember to provide single genomes as a named list anyway. # The name will be used to keep track of your genome. myco_path <- system.file("extdata/mycoplasma_genome", "mycoplasma.fa", package = "seqpac", mustWork = TRUE) ref_paths <- list(myco=myco_path) # Run map_reanno map_reanno(PAC=pac, ref_paths=ref_paths, output_path=outpath_genome, type="internal", mismatches=1, import="genome", threads=8, keep_temp=TRUE) ```
Similarly, we can map against other specialized references, that can be used for classifying each sequence into biotypes. Note, unlike genome mapping, at this stage we are not interested in where a sequence matches a reference for the specialized references, only if it matches or not. Thus, we use `import="biotype"`. ```{r, eval=TRUE, results=FALSE, message=FALSE, warning=FALSE, error=FALSE} ###--------------------------------------------------------------------- ## sRNA mapping using internal bowtie # Path to output folder outpath_biotype <- "/some/path/to/reanno_biotype" # or a temp folder outpath_biotype <- paste0(tempdir(), "/seqpac_reanno/reanno_biotype") # Seqpac contains two fasta for tRNA and rRNA we can use for testing trna_path <- system.file("extdata/trna", "tRNA.fa", package = "seqpac", mustWork = TRUE) rrna_path <- system.file("extdata/rrna", "rRNA.fa", package = "seqpac", mustWork = TRUE) # Provide paths to bowtie indexed fasta references as a list ref_paths <- list(tRNA=trna_path, rRNA=rrna_path) map_reanno(pac, ref_paths=ref_paths, output_path=outpath_biotype, type="internal", mismatches=1, import="biotype", threads=6, keep_temp=TRUE) ```
Now, the output .Rdata files for all mismatch cycles should have been stored in the output folders. Lets import everything into R, generate a reanno objects and plot some pie charts from the Overview table. ```{r, eval=TRUE,fig.show=FALSE,results="hide",message=FALSE,fig.keep='none'} ###--------------------------------------------------------------------- ## Generate a reanno object with make_reanno reanno_genome <- make_reanno(outpath_genome, PAC=pac, mis_fasta_check = TRUE) reanno_biotype <- make_reanno(outpath_biotype, PAC=pac, mis_fasta_check = TRUE) # Note, setting mis_fasta_check=TRUE will double check that the number of # sequences that failed to receive an alignment in the last mismatch cycle # agrees with the number sequences in the reanno object without an annotation. # (these sequences are stored in mis_fasta_x.txt where x is max mismatches+1) # List structure str(reanno_genome, max.level = 3, give.attr = FALSE) str(reanno_biotype, max.level = 3, give.attr = FALSE) # Simple pie charts using the Overview table pie(table(overview(reanno_genome)$myco)) # Very few mycoplasma with 0 mismatch pie(table(overview(reanno_biotype)$Any)) # Many hits for either rRNA or tRNA ```
Next step is to reorganize and classify using `add_reanno`. For mapping against a reference genome this is easy. Just set type="genome" and set the maximum number of alignments to be reported by `genome_max` (Warning: `genome_max="all"` will give you all alignments). ```{r, eval=TRUE, results=FALSE, message=FALSE, warning=FALSE} ###--------------------------------------------------------------------- ### Genomic coordinates using add_reanno # Output as separate table anno_genome <- add_reanno(reanno_genome, type="genome", genome_max=10, mismatches=1) # Output merged with provided PAC object pac <- add_reanno(reanno_genome, type="genome", genome_max=10, mismatches=1, merge_pac=pac) # Example of original reference name annotations head(full(reanno_genome)$mis1$myco) # Finished genome annotation head(anno_genome) head(anno(pac)) ```
For specialized references, `type="biotype"` should be used. This allows for classification based on match or no match between the search terms provided in the `bio_search` input and the reference names annotated with each counted sequence. Correct classification is all about finding the best search terms that discriminates between the different names in the original fasta reference files (up to the first white space; see 4.1).
With the `bio_perfect` option you may control how conservative the matching between search term and annotation should be. With `bio_perfect=FALSE`, every reference hit that fail to match your search terms, will be classified as 'other'. Using `bio_perfect=TRUE` will instead guarantee that your search terms will cover all reference hits.
Hint: See `?add_reanno` for a trick on how to succeed with `bio_perfect=TRUE`. ```{r, eval=TRUE, results=FALSE, message=FALSE, warning=FALSE} ###--------------------------------------------------------------------- ## Classify sequences using add_reanno # Lets start by exploring the names in the original fasta reference: ref_path <- "/some/path/to/fasta_reference" # For testing use the the fasta reference for tRNA provided with Seqpac: trna_path <- system.file("extdata/trna", "tRNA.fa", package = "seqpac", mustWork = TRUE) # Read fasta names fasta <- names(Biostrings::readDNAStringSet(trna_path)) # Different naming standards table(substr(fasta, 1, 10)) # Starting with FBgn discriminate mitochondrial tRNA fasta[grepl("FBgn", fasta)] # The other are nuclear head(fasta[!grepl("FBgn", fasta)]) # We can use as 'mt:tRNA' as search term to catch mitochondrial tRNA # and '_tRNA' to catch nuclear. # A useful expression for extracting strings up to 1st white space is # fasta <- do.call("rbind", strsplit(fasta, " "))[,1] # Similarly load the rrna reference provided with seqpaq rrna_path <- system.file("extdata/rrna", "rRNA.fa", package = "seqpac", mustWork = TRUE) # Read fasta names fasta <- names(Biostrings::readDNAStringSet(rrna_path)) # Many rRNA subtypes table(substr(fasta, 1, 10)) # Lets try two search term lists directed against each reference and written as # regular expressions: # Names in the search list needs to be the same as in the reanno object head(overview(reanno_biotype)) # 'rRNA' and 'tRNA' with some capital letters # Generate a search list with search terms bio_search <- list( rRNA=c("5S", "5.8S", "12S", "16S", "18S", "28S", "pre_S45"), tRNA =c("_tRNA", "mt:tRNA") ) test <- add_reanno(reanno_biotype, bio_search=bio_search, type="biotype", bio_perfect=FALSE, mismatches = 1, merge_pac=pac) ``` ```{r, eval=FALSE, results=FALSE, message=FALSE, warning=FALSE} # Throws an error because perfect matching was required: anno_temp <- add_reanno(reanno_biotype, bio_search=bio_search, type="biotype", bio_perfect=TRUE, mismatches = 1) ``` ```{r, eval=TRUE, results=FALSE, message=FALSE, warning=FALSE} # References with no search term hits are classified as "Other": anno_temp <- add_reanno(reanno_biotype, bio_search=bio_search, type="biotype", bio_perfect=FALSE, mismatches = 1) # Increasing number of hits allowing mismatches table(anno_temp$mis0) table(anno_temp$mis1) # You may also add your new annotaion with a PAC object using the 'merge_pac' # option. For even more options see ?add_reanno. pac <- add_reanno(reanno_biotype, bio_search=bio_search, type="biotype", bio_perfect=FALSE, mismatches = 1, merge_pac=pac) head(anno(pac)) ```
To make overview statistics and graphs we need to boil down the classifications to a factor column with only a few unique biotypes per factor. This is what `simplify_reanno` does. Just like `add_reanno` in the `type="biotype"` mode, `simplify_reanno` needs a list of search terms, an `hierarchy`. This time, however, the list is order sensitive and targets the output table from `add_reanno` that contains the columns with new classifications (e.g. "mis0_bio", "mis1_bio", "mis2_bio" etc). If a match occurs between the first search term and a classification, the counted sequence will be annotated to this classification, no matter if other search terms matches further down the list. This is done sequentially, allowing one additional mismatch until a maximum (specified in `mismatches`) has been reached. Thus, if a match occurs in tRNA with 0 mismatch, and rRNA with 1 mismatch, it will be reported as tRNA even though rRNA where higher in the `hierarchy`. A better match will always be prioritized over the hierarchy.
```{r, eval=TRUE, results=FALSE, message=FALSE, warning=FALSE} ###--------------------------------------------------------------------- ## Hierarchical classification with simplify_reanno # Currently classification does not discriminate table(anno(pac)$mis3_bio) # Set the hierarchy and remember that it is order sensitive. # Here: S5_rRNA >> Other_rRNA >> mt:tRNA >> tRNA # Remember to use 'regular expressions' if you wish to catch all: hierarchy_1 <- list(rRNA_5S="rRNA_5S", Other_rRNA="5.8S|12S|16S|18S|28S|pre_S45|rRNA_other", Mt_tRNA="tRNA_mt:tRNA", tRNA="tRNA__tRNA" ) # What happens if you don't catch all: hierarchy_2 <- list(rRNA_5S="rRNA_5S", Mt_tRNA="tRNA_mt:tRNA", tRNA="tRNA__tRNA") # No mismatch allowed using hierarchy_1 test <- simplify_reanno(input=pac, hierarchy=hierarchy_1, mismatches=0, bio_name="Biotypes_mis0", merge_pac=FALSE) table(test) # All remaining rRNA are classified as 'Other_rRNA' # Instead using hierarchy_2 test <- simplify_reanno(input=pac, hierarchy=hierarchy_2, mismatches=0, bio_name="Biotypes_mis0", merge_pac=FALSE) table(test)# Non-hits are classified as 'other' # Note, setting 'merge_pac=FALSE' returns a one-column data.frame only # containing the new hierarchical classifications. # Lets increase number of mismatches an merge it with the PAC # (How many mismatches depends on what you allowed previously in the workflow) colnames(anno(pac)) # mis0-1_bio = Upto 1 mismatches are available pac_test <- simplify_reanno(input=pac, hierarchy=hierarchy_1, mismatches=1, bio_name="Biotypes_mis1", merge_pac=TRUE) # Now we also have some mitochondrial tRNA table(anno(pac_test)$Biotypes_mis1) ```

# Analysis and visualization There are several functions in Seqpac that handle statistical analysis and visualization. We will only present a few of them here, so please refer to Seqpac's reference manual for a complete list. Just like many of the preprocessing functions, these functions are named PAC_xxx to clearly show that they are applied on a PAC object. Hint: If you are unsure if your manually constructed/manipulated PAC object are compatible, you can always run `PAC_check`. ## Summarize data over groups To make simple summaries over column categories/groups in the `pheno(PAC)` table such as means, standard deviation (SD), standard error (SE), and log2 fold changes (log2FC) etc, Seqpac has a convenient function called `PAC_summary`. This function uses a target_pheno object to generate a summary over raw counts or normalized counts across sample groups. Processed data will be stored in the `summary(PAC)` folder. Important, summarized data in the `summary(PAC)` folder will always have the same rownames (unique sequences) as `counts(PAC)` and `anno(PAC)`, while the column names will be derived from the `pheno(PAC)` table. Summarize using an anno_target object is not allowed, because that would disrupt sequence names (loss of sequence integrity). Summaries over sequences (such as generating means over sRNA classes) is instead handled by each plot/statistical function, and are often stored in the output from these functions. Nonetheless, user may always generate their own derived tables, using functions like `aggregate` and `reshape::melt`. Lets have some examples on how PAC_summary works: ```{r, message=FALSE, eval=TRUE} load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", package = "seqpac", mustWork = TRUE)) ###--------------------------------------------------------------------- ## PAC_summary in Seqpac # Make means of counts over stages and return a data.frame tab <- PAC_summary(pac, norm = "counts", type = "means", pheno_target=list("stage"), merge_pac=FALSE) # When merge_pac=TRUE the table is added to the summary(PAC) 'folder' pac_test <- PAC_summary(pac, norm = "counts", type = "means", pheno_target=list("stage"), merge_pac=TRUE) pac # Structure of PAC before PAC_summary (S4) pac_test # Structure of PAC after PAC_summary (S4) names(summary(pac_test)) head(summary(pac_test)$countsMeans_stage) ``` ```{r, eval=TRUE, results=FALSE, message=FALSE, warning=FALSE} # You may want to use normalized counts pac_test <- PAC_summary(pac_test, norm = "cpm", type = "means", pheno_target=list("stage"), merge_pac=TRUE) # Maybe only include a subset of the samples pac_test <- PAC_summary(pac_test, norm = "cpm", type = "means", pheno_target=list("batch", c("Batch1", "Batch2")), merge_pac=TRUE) # Generate standard errors pac_test <- PAC_summary(pac_test, norm = "cpm", type = "se", pheno_target=list("stage"), merge_pac=TRUE) # log2FC pac_test <- PAC_summary(pac_test, norm = "cpm", type = "log2FC", pheno_target=list("stage"), merge_pac=TRUE) # log2FC generated from a grand mean over all samples pac_test <- PAC_summary(pac_test, norm = "cpm", type = "log2FCgrand", pheno_target=list("stage"), merge_pac=TRUE) # All summarized tables have identical row names that can be merged names(summary(pac_test)) lapply(summary(pac_test), function(x){ identical(rownames(x), rownames(summary(pac_test)[[1]])) }) head(do.call("cbind", summary(pac_test))) ``` ## Differential Expression using PAC object Seqpac has a useful function, `PAC_deseq`, that lets you make omic scale expressional analysis on PAC objects using the popular `r Biocpkg("DESeq2")` package. This involves fitting generalized linear models with negative binomial distributions and subsequent correction for multiple testing using the PAC tables. For more information (such as how to correctly build models), please see the DESeq2 vingette `r Biocpkg("DESeq2")`. In addition to preparing a PAC object for DESeq2, `PAC_deseq` will organize and visualize the output. Lets have an example using the test dataset: ```{r, message=FALSE, eval=TRUE, warning=FALSE} load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", package = "seqpac", mustWork = TRUE)) ###--------------------------------------------------------------------- ## Differential expression in Seqpac # Simple model testing stages against using Wald test with local fit (default) table(pheno(pac)$stage) output_deseq <- PAC_deseq(pac, model= ~stage, threads=6) ``` ```{r, eval=TRUE,fig.show=FALSE,results="hide",message=FALSE,fig.keep='none'} # More complicated, but still graphs will be generated from 'stage' since it is # first in model output_deseq <- PAC_deseq(pac, model= ~stage + batch, threads=6) # Using pheno_target we can change the focus to batch # (no batch effect) output_deseq <- PAC_deseq(pac, model= ~stage + batch, pheno_target=list("batch")) # With pheno_target we can also change the direction of the comparison output_deseq <- PAC_deseq(pac, model= ~stage, pheno_target=list("stage", c("Stage5", "Stage3"))) ## In the output you find PAC merged results, target plots and output_deseq names(output_deseq) head(output_deseq$result) ``` ## Principal component analysis (PAC_pca) The `PAC_pca` function uses the `r CRANpkg("FactoMineR")` package to make a principle component analysis, from which scatter plots are plotted using the `r CRANpkg("ggplot2")` package. Here are some examples on how to use PAC_pca: ```{r, eval=TRUE, fig.show=FALSE, results="hide", fig.keep='none'} ###--------------------------------------------------------------------- ## PCA analysis in Seqpac # As simple as possible output_pca <- PAC_pca(pac) # Two 'folders' in the output names(output_pca) # Using pheno_target output_pca <- PAC_pca(pac, pheno_target =list("stage")) ``` ```{r, message=FALSE, eval=TRUE, warning=FALSE} # Using pheno_target with sample labels output_pca <- PAC_pca(pac, pheno_target =list("stage"), label=pheno(pac)$sample) ``` ```{r, eval=TRUE, fig.show=FALSE, results="hide", fig.keep='none'} # Plotting sequences instead output_pca <- PAC_pca(pac, type = "anno", anno_target =list("Biotypes_mis0")) ``` ## Size distribution and nucleotide bias There are two function that can plot size distributions using the `r CRANpkg("ggplot2")` package: -*PAC_nbias:* First nucleotide bias -*PAC_sizedist:* Class size distribution **PAC_nbias**
Note, first nucleotide bias can be visualized without any advanced annotations added to the `anno(pac)` table ```{r, eval=TRUE, fig.show=FALSE, results="hide", fig.keep='none'} ###--------------------------------------------------------------------- ## Nucleotide bias in Seqpac load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", package = "seqpac", mustWork = TRUE)) # Test without annotations # (note, is equivalent to an unfiltered master PAC) pac_test <- pac anno(pac_test) <- anno(pac_test)[,1, drop = FALSE] output_nbias <- PAC_nbias(pac_test) cowplot::plot_grid(plotlist=output_nbias$Histograms) # Now lets use an anno_target in an annotated PAC targetting miRNA # (Oops, heavy T-bias on 1st nt; are they piRNA?) table(anno(pac)$Biotypes_mis0) output_nbias <- PAC_nbias(pac, anno_target = list("Biotypes_mis0", "miRNA") ) cowplot::plot_grid(plotlist=output_nbias$Histograms) # Switch to 10:th nt bias output_nbias <- PAC_nbias(pac, position=10, anno_target = list("Biotypes_mis0", "miRNA")) cowplot::plot_grid(plotlist=output_nbias$Histograms) ``` ```{r, eval=TRUE, echo=TRUE} # Summarized over group cpm means pac <- PAC_summary(pac, norm = "cpm", type = "means", pheno_target=list("stage"), merge_pac=TRUE) output_nbias <- PAC_nbias(pac, summary_target = list("cpmMeans_stage") ) cowplot::plot_grid(plotlist=output_nbias$Histograms) ```
**PAC_sizedist**
The size distribution also works in a similar fashion, but instead divides the stacked columns using an anno_target: ```{r, eval=TRUE, fig.show=FALSE, results="hide", fig.keep='none'} ###--------------------------------------------------------------------- ## Biotype size distribution # Divide stacked bars by biotype with no mismatch allowed output_sizedist_1 <- PAC_sizedist(pac, anno_target = list("Biotypes_mis0")) cowplot::plot_grid(plotlist=c(output_sizedist_1$Histograms), ncol=3, nrow=3) # Divide stacked bars by biotype with allowing up to 3 mismaches output_sizedist_2 <- PAC_sizedist(pac, anno_target = list("Biotypes_mis3")) cowplot::plot_grid(plotlist=c(output_sizedist_2$Histograms), ncol=3, nrow=3) # anno_target is order sensitive, thus can take care of color order issues: ord_bio <- as.character(unique(anno(pac)$Biotypes_mis3)) ord_bio <- ord_bio[c(1,5,2,4,3,6,7)] output_sizedist_2 <- PAC_sizedist(pac, anno_target = list("Biotypes_mis3", ord_bio)) # And again, we can use a summary_target instead: output_sizedist_sum <- PAC_sizedist(pac, summary_target = list("cpmMeans_stage"), anno_target = list("Biotypes_mis3", ord_bio)) cowplot::plot_grid(plotlist=output_sizedist_sum$Histograms) ## Note: ####################################################################### # 1. miRNA is clearly associated with the correct size (21-23) nt, which are # # dramatically increased in Stage 5 when zygotic transcription has started. # # 2. piRNA was deliberately left out from the fasta references. Note, however, # # that there is a broad peak with no annotations between 20-30 nt in Stage 1, # # which also showed a T-bias at the first nt. Possibly piRNA? # ################################################################################ ``` ## Simple stacked bars and pie charts The `PAC_stackbar` and `PAC_pie` have the same arguments as input but with slightly different outputs. Summaries over groups in `pheno(PAC)` are controlled på the 'summary' argument. summary= "all" | % are generate over a mean of all samples summary= "sample" | % are generate for each sample summary= "pheno" | Group % are derived from the pheno_target object
**PAC_stackbar**
```{r, message=FALSE, include = TRUE, eval=TRUE} ###--------------------------------------------------------------------- ## Stacked bars in Seqpac # Choose an anno_target and plot samples (summary="samples") PAC_stackbar(pac, anno_target=list("Biotypes_mis0")) ``` ```{r, eval=TRUE,fig.show=FALSE,results="hide",message=FALSE,fig.keep='none'} # 'no_anno' and 'other' will always end on top not matter the order ord_bio <- as.character(sort(unique(anno(pac)$Biotypes_mis3))) p1 <- PAC_stackbar(pac, anno_target=list("Biotypes_mis0", ord_bio)) p2 <- PAC_stackbar(pac, anno_target=list("Biotypes_mis0", rev(ord_bio))) cowplot::plot_grid(plotlist=list(p1, p2)) # (Hint: if you want them to appear not on top, rename them) # Reorder samples by pheno_targets PAC_stackbar(pac, pheno_target=list("batch"), summary="samples", anno_target=list("Biotypes_mis0")) # Instead, summarized over pheno_target (summary="pheno") PAC_stackbar(pac, anno_target=list("Biotypes_mis0"), summary="pheno", pheno_target=list("stage")) ```
**PAC_pie**
```{r, eval=TRUE,fig.show=FALSE,results="hide",warning=FALSE,fig.keep='none'} ###--------------------------------------------------------------------- ## Pie charts in Seqpac # Choose an anno_target and plot samples (summary="samples"; default) PAC_pie(pac, anno_target=list("Biotypes_mis0")) # Ordered pie charts of grand mean percent of all samples labeled with percent ord_bio <- as.character(sort(unique(anno(pac)$Biotypes_mis3)), unique(anno(pac)$Biotypes_mis0)) output_pie_1 <- PAC_pie(pac, anno_target=list("Biotypes_mis0", ord_bio), summary="all", labels="percent") output_pie_2 <- PAC_pie(pac, anno_target=list("Biotypes_mis3", ord_bio), summary="all", labels="percent") cowplot::plot_grid(plotlist=c(output_pie_1, output_pie_2), labels = c("mis 0","legend", "mis 3"), ncol=2, nrow=2, greedy=TRUE, scale=1.15) # Rotate PAC_pie(pac, anno_target=list("Biotypes_mis0"), summary="all", angle=180) PAC_pie(pac, anno_target=list("Biotypes_mis0"), summary="all", angle=40) # Compare biotype mapping with or without mismatches and group by pheno(PAC) ord_bio <- as.character(sort(unique(anno(pac)$Biotypes_mis0))) output_mis0 <- PAC_pie(pac, pheno_target=list("stage"), summary="pheno", anno_target=list("Biotypes_mis0", ord_bio)) output_mis3 <- PAC_pie(pac, pheno_target=list("stage"), summary="pheno", anno_target=list("Biotypes_mis3", ord_bio)) cowplot::plot_grid(plotlist=c(output_mis0, output_mis3), labels = names(c(output_mis0, output_mis3)), nrow=2, greedy = TRUE, scale=1.5) ```

# Specilized analysis in Seqpac Seqpac functions may be used for more advanced analysis than simple classification. For example, we can dwell deeper into the details between the alignments of reads and references, by classifying reads depending on where it align. The best example of such advanced alignment analysis is tRNA loop and range classification. We may also want to make further annotations of specific classes of reads, such as classifying piRNA depending on overlaps with transposable elements and genes. In the current version of this guide we have only included a section on how to do advanced alignment analysis for the purpose of tRNA classification, but please see the `PAC_gtf` function to find out how to use coordinate overlap annotations in Seqpac.
## Advanced alignment analysis (tRNA class analysis) So far, classification of sRNA has only involved whether reads align or not to for example a tRNA reference sequence. Type classification of tRNA, such as 5', 3', i' and half, tRF (as nicely described in the MINTmap tool \https://pubmed.ncbi.nlm.nih.gov/28220888/), resembles mapping against a reference genome. Obtaining the exact coordinates of where a read align to the reference allows us to plot coverage plots showing the coverage of sRNA across the reference.
Conveniently, Seqpac contains a 'backdoor' to the reanno workflow, which allows the user to quickly obtain mapping coordinates of sequences in a PAC object that align to a reference fasta file. This is done by the `PAC_mapper` function. This function uses temporary output files from bowtie to generate a map object, which simply lists the reads that mapped to each sequence in the reference.
Importantly, if the Bowtie index is missing for a fasta reference, `PAC_mapper` will automatically generate such an index. Thus, even though larger references may be possible to use, they are discouraged with this function.
After a map object has been generated, the `PAC_covplot` function can make coverage plots of reads mapping to each reference. Lets have a look on how it works in Seqpac for advanced tRNA analysis using the test dataset:
```{r, eval=TRUE,fig.show=FALSE,results="hide",message=FALSE,fig.keep='none'} load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata", package = "seqpac", mustWork = TRUE)) ###--------------------------------------------------------------------- ## tRNA analysis in Seqpac # First create an annotation blank PAC with group means anno(pac) <- anno(pac)[,1, drop=FALSE] pac_trna <- PAC_summary(pac, norm = "cpm", type = "means", pheno_target=list("stage"), merge_pac = TRUE) # Then reannotate only tRNA using the PAC_mapper function ref <- "/some/path/to/trna/fasta/recerence.fa" # or use the provided fasta ref <- system.file("extdata/trna2", "tRNA2.fa", package = "seqpac", mustWork = TRUE) map_object <- PAC_mapper(pac_trna, ref=ref, N_up = "NNN", N_down = "NNN", mismatches=0, threads=6, report_string=TRUE, override=TRUE) # Hint: By adding N_up ad N_down you can make sure that modified fragments (like # 3' -CAA in mature tRNA are included). ## Plot tRNA using xseq=TRUE gives you the reference sequence as X-axis: # (OBS! Long reference will not show well.) cov_tRNA <- PAC_covplot(pac_trna, map_object, summary_target = list("cpmMeans_stage"), xseq=TRUE, style="line", color=c("red", "black", "blue")) cov_tRNA[[1]] # Targeting a single tRNA using a summary data.frame PAC_covplot(pac_trna, map=map_object, summary_target= list("cpmMeans_stage"), map_target="tRNA-Lys-CTT-1-9") # Find tRNAs with many fragments # 1st extract number of rows from each alignment n_tRFs <- unlist(lapply(map_object, function(x){nrow(x[[2]])})) # The test data is highly down an filterd sampled, but still some with tRNA have # a few alignment table(n_tRFs) names(map_object)[n_tRFs>2] # Lets select a few of them and plot them selct <- names(map_object)[n_tRFs>2][c(1, 16, 27, 37)] cov_plt <- PAC_covplot(pac_trna, map=map_object, summary_target= list("cpmMeans_stage"), map_target=selct) ``` ```{r, eval=TRUE, warning=FALSE} cowplot::plot_grid(plotlist=cov_plt, nrow=2, ncol=2) ```
Now, tRNAs can be further classified into isodecoder and isoacceptors. This information is usually provided within the reference names in a tRNA fasta reference file. Thus, it can be extracted from these names. Please see section, `4.1 Fasta references` for some examples on how to work with fasta references in R.
Classification of cleavage products, like 5'and 3' halves etc, are more complicated. Such classification involves knowledge of where in the tRNA loops (A-loop, anticodon loop and T-loop) cleavage may occur. There are external software that predict loops from models of secondary structures. One of the most popular tools for predicting tRNA secondary structures are tRNAscan-SE (http://lowelab.ucsc.edu/tRNAscan-SE/). The output from this software is an ss file, which stores loop information in the following format: Full length tRNA: tRNA-Pro-CGG-2-1_chrX:18565764-18565835_(+) GGCTCGTTGGTCTAGAGGTATGATTCTCGCTTCGGGTGCGAGAGGTCCCGGGTTCAATTCCCGGACGAGCCC >>>>>>>..>>>.........<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<. Loop 1 Loop 2 Loop 3
You may notice that each loop is contain between ">>>...<<<" and that each character (".><") corresponds to one nucleotide in the original full length tRNA. Ss-files for most common genomes are readily available for download at http://gtrnadb.ucsc.edu/.
In Seqpac, the `map_rangetype` function is used to further annotate the resulting map object obtained using the `PAC_mapper` function. First, if tRNA reference names were provided in the correct format (eg. "tRNA-Pro-CGG-2-1_chrX:18565764-18565835_(+): format: name -> genome coordinates -> strand), it will automatically extract the isodecoder (e.g. "Pro") and isoacceptor (e.g. "CGG") and make them seperate factors. It may also classify each mapped PAC sequence in relation to the start and end terminals of the full length tRNA. Lastly, given an ss file, it can annotate PAC sequences in relation to tRNA loops. The `tRNA_class` function can then combine the information in the mapping object and to classify each read sequence in the PAC object. Lets have a look on how this is done:
```{r, eval=TRUE,fig.show=FALSE,results="hide",message=FALSE,fig.keep='none'} ###--------------------------------------------------------------------- ## Generate range types using a ss-file # Download ss object from GtRNAdb # ("http://gtrnadb.ucsc.edu/") ss_file <- "/some/path/to/GtRNAdb/trna.ss" # or for testing use the provided ss-file in secpac ss_file <- system.file("extdata/trna2", "tRNA2.ss", package = "seqpac", mustWork = TRUE) # Classify fragments according to loop cleavage (small loops are omitted) map_object_ss <- map_rangetype(map_object, type="ss", ss=ss_file, min_loop_width=4) # Remove reference tRNAs with no hits map_object_ss <- map_object_ss[ !unlist(lapply(map_object_ss, function(x){x[[2]][1,1] == "no_hits"}))] # Now we have quite comprehensive tRNA loop annotations names(map_object_ss) map_object_ss[[1]][[2]] # Don't forget to check ?map_rangetype to obtain more options ###--------------------------------------------------------------------- ## Function classifying 5'-tRF, 5'halves, i-tRF, 3'-tRF, 3'halves # Does all tRNAs have 3 loops? table(unlist(lapply(map_object_ss, function(x){unique(x$Alignments$n_ssloop)}))) # Set tolerance for classification as a terminal tRF tolerance <- 5 # 2 nucleotides from start or end of full-length tRNA) ### Important: # We set N_up and N_down to "NNN" in the PAC_mapper step. To make sure # that we have a tolerance that include the original tRNA sequence # we set terminal= 2+3 (5). ``` ```{r, eval=TRUE, warning=FALSE} ## tRNA classifying function # Apply the tRNA_class function and make a tRNA type column pac_trna <- tRNA_class(pac_trna, map=map_object_ss, terminal=tolerance) anno(pac_trna)$type <- paste0(anno(pac_trna)$decoder, anno(pac_trna)$acceptor) anno(pac_trna)[1:5, 1:4] ```
When classification by isoacceptor/decoder and cleavage type has been done, the `PAC_trna` function can be applied to generate pheno group differences in relation to tRNA fragmentation:
```{r, message=FALSE, echo=TRUE, eval=TRUE} ###--------------------------------------------------------------------- ## Plot tRNA types # Now use PAC_trna to generate some graphs based on grand means trna_result <- PAC_trna(pac_trna, norm="cpm", filter = NULL, join = TRUE, top = 15, log2fc = TRUE, pheno_target = list("stage", c("Stage1", "Stage3")), anno_target_1 = list("type"), anno_target_2 = list("class")) names(trna_result) names(trna_result$plots) names(trna_result$plots$Expression_Anno_1) cowplot::plot_grid(trna_result$plots$Expression_Anno_1$Grand_means, trna_result$plots$Log2FC_Anno_1, trna_result$plots$Percent_bars$Grand_means, nrow=1, ncol=3) ``` ```{r,eval=TRUE,fig.show=FALSE,results="hide",message=FALSE,fig.keep='none'} # By setting join = FALSE you will get group means trna_result <- PAC_trna(pac_trna, norm="cpm", filter = NULL, join = FALSE, top = 15, log2fc = TRUE, pheno_target = list("stage", c("Stage1", "Stage3")), anno_target_1 = list("type"), anno_target_2 = list("class")) names(trna_result$plots$Expression_Anno_1) cowplot::plot_grid(trna_result$plots$Expression_Anno_1$Stage1, trna_result$plots$Expression_Anno_1$Stage3, trna_result$plots$Log2FC_Anno_1, trna_result$plots$Percent_bars$Stage1, trna_result$plots$Percent_bars$Stage3, nrow=1, ncol=5) ```
That completes this vignette. Good luck!
```{r, eval=TRUE} sessionInfo() ```