---
title: "A guide to small RNA analysis using Seqpac"
shorttitle: "sRNA analysis using Seqpac"
author: "Daniel Nätt, Signe Skog, Lovisa Örkenby"
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
abstract: >
Seqpac provides functions and workflows for analysis of short sequenced
reads. It was originally developed for small RNA (sRNA) analysis, but can
be implemented on any sequencing raw data (provided as a fastq-file), where
the unit of measurement is counts of unique sequences. The core of the
Seqpac strategy is the generation and subsequence annotation/analysis/
visualization of a standardized object called PAC. Using an innovative
target system, Seqpac process, analyze and visualize group differences
using the PAC object. Seqpac is particularly useful in generating sequence
coverage profiles across reference sequences (such as tRNA or gene coverage)
without loosing the integrity of the original fastq sequences. Thus,
candidate sequences are easily validated by for example a BLAST at a
genome browser. This guide will introduce you to the basics of our
recommended workflow for sRNA analysis, both fastq-processing (e.g.
adapter trimming and generating sequence counts) and post-fastq analysis
(e.g filtering, genome and sRNA alignment, differential expression
and visualization). To assure package quality, Seqpac has been optimized for
Bioconducter.
Seqpac package version: `r packageVersion("seqpac")`
output:
BiocStyle::html_document:
number_sections: yes
toc: true
vignette: >
%\VignetteIndexEntry{A guide to small RNA analysis using Seqpac}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
```
Stable versions of Seqpac is available on Bioconductor, and development branches
can be obtained at github:
```{r BiocManager, eval=FALSE}
## Installation Bioconductor:
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("seqpac")
## Installation github:
devtools::install_github("Danis102/seqpac",
dependencies=TRUE, ref="development")
```
# Introduction
## Overview
Seqpac contains an extensive toolbox for the analysis of short sequenced reads.
The package was originally developed for small RNA (sRNA) analysis, but can be
applied on any type of data generated by large scale nucleotide sequencing,
where the user wish to maintain sequence integrity throughout the whole
analysis. This involves, for example, regular RNA-seq with long RNA. But note,
however, that it currently does not automatically support paired-end sequencing.
To preserve sequence integrity, and thereby guarantee a clean data linage,
Seqpac applies sequence-based counting. This can be contrasted against
feature-based counting, where reads are counted over known genomic features,
like protein coding genes or sRNA. Such strategy will result in a count table
where each count is made per feature (e.g. number of read counts mapping to a
gene). In Seqpac, such annotations are initially disregarded, as a count table
is generated solely by counting unique sequences in the raw sequence files
(fastq). Annotations against known reference(s) (such as databases containing
sequences of full genomes, miRNA, tRNA, protein coding genes, or genomic
coordinates of such features) are done after the count table has been produced.
The advantage of using sequence-based counting, compared to feature-based
counting, is that for example mismatches in the classification of reads, can be
applied any time during the analysis. In feature-based counting, allowing a
mismatch in the mapping will result in a loss of information (loss of sequence
integrity).
For example, say that you have 506 read counts for miR-185 (a miRNA) when
allowing 3 mismatches against the reference. In feature-based analysis this will
result in 1 entry in the count table. Unless you have saved the exact alignments
elsewhere, you have lost the information about what sequences that are included
in that count. The only thing you know is that the sequence may differ from the
reference sequence by any 3 mismatches. In sequence-based counting, reads are
counted as unique sequences and each unique sequence is then mapped against the
target reference databases. Information about whether a sequence was aligning
against a reference is maintained in a linked annotation table that can contain
both perfect alignments and alignments allowing mismatches. Thus, in any stage
of the analysis mismatches and classifications into features can be dynamically
controlled, meaning that you can easily observe the effect of for example
allowing more mismatches or changes in hierarchical classifications (often
applied in sRNA analysis when a sequence align with multiple classes of sRNA).
**But the best with sequence-based counting where you preserve sequence
integrity, is probably that you will be able blast your candidate sequences at
your favorite database, to verify and find out more about them.**
## The Seqpac workflow
The input file format for Seqpac is fastq, which can be generated from most
sequencing platforms. Two types of objects sustain the core of the Seqpac
workflow:
- The PAC object
- The Targeting objects
The PAC object stores the Phenotypic information (P), Annotations (A) and Counts
(C) needed for all downstream analysis, while the targeting object is used for
accessing specific subsets of data within the PAC object.
Seqpac provides two versions of the PAC object. In its simplest form it is a S3
list containing 3 data.frames (Pheno, Anno and Counts). As default, however, an
S4 PAC object is generated. By using the coercion method `as(pac-object,
"list")`, you can turn an S4 PAC into an S3 list. Similarly, by using
`as.PAC(pac-object)` you can turn an S3 PAC into an S4. Plaese see `?as.PAC` an
`?PAC` for more details and examples.
The targeting object is always a list of 2 character objects.
The workflow is roughly divided into four steps:
- Generate a PAC object from fastq-files
- Preprocess the PAC object
- Annotate the PAC object
- Analyze the PAC object.
## Quick start
Here is a quick simple workflow using Seqpac. Remember, however, that in Seqpac
you can create much more advanced workflows. Especially when applying the reanno
workflow, as will be described below.
```{r, eval=TRUE, echo=TRUE, warning=FALSE, message=FALSE, results = "hide"}
# Setup
library(seqpac)
sys_path = system.file("extdata", package = "seqpac", mustWork = TRUE)
fastq <- list.files(path = sys_path, pattern = "fastq", all.files = FALSE,
full.names = TRUE)
ref_tRNA <- system.file("extdata/trna", "tRNA.fa",
package = "seqpac", mustWork = TRUE)
# Trim NEB adapter, generate counts and create PAC-object (pheno, anno, counts)
count_list <- make_counts(fastq, plot = FALSE,
parse="default_neb", trimming="seqpac")
pac <- make_PAC(count_list$counts)
pheno(pac)$groups <- c(rep("gr1",3), rep("gr2",3)) #Add groups to pheno table
# Preprocess PAC-object and creat means
pac <- PAC_norm(pac) # Default normalization is cpm
pac <- PAC_filter(pac, size=c(15,60)) # Here, filter on sequence length
pac <- PAC_summary(pac, norm = "cpm", type = "means",
pheno_target=list("groups", unique(pheno(pac)$groups)))
# Quickly align (annotate) and plot PAC-object
map_tRNA <- PAC_mapper(pac, ref=ref_tRNA, override=TRUE)
plts <- PAC_covplot(pac, map_tRNA, summary_target=list("cpmMeans_groups"))
plts[[13]]
```
## Example data
Contained within this package are 6 sRNA fastq file. These files were originally
generated by extracting sRNA from a single fruit fly embryo. The original fastq
file was acquired by Illumina NextSeq 500 sequencer using a High Output 75
cycles flow cell (product no: 20024906) and NEBNext® Small RNA Library Prep Set
for Illumina (E7330). Due to package size restriction, this fastq was randomly
down-sampled to a fraction of its original size into 6 much much smaller files.
There is also a prepared example PAC-object that were similarly generated
from randomly down-sampled fastq files (but here we used 9 unique embryos). In
addition, there are a few fasta reference file. These files are used in
examples of this vignette and in the function manuals.
# Making a PAC object
## Before making a PAC object (merge_lanes)
Seqpac works best with merged fastq, meaning that if sequencing was done in
separate lanes on the flow cell, where the same sample was present in all
lanes, lane files must be merged for each unique sample.
Conveniently, Seqpac contains a function that can be used to merge lane files:
`merge_lanes`. Please see, `?merge_lanes` for examples on how to merge
fastq-files. Specifically, this function searches the input folder for similar
file names and append files with similar names. As long as the user uses unique
sample names (e.g. sample1_lane1.fastq.gz, sample1_lane2.fastq.gz,
sample2_lane1.fastq.gz, sample2_lane2.fastq.gz etc) then `merge_lanes` will
merge lane files into the same sample (e.g. sample1.fastq.gz, sample2.fastq.gz).
## Generating a count table (make_counts)
Count tables are the core components in most sequence analysis. Seqpac uses
sequence-based counts (see 1.1 Overview). Thus, we need to generate a count
table with the number of unique sequences in each sample.
There are three ways of generating a count table in Seqpac.
- Trim raw fastq files using internal functions in Seqpac/R.
- Trim raw fastq files using external functions outside R.
- Import reads from already trimmed fastq files.
The two first involves adapter trimming, while the last involves already
trimmed reads. All input options are contained within the `make_counts`.
function.
### Internal trimming
Generating a count table from untrimmed fastq files using nothing but R packages
(internal) depends on the `make_trim` function. This function goes through a
series of trimming cycles using the `trimLRPatterns` function in the
`r Biocpkg("Biostrings")` package to generate highly comparable adapter trimming
as generated by popular fastq trimming software, such as cutadapt.
Seqpac trimming is praticularly efficient when trimming many fastq files at the
same time. It parallelizes trimming on multiple processor cores/threads by
applying functions in the `r CRANpkg("foreach")` package. Thus, if `threads=12`
and you input 12 fastq files, all files will be trimmed simultaneously in
parallel. Note, however, that each thread will make a substantial impact on your
computers memory performance. Thus, if you know that you are low in ram memory
use fewer number of threads.
When `make_trim` is used on its own it results in trimmed fastq files stored at
a user defined location. When `trimming="Seqpac"` is set in the `make_counts`
function, this function parses settings to `make_trim` to generate a count table
from temporary stored trimmed fastq files.
```{r, eval=FALSE, echo=TRUE}
## Using default settings for NEB type adapter
# NEB= NEBNext® Small RNA Library Prep Set for Illumina (E7300/E7330)
# For illumina type adapters, 'parse="default_illumina"' may be used
# To speed things up we use 4 threads.
count_list <- make_counts(input=fastq, trimming="seqpac", threads=4,
parse="default_neb")
```
### External trimming
As an alternative to internal trimming, two external software:
[cutadapt](https://cutadapt.readthedocs.io/en/stable/installation.html) and
[fastq_quality_filter](http://hannonlab.cshl.edu/fastx_toolkit/commandline.html)
can be used to trim the adapter sequence and remove low quality reads prior of
making a count table. By setting `trimming="cutadapt" make_counts will check if
these software is installed on your system.
### Already trimmed fastq
Setting `trimming=NULL` in the `make_counts` function will pass
the adapter trimming and a count table will be generated directly from the fastq
files. Thus, this option can be used if you already have trimmed your fastq
files.
### Optimizing for low-end systems and challenging data
In the `make_counts`function users may choose to process data on-disk and in
chunks instead of processing data in-memory. Thus, users on low-end computers
with little RAM may choose to process smaller chunks and save data to disk
from time to time on the expense of performance. This is controlled by the
`optimize`option. Here, two panic filters are also available. These filters will
only be applied if the unfiltered data fails to complete due to memory
shortage. If such filter is applied to a fastq file this is clearly stated in
the progress report, which give you the choice of what to do with such a sample.
See the manuals for `make_counts` for more details.
### Outputs from make_counts
Besides the counts table, `make_counts` automatically generates a
progress report for the trimming and evidence filter, as well as bar graphs
showing the impact of the evidence filter (if plots=TRUE).
### Evidence filtering
Users may already when making the counts table with `make_counts`
markedly reduce the noise in the data by specifying the number of independent
evidence that a specific sequence must reach to be included. This is controlled
by the `evidence` argument. As default, `evidence=c(experiment=2,
sample=1)` will include all sequences that have >=1 counts in at least 2
independent fastq files.
Importantly, we routinely observe that 95-99% of all reads are maintained when
applying the default evidence filter, while >50% of the unique sequences will be
dropped because they can not be replicated across two independent samples. For
an example of how a saturated experiment may look like, see the original paper,
Skog et al. from 2021.
As you may have guessed, the 'experiment' argument controls the number of
independent fastq evidence needed across the whole experiment. Note, however,
that 'sample' does not control the number of counts needed in each sample. The
evidence filter will always use >=1 counts in X number of fastq files. Instead
'sample' controls when a sequence should be included despite not reaching the
'experiment' threshold. Thus, if `evidence=c(experiment=2, sample=10)`,
sequences that reach 10 counts in a single sample will also be included.
Here is an example on how to use the 'sample' argument:
```{r, eval=FALSE, echo=TRUE, warning=FALSE, message=FALSE, results = "hide"}
###---------------------------------------------------------------------
## Evidence over two independent samples, saving single sample sequences
## reaching 3 counts
test <- make_counts(input=fastq, trimming="seqpac",
parse="default_neb", threads=3,
evidence=c(experiment=2, sample=3))
extras <- apply(test$counts, 1, function(x){sum(!x==0)})
test$counts[extras==1,] # A few still have 3 alone
```
Lastly, if `evidence=NULL` all unique sequences will be saved. Remember,
however, that the count table will become larger in well saturated experiments
that may lead to performance issues on some systems, since each unique sequence
occupies a row in `counts(PAC)` and `anno(PAC)`. Apply other filters prior to
annotating your sequences may solve such problems.
## Generate the phenotype table (make_pheno)
Next, we generate a Pheno table (P) and merge it with the progress report that
was stored from the `make_counts` output. The Pheno table contains
sample specific information such as sample name and possible interventions, with
each sample as a row.
The `make_pheno` function can import a Pheno table both externally by providing
a path to a .cvs file, as well as internally by just passing a data.frame. In
common for both options, is that a unique column named "Sample_ID" needs to be
present, and must contain the sample names (column names) present in the counts
table generated by `make_counts`. Now, `make_pheno` will attempt to match
similar names, but to avoid problems use identical names.
In addition, `make_pheno` may add the progress report generated by
`make_counts`. Lets have a look at how this works:
```{r, eval=TRUE, echo=TRUE, warning=FALSE, message=FALSE, results = "hide"}
###---------------------------------------------------------------------
## Generate a Pheno table using the file names of test fastq
Sample_ID <- colnames(count_list$counts)
pheno_fl <- data.frame(Sample_ID=Sample_ID,
Treatment=c(rep("heat", times=1),
rep("control", times=2)),
Batch=rep(c("1", "2", "3"), times=1))
pheno <- make_pheno(pheno=pheno_fl,
progress_report=count_list$progress_report,
counts=count_list$counts)
```
## make_PAC
Finally, we put everything together as a master PAC object using the `make_PAC`
function. Conveniently, skipping the `anno=` argument will automatically add a
very simple annotation table to the PAC-object that can be filled later.
```{r, eval=TRUE, echo=TRUE, warning=FALSE, message=FALSE, results = "hide"}
###---------------------------------------------------------------------
## Generate PAC object (default S4)
pac_master <- make_PAC(pheno=pheno, counts=count_list$counts)
pac_master
## If TRUE the PAC object is ok
PAC_check(pac_master)
```
# Preprocessing the PAC object
## Targeting objects
Seqpac uses a simple targeting system to divide and focus the analysis of a PAC
object to certain sample groups (Pheno) or specific sequences (Anno). If not
specified otherwise in the function manual, a targeting object is always a two
item list, where the first object is naming a column in a specific table in the
PAC and the second object tells the function which categories in that column
that should be targeted.
- `pheno_target`: extracts samples from a column in `pheno(PAC)`
- `anno_target`: extracts sequences from a column in `anno(PAC)`
While `pheno_target` and `anno_target` are the most common targeting objects,
there are other targeting object. Common for all is that they target a specific
sub-table in the PAC.
Now, lets take a look at the provided example PAC-object in Seqpac. As noted by
the name, this PAC has already been filtered and annotated, but it will still
show you the principle of preprocessing using targeting objects.
```{r, eval=TRUE, fig.show=FALSE, results="hide", fig.keep='none'}
###---------------------------------------------------------------------
## Load the PAC-object and inspect the columns in Pheno and Anno
library(seqpac)
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata",
package = "seqpac", mustWork = TRUE))
pheno(pac)[,1:5]
anno(pac)[1:5,1:5]
counts(pac)[1:5,1:5]
## This PAC is an S4 object, but you can easily turn it to an S3 list with as().
## using as.PAC() will turn it back into S4:
isS4(pac)
pac_s3 <- as(pac, "list")
lapply(pac_s3[1:3], head)
isS4(pac_s3)
pac <- as.PAC(pac_s3)
methods(class="PAC") #all S4 methods
```
A pheno_target object that will exclusively target stage 1 and 5 samples would
look like this: `pheno_target=list("stage", c("Stage1", "Stage5"))`
While an anno_target that will target sequences of 22 nt in length would look
like this: `anno_target=list("Size", "22")`
Notice that both targeting objects are lists holding two secondary objects: one
character string targeting a specific column, and one character vector targeting
specific categories within the target column. The only difference being that
each targeting object target different tables in the PAC object.
Important, if the second object is set to NULL or is left out this will
automatically target all samples in the specified column. If the anno_target is
a character vector (not a list) some function will attempt to extract matches in
the row names of either Pheno or Anno (see example with `PAC_filtsep` below).For
more, please refer to the original Seqpac publication: Skog et al. 2021.
## Filtering
A master PAC object will contain sequences in your original fastq files that
passed the trimming and the initial evidence filter. Now, in many experiments
this still involves millions of unique sequences, making the master PAC
difficult to work with. Thus, it is often wise to further reduce noise prior to
normalization and annotation.
Seqpac contains two functions for this purpose:
- `PAC_filter`
- `PAC_filtsep`
While `PAC_filter` gives a variety of choices to subset the PAC object according
to for example sequence size or minimum counts in a certain proportion of the
samples,`PAC_filtsep` extracts the sequence names that pass a filter within
sample groups.
Here are some examples:
```{r, eval=TRUE, echo=TRUE}
###---------------------------------------------------------------------
## Extracts all sequences between 20-30 nt in length with at least 5 counts
## in 20% of all samples.
pac_filt <- PAC_filter(pac, size=c(20,30), threshold=5,
coverage=20, norm = "counts",
pheno_target=NULL, anno_target=NULL)
```
```{r, eval=FALSE, echo=TRUE}
###---------------------------------------------------------------------
## Optionally, a simple coverage/threshold graph at different filter depths
## can be plotted with stat=TRUE (but it will promt you for a question).
pac_filt <- PAC_filter(pac, threshold=5, coverage=20,
norm = "counts", stat=TRUE,
pheno_target=NULL, anno_target=NULL)
```
```{r, eval=TRUE, echo=TRUE}
###---------------------------------------------------------------------
## Extracts all sequences with 22 nt size and the samples in Batch1 and Batch2.
pac_filt <- PAC_filter(pac, subset_only = TRUE,
pheno_target=list("batch", c("Batch1", "Batch2")),
anno_target=list("Size", "22"))
###---------------------------------------------------------------------
## Extracts sequences with >=5 counts in 100% of samples within each stage
filtsep <- PAC_filtsep(pac, norm="counts", threshold=5,
coverage=100, pheno_target= list("stage"))
pac_filt <- PAC_filter(pac, subset_only = TRUE,
anno_target= unique(do.call("c", as.list(filtsep))))
```
Notice that `PAC_filtsep` only extracts the sequence names and stores them in a
data.frame. Converted into a character vector with unique names (using `unique`
+ `do.call`) they can be applied as an anno_target that targets sequence names
in Anno.
Hint, the data.frame output of `PAC_filtsep` has been designed for the purpose
of generating Venn and Upset diagrams that visualize overlaps across groups:
```{r, eval=FALSE, echo=TRUE}
###---------------------------------------------------------------------
## Venn diagram using the venneuler package
olap <- reshape2::melt(filtsep,
measure.vars = c("Stage1", "Stage3", "Stage5"),
na.rm=TRUE)
plot(venneuler::venneuler(olap[,c(2,1)]))
###---------------------------------------------------------------------
## Setting output="binary", prepares data for upset plots (UpSetR package):
filtsep_bin <- PAC_filtsep(pac, norm="counts", threshold=5,
coverage=100, pheno_target= list("stage"),
output="binary")
UpSetR::upset(filtsep_bin, sets = colnames(filtsep_bin),
mb.ratio = c(0.55, 0.45), order.by = "freq",
keep.order=TRUE)
```
## Storage of processed data in PAC
While the simplest PAC only contains three data.frame objects (P, A and C),
additional 'folders' will be stored in the PAC object as the analysis
progresses. In fact, if using a S3 PAC (list), you may choose to store any
number of objects in your PAC as long as they do not conflict with the names of
the standard folders, which are:
Pheno (P): data.frame
Anno (A): data.frame
Counts (C): data.frame
norm: list of data.frames
summary: list of data.frames
The *norm* folder will be used to store counts that have been normalized in
different ways, while the *summary* folder will contain the data that has been
summarized across groups of samples. In both cases, applying a filter using
`PAC_filter` will automatically subset not only P, A, and C, but also all tables
stored in `norm(PAC)` and `summary(PAC)`.
## Normalization
The built in normalization methods in Seqpac are handled by the `PAC_norm`
function. In the current version three methods are available:
- Counts per million reads (cpm)
- Variance-Stabalizing Transformation (vst)
- Regularized log transformation (rlog)
More specifically, cpm normalize the dataset in relation to the total number of
reads, while vst and rlog uses both mean dispersion and size factor to produce
homoskedastic values with functions available in the `r Biocpkg("DESeq2")`
package.
It is easy to generate your own normalized count table using your
method of choice, and import it as a data.frame to the *norm* folder in the PAC
object. Just remember that row and column names must be identical to the
original Counts table. Use `PAC_check` to verify.
Lets generate two normalized count tables with cpm and vst:
```{r, eval=TRUE, results=FALSE, message=FALSE, warning=FALSE, error=FALSE}
###---------------------------------------------------------------------
## Example normalization in Seqpac
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata",
package = "seqpac", mustWork = TRUE))
# PAC_norm works on both
pac <- PAC_norm(pac, norm="cpm")
pac <- PAC_norm(pac, norm="vst")
pac
```
From this, it may sometimes be wise to run further filtering steps to focus on
reads with high cpm values.
```{r, results=FALSE, eval=TRUE}
###---------------------------------------------------------------------
## Filtering using cpm instead of raw counts
## filter >=100 cpm in 100% of samples in >= 1 full group
filtsep <- PAC_filtsep(pac, norm="cpm", threshold=100, coverage=100,
pheno_target= list("stage"))
pac_cpm_filt <- PAC_filter(pac, subset_only = TRUE,
anno_target= unique(do.call("c", as.list(filtsep))))
```
# Generate annotations
From the start a PAC object contains a very limited Anno table such as this:
```{r, results=FALSE, eval=TRUE}
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata",
package = "seqpac", mustWork = TRUE))
anno(pac) <- anno(pac)[,1, drop = FALSE]
head(anno(pac))
```
To facilitate easy adoption of Seqpac to any species, we provide a fast and
flexible workflow for mapping PAC sequences against user defined reference
databases. We call this reannotation, since we map the sequences in Anno object
sequentially over one or multiple references.
Important, Seqpac relies on the popular bowtie aligner `r Biocpkg("Rbowtie")`
for mapping: http://bowtie-bio.sourceforge.net/manual.shtml. This means that
references must be indexed using the `bowtie_build` function before they are
compatible with the reannotation workflow (but see the `PAC_mapper` function for
an index free alternative).
## Fasta references
Small RNA annotation/mapping depends heavily on choosing your references wisely
and not to be too greedy if possible. If your hypothesis target a specific type
of class (like tRNA), it might for instance be wiser to target such references
only.
Many databases provide their references in the fasta file format. Seqpac can use
any correctly formated fasta file as input in the reannotation workflow.
However, we distinguish two kinds of fasta references:
- Reference genomes.
- Other specialized references.
You can download the files manually from the source database. Below are some
helpful example links where to retrieve useful fasta references. Note that all
compressed files must be uncompressed to work with the bowtie aligner.
**Reference genomes:**
[Ensembl Animals](https://www.ensembl.org/info/data/ftp/index.html)
[Ensembl Plants](https://plants.ensembl.org/info/data/ftp/index.html)
[UCSC](http://hgdownload.cse.ucsc.edu/downloads.html)
Hint: Unmasked top level fasta files are best for most purposes, e.g. Ensembl
file name *.dna.toplevel.fa.gz.
**Other specialized references:**
[miRNA](http://www.mirbase.org/ftp.shtml)
[miRNA](https://mirgenedb.org/download)
[piRNA](http://www.regulatoryrna.org/database/piRNA/download.html)
[many ncRNA Animals](https://www.ensembl.org/info/data/ftp/index.html)
[many ncRNA Plants](https://plants.ensembl.org/info/data/ftp/index.html)
[tRNA](http://gtrnadb.ucsc.edu/)
[repeats](http://repeatmasker.org/genomicDatasets/RMGenomicDatasets.html)
There are also several packages in Bioconductor/R such as `r
Biocpkg("BSgenome")` and `r Biocpkg("biomaRt")` that allow you to download
reference genomes and other specialized references directly into r. Actually,
using the `download.file` function makes most urls (https/http/ftp etc)
available for download. Custom fasta files can also be used, and you can easily
merge, add and remove features to the references by functions in the `r
Biocpkg("Biostrings")` package, such as `readDNAStringSet` (read fasta) and
`writeXStringSet` (save fasta).
As default, bowtie only reports the names up to the first white space. Sometimes
the sequence names need to be modified or rearranged for the bowtie aligner to
best report the names in the mapping output. This can be done using the
`r Biocpkg("Biostrings")` package to read, rearrange and save the fasta. A
useful expression to verify that you have caught the relevant info before the
the first white space is: `do.call("rbind", strsplit(names(),
" "))[,1]`
You may also want to add tRNAs from the mitochrondrial genome to your tRNA
reference, which is not always provided in the GtRNAdb fasta. This can be done
by moving them from Ensembl ncRNA fasta or scan the mitochondrial genome
manually at [tRNAscan-SE](http://lowelab.ucsc.edu/tRNAscan-SE/). The
mitochonridal genome is found in the reference genome fasta.
## Indexing the fasta references
Before we can use the fasta references we need to index them for the bowtie
aligner. You can do this by using the `bowtie_build` function in the Rbowtie
package or by running bowtie externally, outside R. For more information see
?Rbowtie::bowtie, Rbowtie::bowtie_build_usage() and
http://bowtie-bio.sourceforge.net/manual.shtml.
Important, the bowtie prefix must have the same basename (`prefix=`) and the
indexes must be saved in the same directory (`outdir=`) as the original fasta
file. Since the `bowtie_build` naming is often confusing for newbies, here
are some examples:
```{r, eval = TRUE}
###---------------------------------------------------------------------
## Examples of generating bowtie indexes
path <- "/some/path/to/your/folder"
#Genome
mycoplasma_path <- system.file("extdata/mycoplasma_genome", "mycoplasma.fa",
package = "seqpac", mustWork = TRUE)
Rbowtie::bowtie_build(mycoplasma_path,
outdir=gsub("mycoplasma.fa", "", mycoplasma_path),
prefix="mycoplasma", force=TRUE)
#tRNA
trna_path <- system.file("extdata/trna", "tRNA.fa",
package = "seqpac", mustWork = TRUE)
Rbowtie::bowtie_build(trna_path,
outdir=gsub("tRNA.fa", "", trna_path),
prefix="tRNA", force=TRUE)
#rRNA
rrna_path <- system.file("extdata/rrna", "rRNA.fa",
package = "seqpac", mustWork = TRUE)
Rbowtie::bowtie_build(rrna_path,
outdir=gsub("rRNA.fa", "", rrna_path),
prefix="rRNA", force=TRUE)
```
Building indexes may take some time for large fasta, such as a reference genome.
You only need to generate the index once for every fasta reference, however.
## GTF feature coordinates
After the sequences have been mapped against a reference genome they can also
acquire annotations from genomic feature files, such as the gtf and gff
formats. These files contains the coordinates for different genomic features
associated with a specific reference genome, such as repetitive regions, exons
and CpG islands etc.
GTF files can often be acquired at the same places as the fasta reference files
(see above). The `r Biocpkg("biomartr")` package also have functions for
downloading some gtf files (e.g. `getGTF`).In addition, you may create your own
by downloading/importing different tables, for example using `r
Biocpkg("rtracklayer")` (see `readGFF`, `getTable`, `ucscTableQuery`) and then
create a GRanges object (`r Biocpkg("GenomicRanges")` package) where you add
your preferred table info. Finally you may save it using rtracklayer again
(`export(, , format="gtf"`)
## The reannotation workflow
The procedure to annotate sequences in Seqpac is called reannotation.
The PAC reannotation family of functions carries out the following tasks:
- Maps sequences against fasta references
- Stores the output on hard drive
- Reads selected parts into R
- Creates a reanno object
- Generates an overview from the reanno object
- Classifies and simplifies annotations
- Adds classifications to a PAC object
### The functions in the reanno workflow
**map_reanno**
To accommodate most users on most platforms, Seqpac provides two alternatives
for calling bowtie using the `map_reanno` function: internally from within R
(`type="internal"`) or externally from outside R (`type="external"`). In the
internal mode, Seqpac uses the `bowtie` function in the `r Biocpkg("Rbowtie")`
package, while in the external mode bowtie is called by a system command to a
locally installed version of bowtie. Both options gives identical results, but
since the internal Rbowtie package is rarely updated, we provide the external
option.
Since the input commands for external bowtie and internal Rbowtie differs
slightly, `map_reanno` has two parse options: `parse_internal` and
`parse_external`. These can be used to control the two versions of bowtie
as if they were run from the console or command line, respectively.
While bowtie has its own way to handle mismatches in the alignments, Seqpac
disregards this and instead runs bowtie sequentially, increasing the number of
mismatches for each cycle. After each cycle all sequences that received a match
in any of the provided references will be subtracted. Therefore, in the next
cycle only sequences without a match will be 'reannotated' but allowing one
additional mismatch. The `map_reanno` function controls this by sequentially
increasing the number of mismatches after subtracting the matches.
**import_reanno**
This function is called by `map_reanno` to import the bowtie results after
each cycle. If you would run `import_reanno` separately it simply
provides you with options on how to import bowtie results into R. For example,
when aligning against a reference genome the mapping coordinates is important,
while mapping against more specialized references used for creating sRNA
classifications (rRNA, tRNA, miRNA etc), 'hit' or 'no hit' might be enough, and
will be a much faster option. The `map_reanno` function, therefore, has two
default import modes, import="genome" and import="biotype", which automatically
configure `import_reanno` for genome or class annotation.
**make_reanno**
After each annotation cycle, `map_reanno` will store an .Rdata file
(Full_reanno_mis0/1/2/3.Rdata) containing a list with all alignment hits for
each PAC-sequence-to-fasta-reference mismatch cycle. Calling `make_reanno` will
search for these .Rdata files in a given folder and create a reanno object in R.
This involves extracting, ordering and preparing annotations in a format that
can be merged with the existing Anno table in a PAC object. It will also
generate an overview table, that summarizes the annotations if multiple fasta
references were used by `map_reanno`.
**add_reanno**
Before merging the new annotations with a PAC object, unless we have provided a
fasta reference with only one type (e.g. miRNA for mirBase), classification not
only *between* fasta references but also *within* references might be necessary
(e.g. snoRNA, tRNA, miRNA, rRNA in the Ensembl_ncRNA fasta). Classification of
the fasta reference names, which was first saved by bowtie and then caught in
our reanno object as an annotation, is done with the `add_reanno` function.
Again, there are two primary modes, either "genome" or "biotype". When
type="genome" a standardized workflow anticipating genomic coordinates are
applied. When type="biotype" a list of search term vectors directed against each
fasta reference needs to be provided. Matches between search term (written as
regular expressions) and text strings in the fasta reference names will result
in a classification of the PAC sequence. Finally, `add_reanno` generates an
annotation table that can either be saved separately or merged with the original
PAC object.
**simplify_reanno**
While `add_reanno` starts the process of simplifying the reference name
annotations into useful classifications, it will generate multiple
classifications for each counted sequence if it matches multiple search terms.
The `simplify_reanno` function boils down multi-classifications into one
classification for each sequence. By doing so, an hierarchy must be set, where
priority of some classifications over others is done (e.g. miRNA over piRNA).
While this is a controversial topic, using such hierarchies is still the only
way to provide overview statistics in experiments targeting highly multimapping
sequences, such as small RNA. Nonetheless, the Seqpac workflow with its sequence
based counts, where hierarchical classification is done in the very end of the
annotation process, makes it easy to generate classifications using alternative
hierarchies and observe the consequences of your choices in for example a pie
chart.
In summary:
- **map_reanno** Bowtie reference mapping over mismatch cycles
- **import_reanno** Used by `map_reanno` to import bowtie output
- **make_reanno** Organizes `map_reanno` output into a reanno object
- **add_reanno** Classifies annotations according to reference name
- **simplify_reanno** Hierarchical classification
Already at this point it is good to know that Seqpac provides a 'backdoor'
function, `PAC_mapper`, that carries out many of the steps in the reanno
workflow using smaller references and pac objects. Please see '6.1 Advanced
alignment analysis' or ?PAC_mapper for more details.
### Reannotation in action
Lets observe the functions in action using the drosophila test dataset.
First we map against a reference genome, using `map_reanno`. Note that
import="genome" must be set to catch the genomic coordinates. Since
`map_reanno` can handle multiple fasta references. Remember, each
fasta must have bowtie indexes (see 4.2).
```{r, eval=TRUE, results=FALSE, message=FALSE, warning=FALSE, error=FALSE}
# Lets reload our pac
library(seqpac)
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata",
package = "seqpac", mustWork = TRUE))
anno(pac) <- anno(pac)[,1, drop = FALSE] # Remove previous Anno
###---------------------------------------------------------------------
## Genome mapping
# Path to your output folder
outpath_genome <- "/some/path/to/reanno_folder"
# or a temp folder
outpath_genome <- paste0(tempdir(), "/seqpac_reanno/reanno_genome")
# Look up the your own Bowtie indexed reference genomes (fasta)
ref_paths <- list(myco="")
# For testing, you can use a small mycoplasma genome provided with Seqpac
# But remember to provide single genomes as a named list anyway.
# The name will be used to keep track of your genome.
myco_path <- system.file("extdata/mycoplasma_genome", "mycoplasma.fa",
package = "seqpac", mustWork = TRUE)
ref_paths <- list(myco=myco_path)
# Run map_reanno
map_reanno(PAC=pac, ref_paths=ref_paths, output_path=outpath_genome,
type="internal", mismatches=1, import="genome",
threads=8, keep_temp=TRUE)
```
Similarly, we can map against other specialized references, that can be used for
classifying each sequence into biotypes. Note, unlike genome mapping, at this
stage we are not interested in where a sequence matches a reference for the
specialized references, only if it matches or not. Thus, we use
`import="biotype"`.
```{r, eval=TRUE, results=FALSE, message=FALSE, warning=FALSE, error=FALSE}
###---------------------------------------------------------------------
## sRNA mapping using internal bowtie
# Path to output folder
outpath_biotype <- "/some/path/to/reanno_biotype"
# or a temp folder
outpath_biotype <- paste0(tempdir(), "/seqpac_reanno/reanno_biotype")
# Seqpac contains two fasta for tRNA and rRNA we can use for testing
trna_path <- system.file("extdata/trna", "tRNA.fa",
package = "seqpac", mustWork = TRUE)
rrna_path <- system.file("extdata/rrna", "rRNA.fa",
package = "seqpac", mustWork = TRUE)
# Provide paths to bowtie indexed fasta references as a list
ref_paths <- list(tRNA=trna_path,
rRNA=rrna_path)
map_reanno(pac, ref_paths=ref_paths, output_path=outpath_biotype,
type="internal", mismatches=1, import="biotype",
threads=6, keep_temp=TRUE)
```
Now, the output .Rdata files for all mismatch cycles should have been stored in
the output folders. Lets import everything into R, generate a reanno objects and
plot some pie charts from the Overview table.
```{r, eval=TRUE,fig.show=FALSE,results="hide",message=FALSE,fig.keep='none'}
###---------------------------------------------------------------------
## Generate a reanno object with make_reanno
reanno_genome <- make_reanno(outpath_genome, PAC=pac, mis_fasta_check = TRUE)
reanno_biotype <- make_reanno(outpath_biotype, PAC=pac, mis_fasta_check = TRUE)
# Note, setting mis_fasta_check=TRUE will double check that the number of
# sequences that failed to receive an alignment in the last mismatch cycle
# agrees with the number sequences in the reanno object without an annotation.
# (these sequences are stored in mis_fasta_x.txt where x is max mismatches+1)
# List structure
str(reanno_genome, max.level = 3, give.attr = FALSE)
str(reanno_biotype, max.level = 3, give.attr = FALSE)
# Simple pie charts using the Overview table
pie(table(overview(reanno_genome)$myco)) # Very few mycoplasma with 0 mismatch
pie(table(overview(reanno_biotype)$Any)) # Many hits for either rRNA or tRNA
```
Next step is to reorganize and classify using `add_reanno`. For mapping against
a reference genome this is easy. Just set type="genome" and set the maximum
number of alignments to be reported by `genome_max` (Warning: `genome_max="all"`
will give you all alignments).
```{r, eval=TRUE, results=FALSE, message=FALSE, warning=FALSE}
###---------------------------------------------------------------------
### Genomic coordinates using add_reanno
# Output as separate table
anno_genome <- add_reanno(reanno_genome, type="genome",
genome_max=10, mismatches=1)
# Output merged with provided PAC object
pac <- add_reanno(reanno_genome, type="genome",
genome_max=10, mismatches=1, merge_pac=pac)
# Example of original reference name annotations
head(full(reanno_genome)$mis1$myco)
# Finished genome annotation
head(anno_genome)
head(anno(pac))
```
For specialized references, `type="biotype"` should be used. This allows for
classification based on match or no match between the search terms provided in
the `bio_search` input and the reference names annotated with each counted
sequence. Correct classification is all about finding the best search terms that
discriminates between the different names in the original fasta reference files
(up to the first white space; see 4.1).
With the `bio_perfect` option you may control how conservative the matching
between search term and annotation should be. With `bio_perfect=FALSE`,
every reference hit that fail to match your search terms, will be classified as
'other'. Using `bio_perfect=TRUE` will instead guarantee that your search
terms will cover all reference hits.
Hint: See `?add_reanno` for a trick on how to succeed with `bio_perfect=TRUE`.
```{r, eval=TRUE, results=FALSE, message=FALSE, warning=FALSE}
###---------------------------------------------------------------------
## Classify sequences using add_reanno
# Lets start by exploring the names in the original fasta reference:
ref_path <- "/some/path/to/fasta_reference"
# For testing use the the fasta reference for tRNA provided with Seqpac:
trna_path <- system.file("extdata/trna", "tRNA.fa",
package = "seqpac", mustWork = TRUE)
# Read fasta names
fasta <- names(Biostrings::readDNAStringSet(trna_path))
# Different naming standards
table(substr(fasta, 1, 10))
# Starting with FBgn discriminate mitochondrial tRNA
fasta[grepl("FBgn", fasta)]
# The other are nuclear
head(fasta[!grepl("FBgn", fasta)])
# We can use as 'mt:tRNA' as search term to catch mitochondrial tRNA
# and '_tRNA' to catch nuclear.
# A useful expression for extracting strings up to 1st white space is
# fasta <- do.call("rbind", strsplit(fasta, " "))[,1]
# Similarly load the rrna reference provided with seqpaq
rrna_path <- system.file("extdata/rrna", "rRNA.fa",
package = "seqpac", mustWork = TRUE)
# Read fasta names
fasta <- names(Biostrings::readDNAStringSet(rrna_path))
# Many rRNA subtypes
table(substr(fasta, 1, 10))
# Lets try two search term lists directed against each reference and written as
# regular expressions:
# Names in the search list needs to be the same as in the reanno object
head(overview(reanno_biotype)) # 'rRNA' and 'tRNA' with some capital letters
# Generate a search list with search terms
bio_search <- list(
rRNA=c("5S", "5.8S", "12S", "16S", "18S", "28S", "pre_S45"),
tRNA =c("_tRNA", "mt:tRNA")
)
test <- add_reanno(reanno_biotype, bio_search=bio_search,
type="biotype", bio_perfect=FALSE,
mismatches = 1, merge_pac=pac)
```
```{r, eval=FALSE, results=FALSE, message=FALSE, warning=FALSE}
# Throws an error because perfect matching was required:
anno_temp <- add_reanno(reanno_biotype, bio_search=bio_search,
type="biotype", bio_perfect=TRUE, mismatches = 1)
```
```{r, eval=TRUE, results=FALSE, message=FALSE, warning=FALSE}
# References with no search term hits are classified as "Other":
anno_temp <- add_reanno(reanno_biotype, bio_search=bio_search,
type="biotype", bio_perfect=FALSE, mismatches = 1)
# Increasing number of hits allowing mismatches
table(anno_temp$mis0)
table(anno_temp$mis1)
# You may also add your new annotaion with a PAC object using the 'merge_pac'
# option. For even more options see ?add_reanno.
pac <- add_reanno(reanno_biotype, bio_search=bio_search,
type="biotype", bio_perfect=FALSE, mismatches = 1,
merge_pac=pac)
head(anno(pac))
```
To make overview statistics and graphs we need to boil down the classifications
to a factor column with only a few unique biotypes per factor. This is what
`simplify_reanno` does. Just like `add_reanno` in the `type="biotype"` mode,
`simplify_reanno` needs a list of search terms, an `hierarchy`. This time,
however, the list is order sensitive and targets the output table from
`add_reanno` that contains the columns with new classifications (e.g.
"mis0_bio", "mis1_bio", "mis2_bio" etc). If a match occurs between the first
search term and a classification, the counted sequence will be annotated to this
classification, no matter if other search terms matches further down the list.
This is done sequentially, allowing one additional mismatch until a maximum
(specified in `mismatches`) has been reached. Thus, if a match occurs in tRNA
with 0 mismatch, and rRNA with 1 mismatch, it will be reported as tRNA even
though rRNA where higher in the `hierarchy`. A better match will always be
prioritized over the hierarchy.
```{r, eval=TRUE, results=FALSE, message=FALSE, warning=FALSE}
###---------------------------------------------------------------------
## Hierarchical classification with simplify_reanno
# Currently classification does not discriminate
table(anno(pac)$mis3_bio)
# Set the hierarchy and remember that it is order sensitive.
# Here: S5_rRNA >> Other_rRNA >> mt:tRNA >> tRNA
# Remember to use 'regular expressions' if you wish to catch all:
hierarchy_1 <- list(rRNA_5S="rRNA_5S",
Other_rRNA="5.8S|12S|16S|18S|28S|pre_S45|rRNA_other",
Mt_tRNA="tRNA_mt:tRNA",
tRNA="tRNA__tRNA"
)
# What happens if you don't catch all:
hierarchy_2 <- list(rRNA_5S="rRNA_5S",
Mt_tRNA="tRNA_mt:tRNA",
tRNA="tRNA__tRNA")
# No mismatch allowed using hierarchy_1
test <- simplify_reanno(input=pac, hierarchy=hierarchy_1, mismatches=0,
bio_name="Biotypes_mis0", merge_pac=FALSE)
table(test) # All remaining rRNA are classified as 'Other_rRNA'
# Instead using hierarchy_2
test <- simplify_reanno(input=pac, hierarchy=hierarchy_2, mismatches=0,
bio_name="Biotypes_mis0", merge_pac=FALSE)
table(test)# Non-hits are classified as 'other'
# Note, setting 'merge_pac=FALSE' returns a one-column data.frame only
# containing the new hierarchical classifications.
# Lets increase number of mismatches an merge it with the PAC
# (How many mismatches depends on what you allowed previously in the workflow)
colnames(anno(pac)) # mis0-1_bio = Upto 1 mismatches are available
pac_test <- simplify_reanno(input=pac, hierarchy=hierarchy_1, mismatches=1,
bio_name="Biotypes_mis1", merge_pac=TRUE)
# Now we also have some mitochondrial tRNA
table(anno(pac_test)$Biotypes_mis1)
```
# Analysis and visualization
There are several functions in Seqpac that handle statistical analysis and
visualization. We will only present a few of them here, so please refer to
Seqpac's reference manual for a complete list. Just like many of the
preprocessing functions, these functions are named PAC_xxx to clearly show that
they are applied on a PAC object.
Hint: If you are unsure if your manually constructed/manipulated PAC object are
compatible, you can always run `PAC_check`.
## Summarize data over groups
To make simple summaries over column categories/groups in the `pheno(PAC)` table
such as means, standard deviation (SD), standard error (SE), and log2 fold
changes (log2FC) etc, Seqpac has a convenient function called `PAC_summary`.
This function uses a target_pheno object to generate a summary over raw
counts or normalized counts across sample groups. Processed data will be stored
in the `summary(PAC)` folder.
Important, summarized data in the `summary(PAC)` folder will always have the
same rownames (unique sequences) as `counts(PAC)` and `anno(PAC)`, while the
column names will be derived from the `pheno(PAC)` table. Summarize using an
anno_target object is not allowed, because that would disrupt sequence names
(loss of sequence integrity). Summaries over sequences (such as generating means
over sRNA classes) is instead handled by each plot/statistical function, and
are often stored in the output from these functions. Nonetheless, user may
always generate their own derived tables, using functions like `aggregate` and
`reshape::melt`.
Lets have some examples on how PAC_summary works:
```{r, message=FALSE, eval=TRUE}
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata",
package = "seqpac", mustWork = TRUE))
###---------------------------------------------------------------------
## PAC_summary in Seqpac
# Make means of counts over stages and return a data.frame
tab <- PAC_summary(pac, norm = "counts", type = "means",
pheno_target=list("stage"), merge_pac=FALSE)
# When merge_pac=TRUE the table is added to the summary(PAC) 'folder'
pac_test <- PAC_summary(pac, norm = "counts", type = "means",
pheno_target=list("stage"), merge_pac=TRUE)
pac # Structure of PAC before PAC_summary (S4)
pac_test # Structure of PAC after PAC_summary (S4)
names(summary(pac_test))
head(summary(pac_test)$countsMeans_stage)
```
```{r, eval=TRUE, results=FALSE, message=FALSE, warning=FALSE}
# You may want to use normalized counts
pac_test <- PAC_summary(pac_test, norm = "cpm", type = "means",
pheno_target=list("stage"), merge_pac=TRUE)
# Maybe only include a subset of the samples
pac_test <- PAC_summary(pac_test, norm = "cpm", type = "means",
pheno_target=list("batch", c("Batch1", "Batch2")),
merge_pac=TRUE)
# Generate standard errors
pac_test <- PAC_summary(pac_test, norm = "cpm", type = "se",
pheno_target=list("stage"), merge_pac=TRUE)
# log2FC
pac_test <- PAC_summary(pac_test, norm = "cpm", type = "log2FC",
pheno_target=list("stage"), merge_pac=TRUE)
# log2FC generated from a grand mean over all samples
pac_test <- PAC_summary(pac_test, norm = "cpm", type = "log2FCgrand",
pheno_target=list("stage"), merge_pac=TRUE)
# All summarized tables have identical row names that can be merged
names(summary(pac_test))
lapply(summary(pac_test), function(x){
identical(rownames(x), rownames(summary(pac_test)[[1]]))
})
head(do.call("cbind", summary(pac_test)))
```
## Differential Expression using PAC object
Seqpac has a useful function, `PAC_deseq`, that lets you make omic scale
expressional analysis on PAC objects using the popular `r Biocpkg("DESeq2")`
package. This involves fitting generalized linear models with negative binomial
distributions and subsequent correction for multiple testing using the PAC
tables. For more information (such as how to correctly build models), please see
the DESeq2 vingette `r Biocpkg("DESeq2")`. In addition to preparing a PAC object
for DESeq2, `PAC_deseq` will organize and visualize the output. Lets have an
example using the test dataset:
```{r, message=FALSE, eval=TRUE, warning=FALSE}
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata",
package = "seqpac", mustWork = TRUE))
###---------------------------------------------------------------------
## Differential expression in Seqpac
# Simple model testing stages against using Wald test with local fit (default)
table(pheno(pac)$stage)
output_deseq <- PAC_deseq(pac, model= ~stage, threads=6)
```
```{r, eval=TRUE,fig.show=FALSE,results="hide",message=FALSE,fig.keep='none'}
# More complicated, but still graphs will be generated from 'stage' since it is
# first in model
output_deseq <- PAC_deseq(pac, model= ~stage + batch, threads=6)
# Using pheno_target we can change the focus to batch
# (no batch effect)
output_deseq <- PAC_deseq(pac, model= ~stage + batch,
pheno_target=list("batch"))
# With pheno_target we can also change the direction of the comparison
output_deseq <- PAC_deseq(pac, model= ~stage,
pheno_target=list("stage", c("Stage5", "Stage3")))
## In the output you find PAC merged results, target plots and output_deseq
names(output_deseq)
head(output_deseq$result)
```
## Principal component analysis (PAC_pca)
The `PAC_pca` function uses the `r CRANpkg("FactoMineR")` package to make a
principle component analysis, from which scatter plots are plotted using the
`r CRANpkg("ggplot2")` package. Here are some examples on how to use PAC_pca:
```{r, eval=TRUE, fig.show=FALSE, results="hide", fig.keep='none'}
###---------------------------------------------------------------------
## PCA analysis in Seqpac
# As simple as possible
output_pca <- PAC_pca(pac)
# Two 'folders' in the output
names(output_pca)
# Using pheno_target
output_pca <- PAC_pca(pac, pheno_target =list("stage"))
```
```{r, message=FALSE, eval=TRUE, warning=FALSE}
# Using pheno_target with sample labels
output_pca <- PAC_pca(pac, pheno_target =list("stage"),
label=pheno(pac)$sample)
```
```{r, eval=TRUE, fig.show=FALSE, results="hide", fig.keep='none'}
# Plotting sequences instead
output_pca <- PAC_pca(pac, type = "anno",
anno_target =list("Biotypes_mis0"))
```
## Size distribution and nucleotide bias
There are two function that can plot size distributions using the `r
CRANpkg("ggplot2")` package:
-*PAC_nbias:* First nucleotide bias
-*PAC_sizedist:* Class size distribution
**PAC_nbias**
Note, first nucleotide bias can be visualized without any advanced annotations
added to the `anno(pac)` table
```{r, eval=TRUE, fig.show=FALSE, results="hide", fig.keep='none'}
###---------------------------------------------------------------------
## Nucleotide bias in Seqpac
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata",
package = "seqpac", mustWork = TRUE))
# Test without annotations
# (note, is equivalent to an unfiltered master PAC)
pac_test <- pac
anno(pac_test) <- anno(pac_test)[,1, drop = FALSE]
output_nbias <- PAC_nbias(pac_test)
cowplot::plot_grid(plotlist=output_nbias$Histograms)
# Now lets use an anno_target in an annotated PAC targetting miRNA
# (Oops, heavy T-bias on 1st nt; are they piRNA?)
table(anno(pac)$Biotypes_mis0)
output_nbias <- PAC_nbias(pac, anno_target = list("Biotypes_mis0", "miRNA") )
cowplot::plot_grid(plotlist=output_nbias$Histograms)
# Switch to 10:th nt bias
output_nbias <- PAC_nbias(pac, position=10,
anno_target = list("Biotypes_mis0", "miRNA"))
cowplot::plot_grid(plotlist=output_nbias$Histograms)
```
```{r, eval=TRUE, echo=TRUE}
# Summarized over group cpm means
pac <- PAC_summary(pac, norm = "cpm", type = "means",
pheno_target=list("stage"), merge_pac=TRUE)
output_nbias <- PAC_nbias(pac, summary_target = list("cpmMeans_stage") )
cowplot::plot_grid(plotlist=output_nbias$Histograms)
```
**PAC_sizedist**
The size distribution also works in a similar fashion, but instead divides the
stacked columns using an anno_target:
```{r, eval=TRUE, fig.show=FALSE, results="hide", fig.keep='none'}
###---------------------------------------------------------------------
## Biotype size distribution
# Divide stacked bars by biotype with no mismatch allowed
output_sizedist_1 <- PAC_sizedist(pac, anno_target = list("Biotypes_mis0"))
cowplot::plot_grid(plotlist=c(output_sizedist_1$Histograms), ncol=3, nrow=3)
# Divide stacked bars by biotype with allowing up to 3 mismaches
output_sizedist_2 <- PAC_sizedist(pac, anno_target = list("Biotypes_mis3"))
cowplot::plot_grid(plotlist=c(output_sizedist_2$Histograms), ncol=3, nrow=3)
# anno_target is order sensitive, thus can take care of color order issues:
ord_bio <- as.character(unique(anno(pac)$Biotypes_mis3))
ord_bio <- ord_bio[c(1,5,2,4,3,6,7)]
output_sizedist_2 <- PAC_sizedist(pac,
anno_target = list("Biotypes_mis3", ord_bio))
# And again, we can use a summary_target instead:
output_sizedist_sum <- PAC_sizedist(pac,
summary_target = list("cpmMeans_stage"),
anno_target = list("Biotypes_mis3", ord_bio))
cowplot::plot_grid(plotlist=output_sizedist_sum$Histograms)
## Note: #######################################################################
# 1. miRNA is clearly associated with the correct size (21-23) nt, which are #
# dramatically increased in Stage 5 when zygotic transcription has started. #
# 2. piRNA was deliberately left out from the fasta references. Note, however, #
# that there is a broad peak with no annotations between 20-30 nt in Stage 1, #
# which also showed a T-bias at the first nt. Possibly piRNA? #
################################################################################
```
## Simple stacked bars and pie charts
The `PAC_stackbar` and `PAC_pie` have the same arguments as input but with
slightly different outputs. Summaries over groups in `pheno(PAC)` are controlled
på the 'summary' argument.
summary= "all" | % are generate over a mean of all samples
summary= "sample" | % are generate for each sample
summary= "pheno" | Group % are derived from the pheno_target object
**PAC_stackbar**
```{r, message=FALSE, include = TRUE, eval=TRUE}
###---------------------------------------------------------------------
## Stacked bars in Seqpac
# Choose an anno_target and plot samples (summary="samples")
PAC_stackbar(pac, anno_target=list("Biotypes_mis0"))
```
```{r, eval=TRUE,fig.show=FALSE,results="hide",message=FALSE,fig.keep='none'}
# 'no_anno' and 'other' will always end on top not matter the order
ord_bio <- as.character(sort(unique(anno(pac)$Biotypes_mis3)))
p1 <- PAC_stackbar(pac, anno_target=list("Biotypes_mis0", ord_bio))
p2 <- PAC_stackbar(pac, anno_target=list("Biotypes_mis0", rev(ord_bio)))
cowplot::plot_grid(plotlist=list(p1, p2))
# (Hint: if you want them to appear not on top, rename them)
# Reorder samples by pheno_targets
PAC_stackbar(pac, pheno_target=list("batch"), summary="samples",
anno_target=list("Biotypes_mis0"))
# Instead, summarized over pheno_target (summary="pheno")
PAC_stackbar(pac, anno_target=list("Biotypes_mis0"), summary="pheno",
pheno_target=list("stage"))
```
**PAC_pie**
```{r, eval=TRUE,fig.show=FALSE,results="hide",warning=FALSE,fig.keep='none'}
###---------------------------------------------------------------------
## Pie charts in Seqpac
# Choose an anno_target and plot samples (summary="samples"; default)
PAC_pie(pac, anno_target=list("Biotypes_mis0"))
# Ordered pie charts of grand mean percent of all samples labeled with percent
ord_bio <- as.character(sort(unique(anno(pac)$Biotypes_mis3)),
unique(anno(pac)$Biotypes_mis0))
output_pie_1 <- PAC_pie(pac, anno_target=list("Biotypes_mis0", ord_bio),
summary="all", labels="percent")
output_pie_2 <- PAC_pie(pac, anno_target=list("Biotypes_mis3", ord_bio),
summary="all", labels="percent")
cowplot::plot_grid(plotlist=c(output_pie_1, output_pie_2),
labels = c("mis 0","legend", "mis 3"),
ncol=2, nrow=2, greedy=TRUE, scale=1.15)
# Rotate
PAC_pie(pac, anno_target=list("Biotypes_mis0"), summary="all", angle=180)
PAC_pie(pac, anno_target=list("Biotypes_mis0"), summary="all", angle=40)
# Compare biotype mapping with or without mismatches and group by pheno(PAC)
ord_bio <- as.character(sort(unique(anno(pac)$Biotypes_mis0)))
output_mis0 <- PAC_pie(pac, pheno_target=list("stage"), summary="pheno",
anno_target=list("Biotypes_mis0", ord_bio))
output_mis3 <- PAC_pie(pac, pheno_target=list("stage"), summary="pheno",
anno_target=list("Biotypes_mis3", ord_bio))
cowplot::plot_grid(plotlist=c(output_mis0, output_mis3),
labels = names(c(output_mis0, output_mis3)), nrow=2,
greedy = TRUE, scale=1.5)
```
# Specilized analysis in Seqpac
Seqpac functions may be used for more advanced analysis than simple
classification. For example, we can dwell deeper into the details between the
alignments of reads and references, by classifying reads depending on where it
align. The best example of such advanced alignment analysis is tRNA loop and
range classification. We may also want to make further annotations of specific
classes of reads, such as classifying piRNA depending on overlaps with
transposable elements and genes.
In the current version of this guide we have only included a section on how to
do advanced alignment analysis for the purpose of tRNA classification, but
please see the `PAC_gtf` function to find out how to use coordinate overlap
annotations in Seqpac.
## Advanced alignment analysis (tRNA class analysis)
So far, classification of sRNA has only involved whether reads align or not
to for example a tRNA reference sequence. Type classification of tRNA,
such as 5', 3', i' and half, tRF (as nicely described in the MINTmap tool
\https://pubmed.ncbi.nlm.nih.gov/28220888/), resembles mapping against a
reference genome. Obtaining the exact coordinates of where a read align to the
reference allows us to plot coverage plots showing the coverage of sRNA across
the reference.
Conveniently, Seqpac contains a 'backdoor' to the reanno workflow, which allows
the user to quickly obtain mapping coordinates of sequences in a PAC object that
align to a reference fasta file. This is done by the `PAC_mapper` function.
This function uses temporary output files from bowtie to generate a map object,
which simply lists the reads that mapped to each sequence in the reference.
Importantly, if the Bowtie index is missing for a fasta reference, `PAC_mapper`
will automatically generate such an index. Thus, even though larger references
may be possible to use, they are discouraged with this function.
After a map object has been generated, the `PAC_covplot` function can make
coverage plots of reads mapping to each reference. Lets have a look on how it
works in Seqpac for advanced tRNA analysis using the test dataset:
```{r, eval=TRUE,fig.show=FALSE,results="hide",message=FALSE,fig.keep='none'}
load(system.file("extdata", "drosophila_sRNA_pac_filt_anno.Rdata",
package = "seqpac", mustWork = TRUE))
###---------------------------------------------------------------------
## tRNA analysis in Seqpac
# First create an annotation blank PAC with group means
anno(pac) <- anno(pac)[,1, drop=FALSE]
pac_trna <- PAC_summary(pac, norm = "cpm", type = "means",
pheno_target=list("stage"), merge_pac = TRUE)
# Then reannotate only tRNA using the PAC_mapper function
ref <- "/some/path/to/trna/fasta/recerence.fa"
# or use the provided fasta
ref <- system.file("extdata/trna2", "tRNA2.fa",
package = "seqpac", mustWork = TRUE)
map_object <- PAC_mapper(pac_trna, ref=ref, N_up = "NNN", N_down = "NNN",
mismatches=0, threads=6, report_string=TRUE,
override=TRUE)
# Hint: By adding N_up ad N_down you can make sure that modified fragments (like
# 3' -CAA in mature tRNA are included).
## Plot tRNA using xseq=TRUE gives you the reference sequence as X-axis:
# (OBS! Long reference will not show well.)
cov_tRNA <- PAC_covplot(pac_trna, map_object,
summary_target = list("cpmMeans_stage"),
xseq=TRUE, style="line",
color=c("red", "black", "blue"))
cov_tRNA[[1]]
# Targeting a single tRNA using a summary data.frame
PAC_covplot(pac_trna, map=map_object, summary_target= list("cpmMeans_stage"),
map_target="tRNA-Lys-CTT-1-9")
# Find tRNAs with many fragments
# 1st extract number of rows from each alignment
n_tRFs <- unlist(lapply(map_object, function(x){nrow(x[[2]])}))
# The test data is highly down an filterd sampled, but still some with tRNA have
# a few alignment
table(n_tRFs)
names(map_object)[n_tRFs>2]
# Lets select a few of them and plot them
selct <- names(map_object)[n_tRFs>2][c(1, 16, 27, 37)]
cov_plt <- PAC_covplot(pac_trna, map=map_object,
summary_target= list("cpmMeans_stage"),
map_target=selct)
```
```{r, eval=TRUE, warning=FALSE}
cowplot::plot_grid(plotlist=cov_plt, nrow=2, ncol=2)
```
Now, tRNAs can be further classified into isodecoder and isoacceptors. This
information is usually provided within the reference names in a tRNA fasta
reference file. Thus, it can be extracted from these names. Please see section,
`4.1 Fasta references` for some examples on how to work with fasta references in
R.
Classification of cleavage products, like 5'and 3' halves etc, are more
complicated. Such classification involves knowledge of where in the tRNA loops
(A-loop, anticodon loop and T-loop) cleavage may occur. There are external
software that predict loops from models of secondary structures. One of the most
popular tools for predicting tRNA secondary structures are tRNAscan-SE
(http://lowelab.ucsc.edu/tRNAscan-SE/). The output from this software is an ss
file, which stores loop information in the following format:
Full length tRNA: tRNA-Pro-CGG-2-1_chrX:18565764-18565835_(+)
GGCTCGTTGGTCTAGAGGTATGATTCTCGCTTCGGGTGCGAGAGGTCCCGGGTTCAATTCCCGGACGAGCCC
>>>>>>>..>>>.........<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
Loop 1 Loop 2 Loop 3
You may notice that each loop is contain between ">>>...<<<" and that each
character (".><") corresponds to one nucleotide in the original full length
tRNA. Ss-files for most common genomes are readily available for download at
http://gtrnadb.ucsc.edu/.
In Seqpac, the `map_rangetype` function is used to further annotate the
resulting map object obtained using the `PAC_mapper` function. First, if tRNA
reference names were provided in the correct format (eg.
"tRNA-Pro-CGG-2-1_chrX:18565764-18565835_(+): format: name -> genome coordinates
-> strand), it will automatically extract the isodecoder (e.g. "Pro") and
isoacceptor (e.g. "CGG") and make them seperate factors. It may also classify
each mapped PAC sequence in relation to the start and end terminals of the
full length tRNA. Lastly, given an ss file, it can annotate PAC sequences in
relation to tRNA loops. The `tRNA_class` function can then combine the
information in the mapping object and to classify each read sequence in the PAC
object. Lets have a look on how this is done:
```{r, eval=TRUE,fig.show=FALSE,results="hide",message=FALSE,fig.keep='none'}
###---------------------------------------------------------------------
## Generate range types using a ss-file
# Download ss object from GtRNAdb
# ("http://gtrnadb.ucsc.edu/")
ss_file <- "/some/path/to/GtRNAdb/trna.ss"
# or for testing use the provided ss-file in secpac
ss_file <- system.file("extdata/trna2", "tRNA2.ss",
package = "seqpac", mustWork = TRUE)
# Classify fragments according to loop cleavage (small loops are omitted)
map_object_ss <- map_rangetype(map_object, type="ss",
ss=ss_file, min_loop_width=4)
# Remove reference tRNAs with no hits
map_object_ss <- map_object_ss[
!unlist(lapply(map_object_ss, function(x){x[[2]][1,1] == "no_hits"}))]
# Now we have quite comprehensive tRNA loop annotations
names(map_object_ss)
map_object_ss[[1]][[2]]
# Don't forget to check ?map_rangetype to obtain more options
###---------------------------------------------------------------------
## Function classifying 5'-tRF, 5'halves, i-tRF, 3'-tRF, 3'halves
# Does all tRNAs have 3 loops?
table(unlist(lapply(map_object_ss, function(x){unique(x$Alignments$n_ssloop)})))
# Set tolerance for classification as a terminal tRF
tolerance <- 5 # 2 nucleotides from start or end of full-length tRNA)
### Important:
# We set N_up and N_down to "NNN" in the PAC_mapper step. To make sure
# that we have a tolerance that include the original tRNA sequence
# we set terminal= 2+3 (5).
```
```{r, eval=TRUE, warning=FALSE}
## tRNA classifying function
# Apply the tRNA_class function and make a tRNA type column
pac_trna <- tRNA_class(pac_trna, map=map_object_ss, terminal=tolerance)
anno(pac_trna)$type <- paste0(anno(pac_trna)$decoder, anno(pac_trna)$acceptor)
anno(pac_trna)[1:5, 1:4]
```
When classification by isoacceptor/decoder and cleavage type has been done, the
`PAC_trna` function can be applied to generate pheno group
differences in relation to tRNA fragmentation:
```{r, message=FALSE, echo=TRUE, eval=TRUE}
###---------------------------------------------------------------------
## Plot tRNA types
# Now use PAC_trna to generate some graphs based on grand means
trna_result <- PAC_trna(pac_trna, norm="cpm", filter = NULL,
join = TRUE, top = 15, log2fc = TRUE,
pheno_target = list("stage", c("Stage1", "Stage3")),
anno_target_1 = list("type"),
anno_target_2 = list("class"))
names(trna_result)
names(trna_result$plots)
names(trna_result$plots$Expression_Anno_1)
cowplot::plot_grid(trna_result$plots$Expression_Anno_1$Grand_means,
trna_result$plots$Log2FC_Anno_1,
trna_result$plots$Percent_bars$Grand_means,
nrow=1, ncol=3)
```
```{r,eval=TRUE,fig.show=FALSE,results="hide",message=FALSE,fig.keep='none'}
# By setting join = FALSE you will get group means
trna_result <- PAC_trna(pac_trna, norm="cpm", filter = NULL,
join = FALSE, top = 15, log2fc = TRUE,
pheno_target = list("stage", c("Stage1", "Stage3")),
anno_target_1 = list("type"),
anno_target_2 = list("class"))
names(trna_result$plots$Expression_Anno_1)
cowplot::plot_grid(trna_result$plots$Expression_Anno_1$Stage1,
trna_result$plots$Expression_Anno_1$Stage3,
trna_result$plots$Log2FC_Anno_1,
trna_result$plots$Percent_bars$Stage1,
trna_result$plots$Percent_bars$Stage3,
nrow=1, ncol=5)
```
That completes this vignette. Good luck!
```{r, eval=TRUE}
sessionInfo()
```