%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{Overview of the 'PWMEnrich' package}
%\VignetteKeywords{Motif enrichment, PWM}
%\VignettePackage{PWMEnrich}
\documentclass{article}

\usepackage{float}
\usepackage{a4wide}

<<include=FALSE>>=
library(knitr)
opts_chunk$set(
echo=TRUE,eval=TRUE,cache=FALSE,tidy=FALSE
)
@

<<include=FALSE>>=
library(knitr)
opts_chunk$set(
tidy=FALSE,dev='pdf',
message=FALSE, warning=FALSE
)
@

<<style-knitr, eval=TRUE, echo=FALSE, results="asis">>=
BiocStyle::latex()
@

%% colors
\usepackage{color}
\definecolor{Red}{rgb}{0.7,0,0}
\definecolor{Blue}{rgb}{0,0,0.8}


\hypersetup{%
  hyperindex = {true},
  colorlinks = {true},
  linktocpage = {true},
  plainpages = {false},
  linkcolor = {Blue},
  citecolor = {Blue},
  urlcolor = {Blue},
  pdfstartview = {Fit},
  pdfpagemode = {UseOutlines},
  pdfview = {XYZ null null null}
}

\author{Robert Stojni\'{c}\footnote{
  e-mail: \email{robert.stojnic@gmail.com},
  Cambridge Systems Biology Institute,
  University of Cambridge, UK}
}


\begin{document}

\title{Overview of the \Biocpkg{PWMEnrich} package}

\maketitle

\tableofcontents

\section{Introduction}\label{sec:intro} 

The main functionality of the package is Position Weight Matrix (PWM)\footnote{In this vignette we use "PWM", "DNA motif" and "motif" interchangeably.} enrichment analysis in a single sequence (e.g. enhancer of interest) or a set of sequences (e.g. set of ChIP-chip/seq peaks). Note that this is not the same as \textit{de-novo} motif finding which discovers novel motifs, nor motif comparison which aligns motifs.  

The package is built upon \Robject{Biostrings} and offers high-level functions to scan for DNA motif occurrences and compare them against a genomic background. There are multiple packages with pre-compiled genomic backgrounds such as \Robject{PWMEnrich.Dmelanogaster.background}, \Robject{PWMEnrich.Hsapiens.background} and \Robject{PWMEnrich.Mmusculus.background}. In these packages the genomic distribution is calculated for motifs from the \Robject{MotifDb} database. The \Robject{PWMEnrich} package contains all the functions used to create these packages, so you can calculate your own background distributions for your own set of motifs. In this vignette we will use the \textit{Drosophila} package, but the other background packages are used in the same way (see Section \ref{sec:human} for minor human-specific differences). 

\subsection{Implemented algorithms}

\Robject{PWMEnrich} uses the PWM scanning algorithm implemented by the package \Robject{Biostrings}. This package returns PWM scores at each position on one strand of a sequence. \Robject{PWMEnrich} extends this with a higher-level functions which automatically scans both strands for multiple motifs and sequences.

The main goal of the package is to assess the enrichment of motif hits in a sequence (or group of sequences) compared to a genomic background. The traditional way of doing this is to use a threshold for the PWM score and count the number of motif hits in the sequence(s) of interest. Since this converts the sequence into a binary bound/not-bound string, the enrichment of binding events can be assessed using a binomial formula. The \Robject{PWMEnrich} package implements this algorithm, but by default uses a lognormal threshold-free approach \cite{stojnic_2012} which is related to the score used in Clover \cite{frith_detection_2004}. 

In the lognormal threshold-free approach average affinity is calculated over the whole sequence (or set of sequences) and compared to the average affinity of length-matched sequences from the genomic background. This approach performs better or same as the best threshold approach \cite{stojnic_2012}, with the added benefit of not having to choose a threshold or compare the results for multiple thresholds. We will use this threshold-free approach in all of our examples. Please consult the reference manual on how to use the fixed-threshold algorithms. 

\subsection{S4 class structure and accessors}

As the \Biocpkg{PWMEnrich} package builds upon the \Biocpkg{Biostrings} package it uses the classes from this package to represent DNA sequences (\Robject{DNAString} and \Robject{DNAStringSet}). FASTA files can be loaded using functions from \Biocpkg{Biostrings} such as \Robject{readDNAStringSet}. The package introduces a new class \Robject{PWM} to represent a PWM together with the frequency matrix and other parameters (background nucleotide frequencies and pseudo-counts). All motif scoring is performed by the \Biocpkg{Biostrings} package which is why the \Biocpkg{PWMEnrich} package also returns log2 scores instead of more common log base \textit{e} scores. 

The results of motif scanning are stored in objects of class \Robject{MotifEnrichmentResults} and \Robject{MotifEnrichmentReport}. The package also introduces a number of classes that represent different background distributions: \Robject{PWMLognBackground}, \Robject{PWMCutoffBackground}, \Robject{PWMEmpiricalBackground}, \Robject{PWMGEVBackground}. In all cases, the classes are implemented with a list-like interface, that is, individual pieces of information within the objects are accessibly using \Rfunction{names(obj)} and \Rfunction{obj\$prop}. 

\section{Use case 1: Finding enrichment motifs in a single sequence}

One of the most well-known example of combinatorial control by transcription factors in \textit{Drosophila} is the \textit{even skipped (eve)} stripe 2 enhancer. This well-studied enhancer has a number of annotated binding sites for TFs \textit{Kr}, \textit{vfl}, \textit{bcd}, \textit{hb} and \textit{gt}. We will use this enhancer as an example as we already know its functional structure.

In order to predict which TFs are likely to functionally bind to the stripe 2 enhancer, we will calculate motif enrichment for a set of experimentally derived motifs from the \Biocpkg{MotifDb} database. We will do this by comparing the average affinity of each motif in the stripe 2 enhancers to the affinity over all \textit{D. melanogaster} promoters\footnote{For more information see \cite{stojnic_2012}}. These background distributions are already pre-calculated in the \Robject{PWMEnrich.Dmelanogaster.background} package which we will simply load and use. See the last section of this vignette for using your own motifs and background sequences. 

<<simple,echo=TRUE>>=
library(PWMEnrich)
library(PWMEnrich.Dmelanogaster.background)

# load the pre-compiled lognormal background
data(PWMLogn.dm3.MotifDb.Dmel)

# load the stripe2 sequences from a FASTA file for motif enrichment
sequence = readDNAStringSet(system.file(package="PWMEnrich", 
  dir="extdata", file="stripe2.fa"))
sequence  

# perform motif enrichment!
res = motifEnrichment(sequence, PWMLogn.dm3.MotifDb.Dmel)

report = sequenceReport(res, 1)
report

# plot the motif with P-value < 0.05
plot(report[report$p.value < 0.05], fontsize=7, id.fontsize=6)
@

The main function we used is \Robject{motifEnrichment} which took our sequence and calculated motif enrichment using the lognormal affinity background distribution (fitted on a set of 10031 \textit{D. melanogaster} 2kb promoters). This function returns a set of scores and P-values for our sequence. We then used the \Robject{sequenceReport} function that create a ranked list of motifs, which we then plot using \Robject{plot}. The first column is the rank, the second shows the target name, which is either a gene name, an isoform name (such as ttk-PF), or a dimer name (such as tgo\_sim not present in this list). The next column in the plot is the PWM logo, and after that the motif ID. This ID comes from the \Robject{MotifDb} package and can be used to look up further information about the motif (such as the motif source). The next-to-last column is the raw affinity score, and the last column is the P-value of motif enrichment. 

As we can see, the top of the list is dominated by motifs similar to bcd. By further examining the list, we find we recovered the Kr, bcd and gt motifs, but not the vfl and hb motifs. These two TFs (vfl and hb) have the smallest number of annotated binding sites out of the five TFs in the stripe 2 enhancer. As a result, this affinity is not large enough to be picked up by motif enrichment. However, the other three motifs were picked up. We find this to be the typical case for many enhancers.  

\section{Use case 2: Examining the binding sites}

We continue with our example of the eve stripe 2 enhancer from the previous section. We now want to visualise the binding sites for Kr, bcd and gt. 

<<stripe2visual,echo=TRUE,fig.width=4,fig.height=4,fig.align='center'>>=

# extract the 3 PWMs for the TFs we are interested in
ids = c("bcd_FlyReg_FBgn0000166", 
        "gt_FlyReg_FBgn0001150", 
        "Kr")
sel.pwms = PWMLogn.dm3.MotifDb.Dmel$pwms[ids]
names(sel.pwms) = c("bcd", "gt", "Kr")

# scan and get the raw scores
scores = motifScores(sequence, sel.pwms, raw.scores=TRUE)

# raw scores for the first (and only) input sequence
dim(scores[[1]])
head(scores[[1]])

# score starting at position 1 of forward strand
scores[[1]][1, "bcd"]
# score for the reverse complement of the motif, starting at the same position
scores[[1]][485, "bcd"]

# plot
plotMotifScores(scores, cols=c("green", "red", "blue"))
@

Here we used the \Robject{motifScores} function to obtain the raw scores at each position in the sequence. The result of this function is a list of matrices, each element of the list corresponding to an input sequence. In this case we had only one input sequence, and as a result we get a list of length 1. The matrix of scores is a 968 x 3 matrix, where the rows correspond to the two strands (2 x 484) and the columns correspond to motifs. It is important to remember that the scores are in real and not log space. In other words, a conventional PWM log2 score of 3 is represented as number 8 ($2^3$). 

The scores for the two strands are concatenated one after the other. Therefore, row 1 has the scores for the motif starting at position 1, and row 485 has the score at the same position, but with the reverse complement of the motif (i.e. motif score on the reverse strand). Note that there will be some NA values at the end of the sequence (e.g. position 484) because we do not support partial motif matches. 

Finally we use the \Robject{plotMotifScores} function to plot the log2 scores over the sequence. We colour-code the motifs with green, red and blue. The motif hits are shown as rectangles with the base being the length of the motif, and the hight being the log2 score of the motif hit. By default we show all motif hits with log2 scores larger then 0. The forward strand hits are shown on the top, and the reverse strand hits are shown on the bottom. 

We next might be interested in finding the P-value for individual motif hits so we can get an idea which sites are the most important. To do this we need to calculate the empirical PWM score distribution for single sites. We did not provide these values precalculated because they take up a very large amount of memory. To calculate it based on a set of promoter, we will need the \textit{D. melanogaster} genome sequence. Because the objects are so large, in this example we will determine the P-value only for the hits of the bcd motif, using only a small subset of promoters (controlled by the parameter \Robject{quick=TRUE}).

<<motifEcdf,echo=TRUE,fig.width=4,fig.height=4,fig.align='center'>>=
library(BSgenome.Dmelanogaster.UCSC.dm3)

# empirical distribution for the bcd motif
bcd.ecdf = motifEcdf(sel.pwms$bcd, Dmelanogaster, quick=TRUE)[[1]]

# find the score that is equivalent to the P-value of 1e-3
threshold.1e3 = log2(quantile(bcd.ecdf, 1 - 1e-3))
threshold.1e3 

# replot only the bcd motif hits with the P-value cutoff of 1e-3 (0.001)
plotMotifScores(scores, cols="green", sel.motifs="bcd", cutoff=threshold.1e3)

# Convert scores into P-values
pvals = 1 - bcd.ecdf(scores[[1]][,"bcd"])
head(pvals)
@

Here we have used the \Robject{motifEcdf} function to create an empirical cumulative distribution function (ECDF) for the bcd motif score on Drosophila promoters. This function returns an \Robject{ecdf} object which is part of base R. We can then use the quantile function to find which scores correspond to a P-value of 0.001, or we can use it to convert all the scores into P-values (not shown above). To plot the individual motif hits with P-values smaller than 0.001 we again use the \Robject{plotMotifScores} function, but now we apply the threshold so that only those motif hits above the threshold are drawn.  

In the last line we find out the positions of those motif hits where the P-value is smaller then 1e-3. Note that the values larger than the sequence length (484) indicate the reverse strand. Therefore, we find the four strong motif hits at positions 90 on the forward strand and 110, 354 and 475 on the reverse strand. 

Note that \Robject{plotMotifScores} can also plot multiple sequences on a single plot, and that the \Robject{cutoff} parameter can contain a vector of values if we wish to apply different cutoff to different motifs. 

\section{Use case 3: Finding enriched motifs in multiple sequences}

So far we have only looked at motif enrichment in a single sequence, which was able to recover some but not all of the truly functional motifs. The power of the motif enrichment approach can be significantly boosted by performing it jointly on multiple sequences. 

For this example we are going to use the top 20 ChIP-chip peaks for transcription factor Tinman in \textit{Drosophila} \cite{jin_genome-wide_2013}. We are going to scan these 20 ChIP-chip peaks with all the motifs and then compare their enrichment to genomic background. Running on the whole set of peaks (i.e. thousands) is also possible but can take a long time (i.e. tens of minutes). The speed can be improved by using multiple CPU cores (see next section). 

<<tinman,echo=TRUE>>=
library(PWMEnrich.Dmelanogaster.background)

# load the pre-compiled lognormal background
data(PWMLogn.dm3.MotifDb.Dmel)

sequences = readDNAStringSet(system.file(package="PWMEnrich", 
  dir="extdata", file="tinman-early-top20.fa"))
 
res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel)

report = groupReport(res)
report

plot(report[1:10], fontsize=7, id.fontsize=5)
@

As in Use case 1, the main function is \Robject{motifEnrichment} which took our sequences and calculated motif enrichment using the lognormal affinity background distribution (fitted on a set of 10031 \textit{D. melanogaster} 2kb promoters). We then applied the \Robject{groupReport} function to calculate the enrichment over the whole group of sequences. This produced a ranked list of motifs according to the estimated P-values. Then we used \Robject{plot} to plot the top 10 enriched motifs.  

The first three motifs are very similar and correspond to the tinman, which is the transcription factor for which the ChIP-chip experiment was performed. The first five columns are the same as before (see Use case 1). The sixth column gives the estimate P-value. The last column indicates the breadth of enrichment using a 5\% ranking threshold. This column helps to differentiate cases where the motif enrichment is strongly focused to a small subset of sequences (in which case breadth is small), versus being more widespread but weaker (in which case breadth is bigger). We can also sort by this column:

<<tinman-alt,echo=TRUE>>=
report.top = groupReport(res, by.top.motifs=TRUE)

report.top
@

This ranks motifs by breadth of enrichment, which is calculated by comparing enrichment \textit{between motifs} in individual sequences. This measure only makes sense when applied to a large number of sequence and when scanning with a large number of motifs (>20). 

The object returned by \Robject{motifEnrichment} has more information in it, as can be seen below:

<<tinman2,echo=TRUE>>=
res

# raw scores
res$sequence.nobg[1:5, 1:2]

# P-values
res$sequence.bg[1:5, 1:2]
@

In these two matrices the rows correspond to the different input sequences and the columns correspond to motifs. The first matrix (sequence.nobg) contains the raw affinity scores, while the second (sequence.bg) contains the corresponding P-values. If you are using a fixed threshold background (e.g. scanning with \Robject{PWMPvalueCutoff1e3.dm3.MotifDb.Dmel}) the first matrix will contain the number of motif hits, and the second the corresponding Z-scores. 

\section{Using PWMEnrich on human sequences}\label{sec:human}

Starting from PWMEnrich version 4.0 (October 2014) a new algorithm is used to better fit the background distributions in human sequences. The only difference from the usage perspective is in creating new background files - please make sure to set the parameter \Robject{algorithm="human"} in \Robject{makeBackground()} (and other related functions for creating backgrounds). This will instruct the function to fit separate parameters for different sequence lengths. The sequence lengths are obtained by multiplying the parameters \Robject{bg.len} and \Robject{bg.len.sizes}. The defaults are \Robject{bg.len=250bp} and \Robject{bg.len.size=c(1, 2, 4, 8, 16)}. This means that the P-values are the most accurate for sequences that are in the range of 250bp - 4000bp and the closest in size to 250bp, 500bp, 1000bp, 2000bp and 4000bp. 

\section{Speeding up execution}

\subsection{Parallel execution}

Motif scanning is the most time consuming operation. Because of this, the package has a support for parallel motif scanning using the \R{parallel} core package. Note that parallel execution is currently not supported on Windows. To turn on parallel scanning, simply register a number of cores available to the package:

<<parallel,echo=TRUE>>=
registerCoresPWMEnrich(4)
@

After this command is executed, all further calls to \Biocpkg{PWMEnrich} functions are going to be run in parallel using 4 cores (if possible). To turn off parallel execution call the function with parameter NULL:

<<parallel-stop,echo=TRUE>>=
registerCoresPWMEnrich(NULL)
@

\subsection{Large memory backend}

Motif scanning can be further speeded up by using large amount of memory. If you have an access to a machine with a lot of RAM, you can switch to the "big memory" backend:

<<bigmem,echo=TRUE>>=
useBigMemoryPWMEnrich(TRUE)
@

From this point on, all motif scanning will be done using the optimised big memory backend. The memory requirement depends on the number of sequences scanned, and might require tens of GB of RAM. To turn it off:

<<bigmemoff,echo=TRUE>>=
useBigMemoryPWMEnrich(FALSE)
@

\section{Customisation}

\subsection{Using a custom set of PWMs}\label{sec:custom-pwm}

Background motif distributions for a custom set of PWMs can be easily calculated for all model organisms. We will illustrate this by creating a new lognormal background for two \textit{de-novo} motifs in Drosophila. To load in the motifs the package provides functions to read standard JASPAR and TRANSFAC formats. 

<<readmotifs,echo=TRUE>>=
library(PWMEnrich.Dmelanogaster.background)

motifs.denovo = readMotifs(system.file(package="PWMEnrich", 
  dir="extdata", file="example.transfac"), remove.acc=TRUE)
  
motifs.denovo  

# convert count matrices into PWMs
genomic.acgt = getBackgroundFrequencies("dm3")
pwms.denovo = toPWM(motifs.denovo, prior=genomic.acgt)

bg.denovo = makeBackground(pwms.denovo, organism="dm3", type="logn", quick=TRUE)

# use new motifs for motif enrichment
res.denovo = motifEnrichment(sequences[1:5], bg.denovo)
groupReport(res.denovo)
@

We load in the count matrices and then convert them into PWMs using the genomic distributions of the A, C, G, T nucleotides. Next we use these PWMs to calculate the properties of the affinity distribution on the set of \textit{D. melanogaster} promoters. In this example we used \Robject{quick=TRUE} for illustrative purposes. This fits the parameters quickly on a reduced set of 100 promoters. We strongly discourage the users to use this parameter in their research, and instead only use it to obtain rough estimates and for testing. The resulting object \Robject{bg.denovo} can be used same as before to perform motif enrichment.

The background object \Robject{bg.denovo} contains the two PWMs and their background distribution parameters. All of these can be accessed with the \$ operator. 

<<bg-investigate,echo=TRUE>>=
bg.denovo
bg.denovo$bg.mean
@
%$

\subsection{Using a custom set of background sequences}\label{sec:custom-bg}

Low-level functions are available for constructing custom backgrounds. We start with the two de-novo motifs from previous section and fit the background to first 20 \textit{D. melanogaster} promoters. 

<<custombg,echo=TRUE>>=
library(PWMEnrich.Dmelanogaster.background)
data(dm3.upstream2000)

# make a lognormal background for the two motifs using only first 20 promoters
bg.seq = dm3.upstream2000[1:20]

# the sequences are split into 100bp chunks and fitted
bg.custom = makeBackground(pwms.denovo, bg.seq=bg.seq, type="logn", bg.len=100,
	bg.source="20 promoters split into 100bp chunks")

bg.custom
@

The resulting \Robject{bg.custom} object can be used as before for motif enrichment with the \Robject{motifEnrichment} function (as described before).

\section{Session information}

<<sessionInfo,echo=FALSE,results='asis'>>=
toLatex(sessionInfo())
@

%\bibliographystyle{apalike}
%\bibliography{references}
\bibliography{references}

\end{document}