%\VignetteIndexEntry{Overview of the 'PWMEnrich' package} %\VignetteKeywords{Motif enrichment, PWM} %\VignettePackage{PWMEnrich} \documentclass{article} \usepackage[nogin]{Sweave} \usepackage{hyperref} \usepackage{cite} \usepackage[authoryear,round]{natbib} \usepackage{float} \usepackage{a4wide} \SweaveOpts{echo=TRUE,eval=TRUE,cache=FALSE,keep.source=TRUE} \newcommand{\R}{\texttt{R} } \newcommand{\Rfun}[1]{{\texttt{#1}}} \newcommand{\Robj}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\email}[1]{\href{mailto:#1}{\normalfont\texttt{#1}}} %% colors \usepackage{color} \definecolor{Red}{rgb}{0.7,0,0} \definecolor{Blue}{rgb}{0,0,0.8} \hypersetup{% hyperindex = {true}, colorlinks = {true}, linktocpage = {true}, plainpages = {false}, linkcolor = {Blue}, citecolor = {Blue}, urlcolor = {Blue}, pdfstartview = {Fit}, pdfpagemode = {UseOutlines}, pdfview = {XYZ null null null} } \author{Robert Stojni\'{c}\footnote{ e-mail: \email{robert.stojnic@gmail.com}, Cambridge Systems Biology Institute, University of Cambridge, UK} } \begin{document} \title{Overview of the \Rpackage{PWMEnrich} package} \maketitle \tableofcontents \section{Introduction}\label{sec:intro} The purpose of this package is to asses the enrichment of already known Position Weight Matrices (PWMs) in a set of DNA sequences. Note that this is not the same as \textit{de-novo} motif finding which discovers novel motifs, and motif comparison which aligns motifs. The main difference to traditional PWM-based sequence scanning is that PWM hits are considered \textit{together} in a sequence or DNA region of certain length, instead of being given individual P-values for individual motif hits. One example of such an approach which we re-implement in the package is Clover \citep{frith_detection_2004}. The package is useful both for hypothesis verification, e.g. testing if a ChIP-chip/seq dataset is enriched for a target TF, and for hypothesis generation, e.g. finding novel co-factors (with already existing PWMs) in an already defined regulatory region(s). A variety of algorithms has been implemented to asses enrichment using both motif hit-count based metrics and threshold-free metrics. Enrichments can be normalized to a background (e.g. genomic) distribution and P-values derived. \subsection{Implemented algorithms} Various algorithms are implemented for PWM enrichment analysis: with/without fixed threshold and with/without background correction. The DNA sequence is scanned with a set of Position Weight Matrices (PWMs) corresponding to transcription factor (TF) binding affinities. The goal is to find a set of PWMs that have the largest affinity for the sequence and thus are most likely to functionally bind to the sequence(s) of interest. At present, the package has been developed and tested in \textit{Drosophila melanogaster}, however, it is more broadly applicable as it contains tools to build backgrounds for custom organisms and scan with a set of custom PWMs. The algorithms are applicable to single sequences, groups of sequences, and comparisons of single and groups of sequences (i.e. differential enrichment). Every algorithm has two steps. In the first, sequences are scanned to assess the affinity of a PWM to them. In the second, background correction is performed. For the first step the following metrics are available: \begin{itemize} \item \textbf{Fixed-cutoff motif hit} - for each sequence number of significant motif hits over a score cutoff is recorded. This cutoff can be fixed to a log2-odd value, or can be determined by specifying a P-value. The P-value is converted into a fixed threshold by examining its empirical distribution on a large set of background sequences. \item \textbf{Threshold-free affinity} - instead of introducing a fixed threshold, this approach takes the average odds (not log-odds!) score over the whole sequence to produce a MeanAffinity score. Theoretical work suggests that this quantity is related to the average time TF spends bound to a DNA sequence \citep{stormo2000dna}. Recent benchmarks suggest this as a method of choice \citep{stojnic_2012}. \end{itemize} PWMs are sorted according to number of significant motif hits or MeanAffinity. However, these numbers might not be an optimal indication of motif enrichment in the sequences. For example, a motif with low information content will have, on average, higher number of hits than a motif with high information content. Thus, it is important to perform background correction, especially when analysing more than one sequence \citep{stojnic_2012}. The background is typically fitted to a set of all promoters in an organism. The following background correction algorithms are available in the package: \begin{itemize} \item \textbf{Z-score} - applicable to fixed-threshold motif hits. The density of binding sites above the threshold is calculated in the background set of sequences and Z-score calculated by using a normal approximation to the sum of binomial random variables representing motif hits along the sequence. This approach is described in more detail in \citep{sui2005opossum}. \item \textbf{Lognormal} - applicable to the threshold-free MeanAffinity score. The background distribution of MeanAffinity scores is assumed to be distributed according to the lognormal distribution with standard deviation depending on sequence length. The approach compared favourable in benchmark on gold-standard datasets and is recommended in all contexts and is described in \citep{stojnic_2012}. \item \textbf{Generalized extreme value (GEV)} - applicable to the threshold-free MeanAffinity score. The background distribution of MeanAffinity scores is assumed to be distributed according to the Generalized extreme value (GEV) distribution. The three parameters of this distribution are fitted with linear regression on a range of typical regulatory region sizes. This approach is described in \citep{manke2008statistical}. \item \textbf{Matrix shuffle} - applicable for both fixed-threshold and threshold-free score. No background set of sequences is used in this approach, instead, PWMs are shuffled by reshuffling PWM columns many times. This creates a null-distribution of motif scores for each of the motifs separately. This approach is very slow since the sequence needs to be rescanned many times. We assume that the distribution of shuffled scores is roughly normal and we calculate the Z-score of observed score. This approach is described in \citep{boden2008associating}. \item \textbf{Empirical P-value} - applicable for both fixed-threshold an threshold-free score. The P-value of a score is directly estimated by sampling datasets with same sequence lengths from a set of background sequences. As the P-values typically get smaller with higher number of sequences, the approach is practically applicable only to single sequences and requires the background set that is at least 1/minimal.P.value times larger than the sequence to ensure that the tails of the empirical distribution can be sampled properly. For instance, if one wants to find an empirical P-value for a 500bp sequence with a P-value resolution of 0.001, one would need to have a background set of at least 500bp * 1000 = 500kb and do at least 1000 randomizations (recommended at least 10000 to gain more confidence). \end{itemize} All of these background corrections are applicable to single sequences and groups of sequences when the group is treated as one long sequence. To treat the group of sequences as a group of equal "regulatory blocks" of which an unknown subset has the specified motif we implement the Clover \citep{frith_detection_2004} algorithm. \subsection{S4 class structure and accessors} The \Rpackage{PWMEnrich} package builds upon the \Rpackage{Biostrings} package and uses the classes from this package to represent DNA sequences (\Robj{DNAString} and \Robj{DNAStringSet}). It introduces a new class \Robj{PWM} to represent a PWM together with the frequency matrix and other parameters (background nucleotide frequencies and pseudo-counts). All motif scoring is performed by the \Rpackage{Biostrings} package which is why the \Rpackage{PWMEnrich} package also returns log2 scores instead of more common log base \textit{e} scores. The package also introduces a number of classes that represent different background distributions: \Robj{PWMLognBackground}, \Robj{PWMCutoffBackground}, \Robj{PWMEmpiricalBackground}, \Robj{PWMGEVBackground}. In all cases, the classes are implemented with a list-like (and RefClass) interface, that is, individual pieces of information within the objects are accessibly using \Rfun{names(obj)} and \Rfun{obj\$prop}. \section{Motif enrichment} For \textit{D. melanogaster} and the Bioconductor MotifDb motif database the package provides a set of already compiled background correction objects. See the documentation for MotifDb for more information on sources for these motifs. Sections \ref{sec:custom-pwm} and \ref{sec:custom-bg} describe how to create custom background correction objects. We will load two DNA sequences from a FASTA file and then perform motif enrichment of the 740 \textit{D. melanogaster} PWMs using lognormal background correction. The first sequence is known to bind Tin, and the second is known to bind Twi. <>= library(PWMEnrich) library(PWMEnrich.Dmelanogaster.background) # load the pre-compiled lognormal background data(PWMLogn.dm3.MotifDb.Dmel) # scan two sequences from a FASTA file for motif enrichment sequences = readDNAStringSet(system.file(package="PWMEnrich", dir="extdata", file="example.fa")) sequences res = motifEnrichment(sequences, PWMLogn.dm3.MotifDb.Dmel) @ We read FASTA sequences using \Rpackage{Biostrings} function \Robj{readDNAStringSet}. Alternatively, sequence can be specified as a list of \Robj{DNAString} objects. See the package \Rpackage{Biostrings} for more information on handling FASTA sequences in R. The function \Rfun{motifEnrichment} is the top-level function for motif enrichment in single sequences and groups of sequences. It takes at least two input arguments: a set of sequences, and a set of PWMs. In this example, we used a lognormal background correction object \Robj{PWMLogn.dm3.MotifDb.Dmel} in place of PWMs so that lognormal correction is performed. The background correction object contains the background lognormal distribution parameters for 740 \textit{D. melanogaster} MotifDb PWMs fitted on a set of 10031 \textit{D. melanogaster} 2kb promoters. The output of \Rfun{motifEnrichment} is an object of class \Robj{MotifEnrichmentResults} that contains the scores for both sequences together (the "group") and individual sequences. <>= res # PWMs enriched in both sequences (Lognormal P-value) head(motifRankingForGroup(res)) # PWMs enriched in the first sequence (Lognormal P-values) head(motifRankingForSequence(res, 1)) # PWMs enriched in the second sequence (Lognormal P-values) head(motifRankingForSequence(res, 2)) @ Some transcription factors have PWMs from multiple sources (which are sometimes identical, sometimes not) and from different multimers. We can refer back to the \Robj{MotifDb} package to find out more about the enriched motifs. We do this by looking at the original IDs of top ranked motifs: <>= head(motifRankingForGroup(res, id=TRUE)) library(MotifDb) db = values(MotifDb) db[db$providerName == "dimm_da_SANGER_5_FBgn0000413", "dataSource"] db[db$providerName == "dimm_da_SANGER_5_FBgn0023091", "dataSource"] @ %$ The first two motifs comes from FlyFactorSurvey and correspond to a heterodimer of Da and Dimm. We can further visualise the most enriched motifs through the package \Rpackage{seqLogo}. We plot the same 6 motifs: \begin{center} \setkeys{Gin}{width=16cm, height=7cm} <>= plotTopMotifsGroup(res, 6, rows=2, cols=3, xmargin.scale=0.7) @ % ymargin.scale=0.8, xfontsize=11, yfontsize=11, titlefontsize=14) \end{center} In the top 6 we recover the know motifs: twi and tin together with da and dimm that have similar motifs to twi. Since background correction is performed after the raw individual scores are calculated, the resulting object always contains the raw values as well. We will illustrate this using a fixed-threshold background. Using a P-value threshold (e.g. $10^{-4}$) we recover both the number of motif hits (with P-value $<10^{-4}$) and the associated Z-score. <>= data(PWMPvalueCutoff1e4.dm3.MotifDb.Dmel) res.count = motifEnrichment(sequences, PWMPvalueCutoff1e4.dm3.MotifDb.Dmel) # PWMs sorted by number of motifs hits (P-value < 0.0001) in both sequences head(motifRankingForGroup(res.count, bg=FALSE)) # PWMs sorted by the Z-score of motif enrichment head(motifRankingForGroup(res.count)) # First sequence, PWMs sorted by number of motif hits (P-value < 0.0001) head(motifRankingForSequence(res.count, 1, bg=FALSE)) # Second sequence head(motifRankingForSequence(res.count, 2, bg=FALSE)) @ %$ Using the fixed-threshold approach also recovers the tin and twi motifs. \section{Parallel execution} Motif scanning is the most time consuming operation in all algorithms. Because of this, the package has a support for parallel motif scanning using the \Rpackage{parallel} core package. Note that parallel execution is currently not supported on Windows. To turn on parallel scanning, simply register a number of cores available to the package: <>= registerCoresPWMEnrich(4) @ After this command is executed, all further calls to \Rpackage{PWMEnrich} functions are going to be run in parallel using 4 cores (if possible). To turn off parallel execution call the function with parameter NULL: <>= registerCoresPWMEnrich(NULL) @ \section{Using a custom set of PWMs}\label{sec:custom-pwm} The package provides functions to read motifs in standard JASPAR and TRANSFAC formats. We will compile a lognormal background for a set of two de-novo motifs. <>= motifs.denovo = readMotifs(system.file(package="PWMEnrich", dir="extdata", file="example.transfac"), remove.acc=TRUE) motifs.denovo bg.denovo = makeBackground(motifs.denovo, organism="dm3", type="logn", quick=TRUE) res.denovo = motifEnrichment(sequences, bg.denovo) head(motifRankingForGroup(res.denovo)) @ %$ The function \Robj{makeBackground} does a couple of things. First, it converts the motifs into PWMs by taking the nucleotide background frequencies from the 10031 \textit{D. melanogaster} promoters. Next it fits the lognormal parameters using the same set of sequence. In this example we used \Robj{quick=TRUE} for illustrative purposes. This fits the parameters quickly on a reduced set of 100 promoters. We strongly discourage the users to use this parameter in their research, and instead only use it to obtain rough estimates and for testing. The resulting object \Robj{bg.denovo} can be used same as before to perform motif enrichment (as demonstrated in the last two lines of the example). The background object \Robj{bg.denovo} contains the two PWMs and their background distribution parameters. All of these can be accessed with the \$ operator. <>= bg.denovo bg.denovo$bg.mean @ %$ \section{Using a custom set of background sequences}\label{sec:custom-bg} Low-level functions are available for constructing custom backgrounds. We start with the two de-novo motifs from previous section and fit the background to first 20 \textit{D. melanogaster} promoters. <>= library("BSgenome.Dmelanogaster.UCSC.dm3") # make a lognormal background for the two motifs using only first 20 promoters bg.seq = Dmelanogaster$upstream2000[1:20] # the sequences are split into 100bp chunks and fitted bg.custom = makePWMLognBackground(bg.seq, motifs.denovo, bg.len=100, bg.source="20 promoters split into 100bp chunks") bg.custom @ %$ The \Rfun{makePWMLognBackground} function will convert frequency matrices (\Robj{motifs.denovo}) into PWMs using the same set of sequences on which the distribution is fitted. The frequency matrices can also be manually converted into PWMs using function \Robj{makePriors}, \Robj{getBackgroundFrequencies} and \Robj{PFMtoPWM}. The resulting \Robj{bg.custom} object can be used as before for motif enrichment with \Robj{motifEnrichment} function (as described before). \section{Session information} <>= toLatex(sessionInfo()) @ \bibliographystyle{apalike} \bibliography{references} \end{document}