%\VignetteIndexEntry{motifRG: regression-based discriminative motif discovery}
%\VignetteDepends{Biostrings,IRanges,seqLogo,parallel}
%\VignetteKeywords{Motif analysis}
%\VignettePackage{motifRG}
\documentclass[12pt]{article}

\usepackage{subfigure}
\usepackage{hyperref}
%%\usepackage{eqnarray}
%%\usepackage{amsmath, amsthm, amssymb} 
\usepackage{Sweave}

\textwidth=6.2in
\textheight=8.5in
%\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}


\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}

\textwidth=6.2in

\usepackage{cite}

\usepackage{graphicx}
\usepackage{epsfig}
\usepackage{subfigure}
\usepackage{url}
\usepackage{rotating}
\begin{document}


\section{Introduction}
The emergence of ChIP-seq technology for genome-wide profiling of transcription factor binding sites (TFBS) has made it possible to categorize very precisely the TFBS motifs. How to harness the power of huge volume of data generated by this new technology presents many computational challenges. 
We propose a novel motif discovery algorithm that is scalable to large databases, and performs discriminative motif 
discovers by searching the most differential motifs between a foreground and background sequence dataset. 
This tool can be used in a traditional setting in which the foreground sequence dataset is derived from a ChIP-seq binding profile, and background sequence dataset is either sampled from the genome or generated from a null model.
It can also be used for comparative study involving two TFBS binding profiles. 

In a nutshell, the method works as the following:
we enumerate all fixed-length n-mers exhaustively, and measure their
discriminative power by a logistic regression model. 
The top ranking seed motif is then iteratively refined by allowing IUPAC degenerate letters and extended to a longer motif automatically. 
We introduce a bootstrapping robustness test to avoid over-fitting in the optimization process. 
The logistic regression framework offers direct measurement of statistical significance, and we 
demonstrate by permutation tests that the z-value statistics do reflect the probability of occurrence by chance.
Compared to traditional motif finding tool, use of proper control sequences for comparison avoids the difficulty of 
modeling true genomic background, which usually presents complicated high order structure such as dinucleotide 
sequence preference, repeats, nucleosome positions signals, etc. 
When used to compare two similar ChIP-Seq samples, the discriminative motifs
usually leads to insights on sample specificity. 

\section{Example}
We have applied this technique to the CTCF chip-seq experiment. The positive
dataset contains 10,000 CTCF chip-seq binding sites, each with 200 bases.
The negative dataset contains the same number, and the same
length of sequences as the positive set. They are chosen from chip-seq mapped 
regions with low coverage, and they share the same
distribution of distance to transcription start site as the positives
to adjust for any promoter bias. 


<<results=hide>>=
library(motifRG)
data(ctcf.seq)
data(control.seq)
### concatenate the foreground, background sequences
all.seq <- append(ctcf.seq, control.seq)
### specify which sequences are foreground, background. 
category <- c(rep(1, length(ctcf.seq)), rep(0, length(control.seq))) 
### find motifs
ctcf.motifs <- findMotif(all.seq=all.seq, category=category, max.motif=3)     
@

<<results=tex, include=FALSE>>=
motifLatexTable(main="CTCF motifs", ctcf.motifs)
@


<< results=hide>>=
###Find a refined PWM model given the motif matches as seed
pwm.match <- refinePWMMotif(ctcf.motifs$motifs[[1]]@match$pattern, ctcf.seq)
library(seqLogo)
@

\begin{figure}[h!t]
\centering  
<<fig=TRUE,  height=8, width=12>>=
seqLogo(pwm.match$model$prob)
@ 
\caption{PWM logo of CTCF PWM matches}
\end{figure}


<<results=hide>>=
## Motifs found by findMotif tend to be relatively short, as longer and
## more specific motif models do not necessarily provide better
## discrimination of foreground background vs background if they are
## already well separated. However, one can refine and extend a PWM model
## given the motif matches by findMotif as seed for more specific model.
pwm.match.extend <- 
    refinePWMMotifExtend(ctcf.motifs$motifs[[1]]@match$pattern, ctcf.seq)
@ 

\begin{figure}[h!t]
\centering  
<<fig=T,  height=8, width=12>>=
seqLogo(pwm.match.extend$model$prob)
@ 
\caption{PWM logo of CTCF PWM matches}
\end{figure}


\begin{figure}[h!t]
\centering  
<<fig=T, height=8, width=12>>=
plotMotif(pwm.match.extend$match$pattern)
@ 
\end{figure}

%%\bibliographystyle{plain}
%%\bibliography{motif}

\end{document}