%\VignetteIndexEntry{motifRG: regression-based discriminative motif discovery} %\VignetteDepends{Biostrings,IRanges,seqLogo,parallel} %\VignetteKeywords{Motif analysis} %\VignettePackage{motifRG} \documentclass[12pt]{article} \usepackage{subfigure} \usepackage{hyperref} %%\usepackage{eqnarray} %%\usepackage{amsmath, amsthm, amssymb} \usepackage{Sweave} \textwidth=6.2in \textheight=8.5in %\parskip=.3cm \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \textwidth=6.2in \usepackage{cite} \usepackage{graphicx} \usepackage{epsfig} \usepackage{subfigure} \usepackage{url} \usepackage{rotating} \begin{document} \section{Introduction} The emergence of ChIP-seq technology for genome-wide profiling of transcription factor binding sites (TFBS) has made it possible to categorize very precisely the TFBS motifs. How to harness the power of huge volume of data generated by this new technology presents many computational challenges. We propose a novel motif discovery algorithm that is scalable to large databases, and performs discriminative motif discovers by searching the most differential motifs between a foreground and background sequence dataset. This tool can be used in a traditional setting in which the foreground sequence dataset is derived from a ChIP-seq binding profile, and background sequence dataset is either sampled from the genome or generated from a null model. It can also be used for comparative study involving two TFBS binding profiles. In a nutshell, the method works as the following: we enumerate all fixed-length n-mers exhaustively, and measure their discriminative power by a logistic regression model. The top ranking seed motif is then iteratively refined by allowing IUPAC degenerate letters and extended to a longer motif automatically. We introduce a bootstrapping robustness test to avoid over-fitting in the optimization process. The logistic regression framework offers direct measurement of statistical significance, and we demonstrate by permutation tests that the z-value statistics do reflect the probability of occurrence by chance. Compared to traditional motif finding tool, use of proper control sequences for comparison avoids the difficulty of modeling true genomic background, which usually presents complicated high order structure such as dinucleotide sequence preference, repeats, nucleosome positions signals, etc. When used to compare two similar ChIP-Seq samples, the discriminative motifs usually leads to insights on sample specificity. \section{Example} We have applied this technique to the CTCF chip-seq experiment. The positive dataset contains 10,000 CTCF chip-seq binding sites, each with 200 bases. The negative dataset contains the same number, and the same length of sequences as the positive set. They are chosen from chip-seq mapped regions with low coverage, and they share the same distribution of distance to transcription start site as the positives to adjust for any promoter bias. <>= library(motifRG) data(ctcf.seq) data(control.seq) ### concatenate the foreground, background sequences all.seq <- append(ctcf.seq, control.seq) ### specify which sequences are foreground, background. category <- c(rep(1, length(ctcf.seq)), rep(0, length(control.seq))) ### find motifs ctcf.motifs <- findMotif(all.seq=all.seq, category=category, max.motif=3) @ <>= motifLatexTable(main="CTCF motifs", ctcf.motifs) @ << results=hide>>= ###Find a refined PWM model given the motif matches as seed pwm.match <- refinePWMMotif(ctcf.motifs$motifs[[1]]@match$pattern, ctcf.seq) library(seqLogo) @ \begin{figure}[h!t] \centering <>= seqLogo(pwm.match$model$prob) @ \caption{PWM logo of CTCF PWM matches} \end{figure} <>= ## Motifs found by findMotif tend to be relatively short, as longer and ## more specific motif models do not necessarily provide better ## discrimination of foreground background vs background if they are ## already well separated. However, one can refine and extend a PWM model ## given the motif matches by findMotif as seed for more specific model. pwm.match.extend <- refinePWMMotifExtend(ctcf.motifs$motifs[[1]]@match$pattern, ctcf.seq) @ \begin{figure}[h!t] \centering <>= seqLogo(pwm.match.extend$model$prob) @ \caption{PWM logo of CTCF PWM matches} \end{figure} \begin{figure}[h!t] \centering <>= plotMotif(pwm.match.extend$match$pattern) @ \end{figure} %%\bibliographystyle{plain} %%\bibliography{motif} \end{document}