%\VignetteIndexEntry{iSeq}
%\VignetteDepends{}
%\VignetteKeywords{ChIP-seq,next generation sequencing, massively parallel sequencing}
%\VignettePackage{iSeq}

\documentclass[11pt]{article}
\usepackage{amsmath}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}
\SweaveOpts{echo=FALSE}

\setlength{\textheight}{8.5in}
\setlength{\textwidth}{6in}
\setlength{\topmargin}{-0.25in}
\setlength{\oddsidemargin}{0.25in}
\setlength{\evensidemargin}{0.25in}

\begin{document}
\setkeys{Gin}{width=0.99\textwidth}

\title{\bf iSeq: Bayesian Hierarchical Modeling of ChIP-seq Data Through Hidden Ising Models}

\author{Qianxing Mo}

\maketitle

\begin{center}
Department of Epidemiology and Biostatistics\\
Memorial Sloan-Kettering Cancer Center\\
{\tt moq@mskcc.org}
\end{center}

\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}

This package implements the models proposed by Mo (2010) for ChIP-seq data analysis, 
which are the extensions of the models proposed for ChIP-chip data (Mo and Liang, 2010a,b).  
The package can be used to analyze the ChIP-seq data with or without controls. \\

\noindent
This package processes ChIP-seq data in three steps. \\
\begin{itemize}
\item {Build a signal profile for each chromosome.  To create the signal profiles, sequence tags are aggregated 
into non-overlapping bins whose length can be decided by the user (e.g. 50 base pairs (bp)).  
The number of tags falling each bin is counted.  The signal profiles are made of the bin-based tag counts and 
the corresponding genomic positions on the chromosomes.  The bins and their tag counts are ordered along the chromosomes
according to their genomic positions. }
\item {Model the signal profiles.  It is to model the tag counts on the chromosomes.  The models assume that
each bin is associated with a binary latent variable $X_i \in (1, -1)$, where $i$ denotes the ID for the bin, 
and $X_i=1$ denotes that the bin is an enriched bin, and -1 for a non-enriched bin.  
In the first stage, conditioning on the latent variable $X_i$, the bin-based tag counts are modeled by
Poisson-Gamma distributions.  If $X_i=-1$, the tag count for bin $i$ is assumed to follow a Poisson distribution 
with parameter $\lambda_0$, where $\lambda_0 \sim \textrm{Gamma}(a_0,b_0)$. If $X_i= 1$,  the tag count for 
bin $i$ is assumed to follow a Poisson distribution with parameter $\lambda_1$, 
where $\lambda_1 \sim \textrm{Gamma}(a_1,b_1)$. $\textrm{Gamma}(a,b)$ denotes a gamma distribution with mean 
$a/b$ and variance $a/b^2$.  In the second stage, the latent variable is modeled by ferromagnetic Ising models.
The Gibbs sampler and Metropolis algorithm are used to simulate from the posterior distributions of the model parameters.}
\item {Call enriched regions.  The posterior probabilities for the bins in the enriched state ($X_i = 1$) are used 
for statistical inference.  A bin with a high posterior probability in the enriched state will provide strong evidence 
that the bin is enriched. Enriched bins are then merged into enriched regions.}  
\end{itemize}

\noindent
For more information, we refer the user to Mo and Liang (2010 a, b) because the manuscript (Mo, 2010c) is still under 
review.  The major difference for modeling ChIP-chip and ChIP-seq data is at the first stage, where normal distributions
are used for ChIP-chip data, and Poisson distributions are used for ChIP-seq data. 

\section{The NRSF ChIP-seq Data}

Johnson et al.(2007) carried out genome-wide identification of the binding sites of 
human neuron-restrictive silencer factor (NRSF) in Jurkat T cells.  We use the data of chromosomes 22 and Y
to illustrate the proposed method.  The NRSF sequence tags (25 bp) were generated by 
the Illumina/Solexa sequencing platform, and mapped to The human genome May 2004 (hg17). 
We only use the uniquely mapped tags (up to two mismatches) for the analysis. 
Note that iSeq package doesn't provide functions to read the raw data generated by 
the sequencers.  However, it should not be difficult to prepare the data in the format as shown in 
the following.

\section{Example --- Analyze the NRSF Data}
\noindent
Firstly, let's load the library and check the data.  There are three ChIP and control samples, respectively. 
<<echo=TRUE,print=FALSE>>=
library(iSeq)
data(nrsf)
names(nrsf)
@ 
\subsection{Build the signal profiles using the mergetag function}
\noindent
The following two steps just merge the ChIP and control samples, respectively.
<<echo=TRUE,print=FALSE>>=
chip = rbind(nrsf$chipFC1592,nrsf$chipFC1862,nrsf$chipFC2002)
mock = rbind(nrsf$mockFC1592,nrsf$mockFC1862,nrsf$mockFC2002)
print(chip[1:3,])
print(mock[1:3,])
@ 
\noindent
We suggest building the signal profiles using a 50 bp window.  Use a small window size (e.g. 25 - 75 bp) in order to 
precisely infer the true binding sites. If the sequenced DNA fragments are around 200 bp, it is expected that 
an enriched region is made of 4 bins.  However, if the window size is too small, 
it may lose power to detect enriched regions.
<<echo=TRUE,print=FALSE>>=
tagct = mergetag(chip=chip,control=mock,winsize=50)
print(tagct[1:3,])
@ 
\noindent
See the help file for function 'mergetag' for the meanings of the tagct columns. 
 
\noindent
The user can quickly get the 'enriched' regions without calculating
statistical confidence using function 'peakreg'.  For example, if we claim that a bin with adjusted tag count $>$ 10 is 
an enriched bin,  the 'enriched' regions can be obtained by using the following code.  Let's use the chromosome 22 data as
an example.  
<<echo=TRUE,print=FALSE>>=
tagct22 = tagct[tagct[,1]=="chr22",]  #chr22 data 
reg0 = peakreg(chrpos = tagct22[, 1:3], count = (tagct22[, 5:6] -tagct22[, 7:8]), pp = (tagct22[,4]>10),cutoff = 0, method = "ppcut",maxgap=300)
print(dim(reg0)) 
print(reg0[1:3,]) # each row contain the information for an enriched region
@ 

\subsection{Model the signal profiles using the iSeq1 function}
\noindent 
Function iSeq1 implements a fully Bayesian hidden Ising model in which the latent variable $X$ is modeled by 
the standard 1D Ising model.  The columns 1-4 of tagct are the signal profiles for modeling.  
Note that the column 4 is the adjusted tag counts of the ChIP samples. For one sample analysis, 
it is the total tag counts for the forward and reverse chains. 
<<echo=TRUE,print=FALSE>>=
set.seed(777)
res1 = iSeq1(Y=tagct22[,1:4],gap=300,burnin=200,sampling=1000,ctcut=3,a0=1,b0=1,a1=5,b1=1,
  k0=3,mink=0,maxk=10,normsd=0.1,verbose=FALSE)
@ 

\noindent 
Plot the model parameters to see whether they converge.  In general, the MCMC chains have converged when the parameters 
fluctuate around the modes of their distributions.  If there is an obvious trend(e.g. continuous increase or decrease), 
the user should increase the number of iterations in the burn-in and/or sampling phases. 
If the chains do not mix well, the user can adjust the argument normsd to see how it affects the results. 
\begin{center}
<<fig=TRUE,echo=TRUE>>=
par(mfrow=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0))
hist(res1$pp)
plot(res1$kappa, pch=".", xlab="Iterations", ylab="kappa")
plot(res1$lambda0, pch=".", xlab="Iterations", ylab="lambda0")
plot(res1$lambda1, pch=".", xlab="Iterations", ylab="lambda1")
@
\end{center}
\noindent
From the trace plots, we see the chains converge quite fast.  

\subsection{Call the enriched regions detected by iSeq1 using the peakreg function}
\noindent
Call the enriched regions detected by iSeq1 using 0.5 posterior probability (pp) cutoff.  
Note the argument count is the net tag counts for the two sample analysis.   
The net tag counts are not truncated at zero, but this 
doesn't matter because it is just used for inferring the center of the enriched region, 
which is usually the true binding site.  The user can also use the ChIP tag counts only.  
The results are little different. See the help file of peakreg for details. 
<<echo=TRUE,print=FALSE>>=
reg1 = peakreg(chrpos=tagct22[,1:3],count=(tagct22[,5:6]-tagct22[,7:8]),pp=res1$pp,cutoff=0.5,method="ppcut",maxgap=300)
print(dim(reg1))
print(reg1[1:3,])
@ 
\noindent
Note, the 5' positions of the sequence tags are used as the genomic positions for the NRSF data.  To infer the actual 
binding sites, one may add 13 bp to the peak position (reg1\$peakpos + 13) because the tags' length is 25 bp.  
If the middle positions of the sequence tags are used as the genomic positions, the user doesn't need to do the adjustment.\\


\noindent
Plot some enriched regions. The dash lines indicate the region center, usually the true binding sites. 
\begin{center}
<<fig=TRUE,echo=TRUE>>=
par(mfrow=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0))
for(i in 1:4){
  ID = (reg1[i,4]):(reg1[i,5])
  plotreg(tagct22[ID,2:3],tagct22[ID,5:6],tagct22[ID,7:8],peak=reg1[i,6])
}
@ 
\end{center}

\noindent
Call the enriched regions using 0.05 FDR cutoff.  The FDR is calculated
using a direct posterior probability approach (Newton et al., 2004). 
<<echo=TRUE,print=FALSE>>=
reg2 = peakreg(tagct22[,1:3],tagct22[,5:6]-tagct22[,7:8],res1$pp,0.05,method="fdrcut",maxgap=300)
print(dim(reg2))
print(reg2[1:3,])

@ 

\noindent
The columns 1-3 of the enriched regions (e.g. reg2[,1:3]) can be used to extract the sequences 
from the UCSC genome browser. Alternatively, one may create a BED file using the peak position of 
the enriched regions.  For example, 
<<echo=TRUE,print=FALSE>>=
bed = data.frame(chr=reg2[,1],gstart=reg2[,6]-100,gend=reg2[,6]+100)
@ 

\subsection{Model the signal profiles using the iSeq2 function}
\noindent
The latent variable $X$ can be modeled through a high-order Ising model using function iSeq2. 
The interaction parameter $k$ for the high-order (or the standard/first-order) Ising model is fixed 
and set by the user in iSeq2.  To apply the second-order Ising model to ChIP-seq data, the user can 
let winsize = 2.  If set winsize = 1, it will be the standard Ising model.  To use a high-order Ising model,
according to our experience, a balance between high sensitivity and low FDR can be
achieved when winsize = 2.  The critical value for the second-order Ising model is about 1.0. 
In general, increasing the value of $k$ will lead to less enriched regions, which amounts to 
setting a stringent criterion for detecting enriched regions.  

\noindent
Model the NRSF data using the second-order Ising model. 
<<echo=TRUE,print=FALSE>>=
res2 = iSeq2(Y=tagct22[,1:4],gap=300, burnin=100,sampling=500,winsize=2,ctcut=5,
  a0=1,b0=1,a1=5,b1=1,k=1.0,verbose=FALSE)
@ 

\noindent 
Plot the model parameters to see whether they converge.  If the chains do not mix well, one can adjust 
the parameter {\bf k} and/or normsd to see how it affects the results. 
\begin{center}
<<fig=TRUE,echo=TRUE>>=
par(mfrow=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0))
hist(res2$pp)
plot(res2$lambda0, pch=".", xlab="Iterations", ylab="lambda0")
plot(res2$lambda1, pch=".", xlab="Iterations", ylab="lambda1")
@
\end{center}

\subsection{Call the enriched regions detected by iSeq2 using the peakreg function}
<<echo=TRUE,print=FALSE>>=
reg2 = peakreg(tagct22[,1:3],tagct22[,5:6],res2$pp,0.5,method="ppcut",maxgap=300)
print(dim(reg2))
print(reg2[1:3,])
@ 

\noindent
Plot some enriched regions detected by iSeq2. 
\begin{center}
<<fig=TRUE,echo=TRUE>>=
par(mfrow=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0))
for(i in 1:4){
  ID = (reg2[i,4]):(reg2[i,5])
  plotreg(tagct22[ID,2:3],tagct22[ID,5:6],tagct22[ID,7:8],peak=reg2[i,6])
}
@ 
\end{center}

\section{Tips}
\noindent
Finally, let's analyze the chromosome Y data. Intuitively, it seems that there is no binding region on
chromosome Y.  
<<echo=TRUE,print=FALSE>>=
tagctY = tagct[tagct[,1]=="chrY",]
print(table(tagctY[,4]))
res1 = iSeq1(Y=tagctY[,1:4],gap=300,burnin=1000,sampling=5000,ctcut=3,a0=1,b0=1,a1=5,b1=1,
  k0=3,mink=0,maxk=10,normsd=0.5,verbose=FALSE)
res2 = iSeq2(Y=tagctY[,1:4],gap=300, burnin=1000,sampling=5000,winsize=2,ctcut=3,
  a0=1,b0=1,a1=5,b1=1,k=3.0,verbose=FALSE)
@ 

\begin{center}
<<fig=TRUE,echo=TRUE>>=
par(mfcol=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0))
hist(res1$pp)
plot(res1$lambda1, pch=".", xlab="Iterations", ylab="lambda1")
hist(res2$pp)
plot(res2$lambda1, pch=".", xlab="Iterations", ylab="lambda1")
@ 
\end{center}

\noindent
In this case, all the bins are only in the non-enriched state and the posterior probabilities are zero or close to zero.
In some cases, if the $\lambda_1$ is small (e.g. $\approx a_1/b_1$, the mean value of the gamma prior), it suggests that
there is no enriched region on that chromosome.  The user may not claim any enriched regions found on that chromosome, 
even if some posterior probabilities are high.  
For the studies of transcription factor binding sites, if the posterior probabilities are not dichotomized 
 or not dominated by 0, it suggests that the Ising model is not in the super-paramagnetic phase.  
Only the super-paramagnetic phase reflects the binding events on the chromosomes.  
Therefore, if the user use iSeq2, the user should increase the value of $k$ to let the phase transition occur so that
the Ising model reaches the super-paramagnetic phase. \\

\noindent
The iSeq methods take into account the spatial dependency, the global and local distributions of the sequence tags.   
If the signal profiles have very sparse signals and some bins have very large (adjusted) tag counts (e.g., the NRSF data), 
the estimated $\lambda_1$ will be relatively large, which makes the bins with relatively small counts (e.g., tens) have
low posterior probabilities.  In this case, the user can truncate
the (adjusted) tag counts at certain value (e.g., if count $>$ 100, set count = 100) to increase the power to detect 
the regions with small tag counts.  For the NRSF data, if the adjusted tag counts are truncated at 100, more enriched
regions can be detected.  

\section{Parallel Computation}
\noindent
To speed up the analysis, the user can do parallel computation easily. The user needs to install the snow and snowfall 
packages.  The following is an example. \\

\noindent
library(snow) \\
library(snowfall)  \\ 
dataList = list(chr22=tagct22,chrY=tagctY) \\
sfInit(parallel=TRUE,cpus=2,type="SOCK") \\
res=sfLapply(dataList,iSeq1,gap=300,burnin=100,sampling=200,ctcut=3,a0=1,b0=1,a1=1,b1=1,
  k0=3,mink=0,maxk=10,normsd=0.1,verbose=FALSE) \\

\section{Citing iSeq}
\noindent
The iSeq1 method is described in ``Mo, Q. (2010). A fully Bayesian hidden Ising model for ChIP-seq data analysis (Submitted).''.  The other functions are described in ``Mo, Q. (2010).  iSeq - A flexible and powerful R/Bioconductor package for analyzing ChIP-seq data (Submitted).``

<<echo=TRUE,print=TRUE>>=
sessionInfo()
@ 

\begin{thebibliography}{}
\bibitem[\protect\citeauthoryear{Mo and Liang.}{2010}]{mo2010a}
Mo, Q., Liang, F. (2010a).
Bayesian Modeling of ChIP-chip data through a high-order Ising Model.  
{\it Biometrics}, 2010 Jan 29 [Epub ahead of print]. DOI: 10.1111/j.1541-0420.2009.01379.x

\bibitem[\protect\citeauthoryear{Mo and Liang.}{2010}]{mo2010b}
Mo, Q., Liang, F. (2010b).
A hidden Ising model for ChIP-chip data analysis.  
{\it Bioinformatics} {\bf 26(6)}, 777-783. doi:10.1093/bioinformatics/btq032

\bibitem[\protect\citeauthoryear{Mo}{2010}]{mo2010c}
Mo, Q. (2010). A fully Bayesian hidden Ising model for ChIP-seq data analysis. (Submitted). 

\bibitem[\protect\citeauthoryear{Mo}{2010}]{mo2010d}
Mo, Q. (2010). iSeq - A flexible and powerful R/Bioconductor package for analyzing ChIP-seq data.(Submitted).

\bibitem[\protect\citeauthoryear{Newton et al.}{2004}]{new04}
Newton, M., Noueiry, A., Sarkar, D., Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric
hierarchical mixture method.
{\it Biostatistics} {\bf 5} , 155-176.

\bibitem[\protect\citeauthoryear{Johnson et al.}{2007}]{johnson07}
Johnson, D.S., Mortazavi, A., Myers, R.M., Wold, B. (2007).  
 Genome-wide mapping of in vivo protein-DNA interactions. 
{\it Science} {\bf 316}, 1497-1502. 

\end{thebibliography}
\end{document}