%\VignetteIndexEntry{iSeq} %\VignetteDepends{} %\VignetteKeywords{ChIP-seq,next generation sequencing, massively parallel sequencing} %\VignettePackage{iSeq} \documentclass[11pt]{article} \usepackage{amsmath} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \SweaveOpts{echo=FALSE} \setlength{\textheight}{8.5in} \setlength{\textwidth}{6in} \setlength{\topmargin}{-0.25in} \setlength{\oddsidemargin}{0.25in} \setlength{\evensidemargin}{0.25in} \begin{document} \setkeys{Gin}{width=0.99\textwidth} \title{\bf iSeq: Bayesian Hierarchical Modeling of ChIP-seq Data Through Hidden Ising Models} \author{Qianxing Mo} \maketitle \begin{center} Department of Epidemiology and Biostatistics\\ Memorial Sloan-Kettering Cancer Center\\ {\tt moq@mskcc.org} \end{center} \tableofcontents %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} This package implements the models proposed by Mo (2010) for ChIP-seq data analysis, which are the extensions of the models proposed for ChIP-chip data (Mo and Liang, 2010a,b). The package can be used to analyze the ChIP-seq data with or without controls. \\ \noindent This package processes ChIP-seq data in three steps. \\ \begin{itemize} \item {Build a signal profile for each chromosome. To create the signal profiles, sequence tags are aggregated into non-overlapping bins whose length can be decided by the user (e.g. 50 base pairs (bp)). The number of tags falling each bin is counted. The signal profiles are made of the bin-based tag counts and the corresponding genomic positions on the chromosomes. The bins and their tag counts are ordered along the chromosomes according to their genomic positions. } \item {Model the signal profiles. It is to model the tag counts on the chromosomes. The models assume that each bin is associated with a binary latent variable $X_i \in (1, -1)$, where $i$ denotes the ID for the bin, and $X_i=1$ denotes that the bin is an enriched bin, and -1 for a non-enriched bin. In the first stage, conditioning on the latent variable $X_i$, the bin-based tag counts are modeled by Poisson-Gamma distributions. If $X_i=-1$, the tag count for bin $i$ is assumed to follow a Poisson distribution with parameter $\lambda_0$, where $\lambda_0 \sim \textrm{Gamma}(a_0,b_0)$. If $X_i= 1$, the tag count for bin $i$ is assumed to follow a Poisson distribution with parameter $\lambda_1$, where $\lambda_1 \sim \textrm{Gamma}(a_1,b_1)$. $\textrm{Gamma}(a,b)$ denotes a gamma distribution with mean $a/b$ and variance $a/b^2$. In the second stage, the latent variable is modeled by ferromagnetic Ising models. The Gibbs sampler and Metropolis algorithm are used to simulate from the posterior distributions of the model parameters.} \item {Call enriched regions. The posterior probabilities for the bins in the enriched state ($X_i = 1$) are used for statistical inference. A bin with a high posterior probability in the enriched state will provide strong evidence that the bin is enriched. Enriched bins are then merged into enriched regions.} \end{itemize} \noindent For more information, we refer the user to Mo and Liang (2010 a, b) because the manuscript (Mo, 2010c) is still under review. The major difference for modeling ChIP-chip and ChIP-seq data is at the first stage, where normal distributions are used for ChIP-chip data, and Poisson distributions are used for ChIP-seq data. \section{The NRSF ChIP-seq Data} Johnson et al.(2007) carried out genome-wide identification of the binding sites of human neuron-restrictive silencer factor (NRSF) in Jurkat T cells. We use the data of chromosomes 22 and Y to illustrate the proposed method. The NRSF sequence tags (25 bp) were generated by the Illumina/Solexa sequencing platform, and mapped to The human genome May 2004 (hg17). We only use the uniquely mapped tags (up to two mismatches) for the analysis. Note that iSeq package doesn't provide functions to read the raw data generated by the sequencers. However, it should not be difficult to prepare the data in the format as shown in the following. \section{Example --- Analyze the NRSF Data} \noindent Firstly, let's load the library and check the data. There are three ChIP and control samples, respectively. <>= library(iSeq) data(nrsf) names(nrsf) @ \subsection{Build the signal profiles using the mergetag function} \noindent The following two steps just merge the ChIP and control samples, respectively. <>= chip = rbind(nrsf$chipFC1592,nrsf$chipFC1862,nrsf$chipFC2002) mock = rbind(nrsf$mockFC1592,nrsf$mockFC1862,nrsf$mockFC2002) print(chip[1:3,]) print(mock[1:3,]) @ \noindent We suggest building the signal profiles using a 50 bp window. Use a small window size (e.g. 25 - 75 bp) in order to precisely infer the true binding sites. If the sequenced DNA fragments are around 200 bp, it is expected that an enriched region is made of 4 bins. However, if the window size is too small, it may lose power to detect enriched regions. <>= tagct = mergetag(chip=chip,control=mock,winsize=50) print(tagct[1:3,]) @ \noindent See the help file for function 'mergetag' for the meanings of the tagct columns. \noindent The user can quickly get the 'enriched' regions without calculating statistical confidence using function 'peakreg'. For example, if we claim that a bin with adjusted tag count $>$ 10 is an enriched bin, the 'enriched' regions can be obtained by using the following code. Let's use the chromosome 22 data as an example. <>= tagct22 = tagct[tagct[,1]=="chr22",] #chr22 data reg0 = peakreg(chrpos = tagct22[, 1:3], count = (tagct22[, 5:6] -tagct22[, 7:8]), pp = (tagct22[,4]>10),cutoff = 0, method = "ppcut",maxgap=300) print(dim(reg0)) print(reg0[1:3,]) # each row contain the information for an enriched region @ \subsection{Model the signal profiles using the iSeq1 function} \noindent Function iSeq1 implements a fully Bayesian hidden Ising model in which the latent variable $X$ is modeled by the standard 1D Ising model. The columns 1-4 of tagct are the signal profiles for modeling. Note that the column 4 is the adjusted tag counts of the ChIP samples. For one sample analysis, it is the total tag counts for the forward and reverse chains. <>= set.seed(777) res1 = iSeq1(Y=tagct22[,1:4],gap=300,burnin=200,sampling=1000,ctcut=3,a0=1,b0=1,a1=5,b1=1, k0=3,mink=0,maxk=10,normsd=0.1,verbose=FALSE) @ \noindent Plot the model parameters to see whether they converge. In general, the MCMC chains have converged when the parameters fluctuate around the modes of their distributions. If there is an obvious trend(e.g. continuous increase or decrease), the user should increase the number of iterations in the burn-in and/or sampling phases. If the chains do not mix well, the user can adjust the argument normsd to see how it affects the results. \begin{center} <>= par(mfrow=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0)) hist(res1$pp) plot(res1$kappa, pch=".", xlab="Iterations", ylab="kappa") plot(res1$lambda0, pch=".", xlab="Iterations", ylab="lambda0") plot(res1$lambda1, pch=".", xlab="Iterations", ylab="lambda1") @ \end{center} \noindent From the trace plots, we see the chains converge quite fast. \subsection{Call the enriched regions detected by iSeq1 using the peakreg function} \noindent Call the enriched regions detected by iSeq1 using 0.5 posterior probability (pp) cutoff. Note the argument count is the net tag counts for the two sample analysis. The net tag counts are not truncated at zero, but this doesn't matter because it is just used for inferring the center of the enriched region, which is usually the true binding site. The user can also use the ChIP tag counts only. The results are little different. See the help file of peakreg for details. <>= reg1 = peakreg(chrpos=tagct22[,1:3],count=(tagct22[,5:6]-tagct22[,7:8]),pp=res1$pp,cutoff=0.5,method="ppcut",maxgap=300) print(dim(reg1)) print(reg1[1:3,]) @ \noindent Note, the 5' positions of the sequence tags are used as the genomic positions for the NRSF data. To infer the actual binding sites, one may add 13 bp to the peak position (reg1\$peakpos + 13) because the tags' length is 25 bp. If the middle positions of the sequence tags are used as the genomic positions, the user doesn't need to do the adjustment.\\ \noindent Plot some enriched regions. The dash lines indicate the region center, usually the true binding sites. \begin{center} <>= par(mfrow=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0)) for(i in 1:4){ ID = (reg1[i,4]):(reg1[i,5]) plotreg(tagct22[ID,2:3],tagct22[ID,5:6],tagct22[ID,7:8],peak=reg1[i,6]) } @ \end{center} \noindent Call the enriched regions using 0.05 FDR cutoff. The FDR is calculated using a direct posterior probability approach (Newton et al., 2004). <>= reg2 = peakreg(tagct22[,1:3],tagct22[,5:6]-tagct22[,7:8],res1$pp,0.05,method="fdrcut",maxgap=300) print(dim(reg2)) print(reg2[1:3,]) @ \noindent The columns 1-3 of the enriched regions (e.g. reg2[,1:3]) can be used to extract the sequences from the UCSC genome browser. Alternatively, one may create a BED file using the peak position of the enriched regions. For example, <>= bed = data.frame(chr=reg2[,1],gstart=reg2[,6]-100,gend=reg2[,6]+100) @ \subsection{Model the signal profiles using the iSeq2 function} \noindent The latent variable $X$ can be modeled through a high-order Ising model using function iSeq2. The interaction parameter $k$ for the high-order (or the standard/first-order) Ising model is fixed and set by the user in iSeq2. To apply the second-order Ising model to ChIP-seq data, the user can let winsize = 2. If set winsize = 1, it will be the standard Ising model. To use a high-order Ising model, according to our experience, a balance between high sensitivity and low FDR can be achieved when winsize = 2. The critical value for the second-order Ising model is about 1.0. In general, increasing the value of $k$ will lead to less enriched regions, which amounts to setting a stringent criterion for detecting enriched regions. \noindent Model the NRSF data using the second-order Ising model. <>= res2 = iSeq2(Y=tagct22[,1:4],gap=300, burnin=100,sampling=500,winsize=2,ctcut=5, a0=1,b0=1,a1=5,b1=1,k=1.0,verbose=FALSE) @ \noindent Plot the model parameters to see whether they converge. If the chains do not mix well, one can adjust the parameter {\bf k} and/or normsd to see how it affects the results. \begin{center} <>= par(mfrow=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0)) hist(res2$pp) plot(res2$lambda0, pch=".", xlab="Iterations", ylab="lambda0") plot(res2$lambda1, pch=".", xlab="Iterations", ylab="lambda1") @ \end{center} \subsection{Call the enriched regions detected by iSeq2 using the peakreg function} <>= reg2 = peakreg(tagct22[,1:3],tagct22[,5:6],res2$pp,0.5,method="ppcut",maxgap=300) print(dim(reg2)) print(reg2[1:3,]) @ \noindent Plot some enriched regions detected by iSeq2. \begin{center} <>= par(mfrow=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0)) for(i in 1:4){ ID = (reg2[i,4]):(reg2[i,5]) plotreg(tagct22[ID,2:3],tagct22[ID,5:6],tagct22[ID,7:8],peak=reg2[i,6]) } @ \end{center} \section{Tips} \noindent Finally, let's analyze the chromosome Y data. Intuitively, it seems that there is no binding region on chromosome Y. <>= tagctY = tagct[tagct[,1]=="chrY",] print(table(tagctY[,4])) res1 = iSeq1(Y=tagctY[,1:4],gap=300,burnin=1000,sampling=5000,ctcut=3,a0=1,b0=1,a1=5,b1=1, k0=3,mink=0,maxk=10,normsd=0.5,verbose=FALSE) res2 = iSeq2(Y=tagctY[,1:4],gap=300, burnin=1000,sampling=5000,winsize=2,ctcut=3, a0=1,b0=1,a1=5,b1=1,k=3.0,verbose=FALSE) @ \begin{center} <>= par(mfcol=c(2,2), mar=c(4.1, 4.1, 2.0, 1.0)) hist(res1$pp) plot(res1$lambda1, pch=".", xlab="Iterations", ylab="lambda1") hist(res2$pp) plot(res2$lambda1, pch=".", xlab="Iterations", ylab="lambda1") @ \end{center} \noindent In this case, all the bins are only in the non-enriched state and the posterior probabilities are zero or close to zero. In some cases, if the $\lambda_1$ is small (e.g. $\approx a_1/b_1$, the mean value of the gamma prior), it suggests that there is no enriched region on that chromosome. The user may not claim any enriched regions found on that chromosome, even if some posterior probabilities are high. For the studies of transcription factor binding sites, if the posterior probabilities are not dichotomized or not dominated by 0, it suggests that the Ising model is not in the super-paramagnetic phase. Only the super-paramagnetic phase reflects the binding events on the chromosomes. Therefore, if the user use iSeq2, the user should increase the value of $k$ to let the phase transition occur so that the Ising model reaches the super-paramagnetic phase. \\ \noindent The iSeq methods take into account the spatial dependency, the global and local distributions of the sequence tags. If the signal profiles have very sparse signals and some bins have very large (adjusted) tag counts (e.g., the NRSF data), the estimated $\lambda_1$ will be relatively large, which makes the bins with relatively small counts (e.g., tens) have low posterior probabilities. In this case, the user can truncate the (adjusted) tag counts at certain value (e.g., if count $>$ 100, set count = 100) to increase the power to detect the regions with small tag counts. For the NRSF data, if the adjusted tag counts are truncated at 100, more enriched regions can be detected. \section{Parallel Computation} \noindent To speed up the analysis, the user can do parallel computation easily. The user needs to install the snow and snowfall packages. The following is an example. \\ \noindent library(snow) \\ library(snowfall) \\ dataList = list(chr22=tagct22,chrY=tagctY) \\ sfInit(parallel=TRUE,cpus=2,type="SOCK") \\ res=sfLapply(dataList,iSeq1,gap=300,burnin=100,sampling=200,ctcut=3,a0=1,b0=1,a1=1,b1=1, k0=3,mink=0,maxk=10,normsd=0.1,verbose=FALSE) \\ \section{Citing iSeq} \noindent The iSeq1 method is described in ``Mo, Q. (2010). A fully Bayesian hidden Ising model for ChIP-seq data analysis (Submitted).''. The other functions are described in ``Mo, Q. (2010). iSeq - A flexible and powerful R/Bioconductor package for analyzing ChIP-seq data (Submitted).`` <>= sessionInfo() @ \begin{thebibliography}{} \bibitem[\protect\citeauthoryear{Mo and Liang.}{2010}]{mo2010a} Mo, Q., Liang, F. (2010a). Bayesian Modeling of ChIP-chip data through a high-order Ising Model. {\it Biometrics}, 2010 Jan 29 [Epub ahead of print]. DOI: 10.1111/j.1541-0420.2009.01379.x \bibitem[\protect\citeauthoryear{Mo and Liang.}{2010}]{mo2010b} Mo, Q., Liang, F. (2010b). A hidden Ising model for ChIP-chip data analysis. {\it Bioinformatics} {\bf 26(6)}, 777-783. doi:10.1093/bioinformatics/btq032 \bibitem[\protect\citeauthoryear{Mo}{2010}]{mo2010c} Mo, Q. (2010). A fully Bayesian hidden Ising model for ChIP-seq data analysis. (Submitted). \bibitem[\protect\citeauthoryear{Mo}{2010}]{mo2010d} Mo, Q. (2010). iSeq - A flexible and powerful R/Bioconductor package for analyzing ChIP-seq data.(Submitted). \bibitem[\protect\citeauthoryear{Newton et al.}{2004}]{new04} Newton, M., Noueiry, A., Sarkar, D., Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. {\it Biostatistics} {\bf 5} , 155-176. \bibitem[\protect\citeauthoryear{Johnson et al.}{2007}]{johnson07} Johnson, D.S., Mortazavi, A., Myers, R.M., Wold, B. (2007). Genome-wide mapping of in vivo protein-DNA interactions. {\it Science} {\bf 316}, 1497-1502. \end{thebibliography} \end{document}