% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- \documentclass[11pt]{article} %% Set my margins \setlength{\oddsidemargin}{0.0truein} \setlength{\evensidemargin}{0.0truein} \setlength{\textwidth}{6.5truein} \setlength{\topmargin}{0.0truein} \setlength{\textheight}{9.0truein} \setlength{\headsep}{0.0truein} \setlength{\headheight}{0.0truein} \setlength{\topskip}{0pt} %% End of margins %%\pagestyle{myheadings} %%\markboth{$Date$\hfil$Revision$}{\thepage} \usepackage[pdftex, bookmarks, bookmarksopen, pdfauthor={Dongjun Chung, Pei Fen Kuan and Sunduz Keles}, pdftitle={mosaics Vignette}] {hyperref} \title{Analysis of ChIP-seq Data with `\texttt{mosaics}' Package} \author{Dongjun Chung$^1$, Pei Fen Kuan$^2$ and S\"und\"uz Kele\c{s}$^{1,3}$\\ $^1$Department of Statistics, University of Wisconsin\\ Madison, WI 53706.\\ $^2$Department of Biostatistics, University of North Carolina at Chapel Hill\\ Chapel Hill, NC 27599.\\ $^3$Department of Biostatistics and Medical Informatics, University of Wisconsin\\Madison, WI 53706.} \date{\today} \SweaveOpts{engine=R, echo=TRUE, pdf=TRUE} \begin{document} %\VignetteIndexEntry{MOSAiCS} %\VignetteKeywords{MOSAiCS} %\VignettePackage{mosaics} \maketitle \tableofcontents \section{Overview} This vignette provides an introduction to the analysis of ChIP-seq data with the `\texttt{mosaics}' package. R package \texttt{mosaics} implements MOSAiCS, a statistical framework for the analysis of ChIP-seq data, proposed in \cite{mosaics}. MOSAiCS stands for ``\textbf{MO}del-based one and two \textbf{S}ample \textbf{A}nalysis and \textbf{I}nference for \textbf{C}hIP-\textbf{S}eq Data''. It implements a flexible parametric mixture modeling approach for detecting peaks, i.e., enriched regions, in one-sample (ChIP sample) or two-sample (ChIP and control samples) ChIP-seq data. It can account for mappability and GC content biases that arise in ChIP-seq data. The package can be loaded with the command: <>= options(prompt = "R> ") @ <>= library("mosaics") @ \section{Getting started} `\texttt{mosaics}' package provides flexible framework for the ChIP-seq analysis. If you have the data for matched control sample, two-sample analysis is recommended. If the ChIP-seq data is deeply sequenced, the two-sample analysis without mappability and GC content (Section \ref{two_sample_no_MGC}) is usually appropriate. For the ChIP-seq data with low sequencing depth, the two-sample analysis with mappability and GC content (Section \ref{two_sample_MGC}) can be useful. When control sample is not available, `\texttt{mosaics}' package accommodates one-sample analysis of ChIP-seq data. In this case, you should have files for mappability and GC content, in addition to the files for ChIP and matched control samples. We recommend users start from Section \ref{mosaicsRunAll} and it discusses the most convenient way to do the two-sample analysis (without using mappability and GC content). Section \ref{two_sample_no_MGC} discusses each step of the two-sample workflow in detail and provides command lines for each step. Sections \ref{two_sample_MGC} and \ref{one_sample} briefly explain the workflow and command lines for the two-sample analysis and the one-sample analysis with mappability and GC content, respectively. We encourage questions or requests regarding `\texttt{mosaics}' package to be posted on our Google group \url{http://groups.google.com/group/mosaics_user_group}. \section{Two-Sample Analysis using \texttt{'mosaicsRunAll'}}\label{mosaicsRunAll} Two-sample analysis without mappability and GC content can be done in a more convenient way, with the command: <>= mosaicsRunAll( chipFile="/scratch/eland/STAT1_ChIP_eland_results.txt", chipFileFormat="eland_result", controlFile="/scratch/eland/STAT1_control_eland_results.txt", controlFileFormat="eland_result", binfileDir="/scratch/bin/", peakFile=c("/scratch/peak/STAT1_peak_list.bed", "/scratch/peak/STAT1_peak_list.gff"), peakFileFormat=c("bed", "gff"), reportSummary=TRUE, summaryFile="/scratch/reports/mosaics_summary.txt", reportExploratory=TRUE, exploratoryFile="/scratch/reports/mosaics_exploratory.pdf", reportGOF=TRUE, gofFile="/scratch/reports/mosaics_GOF.pdf", byChr=FALSE, excludeChr="chrM", FDR=0.05, fragLen=200, binSize=fragLen, capping=0, bgEst="automatic", signalModel="BIC", parallel=TRUE, nCore=8 ) @ `\texttt{mosaicsRunAll}' method imports aligned read files, converts them to bin-level files (generated bin-level files will be saved in the directory specified in `\texttt{binfileDir}' argument for future use), fits the MOSAiCS model, identifies peaks, and exports the peak list. In addition, users can also make `\texttt{mosaicsRunAll}' method generate diverse analysis reports, such as summary report of parameters and analysis results, exploratory plots, and goodness of fit (GOF) plots. Arguments of `\texttt{mosaicsRunAll}' method are summarized in Table \ref{tab:mosaicsRunAll}. See Section \ref{constructBins} for details of the arguments `\texttt{chipFileFormat}', `\texttt{controlFileFormat}', `\texttt{byChr}', `\texttt{excludeChr}', `\texttt{fragLen}', `\texttt{binSize}', and `\texttt{capping}'. See Section \ref{mosaicsFit}' for details of the argument `\texttt{bgEst}'. See Section \ref{mosaicsPeak}' for details of the arguments `\texttt{FDR}', `\texttt{signalModel}', `\texttt{peakFileFormat}', `\texttt{maxgap}', `\texttt{minsize}', and `\texttt{thres}'. \begin{table} \caption{ \label{tab:mosaicsRunAll} \textbf{Summary of the arguments of `\texttt{mosaicsRunAll}' method.} } \label{t:case} \begin{center} \begin{tabular}{ll} \multicolumn{2}{c}{(a) Input and output files} \\ \hline Argument & Explanation \\ \hline \texttt{chipFile} & Name of aligned read file of ChIP sample.\\ \texttt{chipFileFormat} & File format of aligned read file of ChIP sample.\\ \texttt{controlFile} & Name of aligned read file of matched control sample.\\ \texttt{controlFileFormat} & File format of aligned read file of matched control sample.\\ \texttt{binfileDir} & Directory that bin-level files are exported to.\\ \texttt{peakFile} & Vector of file names of peak list.\\ \texttt{peakFileFormat} & Vector of file formats of peak list.\\ \hline \multicolumn{2}{c}{} \\ \multicolumn{2}{c}{(b) Reports} \\ \hline Argument & Explanation \\ \hline \texttt{reportSummary} * & Generate analysis summary? \\ \texttt{summaryFileName} & File name of analysis summary. \\ \texttt{reportExploratory} * & Generate exploratory plots? \\ \texttt{exploratoryFileName} & File name of exploratory plots. \\ \texttt{reportGOF} * & Generate GOF plots? \\ \texttt{gofFileName} & File name of GOF plots. \\ \hline \multicolumn{2}{l}{* Reports will be generated only when these arguments are \texttt{TRUE}. Default is \texttt{FALSE}.} \\ \multicolumn{2}{c}{} \\ \multicolumn{2}{c}{(c) Tuning parameters} \\ \hline Argument & Explanation \\ \hline \texttt{byChr} & Genome-wide analysis (\texttt{FALSE}) or chromosome-wise analysis (\texttt{TRUE})? \\ \texttt{excludeChr} & Vector of chromosomes to be excluded from the analysis. \\ \texttt{fragLen} & Average fragment length. \\ \texttt{binSize} & Bin size. \\ \texttt{capping} & Cap read counts in aligned read files? \\ \texttt{bgEst} & Background estimation approach. \\ \texttt{signalModel} & Signal model. \\ \texttt{FDR} & False discovery rate (FDR). \\ \texttt{maxgap} & Distance between initial peaks for merging. \\ \texttt{minsize} & Minimum width to be called as a peak. \\ \texttt{thres} & Minimum ChIP tag counts to be called as a peak. \\ \texttt{parallel} & Use \texttt{parallel} package for parallel computing? \\ \texttt{nCore} & Number of CPUs used for parallel computing. ** \\ \hline \multicolumn{2}{l}{** Relevant only when \texttt{parallel=TRUE} and \texttt{parallel} package is installed.} \\ \end{tabular} \end{center} \end{table} \section{Workflow: Two-Sample Analysis}\label{two_sample_no_MGC} \subsection{Constructing Bin-Level Files from the Aligned Read File}\label{constructBins} R package `\texttt{mosaics}' analyzes the data after converting aligned read files into bin-level files for modeling and visualization purposes. These bin-level data can easily be generated from the aligned read files with the command: <>= constructBins( infile="/scratch/eland/STAT1_eland_results.txt", fileFormat="eland_result", outfileLoc="/scratch/eland/", byChr=FALSE, excludeChr="chrM", fragLen=200, binSize=200, capping=0 ) @ You can specify the name and file format of the aligned read file in `\texttt{infile}' and `\texttt{fileFormat}' arguments, respectively. `\texttt{constructBins}' method currently allows the following aligned read file formats: Eland result (`\texttt{"eland}$\_$\texttt{result"}'), Eland extended (`\texttt{"eland}$\_$\texttt{extended"}'), Eland export (`\texttt{"eland}$\_$\texttt{export"}'), default Bowtie (`\texttt{"bowtie"}'), SAM (`\texttt{"sam"}'), BED (`\texttt{"bed"}'), and CSEM BED (`\texttt{"csem"}'). This method assumes that these aligned read files are obtained from single-end tag (SET) experiments. If input file format is neither BED nor CSEM BED, it retains only reads mapping uniquely to the reference genome (uni-reads). Even though `\texttt{constructBins}' retains only uni-reads for most aligned read file formats, reads mapping to multiple locations on the reference genome (multi-reads) can be easily incorporated into bin-level files by utilizing our multi-read allocator, \texttt{CSEM} (ChIP-Seq multi-read allocator using Expectation-Maximization algorithm). Galaxy tool for CSEM is available in Galaxy Tool Shed (\url{http://toolshed.g2.bx.psu.edu/}; ``csem'' under ``Next Gen Mappers''). Stand-alone version of CSEM is also available at \url{http://www.stat.wisc.edu/~keles/Software/multi-reads/}. CSEM exports uni-reads and allocated multi-reads into standard BED file and the corresponding bin-level files can be constructed by applying `\texttt{constructBins}' method to this BED file with the argument `\texttt{fileFormat="csem"}'. `\texttt{constructBins}' can generate a single bin-level file containing all chromosomes (for a genome-wide analysis) or multiple bin-level files for each chromosome (for a chromosome-wise analysis). If `\texttt{byChr=FALSE}', bin-level data for all chromosomes are exported to one file named as `\texttt{[infileName]}$\_$\texttt{fragL[fragLen]}$\_$\texttt{bin[binSize].txt}', where \texttt{[infileName]}, \texttt{[fragLen]}, and \texttt{[binSize]} are name of aligned read file, average fragment length, and bin size, respectively. If `\texttt{byChr=TRUE}', bin-level data for each chromosome is exported to a separate file named as `\texttt{[infileName]}$\_$\texttt{fragL[fragLen]}$\_$\texttt{bin[binSize]}$\_$\texttt{[chrID].txt}', where \texttt{[chrID]} is chromosome ID that reads align to. These chromosome IDs (\texttt{[chrID]}) are extracted from the aligned read file. The constructed bin-level files are exported to the directory specified in `\texttt{outfileLoc}' argument. If you want to exclude some chromosomes in the processed bin-level files, you can specify these chromosomes in `\texttt{excludeChr}' argument. You can specify average fragment length and bin size in `\texttt{fragLen}' and `\texttt{binSize}' arguments, respectively, and these arguments control the resolution of bin-level ChIP-seq data. By default, average fragment length is set to 200 bp, which is the common fragment length for Illumina sequences, and bin size equals to average fragment length. `\texttt{capping}' argument indicates maximum number of reads allowed to start at each nucleotide position. Using some small value for capping (e.g., `\texttt{capping=3}') will exclude extremely large read counts that might correspond to PCR amplification artifacts, which is especially useful for the ChIP-seq data with low sequencing depth. Capping is not applied (default) if `\texttt{capping}' is set to some non-positive value, e.g., `\texttt{capping=0}'. \subsection{Reading Bin-Level Data into the R Environment}\label{readBins} You now have bin-level ChIP data and matched control sample data from '\texttt{constructBins}'. In this vignette, we use chromosome 21 data from a ChIP-seq experiment of STAT1 binding in interferon-$\gamma$-stimulated HeLa S3 cells \cite{stat1}. `\texttt{mosaicsExample}' package provides this example dataset. <>= library(mosaicsExample) @ Bin-level data can be imported to the R environment with the command: <>= exampleBinData <- readBins( type=c("chip","input"), fileName=c( system.file( file.path("extdata","chip_chr21.txt"), package="mosaicsExample"), system.file( file.path("extdata","input_chr21.txt"), package="mosaicsExample") ) ) @ For the `\texttt{type}' argument, \texttt{"chip"} and \texttt{"input"} indicate bin-level ChIP data control sample data, respectively. You need to specify the corresponding file names in `\texttt{fileName}'. `\texttt{mosaics}' package assumes that each file name in `\texttt{fileName}' is provided in the same order as in `\texttt{type}'. In \texttt{mosaics} package, you can do either genome-wide analysis or chromosome-wise analysis and this analysis type will be determined automatically based on the contents of bin-level files imported using `\texttt{readBins}'. If the bin-level files contain more than one chromosome (i.e., bin-level files are obtained using `\texttt{byChr=FALSE}' in `\texttt{constructBins}'), `\texttt{mosaicsFit}' will analyze all the chromosomes simultaneously (genome-wide analysis). Note that if these bin-level files contain different sets of chromosomes, then `\texttt{readBins}' method will utilize only the intersection of them. If bin-level files are obtained using `\texttt{byChr=FALSE}' in `\texttt{constructBins}', each bin-level file contains data for only one chromosome and each of these bin-level files need to be analyzed separately (chromosome-wise analysis). The genome-wide analysis usually provide more stable model fitting and peak identification results so it is recommended for most cases. R package \texttt{mosaics} provides functions for generating simple summaries of the data. The following command prints out basic information about the bin-level data, such as number of bins and total ``effective tag counts''. ``Total effective tag counts'' is defined as the sum of the ChIP tag counts of all bins. This value is usually larger than the sequencing depth since tags are counted after extension to average fragment length and an extended fragment can contribute to multiple bins. <>= exampleBinData @ `\texttt{print}' method returns the bin-level data in data frame format. <>= print(exampleBinData)[51680:51690,] @ `\texttt{plot}' method provides exploratory plots for the ChIP data. Different type of plots can be obtained by varying the `\texttt{plotType}' argument. `\texttt{plotType="input"}' generates a plot of mean ChIP tag counts versus control tag counts. If `\texttt{plotType}' is not specified, this method plots the histogram of ChIP tag counts. <>= plot( exampleBinData ) plot( exampleBinData, plotType="input" ) @ Figures \ref{fig:bindata-plot-hist} and \ref{fig:bindata-plot-input} display examples of different types of plots. The relationship between mean ChIP tag counts and control tag counts seems to be linear, especially for small control tag counts (Figure \ref{fig:bindata-plot-input}). \begin{figure}[tbh] \begin{center} <>= plot(exampleBinData) @ \caption{\label{fig:bindata-plot-hist} Histograms of the count data from ChIP and control samples.} \end{center} \end{figure} \begin{figure}[tbh] \begin{center} <>= plot( exampleBinData, plotType="input" ) @ \caption{\label{fig:bindata-plot-input} Mean ChIP tag count versus Control tag count.} \end{center} \end{figure} \subsection{Fitting the MOSAiCS Model}\label{mosaicsFit} We are now ready to fit a MOSAiCS model using the bin-level data above (\texttt{exampleBinData}) with the command: <>= exampleFit <- mosaicsFit( exampleBinData, analysisType="IO", bgEst="automatic" ) @ `\texttt{analysisType="IO"}' indicates implementation of the two-sample analysis. `\texttt{bgEst}' argument determines background estimation approach. `\texttt{bgEst="matchLow"}' estimates background distribution using only bins with low tag counts and it is appropriate for the data with relatively low sequencing depth. `\texttt{bgEst="rMOM"}' estimates background distribution using robust method of moment (MOM) and it is appropriate for the data with relatively high sequencing depth. If `\texttt{bgEst="automatic"}' (default), `\texttt{mosaicsFit}' tries its best guess for the background estimation approach, based on the data provided. If the goodness of fit obtained using `\texttt{bgEst="automatic"}' is not satisfactory, we recommend to try `\texttt{bgEst="matchLow"}' and `\texttt{bgEst="rMOM"}' and it might improve the model fit. `\texttt{mosaicsFit}' fits both one-signal-component and two-signal-component models. When identifying peaks, you can choose the number of signal components to be used for the final model. The optimal choice of the number of signal components depends on the characteristics of data. In order to support users in the choice of optimal signal model, \texttt{mosaics} package provides Bayesian Information Criterion (BIC) values and Goodness of Fit (GOF) plots of these signal models. The following command prints out BIC values of one-signal-component and two-signal-component models, with additional information about the parameters used in fitting the background (non-enriched) distribution. A lower BIC value indicates a better model fit. For this dataset, we conclude that the two-signal-component model has a lower BIC and hence it provides a better fit. <>= exampleFit @ `\texttt{plot}' method provides the GOF plot. This plots allows visual comparisons of the fits of the background, one-signal-component, and two-signal-component models with the actual data. Figure \ref{fig:mosaicsfit-plot} displays the GOF plot for our dataset and we conclude that the two-signal-component model provides a better fit as is also supported by its lower BIC value compared to the one-signal component model. <>= plot(exampleFit) @ \begin{figure}[tbh] \begin{center} <>= plot(exampleFit) @ \caption{\label{fig:mosaicsfit-plot} Goodness of Fit (GOF) plot. Depicted are actual data for ChIP and control samples with simulated data from the following fitted models: (Sim:N): Background model; (Sim:N+S1): one-signal-component model; (Sim:N+S1+S2): two-signal-component model. } \end{center} \end{figure} \subsection{Identifying Peaks Based on the Fitted Model}\label{mosaicsPeak} Using BIC values and GOF plots in the previous section, we concluded that two-signal-component model fits our data better. Next, we will identify peaks with the two-signal-component model at a false discovery rate (FDR) of 0.05 using the command: <>= examplePeak <- mosaicsPeak( exampleFit, signalModel="2S", FDR=0.05, maxgap=200, minsize=50, thres=10 ) @ `\texttt{signalModel="2S"}' indicates two-signal-component model. Similarly, one-signal-component model can be specified by `\texttt{signalModel="1S"}'. FDR can be controlled at the desired level by specifying `\texttt{FDR}' argument. In addition to these two essential parameters, you can also control three more parameters, `\texttt{maxgap}', `\texttt{minsize}', and `\texttt{thres}'. These parameters are for refining initial peaks called using specified signal model and FDR. Initial nearby peaks are merged if the distance (in bp) between them is less than `\texttt{maxgap}'. Some initial peaks are removed if their lengths are shorter than `\texttt{minsize}' or their ChIP tag counts are less than `\texttt{thres}'. If you use a bin size shorter than the average fragment length in the experiment, we recommend to set `\texttt{maxgap}' to the average fragment length and `\texttt{minsize}' to the bin size. This setting removes peaks that are too narrow (e.g., singletons). If you set the bin size to the average fragment length (or maybe bin size is larger than the average fragment length), we recommend setting `\texttt{minsize}' to a value smaller than the average fragment length while leaving `\texttt{maxgap}' the same as the average fragment length. This is to prevent filtering using `\texttt{minsize}' because initial peaks would already be at a reasonable width. `\texttt{thres}' is employed to filter out initial peaks with very small ChIP tag counts because such peaks might be false discoveries. Optimal choice of `\texttt{thres}' depends on the sequencing depth of the ChIP-seq data to be analyzed. If you don't wish to filter out initial peaks using ChIP tag counts, you can set `\texttt{thres}' to an arbitrary negative value. The following command prints out a summary of identified peaks including the number of peaks identified, median peak width, and the empirical false discovery rate (FDR). <>= examplePeak @ `\texttt{print}' method returns the peak calling results in data frame format. This data frame can be used as an input for downstream analysis such as motif finding. This output might have different number of columns, depending on `\texttt{analysisType}' of `\texttt{mosaicsFit}'. For example, if `\texttt{analysisType="TS"}', columns are peak start position, peak end position, peak width, averaged posterior probability, minimum posterior probability, averaged ChIP tag count, maximum ChIP tag count, averaged control tag count, averaged control tag count scaled by sequencing depth, averaged log base 2 ratio of ChIP over input tag counts, averaged mappability score, and averaged GC content score for each peak. Here, the posterior probability of a bin refers to the probability that the bin is not a peak conditional on data. Hence, smaller posterior probabilities provide more evidence that the bin is actually a peak. <>= print(examplePeak)[1:15,] @ You can export peak calling results to text files in diverse file formats. Currently, `\texttt{mosaics}' package supports TXT, BED, and GFF file formats. In the exported file, TXT file format (`\texttt{type="txt"}') includes all the columns that `\texttt{print}' method returns. `\texttt{type="bed"}' and `\texttt{type="gff"}' export peak calling results in standard BED and GFF file formats, respectively, where score is the averaged ChIP tag counts in each peak. Peak calling results can be exported in TXT, BED, and GFF file formats, respectively, by the commands: <>= export( examplePeak, type="txt", filename="TSpeakList.txt" ) export( examplePeak, type="bed", filename="TSpeakList.bed" ) export( examplePeak, type="gff", filename="TSpeakList.gff" ) @ `\texttt{fileLoc}' and `\texttt{fileName}' indicate the directory and the name of the exported file. \section{Two-Sample Analysis with Mappability and GC Content}\label{two_sample_MGC} For the two-sample analysis with mappability and GC content and the one-sample analysis, you also need bin-level mappability, GC content, and sequence ambiguity score files for the reference genome you are working with. If you are working with organisms such as human (HG18 and HG19), mouse (MM9), rat (RN4), and Arabidopsis (TAIR9), you can download their corresponding preprocessed mappability, GC content, and sequence ambiguity score files at \url{http://www.stat.wisc.edu/~keles/Software/mosaics/}. If your reference genome of interest is not listed on our website, you can inquire about it at our Google group, \url{http://groups.google.com/group/mosaics_user_group}, and we would be happy to add your genome of interest to the list. The companion website also provides all the related scripts and easy-to-follow instructions to prepare these files. Please check \url{http://www.stat.wisc.edu/~keles/Software/mosaics/} for more details. You can import bin-level data and fit MOSAiCS model for the two-sample analysis using mappability and GC content with the commands: <>= exampleBinData <- readBins( type=c("chip","input","M","GC","N"), fileName=c( system.file( file.path("extdata","chip_chr21.txt"), package="mosaicsExample"), system.file( file.path("extdata","input_chr21.txt"), package="mosaicsExample"), system.file( file.path("extdata","M_chr21.txt"), package="mosaicsExample"), system.file( file.path("extdata","GC_chr21.txt"), package="mosaicsExample"), system.file( file.path("extdata","N_chr21.txt"), package="mosaicsExample") ) ) @ For the `\texttt{type}' argument, \texttt{"chip"}, \texttt{"input"}, \texttt{"M"}, \texttt{"GC"}, and \texttt{"N"} indicate bin-level ChIP data, control sample data, mappability score, GC content score, and sequence ambiguity score, respectively. When you have mappability and GC contents, `\texttt{plot}' method provides additional plot types. `\texttt{plotType="M"}' and `\texttt{plotType="GC"}' generate plots of mean ChIP tag counts versus mappability and GC content scores, respectively. Moreover, `\texttt{plotType="M|input"}' and `\texttt{plotType="GC|input"}' generate plots of mean ChIP tag counts versus mappability and GC content scores, respectively, conditional on control tag counts. <>= plot( exampleBinData, plotType="M" ) plot( exampleBinData, plotType="GC" ) plot( exampleBinData, plotType="M|input" ) plot( exampleBinData, plotType="GC|input" ) @ As discussed in \cite{mosaics}, we observe that mean ChIP tag count increases as mappability score increases (Figure \ref{fig:bindata-plot-M}). Mean ChIP tag count depends on GC score in a non-linear fashion (Figure \ref{fig:bindata-plot-GC}). When we condition on control tag counts (Figures \ref{fig:bindata-plot-M-input} and \ref{fig:bindata-plot-GC-input} ), mean ChIP tag count versus mappability and GC content relations exhibit similar patterns to that of marginal plots given in Figures \ref{fig:bindata-plot-M} and \ref{fig:bindata-plot-GC}. MOSAiCS incorporates this observation by modeling ChIP tag counts from non-peak regions with a small number of control tag counts as a function of mappability, GC content, and control tag counts. \begin{figure}[tbh] \begin{center} <>= plot( exampleBinData, plotType="M" ) @ \caption{\label{fig:bindata-plot-M} Mean ChIP tag count versus Mappability.} \end{center} \end{figure} \begin{figure}[tbh] \begin{center} <>= plot( exampleBinData, plotType="GC" ) @ \caption{\label{fig:bindata-plot-GC} Mean ChIP tag count versus GC content.} \end{center} \end{figure} \begin{figure}[tbh] \begin{center} <>= plot( exampleBinData, plotType="M|input" ) @ \caption{\label{fig:bindata-plot-M-input} Mean ChIP tag count versus Mappability, conditional on control tag counts.} \end{center} \end{figure} \begin{figure}[tbh] \begin{center} <>= plot( exampleBinData, plotType="GC|input" ) @ \caption{\label{fig:bindata-plot-GC-input} Mean ChIP tag count versus GC content, conditional on control tag counts.} \end{center} \end{figure} Application of MOSAiCS to multiple case studies of ChIP-seq data with low sequencing depth showed that consideration of mappability and GC content in the model improves sensitivity and specificity of peak identification even in the presence of a control sample \cite{mosaics}. \texttt{mosaics} package accommodates a two-sample analysis with mappability and GC content by specification of `\texttt{analysisType="TS"}' when calling the `\texttt{mosaicsFit}' method. <>= exampleFit <- mosaicsFit( exampleBinData, analysisType="TS", bgEst="automatic" ) @ Peak identification can be done exactly in the same way as in the case of the two-sample analysis without mappability and GC content. <>= OneSamplePeak <- mosaicsPeak( OneSampleFit, signalModel="2S", FDR=0.05, maxgap=200, minsize=50, thres=10 ) @ \section{One-Sample Analysis}\label{one_sample} When control sample is not available, `\texttt{mosaics}' package accommodates one-sample analysis of ChIP-seq data. Implementation of the MOSAiCS one-sample model is very similar to that of the two-sample analysis. Bin-level data for the one-sample analysis can be imported to the R environment with the command: <>= exampleBinData <- readBins( type=c("chip","M","GC","N"), fileName=c( system.file( file.path("extdata","chip_chr21.txt"), package="mosaicsExample"), system.file( file.path("extdata","M_chr21.txt"), package="mosaicsExample"), system.file( file.path("extdata","GC_chr21.txt"), package="mosaicsExample"), system.file( file.path("extdata","N_chr21.txt"), package="mosaicsExample") ) ) @ %Note that you don't need to provide `\texttt{"input"}' in `\texttt{type}' and %the file name of a control dataset in `\texttt{fileName}' here. In order to fit a MOSAiCS model for the one-sample analysis, you need to specify `\texttt{analysisType="OS"}' %instead of `\texttt{analysisType="TS"}' when calling the `\texttt{mosaicsFit}' method. <>= exampleFit <- mosaicsFit( exampleBinData, analysisType="OS", bgEst="automatic" ) @ Peak identification can be done exactly in the same way as in the case of the two-sample analysis. <>= exampleFit <- mosaicsPeak( exampleFit, signalModel="2S", FDR=0.05, maxgap=200, minsize=50, thres=10 ) @ %\section{Tuning of MOSAiCS Parameters}\label{tuning} %In the two-sample analysis, users can control three tuning parameters: `\texttt{s}', `\texttt{d}', and `\texttt{meanThres}'. `\texttt{s}' and `\texttt{d}' are parameters of the background distribution and control the functional form used for the control data. Please see \cite{mosaics} for further details on these two model parameters. `\texttt{meanThres}' controls the number of strata used at the robust linear regression modelling step of the background distribution fitting. `\texttt{mosaics}' package uses the following parameter setting as default: %<>= %exampleFit <- mosaicsFit( exampleBinData, analysisType="TS", %meanThres=1, s=2, d=0.25 ) %@ %Users might need to consider parameter tuning especially when the fitted background model is too similar to the actual data, resulting in too few peaks. If such cases are detected or predicted, `\texttt{mosaicsFit}' prints out warning or error messages. You may also be able to detect this case using the GOF plot. Using a higher `\texttt{s}' value and lower `\texttt{meanThres}' often solves the problem, e.g., `\texttt{s = 6}' and `\texttt{meanThres = 0}'. %In addition to `\texttt{analysisType}', `\texttt{mosaicsFit}' method provides parameters to tune the background distribution of the MOSAiCS model. We specified appropriate default values for these parameters based on computational experiments and analysis of diverse ChIP-seq datasets. Default values work well in general but some tuning might be required for some cases. You may need to consider parameter tuning if the fitted background model is too similar to the actual data in the GOF plot or you encounter some warning or error messages while running `\texttt{mosaicsFit}' method. Section \ref{tuning} provides basic guidelines on parameter tuning. If you encounter a fitting problem you need help with, feel free to contact us at our Google group, \url{http://groups.google.com/group/mosaics_user_group}. \section{Conclusion and Ongoing Work}\label{conclusion} R package \texttt{mosaics} provides effective tools to read and investigate ChIP-seq data, fit MOSAiCS model, and identify peaks. We are continuously working on improving \texttt{mosaics} package further, especially in supporting more diverse genomes, automating fitting procedures, developing more friendly and easy-to-use user interface, and providing more effective data investigation tools. Please post any questions or requests regarding `\texttt{mosaics}' package at \url{http://groups.google.com/group/mosaics_user_group}. Updates and changes of `\texttt{mosaics}' package will be announced at our Google group and the companion website (\url{http://www.stat.wisc.edu/~keles/Software/mosaics/}). \begin{thebibliography}{99} \bibitem{mosaics} Kuan, PF, D Chung, G Pan, JA Thomson, R Stewart, and S Kele\c{s} (2010), ``A Statistical Framework for the Analysis of ChIP-Seq Data'', \textit{Journal of the American Statistical Association}, 106, 891-903. \bibitem{stat1} Rozowsky, J, G Euskirchen, R Auerbach, D Zhang, T Gibson, R Bjornson, N Carriero, M Snyder, and M Gerstein (2009), ``PeakSeq enables systematic scoring of ChIP-Seq experiments relative to controls'', \textit{Nature Biotechnology}, 27, 66-75. \end{thebibliography} \end{document}