% \VignetteIndexEntry{Copy number variant detection in exome sequencing data}
% \VignetteDepends{methods, graphics, IRanges, GenomicRanges, Rsamtools, Biostrings}
% \VignetteKeyword{CNV}

\documentclass{article}
\usepackage{natbib}
\usepackage{Sweave}
\usepackage[a4paper,margin=1.25in]{geometry}
\usepackage{url}
\usepackage{hyperref}
\newcommand{\exomeCopy}{\textsf{exomeCopy} }
\SweaveOpts{keep.source=FALSE} 

\begin{document}

\setkeys{Gin}{width=0.6\textwidth}

<<echo=FALSE>>=
options(width = 70)
@ 

\title{Copy number variant detection in exome sequencing data using \exomeCopy}
\author{Michael Love\\
\texttt{michaelisaiahlove@gmail.com}}
\date{\today}

\maketitle

\begin{abstract}
  \exomeCopy is an R package implementing a hidden Markov model for predicting copy number variants (CNVs) from exome sequencing experiments without paired control experiments as in tumor/normal sequencing.  It models read counts in genomic ranges using negative binomial emission distributions depending on a hidden state of the copy number and on positional covariates such as GC-content and background read depth.  Normalization and segmentation are performed simultaneously, eliminating the need for preprocessing of the raw read counts.
\end{abstract}

\tableofcontents

\pagebreak

\section{Introduction}

The \exomeCopy package was designed to address the following situation:

\begin{itemize}
  \item Target enrichment, such as exome enrichment, leads to non-uniform read depth, which is often correlated across samples.
  \item CNVs overlapping enriched regions can be detected as increases or decreases in read counts relative to ``background'' read depth, generated by averaging over a control set.
  \item Individual samples can be more or less correlated with background read depth and have different dependencies on GC-content.
\end{itemize}

While exome sequencing is not designed for CNV genotyping, it can nevertheless be used for finding CNVs which overlap exons and are not common in the control set.  It can provide an independent data source to be used in combination with higher resolution array-based methods.  In this vignette we show how to import experiment data, generate background read depth, and recover CNVs using \exomeCopy.  The model implemented in this package is described in \citep{Love2011Modeling}.  On CRAN, the R package \textsf{ReadDepth} can be used for CNV detection in whole genome.  The methods in this package are not appropriate for cancer exome sequencing, where amplifications can be high and not necessarily integer values.  The \textsf{ExomeCNV} package on CRAN is designed for tumor/normal paired exome sequencing data.

\texttt{exomeCopy} is designed to run on read counts from consecutive genomic ranges on a single chromosome, as it tries to identify higher or lower read depth relative to a baseline.  We will use a wrapper function to loop the main function over multiple chromosomes and samples.  Then these results can be compiled across chromosomes and samples.

\section{Quick start}

Here is an abbreviated example of the code required for importing data, running the model (note: on autosomal regions) and compiling results.  For more details, please read the rest of the vignette and try the help for various functions, e.g. \texttt{?compileCopyCountSegments}.

<<eval=FALSE>>=
library(exomeCopy)
target.file <- "targets.bed"
bam.files <- c("/path/to/file1.bam", "/path/to/file2.bam", "/path/to/file3.bam")
sample.names <- c("sample1","sample2","sample3")
reference.file <- "/path/to/reference_genome.fa"
target.df <- read.delim(target.file, header = FALSE)
target <- GRanges(seqname = target.df[,1], IRanges(start = target.df[,2] + 1, end = target.df[,3]))
counts <- target
for (i in 1:length(bam.files)) {
  mcols(counts)[[sample.names[i]]] <- countBamInGRanges(bam.files[i], target)
}
counts$GC <- getGCcontent(target, reference.file)
counts$GC.sq <- counts$GC^2
counts$bg <- generateBackground(sample.names, counts, median)
counts$log.bg <- log(counts$bg + .1)
counts$width <- width(counts)
fit.list <- lapply(sample.names, function(sample.name) {
  lapply(seqlevels(target), function(seq.name) { 
    exomeCopy(counts[seqnames(counts) == seq.name], sample.name, X.names = c("log.bg", "GC", "GC.sq","width"), S = 0:4, d = 2)
  })
})
compiled.segments <- compileCopyCountSegments(fit.list)
@ 

\section{Importing experiment data}

The necessary genomic range information, sample read counts and positional covariates (background read depth and GC-content) should be stored in a \textit{GRanges} object.  The user must provide genomic ranges of targeted enrichment.  For exome sequencing, one can use exon annotations.  The \exomeCopy package provides the following convenience functions: \texttt{subdivideGRanges} for subdividing the target ranges into nearly-equal sized ranges, \texttt{countBamInGRanges} for counting reads from a BAM file in the target ranges, \texttt{getGCcontent} for calculating GC content in the target ranges, and \texttt{generateBackground} for calculating the median read depth over a set of background samples.

To minimize memory size, we first demonstrate with toy data how to construct a \textit{GRanges} object \texttt{toy.counts} from a BED file describing the targeted region, BAM files for the read mapping and a FASTA file for the reference sequence.  For demonstration of running the main functions, we will use a prepackaged \textit{GRanges} object \texttt{exomecounts} containing real exome sequencing read counts.  We will then simulate CNVs in a \textit{GRanges} object, \texttt{example.counts} and run \texttt{exomeCopy}.

\subsection{Subdividing targeted regions}

It is possible, but not necessary to subdivide large input ranges into multiple ranges of comparable width to the average input range.  It is not necessary, because \exomeCopy can use range width as a covariate for modeling read counts.  \texttt{subdivideGRanges} divides the targeted genomic ranges into a set of ranges of nearly equal width, which exactly cover the original ranges.  The ranges will be sorted and reduced if they are not already so.    \texttt{subdivideGRanges} requires an input \textit{GRanges} object and returns a \textit{GRanges} object.  

<<>>=
library(exomeCopy)
gr <- GRanges(seqname="seq1",IRanges(start=1,end=345))
subdivideGRanges(gr)
@ 

The default setting of \texttt{subdivideGRanges} is to divide an input range into ranges around $s$ = 100bp, which is slightly less than the average exon width.  Specifically, an input range of width $w$ will be divided evenly into $\max(1,\lfloor w/s \rfloor)$ regions.  We can visualize the effect of \texttt{subdivideGRanges} on ranges of increasing width. 

\begin{center}
<<fig=TRUE>>=
  plot(0,0,xlim=c(0,500),ylim=c(0,25),type="n",yaxt="n",ylab="",xlab="width of input GRanges object",main="Effect of subdivideGRanges")
  abline(v=1:5*100,col="grey")
  for (i in 1:24) {
    gr <- GRanges(seqname="chr1",IRanges(start=1,width=(i*20)))
    sbd.gr <- subdivideGRanges(gr)
    arrows(start(sbd.gr),rep(i,length(sbd.gr)),end(sbd.gr),rep(i,length(sbd.gr)),length=.04,angle=90,code=3)
  }
@
\end{center}

Here we demonstrate reading in a targeted region BED file, converting to a \textit{GRanges} object and the result from calling \texttt{subdivideGRanges}.  Note that if the targeted region is read in from a BED file, one should add 1 to the starting position for representation as a \textit{GRanges} object.  It is important that the input ranges are sorted and non-overlapping, so we call \texttt{reduce} on the target \textit{GRanges} object. 

<<>>=
target.file <- system.file("extdata","targets.bed",package="exomeCopy")
target.df <- read.delim(target.file,header=FALSE,col.names=c("seqname","start","end")) 
target <- GRanges(seqname=target.df$seqname,IRanges(start=target.df$start+1,end=target.df$end))
target
target <- reduce(target, min.gapwidth=0)
target.sub <- subdivideGRanges(target)
target.sub
@ 

\subsection{Counting reads in genomic ranges}

\texttt{countBamInGRanges} allows the user to count reads from a BAM read mapping file in genomic ranges covering the targeted region.  The function takes as input the BAM filename and a \textit{GRanges} object.  It returns a vector of counts, representing the number of sequenced read starts (leftmost position regardless of strand) with mapping quality above a minimum threshold (default of 1) for each genomic range.  It is also possible to count stranded read starts, all overlapping reads and only reads mapping to unique positions (see \texttt{?countBamInGRanges}).

Users should make sure the sequence names in the \textit{GRanges} object are the same as the sequence names in the BAM file (which can be listed using \texttt{scanBamHeader} in the \textsf{Rsamtools} package).  The BAM file requires a associated index file (see the man page for \texttt{indexBam} in the \textsf{Rsamtools} package).  We will count reads using the subdivided genomic ranges in \texttt{target.sub} and store the counts as a new value column, \texttt{sample1}.

<<>>=
bam.file <- system.file("extdata","mapping.bam",package="exomeCopy")
scanBamHeader(bam.file)[[1]]$targets
seqlevels(target.sub)
toy.counts <- target.sub
sample.df <- data.frame(samples="sample1",bam.files=bam.file,stringsAsFactors=FALSE)
for (i in 1:nrow(sample.df)) {
  mcols(toy.counts)[[sample.df$samples[i]]] <- countBamInGRanges(sample.df$bam.files[i],target.sub)
}
toy.counts
@

\subsection{Calculating GC-content}

\exomeCopy can model read counts from samples which are not perfectly correlated with background read depth using GC-content (ratio of G and C bases to total number of bases).  The GC-content of DNA fragments is known to be a factor in the efficiency of high-throughput sequencing.  To obtain the GC-content information, we only need the targeted ranges and a reference FASTA file.
<<>>=
reference.file <- system.file("extdata","reference.fa",package="exomeCopy")
toy.counts$GC <- getGCcontent(target.sub, reference.file)
toy.counts
@ 

\subsection{Exome sequencing data from 1000 Genomes Project}

For demonstrating the use of \exomeCopy, we have provided a sample dataset, \texttt{exomecounts}, of exome sequencing read counts.  The genomic ranges used in this dataset are generated from a small subset of the CCDS regions on chromosome 1 \citep{Pruitt2009Consensus}.  The regions are downloaded from the hg19 tables of the UCSC Genome Browser (\url{http://genome.ucsc.edu/cgi-bin/hgGateway}).  Alternatively, one could use the \textsf{GenomicFeatures} package to download the CCDS regions.  The original CCDS regions are subdivided using \texttt{subdivideGRanges} with default settings as in the example code above.  The CCDS regions are convenient to use as genomic ranges for CNV detection in exome sequencing data, as they are often in the center of the targeted region in exome enrichment protocols and tend to have less variable coverage than the flanking regions.

The read counts are taken from exome enriched, paired-end sequencing data of the 1000 Genomes Project for 16 samples of the PUR population \citep{2010Map}.  The BAM read mapping files and descriptions of the experiments are available at the 1000 Genomes Project website (\url{http://www.1000genomes.org/data}).   The ftp addresses used are listed in the file \texttt{1000Genomes\_files.txt} in the \texttt{extdata} directory.  The sample names are included as column names in the provided datatset.

<<>>=
data(exomecounts)
length(exomecounts)
head(exomecounts)
@

The genomic ranges in \texttt{exomecounts} have been filtered such that only ranges with nonzero read count over the 16 samples are retained.  The range of the 1000 genomic ranges is from 0.8 to 7.8 Mb on chromosome 1.  Plotting the counts for one sample in a region of 1 Mb, one can observe both the irregular spacing of the ranges as well as the non-uniformities in read counts per range.

\begin{center}
<<fig=TRUE>>=
plot(start(exomecounts),exomecounts$HG00551,xlim=c(0.8e6,1.8e6),xlab="genomic position",ylab="counts",main="HG00551 read counts in exonic ranges")
@ 
\end{center}

\subsection{Generating background read depth}

In order to run \exomeCopy, we first generate background read depth.  For demonstration of working with multiple chromosomes, we will copy the chr1 data as a new chromosome ``chr1a''.

<<>>=
chr1a <- exomecounts
seqlevels(chr1a) <- "chr1a"
example.counts <- c(exomecounts, chr1a)
@ 

The background is generated by a simple function which performs 3 simple steps:  1) given a vector of names of samples to be used as background, extract the read counts data frame from the \textit{GRanges} object, 2) divide each sample by its mean read count (column means), 3) calculate the median of these normalized read counts (row medians).  The function can also be used to calculate any function across the rows of normalized read counts.  For example, we also show how to calculate the variance of the normalized read counts, though we will not use covariate in this vignette.  

Note that if the control set is a mix of males and females, the allosome portion of chrX will be a mixture of copy counts 1 and 2.  In this case, one should created separate \textit{GRanges} objects for chrX and chrY and run males and females separately with the appropriate expected copy count.

<<>>=
exome.samples <- grep("HG.+",colnames(mcols(example.counts)),value=TRUE)
example.counts$bg <- generateBackground(exome.samples, example.counts, median)
example.counts$log.bg <- log(example.counts$bg + .1)
example.counts$bg.var <- generateBackground(exome.samples, example.counts, var)
@ 

We have already removed ranges have zero background coverage, but if we had not we could use the following lines.  If there is zero coverage in the control set, this is most likely a region which is difficult to enrich, and we should not try to infer the copy number here.

<<>>=
summary(example.counts$bg)
example.counts <- example.counts[example.counts$bg > 0,]
@ 

The relationship between read counts and GC-content over the ranges varies across protocols and samples.  It can be roughly approximated per sample using second-order polynomial terms of GC-content.  We store the square of GC-content as a new value column.  Other functions of GC-content could be used as well.  We also store the width of the ranges as a value column.

<<>>=
example.counts$GC.sq <- example.counts$GC^2
example.counts$width <- width(example.counts)
@

\section{Introduction of the model}
 
\exomeCopy models the sample read counts on one chromosome, $O$, as emitted observations of a hidden Markov model (HMM), where the hidden state is the copy number of the sample.  The emission distributions are modeled with negative binomial distributions, as the read counts from high-throughput sequencing are often overdispersed for the Poisson distribution.  

$$ O_t \sim  \textrm{NB}(\mu_{ti}, \phi) $$

$$ \mu_{ti} = \frac{S_i}{d} e^{( x_{t*} \beta )} $$

The mean parameter, $\mu_{ti}$, for genomic range $t$ and hidden state $i$ is a product of the possible copy number state $S_i$ over the expected copy number $d$ and an estimate of the positional effects.  The positional effect modeling comes from an exponentiated $(x_{t*} \beta)$, where $x_{t*}$ is the $t$-th row of $X$ and $\beta$ is a column vector of coefficients.  The estimated positional effect is a  combination of log background read depth, GC-content, range width, and any other useful covariates which are stored in the matrix $X$, with a row for each range and a column for each covariate.  We use the log of background read depth so that the counts and read depth come out on the same scale.

The coefficients $\beta$ are fit by the model, using the forward equations to assess the likelihood of the HMM over all hidden state paths.  In this way, the normalization and segmentation steps are combined into one step of maximizing the likelihood of the parameters given the data.  The Viterbi algorithm is then applied to provide the most likely path.

\begin{table}[ht]
\begin{center}
\caption{Summary of notation}
\begin{tabular}{ll}
\hline
$O_t$ & observed count of reads in the $t$-th genomic range \\
$\mu_{ti}$ & the mean parameter for $f$ at range $t$ in copy state $i$  \\
$\phi$ & the dispersion parameter for $f$ \\
$S_i$ & the copy number for state $i$ ($S_i \in \{ 0,1,2,\dots \}$) \\
$d$ & the expected background copy number (2 for diploid, 1 for haploid) \\
$X$ & the matrix of covariates for estimating $\mu$ \\
$Y$ & the matrix of covariates for estimating $\phi$  \\
$\beta$ & the fitted coefficients for estimating $\mu$ \\
$\gamma$ & the fitted coefficients for estimating $\phi$  \\
\hline
\end{tabular}
\end{center}
\end{table}

The base model uses a scalar estimate for the dispersion parameter $\phi$ (equivalent to 1/\texttt{size} in the function \texttt{dnbinom}).  An extension of this model also tries to fit the variance using positional information such as the log variance of the background read depth stored in a matrix $Y$.

$$ O_t \sim \textrm{NB}(\mu_{ti}, \phi_{t}) $$

$$ \log(\phi_{t}) = y_{t*} \gamma $$
  
\section{Simulating CNVs}

Next we simulate CNVs in 8 samples on both chromosomes representing regions with a copy number of 0,1,3 and 4, relative to a background copy number of 2.  This is accomplished by selecting a fraction of the reads contained in the CNV bounds and removing or doubling them.  

<<>>=
simulateCNV <- function(x,indices,multiply,prob) {
  x[indices] <- x[indices] + multiply * rbinom(length(indices),prob=prob,size=x[indices])
  return(x)
}
set.seed(2)
cnv.probs <- rep(c(.99,.5,.5,.95),each=2)
cnv.mult <- rep(c(-1,1),each=4)
cnv.starts <- rep(c(1,301,601,901),each=2)
for (i in 1:8) {
  mcols(example.counts)[[exome.samples[i]]] <- simulateCNV(mcols(example.counts)[[exome.samples[i]]],cnv.starts[i]:(cnv.starts[i] + 99),multiply=cnv.mult[i],prob=cnv.probs[i])
  mcols(example.counts)[[exome.samples[i]]] <- simulateCNV(mcols(example.counts)[[exome.samples[i]]],1000 + cnv.starts[i]:(cnv.starts[i] + 99),multiply=cnv.mult[i],prob=cnv.probs[i])
}
@ 

\section{Running \exomeCopy}

\subsection{Running \exomeCopy on one sample, one chromosome}

We can now run \exomeCopy using the read counts for one of the simulated CNV samples.  Later we will show how to write a simple wrapper function to loop the \texttt{exomeCopy} function over multiple chromosomes and samples.  We specify the possible copy number values with \texttt{S} from 0 to 6 and the expected copy number state with \texttt{d} of 2.

<<>>=
  fit <- exomeCopy(example.counts[seqnames(example.counts) == "chr1"],sample.name=exome.samples[3],X.names=c("log.bg","GC","GC.sq","width"),S=0:6,d=2)
  show(fit)
@

After fitting, we call the function \texttt{copyCountSegments} on the \textit{ExomeCopy} object, which provides the segmentation with the predicted copy number, the log odds of read counts being emitted from predicted copy count over normal copy count, the number of input genomic ranges contained within each segment, the number of targeted basepairs contained in the segments, and the name of the sample to help compile the segments across samples.  The columns \texttt{log.odds} and \texttt{nranges} can be useful for filtering, see \texttt{?copyCountSegments} for details.

<<>>=
  copyCountSegments(fit)
@

Calling \texttt{plot} on the \textit{ExomeCopy} object draws segments of constant predicted copy number as colored horizontal lines and normalized read counts as points.  If no colors are provided, the function will use red for copy counts less than \texttt{d} and blue for copy counts higher than \texttt{d}.

\begin{center}
<<fig=TRUE>>=
  cnv.cols <- c("red","orange","black","deepskyblue","blue","blue2","blue4")
  plot(fit,col=cnv.cols)
@
\end{center}

\subsection{Looping \exomeCopy over multiple samples and chromosomes}

In order to apply \exomeCopy to a full dataset of multiple chromosomes and samples, we write a wrapper function \texttt{runExomeCopy}.  We will only model duplications up to 4 copies.  The wrapper construction allows for distribution of fitting each samples across workstations.  Keep in mind that the non-PAR region of chrX, in male samples should be fit separately, with \texttt{d = 1}.

<<>>=
runExomeCopy <- function(sample.name,seqs) {
  lapply(seqs,function(seq.name) exomeCopy(example.counts[seqnames(example.counts) == seq.name],sample.name,X.names=c("log.bg","GC","GC.sq","width"),S=0:4,d=2))
}
@ 

We can now run \exomeCopy using either \texttt{lapply} or using a function like \texttt{clusterApplyLB} from the \textsf{snow} package.  We apply \texttt{runExomeCopy} for 8 samples over 2 chromosomes and the result is stored as a named list of named lists.

<<>>=
seqs <- c("chr1","chr1a")
names(seqs) <- seqs
samples <- exome.samples[1:8]
names(samples) <- samples
t0 <- as.numeric(proc.time()[3])
fit.list <- lapply(samples,runExomeCopy,seqs)
t1 <- as.numeric(proc.time()[3])
time.elapsed <- as.numeric(t1 - t0)
paste(round(time.elapsed),"seconds for",length(samples),"samples,",round(sum(width(example.counts))/1e3),"kb of target")
paste("~",round(time.elapsed / 60 / (8 * sum(width(example.counts))) * 32e6,1)," minutes for 1 sample, 32 Mb of target",sep="")
@ 

\subsection{Compiling results, filtering and plotting CNVs}

Using \texttt{compileCopyCountSegments} on the list of lists, we can compile the segments across samples into a \textit{GRanges} object.  We remove the segments with copy count 2, leaving the predicted CNVs for autosomal ranges.  

<<>>=
compiled.segments <- compileCopyCountSegments(fit.list)
CNV.segments <- compiled.segments[compiled.segments$copy.count != 2]
@ 

Some non-CNV ranges with too high or low read counts might be predicted to be CNV, despite the use of positional covariates and overdispersion in the emission distributions of the HMM.  Often the majority of these cases can be removed by requiring a minimum number of ranges (for example requiring the \texttt{nranges} column returned by \texttt{copyCountSegments} to be greater than 5).  The filtered CNV segments can be exported to tab-delimited files using \texttt{as.data.frame} and then \texttt{write.table}.

\texttt{copyCountSegments} also provides a column of log odds for each segment.  This is the log ratio of the probability density function for the fitted predicted distribution over the fitted normal state distribution, then summed over all ranges in a segment (details of these are given in the help for \texttt{copyCountSegments}.)  By definition these will be zero for the segments predicted normal and positive for the CNV segments.

\begin{small}
<<>>=
CNV.segments[1:6]
table(CNV.segments$nranges)
CNV.segments <- CNV.segments[CNV.segments$nranges > 5]
@ 
\end{small}

To find overlaps within our calls across patients, we can use \texttt{findOverlaps} to create a matrix of overlaps between our predicted CNVs in different samples.

<<>>=
CNV.overlaps.matrix <- as.matrix(findOverlaps(CNV.segments,drop.self=TRUE))
head(CNV.overlaps.matrix)
@ 

\texttt{plotCompiledCNV} will plot the CNVs for a given sequence across samples if given an object returned by compileCopyCountSegments and the sequence name.

\begin{center}
<<fig=TRUE>>=
par(mfrow=c(2,1),mar=c(4,3,2,1))
cnv.cols <- c("red","orange","black","deepskyblue","blue")
plotCompiledCNV(CNV.segments=CNV.segments, seq.name="chr1", col=cnv.cols)
plotCompiledCNV(CNV.segments=CNV.segments, seq.name="chr1a", col=cnv.cols)
@ 
\end{center}

\subsection{Inspecting model parameters}

Finally, we can inspect a number of fitted parameters in the \textit{ExomeCopy} object, such as the $\beta$ vector which is adjusted in optimizing the likelihood of the HMM.  The $\beta$ vector is initialized using a linear regression of counts on the covariates, and this and other initial parameter values can be accessed in the \texttt{init.par} slot of the \textit{ExomeCopy} object.  These coefficients are adjusted while optimizing the likelihood of the HMM.  The final $\beta$ vector and other final parameter values are accessible in the \texttt{final.par} slot.

<<>>=
fit.list[[1]][[1]]@init.par$beta.hat
fit.list[[1]][[1]]@final.par$beta
@ 

\section{Session info}

<<>>=
sessionInfo()
@

\bibliographystyle{plainnat}
\bibliography{exomeCopy}

\end{document}