% \VignetteIndexEntry{CNV detection in exome sequencing data} % \VignetteDepends{methods, graphics, IRanges, GenomicRanges, Rsamtools, Biostrings} % \VignetteKeyword{CNV} \documentclass{article} \usepackage{natbib} \usepackage{Sweave} \usepackage[a4paper]{geometry} \usepackage{url} \usepackage{hyperref} \newcommand{\exomeCopy}{\textsf{exomeCopy} } \SweaveOpts{keep.source=TRUE} \begin{document} \setkeys{Gin}{width=0.6\textwidth} \title{CNV detection in exome sequencing data using \exomeCopy} \author{Michael Love\\ \texttt{love@molgen.mpg.de}} \date{\today} \maketitle \begin{abstract} \exomeCopy is an R package implementing a hidden Markov model\footnote{The manuscript describing the model \citep{Love2011Modeling} is listed in the references of this vignette.} for predicting copy number variants (CNVs) from exome/targeted sequencing experiments without paired control experiments as in tumor/normal sequencing.\footnote{R packages which can be used for CNV detection in whole genome or tumor/normal paired exome sequencing data are \textsf{ReadDepth} and \textsf{ExomeCNV} respectively, available on CRAN.} It models read counts in genomic ranges using negative binomial emission distributions depending on a hidden state of the copy number and on positional covariates such as GC-content and background read depth. Normalization and segmentation are performed simultaneously, eliminating the need for preprocessing of the raw read counts. \end{abstract} \tableofcontents \section{Introduction} The \exomeCopy package was designed to address the following situation: \begin{itemize} \item Target enrichment, such as exome enrichment, leads to non-uniform read depth, which is often correlated across samples. \item CNVs overlapping enriched regions can be detected as increases or decreases in read counts relative to ``background'' read depth, generated by averaging over a control set. \item Individual samples can be more or less correlated with background read depth and have different dependencies on GC-content. \end{itemize} While exome sequencing is not designed for CNV genotyping, it can nevertheless be used for finding CNVs which overlap exons and are not common in the control set. It can provide an independent data source to be used in combination with higher resolution array-based methods. In this vignette we show how to prepare data, generate background read depth, simulate CNVs in read count data and recover CNVs using \exomeCopy. We build a wrapper function for calling the \texttt{exomeCopy} function over multiple chromosomes and samples such that the operations can be assigned across workstations. \section{Importing experiment data} The necessary genomic range information, read counts and positional covariates (background read depth and GC-content) should be stored in a \textit{RangedData} object. The user must provide genomic ranges of targeted enrichment. For exome sequencing, one can use exon annotations, which will be discussed in a later section. The \exomeCopy package provides two convenience functions for preparing the genomic ranges and sample read counts: \texttt{subdivideGRanges} and \texttt{countBamInGRanges}. In this vignette we will use a prepackaged \textit{RangedData} object containing real exome sequencing read counts. Due to memory constraints, we cannot construct this object from scratch, but below we demonstrate with a simple example how to construct a \textit{RangedData} object from a BED file describing the targeted region, BAM files for the read mapping and a FASTA file for the reference sequence. \subsection{Subdividing targeted regions} \texttt{subdivideGRanges} divides the targeted genomic ranges into a set of ranges of nearly equal width, which exactly cover the original ranges. While \exomeCopy can use range width as a covariate for modeling read counts, we find it useful to break apart the largest input ranges into multiple ranges of comparable width to the average input range. \texttt{subdivideGRanges} requires an input \textit{GRanges} object and returns a \textit{GRanges} object. <<>>= library(exomeCopy) gr <- GRanges(seqname="seq1",IRanges(start=1,end=345)) subdivideGRanges(gr) @ The default setting of \texttt{subdivideGRanges} is to divide an input range into ranges around $s$ = 100bp, which is slightly less than the average exon width. Specifically, an input range of width $w$ will be divided evenly into $\max(1,\lfloor w/s \rfloor)$ regions. We can visualize the effect of \texttt{subdivideGRanges} on ranges of increasing width. \begin{center} <>= plot(0,0,xlim=c(0,500),ylim=c(0,25),type="n",yaxt="n",ylab="", xlab="width of input GRanges object", main="Effect of subdivideGRanges") abline(v=1:5*100,col="grey") for (i in 1:24) { gr <- GRanges(seqname="chr1",IRanges(start=1,width=(i*20))) sbd.gr <- subdivideGRanges(gr) arrows(start(sbd.gr),rep(i,length(sbd.gr)),end(sbd.gr), rep(i,length(sbd.gr)),length=.04,angle=90,code=3) } @ \end{center} Here we demonstrate reading in a targeted region BED file, converting to a \textit{GRanges} object and the result from calling \texttt{subdivideGRanges}. Note that if the targeted region is read in from a BED file, one should add 1 to the starting position for representation as a \textit{GRanges} object. <<>>= target.file <- system.file("extdata","targets.bed",package="exomeCopy") target.df <- read.delim(target.file,header=FALSE, col.names=c("seqname","start","end")) target <- GRanges(seqname=target.df$seqname, IRanges(start=target.df$start+1,end=target.df$end)) target target.sub <- subdivideGRanges(target) target.sub @ \subsection{Counting reads in genomic ranges} \texttt{countBamInGRanges} allows the user to count reads from a BAM read mapping file in genomic ranges covering the targeted region. The function takes as input the BAM filename and a \textit{GRanges} object. It returns a vector of counts, representing the number of sequenced read starts (leftmost position regardless of strand) with mapping quality above a minimum threshold (default of 1) for each genomic range. Users should make sure the sequence names in the \textit{GRanges} object are the same as the sequence names in the BAM file (which can be listed using \texttt{scanBamHeader} in the \textsf{Rsamtools} package). The BAM file requires a associated index file (see the man page for \texttt{indexBam} in the \textsf{Rsamtools} package). We will count reads using the subdivided genomic ranges in \texttt{target.sub} and store the counts as a new value column, \texttt{sample1}. <<>>= bam.file <- system.file("extdata","mapping.bam",package="exomeCopy") scanBamHeader(bam.file)[[1]]$targets levels(seqnames(target.sub)) rdata <- RangedData(space=seqnames(target.sub),ranges=ranges(target.sub)) rdata[["sample1"]] <- countBamInGRanges(bam.file,target.sub) rdata @ \subsection{Calculating GC-content} \exomeCopy can model read counts from samples which are not perfectly correlated with background read depth using GC-content (ratio of G and C bases to total number of bases). The GC-content of DNA fragments is known to be a factor in the efficiency of high-throughput sequencing. Using \texttt{scanFa} in the \textsf{Rsamtools} package and a FASTA file of the reference genome, we can obtain a \textit{DNAStringSet} object for the DNA sequence of the genomic ranges. Then \texttt{letterFrequency} in the \textsf{Biostrings} package can be used to tally the GC-content. <<>>= reference.file <- system.file("extdata","reference.fa",package="exomeCopy") target.dnastringset <- scanFa(reference.file,target.sub) getGCcontent <- function(x) { GC.count <- letterFrequency(x,"GC") all.count <- letterFrequency(x,"ATGC") as.vector(ifelse(all.count==0,NA,GC.count/all.count)) } rdata[["GC"]] <- getGCcontent(target.dnastringset) rdata @ In the following sections, we will continue with a dataset constructed using real targeted regions and exome sequencing data. \section{Exome sequencing data from 1000 Genomes Project} For demonstrating the use of \exomeCopy, we have provided a sample dataset, \texttt{exomecounts}, of exome sequencing read counts. The genomic ranges used in this dataset are generated from a small subset of the CCDS regions on chromosome 1 \citep{Pruitt2009Consensus}. The regions are downloaded from the hg19 tables of the UCSC Genome Browser (\url{http://genome.ucsc.edu/cgi-bin/hgGateway}). Alternatively, one could use the \textsf{GenomicFeatures} package to download the CCDS regions. The original CCDS regions are subdivided using \texttt{subdivideGRanges} with default settings as in the example code above. The CCDS regions are convenient to use as genomic ranges for CNV detection in exome sequencing data, as they are often in the center of the targeted region in exome enrichment protocols and tend to have less variable coverage than the flanking regions. The read counts are taken from exome enriched, paired-end sequencing data of the 1000 Genomes Project for 16 samples of the PUR population \citep{2010Map}. The BAM read mapping files and descriptions of the experiments are available at the 1000 Genomes Project website (\url{http://www.1000genomes.org/data}). The ftp addresses used are listed in the file \texttt{1000Genomes\_files.txt} in the \texttt{extdata} directory. The sample names are included as column names in the provided datatset. <<>>= data(exomecounts) dim(exomecounts) exomecounts[1:5,1:3] @ The genomic ranges in \texttt{exomecounts} have been filtered such that only ranges with nonzero read count over the 16 samples are retained. The range of the 1000 genomic ranges is from 0.8 to 7.8 Mb on chromosome 1. Plotting the counts for one sample in a region of 1 Mb, one can observe both the irregular spacing of the ranges as well as the non-uniformities in read counts per range. \begin{center} <>= plot(start(exomecounts),exomecounts$HG00551,xlim=c(0.8e6,1.8e6), xlab="genomic position",ylab="counts", main="HG00551 read counts in exonic ranges") @ \end{center} \subsection{Generating background read depth} In order to run \exomeCopy, we first generate background read depth. We extract a read counts data frame from the \textit{RangedData} object and divide each sample by its mean read count (column mean). Our read depth background is the median of these normalized read counts over the 16 samples (row median). We could at this point also store the standard deviation of the normalized read counts for a measure of read depth deviation at each range, but we will not use this information in this vignette. <<>>= exome.samples <- grep("HG.+",colnames(exomecounts),value=TRUE) sample.columns <- colnames(exomecounts) %in% exome.samples C <- as.data.frame(unlist(values(exomecounts)[,sample.columns])) C.norm <- sweep(C,2,colMeans(C),"/") exomecounts[["bg"]] <- apply(C.norm,1,median) @ The relationship between read counts and GC-content over the ranges varies across protocols and samples. It can be roughly approximated per sample using second-order polynomial terms of GC-content. We store the square of GC-content as a new value column. Other functions of GC-content could be used as well. We also store the width of the ranges as a value column. <<>>= exomecounts[["GC.sq"]] <- exomecounts$GC^2 exomecounts[["width"]] <- width(exomecounts) @ \section{Brief introduction of the model} \exomeCopy models the sample read counts on one chromosome, $O$, as emitted observations of a hidden Markov model (HMM), where the hidden state is the copy number of the sample. The emission distributions $f$ are modeled with negative binomial distributions, as the read counts from high-throughput sequencing are often overdispersed for the Poisson distribution. $$ f \sim \textrm{NB}(O_t, \mu_{ti}, \phi) $$ $$ \mu_{ti} = \frac{S_i}{d} ( x_{t*} \beta ) $$ The mean parameter, $\mu_{ti}$, for genomic range $t$ and hidden state $i$ is a product of the possible copy number state $S_i$ over the expected copy number $d$ and an estimate of the positional effects on read depth $(x_{t*} \beta)$, where $x_{t*}$ is the $t$-th row of $X$ and $\beta$ is a column vector of coefficients. The estimated positional effect is a linear combination of background read depth, GC-content, range width, and any other useful covariates which are stored in the matrix $X$, with a row for each range and a column for each covariate. The coefficients $\beta$ are fit by the model, using the forward equations to assess the likelihood of the HMM over all hidden state paths. In this way, the normalization and segmentation steps are combined into one step of maximizing the likelihood of the parameters given the data. The Viterbi algorithm is then applied to provide the most likely path. \begin{table}[ht] \begin{center} \caption{Summary of notation} \begin{tabular}{ll} \hline $O_t$ & observed count of reads in the $t$-th genomic range \\ $f$ & the emission distribution for read counts \\ $\mu_{ti}$ & the mean parameter for $f$ at range $t$ in copy state $i$ \\ $\phi$ & the dispersion parameter for $f$ \\ $S_i$ & the copy number for state $i$ ($S_i \in \{ 0,1,2,\dots \}$) \\ $d$ & the expected background copy number (2 for diploid, 1 for haploid) \\ $X$ & the matrix of covariates for estimating $\mu$ \\ $Y$ & the matrix of covariates for estimating $\phi$ \\ $\beta$ & the fitted coefficients for estimating $\mu$ \\ $\gamma$ & the fitted coefficients for estimating $\phi$ \\ \hline \end{tabular} \end{center} \end{table} The base model uses a scalar estimate for the dispersion parameter $\phi$ (equivalent to 1/\texttt{size} in the function \texttt{dnbinom}). An extension of this model also tries to fit the variance using positional information such as the standard deviation of the background read depth stored in a matrix $Y$. $$ f \sim \textrm{NB}(O_t, \mu_{ti}, \phi_{t}) $$ $$ \phi_{t} = y_{t*} \gamma $$ Both $\mu_{ti}$ and $\phi_{t}$ must be positive, so negative estimates are replaced with a small positive number (\texttt{1e-8}). \section{Running \exomeCopy on simulated CNVs} Next we simulate CNVs in 4 samples representing regions with a copy number of 0,1,3 and 4, relative to a background copy number of 2. This is accomplished by selecting a fraction of the reads contained in the CNV bounds and removing or doubling them. The new counts with simulated CNVs are added as new value columns to the \textit{RangedData} object. We then plot the original counts and the counts within a simulated heterozygous duplication. <<>>= set.seed(2) cnv.type <- c("hom.del","het.del","het.dup","hom.dup") cnv.probs <- c(.99,.5,.5,.95) cnv.mult <- c(-1,-1,1,1) bounds <- IRanges(start=3e6,end=4e6) for (i in 1:4) { samplename <- exome.samples[i] contained <- unlist(ranges(exomecounts)) %in% bounds O <- exomecounts[[samplename]] O[contained] <- O[contained] + (cnv.mult[i] * rbinom(sum(contained),prob=cnv.probs[i],size=O[contained])) exomecounts[[paste(samplename,cnv.type[i],sep=".")]] <- O } @ \begin{center} <>= par(mfrow=c(2,1),mar=c(5,4,3,2)) plot(start(exomecounts),exomecounts[["HG00731"]], xlab="genomic position",ylab="counts",main="Original counts") abline(v=c(start(bounds),end(bounds))) plot(start(exomecounts),exomecounts[["HG00731.het.dup"]], xlab="genomic position",ylab="counts", main="Simulated heterozygous duplication") abline(v=c(start(bounds),end(bounds))) @ \end{center} \subsection{Running \exomeCopy and inspecting the segmentation} We can now run \exomeCopy using the read counts for one of the simulated CNV samples. Later we will show how to write a simple wrapper function to loop the \texttt{exomeCopy} function over multiple chromosomes and samples. We specify the possible copy number values with \texttt{S} and the expected copy number state with \texttt{d}. <<>>= fit <- exomeCopy(exomecounts["chr1"],sample.name="HG00731.het.dup", X.names=c("bg","GC","GC.sq","width"),S=0:6,d=2) show(fit) @ After fitting, we call the function \texttt{copyCountSegments} on the \textit{ExomeCopy} object, which provides the segmentation with the predicted copy number and the number of input genomic ranges contained within each segment. \exomeCopy correctly identifies the segment between 3-4 Mb as having copy number 3 against the background copy number 2. <<>>= copyCountSegments(fit) @ Calling \texttt{plot} on the \textit{ExomeCopy} object draws segments of constant predicted copy number as colored horizontal lines and normalized read counts as points. The vertical lines indicate the start and end of the simulated CNV of copy number 3. \begin{center} <>= cols <- c("red","orange","black","deepskyblue","blue","blue2","blue4") plot(fit,col=cols) abline(v=c(start(bounds),end(bounds))) @ \end{center} \subsection{Looping \exomeCopy over multiple chromosomes/samples} In order to apply \exomeCopy to a full dataset of multiple chromosomes and samples, we write a wrapper function \texttt{runExomeCopy}. This allows for distribution of jobs across workstations, for example. <<>>= runExomeCopy <- function(idx,rdata,seq.loop,sample.loop) { seq.name <- seq.loop[idx] sample.name <- sample.loop[idx] exomeCopy(rdata[seq.name],sample.name, X.names=c("bg","GC","GC.sq","width"),S=0:6,d=2) } @ We can now run \exomeCopy using either \texttt{lapply} or using a function like \texttt{clusterApplyLB} from the \textsf{snow} package. We define an order of chromosome-sample pairs by constructing variables \texttt{seq.loop} and \texttt{sample.loop} with the \texttt{rep} function. By progressing through these two vectors, we predict CNVs in all chromosomes for all samples. We apply \texttt{runExomeCopy} over the 4 samples with simulated CNVs and across 1 chromosome, though this code could be used across multiple chromosomes as well. <<>>= seqs <- c("chr1") samples <- paste(exome.samples[1:4],cnv.type,sep=".") nseqs <- length(seqs) nsamples <- length(samples) seq.loop <- rep(seqs,times=nsamples) sample.loop <- rep(samples,each=nseqs) fit.list <- lapply(seq_len(nseqs*nsamples), runExomeCopy,exomecounts,seq.loop,sample.loop) @ We can visualize the segments of constant predicted copy number for all simulated CNV samples. The vertical lines indicate the start and end of the simulated CNVs. \begin{center} <>= par(mfrow=c(4,1),mar=c(3,3,1,1)) for (i in 1:4) { plot(fit.list[[i]],main="",xlab="",ylab="",show.legend=FALSE,col=cols) abline(v=c(start(bounds),end(bounds))) } @ \end{center} \subsection{Inspecting model parameters} Finally, we can inspect a number of fitted parameters in the \textit{ExomeCopy} object, such as the $\beta$ vector which is adjusted in optimizing the likelihood of the HMM. The $\beta$ vector is initialized using a linear regression of counts on the covariates, and this and other initial parameter values can be accessed in the \texttt{init.par} slot of the \textit{ExomeCopy} object. These coefficients are adjusted while optimizing the likelihood of the HMM. The final $\beta$ vector and other final parameter values are accessible in the \texttt{final.par} slot. <<>>= fit.list[[1]]@init.par$beta.hat fit.list[[1]]@final.par$beta @ If we plot the initial and final $\beta$ vectors from the 4 samples with simulated CNVs, we can observe that the $\beta$ vector varies across patients and that the coefficients which maximize the full HMM are not the same as the initial estimates from regression. \begin{center} <>= par(mfrow=c(2,2),mar=c(3,5,1,1)) for (i in 1:4) { barplot(rbind(fit.list[[i]]@final.par$beta,fit.list[[i]]@init.par$beta.hat), horiz=TRUE,las=1,beside=TRUE,col=c("white","darkgrey"), xlim=c(-150,350)) legend("topright",legend=c("initial","final"),fill=c("darkgrey","white")) } @ \end{center} \section{Session info} <<>>= sessionInfo() @ \bibliographystyle{plainnat} \bibliography{exomeCopy} \end{document}