%\VignetteIndexEntry{VanillaICE Vignette} %\VignetteKeywords{copy number, genotype, SNP} %\VignettePackage{VanillaICE} \documentclass{article} \usepackage{amsmath} \usepackage{graphicx} \usepackage[numbers]{natbib} \usepackage{color} \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\Rclass}[1]{\texttt{#1}} \newcommand{\R}{\textsf{R}} \newcommand{\hmmoptions}{\Robject{HmmOptions}} \newcommand{\hmmparam}{\Robject{HmmParameter}} \newcommand{\crlmm}{\Rpackage{crlmm}} \newcommand{\oligo}{\Rpackage{oligo}} \newcommand{\cne}{\widehat{\text{CN}}} \newcommand{\gte}{\widehat{\text{GT}}} \newcommand{\gtehom}{\widehat{\text{HOM}}} \newcommand{\gtehet}{\widehat{\text{HET}}} \newcommand{\pgte}{\text{S}_{\widehat{\text{\tiny GT}}}} \newcommand{\pcne}{\text{S}_{\widehat{\text{\tiny CN}}}} \newcommand{\pgtehom}{\text{S}_{\widehat{\text{\tiny HOM}}}} \newcommand{\pgtehet}{\text{S}_{\widehat{\text{\tiny HET}}}} \newcommand{\thom}{\text{HOM}} \newcommand{\thet}{\text{HET}} \newcommand{\bDelta}{\mbox{\boldmath $\Delta$}} \newcommand{\real}{\mbox{$\mathbb R$}} % real numbers \newcommand{\bnu}{\mbox{\boldmath $\nu$}} \newcommand{\ice}{\Rpackage{VanillaICE}} \textwidth=6.2in \textheight=8.5in \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \begin{document} \title{\ice{}: Hidden Markov Models for the Assessment of Chromosomal Alterations using High-throughput SNP Arrays} \author{Robert Scharpf} \maketitle <>= options(width=70) @ \begin{abstract} This package provides an implementation of a hidden Markov Model for high throughput SNP arrays. Users of this package should already have available locus-level estimates of copy number and B allele frequencies or genotype calls. Copy number estimates can be relative or absolute. \end{abstract} \section{Overview} This vignette requires that you have \begin{itemize} \item an absolute estimate of the \emph{total} copy number organized such that rows correspond to loci and columns correspond to samples and / or \item a matrix of B allele frequencies or genotype calls (1=AA, 2 = AB, 3= BB): rows correspond to loci and columns correspond to samples \end{itemize} If Affymetrix CEL files or Illumina Idat files were processed with the \R{} package \Rpackage{crlmm}, see the vignette \texttt{crlmmDownstream} included with this package. \noindent Other HMM implementations for the joint analysis of copy number and genotype include QuantiSNP \citep{Colella2007} and PennCNV \citep{Wang2007a}. \paragraph{Data considerations.} The HMM implemented in this package is most relevant for heritable diseases for which integer copy numbers are expected. For somatic cell diseases such as cancer, we suggest circular binary segmentation, as implemented in the \R{} package DNAcopy \citep{Olshen2004}. \paragraph{Citing this software.} % \bibitem{Scharpf2008} Robert~B Scharpf, Giovanni Parmigiani, Jonathan Pevsner, and Ingo Ruczinski. \newblock Hidden {M}arkov models for the assessment of chromosomal alterations using high-throughput {SNP} arrays. \newblock {\em Annals of Applied Statistics}, 2(2):687--713, 2008. \section{Organizing the locus-level data} \label{sec:simpleUsage} This package includes simulated genotype and copy number data for 9165 SNPs. <>= library(oligoClasses) library(VanillaICE) library2(SNPchip) library2(Biobase) library2(IRanges) data(locusLevelData) @ \noindent The copy number estimates in the \Robject{locusLevelData} object were multiplied by 100 and saved as an integer. An object of class \Robject{oligoSnpSet} is instantiated from the simulated data in the next code chunk. Note that the assay data elements for the \Robject{oligoSnpSet} object should be integers. The \Rfunction{integerMatrix} scales the copy number matrix by a factor of 100 (by default) and returns a matrix of integers. When available, B allele frequencies should be scaled by a factor of 1000. For example, \verb+ integerMatrix(bAlleleFreq, scale=1000) \verb+ %get rid of this <>= cn <- log2(locusLevelData[["copynumber"]]/100) oligoSet <- new("oligoSnpSet", copyNumber=integerMatrix(cn), call=locusLevelData[["genotypes"]], callProbability=locusLevelData[["crlmmConfidence"]], annotation=locusLevelData[["platform"]]) oligoSet <- oligoSet[!is.na(chromosome(oligoSet)), ] oligoSet <- oligoSet[chromosome(oligoSet) < 3, ] @ %If confidence scores or inverse standard errors for the copy number %estimates are available, these can be supplied to the %\Robject{cnConfidence} slot in the \Robject{assayData}. For %illustration, in the following code chunk we transform the copy number %estimates to the log scale and calculate a robust estimate of the %standard deviation across autosomes. If uncertainty estimates are not %available for copy number, the HMM will calculate the median absolute %deviation (MAD). See the the function \Robject{robustSds}. % %<>= %sds <- sd(oligoSet) %@ %\noindent The inverse of the \Robject{sds} object can be assigned to the %\Robject{cnConfidence} slot. % %<>= %cnConfidence(oligoSet) <- 1/sds %@ \section{Fitting the HMM} When jointly modeling the copy number and genotypes/allele frequencies, we assume that the genotype and copy number estimates are independent conditional on the underlying hidden state. The method \Rfunction{hmm} is defined for several classes, including objects of class \Rclass{oligoSnpSet} and \Rclass{BeadStudioSet}. % The emission probabilities for the genotypes are calculated using % either (i) assumptions of the probability of observing a homozygous % genotype call given the underlying state or (ii) incorporating crlmm % confidence estimates of the genotype calls (\texttt{ICE} option). \noindent The viterbi algorithm is used to obtain the most likely sequence of hidden states given the observed data. The value returned is an object of class \Robject{RangedDataHMM} with genomic coordinates of the normal and altered regions. We also return the log-likelihood ratio (LLR) of the predicted sequence in an interval versus the likelihood of diploid copy number. For intervals inferred to have diploid copy number, the LLR is zero. <>= oligoSet <- oligoSet[chromosome(oligoSet) <= 2, ] fit.van <- hmm(oligoSet) @ Figure \ref{fig:chr1} plots the data for chromosome 1 and the predictions from the hidden markov model: <>= if(require(RColorBrewer)){ cols <- brewer.pal(6, "YlOrBr") } else cols <- rainbow(n=6) chr1 <- oligoSet[chromosome(oligoSet)==1,] fit.chr1 <- fit.van[fit.van$chrom == 1, ] ##fit.chr1 <- fit.van[fit.van$chrom==1, ] isHet <- snpCall(chr1)==2 par(las=1) plot(position(chr1), copyNumber(chr1)/100, pch=".", cex=2, col="royalblue", ylab="log2 copy number") points(position(chr1)[isHet], copyNumber(chr1)[isHet]/100, col="red", pch=".", cex=2) abline(h=log2(1:3), col="grey70") sts <- start(fit.chr1); ends <- end(fit.chr1) xx <- range(c(sts,ends)) y <- c(-1,-1,-0.9,-0.9) polygon(x=c(xx, rev(xx)), y=y, col="white") for(i in 1:nrow(fit.chr1)){ polygon(x=c(sts[i], ends[i], ends[i], sts[i]), y=y, col=cols[fit.chr1$state[i]], border=cols[fit.chr1$state[i]]) } legend("topleft", fill=cols, legend=c("hom-del", "hem-del", "N", "N-noHets", "sDup", "dDup"), bty="n") @ \begin{figure}[t] \includegraphics[width=0.9\textwidth]{VanillaICE-fig2} \caption{\label{fig:chr1} Plot of artificial data for one chromosome.} \end{figure} To find which markers are included in each genomic interval returned by the \Rfunction{hmm} method, one can use the \Rfunction{findOverlaps} method in the \Rpackage{oligoClasses}. For example, the second interval in the \Robject{RangedDataHMM} object \Robject{fit.van} contains 102 markers. <>= fit.van[2, ] @ To find the names of the 102 markers that are included in this genomic interval, one could do the following <>= mm <- as.matrix(findOverlaps(fit.van, oligoSet)) markersInRange <- featureNames(oligoSet)[mm[,1]==2] @ Multipanel displays can be useful for visualizing the low-level data for copy number alterations. We extend the \Rfunction{xyplot} method in the \R{} package lattice for two common use-cases: by-locus and by-sample. \paragraph{By locus.} To plot the genomic data for a set of ranges at a given locus, we provide an unevaluated code chunk below (our example contains only a single sample): <>= xyplot(cn ~ x | id, oligoSet, range=RangedDataObject, ylim=c(-0.5,4), panel=xypanel) @ \noindent Note that the default in the above command is to use a common x- and y-scale for each panel. To allow the x-axes to change for each panel, one could set the x-scales to 'free': <>= xyplot(cn ~ x | id, oligoSet, range=RangedDataObject, ylim=c(-0.5,4), scales=list(x="free"), panel=xypanel) @ \noindent The function \Rfunction{xypanel} provides default colors for annotating the plotting symbols by genotype and by whether the markers are polymorphic. The \Robject{RangedDataHMM} must be passed by the name \emph{range} to the \Rfunction{xyplot} method. \paragraph{By sample.} To plot the low-level data for multiple alterations occurring in a single sample, one can again pass a \Rclass{RangedDataHMM} object with name \emph{range} to the \Rfunction{xyplot}. For example, see Figure \ref{fig:xy}. The code for producing Figure \ref{fig:xy} is in the following code chunk. Note that the formula in the example below conditions on \emph{range} instead of \emph{id}. The conditioning variable for displaying multiple panels of a single sample must be called 'range'. We plot a 2 Mb window surrounding each of the alterations in the simulated data by specifying \texttt{frame=2e6}. <>= ranges.altered <- fit.van[!state(fit.van) %in% c(3,4) & coverage2(fit.van) > 5, ] xy.example <- xyplot(cn ~ x | range, oligoSet, range=ranges.altered, frame=2e6, scales=list(x="free"), ylim=c(-1,3), panel=xypanel, cex.pch=0.4, pch=21, border="grey", ylab=expression(log[2]("copy number"))) @ \noindent See \texttt{?xyplot} for additional details. <>= pdf("VanillaICE-xyplot.pdf", width=8, height=7) print(xy.example) dev.off() @ \begin{figure}[t] \includegraphics[width=\textwidth]{VanillaICE-xyplot} \caption{\label{fig:xy} The method \Rfunction{xyplot} is used to create a multi-panel display of alterations in a single sample. Each panel displays a single copy number alteration detected by the HMM that is boxed by a rectangle. The alteration is framed by specifying the number of basepairs to plot upstream and downstream of the alteration. Here, we used a frame of 2 Mb. Homozygous SNPs with diallelic geotypes `AA' and `BB' are shaded blue; SNPs with diallelic genotype call `AB' are shaded in red.} \end{figure} Note also that \texttt{scales} must be set to \emph{free} in the above call to \Rfunction{xyplot}. \subsection{ICE HMM} This will likely be phased out in favor of using B allele frequencies instead of genotype calls / call probabilities. To compute emission probabilities that incorporate the \Rpackage{crlmm} genotype confidence scores, we set \Robject{ICE} to \texttt{TRUE}. This option is limited too a few platforms and the Affy 100k platforms is not one of them. <>= VanillaICE:::icePlatforms() @ For illustration, we assign 'genomewidesnp6' to the annotation slot since this platform is supported. <>= ann <- annotation(oligoSet) annotation(oligoSet) <- "genomewidesnp6" fit.ice <- hmm(oligoSet, ICE=TRUE) fit.ice annotation(oligoSet) <- ann @ <>= fit.chr1 <- fit.ice[chromosome(fit.ice)==1, ] widths <- width(fit.chr1) fit.chr1 <- fit.chr1[order(widths,decreasing=TRUE),] par(las=1) plot(position(chr1), copyNumber(chr1)/100, pch=".", ylab="log2 copy number", xlab="physical position", cex=2, col="royalblue") points(position(chr1)[isHet], copyNumber(chr1)[isHet]/100, col="red", pch=".", cex=2) abline(h=log2(1:3), col="grey70") sts <- start(fit.chr1); ends <- end(fit.chr1) xx <- range(c(sts,ends)) y <- c(-1,-1,-0.9,-0.9) polygon(x=c(xx, rev(xx)), y=y, col="white") for(i in 1:nrow(fit.chr1)){ polygon(x=c(sts[i], ends[i], ends[i], sts[i]), y=y, col=cols[state(fit.chr1)[i]], border=cols[state(fit.chr1)[i]]) } legend("topleft", fill=cols, legend=c("hom-del","hem-del", "N", "N/no hets", "s-dup", "d-dup"), bty="n") @ \subsection{Other options} % % really no reason for this with genotyping arrays % %\paragraph{Copy number.} A HMM for copy number only (e.g., if %genotypes are ignored or are unavailable) can be fit as follows. % %<>= %cnSet <- new("CopyNumberSet", % copyNumber=log2(locusLevelData[["copynumber"]]/100), % annotation=locusLevelData[["platform"]]) %cnSet <- cnSet[chromosome(cnSet) <= 2 & !is.na(chromosome(cnSet)), ] %fit.cn <- hmm(cnSet, hmmOpts) %@ % \paragraph{Regions of homozygosity.} %\subsection{Region of homozygosity (ROH) HMM} A HMM for genotype-only data can be used to find long stretches of homozygosity. Note that hemizygous deletions are also identified as 'ROH' when copy number is ignored (as the biallelic genotypte call in a hemizygous deletions tends to be all homozygous calls). <>= snpSet <- as(oligoSet, "SnpSet") fit.gt <- hmm(snpSet, prGtHom=c(0.7, 0.99), normalIndex=1L, S=2L) @ \noindent A suggested visualization: <>= fit.chr1 <- fit.gt[chromosome(fit.gt)==1, ] widths <- width(fit.chr1) fit.chr1 <- fit.chr1[order(widths,decreasing=TRUE),] gt <- ifelse(snpCall(chr1) == 1 | snpCall(chr1) == 3, 1, 0) par(las=1) plot(position(chr1), jitter(gt, amount=0.05), pch=".", ylab="", xlab="physical position", ylim=c(-3, 1.2), yaxt="n") ##points(position(chr1)[isHet], copyNumber(chr1)[isHet,], pch=".", ylab="log2 copy number", xlab="physical position", cex=2, col="red") axis(side=2, at=c(0,1), labels=c("AB", "AA or BB"), cex.axis=0.7) sts <- start(fit.chr1); ends <- end(fit.chr1) xx <- range(c(sts,ends)) y <- c(-1,-1,-0.5,-0.5) polygon(x=c(xx, rev(xx)), y=y, col="white") for(i in 1:nrow(fit.chr1)){ polygon(x=c(sts[i], ends[i], ends[i], sts[i]), y=y, col=cols[fit.chr1$state[i]], border=cols[fit.chr1$state[i]]) } legend("bottomleft", fill=cols, legend=c("region of homozygosity", "Normal"), bty="n") @ %\section{Quality control} % %\subsection{Noisey data} % %\subsection{Outliers} % %%% could remove this section and discuss Uniform-Gaussian mixture %Copy number outliers can cause the HMM to become too jumpy. One %approach to reduce the influence of outliers is to some %\textit{light}-smoothing prior to fitting the HMM, as suggested in the %\R{} package \texttt{DNAcopy}. For instance, one could identify %outliers by some criteria and then average the outliers using the %estimates from neigboring probes. Here, we use the defaults in %\Robject{smooth.CNA}. % %<>= %if(require("DNAcopy")){ % ##create an outlier % copyNumber(cnSet)[50] <- 10 % copyNumber(cnSet)[45:55] % cnaObj <- CNA(genomdat=copyNumber(cnSet), % chrom=chromosome(cnSet), % maploc=position(cnSet), % data.type="logratio", % sampleid=sampleNames(cnSet)) % smoothed.cnaObj <- smooth.CNA(cnaObj) % copyNumber(cnSet) <- matrix(smoothed.cnaObj[, "NA06993"], nrow(cnSet), 1) % copyNumber(cnSet)[50] %} %@ % %\noindent One could also increase the value of \Robject{TAUP} in the %viterbi algorithm to encourage a fit with fewer jumps. Note that with %improved estimates of copy number uncertainty, many of these %\textit{post-hoc} approaches for addressing outliers would be less %critical. % %\subsection{Batch effects} \Rpackage{VanillaICE} can be used in %conjunction with the \Rpackage{crlmm} package to reduce batch %effects. See \citep{Scharpf2010} for details regarding the %\Rpackage{crlmm} package. % %\section{Alternatives} % %See the \texttt{crlmmDownstream} vignette located in %\texttt{inst/test} for an alternative approach that computes emission %probabilities directly on the bivariate normal log(A) versus log(B) %space \citep{Korn2008}. The estimation procedure for copy number in %\Rpackage{crlmm} is described elsewhere \citep{Scharpf2010}. \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= toLatex(sessionInfo()) @ \bibliography{ice}{} \bibliographystyle{plain} \end{document}