%\VignetteIndexEntry{VanillaICE Vignette} %\VignetteKeywords{copy number, genotype, SNP} %\VignettePackage{VanillaICE} \documentclass{article} \usepackage{amsmath} \usepackage{graphicx} \usepackage{natbib} \usepackage{color} \newcommand{\scscst}{\scriptscriptstyle} \newcommand{\scst}{\scriptstyle} \newcommand{\Rpackage}[1]{\textit{#1}} \newcommand{\Rfunction}[1]{\texttt{#1}} \newcommand{\Robject}[1]{\texttt{#1}} \newcommand{\R}{\textsf{R}} \newcommand{\hmmoptions}{\Robject{HmmOptions}} \newcommand{\hmmparam}{\Robject{HmmParameter}} \newcommand{\cne}{\widehat{\text{CN}}} \newcommand{\gte}{\widehat{\text{GT}}} \newcommand{\gtehom}{\widehat{\text{HOM}}} \newcommand{\gtehet}{\widehat{\text{HET}}} \newcommand{\pgte}{\text{S}_{\widehat{\text{\tiny GT}}}} \newcommand{\pcne}{\text{S}_{\widehat{\text{\tiny CN}}}} \newcommand{\pgtehom}{\text{S}_{\widehat{\text{\tiny HOM}}}} \newcommand{\pgtehet}{\text{S}_{\widehat{\text{\tiny HET}}}} \newcommand{\thom}{\text{HOM}} \newcommand{\thet}{\text{HET}} \newcommand{\bDelta}{\mbox{\boldmath $\Delta$}} \newcommand{\real}{\mbox{$\mathbb R$}} % real numbers \newcommand{\bnu}{\mbox{\boldmath $\nu$}} \newcommand{\ice}{\Rpackage{VanillaICE}} \textwidth=6.2in \textheight=8.5in \oddsidemargin=.1in \evensidemargin=.1in \headheight=-.3in \begin{document} \title{\ice{}: Hidden Markov Models for the Assessment of Chromosomal Alterations using High-throughput SNP Arrays} \author{Robert Scharpf} \maketitle <>= options(width=80) @ \begin{abstract} Chromosomal DNA is characterized by variation between individuals at the level of entire chromosomes (e.g. aneuploidy in which the chromosome copy number is altered), segmental changes (including insertions, deletions, inversions, and translocations), and changes to small genomic regions (including single nucleotide polymorphisms). A variety of alterations that occur in chromosomal DNA, many of which can be detected using high density single nucleotide polymorphism (SNP) microarrays, are linked to normal variation as well as disease and therefore of particular interest. These include changes in copy number (deletions and duplications) and genotype (e.g. the occurrence of regions of homozygosity). Hidden Markov models (HMM) are particularly useful for detecting such abnormalities, modeling the spatial dependence between neighboring SNPs. Here, we extend previous approaches that utilize HMM frameworks for inference in high throughput SNP arrays by integrating copy number, genotype calls, and the corresponding measures of uncertainty when available. Using simulated and experimental data, we demonstrate how confidence scores control smoothing in a probabilistic framework. % The goal of this vignette % is to provide a simple interface for fitting HMMs and plotting % functions to help visualize the predicted states alongside the % experimental data. \end{abstract} \section{Overview} This vignette describes how to fit a hidden Markov model to locus-level estimates of genotype or copy number. This vignette does not describe how to preprocess, genotype, or estimate copy number. We assume that locus-level estimates of genotype and/or copy number have been obtained by software such as the \R{} package \Rpackage{crlmm} (preferred). Several HMM implementations are now available for the joint analysis of copy number and genotype, including QuantiSNP \citep{Colella2007} and PennCNV \citep{Wang2007a}. \paragraph{Citing this software.} %\bibitem{Scharpf2008} Robert~B Scharpf, Giovanni Parmigiani, Jonathan Pevsner, and Ingo Ruczinski. \newblock Hidden {M}arkov models for the assessment of chromosomal alterations using high-throughput {SNP} arrays. \newblock {\em Annals of Applied Statistics}, 2(2):687--713, 2008. \section{Data organization} \label{sec:data} \subsection{Basic data elements} \label{data:matrices} The basic elements required for fitting the vanilla hidden Markov model in this vignette are \begin{itemize} \item a matrix of copy number estimates \item a matrix of genotype calls (represented as integers: 1=AA, 2=AB, 3=BB) \item a mapping of platform identifiers to physical position and chromosome \end{itemize} When available, uncertainty estimates of copy number and genotypes (CRLMM uncertainties) should also be represented as matrices. As several platforms now have polymorphic and nonpolymorphic probes, the dimensions of the genotype and copy number matrices do not necessarily coincide. However, for this release of \Rpackage{VanillaICE} we do impose this requirement. The genotypes for the nonpolymorphic probes can be represented as NAs. The assay data provided with this package contains the basic elements for the HMM from an older Affymetrix platform (100k): <>= library(VanillaICE) data(locusLevelData) copynumber <- locusLevelData[["copynumber"]]/100 genotypes <- locusLevelData[["genotypes"]] @ \noindent The copy number estimates provided in the example data were saved as [copy number estimate]*100 and stored as an integer -- we divided the estimates by 100 above to revert to the original scale. Genotypes should be represented as an integer: 1= AA, 2 = AB, 3=BB. NA or the integer 4 can be used to represent missing values. Confidence scores for the assay data are included in the \Robject{locusLevelData}, but for now we ignore these. Meta-data on the locus-level estimates are also required. In particular, we require a mapping of the platform identifiers to the chromosome and physical position. The preferred representation for the chromosome is an integer. Currently, the only supported chromosomes are 1-22, X, and Y. <>= chromosome <- locusLevelData[["annotation"]][, "chromosome"] position <- locusLevelData[["annotation"]][, "position"] names(position) <- names(chromosome) <- rownames(locusLevelData[["annotation"]]) @ Other chromosomes could potentially be supported, but the user will be expected to obtain centromere start and stop sites for alternative chromosomes. For details on how chromosome annotation should be organized, see <>= require(SNPchip) data(chromosomeAnnotation, package="SNPchip") @ <>= all(c(identical(names(position), rownames(copynumber)), identical(names(position), rownames(genotypes)), identical(colnames(genotypes), colnames(copynumber)))) @ The assay data elements should be ordered by chromosome and physical position: <>= ordering <- order(chromosome, position) chromosome <- chromosome[ordering] position <- position[ordering] genotypes <- genotypes[ordering, , drop=FALSE] copynumber <- copynumber[ordering, , drop=FALSE] @ <>= ##For testing that the arm variable forces the HMM to run independently on each chromosomal arm data(locusLevelData, package="VanillaICE") cnConf <- cnConfidence(chromosome1) callsConf <- callsConfidence(chromosome1) callsConf <- matrix(as.integer(-1000*log(1-callsConf)), nrow(callsConf), ncol(callsConf)) callsConf <- rbind(callsConf, callsConf[1:100, , drop=FALSE]) dimnames(callsConf) <- dimnames(copynumber) locusLevelData[["crlmmConfidence"]] <- callsConf annotation <- locusLevelData[["annotation"]] copynumber <- locusLevelData[["copynumber"]]/100 genotypes <- locusLevelData[["genotypes"]] i <- order(annotation[, "chromosome"], annotation[, "position"]) genotypes <- genotypes[i, , drop=FALSE] copynumber <- copynumber[i, , drop=FALSE] annotation <- annotation[i, , drop=FALSE] copynumber <- copynumber[1:9165, , drop=FALSE] genotypes <- genotypes[1:9165, , drop=FALSE] annotation <- annotation[1:9165,, drop=FALSE] copynumber <- rbind(copynumber, copynumber) genotypes <- rbind(genotypes, genotypes) annotation2 <- annotation annotation2[, 1] <- 2 rownames(annotation2) <- paste("SNP_A-", 1:nrow(annotation2), sep="") annotation <- rbind(annotation, annotation2) rownames(genotypes) <- rownames(copynumber) <- rownames(annotation) @ \subsection{Using \Rpackage{Biobase}-derived classes} \label{data:classes} A principal advantage of using classes instead of matrices is that the assay data elements are bound to the meta-data on the samples and loci. In this section, we show how the above assay data and meta-data can be encapsulated in a single object. The classes required for this operation are defined in the \Rpackage{oligoClasses} package available at BioConductor. These class definitions are all extensions of \Rpackage{eSet}. The methods for manipulating \Robject{eSets} in the \R{} package \Rpackage{Biobase} are now available to us. We will instantiate these objects in two-steps: (i) instantiate an object that contains the locus-level meta-data and (ii) instantiate an object that contains the assay-data and the meta-data. For (i), the \Robject{AnnotatedDataFrame} class defined in the \Rpackage{Biobase} package is used as a container for the meta-data on the locus-level summaries: <>= require(oligoClasses) locusAnnotation <- data.frame(list(chromosome=chromosome, position=position), row.names=names(chromosome)) featuredata <- new("AnnotatedDataFrame", data=locusAnnotation, varMetadata=data.frame(labelDescription=colnames(locusAnnotation))) stopifnot(all(c("chromosome", "position") %in% varLabels(featuredata))) @ \noindent The labels 'chromosome' and 'position' are controlled vocabularies for the locus-level metadata that are later required when fitting the HMM. For (ii), three options are possible and depend on the assay data available: \begin{enumerate} \item {\color{blue}{SnpCallSet}} (genotypes-only) <>= new("SnpCallSet", calls=genotypes, phenoData=annotatedDataFrameFrom(genotypes, byrow=FALSE), featureData=featuredata) @ \item {\color{blue}{SnpCopyNumberSet}} (copy number-only) <>= new("SnpCopyNumberSet", copyNumber=copynumber, phenoData=annotatedDataFrameFrom(copynumber, byrow=FALSE), featureData=featuredata) @ \item {\color{blue}{oligoSnpSet}} (genotypes and copy number) <>= locusset <- new("oligoSnpSet", copyNumber=copynumber, calls=genotypes, phenoData=annotatedDataFrameFrom(copynumber, byrow=FALSE), featureData=featuredata) @ \end{enumerate} Successfully instantiating an object of the class ensures that the assay data elements and the meta-data are consistent, greatly simplifying operations such as subsetting and reordering. For instance, the assay data and meta-data can be ordered by chromosome and physical position with one line of \R{} code: <>= locusset <- locusset[order(chromosome(locusset), position(locusset)), ] @ \section{Fitting the Vanilla HMM} \label{hmm:vanilla} We use the Vanilla HMM to smooth the locus-level summaries as a function of physical position when confidence estimates of the assay data are unavailable. In Section \ref{fit:classes}, \Rpackage{Biobase}-derived class representations from the previous section and pre-defined wrappers for fitting a basic HMM are implemented. In Section \ref{sec:matrices}, we reproduce the results from Section \ref{fit:classes} but in a step-by-step fashion using only the matrices and vectors that were created in Section \ref{data:matrices} (no classes). The step-by-step code in Section \ref{data:matrices} is a useful guide for those wishing to alter the number of states in the HMM or develop other adaptations. \subsection{Using \Rpackage{Biobase}-derived classes} \label{fit:classes} The objective of developing classes is to facilitate the process of fitting an HMM and to more effectively keep the assay- and meta-data bound in one object. The following code uses classes to arrive at the same HMM predictions as above. The starting point is the \R{} object \Robject{locusset} created as indicated in Section \ref{data:classes}. This object may be any one of the following classes: \Robject{SnpCallSet}, \Robject{SnpCopyNumberSet}, \Robject{oligoSnpSet}. Note that we could repeat the steps of parameterizing the HMM with transition- and emission-probabilities as described above, using the accessor methods \Robject{chromosome}, \Robject{position}, \Robject{copyNumber}, and \Robject{calls} to access the meta- or assay-data. We provide a convenience wrapper for fitting the Vanilla HMM, \Robject{vanilla}, that computes the transition probabilities and emission probabilities. \emph{For \Robject{oligoSnpSet} objects, the hidden states are assumed to be hemizygous deletion, normal, ROH, and amplification (in this order). } <>= ##featuredata <- new("AnnotatedDataFrame", data=locusAnnotation, varMetadata=data.frame(labelDescription=colnames(locusAnnotation))) ##locusset <- new("oligoSnpSet", ## copyNumber=copynumber, ## calls=genotypes, ## phenoData=annotatedDataFrameFrom(copynumber, byrow=FALSE), ## featureData=featuredata) (brks <- hmm(object=locusset, states=c("hemDel", "normal", "ROH", "amplification"), mu=c(1, 2, 2, 3), probs=c(0.999, 0.7, 0.999, 0.7), takeLog=TRUE)) @ \noindent To obtain the matrix of predicted states for each locus, set \Robject{returnSegments} to \Robject{FALSE}: <>= fit2 <- hmm(object=locusset, states=c("hemDel", "normal", "ROH", "amplification"), mu=c(1, 2, 2, 3), probs=c(0.999, 0.7, 0.9999, 0.7), takeLog=TRUE, returnSegments=FALSE) @ An example of how one may visualize the segmentation: <>= show(plot(locusset[chromosome(locusset) == 1, ])) fit2[fit2 == 3] <- 2 ##So LOH line will not plot at 3 fit2[fit2 == 4] <- 3 lines(position(locusset)[chromosome(locusset) == 1], fit2[chromosome(locusset) == 1, 1]) rohRegion <- as.integer(brks[brks$state == "ROH", ][c("start", "end")]) abline(v=rohRegion, col="royalblue", lty=2) legend("topright", lty=c(1, 2), col=c("black", "royalblue"), legend=c("CN state", "ROH boundary"), bty="n") @ \subsubsection{Extracting intermediate objects} To extract intermediate objects that were created by the \Robject{hmm} wrapper, one may pass an environment to store the intermediate objects that are created prior to fitting the HMM. <>= env <- new.env() fit2 <- hmm(object=locusset, states=c("hemDel", "normal", "ROH", "amplification"), mu=c(1, 2, 2, 3), probs=c(0.999, 0.7, 0.9999, 0.7), takeLog=TRUE, returnSegments=FALSE, envir=env) ls(env) @ \subsection{Step by step} \label{sec:matrices} We now go through the previous example step-by-step without using Biobase-derived classes or wrappers. While we reproduce the results in the previous section, one may modify any of these step (for instance, to expand the of hidden states). The basic elements of the HMM are \begin{itemize} \item[1] hidden states \item[2] initial state probabilities \item[3] transition probabilities \item[4] emission probabilities \end{itemize} Items (1) and (4) depend critically on the software that was used for obtaining the locus-level estimates and the scientific goals of the analysis. \subsubsection{Hidden states} \label{sec:hiddenStates} In this section we discuss specification of the parameters for the hidden states. We assume that conditional on the underlying state the copy number estimates and genotypes are independent \citep{Scharpf2008}. \paragraph{Copy number.} There are two important considerations when specifying the hidden copy number states. First, the scale of the locus-level estimates are dependent on the preprocessing software. For instance, the BeadStudio software for the Illumina platform provides logR ratios, whereas \Rpackage{crlmm} provides an absolute estimate of copy number. Secondly, the data source is important. Integer copy number states are not appropriate when mixtures of different cell populations are present. Mixtures of cell populations can be diagnosed and the admixture estimated by visual inspection. For instance, a genomic region containing copy number estimates between 1 and 2 on the absolute scale with heterozygous genotype calls interspersed suggest a mixture of cell populations. Therefore, plot your data. <>= chr1 <- annotation[, "chromosome"] == 1 plot(annotation[chr1, "position"], copynumber[chr1, 1], pch=".", cex=2, col="grey60") @ If genotypes are available, color-code by the diallelic genotype call. <>= plot(annotation[chr1, "position"], copynumber[chr1, 1], pch=".", cex=2, col="grey60") het <- genotypes[, 1] == 2 points(annotation[chr1 & !het, "position"], copynumber[chr1 & !het, 1], pch=".", cex=2, col="royalblue") points(annotation[chr1 & het, "position"], copynumber[chr1 & het, 1], pch=".", cex=2, col="red") abline(h=1:3) @ While the \Robject{locusLevelData} is based on HapMap samples, alterations were simulated to have integer copy number states. Therefore, in the code chunk below we specify an integer copy number for the hidden states hemizygous deletion (hemDel: copy number < 1), normal (copy number 2 and typical heterozygosity), region of homozygosity (ROH: copy number 2 and unusually low heterozygosity), and amplification (copy number >= 3). <>= states <- c("hemDel", "normal", "ROH", "amplification") copynumberStates <- c(1, 2, 2, 3) @ \paragraph{Genotypes} In the absence of confidence scores for the genotype calls, the parameters needed for the hidden states for the genotypes is the probability that the true genotype is homozygous conditional on the underlying state. <>= probs <- c(0.999, 0.7, 0.999, 0.7) names(probs) <- states @ \noindent One may expand the number of hidden states to include $>$ 3 copies and homozygous deletions, perform simple gain/loss analyses using copy number only, or assess regions of homozygosity with genotypes only. \subsubsection{Initial state probabilities.} \label{sec:initial} The only requirement for specifying the initial state probabilities is that the probabilities sum to 1. Here, we assume \emph{a priori} that the states are equally likely: <>= initialP <- (rep(1, length(states)))/length(states) @ \subsubsection{Transition probabilities.} \label{sec:transition} Our HMM uses locus-specific transition probabilities that are calculated as a function of the physical distance between loci. Specifically, the probability that the locus at position $t-1$ is \emph{not} informative for the locus at position $t$ is calculated as $1-e^{-d/C}$, where $C$ is a constant specified by the user and $d$ is the physical distance between locus $t$ and locus $t-1$. The default for $C$ is $1 \times 10^8$ and can be specified by the \Robject{TAUP} argument to the function \Robject{transitionProbability} to acheive a desired amount of sensitivity and specificity. Larger values of \Robject{TAUP} decreases the probability of transitioning to other states, and therefore provides a more smooth fit. In particular, \Robject{transitionProbability} (i) transforms the physical distance between adjacent loci to an estimate of the genomic distance and (ii) adds an 'arm' variable to the annotation matrix. <>= tau <- transitionProbability(chromosome=chromosome, position=position, TAUP=1e8) table(tau[, "arm"]) str(tau) @ \noindent The values for \Robject{arm} indicate that the HMM will be fit independently to three separate segments of the supplied data, corresponding to arms 1p, 1q, and 2p. By default this function uses the chromosome annotation provided by the \R{} package \Rpackage{SNPchip} for the centromere coordinates and will work only for chromosomes 1-22, X, and Y. %TODO: Discuss how one may scale the transition probability matrix. % \section{Joint analysis of copy number and genotype} \subsubsection{Emission probabilities} \paragraph{Copy number} Skip to \emph{Genotypes} if copy number estimates are not available. One attractive feature of HMMs for smoothing copy number estimates as a function of the physical position is that uncertainty estimates can be incorporated into the emission probabilities. While uncertainty estimates for copy number and genotype calls can improve both the sensitivity and specificity for detecting chromosomal alterations \citep{Scharpf2008}, these estimates are not always available. However, we can develop ad-hoc estimates of the uncertainty provided a reasonable number of samples are available. To obtain emission probabilities, we provide the matrix of copy number estimates and the location parameter for the hidden states (mu). The object \Robject{mu} should have length equal to the number of hidden states. If the hidden states are locus-specific, one may specify \Robject{mu} as a matrix with rows corresponding to loci and columns corresponding to states. %TODO: test SNP-speicific <>= copynumberStates <- c(1, 2, 2, 3) emission.cn <- copynumberEmission(copynumber=copynumber, states=states, mu=copynumberStates, takeLog=TRUE, verbose=TRUE) ##or copynumberStates <- matrix(c(1, 2, 2, 3), nrow(copynumber), length(states), byrow=TRUE) emission.cn2 <- copynumberEmission(copynumber=copynumber, states=states, mu=copynumberStates, takeLog=TRUE) stopifnot(identical(emission.cn, emission.cn2)) dim(emission.cn) @ The third dimension of the array returned by \Robject{copynumberEmission} corresponds to the number of hidden states. The copy number and location parameter (mu) must be supplied on the same scale. For instance, in the above code chuck the copy number estimates and mu are on the scale of copy number. These estimates/parameters are both log-transformed within the body of the function by setting the argument \Robject{takeLog} to \Robject{TRUE}. Again, the emission probabilities are calculated under the assumption that the estimates, or a suitable transformation, are approximately Normal. Hence, in the simulated data we check that the middle 90\% is approximately Gaussian. <>= par(pty="s", las=1) qqnorm(log2(copynumber), cex=0.6) qqline(log2(copynumber)) abline(v=c(-1.645, 1.645)) @ When true uncertainty estimates for the copy number are not available, copy number outliers are more likely to result in extreme emission probabilities than can influence the HMM segmentation. There are several approaches to reducing the influence of outliers on the HMM segmentation: (i) improved uncertainty estimates (preferred), (ii) increase \Robject{TAUP} for the transition probabilities, (iii) threshold the emission probabilities, and (iv) threshold extreme values for the copy number estimates. For example, <>= emission.cn[emission.cn[, , 2] < -10, , 2] <- -10 @ \Robject{copynumberEmission} estimates the scale parameter for the Normal distribution from the supplied data using the median absolute deviation (MAD). However, different standard deviations can be supplied by the user with the argument \Robject{sds}. The supplied \Robject{sds} must be of the same dimension as the copy number matrix. The following discussion elaborates briefly on the default procedure used to estimate the standard deviations. In the example dataset, we have only one sample and no estimates of the copy number uncertainty. Therfore, we obtain a robust estimate of the copy number standard deviation across SNPs and use this as a rough estimate of the uncertainty. As the log-transformed copy number estimates are more nearly Gaussian, we calculate a robust estimate of the standard deviation using the median absolute deviation (MAD) on the log scale: <>= cn.sds <- VanillaICE:::robustSds(copynumber, takeLog=TRUE) dim(cn.sds) @ When multiple samples are available (e.g., 10 or more), SNP-specific estimates of the copy number uncertainty can be obtained by scaling an estimate of the variability across samples by a sample-specific estimate of noise. In the following code chunk, we simulate a matrix of copy number for 200 samples and then compute a robust SNP-specific estimate of the standard deviation. <>= CT <- matrix(sample(copynumber, 100*200, replace=TRUE), 100, 200) sds <- VanillaICE:::robustSds(CT, takeLog=TRUE) @ The \Robject{robustSds} function returns a SNP-specific matrix of standard deviations provided that the copy number matrix has at least 3 samples. The {\it preferred} approach when the sample size is small (say, less than 10), is to develop SNP-specific estimates of the variance on a larger reference set, such as HapMap, using the same software, and then scale these estimates by a measure of the sample variance. \paragraph{Genotypes.} Proceed to \emph{Fitting the hidden Markov model} if genotypes estimates are not available. If genotypes were estimated using the \Rpackage{CRLMM} or \Rpackage{oligo} and confidence estimates are available, see Section \ref{sec:ice}. As copy number estimates are typically very noisy, genotype calls can potentially provide increased power to identify small regions of hemizygous deletions. In addition, the genotypes can be used to identify unusually long regions of homozygosity (ROH). In our experience, sequences of 70 - 100 homozygous genotypes are not uncommon and likely represent normal regions of the genome with fairly uniform haplotype structure. Emission probabilities for the Vanilla HMM are estimated from a Bernoulli with state-specific probabilities of a homozygous genotype call as specified in Section \ref{sec:hiddenStates}. <>= emission.gt <- genotypeEmission(genotypes=genotypes, states=states, probHomCall=probs) @ If any of the genotype calls are missing and missingness is not independent of the underlying hidden state, one may specify the probability of a missing genotype calls for each hidden state (\texttt{probMissing}). By default, the HMM will assume that missing genotype calls are independent of the underlying hidden state. In particular, this assumption may not be reasonable for homozygous deletions. Conditional on the hidden state, we assume that the copy number and genotype are independent. Therefore, the emission probabilities for an HMM that models the copy number and genotypes jointly are computed by adding the emission probabilities (log-scale) for copy number and genotype: <>= emission <- emission.gt + emission.cn @ If assay data for only genotypes or only copy number is available, the emission probability is simply the genotype or copy number emission probabilities, respectfully. \subsubsection{Viterbi algorithm.} The sequence of states that maximizes the likelihood is obtained by the Viterbi algorithm. Note that the argument \Robject{arm} to this function is a factor indicating the chromosomal arms -- the Viterbi algorithm is computed for independently for each chromosomal arm. <>= fit <- viterbi(initialStateProbs=log(initialP), emission=emission, tau=tau[, "transitionPr"], arm=tau[, "arm"]) table(fit) breaks(x=fit, states=states, position=tau[, "position"], chromosome=tau[, "chromosome"], sampleNames=colnames(copynumber)) @ %\paragraph{Updating elements of the HMM.} % %To update the elements used, for instance, a different. For instance, % %<>= %##source("~/madman/Rpacks/VanillaICE/R/functions.R") %e1 <- new.env() %hmm(object=locusset, % states=c("hemDel", "normal", "ROH", "amp"), % mu=c(1, 2, 2, 3), % probs=c(0.9999, 0.99, 0.9999, 0.99), % takeLog=TRUE, envir=e1) %ls(e1) %e1[["TAUP"]] <- 1e12 %##untrace(update, signature="environment", where=getNamespace("VanillaICE")) %update(object=e1) %@ \section{Integretating Confidence Estimates (ICE): Extending \Rpackage{CRLMM}} \label{sec:ice} In this section, we illustrate how one may fit an HMM that incorporates confidence esimates of the SNP-level summaries for the genotype calls. To reduce the amount of repeated code, we will use the \Rpackage{Biobase}-derived classes to illustrate this approach. We will instantiate an object of class \Robject{oligoSnpSet} as described in Section \ref{data:classes}, however we will also store the CRLMM confidence scores (represented as an integer). <>= ##load("../../../tmp/VanillaICE/data/chromosome1.RData") ##cnConf <- cnConfidence(chromosome1) copynumber <- locusLevelData[["copynumber"]]/100 crlmmConfidence <- locusLevelData[["crlmmConfidence"]] genotypes <- locusLevelData[["genotypes"]] fd <- locusLevelData[["annotation"]] featuredata <- new("AnnotatedDataFrame", data=data.frame(fd), varMetadata=data.frame(labelDescription=colnames(fd))) (locusset2 <- new("oligoSnpSet", copyNumber=copynumber, calls=genotypes, callsConfidence=crlmmConfidence, phenoData=annotatedDataFrameFrom(copynumber, byrow=FALSE), featureData=featuredata, annotation="genomewidesnp6")) locusset2 <- locusset2[order(chromosome(locusset2), position(locusset2)), ] @ The \Robject{annotation} slot must be specified so that the appropriate reference distribution will be loaded. Supported Affymetrix platforms include genomewidesnp6, pd.mapping250k.nsp, and pd.mapping250k.sty. When fitting the ICE HMM using the \Robject{hmm} wrapper, we again assume that the hidden states are hemizygous deletion, normal, copy-neutral region of homozygosity, and amplification (in this order). Fitting the ICE HMM using other states is possible, and one could follow the step-by-step approach outlined in Section \ref{sec:matrices}. <>= fit3 <- hmm(object=locusset2, states=c("hemDel", "normal", "ROH", "amplification"), mu=c(1, 2, 2, 3), takeLog=TRUE, returnSegments=FALSE, ice=TRUE) brks <- hmm(object=locusset2, states=c("hemDel", "normal", "ROH", "amplification"), probs=c(0.05, 0.99, 0.7, 0.999), mu=c(1, 2, 2, 3), takeLog=TRUE, ice=TRUE) @ \noindent Here we plot one of the regions that has different breakpoints in the ICE and Vanilla HMMs. <>= i <- which(position(locusset2) >= 45*1e6 & position(locusset2) <= 55*1e6 & chromosome(locusset2) == 1) gp <- plot(locusset2[i, ]) gp$pch <- 21; gp$col <- "black"; gp$cex <- 0.9; gp$ylim <- c(0.9, 6) gp$add.cytoband <- FALSE show(gp) fit3[fit3 == 3] <- 2 ##So LOH line will not plot at 3 fit3[fit3 == 4] <- 3 lines(position(locusset2)[i], fit3[i, 1]) rohRegion <- as.numeric(as.matrix(brks[brks$state == "ROH", ][c("start", "end")])) abline(v=rohRegion, col="royalblue", lty=2) legend("topright", lty=c(1, 2), col=c("black", "royalblue"), legend=c("CN state", "ROH boundary"), bty="o", bg="white") @ %TODO: Vignette for generating chromosome 21 data. %TODO: Reference distribution for pd.mapping50k.nsp %TODO: Reference distribution for pd.mapping50k.sty %TODO: merging pd.mapping50k.sty and pd.mapping50k.sty %\section{Affymetrix 6.0 example} % %Placeholder for an example using Affy 6.0 data (INCOMPLETE). % %<>= %require(SNPchip) %data(crlmmResults, package="VanillaICE") %crlmmResults[["crlmmVersion"]] %@ % %<>= %genotypes <- crlmmResults[["genotypes"]] %confidence <- crlmmResults[["confidence"]] %logsds <- crlmmResults[["logsds"]]/100 %copynumber <- crlmmResults[["copynumber"]]/100 %position <- crlmmResults[["position"]] %dns <- list(crlmmResults[["nms"]], crlmmResults[["sns"]]) %copynumber[copynumber < 0.05] <- 0.05 %copynumber[copynumber > 6] <- 6 %copynumber[is.na(copynumber)] <- 2 %dimnames(confidence) <- dimnames(logsds) <- dimnames(copynumber) <- dimnames(genotypes) <- dimnames(confidence) <- dns %@ % %<>= %fD <- new("AnnotatedDataFrame", % data=data.frame(cbind(chromosome=21, position=position), row.names=rownames(copynumber)), % varMetadata=data.frame(labelDescription=c("chromosome", "position"))) %snpset <- new("oligoSnpSet", % copyNumber=log2(copynumber), % cnConfidence=1/logsds, % calls=genotypes, % callsConfidence=confidence, featureData=fD, % phenoData=annotatedDataFrameFrom(copynumber, byrow=FALSE), % annotation="genomewidesnp6") %snpset <- snpset[order(chromosome(snpset), position(snpset)), ] %fit4 <- hmm(object=snpset, % states=c("hemDel", "normal", "ROH", "amplification"), % probs=c(0.01, 0.99, 0.9, 0.999), % mu=c(0.05, 1, 1, 1.58), % takeLog=FALSE, % returnSegments=FALSE, ice=TRUE) %env <- new.env() %brks <- hmm(object=snpset, % probs=c(0.01, 0.99, 0.9, 0.999), % states=c("hemDel", "normal", "ROH", "amplification"), % mu=c(0.05, 1, 1, 1.58), % takeLog=FALSE, % ice=TRUE, % envir=env) %@ %<<>>= %par(las=1) %CT <- 2^copyNumber(snpset) %HET <- calls(snpset)[, 1] == 2 %plot(position(snpset), CT[, 1], log="y", pch=".", type="n", ylab="copy number", xaxt="n", ylim=c(0.5, 5)) %points(position(snpset)[HET], CT[HET, 1], col="royalblue", pch=".") %points(position(snpset)[!HET], CT[!HET, 1], col="red", pch=".") %fit4[fit4 == 3] <- 2 ##So LOH line will not plot at 3 %fit4[fit4 == 4] <- 3 %lines(position(snpset), fit4[, 1]) %rohRegion <- as.integer(as.matrix(brks[brks$state == "ROH", c("start", "end")])) %abline(v=rohRegion, col="royalblue", lty=2) %legend("topright", lty=c(1, 2), col=c("black", "royalblue"), legend=c("CN state", "ROH boundary"), bty="n") %@ \section{Appendix} \subsection{Confidence scores for genotype calls.} We suggest using the \Rfunction{CRLMM} algorithm \cite{Carvalho2007} for genotype calls. \Rfunction{CRLMM} (in the R package \Rpackage{oligo}) provides confidence scores ($\pgte$) of the genotype estimates ($\gte$). From 269 HapMap samples assayed on the Affymetrix 50k Xba and Hind chips, we have a gold standard of the true genotype defined by the consensus of the HapMap centers. We use kernal-based density estimates to obtain {\scriptsize \begin{eqnarray} \label{ingo2} f\left\{\ \pgtehom\ |\ \gtehom,\thom\ \right\},\ f\left\{\ \pgtehom\ |\ \gtehom,\thet\ \right\},\ f\left\{\ \pgtehet\ |\ \gtehet,\thom\ \right\},\ \mbox{~ and~} f\left\{\ \pgtehet\ |\ \gtehet,\thet\ \right\} \end{eqnarray} } \noindent separately for the Xba and Hind 50k chips. The first term in (\ref{ingo2}), for example, denotes the density of the scores when the genotype is correctly called homozygous ($\gtehom$) and the true genotype is homozygous ($\thom$). See \cite{Scharpf2008} for a more complete description of the methods. %\paragraph{Confidence scores for copy number estimates} % %To illustrate how standard errors of the copy number estimate could be %integrated in the HMM, the R object \Robject{chromosome1} contains %standard errors simulated from a shifted Gamma: $\mbox{Gamma}(1, 2) + %0.3$, where 1 is the shape parameter and 2 is the rate parameter. To %ascertain the effect of qualitatively high confidence scores on the %ICE HMM, we scaled a robust estimate of the copy number standard %deviation by $\frac{1}{2}$. Similarly, to simulate less precise $\cne$ %we scaled $\epsilon$ by 2. For more detailed information about how %the data in the \Robject{chromosome1} was generated, see the %documentation for this object in the R package \ice. <>= gp <- plot(snpset[chromosome(snpset) == 1, ]) lines(position(snpset)[chromosome(snpset)==1], viterbiResults[chromosome(snpset) == 1, ]) ##show copy-neutral LOH by vertical lines gp$abline.v <- TRUE ##plots vertical dashed lines allParameters <- unlist(snpPar(gp)) gp$col.predict[3] <- "white" gp$hmm.ycoords <- c(0.7,0.9) show(gp) @ \subsection{More examples} % \subsubsection{Copy number only: assessing gain/loss} The method \Robject{hmm} has a different set of underlying hidden states depending on whether copy number estimates, genotype calls, or both are available. When only copy number estimates are available, the hidden states (for autosomes) are hemizygous or homozygous deletion (one or fewer copies), normal (two copies), and amplification (three or more copies). The corresponding Biobase-derived class is {SnpCopyNumberSet}. \subsubsection{Genotypes only: assessing ROH} When only genotype calls are available, the hidden states are loss and retention (ret) of heterozygosity. We define \textit{loss} to be a sequence of homozygous SNPs longer than what we would expect to observe by chance. Note that many long stretches of homozygosity may occur as a result of a population sharing a common underlying haplotype structure; loss predictions from an HMM fit to an indvidual do not necessarily reflect the 'loss' of an allele in that individual. The corresponding Biobase-derived class is {SnpCallSet}. \subsection{Expanding the number of hidden states} Section is incomplete. For homozygous deletion, we specify a small value (less than 1) but greater than zero to account for background fluorescence. <>= states <- c("homozygousDeletion", "hemizygousDeletion", "normal", "LOH", "3copyAmp", "4copyAmp") mu <- c(0.05, 1, 2, 2, 3, 4) probs <- c(0.99, 0.999, 0.99, 0.999, 0.99, 0.99) probMissing <- c(0.95, rep(0.5, 5)) @ \subsection{Classes for segmented data} TODO \subsection{Plotting methods for segmented data classes} TODO \section{Session Information} The version number of R and packages loaded for generating the vignette were: <>= toLatex(sessionInfo()) @ \bibliography{ice}{} \bibliographystyle{plain} \end{document}