% \VignetteIndexEntry{TitanCNA} % \VignetteDepends{TitanCNA, foreach} % \VignetteKeywords{Subclonal copy number CNA LOH clonal genome sequencing tumours cancer} % \VignettePackage{HMMcopy} \documentclass[letterpaper]{article} \usepackage{natbib} \usepackage{fullpage} \usepackage{amsmath, amsthm, latexsym} \usepackage{url} \usepackage[utf8]{inputenc} \title{TITAN: Subclonal copy number and LOH prediction from whole genome sequencing of tumours} \author{Gavin Ha} \date{\today} \begin{document} \setkeys{Gin}{width=\textwidth} \maketitle \tableofcontents \section{Introduction} TITAN is a probabilistic framework for predicting regions of copy number alterations (LOH) and loss of heterozygosity (LOH) events in tumour whole genome sequencing (WGS) data. The model simultaneously estimates the cellular prevalence, proportion of tumour sample containing the event, and clonal cluster memberships. The statistical framework was designed based on the principles that the observed sequencing signal is an aggregated measure of multiple heterogeneous cellular populations including normal and tumour subpopulations and sets of genetic aberrations are observed at similar cellular prevalence if these events co-occurred in the same clone due to punctuated clonal expansions \cite{Greaves2012}. The TITAN software is part of an accompanying manuscript where mathematical details are presented \cite{Ha2013}. The observed measurements are modeled as generated from a composite of three cell populations \cite{Yau2010}, consisting normal cells at global proportion $n$; tumour cells with normal genotype at proportion $(1-n)*s_{z}$; and tumour cells with aberrant genotype at proportion $(1-n)*(1-s_{z})$. $s_{z}$ is the proportion of tumour cells that is diploid heterozygous (and therefore normal) at the locus. Thus, $(1-s_{z})$ is the proportion of the tumour population containing the event, or we define as the cellular prevalence. We assume multiple somatic events share similar cellular prevalence and thus can be assigned to one of a finite number of clonal clusters, $z\in Z$. The input data to TITAN are tumour read depth and allele counts. The read depth is corrected for GC content and mappability biases using the Bioconductor R package HMMcopy. Then, the log ratio between tumour and normal corrected depths is computed. The log ratio is model using a mixture of Gaussians were the mean parameter is a funtion of the cellular prevalence, normal proportion, and tumour ploidy; the variance is also unknown and estimated. The allele counts are transformed to allelic ratios, which is defined as the reference read count divided by depth or proportion of reference reads at a locus. Allelic ratios are modeled using a binomial distribution \cite{Ha2012} with the success parameter being a function of cellular prevalence and normal contamination. TITAN is implemented as a two-factor hidden Markov model (HMM) where the hidden genotypes and the hidden clonal cluster memberships are the two Markov chains. The state space is expanded as the joint genotype and clonal cluster states. The output of TITAN is a list of CNA and LOH events, with accompanying parameter estimates of normal proportion, the cellular prevalence and clonal population cluster membership for each event. TITAN estimates the cellular prevalence (proportion of tumour cells with the tumour genotype) based on a fixed number of clonal clusters. Users are required to run TITAN on a tumour sample for a range of clonal clusters in parallel. We advise users to use a range of 1 to 5. When results for the 5 jobs are generated, one for each fixed number of clusters, each parameter output file will contain an S\_Dbw Validity Index score. The run for a tumour sample that has the lowest S\_Dbw score is the optimal result and the corresponding number of clonal clusters is the optimal one. This vignette will detail an example run of TITAN for only 2 clonal clusters of chromosome 2 of a triple negative breast cancer sample \cite{Ha2012,Ha2013}. Once again, for real data, users should repeat the \textit{TITAN} for additional clonal clusters and then select the optimal run using the minimum S\_Dbw Validity index. \section{Analysis Workflow Overview} \begin{enumerate} \item The data points of interest are the germline heterozygous SNP loci identified in the matched normal WGS sample. This is a pre-processing step that is completed by the user, and is outside of the scope of this R package. \item At these loci, the read depth, reference read count and non-reference read count are extracted from the tumour WGS sample. This step is completed outside of the R package \item The read depth are normalized for GC content and mappability biases using the \textit{HMMcopy} R package. \textit{TITAN} uses a wrapper for the function in \textit{HMMcopy} to accomplish this. \item \textit{TITAN} requires as input, the read counts and normalized read depth to estimate model parameters and infer subclonal CNA/LOH. \item For real data, repeat \textit{TITAN} analysis for a range of clonal clusters. \end{enumerate} \section{Model training and parameter estimation} TITAN uses the Expectiation-Maximization algorithm to train the model and estimate parameters for cellular prevalence, normal proportion and tumour ploidy. In the E-step, the forwards-backwards algorithm is employed, making a call to a C implementation for faster computation. In the M-step, parameters are estimated using maximum a posteriori (MAP). The various parameters TITAN uses the Viterbi algorithm to find the optimal state sequence path of genotype and clonal cluster membership for each data point. \section{Extracting tumour and normal read depth} TITAN requires the read depth from the tumour and normal WGS samples. This is done using a tool, outside of R, distributed as part of the \textit{HMMcopy Suite} available at \url{http://compbio.bccrc.ca/software/hmmcopy/}. In short, the suite has tools to: \begin{itemize} \item Obtain high resolution bin counts for large ($\approx$ 250GB) BAM files within a few hours. \item Obtain GC content for bins from standard FASTA files within minutes for a human genome. \item Obtain average mappability for bins from BigWig within minutes, or FASTA files within a day for a human genome. \end{itemize} For TITAN, the user will need to generate the read depth files for the tumour and normal samples in the form of WIG files. Also required, and is provided in the \textit{HMMcopy suite} for the GRCh37-lite reference genome, are the GC content and mappability scores WIG files. Please refer to the instructions on the \textit{HMMcopy suite} website for preparing these files. \section{TITAN analysis} Users should run TITAN once for each setting of the number of clonal clusters, ranging from 1 to 5. Although higher number of clonal clusters can be used, in practice, the sequencing coverage of 30-50X, analyzing more than 5 clonal clusters may not produce accurate results. For each run of TITAN on real data with approximately 2 million loci across all chromosomes and for 5 clonal clusters, the maximum memory requirement is approximately 24Gb. \subsection{Loading the data} Load the provided text file for the allele counts for chromosome 2 of a breast cancer sample \cite{Ha2012}. The format of the file must be 6 columns: chromosome, position, reference base, reference read counts, non-reference base, non-reference read counts. A list object with 6 components of equal size/lengths is returned. <<>>= library(TitanCNA) infile <- system.file("extdata", "test_alleleCounts_chr2.txt", package = "TitanCNA") data <- loadAlleleCounts(infile) names(data) @ \subsection{Loading the initial parameters} Load the default parameters using a specified maximum copy number \textit{TITAN} will consider. Here, 5 is used but up to 8 copies can be specified, however, running time and memory performance hits will be incurred. Here, you will also specify the number of clonal clusters to use for the TITAN run. This is where the number of clusters will be specified when running for 1,2,3,4, or 5 clonal cluster for real data. In general, the default parameters will work well with real, full datasets of whole genome sequencing of tumour samples. Including the loaded \verb|data| from the previous step will help inform the baseline allelic ratio parameters; this is recommended when using \verb|symmetric| genotypes. <<>>= numClusters <- 2 params <- loadDefaultParameters(copyNumber = 5, numberClonalClusters = numClusters, symmetric = TRUE, data = data) params @ \verb|params| is a list object containing 4 sets of parameters, each as a component: \begin{itemize} \item genotypeParams: Parameters for copy number and allelic ratios genotype states \item normalParams: Parameters for normal contamination \item ploidyParams: Parameters for average tumour ploidy \item sParams: Parameters for modeling subclonality: clonal clusters and cellular prevalence \end{itemize} Analysis of copy number by \textit{TITAN} is subject to tumour ploidy. TITAN attempts to estimate this value globally and is accurate in most cases. However, the issue is that, from the signals observed in the data, \textit{TITAN} cannot distinguish between diploid (2 global copies) and tetraploidy (4 global copies) because the data can explain both situations. Therefore, in general, if the user is aware that a sample is ployploid, then TITAN should be run with the ploidy initialization of 2 and also for initialization of 4. If the user is unaware of the ploidy status of the sample, then inspection of a TITAN run at ploidy initialization of 2 can help. If this run has many large, prominent regions of inferred homozygous deletions, then it is likely that this sample is triploid or tetraploid. <>= params$ploidyParams$phi_0 <- 2 # for diploid or params$ploidyParams$phi_0 <- 4 # for tetraploid/ployploid @ Because this vignette is an example involving only one chromosome, the fewer loci in the analysis will give different results than for a sample with all chromosomes. For this example, we will modify the Gaussian variance hyperparameter (prior used in modeling the log ratios) to a less influential setting, making the analysis more suitable for one chromosome. Also, because we know the average tumour ploidy of the genome-wide sample is slightly less than 2, we will initialize the \verb|phi_0| to 1.5 so that it does not overfit to this one chromosome. <<>>= K <- length(params$genotypeParams$alphaKHyper) params$genotypeParams$alphaKHyper <- rep(500, K) params$ploidyParams$phi_0 <- 1.5 @ \subsection{Correct GC content and mappability biases} Correct GC content and mappability biases using a wrapper function that calls uses a function similar to the \textit{HMMcopy} package R function, \textit{correctReadcount}. Here, 4 wig files are required: tumour, normal, GC content, and mappability score. The wrapper will apply the correction sub-routines and compute the log ratio as log2(tumour/normal). <<>>= tumWig <- system.file("extdata", "test_tum_chr2.wig", package = "TitanCNA") normWig <- system.file("extdata", "test_norm_chr2.wig", package = "TitanCNA") gc <- system.file("extdata", "gc_chr2.wig", package = "TitanCNA") map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA") cnData <- correctReadDepth(tumWig, normWig, gc, map) head(cnData) @ For real data, users can use the genome-wide GC content and mappability score for 1kb windows. The two files can be found compressed in data/GRCh37-lite.tar.gz of the git repository. These WIG files were generated from the reference GRCh37-lite.fasta; the tumour and normal BAM files used for generating the tumour and normal WIG files in the preprocessing steps MUST also use this same reference. Otherwise, new GC and map wig files will need to be created as per instructions on \url{http://compbio.bccrc.ca/software/hmmcopy/} Then, find the log ratio at each position of interest (germline heterozygous SNPs) given in the object "data". Then transform the log ratios to natural logs instead of log base 2. Remove "logR" and "cnData" objects to save memory. <<>>= logR <- getPositionOverlap(data$chr, data$posn, cnData) data$logR <- log(2^logR) #transform the log ratio to natural logs @ \subsection{Filter the data} Filter the data for low and high depth positions and positions with NA's. For ease of analysis, TITAN converts chromosome X->23 and Y->24. Do not worry, in the final output, "X" and "Y" will be in the output. Optionally, a data.frame containing a list of positions can be passed as an argument, specifying which positions to use. <<>>= data <- filterData(data, c(1:22, "X", "Y"), minDepth = 10, maxDepth = 200, positionList = NULL) @ \subsection{Model training and parameter estimation} Parameter estimation in \textit{TITAN} uses the Expectation Maximization algorithm (EM) and forwards-backwards algorithm. This step generally requires parallelization to increase time performance. The \textit{TITAN} package uses the \textit{foreach} package which conveniently enables the user to use libraries, such as \textit{doMC} and \textit{doMPI}, to parallelize the training by chromosome. It is up to the user to decide which is the best library to use depending on the current system. For example, if one wishes to use multiple cores via forking on a single machine, <>= library(doMC) registerDoMC(cores = 10) #use 10 cores on a single machine @ Here is a little more detail regarding some important arguments: \begin{itemize} \item \verb|maxiter|: maximum number of EM iterations allowed. In practice, users should set it at 20 since it does not exceed 20. For this vignette to finish in a shorter time, we use 3. \item \verb|maxiterUpdate|: maximum number of coordinate descent iterations during the M-step when parameters are estimated. In practice, 1500 iterations should be used. \item \verb|txnExpLen|: influences prior probability of genotype transitions in the HMM. The higher, the lower tendency to change state. \item \verb|txnZstrength|: influences prior probability of clonal cluster transitions in the HMM. Larger values means lower tendency to change clonal cluster state. \item \verb|useOutlierState|: logical indicating whether an additional outlier state should be used. In practice, this usually is not necessary. \item \verb|normalEstimateMethod|: specifies how to handle normal proportion estimation. To estimate, use "map" which is maximum a posteriori. If you wish to not estimate this parameter, then use "fixed". This will default the normal proportion to whatever is specified in \verb|params$normalParams$n_0|. For example, if you know that this sample has absolutely no normal contamination, then use \verb|params$normalParams$n_0 <- 0| and set \verb|normalEstimateMethod="fixed"|. \item \verb|estimateS|: logical indicating whether to account for clonality and estimate subclonal events \item \verb|estimatePloidy|: logical indicating whether to estimate and account for tumour ploidy \end{itemize} <<>>= convergeParams <- runEMclonalCN(data, gParams = params$genotypeParams, nParams = params$normalParams, pParams = params$ploidyParams, sParams = params$cellPrevParams, maxiter = 3, maxiterUpdate = 50, useOutlierState = FALSE, txnExpLen = 1e15, txnZstrength = 5e5, normalEstimateMethod = "map", estimateS = TRUE, estimatePloidy = TRUE) names(convergeParams) @ \verb|convergeParams| is a list object with components containing the converged parameters from the EM training, including posterior marginal responsibilities, log likelihood, and original parameter settings. \begin{itemize} \item \verb|n|: Converged estimate for normal contamination parameter. \verb|numeric array| containing estimates at each EM iteration. \item \verb|s|: Converged estimate(s) for cellular prevalence parameter(s). This value is defined as the proportion of tumour sample that does \emph{not} contain the aberrant genotype. This will contrast what is output in \verb|outputTitanResults|. \verb|numeric array| containing estimates at each EM iteration. If more than one cluster is specified, then \verb|s| is a \verb|numeric matrix|. \item \verb|var|: Converged estimates for variance parameter of the Gaussian mixtures used to model the log ratio data. \verb|numeric matrix| containing estimates at each EM iteration. \item \verb|phi|: Converged estimate for tumour ploidy parameter. \verb|numeric array| containing estimates at each EM iteration. \item \verb|piG|: Converged estimate for initial genotype state distribution. \verb|numeric matrix| containing estimates at each EM iteration. \item \verb|piZ|: Converged estimate for initial clonal cluster state distribution. \verb|numeric matrix| containing estimates at each EM iteration. \item \verb|muR|: Mean of binomial mixtures computed as a function of \verb|s| and \verb|n|. \verb|numeric matrix| containing estimates at each EM iteration. See References for mathematical details. \item \verb|muC|: Mean of Gaussian mixtures computed as a function of \verb|s|, \verb|n|, and \verb|phi|. \verb|numeric matrix| containing estimates at each EM iteration. See References for mathematical details. \item \verb|loglik|: Posterior Log-likelihood that includes data likelihood and the priors. \verb|numeric array| containing estimates at each EM iteration. \end{itemize} \subsection{Find the genotype and clonal cluster state paths} Viterbi algorithm is used to return the optimal genotype/clonal cluster state path. After running EM, use the converge parameters and the input data to compute the optimal segmentation results. <<>>= optimalPath <- viterbiClonalCN(data, convergeParams) head(optimalPath) @ It is difficult to interpret the output of this function directly. The user should use the function \verb|outputTitanResults| to format the results. \subsection{Format and print results to file} Two files are generated: Position-specific results that includes genotype, clonal cluster, and cellular prevalence estimates. The posterior probabilities at each position can optionally be returned. The results can be output to a file if a file name is specified for the \verb|filename| argument. <<>>= results <- outputTitanResults(data, convergeParams, optimalPath, filename = NULL, posteriorProbs = FALSE, subcloneProfiles = TRUE) head(results) @ The format of this output file and data.frame is the following: \begin{enumerate} \item \verb|Chr| \item \verb|Position| \item \verb|RefCount|: number of reads matching the reference base \item \verb|NRefCount|: number of reads matching the non-reference base \item \verb|Depth|: total read depth at the position \item \verb|AllelicRatio|: RefCount/Depth \item \verb|LogRatio|: log2 ratio between normalized tumour and normal read depths \item \verb|CopyNumber|: predicted TITAN copy number \item \verb|TITANstate|: internal state number used by TITAN; see supplementary table 2 in manuscript \item \verb|TITANcall|: interpretable TITAN state; string \{HOMD,DLOH,HET,NLOH,ALOH,ASCNA,BCNA,UBCNA\}, see supplementary table 2 in manuscript \item \verb|ClonalCluster|: predicted TITAN clonal cluster; lower cluster numbers represent clusters with higher cellular prevalence \item \verb|CellularPrevalence|: proportion of tumour cells containing event; not to be mistaken as proportion of sample (including normal) \item \verb|Subclone1.CopyNumber|: Copy number profile for Subclone 1 \item \verb|Subclone1.TITANcall|: TITAN state for Subclone 1 \item \verb|Subclone1.Prevalence|: Subclonal prevalence for Subclone 1 \item \verb|Subclone2.CopyNumber|: Copy number profile for Subclone 2 \item \verb|Subclone2.TITANcall|: TITAN state for Subclone 2 \item \verb|Subclone2.Prevalence|: Subclonal prevalence for Subclone 2 \end{enumerate} Model parameters and summary values such as the number of clonal clusters and their corresponding cellular prevalence, normal contamination, ploidy, and the S\_Dbw validity index for model selection later. <>= outparam <- paste("test_cluster02_params.txt", sep = "") outputModelParameters(convergeParams, results, outparam, S_Dbw.scale = 1) @ The format of the parameter output file is the following: \begin{enumerate} \item Normal contamination estimate: proportion of normal content in the sample; tumour content is 1 minus this number \item Average tumour ploidy estimate: average number of estimated copies in the genome; 2 represents diploid \item Clonal cluster cellular prevalence: Z denotes the number of clonal clusters; each value (space-delimited) following are the cellular prevalence estimates for each cluster \item Genotype binomial means for clonal cluster Z: set of 21 binomial estimated parameters for each specified cluster \item Genotype Gaussian means for clonal cluster Z: set of 21 Gaussian estimated means for each specified cluster \item Genotype Gaussian variance: set of 21 Gaussian estimated variances; variances are shared for across all clusters \item Number of iterations: number of EM iterations needed for convergence \item Log likelihood: complete data log-likelihood for current cluster run \item S\_Dbw dens.bw: density component of S\_Dbw index \item S\_Dbw scat: scatter component of S\_Dbw index \item S\_Dbw validity index: used for model selection; choose run with optimal number of clusters based on lowest S\_Dbw index \end{enumerate} Users may alter the \verb|S_Dbw.scale| argument to penalize higher number of clonal clusters. <>= outputModelParameters(convergeParams, results, outparam, S_Dbw.scale = 10) @ \subsection{Plotting the results} Plot the results using 3 built in functions. For each chromosome, there is a separate plot function for the log ratios (CNA), allelic ratios (LOH), and cellular prevalence. Each function adds to an existing plot. \subsubsection{Copy number alterations (log ratio)} Generate the figure of copy number alteration results by plotting the log ratios. Optionally, the data can be corrected by using the estimated \verb|ploidy|. Otherwise, use \verb|ploidy=NULL|. <>= ploidy <- tail(convergeParams$phi, 1) ploidy plotCNlogRByChr(results, chr = 2, ploidy = ploidy, ylim = c(-2, 2), cex = 0.25, xlab = "", main = "Chr 2") @ The Y-axis is based on log ratios. Log ratios are computed ratios between normalized tumour and normal read depths. Data points close to 0 represent diploid, above 0 are copy gains, below 0 are deletions. Bright Green - HOMD Green - DLOH Blue - HET, NLOH Dark Red - GAIN Red - ASCNA, UBCNA, BCNA \subsubsection{Loss of heterozygosity (allelic ratio)} The loss of heterozygosity (LOH) results is shown by plotting the allelic ratio. <>= plotAllelicRatio(results, chr = 2, ylim = c(0, 1), cex = 0.25, xlab = "", main = "Chr 2") @ The Y-axis is based on allelic ratios. Allelic ratios are computed as RefCount/Depth. Data points close to 1 represent homozygous reference base, close to 0 represent homozygous non-reference base, and close to 0.5 represent heterozygous. Normal contamination influences the divergence away from 0.5 for LOH events. Grey - HET, BCNA Bright Green - HOMD Green - DLOH, ALOH Blue - NLOH Dark Red - GAIN Red - ASCNA, UBCNA \subsubsection{Cellular prevalence and clonal clusters} One of the key features of \textit{TITAN} is the estimation of cellular prevalence and the inference of clonal cluster membership. The estimated normal proportion is required for this plot. This can be obtained from \verb|convergeParams|. <>= norm <- tail(convergeParams$n, 1) norm # estimated normal contamination 1 - convergeParams$s[, ncol(convergeParams$s)] # estimated cellular prevalence plotClonalFrequency(results, chr = 2, normal = norm, ylim = c(0, 1), cex = 0.25, xlab = "", main = "Chr 2") @ The Y-axis is the cellular prevalence that includes the normal proportion. Therefore, the cellular prevalence here refers to the proportion in the sample (including normal). Lines are drawn for each data point indicating the cellular prevalence. Heterozygous diploid are not shown because it is a normal genotype and is not categorized as being subclonal (this means 100\% of cells are normal). The black horizontal line represents the tumour content labeled as "T". Each horizontal grey line represents the cellular prevalence of the clonal clusters labeled as Z1, Z2, etc. Colours are the sames for allelic ratio plots. \subsubsection{Subclone profiles} New since TitanCNA v1.2.0, users can plot the copy number profiles for the predicted subclones. This function only works for solutions containing 1 or 2 clonal clusters. <>= plotSubcloneProfiles(results, chr = 2, cex = 2, spacing = 6, main = "Chr 2") @ Colours have the same definition as for the allelic ratio plots. \section{Session Information} The version number of R and packages loaded for generating the vignette <>= toLatex(sessionInfo()) @ \bibliographystyle{plain} \bibliography{TitanCNA} \end{document}