%\VignetteKeywords{runAbsoluteCN} %\VignetteEngine{knitr::knitr} %\VignetteDepends{PureCN} %\VignettePackage{PureCN} %\VignetteIndexEntry{PureCN} \documentclass{article} <>= BiocStyle::latex() @ \title{PureCN: Estimating tumor purity, ploidy, LOH and SNV status using hybrid capture NGS data} \author{Markus Riester} \begin{document} \maketitle \tableofcontents \section{Background} \Biocpkg{PureCN} is a purity and ploidy aware copy number caller for cancer samples inspired by the \software{ABSOLUTE} algorithm \cite{Carter2012}. It was designed for hybrid capture sequencing data, especially with medium-sized targeted gene panels without matching normal samples in mind. It can be used to enhance existing segmentation and normalization algorithms. If the correct purity and ploidy solution was identified, \Biocpkg{PureCN} can also help in classifying variants as germline vs. somatic or clonal vs. sub-clonal. \Biocpkg{PureCN} was further designed to integrate well with industry standard pipelines \cite{McKenna2010}, but it is straightforward to generate input data from other pipelines. <>= library(PureCN) @ \section{Limitations} \Biocpkg{PureCN} currently assumes a completely diploid normal genome. It tries to detect the sex of samples by calculating the coverage ratio of chromosomes X and Y and will then remove any non-diploid chromosomes (i.e. XY for males and Y for females)\footnote{Loss of Y chromosome (LOY) can result in wrong female calls, especially in high purity samples or if LOY is in both tumor and contaminating normal cells. Since homozygous germline SNPs on the X chromosome are removed, this will not affect purity/ploidy selection. A future version might warn users in this case when germline SNPs are provided.`}. Non-diploid chromosomes provide only limited information for purity and ploidy selection, but are of interest for copy number calling. Future versions might support full copy number calling on the XY chromosomes. \Biocpkg{PureCN} currently also assumes a mostly clonal genome. For most clinical samples, this is reasonable, but very heterogenous samples are likely not possible to call without manual curation. Due to the lack of signal, manual curation is also often necessary in low purity samples or very quiet genomes. In the absence of matched normals, the software currently requires some normal contamination to infer germline genotypes. Since cell lines are rarely matched, \Biocpkg{PureCN} will likely not work well with cell line data. \section{Basic input files} The algorithm first needs to calculate copy number log-ratios of tumor vs. normal control. The for this calculation required coverage data needs to be provided in \software{GATK DepthOfCoverage} format: \begin{footnotesize} \begin{verbatim} Target total_coverage average_coverage Sample1_total_cvg Sample1_mean_cvg Sample1_granular_Q1 Sample1_granular_median Sample1_granular_Q3 Sample1_._above_15 chr1:69091-70009 0 0 0 0 1 1 1 0 chr1:367659-368598 6358 9.25121153819759 6358 9.25121153819759 1 7 13 11.6 chr1:621096-622035 6294 9.16910019318401 6294 9.16910019318401 1 7 12 9.5 \end{verbatim} \end{footnotesize} \Biocpkg{PureCN} will only use data from the columns "Target", "total\_coverage", and "average\_coverage", all other columns are optional. If \software{GATK} is not available, then we refer to the \CRANpkg{ExomeCNV} \cite{Sathirapongsasuti2011} package and documentation, which provides scripts for generating coverage files from BAM files. Germline SNPs and somatic mutations are expected in a single VCF file. This VCF should contain a DB info flag for dbSNP membership. VCF files generated by \software{MuTect} \cite{Cibulskis2013} should work well and in general require no post-processing. \Biocpkg{PureCN} can handle \software{MuTect} VCF files generated in both single and matched normal mode. If a matched normal is available, then somatic status information is currently expected in a SOMATIC info flag in the VCF. \section{Example data} We now load a few example files: <>= gatk.normal.file <- system.file("extdata", "example_normal.txt", package="PureCN") gatk.normal2.file <- system.file("extdata", "example_normal2.txt", package="PureCN") gatk.normal.files <- c(gatk.normal.file, gatk.normal2.file) gatk.tumor.file <- system.file("extdata", "example_tumor.txt", package="PureCN") vcf.file <- system.file("extdata", "example_vcf.vcf", package="PureCN") @ To obtain gene level copy number calls, we need an exon-to-gene mapping file. This is provided together with GC content via the gc.gene.file argument. The expected format: \begin{verbatim} Target gc_bias Gene chr1:69091-70009 0.427638737758433 OR4F5 chr1:367659-368598 0.459574468085106 OR4F29 chr1:621096-622035 0.459574468085106 OR4F3 \end{verbatim} If \software{GATK} is available, then the \software{GCContentByInterval} tool can be used to generate the GC bias column. Otherwise, several \software{Bioconductor} annotation packages provide GC content information. <>= gc.gene.file <- system.file("extdata", "example_gc.gene.file.txt", package="PureCN") @ \section{GC bias} The algorithm works best when the coverage data is GC normalized. The steps below will not GC normalize, since the example data is already normalized. The function \Rfunction{correctCoverageBias}, which borrows normalization code from the \Biocpkg{TitanCNA} package \cite{Ha2014}, can be used to correct for GC bias. We recommend correcting for GC bias, storing the corrected coverage files and then do the following steps on GC corrected data. All other assay-specific biases such as mappability or exon length are mitigated by using a control sample (GC bias is commonly library-specific). <>= gatk.normal.file <- system.file("extdata", "example_normal.txt", package = "PureCN") gc.gene.file <- system.file("extdata", "example_gc.gene.file.txt", package = "PureCN") coverage <- correctCoverageBias(gatk.normal.file, gc.gene.file, "example_normal_loess.txt") @ \section{Pool of Normals} \subsection{Selection of normals for log-ratio calculation} For calculating copy number log-ratios of tumor vs. normal, \Biocpkg{PureCN} requires coverage from a process-matched normal sample. Using a normal that was sequenced using a similar, but not identical assay, rarely works, since differently covered genomic regions result in too many log-ratio outliers. This section describes how to identify a good process-matched normal in case no matched normal is available or in case the matched normal has low or uneven coverage. The \Rfunction{createNormalDatabase} function builds a database of coverage files: <>= normalDB <- createNormalDatabase(gatk.normal.files) @ Internally, this function determines the sex of the samples and trains a PCA that is later used for clustering a tumor file with all normal samples in the database. This clustering is performed by the \Rfunction{findBestNormal} function: <>= # get the best normal gatk.best.normal.file <- findBestNormal(gatk.tumor.file, normalDB) @ This function can also return multiple normal files that can be averaged into a single pool: <>= # get the best 2 normals and average them gatk.best.normal.files <- findBestNormal(gatk.tumor.file, normalDB, num.normals=2) pool <- poolCoverage(lapply(gatk.best.normal.files, readCoverageGatk), remove.chrs=c('chrX', 'chrY')) @ Pooling is only recommended when the coverage in normals is significantly lower than in tumor. Otherwise the PCA will typically do a good job in selecting a normal with decent coverage and similar biases compared to tumor. But it is worth experimenting with different strategies. When pooling, it might be also necessary to remove certain principal components, most likely the first one, to avoid that coverage is the only selection criteria. The \Rfunction{plotBestNormal} function might be helpful in finding a good strategy. Note that this example removes sex chromosomes; if the normal database contains a sufficient number of samples with matching sex, \Rfunction{findBestNormal} will return only normal samples with matching sex. \subsection{Artifact filtering} It is important to remove as many artifacts as possible, since low ploidy solutions are typically punished more by artifacts than high ploidy solutions. High ploidy solutions are complex and usually find ways of explaining artifacts reasonably well. A large selection of normal files is helpful for the identification and removal of artifacts. We can for example use the normal coverage data to estimate the expected variance in coverage per exon. This will help the segmentation algorithm to filter noise. This step is optional, but recommended with the default segmentation function and large pool of normals. <>= exon.weight.file <- "exon_weights.txt" createExonWeightFile(gatk.tumor.file, gatk.normal.files, exon.weight.file) @ This function calculates exon-level copy number log-ratios for all normal samples provided in the \Rcode{gatk.normal.files} argument. Assuming that all normal samples are in general diploid, an unusual high variance in log-ratio is indicative of an exon with either common germline alterations or frequent artifacts; high or low copy number log-ratios in those exons are unlikely measuring somatic copy number events. For the log-ratio calculation, we provide a coverage file that is used as tumor in the log-ratio calculation. The corresponding \Rcode{gatk.tumor.file} argument can also be an array of coverage files, in which case the exon coverage variance is averaged over all provided tumor files. As long as the coverage of the tumor files is even, only 1 or 2 tumor files are necessary, however. It is also safe to use a high coverage normal file instead of a tumor file in this step: <>= #createExonWeightFile(gatk.normal.files[1], gatk.normal.files[-1], # "exon_weights.txt") @ We can also use the pool of normals to find SNPs with biased allelic fractions in low quality regions (very significantly different from 0.5 for heterozygous SNPs). We do not do this here for simplicity. This step is recommended for unpaired samples and optional for paired samples. If a matching normal is provided, most variants in low quality regions with biased allelic fractions are automatically ignored. Note that this is not meant to model non-reference bias leading in expected allelic ratio only slightly below 0.5 (typically around 0.48). This bias is typically not strong enough to influence selection of SNV states. <>= #mutect.normal.files <- dir("poolofnormals", pattern="vcf$", full.names=TRUE) #snp.blacklist <- createSNPBlacklist(mutect.normal.files) #write.csv(snp.bl[[1]], file="SNP_blacklist.csv") #write.csv(snp.bl[[2]], file="SNP_blacklist_segmented.csv", row.names=FALSE, # quote=FALSE) @ \subsection{Artifact filtering without a pool of normals} By default, \Biocpkg{PureCN} will exclude exons with coverage below 15X from segmentation. For SNVs, the same 15X cutoff is applied. \software{MuTect} applies more sophisticated artifact tests and flags suspicious variants. If \software{MuTect} was run in matched normal mode, then both potential artifacts and germline variants are rejected, that means we cannot just filter by the PASS/REJECT \software{MuTect} flags. The \Rfunction{filterVcfMuTect} function optionally reads the \software{MuTect} stats file and will keep germline variants, while removing potential artifacts. Without the stats file, \Biocpkg{PureCN} will use only the filters based on read depths as defined in \Rfunction{filterVcfBasic}. Both functions are automatically called by \Biocpkg{PureCN}, but can be easily modified and replaced if necessary. \section{Recommended run} Finally, we can run \Biocpkg{PureCN} with all that information: <>= ret <-runAbsoluteCN(gatk.normal.file=pool, gatk.tumor.file=gatk.tumor.file, vcf.file=vcf.file, sampleid='Sample1', gc.gene.file=gc.gene.file, #args.filterVcf=list(snp.blacklist=snp.blacklist, stats.file=mutect.stats.file), args.segmentation=list(exon.weight.file=exon.weight.file), post.optimize=FALSE, plot.cnv=FALSE) @ The post.optimize flag will increase the runtime by a lot, but might be worth it if the copy number profile is not very clean. This is recommended with lower coverage datasets or if there are significant capture biases. For high quality whole exome data, this is typically not necessary. Note the uncommented line, which is recommended, but not used here for simplicity. The \Rfunction{segmentationPSCBS} function requires the \CRANpkg{PSCBS} package \cite{Olshen2011} and sometimes gives better results than the default when the coverage data is relatively noisy, especially for whole exome data with a large number of heterozygous SNPs. The \Rcode{plot.cnv} argument is set to \Rcode{FALSE} for this vignette, but generates often helpful additional plots provided by the segmentation function. We also need to create a few output files: <>= file.rds <- 'Sample1_PureCN.rds' saveRDS(ret, file=file.rds) pdf('Sample1_PureCN.pdf', width=10, height=12) plotAbs(ret, type='all') dev.off() @ The first plot is an overview plot which shows the purity and ploidy local optima, sorted by final likelihood score after fitting both copy number and allelic fractions: <>= plotAbs(ret, type="overview") @ \incfig{figure/figureexample1-1}{0.75\textwidth}{Overview.} {The colors visualize the copy number fitting score from low (blue) to high (red). The numbers indicate the ranks of the local optima.} We now look at the main plots of the maximum likelihood solution in more detail. <>= plotAbs(ret, 1, type="hist") @ \incfig{figure/figureexample2-1}{0.75\textwidth}{Log-ratio histogram.} Figure~\ref{figure/figureexample2-1} displays a histogram of tumor vs. normal copy number log-ratios for the maximum likelihood solution (number 1 in Figure~\ref{figure/figureexample1-1}). The height of a bar in this plot is proportional to the fraction of the genome falling into the particular log-ratio copy number range. The vertical dotted lines and numbers visualize the, for the given purity/ploidy combination, expected log-ratios for all integer copy numbers from 0 to 7. It can be seen that most of the log-ratios of the maximum likelihood solution align well to expected values for copy numbers of 0, 1 and 2. <>= plotAbs(ret, 1, type="BAF") @ \incfig{figure/figureexample3-1}{0.95\textwidth}{B-allele frequency plot.}{Each dot is a (predicted) germline SNP.} Germline variant data are informative for calculating integer copy number, because unbalanced maternal and paternal chromosome numbers in the tumor portion of the sample lead to unbalanced germline allelic fractions. Equations for calculating expected allelic fractions are given in the Supplemental Information of the \Biocpkg{PureCN} manuscript. Figure~\ref{figure/figureexample3-1} shows the allelic fractions of predicted germline SNPs. In the middle panel, the corresponding copy number log-ratios are shown. The lower panel displays the calculated integer copy numbers, corrected for purity and ploidy. <>= plotAbs(ret, 1, type="AF") @ \incfig{figure/figureexample4-1}{0.95\textwidth}{Allele fraction plots.}{Each dot is again a (predicted) germline SNP. This plot normally also shows somatic mutations in two additional panels. This toy example contains only germline SNPs however.} Finally, Figure~\ref{figure/figureexample4-1} provides more insight into how well the variants fit the expected values. The left panel shows the correlation of expected and observed allelic fractions. The expected value is determined by the most likely state. High ploidy solutions have a lot of states, so this correlation is expected to be good for high ploidy solutions. The right panel plots the observed allelic fractions against copy number. The labels show the expected values for all called states; 2m1 would be diploid, heterozygous, 2m2 diploid, homozygous. \clearpage \section{Manual curation} For prediction of SNV status (germline vs. somatic, sub-clonal vs. clonal, homozygous vs. heterozygous), it is important that both purity and ploidy are correct. We provide functionality for curating results: <>= createCurationFile(file.rds) @ This will generate a CSV file, in which the correct purity and ploidy values can be manually entered. It also contains a column "Curated", which should be set to TRUE, otherwise the file will be overwritten when re-run. Then in R, the correct solution (closest to the combination in the CSV file) can be loaded with the \Rfunction{readCurationFile} function: <>= ret <- readCurationFile(file.rds) @ This function has various handy features, but most importantly it will re-order the local optima so that the curated purity and ploidy combination is ranked first. This means \Rcode{plotAbs(ret,1,type="hist")} would show the plot for the curated purity/ploidy combination, for example. The default curation file will list the maximum likelihood solution: <>= read.csv('Sample1_PureCN.csv') @ \Biocpkg{PureCN} currently only flags samples with warnings, it does not mark any samples as failed. The \Rcode{Failed} column in the curation file can be used to manually flag samples for exclusion in downstream analyses. \section{Custom segmentation} By default, we will use \CRANpkg{DNAcopy} \cite{Venkatraman2007} to segment the log-ratio. It is straightforward to replace the default with other methods and the \Rfunction{segmentationCBS} function can serve as an example. The \Rfunction{segmentationPSCBS} function is another example which uses the \CRANpkg{PSCBS} package \cite{Olshen2011}. It is also possible to provide already segmented data, which we however only recommend when matched SNP6 data is available. Otherwise it is usually better to customize the segmentation function as described above, since the algorithm then has access to the raw log-ratio distribution. The expected file format for already segmented copy number data is: \begin{verbatim} ID chrom loc.start loc.end num.mark seg.mean Sample1 1 61723 5773942 2681 0.125406444072723 Sample1 1 5774674 5785170 10 -0.756511807441712 \end{verbatim} \section{Output} The \Rcode{plotAbs()} call above will generate the main plots shown in the manuscript. The R data file (file.rds) contains gene level copy number calls, SNV status and LOH calls. The purity/ploidy combinations are sorted by likelihood and stored in \Rcode{ret\$results}. <>= names(ret) head(ret$results[[1]]$gene.calls, 3) @ This data.frame also contains gene level LOH information. The SNV posteriors: <>= head(ret$results[[1]]$SNV.posterior$beta.model$posteriors, 3) @ This lists all posterior probabilities for all possible SNV states. M0 to M7 are multiplicity values, i.e. the number of chromosomes harboring the mutation (e.g. 1 heterozygous, 2 homozygous if copy number C is 2). Columns with the ML prefix indicate maximum likelihood estimates, e.g. ML.AR is the expected allelic ratio of the most likely state, AR is the observed allelic ratio. GERMLINE.CONTHIGH and GERMLINE.CONTLOW are the two contamination states. The former are homozygous germline SNPs that were not filtered out because reference alleles from another individual were sequenced, resulting in allelic fractions smaller than 1. The latter are non-reference alleles only present in the contamination. \section{Prediction of somatic status and cellular fraction} The SNV posteriors above provide posterior probabilities for all possible states. The \Rfunction{predictSomatic} function adjusts these posterior probabilities by excluding germline states that do not correspond to the maternal and paternal chromosome numbers (since \Biocpkg{PureCN} only considers heterozygous variants, SNPs are either from the maternal or paternal chromosome). For predicted somatic mutations, this function also provides cellular fraction estimates, i.e. the fraction of cells with mutation. Fractions significantly below 1 indicate sub-clonality\footnote{This number can be above 1 when the observed allelic fraction is higher than expected for a clonal mutation. This maybe due to random sampling, wrong copy number, sub-clonal copy number events, or wrong purity/ploidy estimates.}. <>= head(predictSomatic(ret), 3) @ Segment 1 in this toy example has clearly copy number 2 and LOH, so heterozygous mutations are impossible and the posterior probabilities for \Rcode{GERMLINE.M1} are set to 0. \appendix \bibliography{PureCN} \section*{Session Info} <>= toLatex(sessionInfo()) @ \end{document}