%\VignetteKeywords{runAbsoluteCN} %\VignetteEngine{knitr::knitr} %\VignetteDepends{PureCN} %\VignettePackage{PureCN} %\VignetteIndexEntry{Quick start and command line usage} \documentclass{article} <>= BiocStyle::latex() @ \usepackage{booktabs} % book-quality tables \begin{document} <>= library(PureCN) set.seed(1234) @ \section*{PureCN - Quick Start} This tutorial provides a quick overview of the command line tools shipping with \Biocpkg{PureCN}. For the R package and more detailed information, see the main vignette. \subsection*{Update from previous stable versions} \Biocpkg{PureCN} is backward compatible with input generated by the previous stable version. For optimal results, we recommend starting from scratch following the tutorial below. \Biocpkg{PureCN} 1.8 introduced off-target read support and minor improvements that are not available with previously generated data. \subsection*{Prepare environment and files} \begin{itemize} \item Start R and enter the following to get the path to the command line scripts: <>= system.file("extdata", package="PureCN") @ \item Exit R and store this path in an environment variable, for example in BASH: \begin{verbatim} $ export PURECN="/path/to/PureCN/extdata" $ Rscript $PURECN/PureCN.R --help Usage: /path/to/PureCN/inst/extdata/PureCN.R [options] ... \end{verbatim} \item Generate an interval file from a BED file containing target coordinates: \begin{verbatim} $ Rscript $PURECN/IntervalFile.R --infile baits_hg19.bed \ --fasta hg19.fa --outfile baits_hg19_gcgene.txt \ --offtarget --genome hg19 \ --export baits_optimized_hg19.bed \ --mappability wgEncodeCrgMapabilityAlign100mer.bigWig \end{verbatim} Internally, this script uses \Biocpkg{rtracklayer} to parse the \Rcode{infile}. Make sure that the file format matches the file extension. See the \Biocpkg{rtracklayer} documentation for problems loading the file. The \Rcode{--offtarget} flag will include off-target reads. Including them is recommended except for Amplicon data. The \Rcode{--genome} version is needed to annotate exons with gene symbols. The \Rcode{--export} argument is optional. If provided, this script will store the modified intervals as BED file for example (again every \Biocpkg{rtracklayer} format is supported). This is useful when the coverages are calculated with third-party tools like GATK. The \Rcode{--mappability} argument should provide a \Biocpkg{rtracklayer} parsable file with a mappability score in the first meta data column. If provided, off-target regions will be restricted to regions specified in this file. On-target regions with low mappability will be excluded. For a test run, you can skip this argument or simply download the file from the UCSC website (see the FAQ section of the main vignette for instruction how to generate such a file). \end{itemize} \subsection*{Create VCF files} \Biocpkg{PureCN} does not ship with a variant caller. Use a third-party tool to generate a VCF for each sample. A few recommendations: \begin{itemize} \item Use \software{MuTect} 1.1.7 if possible \item Support for \software{MuTect 2} and \software{FreeBayes} is available for tumor-only VCFs, but currently poorly tested and only very limited artifact filtering will be performed for these callers. \item Since germline SNPs are needed to infer allele-specific copy numbers, the provided VCF needs to contain both somatic and germline variants. \item Run the variant caller with a 50 base pair interval padding to increase the number of heterozygous SNPs \end{itemize} \subsection*{Run PureCN with internal segmentation} The following describes \Biocpkg{PureCN} runs with internal copy number normalization and segmentation. \subsubsection*{Coverage} For each sample, tumor and normal: \begin{verbatim} # Calculate and GC-normalize coverage from a BAM file $ Rscript $PURECN/Coverage.R --outdir $OUT/$SAMPLEID \ --bam ${SAMPLEID}.bam \ --gcgene baits_hg19_gcgene.txt # GC-normalize coverage from a GATK DepthOfCoverage file Rscript $PURECN/Coverage.R --outdir $OUT/$SAMPLEID \ --coverage ${SAMPLEID}.coverage.sample_interval_summary \ --gcgene baits_hg19_gcgene.txt \end{verbatim} Similar to GATK, this script also takes a text file containing a list of BAM or coverage file names (one per line). The file extension must be .list: \begin{verbatim} # Calculate and GC-normalize coverage from a list of BAM files $ Rscript $PURECN/Coverage.R --outdir $OUT/$SAMPLEID \ --bam normals.list \ --gcgene baits_hg19_gcgene.txt \ --cpu 4 \end{verbatim} \subsubsection*{NormalDB} To build a normal database, copy the paths to all GC-normalized normal coverage files in a single text file, line-by-line: \begin{verbatim} ls -a normal*loess.txt | cat > example_normal.list # From already GC-normalized files $ Rscript $PURECN/NormalDB.R --outdir $OUT \ --coveragefiles example_normal.list \ --genome hg19 \end{verbatim} \begin{itemize} \item The resulting \Rcode{normalDB\_hg19.rds} file contains absolute paths to the coverage files. It is thus necessary to re-run this command when the coverage files are moved. \item Consider generating different databases when differences are significant, e.g. for samples with different read lengths or insert size distributions \item In particular, do not mix normal data obtained with different capture kits (e.g. Agilent SureSelect v4 and v6) \end{itemize} \subsubsection*{PureCN} Now that the assay-specific files are created and all coverages calculated, we run PureCN.R to normalize, segment and determine purity and ploidy: \begin{verbatim} cd $OUT/$SAMPLEID # Without a matched normal (minimal test run) $ Rscript $PURECN/PureCN.R --out . --tumor ${SAMPLEID}_coverage_loess.txt \ --normaldb ../normalDB_hg19.rds \ --sampleid $SAMPLEID \ --vcf ${SAMPLEID}_mutect.vcf \ --pool 5 \ --genome hg19 --gcgene baits_hg19_gcgene.txt # Production pipeline run $ Rscript $PURECN/PureCN.R --out . --tumor ${SAMPLEID}_coverage_loess.txt \ --normaldb ../normalDB_hg19.rds \ --normal_panel $NORMAL_PANEL \ --sampleid $SAMPLEID \ --vcf ${SAMPLEID}_mutect.vcf \ --statsfile ${SAMPLEID}_mutect_stats.txt \ --pool 5 \ --genome hg19 --gcgene baits_hg19_gcgene.txt --snpblacklist hg19_simpleRepeats.bed \ --targetweightfile ../target_weights_hg19.txt --force --postoptimize # With a matched normal (test run) $ Rscript $PURECN/PureCN.R --out . --tumor ${SAMPLEID}_coverage_loess.txt \ --normal ${SAMPLEID_NORMAL}_coverage_loess.txt \ --sampleid $SAMPLEID \ --vcf ${SAMPLEID}_mutect.vcf \ --genome hg19 \ --normaldb ../normalDB_hg19.rds \ --gcgene baits_hg19_gcgene.txt # Recreate output after manual curation of ${SAMPLEID}.csv $ Rscript $PURECN/PureCN.R --rds ${SAMPLEID}.rds \end{verbatim} A few important recommendations: \begin{itemize} \item Even if matched normals are available, it is often safer to use the normal database for coverage normalization \item Providing the normal database in addition to a matched normal is optional, but helps ignoring low quality regions in the segmentation \item The \Rcode{--pool} flag specifies how many normal samples should be used to calculate the tumor vs. normal coverage. In a large, homogeneous database, this should be set to 10. In a heterogeneous database (for example different insert sizes, coverages, library preparation protocols, etc.), this value is set to an appropriate group size between 2 and 10. \item The normal panel VCF file is useful for mapping bias correction and especially recommended without matched normals. See the FAQ of the main vignette how to generate this file. It is not essential for test runs. \item The \software{MuTect} 1.1.7 stats file (the main output file besides the VCF) should be provided for better artifact filtering. If the VCF was generated by a pipeline that performs good artifact filtering, this file is not needed. \item The \Rcode{--postoptimize} flag defines that purity should be optimized using both variant allelic fractions and copy number instead of copy number only. This results in a significant runtime increase for whole-exome data. \item If \Rcode{--out} is a directory, it will use the sample id as file prefix for all output files. Otherwise \Biocpkg{PureCN} will use \Rcode{--out} as prefix. \end{itemize} \subsection*{Run PureCN with third-party segmentation} If you already have a segmentation from third-party tools (for example CNVkit, EXCAVATOR2). For a test run: \begin{verbatim} Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \ --sampleid $SAMPLEID \ --segfile $OUT/$SAMPLEID/${SAMPLEID}.cnvkit.seg \ --vcf ${SAMPLEID}_mutect.vcf \ --genome hg19 --gcgene baits_hg19_gcgene.txt \end{verbatim} See the main vignette for more details and file formats. For a production pipeline run we provide a bit more information about the assay and genome: \begin{verbatim} # Recommended CNVkit example Rscript $PURECN/PureCN.R --out $OUT/$SAMPLEID \ --sampleid $SAMPLEID \ --tumor $OUT/$SAMPLEID/${SAMPLEID}.cnr \ --segfile $OUT/$SAMPLEID/${SAMPLEID}_cnvkit.seg \ --normal_panel $NORMAL_PANEL \ --vcf ${SAMPLEID}_mutect.vcf \ --statsfile ${SAMPLEID}_mutect_stats.txt \ --snpblacklist hg19_simpleRepeats.bed \ --genome hg19 \ --funsegmentation none \ --force --postoptimize \end{verbatim} \begin{itemize} \item The \Rcode{--funsegmentation} argument controls if the data should to be re-segmented using germline BAFs (default). Set this value to \Rcode{none} if the provided segmentation should be used as is. \end{itemize} \subsection*{Dx} Dx.R extracts copy number and mutation metrics from PureCN.R output. \begin{verbatim} # Provide a BED file with callable regions, for examples obtained by # GATK CallableLoci. Useful to calculate mutations per megabase and # to exclude low quality regions. grep CALLABLE ${SAMPLEID}_callable_status.bed > \ ${SAMPLEID}_callable_status_filtered.bed Rscript $PureCN/Dx.R --out . --rds ${SAMPLEID}.rds \ --callable ${SAMPLEID}_callable_status_filtered.bed \end{verbatim} \clearpage \section*{Reference} \begin{table*} \caption{IntervalFile} \begin{tabular}{lll} \toprule Argument name & Corresponding PureCN argument & PureCN function \\ \midrule --fasta & reference.file & \Rfunction{calculateGCContentByInterval} \\ --infile & interval.file & \Rfunction{calculateGCContentByInterval} \\ --offtarget & off.target & \Rfunction{calculateGCContentByInterval} \\ --targetwidth & average.target.width & \Rfunction{calculateGCContentByInterval} \\ --offtargetwidth & average.off.target.width & \Rfunction{calculateGCContentByInterval} \\ --offtargetseqlevels & off.target.seqlevels & \Rfunction{calculateGCContentByInterval} \\ --mappability & mappability & \Rfunction{calculateGCContentByInterval} \\ --genome & txdb, org & \Rfunction{annotateTargets} \\ --outfile & & \\ --export & & \Rfunction{rtracklayer::export} \\ --version -v & & \\ --force -f & & \\ --help -h & & \\ \bottomrule \end{tabular} \end{table*} \begin{table*} \caption{Coverage} \begin{tabular}{lll} \toprule Argument name & Corresponding PureCN argument & PureCN function \\ \midrule --bam & bam.file & \Rfunction{calculateBamCoverageByInterval} \\ --bai & index.file & \Rfunction{calculateBamCoverageByInterval} \\ --coverage & coverage.file & \Rfunction{correctCoverageBias} \\ --gcgene & gc.gene.file & \Rfunction{correctCoverageBias} \\ --method & method & \Rfunction{correctCoverageBias} \\ --keepduplicates & keep.duplicates & \Rfunction{calculateBamCoverageByInterval} \\ --outdir & & \\ --cpu & & Number of CPUs to use\\ --seed & & \\ --version -v & & \\ --force -f & & \\ --help -h & & \\ \bottomrule \end{tabular} \end{table*} \begin{table*} \caption{NormalDB} \begin{tabular}{lll} \toprule Argument name & Corresponding PureCN argument & PureCN function \\ \midrule --coveragefiles & normal.coverage.files & \Rfunction{createNormalDatabase} \\ --maxmeancoverage & max.mean.coverage & \Rfunction{createNormalDatabase} \\ --assay -a & Optional assay name & Used in output file names. \\ --genome -g & Optional genome version & Used in output file names. \\ --outdir -o & & \\ --version -v & & \\ --force -f & & \\ --help -h & & \\ \bottomrule \end{tabular} \end{table*} \begin{table*} \caption{PureCN} \begin{tabular}{lll} \toprule Argument name & Corresponding PureCN argument & PureCN function \\ \midrule --sampleid -i & sampleid & \Rfunction{runAbsoluteCN} \\ --normal & normal.coverage.file & \Rfunction{runAbsoluteCN} \\ --tumor & tumor.coverage.file & \Rfunction{runAbsoluteCN} \\ --vcf & vcf.file & \Rfunction{runAbsoluteCN} \\ --rds & file.rds & \Rfunction{readCurationFile} \\ --normal\_panel & normal.panel.vcf.file & \Rfunction{setMappingBiasVcf} \\ --normaldb & normalDB (serialized with \Rfunction{saveRDS}) & \Rfunction{findBestNormal}, \Rfunction{filterTargets} \\ --segfile & seg.file & \Rfunction{runAbsoluteCN} \\ --sex & sex & \Rfunction{runAbsoluteCN} \\ --pool & pool, num.normals & \Rfunction{findBestNormal} \\ --genome & genome & \Rfunction{runAbsoluteCN} \\ --gcgene & gc.gene.file & \Rfunction{runAbsoluteCN} \\ --statsfile & stats.file & \Rfunction{filterVcfMuTect} \\ --minaf & af.range & \Rfunction{filterVcfBasic} \\ --snpblacklist & snp.blacklist & \Rfunction{filterVcfBasic} \\ --error & error & \Rfunction{runAbsoluteCN} \\ --funsegmentation & fun.segmentation & \Rfunction{runAbsoluteCN} \\ --alpha & alpha & \Rfunction{segmentationCBS} \\ --maxsegments & max.segments & \Rfunction{runAbsoluteCN} \\ --targetweightfile & target.weight.file & \Rfunction{segmentationCBS} \\ --minpurity & test.purity & \Rfunction{runAbsoluteCN} \\ --maxpurity & test.purity & \Rfunction{runAbsoluteCN} \\ --minploidy & min.ploidy & \Rfunction{runAbsoluteCN} \\ --maxploidy & max.ploidy & \Rfunction{runAbsoluteCN} \\ --postoptimize & post.optimize & \Rfunction{runAbsoluteCN} \\ --modelhomozygous & model.homozygous & \Rfunction{runAbsoluteCN} \\ --model & model & \Rfunction{runAbsoluteCN} \\ --logratiocalibration & log.ratio.calibration & \Rfunction{runAbsoluteCN} \\ --outvcf & return.vcf & \Rfunction{predictSomatic} \\ --out -o & & \\ --seed & & \\ --version -v & & \\ --force -f & & \\ --help -h & & \\ \bottomrule \end{tabular} \end{table*} \begin{table*} \caption{Dx} \begin{tabular}{lll} \toprule Argument name & Corresponding PureCN argument & PureCN function \\ \midrule --rds & file.rds & \Rfunction{readCurationFile} \\ --callable & callable & \Rfunction{callMutationBurden} \\ --exclude & exclude & \Rfunction{callMutationBurden} \\ --out & & \\ --version -v & & \\ --force -f & & \\ --help -h & & \\ \bottomrule \end{tabular} \end{table*} \clearpage \subsection*{Session Info} <>= toLatex(sessionInfo()) @ \end{document}