--- title: "BubbleTree Tutorial" author: "Wei Zhu, Michael Kuziora, Todd Creasy, Brandon Higgs" date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{BubbleTree Tutorial} \usepackage[utf8]{inputenc} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` # Introduction To evaluate the aneuploidy and prevalence of clonal or quasiclonal tumors, we developed a novel tool to characterize the mosaic tumor genome on the basis of one major assumption: the genome of a heterogeneous multi-cell tumor biopsy can be sliced into a chain of segments that are characterized by homogeneous somatic copy number alternations (SCNAs) and B allele frequencies (BAFs). The model, termed BubbleTree, utilizes both SCNAs and the interconnected BAFs as markers of tumor clones to extract tumor clonality estimates. BubbleTree is an intuitive and powerful approach to jointly identify ASCN, tumor purity and (sub)clonality, which aims to improve upon current methods to characterize the tumor karyotypes and ultimately better inform cancer diagnosis, prognosis and treatment decisions. # Quickstart to Using BubbleTree To perform a BubbleTree analysis, data pertaining to the position and B allele frequency of heterozygous snps in the tumor sample and segmented copy number information including the position, number of markers/segment and log2 copy number ratio between tumor and normal samples must first be obtained. ## Preparing Data for BubbleTree BubbleTree was developed using both whole exome sequencing (WES) and whole genome sequening (WGS) NGS data from paired tumor/normal biopsies, but this model should also be applicable to array comparative genomic hybridization (aCGH) and single nucleotide polymorphism (SNP) array data. There are many methods for generating and processing sequencing data in preparation for BubbleTree analysis. In the following section we provide example workflows starting from WES NGS which can be adapted as needed to alternate inputs. ## Preparing Sequence Variation Data The primary BubbleTree requirement for sequence variant information is a `r Biocpkg("GRanges")` object containing the B alelle frequencies and genomic positions of variants known to be heterozygous in the paired normal sample. Mapped reads from tumor and normal tissue can be processed with mutation caller software such as VarScan or MUTECT. In this example, we use a hypothetical vcf file from VarScan output which contains mutation calls from both normal and tumor samples. ### Preparing BAF Data From VarScan Assume that you have loaded the data snp.dat like this: ```{r eval=FALSE} head(snp.dat) CHROM POS ID REF ALT QUAL FILTER LT.rna.dp LN.rna.dp ON.rna.dp OT.rna.dp BT.wes.dp LT.wes.dp LN.wes.dp ON.wes.dp 1 chr1 54757 rs202000650 T G . PASS NA NA NA NA NA NA NA NA 2 chr1 564636 . C T . PASS NA NA NA NA NA NA NA NA 3 chr1 564862 rs1988726 T C . PASS NA NA NA NA NA NA NA NA 4 chr1 564868 rs1832728 T C . PASS NA NA NA NA NA NA NA NA 5 chr1 565454 rs7349151 T C . PASS NA NA NA NA NA NA NA NA 6 chr1 565464 rs6594030 T C . PASS NA NA NA NA NA NA NA NA OT.wes.dp LT.wgs.dp LN.wgs.dp ON.wgs.dp OT.wgs.dp LT.rna.freq LN.rna.freq ON.rna.freq OT.rna.freq BT.wes.freq LT.wes.freq 1 NA 25 24 27 19 NA NA NA NA NA NA 2 NA 21 NA NA 14 NA NA NA NA NA NA 3 NA 10 15 55 13 NA NA NA NA NA NA 4 NA 10 12 60 14 NA NA NA NA NA NA 5 NA 21 14 26 24 NA NA NA NA NA NA 6 NA 25 16 33 29 NA NA NA NA NA NA LN.wes.freq ON.wes.freq OT.wes.freq LT.wgs.freq LN.wgs.freq ON.wgs.freq OT.wgs.freq 1 NA NA NA 0.2400 0.1667 0.2222 0.3684 2 NA NA NA 0.0000 NA NA 0.1429 3 NA NA NA 0.4000 0.5333 0.9091 0.7692 4 NA NA NA 0.5000 0.6667 0.9333 0.7857 5 NA NA NA 0.1429 0.3571 0.6538 0.6250 6 NA NA NA 0.2000 0.3750 0.7273 0.5862 ``` Identify the germline heterozygous loci: ```{rr} is.hetero <- function(x, a=0.3, b=0.7) { (x - a) * (b - x) >= 0 } wgs.snp.ss <- subset(snp.dat, ! CHROM %in% c("chrX", "chrY") & LN.wgs.dp >= 15 & ON.wgs.dp >=15 & is.hetero(LN.wgs.freq, 0.4, 0.6) & is.hetero(ON.wgs.freq, 0.4, 0.6)) ``` Then convert to the GRanges object: ```{rr} library(GenomicRanges) wgs.hetero.grs <- list() wgs.hetero.grs$lung <- with(wgs.snp.ss, GRanges(CHROM, IRanges(POS, POS), mcols=cbind(LT.wgs.dp, LT.wgs.freq))) wgs.hetero.grs$ovary <- with(wgs.snp.ss, GRanges(CHROM, IRanges(POS, POS), mcols=cbind(OT.wgs.dp, OT.wgs.freq))) names(elementMetadata(wgs.hetero.grs$lung)) <- names(elementMetadata(wgs.hetero.grs$ovary)) <- c("dp", "freq") ``` The B-allele frequency data is extracted using the Bioconductor package `r Biocpkg("VariantAnnotation")` and converted from string to numeric format. ### Preparing CNV Data from DNAcopy The object seg is the segment call output from DNAcopy and min.num here specifies the minimum segment size to keep ```{rr} library(GenomicRanges) min.num <- 10 cnv.gr <- with(subset(seg$output, num.mark >= min.num & ! chrom %in% c("chrX", "chrY")) , GRanges(chrom, IRanges(loc.start, loc.end), mcols=cbind(num.mark, seg.mean))) ``` Then merge the SNP and CNV GRanges objects. Example data in the desired format is provided as part of this package as GRanges objects and can be loaded as shown below. To utilize this vignette, you must first load BubbleTree below. You don't need to use "suppressMessages". ```{r} suppressMessages( library(BubbleTree) ) ``` allCall.lst is pre-calculated CNV data. allRBD.lst is simply the RBD data from below. ```{r} load(system.file("data", "allCall.lst.RData", package="BubbleTree")) head(allCall.lst[[1]]@rbd) ``` However, if you wish to create your own RBD object from your input, you would use the code below. There is test data available in this package that is used for demonstration purposes. ```{r} # load sample files load(system.file("data", "cnv.gr.rda", package="BubbleTree")) load(system.file("data", "snp.gr.rda", package="BubbleTree")) # load annotations load(system.file("data", "centromere.dat.rda", package="BubbleTree")) load(system.file("data", "cyto.gr.rda", package="BubbleTree")) load(system.file("data", "cancer.genes.minus2.rda", package="BubbleTree")) load(system.file("data", "vol.genes.rda", package="BubbleTree")) load(system.file("data", "gene.uni.clean.gr.rda", package="BubbleTree")) # initialize RBD object r <- new("RBD", unimodal.kurtosis=-0.1) # create new RBD object with GenomicRanges objects for SNPs and CNVs rbd <- makeRBD(r, snp.gr, cnv.gr) head(rbd) # create a new prediction btreepredictor <- new("BTreePredictor", rbd=rbd, max.ploidy=6, prev.grid=seq(0.2,1, by=0.01)) pred <- btpredict(btreepredictor) # create rbd plot btreeplotter <- new("BTreePlotter", max.ploidy=5, max.size=10) btree <- drawBTree(btreeplotter, pred@rbd) print(btree) # create rbd.adj plot btreeplotter <- new("BTreePlotter", branch.col="gray50") btree <- drawBTree(btreeplotter, pred@rbd.adj) print(btree) # create a combined plot with rbd and rbd.adj that shows the arrows indicating change btreeplotter <- new("BTreePlotter", max.ploidy=5, max.size=10) arrows <- trackBTree(btreeplotter, pred@rbd, pred@rbd.adj, min.srcSize=0.01, min.trtSize=0.01) btree <- drawBTree(btreeplotter, pred@rbd) + arrows print(btree) # create a plot with overlays of significant genes btreeplotter <- new("BTreePlotter", branch.col="gray50") annotator <- new("Annotate") comm <- btcompare(vol.genes, cancer.genes.minus2) sample.name <- "22_cnv_snv" btree <- drawBTree(btreeplotter, pred@rbd.adj) + ggplot2::labs(title=sprintf("%s (%s)", sample.name, info(pred))) out <- pred@result$dist %>% filter(seg.size >= 0.1 ) %>% arrange(gtools::mixedorder(as.character(seqnames)), start) ann <- with(out, { annoByGenesAndCyto(annotator, as.character(out$seqnames), as.numeric(out$start), as.numeric(out$end), comm$comm, gene.uni.clean.gr=gene.uni.clean.gr, cyto.gr=cyto.gr) }) out$cyto <- ann$cyto out$genes <- ann$ann btree <- btree + drawFeatures(btreeplotter, out) print(btree) # print out purity and ploidy values info <- info(pred) cat("\nPurity/Ploidy: ", info, "\n") ``` The remaining datasets used to support the CNV data display on the BubbleTree plots. ```{r} load(system.file("data", "cancer.genes.minus2.rda", package="BubbleTree")) load(system.file("data", "vol.genes.rda", package="BubbleTree")) load(system.file("data", "gene.uni.clean.gr.rda", package="BubbleTree")) load(system.file("data", "cyto.gr.rda", package="BubbleTree")) load(system.file("data", "centromere.dat.rda", package="BubbleTree")) load(system.file("data", "all.somatic.lst.RData", package="BubbleTree")) load(system.file("data", "allHetero.lst.RData", package="BubbleTree")) load(system.file("data", "allCNV.lst.RData", package="BubbleTree")) load(system.file("data", "hg19.seqinfo.rda", package="BubbleTree")) ``` ```{r} # lists of 379 known cancer genes head(cancer.genes.minus2) # another list of 105 known cancer genes head(vol.genes) # full gene annotation Grange object head(gene.uni.clean.gr) # cytoband coordinate data head(cyto.gr) # centromere coordinate data head(centromere.dat) # SNV location data head(all.somatic.lst, n=1L) # sequence variants head(allHetero.lst, n=1L) # copy number variation data head(allCNV.lst, n=1L) # hg19 sequence data hg19.seqinfo ``` # Main Bubbletree Functions ## BubbleTree model and diagram BubbleTree is a model based on three valid assumptions: 1) the paired normal specimen expresses the common diploid state, 2) variant sites are bi-allelic, and 3) genome segments (rather than the whole genome) with homogeneous copy number ratio and BAFs, exist in the profiled tumor genome. The first two assumptions generally hold, whereas the last homogeneity assumption can also be satisfied even in the case of a complex tumor clonal structure. As the three assumptions are all generally plausible, we therefore developed a model for the BubbleTree diagram. For one homogenous genomic segment (x:y;p), we have, Expected copy number, (CN)=2×(1-p)+(x+y)×p Copy Ratio, R=(CN)/2=(1-p)+(x+y)/2×p (1) B allele frequency, BAF=(y×p+1×(1-p))/((x+y)×p+2×(1-p)) and the homozygous-deviation score (HDS), HDS= ∣BAF-0.5∣=(p×∣y-x∣)/(2×[(x+y)×p+2×(1-p)]) (2) Based on equations (1) and (2), we are able to calculate an R score (copy ratio) and HDS for a segment (x:y; p). For example, (0:1; 0.75) will provide 0.625 and 0.3 for the R scores and HDS, respectively. ## Description of the BubblePlot Graph ### The Branches ```{r} btreeplotter <- new("BTreePlotter", max.ploidy=5, max.size=10) print(btreeplotter@branches) ``` The below plot introduces the relationship between HDS and R score (copy number ratio), both used to elucidate the tumor cell prevalence, ploidy state, and clonality for a tumor sample. Generally, the R score indicates the copy number change, ranging from 0 (homozygous deletion) to 3 (hexaploidy) or higher, while the HDS represents LOH, ranging from values of 0 to 0.5 (i.e., LOH with 100% prevalence). Each branch in the diagram starts at the root (1,0), a value of HDS=0 and R score=1. Namely, a diploid heterozygous genotype segment has a copy number ratio, or R score of 1 (tumor DNA copies=2; normal DNA copies=2, so 2/2=1) with no LOH (HDS=0) and is indicated with a genotype of AB, where the A allele is from one parent and the B allele is from the other parent presumably. Then from the root (1,0), the segment prevalence values are provided in increasing increments of 10%, with each branch representing a different ploidy state. As the values increase along the y-axis, the occurrence of LOH increases, such that on the AA/BB branch at HDS=0.5 and R score=1, this indicates a disomy state with LOH and 100% prevalence for the segment.