--- title: "CBN2Path Vignette" authors: "William Choi-Kim and Sayed-Rzgar Hosseini" package: CBN2Path output: BiocStyle::html_document vignette: > %\VignetteIndexEntry{B-CBN Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Introduction Tumorigenesis is a stepwise process driven by a sequence of molecular changes that are described as pathways of cancer progression. Conjunctive Bayesian Networks (CBN) are probabilistic graphical models designed for the analysis and modeling of these pathways [1]. CBN models have evolved into different varieties such as CT-CBN [2], H-CBN [3], B-CBN [4], and R-CBN [5], each addressing different aspects of this task. However, the software corresponding to these methods is not well integrated because they are implemented in different languages with heterogeneous input and output formats. This necessitates a unifying platform that integrates these models and enables the standardization of input and output formats. Evam-tools [6] is an R package that takes the initial steps towards this end. However, it does not include the B-CBN model or the recently developed R-CBN algorithm, which focuses on robust inference of cancer progression pathways [5]. Importantly, the B-CBN and R-CBN algorithms for pathway quantification necessitate exhaustive consideration and weighting of all potential dependency structures (posets) within the mutational quartets. This requires reimplementation of the CBN models and adjustment of downstream pathway analysis and modeling functions. Therefore, here we introduce the **CBN2Path** R package that not only includes the original implementation of the CBN models (e.g., CT-CBN and H-CBN) in a unifying interface but also accommodates the necessary modifications to support robust CBN algorithms (e.g., B-CBN and R-CBN). Importantly, CBN2Path includes a collection of functions required to quantify predictability [7], analyze robustness [5], and visualize mutational pathways from pre-processed cross-sectional genomic data. It is important to note that the R-CBN method has great potential for wide application in future predictive models because of its unique ability to offer an optimal balance between robustness and predictability [5]. Thus, we anticipate that CBN2Path will be a commonly used package in the field, particularly by providing a platform to facilitate future applications of the R-CBN model. ## Installation The package can be installed as follows: ```{r,eval=FALSE} if (!requireNamespace("BiocManager")) install.packages("BiocManager") BiocManager::install("CBN2Path") ``` ```{r} library(CBN2Path) ``` ```{r include=FALSE} library(TCGAbiolinks) ``` ## Cite our work If you use the CBN2Path package, please cite the paper formally as follows: Choi-Kim W and Hosseini SR. CBN2Path: an R/Bioconductor package for the analysis of cancer progression pathways using Conjunctive Bayesian Networks. F1000Research 2025, 14:834 (https://doi.org/10.12688/f1000research.168810.1) ## Preparing the genotype matrix The genotype matrix is a binary matrix in which each row represents a sample and each column represents a gene. The element (i,j) of the matrix is 1 if the i-th sample contains a nonsense or missense mutation in the j-th gene, and zero otherwise. In the example below, a genotype matrix is generated for 414 Bladder Urothelial Carcinoma samples in the "TCGA-BLCA" data. Note that we in this example we have used "TP53", "ARID1A", "KDM6A" and "PIK3CA" as our genes of interest to genotype the samples. ```{r} # Obtaining the "TCGA-BLCA" data rawData <- getRawTCGAData("TCGA-BLCA") # Specifying the genes genes <- c("TP53","ARID1A","KDM6A","PIK3CA") # Generating the genotype matrix gMat <- generateTCGAMatrix(rawData, genes) ``` ## The CT-CBN model CBN2Path provides the R interface for the continuous-time CBN model (CT-CBN), which was originally implemented in C programming language [2]. CT-CBN needs the following three inputs: 1. **`numMutations`:** The number of mutations to be considered. 2. **`poset`:** The partially ordered set (poset), which is represented as a two-column matrix each row of which indicates that mutation in the first column must occur before the mutation in the second column. 3. **`genotypeMatrix`:** a binary matrix with **n** rows and **m** columns. Each row corresponds to a given genotype in the sample. Below, you can see an example on how these inputs are prepared to be used in the original implementation of the ctcbn model (**`ctcbnSingle`**): ```{r} # The poset dag <- matrix(c(1, 1, 1, 4, 2, 3, 4, 3), 4, 2) # The genotype matrix: # gMat, which was generated using the generateTCGAMatrix function (see above). # Preparing input of the ct-cbn/h-cbn methods bc <- Spock$new( poset = dag, numMutations = 4, genotypeMatrix = gMat ) # Running the ct-cbn model resultsC <- ctcbnSingle(bc) ``` The directed acyclic graph representation of the input poset can be visualized using the **`visualizeCBNModel`** function as follows: ```{r fig.width=4.25, fig.height=4.25} visualizeCBNModel(dag) ``` Furthermore, the Lambda parameters and the corresponding log likelihood can be obtained as: ```{r} mlLambda <- resultsC$lambda logLikelihood <- resultsC$summary["Loglike"] ``` The **`genotypeMatrixMutator`** function generates a modified genotype matrix (`gMatMut`) by subjecting the original genotype matrix (`gMat`) to false-positive and false-negative error rates of **0.3** and **0.2**, respectively: ```{r} gMatMut <- genotypeMatrixMutator(gMat, 0.3, 0.2) ``` and then you can rerun the ct-cbn model using the mutated genotype matrix as follows: ```{r} # The poset dag <- matrix(c(1, 1, 1, 4, 2, 3, 4, 3), 4, 2) # Preparing the inputs of the ct-cbn method bc <- Spock$new( poset = dag, numMutations = 4, genotypeMatrix = gMatMut ) # Running the ct-cbn model resultsCMut <- ctcbnSingle(bc) ``` Note that if the posets and genotype data are stored in the original format needed in the C implementation of the CBN models, you can preprocess those files using **`readPoset`** and **`readPattern`** functions in the CBN2Path package. You can see an example below, where the number of mutations is **10**: ```{r} examplePath <- getExamples()[1] bc <- Spock$new( poset = readPoset(examplePath)$sets, numMutations = readPoset(examplePath)$mutations, genotypeMatrix = readPattern(examplePath) ) resultsC2 <- ctcbnSingle(bc) ``` Finally, you can obtain and store the results using a list of posets (instead of a single poset) as the input of the model using the **`ctcbn`** function. In the example below, we will consider the set of all potential (transitively-closed) DAGs of size four (219 unique posets) as our list of posets that we use as the input of the `ctcbn` function: ```{r} posets <- readRDS(system.file("extdata", "Posets.rds", package = "CBN2Path")) bc <- Spock$new( poset = posets, numMutations = 4, genotypeMatrix = gMat ) resultsC3 <- ctcbn(bc, nCores = 3) ``` You can see that the result is a list of 219 components each including the estimated parameters corresponding to one of the 219 posets in the input list. This strategy is utilized in the R-CBN and B-CBN models for quantifying the pathway probabilities. You can obtain the log likelihood corresponding to each poset as follows: ```{r} logLik <- sapply(resultsC3, \(x) x$summary["Loglike"]) ``` Below, you can see that the maximum-likelihood poset according to the CT-CBN model in this example is the empty poset. ```{r} indx <- which.max(logLik) mlPosetC <- posets[[indx]] ``` ```{r fig.width=4.25, fig.height=4.25} visualizeCBNModel(mlPosetC) ``` ## The H-CBN model The input/output structure of the H-CBN model (**`hcbnSingle`** and **`hcbn`**) [3] is exactly the same as in the CT-CBN model (**`ctcbnSingle`** and **`ctcbn`**) described above. ```{r} # The poset dag <- matrix(c(1, 1, 1, 4, 2, 3, 4, 3), 4, 2) # Preparing the inputs of the h-cbn method bc <- Spock$new( poset = dag, numMutations = 4, genotypeMatrix = gMat ) # Running the h-cbn model resultsH <- hcbnSingle(bc) ``` The Lambda parameters and the corresponding log likelihood can be obtained as: ```{r} mlLambdaH <- resultsH$lambda logLikelihoodH <- resultsH$summary["Loglike"] ``` You can also rerun the h-cbn model using the mutated genotype matrix as follows: ```{r} # The poset dag <- matrix(c(3, 3, 4, 4, 1, 2, 1, 2), 4, 2) # Preparing the inputs of the h-cbn method bc <- Spock$new( poset = dag, numMutations = 4, genotypeMatrix = gMatMut ) # Running the h-cbn model resultsHMut <- hcbnSingle(bc) ``` Again the poset and genotype files stored in the original formats can be processed using **`readPoset`** and **`readPattern`** functions in the CBN2Path package. ```{r} examplePath <- getExamples()[1] bc <- Spock$new( poset = readPoset(examplePath)$sets, numMutations = readPoset(examplePath)$mutations, genotypeMatrix = readPattern(examplePath) ) resultsH2 <- hcbnSingle(bc) ``` Finally, you can obtain and store the results using a list of posets (instead of a single poset) as the input of the model using the **`hcbn`** function. ```{r} posets <- readRDS(system.file("extdata", "Posets.rds", package = "CBN2Path")) bc <- Spock$new( poset = posets, numMutations = 4, genotypeMatrix = gMat ) resultsH3 <- hcbn(bc, nCores = 3) ``` You can obtain the log likelihood corresponding to each poset as follows: ```{r} logLikH <- sapply(resultsH3, \(x) x$summary["Loglike"]) ``` Below, you can see that in this example, the maximum-likelihood poset based on H-CBN contains an edge in contrast to the CT-CBN model, which outputs an empty poset as the MLE poset: ```{r} indx <- which.max(logLikH) mlPosetH <- posets[[indx]] ``` ```{r fig.width=4.25, fig.height=4.25} visualizeCBNModel(mlPosetH) ``` ## Analysis of cancer progression pathways One of the important feature of the CBN2Path package is its emphasis on mutational pathway analyses and modeling. In this section, we will work with a set of functions that enable quantification, analysis and visualization of the mutational pathways. There are two approaches to quantify the pathway probabilities: i) The first approach is to use the output of the ct-cbn or h-cbn methods (namely the estimated lambda parameters and the inferred ML poset) as input of the **`pathProbCBN`** function. This method is generic as it can be used for every number of mutations. ii) The second approach works only for mutational quartets (**n=4**) and uses the genotypic data directly as the input. In this setting, each CBN model has its own pathway quantification functions: **`pathProbQuartetCTCBN`**, **`pathProbQuartetHCBN`**, **`pathProbQuartetRCBN`**, and **`pathProbQuartetBCBN`**. As examples for the first approach, let's use the `resultsC2` and `resultsH2` that we obtained in the previous section by learning respectively, ct-cbn and h-cbn models on genotypic data with **n=10** mutations. First, we need to obtain the estimated lambda parameters and the inferred DAG and then use them as the input of the **`pathProbCBN`** function as follows: ```{r,eval=FALSE} lambdaC <- as.numeric(resultsC2$lambda) lambdaH <- as.numeric(resultsH2$lambda) dagC <- resultsC2$poset$sets dagH <- resultsH2$poset$sets probC <- pathProbCBN(dagC, lambdaC, 10) probH <- pathProbCBN(dagH, lambdaH, 10) ``` Note that in the above code, probabilities of 10!=3,628,800 pathways need to be calculated, which is extremely time-consuming, so we have not executed this chunk of the code. Now, let's try the second approach using both the `gMat` and `gMatMut` genotype matrices. Note that in these functions, the number of columns in the input genotype matrix must always be four. ```{r} probC1 <- pathProbQuartetCTCBN(gMat) probC2 <- pathProbQuartetCTCBN(gMatMut) probH1 <- pathProbQuartetHCBN(gMat) probH2 <- pathProbQuartetHCBN(gMatMut) ``` You can visualize the 24 pathways and their associated probabilities using **`visualizeProbabilities`** function. In the figures below, you can compare the pathway probability distributions (quantified using CT-CBN model) before and after errors: ```{r} visualizeProbabilities(probC1) visualizeProbabilities(probC2) ``` You can see before adding errors (`probC1`), only four pathways are feasible with non-zero probabilities, but in the presence of error (`probC2`) all pathways are feasible and so the probability is more uniformly distributed among the pathways. We can now visualize the pathways with gene names and their probability distributions (`probC2`) as follows: ```{r} visualizeProbabilities(probC2, geneNames = genes) ``` Similarly, in the figures below, we can compare the pathway probability distributions (quantified using H-CBN model) before (`probH1`) and after errors (`probH2`): ```{r} visualizeProbabilities(probH1) visualizeProbabilities(probH2) ``` We can see that under the H-CBN model, the pathway probabilities before and after errors stay more similar than what we observed under the CT-CBN model. We can formally quantify to what extent the two probability distribution differ using the Jensen-Shannon Divergence (JSD; implemented as **`jensenShannonDivergence`** function). ```{r} jsdC <- jensenShannonDivergence(probC1, probC2) jsdH <- jensenShannonDivergence(probH1, probH2) jsdC jsdH ``` We can also quantify the predictability for a given pathway probability distribution as described in [7] using the **`predictability`** function. We can see that the predictability after errors drops substantially under CT-CBN: ```{r} predC1 <- predictability(probC1, 4) predC2 <- predictability(probC2, 4) predC1 predC2 predC1 - predC2 ``` In contrast, under H-CBN, the predictability after errors, even gets slightly higher than the one obtained in the error-free setting: ```{r} predH1 <- predictability(probH1, 4) predH2 <- predictability(probH2, 4) predH1 predH2 predH1 - predH2 ``` Finally, we can compute the pathway compatibility scores both for `gMat` and `gMatMut`genotype matrices using the`pathwayCompatibilityQuartet` function as follows: ```{r} pathwayC1 <- pathwayCompatibilityQuartet(gMat) pathwayC2 <- pathwayCompatibilityQuartet(gMatMut) ``` The Spearman's correlation coefficient between the pathway compatibility and the pathway probabilities quantified under CT-CBN or H-CBN can be quantified as follows: ```{r} rhoC1 <- cor(pathwayC1, probC1, method = "spearman") rhoC2 <- cor(pathwayC2, probC2, method = "spearman") rhoC1 rhoC2 rhoH1 <- cor(pathwayC1, probH1, method = "spearman") rhoH2 <- cor(pathwayC2, probH2, method = "spearman") rhoH1 rhoH2 ``` ## The R-CBN algorithm The R-CBN algorithm [5] for quantifying pathway probability distributions is implemented in the **`pathProbQuartetRCBN`** function, whose input/output structure is similar to what we described above for the CT-CBN and H-CBN. However, there are multiple functions, which are called inside the **`pathProbQuartetRCBN`** function, and so the user do not need to directly work with (e.g. **`pathNormalization`**, **`pathwayWeightingRCBN`**, **`edgeMarginalized`**, **`pathEdgeMapper`**, and **`posetWeightingRCBN`**). The **`pathProbQuartetRCBN`** function also receives a four-column binary genotype matrix as the only input, and outputs the corresponding pathway probability distribution. Let's quantify the pathway probability distributions before and after adding errors: ```{r} probR1 <- pathProbQuartetRCBN(gMat) probR2 <- pathProbQuartetRCBN(gMatMut) ``` In the figures below, you can compare the pathway probability distributions (quantified using R-CBN model) before and after errors: ```{r} visualizeProbabilities(probR1) visualizeProbabilities(probR2) ``` Similar to what we described before, the Jensen-Shannon Divergence between the two distributions can be quantified as: ```{r} jsdR <- jensenShannonDivergence(probR1, probR2) jsdR ``` You can see that the JDS value under R-CBN in this example (**0.05**), is considerably smaller than those of CT-CBN (**0.41**) and H-CBN (**0.31**). The predictability values can also be compared as follows: ```{r} predR1 <- predictability(probR1, 4) predR2 <- predictability(probR2, 4) predR1 predR2 predR1 - predR2 ``` Finally, the Spearman's correlation coefficient between the pathway compatibility and the pathway probabilities quantified under R-CBN can be quantified as follows: ```{r} rhoR1 <- cor(pathwayC1, probR1, method = "spearman") rhoR2 <- cor(pathwayC2, probR2, method = "spearman") rhoR1 rhoR2 ``` ## The B-CBN method The workflow for the B-CBN algorithm [4], which is implemented in the **`pathProbQuartetBCBN`** function, is also similar to that of R-CBN. The **`pathProbQuartetBCBN`** function also receives a four-column binary genotype matrix as the only input, and outputs the corresponding pathway probability distribution. Again, let's quantify the pathway probability distributions before and after the errors: ```{r, warning=FALSE, results='hide',eval=FALSE} probB1 <- pathProbQuartetBCBN(gMat) probB2 <- pathProbQuartetBCBN(gMatMut) ``` In the figures below, you can compare the pathway probability distributions (quantified using B-CBN model) before and after errors: ```{r,eval=FALSE} visualizeProbabilities(probB1) visualizeProbabilities(probB2) ``` Similar to what we described before, the Jensen-Shannon Divergence between the two distributions can be quantified as: ```{r,eval=FALSE} jsdB <- jensenShannonDivergence(probB1, probB2) jsdB ``` The predictability values can also be compared as follows: ```{r,eval=FALSE} predB1 <- predictability(probB1, 4) predB2 <- predictability(probB2, 4) predB1 predB2 predB1 - predB2 ``` Finally, the Spearman's correlation coefficient between the pathway compatibility and the pathway probabilities quantified under B-CBN can be quantified as follows: ```{r,eval=FALSE} rhoB1 <- cor(pathwayC1, probB1, method = "spearman") rhoB2 <- cor(pathwayC2, probB2, method = "spearman") rhoB1 rhoB2 ``` ## Analysis of fitness landscapes If we can establish a fitness landscape by directly assigning fitness to all potential genotypes, we can employ evolutionary models to compute the mutational pathway probabilities under the Strong-Selection Weak-Mutation (SSWM) assumption [7], which is implemented in the **`pathProbSSWM`** function. In case of 4 mutations, we will have 16 genotypes, so a fitness vector length of 16 is needed each element of which corresponds to a given genotype that can be determined by the **`generateMatrixGenotypes`** function. For example: ```{r} fitnessVector <- c(0, 0.1, 0.2, 0.1, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0, 0.6, 0.4, 0.3, 0.2, 1) g <- generateMatrixGenotypes(4) ``` The 7-th genotype in the `g` vector is "1 0 1 0" and its corresponding fitness is `fitnessVector[7] = 0.3`. The fitness landscape can be visualized as follows: ```{r} visualizeFitnessLandscape(fitnessVector) ``` The pathway probability distribution under the SSWM-based model can be quantified as: ```{r} probW <- pathProbSSWM(fitnessVector,4) ``` Moreover, the pathway probabilities can be visualized as: ```{r} visualizeProbabilities(probW) ``` and finally the associated predictability can be quantified as: ```{r fig.width=4.25, fig.height=4.25} predW <- predictability(probW, 4) ``` ## Session Info ```{r} sessionInfo() ``` ## References [1] Beerenwinkel, et al. Conjunctive Bayesian Networks. Bernoulli, 13(4):893–909, November 2007. ISSN 1350-7265. doi: . [2] Beerenwinkel and Sullivant. Markov models for accumulating mutations. Biometrika, 96 (3):645–661, September 2009. ISSN 0006-3444, 1464-3510. doi: . [3] Gerstung, et al. Quantifying cancer progression with conjunctive Bayesian networks. Bioinformatics (Oxford, England), 25(21):2809–2815, November 2009. doi: . [4] Sakoparnig and Beerenwinkel. Efficient sampling for Bayesian inference of conjunctive Bayesian networks. Bioinformatics, 28(18):2318–2324, September 2012. ISSN 1367-4811, 1367-4803. doi: . [5] Hosseini. Robust inference of cancer progression pathways using Conjunctive Bayesian Networks. BioRxiv, July 2025. doi: . [6] Diaz-Uriarte and Herrera-Nieto. EvAM-Tools: tools for evolutionary accumulation and cancer progression models. Bioinformatics, 38(24): 5457–5459, December 2022. ISSN 1367-4803, 1367-4811. doi: . [7] Hosseini, et al. Estimating the predictability of cancer evolution. Bioinformatics, 35 (14):i389–i397, July 2019. ISSN 1367-4803, 1367-4811. doi: .