ClonalSim is an R/Bioconductor package for simulating tumor clonal evolution with realistic sequencing noise. It generates mutational profiles of heterogeneous tumor samples with hierarchical clonal structure, making it ideal for:
ClonalSim implements a two-stage realistic noise model:
The main function is simulateTumor(), which generates a complete simulation:
# Set seed for reproducibility
set.seed(123)
sim <- simulateTumor(
subclone_freqs = c(0.15, 0.25, 0.30, 0.30),
n_mut_per_clone = c(20, 25, 30, 15),
n_mut_founder = 10
)
# Display summary
sim
#> ClonalSimData object
#> ==========================================
#> Number of mutations: 135
#> Number of clones: 4
#> Clone frequencies: 0.15, 0.25, 0.3, 0.3
#> Tumor purity: 1
#>
#> Mutation types:
#>
#> founder private shared
#> 10 90 35
#>
#> Sequencing depth:
#> Mean: 98.29
#> Range: 50 - 173
#>
#> VAF summary (observed):
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.04545 0.21603 0.33871 0.42137 0.58705 1.00000
#>
#> Metadata:
#> Created: 2026-04-26 16:57:11.78265
#> Package version: 0.99.8# Get mutation data
mutations <- getMutations(sim)
head(mutations)
#> Mutation Chromosome Position Ref Alt True_VAF VAF Depth Alt_reads
#> 1 Founder_1 chr2 3901294 T G 0.99 0.9591837 98 94
#> 2 Founder_2 chr12 156958367 A A 0.99 1.0000000 70 70
#> 3 Founder_3 chr15 113008658 T T 0.99 1.0000000 105 105
#> 4 Founder_4 chr17 55250578 G T 0.99 1.0000000 77 77
#> 5 Founder_5 chr3 191364048 G A 0.99 0.9871795 78 77
#> 6 Founder_6 chr7 110583980 G T 0.99 0.9901961 102 101
#> Clone Type Clone_IDs
#> 1 Founder founder 1,2,3,4
#> 2 Founder founder 1,2,3,4
#> 3 Founder founder 1,2,3,4
#> 4 Founder founder 1,2,3,4
#> 5 Founder founder 1,2,3,4
#> 6 Founder founder 1,2,3,4
# Get simulation parameters
params <- getSimParams(sim)
params$subclone_freqs
#> Clone1 Clone2 Clone3 Clone4
#> 0.15 0.25 0.30 0.30
# Get true vs observed VAF
true_vaf <- getTrueVAF(sim)
obs_vaf <- getObservedVAF(sim)
# Compare
summary(true_vaf)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.08452 0.24178 0.32869 0.42351 0.59163 0.99000
summary(obs_vaf)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.04545 0.21603 0.33871 0.42137 0.58705 1.00000ClonalSim provides built-in plotting functions. Note: Make sure ggplot2 is loaded:
VAF density plot showing mixed tumor sample
The density plot shows the distribution of variant allele frequencies as they would appear in real sequencing data. Expected peaks correspond to clone frequencies.
VAF scatter plot colored by mutation type
Sequencing depth distribution
Clonal matrix showing mutation presence
Intra-tumor heterogeneity is modeled using a Beta distribution, which is more appropriate than Gaussian for frequency data (bounded 0-1).
# High heterogeneity (low concentration)
set.seed(123)
sim_high_het <- simulateTumor(
subclone_freqs = c(0.3, 0.4, 0.3),
n_mut_per_clone = c(30, 40, 30),
biological_noise = list(enabled = TRUE, concentration = 20)
)
# Low heterogeneity (high concentration)
set.seed(123)
sim_low_het <- simulateTumor(
subclone_freqs = c(0.3, 0.4, 0.3),
n_mut_per_clone = c(30, 40, 30),
biological_noise = list(enabled = TRUE, concentration = 100)
)
# Compare VAF spread
par(mfrow = c(1, 2))
hist(getTrueVAF(sim_high_het), main = "High Heterogeneity\n(concentration = 20)",
xlab = "True VAF", breaks = 30, col = "skyblue")
hist(getTrueVAF(sim_low_het), main = "Low Heterogeneity\n(concentration = 100)",
xlab = "True VAF", breaks = 30, col = "lightcoral")Effect of biological noise concentration parameter
Real NGS data shows overdispersion (variance > mean) due to GC bias, mappability, and PCR artifacts. ClonalSim uses negative binomial distribution:
# Realistic (negative binomial)
set.seed(123)
sim_nb <- simulateTumor(
sequencing_noise = list(
mean_depth = 100,
depth_variation = "negative_binomial",
depth_dispersion = 20
)
)
# Unrealistic (Poisson)
set.seed(124)
sim_poisson <- simulateTumor(
sequencing_noise = list(
mean_depth = 100,
depth_variation = "poisson"
)
)
par(mfrow = c(1, 2))
hist(getMutations(sim_nb)$Depth, main = "Negative Binomial\n(realistic)",
xlab = "Depth", breaks = 30, col = "skyblue")
hist(getMutations(sim_poisson)$Depth, main = "Poisson\n(unrealistic)",
xlab = "Depth", breaks = 30, col = "lightcoral")Depth distributions: negative binomial vs Poisson
Each sequencing read is independently sampled, introducing stochastic variation. This is especially important at low depth or low VAF:
# With binomial sampling (realistic)
set.seed(123)
sim_binom <- simulateTumor(
sequencing_noise = list(binomial_sampling = TRUE)
)
# Without binomial sampling (deterministic)
set.seed(123)
sim_det <- simulateTumor(
sequencing_noise = list(binomial_sampling = FALSE)
)
# Compare True_VAF vs observed VAF
par(mfrow = c(1, 2))
with(getMutations(sim_binom), {
plot(True_VAF, VAF, main = "With Binomial Sampling",
xlab = "True VAF", ylab = "Observed VAF", pch = 16, col = rgb(0,0,1,0.3))
abline(0, 1, col = "red", lty = 2)
})
with(getMutations(sim_det), {
plot(True_VAF, VAF, main = "Without Binomial Sampling",
xlab = "True VAF", ylab = "Observed VAF", pch = 16, col = rgb(0,0,1,0.3))
abline(0, 1, col = "red", lty = 2)
})Stochastic variation from binomial sampling
Many clinical samples have significant normal cell contamination:
set.seed(123)
sim_low_purity <- simulateTumor(
subclone_freqs = c(0.05, 0.10, 0.15), # Sum = 0.30 (30% tumor)
n_mut_per_clone = c(20, 25, 30)
)
cat("Tumor purity:", sum(getSimParams(sim_low_purity)$subclone_freqs), "\n")
#> Tumor purity: 0.3
plot(sim_low_purity, type = "vaf_density")Deep targeted sequencing with uniform coverage:
set.seed(123)
sim_deep <- simulateTumor(
sequencing_noise = list(
enabled = TRUE,
mean_depth = 500,
depth_dispersion = 100, # More uniform
error_rate = 0.0005
)
)
summary(getMutations(sim_deep)$Depth)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 393.0 469.5 502.0 501.2 532.0 656.0
plot(sim_deep, type = "depth_histogram")Exome sequencing with variable coverage:
For testing algorithms without noise:
set.seed(123)
sim_ideal <- simulateTumor(
biological_noise = list(enabled = FALSE),
sequencing_noise = list(enabled = FALSE)
)
# VAF exactly matches expected frequencies
head(getMutations(sim_ideal)[, c("Clone", "True_VAF", "VAF")])
#> Clone True_VAF VAF
#> 1 Founder 1 1
#> 2 Founder 1 1
#> 3 Founder 1 1
#> 4 Founder 1 1
#> 5 Founder 1 1
#> 6 Founder 1 1Real tumor sequencing data contains both somatic and germline variants. Germline variants (heterozygous diploid) appear at VAF ~0.5 regardless of tumor purity because they are heterozygous in both tumor and normal cells:
# 70% purity tumor with germline variants
set.seed(789)
sim_germline <- simulateTumor(
subclone_freqs = c(0.3, 0.4), # 70% tumor purity
n_mut_per_clone = c(40, 50),
germline_variants = list(
enabled = TRUE,
n_variants = 150, # Number of germline SNPs
vaf_expected = 0.5 # Heterozygous diploid
)
)
# Check mutation types
table(getMutations(sim_germline)$Type)
#>
#> founder germline private shared
#> 10 150 90 8
# Compare VAFs
germline_vafs <- getMutations(sim_germline)[getMutations(sim_germline)$Type == "germline", "VAF"]
somatic_vafs <- getMutations(sim_germline)[getMutations(sim_germline)$Type != "germline", "VAF"]
cat("Germline VAF (expected ~0.5):", round(mean(germline_vafs), 3), "\n")
#> Germline VAF (expected ~0.5): 0.51
cat("Somatic VAF mean:", round(mean(somatic_vafs), 3), "\n")
#> Somatic VAF mean: 0.419
# Plot showing germline peak
plot(sim_germline, type = "vaf_density")VAF distribution showing germline peak at 0.5
Key biological insight: Unlike somatic mutations (whose VAF depends on tumor purity), germline variants maintain VAF ~0.5 because:
This creates a characteristic peak at VAF = 0.5 in real sequencing data that is independent of tumor cellularity.
library(GenomicRanges)
gr <- toGRanges(sim)
gr
#> GRanges object with 135 ranges and 10 metadata columns:
#> seqnames ranges strand | Mutation Ref Alt
#> <Rle> <IRanges> <Rle> | <character> <character> <character>
#> [1] chr2 3901294 * | Founder_1 T G
#> [2] chr12 156958367 * | Founder_2 A A
#> [3] chr15 113008658 * | Founder_3 T T
#> [4] chr17 55250578 * | Founder_4 G T
#> [5] chr3 191364048 * | Founder_5 G A
#> ... ... ... ... . ... ... ...
#> [131] chr14 151818959 * | Clone4_mut11 A A
#> [132] chr21 30935466 * | Clone4_mut12 T T
#> [133] chr17 116637069 * | Clone4_mut13 C C
#> [134] chr7 177080573 * | Clone4_mut14 G C
#> [135] chr7 75923037 * | Clone4_mut15 A C
#> True_VAF VAF Depth Alt_reads Clone Type
#> <numeric> <numeric> <integer> <integer> <character> <character>
#> [1] 0.99 0.959184 98 94 Founder founder
#> [2] 0.99 1.000000 70 70 Founder founder
#> [3] 0.99 1.000000 105 105 Founder founder
#> [4] 0.99 1.000000 77 77 Founder founder
#> [5] 0.99 0.987179 78 77 Founder founder
#> ... ... ... ... ... ... ...
#> [131] 0.278236 0.295082 122 36 Clone4 private
#> [132] 0.305871 0.302326 86 26 Clone4 private
#> [133] 0.378219 0.451613 93 42 Clone4 private
#> [134] 0.244493 0.236220 127 30 Clone4 private
#> [135] 0.324152 0.308824 136 42 Clone4 private
#> Clone_IDs
#> <character>
#> [1] 1,2,3,4
#> [2] 1,2,3,4
#> [3] 1,2,3,4
#> [4] 1,2,3,4
#> [5] 1,2,3,4
#> ... ...
#> [131] 4
#> [132] 4
#> [133] 4
#> [134] 4
#> [135] 4
#> -------
#> seqinfo: 22 sequences from an unspecified genome; no seqlengths
# Access metadata
mcols(gr)[1:5, ]
#> DataFrame with 5 rows and 10 columns
#> Mutation Ref Alt True_VAF VAF Depth Alt_reads
#> <character> <character> <character> <numeric> <numeric> <integer> <integer>
#> 1 Founder_1 T G 0.99 0.959184 98 94
#> 2 Founder_2 A A 0.99 1.000000 70 70
#> 3 Founder_3 T T 0.99 1.000000 105 105
#> 4 Founder_4 G T 0.99 1.000000 77 77
#> 5 Founder_5 G A 0.99 0.987179 78 77
#> Clone Type Clone_IDs
#> <character> <character> <character>
#> 1 Founder founder 1,2,3,4
#> 2 Founder founder 1,2,3,4
#> 3 Founder founder 1,2,3,4
#> 4 Founder founder 1,2,3,4
#> 5 Founder founder 1,2,3,4
# Subset by chromosome
gr_chr1 <- gr[seqnames(gr) == "chr1"]
length(gr_chr1)
#> [1] 4# Get VRanges object
vr <- suppressWarnings(toVCF(sim, sample_name = "TumorSample"))
vr
#> VRanges object with 99 ranges and 5 metadata columns:
#> seqnames ranges strand ref alt totalDepth
#> <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle>
#> [1] chr2 3901294 * T G 98
#> [2] chr17 55250578 * G T 77
#> [3] chr3 191364048 * G A 78
#> [4] chr7 110583980 * G T 102
#> [5] chr13 93127085 * C A 99
#> ... ... ... ... ... ... ...
#> [95] chr15 173679600 * C G 52
#> [96] chr14 162276325 * C G 79
#> [97] chr1 117412631 * C T 96
#> [98] chr7 177080573 * G C 127
#> [99] chr7 75923037 * A C 136
#> refDepth altDepth sampleNames softFilterMatrix | TRUE_VAF
#> <integerOrRle> <integerOrRle> <factorOrRle> <matrix> | <numeric>
#> [1] 4 94 TumorSample | 0.99
#> [2] 0 77 TumorSample | 0.99
#> [3] 1 77 TumorSample | 0.99
#> [4] 1 101 TumorSample | 0.99
#> [5] 2 97 TumorSample | 0.99
#> ... ... ... ... ... . ...
#> [95] 43 9 TumorSample | 0.216110
#> [96] 62 17 TumorSample | 0.307477
#> [97] 58 38 TumorSample | 0.345484
#> [98] 97 30 TumorSample | 0.244493
#> [99] 94 42 TumorSample | 0.324152
#> VAF CLONE TYPE CLONE_IDS
#> <numeric> <character> <character> <character>
#> [1] 0.959184 Founder founder 1,2,3,4
#> [2] 1.000000 Founder founder 1,2,3,4
#> [3] 0.987179 Founder founder 1,2,3,4
#> [4] 0.990196 Founder founder 1,2,3,4
#> [5] 0.979798 Founder founder 1,2,3,4
#> ... ... ... ... ...
#> [95] 0.173077 Clone4 private 4
#> [96] 0.215190 Clone4 private 4
#> [97] 0.395833 Clone4 private 4
#> [98] 0.236220 Clone4 private 4
#> [99] 0.308824 Clone4 private 4
#> -------
#> seqinfo: 22 sequences from an unspecified genome; no seqlengths
#> hardFilters: NULL
# Write to VCF file (using tempdir to avoid polluting user's filesystem)
vcf_file <- tempfile(fileext = ".vcf")
suppressWarnings(toVCF(sim, output_file = vcf_file))
file.exists(vcf_file)
#> [1] TRUEpyclone_file <- tempfile(fileext = ".tsv")
toPyClone(sim, file = pyclone_file, sample_id = "sample1")
head(read.table(pyclone_file, header = TRUE, sep = "\t"))
#> mutation_id ref_counts var_counts normal_cn minor_cn major_cn sample_id
#> 1 Founder_1 4 94 2 0 2 sample1
#> 2 Founder_2 0 70 2 0 2 sample1
#> 3 Founder_3 0 105 2 0 2 sample1
#> 4 Founder_4 0 77 2 0 2 sample1
#> 5 Founder_5 1 77 2 0 2 sample1
#> 6 Founder_6 1 101 2 0 2 sample1sciclone_file <- tempfile(fileext = ".tsv")
toSciClone(sim, file = sciclone_file)
head(read.table(sciclone_file, header = TRUE, sep = "\t"))
#> chr pos ref_reads var_reads vaf
#> 1 chr2 3901294 4 94 95.91837
#> 2 chr12 156958367 0 70 100.00000
#> 3 chr15 113008658 0 105 100.00000
#> 4 chr17 55250578 0 77 100.00000
#> 5 chr3 191364048 1 77 98.71795
#> 6 chr7 110583980 1 101 99.01961df <- toDataFrame(sim)
csv_file <- tempfile(fileext = ".csv")
write.csv(df, csv_file, row.names = FALSE)
head(read.csv(csv_file))
#> Mutation Chromosome Position Ref Alt True_VAF VAF Depth Alt_reads
#> 1 Founder_1 chr2 3901294 T G 0.99 0.9591837 98 94
#> 2 Founder_2 chr12 156958367 A A 0.99 1.0000000 70 70
#> 3 Founder_3 chr15 113008658 T T 0.99 1.0000000 105 105
#> 4 Founder_4 chr17 55250578 G T 0.99 1.0000000 77 77
#> 5 Founder_5 chr3 191364048 G A 0.99 0.9871795 78 77
#> 6 Founder_6 chr7 110583980 G T 0.99 0.9901961 102 101
#> Clone Type Clone_IDs
#> 1 Founder founder 1,2,3,4
#> 2 Founder founder 1,2,3,4
#> 3 Founder founder 1,2,3,4
#> 4 Founder founder 1,2,3,4
#> 5 Founder founder 1,2,3,4
#> 6 Founder founder 1,2,3,4Here’s a typical workflow for benchmarking a variant caller:
# 1. Generate ground truth
set.seed(42)
sim <- simulateTumor(
subclone_freqs = c(0.2, 0.3, 0.5),
n_mut_per_clone = c(50, 75, 50)
)
# 2. Export to VCF (ground truth)
toVCF(sim, output_file = "ground_truth.vcf")
# 3. Get true positive set
true_mutations <- getMutations(sim)
# 4. Run your variant caller on simulated data
# (your variant caller code here)
# 5. Compare results
# - Sensitivity: TP / (TP + FN)
# - Precision: TP / (TP + FP)
# - F1 score: 2 * (Precision * Sensitivity) / (Precision + Sensitivity)
# 6. Evaluate VAF estimation accuracy
# cor(true_mutations$VAF, called_vaf)set.seed(123)
sim_complex <- simulateTumor(
subclone_freqs = c(0.1, 0.15, 0.2, 0.25, 0.3),
n_mut_per_clone = c(30, 40, 50, 40, 30),
n_mut_shared = list(
"2 3 4 5" = 20, # Shared by clones 2,3,4,5
"3 4 5" = 15, # Shared by clones 3,4,5
"4 5" = 10, # Shared by clones 4,5
"1 2" = 8 # Shared by clones 1,2
)
)
plot(sim_complex, type = "vaf_density")sessionInfo()
#> R version 4.6.0 RC (2026-04-17 r89917)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.23-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] GenomicRanges_1.63.2 Seqinfo_1.1.0 IRanges_2.45.0
#> [4] S4Vectors_0.49.3 BiocGenerics_0.57.1 generics_0.1.4
#> [7] ClonalSim_0.99.8
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.51.1 SummarizedExperiment_1.41.1
#> [3] gtable_0.3.6 rjson_0.2.23
#> [5] xfun_0.57 bslib_0.10.0
#> [7] ggplot2_4.0.3 Biobase_2.71.0
#> [9] lattice_0.22-9 vctrs_0.7.3
#> [11] tools_4.6.0 bitops_1.0-9
#> [13] curl_7.1.0 parallel_4.6.0
#> [15] tibble_3.3.1 AnnotationDbi_1.73.1
#> [17] RSQLite_2.4.6 blob_1.3.0
#> [19] pkgconfig_2.0.3 Matrix_1.7-5
#> [21] BSgenome_1.79.1 RColorBrewer_1.1-3
#> [23] cigarillo_1.1.0 S7_0.2.2
#> [25] lifecycle_1.0.5 compiler_4.6.0
#> [27] farver_2.1.2 Rsamtools_2.27.2
#> [29] Biostrings_2.79.5 codetools_0.2-20
#> [31] htmltools_0.5.9 sass_0.4.10
#> [33] RCurl_1.98-1.18 yaml_2.3.12
#> [35] tidyr_1.3.2 pillar_1.11.1
#> [37] crayon_1.5.3 jquerylib_0.1.4
#> [39] BiocParallel_1.45.0 DelayedArray_0.37.1
#> [41] cachem_1.1.0 abind_1.4-8
#> [43] tidyselect_1.2.1 digest_0.6.39
#> [45] purrr_1.2.2 restfulr_0.0.16
#> [47] dplyr_1.2.1 VariantAnnotation_1.57.1
#> [49] labeling_0.4.3 fastmap_1.2.0
#> [51] grid_4.6.0 cli_3.6.6
#> [53] SparseArray_1.11.13 magrittr_2.0.5
#> [55] S4Arrays_1.11.1 GenomicFeatures_1.63.2
#> [57] XML_3.99-0.23 dichromat_2.0-0.1
#> [59] withr_3.0.2 scales_1.4.0
#> [61] bit64_4.8.0 rmarkdown_2.31
#> [63] XVector_0.51.0 httr_1.4.8
#> [65] matrixStats_1.5.0 bit_4.6.0
#> [67] otel_0.2.0 png_0.1-9
#> [69] memoise_2.0.1 evaluate_1.0.5
#> [71] knitr_1.51 BiocIO_1.21.0
#> [73] rtracklayer_1.71.3 rlang_1.2.0
#> [75] glue_1.8.1 DBI_1.3.0
#> [77] jsonlite_2.0.0 R6_2.6.1
#> [79] GenomicAlignments_1.47.0 MatrixGenerics_1.23.0