library(cellmig)
library(ggplot2)
library(ggforce)
library(patchwork)
library(rstan)
# Set a clean theme for all plots
ggplot2::theme_set(new = theme_bw(base_size = 10))
Why simulate data? Simulation is a powerful tool for:
Validating the model: Ensuring cellmig can recover known truth
f rom synthetic data.
Experimental Design: Performing power analysis to determine how many replicates or cells you need.
Understanding Variability: Exploring how biological and technical noise affect your results.
This vignette demonstrates two simulation modes in cellmig:
Partially Generative: You define the true effect sizes (e.g., drug effects). Useful for validation and power analysis.
Fully Generative: All parameters are drawn from prior distributions. Useful for checking if your priors produce realistic data.
In this mode, you specify the “ground truth” (e.g., exactly how much a drug slows down cells). The model then adds realistic noise (biological and technical variability) to generate synthetic cell velocities.
We will simulate an experiment with the following structure:
| Parameter | Value | Description |
|---|---|---|
| Biological Replicates | 8 Plates | Independent experimental runs |
| Technical Replicates | 5 Wells | Wells per treatment per plate |
| Cells per Well | 50 | Number of tracked cells |
| Treatment Groups | 7 | Different compounds/doses |
| True Effects (\(\delta\)) | -0.4 to 0.4 | Log-fold changes relative to control |
| Bio Variability (\(\sigma_{bio}\)) | 0.1 | Plate-to-plate variation |
| Tech Variability (\(\sigma_{tech}\)) | 0.05 | Well-to-well variation |
Plates often have batch effects (e.g., one day cells move faster than another). We simulate this by drawing plate-specific baselines from a normal distribution.
Mean: 0.0 (log-scale velocity)
SD: 0.1 (magnitude of batch effect)
To correct for batch effects, one treatment group must serve as a reference (control) across all plates. We designate Treatment 4 as the control (true effect = 0).
Cell velocities are modeled using a Gamma distribution. The shape parameter \(\kappa\) controls the skewness:
Higher \(\kappa\): Less variable, more symmetric data.
Lower \(\kappa\): More skewed data.
Typical value: \(\log(\kappa) \approx 1.5\) (\(\kappa \approx 5\)).
We use the gen_partial() function. The control list specifies all
design parameters.
# Set seed for reproducibility
set.seed(1253)
# Generate synthetic data
y_p <- gen_partial(control = list(
N_biorep = 8, # Number of plates
N_techrep = 5, # Wells per treatment per plate
N_cell = 50, # Cells per well
delta = c(-0.4, -0.2, -0.1, 0, 0.1, 0.2, 0.4), # True treatment effects
sigma_bio = 0.1, # Biological variability
sigma_tech = 0.05, # Technical variability
offset = 4, # Index of the control treatment
prior_alpha_p_M = 0, # Mean plate baseline
prior_alpha_p_SD = 0.1, # SD of plate baseline
prior_kappa_mu_M = 1.5, # Mean log(shape parameter)
prior_kappa_mu_SD = 0.1, # SD of log(shape parameter)
prior_kappa_sigma_M = 0, # Fixed sigma for shape
prior_kappa_sigma_SD = 0.1 # SD of sigma for shape
))
The output y_p contains the data frame y and the true parameters
par.
str(y_p$y)
FALSE 'data.frame': 14000 obs. of 6 variables:
FALSE $ v : num 0.566 1.129 0.798 2.149 0.79 ...
FALSE $ well : int 1 1 1 1 1 1 1 1 1 1 ...
FALSE $ plate : int 1 1 1 1 1 1 1 1 1 1 ...
FALSE $ group : int 1 1 1 1 1 1 1 1 1 1 ...
FALSE $ compound: int 1 1 1 1 1 1 1 1 1 1 ...
FALSE $ dose : chr "X" "X" "X" "X" ...
The data frame includes:
v: Simulated cell velocity.
well, plate: Hierarchical identifiers.
group: Treatment group ID.
compound, dose: Metadata (set to generic values in simulation).
Let’s plot the raw cell velocities. Colors represent plates. If batch effects are present, you might see plate colors clustering at different heights.
ggplot(data = y_p$y) +
geom_sina(aes(x = paste0("t=", group), col = paste0("p=", plate),
y = v, group = well), size = 0.5) +
xlab("Treatment Group") +
ylab("Migration Speed (µm/min)") +
scale_color_grey(name = "Plate") +
scale_y_log10() +
theme(legend.position = "top")
We can also look at mean well velocities to see batch effects more clearly:
well_means <- aggregate(v ~ well + group + plate, data = y_p$y, FUN = mean)
ggplot(data = well_means) +
geom_sina(aes(x = paste0("t=", group), col = paste0("p=", plate),
y = v, group = well), size = 1) +
xlab("Treatment Group") +
ylab("Mean Well Speed") +
scale_color_grey(name = "Plate") +
scale_y_log10() +
theme(legend.position = "top")
To verify the model works, we analyze the synthetic data just like real experimental data.
Important: We must format the data correctly:
offset column (1 for control, 0 for others).sim_data <- y_p$y
# Format columns
sim_data$well <- as.character(sim_data$well)
sim_data$compound <- as.character(sim_data$compound)
sim_data$plate <- as.character(sim_data$plate)
# Define offset (Control = Group 4)
sim_data$offset <- 0
sim_data$offset[sim_data$group == 4] <- 1
Now we fit the model:
osd <- cellmig(x = sim_data,
control = list(mcmc_warmup = 250,
mcmc_steps = 800,
mcmc_chains = 2,
mcmc_cores = 2,
mcmc_algorithm = "NUTS",
adapt_delta = 0.8,
max_treedepth = 10))
FALSE Warning: There were 4 divergent transitions after warmup. See
FALSE https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
FALSE to find out why this is a problem and how to eliminate them.
FALSE Warning: Examine the pairs() plot to diagnose sampling problems
Does cellmig recover the true parameters we set? We compare the
posterior estimates (dots + error bars) against the true values
(red dots).
The model should recover the plate-specific baselines.
ggplot(data = osd$posteriors$alpha_p) +
geom_point(aes(y = plate_id, x = exp(mean))) +
geom_errorbarh(aes(y = plate_id, x = exp(mean), xmin = exp(X2.5.),
xmax = exp(X97.5.)), height = 0.1) +
# Overlay true values in red
geom_point(data = data.frame(v = exp(y_p$par$alpha_p),
plate = osd$posteriors$alpha_p$plate_id),
aes(y = plate, x = v), col = "red", size = 2) +
xlab(label = expression("Plate Baseline ("*alpha[p]*"')")) +
ylab("Plate ID") +
scale_y_continuous(breaks = osd$posteriors$alpha_p$plate_id)
The model should recover the true log-fold changes we defined in
delta.
# Note: Group 4 is the control (offset), so it is not estimated
ggplot(data = osd$posteriors$delta_t) +
geom_point(aes(y = group_id, x = mean)) +
geom_errorbarh(aes(y = group_id, x = mean, xmin = X2.5., xmax = X97.5.),
height = 0.1) +
# Overlay true values (excluding control)
geom_point(data = data.frame(delta = c(-0.4, -0.2, -0.1, 0.1, 0.2, 0.4),
group_id = 1:6),
aes(y = group_id, x = delta), col = "red", size = 2) +
xlab(label = expression("Treatment Effect ("*delta[t]*")")) +
ylab("Treatment Group") +
scale_y_continuous(breaks = 1:8)
Check if the estimated biological and technical noise matches the input.
plot(osd$fit, par = c("sigma_bio", "sigma_tech", "sigma_delta"))
In this mode, no parameters are fixed. Everything (including treatment effects) is drawn from the prior distributions. This generates data from the prior predictive distribution.
Use Case: Check if your priors produce realistic data before collecting any real samples. If the simulated velocities are biologically impossible (e.g., 1000 µm/min), your priors may need adjustment.
We define priors for all parameters instead of fixed values.
control_full <- list(
N_biorep = 4,
N_techrep = 3,
N_cell = 30,
N_group = 8,
# Priors for plate baselines
prior_alpha_p_M = -0.5,
prior_alpha_p_SD = 1.0,
# Priors for Gamma shape
prior_kappa_mu_M = 1.5,
prior_kappa_mu_SD = 1.0,
prior_kappa_sigma_M = 0,
prior_kappa_sigma_SD = 1.0,
# Priors for variability
prior_sigma_bio_M = 0.0,
prior_sigma_bio_SD = 1.0,
prior_sigma_tech_M = 0.0,
prior_sigma_tech_SD = 1.0,
prior_sigma_delta_M = 0.0,
prior_sigma_delta_SD = 1.0
)
y_f <- gen_full(control = control_full)
FALSE Chain 1:
FALSE Chain 1: Elapsed Time: 0 seconds (Warm-up)
FALSE Chain 1: 0.002 seconds (Sampling)
FALSE Chain 1: 0.002 seconds (Total)
FALSE Chain 1:
str(y_f)
FALSE List of 2
FALSE $ y :'data.frame': 2880 obs. of 8 variables:
FALSE ..$ i : int [1:2880] 1 2 3 4 5 6 7 8 9 10 ...
FALSE ..$ v : num [1:2880] 0.267 1.136 0.347 0.98 3.315 ...
FALSE ..$ well : chr [1:2880] "1" "1" "1" "1" ...
FALSE ..$ plate : chr [1:2880] "1" "1" "1" "1" ...
FALSE ..$ group : chr [1:2880] "1" "1" "1" "1" ...
FALSE ..$ compound: chr [1:2880] "1" "1" "1" "1" ...
FALSE ..$ dose : chr [1:2880] "X" "X" "X" "X" ...
FALSE ..$ trep_id : int [1:2880] 1 1 1 1 1 1 1 1 1 1 ...
FALSE $ control:List of 19
FALSE ..$ N_biorep : num 4
FALSE ..$ N_techrep : num 3
FALSE ..$ N_cell : num 30
FALSE ..$ N_group : num 8
FALSE ..$ prior_alpha_p_M : num -0.5
FALSE ..$ prior_alpha_p_SD : num 1
FALSE ..$ prior_kappa_mu_M : num 1.5
FALSE ..$ prior_kappa_mu_SD : num 1
FALSE ..$ prior_kappa_sigma_M : num 0
FALSE ..$ prior_kappa_sigma_SD: num 1
FALSE ..$ prior_sigma_bio_M : num 0
FALSE ..$ prior_sigma_bio_SD : num 1
FALSE ..$ prior_sigma_tech_M : num 0
FALSE ..$ prior_sigma_tech_SD : num 1
FALSE ..$ prior_sigma_delta_M : num 0
FALSE ..$ prior_sigma_delta_SD: num 1
FALSE ..$ offset : num 1
FALSE ..$ N_plate : num 4
FALSE ..$ N_well_reps : num 3
Do the two simulation modes produce similar ranges of cell velocities?
# Compare a subset of velocities
w <- data.frame(v_f = y_f$y$v[1:2000], v_p = y_p$y$v[1:2000])
ggplot(data = w) +
geom_point(aes(x = v_f, y = v_p), size = 0.5, alpha = 0.5) +
geom_density_2d(aes(x = v_f, y = v_p), col = "orange") +
scale_x_log10(name = "Fully Generative Speed", limits = c(0.01, 10^4)) +
scale_y_log10(name = "Partially Generative Speed", limits = c(0.01, 10^4)) +
annotation_logticks(base = 10, sides = "lb") +
theme_bw()
FALSE Warning: Removed 101 rows containing non-finite outside the scale range
FALSE (`stat_density2d()`).
FALSE Warning: Removed 101 rows containing missing values or values outside the scale range
FALSE (`geom_point()`).
One of the most practical uses of simulation is power analysis. Question: How many biological replicates (plates) do I need to detect a specific effect size?
We can answer this by:
Simulating many datasets with varying numbers of replicates.
Fitting the model to each.
Checking how often the model correctly identifies the true effect (True Positive Rate).
Note: The code below is computationally intensive (it fits many
models). It is set to eval = FALSE for this vignette. You can run it
locally to perform your own power analysis. Tip: Use multiple cores
(mcmc_cores) to speed this up!
# --- Configuration ---
N_bioreps <- c(3, 6, 9) # Replicate scenarios to test
N_sim <- 10 # Number of simulations per scenario
true_deltas <- c(-0.3, -0.15, 0, 0.2, 0.4) # Effects to test
offset <- 3 # Control group index
# Store results
deltas <- vector(mode = "list", length = length(N_bioreps) * N_sim)
i <- 1
for(N_biorep in N_bioreps) {
for(b in 1:N_sim) {
# 1. Simulate data
y_p <- gen_partial(control = list(
N_biorep = N_biorep,
N_techrep = 3,
N_cell = 40,
delta = true_deltas,
sigma_bio = 0.1,
sigma_tech = 0.05,
offset = offset,
prior_alpha_p_M = -0.5,
prior_alpha_p_SD = 0.1,
prior_kappa_mu_M = 1.5,
prior_kappa_mu_SD = 0.1,
prior_kappa_sigma_M = 0,
prior_kappa_sigma_SD = 0.1
))
# 2. Format data
sim_data <- y_p$y
sim_data$well <- as.character(sim_data$well)
sim_data$compound <- as.character(sim_data$compound)
sim_data$plate <- as.character(sim_data$plate)
sim_data$offset <- 0
sim_data$offset[sim_data$group == offset] <- 1
# 3. Fit model
o <- cellmig(x = sim_data,
control = list(mcmc_warmup = 300,
mcmc_steps = 1000,
mcmc_chains = 1,
mcmc_cores = 1, # Increase for speed
mcmc_algorithm = "NUTS",
adapt_delta = 0.8,
max_treedepth = 10))
# 4. Evaluate Performance
delta <- o$posteriors$delta_t
delta$b <- b
delta$N_biorep <- N_biorep
delta$true_deltas <- true_deltas[-offset]
# True Positive: HDI excludes 0 AND includes true value
delta$TP <- (delta$X2.5. <= delta$true_deltas &
delta$X97.5. >= delta$true_deltas) &
!(delta$X2.5. <= 0 & delta$X97.5. >= 0)
deltas[[i]] <- delta
i <- i + 1
}
}
# Combine results
deltas <- do.call(rbind, deltas)
When you run this analysis, you will typically observe:
Large Effects (e.g., \(|\delta| > 0.3\)): Detected reliably even with few replicates (High True Positive rate).
Small Effects (e.g., \(|\delta| < 0.2\)): Require many biological replicates to distinguish from noise.
ggplot(data = aggregate(TP ~ N_biorep + true_deltas, data = deltas, FUN = sum)) +
geom_point(aes(x = N_biorep, y = TP, col = abs(true_deltas),
group = as.factor(true_deltas)), size = 2, alpha = 0.5) +
geom_line(aes(x = N_biorep, y = TP, col = abs(true_deltas),
group = as.factor(true_deltas)), alpha = 0.5) +
ylab("Number of True Positives (TPs)") +
xlab("Number of Biological Replicates") +
scale_color_distiller(name = expression("|"*delta[t]*"|"), palette = "Spectral") +
theme_bw()
sessionInfo()
FALSE R version 4.6.0 RC (2026-04-17 r89917)
FALSE Platform: x86_64-pc-linux-gnu
FALSE Running under: Ubuntu 24.04.4 LTS
FALSE
FALSE Matrix products: default
FALSE BLAS: /home/biocbuild/bbs-3.24-bioc/R/lib/libRblas.so
FALSE LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
FALSE
FALSE locale:
FALSE [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
FALSE [3] LC_TIME=en_GB LC_COLLATE=C
FALSE [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
FALSE [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
FALSE [9] LC_ADDRESS=C LC_TELEPHONE=C
FALSE [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
FALSE
FALSE time zone: America/New_York
FALSE tzcode source: system (glibc)
FALSE
FALSE attached base packages:
FALSE [1] stats graphics grDevices utils datasets methods base
FALSE
FALSE other attached packages:
FALSE [1] rstan_2.32.7 StanHeaders_2.32.10 patchwork_1.3.2
FALSE [4] ggforce_0.5.0 ggplot2_4.0.3 cellmig_1.3.3
FALSE [7] BiocStyle_2.41.0
FALSE
FALSE loaded via a namespace (and not attached):
FALSE [1] ggiraph_0.9.6 tidyselect_1.2.1 dplyr_1.2.1
FALSE [4] farver_2.1.2 loo_2.9.0 S7_0.2.2
FALSE [7] fastmap_1.2.0 lazyeval_0.2.3 tweenr_2.0.3
FALSE [10] fontquiver_0.2.1 digest_0.6.39 lifecycle_1.0.5
FALSE [13] tidytree_0.4.7 magrittr_2.0.5 compiler_4.6.0
FALSE [16] rlang_1.2.0 sass_0.4.10 tools_4.6.0
FALSE [19] yaml_2.3.12 knitr_1.51 labeling_0.4.3
FALSE [22] htmlwidgets_1.6.4 pkgbuild_1.4.8 curl_7.1.0
FALSE [25] plyr_1.8.9 RColorBrewer_1.1-3 aplot_0.2.9
FALSE [28] withr_3.0.2 purrr_1.2.2 grid_4.6.0
FALSE [31] polyclip_1.10-7 stats4_4.6.0 gdtools_0.5.0
FALSE [34] inline_0.3.21 scales_1.4.0 MASS_7.3-65
FALSE [37] isoband_0.3.0 tinytex_0.59 dichromat_2.0-0.1
FALSE [40] cli_3.6.6 rmarkdown_2.31 treeio_1.37.0
FALSE [43] generics_0.1.4 otel_0.2.0 RcppParallel_5.1.11-2
FALSE [46] ggtree_4.3.0 reshape2_1.4.5 ape_5.8-1
FALSE [49] cachem_1.1.0 stringr_1.6.0 parallel_4.6.0
FALSE [52] ggplotify_0.1.3 BiocManager_1.30.27 matrixStats_1.5.0
FALSE [55] vctrs_0.7.3 yulab.utils_0.2.4 V8_8.2.0
FALSE [58] jsonlite_2.0.0 fontBitstreamVera_0.1.1 bookdown_0.46
FALSE [61] gridGraphics_0.5-1 magick_2.9.1 systemfonts_1.3.2
FALSE [64] tidyr_1.3.2 jquerylib_0.1.4 glue_1.8.1
FALSE [67] codetools_0.2-20 stringi_1.8.7 gtable_0.3.6
FALSE [70] QuickJSR_1.9.2 tibble_3.3.1 pillar_1.11.1
FALSE [73] rappdirs_0.3.4 htmltools_0.5.9 R6_2.6.1
FALSE [76] evaluate_1.0.5 lattice_0.22-9 ggfun_0.2.0
FALSE [79] fontLiberation_0.1.0 bslib_0.10.0 rstantools_2.6.0
FALSE [82] Rcpp_1.1.1-1.1 gridExtra_2.3 nlme_3.1-169
FALSE [85] xfun_0.57 fs_2.1.0 pkgconfig_2.0.3