Contents

library(cellmig)
library(ggplot2)
library(ggforce)
library(patchwork)
library(rstan)

# Set a clean theme for all plots
ggplot2::theme_set(new = theme_bw(base_size = 10))

1 Introduction

Why simulate data? Simulation is a powerful tool for:

  1. Validating the model: Ensuring cellmig can recover known truth f rom synthetic data.

  2. Experimental Design: Performing power analysis to determine how many replicates or cells you need.

  3. Understanding Variability: Exploring how biological and technical noise affect your results.

This vignette demonstrates two simulation modes in cellmig:

2 Partially Generative Simulation

In this mode, you specify the “ground truth” (e.g., exactly how much a drug slows down cells). The model then adds realistic noise (biological and technical variability) to generate synthetic cell velocities.

2.1 Experimental Design

We will simulate an experiment with the following structure:

Parameter Value Description
Biological Replicates 8 Plates Independent experimental runs
Technical Replicates 5 Wells Wells per treatment per plate
Cells per Well 50 Number of tracked cells
Treatment Groups 7 Different compounds/doses
True Effects (\(\delta\)) -0.4 to 0.4 Log-fold changes relative to control
Bio Variability (\(\sigma_{bio}\)) 0.1 Plate-to-plate variation
Tech Variability (\(\sigma_{tech}\)) 0.05 Well-to-well variation

2.2 Model Parameters

2.2.1 Plate Effects (\(\alpha_p\))

Plates often have batch effects (e.g., one day cells move faster than another). We simulate this by drawing plate-specific baselines from a normal distribution.

  • Mean: 0.0 (log-scale velocity)

  • SD: 0.1 (magnitude of batch effect)

2.2.2 Control Treatment (Offset)

To correct for batch effects, one treatment group must serve as a reference (control) across all plates. We designate Treatment 4 as the control (true effect = 0).

2.2.3 Cell Velocity Distribution (\(\kappa_w\))

Cell velocities are modeled using a Gamma distribution. The shape parameter \(\kappa\) controls the skewness:

  • Higher \(\kappa\): Less variable, more symmetric data.

  • Lower \(\kappa\): More skewed data.

  • Typical value: \(\log(\kappa) \approx 1.5\) (\(\kappa \approx 5\)).

2.3 Generating the Data

We use the gen_partial() function. The control list specifies all design parameters.

# Set seed for reproducibility
set.seed(1253)

# Generate synthetic data
y_p <- gen_partial(control = list(
  N_biorep = 8,                  # Number of plates
  N_techrep = 5,                 # Wells per treatment per plate
  N_cell = 50,                   # Cells per well
  delta = c(-0.4, -0.2, -0.1, 0, 0.1, 0.2, 0.4), # True treatment effects
  sigma_bio = 0.1,               # Biological variability
  sigma_tech = 0.05,             # Technical variability
  offset = 4,                    # Index of the control treatment
  prior_alpha_p_M = 0,           # Mean plate baseline
  prior_alpha_p_SD = 0.1,        # SD of plate baseline
  prior_kappa_mu_M = 1.5,        # Mean log(shape parameter)
  prior_kappa_mu_SD = 0.1,       # SD of log(shape parameter)
  prior_kappa_sigma_M = 0,       # Fixed sigma for shape
  prior_kappa_sigma_SD = 0.1     # SD of sigma for shape
))

2.4 Inspecting the Simulated Data

The output y_p contains the data frame y and the true parameters par.

str(y_p$y)
FALSE 'data.frame': 14000 obs. of  6 variables:
FALSE  $ v       : num  0.566 1.129 0.798 2.149 0.79 ...
FALSE  $ well    : int  1 1 1 1 1 1 1 1 1 1 ...
FALSE  $ plate   : int  1 1 1 1 1 1 1 1 1 1 ...
FALSE  $ group   : int  1 1 1 1 1 1 1 1 1 1 ...
FALSE  $ compound: int  1 1 1 1 1 1 1 1 1 1 ...
FALSE  $ dose    : chr  "X" "X" "X" "X" ...

The data frame includes:

  • v: Simulated cell velocity.

  • well, plate: Hierarchical identifiers.

  • group: Treatment group ID.

  • compound, dose: Metadata (set to generic values in simulation).

2.5 Visualizing Simulated Data

Let’s plot the raw cell velocities. Colors represent plates. If batch effects are present, you might see plate colors clustering at different heights.

ggplot(data = y_p$y) +
  geom_sina(aes(x = paste0("t=", group), col = paste0("p=", plate), 
                y = v, group = well), size = 0.5) +
  xlab("Treatment Group") +
  ylab("Migration Speed (µm/min)") +
  scale_color_grey(name = "Plate") +
  scale_y_log10() +
  theme(legend.position = "top")

We can also look at mean well velocities to see batch effects more clearly:

well_means <- aggregate(v ~ well + group + plate, data = y_p$y, FUN = mean)

ggplot(data = well_means) +
  geom_sina(aes(x = paste0("t=", group), col = paste0("p=", plate), 
                y = v, group = well), size = 1) +
  xlab("Treatment Group") +
  ylab("Mean Well Speed") +
  scale_color_grey(name = "Plate") +
  scale_y_log10() +
  theme(legend.position = "top")

2.6 Analyzing Simulated Data

To verify the model works, we analyze the synthetic data just like real experimental data.

Important: We must format the data correctly:

  1. Convert IDs to characters.
  2. Create an offset column (1 for control, 0 for others).
sim_data <- y_p$y

# Format columns
sim_data$well <- as.character(sim_data$well)
sim_data$compound <- as.character(sim_data$compound)
sim_data$plate <- as.character(sim_data$plate)

# Define offset (Control = Group 4)
sim_data$offset <- 0
sim_data$offset[sim_data$group == 4] <- 1

Now we fit the model:

osd <- cellmig(x = sim_data,
               control = list(mcmc_warmup = 250,
                              mcmc_steps = 800,
                              mcmc_chains = 2,
                              mcmc_cores = 2,
                              mcmc_algorithm = "NUTS",
                              adapt_delta = 0.8,
                              max_treedepth = 10))
FALSE Warning: There were 4 divergent transitions after warmup. See
FALSE https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
FALSE to find out why this is a problem and how to eliminate them.
FALSE Warning: Examine the pairs() plot to diagnose sampling problems

2.7 Validation: Truth vs. Inference

Does cellmig recover the true parameters we set? We compare the posterior estimates (dots + error bars) against the true values (red dots).

2.7.1 Plate Effects (\(\alpha_p\))

The model should recover the plate-specific baselines.

ggplot(data = osd$posteriors$alpha_p) +
  geom_point(aes(y = plate_id, x = exp(mean))) +
  geom_errorbarh(aes(y = plate_id, x = exp(mean), xmin = exp(X2.5.), 
                     xmax = exp(X97.5.)), height = 0.1) +
  # Overlay true values in red
  geom_point(data = data.frame(v = exp(y_p$par$alpha_p),
                               plate = osd$posteriors$alpha_p$plate_id),
             aes(y = plate, x = v), col = "red", size = 2) +
  xlab(label = expression("Plate Baseline ("*alpha[p]*"')")) +
  ylab("Plate ID") +
  scale_y_continuous(breaks = osd$posteriors$alpha_p$plate_id)

2.7.2 Treatment Effects (\(\delta_t\))

The model should recover the true log-fold changes we defined in delta.

# Note: Group 4 is the control (offset), so it is not estimated
ggplot(data = osd$posteriors$delta_t) +
  geom_point(aes(y = group_id, x = mean)) +
  geom_errorbarh(aes(y = group_id, x = mean, xmin = X2.5., xmax = X97.5.), 
                 height = 0.1) +
  # Overlay true values (excluding control)
  geom_point(data = data.frame(delta = c(-0.4, -0.2, -0.1, 0.1, 0.2, 0.4), 
                               group_id = 1:6),
             aes(y = group_id, x = delta), col = "red", size = 2) +
  xlab(label = expression("Treatment Effect ("*delta[t]*")")) +
  ylab("Treatment Group") +
  scale_y_continuous(breaks = 1:8)

2.7.3 Variability Parameters

Check if the estimated biological and technical noise matches the input.

plot(osd$fit, par = c("sigma_bio", "sigma_tech", "sigma_delta"))

3 Fully Generative Simulation

In this mode, no parameters are fixed. Everything (including treatment effects) is drawn from the prior distributions. This generates data from the prior predictive distribution.

Use Case: Check if your priors produce realistic data before collecting any real samples. If the simulated velocities are biologically impossible (e.g., 1000 µm/min), your priors may need adjustment.

3.1 Model Specification

We define priors for all parameters instead of fixed values.

control_full <- list(
  N_biorep = 4, 
  N_techrep = 3, 
  N_cell = 30, 
  N_group = 8,
  # Priors for plate baselines
  prior_alpha_p_M = -0.5,
  prior_alpha_p_SD = 1.0,
  # Priors for Gamma shape
  prior_kappa_mu_M = 1.5,
  prior_kappa_mu_SD = 1.0,
  prior_kappa_sigma_M = 0,
  prior_kappa_sigma_SD = 1.0,
  # Priors for variability
  prior_sigma_bio_M = 0.0,
  prior_sigma_bio_SD = 1.0,
  prior_sigma_tech_M = 0.0,
  prior_sigma_tech_SD = 1.0,
  prior_sigma_delta_M = 0.0,
  prior_sigma_delta_SD = 1.0
)

3.2 Generate Data

y_f <- gen_full(control = control_full)
FALSE Chain 1: 
FALSE Chain 1:  Elapsed Time: 0 seconds (Warm-up)
FALSE Chain 1:                0.002 seconds (Sampling)
FALSE Chain 1:                0.002 seconds (Total)
FALSE Chain 1:
str(y_f)
FALSE List of 2
FALSE  $ y      :'data.frame':  2880 obs. of  8 variables:
FALSE   ..$ i       : int [1:2880] 1 2 3 4 5 6 7 8 9 10 ...
FALSE   ..$ v       : num [1:2880] 0.267 1.136 0.347 0.98 3.315 ...
FALSE   ..$ well    : chr [1:2880] "1" "1" "1" "1" ...
FALSE   ..$ plate   : chr [1:2880] "1" "1" "1" "1" ...
FALSE   ..$ group   : chr [1:2880] "1" "1" "1" "1" ...
FALSE   ..$ compound: chr [1:2880] "1" "1" "1" "1" ...
FALSE   ..$ dose    : chr [1:2880] "X" "X" "X" "X" ...
FALSE   ..$ trep_id : int [1:2880] 1 1 1 1 1 1 1 1 1 1 ...
FALSE  $ control:List of 19
FALSE   ..$ N_biorep            : num 4
FALSE   ..$ N_techrep           : num 3
FALSE   ..$ N_cell              : num 30
FALSE   ..$ N_group             : num 8
FALSE   ..$ prior_alpha_p_M     : num -0.5
FALSE   ..$ prior_alpha_p_SD    : num 1
FALSE   ..$ prior_kappa_mu_M    : num 1.5
FALSE   ..$ prior_kappa_mu_SD   : num 1
FALSE   ..$ prior_kappa_sigma_M : num 0
FALSE   ..$ prior_kappa_sigma_SD: num 1
FALSE   ..$ prior_sigma_bio_M   : num 0
FALSE   ..$ prior_sigma_bio_SD  : num 1
FALSE   ..$ prior_sigma_tech_M  : num 0
FALSE   ..$ prior_sigma_tech_SD : num 1
FALSE   ..$ prior_sigma_delta_M : num 0
FALSE   ..$ prior_sigma_delta_SD: num 1
FALSE   ..$ offset              : num 1
FALSE   ..$ N_plate             : num 4
FALSE   ..$ N_well_reps         : num 3

3.3 Comparing Simulation Modes

Do the two simulation modes produce similar ranges of cell velocities?

# Compare a subset of velocities
w <- data.frame(v_f = y_f$y$v[1:2000], v_p = y_p$y$v[1:2000])

ggplot(data = w) +
  geom_point(aes(x = v_f, y = v_p), size = 0.5, alpha = 0.5) +
  geom_density_2d(aes(x = v_f, y = v_p), col = "orange") +
  scale_x_log10(name = "Fully Generative Speed", limits = c(0.01, 10^4)) +
  scale_y_log10(name = "Partially Generative Speed", limits = c(0.01, 10^4)) +
  annotation_logticks(base = 10, sides = "lb") +
  theme_bw()
FALSE Warning: Removed 101 rows containing non-finite outside the scale range
FALSE (`stat_density2d()`).
FALSE Warning: Removed 101 rows containing missing values or values outside the scale range
FALSE (`geom_point()`).

4 Power Analysis

One of the most practical uses of simulation is power analysis. Question: How many biological replicates (plates) do I need to detect a specific effect size?

We can answer this by:

  1. Simulating many datasets with varying numbers of replicates.

  2. Fitting the model to each.

  3. Checking how often the model correctly identifies the true effect (True Positive Rate).

Note: The code below is computationally intensive (it fits many models). It is set to eval = FALSE for this vignette. You can run it locally to perform your own power analysis. Tip: Use multiple cores (mcmc_cores) to speed this up!

4.1 Simulation Loop

# --- Configuration ---
N_bioreps <- c(3, 6, 9)      # Replicate scenarios to test
N_sim <- 10                  # Number of simulations per scenario
true_deltas <- c(-0.3, -0.15, 0, 0.2, 0.4) # Effects to test
offset <- 3                  # Control group index

# Store results
deltas <- vector(mode = "list", length = length(N_bioreps) * N_sim)
i <- 1

for(N_biorep in N_bioreps) {
  for(b in 1:N_sim) {
    
    # 1. Simulate data
    y_p <- gen_partial(control = list(
      N_biorep = N_biorep, 
      N_techrep = 3, 
      N_cell = 40, 
      delta = true_deltas,
      sigma_bio = 0.1, 
      sigma_tech = 0.05, 
      offset = offset,
      prior_alpha_p_M = -0.5,
      prior_alpha_p_SD = 0.1,
      prior_kappa_mu_M = 1.5,
      prior_kappa_mu_SD = 0.1,
      prior_kappa_sigma_M = 0,
      prior_kappa_sigma_SD = 0.1
    ))
    
    # 2. Format data
    sim_data <- y_p$y
    sim_data$well <- as.character(sim_data$well)
    sim_data$compound <- as.character(sim_data$compound)
    sim_data$plate <- as.character(sim_data$plate)
    sim_data$offset <- 0
    sim_data$offset[sim_data$group == offset] <- 1
    
    # 3. Fit model
    o <- cellmig(x = sim_data,
                 control = list(mcmc_warmup = 300,
                                mcmc_steps = 1000,
                                mcmc_chains = 1,
                                mcmc_cores = 1, # Increase for speed
                                mcmc_algorithm = "NUTS",
                                adapt_delta = 0.8,
                                max_treedepth = 10))
    
    # 4. Evaluate Performance
    delta <- o$posteriors$delta_t
    delta$b <- b
    delta$N_biorep <- N_biorep
    delta$true_deltas <- true_deltas[-offset]
    
    # True Positive: HDI excludes 0 AND includes true value
    delta$TP <- (delta$X2.5. <= delta$true_deltas & 
                   delta$X97.5. >= delta$true_deltas) &
      !(delta$X2.5. <= 0 & delta$X97.5. >= 0)
    
    deltas[[i]] <- delta
    i <- i + 1
  }
}

# Combine results
deltas <- do.call(rbind, deltas)

4.2 Interpreting Results

When you run this analysis, you will typically observe:

  • Large Effects (e.g., \(|\delta| > 0.3\)): Detected reliably even with few replicates (High True Positive rate).

  • Small Effects (e.g., \(|\delta| < 0.2\)): Require many biological replicates to distinguish from noise.

ggplot(data = aggregate(TP ~ N_biorep + true_deltas, data = deltas, FUN = sum)) +
  geom_point(aes(x = N_biorep, y = TP, col = abs(true_deltas), 
                 group = as.factor(true_deltas)), size = 2, alpha = 0.5) +
  geom_line(aes(x = N_biorep, y = TP, col = abs(true_deltas), 
                group = as.factor(true_deltas)), alpha = 0.5) +
  ylab("Number of True Positives (TPs)") +
  xlab("Number of Biological Replicates") +
  scale_color_distiller(name = expression("|"*delta[t]*"|"), palette = "Spectral") +
  theme_bw()

5 Session Info

sessionInfo()
FALSE R version 4.6.0 RC (2026-04-17 r89917)
FALSE Platform: x86_64-pc-linux-gnu
FALSE Running under: Ubuntu 24.04.4 LTS
FALSE 
FALSE Matrix products: default
FALSE BLAS:   /home/biocbuild/bbs-3.24-bioc/R/lib/libRblas.so 
FALSE LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
FALSE 
FALSE locale:
FALSE  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
FALSE  [3] LC_TIME=en_GB              LC_COLLATE=C              
FALSE  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
FALSE  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
FALSE  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
FALSE [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
FALSE 
FALSE time zone: America/New_York
FALSE tzcode source: system (glibc)
FALSE 
FALSE attached base packages:
FALSE [1] stats     graphics  grDevices utils     datasets  methods   base     
FALSE 
FALSE other attached packages:
FALSE [1] rstan_2.32.7        StanHeaders_2.32.10 patchwork_1.3.2    
FALSE [4] ggforce_0.5.0       ggplot2_4.0.3       cellmig_1.3.3      
FALSE [7] BiocStyle_2.41.0   
FALSE 
FALSE loaded via a namespace (and not attached):
FALSE  [1] ggiraph_0.9.6           tidyselect_1.2.1        dplyr_1.2.1            
FALSE  [4] farver_2.1.2            loo_2.9.0               S7_0.2.2               
FALSE  [7] fastmap_1.2.0           lazyeval_0.2.3          tweenr_2.0.3           
FALSE [10] fontquiver_0.2.1        digest_0.6.39           lifecycle_1.0.5        
FALSE [13] tidytree_0.4.7          magrittr_2.0.5          compiler_4.6.0         
FALSE [16] rlang_1.2.0             sass_0.4.10             tools_4.6.0            
FALSE [19] yaml_2.3.12             knitr_1.51              labeling_0.4.3         
FALSE [22] htmlwidgets_1.6.4       pkgbuild_1.4.8          curl_7.1.0             
FALSE [25] plyr_1.8.9              RColorBrewer_1.1-3      aplot_0.2.9            
FALSE [28] withr_3.0.2             purrr_1.2.2             grid_4.6.0             
FALSE [31] polyclip_1.10-7         stats4_4.6.0            gdtools_0.5.0          
FALSE [34] inline_0.3.21           scales_1.4.0            MASS_7.3-65            
FALSE [37] isoband_0.3.0           tinytex_0.59            dichromat_2.0-0.1      
FALSE [40] cli_3.6.6               rmarkdown_2.31          treeio_1.37.0          
FALSE [43] generics_0.1.4          otel_0.2.0              RcppParallel_5.1.11-2  
FALSE [46] ggtree_4.3.0            reshape2_1.4.5          ape_5.8-1              
FALSE [49] cachem_1.1.0            stringr_1.6.0           parallel_4.6.0         
FALSE [52] ggplotify_0.1.3         BiocManager_1.30.27     matrixStats_1.5.0      
FALSE [55] vctrs_0.7.3             yulab.utils_0.2.4       V8_8.2.0               
FALSE [58] jsonlite_2.0.0          fontBitstreamVera_0.1.1 bookdown_0.46          
FALSE [61] gridGraphics_0.5-1      magick_2.9.1            systemfonts_1.3.2      
FALSE [64] tidyr_1.3.2             jquerylib_0.1.4         glue_1.8.1             
FALSE [67] codetools_0.2-20        stringi_1.8.7           gtable_0.3.6           
FALSE [70] QuickJSR_1.9.2          tibble_3.3.1            pillar_1.11.1          
FALSE [73] rappdirs_0.3.4          htmltools_0.5.9         R6_2.6.1               
FALSE [76] evaluate_1.0.5          lattice_0.22-9          ggfun_0.2.0            
FALSE [79] fontLiberation_0.1.0    bslib_0.10.0            rstantools_2.6.0       
FALSE [82] Rcpp_1.1.1-1.1          gridExtra_2.3           nlme_3.1-169           
FALSE [85] xfun_0.57               fs_2.1.0                pkgconfig_2.0.3