--- title: "Introduction to DEmixR" author: "Farrokh Habibzadeh" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: default pdf_document: toc: true vignette: > %\VignetteIndexEntry{Introduction to DEmixR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE) ``` # Introduction The **DEmixR** package provides tools for fitting and evaluating **two-component mixture models** (normal and lognormal) using robust global optimization via Differential Evolution (`DEoptim`), with local refinement by quasi-Newton optimization. It is designed to be more **robust** than standard EM-based methods, particularly in challenging situations where the mixture components overlap strongly or have complex likelihood surfaces. # Installation You can install the released version of `DEmixR` from CRAN with: ```{r, eval = FALSE} install.packages("DEmixR") ``` # Overview of Functions `prelim_plots()` – Diagnostic plots (histogram, QQ, PP, log-QQ) `select_best_mixture()` – Compare lognormal vs normal mixtures using BIC `fit_norm2()` – Fit a two-component **normal mixture** `fit_lognorm2()` – Fit a two-component **lognormal mixture** `bootstrap_mix2()` – Bootstrap mixture parameters (parametric or nonparametric) `evaluate_init()` – Refine initial parameter values with local optimization # When to Use Which Function Use `prelim_plots()` for initial data exploration and distribution assessment Use `select_best_mixture()` when unsure whether **normal** or **lognormal** distribution is appropriate Use `fit_norm2()`/`fit_lognorm2()` when you know the appropriate distribution family Use `bootstrap_mix2()` for uncertainty quantification and confidence intervals Use `evaluate_init()` for testing specific starting values or refining solutions # Two-Component Normal Mixture We first generate synthetic data from a two-component **normal** mixture and draw the basic plots with `DEmixR`. # Diagnostic Plots ```{r} # Load the package library(DEmixR) # Simulation data set.seed(123) x <- c(rnorm(3000, mean = 0, sd = 1), rnorm(2000, mean = 3, sd = 1.2)) # Diagnostic plots prelim_plots(x, which = c("hist", "qq")) ``` # Model Selection ```{r} # Automatically choose the best mixture family best_model <- select_best_mixture(x, n_runs = 3, NP = 50, itermax = 2000) best_model$best$family best_model$BICs ``` # Normal Mixture ```{r} # Fit a two-component normal mixture fit <- fit_norm2(x, n_runs = 5, quiet = 2, parallelType = 0) fit$par # estimated parameters fit$logLik # log-likelihood fit$AIC # Akaike Information Criterion fit$BIC # Bayesian Information Criterion ``` # Lognormal Mixture ```{r} # Simulation data set.seed(123) y <- c(rlnorm(3000, meanlog = 0, sdlog = 0.5), rlnorm(2000, meanlog = 1, sdlog = 0.7)) # Fit a two-component lognormal mixture fit_ln <- fit_lognorm2(y, n_runs = 5, parallelType = 0) fit_ln$par ``` # Bootstrap ```{r} # Bootstrap confidence intervals boot_res <- bootstrap_mix2(fit_ln, B = 100, parametric = TRUE, parallelType = 0) boot_res$central boot_res$ci ``` # Evaluate Initialization Values ```{r} evaluate_init(par_init = c(0.5, 0, 0.6, 1.1, 0.8), y, family = "lognormal") ``` # Practical Example: Biomarker Data Analysis ```{r} # Simulated biomarker data with two populations set.seed(456) biomarker <- c(rlnorm(4000, meanlog = 2.5, sdlog = 0.4), # Healthy group rlnorm(1000, meanlog = 3.8, sdlog = 0.6)) # Disease group # Initial exploration prelim_plots(biomarker, which = c("hist", "logqq")) # Fit lognormal mixture fit_bio <- try(fit_lognorm2(biomarker, n_runs = 5, parallelType = 0, quiet = 2)) if (!inherits(fit_bio, "try-error")) { cat("Biomarker analysis results:\n") print(fit_bio$par) # Estimate proportion in each group cat("\nEstimated proportion in group 1 (healthy):", round(fit_bio$par[1], 3), "\n") cat("Estimated proportion in group 2 (disease):", round(1 - fit_bio$par[1], 3), "\n") # Bootstrap for confidence intervals boot_bio <- try(bootstrap_mix2(fit_bio, B = 100, quiet = 2)) if (!inherits(boot_bio, "try-error")) { cat("\n95% CI for proportion in group 1:\n") print(boot_bio$ci[, "p"]) } } ``` # Advanced Usage This section illustrates some of the optional parameters for fine-tuning performance and diagnostics. ## Fitting with `fit_*` ```r # Fit lognormal mixture with custom optimization settings fit_ln <- fit_lognorm2( x, NP = 150, # population size for DEoptim n_runs = 20, # independent runs to avoid local optima itermax = 200, # maximum iterations per run parallelType = 1, # use parallelization across platforms quiet = 1, # print minimal progress info par_init = NULL # optionally supply initial values ) ``` **Key options:** - `NP`: larger = better exploration, but slower. - `n_runs`: increase for more robustness; with 1000, it is very slow, but have ultra robustness - `parallelType`: `0 = serial`, `1 = socket clusters (Windows/Mac/Linux)`, `2 = forking (Mac/Linux only)`. - `quiet`: `0 = verbose`, `1 = minimal`, `2 = silent`. ## Bootstrapping with `bootstrap_mix2` ```r boot_res <- bootstrap_mix2( fit_ln, B = 500, # number of bootstrap replications parallelType = 1, # parallel computation parametric = TRUE, # parametric bootstrap ci_level = 0.90, # confidence interval level quiet = 1 # show progress bar ) boot_res$central # means for normal and medians for lognormal boot_res$ci # bootstrap confidence intervals ``` **Notes:** - Supports both parametric and nonparametric bootstrap. - Uses an internal function to avoid label switching. ## Preliminary Plots with `prelim_plots` ```r prelim_plots( x, which = c("hist", "qq", "pp", "logqq"), hist_bins = 40, col_hist = "grey85", col_density = "darkorange", col_qq = "grey60", col_line = "darkorange" ) ``` **Options:** - `which`: choose one or more of `"hist", "qq", "pp", "logqq"`. - `hist_bins`: number of bins for histogram. - `col_*`: customize plot colors. ## Model Selection with `select_best_mixture` ```r sel <- select_best_mixture( x, n_runs = 5, NP = 100, itermax = 150, quiet = 2 ) sel$best$family sel$best$par ``` **What it does:** - Fits both Normal and Lognormal mixtures. - Compares them via BIC. - Returns the best-performing family. ## Evaluating Initial Values with `evaluate_init` ```r init_eval <- evaluate_init( x, family = "normal", n_starts = 10, # number of random initializations NP = 50, # DEoptim population size itermax = 100, # maximum iterations quiet = 2 # suppress output ) print(init_eval) ``` **Usage:** - Helps check the stability of optimization. - Useful to diagnose poor convergence. # Summary The **DEmixR** package provides robust and flexible tools for mixture model analysis, making it a strong alternative to existing EM-based approaches. Its use of **global optimization** helps avoid local minima and improve reliability, especially in complex or overlapping mixtures. **Note:** Harmless socket connection warnings may appear when using parallelization. # Key Features 1. **Robust estimation:** Differential Evolution optimization avoids local minima 2. **Comprehensive diagnostics:** Visualization, model selection, and bootstrap uncertainty 3. **Flexible:** Supports both normal and lognormal distributions 4. **User-friendly:** Sensible defaults with options for customization 5. **Efficient:** Optional parallel computation for faster results # Best Practices 1. Always start with `prelim_plots()` to understand your data distribution 2. Use `select_best_mixture()` when uncertain about distribution family 3. Increase `n_runs` for challenging optimization problems 4. Use bootstrap methods for uncertainty quantification 5. Test different initial values with `evaluate_init()` if convergence is problematic For more information, see the help files for each function: `?fit_norm2`, `?bootstrap_mix2`, etc. # References 1. Mullen, K.M., et al. (2011). DEoptim: An R Package for Global Optimization by Differential Evolution. Journal of Statistical Software, 40(6), 1-26. 2. R Core Team (2024). R: A Language and Environment for Statistical Computing.