--- title: "pulsar: parallel utilities for model selection" author: - name: Zachary Kurtz email: zdkurtz@gmail.com output: BiocStyle::html_document: self_contained: yes toc: true toc_float: true toc_depth: 2 code_folding: show date: "`r format(Sys.time(), '%B %d, %Y')`" package: "SpiecEasi" version: "1.99.5" vignette: > %\VignetteIndexEntry{pulsar: parallel utilities for model selection} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, echo = FALSE, eval=TRUE, message=FALSE} library(BiocStyle) knitr::opts_knit$set( upload.fun = NULL, base.url = NULL) # use local files for images knitr::opts_chunk$set( collapse = TRUE, comment = "#" ) # Override BiocStyle plot hook to avoid css_align issues knitr::knit_hooks$set(plot = function(x, options) { paste0('![', basename(x), '](', x, ')') }) runchunks = if (Sys.getenv("FORCE_VIGNETTE_REBUILD", "FALSE") == "TRUE") TRUE else FALSE cache_file <- '../vignette_cache/pulsar-parallel.RData' if (!runchunks && file.exists(cache_file)) { load(cache_file) # If we loaded trimmed objects, reassign them to original names if (exists("se1.trimmed")) { se1 <- se1.trimmed rm(se1.trimmed) } if (exists("se2.trimmed")) { se2 <- se2.trimmed rm(se2.trimmed) } if (exists("se3.trimmed")) { se3 <- se3.trimmed rm(se3.trimmed) } if (exists("se4.trimmed")) { se4 <- se4.trimmed rm(se4.trimmed) } if (exists("se.serial.trimmed")) { se.serial <- se.serial.trimmed rm(se.serial.trimmed) } } else { if (!runchunks) { message("Cache file pulsar-parallel.RData not found - building from scratch") } runchunks <- TRUE } saveout = runchunks ``` # pulsar: parallel utilities for model selection SpiecEasi is now using the [pulsar package](https://cran.r-project.org/package=pulsar) as the backend for performing model selection. In the default parameter setting, this uses the same [StARS](https://arxiv.org/abs/1006.3316) procedure as previous versions. As in the previous version of SpiecEasi, we can supply the `ncores` argument to the pulsar.params list to break up the subsampled computations into parallel tasks. **Note**: On Windows systems, `mc.cores > 1` is not supported by default. For Windows users, we recommend running in serial mode or using batch mode with `conffile='snow'`. In this example, we set the random seed to make consistent comparison across experiments: ```{r, eval=TRUE} library(SpiecEasi) data(amgut1.filt) ``` ```{r, eval=runchunks, message=FALSE, warning=FALSE} ## Default settings ## pargs1 <- list(rep.num=50, seed=10010) t1 <- system.time( se1 <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30, sel.criterion='stars', pulsar.select=TRUE, pulsar.params=pargs1) ) ## Platform-specific or envrionment-specific parallel processing ## if (.Platform$OS.type == "windows" || Sys.getenv("CI") == "true" || nzchar(Sys.getenv("GITHUB_ACTIONS"))) { # On Windows, use snow cluster or run serial pargs2 <- list(rep.num=50, seed=10010, ncores=1) # Serial for Windows cat("Running on Windows - using serial processing\n") } else { # On Unix-like systems, use multicore n_cores <- min(2, parallel::detectCores()) pargs2 <- list(rep.num=50, seed=10010, ncores=n_cores) cat("Running on Unix-like system - using parallel processing\n") } t2 <- system.time( se2 <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30, sel.criterion='stars', pulsar.select=TRUE, pulsar.params=pargs2) ) ``` We can further speed up StARS using the [bounded-StARS](https://arxiv.org/abs/1605.07072) ('bstars') method. The B-StARS approach computes network stability across the whole lambda path, but only for the first 2 subsamples. This is used to build an initial estimate of the summary statistic, which in turn gives us a lower/upper bound on the optimal lambda. The remaining subsamples are used to compute the stability over the restricted path. Since denser networks are more computational expensive to compute, this can significantly reduce computational time for datasets with many variables. ```{r, message=FALSE, warning=FALSE, eval=runchunks} t3 <- system.time( se3 <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30, sel.criterion='bstars', pulsar.select=TRUE, pulsar.params=pargs1) ) t4 <- system.time( se4 <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30, sel.criterion='bstars', pulsar.select=TRUE, pulsar.params=pargs2) ) ``` We can see that in addition to the computational savings, the refit networks are identical: ```{r, eval=TRUE} ## serial vs parallel identical(getRefit(se1), getRefit(se2)) t1[3] > t2[3] ## stars vs bstars identical(getRefit(se1), getRefit(se3)) t1[3] > t3[3] identical(getRefit(se2), getRefit(se4)) t2[3] > t4[3] ``` ## Windows-specific considerations For Windows users, there are several options for parallel processing: ### Option 1: Use batch mode with snow For Windows users who want parallel processing, use batch mode with the `snow` configuration. See the [Batch Mode](#batch-mode) section below for a full example, and use `conffile='snow'` instead of `conffile='parallel'`. ### Option 2: Use serial processing ```{r, eval=runchunks} # Simple serial processing for Windows pargs.serial <- list(rep.num=50, seed=10010, ncores=1) se.serial <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30, sel.criterion='stars', pulsar.select=TRUE, pulsar.params=pargs.serial) ``` ### Option 3: Use batch mode ```{r, eval=FALSE} # Batch mode works on all platforms bargs <- list(rep.num=50, seed=10010, conffile="parallel") se.batch <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30, sel.criterion='stars', pulsar.select='batch', pulsar.params=bargs) ``` ## Batch Mode Pulsar gives us the option of running stability selection in batch mode, using the [batchtools](https://mllg.github.io/batchtools/) package. This will be useful to anyone with access to an hpc/distributing computing system. Each subsample will be independently executed using a system-specific cluster function. This requires an external config file which will instruct the batchtools registry how to construct the cluster function which will execute the individual jobs. `batch.pulsar` has some built in config files that are useful for testing purposes (serial mode, "parallel", "snow", etc), but it is advisable to create your own config file and pass in the absolute path. See the [batchtools docs](https://mllg.github.io/batchtools/articles/batchtools.html#configuration-file) for instructions on how to construct config file and template files (i.e. to interact with a queueing system such as TORQUE or SGE). ```{r, eval=FALSE} ## bargs <- list(rep.num=50, seed=10010, conffile="path/to/conf.txt") bargs <- list(rep.num=50, seed=10010, conffile="parallel") ## See the config file stores: pulsar::findConfFile('parallel') ## uncomment line below to turn off batchtools reporting # options(batchtools.verbose=FALSE) se5 <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30, sel.criterion='stars', pulsar.select='batch', pulsar.params=bargs) ``` ## Performance comparison Let's compare the performance of different approaches: ```{r, eval=TRUE} # Compare timing cat("Serial StARS:", t1[3], "seconds\n") cat("Platform-specific StARS:", t2[3], "seconds\n") cat("Serial B-StARS:", t3[3], "seconds\n") cat("Platform-specific B-StARS:", t4[3], "seconds\n") # Speedup factors (only meaningful on Unix-like systems) if (.Platform$OS.type != "windows") { cat("Parallel speedup (StARS):", t1[3]/t2[3], "\n") cat("B-StARS speedup (serial):", t1[3]/t3[3], "\n") cat("B-StARS speedup (parallel):", t2[3]/t4[3], "\n") } ``` ## Key parameters - **`rep.num`**: Number of subsamples for stability selection (default: 50) - **`ncores`**: Number of cores for parallel processing (use 1 on Windows) - **`sel.criterion`**: Selection criterion ('stars' or 'bstars') - **`pulsar.select`**: Whether to use pulsar for model selection - **`pulsar.params`**: List of parameters passed to pulsar ## Platform-specific recommendations ### Unix-like systems (Linux, macOS): 1. **For small datasets**: Use default StARS with serial processing 2. **For medium datasets**: Use parallel StARS with 4-8 cores 3. **For large datasets**: Use B-StARS with parallel processing 4. **For HPC clusters**: Use batch mode with appropriate config files ### Windows systems: 1. **For small datasets**: Use default StARS with serial processing 2. **For medium datasets**: Use B-StARS with serial processing 3. **For large datasets**: Use batch mode with `conffile='snow'` 4. **For HPC clusters**: Use batch mode with appropriate config files Session info: ```{r} sessionInfo() ``` ```{r, echo = FALSE, eval=TRUE, message=FALSE} # Save objects if they exist if (exists("se1") && exists("se2")) { cache_file <- '../vignette_cache/pulsar-parallel.RData' tryCatch({ # Load trimming function and trim objects to reduce size source('../vignette_cache/trim_spiec_easi.R') se1.trimmed <- trim_spiec_easi(se1) se2.trimmed <- trim_spiec_easi(se2) se3.trimmed <- trim_spiec_easi(se3) se4.trimmed <- trim_spiec_easi(se4) se.serial.trimmed <- trim_spiec_easi(se.serial) # Save trimmed objects and timing data save(se1.trimmed, se2.trimmed, se3.trimmed, se4.trimmed, se.serial.trimmed, t1, t2, t3, t4, pargs1, pargs2, file=cache_file) message("Saved trimmed cache file: ", cache_file, " in directory: ", getwd()) }, error = function(e) { message("Failed to save cache file: ", e$message) }) } ```