---
title: "pulsar: parallel utilities for model selection"
author: 
  - name: Zachary Kurtz
    email: zdkurtz@gmail.com
output: 
  BiocStyle::html_document:
    self_contained: yes
    toc: true
    toc_float: true
    toc_depth: 2
    code_folding: show
date: "`r format(Sys.time(), '%B %d, %Y')`"
package: "SpiecEasi"
version: "1.99.5"
vignette: >
  %\VignetteIndexEntry{pulsar: parallel utilities for model selection}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}  
---

```{r setup, echo = FALSE, eval=TRUE, message=FALSE}
library(BiocStyle)
knitr::opts_knit$set(
  upload.fun = NULL,
  base.url = NULL) # use local files for images
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#"
)
# Override BiocStyle plot hook to avoid css_align issues
knitr::knit_hooks$set(plot = function(x, options) {
  paste0('![', basename(x), '](', x, ')')
})
runchunks = if (Sys.getenv("FORCE_VIGNETTE_REBUILD", "FALSE") == "TRUE") TRUE else FALSE

cache_file <- '../vignette_cache/pulsar-parallel.RData'
if (!runchunks && file.exists(cache_file)) {
  load(cache_file)
  # If we loaded trimmed objects, reassign them to original names
  if (exists("se1.trimmed")) {
    se1 <- se1.trimmed
    rm(se1.trimmed)
  }
  if (exists("se2.trimmed")) {
    se2 <- se2.trimmed
    rm(se2.trimmed)
  }
  if (exists("se3.trimmed")) {
    se3 <- se3.trimmed
    rm(se3.trimmed)
  }
  if (exists("se4.trimmed")) {
    se4 <- se4.trimmed
    rm(se4.trimmed)
  }
  if (exists("se.serial.trimmed")) {
    se.serial <- se.serial.trimmed
    rm(se.serial.trimmed)
  }

} else {
  if (!runchunks) {
    message("Cache file pulsar-parallel.RData not found - building from scratch")
  }
  runchunks <- TRUE
}
saveout   = runchunks

```

# pulsar: parallel utilities for model selection

SpiecEasi is now using the [pulsar package](https://cran.r-project.org/package=pulsar) as the backend for performing model selection. In the default parameter setting, this uses the same [StARS](https://arxiv.org/abs/1006.3316) procedure as previous versions.

As in the previous version of SpiecEasi, we can supply the `ncores` argument to the pulsar.params list to break up the subsampled computations into parallel tasks.

**Note**: On Windows systems, `mc.cores > 1` is not supported by default. For Windows users, we recommend running in serial mode or using batch mode with `conffile='snow'`.

In this example, we set the random seed to make consistent comparison across experiments:

```{r, eval=TRUE}
library(SpiecEasi)
data(amgut1.filt)
```

```{r, eval=runchunks, message=FALSE, warning=FALSE}

## Default settings ##
pargs1 <- list(rep.num=50, seed=10010)
t1 <- system.time(
se1 <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30,
              sel.criterion='stars', pulsar.select=TRUE, pulsar.params=pargs1)
)

## Platform-specific or envrionment-specific parallel processing ##
if (.Platform$OS.type == "windows" || Sys.getenv("CI") == "true" || nzchar(Sys.getenv("GITHUB_ACTIONS"))) {
  # On Windows, use snow cluster or run serial
  pargs2 <- list(rep.num=50, seed=10010, ncores=1)  # Serial for Windows
  cat("Running on Windows - using serial processing\n")
} else {
  # On Unix-like systems, use multicore
  n_cores <- min(2, parallel::detectCores())
  pargs2 <- list(rep.num=50, seed=10010, ncores=n_cores)
  cat("Running on Unix-like system - using parallel processing\n")
}

t2 <- system.time(
se2 <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30,
              sel.criterion='stars', pulsar.select=TRUE, pulsar.params=pargs2)
)
```

We can further speed up StARS using the [bounded-StARS](https://arxiv.org/abs/1605.07072) ('bstars') method. The B-StARS approach computes network stability across the whole lambda path, but only for the first 2 subsamples. This is used to build an initial estimate of the summary statistic, which in turn gives us a lower/upper bound on the optimal lambda. The remaining subsamples are used to compute the stability over the restricted path. Since denser networks are more computational expensive to compute, this can significantly reduce computational time for datasets with many variables.

```{r, message=FALSE, warning=FALSE, eval=runchunks}
t3 <- system.time(
se3 <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30,
               sel.criterion='bstars', pulsar.select=TRUE, pulsar.params=pargs1)
)
t4 <- system.time(
se4 <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30,
               sel.criterion='bstars', pulsar.select=TRUE, pulsar.params=pargs2)
)
```

We can see that in addition to the computational savings, the refit networks are identical:

```{r, eval=TRUE}
## serial vs parallel
identical(getRefit(se1), getRefit(se2))
t1[3] > t2[3]
## stars vs bstars
identical(getRefit(se1), getRefit(se3))
t1[3] > t3[3]
identical(getRefit(se2), getRefit(se4))
t2[3] > t4[3]
```

## Windows-specific considerations

For Windows users, there are several options for parallel processing:

### Option 1: Use batch mode with snow

For Windows users who want parallel processing, use batch mode with the
`snow` configuration. See the [Batch Mode](#batch-mode) section below for
a full example, and use `conffile='snow'` instead of `conffile='parallel'`.

### Option 2: Use serial processing
```{r, eval=runchunks}
# Simple serial processing for Windows
pargs.serial <- list(rep.num=50, seed=10010, ncores=1)
se.serial <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30,
                        sel.criterion='stars', pulsar.select=TRUE, pulsar.params=pargs.serial)
```

### Option 3: Use batch mode
```{r, eval=FALSE}
# Batch mode works on all platforms
bargs <- list(rep.num=50, seed=10010, conffile="parallel")
se.batch <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30,
                       sel.criterion='stars', pulsar.select='batch', pulsar.params=bargs)
```

## Batch Mode

Pulsar gives us the option of running stability selection in batch mode, using the [batchtools](https://mllg.github.io/batchtools/) package. This will be useful to anyone with access to an hpc/distributing computing system. Each subsample will be independently executed using a system-specific cluster function.

This requires an external config file which will instruct the batchtools registry how to construct the cluster function which will execute the individual jobs. `batch.pulsar` has some built in config files that are useful for testing purposes (serial mode, "parallel", "snow", etc), but it is advisable to create your own config file and pass in the absolute path. See the [batchtools docs](https://mllg.github.io/batchtools/articles/batchtools.html#configuration-file) for instructions on how to construct config file and template files (i.e. to interact with a queueing system such as TORQUE or SGE).

```{r, eval=FALSE}
## bargs <- list(rep.num=50, seed=10010, conffile="path/to/conf.txt")
bargs <- list(rep.num=50, seed=10010, conffile="parallel")
## See the config file stores:
pulsar::findConfFile('parallel')

## uncomment line below to turn off batchtools reporting
# options(batchtools.verbose=FALSE)
se5 <- spiec.easi(amgut1.filt, method='mb', lambda.min.ratio=1e-3, nlambda=30,
            sel.criterion='stars', pulsar.select='batch', pulsar.params=bargs)
```

## Performance comparison

Let's compare the performance of different approaches:

```{r, eval=TRUE}
# Compare timing
cat("Serial StARS:", t1[3], "seconds\n")
cat("Platform-specific StARS:", t2[3], "seconds\n")
cat("Serial B-StARS:", t3[3], "seconds\n")
cat("Platform-specific B-StARS:", t4[3], "seconds\n")

# Speedup factors (only meaningful on Unix-like systems)
if (.Platform$OS.type != "windows") {
  cat("Parallel speedup (StARS):", t1[3]/t2[3], "\n")
  cat("B-StARS speedup (serial):", t1[3]/t3[3], "\n")
  cat("B-StARS speedup (parallel):", t2[3]/t4[3], "\n")
}
```

## Key parameters

- **`rep.num`**: Number of subsamples for stability selection (default: 50)
- **`ncores`**: Number of cores for parallel processing (use 1 on Windows)
- **`sel.criterion`**: Selection criterion ('stars' or 'bstars')
- **`pulsar.select`**: Whether to use pulsar for model selection
- **`pulsar.params`**: List of parameters passed to pulsar

## Platform-specific recommendations

### Unix-like systems (Linux, macOS):
1. **For small datasets**: Use default StARS with serial processing
2. **For medium datasets**: Use parallel StARS with 4-8 cores
3. **For large datasets**: Use B-StARS with parallel processing
4. **For HPC clusters**: Use batch mode with appropriate config files

### Windows systems:
1. **For small datasets**: Use default StARS with serial processing
2. **For medium datasets**: Use B-StARS with serial processing
3. **For large datasets**: Use batch mode with `conffile='snow'`
4. **For HPC clusters**: Use batch mode with appropriate config files

Session info:
```{r}
sessionInfo()
```

```{r, echo = FALSE, eval=TRUE, message=FALSE}
# Save objects if they exist
if (exists("se1") && exists("se2")) {
  cache_file <- '../vignette_cache/pulsar-parallel.RData'
  tryCatch({
    # Load trimming function and trim objects to reduce size
    source('../vignette_cache/trim_spiec_easi.R')
    se1.trimmed <- trim_spiec_easi(se1)
    se2.trimmed <- trim_spiec_easi(se2)
    se3.trimmed <- trim_spiec_easi(se3)
    se4.trimmed <- trim_spiec_easi(se4)
    se.serial.trimmed <- trim_spiec_easi(se.serial)
    
    # Save trimmed objects and timing data
    save(se1.trimmed, se2.trimmed, se3.trimmed, se4.trimmed, se.serial.trimmed, t1, t2, t3, t4, pargs1, pargs2, file=cache_file)
    message("Saved trimmed cache file: ", cache_file, " in directory: ", getwd())
  }, error = function(e) {
    message("Failed to save cache file: ", e$message)
  })
}
```