--- title: "scMultiSim Basics" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{scMultiSim Basics} %\usepackage[UTF-8]{inputenc} --- ```{r "setup", include=FALSE} require("knitr") opts_chunk$set(fig.width=4, fig.height=3) ``` We define a utility function to modify a list. ```{r} list_modify <- function (curr_list, ...) { args <- list(...) for (i in names(args)) { curr_list[[i]] <- args[[i]] } curr_list } ``` ```{r load-package, quietly=TRUE, message=FALSE, warning=FALSE} # library("devtools") # devtools::load_all(".") library("scMultiSim") library(dplyr) ``` # Simulation Basics Simulate true counts by calling `sim_true_counts(options)` where `options` is a list. Use `scmultisim_help()` to get help. ```{r scmultisim-help, echo = TRUE, results = "hide"} scmultisim_help("options") ``` ## GRN and Differentiation Tree The minimal input to scMultiSim is a **differentiation tree**, and you can optionally provide ground truth for GRN and cell-cell interactions. The differentiation tree is an R phylo object, which can be created using e.g. `ape::read.tree()` or `ape::rtree()`. It controls the cell population structure: each node of the tree should represent a cell type, and connected nodes indicate the differentiation relationship between cell types. _scMultiSim provides this explicit control on the cell population structure while preserving all other effects (such as GRN and Cell-Cell Interactions)_, so you can generate any cell trajectory or clustering structure you want, which is especially useful for benchmarking trajectory inference and clustering methods. If generating a continuous population, this tree specifies the cell differentiation trajectory; if generating a discrete population, the tips of this tree will be the clusters (cell types). scMultiSim also provides three differentiation trees. `Phyla5()` and `Phyla3()` return bifurcating trees with 5 and 3 leaves respectively. `Phyla1()` returns only a single branch, which can be useful when we don't want any specific trajectory. ```{r plot-tree, fig.width = 8, fig.height = 4} par(mfrow=c(1,2)) Phyla5(plotting = TRUE) Phyla3(plotting = TRUE) # It's not possible to plot Phyla1() because it only contains 1 branch connecting two nodes. Phyla1() ``` The GRN should be a data frame with 3 columns, each representing the `target`, `regulator`, and `effect`. The target and regulator should be gene names, which can be integers or strings. The effect should be a numeric value, indicating the effect of the regulator on the target. scMultiSim provides two sample GRNs, `GRN_params_100` and `GRN_params_1139`, which contain 100 and 1139 genes respectively. Let's load them first. ```{r load-grn} data(GRN_params_100) GRN_params <- GRN_params_100 head(GRN_params) ``` ## Simulating True Counts Now, we create the options list for the simulation session. In the following example, we simulate 500 cells with 50 CIFs. The number of genes is determined by the option `num.genes` or the number of genes in the GRN. If `num.genes` is not specified, the number of genes will be the number of unique genes in the GRN, plus a fraction of genes that are not regulated by any other genes. this is controlled by the option `unregulated.gene.ratio` (default is 0.1). Since our `GRN_params` contains 100 gene names, 10% more genes will be added to the simulation, and the number of genes in the simulated data will be 110. If you don't need to simulate GRN effects, simply set `GRN = NA`. The `cif.sigma` controls the variance of the CIFs. Usually, with `cif.sigma` = 0.1, the trajectory will be very clear, while with `cif.sigma` = 1, the trajectory will be more noisy. We use `cif.sigma` = 0.5 in this example. We also have `do.velocity` option to use the Kinetic model to simulate RNA velocity data. ```{r define-options} set.seed(42) options <- list( GRN = GRN_params, num.cells = 300, num.cifs = 20, cif.sigma = 1, tree = Phyla5(), diff.cif.fraction = 0.8, do.velocity = TRUE ) ``` Now we run the simulation and check what kind of data is in the returned result: ```{r run-simulation} results <- sim_true_counts(options) names(results) ``` The `results` contains the following important data: - `counts`: the true RNA counts (genes x cells) - `atac_counts`: the true ATAC-seq data (regions x cells) - `cell_meta`: the cell meta data, including the cell population labels ## Visualize the Results We can visualize the true counts and ATAC-seq data using `ploTSNE(data, labels)`: ```{r plot-counts, fig.width = 4, fig.height = 3.5, out.width = "60%"} plot_tsne(log2(results$counts + 1), results$cell_meta$pop, legend = 'pop', plot.name = 'True RNA Counts Tsne') plot_tsne(log2(results$atacseq_data + 1), results$cell_meta$pop, legend = 'pop', plot.name = 'True ATAC-seq Tsne') ``` Since we also have RNA velocity enabled, the `results` also contains the following data: - `velocity`: the true RNA velocity (genes x cells) - `unspliced_counts`: the true unspliced RNA counts (genes x cells) ```{r plot-velocity, fig.width = 4, fig.height = 3.5, out.width = "60%"} plot_rna_velocity(results, arrow.length = 2) ``` ## Adding Technical Variation We can also add the technical variation and batch effect to the true counts: ```{r add-expr-noise, fig.width = 4, fig.height = 3.5, out.width = "60%"} # adds `counts_obs` to `results` add_expr_noise(results) # adds `counts_with_batches` to `results` divide_batches(results, nbatch = 2) plot_tsne(log2(results$counts_with_batches + 1), results$cell_meta$pop, legend = 'pop', plot.name = 'RNA Counts Tsne with Batches') ``` # Adjusting Parameters scMultiSim provides various parameters to control each type of biological effect. ## Discrete Cell Population We can also simulate discrete cell population by setting `discrete.cif = TRUE`. In this case, each tip of the tree will be one cell type, therefore there will be 5 clusters in the following result. ```{r simulate-discrete, fig.width = 4, fig.height = 3.5, out.width = "60%"} set.seed(42) options <- list( GRN = GRN_params, num.cells = 400, num.cifs = 20, tree = Phyla5(), diff.cif.fraction = 0.8, discrete.cif = TRUE ) results <- sim_true_counts(options) plot_tsne(log2(results$counts + 1), results$cell_meta$pop, legend = 'pop', plot.name = 'True RNA Counts Tsne') ``` ## Adjusting the Effect of Cell Population In scMultiSim, the differentiation tree provides explicit control of the cell population. The effect of the tree can be adjusted by the option `diff.cif.fraction`, which controls how many CIFs are affected by the cell population. With a larger `diff.cif.fraction`, the effect of cell population will be larger and you may see a clearer trajectory or well separated clusters. With a smaller `diff.cif.fraction`, the resulting RNA counts will be more affected by other factors, such as the GRN. Now let's visualize the trajectory with different `diff.cif.fraction` values: ```{r adjust-diff-cif-fraction, fig.width = 4, fig.height = 3.5, out.width = "60%"} set.seed(42) options <- list( GRN = GRN_params, num.cells = 300, num.cifs = 20, tree = Phyla5(), diff.cif.fraction = 0.8 ) results <- sim_true_counts( options %>% list_modify(diff.cif.fraction = 0.4)) plot_tsne(log2(results$counts + 1), results$cell_meta$pop, legend = 'pop', plot.name = 'RNA Counts (diff.cif.fraction = 0.2)') results <- sim_true_counts( options %>% list_modify(diff.cif.fraction = 0.9)) plot_tsne(log2(results$counts + 1), results$cell_meta$pop, legend = 'pop', plot.name = 'RNA Counts (diff.cif.fraction = 0.8)') ``` ## Adjusting the Inherent Cell Heterogeneity The inherent cell heterogeneity is controlled by the non-diff-CIF, which is sampled from a normal distribution with mean `cif.mean` and standard deviation `cif.sigma`. Therefore, the larger `cif.sigma` is, the larger the inherent cell heterogeneity is. Now, let's visualize the effect of `cif.sigma`: ```{r adjust-cif-sigma, fig.width = 4, fig.height = 3.5, out.width = "60%"} set.seed(42) options <- list( GRN = GRN_params, num.cells = 300, num.cifs = 20, tree = Phyla5(), diff.cif.fraction = 0.8, cif.sigma = 0.5 ) results <- sim_true_counts( options %>% list_modify(cif.sigma = 0.1)) plot_tsne(log2(results$counts + 1), results$cell_meta$pop, legend = 'pop', plot.name = 'RNA Counts (cif.sigma = 0.1)') results <- sim_true_counts( options %>% list_modify(cif.sigma = 1.0)) plot_tsne(log2(results$counts + 1), results$cell_meta$pop, legend = 'pop', plot.name = 'RNA Counts (cif.sigma = 1.0)') ``` ## Adjusting the Intrinsic Noise If we set `do.velocity = FALSE`, scMultiSim will simulate the RNA counts using the Beta-Poisson model, which is faster but doesn't output RNA velocity. When using the Beta-Possion model, scMultiSim provides a `intrinsic.noise` parameter to control the intrinsic noise during the transcription process. By default, `intrinsic.noise` is set to 1, which means the true counts will be sampled from the Beta-Poisson model. If we set `intrinsic.noise` to a smaller value like 0.5, the true counts will be 0.5 * (theoretical mean) + 0.5 * (sampled from the Beta-Poisson model). ```{r adjust-intrinsic-noise, fig.width = 4, fig.height = 3.5, out.width = "60%"} set.seed(42) options <- list( GRN = GRN_params, num.cells = 300, num.cifs = 20, tree = Phyla5(), diff.cif.fraction = 0.8, intrinsic.noise = 1 ) results <- sim_true_counts( options %>% list_modify(intrinsic.noise = 0.5)) plot_tsne(log2(results$counts + 1), results$cell_meta$pop, legend = 'pop', plot.name = 'RNA Counts (intrinsic.noise = 0.5)') results <- sim_true_counts( options %>% list_modify(intrinsic.noise = 1)) plot_tsne(log2(results$counts + 1), results$cell_meta$pop, legend = 'pop', plot.name = 'RNA Counts (intrinsic.noise = 1)') ``` # Simulating Dynamic GRN First, call the following function to check the usage of dynamic GRN. ```{r help-dynamic-grn} scmultisim_help("dynamic.GRN") ``` Here we use `Phyla1()` as the differentiation tree to remove the effect of the trajectory. Additionally, we can use `discrete.cif = TRUE` to simulate discrete cell population. ```{r define-options-dynamic-grn} set.seed(42) options_ <- list( GRN = GRN_params, num.cells = 300, num.cifs = 20, tree = Phyla1(), diff.cif.fraction = 0.8, do.velocity = FALSE, dynamic.GRN = list( cell.per.step = 3, num.changing.edges = 5, weight.mean = 0, weight.sd = 4 ) ) results <- sim_true_counts(options_) ``` `results$cell_specific_grn` is a list containing the gene effects matrix for each cell. Each row is a target and each column is a regulator. The corresponding gene names are displayed as column and row names. ```{r show-cell-specific-grn} # GRN for cell 1 (first 10 rows) results$cell_specific_grn[[1]][1:10,] ``` Since we set `cell.per.step = 3`, we expect each adjacent 3 cells share the same GRN: ```{r check-cell-specific-grn} print(all(results$cell_specific_grn[[1]] == results$cell_specific_grn[[2]])) print(all(results$cell_specific_grn[[2]] == results$cell_specific_grn[[3]])) print(all(results$cell_specific_grn[[3]] == results$cell_specific_grn[[4]])) ``` # Session Information ```{r session-info} sessionInfo() ```