splatter 1.24.0
This vignette describes the Splat simulation model and the parameters it uses in more detail.
library("splatter")
library("scater")
library("ggplot2")
This figure, taken from the Splatter publication, describes the core of the Splat simulation model.
The Splat simulation uses a hierarchical probabilistic where different aspects of a dataset are generated from appropriate statistical distributions. The first stage generates a mean expression level for each gene. These are originally chosen from a Gamma distribution. For some genes that are selected to be outliers with high expression a factor is generated from a log-normal distribution. These factors are then multiplied by the median gene mean to create new means for those genes.
The next stage incorporates variation in the counts per cell. An expected library size (total counts) is chosen for each cell from a log-normal distribution. The library sizes are then used to scale the gene means for each cell, resulting in a range a counts per cell in the simulated dataset. The gene means are then further adjusted to enforce a relationship between the mean expression level and the variability.
The final cell by gene matrix of gene means is then used to generate a count matrix using a Poisson distribution. The result is a synthetic dataset consisting of counts from a Gamma-Poisson (or negative-binomial) distribution. An additional optional step can be used to replicate a “dropout” effect. A probability of dropout is generated using a logistic function based on the underlying mean expression level. A Bernoulli distribution is then used to create a dropout matrix which sets some of the generated counts to zero.
The model described here will generate a single population of cells but the Splat simulation has been designed to be as flexible as possible and can create scenarios including multiple groups of cells (cell types), continuous paths between cell types and multiple experimental batches. The parameters used to create these types of simulations and how they interact with the model are described below.
Within Splatter the parameters for the Splat simulation model are held in the
SplatParams
object. Let’s create one of these objects and see what it looks
like.
params <- newSplatParams()
params
#> A Params object of class SplatParams
#> Parameters can be (estimable) or [not estimable], 'Default' or 'NOT DEFAULT'
#> Secondary parameters are usually set during simulation
#>
#> Global:
#> (Genes) (Cells) [SEED]
#> 10000 100 974408
#>
#> 29 additional parameters
#>
#> Batches:
#> [Batches] [Batch Cells] [Location] [Scale] [Remove]
#> 1 100 0.1 0.1 FALSE
#>
#> Mean:
#> (Rate) (Shape)
#> 0.3 0.6
#>
#> Library size:
#> (Location) (Scale) (Norm)
#> 11 0.2 FALSE
#>
#> Exprs outliers:
#> (Probability) (Location) (Scale)
#> 0.05 4 0.5
#>
#> Groups:
#> [Groups] [Group Probs]
#> 1 1
#>
#> Diff expr:
#> [Probability] [Down Prob] [Location] [Scale]
#> 0.1 0.5 0.1 0.4
#>
#> BCV:
#> (Common Disp) (DoF)
#> 0.1 60
#>
#> Dropout:
#> [Type] (Midpoint) (Shape)
#> none 0 -1
#>
#> Paths:
#> [From] [Steps] [Skew] [Non-linear] [Sigma Factor]
#> 0 100 0.5 0.1 0.8
Like all the parameter objects in Splatter printing this object displays all the
parameters required for this simulation. As we haven’t set any of the parameters
the default values are shown but if we were to change any of them they would be
highlighted. We can also see which parameters can be estimated by the Splat
estimation procedure and which can’t. The default values have been chosen to be
fairly realistic but it is recommended that estimation is used to get a
simulation that is more like the data you are interested in. Parameters can be
modified by setting them in the SplatParams
object or by providing them
directly to the simulation function.
The rest of this section provides details of all this parameters and explains how they can be used with examples.
These parameters are used in every simulation model and control global features of the dataset produced.
nGenes
- Number of genesThe number of genes to simulate.
# Set the number of genes to 1000
params <- setParam(params, "nGenes", 1000)
sim <- splatSimulate(params, verbose = FALSE)
dim(sim)
#> [1] 1000 100
nCells
- Number of cellsThe number of genes to simulate. In the Splat simulation this cannot be set
directly but must be controlled using the batchCells
parameter.
seed
- Random seedSeed to use for generating random numbers including selecting values from distributions. By changing this value multiple simulated datasets with the same parameters can be produced. Simulations produced using the same set of parameters and random seed should be identical but there may be differences between operating systems, software versions etc.
These parameters control experimental batches in the simulated dataset. The overall effect of how batch effects are included in the model is similar to technical replicates (i.e. the same biological sample sequenced multiple times). This means that the underlying structure is consistent between batches but a global technical signature is added may separate them.
nBatches
- Number of batchesThe number of batches in the simulation. This cannot be set directly but is
controlled by setting batchCells
.
batchCells
- Cells per batchA vector specifying the number of cells in each batch. The number of batches
(nBatches
) is equal to the length of the vector and the number of cells
(nCells
) is equal to the sum.
# Simulation with two batches of 100 cells
sim <- splatSimulate(params, batchCells = c(100, 100), verbose = FALSE)
# PCA plot using scater
sim <- logNormCounts(sim)
sim <- runPCA(sim)
plotPCA(sim, colour_by = "Batch")
batch.facLoc
- Batch factor location and batch.facScale
- Batch factor scaleBatches are specified by generating a small scaling factor for each gene in each batch from a log-normal distribution. These factors are then applied to the underlying gene means in each batch. Modifying these parameters affects how different the batches are from each other by generating bigger or smaller factors.
# Simulation with small batch effects
sim1 <- splatSimulate(params, batchCells = c(100, 100),
batch.facLoc = 0.001, batch.facScale = 0.001,
verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
plotPCA(sim1, colour_by = "Batch") + ggtitle("Small batch effects")
# Simulation with big batch effects
sim2 <- splatSimulate(params, batchCells = c(100, 100),
batch.facLoc = 0.5, batch.facScale = 0.5,
verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
plotPCA(sim2, colour_by = "Batch") + ggtitle("Big batch effects")
batch.rmEffect
- Remove batch effectSetting batch.rmEffect
to TRUE
produces a dataset that is (approximately)
identical to when batch.rmEffect
is FALSE
but with the batch effect removed.
This can be used as the ground truth for evaluating batch correction.
sim1 <- splatSimulate(params, batchCells = c(100, 100), batch.rmEffect = FALSE,
verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
plotPCA(sim1, colour_by = "Batch") + ggtitle("With batch effects")
sim2 <- splatSimulate(params, batchCells = c(100, 100), batch.rmEffect = TRUE,
verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
plotPCA(sim2, colour_by = "Batch") + ggtitle("Batch effects removed")
These parameters control the distribution that is used to generate the underlying original gene means.
mean.shape
- Mean shape and mean.rate
- Mean rateThese parameters control the Gamma distribution that gene means are drawn from. The relationship between shape and rate can be complex and it is often better to use values estimated from a real dataset than to try and manually set them. Although these parameters control the base gene means the means in the final simulation will depend upon other parts of the model, particularly the simulated total counts per cell (library sizes).
These parameters control the expected number of counts for each cell. Note that because of sampling the actual counts per cell in the final simulation may be different. Turning on the dropout effect will also effect this. We use the term “library size” here for consistency but expected total counts would be more appropriate.
lib.loc
- Library size location and lib.scale
- Library size scaleThese parameters control the shape of the distribution that is used to generate
library sizes for each cell. Increasing lib.loc
will lead to more counts per
cell and increasing lib.scale
will result in more variability in the counts
per cell.
lib.norm
- Library size distributionThe default (and recommended) distribution used for library sizes in the Splat
simulation is a log-normal. However, in rare cases a normal distribution might
be more appropriate. Setting lib.norm
to TRUE
will use a normal distribution
instead of a log-normal.
When developing the Splat simulation we found that while the Gamma distribution was generally a good match for gene means for some datasets it did not properly capture highly expressed genes. For this reason we added expression outliers to the Splat model.
out.prob
- Expression outlier probabilityThis parameter controls the probability that genes will be selected to be expression outliers. Higher values will results in more outlier genes.
# Few outliers
sim1 <- splatSimulate(out.prob = 0.001, verbose = FALSE)
ggplot(as.data.frame(rowData(sim1)),
aes(x = log10(GeneMean), fill = OutlierFactor != 1)) +
geom_histogram(bins = 100) +
ggtitle("Few outliers")
# Lots of outliers
sim2 <- splatSimulate(out.prob = 0.2, verbose = FALSE)
ggplot(as.data.frame(rowData(sim2)),
aes(x = log10(GeneMean), fill = OutlierFactor != 1)) +
geom_histogram(bins = 100) +
ggtitle("Lots of outliers")
out.facLoc
- Expression outlier factor location and out.facScale
- Expression outlier factor scaleThe expression outlier factors are drawn from a log-normal distribution controlled by these parameters. The generated factors are applied to the median mean expression rather than the existing mean for the selected genes. This is to be consistent with the estimation procedure for these factors and to make sure that the final means are outliers. For example to avoid the situation where a factor is applied to a lowly expressed gene, making it just moderately expressed rather than an expression outlier.
Up until this stage of the simulation only a single population of cells is considered but we often want to simulate datasets with multiple kinds of cells. We do this by assigning cells to groups.
nGroups
- Number of groupsThe number of groups to simulate. This parameter cannot be set directly and is
controlled using group.prob
.
group.prob
- Group probabilitiesA vector giving the probability that cells will be assigned to groups. The
length of the vector gives the number of groups (nGroups
) and the
probabilities must sum to 1. Adjusting the number and relative values of the
probabilities changes the number and relative sizes of the groups. To simulate
groups we also need to use the splatSimulateGroups
function or set
method = "groups"
.
params.groups <- newSplatParams(batchCells = 500, nGenes = 1000)
# One small group, one big group
sim1 <- splatSimulateGroups(params.groups, group.prob = c(0.9, 0.1),
verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
plotPCA(sim1, colour_by = "Group") + ggtitle("One small group, one big group")
# Five groups
sim2 <- splatSimulateGroups(params.groups,
group.prob = c(0.2, 0.2, 0.2, 0.2, 0.2),
verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
plotPCA(sim2, colour_by = "Group") + ggtitle("Five groups")
Note: Once there are more than three or four groups it becomes difficult to properly view them in PCA space. We use PCA here for simplicity but generally a non-linear dimensionality reduction such as t-SNE or UMAP is a more useful way to visualise the groups.
Different groups are created by modifying the base expression levels of selected genes. The process for doing this is to simulate differential expression (DE) between each group and a fictional base cell. Altering the differential expression parameters controls how similar groups are to each other.
de.prob
- DE probabilityThis parameter controls the probability that a gene will be selected to be differentially expressed.
# Few DE genes
sim1 <- splatSimulateGroups(params.groups, group.prob = c(0.5, 0.5),
de.prob = 0.01, verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
plotPCA(sim1, colour_by = "Group") + ggtitle("Few DE genes")
# Lots of DE genes
sim2 <- splatSimulateGroups(params.groups, group.prob = c(0.5, 0.5),
de.prob = 0.3, verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
plotPCA(sim2, colour_by = "Group") + ggtitle("Lots of DE genes")
de.downProb
- Down-regulation probabilityA selected DE gene can be either down-regulated (has a factor less than one) or up-regulated (has a factor greater than one). This parameter controls the probability that a selected gene will be down-regulated.
de.facLoc
- DE factor location and de.facScale
- DE factor scaleDifferential expression factors are produced from a log-normal distribution in a similar way to batch effect factors and expression outlier factors. Changing these parameters can result in more or less extreme differences between groups.
# Small DE factors
sim1 <- splatSimulateGroups(params.groups, group.prob = c(0.5, 0.5),
de.facLoc = 0.01, verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
plotPCA(sim1, colour_by = "Group") + ggtitle("Small DE factors")
# Big DE factors
sim2 <- splatSimulateGroups(params.groups, group.prob = c(0.5, 0.5),
de.facLoc = 0.3, verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
plotPCA(sim2, colour_by = "Group") + ggtitle("Big DE factors")
Just looking at the PCA plots this effect seems similar to adjusting de.prob
but the effect is achieved in a different way. A higher de.prob
means that
more genes are differentially expressed but changing the DE factors changes the
level of DE for the same number of genes.
Each of the differential expression parameters can be specified for each group
by providing a vector of values. These vectors must be the same length as
group.prob
. Specifying parameters as vectors allows more complex simulations
where groups are more or less different to each other rather than being equally
distinct. Here are some examples of different DE scenarios.
# Different DE probs
sim1 <- splatSimulateGroups(params.groups,
group.prob = c(0.2, 0.2, 0.2, 0.2, 0.2),
de.prob = c(0.01, 0.01, 0.1, 0.1, 0.3),
verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
plotPCA(sim1, colour_by = "Group") +
labs(title = "Different DE probabilities",
caption = paste("Groups 1 and 2 have very few DE genes,",
"Groups 3 and 4 have the default number,",
"Group 5 has many DE genes"))
# Different DE factors
sim2 <- splatSimulateGroups(params.groups,
group.prob = c(0.2, 0.2, 0.2, 0.2, 0.2),
de.facLoc = c(0.01, 0.01, 0.1, 0.1, 0.2),
de.facScale = c(0.2, 0.5, 0.2, 0.5, 0.4),
verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
plotPCA(sim2, colour_by = "Group") +
labs(title = "Different DE factors",
caption = paste("Group 1 has factors with small location (value),",
"and scale (variability),",
"Group 2 has small location and greater scale.\n",
"Groups 3 and 4 have greater location with small,",
"and large scales",
"Group 5 has bigger factors with moderate",
"variability"))
# Combination of everything
sim3 <- splatSimulateGroups(params.groups,
group.prob = c(0.05, 0.2, 0.2, 0.2, 0.35),
de.prob = c(0.3, 0.1, 0.2, 0.01, 0.1),
de.downProb = c(0.1, 0.4, 0.9, 0.6, 0.5),
de.facLoc = c(0.6, 0.1, 0.1, 0.01, 0.2),
de.facScale = c(0.1, 0.4, 0.2, 0.5, 0.4),
verbose = FALSE)
sim3 <- logNormCounts(sim3)
sim3 <- runPCA(sim3)
plotPCA(sim3, colour_by = "Group") +
labs(title = "Different DE factors",
caption = paste(
"Group 1 is small with many very up-regulated DE genes,",
"Group 2 has the default DE parameters,\n",
"Group 3 has many down-regulated DE genes,",
"Group 4 has very few DE genes,",
"Group 5 is large with moderate DE factors")
)
The BCV parameters control the variability of the genes in the simulated dataset.
bcv.common
- Common BCVThe bcv.common
parameter controls the underlying common variability across
all genes in the dataset.
bcv.df
- BCV Degrees of FreedomThis parameter sets the degrees of freedom used in the BCV inverse chi-squared distribution. Changing this changes the effect of mean expression on the variability of a gene.
These parameters control whether additional dropout is added to increase the number of zeros in the simulated dataset and if it is how that is applied.
dropout.type
- Dropout typeThis parameter determines the kind of dropout effect to simulate. Setting it to
"none"
means no dropout, “experiment
” is global dropout using the same set
of parameters for every cell, "batch"
uses the same parameters for every cell
in the same batch, "group"
uses the same parameters for every cell in the
same group and "cell"
uses a different set of parameters for every cell.
dropout.mid
- Dropout mid point and dropout.shape
- Dropout shapeThe probability that a particular count in a particular cell is set to zero is
related to the mean expression of that gene in that cell. This relationship is
represented using a logistic function with these parameters. The dropout.mid
parameter control the point at which the probability is equal to 0.5 and the
dropout.shape
controls how it changes with increasing expression. These
parameters must be vectors of the appropriate length depending on the selected
dropout type.
For many uses simulating groups is sufficient but in some cases it is more
appropriate to simulate continuous changes between cell types. The number of
paths and the probability of assigning cells to them is still controlled by the
group.prob
parameter and the amount of change along a path is controlled by
the DE parameters but other aspects of the splatSimulatePaths
model are
controlled by these parameters.
path.from
- Path originThis parameter controls the order of differentiation paths. It is a vector the
same length as group.prob
giving the starting position of each path. For
example a path.from
of c(0, 1, 1, 3)
would indicate the Path 1 starts at
the origin (0), Path 2 starts at the end of Path 1, Path 3 also starts at the
end of Path 1 (a branch point) and Path 4 starts at the end of Path 3.
# Linear paths
sim1 <- splatSimulatePaths(params.groups,
group.prob = c(0.25, 0.25, 0.25, 0.25),
de.prob = 0.5, de.facLoc = 0.2,
path.from = c(0, 1, 2, 3),
verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
plotPCA(sim1, colour_by = "Group") + ggtitle("Linear paths")
# Branching path
sim2 <- splatSimulatePaths(params.groups,
group.prob = c(0.25, 0.25, 0.25, 0.25),
de.prob = 0.5, de.facLoc = 0.2,
path.from = c(0, 1, 1, 3),
verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
plotPCA(sim2, colour_by = "Group") + ggtitle("Branching path")
path.nSteps
- Number of stepsA path is created by using the same differential expression procedure as used for groups to generate an end point. Interpolation is then used to create a series of steps between the start and end points. This parameter controls the number of steps along a path and therefore how discrete or smooth it is.
# Few steps
sim1 <- splatSimulatePaths(params.groups, path.nSteps = 3,
de.prob = 0.5, de.facLoc = 0.2, verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
plotPCA(sim1, colour_by = "Step") + ggtitle("Few steps")
# Lots of steps
sim2 <- splatSimulatePaths(params.groups, path.nSteps = 1000,
de.prob = 0.5, de.facLoc = 0.2, verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
plotPCA(sim2, colour_by = "Step") + ggtitle("Lots of steps")
path.skew
- Path skewBy default cells are evenly distributed along a path but it sometimes be useful
to introduce a skew in the distribution, For example you may want to simulate
a scenario with few stem-like cells and many differentiated cells. Setting
path.skew
to 0 will mean that all cells come from the end point while
higher values up to 1 will skew them towards the start point.
# Skew towards the end
sim1 <- splatSimulatePaths(params.groups, path.skew = 0.1,
de.prob = 0.5, de.facLoc = 0.2, verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
plotPCA(sim1, colour_by = "Step") + ggtitle("Skewed towards the end")
path.nonlinearProb
- Non-linear probabilityMost genes are interpolated in a linear way along a path but in reality this
may not always be the case. For example it is easy to imagine a gene that
is lowly-expressed at the start of a process, highly-expressed in the middle
and lowly-expressed again at the end. The path.nonlinearProb
parameter
controls the probability that a gene will change in a non-linear way along a
path.
path.sigmaFac
- Path skewNon-linear changes along a path are achieved by building a Brownian bridge
between the two end points. A Brownian bridge is Brownian motion controlled in
such a way that the end points are fixed. The path.sigmaFac
parameter controls
how extreme each step in the Brownian motion is and therefore how much the
interpolation differs from a linear path.