Contents

1 Introduction

Single-cell mRNA sequencing can uncover novel cell-to-cell heterogeneity in gene expression levels within seemingly homogeneous populations of cells. However, these experiments are prone to high levels of technical noise, creating new challenges for identifying genes that show genuine heterogeneous expression within the group of cells under study.

BASiCS (Bayesian Analysis of Single-Cell Sequencing data) is an integrated Bayesian hierarchical model that propagates statistical uncertainty by simultaneously performing data normalisation (global scaling), technical noise quantification and two types of supervised downstream analyses:

  1. For a single group of cells (Vallejos, Marioni, and Richardson 2015): BASiCS provides a criterion to identify highly (and lowly) variable genes within the group.
  2. For two (or more) groups of cells: BASiCS allows the identification of changes in gene expression between the groups. As in traditional differential expression tools, BASiCS can uncover changes in mean expression between the groups. Besides this, BASiCS can also uncover changes in gene expression variability in terms of:
  1. Over-dispersion (Vallejos, Richardson, and Marioni 2016) — a measure for the excess of cell-to-cell variability that is observed with respect to Poisson sampling, after accounting for technical noise. This feature has led, for example, to novel insights in the context of immune cells across aging (Martinez-Jimenez et al. 2017). However, due to the strong mean/over-dispersion confounding that is typically observed for scRNA-seq datasets, the assessment of changes in over-dispersion is restricted to genes for which mean expression does not change between the groups.
  2. Residual over-dispersion (Eling et al. 2018) — a residual measure of variability given by departures with respect to a global mean/over-dispersion trend. Positive residual over-dispersion indicates that a gene exhibits more variation than expected relative to genes with similar expression levels; negative residual over-dispersion suggests less variation than expected. To use this feature, please set Regression = TRUE as a function parameter in BASiCS_MCMC.

In all cases, a probabilistic output is provided and a decision rule is calibrated using the expected false discovery rate (EFDR) (Newton et al. 2004).

A brief description for the statistical model implemented in BASiCS is provided in Section 6 of this document. The original implementation of BASiCS (Vallejos, Marioni, and Richardson 2015) requires the use of spike-in molecules — that are artificially introduced to each cell’s lysate — to perform these analyses. More recently, Eling et al. (2018) extendeded BASiCS to also address datasets for which spikes-ins are not available (see Section 4). To use this feature, please set WithSpikes = FALSE as a function parameter in BASiCS_MCMC.

Important: BASiCS has been designed in the context of supervised experiments where the groups of cells (e.g. experimental conditions, cell types) under study are known a priori (e.g. case-control studies). Therefore, we DO NOT advise the use of BASiCS in unsupervised settings where the aim is to uncover sub-populations of cells through clustering.


2 Quick start

Parameter estimation is performed using the BASiCS_MCMC function. Downstream analyses implemented in BASiCS rely on appropriate post-processing of the output returned by BASiCS_MCMC. Essential parameters for running BASiCS_MCMC are:

If the optional parameter PrintProgress is set to TRUE, the R console will display the progress of the MCMC algorithm. For other optional parameters refer to help(BASiCS_MCMC).

Here, we illustrate the usage of BASiCS_MCMC using a built-in synthetic dataset.

NOTE: WE USE A SMALL NUMBER OF ITERATIONS FOR ILLUSTRATION PURPOSES ONLY. LARGER NUMBER OF ITERATIONS ARE USUALLY REQUIRED TO ACHIEVE CONVERGENCE. OUR RECOMMENDED SETTING IS N=20000, Thin=20 and Burn=10000.

Data <- makeExampleBASiCS_Data()
Chain <- BASiCS_MCMC(Data = Data, N = 1000, Thin = 10, Burn = 500, 
                     PrintProgress = FALSE, Regression = TRUE)
## -----------------------------------------------------
## MCMC sampler has been started: 1000 iterations to go.
## -----------------------------------------------------
## -----------------------------------------------------
## End of Burn-in period.
## -----------------------------------------------------
##  
## -----------------------------------------------------
## -----------------------------------------------------
## All 1000 MCMC iterations have been completed.
## -----------------------------------------------------
## -----------------------------------------------------
##  
## -----------------------------------------------------
## Please see below a summary of the overall acceptance rates.
## -----------------------------------------------------
##  
## Minimum acceptance rate among mu[i]'s: 0.422
## Average acceptance rate among mu[i]'s: 0.59648
## Maximum acceptance rate among mu[i]'s: 0.766
##  
##  
## Minimum acceptance rate among delta[i]'s: 0.45
## Average acceptance rate among delta[i]'s: 0.53324
## Maximum acceptance rate among delta[i]'s: 0.634
##  
##  
## Acceptance rate for phi (joint): 0.88
##  
##  
## Minimum acceptance rate among nu[j]'s: 0.452
## Average acceptance rate among nu[j]'s: 0.524533
## Maximum acceptance rate among nu[j]'s: 0.682
##  
##  
## Minimum acceptance rate among theta[k]'s: 0.712
## Average acceptance rate among theta[k]'s: 0.712
## Maximum acceptance rate among theta[k]'s: 0.712
##  
## -----------------------------------------------------
## 

As a default, the code above runs the original implementation mode of BASiCS (spikes without regression; see Section 4). To use the regression BASiCS model (Eling et al. 2018), please set Regression = TRUE. To use the no-spikes implementation of BASiCS, please add WithSpikes = FALSE as an additional parameter.

The Data and Chain (a BASiCS_Chain object) objects created by the code above can be use for subsequent downstream analyses. See Section 3.2 for highly/lowly variable gene detection (single group of cells, see also functions BASiCS_DetectHVG and BASiCS_DetectLVG) and Section 3.3 for differential expression analyses (two groups of cells, see also function BASiCS_TestDE).

Important remarks:

Typically, setting N=20000, Thin=20 and Burn=10000 leads to stable results.

3 Complete workflow

3.1 The input dataset

The input dataset for BASiCS must be stored as an SingleCellExperiment object (see SingleCellExperiment package).

The generation of the input SingleCellExperiment object requires a matrix of raw counts Counts (columns: cells, rows: genes) after quality control (e.g. removing low quality cells) and filtering of lowly expressed genes. If spike-in molecules are contained in Counts, a logical vector Tech is required to indicate which rows contain technical spike-in molecules and a data.frame object SpikeInfo containing the names of the spike-in molecules in the first column and the absolute number of molecules per well in the second column. More details are provided in section 3.1. If spike-ins are not available, a vector BatchInfo containing batch information is required.

3.1.1 With spike-in genes

The newBASiCS_Data function can be used to create the input data object based on the following information:

  • Counts: a matrix of raw expression counts with dimensions \(q\) times \(n\). Within this matrix, \(q_0\) rows must correspond to biological genes and \(q-q_0\) rows must correspond to technical spike-in genes. Gene names must be stored as rownames(Counts).

  • Tech: a logical vector (TRUE/FALSE) with \(q\) elements. If Tech[i] = FALSE the gene i is biological; otherwise the gene is spike-in. This vector must be specified in the same order of genes as in the Counts matrix.

  • SpikeInfo (optional): a data.frame with \(q-q_0\) rows. First column must contain the names associated to the spike-in genes (as in rownames(Counts)). Second column must contain the input number of molecules for the spike-in genes (amount per cell). If a value for this parameter is not provided when calling newBASiCS_Data, SpikeInfo is set as NULL as a default value. In those cases, the BatchInfo argument has to be provided to allow using the no-spikes implementation of BASiCS.

  • BatchInfo (optional): vector of length \(n\) to indicate batch structure (whenever cells have been processed using multiple batches). If a value for this parameter is not provided when calling newBASiCS_Data, BASiCS will assume the data contains a single batch of samples.

For example, the following code generates a synthetic dataset with 50 genes (40 biological and 10 spike-in) and 40 cells.

set.seed(1)
Counts <- matrix(rpois(50*40, 2), ncol = 40)
rownames(Counts) <- c(paste0("Gene", 1:40), paste0("Spike", 1:10))
Tech <- c(rep(FALSE,40),rep(TRUE,10))
set.seed(2)
SpikeInput <- rgamma(10,1,1)
SpikeInfo <- data.frame("SpikeID" = paste0("Spike", 1:10), 
                        "SpikeInput" = SpikeInput)

# No batch structure
DataExample <- newBASiCS_Data(Counts, Tech, SpikeInfo)

# With batch structure
DataExample <- newBASiCS_Data(Counts, Tech, SpikeInfo, 
                              BatchInfo = rep(c(1,2), each = 20)) 

To convert an existing SingleCellExperiment object (Data) into one that can be used within BASiCS, meta-information must be stored in the object.

  • isSpike(Data, "ERCC") <- Tech: the logical vector indicating biological/technical genes (see above) must be stored in the int_metadata slot via the isSpike function.

  • metadata(Data): the SpikeInfo object is stored in the metadata slot of the SingleCellExperiment object: metadata(Data) <- list(SpikeInput = SpikeInfo[,2], BatchInfo = BatchInfo).

  • colData(Data)$BatchInfo: the BatchInfo object is stored in the colData slot of the SingleCellExperiment object.

Once the additional information is included, the object can be used within BASiCS.

NOTE: Input number of molecules for spike-in should be calculated using experimental information. For each spike-in gene \(i\), we use

\[ \mu_{i} = C_i \times 10^{-18} \times (6.022 \times 10^{23}) \times V \times D \hspace{0.5cm} \mbox{where,} \]

  • \(C_i\) is the concentration of the spike \(i\) in the ERCC mix (see here)
  • \(10^{-18}\) is to convert att to mol
  • \(6.022 \times 10^{23}\) is the Avogadro number (mol \(\rightarrow\) molecule)
  • \(V\) is the volume added into each chamber (in microlitres)
  • \(D\) is a dilution factor

3.1.2 Without spike-in genes

To run BASiCS without incorporating reads from technical spike-in genes, and existing SingleCellExperiment object can be used. The only modification to the existing object is to assign the colData(Data)$BatchInfo slot.

set.seed(1)
CountsNoSpikes <- matrix(rpois(50*40, 2), ncol = 40)
rownames(CountsNoSpikes) <- paste0("Gene", 1:50)

# With batch structure
DataExampleNoSpikes <- SingleCellExperiment(assays = list(counts = CountsNoSpikes), 
                      colData = data.frame(BatchInfo = rep(c(1,2), each = 20))) 

Note: BASiCS assumes that a pre-processing quality control step has been applied to remove cells with poor quality data and/or lowly expressed genes that were undetected through sequencing. When analysing multiple groups of cells, the gene filtering step must be jointly applied across all groups to ensure the same genes are retained.

The function BASiCS_Filter can be used to perform this task. For examples, refer to help(BASiCS_Filter). Moreover, the scater package provides enhanced functionality for the pre-processing of scRNA-seq datasets.

3.2 Analysis for a single group of cells

We illustrate this analysis using a small extract from the MCMC chain obtained in (Vallejos, Richardson, and Marioni 2016) when analysing the single cell samples provided in (Grün, Kester, and Oudenaarden 2014). This is included within BASiCS as the ChainSC dataset.

data(ChainSC)

The following code is used to identify highly variable genes (HVG) and lowly variable genes (LVG) among these cells. The VarThreshold parameter sets a lower threshold for the proportion of variability that is assigned to the biological component (Sigma). In the examples below:

  • HVG are defined as those genes for which at least 60% of their total variability is attributed to the biological variability component.
  • LVG are defined as those genes for which at most 40% of their total variability is attributed to the biological variability component.

For each gene, these functions return posterior probabilities as a measure of HVG/LVG evidence. A cut-off value for these posterior probabilities is set by controlling the EFDR (as a default option, EFDR is set as 0.10).

par(mfrow = c(2,2))
HVG <- BASiCS_DetectHVG(ChainSC, VarThreshold = 0.6, Plot = TRUE)
LVG <- BASiCS_DetectLVG(ChainSC, VarThreshold = 0.2, Plot = TRUE)

To access the results of these tests, please use.

head(HVG$Table)
##     GeneIndex GeneName       Mu    Delta     Sigma      Prob   HVG
## 286       286   Rhox13 9.138169 2.417712 0.7706052 0.9733333  TRUE
## 250       250   Phlda2 9.372955 2.171966 0.7462231 0.9600000  TRUE
## 21         21    Amacr 7.164534 1.598386 0.6652893 0.6933333  TRUE
## 320       320    Smoc1 5.623408 1.805558 0.6700561 0.6800000 FALSE
## 122       122   Engase 7.082273 1.579305 0.6445464 0.6533333 FALSE
## 66         66   Cep170 3.712116 1.580848 0.6261821 0.5600000 FALSE
head(LVG$Table)
##     GeneIndex GeneName        Mu      Delta      Sigma Prob  LVG
## 20         20    Ahsa1  167.4669 0.06774259 0.09381064    1 TRUE
## 37         37   Atp5g2  344.9518 0.06381609 0.09037440    1 TRUE
## 55         55     Btf3  263.2987 0.04880375 0.06861493    1 TRUE
## 69         69   Chchd2  576.1569 0.03078477 0.04703723    1 TRUE
## 141       141    Fkbp4  334.4190 0.07084192 0.09782010    1 TRUE
## 147       147     Ftl1 2296.2110 0.04504962 0.06656222    1 TRUE

Note: this decision rule implemented in this function has changed with respect to the original release of BASiCS (where EviThreshold was defined such that EFDR = EFNR). However, the new choice is more stable (sometimes, it was not posible to find a threshold such that EFDR = EFNR).

3.3 Analysis for two groups of cells

To illustrate the use of the differential mean expression and differential over-dispersion tests between two cell populations, we use extracts from the MCMC chains obtained in (Vallejos, Richardson, and Marioni 2016) when analysing the (Grün, Kester, and Oudenaarden 2014) dataset (single cells vs pool-and-split samples). These were obtained by independently running the BASiCS_MCMC function for each group of cells.

data(ChainSC)
data(ChainRNA)
Test <- BASiCS_TestDE(Chain1 = ChainSC, Chain2 = ChainRNA,
                      GroupLabel1 = "SC", GroupLabel2 = "PaS",
                      EpsilonM = log2(1.5), EpsilonD = log2(1.5),
                      EFDR_M = 0.10, EFDR_D = 0.10,
                      Offset = TRUE, PlotOffset = FALSE, Plot = TRUE)

In BASiCS_TestDE, EpsilonM sets the log2 fold change (log2FC) in expression (\(\mu\)) and EpsilonD the log2FC in over-dispersion (\(\delta\)). As a default option: EpsilonM = EpsilonD = log2(1.5) (i.e. the test is set to detect absolute increases of 50% or above). To adjust for differences in overall mRNA content, an internal offset correction is performed when OffSet=TRUE. This is the recommended default setting.

The resulting output list can be displayed using

head(Test$TableMean)
##        GeneName MeanOverall   Mean1   Mean2 MeanFC MeanLog2FC ProbDiffMean
## 1 1700094D03Rik      58.539  54.352  62.726  0.854     -0.228        0.000
## 2 1700097N02Rik      37.421  36.919  37.922  0.946     -0.081        0.013
## 3 1810026B05Rik      14.738  10.459  19.018  0.558     -0.842        0.867
## 4 2310008H04Rik      18.844  15.267  22.422  0.672     -0.573        0.493
## 5 2410137M14Rik      19.050  16.674  21.426  0.768     -0.380        0.280
## 6 4930402H24Rik     962.255 967.739 956.771  1.004      0.006        0.000
##   ResultDiffMean
## 1         NoDiff
## 2         NoDiff
## 3           PaS+
## 4         NoDiff
## 5         NoDiff
## 6         NoDiff
head(Test$TableDisp)
##        GeneName MeanOverall DispOverall Disp1 Disp2 DispFC DispLog2FC
## 1 1700094D03Rik      58.539       0.225 0.321 0.128  2.307      1.206
## 2 1700097N02Rik      37.421       0.338 0.476 0.200  2.342      1.228
## 3 1810026B05Rik      14.738       0.359 0.434 0.284  1.378      0.463
## 4 2310008H04Rik      18.844       0.392 0.441 0.343  1.204      0.268
## 5 2410137M14Rik      19.050       0.396 0.491 0.302  1.663      0.734
## 6 4930402H24Rik     962.255       0.103 0.186 0.020  9.209      3.203
##   ProbDiffDisp      ResultDiffDisp
## 1        0.880                 SC+
## 2        0.813                 SC+
## 3        0.520 ExcludedFromTesting
## 4        0.453              NoDiff
## 5        0.640              NoDiff
## 6        1.000                 SC+

Due to the confounding between mean and over-dispersion that is typically observed in scRNA-seq datasets, the non-regression BASiCS model (run using Regression = FALSE as a function parameter in BASiCS_MCMC) can only be used to assess changes in over-dispersion for those genes in which the mean expression does not change between the groups. In this case, we recommend users to use EpsilonM = 0 as a conservative option to avoid changes in over-dispersion to be confounded by mean expression (the genes for which mean expression changes are marked as ExcludedFromTesting in the Test$TableDisp$ResultDiffDisp slot).

Test <- BASiCS_TestDE(Chain1 = ChainSC, Chain2 = ChainRNA,
                      GroupLabel1 = "SC", GroupLabel2 = "PaS",
                      EpsilonM = 0, EpsilonD = log2(1.5),
                      EFDR_M = 0.10, EFDR_D = 0.10,
                      Offset = TRUE, PlotOffset = FALSE, Plot = FALSE)

Note: If the regression BASiCS model has been used (Regression = TRUE as a function parameter in BASiCS_MCMC), BASiCS_TestDE will also report changes in residual over-dispersion (not confounded by mean expression) between the groups (see Section 4 in this vignette).

4 Alternative implementation modes

Beyond its original implementation, BASiCS has been extended to address experimental designs in which spike-in molecules are not available as well as to address the confounding that is typically observed between mean and over-dispersion for scRNA-seq datasets (Eling et al. 2018). Alternative implementation modes are summarised below:

As a default, the BASiCS_MCMC function uses WithSpikes = TRUE.

4.1 If WithSpikes = FALSE

When technical spike-in genes are not available, BASiCS uses a horizontal integration strategy which borrows information across multiple technical replicates (Eling et al. 2018). Therefore, BASiCS_MCMC will fail to run if a single batch of samples is provided. Note: batch information must be provided via the BatchInfo argument when using the newBASiCS_Data function or BatchInfo must be stored as a slot in colData(Data) when using an existing SingleCellExperiment object.

DataNoSpikes <- newBASiCS_Data(Counts, Tech, SpikeInfo = NULL, 
                              BatchInfo = rep(c(1,2), each = 20))

# Alternatively
DataNoSpikes <- SingleCellExperiment(assays = list(counts = Counts), 
                      colData = data.frame(BatchInfo = rep(c(1,2), each = 20))) 

ChainNoSpikes <- BASiCS_MCMC(Data = DataNoSpikes, N = 1000, 
                             Thin = 10, Burn = 500, 
                             WithSpikes = FALSE,  Regression = TRUE,
                             PrintProgress = FALSE)
## -----------------------------------------------------
## MCMC sampler has been started: 1000 iterations to go.
## -----------------------------------------------------
## -----------------------------------------------------
## End of Burn-in period.
## -----------------------------------------------------
##  
## -----------------------------------------------------
## -----------------------------------------------------
## All 1000 MCMC iterations have been completed.
## -----------------------------------------------------
## -----------------------------------------------------
##  
## -----------------------------------------------------
## Please see below a summary of the overall acceptance rates.
## -----------------------------------------------------
##  
## Minimum acceptance rate among mu[i]'s: 0.428
## Average acceptance rate among mu[i]'s: 0.5142
## Maximum acceptance rate among mu[i]'s: 0.62
##  
##  
## Minimum acceptance rate among delta[i]'s: 0.716
## Average acceptance rate among delta[i]'s: 0.78424
## Maximum acceptance rate among delta[i]'s: 0.836
##  
##  
## Minimum acceptance rate among nu[jk]'s: 0.884
## Average acceptance rate among nu[jk]'s: 0.9579
## Maximum acceptance rate among nu[jk]'s: 0.982
##  
##  
## Minimum acceptance rate among theta[k]'s: 0.754
## Average acceptance rate among theta[k]'s: 0.77
## Maximum acceptance rate among theta[k]'s: 0.786
##  
##  
## -----------------------------------------------------
## 

4.2 If Regression = TRUE

The BASiCS model uses a joint informative prior formulation to account for the relationship between mean and over-dispersion gene-specific parameters. The latter is used to infer a global regression trend between these parameters and, subsequently, to derive a residual over-dispersion measure that is defined as departures with respect to this trend.

DataRegression <- newBASiCS_Data(Counts, Tech, SpikeInfo, 
                              BatchInfo = rep(c(1,2), each = 20))
ChainRegression <- BASiCS_MCMC(Data = DataRegression, N = 1000, 
                               Thin = 10, Burn = 500, 
                               Regression = TRUE, 
                               PrintProgress = FALSE)
## -----------------------------------------------------
## MCMC sampler has been started: 1000 iterations to go.
## -----------------------------------------------------
## -----------------------------------------------------
## End of Burn-in period.
## -----------------------------------------------------
##  
## -----------------------------------------------------
## -----------------------------------------------------
## All 1000 MCMC iterations have been completed.
## -----------------------------------------------------
## -----------------------------------------------------
##  
## -----------------------------------------------------
## Please see below a summary of the overall acceptance rates.
## -----------------------------------------------------
##  
## Minimum acceptance rate among mu[i]'s: 0.368
## Average acceptance rate among mu[i]'s: 0.46495
## Maximum acceptance rate among mu[i]'s: 0.572
##  
##  
## Minimum acceptance rate among delta[i]'s: 0.754
## Average acceptance rate among delta[i]'s: 0.79395
## Maximum acceptance rate among delta[i]'s: 0.836
##  
##  
## Acceptance rate for phi (joint): 0.854
##  
##  
## Minimum acceptance rate among nu[j]'s: 0.788
## Average acceptance rate among nu[j]'s: 0.8494
## Maximum acceptance rate among nu[j]'s: 0.988
##  
##  
## Minimum acceptance rate among theta[k]'s: 0.766
## Average acceptance rate among theta[k]'s: 0.77
## Maximum acceptance rate among theta[k]'s: 0.774
##  
## -----------------------------------------------------
## 

This implementation provides additional functionality when performing downstream analyses. These are illustrated below using a small extract from the MCMC chain obtained when analysing the dataset provided in (Grün, Kester, and Oudenaarden 2014) (single cell versus pool-and-split samples). These are included within BASiCS as the ChainSCReg and ChainRNAReg datasets.

To visualize the fit between over-dispersion \(\delta_i\) and mean expression $ _i$ the following function can be used.

data("ChainRNAReg")
BASiCS_showFit(ChainRNAReg)

The BASiCS_TestDE test function will automatically perform differential variability testing based on the residual over-dispersion parameters \(\epsilon_i\) when its output includes two Chain objects that were generated by the regression BASiCS model.

data("ChainSCReg")
Test <- BASiCS_TestDE(Chain1 = ChainSCReg, Chain2 = ChainRNAReg,
                      GroupLabel1 = "SC", GroupLabel2 = "PaS",
                      EpsilonM = log2(1.5), EpsilonD = log2(1.5),
                      EpsilonR = log2(1.5)/log2(exp(1)),
                      EFDR_M = 0.10, EFDR_D = 0.10,
                      Offset = TRUE, PlotOffset = FALSE, Plot = FALSE)

This test function outputs an extra slot containing the results of the differential testing residual over-dispersion test. Only genes that are expressed in at least 2 cells (in both groups) are included in the test. Genes that do not satisfy this condition are marked as ExcludedFromRegression in the Test$TableResDisp$ResultDiffResDisp slot. By performing the regression, all genes can be tested for changes in expression variability independent of changes in mean expression.

head(Test$TableResDisp, n = 2)
##        GeneName MeanOverall ResDispOverall ResDisp1 ResDisp2 ResDispDistance
## 1 1700094D03Rik      57.353          0.100    0.420   -0.221           0.632
## 2 1700097N02Rik      36.971          0.152    0.438   -0.133           0.432
##   ProbDiffResDisp ResultDiffResDisp
## 1           0.707            NoDiff
## 2           0.613            NoDiff

Note: Additional parameters for this sampler include: k number of regression components (k-2 radial basis functions, one intercept and one linear component), Var the scale parameter influencing the width of the basis functions and eta the degrees of freedom. For additional details about these choices, please refer to Eling et al. (2018).

5 Additional details

5.1 Storing and loading MCMC chains

To externally store the output of BASiCS_MCMC (recommended), additional parameters StoreChains, StoreDir and RunName are required. For example:

Data <- makeExampleBASiCS_Data()
Chain <- BASiCS_MCMC(Data, N = 1000, Thin = 10, Burn = 500, Regression = TRUE,
                     PrintProgress = FALSE, StoreChains = TRUE, 
                     StoreDir = tempdir(), RunName = "Example")
## -----------------------------------------------------
## MCMC sampler has been started: 1000 iterations to go.
## -----------------------------------------------------
## -----------------------------------------------------
## End of Burn-in period.
## -----------------------------------------------------
##  
## -----------------------------------------------------
## -----------------------------------------------------
## All 1000 MCMC iterations have been completed.
## -----------------------------------------------------
## -----------------------------------------------------
##  
## -----------------------------------------------------
## Please see below a summary of the overall acceptance rates.
## -----------------------------------------------------
##  
## Minimum acceptance rate among mu[i]'s: 0.438
## Average acceptance rate among mu[i]'s: 0.6254
## Maximum acceptance rate among mu[i]'s: 0.792
##  
##  
## Minimum acceptance rate among delta[i]'s: 0.4
## Average acceptance rate among delta[i]'s: 0.51528
## Maximum acceptance rate among delta[i]'s: 0.588
##  
##  
## Acceptance rate for phi (joint): 0.826
##  
##  
## Minimum acceptance rate among nu[j]'s: 0.414
## Average acceptance rate among nu[j]'s: 0.536133
## Maximum acceptance rate among nu[j]'s: 0.724
##  
##  
## Minimum acceptance rate among theta[k]'s: 0.706
## Average acceptance rate among theta[k]'s: 0.706
## Maximum acceptance rate among theta[k]'s: 0.706
##  
## -----------------------------------------------------
## 

In this example, the output of BASiCS_MCMC will be stored as a BASiCS_Chain object in the file “chain_Example.Rds”, within the tempdir() directory.

To load pre-computed MCMC chains,

Chain <- BASiCS_LoadChain("Example", StoreDir = tempdir()) 

5.2 Convergence assessment

To assess convergence of the chain, the convergence diagnostics provided by the package coda can be used. Additionally, the chains can be visually inspected. For example:

plot(Chain, Param = "mu", Gene = 1, log = "y")

See help(BASiCS_MCMC) for example plots associated to other model parameters. In the figure above:

  • Left panels show traceplots for the chains
  • Right panels show the autocorrelation function (see ?acf)

5.3 Summarising the posterior distribution

To access the MCMC chains associated to individual parameter use the function displayChainBASiCS. For example,

displayChainBASiCS(Chain, Param = "mu")[1:2,1:2]
##         Gene1    Gene2
## [1,] 8.209807 6.754111
## [2,] 5.086757 7.571519

As a summary of the posterior distribution, the function Summary calculates posterior medians and the High Posterior Density (HPD) intervals for each model parameter. As a default option, HPD intervals contain 0.95 probability.

ChainSummary <- Summary(Chain)

The function displaySummaryBASiCS extract posterior summaries for individual parameters. For example

head(displaySummaryBASiCS(ChainSummary, Param = "mu"), n = 2)
##         median    lower     upper
## Gene1 6.471372 4.354758  8.551392
## Gene2 7.746306 5.790329 10.616959

The following figures display posterior medians and the corresponding HPD 95% intervals for gene-specific parameters \(\mu_i\) (mean) and \(\delta_i\) (over-dispersion)

par(mfrow = c(1,2))
plot(ChainSummary, Param = "mu", main = "All genes", log = "y")
plot(ChainSummary, Param = "delta", 
     Genes = c(2,5,10,50), main = "5 customized genes")

See help(BASiCS_MCMC) for example plots associated to other model parameters.

5.4 Normalisation and removal of technical variation

It is also possible to produce a matrix of normalised and denoised expression counts for which the effect of technical variation is removed. For this purpose, we implemented the function BASiCS_DenoisedCounts. For each gene \(i\) and cell \(j\) this function returns

\[ x^*_{ij} = \frac{ x_{ij} } {\hat{\phi}_j \hat{\nu}_j}, \]

where \(x_{ij}\) is the observed expression count of gene \(i\) in cell \(j\), \(\hat{\phi}_j\) denotes the posterior median of \(\phi_j\) and \(\hat{\nu}_j\) is the posterior median of \(\nu_j\).

DenoisedCounts <- BASiCS_DenoisedCounts(Data = Data, Chain = Chain)
DenoisedCounts[1:2, 1:2]
##          Cell1    Cell2
## Gene1  0.00000 0.000000
## Gene2 34.81459 4.336525

Note: the output of BASiCS_DenoisedCounts requires no further data normalisation.

Alternativelly, the user can compute the normalised and denoised expression rates underlying the expression of all genes across cells using BASiCS_DenoisedRates. The output of this function is given by

\[ \Lambda_{ij} = \hat{\mu_i} \hat{\rho}_{ij}, \]

where \(\hat{\mu_i}\) represents the posterior median of \(\mu_j\) and \(\hat{\rho}_{ij}\) is given by its posterior mean (Monte Carlo estimate based on the MCMC sample of all model parameters).

DenoisedRates <- BASiCS_DenoisedRates(Data = Data, Chain = Chain, 
                                      Propensities = FALSE)
DenoisedRates[1:2, 1:2]
##           Cell1    Cell2
## Gene1  4.899157 2.892763
## Gene2 14.569993 5.900438

Alternative, denoised expression propensities \(\hat{\rho}_{ij}\) can also be extracted

DenoisedProp <- BASiCS_DenoisedRates(Data = Data, Chain = Chain, 
                                    Propensities = TRUE)
DenoisedProp[1:2, 1:2]
##           Cell1     Cell2
## Gene1 0.7570508 0.4470093
## Gene2 1.8808956 0.7617099

6 Methodology

We first describe the model introduced in [1], which relates to a single group of cells.

Throughout, we consider the expression counts of \(q\) genes, where \(q_0\) are expressed in the population of cells under study (biological genes) and the remaining \(q-q_0\) are extrinsic spike-in (technical) genes. Let \(X_{ij}\) be a random variable representing the expression count of a gene \(i\) in cell \(j\)
(\(i=1,\ldots,q\); \(j=1,\ldots,n\)). BASiCS is based on the following hierarchical model: \[X_{ij} \big| \mu_i, \phi_j, \nu_j, \rho_{ij} \sim \left\{ \begin{array}{ll} \mbox{Poisson}(\phi_j \nu_j \mu_i \rho_{ij}), \mbox{ for }i=1,\ldots,q_0, j=1,\ldots,n \\ \mbox{Poisson}(\nu_j \mu_i), \mbox{ for }i=q_0+1,\ldots,q, j=1,\ldots,n, \end{array} \right.\]

where \(\nu_j\) and \(\rho_{ij}\) are mutually independent random effects such that \(\nu_j|s_j,\theta \sim \mbox{Gamma}(1/\theta,1/ (s_j \theta))\) and \(\rho_{ij} | \delta_i \sim \mbox{Gamma} (1/\delta_i,1/\delta_i)\)1 We parametrize the Gamma distribution such that if \(X \sim \mbox{Gamma}(a,b)\), then \(\mbox{E}(X)=a/b\) and \(\mbox{var}(X)=a/b^2\)..

A graphical representation of this model is displayed below. This is based on the expression counts of 2 genes (\(i\): biological and \(i'\): technical) at 2 cells (\(j\) and \(j'\)). Squared and circular nodes denote known observed quantities (observed expression counts and added number of spike-in mRNA molecules) and unknown elements, respectively. Whereas black circular nodes represent the random effects that play an intermediate role in our hierarchical structure, red circular nodes relate to unknown model parameters in the top layer of hierarchy in our model. Blue, green and grey areas highlight elements that are shared within a biological gene, technical gene or cell, respectively.

In this setting, the key parameters to be used for downstream analyses are:

Additional (nuisance) parameters are interpreted as follows:

When cells from the same group are processed in multiple sequencing batches, this model is extended so that the technical over-dispersion parameter \(\theta\) is batch-specific. This extension allows a different strenght of technical noise to be inferred for each batch of cells.

In Vallejos, Richardson, and Marioni (2016), this model has been extended to cases where multiple groups of cells are under study. This is achieved by assuming gene-specific parameters to be also group-specific. Based on this setup, evidence of differential expression is quantified through log2-fold changes of gene-specific parameters (mean and over-dispersion) between the groups. Moreover, Eling et al. (2018) further extended this model by addressing the mean/over-dispersion confounding that is characteristic of scRNA-seq datasets as well as experimental designs where spike-ins are not available.

More details regarding the model setup, prior specification and implementation are described in Vallejos, Marioni, and Richardson (2015), Vallejos, Richardson, and Marioni (2016) and Eling et al. (2018).


7 Acknowledgements

This work has been funded by the MRC Biostatistics Unit (MRC grant no. MRC_MC_UP_0801/1; Catalina Vallejos and Sylvia Richardson), EMBL European Bioinformatics Institute (core European Molecular Biology Laboratory funding; Catalina Vallejos, Nils Eling and John Marioni), CRUK Cambridge Institute (core CRUK funding; John Marioni) and The Alan Turing Institute (EPSRC grant no. EP/N510129/1; Catalina Vallejos).

8 BASiCS hall of fame

We thank several members of the Marioni laboratory (EMBL-EBI; CRUK-CI) for support and discussions throughout the development of this R library. In particular, we are grateful to Aaron Lun (LTLA) for advise and support during the preparation the Bioconductor submission.

We also acknowledge feedback, bug reports and contributions from (Github aliases provided within parenthesis): Ben Dulken (bdulken), Chang Xu (xuchang116), Danilo Horta (Horta), Dmitriy Zhukov (dvzhukov), Jens Preußner (jenzopr), Joanna Dreux (Joannacodes), Kevin Rue-Albrecht (kevinrue), Luke Zappia (lazappi), Nitesh Turaga (nturaga), Mike Morgan (MikeDMorgan), Muad Abd El Hay (Cumol), Simon Anders (s-andrews), Shian Su (Shians), Yongchao Ge and Yuan Cao (yuancao90), among others.


9 Session information

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] hexbin_1.27.2               BASiCS_1.6.0               
##  [3] SingleCellExperiment_1.6.0  SummarizedExperiment_1.14.0
##  [5] DelayedArray_0.10.0         BiocParallel_1.18.0        
##  [7] matrixStats_0.54.0          Biobase_2.44.0             
##  [9] GenomicRanges_1.36.0        GenomeInfoDb_1.20.0        
## [11] IRanges_2.18.0              S4Vectors_0.22.0           
## [13] BiocGenerics_0.30.0         knitr_1.22                 
## [15] BiocStyle_2.12.0           
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6             dynamicTreeCut_1.63-1   
##  [3] tools_3.6.0              R6_2.4.0                
##  [5] irlba_2.3.3              KernSmooth_2.23-15      
##  [7] vipor_0.4.5              lazyeval_0.2.2          
##  [9] colorspace_1.4-1         tidyselect_0.2.5        
## [11] gridExtra_2.3            compiler_3.6.0          
## [13] BiocNeighbors_1.2.0      labeling_0.3            
## [15] bookdown_0.9             scales_1.0.0            
## [17] stringr_1.4.0            digest_0.6.18           
## [19] rmarkdown_1.12           XVector_0.24.0          
## [21] scater_1.12.0            pkgconfig_2.0.2         
## [23] htmltools_0.3.6          limma_3.40.0            
## [25] rlang_0.3.4              shiny_1.3.2             
## [27] DelayedMatrixStats_1.6.0 dplyr_0.8.0.1           
## [29] RCurl_1.95-4.12          magrittr_1.5            
## [31] BiocSingular_1.0.0       GenomeInfoDbData_1.2.1  
## [33] Matrix_1.2-17            Rcpp_1.0.1              
## [35] ggbeeswarm_0.6.0         munsell_0.5.0           
## [37] viridis_0.5.1            stringi_1.4.3           
## [39] yaml_2.2.0               edgeR_3.26.0            
## [41] MASS_7.3-51.4            zlibbioc_1.30.0         
## [43] plyr_1.8.4               grid_3.6.0              
## [45] promises_1.0.1           dqrng_0.2.0             
## [47] crayon_1.3.4             miniUI_0.1.1.1          
## [49] lattice_0.20-38          cowplot_0.9.4           
## [51] locfit_1.5-9.1           pillar_1.3.1            
## [53] igraph_1.2.4.1           glue_1.3.1              
## [55] evaluate_0.13            scran_1.12.0            
## [57] data.table_1.12.2        BiocManager_1.30.4      
## [59] httpuv_1.5.1             gtable_0.3.0            
## [61] purrr_0.3.2              assertthat_0.2.1        
## [63] ggplot2_3.1.1            xfun_0.6                
## [65] ggExtra_0.8              rsvd_1.0.0              
## [67] mime_0.6                 xtable_1.8-4            
## [69] coda_0.19-2              later_0.8.0             
## [71] viridisLite_0.3.0        tibble_2.1.1            
## [73] beeswarm_0.2.3           statmod_1.4.30

References

Eling, Nils, Arianne Richard, Sylvia Richardson, John C Marioni, and Catalina A Vallejos. 2018. “Correcting the Mean-Variance Dependency for Differential Variability Testing Using Single-Cell Rna Sequencing Data.” Cell Systems.

Grün, Dominic, Lennart Kester, and Alexander van Oudenaarden. 2014. “Validation of Noise Models for Single-Cell Transcriptomics.” Nature Methods 11 (6):637–40.

Martinez-Jimenez, Celia Pilar, Nils Eling, Hung-Chang Chen, Catalina A Vallejos, Aleksandra A Kolodziejczyk, Frances Connor, Lovorka Stojic, et al. 2017. “Aging Increases Cell-to-Cell Transcriptional Variability Upon Immune Stimulation.” Science 355 (6332). American Association for the Advancement of Science:1433–6.

Newton, Michael A, Amine Noueiry, Deepayan Sarkar, and Paul Ahlquist. 2004. “Detecting Differential Gene Expression with a Semiparametric Hierarchical Mixture Method.” Biostatistics 5 (2):155–76.

Vallejos, Catalina A, John C Marioni, and Sylvia Richardson. 2015. “BASiCS: Bayesian Analysis of Single-Cell Sequencing Data.” PLoS Computational Biology 11 (6). Public Library of Science:e1004333.

Vallejos, Catalina A, Sylvia Richardson, and John C Marioni. 2016. “Beyond Comparisons of Means: Understanding Changes in Gene Expression at the Single-Cell Level.” Genome Biology 17 (1). Cold Spring Harbor Labs Journals.

Vallejos, Catalina A, Davide Risso, Antonio Scialdone, Sandrine Dudoit, and John C Marioni. 2017. “Normalizing Single-Cell RNA Sequencing Data: Challenges and Opportunities.” Nature Methods 14 (6).