1 Introduction

The scMerge algorithm allows batch effect removal and normalisation for single cell RNA-Seq data. It comprises of three key components including:

  1. The identification of stably expressed genes (SEGs) as “negative controls” for estimating unwanted factors;
  2. The construction of pseudo-replicates to estimate the effects of unwanted factors; and
  3. The adjustment of the datasets with unwanted variation using a fastRUVIII model.

The purpose of this vignette is to illustrate some uses of scMerge and explain its key components.

2 Loading Packages and Data

We will load the scMerge package. We designed our package to be consistent with the popular BioConductor’s single cell analysis framework, namely the SingleCellExperiment and scater package.

suppressPackageStartupMessages({
  library(SingleCellExperiment)
  library(scMerge)
  library(scater)
  })

We provided an illustrative mouse embryonic stem cell (mESC) data in our package, as well as a set of pre-computed stably expressed gene (SEG) list to be used as negative control genes.

The full curated, unnormalised mESC data can be found here. The scMerge package comes with a sub-sampled, two-batches version of this data (named “batch2” and “batch3” to be consistent with the full data) .

## Subsetted mouse ESC data
data("example_sce", package = "scMerge")

In this mESC data, we pooled data from 2 different batches from three different cell types. Using a PCA plot, we can see that despite strong separation of cell types, there is also a strong separation due to batch effects. This information is stored in the colData of example_sce.

scater::plotPCA(
  example_sce, 
  colour_by = "cellTypes", 
  shape_by = "batch")

3 Illustrating pseudo-replicates constructions

The first major component of scMerge is to obtain negative controls for our normalisation. In this vignette, we will be using a set of pre-computed SEGs from a single cell mouse data made available through the segList_ensemblGeneID data in our package. For more information about the selection of negative controls and SEGs, please see Section select SEGs.

## single-cell stably expressed gene list
data("segList_ensemblGeneID", package = "scMerge")
head(segList_ensemblGeneID$mouse$mouse_scSEG)
#> [1] "ENSMUSG00000058835" "ENSMUSG00000026842" "ENSMUSG00000027671"
#> [4] "ENSMUSG00000020152" "ENSMUSG00000054693" "ENSMUSG00000049470"

The second major component of scMerge is to compute pseudo-replicates for cells so we can perform normalisation. We offer three major ways of computing this pseudo-replicate information:

  1. Unsupervised clustering, using k-means clustering;
  2. Supervised clustering, using known cell type information; and
  3. Semi-supervised clustering, using partially known cell type information.

4 Unsupervised scMerge

In unsupervised scMerge, we will perform a k-means clustering to obtain pseudo-replicates. This requires the users to supply a kmeansK vector with each element indicating number of clusters in each of the batches. For example, we know “batch2” and “batch3” both contain three cell types. Hence, kmeansK = c(3, 3) in this case.

scMerge_unsupervised <- scMerge(
  sce_combine = example_sce, 
  ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
  kmeansK = c(3, 3),
  assay_name = "scMerge_unsupervised")
#> Step 1: Computation will run in serial
#> Step 2: Performing RUV normalisation. This will take minutes to hours.
#> scMerge complete!

We now colour construct the PCA plot again on our normalised data. We can observe a much better separation by cell type and less separation by batches.

scater::plotPCA(
  scMerge_unsupervised, 
  colour_by = "cellTypes", 
  shape_by = "batch",
  run_args = list(exprs_values = "scMerge_unsupervised"))

5 Selecting all cells

By default, scMerge only uses 50% of the cells to perform kmeans clustering. While this is sufficient to perform a satisfactory normalisation in most cases, users can control if they wish all cells be used in the kmeans clustering.

scMerge_unsupervised_all <- scMerge(
  sce_combine = example_sce, 
  ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
  kmeansK = c(3, 3),
  assay_name = "scMerge_unsupervised_all",
  replicate_prop = 1)
#> Step 1: Computation will run in serial
#> Step 2: Performing RUV normalisation. This will take minutes to hours.
#> scMerge complete!
scater::plotPCA(
  scMerge_unsupervised_all, 
  colour_by = "cellTypes", 
  shape_by = "batch",
  run_args = list(exprs_values = "scMerge_unsupervised_all"))

6 Supervised scMerge

If all cell type information is available to the user, then it is possible to use this information to create pseudo-replicates. This can be done through the cell_type argument in the scMerge function.

scMerge_supervised <- scMerge(
  sce_combine = example_sce,
  ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
  kmeansK = c(3, 3),
  assay_name = "scMerge_supervised",
  cell_type = example_sce$cellTypes)
#> Step 1: Computation will run in serial
#> Step 2: Performing RUV normalisation. This will take minutes to hours.
#> scMerge complete!
scater::plotPCA(
  scMerge_supervised,
  colour_by = "cellTypes",
  shape_by = "batch",
  run_args = list(exprs_values = "scMerge_supervised"))

7 Semi-supervised scMerge I

If the user is only able to access partial cell type information, then it is still possible to use this information to create pseudo-replicates. This can be done through the cell_type and cell_type_inc arguments in the scMerge function. cell_type_inc should contain a vector of indices indicating which elements in the cell_type vector should be used to perform semi-supervised scMerge.

scMerge_semisupervised1 <- scMerge(
  sce_combine = example_sce,
  ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
  kmeansK = c(3,3),
  assay_name = "scMerge_semisupervised1",
  cell_type = example_sce$cellTypes,
  cell_type_inc = which(example_sce$cellTypes == "2i"), 
  cell_type_match = FALSE)
#> Step 1: Computation will run in serial
#> Step 2: Performing RUV normalisation. This will take minutes to hours.
#> scMerge complete!
scater::plotPCA(
  scMerge_semisupervised1, 
  colour_by = "cellTypes", 
  shape_by = "batch",
  run_args = list(exprs_values = "scMerge_semisupervised1"))

8 Semi-supervised scMerge II

There is alternative semi-supervised method to create pseudo-replicates for scMerge. This uses known cell type information to identify mutual nearest clusters and it is achieved via the cell_type and cell_type_match = TRUE options in the scMerge function.

scMerge_semisupervised2 <- scMerge(
  sce_combine = example_sce,
  ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
  kmeansK = c(3, 3),
  assay_name = "scMerge_semisupervised2",
  cell_type = example_sce$cellTypes,
  cell_type_inc = NULL,
  cell_type_match = TRUE)
#> Step 1: Computation will run in serial
#> Step 2: Performing RUV normalisation. This will take minutes to hours.
#> scMerge complete!
scater::plotPCA(
  scMerge_semisupervised2, 
  colour_by = "cellTypes", 
  shape_by = "batch",
  run_args = list(exprs_values = "scMerge_semisupervised2"))

9 Selecting negative controls

In simple terms, a negative control is a gene that has expression values relatively constant across these datasets. The concept of using these negative control genes for normalisation was most widely used in the RUV method family (e.g. Gagnon-Bartsch & Speed (2012) and Risso et. al. (2014)) and there exist multiple methods to find these negative controls. In our paper, we recommened the SEGs as negative controls for scRNA-Seq data and SEGs can be found using either a data-adaptive computational method or external knowledge.

  • Computation method: We provide the function scSEGIndex to calculate the SEG from a data matrix. The output of this function is a data.frame with a SEG index calculated for each gene. See Lin et. al. (2018) for more details.
exprsMat = SummarizedExperiment::assay(example_sce, 'counts')
result = scSEGIndex(exprsMat = exprsMat)
  • External knowledge: We have applied the SEG computational methodology on multiple human and mouse scRNA-Seq data and made these available as data objects in our package. The end-users can simply use these pre-computed results. There are also additional negative controls from bulk microarray and bulkd RNA-Seq data.
## SEG list in ensemblGene ID
data("segList_ensemblGeneID", package = "scMerge") 
## SEG list in official gene symbols
data("segList", package = "scMerge")

## SEG list for human scRNA-Seq data
head(segList$human$human_scSEG)
#> [1] "AAR2"  "AATF"  "ABCF3" "ABHD2" "ABT1"  "ACAP2"

## SEG list for human bulk microarray data
head(segList$human$bulkMicroarrayHK)
#> [1] "AATF"  "ABL1"  "ACAT2" "ACTB"  "ACTG1" "ACTN4"

## SEG list for human bulk RNASeq data
head(segList$human$bulkRNAseqHK)
#> [1] "AAGAB"  "AAMP"   "AAR2"   "AARS"   "AARS2"  "AARSD1"

10 Achieving fast computation

10.1 Using approximated SVD

Under most circumstances, scMerge is fast enough to be used on a personal laptop for a moderately large data. However, we do recognise the difficulties associated with computation when dealing with larger data. To this end, we devised a fast version of scMerge. The major difference between the two versions lies on the noise estimation component, which utilised singular value decomposition (SVD). In order to speed up scMerge, we used the randomised SVD algorithm made available through the rsvd package. This computational method is able to speed up scMerge by obtain a very accurate approximation of the noise structure in the data. This option is achieved via the option fast_svd = TRUE and rsvd_prop. rsvd_prop is a parameter between 0 and 1, controlling the degree of approximations.

We recommend using this option in the case where the number of cells is large in your single cell data. The speed advantage we obtain for large single cell data is much more dramatic than on a smaller dataset like the example mESC data. For example, a single run of normal scMerge on a human pancreas data (23699 features and 4566 cells) takes about 10 minutes whereas the speed up version takes just under 4 minutes.

scMerge_fast <- scMerge(
  sce_combine = example_sce, 
  ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
  kmeansK = c(3, 3),
  assay_name = "scMerge_fast", 
  fast_svd = TRUE, 
  rsvd_prop = 0.1)
#> Step 1: Computation will run in serial
#> Step 2: Performing RUV normalisation. This will take minutes to hours.
#> scMerge complete!
paste("Normally, scMerge takes ", round(t2 - t1, 2), " seconds")
#> [1] "Normally, scMerge takes  1.54  seconds"
paste("Fast version of scMerge takes ", round(t4 - t3, 2), " seconds")
#> [1] "Fast version of scMerge takes  1.06  seconds"

scater::plotPCA(
  scMerge_fast, 
  colour_by = "cellTypes", 
  shape_by = "batch",
  run_args = list(exprs_values = "scMerge_fast")) +
  labs(title = "fast_svd yields similar results to the default version")

10.2 Parallelised computing

In the unsupervised or semi-supervised versions of scMerge, a replication matrix is constructed from clustering. In the case of large number of cells and batches, this computation could take a long time. By default, scMerge runs this computation in serial, but we have also implemented a parallelised computational option via the BiocParallel package.

There are two ways you could invoke the parallel option. First, you can invoke the parallel option by setting parallel = TRUE and supply a parallelParam option.

scMerge_parallel1 <- scMerge(
  sce_combine = example_sce, 
  ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
  kmeansK = c(3, 3),
  assay_name = "scMerge_parallel1",
  parallel = TRUE,
  parallelParam = MulticoreParam(workers = 2)
)

Alternatively, you can use BiocParallel::register to set your parallel parameters. In this case, scMerge will check the default BiocParallelParam instance from the registry using the BiocParallel::bpparam() function.

BiocParallel::register(MulticoreParam(workers = 2))

scMerge_parallel2 <- scMerge(
  sce_combine = example_sce, 
  ctl = segList_ensemblGeneID$mouse$mouse_scSEG,
  kmeansK = c(3, 3),
  assay_name = "scMerge_parallel2",
  parallel = TRUE
)

It should be noted that you may not see the computational advantages of parallel computing if your data is not large enough. This is due to the overhead cost in parallel computing.

11 Reference

Please check out our paper for detailed analysis and results on multiple scRNA-Seq data. https://www.biorxiv.org/content/10.1101/393280v2

citation("scMerge")
#> 
#> To cite scMerge in publications please use:
#> 
#> Lin Y, Ghazanfar S, Wang K, Gagnon-Bartsch J, Lo K, Su X, Han Z,
#> Ormerod J, Speed T, Yang P, Yang J (2019). "scMerge leverages factor
#> analysis, stable expression, and pseudoreplication to merge multiple
#> single-cell RNA-seq datasets." _Proceedings of the National Academy
#> of Sciences_. doi: 10.1073/pnas.1820006116 (URL:
#> https://doi.org/10.1073/pnas.1820006116), <URL:
#> http://www.pnas.org/lookup/doi/10.1073/pnas.1820006116>.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {{scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets}},
#>     author = {Yingxin Lin and Shila Ghazanfar and Kevin Wang and Johann Gagnon-Bartsch and Kitty Lo and Xianbin Su and Ze-Guang Han and John Ormerod and Terence Speed and Pengyi Yang and Jean Yang},
#>     year = {2019},
#>     journal = {Proceedings of the National Academy of Sciences},
#>     doi = {https://doi.org/10.1073/pnas.1820006116},
#>     url = {http://www.pnas.org/lookup/doi/10.1073/pnas.1820006116},
#>   }

12 Session Info

sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] scater_1.12.0               ggplot2_3.1.1              
#>  [3] scMerge_1.0.0               SingleCellExperiment_1.6.0 
#>  [5] SummarizedExperiment_1.14.0 DelayedArray_0.10.0        
#>  [7] BiocParallel_1.18.0         matrixStats_0.54.0         
#>  [9] Biobase_2.44.0              GenomicRanges_1.36.0       
#> [11] GenomeInfoDb_1.20.0         IRanges_2.18.0             
#> [13] S4Vectors_0.22.0            BiocGenerics_0.30.0        
#> [15] BiocStyle_2.12.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] nlme_3.1-139             bitops_1.0-6            
#>  [3] RColorBrewer_1.1-2       numDeriv_2016.8-1       
#>  [5] tools_3.6.0              backports_1.1.4         
#>  [7] R6_2.4.0                 irlba_2.3.3             
#>  [9] vipor_0.4.5              rpart_4.1-15            
#> [11] KernSmooth_2.23-15       Hmisc_4.2-0             
#> [13] lazyeval_0.2.2           mgcv_1.8-28             
#> [15] colorspace_1.4-1         nnet_7.3-12             
#> [17] withr_2.1.2              tidyselect_0.2.5        
#> [19] gridExtra_2.3            compiler_3.6.0          
#> [21] BiocNeighbors_1.2.0      htmlTable_1.13.1        
#> [23] labeling_0.3             bookdown_0.9            
#> [25] caTools_1.17.1.2         scales_1.0.0            
#> [27] checkmate_1.9.1          proxy_0.4-23            
#> [29] stringr_1.4.0            digest_0.6.18           
#> [31] foreign_0.8-71           rmarkdown_1.12          
#> [33] XVector_0.24.0           base64enc_0.1-3         
#> [35] pkgconfig_2.0.2          htmltools_0.3.6         
#> [37] ruv_0.9.7                bbmle_1.0.20            
#> [39] htmlwidgets_1.3          rlang_0.3.4             
#> [41] rstudioapi_0.10          DelayedMatrixStats_1.6.0
#> [43] gtools_3.8.1             acepack_1.4.1           
#> [45] dplyr_0.8.0.1            BiocSingular_1.0.0      
#> [47] RCurl_1.95-4.12          magrittr_1.5            
#> [49] GenomeInfoDbData_1.2.1   Formula_1.2-3           
#> [51] Matrix_1.2-17            ggbeeswarm_0.6.0        
#> [53] Rcpp_1.0.1               munsell_0.5.0           
#> [55] viridis_0.5.1            stringi_1.4.3           
#> [57] yaml_2.2.0               zlibbioc_1.30.0         
#> [59] gplots_3.0.1.1           plyr_1.8.4              
#> [61] grid_3.6.0               gdata_2.18.0            
#> [63] M3Drop_1.10.0            reldist_1.6-6           
#> [65] crayon_1.3.4             lattice_0.20-38         
#> [67] cowplot_0.9.4            splines_3.6.0           
#> [69] knitr_1.22               pillar_1.3.1            
#> [71] igraph_1.2.4.1           codetools_0.2-16        
#> [73] glue_1.3.1               evaluate_0.13           
#> [75] latticeExtra_0.6-28      data.table_1.12.2       
#> [77] BiocManager_1.30.4       foreach_1.4.4           
#> [79] gtable_0.3.0             purrr_0.3.2             
#> [81] assertthat_0.2.1         xfun_0.6                
#> [83] rsvd_1.0.0               RcppEigen_0.3.3.5.0     
#> [85] viridisLite_0.3.0        survival_2.44-1.1       
#> [87] tibble_2.1.1             iterators_1.0.10        
#> [89] beeswarm_0.2.3           cluster_2.0.9           
#> [91] statmod_1.4.30