Chapter 5 Analysis overview

5.1 Outline

This chapter provides an overview of the framework of a typical scRNA-seq analysis workflow (Figure 5.1).

Schematic of a typical scRNA-seq analysis workflow. Each stage (separated by dashed lines) consists of a number of specific steps, many of which operate on and modify a `SingleCellExperiment` instance.

Figure 5.1: Schematic of a typical scRNA-seq analysis workflow. Each stage (separated by dashed lines) consists of a number of specific steps, many of which operate on and modify a SingleCellExperiment instance.

In the simplest case, the workflow has the following form:

  1. We compute quality control metrics to remove low-quality cells that would interfere with downstream analyses. These cells may have been damaged during processing or may not have been fully captured by the sequencing protocol. Common metrics includes the total counts per cell, the proportion of spike-in or mitochondrial reads and the number of detected features.
  2. We convert the counts into normalized expression values to eliminate cell-specific biases (e.g., in capture efficiency). This allows us to perform explicit comparisons across cells in downstream steps like clustering. We also apply a transformation, typically log, to adjust for the mean-variance relationship.
  3. We perform feature selection to pick a subset of interesting features for downstream analysis. This is done by modelling the variance across cells for each gene and retaining genes that are highly variable. The aim is to reduce computational overhead and noise from uninteresting genes.
  4. We apply dimensionality reduction to compact the data and further reduce noise. Principal components analysis is typically used to obtain an initial low-rank representation for more computational work, followed by more aggressive methods like \(t\)-stochastic neighbor embedding for visualization purposes.
  5. We cluster cells into groups according to similarities in their (normalized) expression profiles. This aims to obtain groupings that serve as empirical proxies for distinct biological states. We typically interpret these groupings by identifying differentially expressed marker genes between clusters.

Subsequent chapters will describe each analysis step in more detail.

5.2 Quick start (simple)

Here, we use the a droplet-based retina dataset from Macosko et al. (2015), provided in the scRNAseq package. This starts from a count matrix and finishes with clusters (Figure 5.2) in preparation for biological interpretation. Similar workflows are available in abbreviated form in later parts of the book.

library(scRNAseq)
sce <- MacoskoRetinaData()

# Quality control (using mitochondrial genes).
library(scater)
is.mito <- grepl("^MT-", rownames(sce))
qcstats <- perCellQCMetrics(sce, subsets=list(Mito=is.mito))
filtered <- quickPerCellQC(qcstats, percent_subsets="subsets_Mito_percent")
sce <- sce[, !filtered$discard]

# Normalization.
sce <- logNormCounts(sce)

# Feature selection.
library(scran)
dec <- modelGeneVar(sce)
hvg <- getTopHVGs(dec, prop=0.1)

# PCA.
library(scater)
set.seed(1234)
sce <- runPCA(sce, ncomponents=25, subset_row=hvg)

# Clustering.
library(bluster)
colLabels(sce) <- clusterCells(sce, use.dimred='PCA',
    BLUSPARAM=NNGraphParam(cluster.fun="louvain"))    

# Visualization.
sce <- runUMAP(sce, dimred = 'PCA')
plotUMAP(sce, colour_by="label")
UMAP plot of the retina dataset, where each point is a cell and is colored by the assigned cluster identity.

Figure 5.2: UMAP plot of the retina dataset, where each point is a cell and is colored by the assigned cluster identity.

# Marker detection.
markers <- findMarkers(sce, test.type="wilcox", direction="up", lfc=1)

5.3 Quick start (multiple batches)

Here we use the pancreas Smart-seq2 dataset from Segerstolpe et al. (2016), again provided in the scRNAseq package. This starts from a count matrix and finishes with clusters (Figure 5.2) with some additional tweaks to eliminate uninteresting batch effects between individuals. Note that a more elaborate analysis of the same dataset with justifications for each step is available in Workflow Chapter 8.

sce <- SegerstolpePancreasData()

# Quality control (using ERCCs).
qcstats <- perCellQCMetrics(sce)
filtered <- quickPerCellQC(qcstats, percent_subsets="altexps_ERCC_percent")
sce <- sce[, !filtered$discard]

# Normalization.
sce <- logNormCounts(sce)

# Feature selection, blocking on the individual of origin.
dec <- modelGeneVar(sce, block=sce$individual)
hvg <- getTopHVGs(dec, prop=0.1)

# Batch correction.
library(batchelor)
set.seed(1234)
sce <- correctExperiments(sce, batch=sce$individual, 
    subset.row=hvg, correct.all=TRUE)

# Clustering.
colLabels(sce) <- clusterCells(sce, use.dimred='corrected')

# Visualization.
sce <- runUMAP(sce, dimred = 'corrected')
gridExtra::grid.arrange(
    plotUMAP(sce, colour_by="label"),
    plotUMAP(sce, colour_by="individual"),
    ncol=2
)
UMAP plot of the pancreas dataset, where each point is a cell and is colored by the assigned cluster identity (left) or the individual of origin (right).

Figure 5.3: UMAP plot of the pancreas dataset, where each point is a cell and is colored by the assigned cluster identity (left) or the individual of origin (right).

# Marker detection, blocking on the individual of origin.
markers <- findMarkers(sce, test.type="wilcox", direction="up", lfc=1)

Session Info

R version 4.4.0 beta (2024-04-15 r86425)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] batchelor_1.20.0            bluster_1.14.0             
 [3] scran_1.32.0                scater_1.32.0              
 [5] ggplot2_3.5.1               scuttle_1.14.0             
 [7] scRNAseq_2.17.10            SingleCellExperiment_1.26.0
 [9] SummarizedExperiment_1.34.0 Biobase_2.64.0             
[11] GenomicRanges_1.56.0        GenomeInfoDb_1.40.0        
[13] IRanges_2.38.0              S4Vectors_0.42.0           
[15] BiocGenerics_0.50.0         MatrixGenerics_1.16.0      
[17] matrixStats_1.3.0           BiocStyle_2.32.0           
[19] rebook_1.14.0              

loaded via a namespace (and not attached):
  [1] jsonlite_1.8.8            CodeDepends_0.6.6        
  [3] magrittr_2.0.3            ggbeeswarm_0.7.2         
  [5] GenomicFeatures_1.56.0    gypsum_1.0.0             
  [7] farver_2.1.1              rmarkdown_2.26           
  [9] BiocIO_1.14.0             zlibbioc_1.50.0          
 [11] vctrs_0.6.5               memoise_2.0.1            
 [13] Rsamtools_2.20.0          DelayedMatrixStats_1.26.0
 [15] RCurl_1.98-1.14           htmltools_0.5.8.1        
 [17] S4Arrays_1.4.0            AnnotationHub_3.12.0     
 [19] curl_5.2.1                BiocNeighbors_1.22.0     
 [21] Rhdf5lib_1.26.0           SparseArray_1.4.0        
 [23] rhdf5_2.48.0              sass_0.4.9               
 [25] alabaster.base_1.4.0      bslib_0.7.0              
 [27] alabaster.sce_1.4.0       httr2_1.0.1              
 [29] cachem_1.0.8              ResidualMatrix_1.14.0    
 [31] GenomicAlignments_1.40.0  igraph_2.0.3             
 [33] lifecycle_1.0.4           pkgconfig_2.0.3          
 [35] rsvd_1.0.5                Matrix_1.7-0             
 [37] R6_2.5.1                  fastmap_1.1.1            
 [39] GenomeInfoDbData_1.2.12   digest_0.6.35            
 [41] colorspace_2.1-0          AnnotationDbi_1.66.0     
 [43] paws.storage_0.5.0        dqrng_0.3.2              
 [45] irlba_2.3.5.1             ExperimentHub_2.12.0     
 [47] RSQLite_2.3.6             beachmat_2.20.0          
 [49] labeling_0.4.3            filelock_1.0.3           
 [51] fansi_1.0.6               httr_1.4.7               
 [53] abind_1.4-5               compiler_4.4.0           
 [55] bit64_4.0.5               withr_3.0.0              
 [57] BiocParallel_1.38.0       viridis_0.6.5            
 [59] DBI_1.2.2                 highr_0.10               
 [61] HDF5Array_1.32.0          alabaster.ranges_1.4.0   
 [63] alabaster.schemas_1.4.0   rappdirs_0.3.3           
 [65] DelayedArray_0.30.0       rjson_0.2.21             
 [67] tools_4.4.0               vipor_0.4.7              
 [69] beeswarm_0.4.0            glue_1.7.0               
 [71] restfulr_0.0.15           rhdf5filters_1.16.0      
 [73] grid_4.4.0                cluster_2.1.6            
 [75] generics_0.1.3            gtable_0.3.5             
 [77] ensembldb_2.28.0          metapod_1.12.0           
 [79] BiocSingular_1.20.0       ScaledMatrix_1.12.0      
 [81] utf8_1.2.4                XVector_0.44.0           
 [83] RcppAnnoy_0.0.22          ggrepel_0.9.5            
 [85] BiocVersion_3.19.1        pillar_1.9.0             
 [87] limma_3.60.0              dplyr_1.1.4              
 [89] BiocFileCache_2.12.0      lattice_0.22-6           
 [91] FNN_1.1.4                 rtracklayer_1.64.0       
 [93] bit_4.0.5                 tidyselect_1.2.1         
 [95] paws.common_0.7.2         locfit_1.5-9.9           
 [97] Biostrings_2.72.0         knitr_1.46               
 [99] gridExtra_2.3             bookdown_0.39            
[101] ProtGenerics_1.36.0       edgeR_4.2.0              
[103] xfun_0.43                 statmod_1.5.0            
[105] UCSC.utils_1.0.0          lazyeval_0.2.2           
[107] yaml_2.3.8                evaluate_0.23            
[109] codetools_0.2-20          tibble_3.2.1             
[111] alabaster.matrix_1.4.0    BiocManager_1.30.22      
[113] graph_1.82.0              cli_3.6.2                
[115] uwot_0.2.2                munsell_0.5.1            
[117] jquerylib_0.1.4           Rcpp_1.0.12              
[119] dir.expiry_1.12.0         dbplyr_2.5.0             
[121] png_0.1-8                 XML_3.99-0.16.1          
[123] parallel_4.4.0            blob_1.2.4               
[125] AnnotationFilter_1.28.0   sparseMatrixStats_1.16.0 
[127] bitops_1.0-7              viridisLite_0.4.2        
[129] alabaster.se_1.4.0        scales_1.3.0             
[131] crayon_1.5.2              rlang_1.1.3              
[133] cowplot_1.1.3             KEGGREST_1.44.0          

Islam, S., A. Zeisel, S. Joost, G. La Manno, P. Zajac, M. Kasper, P. Lonnerberg, and S. Linnarsson. 2014. “Quantitative single-cell RNA-seq with unique molecular identifiers.” Nat. Methods 11 (2): 163–66.

Lun, A. T. L., F. J. Calero-Nieto, L. Haim-Vilmovsky, B. Gottgens, and J. C. Marioni. 2017. “Assessing the reliability of spike-in normalization for analyses of single-cell RNA sequencing data.” Genome Res. 27 (11): 1795–1806.

Macosko, E. Z., A. Basu, R. Satija, J. Nemesh, K. Shekhar, M. Goldman, I. Tirosh, et al. 2015. “Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets.” Cell 161 (5): 1202–14.

Mereu, Elisabetta, Atefeh Lafzi, Catia Moutinho, Christoph Ziegenhain, Davis J. MacCarthy, Adrian Alvarez, Eduard Batlle, et al. 2019. “Benchmarking Single-Cell Rna Sequencing Protocols for Cell Atlas Projects.” bioRxiv. https://doi.org/10.1101/630087.

Muraro, M. J., G. Dharmadhikari, D. Grun, N. Groen, T. Dielen, E. Jansen, L. van Gurp, et al. 2016. “A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Syst 3 (4): 385–94.

Segerstolpe, A., A. Palasantza, P. Eliasson, E. M. Andersson, A. C. Andreasson, X. Sun, S. Picelli, et al. 2016. “Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes.” Cell Metab. 24 (4): 593–607.

Srivastava, A., L. Malik, T. Smith, I. Sudbery, and R. Patro. 2019. “Alevin efficiently estimates accurate gene abundances from dscRNA-seq data.” Genome Biol 20 (1): 65.

Svensson, V., E. da Veiga Beltrame, and L. Pachter. 2019. “Quantifying the Tradeoff Between Sequencing Depth and Cell Number in Single-Cell Rna-Seq.” bioRxiv, 762773.

Wilson, N. K., D. G. Kent, F. Buettner, M. Shehata, I. C. Macaulay, F. J. Calero-Nieto, M. Sanchez Castillo, et al. 2015. “Combined single-cell functional and gene expression analysis resolves heterogeneity within stem cell populations.” Cell Stem Cell 16 (6): 712–24.

Zhang, M. J., V. Ntranos, and D. Tse. 2020. “Determining sequencing depth in a single-cell RNA-seq experiment.” Nat Commun 11 (1): 774.

Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January): 14049.

Ziegenhain, C., B. Vieth, S. Parekh, B. Reinius, A. Guillaumet-Adkins, M. Smets, H. Leonhardt, H. Heyn, I. Hellmann, and W. Enard. 2017. “Comparative Analysis of Single-Cell RNA Sequencing Methods.” Mol. Cell 65 (4): 631–43.

References

Macosko, E. Z., A. Basu, R. Satija, J. Nemesh, K. Shekhar, M. Goldman, I. Tirosh, et al. 2015. “Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets.” Cell 161 (5): 1202–14.

Segerstolpe, A., A. Palasantza, P. Eliasson, E. M. Andersson, A. C. Andreasson, X. Sun, S. Picelli, et al. 2016. “Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes.” Cell Metab. 24 (4): 593–607.