Chapter 3 Getting scRNA-seq datasets

3.1 Overview

Sequencing data from single-cell RNA-seq experiments must be converted into a matrix of expression values. This is usually a count matrix containing the number of reads mapped to each gene (row) in each cell (column). Alternatively, the counts may be that of the number of unique molecular identifiers (UMIs); these are interpreted in the same manner as read counts but are less affected by PCR artifacts during library preparation (Islam et al. 2014). Once this quantification is complete, we can proceed with our downstream statistical analyses in R.

Constructing a count matrix from raw scRNA-seq data requires some thought as the term “single-cell RNA-seq” encompasses a variety of different experimental protocols. This includes droplet-based protocols like 10X Genomics, inDrop and Drop-seq; plate-based protocols with UMIs like CEL-seq(2) and MARS-seq; plate-based protocols with reads (mostly Smart-seq2); and others like sciRNA-seq, to name a few. Each approach requires a different processing pipeline to deal with cell demultiplexing and UMI deduplication (if applicable). This chapter will briefly describe some of the methods used to generate a count matrix and read it into R.

3.2 Some comments on experimental design

Each scRNA-seq protocol has its own advantages and weaknesses that are discussed extensively elsewhere (Mereu et al. 2019; Ziegenhain et al. 2017). In practical terms, droplet-based technologies are the current de facto standard due to their throughput and low cost per cell. Plate-based methods can capture other phenotypic information (e.g., morphology) and are more amenable to customization. Read-based methods provide whole-transcript coverage, which is useful in some applications (e.g., splicing, exome mutations); otherwise, UMI-based methods are more popular as they mitigate the effects of PCR amplification noise. The choice of method is left to the reader’s circumstances - we will simply note that most of the downstream analysis is agnostic to the exact technology being used.

Another question is how many cells should be captured, and to what depth they should be sequenced. The best trade-off between these two factors is an active topic of research (Zhang, Ntranos, and Tse 2020; Svensson, Veiga Beltrame, and Pachter 2019), though ultimately, much depends on the scientific aims of the experiment. If we are aiming to discover rare cell subpopulations, we would need more cells, whereas if we are aiming to quantify subtle differences, we would need more sequencing depth. As of time of writing, an informal survey of the literature suggests that typical droplet-based experiments would capture anywhere from 10,000 to 100,000 cells, sequenced at anywhere from 1,000 to 10,000 UMIs per cell (usually in inverse proportion to the number of cells). Droplet-based methods also have a trade-off between throughput and doublet rate that affects the true efficiency of sequencing.

For studies involving multiple samples or conditions, the design considerations are the same as those for bulk RNA-seq experiments. There should be multiple biological replicates for each condition and conditions should not be confounded with batch. Note that individual cells are not replicates; rather, we are referring to samples derived from replicate donors or cultures. In fact, this adds another dimension into the resourcing equation - should we obtain more cells per sample at the cost of being able to sequence fewer samples? The best answer depends on the sizes of the subpopulations involved, the ease with which they are distinguished from others, and their variability across different samples and conditions. Such factors are rarely known ahead of time, so an informed decision on the design will often benefit from pilot experiments.

3.3 Creating a count matrix

As mentioned above, the exact procedure for quantifying expression depends on the technology involved:

  • For 10X Genomics data, the Cellranger software suite (Zheng et al. 2017) provides a custom pipeline to obtain a count matrix. This uses STAR to align reads to the reference genome and then counts the number of unique UMIs mapped to each gene.
  • Alternatively, pseudo-alignment methods such as alevin (Srivastava et al. 2019) can be used to obtain a count matrix from the same data. This avoids the need for explicit alignment, which reduces the compute time and memory usage.
  • For other highly multiplexed protocols, the scPipe package provides a more general pipeline for processing scRNA-seq data. This uses the Rsubread aligner to align reads and then counts reads or UMIs per gene.
  • For CEL-seq or CEL-seq2 data, the scruff package provides a dedicated pipeline for quantification.
  • For read-based protocols, we can generally re-use the same pipelines for processing bulk RNA-seq data.
  • For any data involving spike-in transcripts, the spike-in sequences should be included as part of the reference genome during alignment and quantification.

In all cases, the identity of the genes in the count matrix should be defined with standard identifiers from Ensembl or Entrez. These provide an unambiguous mapping between each row of the matrix and the corresponding gene. In contrast, a single gene symbol may be used by multiple loci, or the mapping between symbols and genes may change over time, e.g., if the gene is renamed. This makes it difficult to re-use the count matrix as we cannot be confident in the meaning of the symbols. (Of course, identifiers can be easily converted to gene symbols later on in the analysis. This is the recommended approach as it allows us to document how the conversion was performed and to backtrack to the stable identifiers if the symbols are ambiguous.)

Depending on the process involved, there may be additional points of concern:

  • Some feature-counting tools (e.g., HTSeq) will report mapping statistics in the count matrix, such as the number of unaligned or unassigned reads. While these values can be useful for quality control, they would be misleading if treated as gene expression values. Thus, they should be removed (or at least moved somewhere else) prior to further analyses.
  • The most common spike-ins are those developed by the External RNA Controls Consortium (ERCC), which have names along the lines of ERCC-00002. For human data, one should be careful to distinguish these rows from an actual ERCC gene family that has gene symbols like ERCC1. This issue can be avoided altogether by using standard identifiers that are not susceptible to these naming conflicts.

3.4 Reading counts into R

3.4.1 From tabular formats

The next step is to import the count matrix into R. Again, this depends on the output format of the aforementioned processing pipeline. In the simplest case, the pipeline will produce a matrix in tabular format, which can be read in with standard methods like read.delim(). We demonstrate below using a pancreas scRNA-seq dataset from Muraro et al. (2016) (GSE85241):

library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
url <- file.path("ftp://ftp.ncbi.nlm.nih.gov/geo/series",
    "GSE85nnn/GSE85241/suppl",
    "GSE85241%5Fcellsystems%5Fdataset%5F4donors%5Fupdated%2Ecsv%2Egz")

# Making a symbolic link so that the later code can pretend
# that we downloaded the file into the local directory.
muraro.fname <- bfcrpath(bfc, url)
local.name <- URLdecode(basename(url))
unlink(local.name)
if (.Platform$OS.type=="windows") {
    file.copy(muraro.fname, local.name)
} else {
    file.symlink(muraro.fname, local.name)
}
mat <- as.matrix(read.delim("GSE85241_cellsystems_dataset_4donors_updated.csv.gz"))
dim(mat) # number of rows, number of columns
## [1] 19140  3072

In practice, a more efficient approach is to read in the table in sparse format using the readSparseCounts() function from the scuttle package. This only stores the non-zero values and avoids spending memory on the majority of zeros in lowly-sequenced scRNA-seq experiments.

library(scuttle)
sparse.mat <- readSparseCounts("GSE85241_cellsystems_dataset_4donors_updated.csv.gz")
dim(sparse.mat)
## [1] 19140  3072
# We can see that it uses less memory compared to 'mat'.
object.size(sparse.mat)
## 150978872 bytes
object.size(mat)
## 471999152 bytes

On occasion, we may encounter count data stored in Excel files. These can be extracted into a matrix using functions from the readxl package, as demonstrated for a dataset from Wilson et al. (2015) (GSE61533):

bfc <- BiocFileCache("raw_data", ask=FALSE)
wilson.fname <- bfcrpath(bfc, file.path("ftp://ftp.ncbi.nlm.nih.gov/geo/series",
    "GSE61nnn/GSE61533/suppl/GSE61533_HTSEQ_count_results.xls.gz"))

library(R.utils)
wilson.name2 <- "GSE61533_HTSEQ_count_results.xls"
gunzip(wilson.fname, destname=wilson.name2, remove=FALSE, overwrite=TRUE)
library(readxl)
all.counts <- read_excel("GSE61533_HTSEQ_count_results.xls")
gene.names <- all.counts$ID
all.counts <- as.matrix(all.counts[,-1])
rownames(all.counts) <- gene.names
dim(all.counts)
## [1] 38498    96

3.4.2 From Cellranger output

For 10X Genomics data, the Cellranger software suite will produce an output directory containing counts and feature/barcode annotations. We can read this into R by supplying the directory path to read10xCounts() from the DropletUtils package, as demonstrated below using a 4000 peripheral blood mononuclear cell dataset. Note that the function produces a SingleCellExperiment object containing the matrix, which we will discuss in more detail in the next chapter.

library(DropletTestFiles)
cached <- getTestFile("tenx-2.1.0-pbmc4k/1.0.0/filtered.tar.gz")
fpath <- "tenx-2.1.0-pbmc4k"
untar(cached, exdir=fpath)
library(DropletUtils)
sce <- read10xCounts("tenx-2.1.0-pbmc4k/filtered_gene_bc_matrices/GRCh38")
sce
## class: SingleCellExperiment 
## dim: 33694 4340 
## metadata(1): Samples
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(2): ID Symbol
## colnames: NULL
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

We can also read in multiple count matrices by passing multiple directory paths to read10xCounts(). Provided that all datasets have the same gene annotation, the function will be able to combine them into a single object.

# Making a copy and pretending it's a different sample,
# for demonstration purposes.
# TODO: actually get a different sample.
target <- paste0(fpath, '-2')
unlink(target)

if (.Platform$OS.type=="windows") {
    file.copy(fpath, target)
} else {
    file.symlink(fpath, target)
}
dirA <- "tenx-2.1.0-pbmc4k/filtered_gene_bc_matrices/GRCh38"
dirB <- "tenx-2.1.0-pbmc4k-2/filtered_gene_bc_matrices/GRCh38"
sce <- read10xCounts(c(dirA, dirB))
sce
## class: SingleCellExperiment 
## dim: 33694 8680 
## metadata(1): Samples
## assays(1): counts
## rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
##   ENSG00000268674
## rowData names(2): ID Symbol
## colnames: NULL
## colData names(2): Sample Barcode
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

It is worth noting that the Cellranger software suite is not the only approach to processing 10X Genomics data. For example, alevin output can be read into R using the tximeta package, while kallisto-bustools output can be read using the BUSpaRse package.

3.4.3 From HDF5-based formats

A family of scRNA-seq storage formats is based around Hierarchical Data Format version 5 (HDF5). These formats offer the ability to store, in the same file, both the expression values and associated gene and cell annotations. One flavor of this approach is the H5AD format, which can be read into R as a SingleCellExperiment using the zellkonverter package. We demonstrate below with an example dataset that is built into the package:

library(zellkonverter)
demo <- system.file("extdata", "krumsiek11.h5ad", package = "zellkonverter")
sce <- readH5AD(demo)
sce
## class: SingleCellExperiment 
## dim: 11 640 
## metadata(2): highlights iroot
## assays(1): X
## rownames(11): Gata2 Gata1 ... EgrNab Gfi1
## rowData names(0):
## colnames(640): 0 1 ... 158-3 159-3
## colData names(1): cell_type
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

Another flavor is the Loom file format, which we can read into R with the LoomExperiment package. In this case, the procedure creates a SingleCellLoomExperiment, which is effectively a plug-and-play equivalent to the SingleCellExperiment.

library(LoomExperiment)
demo <- system.file("extdata", "L1_DRG_20_example.loom", package = "LoomExperiment")
scle <- import(demo, type="SingleCellLoomExperiment")
scle
## class: SingleCellLoomExperiment 
## dim: 20 20 
## metadata(4): CreatedWith LOOM_SPEC_VERSION LoomExperiment-class
##   MatrixName
## assays(1): matrix
## rownames: NULL
## rowData names(7): Accession Gene ... X_Total X_Valid
## colnames: NULL
## colData names(103): Age AnalysisPool ... cDNA_Lib_Ok ngperul_cDNA
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowGraphs(0): NULL
## colGraphs(2): KNN MKNN

The HDF5-based formats have an additional advantage in that Bioconductor-based analyses can be performed without reading all of the data into R. This allows us to analyze very large datasets in the presence of limited computer memory, a functionality that we will discuss in more detail in Advanced Chapter 14.

Session Info

R Under development (unstable) (2024-10-21 r87258)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] LoomExperiment_1.25.0       BiocIO_1.17.0              
 [3] rhdf5_2.51.0                zellkonverter_1.17.0       
 [5] DropletUtils_1.27.0         DropletTestFiles_1.17.0    
 [7] readxl_1.4.3                R.utils_2.12.3             
 [9] R.oo_1.27.0                 R.methodsS3_1.8.2          
[11] scuttle_1.17.0              SingleCellExperiment_1.29.1
[13] SummarizedExperiment_1.37.0 Biobase_2.67.0             
[15] GenomicRanges_1.59.0        GenomeInfoDb_1.43.0        
[17] IRanges_2.41.0              S4Vectors_0.45.1           
[19] BiocGenerics_0.53.1         generics_0.1.3             
[21] MatrixGenerics_1.19.0       matrixStats_1.4.1          
[23] BiocFileCache_2.15.0        dbplyr_2.5.0               
[25] BiocStyle_2.35.0            rebook_1.17.0              

loaded via a namespace (and not attached):
 [1] DBI_1.2.3                 CodeDepends_0.6.6        
 [3] rlang_1.1.4               magrittr_2.0.3           
 [5] compiler_4.5.0            RSQLite_2.3.7            
 [7] dir.expiry_1.15.0         DelayedMatrixStats_1.29.0
 [9] png_0.1-8                 vctrs_0.6.5              
[11] stringr_1.5.1             pkgconfig_2.0.3          
[13] crayon_1.5.3              fastmap_1.2.0            
[15] XVector_0.47.0            utf8_1.2.4               
[17] rmarkdown_2.29            graph_1.85.0             
[19] UCSC.utils_1.3.0          purrr_1.0.2              
[21] bit_4.5.0                 xfun_0.49                
[23] zlibbioc_1.53.0           cachem_1.1.0             
[25] beachmat_2.23.0           jsonlite_1.8.9           
[27] blob_1.2.4                rhdf5filters_1.19.0      
[29] DelayedArray_0.33.1       Rhdf5lib_1.29.0          
[31] BiocParallel_1.41.0       parallel_4.5.0           
[33] R6_2.5.1                  stringi_1.8.4            
[35] bslib_0.8.0               reticulate_1.39.0        
[37] limma_3.63.2              jquerylib_0.1.4          
[39] cellranger_1.1.0          Rcpp_1.0.13-1            
[41] bookdown_0.41             knitr_1.49               
[43] Matrix_1.7-1              tidyselect_1.2.1         
[45] abind_1.4-8               yaml_2.3.10              
[47] codetools_0.2-20          curl_6.0.0               
[49] lattice_0.22-6            tibble_3.2.1             
[51] basilisk.utils_1.19.0     withr_3.0.2              
[53] KEGGREST_1.47.0           evaluate_1.0.1           
[55] ExperimentHub_2.15.0      Biostrings_2.75.1        
[57] pillar_1.9.0              BiocManager_1.30.25      
[59] filelock_1.0.3            BiocVersion_3.21.1       
[61] sparseMatrixStats_1.19.0  glue_1.8.0               
[63] tools_4.5.0               AnnotationHub_3.15.0     
[65] locfit_1.5-9.10           XML_3.99-0.17            
[67] grid_4.5.0                AnnotationDbi_1.69.0     
[69] edgeR_4.5.0               GenomeInfoDbData_1.2.13  
[71] basilisk_1.19.0           HDF5Array_1.35.1         
[73] cli_3.6.3                 rappdirs_0.3.3           
[75] fansi_1.0.6               S4Arrays_1.7.1           
[77] dplyr_1.1.4               sass_0.4.9               
[79] digest_0.6.37             SparseArray_1.7.1        
[81] dqrng_0.4.1               memoise_2.0.1            
[83] htmltools_0.5.8.1         lifecycle_1.0.4          
[85] httr_1.4.7                statmod_1.5.0            
[87] mime_0.12                 bit64_4.5.2              

References

Islam, S., A. Zeisel, S. Joost, G. La Manno, P. Zajac, M. Kasper, P. Lonnerberg, and S. Linnarsson. 2014. Quantitative single-cell RNA-seq with unique molecular identifiers.” Nat. Methods 11 (2): 163–66.
Mereu, Elisabetta, Atefeh Lafzi, Catia Moutinho, Christoph Ziegenhain, Davis J. MacCarthy, Adrian Alvarez, Eduard Batlle, et al. 2019. “Benchmarking Single-Cell RNA Sequencing Protocols for Cell Atlas Projects.” bioRxiv. https://doi.org/10.1101/630087.
Muraro, M. J., G. Dharmadhikari, D. Grun, N. Groen, T. Dielen, E. Jansen, L. van Gurp, et al. 2016. A Single-Cell Transcriptome Atlas of the Human Pancreas.” Cell Syst 3 (4): 385–94.
Srivastava, A., L. Malik, T. Smith, I. Sudbery, and R. Patro. 2019. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data.” Genome Biol 20 (1): 65.
Svensson, V., E. da Veiga Beltrame, and L. Pachter. 2019. “Quantifying the Tradeoff Between Sequencing Depth and Cell Number in Single-Cell RNA-Seq.” bioRxiv, 762773.
Wilson, N. K., D. G. Kent, F. Buettner, M. Shehata, I. C. Macaulay, F. J. Calero-Nieto, M. Sanchez Castillo, et al. 2015. Combined single-cell functional and gene expression analysis resolves heterogeneity within stem cell populations.” Cell Stem Cell 16 (6): 712–24.
Zhang, M. J., V. Ntranos, and D. Tse. 2020. Determining sequencing depth in a single-cell RNA-seq experiment.” Nat Commun 11 (1): 774.
Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January): 14049.
Ziegenhain, C., B. Vieth, S. Parekh, B. Reinius, A. Guillaumet-Adkins, M. Smets, H. Leonhardt, H. Heyn, I. Hellmann, and W. Enard. 2017. Comparative Analysis of Single-Cell RNA Sequencing Methods.” Mol. Cell 65 (4): 631–43.