Chapter 19 Single-nuclei RNA-seq processing
19.1 Introduction
Single-nuclei RNA-seq (snRNA-seq) provides another strategy for performing single-cell transcriptomics where individual nuclei instead of cells are captured and sequenced. The major advantage of snRNA-seq over scRNA-seq is that the former does not require the preservation of cellular integrity during sample preparation, especially dissociation. We only need to extract nuclei in an intact state, meaning that snRNA-seq can be applied to cell types, tissues and samples that are not amenable to dissociation and later processing. The cost of this flexibility is the loss of transcripts that are primarily located in the cytoplasm, potentially limiting the availability of biological signal for genes with little nuclear localization.
The computational analysis of snRNA-seq data is very much like that of scRNA-seq data. We have a matrix of (UMI) counts for genes by cells that requires quality control, normalization and so on. (Technically, the columsn correspond to nuclei but we will use these two terms interchangeably in this chapter.) In fact, the biggest difference in processing occurs in the construction of the count matrix itself, where intronic regions must be included in the annotation for each gene to account for the increased abundance of unspliced transcripts. The rest of the analysis only requires a few minor adjustments to account for the loss of cytoplasmic transcripts. We demonstrate using a dataset from Wu et al. (2019) involving snRNA-seq on healthy and fibrotic mouse kidneys.
## class: SingleCellExperiment
## dim: 18249 8231
## metadata(0):
## assays(1): counts
## rownames(18249): mt-Cytb mt-Nd6 ... Gm44613 Gm38304
## rowData names(0):
## colnames(8231): sNuc-10x_AAACCTGAGTCCGGTC sNuc-10x_AAACCTGCACAGACAG ...
## UUO_TTGCCGTCACAAGACG UUO_TTTGTCATCTGCTGTC
## colData names(2): Technology Status
## reducedDimNames(0):
## altExpNames(0):
19.2 Quality control for stripped nuclei
The loss of the cytoplasm means that the stripped nuclei should not contain any mitochondrial transcripts. This means that the mitochondrial proportion becomes an excellent QC metric for the efficacy of the stripping process. Unlike scRNA-seq, there is no need to worry about variations in mitochondrial content due to genuine biology. High-quality nuclei should not contain any mitochondrial transcripts; the presence of any mitochondrial counts in a library indicates that the removal of the cytoplasm was not complete, possibly introducing irrelevant heterogeneity in downstream analyses.
library(scuttle)
sce <- addPerCellQC(sce, subsets=list(Mt=grep("^mt-", rownames(sce))))
summary(sce$subsets_Mt_percent == 0)
## Mode FALSE TRUE
## logical 2264 5967
We apply a simple filter to remove libraries corresponding to incompletely stripped nuclei. The outlier-based approach described in Section 6 can be used here, but some caution is required in low-coverage experiments where a majority of cells have zero mitochondrial counts. In such cases, the MAD may also be zero such that other libraries with very low but non-zero mitochondrial counts are removed. This is typically too conservative as such transcripts may be present due to sporadic ambient contamination rather than incomplete stripping.
## low_lib_size low_n_features high_subsets_Mt_percent
## 0 0 2264
## discard
## 2264
Instead, we enforce a minimum difference between the threshold and the median in isOutlier()
(Figure 19.1).
We arbitrarily choose +0.5% here, which takes precedence over the outlier-based threshold if the latter is too low.
In this manner, we avoid discarding libraries with a very modest amount of contamination; the same code will automatically fall back to the outlier-based threshold in datasets where the stripping was systematically less effective.
stats$high_subsets_Mt_percent <- isOutlier(sce$subsets_Mt_percent,
type="higher", min.diff=0.5)
stats$discard <- Reduce("|", stats[,colnames(stats)!="discard"])
colSums(as.matrix(stats))
## low_lib_size low_n_features high_subsets_Mt_percent
## 0 0 42
## discard
## 42
library(scater)
plotColData(sce, x="Status", y="subsets_Mt_percent",
colour_by=I(stats$high_subsets_Mt_percent))
19.4 Tricks with ambient contamination
The expected absence of genuine mitochondrial expression can also be exploited to estimate the level of ambient contamination (Section 14.4). We demonstrate on mouse brain snRNA-seq data from 10X Genomics (Zheng et al. 2017), using the raw count matrix prior to any filtering for nuclei-containing barcodes.
library(DropletTestFiles)
raw.path <- getTestFile("tenx-2.0.1-nuclei_900/1.0.0/raw.tar.gz")
out.path <- file.path(tempdir(), "nuclei")
untar(raw.path, exdir=out.path)
library(DropletUtils)
fname <- file.path(out.path, "raw_gene_bc_matrices/mm10")
sce.brain <- read10xCounts(fname, col.names=TRUE)
sce.brain
## class: SingleCellExperiment
## dim: 27998 737280
## metadata(1): Samples
## assays(1): counts
## rownames(27998): ENSMUSG00000051951 ENSMUSG00000089699 ...
## ENSMUSG00000096730 ENSMUSG00000095742
## rowData names(2): ID Symbol
## colnames(737280): AAACCTGAGAAACCAT-1 AAACCTGAGAAACCGC-1 ...
## TTTGTCATCTTTAGTC-1 TTTGTCATCTTTCCTC-1
## colData names(2): Sample Barcode
## reducedDimNames(0):
## altExpNames(0):
We call non-empty droplets using emptyDrops()
as previously described (Section 15.2).
## Mode FALSE TRUE NA's
## logical 2324 1712 733244
If our libraries are of high quality, we can assume that any mitochondrial “expression” is due to contamination from the ambient solution.
We then use the controlAmbience()
function to estimate the proportion of ambient contamination for each gene, allowing us to mark potentially problematic genes in the DE results (Figure 19.4).
In fact, we can use this information even earlier to remove these genes during dimensionality reduction and clustering.
This is not generally possible for scRNA-seq as any notable contaminating transcripts may originate from a subpopulation that actually expresses that gene and thus cannot be blindly removed.
ambient <- estimateAmbience(counts(sce.brain), round=FALSE, good.turing=FALSE)
nuclei <- rowSums(counts(sce.brain)[,which(e.out$FDR <= 0.001)])
is.mito <- grepl("mt-", rowData(sce.brain)$Symbol)
contam <- controlAmbience(nuclei, ambient, features=is.mito, mode="proportion")
plot(log10(nuclei+1), contam*100, col=ifelse(is.mito, "red", "grey"), pch=16,
xlab="Log-nuclei expression", ylab="Contamination (%)")
Session Info
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] DropletUtils_1.10.3 DropletTestFiles_1.0.0
[3] batchelor_1.6.2 bluster_1.0.0
[5] scran_1.18.5 scater_1.18.6
[7] ggplot2_3.3.3 scuttle_1.0.4
[9] scRNAseq_2.4.0 SingleCellExperiment_1.12.0
[11] SummarizedExperiment_1.20.0 Biobase_2.50.0
[13] GenomicRanges_1.42.0 GenomeInfoDb_1.26.4
[15] IRanges_2.24.1 S4Vectors_0.28.1
[17] BiocGenerics_0.36.0 MatrixGenerics_1.2.1
[19] matrixStats_0.58.0 BiocStyle_2.18.1
[21] rebook_1.0.0
loaded via a namespace (and not attached):
[1] AnnotationHub_2.22.0 BiocFileCache_1.14.0
[3] igraph_1.2.6 lazyeval_0.2.2
[5] BiocParallel_1.24.1 digest_0.6.27
[7] ensembldb_2.14.0 htmltools_0.5.1.1
[9] viridis_0.5.1 fansi_0.4.2
[11] magrittr_2.0.1 memoise_2.0.0
[13] limma_3.46.0 Biostrings_2.58.0
[15] R.utils_2.10.1 askpass_1.1
[17] prettyunits_1.1.1 colorspace_2.0-0
[19] blob_1.2.1 rappdirs_0.3.3
[21] xfun_0.22 dplyr_1.0.5
[23] callr_3.5.1 crayon_1.4.1
[25] RCurl_1.98-1.3 jsonlite_1.7.2
[27] graph_1.68.0 glue_1.4.2
[29] gtable_0.3.0 zlibbioc_1.36.0
[31] XVector_0.30.0 DelayedArray_0.16.2
[33] BiocSingular_1.6.0 Rhdf5lib_1.12.1
[35] HDF5Array_1.18.1 scales_1.1.1
[37] edgeR_3.32.1 DBI_1.1.1
[39] Rcpp_1.0.6 viridisLite_0.3.0
[41] xtable_1.8-4 progress_1.2.2
[43] dqrng_0.2.1 bit_4.0.4
[45] rsvd_1.0.3 ResidualMatrix_1.0.0
[47] httr_1.4.2 ellipsis_0.3.1
[49] R.methodsS3_1.8.1 pkgconfig_2.0.3
[51] XML_3.99-0.6 farver_2.1.0
[53] CodeDepends_0.6.5 sass_0.3.1
[55] dbplyr_2.1.0 locfit_1.5-9.4
[57] utf8_1.2.1 tidyselect_1.1.0
[59] labeling_0.4.2 rlang_0.4.10
[61] later_1.1.0.1 AnnotationDbi_1.52.0
[63] munsell_0.5.0 BiocVersion_3.12.0
[65] tools_4.0.4 cachem_1.0.4
[67] generics_0.1.0 RSQLite_2.2.4
[69] ExperimentHub_1.16.0 evaluate_0.14
[71] stringr_1.4.0 fastmap_1.1.0
[73] yaml_2.2.1 processx_3.4.5
[75] knitr_1.31 bit64_4.0.5
[77] purrr_0.3.4 AnnotationFilter_1.14.0
[79] sparseMatrixStats_1.2.1 mime_0.10
[81] R.oo_1.24.0 xml2_1.3.2
[83] biomaRt_2.46.3 compiler_4.0.4
[85] beeswarm_0.3.1 curl_4.3
[87] interactiveDisplayBase_1.28.0 statmod_1.4.35
[89] tibble_3.1.0 bslib_0.2.4
[91] stringi_1.5.3 highr_0.8
[93] ps_1.6.0 GenomicFeatures_1.42.2
[95] lattice_0.20-41 ProtGenerics_1.22.0
[97] Matrix_1.3-2 vctrs_0.3.6
[99] rhdf5filters_1.2.0 pillar_1.5.1
[101] lifecycle_1.0.0 BiocManager_1.30.10
[103] jquerylib_0.1.3 BiocNeighbors_1.8.2
[105] cowplot_1.1.1 bitops_1.0-6
[107] irlba_2.3.3 httpuv_1.5.5
[109] rtracklayer_1.50.0 R6_2.5.0
[111] bookdown_0.21 promises_1.2.0.1
[113] gridExtra_2.3 vipor_0.4.5
[115] codetools_0.2-18 assertthat_0.2.1
[117] rhdf5_2.34.0 openssl_1.4.3
[119] withr_2.4.1 GenomicAlignments_1.26.0
[121] Rsamtools_2.6.0 GenomeInfoDbData_1.2.4
[123] hms_1.0.0 grid_4.0.4
[125] beachmat_2.6.4 rmarkdown_2.7
[127] DelayedMatrixStats_1.12.3 Rtsne_0.15
[129] shiny_1.6.0 ggbeeswarm_0.6.0
Bibliography
Bakken, T. E., R. D. Hodge, J. A. Miller, Z. Yao, T. N. Nguyen, B. Aevermann, E. Barkan, et al. 2018. “Single-nucleus and single-cell transcriptomes compared in matched cortical cell types.” PLoS ONE 13 (12): e0209648.
Wu, H., Y. Kirita, E. L. Donnelly, and B. D. Humphreys. 2019. “Advantages of Single-Nucleus over Single-Cell RNA Sequencing of Adult Kidney: Rare Cell Types and Novel Cell States Revealed in Fibrosis.” J. Am. Soc. Nephrol. 30 (1): 23–32.
Zheng, G. X., J. M. Terry, P. Belgrader, P. Ryvkin, Z. W. Bent, R. Wilson, S. B. Ziraldo, et al. 2017. “Massively parallel digital transcriptional profiling of single cells.” Nat Commun 8 (January): 14049.
19.3 Comments on downstream analyses
The rest of the analysis can then be performed using the same strategies discussed for scRNA-seq (Figure 19.2). Despite the loss of cytoplasmic transcripts, there is usually still enough biological signal to characterize population heterogeneity (Bakken et al. 2018; Wu et al. 2019). In fact, one could even say that snRNA-seq has a higher signal-to-noise ratio as sequencing coverage is not spent on highly abundant but typically uninteresting transcripts for mitochondrial and ribosomal protein genes. It also has the not inconsiderable advantage of being able to recover subpopulations that are not amenable to dissociation and would be lost by scRNA-seq protocols.
Figure 19.2: \(t\)-SNE plots of the Wu kidney dataset. Each point is a cell and is colored by its cluster assignment (left) or its disease status (right).
We can also apply more complex procedures such as batch correction (Section 13). Here, we eliminate the disease effect to identify shared clusters (Figure 19.3).
Figure 19.3: More \(t\)-SNE plots of the Wu kidney dataset after applying MNN correction across diseases.
Similarly, we can perform marker detection on the snRNA-seq expression values as discussed in Section 11. For the most part, interpretation of these DE results makes the simplifying assumption that nuclear abundances are a good proxy for the overall expression profile. This is generally reasonable but may not always be true, resulting in some discrepancies in the marker sets between snRNA-seq and scRNA-seq datasets. For example, transcripts for strongly expressed genes might localize to the cytoplasm for efficient translation and subsequently be lost upon stripping, while genes with the same overall expression but differences in the rate of nuclear export may appear to be differentially expressed between clusters. In the most pathological case, higher snRNA-seq abundances may indicate nuclear sequestration of transcripts for protein-coding genes and reduced activity of the relevant biological process, contrary to the usual interpretation of the effect of upregulation.
Other analyses described for scRNA-seq require more care when they are applied to snRNA-seq data. Most obviously, cell type annotation based on reference profiles (Section 12) should be treated with some caution as the majority of existing references are constructed from bulk or single-cell datasets with cytoplasmic transcripts. Interpretation of RNA velocity results may also be complicated by variation in the rate of nuclear export of spliced transcripts.