# Nestorowa mouse HSC (Smart-seq2) ## Introduction This performs an analysis of the mouse haematopoietic stem cell (HSC) dataset generated with Smart-seq2 [@nestorowa2016singlecell]. ## Data loading ``` r library(scRNAseq) sce.nest <- NestorowaHSCData() ``` ``` r library(AnnotationHub) ens.mm.v97 <- AnnotationHub()[["AH73905"]] anno <- select(ens.mm.v97, keys=rownames(sce.nest), keytype="GENEID", columns=c("SYMBOL", "SEQNAME")) rowData(sce.nest) <- anno[match(rownames(sce.nest), anno$GENEID),] ``` After loading and annotation, we inspect the resulting `SingleCellExperiment` object: ``` r sce.nest ``` ``` ## class: SingleCellExperiment ## dim: 46078 1920 ## metadata(0): ## assays(1): counts ## rownames(46078): ENSMUSG00000000001 ENSMUSG00000000003 ... ## ENSMUSG00000107391 ENSMUSG00000107392 ## rowData names(3): GENEID SYMBOL SEQNAME ## colnames(1920): HSPC_007 HSPC_013 ... Prog_852 Prog_810 ## colData names(9): gate broad ... projected metrics ## reducedDimNames(1): diffusion ## mainExpName: endogenous ## altExpNames(2): ERCC FACS ``` ## Quality control ``` r unfiltered <- sce.nest ``` For some reason, no mitochondrial transcripts are available, so we will perform quality control using the spike-in proportions only. ``` r library(scater) stats <- perCellQCMetrics(sce.nest) qc <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent") sce.nest <- sce.nest[,!qc$discard] ``` We examine the number of cells discarded for each reason. ``` r colSums(as.matrix(qc)) ``` ``` ## low_lib_size low_n_features high_altexps_ERCC_percent ## 146 28 241 ## discard ## 264 ``` We create some diagnostic plots for each metric (Figure \@ref(fig:unref-nest-qc-dist)). ``` r colData(unfiltered) <- cbind(colData(unfiltered), stats) unfiltered$discard <- qc$discard gridExtra::grid.arrange( plotColData(unfiltered, y="sum", colour_by="discard") + scale_y_log10() + ggtitle("Total count"), plotColData(unfiltered, y="detected", colour_by="discard") + scale_y_log10() + ggtitle("Detected features"), plotColData(unfiltered, y="altexps_ERCC_percent", colour_by="discard") + ggtitle("ERCC percent"), ncol=2 ) ```
Distribution of each QC metric across cells in the Nestorowa HSC dataset. Each point represents a cell and is colored according to whether that cell was discarded.

(\#fig:unref-nest-qc-dist)Distribution of each QC metric across cells in the Nestorowa HSC dataset. Each point represents a cell and is colored according to whether that cell was discarded.

## Normalization ``` r library(scran) set.seed(101000110) clusters <- quickCluster(sce.nest) sce.nest <- computeSumFactors(sce.nest, clusters=clusters) sce.nest <- logNormCounts(sce.nest) ``` We examine some key metrics for the distribution of size factors, and compare it to the library sizes as a sanity check (Figure \@ref(fig:unref-nest-norm)). ``` r summary(sizeFactors(sce.nest)) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.039 0.420 0.743 1.000 1.249 16.789 ``` ``` r plot(librarySizeFactors(sce.nest), sizeFactors(sce.nest), pch=16, xlab="Library size factors", ylab="Deconvolution factors", log="xy") ```
Relationship between the library size factors and the deconvolution size factors in the Nestorowa HSC dataset.

(\#fig:unref-nest-norm)Relationship between the library size factors and the deconvolution size factors in the Nestorowa HSC dataset.

## Variance modelling We use the spike-in transcripts to model the technical noise as a function of the mean (Figure \@ref(fig:unref-nest-var)). ``` r set.seed(00010101) dec.nest <- modelGeneVarWithSpikes(sce.nest, "ERCC") top.nest <- getTopHVGs(dec.nest, prop=0.1) ``` ``` r plot(dec.nest$mean, dec.nest$total, pch=16, cex=0.5, xlab="Mean of log-expression", ylab="Variance of log-expression") curfit <- metadata(dec.nest) curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2) points(curfit$mean, curfit$var, col="red") ```
Per-gene variance as a function of the mean for the log-expression values in the Nestorowa HSC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-ins (red).

(\#fig:unref-nest-var)Per-gene variance as a function of the mean for the log-expression values in the Nestorowa HSC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-ins (red).

## Dimensionality reduction ``` r set.seed(101010011) sce.nest <- denoisePCA(sce.nest, technical=dec.nest, subset.row=top.nest) sce.nest <- runTSNE(sce.nest, dimred="PCA") ``` We check that the number of retained PCs is sensible. ``` r ncol(reducedDim(sce.nest, "PCA")) ``` ``` ## [1] 9 ``` ## Clustering ``` r snn.gr <- buildSNNGraph(sce.nest, use.dimred="PCA") colLabels(sce.nest) <- factor(igraph::cluster_walktrap(snn.gr)$membership) ``` ``` r table(colLabels(sce.nest)) ``` ``` ## ## 1 2 3 4 5 6 7 8 9 10 ## 198 319 208 147 221 182 21 209 74 77 ``` ``` r plotTSNE(sce.nest, colour_by="label") ```
Obligatory $t$-SNE plot of the Nestorowa HSC dataset, where each point represents a cell and is colored according to the assigned cluster.

(\#fig:unref-nest-tsne)Obligatory $t$-SNE plot of the Nestorowa HSC dataset, where each point represents a cell and is colored according to the assigned cluster.

## Marker gene detection ``` r markers <- findMarkers(sce.nest, colLabels(sce.nest), test.type="wilcox", direction="up", lfc=0.5, row.data=rowData(sce.nest)[,"SYMBOL",drop=FALSE]) ``` To illustrate the manual annotation process, we examine the marker genes for one of the clusters. Upregulation of _Car2_, _Hebp1_ amd hemoglobins indicates that cluster 10 contains erythroid precursors. ``` r chosen <- markers[['10']] best <- chosen[chosen$Top <= 10,] aucs <- getMarkerEffects(best, prefix="AUC") rownames(aucs) <- best$SYMBOL library(pheatmap) pheatmap(aucs, color=viridis::plasma(100)) ```
Heatmap of the AUCs for the top marker genes in cluster 10 compared to all other clusters.

(\#fig:unref-heat-nest-markers)Heatmap of the AUCs for the top marker genes in cluster 10 compared to all other clusters.

## Cell type annotation ``` r library(SingleR) mm.ref <- MouseRNAseqData() # Renaming to symbols to match with reference row names. renamed <- sce.nest rownames(renamed) <- uniquifyFeatureNames(rownames(renamed), rowData(sce.nest)$SYMBOL) labels <- SingleR(renamed, mm.ref, labels=mm.ref$label.fine) ``` Most clusters are not assigned to any single lineage (Figure \@ref(fig:unref-assignments-nest)), which is perhaps unsurprising given that HSCs are quite different from their terminal fates. Cluster 10 is considered to contain erythrocytes, which is roughly consistent with our conclusions from the marker gene analysis above. ``` r tab <- table(labels$labels, colLabels(sce.nest)) pheatmap(log10(tab+10), color=viridis::viridis(100)) ```
Heatmap of the distribution of cells for each cluster in the Nestorowa HSC dataset, based on their assignment to each label in the mouse RNA-seq references from the _SingleR_ package.

(\#fig:unref-assignments-nest)Heatmap of the distribution of cells for each cluster in the Nestorowa HSC dataset, based on their assignment to each label in the mouse RNA-seq references from the _SingleR_ package.

## Miscellaneous analyses This dataset also contains information about the protein abundances in each cell from FACS. There is barely any heterogeneity in the chosen markers across the clusters (Figure \@ref(fig:unref-nest-facs)); this is perhaps unsurprising given that all cells should be HSCs of some sort. ``` r Y <- assay(altExp(sce.nest, "FACS")) keep <- colSums(is.na(Y))==0 # Removing NA intensities. se.averaged <- sumCountsAcrossCells(Y[,keep], colLabels(sce.nest)[keep], average=TRUE) averaged <- assay(se.averaged) log.intensities <- log2(averaged+1) centered <- log.intensities - rowMeans(log.intensities) pheatmap(centered, breaks=seq(-1, 1, length.out=101)) ```
Heatmap of the centered log-average intensity for each target protein quantified by FACS in the Nestorowa HSC dataset.

(\#fig:unref-nest-facs)Heatmap of the centered log-average intensity for each target protein quantified by FACS in the Nestorowa HSC dataset.

## Session Info {-}
``` R version 4.4.1 (2024-06-14) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] SingleR_2.8.0 pheatmap_1.0.12 [3] scran_1.34.0 scater_1.34.0 [5] ggplot2_3.5.1 scuttle_1.16.0 [7] AnnotationHub_3.14.0 BiocFileCache_2.14.0 [9] dbplyr_2.5.0 ensembldb_2.30.0 [11] AnnotationFilter_1.30.0 GenomicFeatures_1.58.0 [13] AnnotationDbi_1.68.0 scRNAseq_2.19.1 [15] SingleCellExperiment_1.28.0 SummarizedExperiment_1.36.0 [17] Biobase_2.66.0 GenomicRanges_1.58.0 [19] GenomeInfoDb_1.42.0 IRanges_2.40.0 [21] S4Vectors_0.44.0 BiocGenerics_0.52.0 [23] MatrixGenerics_1.18.0 matrixStats_1.4.1 [25] BiocStyle_2.34.0 rebook_1.16.0 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-3 jsonlite_1.8.9 [3] CodeDepends_0.6.6 magrittr_2.0.3 [5] ggbeeswarm_0.7.2 gypsum_1.2.0 [7] farver_2.1.2 rmarkdown_2.28 [9] BiocIO_1.16.0 zlibbioc_1.52.0 [11] vctrs_0.6.5 DelayedMatrixStats_1.28.0 [13] memoise_2.0.1 Rsamtools_2.22.0 [15] RCurl_1.98-1.16 htmltools_0.5.8.1 [17] S4Arrays_1.6.0 curl_5.2.3 [19] BiocNeighbors_2.0.0 Rhdf5lib_1.28.0 [21] SparseArray_1.6.0 rhdf5_2.50.0 [23] sass_0.4.9 alabaster.base_1.6.0 [25] bslib_0.8.0 alabaster.sce_1.6.0 [27] httr2_1.0.5 cachem_1.1.0 [29] GenomicAlignments_1.42.0 igraph_2.1.1 [31] mime_0.12 lifecycle_1.0.4 [33] pkgconfig_2.0.3 rsvd_1.0.5 [35] Matrix_1.7-1 R6_2.5.1 [37] fastmap_1.2.0 GenomeInfoDbData_1.2.13 [39] digest_0.6.37 colorspace_2.1-1 [41] dqrng_0.4.1 irlba_2.3.5.1 [43] ExperimentHub_2.14.0 RSQLite_2.3.7 [45] beachmat_2.22.0 labeling_0.4.3 [47] filelock_1.0.3 fansi_1.0.6 [49] httr_1.4.7 abind_1.4-8 [51] compiler_4.4.1 bit64_4.5.2 [53] withr_3.0.2 BiocParallel_1.40.0 [55] viridis_0.6.5 DBI_1.2.3 [57] highr_0.11 HDF5Array_1.34.0 [59] alabaster.ranges_1.6.0 alabaster.schemas_1.6.0 [61] rappdirs_0.3.3 DelayedArray_0.32.0 [63] bluster_1.16.0 rjson_0.2.23 [65] tools_4.4.1 vipor_0.4.7 [67] beeswarm_0.4.0 glue_1.8.0 [69] restfulr_0.0.15 rhdf5filters_1.18.0 [71] grid_4.4.1 Rtsne_0.17 [73] cluster_2.1.6 generics_0.1.3 [75] gtable_0.3.6 metapod_1.14.0 [77] BiocSingular_1.22.0 ScaledMatrix_1.14.0 [79] utf8_1.2.4 XVector_0.46.0 [81] ggrepel_0.9.6 BiocVersion_3.20.0 [83] pillar_1.9.0 limma_3.62.0 [85] dplyr_1.1.4 lattice_0.22-6 [87] rtracklayer_1.66.0 bit_4.5.0 [89] tidyselect_1.2.1 locfit_1.5-9.10 [91] Biostrings_2.74.0 knitr_1.48 [93] gridExtra_2.3 bookdown_0.41 [95] ProtGenerics_1.38.0 edgeR_4.4.0 [97] xfun_0.48 statmod_1.5.0 [99] UCSC.utils_1.2.0 lazyeval_0.2.2 [101] yaml_2.3.10 evaluate_1.0.1 [103] codetools_0.2-20 tibble_3.2.1 [105] alabaster.matrix_1.6.0 BiocManager_1.30.25 [107] graph_1.84.0 cli_3.6.3 [109] munsell_0.5.1 jquerylib_0.1.4 [111] Rcpp_1.0.13 dir.expiry_1.14.0 [113] png_0.1-8 XML_3.99-0.17 [115] parallel_4.4.1 blob_1.2.4 [117] sparseMatrixStats_1.18.0 bitops_1.0-9 [119] viridisLite_0.4.2 alabaster.se_1.6.0 [121] scales_1.3.0 purrr_1.0.2 [123] crayon_1.5.3 rlang_1.1.4 [125] celldex_1.15.0 cowplot_1.1.3 [127] KEGGREST_1.46.0 ```