Chapter 11 Paul mouse HSC (MARS-seq)
11.1 Introduction
This performs an analysis of the mouse haematopoietic stem cell (HSC) dataset generated with MARS-seq (Paul et al. 2015). Cells were extracted from multiple mice under different experimental conditions (i.e., sorting protocols) and libraries were prepared using a series of 384-well plates.
11.2 Data loading
library(AnnotationHub)
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
anno <- select(ens.mm.v97, keys=rownames(sce.paul),
keytype="GENEID", columns=c("SYMBOL", "SEQNAME"))
rowData(sce.paul) <- anno[match(rownames(sce.paul), anno$GENEID),]
After loading and annotation, we inspect the resulting SingleCellExperiment
object:
## class: SingleCellExperiment
## dim: 17483 10368
## metadata(0):
## assays(1): counts
## rownames(17483): ENSMUSG00000007777 ENSMUSG00000107002 ...
## ENSMUSG00000039068 ENSMUSG00000064363
## rowData names(3): GENEID SYMBOL SEQNAME
## colnames(10368): W29953 W29954 ... W76335 W76336
## colData names(12): Seq_batch_ID Amp_batch_ID ... CD34_measurement
## FcgR3_measurement
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
11.3 Quality control
For some reason, only one mitochondrial transcripts are available, so we will perform quality control using only the library size and number of detected features. Ideally, we would simply block on the plate of origin to account for differences in processing, but unfortunately, it seems that many plates have a large proportion (if not outright majority) of cells with poor values for both metrics. We identify such plates based on the presence of very low outlier thresholds, for some arbitrary definition of “low”; we then redefine thresholds using information from the other (presumably high-quality) plates.
library(scater)
stats <- perCellQCMetrics(sce.paul)
qc <- quickPerCellQC(stats, batch=sce.paul$Plate_ID)
# Detecting batches with unusually low threshold values.
lib.thresholds <- attr(qc$low_lib_size, "thresholds")["lower",]
nfeat.thresholds <- attr(qc$low_n_features, "thresholds")["lower",]
ignore <- union(names(lib.thresholds)[lib.thresholds < 100],
names(nfeat.thresholds)[nfeat.thresholds < 100])
# Repeating the QC using only the "high-quality" batches.
qc2 <- quickPerCellQC(stats, batch=sce.paul$Plate_ID,
subset=!sce.paul$Plate_ID %in% ignore)
sce.paul <- sce.paul[,!qc2$discard]
We examine the number of cells discarded for each reason.
## low_lib_size low_n_features discard
## 1695 1781 1783
We create some diagnostic plots for each metric (Figure 11.1).
colData(unfiltered) <- cbind(colData(unfiltered), stats)
unfiltered$discard <- qc2$discard
unfiltered$Plate_ID <- factor(unfiltered$Plate_ID)
gridExtra::grid.arrange(
plotColData(unfiltered, y="sum", x="Plate_ID", colour_by="discard") +
scale_y_log10() + ggtitle("Total count"),
plotColData(unfiltered, y="detected", x="Plate_ID", colour_by="discard") +
scale_y_log10() + ggtitle("Detected features"),
ncol=1
)
11.4 Normalization
library(scran)
set.seed(101000110)
clusters <- quickCluster(sce.paul)
sce.paul <- computeSumFactors(sce.paul, clusters=clusters)
sce.paul <- logNormCounts(sce.paul)
We examine some key metrics for the distribution of size factors, and compare it to the library sizes as a sanity check (Figure 11.2).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.057 0.422 0.775 1.000 1.335 9.654
plot(librarySizeFactors(sce.paul), sizeFactors(sce.paul), pch=16,
xlab="Library size factors", ylab="Deconvolution factors", log="xy")
11.5 Variance modelling
We fit a mean-variance trend to the endogenous genes to detect highly variable genes.
Unfortunately, the plates are confounded with an experimental treatment (Batch_desc
) so we cannot block on the plate of origin.
set.seed(00010101)
dec.paul <- modelGeneVarByPoisson(sce.paul)
top.paul <- getTopHVGs(dec.paul, prop=0.1)
plot(dec.paul$mean, dec.paul$total, pch=16, cex=0.5,
xlab="Mean of log-expression", ylab="Variance of log-expression")
curve(metadata(dec.paul)$trend(x), col="blue", add=TRUE)
11.6 Dimensionality reduction
set.seed(101010011)
sce.paul <- denoisePCA(sce.paul, technical=dec.paul, subset.row=top.paul)
sce.paul <- runTSNE(sce.paul, dimred="PCA")
We check that the number of retained PCs is sensible.
## [1] 13
11.7 Clustering
snn.gr <- buildSNNGraph(sce.paul, use.dimred="PCA", type="jaccard")
colLabels(sce.paul) <- factor(igraph::cluster_louvain(snn.gr)$membership)
These is a strong relationship between the cluster and the experimental treatment (Figure 11.4), which is to be expected. Of course, this may also be attributable to some batch effect; the confounded nature of the experimental design makes it difficult to make any confident statements either way.
tab <- table(colLabels(sce.paul), sce.paul$Batch_desc)
rownames(tab) <- paste("Cluster", rownames(tab))
pheatmap::pheatmap(log10(tab+10), color=viridis::viridis(100))
Session Info
R Under development (unstable) (2024-10-21 r87258)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.1 LTS
Matrix products: default
BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] scran_1.35.0 scater_1.35.0
[3] ggplot2_3.5.1 scuttle_1.17.0
[5] AnnotationHub_3.15.0 BiocFileCache_2.15.0
[7] dbplyr_2.5.0 ensembldb_2.31.0
[9] AnnotationFilter_1.31.0 GenomicFeatures_1.59.1
[11] AnnotationDbi_1.69.0 scRNAseq_2.21.0
[13] SingleCellExperiment_1.29.1 SummarizedExperiment_1.37.0
[15] Biobase_2.67.0 GenomicRanges_1.59.0
[17] GenomeInfoDb_1.43.0 IRanges_2.41.0
[19] S4Vectors_0.45.1 BiocGenerics_0.53.1
[21] generics_0.1.3 MatrixGenerics_1.19.0
[23] matrixStats_1.4.1 BiocStyle_2.35.0
[25] rebook_1.17.0
loaded via a namespace (and not attached):
[1] RColorBrewer_1.1-3 jsonlite_1.8.9 CodeDepends_0.6.6
[4] magrittr_2.0.3 ggbeeswarm_0.7.2 gypsum_1.3.0
[7] farver_2.1.2 rmarkdown_2.29 BiocIO_1.17.0
[10] zlibbioc_1.53.0 vctrs_0.6.5 memoise_2.0.1
[13] Rsamtools_2.23.0 RCurl_1.98-1.16 htmltools_0.5.8.1
[16] S4Arrays_1.7.1 curl_6.0.0 BiocNeighbors_2.1.0
[19] Rhdf5lib_1.29.0 SparseArray_1.7.1 rhdf5_2.51.0
[22] sass_0.4.9 alabaster.base_1.7.1 bslib_0.8.0
[25] alabaster.sce_1.7.0 httr2_1.0.6 cachem_1.1.0
[28] GenomicAlignments_1.43.0 igraph_2.1.1 mime_0.12
[31] lifecycle_1.0.4 pkgconfig_2.0.3 rsvd_1.0.5
[34] Matrix_1.7-1 R6_2.5.1 fastmap_1.2.0
[37] GenomeInfoDbData_1.2.13 digest_0.6.37 colorspace_2.1-1
[40] dqrng_0.4.1 irlba_2.3.5.1 ExperimentHub_2.15.0
[43] RSQLite_2.3.7 beachmat_2.23.0 labeling_0.4.3
[46] filelock_1.0.3 fansi_1.0.6 httr_1.4.7
[49] abind_1.4-8 compiler_4.5.0 bit64_4.5.2
[52] withr_3.0.2 BiocParallel_1.41.0 viridis_0.6.5
[55] DBI_1.2.3 HDF5Array_1.35.1 alabaster.ranges_1.7.0
[58] alabaster.schemas_1.7.0 rappdirs_0.3.3 DelayedArray_0.33.1
[61] bluster_1.17.0 rjson_0.2.23 tools_4.5.0
[64] vipor_0.4.7 beeswarm_0.4.0 glue_1.8.0
[67] restfulr_0.0.15 rhdf5filters_1.19.0 grid_4.5.0
[70] Rtsne_0.17 cluster_2.1.6 gtable_0.3.6
[73] metapod_1.15.0 BiocSingular_1.23.0 ScaledMatrix_1.15.0
[76] utf8_1.2.4 XVector_0.47.0 ggrepel_0.9.6
[79] BiocVersion_3.21.1 pillar_1.9.0 limma_3.63.2
[82] dplyr_1.1.4 lattice_0.22-6 rtracklayer_1.67.0
[85] bit_4.5.0 tidyselect_1.2.1 locfit_1.5-9.10
[88] Biostrings_2.75.1 knitr_1.49 gridExtra_2.3
[91] bookdown_0.41 ProtGenerics_1.39.0 edgeR_4.5.0
[94] xfun_0.49 statmod_1.5.0 pheatmap_1.0.12
[97] UCSC.utils_1.3.0 lazyeval_0.2.2 yaml_2.3.10
[100] evaluate_1.0.1 codetools_0.2-20 tibble_3.2.1
[103] alabaster.matrix_1.7.0 BiocManager_1.30.25 graph_1.85.0
[106] cli_3.6.3 munsell_0.5.1 jquerylib_0.1.4
[109] Rcpp_1.0.13-1 dir.expiry_1.15.0 png_0.1-8
[112] XML_3.99-0.17 parallel_4.5.0 blob_1.2.4
[115] bitops_1.0-9 viridisLite_0.4.2 alabaster.se_1.7.0
[118] scales_1.3.0 purrr_1.0.2 crayon_1.5.3
[121] rlang_1.1.4 cowplot_1.1.3 KEGGREST_1.47.0