--- bibliography: ref.bib --- # Cross-annotating mouse brains ## Loading the data We load the classic @zeisel2015brain dataset as our reference. Here, we'll rely on the fact that the authors have already performed quality control. ``` r library(scRNAseq) sceZ <- ZeiselBrainData() ``` We compute log-expression values for use in marker detection inside `SingleR()`. ``` r library(scater) sceZ <- logNormCounts(sceZ) ``` We examine the distribution of labels in this reference. ``` r table(sceZ$level2class) ``` ``` ## ## (none) Astro1 Astro2 CA1Pyr1 CA1Pyr2 CA1PyrInt CA2Pyr2 Choroid ## 189 68 61 380 447 49 41 10 ## ClauPyr Epend Int1 Int10 Int11 Int12 Int13 Int14 ## 5 20 12 21 10 21 15 22 ## Int15 Int16 Int2 Int3 Int4 Int5 Int6 Int7 ## 18 20 24 10 15 20 22 23 ## Int8 Int9 Mgl1 Mgl2 Oligo1 Oligo2 Oligo3 Oligo4 ## 26 11 17 16 45 98 87 106 ## Oligo5 Oligo6 Peric Pvm1 Pvm2 S1PyrDL S1PyrL23 S1PyrL4 ## 125 359 21 32 33 81 74 26 ## S1PyrL5 S1PyrL5a S1PyrL6 S1PyrL6b SubPyr Vend1 Vend2 Vsmc ## 16 28 39 21 22 32 105 62 ``` We load the @tasic2016adult dataset as our test. While not strictly necessary, we remove putative low-quality cells to simplify later interpretation. ``` r sceT <- TasicBrainData() sceT <- addPerCellQC(sceT, subsets=list(mito=grep("^mt_", rownames(sceT)))) qc <- quickPerCellQC(colData(sceT), percent_subsets=c("subsets_mito_percent", "altexps_ERCC_percent")) sceT <- sceT[,which(!qc$discard)] ``` The Tasic dataset was generated using read-based technologies so we need to adjust for the transcript length. ``` r library(AnnotationHub) mm.db <- AnnotationHub()[["AH73905"]] mm.exons <- exonsBy(mm.db, by="gene") mm.exons <- reduce(mm.exons) mm.len <- sum(width(mm.exons)) mm.symb <- mapIds(mm.db, keys=names(mm.len), keytype="GENEID", column="SYMBOL") names(mm.len) <- mm.symb library(scater) keep <- intersect(names(mm.len), rownames(sceT)) sceT <- sceT[keep,] assay(sceT, "TPM") <- calculateTPM(sceT, lengths=mm.len[keep]) ``` ## Applying the annotation We apply `SingleR()` with Wilcoxon rank sum test-based marker detection to annotate the Tasic dataset with the Zeisel labels. ``` r library(SingleR) pred.tasic <- SingleR(test=sceT, ref=sceZ, labels=sceZ$level2class, assay.type.test="TPM", de.method="wilcox") ``` We examine the distribution of predicted labels: ``` r table(pred.tasic$labels) ``` ``` ## ## Astro1 Astro2 CA1Pyr2 CA2Pyr2 Int1 Int10 Int11 Int12 ## 1 5 1 3 154 102 2 10 ## Int13 Int14 Int15 Int16 Int2 Int3 Int4 Int6 ## 18 24 15 9 140 99 27 14 ## Int7 Int8 Int9 Oligo1 Oligo2 Oligo3 Oligo4 Oligo6 ## 2 38 26 8 1 7 1 1 ## Peric S1PyrDL S1PyrL23 S1PyrL4 S1PyrL5 S1PyrL5a S1PyrL6 S1PyrL6b ## 1 331 74 14 1 202 46 62 ## SubPyr ## 8 ``` We can also examine the number of discarded cells for each label: ``` r table(Label=pred.tasic$labels, Lost=is.na(pred.tasic$pruned.labels)) ``` ``` ## Lost ## Label FALSE TRUE ## Astro1 1 0 ## Astro2 5 0 ## CA1Pyr2 1 0 ## CA2Pyr2 3 0 ## Int1 153 1 ## Int10 102 0 ## Int11 2 0 ## Int12 10 0 ## Int13 18 0 ## Int14 23 1 ## Int15 15 0 ## Int16 9 0 ## Int2 138 2 ## Int3 99 0 ## Int4 27 0 ## Int6 14 0 ## Int7 2 0 ## Int8 38 0 ## Int9 26 0 ## Oligo1 8 0 ## Oligo2 1 0 ## Oligo3 7 0 ## Oligo4 1 0 ## Oligo6 1 0 ## Peric 1 0 ## S1PyrDL 318 13 ## S1PyrL23 74 0 ## S1PyrL4 14 0 ## S1PyrL5 1 0 ## S1PyrL5a 201 1 ## S1PyrL6 45 1 ## S1PyrL6b 62 0 ## SubPyr 8 0 ``` ## Diagnostics We visualize the assignment scores for each label in Figure \@ref(fig:unref-brain-score-heatmap). ``` r plotScoreHeatmap(pred.tasic) ```
Heatmap of the (normalized) assignment scores for each cell (column) in the Tasic test dataset with respect to each label (row) in the Zeisel reference dataset. The final assignment for each cell is shown in the annotation bar at the top.

(\#fig:unref-brain-score-heatmap)Heatmap of the (normalized) assignment scores for each cell (column) in the Tasic test dataset with respect to each label (row) in the Zeisel reference dataset. The final assignment for each cell is shown in the annotation bar at the top.

The delta for each cell is visualized in Figure \@ref(fig:unref-brain-delta-dist). ``` r plotDeltaDistribution(pred.tasic) ```
Distributions of the deltas for each cell in the Tasic dataset assigned to each label in the Zeisel dataset. Each cell is represented by a point; low-quality assignments that were pruned out are colored in orange.

(\#fig:unref-brain-delta-dist)Distributions of the deltas for each cell in the Tasic dataset assigned to each label in the Zeisel dataset. Each cell is represented by a point; low-quality assignments that were pruned out are colored in orange.

Finally, we visualize the heatmaps of the marker genes for the most frequent label in Figure \@ref(fig:unref-brain-marker-heat). We could show these for all labels but I wouldn't want to bore you with a parade of large heatmaps. ``` r library(scater) collected <- list() all.markers <- metadata(pred.tasic)$de.genes sceT <- logNormCounts(sceT) top.label <- names(sort(table(pred.tasic$labels), decreasing=TRUE))[1] per.label <- sumCountsAcrossCells(logcounts(sceT), ids=pred.tasic$labels, average=TRUE) per.label <- assay(per.label)[unique(unlist(all.markers[[top.label]])),] pheatmap::pheatmap(per.label, main=top.label) ```
Heatmap of log-expression values in the Tasic dataset for all marker genes upregulated in the most frequent label from the Zeisel reference dataset.

(\#fig:unref-brain-marker-heat)Heatmap of log-expression values in the Tasic dataset for all marker genes upregulated in the most frequent label from the Zeisel reference dataset.

## Comparison to clusters For comparison, we will perform a quick unsupervised analysis of the Grun dataset. We model the variances using the spike-in data and we perform graph-based clustering. ``` r library(scran) decT <- modelGeneVarWithSpikes(sceT, "ERCC") set.seed(1000100) sceT <- denoisePCA(sceT, decT, subset.row=getTopHVGs(decT, n=2500)) library(bluster) sceT$cluster <- clusterRows(reducedDim(sceT, "PCA"), NNGraphParam()) ``` We do not observe a clean 1:1 mapping between clusters and labels in Figure \@ref(fig:unref-brain-label-clusters), probably because many of the labels represent closely related cell types that are difficult to distinguish. ``` r tab <- table(cluster=sceT$cluster, label=pred.tasic$labels) pheatmap::pheatmap(log10(tab+10)) ```
Heatmap of the log-transformed number of cells in each combination of label (column) and cluster (row) in the Tasic dataset.

(\#fig:unref-brain-label-clusters)Heatmap of the log-transformed number of cells in each combination of label (column) and cluster (row) in the Tasic dataset.

We proceed to the most important part of the analysis. Yes, that's right, the $t$-SNE plot (Figure \@ref(fig:unref-brain-label-tsne)). ``` r set.seed(101010100) sceT <- runTSNE(sceT, dimred="PCA") plotTSNE(sceT, colour_by="cluster", text_colour="red", text_by=I(pred.tasic$labels)) ```
$t$-SNE plot of the Tasic dataset, where each point is a cell and is colored by the assigned cluster. Reference labels from the Zeisel dataset are also placed on the median coordinate across all cells assigned with that label.

(\#fig:unref-brain-label-tsne)$t$-SNE plot of the Tasic dataset, where each point is a cell and is colored by the assigned cluster. Reference labels from the Zeisel dataset are also placed on the median coordinate across all cells assigned with that label.

## Session information {-}
``` R version 4.4.1 (2024-06-14) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] bluster_1.16.0 scran_1.34.0 [3] SingleR_2.8.0 ensembldb_2.30.0 [5] AnnotationFilter_1.30.0 GenomicFeatures_1.58.0 [7] AnnotationDbi_1.68.0 AnnotationHub_3.14.0 [9] BiocFileCache_2.14.0 dbplyr_2.5.0 [11] scater_1.34.0 ggplot2_3.5.1 [13] scuttle_1.16.0 scRNAseq_2.19.1 [15] SingleCellExperiment_1.28.0 SummarizedExperiment_1.36.0 [17] Biobase_2.66.0 GenomicRanges_1.58.0 [19] GenomeInfoDb_1.42.0 IRanges_2.40.0 [21] S4Vectors_0.44.0 BiocGenerics_0.52.0 [23] MatrixGenerics_1.18.0 matrixStats_1.4.1 [25] BiocStyle_2.34.0 rebook_1.16.0 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-3 jsonlite_1.8.9 [3] CodeDepends_0.6.6 magrittr_2.0.3 [5] ggbeeswarm_0.7.2 gypsum_1.2.0 [7] farver_2.1.2 rmarkdown_2.28 [9] BiocIO_1.16.0 zlibbioc_1.52.0 [11] vctrs_0.6.5 DelayedMatrixStats_1.28.0 [13] memoise_2.0.1 Rsamtools_2.22.0 [15] RCurl_1.98-1.16 htmltools_0.5.8.1 [17] S4Arrays_1.6.0 curl_5.2.3 [19] BiocNeighbors_2.0.0 Rhdf5lib_1.28.0 [21] SparseArray_1.6.0 rhdf5_2.50.0 [23] sass_0.4.9 alabaster.base_1.6.0 [25] bslib_0.8.0 alabaster.sce_1.6.0 [27] httr2_1.0.5 cachem_1.1.0 [29] GenomicAlignments_1.42.0 igraph_2.1.1 [31] mime_0.12 lifecycle_1.0.4 [33] pkgconfig_2.0.3 rsvd_1.0.5 [35] Matrix_1.7-1 R6_2.5.1 [37] fastmap_1.2.0 GenomeInfoDbData_1.2.13 [39] digest_0.6.37 colorspace_2.1-1 [41] dqrng_0.4.1 irlba_2.3.5.1 [43] ExperimentHub_2.14.0 RSQLite_2.3.7 [45] beachmat_2.22.0 labeling_0.4.3 [47] filelock_1.0.3 fansi_1.0.6 [49] httr_1.4.7 abind_1.4-8 [51] compiler_4.4.1 bit64_4.5.2 [53] withr_3.0.2 BiocParallel_1.40.0 [55] viridis_0.6.5 DBI_1.2.3 [57] highr_0.11 HDF5Array_1.34.0 [59] alabaster.ranges_1.6.0 alabaster.schemas_1.6.0 [61] rappdirs_0.3.3 DelayedArray_0.32.0 [63] rjson_0.2.23 tools_4.4.1 [65] vipor_0.4.7 beeswarm_0.4.0 [67] glue_1.8.0 restfulr_0.0.15 [69] rhdf5filters_1.18.0 grid_4.4.1 [71] Rtsne_0.17 cluster_2.1.6 [73] generics_0.1.3 gtable_0.3.6 [75] metapod_1.14.0 BiocSingular_1.22.0 [77] ScaledMatrix_1.14.0 utf8_1.2.4 [79] XVector_0.46.0 ggrepel_0.9.6 [81] BiocVersion_3.20.0 pillar_1.9.0 [83] limma_3.62.0 dplyr_1.1.4 [85] lattice_0.22-6 rtracklayer_1.66.0 [87] bit_4.5.0 tidyselect_1.2.1 [89] locfit_1.5-9.10 Biostrings_2.74.0 [91] knitr_1.48 gridExtra_2.3 [93] bookdown_0.41 ProtGenerics_1.38.0 [95] edgeR_4.4.0 xfun_0.48 [97] statmod_1.5.0 pheatmap_1.0.12 [99] UCSC.utils_1.2.0 lazyeval_0.2.2 [101] yaml_2.3.10 evaluate_1.0.1 [103] codetools_0.2-20 tibble_3.2.1 [105] alabaster.matrix_1.6.0 BiocManager_1.30.25 [107] graph_1.84.0 cli_3.6.3 [109] munsell_0.5.1 jquerylib_0.1.4 [111] Rcpp_1.0.13 dir.expiry_1.14.0 [113] png_0.1-8 XML_3.99-0.17 [115] parallel_4.4.1 blob_1.2.4 [117] sparseMatrixStats_1.18.0 bitops_1.0-9 [119] viridisLite_0.4.2 alabaster.se_1.6.0 [121] scales_1.3.0 purrr_1.0.2 [123] crayon_1.5.3 rlang_1.1.4 [125] cowplot_1.1.3 KEGGREST_1.46.0 ```