--- bibliography: ref.bib --- # (PART) Advanced usage {-} # Using multiple references ## Overview In some cases, we may wish to use multiple references for annotation of a test dataset. This yields a more comprehensive set of cell types that are not covered by any individual reference, especially when differences in the resolution are considered. However, it is not trivial due to the presence of batch effects across references (from differences in technology, experimental protocol or the biological system) as well as differences in the annotation vocabulary between investigators. Several strategies are available to combine inferences from multiple references: - using reference-specific labels in a combined reference - using harmonized labels in a combined reference - combining scores across multiple references This chapter discusses the various strengths and weaknesses of each strategy and provides some practical demonstrations of each. Here, we will use the HPCA and Blueprint/ENCODE datasets as our references and (yet another) PBMC dataset as the test. ``` r library(TENxPBMCData) pbmc <- TENxPBMCData("pbmc8k") library(celldex) hpca <- HumanPrimaryCellAtlasData(ensembl=TRUE) bpe <- BlueprintEncodeData(ensembl=TRUE) ``` ## Using reference-specific labels In this strategy, each label is defined in the context of its reference dataset. This means that a label - say, "B cell" - in reference dataset X is considered to be different from a "B cell" label in reference dataset Y. Use of reference-specific labels is most appropriate if there are relevant biological differences between the references; for example, if one reference is concerned with healthy tissue while the other reference considers diseased tissue, it can be helpful to distinguish between the same cell type in different biological contexts. We can easily implement this approach by combining the expression matrices together and pasting the reference name onto the corresponding character vector of labels. This modification ensures that the downstream `SingleR()` call will treat each label-reference combination as a distinct entity. ``` r hpca2 <- hpca hpca2$label.main <- paste0("HPCA.", hpca2$label.main) bpe2 <- bpe bpe2$label.main <- paste0("BPE.", bpe2$label.main) shared <- intersect(rownames(hpca2), rownames(bpe2)) combined <- cbind(hpca2[shared,], bpe2[shared,]) ``` It is then straightforward to perform annotation with the usual methods. ``` r library(SingleR) com.res1 <- SingleR(pbmc, ref=combined, labels=combined$label.main, assay.type.test=1) table(com.res1$labels) ``` ``` ## ## BPE.B-cells BPE.CD4+ T-cells BPE.CD8+ T-cells BPE.HSC ## 1178 1708 2656 20 ## BPE.Monocytes BPE.NK cells HPCA.HSC_-G-CSF HPCA.Platelets ## 2349 460 1 7 ## HPCA.T_cells ## 2 ``` However, this strategy identifies markers by directly comparing expression values across references, meaning that the marker set is likely to contain genes responsible for uninteresting batch effects. This will increase noise during the calculation of the score in each reference, possibly leading to a loss of precision and a greater risk of technical variation dominating the classification results. The use of reference-specific labels also complicates interpretation of the results as the cell type is always qualified by its reference of origin. ## Comparing scores across references ### Combining inferences from individual references Another strategy - and the default approach implemented in `SingleR()` - involves performing classification separately within each reference, and then collating the results to choose the label with the highest score across references. This is a relatively expedient approach that avoids the need for explicit harmonization while also reducing exposure to reference-specific batch effects. To use this method, we simply pass multiple objects to the `ref=` and `label=` argument in `SingleR()`. The combining strategy is as follows: 1. The function first annotates the test dataset with each reference individually in the same manner as described in Section \@ref(method-description). This step is almost equivalent to simply looping over all individual references and running `SingleR()` on each. 2. For each cell, the function collects its predicted labels across all references. In doing so, it also identifies the union of markers that are upregulated in the predicted label in each reference. 3. The function identifies the overall best-scoring label as the final prediction for that cell. This step involves a recomputation of the scores across the identified marker subset to ensure that these scores are derived from the same set of genes (and are thus comparable across references). The function will then return a `DataFrame` of combined results for each cell in the test dataset, including the overall label and the reference from which it was assigned. ``` r com.res2 <- SingleR(test = pbmc, assay.type.test=1, ref = list(BPE=bpe, HPCA=hpca), labels = list(bpe$label.main, hpca$label.main)) # Check the final label from the combined assignment. table(com.res2$labels) ``` ``` ## ## B-cells B_cell CD4+ T-cells CD8+ T-cells ## 1170 14 1450 2936 ## GMP HSC Monocyte Monocytes ## 1 22 753 1560 ## NK cells NK_cell Platelets Pre-B_cell_CD34- ## 372 10 9 16 ## T_cells ## 68 ``` ``` r # Check the 'winning' reference for each cell. table(com.res2$reference) ``` ``` ## ## 1 2 ## 7510 871 ``` The main appeal of this approach lies in the fact that it is based on the results of annotation with individual references. This avoids batch effects from comparing expression values across references; it reduces the need for any coordination in the label scheme between references; and simultaneously provides the per-reference annotations in the results. The last feature is particularly useful as it allows for more detailed diagnostics, troubleshooting and further analysis. ``` r head(com.res2$orig.results$BPE$labels) ``` ``` ## [1] "B-cells" "Monocytes" "CD8+ T-cells" "CD8+ T-cells" "Monocytes" ## [6] "Monocytes" ``` ``` r head(com.res2$orig.results$HPCA$labels) ``` ``` ## [1] "B_cell" "Monocyte" "T_cells" "T_cells" "Monocyte" "Monocyte" ``` The main downside is that it is somewhat suboptimal if there are many labels that are unique to one reference, as markers are not identified with the aim of distinguishing a label in one reference from another label in another reference. The continued lack of consistency in the labels across references also complicates interpretation of the results, though we can overcome this by using harmonized labels as described below. ### Combined diagnostics All of the diagnostic plots in *[SingleR](https://bioconductor.org/packages/3.21/SingleR)* will naturally operate on these combined results. For example, we can create a heatmap of the scores in all of the individual references as well as for the recomputed scores in the combined results (Figure \@ref(fig:heat-combined)). Note that scores are only recomputed for the labels predicted in the individual references, so all labels outside of those are simply set to `NA` - hence the swathes of grey. ``` r plotScoreHeatmap(com.res2) ```
Heatmaps of assignment scores for each cell in the PBMC test dataset after being assigned to the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. One heatmap is shown for the recomputed scores and the scores from each individual reference. The annotation at the top of each heatmap represents the final combined prediction for each cell.

(\#fig:heat-combined)Heatmaps of assignment scores for each cell in the PBMC test dataset after being assigned to the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. One heatmap is shown for the recomputed scores and the scores from each individual reference. The annotation at the top of each heatmap represents the final combined prediction for each cell.

The deltas for each individual reference can also be plotted with `plotDeltaDistribution()` (Figure \@ref(fig:delta-combined)). No deltas are shown for the recomputed scores as the assumption described in Section \@ref(based-on-the-deltas-across-cells) may not be applicable across the predicted labels from the individual references. For example, if all individual references suggest the same cell type with similar recomputed scores, any delta would be low even though the assignment is highly confident. ``` r plotDeltaDistribution(com.res2) ```
Distribution of the deltas across cells in the PBMC test dataset for each label in the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. Each point represents a cell that was assigned to that label in the combined results, colored by whether it was pruned or not in the corresponding individual reference.

(\#fig:delta-combined)Distribution of the deltas across cells in the PBMC test dataset for each label in the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. Each point represents a cell that was assigned to that label in the combined results, colored by whether it was pruned or not in the corresponding individual reference.

We can similarly extract marker genes to use in heatmaps as described in Section \@ref(based-on-marker-gene-expression). As annotation was performed to each individual reference, we can simply extract the marker genes from the nested `DataFrame`s as shown in Figure \@ref(fig:pbmc-mono-heat). ``` r hpca.markers <- metadata(com.res2$orig.results$HPCA)$de.genes bpe.markers <- metadata(com.res2$orig.results$BPE)$de.genes mono.markers <- unique(unlist(hpca.markers$Monocyte, bpe.markers$Monocytes)) library(scater) plotHeatmap(logNormCounts(pbmc), order_columns_by=list(I(com.res2$labels)), features=mono.markers) ```
Heatmap of log-expression values in the PBMC dataset for all marker genes upregulated in monocytes in the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. Combined labels for each cell are shown at the top.

(\#fig:pbmc-mono-heat)Heatmap of log-expression values in the PBMC dataset for all marker genes upregulated in monocytes in the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. Combined labels for each cell are shown at the top.

## Using harmonized labels ### Sharing information during marker detection One of the major problems with using multiple references is the presence of study-specific nomenclature. For example, the concept of a B cell may be annotated as `B cells` in one reference, `B_cells` in another reference, and then `B` and `B-cell` and so on in other references. We can overcome this by using harmonized labels where the same cell type is assigned as the same label across references, simplifying interpretation and ensuring that irrelevant discrepancies in labelling do not intefere with downstream analysis. Many of the *[SingleR](https://bioconductor.org/packages/3.21/SingleR)* reference datasets already have their labels mapped to the [Cell Ontology](https://www.ebi.ac.uk/ols/ontologies/cl), which provides a standard vocabulary to refer to the same cell type across diverse datasets. We will describe the utility of Cell Ontology terms in more detail in Chapter \@ref(exploiting-the-cell-ontology); at this point, the key idea is that the same term is used for the same conceptual cell type in each reference. To simplify interpretation, we set `cell.ont="nonna"` to remove all samples that could not be mapped to the ontology. ``` r hpca.ont <- HumanPrimaryCellAtlasData(ensembl=TRUE, cell.ont="nonna") bpe.ont <- BlueprintEncodeData(ensembl=TRUE, cell.ont="nonna") # Using the same sets of genes: shared <- intersect(rownames(hpca.ont), rownames(bpe.ont)) hpca.ont <- hpca.ont[shared,] bpe.ont <- bpe.ont[shared,] # Showing the top 10 most frequent terms: head(sort(table(hpca.ont$label.ont)), 10) ``` ``` ## ## CL:0002259 CL:0000017 CL:0000049 CL:0000050 CL:0000084 CL:0000127 CL:0000557 ## 1 2 2 2 2 2 2 ## CL:0000798 CL:0000816 CL:0000836 ## 2 2 2 ``` ``` r head(sort(table(bpe.ont$label.ont)), 10) ``` ``` ## ## CL:0000451 CL:0000771 CL:0000787 CL:0000815 CL:0000904 CL:0000905 CL:0000907 ## 1 1 1 1 1 1 1 ## CL:0000913 CL:0000972 CL:0000127 ## 1 1 2 ``` The simplest way to take advantage of the standardization in terminology is to use `label.ont` in place of `label.main` in the previous section's `SingleR()` call. This yields annotations that have follow the same vocabulary regardless of the reference used for assignment. ``` r com.res3a <- SingleR(test = pbmc, assay.type.test=1, ref = list(BPE=bpe.ont, HPCA=hpca.ont), labels = list(bpe.ont$label.ont, hpca.ont$label.ont)) table(Label=com.res3a$labels, Reference=com.res3a$reference) ``` ``` ## Reference ## Label 1 2 ## CL:0000037 2 0 ## CL:0000050 6 0 ## CL:0000051 6 0 ## CL:0000233 0 3 ## CL:0000236 0 70 ## CL:0000556 7 0 ## CL:0000557 4 1 ## CL:0000576 1520 548 ## CL:0000623 304 10 ## CL:0000624 731 33 ## CL:0000625 591 89 ## CL:0000786 2 0 ## CL:0000787 270 3 ## CL:0000788 728 12 ## CL:0000798 0 2 ## CL:0000815 79 0 ## CL:0000816 0 21 ## CL:0000837 5 0 ## CL:0000895 0 400 ## CL:0000904 129 867 ## CL:0000905 136 231 ## CL:0000907 730 0 ## CL:0000913 478 5 ## CL:0000955 0 13 ## CL:0000972 101 0 ## CL:0001054 0 244 ``` A more advanced approach is to share information across references during the marker detection stage. This is done by favoring genes the exhibit upregulation consistently in multiple references, which increases the likelihood that those markers will generalize to other datasets. For classic marker detection, we achieve this by calling `getClassicMarkers()` to obtain markers for use in `SingleR()`; the same effect can be achieved for test-based methods in *[scran](https://bioconductor.org/packages/3.21/scran)* functions by setting `block=`. We then use these improved markers by passing them to `genes=` as described in Section \@ref(defining-custom-markers). In this case, we specify `com.markers` twice in a list to indicate that we are using them for both of our references. ``` r com.markers <- getClassicMarkers( ref = list(BPE=bpe.ont, HPCA=hpca.ont), labels = list(bpe.ont$label.ont, hpca.ont$label.ont)) com.res3b <- SingleR(test = pbmc, assay.type.test=1, ref = list(BPE=bpe.ont, HPCA=hpca.ont), labels = list(bpe.ont$label.ont, hpca.ont$label.ont), genes = list(com.markers, com.markers)) table(Label=com.res3b$labels, Reference=com.res3b$reference) ``` ``` ## Reference ## Label 1 2 ## CL:0000037 4 0 ## CL:0000050 5 0 ## CL:0000051 8 0 ## CL:0000233 0 2 ## CL:0000236 0 106 ## CL:0000556 8 0 ## CL:0000557 1 2 ## CL:0000576 1435 669 ## CL:0000623 306 20 ## CL:0000624 605 90 ## CL:0000625 421 171 ## CL:0000786 3 0 ## CL:0000787 225 2 ## CL:0000788 719 30 ## CL:0000798 0 2 ## CL:0000815 111 0 ## CL:0000816 0 36 ## CL:0000837 5 0 ## CL:0000895 0 424 ## CL:0000904 56 974 ## CL:0000905 108 272 ## CL:0000907 728 0 ## CL:0000913 503 24 ## CL:0000955 0 10 ## CL:0000972 91 0 ## CL:0001054 0 205 ``` It is worth noting that, in the above code, the DE genes are still identified _within_ each reference and then the statistics are merged _across_ references to identify the top markers. This ensures that we do not directly compare expression values across references, which reduces the susceptibility of marker detection to batch effects. The most obvious problem with this approach is that it assumes that harmonized labels are available. This is usually not true and requires some manual mapping of the author-provided labels to a common vocabulary. The mapping process also runs the risk of discarding relevant information about the biological status (e.g., activation status, disease condition) if there is no obvious counterpart for that state in the ontology. ### Manual label harmonization The `matchReferences()` function provides a simple approach for label harmonization between two references. Each reference is used to annotate the other and the probability of mutual assignment between each pair of labels is computed, i.e., for each pair of labels, what is the probability that a cell with one label is assigned the other _and_ vice versa? Probabilities close to 1 in Figure \@ref(fig:heat-harmony) indicate there is a 1:1 relation between that pair of labels; on the other hand, an all-zero probability vector indicates that a label is unique to a particular reference. ``` r library(SingleR) bp.se <- BlueprintEncodeData() hpca.se <- HumanPrimaryCellAtlasData() matched <- matchReferences(bp.se, hpca.se, bp.se$label.main, hpca.se$label.main) pheatmap::pheatmap(matched, col=viridis::plasma(100)) ```
Heatmap of mutual assignment probabilities between the Blueprint/ENCODE reference dataset (labels in rows) and the Human primary cell atlas reference (labels in columns).

(\#fig:heat-harmony)Heatmap of mutual assignment probabilities between the Blueprint/ENCODE reference dataset (labels in rows) and the Human primary cell atlas reference (labels in columns).

This function can be used to guide harmonization to enforce a consistent vocabulary between two sets of labels. However, some manual intervention is still required in this process given the ambiguities posed by differences in biological systems and technologies. In the example above, neurons are considered to be unique to each reference while smooth muscle cells in the HPCA data are incorrectly matched to fibroblasts in the Blueprint/ENCODE data. CD4^+^ and CD8^+^ T cells are also both assigned to "T cells", so some decision about the acceptable resolution of the harmonized labels is required here. As an aside, we can also use this function to identify the matching clusters between two independent scRNA-seq analyses. This involves substituting the cluster assignments as proxies for the labels, allowing us to match up clusters and integrate conclusions from multiple datasets without the difficulties of batch correction and reclustering. ## Session info {-}
``` R Under development (unstable) (2024-10-21 r87258) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 24.04.1 LTS Matrix products: default BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/New_York tzcode source: system (glibc) attached base packages: [1] stats4 stats graphics grDevices utils datasets methods [8] base other attached packages: [1] scater_1.35.0 ggplot2_3.5.1 [3] scuttle_1.17.0 SingleR_2.9.1 [5] ensembldb_2.31.0 AnnotationFilter_1.31.0 [7] GenomicFeatures_1.59.1 AnnotationDbi_1.69.0 [9] celldex_1.17.0 TENxPBMCData_1.25.0 [11] HDF5Array_1.35.1 rhdf5_2.51.0 [13] DelayedArray_0.33.2 SparseArray_1.7.2 [15] S4Arrays_1.7.1 abind_1.4-8 [17] Matrix_1.7-1 SingleCellExperiment_1.29.1 [19] SummarizedExperiment_1.37.0 Biobase_2.67.0 [21] GenomicRanges_1.59.1 GenomeInfoDb_1.43.1 [23] IRanges_2.41.1 S4Vectors_0.45.2 [25] BiocGenerics_0.53.3 generics_0.1.3 [27] MatrixGenerics_1.19.0 matrixStats_1.4.1 [29] BiocStyle_2.35.0 rebook_1.17.0 loaded via a namespace (and not attached): [1] RColorBrewer_1.1-3 jsonlite_1.8.9 [3] CodeDepends_0.6.6 magrittr_2.0.3 [5] ggbeeswarm_0.7.2 gypsum_1.3.0 [7] farver_2.1.2 rmarkdown_2.29 [9] BiocIO_1.17.0 zlibbioc_1.53.0 [11] vctrs_0.6.5 memoise_2.0.1 [13] Rsamtools_2.23.0 DelayedMatrixStats_1.29.0 [15] RCurl_1.98-1.16 htmltools_0.5.8.1 [17] AnnotationHub_3.15.0 curl_6.0.1 [19] BiocNeighbors_2.1.0 Rhdf5lib_1.29.0 [21] sass_0.4.9 alabaster.base_1.7.2 [23] bslib_0.8.0 httr2_1.0.6 [25] cachem_1.1.0 GenomicAlignments_1.43.0 [27] mime_0.12 lifecycle_1.0.4 [29] pkgconfig_2.0.3 rsvd_1.0.5 [31] R6_2.5.1 fastmap_1.2.0 [33] GenomeInfoDbData_1.2.13 digest_0.6.37 [35] colorspace_2.1-1 irlba_2.3.5.1 [37] ExperimentHub_2.15.0 RSQLite_2.3.8 [39] beachmat_2.23.1 labeling_0.4.3 [41] filelock_1.0.3 fansi_1.0.6 [43] httr_1.4.7 compiler_4.5.0 [45] bit64_4.5.2 withr_3.0.2 [47] BiocParallel_1.41.0 viridis_0.6.5 [49] DBI_1.2.3 alabaster.ranges_1.7.0 [51] alabaster.schemas_1.7.0 rappdirs_0.3.3 [53] rjson_0.2.23 tools_4.5.0 [55] vipor_0.4.7 beeswarm_0.4.0 [57] glue_1.8.0 restfulr_0.0.15 [59] rhdf5filters_1.19.0 grid_4.5.0 [61] gtable_0.3.6 ScaledMatrix_1.15.0 [63] BiocSingular_1.23.0 utf8_1.2.4 [65] XVector_0.47.0 ggrepel_0.9.6 [67] BiocVersion_3.21.1 pillar_1.9.0 [69] dplyr_1.1.4 BiocFileCache_2.15.0 [71] lattice_0.22-6 rtracklayer_1.67.0 [73] bit_4.5.0 tidyselect_1.2.1 [75] Biostrings_2.75.1 knitr_1.49 [77] gridExtra_2.3 bookdown_0.41 [79] ProtGenerics_1.39.0 xfun_0.49 [81] pheatmap_1.0.12 UCSC.utils_1.3.0 [83] lazyeval_0.2.2 yaml_2.3.10 [85] evaluate_1.0.1 codetools_0.2-20 [87] tibble_3.2.1 alabaster.matrix_1.7.0 [89] BiocManager_1.30.25 graph_1.85.0 [91] cli_3.6.3 munsell_0.5.1 [93] jquerylib_0.1.4 Rcpp_1.0.13-1 [95] dir.expiry_1.15.0 dbplyr_2.5.0 [97] png_0.1-8 XML_3.99-0.17 [99] parallel_4.5.0 blob_1.2.4 [101] beachmat.hdf5_1.5.1 sparseMatrixStats_1.19.0 [103] bitops_1.0-9 alabaster.se_1.7.0 [105] viridisLite_0.4.2 scales_1.3.0 [107] purrr_1.0.2 crayon_1.5.3 [109] rlang_1.1.4 KEGGREST_1.47.0 ```