Chapter 5 Using multiple references
5.1 Overview
In some cases, we may wish to use multiple references for annotation of a test dataset. This yields a more comprehensive set of cell types that are not covered by any individual reference, especially when differences in the resolution are considered. However, it is not trivial due to the presence of batch effects across references (from differences in technology, experimental protocol or the biological system) as well as differences in the annotation vocabulary between investigators.
Several strategies are available to combine inferences from multiple references:
- using reference-specific labels in a combined reference
- using harmonized labels in a combined reference
- combining scores across multiple references
This chapter discusses the various strengths and weaknesses of each strategy and provides some practical demonstrations of each. Here, we will use the HPCA and Blueprint/ENCODE datasets as our references and (yet another) PBMC dataset as the test.
5.2 Using reference-specific labels
In this strategy, each label is defined in the context of its reference dataset. This means that a label - say, “B cell” - in reference dataset X is considered to be different from a “B cell” label in reference dataset Y. Use of reference-specific labels is most appropriate if there are relevant biological differences between the references; for example, if one reference is concerned with healthy tissue while the other reference considers diseased tissue, it can be helpful to distinguish between the same cell type in different biological contexts.
We can easily implement this approach by combining the expression matrices together
and pasting the reference name onto the corresponding character vector of labels.
This modification ensures that the downstream SingleR() call
will treat each label-reference combination as a distinct entity.
hpca2 <- hpca
hpca2$label.main <- paste0("HPCA.", hpca2$label.main)
bpe2 <- bpe
bpe2$label.main <- paste0("BPE.", bpe2$label.main)
shared <- intersect(rownames(hpca2), rownames(bpe2))
combined <- cbind(hpca2[shared,], bpe2[shared,])It is then straightforward to perform annotation with the usual methods.
library(SingleR)
com.res1 <- SingleR(pbmc, ref=combined, labels=combined$label.main, assay.type.test=1)
table(com.res1$labels)## 
##      BPE.B-cells BPE.CD4+ T-cells BPE.CD8+ T-cells          BPE.HSC 
##             1178             1708             2656               20 
##    BPE.Monocytes     BPE.NK cells  HPCA.HSC_-G-CSF   HPCA.Platelets 
##             2349              460                1                7 
##     HPCA.T_cells 
##                2However, this strategy identifies markers by directly comparing expression values across references, meaning that the marker set is likely to contain genes responsible for uninteresting batch effects. This will increase noise during the calculation of the score in each reference, possibly leading to a loss of precision and a greater risk of technical variation dominating the classification results. The use of reference-specific labels also complicates interpretation of the results as the cell type is always qualified by its reference of origin.
5.3 Comparing scores across references
5.3.1 Combining inferences from individual references
Another strategy - and the default approach implemented in SingleR() -
involves performing classification separately within each reference,
and then collating the results to choose the label with the highest score across references.
This is a relatively expedient approach that avoids the need for explicit harmonization
while also reducing exposure to reference-specific batch effects.
To use this method, we simply pass multiple objects to the ref= and label= argument in SingleR().
The combining strategy is as follows:
- The function first annotates the test dataset with each reference individually
in the same manner as described in Section 1.2.
This step is almost equivalent to simply looping over all individual references and running SingleR()on each.
- For each cell, the function collects its predicted labels across all references. In doing so, it also identifies the union of markers that are upregulated in the predicted label in each reference.
- The function identifies the overall best-scoring label as the final prediction for that cell. This step involves a recomputation of the scores across the identified marker subset to ensure that these scores are derived from the same set of genes (and are thus comparable across references).
The function will then return a DataFrame of combined results for each cell in the test dataset,
including the overall label and the reference from which it was assigned.
com.res2 <- SingleR(test = pbmc, assay.type.test=1,
    ref = list(BPE=bpe, HPCA=hpca), 
    labels = list(bpe$label.main, hpca$label.main))
# Check the final label from the combined assignment.
table(com.res2$labels) ## 
##          B-cells           B_cell     CD4+ T-cells     CD8+ T-cells 
##             1170               14             1450             2936 
##              GMP              HSC         Monocyte        Monocytes 
##                1               22              753             1560 
##         NK cells          NK_cell        Platelets Pre-B_cell_CD34- 
##              372               10                9               16 
##          T_cells 
##               68## 
##    1    2 
## 7510  871The main appeal of this approach lies in the fact that it is based on the results of annotation with individual references. This avoids batch effects from comparing expression values across references; it reduces the need for any coordination in the label scheme between references; and simultaneously provides the per-reference annotations in the results. The last feature is particularly useful as it allows for more detailed diagnostics, troubleshooting and further analysis.
## [1] "B-cells"      "Monocytes"    "CD8+ T-cells" "CD8+ T-cells" "Monocytes"   
## [6] "Monocytes"## [1] "B_cell"   "Monocyte" "T_cells"  "T_cells"  "Monocyte" "Monocyte"The main downside is that it is somewhat suboptimal if there are many labels that are unique to one reference, as markers are not identified with the aim of distinguishing a label in one reference from another label in another reference. The continued lack of consistency in the labels across references also complicates interpretation of the results, though we can overcome this by using harmonized labels as described below.
5.3.2 Combined diagnostics
All of the diagnostic plots in SingleR will naturally operate on these combined results.
For example, we can create a heatmap of the scores in all of the individual references
as well as for the recomputed scores in the combined results (Figure 5.1).
Note that scores are only recomputed for the labels predicted in the individual references,
so all labels outside of those are simply set to NA - hence the swathes of grey.
 
Figure 5.1: Heatmaps of assignment scores for each cell in the PBMC test dataset after being assigned to the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. One heatmap is shown for the recomputed scores and the scores from each individual reference. The annotation at the top of each heatmap represents the final combined prediction for each cell.
The deltas for each individual reference can also be plotted with plotDeltaDistribution() (Figure 5.2).
No deltas are shown for the recomputed scores as the assumption described in Section 4.3
may not be applicable across the predicted labels from the individual references.
For example, if all individual references suggest the same cell type with similar recomputed scores,
any delta would be low even though the assignment is highly confident.
 
Figure 5.2: Distribution of the deltas across cells in the PBMC test dataset for each label in the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. Each point represents a cell that was assigned to that label in the combined results, colored by whether it was pruned or not in the corresponding individual reference.
We can similarly extract marker genes to use in heatmaps as described in Section 4.4.
As annotation was performed to each individual reference,
we can simply extract the marker genes from the nested DataFrames as shown in Figure 5.3.
hpca.markers <- metadata(com.res2$orig.results$HPCA)$de.genes
bpe.markers <- metadata(com.res2$orig.results$BPE)$de.genes
mono.markers <- unique(unlist(hpca.markers$Monocyte, bpe.markers$Monocytes))
library(scater)
plotHeatmap(logNormCounts(pbmc), 
    order_columns_by=list(I(com.res2$labels)),
    features=mono.markers) 
Figure 5.3: Heatmap of log-expression values in the PBMC dataset for all marker genes upregulated in monocytes in the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. Combined labels for each cell are shown at the top.
5.4 Using harmonized labels
5.4.2 Manual label harmonization
The matchReferences() function provides a simple approach for label harmonization between two references.
Each reference is used to annotate the other and the probability of mutual assignment between each pair of labels is computed,
i.e., for each pair of labels, what is the probability that a cell with one label is assigned the other and vice versa?
Probabilities close to 1 in Figure 5.4 indicate there is a 1:1 relation between that pair of labels;
on the other hand, an all-zero probability vector indicates that a label is unique to a particular reference.
library(SingleR)
bp.se <- BlueprintEncodeData()
hpca.se <- HumanPrimaryCellAtlasData()
matched <- matchReferences(bp.se, hpca.se,
    bp.se$label.main, hpca.se$label.main)
pheatmap::pheatmap(matched, col=viridis::plasma(100)) 
Figure 5.4: Heatmap of mutual assignment probabilities between the Blueprint/ENCODE reference dataset (labels in rows) and the Human primary cell atlas reference (labels in columns).
This function can be used to guide harmonization to enforce a consistent vocabulary between two sets of labels. However, some manual intervention is still required in this process given the ambiguities posed by differences in biological systems and technologies. In the example above, neurons are considered to be unique to each reference while smooth muscle cells in the HPCA data are incorrectly matched to fibroblasts in the Blueprint/ENCODE data. CD4+ and CD8+ T cells are also both assigned to “T cells”, so some decision about the acceptable resolution of the harmonized labels is required here.
As an aside, we can also use this function to identify the matching clusters between two independent scRNA-seq analyses. This involves substituting the cluster assignments as proxies for the labels, allowing us to match up clusters and integrate conclusions from multiple datasets without the difficulties of batch correction and reclustering.
Session info
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS
Matrix products: default
BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     
other attached packages:
 [1] scater_1.26.0               ggplot2_3.3.6              
 [3] scuttle_1.8.0               SingleR_2.0.0              
 [5] ensembldb_2.22.0            AnnotationFilter_1.22.0    
 [7] GenomicFeatures_1.50.0      AnnotationDbi_1.60.0       
 [9] celldex_1.7.0               TENxPBMCData_1.15.0        
[11] HDF5Array_1.26.0            rhdf5_2.42.0               
[13] DelayedArray_0.24.0         Matrix_1.5-1               
[15] SingleCellExperiment_1.20.0 SummarizedExperiment_1.28.0
[17] Biobase_2.58.0              GenomicRanges_1.50.0       
[19] GenomeInfoDb_1.34.0         IRanges_2.32.0             
[21] S4Vectors_0.36.0            BiocGenerics_0.44.0        
[23] MatrixGenerics_1.10.0       matrixStats_0.62.0         
[25] BiocStyle_2.26.0            rebook_1.8.0               
loaded via a namespace (and not attached):
  [1] AnnotationHub_3.6.0           BiocFileCache_2.6.0          
  [3] lazyeval_0.2.2                BiocParallel_1.32.0          
  [5] digest_0.6.30                 htmltools_0.5.3              
  [7] viridis_0.6.2                 fansi_1.0.3                  
  [9] magrittr_2.0.3                memoise_2.0.1                
 [11] ScaledMatrix_1.6.0            Biostrings_2.66.0            
 [13] prettyunits_1.1.1             colorspace_2.0-3             
 [15] ggrepel_0.9.1                 blob_1.2.3                   
 [17] rappdirs_0.3.3                xfun_0.34                    
 [19] dplyr_1.0.10                  crayon_1.5.2                 
 [21] RCurl_1.98-1.9                jsonlite_1.8.3               
 [23] graph_1.76.0                  glue_1.6.2                   
 [25] gtable_0.3.1                  zlibbioc_1.44.0              
 [27] XVector_0.38.0                BiocSingular_1.14.0          
 [29] Rhdf5lib_1.20.0               scales_1.2.1                 
 [31] pheatmap_1.0.12               DBI_1.1.3                    
 [33] Rcpp_1.0.9                    viridisLite_0.4.1            
 [35] xtable_1.8-4                  progress_1.2.2               
 [37] bit_4.0.4                     rsvd_1.0.5                   
 [39] httr_1.4.4                    dir.expiry_1.6.0             
 [41] RColorBrewer_1.1-3            ellipsis_0.3.2               
 [43] pkgconfig_2.0.3               XML_3.99-0.12                
 [45] farver_2.1.1                  CodeDepends_0.6.5            
 [47] sass_0.4.2                    dbplyr_2.2.1                 
 [49] utf8_1.2.2                    labeling_0.4.2               
 [51] tidyselect_1.2.0              rlang_1.0.6                  
 [53] later_1.3.0                   munsell_0.5.0                
 [55] BiocVersion_3.16.0            tools_4.2.1                  
 [57] cachem_1.0.6                  cli_3.4.1                    
 [59] generics_0.1.3                RSQLite_2.2.18               
 [61] ExperimentHub_2.6.0           evaluate_0.17                
 [63] stringr_1.4.1                 fastmap_1.1.0                
 [65] yaml_2.3.6                    knitr_1.40                   
 [67] bit64_4.0.5                   purrr_0.3.5                  
 [69] KEGGREST_1.38.0               sparseMatrixStats_1.10.0     
 [71] mime_0.12                     xml2_1.3.3                   
 [73] biomaRt_2.54.0                compiler_4.2.1               
 [75] beeswarm_0.4.0                filelock_1.0.2               
 [77] curl_4.3.3                    png_0.1-7                    
 [79] interactiveDisplayBase_1.36.0 tibble_3.1.8                 
 [81] bslib_0.4.0                   stringi_1.7.8                
 [83] highr_0.9                     lattice_0.20-45              
 [85] ProtGenerics_1.30.0           vctrs_0.5.0                  
 [87] pillar_1.8.1                  lifecycle_1.0.3              
 [89] rhdf5filters_1.10.0           BiocManager_1.30.19          
 [91] jquerylib_0.1.4               BiocNeighbors_1.16.0         
 [93] bitops_1.0-7                  irlba_2.3.5.1                
 [95] httpuv_1.6.6                  rtracklayer_1.58.0           
 [97] R6_2.5.1                      BiocIO_1.8.0                 
 [99] bookdown_0.29                 promises_1.2.0.1             
[101] gridExtra_2.3                 vipor_0.4.5                  
[103] codetools_0.2-18              assertthat_0.2.1             
[105] rjson_0.2.21                  withr_2.5.0                  
[107] GenomicAlignments_1.34.0      Rsamtools_2.14.0             
[109] GenomeInfoDbData_1.2.9        parallel_4.2.1               
[111] hms_1.1.2                     grid_4.2.1                   
[113] beachmat_2.14.0               rmarkdown_2.17               
[115] DelayedMatrixStats_1.20.0     shiny_1.7.3                  
[117] ggbeeswarm_0.6.0              restfulr_0.0.15