Chapter 5 Using multiple references

5.1 Overview

In some cases, we may wish to use multiple references for annotation of a test dataset. This yields a more comprehensive set of cell types that are not covered by any individual reference, especially when differences in the resolution are considered. However, it is not trivial due to the presence of batch effects across references (from differences in technology, experimental protocol or the biological system) as well as differences in the annotation vocabulary between investigators.

Several strategies are available to combine inferences from multiple references:

  • using reference-specific labels in a combined reference
  • using harmonized labels in a combined reference
  • combining scores across multiple references

This chapter discusses the various strengths and weaknesses of each strategy and provides some practical demonstrations of each. Here, we will use the HPCA and Blueprint/ENCODE datasets as our references and (yet another) PBMC dataset as the test.

5.2 Using reference-specific labels

In this strategy, each label is defined in the context of its reference dataset. This means that a label - say, “B cell” - in reference dataset X is considered to be different from a “B cell” label in reference dataset Y. Use of reference-specific labels is most appropriate if there are relevant biological differences between the references; for example, if one reference is concerned with healthy tissue while the other reference considers diseased tissue, it can be helpful to distinguish between the same cell type in different biological contexts.

We can easily implement this approach by combining the expression matrices together and pasting the reference name onto the corresponding character vector of labels. This modification ensures that the downstream SingleR() call will treat each label-reference combination as a distinct entity.

It is then straightforward to perform annotation with the usual methods.

## 
##      BPE.B-cells BPE.CD4+ T-cells BPE.CD8+ T-cells          BPE.HSC 
##             1179             1708             2656               20 
##    BPE.Monocytes     BPE.NK cells  HPCA.HSC_-G-CSF   HPCA.Platelets 
##             2348              460                1                7 
##     HPCA.T_cells 
##                2

However, this strategy identifies markers by directly comparing expression values across references, meaning that the marker set is likely to contain genes responsible for uninteresting batch effects. This will increase noise during the calculation of the score in each reference, possibly leading to a loss of precision and a greater risk of technical variation dominating the classification results. The use of reference-specific labels also complicates interpretation of the results as the cell type is always qualified by its reference of origin.

5.3 Comparing scores across references

5.3.1 Combining inferences from individual references

Another strategy - and the default approach implemented in SingleR() - involves performing classification separately within each reference, and then collating the results to choose the label with the highest score across references. This is a relatively expedient approach that avoids the need for explicit harmonization while also reducing exposure to reference-specific batch effects.

To use this method, we simply pass multiple objects to the ref= and label= argument in SingleR(). The combining strategy is as follows:

  1. The function first annotates the test dataset with each reference individually in the same manner as described in Section 1.2. This step is almost equivalent to simply looping over all individual references and running SingleR() on each.
  2. For each cell, the function collects its predicted labels across all references. In doing so, it also identifies the union of markers that are upregulated in the predicted label in each reference.
  3. The function identifies the overall best-scoring label as the final prediction for that cell. This step involves a recomputation of the scores across the identified marker subset to ensure that these scores are derived from the same set of genes (and are thus comparable across references).

The function will then return a DataFrame of combined results for each cell in the test dataset, including the overall label and the reference from which it was assigned.

## 
##          B-cells           B_cell     CD4+ T-cells     CD8+ T-cells 
##             1170               14             1450             2936 
##              GMP              HSC         Monocyte        Monocytes 
##                1               22              753             1560 
##         NK cells          NK_cell        Platelets Pre-B_cell_CD34- 
##              372               10                9               16 
##          T_cells 
##               68
## 
##    1    2 
## 7510  871

The main appeal of this approach lies in the fact that it is based on the results of annotation with individual references. This avoids batch effects from comparing expression values across references; it reduces the need for any coordination in the label scheme between references; and simultaneously provides the per-reference annotations in the results. The last feature is particularly useful as it allows for more detailed diagnostics, troubleshooting and further analysis.

## [1] "B-cells"      "Monocytes"    "CD8+ T-cells" "CD8+ T-cells" "Monocytes"   
## [6] "Monocytes"
## [1] "B_cell"   "Monocyte" "T_cells"  "T_cells"  "Monocyte" "Monocyte"

The main downside is that it is somewhat suboptimal if there are many labels that are unique to one reference, as markers are not identified with the aim of distinguishing a label in one reference from another label in another reference. The continued lack of consistency in the labels across references also complicates interpretation of the results, though we can overcome this by using harmonized labels as described below.

5.3.2 Combined diagnostics

All of the diagnostic plots in SingleR will naturally operate on these combined results. For example, we can create a heatmap of the scores in all of the individual references as well as for the recomputed scores in the combined results (Figure 5.1). Note that scores are only recomputed for the labels predicted in the individual references, so all labels outside of those are simply set to NA - hence the swathes of grey.

Heatmaps of assignment scores for each cell in the PBMC test dataset after being assigned to the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. One heatmap is shown for the recomputed scores and the scores from each individual reference. The annotation at the top of each heatmap represents the final combined prediction for each cell.

Figure 5.1: Heatmaps of assignment scores for each cell in the PBMC test dataset after being assigned to the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. One heatmap is shown for the recomputed scores and the scores from each individual reference. The annotation at the top of each heatmap represents the final combined prediction for each cell.

The deltas for each individual reference can also be plotted with plotDeltaDistribution() (Figure 5.2). No deltas are shown for the recomputed scores as the assumption described in Section 4.3 may not be applicable across the predicted labels from the individual references. For example, if all individual references suggest the same cell type with similar recomputed scores, any delta would be low even though the assignment is highly confident.

Distribution of the deltas across cells in the PBMC test dataset for each label in the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. Each point represents a cell that was assigned to that label in the combined results, colored by whether it was pruned or not in the corresponding individual reference.

Figure 5.2: Distribution of the deltas across cells in the PBMC test dataset for each label in the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. Each point represents a cell that was assigned to that label in the combined results, colored by whether it was pruned or not in the corresponding individual reference.

We can similarly extract marker genes to use in heatmaps as described in Section 4.4. As annotation was performed to each individual reference, we can simply extract the marker genes from the nested DataFrames as shown in Figure 5.3.

Heatmap of log-expression values in the PBMC dataset for all marker genes upregulated in monocytes in the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. Combined labels for each cell are shown at the top.

Figure 5.3: Heatmap of log-expression values in the PBMC dataset for all marker genes upregulated in monocytes in the Blueprint/ENCODE and Human Primary Cell Atlas reference datasets. Combined labels for each cell are shown at the top.

5.4 Using harmonized labels

5.4.1 Sharing information during marker detection

One of the major problems with using multiple references is the presence of study-specific nomenclature. For example, the concept of a B cell may be annotated as B cells in one reference, B_cells in another reference, and then B and B-cell and so on in other references. We can overcome this by using harmonized labels where the same cell type is assigned as the same label across references, simplifying interpretation and ensuring that irrelevant discrepancies in labelling do not intefere with downstream analysis.

Many of the SingleR reference datasets already have their labels mapped to the Cell Ontology, which provides a standard vocabulary to refer to the same cell type across diverse datasets. We will describe the utility of Cell Ontology terms in more detail in Chapter 6; at this point, the key idea is that the same term is used for the same conceptual cell type in each reference. To simplify interpretation, we set cell.ont="nonna" to remove all samples that could not be mapped to the ontology.

## 
## CL:0002259 CL:0000017 CL:0000049 CL:0000050 CL:0000084 CL:0000127 CL:0000557 
##          1          2          2          2          2          2          2 
## CL:0000798 CL:0000816 CL:0000836 
##          2          2          2
## 
## CL:0000451 CL:0000771 CL:0000787 CL:0000815 CL:0000904 CL:0000905 CL:0000907 
##          1          1          1          1          1          1          1 
## CL:0000913 CL:0000972 CL:0000127 
##          1          1          2

The simplest way to take advantage of the standardization in terminology is to use label.ont in place of label.main in the previous section’s SingleR() call. This yields annotations that have follow the same vocabulary regardless of the reference used for assignment.

##             Reference
## Label           1    2
##   CL:0000037    2    0
##   CL:0000050    6    0
##   CL:0000051    6    0
##   CL:0000233    0    3
##   CL:0000236    0   70
##   CL:0000556    7    0
##   CL:0000557    4    1
##   CL:0000576 1520  548
##   CL:0000623  304   10
##   CL:0000624  734   34
##   CL:0000625  591   89
##   CL:0000786    2    0
##   CL:0000787  270    3
##   CL:0000788  728   12
##   CL:0000798    0    2
##   CL:0000815   78    0
##   CL:0000816    0   21
##   CL:0000837    5    0
##   CL:0000895    0  399
##   CL:0000904  126  867
##   CL:0000905  137  231
##   CL:0000907  729    0
##   CL:0000913  479    5
##   CL:0000955    0   13
##   CL:0000972  101    0
##   CL:0001054    0  244

A more advanced approach is to share information across references during the marker detection stage. This is done by favoring genes the exhibit upregulation consistently in multiple references, which increases the likelihood that those markers will generalize to other datasets. For classic marker detection, we achieve this by calling getClassicMarkers() to obtain markers for use in SingleR(); the same effect can be achieved for test-based methods in scran functions by setting block=. We then use these improved markers by passing them to genes= as described in Section 3.3. In this case, we specify com.markers twice in a list to indicate that we are using them for both of our references.

##             Reference
## Label           1    2
##   CL:0000037    4    0
##   CL:0000050    5    0
##   CL:0000051    8    0
##   CL:0000233    0    2
##   CL:0000236    0  106
##   CL:0000556    8    0
##   CL:0000557    1    2
##   CL:0000576 1435  669
##   CL:0000623  306   20
##   CL:0000624  605   90
##   CL:0000625  424  171
##   CL:0000786    3    0
##   CL:0000787  225    2
##   CL:0000788  719   30
##   CL:0000798    0    2
##   CL:0000815  111    0
##   CL:0000816    0   36
##   CL:0000837    5    0
##   CL:0000895    0  410
##   CL:0000904   55  986
##   CL:0000905  108  272
##   CL:0000907  728    0
##   CL:0000913  503   24
##   CL:0000955    0   10
##   CL:0000972   91    0
##   CL:0001054    0  205

It is worth noting that, in the above code, the DE genes are still identified within each reference and then the statistics are merged across references to identify the top markers. This ensures that we do not directly compare expression values across references, which reduces the susceptibility of marker detection to batch effects.

The most obvious problem with this approach is that it assumes that harmonized labels are available. This is usually not true and requires some manual mapping of the author-provided labels to a common vocabulary. The mapping process also runs the risk of discarding relevant information about the biological status (e.g., activation status, disease condition) if there is no obvious counterpart for that state in the ontology.

5.4.2 Manual label harmonization

The matchReferences() function provides a simple approach for label harmonization between two references. Each reference is used to annotate the other and the probability of mutual assignment between each pair of labels is computed, i.e., for each pair of labels, what is the probability that a cell with one label is assigned the other and vice versa? Probabilities close to 1 in Figure 5.4 indicate there is a 1:1 relation between that pair of labels; on the other hand, an all-zero probability vector indicates that a label is unique to a particular reference.

Heatmap of mutual assignment probabilities between the Blueprint/ENCODE reference dataset (labels in rows) and the Human primary cell atlas reference (labels in columns).

Figure 5.4: Heatmap of mutual assignment probabilities between the Blueprint/ENCODE reference dataset (labels in rows) and the Human primary cell atlas reference (labels in columns).

This function can be used to guide harmonization to enforce a consistent vocabulary between two sets of labels. However, some manual intervention is still required in this process given the ambiguities posed by differences in biological systems and technologies. In the example above, neurons are considered to be unique to each reference while smooth muscle cells in the HPCA data are incorrectly matched to fibroblasts in the Blueprint/ENCODE data. CD4+ and CD8+ T cells are also both assigned to “T cells”, so some decision about the acceptable resolution of the harmonized labels is required here.

As an aside, we can also use this function to identify the matching clusters between two independent scRNA-seq analyses. This involves substituting the cluster assignments as proxies for the labels, allowing us to match up clusters and integrate conclusions from multiple datasets without the difficulties of batch correction and reclustering.

Session info

R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.12-books/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.12-books/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] scater_1.18.6               ggplot2_3.3.3              
 [3] SingleR_1.4.1               ensembldb_2.14.0           
 [5] AnnotationFilter_1.14.0     GenomicFeatures_1.42.2     
 [7] AnnotationDbi_1.52.0        celldex_1.0.0              
 [9] TENxPBMCData_1.8.0          HDF5Array_1.18.1           
[11] rhdf5_2.34.0                DelayedArray_0.16.2        
[13] Matrix_1.3-2                SingleCellExperiment_1.12.0
[15] SummarizedExperiment_1.20.0 Biobase_2.50.0             
[17] GenomicRanges_1.42.0        GenomeInfoDb_1.26.4        
[19] IRanges_2.24.1              S4Vectors_0.28.1           
[21] BiocGenerics_0.36.0         MatrixGenerics_1.2.1       
[23] matrixStats_0.58.0          BiocStyle_2.18.1           
[25] rebook_1.0.0               

loaded via a namespace (and not attached):
  [1] AnnotationHub_2.22.0          BiocFileCache_1.14.0         
  [3] lazyeval_0.2.2                BiocParallel_1.24.1          
  [5] digest_0.6.27                 htmltools_0.5.1.1            
  [7] viridis_0.5.1                 fansi_0.4.2                  
  [9] magrittr_2.0.1                memoise_2.0.0                
 [11] Biostrings_2.58.0             askpass_1.1                  
 [13] prettyunits_1.1.1             colorspace_2.0-0             
 [15] blob_1.2.1                    rappdirs_0.3.3               
 [17] xfun_0.22                     dplyr_1.0.5                  
 [19] callr_3.5.1                   crayon_1.4.1                 
 [21] RCurl_1.98-1.3                jsonlite_1.7.2               
 [23] graph_1.68.0                  glue_1.4.2                   
 [25] gtable_0.3.0                  zlibbioc_1.36.0              
 [27] XVector_0.30.0                BiocSingular_1.6.0           
 [29] Rhdf5lib_1.12.1               scales_1.1.1                 
 [31] pheatmap_1.0.12               DBI_1.1.1                    
 [33] Rcpp_1.0.6                    viridisLite_0.3.0            
 [35] xtable_1.8-4                  progress_1.2.2               
 [37] bit_4.0.4                     rsvd_1.0.3                   
 [39] httr_1.4.2                    RColorBrewer_1.1-2           
 [41] ellipsis_0.3.1                scuttle_1.0.4                
 [43] pkgconfig_2.0.3               XML_3.99-0.6                 
 [45] farver_2.1.0                  CodeDepends_0.6.5            
 [47] sass_0.3.1                    dbplyr_2.1.0                 
 [49] utf8_1.2.1                    tidyselect_1.1.0             
 [51] labeling_0.4.2                rlang_0.4.10                 
 [53] later_1.1.0.1                 munsell_0.5.0                
 [55] BiocVersion_3.12.0            tools_4.0.4                  
 [57] cachem_1.0.4                  generics_0.1.0               
 [59] RSQLite_2.2.4                 ExperimentHub_1.16.0         
 [61] evaluate_0.14                 stringr_1.4.0                
 [63] fastmap_1.1.0                 yaml_2.2.1                   
 [65] processx_3.4.5                knitr_1.31                   
 [67] bit64_4.0.5                   purrr_0.3.4                  
 [69] sparseMatrixStats_1.2.1       mime_0.10                    
 [71] xml2_1.3.2                    biomaRt_2.46.3               
 [73] compiler_4.0.4                beeswarm_0.3.1               
 [75] curl_4.3                      interactiveDisplayBase_1.28.0
 [77] tibble_3.1.0                  bslib_0.2.4                  
 [79] stringi_1.5.3                 highr_0.8                    
 [81] ps_1.6.0                      lattice_0.20-41              
 [83] ProtGenerics_1.22.0           vctrs_0.3.6                  
 [85] pillar_1.5.1                  lifecycle_1.0.0              
 [87] rhdf5filters_1.2.0            BiocManager_1.30.10          
 [89] jquerylib_0.1.3               BiocNeighbors_1.8.2          
 [91] bitops_1.0-6                  irlba_2.3.3                  
 [93] httpuv_1.5.5                  rtracklayer_1.50.0           
 [95] R6_2.5.0                      bookdown_0.21                
 [97] promises_1.2.0.1              gridExtra_2.3                
 [99] vipor_0.4.5                   codetools_0.2-18             
[101] assertthat_0.2.1              openssl_1.4.3                
[103] withr_2.4.1                   GenomicAlignments_1.26.0     
[105] Rsamtools_2.6.0               GenomeInfoDbData_1.2.4       
[107] hms_1.0.0                     grid_4.0.4                   
[109] beachmat_2.6.4                rmarkdown_2.7                
[111] DelayedMatrixStats_1.12.3     shiny_1.6.0                  
[113] ggbeeswarm_0.6.0