Chapter 6 Exploiting the cell ontology

6.1 Motivation

As previously discussed in Section 5.4, SingleR maps the labels in its references to the Cell Ontology. The most obvious advantage of doing this is to provide a standardized vocabulary with which to describe cell types, thus facilitating integrated analyses with multiple references. However, another useful feature of the Cell Ontology is its hierarchical organization of terms, allowing us to adjust cell type annotations to the desired resolution. This represents a more dynamic alternative to the static label.main and label.fine options in each reference.

6.2 Basic manipulation

We use the ontoProc package to load in the Cell Ontology. This produces an ontology_index object (from the ontologyIndex package) that we can query for various pieces of information.

## Ontology with 9941 terms
## 
## format-version: 1.2
## data-version: releases/2021-03-05
## ontology: cl
## 
## Properties:
##  id: character
##  name: character
##  parents: list
##  children: list
##  ancestors: list
##  obsolete: logical
##  adjacent_to: list
##  alt_id: list
##  anterior_to: list
##  anteriorly_connected_to: list
##  attaches_to: list
##  attaches_to_part_of: list
##  bearer_of: list
##  bounding_layer_of: list
##  branching_part_of: list
##  capable_of: list
##  capable_of_part_of: list
##  channel_for: list
##  channels_from: list
##  channels_into: list
##  comment: character
##  composed_primarily_of: list
##  conduit_for: list
##  connected_to: list
##  connects: list
##  consider: list
##  contains: list
##  continuous_with: list
##  contributes_to_morphology_of: list
##  created_by: character
##  creation_date: character
##  data-version: list
##  dc-contributor: list
##  dc-creator: list
##  decreased_in_magnitude_relative_to: list
##  deep_to: list
##  def: character
##  derives_from: list
##  developmentally_induced_by: list
##  developmentally_replaces: list
##  develops_from: list
##  develops_from_part_of: list
##  develops_in: list
##  develops_into: list
##  directly_develops_from: list
##  disjoint_from: list
##  distal_to: list
##  distally_connected_to: list
##  distalmost_part_of: list
##  domain: list
##  dorsal_to: list
##  drains: list
##  ends: list
##  ends_with: list
##  equivalent_to_chain: list
##  existence_ends_during: list
##  existence_ends_during_or_before: list
##  existence_ends_with: list
##  existence_starts_and_ends_during: list
##  existence_starts_during: list
##  existence_starts_during_or_after: list
##  existence_starts_with: list
##  expand_assertion_to: list
##  expand_expression_to: list
##  extends_fibers_into: list
##  filtered_through: list
##  format-version: list
##  has_boundary: list
##  has_completed: list
##  has_component: list
##  has_cross_section: list
##  has_developmental_contribution_from: list
##  has_gene_template: list
##  has_high_plasma_membrane_amount: list
##  has_low_plasma_membrane_amount: list
##  has_member: list
##  has_muscle_antagonist: list
##  has_muscle_insertion: list
##  has_muscle_origin: list
##  has_not_completed: list
##  has_part: list
##  has_potential_to_develop_into: list
##  has_potential_to_developmentally_contribute_to: list
##  has_quality: list
##  has_relative_magnitude: list
##  has_role: list
##  has_skeleton: list
##  holds_over_chain: list
##  immediate_transformation_of: list
##  immediately_deep_to: list
##  immediately_preceded_by: list
##  immediately_superficial_to: list
##  in_anterior_side_of: list
##  in_deep_part_of: list
##  in_distal_side_of: list
##  in_dorsal_side_of: list
##  in_lateral_side_of: list
##  in_left_side_of: list
##  in_posterior_side_of: list
##  in_proximal_side_of: list
##  in_right_side_of: list
##  in_taxon: list
##  increased_in_magnitude_relative_to: list
##  indirectly_supplies: list
##  innervated_by: list
##  innervates: list
##  intersection_of: list
##  intersects_midsagittal_plane_of: list
##  inverse_of: list
##  is_a: list
##  is_class_level: list
##  is_cyclic: list
##  is_functional: list
##  is_metadata_tag: list
##  is_symmetric: list
##  is_transitive: list
##  lacks_part: list
##  lacks_plasma_membrane_part: list
##  located_in: list
##  location_of: list
##  lumen_of: list
##  luminal_space_of: list
##  namespace: list
##  negatively_regulates: list
##  never_in_taxon: list
##  occurs_in: list
##  only_in_taxon: list
##  ontology: list
##  output_of: list
##  part_of: list
##  participates_in: list
##  positively_regulates: list
##  posterior_to: list
##  posteriorly_connected_to: list
##  preaxialmost_part_of: list
##  preceded_by: list
##  precedes: list
##  produced_by: list
##  produces: list
##  property_value: list
##  protects: list
##  proximal_to: list
##  proximally_connected_to: list
##  proximalmost_part_of: list
##  range: list
##  regulates: list
##  remark: list
##  replaced_by: list
##  RO:0002100: list
##  RO:0002102: list
##  RO:0002103: list
##  RO:0002104: list
##  RO:0002120: list
##  RO:0002130: list
##  RO:0002292: list
##  RO:0013001: list
##  seeAlso: list
##  sexually_homologous_to: list
##  skeleton_of: list
##  starts: list
##  starts_with: list
##  subdivision_of: list
##  subset: list
##  subsetdef: list
##  superficial_to: list
##  supplies: list
##  surrounded_by: list
##  surrounds: list
##  synonym: list
##  synonymtypedef: list
##  transformation_of: list
##  transitive_over: list
##  tributary_of: list
##  trunk_part_of: list
##  union_of: list
##  ventral_to: list
##  xref: list
## Roots:
##  BFO:0000003 - occurrent
##  functionally_related_to - functionally related to
##  CHEBI:36342 - subatomic particle
##  RO:0002410 - causally related to
##  BFO:0000002 - continuant
##  UBERON:0001062 - anatomical entity
##  RO:0002323 - mereotopologically related to
##  GO:0005575 - cellular_component
##  UBERON:0000000 - processual entity
##  RO:0000057 - has participant
##  ... 82 more

The most immediate use of this object lies in mapping ontology terms to their plain-English descriptions. We can use this to translate annotations produced by SingleR() from the label.ont labels into a more interpretable form. We demonstrate this approach using celldex’s collection of mouse RNA-seq references (Aran et al. 2019).

##              BFO:0000002              BFO:0000003              BFO:0000004 
##             "continuant"              "occurrent" "independent continuant" 
##              BFO:0000006              BFO:0000015              BFO:0000019 
##         "spatial region"                "process"                "quality"
##                                                                                                                                                   BFO:0000002 
## "\"An entity that exists in full at any time in which it exists at all, persists through time while maintaining its identity and has no temporal parts.\" []" 
##                                                                                                                                                   BFO:0000003 
##                                                                "\"An entity that has temporal parts and that happens, unfolds or develops through time.\" []" 
##                                                                                                                                                   BFO:0000004 
##  "\"A continuant that is a bearer of quality and realizable entity entities, in which other entities inhere and which itself cannot inhere in anything.\" []" 
##                                                                                                                                                   BFO:0000006 
##                                                                                                                                                            NA 
##                                                                                                                                                   BFO:0000015 
##                                           "\"An occurrent that has temporal proper parts and for some time t, p s-depends_on some material entity at t.\" []" 
##                                                                                                                                                   BFO:0000019 
##                                                                                                                                                            NA
## CL:0000136 CL:0000136 CL:0000136 CL:0000136 CL:0000136 CL:0000136 
## "fat cell" "fat cell" "fat cell" "fat cell" "fat cell" "fat cell"

Another interesting application involves examining the relationship between different terms. The ontology itself is a directed acyclic graph, so we can can convert it into graph object for advanced queries using the igraph package. Each edge represents an “is a” relationship where each vertex represents a specialized case of the concept of the parent node.

## IGRAPH 0b758ff DN-- 9886 15674 -- 
## + attr: name (v/c)
## + edges from 0b758ff (vertex names):
##  [1] BFO:0000002 ->BFO:0000004  BFO:0000141 ->BFO:0000006 
##  [3] BFO:0000003 ->BFO:0000015  BFO:0000020 ->BFO:0000019 
##  [5] BFO:0000002 ->BFO:0000020  BFO:0000040 ->BFO:0000024 
##  [7] BFO:0000040 ->BFO:0000030  BFO:0000002 ->BFO:0000031 
##  [9] BFO:0000004 ->BFO:0000040  BFO:0000004 ->BFO:0000141 
## [11] CARO:0030000->CARO:0000000 CARO:0000006->CARO:0000003
## [13] BFO:0000040 ->CARO:0000006 CARO:0000000->CARO:0000006
## [15] CARO:0000003->CARO:0001000 CARO:0001000->CARO:0001001
## + ... omitted several edges

One query involves identifying all descendents of a particular term of interest. This can be useful when searching for a cell type in the presence of variable annotation resolution; for example, a search for “epithelial cell” can be configured to pick up all child terms such as “endothelial cell” and “ependymal cell”.

##                        CL:0000624 
## "CD4-positive, alpha-beta T cell"
##                                                       CL:0000624 
##                                "CD4-positive, alpha-beta T cell" 
##                                                       CL:0000492 
##                                     "CD4-positive helper T cell" 
##                                                       CL:0001051 
## "CD4-positive, CXCR3-negative, CCR6-negative, alpha-beta T cell" 
##                                                       CL:0000791 
##                                       "mature alpha-beta T cell" 
##                                                       CL:0000792 
##      "CD4-positive, CD25-positive, alpha-beta regulatory T cell" 
##                                                       CL:0000793 
##                "CD4-positive, alpha-beta intraepithelial T cell"

Alternatively, we might be interested in the last common ancestor (LCA) for a set of terms. This is the furthest term - or, in some cases, multiple terms - from the root of the ontology that is also an ancestor of all of the terms of interest. We will use this LCA concept in the next section to adjust resolution across multiple references.

##                        CL:0000624                        CL:0000785 
## "CD4-positive, alpha-beta T cell"                   "mature B cell" 
##                        CL:0000623 
##             "natural killer cell"
##   CL:0000542 
## "lymphocyte"

6.3 Adjusting resolution

We can use the ontology graph to adjust the resolution of the reference labels by rolling up overly-specific terms to their LCA. The findCommonAncestors() utility takes a set of terms and returns a list of potential LCAs for various subsets of those terms. Users can inspect this list to identify LCAs at the desired resolution and then map their descendent terms to those LCAs.

## $`CL:0000081`
## $`CL:0000081`$name
## [1] "blood cell"
## 
## $`CL:0000081`$descendents
## DataFrame with 2 rows and 2 columns
##                   name      set1
##            <character> <logical>
## CL:0000232 erythrocyte      TRUE
## CL:0000094 granulocyte      TRUE
## 
## 
## $`CL:0000126`
## $`CL:0000126`$name
## [1] "macroglial cell"
## 
## $`CL:0000126`$descendents
## DataFrame with 2 rows and 2 columns
##                       name      set1
##                <character> <logical>
## CL:0000127       astrocyte      TRUE
## CL:0000128 oligodendrocyte      TRUE
## 
## 
## $`CL:0000393`
## $`CL:0000393`$name
## [1] "electrically responsive cell"
## 
## $`CL:0000393`$descendents
## DataFrame with 2 rows and 2 columns
##                           name      set1
##                    <character> <logical>
## CL:0000540              neuron      TRUE
## CL:0000746 cardiac muscle cell      TRUE
## 
## 
## $`CL:0002320`
## $`CL:0002320`$name
## [1] "connective tissue cell"
## 
## $`CL:0002320`$descendents
## DataFrame with 2 rows and 2 columns
##                   name      set1
##            <character> <logical>
## CL:0000136    fat cell      TRUE
## CL:0000057  fibroblast      TRUE
## 
## 
## $`CL:0011115`
## $`CL:0011115`$name
## [1] "precursor cell"
## 
## $`CL:0011115`$descendents
## DataFrame with 2 rows and 2 columns
##                          name      set1
##                   <character> <logical>
## CL:0000047 neuronal stem cell      TRUE
## CL:0000576           monocyte      TRUE
## 
## 
## $`CL:0000066`
## $`CL:0000066`$name
## [1] "epithelial cell"
## 
## $`CL:0000066`$descendents
## DataFrame with 3 rows and 2 columns
##                        name      set1
##                 <character> <logical>
## CL:0000115 endothelial cell      TRUE
## CL:0000182       hepatocyte      TRUE
## CL:0000065   ependymal cell      TRUE

We can also use this function to synchronize multiple sets of terms to the same resolution. Here, we consider the ImmGen dataset (Heng et al. 2008), which provides highly resolved annotation of immune cell types. The findCommonAncestors() function specifies the origins of the descendents for each LCA, allowing us to focus on LCAs that have representatives in both sets of terms.

## $`CL:0000126`
## $`CL:0000126`$name
## [1] "macroglial cell"
## 
## $`CL:0000126`$descendents
## DataFrame with 2 rows and 3 columns
##                       name  MouseRNA    ImmGen
##                <character> <logical> <logical>
## CL:0000127       astrocyte      TRUE     FALSE
## CL:0000128 oligodendrocyte      TRUE     FALSE
## 
## 
## $`CL:0000393`
## $`CL:0000393`$name
## [1] "electrically responsive cell"
## 
## $`CL:0000393`$descendents
## DataFrame with 2 rows and 3 columns
##                           name  MouseRNA    ImmGen
##                    <character> <logical> <logical>
## CL:0000540              neuron      TRUE     FALSE
## CL:0000746 cardiac muscle cell      TRUE     FALSE
## 
## 
## $`CL:0000623`
## $`CL:0000623`$name
## [1] "natural killer cell"
## 
## $`CL:0000623`$descendents
## DataFrame with 2 rows and 3 columns
##                              name  MouseRNA    ImmGen
##                       <character> <logical> <logical>
## CL:0000623    natural killer cell      TRUE      TRUE
## CL:0002438 NK1.1-positive natur..     FALSE      TRUE
## 
## 
## $`CL:0000813`
## $`CL:0000813`$name
## [1] "memory T cell"
## 
## $`CL:0000813`$descendents
## DataFrame with 2 rows and 3 columns
##                              name  MouseRNA    ImmGen
##                       <character> <logical> <logical>
## CL:0000897 CD4-positive, alpha-..     FALSE      TRUE
## CL:0000909 CD8-positive, alpha-..     FALSE      TRUE
## 
## 
## $`CL:0000815`
## $`CL:0000815`$name
## [1] "regulatory T cell"
## 
## $`CL:0000815`$descendents
## DataFrame with 2 rows and 3 columns
##                              name  MouseRNA    ImmGen
##                       <character> <logical> <logical>
## CL:0000792 CD4-positive, CD25-p..     FALSE      TRUE
## CL:0000815      regulatory T cell     FALSE      TRUE
## 
## 
## $`CL:0000819`
## $`CL:0000819`$name
## [1] "B-1 B cell"
## 
## $`CL:0000819`$descendents
## DataFrame with 2 rows and 3 columns
##                   name  MouseRNA    ImmGen
##            <character> <logical> <logical>
## CL:0000820 B-1a B cell     FALSE      TRUE
## CL:0000821 B-1b B cell     FALSE      TRUE

For example, we might notice that the mouse RNA-seq reference only has a single “T cell” term. To synchronize resolution across references, we would need to roll up all of the ImmGen’s finely resolved subsets into that LCA as shown below. Of course, this results in some loss of precision and information; whether this is an acceptable price for simpler interpretation is a decision that is left to the user.

## DataFrame with 35 rows and 3 columns
##                              name  MouseRNA    ImmGen
##                       <character> <logical> <logical>
## CL:0000084                 T cell      TRUE      TRUE
## CL:0002427 resting double-posit..     FALSE      TRUE
## CL:0000809 double-positive, alp..     FALSE      TRUE
## CL:0002429 CD69-positive double..     FALSE      TRUE
## CL:0000624 CD4-positive, alpha-..     FALSE      TRUE
## ...                           ...       ...       ...
## CL:0002415 immature Vgamma1.1-p..     FALSE      TRUE
## CL:0002411 Vgamma1.1-positive, ..     FALSE      TRUE
## CL:0002416 mature Vgamma1.1-pos..     FALSE      TRUE
## CL:0002407 mature Vgamma2-posit..     FALSE      TRUE
## CL:0000815      regulatory T cell     FALSE      TRUE

Session information

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=C              
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] igraph_1.3.5                celldex_1.7.0              
 [3] SummarizedExperiment_1.28.0 Biobase_2.58.0             
 [5] GenomicRanges_1.50.0        GenomeInfoDb_1.34.0        
 [7] IRanges_2.32.0              S4Vectors_0.36.0           
 [9] BiocGenerics_0.44.0         MatrixGenerics_1.10.0      
[11] matrixStats_0.62.0          ontoProc_1.20.0            
[13] ontologyIndex_2.10          BiocStyle_2.26.0           
[15] rebook_1.8.0               

loaded via a namespace (and not attached):
 [1] bitops_1.0-7                  bit64_4.0.5                  
 [3] filelock_1.0.2                httr_1.4.4                   
 [5] Rgraphviz_2.42.0              tools_4.2.1                  
 [7] bslib_0.4.0                   utf8_1.2.2                   
 [9] R6_2.5.1                      DT_0.26                      
[11] DBI_1.1.3                     withr_2.5.0                  
[13] tidyselect_1.2.0              bit_4.0.4                    
[15] curl_4.3.3                    compiler_4.2.1               
[17] graph_1.76.0                  cli_3.4.1                    
[19] DelayedArray_0.24.0           bookdown_0.29                
[21] sass_0.4.2                    rappdirs_0.3.3               
[23] stringr_1.4.1                 digest_0.6.30                
[25] rmarkdown_2.17                XVector_0.38.0               
[27] pkgconfig_2.0.3               htmltools_0.5.3              
[29] sparseMatrixStats_1.10.0      dbplyr_2.2.1                 
[31] fastmap_1.1.0                 htmlwidgets_1.5.4            
[33] rlang_1.0.6                   RSQLite_2.2.18               
[35] DelayedMatrixStats_1.20.0     shiny_1.7.3                  
[37] paintmap_1.0                  jquerylib_0.1.4              
[39] generics_0.1.3                jsonlite_1.8.3               
[41] dplyr_1.0.10                  RCurl_1.98-1.9               
[43] magrittr_2.0.3                GenomeInfoDbData_1.2.9       
[45] Matrix_1.5-1                  Rcpp_1.0.9                   
[47] fansi_1.0.3                   lifecycle_1.0.3              
[49] stringi_1.7.8                 yaml_2.3.6                   
[51] zlibbioc_1.44.0               BiocFileCache_2.6.0          
[53] AnnotationHub_3.6.0           grid_4.2.1                   
[55] blob_1.2.3                    promises_1.2.0.1             
[57] ExperimentHub_2.6.0           crayon_1.5.2                 
[59] dir.expiry_1.6.0              lattice_0.20-45              
[61] Biostrings_2.66.0             ontologyPlot_1.6             
[63] KEGGREST_1.38.0               CodeDepends_0.6.5            
[65] knitr_1.40                    pillar_1.8.1                 
[67] codetools_0.2-18              XML_3.99-0.12                
[69] glue_1.6.2                    BiocVersion_3.16.0           
[71] evaluate_0.17                 BiocManager_1.30.19          
[73] png_0.1-7                     vctrs_0.5.0                  
[75] httpuv_1.6.6                  purrr_0.3.5                  
[77] assertthat_0.2.1              cachem_1.0.6                 
[79] xfun_0.34                     mime_0.12                    
[81] xtable_1.8-4                  later_1.3.0                  
[83] tibble_3.1.8                  AnnotationDbi_1.60.0         
[85] memoise_2.0.1                 ellipsis_0.3.2               
[87] interactiveDisplayBase_1.36.0

Bibliography

Aran, D., A. P. Looney, L. Liu, E. Wu, V. Fong, A. Hsu, S. Chak, et al. 2019. “Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.” Nat. Immunol. 20 (2): 163–72.

Heng, Tracy S.P., Michio W. Painter, Kutlu Elpek, Veronika Lukacs-Kornek, Nora Mauermann, Shannon J. Turley, Daphne Koller, et al. 2008. “The immunological genome project: Networks of gene expression in immune cells.” Nature Immunology 9 (10): 1091–4. https://doi.org/10.1038/ni1008-1091.