The ambitions of collaborative single cell biology will only be achieved through the coordinated efforts of many groups, to help clarify cell types and dynamics in an array of functional and environmental contexts. The use of formal ontology in this pursuit is well-motivated and research progress has already been substantial.
Bakken et al. (2017) discuss “strategies for standardized cell type representations based on the data outputs from [high-content flow cytometry and single cell RNA sequencing], including ‘context annotations’ in the form of standardized experiment metadata about the specimen source analyzed and marker genes that serve as the most useful features in machine learning-based cell type classification models.” Aevermann et al. (2018) describe how the FAIR principles can be implemented using statistical identification of necessary and sufficient conditions for determining cell class membership. They propose that Cell Ontology can be transformed to a broadly usable knowledgebase through the incorporation of accurate marker gene signatures for cell classes.
In this vignette, we review key concepts and tasks required to make progress in the adoption and application of ontological discipline in Bioconductor-oriented data analysis.
The following table describes the resources available with get*
commands defined in ontoProc.
kable(packDesc2019)
X | func | purpose | nclass | nprop | nroots | datav | fmtv |
---|---|---|---|---|---|---|---|
1 | getCellLineOnto | Cell line catalog | 41780 | 6 | 18 | NA | NA |
2 | getCellOnto | Cell biology concepts | 6708 | 59 | 38 | releases/2018-07-07 | 1.2 |
3 | getCellosaurusOnto | Cell line concepts | 87311 | 6 | 87311 | 23 | 1.2 |
4 | getChebiLite | Chemicals of biological interest | 108496 | 6 | 12 | 155 | 1.2 |
5 | getChebiOnto | 108496 | 33 | 12 | 155 | 1.2 | |
6 | getDiseaseOnto | Human disease | 11283 | 24 | 13 | releases/2018-06-29 | 1.2 |
7 | getEFOOnto | Experimental factors | 20115 | 6 | 36 | 2.87 | 1.2 |
8 | getGeneOnto | Gene ontology | 47123 | 43 | 10 | releases/2018-03-27 | 1.2 |
9 | getHCAOnto | Human cell atlas | 11047 | 6 | 76 | NA | NA |
10 | getOncotreeOnto | Tumor relations | 1298 | 15 | 3 | ncit/releases/2017-12-15/ncit-oncotree.ttl | 1.2 |
11 | getPATOnto | Phenotypes and traits | 2670 | 43 | 21 | releases/2018-11-12 | 1.2 |
12 | getPROnto | Protein ontology | 315957 | 6 | 53 | 57 | 1.2 |
13 | getUBERON_NE | Anatomy | 14937 | 6 | 135 | releases/2017-09-09 | 1.2 |
Definitions, semantics. For concreteness, we provide some definitions and examples.
We use ontology
to denote the systematic organization
of terminology used in a conceptual domain. The
Cell Ontology
is a graphical data structure with
carefully annotated terms as nodes and conventionally defined
semantic relationships among terms serving as edges. As
an example, lung ciliated cell
has URI .
This URI includes a fixed-length identifier CL_1000271
with
unambiguous interpretation wherever it is encountered. There is
a chain of relationships from lung ciliated cell
up through ciliated cell
, then native cell
, then
cell
, each possessing its own URI and related
interpretive metadata. The relationship connecting the more precise
to the less precise term in this chain
is denoted SubclassOf
. Ciliated cell
is equivalent to
a native cell
that has plasma membrane part
cilium
. Semantic characteristics of terms and relationships are
used to infer relationships among terms that may
not have relations directly specified in available ontologies.
Barriers to broad adoption. Given the wealth of material available in biological
ontologies, it is somewhat surprising that formal annotation
is so seldom used in practice. Barriers to
more common use of ontology in data annotation
include: (i) Non-existence of exact matching between intended
term and terms available in ontologies of interest.
(ii) The practical problem of decoding ontology identifiers.
A GO tag or CL tag is excellent
for programming, but it is clumsy to co-locate
with the tag the associated natural language term
or phrase. (iii) Likelihood of disagreement of suitability
of terms for conditions observed at the boundaries
of knowledge. To help cope with the first
of these problems, Bioconductor’s ontologyProc
package
includes a function liberalMap
which will search an ontology for terms lexically
close to some target term or phrase. The
second problem can be addressed with more elaborate
data structures for variable annotation and programming in
R, and the third problem will diminish in
importance as the value of ontology adoption becomes
manifest in more applications.
Class vs. instance. It is important to distinguish the practice of designing and maintaining ontologies from the use of ontological class terms to annotate instances of the concepts. The combination of an ontology and a set of annotated instances is called a knowledge base. To illustrate some of the salient distinctions here, consider the cell line called A549, which is established from a human lung adenocarcinoma sample. There is no mention of A549 in the Cell Ontology. However, A549 is present in the EBI Experimental Factor Ontology as a subclass of the “Homo sapiens cell line” class. Presumably this is because A549 is a class of cells that are widely used experimentally, and this cell line constitutes a concept deserving of mapping in the universe of experimental factors. In the universe of concepts related to cell structure and function per se, A549 is an individual that can be characterized through possession of or lack of properties enumerated in Cell Ontology, but it is not deserving of inclusion in that ontology.
The 10X Genomics corporation has distributed a dataset on results of sequencing 10000 PBMC from a healthy donor . Subsets of the data are used in tutorials for the Seurat analytical suite (Butler et al. (2018)).
One result of the tutorial analysis of the 3000 cell subset is a table of cell types and expression-based markers of cell identity. The first three columns of the table below are from concluding material in the Seurat tutorial; the remaining columns are created by “manual” matching between the Seurat terms and terms found in Cell Ontology.
kable(stab <- seur3kTab())
grp | markers | seurTutType | formal | tag |
---|---|---|---|---|
0 | IL7R | CD4 T cells | CD4-positive helper T cell | CL:0000492 |
1 | CD14, LYZ | CD14+ Monocytes | CD14-positive monocyte | CL:0001054 |
2 | MS4A1 | B cells | B cell | CL:0000236 |
3 | CD8A | CD8 T cells | CD8-positive, alpha-beta T cell | CL:0000625 |
4 | FCGR3A, MS4A7 | FCGR3A+ Monocytes | monocyte | CL:0000576 |
5 | GNLY, NKG7 | NK cells | natural killer cell | CL:0000623 |
6 | FCER1A, CST3 | Dendritic Cells | dendritic cell | CL:0000451 |
7 | PPBP | Megakaryocytes | megakaryocyte | CL:0000556 |
Given the informally selected tags in the table above, we can
sketch the Cell Ontology graph connecting the associated
cell types. The ontoProc package adds functionality to
ontologyPlot with make_graphNEL_from_ontology_plot
. This
allows use of all Rgraphviz and igraph visualization facilities
for graphs derived from ontology structures.
Here we display the PBMC cell sets reported in the Seurat tutorial.
library(ontoProc)
cl = getCellOnto()
onto_plot2(cl, stab$tag)
The CLfeats
function traces relationships and
properties from a given Cell Ontology class.
Briefly, each class can assert that it is the
intersection_of
other classes, and
has_part
, lacks_part
, has_plasma_membrane_part
,
lacks_plasma_membrane_part
can be asserted as
relationships holding between cell type instances
and cell components. The components are often cross-referenced
to Protein Ontology or Gene Ontology. When the Protein Ontology
component has a synonym for which an HGNC symbol is provided, that
symbol is retrieved by CLfeats
. Here we obtain the listing
for a mature CD1a-positive dermal dendritic cell.
kable(CLfeats(cl, "CL:0002531"))
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## no recognized predicate references for CL:0000738
## no recognized predicate references for CL:0000988
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## no recognized predicate references for CL:0000766
## no recognized predicate references for CL:0000219
## no recognized predicate references for CL:0000003
## no recognized predicate references for CL:0000000
tag | prtag | cond | entity | SYMBOL | name | |
---|---|---|---|---|---|---|
1 | CL:0002531 | PR:000001310 | hasPMP | CD83 molecule | CD83 | mature CD1a-positive dermal dendritic cell |
2 | CL:0002531 | GO:0042613 | hiPMAmt | MHC class II protein complex | NA | mature CD1a-positive dermal dendritic cell |
3 | CL:0002531 | PR:000001412 | hiPMAmt | CD86 molecule | CD86 | mature CD1a-positive dermal dendritic cell |
4 | CL:0002531 | PR:000001438 | hiPMAmt | CD80 molecule | CD80 | mature CD1a-positive dermal dendritic cell |
CL:0002529 | CL:0002529 | PR:000002025 | hasPMP | T-cell surface glycoprotein CD1a | CD1A | CD1a-positive dermal dendritic cell |
11 | CL:0001006 | PR:000001012 | hasPMP | integrin alpha-M | ITGAM | dermal dendritic cell |
21 | CL:0001006 | PR:000001026 | hasPMP | lymphocyte antigen 75 | LY75 | dermal dendritic cell |
31 | CL:0001006 | PR:000001084 | hasPMP | T-cell surface glycoprotein CD8 alpha chain | CD8A | dermal dendritic cell |
CL:0000990 | CL:0000990 | PR:000001013 | hiPMAmt | integrin alpha-X | ITGAX | conventional dendritic cell |
12 | CL:0000451 | PR:000001002 | lacksPMP | CD19 molecule | CD19 | dendritic cell |
22 | CL:0000451 | PR:000001003 | lacksPMP | CD34 molecule | CD34 | dendritic cell |
32 | CL:0000451 | PR:000001020 | lacksPMP | CD3 epsilon | CD3E | dendritic cell |
41 | CL:0000451 | PR:000001024 | lacksPMP | neural cell adhesion molecule 1 | NCAM1 | dendritic cell |
5 | CL:0000451 | PR:000001289 | lacksPMP | membrane-spanning 4-domains subfamily A member 1 | MS4A1 | dendritic cell |
6 | CL:0000451 | GO:0042613 | hasPart | MHC class II protein complex | NA | dendritic cell |
13 | CL:0001010 | PR:000001310 | hasPMP | CD83 molecule | CD83 | mature dermal dendritic cell |
23 | CL:0001010 | GO:0042613 | hiPMAmt | MHC class II protein complex | NA | mature dermal dendritic cell |
33 | CL:0001010 | PR:000001412 | hiPMAmt | CD86 molecule | CD86 | mature dermal dendritic cell |
42 | CL:0001010 | PR:000001438 | hiPMAmt | CD80 molecule | CD80 | mature dermal dendritic cell |
14 | CL:0000841 | PR:000001310 | hasPMP | CD83 molecule | CD83 | mature conventional dendritic cell |
24 | CL:0000841 | GO:0042613 | hiPMAmt | MHC class II protein complex | NA | mature conventional dendritic cell |
34 | CL:0000841 | PR:000001412 | hiPMAmt | CD86 molecule | CD86 | mature conventional dendritic cell |
43 | CL:0000841 | PR:000001438 | hiPMAmt | CD80 molecule | CD80 | mature conventional dendritic cell |
The ctmarks
function starts a shiny app that generates
tables of this sort for selected cell types.
The sym2CellOnto
function helps find mention of
given gene symbols in properties or parts of cell types.
kable(sdf <- as.data.frame(sym2CellOnto("ITGAM", cl, pr)))
sym | cond | cl | type |
---|---|---|---|
ITGAM | hasPMP | CL:0000040 | monoblast |
ITGAM | hasPMP | CL:0000094 | granulocyte |
ITGAM | hasPMP | CL:0000129 | microglial cell |
ITGAM | hasPMP | CL:0000559 | promonocyte |
ITGAM | hasPMP | CL:0000560 | band form neutrophil |
ITGAM | hasPMP | CL:0000580 | neutrophilic myelocyte |
ITGAM | hasPMP | CL:0000582 | neutrophilic metamyelocyte |
ITGAM | hasPMP | CL:0000612 | eosinophilic myelocyte |
ITGAM | hasPMP | CL:0000614 | basophilic myelocyte |
ITGAM | hasPMP | CL:0000769 | basophilic metamyelocyte |
ITGAM | hasPMP | CL:0000773 | eosinophilic metamyelocyte |
ITGAM | hasPMP | CL:0000861 | elicited macrophage |
ITGAM | hasPMP | CL:0000889 | myeloid suppressor cell |
ITGAM | hasPMP | CL:0001006 | dermal dendritic cell |
ITGAM | hasPMP | CL:0001007 | interstitial dendritic cell |
ITGAM | hasPMP | CL:0001022 | CD115-positive monocyte |
ITGAM | hasPMP | CL:0002058 | Gr1-low non-classical monocyte |
ITGAM | hasPMP | CL:0002395 | Gr1-high classical monocyte |
ITGAM | hasPMP | CL:0002398 | Gr1-positive, CD43-positive monocyte |
ITGAM | hasPMP | CL:0002426 | CD11b-positive, CD27-positive natural killer cell |
ITGAM | hasPMP | CL:0002457 | epidermal Langerhans cell |
ITGAM | hasPMP | CL:0002459 | langerin-negative dermal dendritic cell |
ITGAM | hasPMP | CL:0002465 | CD11b-positive dendritic cell |
ITGAM | hasPMP | CL:0011114 | segmented neutrophil of bone marrow |
ITGAM | lacksPMP | CL:0000037 | hematopoietic stem cell |
ITGAM | lacksPMP | CL:0000547 | proerythroblast |
ITGAM | lacksPMP | CL:0000553 | megakaryocyte progenitor cell |
ITGAM | lacksPMP | CL:0000558 | reticulocyte |
ITGAM | lacksPMP | CL:0000611 | eosinophil progenitor cell |
ITGAM | lacksPMP | CL:0000613 | basophil progenitor cell |
ITGAM | lacksPMP | CL:0000765 | erythroblast |
ITGAM | lacksPMP | CL:0000831 | mast cell progenitor |
ITGAM | lacksPMP | CL:0000836 | promyelocyte |
ITGAM | lacksPMP | CL:0000837 | hematopoietic multipotent progenitor cell |
ITGAM | lacksPMP | CL:0000872 | splenic marginal zone macrophage |
ITGAM | lacksPMP | CL:0000941 | thymic conventional dendritic cell |
ITGAM | lacksPMP | CL:0000942 | thymic plasmacytoid dendritic cell |
ITGAM | lacksPMP | CL:0000989 | CD11c-low plasmacytoid dendritic cell |
ITGAM | lacksPMP | CL:0000998 | CD8_alpha-negative CD11b-negative dendritic cell |
ITGAM | lacksPMP | CL:0001000 | CD8_alpha-positive CD11b-negative dendritic cell |
ITGAM | lacksPMP | CL:0001029 | common dendritic progenitor |
ITGAM | lacksPMP | CL:0001060 | hematopoietic oligopotent progenitor cell, lineage-negative |
ITGAM | lacksPMP | CL:0001066 | erythroid progenitor cell, mammalian |
ITGAM | lacksPMP | CL:0002010 | pre-conventional dendritic cell |
ITGAM | lacksPMP | CL:0002089 | group 2 innate lymphoid cell, mouse |
ITGAM | lacksPMP | CL:0002458 | langerin-positive dermal dendritic cell |
ITGAM | lacksPMP | CL:0002679 | natural helper lymphocyte |
ITGAM | hiPMAmt | CL:0000581 | peritoneal macrophage |
ITGAM | hiPMAmt | CL:0002347 | CD27-high, CD11b-high natural killer cell |
ITGAM | hiPMAmt | CL:0002348 | CD27-low, CD11b-high natural killer cell |
ITGAM | hiPMAmt | CL:0002474 | lymphoid MHC-II-negative classical monocyte |
ITGAM | hiPMAmt | CL:0002475 | lymphoid MHC-II-negative non-classical monocyte |
ITGAM | hiPMAmt | CL:0002505 | liver CD103-negative dendritic cell |
ITGAM | hiPMAmt | CL:0002510 | CD103-negative, langerin-positive lymph node dendritic cell |
ITGAM | hiPMAmt | CL:0002512 | CD11b-high, CD103-negative, langerin-negative lymph node dendritic cell |
ITGAM | loPMAmt | CL:0000091 | Kupffer cell |
ITGAM | loPMAmt | CL:0000583 | alveolar macrophage |
ITGAM | loPMAmt | CL:0000868 | lymph node macrophage |
ITGAM | loPMAmt | CL:0000874 | splenic red pulp macrophage |
ITGAM | loPMAmt | CL:0002345 | CD27-low, CD11b-low immature natural killer cell |
ITGAM | loPMAmt | CL:0002349 | CD27-high, CD11b-low natural killer cell |
ITGAM | loPMAmt | CL:0002506 | liver CD103-positive dendritic cell |
ITGAM | loPMAmt | CL:0002509 | CD103-positive, langerin-positive lymph node dendritic cell |
ITGAM | loPMAmt | CL:0002511 | CD11b-low, CD103-negative, langerin-negative lymph node dendritic cell |
table(sdf$cond)
##
## hasPMP lacksPMP hiPMAmt loPMAmt
## 24 23 8 9
kable(as.data.frame(sym2CellOnto("FOXP3", cl, pr)))
sym | cond | cl | type |
---|---|---|---|
FOXP3 | hasPart | CL:0000902 | induced T-regulatory cell |
FOXP3 | hasPart | CL:0000903 | natural T-regulatory cell |
FOXP3 | hasPart | CL:0000919 | CD8-positive, CD25-positive, alpha-beta regulatory T cell |
FOXP3 | hasPart | CL:0000920 | CD8-positive, CD28-negative, alpha-beta regulatory T cell |
The task of extending an ontology is partly bureaucratic in
nature and depends on a collection of endorsements and updates
to centralized information structures. In order to permit
experimentation with interfaces and new content that may
be quite speculative, we include an approach to combining new
ontology ‘terms’ of structure similar to those endorsed in
Cell Ontology, to ontologyIndex-based ontology_index
instances.
For a demonstration, we consider the discussion in Bakken et al. (2017), of a ‘diagonal’ expression pattern defining a group of novel cell types. A set of genes is identified and cells are distinguised by expressing exactly one gene from the set.
The necessary information is collected in a vector. The vector is the set of genes, the name of element i is the tag to be associated with the type of cell that expresses gene i and does not express any other gene in the set.
sigels = c("CL:X01"="GRIK3", "CL:X02"="NTNG1", "CL:X03"="BAGE2",
"CL:X04"="MC4R", "CL:X05"="PAX6", "CL:X06"="TSPAN12",
"CL:X07"="hSHISA8", "CL:X08"="SNCG", "CL:X09"="ARHGEF28",
"CL:X10"="EGF")
The cyclicSigset
function produces a data.frame instance
connecting cell types with the genes expressed or unexpressed.
cs = cyclicSigset(sigels)
dim(cs)
## [1] 100 3
cs[c(1:5,9:13),]
## gene type cond
## 1 ARHGEF28 CL:X09 hasExp
## 2 GRIK3 CL:X09 lacksExp
## 3 NTNG1 CL:X09 lacksExp
## 4 BAGE2 CL:X09 lacksExp
## 5 MC4R CL:X09 lacksExp
## 9 SNCG CL:X09 lacksExp
## 10 EGF CL:X09 lacksExp
## 11 BAGE2 CL:X03 hasExp
## 12 GRIK3 CL:X03 lacksExp
## 13 NTNG1 CL:X03 lacksExp
table(cs$cond)
##
## hasExp lacksExp
## 10 90
It is expected that a tabular layout like this will suffice to handle general situations of cell type definition.
The most complicated aspect of novel OBO term construction is the proper specifications of relationships with existing ontology components. A prolog that is mostly shared by all terms is generated programmatically for the diagonal pattern task.
makeIntnProlog = function(id, ...) {
# make type-specific prologs as key-value pairs
c(
sprintf("id: %s", id),
sprintf("name: %s-expressing cortical layer 1 interneuron, human", ...),
sprintf("def: '%s-expressing cortical layer 1 interneuron, human described via RNA-seq observations' [PMID 29322913]", ...),
"is_a: CL:0000099 ! interneuron",
"intersection_of: CL:0000099 ! interneuron")
}
The ldfToTerms
API uses this to create a set of strings that can be parsed
as a term.
pmap = c("hasExp"="has_expression_of", lacksExp="lacks_expression_of")
head(unlist(tms <- ldfToTerms(cs, pmap, sigels, makeIntnProlog)), 20)
## [1] "[Term]"
## [2] "id: CL:X01"
## [3] "name: GRIK3-expressing cortical layer 1 interneuron, human"
## [4] "def: 'GRIK3-expressing cortical layer 1 interneuron, human described via RNA-seq observations' [PMID 29322913]"
## [5] "is_a: CL:0000099 ! interneuron"
## [6] "intersection_of: CL:0000099 ! interneuron"
## [7] "has_expression_of: PR:000008242 ! GRIK3"
## [8] "lacks_expression_of: PR:000011467 ! NTNG1"
## [9] "lacks_expression_of: PR:000004625 ! BAGE2"
## [10] "lacks_expression_of: PR:000001237 ! MC4R"
## [11] "lacks_expression_of: PR:000012318 ! PAX6"
## [12] "lacks_expression_of: PR:000016738 ! TSPAN12"
## [13] "lacks_expression_of: PR:B8ZZ34 ! hSHISA8"
## [14] "lacks_expression_of: PR:000015325 ! SNCG"
## [15] "lacks_expression_of: PR:000013942 ! ARHGEF28"
## [16] "lacks_expression_of: PR:000006928 ! EGF"
## [17] "[Term]"
## [18] "id: CL:X02"
## [19] "name: NTNG1-expressing cortical layer 1 interneuron, human"
## [20] "def: 'NTNG1-expressing cortical layer 1 interneuron, human described via RNA-seq observations' [PMID 29322913]"
The content in tms can then be appended to the content of the Cell Ontology cl.obo as
text for import with ontologyIndex::get_OBO
.
Aevermann, Brian D., Mark Novotny, Trygve Bakken, Jeremy A. Miller, Alexander D. Diehl, David Osumi-Sutherland, Roger S. Lasken, Ed S. Lein, and Richard H. Scheuermann. 2018. “Cell type discovery using single-cell transcriptomics: Implications for ontological representation.” Human Molecular Genetics 27 (R1):R40–R47. https://doi.org/10.1093/hmg/ddy100.
Bakken, Trygve, Lindsay Cowell, Brian D. Aevermann, Mark Novotny, Rebecca Hodge, Jeremy A. Miller, Alexandra Lee, et al. 2017. “Cell type discovery and representation in the era of high-content single cell phenotyping.” BMC Bioinformatics 18 (Suppl 17). https://doi.org/10.1186/s12859-017-1977-1.
Butler, Andrew, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. 2018. “Integrating Single-Cell Transcriptomic Data Across Different Conditions, Technologies, and Species.” Nature Biotechnology. https://doi.org/10.1038/nbt.4096.
Westbury, Sarah K., Ernest Turro, Daniel Greene, Claire Lentaigne, Anne M. Kelly, Tadbir K. Bariana, Ilenia Simeoni, et al. 2015. “Human Phenotype Ontology Annotation and Cluster Analysis to Unravel Genetic Defects in 707 Cases with Unexplained Bleeding and Platelet Disorders.” Genome Medicine 7 (1):36. https://doi.org/10.1186/s13073-015-0151-5.