--- title: "ontoProc: RDF ontology processing for Bioconductor" author: "Vincent J. Carey, stvjc at channing.harvard.edu" date: "`r format(Sys.time(), '%B %d, %Y')`" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{ontoProc: RDF ontology processing} %\VignetteEncoding{UTF-8} output: BiocStyle::pdf_document: toc: yes number_sections: yes BiocStyle::html_document: highlight: pygments number_sections: yes theme: united toc: yes --- ```{r setupp,echo=FALSE,results="hide"} suppressWarnings({ suppressPackageStartupMessages({ library(ontoProc) library(org.Mm.eg.db) library(org.Hs.eg.db) }) }) ``` # Executive summary The ontoProc package was developed to facilitate the coding of an ontology-driven visualizer of transcriptomic patterns in single-cell RNA-seq studies ([tenXplore](http://github.com/vjcitn/tenXplore)). ![dashsnap](dashboard.png) # Introduction Our primary objective is facilitating use of ontological metadata to simplify construction of formally annotated hierarchies of samples or features that should be traversed in analysis of complex genomic experiments. ## An enumeration of cell types We used the [Experimental Factor Ontology](https://www.ebi.ac.uk/efo/) 'cell type' class ([EFO_0000324](https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0000324)) to obtain an enumeration of cell types. As of August 22 2017 it is an open question whether [Cell Ontology](http://obofoundry.org/ontology/cl.html) should be used for this purpose. The author's subjective impression is that EFO has a simpler collection of terms for cell types, while Cell Ontology has a better collection of terms for types of neurons. ## Basic operations using ontologyIndex facilities This package ships with an R serialization of an OBO representation of the [Cell Ontology](http://obofoundry.org/ontology/cl.html). This is created using `get_OBO` in `r CRANpkg("ontologyIndex")`. (For ontologies only available in OWL format, the python pronto package was used to convert to OBO.) ```{r useCO} library(ontoProc) cellOnto = getCellOnto() cellOnto ``` At this time, elementary manipulations of the ontology involve collecting the children, siblings, or labels for given URIs. ```{r useCO2} cochil = children_TAG("CL:0000540", cellOnto) cochil label_TAG("CL:0000540", cellOnto) siblings_TAG("CL:0000540", cellOnto) ``` # Application: finding genes annotated to neuron subtypes We focus on mouse. The neuron subtypes identified as OWL subclasses of "neuron" have names ```{r getcl} cleanNames = function(tset) { slot(tset,"cleanFrame")$clean } cleanNames(cochil) ``` We would like to see if the expression data would allow us to discriminate neurons of these different types. ## Bridging from Cell Ontology to mouse genes There is no formal linkage at present between terms of Cell Ontology and those of Gene Ontology. Research on inference of tissue of origin from expression signatures has led to accurate classifiers (Lee, Krishnan, Troyanskaya) and applications in cell mixture deconvolution (Houseman). Formal work in ontology bridging has been described but the specific task of mapping from Cell Ontology terms to Gene Ontology terms has not culminated in any programmatically available resource. We apply approximate pattern matching (agrep in R) to find gene ontology terms that are apparently relevant to cell type vocabulary terms of interest. These are then mapped to gene annotation. Simple (non-vectorized) functions that accomplish this in an organism-specific are straightforward using the OrgDb packages. We serialized all GO terms for convenience with this package, in the data object `allGOterms`. ```{r lkfuns} data(allGOterms) cellTypeToGO("serotonergic neuron", gotab=allGOterms) cellTypeToGenes("serotonergic neuron", orgDb=org.Mm.eg.db, gotab=allGOterms) cellTypeToGenes("serotonergic neuron", orgDb=org.Hs.eg.db, gotab=allGOterms) ``` ## Discrimination of neuron types: exploratory multivariate analysis At this point the API for selecting cell types, bridging to gene sets, and acquiring expression data, is not well-modularized. Thus the best ways to get a feel for it are to use tenXplore() function, and to read the source code. In brief, we often fail to find GO terms that approximately match, as strings, Cell Ontology terms corresponding to cell subtypes. On the other hand, if we match on cell types, we get very large numbers of matches, which, at this time, will need to be filtered to get manageable feature sets. We will introduce tools for generating additional RDF to improve gene harvesting in real time. But the associated statements will need to be curated. The EBI Webulous system should be useful for introducing new terms that facilitate better connections between anatomic structures and sets of genes or other genomic features.