---
title: "An introduction to biodb"
author: "Pierrick Roger"
date: "`r BiocStyle::doc_date()`"
package: "`r BiocStyle::pkg_ver('biodb')`"
vignette: |
  %\VignetteIndexEntry{Introduction to the biodb package.}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
abstract: |
  A quick tour of the various features of *biodb*: creating connectors,
  accessing entries, converting entries into a data frame, searching for
  entries, dealing with mass spectra, existing connectors.
output:
  BiocStyle::html_document:
    toc: yes
    toc_depth: 4
    toc_float:
      collapsed: false
  BiocStyle::pdf_document: default
bibliography: references.bib
---

```{r, echo=FALSE}
source(system.file('vignettes_inc.R', package='biodb'))
```

# Introduction

<!-- What offers biodb. -->
*biodb* provides access to **chemical**, **biological** and **mass spectra**
databases, and offers a **development framework** that facilitates the writing
of new connectors.

<!-- Many databases. -->
Numerous public databases are available for scientific research, but few are
easily accessible from a programming environment, making it hard for most of
researchers to use their content.
Developing a code to access databases and keep it up to date with the
evolutions of these databases are two time consuming tasks. It is thus greatly
preferable to use an already developed package.

<!-- Package incomplete, no capitalization. -->
In R, packages with public database connectors, most often propose to connect
to one single database with a specific API, and do not offer a **development
framework** [@szocs2020_webchem; @guha2016_rpubchem; @tenenbaum2020_keggrest;
@carey2020_hmbdQuery; @soudy2020_uniprotr; @carlson2020_uniprotws;
@wolf2019_chemspiderapi; @stravs2013_rmassbank; @drost2017_biomartr;
@winter2020_rentrez].
When a package does not offer the services the scientific programmer
requests, or when no package exists for the targeted database, a homemade
solution is implemented.
In such a case, the effort spent is often lost and never capitalized for
sharing with the community.

<!-- Development framework. -->
*biodb* has been designed and implemented as a unified API to databases and a
**development framework**.
The unified API allows to access the databases in a standardized way, while
allowing original database services to be accessed directly.
The **development framework** has for goal to help scientific programmers to
capitalize on their effort to improve connection to databases and share it with
the community.
The framework lowers the effort needed by the developer to improve an existing
connector or implement a new one.
Most *biodb* connectors are distributed inside separated packages, that are
automatically recognized by the main package.
This system of extensions gives more independence for developing new connectors
and distributing them, since developers do not need to request any modification
inside the main package code.

<!-- Database services. -->
The database services provided by the unified API of *biodb*: retrieval of
entries, chemical and biological compound search by mass and name, mass spectra
annotation, MSMS matching, read and write of in-house local databases.
Alongside the unified API, connectors to public databases furnishes also access
to specific web services through dedicated methods.
See table \@ref(tab:features) for a list of available features.

```{r, echo=FALSE, results='asis'}
insert_features_table()
```

<!-- What we will see in this vignette -->
In this vignette we will introduce you to the basic features of *biodb*,
allowing you to be quickly productive.
Pointers toward other documents are included along the way, for going into
details or learning advanced features.

For a complete list of features, see vignette
```{r, echo=FALSE, results='asis'}
make_vignette_ref('details')
```
for a more more information of *biodb* with other packages.

# Installation

Install using Bioconductor:
```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install('biodb')
```

# Initialization

The first step in using *biodb*, is to create an instance of the
main class `BiodbMain`. This is done by calling the constructor of
the class:
```{r, results='hide'}
mybiodb <- biodb::newInst()
```
During this step the configuration is set up, the cache system is initialized
and extension packages are loaded.

We will see at the end of this vignette that the *biodb* instance needs to be
terminated with a call to the `terminate()` method.

# Connecting to a database

In *biodb* the connection to a database is handled by a connector instance that
you can get from the factory.
Here we create a connector to a CSV file database (see \@ref(tab:compTable) for
content) of chemical compounds:
```{r}
compUrl <- system.file("extdata", "chebi_extract.tsv", package='biodb')
compdb <- mybiodb$getFactory()$createConn('comp.csv.file', url=compUrl)
```
The two parameters passed to the `createConn()` are the identifier of the
Compound CSV File connector class and the URL (i.e.: the path) of the TSV file.
With this connector instance you are now able to get entries and search for
them by either name or mass.
By default *biodb* will use the TAB character as separator for the CSV file,
and the standard *biodb* entry field names for the column names of the file.
To load a CSV file with a different separator and custom column names, you have
to define them inside the connector instance.
Please see vignette
```{r, echo=FALSE, results='asis'}
make_vignette_ref('details')
```
for learning how to define the character separator and the column names of your
file inside the CSV database connector.

To get a list of all connector classes available with their names, call an
instance of `BiodbDbsInfo`:
```{r}
mybiodb$getDbsInfo()
```
To get available informations on these database connectors, use the `get()`
method:
```{r}
mybiodb$getDbsInfo()$get(c('comp.csv.file', 'mass.csv.file'))
```

Here we must stop a moment to explain the use of the `$` operator.
This operator is the call operator for the object oriented programming (OOP)
model *R5*.
This OOP model is different from *S4*.
While in *S4* the generic methods and their specialization are defined apart
from the classes, in *R5* the two are defined together and a method definition
is necessarily part of a class. Each method being part of a class, it is also
part of each object of the class, hence the use of a call operator (`$`) on a
object.
In the code line above, the call `mybiodb$getFactory()` means to call
`getFactory()` method onto `biodb` instance.
This call will return another object (of class `BiodbFactory`) on which we call
the method `createConn()`.
Note that while in *R Studio*, you will benefit from the autosuggestion system
to find all methods available for an instance.
See vignette
```{r, echo=FALSE, results='asis'}
make_vignette_ref('details')
```
for explanations about the OOP model chosen for *biodb*.

```{r compTable, echo=FALSE, results='asis'}
compDf <- read.table(compUrl, sep="\t", header=TRUE, quote="")
# Prevent RMarkdown from interpreting @ character as a reference:
compDf$smiles <- vapply(compDf$smiles, function(s) paste0('`', s, '`'), FUN.VALUE='')
knitr::kable(head(compDf), "pipe", caption="Excerpt from compound database TSV file.")
```

# Accessing entries

The main goal of the connector is to retrieve entries. The two main generic
ways to retrieve entries with a connector are: getting entries using their
identifiers (accession numbers), and searching for entries by name.
For compound databases, there is also the possibility to search for entries by
mass.

In this section we will show how to get entries, convert them into a data frame
and search for entries by name.
For advanced features about entries, please see the vignette `entries`.

## Getting entries

Getting entries is done with the `getEntry()` methods to which you pass a
character vector of one or more identifiers:
```{r}
entries <- compdb$getEntry(c('1018', '1549', '64679'))
entries
```

The returned objects are instances of `BiodbEntry`, which means you can call on
them all functions available in this class. Here is an example of calling the
method `getFieldsJson()` on the first entry in order to get a JSON
representation of the entry values:
```{r}
entries[[1]]$getFieldsAsJson()
```

## Getting all fields defined inside an entry

To get the list of fields defined (i.e.: with an associated value) in an entry,
call the method `getFieldNames()` on the entry instance:
```{r}
fields <- entries[[1]]$getFieldNames()
fields
```

The names returned correspond to all the fields for which a value has been
parsed from the content returned by the database.
To know the significance of each field you have to call the method `get()` on
the `BiodbEntryFields` class:
```{r}
mybiodb$getEntryFields()$get(fields)
```
The `BiodbEntryFields` gathers all information about entry fields, the same way
the `BiodbDbsInfo` class gather information about all database connectors.

## Getting field values from an entry

In *biodb* the definition of fields are global. Thus they are shared between
databases, and the same field will have the same name in two entries of two
different databases.

`getFieldValue()` is used to get the value of a field:
```{r}
entries[[1]]$getFieldValue('formula')
```

## Exporting entries into a data frame

Another way to access field values of entries, is to export them as a data
frame.

You can export the values of one single entry:
```{r}
entryDf <- entries[[1]]$getFieldsAsDataframe()
```
See table \@ref(tab:entryTable) for the exported data frame.

```{r entryTable, echo=FALSE, results='asis'}
# Prevent RMarkdown from interpreting @ character as a reference:
entryDf$smiles <- vapply(entryDf$smiles, function(s) paste0('`', s, '`'), FUN.VALUE='')
knitr::kable(entryDf, "pipe", caption="Values of one entry of the compound database.")
```

Or export the values of a set of entries:
```{r}
entriesDf <- mybiodb$entriesToDataframe(entries)
```
See table \@ref(tab:entriesTable) for the exported data frame. 

```{r entriesTable, echo=FALSE, results='asis'}
# Prevent RMarkdown from interpreting @ character as a reference:
entriesDf$smiles <- vapply(entriesDf$smiles, function(s) paste0('`', s, '`'), FUN.VALUE='')
knitr::kable(entriesDf, "pipe", caption="Values of a set of entries from the compound database.")
```

## Searching for entries

In *biodb* each database connector offers the possibility to search entries by
their name, although some database servers do not propose this feature in which
case an explicit error message will be returned.

The generic method to search for entries is `searchForEntries()`, it
returns a character vector containing identifiers of matchings entries.
Here is a search on the *name* field:
```{r}
compdb$searchForEntries(list(name='deoxyguanosine'))
```

If you want to search into a compound database, the connector has certainly
implemented the search on mass.
With our example database, we can search on the *monoisotopic.mass* field:
```{r}
compdb$searchForEntries(list(name='guanosine', monoisotopic.mass=list(value=283.0917, delta=0.1)))
```
When searching by mass, the *biodb* mass field to use must be selected. To get
a list of all *biodb* mass fields, run:
```{r}
mybiodb$getEntryFields()$getFieldNames(type='mass')
```
To get information of any of these fields run:
```{r}
mybiodb$getEntryFields()$get('nominal.mass')
```
Then to know if you can run a search on a connector on a particular mass field
run:
```{r}
compdb$isSearchableByField('average.mass')
```

To get a list of all searchable field for a connector, run:
```{r}
compdb$getSearchableFields()
```

# Mass spectra

Another feature of *biodb* is the ability to annotate an LCMS spectra or to
search for an MSMS spectra matching.
In this section we will see the annotation of LCMS spectra and matching of MSMS
spectra.

## Mass spectra annotation with a compound database

Using a compound database it is possible to annotate a mass spectra.
You will get a data frame containing your data frame input (with your M/Z
values) completed by annotations from the compound database.

Here is an input data frame containing M/Z values in negative mode:
```{r}
ms.tsv <- system.file("extdata", "ms.tsv", package='biodb')
mzdf <- read.table(ms.tsv, header=TRUE, sep="\t")
```
See table \@ref(tab:mzdfTable) for the content of the input.

```{r mzdfTable, echo=FALSE, results='asis'}
knitr::kable(mzdf, "pipe", caption="Input M/Z values.")
```

We know call the `annotateMzValues()` method to run the annotation:
```{r}
annotMz <- compdb$annotateMzValues(mzdf, mz.tol=1e-3, ms.mode='neg')
```
The `mz.tol` option sets the M/Z tolerance (by default in plain value, thus
`±0.1` in our case).
The `ms.mode` option defines the MS mode of your input spectrum, either
positive (`'pos'`) or negative (`'neg'`).
See table \@ref(tab:annotMzTable) for the content of the input.
Note that in the output, columns coming from the database have their name
prefixed with the database name.

```{r annotMzTable, echo=FALSE, results='asis'}
knitr::kable(annotMz, "pipe", caption="Annotation output.")
```

## Mass spectra annotation with a mass spectra database

Using a mass spectra database it is as well possible to annotate a simple mass
spectrum, but also LCMS data (i.e. including retention times).

First we have to open a connection to the LCMS database (see table
\@ref(tab:lcms3Table) for content):
```{r}
massUrl <- system.file("extdata", "massbank_extract_lcms_3.tsv", package='biodb')
massDb <- mybiodb$getFactory()$createConn('mass.csv.file', url=massUrl)
```

```{r lcms3Table, echo=FALSE, results='asis'}
massDf <- read.table(massUrl, sep="\t", header=TRUE, quote="")
knitr::kable(head(massDf), "pipe", caption="Excerpt from LCMS database TSV file.")
```

Then we create an input data frame containing M/Z and RT (retention time) values:
```{r}
input <- data.frame(mz=c(73.01, 116.04, 174.2), rt=c(79, 173, 79))
```
Unit of the retention times will be set when running the annotation.

And finally we call the annotation function `searchMsPeaks()`:
```{r}
annotMzRt <- massDb$searchMsPeaks(input, mz.tol=0.1, rt.unit='s', rt.tol=10, match.rt=TRUE, prefix='match.')
```
The `mz.tol` option sets the M/Z tolerance (by default in plain value, thus
`±0.1` in our case).
The `match.rt` option enables matching on retention time values, `rt.unit` sets
the unit (`"s"` for second and `"min"` for minute) and `rt.tol` the tolerance.
The `prefix` option specifies a custom prefix to use for naming the database
columns inside the output.
See table \@ref(tab:annotMzRtTable) for the results.

```{r annotMzRtTable, echo=FALSE, results='asis'}
knitr::kable(head(annotMzRt), "pipe", caption="Results of annotation of an M/Z and RT input file with an LCMS database.")
```

## MS/MS matching

*biodb* also offers an MS/MS matching service, allowing you to compare your
experimental spectrum with an MS/MS database.

First we open a connection to a MS/MS TSV file database:
```{r}
msmsUrl <- system.file("extdata", "massbank_extract_msms.tsv", package='biodb')
msmsdb <- mybiodb$getFactory()$createConn('mass.csv.file', url=msmsUrl)
```
See table \@ref(tab:msmsTable) for content.

```{r msmsTable, echo=FALSE, results='asis'}
msmsDf <- read.table(msmsUrl, sep="\t", header=TRUE, quote="")
knitr::kable(head(msmsDf), "pipe", caption="Excerpt from MS/MS database TSV file.")
```

Then we define an input spectrum:
```{r}
input <- data.frame(mz=c(286.1456, 287.1488, 288.1514), rel.int=c(100, 45, 18))
```
The `rel.int` column contains relative intensity in percentage.

Finally we run the matching service by calling the `msmsSearch()` method:
```{r}
matchDf <- msmsdb$msmsSearch(input, precursor.mz=286.1438, mz.tol=0.1, mz.tol.unit='plain', ms.mode='pos')
```
The `precursor.mz` option sets the M/Z value for the precursor of your input
spectrum.
The `mz.tol` option defines the M/Z tolerance (by default in plain value, thus
`±0.1` in our case).
The `mz.tol.unit` option defines the mode use for the tolerance: either
`'plain'` or `'ppm'`.
The `ms.mode` option defines the MS mode of your input spectrum, either
positive (`'pos'`) or negative (`'neg'`).

The results are displayed in table \@ref(tab:msmsMatchingTable).
Each matching spectrum found in database is listed in the output data frame,
along with a score and the number of the matched peak inside the database
spectrum (the column names are the peak numbers of the input spectrum).

```{r msmsMatchingTable, echo=FALSE, results='asis'}
knitr::kable(head(matchDf), "pipe", caption="Results of running spectrum matching service on an MS/MS database.")
```

# Creating and improving connectors

A powerful feature of *biodb* is its architecture as a development framework.
Connectors can be extended dynamically by created new rules to parse field
values, or creating new fields.
New connectors can also be defined. This feature has been used to create
connectors to public databases like: KEGG, ChEBI, HMDB or UniProt.

See the vignettes
```{r, echo=FALSE, results='asis'}
make_vignette_ref('new_entry_field')
```
and
```{r, echo=FALSE, results='asis'}
make_vignette_ref('new_connector')
```
for details about connector creation and defining new entry fields.

# Existing biodb extension packages

Several extension packages for *biodb* exist today on GitHub. See table
\@ref(tab:extensions) for a list of those extension and a short description.

For installing them, please first make sure that you have the package
`devtools` installed and run:
```{r, eval=FALSE}
devtools::install_github('pkrog/biodbChebi', dependencies=TRUE, build_vignettes=TRUE)
```
Replace `'pkrog/biodbChebi'` by the appropriate repository.

The extensions whose status is marked as `Functional` are in working order and
can be installed and used safely with *biodb*. They may still need some updates
in the documentation or the tests, thus do not hesitate to contact us if you
have doubts on the API, the behaviour or if you would like to improve the
extension.
The extensions whose status is marked as `In maintenance` are currently non
functional due to the refactoring of *biodb* into a development framework, but
will be upgraded as soon as possible.
If have the need to re-enable a currently `in maintenance` extension, do not
hesitate to contact us, we may be able to accelerate the upgrade or propose you
with our support to upgrade it yourself.
If you have the desire to develop a new extension, please contact us, as we
will be able accompany you in the process.

Extension                                                            | Database                                  | Status          | Description
-------------------------------------------------------------------- | ----------------------------------------- | --------------- | ----------------------------------------------------------------
[biodbChebi](https://bioconductor.org/packages/biodbChebi)           | [ChEBI](https://www.ebi.ac.uk/chebi/)     | On Bioconductor | Connector to ChEBI.
[biodbExpasy](https://bioconductor.org/packages/biodbExpasy)         | [ExPASy](https://www.expasy.org/)         | On Bioconductor | Connector to ExPASy Enzyme.
[biodbKegg](https://bioconductor.org/packages/biodbKegg)             | [KEGG](https://www.kegg.jp/)              | On Bioconductor | Connectors to KEGG Compound, Enzyme, Genes, Module, Orthology, Pathway and Reaction.
[biodbHmdb](https://bioconductor.org/packages/biodbHmdb)             | [HMDB](https://hmdb.ca/)                  | On Bioconductor | Connector to HMDB Metabolites.
[biodbLipidmaps](https://bioconductor.org/packages/biodbLipidmaps)   | [LIPID MAPS](https://www.lipidmaps.org/)  | On Bioconductor | Connector to Lipid Maps Structure.
[biodbMirbase](https://bioconductor.org/packages/biodbMirbase)       | [miRBase](http://mirbase.org)             | On Bioconductor | Connector miRBase Mature.
[biodbNci](https://bioconductor.org/packages/biodbNci)               | [NCI](https://www.cancer.gov/)            | On Bioconductor | Connector to NCI CACTUS.
[biodbUniprot](https://bioconductor.org/packages/biodbUniprot)       | [UniProt](https://www.uniprot.org/)       | On Bioconductor | Connector to UniProt KB.
[biodbNcbi](https://bioconductor.org/packages/biodbNcbi)             | [NCBI](https://www.ncbi.nlm.nih.gov/)     | On Bioconductor | Connectors to NCBI CCDS, Gene, PubChem Compound and PubChem Substance.
[biodbMassbank](https://github.com/pkrog/biodbMassbank)              | [MassBank](https://massbank.eu/MassBank/) | In maintenance  | Connector to MassBank.
[biodbChemspider](https://github.com/pkrog/biodbChemspider)          | [ChemSpider](http://www.chemspider.com)   | In maintenance  | Connector to ChemSpider.
[biodbPeakforest](https://github.com/pkrog/biodbPeakforest)          | [PeakForest](https://peakforest.org/)     | In maintenance  | Connectors to PeakForest Compound and PeakForest Mass.
: (\#tab:extensions) Available *biodb* extensions.
A list of currently available extensions with their description and their status.

# Sources of documentation

Several vignettes are available. Among them you will find help for creating
a new connector, adding an entry field to an existing connector, searching
for compounds by mass or name, merging entries from different databases into
a local database, annotation of a mass spectrum, etc.
See table \@ref(tab:vignettes) for a full list of available vignettes.

```{r vignettes, echo=FALSE, results='asis'}
x <- biodbVignettes[, c('link', 'desc')]
names(x) <- c('Vignette', 'Description')
knitr::kable(x, "pipe", caption="List of *biodb* available vignettes with their short description.")
```

You will also find documentation inside the R manual of the package. All
*biodb* public classes have a help page. On each help page you will find a
description of the class as well as a list of all its public methods with a
description of their parameters. For instance you can get help on `BiodbEntry`
class with `?BiodbEntry`.

# Closing biodb instance

When done with your *biodb* instance you have to terminate it, in order to
ensure release of resources (file handles, database connection, etc):
```{r}
mybiodb$terminate()
```

# Session information

```{r}
sessionInfo()
```

# References