---
title: "_eiR_: Accelerated Similarity Searching of Small Molecules"
author: "Authors: Kevin Horan, Yiqun Cao, Tyler Backman & [Thomas Girke](mailto:thomas.girke@ucr.edu)"
date: "Last update: `r format(Sys.time(), '%d %B, %Y')`"
package: "`r pkg_ver('eiR')`"
output:
BiocStyle::html_document:
toc: true
toc_depth: 3
fig_caption: yes
fontsize: 14pt
bibliography: references.bib
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
options(width=100, max.print=1000)
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```
```{r setup, echo=FALSE, messages=FALSE, warnings=FALSE}
suppressPackageStartupMessages({
library(eiR)
library(ChemmineR)
})
```
Note: the most recent version of this tutorial can be found here and a short overview slide show [here](http://faculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_5_8_2014/Rcheminfo/Cheminfo.pdf).
# Introduction
The R package `eiR` provides an index for chemical compound databases allowing one to
quickly find similar compounds in a very large database. To create this
index, *r* reference compounds are selected to represent the database.
Then each compound in the database is embedded into *d*-dimensional
space based on their distance to each reference compound. This requires
time linear in the size of the database, but only needs to be done once
for a database. Within this space, Locality Sensitive Hashing (LSH) is
employed to allow sub-linear time nearest neighbor lookups [@Dong2008-ib; @Dong2008-kn]
that nearest neighbors can be found without doing a linear scan through
the entire compound database. Additional compounds can be added to the
database in time linear to the number of new compounds. No time is spend
processing existing compounds, as long as the set of reference compounds
remains the same. Given the ability to quickly find nearest neighbors,
this method enables fast clustering with the Jarvis-Pattrick algorithm
as well [@Jarvis1973-lp]. For details on the whole process see [@Cao2010-ir].
This library uses an SQL back-end (SQLite by default) to store chemical
compound definitions, either in SDF or SMILE format, as well as
descriptors. Several different kinds of descriptors can be stored for
each compound, for example, one could compute and store atom-pair and
fingerprint descriptors for each compound. The SQLite database, if used,
is stored in a directory called `data`. The `eiInit` function is used to create a new
database, it can import data from SDF or SMILE formated files, or an
`SDFset` object.
Once a database has been created, an embedding must also be created
[@Dimitris_K_Agrafiotis_Dmitrii_N_Rassokhin_Victor_S_Lobanov2001-eq].
In this step the reference compounds are chosen and each compound is
embedded into the new space. This step creates a new directory called
`run-r-d`, where `r` and `d` are the corresponding values. This is the
most costly step of the process and is handled by the `eiMakeDb` function.
This step can be parallelized by providing a SNOW cluster to the
eiMakeDb function.
Given an embedded database, queries can be run against it with the
`eiQuery` function. Additional compounds can also be added to an existing
database and embedding using `eiAdd`. Performance tests can be run using
the eiPerformanceTest function, and Jarvis-Patrick clustering can be
done with the `eiCluster` function.
`eiR` also provides some mechanisms to allow the user to extend the set of
descriptor formats used and to define new distance functions. See
Section [Customization](#customization) for details.
# Initialization
An initial compound database must be created with the following command:
```{r }
library(eiR)
data(sdfsample)
eiInit(sdfsample[1:99])
```
EiInit can take either an SDFset, or a filename. If a filename is given
it must be in either SDF or SMILE format and the format must be
specified with the `format` paramter. It might complain if your SDF file
does not follow the SDF specification. If this happens, you can create
an SDFset with the `read.SDFset` command and then use that instead of
the filename.
Descriptors will also be computed at this time. The default descriptor
type is atompair. Other types can be used by setting the
`descriptorType` parameter. Currently available types are "ap" for
atompair, and "fp" for fingerprint. The set of available descriptors can
be extended, see Section [Customization](#customization). EiInit will create a folder
called 'data'. Commands should always be executed in the folder
containing this directory (ie, the parent directory of "data"), or else
specify the location of that directory with the `dir` option.
`EiInit` also has some options to support loading data in parallel. This is only
possible if the SQL database being used supports parallel writes. For now, this
means only PostgreSQL. To use this feature, the `inputs` parameter must be set to
an array of filenames, `cl` set to a SNOW cluster, and `connSource` set to a
function which will return a new database connection when called with no parameters.
Then each file will be loaded on a different node of the given cluster, in parallel.
The `connSource` function must actually create a new connection on each call, simply
returning the same reference to an existing connection will not work. This is because
that function will be called on potentially different machines which will have to
establish their own connection to the database.
Compounds which already exist in the database will be skipped over, so it is safe to
re-run `eiInit` on an input file which has already been partially loaded. By default
it discovers duplicates by comparing the entire compound definition. However, if you
want two compounds with the same name to be considered equal even if the definition
is different, you can set `updateByName` to be true. In this mode, if a compound being
loaded is found to exist in the database already, by name, but has a different
definition, the compound will be updated with the new definition and any associated
descriptors and/or features will be re-computed.
eiInit will return a list of compound id numbers which represent the
compounds just inserted. These numbers can be used to issue queries
later.
# Creating a Searchable Database
In this step the compounds in the data directory will be embedded in
another space which allows for more efficient searching. The main two
parameters are `refs` and `d`. `refs` can be either a list of compound ids
to use as references, or else an integer value indicating the number
of references to use, which will then be randomly chosen. We will use
*r* to represent the number of reference compounds.
*d* is the dimension of the embedding space. We have found in
practice that setting *d* to around 100 works well. *r* should be large
enough to "represent" the full compound database.
Since creating a database is the longest running step, a SNOW cluster can be
provided to parallelize the task. A `conSource` function is also required
in this case, as described in Section [Initialization](#initialization).
To help tune these values, `eiMakeDb` will pick `numSamples`
non-reference samples which can later be used by the [`eiPerformanceTest`](#performance-tests)
function.
`eiMakdDb` does its job in a job folder, named after the number of
reference compounds and the number of embedding dimensions. For example,
using 300 reference compounds to generate a 100-dimensional embedding
(`r=300, `d`=100`) will result in a job folder called `run-300-100`.
The embedding result is the file matrix.r.d. In the above
example, the output would be `run-300-100/matrix.300.100`.
Since more than one type of descriptor can be stored for each compound,
the desired descriptor type must be given to this function with the
`descriptorType` parameter. The default value is "ap", for atompair. You
can also specify a custom distance function that must be able to take
two descriptors in the format specified and return a distance value. The
default distance method used is 1-Tanimoto(descriptor1,descriptor2).
The return value is called the Run Id. This value is needed by other
functions to identify the data set and embedding parameters to use.
```{r }
r<- 60
d<- 40
runId <- eiMakeDb(r,d)
```
# Queries
Queries can be given in several formats, defined by the `format`
parameter. The default format is "sdf". The `queries` parameter can be
either an SDF file or an SDFset under this format. Other valid values for
`format` are "name" and "compound\_id". Under these two formats the
`queries` parameter is expected to be a list of compound names (as
returned by sdfid on an SDFset), or a list of compound id numbers from the
database, such as what is returned by the eiInit function.
The `runId` parameter is required to determine which embedded database
to use. As with `eiMakeDb`, the `distance` parameter may be given if
desired, it will default to the Tanimoto Coefficient otherwise. Finally,
the parameter `K` is the number of results that will be returned. In
some cases, particularly if `K` is small, you may need to set it to a
larger value and then trim down the result set yourself. This is because
LSH is not an exact algorithm. Internally, it actually searches for
`xK` neighbors, where `x` is referred to as the expansion ratio,
generally set to 2. This allows it to pick the best `K` matches,
according to the true distance function, out of a larger set of
candidates. When `K` is small though, sometimes that expansion ratio is
not quite enough.
Note also that this function returns distance values and not
similarities. Similarities can be computed by setting `asSimilarity` to
TRUE. This assumes that whatever distance function is currently in use
returns values between 0 and 1. This is true for the default distance
functions for "ap" and "fp" descriptors.
Then you can perform a query as follows:
```{r term=FALSE,fig=TRUE}
#find compounds similar to each query
result=eiQuery(runId,sdfsample[45],K=10,asSimilarity=TRUE)
print(result)
#Compare to traditional similarity search:
data(apset)
print(cmp.search(apset,apset[45],type=3,cutoff=4,quiet=TRUE))
cid(sdfsample)=sdfid(sdfsample)
plot(sdfsample[result$target[1:4]],regenCoords=TRUE,print=FALSE)
```
The result will be a data frame with four columns. The first is a query
id, the second is a target, or hit, id, the third is the distance (or
similarity if `asSimilary` was true) between the query and the
target, and the fourth column is the compound id number of
the target.
Lsh parameters can be passed in as well, see Section
[Performance Tests](#performance-tests) for more details.
# Adding New Compounds
New Compounds can be added to an existing database, however, the
reference compounds cannot be changed. To add new compounds, use the
eiAdd function. This function is very similar to the eiQuery function,
except instead of a `queries` parameter, there is an `additions`
parameter, defining the compounds to be added. The format of the value
of this parameter works the same as in the eiQuery function. For
example, to add one compound from an SDFset you would do:
```{r term=FALSE}
eiAdd(runId,sdfsample[100])
```
The returned value is a list of the compounds ids that were just
added.
This function will also write out a new matrix file in the run directory.
For large databases, this can take a significant amount of time.
# Performance Tests
The eiPerforamceTest function will run several tests using some sample
data to evaluate the performance of the current embedding. It takes the
usual `runId` parameter, as well as on optional distance function. It
also takes several LSH parameters, though the defaults are usually fine.
To evaluate the performance you can run:
```{r term=FALSE}
eiPerformanceTest(runId,K=22)
```
This will perform two different tests. The first tests the embedding
results in similarity search. The way this works is by approximating
1,000 random similarity searches (determined by data/test\_queries.iddb)
by nearest neighbor search using the coordinates from the embedding
results. The search results are then compared to the reference search
results (chemical-search.results.gz).
The comparison results are summarized in two types of files. The first
type lists the recall for different k values, k being the number of
numbers to retrieve. These files are named as "recall-ratio-k". For
example, if the recall is 70% for top-100 compound search (70 of the 100
results are among the real top-100 compounds) then the value at line 100
is 0.7. Several relaxation ratios are used, each generating a file in
this form. For instance, recall.ratio-10 is the file listing the recalls
when relaxation ratio is 10. The other file, recall.csv, lists recalls
of different relaxation ratios in one file by limiting to selected k
value. In this CSV file, the rows correspond to different relaxation
ratios, and the columns are different k values. You will be able to pick
an appropriate relaxation ratio for the k values you are interested in.
The second test measures the performance of the Locality Sensitive Hash
(LSH). The results for lsh-assisted search will be in
run-r-d/indexed.performance. It's a 1,000-line file of recall values.
Each line corresponds to one test query. LSH search performance is
highly sensitive to your LSH parameters (K, W, M, L, T). The default
parameters are listed in the man page for `eiPerformanceTest`. When you
have your embedding result in a matrix file, you should follow
instruction on
to find the
best values for these parameters.
# Clustering
Compounds can be clustered in near linear time using the Jarvis-Patrick
clustering algorithm by taking advantage of the near constant time nearest
neighbor lookup provided by the LSH index. Clustering is done with the
`eiCluster` function. It takes a `runId` to identify the data set and
embedding, and two parameters for the Jarvis-Patrick algorithm:
`K` is the number of neighbors to fetch for each compound, and `minNbrs`
is the minimum number of neighbors two compounds must have in
common in order to be joined into the same cluster. A `cutoff` value
can also be given to set a maximum distance between neighbors. Any two
compounds farther apart than this cutoff will never be considered
neighbors. This parameter is helpful in preventing compounds which are
very different from almost every other compound from being considered
similar to other distant compounds simply because they happened to be
closest.
By default `eiCluster` will cluster the entire dataset specified by
`runId`. If you want to only cluster a subset of those compounds, you can
provide their compound id values to the `compoundIds` parameter.
```{r }
clustering <- eiCluster(runId,K=5,minNbrs=2,cutoff=0.5)
byCluster(clustering)
```
# Customization
`eiR` can be extended to understand new descriptor types and new distance
functions. New distance functions can be set in two different ways. Any
function that takes a distance parameter can be given a new distance
function that will be used for just that call. If no distance function
is given, it will fetch a default distance function that has been
defined for the given descriptor type. This default value can be changed
using the `setDefaultDistance` function, which takes the descriptor type
and a distance function. Once this function has been called, the new
distance function will be used for that descriptor type by all functions
using a distance function. The built-in defaults are defined as follows:
```{r }
setDefaultDistance("ap", function(d1,d2) 1-cmp.similarity(d1,d2) )
setDefaultDistance("fp", function(d1,d2) 1-fpSim(d1,d2) )
```
New descriptor types can also be added using the `addTransform`
function. These transforms are basically just ways to read descriptors
from compound definitions, and to convert descriptors between string and
object form. This conversion is required because descriptors are stored
as strings in the SQL database, but are used by the rest of the program
as objects.
There are two main components that need to be added. The `addTransform`
function takes the name of the transform and two functions, `toString`,
and `toObject`. These have slightly different meanings depending on the
component you are adding. The first component to add is a transform from
a chemical compound format, such as SDF, to a descriptor format, such as
atom pair (AP), in either string or object form. The toString function
should take any kind of chemical compound source, such an SDF file, an
SDF object or an SDFset, and output a string representation of the
descriptors. Since this function can be written in terms of other
functions that will be defined, you can usually accept the default value
of this function. The toObject function should take the same kind of
input, but output the descriptors as an object. The actual return value
is a list containing the names of the compounds (in the names field),
and the actual descriptor objects ( in the descriptors field).
The second component to add is a transform that converts between string
and object representations of descriptors. In this case the toString
function takes descriptors in object form and returns a string
representation for each. The toObject function performs the inverse
operation. It takes descriptors in string form and returns them as
objects. The objects returned by this function will be exactly what is
handed to the distance function, so you need to make sure that the two
match each other.
For example, to allow atom pair descriptors to be extracted from and SDF
source we would make the following call:
```{r }
addTransform("ap","sdf",
toObject = function(input,conn=NULL,dir="."){
sdfset=if(is.character(input) && file.exists(input)){
read.SDFset(input)
}else if(inherits(input,"SDFset")){
input
}else{
stop(paste("unknown type for 'input',
or filename does not exist. type found:",class(input)))
}
list(names=sdfid(sdfset),descriptors=sdf2ap(sdfset))
}
)
addTransform("ap",
toString = function(apset,conn=NULL,dir="."){
unlist(lapply(ap(apset), function(x) paste(x,collapse=", ")))
},
toObject= function(v,conn=NULL,dir="."){
if(inherits(v,"list") || length(v)==0)
return(v)
as( if(!inherits(v,"APset")){
names(v)=as.character(1:length(v));
read.AP(v,type="ap",isFile=FALSE)
} else v,
"list")
}
)
```
```{r echo=FALSE,term=FALSE}
unlink("data",recursive=TRUE)
unlink(paste("run",r,d,sep="-"),recursive=TRUE)
```
# Version Information
```{r sessionInfo, results='asis'}
sessionInfo()
```
# Funding
This software was developed with funding from the National Science
Foundation:
[ABI-0957099](http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0957099),
2010-0520325 and IGERT-0504249.
# References