'`.
If you know SQL you can go beyond this basic syntax. These tests will
simply be concatenated together with "AND" in-between them and tacked on
the end of a WHERE clause of an SQL statement. So any SQL that will work
in that context is fine. The function will return a list of compound
ids, the actual compounds can be fetched with
`getCompounds`. If just the names are needed, the
`getCompoundNames` function can be used. Compounds can
also be fetched by name using the `findCompoundsByName`
function.
In this example we search for compounds with 0 or 1 rings:
```{r }
results = findCompounds(conn,"rings",c("rings <= 1"))
message("found ",length(results))
```
If more than one test is given, only compounds which satisfy all tests are found. So if we wanted
to further restrict our search to compounds with 2 or more aromatic rings we could do:
```{r }
results = findCompounds(conn,c("rings","aromatic"),c("rings<=2","aromatic >= 2"))
message("found ",length(results))
```
Remember that any feature used in some test must be listed in the second argument.
String patterns can also be used. So if we wanted to match a substring of the molecular formula, say
to find compounds with 21 carbon atoms, we could do:
```{r eval=FALSE}
results = findCompounds(conn,"formula",c("formula like '%C21%'"))
message("found ",length(results))
```
The "like" operator does a pattern match. There are two wildcard
operators that can be used with this operator. The "%" will match any stretch of characters while the "?"
will match any single character. So the above expression would match a formula like "C21H28N4O6".
Valid comparison operators are:
- <, <=, > , >=
- =, ==, !=, <>, IS, IS NOT, IN, LIKE
The boolean operators "AND" and "OR" can also be used to create more complex expressions within a single test.
If you just want to fetch every compound in the database you can use the `getAllCompoundIds` function:
```{r }
allIds = getAllCompoundIds(conn)
message("found ",length(allIds))
```
[Back to Table of Contents]()
Using Search Results
-----------------------
Once you have a list of compound ids from the `findCompounds` function, you can either
fetch the compound names, or the whole set of compounds as an SDFset.
```{r }
#get the names of the compounds:
names = getCompoundNames(conn,results)
#if the name order is important set keepOrder=TRUE
#It will take a little longer though
names = getCompoundNames(conn,results,keepOrder=TRUE)
# get the whole set of compounds
compounds = getCompounds(conn,results)
#in order:
compounds = getCompounds(conn,results,keepOrder=TRUE)
#write results directly to a file:
compounds = getCompounds(conn,results,filename=file.path(tempdir(),"results.sdf"))
```
Using the `getCompoundFeatures` function, you can get a set of feature values
as a data frame:
```{r}
getCompoundFeatures(conn,results[1:5],c("rings","aromatic"))
#write results directly to a CSV file (reduces memory usage):
getCompoundFeatures(conn,results[1:5],c("rings","aromatic"),filename="features.csv")
#maintain input order in output:
print(results[1:5])
getCompoundFeatures(conn,results[1:5],c("rings","aromatic"),keepOrder=TRUE)
```
[Back to Table of Contents]()
Pre-Built Databases
--------------------
We have pre-built SQLite databases for the Drug Bank and DUD datasets. They can be found in
the ChemmineDrugs annotation package. Connections to these databases can be fetched from the
functions `DrugBank` and `DUD` to get the corresponding database. Any of the above functions can
then be used to query the database.
The DUD dataset was downloaded from [here](http://dude.docking.org/db/subsets/all/all.tar.gz). A description
can be found [here](http://dude.docking.org/).
The Drug Bank data set is version 4.1. It can be downloaded [here](http://www.drugbank.ca/system/downloads/current/structures/all.sdf.zip)
The following features are included:
- **aromatic**: Number of aromatic rings
- **cansmi**: Canonical SMILES sting
- **cansmins**:
- **formula**: Molecular formula
- **hba1**:
- **hba2**:
- **hbd**:
- **inchi**: INCHI string
- **logp**:
- **mr**:
- **mw**: Molecular weight
- **ncharges**:
- **nf**:
- **r2nh**:
- **r3n**:
- **rcch**:
- **rcho**:
- **rcn**:
- **rcooh**:
- **rcoor**:
- **rcor**:
- **rings**:
- **rnh2**:
- **roh**:
- **ropo3**:
- **ror**:
- **title**:
- **tpsa**:
The DUD database additionally includes:
- **target_name**: Name of the target
- **type**: either "active" or "decoy"
[Back to Table of Contents]()
Working with SDF/SDFset Classes
===============================
Several methods are available to return the different data components of
`SDF/SDFset` containers in batches. The following
examples list the most important ones. To save space their content is
not printed in the manual.
```{r eval=FALSE, tidy=FALSE}
view(sdfset[1:4]) # Summary view of several molecules
length(sdfset) # Returns number of molecules
sdfset[[1]] # Returns single molecule from SDFset as SDF object
sdfset[[1]][[2]] # Returns atom block from first compound as matrix
sdfset[[1]][[2]][1:4,]
c(sdfset[1:4], sdfset[5:8]) # Concatenation of several SDFsets
```
The `grepSDFset` function allows string
matching/searching on the different data components in
`SDFset`. By default the function returns a SDF summary
of the matching entries. Alternatively, an index of the matches can be
returned with the setting `mode="index"`.
```{r eval=FALSE, tidy=FALSE}
grepSDFset("650001", sdfset, field="datablock", mode="subset") # To return index, set mode="index")
```
Utilities to maintain unique compound IDs:
```{r eval=FALSE, tidy=FALSE}
sdfid(sdfset[1:4]) # Retrieves CMP IDs from Molecule Name field in header block.
cid(sdfset[1:4]) # Retrieves CMP IDs from ID slot in SDFset.
unique_ids <- makeUnique(sdfid(sdfset)) # Creates unique IDs by appending a counter to duplicates.
cid(sdfset) <- unique_ids # Assigns uniquified IDs to ID slot
```
Subsetting by character, index and logical vectors:
```{r eval=FALSE, tidy=FALSE}
view(sdfset[c("650001", "650012")])
view(sdfset[4:1])
mylog <- cid(sdfset)
view(sdfset[mylog])
```
Accessing `SDF/SDFset` components: header, atom, bond and
data blocks:
```{r eval=FALSE, tidy=FALSE}
atomblock(sdf); sdf[[2]];
sdf[["atomblock"]] # All three methods return the same component
header(sdfset[1:4])
atomblock(sdfset[1:4])
bondblock(sdfset[1:4])
datablock(sdfset[1:4])
header(sdfset[[1]])
atomblock(sdfset[[1]])
bondblock(sdfset[[1]])
datablock(sdfset[[1]])
```
Replacement Methods:
```{r eval=FALSE, tidy=FALSE}
sdfset[[1]][[2]][1,1] <- 999
atomblock(sdfset)[1] <- atomblock(sdfset)[2]
datablock(sdfset)[1] <- datablock(sdfset)[2]
```
Assign matrix data to data block:
```{r eval=FALSE, tidy=FALSE}
datablock(sdfset) <- as.matrix(iris[1:100,])
view(sdfset[1:4])
```
Class coercions from `SDFstr` to `list`,
`SDF` and `SDFset`:
```{r eval=FALSE, tidy=FALSE}
as(sdfstr[1:2], "list") as(sdfstr[[1]], "SDF")
as(sdfstr[1:2], "SDFset")
```
Class coercions from `SDF` to `SDFstr`,
`SDFset`, list with SDF sub-components:
```{r eval=FALSE, tidy=FALSE}
sdfcomplist <- as(sdf, "list") sdfcomplist <-
as(sdfset[1:4], "list"); as(sdfcomplist[[1]], "SDF") sdflist <-
as(sdfset[1:4], "SDF"); as(sdflist, "SDFset") as(sdfset[[1]], "SDFstr")
as(sdfset[[1]], "SDFset")
```
Class coercions from `SDFset` to lists with components
consisting of SDF or sub-components:
```{r eval=FALSE, tidy=FALSE}
as(sdfset[1:4], "SDF") as(sdfset[1:4], "list") as(sdfset[1:4], "SDFstr")
```
[Back to Table of Contents]()
Molecular Property Functions (Physicochemical Descriptors)
==========================================================
Several methods and functions are available to compute basic compound
descriptors, such as molecular formula (MF), molecular weight (MW), and
frequencies of atoms and functional groups. In many of these functions,
it is important to set `addH=TRUE` in order to
include/add hydrogens that are often not specified in an SD file.
```{r boxplot, eval=TRUE, tidy=FALSE}
propma <- atomcountMA(sdfset, addH=FALSE)
boxplot(propma, col="blue", main="Atom Frequency")
```
```{r eval=FALSE, tidy=FALSE}
boxplot(rowSums(propma), main="All Atom Frequency")
```
Data frame provided by library containing atom names, atom symbols,
standard atomic weights, group and period numbers:
```{r eval=TRUE, tidy=FALSE}
data(atomprop)
atomprop[1:4,]
```
Compute MW and formula:
```{r eval=TRUE, tidy=FALSE}
MW(sdfset[1:4], addH=FALSE)
MF(sdfset[1:4], addH=FALSE)
```
Enumerate functional groups:
```{r eval=TRUE, tidy=FALSE}
groups(sdfset[1:4], groups="fctgroup", type="countMA")
```
Combine MW, MF, charges, atom counts, functional group counts and ring
counts in one data frame:
```{r eval=TRUE, tidy=FALSE}
propma <- data.frame(MF=MF(sdfset, addH=FALSE), MW=MW(sdfset, addH=FALSE),
Ncharges=sapply(bonds(sdfset, type="charge"), length),
atomcountMA(sdfset, addH=FALSE),
groups(sdfset, type="countMA"),
rings(sdfset, upper=6, type="count", arom=TRUE))
propma[1:4,]
```
The following shows an example for assigning the values stored in a
matrix (*e.g.* property descriptors) to the data block components in an
`SDFset`. Each matrix row will be assigned to the
corresponding slot position in the `SDFset`.
```{r eval=FALSE, tidy=FALSE}
datablock(sdfset) <- propma # Works with all SDF components
datablock(sdfset)[1:4]
test <- apply(propma[1:4,], 1, function(x)
data.frame(col=colnames(propma), value=x))
```
The data blocks in SDFs contain often important annotation information
about compounds. The `datablock2ma` function returns this
information as matrix for all compounds stored in an
`SDFset` container. The `splitNumChar`
function can then be used to organize all numeric columns in a
`numeric matrix` and the character columns in a
`character matrix` as components of a
`list` object.
```{r eval=FALSE, tidy=FALSE}
datablocktag(sdfset, tag="PUBCHEM_NIST_INCHI")
datablocktag(sdfset,
tag="PUBCHEM_OPENEYE_CAN_SMILES")
```
Convert entire data block to matrix:
```{r eval=FALSE, tidy=FALSE}
blockmatrix <- datablock2ma(datablocklist=datablock(sdfset)) # Converts data block to matrix
numchar <- splitNumChar(blockmatrix=blockmatrix) # Splits matrix to numeric matrix and character matrix
numchar[[1]][1:4,]; numchar[[2]][1:4,]
# Splits matrix to numeric matrix and character matrix
```
[Back to Table of Contents]()
Bond Matrices
=============
Bond matrices provide an efficient data structure for many basic
computations on small molecules. The function `conMA`
creates this data structure from `SDF` and
`SDFset` objects. The resulting bond matrix contains the
atom labels in the row/column titles and the bond types in the data
part. The labels are defined as follows: 0 is no connection, 1 is a
single bond, 2 is a double bond and 3 is a triple bond.
```{r contable, eval=FALSE, fig.keep='none', tidy=FALSE}
conMA(sdfset[1:2],
exclude=c("H")) # Create bond matrix for first two molecules in sdfset
conMA(sdfset[[1]], exclude=c("H")) # Return bond matrix for first molecule
plot(sdfset[1], atomnum = TRUE, noHbonds=FALSE , no_print_atoms = "", atomcex=0.8) # Plot its structure with atom numbering
rowSums(conMA(sdfset[[1]], exclude=c("H"))) # Return number of non-H bonds for each atom
```
[Back to Table of Contents]()
Charges and Missing Hydrogens
=============================
The function `bonds` returns information about the number
of bonds, charges and missing hydrogens in `SDF` and
`SDFset` objects. It is used by many other functions
(*e.g.* `MW`, `MF`,
`atomcount`, `atomcuntMA` and
`plot`) to correct for missing hydrogens that are often
not specified in SD files.
```{r eval=TRUE, tidy=FALSE}
bonds(sdfset[[1]], type="bonds")[1:4,]
bonds(sdfset[1:2], type="charge")
bonds(sdfset[1:2], type="addNH")
```
[Back to Table of Contents]()
Ring Perception and Aromaticity Assignment
==========================================
The function `rings` identifies all possible rings in one
or many molecules (here `sdfset[1]`) using the exhaustive
ring perception algorithm from Hanser et al. [-@Hanser_1996]. In addition, the function can
return all smallest possible rings as well as aromaticity information.
The following example returns all possible rings in a
`list`. The argument `upper` allows to
specify an upper length limit for rings. Choosing smaller length limits
will reduce the search space resulting in shortened compute times. Note:
each ring is represented by a character vector of atom symbols that are
numbered by their position in the atom block of the corresponding
`SDF/SDFset` object.
```{r eval=TRUE, tidy=FALSE}
ringatoms <- rings(sdfset[1], upper=Inf, type="all", arom=FALSE, inner=FALSE)
```
For visual inspection, the corresponding compound structure can be
plotted with the ring bonds highlighted in color:
```{r eval=TRUE, tidy=FALSE}
atomindex <- as.numeric(gsub(".*_", "", unique(unlist(ringatoms))))
plot(sdfset[1], print=FALSE, colbonds=atomindex)
```
Alternatively, one can include the atom numbers in the plot:
```{r eval=TRUE, tidy=FALSE}
plot(sdfset[1], print=FALSE, atomnum=TRUE, no_print_atoms="H")
```
Aromaticity information of the rings can be returned in a logical vector
by setting `arom=TRUE`:
```{r eval=TRUE, tidy=FALSE}
rings(sdfset[1], upper=Inf, type="all", arom=TRUE, inner=FALSE)
```
Return rings with no more than 6 atoms that are also aromatic:
```{r eval=TRUE, tidy=FALSE}
rings(sdfset[1], upper=6, type="arom", arom=TRUE, inner=FALSE)
```
Count shortest possible rings and their aromaticity assignments by
setting `type=count` and `inner=TRUE`. The
inner (smallest possible) rings are identified by first computing all
possible rings and then selecting only the inner rings. For more
details, consult the help documentation with `?rings`.
```{r eval=TRUE, tidy=FALSE}
rings(sdfset[1:4], upper=Inf, type="count", arom=TRUE, inner=TRUE)
```
[Back to Table of Contents]()
Rendering Chemical Structure Images
===================================
R Graphics Device
-----------------
A new plotting function for compound structures has been added to the
package recently. This function uses the native R graphics device for
generating compound depictions. At this point this function is still in
an experimental developmental stage but should become stable soon.
If you have `ChemmineOB` available you can use the `regenCoords`
option to have OpenBabel regenerate the coordinates for the compound.
This can sometimes produce better looking plots.
Plot compound Structures with R's graphics device:
```{r plotstruct2, eval=TRUE, tidy=FALSE}
data(sdfsample)
sdfset <- sdfsample
plot(sdfset[1:4], regenCoords=TRUE,print=FALSE) # 'print=TRUE' returns SDF summaries
```
Customized plots:
```{r eval=FALSE, tidy=FALSE}
plot(sdfset[1:4], griddim=c(2,2), print_cid=letters[1:4], print=FALSE,
noHbonds=FALSE)
```
In the following plot, the atom block position numbers in the SDF are
printed next to the atom symbols (`atomnum = TRUE`). For
more details, consult help documentation with
`?plotStruc` or `?plot`.
```{r plotstruct3, eval=TRUE, tidy=FALSE}
plot(sdfset["CMP1"], atomnum = TRUE, noHbonds=F , no_print_atoms = "",
atomcex=0.8, sub=paste("MW:", MW(sdfsample["CMP1"])), print=FALSE)
```
Substructure highlighting by atom numbers:
```{r plotstruct4, eval=TRUE, tidy=FALSE}
plot(sdfset[1], print=FALSE, colbonds=c(22,26,25,3,28,27,2,23,21,18,8,19,20,24))
```
[Back to Table of Contents]()
Data Tables
------------
Compound images and data can also be viewed in a web browser.
This allows you to page through the table, as well as filter
the results using the search box. Results can be sorted on
any column by clicking on the column title. Compound images are
rendered as SVGs, so you can zoom in on them to see more details.
```{r datatable, eval=FALSE, tidy=FALSE}
data(sdfsample)
SDFDataTable(sdfsample[1:5])
```
Online with ChemMine Tools
--------------------------
Alternatively, one can visualize compound structures with a standard web
browser using the online ChemMine Tools service.
Plot structures using web service ChemMine Tools:
```{r eval=FALSE, tidy=FALSE}
sdf.visualize(sdfset[1:4])
```
![Figure: Visualization webpage created by calling
`sdf.visualize`.](visualizescreenshot-small.png )
[Back to Table of Contents]()
Similarity Comparisons and Searching
====================================
Maximum Common Substructure (MCS) Searching
-------------------------------------------
The `ChemmineR` add-on package
[`fmcsR`](http://www.bioconductor.org/packages/devel/bioc/html/fmcsR.html)
provides support for identifying maximum common substructures (MCSs) and
flexible MCSs among compounds. The algorithm can be used for pairwise
compound comparisons, structure similarity searching and clustering. The
manual describing this functionality is available
[here](http://www.bioconductor.org/packages/devel/bioc/vignettes/fmcsR/inst/doc/fmcsR.html)
and the associated publication is Wang et al. [-@Wang_2013]. The following gives a
short preview of some functionalities provided by the
`fmcsR` package.
```{r plotmcs, eval=TRUE, tidy=FALSE}
library(fmcsR)
data(fmcstest) # Loads test sdfset object
test <- fmcs(fmcstest[1], fmcstest[2], au=2, bu=1) # Searches for MCS with mismatches
plotMCS(test) # Plots both query compounds with MCS in color
```
[Back to Table of Contents]()
AP/APset Classes for Storing Atom Pair Descriptors
--------------------------------------------------
The function `sdf2ap` computes atom pair descriptors for
one or many compounds [@Carhart_1985; @Chen_2002]. It returns a searchable atom pair database
stored in a container of class `APset`, which can be used
for structural similarity searching and clustering. As similarity
measure, the Tanimoto coefficient or related coefficients can be used.
An `APset` object consists of one or many
`AP` entries each storing the atom pairs of a single
compound. Note: the deprecated `cmp.parse` function is
still available which also generates atom pair descriptor databases, but
directly from an SD file. Since the latter function is less flexible it
may be discontinued in the future.
Generate atom pair descriptor database for searching:
```{r eval=TRUE, tidy=FALSE}
ap <- sdf2ap(sdfset[[1]]) # For single compound
ap
```
```{r eval=FALSE, tidy=FALSE}
apset <- sdf2ap(sdfset)
# For many compounds.
```
```{r eval=TRUE, tidy=FALSE}
view(apset[1:4])
```
Return main components of APset objects:
```{r eval=FALSE, tidy=FALSE}
cid(apset[1:4]) # Compound IDs
ap(apset[1:4]) # Atom pair
descriptors
db.explain(apset[1]) # Return atom pairs in human readable format
```
Coerce APset to other objects:
```{r eval=FALSE, tidy=FALSE}
apset2descdb(apset) # Returns old list-style AP database
tmp <- as(apset, "list") # Returns list
as(tmp, "APset") # Converts list back to APset
```
[Back to Table of Contents]()
Large SDF and Atom Pair Databases
---------------------------------
When working with large data sets it is often desirable to save the
`SDFset` and `APset` containers as binary
R objects to files for later use. This way they can be loaded very
quickly into a new R session without recreating them every time from
scratch.
Save and load of `SDFset` and `APset`
containers:
```{r eval=FALSE, tidy=FALSE}
save(sdfset, file = "sdfset.rda", compress = TRUE)
load("sdfset.rda") save(apset, file = "apset.rda", compress = TRUE)
load("apset.rda")
```
[Back to Table of Contents]()
Pairwise Compound Comparisons with Atom Pairs
---------------------------------------------
The `cmp.similarity` function computes the atom pair
similarity between two compounds using the Tanimoto coefficient as
similarity measure. The coefficient is defined as *c/(a+b+c)*, which
is the proportion of the atom pairs shared among two compounds divided
by their union. The variable *c* is the number of atom pairs common in
both compounds, while *a* and *b* are the numbers of their unique
atom pairs.
```{r eval=TRUE, tidy=FALSE}
cmp.similarity(apset[1],
apset[2])
cmp.similarity(apset[1], apset[1])
```
[Back to Table of Contents]()
Similarity Searching with Atom Pairs
------------------------------------
The `cmp.search` function searches an atom pair database
for compounds that are similar to a query compound. The following
example returns a data frame where the rows are sorted by the Tanimoto
similarity score (best to worst). The first column contains the indices
of the matching compounds in the database. The argument cutoff can be a
similarity cutoff, meaning only compounds with a similarity value larger
than this cutoff will be returned; or it can be an integer value
restricting how many compounds will be returned. When supplying a cutoff
of 0, the function will return the similarity values for every compound
in the database.
```{r eval=TRUE, tidy=FALSE}
cmp.search(apset,
apset["650065"], type=3, cutoff = 0.3, quiet=TRUE)
```
Alternatively, the
function can return the matches in form of an index or a named vector if
the `type` argument is set to `1` or
`2`, respectively.
```{r eval=TRUE, tidy=FALSE}
cmp.search(apset, apset["650065"], type=1, cutoff = 0.3, quiet=TRUE)
cmp.search(apset, apset["650065"], type=2, cutoff = 0.3, quiet=TRUE)
```
[Back to Table of Contents]()
FP/FPset Classes for Storing Fingerprints
-----------------------------------------
The `FPset` class stores fingerprints of small molecules
in a matrix-like representation where every molecule is encoded as a
fingerprint of the same type and length. The `FPset`
container acts as a searchable database that contains the fingerprints
of many molecules. The `FP` container holds only one
fingerprint. Several constructor and coerce methods are provided to
populate `FP/FPset` containers with fingerprints, while
supporting any type and length of fingerprints. For instance, the
function `desc2fp` generates fingerprints from an atom
pair database stored in an `APset`, and
`as(matrix, "FPset")` and `as(character, "FPset")` construct an `FPset` database from
objects where the fingerprints are represented as
`matrix` or `character` objects,
respectively.
Show slots of `FPset` class:
```{r eval=TRUE, tidy=FALSE}
showClass("FPset")
```
Instance of `FPset` class:
```{r eval=TRUE, tidy=FALSE}
data(apset)
fpset <- desc2fp(apset)
view(fpset[1:2])
```
`FPset` class usage:
```{r eval=TRUE, tidy=FALSE}
fpset[1:4] # behaves like a list
fpset[[1]] # returns FP object
length(fpset) # number of compounds ENDCOMMENT
cid(fpset) # returns compound ids
fpset[10] <- 0 # replacement of 10th fingerprint to all zeros
cid(fpset) <- 1:length(fpset) # replaces compound ids
c(fpset[1:4], fpset[11:14]) # concatenation of several FPset objects
```
Construct `FPset` class form `matrix`:
```{r eval=TRUE, tidy=FALSE}
fpma <- as.matrix(fpset) # coerces FPset to matrix
as(fpma, "FPset")
```
Construct `FPset` class form `character vector`:
```{r eval=TRUE, tidy=FALSE}
fpchar <- as.character(fpset) # coerces FPset to character strings
as(fpchar, "FPset") # construction of FPset class from character vector
```
Compound similarity searching with `FPset`:
```{r eval=TRUE, tidy=FALSE}
fpSim(fpset[1], fpset, method="Tanimoto", cutoff=0.4, top=4)
```
Folding fingerprints:
```{r eval=TRUE,tidy=FALSE}
fold(fpset) # fold each FP once
fold(fpset, count=2) #fold each FP twice
fold(fpset, bits=128) #fold each FP down to 128 bits
fold(fpset[[1]]) # fold an individual FP
fptype(fpset) # get type of FPs
numBits(fpset) # get the number of bits of each FP
foldCount(fold(fpset)) # the number of times an FP or FPset has been folded
```
[Back to Table of Contents]()
Atom Pair Fingerprints
----------------------
Atom pairs can be converted into binary atom pair fingerprints of fixed
length. Computations on this compact data structure are more time and
memory efficient than on their relatively complex atom pair
counterparts. The function `desc2fp` generates
fingerprints from descriptor vectors of variable length such as atom
pairs stored in `APset` or `list`
containers. The obtained fingerprints can be used for structure
similarity comparisons, searching and clustering.
Create atom pair sample data set:
```{r eval=FALSE, tidy=FALSE}
data(sdfsample)
sdfset <- sdfsample[1:10]
apset <- sdf2ap(sdfset)
```
Compute atom pair fingerprint database using internal atom pair
selection containing the 4096 most common atom pairs identified in
DrugBank's compound collection. For details see `?apfp`.
The following example uses from this set the 1024 most frequent atom
pairs:
```{r eval=FALSE, tidy=FALSE}
fpset <- desc2fp(apset, descnames=1024, type="FPset")
```
Alternatively, one can provide any custom atom pair selection. Here, the
1024 most common ones in `apset`:
```{r eval=FALSE, tidy=FALSE}
fpset1024 <- names(rev(sort(table(unlist(as(apset, "list")))))[1:1024])
fpset <- desc2fp(apset, descnames=fpset1024, type="FPset")
```
A more compact way of storing fingerprints is as character values:
```{r eval=FALSE, tidy=FALSE}
fpchar <- desc2fp(x=apset,
descnames=1024, type="character") fpchar <- as.character(fpset)
```
Converting a fingerprint database to a matrix and vice versa:
```{r eval=FALSE, tidy=FALSE}
fpma <- as.matrix(fpset)
fpset <- as(fpma, "FPset")
```
Similarity searching and returning Tanimoto similarity coefficients:
```{r eval=FALSE, tidy=FALSE}
fpSim(fpset[1], fpset, method="Tanimoto")
```
Under `method` one can choose from several predefined
similarity measures including *Tanimoto* (default),
*Euclidean*, *Tversky* or
*Dice*. Alternatively, one can pass on custom similarity
functions.
```{r eval=FALSE, tidy=FALSE}
fpSim(fpset[1], fpset, method="Tversky", cutoff=0.4, top=4, alpha=0.5, beta=1)
```
Example for using a custom similarity function:
```{r eval=FALSE, tidy=FALSE}
myfct <- function(a, b, c, d) c/(a+b+c+d)
fpSim(fpset[1], fpset, method=myfct)
```
Clustering example:
```{r eval=FALSE, tidy=FALSE}
simMAap <- sapply(cid(apfpset), function(x) fpSim(x=apfpset[x], apfpset, sorted=FALSE))
hc <- hclust(as.dist(1-simMAap), method="single")
plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=TRUE)
```
[Back to Table of Contents]()
Fingerprint E-values
---------------------
The `fpSim` function can also return Z-scores, E-values, and p-values
if given a set of score distribution parameters. These parameters can
be computed over an `fpSet` with the `genParameters` function.
```{r eval=TRUE, tidy=FALSE}
params <- genParameters(fpset)
```
This function will compute all pairwise distances between the given
fingerprints and then fit a Beta distribution to the resulting
Tanimoto scores, conditioned on the number of set bits in each
fingerprint. For large data sets where you would not want to compute
all pairwise distances, you can set what fraction to sample with the
`sampleFraction` argument. This step only needs to be done once for
each database of `fpSet` objects. Alternatively, if you have a large
database of fingerprints, or you believe that the parameters computed
on a single database are more generally applicable, you can use the
resulting parameters for other databases as well.
Once you have a set of parameters, you can pass them to `fpSim` with
the `parameters` argument.
```{r eval=TRUE, tidy=FALSE}
fpSim(fpset[[1]], fpset, top=10, parameters=params)
```
This will then return a data frame with the similarity, Z-score,
E-value, and p-value. You can change which value will be used as a
cutoff and to sort by by setting the argument `scoreType` to one of
these scores. In this way you could set an E-value cutoff of 0.04 for
example.
```{r eval=TRUE, tidy=FALSE}
fpSim(fpset[[1]], fpset, cutoff=0.04, scoreType="evalue", parameters=params)
```
[Back to Table of Contents]()
Pairwise Compound Comparisons with PubChem Fingerprints
-------------------------------------------------------
The `fpSim` function computes the similarity coefficients
(*e.g.* Tanimoto) for pairwise comparisons of binary fingerprints. For
this data type, *c* is the number of "on-bits" common in both
compounds, and *a* and *b* are the numbers of their unique
"on-bits". Currently, the PubChem fingerprints need to be provided (here
PubChem's SD files) and cannot be computed from scratch in
`ChemmineR`. The PubChem fingerprint specifications can
be loaded with `data(pubchemFPencoding)`.
Convert base 64 encoded PubChem fingerprints to
`character` vector, `matrix` or
`FPset` object:
```{r eval=TRUE, tidy=FALSE}
cid(sdfset) <- sdfid(sdfset)
fpset <- fp2bit(sdfset, type=1)
fpset <- fp2bit(sdfset, type=2)
fpset <- fp2bit(sdfset, type=3)
fpset
```
Pairwise compound structure comparisons:
```{r eval=TRUE, tidy=FALSE}
fpSim(fpset[1], fpset[2])
```
[Back to Table of Contents]()
Similarity Searching with PubChem Fingerprints
----------------------------------------------
Similarly, the `fpSim` function provides search
functionality for PubChem fingerprints:
```{r eval=TRUE, tidy=FALSE}
fpSim(fpset["650065"], fpset, method="Tanimoto", cutoff=0.6, top=6)
```
[Back to Table of Contents]()
Visualize Similarity Search Results
-----------------------------------
The `cmp.search` function allows to visualize the
chemical structures for the search results. Similar but more flexible
chemical structure rendering functions are `plot` and
`sdf.visualize` described above. By setting the visualize
argument in `cmp.search` to `TRUE`, the
matching compounds and their scores can be visualized with a standard
web browser. Depending on the `visualize.browse`
argument, an URL will be printed or a webpage will be opened showing the
structures of the matching compounds.
View similarity search results in R's graphics device:
```{r search_result, eval=TRUE, tidy=FALSE}
cid(sdfset) <-
cid(apset) # Assure compound name consistency among objects.
plot(sdfset[names(cmp.search(apset, apset["650065"], type=2, cutoff=4, quiet=TRUE))], print=FALSE)
```
View results online with Chemmine Tools:
```{r eval=FALSE, tidy=FALSE}
similarities <- cmp.search(apset, apset[1], type=3, cutoff = 10)
sdf.visualize(sdfset[similarities[,1]])
```
[Back to Table of Contents]()
Clustering
==========
Clustering Identical or Very Similar Compounds
----------------------------------------------
Often it is of interest to identify very similar or identical compounds
in a compound set. The `cmp.duplicated` function can be
used to quickly identify very similar compounds in atom pair sets, which
will be frequently, but not necessarily, identical compounds.
Identify compounds with identical AP sets:
```{r eval=TRUE, tidy=FALSE}
cmp.duplicated(apset, type=1)[1:4] # Returns AP duplicates as logical vector
cmp.duplicated(apset, type=2)[1:4,] # Returns AP duplicates as data frame
```
Plot the structure of two pairs of duplicates:
```{r duplicates, eval=TRUE, tidy=FALSE}
plot(sdfset[c("650059","650060", "650065", "650066")], print=FALSE)
```
Remove AP duplicates from SDFset and APset objects:
```{r eval=TRUE, tidy=FALSE}
apdups <- cmp.duplicated(apset, type=1)
sdfset[which(!apdups)]; apset[which(!apdups)]
```
Alternatively, one can identify duplicates via other descriptor types if
they are provided in the data block of an imported SD file. For
instance, one can use here fingerprints, InChI, SMILES or other
molecular representations. The following examples show how to enumerate
by identical InChI strings, SMILES strings and molecular formula,
respectively.
```{r eval=TRUE, tidy=FALSE}
count <- table(datablocktag(sdfset,
tag="PUBCHEM_NIST_INCHI"))
count <- table(datablocktag(sdfset, tag="PUBCHEM_OPENEYE_CAN_SMILES"))
count <- table(datablocktag(sdfset, tag="PUBCHEM_MOLECULAR_FORMULA"))
count[1:4]
```
[Back to Table of Contents]()
Binning Clustering
------------------
Compound libraries can be clustered into discrete similarity groups with
the binning clustering function `cmp.cluster`. The
function accepts as input an atom pair (`APset`) or a
fingerprint (`FPset`) descriptor database as well as a
similarity threshold. The binning clustering result is returned in form
of a data frame. Single linkage is used for cluster joining. The
function calculates the required compound-to-compound distance
information on the fly, while a memory-intensive distance matrix is only
created upon user request via the `save.distances`
argument (see below).
Because an optimum similarity threshold is often not known, the
`cmp.cluster` function can calculate cluster results for
multiple cutoffs in one step with almost the same speed as for a single
cutoff. This can be achieved by providing several cutoffs under the
cutoff argument. The clustering results for the different cutoffs will
be stored in one data frame.
One may force the `cmp.cluster` function to calculate and
store the distance matrix by supplying a file name to the
`save.distances` argument. The generated distance matrix
can be loaded and passed on to many other clustering methods available
in R, such as the hierarchical clustering function
`hclust` (see below).
If a distance matrix is available, it may also be supplied to
`cmp.cluster` via the `use.distances`
argument. This is useful when one has a pre-computed distance matrix
either from a previous call to `cmp.cluster` or from
other distance calculation subroutines.
Single-linkage binning clustering with one or multiple cutoffs:
```{r eval=TRUE, tidy=FALSE}
clusters <- cmp.cluster(db=apset, cutoff = c(0.7, 0.8, 0.9), quiet = TRUE)
clusters[1:12,]
```
Clustering of `FPset` objects with multiple cutoffs. This
method allows to call various similarity methods provided by the
`fpSim` function. For details consult
`?fpSim`.
```{r eval=TRUE, tidy=FALSE}
fpset <- desc2fp(apset)
clusters2 <- cmp.cluster(fpset, cutoff=c(0.5, 0.7, 0.9), method="Tanimoto", quiet=TRUE)
clusters2[1:12,]
```
Sames as above, but using Tversky similarity measure:
```{r eval=TRUE, tidy=FALSE}
clusters3 <- cmp.cluster(fpset, cutoff=c(0.5, 0.7, 0.9),
method="Tversky", alpha=0.3, beta=0.7, quiet=TRUE)
```
Return cluster size distributions for each cutoff:
```{r eval=TRUE, tidy=FALSE}
cluster.sizestat(clusters, cluster.result=1)
cluster.sizestat(clusters, cluster.result=2)
cluster.sizestat(clusters, cluster.result=3)
```
Enforce calculation of distance matrix:
```{r eval=FALSE, tidy=FALSE}
clusters <- cmp.cluster(db=apset, cutoff = c(0.65, 0.5, 0.3),
save.distances="distmat.rda") # Saves distance matrix to file "distmat.rda" in current working directory.
load("distmat.rda") # Loads distance matrix.
```
[Back to Table of Contents]()
Jarvis-Patrick Clustering
-------------------------
The Jarvis-Patrick clustering algorithm is widely used in
cheminformatics [@greycite13371]. It requires a nearest neighbor table, which consists
of *j* nearest neighbors for each item (*e.g.* compound).
The nearest neighbor table is then used to join items into clusters when
they meet the following requirements: (a) they are contained in each
other's neighbor list and (b) they share at least *k*
nearest neighbors. The values for *j* and
*k* are user-defined parameters. The
`jarvisPatrick` function implemented in
`ChemmineR` takes a nearest neighbor table generated by
`nearestNeighbors`, which works for
`APset` and `FPset` objects. This function
takes either the standard Jarvis-Patrick *j* parameter
(as the `numNbrs` parameter), or else a
`cutoff` value, which is an extension to the basic
algorithm that we have added. Given a cutoff value, the nearest neighbor
table returned contains every neighbor with a similarity greater than
the cutoff value, for each item. This allows one to generate tighter
clusters and to minimize certain limitations of this method, such as
false joins of completely unrelated items when operating on small data
sets. The `trimNeighbors` function can also be used to
take an existing nearest neighbor table and remove all neighbors whose
similarity value is below a given cutoff value. This allows one to
compute a very relaxed nearest neighbor table initially, and then
quickly try different refinements later.
In case an existing nearest neighbor matrix needs to be used, the
`fromNNMatrix` function can be used to transform it into
the list structure that `jarvisPatrick` requires. The
input matrix must have a row for each compound, and each row should be
the index values of the neighbors of compound represented by that row.
The names of each compound can also be given through the
`names` argument. If not given, it will attempt to use
the `rownames` of the given matrix.
The `jarvisPatrick` function also allows one to relax
some of the requirements of the algorithm through the
`mode` parameter. When set to "a1a2b", then all
requirements are used. If set to "a1b", then (a) is relaxed to a
unidirectional requirement. Lastly, if `mode` is set to
"b", then only requirement (b) is used, which means that all pairs of
items will be checked to see if (b) is satisfied between them. The size
of the clusters generated by the different methods increases in this
order: "a1a2b" < "a1b" < "b". The run time of method "a1a2b" follows a
close to linear relationship, while it is nearly quadratic for the much
more exhaustive method "b". Only methods "a1a2b" and "a1b" are suitable
for clustering very large data sets (e.g. \>50,000 items) in a
reasonable amount of time.
An additional extension to the algorithm is the ability to set the
linkage mode. The `linkage` parameter can be one of
"single", "average", or "complete", for single linkage, average linkage
and complete linkage merge requirements, respectively. In the context of
Jarvis-Patrick, average linkage means that at least half of the pairs
between the clusters under consideration must meet requirement (b).
Similarly, for complete linkage, all pairs must requirement (b). Single
linkage is the normal case for Jarvis-Patrick and just means that at
least one pair must meet requirement (b).
The output is a cluster `vector` with the item labels in
the name slot and the cluster IDs in the data slot. There is a utility
function called `byCluster`, which takes out cluster
vector output by `jarvisPatrick` and transforms it into a
list of vectors. Each slot of the list is named with a cluster id and
the vector contains the cluster members. By default the function
excludes singletons from the output, but they can be included by setting
`excludeSingletons`=FALSE`.
Load/create sample `APset` and `FPset`:
```{r eval=TRUE, tidy=FALSE}
data(apset)
fpset <- desc2fp(apset)
```
Standard Jarvis-Patrick clustering on `APset` and
`FPset` objects:
```{r eval=TRUE, tidy=FALSE}
jarvisPatrick(nearestNeighbors(apset,numNbrs=6), k=5, mode="a1a2b")
#Using "APset"
jarvisPatrick(nearestNeighbors(fpset,numNbrs=6), k=5, mode="a1a2b")
#Using "FPset"
```
The following example runs Jarvis-Patrick clustering with a minimum
similarity `cutoff` value (here Tanimoto coefficient). In
addition, it uses the much more exhaustive `"b"` method
that generates larger cluster sizes, but significantly increased the run
time. For more details, consult the corresponding help file with
`?jarvisPatrick`.
```{r eval=TRUE, tidy=FALSE}
cl<-jarvisPatrick(nearestNeighbors(fpset,cutoff=0.6,
method="Tanimoto"), k=2 ,mode="b")
byCluster(cl)
```
Output nearest neighbor table (`matrix`):
```{r eval=TRUE, tidy=FALSE}
nnm <- nearestNeighbors(fpset,numNbrs=6)
nnm$names[1:4]
nnm$ids[1:4,]
nnm$similarities[1:4,]
```
Trim nearest neighbor table:
```{r eval=TRUE, tidy=FALSE}
nnm <- trimNeighbors(nnm,cutoff=0.4)
nnm$similarities[1:4,]
```
Perform clustering on precomputed nearest neighbor table:
```{r eval=TRUE, tidy=FALSE}
jarvisPatrick(nnm, k=5,mode="b")
```
Using a user defined nearest neighbor matrix:
```{r eval=TRUE, tidy=FALSE}
nn <- matrix(c(1,2,2,1),2,2,dimnames=list(c('one','two')))
nn
byCluster(jarvisPatrick(fromNNMatrix(nn),k=1))
```
[Back to Table of Contents]()
Multi-Dimensional Scaling (MDS)
-------------------------------
To visualize and compare clustering results, the
`cluster.visualize` function can be used. The function
performs Multi-Dimensional Scaling (MDS) and visualizes the results in
form of a scatter plot. It requires as input an `APset`,
a clustering result from `cmp.cluster`, and a cutoff for
the minimum cluster size to consider in the plot. To help determining a
proper cutoff size, the `cluster.sizestat` function is
provided to generate cluster size statistics.
MDS clustering and scatter plot:
```{r eval=FALSE, tidy=FALSE}
cluster.visualize(apset, clusters, size.cutoff=2, quiet = TRUE) # Color codes clusters with at least two members.
cluster.visualize(apset, clusters, quiet = TRUE) # Plots all items.
```
Create a 3D scatter plot of MDS result:
```{r mds_scatter, eval=TRUE, tidy=FALSE}
library(scatterplot3d)
coord <- cluster.visualize(apset, clusters, size.cutoff=1, dimensions=3, quiet=TRUE)
scatterplot3d(coord)
```
Interactive 3D scatter plot with Open GL (graphics not evaluated here):
```{r eval=FALSE, tidy=FALSE}
library(rgl) rgl.open(); offset <- 50;
par3d(windowRect=c(offset, offset, 640+offset, 640+offset))
rm(offset)
rgl.clear()
rgl.viewpoint(theta=45, phi=30, fov=60, zoom=1)
spheres3d(coord[,1], coord[,2], coord[,3], radius=0.03, color=coord[,4], alpha=1, shininess=20)
aspect3d(1, 1, 1)
axes3d(col='black')
title3d("", "", "", "", "", col='black')
bg3d("white") # To save a snapshot of the graph, one can use the command rgl.snapshot("test.png").
```
[Back to Table of Contents]()
Clustering with Other Algorithms
--------------------------------
`ChemmineR` allows the user to take advantage of the wide
spectrum of clustering utilities available in R. An example on how to
perform hierarchical clustering with the hclust function is given
below.
Create atom pair distance matrix:
```{r ap_dist_matrix, eval=TRUE, tidy=FALSE}
dummy <- cmp.cluster(db=apset, cutoff=0, save.distances="distmat.rda", quiet=TRUE)
load("distmat.rda")
```
Hierarchical clustering with `hclust`:
```{r hclust, eval=TRUE, tidy=FALSE}
hc <- hclust(as.dist(distmat), method="single")
hc[["labels"]] <- cid(apset) # Assign correct item labels
plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=T)
```
Instead of atom pairs one can use PubChem's fingerprints for clustering:
```{r fp_hclust, eval=FALSE, tidy=FALSE}
simMA <- sapply(cid(fpset), function(x) fpSim(fpset[x], fpset, sorted=FALSE))
hc <- hclust(as.dist(1-simMA), method="single")
plot(as.dendrogram(hc), edgePar=list(col=4, lwd=2), horiz=TRUE)
```
Plot dendrogram with heatmap (here similarity matrix):
```{r heatmap, eval=TRUE, tidy=FALSE}
library(gplots)
heatmap.2(1-distmat, Rowv=as.dendrogram(hc), Colv=as.dendrogram(hc),
col=colorpanel(40, "darkblue", "yellow", "white"),
density.info="none", trace="none")
```
[Back to Table of Contents]()
Searching PubChem
=================
Get Compound SDF from PubChem by Id
--------------------------------
The function `pubchemCidToSDF` (alias `getIds`) accepts one or more numeric PubChem
compound ids and downloads the corresponding compounds from PubChem
Power User Gateway (PUG) returning results in an `SDFset`
container.
Fetch 2 compounds from PubChem:
```{r pubchemCidToSDF, eval=FALSE, tidy=FALSE}
compounds <- pubchemCidToSDF(c(111,123))
compounds
```
[Back to Table of Contents]()
Get Compound SDF from PubChem by InChIkey
--------------------------------
The function `pubchemInchikey2sdf` accepts one or more character PubChem
compound InChIkey(s) and downloads the corresponding compounds from PubChem's
Power User Gateway (PUG). This returns the results in a list of two items. The first item is
the `SDFset` container of all successful queries. The second item is a named numeric
vector. This vector records whether an InChIkey has a successful return. If the InChIkey query
is successful, a non-zero number is returned as the index of
where it exists in the `SDFset` object for this InChIkey. If failed, `0` is returned.
```{r pubchemInchikey2sdf, eval=FALSE, tidy=FALSE}
inchikeys <- c(
"ZFUYDSOHVJVQNB-FZERPYLPSA-N",
"KONGRWVLXLWGDV-BYGOPZEFSA-N",
"AANKDJLVHZQCFG-WLIQWNBFSA-N",
"SNFRINMTRPQQLE-JQWAAABSSA-N"
)
# You should only have 2 SDF returned, 2 other not found
inchikey_query <- pubchemInchikey2sdf(inchikeys)
inchikey_query$sdf_set
# successful queries
inchikey_query_index <- inchikey_query$sdf_index[inchikey_query$sdf_index != 0]
# get CID of these queries
inchikey_query_cid <- cid(inchikey_query$sdf_set[inchikey_query_index])
names(inchikey_query_cid) <- names(inchikey_query_index)
inchikey_query_cid
```
[Back to Table of Contents]()
Get Compound CID from PubChem by InChI
--------------------------------
The function `pubchemInchi2cid` accepts one or more character PubChem
compound InChI string(s) and downloads the corresponding compound CID from PubChem
Power User Gateway (PUG) returning results in a named numeric vector. Successful
requests will have empty names, requests with invalid InChI strings will have
name "invalid" and requests with valid InChI but not found in PubChem will have
name "not_found". Both "invalid" and "not_found" queries will have CID `0` as return.
PubChem API allows users to only query one InChI a time, so this
function sends one PubChem API request per InChI. For courtesy reasons, the rate
is limited to 1 query per second. It is not recommended to parallelize this function.
```{r pubchemInchi2cid, eval=FALSE, tidy=FALSE}
# first two are valid, third has no result, last is invalid
inchis <- c(
"InChI=1S/C15H26O/c1-9(2)11-6-5-10(3)15-8-7-14(4,16)13(15)12(11)15/h9-13,16H,5-8H2,1-4H3/t10-,11+,12-,13+,14+,15-/m1/s1",
"InChI=1S/C3H8/c1-3-2/h3H2,1-2H3",
"InChI=1S/C15H20Br2O2/c1-2-12(17)13-7-3-4-8-14-15(19-13)10-11(18-14)6-5-9-16/h3-4,6,9,11-15H,2,7-8,10H2,1H3/t5-,11-,12+,13+,14-,15-/m1/s1",
"InChI=abc"
)
pubchemInchi2cid(inchis)
```
[Back to Table of Contents]()
Search a SMILES Query in PubChem
--------------------------------
The function `searchString` accepts one SMILES string
(Simplified Molecular Input Line Entry Specification) and performs a
\>0.95 similarity PubChem fingerprint search, returning the hits in an
`SDFset` container. The ChemMine Tools web service is
used as an intermediate, to translate queries from plain HTTP POST to a
PubChem Power User Gateway (PUG) query.
Search a SMILES string on PubChem:
```{r searchString, eval=FALSE, tidy=FALSE}
compounds <- searchString("CC(=O)OC1=CC=CC=C1C(=O)O")
compounds
```
[Back to Table of Contents]()
Search an SDF Query in PubChem
------------------------------
The function `searchSim` performs a PubChem similarity
search just like `searchString`, but accepts a query in
an `SDFset` container. If the query contains more than
one compound, only the first is searched.
Search an `SDFset` container on PubChem:
```{r searchSim, eval=FALSE, tidy=FALSE}
data(sdfsample);
sdfset <- sdfsample[1]
compounds <- searchSim(sdfset)
compounds
```
[Back to Table of Contents]()
ChemMine Tools R Interface
==========================
ChemMine Web Tools is an online service for analyzing and clustering small molecules. It provides numerous cheminformatics tools which can be used directly on the website, or called remotely from within R. When called within R jobs are sent remotely to a queue on a compute cluster at UC Riverside, which is a free service offered to `ChemmineR` users.
The website is free and open to all users and is available at . When new tools are added to the service, they automatically become availiable within `ChemmineR` without updating your local R package.
List all available tools:
```{r listCMTools, eval=FALSE, tidy=FALSE}
listCMTools()
```
```{r eval=TRUE, echo=FALSE}
# cache results from previous code chunk
# NOTE: this must match the code in the previous code chunk but will be
# hidden. Delete cacheFileName to rebuild the cache from web data.
cacheFileName <- "listCMTools.RData"
if(! file.exists(cacheFileName)){
toolList <- listCMTools()
save(list=c("toolList"), file=cacheFileName)
}
load(cacheFileName)
toolList
```
Show options and description for a tool. This also provides an example function call which can be copied
verbatim, and changed as necessary:
```{r toolDetailsCMT, eval=FALSE, tidy=FALSE}
toolDetails("Fingerprint Search")
```
```{r eval=TRUE, echo=FALSE}
# cache results from previous code chunk
# NOTE: this must match the code in the previous code chunk but will be
# hidden. Delete cacheFileName to rebuild the cache from web data.
cacheFileName <- "toolDetails.RData"
if(! file.exists(cacheFileName)){
.serverURL <- "http://chemmine.ucr.edu/ChemmineR/"
library(RCurl)
response <- postForm(paste(.serverURL, "toolDetails", sep = ""), tool_name = "Fingerprint Search")[[1]]
save(list=c("response"), file=cacheFileName)
}
load(cacheFileName)
cat(response)
```
[Back to Table of Contents]()
Launch a Job
------------------------------
When a job is launched it returns a job token which refers to the running job on the UC Riverside cluster. You can check the status of a job or obtain the results as follows. If `result` is called on a job that is still running, it will loop internally until the job is completed, and then return the result.
Launch the tool `pubchemID2SDF` to obtain the structure for PubChem cid 2244:
```{r launchCMTool, eval=FALSE, tidy=FALSE}
job1 <- launchCMTool("pubchemID2SDF", 2244)
status(job1)
result1 <- result(job1)
```
Use the previous result to search PubChem for similar compounds:
```{r fingerprintSearchCMT, eval=FALSE, tidy=FALSE}
job2 <- launchCMTool('Fingerprint Search', result1, 'Similarity Cutoff'=0.95, 'Max Compounds Returned'=200)
result2 <- result(job2)
job3 <- launchCMTool("pubchemID2SDF", result2)
result3 <- result(job3)
```
Compute OpenBabel descriptors for these search results:
```{r obDescriptorsCMT, eval=FALSE, tidy=FALSE}
job4 <- launchCMTool("OpenBabel Descriptors", result3)
result4 <- result(job4)
result4[1:10,] # show first 10 lines of result
```
```{r eval=TRUE, echo=FALSE}
# cache results from previous code chunk
# NOTE: this must match the code in the previous code chunk but will be
# hidden. Delete cacheFileName to rebuild the cache from web data.
cacheFileName <- "launchCMTool.RData"
if(! file.exists(cacheFileName)){
job1 <- launchCMTool("pubchemID2SDF", 2244)
status(job1)
result1 <- result(job1)
job2 <- launchCMTool('Fingerprint Search', result1, 'Similarity Cutoff'=0.95, 'Max Compounds Returned'=200)
result2 <- result(job2)
job3 <- launchCMTool("pubchemID2SDF", result2)
result3 <- result(job3)
job4 <- launchCMTool("OpenBabel Descriptors", result3)
result4 <- result(job4)
save(list=c("result4"), file=cacheFileName)
}
load(cacheFileName)
result4[1:10,]
```
[Back to Table of Contents]()
View Job Result in Browser
------------------------------
The function `browseJob` launches a web browser to view the results of a job online, just as if they
had been run from the ChemMine Tools website itself. If you also want the result data within R, you must first call
the `result` object from within R before calling `browseJob`. Once `browseJob` has been called on a job token,
the results are no longer accessible within R.
If you have an account on ChemMine Tools and would like to save the web results from your job, you must first login to your account within the default web browser on your system before you launch `browseJob`. The job will then be assigned automatically to the currently logged in account.
View OpenBabel descriptors online:
```{r obDescriptorsWWW, eval=FALSE, tidy=FALSE}
browseJob(job4)
```
Perform binning clustering and visualize result online:
```{r binningClusterWWW, eval=FALSE, tidy=FALSE}
job5 <- launchCMTool("Binning Clustering", result3, 'Similarity Cutoff'=0.9)
browseJob(job5)
```
[Back to Table of Contents]()
Version Information
===================
```{r sessionInfo, results='asis'}
sessionInfo()
```
[Back to Table of Contents]()
Funding
=======
This software was developed with funding from the National Science
Foundation: [ABI-0957099](http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0957099),
2010-0520325 and IGERT-0504249.
References
===========