KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.
KEGGREST
allows access to the
KEGG REST API. Since
KEGG disabled the KEGG SOAP server
on December 31, 2012 (which means the KEGGSOAP
package will no
longer work), KEGGREST
serves as a replacement.
The interface to KEGGREST
is simpler and in some ways more
powerful than KEGGSOAP
; however, not all the functionality
that was available through the SOAP API has been exposed
in the REST API. If and when more functionality is exposed
on the server side, this package will be updated to take
advantage of it.
The KEGG REST API is built on some simple operations:
info
, list
, find
, get
, conv
, and link
.
The corresponding R
functions in KEGGREST
are:
keggInfo()
, keggList()
, keggFind()
, keggGet()
,
keggConv
, and keggLink()
.
keggList()
KEGG exposes a number of databases. To get an idea of
what is available, run listDatabases()
:
library(KEGGREST)
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
listDatabases()
## [1] "pathway" "brite" "module" "disease" "drug" "environ"
## [7] "ko" "genome" "compound" "glycan" "reaction" "rpair"
## [13] "rclass" "enzyme" "organism"
You can use these databases in further queries. Note that in many cases you can also use a three-letter KEGG organism code or a “T number” (genome identifier) in the same place you would use one of these database names.
You can obtain the list of organisms available in KEGG with
the keggList()
function:
org <- keggList("organism")
head(org)
## T.number organism
## [1,] "http://rest.kegg.jp/list/organism" "http://rest.kegg.jp/list/organism"
## species phylogeny
## [1,] "http://rest.kegg.jp/list/organism" "http://rest.kegg.jp/list/organism"
From KEGGREST
's point of view, you've just asked KEGG
to show you the name of every entry in the “organism” database.
Therefore, the complete list of entities that can be
queried with KEGGREST
can be obtained as follows:
queryables <- c(listDatabases(), org[,1], org[,2])
You could also ask for every entry in the “hsa” (Homo sapiens) database as follows:
keggList("hsa")
keggGet()
Once you have a list of specific KEGG identifiers, use
keggGet()
to get more information about them. Here we look up
a human gene and an E. coli O157 gene:
query <- keggGet(c("hsa:10458", "ece:Z5100"))
As expected, this returns two items:
length(query)
## [1] 2
Behind the scenes, KEGGREST
downloaded and parsed a KEGG
flat file, which you
can now explore:
names(query[[1]])
## [1] "ENTRY" "NAME" "DEFINITION" "ORTHOLOGY" "ORGANISM"
## [6] "PATHWAY" "BRITE" "POSITION" "MOTIF" "DBLINKS"
## [11] "STRUCTURE" "AASEQ" "NTSEQ"
query[[1]]$ENTRY
## CDS
## "10458"
query[[1]]$DBLINKS
## [1] "NCBI-ProteinID: NP_001138360" "NCBI-GeneID: 10458"
## [3] "OMIM: 605475" "HGNC: 947"
## [5] "HPRD: 05686" "Ensembl: ENSG00000175866"
## [7] "Vega: OTTHUMG00000177698" "UniProt: Q9UQB8"
keggGet()
can also return amino acid sequences as AAStringSet
objects
(from the Biostrings
package):
keggGet(c("hsa:10458", "ece:Z5100"), "aaseq") ## retrieves amino acid sequences
## A AAStringSet instance of length 2
## width seq names
## [1] 534 MSLSRSEEMHRLTENVYKTIMEQ...RNPFAHVQLKPTVTNDRSAPLLS hsa:10458 BAIAP2,...
## [2] 248 MLNGISNAASTLGRQLVGIASRV...SGLPPLAQALKDHLAAYEQSKKG ece:Z5100 espF; e...
…or DNAStringSet
objects if option
is ntseq
:
keggGet(c("hsa:10458", "ece:Z5100"), "ntseq") ## retrieves nucleotide sequences
## A DNAStringSet instance of length 2
## width seq names
## [1] 1605 ATGTCTCTGTCTCGCTCAGAGGA...GGTCTGCCCCCCTCCTCAGCTGA hsa:10458 BAIAP2,...
## [2] 747 ATGCTTAATGGAATTAGTAACGC...ATGAGCAATCGAAGAAAGGGTAA ece:Z5100 espF; e...
keggGet()
can also return images:
png <- keggGet("hsa05130", "image")
t <- tempfile()
library(png)
writePNG(png, t)
if (interactive()) browseURL(t)
NOTE: keggGet()
can only return 10 result sets at once (this limitation
is on the server side). If you supply more than 10 inputs to keggGet()
,
KEGGREST
will warn that only the first 10 results will be returned.
keggFind()
You can search for two separate keywords (“shiga” and “toxin” in this case):
head(keggFind("genes", c("shiga", "toxin")))
## ece:Z1464
## "stx2A; shiga-like toxin II A subunit encoded by bacteriophage BP-933W;
K11006 shiga toxin subunit A"
## ece:Z1465
## "stx2B; shiga-like toxin II B subunit encoded by bacteriophage BP-933W;
K11007 shiga toxin subunit B"
## ece:Z3343
## "stx1B; shiga-like toxin 1 subunit B encoded within prophage CP-933V; K11007
shiga toxin subunit B"
## ece:Z3344
## "stx1A; shiga-like toxin 1 subunit A encoded within prophage CP-933V; K11006
shiga toxin subunit A"
## ecs:ECs1205
## "Shiga toxin 2 subunit A; K11006 shiga toxin subunit A"
## ecs:ECs1206
## "Shiga toxin 2 subunit B; K11007 shiga toxin subunit B"
Or search for the two words together:
head(keggFind("genes", "shiga toxin"))
## ece:Z1464
## "stx2A; shiga-like toxin II A subunit encoded by bacteriophage BP-933W;
K11006 shiga toxin subunit A"
## ece:Z1465
## "stx2B; shiga-like toxin II B subunit encoded by bacteriophage BP-933W;
K11007 shiga toxin subunit B"
## ece:Z3343
## "stx1B; shiga-like toxin 1 subunit B encoded within prophage CP-933V; K11007
shiga toxin subunit B"
## ece:Z3344
## "stx1A; shiga-like toxin 1 subunit A encoded within prophage CP-933V; K11006
shiga toxin subunit A"
## ecs:ECs1205
## "Shiga toxin 2 subunit A; K11006 shiga toxin subunit A"
## ecs:ECs1206
## "Shiga toxin 2 subunit B; K11007 shiga toxin subunit B"
Search for a chemical formula:
head(keggFind("compound", "C7H10O5", "formula"))
## cpd:C00493 cpd:C04236 cpd:C16588 cpd:C18307 cpd:C18312 cpd:C20961
## "C7H10O5" "C7H10O5" "C7H10O5" "C7H10O5" "C7H10O5" "C7H10N2O5"
Search for a chemical formula containing “O5” and “C7”:
head(keggFind("compound", "O5C7", "formula"))
## cpd:C00493 cpd:C00624 cpd:C01215 cpd:C01424 cpd:C02123 cpd:C02236
## "C7H10O5" "C7H11NO5" "C7H9NO5" "C7H6O5" "C7H12O5" "C7H6O5S"
You can search for compounds with a particular exact mass:
keggFind("compound", 174.05, "exact_mass")
## cpd:C00493 cpd:C04236 cpd:C16588 cpd:C18307 cpd:C18312
## "174.052823" "174.052823" "174.052823" "174.052823" "174.052823"
Because we've supplied a number with two decimal digits of precision, KEGG will find all compounds with exact mass between 174.045 and 174.055.
Integer ranges can be used to find compounds by molecular weight:
head(keggFind("compound", 300:310, "mol_weight"))
## cpd:C00051 cpd:C00200 cpd:C00219 cpd:C00239 cpd:C00270 cpd:C00357
## "307.32348" "306.33696" "304.46688" "307.197122" "309.26986" "301.187702"
keggConv()
Convert between KEGG identifiers and outside identifiers.
You can either specify fully qualified identifiers:
keggConv("ncbi-proteinid", c("hsa:10458", "ece:Z5100"))
## hsa:10458 ece:Z5100
## "ncbi-proteinid:NP_001138360" "ncbi-proteinid:AAG58814"
…or get the mapping for an entire species:
head(keggConv("eco", "ncbi-geneid"))
## ncbi-geneid:11115378 ncbi-geneid:11115379 ncbi-geneid:1450231
## "eco:b4704" "eco:b4703" "eco:b4500"
## ncbi-geneid:1450232 ncbi-geneid:1450233 ncbi-geneid:1450235
## "eco:b4501" "eco:b4406" "eco:b4503"
Reversing the arguments does the opposite mapping:
head(keggConv("ncbi-geneid", "eco"))
## eco:b0001 eco:b0002 eco:b0003
## "ncbi-geneid:944742" "ncbi-geneid:945803" "ncbi-geneid:947498"
## eco:b0004 eco:b0005 eco:b0006
## "ncbi-geneid:945198" "ncbi-geneid:944747" "ncbi-geneid:944749"
keggLink()
Most of the KEGGSOAP
functions whose names started with
“get”, for example get.pathways.by.genes()
, can be replaced
with the keggLink()
function. Here we query all pathways
for human:
head(keggLink("pathway", "hsa"))
## hsa:10 hsa:10 hsa:10 hsa:10 hsa:100
## "path:hsa00232" "path:hsa00983" "path:hsa01100" "path:hsa05204" "path:hsa00230"
## hsa:100
## "path:hsa01100"
…but you can also specify one or more genes (from multiple species):
keggLink("pathway", c("hsa:10458", "ece:Z5100"))
## hsa:10458 hsa:10458 ece:Z5100
## "path:hsa04520" "path:hsa04810" "path:ece05130"