XML documents are a series of nested tags, possibly with attributes. An example is the MSigDB xml file, which contains curated gene sets stored following a format specification.
Download a copy to your AMI, and store it in a directory ~/xml/
.
The first few lines of this file look like
[1] <?xml version="1.0" encoding="UTF-8"?>
[2]
[3] <MSIGDB NAME="msigdb" VERSION="4.0" BUILD_DATE="May 31, 2013">
[4] <GENESET STANDARD_NAME="NUCLEOPLASM" ...></GENESET>
...
[10299] </MSIGDB>
Line 1 tells us about the version of XML used in the document, and the
character encoding. Line 3 opens the MSIGDB
node. The node has
several attributes, NAME
, VERSION
, BUILD_DATE
, as described in
the format specification. Nested inside the MSIGDB
node is the first of many GENESET
nodes; the node terminates on the
final line of the file, with </MSIGDB>
. The GENESET
node has
several attributes (of which only one is shown) an empty body, and
terminates with </GENESET>
.
Load the data base in to R
library(XML)
xml <- xmlTreeParse("~/xml/msigdb_v4.0.xml", useInternalNodes=TRUE)
Don't bother to print xml, it'll scroll across your screen for quite a while.
Elements of XML can be addressed using XPath. The idea is to
specify the path from the root of the document to the node(s) or
attributes that you're interested in. The path is like a linux file
path, starting with /
. Attributes are specified with @
before
their name. We can subset the xml
object using this language, e.g.,
xml[["/MSIGDB/@NAME"]]
An alternative is to use the xmlAttrs
function to extract the
attributes of the node we're interested in
xmlAttrs(xml[["/MSIGDB"]])
There is only one NAME
attribute of MSIGDB
, but there are many
GENESET
child nodes. Here we create a node set of all of these
sets <- xml["/MSIGDB/GENESET"]
class(sets)
length(sets)
XPath provides a convenient syntax for querying nested paths:
//GENESET
says to start at the root and find all paths that have
GENESET
at any level.
We could manipulate sets
at the R level, e.g., selecting the second
element and viewing the first four attributes
head(xmlAttrs(sets[[2]]), 4)
but it's more fun to formulate this query using XPath to select all attributes of the second gene set
head(xml["//GENESET[2]/@*"], 4)
Notice that this gene set has a STANDARD_NAME
attribute. We can use
this to select the gene set
yy <- xml[["//GENESET[@STANDARD_NAME = 'EXTRINSIC_TO_PLASMA_MEMBRANE']"]]
xmlAttrs(yy)[1:4]
There are many gene sets in our document; we might like to visit them
all and extract a particular element, e.g., the ORGANISM
attribute. We can do this by iterating over the node set in R
organism <- sapply(sets, function(elt) xmlAttrs(elt)["ORGANISM"])
but again a fun way to do this is to use an sapply
-like formulation
on the XML document itself
organism <- xpathSApply(xml, "//GENESET/@ORGANISM")
table(organism)
The XPath specification includes functions that are useful for, e.g., string matching. A simple example is to count the number of gene sets in our document
xml[["count(//GENESET)"]]
xml[["count(//GENESET[@ORGANISM='Homo sapiens'])"]]
Section 2.5 Abbreviated Syntax of the XPath specification is a very handy introduction to the flexibility of XPath queries.
Exercise Use an XPath query to select the 5 gene sets that have
ORGANISM
equal to 'Danio rerio'. Use a single XPath query to
determine the STANDARD_NAME
of these gene sets.
xmlEventParse()
xmlEventParse()
Example: from StackOverflow
Advanced exercise: implement event parsing to retrieve the
STANDARD_NAME
and DESCRIPTION_BRIEF
attributes from all GENESET
nodes.