\documentclass[a4paper]{article}

\usepackage[margin=2cm]{geometry}
\usepackage[round]{natbib}
\usepackage{url}

\newcommand{\acronym}[1]{\textsc{#1}}
\newcommand{\pkg}[1]{{\normalfont\fontseries{b}\selectfont #1}}
\newcommand{\proglang}[1]{\textsf{#1}}
\let\code\texttt

%% \VignetteIndexEntry{Extensions}

\begin{document}
<<Init,echo=FALSE,results=hide>>=
library("tm")
library("xml2")
@
\title{Extensions\\How to Handle Custom File Formats}
\author{Ingo Feinerer}
\maketitle

\section*{Introduction}
The possibility to handle custom file formats is a substantial feature in any
modern text mining infrastructure. \pkg{tm} has been designed aware of this
aspect from the beginning on, and has modular components which allow for
extensions. A general explanation of \pkg{tm}'s extension mechanism is
described by~\citet[Sec.~3.3]{Feinerer_etal_2008}, with an updated description
as follows.

\section*{Sources}
A source abstracts input locations and provides uniform methods for access.
Each source must provide implementations for following interface functions:
\begin{description}
\item[close()] closes the source and returns it,
\item[eoi()] returns \code{TRUE} if the end of input of the source is reached,
\item[getElem()] fetches the element at the current position,
\item[length()] gives the number of elements,
\item[open()] opens the source and returns it,
\item[reader()] returns a default reader for processing elements,
\item[pGetElem()] (optional) retrieves all elements in parallel at once, and
\item[stepNext()] increases the position in the source to the next element.
\end{description}
Retrieved elements must be encapsulated in a list with the named components
\code{content} holding the document and \code{uri} pointing to the origin of
the document (e.g., a file path or a \acronym{URL}; \code{NULL} if not
applicable or unavailable).

Custom sources are required to inherit from the virtual base class
\code{Source} and typically do so by extending the functionality provided by the
simple reference implementation \code{SimpleSource}.

E.g., a simple source which accepts an \proglang{R} vector as input could be
defined as
<<keep.source=TRUE>>=
VecSource <- function(x)
    SimpleSource(length = length(x), content = as.character(x),
                 class = "VecSource")
@
which overrides a few defaults (see \code{?SimpleSource} for defaults) and
stores the vector in the \code{content} component. The functions
\code{close()}, \code{eoi()}, \code{open()}, and \code{stepNext()} have
reasonable default methods already for the \code{SimpleSource} class: the
identity function for \code{open()} and \code{close()}, incrementing a position
counter for \code{stepNext()}, and comparing the current position with the
number of available elements as claimed by \code{length()} for \code{eoi()},
respectively. So we only need custom methods for element access:
<<keep.source=TRUE>>=
getElem.VecSource <-
function(x) list(content = x$content[x$position], uri = NULL)
pGetElem.VecSource <-
function(x) lapply(x$content, function(y) list(content = y, uri = NULL))
@

\section*{Readers}
Readers are functions for extracting textual content and metadata out of
elements delivered by a source and for constructing a text document. Each reader
must accept following arguments in its signature:
\begin{description}
\item[elem] a list with the named components \code{content} and \code{uri} (as
  delivered by a source via \code{getElem()} or \code{pGetElem()}),
\item[language] a string giving the language, and
\item[id] a character giving a unique identifier for the created text document.
\end{description}
The element \code{elem} is typically provided by a source whereas the language
and the identifier are normally provided by a corpus constructor (for the case
that \code{elem\$content} does not give information on these two essential
items).

In case a reader expects configuration arguments we can use a function
generator. A function generator is indicated by inheriting from class
\code{FunctionGenerator} and \code{function}. It allows us to process
additional arguments, store them in an environment, return a reader function
with the well-defined signature described above, and still be able to access
the additional arguments via lexical scoping. All corpus constructors in
package \pkg{tm} check the reader function for being a function generator and
if so apply it to yield the reader with the expected signature.

E.g., the reader function \code{readPlain()} is defined as
<<keep.source=TRUE>>=
readPlain <- function(elem, language, id)
    PlainTextDocument(elem$content, id = id, language = language)
@
For examples on readers using the function generator please have a look at
\code{?readPDF} or \code{?readPDF}.

However, for many cases, it is not necessary to define each detailed aspect of
how to extend \pkg{tm}. Typical examples are \acronym{XML} files which are very
common but can be rather easily handled via standard conforming \acronym{XML}
parsers. The aim of the remainder in this document is to give an overview on
how simpler, more user-friendly, forms of extension mechanisms can be applied
in \pkg{tm}.

\section*{Custom Data Formats}
A general situation is that you have gathered together some information into a
tabular data structure (like a data frame or a list matrix) that suffices to
describe documents in a corpus. However, you do not have a distinct file format
because you extracted the information out of various resources, e.g., as
delivered by \code{readtext()} in package \pkg{readtext}. Now you want to use
your information to build a corpus which is recognized by \pkg{tm}.

We assume that your information is put together in a data frame. E.g.,
consider the following example:
<<keep.source=TRUE>>=
df <- data.frame(doc_id   = c("doc 1"    , "doc 2"    , "doc 3"    ),
		 text     = c("content 1", "content 2", "content 3"),
                 title    = c("title 1"  , "title 2"  , "title 3"  ),
                 authors  = c("author 1" , "author 2" , "author 3" ),
                 topics   = c("topic 1"  , "topic 2"  , "topic 3"  ),
                 stringsAsFactors = FALSE)
@
We want to map the data frame rows to the relevant entries of a text document.

An entry \code{text} in the mapping will be matched to fill the actual content
of the text document, \code{doc\_id} will be used as document ID, all other
fields will be used as metadata tags. So we can construct a corpus out of the
data frame:
<<>>=
(corpus <- Corpus(DataframeSource(df)))
corpus[[1]]
meta(corpus[[1]])
@

\section*{Custom XML Sources}
Many modern file formats already come in \acronym{XML} format which
allows to extract information with any \acronym{XML} conforming
parser, e.g., as implemented in \proglang{R} by the \pkg{xml2}
package.

Now assume we have some custom \acronym{XML} format which we want to
access with \pkg{tm}. Then a viable way is to create a custom
\acronym{XML} source which can be configured with only a few
commands. E.g., have a look at the following example:
<<CustomXMLFile>>=
custom.xml <- system.file("texts", "custom.xml", package = "tm")
print(readLines(custom.xml), quote = FALSE)
@
As you see there is a top-level tag stating that there is a corpus,
and several document tags below. In fact, this structure is very
common in \acronym{XML} files found in text mining applications (e.g.,
both the Reuters-21578 and the Reuters Corpus Volume 1 data sets follow
this general scheme). In \pkg{tm} we expect a source to deliver
self-contained blocks of information to a reader function, each block
containing all information necessary such that the reader can
construct a (subclass of a) \code{TextDocument} from it.

The \code{XMLSource()} function can now be used to construct a custom
\acronym{XML} source. It has three arguments:
\begin{description}
\item[x] a character giving a uniform resource identifier,
\item[parser] a function accepting an \acronym{XML} document (as delivered by
  \code{read\_xml()} in package \pkg{xml2}) as input and returning a
  \acronym{XML} elements/nodes (each element/node will then be
  delivered to the reader as a self-contained block),
\item[reader] a reader function capable of turning \acronym{XML}
  elements/nodes as returned by the parser into a subclass of
  \code{TextDocument}.
\end{description}
E.g., a custom source which can cope with our custom \acronym{XML}
format could be:
<<mySource,keep.source=TRUE>>=
mySource <- function(x)
    XMLSource(x, parser = xml2::xml_children, reader = myXMLReader)
@
As you notice in this example we also provide a custom reader function
(\code{myXMLReader}). See the next section for details.

\section*{Custom XML Readers}
As we saw in the previous section we often need a custom reader
function to extract information out of \acronym{XML} chunks (typically
as delivered by some source). Fortunately, \pkg{tm} provides an easy
way to define custom \acronym{XML} reader functions. All you need to
do is to provide a so-called \emph{specification}.

Let us start with an example which defines a reader function for the
file format from the previous section:
<<myXMLReader,keep.source=TRUE>>=
myXMLReader <- readXML(
    spec = list(author = list("node", "writer"),
                content = list("node", "description"),
                datetimestamp = list("function",
                    function(x) as.POSIXlt(Sys.time(), tz = "GMT")),
                description = list("node", "@short"),
                heading = list("node", "caption"),
                id = list("function", function(x) tempfile()),
                origin = list("unevaluated", "My private bibliography"),
                type = list("node", "type")),
    doc = PlainTextDocument())
@

Formally, \code{readXML()} is the relevant function which constructs an reader.
The customization is done via the first argument \code{spec}, the second
provides an empty instance of the document which should be returned (augmented
with the extracted information out of the \acronym{XML} chunks). The
specification must consist of a named list of lists each containing two
character vectors. The constructed reader will map each list entry to the
content or a metadatum of the text document as specified by the named list
entry. Valid names include \code{content} to access the document's content, and
character strings which are mapped to metadata entries.

Each list entry must consist of two character vectors: the first
describes the type of the second argument, and the second is the
specification entry. Valid combinations are:
\begin{description}
\item[\code{type = "node", spec = "XPathExpression"}] the XPath (1.0) expression
  \code{spec} extracts information out of an \acronym{XML} node (as seen for
  \code{author}, \code{content}, \code{description}, \code{heading}, and
  \code{type} in our example specification).
\item[\code{type = "function", spec = function(doc) \ldots}] The function
  \code{spec} is called, passing over the \acronym{XML} document (as
  delivered by \code{read\_xml()} from package \pkg{xml2})
  as first argument (as seen for \code{datetimestamp} and \code{id}).
  As you notice in our example nobody forces us to actually use the passed over
  document, instead we can do anything we want (e.g., create a unique character
  vector via \code{tempfile()} to have a unique identification string).
\item[\code{type = "unevaluated", spec = "String"}] the character vector
  \code{spec} is returned without modification (e.g., \code{origin} in
  our specification).
\end{description}

Now that we have all we need to cope with our custom file format, we
can apply the source and reader function at any place in \pkg{tm}
where a source or reader is expected, respectively. E.g.,
<<>>=
corpus <- VCorpus(mySource(custom.xml))
@
constructs a corpus out of the information in our \acronym{XML}
file:
<<>>=
corpus[[1]]
meta(corpus[[1]])
@

\bibliographystyle{abbrvnat}
\bibliography{references}

\end{document}