%\documentclass[nojss]{jss} \documentclass[]{article} \usepackage{verbatim} \usepackage{url} \usepackage{booktabs} \usepackage{amsmath} \usepackage{amssymb} \usepackage{fullpage} \usepackage{verbatim} \usepackage{hyperref} \usepackage{enumitem} \usepackage[nounderscore]{syntax} \setlength{\grammarparsep}{2pt plus 1pt minus 1pt} % increase separation between rules \setlength{\grammarindent}{6em} % increase separation between LHS/RHS % Grammar Definitions \newcommand{\ntelem}[1] {{\textless \textit{#1}\textgreater}} \newcommand{\ntgram}[1] {{\texttt{<{#1}>}}} % Rewriting JSS commands \newcommand{\code}[1] {{\texttt{#1}}} \newcommand{\pkg}[1] {{\bf{#1}}} \newcommand{\proglang}[1] {{\textsf{#1}}} <>= #options(width=70) #options(prompt="R> ") options(warn = (-1)) @ <>= set.seed(0) @ %opening %\VignetteIndexEntry{Grammatical evolution: A tutorial using gramEvol} \title{Grammatical Evolution: A Tutorial using \pkg{gramEvol}} \author{Farzad Noorian, Anthony M. de Silva, Philip H.W. Leong} \begin{document} %\SweaveOpts{concordance=TRUE} %\VignetteEngine{knitr::knitr} \maketitle %\begin{abstract} %\end{abstract} \section{Introduction} Grammatical evolution (GE) is an evolutionary search algorithm, similar to genetic programming (GP). It is typically used to generate programs with syntax defined through a grammar. The original author's website \cite{o2001grammatical} is a good resource for a formal introduction to this technique: \begin{itemize} \item \url{http://www.grammatical-evolution.org/} \end{itemize} This document serves as a quick and informal tutorial on GE, with examples implemented using the \pkg{gramEvol} package in \proglang{R}. \section{Grammatical Evolution} The goal of using GE is to automatically generate a program that minimises a cost function: \begin{enumerate} \item A \emph{grammar} is defined to describe the syntax of the programs. \item A \emph{cost function} is defined to assess the quality (the \emph{cost} or \emph{fitness}) of a program. \item An \emph{evolutionary algorithm}, such as GA, is used to search within the space of all programs definable by the grammar, in order to find the program with the lowest cost. \end{enumerate} Notice that by a \emph{program}, we refer to any sequence of instructions that perform a specific task. This ranges from a single expression (e.g., \code{sin(x)}), to several statements with function declarations, assignments, and control flow. The rest of this section will describe each component in more details. \subsection{Grammar} A grammar is a set of rules that describe the syntax of sentences and expressions in a language. While grammars were originally invented for studying natural languages, they are extensively used in computer science for describing programming languages. \subsubsection{Informal introduction to context-free grammars} \label{sec:informal_grammar} GE uses a \emph{context-free grammar} to describe the syntax of programs. A grammar in which the rules are not sensitive to the sentence's context is called a \emph{context-free grammar} (CFG), and is defined using a collection of \emph{terminal} symbols, \emph{non-terminal} symbols, \emph{production rules}, and a \emph{start} symbol \cite{Sethi1986Compiler}: \begin{itemize} \item Terminal symbols are the lexicon of the language. \item Non-terminal symbols are used to describe the class of words in the language, or \emph{variables} that can take different values. For example, a \ntelem{subject}, a \ntelem{verb}, or an \ntelem{object}. \item A production rule defines what symbols replace a non-terminal. For example, each of the four following lines is a production rule: \begin{grammar} ::= \ntelem{subject} \ntelem{verb} \ntelem{object}. | \ntelem{subject} \ntelem{verb}. \hspace*{\fill} (1.a), (1.b) ::= I | You | They \hspace*{\fill} (2.a), (2.b), (2.c) ::= read | write | check \hspace*{\fill} (3.a), (3.b), (3.c) ::= books | stories | academic papers \hspace*{\fill} (4.a), (4.b), (4.c) \end{grammar} In each rule, the ``|" symbol separates different replacement possibilities; such as \ntelem{subject}, that can be replaced with ``I", ``You" or ``They". One must note that a non-terminal symbol can be replaced with ther non-terminals as well as terminal symbols, such as in the example's \ntelem{sentence}. This style of notation, including the use of angle brackets (< and >) is known as the \emph{Backus--Naur Form} (BNF). \item A start symbol determines a non-terminal where the generation of the expression starts. For example: \begin{itemize} \item Start: \ntelem{sentence} \end{itemize} \end{itemize} Informally, only the start symbol and the production rules are required to define a grammar. \subsubsection{Formal definition of a context-free grammar} In formal language theory, a context-free grammar is a \textit{formal grammar} where every production rule, formalized by the pair $(n, V)$, is in form of $n \rightarrow V$. The CFG is defined by the 4-tuple $(\mathcal{T}, \mathcal{N}, \mathcal{R}, \mathcal{S})$, where $\mathcal{T}$ is the finite set of terminal symbols, $\mathcal{N}$ is the finite set of non-terminal symbols, $\mathcal{R}$ is the production rule set, $\mathcal{S} \in \mathcal{N}$ is the start symbol. A production rule $n \rightarrow V$ is realized by replacing the non-terminal symbol $n \in \mathcal{N}$ %in string $\alpha$ with the symbol $v \in V$, where $V \in (\mathcal{T} \cup \mathcal{N})^*$ is a sequence of terminal and/or non-terminal symbols. For more details on CFGs, their relation to context-free languages, parsing, compilers and other related topics refer to \cite{Sethi1986Compiler} or Wikipedia: \begin{itemize} \item \url{https://en.wikipedia.org/wiki/Context-free_grammar} \end{itemize} \subsubsection{From grammar to an expression} \label{sec:grammarmapping} Notice that each rule in the grammar of Section \ref{sec:informal_grammar} is numbered. Using these numbers, one can precisely refer to a certain expression. This is performed by replacing the first non-terminal symbol with the $n$th rule of that non-terminal, starting with the start symbol. For example, the sequence [2, 3, 1] selects rules (1.b), (2.c) and (3.a) in the following four-step sequence: %\begin{table}[h]%\centering \begin{tabular}{ c c c c l } \toprule Step & Sequence & Rule & & Current state \\ \hline 0 & & Start & ~ & \ntelem{sentence}. \\ 1 & 2 & (1.b) & ~ & \ntelem{subject} \ntelem{verb}. \\ 2 & 3 & (2.c) & ~ & They \ntelem{verb}. \\ 3 & 1 & (3.a) & ~ & They read. \\ \bottomrule \end{tabular} %\end{table} \subsection{Evolutionary optimisation} Evolutionary optimisation algorithms are a class of optimisation techniques inspired by natural evolution. They are used in cases where: \begin{itemize} \item The solution to the problem can be represented by a certain structure. For example, the solution is an array of binary variables, or integer numbers. \begin{itemize} \item Typically the array size is fixed and each unique value arrangement is considered a candidate solution. \item Using biological terminology, this structure is referred to as the \emph{chromosome} or \emph{genotype}. \end{itemize} \item There exist a cost function which can quickly return the \emph{cost} or \emph{fitness} of any candidate solution. \item Solving the problem using gradient descent techniques is hard or impossible, because the cost function is non-smooth, or has multiple local optimas, or is simply discrete, such as the travelling salesman problem (or in hindsight, a program generated by grammars). \end{itemize} It most be noted that the stochastic nature of evolutionary algorithms does not guarantee the optimal solution, since most practical problems involve very large search spaces, and it is often not computationally feasible to search the whole space. The oldest and simplest of these algorithms is the genetic algorithm (GA), which optimises a vector of binary variables. In this vignette, when referring to GA, we refer to an extended GA which handles integers numbers. For an in depth introduction, readers are referred to Wikipedia: \begin{itemize} \item \url{https://en.wikipedia.org/wiki/Evolutionary_algorithm} \end{itemize} \subsubsection{Optimising a program by evolution} GA only optimises numeric arrays. By \emph{mapping} an integer array to a program using a grammar, GA can be readily applied to evolve programs: \begin{enumerate} \item The solution is represented by an array of integers. \item The array is mapped to a program through the grammar using the technique explained is Section \ref{sec:grammarmapping}. \begin{itemize} \item Using biological terminology, the program is called a \emph{phenotype}, and the mapping is referred to as \emph{genotype to phenotype mapping}. \end{itemize} \item The cost function measures the fitness of the program. \item Any evolutionary optimisation technique is applied on the integer array. \end{enumerate} \subsection{Applications of grammatical evolution} Any application which needs a program definable by grammar, is creatable in GE. Using a grammar allows integration of domain knowledge and a custom program syntax, which adds flexibility and precision to GE compared to other techniques such as GP. Applications of GE include computational finance, music, and robotic control, among others. %The authors have previously used GE for electricity load forecasting in \cite{MNDL_13}. See \url{http://www.grammatical-evolution.org/pubs.html} for a collection of publications in this area. %These following classic problems are usually used to demonstrate the abilities of GE (or other algorithms like GP) %and are used in this vignette. %\begin{itemize} % \item \href{https://en.wikipedia.org/wiki/Symbolic_regression}{Symbolic regression} % \item \href{https://en.wikipedia.org/wiki/Santa_Fe_Trail_problem}{Santa Fe Trail problem} %\end{itemize} \section{gramEvol Package} The package \pkg{gramEvol} simplifies defining a grammar and offers a GA implementation. \pkg{gramEvol} hides many details, including the grammar mapping and GA parameters, and the only things the user has to do is to: \begin{enumerate} \item Define a grammar using \code{CreateGrammar}. \item Define a cost function. It should accept one (or more) \proglang{R} \code{expression}(s) and return a numeric value. \item Call \code{GrammaticalEvolution}. \end{enumerate} In this section, examples are used to demonstrate its usage. \subsection{Rediscovery of Kepler's law by symbolic regression} Symbolic regression is the process of discovering a function, in symbolic form, which fits a given set of data. Evolutionary algorithms such as GP and GE are commonly used to solve Symbolic Regression problems. For more information, visit \url{https://en.wikipedia.org/wiki/Symbolic_regression} or \url{http://www.symbolicregression.com/}. Rediscovery of Kepler's law has been used as a benchmark for symbolic regression \cite{koza1992chapter10, Ferreira2006Chapter6, langley1987heuristics}. Here, the goal is to find a relationship between orbital periods and distances of solar system planets from the sun. The distance and period data, normalised to Earth, is shown in Table~\ref{tab:planets}. \begin{table}[!hb]\centering \begin{tabular}{ l c r } \toprule Planet & Distance & Period \\ \midrule Venus & 0.72 & 0.61 \\ Earth & 1.00 & 1.00 \\ Mars & 1.52 & 1.84 \\ Jupiter & 5.20 & 11.90 \\ Saturn & 9.53 & 29.40 \\ Uranus & 19.10 & 83.50 \\ \bottomrule \end{tabular} \caption{Orbit period and distance from the sun for planets in solar system.} \label{tab:planets} \end{table} Kepler's third law states: \begin{equation} period^2 = constant \times distance^3 \end{equation} \subsubsection{Defining a grammar} To use grammatical evolution to find this relationship from the data, we define a grammar as illustrated in Table~\ref{tab:gram_kepler}. Here $\mathcal{S}$ denotes the starting symbol and $\mathcal{R}$ is the collection of production rules. \begin{table}[!ht]\centering { \raggedright \hrule \vspace{1mm} %$\mathcal{N}$ = \{$expr$, $sub$-$expr$, $func$, $op$, $var$, $n$\} %$\mathcal{T}$ = \{\texttt{+}, \texttt{-}, \texttt{$\times$}, \texttt{\^}, %log, sqrt, sin, cos, %\code{distance}, \texttt{1}, \texttt{2}, \texttt{3}, \texttt{4}, \texttt{(}, \texttt{)}\} $\mathcal{S}$ = \ntelem{expr} \medskip Production rules : $\mathcal{R}$ \begin{grammar} ::= | \hspace*{\fill} (1.a), (1.b) ::= () | | $\hat{~}${} \hspace*{\fill} (2.a), (2.b), (2.c) ::= log | sqrt | sin | cos \hspace*{\fill} (3.a), (3.b), (3.c), (3.d) ::= + | - | $\times$ \hspace*{\fill} (4.a), (4.b), (4.c) ::= \code{distance} | \code{distance}$\hat{~}$ | \hspace*{\fill} (5.a), (5.b), (5.c) ::= \texttt{1} | \texttt{2} | \texttt{3} | \texttt{4} \hspace*{\fill} (6.a), (6.b), (6.c), (6.d) \end{grammar} } \hrule \medskip \caption{Grammar for discovering Kepler's equation.} \label{tab:gram_kepler} \end{table} This is a general purpose grammar, and it can create different expressions corresponding to different formulas which can explain and model the data. The first step for using \pkg{gramEvol} is loading the grammar: %defined in Table~\ref{tab:gram_kepler}: <>= library("gramEvol") ruleDef <- list(expr = grule(op(expr, expr), func(expr), var), func = grule(sin, cos, log, sqrt), op = grule(`+`, `-`, `*`), var = grule(distance, distance^n, n), n = grule(1, 2, 3, 4)) grammarDef <- CreateGrammar(ruleDef) @ Here, the BNF notation is implemented in \proglang{R}: \begin{itemize} \item Rules are defined as a \code{list}. \item Each rule is defined using \code{non.terminal.name = grule(replacement1, replacement2, ...)} format. \item \code{CreateGrammar} is used to load the list and create the grammar object. \end{itemize} The print function reproduces the grammar in a format similar to Table~\ref{tab:gram_kepler}: <<>>= print(grammarDef) @ Note that \code{`+`} and \code{op(expr, expr)} are used in the code above because \code{grule} expects \proglang{R} expressions, and \code{expr op expr} is not valid in \proglang{R}. As it is tedious to convert between the functional form and the operator form, the package also provides \code{gsrule} (or grammar string rule), which accepts strings with \code{<>}: <<>>= ruleDef <- list(expr = gsrule("", "()", ""), func = gsrule("sin", "cos", "log", "sqrt"), op = gsrule("+", "-", "*"), var = grule(distance, distance^n, n), n = grule(1, 2, 3, 4)) CreateGrammar(ruleDef) @ Note that \code{gsrule} and \code{grule} can be mixed, as in the example above. \subsubsection{Defining a cost function} We use the following equation to normalise the error, adjusting its impact on small values (e.g., Venus) versus large values (e.g., Uranus): \begin{equation} \label{eq:kepler_err} e = \frac{1}{N} \sum{ log(1 + |p-\hat{p}|)} \end{equation} where $e$ is the normalised error, $N$ is the number of samples, $p$ is the orbital period and $\hat{p}$ is the result of symbolical regression. We implement this as the fitness function \code{SymRegFitFunc}: <>= planets <- c("Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus") distance <- c(0.72, 1.00, 1.52, 5.20, 9.53, 19.10) period <- c(0.61, 1.00, 1.84, 11.90, 29.40, 83.50) SymRegFitFunc <- function(expr) { result <- eval(expr) if (any(is.nan(result))) return(Inf) return (mean(log(1 + abs(period - result)))) } @ Here, the \code{SymRegFitFunc} receives an \proglang{R} \code{expression} and evaluates it. It is assumed that the \code{expression} uses \code{distance} to estimate the \code{period}. Invalid expressions are handled by returning a very high cost (infinite error). Valid results are compared with the actual period according to (\ref{eq:kepler_err}) to compute the expression's fitness. \subsubsection{Evolving the grammar} \code{GrammaticalEvolution} can now be run. All of the parameters are determined automatically. To avoid wasting time, and as the best possible outcome and its error are known (because we know the answer), a \code{terminationCost} is computed and set to terminate GE when the Kepler's equation is found. <<>>= ge <- GrammaticalEvolution(grammarDef, SymRegFitFunc, terminationCost = 0.021) ge @ Now that the result is found, it can be used in production. Here we only use it in a simple comparison: <<>>= best.expression <- ge$best$expression data.frame(distance, period, Kepler = sqrt(distance^3), GE = eval(best.expression)) @ %Other GE runs find expressions such as %\texttt{sqrt(distance)*distance} %or \texttt{sqrt(distance\textasciicircum3+cos\\(distance)*log(1\textasciicircum4))} %$\sqrt{distance}\times distance$, %$\sqrt{distance^3}$, %or $\sqrt{distance^3}+\cos(distance)*log(1^4)$ %which all simplify to Kepler's equation. %The fitness function handles invalid values (e.g., $\log(-1)$) %by assigning a low fitness (infinite error) to any individual %with an invalid value. However, \proglang{R} may show warnings %about \code{NaN}s being produced. To suppress these warnings, %it is enough to wrap \code{eval} in fitness function %inside \code{suppressWarnings}: %\begin{verbatim} %R> result <- suppressWarnings(eval(expr)) %\end{verbatim} %As an incomplete search is performed, sometimes the GE %fails to find a perfect solution. In such cases, a symbolic result %with error is presented %(i.e., $\log(distance)$ in %\code{sqrt(distance\textasciicircum3\\+log(distance))}). %%$\sqrt{distance^3+\log(distance)}$. %To characterise this behaviour, %the code was run 100 times and its error from Kepler equation was noted. %The measurements were performed on a single thread on a 3.40 GHz Intel Core i7-2600 CPU. %To ensure reproducibility, \code{set.seed(0)} was executed before running the code. %The results are presented in Table~\ref{tab:symreg_performance}. %Notice that the average performance can be improved at the expense of time, %by increasing the GE's number of generation \code{iterations} or population size \code{popSize}. %\begin{table} \centering % \begin{tabular}{ l r r r} % \toprule % Value & Minimum & Average & Maximum \\ \midrule % Error & 0.00 & 0.92 & 1.61 \\ % No. of generations & 2.00 & 77.46 & 100.00 \\ % Time (s) & 0.40 & 4.16 & 20.31 \\ % Error & 0.00 & 0.00 & 0.00 \\ % No. of generations & 1.00 & 45.24 & 186.00 \\ % Time (s) & 0.08 & 0.79 & 3.17 \\ % \bottomrule % \end{tabular} % \caption{Summary of grammatical evolution's performance for 100 runs of symbolic regression example.} % \label{tab:symreg_performance} %\end{table} \subsubsection{Monitoring evolution} As a real-world optimisation may take a long time, a feedback of the state of optimisation is desirable. \code{GrammaticalEvolution} allows monitoring this status using a callback function. This function, if provided to the parameter \code{monitorFunc}, receives an object similar to the return value of \code{GrammaticalEvolution}. For example, the following function prints the current generation, the best individual's expression and its error: <>= customMonitorFunc <- function(results){ cat("-------------------\n") print(results) } ge <- GrammaticalEvolution(grammarDef, SymRegFitFunc, terminationCost = 0.021, monitorFunc = customMonitorFunc) @ or even using the \code{print} function directly: <>= ge <- GrammaticalEvolution(grammarDef, SymRegFitFunc, terminationCost = 0.021, monitorFunc = print) @ which prints: \begin{verbatim} Grammatical Evolution Search Results: No. Generations: 1 Best Expression: distance Best Cost: 1.60700784338907 Grammatical Evolution Search Results: No. Generations: 2 Best Expression: distance Best Cost: 1.60700784338907 \end{verbatim} \dots until: \begin{verbatim} Grammatical Evolution Search Results: No. Generations: 9 Best Expression: distance + distance Best Cost: 1.54428158317392 Grammatical Evolution Search Results: No. Generations: 10 Best Expression: 1 - distance + (cos(distance) - 1) * sin(distance^2) + distance + (log(distance) + distance + (cos(distance) - 1) * sin(distance^2) + distance) Best Cost: 1.4186428597461 Grammatical Evolution Search Results: No. Generations: 11 Best Expression: sqrt(distance^3) Best Cost: 0.0201895728693592 \end{verbatim} \subsection{Discovering Regular Expressions} A regular expressions (RE) is a string that determines a character pattern. REs are more expressive and precise in determining sub-string matches compared to wildcards, and are widely used in many string pattern matching tasks, such as searching through log files or parsing a program's output. See the Wikipedia entry at \url{https://en.wikipedia.org/wiki/Regular_expression} for an in-depth introduction to REs. Creating a regular expressions requires careful assembly of symbols and operators to match the desired pattern. %which is often different between implementations. While this is usually performed by an expert programmer, it is possible to use evolutionary optimisation techniques to infer a RE from examples \cite{bartoli2012automatic}. In this example, we demonstrate how \pkg{gramEvol} can be used to learn REs. <>= set.seed(0) @ \subsubsection[Regular expressions in R]{Regular expression in \proglang{R}} \label{sec:decimal_RE} In formal language theory, a regular expression is a sequence of symbols and operators that describes a character pattern. REs are translated by RE processors into a non-deterministic finite automaton (NFA) and subsequently into a deterministic finite automaton (DFA). The DFA can then be executed on any character string to recognize sub-strings that match the regular expression. For a theoretical introduction to REs, including their relationship with context-free grammars, readers are referred to \cite{Sethi1986Compiler}. \proglang{R} supports standard regular expression with both the POSIX and the \proglang{Perl} syntax. In addition, the \href{https://github.com/kevinushey/rex}{\pkg{rex} Package} \cite{rex_package} offers a functional interface for creating REs in \proglang{R}. \subsubsection{Matching a decimal real number} Consider matching a decimal real number in the form of $[\pm]nnn[.nnn]$, where [~] means optional and $nnn$ denotes one or more digits. The following table compares this notation with the syntax of \proglang{Perl}, POSIX, and \pkg{rex}: \\ \begin{tabular}{ l c c c c } \toprule & Notation & \proglang{Perl} & POSIX & \pkg{rex} \\ \hline One digit & $n$ & \textbackslash d & [[:digit:]] & \code{number} \\ One or more digits & $nnn$ & \textbackslash d+ & [[:digit:]]+ & \code{numbers} \\ Optional presence of X & [X] & X? & X? & \code{maybe(}X\code{)} \\ alternate presence of X or Y & X|Y & X|Y & X|Y & \code{or(}X, Y\code{)} \\ Plus sign & + & \textbackslash + & \textbackslash + & \code{"+"} \\ Minus sign & - & - & - & \code{"-"} \\ Dot & . & \textbackslash . & \textbackslash . & \code{"."} \\ \bottomrule \end{tabular} ~ \\ Using the above table, $[\pm]nnn[.nnn]$ is translated to: \begin{itemize} \item \proglang{Perl}: \verb;(\+|-)?\d+(\.\d+)?; \item POSIX: \verb;(\+|-)?[[:digit:]]+(\.[[:digit:]]+)?; \item \pkg{rex}: \code{maybe(or("+", "-")), numbers, maybe(".", numbers)} \end{itemize} % Note: $ and ^ are removed on purpose! To use a RE, the expression has to be wrapped in a start and stop symbol (\code{\^{}...\$} in POSIX and \proglang{Perl}, and \code{rex(start, ..., end)} for \pkg{rex}): <>= re <- "^(\\+|-)?[[:digit:]]+(\\.[[:digit:]]+)?$" @ \code{grepl} can be used to check if a string matches the RE pattern or not: <>= grepl(re, "+1.1") grepl(re, "1+1") @ Some matching and non-matching examples are listed below: <>= matching <- c("1", "11.1", "1.11", "+11", "-11", "-11.1") non.matching <- c("a", "1.", "1..1", "-.1", "-", "1-", "1.-1", ".-1", "1.-", "1.1.1", "", ".", "1.1-", "11-11") @ \subsubsection{Inferring a regular expression} In this section, we use \pkg{gramEvol} to learn a RE that matches a decimal real number, as explained in the previous section. \paragraph{Defining a cost function:} The objective is to infer a RE that matches the decimal numbers in the vector \code{matching}, but not in the \code{non.matching}. Consequently, the score of any RE is determined by counting the number of matches and non-matches: <>= re.score <- function(re) { score <- sum(sapply(matching, function(x) grepl(re, x))) + sum(sapply(non.matching, function(x) !grepl(re, x))) return (length(matching) + length(non.matching) - score) } @ The fitness function in \pkg{gramEvol} receives an \proglang{R} \code{expression}, which has to be evaluated before being passed to \code{re.score}: <<>>= fitfunc <- function(expr) re.score(eval(expr)) @ \paragraph{Defining a grammar:} We use \pkg{rex} RE functions to create a grammar. The grammar only includes the functions explored in Section \ref{sec:decimal_RE}, and is designed such that the search space is reduced: <>= library("rex") library("gramEvol") grammarDef <- CreateGrammar(list( re = grule(rex(start, rules, end)), rules = grule(rule, .(rule, rules)), rule = grule(numbers, ".", or("+", "-"), maybe(rules)))) grammarDef @ \begin{itemize} \item The first rule, \ntelem{re}, creates a valid \pkg{rex} command that uses \ntelem{rules} for pattern matching. \item The second element, \ntelem{rules}, is \emph{recursive} and can create a collection of rules by repeating itself, e.g., \ntelem{rule}, \ntelem{rule}, \ntelem{rule}. The \code{.()} allows using a comma inside a \code{grule} definition, where otherwise it would have been interpreted as another replacement rule in the list. \item The last element, \ntelem{rule}, expands to a RE function or character pattern. These include \code{numbers} and \code{maybe} from \pkg{rex}, a decimal point, and + or --. \end{itemize} \paragraph{Evolving the grammar:} The last step is to perform a search for a regular expression that minimises the score function. Here the minimum \code{terminationCost} is known (i.e., zero error), and \code{max.depth} is increased to allow for more expansion of the recursive \ntelem{rules}. We use \code{GrammaticalExhaustiveSearch} to exhaustively search for the answer among all possible combinations of the grammar: % this takes some time to run, use "monitorFunc = print" for more convenience <>= GrammaticalExhaustiveSearch(grammarDef, fitfunc, max.depth = 7, terminationCost = 0) @ \begin{verbatim} GE Search Results: Expressions Tested: 6577 Best Chromosome: 0 1 3 0 2 1 3 1 0 0 1 1 0 0 3 0 0 Best Expression: rex(start, maybe(or("+", "-")), maybe(numbers, "."), numbers, maybe(numbers), end) Best Cost: 0 \end{verbatim} The result, while correct, is different from what we expected: $[\pm][nnn.]nnn[nnn]$, which is true for any real number. Furthermore, the search takes a considerable amount of time: <>= system.time(GrammaticalExhaustiveSearch(grammarDef, fitfunc, max.depth = 7, terminationCost = 0)) @ \begin{verbatim} user system elapsed 380.469 17.022 392.637 \end{verbatim} which was measured a 3.40 GHz Intel Core i7-2600 CPU. In conclusion, one might find it easier to design REs by hand in real-world scenarios, rather than using evolutionary optimisation techniques. \section[Other gramEvol functionality]{Other \pkg{gramEvol} functionality} In this section, some of the other functionalities of the \pkg{gramEvol} are introduced. Here, all of the examples are demonstrated using the following grammar: <>= grammarDef <- CreateGrammar(list( expr = gsrule("()()", "*"), op = gsrule("+", "-", "*", "/"), coef = gsrule("c1", "c2"), var = gsrule("v1", "v2"))) grammarDef @ \subsection{Manual mapping} To \emph{map} a numeric sequence to an expression manually, use \code{GrammarMap}: <>= GrammarMap(c(0, 1, 0, 0, 1, 1, 0, 0), grammarDef) @ The sequence is zero-indexed (the first rule is zero). To see the step by step mapping, use the \code{verbose} parameter option: <>= GrammarMap(c(0, 1, 0, 0, 1, 1, 0, 0), grammarDef, verbose = TRUE) @ If the length of a sequence is insufficient for the mapping process, such that a few non-terminal elements still remain in the resulting expression, a wrapping of up to \code{wrappings} is performed. For example: <>= GrammarMap(c(0, 1, 0, 0, 1, 1), grammarDef, verbose = TRUE) @ \subsection{Examining a grammar} \pkg{gramEvol} offers several functions to examine grammar definitions. \code{summary} reports a summary of what grammar presents: <>= summary(grammarDef) @ Many of these properties are available through individual functions: \code{GetGrammarDepth} computes the depth of grammar tree. The parameter \code{max.depth} is used to limit recursion in \emph{cyclic} grammars. For example, this grammar is cyclic because of rule \ntelem{expr} $\rightarrow$ \ntelem{expr}\ntelem{op}\ntelem{expr}, i.e., replacing a \ntelem{expr} with other \ntelem{expr}s. By default \code{GetGrammarDepth} limits recursion to the number of symbols defined in the grammar: <>= GetGrammarDepth(grammarDef) GetGrammarDepth(grammarDef, max.depth = 10) @ For grammars without recursion, the value returned by \code{GetGrammarDepth} is the actual depth of the tree: <>= grammarDef2 <- CreateGrammar(list( expr = gsrule("()()"), subexpr = gsrule("*"), op = gsrule("+", "-", "*", "/"), coef = gsrule("c1", "c2"), var = gsrule("v1", "v2"))) GetGrammarDepth(grammarDef2) @ \code{GetGrammarDepth} also supports computing the depth from any symbol: <>= GetGrammarDepth(grammarDef2, startSymb = "") GetGrammarDepth(grammarDef2, startSymb = "") @ \code{GetGrammarMaxRuleSize} returns the maximum number of production rules per symbol. Here, \ntelem{op} has the highest number of production rules: <>= GetGrammarMaxRuleSize(grammarDef) @ \code{GetGrammarNumOfExpressions} returns the number of possible expressions existing in the grammar space. This function also uses the optional argument \code{max.depth} to limit the number of recursions and \code{startSymb} to set the starting symbol: <>= GetGrammarNumOfExpressions(grammarDef) GetGrammarNumOfExpressions(grammarDef, max.depth = 2) GetGrammarNumOfExpressions(grammarDef, startSymb = "") @ Here, the only expressions with depth of 2 or less are constructed if rule (\ntelem{coef}$\times$\ntelem{var}) is applied first, creating 4 expressions (i.e., $c_1 \times v_1$, $c_1 \times v_2$, $c_2 \times v_1$ and $c_2 \times v_2$). Also if \ntelem{coef} is chosen as the starting symbol, the expressions are limited to $c_1$ and $c_2$. \code{GetGrammarMaxSequenceLen} computes the length of integer sequence required for iterating through the grammar space without wrapping. As with the previous functions, \code{max.depth} is set to the number of symbols defined in the grammar. <>= GetGrammarMaxSequenceLen(grammarDef) GetGrammarMaxSequenceLen(grammarDef, max.depth = 3) GetGrammarMaxSequenceLen(grammarDef2, startSymb = "") @ \subsection{Grammatical evolution options} \code{GrammaticalEvolution} is defined as follows: <>= GrammaticalEvolution(grammarDef, evalFunc, numExpr = 1, max.depth = GrammarGetDepth(grammarDef), startSymb = GrammarStartSymbol(grammarDef), seqLen = GrammarMaxSequenceLen(grammarDef, max.depth, startSymb), wrappings = 3, suggestions = NULL, optimizer = c("auto", "es", "ga"), popSize = 8, newPerGen = "auto", elitism = 2, mutationChance = NA, iterations = 1000, terminationCost = NA, monitorFunc = NULL, plapply = lapply, ...) @ \code{max.depth} and \code{startSymb} determine recursive grammar limitations, similar to what was explained in the previous section. The rest of the parameters are the evolutionary optimisation options: \begin{itemize} \item \code{GrammaticalEvolution} evolves a population of \code{popSize} chromosomes for a number of \code{iterations}. \item if \code{optimizer} is set to ``auto", using the information obtained about the grammar (e.g., number of possibles expressions and maximum sequence length), \code{GrammaticalEvolution} uses a heuristic algorithm based on \cite{deb1999understanding} to automatically determine a suitable value for \code{popSize} (i.e., the population size) \code{iterations} (i.e., the number of iterations) parameters. \item The ordinary cross-over operator of GA is considered destructive when homologous production rules are not aligned, such as for cyclic grammars \cite{oneill2003crossover}. Consequently, \code{GrammaticalEvolution} automatically changes cross-over parameters depending on the grammar to improve optimisation results. A user can turn this off by manually setting the \code{optimizer}. \item The first generation is made from the \code{suggestions} in form of integer chromosomes, and randomly generated individuals. \item Each integer chromosome is mapped using the grammar, and its fitness is assessed by calling \code{evalFunc}. \item For each generation, the top $n$ scoring chromosomes where $n =$~\code{elitism} are directly added to the next generation's population. The rest of the population is created using cross-over of chromosomes selected with roulette selection operator. \item Each chromosome may mutate by a probability of \code{mutationChance}. \item After reaching a termination criteria, e.g., the maximum number of \code{iterations} or the desired \code{terminationCost}, the algorithm stops and returns the best expression found so far. \item \code{GrammaticalEvolution} supports multi-gene operations, generating more than one expression per chromosome using the \code{numExpr} parameter. \item The number of integer codons in the chromosome is determined by \code{seqLen} times \code{numExpr} (i.e., the sequence length per expression, times the number of expressions). \item \code{monitorFunc} is then called with information and statistics about the current status of the population. \item \code{plapply} is used for parallel processing. \item \code{GrammaticalEvolution} automatically filters non-terminal expressions (i.e., expressions that don't yield a terminal expression even after times of \code{wrappings}). Therefore the end-user does not need to worry about them while using \pkg{gramEvol}. \end{itemize} \subsection{Parallel processing option} Processing expressions and computing their fitness is often computationally expensive. The \pkg{gramEvol} package can utilise parallel processing facilities in \proglang{R} to improve its performance. This is done through the \code{plapply} argument of \code{GrammaticalEvolution} function. By default, \code{lapply} function is used to evaluate all individuals in the population. Multi-core systems simply benefit from using \code{mclapply} from package \pkg{parallel}, % \cite{Rlang}, which is a drop-in replacement for \code{lapply} on POSIX compatible systems. The following code optimises \code{evalFunc} on 4 cores: <>= library("parallel") options(mc.cores = 4) ge <- GrammaticalEvolution(grammarDef, evalFunc, plapply = mclapply) @ To run \pkg{gramEvol} on a cluster, \code{clusterapply} functions can be used instead. The \pkg{gramEvol} package must be first installed on all machines and the fitness function and its data dependencies exported before GE is called. The following example demonstrates a four-process cluster running on the local machine: <>= library("parallel") cl <- makeCluster(type = "PSOCK", c("127.0.0.1", "127.0.0.1", "127.0.0.1", "127.0.0.1")) clusterEvalQ(cl, library("gramEvol")) clusterExport(cl, c("evalFunc")) ge <- GrammaticalEvolution(grammarDef, evalFunc, plapply = function(...) parLapply(cl, ...)) stopCluster(cl) @ It must be noticed that in any problem, the speed-up achieved depends on the overhead of communication compared with the fitness functions' computational complexity. \subsection{Generating more than one expression} \pkg{gramEvol} supports generation and evaluation of multiple expressions: \begin{itemize} \item \code{numExpr} in \code{GrammaticalEvolution} is used to pass a list of more than one \proglang{R} \code{expression} to the fitness function. \item \code{EvalExpressions} offers a simpler interface for evaluating multiple expressions. \end{itemize} The following example show cases \code{EvalExpressions}: It uses a dataset for variables defined in the grammar, and evaluates a GE expression object along with a string: <>= df <- data.frame(c1 = c(1, 2), c2 = c(2, 3), v1 = c(3, 4), v2 = c(4, 5)) quad.expr <- expression(c1 * v1, c1 * v2, c2 * v1, c2 * v2) EvalExpressions(quad.expr, envir = df) @ This is useful in applications when more than one expression is required, or the collective power of several simple expressions outperform a single complex program. For example in \cite{MNDL_13}, the authors have used GE for electricity load forecasting; instead of using a complex machine learning algorithm, pools of string expressions were generated in a guided manner and were used as features in a simpler machine learning algorithm to obtain better results. The idea of generating \emph{features} using GE is further explored in \cite{noorian16gramEvol}. \subsection{Alternative optimisation algorithms} \pkg{gramEvol} also provides a random search and an exhaustive search. Their syntax is similar to the \code{GrammaticalEvolution}: <>= result1 <- GrammaticalExhaustiveSearch(grammarDef, evalFunc) result2 <- GrammaticalRandomSearch(grammarDef, evalFunc) @ \subsection{Using vectors as rules} \code{gvrule} allows members of a vector to be used as individual rules. For example, <<>>= gvrule(1:5) @ which is equal to <<>>= grule(1,2,3,4,5) @ Without \code{gvrule}, \code{1:5} would have been interpreted as a single rule: <<>>= grule(1:5) @ \subsection{Using commas and assignments in rules} There are two ways to use commas and assignments in \pkg{gramEvol} rules: \begin{enumerate} \item Rules are defined in character string form using \code{gsrule}. \item Rules are wrapped in \code{.()} and defined using \code{grule}. \end{enumerate} For example, consider the following rules: \ntelem{assignment} ::= A = B | A = C \ntelem{comma} ::= A, B | B, C ~\\ ~\\ Their definition using \pkg{gramEvol} is as follows: <<>>= CreateGrammar(list(assignment = gsrule("A = B", "A = C"), comma = gsrule("A, B", "B, C"))) @ or <<>>= CreateGrammar(list(assignment = grule(.(A = B), .(A = C)), comma = grule(.(A, B), .(B, C)))) @ \section{Conclusion} GE offers a flexible yet powerful framework for automatic program generation. The syntax and the structure of the programs are described using a context-free grammar, and their objective is determined by a cost function. An evolutionary search is performed on the grammar to find the program that minimises the cost function. \pkg{gramEvol} implements GE in \proglang{R}. It allows a grammar to be defined using \proglang{R} expressions, as well as in custom string formats. A GE program generator using \pkg{gramEvol} only requires a grammar definition and a cost function, and other steps including the evolutionary search and selecting its optimal parameters are handled automatically by the package. It also supports parallel computing, and includes facilities for exhaustive and random search. In this vignette, some of the functionalities of \pkg{gramEvol} were explored. Furthermore, two examples were used to demonstrate the flexibility of GE and \pkg{gramEvol}. \bibliographystyle{IEEEtran} \bibliography{vig_refs} \end{document}