%\documentclass[article]{jss} \documentclass[nojss]{jss} \usepackage{tikz} \usetikzlibrary{shapes,arrows,fit,calc,positioning} \tikzset{box/.style={draw, rectangle, thick, text centered, minimum height=3em}} \tikzset{oval/.style={draw, ellipse, thick, text centered, minimum height=3em}} \tikzset{line/.style={draw, thick, -latex'}} %\VignetteIndexEntry{rmake: A Build Process Manager for Complex Analyses in R} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% almost as usual \author{Michal Burda\\ Institute for Research and Applications of Fuzzy Modeling\\ Centre of Excellence IT4Innovations, Division University of Ostrava\\ 30. dubna 22, 701 03 Ostrava, Czech Republic\\ E-mail: \email{Michal.Burda@osu.cz}} \title{\pkg{rmake}: A Build Process Manager for Complex Analyses in \R{}} %% for pretty printing and a nice hypersummary also set: \Plainauthor{Michal Burda} %% comma-separated \Plaintitle{rmake: A Build Process Manager for Complex Analyses in R} %\Shorttitle{\pkg{rmake}: Build Process for Analyses in \R{}} %% a short title (if necessary) %% an abstract and keywords \Abstract{ \R{} software allows to develop \emph{repeatable} statistical analyses, i.e., automatically re-computable analyses if data or some data processing step changes. However, if the analyses grow on complexity, their manual re-execution on any change may become tedious and prone to errors. \emph{Make} is a widely accepted tool for managing the generation of resulting files from source data and script files. \emph{Make} reads dependencies between data and script files from a text file called the \code{Makefile}. The aim of this paper is to present \pkg{rmake}, an \R{} package for easy generation of \code{Makefile}s for statistical and data manipulation projects. } \Keywords{statistical analyses, build process, \code{make}, \code{Makefile}, \R{}} \Plainkeywords{statistical analyses, build process, make, Makefile, R} %% without formatting %% at least one keyword must be supplied %% publication information %% NOTE: Typically, this can be left commented and will be filled out by the technical editor %% \Volume{50} %% \Issue{9} %% \Month{June} %% \Year{2012} %% \Submitdate{2012-06-04} %% \Acceptdate{2012-06-04} %% The address of (at least) one author should be given %% in the following format: \Address{ Michal Burda\\ Institute for Research and Applications of Fuzzy Modeling\\ Centre of Excellence IT4Innovations, Division University of Ostrava\\ 30. dubna 22, 701 03 Ostrava, Czech Republic\\ E-mail: \email{Michal.Burda@osu.cz} } %% It is also possible to add a telephone and fax number %% before the e-mail in the following format: %% Telephone: +43/512/507-7103 %% Fax: +43/512/507-2851 %% for those who use Sweave please include the following line (with % symbols): %% need no \usepackage{Sweave.sty} %% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \newcommand{\todo}[1]{\textbf{\color{red}#1}} \newcommand{\R}{\proglang{R}} \newcommand{\pipe}{\code{\%\textgreater{}\textgreater{}\%}} \begin{document} <>= library(knitr) options(prompt='R> ') options(continue='+ ') opts_chunk$set(prompt=TRUE, echo=TRUE, comment=NA) @ \section{Introduction} \R{} \citep{R} is a mature scripting language for statistical computations and data processing. Besides other benefits, an important advantage of \R{} is that it allows to write \emph{repeatable} statistical analyses, that is, to program all steps of data processing in a scripting language, which allows to re-execute the whole process after any change in data or in any processing step. There are several useful packages for \R{} to obtain repeatability of statistical computations much more easier. Among others, let us name \pkg{knitr} \citep{knitr2, knitr} and \pkg{rmarkdown} \citep{rmarkdown}. These tools allow to write R scripts that generate reports combining text with tables and figures generated from data. Creation of final statistical reports by such scripts is as simple as issuing a single statement from the command line or as clicking an icon in an integrated development environment (IDE) such as \emph{RStudio} \citep{rstudio}. However, if the analyses grow on complexity, manual re-execution of the whole process may become tedious, prone to errors, and very demanding from the view of computational power. Complex analyses involve typically a lot of pre-processing steps on large data sets, a repetitive execution of commands differing in several parameters only, and producing multiple output files in various formats. It is very inefficient to re-run over and over again all the pre-processing steps to refresh the final report after any change on it. A caching mechanism provided by \pkg{knitr} is very helpful there, but its use is still limited on a single report. It is rational to split a complex analysis into several parts and save intermediate results into files. However, such approach brings another challenge: management of dependencies between inputs, outputs and underlying scripts to ensure a refresh of all the results on change of any prerequisite. In open source community, (GNU) \emph{Make} is a popular tool to help with that. \emph{Make} is a tool which controls the generation of intermediate and resulting files from source data and script files. It was primarily created to help program developers to build executable binaries from source codes. However, it can be used to generate any type of file. \emph{Make} gets its knowledge of how to build the results from a file called the \code{Makefile}, which defines dependencies between files as well as commands of how to create dependent files from their sources. \emph{Make} compares last-change timestamps of the files listed in \code{Makefile} to determine which files (and in what order) have to be refreshed in order to get all of them updated. \emph{Make} is also quite popular in the \R{} community, with a direct support e.g. by \emph{RStudio} and another tools. It is quite straightforward to write \code{Makefile} manually and many \R{} users write simple \code{Makefile}s for themselves by hand. However, for more complex build processes, manual management of \code{Makefile} may become grueling. The aim of this paper is to present \pkg{rmake}, an \R{} package providing tools for easy generation of \code{Makefile}s for statistical and data manipulation tasks in \R. The main features of \pkg{rmake} are as follows: \begin{itemize} \item the use of the well-known \emph{Make} tool; \item easy definitions of file dependencies in the \R{} language; \item high flexibility provided by parameterized execution of \R{} scripts and programmatically generated dependencies; \item simple and short code thanks to special \verb|%>>%| pipeline operator and templating mechanism; \item support for \R{} scripts and \R{} markdown files; \item extensibility for user-defined rule types; \item isolated and parallel execution of building tasks obtained for free thanks to \emph{Make}'s parallel processing features; \item support for all platforms including Unix (Linux), MacOS, MS Windows, and Solaris; \item compatibility with \emph{RStudio}. \end{itemize} A different approach to build complex analytical projects is represented e.g. by the \pkg{drake} package \citep{drake}, which (unlike to \pkg{rmake}) processes all the tasks within the single \R{} session and which handles all the dependencies between \R{} objects by itself. To the opposite, \pkg{rmake} is simply a light-weight generator of the \code{Makefile} dependency file, which leaves the re-generation of obsolete results on the \emph{Make} utility. A large list of other pipelining projects may be found also at \code{https://github.com/pditommaso/awesome-pipeline}. The rest of the paper is organized as follows. Section~\ref{sec:installation} lists all necessary steps for correct installation and setup. Section~\ref{sec:basic} describes the basic usage of the \pkg{rmake} package including initialization of a new project, creating the build rules, running the build process, and visualization of file dependencies. Section~\ref{sec:details} provides all details about pre-defined build rules and custom rules definitions. In Section~\ref{sec:advanced}, the advanced topics of \pkg{rmake} usage are discussed: the mechanism of tasks for grouping the rules that have to be executed together, parameterized execution of rules, and rule templates. Section~\ref{sec:conclusion} concludes the paper. \section{Installation} \label{sec:installation} In order to use \pkg{rmake}, the \R{} environment and the \emph{Make} program has to be installed and properly configured, in advance. On Linux-based systems, it is usually a matter of installing the appropriate distribution packages. On Windows, an installation of \code{Rtools} is recommended that contains the \emph{Make} tool included. To install the \pkg{rmake} package from CRAN, execute the following command from within the \R{} session (note the leading ``\code{R>}'' denotes the \R{}'s shell prompt and it is not a part of the command): <>= install.packages("rmake") @ Alternatively, a development version of \pkg{rmake} may be obtained directly from GitHub: <>= install.packages("devtools") devtools::install_github("beerda/rmake") @ \subsection[Settings for Use Outside of the R Session]{Settings for Use Outside of the \R{} Session} \label{sec:outside-r} For \pkg{rmake} to work properly, \code{R_HOME} and \code{R_ARCH} environment variables have to be set correctly. If executing \emph{Make} from within an \R{} session (e.g., from \emph{RStudio}), the environment variables are set automatically by \R{}. In order to execute \emph{Make} outside \R{}, for instance, from the shell, these variables have to be set manually. The \code{R_HOME} variable should contain a path to \R{}'s installation directory and \code{R_ARCH} the architecture variant. The correct values may be obtained from \R{} by issuing the following commands: <<>>= Sys.getenv("R_HOME") Sys.getenv("R_ARCH") @ To set environment variables on a Unix-like system, issue the following shell command (``\code{\$}'' is a shell prompt): \begin{verbatim} $ export R_HOME=/usr/lib/R $ export R_ARCH= \end{verbatim} These commands may be added to your home \code{.profile} file for variables to be created automatically after you log-in. \section{Basic Usage} \label{sec:basic} \subsection{New Project Initialization} To start maintaining the \R{} project with \pkg{rmake}, an \R{} script \code{Makefile.R} has to be created that would then generate the \code{Makefile}. That script file may be created manually, or from a skeleton provided by \pkg{rmake}. To start from skeleton, first load the \pkg{rmake} package: <<>>= library(rmake) @ and then enter the following command: <>= rmakeSkeleton(".") @ This will create two files in the current directory (``\code{.}''): \code{Makefile.R} and \code{Makefile}. The first file is an \R{} script intended to generate the second file. In the beginning, \code{Makefile.R} contains the following: <>= library(rmake) job <- list() makefile(job, "Makefile") @ Function \code{makefile()} generates \code{Makefile} based on a \code{job} variable, which currently contains an empty list. Nevertheless, the generated \code{Makefile} contains at least a single rule that ensures an automatic re-creation of \code{Makefile} after any change to the \code{Makefile.R} script is made in the future. \subsection{Running the Build Process} Once the \code{Makefile} exists, the \emph{Make} tool may be executed from within the \R{} session by calling the following function: <>= make() @ \begin{verbatim} make: Nothing to be done for 'all'. \end{verbatim} Indeed, nothing is to be done, since the single rule generating the \code{Makefile} itself needs not to be re-generated. After \code{Makefile.R} gets updated, \code{Makefile} would be re-generated and also other tasks will be executed, as specified by the rules in \code{Makefile.R}. To run \emph{Make} from shell, just enter the following command (note the settings needed for everything to work properly outside the \R{} session in Section~\ref{sec:outside-r}): \begin{verbatim} $ make make: Nothing to be done for 'all'. \end{verbatim} If working in \emph{RStudio}, it is beneficial to setup its environment to use Make: in \emph{Build}/\emph{Configure Build Tools} menu, set \emph{Project build tools} to \emph{Makefile}. A \emph{Build All} command becomes available that runs \emph{Make} using the generated \code{Makefile}. \subsection{Adding a Build Rule} \label{sec:addrule} Now, let us do some ``real'' work. Suppose we have \code{data.csv} with the following content: \begin{verbatim} ID,V1,V2 a,2,8 b,9,1 c,3,3 \end{verbatim} We would like to compute the sums of columns \code{V1} and \code{V2} and store the result into a file \code{sums.csv}. Therefore, we create the following script file \code{script.R}: <>= d <- read.csv("data.csv") sums <- data.frame(ID="sum", V1=sum(d$V1), V2=sum(d$V2)) write.csv(sums, "sums.csv", row.names=FALSE) @ The script reads \code{data.csv} into a variable \code{d}, creates a data frame \code{sums} with computed sums and writes it to the file \code{sums.csv}. Now, let us modify the \code{Makefile.R} script to build the \code{sums.csv} file automatically whenever either \code{data.csv} or \code{script.R} files change. All that has to be done is to update the line of code where the \code{job} variable is created in \code{Makefile.R}: <>= library(rmake) job <- list(rRule(target="sums.csv", script="script.R", depends="data.csv")) makefile(job, "makefile") @ Function \code{rRule()} creates a new rule for execution of an \R{} script \code{script.R}, whose target is \code{sums.csv} and which depends on \code{data.csv}. Whenever any dependency file or the script file changes, the rule triggers and re-executes the given script in order to build the given target. Let us run the \code{make} command again: <>= make() @ The \emph{Make} utility should firstly re-generate the \code{Makefile} itself (since \code{Makefile.R} has changed) and then execute \code{script.R} in a new \R{} session to create \code{sums.csv}. Further calls of \code{make} will do nothing until any change is detected again. To finalize this toy example, let us create a file named \code{analysis.Rmd} with the following content: \begin{verbatim} --- title: "Analysis" output: pdf_document --- # Sums of data rows ```{r, echo=FALSE, results='asis'} sums <- read.csv('sums.csv') knitr::kable(sums) ``` \end{verbatim} This is a markdown file, which we are going to process with the \pkg{rmarkdown} package to create a PDF document with the results. This script reads \code{sums.csv} and prints its content in a tabular layout. For more details on how to work with markdown documents see \cite{rmarkdown}. The \code{analysis.Rmd} script depends on the \code{sums.csv} data file. The markdown processor produces an \code{analysis.pdf} file from it. Let us now update \code{Makefile.R} so that the PDF file is refreshed everytime either the script or the data change. The job creation command should be modified as follows: <>= library(rmake) job <- list( rRule(target="sums.csv", script="script.R", depends="data.csv"), markdownRule(target="analysis.pdf", script="analysis.Rmd", depends="sums.csv") ) makefile(job, "makefile") @ See Fig.~\ref{fig:basicUsageChain} for an illustrative diagram of dependencies. After calling <>= make() @ the \code{analysis.pdf} is generated. \begin{figure} \centering \begin{tikzpicture}[auto] \node [box] (data) {data.csv}; \node [oval, right=0.5cm of data] (script) {script.R}; \node [box, right=0.5cm of script] (sums) {sums.csv}; \node [oval, right=0.5cm of sums] (rmd) {analysis.Rmd}; \node [box, right=0.5cm of rmd] (pdf) {analysis.pdf}; \path [line] (data) -- (script); \path [line] (script) -- (sums); \path [line] (sums) -- (rmd); \path [line] (rmd) -- (pdf); \end{tikzpicture} \caption{A simple chain of dependencies from the example in Section~\ref{sec:addrule}} \label{fig:basicUsageChain} \end{figure} \subsection{The Pipe Operator} \label{sec:pipes} The sequence of the above-listed \pkg{rmake} rules makes a chain: \code{data.csv} is a prerequisite for \code{sums.csv} which is a prerequisite for \code{analysis.pdf}. Such sequence of rules may be equivalently written using the new ``\pipe{}'' pipe operator introduced by \pkg{rmake} (cf. with Fig.~\ref{fig:basicUsageChain}): <<>>= job <- "data.csv" %>>% rRule("script.R") %>>% "sums.csv" %>>% markdownRule("analysis.Rmd") %>>% "analysis.pdf" @ Generally, every $k$-th element of the pipe chain (for $k = 2, 4, 6, \ldots$) must be a call to a function that creates a \pkg{rmake} rule. Prior its execution, each such function call internally obtains two additional named arguments: \code{depends} and \code{target}, whose values are respectively obtained from the preceding (i.e., $(k-1)$-th) or following (i.e., $(k+1)$-th) element of the chain. If some rule depends on or creates multiple files, their names have to be specified as a character vector (see the \code{c()} function) -- for instance, the \code{run.R} script reads and writes two files: <>= job <- c('in1.csv', 'in2.csv') %>>% rRule('run.R') %>>% c('out1.csv', 'out2.csv') @ If the dependencies are more complex than a single chain, multiple chains may be merged with the \code{c()} function as follows: <<>>= chain1 <- "data1.csv" %>>% rRule("preprocess1.R") %>>% "intermed1.rds" chain2 <- "data2.csv" %>>% rRule("preprocess2.R") %>>% "intermed2.rds" chain3 <- c("intermed1.rds", "intermed2.rds") %>>% rRule("merge.R") %>>% "merged.rds" %>>% markdownRule("report.Rmd") %>>% "report.pdf" job <- c(chain1, chain2, chain3) @ \begin{figure} \centering \begin{tikzpicture}[auto] \node [box] (data1) {data1.csv}; \node [oval, right=0.5cm of data1] (prep1) {preprocess1.R}; \node [box, right=0.5cm of prep1] (intermed1) {intermed1.csv}; \node [box, below=1.0cm of data1] (data2) {data2.csv}; \node [oval, right=0.5cm of data2] (prep2) {preprocess2.R}; \node [box, right=0.5cm of prep2] (intermed2) {intermed2.csv}; \node [oval, right=0.5cm of intermed1, yshift=-1cm] (merge) {merge.R}; \node [box, right=0.5cm of merge] (merged) {merged.rds}; \node [oval, below=0.5cm of merged] (rmd) {report.Rmd}; \node [box, below=0.5cm of rmd] (pdf) {report.pdf}; \path [line] (data1) -- (prep1); \path [line] (prep1) -- (intermed1); \path [line] (data2) -- (prep2); \path [line] (prep2) -- (intermed2); \path [line] (intermed1) -| (merge); \path [line] (intermed2) -| (merge); \path [line] (merge) -- (merged); \path [line] (merged) -- (rmd); \path [line] (rmd) -- (pdf); \end{tikzpicture} \caption{More complex chain of dependencies from the example in Section~\ref{sec:pipes}} \label{fig:complexChain} \end{figure} A graphical representation of defined dependencies is shown in Fig.~\ref{fig:complexChain}. \subsection{Cleaning Up} A good manner of \code{Makefile} writers is to provide a clean-up task that deletes all files of the project that were generated within the build process. This is traditionally executed by the following command: \begin{verbatim} $ make clean \end{verbatim} Each \pkg{rmake}'s rule type adds automatically to the \code{Makefile} a command for deleting all target files that were generated by that rule. A single exception is the \code{Makefile} itself that is never deleted, although it is generated too. From within \R{}, the clean-up task may be executed by <>= make("clean") @ \subsection{Parallel Execution} Some implementations of the \emph{Make} utility allow to build multiple targets in parallel. For instance, \emph{GNU Make} recognizes the \code{-j} argument, which can be used to specify the number of processes to run simultaneously. For instance, \begin{verbatim} $ make -j8 \end{verbatim} causes up to 8 targets to be prepared in parallel. If the \code{-j} option is given without a number, the \emph{Make} utility will not limit the number of rules that can run simultaneously. From \R{}, the parallel execution might be started with the following command: <>= make("-j8") @ \subsection{Visualization} \label{sec:visualization} The list of rules may be printed, to see concisely the defined dependencies. For instance, the \code{job} from Section~\ref{sec:pipes} would produce the following output: <<>>= print(job) @ A lot more comprehensible view on the graph of dependencies is obtained by visualizing the \code{job}: <>= visualize(job, legend=FALSE) @ \begin{figure} \centering \includegraphics[scale=1.5]{pipefig.png} \caption{Visualization of the \code{job} from Section~\ref{sec:pipes} with the \code{visualize()} function} \label{fig:pipefig} \end{figure} The \code{visualize()} function converts the \code{job} into a \pkg{visNetwork}'s directed graph and renders it as an interactive web-page, in which the picture may be zoomed and nodes re-arranged with a mouse -- see Fig.~\ref{fig:pipefig} for an overview. The optional \code{legend} argument turns off the legend in the resulting figure. In the graph, the main script files are depicted with diamonds, rules are represented with ovals, while the other files are symbolized with squares. \section{Details on Build Rules} \label{sec:details} The \pkg{rmake} package provides several functions that represent the most common build-rule types. Each function has a mandatory \code{target} argument for a character vector of file names created by that rule. Additionally, a character vector of file names the rule depends on can be specified as an optional \code{depends} argument. The \code{task} argument allows rules to group into \emph{tasks} -- see Section~\ref{sec:tasks} for more details. Some rules also allow an optional \code{params} argument to pass parameters to the scripts; Section~\ref{sec:params} contains detailed information on that topic. Each rule, when executed by the \emph{Make} tool, is started as a separate operating system process, that is, the \R{} scripts and markdown processors do not share the running instance of the \R{} interpreter. \subsection{Pre-defined Rule Types} <>= rRule(target, script, depends = NULL, params = list(), task = "all") @ The \code{rRule()} rule executes the \R{} script by calling the \code{Rscript} executable file from the shell and \code{source}-ing the \code{script} file. This rule is fired if either any file from \code{depends} or the \code{script} file changes. Cleaning removes all \code{target} files. <>= markdownRule(target, script, depends = NULL, format = "all", params = list(), task = "all") @ The \code{markdownRule()} rule renders a \code{target} document from the source Markdown \code{script} file. The rendering is done by calling the \code{render()} function of the \pkg{rmarkdown} package \citep{rmarkdown}. The \code{format} argument is passed to the \pkg{rmarkdown}'s \code{render()} function as an \code{output\_format} parameter and thus may be used to specify the desired format of the resulting file: \code{'all'} to render all formats defined by the \pkg{rmarkdown}'s output directive specified in the \code{Rmd} file, a name of the format to render a single format or a vector of format names to render multiple formats. The recognized format names are: \code{'html\_document'} (HTML web page), \code{'pdf\_document'} (Adobe Portable Document Format), \code{'word\_document'} (Microsoft Word document), \code{'odt\_document'} (OpenDocument Text), \code{'rtf\_document'} (Rich Text Format), and \code{'md\_document'} (Markdown document). Cleaning removes all \code{target} files. <>= offlineRule(target, message, depends = NULL, task = "all") @ The \code{offlineRule()} rule provides a way to force some non-automated action within the \emph{Make} build process. It should be used whenever a transformation of prerequisites from the \code{depends} vector into \code{target} files requires a manual action. In case of building, a custom error \code{message} is shown that would instruct the user to perform the task by hand until the \code{target} files get more recent modification timestamps than files in \code{depends} vector. Cleaning removes all \code{target} files. \subsection{Custom Rules} Besides predefined rule types, any custom rule may be added to the build process. Internally, the \code{makefile()} function generates \code{Makefile} from a list of instances of the S3 \code{rmake.rule} class. To create an object of such type, the following general function may be used: <>= rule(target, depends = NULL, build = NULL, clean = NULL, task = "all", phony = FALSE) @ where \code{target} is a character vector of target files that the rules is intended to generate, and \code{depends} is a character vector of prerequisite file names. To the contrast with predefined rule types, custom rules should add also script file names into a vector of dependencies, preceding other data files. The \code{task} argument may be used to assign the rule to a certain task, see Section~\ref{sec:tasks} for more information. The \code{phony} argument is a boolean (\code{TRUE}/\code{FALSE}) value specifying whether the \emph{Make} rule has a \code{.PHONY} (i.e. non-file) target. A rule should be marked with \code{phony} if the target is not a file name that would be generated by the \code{build} command. E.g., \code{all} or \code{clean} are phony targets. Also all targets representing tasks (see Section~\ref{sec:tasks}) are phony. See also the manual to the \emph{Make} tool for more details on \code{.PHONY} targets. The \code{build} and \code{clean} arguments are character vectors with shell commands that have to be executed during the build or cleaning-up, respectively. It is advisable to write shell commands with the use of \emph{Make} variables that are predefined by \pkg{rmake} at the beginning of the \code{Makefile}: \begin{itemize} \item \code{\$(R)} -- a path and file name of the \code{Rscript} binary, as obtained from the \code{R_HOME} and \code{R_ARCH} environment variable (see Section~\ref{sec:outside-r} for details); \item \code{\$(RM)} -- a name of the file deletion command (\code{rm} on Unix, \code{del} on Windows). \end{itemize} For instance, the following rule runs \emph{NodeJS} \proglang{JavaScript} interpreter on \code{test.js} script, which generates \code{test.json} file: <>= r <- rule(target="test.json", depends="test.js", build="node test.js", clean="$(RM) test.json") @ \emph{Make} variables other than \code{\$(R)} or \code{\$(RM)} may be defined by modifying the \code{defaultVars} global variable, e.g., let us introduce a new \code{\$(JS)} variable with path to the \proglang{JavaScript} interpreter: <>= defaultVars["JS"] <- "/usr/bin/node" @ One can then modify the above-listed rule in \code{Makefile.R} to use the new \code{\$(JS)} variable: <>= library(rmake) defaultVars["JS"] <- "/usr/bin/node" job <- list(rule(target="test.json", depends="test.js", build="$(JS) test.js", clean="$(RM) test.json")) makefile(job, "Makefile") @ The \pkg{rmake} package provides a useful tool to help write rules that execute custom \R{} sequence of commands. The \code{inShell()} function simply takes an \R{} language expression and transforms it into a character vector with a shell command that calls \code{Rscript} with parameters that execute the given expression: <<>>= inShell({ result <- 1+1; saveRDS(result, "result.rds") }) @ The \code{inShell()} function may be used to define a new rule: <>= rule(target="result.rds", build=inShell({ result <- 1+1; saveRDS(result, "result.rds") }), clean="$(RM) result.rds") @ However, the overuse of \code{inShell()} is not recommended. For normal processing of data, it is far better to store \R{} commands into a separate script file, because then, after any change to the code is made, the \emph{Make} tool can detect it and force update of other depending artifacts. Any rule change at the level of \code{Makefile} will not cause any rebuild of the target. The intended use of \code{inShell()} is to ease the implementation of internals such as in \code{rRule()} or \code{markdownRule()}. \section{Advanced Usage} \label{sec:advanced} \subsection{Tasks} \label{sec:tasks} The mechanism of tasks allows to make groups of rules. Groups of rules may then be executed together. If not stating differently, each \pkg{rmake} rule is a member of the \code{all} task. \pkg{rmake} also provides a special \code{clean} task for project cleanup. To build all rules grouped into a task, simply invoke the \code{make} command and give it the name of the task. For instance: \begin{verbatim} $ make all \end{verbatim} executes all rules grouped into the \code{all} task. Equivalently, we may execute the task from within the \R{} session: <>= make('all') @ A rule is assigned to a task by specifying task name in the \code{task} argument of the rule creation function. A rule may be a member of more than a single task -- simply put all task names in the character vector as the \code{task} argument. In the following example, the rules are divided into two tasks: <>= library(rmake) job <- c( "data.csv" %>>% rRule("preprocess.R") %>>% "data.rds", "data.rds" %>>% markdownRule("preview.Rmd", task="preview") %>>% "preview.pdf", "data.rds" %>>% markdownRule("final.Rmd", task="final") %>>% "final.pdf" ) makefile(job, "Makefile") @ This code creates a rule \code{preprocess.R}, which transforms \code{data.csv} into \code{data.rds}, and two markdown rules, \code{preview.Rmd} and \code{final.Rmd}, that are each assigned to its own task named \code{preview} and \code{final}, respectively. Thus invoking <>= make("preview") @ would create \code{data.rds} (because \code{preview.Rmd} depends on it, irrespective of \code{preprocess.R} is a member of the \code{all} task) and \code{preview.pdf}, but not \code{final.pdf}. Similarly, <>= make("final") @ would generate \code{final.pdf} (possibly with previous build of \code{data.rds}, if it was not done already), but not \code{preview.pdf}. All rules will be triggered by calling <>= make("all") @ \subsection{Parameterized Execution of Rules} \label{sec:params} The \pkg{rmake} package allows to send parameters to the main script of a rule. Both \code{rRule()} and \code{markdownRule()} functions may obtain a list of arbitrary data as the \code{params} argument. The content of that argument would be available as the \code{params} global variable from within the script. Through such parameterization, a single \R{} script may be used to produce multiple outputs. An example is as follows: <>= library(rmake) job <- c( "data.csv" %>>% rRule("fit.R", params=list(alpha=0.1)) %>>% "out-0.1.rds", "data.csv" %>>% rRule("fit.R", params=list(alpha=0.2)) %>>% "out-0.2.rds", "data.csv" %>>% rRule("fit.R", params=list(alpha=0.3)) %>>% "out-0.3.rds", "data.csv" %>>% rRule("fit.R", params=list(alpha=0.4)) %>>% "out-0.4.rds" ) makefile(job, "Makefile") @ We can create the following \code{fit.R} script file to see what is inside of the \code{params} global variable: <>= # the fit.R file str(params) @ Executing <>= make("all") @ will show the following result: \begin{verbatim} List of 5 $ .target : chr "out-0.1.rds" $ .script : chr "fit.R" $ .depends: chr "data.csv" $ .task : chr "all" $ alpha : num 0.1 List of 5 $ .target : chr "out-0.2.rds" $ .script : chr "fit.R" $ .depends: chr "data.csv" $ .task : chr "all" $ alpha : num 0.2 List of 5 $ .target : chr "out-0.3.rds" $ .script : chr "fit.R" $ .depends: chr "data.csv" $ .task : chr "all" $ alpha : num 0.3 List of 5 $ .target : chr "out-0.4.rds" $ .script : chr "fit.R" $ .depends: chr "data.csv" $ .task : chr "all" $ alpha : num 0.4 \end{verbatim} Indeed, the \code{params} variable contains the \code{alpha} parameter with an expected value. Besides that, \code{params} contains several dot-named values that correspond to the arguments of \code{rRule()}: \code{.target}, \code{.script}, \code{.depends}, and \code{.task}. The \code{fit.R} script may handle the \code{params} global variable directly, but it is advisable to use the \code{getParam()} function instead, which throws an error in case the script is executed without \code{params} being defined before: <>= # the fit.R file library(rmake) dataName <- getParam(".depends") resultName <- getParam(".target") alpha <- getParam("alpha") # now we can use these variables to do here some real work... cat("dataName:", dataName, "\n") cat("resultName:", resultName, "\n") cat("alpha:", alpha, "\n") @ Executing \code{fit.R} outside the generated \code{Makefile} would trigger a warning message about non-existence of the parameters: <>= source("fit.R") @ \begin{verbatim} dataName: NA resultName: NA alpha: NA Warning message: In getParam(".depends") : rmake parameters not found - using default value for ".depends": NA Warning message: In getParam(".target") : rmake parameters not found - using default value for ".target": NA Warning message: In getParam("alpha") : rmake parameters not found - using default value for "alpha": NA \end{verbatim} Meaningful default values may be assigned to the parameters by a second argument of the \code{getParam()} function: <>= dataName <- getParam(".depends", "data.csv") resultName <- getParam(".target", "result.rds") alpha <- getParam("alpha", 0.2) @ The warning does not disappear, but the script has now a chance to run with proper parameters, which may be useful, when debugging the script in \emph{RStudio}: <>= source("fit.R") @ \begin{verbatim} dataName: data.csv resultName: result.rds alpha: 0.2 Warning message: In getParam(".depends", "data.csv") : rmake parameters not found - using default value for ".depends": data.csv Warning message: In getParam(".target", "result.rds") : rmake parameters not found - using default value for ".target": result.rds Warning message: In getParam("alpha", 0.2) : rmake parameters not found - using default value for "alpha": 0.2 \end{verbatim} \subsection{Rule Templates} \label{sec:templates} More complex analyzes may contain similar rule sequences that repeate multiple times. Think of fitting multiple models differing only in some parameters, stored into files with a name derived from parameter values. The \pkg{rmake} package provides a templating mechanism to avoid tedious copy\&paste of rule definitions and to help quickly creating and easily maintaining the \code{Makefile.R} script. The idea of rule templates is best presented on the following example. Let us have a lot of CSV data files that all have to be processed and saved in a uniform way. We may create a script that processes all files in a loop, but that would make difficult a selective re-calculation of future changed data. Instead, we might to write a unique \pkg{rmake} rule for each data file: <>= job <- c( "data-1.csv" %>>% rRule("process.R") %>>% "result-1.csv", "data-2.csv" %>>% rRule("process.R") %>>% "result-2.csv", # ... "data-99.csv" %>>% rRule("process.R") %>>% "result-99.csv" ) @ Instead of that, rule templates will simplify the code significantly: <>= tmpl <- "data-$[NUM].csv" %>>% rRule("process.R") %>>% "result-$[NUM].csv" variants <- data.frame(NUM=1:99) job <- expandTemplate(tmpl, variants) @ The \code{expandTemplate()} function simply takes a list of rules, \code{tmpl}, and replaces all appearances of \emph{template variables} in all strings with their values provided by the \code{variants} data frame. The rules in \code{tmpl} are replicated for each row of the \code{variants} data frame. The following example creates rules for each combination of \code{DATA} and \code{TYPE}: <<>>= variants <- expand.grid(DATA=c("dataSimple", "dataComplex"), TYPE=c("lm", "rf", "nnet")) print(variants) tmpl <- "$[DATA].csv" %>>% rRule("fit-$[TYPE].R") %>>% "result-$[DATA]_$[TYPE].csv" job <- expandTemplate(tmpl, variants) @ The resulting \code{job} contains six rules that combine the specified variants as follows: <<>>= print(job) @ If duplicated rules are created during the template expansion, they are omitted, as in the following job: <<>>= tmpl <- "data.csv" %>>% rRule("pre.R") %>>% "pre.rds" %>>% rRule("comp.R", params=list(alpha="$[NUM]")) %>>% "result-$[NUM].csv" variants <- data.frame(NUM=1:5) job <- expandTemplate(tmpl, variants) @ Expansion of the template would yield in repeating the rule <>= "data.csv" %>>% rRule("pre.R") %>>% "pre.rds" @ However, the repeated rules are automatically removed as can be seen from the print: <<>>= print(job) @ On the other hand, (not only) a template expansion may often result in distinct rules producing a duplicated target. Such sequence of rules is prohibited and causes an error message in the \code{makefile()} function. The problem is illustrated in the example below. <<>>= tmpl <- "data-$[TYPE].csv" %>>% markdownRule("report.Rmd") %>>% "report.pdf" variants <- data.frame(TYPE=c("a", "b", "c")) job <- expandTemplate(tmpl, variants) print(job) @ Here the three different rules produce the same target (\code{report.pdf}). An attempt to generate the \code{Makefile} would end with an error: <>= makefile(job) @ \section{Conclusion} \label{sec:conclusion} The presented \pkg{rmake} package provides an easy but powerful way for managing complex data manipulation processes in \R{} using the well-known and broadly adopted \emph{Make} utility. \pkg{rmake} brings tools for generation of the \code{Makefile}, in which the file dependencies and build rules are defined. Advanced features of \pkg{rmake} such as pipelining (\verb|%>>%|), parameterized rules, or rule templates, enable quick definition of that file dependencies. \section*{Acknowledgements} The support by the project ``LQ1602 IT4Innovations excellence in science'' is gratefully acknowledged. \bibliography{bibl} \end{document}