\documentclass[article]{jss} %\VignetteIndexEntry{Non parametric} \usepackage{amsfonts,amsmath,amssymb} \usepackage{float} % need no \usepackage{Sweave.sty} \definecolor{orange}{rgb}{1.0, 0.5, 0.0} \def\indic{\hbox{\it 1\hskip -3.5pt I}} % indicator function \def\R{{\mathbb{R}}} \def\P{{\mathbb{P}}} \def\E{{\mathbb{E}}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% declarations for jss.cls %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% almost as usual \author{Abdelaati Daouia \\ Toulouse School of Economics \\ \& Universit\'e Catholique de Louvain \And Thibault Laurent \\ Toulouse \\ School of Economics \And Hohsuk Noh \\ Sookmyung Women's \\ University } \title{\pkg{npbr}: A Package for Nonparametric Boundary Regression in \proglang{R}} %% for pretty printing and a nice hypersummary also set: \Plainauthor{Abdelaati Daouia, Thibault Laurent, Hohsuk Noh} %% comma-separated \Plaintitle{npbr: A Package for Nonparametric Boundary Regression in R} %% without formatting \Shorttitle{\pkg{npbr}: Nonparametric Boundary Regression in \proglang{R}} %% a short title (if necessary) %% an abstract and keywords \Abstract{ The package \pkg{npbr} is the first free specialized software for data edge and frontier analysis in the statistical literature. It provides a variety of functions for the best known and most innovative approaches to nonparametric boundary estimation. The selected methods are concerned with empirical, smoothed, unrestricted as well as constrained fits under both single and multiple shape constraints. They also cover data envelopment techniques as well as robust approaches to outliers. The routines included in \pkg{npbr} are user friendly and afford a large degree of flexibility in the estimation specifications. They provide smoothing parameter selection for the modern local linear and polynomial spline methods as well as for some promising extreme value techniques. Also, they seamlessly allow for Monte Carlo comparisons among the implemented estimation procedures. This package will be very useful for statisticians and applied researchers interested in employing nonparametric boundary regression models. Its use is illustrated with a number of empirical applications and simulated examples. } \Keywords{boundary curve, concavity, extreme-values, kernel smoothing, linear programming, local linear fitting, monotonicity, multiple shape constraints, piecewise polynomials, spline smoothing, \proglang{R}} \Plainkeywords{boundary curve, concavity, extreme-values, kernel smoothing, linear programming, local linear fitting, monotonicity, multiple shape constraints, piecewise polynomials, spline smoothing, R} %% without formatting %% at least one keyword must be supplied %% publication information %% NOTE: Typically, this can be left commented and will be filled out by the technical editor %% \Volume{} %% \Issue{} %% \Month{} %% \Year{} %% \Submitdate{} %% \Acceptdate{} %% The address of (at least) one author should be given %% in the following format: \Address{ Thibault Laurent\\ Toulouse School of Economics (CNRS) \\ Universit\'e Toulouse 1 Capitole\\ 21, all\'ee de Brienne\\ 31042 Toulouse, FRANCE \\ E-mail: \email{Thibault.Laurent@univ-tlse1.fr}\\ } %% It is also possible to add a telephone and fax number %% before the e-mail in the following format: %% Telephone: +43/512/507-7103 %% Fax: +43/512/507-2851 %% for those who use Sweave please include the following line (with % symbols): % need no \usepackage{Sweave.sty} %% end of declarations %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{document} <>= options(prompt = "R> ", continue = "+ ", width = 70, useFancyQuotes = FALSE) .PngNo <- 0 @ <>= .PngNo <- .PngNo + 1; name.file <- paste("Figures/Fig-bitmap-", .PngNo, ".pdf", sep="") pdf(file=name.file, width = 7, height = 3.5, bg = "white") @ <>= .PngNo <- .PngNo + 1; name.file <- paste("Figures/Fig-bitmap-", .PngNo, ".pdf", sep="") pdf(file=name.file, width = 9, height = 3.5, bg = "white") @ <>= .PngNo <- .PngNo + 1; name.file <- paste("Figures/Fig-bitmap-", .PngNo, ".pdf", sep="") pdf(file=name.file, width = 18, height = 14, pointsize = 14, bg = "white") @ <>= dev.null <- dev.off() cat("\\includegraphics[width=0.9\\textwidth]{", name.file, "}\n\n", sep="") @ <>= dev.null <- dev.off() cat("\\includegraphics[width=0.8\\textwidth]{", name.file, "}\n\n", sep="") @ \section[Introduction]{Introduction \label{intro}} In the standard regression model $$y_i = \varphi(x_i) + \varepsilon_i,\quad i=1,\ldots,n, $$ where the data $(x_i,y_i)$ are observed, a variety of programs specializing in nonparametric and semi-parametric estimation have recently appeared. Prominent among these routines is the popular \pkg{np} package \citep{Hayfield:2008}, which allows \proglang{R} \citep{Rcore} users to conduct, for instance, nonparametric mean and quantile regression. In the non-standard boundary regression model, in contrast to classical theory, the regression errors $(\varepsilon_i)$ are not assumed to be centred, but to have a one-sided support $(-\infty, 0]$, and the regression function $\varphi$ describes some boundary curve. The present \pkg{npbr} package \citep{laurent} is a collection of functions that perform a variety of nonparametric estimation techniques of the frontier function $\varphi$ in the statistical software environment \proglang{R}. Specifically, suppose that we have $n$ pairs of observations $(x_i,y_i),~i=1,\ldots,n$, from a bivariate distribution with a density $f(x,y)$ in $\mathbb{R}^2$. The support $\Psi$ of $f$ is assumed to be of the form \begin{eqnarray*} \Psi = \{ (x,y) | y \leq \varphi(x) \} &\supseteq& \{ (x,y) | f(x,y) > 0 \} ,\\ \{ (x,y) | y > \varphi(x) \} &\subseteq& \{ (x,y) | f(x,y) = 0 \}, \nonumber \end{eqnarray*} where the graph of $\varphi$ corresponds to the locus of the curve above which the density $f$ is zero. More specifically, this graph is the extremal regression quantile curve corresponding to the probability level $1$. We consider the estimation of the frontier function $\varphi$ based on the sample $\{ (x_i,y_i),~i=1,\ldots,n\}$. This problem has increasing usage in various fields such as classification, cluster analysis, economics, education, finance, management, physics, public policy, scatter-point image analysis, and other arenas. For example, in image reconstruction, the frontier-or-edge is typically the interface of areas of different intensities or differing color tones, perhaps black above the boundary (where no observations are recorded) and grey below \citep[see][for a nice summary and an extensive bibliography]{Park:2001}. In most applications, the frontier function $\varphi$ is assumed to be monotone or concave (convex) monotone. This naturally occurs when analyzing, for instance, the reliability of nuclear reactors where $x_i$ represents the temperature of the reactor pressure vessel material $i$ and $y_i$ represents its fracture toughness. The main goal is to estimate the highest fracture toughness $\varphi$ as a function of the temperature. From a physical point of view, this master curve is known to be increasing and is believed to be concave \citep*[see][]{Daouia:2014,Daouia:2015}. According to the micro-economic theory of the firm, the support boundary is interpreted as the set of the most efficient businesses or industries that are optimally using inputs $x_i$ (labor, energy, capital, etc.) to produce their outputs $y_i$ (produced goods or services). Econometrics considerations often lead to the assumption that the cost/production function $\varphi$ is monotone nondecreasing with/without concavity. The concavity assumption is not always valid, although it is widely used in economics. For example, the production set $\Psi$ might admit increasing returns to scale, that is, the outputs might increase faster than the inputs \citep*[see, {\it e.g.,}][]{Daouia:2014}. Another related field of application where monotone boundaries and convex supports naturally appear is portfolio management. In the Capital Assets Pricing Models, the upper support extremity gives a benchmark relative to which the performance of an investment portfolio can be measured. Here, $x_i$ measures the risk (volatility or variance) of a portfolio, $y_i$ its averaged return, and $\varphi$ is required to be both monotone and concave \citep*[see, {\it e.g.,}][]{Gijbels:1999}. Such examples are abundant in economics and related fields. Nonparametric boundary regression is clearly a problem involving extreme value theory. Already in the case of production econometrics, \citet{hendricks:1992} stated, ``In the econometric literature on the estimation of production technologies, there has been considerable interest in estimating so called frontier production models that correspond closely to models for extreme quantiles of a stochastic production surface". \citet{chernozhukov:2005} and \citet{daouia:2013} may be viewed as the first attempt to actually implement theoretically the idea of Hendricks and Koenker, respectively, in a linear regression model and in a general nonparametric model. However, their approaches aim to estimate an extreme regression quantile curve instead of the true full frontier $\varphi$. Thereby the use of high regression quantiles might be viewed as an exploratory tool, rather than as a method for final frontier analysis. To this end, one may employ the \proglang{R} packages \pkg{cobs} \citep{ng:2007}, \pkg{quantreg} \citep{koenker:2015} and \pkg{splines} \citep{Rcore}, to name a few. There is a vast literature on nonparametric frontier estimation, including extreme-value methods \citep*{deHaan:1994,Hall:1997,Gijbels:2000,Girard:2003,Girard:2004,Daouia:2010}, projection techniques \citep{Jacob:1995}, piecewise polynomials \citep*{Korostelev:1993,Hardle:1995}, local polynomials \citep*{Hall:2004,Hall:1998,Knight:2001}. It is often assumed that the joint density of the data $f(x,y)$ is an algebraic function of the distance from the upper support extremity with a power $\beta_x>-1$, {\it i.e.}, $$ f(x,y) = c_{x}\, \{\varphi(x) - y\}^{\beta_x}+o(\{\varphi(x) - y\}^{\beta_x})\quad\textrm{ as }\quad y\uparrow \varphi (x), $$ with $c_x$ being a strictly positive function in $x$. The quantity $\beta_x\not= 0$ describes the rate at which the density decays to zero smoothly ($\beta_x>0$) or rises up to infinity ($\beta_x<0$) as it approaches the boundary. The power $\beta_x=0$ corresponds to a jump of the density at the boundary $\varphi(x)$. The cases $\beta_x>0$, $\beta_x=0$ and $\beta_x<0$ are referred to as ``non-sharp boundary'', ``sharp boundary'' and ``default-type boundary'', respectively. For instance, the more realistic case of non-sharp boundaries has been studied in \citet{Hardle:1995}, where piecewise polynomials are utilized for estimating $\varphi(x)$. The particular range $\beta_x>1$ has been considered in \citet{Hall:1997}, where the estimation of $\varphi(x)$ is based on an increasing number of large order statistics generated by the $y_i$ values of observations falling into a strip around $x$. The case of general $\beta_x$ has been handled by \citet{Gijbels:2000}, where the maximum of all $y_i$ values of observations falling into a strip around $x$ and another extreme-value estimator based on three upper order statistics of these $y_i$'s are considered. All of the elegant approaches mentioned above do not rely, however, on the inherent shape constraints of monotonicity and concavity/convexity. There are two common empirical approaches for estimating monotone data edges: the free disposal hull (FDH) estimator \citep{Deprins:1984} and the data envelopment analysis (DEA) estimator \citep{Farrell:1957} which relies on the additional assumption of concavity/convexity of the boundary curve. Despite the simple nature of these two estimators, their full asymptotic theory has been elucidated only during the last decade \citep*[see, {\it e.g.,}][]{Simar:2008}. An improved version of the FDH estimator, referred to as the linearized FDH (LFDH), has been considered in \citet{Hall:2002} and \citet{Jeong:2006}. Although the FDH, LFDH and DEA estimators provide the fitted values at the observed predictor with monotonicity or monotone concavity, they undersmooth the data and underestimate the true frontier. To reduce these defects, \citet{Daouia:2015} suggested to combine spline smoothing with constrained estimation under both separate and simultaneous shape constraints. Modern kernel smoothing fits have also been proposed by \citet{Parmeter:2013} to estimate the smooth frontier function, based on recent advances in constrained kernel estimation by \citet{Hall:2001}. More recently, \citet{Noh:2014} improved the kernel smoothing device of \citet{Parmeter:2013} by considering more adequate optimization criteria and bandwidth selection strategy for the estimator. Most of the available empirical and smooth estimation techniques are, however, based on envelopment ideas, and hence are very non-robust to outliers and/or extremes. Efforts to remedy such a deficiency have appeared in some nonparametric frontier models \citep*[see, {\it e.g.,}][]{Daouia:2005,Daouia:2006,Daouia:2011,Daouia:2012}. Prominent among these recent developments are the contributions of \citet{Daouia:2010,Daouia:2012}. Instead of using only the top observations lying on the sample boundary to estimate the true frontier, they show how other extreme observations could help to build robust frontier estimators by using the ideas from \citet{Dekkers:1989} and \citet{Dekkers:1989b}. Moreover, they provide different useful asymptotic confidence bands for the boundary function under the monotonicity constraint in the case of general $\beta_x$. However, such techniques are not without their disadvantages. As it is often the case in extreme-value theory, they require a large sample size to ensure acceptable results. The overall objective of the present package is to provide a large variety of functions for the best known approaches to nonparametric boundary regression, including the vast class of methods employed in both Monte Carlo comparisons of \citet{Daouia:2015} and \citet{Noh:2014} as well as other promising nonparametric devices, namely the extreme-value techniques of \citet{Gijbels:2000}, \citet{Daouia:2010} and \citet{Daouia:2012}. The various functions in the \pkg{npbr} package are summarized in Table \ref{tab:fcts}. We are not aware of any other existing set of statistical routines more adapted to data envelope fitting and robust frontier estimation. Only the classical nonsmooth FDH and DEA methods can be found in some available packages dedicated to the economic literature on measurements of the production performance of enterprises, such as the \proglang{R} package \pkg{Benchmarking} \citep{Bogetoft:2011}. Other contributions to the econometric literature on frontier analysis by \citet{Parmeter:2013} can be found at \url{http://socserv.mcmaster.ca/racinej/Gallery/Home.html}. The package \pkg{npbr} is actually the first free specialized software for the statistical literature on nonparametric frontier analysis. The routines included in \pkg{npbr} are user friendly and highly flexible in terms of estimation specifications. They allow the user to filter out noise in edge data by making use of both empirical and smooth fits as well as (un)constrained estimates under separate and simultaneous multiple shape constraints. They also provide smoothing parameter selection for the innovative methods based on local linear techniques, polynomial splines, extreme values and kernel smoothing, though the proposed selection procedures can be computationally demanding. To solve the different involved optimization problems, we mainly use the \pkg{Rglpk} package \citep[version >= 0.6-2]{Rglpk} based on the \proglang{C} library \pkg{GLPK} \citep{glpk}, version 4.61. In addition, the package will be very useful for researchers and practitioners interested in employing nonparametric boundary regression methods. On one hand, such methods are very appealing because they rely on very few assumptions and benefit from their modeling flexibility, function approximation power and ability to detect the boundary structure of data without recourse to any {\it a priori} parametric restrictions on the shape of the frontier and/or the distribution of noise. On the other hand, the package offers \proglang{R} users and statisticians in this active area of research simple functions to compute the empirical mean integrated squared error, the empirical integrated squared bias and the empirical integrated variance of various frontier estimators. This seamlessly allows the interested researcher to reproduce the Monte Carlo estimates obtained in the original articles and, perhaps most importantly, to easily compare the quality of any new proposal with the competitive existing methods. The package \pkg{npbr} is available from the Comprehensive \proglang{R} Archive Network (CRAN) at http://CRAN.R-project.org/package=npbr. Section \ref{dataexamples} presents, briefly, five unrelated motivating data examples concerned with annual sport records, the master curve prediction in the reliability programs of nuclear reactors and with the optimal cost/production assessment in applied econometrics. Section \ref{functions} describes in detail the implemented functions of the package and provides practical guidelines to effect the necessary computations. In Section \ref{illustrations}, we provide some computational tips that facilitate Monte-Carlo comparisons among frontier estimation methods in a similar way to the simulation studies undertaken by \citet{Daouia:2015} and \citet{Noh:2014}. \begin{table}[h!] \begin{center} \begin{tabular}{ll|l|l} \hline & Function & Description & Reference \\ \hline & \code{dea\_est} & DEA, FDH & \citet{Farrell:1957}, \\ & & & \citet{Deprins:1984}, \\ & & and linearized FDH & \citet{Hall:2002}, \\ & & & \citet{Jeong:2006} \\ & \code{loc\_est} & Local linear fitting & \citet{Hall:1998}, \\ & & & \citet{Hall:2004} \\ & \code{loc\_est\_bw} & Bandwidth choice & \citet{Hall:2004} \\ & & for local linear fitting & \\ & \code{poly\_est} & Polynomial estimation & \citet{Hall:1998} \\ & \code{poly\_degree} & Optimal polynomial & \citet{Daouia:2015} \\ & & degree selection & \\ & \code{dfs\_momt} & Moment type estimation & \citet{Daouia:2010}, \\ & & & \citet{Dekkers:1989} \\ & \code{dfs\_pick} & Pickands type estimation & \citet{Daouia:2010}, \\ & & & \citet{Dekkers:1989b} \\ & \code{rho\_momt\_pick} & Conditional tail & \citet{Daouia:2010}, \\ & & index estimation & \citet{Dekkers:1989}, \\ & & & \citet{Dekkers:1989b} \\ & \code{kopt\_momt\_pick} & Threshold selection for & \citet{Daouia:2010} \\ & & moment/Pickands frontiers & \\ & \code{dfs\_pwm\_regul} & Nonparametric frontier & \citet{Daouia:2012} \\ & & regularization & \\ & \code{loc\_max} & Local constant estimation & \citet{Gijbels:2000} \\ & \code{pick\_est} & Local extreme-value estimation & \citet{Gijbels:2000} \\ & \code{quad\_spline\_est} & Quadratic spline fitting & \citet{Daouia:2015} \\ & \code{quad\_spline\_kn} & Knot selection for & \citet{Daouia:2015} \\ & & quadratic spline fitting & \\ & \code{cub\_spline\_est} & Cubic spline fitting & \citet{Daouia:2015} \\ & \code{cub\_spline\_kn} & Knot selection for & \citet{Daouia:2015} \\ & & cubic spline fitting & \\ & \code{kern\_smooth} & Nonparametric kernel & \citet{Parmeter:2013}, \\ & & boundary regression & \citet{Noh:2014} \\ & \code{kern\_smooth\_bw} & Bandwidth choice for & \citet{Parmeter:2013}, \\ & & kernel boundary regression & \citet{Noh:2014} \\ \hline \end{tabular} \caption{\pkg{npbr} functions.}\label{tab:fcts} \end{center} \end{table} %\section[Encountered applied settings and data examples]{Encountered applied settings and data examples \label{dataexamples}} \section[Empirical applications]{Empirical applications \label{dataexamples}} In this section, we illustrate the use of the \pkg{npbr} package via five different empirical applications taken from the recent literature. Each dataset is chosen to highlight the specifics of a class of estimation methods: \begin{itemize} \item The dataset \code{records} is concerned with the yearly best men's outdoor 1500-metre run times starting from 1966. These annual records, depicted in Figure \ref{datasets} (a), display some interesting features. Following \citet*{Jirak:2014}, the lower boundary can be interpreted as the best possible time for a given year. This boundary steadily decreases from 1970 until around the year 2000, followed by a sudden increase. This event leaves room for speculations given that, until the year 2000, it had been very difficult to distinguish between the biological and synthetical EPO. Here, the boundary is not believed to be shape constrained and can be estimated by the polynomial, local linear, spline or kernel smoothing methods described in Sections \ref{piecewise_polynomials_splines} and \ref{kern_smooth}. \item The dataset \code{nuclear} from the US Electric Power Research Institute (EPRI) consists of 254 toughness results obtained from non-irradiated representative steels. For each steel $i$, fracture toughness $y_i$ and temperature $x_i$ were measured. The scatterplot is given in Figure \ref{datasets} (b). The objective is to estimate the lower and upper limits of fracture toughness for the reactor pressure vessel materials as a function of the temperature. Given that the nuclear reactors' data are measured accurately, it is natural and more realistic for practitioners to rely on data envelopment estimation techniques that we regroup in Sections \ref{piecewise_polynomials_splines}-\ref{kern_smooth}. Here, the lower support boundary is believed to be both increasing and convex, while the upper extremity is only known to be monotone nondecreasing \citep[see][]{Daouia:2014, Daouia:2015}. \item The dataset \code{air} is concerned with the assessment of the efficiency of 37 European Air Controllers. The performance of each controller can be measured by its ``distance'' from the upper support boundary, or equivalently, the set of the most efficient controllers. This dataset is taken from \citet*{Mouchart:2002}. The scatterplot of the controllers in the year 2000 is given in Figure \ref{datasets} (c), where their activity is described by one input (an aggregate factor of different kinds of labor) and one output (an aggregate factor of the activity produced, based on the number of controlled air movements, the number of controlled flight hours, etc.). Given the very small sample size and the sparsity in data, only the class of polynomials, piecewise polynomials and spline approximations seems to provide satisfactory fits in this applied setting. This class includes the families of empirical and smooth estimation methods described in Section \ref{piecewise_polynomials_splines}. Note also that the efficient frontier here is monotone and can be assumed to be in addition concave \citep[see][]{Daouia:2008, Daouia:2015}. \item The dataset \code{post} about the cost of the delivery activity of the postal services in France was first analyzed by \citet*{Cazals:2002} and then by \citet*{Aragon:2005} and \citet*{Daouia:2010} among others. There are 4000 post offices observed in 1994. For each post office $i$, the input $x_i$ is the labor cost measured by the quantity of labor, which accounts for more than $80\%$ of the total cost of the delivery activity. The output $y_i$ is defined as the volume of delivered mail (in number of objects). As can be seen from the scatterplot in Figure \ref{datasets} (d), some observations look so isolated in the output direction that they seem hardly related to the other observations. As a matter of fact, this dataset is known to contain outliers and it would then look awkward for practitioners to rely on estimation techniques based on data envelopment ideas \citep[see][]{Daouia:2011}. This motivated the quest for robust frontier estimation methods in Section \ref{robustness}. It should be clear that only these methods allow one to construct valid asymptotic confidence intervals for the unknown support boundary. \item The dataset \code{green} consists of 123 American electric utility companies. As in the set-up of \citet*{Gijbels:1999}, we used the measurements of the variables $y_i = \log(q_i)$ and $x_i = \log(c_i)$, where $q_i$ is the production output of the company $i$ and $c_i$ is the total cost involved in the production. A detailed description and analysis of these data can be found in \citet*{Christensen:1976}. The scatterplot is given in Figure \ref{datasets} (e). Here, the assumption of both monotonicity and concavity constraints is well accepted and any restricted data envelopment technique such as, for instance, kernel smoothing in Section \ref{kern_smooth} can be applied. Also, in the absence of information on whether these data are recorded accurately, one may favor robust frontier estimation. We caution the user that the robust methods based on extreme-value ideas may require a large sample size of the order of thousands to achieve acceptable fits and confidence intervals. %%In case of small and moderate samples, it is most efficient to employ the R package "frontiles" for any robust analysis by making use of the concept of partial frontier modeling (Daouia and Laurent) : tbc \end{itemize} To help users navigate the methods in the \pkg{npbr} package, we describe in Table \ref{table2} the type of estimation and shape constraints allowed by each method. \begin{table}[h!] \begin{center} \begin{tabular}{ll|l|l} \hline & Function & Type of estimator & Allowed constraints \\ \hline & \code{dea\_est} & envelope, piecewise linear & monotonicity, concavity \\ & \code{loc\_est} & envelope, local linear & unconstrained \\ & \code{poly\_est} & envelope, polynomial & unconstrained \\ & \code{dfs\_momt} & robust, extreme quantile & monotonicity \\ & \code{dfs\_pick} & robust, extreme quantile & monotonicity \\ & \code{dfs\_pwm\_regul} & robust, probability- & monotonicity \\ & & weighted moment & \\ & \code{loc\_max} & envelope, local constant, & unconstrained \\ & & local DEA & monotonicity, concavity \\ & \code{pick\_est} & robust/envelope, & unconstrained \\ & & extreme quantile & \\ & \code{quad\_spline\_est} & envelope, quadratic spline & unconstrained, monotonicity, \\ & & & concavity \\ & \code{cub\_spline\_est} & envelope, cubic spline & unconstrained, concavity \\ & \code{kern\_smooth} & envelope, kernel smoother & unconstrained, monotonicity, \\ & & & concavity \\ \hline \end{tabular} \caption{Characteristics of the estimation methods in \pkg{npbr}.}\label{table2} \end{center} \end{table} \vspace{2mm} \noindent For our illustration purposes, each of the five datasets contains only two variables: one input and one output. <<>>= require("npbr") data("records", "nuclear", "air", "post", "green") @ The following code will generate Figure \ref{datasets}. <>= plot(result ~ year, data = records, xlab = "year", ylab = "1500m record") plot(ytab ~ xtab, data = nuclear, xlab = "temp. of the reactor vessel", ylab = "fracture toughness") plot(ytab ~ xtab, data = air, xlab = "input", ylab = "output") plot(yprod ~ xinput, data = post, xlab = "quantity of labor", ylab = "volume of delivered mail") plot(log(OUTPUT) ~ log(COST), data = green) @ \begin{figure}[!h] \begin{center} <>= <> op <- par(mfrow = c(2,3), mar = c(3,3.1,2.1,2.1), mgp = c(2,.4,0), oma = c(0,0,0,0), cex.lab = 1.7, cex.axis = 1.7, cex.main = 1.7, col.main = "blue") plot(result ~ year, data = records, col = "blue2", pch = 1, xlab = "year", ylab = "1500m record", main = "(a)") plot(ytab ~ xtab, data = nuclear, pch = 1, col = "blue2", main = "(b)", xlab = "temp. of the reactor vessel", ylab = "fracture toughness") plot(ytab ~ xtab, data = air, pch = 1, col = "blue2", xlab = "input", ylab = "output", main = "(c)") plot(yprod ~ xinput, data = post, pch = 1, col = "blue2", main = "(d)", xlab = "quantity of labor", ylab = "volume of delivered mail") plot(log(OUTPUT) ~ log(COST), data = green, pch = 1, col = "blue2", main = "(e)") par(op) <> @ \end{center} \caption{From left to right and from top to bottom, the scatterplots of the yearly best men's outdoor 1500-metre run times in seconds, the 254 nuclear reactors' data, the 37 European Air Controllers, the 4000 European post offices and the 123 American electric utility companies.}\label{datasets} \end{figure} \section[Main functions]{Main functions \label{functions}} This section describes in detail the main functions of the \pkg{npbr} package. The two first arguments of these functions correspond to the observed inputs $x_1,\ldots,x_n$ and the observed outputs $y_1,\ldots,y_n$. The third argument is a numeric vector of evaluation points at which the estimator is to be computed. Basically, the user can generate a regular sequence of size 100, or any finer grid of points, from the minimum value of inputs $x_i$ to their maximum value. The other arguments of the functions depend on the underlying statistical methods. \vspace{2mm} \noindent We do not presume that the user is familiar with nonparametric frontier modeling hence briefly describe the underlying estimation methodology and tuning parameters selection for each method. Section \ref{piecewise_polynomials_splines} is concerned with piecewise polynomial fitting, Section \ref{local_polynomials} with local polynomial estimation, Section \ref{kern_smooth} with kernel smoothing techniques, and Section \ref{robustness} with robust regularization approaches. %%%%%%%% %%%Plan%%% %%%%%%%% %\subsection{Piecewise polynomial fitting}\label{piecewise_polynomials_splines} % \subsubsection{DEA, FDH and LFDH frontiers}\label{dea_fdh_lfdh} % \subsubsection{Polynomial boundaries}\label{polynomials} % \subsubsection{Quadratic spline frontiers}\label{quadratic_splines} % \subsubsection{Cubic spline frontiers}\label{cubic_splines} %\subsection{Localizing boundary regression}\label{local_polynomials} % \subsubsection{Local linear fitting}\label{local_linear} % \subsubsection{Local maximum estimation}\label{local_max} % \subsubsection{Local extreme-value smoothing}\label{local_ev} %\subsection{Kernel smoothing}\label{kern_smooth} %\subsection{Robust extreme-value approaches}\label{robustness} \subsection[Piecewise polynomial fitting]{Piecewise polynomial fitting \label{piecewise_polynomials_splines}} We commence with the traditional empirical DEA, FDH and Linearized FDH estimators. We then proceed to polynomial boundary estimators \citep*{Hall:1998}, and finally to constrained spline estimators \citep*{Daouia:2015}. \subsubsection[DEA, FDH and LFDH frontiers]{DEA, FDH and LFDH frontiers \label{dea_fdh_lfdh}} The function \code{dea\_est} implements the empirical FDH, LFDH and DEA frontier estimators programmed earlier in the \pkg{Benchmarking} package \citep{Bogetoft:2011}. There are two popular methods for preserving monotonicity in the frontier setting: the free disposal hull (FDH) introduced by \citet{Deprins:1984} and the data envelopment analysis (DEA) proposed by \citet{Farrell:1957}. The FDH boundary is the lowest ``stair-case'' monotone curve covering all the data points $$\varphi_{fdh}(x):=\max\{y_i,\,i:x_i\leq x\}.$$ An improved version of this estimator, referred to as the linearized FDH (LFDH), is obtained by drawing the polygonal line smoothing the staircase FDH curve. It has been considered in \citet{Hall:2002} and \citet{Jeong:2006}. When the joint support of the data is in addition convex, the DEA estimator is defined as the least concave majorant of the FDH frontier. Formally, the DEA estimator of the joint support $\Psi$ is defined by \begin{align*} &\hat\Psi = \left\{ (x,y) | y \leq \sum_{i=1}^n \gamma_i y_i ; \, x\geq\sum_{i=1}^n \gamma_i x_i \quad \textrm{for some} \quad (\gamma_1,\ldots,\gamma_n), \quad \right. \\ & \left. \textrm{such that} \quad \sum_{i=1}^n \gamma_i = 1; \, \gamma_i \geq 0; \, i=1,\ldots,n \right\}. \end{align*} Then the DEA estimator of the frontier function $\varphi$ at $x$ is defined by $$\varphi_{dea}(x):=\sup\{ y | (x,y)\in\hat\Psi \}.$$ Note that the FDH, LFDH and DEA estimators are well defined whenever there exists an $x_i$ such that $x_i\leq x$. To illustrate the difference between these three empirical frontiers, we consider the \code{air} and \code{green} data. First, we generate a vector of evaluation points. <<>>= x.air <- seq(min(air$xtab), max(air$xtab), length.out = 101) x.green <- seq(min(log(green$COST)), max(log(green$COST)), length.out = 101) @ Then, we compute the DEA, FDH and LFDH estimates. <<>>= y.dea.green = dea_est(log(green$COST), log(green$OUTPUT), x.green, type = "dea") y.fdh.green = dea_est(log(green$COST), log(green$OUTPUT), x.green, type = "fdh") y.lfdh.green = dea_est(log(green$COST), log(green$OUTPUT), x.green, type = "lfdh") y.dea.air <- dea_est(air$xtab, air$ytab, x.air, type = "dea") y.fdh.air <- dea_est(air$xtab, air$ytab, x.air, type = "fdh") y.lfdh.air <- dea_est(air$xtab, air$ytab, x.air, type = "lfdh") @ Figure~\ref{DEA-fig} plots the resulting piecewise linear curves. The following code will generate Figure~\ref{DEA-fig}. <>= plot(y.dea.green ~ x.green, lty = 4, col = "cyan", type = "l", xlab = "log(cost)", ylab = "log(output)") lines(x.green, y.fdh.green, lty = 1, col = "green") lines(x.green, y.lfdh.green, lty = 2, col = "magenta") legend("topleft", legend = c("DEA", "FDH", "LFDH"), bty = "n", col = c("cyan", "green", "magenta"), lty = c(4, 1, 2)) points(log(OUTPUT) ~ log(COST), data = green) plot(x.air, y.dea.air, lty = 4, col = "cyan", type = "l", xlab = "input", ylab = "output") lines(x.air, y.fdh.air, lty = 1, col = "green") lines(x.air, y.lfdh.air, lty = 2, col = "magenta") legend("topleft", legend = c("DEA", "FDH", "LFDH"), bty = "n", col = c("cyan", "green", "magenta"), lty = c(4, 1, 2)) points(ytab ~ xtab, data = air) @ \begin{figure}[htbp] \begin{center} <>= <> op=par(mfrow = c(1, 2), mar = c(3, 3.1, 2.1, 2.1), mgp = c(2, .4, 0), oma = c(0, 0, 0, 0), cex.lab = 1.4) plot(x.green, y.dea.green, lty = 4, lwd = 4, col = "cyan", type = "l", xlab = "log(cost)", ylab = "log(output)") lines(x.green, y.fdh.green, lty = 1, lwd = 4, col = "green") lines(x.green, y.lfdh.green, lty = 2, lwd = 4, col = "magenta") legend("topleft", legend = c("DEA", "FDH", "LFDH"), cex = 1.2, bty = "n", col = c("cyan", "green", "magenta"), lty = c(4, 1, 2), lwd = 4) points(log(OUTPUT) ~ log(COST), data = green, cex = 1) plot(x.air, y.dea.air, lty = 4, lwd = 4, col = "cyan", type = "l", xlab = "input", ylab = "output") lines(x.air, y.fdh.air, lty = 1, lwd = 4, col = "green") lines(x.air, y.lfdh.air, lty = 2, lwd = 4, col = "magenta") legend("topleft", legend = c("DEA", "FDH", "LFDH"), cex = 1.2, bty = "n", col = c("cyan", "green", "magenta"), lty = c(4, 1, 2), lwd = 4) points(ytab ~ xtab, data = air) par(op) <> @ \end{center} \caption{DEA, FDH and LFDH estimates of the optimal frontier for the 37 European air controllers (left) and the 123 American electric utility companies (right). \label{DEA-fig}} \end{figure} \subsubsection[Polynomial estimators]{Polynomial estimators \label{polynomials}} The function \code{poly\_est} is an implementation of the unconstrained polynomial-type estimators of \citet*{Hall:1998} for support frontiers and boundaries. \vspace{2mm} \noindent Here, the data edge is modeled by a single polynomial $\varphi_{\theta}(x) = \theta_0+\theta_1 x+\ldots+\theta_p x^p$ of known degree $p$ that envelopes the full data and minimizes the area under its graph for $x\in[a,b]$, with $a$ and $b$ being respectively the lower and upper endpoints of the design points $x_1,\ldots,x_n$. The function is the estimate $\hat\varphi_{n,P}(x) = \hat\theta_0+\hat\theta_1 x+\ldots+\hat\theta_p x^p$ of $\varphi(x)$, where $\hat\theta=(\hat\theta_0,\hat\theta_1,\ldots,\hat\theta_p)^\top$ minimizes $\int_{a}^b \varphi_{\theta}(x) \,dx$ over $\theta\in\R^{p+1}$ subject to the envelopment constraints $\varphi_{\theta}(x_i)\geq y_i$, $i=1,\ldots,n$. The polynomial degree $p$ has to be fixed by the user in the 4th argument of the function. \vspace{3mm} \noindent {\it Selection of the polynomial degree} \vspace{3mm} \noindent As the degreee $p$ determines the dimensionality of the approximating function, we may view the problem of choosing $p$ as model selection by calling the function \code{poly\_degree}. By analogy to the information criteria proposed by \citet{Daouia:2015} in the boundary regression context, we obtain the optimal polynomial degree by minimizing \begin{eqnarray*} AIC(p) &=& \log \left( \sum_{i=1}^n (\hat \varphi_n(x_i)-y_i) \right) + (p+1)/n , \label{aic} \\BIC(p) &=& \log \left( \sum_{i=1}^n (\hat \varphi_n(x_i)-y_i) \right) + \log n \cdot (p+1)/(2n). \label{bic} \end{eqnarray*} The first one (option \code{type = "AIC"}) is similar to the famous Akaike information criterion \citep{Akaike:1973} and the second one (option \code{type = "BIC"}) to the Bayesian information criterion \citep{Schwartz:1978}. They aim to balance the fidelity to data and the complexity of the fit in the boundary regression context. There are several ways to motivate the use of the total absolute residuals in these criteria instead of the standard residual sum of squares. For instance, it can be derived directly assuming exponential errors as motivated by \citet{Daouia:2015} in Section 2.1. \vspace{3mm} \noindent {\it Practical guidelines}\vspace{3mm} \noindent By way of example, we consider the \code{records}, \code{air} and \code{nuclear} datasets. To determine the optimal polynomial degrees via the AIC criterion, we employ the commands <<>>= (p.aic.records <- poly_degree(records$year, 1/records$result, prange = 0:12, type = "AIC")) (p.aic.air <- poly_degree(air$xtab, air$ytab, type = "AIC")) (p.aic.nuc <- poly_degree(nuclear$xtab, nuclear$ytab, type = "AIC")) @ We find the same degrees by applying the BIC criterion. The \proglang{R}~specifications for the corresponding polynomial boundaries to be estimated are given by <<>>= x.records<-seq(min(records$year), max(records$year), length.out = 101) y.poly.records <- poly_est(records$year, 1/records$result, x.records, deg = p.aic.records) y.poly.air <- poly_est(air$xtab, air$ytab, x.air, deg = p.aic.air) x.nucl <- seq(min(nuclear$xtab), max(nuclear$xtab), length.out = 101) y.poly.nuc <- poly_est(nuclear$xtab, nuclear$ytab, x.nucl, deg = p.aic.nuc) @ The following code can be used to construct the plots of the resulting estimators appearing in Figure~\ref{fig-poly}. <>= plot(x.records, 1/y.poly.records, type = "l", col = "green") points(result ~ year, data = records) legend("bottomleft", legend = paste("degree =", p.aic.records), col = "green", lty = 1, bty = "n") plot(x.air, y.poly.air, type = "l", col = "magenta") points(ytab ~ xtab, data = air) legend("topleft", legend = paste("degree =", p.aic.air), col = "magenta", lty = 1, bty = "n") plot(y.poly.nuc ~ x.nucl, type = "l", col = "cyan", ylim = range(nuclear$ytab)) points(ytab ~ xtab, data = nuclear) legend("topleft", legend = paste("degree =", p.aic.nuc), col = "cyan", lty = 1, bty = "n") @ \begin{figure}[htbp] \begin{center} <>= <> op = par(mfrow = c(1, 3), mar = c(4.5, 4.5, 2.1, 2.1), mgp = c(2.9, 1.1, 0), oma = c(0, 0, 0, 0), cex.lab = 2.3, cex.axis = 2.3) plot(x.records, 1/y.poly.records, lty = 1, lwd = 4, col = "green", type = "l", xlab = "year", ylab = "1500m record") points(result ~ year, data = records) legend("bottomleft", legend = paste("degree =", p.aic.records), bty = "n", col = "green", lwd = 4, lty = 1, cex = 2.2) plot(x.air, y.poly.air, lty = 1, lwd = 4, col = "magenta", type = "l", xlab = "input", ylab = "output") points(ytab ~ xtab, data = air) legend("topleft", legend = paste("degree =", p.aic.air), bty = "n", col = "magenta", lwd = 4, lty = 1, cex = 2.2) plot(x.nucl, y.poly.nuc, lty = 1, lwd = 4, col = "cyan", type = "l", ylim = range(nuclear$ytab), xlab = "temperature", ylab = "toughness") points(ytab ~ xtab, data = nuclear) legend("topleft", legend = paste("degree =", p.aic.nuc), bty = "n", col = "cyan", lwd = 4, lty = 1, cex = 2.2) par(op) <> @ \end{center} \caption{Polynomial boundary estimators for the 46 annual sport records (left), the 37 European air controllers (middle) and the 254 nuclear reactors' data (right).\label{fig-poly}} \end{figure} \subsubsection[Quadratic spline smoothers]{Quadratic spline smoothers \label{quadratic_splines}} The function \code{quad\_spline\_est} is an implementation of the (un)constrained quadratic spline estimates proposed by \citet{Daouia:2015}. \vspace{3mm} \noindent{\it Unconstrained quadratic fit} \vspace{3mm} \noindent Let $a$ and $b$ be, respectively, the minimum and maximum of the design points $x_1,\ldots,x_n$. Denote a partition of $[a,b]$ by $a=t_0>= (kn.bic.air.u <- quad_spline_kn(air$xtab, air$ytab, method = "u", type = "BIC")) (kn.bic.green.u <- quad_spline_kn(log(green$COST), log(green$OUTPUT), method = "u", type = "BIC")) @ When applying the AIC criterion, we get the optimal values 12 and 20 of $k_n$, respectively. The \proglang{R}~specification for the unconstrained spline estimate $\tilde\varphi_n$ to be calculated is given by <<>>= y.quad.air.u <- quad_spline_est(air$xtab, air$ytab, x.air, kn = kn.bic.air.u, method = "u") y.quad.green.u <- quad_spline_est(log(green$COST), log(green$OUTPUT), x.green, kn = kn.bic.green.u, method = "u") @ When only the monotonicity constraint is of interest, we calculate the optimal number $k_n$ via the following specification: <<>>= (kn.bic.air.m <- quad_spline_kn(air$xtab, air$ytab, method = "m", type = "BIC")) (kn.bic.green.m <- quad_spline_kn(log(green$COST), log(green$OUTPUT), method = "m", type = "BIC")) @ Note that we find the values 6 and 19 of the optimal number $k_n$ when applying the AIC criterion. The monotonic spline $\hat\varphi_n$ can then be produced by employing the command <<>>= y.quad.air.m <- quad_spline_est(air$xtab, air$ytab, x.air, kn = kn.bic.air.m, method = "m") y.quad.green.m <- quad_spline_est(log(green$COST), log(green$OUTPUT), x.green, kn = kn.bic.green.m, method = "m") @ When the concavity constraint is also of interest, we obtain the optimal number $k_n$ via the BIC criterion and the corresponding constrained spline $\hat\varphi^{\star}_n$ by proceeding as follows: <<>>= (kn.bic.air.mc <- quad_spline_kn(air$xtab, air$ytab, method = "mc", type = "BIC")) (kn.bic.green.mc <- quad_spline_kn(log(green$COST), log(green$OUTPUT), method = "mc", type = "BIC")) @ When applying the AIC criterion, we get the optimal values 2 and 7 of $k_n$, respectively. To compute the smoother $\hat\varphi^{\star}_n$ by utilizing all the DEA points as knots, we use the command <<>>= y.quad.air.mc <- quad_spline_est(air$xtab, air$ytab, x.air, kn = kn.bic.air.mc, method = "mc", all.dea = TRUE) y.quad.green.mc <- quad_spline_est(log(green$COST), log(green$OUTPUT), x.green, kn = kn.bic.green.mc, method = "mc", all.dea = TRUE) @ The resulting unrestricted and two constrained estimates of the econometric frontiers ({\it i.e.}, the sets of the most efficient companies and controllers) are graphed in Figure~\ref{fig-splines} for each dataset. The following code will generate Figure~\ref{fig-splines}. <>= plot(y.quad.air.u ~ x.air, lty = 1, col = "green", type = "l", xlab = "input", ylab = "output") lines(x.air, y.quad.air.m, lty = 2, col = "cyan") lines(x.air, y.quad.air.mc, lty = 3, col = "magenta") points(ytab ~ xtab, data = air) legend("topleft", col = c("green", "cyan", "magenta"), lty = c(1, 2, 3), bty = "n", lwd = 4, legend = c("unconstrained", "monotone", "monotone + concave")) plot(y.quad.green.u ~ x.green, lty = 1, col = "green", type = "l", xlab = "log(COST)", ylab = "log(OUTPUT)") lines(x.green, y.quad.green.m, lty = 2, col = "cyan") lines(x.green, y.quad.green.mc, lty = 3, col = "magenta") points(log(OUTPUT) ~ log(COST), data = green) legend("topleft", col = c("green", "cyan", "magenta"), bty = "n", lty = c(1, 2, 3), legend = c("unconstrained", "monotone", "monotone + concave")) @ \begin{figure}[htbp] \begin{center} <>= <> op=par(mfrow=c(1,2),mar=c(3,3.1,2.1,2.1),mgp=c(2,.4,0),oma=c(0,0,0,0), cex.lab=1.3, cex.axis=1.2) plot(x.air, y.quad.air.u, lty=1, lwd=4, col="green", type="l", xlab="input", ylab="output") lines(x.air, y.quad.air.m, lty=2, lwd=4, col="cyan") lines(x.air, y.quad.air.mc, lty=3, lwd=4, col="magenta") points(ytab~xtab, data=air) legend("topleft", col=c("green","cyan","magenta"), bty = "n", lty=c(1,2,3), legend=c("unconstrained", "monotone", "monotone + concave"), lwd=4, cex=1.1) plot(x.green, y.quad.green.u, lty=1, lwd=4, col="green", type="l", xlab="log(COST)", ylab="log(OUTPUT)") lines(x.green, y.quad.green.m, lty=2, lwd=4, col="cyan") lines(x.green, y.quad.green.mc, lwd=4, lty=3, col="magenta") points(log(OUTPUT)~log(COST), data=green) legend("topleft", col=c("green","cyan","magenta"), bty = "n", lty=c(1,2,3), legend=c("unconstrained", "monotone", "monotone + concave"), lwd=4, cex=1.1) par(op) <> @ \end{center} \caption{The quadratic spline frontiers $\tilde\varphi_n$, $\hat\varphi_n$ and $\hat\varphi^{\star}_n$ for the 37 European air controllers (left) and the 123 American electric utility companies (right).\label{fig-splines}} \end{figure} \subsubsection[Cubic spline frontiers]{Cubic spline frontiers \label{cubic_splines}} The function \code{cub\_spline\_est} is an implementation of the (un)constrained cubic spline estimates proposed by \citet{Daouia:2015}. \vspace{2mm} \noindent As in the quadratic spline setting, let $a$ and $b$ be respectively the minimum and maximum of the design points $x_1,\ldots,x_n$, and denote a partition of $[a,b]$ by $a=t_0>= (kn.bic.air.u <- cub_spline_kn(air$xtab, air$ytab, method = "u", type = "BIC")) (kn.bic.green.u <- cub_spline_kn(log(green$COST), log(green$OUTPUT), method = "u", type = "BIC")) (kn.bic.air.m <- cub_spline_kn(air$xtab, air$ytab, method = "m", type = "BIC")) (kn.bic.green.m <- cub_spline_kn(log(green$COST), log(green$OUTPUT), method = "m", type = "BIC")) (kn.bic.air.mc <- cub_spline_kn(air$xtab, air$ytab, method = "mc", type = "BIC")) (kn.bic.green.mc <- cub_spline_kn(log(green$COST), log(green$OUTPUT), method = "mc", type = "BIC")) @ Note that we find the same values by applying the AIC criterion. To compute the corresponding (un)constrained cubic spline frontiers, we employ the following commands <<>>= y.cub.air.u <- cub_spline_est(air$xtab, air$ytab, x.air, kn = kn.bic.air.u, method = "u") y.cub.green.u <- cub_spline_est(log(green$COST), log(green$OUTPUT), x.green, kn = kn.bic.green.u, method = "u") y.cub.air.m <- cub_spline_est(air$xtab, air$ytab, x.air, kn = kn.bic.air.m, method = "m") y.cub.green.m <- cub_spline_est(log(green$COST), log(green$OUTPUT), x.green, kn = kn.bic.green.m, method = "m") y.cub.air.mc <- cub_spline_est(air$xtab, air$ytab, x.air, kn = kn.bic.air.mc, method = "mc") y.cub.green.mc <- cub_spline_est(log(green$COST), log(green$OUTPUT), x.green, kn = kn.bic.green.mc, method = "mc") @ The resulting unconstrained and concave frontier estimates are graphed in Figure~\ref{fig-splines-cub} for each dataset. The following code will generate Figure~\ref{fig-splines-cub}. <>= plot(y.cub.air.u ~ x.air, type = "l", col = "green", xlab = "input", ylab = "output") lines(x.air, y.cub.air.m, lty = 2, col = "cyan") lines(x.air, y.cub.air.mc, lty = 3, col = "magenta") points(ytab ~ xtab, data = air) legend("topleft", col = c("green", "cyan", "magenta"), lty = c(1, 2, 3), bty = "n", legend=c("unconstrained", "monotone", "monotone+concave")) plot(y.cub.green.u ~ x.green, type = "l", col = "green", xlab = "log(COST)", ylab = "log(OUTPUT)") lines(x.green, y.cub.green.m, lty = 2, col = "cyan") lines(x.green, y.cub.green.mc, lty = 3, col = "magenta") points(log(OUTPUT) ~ log(COST), data = green) legend("topleft", col = c("green", "cyan", "magenta"), lty = c(1, 2, 3), bty = "n", legend = c("unconstrained", "monotone", "monotone+concave")) @ \begin{figure}[htbp] \begin{center} <>= <> op = par(mfrow = c(1,2), mar = c(3, 3.1, 2.1, 2.1), mgp = c(2, .4, 0), oma = c(0, 0, 0, 0), cex.lab = 1.3, cex.axis = 1.2) plot(x.air, y.cub.air.u, lty = 1, lwd = 4, col = "green", type = "l", xlab = "input", ylab = "output") lines(x.air, y.cub.air.m, lty = 2, lwd = 4, col = "cyan") lines(x.air, y.cub.air.mc, lty = 3, lwd = 4, col = "magenta") points(ytab ~ xtab, data = air) legend("topleft", col = c("green", "cyan", "magenta"), bty = "n", lty = c(1, 2, 3), legend = c("unconstrained", "monotone", "monotone+concave"), lwd = 4, cex = 1.1) plot(x.green, y.cub.green.u, lty = 1, lwd = 4, col = "green", type = "l", xlab = "log(COST)", ylab = "log(OUTPUT)") lines(x.green, y.cub.green.m, lty = 2, lwd = 4, col = "cyan") lines(x.green, y.cub.green.mc, lty = 3, lwd = 4, col = "magenta") points(log(OUTPUT) ~ log(COST), data = green) legend("topleft", col = c("green", "cyan", "magenta"), bty = "n", lty = c(1, 2, 3), legend = c("unconstrained", "monotone", "monotone+concave"), lwd = 4, cex = 1.1) par(op) <> @ \end{center} \caption{The unconstrained and concave cubic spline frontiers for the 37 European air controllers (left) and the 123 American electric utility companies (right).\label{fig-splines-cub}} \end{figure} \subsection{Localized boundary regression}\label{local_polynomials} This section is concerned with localizing the frontier estimation and considers local linear fitting \citep{Hall:1998, Hall:2004}, local maximum and extreme-value smoothing \citep{Gijbels:2000}. \subsubsection{Local linear fitting}\label{local_linear} The function \code{loc\_est} computes the local linear smoothing frontier estimators of \citet*{Hall:1998} and \citet{Hall:2004}. In the unconstrained case (option \code{method = "u"}), the implemented estimator of $\varphi(x)$ is defined by \begin{eqnarray*} \hat \varphi_{n,LL}(x) &=& \min \Big\{ z : {\rm there~exists~} \theta ~{\rm such~that~} y_i \leq z + \theta(x_i - x) \nonumber \\ & & {\rm for~all}~i~{\rm such~that~}x_i \in (x-h,x+h) \Big\}, \end{eqnarray*} where the bandwidth $h$ has to be fixed by the user in the 4th argument of the function. This estimator may lack of smoothness in case of small samples and has no guarantee of being monotone even if the true frontier is so. Following the curvature of the monotone frontier $\varphi$, the unconstrained estimator $\hat \varphi_{n,LL}$ is likely to exhibit substantial bias, especially at the sample boundaries. A simple way to remedy to this drawback is by imposing the extra condition $\theta \geq 0$ in the definition of $\hat \varphi_{n,LL}(x)$ to get \begin{eqnarray*} \tilde \varphi_{n,LL}(x) &=& \min \Big\{ z : {\rm there~exists~} \theta \geq 0 ~{\rm such~that~} y_i \leq z + \theta_1(x_i - x) \nonumber \\ & & {\rm for~all}~i~{\rm such~that~}x_i \in (x-h,x+h) \Big\}. \end{eqnarray*} As shown in \citet{Daouia:2015}, this version only reduces the vexing bias and border defects of the original estimator when the true frontier is monotone. The option \code{method = "m"} indicates that the improved fit $\tilde \varphi_{n,LL}$ should be utilized in place of $\hat \varphi_{n,LL}$. \vspace{3mm} \noindent {\it Optimal bandwidth choice} \vspace{3mm} \noindent \citet{Hall:2004} proposed a bootstrap procedure for selecting the optimal bandwidth $h$ in $\hat \varphi_{n,LL}$ and $\tilde \varphi_{n,LL}$. The function \code{loc\_est\_bw} computes this optimal bootstrap bandwidth. To initiate Hall and Park's bootstrap device, one needs to set a pilot bandwidth, which seems to be quite critical to the quality of $\hat \varphi_{n,LL}$ and $\tilde \varphi_{n,LL}$. \vspace{3mm} \noindent {\it Practical guidelines} \vspace{3mm} \noindent To see how the local linear unconstrained estimate $\hat \varphi_{n,LL}$ and its improved version $\tilde \varphi_{n,LL}$ perform in the case of \code{records}, \code{air} and \code{nuclear} data. We first compute the optimal bandwidths over 100 bootstrap replications by using, for instance, the values 2, 2 and 40 as pilot bandwidths. <>= h.records.u <- loc_est_bw(records$year, 1/records$result, x.records, hini = 2, B = 100, method = "u") @ <>= (h.records.u <- 22.5) @ <>= h.air.u <- loc_est_bw(air$xtab, air$ytab, x.air, hini = 2, B = 100, method = "u") @ <>= (h.air.u <- 2.89278) @ <>= h.air.m <- loc_est_bw(air$xtab, air$ytab, x.air, hini = 2, B = 100, method = "m") @ <>= (h.air.m <- 3.586696) @ <>= h.nucl.u <- loc_est_bw(nuclear$xtab, nuclear$ytab, x.nucl, hini = 40, B = 100, method = "u") @ <>= (h.nucl.u <- 82.32759) @ <>= h.nucl.m <- loc_est_bw(nuclear$xtab, nuclear$ytab, x.nucl, hini = 40, B = 100, method = "m") @ <>= (h.nucl.m <- 82.32759) @ Note that the computational burden here is very demanding, so be forewarned. Now to evaluate $\hat \varphi_{n,LL}$ and/or $\tilde \varphi_{n,LL}$, we employ the commands <<>>= y.records.u <- loc_est(records$year, 1/records$result, x.records, h = h.records.u, method = "u") y.air.u <- loc_est(air$xtab, air$ytab, x.air, h = h.air.u, method = "u") y.air.m <- loc_est(air$xtab, air$ytab, x.air, h = h.air.m, method = "m") y.nucl.u <- loc_est(nuclear$xtab, nuclear$ytab, x.nucl, h = h.nucl.u, method = "u") y.nucl.m <- loc_est(nuclear$xtab, nuclear$ytab, x.nucl, h = h.nucl.m, method = "m") @ \noindent Figure~\ref{fig-ll} superimposes the obtained estimates for each dataset. For the particular datasets \code{air} and \code{nuclear}, the resulting unconstrained and improved estimates are very similar. The following code will generate Figure~\ref{fig-ll}. <>= plot(x.records, 1/y.records.u, type = "l", col = "magenta") points(result ~ year, data = records) legend("topright", legend = "unconstrained", bty = "n", col = "magenta", lty = 1) plot(y.air.u ~ x.air, type = "l", col = "magenta") lines(x.air, y.air.m, lty = 2, col = "cyan") points(ytab ~ xtab, data = air) legend("topleft", legend = c("unconstrained", "improved"), bty = "n", col = c("magenta", "cyan"), lty = c(1, 2)) plot(y.nucl.u ~ x.nucl, type = "l", col = "magenta") lines(x.nucl, y.nucl.m, lty = 2, col = "cyan") points(ytab ~ xtab, data = nuclear) legend("topleft", legend = c("unconstrained", "improved"), bty = "n", col = c("magenta", "cyan"), lty = c(1, 2)) @ \begin{figure}[htbp] \begin{center} <>= <> op = par(mfrow = c(1, 3), mar = c(4.5, 4.5, 2.1, 2.1), mgp = c(2.9, 1.1, 0), oma = c(0, 0, 0, 0), cex.lab = 2.3, cex.axis = 2.3) plot(x.records, 1/y.records.u, lty = 1, lwd = 4, col = "magenta", type = "l", xlab = "year", ylab = "1500m record") points(result ~ year, data = records) legend("topright", legend = "unconstrained", col = "magenta", lwd = 4, lty = 1, cex = 2.2 , bty = "n") plot(x.air, y.air.u, lty = 1, lwd = 4, col = "magenta", type = "l", xlab = "input", ylab = "output") lines(x.air, y.air.m, lty = 2, lwd = 4, col = "cyan") points(ytab ~ xtab, data = air) legend("topleft", legend = c("unconstrained", "improved"), bty = "n", col = c("magenta", "cyan"), lwd = 4, lty = c(1, 2), cex = 2.2) plot(x.nucl, y.nucl.u, lty = 1, lwd = 4, col = "magenta", type = "l", ylim = range(nuclear$ytab), xlab = "temperature", ylab = "toughness") lines(x.nucl, y.nucl.m, lty = 2, lwd = 4, col = "cyan") points(ytab ~ xtab, data = nuclear) legend("topleft", legend = c("unconstrained", "improved"), bty = "n", col = c("magenta", "cyan"), lwd = 4, lty = c(1, 2), cex = 2.2) par(op) <> @ \end{center} \caption{Local linear frontier estimates $\hat \varphi_{n,LL}$ and $\tilde \varphi_{n,LL}$ for the 46 annual sport records (left), the 37 European air controllers (middle) and the 254 nuclear reactors (right).\label{fig-ll}} \end{figure} \subsubsection[Local maximum estimation]{Local maximum estimation \label{local-max}} The function \code{loc\_max} implements the local maximum estimates of $\varphi(x)$ proposed by \citet{Gijbels:2000}: a local constant estimator at first (option \code{type = "one-stage"}) and subsequently a local DEA estimator (option \code{type = "two-stage"}). \vspace{2mm} \noindent The methodology of Gijbels and Peng consists of considering a strip around $x$ of width $2h$, where $h=h_n\to 0$ with $nh_n\to\infty$ as $n\to\infty$, and focusing then on the $y_i$ values of observations falling into this strip. More precisely, they consider the transformend variables $z^{xh}_i = y_i\indic_{\{|x_i-x|\le h\}}$, $i=1,\ldots,n$, and the corresponding order statistics $z^{xh}_{(1)}\le\ldots\le z^{xh}_{(n)}$. The simple maximum $z^{xh}_{(n)}=\max_{i=1,\ldots,n}z^{xh}_i$ defines then the local constant estimator (option \code{type = "one-stage"}) of the frontier point $\varphi(x)$. This opens a way to a two-stage estimation procedure as follows. In a first stage, Gijbels and Peng calculate the maximum $z^{xh}_{(n)}$. Then, they suggest to replace each observation $y_i$ in the strip of width $2h$ around $x$ by this maximum, leaving all observations outside the strip unchanged. More specifically, they define \begin{equation*} \tilde{y}_i= \left \{ \renewcommand{\arraystretch}{1} \begin{array}{lll} y_i & \quad \mbox{if} & |x_i-x| > h \\ z^{xh}_{(n)} & \quad \mbox{if} & |x_i-x| \leq h. \end{array} \right . \end{equation*} Then, they apply the DEA estimator (see the function \code{dea\_est}) to these transformed data $(x_i,\tilde{y}_i)$, giving the local DEA estimator (option \code{type = "two-stage"}). \vspace{2mm} \noindent The bandwidth $h$ has to be fixed by the user in the 4th argument of the function. By way of example, in the case of the \code{green} data, the value $h=0.5$ leads to reproduce in Figure~\ref{fig-loc-max} (left) the estimates obtained by \citet{Gijbels:2000}. <>= loc_max_1stage <- loc_max(log(green$COST), log(green$OUTPUT), x.green, h = 0.5, type = "one-stage") loc_max_2stage <- loc_max(log(green$COST), log(green$OUTPUT), x.green, h = 0.5, type = "two-stage") @ \vspace{3mm} \noindent {\it A data-driven rule for selecting $h$} \vspace{3mm} \noindent Note that the frontier point $\varphi(x)$ is identical to the right-endpoint of the cumulative distribution function $F(\cdot|x)$ of $Y$ given $X=x$, and that the local constant estimate $z^{xh}_{(n)}$ coincides with the right-endpoint of the kernel estimator $$F_n(y|x)=\sum_{i=1}^n K(\frac{x-x_i}{h})\indic_{(y_i\leq y)}/\sum_{i=1}^n K(\frac{x-x_i}{h}),$$ with $K(\cdot)$ being the uniform kernel. When the interest is in the estimation of the conditional distribution function, one way to select the bandwidth $h$ is by making use of the following commands <>= require("np") bw <- npcdistbw(log(OUTPUT) ~ log(COST), data = green, cykertype = "uniform", bwtype = "fixed")$xbw (h.opt <- max(bw, max(diff(sort(log(green$COST))))/2)) @ <>= (h.opt = 0.4152283) @ The first command returns the bandwidth $bw$ computed via the least squares cross-validation method \citep*[see][for details]{Li:2013}. As the resulting bandwidth can be smaller than half the maximum spacing due to sparsity in data, the second command selects the maximum value. On may then use this value to compute the estimates of the conditional endpoint $\varphi(x)$ itself. This is an {\it ad hoc} choice, but it works quite well. It might be viewed as an exploratory tool, rather than as a method for final analysis. The corresponding local maximum frontier estimates are graphed in Figure~\ref{fig-loc-max} (right). <<>>= loc_max_1stage.opt <- loc_max(log(green$COST), log(green$OUTPUT), x.green, h = h.opt, type = "one-stage") loc_max_2stage.opt <- loc_max(log(green$COST), log(green$OUTPUT), x.green, h = h.opt, type = "two-stage") @ The following code will generate Figure~\ref{fig-loc-max}. <>= plot(log(OUTPUT) ~ log(COST), data = green) lines(x.green, loc_max_1stage, lty = 1, col = "magenta") lines(x.green, loc_max_2stage, lty = 2, col = "cyan") legend("topleft", legend = c("one-stage", "two-stage"), bty = "n", col = c("magenta", "cyan"), lty = c(1, 2)) plot(log(OUTPUT) ~ log(COST), data = green) lines(x.green, loc_max_1stage.opt, lty = 1, col = "magenta") lines(x.green, loc_max_2stage.opt, lty = 2, col = "cyan") legend("topleft",legend = c("one-stage", "two-stage"), bty = "n", col = c("magenta", "cyan"), lty = c(1, 2)) @ \begin{figure}[htbp] \begin{center} <>= <> op = par(mfrow = c(1, 2), mar = c(3, 3.1, 2.1, 2.1), mgp = c(2, .4, 0), oma = c(0, 0, 0, 0), cex.lab = 1.2, cex.axis = 1.2) plot(log(OUTPUT) ~ log(COST), data = green, main = "Peng and Gijbels choice") lines(x.green, loc_max_1stage, lty = 1, lwd = 2, col = "magenta") lines(x.green, loc_max_2stage, lty = 2, lwd = 2, col = "cyan") legend("topleft", bty = "n", legend = c("one-stage", "two-stage"), col = c("magenta", "cyan"), lty = c(1, 2), lwd = 2, cex = 1.2) plot(log(OUTPUT) ~ log(COST), data = green, main = "Automatic selection") lines(x.green, loc_max_1stage.opt, lty = 1, lwd = 2, col = "magenta") lines(x.green, loc_max_2stage.opt, lty = 2, lwd = 2, col = "cyan") legend("topleft", legend = c("one-stage", "two-stage"), bty = "n", col = c("magenta", "cyan"), lty = c(1, 2), lwd = 2, cex = 1.2) par(op) <> @ \end{center} \caption{Local maximum frontier estimates for the 123 American electric utility companies with $h=0.5$ (left) and $h.opt=0.4152283$ (right).\label{fig-loc-max}} \end{figure} \subsubsection[Local extreme-value estimation]{Local extreme-value estimation \label{local-ev}} The function \code{pick\_est} computes the local Pickands type of estimator introduced by \citet{Gijbels:2000}. The implemented estimator of $\varphi(x)$, obtained by applying the well-known extreme value approach of \citet{Dekkers:1989} in conjunction with the transformed sample $(z^{xh}_1,\ldots,z^{xh}_n)$ described above in Section~\ref{local-max}, is defined as: \begin{equation*}\label{pickands} \tilde\varphi_{pick}(x):= z^{xh}_{(n-k)} + \left(z^{xh}_{(n-k)}-z^{xh}_{(n-2k)}\right) \left\{2^{-\log\frac{z^{xh}_{(n-k)}-z^{xh}_{(n-2k)}}{z^{xh}_{(n-2k)}-z^{xh}_{(n-4k)}}/\log 2}-1\right\}^{-1}. \end{equation*} It is based on three upper order statistics $z^{xh}_{(n-k)}$, $z^{xh}_{(n-2k)}$, $z^{xh}_{(n-4k)}$, and depends on the bandwidth $h$ as well as an intermediate sequence $k=k(n)\to\infty$ with $k/n\to 0$ as $n\to\infty$. The two smoothing parameters $h$ and $k$ have to be fixed by the user in the 4th and 5th arguments of the function. Also, as for the two-stage local frontier estimator presented above, writing \begin{equation*} \tilde{y}_i= \left \{ \renewcommand{\arraystretch}{1} \begin{array}{lll} y_i & \quad \mbox{if} & |x_i-x| > h \\ \tilde\varphi_{pick}(x) & \quad \mbox{if} & |x_i-x| \leq h, \end{array} \right . \end{equation*} one can then apply the DEA estimator to these transformed data $(x_i,\tilde{y}_i)$, giving thus the local DEA estimator (option \code{type = "two-stage"}). \vspace{2mm} \noindent Regarding the choice of the smoothing parameters, it should be clear that any automatic data-driven method has to pick up $h$ and $k$ simultaneously, which is a daunting problem. Doubtlessly, further work to define a concept of selecting appropriate values for $h$ and $k$ will yield new refinements. \subsection{Kernel smoothing}\label{kern_smooth} Recently, kernel smoothing methods have been developed for estimating smooth frontier functions. The function \code{kern\_smooth} implements two up-to-date approaches in such direction.%: \citet{Parmeter:2013} and \citet{Noh:2014}. \subsubsection[Parmeter and Racine's estimator]{Parmeter and Racine's estimator} The function \code{kern\_smooth} computes \citet{Parmeter:2013}'s estimator (option \code{technique = "pr"}) without constraints (option \code{method = "u"}), and with the monotonicity constraint (option \code{method = "m"}) as well as the monotone concavity constraint (option \code{method = "mc"}). \vspace{3mm} \noindent {\it Definition of the estimator} \vspace{3mm} \noindent To estimate the frontier function, Parameter and Racine (2013) considered the following generalization of linear regression smoothers $\hat\varphi(x|p) = \sum_{i=1}^n p_i A_i(x)y_i$, where $A_i(x)$ is the kernel weight function of $x$ for the $i$-th data depending on $x_i$'s and the sort of linear smoothers. For example, the Nadaraya-Watson kernel weights are $A_i(x) = K_i(x)/(\sum_{j=1}^n K_j(x))$, where $K_i(x) = h^{-1} K\{ (x-x_i)/h\}$, with the kernel function $K$ being a bounded and symmetric probability density, and $h$ is a bandwidth. Then, the weight vector $p=(p_1,\ldots,p_n)^\top$ is chosen to minimize the distance $D(p)=(p-p_u)^\top(p-p_u)$ subject to the envelopment constraints and the choice of the shape constraints, where $p_u$ is an $n$-dimensional vector with all elements being one. The envelopement and shape constraints are \begin{eqnarray*} \hat \varphi(x_i|p) - y_i &=& \sum_{i=1}^n p_i A_i(x_i)y_i - y_i \geq 0,~i=1,\ldots,n;~~~{\rm (envelopment~constraints)} \label{env} \\ \hat \varphi^{(1)}(x|p) &=& \sum_{i=1}^n p_i A_i^{(1)}(x)y_i \geq 0,~x \in \mathcal{M};~~~{\rm (monotonocity~constraints)} \label{mono} \\ \hat \varphi^{(2)}(x|p) &=& \sum_{i=1}^n p_i A_i^{(2)}(x)y_i \leq 0,~x \in \mathcal{C},~~~{\rm (concavity~constraints)} \label{con} \end{eqnarray*} where $\hat \varphi^{(s)}(x|p) = \sum_{i=1}^n p_i A_i^{(s)}(x) y_i$ is the $s$-th derivative of $\hat \varphi(x|p)$, with $\mathcal{M}$ and $\mathcal{C}$ being the collections of points where monotonicity and concavity are imposed, respectively. In our implementation of the estimator, we simply take the entire dataset $\{(x_i,y_i),~i=1,\ldots,n\}$ to be $\mathcal{M}$ and $\mathcal{C}$ and, in case of small samples, we augment the sample points by an equispaced grid of length 201 over the observed support $[\min_i x_i,\max_i x_i]$ of $X$. For the weight $A_i(x)$, we use the Nadaraya-Watson weights. \vspace{3mm} \noindent {\it Optimal bandwidth} \vspace{3mm} \noindent Bandwidth selection is crucial to good performance of the frontier estimator as with other kernel smoothing estimators. \citet{Parmeter:2013}'s recommendation is to adapt the optimal bandwidth for mean regression curve estimation chosen by least squares cross-validation to the boundary regression context. This is implemented with \code{bw\_method = "cv"} in the function \code{kern\_smooth\_bw}. We also refer to existing functions from the \pkg{np} \citep{Hayfield:2008} and \pkg{quadprog} \citep{quadprog} packages that can be found at \url{http://socserv.mcmaster.ca/racinej/Gallery/Home.html}. \subsubsection[Noh's estimator]{Noh's estimator} \citet{Noh:2014} considered the same generalization of linear smoothers $\hat \varphi(x|p)$ for frontier estimation, but with a different method for choosing the weight $p$. This is implemented in the function \code{kern\_smooth} with option \code{technique = "noh"}. \vspace{3mm} \noindent {\it Definition of the estimator} \vspace{3mm} In contrast with \citet{Parmeter:2013}, along with the same envelopment and shape constraints, the weight vector $p$ is chosen to minimize the area under the estimator $\hat \varphi(x|p)$, that is $A(p) = \int_a^b\hat \varphi(x|p) dx = \sum_{i=1}^n p_i y_i \left( \int_a^b A_i(x) dx \right)$, where $[a,b]$ is the true support of $X$. In practice, we integrate over the observed support $[\min_i x_i,\max_i x_i]$ since the theoretic one is unknown. In what concerns the kernel weights $A_i(x)$, we use the Priestley-Chao weights $$ A_i(x) = \left\{ \begin{array}{cc} 0 &,~i=1 \\ (x_i - x_{i-1}) K_i(x) &,~i \neq 1 \end{array} \right., $$ where it is assumed that the pairs $(x_i,y_i)$ have been ordered so that $x_1 \leq \ldots \leq x_n$. The choice of such weights is motivated by their convenience for the evaluation of the integral $\int A_i(x) dx$. \vspace{3mm} \noindent {\it Optimal bandwidth} \vspace{3mm} Following \citet{Parmeter:2013}'s recommendation, we may use the resulting bandwidth from cross-validation for \citet{Noh:2014}'s estimator. Another option proposed by \citet{Noh:2014} is to select the bandwidth which minimizes a BIC-type criterion developed for frontier estimation. The criterion is the following: \begin{eqnarray*} BIC(h) = \log \left( \sum_{i=1}^n (\hat \varphi(x_i|\hat p(h))-y_i)\right)+\frac {\log n \cdot tr(S(h))}{2n}, \end{eqnarray*} where $\hat p(h)$ is the chosen weight vector given the bandwidth $h$, and $tr(S(h))$ is the trace of the smoothing matrix $$ S(h) = \left( \begin{array}{ccc} A_1(x_1) & \ldots & A_n(x_1) \\ \vdots & \ddots& \vdots \\ A_1(x_n) & \ldots & A_n(x_n) \end{array} \right). $$ We refer to \citet{Noh:2014} for a thorough discussion of the rationale for this BIC-type criterion. The function \code{kern\_smooth\_bw} computes the optimal bandwidth from this criterion with option \code{bw\_method = "bic"}. \subsubsection[Comparison between the two estimators]{Comparison between the two estimators} To illustrate the use of \code{kern\_smooth} and compare the two estimators, we consider the \code{green} data and compute each estimator under the monotonicity constraint (option \code{method = "m"}). First, using the function \code{kern\_smooth\_bw} we compute the optimal bandwidth for each estimator. <>= require("npbr") require("np") data("green") @ <>= require("np") (h.pr.green.m <- kern_smooth_bw(log(green$COST), log(green$OUTPUT), method = "m", technique = "pr", bw_method = "cv")) @ <>= (h.pr.green.m <- 0.8304566) @ <>= (h.noh.green.m <- kern_smooth_bw(log(green$COST), log(green$OUTPUT), method = "m", technique = "noh", bw_method = "bic")) @ To compute the estimators for the chosen bandwidths obeying the constraint , we employ the following commands: <<>>= y.pr.green.m <- kern_smooth(log(green$COST), log(green$OUTPUT), x.green, h = h.pr.green.m, method = "m", technique = "pr") y.noh.green.m <- kern_smooth(log(green$COST), log(green$OUTPUT), x.green, h = h.noh.green.m, method = "m", technique = "noh") @ The resulting two constrained estimates are graphed in Figure ~\ref{fig-kern-two} from the following commands: <>= plot(log(OUTPUT) ~ log(COST), data = green, xlab = "log(COST)", ylab = "log(OUTPUT)") lines(x.green, y.pr.green.m, lty = 2, col = "blue") lines(x.green, y.noh.green.m, lty = 3, col = "red") legend("topleft", bty = "n", legend = c("noh", "pr"), col = c("red", "blue"), lty = c(3,2)) @ \begin{figure}[htbp] \begin{center} <>= <> plot(log(OUTPUT) ~ log(COST), data = green, xlab = "log(COST)", ylab = "log(OUTPUT)", cex.lab = 1.2, cex.axis = 1.2) lines(x.green, y.pr.green.m, lwd = 4, lty = 2, col = "blue") lines(x.green, y.noh.green.m, lwd = 4, lty = 3, col = "red") legend("topleft", col = c("red", "blue"), bty = "n", lty = c(3,2), legend = c("noh", "pr"), lwd = 4, cex = 1.3) <> @ \end{center} \caption{The two kernel smoothing frontier estimators for 123 American electric utility companies.\label{fig-kern-two}} \end{figure} \subsection[Robust regularization approaches]{Robust regularization approaches \label{robustness}} In applied settings where outlying observations are omnipresent, as is the case for instance in production data, it is prudent to seek a ``robustification'' strategy. To achieve this objective, we propose in this section three regularization extreme-value based methods \citep*{Daouia:2010,Daouia:2012}. All of these methods are based on the assumption that the frontier function $\varphi$ is monotone nondecreasing. \subsubsection[Moment frontier estimator]{Moment frontier estimator} The function \code{dfs\_momt} is an implementation of the moment-type estimator and the corresponding confidence interval developed by \citet{Daouia:2010} under the monotonicity constraint. Combining the ideas from \citet*{Dekkers:1989} with the dimensionless transformation $\{z^{x}_i := y_i\indic_{\{x_i\le x\}}, \,i=1,\ldots,n\}$ of the observed sample $\{(x_i,y_i), \,i=1,\ldots,n\}$, they estimate the conditional endpoint $\varphi(x)$ by $$ \tilde\varphi_{momt}(x) = z^{x}_{(n-k)} + z^{x}_{(n-k)} M^{(1)}_n \left\{1 + \rho_x \right\} $$ where $M^{(1)}_n = (1/k)\sum_{i=0}^{k-1}\left(\log z^x_{(n-i)}- \log z^x_{(n-k)}\right)$, $z^{x}_{(1)}\leq \ldots\leq z^{x}_{(n)}$ are the ascending order statistics of the transformed sample $\{z^{x}_i, \,i=1,\ldots,n\}$, and $\rho_x>0$ is referred to as the extreme-value index and has the following interpretation: When $\rho_x > 2$, the joint density of data decays smoothly to zero at a speed of power $\rho_x -2$ of the distance from the frontier; when $\rho_x = 2$, the density has sudden jumps at the frontier; when $\rho_x < 2$, the density increases toward infinity at a speed of power $\rho_x -2$ of the distance from the frontier. As a matter of fact, we have $\rho_x = \beta_x + 2$, where $\beta_x$ is the shape parameter of the joint density introduced in Section \ref{intro}. Most of the contributions to the econometric literature on frontier analysis assume that the joint density is strictly positive at its support boundary, or equivalently, $\rho_x=2$ for all $x$. \vspace{3mm} \noindent {\it Estimation strategy when $\rho_x$ is unknown} \vspace{3mm} \noindent In this case, \citet{Daouia:2010} suggest to use the following two-step estimator: First, estimate $\rho_x$ by the moment estimator $\tilde{\rho}_x$ implemented in the function \code{rho\_momt\_pick} by utilizing the option \code{method = "moment"}, or by the Pickands estimator $\hat{\rho}_x$ by using the option \code{method = "pickands"} (see the paragraph {\bf Moment and Pickands estimates of the tail-index $\rho_x$} below for a detailed description of the function \code{rho\_momt\_pick}). Second, use the estimator $\tilde\varphi_{momt}(x)$, as if $\rho_x$ were known, by substituting the estimated value $\tilde{\rho}_x$ or $\hat{\rho}_x$ in place of $\rho_x$. \vspace{3mm} \noindent {\it Confidence interval } \vspace{3mm} \noindent The $95\%$ confidence interval of $\varphi(x)$ derived from the asymptotic normality of $\tilde\varphi_{momt}(x)$ is given by $$ [\tilde\varphi_{momt}(x) \pm 1.96 \sqrt{V(\rho_x) / k} z^{x}_{(n-k)} M^{(1)}_n (1 + 1/\rho_x) ], $$ where $ V(\rho_x) = \rho^2_x (1+2/\rho_x)^{-1}. $ \vspace{3mm} \noindent {\it Selection of the sequence $k$} \vspace{3mm} \noindent The number $k=k_n(x)$ plays here the role of the smoothing parameter and varies between 1 and $N_x-1$, with $N_x=\sum_{i=1}^n\indic_{\{x_i\le x\}}$ being the number of observations $(x_i,y_i)$ such that $x_i \leq x$. The question of selecting the optimal value of $k_n(x)$ is still an open issue and is not addressed yet. \citet{Daouia:2010} have only suggested an empirical rule implemented in the function \code{kopt\_momt\_pick} (option \code{method = "moment"}) that turns out to give reasonable values of the sequence $k_n(x)$ for estimating the frontier $\varphi(x)$ [see the paragraph {\bf Threshold selection for moment and Pickands frontiers} below for a detailed description of the function \code{kopt\_momt\_pick}]. However, as it is common in extreme-value theory, good results require a large sample size $N_x$ of the order of several hundreds. If the resulting pointwise frontier estimates and confidence intervals exhibit severe instabilities, the user should call the function \code{kopt\_momt\_pick} by tuning the parameter \code{wind.coef} in the interval $(0,1]$ until obtaining more stable curves (default option \code{wind.coef = 0.1}). See \code{help(kopt\_momt\_pick)} for further details. \vspace{3mm} \noindent {\it Practical guidelines} \vspace{3mm} \noindent For our illustration purposes using the large dataset \code{post}, we consider the following three possible scenarios: either $\rho_x$ is known (typically equal to $2$ if the assumption of a jump at the frontier is reasonable), or $\rho_x$ is unknown and estimated by the moment estimator $\tilde{\rho}_x$, or $\rho_x$ is unknown independent of $x$ and estimated by the (trimmed) mean of $\tilde{\rho}_x$. First, we select the points at which we want to evaluate the frontier estimator. <<>>= x.post <- seq(post$xinput[100], max(post$xinput), length.out = 100) @ In the case where the extreme-value index $\rho_x$ is known and equal to $2$, we set <<>>= rho <- 2 @ Then, we determine the sequence $k=k_n(x)$ in $\tilde\varphi_{momt}(x)$. <<>>= best_kn.1 <- kopt_momt_pick(post$xinput, post$yprod, x.post, rho = rho) @ When $\rho_x$ is unknown and dependent of $x$, its estimate $\tilde{\rho}_x$ is computed via the command <>= rho_momt <- rho_momt_pick(post$xinput, post$yprod, x.post, method = "moment") @ To determine the number $k$ in the two-stage estimator $\tilde\varphi_{momt}(x)$, we use <>= best_kn.2 <- kopt_momt_pick(post$xinput, post$yprod, x.post, rho = rho_momt) @ <>= rho_momt<-c(1.993711,2.360920,2.245450,3.770526,2.724960,3.667846,4.026203, 2.281109,1.363260,1.150343,2.567832,2.228400,3.106491,2.592477, 2.233479,2.040209,1.916878,1.494831,1.961430,1.930942,1.927990, 1.833530,1.808632,1.758135,1.717626,1.686540,1.707200,1.711357, 1.720839,1.704845,1.678985,1.686872,1.686907,1.747732,1.741290, 1.792388,1.805144,1.855829,1.919817,1.929348,2.046588,2.135351, 2.196834,2.224797,2.221043,2.290578,2.390179,2.042884,2.087287, 2.158198,2.173314,2.260872,2.311427,1.865147,1.874019,1.913673, 1.922869,1.918484,1.949220,1.961016,1.998101,2.023605,2.041663, 2.067775,2.088982,2.107949,2.152688,2.170959,1.283350,1.285458, 1.295437,1.296902,1.316896,1.331668,1.330163,1.339701,1.322501, 1.326488,1.373837,1.392537,1.419458,1.426513,1.448544,1.473716, 1.517720,1.549229,1.561259,1.567216,1.580512,1.647293,1.672556, 1.750994,1.743083,1.801643,1.823678,1.869798,1.906898,1.873269, 1.893699,1.916469) best_kn.2 <- kopt_momt_pick(post$xinput, post$yprod, x.post, rho = rho_momt) @ Here, for the \code{post} data, we used the default value \code{wind.coef = 0.1} in the function \code{kopt\_momt\_pick} to avoid numerical instabilities. When employing another large dataset, the user should tune this coefficient until the resulting pointwise frontier estimates and confidence intervals exhibit stable curves (see the function \code{kopt\_momt\_pick} for details). \noindent When $\rho_x$ is unknown but independent of $x$, which is a more realistic setting in practice, a robust estimation strategy is obtained by using the (trimmed) mean over the moment estimates $\tilde{\rho}_x$. <<>>= rho_trimmean <- mean(rho_momt, trim = 0.05) best_kn.3 <- kopt_momt_pick(post$xinput, post$yprod, x.post, rho = rho_trimmean) @ Finally, we compute the frontier estimates and confidence intervals as follows: <<>>= res.momt.1 = dfs_momt(post$xinput, post$yprod, x.post, rho = rho, k = best_kn.1) res.momt.2 = dfs_momt(post$xinput, post$yprod, x.post, rho = rho_momt, k = best_kn.2) res.momt.3 = dfs_momt(post$xinput, post$yprod, x.post, rho = rho_trimmean, k = best_kn.3) @ The following code can be used to construct the resulting moment frontier plots graphed in Figure~\ref{fig-dfs}. <>= my_samp <- post[sample(1:nrow(post), 1000), ] plot(yprod ~ xinput, data = my_samp, xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.momt.1[,1], lty = 1, col = "cyan") lines(x.post, res.momt.1[,2], lty = 3, col = "magenta") lines(x.post, res.momt.1[,3], lty = 3, col = "magenta") plot(yprod ~ xinput, data = my_samp, xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.momt.2[,1], lty = 1, col = "cyan") lines(x.post, res.momt.2[,2], lty = 3, col = "magenta") lines(x.post, res.momt.2[,3], lty = 3, col = "magenta") plot(yprod ~ xinput, data = my_samp, xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.momt.3[,1], lty = 1, col = "cyan") lines(x.post, res.momt.3[,2], lty = 3, col = "magenta") lines(x.post, res.momt.3[,3], lty = 3, col = "magenta") @ \begin{figure}[htbp] \begin{center} <>= <> my_samp <- post[sample(1:nrow(post), 1000), ] op = par(mfrow = c(1, 3), mar = c(4.5, 4.5, 2.1, 2.1), mgp = c(2.9, 1.1, 0), oma = c(0, 0, 0, 0), cex.lab = 2.3, cex.axis = 2.3) plot(yprod ~ xinput, data = my_samp, col = "grey", xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.momt.1[,1], lty = 1, lwd = 2, col = "cyan") lines(x.post, res.momt.1[,2], lty = 3, lwd = 4, col = "magenta") lines(x.post, res.momt.1[,3], lty = 3, lwd = 4, col = "magenta") plot(yprod ~ xinput, data = my_samp, col = "grey", xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.momt.2[,1], lty = 1, lwd = 2, col = "cyan") lines(x.post, res.momt.2[,2], lty = 3, lwd = 4, col = "magenta") lines(x.post, res.momt.2[,3], lty = 3, lwd = 4, col = "magenta") plot(yprod ~ xinput, data = my_samp, col = "grey", xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.momt.3[,1], lty = 1, lwd = 2, col = "cyan") lines(x.post, res.momt.3[,2], lty = 3, lwd = 4, col = "magenta") lines(x.post, res.momt.3[,3], lty = 3, lwd = 4, col = "magenta") par(op) <> @ \end{center} \caption{Resulting moment estimator $\tilde\varphi_{momt}$ and $95\%$ confidence bands of $\varphi$ for the 4000 European post offices. From left to right, we have the case $\rho_x=2$, plugging $\tilde{\rho}_x$ and plugging the mean of $\tilde{\rho}_x$.\label{fig-dfs}} \end{figure} \subsubsection[Pickands frontier estimator]{Pickands frontier estimator \label{pickds}} The function \code{dfs\_pick} computes the Pickands type of estimator and its associated confidence interval introduced by \citet{Daouia:2010} under the monotonicity constraint. \vspace{2mm} \noindent Built on the ideas of \cite{Dekkers:1989b}, \citet{Daouia:2010} proposed to estimate the frontier point $\varphi(x)$ by $$ \hat\varphi_{pick}(x) = \frac{z^x_{(n-k+1)} - z^x_{(n-2k+1)}} {2^{1/\rho_x} - 1} + z^x_{(n-k+1)} $$ from the transformed data $\{z^{x}_i := y_i\indic_{\{x_i\le x\}}, \,i=1,\ldots,n\}$, where $\rho_x>0$ is the same tail-index as in \code{dfs\_momt}. \vspace{2mm} \noindent If $\rho_x$ is known (typically equal to $2$ if the joint density of data is believed to have sudden jumps at the frontier), then one can use the estimator $\hat\varphi_{pick}(x)$ in conjunction with the data driven method for selecting the threshold $k$ as described below. \vspace{2mm} \noindent In contrast, if $\rho_x$ is unknown, one could consider using the following two-step estimator: First, estimate $\rho_x$ by the Pickands estimator $\hat{\rho}_x$ implemented in the function \code{rho\_momt\_pick} by using the option \code{method = "pickands"}, or by the moment estimator $\tilde{\rho}_x$ by utilizing the option \code{method = "moment"} [a detailed description of the function \code{rho\_momt\_pick} is provided below in a separate paragraph]. Second, use the estimator $\hat\varphi_{pick}(x)$, as if $\rho_x$ were known, by substituting the estimated value $\hat{\rho}_x$ or $\tilde{\rho}_x$ in place of $\rho_x$. \vspace{2mm} \noindent The pointwise $95\%$ confidence interval of the frontier function obtained from the asymptotic normality of $\hat\varphi_{pick}(x)$ is given by $$ [\hat\varphi_{pick}(x) \pm 1.96 \sqrt{v(\rho_x) / (2 k)} ( z^x_{(n-k+1)} - z^x_{(n-2k+1)})] $$ where $ v(\rho_x) =\rho^{-2}_x 2^{-2/\rho_x}/(2^{-1/\rho_x} -1)^4. $ \vspace{2mm} \noindent Finally, to select the threshold $k=k_n(x)$, one could use the automatic data-driven method of \citet{Daouia:2010} implemented in the function \code{kopt\_momt\_pick} (option \code{method = "pickands"}) as described below in the last paragraph. \vspace{3mm} \noindent {\it Practical guidelines} \vspace{3mm} \noindent For our illustration purposes, we used again the large dataset \code{post} and considered the following three scenarios: either $\rho_x$ is known (typically equal to $2$ if the joint density has sudden jumps at the frontier), $\rho_x$ is unknown and estimated by the Pickands estimator $\hat{\rho}_x$, or $\rho_x$ is unknown independent of $x$ and estimated by the (trimmed) mean of $\hat{\rho}_x$. When $\rho_x$ is known and equal to $2$, we set <<>>= rho <- 2 @ Then, we determine the sequence $k=k_n(x)$ in $\hat\varphi_{pick}(x)$. <<>>= best_kn.1 <- kopt_momt_pick(post$xinput, post$yprod, x.post, method = "pickands", rho = rho) @ To estimate $\rho_x$ by $\hat{\rho}_x$, we use the command <<>>= rho_pick <- rho_momt_pick(post$xinput, post$yprod, x.post, method = "pickands") @ Then, we compute the number $k=k_n(x)$ in the two-stage estimator $\hat\varphi_{pick}(x)$ as follows: <<>>= best_kn.2 <- kopt_momt_pick(post$xinput, post$yprod, x.post, method = "pickands", rho = rho_pick) @ When $\rho_x$ is unknown but independent of $x$, a robust estimation strategy is by using the (trimmed) mean over the Pickands estimates $\hat{\rho}_x$. <<>>= rho_trimmean <- mean(rho_pick, trim = 0.05) best_kn.3 <- kopt_momt_pick(post$xinput, post$yprod, x.post, rho = rho_trimmean, method = "pickands") @ Finally, the specifications to calculate the frontier estimates and confidence intervals are given by <<>>= res.pick.1 <- dfs_pick(post$xinput, post$yprod, x.post, rho = rho, k = best_kn.1) res.pick.2 <- dfs_pick(post$xinput, post$yprod, x.post, rho = rho_pick, k = best_kn.2) res.pick.3 <- dfs_pick(post$xinput, post$yprod, x.post, rho = rho_trimmean, k = best_kn.3) @ The obtained pickands frontiers are graphed in Figure~\ref{fig-pickands}. The following code will generate Figure~\ref{fig-pickands}. <>= plot(yprod ~ xinput, data = my_samp, xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.pick.1[,1], lty = 1, col = "cyan") lines(x.post, res.pick.1[,2], lty = 3, col = "magenta") lines(x.post, res.pick.1[,3], lty = 3, col = "magenta") plot(yprod ~ xinput, data = my_samp, xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.pick.2[,1], lty = 1, col = "cyan") lines(x.post, res.pick.2[,2], lty = 3, col = "magenta") lines(x.post, res.pick.2[,3], lty = 3, col = "magenta") plot(yprod ~ xinput, data = my_samp, xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.pick.3[,1], lty = 1, col = "cyan") lines(x.post, res.pick.3[,2], lty = 3, col = "magenta") lines(x.post, res.pick.3[,3], lty = 3, col = "magenta") @ \begin{figure}[htbp] \begin{center} <>= <> op = par(mfrow = c(1, 3), mar = c(4.5, 4.5, 2.1, 2.1), mgp = c(2.9, 1.1, 0), oma = c(0, 0, 0, 0), cex.lab = 2.3, cex.axis = 2.3) plot(yprod ~ xinput, data = my_samp, col = "grey", xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.pick.1[,1], lty = 1, lwd = 2, col = "cyan") lines(x.post, res.pick.1[,2], lty = 3, lwd = 4, col = "magenta") lines(x.post, res.pick.1[,3], lty = 3, lwd = 4, col = "magenta") plot(yprod~xinput, data = my_samp, col = "grey", xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.pick.2[,1], lty = 1, lwd = 2, col = "cyan") lines(x.post, res.pick.2[,2], lty = 3, lwd = 4, col = "magenta") lines(x.post, res.pick.2[,3], lty = 3, lwd = 4, col = "magenta") plot(yprod ~ xinput, data = my_samp, col = "grey", xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.pick.3[,1], lty = 1, lwd = 2, col = "cyan") lines(x.post, res.pick.3[,2], lty = 3, lwd = 4, col = "magenta") lines(x.post, res.pick.3[,3], lty = 3, lwd = 4, col = "magenta") par(op) <> @ \end{center} \caption{Resulting Pickands estimator $\hat\varphi_{pick}$ and $95\%$ confidence interval of $\varphi$ for the 4000 European post offices. From left to right, we have the case $\rho_x=2$, plugging $\hat{\rho}_x$, and plugging the mean of $\hat{\rho}_x$. \label{fig-pickands}} \end{figure} \vspace{3mm} \noindent {\it Moment and Pickands estimates of the tail-index $\rho_x$} \vspace{3mm} \noindent The function \code{rho\_momt\_pick} computes the moment and Pickands estimates of the extreme-value index $\rho_x$ involved in the frontier estimators $\tilde\varphi_{momt}(x)$ [see \code{dfs\_momt}] and $\hat\varphi_{pick}(x)$ [see \code{dfs\_pick}]. \vspace{2mm} \noindent For the case where \code{method = "moment"}, the estimator of $\rho_x$ defined as $$ \tilde{\rho}_x = -\left(M^{(1)}_n + 1 -\frac{1}{2}\left[1-(M^{(1)}_n)^2/M^{(2)}_n\right]^{-1}\right)^{-1} $$ is based on the moments $M^{(j)}_n = (1/k)\sum_{i=0}^{k-1}\left(\log z^x_{(n-i)}- \log z^x_{(n-k)}\right)^j$ for $j=1,2$, with $ z^{x}_{(1)}\leq \ldots\leq z^{x}_{(n)}$ being the ascending order statistics which correspond to the transformed sample $\{z^{x}_i := y_i\indic_{\{x_i\le x\}}, \,i=1,\ldots,n\}$. See the note in \code{help(rho\_momt\_pick)} for further details. \vspace{2mm} \noindent In the case where \code{method = "pickands"}, the estimator of $\rho_x$ is given by $$ \hat{\rho}_x = - \log 2/\log\{(z^x_{(n-k+1)} - z^x_{(n-2k+1)})/(z^x_{(n-2k+1)} - z^x_{(n-4k+1)})\}. $$ \noindent To select the threshold $k=k_n(x)$ in $\tilde{\rho}_x$ and $\hat{\rho}_x$, \citet{Daouia:2010} have suggested to use the following data driven method for each $x$: They first select a grid of values for $k_n(x)$. For the Pickands estimator $\hat{\rho}_x$, they choose $k_n(x) = [N_x /4] - k + 1$, where $k$ is an integer varying between $1$ and the integer part $[N_x/4]$ of $N_x/4$, with $N_x=\sum_{i=1}^n\indic_{\{x_i\le x\}}$. For the moment estimator $\tilde{\rho}_x$, they choose $k_n(x) = N_x - k$, where $k$ is an integer varying between $1$ and $N_x -1$. Then, they evaluate the estimator $\hat{\rho}_x(k)$ (respectively, $\tilde{\rho}_x(k)$) and select the $k$ where the variation of the results is the smallest. They achieve this by computing the standard deviation of $\hat{\rho}_x(k)$ (respectively, $\tilde{\rho}_x(k)$) over a ``window'' of $\max([\sqrt{N_x /4}],3)$ (respectively, $\max([\sqrt{N_x-1}],3)$) successive values of $k$. The value of $k$ where this standard deviation is minimal defines the value of $k_n(x)$. \vspace{2mm} \noindent The user can also appreciably improve the estimation of $\rho_x$ and $\varphi(x)$ itself by tuning the choice of the lower limit (default option \code{lrho = 1}) and upper limit (default option \code{urho = Inf}). \vspace{3mm} \noindent {\it Threshold selection for moment and Pickands frontiers} \vspace{3mm} \noindent The function \code{kopt\_momt\_pick} is an implementation of an experimental method by \citet{Daouia:2010} for the automated threshold selection (choice of $k=k_n(x)$) for the moment frontier estimator $\tilde\varphi_{momt}(x)$ [see \code{dfs\_momt}] in the case where \code{method = "moment"} and for the Pickands frontier estimator $\hat\varphi_{pick}(x)$ [see \code{dfs\_pick}] in the case where \code{method = "pickands"}. The idea is to select first (for each $x$) a grid of values for the number $k_n(x)$ given by $k = 1, \ldots, [\sqrt{N_x}]$, where $[\sqrt{N_x}]$ stands for the integer part of $\sqrt{N_x}$ with $N_x=\sum_{i=1}^n\indic_{\{x_i\le x\}}$, and then select the $k$ where the variation of the results is the smallest. To achieve this here, \citet{Daouia:2010} compute the standard deviations of $\tilde\varphi_{momt}(x)$ [option \code{method = "moment"}] or $\hat\varphi_{pick}(x)$ [option \code{method = "pickands"}] over a ``window'' of size $\max(3, [ \texttt{wind.coef} \times \sqrt{N_x} /2])$, where the coefficient \code{wind.coef} should be selected in the interval $(0,1]$ in such a way to avoid numerical instabilities. The default option \code{wind.coef = 0.1} corresponds to having a window large enough to cover around $10\%$ of the possible values of $k$ in the selected range of values for $k_n(x)$. The value of $k$ where the standard deviation is minimal defines the desired number $k_n(x)$. See the note in \code{help(kopt\_momt\_pick)} for further details. \subsubsection[Probability-weighted moment frontier estimator]{Probability-weighted moment frontier estimator \label{pwm}} The function \code{dfs\_pwm} computes the regularized frontier estimator introduced by \citet{Daouia:2012}. It is based on the unregularized probability-weighted moment (PWM) estimator $$ \hat{\varphi}_m(x) = \varphi_{fdh}(x) - \int_{0}^{\varphi_{fdh}(x)} \hat{F}^m(y|x)dy $$ where the trimming order $m\geq 1$ is an integer such that $m=m_n\to\infty$ as $n\to\infty$, and $\hat{F}(y|x)=\sum_{i=1}^n\indic_{(x_i\leq x,y_i\leq y)}/\sum_{i=1}^n\indic_{(x_i\leq x)}$. The implemented estimator of $\varphi(x)$ is then defined as $$ \tilde{\varphi}_{pwm}(x) = \hat{\varphi}_m(x) + \Gamma\left(1 + 1/\bar\rho_x\right)\left( 1/m\,\hat\ell_x\right)^{1/\bar\rho_x} $$ where $$ \bar{\rho}_x = \log (a)\left\{ \log\Big( \frac{\hat\varphi_{m}(x)-\hat\varphi_{am}(x)}{\hat\varphi_{am}(x)-\hat\varphi_{a^2m}(x)} \Big) \right\}^{-1} , \quad \hat{\ell}_x = \frac {1}{m}\left[\frac{\Gamma(1+ 1/\bar\rho_x)\big(1-a^{-1/\bar\rho_x}\big)}{\hat\varphi_{m}(x)-\hat\varphi_{am}(x)}\right]^{\bar\rho_x}, $$ with $a\geq 2$ being a fixed integer and $\bar{\rho}_x$ estimates the same tail-index $\rho_x=\beta_x+2$ as in \code{dfs\_momt} and \code{dfs\_pick}. If the true value of $\rho_x$ is known, we set $\bar{\rho}_x=\rho_x$ in the expressions above. In contrast, if $\rho_x$ is unknown, its estimate $\bar{\rho}_x$ can be obtained separately in an optimal way by calling the function \code{rho\_pwm} described below in the last paragraph. In both cases, we use the frontier estimator $\tilde{\varphi}_{pwm}(x)$ as if $\bar{\rho}_x$ were known by plugging in its value. As pointed out by \citet{Daouia:2012}, it is most efficient to conduct tail-index estimation and frontier estimation separately. Then, knowing the value $\bar{\rho}_x$, it remains to fix the two smoothing parameters $a$ and $m$ in order to calculate the frontier estimator $\tilde{\varphi}_{pwm}(x)$. A practical choice of these parameters that \citet{Daouia:2012} have employed is the simple rule of thumb $a=2$ [default option in the 5th argument of the function] and $m=\text{coefm} \times N^{1/3}_x$, where $N_x=\sum_{i=1}^n\indic_{\{x_i\le x\}}$ and the integer \code{coefm} is to be tuned by the user in the 4th argument of the function. \citet{Daouia:2012} have suggested in their numerical illustrations to use, for instance, the value \code{coefm = 1}. An automatic data-driven rule for choosing the optimal tuning parameter \code{coefm} is implemented in the function \code{mopt\_pwm} described below. \vspace{3mm} \noindent {\it Confidence interval} \vspace{3mm} \noindent The pointwise $95\%$ confidence interval of $\varphi(x)$ derived from the asymptotic normality of $\tilde{\varphi}_{pwm}(x)$ is given by $[\tilde{\varphi}_{pwm}(x) \pm 1.96 \, \hat\sigma(m,x)/\sqrt{n} ]$ where $$ \hat\sigma^2(m,x)= \frac{2m^2}{ \hat F_X(x)}\int_0^{\varphi_{fdh}(x)} \int_0^{\varphi_{fdh}(x)} \hat F^{m}(y|x)\hat F^{m-1}(u|x)(1-\hat F(u|x)) \indic_{(y\le u)}\, dy\,du, $$ with $\hat F_X(x) =(1/n)\sum_{i=1}^n\indic_{(x_i\leq x)}$. Note that the standard deviation $\sigma(m,x)/\sqrt{n}$ of the bias-corrected estimator $\tilde{\varphi}_{pwm}(x)$ is adjusted by a bootstrap estimator in the numerical illustrations of \citet{Daouia:2012}, whereas the exact estimate $\hat\sigma(m,x)/\sqrt{n}$ is utilized in our implemented function. \vspace{3mm} \noindent {\it Practical guidelines} \vspace{3mm} \noindent By way of example, we used as before the large dataset \code{post} and considered the following three possible scenarios: either $\rho_x$ is known (typically equal to $2$ if the assumption of a jump at the frontier is valid), or $\rho_x$ is unknown and estimated by the PWM estimator $\bar{\rho}_x$, or $\rho_x$ is unknown independent of $x$ and estimated by the (trimmed) mean of $\bar{\rho}_x$. When $\rho_x=2$, <<>>= rho <- 2 @ we get \code{coefm} in $\tilde\varphi_{pwm}(x)$ and the frontier estimate $\tilde\varphi_{pwm}(x)$ itself via the commands <>= best_cm.1 <- mopt_pwm(post$xinput, post$yprod, x.post, a = 2, rho = rho, wind.coef = 0.1) res.pwm.1 <- dfs_pwm(post$xinput, post$yprod, x.post, coefm = best_cm.1, a = 2, rho = rho) @ <>= res.pwm.1<-matrix(c(3693.689,3391.852,3995.527,3698.156,3393.867,4002.446,3664.170,3378.494,3949.845,4052.185,3678.129, 4426.240,4583.859,4082.118,5085.599,4544.788,4058.775,5030.800,4514.398,4040.244,4988.552,4370.231,3966.160,4774.301, 4299.334,3920.889,4677.779,5996.560,5104.971,6888.149,5767.108,5006.010,6528.207,8134.150,6676.470,9591.829,7714.785, 6449.040,8980.529,7293.492,6169.318,8417.665,7182.106,6156.329,8207.884,6996.592,6065.089,7928.095,7237.842,6381.361, 8094.322,7028.955,6258.911,7798.999,6880.657,6164.981,7596.333,6871.169,6193.560,7548.777,7087.487,6442.267,7732.707, 7160.956,6542.813,7779.100,7473.784,6873.938,8073.630,7468.800,6889.968,8047.631,7423.528,6855.837,7991.220,7399.736, 6846.585,7952.887,7466.078,6922.605,8009.552,7695.183,7152.966,8237.401,7674.784,7146.509,8203.058,7632.812,7115.122, 8150.503,7656.791,7143.767,8169.815,7694.408,7188.657,8200.158,7743.059,7245.345,8240.773,7784.993,7295.642,8274.344, 7768.774,7282.830,8254.718,7744.851,7268.383,8221.318,7737.486,7263.321,8211.652,7816.011,7351.348,8280.673,8050.885, 7582.127,8519.643,8065.387,7600.336,8530.437,8379.410,7909.951,8848.869,8691.761,8217.325,9166.196,8697.871,8234.138, 9161.604,8671.017,8213.569,9128.464,8673.969,8215.666,9132.272,9115.328,8640.232,9590.424,9388.969,8909.159,9868.779, 9496.951,9037.316,9956.587,9500.138,9050.094,9950.183,9709.822,9256.602,10163.042,9684.555,9236.256,10132.853,9973.046, 9498.580,10447.511,10399.506,9868.298,10930.713,10379.734,9851.823,10907.644,10406.428,9882.258,10930.599,10473.119, 9956.109,10990.130,10524.562,10009.960,11039.163,10520.942,10006.962,11034.921,10583.200,10072.521,11093.879,10643.052, 10134.391,11151.712,10638.578,10141.266,11135.889,10625.220,10133.219,11117.221,10611.912,10122.349,11101.475,10731.387, 10240.820,11221.954,10709.384,10223.295,11195.474,10697.309,10217.597,11177.020,11335.434,10745.236,11925.632,11320.899, 10733.487,11908.310,11357.880,10778.002,11937.758,11364.218,10782.596,11945.840,11345.092,10767.871,11922.313,11340.068, 10763.661,11916.475,11483.711,10903.637,12063.785,11543.050,10967.724,12118.376,11551.772,10983.171,12120.374,11576.991, 11015.011,12138.971,11546.826,10990.595,12103.056,11536.971,10982.598,12091.344,11880.266,11316.796,12443.736,11950.866, 11394.222,12507.509,11994.852,11452.743,12536.962,11989.201,11447.978,12530.425,11964.985,11431.400,12498.571,12014.366, 11490.019,12538.714,12291.219,11762.119,12820.319,12390.599,11866.754,12914.443,12356.355,11845.739,12866.971,12336.726, 11830.709,12842.744,12581.846,12055.270,13108.422,12996.801,12447.918,13545.684,12976.842,12437.372,13516.311,13558.110, 12982.361,14133.860,13541.724,12968.784,14114.664,13505.548,12946.694,14064.402,13490.559,12933.933,14047.185,13622.065, 13062.711,14181.420,13611.193,13061.849,14160.538,13586.358,13044.210,14128.507,13570.022,13030.314,14109.731,13564.974, 13029.502,14100.446),100,3,byrow=TRUE) @ To obtain the estimate $\bar{\rho}_x$ and its (trimmed) mean, we use the following specifications <>= rho_pwm <- rho_pwm(post$xinput, post$yprod, x.post, a = 2, lrho = 1, urho = Inf) rho_pwm_trim <- mean(rho_pwm, trim = 0.05) @ <>= rho_pwm<-c(1.023594,1.024185,1.039690,1.052159,1.024773,1.039298,1.927103, 1.837867,1.677647,1.550235,1.454431,1.379524,1.310404,1.260026,1.234148, 1.203759,1.178933,1.170742,1.161470,1.149357,1.198126,1.314633,1.515514, 1.599483,1.665202,1.722867,1.817279,1.879157,1.945815,1.990754,2.031745, 2.105846,2.165757,2.237778,2.259292,2.295147,2.315305,2.397803,2.079720, 2.092493,2.206906,2.359096,2.385121,2.400971,2.422183,2.024605,2.243093, 2.294995,2.310249,2.317997,2.334523,1.181741,1.189079,1.191125,1.199886, 1.204921,1.205482,1.208576,1.202842,1.209828,1.215217,1.195819,1.198989, 1.185889,1.188102,1.190547,1.580017,1.581408,1.586728,1.589427,1.590941, 1.592341,1.706439,1.721817,1.724899,1.729585,1.732874,1.735634,2.072950, 2.097440,2.137919,2.140791,2.144488,2.162599,5.093260,6.434999,6.410624, 6.412833,4.358651,3.123234,3.148811,1.078259,1.079294,1.074666,1.076102, 1.082311,1.083827,1.077977,1.079619,1.080355) rho_pwm_trim<-mean(rho_pwm, trim=0.05) @ The corresponding smoothing parameters \code{coefm} and frontier estimates are computed as follows: <>= best_cm.2 <- mopt_pwm(post$xinput, post$yprod, x.post, a = 2, rho = rho_pwm) best_cm.3 <- mopt_pwm(post$xinput, post$yprod, x.post, a = 2, rho = rho_pwm_trim) res.pwm.2 <- dfs_pwm(post$xinput, post$yprod, x.post, coefm = best_cm.2, rho = rho_pwm) res.pwm.3 <- dfs_pwm(post$xinput, post$yprod, x.post, coefm = best_cm.3, rho = rho_pwm_trim) @ <>= res.pwm.2<-matrix(c(3423.634,3133.666,3713.601,3420.217,3127.762,3712.672, 3403.614,3117.938,3689.290,3709.516,3335.461,4083.571,4093.768,3592.028,4595.509, 4056.693,3570.681,4542.706,4476.577,4002.423,4950.731,4294.037,3889.966,4698.107, 4152.061,3773.616,4530.506,5570.299,4678.710,6461.887,5307.500,4546.401,6068.599, 7158.666,5700.986,8616.345,6731.634,5465.889,7997.379,6339.216,5215.043,7463.390, 6232.965,5207.188,7258.743,6068.923,5137.420,7000.426,6265.447,5408.967,7121.928, 6137.926,5367.883,6907.970,6035.611,5319.935,6751.287,6039.558,5361.950,6717.166, 6290.624,5645.404,6935.844,6477.025,5858.882,7095.168,6966.139,6366.293,7565.985, 7056.248,6477.417,7635.079,7083.572,6515.880,7651.263,7125.345,6572.194,7678.496, 7285.332,6741.859,7828.805,7571.149,7028.932,8113.367,7620.769,7092.495,8149.044, 7623.786,7106.095,8141.477,7687.649,7174.624,8200.673,7797.465,7291.714,8303.215, 7904.813,7407.099,8402.527,8013.950,7524.599,8503.301,8016.589,7530.645,8502.533, 8020.621,7544.154,8497.088,8028.042,7553.876,8502.208,8182.375,7717.713,8647.037, 8127.045,7658.287,8595.803,8153.546,7688.495,8618.597,8586.068,8116.609,9055.527, 9069.230,8594.794,9543.665,9094.677,8630.944,9558.409,9076.375,8618.927,9533.822, 9100.556,8642.253,9558.859,9142.547,8667.451,9617.643,9669.278,9189.468,10149.088, 9830.317,9370.682,10289.953,9842.657,9392.612,10292.701,10069.847,9616.627,10523.067, 10058.510,9610.212,10506.808,9019.517,8545.051,9493.982,9359.357,8828.149,9890.564, 9351.436,8823.526,9879.347,9389.890,8865.720,9914.060,9469.730,8952.719,9986.740, 9517.669,9003.067,10032.270,9518.444,9004.464,10032.424,9571.469,9060.790,10082.148, 9634.284,9125.624,10142.945,9656.974,9159.662,10154.285,9631.020,9139.019,10123.022, 9627.760,9138.197,10117.324,9716.570,9226.003,10207.137,9709.098,9223.009,10195.188, 9713.317,9233.605,10193.029,10746.776,10156.577,11336.974,10738.178,10150.766,11325.589, 10786.367,10206.489,11366.245,10797.398,10215.776,11379.019,10786.170,10208.949,11363.391, 10783.844,10207.437,11360.252,11073.602,10493.528,11653.676,11154.008,10578.682,11729.334, 11170.647,10602.046,11739.248,11205.198,10643.218,11767.177,11184.990,10628.759,11741.220, 11180.277,10625.904,11734.650,11982.912,11419.442,12546.382,12087.762,11531.119,12644.406, 12185.205,11643.096,12727.315,12183.219,11641.995,12724.442,12159.491,11625.905,12693.077, 12231.392,11707.045,12755.740,16571.616,16042.516,17100.716,18554.292,18030.448,19078.137, 18317.125,17806.509,18827.741,18235.602,17729.584,18741.619,15878.405,15351.829,16404.981, 14649.501,14100.618,15198.383,14637.705,14098.236,15177.175,12136.540,11560.790,12712.289, 12130.230,11557.290,12703.170,12131.561,11572.707,12690.416,12124.857,11568.231,12681.483, 12246.209,11686.855,12805.564,12260.300,11710.956,12809.644,12246.975,11704.827,12789.123, 12240.520,11700.812,12780.229,12245.702,11710.229,12781.174),100,3,byrow=TRUE) res.pwm.3<-matrix(c(3646.226,3344.389,3948.064,3648.807,3344.517,3953.097, 3622.910,3337.234,3908.586,3997.221,3623.165,4371.276,4507.417,4005.677,5009.158, 4467.527,3981.515,4953.540,4436.192,3962.038,4910.346,4299.358,3895.287,4703.428, 4230.362,3851.917,4608.807,5853.342,4961.754,6744.931,5639.696,4878.597,6400.795, 7896.196,6438.517,9353.876,7498.835,6233.090,8764.580,7098.036,5973.863,8222.210, 6994.209,5968.432,8019.987,6819.883,5888.380,7751.386,7058.151,6201.671,7914.632, 6865.907,6095.864,7635.951,6727.713,6012.037,7443.389,6722.773,6045.165,7400.381, 6936.750,6291.530,7581.970,7009.811,6391.668,7627.954,7315.399,6715.554,7915.245, 7313.206,6734.375,7892.037,7270.222,6702.530,7837.913,7250.309,6697.159,7803.460, 7316.880,6773.406,7860.353,7540.428,6998.210,8082.645,7524.539,6996.264,8052.813, 7485.710,6968.019,8003.401,7510.348,6997.323,8023.372,7547.774,7042.024,8053.525, 7596.133,7098.420,8093.847,7640.059,7150.708,8129.410,7624.931,7138.987,8110.875, 7604.246,7127.779,8080.713,7598.825,7124.659,8072.991,7677.471,7212.809,8142.133, 7906.991,7438.233,8375.749,7921.832,7456.781,8386.883,8229.054,7759.595,8698.513, 8533.614,8059.179,9008.049,8542.872,8079.139,9006.605,8518.945,8061.497,8976.392, 8521.985,8063.682,8980.288,8948.660,8473.564,9423.756,9215.412,8735.602,9695.222, 9326.893,8867.258,9786.529,9334.011,8883.967,9784.056,9539.464,9086.243,9992.684, 9516.357,9068.059,9964.655,9796.244,9321.778,10270.709,10204.918,9673.710,10736.125, 10186.882,9658.972,10714.792,10213.718,9689.548,10737.888,10281.710,9764.700,10798.721, 10332.350,9817.748,10846.952,10328.829,9814.849,10842.809,10390.698,9880.019,10901.376, 10449.435,9940.774,10958.095,10448.894,9951.583,10946.206,10437.687,9945.686,10929.688, 10425.548,9935.984,10915.111,10542.273,10051.706,11032.840,10522.477,10036.387,11008.566, 10512.897,10033.185,10992.608,11123.680,10533.481,11713.878,11110.586,10523.175,11697.997, 11148.966,10569.088,11728.844,11155.662,10574.040,11737.283,11138.683,10561.462,11715.904, 11133.952,10557.545,11710.360,11272.850,10692.776,11852.924,11331.986,10756.659,11907.312, 11342.691,10774.090,11911.292,11369.501,10807.521,11931.481,11342.411,10786.180,11898.641, 11333.361,10778.988,11887.734,11668.327,11104.857,12231.797,11739.272,11182.628,12295.915, 11787.024,11244.914,12329.133,11781.696,11240.473,12322.920,11762.284,11228.699,12295.870, 11813.403,11289.055,12337.750,12083.976,11554.877,12613.076,12182.628,11658.783,12706.472, 12154.117,11643.501,12664.733,12136.689,11630.671,12642.707,12372.389,11845.813,12898.965, 12775.894,12227.011,13324.777,12759.797,12220.327,13299.266,13323.740,12747.991,13899.490, 13308.758,12735.818,13881.698,13279.889,12721.034,13838.743,13265.917,12709.291,13822.544, 13394.246,12834.892,13953.601,13387.143,12837.799,13936.488,13365.605,12823.457,13907.753, 13350.512,12810.804,13890.221,13346.982,12811.509,13882.454),100,3,byrow=TRUE) @ The following code can be used to construct the resulting PWM frontier plots graphed in Figure~\ref{fig-pwm}. <>= plot(yprod ~ xinput, data = my_samp, xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.pwm.1[,1], lty = 1, col = "cyan") lines(x.post, res.pwm.1[,2], lty = 3, col = "magenta") lines(x.post, res.pwm.1[,3], lty = 3, col = "magenta") plot(yprod ~ xinput, data = my_samp, xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.pwm.2[,1], lty = 1, col = "cyan") lines(x.post, res.pwm.2[,2], lty = 3, col = "magenta") lines(x.post, res.pwm.2[,3], lty = 3, col = "magenta") plot(yprod ~ xinput, data = my_samp, xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.pwm.3[,1], lty = 1, col = "cyan") lines(x.post, res.pwm.3[,2], lty = 3, col = "magenta") lines(x.post, res.pwm.3[,3], lty = 3, col = "magenta") @ \begin{figure}[htbp] \begin{center} <>= <> op = par(mfrow = c(1, 3), mar = c(4.5, 4.5, 2.1, 2.1), mgp = c(2.9, 1.1, 0), oma = c(0, 0, 0, 0), cex.lab = 2.3, cex.axis = 2.3) plot(yprod ~ xinput, data = my_samp, col = "grey", xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.pwm.1[,1], lty = 1, lwd = 2, col = "cyan") lines(x.post, res.pwm.1[,2], lty = 3, lwd = 4, col = "magenta") lines(x.post, res.pwm.1[,3], lty = 3, lwd = 4, col = "magenta") plot(yprod~xinput, data = my_samp, col = "grey", xlab = "Quantity of labor", ylab = "Volume of delivered mail") lines(x.post, res.pwm.2[,1], lty = 1, lwd = 2, col = "cyan") lines(x.post, res.pwm.2[,2], lty = 3, lwd = 4, col = "magenta") lines(x.post, res.pwm.2[,3], lty = 3, lwd = 4, col = "magenta") plot(yprod~xinput, data = my_samp, xlab = "Quantity of labor", col = "grey", ylab = "Volume of delivered mail") lines(x.post, res.pwm.3[,1], lty = 1, lwd = 2, col = "cyan") lines(x.post, res.pwm.3[,2], lty = 3, lwd = 4, col = "magenta") lines(x.post, res.pwm.3[,3], lty = 3, lwd = 4, col = "magenta") par(op) <> @ \end{center} \caption{Resulting regularized PWM estimator $\tilde\varphi_{pwm}$ and $95\%$ confidence interval of $\varphi$ for the 4000 European post offices. From left to right, we have the case $\rho_x=2$, plugging $\bar{\rho}_x$ and plugging the mean of $\bar{\rho}_x$.\label{fig-pwm}} \end{figure} \vspace{3mm} \noindent {\it Threshold selection for the PWM frontier estimator} \vspace{3mm} \noindent The function \code{mopt\_pwm} implements an automated selection of the parameter \code{coefm} involved in the probability-weighted moment (PWM) estimator $\tilde\varphi_{pwm}(x)$ [see \code{dfs\_pwm}]. It is an adaptation of the experimental method \code{kopt\_momt\_pick} by \citet{Daouia:2010}. The idea is to select first (for each $x$) a grid of values for the parameter \code{coefm} given by $c = 1, \ldots, \min(10,[\sqrt{N_x}])$, where $N_x=\sum_{i=1}^n\indic_{\{x_i\le x\}}$, and then select the $c$ where the variation of the results is the smallest. To achieve this, we compute the standard deviations of $\tilde\varphi_{pwm}(x)$ over a ``window'' of size $wind.coef \times \min(10,[\sqrt{N_x}])$, where the coefficient \code{wind.coef} should be selected in the interval $(0,1]$ in such a way to avoid numerical instabilities. The default option \code{wind.coef = 0.1} corresponds to having a window large enough to cover around $10\%$ of the possible values of $c$ in the selected range of values for \code{coefm}. The value of $c$ where the standard deviation is minimal defines the desired \code{coefm}. \vspace{3mm} \noindent {\it PWM estimate of the tail-index $\rho_x$} \vspace{3mm} \noindent The function \code{rho\_pwm} computes the probability-weighted moment (PWM) estimator $\bar\rho_x$ utilized in the frontier estimate $\tilde\varphi_{pwm}(x)$ [see \code{dfs\_pwm}]. This estimator depends on the smoothing parameters $a$ and $m$. A simple selection rule of thumb that \citet{Daouia:2012} have employed is $a=2$ [default option in the 4th argument of the function] and $m=\text{coefm} \times N^{1/3}_x$, where $N_x=\sum_{i=1}^n\indic_{\{x_i\le x\}}$ and the integer \code{coefm} is to be tuned by the user. To choose this parameter in an optimal way for each $x$, we adapt the automated threshold selection method of \citet{Daouia:2010} as follows: We first evaluate the estimator $\bar\rho_x$ over a grid of values of \code{coefm} given by $c = 1, \ldots, 150$. Then, we select the $c$ where the variation of the results is the smallest. This is achieved by computing the standard deviation of the estimates $\bar\rho_x$ over a ``window'' of $\max([\sqrt{150}],3)$ successive values of $c$. The value of $c$ where this standard deviation is minimal defines the value of \code{coefm}. \vspace{2mm} \noindent The user can also appreciably improve the estimation of $\rho_x$ and $\varphi(x)$ itself by tuning the choice of the lower limit (default option \code{lrho = 1}) and upper limit (default option \code{urho = Inf}). \vspace{3mm} \section[Numerical illustrations]{Numerical illustrations \label{illustrations}} Comparisons among most of the selected estimation methods described above have been undertaken by \citet{Daouia:2015} and more recently by \citet{Noh:2014} via simulation experiments. To encourage others to explore these methods and easily compare the quality of any new proposal with the competitive existing methods, we provide some guidelines that facilitate comparision based on Monte-Carlo simulations in a similar way to the devices of \citet{Daouia:2015} and \citet{Noh:2014}. \subsection[Comparison criteria]{Comparison criteria \label{com_cri}} After estimating the true frontier function $\varphi(x)$ from $N$ independent samples of size $n$, \citet{Daouia:2015} and \citet{Noh:2014} considered the empirical mean integrated squared error (MISE), the empirical integrated squared bias (${\rm IBIAS2}$) and the empirical integrated variance (IVAR), which are given by \begin{align*}\label{eq17} {\rm MISE} &=\frac{1}{N}\sum_{j=1}^{N} {\rm ISE}(\hat{\varphi}^{(j)}):=\frac{1}{N}\sum_{j=1}^{N} \left[ \frac{1 }{I}\sum_{i=0}^I\left(\hat{\varphi}^{(j)}(z_{i})-\varphi(z_i)\right)^{2}\right] \\&=\frac{1}{I}\sum_{i=0}^I \left(\varphi(z_{i})-\bar{\hat{\varphi}}(z_{i})\right) ^{2} + \frac{1}{I}\sum_{i=0}^I\left[\frac{1}{N}\sum_{j=1}^N(\hat{\varphi}^{(j)}(z_{i})-\bar{\hat{\varphi}}(z_{i}))^2\right] \nonumber \\&\equiv {\rm IBIAS2}+{\rm IVAR}, \nonumber \end{align*} where $\{{z}_{i}$, $i=0,\ldots,I \}$ is an equispaced grid having width $1/I$ over $[a,b]$ (the true support of the input variable), with $I=1000$, $\hat{\varphi}^{(j)}(\cdot)$ is the estimated frontier function from the $j$-th data sample and $\bar{\hat{\varphi}}({z}_{i}) = N^{-1} \sum_{j=1}^N \hat \varphi^{(j)}({z}_i)$. Although the definition of these comparison criteria is quite straightforward, some caution should be taken when calculating them. The reason is that the estimation of $\hat \varphi^{(j)}(z_i)$ is possible only when $z_i$ lies between the minimum and maximum of the inputs of the $j$th sample $x_1^{(j)},\ldots,x_n^{(j)}$. In our package, when storing the estimates $\hat \varphi^{(j)}(z_i),~i=1,\ldots,n$, we let the value $\hat \varphi^{(j)}(z_i)$ assigned to zero for distinction when the estimation is not possible. The function \code{evaluation} automatically computes the comparison criteria using only nonzero estimates at every grid point $z_i$. The first argument of this function is the matrix where the estimation results are stored, the second argument is the evaluation grid vector, and the third argument is the vector of values of the true frontier function at the grid points. \subsection[Some Monte Carlo evidence]{Some Monte Carlo evidence} By way of example, to evaluate finite-sample performance of the empirical LFDH and DEA frontier estimators in comparison with the polynomial, spline and kernel smoothed estimators, we have undertaken some simulation experiments following \citet{Daouia:2015}'s study. The experiments all employ the model $y_i=\varphi(x_i)\,v_i,$ where $x_i$ is uniform on $[0, 1]$ and $v_i,$ independent of $x_i,$ is ${\rm Beta}(\beta, \beta)$ with values of $\beta=0.5, 1$ and $3$ [corresponding, respectively, to a joint density of the $(x_i,y_i)$'s increasing toward infinity, having a jump or decreasing to zero as it approaches the support boundary]. Tables \ref{results1} and \ref{results2} report the obtained Monte Carlo estimates when $\varphi(x)= x^{1/2}$ and $\varphi(x)= \exp(-5+10x)/(1+\exp(-5+10x))$, respectively. All the experiments were performed over $N=5000$ independent samples of size $n=25$, $50$, $100$ and $200$. The code which generates the results in Tables \ref{results1} and \ref{results2} is given in the supplementary file. Note that the computational burden here is demanding, so be forewarned. Note also that only $N = 200$ replications were considered in \citet{Daouia:2015}. \begin{table}[H] \scriptsize{ \centering \begin{tabular}{llrrrrrr} \hline & & \scriptsize{\texttt{dea\_est}} & \scriptsize{\texttt{cub\_spline\_est}} & \scriptsize{\texttt{quad\_spline\_est}} & \scriptsize{\texttt{kern\_smooth}} & \scriptsize{\texttt{kern\_smooth}} & \scriptsize{\texttt{poly\_est}} \\ & & \scriptsize{(\texttt{type="dea"})} & \scriptsize{(\texttt{type="mc"})} & \scriptsize{(\texttt{type="mc"})} & \scriptsize{(\texttt{type="mc"})} & \scriptsize{(\texttt{type="mc"})} & \scriptsize{(\texttt{"BIC"})} \\ & & & \scriptsize{(\texttt{all.dea=T})} & \scriptsize{(\texttt{all.dea=T})} & \scriptsize{(\texttt{technique="pr"})} & \scriptsize{(\texttt{technique="noh"})} & \\ $\beta=0.5$ & & & & & \scriptsize{(\texttt{"cv"})} & \scriptsize{(\texttt{"bic"})} & \\ \hline $n=25$ & ${\rm IBIAS2}$ & 0.002655 & 0.001525 & 0.001759 & 0.018617 & 0.001208 & 0.011507 \\ & IVAR & 0.001942 & 0.001955 & 0.002045 & 0.007450 & 0.001935 & 0.031622 \\ & IMSE & 0.004597 & \textcolor{orange}{0.003480} & 0.003803 & 0.026067 & \textcolor{red}{0.003143} & 0.043130 \\ $n=50$ & ${\rm IBIAS2}$ & 0.000793 & 0.000412 & 0.000481 & 0.009313 & 0.000347 & 0.001429 \\ & IVAR & 0.000615 & 0.000594 & 0.000621 & 0.003511 & 0.000584 & 0.007217 \\ & IMSE & 0.001408 & \textcolor{orange}{0.001006} & 0.001102 & 0.012824 & \textcolor{red}{0.000931} & 0.008646 \\ $n=100$& ${\rm IBIAS2}$ & 0.000226 & 0.000105 & 0.000127 & 0.005078 & 0.000152 & 0.000350 \\ & IVAR & 0.000183 & 0.000168 & 0.000174 & 0.001336 & 0.000168 & 0.001007 \\ & MISE & 0.000409 & \textcolor{red}{0.000274} & \textcolor{orange}{0.000300} & 0.006414 & 0.000320 & 0.001358 \\ $n=200$& ${\rm IBIAS2}$ & 0.000061 & 0.000025 & 0.000032 & 0.003399 & 0.000105 & 0.000198 \\ & IVAR & 0.000048 & 0.000044 & 0.000045 & 0.000539 & 0.000049 & 0.000167 \\ & MISE & 0.000109 & \textcolor{red}{0.000069} & \textcolor{orange}{0.000077} & 0.003938 & 0.000154 & 0.000365\\ $\beta=1$ & & & & & (N=4799) & & \\ \hline $n=25$& ${\rm IBIAS2}$ & 0.008049 & 0.005598 & 0.006092 & 0.014150 & 0.005202 & 0.024311 \\ & IVAR & 0.002856 & 0.003188 & 0.003282 & 0.006447 & 0.003160 & 0.027117 \\ & MISE & 0.010905 & \textcolor{orange}{0.008786} & 0.009374 & 0.020597 & \textcolor{red}{0.008362} & 0.051428 \\ $n=50$ & ${\rm IBIAS2}$ & 0.003401 & 0.002223 & 0.002447 & 0.007114 & 0.002065 & 0.007184 \\ & IVAR & 0.001288 & 0.001390 & 0.001438 & 0.003040 & 0.001400 & 0.010049 \\ & MISE & 0.004688 & \textcolor{orange}{0.003613} & 0.003885 & 0.010154 & \textcolor{red}{0.003465} & 0.017233 \\ $n=100$ & ${\rm IBIAS2}$ & 0.001305 & 0.000784 & 0.000878 & 0.003904 & 0.000747 & 0.001928 \\ & IVAR & 0.000497 & 0.000538 & 0.000537 & 0.001320 & 0.000539 & 0.003196 \\ & MISE & 0.001802 & \textcolor{orange}{0.001322} & 0.001415 & 0.005224 & \textcolor{red}{0.001286} & 0.005124 \\ & & & & & (N=4999) & & \\ $n=200$ & ${\rm IBIAS2}$ & 0.000525 & 0.000298 & 0.000342 & 0.002540 & 0.000310 & 0.000589 \\ & IVAR & 0.000201 & 0.000219 & 0.000212 & 0.000551 & 0.000216 & 0.001054 \\ & MISE & 0.000727 & \textcolor{red}{0.000517} & 0.000555 & 0.003091 & \textcolor{orange}{0.000526} & 0.001643 \\ $\beta=3$ & & & & & (N=4773) & & \\ \hline $n=25$& ${\rm IBIAS2}$ & 0.029439 & 0.024860 & 0.025751 & 0.021526 & 0.024553 & 0.050245 \\ & IVAR & 0.002940 & 0.003485 & 0.003555 & 0.004882 & 0.003441 & 0.014190 \\ & MISE & 0.032379 & 0.028345 & 0.029306 & \textcolor{red}{0.026407} & \textcolor{orange}{0.027994} & 0.064435 \\ $n=50$ & ${\rm IBIAS2}$ & 0.018980 & 0.015737 & 0.016307 & 0.014489 & 0.015749 & 0.030942 \\ & IVAR & 0.001857 & 0.002204 & 0.002258 & 0.002895 & 0.002196 & 0.007812 \\ & MISE & 0.020837 & \textcolor{orange}{0.017941} & 0.018565 & \textcolor{red}{0.017384} & 0.017944 & 0.038755 \\ $n=100$ & ${\rm IBIAS2}$ & 0.012697 & 0.010435 & 0.010824 & 0.010368 & 0.010586 & 0.019784 \\ & IVAR & 0.001177 & 0.001411 & 0.001447 & 0.001708 & 0.001366 & 0.004460 \\ & MISE & 0.013874 & \textcolor{red}{0.011846} & 0.012271 & 0.012076 & \textcolor{orange}{0.011952} & 0.024244 \\ & & & & & (N=4995) & & \\ $n=200$ & ${\rm IBIAS2}$ & 0.008182 & 0.006616 & 0.006885 & 0.007150 & 0.006820 & 0.012588 \\ & IVAR & 0.000735 & 0.000901 & 0.000903 & 0.000976 & 0.000843 & 0.002722 \\ & MISE & 0.008917 & \textcolor{red}{0.007518} & 0.007788 & 0.008126 & \textcolor{orange}{0.007673} & 0.015310\\ & & & & & (N=4682) & & \\ \hline \end{tabular} } \centering \caption{Monte-Carlo comparison when the true frontier is monotone and concave ($\varphi(x) = \sqrt{x}$), with $N=5000$ replications. Colors code : \textcolor{red}{1st rank}, \textcolor{orange}{2nd rank}. When $N < 5000$, this means that \texttt{solve.QP} was unable to find a solution \citep[][suggest then to adjust constraints and restart]{Hayfield:2008}.\label{results1}} \end{table} \begin{table}[H] \scriptsize{ \centering \begin{tabular}{rrrrrrrrr} \hline & & \scriptsize{\texttt{dea\_est}} & \scriptsize{\texttt{cub\_spline\_est}} & \scriptsize{\texttt{quad\_spline\_est}} & \scriptsize{\texttt{kern\_smooth}} & \scriptsize{\texttt{kern\_smooth}} & \scriptsize{\texttt{poly\_est}} \\ & & \scriptsize{(\texttt{type="lfdh"})} & \scriptsize{(\texttt{type="m"})} & \scriptsize{(\texttt{type="m"})} & \scriptsize{(\texttt{type="m"})} & \scriptsize{(\texttt{type="m"})} & \scriptsize{(\texttt{"BIC"})} \\ & & & \scriptsize{(\texttt{all.dea=F})} & \scriptsize{(\texttt{all.dea=F})} & \scriptsize{(\texttt{technique="pr"})} & \scriptsize{(\texttt{technique="noh"})} & \\ $\beta=0.5$ & & & & & \scriptsize{(\texttt{"cv"})} & \scriptsize{(\texttt{"bic"})} & \\ \hline $n=25$ & ${\rm IBIAS2}$ & 0.007032 & 0.001601 & 0.001580 & 0.002627 & 0.002071 & 0.015539 \\ & IVAR & 0.005284 & 0.004923 & 0.005130 & 0.006121 & 0.004485 & 0.030231 \\ & MISE & 0.012316 & \textcolor{red}{0.006524} & 0.006710 & 0.008748 & \textcolor{orange}{0.006557} & 0.045770 \\ $n=50$ & ${\rm IBIAS2}$ & 0.002492 & 0.000200 & 0.000294 & 0.000570 & 0.000554 & 0.003047 \\ & IVAR & 0.002073 & 0.000988 & 0.001595 & 0.002283 & 0.001264 & 0.009148 \\ & MISE & 0.004565 & \textcolor{red}{0.001188} & \textcolor{orange}{0.001889} & 0.002854 & 0.001818 & 0.012195\\ $n=100$& ${\rm IBIAS2}$ & 0.000916 & 0.000023 & 0.000122 & 0.000099 & 0.000126 & 0.000489 \\ & IVAR & 0.000829 & 0.000180 & 0.000549 & 0.000818 & 0.000431 & 0.002151 \\ & MISE & 0.001745 & \textcolor{red}{0.000203} & 0.000672 & 0.000918 & \textcolor{orange}{0.000557} & 0.002640 \\ & & & & & (N=4997) & & \\ $n=200$& ${\rm IBIAS2}$ & 0.000347 & 0.000009 & 0.000044 & 0.000014 & 0.000017 & 0.000080 \\ & IVAR & 0.000335 & 0.000039 & 0.000171 & 0.000149 & 0.000102 & 0.000409 \\ & MISE & 0.000682 & \textcolor{red}{0.000048} & 0.000215 & 0.000163 & \textcolor{orange}{0.000119} & 0.000489\\ $\beta=1$ & & & & & (N=4769) & & \\ \hline $n=25$& ${\rm IBIAS2}$ & 0.016099 & 0.006079 & 0.006077 & 0.005694 & 0.007217 & 0.025964 \\ & IVAR & 0.005824 & 0.006517 & 0.006624 & 0.006149 & 0.005723 & 0.023157 \\ & MISE & 0.021923 & \textcolor{orange}{0.012597} & 0.012700 & \textcolor{red}{0.011843} & 0.012941 & 0.049121\\ $n=50$ & ${\rm IBIAS2}$ & 0.007761 & 0.001880 & 0.001845 & 0.002213 & 0.003455 & 0.009294 \\ & IVAR & 0.003058 & 0.002341 & 0.002930 & 0.002528 & 0.002561 & 0.010293 \\ & MISE & 0.010819 & \textcolor{red}{0.004221} & 0.004775 & \textcolor{orange}{0.004741} & 0.006016 & 0.019587 \\ $n=100$ & ${\rm IBIAS2}$ & 0.003700 & 0.000541 & 0.000525 & 0.000869 & 0.001590 & 0.002977 \\ & IVAR & 0.001511 & 0.000767 & 0.001274 & 0.001098 & 0.001252 & 0.003713 \\ & MISE & 0.005211 & \textcolor{red}{0.001308} & \textcolor{orange}{0.001799} & 0.001967 & 0.002842 & 0.006690 \\ & & & & & (N=4998) & & \\ $n=200$ & ${\rm IBIAS2}$ & 0.001757 & 0.000151 & 0.000176 & 0.000332 & 0.000708 & 0.000898 \\ & IVAR & 0.000745 & 0.000246 & 0.000609 & 0.000400 & 0.000602 & 0.001303 \\ & MISE & 0.002502 & \textcolor{red}{0.000397} & 0.000785 & \textcolor{orange}{0.000732} & 0.001310 & 0.002202 \\ $\beta=3$ & & & & & (N=4706) & & \\ \hline $n=25$& ${\rm IBIAS2}$ & 0.038773 & 0.024179 & 0.024325 & 0.021889 & 0.025572 & 0.044459 \\ & IVAR & 0.004420 & 0.005280 & 0.005490 & 0.004586 & 0.004817 & 0.011995 \\ & MISE & 0.043193 & \textcolor{orange}{0.029459} & 0.029815 & \textcolor{red}{0.026475} & 0.030389 & 0.056454 \\ $n=50$ & ${\rm IBIAS2}$ & 0.026249 & 0.014575 & 0.014621 & 0.015485 & 0.018127 & 0.027638 \\ & IVAR & 0.002996 & 0.003205 & 0.003465 & 0.002635 & 0.002996 & 0.006719 \\ & MISE & 0.029245 & \textcolor{red}{0.017779} & \textcolor{orange}{0.018087} & 0.018121 & 0.021123 & 0.034357 \\ $n=100$ & ${\rm IBIAS2}$ & 0.018164 & 0.009415 & 0.009543 & 0.011111 & 0.013310 & 0.017469 \\ & IVAR & 0.001981 & 0.001902 & 0.002250 & 0.001677 & 0.002002 & 0.003854 \\ & MISE & 0.020144 & \textcolor{red}{0.011317} & \textcolor{orange}{0.011794} & 0.012788 & 0.015312 & 0.021323 \\ & & & & & (N=4996) & & \\ $n=200$ & ${\rm IBIAS2}$ & 0.012546 & 0.006264 & 0.006477 & 0.007611 & 0.009773 & 0.011220 \\ & IVAR & 0.001328 & 0.001196 & 0.001488 & 0.001081 & 0.001351 & 0.002338 \\ & MISE & 0.013874 & \textcolor{red}{0.007460} & \textcolor{orange}{0.007966} & 0.008692 & 0.011123 & 0.013558 \\ & & & & & (N=4631) & & \\ \hline \end{tabular} } \centering \caption{Monte-Carlo comparison when the true frontier is only monotone $(\varphi(x)=\frac{\exp(-5 + 10\times x)}{(1 + \exp(-5 + 10\times x))})$, with $N=5000$ replications. Colors code : \textcolor{red}{1st rank}, \textcolor{orange}{2nd rank}. When $N<5000$, this means that \texttt{solve.QP} was unable to find a solution \citep[][suggest then to adjust constraints and restart]{Hayfield:2008}.\label{results2}} \end{table} <>= f<-function(core) { parametre<-data.frame(betav=rep(c(0.5,1,3),each=4), n=rep(c(25, 50, 100, 200),3), phi=c(rep(1,12),rep(2,12))) n = parametre$n[core] betav = parametre$betav[core] B=5000 n.phi<-parametre$phi[core] if(n.phi==1) {Fron <- function(x) sqrt(x) type="dea" method="mc" } else { Fron <- function(x) exp(-5 + 10*x)/(1 + exp(-5 + 10*x)) type="lfdh" method="m" } x.sim <- seq(0, 1, length.out = 1000) y.dea<-matrix(0, B, 1000) y.cub<-matrix(0, B, 1000) y.quad<-matrix(0, B, 1000) y.noh<-matrix(0, B, 1000) y.pr<-matrix(0, B, 1000) y.poly<-matrix(0, B, 1000) for (k in 1:B) { xtab <- runif(n, 0, 1) V <-rbeta(n, betav, betav) ytab <- Fron(xtab)*V cind <- which((x.sim>=min(xtab)) & (x.sim<=max(xtab))) x <- x.sim[cind] y.dea[k,cind] <- npbr::dea_est(xtab, ytab, x, type = type) # cubic kopt <- npbr::cub_spline_kn(xtab, ytab, method = method, krange = 1:20, type = "BIC") y.cub[k,cind] <- npbr::cub_spline_est(xtab, ytab, x, kn = kopt, method = method, all.dea = ifelse(n.phi==1,TRUE,FALSE) ) # quadratic kopt <- npbr::quad_spline_kn(xtab, ytab, method = method, krange = 1:20, type = "BIC") y.quad[k,cind] <- npbr::quad_spline_est(xtab, ytab, x, kn = kopt, method = method, all.dea = ifelse(n.phi==1,TRUE,FALSE) ) options(np.tree=TRUE,crs.messages=FALSE,np.messages=FALSE) # parmeter racine h.pr <- npbr::kern_smooth_bw(xtab, ytab, method=method, technique="pr", bw_method="cv") y.pr[k,cind] <- npbr::kern_smooth(xtab, ytab, x, h=h.pr, method=method, technique="pr") # noh h.noh <- npbr::kern_smooth_bw(xtab, ytab, method=method, technique="noh", bw_method="bic") y.noh[k,cind] <- npbr::kern_smooth(xtab, ytab, x, h=h.noh, method=method, technique="noh") # polynomial p.bic <- npbr::poly_degree(xtab, ytab, type = "BIC") y.poly[k,cind] <- npbr::poly_est(xtab, ytab, x, deg=p.bic) } evaluation<-function(MAT,xeval,true_vec) { # internal function denzero<-function(vec) { return(sum(vec!=0, na.rm=TRUE)) } nzr<-apply(MAT,1,denzero) nzc<-apply(MAT,2,denzero) nzc_ind<-which(apply(MAT,2,denzero)!=0) nz_mat<-matrix(as.numeric(MAT!=0),dim(MAT)[1],length(xeval),byrow=FALSE) cmean<-rep(0,dim(MAT)[2]) temp<-apply(MAT,2,sum, na.rm=TRUE) cmean[nzc_ind]<-temp[nzc_ind]*(1/nzc[nzc_ind]) temp2<-apply((MAT-rep(1,dim(MAT)[1]) %*% t(cmean))^2 * nz_mat,2,sum,na.rm=TRUE) IVAR<-mean(temp2[nzc_ind]*(1/nzc[nzc_ind])) temp3<-(true_vec-cmean)^2 IBIAS<-mean(temp3[nzc_ind]) IMSE<-IBIAS+IVAR return(list(IBIAS2=IBIAS,IVAR=IVAR,MISE=IMSE)) } result.dea <- evaluation(y.dea, x.sim, Fron(x.sim)) result.cub <- evaluation(y.cub, x.sim, Fron(x.sim)) result.quad <- evaluation(y.quad, x.sim, Fron(x.sim)) result.pr <- evaluation(y.pr, x.sim, Fron(x.sim)) result.noh <- evaluation(y.noh, x.sim, Fron(x.sim)) result.poly <- evaluation(y.poly, x.sim, Fron(x.sim)) return(list(tab=cbind(result.dea, result.cub, result.quad, result.pr, result.noh, result.poly), na.nb=c(sum(apply(y.dea, 1,function(x) any(is.na(x)))), sum(apply(y.cub, 1,function(x) any(is.na(x)))), sum(apply(y.quad, 1,function(x) any(is.na(x)))), sum(apply(y.pr, 1,function(x) any(is.na(x)))), sum(apply(y.noh, 1,function(x) any(is.na(x)))), sum(apply(y.poly, 1,function(x) any(is.na(x))))) ) ) } require("snowfall") nb.cpus=24 sfInit(parallel=TRUE, cpus=nb.cpus) system.time(result <- sfClusterApplyLB(1:24,f)) sfStop() @ \section*{Acknowledgments} The first author acknowledges financial support by the Toulouse School of Economics Individual Research Fund (IRF/Daouia-20125) and by the Seventh Framework Programme of the European Union (IEF/273584/EMBAF-project). The research of the third author was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2014R1A1A2059875). \bibliography{npbrvignette} \end{document}