\documentclass[mathserif,12pt]{amsart} %\VignetteIndexEntry{MEDDE: Penalized Renyi Density Estimation} %\VignetteEngine{knitr::knitr} %% additional packages \usepackage[latin1]{inputenc} \usepackage{a4wide,graphicx,color,thumbpdf} \usepackage{hyperref} \usepackage{alltt} \usepackage{amsmath} \usepackage{amsthm} \usepackage{upquote} \usepackage{mdframed} \usepackage{xcolor} \usepackage{eulervm} %% BibTeX settings \usepackage[authoryear,round]{natbib} \bibliographystyle{jae} \bibpunct{(}{)}{,}{a}{,}{,} \newcommand{\doi}[1]{\href{http://dx.doi.org/#1}{\normalfont\texttt{doi:#1}}} %% markup commands for code/software \let\code=\texttt \let\pkg=\textbf \let\proglang=\textsf \newcommand{\file}[1]{`\code{#1}'} %\newcommand{\email}[1]{\href{mailto:#1}{\normalfont\texttt{#1}}} %% paragraph formatting %\renewcommand{\baselinestretch}{1.5} %% \usepackage{Sweave} is essentially \RequirePackage[T1]{fontenc} \RequirePackage{ae,fancyvrb} \DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl} \DefineVerbatimEnvironment{Soutput}{Verbatim}{} \DefineVerbatimEnvironment{Scode}{Verbatim}{fontshape=sl} \def\RR{\mathbb{R}} \def\OO{\mathcal O} \def\PP{\mathbb{P}} \def\LL{\mathcal L} \def\II{\mathcal I} \def\EE{\mathbb E} \def\VB{\mathbb V} \def\FF{\mathcal F} \def\NN{\mathcal N} \def\XX{\mathcal X} \def\SS{\mathcal S} \def\GG{\mathcal G} \def\CC{\mathcal C} \def\AA{\mathcal A} \def\TT{\mathcal T} \def\HH{\mathcal H} \def\JJ{\mathcal J} \def\KK{\mathcal K} \def\PP{\mathcal{P}} \def\DD{\mathcal{D}} \def\LL{\mathcal L} \def\II{\mathcal I} \def\pperp{\perp\!\!\!\perp} \def\Polya{P\'olya } \def\Renyi{R\'enyi } \def\diag{\mbox{diag}} \def\sgn{\mbox{sgn}} \def\TV{\bigvee_\Omega } %\newtheorem{example}{Example}[section] \newtheorem{exmp}{Example} %\newtheorem{remark}{Remark} %\renewcommand{\abstractname}{Abstract} \renewcommand{\abstractname}{K\={o}an} \def\argmax{\text{argmax}} \newtheorem{lemma}{Lemma} \newtheorem{theorem}{Theorem} \newtheorem{corollary}{Corollary} \newtheorem{proposition}{Proposition} \renewenvironment{proof}{\noindent {\bf Proof.\ }}{\hfill{\rule{2mm}{2mm}}} %\renewenvironment{remark}{\noindent {\bf Remark.\ }} {\hfill{ \rule{2mm}{2mm}}} %\renewenvironment{example}{\noindent {\bf Example.\ }}{\hfill{ \rule{2mm}{2mm}}} \begin{document} \bibliographystyle{jae} \title{ Medde:\\ Maximum Entropy Deregularized Density Estimation } \author{Roger Koenker} \thanks{Version: \today . The author would like to thank Ivan Mizera and Jon Wellner for very helpful conversations, and Fatih Guvenen for providing the kernel density estimate reconsidered in Example \ref{ex.Guv}. All of the computational experience reported here was conducted in the R language with the package \pkg{REBayes}, \cite{REBayes}. Code for all of the reported computation may be found in the vignette {\tt medde.Rnw} that appears as part of that package. This vignette is a very preliminary version of a much more fully baked paper titled ``Shape Constrained Density Estimation via Penalized Renyi Divergence'' that is now forthcoming in {\emph{Statistical Science}}} \begin{abstract} The trouble with likelihood is not the likelihood itself, it's the log. \end{abstract} \maketitle %\noindent {\bf Keywords:} <>= hasMosek <- requireNamespace("Rmosek", quietly = TRUE) if(hasMosek) hasMosek <- tryCatch(example(mosek)$value$response$code == 0, warning = function(c) 0, error = function(c) 0) if (!hasMosek) { knitr::opts_chunk$set(eval = FALSE) msg <- paste("This vignette requires Mosek, but this system", "does not have Mosek", "installed, so code examples will not be evaluated.") msg <- paste(strwrap(msg), collapse="\n") message(msg) } require("REBayes") knitr::render_sweave() options(prompt = "R> ", continue = "+ ", digits = 2, show.signif.stars = FALSE) cleanup <- FALSE @ \section{Introduction} \label{sec:intro} Suppose that you would like to estimate a unimodal density. There are several exploratory inference approaches that could be employed, notably \cite{Cox66}, \cite{S.81, S.83} and \cite{HH.85}. More recently interest has focused on maximum likelihood estimation of log concave densities as described by \cite{Rufibach}, \cite{Walther} and \cite{CSS}, who offer a more direct approach to estimation of {\it{strongly}} unimodal densities as characterized by \cite{I.56}. Weaker notions of unimodality have been explored in \cite{KM.10} and \cite{HW.16}. In this note I would like to briefly describe some further experience with these weaker forms of unimodality and their relevance for both shape constrained estimation and norm constrained estimation of densities. Our path leads us away from the well paved highway of maximum likelihood, but it is arguably more scenic. I will briefly review means of order $\rho$ and their connection to classes of concave densities and \Renyi entropy, and then illustrate their application to shape constrained and norm constrained density estimation. Further details about the computational methods are provided in Section \ref{sec.medde}. \section{Means of order $\rho$ and a Hierarchy of Concave Densities} A natural hierarchy of concave functions can be built on the foundation of the weighted means of order $\rho$ studied by \citet*{HLP}. For any $p$ in the unit simplex, $\SS = \{ p \in \RR^n | p \geq 0, \sum p_i =1 \}$, let \[ M_{\rho} (a;p) = M_{\rho}(a_1, \cdots , a_n ; p) = \Bigl(\sum_{i=1}^n p_i a_i^{\rho} \Bigr)^{\! 1/\rho}, \quad \rho \ne 0, \] with $ M_{0} (a;p) = M_{\rho}(a_1, \cdots , a_n ; p) = \prod_{i=1}^n a_i^{p_i}$ as a limit as $\rho \rightarrow 0$ The familiar arithmetic, geometric, and harmonic means correspond to $\rho$ equal to $1$, $0$, and $-1$, respectively. Following \citet{A.72}, a non-negative, real function $f$, defined on a convex set $C \subset \RR^d$ is called \emph{$\rho$-concave} if for any $x_0, x_1 \in C$, and $p \in \SS$, \[ f(p_0 x_0 + p_1 x_1 ) \geq M_\rho (f(x_0), f(x_1); p). \] In this terminology $\log$-concave functions are $0$-concave, and concave functions are $1$-concave. Since $M_\rho(a,p)$ is monotone increasing in $\rho$ for $a \geq 0$ and any $p \in \SS$, it follows that if $f$ is $\rho$-concave, then $f$ is also $\rho '$-concave for any $\rho ' < \rho$. Thus, concave functions are $\log$-concave, but not vice-versa. In the limit $-\infty$-concave functions satisfy the condition \[ f(p_0 x_0 + p_1 x_1 ) \geq \min \{f(x_0), f(x_1) \}, \] so they are \emph{quasi-concave}, and consequently so are all $\rho$-concave functions. Further details and motivation for $\rho$-concave densities can be found in \citet{Prekopa}, \citet{Borell}, and \citet{DJD}. \subsection{Estimation of log concave densities} A probability density function, $f$, is called \emph{log-concave} if $-\log f$ is a (proper) convex function on the support of $f$. Maximum likelihood estimation of log concave densities can be formulated as a convex optimization problem. Let $X = \{ X_1 , \cdots , X_n \}$ be a collection of data points in $\RR^d$ such that the convex hull of $X$, $\HH (X)$, has a nonempty interior in~$\RR^d$; such a configuration occurs with probability $1$ if $n \geq d$ and the $X_i$ behave like a random sample from $f_0$, a probability density with respect to the Lebesgue measure on $\RR^d$. Setting $g = -\log f$, we can formulate the maximum likelihood problem as, \begin{equation} \label{!lcmlelog} \min_g \Bigl\{ \sum_{i=1}^n g(X_i) \; | \; \int\!\mathrm{e}^{-g} \,dx = 1, g \in \KK(X) \Bigr\}, \tag{$\PP_0$} \end{equation} where $\KK(X)$ denotes the class of closed convex functions on $\HH(X) \subset \RR^d$. As shown in \cite{KM.10} such problems have solutions that admit a finite dimensional characterization determined by the function values of $\hat g$ evaluated at the observed $X_i$ with values of $g$ elsewhere determined by linear interpolation. This primal problem has an equivalent dual formulation as, \begin{equation} \label{!mledual} \max_f \Bigl\{ - \int f \log f \,dy \; | \; f = (d(P_n-G))/dy, \quad G \in \KK (X)^o \Bigr\},\tag{$\DD_0$} \end{equation} where $\KK(X)^o$ denotes the polar cone corresponding to $\KK(X)$, that is, \begin{equation*} \KK(X)^o = \Bigl\{ G \in \mathcal{C}^\ast (X) \mid \int g \,dG \leq 0 \text{ for all } g \in \KK(X) \Bigr\}, \end{equation*} and $\mathcal{C}^\ast (X)$ denotes the space of (signed) Radon measures on $\HH(X)$, its distinguished element is $P_n$, the empirical measure supported by the data points $\{ X_i, i = 1, \cdots , X_n \}$. It is a notable feature of the log concave MLE that it is tuning parameter free, well posed without any further need for regularization. This feature carries over to weaker forms of concave, shape constrained estimators we will consider next. It is hardly surprising in view of prior likelihood experience that our dual formulation involves maximizing Shannon entropy, since we are already well aware of the close connection to Kullback-Leibler divergence. One potentially disturbing aspect of foregoing formulation is the finding that solutions, $\hat g$ must be piecewise linear, so the estimated density, $\hat f$ must be piecewise exponential, which when extrapolated into the tails implies sub-exponential tail behavior. This finding motivated consideration of weaker forms of concavity that permit heavier tail behavior as well as more peaked densities. A second, seemingly anomalous feature of the dual problem is the fact that $G$ must be chosen to annihilate the jumps in $P_n$ in order to produce a density, $f$, with respect to Lebesgue measure. Further details on this may be found in \cite{KM.10}. \subsection{\Renyi likelihood and weaker forms of concave densities} Given the appearance of Shannon entropy in the likelihood formulation of log concave density estimation, it is natural, or at least tempting, to consider the family of \cite{r.61} entropies, \[ R_\alpha (f) = (1 - \alpha)^{-1} \log ( \int f^\alpha (x) dx ) \] as a vehicle for estimating $\rho$-concave densities. Shannon is conveniently nested as $\alpha = 1$ in this family. We will focus attention on $\alpha < 1$, corresponding to $\rho = \alpha - 1 < 0$. Maximizing $R_\alpha (f)$ with respect to $f$ is equivalent to maximizing, for fixed $\alpha < 1$, \[ \alpha^{-1} \int f^\alpha (x) dx. \] This yields the new pairing of primal and dual problems:\footnote{The primal version of this pairing corrects an error in formula (3.7) of \cite{KM.10} in the event that $\alpha < 0$.} \begin{equation} \label{mleR} \min_g \Bigl\{ \sum_{i=1}^n g(X_i) + \frac{|1 - \alpha|}{\alpha} \int g^\beta \,dx \; \vert \: g \in \KK(X) \Bigr\}, \tag{$\PP_\alpha$} \end{equation} and \begin{equation} \max_f \Bigl\{ \frac{1}{\alpha} \int f^\alpha (y) \,dy \; \vert \; f = d(P_n-G)/dy, \quad G \in \KK (X)^o \Bigr\},\tag{$\DD_\alpha$} \end{equation} with $\alpha$ and $\beta$ conjugates in the usual sense that $\alpha^{-1} + \beta^{-1} = 1$. This formulation weakens the unimodality constraint admitting a larger class of heavier tailed, more peaked densities; at the same time it modifies the fidelity criterion replacing log likelihood with a criterion based on \Renyi entropy. Why not stick with log likelihood and just modify the constraint, as suggested by \cite{SW.10}? The pragmatic reason is that modifying both preserves an extremely convenient form of the convex optimization problem. This motivation is further elaborated in \cite{KM.10}. From a more theoretical perspective weaker concavity requirements pose difficulties for the standard likelihood formulation, \cite{DW.16} elaborate on these difficulties and demonstrate among many other things that when imposing concavity constraints with $\alpha < 0$ the MLE fails to exist. \vspace{3mm} \begin{exmp}\label{ex.Guv} In \cite{KM.10} we stressed the (Hellinger) case that $\alpha = 1/2$, so $\beta = -1$, and $g = -1/\sqrt{f}$. Since $g$ is constrained to be concave we conclude that the estimated density, $f$ is $\rho$-concave for $\rho = -1/2$, a class that includes all of the Student t densities with degrees of freedom, $\nu \geq 1$, as well as all the log concaves. To illustrate the applicability of this Hellinger criterion within the \Renyi family, we reconsider a density estimation problem arising in econometrics. \cite{Guvenen} have estimated models of income dynamics using very large (10 percent) samples of U.S. Social Security records linked to W2 data. Their work reveals quite extreme tail behavior in annual log income increments. In the left panel of Figure \ref{fig:Guv} we reproduce the Guvenen et al plot of a conventional kernel density estimate of the log density of annual increments of log income based on their sample. Clearly, this density is {\it{not}} log-concave, however when we plot instead $-1/\sqrt{f}$ we see that concavity looks extremely plausible. When the \Renyi estimate is superimposed in red, it fits almost perfectly. <>= Guv.cap <- "Density estimation of annual increments in log income for U.S. individuals over the period 1994-2013. The left panel of the figure reproduces a plot of the logarithm of a kernel density estimate from \\cite{Guvenen} showing that annual income increments are clearly not log concave. However the middle panel showing $-1/\\sqrt{f}$ does appear to be nicely concave and is fit remarkably well by the \\Renyi procedure with $\\alpha = 1/2$." @ <>= # Estimate Hellinger Model for Guvenen Annual Income Increment Data data(Guvenen) x <- Guvenen[,1] y <- Guvenen[,2] # Log(f(x)) par(mfrow = c(1,3)) par(mgp = c(2,1,0)) # Left Panel: Log Concavity is violated plot(x, y, xlab = "x ~ log income annual increments", ylab = "log f(x)", type = "l") # But Hellinger Concavity is plausible plot(x, -1/sqrt(exp(y)), xlab = "x ~ log income annual increments", ylab = expression(-1/sqrt(f(x))), type = "l") w <- exp(y)*diff(x)[1] f <- medde(x, w = w, lambda = - 0.5, alpha = 0.5) s <- (f$x > -3.7) & (f$x < 3.7) xx <- f$x[s] yy <- f$y[s] zz <- -1/sqrt(yy) lines(xx,zz,col = "red") # Right Panel: Density Estimates plot(xx, yy,xlab = "x ~ log income annual increments", ylab = "f(x)", type = "l") lines(x,exp(y),col = "red") @ \end{exmp} Permitting Cauchy tail behavior may be regarded as sufficiently indulgent for most statistical purposes, but the next example illustrates that more extreme \Renyi fitting criteria with $\alpha < 1/2$ is sometimes needed to accommodate sharp peaks in the target density. \vspace{3mm} \begin{exmp}\label{ex.velo} We reconsider the rotational velocity of stars data considered previously in \cite{KM.10}. The data was taken originally from \cite{hofwar91} and is available from the R package {\pkg{REBayes}}. Figure \ref{fig:velo} illustrates a histogram of the 3806 positive rotational velocities from the original sample of 3933. After dropping the 127 zero velocity observations, the histogram looks plausibly unimodal and we superimpose three distinct \Renyi shape constrained estimates. The Hellinger ($\alpha = 1/2$) estimate is clearly incapable of capturing the sharp peak around $x = 18$, and even the fit for $\alpha = 0$ fails to do so. But pressing further, we see that setting $\alpha = -2$ provides an excellent fit by constraining $-1/f^3$ to be concave. <>= velo.cap <- "Rotational velocity of stars with three quasi concave shape constrained density estimates using the \\Renyi likelihood." @ <>= # Star velocity Hellinger plot in K&M Quasi paper Revisited data(velo) x <- velo x <- x[!(x == 0)] # drop zero velocities hist(x, 100, freq = FALSE, main = "") alphas <- c(0.5, 0, -2) for(i in 1:3){ f <- medde(x, v = 1000, lambda = -0.5, alpha = alphas[i], rtol = 1e-8) lines(f$x, f$y, col = i+1) } leg <- as.expression(lapply(c(1/2,0,-2),function(x) bquote(alpha ==.(x)))) legend(300, 0.025, leg, lty = 1, col = 2:4) @ \end{exmp} \section{\Renyi likelihood and norm constrained density estimation} Although our original intent for the \Renyi likelihood was strictly pragmatic -- to maintain the convexity of the optimization problem underlying the estimation while maintaining weaker forms of the concavity constraint -- I would now like to briefly consider its use in norm constrained settings where the objective of penalization is smoothness of the estimated density rather than shape constraint. There is a long tradition of norm penalized nonparametric maximum likelihood estimation of densities. Perhaps the earliest example is \cite{Goo71} who proposed the penalty, \[ J(f) = \int (\sqrt{f}')^2 dx, \] which shrinks the estimated density toward densities with smaller Fisher information for location. The deeper rationale for this form of shrinkage remains obscure, and most of the subsequent literature has instead focused on penalizing derivatives of $\log f$, with the familiar cubic smoothing spline penalty, \[ J(f) = \int (\log f'')^2 dx, \] receiving most of the attention. \cite{Sil82} proposed penalizing the squared $L_2$ norm of the {\it{third}} derivative of $\log f$ as a means of shrinking toward the Gaussian density. Squared $L_2$ norm penalties are ideal for smoothly varying densities, but they abhor sharp bends and kinks, so there has been some interest in exploring total variation penalization as a way to expand the scope of penalty methods. The taut-string methods of \cite{dk.01,dk.04} penalize total variation of the density itself. \cite{KM.07} describe some experience with penalties of the form, \[ J(f) = \int | \log f'' | dx, \] that penalize the total variation of the first derivative of $\log f$. In the spirit of \cite{Sil82} the next example illustrates penalization of the total variation of the third derivative of $\log f$, again with the intent of shrinking toward the Gaussian, but in a manner somewhat more tolerant of abrupt changes in the derivatives than with Silverman's squared $L_2$ norm. <>= Sil.cap <- "Gaussian histogram based on 500 observations and two penalized maximum likelihood estimates with total variation norm penalty and $\\lambda \\in \\{ 0.5 \\times 10^{-4}, 0.5 \\times 10^{-6} \\}$." @ <>= # Silverman type total variation roughness penalty estimation in medde # Silverman BW (1982). On the Estimation of a Probability Density Function # by the Maximum Penalized Likelihood Method. Annals of Statistics, 10, 795-810 n <- 500 set.seed <-18 x <- rnorm(n) hist(x, 70, freq = FALSE, main = "", col = grey(.95)) f <- medde(x, Dorder = 2, lambda = 1, verb = 0) lines(f, col = "red") f <- medde(x, Dorder = 2, lambda = 0.05, verb = 0) lines(f, col = "blue") @ \vspace{3mm} \begin{exmp}\label{ex.Silverman} In Figure \ref{fig:Silverman} we illustrate a histogram based on 500 standard Gaussian observations, and superimpose two fitted densities estimated by penalizaed maximum likelihood as solutions to \[ \min_f \Bigl\{ - \sum_{i=1}^n \log f(X_i) + \lambda \int | \log f''' | dx, \] for two choices of $\lambda$. For $\lambda$ sufficiently large solutions to this problem conform to the parametric Gaussian MLE since the penalty forces the solution to take a Gaussian shape, but does not constrain the location or scale of the estimated density. For smaller $\lambda$ we obtain a more oscillatory estimate than conforms more closely to the vagaries of the histogram. \end{exmp} \vspace{3mm} Penalizing total variation of $\log f ''$ as in Figure \ref{fig:Silverman} raises the question: What about other \Renyi exponents for $\alpha \neq 1$? Penalizing $\log f ''$ is implicitly presuming sub-exponential tail behavior that may be better controlled by weaker \Renyi penalties. To explore this conjecture we consider a in the next example estimating a mixture of three lognormals. <>= lnorm.cap <- "Mixture of three lognormals histogram and two \\Renyi likelihood estimates with total variation ($L_1$ norm) penalty with $\\alpha \\in \\{ 0, 1 \\}$." @ <>= # Mixture of log normals example from ancient problem set rlambda <- function(n, mu = c(0.5, 1.1, 2.6), sigma = c(0.2, 0.3, 0.2), alpha = c(0.4, 1.2, 2.4), w = c( 0.33, 0.33, 0.34)) { #mixture of lognormals -- random numbers #No error checking! w is a weight vector which should add to one. m <- length(w) w <- cumsum(w) U <- runif(n) W <- matrix(0, n, m) W[, 1] <- U < w[1] for(i in 2:m) { W[, i] <- (U < w[i]) & (U >= w[i - 1]) } z <- rep(0, n) for(i in 1:m) { z <- z + W[, i] * (alpha[i] + exp(rnorm(n, mu[i], sigma[i]))) } z } dlambda <- function(z, mu = c(0.5, 1.1, 2.6), sigma = c(0.2, 0.3, 0.2), alpha = c(0.4, 1.2, 2.4), w = c( 0.33, 0.33, 0.34), eps = 0.0001) { #mixture of lognormals density function m <- length(w) f <- 0 * z for(i in 1:m) f <- f + (w[i] * dnorm(log(pmax(z - alpha[i], eps)), mu[i], sigma[ i]))/((z - alpha[i])) f } set.seed(17) x <- rlambda(500) hist(x, 200, freq = FALSE, border = "grey", main = "") z <- seq(0,max(x), length=2000) lines(z, dlambda(z), col = 2) f <- medde(x, lambda = 0.5, Dorder = 2, alpha = 0) lines(f, col = "blue") f <- medde(x, lambda = 0.05, Dorder = 2, alpha = 1) lines(f, col = "green") leg <- c("True", as.expression(lapply(c(0,1),function(x) bquote(alpha ==.(x))))) legend(15,0.4,leg, lty = 1, lwd = 1.5, col = c("red","blue","green")) @ \vspace{3mm} \begin{exmp}\label{ex.lnormmix} Figure \ref{fig:lnormmix} illustrates a histogram based on 500 observations from the mixture of lognormals density depicted in red. I have used this density for several years in class to illustrate how difficult it can be to choose an effective bandwidth for conventional kernel density estimation. A bandwidth sufficiently small to distinguish the two left-most modes is almost surely incapable of producing a smooth fit to the upper mode. Logspline methods as proposed by \cite{ks.91} perform much better in such cases, but they can be sensitive to knot selection strategies. The methods under consideration here are allied more closely to the smoothing spline literature, and thereby circumvent the knot selection task, but in so doing have introduced new knobs to turn and buttons to push. Not only do we need to choose the familiar $\lambda$, there is now a choice of the order of the derivative in the penalty, and the \Renyi exponent, $\alpha$, determining the transformation of the density. I would argue that these choices are more easily adapted to particular applications, but others may disagree. From a Bayesian perspective, however, it seems indisputable that more diversity in the class of tractable prior specifications is desirable. Examining Figure \ref{fig:lnormmix} we see that the $\alpha = 1$ maximum likelihood estimate is a bit too smooth, failing to find the second mode, whereas the $\alpha = 0$ solution is too enthusiastic about the fitting the first mode, but at least does distinguish the second mode. Both methods produce an excellent fit to the third mode, almost indistinguishable from the true density. \end{exmp} \section{Discrete Implementation Details} \label{sec.medde} The discrete formulation of the variational problems described above lead to extremely efficient algorithms that exploit modern interior point methods for convex optimization. All of our computational results were carried out with the function \code{medde} from the package \pkg{REBayes} for the \proglang{R} language and available from the CRAN website, \url{https://cran.r-project.org}. This package relies in turn on the \cite{Mosek} optimization system and its \cite{Rmosek} interface for the \proglang{R} language. The current implementation in \code{medde} is restricted to univariate densities as we will do here as well. \cite{KM.10} describes some extensions to bivariate settings. Most of the other functionality of the \pkg{REBayes} package is devoted to empirical Bayes methods and described in \cite{KG.16}. For univariate densities convexity of piecewise linear functions can be enforced by imposing linear inequality constraints on the set of function values $\gamma_i = g(\xi_i)$ at selected points $\xi_1, \xi_2, \dots , \xi_m$. We typically choose these $\xi_i$'s on an equally spaced grid of a few hundred points, and the convex cone constraint can then be expressed as $D \gamma \geq 0$ for a tridiagonal matrix $D$ when the penalization is imposed on second derivatives as in the case of our concavity constraints, and quindiagonal in the case of third derivative constraints. By default in \code{medde} we choose $m = 300$, with the $\xi_i$'s support extending slightly beyond the empirical support of the observations. As described in \cite{KM.10} the primal formulation of the shape-constrained problem takes the discrete form, \[ \{ w^\top L \gamma + s^\top \Psi(\gamma) | D \gamma \geq 0 \} = \min ! \tag{P} \] where $\Psi(\gamma)$ denotes an $m$-vector with typical element $\psi (g(\xi_i)) = \psi (\gamma_i)$, $L$ is an ``evaluation operator'' which either selects the elements from $\gamma$, or performs the appropriate linear interpolation from the neighboring ones, so that $L \gamma$ denotes the $n$-vector with typical element, $g(X_i)$, and $w$ is an $n$-vector of observation weights, typically $w_i \equiv 1/n$. The matrix $D$ is the discrete derivative operator that constrains the fitted function to lie in the convex cone $\KK (X)$. The vector $s$ denotes weights that impose the integrability constraint on the fitted density. As long as the grid is sufficiently fine in univariate settings elements of $s$ can be averages of the adjacent spacings between the $\xi_i$'s. Associated with the primal problem (P) is the dual problem, \[ \{ -s^\top \Psi^{\ast} (- \phi)\; | \; S \phi = -w^\top L + D^\top \eta, \phi \geq 0, D^\top \eta \geq 0 \} = \max ! \tag{D} \] Here, $\eta$ is an $m$-vector of dual variables and $\phi$ is an $m$-vector of function values representing the density evaluated at the $\xi_i$'s, and $S = \diag (s)$. The vector $\Psi^{\ast}$ is the convex conjugate of $\Psi$ defined coordinate-wise with typical element $\psi^{\ast} (y) = \sup_x \{ y x - \psi(x) \}$. Problems (P) and (D) are strongly dual in the sense of the following result, which may viewed as the discrete counterpart of Theorem 2 of \cite{KM.10}. For $\Psi (x)$ with typical element $\psi (x) = e^{-x}$ we have $\Psi^{\ast}$ with elements $\psi^{\ast} (y) = -y \log y + y$, so the dual problem corresponding to maximum likelihood can be interpreted as maximizing the Shannon entropy of the estimated density subject to the constraints appearing in (D). Since $g$ was interpreted in (P) as $\log f$ this result justifies our interpretation of solutions of (D) as densities provided that they satisfy our integrability condition. This is easily verified and thus justifies the implicit Lagrange multiplier of one on the integrability constraint in (P). Then solutions $\phi$ of (D) satisfy $s^\top \phi = 1$ and $\phi \geq 0$. as shown in Proposition 2 of \cite{KM.10}. The crucial element of the proof of this proposition is that the differencing operator D annihilates the constant vector and therefore the result extends immediately to other norm-type penalties as well as to the other entropy objectives that we have discussed. Indeed, since the second difference operator representing our convexity constraint annihilates any affine function it follows by the same argument that the mean of the estimated density also coincides with the sample mean of the observed $X_i$'s. For penalties with \Renyi exponents less than one, the dual formulation takes $\psi (y) = y^\alpha$ except of course for $\alpha = 0$ for which $\psi (y) = \log y$. To implement the total variation regularization rather than the concavity constraint, the $L_1$ constraint on $D \gamma$ in the primal becomes an $L_\infty$ constraint in the dual, so in the dual formulation we simply constrain $\| D^\top \eta \|_\infty \leq \lambda$, and similarly for total variation ($L_1$ norm) constraints on higher order derivatives. Code to reproduce each of the figures appearing above is available from \pkg{REBayes} package in the file \code{medde.Rnw} located in the \code{vignettes} directory. Readers are cautioned that although all of the computational problems described above are strictly convex and therefore possess unique solutions, extreme choices of the parameters can stress even the excellent optimization software provided by Mosek. In Example \ref{ex.velo} we have seen that attempts to push the \Renyi $\alpha$ parameter much below -1, cause difficulty. In Example \ref{ex.Silverman} a choice of $\lambda$ somewhat larger than those reported here also causes trouble. Fortunately, it is relatively easy to find values of these parameters that are within an empirically sensible range. \section{Conclusion} Density estimation by penalty methods is one of those [IJ] Good ideas of the 1970's that has matured rather slowly. Fortunately, recent developments in convex optimization have greatly expanded the menu of possible penalties, and there are promising opportunities for embedding these methods into more complex semi-parametric analyses. Many aspects remain to be explored. We have elementary Fisher consistency results from \cite{KM.10} and some rate and limiting distributional results from \cite{HW.16}, and others, but there are many interesting theoretical questions. It would be nice to know more about multivariate extensions. Little is known about choice of the \Renyi $\alpha$, can it be estimated in a reasonable way? If only we could divert some energy away from kernel methods, maybe some progress could be made in one or more of these directions. \bibliography{medde} \end{document}