\documentclass[a4paper]{article} \usepackage{amsmath} \usepackage{bbm} \usepackage[inline]{enumitem} \usepackage[T1]{fontenc} \usepackage[bookmarks=TRUE, colorlinks, pdfpagemode=none, pdfstartview=FitH, citecolor=black, filecolor=black, linkcolor=blue, urlcolor=black, ]{hyperref} \usepackage{graphicx} \usepackage{icomma} \usepackage[utf8]{inputenc} \usepackage{mathtools} % for extended pderiv arguments \usepackage{natbib} \usepackage{xargs} % for extended pderiv arguments \usepackage{xspace} % \SweaveUTF8 \newcommand{\COii}{\ensuremath{\mathit{CO}_{2}}\xspace} \DeclareMathOperator*{\E}{\mathbbm{E}}% expectation \newcommand*{\mat}[1]{\mathsf{#1}} \newcommand{\likelihood}{\mathcal{L}}% likelihood \newcommand{\loglik}{\ell}% log likelihood \newcommand{\maxlik}{\texttt{maxLik}\xspace} \newcommand{\me}{\mathrm{e}} % Konstant e=2,71828 \newcommandx{\pderiv}[3][1={}, 2={}]{\frac{\partial^{#2}{#1}}{\mathmbox{\partial{#3}}^{#2}}} % #1: function to differentiate (optional, empty = write after the formula) % #2: the order of differentiation (optional, empty=1) % #3: the variable to differentiate wrt (mandatory) \newcommand{\R}{\texttt{R}\xspace} \newcommand*{\transpose}{^{\mkern-1.5mu\mathsf{T}}} \renewcommand*{\vec}[1]{\boldsymbol{#1}} % \VignetteIndexEntry{Maximum likelihood estimation with maxLik} \title{Maximum Likelihood Estimation with \emph{maxLik}} \author{Ott Toomet} \begin{document} \maketitle <>= library(maxLik) set.seed(6) @ \section{Introduction} \label{sec:introduction} This vignette is intended for users who are familiar with concepts of likelihood and with the related methods, such as information equality and BHHH approximation, and with \R language. The vignette focuses on \maxlik usage and does not explain the underlying mathematical concepts. Potential target group includes researchers, graduate students, and industry practitioners who want to apply their own custom maximum likelihood estimators. If you need a refresher, consult the accompanied vignette ``Getting started with maximum likelihood and \maxlik''. The next section introduces the basic usage, including the \maxlik function, the main entry point for the package; gradients; different optimizers; and how to control the optimization behavior. These are topics that are hard to avoid when working with applied ML estimation. Section~\ref{sec:advanced-usage} contains a selection of more niche topics, including arguments to the log-likelihood function, other types of optimization, testing condition numbers, and constrained optimization. \section{Basic usage} \label{sec:basic-usage} \subsection{The maxLik function} \label{sec:maxlik-function} The main entry point to \maxlik functionality is the function of the same name, \verb|maxLik|. It is a wrapper around the underlying optimization algorithms that ensures that the returned object is of the right class so one can use the convenience methods, such as \verb|summary| or \verb|logLik|. It is important to keep in mind that \maxlik \emph{maximizes}, not minimizes functions. The basic usage of the function is very simple: just pass the log-likelihood function (argument \verb|logLik|) and the start value (argument \verb|start|). Let us demonstrate the basic usage by estimating the normal distribution parameters. We create 100 standard normals, and estimate the best fit mean and standard deviation. Instead of explicitly coding the formula for log-likelihood, we rely on the \R function \verb|dnorm| instead (see Section~\ref{sec:different-optimizers} for a version that does not use \verb|dnorm|): <<>>= x <- rnorm(100) # data. true mu = 0, sigma = 1 loglik <- function(theta) { mu <- theta[1] sigma <- theta[2] sum(dnorm(x, mean=mu, sd=sigma, log=TRUE)) } m <- maxLik(loglik, start=c(mu=1, sigma=2)) # give start value somewhat off summary(m) @ The algorithm converged in 7 iterations and one can check that the results are equal to the sample mean and variance.\footnote{Note that \R function \texttt{var} returns the unbiased estimator by using denominator $n-1$, the ML estimator is biased with denominator $n$. } This example demonstrates a number of key features of \verb|maxLik|: \begin{itemize} \item The first argument of the likelihood must be the parameter vector. In this example we define it as $\vec{\theta} = (\mu, \sigma)$, and the first lines of \verb|loglik| are used to extract these values from the vector. \item The \verb|loglik| function returns a single number, sum of individual log-likelihood contributions of individual $x$ components. (It may also return the components individually, see BHHH method in Section~\ref{sec:different-optimizers} below.) \item Vector of start values must be of correct length. If its components are named, those names are also displayed in \verb|summary| (and for \verb|coef| and \verb|stdEr|, see below). \item \verb|summary| method displays a handy summary of the results, including the convergence message, the estimated values, and statistical significance. \item \verb|maxLik| (and other auxiliary optimizers in the package) is a \emph{maximizer}, not minimizer. \end{itemize} As we did not specify the optimizer, \verb|maxLik| picked Newton-Raphson by default, and computed the necessary gradient and Hessian matrix numerically. \bigskip Besides summary, \verb|maxLik| also contains a number of utility functions to simplify handling of estimated models: \begin{itemize} \item \verb|coef| extracts the model coefficients: <<>>= coef(m) @ \item \verb|stdEr| returns the standard errors (by inverting Hessian): <<>>= stdEr(m) @ \item Other functions include \verb|logLik| to return the log-likelihood value, \verb|returnCode| and \verb|returnMessage| to return the convergence code and message respectively, and \verb|AIC| to return Akaike's information criterion. See the respective documentation for more information. \item One can also query the number of observations with \verb|nObs|, but this requires likelihood values to be supplied by observation (see the BHHH method in Section~\ref{sec:different-optimizers} below). \end{itemize} \subsection{Supplying analytic gradient} \label{sec:supplying-gradients} The simple example above worked fast and well. In particular, the numeric gradient \verb|maxLik| computed internally did not pose any problems. But users are strongly advised to supply analytic gradient, or even better, both the gradient and the Hessian matrix. More complex problems may be intractably slow, converge to a sub-optimal solution, or not converge at all if numeric gradients are noisy. Needless to say, unreliable Hessian also leads to unreliable inference. Here we show how to supply gradient to the \verb|maxLik| function. We demonstrate this with a linear regression example. Non-linear optimizers perform best in regions where level sets (contours) are roughly circular. In the following example we use data in a very different scale and create the log-likelihood function with extremely elongated elliptical contours. Now Newton-Raphson algorithm fails to converge when relying on numeric derivatives, but works well with analytic gradient. % using matrix notation We combine three vectors, $\vec{x}_{1}$, $\vec{x}_{2}$ and $\vec{x}_{3}$, created at a very different scale, into the design matrix $\mat{X} = \begin{pmatrix} \vec{x}_{1} & \vec{x}_{2} & \vec{x}_{3} \end{pmatrix}$ and compute $\vec{y}$ as \begin{equation} \label{eq:linear-regression-matrix} \vec{y} = \mat{X} \begin{pmatrix} 1 \\ 1 \\ 1 \end{pmatrix} + \vec{\epsilon}. \end{equation} We create $\vec{x}_{1}$, $\vec{x}_{2}$ and $\vec{x}_{3}$ as random normals with standard deviation of 1, 1000 and $10^{7}$ respectively, and let $\vec{\epsilon}$ be standard normal disturbance term: <<>>= ## create 3 variables with very different scale X <- cbind(rnorm(100), rnorm(100, sd=1e3), rnorm(100, sd=1e7)) ## note: correct coefficients are 1, 1, 1 y <- X %*% c(1,1,1) + rnorm(100) @ Next, we maximize negative of sum of squared errors \emph{SSE} (remember, \verb|maxLik| is a maximizer not minimizer) \begin{equation} \label{eq:ols-sse-matrix} \mathit{SSE}(\vec{\beta}) = (\vec{y} - \mat{X} \cdot \vec{\beta})^{\transpose} (\vec{y} - \mat{X} \cdot \vec{\beta}) \end{equation} as this is equivalent to likelihood maximization: <<>>= negSSE <- function(beta) { e <- y - X %*% beta -crossprod(e) # note '-': we are maximizing } m <- maxLik(negSSE, start=c(0,0,0)) # give start values a bit off summary(m, eigentol=1e-15) @ As one can see, the algorithm gets stuck and fails to converge, the last parameter value is also way off from the correct value $(1, 1, 1)$. We have amended summary with an extra argument, \verb|eigentol=1e-15|. Otherwise \maxlik refuses to compute standard errors for near-singular Hessian, see the documentation of \verb|summary.maxLik|. It makes no difference right here but we want to keep it consistent with the two following examples. Now let's improve the model performance with analytic gradient. The gradient of \emph{SSE} can be written as \begin{equation} \label{eq:ols-sse-gradient-matrix} \pderiv{\vec{\beta}}\mathit{SSE}(\vec{\beta}) = -2(\vec{y} - \mat{X}\vec{\beta})^{\transpose} \mat{X}. \end{equation} \maxlik uses numerator layout, i.e. the derivative of the scalar log-likelihood with respect to the column vector of parameters is a row vector. We can code the negative of it as <<>>= grad <- function(beta) { 2*t(y - X %*% beta) %*% X } @ We can add gradient to \verb|maxLik| as an additional argument \verb|grad|: <<>>= m <- maxLik(negSSE, grad=grad, start=c(0,0,0)) summary(m, eigentol=1e-15) @ Now the algorithm converges rapidly, and the estimate is close to the true value. Let us also add analytic Hessian, in this case it is \begin{equation} \label{eq:ols-sse-hessian-matrix} \frac{\partial^{2}}{\partial\vec{\beta}\,\partial\vec{\beta}^{\transpose}} \mathit{SSE}(\vec{\beta}) = 2\mat{X}^{\transpose}\mat{X} \end{equation} and we implement the negative of it as <<>>= hess <- function(beta) { -2*crossprod(X) } @ Analytic Hessian matrix can be included with the argument \verb|hess|, and now the results are <>= m <- maxLik(negSSE, grad=grad, hess=hess, start=c(0,0,0)) summary(m, eigentol=1e-15) @ Analytic Hessian did not change the convergence behavior here. Note that as the loss function is quadratic, Newton-Raphson should provide the correct solution in a single iteration only. However, this example has numerical issues when inverting near-singular Hessian. One can easily check that when creating covariates in a less extreme scale, then the convergence is indeed immediate. While using separate arguments \texttt{grad} and \texttt{hess} is perhaps the most straightforward way to supply gradients, \maxlik also supports gradient and Hessian supplied as log-likelihood attributes. This is motivated by the fact that computing gradient often involves a number of similar computations as computing log-likelihood, and one may want to re-use some of the results. We demonstrate this on the same example, by writing a version of log-likelihood function that also computes the gradient and Hessian: <>= negSSEA <- function(beta) { ## negative SSE with attributes e <- y - X %*% beta # we will re-use 'e' sse <- -crossprod(e) # note '-': we are maximizing attr(sse, "gradient") <- 2*t(e) %*% X attr(sse, "Hessian") <- -2*crossprod(X) sse } m <- maxLik(negSSEA, start=c(0,0,0)) summary(m, eigentol=1e-15) @ The log-likelihood with ``gradient'' and ``Hessian'' attributes, \verb|negSSEA|, computes log-likelihood as above, but also computes its gradient, and adds it as attribute ``gradient'' to the log-likelihood. This gives a potential efficiency gain as the residuals $\vec{e}$ are re-used. \maxlik checks the presence of the attribute, and if it is there, it uses the provided gradient. In real applications the efficiency gain will depend on the amount of computations re-used, and the number of likelihood calls versus gradient calls. While analytic gradients are always helpful and often necessary, they may be hard to derive and code. In order to help to derive and debug the analytic gradient, another provided function, \verb|compareDerivatives|, takes the log-likelihood function, analytic gradent, and compares the numeric and analytic gradient. As an example, we compare the log-likelihood and gradient functions we just coded: <<>>= compareDerivatives(negSSE, grad, t0=c(0,0,0)) # 't0' is the parameter value @ The function prints the analytic gradient, numeric gradient, their relative difference, and the largest relative difference value (in absolute value). The latter is handy in case of large gradient vectors where it may be hard to spot a lonely component that is off. In case of reasonably smooth functions, expect the relative difference to be smaller than $10^{-7}$. But in this example the numerical gradients are clearly problematic. \verb|compareDerivatives| supports vector functions, so one can test analytic Hessian in the same way by calling \verb|compareDerivatives| with \verb|gradlik| as the first argument and the analytic hessian as the second argument. \subsection{Different optimizers} \label{sec:different-optimizers} By default, \maxlik uses Newton-Raphson optimizer but one can easily swap the optimizer by \verb|method| argument. The supported optimizers include ``NR'' for the default Newton-Raphson, ``BFGS'' for gradient-only Broyden-Fletcher-Goldfarb-Shannon, ``BHHH'' for the information-equality based Berndt-Hall-Hall-Hausman, and ``NM'' for gradient-less Nelder-Mead. Different optimizers may be based on a very different approach, and certain concepts, such as \emph{iteration}, may mean quite different things. For instance, although Newton-Raphson is a simple, fast and intuitive method that approximates the function with a parabola, it needs to know the Hessian matrix (the second derivatives). This is usually even harder to program than gradient, and even slower and more error-prone when computed numerically. Let us replace NR with gradient-only BFGS method. It is a quasi-Newton method that computes its own internal approximation of the Hessian while relying only on gradients. We re-use the data and log-likelihood function from the first example where we estimated normal distribution parameters: <>= m <- maxLik(loglik, start=c(mu=1, sigma=2), method="BFGS") summary(m) @ One can see that the results were identical, but while NR converged in 7 iterations, it took 20 iterations for BFGS. In this example the BFGS approximation errors were larger than numeric errors when computing Hessian, but this may not be true for more complex objective functions. In a similar fashion, one can simply drop in most other provided optimizers. One method that is very popular for ML estimation is BHHH. We discuss it here at length because that method requires both log-likelihood and gradient function to return a somewhat different value. The essence of BHHH is information equality, the fact that in case of log-likelihood function $\loglik(\theta)$, the expected value of Hessian at the true parameter value $\vec{\theta}_{0}$ can be expressed through the expected value of the outer product of the gradient: \begin{equation} \label{eq:information-equality} \E \left[ \frac{\partial^2 l(\vec{\theta})} {\partial\vec{\theta}\, \partial\vec{\theta}^{\transpose}} \right]_{\vec{\theta} = \vec{\theta}_0} = - \E \left[ \left. \frac{\partial l(\vec{\theta})} {\partial\vec{\theta}^{\transpose}} \right|_{\vec{\theta} = \vec{\theta}_0} \cdot \left. \frac{\partial l(\vec{\theta})} {\partial\vec{\theta}} \right|_{\vec{\theta} = \vec{\theta}_0} \right]. \end{equation} Hence we can approximate Hessian by the average outer product of the gradient. Obviously, this is only an approximation, and it is less correct when we are far from the true value $\vec{\theta}_{0}$. Note also that when approximating expected value with average we rely on the assumption that the observations are independent. This may not be true for certain type of data, such as time series. However, in order to compute the average outer product, we need to compute gradient \emph{by observation}. Hence it is not enough to just return a single gradient vector, we have to compute a matrix where rows correspond to individual data points and columns to the gradient components. We demonstrate BHHH method by replicating the normal distribution example from above. Remember, the normal probability density is \begin{equation} \label{eq:normal-pdf} f(x; \mu, \sigma) = \frac{1}{\sqrt{2\pi}} \frac{1}{\sigma} \, \me^{ -\displaystyle\frac{1}{2} \frac{(x - \mu)^{2}}{\sigma^{2}} }. \end{equation} and hence the log-likelihood contribution of $x$ is \begin{equation} \label{eq:normal-loglik} \loglik(\mu, \sigma; x) = - \log{\sqrt{2\pi}} - \log \sigma - \frac{1}{2} \frac{(x - \mu)^{2}}{\sigma^{2}} \end{equation} and its gradient \begin{equation} \label{eq:normal-loglik-gradient} \begin{split} \pderiv{\mu} \loglik(\mu, \sigma; x) &= \frac{1}{\sigma^{2}}(x - \mu) \\ \pderiv{\sigma} \loglik(\mu, \sigma; x) &= -\frac{1}{\sigma} + \frac{1}{\sigma^{2}}(x - \mu)^{2}. \end{split} \end{equation} We can code these two functions as <<>>= loglik <- function(theta) { mu <- theta[1] sigma <- theta[2] N <- length(x) -N*log(sqrt(2*pi)) - N*log(sigma) - sum(0.5*(x - mu)^2/sigma^2) # sum over observations } gradlikB <- function(theta) { ## BHHH-compatible gradient mu <- theta[1] sigma <- theta[2] N <- length(x) # number of observations gradient <- matrix(0, N, 2) # gradient is matrix: # N datapoints (rows), 2 components gradient[, 1] <- (x - mu)/sigma^2 # first column: derivative wrt mu gradient[, 2] <- -1/sigma + (x - mu)^2/sigma^3 # second column: derivative wrt sigma gradient } @ Note that in this case we do not sum over the individual values in the gradient function (but we still do in log-likelihood). Instead, we fill the rows of the $N\times2$ gradient matrix with the values observation-wise. The results are similar to what we got above and the convergence speed is in-between that of Newton-Raphson and BFGS: \label{code:bhhh-example} <<>>= m <- maxLik(loglik, gradlikB, start=c(mu=1, sigma=2), method="BHHH") summary(m) @ In case we do not have time and energy to code the analytic gradient, we can let \maxlik compute the numeric one for BHHH too. In this case we have to supply the log-likelihood by observation. This essentially means we remove summing from the original likelihood function: <<>>= loglikB <- function(theta) { mu <- theta[1] sigma <- theta[2] -log(sqrt(2*pi)) - log(sigma) - 0.5*(x - mu)^2/sigma^2 # no summing here # also no 'N*' terms as we work by # individual observations } m <- maxLik(loglikB, start=c(mu=1, sigma=2), method="BHHH") summary(m) @ Besides of relying on information equality, BHHH is essentially the same algorithm as NR. As the Hessian is just approximated, its is converging at a slower pace than NR with analytic Hessian. But when relying on numeric derivatives only, BHHH may be more reliable. For convenience, the other methods also support observation-wise gradients and log-likelihood values, those numbers are just summed internally. So one can just code the problem in an BHHH-compatible manner and use it for all supported optimizers. \maxlik package also includes stochastic gradient ascent optimizer. As that method is rarely used for ML estimation, it cannot be supplied through the ``method'' argument. Consult the separate vignette ``Stochastic gradient ascent in \maxlik''. \subsection{Control options} \label{sec:control-options} \maxlik supports a number of control options, most of which can be supplied through \verb|control=list(...)| method. Some of the most important options include \verb|printLevel| to control debugging information, \verb|iterLim| to control the maximum number of iterations, and various \verb|tol|-parameters to control the convergence tolerances. For instance, we can limit the iterations to two, while also printing out the parameter estimates at each step. We use the previous example with BHHH optimizer: <<>>= m <- maxLik(loglikB, start=c(mu=1, sigma=2), method="BHHH", control=list(printLevel=3, iterlim=2)) summary(m) @ The first option, \verb|printLevel=3|, make \verb|maxLik| to print out parameters, gradient a few other bits of information at every step. Larger levels output more information, printlevel 1 only prints the first and last parameter values. The output from \maxlik-implemented optimizers is fairly consistent, but methods that call optimizers in other packages, such as BFGS, may output debugging information in a quite different way. The second option, \verb|iterLim=2| stops the algorithm after two iterations. It returns with code 4: iteration limit exceeded. Other sets of handy options are the convergence tolerances. There are three convergence tolerances: \begin{description} \item[tol] This measures the absolute convergence tolerance. Stop if successive function evaluations differ by less than \emph{tol} (default $10^{-8}$). \item[reltol] This is somewhat similar to \emph{tol}, but relative to the function value. Stop if successive function evaluations differ by less than $\mathit{reltol}\cdot (\loglik(\vec{\theta}) + \mathit{reltol})$ (default \verb|sqrt(.Machine[["double.eps"]])|, may be approximately \Sexpr{formatC(sqrt(.Machine[["double.eps"]]), digits=1)} on a modern computer). \item[gradtol] stop if the (Euclidean) norm of the gradient is smaller than this value (default $10^{-6}$). \end{description} Default tolerance values are typically good enough, but in certain cases one may want to adjust these. For instance, in case of function values are very large, one may rely only on tolerance, and ignore relative tolerance and gradient tolerance criteria. A simple way to achieve this is to set both \emph{reltol} and \emph{gradtol} to zero. In that case these two conditions are never satisfied and the algorithm stops only when the absolute convergence criterion is fulfilled. For instance, in the previous case we get: <<>>= m <- maxLik(loglikB, start=c(mu=1, sigma=2), method="BHHH", control=list(reltol=0, gradtol=0)) summary(m) @ When comparing the result with that on Page~\pageref{code:bhhh-example} we can see that the optimizer now needs more iterations and it stops with a return code that is related to tolerance, not relative tolerance. Note that BFGS and other optimizers that are based on the \verb|stats::optim| does not report the convergence results in a similar way as BHHH and NR, the algorithms provided by the \maxlik package. Instead of tolerance limits or gradient close to zero message, we hear about ``successful convergence''. Stochastic gradient ascent relies on completely different convergence criteria. See the dedicated vignette ``Stochastic Gradient Ascent in \maxlik''. \section{Advanced usage} \label{sec:advanced-usage} This section describes more advanced and less frequently used aspects of \maxlik. \subsection{Additional arguments to the log-likelihood function} \label{sec:additional-arguments-loglik} \maxlik expects the first argument of log-likelihood function to be the parameter vector. But the function may have more arguments. Those can be passed as additional named arguments to \verb|maxLik| function. For instance, let's change the log-likelihood function in a way that it expects data $\vec{x}$ to be passed as an argument \verb|x|. Now we have to call \maxlik with an additional argument \verb|x=...|: <<>>= loglik <- function(theta, x) { mu <- theta[1] sigma <- theta[2] sum(dnorm(x, mean=mu, sd=sigma, log=TRUE)) } m <- maxLik(loglik, start=c(mu=1, sigma=2), x=x) # named argument 'x' will be passed # to loglik summary(m) @ This approach only works if the argument names do not overlap with \verb|maxLik|'s arguments' names. If that happens, it prints an informative error message. \subsection{Maximizing other functions} \label{sec:maximizing-other-functions} \verb|maxLik| function is basically a wrapper around a number of maximization algorithms, and a set of likelihood-related methods, such as standard errors. However, from time-to-time we need to optimize other functions where inverting the Hessian to compute standard errors is not applicable. In such cases one can call the included optimizers directly, using the form \verb|maxXXX| where \verb|XXX| stands for the name of the method, e.g. \verb|maxNR| for Newton-Rapshon (\verb|method="NR"|) and \verb|maxBFGS| for BFGS. There is also \verb|maxBHHH| although the information equality--based BHHH is not correct if we do not work with log-likelihood functions. The arguments for \verb|maxXXX|-functions are largely similar to those for \maxlik, the first argument is the function, and one also has to supply start values. Let us demonstrate this functionality by optimizing 2-dimensional bell curve, \begin{equation} \label{eq:2d-bell-curve} f(x, y) = \me^{-x^{2} - y^{2}}. \end{equation} We code this function and just call \verb|maxBFGS| on it: <<>>= f <- function(theta) { x <- theta[1] y <- theta[2] exp(-x^2 - y^2) # optimum at (0, 0) } m <- maxBFGS(f, start=c(1,1)) # give start value a bit off summary(m) @ Note that the summary output is slightly different: it reports the parameter and gradient value, appropriate for a task that is not likelihood optimization. Behind the scenes, this is because the \verb|maxXXX|-functions return an object of \emph{maxim}-class, not \emph{maxLik}-class. \subsection{Testing condition numbers} \label{sec:testing-condition-numbers} Analytic gradient we demonstrated in Section~\ref{sec:supplying-gradients} helps to avoid numerical problems. But not all problems can or should be solved by analytic gradients. For instance, multicollinearity should be addressed on data or model level. \maxlik provides a helper function, \verb|condiNumbers|, to detect such problems. We demonstrate this by creating a highly multicollinear dataset and estimating a linear regression model. We re-use the regression code from Section~\ref{sec:supplying-gradients} but this time we create multicollinear data in similar scale. <<>>= ## create 3 variables, two independent, third collinear x1 <- rnorm(100) x2 <- rnorm(100) x3 <- x1 + x2 + rnorm(100, sd=1e-6) # highly correlated w/x1, x2 X <- cbind(x1, x2, x3) y <- X %*% c(1, 1, 1) + rnorm(100) m <- maxLik(negSSEA, start=c(x1=0, x2=0, x3=0)) # negSSEA: negative sum of squared errors # with gradient, hessian attribute summary(m) @ As one can see, the model converges but the standard errors are missing (because Hessian is not negative definite). In such case we may learn more about the problem by testing the condition numbers $\kappa$ of either the design matrix $\mat{X}$ or of the Hessian matrix. It is instructive to test not just the whole matrix, but to do it column-by-column, and see where the number suddenly jumps. This hints which variable does not play nicely with the rest of data. \verb|condiNumber| provides such functionality. First, we test the condition number of the design matrix: <<>>= condiNumber(X) @ We can see that when only including $\vec{x}_{1}$ and $\vec{x}_{2}$ into the design, the condition number is 1.35, far from any singularity-related problems. However, adding $\vec{x}_{3}$ to the matrix causes $\kappa$ to jump to over 5 millions. This suggests that $\vec{x}_{3}$ is highly collinear with $\vec{x}_{1}$ and $\vec{x}_{2}$. In this example the problem is obvious as this is how we created $\vec{x}_{3}$, in real applications one often needs further analysis. For instance, the problem may be in categorical values that contain too few observations or complex fixed effects that turn out to be perfectly multicollinear. A good suggestion is to estimate a linear regression model where one explains the offending variable using all the previous variables. In this example we might estimate \verb|lm(x3 ~ x1 + x2)| and see which variables help to explain $\vec{x}_{3}$ perfectly. Sometimes the design matrix is fine but the problem arises because data and model do not match. In that case it may be more informative to test condition number of Hessian matrix instead. The example below creates a linearly separated set of observations and estimates this with logistic regression. As a refresher, the log-likelihood of logistic regression is \begin{equation} \label{eq:logistic-loglik} \loglik(\beta) = \sum_{i: y_{i} = 1} \log\Lambda(\vec{x}_{i}^{\transpose} \vec{\beta}) + \sum_{i: y_{i} = 0} \log\Lambda(-\vec{x}_{i}^{\transpose} \vec{\beta}) \end{equation} where $\Lambda(x) = 1/(1 + \exp(-x))$ is the logistic cumulative distribution function. We implement it using \R function \verb|plogis| <<>>= x1 <- rnorm(100) x2 <- rnorm(100) x3 <- rnorm(100) X <- cbind(x1, x2, x3) y <- X %*% c(1, 1, 1) > 0 # y values 1/0 linearly separated loglik <- function(beta) { link <- X %*% beta sum(ifelse(y > 0, plogis(link, log=TRUE), plogis(-link, log=TRUE))) } m <- maxLik(loglik, start=c(x1=0, x2=0, x3=0)) summary(m) @ Not surprisingly, all coefficients tend to infinity and inference is problematic. In this case the design matrix does not show any issues: <<>>= condiNumber(X) @ But the Hessian reveals that including $\vec{x}_{3}$ in the model is still problematic: <<>>= condiNumber(hessian(m)) @ Now the problem is not multicollinearity but the fact that $\vec{x}_{3}$ makes the data linearly separable. In such cases we may want to adjust our model or estimation strategy. \subsection{Fixed parameters and constrained optimization} \label{sec:fixed-parameters} \maxlik supports three types of constrains. The simplest case just keeps certain parameters' values fixed. The other two, general linear equality and inequality constraints are somewhat more complex. Occasionally we want to treat one of the model parameters as constant. This can be achieved in a very simple manner, just through the argument \verb|fixed|. It must be an index vector, either numeric, such as \verb|c(2,4)|, logical as \verb|c(FALSE, TRUE, FALSE, TRUE)|, or character as \verb|c("beta2", "beta4")| given \verb|start| is a named vector. We revisit the first example of this vignette and estimate the normal distribution parameters again. However, this time we fix $\sigma = 1$: <<>>= x <- rnorm(100) loglik <- function(theta) { mu <- theta[1] sigma <- theta[2] sum(dnorm(x, mean=mu, sd=sigma, log=TRUE)) } m <- maxLik(loglik, start=c(mu=1, sigma=1), fixed="sigma") # fix the component named 'sigma' summary(m) @ The result has $\sigma$ exactly equal to $1$, it's standard error $0$, and $t$ value undefined. The fixed components are ignored when computing gradients and Hessian in the optimizer, essentially reducing the problem from 2-dimensional to 1-dimensional. Hence the inference for $\mu$ is still correct. Next, we demonstrate equality constraints. We take the two-dimensional function we used in Section~\ref{sec:maximizing-other-functions} and add constraints $x + y = 1$. The constraint must be described in matrix form $\mat{A}\,\vec{\theta} + \vec{B} = 0$ where $\vec{\theta}$ is the parameter vector and matrix $\mat{A}$ and vector $\vec{B}$ describe the constraints. In this case we can write \begin{equation} \label{eq:equality-constraints} \begin{pmatrix} 1 & 1 \end{pmatrix} \cdot \begin{pmatrix} x \\ y \end{pmatrix} + \begin{pmatrix} -1 \end{pmatrix} = 0, \end{equation} i.e. $\mat{A} = (1 \; 1)$ and $\vec{B} = -1$. These values must be supplied to the optimizer argument \verb|constraints|. This is a list with components names \verb|eqA| and \verb|eqB| for $\mat{A}$ and $\vec{B}$ accordingly. We do not demonstrate this with a likelihood example as no corrections to the Hessian matrix is done and hence the standard errors are incorrect. But if you are not interested in likelihood-based inference, it works well: <<>>= f <- function(theta) { x <- theta[1] y <- theta[2] exp(-x^2 - y^2) # optimum at (0, 0) } A <- matrix(c(1, 1), ncol=2) B <- -1 m <- maxNR(f, start=c(1,1), constraints=list(eqA=A, eqB=B)) summary(m) @ The problem is solved using sequential unconstrained maximization technique (SUMT). The idea is to add a small penalty for the constraint violation, and to slowly increase the penalty until violations are prohibitively expensive. As the example indicates, the solution is extremely close to the constraint line. The usage of inequality constraints is fairly similar. We have to code the inequalities as $\mat{A}\,\vec{\theta} + \vec{B} > 0$ where the matrices $\mat{A}$ and $\vec{B}$ are defined as above. Let us optimize the function over the region $x + y > 1$. In matrix form this will be \begin{equation} \label{eq:inequality-constraints-1} \begin{pmatrix} 1 & 1 \end{pmatrix} \cdot \begin{pmatrix} x \\ y \end{pmatrix} + \begin{pmatrix} -1 \end{pmatrix} > 0. \end{equation} Supplying the constraints is otherwise similar to the equality constraints, just the constraints-list components must be called \verb|ineqA| and \verb|ineqB|. As \verb|maxNR| does not support inequality constraints, we use \verb|maxBFGS| instead. The corresponding code is <<>>= A <- matrix(c(1, 1), ncol=2) B <- -1 m <- maxBFGS(f, start=c(1,1), constraints=list(ineqA=A, ineqB=B)) summary(m) @ Not surprisingly, the result is exactly the same as in case of equality constraints, in this case the optimum is found at the boundary line, the same line what we specified when demonstrating the equality constraints. One can supply more than one set of constraints, in that case these all must be satisfied at the same time. For instance, let's add another condition, $x - y > 1$. This should be coded as another line of $\mat{A}$ and another component of $\vec{B}$, in matrix form the constraint is now \begin{equation} \label{eq:inequality-constraints-2} \begin{pmatrix} 1 & 1\\ 1 & -1 \end{pmatrix} \cdot \begin{pmatrix} x \\ y \end{pmatrix} + \begin{pmatrix} -1 \\ -1 \end{pmatrix} > \begin{pmatrix} 0 \\ 0 \end{pmatrix} \end{equation} where ``>'' must be understood as element-wise operation. We also have to ensure the initial value satisfies the constraint, so we choose $\vec{\theta}_{0} = (2, 0)$. The code will be accordingly: <<>>= A <- matrix(c(1, 1, 1, -1), ncol=2) B <- c(-1, -1) m <- maxBFGS(f, start=c(2, 0), constraints=list(ineqA=A, ineqB=B)) summary(m) @ The solution is $(1, 0)$ the closest point to the origin where both constraints are satisfied. \bigskip This example concludes the \maxlik usage introduction. For more information, consult the fairly extensive documentation, and the other vignettes. % \bibliographystyle{apecon} % \bibliography{maxlik} \end{document}