%\VignetteIndexEntry{R packages: LaTeX vignettes} %\VignetteEngine{R.rsp::tex} %\VignetteKeyword{R} %\VignetteKeyword{package} %\VignetteKeyword{vignette} %\VignetteKeyword{ANOVA} \documentclass[12pt]{article} \parindent 0cm \usepackage{amsmath} \usepackage[UKenglish]{isodate} \usepackage[round]{natbib} \newcommand{\btheta}{\mbox{$\boldsymbol{\theta}$}} \newcommand{\bphi}{\mbox{$\boldsymbol{\phi}$}} \newcommand{\bpsi}{\mbox{$\boldsymbol{\psi}$}} \newcommand{\bsigma}{\mbox{$\boldsymbol{\sigma}$}} \newcommand{\bdelta}{\mbox{$\boldsymbol{\delta}$}} \newcommand{\bPhi}{\mbox{$\boldsymbol{\Phi}$}} \newcommand{\brho}{\mbox{$\boldsymbol{\rho}$}} \newcommand{\bx}{\mbox{$\boldsymbol{x}$}} \newcommand{\bzero}{\mbox{$\boldsymbol{0}$}} \newcommand{\ntop}{\mbox{$n_{\textup{\textrm{top}}}$}} \newcommand{\nbot}{\mbox{$n_{\textup{\textrm{bot}}}$}} \begin{document} \cleanlookdateon \title{\textbf{Extended Generalised Linear Hidden Markov Models}} \author{Rolf Turner} \date{\today} \maketitle \section{Introduction} \label{sec:intro} The \texttt{eglhmm} package provides means of fitting hidden Markov models \cite{Rabiner1989} in contexts in which the data conform to generalised linear models or slightly extended versions thereof. The package accomodates models in which the observations (``emissions'') are assumed to arise from a number of distributions: Gaussian, Poisson, Binomial, Db (discretised beta, \citealt{Turner2021}), and Multinom. In the Poisson and Binomial cases the models are generalised linear models. In the Gaussian and Db cases the models are ``something like, but not exactly'' generalised linear models. In the case of the Multinom (or ``discnp'' --- discrete non-parametric) distribution the model in question bears some relationship to a generalised linear model but is of a substantialy different form. We shall use the expression ``extended generalised hidden Markov models''. to describe they collection of all models under consideration, including those based on the Gaussian, Db and Multinom distributions. The package fits the models in question by several different methods`, namely the EM algorithm \cite{DempsterEtAl1977}, the Levenberg-Marquardt algorithm \cite{Turner2008}, and ``brute force'' which use either the \texttt{optim} or the \texttt{nlm} package to optimise the log likelihood. The Levenberg-Marquardt algorithm, and in certain circumstances the ``brute force'' procedure, require the analytic calculation of the gradient and Hessian of the log likelihood. The calculation is intricate in the hidden Markov model context. (In fact simply calculating the log likelihood is intricate.) Most of this vignette is devoted to the calculation of the first and second derivatives of the log likelihood. \section{Recursive calculations} \label{sec:recurse} The likelihood of a hidden Markov model may feasibly be calculated in terms of the ``forward'' probabilities developed by Baum et al. (see \citealt{BaumEtAl1970}). These probabilities are calculated by means of a recursive procedure which of course depends on the likelihoods of individual observations. These likelihoods, which may be expressed in the form $f(y,\btheta)$, may be either probability density functions or probability mass functions. The symbol $y$ represents an observation (emission) and $\btheta$ represents a vector of parameters upon which the distribution in question depends. These parameters depend in turn on the underlying state of the hidden Markov chain and in general upon other predictors (in addition to ``state''). The dependence of $\btheta$ upon the predictors will involve further parameters. The derivatives of the log likelihood of the model must therefore, in turn be calculated via recursive procedures. In order to effect these procedures, we need to calculate the first and second derivatives, with respect to all of the parameters that are involved, of the single observation likelihoods $f(y,\btheta)$. In the case of the Gaussian distribution $\btheta = (\mu,\sigma)^{\top}$ where $\mu$ is the mean and $\sigma$ is the standard deviation of the distribution. In the cases of the Poisson and Binomial distributions $\btheta$ is actually a scalar (which we consequently write simply at $\theta$). For the Poisson distribution $\theta$ is equal to $\lambda$, the Poisson mean, and for the Binomial distribution $\theta$ is equal to $p$, the binomial success probability. In the case of the Db distribution, $\btheta$ is equal to $(\alpha,\beta)^{\top}$ the vector of ``shape'' parameters of the distribution. In the case of the Multinom distribution, the model (as indicated above) has a rather different structure. Except in the Gaussian case we assume that $\btheta$ is completely determined by a vector $\bx$ of predictor variables and a vector $\bphi$ of predictor coefficients. We need to determine the first and second derivatives, of the likelihood of a single observation, with respect to the entries of $\bphi$. In the case of the Gaussian distribution $\btheta$ also includes the values of $\sigma$ corresponding the different states. In the current implementation of the package these $\sigma$ values are not obtained from the predictor coefficients $\bphi$. \section{Derivatives specific to each of the distributions} \label{sec:derivs} We now provide the details of the calculation of these derivatives for each of the five distributions in question. \subsection{The Gaussian distribution} \label{sec:gauss} We denote the vector of standard deviations by $\bsigma = (\sigma_1, \ldots, \sigma_K)^{\top}$ (where $K$ is the number of states). In the current development we assume that $\sigma_i$ depends only on the state $i$ of the underlying hidden Markov chain (and not on any other prectors included in $\bx$. It is thus convenient to make explicit the dependence of the probability density functions upon the underlying state. We write the probability density function corresponding to state $i$ as \[ f_i(y) = \frac{1}{\sqrt{2\pi} \sigma_i} \exp \left ( \frac{-(y-\mu)^2}{2\sigma_i^2} \right ) \; . \] We model $\mu$ as $\mu = \bx^{\top}\bphi$. Note that consequently $\mu$ depends, in general, upon the state $i$ although this dependence $\bx$ is not made explicit in the foregoing expression for $f_i(y)$. We need to differentiate $f_i(y)$ with respect to $\bphi$ and $\bsigma$. It is straightforward, using logarithmic differentiation, to determine that: \begin{equation} \begin{split} \frac{\partial f_i(y)}{\partial \mu} &= f_i(y)\left( \frac{y-\mu}{\sigma_i^2} \right ) \\ \frac{\partial f_i(y)}{\partial \sigma_j} &= \left \{ \begin{array}{ll} f_i(y)\left( \frac{(y-\mu)^2}{\sigma_i^2} - 1 \right)/\sigma_i & \mbox{~if~} j = i\\ 0 & \mbox{~if~} j \neq i \end{array} \right . \\ \frac{\partial^2 f_i(y)}{\partial \mu^2} &= f_i(y) \left( \frac{(y-\mu)^2}{\sigma_i^2} - 1 \right )/\sigma_i^2 \\ \frac{\partial^2 f_i(y)}{\partial \sigma_i \partial \sigma_j} &= \left \{ \begin{array}{ll} f_i(y) \left( \left ( \frac{(y-\mu)^2}{\sigma_i^2} - 1 \right )^2 + 1 - \frac{3(y-\mu)^2}{\sigma_i^2} \right )/\sigma_i^2 & \mbox{~if~} j = i\\ 0 & \mbox{~if~} j \neq i \end{array} \right . \\ \frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_j} &= \left \{ \begin{array}{ll} f_i(y) \left (\frac{(y-\mu)^2}{\sigma^3} - \frac{3}{\sigma} \right )(y-\mu)/\sigma^2 & \mbox{~if~} j = i\\ 0 & \mbox{~if~} j \neq i \end{array} \right . \; . \end{split} \label{eq:muSigPartials} \end{equation} Recalling that $\mu = \bx^{\top} \bphi$ we see that \[ \frac{\partial \mu}{\partial \bphi} = \bx \; , \] An application of the chain rule then gives: \[ \frac{\partial f_i(y)}{\partial \bphi} = \frac{\partial f_i(y)}{\partial \mu} \bx \] The second derivatives of $f_i(y)$ with respect to $\bphi$ are given by \begin{align*} \frac{\partial^2 f_i(y)}{\partial \bphi^{\top} \partial \bphi} &= \frac{\partial}{\partial \bphi^{\top}} \left ( \frac{\partial f_i(y)}{\partial \mu} \bx \right ) \\ &= \bx \left(\frac{\partial^2 f_i(y)}{\partial \mu^2} \frac{\partial \mu}{\partial \bphi^{\top}} + \frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_i} \frac{\partial \sigma_i}{\partial \bphi^{\top}} \right ) \\ &= \left(\frac{\partial^2 f_i(y)}{\partial \mu^2} \right) \bx \bx^{\top} \end{align*} since $\partial \sigma_i/\partial \bphi^{\top} = \bzero$. The second derivatives of $f_i(y)$ with respect to $\bphi$ and $\bsigma$ are given by \begin{align*} \frac{\partial^2 f_i(y)}{\partial \bphi^{\top} \partial \sigma_j} &= \left \{ \begin{array}{cl} \left(\frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_j} \right) \bx^{\top} & \mbox{~if~} j = i \\[0.25cm] \bzero^{\top} & \mbox{~if~} j \neq i \end{array} \right . \\ \frac{\partial^2 f_i(y)}{\partial \sigma_j \partial \bphi} &= \left \{ \begin{array}{cl} \left(\frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_j} \right) \bx & \mbox{~if~} j = i \\[0.25cm] \bzero & \mbox{~if~} j \neq i \end{array} \right . \; . \end{align*} Note that \[ \frac{\partial^2 f_i(y)}{\partial \sigma_i \partial \sigma_j} \] is provided in \eqref{eq:muSigPartials}. The structure of the first and second derivatives of $f_i(y)$ with respect to $\bphi$ and $\sigma$ can be expressed concisely by letting \[ \bpsi = \left [ \begin{array}{l} \bsigma\\ \bphi \end{array} \right ] \] and then writing \begin{align*} \frac{\partial f_i(y)}{\partial \bpsi} &= \left[ \begin{array}{l} \frac{\partial f_i(y)}{\partial \bsigma} \\[0.1cm] \frac{\partial f_i(y)}{\partial \bphi} \end{array} \right] \\[0.25cm] &=\left[ \begin{array}{l} \frac{\partial f_i(y)}{\partial \sigma_i} \bdelta_i\\[0.1cm] \frac{\partial f_i(y)}{\partial \mu} \bx \end{array} \right] \end{align*} where $\bdelta_i$ is a vector of dimension $K$ whose $i$th entry is 1 and whose other entries are all 0, and \begin{align*} \frac{\partial^2 f_i(y)}{\partial \bpsi^{\top} \partial \bpsi} &= \left[ \begin{array}{ll} \frac{\partial^2 f_i(y)}{\partial \bsigma^{\top} \partial \bsigma} & \frac{\partial^2 f_i(y)}{\partial \bsigma^{\top} \partial \bphi} \\ \frac{\partial^2 f_i(y)}{\partial \bphi^{\top} \partial \bsigma} & \frac{\partial^2 f_i(y)}{\partial \bphi^{\top} \partial \bphi} \end{array} \right] \\[0.25cm] & = \left[ \begin{array}{ll} \frac{\partial^2 f_i(y)}{\partial \sigma_i^2} \bdelta_i \bdelta_i^{\top} & \frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_i} \bdelta_i \bx^{\top} \\ \frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_i} \bx \bdelta_i^{\top} & \frac{\partial^2 f_i(y)}{\partial \mu^2} \bx \bx^{\top} \\ \end{array} \right] \; . \end{align*} Note that the first and second partial derivatives of $f_i(y)$ with respect to $\mu$ and $\sigma_i$ are provided in \eqref{eq:muSigPartials}. \subsection{The Poisson distribution} \label{sec:pois} The likelihood is the probability mass function \[ f(y) = e^{-\lambda} \frac{\lambda^y}{y!} \] $y = 0, 1, 2, \ldots$. Here $\btheta$ is a scalar, $\theta = \lambda$, and we model $\lambda$ via $\lambda = \exp(\bx^{\top} \bphi)$, where $\bx$ is a vector of predictors and $\bphi$ is a vector of predictor coefficients. The first and second derivatives of $f(y)$ with respect to $\lambda$ are \begin{align*} \frac{\partial f(y)}{\partial \lambda} &= f(y) \left (\frac{y}{\lambda} - 1 \right ) \\ \frac{\partial^2 f(y)}{\partial \lambda^2} &= f(y) \left( \left(\frac{y}{\lambda} - 1 \right)^2 - \frac{y}{\lambda^2} \right) \end{align*} Since $\lambda = \exp(\bx^{\top} \bphi)$ it follows readily that the first and second derivatives of $\lambda$ with respect to $\bphi$ are $lambda \bx$ and $\lambda \bx \bx^{\top}$, respectively. Applying the chain rule we get \begin{align*} \frac{\partial f(y)}{\partial \bphi} &= \frac{\partial f(y)}{\partial \lambda} \lambda \bx \\ \frac{\partial^2 f(y)}{\partial \bphi^{\top} \partial \bphi} &= \left(\frac{\partial f(y)}{\partial \lambda} \lambda + \frac{\partial^2 f(y)}{\partial \lambda^2} \lambda^2 \right) \bx \bx^{\top} \end{align*} \subsection{The Binomial distribution} \label{sec:bino} The likelihood is the probability mass function \[ f(y) = \binom{n}{y} p^y (1-p)^{n-y} \] $y = 0, 1, 2, \ldots, n$, where $n$ is the number of independent binomial trials on which the success count $y$ is based, and $p$ is the probability of success. Here $\btheta$ is a scalar, $\theta = p$, and we model $p$ via $p = h(u)$ where $u = \bx^{\top} \bphi$, where $\bx$ is a vector of predictors, $\bphi$ is a vector of predictor coefficients and $h(u)$ is the logit function $h(u) = (1 + e^{-u})^{-1}$. In what follows we will need the first and second derivatives of the logit function. These are given by \begin{equation} \begin{split} h'(u) &= \frac{e^{-u}}{(1+e^{-u})^2} \mbox{~and} \\ h''(u) &= \frac{e^{-u}(e^{-u} - 1)}{(1+e^{-u})^3} \; . \end{split} \label{eq:logitDerivs} \end{equation} The first and second derivatives of $f(y)$ with respect to $p$ are \begin{align*} \frac{\partial f(y)}{\partial p} &= f(y) \left ( \frac{y}{p} - \frac{n-y}{1-p} \right )\\ \frac{\partial^2 f(y)}{\partial p^2} &= f(y) \left ( \left (\frac{y}{p} - \frac{n-y}{1-p} \right)^2 - \frac{y}{p^2} - \frac{n-y}{(1-p)^2} \right ) \; . \end{align*} Since $p = h(\bx^{\top} \bphi)$ we see that \begin{align*} \frac{\partial p}{\partial \bphi} &= h'(\bx^{\top} \bphi) \bx \mbox{~ and}\\ \frac{\partial^2 p}{\partial \bphi^{\top} \partial \bphi} &= h''(\bx^{\top} \bphi) \bx \bx^{\top} \end{align*} Applying the chain rule we see that \begin{align*} \frac{\partial f(y)}{\partial \bphi} &= \frac{\partial f} {\partial p} h'(\bx^{\top} \bphi) \bx \mbox{~ and}\\ \frac{\partial^2 f(y)}{\partial \bphi^{\top} \partial \bphi} &= \left ( \frac{\partial f(y)}{\partial p} h''(\bx^{\top} \bphi) + \frac{\partial^2 f(y)}{\partial p^2} (h'(\bx^{\top} \bphi)^2 \right) \bx \bx^{\top} \end{align*} Recall that expressions for $h'(\cdot)$ and $h''(\cdot)$ are given by \eqref{eq:logitDerivs}. \subsection{The Db distribution} \label{sec:dbd`} The likelihood is the probability mass function which depends on a vector of parameters $\btheta = (\alpha,\beta)^{\top}$ and is somewhat complicated to write down. In order to obtain an expression for this probabilty mass function we need to define \begin{align*} h_0(y) &= (y(1-y))^{-1} \\ h(y) &= h_0((y - \nbot + 1)/(\ntop - \nbot + 2)) \\ T_1(y) &= \log((y - \nbot + 1)/(\ntop - \nbot + 2)) \\ T_2(y) &= \log((\ntop - y + 1)/(\ntop - \nbot + 2)) \\ A(\alpha,\beta) &= \log \left( \sum_{i=\nbot}^{\ntop} h(i) \exp\{\alpha T_1(i) + \beta T_2(i) \} \right) \; . \end{align*} Given these definitions, the probability mass function of the Db distribution can be written as \[ f(y,\alpha,\beta) = \Pr(X=y \mid \alpha, \beta) = h(y) \exp\{\alpha T_1(y) + \beta T_2(y) - A(\alpha,\beta)\} \; . \] We model $\alpha$ and $\beta$ via \begin{align*} \alpha &= \bx^{\top} \bphi_1 \\ \beta &= \bx^{\top} \bphi_2 \end{align*} where $\bx$ is a vector of predictors and $\bphi_1$ and $\bphi_2$ are vectors of predictor coefficients. The vector $\bphi$, with respect to which we seek to differentiate the likelihood, is the catenation of $\bphi_1$ and $\bphi_2$. The first derivative of the likelihood with respect to $\bphi$ is \begin{align*} \frac{\partial f}{\partial \bphi} &= \frac{\partial f}{\partial \alpha} \frac{\partial \alpha}{\partial \bphi} + \frac{\partial f}{\partial \beta} \frac{\partial \beta}{\partial \bphi} \\ &= \frac{\partial f}{\partial \alpha} \left [ \begin{array}{c} \frac{\partial \alpha}{\partial \bphi_1} \\ \bzero \end{array} \right ] + \frac{\partial f}{\partial \beta} \left [ \begin{array}{c} \bzero \\ \frac{\partial \beta}{\partial \bphi_2} \end{array} \right ] \\ &= \frac{\partial f}{\partial \alpha} \left [ \begin{array}{c} \bx \\ \bzero \end{array} \right ] + \frac{\partial f}{\partial \beta} \left [ \begin{array}{c} \bzero \\ \bx \end{array} \right ] \\ &= \left [ \begin{array}{c} \frac{\partial f}{\partial \alpha} \bx \\ \frac{\partial f}{\partial \beta} \bx \end{array} \right ] \end{align*} The second derivative is calculated as \[ \frac{\partial^2 f}{\partial \bphi^{\top} \partial \bphi} = \left [ \begin{array}{c} \frac{\partial}{\partial \bphi^{\top}} \left ( \frac{\partial f}{\partial \alpha} \bx \right) \\ \frac{\partial}{\partial \bphi^{\top}} \left ( \frac{\partial f}{\partial \beta} \bx \right) \end{array} \right ] \; . \] Taking this expression one row at a time we see that \begin{align*} \frac{\partial}{\partial \bphi^{\top}} \left ( \frac{\partial f}{\partial \alpha} \right) &= \left [ \begin{array}{lr} \frac{\partial}{\partial \bphi_1^{\top}} \left ( \frac{\partial f}{\partial \alpha} \right) & \frac{\partial}{\partial \bphi_2^{\top}} \left ( \frac{\partial f}{\partial \alpha} \right) \end{array} \right ] \\ &= \left [ \begin{array}{lr} \frac{\partial^2 f}{\partial \alpha^2} \frac{\partial \alpha}{\partial \bphi_1^{\top}} & \frac{\partial^2 f}{\partial \beta \partial \alpha} \frac{\partial \beta}{\partial \bphi_2^{\top}} \end{array} \right ] \\ &= \left [ \begin{array}{lr} \frac{\partial^2 f}{\partial \alpha^2} \bx^{\top} & \frac{\partial^2 f}{\partial \beta \partial \alpha} \bx^{\top} \end{array} \right ] \mbox{~and likewise}\\ \frac{\partial}{\partial \bphi^{\top}} \left ( \frac{\partial f}{\partial \beta} \right) & = \left [ \begin{array}{lr} \frac{\partial^2 f}{\partial \beta \partial \alpha} \bx^{\top} & \frac{\partial^2 f}{\partial \beta^2} \bx^{\top} \end{array} \right ] \; . \end{align*} Combining the foregoing we get \[ \frac{\partial^2 f}{\partial \bphi^{\top} \partial \bphi} = \left [ \begin{array}{lr} \frac{\partial^2 f}{\partial \alpha^2} \bx \bx^{\top} & \frac{\partial^2 f}{\partial \beta \partial \alpha} \bx \bx^{\top} \\[0.25cm] \frac{\partial^2 f}{\partial \beta \partial \alpha} \bx \bx^{\top} & \frac{\partial^2 f}{\partial \beta^2} \bx \bx^{\top} \end{array} \right ] \; . \] As was the case for the three distributions for which $\btheta$ is a scalar, it is expedient to express the partial derivatives of $f(y,\alpha,\beta)$, with respect to the parameters of the distribution, in terms of $f(y,\alpha,\beta)$ The required expressions are as follows: \begin{align*} \frac{\partial f}{\partial \alpha} &= f(y,\alpha,\beta) \left ( T_1(y) - \frac{\partial A}{\partial \alpha} \right )\\ \frac{\partial f}{\partial \beta} &= f(y,\alpha,\beta) \left ( T_2(y) - \frac{\partial A}{\partial \beta} \right )\\ \frac{\partial^2 f}{\partial \alpha^2} &= f(y,\alpha,\beta) \left [ \left ( T_1(y) - \frac{\partial A}{\partial \alpha} \right )^2 - \frac{\partial^2 A}{\partial \alpha^2} \right ]\\ \frac{\partial^2 f}{\partial \alpha \partial \beta} &= f(y,\alpha,\beta) \left [ \left ( T_1(y) - \frac{\partial A}{\partial \alpha} \right ) \left ( T_2(y) - \frac{\partial A}{\partial \beta} \right ) - \frac{\partial^2 A}{\partial \alpha \partial \beta} \right ] \\ \frac{\partial^2 f}{\partial \beta^2} &= f(y,\alpha,\beta) \left [ \left ( T_2(y) - \frac{\partial A}{\partial \beta} \right )^2 - \frac{\partial^2 A}{\partial \beta^2} \right ] \end{align*} \newpage It remains to provide expressions for the partial derivatives of $A$ with respect to $\alpha$ and $\beta$. Let \[ E = \exp(A) = \sum_{i=\nbot}^{\ntop} h(i) \exp\{\alpha T_1(i) + \beta T_2(i) \} \; . \] Clearly \begin{align*} \frac{\partial A}{\partial \alpha} &= \frac{1}{E} \frac{\partial E}{\partial \alpha} \\ \frac{\partial A}{\partial \beta} &= \frac{1}{E} \frac{\partial E}{\partial \beta} \\ \frac{\partial^2 A}{\partial \alpha^2} &= \frac{1}{E} \frac{\partial^2 E}{\partial \alpha^2} - \frac{1}{E^2} \left (\frac{\partial E}{\partial \alpha} \right )^2 \\ \frac{\partial^2 A}{\partial \alpha \partial \beta} &= \frac{1}{E} \frac{\partial^2 E}{\partial \alpha \partial \beta} - \frac{1}{E^2} \left (\frac{\partial E}{\partial \alpha} \frac{\partial E}{\partial \beta} \right ) \\ \frac{\partial^2 A}{\partial \beta^2} &= \frac{1}{E} \frac{\partial^2 E}{\partial \beta^2} - \frac{1}{E^2} \left (\frac{\partial E}{\partial \beta} \right )^2 \end{align*} \enlargethispage{1\baselineskip} Finally, the relevant partial derivatives of $E$ are: \begin{align*} \frac{\partial E}{\partial \alpha} &= \sum_{i=\nbot}^{\ntop} h(i) T_1(i) \exp(\alpha T_1(i) + \beta T_2(i)) \\ \frac{\partial E}{\partial \beta} &= \sum_{i=\nbot}^{\ntop} h(i) T_2(i) \exp(\alpha T_1(i) + \beta T_2(i))\\ %\end{align*} %\begin{align*} \frac{\partial^2 E}{\partial \alpha^2} &= \sum_{i=\nbot}^{\ntop} h(i) T_1(i)^2 \exp(\alpha T_1(i) + \beta T_2(i)) \\ \frac{\partial^2 E}{\partial \alpha \partial \beta} &= \sum_{i=\nbot}^{\ntop} h(i) T_1(i) T_2(i) \exp(\alpha T_1(i) + \beta T_2(i)) \\ \frac{\partial^2 E}{\partial \beta^2} &= \sum_{i=\nbot}^{\ntop} h(i) T_2(i)^2 \exp(\alpha T_1(i) + \beta T_2(i)) \; . \end{align*} \subsection{The Multinom distribution} \label{sec:multinom} This distribution is very different from those with which we have previously dealt. It is defined effectively in terms of \emph{tables}. In the hidden Markov model context, these tables take the form \[ \Pr(Y = y_i \mid S = k) = \rho_{ik} \] where $Y$ is the emissions variate, its possible values or ``levels'' are $y_1, y_2, \ldots, y_m$, and $S$ denotes ``state'' which (wlog) takes values $1, 2, \ldots, K$. Of course $\rho_{\cdot k} = 1$ for all $k$. We shall denote $\Pr(Y = y \mid S = k) = \rho_{ik}$ by $f_k(y)$. The maximisation of the likelihood with respect to the $\rho_{ik}$ is awkward, due to the ``sum-to-1'' constraints that they must satisfy, and it is better to impose this constraint ``smoothly'' via a logistic parameterisation. See \cite{Turner2008}. Such a parameterisation allows us to express the dependence of the emissions probabilities, upon ``state'', in terms of linear predictors. This in turn opens up the possibility of including other predictors, in addition to those determined by ``state'', in the model. To this end we define vectors of parameters $\bphi_i$, $i = 1, \ldots, m$, corresponding to each of the possible values of $Y$. For identifiability we take $\bphi_m$ to be identically 0. Each $\bphi_i$ is a vector of length $np$, say, where $np$ is the number of predictors. If, in a $K$ state model, there are no predictors other than those determined by state, then $np = K$. In this case there are $K \times (m-1)$ ``free'' parameters, just as there should be (and just at there are in the original parameterisation in terms of the $\rho_{ik}$). Let the $k$th entry of $\bphi_i$ be $\phi_{ik}$, $k = 1,\ldots,np$. Let $\bphi$ be the vector consisting of the catenation of all of the $\phi_{ij}$, excluding the entries of $\bphi_m$ which are all 0: \[ \bphi = (\phi_{11}, \phi_{12}, \ldots, \phi_{1,np}, \phi_{21}, \phi_{22}, \ldots, \phi_{2,np}, \ldots\ , \ldots\ ,\phi_{m-1,1}, \phi_{m-1,2}, \ldots, \phi_{m-1,np})^{\top} \; . \] Let $\bx$ be a vector of predictors. In terms of the foregoing notation, $f_k(y)$ can be written as \[ f_k(y) = \frac{e^{\bx^{\top} \bphi_y}}{Z} \] where in turn \[ Z = \sum_{\ell = 1}^k e^{\bx^{\top} \bphi_{\ell}} \; . \] The dependence of $f_k(y)$ upon the state $k$ is implicit in the predictor vector $\bx$ which includes predictors indicating state. We now calculate the partial derivatives of $f_k(y)$ with respect to $\bphi$. First note that $\frac{\partial f}{\partial \bphi}$ can be written as \[ \left [ \begin{array}{c} \frac{\partial f_k}{\partial \bphi_1} \\[0.25cm] \frac{\partial f_k}{\partial \bphi_2} \\ \vdots \\ \frac{\partial f_k}{\partial \bphi_{m-1}} \end{array} \right ] \;. \] Next we calculate \[ \frac{\partial f_k(y)}{\partial \bphi_i}, \mbox{~~} i = 1, \ldots, m-1 \; . \] Using logarithmic differentiation we see that \[ \frac{1}{f_k(y)} \frac{\partial f_k(y)}{\partial \bphi_i} = \delta_{yi} \bx - \frac{1}{Z} e^{\bx^{\top} \bphi_i} \bx \] so that \[ \frac{\partial f_k(y)}{\partial \bphi_i} = f_k(y) \left ( \delta_{yi} - \frac{e^{\bx^{\top} \bphi_i}}{Z} \right ) \] which can be written as $f_k(y)(\delta_{yi} - f_k(i)) \bx$. In summary we have \[ \frac{\partial f}{\partial \bphi} = f_k(y) \left [ \begin{array}{l} ( \delta_{y1} - f_k(1) ) \bx \\ ( \delta_{y2} - f_k(2) ) \bx \\ \multicolumn{1}{c}{\vdots} \\ ( \delta_{y,m-1} - f_k(m-1) ) \bx \end{array} \right ] \] The second derivatives of $f_k(y)$ with respect to $\bphi$ are given by \[ \frac{\partial^2 f}{\partial \bphi \partial \bphi^{\top}} = \left [ \begin{array}{llcl} \frac{\partial^2 f}{\partial \bphi_1 \partial \bphi_1^{\top}} & \frac{\partial^2 f}{\partial \bphi_1 \partial \bphi_2^{\top}} & \ldots & \frac{\partial^2 f}{\partial \bphi_1 \partial \bphi_{m-1}^{\top}} \\[0.5cm] \frac{\partial^2 f}{\partial \bphi_2 \partial \bphi_1^{\top}} & \frac{\partial^2 f}{\partial \bphi_2 \partial \bphi_2^{\top}} & \ldots & \frac{\partial^2 f}{\partial \bphi_2 \partial \bphi_{m-1}^{\top}} \\ \multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots} \\ \frac{\partial^2 f}{\partial \bphi_{m-1} \partial \bphi_1^{\top}} & \frac{\partial^2 f}{\partial \bphi_{m-1} \partial \bphi_2^{\top}} & \ldots & \frac{\partial^2 f}{\partial \bphi_{m-1} \partial \bphi_{m-1}^{\top}} \end{array} \right] \] The $(i,j)$th entry of $\frac{\partial^2 f}{\partial \bphi \partial \bphi^{\top}}$, i.e. $\frac{\partial^2 f}{\partial \bphi_i \partial \bphi_j^{\top}}$, is given by \begin{align*} \frac{\partial}{\partial \bphi_i} \left ( \frac{\partial y}{\partial \bphi_j^{\top}} \right ) &=\frac{\partial}{\partial \bphi_i} \left (f_k(y)(\delta_{yj} - f_k(j)\bx^{\top} \right) \\ &= f_k(y)(0 - f_k(j)(\delta_{ij} - f_k(i))\bx\bx^{\top}) + f_k(y)(\delta_{yi} - f_k(i))\bx (\delta_{yj}- f_k(j))\bx^{\top} \\ &= f_k(y)(-f_k(j)(\delta_{ij} - f_k(i)) + (\delta_{yj} - f_k(i))(\delta_{yj} - f_k(j))) \bx \bx^{\top} \\ &= f_k(y)(f_k(i)(f_k(j) - \delta_{ij}f_k(j) + (\delta_{yi} - f_k(i)) (\delta_{yj} - f_k(j))) \bx \bx^{\top} \end{align*} At first glance this expression seems to be anomalously asymmetric in $i$ and $j$, but the asymmetry is illusory. Note that when $i \neq j$, $\delta_{ij}f_k(j)$ is 0, and when $i=j$, $\delta_{ij}f_k(j) = f_k(j) = f_k(i)$. In summary we see that \[ \frac{\partial^2 f}{\partial \bphi \partial \bphi^{\top}} = \left [ \begin{array}{llcl} a_{11} \bx \bx^{\top} & a_{12} \bx \bx^{\top} & \ldots & a_{1,m-1} \bx \bx^{\top} \\ a_{21} \bx \bx^{\top} & a_{22} \bx \bx^{\top} & \ldots & a_{2,m-1} \bx \bx^{\top} \\ \multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots} & \multicolumn{1}{c}{\vdots} \\ a_{m-1,1} \bx \bx^{\top} & a_{m-1,2} \bx \bx^{\top} & \ldots & a_{m-1,,m-1} \bx \bx^{\top} \end{array} \right ] \] where $a_{ij} = f_k(y)(f_k(i)(f_k(j) - \delta_{ij}f_k(j) + (\delta_{yi} - f_k(i))(\delta_{yj} - f_k(j))$, $i, j = 1, \ldots, m - 1$. \newpage \bibliography{eglhmm} \bibliographystyle{plainnat} \end{document}