%\VignetteIndexEntry{R packages: LaTeX vignettes}
%\VignetteEngine{R.rsp::tex}
%\VignetteKeyword{R}
%\VignetteKeyword{package}
%\VignetteKeyword{vignette}
%\VignetteKeyword{ANOVA}
\documentclass[12pt]{article}
\parindent 0cm
\usepackage{amsmath}
\usepackage[UKenglish]{isodate}
\usepackage[round]{natbib}
\newcommand{\btheta}{\mbox{$\boldsymbol{\theta}$}}
\newcommand{\bphi}{\mbox{$\boldsymbol{\phi}$}}
\newcommand{\bpsi}{\mbox{$\boldsymbol{\psi}$}}
\newcommand{\bsigma}{\mbox{$\boldsymbol{\sigma}$}}
\newcommand{\bdelta}{\mbox{$\boldsymbol{\delta}$}}
\newcommand{\bPhi}{\mbox{$\boldsymbol{\Phi}$}}
\newcommand{\brho}{\mbox{$\boldsymbol{\rho}$}}
\newcommand{\bx}{\mbox{$\boldsymbol{x}$}}
\newcommand{\bzero}{\mbox{$\boldsymbol{0}$}}
\newcommand{\ntop}{\mbox{$n_{\textup{\textrm{top}}}$}}
\newcommand{\nbot}{\mbox{$n_{\textup{\textrm{bot}}}$}}
\begin{document}
\cleanlookdateon
\title{\textbf{Extended Generalised Linear Hidden Markov Models}}
\author{Rolf Turner}
\date{\today}
\maketitle
\section{Introduction}
\label{sec:intro}

The \texttt{eglhmm} package provides means of fitting hidden
Markov models \cite{Rabiner1989} in contexts in which the
data conform to generalised linear models or slightly extended
versions thereof.  The package accomodates models in which the
observations (``emissions'') are assumed to arise from a number of
distributions: Gaussian, Poisson, Binomial, Db (discretised beta,
\citealt{Turner2021}), and Multinom.  In the Poisson and Binomial
cases the models are generalised linear models.  In the Gaussian
and Db cases the models are ``something like, but not exactly''
generalised linear models.  In the case of the Multinom (or
``discnp'' --- discrete non-parametric) distribution the model in
question bears some relationship to a generalised linear model but
is of a substantialy different form.  We shall use the expression
``extended generalised hidden Markov models''.  to describe they
collection of all models under consideration, including those based
on the Gaussian, Db and Multinom distributions.

The package fits the models in question by several different
methods`, namely the EM algorithm \cite{DempsterEtAl1977}, the
Levenberg-Marquardt algorithm \cite{Turner2008}, and ``brute force''
which use either the \texttt{optim} or the \texttt{nlm} package to
optimise the log likelihood.  The Levenberg-Marquardt algorithm, and
in certain circumstances the ``brute force'' procedure, require the
analytic calculation of the gradient and Hessian of the log likelihood.
The calculation is intricate in the hidden Markov model context.
(In fact simply calculating the log likelihood is intricate.)  Most
of this vignette is devoted to the calculation of the first and
second derivatives of the log likelihood.

\section{Recursive calculations}
\label{sec:recurse}
The likelihood of a hidden Markov model may feasibly be calculated in
terms of the ``forward'' probabilities developed by Baum et al. (see
\citealt{BaumEtAl1970}).  These probabilities are calculated by means
of a recursive procedure which of course depends on the likelihoods
of individual observations.   These likelihoods, which may be
expressed in the form $f(y,\btheta)$, may be either probability
density functions or probability mass functions.  The symbol $y$
represents an observation (emission) and $\btheta$ represents a
vector of parameters upon which the distribution in question depends.
These parameters depend in turn on the underlying state of the hidden
Markov chain and in general upon other predictors (in addition to
``state'').  The dependence of $\btheta$ upon the predictors will
involve further parameters.

The derivatives of the log likelihood of the model must therefore,
in turn be calculated via recursive procedures.  In order to
effect these procedures, we need to calculate the first and second
derivatives, with respect to all of the parameters that are involved,
of the single observation likelihoods $f(y,\btheta)$.

In the case of the Gaussian distribution $\btheta =
(\mu,\sigma)^{\top}$ where $\mu$ is the mean and $\sigma$ is
the standard deviation of the distribution.  In the cases of the
Poisson and Binomial distributions $\btheta$ is actually a scalar
(which we consequently write simply at $\theta$).  For the Poisson
distribution $\theta$ is equal to $\lambda$, the Poisson mean,
and for the Binomial distribution $\theta$ is equal to $p$, the
binomial success probability.  In the case of the Db distribution,
$\btheta$ is equal to $(\alpha,\beta)^{\top}$ the vector of
``shape'' parameters of the distribution.  In the case of the
Multinom distribution, the model (as indicated above) has a rather
different structure.

Except in the Gaussian case we assume that $\btheta$ is completely
determined by a vector $\bx$ of predictor variables and a vector
$\bphi$ of predictor coefficients.  We need to determine the
first and second derivatives, of the likelihood of a single
observation, with respect to the entries of $\bphi$.  In the case
of the Gaussian distribution $\btheta$ also includes the values
of $\sigma$ corresponding the different states.  In the current
implementation of the package these $\sigma$ values are not obtained
from the predictor coefficients $\bphi$.

\section{Derivatives specific to each of the distributions}
\label{sec:derivs}

We now provide the details of the calculation of these derivatives
for each of the five distributions in question.

\subsection{The Gaussian distribution}
\label{sec:gauss}

We denote the vector of standard deviations by $\bsigma = (\sigma_1,
\ldots, \sigma_K)^{\top}$ (where $K$ is the number of states).
In the current development we assume that $\sigma_i$ depends only
on the state $i$ of the underlying hidden Markov chain (and not on
any other prectors included in $\bx$.  It is thus convenient to make
explicit the dependence of the probability density functions upon
the underlying state.  We write the probability density function
corresponding to state $i$ as
\[
f_i(y) = \frac{1}{\sqrt{2\pi} \sigma_i} \exp \left (
                            \frac{-(y-\mu)^2}{2\sigma_i^2} \right ) \; .
\]

We model $\mu$ as $\mu =  \bx^{\top}\bphi$.  Note that consequently
$\mu$ depends, in general, upon the state $i$ although this
dependence $\bx$ is not made explicit in the foregoing expression
for $f_i(y)$.  We need to differentiate $f_i(y)$ with respect to
$\bphi$ and $\bsigma$.

It is straightforward, using logarithmic differentiation, to
determine that:
\begin{equation}
\begin{split}
\frac{\partial f_i(y)}{\partial \mu} &= f_i(y)\left( \frac{y-\mu}{\sigma_i^2} \right ) \\
\frac{\partial f_i(y)}{\partial \sigma_j} &= \left \{
\begin{array}{ll}
f_i(y)\left( \frac{(y-\mu)^2}{\sigma_i^2} - 1 \right)/\sigma_i & \mbox{~if~} j = i\\
0 & \mbox{~if~} j \neq i \end{array} \right . \\
\frac{\partial^2 f_i(y)}{\partial \mu^2} &= f_i(y) \left( 
    \frac{(y-\mu)^2}{\sigma_i^2}  - 1 \right )/\sigma_i^2 \\
\frac{\partial^2 f_i(y)}{\partial \sigma_i \partial \sigma_j} &= \left \{
\begin{array}{ll}
f_i(y) \left( \left ( \frac{(y-\mu)^2}{\sigma_i^2} - 1 \right )^2 +
            1 - \frac{3(y-\mu)^2}{\sigma_i^2} \right )/\sigma_i^2  & \mbox{~if~} j = i\\
0 & \mbox{~if~} j \neq i \end{array} \right . \\
\frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_j} &= \left \{
\begin{array}{ll}
f_i(y) \left (\frac{(y-\mu)^2}{\sigma^3} -
              \frac{3}{\sigma} \right )(y-\mu)/\sigma^2 & \mbox{~if~} j = i\\
0 & \mbox{~if~} j \neq i \end{array} \right . \; .
\end{split}
\label{eq:muSigPartials}
\end{equation}

Recalling that $\mu = \bx^{\top} \bphi$ we see that
\[
\frac{\partial \mu}{\partial \bphi} = \bx \; ,
\]
An application of the chain rule then gives:
\[
\frac{\partial f_i(y)}{\partial \bphi} =
                  \frac{\partial f_i(y)}{\partial \mu} \bx
\]

The second derivatives of $f_i(y)$ with respect to $\bphi$
are given by
\begin{align*}
\frac{\partial^2 f_i(y)}{\partial \bphi^{\top} \partial \bphi} &=
\frac{\partial}{\partial \bphi^{\top}} \left (
                \frac{\partial f_i(y)}{\partial \mu} \bx
                                 \right ) \\
&= \bx \left(\frac{\partial^2 f_i(y)}{\partial \mu^2}
             \frac{\partial \mu}{\partial \bphi^{\top}} +
             \frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_i}
             \frac{\partial \sigma_i}{\partial \bphi^{\top}} \right ) \\
&= \left(\frac{\partial^2 f_i(y)}{\partial \mu^2} \right) \bx \bx^{\top}
\end{align*}
since $\partial \sigma_i/\partial \bphi^{\top} = \bzero$.

The second derivatives of $f_i(y)$ with respect to $\bphi$ and $\bsigma$
are given by
\begin{align*}
\frac{\partial^2 f_i(y)}{\partial \bphi^{\top} \partial \sigma_j} &=
\left \{
\begin{array}{cl}
\left(\frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_j}
      \right) \bx^{\top} & \mbox{~if~} j = i \\[0.25cm]
\bzero^{\top}  & \mbox{~if~} j \neq i \end{array} \right . \\
\frac{\partial^2 f_i(y)}{\partial \sigma_j \partial \bphi} &=
\left \{
\begin{array}{cl}
\left(\frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_j} \right) \bx
& \mbox{~if~} j = i \\[0.25cm]
  \bzero  & \mbox{~if~} j \neq i \end{array} \right . \; .
\end{align*}

Note that
\[
\frac{\partial^2 f_i(y)}{\partial \sigma_i \partial \sigma_j}
\]
is provided in \eqref{eq:muSigPartials}.

The structure of the first and second derivatives of $f_i(y)$ with respect
to $\bphi$ and $\sigma$ can be expressed concisely by letting
\[
\bpsi = \left [ \begin{array}{l}
            \bsigma\\
            \bphi \end{array} \right ]
\]
and then writing
\begin{align*}
\frac{\partial f_i(y)}{\partial \bpsi} &= 
\left[ \begin{array}{l}
\frac{\partial f_i(y)}{\partial \bsigma} \\[0.1cm]
\frac{\partial f_i(y)}{\partial \bphi}
\end{array} \right] \\[0.25cm]
&=\left[ \begin{array}{l}
       \frac{\partial f_i(y)}{\partial \sigma_i} \bdelta_i\\[0.1cm]
       \frac{\partial f_i(y)}{\partial \mu} \bx 
       \end{array} \right]
\end{align*}
where $\bdelta_i$ is a vector of dimension $K$ whose $i$th entry is 1 
and whose other entries are all 0, and

\begin{align*}
\frac{\partial^2 f_i(y)}{\partial \bpsi^{\top} \partial \bpsi} &=
\left[ \begin{array}{ll}
\frac{\partial^2 f_i(y)}{\partial \bsigma^{\top} \partial \bsigma} &
\frac{\partial^2 f_i(y)}{\partial \bsigma^{\top} \partial \bphi} \\
\frac{\partial^2 f_i(y)}{\partial \bphi^{\top} \partial \bsigma} &
\frac{\partial^2 f_i(y)}{\partial \bphi^{\top} \partial \bphi}
\end{array} \right] \\[0.25cm]
& = \left[ \begin{array}{ll}
           \frac{\partial^2 f_i(y)}{\partial \sigma_i^2}
           \bdelta_i \bdelta_i^{\top} &
           \frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_i}
           \bdelta_i \bx^{\top} \\
           \frac{\partial^2 f_i(y)}{\partial \mu \partial \sigma_i}
           \bx \bdelta_i^{\top} &
           \frac{\partial^2 f_i(y)}{\partial \mu^2}
           \bx \bx^{\top} \\
           \end{array} \right] \; .
\end{align*}
Note that the first and second partial derivatives of
$f_i(y)$ with respect to $\mu$ and $\sigma_i$ are provided in
\eqref{eq:muSigPartials}.

\subsection{The Poisson distribution}
\label{sec:pois}

The likelihood is the probability mass function
\[
f(y) = e^{-\lambda} \frac{\lambda^y}{y!}
\]
$y = 0, 1, 2, \ldots$.  Here $\btheta$ is a scalar, $\theta = \lambda$,
and we model $\lambda$ via $\lambda = \exp(\bx^{\top} \bphi)$, where
$\bx$ is a vector of predictors and $\bphi$ is a vector
of predictor coefficients.  The first and second derivatives of $f(y)$
with respect to $\lambda$ are
\begin{align*}
\frac{\partial f(y)}{\partial \lambda} &=
     f(y) \left (\frac{y}{\lambda} - 1 \right ) \\
\frac{\partial^2 f(y)}{\partial \lambda^2} &=
     f(y) \left( \left(\frac{y}{\lambda} - 1 \right)^2 - \frac{y}{\lambda^2} \right)
\end{align*}
Since $\lambda = \exp(\bx^{\top} \bphi)$ it follows readily that
the first and second derivatives of $\lambda$ with respect to $\bphi$
are $lambda \bx$ and $\lambda \bx \bx^{\top}$, respectively.
Applying the chain rule we get
\begin{align*}
\frac{\partial f(y)}{\partial \bphi} &=
\frac{\partial f(y)}{\partial \lambda} \lambda \bx \\
\frac{\partial^2 f(y)}{\partial \bphi^{\top} \partial \bphi} &=
\left(\frac{\partial f(y)}{\partial \lambda} \lambda +
      \frac{\partial^2 f(y)}{\partial \lambda^2} \lambda^2 \right) \bx \bx^{\top}
\end{align*}

\subsection{The Binomial distribution}
\label{sec:bino}

The likelihood is the probability mass function
\[
f(y) = \binom{n}{y} p^y (1-p)^{n-y}
\]
$y = 0, 1, 2, \ldots, n$, where $n$ is the number of independent
binomial trials on which the success count $y$ is based, and $p$ is
the probability of success.  Here $\btheta$ is a scalar, $\theta =
p$, and we model $p$ via $p = h(u)$ where $u = \bx^{\top} \bphi$,
where $\bx$ is a vector of predictors, $\bphi$ is a vector of
predictor coefficients and $h(u)$ is the logit function $h(u) =
(1 + e^{-u})^{-1}$.

In what follows we will need the first and second derivatives of
the logit function.  These are given by
\begin{equation}
\begin{split}
h'(u) &= \frac{e^{-u}}{(1+e^{-u})^2} \mbox{~and} \\
h''(u) &= \frac{e^{-u}(e^{-u} - 1)}{(1+e^{-u})^3} \; .
\end{split}
\label{eq:logitDerivs}
\end{equation}
The first and second derivatives of $f(y)$
with respect to $p$ are
\begin{align*}
\frac{\partial f(y)}{\partial p} &= f(y)
      \left ( \frac{y}{p} - \frac{n-y}{1-p} \right )\\
\frac{\partial^2 f(y)}{\partial p^2} &= f(y)
      \left ( \left (\frac{y}{p} - \frac{n-y}{1-p} \right)^2 -
              \frac{y}{p^2} - \frac{n-y}{(1-p)^2} \right ) \; .
\end{align*}
Since $p = h(\bx^{\top} \bphi)$ we see that
\begin{align*}
\frac{\partial p}{\partial \bphi} &= h'(\bx^{\top} \bphi) \bx
\mbox{~ and}\\
\frac{\partial^2 p}{\partial \bphi^{\top} \partial \bphi} &=
h''(\bx^{\top} \bphi) \bx \bx^{\top}
\end{align*}
Applying the chain rule we see that
\begin{align*}
\frac{\partial f(y)}{\partial \bphi} &= \frac{\partial f}
                 {\partial p}  h'(\bx^{\top} \bphi) \bx
\mbox{~ and}\\
\frac{\partial^2 f(y)}{\partial \bphi^{\top} \partial \bphi} &= \left (
\frac{\partial f(y)}{\partial p} h''(\bx^{\top} \bphi) +
\frac{\partial^2 f(y)}{\partial p^2} (h'(\bx^{\top} \bphi)^2
      \right) \bx \bx^{\top}
\end{align*}
Recall that expressions for $h'(\cdot)$ and $h''(\cdot)$ are
given by \eqref{eq:logitDerivs}.

\subsection{The Db distribution}
\label{sec:dbd`}
The likelihood is the probability mass function which depends on
a vector of parameters $\btheta = (\alpha,\beta)^{\top}$ and is
somewhat complicated to write down.
In order to obtain an expression for this probabilty mass function
we need to define
\begin{align*}
h_0(y) &= (y(1-y))^{-1} \\
h(y)   &= h_0((y - \nbot + 1)/(\ntop - \nbot + 2)) \\
T_1(y) &= \log((y - \nbot + 1)/(\ntop - \nbot + 2)) \\
T_2(y) &= \log((\ntop - y + 1)/(\ntop - \nbot + 2)) \\
A(\alpha,\beta) &= \log \left( \sum_{i=\nbot}^{\ntop} h(i)
                       \exp\{\alpha T_1(i) + \beta T_2(i) \} \right) \; .
\end{align*}
Given these definitions, the probability mass function of the Db
distribution can be written as
\[
f(y,\alpha,\beta) = \Pr(X=y \mid \alpha, \beta)
                  = h(y) \exp\{\alpha T_1(y) + \beta T_2(y)
                                   - A(\alpha,\beta)\} \; .
\]

We model $\alpha$ and $\beta$ via
\begin{align*}
\alpha &= \bx^{\top} \bphi_1 \\
\beta  &= \bx^{\top} \bphi_2
\end{align*}
where $\bx$ is a vector of predictors and $\bphi_1$ and $\bphi_2$
are vectors of predictor coefficients.  The vector $\bphi$, with
respect to which we seek to differentiate the likelihood, is the
catenation of $\bphi_1$ and $\bphi_2$.

The first derivative of the likelihood with respect to $\bphi$ is
\begin{align*}
\frac{\partial f}{\partial \bphi} &=
\frac{\partial f}{\partial \alpha}
\frac{\partial \alpha}{\partial \bphi} +
\frac{\partial f}{\partial \beta}
\frac{\partial \beta}{\partial \bphi} \\
 &= \frac{\partial f}{\partial \alpha} \left [
    \begin{array}{c}
    \frac{\partial \alpha}{\partial \bphi_1} \\ \bzero
    \end{array} \right ] +
    \frac{\partial f}{\partial \beta} \left [
    \begin{array}{c}
    \bzero \\
    \frac{\partial \beta}{\partial \bphi_2}
    \end{array} \right ] \\
 &= \frac{\partial f}{\partial \alpha} \left [
    \begin{array}{c} \bx \\ \bzero \end{array} \right ] +
    \frac{\partial f}{\partial \beta} \left [
    \begin{array}{c} \bzero \\ \bx \end{array} \right ] \\
 &= \left [ \begin{array}{c}
    \frac{\partial f}{\partial \alpha} \bx \\
    \frac{\partial f}{\partial \beta} \bx
    \end{array} \right ]
\end{align*}

The second derivative is calculated as
\[
\frac{\partial^2 f}{\partial \bphi^{\top} \partial \bphi} =
\left [ \begin{array}{c}
\frac{\partial}{\partial \bphi^{\top}}
\left ( \frac{\partial f}{\partial \alpha} \bx \right) \\
\frac{\partial}{\partial \bphi^{\top}}
 \left ( \frac{\partial f}{\partial \beta} \bx \right) \end{array}
\right ] \; .
\]
Taking this expression one row at a time we see that
\begin{align*}
\frac{\partial}{\partial \bphi^{\top}}
\left ( \frac{\partial f}{\partial \alpha} \right)
&= \left [ \begin{array}{lr}
\frac{\partial}{\partial \bphi_1^{\top}}
\left ( \frac{\partial f}{\partial \alpha} \right)
& 
\frac{\partial}{\partial \bphi_2^{\top}}
 \left ( \frac{\partial f}{\partial \alpha} \right)
\end{array} \right ] \\
&= \left [ \begin{array}{lr}
\frac{\partial^2 f}{\partial \alpha^2}
\frac{\partial \alpha}{\partial \bphi_1^{\top}} &
\frac{\partial^2 f}{\partial \beta \partial \alpha}
\frac{\partial \beta}{\partial \bphi_2^{\top}} \end{array} \right ] \\
&= \left [ \begin{array}{lr}
\frac{\partial^2 f}{\partial \alpha^2} \bx^{\top} &
\frac{\partial^2 f}{\partial \beta \partial \alpha} \bx^{\top}
\end{array} \right ] \mbox{~and likewise}\\
\frac{\partial}{\partial \bphi^{\top}}
\left ( \frac{\partial f}{\partial \beta} \right) & =
\left [ \begin{array}{lr}
\frac{\partial^2 f}{\partial \beta \partial \alpha} \bx^{\top} &
\frac{\partial^2 f}{\partial \beta^2} \bx^{\top}
\end{array} \right ] \; .
\end{align*}
Combining the foregoing we get
\[
\frac{\partial^2 f}{\partial \bphi^{\top} \partial \bphi} =
\left [ \begin{array}{lr}
\frac{\partial^2 f}{\partial \alpha^2} \bx \bx^{\top} &
\frac{\partial^2 f}{\partial \beta \partial \alpha} \bx \bx^{\top} \\[0.25cm]
\frac{\partial^2 f}{\partial \beta \partial \alpha} \bx \bx^{\top} &
\frac{\partial^2 f}{\partial \beta^2} \bx \bx^{\top}
\end{array} \right ] \; .
\]
As was the case for the three distributions for which $\btheta$ is
a scalar, it is expedient to express the partial derivatives of
$f(y,\alpha,\beta)$, with respect to the parameters of the distribution,
in terms of $f(y,\alpha,\beta)$  The required expressions are as follows:
\begin{align*}
\frac{\partial f}{\partial \alpha} &=
f(y,\alpha,\beta) \left ( T_1(y) - \frac{\partial A}{\partial \alpha} \right )\\
\frac{\partial f}{\partial \beta} &=
f(y,\alpha,\beta) \left ( T_2(y) - \frac{\partial A}{\partial \beta} \right )\\
\frac{\partial^2 f}{\partial \alpha^2} &=
f(y,\alpha,\beta) \left [ \left ( T_1(y) - \frac{\partial A}{\partial \alpha} \right )^2
- \frac{\partial^2 A}{\partial \alpha^2} \right ]\\
\frac{\partial^2 f}{\partial \alpha \partial \beta} &=
f(y,\alpha,\beta) \left [
\left ( T_1(y) - \frac{\partial A}{\partial \alpha} \right )
\left ( T_2(y) - \frac{\partial A}{\partial \beta} \right )
- \frac{\partial^2 A}{\partial \alpha \partial \beta} \right ] \\
\frac{\partial^2 f}{\partial \beta^2} &=
f(y,\alpha,\beta) \left [ \left ( T_2(y) - \frac{\partial A}{\partial \beta} \right )^2
- \frac{\partial^2 A}{\partial \beta^2} \right ]
\end{align*}
\newpage
It remains to provide expressions for the partial derivatives of $A$ with
respect to $\alpha$ and $\beta$.  Let
\[
E = \exp(A) = \sum_{i=\nbot}^{\ntop} h(i)
                       \exp\{\alpha T_1(i) + \beta T_2(i) \} \; .
\]
Clearly
\begin{align*}
\frac{\partial A}{\partial \alpha} &= \frac{1}{E} \frac{\partial E}{\partial \alpha} \\
\frac{\partial A}{\partial \beta} &= \frac{1}{E} \frac{\partial E}{\partial \beta} \\
\frac{\partial^2 A}{\partial \alpha^2} &=
\frac{1}{E} \frac{\partial^2 E}{\partial \alpha^2} - \frac{1}{E^2}
\left (\frac{\partial E}{\partial \alpha} \right )^2 \\
\frac{\partial^2 A}{\partial \alpha \partial \beta} &=
\frac{1}{E} \frac{\partial^2 E}{\partial \alpha \partial \beta} - \frac{1}{E^2}
\left (\frac{\partial E}{\partial \alpha}
\frac{\partial E}{\partial \beta} \right ) \\
\frac{\partial^2 A}{\partial \beta^2} &=
\frac{1}{E} \frac{\partial^2 E}{\partial \beta^2} - \frac{1}{E^2}
\left (\frac{\partial E}{\partial \beta} \right )^2
\end{align*}
\enlargethispage{1\baselineskip}
Finally, the relevant partial derivatives of $E$ are:
\begin{align*}
\frac{\partial E}{\partial \alpha} &= \sum_{i=\nbot}^{\ntop} h(i) T_1(i)
                                     \exp(\alpha T_1(i) + \beta T_2(i)) \\
\frac{\partial E}{\partial \beta} &= \sum_{i=\nbot}^{\ntop} h(i) T_2(i)
                                     \exp(\alpha T_1(i) + \beta T_2(i))\\
%\end{align*}
%\begin{align*}
\frac{\partial^2 E}{\partial \alpha^2} &= \sum_{i=\nbot}^{\ntop} h(i) T_1(i)^2
                                     \exp(\alpha T_1(i) + \beta T_2(i)) \\
\frac{\partial^2 E}{\partial \alpha \partial \beta} &=
         \sum_{i=\nbot}^{\ntop} h(i) T_1(i) T_2(i)
         \exp(\alpha T_1(i) + \beta T_2(i)) \\
\frac{\partial^2 E}{\partial \beta^2} &= \sum_{i=\nbot}^{\ntop} h(i) T_2(i)^2
                                     \exp(\alpha T_1(i) + \beta T_2(i)) \; .
\end{align*}

\subsection{The Multinom distribution}
\label{sec:multinom}
This distribution is very different from those with which we
have previously dealt.  It is defined effectively in terms of
\emph{tables}.  In the hidden Markov model context, these tables
take the form
\[
\Pr(Y = y_i \mid S = k) = \rho_{ik}
\]
where $Y$ is the emissions variate, its possible values or ``levels''
are $y_1, y_2, \ldots, y_m$, and $S$ denotes ``state'' which (wlog)
takes values $1, 2, \ldots, K$.  Of course $\rho_{\cdot k} = 1$
for all $k$.  We shall denote $\Pr(Y = y \mid S = k) = \rho_{ik}$
by $f_k(y)$.

The maximisation of the likelihood with respect to the $\rho_{ik}$
is awkward, due to the ``sum-to-1'' constraints that they must
satisfy, and it is better to impose this constraint ``smoothly''
via a logistic parameterisation.  See \cite{Turner2008}.  Such a
parameterisation allows us to express the dependence of the emissions
probabilities, upon ``state'', in terms of linear predictors.  This
in turn
opens up the possibility of including other predictors, in addition to those
determined by ``state'', in the model.

To this end we define vectors of parameters $\bphi_i$, $i = 1,
\ldots, m$, corresponding to each of the possible values of $Y$.
For identifiability we take $\bphi_m$ to be identically 0.
Each $\bphi_i$ is a vector of length $np$, say, where $np$ is the
number of predictors.  If, in a $K$ state model, there are no
predictors other than those determined by state, then $np = K$.
In this case there are $K \times (m-1)$ ``free'' parameters,
just as there should be (and just at there are in the original
parameterisation in terms of the $\rho_{ik}$).  Let the $k$th
entry of $\bphi_i$ be $\phi_{ik}$, $k = 1,\ldots,np$.  Let $\bphi$
be the vector consisting of the catenation of all of the $\phi_{ij}$,
excluding the entries of $\bphi_m$ which are all 0:
\[
\bphi = (\phi_{11}, \phi_{12}, \ldots, \phi_{1,np}, \phi_{21}, \phi_{22}, \ldots,
\phi_{2,np}, \ldots\ , \ldots\ ,\phi_{m-1,1}, \phi_{m-1,2},
\ldots, \phi_{m-1,np})^{\top} \; .
\]
Let $\bx$ be a vector of predictors.  In terms of the foregoing notation, $f_k(y)$
can be written as
\[
f_k(y) = \frac{e^{\bx^{\top} \bphi_y}}{Z}
\]
where in turn
\[
Z = \sum_{\ell = 1}^k e^{\bx^{\top} \bphi_{\ell}} \; .
\]
The dependence of $f_k(y)$ upon the state $k$ is implicit in the
predictor vector $\bx$ which includes predictors indicating state.
We now calculate the partial derivatives of $f_k(y)$ with respect
to $\bphi$.  First note that $\frac{\partial f}{\partial \bphi}$
can be written as
\[
\left [ \begin{array}{c}
        \frac{\partial f_k}{\partial \bphi_1} \\[0.25cm]
        \frac{\partial f_k}{\partial \bphi_2} \\
        \vdots \\
        \frac{\partial f_k}{\partial \bphi_{m-1}} \end{array} \right ] \;.
\]
Next we calculate
\[
\frac{\partial f_k(y)}{\partial \bphi_i}, \mbox{~~} i = 1, \ldots, m-1 \; .
\]
Using logarithmic differentiation we see that
\[
\frac{1}{f_k(y)} \frac{\partial f_k(y)}{\partial \bphi_i} = \delta_{yi} \bx
- \frac{1}{Z} e^{\bx^{\top} \bphi_i} \bx
\]
so that
\[
\frac{\partial f_k(y)}{\partial \bphi_i} = f_k(y) \left ( \delta_{yi} -
\frac{e^{\bx^{\top} \bphi_i}}{Z} \right )
\]
which can be written as $f_k(y)(\delta_{yi} - f_k(i)) \bx$.

In summary we have
\[
\frac{\partial f}{\partial \bphi} =
f_k(y) \left [ \begin{array}{l}
         ( \delta_{y1} - f_k(1) ) \bx \\
         ( \delta_{y2} - f_k(2) ) \bx \\
         \multicolumn{1}{c}{\vdots} \\
         ( \delta_{y,m-1} - f_k(m-1) ) \bx
     \end{array}
     \right ]
\]

The second derivatives of $f_k(y)$ with respect to $\bphi$ are given by
\[
\frac{\partial^2 f}{\partial \bphi \partial \bphi^{\top}} = 
\left [ \begin{array}{llcl}
        \frac{\partial^2 f}{\partial \bphi_1 \partial \bphi_1^{\top}} & 
        \frac{\partial^2 f}{\partial \bphi_1 \partial \bphi_2^{\top}} & 
        \ldots &
        \frac{\partial^2 f}{\partial \bphi_1 \partial \bphi_{m-1}^{\top}} \\[0.5cm]
        \frac{\partial^2 f}{\partial \bphi_2 \partial \bphi_1^{\top}} & 
        \frac{\partial^2 f}{\partial \bphi_2 \partial \bphi_2^{\top}} & 
        \ldots &
        \frac{\partial^2 f}{\partial \bphi_2 \partial \bphi_{m-1}^{\top}} \\ 
        \multicolumn{1}{c}{\vdots} &
        \multicolumn{1}{c}{\vdots} &
        \multicolumn{1}{c}{\vdots} &
        \multicolumn{1}{c}{\vdots} \\
        \frac{\partial^2 f}{\partial \bphi_{m-1} \partial \bphi_1^{\top}} & 
        \frac{\partial^2 f}{\partial \bphi_{m-1} \partial \bphi_2^{\top}} & 
        \ldots &
        \frac{\partial^2 f}{\partial \bphi_{m-1} \partial \bphi_{m-1}^{\top}}
        \end{array}
\right]
\]
The $(i,j)$th entry of $\frac{\partial^2 f}{\partial \bphi \partial
\bphi^{\top}}$, i.e.  $\frac{\partial^2 f}{\partial \bphi_i \partial
\bphi_j^{\top}}$, is given by
\begin{align*}
 \frac{\partial}{\partial \bphi_i} \left (
        \frac{\partial y}{\partial \bphi_j^{\top}} \right )
 &=\frac{\partial}{\partial \bphi_i} \left (f_k(y)(\delta_{yj} - f_k(j)\bx^{\top}
                                     \right) \\
 &= f_k(y)(0 - f_k(j)(\delta_{ij} - f_k(i))\bx\bx^{\top}) +
    f_k(y)(\delta_{yi} - f_k(i))\bx (\delta_{yj}- f_k(j))\bx^{\top} \\
 &= f_k(y)(-f_k(j)(\delta_{ij} - f_k(i)) + (\delta_{yj} - f_k(i))(\delta_{yj} - f_k(j)))
    \bx \bx^{\top} \\
 &= f_k(y)(f_k(i)(f_k(j) - \delta_{ij}f_k(j) + (\delta_{yi} - f_k(i))
                                               (\delta_{yj} - f_k(j))) \bx \bx^{\top}
\end{align*}
At first glance this expression seems to be anomalously asymmetric in $i$ and $j$,
but the asymmetry is illusory.  Note  that when $i \neq j$, $\delta_{ij}f_k(j)$
is 0, and when $i=j$, $\delta_{ij}f_k(j) = f_k(j) = f_k(i)$.

In summary we see that
\[
\frac{\partial^2 f}{\partial \bphi \partial \bphi^{\top}} = 
\left [ \begin{array}{llcl}
        a_{11} \bx \bx^{\top} & a_{12} \bx \bx^{\top} &
        \ldots & a_{1,m-1} \bx \bx^{\top} \\
        a_{21} \bx \bx^{\top} & a_{22} \bx \bx^{\top} &
        \ldots & a_{2,m-1} \bx \bx^{\top} \\
        \multicolumn{1}{c}{\vdots} &
        \multicolumn{1}{c}{\vdots} &
        \multicolumn{1}{c}{\vdots} &
        \multicolumn{1}{c}{\vdots} \\
        a_{m-1,1} \bx \bx^{\top} & a_{m-1,2} \bx \bx^{\top} &
        \ldots & a_{m-1,,m-1} \bx \bx^{\top} \end{array}
\right ]
\]
where $a_{ij} =  f_k(y)(f_k(i)(f_k(j) - \delta_{ij}f_k(j) +
(\delta_{yi} - f_k(i))(\delta_{yj} - f_k(j))$, $i, j = 1, \ldots, m - 1$.

\newpage
\bibliography{eglhmm}
\bibliographystyle{plainnat}
\end{document}