% File apc/vignettes/CheckPrediction.Rnw % Part of the apc package, http://www.R-project.org % Copyright 2015 Bent Nielsen % Distributed under GPL 3 %\VignetteIndexEntry{Identification: illustrate and check identification used in plot fit function} \documentclass[a4paper,twoside,12pt]{article} \usepackage[english]{babel} \usepackage{booktabs,rotating,graphicx,amsmath,verbatim,fancyhdr,Sweave} \usepackage[colorlinks,linkcolor=red,urlcolor=blue]{hyperref} \newcommand{\R}{\textsf{\bf R}} \renewcommand{\topfraction}{0.95} \renewcommand{\bottomfraction}{0.95} \renewcommand{\textfraction}{0.1} \renewcommand{\floatpagefraction}{0.9} \DeclareGraphicsExtensions{.pdf,.jpg} \setcounter{secnumdepth}{1} \setcounter{tocdepth}{1} \oddsidemargin 1mm \evensidemargin 1mm \textwidth 160mm \textheight 230mm \topmargin -5mm \headheight 8mm \headsep 5mm \footskip 15mm \begin{document} \SweaveOpts{concordance=TRUE} \raggedleft \pagestyle{empty} \vspace*{0.1\textheight} \Huge {\bf Identification in \texttt{apc} package} \noindent\rule[-1ex]{\textwidth}{5pt}\\[2.5ex] \Large 12 April 2015 \vfill \normalsize \begin{tabular}{rl} Bent Nielsen & Department of Economics, University of Oxford \\ & \small \& Nuffield College \\ & \small \& Institute for Economic Modelling \\ & \normalsize \texttt{bent.nielsen@nuffield.ox.ac.uk} \\ & \url{http://users.ox.ac.uk/~nuff0078} \end{tabular} \normalsize \newpage \raggedright \parindent 3ex \parskip 0ex \tableofcontents \cleardoublepage \setcounter{page}{1} \pagestyle{fancy} \renewcommand{\sectionmark}[1]{\markboth{\thesection #1}{\thesection \ #1}} \fancyhead[OL,ER]{\sl Identification in \texttt{apc} package.} %\fancyhead[ER]{\sl \rightmark} \fancyhead[EL,OR]{\bf \thepage} \fancyfoot{} \renewcommand{\headrulewidth}{0.1pt} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} The \texttt{apc} package works with a number of parametrisations. The estimation is done in terms of the canonical parameter defined in Kuang, Nielsen and Nielsen (2008). This parametrisation is invariant to the group of transformations characterising the identification problem. The plots of the fit do, however, allow the display of a two ad hoc identifications of the time effects. These time effects are not invariant, but may nonetheless be helpful in applications. The purpose with the vignette is to demonstrate that these parametrisations are equivalent. This is done by showing that they give the same fit. The two ad hoc identifications are as follows. The first, "sum.sum", displays the components that appear in the representation theorem in Nielsen (2014), which is a development of the representation in Kuang, Nielsen and Nielsen (2008). The second, "detrend" is possibly more useful in practice. This is the default option in \texttt{apc.plot.fit}. It displays detrended versions of the double sums, so that they start and end in zero. Thus, it displays the cumulated deviations from linearity arising from the double difference parameters in canonical parameter. Moreover, the three time effects are treated symmetrically so that the identification of one time trend has no bearing on the other time trends. This is based on the development in Nielsen (2014). The two ad hoc identifications have the feature that the number of non-zero elements is the same as the number of elements in the canonical parameter. We use the Belgian lung cancer data from Clayton and Schifflers (1987a) as an example. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Ad hoc identification of the time effects} The age-period-cohort model is given by \begin{equation} \mu_{ik} =\alpha_i +\beta_j +\gamma_k +\delta , \end{equation} for age $i$, cohort $k$ and period $j=i+k-1$. The time effects $\alpha_i$, $\beta_j$, $\gamma_k$ are not fully identified. \textbf{Canonical parameter.} The \texttt{apc} package works around the canonical parameter from Kuang, Nielsen, Nielsen (2008) \begin{equation} \xi = (\mu_{UU}, \mu_{U+1,U} , \mu_{U,U+1}, \dots , \Delta^2\alpha_i , \dots , \dots , \Delta^2\beta _j , \dots , \dots , \Delta^2\gamma_k , \dots )' . \end{equation} The design matrix is defined from the following representation taken from Nielsen (2014) \begin{equation} \mu_{ik} = \nu_0 +(i-U)\nu_a +(k-U)\nu_c +A_i +B_j +C_k \label{mu:xi} \end{equation} where $j=i+k-1$ and $U=\textrm{integer}\{(L+3)/2\}$ while \begin{eqnarray} A_i &=& 1_{(iU+1)} \sum_{t=U+2}^i \sum_{s=U+2}^t \Delta^2\alpha_s \label{A} \\ B_j &=& 1_{(L\text{ odd \& }j=2U-2)}\Delta^2\beta_{2U} + 1_{(j>2U)} \sum_{t=2U+1}^j \sum_{s=2U+1}^t \Delta^2\beta_s \label{B} \\ C_k &=& 1_{(kU+1)} \sum_{t=U+2}^k \sum_{s=U+2}^t \Delta^2\gamma_s \label{C} \end{eqnarray} Six of these coefficients are zero. That is $A_U=A_{U+1}=C_U=C_{U+1}=0$, if $L$ is odd then $B_{2U+1}=B_{2U+2}=0$, whereas if $L$ is even then $B_{2U}=B_{2U+1}=0$. We get the canonical parameter through the command <<>>= # attach apc library library(apc) # get data from precoded function data <- data.Belgian.lung.cancer() # Estimate APC model model.family <- "poisson.dose.response" model.design <- "APC" fit <- apc.fit.model(data,model.family,model.design) c.c <- fit$coefficients.canonical @ \textbf{ss.dd} or \textbf{sum.sum} parametrisation. We can also parametrise in terms of the coefficients $A_i$, $B_j$, $C_k$. These coefficients are ad hoc identified versions of the time effects $\alpha_i$, $\beta_j$, $\gamma_k$. We get these parameters through the command \texttt{apc.identify}. Note, that this command is called indirectly by \texttt{apc.plot.fit}, so there is no need to use this command when working with plots only. The command is <<>>= id <- apc.identify(fit) c.ssdd <- id$coefficients.ssdd @ \textbf{detrend} parametrisation. In practise it is easier to see deviations from linearity through a different ad hoc identification where the time effects are constrained to start and end in zero; see Nielsen (2014b) for discussion of the interpretation. From $A_i$, $B_j$ and $C_k$ defined in (\ref{A})-(\ref{C}) we form \begin{eqnarray} A^{detrend}_i &=& A_i-A_1 - \frac{i-1}{I-1}(A_I-A_1) , \label{Adetrend} \\ B^{detrend}_{L+j} &=& B_{L+j}-B_{L+1} - \frac{j-1}{J-1}(B_{L+J}-B_{L+1}) , \\ C^{detrend}_k &=& C_k-C_1 - \frac{k-1}{K-1}(C_K-C_1) . \label{Cdetrend} \end{eqnarray} The constants and linear slopes are corrected correspondingly giving \begin{eqnarray} \nu_0^{detrend} &=& \nu_0 + A_1 + B_{L+1} - \frac{L}{J+1}(B_{L+J}-B_{L+1}) + C_1 , \\ \nu_a^{detrend} &=& \nu_a + \frac{A_I-A_1}{I-1} + \frac{B_{L+J}-B_{L+1}}{J-1} \\ \nu_c^{detrend} &=& \nu_c + \frac{B_{L+J}-B_{L+1}}{J-1} + \frac{C_K-C_1}{K-1} \end{eqnarray} We then get the representation \begin{eqnarray} \mu_{ik} &=& \nu_0^{detrend} +(i-1) \nu_a^{detrend} +(k-1) \nu_c^{detrend} +A_i^{detrend} +B_{L+j}^{detrend} +C_k^{detrend} \label{mu:xi_detrend} \end{eqnarray} These coefficients are also found using the identification command \texttt{apc.identify}, so that <<>>= c.detrend <- id$coefficients.detrend @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Ad hoc identification for 2-factor models} When one of the three time effects, $\alpha_i$, $\beta_j$ or $\gamma_k$ is absent the linear trends can be attributed uniquely to the time effects, whereas the level is still unidentified. The \texttt{apc} package offers a parametrisation of that type. \textbf{demean} parametrisation. The model design "AC" is a two factor model. The representation derives from the detrended representation (\ref{mu:xi_detrend}). With $\Delta^2\beta_j=0$ then $B_j=0$ and hence $B^{detrend}_j=0$ in that formula. Thus we can attribute the linear trends to the age and cohort time effect. This is done so that the first element of each time effect is zero. This identification choice is called the demeaned parameter. \begin{eqnarray} \mu_{ik} &=& \nu_0^{demean} +A_i^{demean} +C_k^{demean} \label{mu:xi_demean} \end{eqnarray} where \begin{eqnarray} A_i^{demean} &=& A_i^{detrend} +(i-1) \nu_a^{detrend} \qquad C_k^{demean} = C_k^{detrend} +(k-1) \nu_c^{detrend} \end{eqnarray} Note that $A_1^{demean}=C_1^{demean}=0$. We get the single sum of difference parameters through the following command <<>>= # fit AC model model.design <- "AC" fit.ac <- apc.fit.model(data,model.family,model.design) # identify to get sums of difference parameters id.ac <- apc.identify(fit.ac) c.demean <- id.ac$coefficients.demean @ We get an exactly, invariant identified parameter by taking first differences of $A_i^{demean},C_k^{demean}$. This could also have been used as canonical parameter for this two factor model. The level is as before. <<>>= # get difference parameters c.dif <- id.ac$coefficients.dif @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Check canonical parametrisation} The fit of the canonical parametrisation is discussed. In the background the generalized linear model package is used. \texttt{apc} supplies a design matrix that is fed into \texttt{glm.fit}. This delivers the following standard output including coefficients and linear predictors. <<>>= # Coefficients fit$coefficients # Arrange linear predictors as matrix in original format # Create matrix of original dimension m.fit <- fit$response m.fit[fit$index.data] <-fit$linear.predictors m.fit @ The coefficients are equal to the canonical parameters. We get the summary of these and check that the coefficients are the same as for the generalized linear model estimation. <<>>= # Canonical paramters c.c # Check canonical coefficients are the same as the standard coefficients sum(abs(c.c[,1]-fit$coefficients)) @ We check that the canonical parameters give the same fit as the standard generalized linear model fit. We get the design matrix and multiply with the canonical parameter <<>>= # get design matrix m.design <- apc.get.design(fit)$design # create matrix of original dimension m.fit.canonical.no.dose <- fit$response m.fit.canonical.no.dose[fit$index.data] <- m.design %*% c.c[,1] if(is.null(data$dose)==TRUE) m.fit.canonical <- m.fit.canonical.no.dose if(is.null(data$dose)==FALSE) m.fit.canonical <- m.fit.canonical.no.dose + log(data$dose) m.fit.canonical # Check canonical coefficients give same fit as standard fit sum(abs(m.fit-m.fit.canonical),na.rm=TRUE) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Check identification of double sums of double differences} We check the ad hoc identification in terms of the double sums of double differences (\ref{A})-(\ref{C}). We first get the cooefficients through <<>>= c.ssdd @ Compare the output for \texttt{c.ssdd} with that for \texttt{c.c} above. \texttt{c.ssdd} has more elements as some zero elements are squeezed in according to the indicator functions in (\ref{mu:xi}). The coefficients and standard errors neighbouring the zero elements can be recognised directly from \texttt{c.ssdd}. The number of non-zero elements is the same. The prediction based on these parameters is found as follows. An age-cohort indexation is needed for the data. This is generated by the command \texttt{apc.get.index}. It is automatically linked as an object for fit. We can therefore generate the linear prediction as follows and compare with the previous predictions. Note, this first check requires that model design is APC. <<>>= age <- fit$index.trap[,1] coh <- fit$index.trap[,2] # From this we get the period. Need to correct for lowest period value. per.zero <- fit$per.zero per <- age+coh-1-per.zero U <- fit$U # Then we can compute the prediction as a vector if(model.design=="APC") { prediction <- c.ssdd[1,1] + + c.ssdd[2,1]*(age-U) + + c.ssdd[3,1]*(coh-U) + + c.ssdd[id$index.age.max[age],1] + + c.ssdd[id$index.per.max[per],1] + + c.ssdd[id$index.coh.max[coh],1] # Then we embed it into a matrix m.fit.ssdd <- fit$response m.fit.ssdd[fit$index.data] <- prediction # Add dose m.fit.ssdd <- m.fit.ssdd + log(data$dose) # Check fit is correct sum(abs(m.fit.canonical-m.fit.ssdd),na.rm=TRUE) } @ The above code will fail if \texttt{model.design} is not APC. We then need to introduce various checks as in the following snippet, that recycles \texttt{age}, \texttt{per}, \texttt{coh}, \texttt{per.zero}, \texttt{U} defined above. <<>>= # We need two further variables slopes <- fit$slopes difdif <- fit$difdif # Compute the prediction as a vector prediction <- c.ssdd[1,1] # Add the age double differences and age slope if(difdif[1]) # TRUE if age double differences prediction <- prediction + c.ssdd[id$index.age.max[age],1] if(difdif[2]) # TRUE if period double differences prediction <- prediction + c.ssdd[id$index.per.max[per],1] if(difdif[3]) # TRUE if cohort double differences prediction <- prediction + c.ssdd[id$index.coh.max[coh],1] if(slopes[1]) # TRUE if age linear trend { prediction <- prediction + c.ssdd[2,1]*(age-U) if(slopes[3]) prediction <- prediction + c.ssdd[3,1]*(coh-U) } if(slopes[1]==FALSE) { if(slopes[2]) prediction <- prediction + c.ssdd[2,1]*(per-1) if(slopes[3]) prediction <- prediction + c.ssdd[2,1]*(coh-U) } # Then we embed it into a matrix m.fit.ssdd <- fit$response m.fit.ssdd[fit$index.data] <- prediction # Add dose if(is.null(data$dose)==FALSE) m.fit.ssdd <- m.fit.ssdd + log(data$dose) # Check fit is correct sum(abs(m.fit.canonical-m.fit.ssdd),na.rm=TRUE) @ We can plot these coefficients using \texttt{apc.plot.fit} and the reconcile with the prediction as follows. <<>>= apc.plot.fit(fit,type="sum.sum") m.fit.canonical.no.dose @ The anchoring point is for $age=cohort=U=6$ and $period=L+1$, or in real coordinates $age="50-54"$ and $period="1955-1959"$ corresponding to cohort $1905$. The predictor is $1.9574$. This is the sum of the entries in panels (d)-(i) of the plot, which are all zero apart from the constant. %%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Check identification of detrended double sums of double differences} The double differences capture variation in the time trends over and above linear trends. Thus, when considering double sums of double differences only variation over and above linear trends should be interpreted. To emphasize this variation, Nielsen (2014) suggests to detrend the double sums so that they start and end in zero as outlined in (\ref{mu:xi:detrend}). Some interpretations of those detrended double sums are also offered in that paper. The prediction based on these parameters is found as follows. <<>>= # We use age, coh, per, per.zero, id, slopes, difdif defined above. # Compute the prediction as a vector prediction <- c.detrend[1,1] # Add the age double differences and age slope if(difdif[1]) # TRUE if age double differences prediction <- prediction + c.detrend[id$index.age.max[age],1] if(difdif[2]) # TRUE if period double differences prediction <- prediction + c.detrend[id$index.per.max[per],1] if(difdif[3]) # TRUE if cohort double differences prediction <- prediction + c.detrend[id$index.coh.max[coh],1] if(slopes[1]) # TRUE if age linear trend { prediction <- prediction + c.detrend[2,1]*(age-1) if(slopes[3]) prediction <- prediction + c.detrend[3,1]*(coh-1) } if(slopes[1]==FALSE) { if(slopes[2]) prediction <- prediction + c.detrend[2,1]*(per-1) if(slopes[3]) prediction <- prediction + c.detrend[2,1]*(coh-1) } # Then we embed it into a matrix m.fit.detrend <- fit$response m.fit.detrend[fit$index.data] <- prediction # Add dose if(is.null(data$dose)==FALSE) m.fit.detrend <- m.fit.detrend + log(data$dose) # Check fit is correct sum(abs(m.fit.canonical-m.fit.detrend),na.rm=TRUE) @ We can plot these coefficients using \texttt{apc.plot.fit} and the reconcile with the prediction as follows. <<>>= apc.plot.fit(fit) m.fit.canonical.no.dose @ We consider the top left point of the prediction matrix with real coordinates $age="25-29"$ and $period="1970-74"$ corresponding to cohort $1945$. In mathematical coordinates, that is $age=1$, $cohort=coh.max$, $period=per.max$. Thus all double difference sums in panels (g)-(i) and age slope in panel (d) is zero. The level in panel (e) is -2.34. The cohort slope value in (f) is $0.68=(14-1)*0.052$. In total this is $-1.66$. %%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Check identification for model design "AC"} We now check the identification for the two factor model. To check the parametrisation we first recalculate the fit using the canonical parameter. Then the alternative parametrisations are checked <<>>= if(model.design=="AC") { ################################### # Get fit of canonical parameters # get the canonical parameters c.c.ac <- fit.ac$coefficients.canonical # Get design matrix m.design.ac <- apc.get.design(fit.ac)$design # Create matrix of original dimension m.fit.canonical.ac <- fit.ac$response m.fit.canonical.ac[fit.ac$index.data] <- m.design.ac %*% c.c.ac[,1] ##################################### # Get fit of sum of difference parameters prediction <- c.demean[1,1] + + c.demean[id.ac$index.age.sub[age],1] + + c.demean[id.ac$index.coh.sub[coh],1] # Create matrix of original dimension m.fit.demean.ac <- fit.ac$response m.fit.demean.ac[fit.ac$index.data] <- prediction } @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Summary of all checks} <<>>= # ALL CHECKS sum(abs(c.c[,1]-fit$coefficients)) sum(abs(m.fit-m.fit.canonical),na.rm=TRUE) sum(abs(m.fit.canonical-m.fit.ssdd),na.rm=TRUE) sum(abs(m.fit.canonical-m.fit.detrend),na.rm=TRUE) sum(abs(m.fit.canonical.ac-m.fit.demean.ac),na.rm=TRUE) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section*{References} \begin{description} \item Clayton, D., Schifflers, E. (1987a) Models for temperoral variation in cancer rates. I: age-period and age-cohort models. \textit{Statistics in Medicine} 6, 449-467. \item Kuang, D., Nielsen, B. and Nielsen, J.P. (2008) Identification of the age-period-cohort model and the extended chain ladder model. \textit{Biometrika} 95, 979-986. \textit{Download}: Earlier version: \url{http://www.nuffield.ox.ac.uk/economics/papers/2007/w5/KuangNielsenNielsen07.pdf}. \item Nielsen, B. (2014) Deviance analysis of age-period-cohort models. \textit{Download}: \url{http://www.nuffield.ox.ac.uk/economics/papers/2014/apc_deviance.pdf}. \end{description} \end{document}