% File apc/vignettes/CheckPrediction.Rnw % Part of the apc package, http://www.R-project.org % Copyright 2016 Bent Nielsen % Distributed under GPL 3 %\VignetteIndexEntry{Illustrate and check identification used in plot fit function} \documentclass[a4paper,twoside,12pt]{article} \usepackage[english]{babel} \usepackage{booktabs,rotating,graphicx,amsmath,verbatim,fancyhdr,Sweave} \usepackage[colorlinks,linkcolor=red,urlcolor=blue]{hyperref} \newcommand{\R}{\textsf{\bf R}} \renewcommand{\topfraction}{0.95} \renewcommand{\bottomfraction}{0.95} \renewcommand{\textfraction}{0.1} \renewcommand{\floatpagefraction}{0.9} \DeclareGraphicsExtensions{.pdf,.jpg} \setcounter{secnumdepth}{1} \setcounter{tocdepth}{1} \oddsidemargin 1mm \evensidemargin 1mm \textwidth 160mm \textheight 230mm \topmargin -5mm \headheight 8mm \headsep 5mm \footskip 15mm \begin{document} \SweaveOpts{concordance=TRUE} \raggedleft \pagestyle{empty} \vspace*{0.1\textheight} \Huge {\bf Reproducing \\ Mart\'{i}nez Miranda,\\ Nielsen and Nielsen (2015)\\ using the \texttt{apc} package} \noindent\rule[-1ex]{\textwidth}{5pt}\\[2.5ex] \Large 10 September 2016, version 2 \vfill \normalsize \begin{tabular}{rl} Bent Nielsen & Department of Economics, University of Oxford \\ & \small \& Nuffield College \\ & \small \& Institute for Economic Modelling \\ & \normalsize \texttt{bent.nielsen@nuffield.ox.ac.uk} \\ & \url{http://users.ox.ac.uk/~nuff0078} \end{tabular} \normalsize \newpage \raggedright \parindent 3ex \parskip 0ex \tableofcontents \cleardoublepage \setcounter{page}{1} \pagestyle{fancy} \renewcommand{\sectionmark}[1]{\markboth{\thesection #1}{\thesection \ #1}} \fancyhead[OL,ER]{\sl Reproducing Mart\'{i}nez Miranda, Nielsen and Nielsen (2015).} %\fancyhead[ER]{\sl \rightmark} \fancyhead[EL,OR]{\bf \thepage} \fancyfoot{} \renewcommand{\headrulewidth}{0.1pt} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} The purpose of this vignette is to use the \texttt{apc} package version 1.2.2. to reproduce some of the result in Mart\'{i}nez Miranda, Nielsen and Nielsen (2015): \textit{Inference and forecasting in the age-period-cohort model with unknown exposure with an application to mesothelioma mortality}, published in \textit{Journal of the Royal Statistical Society} A 178, 29-55. The \texttt{apc} package builds on the identification analysis in Kuang, Nielsen and Nielsen (2008a), the development of deviance analysis for general data arrays in Nielsen (2014). The package is discussed in Nielsen (2015). The data consists of counts of mesothelioma deaths in the UK by age, $25-89$, and period $1967-2007$. This is modelling using a response-only Poisson regression using an age-period-cohort structure. The purpose of analysis is to forecast the future burden of mesothelioma deaths. The data are available in the \texttt{apc} package. They can be called with the command <<>>= library(apc) data <- data.asbestos() @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Figure 1: Summary of data} The data is organised as a matrix with period as row index and age as column index. Figure 1(a,b,c) in the paper shows sums of the data by age, period and cohort. Figure 1(d) shows log-cumulative deaths by 5-year age and cohort group. A range of plots illustrating the data can be generated by the command <>= apc.plot.data.all(data) @ This command calls a range of particular commands. Some warnings are reported. This is because one of the plots, \texttt{apc.plot.data.within} groups the indices and the index ranges are not divisible by the default group size. We can also generate plots in a more basic way so as to allow more customization. In particular, Figure 1(a,b,c) can be reproduced by <<>>= apc.plot.data.sums(data) @ To get individual plots one can generate the data sums and plot them as desired. For instance, <>= data.sums <- apc.data.sums(data) par(mfrow=c(2,2),oma=c(0,0,2,0),mar=c(4,4,2,0)+0.1) plot(seq(25,89),data.sums$sums.age, main="(a) sums by age", type="l",xlab="age",ylab="age data sum") plot(seq(1967,2007),data.sums$sums.per, main="(b) sums by period", type="l",xlab="period",ylab="period data sum") plot(seq(1878,1982),data.sums$sums.coh, main="(c) sums by cohort", type="l",xlab="cohort",ylab="cohort data sum") apc.plot.data.within(data,plot.type="pwc", thin=5,type="l",main="(d)",lty=1,legend=FALSE) title("Figure 1",outer=TRUE) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Table 1: Deviance analysis} The deviance Table 1 can be reproduced by a single command <>= apc.fit.table(data,"poisson.response")[1:4,1:6] @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Figure 3: The standardized residuals} In the first instance we consider the unrestricted age-period-cohort model. This is fitted by the command <<>>= fit.apc <- apc.fit.model(data,"poisson.response","APC") @ A range of plots illustrating the data can be generated by the command <<>>= apc.plot.fit.all(fit.apc) @ This command calls a range of particular commands. A warning is reported. This is because one of the plots, \texttt{apc.plot.fit} may not show the standard errors one would expect. We can also generate individual plots The package does not exactly reproduce Figure 2. Rather than looking at standardized residuals, we can look at probability transforms of the data given the fitted values through the command <>= apc.plot.fit.pt(fit.apc,main="Figure 3 - approximately") @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Figure 6: Forecasts based on full sample analysis, but decomposed by cohort} Figure 6 presents forecasts for particular cohorts based on an age-cohort model. The age-cohort model is fitted as follows <<>>= fit.ac <- apc.fit.model(data,"poisson.response","AC") @ We now generate the forecasts for particular cohorts. We need to truncate the range of cohorts when forecasting. This requires a little calculation. In the paper the range for the cohorts is denoted 1878-1982. In the \texttt{apc} package, version 1.2, the range of cohorts is denoted 1879-1983. In any case the index for these cohorts is 1-105. Note that there are 65 age groups and 41 period groups, so that the number of cohorts is 65+41-1=105. The first 41 cohorts are not going to be extrapolated in any case. Thus, we can potentially forecast 105-41=64 cohorts without having to extrapolate cohort parameters. In Figure 6 the cohorts are truncated by 1966/1952/1937, in the notation of the paper. This corresponds to truncating the last 16/30/45 cohorts. We get the truncated forecasts as follows <>= forecast.16 <- apc.forecast.ac(fit.ac,sum.per.by.coh=c(42,89)) forecast.30 <- apc.forecast.ac(fit.ac,sum.per.by.coh=c(42,75)) forecast.45 <- apc.forecast.ac(fit.ac,sum.per.by.coh=c(42,60)) @ Some warnings are produced. These relate to the command \texttt{apc.data.list.subset}, which is used for truncating the point forecasts to the desired cohorts. We need to sum the data by period to show the actual outcomes. <>= data.sum.per <- apc.data.sums(data.asbestos())$sums.per @ Figure 6 can now be reproduced as follows. The command \texttt{apc.polygon} allows easy plotting of forecast with confidence bands. The function uses \texttt{lines} function from the \texttt{graphics} package to plot the point forecasts. It also uses the \texttt{polygon} function from the \texttt{graphics} package to draw up shaded areas for the forecast standard error, and possibly also for the process standard error and the estimation standard error. The darker shaded area represents plus/minus twice the overall forecast standard deviation. The lighter aerea represents plus/minus twice the process error forecast standard deviation, that is the estimates are taken is parameters without estimation uncertainty. <>= plot(seq(1967,2007),data.sum.per,xlim=c(1967,2047),ylim=c(0,3500), xlab="period",ylab="number of cases", main="Figure 6") apc.polygon(forecast.16$response.forecast.per.by.coh, 2007,TRUE,TRUE,col.line=1) apc.polygon(forecast.30$response.forecast.per.by.coh, 2007,TRUE,TRUE,col.line=2) apc.polygon(forecast.45$response.forecast.per.by.coh, 2007,TRUE,TRUE,col.line=3) legend("topleft",legend=c("1966","1952","1937"), col=c(1,2,3),lty=1) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Figure 7: Recursive forecasts} Figure 7 presents recursive forecasts using age-cohort models. The darker shaded area represents plus/minus twice the overall forecast standard deviation. The lighter aerea represents plus/minus twice the process error forecast standard deviation, that is the estimates are taken is parameters without estimation uncertainty. To produce the forecasts we start by extracting a subset of the data array. Then we rerun the age-cohort model and finally produce the forecasts. <<>>= data.1991 <- apc.data.list.subset(data.asbestos(),0,0,0,16,0,0) fit.ac.1991 <- apc.fit.model(data.1991,"poisson.response","AC") forecast.1991 <- apc.forecast.ac(fit.ac.1991) @ There are two warnings relating to the command \texttt{apc.data.list.subset} The first warning concerns the truncation of the data array. Truncating the last 16 periods implies that we also truncate the last 16 cohorts. The second warning shows that the coordinate system has been changed from the original per-age coordinates to age-cohort coordinates. We start by generating the other forecasts. <>= data.2001 <- apc.data.list.subset(data.asbestos(),0,0,0,6,0,0) fit.ac.2001 <- apc.fit.model(data.2001,"poisson.response","AC") forecast.2001 <- apc.forecast.ac(fit.ac.2001) data.2006 <- apc.data.list.subset(data.asbestos(),0,0,0,1,0,0) fit.ac.2006 <- apc.fit.model(data.2006,"poisson.response","AC") forecast.2006 <- apc.forecast.ac(fit.ac.2006) fit.ac.2007 <- apc.fit.model(data.asbestos(),"poisson.response","AC") forecast.2007 <- apc.forecast.ac(fit.ac.2007) @ This is followed by the plot. <>= plot(seq(1967,2007),data.sum.per,xlim=c(1967,2047),ylim=c(0,3500), xlab="period",ylab="number of cases", main="Figure 7") apc.polygon(forecast.2007$response.forecast.per.ic, 2007,TRUE,TRUE,col.line=1) apc.polygon(forecast.2007$response.forecast.per , 2007,FALSE ,col.line=2) apc.polygon(forecast.2006$response.forecast.per , 2006,FALSE ,col.line=3) apc.polygon(forecast.2001$response.forecast.per , 2001,FALSE ,col.line=4) apc.polygon(forecast.1991$response.forecast.per , 1991,FALSE ,col.line=5) legend("topleft",legend=c("2007ic","2007","2006","2001","1991"), col=c(1,2,3,4,5),lty=1) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Table 2: Peaks from recursive analysis} We can find the peaks by inspecting the output as follows. <<>>= forecast.2007$response.forecast.per[10:14,1] forecast.2007$response.forecast.per.ic[10:14,1] @ The peak is 2019 in both cases - as it must be - because the intercept correction simply scales the forecasts. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Figure 8: Sensitivity analysis} Figure 8 includes forecasts from different models. One forecast is based on the unrestricted age-period-cohort model. This is not implemented in the \texttt{apc} package as yet. The other forecasts use age-cohort models, but for different subsets of the data. The forecasts are generated as follows. <>= data.coh.1966 <- apc.data.list.subset(data.asbestos(),0,0,0,0,0,16) fit.ac.coh.1966 <- apc.fit.model(data.coh.1966,"poisson.response","AC") forecast.coh.1966 <- apc.forecast.ac(fit.ac.coh.1966) data.age.35 <- apc.data.list.subset(data.asbestos(),10,0,0,0,0,0) fit.ac.age.35 <- apc.fit.model(data.age.35,"poisson.response","AC") forecast.age.35 <- apc.forecast.ac(fit.ac.age.35,sum.per.by.coh=c(42,89)) @ Finally, a forecast from an age-period-cohort is needed. This requires an extrapolation of the period parameters, see Kuang, Nielsen and Nielsen (2008b). The data appear to be fairly smooth, so an "I0" forecast is chosen. This is the default for \texttt{apc.forecast.apc}, so the forecast is generated as follows. <>= fit.apc.1966 <- apc.fit.model(data.coh.1966,"poisson.response","APC") forecast.apc.1966 <- apc.forecast.apc(fit.apc.1966) @ The plot is then generated as follows. Note that the plot in the paper has no standard deviations, so it would be nearly as easy to use the \texttt{lines} function from the \texttt{graphics} package as the \texttt{apc.polygon} function <>= plot(seq(1967,2007),data.sum.per,xlim=c(1967,2047),ylim=c(0,3500), xlab="period",ylab="number of cases", main="Figure 8") apc.polygon(forecast.16$response.forecast.per.by.coh , 2007,FALSE,col.line=1) apc.polygon(forecast.coh.1966$response.forecast.per , 2007,FALSE,col.line=2) apc.polygon(forecast.age.35$response.forecast.per.by.coh, 2007,FALSE,col.line=3) apc.polygon(forecast.16$response.forecast.per.by.coh.ic , 2007,FALSE,col.line=4) apc.polygon(forecast.apc.1966$response.forecast.per , 2007,FALSE,col.line=5) legend("topleft",legend=c("est: full, fore: coh<=1966", "est: coh<=1966, fore: coh<=1966", "est: age>=35, fore: coh<=1966", "est: full, fore: coh<=1966, ic", "est: apc, I0: coh<=1966"), col=c(1,2,3,4,5),lty=1) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Figure 9: Preferred forecast} The preferred forecast is now generated at follows. <>= plot(seq(1967,2007),data.sum.per,xlim=c(1967,2047),ylim=c(0,3500), xlab="period",ylab="number of cases", main="Figure 8") apc.polygon(forecast.16$response.forecast.per.by.coh.ic,2007,TRUE,TRUE) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section*{References} \begin{description} \item Kuang, D., Nielsen, B. and Nielsen, J.P. (2008a) Identification of the age-period-cohort model and the extended chain ladder model. \textit{Biometrika} 95, 979-986. \textit{Download}: Earlier version: \url{http://www.nuffield.ox.ac.uk/economics/papers/2007/w5/KuangNielsenNielsen07.pdf}. \item Kuang, D., Nielsen, B. and Nielsen, J.P. (2008b) Forecasting with the age-period-cohort model and the extended chain-ladder model. \textit{Biometrika} 95, 987-991. \textit{Download}: Earlier version: \url{http://www.nuffield.ox.ac.uk/economics/papers/2008/w9/KuangNielsenNielsen_Forecast.pdf}. \item Mart\'{i}nez Miranda, M.D., Nielsen, B. and Nielsen, J.P. (2015) Inference and forecasting in the age-period-cohort model with unknown exposure with an application to mesothelioma mortality. \textit{Journal of the Royal Statistical Society} A 178, 29-55. \textit{Download}: \url{http://www.nuffield.ox.ac.uk/economics/papers/2013/Asbestos8mar13.pdf}. \item Nielsen, B. (2014) Deviance analysis of age-period-cohort models. \textit{Download}: \url{http://www.nuffield.ox.ac.uk/economics/papers/2014/apc_deviance.pdf}. \item Nielsen, B. apc: An R package for age-period-cohort analysis. To appear in \textit{R Journal}. \textit{Download}: \url{https://journal.r-project.org/archive/accepted/nielsen.pdf}. \end{description} \end{document}