% File apc/vignettes/apc_aggregate.Rnw % Part of the apc package, http://www.R-project.org % Copyright 2020 Bent Nielsen % Distributed under GPL 3 %\VignetteIndexEntry{Introduction: analysis of aggregate data} \documentclass[a4paper,twoside,12pt]{article} \usepackage[english]{babel} \usepackage{booktabs,rotating,graphicx,amsmath,verbatim,fancyhdr,Sweave} \usepackage[colorlinks,linkcolor=red,urlcolor=blue]{hyperref} \newcommand{\R}{\textsf{\bf R}} \renewcommand{\topfraction}{0.95} \renewcommand{\bottomfraction}{0.95} \renewcommand{\textfraction}{0.1} \renewcommand{\floatpagefraction}{0.9} \DeclareGraphicsExtensions{.pdf,.jpg} \setcounter{secnumdepth}{1} \setcounter{tocdepth}{1} \oddsidemargin 1mm \evensidemargin 1mm \textwidth 160mm \textheight 230mm \topmargin -5mm \headheight 8mm \headsep 5mm \footskip 15mm \begin{document} \SweaveOpts{concordance=TRUE} \raggedleft \pagestyle{empty} \vspace*{0.1\textheight} \Huge {\bf Introduction to\\analysis of aggregate data\\in the package\texttt{apc} } \noindent\rule[-1ex]{\textwidth}{5pt}\\[2.5ex] \Large 25 August 2020 \vfill \normalsize \begin{tabular}{rl} Bent Nielsen & Department of Economics, University of Oxford \\ & \small \& Nuffield College \\ & \normalsize \texttt{bent.nielsen@nuffield.ox.ac.uk} \\ & \url{http://users.ox.ac.uk/~nuff0078} \end{tabular} \normalsize \newpage \raggedright \parindent 3ex \parskip 0ex \tableofcontents \cleardoublepage \setcounter{page}{1} \pagestyle{fancy} \renewcommand{\sectionmark}[1]{\markboth{\thesection #1}{\thesection \ #1}} \fancyhead[OL,ER]{\sl apc indiv} %\fancyhead[ER]{\sl \rightmark} \fancyhead[EL,OR]{\bf \thepage} \fancyfoot{} \renewcommand{\headrulewidth}{0.1pt} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} The purpose of this vignette is give an introduction the age-period-cohort analysis of tables of aggregate data using R package \texttt{apc}. The apc package uses the canonical parameters suggested by Kuang, Nielsen and Nielsen (2008a) and generalized by Nielsen (2014). These evolve around the second differences of age, period and cohort factors as well as an three parameters (level and two slopes) for a linear plane. The age, period and cohort factors themselves are not identifiable. They could be ad hoc identified by associating the levels and two slopes to the age, period and cohort factors in a particular way. This should be done with great care as such ad hoc identification easily masks which information is coming from the data and which information is coming from the choice of ad hoc identification scheme. An illustration is given below. A short description of the package can be found in Nielsen (2015). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Analysis of Belgian lung cancer data} The very first step is to call the package. <<>>= library(apc) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{The data} The data set is taken from table VIII of Clayton and Schifflers (1987a), which contains age-specific incidence rates (per 100,000 person-years observation) of lung cancer in Belgian females during the period 1955-1978. Numerators are also available. The original source was the WHO mortality database. The package uses a special data format for aggregate data, where the data is kept in a matrix format. This data format also keeps track of the labels of the different time scales. This comes in handy when listing parameters, when plotting and when seeking to truncate the data. The package does not use the vectorized format in the traditional data.frame in R. The Belgian data are already in the apc.data.list format. Other data can be organized using the apc.data.list function. <<>>= data.list <- data.Belgian.lung.cancer() objects(data.list) data.list @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Plot the data} The data analysis can be initiated by plotting the data. The package has a number of plots. All plot formats can be called with a single command. Some of plots involve grouping of data. A warning is produced because the defaults settings lead to an unbalanced grouping of data. <<>>= apc.plot.data.all(data.list) @ Alternatively, the plots can be called individually. The first plot contains data sums. <<>>= graphics.off() apc.plot.data.sums(data.list) @ A sparsity plot shows where data are thin. In this case, the plots are blank with default settings. We therefore change sparsity.limits. <<>>= graphics.off() apc.plot.data.sparsity(data.list) apc.plot.data.sparsity(data.list,sparsity.limits=c(5,10)) @ The next plots visualize data using different pairs of the three time scales. These plots are done for mortality ratios. All plots appear to have approximately parallel lines. This indicates that interpretation should be done carefully. <<>>= graphics.off() apc.plot.data.within.all.six(data.list,"m") @ \subsection{Get a deviance table} A deviance table is constructed. For this, we need to formulated the distribution of the model. The table show that the sub-models "AC" and "Ad" cannot be rejected relative to the unrestricted "APC" model <<>>= apc.fit.table(data.list,"poisson.dose.response") @ \subsection{Estimate selected models} We consider the "APC" and "Ad" model. We also consider also the sub-model "A", which is not supported by the tests in the deviance table. We get the three fits <<>>= fit.apc <- apc.fit.model(data.list,"poisson.dose.response","APC") fit.ad <- apc.fit.model(data.list,"poisson.dose.response","Ad") fit.a <- apc.fit.model(data.list,"poisson.dose.response","A") @ The coefficients for canonical parameters are found through <<>>= fit.apc$coefficients.canonical fit.ad$coefficients.canonical @ \subsection{Residual analysis} We get a number of plots to illustrate the fit. We plot estimators, probability transforms of responses given fit, residuals, fitted values, linear predictors, and data. In the probability transform plot: Black circle are used for central part of distribution. Triangles are used in tails, green/blue/red as responses are further in tail No sign of mis-specification for "APC" and "Ad": there are many black circles and only few coloured triangles. In comparison the model "A" yields more extreme observations. That model is not supported by the data. To get numerical values see apc.plot.fit.pt <<>>= graphics.off() apc.plot.fit.all(fit.apc) apc.plot.fit.all(fit.ad) apc.plot.fit.all(fit.a) @ \subsection{Plot estimated coefficients for sub models } We consider the "APC", "Ad" and "A" models and plot the estimated coefficients The first row of plots show double differences of parameters The second row of plots shows level and slope determining a linear plane The third row shows double sums of double differences, all identified to be zero at the begining and at the end. Thus the plots in third row must be interpreted jointly with those in the second row. The interpretation of the third row plots is that they show deviations from linear trends. The third row plots are not invariant to changes to data array. For the "APC" and "Ad" the estimated coefficients are similar. For the "A" model the coefficients are different, reflecting the misspecification. <<>>= graphics.off() apc.plot.fit(fit.apc) apc.plot.fit(fit.ad) apc.plot.fit(fit.a) @ \subsection{Recursive analysis} Cut the first period group and redo analysis <<>>= data.list.subset.1 <- apc.data.list.subset(data.list,0,0,1,0,0,0) apc.fit.table(data.list.subset.1,"poisson.dose.response") @ \subsection{Effect of ad hoc identification} At first a subset is chosen where youngest age and cohort groups are truncated. This way sparsity is eliminated and ad hoc identification effects are dominated by estimation uncertainty. Then consider Plot 1: parameters estimated from data without first age groups Plot 2: parameters estimated from all data Note that estimates for double difference very similar. Estimates for linear slopes are changed because the indices used for parametrising these are changed Estimates for detrended double sums of age and cohort double differences are changed, because they rely on a particular ad hoc identifications that have changed. Nonetheless these plots are useful to evaulate variation in time trends over and above linear trends. <<>>= graphics.off() data.list <- data.Belgian.lung.cancer() data.list.subset <- apc.data.list.subset(data.list,2,0,0,0,0,0) fit.apc <- apc.fit.model(data.list,"poisson.dose.response","APC") fit.apc.subset <- apc.fit.model(data.list.subset,"poisson.dose.response","APC") apc.plot.fit(fit.apc.subset, main.outer="1. Belgian lung cancer: cut first two age groups") apc.plot.fit(fit.apc,main.outer="2. Belgian lung cancer data: all data") @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{References} \begin{description} \item Clayton, D. and Schifflers, E. (1987a) Models for temperoral variation in cancer rates. I: age-period and age-cohort models. \emph{Statistics in Medicine} 6, 449-467. \item Kuang, D., Nielsen, B. and Nielsen, J.P. (2008a) Identification of the age-period-cohort model and the extended chain ladder model. \emph{Biometrika} 95, 979-986. \emph{Download}: Earlier version: \url{http://www.nuffield.ox.ac.uk/economics/papers/2007/w5/KuangNielsenNielsen07.pdf}. \item Kuang, D., Nielsen, B. and Nielsen, J.P. (2008b) Forecasting with the age-period-cohort model and the extended chain-ladder model. \emph{Biometrika} 95, 987-991. \emph{Download}: Earlier version: \url{http://www.nuffield.ox.ac.uk/economics/papers/2008/w9/KuangNielsenNielsen_Forecast.pdf}. \item Nielsen, B. (2014) Deviance analysis of age-period-cohort models. \emph{Download}: \url{http://www.nuffield.ox.ac.uk/economics/papers/2014/apc_deviance.pdf}. \item Nielsen, B. (2015) apc: An R package for age-period-cohort analysis. \emph{R Journal} 7, 52-64. \emph{Download}: \url{https://journal.r-project.org/archive/2015-2/nielsen.pdf}. \end{description} \end{document}