% File apc/vignettes/apc_indiv.Rnw % Part of the apc package, http://www.R-project.org % Copyright 2020 Zoe Fannon, Bent Nielsen % Distributed under GPL 3 %\VignetteIndexEntry{Introduction: analysis of individual data} \documentclass[a4paper,twoside,12pt]{article} \usepackage[english]{babel} \usepackage{booktabs,rotating,graphicx,amsmath,verbatim,fancyhdr,Sweave} \usepackage[colorlinks,linkcolor=red,urlcolor=blue]{hyperref} \newcommand{\R}{\textsf{\bf R}} \renewcommand{\topfraction}{0.95} \renewcommand{\bottomfraction}{0.95} \renewcommand{\textfraction}{0.1} \renewcommand{\floatpagefraction}{0.9} \DeclareGraphicsExtensions{.pdf,.jpg} \setcounter{secnumdepth}{1} \setcounter{tocdepth}{1} \oddsidemargin 1mm \evensidemargin 1mm \textwidth 160mm \textheight 230mm \topmargin -5mm \headheight 8mm \headsep 5mm \footskip 15mm \begin{document} \SweaveOpts{concordance=TRUE} \raggedleft \pagestyle{empty} \vspace*{0.1\textheight} \Huge {\bf Introduction to\\analysis of individual data\\using the \texttt{apc.indiv} functions\\in the package\texttt{apc} } \noindent\rule[-1ex]{\textwidth}{5pt}\\[2.5ex] \Large 25 August 2020 \vfill \normalsize \begin{tabular}{rl} Zoe Fannon & Department of Economics, University of Oxford \end{tabular} \normalsize \newpage \raggedright \parindent 3ex \parskip 0ex \tableofcontents \cleardoublepage \setcounter{page}{1} \pagestyle{fancy} \renewcommand{\sectionmark}[1]{\markboth{\thesection #1}{\thesection \ #1}} \fancyhead[OL,ER]{\sl apc indiv} %\fancyhead[ER]{\sl \rightmark} \fancyhead[EL,OR]{\bf \thepage} \fancyfoot{} \renewcommand{\headrulewidth}{0.1pt} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} The purpose of this vignette is to demonstrate the use of some of the new commands available as part of the \texttt{apc.indiv} update to the R package \texttt{apc}. This code is designed to allow the user to study the effects of age, period, and cohort on an outcome of interest. The age-period-cohort identification problem is avoided because the code uses the reparametrization approach developed in Kuang, Nielsen and Nielsen (2008). This approach does not attempt to separate the linear effects of age, period, and cohort, which are unidentified due to the well-known identification problem. Instead, the focus is on estimation of the non-linear effects. The non-linear effects that are identified are ``double-differences'' in each of age, period, and cohort. These ``double-differences'' are the accelerations in each of age, period, and cohort. By cumulating these accelerations a picture of the non-linear part of the relationship between age, period, or cohort and the outcome of interest can be constructed. Further details of the reparametrization approach and how the double-differences and cumulated double-differences should be interpreted are available in Nielsen (2015). The new code allows for estimation of the reparametrized APC effects from the following: \begin{itemize} \item Gaussian models using repeated cross-section data \item Logistic models using repeated cross-section data \item Both of the above with survey weights \item Gaussian models using panel data (with POLS, random effects, and fixed effects options) \item All of the above with covariates included in the model \end{itemize} The tools build on several other packages. In particular \texttt{plm} (Croissant and Millo, 2008) and \texttt{survey} (Lumley, 2019) are used to perform the estimation for panel data and survey data respectively, while \texttt{lmtest} (Zeileis and Hothorn, 2002) and \texttt{car} (Fox and Weisberg, 2019) are used for testing restrictions. The aggregate-data functions from the package \texttt{apc} Nielsen (2015) were cannibalised extensively to produce the \texttt{apc.indiv} functions. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Repeated Cross Section} To illustrate the use of the code for repeated cross-section data, I use the \texttt{Wage} data from the \texttt{ISLR} package (James et al, 2017). This data records information about 3000 male workers in the Mid-Atlantic region of the US, and was manually assembled from the March 2011 supplement to the American Current Population Survey. I examine the age, period, and cohort effects on the log wage of these workers (a continuous outcome), and on the probability that they hold a job classified as ``industrial'' rather than ``information'' (a binary outcome). There is a concave non-linear relationship between age and the log wage, but no non-linear relationship with period or cohort. There is a sharp acceleration in the probability of holding an industrial job in 2008, followed by a compensating deceleration; this may indicate temporary layoffs in response to the financial crisis that are job class-specific. Note that this data does not contain weights and there is no evidence that the wage information has been corrected for inflation. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Data assessment and cleaning} I begin by examining the age-period-cohort structure of the data % Chunk 1 <>= library("plyr") library("reshape") library("ISLR") data("Wage") summary(Wage) AP_count <- count(Wage, c("age", "year")) AP_show <- cast(AP_count, age~year) AP_show[1:10,] @ The output of the above is a long table, with a column for each of the seven periods in the data and a row for each of the 61 ages. Each cell shows the number of observations in the data for that age-period combination. The \texttt{apc.indiv} functions require a contiguous dataset, and so it is necessary to restrict the data by age and period so that no cells have 0 observations. In this case I will omit some of the youngest and oldest ages, which are sparsely observed. % Chunk 2 <<>>= Wage2 <- Wage[Wage$age >= 25 & Wage$age <= 55, ] @ need to change the names <<>>= names(Wage2)[names(Wage2) %in% c("year","age")] <- c("period","age") @ tidy some variables for the analysis <<>>= cohort <- Wage2$period - Wage2$age indust_job <- ifelse(Wage2$jobclass=="1. Industrial", 1, 0) hasdegree <- ifelse(Wage2$education %in% c("4. College Grad", "5. Advanced Degree"), 1, 0) married <- ifelse(Wage2$maritl == "2. Married", 1, 0) Wage3 <- cbind(Wage2, cohort, indust_job, hasdegree, married) @ In the above, I have restricted the data to those aged between 25 and 55. Note that I have also renamed some of the variables; the \texttt{apc.indiv} functions require that at least two of the variables \texttt{age}, \texttt{period}, and \texttt{cohort} are present in the data. I have also tidied some of the other variables that are of interest in the analysis, creating indicators for whether the job is industrial (as opposed to informational), whether the worker has a college degree, and whether the worker is married. I will be interested in how the wage of the worker and the nature of their job (industrial or otherwise) is related to their age, cohort, and period of observation. Before performing a formal analysis of these relationships using the \texttt{apc.indiv} functions, I can use a visualisation to conduct a preliminary search for patterns in these variables along age, period, or cohort. This is done using \texttt{ggplot2} (Wickham, 2016). % Chunk 3 <<>>= library("ggplot2") mean_logwage <- ddply(Wage3, .variables=c("period", "age"), function(dfr, colnm){mean(dfr[, colnm])}, "logwage") names(mean_logwage)[3] <- "Mean_logwage" plot_mean_logwage <- ggplot(mean_logwage, aes(period, age)) + theme_bw() + xlab('\n Period') + ylab('Age\n') + geom_tile(aes(fill = Mean_logwage)) + scale_fill_gradientn(colours=c("red", "blue"), space = 'Lab', name="Mean logwage \n") + scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0)) + theme(axis.text=element_text(size=18), axis.title=element_text(size=24, face="bold"), legend.title=element_text(size=20, face="bold"), legend.key.size = unit(1, "cm"), legend.text=element_text(size=18)) @ % Chunk 4 <>= plot_mean_logwage @ The output of the above code is seen in figure \ref{fig:mean_logwage_RCS}. Each block shows the mean log wage among observations with that age-period combination in the data. The red colour corresponds to a lower mean log wage and the blue to a higher mean log wage. The concentration could be a combination of age and period effects: young people have lower wages, and in later years people have higher nominal wages (the data may not be adjusted for inflation). We can use similar code to produce an analagous graph, showing the mean value of the indicator \texttt{indust\_job} in each age-period cell. That mean indicates the proportion of people in that cell who have an industrial, rather than an information, role. This graph is not shown here; it has a similar pattern, with a higher probability of being in an industrial job concentrated among the young in the early 2000s. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Analysis of log wage} I now use the functions developed in \texttt{apc.indiv} to investigate the non-linear patterns in age, period, and cohort that can be identified from this data. These functions have a number of dependency packages which must be loaded if they are not already in the environment. % Chunk 5 <<>>= library("apc") library("plyr") library("lmtest") library("car") library("plm") library("survey") @ The first stage of the analysis is to estimate a table which can be used to determine how many of the elements of the age-period-cohort reparametrization are needed to describe the variation in this data. The relevant code is below: I specify the dataset, dependent variable, covariates I am including (in this case, I include the indicator for whether a person has a degree, as this is expected to influence their wage), the appropriate model family for this data, and which type of test I want to use. Here I chose a Wald test, compared against an F distribution; one could perform the Wald test using a Chi-squared distribution, or use a Likelihood Ratio test which must be compared with a Chi-squared distribution. I also choose to include a ``TS'' model in the table. This Time-Saturated (TS) model is a more general model than the reparametrized age-period-cohort model; it includes an indicator for each age-period combination present in the data. It nests the age-period-cohort model and therefore allows us to test whether the age-period-cohort model is sufficient to describe the variation in the data. % Chunk 6 <<>>= logwage_tab <- apc.indiv.model.table(Wage3, dep.var="logwage", covariates="hasdegree", model.family="gaussian", test="Wald", dist="F", TS=TRUE) @ % Chunk 7 <>= logwage_tab$table @ The output of the above code is seen in table \ref{logwage_tab}. Look for the model which minimises the AIC; this is the model where the AIC takes the value $1010.99$, i.e. the Ad model. We can also see that the p-values of the Wald tests of this model against the more general TS and APC models are quite large, indicating support for the reduction from those more general models. The Ad model, or age-drift model, includes double-differences in age only; the double-differences in period and cohort are constrained to zero. Further details of the APC sub-models are available in Nielsen (2015). I now estimate the Ad model alone. I use a table to inspect the covariate coefficients, but the best way to examine the estimated time effects is by a visualisation. There are 30 ages in my dataset, which means 28 double-differences in age. Rather than looking at 28 estimates for double-differences, it is easier to understand their interpretation by plotting them. % Chunk 8 <<>>= logwage_ad <- apc.indiv.est.model(Wage3, dep.var = "logwage", covariates="hasdegree", model.family="gaussian", model.design="Ad") logwage_ad$coefficients.covariates apc.plot.fit(logwage_ad, main.outer="") @ As expected, the coefficient on having a degree is positive ($0.285$) and highly significant (p-value of $8.2e^{-105}$). The visual representation of the Ad part of the model is seen below. Subfigures (d) through (f) show the estimated linear plane, which combines the unidentified linear effects of age, period, and cohort. This is the ``drift'' part of the model. The first linear trend is plotted in the age dimension, while the second linear trend is plotted in the cohort dimension; respectively they combine the linear effects of age and period, and of cohort and period. The net effect of the three unidentifiable slopes then is that there is an increase in log wage with age and an increase in log wage with cohort, which may be used for forecasting purposes. Subfigures (a) and (g) are of greater interest. Subfigure (a) shows the estimated double-differences in age, of which there are 28. Subfigure (g) shows the result of cumulating these to get a picture of the non-linear relationship between age and log-wage. We see that this relationship is concave, which would be consistent with acceleration in log wage up to the mid-30s and a plateauing thereafter. This concavity is of interest because we can use it to evaluate consistency with theoretical models of the evolution of log wages over the life-cycle. For example, we could imagine a theory model which predicted that log wage is not only concave over the life cycle but is also quadratic. That the concavity is quadratic is a testable restriction in our APC model. The advantage to testing this quadratic hypothesis in this reparametrized APC model rather than another form of model is that this model has isolated the non-linear portion of age from the non-linear portion of cohort and period; therefore the test of the quadratic age effect is not contaminated by period or cohort effects. We can perform this quadratic test as follows, using the \texttt{linearHypothesis} function from the package \texttt{car} (Fox and Weisberg, 2019). % Chunk 9 <<>>= allageDD <- rownames(logwage_ad$coefficients.canonical)[grep("DD_age", rownames(logwage_ad$coefficients.canonical))] ageDD1 <- allageDD[-1] ageDD2 <- allageDD[-length(allageDD)] quadratic_hyp <- paste(ageDD2, ageDD1, sep = " = ") rm(list=ls(pattern="ageDD")) linearHypothesis(logwage_ad$fit, quadratic_hyp, test="F") @ Again, a Wald test is used, with comparison to an F distribution. The resulting test statistic of $1.297$, with degrees of freedom $(28, 2382)$, has a p-value of $0.14$. This indicates that the hypothesis of a quadratic relationship between the age of the worker and his log wage cannot be rejected. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Analysis of industrial job} The code can be used in a very similar way to investigate the relationship between a binary variable and age, period, and cohort. I illustrate this by building a model for whether or not the worker has an industrial job. Again, the analysis begins with a table comparing the time-saturated (TS) model, the full APC model, and submodels of the APC model. % Chunk 9 <<>>= indust_job_tab <- apc.indiv.model.table(Wage3, dep.var="indust_job", covariates="hasdegree", model.family="binomial", test="LR", dist= "Chisq", TS=TRUE) indust_job_tab$table @ Note that here the model family is binomial, and the test used is a likelihood ratio test. The time-saturated model here is estimated by a custom Newton-Rhapson iteration procedure, so part of the output of this table is a report on the behaviour of that algorithm. If the print statement does not report convergence, the Newton Rhapson parameters should be modified using the option \texttt{NR.controls} until convergence is achieved - for example by increasing the number of iterations. Looking at the estimated table, we see that the AIC is minimised towards the end of the table, by the tC model. However, the likelihood ratio tests comparing the tC model to the TS and APC models reject the restriction. Indeed most restrictions are rejected by the likelihood ratio test; the APC model itself is barely accepted as a restriction on the TS model. It is therefore difficult to select a model from this table. Ultimately I favour the PC model; it has one of the lower AIC values, is the most supported sub-model against the APC model, and is almost supported against the TS model. That said, this setting is one in which there is a strong argument that the APC model and its submodels do a poor job of capturing the time variation in the data, and some other reduction of the TS model should instead be used. % Chunk 10 <<>>= indust_job_pc <- apc.indiv.est.model(Wage3, dep.var="indust_job", covariates="hasdegree", model.family="binomial", model.design="PC") indust_job_pc$coefficients.covariates @ % Chunk 11 <>= apc.plot.fit(indust_job_pc) @ Again, I directly estimate the preferred model using \texttt{apc.indiv.est.model} and inspect the estimated non-linearities in period and cohort using \texttt{apc.plot.fit}. There is a somewhat interesting pattern in the period effects, where there appears to be a substantial acceleration in the probability of having an industrial job in 2008; given that these are shipping workers, this may reflect a streamlining of operations during the financial crisis. However, there is no clear pattern in the cohort non-linearities. The effect of having a degree on the probability of having an industrial job is, unsurprisingly, significant and negative. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Extensions}\label{ss: RCS extensions} The data I have been using does not include survey weights. However if they were present in the data, they could be quite easily added to all of the above analysis by simply specifying the name of the weight variable using the option \texttt{wt.var} in all of the above commands. It should be noted that since models incorporating survey weights are not estimated by maximum likelihood, the likelihood column is omitted from the data and one must use Wald tests rather than likelihood ratio tests. A psuedo-AIC is reported; see Lumley (2004, 2019) for details. Additionally, estimation of the time-saturated model has not yet been implemented for survey data, and so that will not be reported. Sometimes the fact that the earliest and latest cohorts are only observed in one or two age-period cells can lead to instability in the estimates. This can be seen to some extent in the PC model for having an industrial job; the magnitude of the estimate for the earliest cohort is very large. This problem can be addressed by ``censoring'' those early and late cohorts out of the data, by first dropping them from the data using standard R techniques and then specifying the options \texttt{n.coh.excl.start} and \texttt{n.coh.excl.end}. The structure of this particular dataset is such that it can be displayed as a rectangle in age-period space. We say the data has an ``age-period'' format. In this case, it is the cohort double-differences where instability will appear and censoring should occur. However, not all data is of the ``age-period'' format. For example in the next section we will deal with data in ``period-cohort'' format; in that case, one would want to censor ages, using \texttt{n.age.excl.start} and \texttt{n.age.excl.end}. One might also have ``age-cohort'' data, in which case periods could be censored. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Panel data} To illustrate the use of the code for panel data, I use the \texttt{PSID7682} data from the \texttt{AER} package Kleiber and Zeileis (2008). This is an excerpt from the Panel Survey of Income Dynamics, covering 595 individuals over a seven-year period from 1976-1982 which has been used in economics textbooks such as Baltagi (2005) and Greene(2008). It is therefore an age-period dataset. Note that in this data the conflated variables are not age, period, and cohort, but rather years of work experience, period, and year of entering the workforce. There is an equivalent identification problem to the APC problem among these three variables, see for example Heckman and Robb (1985). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Data assessment and cleaning} I begin by inspecting the data. After initially displaying the data in an age-period format, it became clear that the period-cohort format was more appropriate. % Chunk 12 <<>>= library("plyr") library("reshape") library("AER") data("PSID7682") summary(PSID7682) AP_count <- count(PSID7682, c("experience", "year")) AP_show <- cast(AP_count, experience~year) AP_show[1:10,] @ The missing corners of the data show that this is actually cohort-period data (i.e. take a given set of people and follow them for X years, rather than observe people within an age group in a series of years). % Chunk 13 <<>>= period <- as.numeric(PSID7682$year) + 1975 entry <- period - PSID7682$experience psid <- cbind(PSID7682, period, entry) CP_count <- count(psid, c("entry", "year")) CP_show <- cast(CP_count, entry~year) CP_show[1:10,] @ It is easily seen from \texttt{CP\_show} that this is a balanced panel; the number of observations in a given cohort does not change over period. This makes it quite easy to see how we should restrict the data to ensure a sufficient number of observations in each cell. Again, I tidy some of the variables that will be used in the analysis, and rename the variables corresponding to \texttt{age}, \texttt{period}, and \texttt{cohort}. % Chunk 14 <<>>= psid2 <- psid[psid$entry >= 1939, ] # which variables do we want to use? logwage <- log(psid2$wage) inunion <- ifelse(psid2$union == "yes", 1, 0) insouth <- ifelse(psid2$south == "yes", 1, 0) bluecollar <- ifelse(psid2$occupation == "blue", 1, 0) # also education which is a continuous covariate psid3 <- cbind(psid2, logwage, inunion, insouth, bluecollar) names(psid3)[names(psid3) %in% c("experience","entry")] <- c("age","cohort") @ It is important to visualise the data before estimating any models. This is done using \texttt{ggplot2} in the same way as for repeated cross-sectional data. However, since this is period-cohort data, I plot cohort (year of entry) instead of age (experience) on the Y-axis. % Chunk 15 <<>>= library("ggplot2") mean_logwage <- ddply(psid3, .variables=c("period", "cohort"), function(dfr, colnm){mean(dfr[, colnm])}, "logwage") names(mean_logwage)[3] <- "Mean_logwage" plot_mean_logwage <- ggplot(mean_logwage, aes(period, cohort)) + theme_bw() + xlab('\n Period') + ylab('Entry \n') + geom_tile(aes(fill = Mean_logwage)) + scale_fill_gradientn(colours=c("red", "blue"), space = 'Lab', name="Mean logwage \n") + scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0)) + theme(axis.text=element_text(size=18), axis.title=element_text(size=24, face="bold"), legend.title=element_text(size=20, face="bold"), legend.key.size = unit(1, "cm"), legend.text=element_text(size=18)) @ % Chunk 16 <>= plot_mean_logwage @ I display the visualization in figure \ref{fig:mean_logwage_panel}; note that in the labelling I have replaced ``cohort'' with ``entry''. There is a clear period effect; the colour becomes more blue towards the right of the graph, indicating higher wages in later years. There is also evidence of cohort effects, appearing as horizontal bands of colour. Those starting work around 1968, for instance, appear to have lower wages throughout their lives. That said, with a small panel we must be careful of confounding between cohort effects and individual fixed effects. Finally, age effects are evidenced by the predominance of red in the top-left corner of the graph; this is the area where individuals have the least work experience (those entering the workforce in the 1970s, observed in the 1970s), and we can unsurprisingly see that lack of experience means low wages. To begin with, I consider a model with no covariates, just to get a sense of how the patterns seen in the graph above are reflected in a formal analysis. As was the case with repeated cross-sectional data, I begin with a table containing the full APC model and all submodels. Note that the time-saturated (TS) model is not currently implemented for panel data. Additionally, since the panel data model I consider (the random effects model) is not estimated by maximum likelihood, I lose the likelihood and AIC columns. Therefore model selection is by Wald test only. % Chunk 17 <<>>= library(apc) panel_tab <- apc.indiv.model.table(psid3, dep.var="logwage", model.family = "gaussian", test="Wald", dist="F", plmmodel="random", id.var="id") @ % Chunk 18 <>= panel_tab$table @ It is clear from the table, seen in \ref{tab:panel_tab}, that none of the restrictions of the APC model pass muster. It is also worth noticing that some of the submodels seen in previous tables do not appear here. Those are: the C, tC, and 1 models. This is because random effects estimation requires at least one explanatory variable which changes over time within an individual, and these models do not satisfy this requirement. The model selected by this analysis is, clearly, the APC model, and so I proceed to estimate and plot that using the standard tools. % Chunk 19 <<>>= panel_apc <- apc.indiv.est.model(psid3, dep.var="logwage", model.family="gaussian", plmmodel="random", id.var="id") apc.plot.fit(panel_apc) @ There is clear concavity in both age and period, while the non-linearity in cohort, despite being significant, lacks a clear pattern. This model can also be estimated using fixed effects. This changes the set of models which are available, since the fixed effects are perfectly collinear with both the cohort double-differences and the combined slope that is estimated in the cohort dimension. The set of available models are as follows: FAP, FA, FP, Ft. These stand for ``fixed effects with age and period non-linearities'', ``fixed effects with age non-linearities'', ``fixed effects with period non-linearities'', and ``fixed effects with trend''. Note that FAP, FA, and FP all also contain the single linear trend that can be identified in these models, which is represented in the age dimension and combines the linear effects of age and period. % Chunk 20 <<>>= panel_tab_fe <- apc.indiv.model.table(psid3, dep.var="logwage", covariates = c("inunion", "insouth", "bluecollar"), model.family = "gaussian", test="Wald", dist="F", plmmodel="within", id.var="id") panel_tab_fe$table @ Again, restrictions not accepted % Chunk 21 <<>>= panel_fap <- apc.indiv.est.model(psid3, dep.var="logwage", covariates = c("inunion", "insouth", "bluecollar", "education"), model.family = "gaussian", plmmodel="within", id.var="id", model.design="FAP") panel_fap$coefficients.covariates @ % Chunk 21 <>= apc.plot.fit(panel_fap) @ The first step is to construct a table which compares all submodels to the most general model, which in the context of fixed effects is the FAP model. This table is not shown, as the conclusion is straightforward; even with the large set of covariates and the individual fixed effects, none of the submodel reductions is accepted. The FAP model is then estimated. The age and period non-linearities are largely unchanged from the random effects model, indicating that they are robust to the introduction of fixed effects and covariates. The covariate estimates are also shown. Most are not significant, which may be due to limited within-individual variation in those variables. Note that education was not included as a covariate: it has no within-individual variation and therefore its effect cannot be identified in a fixed effects model. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Extensions} It should be noted that at present panel data analysis only works for OLS models, so binary outcomes must be analysed in a linear probability framework. The time-saturated model is also not implemented for panel data. The censoring of cohorts, ages, or periods to improve the stability of estimates, described in section \ref{ss: RCS extensions}, is available for panel data. In addition to the random and fixed effects models illustrated here, it is also possible to estimate panel data using pooled OLS. However, this is really no different to repeated cross section analysis. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{References} \begin{description} \item Baltagi, B. H. (2005). \textit{Econometric Analysis of Panel Data} (3rd ed.). Chichester, England: John Wiley \& Sons. \item Croissant, Y. and Millo, G. (2008) Panel data econometrics in R: The plm package. \textit{Journal of Statistical Software} 27. \item Greene, W. H. (2008). \textit{Econometric Analysis}. Upper Saddle River, NJ: Pearson Prentice Hall. \item Heckman, J. and Robb, R. (1985). \textit{Using longitudinal data to estimate age, period and cohort effects in earnings equations}, pp. 137-150. New York, NY: Springer New York. \item James, G., Witten, D., Hastie, T. and Tibshirani, R. (2017). ISLR: Data for an Introduction to Statistical Learning with Applications in R. R package version 1.2. \item Kleiber, C. and Zeileis, A. (2008). \textit{Applied Econometrics with R}. New York: Springer-Verlag. \item Kuang, D., Nielsen, B. and Nielsen, J.P. (2008) Identification of the age-period-cohort model and the extended chain ladder model. \textit{Biometrika} 95, 979-986. \textit{Download}: Earlier version: \url{http://www.nuffield.ox.ac.uk/economics/papers/2007/w5/KuangNielsenNielsen07.pdf}. \item Lumley, T. (2004). Analysis of complex survey samples. \textit{Journal of Statistical Software}, 9 (1), 1-19. R package verson 2.2. \item Lumley, T. (2019) survey: analysis of complex survey samples. R package version 3.35-1. \item Nielsen, B.\ (2015) apc: An R package for age-period-cohort analysis. \textit{R Journal} 7, 52-64. \textit{Download}: \url{https://journal.r-project.org/archive/2015-2/nielsen.pdf}. \item Wickham, H. (2016). \textit{ggplot2: Elegant Graphics for Data Analysis}. Springer Verlag, New York. \item Zeileis, A. and Hothorn, T. (2002). Diagnostic checking in regression relationships. \textit{R News}, 2 (3), 7-10. \end{description} \end{document}