%\VignetteIndexEntry{Basic direct and indirect estimators} \documentclass{article} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \title{Basic direct and indirect estimators in sae package} \author{ Isabel Molina\footnote{Department of Statistics, Universidad Carlos III de Madrid. Address: C/Madrid 126, 28903 Getafe (Madrid), Spain, Tf: +34 916249887, Fax: +34 916249849, E-mail: isabel.molina@uc3m.es}, Yolanda Marhuenda} \date{March, 2015} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle This document presents design unbiased direct estimators and simple indirect estimators of domain means $\bar Y_d$, $d=1,\ldots,D$. For a general random sampling without replacement within each domain $U_d$. We denote by $\pi_{dj}$ the inclusion probability of $j$-th unit from $d$-th domain in the corresponding domain sample $s_d$ and $w_{dj}=\pi_{dj}^{-1}$ is the corresponding sampling weight. A design-unbiased direct estimator of $\bar Y_d$ is the Horvitz-Thompson (HT) estimator, given by \begin{equation}\label{HTest} \hat{\bar{Y}}_d^{DIR}=N_d^{-1}\sum_{j\in s_d}w_{dj}Y_{dj}. \end{equation} Unbiased estimation of the sampling variance of the HT estimator requires availability of the second order inclusion probabilities $\pi_{d,jk}$ of each pair of units $j$ and $k$ in $s_d$. A simple approximation that avoids the use of second order inclusion probabilities is obtained by considering $\pi_{d,jk}\approx \pi_{dj}\pi_{dk}$ and is given by \begin{equation}\label{estvar} \hat V_{\pi}(\hat{\bar{Y}}_d^{DIR})=\frac{1}{N_d^2}\sum_{j\in s_d}w_{dj}(w_{dj}-1)Y_{dj}^2. \end{equation} Under Poisson sampling, $\pi_{d,jk}= \pi_{dj}\pi_{dk}$ and in that case the estimator in (\ref{estvar}) is exactly unbiased. Under simple random sampling (SRS) without replacement within each area $U_d$, $d=1,\ldots,D$, the HT estimator of the mean $\bar{Y}_d$ is the usual sample mean $\hat{\bar Y}_d=\bar y_d=n_d^{-1}\sum_{j \in s_d}Y_{dj}$, and the (exactly) unbiased estimator of the sampling variance is $\hat V_{\pi}(\hat{\bar{Y}}_d^{DIR})=(1-f_d)S_{d}^2/n_d$, for $S_d^2=\sum_{j\in s_d}(Y_{dj}-\bar y_d)^2/(n_d-1)$. When the sampling is with replacement within each domain $U_d$, and units are selected with probabilities $P_{dj}$, $j=1,\ldots,N_d$, proportional to some size measure, if we define new weights $w_{dj}=(n_dP_{dj})^{-1}$, the estimator defined in (\ref{HTest}) remains unbiased and the unbiased estimator of the sampling variance is given by %badbox $$ \hat V_{\pi}(\hat{\bar Y}_d^{DIR})=\frac{1}{n_d}\sum_{j\in s_d}\left(f_dw_{dj}Y_{dj}-\hat{\bar Y}_d\right)^2, $$ which becomes $S_d^2/n_d$ under SRS with replacement. The post-stratified synthetic estimator assumes that data are distributed into $K$ (large) groups called post-strata that cut across the domains, and such that the within group mean is constant across domains, that is, if $\bar Y_{dk}$ denotes the mean in the crossing of post-stratum $k$ and domain $d$ and $\bar Y_{+k}$ is the mean of post-stratum $k$, it holds that $\bar Y_{dk}=\bar Y_{+k}$, $k=1,\ldots,K$. The groups are assumed to have large enough sample sizes to allow direct estimation with high efficiency. Since the mean of domain $d$ is given by $\bar Y_d=N_d^{-1}\sum_{k=1}^K N_{dk}\bar Y_{dk}$, replacing $\bar Y_{dk}=\bar Y_{+k}$ by the ratio HT estimator $\hat{\bar Y}_{+k}^R=\hat Y_{+k}^{DIR}/\hat N_{+k}^{DIR}$, where $\hat Y_{+k}^{DIR}$ is the direct estimator of the total in post-stratum $k$ and $\hat N_{+k}$ is the direct estimator of the population size $N_{+k}$ in the same post-stratum, we obtain the post-stratified synthetic estimator $$ \hat{\bar Y}_d^{SYN}=\frac{1}{N_d}\sum_{k=1}^K N_{dk}\hat{\bar Y}_{+k}^R. $$ Note that this estimator requires the population sizes of the crossings between each post-stratum $k$ and domain $d$, $N_{dk}$ for all $k$ and $d$. The direct estimator is inefficient for a domain with small sample size. On the other hand, the post-stratified synthetic estimator is biased when the assumption of constant means across domains within a stratum does not hold. To balance the bias of a synthetic estimator and the instability of the direct estimator, \cite{DrewSinghChoudhry82} proposed the sample-size dependent (SSD) estimator defined as a composition of the two mentioned estimators, that is, $$ \hat{\bar Y}_d^{SSD}=\phi_d\hat{\bar Y}_d^{DIR}+(1-\phi_d)\hat{\bar Y}_d^{SYN}, $$ where the composition weight $\phi_d$ depends on the sample size of the domain as $$ \phi_d=\left\{\begin{array}{cc} 1,& \hat N_d^{DIR}\geq \delta N_d; \\ \hat N_d^{DIR}/(\delta N_d),& \hat N_d^{DIR}< \delta N_d,\end{array}\right. $$ for a given constant $\delta>0$ that controls how much weight is attached to the synthetic estimator, with larger value of $\delta$ meaning that more strength is borrowed from other domains. However, if the expected sample size is small, then the SSD estimator is not borrowing strength in domains $d$ with $\hat N_d^{DIR}\geq \delta N_d$ even if they have small sample sizes. Functions \texttt{direct()}, \texttt{pssynt()} and \texttt{ssd()} give respectively direct, post-stratified synthetic and sample size dependent estimates. The calls to these functions are: \begin{verbatim} direct(y, dom, sweight, domsize, data, replace = FALSE) pssynt(y, sweight, ps, domsizebyps, data) ssd(dom, sweight, domsize, direct, synthetic, delta = 1, data) \end{verbatim} Function \texttt{direct()} returns unbiased direct estimates of the area means, where the result depends on the sampling design specified through the sampling weight vector \texttt{sweight} and the argument \texttt{replace} for with or without replacement sampling. We must provide the area population sizes in the data frame \texttt{domsize}, whose first column must contain the area codes. In \texttt{pssynt()}, we must specify our selected post-stratifying variable in argument \texttt{ps}. The population sizes of each crossing between domain and post-strata must be specified in the data frame \texttt{domsizebyps}, whose first column must be again the area codes. Function \texttt{ssd()} gives SSD estimators obtained by composition of direct and synthetic estimators. We need to introduce the direct estimators (\texttt{direct}) and the synthetic estimators (\texttt{synthetic}) to compose, together with the constant $\delta$ (\texttt{delta}) involved in the SSD estimator. Domain codes (\texttt{dom}) and domain population sizes (\texttt{domsize}) are also required arguments. The vector of sampling weights (\texttt{sweight}) must be included in the three functions. The variables specified in \texttt{y}, \texttt{dom}, \texttt{sweight} and \texttt{ps} can be selected from the data set specified in argument \texttt{data}. \subsection*{Example. Poverty mapping} In this example, we calculate several simple estimates of poverty incidences in Spanish provinces, namely direct estimates, post-stratified synthetic estimates with education levels as post-strata and SSD estimates obtained from the composition of direct and post-stratified synthetic estimates. The poverty incidence for a province is the province mean of a binary variable taking value 1 when person's income is below a given poverty line and 0 otherwise. Direct estimates can be obtained easily applying the usual theory for means to this binary variable. First, we load the data set \texttt{incomedata} containing the input data for each individual and the data sets \texttt{sizeprov} and \texttt{sizeprovedu} containing the population sizes and the population sizes by education level, respectively. <<>>= library("sae") data("incomedata") data("sizeprov") data("sizeprovedu") @ Next, we define the poverty line \texttt{z}, calculate the binary variable \texttt{poor}, with value \texttt{1} if the corresponding income value is below the poverty line and \texttt{0} otherwise, and calculate province poverty incidences as province means of this variable. <<>>= z <- 6557.143 poor <- as.integer(incomedata$income < z) @ We use the province name \texttt{provlab} as the domain code (\texttt{dom}) and calculate direct estimates \texttt{DIR}. <<>>= Popn <- sizeprov[, c("provlab", "Nd")] DIR <- direct(y = poor, dom = incomedata$provlab, sweight = incomedata$weight, domsize = Popn) @ Next, we calculate post-stratified synthetic estimates with education levels as post-strata. For the function \texttt{pssynt()}, we construct the data frame \texttt{domsizebyps}, containing the domain codes \texttt{provlab} in the first column and, in the remaining columns, the province sizes by education level. The names of the columns (except for the first one) in this data frame must be the education levels, namely \texttt{0} (age<16), \texttt{1} (primary education), \texttt{2} (secondary education) and \texttt{3} (post-secondary education): <<>>= Popn.educ <- sizeprovedu[,-2] colnames(Popn.educ) <- c("provlab","0", "1", "2", "3") PSYN.educ <- pssynt(y = poor, sweight = incomedata$weight, ps = incomedata$educ, domsizebyps = Popn.educ) @ We calculate SSD estimates by composition of the previous direct and post-stratified estimates, and taking the default value \texttt{delta=1} in function \texttt{ssd()}. Again, the first columns of \texttt{domsize}, \texttt{direct} and \texttt{synthetic} must be the province names. <<>>= SSD <- ssd(dom = provlab, sweight = weight, domsize = Popn, direct = DIR[, c("Domain", "Direct")], synthetic = PSYN.educ, data = incomedata) @ We collect the province names, sample sizes and the three sets of percent poverty incidence estimates in the data frame \texttt{results}: <<>>= results <- data.frame(Province = DIR$Domain, SampleSize = DIR$SampSize, DIR = DIR$Direct * 100, PSYN.educ = PSYN.educ$PsSynthetic * 100, SSD = SSD$ssd * 100) print(results, row.names = FALSE) @ These estimates are plotted in the Figure %~\ref{AllEstimates} for each province (area), with provinces sorted by decreasing sample size. This Figure shows that direct estimates and SSD estimates are very similar, with direct estimates slightly more unstable. However, the post-stratified synthetic estimates appear to be too stable, giving practically the same values for all provinces. This estimator is based on the unrealistic assumption of constant poverty incidence for all the population with the same education level and therefore might be seriously biased. \begin{center} <>= # Sorted results by decreasing sample size results <- results[order(results$SampleSize, decreasing = TRUE), ] plot(results$DIR, type = "n", xlab = "area (sorted by decreasing sample size)", ylab = "Estimate", cex.axis = 1.5, cex.lab = 1.5) points(results$DIR, type = "b", col = 3, lwd = 2, pch = 1) points(results$PSYN.educ, type= "b", col = 5, lwd = 2, pch = 2) points(results$SSD, type = "b", col = 2, lwd = 2, pch = 5) legend("bottom", legend = c("Direct", "Post-strat educ", "SSD"), ncol = 1, col = c(3, 5, 2), lwd = rep(2, 3), pch = c(1, 2, 5), cex = 1.3) @ \end{center} Comparing direct estimates with the EB estimates of poverty incidences obtained in the data frame \texttt{results.EB} of Example 5 in \cite{MolinaMarhuenda15}, we can see that estimates differ significantly for the 5 selected provinces and the CVs show great gains in efficiency of EB estimates as compared with direct estimates. <<>>= DIR[c("42","5","34","44","40"), -4] @ \addcontentsline{toc}{section}{References} \begin{thebibliography}{27} \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi \bibitem{DrewSinghChoudhry82} \textsc{Drew, D.}, \textsc{Singh, M.P.} \& \textsc{Choudhry, G.H.} (1982). \newblock Evaluation of small area estimation techniques for the Canadian Labour Force Survey. \newblock \textit{Survey Methodology} \textbf{8}, 17--47. \bibitem{MolinaMarhuenda15} \textsc{Molina, I.} \& \textsc{Marhuenda, Y.} (1982). \newblock sae: An R package for Small Area Estimation. \newblock \textit{R Journal}, Under revision. \end{thebibliography} \end{document}