Estimation of Population Total Using Spline Functions
Gladys Gakenia Njoroge
Department of Physical Sciences, Chuka University, Chuka, Kenya
Email address:
To cite this article:
Gladys Gakenia Njoroge. Estimation of Population Total Using Spline Functions.American Journal of Theoretical and Applied Statistics. Vol. 4, No. 5, 2015, pp. 396403.doi: 10.11648/j.ajtas.20150405.20
Abstract: This study sought to estimate finite population total using spline functions. The emerging patterns from spline smoother were compared with those that were obtained from the modelbased, the modelassisted and the nonparametric estimators. To measure the performance of each estimator, three aspects were considered: the average bias, the efficiency by use of the average mean square error and the robustness using the rate of change of efficiency. We used six populations: four natural and two simulated. The findings showed that the modelbased estimator works very well in terms of efficiency while the modelassisted is almost unbiased when the model is linear and homoscedastic. However, the estimators break down when the underlying model assumptions are violated. The Kernel Estimator (NadarayaWatson) is found to be the most robust of the five estimators considered. Between the two spline functions that we considered, the periodic spline was found to perform better. The spline functions were found to provide good results whether or not the design points were uniformly spaced. We also found out that, under certain conditions, a smoothing spline estimator and a Kernel estimator are equivalent. The study recommends that both the ratio estimator and the local polynomial estimator should be used within the confines of a linear homoscedastic model. The NadarayaWatson and the periodic spline estimators, both of which are nonparametric, are highly robust. The NadarayaWatson however is even more robust than the periodic spline.
Keywords: Population Total, Estimator, Efficiency, Homoscedasticity, Robustness
1. Introduction
The name "spline function" was given by [11] to the piecewise polynomial functions known as univariate polynomial spines. This was because of their resemblance to the curves obtained by their draftsmen using a mechanical spline a thin flexible rod with a groove and a set of weights called "ducks" used to position the rods at points through which it was derived to draw smooth interpolating curves passing through prescribed points. The basic idea dates back at least to [16]. More recent papers on the subject include [6, 12, and 14] among others.
The available literature in statistics indicates that the approaches mostly used in estimation of population total include the modelbased, the designbased and the modelassisted approaches. The nonparametric approach has also picked up especially with such works as of [5,10] on the Kernel estimation. The spline smoothing is another nonparametric approach to estimation of finite population total. However, not much literature is available on this approach and neither has there been a lot of its application on estimation of population, as compared to the previous approaches. This study therefore sought to estimate finite population total using spline functions while using ratio estimator, local polynomial estimator and Kernel functions for a numerical comparison to determine whether the patterns of estimation would be as accurate as those derived from the use of previous approaches. To measure the performance of each estimator, we considered three aspects namely: bias, the efficiency by use of the average mean square error and the robustness using the rate of change of efficiency.
2. The Estimators
2.1. Ratio Estimator (ModelBased)
The prediction approach is based on a model. Royall [9] summarizes the philosophy behind this approach. Suppose the number of the units in the finite population is known and that in each unit is associated a number. The general problem is to choose some of the units as a sample, observe the’s for the sample units and then use those observations to estimate the value of some function of all the’s in the population. The prediction approach treats the numbers as realized values of random variables. After the samples have been observed, estimating entails predicting a function of the unobserved’s. The relationships among the random variables both the auxiliary variable and the survey variable are expressed in a model. The general model being
(1)
Where is the mean function and a random error term. After selecting and observing a sample, the ’s for the sample units get to be known but the values for the nonsample units remain unknown. The ignorance of the nonsample values implies that some functions of those values must be mathematically predicted in order to have an estimator or predictor for the full population. Suppose the study of the scatter diagram reveals that the sample points are clustered around straight line passing through the origin. Then, the ratio , are more or less the same. We may then postulate the approximate relation.
. Hence we can write
(2)
From which we can suggest an estimator of ȳ as
(3)
where and refer to the sample means for and , respectively. The is assumed to be known before hand. This estimator in (3) is popularly known as the Ratio Estimator [7]. The estimator of the population total using the modelbased approach (prediction approach) thus becomes
(4)
Where
(5)
substituting equation (5) in (4) gives
(6)
we take for the nonsample where is linear and i.e. homoscedastic [9]. Let be the predictor of of the non sample values which is given as
Thus, our estimate of the population total under Royall’s prediction model is
Therefore,
(7)
is the ratio estimator for the population total
2.2. The Local Polynomial Regression Estimator (Model–Assisted)
Breidt and Opsomer [2], assumed that the population is generated by the super population model: where is an independent sequence of random variables with mean zero and the variance is a smooth function of. They employed local polynomial smoothing techniques to obtain a modelassisted regression estimator for the finite population total. We consider a finite population of units with label set an auxiliary variable is observed. A probability sample is drawn from according to a fixed size sampling design where is the probability of drawing the sample . Let be the size of . Assume
And
.
The study variable is observed for each. The goal is to estimate
Let if and otherwise.
, where denotes expectation with respect to the sampling design i.e. averaging over all possible samples from the finite population.
Using this notation, an estimator of is said to be designunbiased if
A well known designunbiased estimator of is the HorvitzThompson estimator,
(8)
The variance of the Horvitz Thompson estimator under the sampling design is
(9)
An estimator motivated by modeling the finite population of ’s, conditioned on the auxiliary variable , as a realization from a super population , in which is proposed. Given, is called the regression function, while is the variance function.
Let denote a continuous kernel function and let denote the bandwidth. We begin by defining the Local polynomial Kernel estimator of degree based on the entire finite population. Let be the Nvector of ’s in the finite population.
Define the matrix as
and define the matrix,
the Kernel weights where is the smoothing parameter (bandwidth). Let represent a vector with a 1 in the position and 0 elsewhere. The local polynomial kernel estimator of the regression function at , based on the entire finite population is then given by
(10)
which is well defined as long as is invertible.
Since only in are known, is replaced by a samplebased consistent estimator to make its calculation possible.
Let be the nvector ’s obtained in the sample.
Define the matrix,
And define the matrix, a sample designbased estimator of is then given byas long as is invertible.which is a vector.
The above shows that the local polynomial estimators linear smoothers are of the form
The coefficient of the linear combination depends on the degree of the polynomial approximation. We note that for, the estimator reduces to the NadarayaWatson estimator [1]. Now, based on the proposed estimator in equation (6), and assuming that throughout, due to mathematical complexity, then the local polynomial regression estimator for the finite population total is given by
(11)
where is the sample estimator for . Substituting equation (9) in (11) above gives
(12)
2.3. Kernel Estimation
We consider the NadarayaWatson Kernel estimator. It is assumed that the auxiliary information is available for the entire population and the auxiliary variable and the study variable are related in a more general way. The studies of the properties of the proposed estimator are conditional on the available sample and nonsample values of the auxiliary variable. A conceptually simple approach to a representation of the weight sequences is to describe the shape of the weight function by a density function with a scale parameter that adjusts the size and the form of the weights near. This function is commonly referred to as Kernel . The Kernel is continuous, bounded and symmetric function which integrates to one,
(13)
To estimate in model (1) one method is to average the nearby values of where "nearby" is measured in terms of the distance
Let be the Kernel with bandwidth.
The weight sequences for the Kernel smoothers (for one dimensional x) is given by
(14)
This form of Kernel weights (13) was proposed by [8,15].
The NadarayaWatson estimator of in (1) is
(15)
On substituting (13) in (14) we get
(16)
The shape of the Kernel weights is determined by . One unique feature of the size of the bandwidth is that the smaller it is the more concentrated are the weights around x.
Selection of the bandwidth is the important part of the Kernel estimation method. When selecting the bandwidth we need to consider the error in our selection. This is the deeper reason why precision has to be measured in terms of point wise Mean Squared Error (MSE), the sum of variance and squared bias. The MSE is given by
which tends to zero for the Kernel estimator.
, if and.
The nonparametric regressionbased estimator, , for the population total T is given by
(17)
where is the NadarayaWatson estimator in (15).
Therefore the NadarayaWatson estimator of the population total is given by substituting (15) in (16) which gives
(18)
where represents the NadarayaWatson estimator of the population total.
2.4. The Spline Smoothing
A measure of the rapid local variation of a curve can be given by a roughness penalty such as the integrated square second derivative. Various penalties have been suggested and used. For example, [3], but is most convenient for our purpose. Using this measure, we define the modified sum of squares as
(19)
The idea behind spline estimation then, is to find the function such that the following minimization problem is solved
(20)
The parameter is a smoothing parameter which controls the tradeoff between smoothness and goodness of fit to the data. If the minimization of (21) gives a linear fit whereas letting gives a wiggly function. The larger the value of , the more the data will be smoothed to produce the curve estimate. However, the basic underlying idea of penalising a measure of goodness of fit by one of roughness was described by [16].Equation (21) shows that the function to be minimized consists of two components: first, the deviation of the fitted function from the observed values should be minimized which gives the goodness of the fit. Second, complex functions are penalised by the second term in (21), as measured by the second order derivative. From [3] and from the quadratic nature of equation (21), the spline smoother is linear in the observations in the sense that there exists a weight function such that
(21)
Where,
(22)
with the Kernel function given by
(23)
and the local bandwidth satisfies
(24)
It has been assured that is large and that the design points have local density, in that the proportion of in an interval of length near is approximately. Equation (23) above applies for large provided is not too near the edge of the interval on which the data lie, and is not too big or too small.
After obtaining the spline smoother in equation (22), we then can substitute this value in the equation (16) to obtain the population total as fromand
substituting in
we get the smoothing spline estimator of the population total,as
(25)
While the periodic Spline Estimator of the Population Total is obtained as
(26)
3. Empirical Study
We present the analysis and results of the five estimators i.e. the ratio, the local polynomial, the NadarayaWatson Kernel, the spline smoother and the periodic spline. We used four natural and two artificial populations in the study.
3.1. Description of the Study Populations
In artificial population I, we generated 100 data points according to the linear homoscedastic model:
with and
In artificial population II, we again generated 100 data points according to the quadratic homoscedastic model:
with
We obtained the natural populations from the Kenya Central Bureau of Statistics ofbetween 2006and 2014. The description of each of the populations is given in the table 3.1 below.
Population  Data Points  Description  

 
I  100  Value (in millions) of Road Transport equipment Imported.  Quantity (number) of Road Transport Equipment Imported. 
II  126  Value in thousands of principle articles traded.  Quantity (units) of principle Articles Traded. 
III  130  Total number of employees engaged per industry.  Total number of firms and Establishments per industry. 
IV  130  Total outputs per industry in a manufacturing sector.  Total inputs per Industry in the manufacturing sector. 
Scatter plots drawn for each of the four natural populations (Population IIV) were used to deduce the form of the population structures as below:
Population I: the structure of the population could be nonlinear and heteroscedastic
Population II: the structure of the population could be linear and heteroscedastic.
Population III: the structure of the population could be linear and heteroscedastic
Population IV: the structure of the population could be linear and homoscedastic.
Population V and IV were the artificial populations with known population structures:
Population V: is of a linear homoscedastic model and passing through the origin.
Population VI: is of a quadratic homoscedatic model.
3.2. Design of the Study
For each of the six populations, 500 samples of size 50 were drawn by Simple Random Sampling without replacement. The Epanechnikov Kernel defined as
was used in the study for the Local Polynomial Estimator and the NadarayaWatson Kernel Estimator. An optional bandwidth for NadarayaWatson smoother within the interval was sought where is the standard deviation of ’s. The Kernel function used in the spline smoothing and periodic spline is[14], with the local bandwidth satisfying
3.3. Description of the Computation Procedure
For each of the six populations, we computed the true population total , where is the number of units in each population. The estimator of population total , was then obtained for each population using the five different estimators as follows;
Ratio Estimator:Local polynomial:
NadarayaWatson:
Smoothing Spline:
Periodic Spline:
To compare the five estimators, the average biases and the average Mean Square Errors (MSE) for each population were calculated. For population five and six, the relative change in efficiency was calculated to measure the robustness of the estimators. The Average Bias for each estimator was calculated as;
Average Bias where denotes the different estimators.
The Average Mean Square Error for each estimator was obtained from
Average MSE.
The Relative change in efficiency (RCE) for each estimator was given by
RCE=
3.4. Results
The results of this study were summarized in Tables 3.2, 3.3, 3.4, 3.5 and 3. 6 below:
Pop 1  Pop 2  Pop 3  Pop 4  Pop 5  Pop 6  
Population Sums  131.002  598.124317  510.177  178.7683  12.18925  111.4207 
Pop 1  Pop 2  Pop 3  Pop 4  Pop 5  Pop 6  
NadarayaWatson  135.4742  617.397269  509.3305  186.1202  11.53005  111.2965 
Smoothing Spline  90.64416  1836.954517  317.3828  295.2271  22.32936  211.1834 
Local Polynomial  131.8575  484.6628646  395.1619  139.7997  12.23147  113.7409 
Ratio Estimator  163.0781  623.9877722  534.3458  188.1737  16.65574  152.2789 
Periodic Spline  129.4973  598.0745695  449.508  173.8463  11.23823  104.4373 
NadarayaWatson  4.472212  19.27295201  0.84652  7.351878  0.6592  0.12418 
Smoothing Spline  40.3578  1238.8302  192.794  116.4588  10.1401  99.7627 
Local Polynomial  0.855476  113.461452  115.015  38.9686  0.042218  2.320245 
Ratio Estimator  32.07607  25.86345519  24.16878  9.405401  4.466489  40.85822 
Periodic Spline  1.5047  0.0497475  60.669  4.92197  0.95103  6.98339 
NadarayaWatson  372.2935  19113.57612  3152.551  508.5213  0.965757  25.60268 
Smoothing Spline  2157.35  1714144.509  43854.85  16611.31  103.8781  9994.21 
Local Polynomial  4168.499  109812.6846  6061.498  1345.238  20.49219  1731.601 
Ratio Estimator  332.4187  24818.69519  15134.26  1791.341  0.568978  31.90509 
Periodic Spline  2448.407  18474.74869  110964  675.3513  1.418111  70.75136 
Estimator  NadarayaWatson  Smoothing spline  Local polynomial  Ratio Estimator  Periodic spline 
RCE  25.51048  95.21094  83.50053  55.07438  48.89127 
3.5. Discussion of the Results
For population I which is approximately nonlinear and heteroscedastic, the bias of local polynomial estimator is the smallest compared to the rest, making it the best estimator for this population. Periodic spline has the smallest bias for population II which is approximately linear and heteroscedastic. On the other hand, NadarayaWatson has the lowest bias for population III which is also approximately linear and heteroscedastic. In population four (approximately linear and homoscedastic), we notice that the periodic spline has the lowest bias, hence becoming a good estimator for this population. Table 3.4 shows that generally all the estimators have low biases in population V compared to the rest of the populations. The lowest bias however is of the local polynomial estimator which makes it a good estimator for the linear homoscedastic model. We further notice that NadarayaWatson estimator has the smallest bias in population VI, making it the best estimator for the nonlinear homoscedastic model.
We next consider the performance of each estimator across the six populations in terms of average biases as shown in table 3.4. The NadarayaWatson estimator performed relatively well in all the populations. It, however, did best in populations three and six which are linear and heteroscedastic and quadratic and homoscedastic respectively. The smoothing spline on the other hand, had the largest bias in all the populations. It had its best performance with a linear homoscedastic population. For the Local polynomial estimator, we notice that it had the lowest bias in population one which is linear and heteroscedastic and population five which is linear and homoscedastic. Its bias in population six, which is quadratic and homoscedastic, is also relatively low. When it comes to Ratio Estimator, we notice that generally its performance is low compared to the other estimators but better than the smoothing spline. Its best performance is in population three which is approximately linear and heteroscedastic.
Then we moved on to the Average Mean Square Error (AMSE) in table 3.5. The smaller the AMSE, the higher the efficiency of the estimator for the given population. In population I, the lowest AMSE was given by the Ratio Estimator while in population II, it was the periodic spline. NadarayaWatson had the lowest AMSE in population III and IV while for Population V it was the Ratio Estimator. On the other hand, the NadarayaWatson was the most efficient estimator for the nonlinear homoscedastic population VI.
Finally, we compared the Relative Change in Efficiency (RCE) among the five estimators. We noticed from table 3.6 that the NadarayaWatson had the lowest RCE. The implication here was that it is the least sensitive to the change of structure of the population and hence the most Robust among the five estimators. It was then followed by the Periodic Spline, the Ratio Estimator and the Local polynomial. The Smoothing Spline was the least Robust among them.
4. Summary, Conclusions and Recommendations
4.1. Summary of the Findings
The research set out to estimate population total using spline functions. However, other estimators of the population total were also involved for comparative purposes. In all the six populations considered, the Periodic spline had a smaller average bias, had less average AMSE and was found to be more robust than the Smoothing Spline. The NadarayaWatson estimator performed generally well in terms of the average bias, efficiency and robustness. It had very small biases in both linear and nonlinear homoscedastic models. The bias in heteroscedastic models was also relatively low. Its efficiency was equally higher in most of the populations and it also had the lowest RCE value out of the five estimators considered.
The local polynomial estimator was found to be almost unbiased for a linear homoscedastic model. Its bias however goes up when a nonlinear homoscedastic population is considered. In terms of efficiency, the estimator is far more efficient in a linear homoscedastic model than a nonlinear one. It has a high RCE value.
We observed that this estimator is relatively highly biased across the six populations considered. However in terms of efficiency, it was the most efficient of the five estimators for a linear homoscedastic model. The efficiency went down when a nonlinear homoscedastic population was considered. The RCE value is relatively high. We also observed that the periodic spline and the NadarayaWatson estimators gave results that were quite similar in terms of bias, efficiency and robustness.
4.2. Conclusions and Recommendations
We observed from this study that the two spline functions considered perform quite differently. The periodic spline performed better than the smoothing spline in all the aspects considered: bias, efficiency and robustness. We, therefore, concluded that the periodic spline is a better estimator than the smoothing spline in a case of a linear homoscedastic model and even when the model assumptions have been violated. It was also shown that the NadarayaWatson estimator performed well in the linear homoscedastic model and also when the conditions were violated. It had the lowest RCE value. Therefore, we came to the conclusion that, NadarayaWatson estimator was the most robust of the five estimators. The results also showed the periodic spline and the NadarayaWatson estimators to be quite similar. Thus, we concluded from both the theoretical results and the empirical study that spline smoothing corresponds approximately to smoothing by a Kernel method thus concurring with the theoretical observation made by [13].
The local polynomial estimator was very sensitive to model assumption violation and we therefore concluded that it is not robust. The results also indicated that the radio estimator was the most efficient of the five estimators for a linear homoscedastic model. Nevertheless, when these conditions are violated, the estimator completely breaks down. We conclude that this estimator is not robust to the violation of the linear and homoscedastic conditions.
From the findings of the study, we gave the following recommendations:
1. Both the ratio estimator (modelbased) and the local polynomial (model assisted) estimator should be used within the confines of a linear homoscedastic model. They are not appropriate for use when the model is unspecified or when the linear and homoscedastic assumptions are violated.
2. The NadarayaWatson and the periodic spline estimators, both of which are nonparametric, should be used in case of a linear and homoscedastic model and even when the model assumptions are violated. Their sensitivity to the change of structure of the population is relatively low and hence are highly robust. The NadarayaWatson, however, is even more robust than the periodic spline.
References