American Journal of Theoretical and Applied Statistics
Volume 4, Issue 5, September 2015, Pages: 396-403

Estimation of Population Total Using Spline Functions

Gladys Gakenia Njoroge

Department of Physical Sciences, Chuka University, Chuka, Kenya

Email address:

To cite this article:

Gladys Gakenia Njoroge. Estimation of Population Total Using Spline Functions.American Journal of Theoretical and Applied Statistics. Vol. 4, No. 5, 2015, pp. 396-403.doi: 10.11648/j.ajtas.20150405.20


Abstract: This study sought to estimate finite population total using spline functions. The emerging patterns from spline smoother were compared with those that were obtained from the model-based, the model-assisted and the non-parametric estimators. To measure the performance of each estimator, three aspects were considered: the average bias, the efficiency by use of the average mean square error and the robustness using the rate of change of efficiency. We used six populations: four natural and two simulated. The findings showed that the model-based estimator works very well in terms of efficiency while the model-assisted is almost unbiased when the model is linear and homoscedastic. However, the estimators break down when the underlying model assumptions are violated. The Kernel Estimator (Nadaraya-Watson) is found to be the most robust of the five estimators considered. Between the two spline functions that we considered, the periodic spline was found to perform better. The spline functions were found to provide good results whether or not the design points were uniformly spaced. We also found out that, under certain conditions, a smoothing spline estimator and a Kernel estimator are equivalent. The study recommends that both the ratio estimator and the local polynomial estimator should be used within the confines of a linear homoscedastic model. The Nadaraya-Watson and the periodic spline estimators, both of which are non-parametric, are highly robust. The Nadaraya-Watson however is even more robust than the periodic spline.

Keywords: Population Total, Estimator, Efficiency, Homoscedasticity, Robustness


1. Introduction

The name "spline function" was given by [11] to the piecewise polynomial functions known as univariate polynomial spines. This was because of their resemblance to the curves obtained by their draftsmen using a mechanical spline- a thin flexible rod with a groove and a set of weights called "ducks" used to position the rods at points through which it was derived to draw smooth interpolating curves passing through prescribed points. The basic idea dates back at least to [16]. More recent papers on the subject include [6, 12, and 14] among others.

The available literature in statistics indicates that the approaches mostly used in estimation of population total include the model-based, the design-based and the model-assisted approaches. The non-parametric approach has also picked up especially with such works as of [5,10] on the Kernel estimation. The spline smoothing is another non-parametric approach to estimation of finite population total. However, not much literature is available on this approach and neither has there been a lot of its application on estimation of population, as compared to the previous approaches. This study therefore sought to estimate finite population total using spline functions while using ratio estimator, local polynomial estimator and Kernel functions for a numerical comparison to determine whether the patterns of estimation would be as accurate as those derived from the use of previous approaches. To measure the performance of each estimator, we considered three aspects namely: bias, the efficiency by use of the average mean square error and the robustness using the rate of change of efficiency.

2. The Estimators

2.1. Ratio Estimator (Model-Based)

The prediction approach is based on a model. Royall [9] summarizes the philosophy behind this approach. Suppose the number of the units  in the finite population is known and that in each unit is associated a number. The general problem is to choose some of the units as a sample, observe the’s for the sample units and then use those observations to estimate the value of some function  of all the’s in the population. The prediction approach treats the numbers  as realized values of random variables. After the samples have been observed, estimating  entails predicting a function of the unobserved’s. The relationships among the random variables both the auxiliary variable  and the survey variable  are expressed in a model. The general model being

(1)

Where  is the mean function and  a random error term. After selecting and observing a sample, the ’s for the sample units get to be known but the values for the non-sample units remain unknown. The ignorance of the non-sample  values implies that some functions of those values must be mathematically predicted in order to have an estimator or predictor for the full population. Suppose the study of the scatter diagram reveals that the  sample points are clustered around straight line passing through the origin. Then, the ratio , are more or less the same. We may then postulate the approximate relation.

. Hence we can write

(2)

From which we can suggest an estimator of ȳ as

(3)

where and  refer to the sample means for and , respectively. The  is assumed to be known before hand. This estimator in (3) is popularly known as the Ratio Estimator [7]. The estimator of the population total using the model-based approach (prediction approach) thus becomes

               (4)

Where

(5)

substituting equation (5) in (4) gives

(6)

we take  for the non-sample where  is linear and  i.e. homoscedastic [9]. Let  be the predictor of of the non sample values which is given as

Thus, our estimate of the population total under Royall’s prediction model is

Therefore,

(7)

is the ratio estimator for the population total

2.2. The Local Polynomial Regression Estimator (Model–Assisted)

Breidt and Opsomer [2], assumed that the population is generated by the super population model:  where  is an independent sequence of random variables with mean zero and the variance is a smooth function of. They employed local polynomial smoothing techniques to obtain a model-assisted regression estimator for the finite population total. We consider a finite population of  units with label set  an auxiliary variable  is observed. A probability sample  is drawn from  according to a fixed size sampling design  where  is the probability of drawing the sample . Let  be the size of . Assume

And

.

The study variable  is observed for each. The goal is to estimate

Let  if  and  otherwise.

, where  denotes expectation with respect to the sampling design i.e. averaging over all possible samples from the finite population.

Using this notation, an estimator  of  is said to be design-unbiased if

A well known design-unbiased estimator of is the Horvitz-Thompson estimator,

(8)

The variance of the Horvitz Thompson estimator under the sampling design is

(9)

An estimator motivated by modeling the finite population of ’s, conditioned on the auxiliary variable , as a realization from a super population , in which is proposed. Given,  is called the regression function, while  is the variance function.

Let  denote a continuous kernel function and let  denote the bandwidth. We begin by defining the Local polynomial Kernel estimator of degree  based on the entire finite population. Let  be the N-vector of ’s in the finite population.

Define the  matrix as

and define the  matrix,

the Kernel weights where  is the smoothing parameter (bandwidth). Let  represent a vector with a 1 in the  position and 0 elsewhere. The local polynomial kernel estimator of the regression function at , based on the entire finite population is then given by

(10)

which is well defined as long as is invertible.

Since only  in  are known,  is replaced by a sample-based consistent estimator to make its calculation possible.

Let  be the n-vector ’s obtained in the sample.

Define the  matrix,

And define the  matrix, a sample design-based estimator of  is then given byas long as  is invertible.which is a  vector.

The above shows that the local polynomial estimators linear smoothers are of the form

The coefficient of the linear combination depends on the degree  of the polynomial approximation. We note that for, the estimator reduces to the Nadaraya-Watson estimator [1]. Now, based on the proposed estimator in equation (6), and assuming that  throughout, due to mathematical complexity, then the local polynomial regression estimator for the finite population total is given by

(11)

where is the sample estimator for . Substituting equation (9) in (11) above gives

(12)

2.3. Kernel Estimation

We consider the Nadaraya-Watson Kernel estimator. It is assumed that the auxiliary information is available for the entire population and the auxiliary variable  and the study variable  are related in a more general way. The studies of the properties of the proposed estimator are conditional on the available sample and non-sample values of the auxiliary variable. A conceptually simple approach to a representation of the weight sequences  is to describe the shape of the weight function by a density function with a scale parameter that adjusts the size and the form of the weights near. This function is commonly referred to as Kernel . The Kernel is continuous, bounded and symmetric function which integrates to one,

(13)

To estimate  in model (1) one method is to average the nearby values of  where "nearby" is measured in terms of the distance

Let  be the Kernel with bandwidth.

The weight sequences for the Kernel smoothers (for one dimensional x) is given by

(14)

This form of Kernel weights (13) was proposed by [8,15].

The Nadaraya-Watson estimator of  in (1) is

(15)

On substituting (13) in (14) we get

(16)

The shape of the Kernel weights is determined by . One unique feature of the size of the bandwidth is that the smaller it is the more concentrated are the weights around x.

Selection of the bandwidth is the important part of the Kernel estimation method. When selecting the bandwidth we need to consider the error in our selection. This is the deeper reason why precision has to be measured in terms of point wise Mean Squared Error (MSE), the sum of variance and squared bias. The MSE is given by

which tends to zero for the Kernel estimator.

, if and.

The non-parametric regression-based estimator, , for the population total T is given by

(17)

where is the Nadaraya-Watson estimator in (15).

Therefore the Nadaraya-Watson estimator of the population total is given by substituting (15) in (16) which gives

(18)

where represents the Nadaraya-Watson estimator of the population total.

2.4. The Spline Smoothing

A measure of the rapid local variation of a curve can be given by a roughness penalty such as the integrated square second derivative. Various penalties have been suggested and used. For example, [3], but  is most convenient for our purpose. Using this measure, we define the modified sum of squares as

(19)

The idea behind spline estimation then, is to find the function  such that the following minimization problem is solved

(20)

The parameter  is a smoothing parameter which controls the trade-off between smoothness and goodness of fit to the data. If the minimization of (21) gives a linear fit whereas letting  gives a wiggly function. The larger the value of , the more the data will be smoothed to produce the curve estimate. However, the basic underlying idea of penalising a measure of goodness of fit by one of roughness was described by [16].Equation (21) shows that the function to be minimized consists of two components: first, the deviation of the fitted function from the observed values should be minimized which gives the goodness of the fit. Second, complex functions are penalised by the second term in (21), as measured by the second order derivative. From [3] and from the quadratic nature of equation (21), the spline smoother  is linear in the observations  in the sense that there exists a weight function  such that

(21)

Where,

(22)

with the Kernel function  given by

(23)

and the local bandwidth  satisfies

(24)

It has been assured that  is large and that the design points have local density, in that the proportion of  in an interval of length  near  is approximately. Equation (23) above applies for large  provided  is not too near the edge of the interval on which the data lie, and  is not too big or too small.

After obtaining the spline smoother  in equation (22), we then can substitute this value in the equation (16) to obtain the population total as fromand

substituting in

we get the smoothing spline estimator of the population total,as

(25)

While the periodic Spline Estimator of the Population Total  is obtained as

(26)

3. Empirical Study

We present the analysis and results of the five estimators i.e. the ratio, the local polynomial, the Nadaraya-Watson Kernel, the spline smoother and the periodic spline. We used four natural and two artificial populations in the study.

3.1. Description of the Study Populations

In artificial population I, we generated 100 data points according to the linear homoscedastic model:

with and

In artificial population II, we again generated 100 data points according to the quadratic homoscedastic model:

with

We obtained the natural populations from the Kenya Central Bureau of Statistics ofbetween 2006and 2014. The description of each of the populations is given in the table 3.1 below.

Table 3.1. Description of the four natural populations.

Population Data Points Description
   

I 100 Value (in millions) of Road Transport equipment Imported. Quantity (number) of Road Transport Equipment Imported.
II 126 Value in thousands of principle articles traded. Quantity (units) of principle Articles Traded.
III 130 Total number of employees engaged per industry. Total number of firms and Establishments per industry.
IV 130 Total outputs per industry in a manufacturing sector. Total inputs per Industry in the manufacturing sector.

Scatter plots drawn for each of the four natural populations (Population I-IV) were used to deduce the form of the population structures as below:

Population I: the structure of the population could be non-linear and heteroscedastic

Population II: the structure of the population could be linear and heteroscedastic.

Population III: the structure of the population could be linear and heteroscedastic

Population IV: the structure of the population could be linear and homoscedastic.

Population V and IV were the artificial populations with known population structures:

Population V: is of a linear homoscedastic model and passing through the origin.

Population VI: is of a quadratic homoscedatic model.

3.2. Design of the Study

For each of the six populations, 500 samples of size 50 were drawn by Simple Random Sampling without replacement. The Epanechnikov Kernel defined as

was used in the study for the Local Polynomial Estimator and the Nadaraya-Watson Kernel Estimator. An optional bandwidth for Nadaraya-Watson smoother within the interval  was sought where  is the standard deviation of ’s. The Kernel function used in the spline smoothing and periodic spline is[14], with the local bandwidth  satisfying

3.3. Description of the Computation Procedure

For each of the six populations, we computed the true population total , where  is the number of units in each population. The estimator of population total , was then obtained for each population using the five different estimators as follows;

Ratio Estimator:Local polynomial:

Nadaraya-Watson:

Smoothing Spline:

Periodic Spline:

To compare the five estimators, the average biases and the average Mean Square Errors (MSE) for each population were calculated. For population five and six, the relative change in efficiency was calculated to measure the robustness of the estimators. The Average Bias for each estimator was calculated as;

Average Bias  where  denotes the different estimators.

The Average Mean Square Error for each estimator was obtained from

Average MSE.

The Relative change in efficiency (RCE) for each estimator was given by

RCE=

3.4. Results

The results of this study were summarized in Tables 3.2, 3.3, 3.4, 3.5 and 3. 6 below:

Table 3.2. True Population Totals.

  Pop 1 Pop 2 Pop 3 Pop 4 Pop 5 Pop 6
Population Sums 131.002 598.124317 510.177 178.7683 12.18925 111.4207

Table 3.3. Estimates of Population Totals.

  Pop 1 Pop 2 Pop 3 Pop 4 Pop 5 Pop 6
Nadaraya-Watson 135.4742 617.397269 509.3305 186.1202 11.53005 111.2965
Smoothing Spline 90.64416 1836.954517 317.3828 295.2271 22.32936 211.1834
Local Polynomial 131.8575 484.6628646 395.1619 139.7997 12.23147 113.7409
Ratio Estimator 163.0781 623.9877722 534.3458 188.1737 16.65574 152.2789
Periodic Spline 129.4973 598.0745695 449.508 173.8463 11.23823 104.4373

Table 3.4. Average Bias.

Nadaraya-Watson 4.472212 19.27295201 -0.84652 7.351878 -0.6592 -0.12418
Smoothing Spline -40.3578 1238.8302 -192.794 116.4588 10.1401 99.7627
Local Polynomial 0.855476 -113.461452 -115.015 -38.9686 0.042218 2.320245
Ratio Estimator 32.07607 25.86345519 24.16878 9.405401 4.466489 40.85822
Periodic Spline -1.5047 -0.0497475 -60.669 -4.92197 -0.95103 -6.98339

Table 3.5. Average Mean Square Error.

Nadaraya-Watson 372.2935 19113.57612 3152.551 508.5213 0.965757 25.60268
Smoothing Spline 2157.35 1714144.509 43854.85 16611.31 103.8781 9994.21
Local Polynomial 4168.499 109812.6846 6061.498 1345.238 20.49219 1731.601
Ratio Estimator 332.4187 24818.69519 15134.26 1791.341 0.568978 31.90509
Periodic Spline 2448.407 18474.74869 110964 675.3513 1.418111 70.75136

Table 3.6. Relative Change in Efficiency (RCE).

Estimator Nadaraya-Watson Smoothing spline Local polynomial Ratio Estimator Periodic spline
RCE 25.51048 95.21094 83.50053 55.07438 48.89127

3.5. Discussion of the Results

For population I which is approximately non-linear and heteroscedastic, the bias of local polynomial estimator  is the smallest compared to the rest, making it the best estimator for this population. Periodic spline has the smallest bias for population II which is approximately linear and heteroscedastic. On the other hand, Nadaraya-Watson has the lowest bias for population III which is also approximately linear and heteroscedastic. In population four (approximately linear and homoscedastic), we notice that the periodic spline has the lowest bias, hence becoming a good estimator for this population. Table 3.4 shows that generally all the estimators have low biases in population V compared to the rest of the populations. The lowest bias however is of the local polynomial estimator which makes it a good estimator for the linear homoscedastic model. We further notice that Nadaraya-Watson estimator has the smallest bias in population VI, making it the best estimator for the non-linear homoscedastic model.

We next consider the performance of each estimator across the six populations in terms of average biases as shown in table 3.4. The Nadaraya-Watson estimator performed relatively well in all the populations. It, however, did best in populations three and six which are linear and heteroscedastic and quadratic and homoscedastic respectively. The smoothing spline on the other hand, had the largest bias in all the populations. It had its best performance with a linear homoscedastic population. For the Local polynomial estimator, we notice that it had the lowest bias in population one which is linear and heteroscedastic and population five which is linear and homoscedastic. Its bias in population six, which is quadratic and homoscedastic, is also relatively low. When it comes to Ratio Estimator, we notice that generally its performance is low compared to the other estimators but better than the smoothing spline. Its best performance is in population three which is approximately linear and heteroscedastic.

Then we moved on to the Average Mean Square Error (AMSE) in table 3.5. The smaller the AMSE, the higher the efficiency of the estimator for the given population. In population I, the lowest AMSE was given by the Ratio Estimator while in population II, it was the periodic spline. Nadaraya-Watson had the lowest AMSE in population III and IV while for Population V it was the Ratio Estimator. On the other hand, the Nadaraya-Watson was the most efficient estimator for the non-linear homoscedastic population VI.

Finally, we compared the Relative Change in Efficiency (RCE) among the five estimators. We noticed from table 3.6 that the Nadaraya-Watson had the lowest RCE. The implication here was that it is the least sensitive to the change of structure of the population and hence the most Robust among the five estimators. It was then followed by the Periodic Spline, the Ratio Estimator and the Local polynomial. The Smoothing Spline was the least Robust among them.

4. Summary, Conclusions and Recommendations

4.1. Summary of the Findings

The research set out to estimate population total using spline functions. However, other estimators of the population total were also involved for comparative purposes. In all the six populations considered, the Periodic spline had a smaller average bias, had less average AMSE and was found to be more robust than the Smoothing Spline. The Nadaraya-Watson estimator performed generally well in terms of the average bias, efficiency and robustness. It had very small biases in both linear and non-linear homoscedastic models. The bias in heteroscedastic models was also relatively low. Its efficiency was equally higher in most of the populations and it also had the lowest RCE value out of the five estimators considered.

The local polynomial estimator was found to be almost unbiased for a linear homoscedastic model. Its bias however goes up when a non-linear homoscedastic population is considered. In terms of efficiency, the estimator is far more efficient in a linear homoscedastic model than a non-linear one. It has a high RCE value.

We observed that this estimator is relatively highly biased across the six populations considered. However in terms of efficiency, it was the most efficient of the five estimators for a linear homoscedastic model. The efficiency went down when a non-linear homoscedastic population was considered. The RCE value is relatively high. We also observed that the periodic spline and the Nadaraya-Watson estimators gave results that were quite similar in terms of bias, efficiency and robustness.

4.2. Conclusions and Recommendations

We observed from this study that the two spline functions considered perform quite differently. The periodic spline performed better than the smoothing spline in all the aspects considered: bias, efficiency and robustness. We, therefore, concluded that the periodic spline is a better estimator than the smoothing spline in a case of a linear homoscedastic model and even when the model assumptions have been violated. It was also shown that the Nadaraya-Watson estimator performed well in the linear homoscedastic model and also when the conditions were violated. It had the lowest RCE value. Therefore, we came to the conclusion that, Nadaraya-Watson estimator was the most robust of the five estimators. The results also showed the periodic spline and the Nadaraya-Watson estimators to be quite similar. Thus, we concluded from both the theoretical results and the empirical study that spline smoothing corresponds approximately to smoothing by a Kernel method thus concurring with the theoretical observation made by [13].

The local polynomial estimator was very sensitive to model assumption violation and we therefore concluded that it is not robust. The results also indicated that the radio estimator was the most efficient of the five estimators for a linear homoscedastic model. Nevertheless, when these conditions are violated, the estimator completely breaks down. We conclude that this estimator is not robust to the violation of the linear and homoscedastic conditions.

From the findings of the study, we gave the following recommendations:

1. Both the ratio estimator (model-based) and the local polynomial (model -assisted) estimator should be used within the confines of a linear homoscedastic model. They are not appropriate for use when the model is unspecified or when the linear and homoscedastic assumptions are violated.

2. The Nadaraya-Watson and the periodic spline estimators, both of which are non-parametric, should be used in case of a linear and homoscedastic model and even when the model assumptions are violated. Their sensitivity to the change of structure of the population is relatively low and hence are highly robust. The Nadaraya-Watson, however, is even more robust than the periodic spline.


References

  1. Aerts, M., Augustyns, I. and Janssen, P.,"Smoothing Sparse Multinomial Data Using Local Polynomial Fitting,"Journal of Nonparametric Statistics, 8, 127-147, 1997.
  2. Breidt, F. J. and Opsomer, J. D., "Local Polynomial Regression Estimators in Survey Sampling," Annals of Statistics, 28, 1026-1053, 2000.
  3. Cardot, H., "Local Roughness Penalties for Regression Splines,"Computational Statistics, 17, 89-102, 2002.
  4. Fuller, W.A, Sampling Statistics, Wiley, Hoboken, 2009.
  5. Harms, T. and Duchesne, P., "On Kernel Non- Parametric Regression Designed for Complex Survey", Metrika, 72 (1), 111-138, July2010.
  6. Kauermann, G., Krivobokova, T. and Fahrmeir, L., "Some Asymptotic Results on Generalized Penalized Spline Smoothing,"J. R. Statistic. Soc.Series B, 71, 487-503, 2009.
  7. Lu, J. and Yan, Z.,"A Class of Ratio Estimators of a Finite Population Mean Using Two Auxiliary Variables,"PLoS ONE 9(2): e89538.doi:10.1371/journal.pone.0089538, 2014.
  8. Nadaraya, E.A.,"On Estimating Regression," Jour. TheoryProbab. Appl.9 (1), 141-142, 1964.
  9. Royall, R.M.,"Likelihood Functions in Finite Population Sampling Theory,"Biometrika,63, 605-614, 1976.
  10. Sarda, P. and Vieu, P.,Kernel Regression in Smoothing and Regression: Approaches Computation and Application,Ed M.G. Schimek,Wiley Series in Probability and Statistics,2000, 43-70.
  11. Schoenberg, I.J.,"Spline Functions and the Problem of Graduation,"Proc. Nat. Acad. Sci. U.S.A.,52, 947-950, 1946.
  12. Schumaker, L. L.,Spline Functions: Computational Methods,SIAM, Philadelphia,2015.
  13. Silverman, B. W., "Spline Smoothing: The Equivalent Variable Kernel Method,"The Annals ofStatistics, 12(3), 898-916, 1984.
  14. Wahba, G.,"Smoothing Noisy Data with Spline Functions,"Numerische Mathematik,24, 383-393, 1975.
  15. Watson, G. S.,"Smooth Regression Analysis,"SankhyaSer. A.,26, 359-372, 1664
  16. Whittaker, E.,"On a New Method of Graduation,"Proc. Edinburgh Math. Soc.,41, 63-75, 1923.

Article Tools
  Abstract
  PDF(280K)
Follow on us
ADDRESS
Science Publishing Group
548 FASHION AVENUE
NEW YORK, NY 10018
U.S.A.
Tel: (001)347-688-8931