American Journal of Theoretical and Applied Statistics
Volume 5, Issue 5, September 2016, Pages: 252-259

Non-parametric Variance Estimation Using Donor Imputation Method

Hellen W. Waititu1, *, Edward Njenga2

1Department of Statistics and Computer Sciences, Moi University, Nairobi, Kenya

2Department of Mathematics, Kenyatta University, Nairobi, Kenya

Email address:

(H. W. Waititu)

*Corresponding author

To cite this article:

Hellen W. Waititu, Edward Njenga. Non-parametric Variance Estimation Using Donor Imputation Method. American Journal of Theoretical and Applied Statistics. Vol. 5, No. 5, 2016, pp. 252-259. doi: 10.11648/j.ajtas.20160505.11

Received: July 1, 2016; Accepted: July 16, 2016; Published: August 3, 2016


Abstract: The main objective of this study is to investigate the relative performance of donor imputation method in situations that are likely to occur in practice and to carry out numerical comparative study of estimators of variance using Nadaraya-Watson kernel estimators and other estimators. Nadaraya-Watson kernel estimator can be viewed as a non-parametric imputation method as it leads to an imputed estimator with negligible bias without requiring the specification of a parametric imputation model. Simulation studies were carried out to investigate the performance of Nadaraya-Watson kernel estimators in terms of variance. From the results, it was found out that Nadaraya-Watson kernel estimator has negligible bias and its variance is small. When compared with Naïve, Jackknife and Bootstrap estimators, Nadaraya-Watson kernel estimator was found to perform better than bootstrap estimator in linear and non-linear populations.

Keywords: Hot Deck Imputation, Non-parametric, Unbiased Estimator, Donor, Recipient, Donor Imputation


1. Introduction

Donor imputation is a method in which the missing values for one or more variables of a non responding unit (recipient) are replaced by the corresponding values of a responding unit (donor) with no missing value for these variables. It is a variance estimation method which is valid even in the presence of high sampling fractions [1]. However, very few variance estimation methods that take into account donor imputation have been developed. Essentially, donor imputation is convenient and has some interesting statistical properties. Although donor imputation may not be the most efficient method in any specific scenario, it is popular in surveys due to its practical advantages. Therefore, it remains useful to develop variance estimation methods that take donor imputation into account. In this study, variance estimator after donor imputation have been investigated and compared with the Naïve estimator, Jackknife estimator and Bootstrap estimator. Variance estimation methods accounting for the effect of imputation have been studied by [11], [13] and [8], among others. Some methods of variance estimation that have been developed for use with imputed data include a model-assisted method [11], an adjusted jackknife method [11], and multiple imputations [8]. [2] considered Random Hot-Deck (RHD) imputation under more general sampling designs assuming a one-factor analysis of variance model holds. [9], [6] and [5] dealt with Nearest Neighbor Imputation (NNI). [3] considered NNI, an alternative to re-sampling variance estimation method. [10] considered NNI under simple random sampling assuming that a ratio imputation model holds. [1] dealt with general donor imputation methods including NNI and with possibly post-imputation edit rules and hierarchical imputation classes, under general sampling designs and more general imputation models. In this paper, non-parametric variance estimation using donor imputation method have been considered with estimation of parameters  and  being done using the kernel method proposed by Nadaraya (1964) and Watson (1964).

2. Estimation Procedure

Consider a population of N elements identified by a set of indices U = {1, 2,…, N}. Associated with the  unit in the population are two variables () where. The variable  has some unknown values and it is the variable under study. The variable  is the auxiliary variable assumed to be known for all units of the population. A simple random sample without replacement (SRSWOR) of size n denoted as  is drawn from the population. Suppose that  are observed (respondents) and  are missing (non-respondents). That is  units respond for  and  do not respond. Therefore. Consider a unit  The NNI method imputes a missing  by  where .  is the nearest neighbor of j measured by the  variable. That is  satisfies =. If there are tied  values, then there may be multiple nearest neighbors of  and  is randomly selected from them. Suppose that occurs for. Then the value  is imputed for the missing .

The completed data set is

(1)

Where . If the survey has 100% response, then the populations mean

(2)

is estimated by the sample mean  and its variance is estimated by

(3)

where .

In the presence of non-response, the customary approach to point estimation is to take the formula for 100% response and calculate it on the completed data set. Thus from (2), the estimator of  is  where  is the number of times the  responding unit is used as a donor. For variance estimation, the naïve approach is to calculate the ordinary variance estimator, , to (3) on data after imputation. i.e.  where  and is defined by (1). This variance estimator can be biased.

Let  denote the sampling design, that is,  is the known probability of obtaining a sample . In our case,  denote the SRSWOR design. Given , denote the response mechanism by . i.e.  is the unknown conditional probability that the response set  is obtained. We assume that  may depend on the auxiliary variable  but not on the values . The total error (sum of sampling error and imputation error) of  can be broken down into sampling error and imputation error as follows

We note that

, where

Thus the bias of  is  Variance of  denoted by  is given by

(4)

 is a standard variance estimator using the imputed values as if they were reported values. This is called the naïve variance estimator. [2] show that under the cell mean model and hot deck imputation, the bias of the naïve variance estimator as an estimator for  is small when no respondent is used too often as a donor of an imputed value.

The jackknife variance estimator of  is given by [8]. In the presence of non-response to item y, the use of the above estimator may lead to serious underestimation of the variance of the estimator, especially if the non-response rate is important. [11] proposed an adjusted jackknife method that is calculated in a similar fashion as the above estimator except that, whenever a responding unit is deleted, the imputed values are adjusted. The imputed values are unchanged if a non-responding unit is deleted. Let , denote the adjusted imputed value for unit  when unit j was deleted. For mean imputation, we have  where  denotes the mean of the respondents excluding unit . The Rao-Shao jackknife variance estimator is then given by

The bootstrap method is estimated by  where [3] proposed a rescaling Bootstrap method in order to estimate the Variance. Their method draws bootstrap samples of size  with replacement from the rescaled sample. Note that  may be different from. The rescaling factor, denoted by, is chosen so that the variance under re-sampling matches the usual variance estimator of the population mean.

The Rao-Wu bootstrap variance estimator is given by where . Applying the Rao-Wu bootstrap in the presence of missing responses and treating the missing values as true values, may lead to serious underestimation of the variance of the estimator. In the presence of imputed data, [12] proposed a bootstrap procedure for imputed survey data. The Shao-Sitter bootstrap variance estimator is given by

, where

2.1. Donor Imputation

A sample s of size n is drawn from population total U according to a probability sampling design . In the absence of non-response, we assume SRSWOR with mean .

Variable y is only observed for a subset  of  according to a response mechanism . This subset of size  is called the set of respondents (or donors) while its complement  of size  is called the set of non-respondents (or recipients). To compensate for the missing values, donor imputation is performed. This leads to the imputed estimator of the mean given by

where

 is the donor used to impute the recipient . A variety of strategies can be considered in practice in order to find donors for imputing recipients. Usually, a vector  of auxiliary variables, available for all the sample units , is used to determine a set of selected donors that are "close" to the corresponding recipients in .

2.2. Approach to Inference

To evaluate properties of the imputed mean estimator  and to make inferences, the following imputation model is used:

(5)

where the subscript  indicates that the expectation, variance, and covariance are evaluated with respect to the imputation model,  is the N-row matrix containing  in its  row, and  and  are parametric or non-parametric smooth functions of . Note that the subscript  in  indicates missing values and should not be confused with the imputation model.

The vector  contains variables used at the imputation stage for the selection of donors. In principle, the imputer uses available variables that are associated with the y-variable. The vector  may thus contain design variables (e.g., strata and cluster indicators, size measure), the domain of interest or other auxiliary variables. It is assumed in model (5) that the imputer has appropriately chosen the vector  of auxiliary variables so that the design variables and the domain of interest do not explain further the y-variable after conditioning on . This allows us to treat the design variables and the domain(s) of interest as being fixed under model (5).

3. Proposed Variance Estimator

Considering model (5), the total error of  can be broken down into sampling error and imputation error as shown in (4).

The expectation appearing in the true variance component can be evaluated leading to expressions which depend on known  values and on the unknown model parameters  and . Therefore to estimate the three components of the variance, all we need to provide are the model unbiased estimators of  and  However, this will not completely lead to an explicit variance estimator since we still have to obtain expectations of some terms with respect to response mechanism.

3.1. Estimation of VSAM

Hence, unbiased estimator of  is

Where  and  are model unbiased estimators of  and  respectively.

3.2. Estimation of VIMP

Hence an unbiased estimator of  is

3.3. Estimation of VMIX

It follows that the unbiased estimator of  is

The estimator for the Variance is given by

=

3.4. Estimation of   and

One of the most common methods in non-parametric regression is the kernel method introduced by Nadaraya-Watson (1964), which is often obtained by using a bandwidth [7]. The kernel estimators with varying bandwidths are specially used to estimate density of the long-tailed and multi-mod distributions. A kernel estimate is introduced for obtaining a non-parametric estimate of a regression function.

Smooth linear estimate of

A smooth linear estimate of a function  denoted by  can be written in general form as =

Where  denotes a smoothing function with a bandwidth parameter k. This bandwidth parameter determines the amount of smoothing to be done. The estimates proposed by Nadaraya (1964) and Watson (1964) associated with kernel functions [7] will be considered.

3.5. Nadaraya-Watson Smooth Estimate of

Nadaraya (1964) and Watson (1964) independently proposed the following estimate of .

where k denotes the bandwidth parameter.  is called the kernel function with the following properties.

1) 

2) 

3)   [7]

3.6. Smooth Linear Estimate of

Consider  where  and

The estimate of the residual term is given by

The square of the estimate of this residual term  is given by

(6)

To smooth , we choose a smooth function  with a bandwidth parameter . Using (6), we get  which is a smooth estimate of

A corresponding  estimate of  is given by where  denotes the bandwidth parameter.

The estimator for the Variance is given by

Where  and  are as given above.

4. Simulation Studies

In our simulation study, the performance of the proposed donor estimator was compared with the naïve estimator, Jackknife estimator and bootstrap estimator empirically. In our comparison, two artificial population structures (linear and non-linear), one real population (linear) and two non-response mechanisms were considered. We conducted a simulation study to evaluate the performance of our variance estimator in terms of Relative Bias (RB) and Variance.

The first population (linear population) was generated as follows: 100 data points were generated according to the linear homoscedastic model;

This was done by first generating the auxiliary variables  values and then the values for. In the second population structure (non-linear population), 100 data points were generated according to the quadratic homoscedastic model;

A simple random sample of size 0.225 of the population size was taken without replacement from each population structure. We considered two non response mechanisms which are random and non random non-response.

For a random non-response mechanism, non responses were generated using independent Bernoulli trials with a constant parameter 0.3 representing the probability of non-response.

For a non random non-response mechanism, the sample values were arranged in order of magnitude using  values and then the largest 30% of the values were regarded as missing.

Non responses were generated for each non-response mechanism. To compensate for the missing values, nearest neighbor imputation was performed. After imputation, the four variance estimates  were calculated. The experiment was repeated 1000 times independently and the average value of each value was got. In the case of bootstrap estimator, 1000 bootstrap iterations were used. In the instance of donor estimator, we used the bandwidth parameter that minimized the mean squared error and satisfied Silver-man’s (1986) condition.

The Epanechnikov’s kernel function  was used since it gives optimal solutions.

The performances of estimators were assessed using two criteria: the relative bias and the Variance. The relative bias of the estimators is calculated as follows:

= where ,  is the value of  for the  experiment and  represents the value of the estimator for the  experiment.

5. Results

The results were then tabulated showing the performance of the estimators in terms of relative bias and Variance. Three populations were analyzed with each population having two tables. One table shows the case when the non-response mechanism is random while the other shows the case when the non-response mechanism is non-random.

a)  Case when population is linear.

Fig. 1. Graph of Survey variables against design variables.

From Table 1, the naïve estimator has the smallest Variance followed by Jackknife while our proposed estimator performs better than Bootstrap. The proposed estimator has the highest relative bias followed by the naïve estimator while Jackknife and Bootstrap seems to do well in terms of relative bias.

b)  Case when population is real

Fig. 2. Graph of withdrawals against deposits.

The results of Table 2 are similar to those of Table 1. This implies that whether the population is real or artificial, as long as it is linear, the estimators behave in the same way.

c)  Case when population scatter is non- linear.

Fig. 3. Graph of Survey variables against design variables.

Table 1. Variance, Relative bias and M.S.E for the four variance estimators.

Table 2. Variance, Relative bias and M.S.E for the four variance estimators.

Table 3. Variance, Relative bias and M.S.E for the four variance estimators.

According to Table 3, our proposed estimator performs better than the bootstrap estimator while the naïve and Jackknife estimators have the smallest Variance. Bootstrap seems to be the best in terms of relative bias while our proposed estimator has the highest relative bias.

Discussion of the results

Considering the above three tables where we were comparing the estimators when the popuation is linear or non linear, naïve estimator seems to have the smallest Variance followed by Jackknife estimator while our proposed estimator alternates with bootstrap. In non-linear population, our proposed estimator performs better in terms of Variance than bootstrap. It is also noted that the Variance and relative bias of the four estimators have close numerical values implying that they are all valid.

It is worth noting that donor imputation may not be the most efficient imputation method in any specific scenario. Nevertheless, it is quite a popular imputation method in surveys due to its practical advantages. Therefore it is useful to develop variance estimation methods that take donor imputation into account.

6. Conclusion

The simulation study examined the performance of four variance estimators. Two population structures (linear and non-linear), and two non-response mechanisms were considered. Simulation study was conducted to evaluate the performance of the variance estimators in terms of Relative Bias (RB) and Variance. It was noted that the variance and the relative bias of the 4 estimators have very close numerical values. Hence all are valid and work well in simulation study. We have proposed a variance estimation method for any type of donor imputation. It is valid and was shown to work well in a simulation study. The variance of the proposed estimator is small and its relative bias is also small.

Thus, it is useful to develop a variance estimation method that takes donor imputation into account. Its main drawback is that it depends on the validity of an imputation model. This is also a characteristic of the methods for NN imputation. Two key issues with any variance estimation method that relies on an imputation model are the appropriate choice of auxiliary variables for donor selection and the estimation of the model mean and variance  given the chosen auxiliary variables. Auxiliary variables should be associated with the variable of interest so as to ensure that the conditional model bias remains small [1].


References

  1. Beaumont, J. F. and Bocci, C. (2009). Variance estimation when donor imputation is used to fill in missing values. Canadian Journal of Statistics, 96 (4), 917-932.
  2. Beaumont, J. F. and Bocci, C. (2007). Variance estimation when donor imputation is used to fill in missing values. Proceedings of the Third International Conference on Establishment Surveys, Montréal.
  3. F. W. Scholz (2007). The Bootstrap Small Sample Properties. University of Washington
  4. Brick, J. M., Kalton, G. and Kim, J. K. (2004). Variance estimation with hot deck imputation using a model. Survey Methodology, 30, 57-66.
  5. Chen, J. and Shao, J. (2000). Nearest neighbour imputation for survey data. Journal of Official Statistics, 16, 113–131.
  6. Chen, J. and Shao, J. (2001). Jackknife variance estimation for nearest neighbor imputation. Journal of the American statistics Association, 96, 260-269.
  7. Fuller, A. and Kim, J. K. (2000). Hot Deck Imputation for the Response Model. Vol. 31, No. 2, pp. 139-149 Statistics Canada, Catalogue No. 12-00 Statistica Sinica 10, 1153-1169.
  8. Jae Kwang Kim (2001). Variance Estimation After Imputation. Statistics Canada, Catalogue No. 12001Vol. 27, No. 1, pp. 7583
  9. Njenga, E. G. (1990). Robust estimation of the regression coefficients in complex surveys. (Doctoral dissertation, 1990).
  10. Rao, J. N. K. and Shao, J. (1992). Jackknife Variance Estimation with Survey Data under Hot Deck Imputation. Biometrika,79, 811-822.
  11. Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. John Wiley & Sons, New York.
  12. Shao, J., and Steel, P. (1999). Variance estimation for survey data with composite imputation and nonnegligible sampling fractions. Journal of the American Statistical Association, 94, 254-265.
  13. Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. New York: Springer-Verlag.

Article Tools
  Abstract
  PDF(407K)
Follow on us
ADDRESS
Science Publishing Group
548 FASHION AVENUE
NEW YORK, NY 10018
U.S.A.
Tel: (001)347-688-8931