American Journal of Theoretical and Applied Statistics
Volume 5, Issue 1, January 2016, Pages: 13-22

Kernel-Type Estimators of Divergence Measures and Its Strong Uniform Consistency

Hamza Dhaker1, *, Papa Ngom1, El Hadji Deme2, Pierre Mendy3

1Departement de Mathématiques et Informatique, Faculté des Sciences et Technique, Université Cheikh Anta Diop, Dakar, Sénégal

2Sciences Appliquées et Technologie, Unité de Formation et de Recherche, Université Gaston Berger, Saint-Louis, Sénégal

3Département de Techniques Quantitatives, Faculté des Sciences Economiques et de Gestion, Université Cheikh Anta Diop , Dakar, Sénégal

Email address:

(H. Dhaker)
(P.Ngom)
(E. H.Deme)
(P. Mendy)

To cite this article:

Hamza Dhaker, Papa Ngom, El Hadji Deme, Pierre Mendy. Kernel-Type Estimators of Divergence Measures and Its Strong Uniform Consistency. American Journal of Theoretical and Applied Statistics. Vol. 5, No. 1, 2016, pp. 13-22. doi: 10.11648/j.ajtas.20160501.13


Abstract: Nonparametric density estimation, based on kernel-type estimators, is a very popular method in statistical research, especially when we want to model the probabilistic or stochastic structure of a data set. In this paper, we investigate the asymptotic confidence bands for the distribution with kernel-estimators for some types of divergence measures (Rényi-α and Tsallis-α divergence). Our aim is to use the method based on empirical process techniques, in order to derive some asymptotic results. Under different assumptions, we establish a variety of fundamental and theoretical properties, such as the strong consistency of an uniform-in-bandwidth of the divergence estimators. We further apply the previous results in simulated examples, including the kernel-type estimator for Hellinger, Bhattacharyya and Kullback-Leibler divergence, to illustrate this approach, and we show that that the method performs competitively.

Keywords: Divergence Measures, Kernel Estimation, Strong Uniform, Consistency


1. Introduction

In this paper, we focus on the similarity between two distributions. Given a sample from one distribution, one of fundamental and classical question to ask is: how to have the similarity between this density with another known density? First, one must specify what it means for two distributions to be close, for which many different measures quantifying the degree of these distributions have been studied in the past. They are frequently called distance measures, although some of them are not strictly metrics. The divergence measures play an important role in statistical theory, especially in large theories of estimation and testing. They have been applied to different areas, such as medical image registration [21], classification and retrieval. There are several important problems in machine learning and statistics that require the estimation of the distance or divergence between distributions. Divergence between distributions also proves to be useful in neuroscience, For example (see, e.g [14]). employs divergence to quantify the difference between neural response patterns.

Later, many papers have appeared in the literature, where divergence or entropy type measures of information have been used in testing statistical hypotheses. For more examples and other possible applications of divergence measures, see the extended technical report [23,24]. The key role of the measure divergence in these various applications, it is necessary to accurately estimate these divergences.

Recently Ngom et all [16] has introduced the Divergence Indicator  method by proposing a test for choosing between a random walk and AR(1), using a divergence measure.

The class of divergence measures is large; it includes the Rényi- [25, 26], Tsallis- [30], Kullback-Leibler (KL), Hellinger, Bhattacharyya, Euclidean divergences, etc. These divergence measures can be related to the Csiszár- divergence [3]. The Kullback-Leibler, Hellinger and Bhattacharyya are special cases of Rényi- and Tsallis- divergences. But the Kullback-Leibler one is the most popular of these divergence measures. Estimation of divergence and its applications have been many studies using different approaches. For example Pardo [20] presented methods and applications in the case of discrete distributions. By exploring a nonparametric method for estimating the divergence in the continuous case, Poczos and Schneider [23] proposed a -nearest-neighbor estimator and proved the weak consistency of the estimator Rényi- and Tsallis- divergences.

Finding estimators nonparametric of measure divergence, remains an open issue. Krishnamurthy and Kandasamy [15] used an initial plug-in estimator for estimating by estimates of the higher order terms in the von Mises expansion of the divergence functional. In their frameworks, they proposed three estimators for Rényi-, Tsallis-, and Euclidean divergences between two continuous distributions and established the rates of convergence of these estimators. The main purpose of this paper is to analyze estimators for divergence measures between two continuous distributions. Our approach is similar to that of Krishnamurthy and Kandasamy [15] and is based on plug-in estimation scheme: first, we apply a consistent density estimator for the underlying densities, and then we plug them into the desired formulas. Unlike of their frameworks, we study the strong consistency estimators for a general class of divergence measures. We emphasize that the plug-in estimation techniques are heavily used by [2, 9] in the case of entropy. Bouzebda [2] proposed a method to establish consistency for kernel-type estimators of the differential entropy. We generalize this method for a large class of divergence measures in order to establish the consistency of kernel-type estimators of divergence measure when the bandwidth is allowed to range in a small interval which may decrease in length with the sample size. Our results will be immediately applicable to proving strong consistency for Kenel-type estimation of this class of divergence measures.

The rest of this paper is organized as follows: in Section 2, we introduce divergence measures and we construct their kernel-type estimators. In Section 3, we study the uniform strong consistency of the proposed estimators. Section 4 is devoted to the proofs. In Section 5, numical examples are proposed in order to illustrate the performance of our method. Finally, in Section 6, we present our conclusion.

2. Kernel-Type Estimators of Divergence Measures

In this section we give the notations and then presenting some basic definitions. We are interested with two densities,, :  where  denotes the dimension. The divergence measures of interest are Rényi-, Tsallis- are defined respectively as follows

(1)

(2)

These quantities are nonnegative, and equal zero iff  almost surely (a.s). Remark that in the special cases where , we obtain from (1) and (2) the well known Hellinger, Kullback-Leibler and Bhattacharyya divergences.

which is related to the Shannon entropy. For some statistical properties for the Shannon entropy, one can refer to [2].

via

where

For the following, we focus only on the estimation of  and . The Kullback-Leibler, Hellinger and Bhattacharyya can be deducted immediately.

We will next provide consistent estimator for the following quantity

(3)

whenever this integral is meaningful. Plugging its estimates into the appropriate formula immediately leads to a consistent estimator for the divergence measures  and .

Now, assuming that for the rest of the document, the density  is unknown, and the density  is known and satisfies:  is finite, this implies that  is finite. Next, consider  a sequence of independent and identically distributed -valued random vectors, with cumulative distribution function  a density function  with respect to the Lebesgue measure on . The following conditions are needed for the remainder of this paper. To construct our divergence estimators we define, We start by a kernel density estimator for , and then substituting  by its estimator in the divergence like functional of . For this, we introduce a measurable function  that satisfies the following conditions.

(K.1)  is of bounded variation on

(K.2)  is right continuous on

(K.3)

(K.4)

Rosenblatt [27] first proposed an estimator for  and Parzen [19] generalized, it eventually leading to the Parzen-Rosenblatt estimator, defined in the following way for any

(4)

where  is the bandwidth sequence. Assuming that the density  is continuous, one obtains a strongly consistent estimator  of , that is, one has with probability , , . There are also results concerning uniform convergence and convergence rates. For proving such results one usually writes the difference  as the sum of a probabilistic term  and a deterministic term , also called bias. For further explanation One can refer to [10, 12, 13], among other authors. After having estimated , we estimate  by setting

(5)

where  and  is a sequence of positive constants. Thus, using (5), the associated divergences  and  can be estimated by:

The approach used to define the plug-in estimators is also developed in [2] in order to introduce a kernel-type estimator of Shannon’s entropy. Some statistical properties of these divergences is related to those of the kernel estimator  of the continuous density . The limiting behavior of , for appropriate choices of the bandwidth , has been widely studied in the literature, examples include the work of Deroye [6, 7] Bosq [1] and Prakasa [22]. In particular, under our assumptions, the condition that  together with  is necessary and sufficient for the convergence in probability of  towards the limit , independently of  and the density . We can find other results of uniform consistency of the estimator  in [4, 10, 5] and the references therein. In the next section, we will use the methods developed in previous references to establish convergence results for the estimates  and deduce the convergence results of  and .

3. Statistical Properties of the Estimators

We first study the strong consistency of the estimator  defined in (5). Throughout the remainder of this paper, we well the notation for , which is delicate to handle. This is given by

Lemma 1 Let  satisfy (K.1-2-3-4) and let  be a continuous bounded density. Then, for each pair of sequence ,  such that , together with ,  as  and for any  one has with probability 1

The proof of Lemma 1 is postponed until Section 4.

Lemma 2 Let  satisfy (K.3-4) and let  be a uniformly Lipschitz and continuous density. Then, for each pair of sequence ,  such that , together with , as  and for any  we have

The proof of Lemma 2 is postponed until Section 4.

Theorem 1 Let  satisfy (K.1-2-3-4) and let  be a uniformly Lipschitz, bounded and continuous density. Then, for each pair of sequence ,  such that , together with ,  as  and for any  one has with probability 1

This, in turn, implies that

(6)

The proof of Theorem 1 is postponed until Section 4.

The following corollaries handle respectively the uniform deviation of the estimate  and  with respect to  and .

Corollary 1 Assuming that the assumptions in Theorem 1 hold. Then, we have

This, in turn, implies that

(7)

The proof of Corollary 1 is postponed until Section 4.

Corollary 2 Assuming that the assumptions in Theorem 1 hold. Then, we have

This, in turn, implies that

(8)

The proof of Corollary 2 is postponed until Section 4.

Note that, the divergence estimator such as (5) also requires the appropriate choice of the smoothing parameter . The results given in (6), (7) and (8) show that any choice of  between  and  ensures the strong consistency of the underlying divergence estimators. In other words, the fluctuation of the bandwidth in a small interval do not affect the consistency of the nonparametric estimators of these divergences. The work of Bouzebda and Elhattab [2] is very important for establishing our results, these authors have created a class of compactly supported densities. They used the following additional conditions.

(F.1)  has a compact support say  and is -time continuously differentiable, and there exists a constant  such that

(K.5)  is of order , i.e., for some constant

and

Under (F.1) the expression  may be written as follows

(9)

Theorem 2 Assuming conditions (K.1-2-3-4-5) hold. Let  fulfill (F.1). Then for each pair of sequences  with ,  as  and for any , we have

where

The proof of Theorem 2 is postponed until Section 4.

Corollary 3 Assuming that the assumptions in the Theorem 2 hold. Then,

Corollary 4 Assuming that the assumptions in the Theorem 2 hold. Then, for any  we have

The proof of Corollaries 3 and 4 are given in Section 4.

Now, assume that there exists a sequence  of strictly nondecreasing compact subsets of , such that  For the estimation of the support  we may refer to ([8]) and the references therein. Throughout, we let , where  and  are as in Corollaries (3) and (4). We choose an estimator of  in the Corollaries (3) and (4) as the form

Using the techniques developed in [5] and the Corollaries (3) and (4) one can construct an asymptotically  certainty intervals for the true divergences  and .

4. Proofs of Our Results

Proof of Lemma 1. to show the strong consistency of , we use the following expression

where  and  is a sequence of positive constant. Define

We have

Since  is a 1-Lipschitz function, for  then

.

Therefore for , we have

where  denotes, as usual, the supremum norm, i.e., . Hence,

(10)

Finaly,

(11)

Using the conditions on the kernel , posed by Einmahl [11], consider the class of functions

For , set , where the supremum is taken over all probability measures  on , where  represents the -field of Borel sets of , i.e is the smallest containing all the open (and/or closed) balls is . Here,  denotes the -metric and  is the minimal number of balls  of -radius  needed to cover .

We assume that  satisfies the following uniform entropy condition.

(K.6) for some  and ,

(K.7)  is a pointwise measurable class, that is there exists a countable sub-class  of  suchthat we can find for any function  a sequence of functions  in  for which

This condition is discussed in [27]. It is satisfied whenever  is right continuous.

Remark that condition (K.6) is satisfied whenever (K.1) holds, i.e.,  is of bounded variation on , we refer the reader to Van der Vaart and Wellner [28], for details on conditions of entropy (see also Pakes and Pollard [18], and Nolan and Pollard [17]). Condition (K.7) is satisfied whenever (K.2) holds, i.e.,  is right continuous, this condition is discussed in [28], (see also [5] and [11]).

From Theorem 1 in [11], whenever  is measurable and satisfies (K.3-4-6-7), and when  is bounded, we have for each pair of sequence ,  such that , together with  and  as , with probability 1

(12)

Since , in view of (11) and (12), we obtain with probability 1.

(13)

It concludes the proof of the lemma.

Proof of Lemma 2.

Let  be the complement of  in  (i.e, ). We have

with

and

Term . Repeat the arguments above in the terms  with the formal change of  by . We show that, for any ,

(14)

which implies

(15)

On the other hand, we know (see, e.g,[11] ), that since the density  is uniformly Lipschitz and continuous, we have for each sequences , with , as ,

(16)

Thus,

(17)

Term . It is obsious to see that

Thus,

(18)

Hence,

(19)

Thus, in view of (16), we get

(20)

Finaly, in view of (17) and (20), we get

(21)

is deduced the proof of the lemma.

Proof of Theorem 1. We have

Combinating the Lemmas (1) and (2), we obtain

It concludes the proof of the Theorem.

Proof of Corollary 1. Remark that

Using the Theorem (1), we have

and the Corollary 1 holds.

Proof of Corollary 2. A first order taylor expansion of  arround  and  gives

Remark that from Theorem 1,

which in turn, implies that

Thus, for all

Consequently

and the Corollary 2 holds.

Proof of Theorem 2. Under conditions ,  and using Taylor expansion of order  we get, for ,

where  and  Thus a straightforward application of Lebesgue dominated convergence theorem gives, for  large enough,

Let  be a nonempty compact subset of the interior of . First, note that we have from Corollary 3.1.2. p. 62 of Viallon [29] (see also, [2], statement (4.16)).

(22)

Set, for all ,

(23)

(24)

by combining (22) and (24)

(25)

Let  be a sequence of nondecreasing nonempty compact subsets of the interior of  such that

Now, from (25), it is straightforward to observe that

The proof of Theorem 2 is completed.

Proof of Corollary 3. A direct application of the Theorem 2 leeds to the Corollary 3.

Proof of Corollary 4. Here again, set, for all ,

A first order Taylor expansion of  leads to

Using condition ,  is compactly supported),  is bounded away from zero on its support, thus, we have for  large enough, there exists , such that , for all  in the support of . From (23), we have

Hence,

by combining the last equation with (22)

The proof of Corollary is completed.

5. Simulation Study

Summarizing the ideas and the results given in the previous sections, we propose to study the performance of the kernel-estimators for Hellinger (-), Bhattacharyya () and Kullback-Leibler () measures and their uniform-in-bandwidth consistency.

Hellinger, Bhattacharyya and Kullback-Leibler divergences are defined respectively as follows:

The asymptotic behavior of each bandwidth is performed using the kernel-type estimator of the divergence criteria in corollary 3 and corollary 4 respectively.

We compute, for each chosen value of α, the expressions

where the cooresponding bounds () are defined by

We consider an experiment in which the DGP (Data Generating Process) for the true distribution  are generated from a mixture of two normal distributions,

and the  function is supposed to be a normal distribution with mean 1 and variance 2.

The sample size varies from 10 to 1000, and for each size, the statistics  andare evaluated.

In order to plot  against sample size, we need to perform three sets experiments.

The results are presented in tables 1-3 and figures 1-3.

Table. 1.  and  against the sample .

10 0.14 0.0627
20 0.25 0.0627
50 0.07 0.0627
100 0.02 0.0627
300 0.01 0.0628
500 0.006 0.0629
1000 0.003 0.0629

Table. 2.  and  against the sample .

10 0.048 0.040
20 0.037 0.030
50 0.007 0.027
100 0.007 0.025
300 0.005 0.024
500 0.003 0.024
1000 0.002 0.023

Table. 3.  and  against the sample .

10 0.15 0.19
20 0.128 0.167
50 0.213 0.11
100 0.053 0.095
300 0.034 0.093
500 0.025 0.842
1000 0.007 0.825

The tables 1-3 show that the kernel-type estimators of the divergence measures converge rapidly to their pseudo-true value, and confirm our asymptotic results. They all show that the discrepancy between the estimated and the true divergence criterion converge rapidly to zero. Similarly, in table 2 and table 3, DB and DK converge, as expected, to zero, which is the mean of the asymptotic distribution when the estimated distribution  is close to f.

Fig. 1. ,  and  as a function of .

Fig. 2. ,  and  as a function of .

Fig. 3. ,  and  as a function of .

The figures 1-3 show  values plots for Hellinger, Bhattacharyya and Kullback-Leibler divergence respectively. The preceding comments from the table 1-3 also apply to the figure 1-3. For dealing with divergence error, it is much revealing to graph DH, DB and DK against sample size. They also confirm our asymptotic results. We note that, as sample size increases, the value discrepancy plots of divergence error converge, as it should, to zero These plots provide a great deal of information about how the sample size affect the performance of these informational criterions.

6. Concluding Remarks and Future Works

In this paper, we are concerned with the problem of nonparametric estimation of a class of divergence measures. For this cause, many estimators are available. The most recent ones are the estimates developed by Bouzebda [2]. We introduce an estimator that can be seen as a generalization of those previously suggested, in the sense that Bouzebda was only interested in the case of entropy, while we focus on the Rényi- and the Tsallis- divergence measures. Under our study, one can easily deduce Kullback-Leibler, Hellinger and Bhattacharyya nonparametric estimators. The results presented in this work are general, since the required conditions are fulfilled by a large class of densities. We mention that the estimator  in (5) can be calculated by using a Monte-Carlo method under a given distribution . And a practical choice of  is  where  and .

It will be interesting to enrich our results presented here by an additional uniformity in term of  in the supremum appearing in all our theorems, which requires non trivial mathematics, this would go well beyond the scope of the present paper. Another direction of research is to obtain results, in the case where the continuous distributions  and  are both unknown. The problems and the methods described here are all inherently univariate. A natural and useful multivariate extension appear in the use of copula function.


References

  1. Bosq, D. and Lecoutre, J. P. (1987). Théorie de l’estimation fonctionnelle. Économie et Statistiques Avancées. Economica, Paris.
  2. Bouzebda, S. and Elhattab, I.(2011) Uniform-in-bandwidth consistency for kernel-type estimators of Shannon’s entropy. Electronic Journal of Statistics. 5, 440-459.
  3. Csiszár, I. (1967). Information-type measures of differences of probability distributions and indirect observations. Studia Sci. Math. Hungarica, 2: 299-318.
  4. Deheuvels, P. (2000). Uniform limit laws for kernel density estimators on possibly unbounded intervals. In Recent advances in reliability theory (Bordeaux, 2000), Stat. Ind. Technol., pages 477-492. BirkhaBoston.
  5. Deheuvels, P. and Mason, D. M. (2004). General asymptotic confidence bands based on kernel-type function estimators. Stat. Inference Stoch. Process., 7(3), 225-277.
  6. Deroye, L. and Gyorfi, L. (1985). Nonparametric density estimation. Wiley Series in Probability and Mathematical Statistics: Tracts on Probability and Statistics. John Wiley Sons Inc., New York. The L1 view.
  7. Devroye, L. and Lugosi, G. (2001). Combinatorial methods in density estimation. Springer Series in Statistics. Springer-Verlag, New York.
  8. Devroye, L. and Wise, G. L. (1980). Detection of abnormal behavior via nonparametric estimation of the support. SIAM J. Appl. Math., 38(3),480-488.
  9. Dmitriev, J. G. and Tarasenko, F. P. (1973). The estimation of functionals of a probability density and its derivatives. Teor. Verojatnost. i Primenen., 18, 662-668.
  10. Einmahl, U. and Mason, D. M. (2000). An empirical process approach to the uniform consistency of kernel-type function estimators. J. Theoret. Probab., 13 (1), 1-37.
  11. Einmahl, U. and Mason, D. M. (2005). Uniform in bandwidth consistency of kernel-type function estimators. Ann. Statist., 33(3), 1380-1403.
  12. Giné, E.and Guillou, A. (2002). Rates of strong uniform consistency for multivariate kernel density estimators. Ann. Inst. H. Poincaré Probab. Statist. 38 907-921.
  13. Giné, E. and Zinn, J. (1984). Some limit theorems for empirical processes (with discussion).Ann. Probab.12 929-998.
  14. Johnson, D. H., Gruner, B., C. M. K., and Seshagiri.(2001) Information-theoretic analysis of neural coding. Journal of Computational Neuroscience.
  15. Krishnamurthy A., Kandasamy K., Póczos B., and Wasserman L., (2014). Nonparametric Estimation of Rényi Divergence and Friends. http://www.arxiv.org/1402.2966v2.
  16. Ngom, P., Dhaker, H., Mendy, P., Deme,. E. Generalized divergence criteria for model selection between random walk and AR(1) model.https://hal.archives-ouvertes.fr/hal-01207476v1
  17. Nolan, D. and Pollard, D.(1987): U-processes: rates of convergence. Ann. Statist., 15(2):780–799.
  18. Pakes, A. and Pollard, D.(1989): Simulation and the asymptotics of optimization estimators. Econometrica, 57(5):1027–1057, 1989.
  19. Parzen, E. (1962). On estimation of a probability density function and mode. Ann. Math. Statist., 33, 1065-1076.
  20. Pardo, L.(2005) Statistical inference based on divergence measures. CRC Press.
  21. Pluim B M, Safran M. From breakpoint to advantage. description, treatment, and prevention of all tennis injuries. Vista: USRSA, 2004.
  22. Prakasa Rao, B. L. S. (1983). Nonparametric functional estimation. Probability and Mathematical Statistics. Academic Press Inc. [Harcourt Brace Jovanovich Publishers], New York.
  23. Póczos, B. and Schneider, J. On the estimation of alpha-divergences.CMU, Auton Lab Technical Report,
  24. http://www.cs.cmu.edu/bapoczos/articles/poczos11alphaTR.pdf.
  25. Póczos, B. Xiong L., Sutherland D, J., and Schneider J. (2012). Nonparametric kernel estimators for image classification. In IEEE Conference on Computer Vision and Pattern Recognition.
  26. Rényi, A. (1961). On measures of entropy and information. In Fourth Berkeley Symposium on Mathematical Statistics and Probability.
  27. Rényi, A. (1970). Probability Theory. Publishing Company, Amsterdam.
  28. Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. Ann. Math. Statist., 27, 832-837.
  29. Van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York.
  30. Viallon, V. (2006). Processus empiriques, estimation non param´etrique et données censurées.Ph.D. thesis, Université Paris 6.
  31. Villmann, T. and Haase, S. (2010). Mathematical aspects of divergence based vector quantization using Frechet-derivatives.University of Applied SciencesMittweida.

Article Tools
  Abstract
  PDF(731K)
Follow on us
ADDRESS
Science Publishing Group
548 FASHION AVENUE
NEW YORK, NY 10018
U.S.A.
Tel: (001)347-688-8931