American Journal of Theoretical and Applied Statistics
Volume 4, Issue 6, November 2015, Pages: 602-609

Discriminant Analysis Procedures Under Non-optimal Conditions for Binary Variables

I. Egbo

Department of Mathematics, Alvan Ikoku University of Education, Owerri, Nigeria

Email address:

To cite this article:

I. Egbo. Discriminant Analysis Procedures Under Non-optimal Conditions for Binary Variables. American Journal of Theoretical and Applied Statistics. Vol. 4, No. 6, 2015, pp. 602-609. doi: 10.11648/j.ajtas.20150406.32


Abstract: The performance of four discriminant analysis procedures for the classification of observations from unknown populations was examined by Monte Carlo methods. The procedures examined were the Fisher Linear discriminant function, the quadratic discriminant function, a polynomial discriminant function and A-B linear procedure designed for use in situations where covariance matrices are equal. Each procedure was observed under conditions of equal sample sizes, equal covariance matrices, and in conditions where the sample was drawn from populations that have a multivariate normal distribution. When the population covariance matrices were equal, or not greatly different, the quadratic discriminant function performed similarly or marginally the same like Linear procedures. In all cases the polynomial discriminate function demonstrated the poorest, linear discriminant function performed much better than the other procedures. All of the procedures were greatly affected by non-normality and tended to make many more errors in the classification of one group than the other, suggesting that data be standardized when non-normality is suspected.

Keywords: Apparent Error Rates, Fisher’s Linear Discriminant, Quadratic Discriminant Function, A-B Discriminant Function, Polynomial Discriminant Function


1. Introduction

Many practical problems can be reduced to the assignment of various objects to different classes. For example in the case of the medical diagnosis, it is a question of recognizing the pathology of a given patient, the purposes correspond to the patients and the classes with various pathologies. In the economy field, a bank wants to know if a customer applying for a loan is a good or bad customer while being based on several variables like the age, the profession, former fidelity, the required credit. A review of these appears in [16]. In assignment problems in biomedical research, one or more of these techniques is often used. The assumptions underlying these techniques are not always evident to the user, nor are the consequences of their violation. The assumptions include multivariate normality, common covariance matrices and correct assignment of the initial groups [17], [18] and [19]. While a good deal is known in the two group situation, the robustness of these procedures under non-optimal conditions for binary variable is essentially unknown. The purpose of this paper is to compare and delineate these problems systematically and to suggest useful areas of research.

The problem of classifying an individual into one of two concerned groups (called populations), arises in many areas, typically in anthropology, education, psychology, medical diagnosis, biology, engineering, etc. An anthropometrician may wish to identify ancient human remains in two different racial groups or in two different time periods by measuring certain skull characters [2]. A plant breeder discriminates a desired from an undesirable species by observing some heritable characters [14]. A company hires or rejects an applicant frequently based on a certain measurement. Similarly a college accepts or denies a prospective student usually based on his entrance examination scores. In a hospital, a patient maybe diagnosed and classified into a certain potential disease group by a battery of tests, usually it is assumed that there are two populations, say  and , the individual to be classified comes from either or ; furthermore, it is assumed that from previous experiments or records we have in our possession the characteristic measurements of  individuals who were known to belong to , and of  individuals who were known to belong to . Based on the available data obtained from previous + individuals and the corresponding characteristic measurements of a new individual, we would like to classify the new individual into either or by using certain criterion. The case of more than two populations will not be considered in this paper.

In this inferential setting, the researcher can commit one of the following errors. An object from  may be misclassified into likewise, an object from  may be misclassified into . If misclassification occurs a loss would be suffered. Let  be the cost of misclassifying an object, into. For the two population setting, we have that  means cost of misclassifying an object into  given that it is from.

is the cost of misclassifying an object into  given that it is from . The relative magnitude of the loss  =  depends on the case in question: for example failure to detect an early cancer in a patient is costlier than stating that a patient has cancer and discovering otherwise.

2. Classification Procedures

2.1. The Fisher’s Linear Discriminat Function (FLDF Rules)

The linear discriminant function for discrete variables is given by

(1)

Where are the element of the inverse of the pooled sample covariance matrix and are the elements of the sample means in  and  respectively. The classification rule obtained using this estimation is classify an item with response pattern X into p

Ifand toor otherwise  (2)

2.2. The Quadratic Discriminant Function

When an observation vector x, is drawn from a MVN distribution with mean vector mI and covariance matrix SI, the MVN density function f (x), can be expressed as:

(3)

In the case of two groups an individual is classified as belonging to population 1 if  that is, if  Alternatively, an individual is assigned to population 2 if that is, if .Where and  are the proportions of individuals from the two groups in the populations [7].When the two groups have a common covariance matrix, , and mean vectors  and  the above rule becomes

(4)

= exp

And taking logarithms, the rule is to assign an individual to population 1 if

(5)

And to the group 2 otherwise. The sample analogue of the above equation is

(6)

And the coefficients  are seen to be identical to Fisher’s result for the LDF.

When covariance matrices are unequal and cannot be pooled, but the population distributions are multivariate normal, the classification rule has the form

(7)

(8)

In these cases, the discriminant function is quadratic, since the term  is still present [7]. From the above with,, and  estimated bytheir respective mean vectors and covariance matrices , and  the sample analogue ofis

(9)

In each of the conditions of the present study the proportions of each group is the population were assumed to be equal to each other and not proportional to sample size since the true proportion are not usually know in most areas of psychological research. When population proportions are equal the quadratic decision rule is the to classify an individual into population 1 if> 0 or into population 2 if since

2.3. The A-B Discriminant Function

[1] proposed a Linear discriminant function of the form  with chosen so that  is classified as from population 1 if > c and from population 2 if where c is also suitably determined. With this procedure, the misclassification probabilities are:

 And

(10)

Where F is the cumulative distribution function of a standard normal variable. The  and  are determined by

 and  (11)

Where  and  are the means of population 1 and population 2, Now can be expressed as"

(12)

The  is then chosen which maximizes  for a given. By differentiating  withrespectto . It can be show that the solution consists of solving the following equations in  and a scalar t:

and  (13)

The solution to these equations is obtained by a trial – and- error procedure and c is then obtained by:

(14)

Now y1 can be obtained from

(15)

[1] also considered an alternative method when the two misclassification probability are equal, i.e. . In this case,  and tare found from:

(16)

The determination of the value of t was accomplished by using the result due to [3], in which were expressed as:

, and ,  (17)

Wherediagare the roots of the determinantalequation, then must lie between the minimum and maximum roots of the above characteristic equation.

In the present study the optimal value of t was approximated by evaluating t for  equal to the minimum and maximum characteristic roots and computing the vector in each case from the equation (13). The value of c was then calculated from Equation (14). And the observation population 1 if or population 2 if . In this manner, the intervalwas successively bisected, and for reach value of t, the proportion of correct classification calculated. The interval was bisected a maximum of five times or until classification did not improve. The resultant discriminant function was then applied to the cross validation sample, and the proportion of correct classifications was calculated.

2.4. The Polynomial Discriminant Function

In this case, the discriminant function was constructed by estimating the probability density function for each sample directly from the observed data, as described in [15]. This was accomplished by expanding the estimate  in in a series which represents the probability density function of the  population, Tou and Gonzalez show that if it is required that the estimate of the probability density function minimize a mean –square error function defined as:

(18)

Where w is a waiting function, then may be expanded in the series

(19)

Where the  are coefficient to be determined, and theare a set of specified basis functions.

A set of univariate basis functions associated with the normal distribution from which multivariate basis functions can be obtained, are Hermite polynomials,  generated by the recursive relation

(20)

Where . The first few Hermite polynomials are::

(21)

Substituting the expansion of  into the mean-square error function yields

(22)

And minimizing R with respect to the coefficient, yields.

(23)

The right side of this equation is the definition of the expected value of the function  and may be approximated from the sample average

(24)

Since the basic functions are orthonormal and are chosen orthogonal with respect to the weighting function, the coefficients may be determined from

(25)

And the resultant density may be obtained from

(26)

By using Bayes’ formula

(27)

Where  is the probability of the  population, the discriminant function for this problem are then given by:

and  (28)

(29)

And if , the decision boundary is given by .

In the present study a two-dimensional set of orthogonal function was obtained by forming pairwise combinations of the one-dimensional functions. Six terms were used to appreciate the density function and were constructed as follows:

(30)

The set of original functions for the six-variable case was constructed in the same manner as for the bivariate case by forming the product of one dimensional Hermite polynomials. In order for the estimates of the density functions to be polynomials of degree two for all the variables, 28 terms were constructed as follow:

(31)

The vector of coefficients c, was then computes for each sample from equation (25), and the polynomial estimates of the density functions were constructed as in Equation (26). The two estimates of the density functions were the subtracted to form the polynomial discriminant function, which was then applied to the observations in each of the original and cross-validation samples. Finally, the proportion of correct classification was calculated.

2.5. Testing Adequacy of Discriminant Coefficient

Consider the discriminant problems between two multinomial populations with mean  and common matrixS. The coefficient of the MLD discriminant function  are given by  in practice of course the parameters are estimated by

(32)

Letting, the coefficient of sample MLDF given by

A test of hypothesis H0:  using the sample Mahalanobis distance  has been proposed by [12] this test statistics uses the statistic:

(33)

Where, under the null hypothesis has distribution and we reject H0 for large value of this statistics.

2.6. Evaluation of Classification Functions

One important way of judging the performance of any classification procedures is to calculate the errors rates or misclassification probability [13]. When the forms of parent populations are known completely, misclassification probabilities can be calculated with relative ease. Because parent populations are rarely know, we shall concentrate on the error rates associated with the sample classification functions. Once this classification function is constructed a measure of its performance in future sample is of interest. The total probability of misclassification (TPM) is given as:

(34)

The smallest value of this quantity by a judicious choice of  is calculated the optimum error rate (OFR)

OFR = Minimum TPM

2.7. Probability of Misclassification

In constructing a procedure of classification, it is desires to minimize on the average the bad effects of misclassification [10], [13] and [11]. Suppose we have an item with response pattern x from either . We think of an item as a point in a r-dimensional space. We partition the space R into regions  which are mutually exclusive. If the item falls in , we classify it as coming from  and if it falls in we classify it as coming from .In following a given classification procedure, the researcher can make two kinds of errors in classification. If the item is actually from  the researcher can classify it as coming from.Also the researcher can classify an item from  as coming from. We need to know the relative undesirability of these two kinds of errors in classification. Let the prior probability that an observation comes from be , and from  be .Let the probability mass function of  be  and that of be . Let the regions of classifying into be.Then the probability of correctly classifying an observation that is actually from  into is;

(35)

Similarly, the probability of correctly classifying an observation from is

(36)

Similarly, the probability of correctly classifying an observation from  into  isand the probability is misclassifying an item from  into  is

(37)

The total probability of misclassification using the rule is

(38)

In order to determine the performance of a classification rule R in the classification of future items, we compute the total probability of misclassification know as the error rate. [7] defined the following types of error rates.

i. Error rate for the optimum classification rule. When the parameter of the distributions are known the errors is  which is optimum for this distribution.

ii. Actual error rate: The error rate for the classification rule as it will perform in future samples

iii. Expected actual error rate: The expected error for classification rules based on sample size c from and from .

iv. The plug-in estimate of error rate obtained by using the estimated parameters for and .

v. The apparent error rate: This is defined as the fraction of items in the initials sample which is misclassified by the classification rule.

Table 1. Confusion matrix of Apparent error rate.

 

 

 

The table above is called the confusion matrix and the apparent error rate is given by

(39)

[6] called the second error rate the actual error rate and the third expected actual error rate. Hills showed that the actual error rate is greater than the optimum error rate and it in turns, is greater than the expectation of the plug –in estimate of the error rate. [9] proved a similar inequality. An algebraic expression for the extract bias of the apparent error rate of the sample multinomial discriminant rule was obtained by [5], who tabulated it under various combinations of the sample size  and the number of multinomial cells and the cell probabilities. Their result demonstrated that the bound described above is generally loose.

3. The Simulation Experiments and Results

The four classification procedures are evaluated at each of the 118 configurations of n, r and d. The 118 configurations of n, r and d are all possible combinations of n = 40, 60, 80, 100, 200, r = 3, 4, 5 and d = 0.1, 0.2, 0.3, and 0.4. A simulation experiment which generates the data and evaluates the procedures is now described.

(i) A training data set of size n is generated via R-program where  observations are sampled from  which has multivariate Bernoulli distribution with input parameter  and  observations sampled from , which is multivariate Bernoulli with input parameter . These samples are used to construct the rule for each procedure and estimate the probability of misclassification for each procedure is obtained by the plug-in rule or the confusion matrix in the sense of the full multinomial.

(ii) The likelihood ratios are used to define classification rules. The plug-in estimates of error rates are determined for each of the classification rules.

(iii) Step (i) and (ii) are repeated 1000 times and the mean plug-in error and variances for the 1000 trials are recorded. The method of estimation used here is called the resubstitution method.

The following table contains a display of one of the results obtained

Table 2(a). Mean apparent error rates.

Sample sizes A-B Polynomial LDA Quadratic
40 0.157125 0.110074 0.110787 0.204512
60 0.161900 0.127855 0.127958 0.207491
100 0.163290 0.143526 0.143680 0.209940
140 0.162967 0.149837 0.150407 0.209826
200 0.162565 0.156384 0.155280 0.211542

Table 2(b). Actual Error rates.

Sample sizes A-B Polynomial LDA Quadratic
40 0.040271 0.052706 0.037112 0.041686
60 0.032751 0.042691 0.031487 0.033007
100 0.027786 0.037015 0.026152 0.027125
140 0.022462 0.031623 0.022112 0.024082
200 0.017981 0.026657 0.018218 0.019071

Tables 2(a) and (b) present the mean apparent error rates and standard deviation (actual error rates) for classification rules under different parameter values. The mean apparent error rates increases with the increase in sample sizes and actual error rate decreases with the increase in sample sizes. From the analysis, linear discriminant function is ranked first, followed by A-B Discriminant, Quadratic function and Polynomial discriminant function came last.

Table 3. Performance of classification rules by rank.

Classification Rule Performance/rank
Linear Discriminant 1
A-B  Discriminant 2
Quadratic function 3
Polynomial Discriminant function 4

4. Discussion and Conclusion

The results in table 3.1b indicate that, in general, with samples drawn from MVN populations with equal covariance matrices, the fisher LDF, the A-B procedure, the Quadratic Discriminant function (QDF) and Polynomial discriminant function (PDF) performed similarly, but as the degree of heterogeneity increases (not shown in the table), the QDF outperformed the other procedures. These results are consistent with those of [8] and [4], since it can be observed that the fisher LDF performed well, with respect to the QDF, for mild departures from homogeneity of covariance matrices, but as the degree of heterogeneity increased, the QDF outperformed the fisher LDF, A-B procedure and Polynomial discriminant function.

However, we obtained two major results from this study. Firstly, using the simulation experiments we ranked the procedures as follows: Linear Discriminant Function, A-B Discriminant function Quadratic and Polynomial Discriminant function. The best method was the linear discriminant procedure. Secondly, we concluded that it is better to increase the number of variables because accuracy increases with increasing number of variables. Moreover, our study showed that the linear discriminant function is more flexible in such a way to allow the analyst to incorporate some priori information in the models. Nevertheless, this does not exclude the use of other statistical techniques once the required hypotheses are satisfied.


References

  1. Anderson, T.W. and Bahadur, R.R.(1962).Classification into two multivariate normal distributions with different covariance matrices.Annals of Mathematics Statistics, 33.420-431.
  2. Barnard, M.M. (1935). The Secular variations of skull characteristics in four series of Egyptian skulls.Annals Eugenics, v. 6,352-371.
  3. Benerjee, K.S. & Marcus, L.F. (1965).In a Minimax Classification Procedure.Biometrics, 52, 654-654
  4. Gilbert, S.E. (1968). "On Discrimination using Qualitative Variables" Journal of the American Statistical Association 1399-1418.
  5. GoldSteinM. &Wolf (1977). On the problem ofBias multinomialclassification. Biometrics 33, 325-331.
  6. Hills, M. (1967). "Discriminantion and allocation with discrete data", Applied Statistics. 16 237-250.
  7. Lachenbruch, P.A. (1975) Discriminant Analysis. Hafner Press New York.
  8. Marks, S. & Dunn, O.J. (1974). Discriminant functions when Covariance matrices are unequal. Journal of the American Statistical Association, 69, 555-559.
  9. Martins, D.C., &Bradley,R.R. (1972). Probability Models, Estimation and Classification for Multivariate Dichotomous Populations, Biometrics,23, 203-221.
  10. OnyeaguS.I.(2003). Derivation of an optimal classification rule for discrete variables Journal ofNigerian Statistical Association, 73,724-745.
  11. Oluadare.S. (2011).Rubust Linear classifier for equal Cost Ratios of misclassification.CBN Journalof Applied Statistics.(2) (1)
  12. Rao, C.R. (1965).Linear Statistical Inference and Its Applications: John Willey New York.
  13. Richard A.J. & Dean W.W. (1988).Applied Multivariate Statistical Analysis.4th edition PrenticeHall.Inc.New Jessey.
  14. Smith, H.F. (1936). A discriminant function for plant selection.Ann. Eugn. 7,240 – 250.
  15. Tou, J.T. & Gonzalez, R. C. (1974).Pattern Recognition Principles.Reading, Mass.; Addison –Wesley.
  16. Slah, B.Y. & Abdelwaheb Rebai (2007). Comparison between Statistical Approaches and linear programming for resolving classification problem. International Mathematics Forum, 2,(63), 3125-3141.
  17. Egbo, I., Onyeagu, S.I. & Ekezie, D.D. (2014). A comparison of multinomial classification rules for binary variables. International Journal of Mathematical Science and Engineering Applications (IJMSEA),8,141-157.
  18. Ekezie, D.D. (2012). Comparison of seven Asymptotic Error Rate Expansion for the sample linear Discriminant function. Unpublished Ph.D thesis submitted to Department of Statistics, Imo State University, Owerri, Nigeria.
  19. Egbo, I., Onyeagu, S.I. & Ekezie, D.D. (2014).A Comparison of Multivariate Discrimination of Binary Data. International Journal of Mathematics and Statistics Studies, 2(4), 40-61.

Article Tools
  Abstract
  PDF(276K)
Follow on us
ADDRESS
Science Publishing Group
548 FASHION AVENUE
NEW YORK, NY 10018
U.S.A.
Tel: (001)347-688-8931