Sequentially Selecting Between Two Experiment for Optimal Estimation of a Trait with Misclassification
George Matiri^{1}, Kennedy Nyongesa^{2}, Ali Islam^{1}
^{1}Department of Mathematics, Egerton University, Nakuru, Kenya
^{2}Department of Mathematics, Masinde Muliro University of Science and Technology, Kakamega, Kenya
Email address:
To cite this article:
George Matiri, Kennedy Nyongesa, Ali Islam. Sequentially Selecting Between Two Experiment for Optimal Estimation of a Trait with Misclassification. American Journal of Theoretical and Applied Statistics. Vol. 6, No. 2, 2017, pp. 7989. doi: 10.11648/j.ajtas.20170602.12
Received: January 18, 2017; Accepted: February 3, 2017; Published: February 27, 2017
Abstract: The idea of pool testing originated with Dorfman during the World War II as an economical method of testing blood samples of army inductees in order to detect the presence of infection. Dorfman proposed that rather than testing each blood sample individually, portions of each of the samples can be pooled and the pooled sample tested first. If the pooled sample is free of infection, all inductees in the pooled sample are passed with no further tests otherwise the remaining portions of each of the blood samples are tested individually. Apart from classification problem, pool testing can also be used in estimating the prevalence rate of a trait in a population which was the focus of our study. In approximating the prevalence rate, oneatatime testing is time consuming, noncost effective and is bound to errors hence pool testing procedures have been proposed to address these problems. This study has developed statistical model which is used to sequentially switching between two experiments when the sensitivity and specificity of the test kits is less than 100%. The experiments are selected sequentially, so that at each stage, the information available at that stage is used to determine which experiment to carry out at the next stage. The method of maximum likelihood estimator (MLE) was used in obtaining the estimators. The fisher information of different experiments is compared and the cut off values where one experiment is better than the other are calculated. The variance of the estimators has also been compared. The joint model has been compared to oneatatime and pool testing models by computing ARE. The joint model is found to be more efficient.
Keywords: Pool, Pool Testing, Cut off Value, Prevalence Rate, Sensitivity, Specificity
1. Introduction
Sequential testing of a population in the form of pools began by Dorfman [2] as an economical method of testing blood samples of army inductees in order to detect the presence of infection. Johnson et al. [6] and Nyongesa [14] extended Dorfman [2] work to multistage with the aim of reducing the number of tests. Computational testing with the first objective of classifying subjects has been developed by Maheswaran et al. [10]. Recently more research work are focused on the second objective for estimating the rate of trait. Thomson [18] studied the estimation problem using pool testing. This was later considered by Brookmayer [1] by introducing errors.
Sufficiently accurate estimate of the prevalence can be obtained from testing pooled samples as demonstrated by Hammick and Gastwirth [4]. Their procedure provides greater protection of respondent’s anonymity which can lead to greater participation in the survey. On the same year, Gastwirth and Johnson [3] used pool testing to estimate HIV prevalence costeffectively. Of recent Xie et al. [19] have demonstrated how pool testing can reduce costs in early stages of drug discovery. Janis et al. [5] considered sequentially deciding between two experiments for estimating a common success prevalence rate where he considered the individual Bernoulli (p) trials or the product of k individual independent Bernoulli trials. Nyongesa [13] proposed pool testing when members that form the population under investigation are pooled together in pools and these pools are given a test. Pools that test negative, further testing are discontinued but if the reading is positive the pool is divided into blocks of equal sizes. The blocks are further tested and those that test positive the constituent members are tested individually for the presence or absence of the trait under investigation. Pools that test negative are given a retest and those that test positive on retest member constituents are tested individually. Nyongesa [13] used moment method to estimate the prevalence and he observed that his proposed testing procedure reduced misclassification, particularly the false positives. Computational statistics has been used in pool testing to compute the statistical measures when perfect and imperfect tests are used (Syaywa and Nyongesa [16]; Tamba et al. [17]).
Pool testing can be applied in many areas as outlined by Sobel and Groll [15]. The first application of pooltesting was to the problem of pooling blood samples in order to classify each one of a large group of people as to whether or not they have a particular disease. Mundel [12] showed that group testing can be applied in industries for example, in making a "leak test" on a large number of gasfilled electrical devices, one can test any number of units in a single test and the result of test on k units is that either all k are good (no leak) or at least 1 of the k is defective. Another application is in testing various electrical devices such as condensers, resistors, etc. Pool testing has been applied in screening the population for the presence of HIV antibody (Kline et al. [8] and Manzon et al. [11]). Litvak et al. [9], applied pool testing in screening HIV antibody to help curb the further spread of the virus. Litvak et al., [9] showed that pooling offers a feasible way to lower the error rates associated with labelling samples when screening low risk HIV population. For instance, given the limited precision of the available test kits, it has been shown that screening pooled sera can be used to reduce the probability that a sample labelled negative in fact has antibodies since each test has a certain sensitivity and specificity. Juan and Wenju [7] have provided algorithm for the computations of pool sizes.
The essence of this study is to device a method of selecting between two experiments namely:
i) individual testing of items of a population with a view to estimating prevalence rate in this experiment we shall assume the tests are imperfect that is to say the test have where and are sensitivity and specificity respectively, this experiment here in denoted by
ii) pool testing experiment as proposed by Dorfman [2] but with errors in inspection. This experiment here in denoted by
The rest of the paper is arranged as follows: in Section 2 we shall develop the models and formula for calculating their Fisher information, in Section 3 we shall plot the graphs of Fisher information against the value of p. In Section 4 we shall compute the cut off values. In Section 5 we shall develop the maximum likelihood estimators of and their asymptotic variance. Section 6 we shall compare the asymptotic variances of the maximum likelihood estimators by plotting their graphs. In Section 7 we shall compute the ARE values and in section 8 we shall have discussion and conclusion of the study.
2. The Models
The model have been split into two that is P^{I}experiment and P^{G}experiment. P^{I}experiment means estimating the prevalence rate of the characteristic of interest with testing each individual under study while P^{G}experiment means estimating the prevalence rate of the characteristic of interest by putting together items or individuals to form a pool and testing the pool rather than testing each subject. Throughout the study and have been assumed to be the number of observations from the P^{I}experiment and the P^{G}experiment respectively with , the total number of observations from both experiments.
2.1. The Experiment
In our study the P^{I}experiment will involve estimating prevalence rate of the characteristic of interest with testing each individual under study. Suppose the P^{I}experiment is to be used to estimate the prevalence rate of interest and if for is a sequence of identically independent distributed random variable, then where is the probability of declaring an individual as positive i.e .
For a single experiment, the probability density function is
(1)
The Fisher information on the prevalence rate contained in a single observation denoted by is
(2)
If observations from only the P^{I}experiment are used to estimate, then the likelihood function of Equation (1) is
.
Therefore the estimator of p from P^{I}experiment is
(3)
and the asymptotic variance of is
(4)
2.2. The P^{G}experiment
The P^{G}experiment involve putting together items to form a pool and testing the pool rather than testing each individual for the evidence of a characteristic of interest. A negative reading indicates that the pool contains no defective item and a positive reading indicates at least one defective item in the pool. Pooling procedures have proved to reduce the cost of testing when the prevalence rate is low. In this experiment, the probability of declaring a pool of size positive will be denoted by and for analysis purposes, we shall assume that the constituent members of a pool act independent of each other with . Let denote a sequence of identically independent distributed random variable for , then . For a single experiment equivalently the probability density function is
(5)
from which the fisher information denoted by is
(6)
Suppose there are pools from the P^{G}experiment each of size k, available for estimating and suppose pool test positive on the test. Then from Equation (5), the maximum likelihood estimator of from the experiment is
(7)
and the asymptotic variance of is
(8)
2.3. The Joint Model
If is the number of observations from P^{I}experiment and is the number of observations from P^{G}experiment, assuming independence, then the joint probability density function of the random variables and from the P^{I}experiment and P^{G}experiment respectively is a multinomial probability density function given by the product of their density functions
(9)
The joint likelihood function of Equation (9) is
where the maximum likelihood estimator (MLE) is obtained by solving
(10)
Since are known constants, then Equation (10) is a continuous function of and a unique value of q, that satisfy the equation exists since its plot cuts the qaxis at a point as q varies from 0 to 1. The value of q, denoted by , that satisfy Equation (10) can be solved iteratively as follows:
Let
,
then a unique value of exists such that . Consider a tangent line of that passes through the point and where is the initial approximation of the root of, then the gradient of the tangent line at the point denoted by is given by and solving for leads to . Similarly , . In general where is the derivative of the function which is not equal to zero for any value of for . The iteration will stop if for some arbitrary value , and since the series converges, is taken as an approximate value of which is the solution of Equation (10). The ‘while’ matlab loop was used for solving Equation (10).
The asymptotic variance of of the joint model where is
(11)
where .
3. Comparison of of and Experiments
In this section we compare the performance of each of the two procedures by plotting the graphs of of and P^{G}experiment for various values of versus .
Figure 1. A graph of Fisher Information against the value of with and .
Figure 2. A graph of Fisher Information against the value of with and .
Figure 3. A graph of Fisher Information against the value of with and .
Figure 4. A graph of Fisher Information against the value of with and .
Figure 5. A graph of Fisher Information against the value of with and .
Figure 6. A graph of Fisher Information against the value of with and .
Figure 7. A graph of Fisher Information against the value of with and .
Figure 8. A graph of Fisher Information against the value of with and .
Figure 9. A graph of Fisher Information against the value of with and .
Figure 10. A graph of Fisher Information against the value of with and .
Figure 11. A graph of Fisher Information against the value of with and .
Figure 12. A graph of Fisher Information against the value of with and .
As seen from Figures 1 to 12, the plot of the Fisher information of the P^{I}experiment is symmetric and concave upwards i.e the Fisher information is very high for values of close to 0 and for the values of close to 1. It is minimum for the values of p about 0.5. It can also be noted that the change of the value of k does not affect the Fisher information of the P^{I}experiment since the P^{I}experiment is independent of k. As sensitivity and specificity of the tests increases the Fisher information for P^{I}experiment also increases. The graph of Fisher information of the P^{G}experiment is found to be strictly decreasing as the value of the parameter p increases from 0 to 1. A striking feature also to note is that the relationship between the Fisher information and the parameter p is sensitive to k as the slope of the curve changes with varying k. The curve become steeper as k increases but the slope become less steep and almost levelises as p approaches 1. It is also noted that as k increases the curve of the Fisher information of the P^{G}experiment shift to the left of the graph meaning that the region for which P^{G}experiment is better than the P^{I}experiment shrinks. As sensitivity and specificity of the tests increases the region at which the Fisher information of P^{G}experiment is higher than for the P^{I}experiment increases. It can also be observed that pool testing is only visible and better than individual testing strategy where the prevalence rate is small which concurs with the idea of Dorfman [2] that pool testing is only viable if the prevalence rate is low otherwise the use of P^{I}experiment is recommended.
4. Computation of Cut off Values
The cut off value shall be defined as the value of at which the Fisher information for the experiment and the experiment are equal or the value of at the point of intersection of the graphs of .
If we let be the cut off value, then is a unique root in of the equation i.e
(12)
since are known constants, then Equation (12) is a function of, of which the value of can be solved iteratively as follows:
Let
(13)
then the function is continuous in the interval and from Figures 1 to 12 of the graphs of Fisher information, there exist a value p, such that Equation (13) is equal to zero which is the point of intersection of the two curves. Consider a tangent line of that passes through the point and where is the initial approximation of the root of, then the gradient of the tangent line at the point denoted by is given by and solving for yields . Similarly , . In general where is the derivative of the function which is not equal to zero for any value of for . The iteration will stop if for some arbitrary value which is the error term which should be small. If the series converges, is taken as an approximate value of which is the solution of Equation (12). The ‘while’ matlab loop was used for solving Equation (13).
For various values of and the values of the roots of Equation (13) or the cut off values are given in Table 1:
Table 1. Cut off values for various values of .
 





2  0.646  0.596  0.563  0.528 
3  0.555  0.507  0.477  0.446 
5  0.439  0.395  0.371  0.348 
10  0.296  0.263  0.248  0.234 
15  0.227  0.201  0.190  0.181 
20  0.185  0.164  0.156  0.150 
50  0.092  0.082  0.080  0.078 
From Table 1 it can be observed that as the pool size (k) increases, the cut off point value decreases for various values of and i.e the region in which the P^{G}experiment is better shrinks. This concurs with the conclusion that pool testing is only feasible when the pool size are reasonably small. It can also be observed that as sensitivity and specificity of the test kits increases the region in which the P^{G}experiment is better also increases.
For example at , k = 5 and if N tests are available, the maximum information aboutis obtained when
In general, if N tests are available, then the allocation that maximizes the information aboutis
Note that the region where one experiment is better than the other depends on the unknown parameter . Thus the obvious adaptive rule is suggested where is estimated at each stage and the next observation is allocated depending on the relationship between the estimated and the cut off point value.
5. Estimator of Prevalence Rate, Its Variance and Confidence Interval
In this section we compute the maximum likelihood estimator of the prevalence rate, the variance and 95% Waldtype confidence interval of the maximum likelihood estimator for various values of sensitivity, specificity and pool size.
Table 2. Maximum likelihood estimator, variance and Confidence interval for different values of p for and .



 
 0.01  0.0160  0.3266  0.0086, 0.0407 
0.05  0.0465  0.8728  0.0052, 0.0878  
0.10  0.1190  2.291  0.0556, 0.1825  
0.20  0.2027  4.226  0.1239, 0.2815  
 0.01  0.0113  0.1224  0.0094, 0.0319 
0.05  0.0567  0.6592  0.01138, 0.1021  
0.10  0.1119  1.605  0.0501, 0.1736  
0.20  0.2337  6.136  0.1500, 0.3168 
Table 3. Maximum likelihood estimator, variance and Confidence interval for different values of p for and .



 
 0.01  0.0034  0.6200  0.0081, 0.0150 
0.05  0.0561  1.9000  0.0110, 0.1013  
0.10  0.0831  2.6000  0.0290, 0.1373  
0.20  0.1634  5.1800  0.0909, 0.2359  
 0.01  0.0073  0.2310  0.0094, 0.0238 
0.05  0.0597  1.1100  0.0133, 0.1061  
0.10  0.1106  2.6000  0.0491, 0.1720  
0.20  0.2039  9.6000  0.1249, 0.2828 
Table 4. Maximum likelihood estimator, variance and Confidence interval for different values of p for and .



 
 0.01  0.0148  2.1900  0.0089, 0.0385 
0.05  0.0542  3.6400  0.0098, 0.0986  
0.10  0.1164  6.5640  0.0535, 0.1793  
0.20  0.1789  10.748  0.1038, 0.2547  
 0.01  0.0172  0.0780  0.0083, 0.0428 
0.05  0.0306  0.0940  0.0032, 0.0644  
0.10  0.1000  4.120  0.0412, 0.1588  
0.20  0.2767  4.128  0.1890, 0.3644 
From Tables 2 to 4 it can be noted that the maximum likelihood estimators of the prevalence rate are very close to the actual value which was used to simulate the estimators. The population estimators resulting from the experiments are used to evaluate the confidence limits of the confidence interval of the simulated estimators where is the level of significance and it can be noted from Tables 2 to 4 that the actual value is within the upper and the lower limits.
6. Comparison of Variances
In this section we shall plot the graphs of the variance for P^{I}, P^{G}experiments and joint model for various values of and versus values.
Figure 13. A graph of as a function of with and .
Figure 14. A graph of as a function of with and .
Figure 15. A graph of as a function of with and .
Figure 16. A graph of as a function of with and .
Figure 17. A graph of as a function of with and .
Figure 18. A graph of as a function of with and .
Figure 19. A graph of as a function of with and .
Figure 20. A graph of as a function of with and .
Figure 21. A graph of as a function of with and .
Figure 22. A graph of as a function of with and .
Figure 23. A graph of as a function of with and .
Figure 24. A graph of as a function of with and .
As seen from Figures 13 to 24 the plot of is concave downwards and symmetric, maximum at approximate value of p equal 0.5. The is unaffected by the change of the value of k holding specificity and sensitivity constant since the model is independent of k. As specificity and sensitivity of the tests increases the decreases. It can also be noted that the increases exponentially as the value of the parameter p increases from 0 to 1. As k increases, the decreases keeping sensitivity and specificity constant while holding k constant, increasing sensitivity and specificity of the tests decreases the. The increases as the value of the parameter p increases but thereafter it starts decreasing as p gets closer to 1. The increases as the value of k increases keeping sensitivity and specificity constant while holding k constant, increasing sensitivity and specificity decreases the value of . As the value of k increase the plot of the shifts to the left meaning the region in which the is higher than the decreases. As sensitivity and specificity of the tests increases the area in which is higher than the increases. For small values of the parameter p, the is smaller than the and but is equal to the for the values of p close to 1. The region in which the is higher than the increases exponentially as the value of p increases from 0 to 1 however the region in which it is better than increases then it starts decreasing again and they are equal for the values of p close to 1. As the value of k increases, the region in which the and are equal increases. In general we observed that the is smaller or equal to the or for
7. Asymptotic Relative Efficiency (ARE)
In this section, , and have been compared. This is accomplished by computing asymptotic relative efficiency (ARE) values for various values of and Let and then, implies that the joint model is more efficient than the other two models namely P^{I} and P^{G}procedures.
Table 5. The ARE of the joint model relative to P^{I} and P^{G}models with .
pvalue 



 
0.01 
 0.273  0.183  0.109  0.054 
 0.727  0.817  0.891  0.946  
0.05 
 0.320  0.238  0.162  0.098 
 0.680  0.762  0.838  0.902  
0.10 
 0.336  0.260  0.189  0.136 
 0.664  0.740  0.810  0.864  
0.15 
 0.346  0.276  0.215  0.186 
 0.654  0.724  0.785  0.814  
0.20 
 0.356  0.293  0.244  0.257 
 0.644  0.707  0.756  0.743  
0.30 
 0.376  0.331  0.321  0.513 
 0.623  0.669  0.679  0.487 
Table 6. The ARE of the joint model relative to P^{I} and P^{G}models with .
pvalue 



 
0.01 
 0.213  0.115  0.051  0.018 
 0.787  0.885  0.94  0.982  
0.05 
 0.252  0.160  0.092  0.048 
 0.748  0.840  0.908  0.951  
0.10 
 0.283  0.199  0.134  0.960 
 0.717  0.801  0.866  0.904  
0.15 
 0.306  0.231  0.175  0.172 
 0.694  0.769  0.825  0.828  
0.20 
 0.325  0.261  0.223  0.304 
 0.675  0.739  0.777  0.696  
0.30 
 0.362  0.326  0.357  0.746 
 0.638  0.674  0.643  0.254 
Table 7. The ARE of the joint model relative to P^{I} and P^{G}models with .
pvalue 



 
0.01 
 0.207  0.108  0.045  0.014 
 0.493  0.892  0.955  0.986  
0.05 
 0.231  0.136  0.071  0.034 
 0.769  0.864  0.929  0.966  
0.10 
 0.257  0.169  0.107  0.077 
 0.743  0.831  0.893  0.923  
0.15 
 0.281  0.202  0.151  0.164 
 0.719  0.798  0.849  0.836  
0.20 
 0.304  0.238  0.208  0.332 
 0.696  0.762  0.792  0.668  
0.30 
 0.351  0.321  0.383  0.816 
 0.649  0.679  0.617  0.184 
Tables 5 to 7 of the computed values of ARE of the proposed model relative to P^{I} and P^{G}models reveal the same trend whereby if and are held constant, it is observed that as the value of k increases from 2 to 10, ARE^{1} decreases for small values of p but as p increases where, the ARE^{1} decreases and then it starts increasing. ARE^{2} increases as the value of k increases from 2 to 10 for small values of p but also as p increases where it starts decreasing. It can also be observed that holding k constant and increasing the value of p increases ARE^{1} while ARE^{2} decreases. As sensitivity and specificity of the tests decreases ARE^{1} decreases while ARE^{2} increases. It can also be noted that for the given interval of p {} ARE^{1} is less than 0.5 implying that P^{I}experiment is less than 50% efficient as the proposed model while ARE^{2} is more than 0.5 implying that P^{G}experiment is more than 50% as efficient as the proposed model. However it is noted that the computed values of are less than 1 hence the proposed joint model is more efficient than the other two existing models for the given range of p.
8. Discussion
From the study, it is found out that the curve of the Fisher information for the P^{I}experiment is concave upwards, symmetric and it is not affected by change of the pooled sample size. Fisher information for the P^{G}experiment is very high for small values of p and decreases exponentially as the value of p increases from 0 to 1. Increasing the pool size decreases the value of Fisher information and at the same time shifts the plot of the P^{G}experiment to the left. If the pool size is assumed constant, increasing sensitivity and the specificity of the tests increases the value of the Fisher information of both P^{I} and P^{G}experiments.
The plot of the asymptotic variance of maximum likelihood estimator of p of the P^{I}experiment () against p is concave upwards. The is not affected by change of the pool size assuming sensitivity and specificity remains the same but treating pool size constant and increasing sensitivity and specificity of the tests decreases the variance. Similarly the graph of the asymptotic variance of maximum likelihood estimator of p of the P^{G}experiment against p increases exponentially as the value of p increases from 0 to 1. The curve for the P^{G}experiment shifts to the left and becomes steeper as the value of k increases from 2 to 10 holding sensitivity and specificity constant. Treating pool size constant and increasing sensitivity and specificity of the tests decreases the asymptotic variance of .
The constructed estimator is affected by change of both pool size and also sensitivity and specificity of the test kits. Increase in pool size increases the variance of the estimator holding specificity and sensitivity constant while increasing sensitivity and specificity, pool size remaining constant decreases the variance. The variance of the constructed estimator is smaller compared to the variances of oneatatime experiment and pooled experiment for values of hence the constructed estimator is more efficient than the previous estimators especially for small values of
9. Conclusion
This study focused on construction of the new model for approximating the prevalence rate of a trait in a population with imperfect tests by selecting between two experiments namely P^{I} and P^{G}experiments. Ideally the model should select the better experiment and once the better experiment is being used, the estimator should approximate the individual maximum likelihood estimator for that experiment. From this study it can be concluded that the P^{G}experiment is better than the P^{I}experiment for values of p close to zero but for values of p close to 1.0 the P^{I}experiment is recommended. Hence from the results of the Fisher information, asymptotic variance and ARE, the proposed joint model for sequentially selecting between two experiments for estimating the prevalence rate of a trait in a population with imperfect tests is more efficient than P^{I} and P^{G}models across the entire range of parameter values regardless of the total pool size, sensitivity and specificity of the tests.
The developed model have potential in the application of HIV testing because it gives a superior estimator of the disease prevalence without necessarily identifying the subject. The models may also be applied for use by pharmaceutical companies in discovering drugs in early stages.
Based on the constructed model, one can extend the present work to include a model with more than two experiments with misclassification. The present work can also be extended not only to approximate p but also the value of k (pool size) that will optimize group testing scenario based on the new model. A model based on cost analysis when sampling from different experiments can also be looked at when using imperfect kits.
References