The Log Normal and the Poisson Gravity Models in the Analysis of Interactions Phenomena
Giuseppe Ricciardo Lamonica
Department of Economics and Social Sciences, Polytechnic University of Marche, Ancona, Italy
To cite this article:
Giuseppe Ricciardo Lamonica. The Log Normal and the Poisson Gravity Models in the Analysis of Interactions Phenomena. American Journal of Theoretical and Applied Statistics. Vol. 4, No. 4, 2015, pp. 291-299. doi: 10.11648/j.ajtas.20150404.19
Abstract: Three problems often encountered when bilateral interaction data are analyzed by means of the log-normal gravity model: the bias created by the logarithmic transformation, the failure of the homoscedasticity assumption and the treatment of zero valued flows. When the interaction are count data type that takes non-negative integer values, to overcome these problems the literature suggests to use a Poisson gravity model instead of log-normal model. In this paper, using a real interaction phenomenon a comparative analysis of the two models is carried out. The most important results obtained highlights that if the phenomenon is correctly specified, the two specification of the gravity model have a very similar behaviour.
Keywords: Gravity Model, Poisson Model, Log Normal Model, Comparisons, Count Data
The analysis of interactions (or flows) phenomena of any type is an area of particular interest. Its aim is to describe, explain and predict the interactions that arise between the units of a collective.
So many models have been developed for this purpose in the literature that it is impossible to list them here. However, put briefly, it is possible to cluster them into two classic categories: stochastic models and econometric models.
The former are probabilistic models of Markov type, and they aim to highlight the fundamental constants of the interactions (see for example ). The latter are all those models characterized by a number of variables (covariates) considered explanatory, and they try to explain and predict the interactions.
Belonging in the latter group is the gravity model, which is considered one of the most important models for the analysis of interaction phenomena, and to which the literature has devoted close interest especially from an empirical point of view (as in: , , , , , , , ,  and).
The idea underlying this model is that the interaction which arises between two units of a collective, in conformity with Newton’s gravitational law, is directly proportional to the masses of those units and inversely proportional to the distance between them. In its classic form, the model is set out as follows:
Where fij is the interaction whose origin is the i-th unit whose destination is the j-th unit; pi and pj represent the masses of the two units; dij is the distance between them; and eij is the residual variable. Finally, b0 is a constant of proportionality which, together with the parameters b1, b2 and b3, is subject to estimation.
Considering the logarithm of (1) a double log-linear model (henceforth ‘log-normal model’) is easily estimated with the Ordinary Least Squares method:
The literature has highlighted that the log-normal model (2) based on the classic hypothesis has a series of potential drawbacks (see for example  and ). In particular:
• The use of the logarithmic transformation produces estimates of the logarithm of the covariates, not of the covariates themselves. The antilogarithms are biased estimates (because of Jensen’s inequality). A consequence of this is the underprediction of large flows.
• Model (2) assumes the homoscedasticity of the residual variable and, as in ,the variance of lg(fij)is identical for all ij pairs. Thus, an observed flow of 2in relation to an estimate of 1is as likely as an observed flow of 200 in relation to an estimate of 100. This property, homoscedasticity, is implausible for data sets where there is a wide variation in the size of interaction flows. The first consequence is that the standard errors of the least-squares estimates of the regression parameters are incorrect, and the confidence intervals and tests of hypotheses that use them are invalid. Since standard computer software packages use these formulas, they are inappropriate when heteroscedasticity is present. The second consequence is that the OLS estimators of the parameters of the regression model lose some of their desirable statistical properties. They remain unbiased but no longer have minimum variances even if the correct formulas are used to estimate these variances. This is so because it can be shown that the estimators with minimum variance are the generalized, not the ordinary, least-squares estimators. Generalized least-squares (GLS) is usually applied by an appropriate transformation of the regression model that makes the resulting disturbances homoscedastic ();
• When the flows are zeros, the logarithmic transformation cannot be computed. To avoid this problem, a small positive constant αÎ(0; 1] is added to all observations. However, as in , when there are many zero flows, the choice of this constant has a considerable impact on the parameters of the model and on its explanatory power.
• If the residual variable of model (2) is normal distributed, also the lg(fij) is normal and fij is log-normal distributed. This is unlikely because the flows are nonnegative integer values.
When the dependent variable is of a count data type that takes non-negative integer values – for example the number of people that move from one place to another – to avoid these pitfalls the literature suggests using a Poisson regression model.
This model (see ) is based on the hypothesis that if the probability of interaction between two generic units is small and constant, then it is possible to assume that fij is a realization of the variable Fij with Poisson probability distribution and mean λij. Thus the probability that Fij=fij is:
Moreover, the parameter λij is logarithmically linked with the covariates:
If on the one hand, the Poisson gravity model does not present the drawbacks previously mentioned, on the other also this model has some pitfalls. The most important of them is that the model is characterized by one parameter which represents the mean and variance distribution.
When real data are used the variance is often greater than the mean (over-dispersion) and the Poisson regression may not be appropriate for count data.
Another problem with Poisson regression is the excess of zeros, i.e. real data have more zeros than a Poisson regression would predict.
Referring for the details to the numerous econometric manuals existing in the literature, the aim of this paper is to analyzethese two models by means of a real phenomenon in order to identify theirshared characteristics and those specific to each of them.
In particular,the paper will focus on the problem of zero values and that of homoscedasticity of the residual variable in the normal gravity model. The main result obtained is that the normal gravity model is still a reference scheme of undoubted interest for describing and interpret the interactions phenomena and, contrary to several claims in the literature, the benefits of using the Poisson regression are minor and only theoretical.
The paper is organized as follows: section 2 describes the data used and the results obtained, section 3 concludes.
All the analysis were performed using the SAS System software ver. 9.3.
2. Data and Results
To compare the log-normal gravity and the Poisson gravity models the analysis reported by this paper considered as interaction phenomena the migratory flows of resident foreigners for the year 1995 among the Italian regions (see Figure 1 of the appendix) corresponding to the second level of the Nomenclature of Territorial Units for Statistics (NUTS 2).
We are aware that the data used are not really recent. However, this does not limit the goodness of the obtained results that are independent from the age of the data.
Table 10 in the Appendix reports the data used for the inquiry. Excluding the movements within the Italian regions (i.e., fii for i=1,..,20) from the analysis, the following Table 1 shows the frequency distribution of the observed flows:
|Class interval||Number of flows||Frequency (%)|
As will be seen, in 1995, the mean size of flows of resident foreigners among the Italian regions was 33.9 and the variance was 3673.52. Furthermore, 65.58% of the flows did not exceed 20 movements and 9.21% were greater than 100. The largest flow was recorded from Lazio to Lombardia and involved 398 migrants. By contrast, 11.32% (43/380) of the flows were zeros.
The distinctive features of the data set considered are that it includes a very large number of zero and small flows, so that it is particularly suited to the type of experimentation carried out in this paper.
When real phenomena are analysed, the gravity model is usually extended in order to consider, besides the classic determinant, other potential factors that may influence the phenomenon under investigation.
Consequently, as reported by a large body of literature (see for example: , , ,  and ), migratory phenomena are influenced not only by masses and distance but also by economic, social and demographic disparities among the territorial units considered. Hence, for the purposes of the analysis, it was decided to consider, for each Italian region, 18 variables (see the Appendix), in that they were deemed able to measure the principal aspects of the characteristics just mentioned.
Preliminary examination of these indexes revealed the presence of correlations such to counsel against their direct use in the gravity regression model. Consequently, the 18 indicators were synthesised by means of factor analysis.
The results of this analysis are set out in Table13 of the appendix. They show that the factor structure identified has a considerable power of synthesis. The first two factors, considered on the basis of the usual criteria for factorial choice, can be immediately interpreted.
The high and positive coefficients of correlation between the first factor and all the variables of an economic nature suggest identification of this factor as a complex index of the economic structure, while the close correlations of the second factor with the remaining indexes suggest its identification as a complex index of the demographic structure.
For the purposes of the analysis, the following log-normal gravity model(5) was considered, where F1i, F1j are the first factor (economic factor) in the origin and the destination regions of flows, while F2i, F2j are the second factor (demographic factor) in the origin and the destination regions of flows. Finally, lg(b0) and bi(for i=1,…,7) are the parameters of the model.
lg(fij)=lg(b0)+b1lg(pi)+b2lg(pj)-b3lg(dij)+b4F1i+b5F1j+b6F2i+b7F2j +lg(eij) (5)
In order to estimate the model parameters, the masses (pi and pj) were calculated as the geometric average of the population at the beginning and at the end of the year. The distances (dij) between the regions were instead calculated by considering the Euclidean distance between the demographic barycentres of each region. The pairs of co-ordinates identifying each regional demographic barycentre were determined by calculating the arithmetic average, weighted with the population, of the latitude and longitude of each provincial capital in the same region.
Since some observed flows were zeros, as in, the following experimentation was conducted: a constant α taking values from 0.1 to 1 by 0.1 was added to all flows, and model (5) was fitted.
The results of this analysis are shown in Table 2, which, as said above, does not consider the intra-region flows (i.e. the fii for i=1,..,20).
Legend: p-values in parenthesis
For various values of α, the estimates of the constant (intercept) and the parameters associated with the population size of the regions (lg(pi) and lg(pj)), as well as the parameter relative to the distance (lg(dij)), were always highly significant. The parameter sign of the latter variable was negative and consistent with expectations.
The estimates of the parameters associated with the economic factor (F1i) in the regions of origin were always not significant. By contrast, in the destination regions of flows (F1j) they were always highly significant.
According to the signs, this factor was a push determinant in the regions of origin and a pull determinant in the destination regions of flows, while in the absolute values a predominant effect of the pull rather than push determinant was evident.
Consideration of the demographic factor of the places of origin (F2i) and of the places of destination (F2j), found that the estimates of the associated parameters were non-significant.
Moreover (Table 3), the White and the Kolmogorov-Smirnov tests showed that the regression residuals were, respectively, homoscedastic and normally distributed. Finally, the index of determination (corrected R2) was found to be very high (from 77% to 81%).
Similarly the results in , also in this analysis if the constant α increases, the parameter estimates associated with the intercept, the masses of the regions, and the distance decrease.
However, due to the inclusion of the two factors, the parameters estimates of the model, contrary to the results in , are much more stable, highlighting a quasi-constant effect of α.
Even if we admit that an α-effect exists on the parameters of the model, this is a problem easily solved because the criteria shown in Table 3 indicated that α=1 should be assigned as the optimal value.
This choice concurs with that of several studies which have recommended the use of the lowest possible non-zero count in this situation (see, for example, ).
|White test of heteroskedasticity||71.82||62.41||55.11||50.88||47.92||45.74||44.02||42.58||41.32||40.20|
|Kolmogorov test of normality||0.07||0.06||0.06||0.05||0.05||0.05||0.04||0.04||0.04||0.04|
Legend: p-values in parenthesis
Table 4 shows a more detailed analysis of the goodness of fit of the log-linear model. In particular, it reports the cross tabulation between the fitted and the observed flows. As will be seen, about 57% (the sum of the frequencies on the main diagonal) of the predicted flows match the observed flows because they are classified in the same classes. By contrast, 17% (frequencies above the main diagonal) are overestimated and 26% are underestimated. This latter situation is stronger for the flows greater than 40.
|Observed flows||up to10||(10-20]||(20-30]||(30-40]||(40-50]||(50-100]||(100-200]||(200-300]||Total|
For comparative purposes, the following Poisson gravity model (6) was estimated and Table 5 shows the results obtained:
Also in this case, the estimates of the intercept, the parameters associated with the population size of the regions (lg(pi) and lg(pj)), and that relative to the distance (lg(dij)), are significant.
The estimate of the parameter associated with the economic factor (F1i) in the regions of origin is not significant; by contrast, in the destination regions of flows (F1j) it is highly significant.
According to the signs, this factor is a push determinant in the regions of origin and a pull determinant in the destination regions of flows, while in the absolute values a predominant effect of the pull rather than push determinant is evident.
When consideration was made of the demographic factor of the places of origin (F2i) and of the places of destination (F2j), the estimates of the associated parameters were found to be non-significant.
|Parameter||Estimates||Wald 95% Confidence Limits||Wald Chi-Square||Pr > Chi-Square|
|Criteria for assessing goodness of fit|
Also in this case, the parameters associated with the population size of the regions, the distance, and the economic factor in the destination region of the flows are highly significant. Therefore, from this point of view, the two models are equivalent.
The Chi-square and pseudo R2 indices show that also the Poisson gravity model has high explanatory power. This result is confirmed in Table 6, where, similarly to the previous situation, the cross tabulation between the fitted and the observed flows is reported. In synthesis, 56% of the fitted flows match the observed flows;17% are underestimated; and 27% are overestimated.
|Observed flows||up to10||(10-20]||(20-30]||(30-40]||(40-50]||(50-100]||(100-200]||(200-300]||over 300||Total|
On comparing the estimates of the Poisson model with the corresponding estimates of the log-normal model (Table 2), in general, no substantial differences are apparent. In particular, using the Euclidean distance between the parameters of the two models as the similarity index, Table 7 shows that for α=0.5 and for α=0.6 the Poisson and log-normal models are extraordinarily coincident. But, if the constant of proportionality is excluded, the similarity between the two models is more marked, with values which decrease with those of the constant. The maximum similarity is reached when the constant is equal to 1.
|With the intercept|
|Without the intercept|
Another important characteristic is that the particular similarity between the two models tends to weaken if α approaches 0.1 or 1.
Since the criteria shown in Table 3 and the results set out in Table 7indicated that α=1, a performance analysis was conducted on the predictive power of the Poisson and log-normal (for α=1) models.
Tables8 and 9report respectively the predicted (or fitted) flows () of the log-normal gravity and the Poisson gravity model. On analysing these two tables, once again a uniform behaviour of the two models is apparent.
In particular, Table 8 reports a classification of the predicted flows of the two models according to whether they are greater (overestimated) or smaller (underestimated) than the observed flows.
The Poisson model overestimates 63.68% and underestimates 36.32% of the observed flows. Consequently this model is inclined to overestimate the flows. By contrast, the log-linear model exhibits more uniform behaviour.
Moreover, 84.48% of the predicted flows of the two models are concordant i.e. 36.32% are underestimated and 48.16% are overestimated.
|Log-normal model (α=1)|| |
|138 (36.32%)||59 (15.52%)||197 (51.84%)|
|0 (0.00%)||183 (48.16%)||183 (48.16%)|
|Total||138 (36.32%)||242 (63.68%)||380 (100.0%)|
A more detailed analysis is set out Table 9,which shows the cross tabulation between the fitted and observed flows of the two models.
As will be seen, about 73% of the fitted values of the two models are concordant (that is, classified in the same classes) while 27% are discordant (that is, classified in different classes).
Moreover, considering the flows with a discordant classification, it is very clear that the Poisson fitted flows are in general less than the log-normal fitted flows.
|Fitted flows of the log-normal model|
|Fitted flows of the Poisson model||up to10||(10-20]||(20-30]||(30-40]||(40-50]||(50-100]||(100-200]||(200-300]||Total|
Put briefly, from the experimentation conducted it clearly emerges that in regard to the phenomenon analyzed:
• The problem of zeros flows may be easily solved, and the solution is in line with those in the literature: a constant greater or equal to 0.5 is a good choice but the optimal choice is a constant equal to 1
• All the classical hypothesis on the residual variable of the log-normal gravity model are verified.
• The estimates of the parameters of the two types of regression considered are very similar.
• The log-normal model tends slightly to underpredict the flows, whereas the Poisson model tends to overpredict the flows.
In conclusion, if the gravity model, as usually happens in real analysis, is extended in order to consider, besides the classic determinant, other potential factors that may influence the phenomenon under investigation,the log-normal model and the Poisson model have the same behaviours and, contrary to claims in the literature, there are no reasons to prefer one model to the other, especially when the analysis is of explanatory type: that is, determining the covariates that influence the interactions.
In the analysis of interaction phenomena of count data type, the literature (see e.g. ) suggests using the Poisson regression instead of the log-normal regression because the former model does not have certain drawbacks and seems to perform better in real analysis.
Starting from the hypothesis that the results in the literature are not completely convincing owing to the use of a model that suffers from omitted variables, this paper has compared the two regression models by means of a real interaction phenomenon.
In particular, the comparison was carried out usingthe migratory flows of foreign residents among the Italian regions. Following the literature, in addition to the classic covariates of the gravity model, also the economic, social and demographic disparities among the territorial units were considered.
The most important result obtained is that the two models show, in general, very similar behaviours in terms of both parameter estimates and goodness of fit. The only differences are that the Poisson model tends to overestimate small flows, while the log-linear model tends to underestimate the largest flows.
However, in contrast with the literature, the residual variable of the log-normal gravity model satisfies all the classic hypotheses, and the presence of the zero flows is an easily resolvable problem which does not restrict the model’s operability.
In conclusion, if the empirical analysis is of explanatory type, i.e. the goal is only to identify the covariates influencing the interaction phenomena, then both models are equally valid for use in practice. However, since the log-normal gravity model is richer with statistical properties and easier to interpret, it may be preferred to the Poisson model.
By contrast, if the analysis is of predictive type, because the Poisson model guarantees non-negative prevision, it may be preferable if the data do not show over or under-dispersion, and taking into account that the model overestimate the small flows.
|Valle d'Aosta (R2)||13||77||4||-||2||-||-||5||5||-||2||-||-||-||-||-||-||-||-||-|
|Trentino-Alto A. (R4)||12||-||68||919||75||28||4||23||10||4||7||4||6||-||5||4||-||-||5||2|
|Friuli-V. G. (R6)||9||-||69||12||165||754||10||28||21||-||3||15||-||1||2||5||-||1||2||2|
|Correlations between the variables and the first two factors|
Socio-economic and demographic variables
X1) Employment rate = Employed resident population / Total population resident in the region.
X2) Added value per capita = Regional added value / Total population resident in the region.
X3) Added value per person employed = Regional added value / Employed resident population.
X4) GDP per capita = Regional GDP/ Population resident in the region.
X5) GDP per person employed = Regional GDP/Employed resident population.
X6) % of employed in industry = Share of population resident in the region employed in industry.
X7) % of employed in agriculture = Share of population resident in the region employed in agriculture.
X8) % of employed in other activities = Share of population resident in the region employed in activities other than industry and agriculture.
X9) Consumption per capita = Resident population consumption / Total population resident in the region.
X10) Income per capita = Resident population income / Total population resident in the region.
X11) Units of labour per inhabitant = Number of regional labour units / Total population resident in the region.
X12) Size of unit of labour = Number of employed in the region / Number of regional labour units.
X13) Age dependency ratio = Regional resident population aged 65+ / Regional resident population aged 15-64.
X14) Index of turnover in the active population = Regional resident population aged 15-19 / Regional resident population aged 60-64.
X15) Portion of persons aged 65 and over = Regional resident population aged 65+/ Regional resident population.
X16) Old-age dependency ratio = Regional resident population aged 65+ / Regional resident population aged 15-64.
X17) % of resident foreigners to total population = Number of foreigners resident in the region / Total population resident in the region.
X18) Index of active population structure = Regional resident population aged 40-64 / Regional resident population aged 15-39.