Servicios
Descargas
Buscar
Idiomas
P. Completa
Significance estimation for the Kullback-Leibler divergence: the Poissonian case in seismological studies
F. A. Nava
F. A. Nava
Significance estimation for the Kullback-Leibler divergence: the Poissonian case in seismological studies
Estimación de significatividad para la divergencia Kullback-Leibler: el caso Poissoniano en estudios sismológicos
Geofísica internacional, vol. 62, no. 3, pp. 519-523, 2023
Universidad Nacional Autónoma de México, Instituto de Geofísica
resúmenes
secciones
referencias
imágenes

Abstract: The Kullback-Leibler divergence, κ, is a widely used measure of the difference between an observed probability distribution and a reference one; κ=0 when the two distributions are equal, but it has no upper limit to help interpret the significance of any other κ value. Using as an example the problem of distinguishing clustering or gaps in the time occurrence of earthquakes from seismicity uniformly distributed in time, a Monte Carlo method for evaluating the significance of a particular κ value is presented, a method that takes into account the number of classes in the distributions and the length of the sample. Application of this method yields a probability according to which the hypothesis of the observed distribution being a realization of the reference one can be discarded or accepted with a quantitative degree of confidence. This method, and two possible reference values, are presented using the Poisson distribution as an example, but they can be used for other reference distributions.

Key words: Kullback-Leibler divergence, Poisson distribution, Statistical seismology.

Resumen: La divergencia Kullback-Leibler, κ, es una medida ampliamente usada de la diferencia entre una distribución de probabilidad observada y otra distribución de referencia; κ=0 cuando ambas distribuciones son iguales, pero no tiene un valor tope que permita interpretar la significatividad de cualquier otro valor de κ. Usando como ejemplo el problema de distinguir cúmulos o vacancias en la ocurrencia temporal de sismos de sismicidad distribuida con probabilidad uniforme en el tiempo, se presenta un método de Monte Carlo para evaluar la significatividad de algún valor de κ, método que toma en cuenta el largo de la muestra. Este método y dos posibles valores de referencia son presentados usando la distribución de Poisson como ejemplo, pero pueden ser utilizados con cualquier otra distribución de referencia.

Palabras clave: Divergencia Kullback-Leibler, Distribución de Poisson, Sismología estadística.

Carátula del artículo

Articles

Significance estimation for the Kullback-Leibler divergence: the Poissonian case in seismological studies

Estimación de significatividad para la divergencia Kullback-Leibler: el caso Poissoniano en estudios sismológicos

F. A. Nava
Centro de Investigación y de Educación Superior de Ensenada, México
Geofísica internacional, vol. 62, no. 3, pp. 519-523, 2023
Universidad Nacional Autónoma de México, Instituto de Geofísica

Received: 18 February 2023

Accepted: 10 April 2023

Published: 01 July 2023

Introduction

In many kinds of statistical studies, including seismological ones, it is a common task to compare some observed probability distribution P={pj; j=1,…,M} with some reference distribution Π={πj ; j=1,…,M}, and the difference between them is often measured by using the Kullback-Leibler divergence (K-L) κ:

κ j = 1 M p j log 2 p j π j (1)

(Kullback and Leibler, 1951; Eguchi and Copas, 2006). This measure is zero when P=Π, but it does not have a fixed upper limit; it can be infinite when one or more πj ’s are zero and the corresponding pj ’s are not (Lin, 1991; Shlens, 2007), but, for a reasonable reference distribution with no unbalanced zeros, how large can it be? It is necessary to have a reference value in order to assess the significance of any result other than zero.

In what follows, we will present a method for evaluating the confidence that can be had about two distributions being similar, based on a K-L measure, using a seismological example.

In assessing seismic hazard for a given region a common tool is to look for clustering or gaps in the times of occurrence of earthquakes above a given magnitude in the background seismicity, because those features may be precursors to a large earthquake. If the observed seismicity appears to show clusters or gaps, to assess their significance it is necessary to test whether they may be due to random concentrations in events occurring with uniform probability over time, i.e. to test the observations versus the null hypothesis.

One way to test the null hypothesis is to use the well known fact that if events are occurring randomly with uniform probability in time at a rate of λ events per unit time, then the number of events n occurring within intervals of a given length T are distributed according to the Poisson distribution:

Pr n , T = e - λ T λ T n n ! (2)

(e.g. Mack, 1967; Dekking et al., 2005; Boxma and Yechiali, 2007), so we will compare the distribution of observed n’s with the Poisson distribution using the K-L divergence.

Significance of the K-L measure

Suppose that the times of occurrence of Ne =160 earthquakes ocurred over a period Ny =60 years have been observed, as shown in Figure 1 (top), which gives an occurrence rate λ=2.6̅6̅6̅6̅6̅ events/year, and , for the sake of simplicity, let us consider yearly intervals so T=1.0 yr in (2). Figure 1 (middle) shows the number of events per year, n, for each observed year, and (bottom) the corresponding histogram, as well as the expected number of events from the Poisson distribution.


Figure 1
Times of occurrence 160 earthquakes with Gutenberg-Richter distributed magnitudes over a period of 60 years (top), number of earthquakes per year for each year (middle), and (bottom) histogram of numbers of earthquakes per year (blue line) and expected frequencies of earthquakes per year from the Poisson distribution.

The two distributions are not equal but, quantitatively, how different are they? We will measure their divergence using K-L. Observed probabilities are estimated from the histogram as

p n = N n k = 0 n m a x N k ; n = 0,1 , , n m a x , (3)

where Nn is the observed number of incidences of n events/ year and nmax is the maximum observed n. We will compare this observed probability distribution with the reference

π n = e - λ λ n n ! ; n = 0,1 , , , (4)

as shown in Figure 2.


Figure 2
Observed distribution of n and corresponding values from the Poisson distribution (red diamonds) for λ=2.6̅6̅6̅6̅6̅ events/year indicated by the vertical dashed line. The K-L divergence between the two distributions is κ=0.1870.

To evaluate the K-L divergence between these two distributions, although the Poisson distribution is non-zero for an infinite number of terms, the summation in (1) is only done from 0 to nmax , because P can be considered equal to zero for n>nmax and terms with pn =0 do not contribute to the summation.

The K-L evaluation yields κ=0.1870, but, what does this number mean (apart from the distributions not being equal)? Here, it must be considered that, as is common for seismological studies, particularly those involving large magnitudes, the observed distribution comes from only one very short realization consisting of only Ny =60 events.

Let us estimate how probable is the observed κ for samples of size Ny of a Poisson process. We will do this through a Monte Carlo simulation (Yakowitz, 1977; Rubinstein and Kroese, 2016) of Nr =100,000 realizations of synthetic samples of Ny Poisson distributed numbers n; the divergence κ between the resulting distribution P and Π from (3) is evaluated for each realization.

The probability distribution f(κ) resulting from the simulation is shown in Figure 3 (top), the distribution has mean μκ =0.1066 and standard deviation σκ =0.0499, and it is clear that the probability that a 60 samples long random realization of a Poissonian process actually results in κ=0 is extremely small. Indeed, if the sampled process were indeed Poissonian, instead of κ=0, a value around κ=0.082 would be much more probable.


Figure 3
Distribution of κ values from the Monte Carlo simulation (top), ∆κ is the class width, the dashed thick vertical line indicates the mean μκ of the Nr =100,000 realizations, and the thin dashed vertical lines indicate plus/minus one standard deviation σκ from the mean. Cumulative κ distribution (bottom), the dashed black line is μκ , and the dotted red line indicates the observed κ=0.1870.

The cumulative F(κ) in Figure 3 (bottom) shows the observed κ=0.1870, and gives Pr(κ≥0.1870)=0.0664, so the possibility of the null hypothesis, that the observed seismicity occurred with uniform probability in time, can be rejected with 0.9336 confidence. This number constitutes a firm basis for the decision of whether to reject the null hypothesis or not; in this case the observed seismicity is, with high probability, not distributed uniformly in time, although the null hypothesis cannot be rejected at the widely used significance level of 0.05. It should be pointed out that this confidence estimation takes into account the sample length.

Other reference measures

We have presented a practical way for assessing the significance of a K-L measurement. Now, just for argument’s sake, let us consider two other reference measures.

First, consider the uniform distribution, which is a common reference because it has the highest entropy (Shannon, 1948), with probabilities

p n U = 1 n m a x + 1 ; n = 0,1 , , n m a x , (5)

shown as circles in Figure 4 (left). The K-L divergence between the uniform and Poisson distributions for nmax =9 is κU =1.22055, much higher than the value for our example distribution and has a probability value of nearly zero.


Figure 4
Poisson distribution for λ=2.6̅6̅6̅6̅6̅ events/year, and nmax =9, indicated by red diamonds, and the uniform (left) and opposite (right) reference distributions.

The second reference distribution is the “opposite” distribution to Poisson:

p n 0 = π m a x - π n k = 0 n m a x π m a x - π κ ; n = 0,1 , , n m a x , (6)

where πmax is the maximum value of the Poisson distribution. It is an inverted and normalized Poisson as shown by circles in Figure 4 (right). The K-L divergence for the opposite distribution is κO =2.82680, and it is considerably higher than the observed κ and the reference κU ; its divergence is not the highest possible between the Poisson distribution and another distribution of the same length, but it is the most different distribution in an intuitive way.

Of course, both reference values should be estimated for the same n range as that of the observed distribution (they increase with the length of the range), and may not be very useful because they have very large κ values with vanishingly small probabilities, but still they are better reference values than infinity.

Discussion

We presented a method of estimating significance levels associated with measures of the K-L divergence, and proposed two possible reference values. The Poisson distribution was used as an example, but the method is applicable to any other reference distribution.

The Monte Carlo evaluation of the significance of a κ value illustrates quite clearly the problem of having to work with small samples, which is unfortunately the case with many studies in statistical seismology, due to the relative shortness and heterogeneity of the seismic catalogs, particularly for studies dealing with large earthquakes. One of the advantages of the proposed method is that it takes into account variations caused from samples that are but small realizations of a stochastic process. In the example presented here we showed that a Ny =60 long sample from a true Poissonian process has almost null probability of resulting in κ=0, because for that sample length the κ values are distributed with mean μκ =0.1066 and standard deviation σκ =0.0499, while samples of Ny =120 have μκ =0.0561 and σκ =0.0255, and for Ny =180 μκ =0.0384 and σκ =0.0171 (all for equal λ). Hence, κ values cannot be correctly interpreted without taking into account the sample length, something that our proposed method does implicitly.

The method proposed here can be a useful tool in studies of seismic hazard, where it is essential to distinguish, with a quantitative bas is, between an apparently anomalous distribution being observed and the null hypothesis.

Supplementary material
Acknowledgments

Many thanks to two anonimous reviewers and to the editorial staff of Geofísica Internacional: Ana Teresa Mendoza Rosas, Servando de la Cruz, and Andrea Rostan.

References
Boxma, O. J., & Yechiali, U. (2007). Poisson processes, ordinary and compound. Encyclopedia of statistics in quality and reliability.
Dekking, F. M., Kraaikamp, C., Lopuhaä, H. P., & Meester, L. E. (2005). A Modern Introduction to Probability and Statistics: Understanding why and how (Vol. 488). London: Springer.
Eguchi, S., & Copas, J. (2006). Interpreting kullback-leibler divergence with the neyman-pearson lemma. Journal of Multivariate Analysis, 97(9), 2034-2040.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory, 37(1), 145-151.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86.
Mack, C. (1967). Essentials of statistics for scientists and technologists. Plenum Press, New York, 173pp.
Rubinstein, R. Y., & Kroese, D. P. (2016). Simulation and the Monte Carlo method. John Wiley & Sons
Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3), 379-423.
Shlens, J. (2007). Notes on kullback-leibler divergence and likelihood theory. Systems Neurobiology Laboratory, 92037, 1-4.
Yakowitz, S. (1977). Computational probability and simulation. Addison-Wesley Pub. Co.
Notes
Author notes
Editorial responsibility: Dra. Ana Teresa Mendoza Rosas.

Figure 1
Times of occurrence 160 earthquakes with Gutenberg-Richter distributed magnitudes over a period of 60 years (top), number of earthquakes per year for each year (middle), and (bottom) histogram of numbers of earthquakes per year (blue line) and expected frequencies of earthquakes per year from the Poisson distribution.

Figure 2
Observed distribution of n and corresponding values from the Poisson distribution (red diamonds) for λ=2.6̅6̅6̅6̅6̅ events/year indicated by the vertical dashed line. The K-L divergence between the two distributions is κ=0.1870.

Figure 3
Distribution of κ values from the Monte Carlo simulation (top), ∆κ is the class width, the dashed thick vertical line indicates the mean μκ of the Nr =100,000 realizations, and the thin dashed vertical lines indicate plus/minus one standard deviation σκ from the mean. Cumulative κ distribution (bottom), the dashed black line is μκ , and the dotted red line indicates the observed κ=0.1870.

Figure 4
Poisson distribution for λ=2.6̅6̅6̅6̅6̅ events/year, and nmax =9, indicated by red diamonds, and the uniform (left) and opposite (right) reference distributions.
Buscar:
Contexto
Descargar
Todas
Imágenes
Scientific article viewer generated from XML JATS by Redalyc