A comparative study between two discrete Lindley distributions

Ricardo Puziol Oliveira; Josmar Mazucheli; Jorge Alberto Achcar

Biology-Botany

Received: 14/03/17

Accepted: 11/05/17

Abstract: The methods to generate a probability function from a probability density function has long been used in recent years. In general, the discretization process produces probability functions that can be rivals to traditional distributions used in the analysis of count data as the geometric, the Poisson and negative binomial distributions. The discretization also avoids the use of a continuous distribution in the analysis of strictly discrete data. In this paper, using the method based on an infinite series, proposed by Good (1953), we studied an alternative discrete Lindley distribution to those study in G´omez-D´eniz e Calder´ın-Ojeda (2011) and Bakouch et al. (2014). For both distributions, a simulation study is carried out to examine the bias and mean squared error for the maximum likelihood estimators of the parameters as well as the coverage probability and the width of the confidence intervals. For the discrete Lindley distribution obtained by infinite series method we present the analytical expression for bias reduction of the maximum likelihood estimator. Some examples using real data from the literature show the potential of these distributions. Despite the discretization methods are quite different, the resulting distributions are interchangeable, however the distribution generated by an infinite series has simple mathematical expressions and can be used directly to count data in the presence of covariates.

Keywords: Discretization methods, Lindley distribution, likelihood, series, survival analysis, Monte Carlo simulation.

1 Introduction

In recent years generation of a discrete observation to a continuous random variable has been considered by several authors (see, for example, Chakraborty 2015). Basically, the main purpose to discretize a continuous probability density function is to generate a distribution for the analysis of strictly discrete data. For example, in survival data analysis it is common to use continuous distributions for discrete data, so the discretization acts with a subterfuge to avoid this process. A lot of applications considering continuous distributions in the analysis of discrete data are presented in many lifetime books as for example: Hamada et al. (2008); Collett (2003); Lee e Wang (2003); Lawless (2003); Kalbfleisch e Prentice (2002); Meeker e Escobar (1998); Klein e Moeschberger (1997) and others.

One of the first discretized distributions introduced in the literature was the Weibull distribution. From the Weibull distribution with probability density function:

(1)

and survival function:

(2)

Nakagawa e Osaki (1975) proposed the discrete Weibull distribution whose probability function can be written as:

(3)

where x Є and μ,β‑ > 0 are, respectively, the scale and shape parameters. It is easy to verify that (3) is, in fact, a probability function.

Recently Nekoukhou et al. (2012), using the method based on an infinite series, have introduced the Generalized Exponential distribution whose probability function is written as:

(4)

where x Є , (^α_j⁻¹ ) = (α − 1) · · · (α − j), 0 < λ < 1 and α > 0.

In this paper, also considering the method based on an infinite series, we introduce an alternative discrete Lindley distribution and a comparison of this model with the version presented in Gómez-Déniz e Calderín-Ojeda (2011) and Bakouch et al. (2014). In Section 2, two discretization methods are presented and expressions resulting from its application in Lindley distribution are displayed in Section 3. In section 4, the biases and mean squared error of the maximum likelihood estimates are studied. Some applications are presented in Section 5 and in Section 6 we present some concluding remarks.

2 Discretization methods

2.1 Discretization by survival function

Proposed by Nakagawa e Osaki (1975), this method discretize a continuous random variable from its survival function. Some properties for a discrete analogue to continuous distributions obtained by this method were studied, among others, by Kemp (2004), Bracquemond e Gaudoin (2003), Roy (2003), Chakraborty (2015).

Following Kemp (2004), we can define an discrete analogue to continuous random variable as follows:

Definition 1: Let X a continuous random variable. If X has survival function S_X(x), then the discrete random variable Y = [X], where [X] indicates the smallest integer part or equal to X, has PMF (probability mass function) written as:

(5)

It is easily verified that (5) is, in fact, a probability function for x Є . If the survival function of X has compact form, then the PMF (5) will have compact form.

Some distributions discretized by this method introduced in the literature are: Inverse Rayleigh distribution (Hussain e Ahmad, 2014), Lindley distribution (Gómez-Déniz e Calderín-Ojeda, 2011; Bakouch et al., 2014), Type II generalized Exponential distribution (Nekoukhou et al., 2013), Gamma distribution (Chakraborty e Chakravarty, 2012), Inverse Weibull distribution (Aghababaei Jazi et al., 2010), Burr XII and Pareto distributions (Krishna e Pundir, 2009), Rayleigh distribution (Roy, 2004), geometric Weibull distribution (Bracquemond e Gaudoin, 2003), among others.

2.2 Discretization by an infinite series

The first traces of this method were presented in Good (1953) in a modeling study of population frequency of species. Later, other authors such as Kulasekera e Tonkyn (1992), Doray e Luong (1997), Kemp (1997), Sato et al. (1999) studied this method and showed a version of it when the support of continuous random variable is defined in (−, ) or (0, ).

Definition 2: Let X be a continuous random variable. If X has pdff (x) with support − < x < , then the discrete random variable corresponding Y has PMF as follows:

(6)

In the case where the support of X is (0,), according to Sato et al. (1999), the PMF of Y is:

(7)

Some distributions discretized by this method introduced in the literature are: Pearson III distribution (Haight, 1957), Dirichlet’s series distribution (Siromoney, 1964), Gaussian distribution (Kemp, 1997), Gamma and exponential distributions (Sato et al., 1999), Log-Gaussian distribution (Bi et al., 2001), Laplace distribution (Inusah e J. Kozubowski, 2006), Skew-Laplace distribution (Kozubowski e Inusah, 2006), Half-Gaussian distribution (Kemp, 2008), Betaexponential distribution (Nekoukhou et al., 2012), among others.

3 The discrete Lindley distribution

3.1 Discretization by survival function

Let X be a continuous random variable with Lindley distribution. Using the survival function of X, Gómez-Déniz e Calderín-Ojeda (2011) and Bakouch et al. (2014) presented the discrete Lindley distribution with PMF written in the form:

(8)

where x Є and β> 0.

The behavior of (8) for some values of β is showned in Figure 1. Note that the PMF is unimodal and when β > 1, the mode is centered at the value zero (Bakouch et al. (2014)).

From (8) we have:

(9)

and:

(10)

Analyzing the ratio between (X) and (X) we can see that (X) < (X) for all β > 0. So this discrete version should only be used in data analysis with overdispersion. For more details, see Bakouch et al. (2014).

3.2 Estimation

Let x₁, . . . , x_nbe a random sample from a distribution with PMF (8); the log-likelihood function of the discrete Lindley distribution is given by:

(11)

The maximum likelihood estimator of β is obtained by solving

numerically, for , the equation l(β | x) = 0, where:

(12)

Note that this expression is non-linear in and it must be solved

numerically. However ≈ −0.5 ( 1 − ) if e^{− β} ≈ 1 (Bakouch et al., 2014).

The confidence intervals for β as well as hypothesis tests of interest can be constructed from the asymptotic normality of the máximum likelihood estimates considering large sample sizes.

Figure 1:
Behavior of the probability function of the discrete Lindley distribution, obtained by survival function, considering different values for β (upper-left panel: β = 0.1, upper-right panel: β = 0.2, lower-left panel: β = 0.5 and lower-right panel: β = 1.2).

3.3 Discretization by infinite series

By the discretization method presented in Section 2.2 the discrete Lindley has PMF written in the form:

(13)

where x Є and β > 0.

According to the equation (13), it is observed that the PMF is unimodal with mode given as follows:

(14)

In fact, note that:

The right side of the above inequality is the same as P(X = x −1)P(X = x +1). So, the equation (13) satisfies the log-concavity inequality P²(X = x) P(X = x − 1)P(X = x + 1) for x = 1, 2, . . . and, therefore, by Theorem 3 from

Keilson e Gerber (1971), is unimodal. In Figure 2 it is illustrated the behavior of (13) for some values of β.

For a random variable X with PMF (13) the corresponding probability generating function and the moment generating function can be expressed, respectivey as:

Figure 2:
Behavior of the probability function of the discrete Lindley distribution, obtained by infinite series, considering different values for β (upper-left panel: β = 0.1, upper-right panel: β = 0.2, lower left panel: β = 0.5 and lower-right panel: β = 1.2).

Different from the discretized version obtained by survival function, the version proposed here has simple expressions for the mean and variance:

For all β > 0 it is easily to see that (X) < (X). In this way, this distribution can be used in the count data analysis with overdispersion. The dispersion index is written as .

3.4 Estimation

Let x₁, . . . , x_nbe a random sample from (13); the log-likelihood function is given by:

(15)

The maximum likelihood estimator of is obtained by solving l(β | x) = 0 in β. That is:

(16)

The second derivative of the log-likelihood function is given by:

(17)

therefore:

(18)

Solving (18) locally in we have:

(19)

Theorem 1: The estimator of β is positively biased, that is, E() − β > 0.

Proof: Let

And

for t > 0. Since

g(t) is strictly convex. Thus, by Jensen’s inequality, we have (g (X)) > g ((X)). Finally, since:

we obtain () > β . Therefore, the estimator of β is positively biased.

Cox and Snell (1968) provided a framework for estimating the bias, to O(n⁻¹) for the maximum likelihood estimators of the parameters of regular densities. Then, subtracting the estimated bias from the original maximum likelihood estimator produces a bias-corrected estimator that is unbiased to O(n⁻²). This type of bias adjustment can be applied successfully in the discrete Lindley distribution given in (13). Following Cox and Snell (1968) we have:

(20)

In this way, the bias-corrected maximum likelihood estimator βCMLE can be written as:

(21)

Re-parameterizing (13) in terms of the mean θ = we have β = log( ) such that = x. The bias-corrected maximum likelihood estimator for θ is given by − θ. It is important to point out that in terms of θ we have

(22)

such that E (X) = θ and V(X) = θ +

4 Simulation study

In this section we estimated, by Monte Carlo simulation, the biases, the mean squared errors, the coverage probabilities and the coverage lengths for the maximum likelihood estimator, , for discrete Lindley distributions obtained by survival function and infinity serie. For computational stability, we assumed the values β = 0.2, 0.5, 0.8, 1.0, 1.2 and sample sizes n = 10, 20, . . . , 90, 100. For each scenario, we calculated:

where I{·} denotes the indicator function and the number of simulations, N = 10.000. The simulation study are performed using R version 3.3.0 (R Core Team, 2015).

In Table 1, it is presented the simulation results for discrete Lindley distribution obtained by survival function. In Tables 2 and 3, are presented the simulation results for maximum likelihood estimator and bias-corrected maximum likelihood estimator for the discrete Lindley distribution obtained by infinite series.

In every scenario, for both discretizations, we have that the bias of is positive and tends to zero when the sample size increases. It is also observed that the mean square error of tends to zero in every scenario. Related to the coverage probabilities, we have CP_β(n) ranging from 0.94 to 0.96 and the coverage length tends to zero when the sample size increases.

Table 1:
Estimated bias, mean-squared error, coverage probability and length of coverage probability for β (by survival function).

Table 2:
Estimated bias, mean-squared error, coverage probability and length of coverage probability for β (by infinite series).

Table 3:
Estimated bias, mean-squared error, coverage probability and length of coverage probability for β_CMLE (by infinite series).

5 Aplications

5.1 Aplication 1 (without covariates)

Consider a dataset related to the number of times that a computer break-down in each of 128 consecutive weeks of operation (Chakraborty and Chakravarty, 2012). The mean and variance are given respectively by, x = 4.023 times and s² = 14.464 times², which evidences overdispersion. The fit of a discrete Lindley distribution obtained by infinite series (DLIS) was compared to the fit of another discrete Lindley distribution obtained by survival function (DLS), P(X = x | β) = e^−βx(1 +β )⁻¹[β(1 − e^−β) + (1 − e^−β)(1 + βx)] (Bakouch et al., 2014), a discrete Rayleigh (DR), P(X = x | θ) = θ^x2 − θ^(x+1)2 (Roy, 2004), a geometric (G), P(X = x | θ) = θ^x − θ^(x+1), and a Poisson (P), P(X = x |β ) =

The parameters were estimated by maximum likelihood method (MLE) and to compare the fits we considered the values of −logL, AIC, BIC and the X² goodness-of-fit (see, Table 5). We conclude that, between DLIS and DLS, the results are almost the same. But, in terms of equations and computational stability, the DLIS distribution has a better fit when compared to the others distributions considered in this application.

Table 4:
Observed and expected number of times that computer break-down considering the DLIS, DLS, DR, G, P and NB distributions.

Table 5:
Parameter estimates and goodness-of-fit measures.

5.2 Aplication 2 (with covariates)

In this application, we considered a dataset introduced by Long (1990) related to the number of publications produced by Ph.D. biochemists to illustrate the application of a discrete Lindley distributions (DLIS and DLS) in presence of covariates. Its fit is compared to the negative binomial distribution.

This dataset have also been analyzed by Long et al. (2001) and is available from the Stata website http://www. stata-press.com/data/lf2/couart2.dta. The mean number of articles is 1.69 and the variance is 3.71, a little more than twice the mean (see Table 6). The data are over-dispersed. Results are showed in Tables 7 and 8. For both

distributions we consider: log(β) = βo + where x_i are described in Table 6.

Table 6:

Dataset: Number of publications produced by Ph.D. biochemists.

Table 7:
Parameter estimates and standard errors for Negative Binomial and Discrete Lindley models.

Table 8:
Goodness-of-fit measures.

It is observed from the results in Table 7, that the DLIS distribution disitrbution estimates are not very different from those obtained assuming the negative binomial model, and both sets would led to the same conclusions and looking at the standard errors, we see that both approaches to overdispersion lead to very similar estimated standard errors. However, the LDS estimates, except for the sign, are basically the same of the others models. Now, looking regression coefficients, we conclude that, in DLS distribution, β₂, β₄ are not significant; in DLIS distribution, β₀, β₂, β₄ are not significant; and, in negative binomial distribution, β₀, β₂, β₄are not significant (see confidence intervals in Table 7). Also, looking the AIC (Akaike, 1974), AICc (Cavanaugh, 1997) and BIC (Bhat e Kumar, 2010) criterion introduced in Table 8, they are, basically, the same, but the DLS model is better in terms of parsimony and goodness of fit.

5.3 Aplication 3 (with covariates)

In this application, we considered the dataset analyzed by Deb e Trivedi (1997) and Liu e Cela (2008) to illustrate just the application of discrete Lindley (DLIS), zero-inflated discrete Lindley (ZIDLIS) and Hurdle discrete Lindley

(HDLIS) models in the presence of covariates (see, Remark 1). Its fit is compared to the Poisson, negative binomial, zero-inflated Poisson and Hurdle Poisson models. For all distributions we consider: log(β) = β₀ + β_ix_i and logit (p) = α0 + αixi where xi are describe in Table 9.

Table 9:
Dataset: The number of hospital stays of 4406 respondents who were aged 66 or older and covered by Medicare program.

Remark 1:In this application, we used just DLIS distribution since using DLS distribution we got computational instability for the parameter estimations.

This dataset was originally obtained from National Medical Expenditure Survey (NMES) conducted in 1987 including 4406 respondents who were aged 66 or older and covered by Medicare program. The dataset description and summary statistics are given in Table 9 and we can show that the variance of hosp is about two times of the mean, implying the possibility of overdispersion.

Estimated coefficients of all models together with related statistics are listed in Tables 10 and 11. While Poisson regression provides a baseline model for count data, the other models demonstrate the better fit when compared to the basic Poisson regression model. The zeroinflated discrete Lindley model has the best fit when compared to the others models.

Looking at the standard errors of all models, we see that both approaches to overdispersion lead to very similar estimated standard errors and looking the AIC, AICc and BIC criterion, we conclude that the zeroinflated discrete Lindley model is the best fitted model in terms of goodness of fit.

Table 10:
Parameter estimates and standard errors for all models.

P: Poisson, NB: Negative Binomial, HP: Hurdle Poisson, ZIP: Zero-Inflated Poisson, DL: Discrete Lindley HDL: Hurdle Discrete Lindley and ZIDL: Zero-Inflated Discrete Lindley

Table 11:
Goodness-of-fit measures.

P: Poisson, NB: Negative Binomial, HP: Hurdle Poisson, ZIP: Zero-Inflated Poisson, DL: Discrete Lindley HDL: Hurdle Discrete Lindley and ZIDL: Zero-Inflated Discrete Lindley

6 Conclusion

In this paper, considering a discretization method based on an infinite series, we introduce an alternative discrete Lindley distribution. Some characteristics and properties of this distribution were presented and studied where it was found that it can be used in the analysis of data with overdispersion. Monte Carlo studies showed that the biases and mean squared errors of this distribution are asymptotically non-biased and has lower values compared to discrete Lindley distribution obtained by survival function considered in Bakouch et al. (2014) and has great coverage probabilities ranging from 0.94 to 0.96 and the coverage length goes to zero when the sample size increases. In the considered applications, the DLIS distribution had a better or equivalent fit compared to other distributions considered in the applications leading to the conclusion that this distribution could be a good alternative for overdispersed count data in presence or not of covariates, especially, it is better than DLS distribution in computacional aspects (simulation and estimation), equations and goodness-of-fit.

References

Aghababaei Jazi, M., Lai, C. D., Hossein Alamatsaz, M. (2010). A discrete inverse Weibull distribution and estimation of its parameters. Statistical Methodology, 7, 121–132.

Akaike, H. (1974). A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 19 (6), 716–723.

Bakouch, H. S., Jazi, M. A., Nadarajah, S. (2014). A new discrete distribution. Statistics, 48 (1), 200–240.

Bhat, H. S., Kumar, N. (2010). On the derivation of the bayesian information criterion. School of Natural Sciences, University of California.

Bi, Z., Faloutsos, C., Korn, F. (2001). The DGX distribution for mining massive, skewed data. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp. 17–26.

Bracquemond, C., Gaudoin, O. (2003). A survey on discrete lifetime distributions. International Journal of Reliability, Quality and Safety Engineering, 10 (01), 69–98.

Cavanaugh, J. E. (1997). Unifying the derivations for the akaike and corrected akaike information criteria. Statistics & Probability Letters, 33 (2), 201–208.

Chakraborty, S. (2015). Generating discrete analogues of continuous probability distributions - a survey of methods and constructions. Journal of Statistical Distributions and Applications, 2 (1), 1–30.

Chakraborty, S., Chakravarty, D. (2012). Discrete gamma distributions: properties and parameter estimations. Communications in Statistics-Theory and Methods, 41 (18), 3301–3324.

Collett, D. (2003). Modelling Survival Data in Medical Research, 2o edn. Chapaman and Hall, New York.

Cox, D. R., Snell, E. J. (1968). A general definition of residuals. Journal of the Royal Statistical Society Series B (Methodological), 30 (2), 248–275.

Deb, P., Trivedi, P. K. (1997). Demand for medical care by the elderly: a finite mixture approach. Journal of applied Econometrics, 12 (3), 313–336

Doray, L. G., Luong, A. (1997). Efficient estimators for the good family. Communications in Statistics-Simulation and Computation, 26 (3), 1075–1088.

Gómez-Déniz, E., Calder´ın-Ojeda, E. (2011). The discrete Lindley distribution: properties and applications. Journal of Statistical Computation and Simulation, 81 (11), 1405–1416.

Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40 (3-4), 237–264.

Haight, F. A. (1957). Queueing with balking. Biometrika, 44 (3/4), 360–369.

Hamada, M. S., Wilson, A. G., Reese, C. S., Martz, H. F. (2008). Bayesian reliability. Springer Series in Statistics, Springer, New York.

Hussain, T., Ahmad, M. (2014). Discrete inverse Rayleigh distribution. Pak J Statist, 30 (2), 203–222.

Inusah, S., J. Kozubowski, T. (2006). A discrete analogue of the Laplace distribution. Journal of Statistical Planning and Inference, 136.

Kalbfleisch, J. D., Prentice, R. L. (2002). The Statistical Analysis of Failure Time Data, 2o edn. Wiley, New York, NY.

Keilson, J., Gerber, H. (1971). Some results for discrete unimodality. Journal of the American Statistical Association, 66.

Kemp, A. W. (1997). Characterizations of a discrete normal distribution. Journal of Statistical Planning and Inference, 63 (2), 223 – 229, in Honor of C.R. Rao.

Kemp, A. W. (2004). Classes of discrete lifetime distributions. Taylor & Francis.

Kemp, A. W. (2008). The discrete half-normal distribution. In: Advances in mathematical and statistical modeling, Springer, pp. 353–360.

Klein, J. P., Moeschberger, M. L. (1997). Survival Analysis: Techniques for Censored and Truncated Data. Springer- Verlag, New York.

Kozubowski, T. J., Inusah, S. (2006). A skew Laplace distribution on integers. Annals of the Institute of Statistical Mathematics, 58 (3), 555–571.

Krishna, H., Pundir, P. S. (2009). Discrete Burr and discrete Pareto distributions. Statistical Methodology, 6 (2), 177–188.

Kulasekera, K., Tonkyn, D. W. (1992). A new discrete distribution, with applications to survival, dispersal and dispersion. Communications in Statistics-Simulation and Computation, 21 (2), 499–518.

Lawless, J. F. (2003). Statistical models and methods for lifetime data, 2o edn. Wiley Series in Probability and Statistics, Wiley-Interscience [John Wiley & Sons], Hoboken, NJ.

Lee, E. T., Wang, J. W. (2003). Statistical methods for survival data analysis, 3o edn. Wiley Series in Probability and Statistics, Hoboken, NJ.

Liu, W., Cela, J. (2008). Count data models in SAS. In: SAS Global Forum, Citeseer.

Long, J. S. (1990). The origins of sex differences in science. Social forces.

Long, J. S., Freese, J., et al. (2001). Predicted probabilities for count models. Stata Journal, 1 (1), 51–7.

Meeker, W. Q., Escobar, L. A. (1998). Statistical Methods for Reliability Data. John Wiley & Sons, New York.

Nakagawa, T., Osaki, S. (1975). The discrete Weibull distribution. IEEE Transactions on Reliability, 5, 300–301.

Nekoukhou, V., Alamatsaz, M. H., Bidram, H. (2012). A discrete analog of the generalized exponential distribution. Communication in Statistics- Theory and Methods, 41, 2000–2013.

Nekoukhou, V., Alamatsaz, M. H., Bidram, H. (2013). Discrete generalized exponential distribution of a second type. Statistics, 47, 876–887.

R Core Team (2015). R: A Language and Environment for Statistical Computing. Vienna, Austria.

Roy, D. (2003). The discrete normal distribution. Communication in Statistics- Theory and Methods, 32, 1871–1883.

Roy, D. (2004). Discrete Rayleigh distribution. Reliability, IEEE Transactions on, 53 (2), 255–260.

Sato, H., Ikota, M., Sugimoto, A., Masuda, H. (1999). A new defect distribution metrology with a consistent discrete exponential formula and its applications. Semiconductor Manufacturing, IEEE Transactions on, 12 (4), 409–418.

Siromoney, G. (1964). The general Dirichlet’s series distribution. Journal of the Indian Statistical Association, 2.