Biology-Botany
Received: 14/03/17
Accepted: 11/05/17
Abstract: The methods to generate a probability function from a probability density function has long been used in recent years. In general, the discretization process produces probability functions that can be rivals to traditional distributions used in the analysis of count data as the geometric, the Poisson and negative binomial distributions. The discretization also avoids the use of a continuous distribution in the analysis of strictly discrete data. In this paper, using the method based on an infinite series, proposed by Good (1953), we studied an alternative discrete Lindley distribution to those study in G´omez-D´eniz e Calder´ın-Ojeda (2011) and Bakouch et al. (2014). For both distributions, a simulation study is carried out to examine the bias and mean squared error for the maximum likelihood estimators of the parameters as well as the coverage probability and the width of the confidence intervals. For the discrete Lindley distribution obtained by infinite series method we present the analytical expression for bias reduction of the maximum likelihood estimator. Some examples using real data from the literature show the potential of these distributions. Despite the discretization methods are quite different, the resulting distributions are interchangeable, however the distribution generated by an infinite series has simple mathematical expressions and can be used directly to count data in the presence of covariates.
Keywords: Discretization methods, Lindley distribution, likelihood, series, survival analysis, Monte Carlo simulation.
1 Introduction
In recent years generation of a discrete observation to a continuous random variable has been considered by several authors (see, for example, Chakraborty 2015). Basically, the main purpose to discretize a continuous probability density function is to generate a distribution for the analysis of strictly discrete data. For example, in survival data analysis it is common to use continuous distributions for discrete data, so the discretization acts with a subterfuge to avoid this process. A lot of applications considering continuous distributions in the analysis of discrete data are presented in many lifetime books as for example: Hamada et al. (2008); Collett (2003); Lee e Wang (2003); Lawless (2003); Kalbfleisch e Prentice (2002); Meeker e Escobar (1998); Klein e Moeschberger (1997) and others.
One of the first discretized distributions introduced in the literature was the Weibull distribution. From the Weibull distribution with probability density function:
(1)and survival function:
(2)Nakagawa e Osaki (1975) proposed the discrete Weibull distribution whose probability function can be written as:
(3) where x Є
and μ,β‑ > 0 are, respectively, the scale and shape parameters. It is easy to verify that (3) is, in fact, a probability function.
Recently Nekoukhou et al. (2012), using the method based on an infinite series, have introduced the Generalized Exponential distribution whose probability function is written as:
(4) where x Є
, (αj−1 ) =
(α − 1) · · · (α − j), 0 < λ < 1 and α > 0.
In this paper, also considering the method based on an infinite series, we introduce an alternative discrete Lindley distribution and a comparison of this model with the version presented in Gómez-Déniz e Calderín-Ojeda (2011) and Bakouch et al. (2014). In Section 2, two discretization methods are presented and expressions resulting from its application in Lindley distribution are displayed in Section 3. In section 4, the biases and mean squared error of the maximum likelihood estimates are studied. Some applications are presented in Section 5 and in Section 6 we present some concluding remarks.
2 Discretization methods
2.1 Discretization by survival function
Proposed by Nakagawa e Osaki (1975), this method discretize a continuous random variable from its survival function. Some properties for a discrete analogue to continuous distributions obtained by this method were studied, among others, by Kemp (2004), Bracquemond e Gaudoin (2003), Roy (2003), Chakraborty (2015).
Following Kemp (2004), we can define an discrete analogue to continuous random variable as follows:
Definition 1: Let X a continuous random variable. If X has survival function SX(x), then the discrete random variable Y = [X], where [X] indicates the smallest integer part or equal to X, has PMF (probability mass function) written as:
(5) It is easily verified that (5) is, in fact, a probability function for x Є
. If the survival function of X has compact form, then the PMF (5) will have compact form.
Some distributions discretized by this method introduced in the literature are: Inverse Rayleigh distribution (Hussain e Ahmad, 2014), Lindley distribution (Gómez-Déniz e Calderín-Ojeda, 2011; Bakouch et al., 2014), Type II generalized Exponential distribution (Nekoukhou et al., 2013), Gamma distribution (Chakraborty e Chakravarty, 2012), Inverse Weibull distribution (Aghababaei Jazi et al., 2010), Burr XII and Pareto distributions (Krishna e Pundir, 2009), Rayleigh distribution (Roy, 2004), geometric Weibull distribution (Bracquemond e Gaudoin, 2003), among others.
2.2 Discretization by an infinite series
The first traces of this method were presented in Good (1953) in a modeling study of population frequency of species. Later, other authors such as Kulasekera e Tonkyn (1992), Doray e Luong (1997), Kemp (1997), Sato et al. (1999) studied this method and showed a version of it when the support of continuous random variable is defined in (−
,
) or (0,
).
Definition 2: Let X be a continuous random variable. If X has pdff (x) with support −
< x <
, then the discrete random variable corresponding Y has PMF as follows:
(6)In the case
where the support of X is (0,
), according to Sato et al. (1999), the PMF of Y
is:
(7)Some distributions discretized by this method introduced in the literature are: Pearson III distribution (Haight, 1957), Dirichlet’s series distribution (Siromoney, 1964), Gaussian distribution (Kemp, 1997), Gamma and exponential distributions (Sato et al., 1999), Log-Gaussian distribution (Bi et al., 2001), Laplace distribution (Inusah e J. Kozubowski, 2006), Skew-Laplace distribution (Kozubowski e Inusah, 2006), Half-Gaussian distribution (Kemp, 2008), Betaexponential distribution (Nekoukhou et al., 2012), among others.
3 The discrete Lindley distribution
3.1 Discretization by survival function
Let X be a continuous random variable with Lindley distribution. Using the survival function of X, Gómez-Déniz e Calderín-Ojeda (2011) and Bakouch et al. (2014) presented the discrete Lindley distribution with PMF written in the form:
(8) where x Є
and β> 0.
The behavior of (8) for some values of β is showned in Figure 1. Note that the PMF is unimodal and when β > 1, the mode is centered at the value zero (Bakouch et al. (2014)).
From (8) we have:
(9)and:
(10)Analyzing
the ratio between
(X) and
(X) we can see that
(X) <
(X) for all β > 0. So this discrete version
should only be used in data analysis with overdispersion. For more details, see
Bakouch et al. (2014).
3.2 Estimation
Let x1, . . . , xnbe a random sample from a distribution with PMF (8); the log-likelihood function of the discrete Lindley distribution is given by:
(11)The maximum
likelihood estimator
of β is obtained by solving
numerically,
for , the equation
l(β | x) = 0, where:
(12)Note that this expression is non-linear in and it must be solved
numerically. However
≈ −0.5 ( 1 −
) if e− β ≈ 1 (Bakouch et al., 2014).
The confidence intervals for β as well as hypothesis tests of interest can be constructed from the asymptotic normality of the máximum likelihood estimates considering large sample sizes.

3.3 Discretization by infinite series
By the discretization method presented in Section 2.2 the discrete Lindley has PMF written in the form:
(13) where x Є
and β > 0.
According to the equation (13), it is observed that the PMF is unimodal with mode given as follows:
(14)In fact, note that:

The right
side of the above inequality is the same as P(X = x −1)P(X = x +1). So, the
equation (13) satisfies the log-concavity inequality P2(X = x)
P(X = x − 1)P(X = x + 1) for x = 1, 2, . . .
and, therefore, by Theorem 3 from
Keilson e Gerber (1971), is unimodal. In Figure 2 it is illustrated the behavior of (13) for some values of β.
For a random variable X with PMF (13) the corresponding probability generating function and the moment generating function can be expressed, respectivey as:


Different from the discretized version obtained by survival function, the version proposed here has simple expressions for the mean and variance:

For all β > 0 it is easily to see that
(X) <
(X). In this
way, this distribution can be used in the count data analysis with
overdispersion. The dispersion index is written as
.
3.4 Estimation
Let x1, . . . , xn be a random sample from (13); the log-likelihood function is given by:
(15)The maximum
likelihood estimator of is obtained by solving
l(β | x) = 0 in β. That is:
(16)The second derivative of the log-likelihood function is given by:
(17)therefore:
(18)Solving (18)
locally in
we have:
(19)
Theorem 1: The estimator
of β is positively biased, that is, E(
) − β > 0.
Proof: Let

And

for t > 0. Since

g(t) is
strictly convex. Thus, by Jensen’s inequality, we have
(g (X)) > g (
(X)). Finally, since:

we obtain
(
) > β . Therefore, the estimator
of β is positively biased.
Cox and Snell (1968) provided a framework for estimating the bias, to O(n−1) for the maximum likelihood estimators of the parameters of regular densities. Then, subtracting the estimated bias from the original maximum likelihood estimator produces a bias-corrected estimator that is unbiased to O(n−2). This type of bias adjustment can be applied successfully in the discrete Lindley distribution given in (13). Following Cox and Snell (1968) we have:
(20)In this way, the bias-corrected maximum likelihood estimator βCMLE can be written as:
(21)Re-parameterizing
(13) in terms of the mean θ =
we have β = log(
) such that
= x. The bias-corrected maximum likelihood estimator
for θ is given by
−
θ. It is important to point out that in terms of θ we have
(22)such that E (X) = θ and V(X) = θ + 
4 Simulation study
In this
section we estimated, by Monte Carlo simulation, the biases, the mean squared
errors, the coverage probabilities and the coverage lengths for the maximum
likelihood estimator,
, for discrete Lindley distributions obtained by survival
function and infinity serie. For computational stability, we assumed the values
β = 0.2, 0.5, 0.8, 1.0, 1.2 and sample sizes n = 10, 20, . .
. , 90, 100. For each scenario, we calculated:

where I{·} denotes the indicator function and the number of simulations, N = 10.000. The simulation study are performed using R version 3.3.0 (R Core Team, 2015).
In Table 1, it is presented the simulation results for discrete Lindley distribution obtained by survival function. In Tables 2 and 3, are presented the simulation results for maximum likelihood estimator and bias-corrected maximum likelihood estimator for the discrete Lindley distribution obtained by infinite series.
In every scenario, for both discretizations, we have that the bias of
is positive and tends to zero when the sample size increases. It is also observed that the mean square error of
tends to zero in every scenario. Related to the coverage probabilities, we have CPβ(n) ranging from 0.94 to 0.96 and the coverage length tends to zero when the sample size increases.



5 Aplications
5.1 Aplication 1 (without covariates)
Consider a
dataset related to the number of times that a computer break-down in each of
128 consecutive weeks of operation (Chakraborty and Chakravarty, 2012). The
mean and variance are given respectively by, x = 4.023 times and s2 = 14.464
times2, which evidences overdispersion. The fit of a discrete Lindley
distribution obtained by infinite series (DLIS) was compared to the fit of
another discrete Lindley distribution obtained by survival function (DLS), P(X
= x | β) = e−βx(1 +β )−1[β(1 − e−β) + (1 − e−β)(1 + βx)] (Bakouch et al., 2014), a discrete Rayleigh (DR), P(X = x
| θ) = θx2 − θ(x+1)2 (Roy, 2004), a geometric (G),
P(X = x | θ) = θx − θ(x+1), and a Poisson (P), P(X = x |β ) = 
The parameters were estimated by maximum likelihood method (MLE) and to compare the fits we considered the values of −logL, AIC, BIC and the X2 goodness-of-fit (see, Table 5). We conclude that, between DLIS and DLS, the results are almost the same. But, in terms of equations and computational stability, the DLIS distribution has a better fit when compared to the others distributions considered in this application.


5.2 Aplication 2 (with covariates)
In this application, we considered a dataset introduced by Long (1990) related to the number of publications produced by Ph.D. biochemists to illustrate the application of a discrete Lindley distributions (DLIS and DLS) in presence of covariates. Its fit is compared to the negative binomial distribution.
This dataset have also been analyzed by Long et al. (2001) and is available from the Stata website http://www. stata-press.com/data/lf2/couart2.dta. The mean number of articles is 1.69 and the variance is 3.71, a little more than twice the mean (see Table 6). The data are over-dispersed. Results are showed in Tables 7 and 8. For both
distributions we consider: log(β) = βo + 
where xi are described in Table 6.

Dataset: Number of publications produced by Ph.D. biochemists.


It is observed from the results in Table 7, that the DLIS distribution disitrbution estimates are not very different from those obtained assuming the negative binomial model, and both sets would led to the same conclusions and looking at the standard errors, we see that both approaches to overdispersion lead to very similar estimated standard errors. However, the LDS estimates, except for the sign, are basically the same of the others models. Now, looking regression coefficients, we conclude that, in DLS distribution, β2, β4 are not significant; in DLIS distribution, β0, β2, β4 are not significant; and, in negative binomial distribution, β0, β2, β4 are not significant (see confidence intervals in Table 7). Also, looking the AIC (Akaike, 1974), AICc (Cavanaugh, 1997) and BIC (Bhat e Kumar, 2010) criterion introduced in Table 8, they are, basically, the same, but the DLS model is better in terms of parsimony and goodness of fit.
5.3 Aplication 3 (with covariates)
In this application, we considered the dataset analyzed by Deb e Trivedi (1997) and Liu e Cela (2008) to illustrate just the application of discrete Lindley (DLIS), zero-inflated discrete Lindley (ZIDLIS) and Hurdle discrete Lindley
(HDLIS) models in the presence of covariates (see, Remark 1). Its fit is compared to the Poisson, negative binomial, zero-inflated Poisson and Hurdle Poisson models. For all distributions we consider: log(β) = β0 +
βixi and logit (p) = α0 + αixi where xi are describe in Table 9.

Remark 1:In this application, we used just DLIS distribution since using DLS distribution we got computational instability for the parameter estimations.
This dataset was originally obtained from National Medical Expenditure Survey (NMES) conducted in 1987 including 4406 respondents who were aged 66 or older and covered by Medicare program. The dataset description and summary statistics are given in Table 9 and we can show that the variance of hosp is about two times of the mean, implying the possibility of overdispersion.
Estimated coefficients of all models together with related statistics are listed in Tables 10 and 11. While Poisson regression provides a baseline model for count data, the other models demonstrate the better fit when compared to the basic Poisson regression model. The zeroinflated discrete Lindley model has the best fit when compared to the others models.
Looking at the standard errors of all models, we see that both approaches to overdispersion lead to very similar estimated standard errors and looking the AIC, AICc and BIC criterion, we conclude that the zeroinflated discrete Lindley model is the best fitted model in terms of goodness of fit.


6 Conclusion
In this paper, considering a discretization method based on an infinite series, we introduce an alternative discrete Lindley distribution. Some characteristics and properties of this distribution were presented and studied where it was found that it can be used in the analysis of data with overdispersion. Monte Carlo studies showed that the biases and mean squared errors of this distribution are asymptotically non-biased and has lower values compared to discrete Lindley distribution obtained by survival function considered in Bakouch et al. (2014) and has great coverage probabilities ranging from 0.94 to 0.96 and the coverage length goes to zero when the sample size increases. In the considered applications, the DLIS distribution had a better or equivalent fit compared to other distributions considered in the applications leading to the conclusion that this distribution could be a good alternative for overdispersed count data in presence or not of covariates, especially, it is better than DLS distribution in computacional aspects (simulation and estimation), equations and goodness-of-fit.
References
Aghababaei Jazi, M., Lai, C. D., Hossein Alamatsaz, M. (2010). A discrete inverse Weibull distribution and estimation of its parameters. Statistical Methodology, 7, 121–132.
Akaike, H. (1974). A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 19 (6), 716–723.
Bakouch, H. S., Jazi, M. A., Nadarajah, S. (2014). A new discrete distribution. Statistics, 48 (1), 200–240.
Bhat, H. S., Kumar, N. (2010). On the derivation of the bayesian information criterion. School of Natural Sciences, University of California.
Bi, Z., Faloutsos, C., Korn, F. (2001). The DGX distribution for mining massive, skewed data. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp. 17–26.
Bracquemond, C., Gaudoin, O. (2003). A survey on discrete lifetime distributions. International Journal of Reliability, Quality and Safety Engineering, 10 (01), 69–98.
Cavanaugh, J. E. (1997). Unifying the derivations for the akaike and corrected akaike information criteria. Statistics & Probability Letters, 33 (2), 201–208.
Chakraborty, S. (2015). Generating discrete analogues of continuous probability distributions - a survey of methods and constructions. Journal of Statistical Distributions and Applications, 2 (1), 1–30.
Chakraborty, S., Chakravarty, D. (2012). Discrete gamma distributions: properties and parameter estimations. Communications in Statistics-Theory and Methods, 41 (18), 3301–3324.
Collett, D. (2003). Modelling Survival Data in Medical Research, 2o edn. Chapaman and Hall, New York.
Cox, D. R., Snell, E. J. (1968). A general definition of residuals. Journal of the Royal Statistical Society Series B (Methodological), 30 (2), 248–275.
Deb, P., Trivedi, P. K. (1997). Demand for medical care by the elderly: a finite mixture approach. Journal of applied Econometrics, 12 (3), 313–336
Doray, L. G., Luong, A. (1997). Efficient estimators for the good family. Communications in Statistics-Simulation and Computation, 26 (3), 1075–1088.
Gómez-Déniz, E., Calder´ın-Ojeda, E. (2011). The discrete Lindley distribution: properties and applications. Journal of Statistical Computation and Simulation, 81 (11), 1405–1416.
Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40 (3-4), 237–264.
Haight, F. A. (1957). Queueing with balking. Biometrika, 44 (3/4), 360–369.
Hamada, M. S., Wilson, A. G., Reese, C. S., Martz, H. F. (2008). Bayesian reliability. Springer Series in Statistics, Springer, New York.
Hussain, T., Ahmad, M. (2014). Discrete inverse Rayleigh distribution. Pak J Statist, 30 (2), 203–222.
Inusah, S., J. Kozubowski, T. (2006). A discrete analogue of the Laplace distribution. Journal of Statistical Planning and Inference, 136.
Kalbfleisch, J. D., Prentice, R. L. (2002). The Statistical Analysis of Failure Time Data, 2o edn. Wiley, New York, NY.
Keilson, J., Gerber, H. (1971). Some results for discrete unimodality. Journal of the American Statistical Association, 66.
Kemp, A. W. (1997). Characterizations of a discrete normal distribution. Journal of Statistical Planning and Inference, 63 (2), 223 – 229, in Honor of C.R. Rao.
Kemp, A. W. (2004). Classes of discrete lifetime distributions. Taylor & Francis.
Kemp, A. W. (2008). The discrete half-normal distribution. In: Advances in mathematical and statistical modeling, Springer, pp. 353–360.
Klein, J. P., Moeschberger, M. L. (1997). Survival Analysis: Techniques for Censored and Truncated Data. Springer- Verlag, New York.
Kozubowski, T. J., Inusah, S. (2006). A skew Laplace distribution on integers. Annals of the Institute of Statistical Mathematics, 58 (3), 555–571.
Krishna, H., Pundir, P. S. (2009). Discrete Burr and discrete Pareto distributions. Statistical Methodology, 6 (2), 177–188.
Kulasekera, K., Tonkyn, D. W. (1992). A new discrete distribution, with applications to survival, dispersal and dispersion. Communications in Statistics-Simulation and Computation, 21 (2), 499–518.
Lawless, J. F. (2003). Statistical models and methods for lifetime data, 2o edn. Wiley Series in Probability and Statistics, Wiley-Interscience [John Wiley & Sons], Hoboken, NJ.
Lee, E. T., Wang, J. W. (2003). Statistical methods for survival data analysis, 3o edn. Wiley Series in Probability and Statistics, Hoboken, NJ.
Liu, W., Cela, J. (2008). Count data models in SAS. In: SAS Global Forum, Citeseer.
Long, J. S. (1990). The origins of sex differences in science. Social forces.
Long, J. S., Freese, J., et al. (2001). Predicted probabilities for count models. Stata Journal, 1 (1), 51–7.
Meeker, W. Q., Escobar, L. A. (1998). Statistical Methods for Reliability Data. John Wiley & Sons, New York.
Nakagawa, T., Osaki, S. (1975). The discrete Weibull distribution. IEEE Transactions on Reliability, 5, 300–301.
Nekoukhou, V., Alamatsaz, M. H., Bidram, H. (2012). A discrete analog of the generalized exponential distribution. Communication in Statistics- Theory and Methods, 41, 2000–2013.
Nekoukhou, V., Alamatsaz, M. H., Bidram, H. (2013). Discrete generalized exponential distribution of a second type. Statistics, 47, 876–887.
R Core Team (2015). R: A Language and Environment for Statistical Computing. Vienna, Austria.
Roy, D. (2003). The discrete normal distribution. Communication in Statistics- Theory and Methods, 32, 1871–1883.
Roy, D. (2004). Discrete Rayleigh distribution. Reliability, IEEE Transactions on, 53 (2), 255–260.
Sato, H., Ikota, M., Sugimoto, A., Masuda, H. (1999). A new defect distribution metrology with a consistent discrete exponential formula and its applications. Semiconductor Manufacturing, IEEE Transactions on, 12 (4), 409–418.
Siromoney, G. (1964). The general Dirichlet’s series distribution. Journal of the Indian Statistical Association, 2.