Abstract:
INTRODUCTION On March 11, 2020, WHO declared COVID-19 a pandemic and called on governments to impose drastic measures to fight it. It is vitally important for government health authorities and leaders to have reliable estimates of infected cases and deaths in order to apply the necessary measures with the resources at their disposal.
OBJECTIVE Test the validity of the logistic regression and Gompertz curve to forecast peaks of confirmed cases and deaths in Cuba, as well as total number of cases.
METHODS An inferential, predictive study was conducted using logistic and Gompertz growth curves, adjusted with the least squares method and informatics tools for analysis and prediction of growth in COVID-19 cases and deaths. Italy and Spain—countries that have passed the initial peak of infection rates—were studied, and it was inferred from the results of these countries that their models were applicable to Cuba. This hypothesis was tested by applying goodness-of-fit and significance tests on its parameters.
RESULTS Both models showed good fit, low mean square errors, and all parameters were highly significant.
CONCLUSIONS The validity of models was confirmed based on logistic regression and the Gompertz curve to forecast the dates of peak infections and deaths, as well as total number of cases in Cuba.
Keywords:COVID-19COVID-19,SARS-CoV-2SARS-CoV-2,logistic modelslogistic models,pandemicpandemic,mortalitymortality,CubaCuba.
ORIGINAL RESEARCH
COVID-19 Forecasts for Cuba Using Logistic Regression and Gompertz Curves
Received: 29 April 2020
Accepted: 17 July 2020
Published: 31 July 2020
The COVID-19 pandemic and the characteristics of the SARS-Cov-2 viral agent[1] have led many governments to restrict social contact in order to cut the chain of transmission and thus reduce cases and deaths. The measures include some variation of lockdown, which in various countries has proven effective at curbing disease spread, flattening the curve and avoiding health system saturation.[2] Thus, it is vitally important for decision-makers to be able to approximate the maximum number of infections and deaths expected, as well as when caseload peaks will occur.
In Cuba, many measures have been implemented to mitigate COVID-19 spread and to limit the severity of cases and deaths.[3] However, until April 22, 2020, the increase in confirmed case numbers was approximately exponential. Going forward, reliable estimates are needed to inform decision-making in the context of limited resources.
Many such forecasts are made using mathematical modeling. A classic epidemiological model is SIR (Susceptible, Infectious, Recovered), based on ordinary differential equations. This modeling has been used successfully for the COVID-19 pandemic in some regions.[4,5] In Cuba, several authors have also applied it to the COVID-19 pandemic.[6–8]
Other techniques that have been used for modeling COVID-19 are:
Statistical time-series models to predict the number of infections and/or deaths[9]
Data processing to obtain forecasting models using the internet[10]
Models based on artificial intelligence and machine learning[11,12]
These approaches are based on parameters that describe different characteristics of the pandemic. The estimation of these guiding parameters is complex, requiring controlled study of samples or use of approximations. Interpreting the models themselves is also complex.
Among the statistical models are logistic population growth models and the Gompertz growth model.[13] These models have been used in the COVID-19 pandemic and are less complex than those previously mentioned. But they are limited to short-term forecasts since they incorporate few parameters related to changes in epidemic dynamics, such as those that are sensitive to actions of a clinical nature, or to transmission-mitigation measures. To estimate the parameters of these models, the nonlinear least squares method is used. This modeling has been applied worldwide to forecast for incidence and prevalence rates.
Various studies have used logistic models to make predictions regarding COVID-19’s epidemiologic dynamics and the disease’s effects. Batista used the logistic regression model to study the magnitude of the pandemic in China through February 25, 2020;[14] Morais used it in forecasting deaths in China, Iran, Italy, South Korea and Spain;[15] Tátrai and Várallyay applied the model to predict the peaks in various countries affected by COVID-19 and assessed the quality of its fit with data from various regions in China affected by COVID-19.[16] Wu used a logistic model to estimate the peak in confirmed cases for Europe and the United States, and evaluated goodness-of-fit using a sample of 29 provinces in China and 19 countries that had passed the peak.[17] Qaedan used a logarithmic-logistic model to obtain predictions for the state of Utah in the United States and assessed its fit based on adjustments made in South Korea and Italy.[18]
Some studies have implemented the Gompertz model. Mazurek and Nenickova applied it to predict the pandemic’s peaks in the United States.[19] Mazurek took a similar approach to study data for the United Kingdom, the Russian Federation, Turkey and the world as a whole;[20] and Razzak applied the model to predict the course of the pandemic in New Zealand.[21]
Other studies have used both models simultaneously to obtain forecasts for COVID-19. Jia used Gompertz, Bertalanffy, and logistic models to predict COVID-19 case numbers in various regions in China. These authors first studied the models’ goodness-of-fit using data from SARS-CoV-1 confirmed cases in China in 2003.[22] Similarly, based on the goodness-of-fit of the logistic model and the Gompertz model for the data from China and South Korea, Villalobos presented predictions for Costa Rica.[23] Milhinhos and Costa adjusted logarithmic-logistic models and logarithmic-Gaussian models to obtain forecasts for Portugal based on their goodness-of-fit for distribution of COVID-19 data in South Korea.[24]
Dattoli used a three-parameter logistic model and the Gompertz model to make estimates for Italy.[25] Bauckhage used the logistic and Gompertz models to obtain predictions for Germany for mid-April 2020,[26] while Rodrigues-Silva used these models to obtain predictions for the state of Goias in Brazil[27] and Dutra used them to estimate the number of persons affected by COVID-19 for various US states and the whole country.[28] Attanyake fitted logistic, Gompertz and other exponential models to data corresponding to the impact of COVID-19 in Sri Lanka, Italy and Hubei, a province in central China.[29] Ahmadi adjusted the Gompertz, Bertalanffy and cubic polynomial models to forecast pandemic dynamics for April 2020 in Iran.[30]
The ordinary differential equations presented in Equation 1 and Equation 2 are known as the logistic differential equation (or Verhulst equation) and Gompertz equation, respectively.[31]
Equation 1: Logistic differential equation
[Equation 1: Logistic differential equation]Equation 2: Gompertz differential equation
[Equation 2: Gompertz differential equation]Both describe the growth of populations where: P(t) represents the number of organisms or the size of a population at a given moment in time, r represents the instantaneous rate of increase and K corresponds to the carrying capacity of the environment or the maximum number of individuals that the population can sustain. K and r are positive real numbers and the function P(t) is positive, monotonically increasing and suitable for representing epidemiological models, as it presents a rapid initial growth that is approximately exponential and as the number of infections increases, the number of non-infected individuals in the population decreases. As a result, the relative growth rate within the population decreases until growth stops when there are no individuals left to infect.
Both models present an explicit solution provided by Equations 3 and 4 for the logistic model and Gompertz models, respectively.
Equation 3: Logistic model (b >0)
Equation 3: Logistic model (b >0)Equation 4: Gompertz model (b > 0)
[Equation 4]P0 represents the population (P) at the start of the growth process (0 < P0 < K). The b parameter is found to be associated with displacement on the abscissa axis for both sigmoid models. This is obtained through changes in variables (in Equation 3, algebraic transformations were applied before implementing the variable change).
The inflection point for these population growth models is of interest, as it represents the moment at which the rate of growth is highest, which can be interpreted as the peak of the pandemic. The inflection point for the logistic model is presented in Equation 5 while the inflection point for the Gompertz curve is presented in Equation 6. In the logistic model, this point is at 50% of population growth (the logistic function is symmetrical with regard to this point) while this point on the Gompertz model is approximately located between 35% and 40% of population growth.[31]
Equation 5: Inflection point of the logistic model
[Equation 5]Equation 6: Inflection point of the Gompertz model
[Equation 6]The relative rate of population growth is linear in the logistic process (Equation 7) and logarithmic in the Gompertz process (Equation 8). The latter growth process develops more slowly with respect to the logistic model process.[31]
Equation 7: Relative population growth rate in the logistic model
[Equation 7]Equation 8: Relative population growth rate in the Gompertz model
[Equation 8]This study aims to fit logistic and Gompertz models to the distribution of COVID-19 in Cuba for confirmed and deceased cases, to demonstrate the fit of these models for these distributions in such a way that they can be generalized as predictive models and to make forecasts for the peak dates of confirmed cases and deaths due to COVID-19 in Cuba.
The first aspect studied was the fit of the models used for the distribution of COVID-19 confirmed cases and deaths in Spain and Italy, countries that had passed the peak of the pandemic. The good fit of these models in those countries and their comparative simplicity in relation to other models has piqued interest in applying them to forecasting in Cuba. The adequacy of the models in estimating distribution of confirmed cases and deaths in Cuba was assessed by analyzing the parameters for goodness of fit and testing the models themselves for statistical significance.
Design and participants This is an inferential and predictive study using the logistic model and the Gompertz growth curve. The curve fitting method was used by applying the least squares technique for non-linear models with respect to their parameters.
This study was conducted from March 16 to April 22, 2020, while Cuba was experiencing the impact of COVID-19, by a group of professors from the Mathematics Department at the Carlos Rafael Rodriguez University of Cienfuegos in collaboration with the Department of Educational Technology at the same institution.
Official data on the number of confirmed cases and deaths from COVID-19 reported by the governments of different countries were studied as summarized by WHO and recorded and published by Johns Hopkins University. These data are updated daily and show cumulative confirmed cases, deaths and recoveries from the disease for different countries and territories. The first record in this database is from January 22, 2020.[32] Data was collected until April 22, 2020.
For the countries studied, documentation began with the date of the first recorded confirmed cases or deaths in the territory (Table 1). The daily cumulative cases were recorded in both analyses. In Cuba, the first cases were confirmed on March 11, 2020, but they were recorded in the database the following day.

Study variables The variables analyzed in this investigation are discrete quantitative variables, specifically:
Number of days elapsed since the first positive cases of COVID-19 were confirmed. Each data point for this variable is recorded on a daily basis: for example, in the case of Cuba, the first day corresponds to March 12, 2020 and the second corresponds to March 13.
Number of days elapsed since the first confirmed deaths of patients diagnosed with COVID-19. These values are recorded in a similar way to the previous variable, but using the database corresponding to deaths.
Number of confirmed daily cumulative cases for COVID-19.
Number of daily cumulative deaths for patients diagnosed with COVID-19.
Data Management and Processing Downloaded daily as .csv files, data were decoded using programmed scripts for that purpose.
The Maxima 5.41.0[33] symbolic software programs and R 3.6.1[34] programming language for number processing were used to process the data.
To use the least squares method, the lsquare.mac (version 5.41.0) package was used in the Maxima program and for the commands for R; nls, SSlogis and SSgompertz from the stat package (version 3.6.1) and drm from the drc package (version 3.0-1) were used. To study the Root Mean Square Error (RMSE) and the significance of the parameters of the model, the summary command from the stat package (version 3.6.1) was used and the adjusted R2 was calculated using rSquared from the miscTools package (version 0.6-22). To determine the goodness-of-fit for the model, the command neill.test from the drc package (version 3.0-1) was used.
Analysis The logistic and Gompertz models were fitted to the data published for COVID-19 for confirmed cases and deaths in Spain and Italy. Italy had its peak of confirmed cases on March 26, 2020 and its peak deaths on March 27, 2020.[35] Spain had its peaks of confirmed cases and deaths on March 31, 2020 and April 2, 2020, respectively.[36] As of April 22, 2020, according to the Johns Hopkins database, Italy had reported a total of 187,327 confirmed cases due to COVID-19 with 25,085 deaths, while Spain had recorded 208,389 confirmed cases and 21,717 deaths. As these countries had passed the peak of the pandemic, the official published data on the peaks was compared to the forecasts obtained using the models.
The RMSE and the R2 adjusted coefficient of determination were calculated to study the goodness-of-fit of the models, while keeping in mind that, for both models, values close to 1 for R2 and lower values for RMSE indicate a better fit.
The models were adjusted to the data published for COVID-19 for confirmed cases and deaths in Cuba. Goodness-of-fit was determined using the analyses of R2 and RMSE. Significance of the models’ adjusted coefficients was determined using the t test. Goodness-of-fit was verified using the Neill test, which is suitable for non-linear models with respect to the established parameters, and which utilizes grouping techniques in the event that there are no replicates.[37] The significance threshold selected a priori was alpha = 0.05.
Once the models’ statistical significance had been demonstrated for distributions of confirmed COVID-19 cases and deaths in Cuba, these models were used to forecast the same.
Case Study, Italy The first case was recorded on January 31, 2020. However, it was not until February 21 that exponential growth of the pandemic was officially reported. Figure 1 presents the geometric representation of cumulative confirmed cases and the logistic model (Equation 3) and Gompertz curve (Equation 4). Table 2 presents the adjusted coefficients for each model, R2, the RMSE values obtained for each, and the forecasted peaks. Both models show an R2 greater than 0.99 with a notably lower RMSE in the Gompertz model. Using the logistic model, the peak was forecast at 60 days (March 30) after first case, while the Gompertz model forecast it at 57 days (March 27).
Case Study, Spain The first case was recorded on February 1, 2020. However, it was not until February 25 that the pandemic’s exponential growth was officially reported. Figure 2 shows the geometric representation of cumulative confirmed cases, according to the logistic model (Equation 3) and Gompertz model (Equation 4).
Table 2 shows R2 greater than 0.99 for both models. The Gompertz model shows a lower RMSE than the logistic model, which suggests a better fit. The estimated peak, according to the logistic model, is calculated at 62 days (April 2); while the estimated peak for the Gompertz model is estimated at 59 days (March 30).




Case Study, Italy The first death was reported on February 21. The graph in Figure 3 shows the geometric representation of observed cumulative deaths and the estimations by the logistic model (Equation 3) and Gompertz curve (Equation 4). Both models have an R2 greater than 0.99, however, the Gompertz model has a lower RMSE than the logistic model (Table 3). The logistic model has a forecasted peak at 41 days (April 1), while the forecasted peak for the Gompertz model is 39 days (March 30) after the appearance of the first case in the country (February 1).
Case Study, Spain The first death was reported on March 3. Figure 4 shows the geometric representation of observed and predicted cases and deaths by the logistic model (Equation 3) and Gompertz model (Equation 4). Both models had an R2 higher than 0.99, however the Gompertz model had a smaller RMSE (Table 3). The logistic model had a projected peak at 33 days (April 4) while the projected peak for the Gompertz model is estimated at 30 days (April 1) after the reporting of the first death in the country.
Estimation for Cuba The first cases were diagnosed on March 11, recorded on March 12, and the first death was on March 18. As of April 22, it had been 42 days since the first report of infection and 36 days since the first death. Figure 5 presents the geometric representation of observed cumulative confirmed cases and deaths using the logistic model (Equation 3) and Gompertz curve model (Equation 4). On the graph, it can be observed that the models were correctly fitted to the data and the increase in the data is within the prediction interval of 95%.
The model-generated forecasts for Cuba provide a projected peak of infection between 34 and 39 days after first report of COVID-19 cases (March 12) and put the peak of deaths between 32 and 49 days after confirmation of the first death in the country (March 18). As with Spain and Italy, the Gompertz model forecast a greater total number of confirmed cases and deaths than the logistic model. Table 4 shows the coefficients corresponding to the logistic models and Gompertz models fitted to the reported Cuban data for confirmed COVID-19 cases and deaths. The criteria for the goodness-of-fit were similar for both models; they are slightly better in the Gompertz model for the distribution of confirmed cases and in the logistic model for the distribution of deaths.
Associated p values for the significance tests for the coefficients were all less than 0.05, indicating that the models were acceptable. Goodness-of-fit was demonstrated using the Neill test, which presents levels of significance higher than 0.05 for each model in each of the applied distributions (confirmed cases and deaths). This also demonstrates an acceptable fit for the models and thus their suitability for prognostic purposes.
Forecasts for the days with the highest numbers of infection and deaths were obtained using the calculation of the inflection point in each adjusted model and the cumulative totals corresponding to the K parameter (Table 4).


The logistic growth and Gompertz models provided good forecasts for Italy and Spain. For both countries, the Gompertz model had better estimates for the peak in confirmed cases and deaths. In the case of Italy, this model provided forecasts with an error of one day later and three days later for the peaks of infection and deaths respectively in comparison to the real peaks presented for that country. For Spain, the Gompertz model presented the forecasts for the peaks in infection and death with one day of error earlier than the real dates on which these peaks occurred. The Gompertz model forecast a higher total number of cases and deaths than the logistic model in both countries.


The authors hypothesized that if the models provided good forecasts for Spain and Italy, they would also do so for Cuba. Various authors[16–18,22–24] have used this subjective principle of plausibility and have anticipated goodness-of-fit in territories that had not yet passed the peak of the pandemic, based on adequate fit in other territories that had passed their peaks.
To test this hypothesis, the models were fitted to the distribution of confirmed cases and deaths recorded in Cuba and goodness-of-fit was assessed. Significance testing for the models’ coefficients demonstrated their validity. Each of the models passed the Neill goodness-of-fit test, which makes it possible to generalize these models to mathematically describe the dynamics of the pandemic.
The logistic and Gompertz population growth models used to predict peaks and total numbers of infected cases and deaths due to COVID-19 have been statistically validated with the usual analytical resources, which confirmed the initial hypothesis that these models could be extrapolated and applied in Cuba. This provides two additional options that are methodologically viable to model epidemiological processes over time, especially for short-term forecasting and when the aim is not to include the influence of a large number of external factors.
Disclosures: None
jfelipemm@ucf.edu.cu








