Abstract: The contrast hypothesis constitutes the most used method in the scientific research to estimate the statistical significance of any find. However, nowadays its use is questionable because it did not have other statistical criteria that make possible the credibility and reproducibility of the studies. From this condition, this study review how the use of the null hypothesis significance testing has been and the recommendations made regarding the application of other complementary statistical criteria for the interpretation of the results. It is described the main controversy of only use the probability value to reject or accept a hypothesis. The interpretation of a non significant value, as prove of effect absence or a significant value as existence of it, is a frequent mistake in scientific researchers, according to the reviewed literature. It is suggested to make a rigorous assessment of the obtained data in a research and include in the study reports other statistical tests, as the test power and the effect size of the intercession, to offer a complete interpretation and increase the results quality. Specifically, it is recommended to the editors of scientific journals to consider the report of the mentioned statisticians in the papers who required, as part of the criteria to take into account for their evaluation.
Key words: Null hypothesis significance testing, probability value, statistical power, effect size.
Resumen: El contraste de hipótesis constituye el método que más se emplea en la investigación científica para estimar la significación estadística de cualquier hallazgo. Sin embargo, en la actualidad su utilización es cuestionable porque le falta integrar otros criterios estadísticos que posibiliten la credibilidad y reproducibilidad de los estudios. A partir de esta condición, este trabajo reseña cómo ha sido la utilización de la prueba de significación de la hipótesis nula y las recomendaciones que se le han hecho en cuanto a la aplicación de otros criterios estadísticos complementarios para la interpretación de los resultados. Se describe aquí la polémica fundamental de utilizar solamente el valor de probabilidad para rechazar o aceptar una hipótesis. La interpretación de un valor no signi ficativo, como una prueba de ausencia de efecto o de un valor significativo como existencia del mismo, es un error frecuente en investigaciones científicas, según refiere la literatura revisada. Se sugiere realizar una valoración rigurosa de los datos obtenidos en una investigación e incluir en los informes de trabajo otras pruebas estadísticas, como la potencia de la dócima y el tamaño del efecto de la intervención, para ofrecer una interpretación más completa e incrementar la calidad de los resultados. Específicamente, se recomienda a los editores de revistas científicas que se considere el informe de dichos estadísticos en los trabajos que así lo requieran, como parte de los criterios a tener en cuenta para su evaluación.
Palabras clave: Prueba de significación de la hipótesis nula, valor de probabilidad, potencia estadística, tamaño del efecto.
Biomathematics
Statistical significance and other complementary measures for the interpretation of the research results
Significación estadística y otras medidas complementarias para la interpretación de los resultados de investigación
Received: 10 June 2023
Accepted: 15 September 2023
In the context of the research activity, the null hypothesis significance testing (NHST) is the inductive inferential method most used in the reports (Antúnez et al. 2021). However, the criticisms to the use of this test are so numerous that will be difficult to exhaustively deal with them in only one paper. It has provided evidence from the ones are focus on the incorrect use in the research reports up to those who question their scientific use and propose its abandon (Díaz-Batanero et al. 2019).
Thorough years, the controversy about the (NHST) has been so intensive, that some scientific and professional associations, as the American Psychological Association, American Education Research Association and the American Statistical Association, recommend to make changes in the editorial policy of the scientific journals with respect to the test use and to the favorable use of other criteria that allow to discuss more the founded results (Frías et al. 2002).
The proposed changes are not alternatives to the classic statistical inference model, are a way of compensate some of the limitations of the NHST. These recommendations make reference mainly to two aspects: the need of taking into account the test power in the studies and to include estimations of the effect size (ES) (Hickey et al. 2018).
This study expects to review the use of the NHST and the recommendations for the application of other complementary statistical measurements in the results interpretation.
All scientific research has as objective to look for the explication of phenomenon, to make theories on their performances and with this, can derive estimations on the reality. However, to prove theories or estimate effects of a treatment, the researchers have to make a process of hypothesis verification, in which the scientific hypothesis is translated to the statistics (Kuffner and Walker 2019). According to Frías et al. (2002), the statistics technique of the hypothesis contrast and the research design has been mutually needed during decades.
The null hypothesis significance testing: p value. The methodological proposal of the NHST was developed between 1915 and 1933, as a result of the analysis of two thought schools: that of Ronald Fisher (1890-1962) and which was represented by Jerzy Neyman (1894-1981) and Egon Pearson (1895-1980). The main difference between these two theories do not lies in the calculations, but in the conceptions and in the underlying reasoning (Bono and Arnau 1995).
Fisher (1925) only defined a null hypothesis (H0) and from this one, based on the sampling distribution of the statistical test, he estimated the probability of the data sample to decide it reject or not. In a general way, the decision rule was based on a probability value (p) from which it was accepted or not H0, if the calculated p value was lower than 0.05. It should highlight that, although Fisher (1935, 1950, and 1955) gave priority to a significance level of 0.05, never prescribed that said level should keep fixed, so it depends on the characteristics of the research.
Neyman and Pearson (1928) proposed the addition of an alternative hypothesis (H1) in comparison with H0, which lead to the definition of two regions: reject and acceptation. From these contributions, the decision process can lead to two potential errors, type I, defined as the probability (α) of reject H0 when it is true and the error type II, understandable as the probability (β) of accepting H0 being false, which mean that there is not effect of treatments when in actual fact there is. The control of the last error allows increase the probability of find the true positives and correctly reject H0, with a degree of certainty named: power (1- β) (Cochran and Cox 1999).
During the course of years, the p value of Fisher was turn in a way of adequately estimating the result of the intervention group, when assuming that H0 is correct. The period 1940-1960 was known as the inference revolution and the statistical manuals of the period showed the hybrid model of the NHST between the approach of Fisher and Neyman-Pearson. This period was characterized by the exponential increase of the application of the HNST procedure by the researchers, in which the inference of the sample to the population was considered the crucial point of the studies (Bono and Arnau 1995).
The hypothesis contrast acquired great importance in the seventies and eighties. Many journals took as criterion the obtaining of results statistically significant to accept articles (publication bias) (Cohen 1994). The Journal of Experimental Psychology, for example, considered among their editorial rules to accept only those articles with significant results at level of 0.05. Those statistically significant at 0.01deserves a priority place in the journal. However, some of these results represented little practical interest and in the most of papers it was not considered the risk level which the research was able to accept when interpreted the results of the statistical test (Cohen 1992). In addition, researchers of the social and psychological sciences, not only need to know if the effect of treatment was significant or not, so they wish to obtain the true value of the effect (Rothman 1978).
From the above, to the nineties there were important statistical elements for a total interpretation of the results. Authors like Schmidt (1996) exhort to focus on the ES estimation for the final discussion of the finds. Wilkinson et al. (1999) recommended informing this statistical close with the probability value. Also, in the fourth edition of the publication manual of the American Psychological Association (1994), certain recommendation on the style of the research reports are performed and is emphasize in that the researchers should provide the probability values contributed by the statistical significance tests close to the values of the effect size (ES) and the statistical power as reliability precaution of the result. In this sense, a link between the significant, important and valid is established.
Despite these recommendations, there are still many researchers which are publishing that even take into account them. However, for the beginning of the new century is show a tendency in favor of not only inform the hypothesis contrast, as only one element to find or not significance differences, so it should be enclose of other complementary measures that allow a practical and precise scientific discussion (Marín and Paredes 2020).
Serdar et al. (2021) state that the controversy of using the NHST as instrument valid for the scientific progress is still keeping, which is show in meetings and congress of the American Psychological Association, where work sessions to this discussion are made. In this sense, are many the studies in which the researchers deep about the goodness and deficiencies of the NHST (table 1). In some the practical use of the test is defended and in others is questioned (Díaz-Batanero et al. 2019).

Table 1 show the chronological analysis of the historical performance related with the contrast and checking of statistical hypothesis. These studies show the confusion, criticism and controversy among researchers, which at the beginning considered that it was enough the report of the p value to reject or accept a hypothesis (Ioannidis 2018). Ochoa et al. (2020) state that in the scientific literature it is frequency observed the error of interpret a non significant p value as prove of effect absence or association. It is also common to interpret a significant value as evidence of the existence of an effect or rate. In this sense, the absence of statistical significance (p>0.05) is not allow to prove the H0 either the presence of signification (p<0.05) of the H1. Any decision about superiority or inferiority is subject to uncertainty, which is not solve in function that p be high or low to 0.05.
Wasserstein and Lazar (2016) show that due to the interpretation errors in the results of the hypothesis contrast and too many criticisms about the statistical significance, the American Statistical Association state their point of views to respect:
Statistical power.Bono and Arnau (1995), when reviewing the development of the concept of a test power, point out that in the theory develop by Neyman and Pearson in 1928, the power of a statistical test is the probability of find significant results. Their estimation, according to these authors, is determined by three basic components: sample size, level of significance (α) and ES to be detected.
There are two ways to estimate the power: a priori (prefixed) and a posteriori. The first shows the researcher about the sample size needed for adequate power. To this end, power tables have been constructed. The a posteriori power is important in the interpretation of the results of completed studies (Guerra et al. 2019).
Scheffé (1959) discusses the power of the F Fisher test in analysis of variance models (ANAVA), with fixed effects. It refers to the power tables, calculated for the values of α = 0.01 and 0.05, and reproduces power graphs for the F Fisher test.
Menchaca (1974, 1975), Venereo (1976), Caballero (1979) and Menchaca and Torres (1985) contributed tables of sample sizes and number of replications in analysis of variance models, associated with designs completely randomized, random blocks, Latin square and turnover design. They include the maximum standardized difference between two means (Δ), the number of treatments (t), the level of significance (α) and the power of the test. These tables represent valuable work tools for researchers from different branches. Currently, with the advance of computer science, there are statistical packages that include the calculation of power, such as InfoStat, G Power and SPSS, among others (Guerra et al. 2019).
Despite the contributions of different specialists in the topic, the articles are lacking of the report of the statistical power as a truthfulness indicator of the research, which has been one of the criticisms most highlighted through years (Cohen 1992, Clark-Carter 1997, Frías et al. 2000 and Bakker and Wicherts 2011).
Cohen (1988, 1992) papers state by convention a minimum power of 0.80, due that usually is more serious to show that there is an effect when there is not, than to show that there is not effect when there is. Authors like Funder and Ozen (2019) report that, when the power value is lower than 0.80, it cannot conclude that the study be totally useless, so it should made valid conclusions from the sample size.
It should highlight the importance of the statistical power when a study is designed, in a way that the sample size used guarantees a high probability of detecting differences if there are really are. To perform studies of low statistical power is not ethically acceptable, so it can lead to results of uncertain scientific validity.
Effect sixe (ES).Cohen (1988) defined ES as the degree in which a phenomenon is in the population or the degree in which the null hypothesis is false. This statistical measure evaluate in a logical way the magnitude of an aspect of interest in a quantitative study and, therefore, make easy the assessment of its practical importance (Botella and Zamora 2017). In brief, is not enough with only identify the occurrence or not of certain effect, it also require to determine its magnitude or size to know their relevance or practical signification (Ponce et al. 2021).
In general way, the ES indexes can be classified in three big general categories: indexes of the mean families, indexes of the relation or association family and risk indexes (relative or absolute) (Ventura 2018). Rivera (2017) showed that, in the scientific literature are different formulations for the ES calculation, according to the phenomenon under study. The final interpretation of the results seen to be based on a value scale, according to the statistical test performed in the research (table 2) (Serdar et al. 2021).

The ES statistical provide the information about how the independent variable or variables explain the dependent variable. Low ES values means that the independent variables did not predict in adequate way because they are only slightly relate with the dependent variable. High ES values represent that the independent variables are good predictors of the dependent variable. So, the ES is an important statistical indicator to evaluate the efficacy of any treatment or intervention on a determined response (Ventura 2018). In addition, Bologna (2014) state that the ES measures, when been standardized exceeded the problem of the hypothesis tests regarding their dependence with the sample size and they are use to make comparisons among researchers about a same topic when taking the results to a metric in common.
The publication manual of the American Psychological Association (2001) conclude that is necessary to report the ES with the p value to answer three basic questions of the research: a) ¿ if there a real effect or the result should attribute at random?, b) if the effect is real, ¿ how big is it?, and c) ¿ if the effect big enough for to be considered important or useful?
By all the previous, the ES is considered as a complementary analysis of the NHST which helps to correct the limitations showed by this test. However, despite their practical use it is not frequent their use in the researcher reports.
Use of the complementary measures in the literature. From the nineties, the statistical experts have been conscious that the NHST is, in many aspects, inadequate to interpret the researcher results. Therefore, the uses of other complementary measure in the results report are not even completely achieving (Ochoa et al. 2019).
The sixth edition of the publication manual of the American Psychological Association (2010) showed the need of seriously take into account the statistical power providing information that show that the study has the enough power to find effects of substantive interest. However, the continuous lack of interest by the power of the statistical tests only will change when the editor of the important journals demand this analysis in their editorial policy (Frías et al. 2002).
In studies performed by different authors from 2010, there is an increase in the use of the ES, mainly in psychology journals, so they demand the use of this statistician as rule. Authors like Odgaard and Fowler (2010) reviewed the intervention studies publishing in 2003, 2004, 2007 and 2008 in the Journal of Consulting and Clinical Psychology, revisaron los estudios de intervención publicados en 2003, 2004, 2007 y 2008 en el and found that in general 75 % of the studies report of any index of the ES.
Sun et al. (2010) analyzed the articles published between 2005 and 2007 in five journals (Journal of Educational Psychology, Journal of Experimental Psychology: Applied, Journal of Experimental: Psychology Human Perception and Performance, Journal of Experimental Psychology: Learning, Memory & Cognition, and School Psychology Quarterly) and found that only 40 % of them report any index of the ES.
McMillan and Foley (2011) consulted a total of 417 articles published among 2008 and 2010 in four specialized journals of education and psychology (Journal of Educational Psychology, Journal of Experimental Education, Journal of Educational Research, and Contemporary Educational Psychology) and found that 74 % of the studies informed any measure of the ES. These authors concluded that if the use of ES indexes in the researcher reports has being increased; the discussions about their significance are being poor by the lack of argument or ignorance of what this value represent in the study.
Sesé and Palmer (2012) analyzed the use the statisticians in the in the articles published in 2010 in eight journals (Journal of Behavioural Medicine, Behaviour, Research and Therapy, Depression and Anxiety, Behavior Therapy, Journal of Anxiety Disorders, International Journal of Clinical and Health Psychology, British Journal of Clinical Psychology, and British Journal of Health Psychology). These authors found that the ES indexes report in 61.04 % of the articles.
Caperos and Pardo (2013) verified the articles published in four Spanish journals of many discipline (Anales de Psicología, Psicológica, Psicothema, and Spanish Journal of Psychology), indexed in the database Journal Citation Reports (JCR). Their results show that only 24.3 % of the performed NHST were enclose of the a statistician of the ES and of the statistical power.
Rendón et al. (2021) concluded that one of the seven most common mistakes in the articles is to omit the report of the statistical power and the ES. Currently there are some mean stream journals that do not accept the publication of articles of quantitative research in which these statisticians are not reported. From 2020 journals as Memory and Cognition, Educational and Psychological Measurement, Measurement and Evaluation in Counseling and Development, Journal of Experimental Education and Journal of Applied Psychology, decided to regulate the use of complementary measurements to the NHST in the statistical analysis for the correct interpretation and practical importance of the results (Serdar et al. 2021).
It is concluded that the NHST is not enough to perform a rigorous assessment of the obtained data in a research. It is considered necessary to include in the studies reports other statistical tests, as the test power and the effect size, to offer a more complete interpretation of the results. Despite that many authors has made reference to the subject, there is the need of calculate these measures to evaluate the quality of the scientific researchers. It is recommended to the editors of scientific journals to include these statisticians in the editorial rules.
*Email: femtorresm@ica.edu.cu

