Sección evidencias
How Do We Use Risk Scores?
How Do We Use Risk Scores?
Acta Gastroenterológica Latinoamericana, vol. 51, núm. 3, pp. 245-247, 2021
Sociedad Argentina de Gastroenterología

Recepción: 06 Julio 2021
Aprobación: 21 Agosto 2021
Publicación: 27 Septiembre 2021
Román Conroy, in the book Therapeutic Strategies in Cardiovascular Risk, begins his chapter "From epidemiological risk to clinical practice by way of statistics" with a quote from Tavia Gordon: "The power and elegance of the logistic function make it an attractive and elegant statistical instrument, but in the end we cannot push a button and hope that everything will come out all right. Because frequently it will not".1
In today's medicine, risk scores are widely spread. We use them for diagnosis and risk stratification in hospitalization. These scores guide us when we decide therapeutic behaviors and assess the risk at discharge and in the long term. Since their application is widespread and they are frequently used, it is crucial that they must be easy to use, uncomplicated and not too time-consuming. Now, once we find one that owns these fundamental attributes, other questions arise: does it comply with the laws to which the scores are subject? Does it always work in any situation? Why this one and not another?
The scores work in the same way as a common diagnostic method (such as troponin for acute coronary syndrome or NT proBNP for heart failure), and the same laws of sensitivity and specificity apply to them. In turn, they have some peculiarities when they are generated and their performance is evaluated.
In a simplified way, and without going into complex statistical twists and turns, the two most relevant requirements that must be met are, firstly, that the endpoint it predicts be clear, standardized and easily replicable. Death as an endpoint is one thing, dyspnea is another. The latter does not seem to be the best endpoint when generating a risk score, since the diagnosis in many cases is subjective.
The second requirement is that it must arise from a representative sample of the population on which it is to be applied.2 In this regard, we can use two of them as examples. The Framingham score is derived from a cohort of 5,345 people from that community, followed for up to twelve years, and the scheduled endpoint was composed of death attributed to coronary artery disease, myocardial infarction, angina or coronary "insufficiency".3 Meanwhile, closer in time, the SCORE Project group used a cohort of 200,000 individuals from eleven European countries, followed for up to thirteen years, and the endpoint evaluated was fatal cardiovascular disease. They also used surrogate genetic and environmental factors from the different geographical regions in its generation4 At first glance, it would seem that the second requirement was designed from a more representative sample of the population on which it was to be applied and that, moreover, its endpoint was very specific. Even considering these important differences, in their editorial to the publication of the SCORE, Topol et al. criticize the endpoint used in both scores. In the first case because it is ambiguous and sensitive to biases and in the second because it is insufficient.5 This shows how difficult it can be to find the appropriate endpoints.
The correct way to evaluate the performance of the risk scores, once generated, is by measuring three characteristics: discrimination efficiency, calibration and reclassification capacity when variables are added to the original model.
The first of the qualities, discrimination efficiency, is the suitability of the function to separate those who have a high probability of presenting the evaluated endpoint, from those who do not. We can evaluate it within the same population that we use to generate the function, using a part of the sample that is set aside, prior to the scoring, for validation purposes; we will call this internal validity. There is also an external validity, which is the most important and comes from applying the function to other and assessing its discriminatory capacity.
To understand how important it is that the population where we apply the score is similar to the one that generated it, we can use age as an example. When applying risk scores to patients in extreme age ranges, for example over 65 years of age, we must be careful, since these ages are not well represented in the sample from which the score was generated, the performance of the function may not be entirely good if we apply it to a population that is mostly of the same or older age. In turn, these models propose an identical beta coefficient for all age ranges, and this may not be the case in practice. It may happen that other factors affect function differently depending on age. This could be corrected, at least partially, by creating an interaction factor between age and the other factors included in the function.9
Calibration, on the other hand, is a measure of how reliable the prediction is. That is, how many of those who were predicted to submit the endpoint actually submitted it? Both where and when the formula is applied will affect the calibration. For example, if we apply it in a place that has a higher prevalence of the event than the place where it was generated, there will be a tendency to underestimate. The opposite will occur in a place where the incidence of the endpoint is lower. In this case there will be a tendency to overestimate. The best calibration is obtained by applying the formula to a population of similar characteristics to the one from which it was generated.2, 6 This occurs because the prior probability, when applying any type of test or score, is transcendent in the performance that it will have. Will it be necessary to do something with the formula, so that it calibrates better in my population? The answer is often yes. In fact, countries such as China7 and the United Kingdom8 have done so with the Framingham risk score.
Finally, the reclassification: it is a measure of the percentage of individuals who did or did not present an event and who were correctly reclassified to a new category after adding some risk variable to the formula.2 It is a newer concept than the previous ones but is widely used nowadays.
As a conceptual example, it is a matter of time for genetics to show us risk substrates for cardiovascular disease, as it has done with other pathologies. Today, we do not find any risk score that directly considers genetic variables in its formula. It is possible to think that, in the future, the genotype will be as or more relevant than the phenotype, which we use currently to stratify the risk of our patients.
When science finds these variables and incorporates them into existing models, the quantification of the reclassification capacity will be important to measure the contribution of these variables to the various scores.
In conclusion, risk scores, whatever they may be, are subject to the same laws as other studies or diagnostic methods. We must use them always taking into account the prior probability of the population on which we are applying it, since the results that we will extract from the estimate depends on that population. Even though it has similar characteristics, the age distribution within it should be taken into account, because it is one of the most important predictors: if the age range is poorly represented, the function will not perform adequately. Discrimination ability, calibration, and reclassification power are the three characteristics that we must evaluate in a risk score.
Scores, although they are a tool, can never replace the medical criteria of the professional who is applying them. Tavia Gordon's phrase, with which we began the text, perhaps with some nuances and greater complexities, continues to be true today.
References
1. Gordon T. Editorial: Hazards in the use of the logistic function with special reference to data from prospective cardiovascular studies. J Chronic Dis. 1974;27(3):97-102.
2. Cooney MT, Dudina AL, Graham IM. Value and limitations of existing scores for the assessment of cardiovascular risk: a review for clinicians. J Am Coll Cardiol. 2009;54(14):1209-27.
3. Wilson PW, D'Agostino RB, Levy D, Belanger AM, Silbershatz H, Kannel WB. Prediction of coronary heart disease using risk factor categories. Circulation. 1998;97(18):1837-47.
4. Conroy RM, Pyörälä K, Fitzgerald AP, et al. Estimation of tenyear risk of fatal cardiovascular disease in Europe: the SCORE project. Eur Heart J. 2003;24(11):987-1003.
5. Topol EJ, Lauer MS. The rudimentary phase of personalised medicine: coronary risk scores. Lancet. 2003;362(9398):1776-7.
6. Brindle P, Beswick A, Fahey T, Ebrahim S. Accuracy and impact of risk assessment in the primary prevention of cardiovascular disease: a systematic review. Heart. 2006;92(12): 1752-9.
7. Liu J, Hong Y, D'Agostino RB Sr, et al. Predictive value for the Chinese population of the Framingham CHD risk assessment tool compared with the Chinese Multi-Provincial Cohort Study. JAMA. 2004;291(21):2591-9.
8. Brindle P, May M, Gill P, et al. Primary prevention of cardiovascular disease: a web-based risk score for seven British black and minority ethnic groups. Heart. 2006;92(11): 1595-602.
9. Hippisley-Cox J, Coupland C, Vinogradova Y, et al. Predicting cardiovascular risk in England and Wales: prospective derivation and validation of QRISK2. BMJ. 2008;336(7659):1475-82.
Notes
Notas de autor
Luciano Oscar Lucas Email: luciano.lucas@hospitalitaliano.org.ar