Academic papers report AUC almost universally. It measures how well the model ranks patients: whether high-risk patients get higher scores than low-risk patients. That is useful, but it is not enough.
What AUC does and does not tell you
AUC tells you about ranking. A model with an AUC of 0.85 does a good job of putting sicker patients above healthier patients in the risk ranking. But AUC says nothing about whether the predicted probabilities are accurate.
A model can have an AUC of 0.85 and systematically overestimate risk by 20 percentage points. Every patient gets a score that is too high, but the ranking is preserved because everyone is shifted up by roughly the same amount. The AUC looks fine. The probabilities are wrong.
This distinction is not academic. A clinician using the model to counsel a patient needs accurate probabilities, not just a correct ranking. Telling a patient they have a 60% chance of recurrence when the true probability is 35% changes the treatment conversation, the patient's anxiety level, and potentially the treatment decision itself. The ranking is correct in both cases (higher-risk patients still get higher scores), but the clinical impact is very different.
Why calibration matters more in practice
When a model tells a clinician that a patient has a 45% probability of recurrence, the clinician needs that number to mean something. If 45% of patients who receive that score actually experience recurrence, the model is well calibrated. If the real rate is 25%, the model is misleading even though the ranking might be correct.
Calibration matters whenever the output is used as a probability, not just as a ranking. Treatment decisions, patient communication, and resource planning all depend on the probability being approximately right, not just correctly ordered.
This is why our demo models report the score as a percentage and include interpretation text that explains what the number means: '45.52 out of 100 people with the same risk factors will have Type II diabetic.' That sentence only makes sense if the model is calibrated.
Resource planning makes the point even clearer. If a programme uses a model to estimate how many patients in a district will need a particular intervention, calibration determines whether the estimate is useful. A model that consistently overestimates risk by 20% will predict that 600 patients need the intervention when the true number is 500. The programme orders supplies for 600, wastes the surplus, and looks inefficient. A well-calibrated model that predicts 500 allows the programme to plan accurately. AUC is irrelevant to this use case; calibration is everything.
- AUC measures ranking: are high-risk patients scored higher than low-risk patients?
- Calibration measures accuracy of probabilities: does a 40% prediction happen 40% of the time?
- A model can rank well but give wrong probabilities
- Calibration should be checked with calibration plots or the Hosmer-Lemeshow test
How calibration breaks down in practice
A model that was well calibrated in the development population can lose calibration when applied to a different population. If the model was developed in a population with a 30% event rate and is applied to a population with a 15% event rate, the predicted probabilities will systematically overestimate risk. This is called calibration-in-the-large, and it is the most common form of miscalibration when models are transported across populations.
Recalibration can fix this. The simplest approach is to update the model's intercept to reflect the new population's event rate while keeping the coefficients (the relative effects of each predictor) unchanged. More sophisticated recalibration adjusts both the intercept and the slope of the linear predictor. These adjustments require outcome data from the target population, which means some prospective validation is needed before deployment.
We assess calibration using calibration plots that divide predicted probabilities into deciles and compare the predicted probability in each decile with the observed event rate. A well-calibrated model shows points close to the 45-degree line. A miscalibrated model shows systematic deviation. We include these plots in our model documentation so partners can see, visually, whether the model gives accurate probabilities across the full range of risk.
Checking calibration before deployment
Before we deploy a model as a live demo, we verify that the published research includes calibration assessment or that the model structure (logistic regression with adequate sample size) supports reasonable calibration. We do not deploy models where the probabilities are likely to be misleading, regardless of how strong the AUC looks.
For models where calibration data is available, we reproduce the calibration plot using the published data and check that the predicted probabilities align with observed outcomes across the risk spectrum. For models where only the coefficients and summary statistics are available, we assess whether the model structure is likely to produce calibrated probabilities based on the sample size, the number of predictors, and the event rate.
We also document the population in which calibration was assessed and flag any known differences between that population and the intended users. If the model was calibrated on a hospital-based population and the intended users are primary care clinicians, the calibration may not transfer because the case mix is different. We state this explicitly in the tool's documentation rather than leaving users to discover it after making clinical decisions based on the output.