Health EquityMachine Learning

Training Data Geography Shapes Who Benefits From a Model

A model trained on patients from one country will not necessarily work for patients from another. This is not a technical footnote. It is an equity issue.

April 3, 2026 · 10 min read · Africure Analytics

Every prediction model is a product of the data it was trained on. When that data comes from a specific population, the model's behaviour reflects that population's characteristics: its demographics, its disease patterns, its healthcare system, and its measurement practices.

Geography is not just a footnote

Our diabetes risk model was built on a Nigerian patient database. The coefficients, the standardisation parameters, and the risk thresholds all reflect the characteristics of that population. If you apply the same model to patients in Japan or Sweden, the predicted probabilities may not be accurate because the baseline risk, the distribution of risk factors, and even the measurement methods may differ.

This is why the R Shiny disclaimer states explicitly that the calculator 'is based on Nigeria patient database populations, and may not perform reliably in relation to other populations.' That is not a legal disclaimer. It is a scientific statement about the model's validity.

The specificity goes deeper than country-level differences. A model trained on patients from tertiary hospitals in Lagos may not transfer well to patients in rural primary care clinics in northern Nigeria. The populations differ in disease severity at presentation, comorbidity profiles, measurement precision, and baseline risk. Urban teaching hospital patients tend to present later, have more comorbidities, and be assessed with better equipment than rural primary care patients. These differences affect the model's coefficients and therefore its predictions.

Even measurement methods matter. HbA1c values measured on a point-of-care device in a rural clinic may differ systematically from values measured by a laboratory analyser in a referral hospital. If the model was trained on laboratory-grade measurements and is applied to point-of-care results, the input data is measured on a different scale, and the predicted probabilities inherit that discrepancy. This is not hypothetical. Studies comparing point-of-care and laboratory HbA1c in African settings have found mean differences of 0.3-0.5 percentage points, which is enough to shift patients between risk categories.

The equity dimension

Most widely cited prediction models in medicine were developed using data from high-income countries. When those models are applied in African or South Asian settings without recalibration, they may systematically over- or under-estimate risk for those populations.

This is an equity problem because it means the people who most need accurate risk prediction (those in settings with fewer specialist resources) are the ones most likely to receive inaccurate predictions. Building models from local data, as our demos do, is one way to address this.

It is not enough to build one model and claim it works everywhere. Responsible analytics means stating where the model was validated and being honest about where it has not been tested.

The Framingham cardiovascular risk score is the canonical example. Developed on a predominantly white American population in the 1990s, it has been widely adopted globally. But when validated in African populations, it consistently overestimates cardiovascular risk, sometimes by a factor of two. A patient classified as high-risk by Framingham may be at moderate risk according to a locally calibrated model. The overestimation leads to unnecessary medication, unnecessary anxiety, and wasted resources in health systems that can afford none of those.

The solution is not to avoid using models in settings where they were not developed. The solution is to validate them locally and recalibrate where needed. This requires local outcome data, which requires investment in data collection and analytical capacity in the populations that stand to benefit most. That investment is part of the equity equation.

Building local evidence bases

Addressing the geography problem requires more than recalibrating existing models. It requires building local evidence bases that support the development of new models from local data. This means investing in patient registries, electronic medical records, and research infrastructure in settings that have historically been underrepresented in clinical prediction research.

Our diabetes demo demonstrates what this looks like in practice. The model was built from a Nigerian patient dataset, validated on that population, and deployed with clear scope statements. When a partner in another country wants a similar tool, we do not simply relabel the Nigerian model. We work with them to identify or build a local dataset, develop a model from that data, validate it, and deploy it as a separate tool with its own scope statements.

This approach is slower and more expensive than applying a single global model everywhere. But it produces tools that are accurate for the populations they serve, and it builds local analytical capacity in the process. The research team that developed the local dataset and validated the model owns that capability going forward. They are not dependent on a model developed elsewhere and maintained by someone else.

What we do about it

Every demo on this platform states the source population. The interpretation text makes clear that the probability applies to patients with the same risk factors from the same population context. When we work with partners in other countries, we recalibrate or rebuild the model using local data rather than assuming the original coefficients transfer.

We maintain a model registry that documents, for each deployed model, the source population, the development dataset size, the validation approach, the calibration assessment, and the known limitations. This registry is accessible to partners and serves as the basis for any conversation about extending a model to a new population.

When partners ask whether an existing model can be used in their setting, we follow a structured assessment: first, compare the demographic and clinical characteristics of the source population with the target population; second, check whether the same variables can be measured the same way; third, if outcome data is available in the target population, run a transportability analysis. Only if the model passes these checks do we recommend deployment. Otherwise, we recommend local development or recalibration, and we support the partner through that process.

Discuss this topic with us

Related insights

Machine LearningApplied AI

April 1, 2026 / 10 min read

Designing Risk Analytics for Real Operational Workflows

Useful risk analytics starts with the workflow it needs to support. Model novelty matters far less than whether the output fits real review, reporting, and follow-through.

Read article

Population AnalyticsEpidemiology

April 1, 2026 / 10 min read

Why Population Analytics Must Reflect Local Conditions

Population analytics works best when it reflects local burden, reporting structures, and the real operational environment.

Read article

Applied AIData Governance

April 1, 2026 / 10 min read

Image Analytics Without Overclaiming

Image models can add analytical value when scope, validation, and reporting boundaries are described with precision.

Read article