Machine LearningApplied AI

Feature Selection Discipline in Small Clinical Datasets

When your dataset has 300 patients and 40 variables, every variable you include is a gamble. Most applied health ML projects face exactly this problem.

April 2, 2026 · 10 min read · Africure Analytics

Clinical datasets in African health research are often small by machine learning standards. A study with 500 patients is considered substantial. A study with 5,000 is rare. This changes the rules for how models should be built.

More variables is not better

There is a common instinct to include every available variable in a model on the grounds that the algorithm will figure out what matters. In large datasets with tens of thousands of observations, this sometimes works. In small datasets, it is a recipe for overfitting.

Overfitting means the model learns patterns that exist in this specific dataset but do not generalise. A variable that happens to correlate with the outcome in 300 patients may have no real predictive value. The model cannot tell the difference, so it treats noise as signal.

The result is a model that performs well on the training data and poorly on new data. The published paper reports strong performance. The deployed model disappoints everyone.

A useful rule of thumb is the events-per-variable ratio. For logistic regression, you need roughly 10-20 outcome events per predictor variable to get stable coefficient estimates. If your dataset has 300 patients and 60 events (a 20% event rate), you can reliably support 3-6 predictor variables. Include 15 variables, and the model will find patterns that do not exist. This arithmetic is often ignored in published studies, which report models with dozens of predictors trained on a few hundred observations.

The problem is compounded when researchers use automated feature selection methods like stepwise regression on small datasets. These methods test many variable combinations and select the one that fits best, which in a small sample means selecting the combination that best fits the noise. The selected model looks strong in the training data and collapses on external validation.

Domain knowledge is the first filter

We start feature selection with clinical knowledge, not algorithmic screening. A clinician or domain expert reviews the candidate variables and identifies which ones have a plausible biological or clinical relationship with the outcome. Variables without a credible mechanism are excluded before the model sees them.

This is not because those variables could never matter. It is because with a small sample, the model does not have enough data to reliably distinguish real associations from spurious ones. Restricting the search space reduces the chance of finding patterns that do not replicate.

In our diabetes risk model, the four variables (age, BMI, family history, and HbA1c) were selected because each has a well-established mechanistic relationship with Type II diabetes. Age reflects cumulative metabolic stress and declining beta-cell function. BMI reflects insulin resistance. Family history captures genetic predisposition. HbA1c directly measures glycaemic control. A machine learning algorithm might have selected blood pressure or cholesterol instead, but those variables would have been picked for their statistical association in that particular sample, not for their causal relevance to the outcome.

After the domain filter, we use penalised regression to further refine the variable set. LASSO regression shrinks weak coefficients toward zero, effectively removing variables whose contribution is not strong enough to justify their inclusion. Ridge regression shrinks coefficients but retains all variables, which is useful when you believe all the selected variables are relevant but want to guard against inflated estimates. The choice between LASSO and ridge depends on the analytical question and the clinical context, not on which produces the better-looking AUC.

Start with variables that have a plausible clinical mechanism
Remove variables with >30% missing values unless missingness itself is informative
Check for multicollinearity: correlated predictors waste degrees of freedom
Use penalised regression (LASSO, ridge) to shrink weak coefficients toward zero
Validate with cross-validation, not a single train-test split

Validation in small samples requires extra care

A single train-test split is unreliable when the dataset is small because the results depend heavily on which patients end up in which split. A model trained on one random 80% of the data may perform very differently than a model trained on a different 80%. With 300 patients, a single split can produce AUC estimates that vary by 0.10 or more depending on the random seed.

We use repeated k-fold cross-validation as the standard validation approach for small datasets. The data is split into k folds (typically 5 or 10), the model is trained on k-1 folds and tested on the remaining fold, and this process is repeated across all folds. The entire procedure is then repeated with different random splits (typically 10-20 repetitions) to estimate the variability of the performance metric. The result is a distribution of AUC values, not a single point estimate.

Bootstrapped validation is another approach we use, particularly for estimating optimism-corrected performance. The model is trained on a bootstrap sample (a random sample with replacement from the original data), tested on the original data, and the difference between training performance and test performance is the optimism estimate. Subtracting this optimism from the apparent training performance gives a more realistic estimate of how the model will perform on new data.

Our three demos follow this approach

The diabetes demo uses four variables. The breast cancer demo uses four. The osteoporosis demo uses six. These are not toy models. They are models built from published research where the variable selection was driven by clinical reasoning and statistical discipline, not by throwing everything at a random forest.

Each model uses logistic regression with standardised inputs and transparent coefficients. Any clinician can trace how the inputs produce the output. That traceability is worth more than a marginal improvement in AUC from a black-box model that nobody can explain.

The small number of variables is a strength, not a limitation. Each variable can be collected in a routine clinical encounter without specialised equipment or expensive lab tests. This matters for deployment in resource-limited settings where the model is most needed. A 40-variable model that requires a full metabolic panel, imaging studies, and genetic testing is useless in a primary care clinic that has a scale, a glucometer, and a patient history form.

Discuss this topic with us

Related insights

Machine LearningApplied AI

April 1, 2026 / 10 min read

Designing Risk Analytics for Real Operational Workflows

Useful risk analytics starts with the workflow it needs to support. Model novelty matters far less than whether the output fits real review, reporting, and follow-through.

Read article

Population AnalyticsEpidemiology

April 1, 2026 / 10 min read

Why Population Analytics Must Reflect Local Conditions

Population analytics works best when it reflects local burden, reporting structures, and the real operational environment.

Read article

Applied AIData Governance

April 1, 2026 / 10 min read

Image Analytics Without Overclaiming

Image models can add analytical value when scope, validation, and reporting boundaries are described with precision.

Read article