Every study has a population. How you define that population (who is included, who is excluded, and how you handle the people you lose track of) is not a technicality. It is the foundation of every result you report.
Inclusion criteria are analytical choices
Deciding to include patients aged 18 and above, or only those with a confirmed diagnosis, or only those who completed follow-up is not a neutral act. Each decision changes the composition of the study population and therefore changes the results.
A recurrence model trained only on patients who completed five years of follow-up will miss everyone who dropped out earlier. If dropout is related to disease severity, the model learns from a biased sample and produces optimistic estimates.
This is survivorship bias in its most literal form. The people who survived long enough to be in the dataset are not representative of everyone who had the disease.
Consider a concrete scenario from an HIV treatment programme. The programme wants to model viral load suppression rates at 12 months. If the cohort is defined as patients who attended their 12-month visit, the model excludes everyone who dropped out, transferred, or died before that point. In a programme with 30% loss to follow-up, the cohort represents only the 70% who stayed in care. The suppression rate looks good because the most challenging patients have been removed from the denominator.
The alternative is to define the cohort at the point of treatment initiation and track everyone forward, regardless of whether they completed follow-up. This produces a less flattering suppression rate, but it describes the reality of the programme, not a curated subset of it.
Loss to follow-up is not just missing data
In many African health systems, loss to follow-up rates are high. Patients move, change facilities, or stop attending. The standard approach is to censor them, treating them as though we simply stopped observing them. But if the reason they left is related to the outcome, censoring introduces bias.
A patient who stopped coming to the clinic because they felt better is different from one who stopped because they were too sick to travel. Treating both the same way is a modelling choice that should be stated explicitly, not hidden in a methods section.
We handle loss to follow-up transparently. The cohort definition document lists every rule: patients with no clinic visit for 12 months are classified as lost to follow-up, patients who transferred to another facility are censored at the transfer date, and patients who died from unrelated causes are treated as competing risks. Each rule is stated in plain language so collaborators can challenge or modify it before the analysis runs.
Sensitivity analyses are part of every cohort analysis we produce. We run the primary analysis under the stated assumptions, then repeat it under alternative assumptions: what if all patients lost to follow-up had the outcome? What if none of them did? What if they had the outcome at twice the rate of patients who stayed in care? The range of results across these scenarios tells the reader how sensitive the conclusions are to the assumptions about loss to follow-up. If the conclusions change substantially, the primary result should be interpreted with caution.
Temporal boundaries matter as much as inclusion criteria
When the cohort starts and ends can be as consequential as who is included. A study that defines its cohort as 'all patients diagnosed between January 2018 and December 2020' will capture different treatment protocols, different diagnostic criteria, and different health system conditions than one that uses 2021-2023. If treatment guidelines changed in 2019, the cohort spans two different clinical realities, and the results are an average across both.
We define temporal boundaries based on the analytical question, not on data availability. If the question is about the effectiveness of a specific treatment protocol, the cohort should span the period when that protocol was in use, not the full period for which data happens to exist. This sometimes means using a smaller cohort, but a smaller cohort that answers the right question is more useful than a larger one that answers a muddled version of it.
Date fields in African health records are often incomplete or inconsistent. A patient's diagnosis date might be recorded as the date they first visited the facility, the date the lab result came back, or the date the clinician entered the data. These can differ by days or weeks. The cohort definition should specify which date field is used and how ambiguous dates are handled, because a one-week difference in the start date can move patients in or out of the cohort.
Make the cohort definition visible
When we build analytics for a project, the cohort definition is documented in plain language, not just in code. Collaborators can see who is included, who is excluded, and why. If they disagree with a decision, that conversation happens before the analysis runs, not after the results are published.
This is basic methodology, but it is often skipped in applied analytics work where the pressure is to produce results quickly. We treat it as a required step because the alternative is building on a foundation that nobody examined.
The cohort definition document is delivered alongside the analysis results, not buried in an appendix. It includes a flow diagram showing how many patients were in the initial dataset, how many were excluded at each step and why, and how many remain in the final analysis cohort. This CONSORT-style flow diagram is standard in clinical trials but rare in applied analytics. We include it because it makes the analytical choices visible to anyone reading the results, whether they are a statistician or a programme director.