BioinformaticsApplied AI

From Sequencing Output to Actionable Research Questions

Many institutions can now generate genomic data. Fewer can turn that data into research questions that lead somewhere. The gap is analytical, not technological.

April 3, 2026 · 10 min read · Africure Analytics

Sequencing costs have dropped to the point where generating molecular data is no longer the bottleneck. The bottleneck is what happens after the sequencer finishes running.

The data-to-question gap

A sequencing run produces millions of reads. Those reads need to be aligned, quality-filtered, and processed through a bioinformatics pipeline before they become anything a researcher can work with. That pipeline requires choices (which aligner, which variant caller, which quality thresholds), and each choice affects the results.

Once the processing is done, the researcher has a dataset. But a dataset is not a research question. The question requires context: what is the clinical problem, what has already been studied, what would a meaningful finding look like, and what sample size is needed to detect it.

In many institutions, the person running the sequencer and the person asking the research question are not the same person, and the gap between them is where projects stall.

A typical scenario: a research team at a West African university obtains funding to sequence 200 tumour samples from breast cancer patients. The sequencing is completed in six weeks. The raw data is transferred to a hard drive and handed to the bioinformatics unit. The bioinformatics unit runs a standard variant calling pipeline and delivers a spreadsheet of 45,000 variants. The clinical researcher looks at the spreadsheet and asks: 'What do we do with this?' The project stalls for eight months while the team figures out how to move from a list of variants to a research question that can be answered with their data.

The gap exists because sequencing technology scaled faster than analytical training. Buying a sequencer and running protocols is a capital and operational challenge that many institutions have solved. Building the expertise to design a study around genomic data, process the output with appropriate quality control, and analyse the results with statistical rigour is a human capital challenge that takes years to address.

Analysis is more than computation

Running a differential expression analysis or calling variants is computation. Deciding which comparisons are meaningful, which confounders to account for, and how to handle multiple testing is analysis. The second part requires statistical training and domain knowledge, not just access to a server.

We position bioinformatics as an analytical discipline within the broader health analytics platform. The computational pipelines are important, but they are infrastructure. The value is in connecting the processed data to a well-formed question and producing results that are statistically credible and biologically interpretable.

Multiple testing is a good example of where computation and analysis diverge. A standard RNA-seq experiment may test 20,000 genes for differential expression. Without correction, you expect 1,000 false positives at a p-value threshold of 0.05. The Benjamini-Hochberg procedure controls the false discovery rate, but the choice of FDR threshold (0.05? 0.10? 0.01?) is an analytical decision that depends on the study's goals. An exploratory study looking for candidate genes to follow up may accept FDR of 0.10. A confirmatory study intended to validate previous findings should use 0.01 or stricter. A computational pipeline applies a default threshold without considering the context. An analyst chooses the threshold based on the question.

Biological interpretation is another area where pure computation falls short. A list of 200 differentially expressed genes is not a finding. Pathway enrichment analysis, gene set analysis, and network analysis can identify the biological processes represented in the list, but interpreting those results requires someone who understands both the statistics and the biology. A pathway that is statistically enriched may not be biologically meaningful if it is driven by a single hub gene that happens to interact with many other genes. Conversely, a biologically important pathway may not reach statistical significance if only a few of its genes are differentially expressed.

Study design comes before sequencing

The most common analytical mistake in genomics projects is designing the study after the data has been generated. Sample selection, batch allocation, and covariate recording all need to be planned before the first sample enters the sequencer. If cases and controls are processed in different batches, batch effects will confound the biological signal. If important covariates (age, sex, tumour stage, tissue preservation method) are not recorded, they cannot be adjusted for in the analysis.

We advise research partners to involve analytical expertise at the study design stage, not after the data is generated. A 30-minute conversation about sample allocation, power calculation, and covariate recording can prevent months of frustration trying to rescue a confounded dataset. The conversation is inexpensive. The rescue attempt is expensive and often unsuccessful.

Power calculations for genomic studies require assumptions about effect sizes, within-group variability, and the number of tests. These assumptions are harder to specify than in a traditional clinical study because the effect sizes are often unknown and the number of tests is large. We use pilot data or published studies with similar designs to inform the power calculation, and we report the detectable effect size alongside the sample size justification so the research team can judge whether the study is worth conducting.

Building local analytical capacity

When bioinformatics analysis is outsourced entirely, the local institution learns nothing. They send samples, receive results, and co-author a paper, but the analytical capability stays elsewhere. We work with partners to build their capacity to ask and answer questions from their own molecular data, not just to generate it.

Building capacity means more than running a training workshop. It means embedding analytical skills into the institution's research workflow. This includes establishing standard operating procedures for common analyses, documenting pipeline configurations so they can be reproduced and modified, and creating a mentorship structure where junior analysts can progress from running established pipelines to designing new analyses.

The goal is independence. A research institution that can design genomic studies, process sequencing data, and analyse results without external support is scientifically autonomous. That autonomy means faster iteration, more relevant research questions (because the people asking the questions are closest to the clinical context), and a self-sustaining research programme that does not collapse when external funding or collaboration ends.

Discuss this topic with us

Related insights

Machine LearningApplied AI

April 1, 2026 / 10 min read

Designing Risk Analytics for Real Operational Workflows

Useful risk analytics starts with the workflow it needs to support. Model novelty matters far less than whether the output fits real review, reporting, and follow-through.

Read article

Population AnalyticsEpidemiology

April 1, 2026 / 10 min read

Why Population Analytics Must Reflect Local Conditions

Population analytics works best when it reflects local burden, reporting structures, and the real operational environment.

Read article

Applied AIData Governance

April 1, 2026 / 10 min read

Image Analytics Without Overclaiming

Image models can add analytical value when scope, validation, and reporting boundaries are described with precision.

Read article