Data sharing agreements are written by lawyers and compliance officers. Analytics workflows are designed by data scientists and engineers. The two groups rarely sit in the same room, and the gaps between their assumptions cause problems that neither anticipated.
What agreements usually cover
A typical data sharing agreement specifies who can access the raw data, for what purpose, for how long, and under what security conditions. It may specify that data must be stored in a particular jurisdiction, that access is limited to named individuals, and that the data must be destroyed at the end of the project.
These provisions are straightforward when the data sits in a single file on a single server. They become complicated when the data enters an analytics workflow.
Most agreements also include provisions about publication and attribution. The data provider may have the right to review publications before submission, to be named as a co-author or acknowledged, and to approve the specific analyses that are published. These provisions are reasonable and standard. But they assume a linear workflow: data goes in, a paper comes out. Analytics workflows are rarely that simple.
What agreements usually miss
An analytics workflow produces intermediate outputs: cleaned datasets, derived variables, model coefficients, prediction scores, summary statistics, and visualisations. The agreement says the raw data is restricted. But what about a summary table that shows average blood pressure by age group and district? That table was derived from the restricted data. Can it be shared?
What about model coefficients? If a logistic regression was trained on restricted data, the coefficients encode information about that data. Can the model be shared even if the data cannot? Can the model be deployed as a live tool that produces predictions for new patients?
These questions come up in every applied analytics project. Most data sharing agreements do not address them because the people who wrote the agreement did not think about the analytics workflow in detail.
The issue extends to visualisations and dashboards. A dashboard that shows district-level disease rates derived from restricted patient data could, in principle, be used to re-identify individuals if the district population is small enough. A map showing malaria incidence by village may effectively reveal which households are affected if the village has only 50 residents. The agreement restricts access to individual-level data, but the derived output may carry enough information to compromise confidentiality. Standard data sharing agreements rarely address this risk because the people drafting them do not think in terms of dashboard design and minimum cell sizes.
Machine learning model outputs create another grey area. If a prediction model is trained on restricted data and then used to generate predictions for new patients, the predictions are new data that did not exist in the original dataset. But the model that generated them was shaped by the restricted data. If the agreement prohibits derivative works, does a deployed prediction model count as a derivative work? Different legal interpretations reach different conclusions, and most agreements do not address this question directly.
- Can derived summary statistics be shared outside the project team?
- Can model coefficients trained on restricted data be published?
- Can a prediction tool built from restricted data be deployed publicly?
- Who owns the intellectual property in derived analytical outputs?
- What happens to intermediate datasets when the agreement expires?
Practical approaches to closing the gap
The most effective way to close the gap between data sharing agreements and analytics workflows is to involve an analyst in the agreement drafting process. Not to write the legal language, but to walk the legal team through the anticipated workflow: here is the raw data, here are the intermediate outputs we expect to produce, here is how we plan to use the model, and here is what we want to publish. This conversation surfaces the ambiguities before they become problems.
We have developed a standard analytics workflow appendix that partners can attach to their data sharing agreements. The appendix lists the types of intermediate outputs the analytics workflow will produce, specifies which outputs are treated as restricted (subject to the same access controls as the raw data) and which are treated as derived (shareable within the terms of the agreement), and clarifies the status of model coefficients and prediction tools.
For outputs that fall in the grey area, we recommend a tiered approach: summary statistics with cell sizes above a minimum threshold (typically 5 or 10) are treated as non-identifiable and shareable; model coefficients are shareable if the model has more than a specified number of training observations (to prevent coefficient-based re-identification); and dashboards are reviewed for re-identification risk before external access is granted.
How we handle it
Our platform uses project-scoped access. All data, files, messages, and outputs are contained within a project. Access is controlled at the project level, and the audit log tracks every interaction. When a data sharing agreement has specific restrictions, we configure the project's access controls to match.
We also raise the intermediate-output question early in the partnership discussion. It is easier to agree on what can and cannot be shared before the analysis produces results than to negotiate retroactively when a manuscript is ready for submission.
The platform's project structure naturally supports the boundary between restricted and shareable outputs. Raw data files are uploaded with a 'data_upload' category and are accessible only to project-assigned users. Analysis outputs such as results and reports are uploaded with their own categories and can have different access rules. The audit log records every access to every file, so compliance with the data sharing agreement can be verified after the fact.
When an agreement specifies that data must be destroyed at the end of the project, our platform supports project archival with file deletion. The project metadata (timelines, stage history, audit log) is retained for accountability purposes, but the data files are permanently deleted. The audit log entry for the deletion serves as evidence that the destruction clause was fulfilled.