Old biases in new data: Inclusive preprocessing to create disability representation in synthetic datasets

Vicki Austin, Jamie Danemayer

March 25, 2026

Academic Research Publications

Overview

Population-based data disaggregated by disability are essential for informed policymaking, especially for disability-inclusive development and the realisation of the rights of persons with disabilities. Both areas rely on accurate evidence and its efficient use, especially in the current global context of resource constriction. Disability inclusive data, and inclusive disaggregated data sets more widely can enable assessment of whether people with disabilities participate in society on equal terms with those without disabilities, as well as supporting difficult decision making about how and what to prioritise in a resource poor context.

The importance of inclusive data sets is growing in the context of population ageing, as governments must plan for increasing levels of support need. When individuals are not represented in the data used to design policy, they are less likely to benefit from it, perpetuating exclusionary practices in a phenomenon described in clinical health research as data poverty. Yet effectively representing disability in population data, and then interpreting what resources are needed when and by whom, is not straightforward.

Disability is a protected characteristic, similar to gender or sexual orientation, meaning that its collection and use pose risks around privacy, discrimination, and misuse. Combined with challenges with non-inclusive and inaccessible data collection, this contributes to the systematic underrepresentation of disabled people in many population datasets. Data poverty and evidence exclusion can then result in poor, ill-informed decision-making which can lead to disadvantaged outcomes for disabled people. However, rapid developments in data science mean new statistical methods can be drawn upon to improve existing datasets, including techniques to fill in missing information and mitigate bias. These methods warrant investigation, as they potentially offer new hope for evidence-based, disability-inclusive policymaking in data-sparse settings.

One advancement in data science is using artificially-generated datasets (known as synthetic data) to fill in the gaps where complete, representative data are not available.

Synthetic data creation involves generating a new set of datapoints with the same properties as the original dataset (i.e. preserving key statistical relationships between variable), while removing identifying characteristics. This process can also include steps to augment the data, for example by correcting historical biases and underrepresentation. Further, because individuals cannot be identified, synthetic datasets can be made more freely available to researchers. For these reasons, synthetic data are increasingly used to study relationships within data and to inform decision-making where access to sensitive data is restricted, or complete data is unavailable.

However, synthetic data are inherently dependent on the quality and representativeness of the original, real dataset upon which they are based. Without careful preprocessing, synthetic data risk reproducing and amplifying the representation gaps of the real dataset, into any evidence and policy they have informed.

Although the use of synthetic data in disability policy has not been widely documented, its application in adjacent areas of public health, such as clinical guidance and health system planning, suggests that its use will continue to expand. As population ageing also drives governments to revise public health policies, the appetite for population-based evidence (that factors in demographic- and age- structure) is rapidly increasing. It is highly likely that without clear guidance on fairness and bias mitigation, limited datasets will be used to generate synthetic data for policy-level decision-making in the near future. Subsequently, opportunities to improve disability representation in data—by protecting individuals’ privacy and expanding access to quality data on which to base policy decision-making—will be missed. Researchers in this space can seek to test, develop and apply new and good practice for assessing and improving disability representation (among other protected characteristics) using new statistical methods. These inclusive efforts are most vital and impactful at the preprocessing stage, improving the quality of evidence and advancing towards a global coherence on the fair use of synthetic data.

To develop the idea of inclusive preprocessing, this article brings together three intersecting areas: the role and limitations of administrative disability data; the opportunities and risks associated with synthetic data for disability inclusion; and the future methodological work needed to ensure that new data practices do not reproduce old biases and is in fact useful for disability-inclusive policymaking, highlighting our research agenda. While appropriate methods to do this will vary by context and use case, there is a clear need for developed overarching guidance on the use of synthetic data and AI more broadly in structural health research, to enable analyses to delineate and address inequities.