Synthetic Data – Are We All Talking About the Same Thing?

Synthetic Data’ – Are We All Talking About the Same Thing?

In the rapidly evolving world of healthcare and drug development, data is the driving force behind innovation. Yet, the lack of cross-institute shared access frameworks and the need to protect patient privacy pose significant challenges that hinder innovation and demand innovative solutions. One such solution gaining significant momentum is synthetic data—artificially generated datasets that closely mirror observed (real) data from a statistical perspective without revealing information about any individual in particular, thus minimizing privacy risks.

Our latest collaborative paper, “Synthetic Data in Healthcare and Drug Development: Definitions, Regulatory Frameworks, Issues,” explores the potential of synthetic data in transforming healthcare research. In this paper, we collaborated with experts from the UK Medicines and Healthcare Products Regulatory Agency, Unlearn.AI, KU Leuven’s Centre for IT & IP Law, University of Oxford’s HeLEX, University of Catania, and Humanitas Research Center to bring clarity on what synthetic data is, and how to frame it in the context of external control arms (also referred to as “synthetic control arms” when referring to true/observed data (Burcu et al., 2020), hence not fully encompassing the broader scope introduced by recent AI-driven methodologies for generating synthetic data), and highlighting some key challenges, such as regulatory acceptance.

Why Synthetic Data Matters

Synthetic data, which can nowadays be created through powerful generative AI algorithms (but not only those), has the potential to significantly enhance the speed, accuracy, and efficiency of drug development. These datasets allow researchers to simulate clinical scenarios without the use of sensitive patient data, thereby reducing privacy concerns. Reducing privacy concerns also means increased accessibility, as real-world data (RWD) and data from randomized controlled trials (RCTs), is notoriously difficult to obtain. Another key aspect of synthetic data is that its generating process can often be “conditioned”: a researcher can set the algorithm to generate arbitrary numbers of patients belonging to specific sub-populations (e.g., minority groups based on sex or ethnicity). This means that what-if scenarios can be explored, assessing how the downstream analysis results change when the a specific sub-population is taken into exam.

However, the introduction of synthetic data also raises critical questions about provenance, trust, and regulatory oversight. Where does the data originate? How do we ensure it faithfully represents real-world conditions? And how can we validate its utility in clinical research?

Key Insights from Our Paper

We explore several pivotal aspects of synthetic data use in healthcare, offering insights drawn from both our research and the collaborative efforts of our team:

Provenance of Synthetic Data
We highlight the importance of establishing robust provenance mechanisms to track the origin, transformation, and use of synthetic data. More than observed (real) data, , synthetic data requires detailed documentation of the models and processes used in its creation to maintain its credibility.
Distinguishing Between Synthetic and Observed Data
We highlight the need of clearly labeling synthetic data to avoid confusion when it is integrated with (true) observed data (e.g., real-world data, historical randomized controlled trials). To address this, we propose the use of data cards—structured summaries that document the data’s origins, generation processes, and intended uses, thereby enhancing transparency and reducing misinterpretation.
Replicability and Validation
Ensuring that synthetic data can reliably replicate real-world outcomes is essential for its use in research. We call attention to approaches like sequential synthesis to improve replicability and validate the consistency of conclusions drawn from synthetic data compared to observed data.
Data Privacy and Ethical Considerations
We explore the privacy risks associated with generative AI models, particularly the possibility of models memorizing real data points.

The Road Ahead for Synthetic Data in Healthcare

As synthetic data is posed to play a growing role in healthcare research, it is essential for the regulatory landscape to evolve in tandem. While process-driven synthetic data (such as pharmacokinetic models) is already accepted by regulatory bodies like the FDA and EMA in drug development, data-driven synthetic data (generated via AI models) still lacks clear regulatory definitions and terms for utilization.

Our paper highlights the need for continued collaboration across various sectors—academic, clinical, regulatory, and industry stakeholders—to drive the development of synthetic data standards. By clearly defining terminology and ensuring transparency in its usage, we believe synthetic data can become a key enabler in clinical research and drug development.

Read the full, open access paper here: https://ascpt.onlinelibrary.wiley.com/doi/10.1002/psp4.70021

References

Burcu, M., Dreyer, N. A., Franklin, J. M., Blum, M. D., Critchlow, C. W., Perfetto, E. M., & Zhou, W. (2020). Real‐world evidence to support regulatory decision‐making for medicines: Considerations for external control arms. Pharmacoepidemiology and Drug Safety, 29(10), 1228–1235. https://doi.org/10.1002/pds.4975

Synthetic Data – Are We All Talking About the Same Thing?

Why Synthetic Data Matters

Key Insights from Our Paper

The Road Ahead for Synthetic Data in Healthcare

References

Please fill the form below to Download