Synthetic Data – Are We All Talking About the Same Thing?

In the rapidly evolving world of healthcare and drug development, data is the driving force behind innovation. Yet, the growing complexities of patient privacy and data access are creating challenges that demand innovative solutions. One such solution gaining significant momentum is synthetic data—artificially generated datasets that closely mirror observed (real) data while minimizing privacy risks.

 

Our latest collaborative paper, Synthetic Data in Healthcare and Drug Development: Definitions, Regulatory Frameworks, Issues,” explores the potential of synthetic data in transforming healthcare research. In this paper, we collaborated with experts from  the UK Medicines and Healthcare Products Regulatory Agency, Unlearn.AI, KU Leuven’s Centre for IT & IP Law, University of Oxford’s HeLEX, University of Catania, and Humanitas Research Center to address key challenges surrounding data provenance, transparency, and replicability, while highlighting synthetic data as a powerful tool for advancing medical science.

 

Why Synthetic Data Matters

Synthetic data, created through AI and machine learning algorithms, has the potential to significantly enhance the speed, accuracy, and efficiency of drug development. These datasets allow researchers to simulate clinical scenarios without the use of sensitive patient data, thereby reducing privacy concerns. Moreover, synthetic data helps overcome the limitations of real-world data (RWD) and randomized controlled trials (RCTs), which may sometimes be scarce or difficult to obtain.

However, the introduction of synthetic data also raises critical questions about provenance, trust, and regulatory oversight. Where does the data originate? How do we ensure it faithfully represents real-world conditions? And how can we validate its utility in clinical research?

 

Key Insights from Our Paper

In this paper, we explore several pivotal aspects of synthetic data use in healthcare, offering insights drawn from both our research and the collaborative efforts of our team:

  1. Provenance of Synthetic Data
    We highlight the importance of establishing robust provenance mechanisms to track the origin, transformation, and use of synthetic data. Unlike observed (real) data, which usually has a clearer origin, synthetic data requires detailed documentation of the models and processes used in its creation to maintain its credibility.
  2. Distinguishing Between Synthetic and Observed Data
    We discuss the significance of clearly labeling synthetic data to avoid confusion when it is integrated with (true) observed data (e.g., real-world data, historical randomized controlled trials). To address this, we propose the use of data cards—structured summaries that document the data’s origins, generation processes, and intended uses, thereby enhancing transparency and reducing misinterpretation.
  3. Replicability and Validation
    Ensuring that synthetic data can reliably replicate real-world outcomes is essential for its use in research. We discuss approaches like sequential synthesis to improve replicability and validate the consistency of conclusions drawn from synthetic data compared to observed data.
  4. Data Privacy and Ethical Considerations
    We explore the privacy risks associated with generative AI models, particularly the possibility of models memorizing real data points. The paper discusses strategies for ensuring compliance with privacy regulations, like the GDPR, while still leveraging synthetic data for research purposes.

 

FIGURE 1 Comparison of observed (“true”) data and synthetic data via Venn diagram, illustrating the key characteristics, benefits, and limitations of each data type while highlighting their unique attributes and areas of overlap.

 

The Road Ahead for Synthetic Data in Healthcare

As synthetic data continues to play a growing role in healthcare research, it is essential for the regulatory landscape to evolve in tandem. While process-driven synthetic data (such as pharmacokinetic models) is already accepted by regulatory bodies like the FDA and EMA in drug development, data-driven synthetic data (generated via AI models) still lacks clear regulatory definitions and terms for utilization.

Our paper highlights the need for continued collaboration across various sectors—academic, clinical, regulatory, and industry stakeholders—to drive the development of synthetic data standards. By clearly defining terminology and ensuring transparency in its usage, we believe synthetic data can become a key enabler in clinical research and drug development.

 

Read the paper here: https://ascpt.onlinelibrary.wiley.com/doi/10.1002/psp4.70021

 

Comments are closed.
Start simulating now