Synthetic Data for Rare Subgroups – A Decision-Grade Tool to Inform Drug Development

In drug development, access to diverse and representative patient data is often a limiting factor, especially when studying rare subgroups. This is where synthetic data can offer powerful, practical value.
At InSilicoTrials, we use synthetic data not as a replacement for real-world data (RWD), but as a decision-grade tool to augment, simulate, and explore what real data alone can’t reveal.

Where Does Synthetic Data Fit?

At InSilicoTrials, we see synthetic data as something that fits in the continuum of prior knowledge-based approaches, like:
Published Literature, Meta-Analyses, Bayesian priors and RWD.

Like the aforementioned, synthetic data can inform study design, assess robustness, and optimize populations. However, this value is not without its caveats. Synthetic data faces challenges such as dependance on model assumptions, potential for introducing artifacts, and unclear generalizability. It is critical, for correct use, that the limitations of synthetic data are acknowledged and tackled rigorously. That said, synthetic data can bring advantages. In the following, we would like to illustrate a toy scenario showcasing how using synthetic data can help. Specifically, we’ll learn a prior from a large population, and use that to obtain more precise estimates for a small subgroup.

A Toy Example: Exploring Underrepresented Subgroups

Suppose we want to analyze a rare subgroup, such as patients in the top 1% of a biomarker distribution. Traditional analysis using RWD alone is problematic: subgroup sizes are small, and statistical estimates become unstable.

We ask ourselves:

“Can we infer outcomes in a rare subgroup using synthetic data generated from the full population?”

We prepared a toy setting to explore this question, with R code that can be found here:? https://rpubs.com/pmessina511/1315710

In short, we considered a scenario with a large population (1,000 individuals) and a small subgroup (10 individuals). We used a density estimator trained on the whole dataset, from which synthetic members of the rare subgroup are sampled.

Our findings?

Synthetic data outperformed RWD in estimating subgroup outcomes
Results were consistent across 100 simulations

We remark that this holds when the subgroup is “in-distribution,” i.e., shares latent traits with the training population

Takeaways

Synthetic data offers a compelling addition to the modern data ecosystem in drug development, particularly as a decision-grade tool for early-phase strategy, simulation, and subgroup analysis. Its greatest strength lies in its ability to augment real-world data and enable robust “what-if” exploration when traditional datasets fall short, provided it is used with methodological care.

Our example covers a specific scenario, but it highlights a critical insight:
When a rare subgroup shares latent characteristics with the broader true population, synthetic data can fill gaps left by biased or sparse real-world samples. This especially holds in an ideal scenario where the model has been trained on large and diverse datasets capturing the full spectrum of variability relevant to the subgroup, enabling the model to generate realistic and informative synthetic representations. Just like fine-tuning a pre-trained foundation model in machine learning!

However, again, we remark that synthetic data has important limitations.

Synthetic data is only as good as the models and input data on which it is based.
Although synthetic datasets can reduce sampling variability, they do not eliminate true uncertainty about causal effects, and care must be taken not to misinterpret precision for confidence.
While synthetic data is often heralded as inherently privacy-safe, it is not private by default and requires formal risk assessments.

When these caveats are respected and transparent validation is applied, synthetic data can serve as a powerful tool to inform design decisions, simulate populations, and optimize trials—bringing rigor, flexibility, and inclusiveness to the next generation of clinical research.

Let’s explore together how synthetic data can elevate your early-phase strategy and unlock deeper understanding of subgroups that matter.

Synthetic Data for Rare Subgroups – A Decision-Grade Tool to Inform Drug Development

Please fill the form below to Download