A Quarter-Century of Synthetic Data in Healthcare: Unveiling Trends with Structural Topic Modeling

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Data-driven approaches are transforming healthcare, yet acquisition of comprehensive datasets is hindered by high costs, privacy regulations, and ethical concerns. To address these challenges, synthetic data, artificially generated datasets that mimic the statistical properties of real-world data, provides a promising solution. Despite its growing adoption, the thematic landscape of synthetic data research in healthcare remains underexplored. Therefore, we applied structural topic modeling (STM) to map the research landscape of synthetic data in healthcare, revealing prevalent topics and tracking their evolution over time and across geographic locations. PubMed publications from 2000-2024 containing “synthetic data,” “artificial data,” or “simulated data” in the title/abstract were retrieved. After preprocessing the text (lowercasing, punctuation/stopword removal, stemming), structural topic modeling (STM) was performed using year and continent as covariates. The optimal number of topics (K=10) was determined using held-out likelihood and interpretability. Topic prevalence, temporal trends, and inter-topic correlations were analyzed using stacked area charts and network analysis. Analysis of 14,788 PubMed articles (2000-2024) revealed a tenfold increase in publications. Geographically, North America (48.6%) and Europe (33.5%) were primary contributors, but Asia’s share steadily rose from 2.9% to 23.1%. STM identified ten key topics, grouped into Biomedical Imaging & Signal Processing (25.2%), Synthetic Data Applications in Biomedical Research (17.7%), Computational & Statistical Methods (23.9%), and Genomics & Evolutionary Biology (33.2%) themes. We observed gradual declines in initially prominent topics including “Bayesian Modelling” (23.1% to 9.9%), “Neuroimaging” (16.0% to 9.3%), and “Image Simulation” (17.7% to 9.1%), giving ascendancy to “Synthetic Data Generation” (2.2% to 27.1%) and “Disease Modeling and Public Health” (4.8% to 11.9%) by 2024. Synthetic data research in healthcare has experienced increasing interest, marked by shifts in geographic distribution and dynamic evolution of key topics. Realizing the full potential of synthetic data requires fostering cross-disciplinary collaborations, implementing bias mitigation strategies, and establishing equitable partnerships.

Author Summary

In recent years, synthetic data—artificially generated datasets designed to reflect real-world information—has gained attention as a way to advance healthcare research while addressing concerns around data privacy, costs, and accessibility. Our work explores how this field has evolved over the past 25 years, identifying key research trends and shifts in geographic contributions. By analyzing over 14,000 published studies, I found that synthetic data research has grown nearly tenfold, with increasing contributions from Asia alongside traditional leaders in North America and Europe. The focus of research has also changed: earlier work emphasized medical imaging and statistical modeling, while recent studies highlight synthetic data generation and its use in disease modeling, public health, and clinical trials. Despite this progress, important gaps remain. Areas like drug discovery, mental health, and ethical considerations in artificial intelligence need further attention. By mapping these trends, our work underscores the importance of cross-disciplinary collaboration and equitable global partnerships to maximize the benefits of synthetic data in improving healthcare worldwide.

Related articles

Related articles are currently not available for this article.