Blogpost
Synthetic medical images can effectively fill voids in training datasets, addressing critical challenges in medical imaging AI
Data scarcity has long been a significant barrier to the development of robust AI models for medical imaging. Many medical conditions, particularly pathology outliers, have limited representation in available datasets, making it challenging to train AI models that perform well across diverse cases. Accessing high-quality, annotated medical imaging data is both time-consuming and expensive, further compounding the issue. As a result, AI systems may struggle to generalize effectively, particularly when detecting or segmenting edge cases or when applied to underrepresented demographic groups.
To address these challenges, our studies focused on filling voids in training data by generating synthetic medical images. Using advanced generative models, we created synthetic datasets to address two critical gaps: (1) underrepresented cohorts and (2) true positive cases. This approach enabled us to enhance data diversity and balance, ultimately improving the performance and robustness of AI models. This whitepaper summarizes our findings and translates them into actionable insights for AI developers.
Supplementing Training Data with Challenging Cohorts
Our first study, “Evaluating the Utility of Memory-Efficient Medical Image Generation: A Study on Lung Nodule Segmentation”, explores the potential of synthetic data to address the lack of diversity in lung nodule segmentation datasets. Outlier pathologies, such as pleural nodules, present a significant challenge for nodule detection models due to their scarcity in existing datasets. As a result, these models are often inadequately trained to detect such cases. This study focuses on overcoming this limitation by synthetically oversampling these outlier pathologies and evaluating the impact of this approach on the performance of nodule detection models.
Using a diffusion model, synthetic CT scans were generated and used in two scenarios:
1. Training Exclusively with Synthetic Data
A segmentation model was trained solely on the synthetic images generated by the diffusion model. This scenario tested whether synthetic data could fully replace real-world data in the training pipeline. When trained exclusively on synthetic data, the segmentation model achieved a mean Dice Similarity Coefficient (DSC) of 0.5016 (STD: 0.0206). For comparison, the model trained on real-world data alone achieved a mean DSC of 0.4913 (STD: 0.02733). These results demonstrate that synthetic data alone can provide a viable alternative to real-world data for training segmentation models, with performance metrics that are comparable to models trained on real-world datasets.
2. Augmenting Real-World Data with Synthetic Images
Synthetic images were added to a real-world training dataset to assess whether augmentation could improve the segmentation model’s performance. This scenario tested the ability of synthetic data to enhance the diversity and coverage of real-world datasets. In this scenario, augmenting the real-world training dataset with synthetic images led to a noticeable improvement in the segmentation model’s performance. The augmented dataset produced a mean DSC of 0.5418 (STD: 0.03015). This improvement highlights the value of synthetic data in expanding the diversity and richness of training datasets, enabling models to generalize better to unseen cases.
By filling gaps in challenging cohorts, this study highlighted the potential of synthetic data to address critical challenges in medical imaging AI.
Balancing True and False Positive Cases in Training Data
In our second study, we focused on addressing the imbalance between true positive and false positive cases in lung nodule classification datasets. The usual process in early-stage lung cancer detection entails two steps: First, nodules are segmented. Second, a classification model determines whether the detected areas are actually nodules.
The first step of detection has to be highly sensitive to any abnormalities, to not miss any cancer cases. As a consequence, the model ends up with a large number of detected abnormalities that aren’t cancerous nodules – false positives. Compared to that, the number of actually cancerous nodules – true positives – is very low. The second step of nodule classification is supposed to reduce the number of false positives. However, the inherent imbalance of the training data for such a classifier poses a challenge as it undermines a classifier’s ability to accurately identify true positive cases.
To mitigate this, synthetic true positive data points were generated using guided diffusion models and added to the training dataset, creating a balanced composition between true positive and false positive cases. The results demonstrated a measurable improvement in the classifier’s performance, with accuracy increasing from 93.28% to 94.17%, an absolute improvement of 0.89%. While the improvement may appear modest, such incremental gains are significant in clinical applications, as they can lead to more accurate diagnoses and better patient outcomes.
By filling the void of underrepresented true positive cases, this study highlights the role of synthetic data in addressing data imbalances and enhancing the performance of even high-performing models.
Implications for Medical AI Development
The findings of our studies underscore several key benefits of using synthetic medical images for AI development:
Synthetic data provides the flexibility to generate images that are underrepresented in real-world datasets, such as outlier pathologies or specific demographic cohorts. By including these edge cases, AI models can be trained to handle a broader range of scenarios, improving their robustness and fairness.
Synthetic data can either replace or supplement real-world datasets. In the first study, synthetic data was shown to perform comparably to real-world data when used exclusively. When used for augmentation, it significantly improved model performance by enhancing data diversity.
Generating synthetic data eliminates the need for manual annotation, which is a costly and time-intensive process. Synthetic images are generated with corresponding annotations, bypassing this bottleneck and enabling the rapid creation of large, diverse datasets.
Conclusion
Our studies demonstrate that synthetic medical images can effectively fill voids in training datasets, addressing critical challenges in medical imaging AI. By generating synthetic data to supplement underrepresented cohorts and true positive cases, we improved model performance. These findings underscore the potential of synthetic data to replace or augment real-world datasets.
As generative modeling technologies continue to advance, synthetic data is poised to play a vital role in medical imaging AI, accelerating development, improving clinical outcomes, and enhancing the fairness and robustness of AI systems. Thus, RYVER.AI and Segmed partner to develop comprehensive models for generating medical imaging datasets.
Connect with us to learn more about how synthetic data can fill gaps in your datasets and assess the benefit for your AI development.
Ryver.AI is at the forefront of leveraging synthetic data to advance medical AI. The company’s mission is to reduce bias in medical imaging datasets by generating high-quality synthetic radiology images. With models trained on diverse medical data, Ryver.AI helps ensure that AI systems perform reliably across all demographic groups, addressing the disparities present in current medical AI tools. Ryver.AI’s cutting-edge generative AI technology empowers medical AI teams to develop more accurate, inclusive, and robust diagnostic tools that can benefit patients worldwide.
Media Contact:
contact@ryver.ai
Partnership
This case study shows how synthetic 3D Lung CTs including nodules of different size and texture can be used to enhance a best-in-class classification model.
Preprint
This paper evaluates quality and effectiveness of synthetic data by testing its impact on downstream segmentation tasks.