Out of Domain Generalization in Medical Imaging via Vision Language Models
Team
Supervisors
Table of content
- Abstract
- Related works
- Methodology
- Experiment Setup and Implementation
- Results and Analysis
- Conclusion
- Publications
- Links
Abstract
This research addresses domain generalization challenges in medical imaging using BiomedCLIP as the baseline model, a vision-language model optimized via advanced prompting strategies. We propose an automatic prompting method that improves interpretability and generalization through iterative feedback to large language models (LLMs), specifically adapting prompts for disease classification tasks from histopathological images.
Related works
Vision-language models (VLMs) such as CLIP and BiomedCLIP have shown great promise in biomedical tasks, with models like BiomedCoOp and XCoOp introducing domain-specific prompt learning. However, many lack interpretability and rely on single static LLM outputs. Our method builds on these by integrating iterative feedback for prompt refinement, improving both robustness and transparency in clinical tasks.
Methodology
We use BiomedCLIP as our base and apply a series of prompt optimization techniques to enhance out-of-domain generalization. The methodology includes:
Preprocessing
- Cleaning: File integrity checks and metadata validation
- Normalization: Standardizing pixel values
- Resizing: Images resized to 224×224 pixels
- Standardization: Label unification and demographic balancing
- Splitting: Domain-generalization-based data split
Prompt Optimization
An LLM-driven prompt generation framework starts with an initial set of prompts from Gemini. Using performance scores, prompts are iteratively refined to improve classification. The process includes:
- Prompt diversity strategies inspired by evolutionary algorithms
- Scoring and feedback loops
- Final prompts remain human-readable, improving interpretability
CLIP Fine-Tuning Techniques
Three strategies are explored and compared:
- Prompt Tuning: CoOp, CoCoOp
- OOD Fine-Tuning: CLIPood strategy
- Adapter Layers: Task-specific feature learning
Validation Strategy
Model performance is validated both in-domain and out-of-domain using accuracy, F1-score, AUC, and OOD metrics.
Results and Analysis
Conclusion
The integration of iterative, interpretable prompt generation using LLMs significantly improves the domain generalization capabilities of vision-language models in medical imaging. The approach offers a path forward for deploying robust and explainable AI tools in clinical settings.