Out of Domain Generalization in Medical Imaging via Vision Language Models

Team

Supervisors

Table of content

  1. Abstract
  2. Introduction and Background
  3. Problem Statement
  4. Related Works
  5. Methodology
  6. BiomedXPro Architecture
  7. Experimental Setup and Implementation
  8. Results and Analysis
  9. Additional Experiments
  10. Limitations
  11. Future Directions
  12. Impact and Contributions
  13. Conclusion
  14. Publications
  15. Links

Abstract

This research addresses critical domain generalization challenges in medical imaging by introducing BiomedXPro, a novel framework that leverages Vision-Language Models (VLMs) with interpretable prompt optimization. Using BiomedCLIP as the baseline model, we propose an evolutionary algorithm-based automatic prompting method that significantly improves both interpretability and out-of-domain generalization through iterative feedback mechanisms with Large Language Models (LLMs).

Our approach specifically targets disease classification tasks from histopathological images, addressing the fundamental challenge of maintaining model performance when deploying AI systems across different medical institutions with varying equipment, protocols, and patient demographics. The framework generates human-readable diagnostic prompts that capture clinically relevant visual discriminative features, ensuring both robust performance and explainable decision-making processes essential for clinical adoption.

Key achievements include achieving 93.06% accuracy on the CAMELYON17 dataset while maintaining full interpretability, demonstrating superior out-of-domain generalization across multiple hospital centers, and providing quantifiable contributions of each diagnostic observation to final predictions.


Introduction and Background

Medical Image Analysis Challenges

Medical image analysis plays a crucial role in modern healthcare, encompassing disease diagnosis, treatment guidance, image segmentation, and various other clinical applications. While significant advances have been made in medical AI systems, several critical challenges persist when deploying these systems in real-world clinical scenarios:

Domain Shift Problem

Domain shift represents one of the most significant barriers to successful deployment of AI systems in medical imaging. This phenomenon occurs when there are discrepancies between the training distribution (source domain) and the unseen distribution (target domain) where the model is deployed. In medical imaging contexts, domain shift manifests through:

The consequence of domain shift is significant performance degradation when models trained on data from one institution are deployed in new clinical settings, potentially compromising diagnostic accuracy and patient safety.

Explainability Requirements

Modern healthcare demands AI systems that not only perform accurately but also provide transparent, interpretable decision-making processes. Medical professionals require:

Vision Language Models in Medical Imaging

Vision Language Models (VLMs), particularly Contrastive Language-Image Pretraining (CLIP) and its biomedical variant BiomedCLIP, have emerged as promising solutions for addressing both domain generalization and explainability challenges:

Key Advantages of VLMs:

  1. Zero-Shot Classification Capabilities: Models can classify images without task-specific training by leveraging natural language descriptions
  2. Inherent Interpretability: Classifications are based on natural language prompts that humans can understand and validate
  3. Robustness to Distribution Shifts: Pre-training on diverse datasets provides inherent resilience to domain variations
  4. Multimodal Understanding: Joint representation learning enables sophisticated reasoning about visual-textual relationships

BiomedCLIP Architecture

BiomedCLIP represents the state-of-the-art in biomedical vision-language models, specifically designed for medical imaging applications. The model architecture consists of:


Problem Statement

Despite the promising capabilities of biomedical CLIP models, current approaches face a critical limitation that hinders their clinical adoption:

Core Challenge: Biomedical CLIP models demonstrate inherent robustness to distribution shifts, and their performance can be significantly enhanced through context optimization techniques. However, existing context optimization methods rely on uninterpretable “soft” prompts - learned vector representations that lack human readability.

Clinical Implications: This lack of interpretability presents a fundamental barrier for clinical adoption, where both out-of-domain generalization AND explainability are paramount requirements for reliable AI-driven diagnostics. Medical professionals cannot trust or validate diagnostic decisions based on abstract vector representations that provide no clinical reasoning.

Specific Limitations of Current Approaches:

  1. Soft Vector Learning: Context optimization methods like CoOp generate numerical vectors (e.g., [1.3, 2.3, 4.2, …]) that cannot be interpreted by medical professionals
  2. Single Static Outputs: Many existing methods rely on single LLM outputs without iterative refinement
  3. Limited Clinical Validation: Generated prompts often lack verification against established medical knowledge
  4. Insufficient Diversity: Single optimal prompt approaches fail to capture the complexity of medical diagnostic reasoning

Vision-Language Models in Biomedical Applications

The development of vision-language models has revolutionized biomedical image analysis, with several key contributions shaping the field:

Foundation Models

CLIP (Contrastive Language-Image Pretraining): Introduced the concept of learning joint representations of images and text through contrastive learning. While effective for natural images, CLIP’s performance on specialized biomedical images remained limited due to domain-specific terminology and visual characteristics.

BiomedCLIP: Specifically designed for biomedical applications, this model was trained on biomedical image-text pairs, significantly improving performance on medical imaging tasks while maintaining the interpretability advantages of the original CLIP architecture.

Prompt Learning Approaches

BiomedCoOp: Extended the Context Optimization (CoOp) approach to biomedical domains, learning continuous prompt vectors that optimize classification performance. However, these learned vectors lack interpretability, making clinical validation challenging.

XCoOp: Introduced cross-modal prompt learning, attempting to bridge vision and language modalities more effectively. Despite improved performance, the fundamental interpretability limitation persisted.

Limitations of Existing Approaches

  1. Interpretability Gap: Most existing methods focus solely on performance optimization, neglecting the critical need for explainable diagnostic reasoning
  2. Single Prompt Limitation: Many approaches optimize for a single “best” prompt, failing to capture the diversity of diagnostic observations
  3. Static Generation: Limited use of iterative refinement processes that could improve prompt quality over time
  4. Clinical Validation Deficit: Insufficient integration of medical domain expertise in prompt generation and validation

Gap in Current Research

Our comprehensive literature review revealed a significant gap: no existing method successfully combines high performance with interpretable prompt generation for biomedical vision-language models. This gap represents a critical barrier to clinical adoption and motivated our development of BiomedXPro.


Methodology

Our methodology introduces BiomedXPro, a novel framework that addresses the interpretability-performance trade-off through evolutionary prompt optimization. The approach consists of several integrated components:

1. Preprocessing Pipeline

Data Quality Assurance

Image Standardization

Dataset Preparation

2. Evolutionary Prompt Optimization Framework

Theoretical Foundation

Our approach leverages Large Language Models as implicit optimizers, drawing inspiration from evolutionary algorithms and gradient-free optimization techniques. The key insight is that LLMs can serve multiple roles:

LLM Integration Strategy

We utilize Gemma3 27B as our primary LLM, selected for its strong performance in medical domain tasks and cost-effectiveness for iterative optimization processes.

3. Multi-Stage Optimization Process

Stage 1: Initial Prompt Population Generation

Meta-Prompt Design (Q₀):

Give 50 textual description pairs of visual discriminative features to identify whether the central region of a histopathological image patch contains tumor tissue or not. The patch is extracted from an H&E‑stained whole‑slide image of a lymph node section.

This initial meta-prompt is carefully crafted to:

Stage 2: Fitness Evaluation and Selection

Fitness Score Calculation: The fitness function evaluates each prompt pair based on classification performance:

f(p) = Performance_Metric(BiomedCLIP_with_prompt_p, training_data)

We experimented with multiple fitness metrics:

Roulette Wheel Selection: This probabilistic selection mechanism ensures:

Stage 3: Iterative Prompt Generation

Optimizer Meta-Prompt (Qᵢ):

The task is to generate textual descriptions pairs of visual discriminative features to identify whether the central region of an histopathological image patch contains tumor tissue or not.

Here are the best performing pairs in descending order:
1. (..., ...) Score: 90
2. (..., ...) Score: 84
...

Write 10 new prompt pairs that are different from the old ones and has a score as high as possible.

This iterative process enables:

Stage 4: Diversity Maintenance Through Crowding

Crowding Meta-Prompt (Qc): The crowding mechanism prevents convergence to semantically identical prompts with different linguistic expressions:

Group the prompt pairs that have exactly the same medical observation but differ only in language variations. Provide the output as grouped indices.

This process:

4. Final Prompt Selection and Ensemble

Elbow Analysis for Optimal Prompt Count

Rather than using a fixed number of final prompts, we employ elbow analysis on the fitness score distribution to automatically determine the optimal number of prompts that maximize diversity while maintaining high performance.

Weighted Ensemble Voting

Final Classification Process:

  1. Individual Prompt Evaluation: Each selected prompt pair provides a binary vote (tumor/normal)
  2. Weight Calculation: Fitness scores are normalized to create prompt-specific weights
  3. Weighted Aggregation: Final decision combines all votes using calculated weights
  4. Threshold Application: Scores > 0.5 indicate tumor presence

Mathematical Formulation:

Final_Score = Σᵢ (Normalized_Fitness_Score_i × Vote_i)
Prediction = Final_Score > 0.5 ? "Tumor" : "Normal"

BiomedXPro Architecture

System Architecture Overview

BiomedXPro implements a sophisticated evolutionary algorithm specifically designed for medical prompt optimization. The architecture consists of several interconnected components working in harmony to achieve optimal performance while maintaining interpretability.

Core Components

1. LLM-Driven Prompt Generation Engine

2. Fitness Evaluation System

3. Evolutionary Optimization Framework

4. Interpretability Layer

Optimization Parameters

Performance Characteristics


Experimental Setup and Implementation

Dataset Configuration

Primary Dataset: CAMELYON17 WILDS

Dataset Characteristics:

Domain Distribution:

Secondary Datasets for Generalization Testing

NIH ChestX-ray14:

CheXpert:

RETOUCH:

Implementation Details

Hardware Configuration

Software Framework

Hyperparameter Configuration

Evolutionary Algorithm Parameters:

LLM Generation Parameters:

Evaluation Methodology

Performance Metrics

Cross-Validation Strategy

Baseline Comparisons


Results and Analysis

CAMELYON17 Primary Results

Evolutionary Optimization Progress

The optimization process demonstrates clear convergence characteristics over 1,000 iterations:

Convergence Analysis:

Final Optimized Prompts (Top 8 selected via elbow analysis):

  1. Primary Diagnostic Prompt (Score: 0.9013):
    • Negative: “No atypical cells infiltrating surrounding tissues”
    • Positive: “Atypical cells infiltrating surrounding tissues and disrupting normal structures”
  2. Cellular Atypia Assessment (Score: 0.8997):
    • Negative: “No significant atypia in the surrounding lymphocytes”
    • Positive: “Significant atypia observed in lymphocytes adjacent to tumor nests”
  3. Stromal Changes Detection (Score: 0.8994):
    • Negative: “No evidence of fibrosis”
    • Positive: “Prominent stromal fibrosis surrounding tumor nests”
  4. Architectural Preservation (Score: 0.8940):
    • Negative: “Normal follicular architecture is preserved”
    • Positive: “Disrupted follicular architecture with loss of polarity”

These prompts demonstrate sophisticated understanding of histopathological features that pathologists use for tumor diagnosis, including cellular infiltration patterns, nuclear atypia, stromal reactions, and architectural disruption.

Comparative Performance Analysis

Method Main Test Set Hospital 0 (ID) Hospital 1 (ID) Hospital 2 (ID)
Zero-Shot BiomedCLIP 88.22% 81.97% 80.35% 79.86%
BiomedCLIP + CoOp 93.90% 95.14% 92.83% 95.95%
BiomedXPro (Ours) 93.06% 92.23% 85.66% 93.69%

Key Performance Insights

Competitive Performance: BiomedXPro achieves 93.06% accuracy on the main test set, demonstrating only a 0.84% performance gap compared to CoOp while providing full interpretability.

Domain Generalization: Strong performance across all hospital domains indicates robust out-of-domain generalization, with particularly impressive results on Hospital 2 (93.69%).

Interpretability Advantage: Unlike CoOp’s uninterpretable context vectors, BiomedXPro provides clinically meaningful prompts that medical professionals can understand and validate.

Detailed Method Comparison

BiomedXPro vs. CoOp Analysis

Aspect CoOp BiomedXPro
Interpretability Context vectors (e.g., [1.3, 2.3, 4.2, …]) Human-readable clinical descriptions
Training Time Higher (requires gradient computation) Lower (gradient-free optimization)
Peak Performance 93.90% 93.06%
Clinical Adoption Limited due to black-box nature Suitable for clinical validation
Flexibility Fixed optimization for single task Adaptable across medical domains
Expert Validation Impossible to validate learned vectors Direct expert review of prompts possible

Statistical Significance Analysis

Performance Distribution: Multiple optimization runs (n=5) show consistent results:

Domain Robustness: Cross-domain evaluation demonstrates:


Additional Experiments

Multi-Dataset Evaluation

Cross-Modality Generalization

NIH ChestX-ray14 Results:

CheXpert Validation:

RETOUCH OCT Analysis:

LLM Comparison Study

Performance Variation Across Different LLMs

ChatGPT 4.1 vs. Gemma3 27B:

Key Findings:

Advanced Ensemble Methods

Stacking Approach Evaluation

Meta-Model Performance on CAMELYON17:

Meta-Model Test Center Accuracy
Logistic Regression 92.60%
Decision Tree 91.42%
Random Forest 92.41%
Gradient Boosting 92.57%
SVM 92.67%
Naive Bayes 92.29%

Stacking Benefits:

Implementation Considerations:

Ablation Studies

Component Contribution Analysis

Evolutionary Components:

  1. Roulette Wheel Selection: 2.3% performance improvement over random selection
  2. Crowding Mechanism: 1.8% improvement through diversity maintenance
  3. Iterative Refinement: 4.5% improvement over single-generation prompts
  4. Ensemble Voting: 1.2% improvement over single best prompt

Optimization Parameters:


Limitations

Clinical Validation Requirements

Expert Feedback Integration

Pathologist Review (Dr. Sumanarasekara):

Identified Issues:

  1. Generality vs. Specificity Trade-off: Balancing broad applicability with diagnostic precision
  2. Medical Terminology: Ensuring proper use of standardized pathological terminology
  3. Diagnostic Hierarchy: Incorporating proper diagnostic decision trees used by pathologists

Validation Framework Needs

Expert Integration Process:

Computational and Economic Constraints

LLM Generation Costs

Current Cost Structure:

Cost Optimization Strategies:

Performance Trade-offs

BiomedXPro vs. CoOp Performance Gap:

Scalability Considerations

Multi-Disease Extension

Current Limitations:

Required Developments:

Dataset Dependency

Training Data Requirements:


Conclusion

The integration of iterative, interpretable prompt generation using LLMs significantly improves the domain generalization capabilities of vision-language models in medical imaging. The approach offers a path forward for deploying robust and explainable AI tools in clinical settings.


Publications