KneeOA-VLM | Multimodal Phenotyping Research

01 — Background

Why standard grading isn't enough

Doctors currently grade knee osteoarthritis on a scale of 0 to 4 — the Kellgren-Lawrence (KL) system — based on what bone damage looks like on an X-ray. But this creates a deep problem: two patients with the same grade can feel completely different.

One patient with KL Grade 2 might have severe pain and rapid deterioration. Another with the same grade feels fine and stays stable for years. Current AI models are trained to reproduce this grading, meaning they're stuck in the same blind spot.

⚠ The Discordance Paradox

A patient can have a "Mild" X-ray but severe pain — or a "Severe" X-ray and no pain at all. This isn't noise; it's a signal that OA has multiple biological subtypes that one number cannot capture.

Grade

What it means

X-ray signs

Healthy

None

Doubtful

Possible bone spurs

Mild

Definite bone spurs

Moderate

Joint space narrowing

Severe

Bone-on-bone contact

Our approach doesn't replace this system — it looks deeper inside each grade to find biologically distinct patient subgroups (phenotypes) using a Vision-Language Model trained on medical images.

02 — How It Works

A 7-stage AI pipeline

From raw hospital X-rays to clinically validated patient phenotypes with 10-year longitudinal confirmation.

🗂️

DICOM Preprocessing

4,502 bilateral X-rays downloaded from OAI. YOLO crops left & right knees separately.

🔬

Zero-Shot Embedding

BiomedCLIP extracts 512-dimensional vectors from each knee. Tested as baseline — produced 55 fragmented clusters.

🎯

Fine-Tuning

KL-grade regression trains the upper 6 transformer blocks. MAE drops to 0.865 grade units.

🔗

Multimodal Fusion

Visual embeddings fused with pain scores, JSN, osteophytes, age & BMI. UMAP reduces to 2D.

🧬

HDBSCAN Clustering

Clusters independently within each KL grade. Finds phenotypes that aren't just severity differences.

👁️

XAI Validation

CLS-token attention maps confirm the model looks at the joint space — not image borders.

📈

Longitudinal Check

OAI follow-ups (V00–V10) confirm clusters predict real disease progression over 10 years.

Fig. 1 — Full seven-stage architecture · BiomedCLIP fine-tuning · HDBSCAN within-grade clustering

03 — What We Found

Results that matter clinically

Fine-Tuning Performance

86.4^%

Within-1 Grade Accuracy

The model predicts KL severity within one grade step 86.4% of the time. MAE = 0.865 grade units. Exact match = 31.0% — comparable to inter-rater radiologist agreement.

Clustering Quality — Silhouette Scores

KL 1

0.789

KL 2

0.783

KL 3

0.781

KL 4

0.841

All scores above 0.5 — considered "strong" cluster separation. Scores above 0.3 are adequate for biological discovery.

Key Clinical Findings

F.01 Lateral JSN progresses 2× faster than Medial JSN over 10 years (p < 0.0001, Bonferroni-corrected). This is the headline result — same KL grade at baseline, completely different long-term outcome.
F.02 Pain-susceptibility phenotype discovered in KL Grade 0: a cluster of 1,219 patients with elevated pain (WOMAC = 4.7) and zero structural damage. These patients are currently invisible to radiographic grading.
F.03 Lateral JSN cluster stability = 97.2% and Medial JSN stability = 95.0% over 8 years — confirming these are genuine biological phenotypes, not statistical noise.
F.04 50% of Lateral JSN patients progressed ≥1 KL grade by 8 years, vs 25.8% Medial, 22.1% No JSN, 19.6% Pain-Dominant, 16.7% Healthy — enabling early risk stratification.

XAI — Where the model looks

CLS-Token Attention Validation

After fine-tuning, CLS-token attention maps show the model consistently focuses on the joint space line, femoral condyles, and tibial plateau — the exact anatomical regions relevant to JSN and osteophyte grading. Zero-shot BiomedCLIP instead focused on image borders and ruler strips.

KL Grade 0 — CLS attention · Cluster 0 (pain=4.7) vs Cluster 1 (pain=0.2)

10-Year Longitudinal Progression

KL Grade Change from Baseline

The Lateral JSN phenotype accumulates a mean ΔKL of 0.75 by V120 — nearly double the Medial JSN group (0.41) and nearly triple the Healthy group (0.28).

Fig. — KL grade trajectories across 5 phenotypes · V00 to V120 (10 years)

04 — Discovered Phenotypes

Five distinct patient subtypes

These are the clinically meaningful groups our framework discovered — each requiring a different treatment strategy.

Lateral JSN

KL Grades 1–4 · n = 671

Dominant narrowing of the lateral compartment. Fastest-progressing phenotype — 50% advance ≥1 KL grade within 8 years. Likely benefits from lateral unloading bracing.

jsn_lat ≈ 1.0 50.0% ≥1 grade/8yr

Medial JSN

KL Grades 1–4 · n = 3,082

Dominant narrowing of the medial compartment. Most common phenotype. Moderate progression rate. At KL 4, shows slightly higher pain than Lateral group (2.07 vs 1.69).

jsn_med ≈ 1.0 25.8% ≥1 grade/8yr

No JSN

KL Grades 1–2 · n = 1,780

Osteophyte-dominant involvement with minimal joint space narrowing. Structurally present but without compartment-specific damage pattern.

jsn ≈ 0 22.1% ≥1 grade/8yr

Pain-Dominant

KL Grade 0 · n = 1,219

High pain (WOMAC = 4.7) with zero structural damage on X-ray. Classic discordance paradox. Represents neurobiological pain susceptibility — completely invisible to standard KL grading.

pain = 4.7 jsn = 0

Healthy

KL Grade 0 · n = 2,193

No structural damage and low pain (WOMAC = 0.2). Slowest progression. Serves as the true baseline comparator across all longitudinal analyses.

pain = 0.2 16.7% ≥1 grade/8yr

🔬

Clinical Implication

Instead of treating all KL 2 patients identically, physicians can now stratify by compartment phenotype and prescribe targeted interventions — valgus/varus bracing, anti-inflammatories, or pain management protocols.

05 — Technical Details

Model architecture & training

BiomedCLIP as the backbone

Pre-trained on 15 million PubMed figure-caption pairs using InfoNCE contrastive loss. Its ViT vision encoder maps images to 512-dimensional vectors in a shared image-text embedding space.

microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224

Selective layer freezing

Transformer blocks 0–5, patch embed, positional embed, and CLS token are frozen (152.9M params). Only blocks 6–11 and the regression head are trained (42.9M params, 21.9% of model).

Prevents catastrophic forgetting

Regression, not classification

MSE loss with sigmoid activation scaled to [0,4]. Regression imposes ordinal structure — KL grades are treated as continuous, not discrete classes. This avoids hard decision boundaries that reproduce existing bias.

AdamW · lr=5e-4 · cosine annealing

Within-grade HDBSCAN clustering

Clustering independently inside each KL grade ensures discovered clusters represent phenotypic variation — not severity differences that are already known. Noise points are rejected rather than forced into clusters.

min_cluster_size tuned per grade

Model Configuration

Base model

BiomedCLIP ViT-B/16

Total parameters

195.9M

Trainable params

42.9M (21.9%)

Embedding dimension

512-D

Training loss

MSE + Ranking

Best epoch

21 / 30

Val MAE

0.865 KL units

Val RMSE

1.055

Within-1 accuracy

86.4%

Dataset

OAI V00 (8,945 knees)

UMAP n_neighbors

UMAP min_dist

0.1

Clustering algorithm

HDBSCAN (per grade)

XAI method

CLS-token attention

06 — Research Team

The people behind this work

I.A.U. Siriwardane

E/20/378

Computer Engineering · University of Peradeniya

GitHub Email

K.G.H. Nirmani

E/20/271

Computer Engineering · University of Peradeniya

GitHub Email

N.R.P. Gunathilake

E/20/122

Computer Engineering · University of Peradeniya

GitHub Email

Supervisors

Ms. Yasodha Vimukthi

Dept. of Computer Engineering · Faculty of Engineering

Dr. Damayanthi Herath

Dept. of Computer Engineering · Faculty of Engineering

Mr. A.M. Mohamed Rikas

Dept. of Physiotherapy · Faculty of Allied Health Sciences