We show that high-confidence, cardiologist-validated ECG labels produce more accurate, better-calibrated, and clinically interpretable deep learning models than larger but noisier datasets — across two architectures.
Most medical AI research focuses on collecting more data. We ask a different question: does the certainty of diagnostic labels matter more than dataset size?
Classifying 12-lead ECGs as Myocardial Infarction (MI) vs. Normal using the PTB-XL dataset with cardiologist-annotated SCP codes.
Same normal cases across all three datasets. Only the MI label certainty varies — isolating the effect of ground truth quality on model performance.
Explainable AI (Grad-CAM & Integrated Gradients) validates that high-certainty labels produce models focusing on physiologically correct ECG regions.
All training sets share the same 4,451 pure normal ECGs. They differ only in which MI cases are included — enabling a controlled comparison of label certainty.
| Dataset | MI Type | MI Cases | Normal Cases | Total Train | Label Certainty |
|---|---|---|---|---|---|
| Dataset A | Certain MI | 1,194 | 4,451 | 5,645 | 100% confidence, human-validated |
| Dataset C | All MI | 2,387 | 4,451 | 6,838 | Mixed (certain + uncertain) |
| Dataset D | Uncertain MI | 1,193 | 4,451 | 5,644 | <100% confidence only |
Train, validation, and test sets have zero patient overlap — preventing data leakage and ensuring realistic performance estimates.
Validation (1,234 records) and test (1,253 records) sets are identical across all variants for fair, controlled comparison.
We compare a conventional hybrid architecture with a modern state space model to show our findings generalize across model families.
Type: Convolutional + Recurrent
Parameters: ~367K trainable
Framework: PyTorch
XAI Method: Grad-CAM
Input: (batch, 1000, 12)3× Conv1D → BiLSTM → FC → Sigmoid
Type: State Space Model (SSM)
Advantage: Linear complexity O(n)
Framework: PyTorch
XAI Method: Integrated Gradients (Captum)
Input: (batch, 1000, 12)BiMamba-2 blocks → Classifier → Sigmoid
Across both architectures, Dataset A (highest label certainty) achieves the best discrimination, calibration, and interpretability.
AUROC • Accuracy • Calibration (ECE) • Brier Score • Clinical Interpretability
| Metric | CNN-LSTM (Dataset A) | Description |
|---|---|---|
| ROC-AUC | 99.06% | Near-perfect discrimination |
| Accuracy | 95.87% | Overall correctness |
| Recall | 93.31% | MI detection sensitivity |
| Precision | 88.76% | False alarm rate control |
| F1-Score | 90.98% | Balanced performance |
| Specificity | 96.60% | Normal case identification |
Dataset D showed significantly degraded ECE (Expected Calibration Error) on Mamba-2 — uncertain labels harm not just accuracy but model trustworthiness.
When cardiologists expressed lower diagnostic confidence, model prediction confidence dropped proportionally — clinically valuable for flagging ambiguous cases.
Dataset A models focus on V1–V4 for anterior MI and II/III/aVF for inferior MI. Dataset D models show diffuse, non-specific activations — losing interpretability.
Computer Engineering undergraduates at the University of Peradeniya, Sri Lanka.