Quantitative Comparison and Reader-Based Evaluation of a Vision-Language Model for Chest Radiograph Interpretation

Purpose To evaluate the classification and segmentation capabilities of a vision-language model (VLM) for chest radiograph interpretation through a two-stage study: (1) quantitative comparison with a unimodal segmentation model (uSEG), and (2) reader-based evaluation of interpretability and perceived clinical utility.
Materials and Methods A newly developed VLM generates similarity maps between chest radiographs and textual descriptions. Thresholding the similarity maps allows the model to perform open-vocabulary segmentation. In contrast, the uSEG is a supervised segmentation model trained to detect five common thoracic abnormalities.
Classification and segmentation performance of the VLM and uSEG were compared using a publicly available chest radiograph dataset annotated for findings including consolidation, pneumothorax, fibrosis, nodule/mass, and pleural effusion. Classification metrics included area under the receiver operating characteristic curve (AUROC), sensitivity, specificity, and accuracy, with AUROC comparisons performed using the DeLong test. Segmentation performance was assessed using Dice scores and compared using paired t-tests.
For reader evaluation, 100 chest radiographs were retrospectively collected from a tertiary care hospital under IRB approval. Two general practitioners and two senior medical students independently evaluated the VLM’s heatmaps, rating agreement with findings and perceived helpfulness for clinical or educational use on a 5-point Likert scale. Score distributions and acceptable response rates (scores ≥3) were analyzed.
Results The VLM demonstrated comparable classification performance to the uSEG across most findings (e.g., consolidation AUROC: 0.930 vs. 0.929; p = .83). For segmentation, the VLM achieved higher Dice scores for pleural effusion, comparable performance for fibrosis (p = .05), and modestly lower performance for other findings.
In the reader evaluation, general practitioners demonstrated moderate agreement and helpfulness ratings, with acceptable rates of 94.0% and 81.0%, respectively, for GP1, and 79.0% and 67.0% for GP2. Medical students demonstrated higher acceptable rates for both agreement and helpfulness compared to general practitioners.
Conclusion The VLM demonstrated performance comparable to a unimodal segmentation model while generating interpretable outputs without requiring manual annotation. Reader evaluations further supported its potential clinical and educational relevance.
Clinical Relevance Statement The VLM offers a scalable solution for chest radiograph interpretation by producing interpretable visual outputs. It may support clinical practice and education, pending prospective validation.