Diagnostic Performance of a Large Language Model in Multimodal Retrieval-Augmented Radiology Report Generation: A Comparative Study with an Experienced Human Reader

Purpose: This study aims to evaluate the diagnostic performance of a large language model (LLM) in multimodal retrieval-augmented radiology report generation for chest radiographs, comparing its performance with that of an experienced human reader.
Materials and Methods: This retrospective study included 800 anonymized chest radiographs obtained from five clinical settings: screening, outpatient non-respiratory clinics, outpatient respiratory clinics, inpatient wards, and emergency departments. The evaluation focused on posteroanterior and anteroposterior radiographs, with chest computed tomography (CT) performed on the same day or within one week serving as the standard reference. Diagnostic accuracy and report quality (assessed using a 5-point Likert scale) were compared between LLM-generated reports and those from an experienced human reader, using the McNemar test for diagnostic accuracy and the Wilcoxon Signed-Rank test for report quality. Efficiency was assessed by measuring the time required for LLM interpretation, and hallucinations in the generated reports were analyzed to evaluate the reliability of LLM interpretations.
Results: The human reader achieved an overall diagnostic accuracy of 91.5% (732/800), significantly higher than the LLM's accuracy of 66.9% (535/800) (P < 0.001). The average Likert score for report quality was 4.53 for the human reader and 3.43 for the LLM (P < 0.001). The LLM demonstrated a hallucination rate of 5.6% (45/800). The mean reporting time for the LLM was 3.4 seconds (SD 0.9), ranging from 2.0 to 9.6 seconds. Across all clinical settings, the human reader consistently achieved higher diagnostic accuracy than the LLM (all P < 0.001). The highest accuracy for both was observed in the screening setting (96.3% [154/160] vs. 84.4% [135/160]), while the largest performance gap occurred in outpatient respiratory clinics (89.4% [143/160] vs. 55% [88/160]). In inpatient wards and emergency departments, the human reader maintained accuracies above 90%, significantly outperforming the LLM (63.8% [102/160] and 65% [104/160], respectively).
Conclusions: These findings underscore the superior diagnostic accuracy and report quality of the human reader across all clinical settings. However, the LLM demonstrated acceptable diagnostic accuracy in screening examinations, highlighting its potential utility in specific scenarios where efficiency and scalability are prioritized.