Multimodal Retrieval-Augmented Radiology Reporting: Diagnostic Comparisons of Single- and Multi-View LLMs

Objective: To evaluate the diagnostic performance of multi-view large language model (LLM) in multimodal radiology report generation, leveraging both chest posteroanterior (PA) and lateral views to improve diagnostic accuracy.

Materials and Methods: This study analyzed 219 anonymized chest radiographs, comparing the performance of single-view and multi-view LLMs. The single-view LLM analyzed only chest PA radiographs, while the multi-view LLM incorporated both chest PA and lateral views. Chest computed tomography (CT) performed on the same day or within one week served as the standard reference. Diagnostic accuracy, report quality (assessed using a 5-point Likert scale), reporting time, and hallucination rates were compared between the two groups using McNemar’s test, Wilcoxon signed-rank test, and paired t-test.

Results: The multi-view LLM demonstrated higher diagnostic accuracy (81.7% [179/219]) compared to the single-view LLM (76.3% [167/219]) (P = 0.017). Report quality, measured by the average Likert score, was slightly higher for the multi-view LLM (4.02) compared to the single-view LLM (3.97), but the difference was not statistically significant (P = 0.719). The multi-view LLM also exhibited a lower hallucination rate (0.9% [2/219] vs. 2.3% [5/219]), though this difference was not statistically significant (P = 0.25). However, the multi-view approach required more time for report generation, with a mean reporting time of 6.0 seconds (SD 1.3) compared to 3.2 seconds (SD 0.8) for the single-view LLM (P < 0.001).

Conclusions: Despite requiring longer reporting times, the multi-view approach highlights the potential benefits of incorporating additional imaging views to enhance diagnostic performance in multimodal radiology report generation.