Demonstrating the Reliability of Reasoning-Capable Large Language Models in Radiologic Numerical Tasks

In our latest open-access study, “Large Language Models in Radiologic Numerical Tasks: A Thorough Evaluation and Error Analysis,” the Sohn Lab demonstrates that modern reasoning-capable large language models can reliably perform clinically meaningful numerical tasks directly from radiology reports. Evaluating six real-world extraction and judgment tasks across CT, PET, DEXA, and ultrasound reports using large cohorts from MIMIC-III and institutional data, the study shows that reinforcement learning trained reasoning models achieve near-perfect performance, with GPT-5-mini reaching at least 99 percent accuracy across all tasks and exhibiting no observed mathematical errors, even with minimal prompt engineering. Importantly, a detailed manual error analysis clarifies where failures arise, most often from medical context rather than arithmetic, and highlights how output formatting can meaningfully affect model performance in applied pipelines. This work reflects a close collaboration between Ali Nowroozi, MD; Masha Bondarenko, BS; Adrian Serapio, BS; Tician Schnitzler, MD, MS; Sukhmanjit S. Brar, MBBS; and PI Jae Ho Sohn, MD, MS underscoring the lab’s commitment to rigorous evaluation and responsible integration of LLMs into radiology research and clinical workflows.

Find the paper at https://doi.org/10.1007/s10278-025-01824-9. 

Figure from Nowroozi et al.

Nowroozi, A., Bondarenko, M., Serapio, A. et al. Large Language Models in Radiologic Numerical Tasks: A Thorough Evaluation and Error Analysis. J Digit Imaging. Inform. med. (2026). https://doi.org/10.1007/s10278-025-01824-9