62nd National Congress of the Italian Society of Rheumatology
Vol. 77 No. s1 (2025): Abstract book of the 62th Conference of the Italian Society for Rheumatology, Rimini, 26-29 November 2025

PO:09:139 | Comparative evaluation of GPT-4.0, Claude 4, and MedGEMMA in automatic Kellgren-Lawrence grading of knee osteoarthritis

Teresa Caferri1, Vincenzo Venerito1, Angela Carenza1, Daniele Catamerò1, Daniele Domanico1, Sergio Del Vescovo1, Lucia Cristiana Colaprico1, Lavista Marlea1, Maria Giannotta1, Giuseppe Lopalco1, Florenzo Iannone1. | 1Rheumatology Unit, Department of Precision and Regenerative Medicine and Jonian Area, University of Bari, Bari, Italy.

Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
Published: 26 November 2025
87
Views
0
Downloads

Authors

Background. Automated grading of knee osteoarthritis (OA) severity using the Kellgren-Lawrence (KL) scale is a critical task in musculoskeletal radiology. Recently developed multimodal large language models (LLMs) offer the potential to interpret clinical images alongside text, but their performance on fine-grained ordinal classification tasks remains poorly characterized.

Methods. We evaluated three multimodal LLMs—Open AI GPT-4o (vision-enabled), Anthropic Claude 4, and the open source Google MedGEMMA—on their ability to predict KL grades from knee radiographs in a publicly available, expert-annotated dataset¹. Model predictions were compared to ground truth labels using exact match accuracy, ±1 tolerance accuracy (i.e., prediction within one KL grade)(Figure 1), and macro-averaged precision and recall. Confusion matrices were also analyzed to examine misclassification trends.

Results. The dataset included 100 radiographic images, equally distributed across KL grades: • Grade 0: 20 images • Grade 1: 20 images • Grade 2: 20 images • Grade 3: 20 images • Grade 4: 20 images GPT-4o demonstrated the best overall performance with 26% exact match accuracy, 63% ±1 tolerance accuracy, macro precision 0.38, and macro recall 0.26. Claude 4 and MedGEMMA each reached 21% exact match, 58% ±1 tolerance, with macro precision/recall of 0.23/0.21 and 0.20/0.21, respectively (Figure 2). All models exhibited frequent misclassification between adjacent KL grades, particularly underestimating moderate to severe OA (grades 3–4).

Conclusion. Although GPT-4o outperformed the other models, its accuracy remains insufficient for clinical reliability. These findings reveal that current multimodal LLMs, while promising, still struggle with ordinal radiographic interpretation tasks. Targeted training on medical imaging datasets and improved domain adaptation are necessary to enhance their diagnostic utility. ¹
mceclip0-efb04a4994b37d689d4e2fc9b303ef38.jpg

333_20250607152926.jpg

Downloads

Download data is not yet available.

Citations

How to Cite



1.
PO:09:139 | Comparative evaluation of GPT-4.0, Claude 4, and MedGEMMA in automatic Kellgren-Lawrence grading of knee osteoarthritis: Teresa Caferri1, Vincenzo Venerito1, Angela Carenza1, Daniele Catamerò1, Daniele Domanico1, Sergio Del Vescovo1, Lucia Cristiana Colaprico1, Lavista Marlea1, Maria Giannotta1, Giuseppe Lopalco1, Florenzo Iannone1. | 1Rheumatology Unit, Department of Precision and Regenerative Medicine and Jonian Area, University of Bari, Bari, Italy. Reumatismo [Internet]. 2025 Nov. 26 [cited 2026 Jan. 22];77(s1). Available from: https://www.reumatismo.org/reuma/article/view/2025