Conventional radiography and correlated factors of enthesopathies of the Achilles tendon and plantar fascia in patients with axial spondyloarthritis

Ebru Yılmaz; Özge Pasin; Loiane Cristina de Souza; Guilherme Torres Vilarino; Alexandro Andrade; Camila Gusmão Vicente de Carvalho; Barbara Bayeh; Fernando Henrique Carlos de Souza; Renata Miossi; Pleiades Tiharu Inaoka; Takashi Matsushita; Naoki Mugii; Samuel Katsuyuki Shinjo; Società Italiana di Reumatologia

doi:10.4081/reumatismo.2024.1709

62nd National Congress of the Italian Society of Rheumatology

Vol. 77 No. s1 (2025): Abstract book of the 62th Conference of the Italian Society for...

https://doi.org/10.4081/reumatismo.2025.2025

PO:09:139 | Comparative evaluation of GPT-4.0, Claude 4, and MedGEMMA in automatic Kellgren-Lawrence grading of knee osteoarthritis

Teresa Caferri¹, Vincenzo Venerito¹, Angela Carenza¹, Daniele Catamerò¹, Daniele Domanico¹, Sergio Del Vescovo¹, Lucia Cristiana Colaprico¹, Lavista Marlea¹, Maria Giannotta¹, Giuseppe Lopalco¹, Florenzo Iannone¹. | ¹Rheumatology Unit, Department of Precision and Regenerative Medicine and Jonian Area, University of Bari, Bari, Italy.

Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Published: 25 November 2025

234

Views

Osteoarthritis, Large language model, Artificial intelligence

Teresa Caferri, Vincenzo Venerito, Angela Carenza, Daniele Catamerò, Daniele Domanico, Sergio Del Vescovo, Lucia Cristiana Colaprico, Lavista Marlea, Maria Giannotta, Giuseppe Lopalco, Florenzo Iannone

Authors

Società Italiana di Reumatologia

redazione@reumatismo.org

Background. Automated grading of knee osteoarthritis (OA) severity using the Kellgren-Lawrence (KL) scale is a critical task in musculoskeletal radiology. Recently developed multimodal large language models (LLMs) offer the potential to interpret clinical images alongside text, but their performance on fine-grained ordinal classification tasks remains poorly characterized.

Methods. We evaluated three multimodal LLMs—Open AI GPT-4o (vision-enabled), Anthropic Claude 4, and the open source Google MedGEMMA—on their ability to predict KL grades from knee radiographs in a publicly available, expert-annotated dataset¹. Model predictions were compared to ground truth labels using exact match accuracy, ±1 tolerance accuracy (i.e., prediction within one KL grade)(Figure 1), and macro-averaged precision and recall. Confusion matrices were also analyzed to examine misclassification trends.

Results. The dataset included 100 radiographic images, equally distributed across KL grades: • Grade 0: 20 images • Grade 1: 20 images • Grade 2: 20 images • Grade 3: 20 images • Grade 4: 20 images GPT-4o demonstrated the best overall performance with 26% exact match accuracy, 63% ±1 tolerance accuracy, macro precision 0.38, and macro recall 0.26. Claude 4 and MedGEMMA each reached 21% exact match, 58% ±1 tolerance, with macro precision/recall of 0.23/0.21 and 0.20/0.21, respectively (Figure 2). All models exhibited frequent misclassification between adjacent KL grades, particularly underestimating moderate to severe OA (grades 3–4).

Conclusion. Although GPT-4o outperformed the other models, its accuracy remains insufficient for clinical reliability. These findings reveal that current multimodal LLMs, while promising, still struggle with ordinal radiographic interpretation tasks. Targeted training on medical imaging datasets and improved domain adaptation are necessary to enhance their diagnostic utility. ¹

Downloads

Download data is not yet available.

Citations

How to Cite

PO:09:139 | Comparative evaluation of GPT-4.0, Claude 4, and MedGEMMA in automatic Kellgren-Lawrence grading of knee osteoarthritis: Teresa Caferri1, Vincenzo Venerito1, Angela Carenza1, Daniele Catamerò1, Daniele Domanico1, Sergio Del Vescovo1, Lucia Cristiana Colaprico1, Lavista Marlea1, Maria Giannotta1, Giuseppe Lopalco1, Florenzo Iannone1. | 1Rheumatology Unit, Department of Precision and Regenerative Medicine and Jonian Area, University of Bari, Bari, Italy. Reumatismo [Internet]. 2025 Nov. 25 [cited 2026 Jul. 18];77(s1). Available from: https://www.reumatismo.org/reuma/article/view/2025

Download Citation

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Current Issue

PO:09:139 | Comparative evaluation of GPT-4.0, Claude 4, and MedGEMMA in automatic Kellgren-Lawrence grading of knee osteoarthritis

Authors

Downloads

Citations

How to Cite

Download Citation

authors

reviewers

Categories

indexing

linkedin

Keywords