62nd National Congress of the Italian Society of Rheumatology
Vol. 77 No. s1 (2025): Abstract book of the 62th Conference of the Italian Society for...

PO:24:060 | Evaluation of a Large Language Model Performance in Rating Cutaneous Manifestations of Dermatomyositis: A Comparison with Expert Assessors

Gianmarco Roselli1, Marco Fornaro1, Swapnasha Panigrahi2, Sara Sabbagh3, Florenzo Iannone1, Vincenzo Venerito1, Latika Gupta4 | 1Università degli Studi di Bari, Unità di Reumatologia, DiMePRe-J Bari, Italy; 2University of Birmingham Birmingham, United Kingdom; 3Medical College of Wisconsin, Unit of Rheumatology, Department of Paediatrics Milwaukee, USA; 4University of Manchester, Manchester Academic Health Centre, Division of muscoloskeletal and deramatological sciences Manchester, United Kingdom

Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.
Published: 18 March 2026
11
Views

Authors

Background. Large Language Models (LLMs) are increasingly applied and promising in medicine. In idiopathic inflammatory myopathies (IIM), it has been recently demonstrated a strong correlation of LLMs (Claude-2) with experts assessors of the Myositis Disease Activity Assessment Tool-Visual Analogue Scale (MDAAT-VAS). The specific evaluation of LLMs performance in the cutaneous domain of Dermatomyositis (DM) is unexplored. Objectives. This study aimed at evaluting LLM 'Claude v. 3.5 Sonnet' in scoring cutaneous manifestations of DM against expert assessors, with implications for automated clinical trials screening where competing trials often limit patient recruitment from an already restricted patient pool.

Methods. Twenty-seven DM cases with standardized clinical photographs were identified through systematic PubMed review. Two rheumatologists with expertise in Cutaneous Dermatomyositis Disease Area and Severity Index (CDASI) scoring and trial recruitment independently assessed the images. The LLM 'Claude' analysed identical images using chain-of-thought and it was prompted the scoring of CDASI domains: erythema, scaling, erosion/ulceration, poikiloderma and calcinosis. Hand lesions were scored with specific attention to papules (requiring doubled erythema scores) and periungual changes. Intraclass Correlation Coefficient (ICC) analysis was performed using two-way random effects modelling (Stata 18).

Results. Global ICC analysis demonstrated excellent agreement between Claude and expert assessors (0.92, 95% CI: 0.89-0.94), comparable to inter-expert reliability (0.87, 95% CI: 0.82-0.91). Domain-specific analysis revealed: 1. Moderate agreement for core features: o Erythema (0.61, 95% CI: 0.27-0.81) o Scaling (0.57, 95% CI: 0.19-0.79) o Erosions (0.57, 95% CI: 0.21-0.79) o Poikiloderma (0.47, 95% CI: 0.10-0.73) 2. Strong concordance for hand assessment -– an ubiquitous and specific feature of disease: o Global hand score (0.95, 95% CI: 0.91-0.97) o Hand erythema (0.78, 95% CI: 0.24-0.95) o Perfect agreement for periungual vasculitis (ICC 1.0) 3. Lower reliability for damage assessment: o Hand damage (0.37, 95% CI: 0.10-0.85) Time Efficiency Analysis: o Expert assessors: Mean 8.4 minutes per case (range 6-12 minutes) o LLM assessment: Mean 42 seconds per case (range 35-50 seconds) o Total time saved: 93% reduction in scoring time o Additional efficiency: Simultaneous batch processing capability for LLM versus sequential expert assessment.

Conclusions. The LLM demonstrates excellent reliability for global disease assessment (ICC 0.92) and objective features like periungual changes (ICC 1.0), with significant time efficiency (93% reduction in scoring time) and batch processing capabilities. This could enhance clinical trial recruitment workflows in DM patients, where competing trials often limit patient recruitment from an already restricted patient pool. However, important limitations persist in assessing subtle features (poikiloderma ICC 0.47, damage ICC 0.37) and technical constraints, including image quality dependencies. This suggests its current advisable use is as a screening tool to support, rather than replace, expert assessment.


346_20250607184823.jpg

Downloads

Download data is not yet available.

Citations

How to Cite



1.
PO:24:060 | Evaluation of a Large Language Model Performance in Rating Cutaneous Manifestations of Dermatomyositis: A Comparison with Expert Assessors: Gianmarco Roselli1, Marco Fornaro1, Swapnasha Panigrahi2, Sara Sabbagh3, Florenzo Iannone1, Vincenzo Venerito1, Latika Gupta4 | 1Università degli Studi di Bari, Unità di Reumatologia, DiMePRe-J Bari, Italy; 2University of Birmingham Birmingham, United Kingdom; 3Medical College of Wisconsin, Unit of Rheumatology, Department of Paediatrics Milwaukee, USA; 4University of Manchester, Manchester Academic Health Centre, Division of muscoloskeletal and deramatological sciences Manchester, United Kingdom. Reumatismo [Internet]. 2026 Mar. 18 [cited 2026 Apr. 17];77(s1). Available from: https://www.reumatismo.org/reuma/article/view/2349