Section:
Otiology
Potential of multimodal language model for preliminary evaluation of otoscopic images
M. V. Komarov (1), O. I. Goncharov (2), A. A. Fedotova (3)
(1) Saint Petersburg Research Institute of Ear, Throat, Nose and Speech, Saint Petersburg, 190013, Russian Federation, (1), (3) Mechnikov North-Western State Medical University, Saint Petersburg, 195067, Russian Federation, (2) Almazov National Medical Research Centre, Saint Petersburg, 197341, Russian Federation, (1), (2), (3) City Hospital No. 26, Saint Petersburg, 196240, Russian Federation
UDK: УДК 616.284-072.1:519.766.2
DOI: https://doi.org/10.18692/1810-4800-2025-3-53-62
ABSTRACT
Abstract. A pilot study evaluated the capabilities of the universal multimodal LLM ChatGPT 03 for interpreting otoscopic images. Thirty-eight frames were grouped into nine clinical categories—from normal and foreign bodies to postoperative states and middle-ear tumors. A “gold standard” annotation was provided by two otorhinolaryngology experts (Cohen’s κ > 0.85), with consensus reached in cases of disagreement. Each frame was processed in a new session with the prompt “What do you see in this photo?” ChatGPT 03 achieved 100% accuracy in distinguishing “normal vs. pathology” (95% CI 90.8–100%), with sensitivity and specificity, PPV/NPV (positive predictive value/negative predictive value) = 100%. The correctness of its clinical diagnosis formulation was 81.6% (31/38). For five key morphological features (perforation, effusion, hyperemia, tympanosclerosis, cholesteatoma), the mean F1-score was 0.92, and Cohen’s κ = 0.87. Expert ratings of the utility of its text descriptions on a 5-point scale yielded M = 4.4 ± 0.6, ICC = 0.82, with no significant differences between groups (p = 0.24). Spearman’s ρ = 0.72 (p < 0.001) confirmed a strong positive correlation between the number of correctly identified features and the usefulness assessment. The average response time was 30–40 s. These findings underscore ChatGPT 03’s high potential for preliminary screening, report standardization, and education. Clinical implementation will require large-scale prospective validation, structured output, and integration of quantitative tools.
Publication date:
17.06.2025
Keywords:
otoscopy, multimodal language model, ChatGPT 03, middle ear diagnosis, morphological analysis, screening, telemedicine, explainable AI, classification accuracy, inter-rater agreement For citation:
Komarov M. V., Goncharov O. I., Fedotova A. A. Potential of multimodal language model for preliminary evaluation of otoscopic images. Russian Otorhinolaryngology. 2025;24(3):53-62. (In Russ.) https://doi.org/10.18692/1810-4800-2025-3-53-62