Generative AI based on medical visual question answering (VQA) techniques
Loading...
Date
Authors
Kaushik, Sarthak
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Medical visual question answering (MedVQA) enables clinicians to pose direct medical image questions rather than just using single-label image classification. It comes in handy in gastrointestinal (GI) endoscopy and, more specifically, in colonoscopy where clinical inquiries tend to be about whether findings are present, where they are, how many they are, or how much disease they represent. Nevertheless, the existing GI MedVQA systems are still plagued with critical issues, such as unbalanced dataset structure, imbalanced classes, mismatch of the answer format, and poor visual grounding in case of free-text generation.
These problems are examined in this dissertation based on five GI datasets: HyperKvasir, Kvasir-VQA, Kvasir-VQA-x1, ImageCLEF MEDVQA-GI and LIMUC. The comparison of benchmarking results reveals a similar trend in these datasets: in case of cur-=rent GI tasks, pipelines with supervision or other restrictions are more trustworthy than unsupervised and raw zero-shot vision-language generation. This is particularly relevant in the context where the answer space is fixed or where the clinically important classes are uncommon.
The dissertation develops a generative approach to ulcerative colitis severity scoring on LIMUC based on this finding in which the Mayo endoscopic subscore is the target task. The model is tested to explicit output limitations, and analysis with clinically relevant analysis. The work then maps this model-level contribution to a physician-support environment by a modular wrapper which interprets queries, retrieves supporting evidence as necessary, generates citation-linked answers, rejects unsupported queries, and maintains traceable system logs.
The general argument of the dissertation is that GI MedVQA can only proceed toward clinical decision support with systems that are still visually based, whose limits can be traced, and whose auditing can be performed by the supervision of physicians. The work does not purport independent clinical application. Rather, it provides a gradual route to benchmark assessment to safer clinician-facing colonoscopy aid.
Description
Keywords
medical visual question answering, gastrointestinal endoscopy, ulcerative colitis, Mayo endoscopic subscore, vision-language models, retrieval-augmented generation, clinical decision support, colonoscopy
