A Novel Approach to Improving the Robustness of Visual Question Answering Systems

Kodippily, Methmi

dc.contributor.author	Kodippily, Methmi
dc.date.accessioned	2026-03-23T06:39:16Z
dc.date.available	2026-03-23T06:39:16Z
dc.date.issued	2025
dc.identifier.citation	Kodippily, Methmi (2025) A Novel Approach to Improving the Robustness of Visual Question Answering Systems. BSc. Dissertation, Informatics Institute of Technology	en_US
dc.identifier.issn	2019383
dc.identifier.uri	http://dlib.iit.ac.lk/xmlui/handle/123456789/3024
dc.description.abstract	VQA systems aim to interpret natural language questions about images and generate accurate responses by leveraging both visual and textual information. However, despite their growing popularity, existing VQA models face significant performance challenges due to biases inherent in their training datasets. Despite efforts to address these biases, current approaches often struggle to maintain robustness across various distribution settings. Moreover, some methods that demonstrate robustness in out-of-distribution scenarios do so at the expense of in-distribution performance. To address this challenge, this research proposes a novel architecture that combines transfer learning and debiasing techniques to create a more balanced and robust VQA model. The implementation uses the Xception network to extract high-level image features and the SBERT model to capture semantic question embeddings. These features are fused using a pointwise multiplication strategy to form a multimodal representation, which is then passed through a classifier trained on the VQA v2 dataset. A class-balanced loss function was introduced to counteract answer frequency biases by inversely weighing the loss contribution of frequent and infrequent answers. Finally, the entire system was encapsulated in a web-based application with a Flask backend and React frontend, allowing end-users to interactively upload images, pose questions, and receive model-generated answers. Extensive testing was carried out to measure how well the system performs both on familiar data (in-distribution) and new, unseen data (out-of-distribution). While the baseline model without debiasing slightly outperformed in standard accuracy, it struggled with rare question types in the OOD dataset. The debiased model, however, showed better balance across different answer types, improved accuracy for underrepresented classes, and achieved a 1.69% boost in tail accuracy on the GQA-OOD benchmark. These gains came without significantly lowering performance on common answers. Overall, the results show that combining class-balanced debiasing with pretrained models leads to a more robust and fairer VQA system.	en_US
dc.language.iso	en	en_US
dc.subject	Transfer Learning	en_US
dc.subject	NLP	en_US
dc.subject	Computer Vision	en_US
dc.title	A Novel Approach to Improving the Robustness of Visual Question Answering Systems	en_US
dc.type	Thesis	en_US