Abstract:
VQA systems aim to interpret natural language questions about images and generate accurate responses by leveraging both visual and textual information. However, despite their growing popularity, existing VQA models face significant performance challenges due to biases inherent in their training datasets. Despite efforts to address these biases, current approaches often struggle to maintain robustness across various distribution settings. Moreover, some methods that demonstrate robustness in out-of-distribution scenarios do so at the expense of in-distribution performance.
To address this challenge, this research proposes a novel architecture that combines transfer learning and debiasing techniques to create a more balanced and robust VQA model. The implementation uses the Xception network to extract high-level image features and the SBERT model to capture semantic question embeddings. These features are fused using a pointwise multiplication strategy to form a multimodal representation, which is then passed through a classifier trained on the VQA v2 dataset. A class-balanced loss function was introduced to counteract answer frequency biases by inversely weighing the loss contribution of frequent and infrequent answers. Finally, the entire system was encapsulated in a web-based application with a Flask backend and React frontend, allowing end-users to interactively upload images, pose questions, and receive model-generated answers.
Extensive testing was carried out to measure how well the system performs both on familiar data (in-distribution) and new, unseen data (out-of-distribution). While the baseline model without debiasing slightly outperformed in standard accuracy, it struggled with rare question types in the OOD dataset. The debiased model, however, showed better balance across different answer types, improved accuracy for underrepresented classes, and achieved a 1.69% boost in tail accuracy on the GQA-OOD benchmark. These gains came without significantly lowering performance on common answers. Overall, the results show that combining class-balanced debiasing with pretrained models leads to a more robust and fairer VQA system.