AURIX: Augmented Multimodal Knowledge Integration for Adaptive Zero-Shot Scene Understanding

Amarawickrama, Malith

dc.contributor.author	Amarawickrama, Malith
dc.date.accessioned	2026-04-21T07:32:25Z
dc.date.available	2026-04-21T07:32:25Z
dc.date.issued	2025
dc.identifier.citation	Amarawickrama, Malith (2025) AURIX: Augmented Multimodal Knowledge Integration for Adaptive Zero-Shot Scene Understanding. BSc. Dissertation, Informatics Institute of Technology	en_US
dc.identifier.issn	20210353
dc.identifier.uri	http://dlib.iit.ac.lk/xmlui/handle/123456789/3176
dc.description.abstract	Effectively understanding complex scenes in real-time remains a major challenge in computer vision, especially with unseen or dynamic elements. Traditional models struggle to generalize across environments and lack contextual enrichment from visual, textual, and knowledge-based data. Scalability and computational efficiency are also significant barriers, particularly for large-scale real-time applications. This research proposes an adaptive scene understanding model that integrates zero-shot learning with multimodal data like visual, text, and knowledge graphs as external knowledge to enhance performance and adaptability. This project presents a multimodal framework for adaptive zero-shot scene understanding using visual, textual, and knowledge graph data. Embeddings from these modalities are mapped into a shared semantic space. When given an input image, visual features are extracted. These features are then processed by a zero-shot learning model, which uses shared embeddings to identify unseen objects based on semantic similarity. Finally, the zero-shot learning model’s predictions are sent to a generative model like Flan-T5, which creates a caption describing the scene. This research successfully designed, implemented, and tested the system, integrating multimodal data and external knowledge sources like ConceptNet. The system achieved a BLEU-4 score of 49.8%, CIDEr 112.4, and ROUGE-L 65.3%, demonstrating strong caption generation. ConceptNet integration improved contextual relevance with a 94% query success rate and 2.3 additional relevant concepts per caption. Further optimizations, including improved knowledge extraction, batch processing, and UI enhancements, will enhance efficiency and scalability, ensuring real-world applicability in multimodal AI.	en_US
dc.language.iso	en	en_US
dc.subject	Zero Shot Learning	en_US
dc.subject	Multimodal	en_US
dc.subject	Scene Understanding	en_US
dc.subject	Computer Vision	en_US
dc.title	AURIX: Augmented Multimodal Knowledge Integration for Adaptive Zero-Shot Scene Understanding	en_US
dc.type	Thesis	en_US