Digital Repository

AURIX: Augmented Multimodal Knowledge Integration for Adaptive Zero-Shot Scene Understanding

Show simple item record

dc.contributor.author Amarawickrama, Malith
dc.date.accessioned 2026-04-21T07:32:25Z
dc.date.available 2026-04-21T07:32:25Z
dc.date.issued 2025
dc.identifier.citation Amarawickrama, Malith (2025) AURIX: Augmented Multimodal Knowledge Integration for Adaptive Zero-Shot Scene Understanding. BSc. Dissertation, Informatics Institute of Technology en_US
dc.identifier.issn 20210353
dc.identifier.uri http://dlib.iit.ac.lk/xmlui/handle/123456789/3176
dc.description.abstract Effectively understanding complex scenes in real-time remains a major challenge in computer vision, especially with unseen or dynamic elements. Traditional models struggle to generalize across environments and lack contextual enrichment from visual, textual, and knowledge-based data. Scalability and computational efficiency are also significant barriers, particularly for large-scale real-time applications. This research proposes an adaptive scene understanding model that integrates zero-shot learning with multimodal data like visual, text, and knowledge graphs as external knowledge to enhance performance and adaptability. This project presents a multimodal framework for adaptive zero-shot scene understanding using visual, textual, and knowledge graph data. Embeddings from these modalities are mapped into a shared semantic space. When given an input image, visual features are extracted. These features are then processed by a zero-shot learning model, which uses shared embeddings to identify unseen objects based on semantic similarity. Finally, the zero-shot learning model’s predictions are sent to a generative model like Flan-T5, which creates a caption describing the scene. This research successfully designed, implemented, and tested the system, integrating multimodal data and external knowledge sources like ConceptNet. The system achieved a BLEU-4 score of 49.8%, CIDEr 112.4, and ROUGE-L 65.3%, demonstrating strong caption generation. ConceptNet integration improved contextual relevance with a 94% query success rate and 2.3 additional relevant concepts per caption. Further optimizations, including improved knowledge extraction, batch processing, and UI enhancements, will enhance efficiency and scalability, ensuring real-world applicability in multimodal AI. en_US
dc.language.iso en en_US
dc.subject Zero Shot Learning en_US
dc.subject Multimodal en_US
dc.subject Scene Understanding en_US
dc.subject Computer Vision en_US
dc.title AURIX: Augmented Multimodal Knowledge Integration for Adaptive Zero-Shot Scene Understanding en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search


Advanced Search

Browse

My Account