Abstract:
"With continuous advancements in social technology, image/video and textual data growth has been rapid. While audio captioning has seen effective implementation, image captioning demands further meticulous attention in accurately captioning images, focusing on detecting small and overlapping objects and exotic fonts within images. These elements, often overlooked by current ICS, lead to generated captions that are inaccurate and lack detail. The omission of small and overlapping objects and certain text styles from captions reduces the captioning system's overall quality and risks conveying incorrect information to users.
A new approach is proposed to address the problem of detecting small and overlapping objects within an image. This approach involves incorporating depth estimations in Convolutional Neural Networks (CNNs), which aims to improve object detection processes' precision to generate more accurate and detailed image captions. This methodology aims to bridge the gap in the current state of image captioning technologies and provide a more comprehensive understanding of the content within images.
Based on the evaluation, the model achieves a 62% accuracy even though it was trained with a small dataset. The caption produced using feature extraction, object detection, and depth estimation technology has a high cosine similarity of 66%. These results clearly demonstrate the effectiveness and reliability of the model."