Abstract:
"One of the limitations of the current search functionalities embedded into existing operating system file explorers is that they are restricted to metadata-based searches, such as file names and creation dates. This means that these search systems cannot retrieve images based on their content, such as the semantic meaning or the visual features of the image. This poses a challenge for users who want to find relevant photos without manually labeling or organizing them.
In this research author propose an encoder-decoder model for image captioning task. The encoder uses a pre-trained ResNet-50 network, where only the final few layers are fine-tuned during training. The ResNet features are converted into an embedding space using a linear layer, followed by batch normalization and dropout. The decoder is an LSTM that takes the encoder embeddings and generates captions word by word using a linear layer over the LSTM outputs. During inference, beam search is used to sample caption tokens from the output probabilities.
The custom image captioning model shows moderate performance in overall evaluation metrics. It achieves a BLEU score of 0.604, showing good recognition of individual words in generated captions. The METEOR score of 0.188 and ROUGE_L score of 0.445 show that the generated captions reflect moderate alignment with human-like sematic accuracy and sentence structure. The CIDEr score of 0.526 indicates average caption capture, while the SPICE score of 0.113 suggests improvement in semantic fidelity and image understanding."