Abstract:
"This document introduces a solution to the challenge of static music recommendations in
dynamic environments, particularly in crowded gatherings. Traditional music recommendation
systems (MRS) often fall short in these settings by neglecting real-time context, such as the
ambiance in a room, leading to suboptimal user experiences. To address this issue, a novel MRS
is proposed, leveraging both video and audio inputs to comprehend the ambiance and deliver
context-aware music suggestions.
The methodology first uses image classification with 3D Convolutional Neural Networks
(CNNs) exclusively on image frames to identify ambient illumination and room settings. This
method combines image processing and audio analysis. To capture the temporal spectrum
aspects of audio data, it is simultaneously transformed into Mel spectrograms and Mel-frequency
cepstral coefficients (MFCCs). To combine visual and auditory information, these elements from
the image frames and audio are then early fused. The atmosphere of the room is ascertained by
further layers processing the fused information, which determines the suggested music selection
based on this analysis.
The initial prototype demonstrates promising results in ambiance detection, and as expected, due
to dataset limitations. This project represents a significant advancement in MRS by incorporating
visual and auditory elements. Future enhancements are planned to further refine the model's
performance."