Abstract:
Depression is a significant public health issue with serious personal and societal consequences. Traditional methods for diagnosing depression, such as self-reported questionnaires and clinical interviews, often rely on subjective assessments, which may not accurately reflect an individual's emotional state. This study proposes a multimodal approach to depression detection, integrating audio, visual, and textual data to improve diagnostic accuracy and facilitate early intervention.
To address this challenge, the author developed a Multimodal Depression Detection System (D-DiTECT) that combines facial expression analysis using Convolutional Neural Networks (CNN), audio processing through MFCC-LSTM, and sentiment analysis with a BERT-CNN model. Each modality was designed to capture distinct aspects of depressive symptoms, providing a more comprehensive evaluation.