Abstract:
Emotion recognition in conversations (ERC) plays a critical role in building emotionally
intelligent systems for mental health monitoring, virtual assistants, and human-computer
interaction (HCI). However, the majority of existing ERC models struggle to detect emotion
shifts — the subtle, dynamic changes in emotional states that occur naturally in multi-party
conversations (MPCs). Recognizing these shifts is challenging due to their context-dependent nature and the imbalance in datasets where dominant emotions like ""neutral"" overshadow minority classes, ultimately limiting the overall system performance.
In this study, the author proposes EmoShiftNet, a multi-task learning (MTL) based framework designed to jointly perform emotion classification and emotion shift detection. To achieve this, the author integrates multimodal features, including contextualized text embeddings from a BERT transformer, acoustic features such as MFCCs, pitch, and loudness, and temporal features like utterance duration and pause intervals. Attention-based fusion techniques and a customized multi-task loss function are utilized to enhance the model's sensitivity to both static emotional states and dynamic emotional transitions across conversation turns.
Based on experiments conducted using the Multimodal EmotionLines Dataset (MELD)
dataset, EmoShiftNet achieves a 70.03% emotion recognition accuracy while demonstrating
significant improvements in F1-score and minority class detection compared to traditional
single-task baselines. Ablation studies further validate that incorporating emotion shift
detection as an auxiliary task improves the contextual understanding and robustness of ERC
systems, highlighting the critical role of modeling emotional dynamics in conversational AI.