dc.description.abstract |
"Emotional recognition in speech is essential for enhancing Human-Computer Interaction. Over the past few decades, researchers have employed many methodologies to construct a model for SER system. The effectiveness of SER is significantly influenced by the magnitude of the dataset. The algorithm's ability to learn more variations from the input data helps to prevent overfitting. Nevertheless, acquiring a suitable dataset for training purposes poses further challenges. In practical situations, there will inevitably be ambient noise, and various emotional states might show various levels of variability. However, the majority of the datasets used for speech emotion detection are obtained from controlled studio environments devoid of any background noise. Additionally, these datasets feature performances by skilled professional actors. There are a few natural datasets available, but they can only be accessed under a permission agreement.
In order to enhance the model's generality, merge multiple datasets for voice emotion recognition. The datasets that have been integrated are RAVDESS, CREMA-D, TESS, and SAVEE. Upon merging the aforementioned datasets, we specifically chose the neutral, happy, sad, angry fear, and disgust emotion categories for the model. Next, the balanced dataset utilized for the model and data augmentation techniques were applied, including the addition of pitch modification, and stretching the voice signal.
Based on the initial test results, our model demonstrated a 72% accuracy rate across six distinct emotions, with each emotion category including 1923 data points.
Subject Descriptions
Machine Learning Low Diversity Dataset High Accuracy but Low Generalizability Improve Dataset with Combining Balance the Dataset Data Augmentation Techniques Improve the Generalizability
Keywords
Data Augmentation, Convolutional Neural Network
" |
en_US |