Abstract:
"The majority of the previous works on the automated evaluation of the oral presentation were trained using traditional machine learning approaches due to the need for large labelled datasets. Due to this issue, deep learning approaches were underexplored in this domain, and it’s the biggest hurdle to improving the system’s performance and functionalities. Moreover, this system can greatly help students and public speakers who are in need of feedback to improve their skills. Hence, it is necessary to make the automated evaluation system to be more robust and efficient.
Since the problem resides with the lack of a vast amount of labelled data and the nature of the problem domain requires the model to be generalized since presentation styles can differ from person to person, a novel approach using multi-source transfer learning was proposed. This approach utilizes multimodality, where text modality is trained using Bi-LSTM and audio modality is trained using CNN. The weighted soft voting approach was used to ensemble the multiple sources and the modalities.
The results showed that the proposed approach was able to outperform the baselines. The proposed approach was able to improve accuracy by 1.11% when compared to the baseline. Even though the text modality shows some drawback in performance the proposed approach was able to still improve the accuracy. Hence, the results suggested that the proposed architecture has the potential to enhance the accuracy of the system."