Abstract:
With the steady increase of user-generated content on the internet, the amount of hate content on the internet is also being rapidly increased. Social media sites, review forums, microblogging sites encourage users to convey their thoughts with minimum restrictions. This leads to expressing hate towards others who do not believe their beliefs. This study focuses on identifying hate speech texts that are written in Sinhala-English code-mixed language (Singlish) which is mostly used by Sri Lankans on the internet. Due to the unavailability of Sinhala-English code-mixed datasets, the dataset was created using comments on YouTube and Facebook. In this research, eight machine learning algorithms and three ensemble approaches were evaluated to detect hate speech in Singlish. Furthermore, their accuracy, precision, recall, and f1-score were evaluated. Afterwards, based on the performance of the considered algorithms, Support Vector Machine (SVM), Multinominal Naïve Bayes (MNB), AdaBoost Classifier, and Logistic Regression classifiers were used to develop ensemble learning-based solutions. In terms of ensemble learning approaches, soft voting, hard voting, and stacking were evaluated. The hard voting approach outperformed other baseline algorithms and ensemble approaches with 84% accuracy and f1-score.