Abstract:
"This study proposes a novel system for identifying hate speech in transliterated language that is a mixture of Sinhala and English. Due to the intricacy of the language and the prevalence of code-mixed languages on social media platforms, it is difficult to identify hate speech in these languages.
The proposed novel system uses two pre-trained transformer models to detect hate speech content in Sinhala-English code-mixed, which is first transliterated and then used to train a hate speech detection model. The proposed approach consists of three components: a pre-processing module, a transliteration module, and a hate speech detection module. These components work together to process the input text, transliterate it into Sinhala, and then classify it for hate speech content.
The suggested approach employs a Sinhala-English code-mixed aggregated dataset with hate speech annotations, and then utilizes a pre-trained transformer model to detect hate speech content. The proposed novel solution has outperformed the existing benchmarks for identifying hate speech content in Sinhala-English code-mixed language over 92% in Precision, Recall, and F1-score. The system can be simply modified to accommodate other low-resource code- mixed languages and aid in the identification of hate speech content on social media sites."