Abstract:
The research aims to detect offensive words on images written in the Sinhala language, particularly those covered with jargon characters. The author explains the problem domain, highlighting the prevalence of abusive language on social media and its manifestation in memes. The research defines the problem statement, emphasizing the need to address the loopholes utilized by "memers" to hide offensive content. The research aims to contribute to the body of knowledge by developing a robust system for detecting and flagging offensive words in Sinhala images. The research question and the novelty of the research in the domain are presented, along with the identified research gap and potential contributions. A system to detect crossed letters on images is proposed and designed and the system will also be able to extract the clean text from the image. Then an abusive language detection model will recognize the text as offensive or not offensive. A sequential CNN was implemented to detect the crossed letters and K-Nearest Neighbor, Random Forest and Support Vector Machine algorithms were used to detect offensive language in text format. Sequential CNN model trained on a limited dataset achieved an accuracy of 96% and KNearest Neighbor, Random Forest and Support Vector Machine algorithms have achieved an accuracy of 88% , 86% and 83% respectively.