Abstract:
This research addressed the urgent issue of online safety by developing a system to detect and classify Sinhala adult-only textual content using NLP and ML. Given the increasing presence of inappropriate content in Sinhala digital platforms, especially across social media and video comment sections, the study aimed to create a robust detection pipeline tailored for a low resource language environment. A custom-labeled dataset comprising 1,500 Sinhala text samples was compiled and annotated for binary classification.
The implementation utilized a DL model with an embedding layer, followed by fully connected layers optimized for binary text classification.
Text preprocessing involved cleansing Sinhala characters, tokenizing using custom logic built with NLTK, and generating a vocabulary dictionary dynamically from the dataset. Inputs were transformed into sequences of integer tokens with padding. The model was trained using cross-entropy loss and the Adam optimizer. Hyperparameters such as hidden layer size, batch size, learning rate, and number of epochs were fine-tuned through iterative experimentation. The system was trained on 70% of the dataset and validated on the remaining 30%, with performance assessed using classification reports, confusion matrices, and AUC-ROC curves.
To enhance usability and accessibility, the solution was deployed as a web-based system with three user interfaces. The first UI collected YouTube comments using Google Developer APIs, while the second allowed administrators to upload datasets, trigger model training, and evaluation. The third UI, built for public users, enabled real-time classification of Sinhala text, documents, web pages, and audio/ video. The final system demonstrated reliable classification accuracy, offering a scalable and ethical tool to support digital content moderation in Sinhala, thereby contributing significantly to both the technical and linguistic domains of adult content detection research.