Sinhala Adult Content Detector

Sirisena, Pramuditha

dc.contributor.author	Sirisena, Pramuditha
dc.date.accessioned	2026-03-10T11:16:34Z
dc.date.available	2026-03-10T11:16:34Z
dc.date.issued	2025
dc.identifier.citation	Sirisena, Pramuditha (2025) Sinhala Adult Content Detector. Msc. Dissertation, Informatics Institute of Technology	en_US
dc.identifier.issn	20221563
dc.identifier.uri	http://dlib.iit.ac.lk/xmlui/handle/123456789/2907
dc.description.abstract	This research addressed the urgent issue of online safety by developing a system to detect and classify Sinhala adult-only textual content using NLP and ML. Given the increasing presence of inappropriate content in Sinhala digital platforms, especially across social media and video comment sections, the study aimed to create a robust detection pipeline tailored for a low resource language environment. A custom-labeled dataset comprising 1,500 Sinhala text samples was compiled and annotated for binary classification. The implementation utilized a DL model with an embedding layer, followed by fully connected layers optimized for binary text classification. Text preprocessing involved cleansing Sinhala characters, tokenizing using custom logic built with NLTK, and generating a vocabulary dictionary dynamically from the dataset. Inputs were transformed into sequences of integer tokens with padding. The model was trained using cross-entropy loss and the Adam optimizer. Hyperparameters such as hidden layer size, batch size, learning rate, and number of epochs were fine-tuned through iterative experimentation. The system was trained on 70% of the dataset and validated on the remaining 30%, with performance assessed using classification reports, confusion matrices, and AUC-ROC curves. To enhance usability and accessibility, the solution was deployed as a web-based system with three user interfaces. The first UI collected YouTube comments using Google Developer APIs, while the second allowed administrators to upload datasets, trigger model training, and evaluation. The third UI, built for public users, enabled real-time classification of Sinhala text, documents, web pages, and audio/ video. The final system demonstrated reliable classification accuracy, offering a scalable and ethical tool to support digital content moderation in Sinhala, thereby contributing significantly to both the technical and linguistic domains of adult content detection research.	en_US
dc.language.iso	en	en_US
dc.subject	Adult Content Detection	en_US
dc.subject	Natural Language Processing	en_US
dc.subject	Sinhala Text Classification	en_US
dc.title	Sinhala Adult Content Detector	en_US
dc.type	Thesis	en_US