Digital Repository

Sinhala Adult Content Detector

Show simple item record

dc.contributor.author Sirisena, Pramuditha
dc.date.accessioned 2026-03-10T11:16:34Z
dc.date.available 2026-03-10T11:16:34Z
dc.date.issued 2025
dc.identifier.citation Sirisena, Pramuditha (2025) Sinhala Adult Content Detector. Msc. Dissertation, Informatics Institute of Technology en_US
dc.identifier.issn 20221563
dc.identifier.uri http://dlib.iit.ac.lk/xmlui/handle/123456789/2907
dc.description.abstract This research addressed the urgent issue of online safety by developing a system to detect and classify Sinhala adult-only textual content using NLP and ML. Given the increasing presence of inappropriate content in Sinhala digital platforms, especially across social media and video comment sections, the study aimed to create a robust detection pipeline tailored for a low resource language environment. A custom-labeled dataset comprising 1,500 Sinhala text samples was compiled and annotated for binary classification. The implementation utilized a DL model with an embedding layer, followed by fully connected layers optimized for binary text classification. Text preprocessing involved cleansing Sinhala characters, tokenizing using custom logic built with NLTK, and generating a vocabulary dictionary dynamically from the dataset. Inputs were transformed into sequences of integer tokens with padding. The model was trained using cross-entropy loss and the Adam optimizer. Hyperparameters such as hidden layer size, batch size, learning rate, and number of epochs were fine-tuned through iterative experimentation. The system was trained on 70% of the dataset and validated on the remaining 30%, with performance assessed using classification reports, confusion matrices, and AUC-ROC curves. To enhance usability and accessibility, the solution was deployed as a web-based system with three user interfaces. The first UI collected YouTube comments using Google Developer APIs, while the second allowed administrators to upload datasets, trigger model training, and evaluation. The third UI, built for public users, enabled real-time classification of Sinhala text, documents, web pages, and audio/ video. The final system demonstrated reliable classification accuracy, offering a scalable and ethical tool to support digital content moderation in Sinhala, thereby contributing significantly to both the technical and linguistic domains of adult content detection research. en_US
dc.language.iso en en_US
dc.subject Adult Content Detection en_US
dc.subject Natural Language Processing en_US
dc.subject Sinhala Text Classification en_US
dc.title Sinhala Adult Content Detector en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search


Advanced Search

Browse

My Account