Digital Repository

Design and Implementation of a Machine Learning-Based Detection System for Large Language Model Jailbreak Prompts

Show simple item record

dc.contributor.author Dissanayake, Navinda
dc.date.accessioned 2026-03-10T09:29:20Z
dc.date.available 2026-03-10T09:29:20Z
dc.date.issued 2025
dc.identifier.citation Dissanayake, Navinda (2025) Design and Implementation of a Machine Learning-Based Detection System for Large Language Model Jailbreak Prompts. Msc. Dissertation, Informatics Institute of Technology en_US
dc.identifier.issn 20220874
dc.identifier.uri http://dlib.iit.ac.lk/xmlui/handle/123456789/2900
dc.description.abstract Problem: Large Language Models (LLMs) such as GPT have revolutionised digital applications but remain vulnerable to ""jailbreak prompts"" that bypass safety mechanisms, potentially causing harmful outputs. Existing solutions are primarily internal and inaccessible to external developers. This project addresses these gaps by creating a real-time detection system for developers to integrate into their applications. Methodology: A supervised learning approach was implemented to develop the detection system. A BERT-based sequence classification model was fine-tuned on 60,000 balanced prompts, utilizing advanced tokenization and contextual embeddings to preprocess and transform the data. The system's architecture includes a REST API built using FastAPI, providing real-time classification capabilities, and a web interface that allows prompt submission and result visualization. This design ensures seamless integration into existing workflows and accessibility for testing and deployment. Results: The MVP achieved 86.93% accuracy, with precision at 86.81% and recall at 87.07%. The model demonstrated strong real-time detection when integrated with the API, offering developers actionable insights and secure prompt analysis. en_US
dc.language.iso en en_US
dc.subject Large Language Models en_US
dc.subject Jailbreak Prompt Detection en_US
dc.subject LLM Safety en_US
dc.title Design and Implementation of a Machine Learning-Based Detection System for Large Language Model Jailbreak Prompts en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search


Advanced Search

Browse

My Account