Design and Implementation of a Machine Learning-Based Detection System for Large Language Model Jailbreak Prompts

Dissanayake, Navinda

dc.contributor.author	Dissanayake, Navinda
dc.date.accessioned	2026-03-10T09:29:20Z
dc.date.available	2026-03-10T09:29:20Z
dc.date.issued	2025
dc.identifier.citation	Dissanayake, Navinda (2025) Design and Implementation of a Machine Learning-Based Detection System for Large Language Model Jailbreak Prompts. Msc. Dissertation, Informatics Institute of Technology	en_US
dc.identifier.issn	20220874
dc.identifier.uri	http://dlib.iit.ac.lk/xmlui/handle/123456789/2900
dc.description.abstract	Problem: Large Language Models (LLMs) such as GPT have revolutionised digital applications but remain vulnerable to ""jailbreak prompts"" that bypass safety mechanisms, potentially causing harmful outputs. Existing solutions are primarily internal and inaccessible to external developers. This project addresses these gaps by creating a real-time detection system for developers to integrate into their applications. Methodology: A supervised learning approach was implemented to develop the detection system. A BERT-based sequence classification model was fine-tuned on 60,000 balanced prompts, utilizing advanced tokenization and contextual embeddings to preprocess and transform the data. The system's architecture includes a REST API built using FastAPI, providing real-time classification capabilities, and a web interface that allows prompt submission and result visualization. This design ensures seamless integration into existing workflows and accessibility for testing and deployment. Results: The MVP achieved 86.93% accuracy, with precision at 86.81% and recall at 87.07%. The model demonstrated strong real-time detection when integrated with the API, offering developers actionable insights and secure prompt analysis.	en_US
dc.language.iso	en	en_US
dc.subject	Large Language Models	en_US
dc.subject	Jailbreak Prompt Detection	en_US
dc.subject	LLM Safety	en_US
dc.title	Design and Implementation of a Machine Learning-Based Detection System for Large Language Model Jailbreak Prompts	en_US
dc.type	Thesis	en_US