Design and Implementation of a Machine Learning-Based Detection System for Large Language Model Jailbreak Prompts

Dissanayake, Navinda

Design and Implementation of a Machine Learning-Based Detection System for Large Language Model Jailbreak Prompts

Dissanayake, Navinda

URI: http://dlib.iit.ac.lk/xmlui/handle/123456789/2900

Date: 2025

Abstract:

Problem: Large Language Models (LLMs) such as GPT have revolutionised digital applications but remain vulnerable to ""jailbreak prompts"" that bypass safety mechanisms, potentially causing harmful outputs. Existing solutions are primarily internal and inaccessible to external developers. This project addresses these gaps by creating a real-time detection system for developers to integrate into their applications. Methodology: A supervised learning approach was implemented to develop the detection system. A BERT-based sequence classification model was fine-tuned on 60,000 balanced prompts, utilizing advanced tokenization and contextual embeddings to preprocess and transform the data. The system's architecture includes a REST API built using FastAPI, providing real-time classification capabilities, and a web interface that allows prompt submission and result visualization. This design ensures seamless integration into existing workflows and accessibility for testing and deployment. Results: The MVP achieved 86.93% accuracy, with precision at 86.81% and recall at 87.07%. The model demonstrated strong real-time detection when integrated with the API, offering developers actionable insights and secure prompt analysis.

Show full item record