Abstract:
Problem: Large Language Models (LLMs) such as GPT have revolutionised digital
applications but remain vulnerable to ""jailbreak prompts"" that bypass safety mechanisms,
potentially causing harmful outputs. Existing solutions are primarily internal and inaccessible to external developers. This project addresses these gaps by creating a real-time detection system for developers to integrate into their applications.
Methodology: A supervised learning approach was implemented to develop the detection
system. A BERT-based sequence classification model was fine-tuned on 60,000 balanced
prompts, utilizing advanced tokenization and contextual embeddings to preprocess and
transform the data. The system's architecture includes a REST API built using FastAPI,
providing real-time classification capabilities, and a web interface that allows prompt
submission and result visualization. This design ensures seamless integration into existing
workflows and accessibility for testing and deployment.
Results: The MVP achieved 86.93% accuracy, with precision at 86.81% and recall at 87.07%.
The model demonstrated strong real-time detection when integrated with the API, offering
developers actionable insights and secure prompt analysis.