dc.description.abstract |
"The management of Personally Identifiable Information (PII) within system and debugging logs
is crucial due to stringent data protection regulations such as GDPR and PDPA. These logs often
contain sensitive data that, if mishandled, can pose significant privacy risks. Existing
methodologies for PII detection in logs are often inadequate, leading to challenges in both
protecting privacy and complying with regulations. This research addresses these issues by
enhancing the detection and anonymization of PII using advanced techniques.
This project introduces SysPII, a prototype system designed to detect and anonymize PII in system
and debugging logs. The system leverages a combination of advanced AI techniques and
traditional rule-based methods using the Presidio framework. Specifically, a fine-tuned
DistilBERT transformer model, adapted for log data, is utilized to improve detection accuracy.
The implementation integrates these models within a Streamlit-based user interface, ensuring a
user-friendly experience.
Extensive evaluations of SysPII, including benchmarking against two Convolutional Neural
Network (CNN) models optimized for accuracy and efficiency, demonstrated the superiority of
the DistilBERT transformer model. SysPII achieved an accuracy of 0.97, precision of 0.91, recall
of 0.83, and an F1 score of 0.87, with a ROC AUC of 0.91. These metrics highlight its effectiveness
in accurately identifying and anonymizing PII within system logs, supporting its potential for real-world applications and compliance with data protection regulations.
" |
en_US |