Using Machine Learning to Identify and Categorize Personally Identifiable Information and Payment Card Industry Data in Textual Content

Arambawela, Milinda; Aponso, Achala

Home
→
Conference Papers, Journal Articles
→
2024 Conference Papers & Journal Articles
→
Conferance Papers
→
View Item

dc.contributor.author	Arambawela, Milinda
dc.contributor.author	Aponso, Achala
dc.date.accessioned	2025-04-23T07:42:54Z
dc.date.available	2025-04-23T07:42:54Z
dc.date.issued	2024
dc.identifier.citation	Arambawela, M. and Aponso, A. (2024) ‘Using Machine Learning to Identify and Categorize Personally Identifiable Information and Payment Card Industry Data in Textual Content’, in 2024 4th International Conference on Advanced Research in Computing (ICARC). 2024 4th International Conference on Advanced Research in Computing (ICARC), pp. 201–205. Available at: https://doi.org/10.1109/ICARC61713.2024.10499783.	en_US
dc.identifier.uri	https://ieeexplore.ieee.org/document/10499783
dc.identifier.uri	http://dlib.iit.ac.lk/xmlui/handle/123456789/2266
dc.description.abstract	The advent of the Internet has significantly stream-lined daily tasks through the rapid increase of online services. Everyday activities, such as purchasing goods and scheduling appointments with healthcare professionals, have become more speedy, efficient and user-friendly with the integration of the Internet. The continuous improvement of online services has led to many people moving towards digital activities. As a result, it has heightened the recording of personal and payment transaction data across various storage mediums, including databases and log files. The protection and regulation of this sensitive data are imperative, aligning with the guidelines outlined in GDPR and PCI-DSS compliances. Recognizing exposed personal data poses a considerable challenge. This research introduces a novel approach to identifying payment card industry data (PCI) and personally identifiable information (PII). The research project proposes a machine learning-based text classification model utilizing the Convolutional Neural Network (CNN) model to discern PII and PCI data within a given text. The CNN model has been constructed and compared against Naive Bayes, Gradient Boost, Random Forest, and Support Vector Machine (SVM) models. The CNN model achieved the highest accuracy at 0.96 (96%). Additionally, the F1 scores for each class were significant, with PII scoring 0.94, PCI scoring 0.95, and Normal scoring 0.99. Following the model's construction and training, it was employed with the saved tokenizer's word indexes and label encoders in the developed classification tool. This tool successfully delivered the promised results, identifying exposed PII and PCI data.	en_US
dc.subject	Machine learning	en_US
dc.subject	Support vector machines	en_US
dc.subject	Text categorization	en_US
dc.title	Using Machine Learning to Identify and Categorize Personally Identifiable Information and Payment Card Industry Data in Textual Content	en_US
dc.type	Article	en_US