Digital Repository

Using Machine Learning to Identify and Categorize Personally Identifiable Information and Payment Card Industry Data in Textual Content

Show simple item record

dc.contributor.author Arambawela, Milinda
dc.contributor.author Aponso, Achala
dc.date.accessioned 2025-04-23T07:42:54Z
dc.date.available 2025-04-23T07:42:54Z
dc.date.issued 2024
dc.identifier.citation Arambawela, M. and Aponso, A. (2024) ‘Using Machine Learning to Identify and Categorize Personally Identifiable Information and Payment Card Industry Data in Textual Content’, in 2024 4th International Conference on Advanced Research in Computing (ICARC). 2024 4th International Conference on Advanced Research in Computing (ICARC), pp. 201–205. Available at: https://doi.org/10.1109/ICARC61713.2024.10499783. en_US
dc.identifier.uri https://ieeexplore.ieee.org/document/10499783
dc.identifier.uri http://dlib.iit.ac.lk/xmlui/handle/123456789/2266
dc.description.abstract The advent of the Internet has significantly stream-lined daily tasks through the rapid increase of online services. Everyday activities, such as purchasing goods and scheduling appointments with healthcare professionals, have become more speedy, efficient and user-friendly with the integration of the Internet. The continuous improvement of online services has led to many people moving towards digital activities. As a result, it has heightened the recording of personal and payment transaction data across various storage mediums, including databases and log files. The protection and regulation of this sensitive data are imperative, aligning with the guidelines outlined in GDPR and PCI-DSS compliances. Recognizing exposed personal data poses a considerable challenge. This research introduces a novel approach to identifying payment card industry data (PCI) and personally identifiable information (PII). The research project proposes a machine learning-based text classification model utilizing the Convolutional Neural Network (CNN) model to discern PII and PCI data within a given text. The CNN model has been constructed and compared against Naive Bayes, Gradient Boost, Random Forest, and Support Vector Machine (SVM) models. The CNN model achieved the highest accuracy at 0.96 (96%). Additionally, the F1 scores for each class were significant, with PII scoring 0.94, PCI scoring 0.95, and Normal scoring 0.99. Following the model's construction and training, it was employed with the saved tokenizer's word indexes and label encoders in the developed classification tool. This tool successfully delivered the promised results, identifying exposed PII and PCI data. en_US
dc.subject Machine learning en_US
dc.subject Support vector machines en_US
dc.subject Text categorization en_US
dc.title Using Machine Learning to Identify and Categorize Personally Identifiable Information and Payment Card Industry Data in Textual Content en_US
dc.type Article en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search


Advanced Search

Browse

My Account