Digital Repository

Classifying Personally Identifiable Information and Payment Card Industry Data in Text Contents Using Machine Learning

Show simple item record

dc.contributor.author Arambawela, Milinda
dc.date.accessioned 2024-02-15T06:11:22Z
dc.date.available 2024-02-15T06:11:22Z
dc.date.issued 2023
dc.identifier.citation Arambawela, Milinda (2023) Classifying Personally Identifiable Information and Payment Card Industry Data in Text Contents Using Machine Learning. MSc. Dissertation, Informatics Institute of Technology en_US
dc.identifier.issn 20210104
dc.identifier.uri http://dlib.iit.ac.lk/xmlui/handle/123456789/1692
dc.description.abstract "With the achievement of the Internet, lives become easier with online services. Daily tasks such as purchasing goods and placing an appointment with a doctor using the internet are quicker and easier than they were. Improvement of many online services attracts many people to do their activities online. This makes larger amount of personal and payment transactions data recorded in the many forms of storage such as, databases, logfiles. This sensitive data should be protected and regulated according to the guidelines provided in GDPR and PCI-DSS compliances. Identifying the exposed personal data is not an easy task. In this research, a novel approach has been introduced to identify personally identifiable information (PII) and payment card industry data (PCI). A machine learning based text classification model that uses Support Vector Machine model to identify PII and PCI data in the given text has been proposed in this research project. The CNN model has been built and benchmarked against SVM, Naive Bayes, Random Forest, and gradient boost models. Among all the models, the CNN model achieved the highest accuracy of 0.96 (96%). The F1 scores for each class were also impressive, with PII scoring 0.96, PCI scoring 0.96, and Normal scoring 0.96. After building and training the model, it was utilized with the saved tokenizer's word indexes and label encoder classes in the classification tool, which was developed to identify exposed PII and PCI data. As promised, the classification tool successfully displayed the results of exposed PII and PCI data. " en_US
dc.language.iso en en_US
dc.publisher IIT en_US
dc.subject CNN en_US
dc.subject NLP en_US
dc.subject PCI-DSS en_US
dc.title Classifying Personally Identifiable Information and Payment Card Industry Data in Text Contents Using Machine Learning en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search


Advanced Search

Browse

My Account