Digital Repository

Towards Universal NLP: Explainable AI-Guided Noise Injection for Language Agnostic Automated Text Data Augmentation

Show simple item record

dc.contributor.author Wijesundera, Sarith
dc.date.accessioned 2026-04-07T04:01:09Z
dc.date.available 2026-04-07T04:01:09Z
dc.date.issued 2025
dc.identifier.citation Wijesundera, Sarith (2025) Towards Universal NLP: Explainable AI-Guided Noise Injection for Language Agnostic Automated Text Data Augmentation. BSc. Dissertation, Informatics Institute of Technology en_US
dc.identifier.issn 20210010
dc.identifier.uri http://dlib.iit.ac.lk/xmlui/handle/123456789/3116
dc.description.abstract Data augmentation is a powerful strategy to address data scarcity. It improves the generalization and robustness of machine learning models which mitigates model overfitting issues. Data augmentation is in its early stages in natural language processing due to the difficulty in textual data transformation. Therefore, selecting the optimal augmentation techniques has become a challenging task while preserving keywords of texts. In the context of multiple languages, challenges arise because an augmentation technique that performs well in one language may not perform the same in another language. The author proposes a novel approach to automate the pipeline of textual data augmentation by enabling cross-lingual capabilities through the use of language models and noising-based augmentation strategies. By framing the problem as a hyperparameter optimisation task, the author defines the augmentation search space with language agnostic noise injection techniques and a restructured augmentation policy. To maintain keywords of texts during the augmentation, XAI techniques are leveraged to compute the word contribution scores on the model prediction. Experiments were conducted for sentiment analysis in three languages: English, Sinhala, and Korean, within an extremely low resource setting, utilising only 80 training samples and 60 validation samples. XLM-R-base was used as the backbone classifier model. For English, the accuracy improvement after applying the proposed augmentation strategies increased by 16%-18%, while for Sinhala, the improvement was 9%-10% and for Korean, the improvement was ranged from 21%-23%. These results highlight the potential of leveraging noising-based techniques and XAI to perform language agnostic automated data augmentation in low-resource contexts. en_US
dc.language.iso en en_US
dc.subject Automated Text Data Augmentation en_US
dc.subject Low-resource Languages en_US
dc.subject Hyperparameter Optimisation en_US
dc.subject Explainable AI en_US
dc.title Towards Universal NLP: Explainable AI-Guided Noise Injection for Language Agnostic Automated Text Data Augmentation en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search


Advanced Search

Browse

My Account