Abstract:
Data augmentation is a powerful strategy to address data scarcity. It improves the generalization and robustness of machine learning models which mitigates model overfitting issues. Data augmentation is in its early stages in natural language processing due to the difficulty in textual data transformation. Therefore, selecting the optimal augmentation techniques has become a challenging task while preserving keywords of texts. In the context of multiple languages, challenges arise because an augmentation technique that performs well in one language may not perform the same in another language.
The author proposes a novel approach to automate the pipeline of textual data augmentation by enabling cross-lingual capabilities through the use of language models and noising-based augmentation strategies. By framing the problem as a hyperparameter optimisation task, the author defines the augmentation search space with language agnostic noise injection techniques and a restructured augmentation policy. To maintain keywords of texts during the augmentation, XAI techniques are leveraged to compute the word contribution scores on the model prediction.
Experiments were conducted for sentiment analysis in three languages: English, Sinhala, and Korean, within an extremely low resource setting, utilising only 80 training samples and 60 validation samples. XLM-R-base was used as the backbone classifier model. For English, the accuracy improvement after applying the proposed augmentation strategies increased by 16%-18%, while for Sinhala, the improvement was 9%-10% and for Korean, the improvement was ranged from 21%-23%. These results highlight the potential of leveraging noising-based techniques and XAI to perform language agnostic automated data augmentation in low-resource contexts.