SentiGEN: Synthetic Data Generation for Sentiment Analysis

Sundarreson, Pushpika

SentiGEN: Synthetic Data Generation for Sentiment Analysis

Sundarreson, Pushpika

URI: http://dlib.iit.ac.lk/xmlui/handle/123456789/2712

Date: 2024

Abstract:

"Obtaining high quality, diverse, accurate datasets for sentiment analysis has always been a significant challenge. Traditional approaches include annotators which may introduce bias to datasets and is also time-consuming and expensive. These types of datasets may also not represent the variety needed to train robust and generalizable classification models. This research introduces a novel combination of techniques to approach the problem with a novel solution. The proposed system, SentiGEN includes the use of a transformer, T5, fine- tuned and optimized using Neural Architecture Search (NAS) to generate high quality, diverse, accurate data for sentiment analysis. The generated data is validated using another transformer, XLNet to ensure high sentiment accuracy. This combination of technologies has proven successful based on the results derived from evaluating multiple models. From complex transformers such as BERT and RoBERTa to more straightforward approaches like Random Forest and Logistic Regression, those trained using synthetic data demonstrated superior performance compared to their counterparts trained on real data. This enhancement in predictive accuracy was observed when evaluated on benchmark datasets such as the Stanford Sentiment Treebank 2 (SST-2) and Yelp test dataset. The proposed system, SentiGEN is capable of generating high quality, diverse, accurate data for sentiment analysis and successfully increased the performance of models trained on the SentiGEN generated data compared to the same model trained on real data."

Show full item record