Abstract:
"Obtaining high quality, diverse, accurate datasets for sentiment analysis has always been a
significant challenge. Traditional approaches include annotators which may introduce bias to
datasets and is also time-consuming and expensive. These types of datasets may also not
represent the variety needed to train robust and generalizable classification models.
This research introduces a novel combination of techniques to approach the problem with a
novel solution. The proposed system, SentiGEN includes the use of a transformer, T5, fine-
tuned and optimized using Neural Architecture Search (NAS) to generate high quality, diverse,
accurate data for sentiment analysis. The generated data is validated using another transformer,
XLNet to ensure high sentiment accuracy.
This combination of technologies has proven successful based on the results derived from
evaluating multiple models. From complex transformers such as BERT and RoBERTa to more
straightforward approaches like Random Forest and Logistic Regression, those trained using
synthetic data demonstrated superior performance compared to their counterparts trained on
real data. This enhancement in predictive accuracy was observed when evaluated on
benchmark datasets such as the Stanford Sentiment Treebank 2 (SST-2) and Yelp test dataset.
The proposed system, SentiGEN is capable of generating high quality, diverse, accurate data
for sentiment analysis and successfully increased the performance of models trained on the
SentiGEN generated data compared to the same model trained on real data."