Digital Repository

SentiGEN: Synthetic Data Generation for Sentiment Analysis

Show simple item record

dc.contributor.author Sundarreson, Pushpika
dc.date.accessioned 2025-06-27T03:57:05Z
dc.date.available 2025-06-27T03:57:05Z
dc.date.issued 2024
dc.identifier.citation Sundarreson, Pushpika (2024) SentiGEN: Synthetic Data Generation for Sentiment Analysis. BSc. Dissertation, Informatics Institute of Technology en_US
dc.identifier.issn 20200456
dc.identifier.uri http://dlib.iit.ac.lk/xmlui/handle/123456789/2712
dc.description.abstract "Obtaining high quality, diverse, accurate datasets for sentiment analysis has always been a significant challenge. Traditional approaches include annotators which may introduce bias to datasets and is also time-consuming and expensive. These types of datasets may also not represent the variety needed to train robust and generalizable classification models. This research introduces a novel combination of techniques to approach the problem with a novel solution. The proposed system, SentiGEN includes the use of a transformer, T5, fine- tuned and optimized using Neural Architecture Search (NAS) to generate high quality, diverse, accurate data for sentiment analysis. The generated data is validated using another transformer, XLNet to ensure high sentiment accuracy. This combination of technologies has proven successful based on the results derived from evaluating multiple models. From complex transformers such as BERT and RoBERTa to more straightforward approaches like Random Forest and Logistic Regression, those trained using synthetic data demonstrated superior performance compared to their counterparts trained on real data. This enhancement in predictive accuracy was observed when evaluated on benchmark datasets such as the Stanford Sentiment Treebank 2 (SST-2) and Yelp test dataset. The proposed system, SentiGEN is capable of generating high quality, diverse, accurate data for sentiment analysis and successfully increased the performance of models trained on the SentiGEN generated data compared to the same model trained on real data." en_US
dc.language.iso en en_US
dc.subject Synthetic Data en_US
dc.subject Sentiment Analysis en_US
dc.subject Machine Learning en_US
dc.title SentiGEN: Synthetic Data Generation for Sentiment Analysis en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search


Advanced Search

Browse

My Account