Sinhala News Clustering using Contextual Word Embeddings

Hewaralalage Amaradasa, Dharshika Nayanathara

Home
→
Dissertations & Thesis
→
MSc Bigdata Analytics
→
2022
→
View Item

Sinhala News Clustering using Contextual Word Embeddings

Hewaralalage Amaradasa, Dharshika Nayanathara

URI: http://dlib.iit.ac.lk/xmlui/handle/123456789/1402

Date: 2022

Abstract:

"With the modern-day technological expansions, accessing and sharing news has become a necessary commodity of websites and social media platforms. As a result, there are many news sites on the web providing myriads of information to end-users. News aggregators such as Google News and Bing News are available on the web to view such information effortlessly. The main task of these tools is to collect news articles from different news sites and transfer them into one location for easier access. Further, these aggregators are capable of automatically clustering/grouping news articles based on the similarities of article content. While there are many news aggregators for the English language, limited research has been conducted on implementing Sinhala News aggregators [1]. Further, it can be noticed that these aggregators were implemented using traditional data representation techniques such as TF-IDF [2] with clustering algorithms. However, recent research in news document clustering [3] [4] [5] in other languages demonstrates the use of modern data representation techniques such as pre-trained word Embeddings, namely Glove [6], FastText [8] and Word2Vec [9]. It is also noticeable that the Contextual word embeddings, BERT [7] have been used in some of the newest text clustering research due its higher performance compared to word embeddings. This research will explore the possibilities of clustering news for the Sinhala language using Contextual word embeddings with suitable clustering algorithms. "

Show full item record