Abstract:
"With the modern-day technological expansions, accessing and sharing news has become a
necessary commodity of websites and social media platforms. As a result, there are many news
sites on the web providing myriads of information to end-users. News aggregators such as
Google News and Bing News are available on the web to view such information effortlessly.
The main task of these tools is to collect news articles from different news sites and transfer
them into one location for easier access. Further, these aggregators are capable of automatically
clustering/grouping news articles based on the similarities of article content.
While there are many news aggregators for the English language, limited research has been
conducted on implementing Sinhala News aggregators [1]. Further, it can be noticed that these
aggregators were implemented using traditional data representation techniques such as TF-IDF
[2] with clustering algorithms.
However, recent research in news document clustering [3] [4] [5] in other languages
demonstrates the use of modern data representation techniques such as pre-trained word
Embeddings, namely Glove [6], FastText [8] and Word2Vec [9]. It is also noticeable that the
Contextual word embeddings, BERT [7] have been used in some of the newest text clustering
research due its higher performance compared to word embeddings.
This research will explore the possibilities of clustering news for the Sinhala language using
Contextual word embeddings with suitable clustering algorithms.
"