dc.description.abstract |
The growth of Information and Communication Technologies (ICT) raised the popularity of the World Wide Web(WWW) among the people. Consequently, vast amounts of text content in the form of articles, newspapers, books, etc have been starting to scatter through the web. Since most of these documents are unstructured and heterogeneous, it opened a new path for text analysis in the research world to retrieve information from unstructured text and then create structured data.
To improve the interconnection between the information and human language, Natural language Processing (NLP) researches contribute in different tasks such as information extraction, speech recognition, machine translation, summarization, topic modelling, etc. With technology involved, Sinhala text usage on the web also increased and started to gain attention among researchers. Although several techniques such as text classification, clustering, and named entity extraction were performed on Sinhala, there are open research areas due to the limited number of researches and lack of resources.
The SinTM system was built on topic modelling tasks to discover topics in the Sinhala text document. The system provides a novel hybrid approach to detect topics in Sinhala text documents combining topic modelling and keyword extraction techniques at a better interpretability level. It was tested with prominent topic models evaluation matrices such as likelihood, r-squared, perplexity, coherence and benchmarking with a well-known topic modelling algorithm, Latent Dirichlet Allocation (LDA). The web user interface comes with the SinTM system providing a more controllable parameter tuning and easy understandable graph-based view to the user and wealthy in terms of the ability to compare the novel model against other well-known approaches. |
en_US |