SinDOC: A Combined Approach of Summarizing Low Resource Sinhala Language Documents

Ziyard, Hamza

dc.contributor.author	Ziyard, Hamza
dc.date.accessioned	2024-03-13T07:29:18Z
dc.date.available	2024-03-13T07:29:18Z
dc.date.issued	2023
dc.identifier.citation	Ziyard, Hamza (2023) SinDOC: A Combined Approach of Summarizing Low Resource Sinhala Language Documents. BSc. Dissertation, Informatics Institute of Technology	en_US
dc.identifier.issn	2019407
dc.identifier.uri	http://dlib.iit.ac.lk/xmlui/handle/123456789/1880
dc.description.abstract	"Summarization is one of the NLP related tasks that is widely researched during the recent times. Due to the large amount of data stored on the internet, users do not consume all information that is available. They will be looking for a small piece of information that will convey the most important information to them as quickly as possible. Sinhala language is one of the low resource languages which do not have many contributions within the field of document summarization. Mainly because of the lack of resources. To address this problem the author has come up with a novel approach that uses a combined method to summarize Sinhala documents by using both extractive and abstractive techniques. The proposed model uses word frequency and sentence scoring approaches for the extractive model and uses a pre-trained model for the abstractive summarization model. The author has presented a new dataset that has been translated from an existing English dataset. The pre-trained model uses this dataset for training. The author was also able to prove that automating hyper parameter tuning to generate training arguments for the abstractive model gives better results with less time constraints when compared to the traditional approaches. The author has also given the option of generating summaries through all three approaches to make sure that the user gets the best overall result. The model build in this research shows good results that outperforms previous works. Overall, this research proposes a combined approach of generating summaries for Sinhala documents. As all the models and code are made publicly available the author believes that this work could build a strong foundation for future researchers."	en_US
dc.language.iso	en	en_US
dc.publisher	IIT	en_US
dc.subject	Natural Language Processing	en_US
dc.subject	Sinhala Language	en_US
dc.subject	Pre-trained models	en_US
dc.title	SinDOC: A Combined Approach of Summarizing Low Resource Sinhala Language Documents	en_US
dc.type	Thesis	en_US