Digital Repository

SinDOC: A Combined Approach of Summarizing Low Resource Sinhala Language Documents

Show simple item record

dc.contributor.author Ziyard, Hamza
dc.date.accessioned 2024-03-13T07:29:18Z
dc.date.available 2024-03-13T07:29:18Z
dc.date.issued 2023
dc.identifier.citation Ziyard, Hamza (2023) SinDOC: A Combined Approach of Summarizing Low Resource Sinhala Language Documents. BSc. Dissertation, Informatics Institute of Technology en_US
dc.identifier.issn 2019407
dc.identifier.uri http://dlib.iit.ac.lk/xmlui/handle/123456789/1880
dc.description.abstract "Summarization is one of the NLP related tasks that is widely researched during the recent times. Due to the large amount of data stored on the internet, users do not consume all information that is available. They will be looking for a small piece of information that will convey the most important information to them as quickly as possible. Sinhala language is one of the low resource languages which do not have many contributions within the field of document summarization. Mainly because of the lack of resources. To address this problem the author has come up with a novel approach that uses a combined method to summarize Sinhala documents by using both extractive and abstractive techniques. The proposed model uses word frequency and sentence scoring approaches for the extractive model and uses a pre-trained model for the abstractive summarization model. The author has presented a new dataset that has been translated from an existing English dataset. The pre-trained model uses this dataset for training. The author was also able to prove that automating hyper parameter tuning to generate training arguments for the abstractive model gives better results with less time constraints when compared to the traditional approaches. The author has also given the option of generating summaries through all three approaches to make sure that the user gets the best overall result. The model build in this research shows good results that outperforms previous works. Overall, this research proposes a combined approach of generating summaries for Sinhala documents. As all the models and code are made publicly available the author believes that this work could build a strong foundation for future researchers." en_US
dc.language.iso en en_US
dc.publisher IIT en_US
dc.subject Natural Language Processing en_US
dc.subject Sinhala Language en_US
dc.subject Pre-trained models en_US
dc.title SinDOC: A Combined Approach of Summarizing Low Resource Sinhala Language Documents en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search


Advanced Search

Browse

My Account