Digital Repository

Sinhala Multi Document Similarity Detection Tool

Show simple item record

dc.contributor.author Piyarathna, Achala Yasas
dc.date.accessioned 2020-07-11T10:43:47Z
dc.date.available 2020-07-11T10:43:47Z
dc.date.issued 2019
dc.identifier.citation Piyarathna, Achala Yasas (2019). Sinhala Multi Document Similarity Detection Tool BSc. Dissertation Informatics Institute of Technology en_US
dc.identifier.other 2014245
dc.identifier.uri http://dlib.iit.ac.lk/xmlui/handle/123456789/475
dc.description.abstract Word plagiarism simply refers to using someone else’s work without attribution whether it’s intentional or not. There are number of software tools to detect plagiarism in various domains. Almost all of these tools are available for English language, but similar tools for Sinhala language is not yet available. Language independency is a crucial factor that affects the accuracy of similarity detection. There are many attempts of developing language dependent similarity detection tools for languages like Hindi, Chinese, Malayalam, Arabic and Persian. Most of these tools outperforms the available language independent commercial plagiarism detection tools as well. Sinhala language being similar to these languages and also being the official language of Sri Lanka along with Tamil, the need of a comprehensive similarity detection tool is present. Due to the complexity of the language itself the available language independent tools produces very poor results on plagiarism. This research’s main objective is to address the need of a similarity detection tool for Sinhala language to detect similarity in multiple documents. A novel algorithm has been developed to detect the similarity among multiple documents. The proposed system mainly consists of two stages as text pre-processing and similarity detection. A prototype of a multi document Sinhala similarity detection tool has been developed and introduced for demonstration. Sinhala language resources used in this project were taken from the Language Technology Research Laboratory of University of Colombo and Natural Language Processing Research group of University of Kelaniya. Testing and validations have been carried out by collecting random text samples of school students and which were examined by experts. And the prototype’s plagiarism calculation of these datasets was cross referenced by the experts and the actual plagiarized content was identified. The developed prototype has been successful in identifying the plagiarized content with a high accuracy. en_US
dc.subject Plagiarism Detection en_US
dc.subject Natural Language Linguistics en_US
dc.subject text pre-processing en_US
dc.subject Natural Language Processing en_US
dc.title Sinhala Multi Document Similarity Detection Tool en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search


Advanced Search

Browse

My Account