Sinhala Multi Document Similarity Detection Tool

Piyarathna, Achala Yasas

dc.contributor.author	Piyarathna, Achala Yasas
dc.date.accessioned	2020-07-11T10:43:47Z
dc.date.available	2020-07-11T10:43:47Z
dc.date.issued	2019
dc.identifier.citation	Piyarathna, Achala Yasas (2019). Sinhala Multi Document Similarity Detection Tool BSc. Dissertation Informatics Institute of Technology	en_US
dc.identifier.other	2014245
dc.identifier.uri	http://dlib.iit.ac.lk/xmlui/handle/123456789/475
dc.description.abstract	Word plagiarism simply refers to using someone else’s work without attribution whether it’s intentional or not. There are number of software tools to detect plagiarism in various domains. Almost all of these tools are available for English language, but similar tools for Sinhala language is not yet available. Language independency is a crucial factor that affects the accuracy of similarity detection. There are many attempts of developing language dependent similarity detection tools for languages like Hindi, Chinese, Malayalam, Arabic and Persian. Most of these tools outperforms the available language independent commercial plagiarism detection tools as well. Sinhala language being similar to these languages and also being the official language of Sri Lanka along with Tamil, the need of a comprehensive similarity detection tool is present. Due to the complexity of the language itself the available language independent tools produces very poor results on plagiarism. This research’s main objective is to address the need of a similarity detection tool for Sinhala language to detect similarity in multiple documents. A novel algorithm has been developed to detect the similarity among multiple documents. The proposed system mainly consists of two stages as text pre-processing and similarity detection. A prototype of a multi document Sinhala similarity detection tool has been developed and introduced for demonstration. Sinhala language resources used in this project were taken from the Language Technology Research Laboratory of University of Colombo and Natural Language Processing Research group of University of Kelaniya. Testing and validations have been carried out by collecting random text samples of school students and which were examined by experts. And the prototype’s plagiarism calculation of these datasets was cross referenced by the experts and the actual plagiarized content was identified. The developed prototype has been successful in identifying the plagiarized content with a high accuracy.	en_US
dc.subject	Plagiarism Detection	en_US
dc.subject	Natural Language Linguistics	en_US
dc.subject	text pre-processing	en_US
dc.subject	Natural Language Processing	en_US
dc.title	Sinhala Multi Document Similarity Detection Tool	en_US
dc.type	Thesis	en_US