STACKO-DQD : Optimized duplicate question detection in programming community question & answer Platforms using Semantic Hashing

Koswatte, D.D

dc.contributor.author	Koswatte, D.D
dc.date.accessioned	2022-03-08T07:01:17Z
dc.date.available	2022-03-08T07:01:17Z
dc.date.issued	2021
dc.identifier.citation	Koswatte, D.D (2021) STACKO-DQD : Optimized duplicate question detection in programming community question & answer Platforms using Semantic Hashing. BSc. Dissertation Informatics Institute of Technology	en_US
dc.identifier.issn	2017198
dc.identifier.uri	http://dlib.iit.ac.lk/xmlui/handle/123456789/883
dc.description.abstract	" Programming Community Question & Answer (PCQA) platforms receive many questions daily. With this, the process of Duplicate Question Detection (DQD) has become infeasible since the same question can be asked with several interpretations. Currently, these platforms rely on platform moderators to identify duplicate questions manually. Since this manual task is tedious, it created a requirement for an automated process for DQD. Several researchers have addressed this issue, but the factor of fast retrieval of data with less comparisons is not considered. Current research work compares a new query question with all the questions in a data source which is an infeasible approach. Some researchers of the existing work have found this limitation, yet they have given more concentration to the other aspects like features, over it. This research overlays how this limitation is addressed with a novel approach. This novel approach combines both Semantic Text Similarity (STS) and hashing to address the above-mentioned limitation. STS helps in identifying the contextual meaning of text, while hashing acts as the key mechanism that reduces the dimensions of data and search space. With this, a given query question will first be mapped to the hash space (semantic hashing), and the closest questions within the radius (hamming distance) of it will be retrieved hence reducing the search space. After that ranking is done with the cosine similarity measure. According to the results from the Stack Overflow dataset obtained from the MSR15 conference, there was an average improvement of 1.73%, 6.52% and 7.22% at each recall_rate@5, recall_rate@10, and recall_rate@20, over the previous research work. Furthermore, it was clear that the data dimensionality and search space reduction is important to increase the accuracy and efficiency in DQD at PCQA platforms (graphical results in the research benchmarking are backing statistics for the research outcome). "	en_US
dc.language.iso	en	en_US
dc.subject	Recommender Systems	en_US
dc.subject	Information Systems	en_US
dc.subject	Information Systems – Question & Answering	en_US
dc.subject	Information Extraction	en_US
dc.subject	Computer Methodologies	en_US
dc.title	STACKO-DQD : Optimized duplicate question detection in programming community question & answer Platforms using Semantic Hashing	en_US
dc.type	Thesis	en_US