STACKO-DQD : Optimized duplicate question detection in programming community question & answer Platforms using Semantic Hashing

Koswatte, D.D

STACKO-DQD : Optimized duplicate question detection in programming community question & answer Platforms using Semantic Hashing

Koswatte, D.D

URI: http://dlib.iit.ac.lk/xmlui/handle/123456789/883

Date: 2021

Abstract:

" Programming Community Question & Answer (PCQA) platforms receive many questions daily. With this, the process of Duplicate Question Detection (DQD) has become infeasible since the same question can be asked with several interpretations. Currently, these platforms rely on platform moderators to identify duplicate questions manually. Since this manual task is tedious, it created a requirement for an automated process for DQD. Several researchers have addressed this issue, but the factor of fast retrieval of data with less comparisons is not considered. Current research work compares a new query question with all the questions in a data source which is an infeasible approach. Some researchers of the existing work have found this limitation, yet they have given more concentration to the other aspects like features, over it. This research overlays how this limitation is addressed with a novel approach. This novel approach combines both Semantic Text Similarity (STS) and hashing to address the above-mentioned limitation. STS helps in identifying the contextual meaning of text, while hashing acts as the key mechanism that reduces the dimensions of data and search space. With this, a given query question will first be mapped to the hash space (semantic hashing), and the closest questions within the radius (hamming distance) of it will be retrieved hence reducing the search space. After that ranking is done with the cosine similarity measure. According to the results from the Stack Overflow dataset obtained from the MSR15 conference, there was an average improvement of 1.73%, 6.52% and 7.22% at each recall_rate@5, recall_rate@10, and recall_rate@20, over the previous research work. Furthermore, it was clear that the data dimensionality and search space reduction is important to increase the accuracy and efficiency in DQD at PCQA platforms (graphical results in the research benchmarking are backing statistics for the research outcome). "

Show full item record