Digital Repository

STACKO-DQD : Optimized duplicate question detection in programming community question & answer Platforms using Semantic Hashing

Show simple item record

dc.contributor.author Koswatte, D.D
dc.date.accessioned 2022-03-08T07:01:17Z
dc.date.available 2022-03-08T07:01:17Z
dc.date.issued 2021
dc.identifier.citation Koswatte, D.D (2021) STACKO-DQD : Optimized duplicate question detection in programming community question & answer Platforms using Semantic Hashing. BSc. Dissertation Informatics Institute of Technology en_US
dc.identifier.issn 2017198
dc.identifier.uri http://dlib.iit.ac.lk/xmlui/handle/123456789/883
dc.description.abstract " Programming Community Question & Answer (PCQA) platforms receive many questions daily. With this, the process of Duplicate Question Detection (DQD) has become infeasible since the same question can be asked with several interpretations. Currently, these platforms rely on platform moderators to identify duplicate questions manually. Since this manual task is tedious, it created a requirement for an automated process for DQD. Several researchers have addressed this issue, but the factor of fast retrieval of data with less comparisons is not considered. Current research work compares a new query question with all the questions in a data source which is an infeasible approach. Some researchers of the existing work have found this limitation, yet they have given more concentration to the other aspects like features, over it. This research overlays how this limitation is addressed with a novel approach. This novel approach combines both Semantic Text Similarity (STS) and hashing to address the above-mentioned limitation. STS helps in identifying the contextual meaning of text, while hashing acts as the key mechanism that reduces the dimensions of data and search space. With this, a given query question will first be mapped to the hash space (semantic hashing), and the closest questions within the radius (hamming distance) of it will be retrieved hence reducing the search space. After that ranking is done with the cosine similarity measure. According to the results from the Stack Overflow dataset obtained from the MSR15 conference, there was an average improvement of 1.73%, 6.52% and 7.22% at each recall_rate@5, recall_rate@10, and recall_rate@20, over the previous research work. Furthermore, it was clear that the data dimensionality and search space reduction is important to increase the accuracy and efficiency in DQD at PCQA platforms (graphical results in the research benchmarking are backing statistics for the research outcome). " en_US
dc.language.iso en en_US
dc.subject Recommender Systems en_US
dc.subject Information Systems en_US
dc.subject Information Systems – Question & Answering en_US
dc.subject Information Extraction en_US
dc.subject Computer Methodologies en_US
dc.title STACKO-DQD : Optimized duplicate question detection in programming community question & answer Platforms using Semantic Hashing en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search


Advanced Search

Browse

My Account