Abstract:
"
Programming Community Question & Answer (PCQA) platforms receive many
questions daily. With this, the process of Duplicate Question Detection (DQD) has
become infeasible since the same question can be asked with several interpretations.
Currently, these platforms rely on platform moderators to identify duplicate questions
manually. Since this manual task is tedious, it created a requirement for an automated
process for DQD. Several researchers have addressed this issue, but the factor of fast
retrieval of data with less comparisons is not considered. Current research work
compares a new query question with all the questions in a data source which is an
infeasible approach. Some researchers of the existing work have found this limitation,
yet they have given more concentration to the other aspects like features, over it. This
research overlays how this limitation is addressed with a novel approach.
This novel approach combines both Semantic Text Similarity (STS) and hashing to
address the above-mentioned limitation. STS helps in identifying the contextual
meaning of text, while hashing acts as the key mechanism that reduces the dimensions
of data and search space. With this, a given query question will first be mapped to the
hash space (semantic hashing), and the closest questions within the radius (hamming
distance) of it will be retrieved hence reducing the search space. After that ranking is
done with the cosine similarity measure.
According to the results from the Stack Overflow dataset obtained from the MSR15
conference, there was an average improvement of 1.73%, 6.52% and 7.22% at each
recall_rate@5, recall_rate@10, and recall_rate@20, over the previous research work.
Furthermore, it was clear that the data dimensionality and search space reduction is
important to increase the accuracy and efficiency in DQD at PCQA platforms
(graphical results in the research benchmarking are backing statistics for the research
outcome).
"