| dc.description.abstract |
The presence of duplicate bug reports in large-scale software repositories continues to hinder
efficient triaging and resource allocation. Traditional similarity detection techniques often lack
the semantic depth required to accurately identify duplicates in natural language bug
descriptions. This research addresses the challenge by proposing DeepIssueMatch, a token-
interaction-based framework designed to semantically retrieve and rank similar bug reports.
The system was implemented as a modular pipeline consisting of a Sentence-BERT (SBERT)
based semantic retriever, an optional reranker, and a lightweight language model for response
formulation. Comparative evaluations were conducted using classical information retrieval
models such as TF-IDF and BM25, alongside other semantic baselines including GloVe and
BERT. While advanced models such as ColBERT were considered, their high computational
complexity and inference overhead were found to be unsuitable for deployment in the target
setting. The architecture was deployed through a FastAPI interface, and experiments were
performed on a labeled HBase bug report dataset.
The results demonstrated that SBERT alone achieved a Recall@10 of approximately 56%,
which improved to over 61% when augmented with a reranker. Classical models such as BM25
and TF-IDF yielded Recall@10 scores around 55% and 51%, respectively, while shallow
embedding-based methods remained below 30%. These findings confirm that SBERT-based
retrieval provides a practical balance between performance and scalability for duplicate
detection in bug triaging systems. Furthermore, fine-tuning SBERT for a specific dataset
achieved even more Recall@K which is the ideal solution in real-world deployment. |
en_US |