DeepIssueMatch: A Token-Interaction and LLM-Based Framework for Bug Report Similarity and Triaging

Herath, Dinith

dc.contributor.author	Herath, Dinith
dc.date.accessioned	2026-03-11T05:58:07Z
dc.date.available	2026-03-11T05:58:07Z
dc.date.issued	2025
dc.identifier.citation	Herath, Dinith (2025) DeepIssueMatch: A Token-Interaction and LLM-Based Framework for Bug Report Similarity and Triaging. Msc. Dissertation, Informatics Institute of Technology	en_US
dc.identifier.issn	20230583
dc.identifier.uri	http://dlib.iit.ac.lk/xmlui/handle/123456789/2923
dc.description.abstract	The presence of duplicate bug reports in large-scale software repositories continues to hinder efficient triaging and resource allocation. Traditional similarity detection techniques often lack the semantic depth required to accurately identify duplicates in natural language bug descriptions. This research addresses the challenge by proposing DeepIssueMatch, a token- interaction-based framework designed to semantically retrieve and rank similar bug reports. The system was implemented as a modular pipeline consisting of a Sentence-BERT (SBERT) based semantic retriever, an optional reranker, and a lightweight language model for response formulation. Comparative evaluations were conducted using classical information retrieval models such as TF-IDF and BM25, alongside other semantic baselines including GloVe and BERT. While advanced models such as ColBERT were considered, their high computational complexity and inference overhead were found to be unsuitable for deployment in the target setting. The architecture was deployed through a FastAPI interface, and experiments were performed on a labeled HBase bug report dataset. The results demonstrated that SBERT alone achieved a Recall@10 of approximately 56%, which improved to over 61% when augmented with a reranker. Classical models such as BM25 and TF-IDF yielded Recall@10 scores around 55% and 51%, respectively, while shallow embedding-based methods remained below 30%. These findings confirm that SBERT-based retrieval provides a practical balance between performance and scalability for duplicate detection in bug triaging systems. Furthermore, fine-tuning SBERT for a specific dataset achieved even more Recall@K which is the ideal solution in real-world deployment.	en_US
dc.language.iso	en	en_US
dc.subject	Semantic Retrieval	en_US
dc.subject	Bug Report Deduplication	en_US
dc.subject	Large Language Models	en_US
dc.title	DeepIssueMatch: A Token-Interaction and LLM-Based Framework for Bug Report Similarity and Triaging	en_US
dc.type	Thesis	en_US