Abstract:
"This research introduces an innovative approach that integrates Structured Indexing and
Retrieval (SIR) techniques with Large Language Models (LLMs) to enhance the process of
codebase retrieval and interaction. By employing Abstract Syntax Tree (AST) parsing, this
method maintains the structural integrity of code, enabling a rich representation that captures
both static and dynamic aspects of software projects. This structured representation facilitates
the extraction of key software elements and their relationships, which are efficiently queried
using a sequenced database. An LLM then interprets these structured inputs, improving
context-awareness and precision in responding to user queries.
The approach transcends traditional methods by treating codebases not as mere text but as
complex structures that require understanding at both macro and micro levels. The integration
of AST with LLMs marks a significant leap forward, making retrieval processes more intuitive
and accurate. This system not only improves the relevance of the responses but also enhances
the clarity and utility of the information retrieved, making it a powerful tool for developers.
Preliminary evaluations, focusing on Python codebases, have demonstrated the system's
effectiveness, achieving remarkable metrics in relevance, precision, clarity, and utility. This
success establishes the proposed method as a significant advancement in the field of software
development, setting a new standard for intelligent codebase management and interaction.
Through its novel use of SIR techniques and LLMs, this research paves the way for more
efficient and accurate codebase management solutions."