SchemaRAG: Semantic dictionary based Agentic LLM for Text-to-SQL

Nugara, Binura

dc.contributor.author	Nugara, Binura
dc.date.accessioned	2026-03-10T07:51:35Z
dc.date.available	2026-03-10T07:51:35Z
dc.date.issued	2025
dc.identifier.citation	Nugara, Binura (2025) SchemaRAG: Semantic dictionary based Agentic LLM for Text-to-SQL. Msc. Dissertation, Informatics Institute of Technology	en_US
dc.identifier.issn	20211182
dc.identifier.uri	http://dlib.iit.ac.lk/xmlui/handle/123456789/2896
dc.description.abstract	Text-to-SQL is an efficient way to generate SQL queries and retrieve data from databases using human natural language. This approach helps non-technical users in different domains such as education, healthcare, and finance to access information from databases for different use cases such as decision making, reporting and analytics without needing technical expertise to write SQL queries. With the rapid development of Large Language models over recent years, an increase in Text-to-SQL studies can be observed and the performance of these models has greatly improved due to advanced reasoning capabilities of LLMs. However, due to the inherent ambiguity of the natural language and the complexity of the schema and required SQL queries, the existing Text-to-SQL solutions suffer from issues such as hallucinations, lack of domain knowledge and the inability to accurately generate complex queries with join operations. To address these limitations, an Agentic LLM system for Text-to-SQL tasks was implemented during this study with the use of Retrieval-Augmented Generation (RAG) for domain knowledge enhancement of the LLM. Four LLM agents were utilized in the Text-to-SQL workflow. The process starts with an Assistant agent capturing user’s natural language questions and its role is to delegate schema retrieval, query generation and query validation tasks to the Retrieval Agent, Query Generation Agent and Validation Agent respectively, in the same order. The schema knowledge is stored in a database as vector embeddings, which are used to fetch relevant schema information related to the user’s natural language question based on the semantic similarities. SchemaRAG achieved an execution accuracy of (EX) 73.9% during the testing phase. The proposed system performs reasonably well considering the minimal effort needed for SQL query generation compared to existing work, which is evident through evaluation feedback. The proposed system shows great potential in real world use cases with its usability improvements such as minimal reliance on user interaction and enabling access to non-technical users through automatic schema knowledge retrieval	en_US
dc.language.iso	en	en_US
dc.subject	Large Language Models	en_US
dc.subject	Retrieval Augmented Generation	en_US
dc.subject	Natural Language Processing	en_US
dc.title	SchemaRAG: Semantic dictionary based Agentic LLM for Text-to-SQL	en_US
dc.type	Thesis	en_US