dc.description.abstract |
"As a whole, many people in the world use open-domain LLM-based conversational agents such as ChatGPT, BARD in their day-to-day life. However, most of these chatbots are limited to the English community. There is a special community in Sri Lanka that knows only their official language Sinhala and they are restricted from experiencing the true potential of these LLB-based chatbots because the Backbone LLMs of these chatbots are not adapted well to low-resource languages such as Sinhala. They also should have experienced the powers of LLMs. Only then can LLMs truly be considered socialized in Sri Lanka.
The author of this study is attempting to give the ability of Romanized Sinhala language comprehension and generation to LLM using the Retrieval Augmented Generation (RAG) approach. The comprehension and generation capabilities of various open-source LLMs are analyzed in this study. Experiments with Parameter-Efficient Fine-Tuning (PEFT) to adapt LLMs to Romanized Sinhala were also conducted as a comparison in this study. In this research, the author is going to archive state-of-the-art results on adapting LLMs to understand and generate the Romanized Sinhala language.
RAG architecture with Gemini-Pro model as a text generation model gives best BLEU and ROUGH scores for the Romanized Sinhala content. Conversational dataset for Romanized Sinhala also published for public throughout this study." |
en_US |