Abstract:
The use of Code-mixed Romanized Sinhala (known also as 'Singlish') in digital communication
has left a gap in the current text to speech (TTS) systems that are limited to native Sinhala script.
Current TTS models are not able to handle informal spelling variations, shorthand notations, code-
mixed Romanized words.
In this work, Author proposed a fine-tuned NLLB (No Language Left Behind) model for back
transliteration of Romanized Sinhala to native Sinhala as well as a VITS (Vocoder Free TTS)
model for high quality speech synthesis. It is implemented with a modular and scalable architecture
with a user-friendly Graphical User Interface (GUI) for easy text input and speech generation.
The evaluation metrics, such as Word Error Rate (WER), Character Error Rate (CER) and BLEU
scores, show that the NLLB model reaches an 21% WER, 58% BLEU, 6% CER, and the VITS
TTS model shows MOS 3.8 for male speaker and MOS 2.6 female speaker. Nevertheless, code
mixed sentences still pose a challenge in handling them accurately and voice naturalness remains
a challenge for female speakers. This study presents the findings that will positively contribute to
the progress of Sinhala speech technology and bridge the gap of accessibility for users who utilized
TTS technology.