| dc.description.abstract |
The lack of facilities for hearing-impaired people might create severe communication barriers. Most current solutions in the domain of lip-reading systems are being developed in English and a few other widely spoken languages; thus, underrepresented languages, such as Sinhala, lag behind. This paper presents research based on the development of the Sinhala lipreading model, which addresses this gap by facilitating better communication support for the disabled community. The Waterfall methodology is followed by the project to be developed in a structured manner. A custom made dataset was created, which follows the format of the GRID dataset and contains videos along with their aligned transcripts. The deep learning based approach for this research involves the use of CNN combined with LSTM models to analyse video sequences. The LipVeda showed a Word Error Rate (WER) of approximately 29.13%, meaning the model correctly predicted about 71.87% of words in the test data. The Character Error Rate (CER) was notably low at 7.53%, indicating strong phoneme level and character level prediction accuracy. However, the Exact Match Accuracy which only considers predictions entirely identical to the ground truth was around 2%, which is expected in sequence models using CTC decoding. These results suggest that while minor errors may prevent full sequence matches, the model is highly effective at predicting individual components of speech, laying a strong foundation for further improvements. |
en_US |