Abstract:
This research project addresses the pressing need for robust countermeasures against audio spoofing, focusing on the emerging field of audio deepfake detection (ADD). Despite recent strides in utilizing self-supervised speech models for feature extraction, current approaches face limitations in handling multi-speaker tasks and struggle with cross-domain conditions, hindering their effectiveness in real-world scenarios. This project proposes a novel solution by integrating WavLM, a cutting-edge speech selfsupervised model, as a front-end feature extractor for ADD. Leveraging advanced training techniques such as masked speech prediction and denoising, WavLM exhibits improved performance in capturing non-automatic speech recognition features, thereby enhancing the robustness of ADD systems. Moreover, to address the challenge of generalizing to unfamiliar target domains with limited source data, this project explores creating a framework for training and evaluating detection models on custom data so that researchers can create their own models for domain specific scenarios. After training for 10 epochs on a selection of datasets containing around 11000 samples each, the model with the WavLM frontend is capable of detecting audio deepfakes with average output [EER/ min-TCDF] scores of (0.152237, 0.36142), outperforming both similar speech based SSL models as well as benchmark models as tested on popular datasets, ASVSpoof 2019/ 2021 and WaveFake. The research proved successful with the proposed model being 11 times smaller in size than the next closest model.