Abstract:
Violence is a key aspect experienced by public in common premises where a group of people get together. During a violence occurrence, a main problem is that the public suffer from not having a prompt alert methodology and not being able to receive necessary evidences about the culprits and affected persons or property immediately for legal investigations. In fact, occurrence of violence is a swift scenario that lasts for a short period of time, which emphasizes the importance of real time accurate VD systems to be introduced which is capable of detecting both ongoing and future violence without any sort of human intervention. A system is proposed through this project to build a combination of trio transformer architectures which belongs to “Transformers”; a recent attention mechanism-based technology in computer vision domain. Video classification for fight detection is implemented using Video Vision Transformer (ViViT), image classification for weapon classification built using Vision Transformer (ViT) and the violent audio classification is to be achieved using Audio Spectrogram Transformer (AST). Currently, the implemented ViViT model archives an accuracy of 60.0% during 50 epochs with V100 GPU. The ViT model achieves 99.53% overall accuracy on the multiple classes. The overall accuracy of 59.06% is achieved by AST model for audio classification.