Adversarial attack on audio-visual speech recognition model

Mishra, Saumya

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/3112

Title:	Adversarial attack on audio-visual speech recognition model
Authors:	Mishra, Saumya
Supervisors:	Gupta, Puneet
Keywords:	Computer Science and Engineering
Issue Date:	24-Sep-2021
Publisher:	Department of Computer Science and Engineering, IIT Indore
Series/Report no.:	MSR011
Abstract:	The Audio-Visual Speech Recognition (AVSR) model is a favourable solution to predict text corresponding to the spoken words utilising both audio and face videos, specifically when the audio is corrupted by noise. These models have been widely used in applications like biometric verification, assisting hearing-impaired person, speaker verification in the multi-speaker scenario and event recognition in surveillance videos. However, these models are vulnerable to adversarial examples that can have profound implications such as distress to differently-abled and security breaches in surveillance systems. Adversarial examples are generated by adding imperceptible perturbations to clean samples with an intention to fool machine learning models. It is difficult to attack an AVSR model since audio and visual modalities complement each other. Furthermore, while generating an adversarial example, the correlation between audio and video features decreases, which can be used to detect the adversarial example for the AVSR model. In this thesis, we introduce an end-to-end targeted attack, the Fooling Audio visuaL Speech rEcognition, FALSE, that effectively performs an imperceptible ad versarial attack while avoiding the detection by the existing synchronisation-based detection network (SyncNet). To the best of our knowledge, we are the first to per form an adversarial attack that simultaneously fools the AVSR model and SyncNet by introducing less distortion in audio and face videos. The experimental results show that the proposed attack successfully fool the state-of-the-art AVSR model on the publicly available dataset while avoiding the detection. Moreover, some well-known defences are easily circumvented by maintaining a 100% targeted attack success rate using our FALSE attack. Keywords— Audio-Visual Speech Recognition; Cross-modality; Detection Network; Adversarial Attacks and Defenses.
URI:	https://dspace.iiti.ac.in/handle/123456789/3112
Type of Material:	Thesis_MS Research
Appears in Collections:	Department of Computer Science and Engineering_ETD

Files in This Item:

File	Description	Size	Format
MSR011_Saumya_Mishra_1904101010.pdf		1.67 MB	Adobe PDF	View/Open

Show full item record

Altmetric Badge: