DARE: Deceiving Audio–Visual speech Recognition model

Mishra, Saumya; Gupta, Anup Kumar; Gupta, Puneet

Please use this identifier to cite or link to this item: https://dspace.iiti.ac.in/handle/123456789/4804

Full metadata record

DC Field	Value	Language
dc.contributor.author	Mishra, Saumya	en_US
dc.contributor.author	Gupta, Anup Kumar	en_US
dc.contributor.author	Gupta, Puneet	en_US
dc.date.accessioned	2022-03-17T01:00:00Z	-
dc.date.accessioned	2022-03-17T15:35:33Z	-
dc.date.available	2022-03-17T01:00:00Z	-
dc.date.available	2022-03-17T15:35:33Z	-
dc.date.issued	2021	-
dc.identifier.citation	Mishra, S., Gupta, A. K., & Gupta, P. (2021). DARE: Deceiving Audio–Visual speech recognition model. Knowledge-Based Systems, 232 doi:10.1016/j.knosys.2021.107503	en_US
dc.identifier.issn	0950-7051	-
dc.identifier.other	EID(2-s2.0-85115893673)	-
dc.identifier.uri	https://doi.org/10.1016/j.knosys.2021.107503	-
dc.identifier.uri	https://dspace.iiti.ac.in/handle/123456789/4804	-
dc.description.abstract	Audio–Visual speech recognition (AVSR) is an effective way to predict text corresponding to the spoken words using both audio and face videos, even in a noisy environment. These models find extensive applications in various fields like assisting hearing-impaired, biometric verification and speaker verification. Adversarial examples are created by adding imperceptible perturbations to the original input resulting in an incorrect classification by the deep learning models. Attacking an AVSR model is quite challenging, as both audio and visual modalities complement each other. Moreover, the correlation between audio and video features decreases while crafting an adversarial example, which can be used for detecting the adversarial example. We propose an end-to-end targeted attack, Deceiving Audio–visual speech Recognition model (DARE), which successfully performs an imperceptible adversarial attack while remaining undetected by the existing synchronisation-based detection network, SyncNet. To this end, we are the first to perform an adversarial attack that fools the AVSR model and SyncNet simultaneously. Experimental results on the publicly available dataset using state-of-the-art AVSR model reveal that the proposed attack can successfully deceive the AVSR model while remaining undetected. Furthermore, our DARE attack circumvents the well-known defences while maintaining a 100% targeted attack success rate. © 2021 Elsevier B.V.	en_US
dc.language.iso	en	en_US
dc.publisher	Elsevier B.V.	en_US
dc.source	Knowledge-Based Systems	en_US
dc.subject	Character recognition	en_US
dc.subject	Deep learning	en_US
dc.subject	Speech recognition	en_US
dc.subject	Adversarial attack	en_US
dc.subject	Audiovisual speech recognition	en_US
dc.subject	Biometric verification	en_US
dc.subject	Cross modality	en_US
dc.subject	Detection networks	en_US
dc.subject	Hearing impaired	en_US
dc.subject	Noisy environment	en_US
dc.subject	Recognition models	en_US
dc.subject	Speaker verification	en_US
dc.subject	Spoken words	en_US
dc.subject	Audition	en_US
dc.title	DARE: Deceiving Audio–Visual speech Recognition model	en_US
dc.type	Journal Article	en_US
Appears in Collections:	Department of Computer Science and Engineering

Files in This Item:

There are no files associated with this item.

Show simple item record

Altmetric Badge: