Don’t twist my words (nay Google): Adversarial Automatic Speech Recognition

Saskia Lensink, Anne Merel Sternheim, Dominique Blok and Yori Kamphuis

There is growing demand for and reliance on automatic speech recognition (ASR) in a wide range of domains [1]. Examples include the automatic transcription of patient-doctor interactions to alleviate the administrative burden of medical staff, the automatic transcription of court cases or interrogations of suspects of criminal offenses, and diverse applications of voice assistants in cars, customer service, and smart devices such as Google Home. Although these technologies offer many opportunities, there are some severe risks with possibly disruptive consequences. One of those risks is the possibility for adversarial attacks, where a targeted attack on audio can disrupt and even completely change the resulting transcription. An attacker could add specifically crafted noise to speech audio in such a way that the ASR transcribes the text in a completely different way. This has been demonstrated for images, but is possible for audio as well [2]. Imagine telling your Google device to play the latest hits, but an attacker has the device believe that you ordered it to unlock your car door… Adversarial attacks on speech recognition systems could lead to incorrect medical notes, incorrect representations of hearings or interrogations, or misleading and malicious instructions for smart devices.

Given the diverse set of potential use cases for speech technologies, any risks for adversarial attacks could potentially have a large impact. Therefore, it is crucial to have an overview of potential risks and mitigation strategies. We will present findings of our recent work on adversarial attacks on automatic speech recognition for Dutch, and will discuss
1. Our deployment of an existing open-source ASR tool for Dutch (Kaldi [3] or DeepSpeech [4])
2. demonstrate how to attack the ASR system in such a way, that a speech signal will be transcribed in any way the attacker sees fit, starting with the approach outlined in [2].
3. explore techniques and tools to counter these attacks.

[1] Lensink, S.E., Greeff, J., (Oct 2021). Landschap Nederlandstalige Taal- en Spraaktechnologie
[2] Carlini, N., & Wagner, D. (2018, May). Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW) (pp. 1-7). IEEE.
[3] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., … & Vesely, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society.
[4] Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., … & Ng, A. Y. (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.