Conformer is All You Need for Visual Speech Recognition

Published in ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing, 2024

Visual speech recognition (VSR) aims to transcribe speech from silent video of a speaker’s face. In this work, we demonstrate that the Conformer architecture, which combines convolutions with self-attention, is highly effective for VSR tasks. Our experiments show that a properly configured Conformer model achieves state-of-the-art performance on standard VSR benchmarks, suggesting that the architectural innovations developed for audio speech recognition transfer well to the visual domain.

Recommended citation: @inproceedings{chang2024conformer, title={Conformer is All You Need for Visual Speech Recognition}, author={Chang, Oscar and Liao, Hank and Serdyuk, Dmitriy and Shah, Ankit and Siohan, Olivier}, booktitle={ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing}, pages={10501--10505}, year={2024}, organization={IEEE} }
Download Paper