Audio-visual fine-tuning of audio-only ASR models

Published in arXiv preprint arXiv:2312.09369, 2023

This paper investigates using visual information from lip movements to fine-tune audio-only automatic speech recognition (ASR) models. We demonstrate that incorporating visual cues during fine-tuning can improve ASR performance, particularly in noisy conditions where audio quality is degraded. Our approach enables leveraging the complementary nature of audio and visual modalities without requiring visual features at inference time.

Recommended citation: @article{may2023audiovisual, title={Audio-Visual Fine-Tuning of Audio-Only ASR Models}, author={May, Avner and Serdyuk, Dmitriy and Shah, Ankit Parag and Braga, Otavio and Siohan, Olivier}, journal={arXiv preprint arXiv:2312.09369}, year={2023} }
Download Paper