Overview of Audio Visual Scene-Aware Dialog with Reasoning Track for Natural Language Generation in DSTC10

Published in Proceedings of DSTC10 Workshop at AAAI-2022, 2022

The Audio-Visual Scene-Aware Dialog (AVSD) task was proposed in the Dialog System Technology Challenge (DSTC), where an AVSD dataset was collected and AVSD technologies were developed. An AVSD challenge track was hosted at both the 7th and 8th DSTCs (DSTC7, DSTC8). In these challenges, the best-performing systems relied heavily on humangenerated descriptions of the video content, which were available in the datasets but would be unavailable in real-world applications. To promote further advancements for real-world applications, a third AVSD challenge is proposed, at DSTC10, with two modifications: 1) the human-created description is unavailable at inference time, and 2) systems must demonstrate temporal reasoning by finding evidence from the video to support each answer. This paper introduces the new task that includes temporal reasoning and the new extension of the AVSD dataset for DSTC10, for which humangenerated temporal reasoning data were collected. A baseline system was built using an AV-transformer and the new datasets were released for the challenge. Finally, this paper reports the challenge results of 12 systems submitted to the AVSD task in DSTC10. The two systems using GPT-2 based multimodal transformer have achieved the best performance for human rating, BLEU4 and CIDEr. The temporal reasoning performed by those systems has outperformed the baseline method with temporal attention.

Citation: @article{horioverview, title={Overview of Audio Visual Scene-Aware Dialog with Reasoning Track for Natural Language Generation in DSTC10}, author={Hori, Chiori and Shah, Ankit Parag and Geng, Shijie and Gao, Peng and Cherian, Anoop and Hori, Takaaki and Le Roux, Jonathan and Marks, Tim K} }