Shutao Li

and 3 more

The temporal answer grounding in instructional video (TAGV) is a new task naturally derived from temporal sentence grounding in general video (TSGV). Given an untrimmed instructional video and a text question, this task aims at locating the frame span from the video that can semantically answer the question. Existing methods tend to formulate the TAGV task with a visual span-based predictor by matching the video frame span queried by the text question. However, due to the weak correlations of the semantic features between the textual question and visual answer, existing methods adopting the visual span-based predictor perform poorly in the TAGV task. In this paper, we propose a visual-prompt text span localizing (VPTSL) method, which introduces the timestamped subtitles to perform the text span localization. Specifically, we design the text span-based predictor, where the input text question, video subtitles, and visual prompt features are jointly learned with the pre-trained language model for enhancing the joint semantic representations. As a result, the TAGV task is reformulated with the visual-prompt subtitle span prediction matching the visual answer. Extensive experiments on three instructional video datasets, namely MedVidQA, TutorialVQA, and VehicleVQA, show that the proposed method outperforms several state-of-the-art (SOTA) methods by a  large margin in terms of mIoU score, which demonstrates the effectiveness of the proposed visual prompt and text span-based predictor. Besides, all the experimental codes and datasets are open-sourced on the website https://github.com/wengsyx/VPTSL.