视频理解

发布日期: 2022-07-19

2022-07-19 更新

Prompting Visual-Language Models for Efficient Video Understanding

Authors:Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, Weidi Xie

Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for zero-shot generalisation. This paper presents a simple but strong baseline to efficiently adapt the pre-trained I-VL model, and exploit its powerful ability for resource-hungry video understanding tasks, with minimal training. Specifically, we propose to optimise a few random vectors, termed as continuous prompt vectors, that convert video-related tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components. On 10 public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, and zero-shot scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite optimising significantly fewer parameters.
PDF ECCV 2022. Project page: https://ju-chen.github.io/efficient-prompt/

点此查看论文截图

木子已

https://ipaper.today/2022/07/19/2022-07-19-shi-pin-li-jie/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源木子已 !

视频理解

GAN

2022-07-19 GAN

GAN

I2I Translation

2022-07-19 I2I Translation

I2I Translation

视频理解

2022-07-19 更新

Prompting Visual-Language Models for Efficient Video Understanding

打赏用于支持本站流量费