视频理解

发布日期: 2023-05-03

2023-05-03 更新

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

Authors:Junke Wang, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang

Existing deep video models are limited by specific tasks, fixed input-output spaces, and poor generalization capabilities, making it difficult to deploy them in real-world scenarios. In this paper, we present our vision for multimodal and versatile video understanding and propose a prototype system, \system. Our system is built upon a tracklet-centric paradigm, which treats tracklets as the basic video unit and employs various Video Foundation Models (ViFMs) to annotate their properties e.g., appearance, motion, \etc. All the detected tracklets are stored in a database and interact with the user through a database manager. We have conducted extensive case studies on different types of in-the-wild videos, which demonstrates the effectiveness of our method in answering various video-related problems. Our project is available at https://www.wangjunke.info/ChatVideo/
PDF work in progress

点此查看论文截图

木子已

https://ipaper.today/2023/05/03/2023-05-03-shi-pin-li-jie/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源木子已 !

视频理解

Vision Transformer

2023-05-03 Vision Transformer

Vision Transformer

I2I Translation

2023-05-03 I2I Translation

I2I Translation

视频理解

2023-05-03 更新

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

打赏用于支持本站流量费