视频理解


2023-12-24 更新

RTQ: Rethinking Video-language Understanding Based on Image-text Model

Authors:Xiao Wang, Yaoyu Li, Tian Gan, Zheng Zhang, Jingjing Lv, Liqiang Nie

Recent advancements in video-language understanding have been established on the foundation of image-text models, resulting in promising outcomes due to the shared knowledge between images and videos. However, video-language understanding presents unique challenges due to the inclusion of highly complex semantic details, which result in information redundancy, temporal dependency, and scene complexity. Current techniques have only partially tackled these issues, and our quantitative analysis indicates that some of these methods are complementary. In light of this, we propose a novel framework called RTQ (Refine, Temporal model, and Query), which addresses these challenges simultaneously. The approach involves refining redundant information within frames, modeling temporal relations among frames, and querying task-specific information from the videos. Remarkably, our model demonstrates outstanding performance even in the absence of video-language pre-training, and the results are comparable with or superior to those achieved by state-of-the-art pre-training methods. Code is available at https://github.com/SCZwangxiao/RTQ-MM2023.
PDF Accepted by ACM MM 2023 as Oral representation

点此查看论文截图

SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models

Authors:Lee Hyun, Kim Sung-Bin, Seungju Han, Youngjae Yu, Tae-Hyun Oh

Despite the recent advances of the artificial intelligence, building social intelligence remains a challenge. Among social signals, laughter is one of the distinctive expressions that occurs during social interactions between humans. In this work, we tackle a new challenge for machines to understand the rationale behind laughter in video, Video Laugh Reasoning. We introduce this new task to explain why people laugh in a particular video and a dataset for this task. Our proposed dataset, SMILE, comprises video clips and language descriptions of why people laugh. We propose a baseline by leveraging the reasoning capacity of large language models (LLMs) with textual video representation. Experiments show that our baseline can generate plausible explanations for laughter. We further investigate the scalability of our baseline by probing other video understanding tasks and in-the-wild videos. We release our dataset, code, and model checkpoints on https://github.com/SMILE-data/SMILE.
PDF 18 pages, 14 figures

点此查看论文截图

Artificial intelligence optical hardware empowers high-resolution hyperspectral video understanding at 1.2 Tb/s

Authors:Maksim Makarenko, Qizhou Wang, Arturo Burguete-Lopez, Silvio Giancola, Bernard Ghanem, Luca Passone, Andrea Fratalocchi

Foundation models, exemplified by GPT technology, are discovering new horizons in artificial intelligence by executing tasks beyond their designers’ expectations. While the present generation provides fundamental advances in understanding language and images, the next frontier is video comprehension. Progress in this area must overcome the 1 Tb/s data rate demanded to grasp real-time multidimensional video information. This speed limit lies well beyond the capabilities of the existing generation of hardware, imposing a roadblock to further advances. This work introduces a hardware-accelerated integrated optoelectronic platform for multidimensional video understanding in real-time. The technology platform combines artificial intelligence hardware, processing information optically, with state-of-the-art machine vision networks, resulting in a data processing speed of 1.2 Tb/s with hundreds of frequency bands and megapixel spatial resolution at video rates. Such performance, validated in the AI tasks of video semantic segmentation and object understanding in indoor and aerial applications, surpasses the speed of the closest technologies with similar spectral resolution by three to four orders of magnitude. This platform opens up new avenues for research in real-time AI video understanding of multidimensional visual information, helping the empowerment of future human-machine interactions and cognitive processing developments.
PDF

点此查看论文截图

Text-Conditioned Resampler For Long Form Video Understanding

Authors:Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, Federico Tombari

Videos are highly redundant data source and it is often enough to identify a few key moments to solve any given task. In this paper, we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time allowing the model to use much longer chunks of video than earlier works. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we empirically validate its efficacy on a wide variety of evaluation tasks, and set a new state-of-the-art on NextQA, EgoSchema, and the EGO4D-LTA challenge; and (iii) we determine tasks which require longer video contexts and that can thus be used effectively for further evaluation of long-range video models.
PDF

点此查看论文截图

文章作者: 木子已
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 木子已 !
  目录