视频理解


2022-11-30 更新

Egocentric scene context for human-centric environment understanding from video

Authors:Tushar Nagarajan, Santhosh Kumar Ramakrishnan, Ruta Desai, James Hillis, Kristen Grauman

First-person video highlights a camera-wearer’s activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and only capture what is directly seen. We present an approach that links egocentric video and camera pose over time by learning representations that are predictive of the camera-wearer’s (potentially unseen) local surroundings to facilitate human-centric environment understanding. We train such models using videos from agents in simulated 3D environments where the environment is fully observable, and test them on real-world videos of house tours from unseen environments. We show that by grounding videos in their physical environment, our models surpass traditional scene classification models at predicting which room a camera-wearer is in (where frame-level information is insufficient), and can leverage this grounding to localize video moments corresponding to environment-centric queries, outperforming prior methods. Project page: http://vision.cs.utexas.edu/projects/ego-scene-context/
PDF

点此查看论文截图

Point Primitive Transformer for Long-Term 4D Point Cloud Video Understanding

Authors:Hao Wen, Yunze Liu, Jingwei Huang, Bo Duan, Li Yi

This paper proposes a 4D backbone for long-term point cloud video understanding. A typical way to capture spatial-temporal context is using 4Dconv or transformer without hierarchy. However, those methods are neither effective nor efficient enough due to camera motion, scene changes, sampling patterns, and the complexity of 4D data. To address those issues, we leverage the primitive plane as a mid-level representation to capture the long-term spatial-temporal context in 4D point cloud videos and propose a novel hierarchical backbone named Point Primitive Transformer(PPTr), which is mainly composed of intra-primitive point transformers and primitive transformers. Extensive experiments show that PPTr outperforms the previous state of the arts on different tasks
PDF

点此查看论文截图

Two-Stream Transformer Architecture for Long Video Understanding

Authors:Edward Fish, Jon Weinbren, Andrew Gilbert

Pure vision transformer architectures are highly effective for short video classification and action recognition tasks. However, due to the quadratic complexity of self attention and lack of inductive bias, transformers are resource intensive and suffer from data inefficiencies. Long form video understanding tasks amplify data and memory efficiency problems in transformers making current approaches unfeasible to implement on data or memory restricted domains. This paper introduces an efficient Spatio-Temporal Attention Network (STAN) which uses a two-stream transformer architecture to model dependencies between static image features and temporal contextual features. Our proposed approach can classify videos up to two minutes in length on a single GPU, is data efficient, and achieves SOTA performance on several long video understanding tasks.
PDF

点此查看论文截图

Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling

Authors:Hsin-Ying Lee, Hung-Ting Su, Bing-Chen Tsai, Tsung-Han Wu, Jia-Fong Yeh, Winston H. Hsu

While recent large-scale video-language pre-training made great progress in video question answering, the design of spatial modeling of video-language models is less fine-grained than that of image-language models; existing practices of temporal modeling also suffer from weak and noisy alignment between modalities. To learn fine-grained visual understanding, we decouple spatial-temporal modeling and propose a hybrid pipeline, Decoupled Spatial-Temporal Encoders, integrating an image- and a video-language encoder. The former encodes spatial semantics from larger but sparsely sampled frames independently of time, while the latter models temporal dynamics at lower spatial but higher temporal resolution. To help the video-language model learn temporal relations for video QA, we propose a novel pre-training objective, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences. Extensive experiments demonstrate that our model outperforms previous work pre-trained on orders of magnitude larger datasets.
PDF BMVC 2022. Code is available at https://github.com/shinying/dest

点此查看论文截图

FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks

Authors:Santiago Castro, Fabian Caba Heilbron

Large-scale pretrained image-text models have shown incredible zero-shot performance in a handful of tasks, including video ones such as action recognition and text-to-video retrieval. However, these models have not been adapted to video, mainly because they do not account for the time dimension but also because video frames are different from the typical images (e.g., containing motion blur, and less sharpness). In this paper, we present a fine-tuning strategy to refine these large-scale pretrained image-text models for zero-shot video understanding tasks. We show that by carefully adapting these models we obtain considerable improvements on two zero-shot Action Recognition tasks and three zero-shot Text-to-video Retrieval tasks. The code is available at https://github.com/bryant1410/fitclip
PDF Accepted at BMVC 2022. It includes the supplementary material. The margins and page size were modified to fit the arXiv ID stamp on the left side

点此查看论文截图

Free Lunch for Surgical Video Understanding by Distilling Self-Supervisions

Authors:Xinpeng Ding, Ziwei Liu, Xiaomeng Li

Self-supervised learning has witnessed great progress in vision and NLP; recently, it also attracted much attention to various medical imaging modalities such as X-ray, CT, and MRI. Existing methods mostly focus on building new pretext self-supervision tasks such as reconstruction, orientation, and masking identification according to the properties of medical images. However, the publicly available self-supervision models are not fully exploited. In this paper, we present a powerful yet efficient self-supervision framework for surgical video understanding. Our key insight is to distill knowledge from publicly available models trained on large generic datasets4 to facilitate the self-supervised learning of surgical videos. To this end, we first introduce a semantic-preserving training scheme to obtain our teacher model, which not only contains semantics from the publicly available models, but also can produce accurate knowledge for surgical data. Besides training with only contrastive learning, we also introduce a distillation objective to transfer the rich learned information from the teacher model to self-supervised learning on surgical data. Extensive experiments on two surgical phase recognition benchmarks show that our framework can significantly improve the performance of existing self-supervised learning methods. Notably, our framework demonstrates a compelling advantage under a low-data regime. Our code is available at https://github.com/xmed-lab/DistillingSelf.
PDF accepted at MICCAI 2022

点此查看论文截图

Compressed Vision for Efficient Video Understanding

Authors:Olivia Wiles, Joao Carreira, Iain Barr, Andrew Zisserman, Mateusz Malinowski

Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks. Operating on compressed videos improves efficiency at all pipeline levels — data transfer, speed and memory — making it possible to train models faster and on much longer videos. Processing compressed signals has, however, the downside of precluding standard augmentation techniques if done naively. We address that by introducing a small network that can apply transformations to latent codes corresponding to commonly used augmentations in the original video space. We demonstrate that with our compressed vision pipeline, we can train video models more efficiently on popular benchmarks such as Kinetics600 and COIN. We also perform proof-of-concept experiments with new tasks defined over hour-long videos at standard frame rates. Processing such long videos is impossible without using compressed representation.
PDF ACCV

点此查看论文截图

VideoPipe 2022 Challenge: Real-World Video Understanding for Urban Pipe Inspection

Authors:Yi Liu, Xuan Zhang, Ying Li, Guixin Liang, Yabing Jiang, Lixia Qiu, Haiping Tang, Fei Xie, Wei Yao, Yi Dai, Yu Qiao, Yali Wang

Video understanding is an important problem in computer vision. Currently, the well-studied task in this research is human action recognition, where the clips are manually trimmed from the long videos, and a single class of human action is assumed for each clip. However, we may face more complicated scenarios in the industrial applications. For example, in the real-world urban pipe system, anomaly defects are fine-grained, multi-labeled, domain-relevant. To recognize them correctly, we need to understand the detailed video content. For this reason, we propose to advance research areas of video understanding, with a shift from traditional action recognition to industrial anomaly analysis. In particular, we introduce two high-quality video benchmarks, namely QV-Pipe and CCTV-Pipe, for anomaly inspection in the real-world urban pipe systems. Based on these new datasets, we will host two competitions including (1) Video Defect Classification on QV-Pipe and (2) Temporal Defect Localization on CCTV-Pipe. In this report, we describe the details of these benchmarks, the problem definitions of competition tracks, the evaluation metric, and the result summary. We expect that, this competition would bring new opportunities and challenges for video understanding in smart city and beyond. The details of our VideoPipe challenge can be found in https://videopipe.github.io.
PDF VideoPipe Challenge @ ICPR2022. Homepage: https://videopipe.github.io/

点此查看论文截图

EgoTaskQA: Understanding Human Tasks in Egocentric Videos

Authors:Baoxiong Jia, Ting Lei, Song-Chun Zhu, Siyuan Huang

Understanding human tasks through video observations is an essential capability of intelligent agents. The challenges of such capability lie in the difficulty of generating a detailed understanding of situated actions, their effects on object states (i.e., state changes), and their causal dependencies. These challenges are further aggravated by the natural parallelism from multi-tasking and partial observations in multi-agent collaboration. Most prior works leverage action localization or future prediction as an indirect metric for evaluating such task understanding from videos. To make a direct evaluation, we introduce the EgoTaskQA benchmark that provides a single home for the crucial dimensions of task understanding through question-answering on real-world egocentric videos. We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents’ beliefs about others. These questions are divided into four types, including descriptive (what status?), predictive (what will?), explanatory (what caused?), and counterfactual (what if?) to provide diagnostic analyses on spatial, temporal, and causal understandings of goal-oriented tasks. We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos. We hope this effort will drive the vision community to move onward with goal-oriented video understanding and reasoning.
PDF Published at NeurIPS Track on Datasets and Benchmarks 2022

点此查看论文截图

STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding

Authors:Zihang Lin, Chaolei Tan, Jian-Fang Hu, Zhi Jin, Tiancai Ye, Wei-Shi Zheng

In this technical report, we introduce our solution to human-centric spatio-temporal video grounding task. We propose a concise and effective framework named STVGFormer, which models spatiotemporal visual-linguistic dependencies with a static branch and a dynamic branch. The static branch performs cross-modal understanding in a single frame and learns to localize the target object spatially according to intra-frame visual cues like object appearances. The dynamic branch performs cross-modal understanding across multiple frames. It learns to predict the starting and ending time of the target moment according to dynamic visual cues like motions. Both the static and dynamic branches are designed as cross-modal transformers. We further design a novel static-dynamic interaction block to enable the static and dynamic branches to transfer useful and complementary information from each other, which is shown to be effective to improve the prediction on hard cases. Our proposed method achieved 39.6% vIoU and won the first place in the HC-STVG track of the 4th Person in Context Challenge.
PDF Technical report. The 1st place solution in the HC-STVG track of the 4th Person in Context Challenge(2022)

点此查看论文截图

GraphVid: It Only Takes a Few Nodes to Understand a Video

Authors:Eitan Kosman, Dotan Di Castro

We propose a concise representation of videos that encode perceptually meaningful features into graphs. With this representation, we aim to leverage the large amount of redundancies in videos and save computations. First, we construct superpixel-based graph representations of videos by considering superpixels as graph nodes and create spatial and temporal connections between adjacent superpixels. Then, we leverage Graph Convolutional Networks to process this representation and predict the desired output. As a result, we are able to train models with much fewer parameters, which translates into short training periods and a reduction in computation resource requirements. A comprehensive experimental study on the publicly available datasets Kinetics-400 and Charades shows that the proposed method is highly cost-effective and uses limited commodity hardware during training and inference. It reduces the computational requirements 10-fold while achieving results that are comparable to state-of-the-art methods. We believe that the proposed approach is a promising direction that could open the door to solving video understanding more efficiently and enable more resource limited users to thrive in this research field.
PDF Accepted to ECCV2022 (Oral)

点此查看论文截图

Prompting Visual-Language Models for Efficient Video Understanding

Authors:Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, Weidi Xie

Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for zero-shot generalisation. This paper presents a simple but strong baseline to efficiently adapt the pre-trained I-VL model, and exploit its powerful ability for resource-hungry video understanding tasks, with minimal training. Specifically, we propose to optimise a few random vectors, termed as continuous prompt vectors, that convert video-related tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components. On 10 public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, and zero-shot scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite optimising significantly fewer parameters.
PDF ECCV 2022. Project page: https://ju-chen.github.io/efficient-prompt/

点此查看论文截图

Federated Self-supervised Learning for Video Understanding

Authors:Yasar Abbas Ur Rehman, Yan Gao, Jiajun Shen, Pedro Porto Buarque de Gusmao, Nicholas Lane

The ubiquity of camera-enabled mobile devices has lead to large amounts of unlabelled video data being produced at the edge. Although various self-supervised learning (SSL) methods have been proposed to harvest their latent spatio-temporal representations for task-specific training, practical challenges including privacy concerns and communication costs prevent SSL from being deployed at large scales. To mitigate these issues, we propose the use of Federated Learning (FL) to the task of video SSL. In this work, we evaluate the performance of current state-of-the-art (SOTA) video-SSL techniques and identify their shortcomings when integrated into the large-scale FL setting simulated with kinetics-400 dataset. We follow by proposing a novel federated SSL framework for video, dubbed FedVSSL, that integrates different aggregation strategies and partial weight updating. Extensive experiments demonstrate the effectiveness and significance of FedVSSL as it outperforms the centralized SOTA for the downstream retrieval task by 6.66% on UCF-101 and 5.13% on HMDB-51.
PDF

点此查看论文截图

文章作者: 木子已
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 木子已 !
  目录