视频理解

发布日期: 2022-08-04

2022-08-04 更新

Two-Stream Transformer Architecture for Long Video Understanding

Authors:Edward Fish, Jon Weinbren, Andrew Gilbert

Pure vision transformer architectures are highly effective for short video classification and action recognition tasks. However, due to the quadratic complexity of self attention and lack of inductive bias, transformers are resource intensive and suffer from data inefficiencies. Long form video understanding tasks amplify data and memory efficiency problems in transformers making current approaches unfeasible to implement on data or memory restricted domains. This paper introduces an efficient Spatio-Temporal Attention Network (STAN) which uses a two-stream transformer architecture to model dependencies between static image features and temporal contextual features. Our proposed approach can classify videos up to two minutes in length on a single GPU, is data efficient, and achieves SOTA performance on several long video understanding tasks.
PDF

点此查看论文截图

木子已

https://ipaper.today/2022/08/04/2022-08-04-shi-pin-li-jie/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源木子已 !

视频理解

对抗攻击

2022-08-04 对抗攻击

对抗攻击

I2I Translation

2022-08-04 I2I Translation

I2I Translation

视频理解

2022-08-04 更新

Two-Stream Transformer Architecture for Long Video Understanding

打赏用于支持本站流量费