Vision Transformer

发布日期: 2022-03-10

2022-03-10 更新

ChiTransformer:Towards Reliable Stereo from Cues

Authors:Qing Su, Shihao Ji

Current stereo matching techniques are challenged by restricted searching space, occluded regions, and sheer size. While single image depth estimation is spared from these challenges and can achieve satisfactory results with the extracted monocular cues, the lack of stereoscopic relationship renders the monocular prediction less reliable on its own, especially in highly dynamic or cluttered environments. To address these issues in both scenarios, we present an optic-chiasm-inspired self-supervised binocular depth estimation method, wherein vision transformer (ViT) with a gated positional cross-attention (GPCA) layer is designed to enable feature-sensitive pattern retrieval between views while retaining the extensive context information aggregated through self-attentions. Monocular cues from a single view are thereafter conditionally rectified by a blending layer with the retrieved pattern pairs. This crossover design is biologically analogous to the optic-chasma structure in human visual system and hence the name, ChiTransformer. Our experiments show that this architecture yields substantial improvements over state-of-the-art self-supervised stereo approaches by 11%, and can be used on both rectilinear and non-rectilinear (e.g., fisheye) images.
PDF 11 pages, 3 figures, CVPR2022

论文截图

Harvey

https://ipaper.today/2022/03/10/2022-03-10-vision-transformer/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Harvey !

Vision Transformer

检测/分割/跟踪

2022-03-10 检测/分割/跟踪

检测分割跟踪

NeRF

2022-03-10 NeRF

NeRF

Vision Transformer

2022-03-10 更新

ChiTransformer:Towards Reliable Stereo from Cues

打赏用于支持本站流量费