发布日期: 2022-12-01

2022-12-01 更新

ObjCAViT: Improving Monocular Depth Estimation Using Natural Language Models And Image-Object Cross-Attention

Authors:Dylan Auty, Krystian Mikolajczyk

While monocular depth estimation (MDE) is an important problem in computer vision, it is difficult due to the ambiguity that results from the compression of a 3D scene into only 2 dimensions. It is common practice in the field to treat it as simple image-to-image translation, without consideration for the semantics of the scene and the objects within it. In contrast, humans and animals have been shown to use higher-level information to solve MDE: prior knowledge of the nature of the objects in the scene, their positions and likely configurations relative to one another, and their apparent sizes have all been shown to help resolve this ambiguity. In this paper, we present a novel method to enhance MDE performance by encouraging use of known-useful information about the semantics of objects and inter-object relationships within a scene. Our novel ObjCAViT module sources world-knowledge from language models and learns inter-object relationships in the context of the MDE problem using transformer attention, incorporating apparent size information. Our method produces highly accurate depth maps, and we obtain competitive results on the NYUv2 and KITTI datasets. Our ablation experiments show that the use of language and cross-attention within the ObjCAViT module increases performance. Code is released at https://github.com/DylanAuty/ObjCAViT.
PDF 9 pages, 4 figures. Code is released at https://github.com/DylanAuty/ObjCAViT

点此查看论文截图

Authors:Samarth Sinha, Jason Y. Zhang, Andrea Tagliasacchi, Igor Gilitschenski, David B. Lindell

Camera pose estimation is a key step in standard 3D reconstruction pipelines that operate on a dense set of images of a single object or scene. However, methods for pose estimation often fail when only a few images are available because they rely on the ability to robustly identify and match visual features between image pairs. While these methods can work robustly with dense camera views, capturing a large set of images can be time-consuming or impractical. We propose SparsePose for recovering accurate camera poses given a sparse set of wide-baseline images (fewer than 10). The method learns to regress initial camera poses and then iteratively refine them after training on a large-scale dataset of objects (Co3D: Common Objects in 3D). SparsePose significantly outperforms conventional and learning-based baselines in recovering accurate camera rotations and translations. We also demonstrate our pipeline for high-fidelity 3D reconstruction using only 5-9 images of an object.
PDF

点此查看论文截图

木子已

https://ipaper.today/2022/12/01/2022-12-01-i2i-translation/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源木子已 !

I2I Translation

视频理解

2022-12-01 视频理解

视频理解

场景文本检测识别

2022-12-01 场景文本检测识别

场景文本检测识别

I2I Translation

2022-12-01 更新

ObjCAViT: Improving Monocular Depth Estimation Using Natural Language Models And Image-Object Cross-Attention

SparsePose: Sparse-View Camera Pose Regression and Refinement

打赏用于支持本站流量费