2022-07-19 更新
Joint stereo 3D object detection and implicit surface reconstruction
Authors:Shichao Li, Kwang-Ting Cheng
We present the first learning-based framework for category-level 3D object detection and implicit shape estimation based on a pair of stereo RGB images in the wild. Previous stereo 3D object detection approaches cannot describe the complete shape details of the detected objects and often fails for the small objects. In contrast, we propose a new progressive approach that can (1) perform precise localization as well as provide a complete and resolution-agnostic shape description for the detected objects and (2) produce significantly more accurate orientation predictions for the tiny instances. This approach features a new instance-level network that explicitly models the unseen surface hallucination problem using point-based representations and uses a new geometric representation for orientation refinement. Extensive experiments show that our approach achieves state-of-the-art performance using various metrics on the KITTI benchmark. Code and pre-trained models will be available at this https URL.
PDF
点此查看论文截图
Class-agnostic Object Detection with Multi-modal Transformer
Authors:Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer, Ming-Hsuan Yang
What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. For the first time in literature, we demonstrate that Multi-modal Vision Transformers (MViT) trained with aligned image-text pairs can effectively bridge this gap. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on the observation that existing MViTs do not include multi-scale feature processing and usually require longer training schedules, we develop an efficient MViT architecture using multi-scale deformable attention and late vision-language fusion. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs can adaptively generate proposals given a specific language query and thus offer enhanced interactability. Code: \url{https://git.io/J1HPY}.
PDF ECCV 2022 accepted
点此查看论文截图
DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection
Authors:Liang Peng, Xiaopei Wu, Zheng Yang, Haifeng Liu, Deng Cai
Monocular 3D detection has drawn much attention from the community due to its low cost and setup simplicity. It takes an RGB image as input and predicts 3D boxes in the 3D space. The most challenging sub-task lies in the instance depth estimation. Previous works usually use a direct estimation method. However, in this paper we point out that the instance depth on the RGB image is non-intuitive. It is coupled by visual depth clues and instance attribute clues, making it hard to be directly learned in the network. Therefore, we propose to reformulate the instance depth to the combination of the instance visual surface depth (visual depth) and the instance attribute depth (attribute depth). The visual depth is related to objects’ appearances and positions on the image. By contrast, the attribute depth relies on objects’ inherent attributes, which are invariant to the object affine transformation on the image. Correspondingly, we decouple the 3D location uncertainty into visual depth uncertainty and attribute depth uncertainty. By combining different types of depths and associated uncertainties, we can obtain the final instance depth. Furthermore, data augmentation in monocular 3D detection is usually limited due to the physical nature, hindering the boost of performance. Based on the proposed instance depth disentanglement strategy, we can alleviate this problem. Evaluated on KITTI, our method achieves new state-of-the-art results, and extensive ablation studies validate the effectiveness of each component in our method. The codes are released at https://github.com/SPengLiang/DID-M3D.
PDF ECCV 2022
点此查看论文截图
Weakly Supervised Video Salient Object Detection via Point Supervision
Authors:Shuyong Gao, Haozhe Xing, Wei Zhang, Yan Wang, Qianyu Guo, Wenqiang Zhang
Video salient object detection models trained on pixel-wise dense annotation have achieved excellent performance, yet obtaining pixel-by-pixel annotated datasets is laborious. Several works attempt to use scribble annotations to mitigate this problem, but point supervision as a more labor-saving annotation method (even the most labor-saving method among manual annotation methods for dense prediction), has not been explored. In this paper, we propose a strong baseline model based on point supervision. To infer saliency maps with temporal information, we mine inter-frame complementary information from short-term and long-term perspectives, respectively. Specifically, we propose a hybrid token attention module, which mixes optical flow and image information from orthogonal directions, adaptively highlighting critical optical flow information (channel dimension) and critical token information (spatial dimension). To exploit long-term cues, we develop the Long-term Cross-Frame Attention module (LCFA), which assists the current frame in inferring salient objects based on multi-frame tokens. Furthermore, we label two point-supervised datasets, P-DAVIS and P-DAVSOD, by relabeling the DAVIS and the DAVSOD dataset. Experiments on the six benchmark datasets illustrate our method outperforms the previous state-of-the-art weakly supervised methods and even is comparable with some fully supervised approaches. Source code and datasets are available.
PDF accepted by ACM MM 2022
点此查看论文截图
Automatic dataset generation for specific object detection
Authors:Xiaotian Lin, Leiyang Xu, Qiang Wang
In the past decade, object detection tasks are defined mostly by large public datasets. However, building object detection datasets is not scalable due to inefficient image collecting and labeling. Furthermore, most labels are still in the form of bounding boxes, which provide much less information than the real human visual system. In this paper, we present a method to synthesize object-in-scene images, which can preserve the objects’ detailed features without bringing irrelevant information. In brief, given a set of images containing a target object, our algorithm first trains a model to find an approximate center of the object as an anchor, then makes an outline regression to estimate its boundary, and finally blends the object into a new scene. Our result shows that in the synthesized image, the boundaries of objects blend very well with the background. Experiments also show that SOTA segmentation models work well with our synthesized data.
PDF 5 pages, 4 figures, accepted by ICIP 2022
点此查看论文截图
SPSN: Superpixel Prototype Sampling Network for RGB-D Salient Object Detection
Authors:Minhyeok Lee, Chaewon Park, Suhwan Cho, Sangyoun Lee
RGB-D salient object detection (SOD) has been in the spotlight recently because it is an important preprocessing operation for various vision tasks. However, despite advances in deep learning-based methods, RGB-D SOD is still challenging due to the large domain gap between an RGB image and the depth map and low-quality depth maps. To solve this problem, we propose a novel superpixel prototype sampling network (SPSN) architecture. The proposed model splits the input RGB image and depth map into component superpixels to generate component prototypes. We design a prototype sampling network so that the network only samples prototypes corresponding to salient objects. In addition, we propose a reliance selection module to recognize the quality of each RGB and depth feature map and adaptively weight them in proportion to their reliability. The proposed method makes the model robust to inconsistencies between RGB images and depth maps and eliminates the influence of non-salient objects. Our method is evaluated on five popular datasets, achieving state-of-the-art performance. We prove the effectiveness of the proposed method through comparative experiments.
PDF Accepted to European Conference on Computer Vision (ECCV) 2022
点此查看论文截图
GroupViT: Semantic Segmentation Emerges from Text Supervision
Authors:Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang
Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments. We train GroupViT jointly with a text encoder on a large-scale image-text dataset via contrastive losses. With only text supervision and without any pixel-level annotations, GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner, i.e., without any further fine-tuning. It achieves a zero-shot accuracy of 52.3% mIoU on the PASCAL VOC 2012 and 22.4% mIoU on PASCAL Context datasets, and performs competitively to state-of-the-art transfer-learning methods requiring greater levels of supervision. We open-source our code at https://github.com/NVlabs/GroupViT .
PDF CVPR 2022. Project page and code: https://jerryxu.net/GroupViT
点此查看论文截图
Consistency of Implicit and Explicit Features Matters for Monocular 3D Object Detection
Authors:Qian Ye, Ling Jiang, Yuyang Du
Monocular 3D object detection is a common solution for low-cost autonomous agents to perceive their surrounding environment. Monocular detection has progressed into two categories: (1)Direct methods that infer 3D bounding boxes directly from a frontal-view image; (2)3D intermedia representation methods that map image features to 3D space for subsequent 3D detection. The second category is standing out not only because 3D detection forges ahead at the mercy of more meaningful and representative features, but because of emerging SOTA end-to-end prediction and planning paradigms that require a bird’s-eye-view feature map from a perception pipeline. However, in transforming to 3D representation, these methods do not guarantee that objects’ implicit orientations and locations in latent space are consistent with those explicitly observed in Euclidean space, which will hurt model performance. Hence, we argue that the consistency of implicit and explicit features matters and present a novel monocular detection method, named CIEF, with the first orientation-aware image backbone to eliminate the disparity of implicit and explicit features in subsequent 3D representation. As a second contribution, we introduce a ray attention mechanism. In contrast to previous methods that repeat features along the projection ray or rely on another intermedia frustum point cloud, we directly transform image features to voxel representations with well-localized features. We also propose a handcrafted gaussian positional encoding function that outperforms the sinusoidal encoding function but maintains the benefit of being continuous. CIEF ranked 1st among all reported methods on both 3D and BEV detection benchmark of KITTI at submission time.
PDF 9 pages,7 figures
点此查看论文截图
Co-Training for Unsupervised Domain Adaptation of Semantic Segmentation Models
Authors:Jose L. Gómez, Gabriel Villalonga, Antonio M. López
Semantic image segmentation is addressed by training deep models. Since supervised training draws to a curse of human-based image labeling, using synthetic images with automatically generated ground truth together with unlabeled real-world images is a promising alternative. This implies to address an unsupervised domain adaptation (UDA) problem. In this paper, we proposed a new co-training process for synth-to-real UDA of semantic segmentation models. First, we design a self-training procedure which provides two initial models. Then, we keep training these models in a collaborative manner for obtaining the final model. The overall process treats the deep models as black boxes and drives their collaboration at the level of pseudo-labeled target images, i.e., neither modifying loss functions is required, nor explicit feature alignment. We test our proposal on standard synthetic and real-world datasets. Our co-training shows improvements of 15-20 percentage points of mIoU over baselines, so establishing new state-of-the-art results.
PDF