2022-10-05 更新
A General Divergence Modeling Strategy for Salient Object Detection
Authors:Xinyu Tian, Jing Zhang, Yuchao Dai
Salient object detection is subjective in nature, which implies that multiple estimations should be related to the same input image. Most existing salient object detection models are deterministic following a point to point estimation learning pipeline, making them incapable of estimating the predictive distribution. Although latent variable model based stochastic prediction networks exist to model the prediction variants, the latent space based on the single clean saliency annotation is less reliable in exploring the subjective nature of saliency, leading to less effective saliency divergence modeling. Given multiple saliency annotations, we introduce a general divergence modeling strategy via random sampling, and apply our strategy to an ensemble based framework and three latent variable model based solutions to explore the subjective nature of saliency. Experimental results prove the superior performance of our general divergence modeling strategy.
PDF Code is available at: https://npucvr.github.io/Divergence_SOD/
点此查看论文截图
Multi-Grained Angle Representation for Remote Sensing Object Detection
Authors:Hao Wang, Zhanchao Huang, Zhengchao Chen, Ying Song, Wei Li
Arbitrary-oriented object detection (AOOD) plays a significant role for image understanding in remote sensing scenarios. The existing AOOD methods face the challenges of ambiguity and high costs in angle representation. To this end, a multi-grained angle representation (MGAR) method, consisting of coarse-grained angle classification (CAC) and fine-grained angle regression (FAR), is proposed. Specifically, the designed CAC avoids the ambiguity of angle prediction by discrete angular encoding (DAE) and reduces complexity by coarsening the granularity of DAE. Based on CAC, FAR is developed to refine the angle prediction with much lower costs than narrowing the granularity of DAE. Furthermore, an Intersection over Union (IoU) aware FAR-Loss (IFL) is designed to improve accuracy of angle prediction using an adaptive re-weighting mechanism guided by IoU. Extensive experiments are performed on several public remote sensing datasets, which demonstrate the effectiveness of the proposed MGAR. Moreover, experiments on embedded devices demonstrate that the proposed MGAR is also friendly for lightweight deployments.
PDF 13 pages, 9 figures, 14 tables
点此查看论文截图
A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation
Authors:Wuyang Chen, Xianzhi Du, Fan Yang, Lucas Beyer, Xiaohua Zhai, Tsung-Yi Lin, Huizhong Chen, Jing Li, Xiaodan Song, Zhangyang Wang, Denny Zhou
This work presents a simple vision transformer design as a strong baseline for object localization and instance segmentation tasks. Transformers recently demonstrate competitive performance in image classification tasks. To adopt ViT to object detection and dense prediction tasks, many works inherit the multistage design from convolutional networks and highly customized ViT architectures. Behind this design, the goal is to pursue a better trade-off between computational cost and effective aggregation of multiscale global contexts. However, existing works adopt the multistage architectural design as a black-box solution without a clear understanding of its true benefits. In this paper, we comprehensively study three architecture design choices on ViT — spatial reduction, doubled channels, and multiscale features — and demonstrate that a vanilla ViT architecture can fulfill this goal without handcrafting multiscale features, maintaining the original ViT design philosophy. We further complete a scaling rule to optimize our model’s trade-off on accuracy and computation cost / model size. By leveraging a constant feature resolution and hidden size throughout the encoder blocks, we propose a simple and compact ViT architecture called Universal Vision Transformer (UViT) that achieves strong performance on COCO object detection and instance segmentation tasks.
PDF ECCV 2022 accepted
点此查看论文截图
InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction
Authors:Yinghao Huang, Omid Tehari, Michael J. Black, Dimitrios Tzionas
Humans constantly interact with daily objects to accomplish tasks. To understand such interactions, computers need to reconstruct these from cameras observing whole-body interaction with scenes. This is challenging due to occlusion between the body and objects, motion blur, depth/scale ambiguities, and the low image resolution of hands and graspable object parts. To make the problem tractable, the community focuses either on interacting hands, ignoring the body, or on interacting bodies, ignoring hands. The GRAB dataset addresses dexterous whole-body interaction but uses marker-based MoCap and lacks images, while BEHAVE captures video of body object interaction but lacks hand detail. We address the limitations of prior work with InterCap, a novel method that reconstructs interacting whole-bodies and objects from multi-view RGB-D data, using the parametric whole-body model SMPL-X and known object meshes. To tackle the above challenges, InterCap uses two key observations: (i) Contact between the hand and object can be used to improve the pose estimation of both. (ii) Azure Kinect sensors allow us to set up a simple multi-view RGB-D capture system that minimizes the effect of occlusion while providing reasonable inter-camera synchronization. With this method we capture the InterCap dataset, which contains 10 subjects (5 males and 5 females) interacting with 10 objects of various sizes and affordances, including contact with the hands or feet. In total, InterCap has 223 RGB-D videos, resulting in 67,357 multi-view frames, each containing 6 RGB-D images. Our method provides pseudo ground-truth body meshes and objects for each video frame. Our InterCap method and dataset fill an important gap in the literature and support many research directions. Our data and code are areavailable for research purposes.
PDF To appear at GCPR2022
点此查看论文截图
2022-10-05 更新
Automated Medical Device Display Reading Using Deep Learning Object Detection
Authors:Lucas P. Moreira
Telemedicine and mobile health applications, especially during the quarantine imposed by the covid-19 pandemic, led to an increase on the need of transferring health monitor readings from patients to specialists. Considering that most home medical devices use seven-segment displays, an automatic display reading algorithm should provide a more reliable tool for remote health care. This work proposes an end-to-end method for detection and reading seven-segment displays from medical devices based on deep learning object detection models. Two state of the art model families, EfficientDet and EfficientDet-lite, previously trained with the MS-COCO dataset, were fine-tuned on a dataset comprised by medical devices photos taken with mobile digital cameras, to simulate real case applications. Evaluation of the trained model show high efficiency, where all models achieved more than 98% of detection precision and more than 98% classification accuracy, with model EfficientDet-lite1 showing 100% detection precision and 100% correct digit classification for a test set of 104 images and 438 digits.
PDF 6 pages, 5 figures
点此查看论文截图
FreDSNet: Joint Monocular Depth and Semantic Segmentation with Fast Fourier Convolutions
Authors:Bruno Berenguel-Baeta, Jesus Bermudez-Cameo, Jose J. Guerrero
In this work we present FreDSNet, a deep learning solution which obtains semantic 3D understanding of indoor environments from single panoramas. Omnidirectional images reveal task-specific advantages when addressing scene understanding problems due to the 360-degree contextual information about the entire environment they provide. However, the inherent characteristics of the omnidirectional images add additional problems to obtain an accurate detection and segmentation of objects or a good depth estimation. To overcome these problems, we exploit convolutions in the frequential domain obtaining a wider receptive field in each convolutional layer. These convolutions allow to leverage the whole context information from omnidirectional images. FreDSNet is the first network that jointly provides monocular depth estimation and semantic segmentation from a single panoramic image exploiting fast Fourier convolutions. Our experiments show that FreDSNet has similar performance as specific state of the art methods for semantic segmentation and depth estimation. FreDSNet code is publicly available in https://github.com/Sbrunoberenguel/FreDSNet
PDF 7 pages, 5 figures, 3 tables
点此查看论文截图
Cooperative Self-Training for Multi-Target Adaptive Semantic Segmentation
Authors:Yangsong Zhang, Subhankar Roy, Hongtao Lu, Elisa Ricci, Stéphane Lathuilière
In this work we address multi-target domain adaptation (MTDA) in semantic segmentation, which consists in adapting a single model from an annotated source dataset to multiple unannotated target datasets that differ in their underlying data distributions. To address MTDA, we propose a self-training strategy that employs pseudo-labels to induce cooperation among multiple domain-specific classifiers. We employ feature stylization as an efficient way to generate image views that forms an integral part of self-training. Additionally, to prevent the network from overfitting to noisy pseudo-labels, we devise a rectification strategy that leverages the predictions from different classifiers to estimate the quality of pseudo-labels. Our extensive experiments on numerous settings, based on four different semantic segmentation datasets, validate the effectiveness of the proposed self-training strategy and show that our method outperforms state-of-the-art MTDA approaches. Code available at: https://github.com/Mael-zys/CoaST
PDF Accepted at WACV 2023
点此查看论文截图
Bridged Transformer for Vision and Point Cloud 3D Object Detection
Authors:Yikai Wang, TengQi Ye, Lele Cao, Wenbing Huang, Fuchun Sun, Fengxiang He, Dacheng Tao
3D object detection is a crucial research topic in computer vision, which usually uses 3D point clouds as input in conventional setups. Recently, there is a trend of leveraging multiple sources of input data, such as complementing the 3D point cloud with 2D images that often have richer color and fewer noises. However, due to the heterogeneous geometrics of the 2D and 3D representations, it prevents us from applying off-the-shelf neural networks to achieve multimodal fusion. To that end, we propose Bridged Transformer (BrT), an end-to-end architecture for 3D object detection. BrT is simple and effective, which learns to identify 3D and 2D object bounding boxes from both points and image patches. A key element of BrT lies in the utilization of object queries for bridging 3D and 2D spaces, which unifies different sources of data representations in Transformer. We adopt a form of feature aggregation realized by point-to-patch projections which further strengthen the correlations between images and points. Moreover, BrT works seamlessly for fusing the point cloud with multi-view images. We experimentally show that BrT surpasses state-of-the-art methods on SUN RGB-D and ScanNetV2 datasets.
PDF CVPR 2022