检测/分割/跟踪


2023-03-29 更新

Sample Imbalance Adjustment and Similar Object Exclusion in Underwater Object Tracking

Authors:Yunfeng Li, Bo Wang, Ye Li, Wei Huo, Zhuoyan Liu

Although modern trackers exhibit competitive performance for underwater image degradation assessment, two problems remain when these are applied to underwater object tracking (UOT). A single-object tracker is trained on open-air datasets, which results in a serious sample imbalance between underwater objects and open-air objects when it is applied to UOT. Moreover, underwater targets such as fish and dolphins usually have a similar appearance, and it is challenging for models to discriminate weak discriminative features. Existing detection-based post-processing approaches struggle to distinguish a tracked target from similar objects. In this study, the UOSTrack is proposed, which involves the use of underwater images and open-air sequence hybrid training (UOHT), and motion-based post-processing (MBPP). The UOHT training paradigm is designed to train the sample-imbalanced underwater tracker. In particular, underwater object detection (UOD) images are converted into image pairs through customised data augmentation, such that the tracker is exposed to more underwater domain training samples and learns the feature expressions of underwater objects. The MBPP paradigm is proposed to exclude similar objects near the target. In particular, it employs the estimation box predicted using a Kalman filter and the candidate boxes in each frame to reconfirm the tracked target that is hidden in the candidate area when it has been lost. UOSTrack provides an average performance improvement of 3.5 % compared to OSTrack on similar object challenge attribute in UOT100 and UTB180. The average performance improvements provided by UOSTrack are 1 % and 3 %, respectively. The results from two UOT benchmarks demonstrate that UOSTrack sets a new state-of-the-art benchmark, and the effectiveness of UOHT and MBPP, and the generalisation and applicability of the MBPP for use in UOT.
PDF

点此查看论文截图

Detecting Everything in the Open World: Towards Universal Object Detection

Authors:Zhenyu Wang, Yali Li, Xi Chen, Ser-Nam Lim, Antonio Torralba, Hengshuang Zhao, Shengjin Wang

In this paper, we formally address universal object detection, which aims to detect every scene and predict every category. The dependence on human annotations, the limited visual information, and the novel categories in the open world severely restrict the universality of traditional detectors. We propose UniDetector, a universal object detector that has the ability to recognize enormous categories in the open world. The critical points for the universality of UniDetector are: 1) it leverages images of multiple sources and heterogeneous label spaces for training through the alignment of image and text spaces, which guarantees sufficient information for universal representations. 2) it generalizes to the open world easily while keeping the balance between seen and unseen classes, thanks to abundant information from both vision and language modalities. 3) it further promotes the generalization ability to novel categories through our proposed decoupling training manner and probability calibration. These contributions allow UniDetector to detect over 7k categories, the largest measurable category size so far, with only about 500 classes participating in training. Our UniDetector behaves the strong zero-shot generalization ability on large-vocabulary datasets like LVIS, ImageNetBoxes, and VisualGenome - it surpasses the traditional supervised baselines by more than 4\% on average without seeing any corresponding images. On 13 public detection datasets with various scenes, UniDetector also achieves state-of-the-art performance with only a 3\% amount of training data.
PDF Accepted by CVPR2023

点此查看论文截图

Adaptive Base-class Suppression and Prior Guidance Network for One-Shot Object Detection

Authors:Wenwen Zhang, Xinyu Xiao, Hangguan Shan, Eryun Liu

One-shot object detection (OSOD) aims to detect all object instances towards the given category specified by a query image. Most existing studies in OSOD endeavor to explore effective cross-image correlation and alleviate the semantic feature misalignment, however, ignoring the phenomenon of the model bias towards the base classes and the generalization degradation on the novel classes. Observing this, we propose a novel framework, namely Base-class Suppression and Prior Guidance (BSPG) network to overcome the problem. Specifically, the objects of base categories can be explicitly detected by a base-class predictor and adaptively eliminated by our base-class suppression module. Moreover, a prior guidance module is designed to calculate the correlation of high-level features in a non-parametric manner, producing a class-agnostic prior map to provide the target features with rich semantic cues and guide the subsequent detection process. Equipped with the proposed two modules, we endow the model with a strong discriminative ability to distinguish the target objects from distractors belonging to the base classes. Extensive experiments show that our method outperforms the previous techniques by a large margin and achieves new state-of-the-art performance under various evaluation settings.
PDF

点此查看论文截图

Learned Two-Plane Perspective Prior based Image Resampling for Efficient Object Detection

Authors:Anurag Ghosh, N. Dinesh Reddy, Christoph Mertz, Srinivasa G. Narasimhan

Real-time efficient perception is critical for autonomous navigation and city scale sensing. Orthogonal to architectural improvements, streaming perception approaches have exploited adaptive sampling improving real-time detection performance. In this work, we propose a learnable geometry-guided prior that incorporates rough geometry of the 3D scene (a ground plane and a plane above) to resample images for efficient object detection. This significantly improves small and far-away object detection performance while also being more efficient both in terms of latency and memory. For autonomous navigation, using the same detector and scale, our approach improves detection rate by +4.1 $AP{S}$ or +39% and in real-time performance by +5.3 $sAP{S}$ or +63% for small objects over state-of-the-art (SOTA). For fixed traffic cameras, our approach detects small objects at image scales other methods cannot. At the same scale, our approach improves detection of small objects by 195% (+12.5 $AP{S}$) over naive-downsampling and 63% (+4.2 $AP{S}$) over SOTA.
PDF CVPR 2023 Accepted Paper, 21 pages, 16 Figures

点此查看论文截图

Both Style and Distortion Matter: Dual-Path Unsupervised Domain Adaptation for Panoramic Semantic Segmentation

Authors:Xu Zheng, Jinjing Zhu, Yexin Liu, Zidong Cao, Chong Fu, Lin Wang

The ability of scene understanding has sparked active research for panoramic image semantic segmentation. However, the performance is hampered by distortion of the equirectangular projection (ERP) and a lack of pixel-wise annotations. For this reason, some works treat the ERP and pinhole images equally and transfer knowledge from the pinhole to ERP images via unsupervised domain adaptation (UDA). However, they fail to handle the domain gaps caused by: 1) the inherent differences between camera sensors and captured scenes; 2) the distinct image formats (e.g., ERP and pinhole images). In this paper, we propose a novel yet flexible dual-path UDA framework, DPPASS, taking ERP and tangent projection (TP) images as inputs. To reduce the domain gaps, we propose cross-projection and intra-projection training. The cross-projection training includes tangent-wise feature contrastive training and prediction consistency training. That is, the former formulates the features with the same projection locations as positive examples and vice versa, for the models’ awareness of distortion, while the latter ensures the consistency of cross-model predictions between the ERP and TP. Moreover, adversarial intra-projection training is proposed to reduce the inherent gap, between the features of the pinhole images and those of the ERP and TP images, respectively. Importantly, the TP path can be freely removed after training, leading to no additional inference cost. Extensive experiments on two benchmarks show that our DPPASS achieves +1.06$\%$ mIoU increment than the state-of-the-art approaches.
PDF Accepted by CVPR 2023

点此查看论文截图

Spatio-Temporal Pixel-Level Contrastive Learning-based Source-Free Domain Adaptation for Video Semantic Segmentation

Authors:Shao-Yuan Lo, Poojan Oza, Sumanth Chennupati, Alejandro Galindo, Vishal M. Patel

Unsupervised Domain Adaptation (UDA) of semantic segmentation transfers labeled source knowledge to an unlabeled target domain by relying on accessing both the source and target data. However, the access to source data is often restricted or infeasible in real-world scenarios. Under the source data restrictive circumstances, UDA is less practical. To address this, recent works have explored solutions under the Source-Free Domain Adaptation (SFDA) setup, which aims to adapt a source-trained model to the target domain without accessing source data. Still, existing SFDA approaches use only image-level information for adaptation, making them sub-optimal in video applications. This paper studies SFDA for Video Semantic Segmentation (VSS), where temporal information is leveraged to address video adaptation. Specifically, we propose Spatio-Temporal Pixel-Level (STPL) contrastive learning, a novel method that takes full advantage of spatio-temporal information to tackle the absence of source data better. STPL explicitly learns semantic correlations among pixels in the spatio-temporal space, providing strong self-supervision for adaptation to the unlabeled target domain. Extensive experiments show that STPL achieves state-of-the-art performance on VSS benchmarks compared to current UDA and SFDA approaches. Code is available at: https://github.com/shaoyuanlo/STPL
PDF Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023

点此查看论文截图

DoNet: Deep De-overlapping Network for Cytology Instance Segmentation

Authors:Hao Jiang, Rushan Zhang, Yanning Zhou, Yumeng Wang, Hao Chen

Cell instance segmentation in cytology images has significant importance for biology analysis and cancer screening, while remains challenging due to 1) the extensive overlapping translucent cell clusters that cause the ambiguous boundaries, and 2) the confusion of mimics and debris as nuclei. In this work, we proposed a De-overlapping Network (DoNet) in a decompose-and-recombined strategy. A Dual-path Region Segmentation Module (DRM) explicitly decomposes the cell clusters into intersection and complement regions, followed by a Semantic Consistency-guided Recombination Module (CRM) for integration. To further introduce the containment relationship of the nucleus in the cytoplasm, we design a Mask-guided Region Proposal Strategy (MRP) that integrates the cell attention maps for inner-cell instance prediction. We validate the proposed approach on ISBI2014 and CPS datasets. Experiments show that our proposed DoNet significantly outperforms other state-of-the-art (SOTA) cell instance segmentation methods. The code is available at https://github.com/DeepDoNet/DoNet.
PDF Accepted by CVPR2023

点此查看论文截图

IFSeg: Image-free Semantic Segmentation via Vision-Language Model

Authors:Sukmin Yun, Seong Hyeon Park, Paul Hongsuck Seo, Jinwoo Shin

Vision-language (VL) pre-training has recently gained much attention for its transferability and flexibility in novel concepts (e.g., cross-modality transfer) across various visual tasks. However, VL-driven segmentation has been under-explored, and the existing approaches still have the burden of acquiring additional training images or even segmentation annotations to adapt a VL model to downstream segmentation tasks. In this paper, we introduce a novel image-free segmentation task where the goal is to perform semantic segmentation given only a set of the target semantic categories, but without any task-specific images and annotations. To tackle this challenging task, our proposed method, coined IFSeg, generates VL-driven artificial image-segmentation pairs and updates a pre-trained VL model to a segmentation task. We construct this artificial training data by creating a 2D map of random semantic categories and another map of their corresponding word tokens. Given that a pre-trained VL model projects visual and text tokens into a common space where tokens that share the semantics are located closely, this artificially generated word map can replace the real image inputs for such a VL model. Through an extensive set of experiments, our model not only establishes an effective baseline for this novel task but also demonstrates strong performances compared to existing methods that rely on stronger supervision, such as task-specific images and segmentation masks. Code is available at https://github.com/alinlab/ifseg.
PDF Accepted to CVPR 2023

点此查看论文截图

Adaptive Sparse Convolutional Networks with Global Context Enhancement for Faster Object Detection on Drone Images

Authors:Bowei Du, Yecheng Huang, Jiaxin Chen, Di Huang

Object detection on drone images with low-latency is an important but challenging task on the resource-constrained unmanned aerial vehicle (UAV) platform. This paper investigates optimizing the detection head based on the sparse convolution, which proves effective in balancing the accuracy and efficiency. Nevertheless, it suffers from inadequate integration of contextual information of tiny objects as well as clumsy control of the mask ratio in the presence of foreground with varying scales. To address the issues above, we propose a novel global context-enhanced adaptive sparse convolutional network (CEASC). It first develops a context-enhanced group normalization (CE-GN) layer, by replacing the statistics based on sparsely sampled features with the global contextual ones, and then designs an adaptive multi-layer masking strategy to generate optimal mask ratios at distinct scales for compact foreground coverage, promoting both the accuracy and efficiency. Extensive experimental results on two major benchmarks, i.e. VisDrone and UAVDT, demonstrate that CEASC remarkably reduces the GFLOPs and accelerates the inference procedure when plugging into the typical state-of-the-art detection frameworks (e.g. RetinaNet and GFL V1) with competitive performance. Code is available at https://github.com/Cuogeihong/CEASC.
PDF Accepted by CVPR 2023

点此查看论文截图

Viewpoint Equivariance for Multi-View 3D Object Detection

Authors:Dian Chen, Jie Li, Vitor Guizilini, Rares Ambrus, Adrien Gaidon

3D object detection from visual sensors is a cornerstone capability of robotic systems. State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input. In this work we gain intuition from the integral role of multi-view consistency in 3D scene understanding and geometric learning. To this end, we introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry to improve localization through viewpoint awareness and equivariance. VEDet leverages a query-based transformer architecture and encodes the 3D scene by augmenting image features with positional encodings from their 3D perspective geometry. We design view-conditioned queries at the output level, which enables the generation of multiple virtual frames during training to learn viewpoint equivariance by enforcing multi-view consistency. The multi-view geometry injected at the input level as positional encodings and regularized at the loss level provides rich geometric cues for 3D object detection, leading to state-of-the-art performance on the nuScenes benchmark. The code and model are made available at https://github.com/TRI-ML/VEDet.
PDF 11 pages, 4 figures; accepted to CVPR 2023

点此查看论文截图

BoxVIS: Video Instance Segmentation with Box Annotations

Authors:Minghan Li, Lei Zhang

It is expensive and labour-extensive to label the pixel-wise object masks in a video. As a results, the amount of pixel-wise annotations in existing video instance segmentation (VIS) datasets is small, limiting the generalization capability of trained VIS models. An alternative but much cheaper solution is to use bounding boxes to label instances in videos. Inspired by the recent success of box-supervised image instance segmentation, we first adapt the state-of-the-art pixel-supervised VIS models to a box-supervised VIS (BoxVIS) baseline, and observe only slight performance degradation. We consequently propose to improve BoxVIS performance from two aspects. First, we propose a box-center guided spatial-temporal pairwise affinity (STPA) loss to predict instance masks for better spatial and temporal consistency. Second, we collect a larger scale box-annotated VIS dataset (BVISD) by consolidating the videos from current VIS benchmarks and converting images from the COCO dataset to short pseudo video clips. With the proposed BVISD and the STPA loss, our trained BoxVIS model demonstrates promising instance mask prediction performance. Specifically, it achieves 43.2\% and 29.0\% mask AP on the YouTube-VIS 2021 and OVIS valid sets, respectively, exhibiting comparable or even better generalization performance than state-of-the-art pixel-supervised VIS models by using only 16\% annotation time and cost. Codes and data of BoxVIS can be found at \url{https://github.com/MinghanLi/BoxVIS}.
PDF

点此查看论文截图

Transformer-based Multi-Instance Learning for Weakly Supervised Object Detection

Authors:Zhaofei Wang, Weijia Zhang, Min-Ling Zhang

Weakly Supervised Object Detection (WSOD) enables the training of object detection models using only image-level annotations. State-of-the-art WSOD detectors commonly rely on multi-instance learning (MIL) as the backbone of their detectors and assume that the bounding box proposals of an image are independent of each other. However, since such approaches only utilize the highest score proposal and discard the potentially useful information from other proposals, their independent MIL backbone often limits models to salient parts of an object or causes them to detect only one object per class. To solve the above problems, we propose a novel backbone for WSOD based on our tailored Vision Transformer named Weakly Supervised Transformer Detection Network (WSTDN). Our algorithm is not only the first to demonstrate that self-attention modules that consider inter-instance relationships are effective backbones for WSOD, but also we introduce a novel bounding box mining method (BBM) integrated with a memory transfer refinement (MTR) procedure to utilize the instance dependencies for facilitating instance refinements. Experimental results on PASCAL VOC2007 and VOC2012 benchmarks demonstrate the effectiveness of our proposed WSTDN and modified instance refinement modules.
PDF

点此查看论文截图

The Devil is in the Points: Weakly Semi-Supervised Instance Segmentation via Point-Guided Mask Representation

Authors:Beomyoung Kim, Joonhyun Jeong, Dongyoon Han, Sung Ju Hwang

In this paper, we introduce a novel learning scheme named weakly semi-supervised instance segmentation (WSSIS) with point labels for budget-efficient and high-performance instance segmentation. Namely, we consider a dataset setting consisting of a few fully-labeled images and a lot of point-labeled images. Motivated by the main challenge of semi-supervised approaches mainly derives from the trade-off between false-negative and false-positive instance proposals, we propose a method for WSSIS that can effectively leverage the budget-friendly point labels as a powerful weak supervision source to resolve the challenge. Furthermore, to deal with the hard case where the amount of fully-labeled data is extremely limited, we propose a MaskRefineNet that refines noise in rough masks. We conduct extensive experiments on COCO and BDD100K datasets, and the proposed method achieves promising results comparable to those of the fully-supervised model, even with 50% of the fully labeled COCO data (38.8% vs. 39.7%). Moreover, when using as little as 5% of fully labeled COCO data, our method shows significantly superior performance over the state-of-the-art semi-supervised learning method (33.7% vs. 24.9%). The code is available at https://github.com/clovaai/PointWSSIS.
PDF CVPR 2023

点此查看论文截图

What Can Human Sketches Do for Object Detection?

Authors:Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Aneeshan Sain, Subhadeep Koley, Tao Xiang, Yi-Zhe Song

Sketches are highly expressive, inherently capturing subjective and fine-grained visual cues. The exploration of such innate properties of human sketches has, however, been limited to that of image retrieval. In this paper, for the first time, we cultivate the expressiveness of sketches but for the fundamental vision task of object detection. The end result is a sketch-enabled object detection framework that detects based on what \textit{you} sketch — \textit{that} zebra'' (e.g., one that is eating the grass) in a herd of zebras (instance-aware detection), and only the \textit{part} (e.g.,head” of a ``zebra”) that you desire (part-aware detection). We further dictate that our model works without (i) knowing which category to expect at testing (zero-shot) and (ii) not requiring additional bounding boxes (as per fully supervised) and class labels (as per weakly supervised). Instead of devising a model from the ground up, we show an intuitive synergy between foundation models (e.g., CLIP) and existing sketch models build for sketch-based image retrieval (SBIR), which can already elegantly solve the task — CLIP to provide model generalisation, and SBIR to bridge the (sketch$\rightarrow$photo) gap. In particular, we first perform independent prompting on both sketch and photo branches of an SBIR model to build highly generalisable sketch and photo encoders on the back of the generalisation ability of CLIP. We then devise a training paradigm to adapt the learned encoders for object detection, such that the region embeddings of detected boxes are aligned with the sketch and photo embeddings from SBIR. Evaluating our framework on standard object detection datasets like PASCAL-VOC and MS-COCO outperforms both supervised (SOD) and weakly-supervised object detectors (WSOD) on zero-shot setups. Project Page: \url{https://pinakinathc.github.io/sketch-detect}
PDF Accepted as Top 12 Best Papers. Will be presented in special single-track plenary sessions to all attendees in Computer Vision and Pattern Recognition (CVPR), 2023. Project Page: www.pinakinathc.me/sketch-detect

点此查看论文截图

AIR-DA: Adversarial Image Reconstruction for Unsupervised Domain Adaptive Object Detection

Authors:Kunyang Sun, Wei Lin, Haoqin Shi, Zhengming Zhang, Yongming Huang, Horst Bischof

Unsupervised domain adaptive object detection is a challenging vision task where object detectors are adapted from a label-rich source domain to an unlabeled target domain. Recent advances prove the efficacy of the adversarial based domain alignment where the adversarial training between the feature extractor and domain discriminator results in domain-invariance in the feature space. However, due to the domain shift, domain discrimination, especially on low-level features, is an easy task. This results in an imbalance of the adversarial training between the domain discriminator and the feature extractor. In this work, we achieve a better domain alignment by introducing an auxiliary regularization task to improve the training balance. Specifically, we propose Adversarial Image Reconstruction (AIR) as the regularizer to facilitate the adversarial training of the feature extractor. We further design a multi-level feature alignment module to enhance the adaptation performance. Our evaluations across several datasets of challenging domain shifts demonstrate that the proposed method outperforms all previous methods, of both one- and two-stage, in most settings.
PDF Accepted at IEEE Robotics and Automation Letters 2023

点此查看论文截图

3D Video Object Detection with Learnable Object-Centric Global Optimization

Authors:Jiawei He, Yuntao Chen, Naiyan Wang, Zhaoxiang Zhang

We explore long-term temporal visual correspondence-based optimization for 3D video object detection in this work. Visual correspondence refers to one-to-one mappings for pixels across multiple images. Correspondence-based optimization is the cornerstone for 3D scene reconstruction but is less studied in 3D video object detection, because moving objects violate multi-view geometry constraints and are treated as outliers during scene reconstruction. We address this issue by treating objects as first-class citizens during correspondence-based optimization. In this work, we propose BA-Det, an end-to-end optimizable object detector with object-centric temporal correspondence learning and featuremetric object bundle adjustment. Empirically, we verify the effectiveness and efficiency of BA-Det for multiple baseline 3D detectors under various setups. Our BA-Det achieves SOTA performance on the large-scale Waymo Open Dataset (WOD) with only marginal computation cost. Our code is available at https://github.com/jiaweihe1996/BA-Det.
PDF CVPR2023

点此查看论文截图

AutoKary2022: A Large-Scale Densely Annotated Dateset for Chromosome Instance Segmentation

Authors:Dan You, Pengcheng Xia, Qiuzhu Chen, Minghui Wu, Suncheng Xiang, Jun Wang

Automated chromosome instance segmentation from metaphase cell microscopic images is critical for the diagnosis of chromosomal disorders (i.e., karyotype analysis). However, it is still a challenging task due to lacking of densely annotated datasets and the complicated morphologies of chromosomes, e.g., dense distribution, arbitrary orientations, and wide range of lengths. To facilitate the development of this area, we take a big step forward and manually construct a large-scale densely annotated dataset named AutoKary2022, which contains over 27,000 chromosome instances in 612 microscopic images from 50 patients. Specifically, each instance is annotated with a polygonal mask and a class label to assist in precise chromosome detection and segmentation. On top of it, we systematically investigate representative methods on this dataset and obtain a number of interesting findings, which helps us have a deeper understanding of the fundamental problems in chromosome instance segmentation. We hope this dataset could advance research towards medical understanding. The dataset can be available at: https://github.com/wangjuncongyu/chromosome-instance-segmentation-dataset.
PDF Accepted by ICME 2023

点此查看论文截图

文章作者: 木子已
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 木子已 !
  目录