# 2022-09-24 更新

### Review On Deep Learning Technique For Underwater Object Detection

Repair and maintenance of underwater structures as well as marine science rely heavily on the results of underwater object detection, which is a crucial part of the image processing workflow. Although many computer vision-based approaches have been presented, no one has yet developed a system that reliably and accurately detects and categorizes objects and animals found in the deep sea. This is largely due to obstacles that scatter and absorb light in an underwater setting. With the introduction of deep learning, scientists have been able to address a wide range of issues, including safeguarding the marine ecosystem, saving lives in an emergency, preventing underwater disasters, and detecting, spooring, and identifying underwater targets. However, the benefits and drawbacks of these deep learning systems remain unknown. Therefore, the purpose of this article is to provide an overview of the dataset that has been utilized in underwater object detection and to present a discussion of the advantages and disadvantages of the algorithms employed for this purpose.
PDF 15 pages, with 9 figures,3rd International Conference on Data Science and Machine Learning (DSML 2022)

### Domain Adaptive Semantic Segmentation via Regional Contrastive Consistency Regularization

Authors:Qianyu Zhou, Chuyun Zhuang, Ran Yi, Xuequan Lu, Lizhuang Ma

Unsupervised domain adaptation (UDA) for semantic segmentation has been well-studied in recent years. However, most existing works largely neglect the local regional consistency across different domains and are less robust to changes in outdoor environments. In this paper, we propose a novel and fully end-to-end trainable approach, called regional contrastive consistency regularization (RCCR) for domain adaptive semantic segmentation. Our core idea is to pull the similar regional features extracted from the same location of different images, i.e., the original image and augmented image, to be closer, and meanwhile push the features from the different locations of the two images to be separated. We innovatively propose a region-wise contrastive loss with two sampling strategies to realize effective regional consistency. Besides, we present momentum projection heads, where the teacher projection head is the exponential moving average of the student. Finally, a memory bank mechanism is designed to learn more robust and stable region-wise features under varying environments. Extensive experiments on two common UDA benchmarks, i.e., GTAV to Cityscapes and SYNTHIA to Cityscapes, demonstrate that our approach outperforms the state-of-the-art methods.
PDF Accepted to IEEE International Conference on Multimedia and Expo (ICME), 2022

### Self-adversarial Multi-scale Contrastive Learning for Semantic Segmentation of Thermal Facial Images

Authors:Jitesh Joshi, Nadia Bianchi-Berthouze, Youngjun Cho

Reliable segmentation of thermal facial images in unconstrained settings such as thermal ambience and occlusions is challenging as facial features lack salience. Limited availability of datasets from such settings further makes it difficult to train segmentation networks. To address the challenge, we propose Self-Adversarial Multi-scale Contrastive Learning (SAM-CL) as a generic learning framework to train segmentation networks. SAM-CL framework constitutes SAM-CL loss function and a thermal image augmentation (TiAug) as a domain-specific augmentation technique to simulate unconstrained settings based upon existing datasets collected from controlled settings. We use the Thermal-Face-Database to demonstrate effectiveness of our approach. Experiments conducted on the existing segmentation networks- UNET, Attention-UNET, DeepLabV3 and HRNetv2 evidence the consistent performance gain from the SAM-CL framework. Further, we present a qualitative analysis with UBComfort and DeepBreath datasets to discuss how our proposed methods perform in handling unconstrained situations.
PDF Submitted to the British Machine Vision Conference (BMVC), 2022

### IoU-Enhanced Attention for End-to-End Task Specific Object Detection

Authors:Jing Zhao, Shengjian Wu, Li Sun, Qingli Li

Without densely tiled anchor boxes or grid points in the image, sparse R-CNN achieves promising results through a set of object queries and proposal boxes updated in the cascaded training manner. However, due to the sparse nature and the one-to-one relation between the query and its attending region, it heavily depends on the self attention, which is usually inaccurate in the early training stage. Moreover, in a scene of dense objects, the object query interacts with many irrelevant ones, reducing its uniqueness and harming the performance. This paper proposes to use IoU between different boxes as a prior for the value routing in self attention. The original attention matrix multiplies the same size matrix computed from the IoU of proposal boxes, and they determine the routing scheme so that the irrelevant features can be suppressed. Furthermore, to accurately extract features for both classification and regression, we add two lightweight projection heads to provide the dynamic channel masks based on object query, and they multiply with the output from dynamic convs, making the results suitable for the two different tasks. We validate the proposed scheme on different datasets, including MS-COCO and CrowdHuman, showing that it significantly improves the performance and increases the model convergence speed.
PDF ACCV2022

### Rethinking Unsupervised Domain Adaptation for Semantic Segmentation

Authors:Zhijie Wang, Masanori Suganuma, Takayuki Okatani

Unsupervised domain adaptation (UDA) adapts a model trained on one domain (called source) to a novel domain (called target) using only unlabeled data. Due to its high annotation cost, researchers have developed many UDA methods for semantic segmentation, which assume no labeled sample is available in the target domain. We question the practicality of this assumption for two reasons. First, after training a model with a UDA method, we must somehow verify the model before deployment. Second, UDA methods have at least a few hyper-parameters that need to be determined. The surest solution to these is to evaluate the model using validation data, i.e., a certain amount of labeled target-domain samples. This question about the basic assumption of UDA leads us to rethink UDA from a data-centric point of view. Specifically, we assume we have access to a minimum level of labeled data. Then, we ask how much is necessary to find good hyper-parameters of existing UDA methods. We then consider what if we use the same data for supervised training of the same model, e.g., finetuning. We conducted experiments to answer these questions with popular scenarios, {GTA5, SYNTHIA}$\rightarrow$Cityscapes. We found that i) choosing good hyper-parameters needs only a few labeled images for some UDA methods whereas a lot more for others; and ii) simple finetuning works surprisingly well; it outperforms many UDA methods if only several dozens of labeled images are available.
PDF

### AcroFOD: An Adaptive Method for Cross-domain Few-shot Object Detection

Authors:Yipeng Gao, Lingxiao Yang, Yunmu Huang, Song Xie, Shiyong Li, Wei-shi Zheng

Under the domain shift, cross-domain few-shot object detection aims to adapt object detectors in the target domain with a few annotated target data. There exists two significant challenges: (1) Highly insufficient target domain data; (2) Potential over-adaptation and misleading caused by inappropriately amplified target samples without any restriction. To address these challenges, we propose an adaptive method consisting of two parts. First, we propose an adaptive optimization strategy to select augmented data similar to target samples rather than blindly increasing the amount. Specifically, we filter the augmented candidates which significantly deviate from the target feature distribution in the very beginning. Second, to further relieve the data limitation, we propose the multi-level domain-aware data augmentation to increase the diversity and rationality of augmented data, which exploits the cross-image foreground-background mixture. Experiments show that the proposed method achieves state-of-the-art performance on multiple benchmarks.
PDF Accepted in ECCV 2022

### Few-Shot Object Detection in Unseen Domains

Authors:Karim Guirguis, George Eskandar, Matthias Kayser, Bin Yang, Juergen Beyerer

Few-shot object detection (FSOD) has thrived in recent years to learn novel object classes with limited data by transferring knowledge gained on abundant base classes. FSOD approaches commonly assume that both the scarcely provided examples of novel classes and test-time data belong to the same domain. However, this assumption does not hold in various industrial and robotics applications, where a model can learn novel classes from a source domain while inferring on classes from a target domain. In this work, we address the task of zero-shot domain adaptation, also known as domain generalization, for FSOD. Specifically, we assume that neither images nor labels of the novel classes in the target domain are available during training. Our approach for solving the domain gap is two-fold. First, we leverage a meta-training paradigm, where we learn the domain shift on the base classes, then transfer the domain knowledge to the novel classes. Second, we propose various data augmentations techniques on the few shots of novel classes to account for all possible domain-specific information. To constraint the network into encoding domain-agnostic class-specific representations only, a contrastive loss is proposed to maximize the mutual information between foreground proposals and class embeddings and reduce the network’s bias to the background information from target domain. Our experiments on the T-LESS, PASCAL-VOC, and ExDark datasets show that the proposed approach succeeds in alleviating the domain gap considerably without utilizing labels or images of novel categories from the target domain.
PDF

### Continual Learning for Class- and Domain-Incremental Semantic Segmentation

Authors:Tobias Kalb, Masoud Roschani, Miriam Ruf, Jürgen Beyerer

The field of continual deep learning is an emerging field and a lot of progress has been made. However, concurrently most of the approaches are only tested on the task of image classification, which is not relevant in the field of intelligent vehicles. Only recently approaches for class-incremental semantic segmentation were proposed. However, all of those approaches are based on some form of knowledge distillation. At the moment there are no investigations on replay-based approaches that are commonly used for object recognition in a continual setting. At the same time while unsupervised domain adaption for semantic segmentation gained a lot of traction, investigations regarding domain-incremental learning in an continual setting is not well-studied. Therefore, the goal of our work is to evaluate and adapt established solutions for continual object recognition to the task of semantic segmentation and to provide baseline methods and evaluation protocols for the task of continual semantic segmentation. We firstly introduce evaluation protocols for the class- and domain-incremental segmentation and analyze selected approaches. We show that the nature of the task of semantic segmentation changes which methods are most effective in mitigating forgetting compared to image classification. Especially, in class-incremental learning knowledge distillation proves to be a vital tool, whereas in domain-incremental learning replay methods are the most effective method.
PDF

### FusionRCNN: LiDAR-Camera Fusion for Two-stage 3D Object Detection

Authors:Xinli Xu, Shaocong Dong, Lihe Ding, Jie Wang, Tingfa Xu, Jianan Li

3D object detection with multi-sensors is essential for an accurate and reliable perception system of autonomous driving and robotics. Existing 3D detectors significantly improve the accuracy by adopting a two-stage paradigm which merely relies on LiDAR point clouds for 3D proposal refinement. Though impressive, the sparsity of point clouds, especially for the points far away, making it difficult for the LiDAR-only refinement module to accurately recognize and locate objects.To address this problem, we propose a novel multi-modality two-stage approach named FusionRCNN, which effectively and efficiently fuses point clouds and camera images in the Regions of Interest(RoI). FusionRCNN adaptively integrates both sparse geometry information from LiDAR and dense texture information from camera in a unified attention mechanism. Specifically, it first utilizes RoIPooling to obtain an image set with a unified size and gets the point set by sampling raw points within proposals in the RoI extraction step; then leverages an intra-modality self-attention to enhance the domain-specific features, following by a well-designed cross-attention to fuse the information from two modalities.FusionRCNN is fundamentally plug-and-play and supports different one-stage methods with almost no architectural changes. Extensive experiments on KITTI and Waymo benchmarks demonstrate that our method significantly boosts the performances of popular detectors.Remarkably, FusionRCNN significantly improves the strong SECOND baseline by 6.14% mAP on Waymo, and outperforms competing two-stage approaches. Code will be released soon at https://github.com/xxlbigbrother/Fusion-RCNN.
PDF 7 pages, 3 figures

### A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation

Authors:Wuyang Chen, Xianzhi Du, Fan Yang, Lucas Beyer, Xiaohua Zhai, Tsung-Yi Lin, Huizhong Chen, Jing Li, Xiaodan Song, Zhangyang Wang, Denny Zhou

This work presents a simple vision transformer design as a strong baseline for object localization and instance segmentation tasks. Transformers recently demonstrate competitive performance in image classification tasks. To adopt ViT to object detection and dense prediction tasks, many works inherit the multistage design from convolutional networks and highly customized ViT architectures. Behind this design, the goal is to pursue a better trade-off between computational cost and effective aggregation of multiscale global contexts. However, existing works adopt the multistage architectural design as a black-box solution without a clear understanding of its true benefits. In this paper, we comprehensively study three architecture design choices on ViT — spatial reduction, doubled channels, and multiscale features — and demonstrate that a vanilla ViT architecture can fulfill this goal without handcrafting multiscale features, maintaining the original ViT design philosophy. We further complete a scaling rule to optimize our model’s trade-off on accuracy and computation cost / model size. By leveraging a constant feature resolution and hidden size throughout the encoder blocks, we propose a simple and compact ViT architecture called Universal Vision Transformer (UViT) that achieves strong performance on COCO object detection and instance segmentation tasks.
PDF ECCV 2022 accepted

### Self-supervised 3D Object Detection from Monocular Pseudo-LiDAR

Authors:Curie Kim, Ue-Hwan Kim, Jong-Hwan Kim

There have been attempts to detect 3D objects by fusion of stereo camera images and LiDAR sensor data or using LiDAR for pre-training and only monocular images for testing, but there have been less attempts to use only monocular image sequences due to low accuracy. In addition, when depth prediction using only monocular images, only scale-inconsistent depth can be predicted, which is the reason why researchers are reluctant to use monocular images alone. Therefore, we propose a method for predicting absolute depth and detecting 3D objects using only monocular image sequences by enabling end-to-end learning of detection networks and depth prediction networks. As a result, the proposed method surpasses other existing methods in performance on the KITTI 3D dataset. Even when monocular image and 3D LiDAR are used together during training in an attempt to improve performance, ours exhibit is the best performance compared to other methods using the same input. In addition, end-to-end learning not only improves depth prediction performance, but also enables absolute depth prediction, because our network utilizes the fact that the size of a 3D object such as a car is determined by the approximate size.
PDF Accepted for the 2022 IEEE International Conference on Multisensor Fusion and Integration (MFI 2022)

### Energy Efficient Automatic Streetlight Controlling System using Semantic Segmentation

Authors:Md Sakib Ullah Sourav, Huidong Wang

This study aims to develop a novel streetlight management system powered by computer vision technology mounted with the close circuit television (CCTV) camera that allows the light emitting diode (LED) streetlight to automatically light up with proper brightness by recognizing the presence of pedestrians or vehicles and reversely dimming the streetlight in their absence by semantic image segmentation from video.
PDF

目录