检测/分割/跟踪


2022-03-10 更新

A high-precision underwater object detection based on joint self-supervised deblurring and improved spatial transformer network

Authors:Xiuyuan Li, Fengchao Li, Jiangang Yu, Guowen An

Deep learning-based underwater object detection (UOD) remains a major challenge due to the degraded visibility and difficulty to obtain sufficient underwater object images captured from various perspectives for training. To address these issues, this paper presents a high-precision UOD based on joint self-supervised deblurring and improved spatial transformer network. A self-supervised deblurring subnetwork is introduced into the designed multi-task learning aided object detection architecture to force the shared feature extraction module to output clean features for detection subnetwork. Aiming at alleviating the limitation of insufficient photos from different perspectives, an improved spatial transformer network is designed based on perspective transformation, adaptively enriching image features within the network. The experimental results show that the proposed UOD approach achieved 47.9 mAP in URPC2017 and 70.3 mAP in URPC2018, outperforming many state-of-the-art UOD methods and indicating the designed method is more suitable for UOD.
PDF

论文截图

Evaluation of YOLO Models with Sliced Inference for Small Object Detection

Authors:Muhammed Can Keles, Batuhan Salmanoglu, Mehmet Serdar Guzel, Baran Gursoy, Gazi Erkan Bostanci

Small object detection has major applications in the fields of UAVs, surveillance, farming and many others. In this work we investigate the performance of state of the art Yolo based object detection models for the task of small object detection as they are one of the most popular and easy to use object detection models. We evaluated YOLOv5 and YOLOX models in this study. We also investigate the effects of slicing aided inference and fine-tuning the model for slicing aided inference. We used the VisDrone2019Det dataset for training and evaluating our models. This dataset is challenging in the sense that most objects are relatively small compared to the image sizes. This work aims to benchmark the YOLOv5 and YOLOX models for small object detection. We have seen that sliced inference increases the AP50 score in all experiments, this effect was greater for the YOLOv5 models compared to the YOLOX models. The effects of sliced fine-tuning and sliced inference combined produced substantial improvement for all models. The highest AP50 score was achieved by the YOLOv5- Large model on the VisDrone2019Det test-dev subset with the score being 48.8.
PDF 6 pages, 5 figures

论文截图

U$^2$-Net: Going Deeper with Nested U-Structure for Salient Object Detection

Authors:Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood Dehghan, Osmar R. Zaiane, Martin Jagersand

In this paper, we design a simple yet powerful deep network architecture, U$^2$-Net, for salient object detection (SOD). The architecture of our U$^2$-Net is a two-level nested U-structure. The design has the following advantages: (1) it is able to capture more contextual information from different scales thanks to the mixture of receptive fields of different sizes in our proposed ReSidual U-blocks (RSU), (2) it increases the depth of the whole architecture without significantly increasing the computational cost because of the pooling operations used in these RSU blocks. This architecture enables us to train a deep network from scratch without using backbones from image classification tasks. We instantiate two models of the proposed architecture, U$^2$-Net (176.3 MB, 30 FPS on GTX 1080Ti GPU) and U$^2$-Net$^{\dagger}$ (4.7 MB, 40 FPS), to facilitate the usage in different environments. Both models achieve competitive performance on six SOD datasets. The code is available: https://github.com/NathanUA/U-2-Net.
PDF Accepted in Pattern Recognition 2020

论文截图

A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection

Authors:Yukun Su, Jingliang Deng, Ruizhou Sun, Guosheng Lin, Qingyao Wu

Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world. In the computer vision area, many researches focus on co-segmentation (CoS), co-saliency detection (CoSD) and video salient object detection (VSOD) to discover the co-occurrent objects. However, previous approaches design different networks on these similar tasks separately, and they are difficult to apply to each other, which lowers the upper bound of the transferability of deep learning frameworks. Besides, they fail to take full advantage of the cues among inter- and intra-feature within a group of images. In this paper, we introduce a unified framework to tackle these issues, term as UFO (Unified Framework for Co-Object Segmentation). Specifically, we first introduce a transformer block, which views the image feature as a patch token and then captures their long-range dependencies through the self-attention mechanism. This can help the network to excavate the patch structured similarities among the relevant objects. Furthermore, we propose an intra-MLP learning module to produce self-mask to enhance the network to avoid partial activation. Extensive experiments on four CoS benchmarks (PASCAL, iCoseg, Internet and MSRC), three CoSD benchmarks (Cosal2015, CoSOD3k, and CocA) and four VSOD benchmarks (DAVIS16, FBMS, ViSal and SegV2) show that our method outperforms other state-of-the-arts on three different tasks in both accuracy and speed by using the same network architecture , which can reach 140 FPS in real-time.
PDF Code: https://github.com/suyukun666/UFO

论文截图

CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers

Authors:Huayao Liu, Jiaming Zhang, Kailun Yang, Xinxin Hu, Rainer Stiefelhagen

The performance of semantic segmentation of RGB images can be advanced by exploiting informative features from supplementary modalities. In this work, we propose CMX, a vision-transformer-based cross-modal fusion framework for RGB-X semantic segmentation. To generalize to different sensing modalities encompassing various uncertainties, we consider that comprehensive cross-modal interactions should be provided. CMX is built with two streams to extract features from RGB images and the complementary modality (X-modality). In each feature extraction stage, we design a Cross-Modal Feature Rectification Module (CM-FRM) to calibrate the feature of the current modality by combining the feature from the other modality, in spatial- and channel-wise dimensions. With rectified feature pairs, we deploy a Feature Fusion Module (FFM) to mix them for the final semantic prediction. FFM is constructed with a cross-attention mechanism, which enables exchange of long-range contexts, enhancing both modalities’ features at a global level. Extensive experiments show that CMX generalizes to diverse multi-modal combinations, achieving state-of-the-art performances on four RGB-Depth benchmarks, as well as RGB-Thermal and RGB-Polarization datasets. Besides, to investigate the generalizability to dense-sparse data fusion, we establish a RGB-Event semantic segmentation benchmark based on the EventScape dataset, on which CMX sets the new state-of-the-art. Code is available at https://github.com/huaaaliu/RGBX_Semantic_Segmentation
PDF Code is available at https://github.com/huaaaliu/RGBX_Semantic_Segmentation

论文截图

CEU-Net: Ensemble Semantic Segmentation of Hyperspectral Images Using Clustering

Authors:Nicholas Soucy, Salimeh Yasaei Sekeh

Most semantic segmentation approaches of Hyperspectral images (HSIs) use and require preprocessing steps in the form of patching to accurately classify diversified land cover in remotely sensed images. These approaches use patching to incorporate the rich neighborhood information in images and exploit the simplicity and segmentability of the most common HSI datasets. In contrast, most landmasses in the world consist of overlapping and diffused classes, making neighborhood information weaker than what is seen in common HSI datasets. To combat this issue and generalize the segmentation models to more complex and diverse HSI datasets, in this work, we propose our novel flagship model: Clustering Ensemble U-Net (CEU-Net). CEU-Net uses the ensemble method to combine spectral information extracted from convolutional neural network (CNN) training on a cluster of landscape pixels. Our CEU-Net model outperforms existing state-of-the-art HSI semantic segmentation methods and gets competitive performance with and without patching when compared to baseline models. We highlight CEU-Net’s high performance across Botswana, KSC, and Salinas datasets compared to HybridSN and AeroRIT methods.
PDF 11 Pages, 5 Tables, 1 Algorithm, 5 Figures

论文截图

文章作者: Harvey
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Harvey !
  目录