2022-10-21 更新
Pseudo-Label Noise Suppression Techniques for Semi-Supervised Semantic Segmentation
Authors:Sebastian Scherer, Robin Schön, Rainer Lienhart
Semi-supervised learning (SSL) can reduce the need for large labelled datasets by incorporating unlabelled data into the training. This is particularly interesting for semantic segmentation, where labelling data is very costly and time-consuming. Current SSL approaches use an initially supervised trained model to generate predictions for unlabelled images, called pseudo-labels, which are subsequently used for training a new model from scratch. Since the predictions usually do not come from an error-free neural network, they are naturally full of errors. However, training with partially incorrect labels often reduce the final model performance. Thus, it is crucial to manage errors/noise of pseudo-labels wisely. In this work, we use three mechanisms to control pseudo-label noise and errors: (1) We construct a solid base framework by mixing images with cow-patterns on unlabelled images to reduce the negative impact of wrong pseudo-labels. Nevertheless, wrong pseudo-labels still have a negative impact on the performance. Therefore, (2) we propose a simple and effective loss weighting scheme for pseudo-labels defined by the feedback of the model trained on these pseudo-labels. This allows us to soft-weight the pseudo-label training examples based on their determined confidence score during training. (3) We also study the common practice to ignore pseudo-labels with low confidence and empirically analyse the influence and effect of pseudo-labels with different confidence ranges on SSL and the contribution of pseudo-label filtering to the achievable performance gains. We show that our method performs superior to state of-the-art alternatives on various datasets. Furthermore, we show that our findings also transfer to other tasks such as human pose estimation. Our code is available at https://github.com/ChristmasFan/SSL_Denoising_Segmentation.
PDF Accepted to BMVC 2022
点此查看论文截图
MFFN: Multi-view Feature Fusion Network for Camouflaged Object Detection
Authors:Dehua Zheng, Xiaochen Zheng, Laurence T. Yang, Yuan Gao, Chenlu Zhu, Yiheng Ruan
Recent research about camouflaged object detection (COD) aims to segment highly concealed objects hidden in complex surroundings. The tiny, fuzzy camouflaged objects result in visually indistinguishable properties. However, current single-view COD detectors are sensitive to background distractors. Therefore, blurred boundaries and variable shapes of the camouflaged objects are challenging to be fully captured with a single-view detector. To overcome these obstacles, we propose a behavior-inspired framework, called Multi-view Feature Fusion Network (MFFN), which mimics the human behaviors of finding indistinct objects in images, i.e., observing from multiple angles, distances, perspectives. Specifically, the key idea behind it is to generate multiple ways of observation (multi-view) by data augmentation and apply them as inputs. MFFN captures critical boundary and semantic information by comparing and fusing extracted multi-view features. In addition, our MFFN exploits the dependence and interaction between views and channels. Specifically, our methods leverage the complementary information between different views through a two-stage attention module called Co-attention of Multi-view (CAMV). And we design a local-overall module called Channel Fusion Unit (CFU) to explore the channel-wise contextual clues of diverse feature maps in an iterative manner. The experiment results show that our method performs favorably against existing state-of-the-art methods via training with the same data. The code will be available at https://github.com/dwardzheng/MFFN_COD.
PDF In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
点此查看论文截图
p$^3$VAE: a physics-integrated generative model. Application to the semantic segmentation of optical remote sensing images
Authors:Romain Thoreau, Laurent Risser, Véronique Achard, Béatrice Berthelot, Xavier Briottet
The combination of machine learning models with physical models is a recent research path to learn robust data representations. In this paper, we introduce p$^3$VAE, a generative model that integrates a perfect physical model which partially explains the true underlying factors of variation in the data. To fully leverage our hybrid design, we propose a semi-supervised optimization procedure and an inference scheme that comes along meaningful uncertainty estimates. We apply p$^3$VAE to the semantic segmentation of high-resolution hyperspectral remote sensing images. Our experiments on a simulated data set demonstrated the benefits of our hybrid model against conventional machine learning models in terms of extrapolation capabilities and interpretability. In particular, we show that p$^3$VAE naturally has high disentanglement capabilities. Our code and data have been made publicly available at https://github.com/Romain3Ch216/p3VAE.
PDF 21 pages, 11 figures, submitted to the International Journal of Computer Vision
点此查看论文截图
MGTUNet: An new UNet for colon nuclei instance segmentation and quantification
Authors:Liangrui Pan, Lian Wang, Mingting Liu, Zhujun Xu, Liwen Xu, Shaoliang Peng
Colorectal cancer (CRC) is among the top three malignant tumor types in terms of morbidity and mortality. Histopathological images are the gold standard for diagnosing colon cancer. Cellular nuclei instance segmentation and classification, and nuclear component regression tasks can aid in the analysis of the tumor microenvironment in colon tissue. Traditional methods are still unable to handle both types of tasks end-to-end at the same time, and have poor prediction accuracy and high application costs. This paper proposes a new UNet model for handling nuclei based on the UNet framework, called MGTUNet, which uses Mish, Group normalization and transposed convolution layer to improve the segmentation model, and a ranger optimizer to adjust the SmoothL1Loss values. Secondly, it uses different channels to segment and classify different types of nucleus, ultimately completing the nuclei instance segmentation and classification task, and the nuclei component regression task simultaneously. Finally, we did extensive comparison experiments using eight segmentation models. By comparing the three evaluation metrics and the parameter sizes of the models, MGTUNet obtained 0.6254 on PQ, 0.6359 on mPQ, and 0.8695 on R2. Thus, the experiments demonstrated that MGTUNet is now a state-of-the-art method for quantifying histopathological images of colon cancer.
PDF Accepted in BIBM2022(regular paper)
点此查看论文截图
MOVE: Unsupervised Movable Object Segmentation and Detection
Authors:Adam Bielski, Paolo Favaro
We introduce MOVE, a novel method to segment objects without any form of supervision. MOVE exploits the fact that foreground objects can be shifted locally relative to their initial position and result in realistic (undistorted) new images. This property allows us to train a segmentation model on a dataset of images without annotation and to achieve state of the art (SotA) performance on several evaluation datasets for unsupervised salient object detection and segmentation. In unsupervised single object discovery, MOVE gives an average CorLoc improvement of 7.2% over the SotA, and in unsupervised class-agnostic object detection it gives a relative AP improvement of 53% on average. Our approach is built on top of self-supervised features (e.g. from DINO or MAE), an inpainting network (based on the Masked AutoEncoder) and adversarial training.
PDF 36th Conference on Neural Information Processing Systems (NeurIPS 2022)
点此查看论文截图
Contrastive Proposal Extension with LSTM Network for Weakly Supervised Object Detection
Authors:Pei Lv, Suqi Hu, Tianran Hao
Weakly supervised object detection (WSOD) has attracted more and more attention since it only uses image-level labels and can save huge annotation costs. Most of the WSOD methods use Multiple Instance Learning (MIL) as their basic framework, which regard it as an instance classification problem. However, these methods based on MIL tends to converge only on the most discriminate regions of different instances, rather than their corresponding complete regions, that is, insufficient integrity. Inspired by the habit of observing things by the human, we propose a new method by comparing the initial proposals and the extension ones to optimize those initial proposals. Specifically, we propose one new strategy for WSOD by involving contrastive proposal extension (CPE), which consists of multiple directional contrastive proposal extensions (D-CPE), and each D-CPE contains encoders based on LSTM network and corresponding decoders. Firstly, the boundary of initial proposals in MIL is extended to different positions according to well-designed sequential order. Then, CPE compares the extended proposal and the initial proposal by extracting the feature semantics of them using the encoders, and calculates the integrity of the initial proposal to optimize the score of the initial proposal. These contrastive contextual semantics will guide the basic WSOD to suppress bad proposals and improve the scores of good ones. In addition, a simple two-stream network is designed as the decoder to constrain the temporal coding of LSTM and improve the performance of WSOD further. Experiments on PASCAL VOC 2007, VOC 2012 and MS-COCO datasets show that our method has achieved the state-of-the-art results.
PDF 15 pages,12 figures, accepted to IEEE Transactions on Image Processing
点此查看论文截图
VITA: Video Instance Segmentation via Object Token Association
Authors:Miran Heo, Sukjun Hwang, Seoung Wug Oh, Joon-Young Lee, Seon Joo Kim
We introduce a novel paradigm for offline Video Instance Segmentation (VIS), based on the hypothesis that explicit object-oriented information can be a strong clue for understanding the context of the entire sequence. To this end, we propose VITA, a simple structure built on top of an off-the-shelf Transformer-based image instance segmentation model. Specifically, we use an image object detector as a means of distilling object-specific contexts into object tokens. VITA accomplishes video-level understanding by associating frame-level object tokens without using spatio-temporal backbone features. By effectively building relationships between objects using the condensed information, VITA achieves the state-of-the-art on VIS benchmarks with a ResNet-50 backbone: 49.8 AP, 45.7 AP on YouTube-VIS 2019 & 2021, and 19.6 AP on OVIS. Moreover, thanks to its object token-based structure that is disjoint from the backbone features, VITA shows several practical advantages that previous offline VIS methods have not explored - handling long and high-resolution videos with a common GPU, and freezing a frame-level detector trained on image domain. Code is available at https://github.com/sukjunhwang/VITA.
PDF