2022-06-26 更新
Occlusion-Resistant Instance Segmentation of Piglets in Farrowing Pens Using Center Clustering Network
Authors:Endai Huang, Axiu Mao, Yongjian Wu, Haiming Gan, Maria Camila Ceballos, Thomas D. Parsons, Junhui Hou, Kai Liu
Computer vision enables the development of new approaches to monitor the behavior, health, and welfare of animals. Instance segmentation is a high-precision method in computer vision for detecting individual animals of interest. This method can be used for in-depth analysis of animals, such as examining their subtle interactive behaviors, from videos and images. However, existing deep-learning-based instance segmentation methods have been mostly developed based on public datasets, which largely omit heavy occlusion problems; therefore, these methods have limitations in real-world applications involving object occlusions, such as farrowing pen systems used on pig farms in which the farrowing crates often impede the sow and piglets. In this paper, we propose a novel occlusion-resistant Center Clustering Network for instance segmentation, dubbed as CClusnet-Inseg. Specifically, CClusnet-Inseg uses each pixel to predict object centers and trace these centers to form masks based on clustering results, which consists of a network for segmentation and center offset vector map, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm, Centers-to-Mask (C2M) and Remain-Centers-to-Mask (RC2M) algorithms, and a pseudo-occlusion generator (POG). In all, 4,600 images were extracted from six videos collected from six farrowing pens to train and validate our method. CClusnet-Inseg achieves a mean average precision (mAP) of 83.6; it outperformed YOLACT++ and Mask R-CNN, which had mAP values of 81.2 and 74.7, respectively. We conduct comprehensive ablation studies to demonstrate the advantages and effectiveness of core modules of our method. In addition, we apply CClusnet-Inseg to multi-object tracking for animal monitoring, and the predicted object center that is a conjunct output could serve as an occlusion-resistant representation of the location of an object.
PDF Accepted in CV4Animals Workshop of CVPR 2022 (IJCV journal track)
论文截图
Efficient textual explanations for complex road and traffic scenarios based on semantic segmentation
Authors:Yiyue Zhao, Xinyu Yun, Chen Chai, Zhiyu Liu, Wenxuan Fan, Xiao Luo
The complex driving environment brings great challenges to the visual perception of autonomous vehicles. It’s essential to extract clear and explainable information from the complex road and traffic scenarios and offer clues to decision and control. However, the previous scene explanation had been implemented as a separate model. The black box model makes it difficult to interpret the driving environment. It cannot detect comprehensive textual information and requires a high computational load and time consumption. Thus, this study proposed a comprehensive and efficient textual explanation model. From 336k video frames of the driving environment, critical images of complex road and traffic scenarios were selected into a dataset. Through transfer learning, this study established an accurate and efficient segmentation model to obtain the critical traffic elements in the environment. Based on the XGBoost algorithm, a comprehensive model was developed. The model provided textual information about states of traffic elements, the motion of conflict objects, and scenario complexity. The approach was verified on the real-world road. It improved the perception accuracy of critical traffic elements to 78.8%. The time consumption reached 13 minutes for each epoch, which was 11.5 times more efficient than the pre-trained network. The textual information analyzed from the model was also accordant with reality. The findings offer clear and explainable information about the complex driving environment, which lays a foundation for subsequent decision and control. It can improve the visual perception ability and enrich the prior knowledge and judgments of complex traffic situations.
PDF
论文截图
VPIT: Real-time Embedded Single Object 3D Tracking Using Voxel Pseudo Images
Authors:Illia Oleksiienko, Paraskevi Nousi, Nikolaos Passalis, Anastasios Tefas, Alexandros Iosifidis
In this paper, we propose a novel voxel-based 3D single object tracking (3D SOT) method called Voxel Pseudo Image Tracking (VPIT). VPIT is the first method that uses voxel pseudo images for 3D SOT. The input point cloud is structured by pillar-based voxelization, and the resulting pseudo image is used as an input to a 2D-like Siamese SOT method. The pseudo image is created in the Bird’s-eye View (BEV) coordinates, and therefore the objects in it have constant size. Thus, only the object rotation can change in the new coordinate system and not the object scale. For this reason, we replace multi-scale search with a multi-rotation search, where differently rotated search regions are compared against a single target representation to predict both position and rotation of the object. Experiments on KITTI Tracking dataset show that VPIT is the fastest 3D SOT method and maintains competitive Success and Precision values. Application of a SOT method in a real-world scenario meets with limitations such as lower computational capabilities of embedded devices and a latency-unforgiving environment, where the method is forced to skip certain data frames if the inference speed is not high enough. We implement a real-time evaluation protocol and show that other methods lose most of their performance on embedded devices, while VPIT maintains its ability to track the object.
PDF 10 pages, 5 figures, 4 tables. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
论文截图
Parallel Pre-trained Transformers (PPT) for Synthetic Data-based Instance Segmentation
Authors:Ming Li, Jie Wu, Jinhang Cai, Jie Qin, Yuxi Ren, Xuefeng Xiao, Min Zheng, Rui Wang, Xin Pan
Recently, Synthetic data-based Instance Segmentation has become an exceedingly favorable optimization paradigm since it leverages simulation rendering and physics to generate high-quality image-annotation pairs. In this paper, we propose a Parallel Pre-trained Transformers (PPT) framework to accomplish the synthetic data-based Instance Segmentation task. Specifically, we leverage the off-the-shelf pre-trained vision Transformers to alleviate the gap between natural and synthetic data, which helps to provide good generalization in the downstream synthetic data scene with few samples. Swin-B-based CBNet V2, SwinL-based CBNet V2 and Swin-L-based Uniformer are employed for parallel feature learning, and the results of these three models are fused by pixel-level Non-maximum Suppression (NMS) algorithm to obtain more robust results. The experimental results reveal that PPT ranks first in the CVPR2022 AVA Accessibility Vision and Autonomy Challenge, with a 65.155% mAP.
PDF The solution of 1st Place in AVA Accessibility Vision and Autonomy Challenge on CVPR 2022 workshop. Website: https://accessibility-cv.github.io/
论文截图
Object Occlusion of Adding New Categories in Objection Detection
Authors:Boyang Deng, Meiyan Lin, Shoulun Long
Building instance detection models that are data efficient and can handle rare object categories is an important challenge in computer vision. But data collection methods and metrics are lack of research towards real scenarios application using neural network. Here, we perform a systematic study of the Object Occlusion data collection and augmentation methods where we imitate object occlusion relationship in target scenarios. However, we find that the simple mechanism of object occlusion is good enough and can provide acceptable accuracy in real scenarios adding new category. We illustate that only adding 15 images of new category in a half million training dataset with hundreds categories, can give this new category 95% accuracy in unseen test dataset including thousands of images of this category.
PDF
论文截图
Depth-Adapted CNNs for RGB-D Semantic Segmentation
Authors:Zongwei Wu, Guillaume Allibert, Christophe Stolz, Chao Ma, Cédric Demonceaux
Recent RGB-D semantic segmentation has motivated research interest thanks to the accessibility of complementary modalities from the input side. Existing works often adopt a two-stream architecture that processes photometric and geometric information in parallel, with few methods explicitly leveraging the contribution of depth cues to adjust the sampling position on RGB images. In this paper, we propose a novel framework to incorporate the depth information in the RGB convolutional neural network (CNN), termed Z-ACN (Depth-Adapted CNN). Specifically, our Z-ACN generates a 2D depth-adapted offset which is fully constrained by low-level features to guide the feature extraction on RGB images. With the generated offset, we introduce two intuitive and effective operations to replace basic CNN operators: depth-adapted convolution and depth-adapted average pooling. Extensive experiments on both indoor and outdoor semantic segmentation tasks demonstrate the effectiveness of our approach.
PDF
论文截图
DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection
Authors:Yunhao Ge, Jiashu Xu, Brian Nlong Zhao, Laurent Itti, Vibhav Vineet
Object cut-and-paste has become a promising approach to efficiently generate large sets of labeled training data. It involves compositing foreground object masks onto background images. The background images, when congruent with the objects, provide helpful context information for training object recognition models. While the approach can easily generate large labeled data, finding congruent context images for downstream tasks has remained an elusive problem. In this work, we propose a new paradigm for automatic context image generation at scale. At the core of our approach lies utilizing an interplay between language description of context and language-driven image generation. Language description of a context is provided by applying an image captioning method on a small set of images representing the context. These language descriptions are then used to generate diverse sets of context images using the language-based DALL-E image generation framework. These are then composited with objects to provide an augmented training set for a classifier. We demonstrate the advantages of our approach over the prior context image generation approaches on four object detection datasets. Furthermore, we also highlight the compositional nature of our data generation approach on out-of-distribution and zero-shot data generation scenarios.
PDF 28 pages (including appendix), 13 figures
论文截图
Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation
Authors:Wouter Van Gansbeke, Simon Vandenhende, Luc Van Gool
The task of unsupervised semantic segmentation aims to cluster pixels into semantically meaningful groups. Specifically, pixels assigned to the same cluster should share high-level semantic properties like their object or part category. This paper presents MaskDistill: a novel framework for unsupervised semantic segmentation based on three key ideas. First, we advocate a data-driven strategy to generate object masks that serve as a pixel grouping prior for semantic segmentation. This approach omits handcrafted priors, which are often designed for specific scene compositions and limit the applicability of competing frameworks. Second, MaskDistill clusters the object masks to obtain pseudo-ground-truth for training an initial object segmentation model. Third, we leverage this model to filter out low-quality object masks. This strategy mitigates the noise in our pixel grouping prior and results in a clean collection of masks which we use to train a final segmentation model. By combining these components, we can considerably outperform previous works for unsupervised semantic segmentation on PASCAL (+11% mIoU) and COCO (+4% mask AP50). Interestingly, as opposed to existing approaches, our framework does not latch onto low-level image cues and is not limited to object-centric datasets. The code and models will be made available.
PDF Code: https://github.com/wvangansbeke/MaskDistill
论文截图
Simple and Efficient Architectures for Semantic Segmentation
Authors:Dushyant Mehta, Andrii Skliar, Haitam Ben Yahia, Shubhankar Borse, Fatih Porikli, Amirhossein Habibian, Tijmen Blankevoort
Though the state-of-the architectures for semantic segmentation, such as HRNet, demonstrate impressive accuracy, the complexity arising from their salient design choices hinders a range of model acceleration tools, and further they make use of operations that are inefficient on current hardware. This paper demonstrates that a simple encoder-decoder architecture with a ResNet-like backbone and a small multi-scale head, performs on-par or better than complex semantic segmentation architectures such as HRNet, FANet and DDRNets. Naively applying deep backbones designed for Image Classification to the task of Semantic Segmentation leads to sub-par results, owing to a much smaller effective receptive field of these backbones. Implicit among the various design choices put forth in works like HRNet, DDRNet, and FANet are networks with a large effective receptive field. It is natural to ask if a simple encoder-decoder architecture would compare favorably if comprised of backbones that have a larger effective receptive field, though without the use of inefficient operations like dilated convolutions. We show that with minor and inexpensive modifications to ResNets, enlarging the receptive field, very simple and competitive baselines can be created for Semantic Segmentation. We present a family of such simple architectures for desktop as well as mobile targets, which match or exceed the performance of complex models on the Cityscapes dataset. We hope that our work provides simple yet effective baselines for practitioners to develop efficient semantic segmentation models.
PDF To be presented at Efficient Deep Learning for Computer Vision Workshop at CVPR 2022
论文截图
Confidence Calibration for Object Detection and Segmentation
Authors:Fabian Küppers, Anselm Haselhoff, Jan Kronenberger, Jonas Schneider
Calibrated confidence estimates obtained from neural networks are crucial, particularly for safety-critical applications such as autonomous driving or medical image diagnosis. However, although the task of confidence calibration has been investigated on classification problems, thorough investigations on object detection and segmentation problems are still missing. Therefore, we focus on the investigation of confidence calibration for object detection and segmentation models in this chapter. We introduce the concept of multivariate confidence calibration that is an extension of well-known calibration methods to the task of object detection and segmentation. This allows for an extended confidence calibration that is also aware of additional features such as bounding box/pixel position, shape information, etc. Furthermore, we extend the expected calibration error (ECE) to measure miscalibration of object detection and segmentation models. We examine several network architectures on MS COCO as well as on Cityscapes and show that especially object detection as well as instance segmentation models are intrinsically miscalibrated given the introduced definition of calibration. Using our proposed calibration methods, we have been able to improve calibration so that it also has a positive impact on the quality of segmentation masks as well.
PDF Book chapter in: Tim Fingerscheidt, Hanno Gottschalk, Sebastian Houben (eds.): “Deep Neural Networks and Data for Automated Driving”, pp. 225—250, Springer Nature, Switzerland, 2022
论文截图
Access Control of Semantic Segmentation Models Using Encrypted Feature Maps
Authors:Hiroki Ito, AprilPyone MaungMaung, Sayaka Shiota, Hitoshi Kiya
In this paper, we propose an access control method with a secret key for semantic segmentation models for the first time so that unauthorized users without a secret key cannot benefit from the performance of trained models. The method enables us not only to provide a high segmentation performance to authorized users but to also degrade the performance for unauthorized users. We first point out that, for the application of semantic segmentation, conventional access control methods which use encrypted images for classification tasks are not directly applicable due to performance degradation. Accordingly, in this paper, selected feature maps are encrypted with a secret key for training and testing models, instead of input images. In an experiment, the protected models allowed authorized users to obtain almost the same performance as that of non-protected models but also with robustness against unauthorized access without a key.
PDF
论文截图
BED: A Real-Time Object Detection System for Edge Devices
Authors:Guanchu Wang, Zaid Pervaiz Bhat, Zhimeng Jiang, Yi-Wei Chen, Daochen Zha, Alfredo Costilla Reyes, Afshin Niktash, Gorkem Ulkar, Erman Okman, Xia Hu
Deploying deep neural networks~(DNNs) on edge devices provides efficient and effective solutions for the real-world tasks. Edge devices have been used for collecting a large volume of data efficiently in different domains. DNNs have been an effective tool for data processing and analysis. However, designing DNNs on edge devices is challenging due to the limited computational resources and memory. To tackle this challenge, we demonstrate Object Detection System for Edge Devices~(BED) on the MAX78000 DNN accelerator. It integrates on-device DNN inference with a camera and an LCD display for image acquisition and detection exhibition, respectively. BED is a concise, effective and detailed solution, including model training, quantization, synthesis and deployment. Experiment results indicate that BED can produce accurate detection with a 300-KB tiny DNN model, which takes only 91.9 ms of inference time and 1.845 mJ of energy.
PDF
论文截图
On Efficient Real-Time Semantic Segmentation: A Survey
Authors:Christopher J. Holder, Muhammad Shafique
Semantic segmentation is the problem of assigning a class label to every pixel in an image, and is an important component of an autonomous vehicle vision stack for facilitating scene understanding and object detection. However, many of the top performing semantic segmentation models are extremely complex and cumbersome, and as such are not suited to deployment onboard autonomous vehicle platforms where computational resources are limited and low-latency operation is a vital requirement. In this survey, we take a thorough look at the works that aim to address this misalignment with more compact and efficient models capable of deployment on low-memory embedded systems while meeting the constraint of real-time inference. We discuss several of the most prominent works in the field, placing them within a taxonomy based on their major contributions, and finally we evaluate the inference speed of the discussed models under consistent hardware and software setups that represent a typical research environment with high-end GPU and a realistic deployed scenario using low-memory embedded GPU hardware. Our experimental results demonstrate that many works are capable of real-time performance on resource-constrained hardware, while illustrating the consistent trade-off between latency and accuracy.
PDF 18 pages, 13 figures, 4 tables This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
论文截图
Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for Autonomous Driving
Authors:Peixuan Li, Jieyu Jin
While separately leveraging monocular 3D object detection and 2D multi-object tracking can be straightforwardly applied to sequence images in a frame-by-frame fashion, stand-alone tracker cuts off the transmission of the uncertainty from the 3D detector to tracking while cannot pass tracking error differentials back to the 3D detector. In this work, we propose jointly training 3D detection and 3D tracking from only monocular videos in an end-to-end manner. The key component is a novel spatial-temporal information flow module that aggregates geometric and appearance features to predict robust similarity scores across all objects in current and past frames. Specifically, we leverage the attention mechanism of the transformer, in which self-attention aggregates the spatial information in a specific frame, and cross-attention exploits relation and affinities of all objects in the temporal domain of sequence frames. The affinities are then supervised to estimate the trajectory and guide the flow of information between corresponding 3D objects. In addition, we propose a temporal -consistency loss that explicitly involves 3D target motion modeling into the learning, making the 3D trajectory smooth in the world coordinate system. Time3D achieves 21.4\% AMOTA, 13.6\% AMOTP on the nuScenes 3D tracking benchmark, surpassing all published competitors, and running at 38 FPS, while Time3D achieves 31.2\% mAP, 39.4\% NDS on the nuScenes 3D detection benchmark.
PDF Accepted to CVPR 2022
论文截图
Hybrid Instance-aware Temporal Fusion for Online Video Instance Segmentation
Authors:Xiang Li, Jinglu Wang, Xiao Li, Yan Lu
Recently, transformer-based image segmentation methods have achieved notable success against previous solutions. While for video domains, how to effectively model temporal context with the attention of object instances across frames remains an open problem. In this paper, we propose an online video instance segmentation framework with a novel instance-aware temporal fusion method. We first leverages the representation, i.e., a latent code in the global context (instance code) and CNN feature maps to represent instance- and pixel-level features. Based on this representation, we introduce a cropping-free temporal fusion approach to model the temporal consistency between video frames. Specifically, we encode global instance-specific information in the instance code and build up inter-frame contextual fusion with hybrid attentions between the instance codes and CNN feature maps. Inter-frame consistency between the instance codes are further enforced with order constraints. By leveraging the learned hybrid temporal consistency, we are able to directly retrieve and maintain instance identities across frames, eliminating the complicated frame-wise instance matching in prior methods. Extensive experiments have been conducted on popular VIS datasets, i.e. Youtube-VIS-19/21. Our model achieves the best performance among all online VIS methods. Notably, our model also eclipses all offline methods when using the ResNet-50 backbone.
PDF AAAI 2022
论文截图
BFS-Net: Weakly Supervised Cell Instance Segmentation from Bright-Field Microscopy Z-Stacks
Authors:Shervin Dehghani, Benjamin Busam, Nassir Navab, Ali Nasseri
Despite its broad availability, volumetric information acquisition from Bright-Field Microscopy (BFM) is inherently difficult due to the projective nature of the acquisition process. We investigate the prediction of 3D cell instances from a set of BFM Z-Stack images. We propose a novel two-stage weakly supervised method for volumetric instance segmentation of cells which only requires approximate cell centroids annotation. Created pseudo-labels are thereby refined with a novel refinement loss with Z-stack guidance. The evaluations show that our approach can generalize not only to BFM Z-Stack data, but to other 3D cell imaging modalities. A comparison of our pipeline against fully supervised methods indicates that the significant gain in reduced data collection and labelling results in minor performance difference.
PDF