2022-02-19 更新
Recent Trends in 2D Object Detection and Applications in Video Event Recognition
Authors:Prithwish Jana, Partha Pratim Mohanta
Object detection serves as a significant step in improving performance of complex downstream computer vision tasks. It has been extensively studied for many years now and current state-of-the-art 2D object detection techniques proffer superlative results even in complex images. In this chapter, we discuss the geometry-based pioneering works in object detection, followed by the recent breakthroughs that employ deep learning. Some of these use a monolithic architecture that takes a RGB image as input and passes it to a feed-forward ConvNet or vision Transformer. These methods, thereby predict class-probability and bounding-box coordinates, all in a single unified pipeline. Two-stage architectures on the other hand, first generate region proposals and then feed it to a CNN to extract features and predict object category and bounding-box. We also elaborate upon the applications of object detection in video event recognition, to achieve better fine-grained video classification performance. Further, we highlight recent datasets for 2D object detection both in images and videos, and present a comparative performance summary of various state-of-the-art object detection techniques.
PDF Book chapter: P Jana and PP Mohanta, Recent Trends in 2D Object Detection and Applications in Video Event Recognition, published in Advancement of Deep Learning and its Applications in Object Detection and Recognition, edited by R N Mir et al, 2022, published by River Publishers
论文截图
STURE: Spatial-Temporal Mutual Representation Learning for Robust Data Association in Online Multi-Object Tracking
Authors:Haidong Wang, Zhiyong Li, Yaping Li, Ke Nai, Ming Wen
Online multi-object tracking (MOT) is a longstanding task for computer vision and intelligent vehicle platform. At present, the main paradigm is tracking-by-detection, and the main difficulty of this paradigm is how to associate the current candidate detection with the historical tracklets. However, in the MOT scenarios, each historical tracklet is composed of an object sequence, while each candidate detection is just a flat image, which lacks the temporal features of the object sequence. The feature difference between current candidate detection and historical tracklets makes the object association much harder. Therefore, we propose a Spatial-Temporal Mutual {Representation} Learning (STURE) approach which learns spatial-temporal representations between current candidate detection and historical sequence in a mutual representation space. For the historical trackelets, the detection learning network is forced to match the representations of sequence learning network in a mutual representation space. The proposed approach is capable of extracting more distinguishing detection and sequence representations by using various designed losses in object association. As a result, spatial-temporal feature is learned mutually to reinforce the current detection features, and the feature difference can be relieved. To prove the robustness of the STURE, it is applied to the public MOT challenge benchmarks and performs well compared with various state-of-the-art online MOT trackers based on identity-preserving metrics.
PDF
论文截图
From Pixel to Patch: Synthesize Context-aware Features for Zero-shot Semantic Segmentation
Authors:Zhangxuan Gu, Siyuan Zhou, Li Niu, Zihan Zhao, Liqing Zhang
Zero-shot learning has been actively studied for image classification task to relieve the burden of annotating image labels. Interestingly, semantic segmentation task requires more labor-intensive pixel-wise annotation, but zero-shot semantic segmentation has only attracted limited research interest. Thus, we focus on zero-shot semantic segmentation, which aims to segment unseen objects with only category-level semantic representations provided for unseen categories. In this paper, we propose a novel Context-aware feature Generation Network (CaGNet), which can synthesize context-aware pixel-wise visual features for unseen categories based on category-level semantic representations and pixel-wise contextual information. The synthesized features are used to finetune the classifier to enable segmenting unseen objects. Furthermore, we extend pixel-wise feature generation and finetuning to patch-wise feature generation and finetuning, which additionally considers inter-pixel relationship. Experimental results on Pascal-VOC, Pascal-Context, and COCO-stuff show that our method significantly outperforms the existing zero-shot semantic segmentation methods. Code is available at https://github.com/bcmi/CaGNetv2-Zero-Shot-Semantic-Segmentation.
PDF accepted by TNNLS
论文截图
SOIT: Segmenting Objects with Instance-Aware Transformers
Authors:Xiaodong Yu, Dahu Shi, Xing Wei, Ye Ren, Tingqun Ye, Wenming Tan
This paper presents an end-to-end instance segmentation framework, termed SOIT, that Segments Objects with Instance-aware Transformers. Inspired by DETR \cite{carion2020end}, our method views instance segmentation as a direct set prediction problem and effectively removes the need for many hand-crafted components like RoI cropping, one-to-many label assignment, and non-maximum suppression (NMS). In SOIT, multiple queries are learned to directly reason a set of object embeddings of semantic category, bounding-box location, and pixel-wise mask in parallel under the global image context. The class and bounding-box can be easily embedded by a fixed-length vector. The pixel-wise mask, especially, is embedded by a group of parameters to construct a lightweight instance-aware transformer. Afterward, a full-resolution mask is produced by the instance-aware transformer without involving any RoI-based operation. Overall, SOIT introduces a simple single-stage instance segmentation framework that is both RoI- and NMS-free. Experimental results on the MS COCO dataset demonstrate that SOIT outperforms state-of-the-art instance segmentation approaches significantly. Moreover, the joint learning of multiple tasks in a unified query embedding can also substantially improve the detection performance. Code is available at \url{https://github.com/yuxiaodongHRI/SOIT}.
PDF AAAI 2022
论文截图
ResiDualGAN: Resize-Residual DualGAN for Cross-Domain Remote Sensing Images Semantic Segmentation
Authors:Yang Zhao, Han Gao, Peng Guo, Zihao Sun
The performance of a semantic segmentation model for remote sensing (RS) images pretrained on an annotated dataset would greatly decrease when testing on another unannotated dataset because of the domain gap. Adversarial generative methods, e.g., DualGAN, are utilized for unpaired image-to-image translation to minimize the pixel-level domain gap, which is one of the common approaches for unsupervised domain adaptation (UDA). However, existing image translation methods are facing two problems when performing RS images translation: 1) ignoring the scale discrepancy between two RS datasets which greatly affect the accuracy performance of scale-invariant objects, 2) ignoring the characteristic of real-to-real translation of RS images which brings an unstable factor for the training of the models. In this paper, ResiDualGAN is proposed for RS images translation, where a resizer module is used for addressing the scale discrepancy of RS datasets, and a residual connection is used for strengthening the stability of real-to-real images translation and improving the performance in cross-domain semantic segmentation tasks. Combining with an output space adaptation method, the proposed method greatly improves the accuracy performance on common benchmarks, which demonstrates the superiority and reliability of ResiDuanGAN. At the end of the paper, a thorough discussion is also conducted to give a reasonable explanation for the improvement of ResiDualGAN.
PDF
论文截图
Weakly-Supervised Semantic Segmentation with Visual Words Learning and Hybrid Pooling
Authors:Lixiang Ru, Bo Du, Yibing Zhan, Chen Wu
Weakly-Supervised Semantic Segmentation (WSSS) methods with image-level labels generally train a classification network to generate the Class Activation Maps (CAMs) as the initial coarse segmentation labels. However, current WSSS methods still perform far from satisfactorily because their adopted CAMs 1) typically focus on partial discriminative object regions and 2) usually contain useless background regions. These two problems are attributed to the sole image-level supervision and aggregation of global information when training the classification networks. In this work, we propose the visual words learning module and hybrid pooling approach, and incorporate them in the classification network to mitigate the above problems. In the visual words learning module, we counter the first problem by enforcing the classification network to learn fine-grained visual word labels so that more object extents could be discovered. Specifically, the visual words are learned with a codebook, which could be updated via two proposed strategies, i.e. learning-based strategy and memory-bank strategy. The second drawback of CAMs is alleviated with the proposed hybrid pooling, which incorporates the global average and local discriminative information to simultaneously ensure object completeness and reduce background regions. We evaluated our methods on PASCAL VOC 2012 and MS COCO 2014 datasets. Without any extra saliency prior, our method achieved 70.6% and 70.7% mIoU on the $val$ and $test$ set of PASCAL VOC dataset, respectively, and 36.2% mIoU on the $val$ set of MS COCO dataset, which significantly surpassed the performance of state-of-the-art WSSS methods.
PDF Accepted to IJCV
论文截图
Global and Local Contrastive Self-Supervised Learning for Semantic Segmentation of HR Remote Sensing Images
Authors:Haifeng Li, Yi Li, Guo Zhang, Ruoyun Liu, Haozhe Huang, Qing Zhu, Chao Tao
Supervised learning for semantic segmentation requires a large number of labeled samples, which is difficult to obtain in the field of remote sensing. Self-supervised learning (SSL), can be used to solve such problems by pre-training a general model with a large number of unlabeled images and then fine-tuning it on a downstream task with very few labeled samples. Contrastive learning is a typical method of SSL that can learn general invariant features. However, most existing contrastive learning methods are designed for classification tasks to obtain an image-level representation, which may be suboptimal for semantic segmentation tasks requiring pixel-level discrimination. Therefore, we propose a global style and local matching contrastive learning network (GLCNet) for remote sensing image semantic segmentation. Specifically, 1) the global style contrastive learning module is used to better learn an image-level representation, as we consider that style features can better represent the overall image features. 2) The local features matching contrastive learning module is designed to learn representations of local regions, which is beneficial for semantic segmentation. The experimental results show that our method mostly outperforms SOTA self-supervised methods and the ImageNet pre-training method. Specifically, with 1\% annotation from the original dataset, our approach improves Kappa by 6\% on the ISPRS Potsdam dataset relative to the existing baseline. Moreover, our method outperforms supervised learning methods when there are some differences between the datasets of upstream tasks and downstream tasks. Since SSL could directly learn the essential characteristics of data from unlabeled data, which is easy to obtain in the remote sensing field, this may be of great significance for tasks such as global mapping. The source code is available at https://github.com/GeoX-Lab/G-RSIM.
PDF 14 pages, 13 figures, 4 tables
论文截图
Deep Learning for UAV-based Object Detection and Tracking: A Survey
Authors:Xin Wu, Wei Li, Danfeng Hong, Ran Tao, Qian Du
Owing to effective and flexible data acquisition, unmanned aerial vehicle (UAV) has recently become a hotspot across the fields of computer vision (CV) and remote sensing (RS). Inspired by recent success of deep learning (DL), many advanced object detection and tracking approaches have been widely applied to various UAV-related tasks, such as environmental monitoring, precision agriculture, traffic management. This paper provides a comprehensive survey on the research progress and prospects of DL-based UAV object detection and tracking methods. More specifically, we first outline the challenges, statistics of existing methods, and provide solutions from the perspectives of DL-based models in three research topics: object detection from the image, object detection from the video, and object tracking from the video. Open datasets related to UAV-dominated object detection and tracking are exhausted, and four benchmark datasets are employed for performance evaluation using some state-of-the-art methods. Finally, prospects and considerations for the future work are discussed and summarized. It is expected that this survey can facilitate those researchers who come from remote sensing field with an overview of DL-based UAV object detection and tracking methods, along with some thoughts on their further developments.
PDF
论文截图
Synthesis in Style: Semantic Segmentation of Historical Documents using Synthetic Data
Authors:Christian Bartz, Hendrik Rätz, Jona Otholt, Christoph Meinel, Haojin Yang
One of the most pressing problems in the automated analysis of historical documents is the availability of annotated training data. The problem is that labeling samples is a time-consuming task because it requires human expertise and thus, cannot be automated well. In this work, we propose a novel method to construct synthetic labeled datasets for historical documents where no annotations are available. We train a StyleGAN model to synthesize document images that capture the core features of the original documents. While originally, the StyleGAN architecture was not intended to produce labels, it indirectly learns the underlying semantics to generate realistic images. Using our approach, we can extract the semantic information from the intermediate feature maps and use it to generate ground truth labels. To investigate if our synthetic dataset can be used to segment the text in historical documents, we use it to train multiple supervised segmentation models and evaluate their performance. We also train these models on another dataset created by a state-of-the-art synthesis approach to show that the models trained on our dataset achieve better results while requiring even less human annotation effort.
PDF Code available at: https://github.com/hendraet/synthesis-in-style
论文截图
Sparse Object-level Supervision for Instance Segmentation with Pixel Embeddings
Authors:Adrian Wolny, Qin Yu, Constantin Pape, Anna Kreshuk
Most state-of-the-art instance segmentation methods have to be trained on densely annotated images. While difficult in general, this requirement is especially daunting for biomedical images, where domain expertise is often required for annotation and no large public data collections are available for pre-training. We propose to address the dense annotation bottleneck by introducing a proposal-free segmentation approach based on non-spatial embeddings, which exploits the structure of the learned embedding space to extract individual instances in a differentiable way. The segmentation loss can then be applied directly to instances and the overall pipeline can be trained in a fully- or weakly supervised manner, including the challenging case of positive-unlabeled supervision, where a novel self-supervised consistency loss is introduced for the unlabeled parts of the training data. We evaluate the proposed method on 2D and 3D segmentation problems in different microscopy modalities as well as on the Cityscapes and CVPPP instance segmentation benchmarks, achieving state-of-the-art results on the latter. The code is available at: https://github.com/kreshuklab/spoco
PDF
论文截图
MuSCLe: A Multi-Strategy Contrastive Learning Framework for Weakly Supervised Semantic Segmentation
Authors:Kunhao Yuan, Gerald Schaefer, Yu-Kun Lai, Yifan Wang, Xiyao Liu, Lin Guan, Hui Fang
Weakly supervised semantic segmentation (WSSS) has gained significant popularity since it relies only on weak labels such as image level annotations rather than pixel level annotations required by supervised semantic segmentation (SSS) methods. Despite drastically reduced annotation costs, typical feature representations learned from WSSS are only representative of some salient parts of objects and less reliable compared to SSS due to the weak guidance during training. In this paper, we propose a novel Multi-Strategy Contrastive Learning (MuSCLe) framework to obtain enhanced feature representations and improve WSSS performance by exploiting similarity and dissimilarity of contrastive sample pairs at image, region, pixel and object boundary levels. Extensive experiments demonstrate the effectiveness of our method and show that MuSCLe outperforms the current state-of-the-art on the widely used PASCAL VOC 2012 dataset.
PDF
论文截图
The devil is in the labels: Semantic segmentation from sentences
Authors:Wei Yin, Yifan Liu, Chunhua Shen, Anton van den Hengel, Baichuan Sun
We propose an approach to semantic segmentation that achieves state-of-the-art supervised performance when applied in a zero-shot setting. It thus achieves results equivalent to those of the supervised methods, on each of the major semantic segmentation datasets, without training on those datasets. This is achieved by replacing each class label with a vector-valued embedding of a short paragraph that describes the class. The generality and simplicity of this approach enables merging multiple datasets from different domains, each with varying class labels and semantics. The resulting merged semantic segmentation dataset of over 2 Million images enables training a model that achieves performance equal to that of state-of-the-art supervised methods on 7 benchmark datasets, despite not using any images therefrom. By fine-tuning the model on standard semantic segmentation datasets, we also achieve a significant improvement over the state-of-the-art supervised segmentation on NYUD-V2 and PASCAL-context at 60% and 65% mIoU, respectively. Based on the closeness of language embeddings, our method can even segment unseen labels. Extensive experiments demonstrate strong generalization to unseen image domains and unseen labels, and that the method enables impressive performance improvements in downstream applications, including depth estimation and instance segmentation.
PDF 18 pages
论文截图
Consistency-Regularized Region-Growing Network for Semantic Segmentation of Urban Scenes with Point-Level Annotations
Authors:Yonghao Xu, Pedram Ghamisi
Deep learning algorithms have obtained great success in semantic segmentation of very high-resolution (VHR) images. Nevertheless, training these models generally requires a large amount of accurate pixel-wise annotations, which is very laborious and time-consuming to collect. To reduce the annotation burden, this paper proposes a consistency-regularized region-growing network (CRGNet) to achieve semantic segmentation of VHR images with point-level annotations. The key idea of CRGNet is to iteratively select unlabeled pixels with high confidence to expand the annotated area from the original sparse points. However, since there may exist some errors and noises in the expanded annotations, directly learning from them may mislead the training of the network. To this end, we further propose the consistency regularization strategy, where a base classifier and an expanded classifier are employed. Specifically, the base classifier is supervised by the original sparse annotations, while the expanded classifier aims to learn from the expanded annotations generated by the base classifier with the region-growing mechanism. The consistency regularization is thereby achieved by minimizing the discrepancy between the predictions from both the base and the expanded classifiers. We find such a simple regularization strategy is yet very useful to control the quality of the region-growing mechanism. Extensive experiments on two benchmark datasets demonstrate that the proposed CRGNet significantly outperforms the existing state-of-the-art methods. Codes and pre-trained models will be available online.
PDF
论文截图
Random Ferns for Semantic Segmentation of PolSAR Images
Authors:Pengchao Wei, Ronny Hänsch
Random Ferns — as a less known example of Ensemble Learning — have been successfully applied in many Computer Vision applications ranging from keypoint matching to object detection. This paper extends the Random Fern framework to the semantic segmentation of polarimetric synthetic aperture radar images. By using internal projections that are defined over the space of Hermitian matrices, the proposed classifier can be directly applied to the polarimetric covariance matrices without the need to explicitly compute predefined image features. Furthermore, two distinct optimization strategies are proposed: The first based on pre-selection and grouping of internal binary features before the creation of the classifier; and the second based on iteratively improving the properties of a given Random Fern. Both strategies are able to boost the performance by filtering features that are either redundant or have a low information content and by grouping correlated features to best fulfill the independence assumptions made by the Random Fern classifier. Experiments show that results can be achieved that are similar to a more complex Random Forest model and competitive to a deep learning baseline.
PDF This is the author’s version of the article as accepted for publication in IEEE Transactions on Geoscience and Remote Sensing, 2021. Link to original: https://ieeexplore.ieee.org/document/9627989
论文截图
Implicit Feature Refinement for Instance Segmentation
Authors:Lufan Ma, Tiancai Wang, Bin Dong, Jiangpeng Yan, Xiu Li, Xiangyu Zhang
We propose a novel implicit feature refinement module for high-quality instance segmentation. Existing image/video instance segmentation methods rely on explicitly stacked convolutions to refine instance features before the final prediction. In this paper, we first give an empirical comparison of different refinement strategies,which reveals that the widely-used four consecutive convolutions are not necessary. As an alternative, weight-sharing convolution blocks provides competitive performance. When such block is iterated for infinite times, the block output will eventually convergeto an equilibrium state. Based on this observation, the implicit feature refinement (IFR) is developed by constructing an implicit function. The equilibrium state of instance features can be obtained by fixed-point iteration via a simulated infinite-depth network. Our IFR enjoys several advantages: 1) simulates an infinite-depth refinement network while only requiring parameters of single residual block; 2) produces high-level equilibrium instance features of global receptive field; 3) serves as a plug-and-play general module easily extended to most object recognition frameworks. Experiments on the COCO and YouTube-VIS benchmarks show that our IFR achieves improved performance on state-of-the-art image/video instance segmentation frameworks, while reducing the parameter burden (e.g.1% AP improvement on Mask R-CNN with only 30.0% parameters in mask head). Code is made available at https://github.com/lufanma/IFR.git
PDF Published at ACM MM 2021. Code is available at https://github.com/lufanma/IFR.git
论文截图
GiraffeDet: A Heavy-Neck Paradigm for Object Detection
Authors:Yiqi Jiang, Zhiyu Tan, Junyan Wang, Xiuyu Sun, Ming Lin, Hao Li
In conventional object detection frameworks, a backbone body inherited from image recognition models extracts deep latent features and then a neck module fuses these latent features to capture information at different scales. As the resolution in object detection is much larger than in image recognition, the computational cost of the backbone often dominates the total inference cost. This heavy-backbone design paradigm is mostly due to the historical legacy when transferring image recognition models to object detection rather than an end-to-end optimized design for object detection. In this work, we show that such paradigm indeed leads to sub-optimal object detection models. To this end, we propose a novel heavy-neck paradigm, GiraffeDet, a giraffe-like network for efficient object detection. The GiraffeDet uses an extremely lightweight backbone and a very deep and large neck module which encourages dense information exchange among different spatial scales as well as different levels of latent semantics simultaneously. This design paradigm allows detectors to process the high-level semantic information and low-level spatial information at the same priority even in the early stage of the network, making it more effective in detection tasks. Numerical evaluations on multiple popular object detection benchmarks show that GiraffeDet consistently outperforms previous SOTA models across a wide spectrum of resource constraints.
PDF
论文截图
Mask2Former for Video Instance Segmentation
Authors:Bowen Cheng, Anwesa Choudhuri, Ishan Misra, Alexander Kirillov, Rohit Girdhar, Alexander G. Schwing
We find Mask2Former also achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline. In this report, we show universal image segmentation architectures trivially generalize to video segmentation by directly predicting 3D segmentation volumes. Specifically, Mask2Former sets a new state-of-the-art of 60.4 AP on YouTubeVIS-2019 and 52.6 AP on YouTubeVIS-2021. We believe Mask2Former is also capable of handling video semantic and panoptic segmentation, given its versatility in image segmentation. We hope this will make state-of-the-art video segmentation research more accessible and bring more attention to designing universal image and video segmentation architectures.
PDF Code and models: https://github.com/facebookresearch/Mask2Former
论文截图
Weakly Supervised Semantic Segmentation of Remote Sensing Images for Tree Species Classification Based on Explanation Methods
Authors:Steve Ahlswede, Nimisha Thekke-Madam, Christian Schulz, Birgit Kleinschmit, Begüm Demir
The collection of a high number of pixel-based labeled training samples for tree species identification is time consuming and costly in operational forestry applications. To address this problem, in this paper we investigate the effectiveness of explanation methods for deep neural networks in performing weakly supervised semantic segmentation using only image-level labels. Specifically, we consider four methods:i) class activation maps (CAM); ii) gradient-based CAM; iii) pixel correlation module; and iv) self-enhancing maps (SEM). We compare these methods with each other using both quantitative and qualitative measures of their segmentation accuracy, as well as their computational requirements. Experimental results obtained on an aerial image archive show that:i) considered explanation techniques are highly relevant for the identification of tree species with weak supervision; and ii) the SEM outperforms the other considered methods. The code for this paper is publicly available at https://git.tu-berlin.de/rsim/rs_wsss.
PDF 4 pages, 1 figure, submitted to IEEE Geosciences and Remote Sensing Symposium (2022)
论文截图
InSeGAN: A Generative Approach to Segmenting Identical Instances in Depth Images
Authors:Anoop Cherian, Goncalo Dias Pais, Siddarth Jain, Tim K. Marks, Alan Sullivan
In this paper, we present InSeGAN, an unsupervised 3D generative adversarial network (GAN) for segmenting (nearly) identical instances of rigid objects in depth images. Using an analysis-by-synthesis approach, we design a novel GAN architecture to synthesize a multiple-instance depth image with independent control over each instance. InSeGAN takes in a set of code vectors (e.g., random noise vectors), each encoding the 3D pose of an object that is represented by a learned implicit object template. The generator has two distinct modules. The first module, the instance feature generator, uses each encoded pose to transform the implicit template into a feature map representation of each object instance. The second module, the depth image renderer, aggregates all of the single-instance feature maps output by the first module and generates a multiple-instance depth image. A discriminator distinguishes the generated multiple-instance depth images from the distribution of true depth images. To use our model for instance segmentation, we propose an instance pose encoder that learns to take in a generated depth image and reproduce the pose code vectors for all of the object instances. To evaluate our approach, we introduce a new synthetic dataset, “Insta-10”, consisting of 100,000 depth images, each with 5 instances of an object from one of 10 classes. Our experiments on Insta-10, as well as on real-world noisy depth images, show that InSeGAN achieves state-of-the-art performance, often outperforming prior methods by large margins.
PDF Accepted at ICCV 2021. Code & data @ https://www.merl.com/research/license/InSeGAN
论文截图
Few-shot semantic segmentation via mask aggregation
Authors:Wei Ao, Shunyi Zheng, Yan Meng
Few-shot semantic segmentation aims to recognize novel classes with only very few labelled data. This challenging task requires mining of the relevant relationships between the query image and the support images. Previous works have typically regarded it as a pixel-wise classification problem. Therefore, various models have been designed to explore the correlation of pixels between the query image and the support images. However, they focus only on pixel-wise correspondence and ignore the overall correlation of objects. In this paper, we introduce a mask-based classification method for addressing this problem. The mask aggregation network (MANet), which is a simple mask classification model, is proposed to simultaneously generate a fixed number of masks and their probabilities of being targets. Then, the final segmentation result is obtained by aggregating all the masks according to their locations. Experiments on both the PASCAL-5^i and COCO-20^i datasets show that our method performs comparably to the state-of-the-art pixel-based methods. This competitive performance demonstrates the potential of mask classification as an alternative baseline method in few-shot semantic segmentation. Our source code will be made available at https://github.com/TinyAway/MANet.
PDF
论文截图
DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer
Authors:Sanket Biswas, Ayan Banerjee, Josep Lladós, Umapada Pal
Understanding documents with rich layouts is an essential step towards information extraction. Business intelligence processes often require the extraction of useful semantic content from documents at a large scale for subsequent decision-making tasks. In this context, instance-level segmentation of different document objects(title, sections, figures, tables and so on) has emerged as an interesting problem for the document layout analysis community. To advance the research in this direction, we present a transformer-based model for end-to-end segmentation of complex layouts in document images. To our knowledge, this is the first work on transformer-based document segmentation. Extensive experimentation on the PubLayNet dataset shows that our model achieved comparable or better segmentation performance than the existing state-of-the-art approaches. We hope our simple and flexible framework could serve as a promising baseline for instance-level recognition tasks in document images.
PDF Submitted to International Workshop on Document Analysis Systems (DAS) 2022
论文截图
HM-Net: A Regression Network for Object Center Detection and Tracking on Wide Area Motion Imagery
Authors:Hakki Motorcu, Hasan F. Ates, H. Fatih Ugurdag, Bahadir Gunturk
Wide Area Motion Imagery (WAMI) yields high-resolution images with a large number of extremely small objects. Target objects have large spatial displacements throughout consecutive frames. This nature of WAMI images makes object tracking and detection challenging. In this paper, we present our deep neural network-based combined object detection and tracking model, namely, Heat Map Network (HM-Net). HM-Net is significantly faster than state-of-the-art frame differencing and background subtraction-based methods, without compromising detection and tracking performances. HM-Net follows the object center-based joint detection and tracking paradigm. Simple heat map-based predictions support an unlimited number of simultaneous detections. The proposed method uses two consecutive frames and the object detection heat map obtained from the previous frame as input, which helps HM-Net monitor spatio-temporal changes between frames and keeps track of previously predicted objects. Although reuse of prior object detection heat map acts as a vital feedback-based memory element, it can lead to an unintended surge of false-positive detections. To increase the robustness of the method against false positives and to eliminate low confidence detections, HM-Net employs novel feedback filters and advanced data augmentations. HM-Net outperforms state-of-the-art WAMI moving object detection and tracking methods on the WPAFB dataset with its 96.2% F1 and 94.4% mAP detection scores while achieving a 61.8% mAP tracking score on the same dataset. This performance corresponds to an improvement of 2.1% for F1, 6.1% for mAP scores on detection, and 9.5% for mAP score on tracking over the state-of-the-art.
PDF 14 pages, 13 figures
论文截图
NucMM Dataset: 3D Neuronal Nuclei Instance Segmentation at Sub-Cubic Millimeter Scale
Authors:Zudi Lin, Donglai Wei, Mariela D. Petkova, Yuelong Wu, Zergham Ahmed, Krishna Swaroop K, Silin Zou, Nils Wendt, Jonathan Boulanger-Weill, Xueying Wang, Nagaraju Dhanyasi, Ignacio Arganda-Carreras, Florian Engert, Jeff Lichtman, Hanspeter Pfister
Segmenting 3D cell nuclei from microscopy image volumes is critical for biological and clinical analysis, enabling the study of cellular expression patterns and cell lineages. However, current datasets for neuronal nuclei usually contain volumes smaller than $10^{\text{-}3}\ mm^3$ with fewer than 500 instances per volume, unable to reveal the complexity in large brain regions and restrict the investigation of neuronal structures. In this paper, we have pushed the task forward to the sub-cubic millimeter scale and curated the NucMM dataset with two fully annotated volumes: one $0.1\ mm^3$ electron microscopy (EM) volume containing nearly the entire zebrafish brain with around 170,000 nuclei; and one $0.25\ mm^3$ micro-CT (uCT) volume containing part of a mouse visual cortex with about 7,000 nuclei. With two imaging modalities and significantly increased volume size and instance numbers, we discover a great diversity of neuronal nuclei in appearance and density, introducing new challenges to the field. We also perform a statistical analysis to illustrate those challenges quantitatively. To tackle the challenges, we propose a novel hybrid-representation learning model that combines the merits of foreground mask, contour map, and signed distance transform to produce high-quality 3D masks. The benchmark comparisons on the NucMM dataset show that our proposed method significantly outperforms state-of-the-art nuclei segmentation approaches. Code and data are available at https://connectomics-bazaar.github.io/proj/nucMM/index.html.
PDF MICCAI 2021. Fix typos and update citations
论文截图
MODS — A USV-oriented object detection and obstacle segmentation benchmark
Authors:Borja Bovcon, Jon Muhovič, Duško Vranac, Dean Mozetič, Janez Perš, Matej Kristan
Small-sized unmanned surface vehicles (USV) are coastal water devices with a broad range of applications such as environmental control and surveillance. A crucial capability for autonomous operation is obstacle detection for timely reaction and collision avoidance, which has been recently explored in the context of camera-based visual scene interpretation. Owing to curated datasets, substantial advances in scene interpretation have been made in a related field of unmanned ground vehicles. However, the current maritime datasets do not adequately capture the complexity of real-world USV scenes and the evaluation protocols are not standardised, which makes cross-paper comparison of different methods difficult and hinders the progress. To address these issues, we introduce a new obstacle detection benchmark MODS, which considers two major perception tasks: maritime object detection and the more general maritime obstacle segmentation. We present a new diverse maritime evaluation dataset containing approximately 81k stereo images synchronized with an on-board IMU, with over 60k objects annotated. We propose a new obstacle segmentation performance evaluation protocol that reflects the detection accuracy in a way meaningful for practical USV navigation. Nineteen recent state-of-the-art object detection and obstacle segmentation methods are evaluated using the proposed protocol, creating a benchmark to facilitate development of the field. The proposed dataset, as well as evaluation routines, are made publicly available at vicos.si/resources.
PDF 16 pages, 15 figures. The dataset, as well as the proposed evaluation protocols, are published on our website: https://www.vicos.si/resources/
论文截图
SRT3D: A Sparse Region-Based 3D Object Tracking Approach for the Real World
Authors:Manuel Stoiber, Martin Pfanne, Klaus H. Strobl, Rudolph Triebel, Alin Albu-Schäffer
Region-based methods have become increasingly popular for model-based, monocular 3D tracking of texture-less objects in cluttered scenes. However, while they achieve state-of-the-art results, most methods are computationally expensive, requiring significant resources to run in real-time. In the following, we build on our previous work and develop SRT3D, a sparse region-based approach to 3D object tracking that bridges this gap in efficiency. Our method considers image information sparsely along so-called correspondence lines that model the probability of the object’s contour location. We thereby improve on the current state of the art and introduce smoothed step functions that consider a defined global and local uncertainty. For the resulting probabilistic formulation, a thorough analysis is provided. Finally, we use a pre-rendered sparse viewpoint model to create a joint posterior probability for the object pose. The function is maximized using second-order Newton optimization with Tikhonov regularization. During the pose estimation, we differentiate between global and local optimization, using a novel approximation for the first-order derivative employed in the Newton method. In multiple experiments, we demonstrate that the resulting algorithm improves the current state of the art both in terms of runtime and quality, performing particularly well for noisy and cluttered images encountered in the real world.
PDF Submitted to the International Journal of Computer Vision
论文截图
Scribble-based Boundary-aware Network for Weakly Supervised Salient Object Detection in Remote Sensing Images
Authors:Zhou Huang, Tian-Zhu Xiang, Huai-Xin Chen, Hang Dai
Existing CNNs-based salient object detection (SOD) heavily depends on the large-scale pixel-level annotations, which is labor-intensive, time-consuming, and expensive. By contrast, the sparse annotations become appealing to the salient object detection community. However, few efforts are devoted to learning salient object detection from sparse annotations, especially in the remote sensing field. In addition, the sparse annotation usually contains scanty information, which makes it challenging to train a well-performing model, resulting in its performance largely lagging behind the fully-supervised models. Although some SOD methods adopt some prior cues to improve the detection performance, they usually lack targeted discrimination of object boundaries and thus provide saliency maps with poor boundary localization. To this end, in this paper, we propose a novel weakly-supervised salient object detection framework to predict the saliency of remote sensing images from sparse scribble annotations. To implement it, we first construct the scribble-based remote sensing saliency dataset by relabelling an existing large-scale SOD dataset with scribbles, namely S-EOR dataset. After that, we present a novel scribble-based boundary-aware network (SBA-Net) for remote sensing salient object detection. Specifically, we design a boundary-aware module (BAM) to explore the object boundary semantics, which is explicitly supervised by the high-confidence object boundary (pseudo) labels generated by the boundary label generation (BLG) module, forcing the model to learn features that highlight the object structure and thus boosting the boundary localization of objects. Then, the boundary semantics are integrated with high-level features to guide the salient object detection under the supervision of scribble labels.
PDF 33 pages, 10 figures
论文截图
Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics
Authors:Hangjie Yuan, Mang Wang, Dong Ni, Liangpeng Xu
Human-Object Interaction (HOI) detection is an essential task to understand human-centric images from a fine-grained perspective. Although end-to-end HOI detection models thrive, their paradigm of parallel human/object detection and verb class prediction loses two-stage methods’ merit: object-guided hierarchy. The object in one HOI triplet gives direct clues to the verb to be predicted. In this paper, we aim to boost end-to-end models with object-guided statistical priors. Specifically, We propose to utilize a Verb Semantic Model (VSM) and use semantic aggregation to profit from this object-guided hierarchy. Similarity KL (SKL) loss is proposed to optimize VSM to align with the HOI dataset’s priors. To overcome the static semantic embedding problem, we propose to generate cross-modality-aware visual and semantic features by Cross-Modal Calibration (CMC). The above modules combined composes Object-guided Cross-modal Calibration Network (OCN). Experiments conducted on two popular HOI detection benchmarks demonstrate the significance of incorporating the statistical prior knowledge and produce state-of-the-art performances. More detailed analysis indicates proposed modules serve as a stronger verb predictor and a more superior method of utilizing prior knowledge. The codes are available at \url{https://github.com/JacobYuan7/OCN-HOI-Benchmark}.
PDF Accepted to AAAI2022
论文截图
Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation
Authors:Anirudh S Chakravarthy, Won-Dong Jang, Zudi Lin, Donglai Wei, Song Bai, Hanspeter Pfister
Video instance segmentation aims to detect, segment, and track objects in a video. Current approaches extend image-level segmentation algorithms to the temporal domain. However, this results in temporally inconsistent masks. In this work, we identify the mask quality due to temporal stability as a performance bottleneck. Motivated by this, we propose a video instance segmentation method that alleviates the problem due to missing detections. Since this cannot be solved simply using spatial information, we leverage temporal context using inter-frame attentions. This allows our network to refocus on missing objects using box predictions from the neighbouring frame, thereby overcoming missing detections. Our method significantly outperforms previous state-of-the-art algorithms using the Mask R-CNN backbone, by achieving 36.0% mAP on the YouTube-VIS benchmark. Additionally, our method is completely online and requires no future frames. Our code is publicly available at https://github.com/anirudh-chakravarthy/ObjProp.
PDF Accepted at CVPR RVSU Workshop 2021
论文截图
Benchmarking Deep Models for Salient Object Detection
Authors:Huajun Zhou, Yang Lin, Lingxiao Yang, Jianhuang Lai, Xiaohua Xie
In recent years, deep network-based methods have continuously refreshed state-of-the-art performance on Salient Object Detection (SOD) task. However, the performance discrepancy caused by different implementation details may conceal the real progress in this task. Making an impartial comparison is required for future researches. To meet this need, we construct a general SALient Object Detection (SALOD) benchmark to conduct a comprehensive comparison among several representative SOD methods. Specifically, we re-implement 14 representative SOD methods by using consistent settings for training. Moreover, two additional protocols are set up in our benchmark to investigate the robustness of existing methods in some limited conditions. In the first protocol, we enlarge the difference between objectness distributions of train and test sets to evaluate the robustness of these SOD methods. In the second protocol, we build multiple train subsets with different scales to validate whether these methods can extract discriminative features from only a few samples. In the above experiments, we find that existing loss functions usually specialized in some metrics but reported inferior results on the others. Therefore, we propose a novel Edge-Aware (EA) loss that promotes deep networks to learn more discriminative features by integrating both pixel- and image-level supervision signals. Experiments prove that our EA loss reports more robust performances compared to existing losses.
PDF 24 pages
论文截图
Deep Level Set for Box-supervised Instance Segmentation in Aerial Images
Authors:Wentong Li, Yijie Chen, Wenyu Liu, Jianke Zhu
Box-supervised instance segmentation has recently attracted lots of research efforts while little attention is received in aerial image domain. In contrast to the general object collections, aerial objects have large intra-class variances and inter-class similarity with complex background. Moreover, there are many tiny objects in the high-resolution satellite images. This makes the recent pairwise affinity modeling method inevitably to involve the noisy supervision with the inferior results. To tackle these problems, we propose a novel aerial instance segmentation approach, which drives the network to learn a series of level set functions for the aerial objects with only box annotations in an end-to-end fashion. Instead of learning the pairwise affinity, the level set method with the carefully designed energy functions treats the object segmentation as curve evolution, which is able to accurately recover the object’s boundaries and prevent the interference from the indistinguishable background and similar objects. The experimental results demonstrate that the proposed approach outperforms the state-of-the-art box-supervised instance segmentation methods. The source code is available at https://github.com/LiWentomng/boxlevelset.
PDF 10 pages, 5 figures
论文截图
Multi-source Pseudo-label Learning of Semantic Segmentation for the Scene Recognition of Agricultural Mobile Robots
Authors:Shigemichi Matsuzaki, Jun Miura, Hiroaki Masuzawa
This paper describes a novel method of training a semantic segmentation model for scene recognition of agricultural mobile robots exploiting publicly available datasets of outdoor scenes that are different from the target greenhouse environments. Semantic segmentation models require abundant labels given by tedious manual annotation. A method to work around it is unsupervised domain adaptation (UDA) that transfers knowledge from labeled source datasets to unlabeled target datasets. However, the effectiveness of existing methods is not well studied in adaptation between heterogeneous environments, such as urban scenes and greenhouses. In this paper, we propose a method to train a semantic segmentation model for greenhouse images without manually labeled datasets of greenhouse images. The core of our idea is to use multiple rich image datasets of different environments with segmentation labels to generate pseudo-labels for the target images to effectively transfer the knowledge from multiple sources and realize a precise training of semantic segmentation. Along with the pseudo-label generation, we introduce state-of-the-art methods to deal with noise in the pseudo-labels to further improve the performance. We demonstrate in experiments with multiple greenhouse datasets that our proposed method improves the performance compared to the single-source baselines and an existing approach.
PDF Submitted to Advanced Robotics
论文截图
Characterization of Semantic Segmentation Models on Mobile Platforms for Self-Navigation in Disaster-Struck Zones
Authors:Ryan Zelek, Hyeran Jeon
The role of unmanned vehicles for searching and localizing the victims in disaster impacted areas such as earthquake-struck zones is getting more important. Self-navigation on an earthquake zone has a unique challenge of detecting irregularly shaped obstacles such as road cracks, debris on the streets, and water puddles. In this paper, we characterize a number of state-of-the-art FCN models on mobile embedded platforms for self-navigation at these sites containing extremely irregular obstacles. We evaluate the models in terms of accuracy, performance, and energy efficiency. We present a few optimizations for our designed vision system. Lastly, we discuss the trade-offs of these models for a couple of mobile platforms that can each perform self-navigation. To enable vehicles to safely navigate earthquake-struck zones, we compiled a new annotated image database of various earthquake impacted regions that is different than traditional road damage databases. We train our database with a number of state-of-the-art semantic segmentation models in order to identify obstacles unique to earthquake-struck zones. Based on the statistics and tradeoffs, an optimal CNN model is selected for the mobile vehicular platforms, which we apply to both low-power and extremely low-power configurations of our design. To our best knowledge, this is the first study that identifies unique challenges and discusses the accuracy, performance, and energy impact of edge-based self-navigation mobile vehicles for earthquake-struck zones. Our proposed database and trained models are publicly available.
PDF 12 pages, 18 figures
论文截图
3D-FCT: Simultaneous 3D Object Detection and Tracking Using Feature Correlation
Authors:Naman Sharma, Hocksoon Lim
3D object detection using LiDAR data remains a key task for applications like autonomous driving and robotics. Unlike in the case of 2D images, LiDAR data is almost always collected over a period of time. However, most work in this area has focused on performing detection independent of the temporal domain. In this paper we present 3D-FCT, a Siamese network architecture that utilizes temporal information to simultaneously perform the related tasks of 3D object detection and tracking. The network is trained to predict the movement of an object based on the correlation features of extracted keypoints across time. Calculating correlation across keypoints only allows for real-time object detection. We further extend the multi-task objective to include a tracking regression loss. Finally, we produce high accuracy detections by linking short-term object tracklets into long term tracks based on the predicted tracks. Our proposed method is evaluated on the KITTI tracking dataset where it is shown to provide an improvement of 5.57% mAP over a state-of-the-art approach.
PDF
论文截图
Probabilistic 3D Multi-Modal, Multi-Object Tracking for Autonomous Driving
Authors:Hsu-kuang Chiu, Jie Li, Rares Ambrus, Jeannette Bohg
Multi-object tracking is an important ability for an autonomous vehicle to safely navigate a traffic scene. Current state-of-the-art follows the tracking-by-detection paradigm where existing tracks are associated with detected objects through some distance metric. The key challenges to increase tracking accuracy lie in data association and track life cycle management. We propose a probabilistic, multi-modal, multi-object tracking system consisting of different trainable modules to provide robust and data-driven tracking results. First, we learn how to fuse features from 2D images and 3D LiDAR point clouds to capture the appearance and geometric information of an object. Second, we propose to learn a metric that combines the Mahalanobis and feature distances when comparing a track and a new detection in data association. And third, we propose to learn when to initialize a track from an unmatched object detection. Through extensive quantitative and qualitative results, we show that when using the same object detectors our method outperforms state-of-the-art approaches on the NuScenes and KITTI datasets.
PDF IEEE International Conference on Robotics and Automation (ICRA) 2021
论文截图
Compensation Tracker: Reprocessing Lost Object for Multi-Object Tracking
Authors:Zhibo Zou, Junjie Huang, Ping Luo
Tracking by detection paradigm is one of the most popular object tracking methods. However, it is very dependent on the performance of the detector. When the detector has a behavior of missing detection, the tracking result will be directly affected. In this paper, we analyze the phenomenon of the lost tracking object in real-time tracking model on MOT2020 dataset. Based on simple and traditional methods, we propose a compensation tracker to further alleviate the lost tracking problem caused by missing detection. It consists of a motion compensation module and an object selection module. The proposed method not only can re-track missing tracking objects from lost objects, but also does not require additional networks so as to maintain speed-accuracy trade-off of the real-time model. Our method only needs to be embedded into the tracker to work without re-training the network. Experiments show that the compensation tracker can efficaciously improve the performance of the model and reduce identity switches. With limited costs, the compensation tracker successfully enhances the baseline tracking performance by a large margin and reaches 66% of MOTA and 67% of IDF1 on MOT2020 dataset.
PDF
论文截图
Camouflaged Instance Segmentation In-The-Wild: Dataset, Method, and Benchmark Suite
Authors:Trung-Nghia Le, Yubo Cao, Tan-Cong Nguyen, Minh-Quan Le, Khanh-Duy Nguyen, Thanh-Toan Do, Minh-Triet Tran, Tam V. Nguyen
This paper pushes the envelope on decomposing camouflaged regions in an image into meaningful components, namely, camouflaged instances. To promote the new task of camouflaged instance segmentation of in-the-wild images, we introduce a dataset, dubbed CAMO++, that extends our preliminary CAMO dataset (camouflaged object segmentation) in terms of quantity and diversity. The new dataset substantially increases the number of images with hierarchical pixel-wise ground truths. We also provide a benchmark suite for the task of camouflaged instance segmentation. In particular, we present an extensive evaluation of state-of-the-art instance segmentation methods on our newly constructed CAMO++ dataset in various scenarios. We also present a camouflage fusion learning (CFL) framework for camouflaged instance segmentation to further improve the performance of state-of-the-art methods. The dataset, model, evaluation suite, and benchmark will be made publicly available on our project page: https://sites.google.com/view/ltnghia/research/camo_plus_plus
PDF TIP acceptance. Project page: https://sites.google.com/view/ltnghia/research/camo_plus_plus
论文截图
3D Object Detection from Images for Autonomous Driving: A Survey
Authors:Xinzhu Ma, Wanli Ouyang, Andrea Simonelli, Elisa Ricci
3D object detection from images, one of the fundamental and challenging problems in autonomous driving, has received increasing attention from both industry and academia in recent years. Benefiting from the rapid development of deep learning technologies, image-based 3D detection has achieved remarkable progress. Particularly, more than 200 works have studied this problem from 2015 to 2021, encompassing a broad spectrum of theories, algorithms, and applications. However, to date no recent survey exists to collect and organize this knowledge. In this paper, we fill this gap in the literature and provide the first comprehensive survey of this novel and continuously growing research field, summarizing the most commonly used pipelines for image-based 3D detection and deeply analyzing each of their components. Additionally, we also propose two new taxonomies to organize the state-of-the-art methods into different categories, with the intent of providing a more systematic review of existing methods and facilitating fair comparisons with future works. In retrospect of what has been achieved so far, we also analyze the current challenges in the field and discuss future directions for image-based 3D detection research.
PDF
论文截图
Identification of Driver Phone Usage Violations via State-of-the-Art Object Detection with Tracking
Authors:Steven Carrell, Amir Atapour-Abarghouei
The use of mobiles phones when driving have been a major factor when it comes to road traffic incidents and the process of capturing such violations can be a laborious task. Advancements in both modern object detection frameworks and high-performance hardware has paved the way for a more automated approach when it comes to video surveillance. In this work, we propose a custom-trained state-of-the-art object detector to work with roadside cameras to capture driver phone usage without the need for human intervention. The proposed approach also addresses the issues caused by windscreen glare and introduces the steps required to remedy this. Twelve pre-trained models are fine-tuned with our custom dataset using four popular object detection methods: YOLO, SSD, Faster R-CNN, and CenterNet. Out of all the object detectors tested, the YOLO yields the highest accuracy levels of up to 96% (AP10) and frame rates of up to ~30 FPS. DeepSort object tracking algorithm is also integrated into the best-performing model to collect records of only the unique violations, and enable the proposed approach to count the number of vehicles. The proposed automated system will collect the output images of the identified violations, timestamps of each violation, and total vehicle count. Data can be accessed via a purpose-built user interface.
PDF 10 pages
论文截图
Large-scale Unsupervised Semantic Segmentation
Authors:Shanghua Gao, Zhong-Yu Li, Ming-Hsuan Yang, Ming-Ming Cheng, Junwei Han, Philip Torr
Powered by the ImageNet dataset, unsupervised learning on large-scale data has made significant advances for classification tasks. There are two major challenges to allowing such an attractive learning modality for segmentation tasks: i) a large-scale benchmark for assessing algorithms is missing; ii) unsupervised category/shape representation learning is difficult. We propose a new problem of large-scale unsupervised semantic segmentation (LUSS) with a newly created benchmark dataset to track the research progress. Based on the ImageNet dataset, we propose the ImageNet-S dataset with 1.2 million training images and 50k high-quality semantic segmentation annotations for evaluation. Our benchmark has a high data diversity and a clear task objective. We also present a simple yet effective method that works surprisingly well for LUSS. In addition, we benchmark related un/weakly/fully supervised methods accordingly, identifying the challenges and possible directions of LUSS.
PDF benchmark: https://github.com/UnsupervisedSemanticSegmentation
论文截图
Cross-Modal Object Tracking: Modality-Aware Representations and A Unified Benchmark
Authors:Chenglong Li, Tianhao Zhu, Lei Liu, Xiaonan Si, Zilin Fan, Sulan Zhai
In many visual systems, visual tracking often bases on RGB image sequences, in which some targets are invalid in low-light conditions, and tracking performance is thus affected significantly. Introducing other modalities such as depth and infrared data is an effective way to handle imaging limitations of individual sources, but multi-modal imaging platforms usually require elaborate designs and cannot be applied in many real-world applications at present. Near-infrared (NIR) imaging becomes an essential part of many surveillance cameras, whose imaging is switchable between RGB and NIR based on the light intensity. These two modalities are heterogeneous with very different visual properties and thus bring big challenges for visual tracking. However, existing works have not studied this challenging problem. In this work, we address the cross-modal object tracking problem and contribute a new video dataset, including 654 cross-modal image sequences with over 481K frames in total, and the average video length is more than 735 frames. To promote the research and development of cross-modal object tracking, we propose a new algorithm, which learns the modality-aware target representation to mitigate the appearance gap between RGB and NIR modalities in the tracking process. It is plug-and-play and could thus be flexibly embedded into different tracking frameworks. Extensive experiments on the dataset are conducted, and we demonstrate the effectiveness of the proposed algorithm in two representative tracking frameworks against 17 state-of-the-art tracking methods. We will release the dataset for free academic usage, dataset download link and code will be released soon.
PDF In Submission
论文截图
Joint 3D Object Detection and Tracking Using Spatio-Temporal Representation of Camera Image and LiDAR Point Clouds
Authors:Junho Koh, Jaekyum Kim, Jinhyuk Yoo, Yecheol Kim, Dongsuk Kum, Jun Won Choi
In this paper, we propose a new joint object detection and tracking (JoDT) framework for 3D object detection and tracking based on camera and LiDAR sensors. The proposed method, referred to as 3D DetecTrack, enables the detector and tracker to cooperate to generate a spatio-temporal representation of the camera and LiDAR data, with which 3D object detection and tracking are then performed. The detector constructs the spatio-temporal features via the weighted temporal aggregation of the spatial features obtained by the camera and LiDAR fusion. Then, the detector reconfigures the initial detection results using information from the tracklets maintained up to the previous time step. Based on the spatio-temporal features generated by the detector, the tracker associates the detected objects with previously tracked objects using a graph neural network (GNN). We devise a fully-connected GNN facilitated by a combination of rule-based edge pruning and attention-based edge gating, which exploits both spatial and temporal object contexts to improve tracking performance. The experiments conducted on both KITTI and nuScenes benchmarks demonstrate that the proposed 3D DetecTrack achieves significant improvements in both detection and tracking performances over baseline methods and achieves state-of-the-art performance among existing methods through collaboration between the detector and tracker.
PDF
论文截图
Exploring Fusion Strategies for Accurate RGBT Visual Object Tracking
Authors:Zhangyong Tang, Tianyang Xu, Hui Li, Xiao-Jun Wu, Xuefeng Zhu, Josef Kittler
We address the problem of multi-modal object tracking in video and explore various options of fusing the complementary information conveyed by the visible (RGB) and thermal infrared (TIR) modalities including pixel-level, feature-level and decision-level fusion. Specifically, different from the existing methods, paradigm of image fusion task is heeded for fusion at pixel level. Feature-level fusion is fulfilled by attention mechanism with channels excited optionally. Besides, at decision level, a novel fusion strategy is put forward since an effortless averaging configuration has shown the superiority. The effectiveness of the proposed decision-level fusion strategy owes to a number of innovative contributions, including a dynamic weighting of the RGB and TIR contributions and a linear template update operation. A variant of which produced the winning tracker at the Visual Object Tracking Challenge 2020 (VOT-RGBT2020). The concurrent exploration of innovative pixel- and feature-level fusion strategies highlights the advantages of the proposed decision-level fusion method. Extensive experimental results on three challenging datasets, \textit{i.e.}, GTOT, VOT-RGBT2019, and VOT-RGBT2020, demonstrate the effectiveness and robustness of the proposed method, compared to the state-of-the-art approaches. Code will be shared at \textcolor{blue}{\emph{https://github.com/Zhangyong-Tang/DFAT}.
PDF 13 pages, 10 figures
论文截图
ROFT: Real-Time Optical Flow-Aided 6D Object Pose and Velocity Tracking
Authors:Nicola A. Piga, Yuriy Onyshchuk, Giulia Pasquale, Ugo Pattacini, Lorenzo Natale
6D object pose tracking has been extensively studied in the robotics and computer vision communities. The most promising solutions, leveraging on deep neural networks and/or filtering and optimization, exhibit notable performance on standard benchmarks. However, to our best knowledge, these have not been tested thoroughly against fast object motions. Tracking performance in this scenario degrades significantly, especially for methods that do not achieve real-time performance and introduce non negligible delays. In this work, we introduce ROFT, a Kalman filtering approach for 6D object pose and velocity tracking from a stream of RGB-D images. By leveraging real-time optical flow, ROFT synchronizes delayed outputs of low frame rate Convolutional Neural Networks for instance segmentation and 6D object pose estimation with the RGB-D input stream to achieve fast and precise 6D object pose and velocity tracking. We test our method on a newly introduced photorealistic dataset, Fast-YCB, which comprises fast moving objects from the YCB model set, and on the dataset for object and hand pose estimation HO-3D. Results demonstrate that our approach outperforms state-of-the-art methods for 6D object pose tracking, while also providing 6D object velocity tracking. A video showing the experiments is provided as supplementary material.
PDF To cite this work, please refer to the journal reference entry. For more information, code, pictures and video please visit https://github.com/hsp-iit/roft
论文截图
Space Non-cooperative Object Active Tracking with Deep Reinforcement Learning
Authors:Dong Zhou, Guanghui Sun, Wenxiao Lei
Active visual tracking of space non-cooperative object is significant for future intelligent spacecraft to realise space debris removal, asteroid exploration, autonomous rendezvous and docking. However, existing works often consider this task into different subproblems (e.g. image preprocessing, feature extraction and matching, position and pose estimation, control law design) and optimize each module alone, which are trivial and sub-optimal. To this end, we propose an end-to-end active visual tracking method based on DQN algorithm, named as DRLAVT. It can guide the chasing spacecraft approach to arbitrary space non-cooperative target merely relied on color or RGBD images, which significantly outperforms position-based visual servoing baseline algorithm that adopts state-of-the-art 2D monocular tracker, SiamRPN. Extensive experiments implemented with diverse network architectures, different perturbations and multiple targets demonstrate the advancement and robustness of DRLAVT. In addition, We further prove our method indeed learnt the motion patterns of target with deep reinforcement learning through hundreds of trial-and-errors.
PDF
论文截图
Two stages for visual object tracking
Authors:Fei Chen, Fuhan Zhang, Xiaodong Wang
Siamese-based trackers have achived promising performance on visual object tracking tasks. Most existing Siamese-based trackers contain two separate branches for tracking, including classification branch and bounding box regression branch. In addition, image segmentation provides an alternative way to obetain the more accurate target region. In this paper, we propose a novel tracker with two-stages: detection and segmentation. The detection stage is capable of locating the target by Siamese networks. Then more accurate tracking results are obtained by segmentation module given the coarse state estimation in the first stage. We conduct experiments on four benchmarks. Our approach achieves state-of-the-art results, with the EAO of 52.6$\%$ on VOT2016, 51.3$\%$ on VOT2018, and 39.0$\%$ on VOT2019 datasets, respectively.
PDF 2021 International Conference on Intelligent Computing, Automation and Applications (ICAA)