2023-01-26 更新
RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving
Authors:Angelika Ando, Spyros Gidaris, Andrei Bursuc, Gilles Puy, Alexandre Boulch, Renaud Marlet
Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, e.g., via range projection, is an effective and popular approach. These projection-based methods usually benefit from fast computations and, when combined with techniques which use other point cloud representations, achieve state-of-the-art results. Today, projection-based methods leverage 2D CNNs but recent advances in computer vision show that vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. In this work, we question if projection-based methods for 3D semantic segmentation can benefit from these latest improvements on ViTs. We answer positively but only after combining them with three key ingredients: (a) ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. By preserving the same backbone architecture as for RGB images, we can exploit the knowledge from long training on large image collections that are much cheaper to acquire and annotate than point clouds. We reach our best results with pre-trained ViTs on large image datasets. (b) We compensate ViTs’ lack of inductive bias by substituting a tailored convolutional stem for the classical linear embedding layer. (c) We refine pixel-wise predictions with a convolutional decoder and a skip connection from the convolutional stem to combine low-level but fine-grained features of the the convolutional stem with the high-level but coarse predictions of the ViT encoder. With these ingredients, we show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and SemanticKITTI. We provide the implementation code at https://github.com/valeoai/rangevit.
PDF Code at https://github.com/valeoai/rangevit
点此查看论文截图
Image Memorability Prediction with Vision Transformers
Authors:Thomas Hagen, Thomas Espeseth
Behavioral studies have shown that the memorability of images is similar across groups of people, suggesting that memorability is a function of the intrinsic properties of images, and is unrelated to people’s individual experiences and traits. Deep learning networks can be trained on such properties and be used to predict memorability in new data sets. Convolutional neural networks (CNN) have pioneered image memorability prediction, but more recently developed vision transformer (ViT) models may have the potential to yield even better predictions. In this paper, we present the ViTMem, a new memorability model based on ViT, and evaluate memorability predictions obtained by it with state-of-the-art CNN-derived models. Results showed that ViTMem performed equal to or better than state-of-the-art models on all data sets. Additional semantic level analyses revealed that ViTMem is particularly sensitive to the semantic content that drives memorability in images. We conclude that ViTMem provides a new step forward, and propose that ViT-derived models can replace CNNs for computational prediction of image memorability. Researchers, educators, advertisers, visual designers and other interested parties can leverage the model to improve the memorability of their image material.
PDF