Domain Adaptation


2024-04-14 更新

Language-Guided Instance-Aware Domain-Adaptive Panoptic Segmentation

Authors:Elham Amin Mansour, Ozan Unal, Suman Saha, Benjamin Bejar, Luc Van Gool

The increasing relevance of panoptic segmentation is tied to the advancements in autonomous driving and AR/VR applications. However, the deployment of such models has been limited due to the expensive nature of dense data annotation, giving rise to unsupervised domain adaptation (UDA). A key challenge in panoptic UDA is reducing the domain gap between a labeled source and an unlabeled target domain while harmonizing the subtasks of semantic and instance segmentation to limit catastrophic interference. While considerable progress has been achieved, existing approaches mainly focus on the adaptation of semantic segmentation. In this work, we focus on incorporating instance-level adaptation via a novel instance-aware cross-domain mixing strategy IMix. IMix significantly enhances the panoptic quality by improving instance segmentation performance. Specifically, we propose inserting high-confidence predicted instances from the target domain onto source images, retaining the exhaustiveness of the resulting pseudo-labels while reducing the injected confirmation bias. Nevertheless, such an enhancement comes at the cost of degraded semantic performance, attributed to catastrophic forgetting. To mitigate this issue, we regularize our semantic branch by employing CLIP-based domain alignment (CDA), exploiting the domain-robustness of natural language prompts. Finally, we present an end-to-end model incorporating these two mechanisms called LIDAPS, achieving state-of-the-art results on all popular panoptic UDA benchmarks.
PDF

点此查看论文截图

Frequency Decomposition-Driven Unsupervised Domain Adaptation for Remote Sensing Image Semantic Segmentation

Authors:Xianping Ma, Xiaokang Zhang, Xingchen Ding, Man-On Pun, Siwei Ma

Cross-domain semantic segmentation of remote sensing (RS) imagery based on unsupervised domain adaptation (UDA) techniques has significantly advanced deep-learning applications in the geosciences. Recently, with its ingenious and versatile architecture, the Transformer model has been successfully applied in RS-UDA tasks. However, existing UDA methods mainly focus on domain alignment in the high-level feature space. It is still challenging to retain cross-domain local spatial details and global contextual semantics simultaneously, which is crucial for the RS image semantic segmentation task. To address these problems, we propose novel high/low-frequency decomposition (HLFD) techniques to guide representation alignment in cross-domain semantic segmentation. Specifically, HLFD attempts to decompose the feature maps into high- and low-frequency components before performing the domain alignment in the corresponding subspaces. Secondly, to further facilitate the alignment of decomposed features, we propose a fully global-local generative adversarial network, namely GLGAN, to learn domain-invariant detailed and semantic features across domains by leveraging global-local transformer blocks (GLTBs). By integrating HLFD techniques and the GLGAN, a novel UDA framework called FD-GLGAN is developed to improve the cross-domain transferability and generalization capability of semantic segmentation models. Extensive experiments on two fine-resolution benchmark datasets, namely ISPRS Potsdam and ISPRS Vaihingen, highlight the effectiveness and superiority of the proposed approach as compared to the state-of-the-art UDA methods. The source code for this work will be accessible at https://github.com/sstary/SSRS.
PDF 28 pages, 13 figures

点此查看论文截图

Mixup Domain Adaptations for Dynamic Remaining Useful Life Predictions

Authors:Muhammad Tanzil Furqon, Mahardhika Pratama, Lin Liu, Habibullah, Kutluyil Dogancay

Remaining Useful Life (RUL) predictions play vital role for asset planning and maintenance leading to many benefits to industries such as reduced downtime, low maintenance costs, etc. Although various efforts have been devoted to study this topic, most existing works are restricted for i.i.d conditions assuming the same condition of the training phase and the deployment phase. This paper proposes a solution to this problem where a mix-up domain adaptation (MDAN) is put forward. MDAN encompasses a three-staged mechanism where the mix-up strategy is not only performed to regularize the source and target domains but also applied to establish an intermediate mix-up domain where the source and target domains are aligned. The self-supervised learning strategy is implemented to prevent the supervision collapse problem. Rigorous evaluations have been performed where MDAN is compared to recently published works for dynamic RUL predictions. MDAN outperforms its counterparts with substantial margins in 12 out of 12 cases. In addition, MDAN is evaluated with the bearing machine dataset where it beats prior art with significant gaps in 8 of 12 cases. Source codes of MDAN are made publicly available in \url{https://github.com/furqon3009/MDAN}.
PDF accepted for publication in Knowledge-based Systems

点此查看论文截图

UniMix: Towards Domain Adaptive and Generalizable LiDAR Semantic Segmentation in Adverse Weather

Authors:Haimei Zhao, Jing Zhang, Zhuo Chen, Shanshan Zhao, Dacheng Tao

LiDAR semantic segmentation (LSS) is a critical task in autonomous driving and has achieved promising progress. However, prior LSS methods are conventionally investigated and evaluated on datasets within the same domain in clear weather. The robustness of LSS models in unseen scenes and all weather conditions is crucial for ensuring safety and reliability in real applications. To this end, we propose UniMix, a universal method that enhances the adaptability and generalizability of LSS models. UniMix first leverages physically valid adverse weather simulation to construct a Bridge Domain, which serves to bridge the domain gap between the clear weather scenes and the adverse weather scenes. Then, a Universal Mixing operator is defined regarding spatial, intensity, and semantic distributions to create the intermediate domain with mixed samples from given domains. Integrating the proposed two techniques into a teacher-student framework, UniMix efficiently mitigates the domain gap and enables LSS models to learn weather-robust and domain-invariant representations. We devote UniMix to two main setups: 1) unsupervised domain adaption, adapting the model from the clear weather source domain to the adverse weather target domain; 2) domain generalization, learning a model that generalizes well to unseen scenes in adverse weather. Extensive experiments validate the effectiveness of UniMix across different tasks and datasets, all achieving superior performance over state-of-the-art methods. The code will be released.
PDF Accepted by CVPR 2024

点此查看论文截图

Multi-modal Document Presentation Attack Detection With Forensics Trace Disentanglement

Authors:Changsheng Chen, Yongyi Deng, Liangwei Lin, Zitong Yu, Zhimao Lai

Document Presentation Attack Detection (DPAD) is an important measure in protecting the authenticity of a document image. However, recent DPAD methods demand additional resources, such as manual effort in collecting additional data or knowing the parameters of acquisition devices. This work proposes a DPAD method based on multi-modal disentangled traces (MMDT) without the above drawbacks. We first disentangle the recaptured traces by a self-supervised disentanglement and synthesis network to enhance the generalization capacity in document images with different contents and layouts. Then, unlike the existing DPAD approaches that rely only on data in the RGB domain, we propose to explicitly employ the disentangled recaptured traces as new modalities in the transformer backbone through adaptive multi-modal adapters to fuse RGB/trace features efficiently. Visualization of the disentangled traces confirms the effectiveness of the proposed method in different document contents. Extensive experiments on three benchmark datasets demonstrate the superiority of our MMDT method on representing forensic traces of recapturing distortion.
PDF Accepted to ICME 2024

点此查看论文截图

ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling

Authors:Ege Özsoy, Chantal Pellegrini, Matthias Keicher, Nassir Navab

Every day, countless surgeries are performed worldwide, each within the distinct settings of operating rooms (ORs) that vary not only in their setups but also in the personnel, tools, and equipment used. This inherent diversity poses a substantial challenge for achieving a holistic understanding of the OR, as it requires models to generalize beyond their initial training datasets. To reduce this gap, we introduce ORacle, an advanced vision-language model designed for holistic OR domain modeling, which incorporates multi-view and temporal capabilities and can leverage external knowledge during inference, enabling it to adapt to previously unseen surgical scenarios. This capability is further enhanced by our novel data augmentation framework, which significantly diversifies the training dataset, ensuring ORacle’s proficiency in applying the provided knowledge effectively. In rigorous testing, in scene graph generation, and downstream tasks on the 4D-OR dataset, ORacle not only demonstrates state-of-the-art performance but does so requiring less data than existing models. Furthermore, its adaptability is displayed through its ability to interpret unseen views, actions, and appearances of tools and equipment. This demonstrates ORacle’s potential to significantly enhance the scalability and affordability of OR domain modeling and opens a pathway for future advancements in surgical data science. We will release our code and data upon acceptance.
PDF 11 pages, 3 figures, 7 tables

点此查看论文截图

MoCap-to-Visual Domain Adaptation for Efficient Human Mesh Estimation from 2D Keypoints

Authors:Bedirhan Uguz, Ozhan Suat, Batuhan Karagoz, Emre Akbas

This paper presents Key2Mesh, a model that takes a set of 2D human pose keypoints as input and estimates the corresponding body mesh. Since this process does not involve any visual (i.e. RGB image) data, the model can be trained on large-scale motion capture (MoCap) datasets, thereby overcoming the scarcity of image datasets with 3D labels. To enable the model’s application on RGB images, we first run an off-the-shelf 2D pose estimator to obtain the 2D keypoints, and then feed these 2D keypoints to Key2Mesh. To improve the performance of our model on RGB images, we apply an adversarial domain adaptation (DA) method to bridge the gap between the MoCap and visual domains. Crucially, our DA method does not require 3D labels for visual data, which enables adaptation to target sets without the need for costly labels. We evaluate Key2Mesh for the task of estimating 3D human meshes from 2D keypoints, in the absence of RGB and mesh label pairs. Our results on widely used H3.6M and 3DPW datasets show that Key2Mesh sets the new state-of-the-art by outperforming other models in PA-MPJPE for both datasets, and in MPJPE and PVE for the 3DPW dataset. Thanks to our model’s simple architecture, it operates at least 12x faster than the prior state-of-the-art model, LGD. Additional qualitative samples and code are available on the project website: https://key2mesh.github.io/.
PDF accepted to CVPRW 2024

点此查看论文截图

Unified Language-driven Zero-shot Domain Adaptation

Authors:Senqiao Yang, Zhuotao Tian, Li Jiang, Jiaya Jia

This paper introduces Unified Language-driven Zero-shot Domain Adaptation (ULDA), a novel task setting that enables a single model to adapt to diverse target domains without explicit domain-ID knowledge. We identify the constraints in the existing language-driven zero-shot domain adaptation task, particularly the requirement for domain IDs and domain-specific models, which may restrict flexibility and scalability. To overcome these issues, we propose a new framework for ULDA, consisting of Hierarchical Context Alignment (HCA), Domain Consistent Representation Learning (DCRL), and Text-Driven Rectifier (TDR). These components work synergistically to align simulated features with target text across multiple visual levels, retain semantic correlations between different regional representations, and rectify biases between simulated and real target visual features, respectively. Our extensive empirical evaluations demonstrate that this framework achieves competitive performance in both settings, surpassing even the model that requires domain-ID, showcasing its superiority and generalization ability. The proposed method is not only effective but also maintains practicality and efficiency, as it does not introduce additional computational costs during inference. Our project page is https://senqiaoyang.com/project/ULDA .
PDF Accepted by CVPR 2024

点此查看论文截图

Scalable Language Model with Generalized Continual Learning

Authors:Bohao Peng, Zhuotao Tian, Shu Liu, Mingchang Yang, Jiaya Jia

Continual learning has gained increasing importance as it facilitates the acquisition and refinement of scalable knowledge and skills in language models. However, existing methods typically encounter strict limitations and challenges in real-world scenarios, such as reliance on experience replay, optimization constraints, and inference task-ID. In this study, we introduce the Scalable Language Model (SLM) to overcome these limitations within a more challenging and generalized setting, representing a significant advancement toward practical applications for continual learning. Specifically, we propose the Joint Adaptive Re-Parameterization (JARe), integrated with Dynamic Task-related Knowledge Retrieval (DTKR), to enable adaptive adjustment of language models based on specific downstream tasks. This approach leverages the task distribution within the vector space, aiming to achieve a smooth and effortless continual learning process. Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting. Moreover, while prior research primarily focused on a single task type such as classification, our study goes beyond, with the large language model, i.e., LLaMA-2, to explore the effects across diverse domains and task types, such that a single language model can be decently scaled to broader applications.
PDF The Twelfth International Conference on Learning Representations

点此查看论文截图

Generative Probabilistic Planning for Optimizing Supply Chain Networks

Authors:Hyung-il Ahn, Santiago Olivar, Hershel Mehta, Young Chol Song

Supply chain networks in enterprises are typically composed of complex topological graphs involving various types of nodes and edges, accommodating numerous products with considerable demand and supply variability. However, as supply chain networks expand in size and complexity, traditional supply chain planning methods (e.g., those found in heuristic rule-based and operations research-based systems) tend to become locally optimal or lack computational scalability, resulting in substantial imbalances between supply and demand across nodes in the network. This paper introduces a novel Generative AI technique, which we call Generative Probabilistic Planning (GPP). GPP generates dynamic supply action plans that are globally optimized across all network nodes over the time horizon for changing objectives like maximizing profits or service levels, factoring in time-varying probabilistic demand, lead time, and production conditions. GPP leverages attention-based graph neural networks (GNN), offline deep reinforcement learning (Offline RL), and policy simulations to train generative policy models and create optimal plans through probabilistic simulations, effectively accounting for various uncertainties. Our experiments using historical data from a global consumer goods company with complex supply chain networks demonstrate that GPP accomplishes objective-adaptable, probabilistically resilient, and dynamic planning for supply chain networks, leading to significant improvements in performance and profitability for enterprises. Our work plays a pivotal role in shaping the trajectory of AI adoption within the supply chain domain.
PDF

点此查看论文截图

PromptSync: Bridging Domain Gaps in Vision-Language Models through Class-Aware Prototype Alignment and Discrimination

Authors:Anant Khandelwal

The potential for zero-shot generalization in vision-language (V-L) models such as CLIP has spurred their widespread adoption in addressing numerous downstream tasks. Previous methods have employed test-time prompt tuning to adapt the model to unseen domains, but they overlooked the issue of imbalanced class distributions. In this study, we explicitly address this problem by employing class-aware prototype alignment weighted by mean class probabilities obtained for the test sample and filtered augmented views. Additionally, we ensure that the class probabilities are as accurate as possible by performing prototype discrimination using contrastive learning. The combination of alignment and discriminative loss serves as a geometric regularizer, preventing the prompt representation from collapsing onto a single class and effectively bridging the distribution gap between the source and test domains. Our method, named PromptSync, synchronizes the prompts for each test sample on both the text and vision branches of the V-L model. In empirical evaluations on the domain generalization benchmark, our method outperforms previous best methods by 2.33\% in overall performance, by 1\% in base-to-novel generalization, and by 2.84\% in cross-dataset transfer tasks.
PDF Accepted at CVPR 2024 LIMIT, 12 pages, 8 Tables, 2 Figures

点此查看论文截图

Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain

Authors:Iker García-Ferrero, Rodrigo Agerri, Aitziber Atutxa Salazar, Elena Cabrio, Iker de la Iglesia, Alberto Lavelli, Bernardo Magnini, Benjamin Molinet, Johana Ramirez-Romero, German Rigau, Jose Maria Villa-Gonzalez, Serena Villata, Andrea Zaninello

Research on language technology for the development of medical applications is currently a hot topic in Natural Language Understanding and Generation. Thus, a number of large language models (LLMs) have recently been adapted to the medical domain, so that they can be used as a tool for mediating in human-AI interaction. While these LLMs display competitive performance on automated medical texts benchmarks, they have been pre-trained and evaluated with a focus on a single language (English mostly). This is particularly true of text-to-text models, which typically require large amounts of domain-specific pre-training data, often not easily accessible for many languages. In this paper, we address these shortcomings by compiling, to the best of our knowledge, the largest multilingual corpus for the medical domain in four languages, namely English, French, Italian and Spanish. This new corpus has been used to train Medical mT5, the first open-source text-to-text multilingual model for the medical domain. Additionally, we present two new evaluation benchmarks for all four languages with the aim of facilitating multilingual research in this domain. A comprehensive evaluation shows that Medical mT5 outperforms both encoders and similarly sized text-to-text models for the Spanish, French, and Italian benchmarks, while being competitive with current state-of-the-art LLMs in English.
PDF LREC-COLING 2024

点此查看论文截图

Finding Dino: A plug-and-play framework for unsupervised detection of out-of-distribution objects using prototypes

Authors:Poulami Sinhamahapatra, Franziska Schwaiger, Shirsha Bose, Huiyu Wang, Karsten Roscher, Stephan Guennemann

Detecting and localising unknown or Out-of-distribution (OOD) objects in any scene can be a challenging task in vision. Particularly, in safety-critical cases involving autonomous systems like automated vehicles or trains. Supervised anomaly segmentation or open-world object detection models depend on training on exhaustively annotated datasets for every domain and still struggle in distinguishing between background and OOD objects. In this work, we present a plug-and-play generalised framework - PRototype-based zero-shot OOD detection Without Labels (PROWL). It is an inference-based method that does not require training on the domain dataset and relies on extracting relevant features from self-supervised pre-trained models. PROWL can be easily adapted to detect OOD objects in any operational design domain by specifying a list of known classes from this domain. PROWL, as an unsupervised method, outperforms other supervised methods trained without auxiliary OOD data on the RoadAnomaly and RoadObstacle datasets provided in SegmentMeIfYouCan (SMIYC) benchmark. We also demonstrate its suitability for other domains such as rail and maritime scenes.
PDF

点此查看论文截图

Flatness Improves Backbone Generalisation in Few-shot Classification

Authors:Rui Li, Martin Trapp, Marcus Klasson, Arno Solin

Deployment of deep neural networks in real-world settings typically requires adaptation to new tasks with few examples. Few-shot classification (FSC) provides a solution to this problem by leveraging pre-trained backbones for fast adaptation to new classes. Surprisingly, most efforts have only focused on developing architectures for easing the adaptation to the target domain without considering the importance of backbone training for good generalisation. We show that flatness-aware backbone training with vanilla fine-tuning results in a simpler yet competitive baseline compared to the state-of-the-art. Our results indicate that for in- and cross-domain FSC, backbone training is crucial to achieving good generalisation across different adaptation methods. We advocate more care should be taken when training these models.
PDF

点此查看论文截图

Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

Authors:Yiwen Tang, Jiaming Liu, Dong Wang, Zhigang Wang, Shanghang Zhang, Bin Zhao, Xuelong Li

Large foundation models have recently emerged as a prominent focus of interest, attaining superior performance in widespread scenarios. Due to the scarcity of 3D data, many efforts have been made to adapt pre-trained transformers from vision to 3D domains. However, such 2D-to-3D approaches are still limited, due to the potential loss of spatial geometries and high computation cost. More importantly, their frameworks are mainly designed for 2D models, lacking a general any-to-3D paradigm. In this paper, we introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. Given a frozen transformer from any source modality, we propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality. This mechanism enables us to assign each 3D token with a positional encoding paired with the pre-trained model, which avoids 3D geometry loss caused by the true projection and better motivates the transformer for 3D learning with 1D/2D positional priors. Then, within each transformer block, we insert an any-to-3D guided adapter module for parameter-efficient fine-tuning. The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers. We conduct extensive experiments to showcase the effectiveness and efficiency of our method. Code and models are released at https://github.com/Ivan-Tang-3D/Any2Point.
PDF Code and models are released at https://github.com/Ivan-Tang-3D/Any2Point

点此查看论文截图

文章作者: 木子已
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 木子已 !
  目录