Domain Adaptation


2024-04-16 更新

PromptSync: Bridging Domain Gaps in Vision-Language Models through Class-Aware Prototype Alignment and Discrimination

Authors:Anant Khandelwal

The potential for zero-shot generalization in vision-language (V-L) models such as CLIP has spurred their widespread adoption in addressing numerous downstream tasks. Previous methods have employed test-time prompt tuning to adapt the model to unseen domains, but they overlooked the issue of imbalanced class distributions. In this study, we explicitly address this problem by employing class-aware prototype alignment weighted by mean class probabilities obtained for the test sample and filtered augmented views. Additionally, we ensure that the class probabilities are as accurate as possible by performing prototype discrimination using contrastive learning. The combination of alignment and discriminative loss serves as a geometric regularizer, preventing the prompt representation from collapsing onto a single class and effectively bridging the distribution gap between the source and test domains. Our method, named PromptSync, synchronizes the prompts for each test sample on both the text and vision branches of the V-L model. In empirical evaluations on the domain generalization benchmark, our method outperforms previous best methods by 2.33% in overall performance, by 1% in base-to-novel generalization, and by 2.84% in cross-dataset transfer tasks.
PDF Accepted at CVPR 2024 LIMIT, 12 pages, 8 Tables, 2 Figures

点此查看论文截图

Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis

Authors:Maged Shoman, Dongdong Wang, Armstrong Aboah, Mohamed Abdel-Aty

This paper introduces our solution for Track 2 in AI City Challenge 2024. The task aims to solve traffic safety description and analysis with the dataset of Woven Traffic Safety (WTS), a real-world Pedestrian-Centric Traffic Video Dataset for Fine-grained Spatial-Temporal Understanding. Our solution mainly focuses on the following points: 1) To solve dense video captioning, we leverage the framework of dense video captioning with parallel decoding (PDVC) to model visual-language sequences and generate dense caption by chapters for video. 2) Our work leverages CLIP to extract visual features to more efficiently perform cross-modality training between visual and textual representations. 3) We conduct domain-specific model adaptation to mitigate domain shift problem that poses recognition challenge in video understanding. 4) Moreover, we leverage BDD-5K captioned videos to conduct knowledge transfer for better understanding WTS videos and more accurate captioning. Our solution has yielded on the test set, achieving 6th place in the competition. The open source code will be available at https://github.com/UCF-SST-Lab/AICity2024CVPRW
PDF

点此查看论文截图

Generalized Population-Based Training for Hyperparameter Optimization in Reinforcement Learning

Authors:Hui Bai, Ran Cheng

Hyperparameter optimization plays a key role in the machine learning domain. Its significance is especially pronounced in reinforcement learning (RL), where agents continuously interact with and adapt to their environments, requiring dynamic adjustments in their learning trajectories. To cater to this dynamicity, the Population-Based Training (PBT) was introduced, leveraging the collective intelligence of a population of agents learning simultaneously. However, PBT tends to favor high-performing agents, potentially neglecting the explorative potential of agents on the brink of significant advancements. To mitigate the limitations of PBT, we present the Generalized Population-Based Training (GPBT), a refined framework designed for enhanced granularity and flexibility in hyperparameter adaptation. Complementing GPBT, we further introduce Pairwise Learning (PL). Instead of merely focusing on elite agents, PL employs a comprehensive pairwise strategy to identify performance differentials and provide holistic guidance to underperforming agents. By integrating the capabilities of GPBT and PL, our approach significantly improves upon traditional PBT in terms of adaptability and computational efficiency. Rigorous empirical evaluations across a range of RL benchmarks confirm that our approach consistently outperforms not only the conventional PBT but also its Bayesian-optimized variant.
PDF IEEE TETCI

点此查看论文截图

NC-TTT: A Noise Contrastive Approach for Test-Time Training

Authors:David Osowiechi, Gustavo A. Vargas Hakim, Mehrdad Noori, Milad Cheraghalikhani, Ali Bahri, Moslem Yazdanpanah, Ismail Ben Ayed, Christian Desrosiers

Despite their exceptional performance in vision tasks, deep learning models often struggle when faced with domain shifts during testing. Test-Time Training (TTT) methods have recently gained popularity by their ability to enhance the robustness of models through the addition of an auxiliary objective that is jointly optimized with the main task. Being strictly unsupervised, this auxiliary objective is used at test time to adapt the model without any access to labels. In this work, we propose Noise-Contrastive Test-Time Training (NC-TTT), a novel unsupervised TTT technique based on the discrimination of noisy feature maps. By learning to classify noisy views of projected feature maps, and then adapting the model accordingly on new domains, classification performance can be recovered by an important margin. Experiments on several popular test-time adaptation baselines demonstrate the advantages of our method compared to recent approaches for this task. The code can be found at:https://github.com/GustavoVargasHakim/NCTTT.git
PDF

点此查看论文截图

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Authors:Zhiwei Yang, Jing Liu, Peng Wu

Weakly supervised video anomaly detection (WSVAD) is a challenging task. Generating fine-grained pseudo-labels based on weak-label and then self-training a classifier is currently a promising solution. However, since the existing methods use only RGB visual modality and the utilization of category text information is neglected, thus limiting the generation of more accurate pseudo-labels and affecting the performance of self-training. Inspired by the manual labeling process based on the event description, in this paper, we propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer the rich language-visual knowledge of the contrastive language-image pre-training (CLIP) model for aligning the video event description text and corresponding video frames to generate pseudo-labels. Specifically, We first fine-tune the CLIP for domain adaptation by designing two ranking losses and a distributional inconsistency loss. Further, we propose a learnable text prompt mechanism with the assist of a normality visual prompt to further improve the matching accuracy of video event description text and video frames. Then, we design a pseudo-label generation module based on the normality guidance to infer reliable frame-level pseudo-labels. Finally, we introduce a temporal context self-adaptive learning module to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments show that our method achieves state-of-the-art performance on two benchmark datasets, UCF-Crime and XD-Viole
PDF Accepted to CVPR2024

点此查看论文截图

Pathological Primitive Segmentation Based on Visual Foundation Model with Zero-Shot Mask Generation

Authors:Abu Bakor Hayat Arnob, Xiangxue Wang, Yiping Jiao, Xiao Gan, Wenlong Ming, Jun Xu

Medical image processing usually requires a model trained with carefully crafted datasets due to unique image characteristics and domain-specific challenges, especially in pathology. Primitive detection and segmentation in digitized tissue samples are essential for objective and automated diagnosis and prognosis of cancer. SAM (Segment Anything Model) has recently been developed to segment general objects from natural images with high accuracy, but it requires human prompts to generate masks. In this work, we present a novel approach that adapts pre-trained natural image encoders of SAM for detection-based region proposals. Regions proposed by a pre-trained encoder are sent to cascaded feature propagation layers for projection. Then, local semantic and global context is aggregated from multi-scale for bounding box localization and classification. Finally, the SAM decoder uses the identified bounding boxes as essential prompts to generate a comprehensive primitive segmentation map. The entire base framework, SAM, requires no additional training or fine-tuning but could produce an end-to-end result for two fundamental segmentation tasks in pathology. Our method compares with state-of-the-art models in F1 score for nuclei detection and binary/multiclass panoptic(bPQ/mPQ) and mask quality(dice) for segmentation quality on the PanNuke dataset while offering end-to-end efficiency. Our model also achieves remarkable Average Precision (+4.5%) on the secondary dataset (HuBMAP Kidney) compared to Faster RCNN. The code is publicly available at https://github.com/learner-codec/autoprom_sam.
PDF 2024 IEEE International Symposium on Biomedical Imaging

点此查看论文截图

文章作者: 木子已
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 木子已 !
  目录