검색 상세

Dual Prompt Tuning for Enhancing Discriminative and Generalization Ability in Vision-Language Models

초록/요약

Prompt tuning has emerged as a key technique for enhancing the transfer learning performance of large-scale Vision-Language Models. In particular, models like CLIP demonstrate strong performance even in zero-shot and few-shot scenarios by learning tunable text contexts. Based on this, various methods have been proposed to improve discriminative performance on seen tasks without sacrificing generalization on unseen tasks. However, we point out that most existing prompt tuning methods heavily rely on manual class names, and simply learning the context is often insufficient to form clear class boundaries. This limitation arises not only in fine-grained datasets such as FGVC-Aircraft but also in general datasets like ImageNet. To address this issue, we propose a new prompt structure that directly learns class embeddings from images and enables effective fusion at the representation level with manual class names. Our method is adaptable to existing text prompt tuning approaches and demonstrates improved performance not only on base tasks but also on novel tasks. As a result, it alleviates the fundamental base to novel trade-off problem commonly observed in transfer learning. The effectiveness of our approach is validated through extensive experiments and benchmark evaluations across various models.

more

목차

1. Introduction 1
2. Related works 6
2.1 Vision language models 6
2.2 Prompt-tuning in Vision Language Model 7
3. Method 10
3.1 Learnable class embedding 10
3.2 Sharing Learnable Context 12
3.3 Inference for Base to Novel 15
4. Experiments 16
4.1 Experimental Setup 16
4.2 Experimental Results 17
5. Ablation studies 20
5.1 Results Under Different Shot Settings 21
5.2 Results on Proposed Main Components 23
5.3 Results on Unsupervised Prompt Distillation 24
5.4 Similarity map visualization 26
6. Conclusion 27
References 29

more