Dual Prompt Tuning for Enhancing Discriminative and Generalization Ability in Vision-Language Models
- 주제(키워드) Computer Vision , Vision-Langage models , Prompt tuning
- 주제(DDC) 006.31
- 발행기관 아주대학교 일반대학원
- 지도교수 Jongbin Ryu
- 발행년도 2025
- 학위수여년월 2025. 8
- 학위명 석사
- 학과 및 전공 일반대학원 인공지능학과
- 실제URI http://www.dcollection.net/handler/ajou/000000035060
- 본문언어 영어
- 저작권 아주대학교 논문은 저작권에 의해 보호받습니다.
초록/요약
Prompt tuning has emerged as a key technique for enhancing the transfer learning performance of large-scale Vision-Language Models. In particular, models like CLIP demonstrate strong performance even in zero-shot and few-shot scenarios by learning tunable text contexts. Based on this, various methods have been proposed to improve discriminative performance on seen tasks without sacrificing generalization on unseen tasks. However, we point out that most existing prompt tuning methods heavily rely on manual class names, and simply learning the context is often insufficient to form clear class boundaries. This limitation arises not only in fine-grained datasets such as FGVC-Aircraft but also in general datasets like ImageNet. To address this issue, we propose a new prompt structure that directly learns class embeddings from images and enables effective fusion at the representation level with manual class names. Our method is adaptable to existing text prompt tuning approaches and demonstrates improved performance not only on base tasks but also on novel tasks. As a result, it alleviates the fundamental base to novel trade-off problem commonly observed in transfer learning. The effectiveness of our approach is validated through extensive experiments and benchmark evaluations across various models.
more목차
1. Introduction 1
2. Related works 6
2.1 Vision language models 6
2.2 Prompt-tuning in Vision Language Model 7
3. Method 10
3.1 Learnable class embedding 10
3.2 Sharing Learnable Context 12
3.3 Inference for Base to Novel 15
4. Experiments 16
4.1 Experimental Setup 16
4.2 Experimental Results 17
5. Ablation studies 20
5.1 Results Under Different Shot Settings 21
5.2 Results on Proposed Main Components 23
5.3 Results on Unsupervised Prompt Distillation 24
5.4 Similarity map visualization 26
6. Conclusion 27
References 29

