dCollection 디지털 학술정보 유통시스템

Dual Prompt Tuning for Enhancing Discriminative and Generalization Ability in Vision-Language Models

원문보기

주제(키워드) Computer Vision , Vision-Langage models , Prompt tuning
주제(DDC) 006.31
발행기관 아주대학교 일반대학원
지도교수 Jongbin Ryu
발행년도 2025
학위수여년월 2025. 8
학위명 석사
학과 및 전공 일반대학원 인공지능학과
실제URI http://www.dcollection.net/handler/ajou/000000035060
본문언어 영어
저작권 아주대학교 논문은 저작권에 의해 보호받습니다.

초록/요약

Prompt tuning has emerged as a key technique for enhancing the transfer learning performance of large-scale Vision-Language Models. In particular, models like CLIP demonstrate strong performance even in zero-shot and few-shot scenarios by learning tunable text contexts. Based on this, various methods have been proposed to improve discriminative performance on seen tasks without sacrificing generalization on unseen tasks. However, we point out that most existing prompt tuning methods heavily rely on manual class names, and simply learning the context is often insufficient to form clear class boundaries. This limitation arises not only in fine-grained datasets such as FGVC-Aircraft but also in general datasets like ImageNet. To address this issue, we propose a new prompt structure that directly learns class embeddings from images and enables effective fusion at the representation level with manual class names. Our method is adaptable to existing text prompt tuning approaches and demonstrates improved performance not only on base tasks but also on novel tasks. As a result, it alleviates the fundamental base to novel trade-off problem commonly observed in transfer learning. The effectiveness of our approach is validated through extensive experiments and benchmark evaluations across various models.

1. Introduction 1
2. Related works 6
2.1 Vision language models 6
2.2 Prompt-tuning in Vision Language Model 7
3. Method 10
3.1 Learnable class embedding 10
3.2 Sharing Learnable Context 12
3.3 Inference for Base to Novel 15
4. Experiments 16
4.1 Experimental Setup 16
4.2 Experimental Results 17
5. Ablation studies 20
5.1 Results Under Different Shot Settings 21
5.2 Results on Proposed Main Components 23
5.3 Results on Unsupervised Prompt Distillation 24
5.4 Similarity map visualization 26
6. Conclusion 27
References 29

반출 Meta View 목록

아주대학교

검색 상세

Dual Prompt Tuning for Enhancing Discriminative and Generalization Ability in Vision-Language Models

초록/요약

목차