dCollection 디지털 학술정보 유통시스템

Enhancing Representation Learning for Vision Transformers with Semantic Anchor Tokens

원문보기

주제(키워드) Vision Transformer
주제(DDC) 006.31
발행기관 아주대학교 일반대학원
지도교수 Jongbin Ryu
발행년도 2026
학위수여년월 2026. 2
학위명 석사
학과 및 전공 일반대학원 인공지능학과
실제URI http://www.dcollection.net/handler/ajou/000000035497
본문언어 영어
저작권 아주대학교 논문은 저작권에 의해 보호받습니다.

초록/요약

Although Vision Transformers (ViTs) have achieved remarkable performance in image classification, they remain limited by their inability to utilize the rich metadata often accompanying images and their tendency toward 'shortcut learning,' relying on superficial cues such as background or texture. In this paper, we propose the Semantic Anchor Token (SAT) methodology to simultaneously address these two issues by utilizing text metadata as an auxiliary class token. SAT integrates metadata embeddings, encoded via a pre-trained language model, into the ViT's input sequence as a native token. This approach facilitates natural interaction between visual and semantic information through ViT's inherent Self-Attention mechanism without requiring complex fusion modules. Experimental results on the DVM-CAR dataset—a large-scale Fine-Grained Visual Classification (FGVC) benchmark characterized by high visual ambiguity—demonstrate that the proposed method improves the accuracy of the ViT-Small model by approximately 5.7%, proving its efficiency by outperforming the structurally more complex Swin Transformer. Notably, through Attention Map analysis, we confirm that SAT serves as a 'Semantic Anchor,' shifting the model's visual attention from local features to a holistic representation of the object. These findings suggest that the robustness of vision models can be significantly enhanced by effectively injecting external knowledge without increasing structural complexity.

Ⅰ. Introduction 1
Ⅱ. Related works 4
A. Image Classification 4
B. Vision Transformer 5
C. Research Positioning and Rationale 6
Ⅲ. Method 7
A. overview 7
B. Model Architecture 8
C. Semantic Anchor Token 9
1. Concept of Semantic Anchor Token 9
2. Metadata Preprocessing 10
3. SAT Generation and Integration 10
4. Learning and Inference Mechanism 11
Ⅳ. Experiments 12
A. Dataset 12
B. Data Preprocessing 12
C. Implementation Details 14
Ⅴ. Results and Analysis 15
A. Performance Comparison 15
B. Attention Map Analysis 16
C. Ablation Studies 18
1. Image and Metadata Fusion Effect 18
2. Importance of Semantic Information 20
Ⅵ. Conclusion 22
Ⅶ. References 23

반출 Meta View 목록

아주대학교

검색 상세

Enhancing Representation Learning for Vision Transformers with Semantic Anchor Tokens

초록/요약

목차