검색 상세

Enhancing Representation Learning for Vision Transformers with Semantic Anchor Tokens

초록/요약

Although Vision Transformers (ViTs) have achieved remarkable performance in image classification, they remain limited by their inability to utilize the rich metadata often accompanying images and their tendency toward 'shortcut learning,' relying on superficial cues such as background or texture. In this paper, we propose the Semantic Anchor Token (SAT) methodology to simultaneously address these two issues by utilizing text metadata as an auxiliary class token. SAT integrates metadata embeddings, encoded via a pre-trained language model, into the ViT's input sequence as a native token. This approach facilitates natural interaction between visual and semantic information through ViT's inherent Self-Attention mechanism without requiring complex fusion modules. Experimental results on the DVM-CAR dataset—a large-scale Fine-Grained Visual Classification (FGVC) benchmark characterized by high visual ambiguity—demonstrate that the proposed method improves the accuracy of the ViT-Small model by approximately 5.7%, proving its efficiency by outperforming the structurally more complex Swin Transformer. Notably, through Attention Map analysis, we confirm that SAT serves as a 'Semantic Anchor,' shifting the model's visual attention from local features to a holistic representation of the object. These findings suggest that the robustness of vision models can be significantly enhanced by effectively injecting external knowledge without increasing structural complexity.

more

목차

Ⅰ. Introduction 1
Ⅱ. Related works 4
A. Image Classification 4
B. Vision Transformer 5
C. Research Positioning and Rationale 6
Ⅲ. Method 7
A. overview 7
B. Model Architecture 8
C. Semantic Anchor Token 9
1. Concept of Semantic Anchor Token 9
2. Metadata Preprocessing 10
3. SAT Generation and Integration 10
4. Learning and Inference Mechanism 11
Ⅳ. Experiments 12
A. Dataset 12
B. Data Preprocessing 12
C. Implementation Details 14
Ⅴ. Results and Analysis 15
A. Performance Comparison 15
B. Attention Map Analysis 16
C. Ablation Studies 18
1. Image and Metadata Fusion Effect 18
2. Importance of Semantic Information 20
Ⅵ. Conclusion 22
Ⅶ. References 23

more