CLIP-based Image Caption Prompting for Zero-shot Hateful Meme Detection
- 주제(키워드) Large Language Model , Hateful Meme Detection , Contrastive Learning , In-context Learning
- 주제(DDC) 006.31
- 발행기관 아주대학교 일반대학원
- 지도교수 조현석
- 발행년도 2025
- 학위수여년월 2025. 8
- 학위명 석사
- 학과 및 전공 일반대학원 인공지능학과
- 실제URI http://www.dcollection.net/handler/ajou/000000034841
- 본문언어 영어
- 저작권 아주대학교 논문은 저작권에 의해 보호받습니다.
초록/요약
Recent advancements in vision-language models have significantly enhanced multimodal understanding capabilities, yet substantial challenges remain in detecting implicitly hateful memes that rely on subtle image-text interplay. This paper introduces a novel framework integrating CLIP-guided caption optimization with large language model summarization to bridge the modality gap in zero-shot hate speech detection. Our three-stage architecture first generates diverse image descriptions through strategic prompt engineering, then synthesizes these into semantically dense captions, and finally selects optimal representations using contrastive cross-modal alignment. Extensive evaluations on the Facebook Hateful Memes benchmark demonstrate enhanced performance, with accuracy improvements of 3.4% and 1.0% for InstructBLIP-T5-xl and InstructBLIP-T5-xxl models respectively. The results establish caption quality optimization as a critical factor in enhancing multimodal reasoning while maintaining model interpretability. This approach provides a viable pathway for content moderation systems to address evolving challenges in implicit hate speech detection across digital platforms.
more목차
1 Introduction 1
1.1 Motivation 1
1.2 Contributions 1
1.3 Thesis Outline 2
2 Related Works 4
2.1 Multimodal Large Language Model 4
2.1.1 BLIP- 2 4
2.1.2 InstructBLIP 5
2.2 Hateful Meme Detection 6
2.3 Reference-free Image-Caption Scoring 7
3 Method 9
3.1 Stage 1: Diverse Image Caption Generation on Various Prompts 9
3.2 Stage 2: Diverse Caption Summarization from Large Language Model 10
3.3 Stage 3: CLIP-Score based Well-matched Caption Selection 11
4 Experiments 12
4.1 Experimental Setup 12
4.1.1 Dataset 12
4.1.2 Implementation Details 13
4.2 Main Results 13
4.2.1 Results of Hateful Meme Detection 13
4.2.2 Analysis of CLIP-Guided Caption Selection Effectiveness 14
4.3 Ablation Study 15
4.3.1 Effect of the Number of Captions 15
5 Conclusion 17
5.1 Limitations 17
5.2 Future Works 18
References 19

