검색 상세

CLIP-based Image Caption Prompting for Zero-shot Hateful Meme Detection

초록/요약

Recent advancements in vision-language models have significantly enhanced multimodal understanding capabilities, yet substantial challenges remain in detecting implicitly hateful memes that rely on subtle image-text interplay. This paper introduces a novel framework integrating CLIP-guided caption optimization with large language model summarization to bridge the modality gap in zero-shot hate speech detection. Our three-stage architecture first generates diverse image descriptions through strategic prompt engineering, then synthesizes these into semantically dense captions, and finally selects optimal representations using contrastive cross-modal alignment. Extensive evaluations on the Facebook Hateful Memes benchmark demonstrate enhanced performance, with accuracy improvements of 3.4% and 1.0% for InstructBLIP-T5-xl and InstructBLIP-T5-xxl models respectively. The results establish caption quality optimization as a critical factor in enhancing multimodal reasoning while maintaining model interpretability. This approach provides a viable pathway for content moderation systems to address evolving challenges in implicit hate speech detection across digital platforms.

more

목차

1 Introduction 1
1.1 Motivation 1
1.2 Contributions 1
1.3 Thesis Outline 2
2 Related Works 4
2.1 Multimodal Large Language Model 4
2.1.1 BLIP- 2 4
2.1.2 InstructBLIP 5
2.2 Hateful Meme Detection 6
2.3 Reference-free Image-Caption Scoring 7
3 Method 9
3.1 Stage 1: Diverse Image Caption Generation on Various Prompts 9
3.2 Stage 2: Diverse Caption Summarization from Large Language Model 10
3.3 Stage 3: CLIP-Score based Well-matched Caption Selection 11
4 Experiments 12
4.1 Experimental Setup 12
4.1.1 Dataset 12
4.1.2 Implementation Details 13
4.2 Main Results 13
4.2.1 Results of Hateful Meme Detection 13
4.2.2 Analysis of CLIP-Guided Caption Selection Effectiveness 14
4.3 Ablation Study 15
4.3.1 Effect of the Number of Captions 15
5 Conclusion 17
5.1 Limitations 17
5.2 Future Works 18
References 19

more