Artificial Intelligence Rethinking the Training Paradigm of Discrete Token-Based Multimodal LLMs Wansik Jo
- 주제(키워드) Multimodal Large Language Model , Discrete Token-Based Models , Text- Centric Bias , Neuron-Level Interpretability , Neuron-Level Interpretability , Uni-modal Continual Pre-Training
- 주제(DDC) 006.31
- 발행기관 아주대학교 일반대학원
- 지도교수 조현석
- 발행년도 2025
- 학위수여년월 2025. 8
- 학위명 석사
- 학과 및 전공 일반대학원 인공지능학과
- 실제URI http://www.dcollection.net/handler/ajou/000000035070
- 본문언어 영어
- 저작권 아주대학교 논문은 저작권에 의해 보호받습니다.
초록/요약
Discrete token-based multimodal large language models (MLLMs), such as AnyGPT and MIO, integrate diverse modalities into an autoregressive framework by discretiz- ing modality inputs into tokens compatible with language models. Unlike encoder- based approaches, such as LLaVA and Flamingo, which utilize pretrained modality- specific encoders, discrete token-based MLLMs simultaneously learn modality token representations and their alignment with the language, yet are exclusively trained on modality-text paired datasets without additional unimodal training. We identify a structural limitation inherent in this training paradigm, termed text-centric bias, defined as an over-reliance on the textual context that restricts intrinsic modality understanding. To systematically analyze the existence of this bias, we propose an analytical framework involving external perplexity-based and internal neuron-level analyses. Furthermore, to verify whether the bias originates from the paired-only training paradigm, we introduce an analytical methodology named Monotune, which is a simple unimodal training stage. Our analyses demonstrate that minimal exposure to unimodal data effectively mitigates text-centric bias, providing empirical evidence that the bias is fundamentally induced by the paired-only training strategy. Through comprehensive downstream task evaluations, we further reveal that this structural bias meaningfully affects real-world multimodal task performance, particularly under limited textual contexts. Our findings highlight a fundamental limitation in current discrete token-based MLLM training paradigms and suggest directions for future multimodal training strategies.
more목차
1 Introduction 1
1.1 Motivation 1
1.2 Contributions 4
2 Related Works 5
2.1 Biases in Unimodal and Multimodal Models 5
2.2 Interpretability Methods for Investigating Model Behavior 6
3 Analyzing Text-Centric Bias 7
3.1 Background: Encoder-Based vs. Discrete Token-Based MLLMs 7
3.2 Text-Centric Bias: Definition and Origin 8
3.3 Probabilistic Analysis: Perplexity 9
3.4 Parametric Analysis: Specialized Neurons 10
4 Mitigating Text-Centric Bias 13
4.1 Monotune: A Unimodal Training Stage 13
4.2 Experimental Setup 14
4.3 Evaluating the Effect of Monotune 15
4.3.1 External Evaluation via Perplexity Analysis 15
4.3.2 Internal Evaluation via Neuron-Masking Analysis 16
5 How Text-Centric Bias Affects to Downstream Tasks 17
5.1 Direct Text-Conditioned Tasks 17
5.2 Indirect Text-Conditioned Tasks 18
5.3 Modality Generation Quality 19
6 Conclusion 21
References 22

