검색 상세

Artificial Intelligence Rethinking the Training Paradigm of Discrete Token-Based Multimodal LLMs Wansik Jo

초록/요약

Discrete token-based multimodal large language models (MLLMs), such as AnyGPT and MIO, integrate diverse modalities into an autoregressive framework by discretiz- ing modality inputs into tokens compatible with language models. Unlike encoder- based approaches, such as LLaVA and Flamingo, which utilize pretrained modality- specific encoders, discrete token-based MLLMs simultaneously learn modality token representations and their alignment with the language, yet are exclusively trained on modality-text paired datasets without additional unimodal training. We identify a structural limitation inherent in this training paradigm, termed text-centric bias, defined as an over-reliance on the textual context that restricts intrinsic modality understanding. To systematically analyze the existence of this bias, we propose an analytical framework involving external perplexity-based and internal neuron-level analyses. Furthermore, to verify whether the bias originates from the paired-only training paradigm, we introduce an analytical methodology named Monotune, which is a simple unimodal training stage. Our analyses demonstrate that minimal exposure to unimodal data effectively mitigates text-centric bias, providing empirical evidence that the bias is fundamentally induced by the paired-only training strategy. Through comprehensive downstream task evaluations, we further reveal that this structural bias meaningfully affects real-world multimodal task performance, particularly under limited textual contexts. Our findings highlight a fundamental limitation in current discrete token-based MLLM training paradigms and suggest directions for future multimodal training strategies.

more

목차

1 Introduction 1
1.1 Motivation 1
1.2 Contributions 4
2 Related Works 5
2.1 Biases in Unimodal and Multimodal Models 5
2.2 Interpretability Methods for Investigating Model Behavior 6
3 Analyzing Text-Centric Bias 7
3.1 Background: Encoder-Based vs. Discrete Token-Based MLLMs 7
3.2 Text-Centric Bias: Definition and Origin 8
3.3 Probabilistic Analysis: Perplexity 9
3.4 Parametric Analysis: Specialized Neurons 10
4 Mitigating Text-Centric Bias 13
4.1 Monotune: A Unimodal Training Stage 13
4.2 Experimental Setup 14
4.3 Evaluating the Effect of Monotune 15
4.3.1 External Evaluation via Perplexity Analysis 15
4.3.2 Internal Evaluation via Neuron-Masking Analysis 16
5 How Text-Centric Bias Affects to Downstream Tasks 17
5.1 Direct Text-Conditioned Tasks 17
5.2 Indirect Text-Conditioned Tasks 18
5.3 Modality Generation Quality 19
6 Conclusion 21
References 22

more