Pre-trained Language Model Fine-tuning on Non-textual Data Understanding
- 주제(키워드) Pre-trained language model , Non-textual data , Language model transferability
- 주제(DDC) 006.31
- 발행기관 아주대학교 일반대학원
- 지도교수 조현석
- 발행년도 2025
- 학위수여년월 2025. 2
- 학위명 석사
- 학과 및 전공 일반대학원 인공지능학과
- 실제URI http://www.dcollection.net/handler/ajou/000000034362
- 본문언어 영어
- 저작권 아주대학교 논문은 저작권에 의해 보호받습니다.
초록/요약
Pre-training language model helps to pre-train huge language corpus data to improve the performance of downstream tasks with learning methods using a small number of fine-tuned data. Unlike the easily collectable language corpus data, non-textual data has limitations in collecting large amounts of data, the existing approach to pre-train the pre-trained model using non-textual data requires a lot of costs. Therefore, the need for an approach to fine-tune non-textual data using a pre-trained language model is emerging to apply it to downstream tasks. Recently, it has become easier to share and access the pre-trained language model. These various pre-trained language models can be used for non-textual data, it is necessary to select the model in consideration of the characteristics of each pre-training language model and non-textual data. Considering its characteristics, we want to analyze the impact of various pre- training language models and hyperparameter settings on non-textual data fine-tuning. To this end, we use a BERT model pre-trained on natural language corpus data and a CodeBERT model pre-trained on artificial language programming code data. In addition, to investigate the influence of hyperparameters in the pre-training language model, we analyze various perspectives, such as the domain of non-textual data, the type of pre-trained language, the length of input token sequence, and the number of token frequencies. We find out that the optimal hyperparameters exist accordingly. In conclusion, this paper observes and analyzes the characteristics of the input data and the pre-trained language model that should be considered when fine-tuning non-textual data through the pre-trained language model. The goal of this study is to present an approach that can obtain better performance for fine-tuning non-textual data by applying the analysis results. Keyword: Pre-trained language model, Non-textual data, Language model transferability
more목차
1 Introduction 1
1.1 Motivation 1
1.2 Contributions 2
1.3 Thesis Outline 2
2 Related Works 4
2.1 Pre-trained Language Model 4
2.2 Pre-trained Language Model Structure 5
2.2.1 Transformer 5
2.2.2 BERT 6
2.2.3 RoBERTa 7
2.2.4 CodeBERT 7
2.3 Non-textual Data 8
2.3.1 DNA Sequence Classification Data 8
2.3.2 Protein Sequence Classification Data 9
2.3.3 Music Composer Classification Data 9
2.4 Non-textual Data Fine-tuning 10
2.5 Non-textual Data Fine-tuning on Pre-trained Language Model 10
3 Approach 11
3.1 Language Type 11
3.2 Input Token Sequence Length 13
3.3 Token Frequency 14
4 Experiments 16
4.1 Experimental Setting 16
4.1.1 Dataset 16
4.1.2 Pre-trained Model Selection 17
4.1.3 Evaluation Metrics 17
4.2 Results 18
4.2.1 Language Type Experiment 18
4.2.2 Input Token Sequence Length Experiment 18
4.2.3 Token Frequency Experiment 22
5 Discussion 24
5.1 Effect of Pre-training Language Types 24
5.2 Effect of Input Token Sequence Length Distribution 25
5.3 Effect of Token Frequency Based Token Mapping 25
6 Conclusion 29
6.1 Limitation 29
6.1.1 Limited Pre-training Datasets Analysis 29
6.1.2 Model Comparison Fairness 30
6.2 Future Works 30
6.2.1 Pre-training Transferability on Multimodal Data 30
6.2.2 Token Frequency Distribution Based One-to-one Token Mapping 30
References 31
Appendix 35
A Pre-training Dataset Analysis 35
B Input Token Sequence Length Overall Experiment Results 36

