dCollection 디지털 학술정보 유통시스템

Democratizing LLM Adaptation via Monolingual Datasets from Common Crawl

원문보기

주제(키워드) Large Language Models , LLMs , Low-Resource Languages , Common Crawl , Adapters , LoRA
주제(DDC) 006.31
발행기관 아주대학교 일반대학원
지도교수 Tae-Sun Chung
발행년도 2024
학위수여년월 2024. 2
학위명 석사
학과 및 전공 일반대학원 인공지능학과
실제URI http://www.dcollection.net/handler/ajou/000000033564
본문언어 영어
저작권 아주대학교 논문은 저작권에 의해 보호받습니다.

초록/요약

Large language models (LLMs) under-perform on low-resource languages due to limited training data. We present a method to efficiently collect text data for low-resource languages from the entire Common Crawl corpus. Our approach, UnifiedCrawl, filters and extracts common crawl using minimal compute resources, yielding mono-lingual datasets much larger than previously available sources. We demonstrate that leveraging this data to fine-tu ning multilingual LLMs via efficient adapter methods (QLoRA) significantly boosts performance on the low-resource language, while minimizing VRAM usage. Our experiments show large improvements in language modeling perplexity and an increase in few-shot prompting scores. Our work and released source code provide an affordable approach to improve LLMs for low-resource languages using consumer hardware.

1 Introduction 1
1.1 Problem Definition 2
1.2 Motivation and Significance 3
1.3 Research Questions 6
1.4 Proposed Method 7
1.5 Contribution 9
1.6 Organization 10
2 Related Works 11
2.1 Multilingual Large Language Models 11
2.2 Large Multilingual or Monolingual Datasets 12
2.3 Common Crawl and Dataset Extraction 13
2.4 Deduplication 14
2.5 Low Resource Model Adaptation 15
3 Methods 17
3.1 Data Collection Framework 17
3.1.1 Index Filtering 18
3.1.2 Extracting WARC Files 19
3.1.3 Text Extraction 20
3.1.4 Deduplication 20
3.2 Low Resource Model Adaptation 21
4 Experimental Settings and Implementation Details 23
4.1 Languages and Benchmark Datasets and Dataset Collection 23
4.1.1 Dataset Collection 23
4.1.2 Compute Requirements 24
4.1.3 Languages 24
4.1.4 Benchmark Datasets 25
4.2 Models and Model Adaptation Settings 25
4.2.1 Models 25
4.2.2 Model Adaptation 26
4.3 Evaluation Settings 26
4.3.1 Language Modeling Evaluation 27
4.3.2 Downstream Evaluation 27
4.3.2.1 Question Answering Task 27
4.3.2.2 Few Shot Prompting Evaluation 28
4.3.3 Evaluation Metrics 28
5 Performance Evaluation 29
5.1 Data Collection Evaluation 29
5.1.1 UnifiedCrawl Amharic 30
5.1.2 UnifiedCrawl for other Languages 31
5.1.3 Dataset Comparison with other Datasets 31
5.2 Method Evaluation 33
5.2.1 Language Modeling Evaluation 33
5.2.2 Downstream Few Shot Prompting 34
6 Ablation Studies 35
6.1 Comparison with Full Finetuning 35
6.2 Comparison with Training from Scratch 37
6.3 Comparison on Downstream Supervised Training 37
7 Limitations and Future Works 39
8 Conclusion 41

반출 Meta View 목록

아주대학교

검색 상세

Democratizing LLM Adaptation via Monolingual Datasets from Common Crawl

초록/요약

목차