검색 상세

Democratizing LLM Adaptation via Monolingual Datasets from Common Crawl

초록/요약

Large language models (LLMs) under-perform on low-resource languages due to limited training data. We present a method to efficiently collect text data for low-resource languages from the entire Common Crawl corpus. Our approach, UnifiedCrawl, filters and extracts common crawl using minimal compute resources, yielding mono-lingual datasets much larger than previously available sources. We demonstrate that leveraging this data to fine-tu ning multilingual LLMs via efficient adapter methods (QLoRA) significantly boosts performance on the low-resource language, while minimizing VRAM usage. Our experiments show large improvements in language modeling perplexity and an increase in few-shot prompting scores. Our work and released source code provide an affordable approach to improve LLMs for low-resource languages using consumer hardware.

more

목차

1 Introduction 1
1.1 Problem Definition 2
1.2 Motivation and Significance 3
1.3 Research Questions 6
1.4 Proposed Method 7
1.5 Contribution 9
1.6 Organization 10
2 Related Works 11
2.1 Multilingual Large Language Models 11
2.2 Large Multilingual or Monolingual Datasets 12
2.3 Common Crawl and Dataset Extraction 13
2.4 Deduplication 14
2.5 Low Resource Model Adaptation 15
3 Methods 17
3.1 Data Collection Framework 17
3.1.1 Index Filtering 18
3.1.2 Extracting WARC Files 19
3.1.3 Text Extraction 20
3.1.4 Deduplication 20
3.2 Low Resource Model Adaptation 21
4 Experimental Settings and Implementation Details 23
4.1 Languages and Benchmark Datasets and Dataset Collection 23
4.1.1 Dataset Collection 23
4.1.2 Compute Requirements 24
4.1.3 Languages 24
4.1.4 Benchmark Datasets 25
4.2 Models and Model Adaptation Settings 25
4.2.1 Models 25
4.2.2 Model Adaptation 26
4.3 Evaluation Settings 26
4.3.1 Language Modeling Evaluation 27
4.3.2 Downstream Evaluation 27
4.3.2.1 Question Answering Task 27
4.3.2.2 Few Shot Prompting Evaluation 28
4.3.3 Evaluation Metrics 28
5 Performance Evaluation 29
5.1 Data Collection Evaluation 29
5.1.1 UnifiedCrawl Amharic 30
5.1.2 UnifiedCrawl for other Languages 31
5.1.3 Dataset Comparison with other Datasets 31
5.2 Method Evaluation 33
5.2.1 Language Modeling Evaluation 33
5.2.2 Downstream Few Shot Prompting 34
6 Ablation Studies 35
6.1 Comparison with Full Finetuning 35
6.2 Comparison with Training from Scratch 37
6.3 Comparison on Downstream Supervised Training 37
7 Limitations and Future Works 39
8 Conclusion 41

more