검색 상세

R-CoA:관계적 앵커 연쇄 구조에 기반한 저자원 언어 이해 프레임워크

R-CoA: A Relational Chain-of-Anchor Framework for Low-Resource Language Understanding

초록/요약

Classical Chinese and Korean Literary Sinitic (KLS) represent a foundational yet severely under-resourced domain in natural language processing. Classical texts are characterized by radical parataxis, pervasive ellipsis, dense intertextual allusions, and genre-specific reasoning patterns rooted in commentarial and examination traditions. These properties suppress explicit syntactic cues and require external philological knowledge for interpretation, fundamentally diverging from the assumptions embedded in modern large language models. This dissertation frames classical language processing as a representation problem rather than a data-scaling problem. Through systematic evaluation, we show that state-of-the- art multilingual and instruction-tuned models consistently fail on classical texts despite strong performance on modern benchmarks. These failures—hallucinated context, anachronistic syntax, and breakdowns in inference—stem from a mismatch between classical linguistic structure and distributional learning assumptions, where sparse and elliptical contexts invalidate surface co-occurrence signals. To expose these limitations, we introduce KLSBench, a multi-task benchmark spanning classification, retrieval, punctuation restoration, natural language inference, and translation, with stratified metadata capturing genre, commentary density, and linguistic difficulty. To address representation instability, we propose Relational Chain-of-Anchor (R-CoA), a dual-space architecture that integrates distributional semantics with structured historical knowledge. R-CoA aligns classical texts with translations or commentaries via contrastive learning, while anchoring representations through knowledge graph relations modeled with TransE-style objectives. By freezing the backbone and applying parameter-efficient adaptation, R-CoA improves complex reasoning and retrieval performance while avoid- ing catastrophic forgetting. Empirical results demonstrate substantial gains on inference and large-scale retrieval tasks, alongside a clear task complexity threshold: knowledge- augmented architectures are most effective when baseline performance is moderate, but unnecessary for trivial tasks. Together, these findings establish design principles for historical and low-resource language processing and show that classical language under- standing fundamentally benefits from integrating textual representations with symbolic knowledge.

more

초록/요약

한문(Korean Literary Sinitic, KLS)은 자연어처리(NLP) 분야에서 기초적이면서도 심각하게 저자원화(under-resourced)된 언어 영역을 구성한다. 이러한 고전 텍스트는 극단적인 병렬구문(parataxis), 광범위한 생략(ellipsis), 밀집된 상호텍스트적(allusive) 인용, 주석 및 과거 시험 전통에 뿌리를 둔 장르 특유의 추론 양식 등으로 특징지어진다. 이와 같은 언어적 속성은 명시적 통사 단서를 약화시키고, 해석을 위해서는 외부의 주석학적 지식이 필수적으로 요구되며, 결과적으로 현대 대형언어모델(LLM)에 내재된 전제와 근본적으로 괴리된다. 본 논문은 고전 언어 처리 문제를 데이터 규모(data-scaling)의 문제에서 표현(representation)의 문제로 규정한다. 체계적인 평가를 통해, 최신 다국어 및 지시조정(instruction-tuned) 언어모델들이 현대 언어 벤치마크에서는 우수한 성능을 보이면서도 고전 텍스트에서는 일관되게 실패함을 보인다. 이러한 오류―맥락 환각(hallucinated context), 시대착오적 통사구조(anachronistic syntax), 추론 붕괴(inference breakdown)―는 고전 언어의 희소하고 생략된 구성에서 비롯되며, 이는 분포 기반 학습(distributional learning)의 가정, 특히 표층 공출현(surface co-occurrence) 신호에 의존하는 가정을 무효화한다. 이러한 한계를 드러내기 위하여, 본 연구는 KLSBench를 제안한다. KLSBench는 분류(classification), 검색(retrieval), 문장부호 복원(punctuation restoration), 자연어추론(NLI), 번역(translation) 등의 다중 과제를 포함하며, 장르(genre), 주석 밀도(commentary density), 언어적 난이도(linguistic difficulty) 등 계층화된 메타데이터를 제공한다. 표현 불안정성(representation instability)을 해결하기 위해, 본 논문은 Relational Chain-of-Anchor (R-CoA) 아키텍처를 제안한다. R-CoA는 분포적 의미(distributional semantics)와 구조화된 역사 지식(structured historical knowledge)을 결합하는 이중 공간(dual-space) 구조이며, 대조학습(contrastive learning)을 통해 고문(古文) 텍스트를 번역문 및 주석문과 정렬시키는 동시에, TransE 기반의 지식그래프 관계 정합(objective)을 통해 표현을 ‘앵커(anchor)’로 고정시킨다. 또한, 백본(backbone)을 고정한 채로 매개변수 효율적 적응(parameter-efficient adaptation)을 수행함으로써, 망각(catastrophic forgetting) 없이 복잡한 추론 및 검색 성능을 향상시킨다. 실험 결과, R-CoA는 추론 및 대규모 검색 과제에서 상당한 성능 향상을 보이며, 동시에 명확한 과제 복잡도 임계치(task complexity threshold)를 드러낸다. 즉, 지식 강화(knowledge-augmented) 아키텍처는 기본 성능이 중간 수준일 때 가장 효과적이며, 단순한 과제에서는 불필요하다. 이러한 결과는 역사적·저자원 언어 처리를 위한 설계 원리를 제시하고, 고전 언어 이해가 본질적으로 텍스트 표현과 상징적 지식의 통합으로부터 근본적인 이익을 얻음을 입증한다.

more

목차

1 Introduction 1
1.1 Motivation 1
1.2 Research Questions 4
1.2.1 Problem Space: The Representation Gap 4
1.2.2 Existing Approaches and Their Limitations 5
1.2.3 Natural Language Inference as a Diagnostic 6
1.2.4 Formulation of Research Questions 7
1.3 Contributions 9
1.3.1 KLSBench: A Comprehensive Benchmark for Classical Language Understanding 9
1.3.2 Systematic Characterization of LLM Failures on Classical Texts 10
1.3.3 R-CoA: Relational Chain-of-Anchor Framework 10
1.3.4 Design Principles for Historical Language Representation Learning . 12
1.4 Thesis Roadmap 13
2 Background and Related Work 16
2.1 Low-Resource and Historical NLP 16
2.1.1 Defining Low-Resource NLP: Criteria and Dimensions 16
2.2 Methodological Approaches in NLP and Low-Resource Constraints 18
2.2.1 Knowledge Representation: RDF and Ontologies 19
2.2.2 Statistical Modeling: Surface-Level Probabilities and Sparsity 20
2.2.3 Vector Representation Models 21
2.3 Historical Languages as Extreme Low-Resource Settings 23
2.3.1 Case Study: Classical Chinese and Literary Sinitic (KLS) 25
3 KLSBench: Case Study 30
3.1 KLSBench v1: Foundational Task Suite 30
3.1.1 Task Overview 30
3.1.2 Classification: Literary Genre Recognition 30
3.1.3 Retrieval: Source Attribution 31
3.1.4 Punctuation Restoration 32
3.1.5 Natural Language Inference 33
3.1.6 Translation 33
3.2 Limitations of KLSBench v1 34
3.2.1 Representational Requirements Analysis 34
3.2.2 Empirical Evidence 34
Note on Retrieval Performance Scale 35
3.3 KLSBench v2: Knowledge Graph-Aware Extension 35
3.3.1 Design Rationale 36
3.3.2 Task Formalization 36
3.3.3 Dataset Construction 37
3.3.4 Expected Performance Patterns 38
3.4 Comparative Summary 39
4 R-CoA Framework 40
4.1 Conceptual Foundations 40
4.1.1 The Dual-Cord Metaphor 40
4.1.2 Knowledge Graph Anchors 41
4.1.3 Dual Embedding Spaces 42
4.2 Architecture 43
4.2.1 Model Overview 43
4.2.2 Training Objectives 46
Anchor Head: InfoNCE Contrastive Loss 46
KG Head: TransE Relational Loss 47
Multi-Hop Consistency Regularization 48
4.3 Optimization Strategy 49
4.3.1 Staged Training Pipeline 49
4.3.2 Negative Sampling Strategy 51
4.3.3 Hyperparameters and Training Details 51
4.4 Theoretical Considerations 51
4.4.1 Representation Stability via Anchors 53
4.4.2 Multi-Hop Transitivity 53
4.4.3 Gradient Conflict in Joint Training 53
4.4.4 Task-Specific Mode Selection 54
5 Experiments 55
5.1 Experimental Setup 55
5.1.1 Evaluation Benchmark 55
5.1.2 Baseline Systems 56
5.1.3 Evaluation Metrics 57
5.1.4 Implementation Details 58
5.2 Main Results 59
5.2.1 KLSBench v1 Performance 59
Task-Specific Mode Analysis 59
5.2.2 KLSBench v2: Relation Prediction 63
5.2.3 Catastrophic Forgetting Analysis 63
5.3 Ablation Studies 63
5.3.1 Ablation 1: Anchor Head vs. KG Head Only 63
5.3.2 Ablation 2: Joint Training vs. Staged Curriculum 64
5.3.3 Ablation 3: LoRA vs. Full Fine-tuning vs. Frozen Encoder 64
5.3.4 Ablation 4: Anti-Collapse Regularization 64
Experimental Design 64
Stepwise Performance Progression 65
5.3.5 Ablation 5: Data Scale vs. Architectural Complexity 66
5.4 Cross-Domain Transfer: C3 Benchmark 69
5.4.1 Domain Mismatch 70
5.4.2 Experimental Protocol 70
5.4.3 Results 70
5.5 Summary 78
6 Discussion and Conclusion 83
6.1 Answering the Research Questions 83
6.1.1 RQ1: Effects of Training Sufficiency on Retrieval Performance 83
6.1.2 RQ2: Stratified Performance Across Tasks and Metadata 87
6.1.3 RQ3: R-CoA's Representation Stabilization 89
6.1.4 RQ4: Model Paradigm Comparison 92
6.2 Core Contributions and Achievements 95
6.2.1 Contribution 1: Task-Adaptive Dual-Space Architecture 95
6.2.2 Contribution 2: Representation Stabilization Mechanisms 96
6.2.3 Contribution 3: Anti-Collapse Regularization 97
6.2.4 Contribution 4: Knowledge Graph Precision Boost 99
6.2.5 Contribution 5: Training Sufficiency as Dominant Factor 100
6.3 Design Principles for Low-Resource Classical NLP 102
6.3.1 Principle 1: Diagnose Before Redesigning 102
6.3.2 Principle 2: Temporally Isolate Conflicting Objectives 103
6.3.3 Principle 3: Task Complexity Threshold for Model Selection 104
6.3.4 Principle 4: Leverage Anti-Collapse Regularization Carefully 105
6.4 Interpretation of Observed Training Dynamics 106
6.4.1 Representation Collapse in Low-Resource Settings 106
Why Optimizer Iterations Influence Recovery 107
6.4.2 Heterogeneous Objectives and Gradient Conflicts 108
6.4.3 Model Collapse: Causes and Manifestations 109
6.4.4 Why Classical Languages Differ from Modern NLP 110
6.5 Generalization Beyond Classical Korean 111
6.6 Limitations and Future Directions 114
6.6.1 Boundary Conditions and Training Sufficiency 114
6.6.2 Future Directions 117
6.7 Conclusion 118
References 121
Appendix A: List of Publications 133
Appendix B: KLSBench Technical Details 135
Appendix C: Qualitative Inference Examples 147
국문초록 159

more