검색 상세

Normalization Distillation for Heterogeneous Architecture

초록/요약

Knowledge distillation has been established as a powerful approach for model compression and acceleration, improving model performance and efficiency through a teacher-student framework. However, many existing distillation methods are limited to scenarios where both teacher and student models share similar architectures. When applied to heterogeneous model architectures, these methods often yield suboptimal performance improvements, occasionally even resulting in performance degradation. This limitation is particularly prevalent in feature distillation approaches, where the discrepancy in representations between teacher and student models exacerbates these issues. Using centered kernel alignment (CKA), we observed that in scenarios involving het-erogeneous architectures, the discrepancy in representational power be- tween teacher and student models leads to the ineffectiveness of conventional logit-based or feature-based distillation methods. To address the challenge of heterogeneous model distillation, we propose NormKD, an efficient and effective KD framework that significantly improves the performance of distillation across heterogeneous architectures. Specifically, NormKD transforms the norm layer outputs of the student model to resemble those of the teacher model, projecting them into an aligned latent space to facilitate more effective and consistent knowledge transfer. Additionally, by employing the Center Kernel Alignment, NormKD dynamically matches and distills knowledge between the most correlated layers of the teacher and student models. Experiments on benchmarks like CIFAR-100 with diverse teacher-student pairs (e.g., CNNs and Transformers) show that our NormKD framework achieves superior performance in heterogeneous distillation scenarios. Keywords: Knowledge Distillation, Heterogeneous Architecture, Transfer Learning.

more

목차

1. Introduction 1
2. Related Work 5
2.1 Knowledge Distillation 5
2.2 Models 7
2.3 Normalization 8
3. Proposed Method 10
3.1 Revisit Knowledge Distillation 10
3.2 Normalization Transform 12
3.3 Adaptive Norm Layer Matching 14
3.4 Normalization Distillation 15
4. Experimental Results 17
4.1 Experimental Setup 17
4.2 CIFAR-100 18
4.3 Ablation Study 19
5. Conclusion 22
Bibliography 24

more