CPPpred-En: Ensemble framework integrating a protein language model and conventional features for highly accurate cell-penetrating peptide prediction
- 주제(키워드) cell penetrating peptide , machine learning , ensemble learning , protein language model , feature encoding , bioinformatics , peptide prediction
- 주제(DDC) 547
- 발행기관 아주대학교 일반대학원
- 지도교수 Gwang Lee
- 발행년도 2026
- 학위수여년월 2026. 2
- 학위명 박사
- 학과 및 전공 일반대학원 분자과학기술학과
- 실제URI http://www.dcollection.net/handler/ajou/000000035835
- 본문언어 영어
- 저작권 아주대학교 논문은 저작권에 의해 보호받습니다.
초록/요약
Cell-penetrating peptides (CPPs) have attracted considerable interest in biomedical research owing to their unique capability to traverse cellular membranes, which enables their use in drug delivery and therapeutic innovation. Accurately identifying CPPs is essential for expediting the development of novel peptide-based treatments. Existing approaches for CPP prediction mainly rely on either conventional feature derived from peptide characteristics or one or two protein language models (PLMs) ; however, these strategies often fall short in exploiting the synergistic benefits of combining heterogeneous feature types. However, these methods often fail to fully exploit the complementary strengths of different feature representations. To overcome this limitation, we propose CPPpred-En, a predictive framework that assesses a wide range of conventional and PLM-derived features using multiple machine learning classifiers. The model identifies the most effective feature– classifier pairs and combines them through ensemble learning to achieve enhanced prediction performance. Trained on both the CPP924 and MLCPP 2.0 datasets, CPPpred-En surpassed existing state-of-the-art tools, reaching an accuracy (Acc) of 97.27% and a Matthews correlation coefficient (MCC) of 0.964 on the CPP924 dataset, and an Acc of 96.10% with an MCC of 0.707 on the MLCPP 2.0 dataset. This ensemble-based framework demonstrated strong generalization capability across datasets, underscoring its robustness and reliability. The integration of conventional features and PLM-based representations within an ensemble learning architecture presents a promising direction for enhancing peptide-based drug discovery. CPPpred-En could serve as a highly accurate and dependable tool for identifying CPPs and has potential applications in targeted therapeutics and drug delivery. Keywords: cell penetrating peptide, machine learning, ensemble learning, protein language model, feature encoding, bioinformatics, peptide prediction
more목차
1. Introduction 1
2. Materials and Methods 4
2.1. Dataset 4
2.2. Framework of CPPpred-En 6
2.2.1. Conventional feature 8
2.2.2. PLM-base feature 8
2.2.3. Feature and Classifier Selection 8
2.2.4. Ensemble learning 9
2.2.5. Evaluation metrics 13
3. Results & Discussion 14
3.1. Evaluation of individual and ensemble model performance using the CPP924 and MLCPP 2.0 datasets 14
3.2. Comparison with existing CPP prediction models on the CPP924 and MLCPP 2. 0 datasets 17
4. Ablation study 20
4.1. Impact of individual models on the performance of the ensemble 20
CONCLUSION 27
REFERENCES 28
Appendix Figures 31
Appendix Tables 37

