OMOP-CDM ETL(Extract, Transform, Load) semi-automation Using Large Language Models
- 주제(키워드) OMOP-CDM , Large Language Models , Extract-Transform-Load
- 주제(DDC) 610
- 발행기관 아주대학교 일반대학원
- 지도교수 Rae Woong Park
- 발행년도 2025
- 학위수여년월 2025. 2
- 학위명 박사
- 학과 및 전공 일반대학원 의학과
- 실제URI http://www.dcollection.net/handler/ajou/000000034400
- 본문언어 영어
- 저작권 아주대학교 논문은 저작권에 의해 보호받습니다.
초록/요약
This study explores the use of large language models (LLMs) in the standardization of medical data across four key areas: design, code mapping, implementation, and quality control. The research introduces an optimized Retrieval-Augmented Generation (LLM-RAG) structure to semi-automate the OMOP-CDM ETL process, reducing human intervention. LLM-RAG achieved over 95% accuracy in tasks like Table-to-Table and Column-to-Column mapping, outperforming standard LLMs by 8-21% in specific tasks. Human evaluations of code mapping showed LLM-RAG scored 84.5%, significantly higher than Usagi’s 42.3%, although slightly lower than LLMs at 92.3%. However, more than 60% of LLM concept names and 99% of IDs were hallucinated, highlighting the need for RAG to ensure reliable mapping. In the case of implementing, LLM-generated code also improved Themis compliance checks from 16 to 25 and showed lower complexity compared to expert-created code. In quality control, LLM-enhanced ETL processes improved in conformance (93.9 to 95.3), plausibility (92.7 to 94.2), and completeness (97.3 to 98.1), driven by better concept standardization and reduced null values. While MIMIC-III was used for this study, future work must validate the approach with more complex EMR datasets, aiming for an AI-driven, end-to-end ETL process. Keywords: OMOP-CDM, Large Language Models, Extract-Transform-Load
more목차
I. INTRODUCTION 1
A. BACKGROUND 1
1. OMOP-CDM AND CHALLENGES IN STANDARDIZATION 1
2. TRADITIONAL ETL METHODS AND CHALLENGES 4
3. LARGE LANGUAGE MODELS (LLMS) IN DATA STANDARDIZATION 9
B. OBJECTIVES 12
II. MATERIALS AND METHODS 14
A. DATA SOURCES 14
B. OVERALL PROCESS 18
C. LARGE LANGUAGE MODELS 21
D. DESIGN 27
E. CODE MAPPING 30
F. IMPLEMENTING 42
G. QUALITY CONTROL 46
III. RESULTS 48
A. DESIGN 48
B. CODE MAPPING 55
C. IMPLEMENTING 67
D. QUALITY CONTROL 71
IV. DISCUSSION 74
A. MAIN FINDINGS 74
B. DESIGN 75
C. CODE MAPPING 76
D. IMPLEMENTING 77
E. QUALITY CONTROL 78
F. LIMITATIONS 79
V. CONCLUSION 81

