검색 상세

Medical record summarization framework using episodic memory generating agent for clinical context and causality preservation

초록/요약

The volume and complexity of electronic medical records (EMRs) are rapidly increasing, driven not only by advances in digital documentation systems but also in the growing use of diagnostic tools, diverse therapeutic interventions, and long-term patient monitoring. This expansion has resulted in temporally dense and semantically rich datasets that require causal and temporal reasoning to interpret. Thus, effective interpretation requires models that can reason over time, preserve causal relationships, and maintain clinical context. However, Large Language Models (LLMs) are constrained by fixed context windows, limiting their ability to process long and complex patient narratives. To address these challenges, we propose a memory augmented summarization framework using the Memory Context/Causality Attuned Representation Encoding (MemCARE) agent, which constructs problem- based episodic memory to support temporal reasoning, clinical summarization, and decision making. We developed a memory augmented framework that connects structured long- term memory to LLM to support longitudinal clinical reasoning. At the center of this framework is the MemCARE agent, which constructs episodic memory by summarizing and organizing clinical notes into temporally ordered, problem-linked segments. The system also includes preprocessing for episode segmentation, an Emergency Room (ER)-specific knowledge base, and a chatbot interface tailored for usability. The data source was derived from electronic health records structured in the Observational Medical Outcomes Partnership Common Data Model (OMOP- CDM) from Ajou University School of Medicine (AUSOM). Our evaluation proceeded in stages. First, we tested whether temporal memory alone (refine-style summarization) improves information retention compared to a stuff-style baseline (input whole text chunk at once). Second, we aimed to determine which summarization method (refine style or MemCARE agent) produces more suitable outputs to be adopted as components of long-term memory. We then evaluated reasoning tasks. For causal chain generation, we compare Retrieval Augmented Generation (RAG) from raw records against agent generated memory. Proportion of selected modality is measured, and strength of preference were measured in scale of -10(toward RAG) to +10(toward memory). For temporal reconstruction, we compared RAG, memory based, and combined approaches. Outputs were rated as either perfectly correct (score: 5) or partially correct (score: 2.5), based on predefined evaluation criteria. Next, we tested whether integrating an ER-specific knowledge base improved severity scoring. Finally, we implemented the system in a chatbot interface using high parameter LLMs and evaluated its clinical usability. Evaluation metrics included ROUGE, BLEU, and METEOR for surface similarity, expert ratings for clinical quality, LLM-based scoring for summary correctness and completeness, preference strength for reasoning tasks. Lastly, temporal accuracy for chronology reconstruction was evaluated with human evaluators. Compared to baseline summarization(stuff), temporal memory strategies such as refine-style summarization significantly improved both surface-level and semantic scores in qwen3:32b. While improvements diminished with larger model(gpt- oss:120b), especially in surface similarity, significant gains in LLM-based evaluations of correctness (Difference:0.42, p<0.001) and completeness (Difference:0.75, p<0.001) were preserved. In the problem-based summarization task, the MemCARE agent outperformed the memory refine method. Expert Likert- scale evaluations provided stronger support, with MemCARE receiving higher ratings in conciseness (Difference 1.007, p<0.001), classification (Difference 0.572, p<0.001), temporal ordering (Difference 0.613, p<0.001), and correctness (Difference 0.313, p<0.001). For causal chain generation, evaluated using an LLM- based preference scoring system, the MemCARE agent was preferred in 74% of cases over RAG, with a total preference score of +253 and a mean score of +3.28, demonstrating stronger causal linkage and narrative coherence. In temporal reconstruction, mean scores increased from 1.70 (RAG) to 3.45 (Memory) and 3.95 (RAG+Memory). Lastly, in the ER severity scoring task, integration of an emergency-specific knowledge base (ERKB) led to a marked difference in scoring outputs. The mean score differentiated by approximately nine points when ERKB was used, suggesting that the knowledge base meaningfully differentiated ER contextual severity estimation. Furthermore, the chatbot interface demonstrated usability in simulated ER scenarios. In conclusion, this study demonstrates the potential of memory augmented architectures for clinical language modeling by integrating temporally structured, problem linked long-term memory with large language models. The MemCARE agent effectively constructs episodic memory representations that enhance clinical summarization, causal inference, and temporal reconstruction across longitudinal patient records. Our results show that even basic temporal memory strategies mitigate context window limitations, and that structured long-term memory substantially improves reasoning performance, particularly in complex tasks such as problem-based summarization and causal chain generation. The incorporation of ER- specific knowledge and a usability focused chatbot interface further supports contextual adaptation and practical applicability. Keywords: Agent, Artificial Intelligence Memory, Electronic Medical Record, Large Language Model, Summarization

more

목차

I. Introduction 1
A. Background 1
1.The growing need for medical record summarization 1
2. Characteristics of a high-quality clinical summary 1
3. Limitations of current large language models for clinical summarization 3
4. From human cognition to AI memory 4
5. Agent and memory-augmented clinical reasoning 5
6. Previous studies 6
7. Our approach 7
B. Objectives 9
II. Materials and Methods 10
A. Data sources 10
B. Data preprocessing 10
C. Summary construction strategies 13
D. Summary evaluation framework 17
E. Evaluation metrics 19
1. Surface similarity 19
2. Semantic similarity 19
3. Metrics used by judge 19
4. LLM as judge 20
5. Human as judge 21
F. Statistical analysis 22
G. Model and experimental configuration 23
III. Results 27
A. Cohort demographics and clinical characteristics 27
B. Temporal memory mitigates context window limitations 29
C. Performance of problem-based summarization between memory refiner and MemCare agent 44
D. Causal reasoning performance by information source 56
E. Temporal reconstruction performance by different information sources 58
F. Influence of ER contextual knowledge on severity estimation 60
G. Integration of chatbot and agent developed long term memory for clinical application. 62
IV. Discussion 65
A. Main findings 65
1. Memory mitigates context window limitations 65
2. Problem-based summarization 67
3. Causal reasoning 69
4. Temporal reconstruction 70
5. Influence of ER contextual knowledge 71
6. Integration of chatbot and agent developed long term memory 72
B. Limitations 72
C. Further research 74
V. Conclusion 77
References 78
Appendix 85
국문요약 94

more