검색 상세

Development of cancer pathology data model and natural language processing based data conversion methodology

암 병리 데이터 모델 및 자연어처리기반 데이터 변환 방법론 개발

초록/요약

As per the cancer statistics, the total number of cancer patients in Korea as of 2017 is 232,255, which is 1,019 more compared to 2016. Various tests have been conducted to diagnose and treat cancer; consequently, a large amount of unstructured clinical data, including texts, images, and videos, are produced. The cancer pathology report is an extremely important information source to provide guidance on cancer diagnosis and treatment because it contains information on cancer type, characteristics, and cancer stages. However, this report is usually in the form of free-descriptive texts and such texts should be converted into structured ones such that machines can understand. Therefore, this study aims at developing a cancer pathology data model to create a structured data, and, based on such a data model, develop a cancer pathology data conversion methodology that uses a model based on natural language processing (NLP) for information extraction. For this purpose, we have collected overseas and domestic pathology reports and documents related to breast, thyroid, colorectal, and gastric cancers, whose occurrence is high in Korea. We then analyzed the data system, as well as established a dictionary of vocabulary used in these studies. Accordingly, we developed a cancer pathology model comprising four tables of specimen basic information, specimen common observation information, specimen-specific observation information, and immunohistochemical test information. To extract information from unstructured texts of cancer pathology reports, two types of models have been developed. The first model is called named entity recognition (NER) model, applying the convolutional neural network (CNN) algorithm that uses Spacy, an NLP library. The second model is a hybrid one that was created by adding a rules-based algorithm to the first model. All 1200 studies were randomly selected from Ajou University Hospital’s pathology reports for four types of cancers and the entity annotation was then performed. Then, the model was trained using 960 training data sets and its performance evaluation was performed using 240 test data sets. To assess the model’s generalization possibility, 200 cancer pathology reports of the external institution B and 30 sets of online data were produced and used. The data conversion methodology was first developed by cancer pathology data model as a result of applying the final selected named entity recognition model to four types of cancer pathology reports by Ajou University Hospital, and then by designing the extract, transform, and load process to convert such data into the OMOP Common Data Model (CDM). Moreover, to verify the methodology, cancer pathology data model and common data model were established from which 400 sets were randomly extracted whose accuracy was then manually reviewed by one researcher. The comparison between both named entity recognition models shows that a single model based on CNN has an f1-score of 0.965, which was 0.111 higher than that of the hybrid model. After assessing these models by applying to the external institutions, an f1-score of 0.711 and 0.854 was produced, thus demonstrating its possible application to the external agency. Using a cancer pathology data model, which was established from using the newly developed data conversion methodology, it was confirmed that data generalization is possible regardless of the types of cancer. Moreover, the manual review of the data demonstrated an accurate rate of 96.91%. In this study, the cancer pathology data model and data conversion methodology proposed is highly effective for extracting, storing and utilizing various data in large amounts from different types of cancer pathology studies. This is expected to contribute to promoting precision research for cancer treatment along with the existing clinical data.

more

목차

I. Introduction 1
A. Study Background and Necessity 1
1. The Current Cancer Incidence in Korea 1
2. Importance of Cancer Pathology Report 3
3. Status of converting medical data into structured data based on Natural Language Processing method 4
4. The necessity of modeling and unifying cancer pathology data 8
B. Study Purpose 12
II. Materials and Method 14
A. Study Subject data 17
B. Analysis of system and vocabulary of the cancer pathology report 18
C. Development of Cancer Pathology Data Model 21
D. Development of natural language processing model for data extraction 27
1. Text preprocessing 33
2. Named Entity Recognition Model Structure 42
3. Model training 44
4. Realization of rule-based algorithm 47
E. Model evaluation method 50
F. Design of common data model conversion 51
III. Result 59
A. Establishment of Korean cancer pathology reports system and vocabulary dictionary 59
B. Comparison of Model performance for data extraction 67
1. Comparison of performance by parameter setting 67
2. Comparison of performance between CNN model and Hybrid (CNN+Rule-based) model 72
C. Development of data conversion methodology for the structured cancer pathology report 76
1. Establishment of clinical data model for cancer pathology report 79
2. Result evaluation through manual review of data 83
3. Establishment of common data model 84
IV. Discussion 88
A. Consideration on study method and result 88
B. Study limitations 97
V. Conclusion 99
References 101

more