dCollection 디지털 학술정보 유통시스템

Design Knowledge Discover and Extractor Analytical Pipeline System for COVID-19 Research Based on Hadoop-Spark Big Data Frameworks

원문보기

주제(키워드) Natural Language Processing , Text Mining , Spark , Hadoop , Machine Learning
주제(DDC) 658.5
발행기관 아주대학교
지도교수 Kiejin Park
발행년도 2023
학위수여년월 2023. 2
학위명 석사
학과 및 전공 일반대학원 산업공학과
실제URI http://www.dcollection.net/handler/ajou/000000032683
본문언어 영어
저작권 아주대학교 논문은 저작권에 의해 보호받습니다.

초록/요약

Internet resources have rapidly increased in recent years owing to advanced technology. Digital documents have become more popular for storing and broadcasting information when compared to traditional paper documents. The problem of ordinary users has become finding desired information in such an environment. Also, too much data is available, which usually leads to great difficulties due to too high computational complexity (time and memory). On another side, the available hardware may not handle it. The world saw 2020 as an unprecedented global invasion of coronavirus. Furthermore, many public and private institutions act quickly to make resources, such as opening big data repositories, to make the discovery process faster and more efficient. We propose in this Thesis a self-based Automatic Text summarization, merging extensive. Big data modeling, information mapping, and Extraction deliver top-down and bottom-up searching and browsing—the Intersection for faster knowledge discovery and Extraction. The problem consists of finding information on the documents that accept or deny the correlation between Respiratory Syndrome and Weather of COVID-19 spreading, taking into consideration social, Ethics, and Media. In order to discover and extract, and then summarize Texts from the massive amount of paper documents with Natural language processing, text mining techniques, and unsupervised learning mechanism. We designed an automatic Knowledge discovery and extraction analytical pipeline system. The input dataset to the proposed system generates first using Tokenization. Then breaking a text into linguistically meaningful units called tokens removes words that occur frequently but do not contribute to the content of the text, followed by applying an N-gram algorithm that can help capture multiword phrases precisely. Next, the LDA model is used to discover topics that hide in a set of text documents. Next, Hadoop and Spark are adopted to run a topic model and provide summarized text. Finally, for preprocessing, T.B. size input datasets were processed.

1 Introduction 1
1.1 Background Information 2
1.1.1 Programming Model 2
1.1.2 Yarn Architecture 2
1.1.3 Hadoop Distributed File System (HDFS) 3
1.1.4 Cluster Computing (Spark) 4
1.1.5 Searching the Text 4
1.1.6 Natural Language Processing 6
1.1.7 Capabilities of NLP 6
1.1.8 Applications of NLP 7
1.1.9 Text Preprocessing 7
1.1.10 Techniques for Text Preprocessing 7
1.1.11 LDA Topic Modeling 8
1.2 Problem Description 10
2 Proposed Approach 11
2.1 Knowledge finding and Extraction 11
2.1.1 Work Environment 14
2.2 Keyword Query DataFrame 15
2.3 LDA modeling 17
2.3.1 Loading Data 17
2.3.2 Import Dataframe from hdfs 18
2.4 Data Cleaning 18
2.4.1 Lower casing and Remove punctuation 18
2.5 Prepare data for LDA analysis (data preprocessing) 18
2.6 Selection of several topics 19
2.7 Selection of several topics 19
3 Result 20
3.1 Analyzing model results 20
3.2 Results exploration 22
4 Conclusion 35
4.1 Future studies and drawbacks 35
5 References 36

반출 Meta View 목록

아주대학교

검색 상세

Design Knowledge Discover and Extractor Analytical Pipeline System for COVID-19 Research Based on Hadoop-Spark Big Data Frameworks

초록/요약

목차