검색 상세

Design Knowledge Discover and Extractor Analytical Pipeline System for COVID-19 Research Based on Hadoop-Spark Big Data Frameworks

초록/요약

Internet resources have rapidly increased in recent years owing to advanced technology. Digital documents have become more popular for storing and broadcasting information when compared to traditional paper documents. The problem of ordinary users has become finding desired information in such an environment. Also, too much data is available, which usually leads to great difficulties due to too high computational complexity (time and memory). On another side, the available hardware may not handle it. The world saw 2020 as an unprecedented global invasion of coronavirus. Furthermore, many public and private institutions act quickly to make resources, such as opening big data repositories, to make the discovery process faster and more efficient. We propose in this Thesis a self-based Automatic Text summarization, merging extensive. Big data modeling, information mapping, and Extraction deliver top-down and bottom-up searching and browsing—the Intersection for faster knowledge discovery and Extraction. The problem consists of finding information on the documents that accept or deny the correlation between Respiratory Syndrome and Weather of COVID-19 spreading, taking into consideration social, Ethics, and Media. In order to discover and extract, and then summarize Texts from the massive amount of paper documents with Natural language processing, text mining techniques, and unsupervised learning mechanism. We designed an automatic Knowledge discovery and extraction analytical pipeline system. The input dataset to the proposed system generates first using Tokenization. Then breaking a text into linguistically meaningful units called tokens removes words that occur frequently but do not contribute to the content of the text, followed by applying an N-gram algorithm that can help capture multiword phrases precisely. Next, the LDA model is used to discover topics that hide in a set of text documents. Next, Hadoop and Spark are adopted to run a topic model and provide summarized text. Finally, for preprocessing, T.B. size input datasets were processed.

more

목차

1 Introduction 1
1.1 Background Information 2
1.1.1 Programming Model 2
1.1.2 Yarn Architecture 2
1.1.3 Hadoop Distributed File System (HDFS) 3
1.1.4 Cluster Computing (Spark) 4
1.1.5 Searching the Text 4
1.1.6 Natural Language Processing 6
1.1.7 Capabilities of NLP 6
1.1.8 Applications of NLP 7
1.1.9 Text Preprocessing 7
1.1.10 Techniques for Text Preprocessing 7
1.1.11 LDA Topic Modeling 8
1.2 Problem Description 10
2 Proposed Approach 11
2.1 Knowledge finding and Extraction 11
2.1.1 Work Environment 14
2.2 Keyword Query DataFrame 15
2.3 LDA modeling 17
2.3.1 Loading Data 17
2.3.2 Import Dataframe from hdfs 18
2.4 Data Cleaning 18
2.4.1 Lower casing and Remove punctuation 18
2.5 Prepare data for LDA analysis (data preprocessing) 18
2.6 Selection of several topics 19
2.7 Selection of several topics 19
3 Result 20
3.1 Analyzing model results 20
3.2 Results exploration 22
4 Conclusion 35
4.1 Future studies and drawbacks 35
5 References 36

more