검색 상세

A Step-Wise Methodology using Semi-Supervised Topic Modelling to Recommend Contextual Relationships for an Ontology Engineer

초록/요약

Students are faced with an increasing amount of complex decision making during the duration of their studies. Those decisions might involve choosing a degree that lines up with their career objectives. They may have a specific curiosity in mathematics or science which they may not entirely understand and wish to explore further. This may lead a learner to a search engine to explore their curiosity further. This in itself can be a challenging task as the learner has little knowledge about the subject matter to start off with and may not know how to initiate their enquiry. They may search for “Computer Science” but find it hard to understand the purpose or contents of subject matter. Thus, we found it necessary to create a frequently asked system called GOLD. One of the challenges of building a system like this is getting at large amounts of information hidden in corpus, therefore we wanted to use a high performance algorithm to gather topics from large sets of corpus. This resulted in the investigation of different approaches including Naive Bayes and Support Vector Machines, both of which can be used to classify text. After careful examination and consideration of the suitability for our context we decided that the family of topic modelling algorithms within the Latent Dirichlet Allocation to be the most suitable. Two semi-supervised algorithms namely Labeled Latent Dirichlet Allocation (L-LDA) and Partial Latent Dirichlet Allocation (P-LDA) both provide a means of supervising topic models around specific topic labels. The next challenge is determination of suitable labels to train our topic models. The approached we used to generate a set of concepts that make up all possible answers generatable from the ontology. We generate these concepts into concept hierarchies which represent the logical flow of information. These form what we call the known or visible layer of our system. The hidden layer of the system consists of a corpus, the concept trees from the known layer, a local taxonomy and a label selection algorithm to choose the most suitable labels for training our L-LDA models. This pre-processing step highlights unsuitable labels and constructs a CSV file for consumption by our SCALA code which is subsequently fed into the Stanford Topic Modelling Toolkit. The results form the basis for evaluation of the completeness of the ontology and the derived ontology answers. We then approached this problem from a different angle by focusing providing new branches and relationships that didn’t exist within the ontology. Firstly, we did this by creating a tool that uses topic modeling to discover related topics to a concept and provided an interface to WordNET that allowed exploration of connected graphs of words. Secondly, we used Brills Pars of Speech algorithm to classify the topic models into parts of speech and used that to construct relationships between nodes and new concepts. We evaluated this system using the case study methodology and a set of topic model evaluation experiments and two separate surveys. One for the quality of the topic models and the second for evaluation of the ontology analysis results.

more

목차

CHAPTER 1. Introduction 1
1.1 Motivation 1
1.2 Challenges 2
1.3 Contribution 4
1.4 Scope 4
1.5 Context 5
CHAPTER 2. Related Work 7
2.1 Assessing Topic Models 7
2.2 Topic Interpretability 8
2.3 Meta Driven Topic Models 10
2.4 Recommendation Systems 11
2.5 Topics that change over time 12
CHAPTER 3. Proposed Approach 14
3.1 Method 17
3.1.1 Data Construction 18
3.1.2 Pre-Processing 24
3.1.3 Train Topic Models 27
3.1.4 Identify Concepts 28
3.2 Analysis Tools 29
3.2.1 Topic Model Coverage 29
3.2.2 Horizontal Graph Analysis 29
3.2.3 Vertical Graph Analysis 30
3.2.4 Topic Similarity 31
3.2.5 Predicate Extraction 31
CHAPTER 4. Topic Modeling using Term Trees 34
4.1 Pre-Processing 34
4.2 Label Selection Algorithm 34
CHAPTER 5. User Interfaces 41
5.1 GOLD FAQ 41
5.1.1 Concept and Individual Selection 41
5.1.2 Question Selection 42
5.1.3 Term Tree Selection 43
5.1.4 Topic Coverage 44
5.2 GOLD Analysis 45
5.2.1 Application overview 46
5.2.2 Menu System 47
5.2.3 Import Ontology 47
5.2.4 Define Taxonomy 48
5.2.5 Download Corpus 49
5.2.6 Prepare Training Data 50
5.2.7 Import Topic Models 50
5.2.8 (C.1) Horizontal Graph Analysis 52
5.2.9 (C.2) Vertical Graph Analysis 53
5.2.10 (C.3) Topic Similarity 55
5.2.11 (C.4) Predicate Analysis 56
5.2.12 (C.5) Topic Model Recommendations 57
CHAPTER 6. System Architecture 58
6.1 GOLD ANALYSIS 58
6.1.1 Components 58
6.2 GOLD WEB 61
6.2.1 Components 61
6.2.2 Object Model 62
6.2.3 Entity Relationship Diagram 64
CHAPTER 7. Evaluation 66
7.1 Theoretical Evaluation of Proposed Approach 67
7.1.1 Study Questions 67
7.1.2 Study Proposition 68
7.1.3 Unit of Analysis 69
7.1.4 Linking Data 69
7.1.5 Understanding these results 71
7.2 Experimental Results 71
7.2.1 Topic Selection Algorithm 71
7.2.2 Topic Analysis Tool 76
CHAPTER 8. Conclusion and future work 81
REFERENCES 82
Appendix 86
Script for calculation of optimal K 86
LABEL Selection Survey 87
GOLD ANALYSIS Survey 99

more