검색 상세

Representation Learning of Biomedical Ontologies using Poincaré Embedding and Application to Genetic Risk Model

초록/요약

Knowledge manipulation of Gene Ontology (GO) and Gene Ontology Annotation (GOA) can be done primarily by using vector representation of GO terms and genes. Previous studies have represented GO terms and genes or gene products in Euclidean space to measure their semantic similarity using an embedding method such as the Word2Vec-based method to represent entities as numeric vectors. However, this method has the limitation that embedding large graph-structured data in the Euclidean space cannot prevent a loss of information of latent hierarchies, thus precluding the semantics of GO and GOA from being captured optimally. On the other hand, hyperbolic spaces such as the Poincaré ball are more suitable for modeling hierarchies, as they have a geometric property in which the distance increases exponentially as it nears the boundary because of negative curvature. In this thesis, we propose hierarchical representations of GO and genes (HiG2Vec) by applying Poincaré embedding specialized in the representation of hierarchy through a two-step procedure: GO embedding and gene embedding. Through experiments, we show that our model represents the hierarchical structure better than other approaches and predicts the interaction of genes or gene products similar to or better than previous studies. The results indicate that HiG2Vec is superior to other methods in capturing the GO and gene semantics and in data utilization as well. As one of effective downstream application of gene embeddings, we propose TransformerPRS, a deep learing model using a transformer module derived from language model, and compared with conventional polygenic risk score (PRS) which is a widely used risk scoring approach that derives a genetic risk for each individual from the sum of risk variants weighted by effect sizes from genome-wide association studies (GWASs). In the experiments, TransformerPRS with initialized by HiG2Vec showed better prediction performance than TransfermerPRS from scratch as well as conventional PRS. In addition, the self-attention module in a transformer block identified important features and their interactions. Our models can improve genetic risk prediction by providing information on which genes and interactions between genes have an important impact on prediction, which were not captured by conventional PRS.

more

목차

1. Introduction 1
2. Related work 6
2.1 GO Semantic similarity measures 6
2.1.1 Resnik method 6
2.1.2 Wang method 6
2.1.3 GOGO 7
2.2 Embedding of GO and Gene/Protein 7
2.2.1 Onto2Vec 8
2.2.2 OPA2Vec 8
2.2.3 EL embeddings 9
2.2.4 Gene2Vec 9
2.3 Transformer in other domains 10
3. Materials and Methods 11
3.1 Materials 11
3.1.1 Gene Ontology (GO) and Gene Ontology Annotation (GOA) 11
3.1.2 Interaction Database 12
3.1.3 Alzheimer's Disease Neuroimaging Initiative and GWAS summary statistics 13
3.2 Poincaré Embedding 13
3.3 Learning GO Representations by Poincaré Embedding 15
3.4 Gene Embedding by Fine-tuning 16
3.5 Evaluation Methods for Embeddings 17
3.5.1 Link Reconstruction, Hierarchy Reconstruction, and Level Reconstruction at the GO Level 18
3.5.2 Interaction Prediction at the Gene Level 20
3.6 New Relations Gene Ontology 22
3.7 Genetic Risk Score 22
3.7.1 Polygenic Risk Score 22
3.7.2 Monogenic Risk Score 23
3.8 TransformerPRS 23
3.8.1 Sequences of Risky Genes and Protective Genes 23
3.8.2 Input Embedding 24
3.8.3 Transformer block 24
3.8.4 Multihead self-attention 25
3.8.5 Linear Head 25
3.8.6 Training Details 25
3.9 TransformerPRS with HiG2Vec 26
3.10 Miscellanea 26
3.10.1 Transitive Closure 26
3.10.2 Similarity Measurement 26
3.10.3 Evaluation Metrics 27
3.10.4 Statistical test for model comparisons 29
4. Result 30
4.1 Nearest Neighbor of the Root GO terms in the Hierarchy 30
4.2 Link Reconstruction, Hierarchy Reconstruction, and Level Reconstruction at the GO Level 31
4.3 Interaction Prediction at the Gene Level 39
4.3.1 Binary Interaction 39
4.3.2 Interaction Score 40
4.3.3 Interaction Type 40
4.4 New Relations in Gene Ontology 44
4.5 Alzheimer's disease Classification 44
5. Discussion and Conclusion 47
5.1 Data Leakage in the Evaluation Tasks 47
5.2 Advantages: Capturing Semantics, Data Utilization, and Robustness 48
5.3 Dimensionality 49
5.4 Two-step Procedure 50
5.5 Limitation and Future work 51
5.6 HiG2Vec is beneficial to TransformerPRS 51
5.7 Interpretability of TransformerPRS 52
5.8 Conclusion 52

more