A knowledge-based approach for semantic similarity, relatedness, and word sense disambiguation using WordNet
Al-Mousa, Mohannad Adel
DisciplineEngineering : Electrical & Computer
MetadataShow full item record
Various applications in computational linguistics and artificial intelligence rely on high-performing semantic similarity and word sense disambiguation techniques to solve challenging tasks such as information retrieval, machine translation, question answering, and document clustering. While text comprehension is intuitive for humans, machines face tremendous challenges in processing and interpreting a human’s natural language. This thesis discusses two interconnected natural language processing tasks using a contextual semantic approach and knowledge-based repository. The first task is a knowledgebased semantic similarity and relatedness between words using WordNet, and the second is a knowledge-based semantic word sense disambiguation. The semantic similarity and relatedness task determines the level of likeness and connectedness between two words within a given context based on their semantic representation within a knowledge graph. The word sense disambiguation task determines the correct sense (meaning) of a word within sentence and document contexts. The main focus of current research in this field relies solely on the taxonomic relation ”ISA” to evaluate semantic similarity and relatedness between terms. Semantic similarity and relatedness have not been exploited to their full potential to solve integral natural language processing tasks, such as the word sense disambiguation task. Despite the wide range of knowledge-based word sense disambiguation approaches, the underlying similarity measure for most of them is the word overlap measure (i.e., Lesk similarity measure), which is, by definition, limited to the exact match of terms between the compared texts. This thesis explores the benefits of using all types of non-taxonomic relations in WordNet knowledge graph to enhance existing semantic similarity and relatedness measures. We propose a holistic poly-relational approach based on a new relational-based information content and non-taxonomic-based weighted paths to devise a comprehensive semantic similarity and relatedness measure. Furthermore, we propose a novel knowledge-based word sense disambiguation algorithm, namely Sequential Contextual Similarity Matrix Multiplication algorithm (SCSMM). The SCSMM algorithm combines semantic similarity, heuristic knowledge, and document context to respectively exploit the merits of local context between consecutive terms, human knowledge about terms, and a document’s main topic in disambiguating terms. Unlike other algorithms, the SCSMM algorithm guarantees the capture of the maximum sentence context while maintaining the terms’ order within the sentence. Also, we identify the core factors that affect our proposed algorithm and most existing word sense disambiguation systems. The results of the proposed algorithms demonstrate an improvement over the benchmark methods, including the state-of-the-art knowledge-based techniques. Our proposed semantic similarity and relatedness measure demonstrated improvement gain that ranged from 3.8%-23.8%, 1.3%-18.3%, 31.8%-117.2%, and 19.1%-111.1%, on all gold standard datasets MC, RG, WordSim, and Mturk, respectively. On the other hand, the proposed SCSMM algorithm outperformed all other algorithms when disambiguating nouns on the combined gold standard datasets, while demonstrating comparable results to current stateof-the-art word sense disambiguation systems when dealing with each dataset separately. Finally, the thesis discusses the impact of granularity level, ambiguity rate, sentence size, and part of speech distribution on the performance of the proposed algorithm.