A knowledge-based approach for semantic similarity, relatedness, and word sense disambiguation using WordNet
Abstract
Various applications in computational linguistics and artificial intelligence rely on
high-performing semantic similarity and word sense disambiguation techniques to solve
challenging tasks such as information retrieval, machine translation, question answering,
and document clustering. While text comprehension is intuitive for humans, machines
face tremendous challenges in processing and interpreting a human’s natural language.
This thesis discusses two interconnected natural language processing tasks using a contextual semantic approach and knowledge-based repository. The first task is a knowledgebased semantic similarity and relatedness between words using WordNet, and the second
is a knowledge-based semantic word sense disambiguation. The semantic similarity and
relatedness task determines the level of likeness and connectedness between two words
within a given context based on their semantic representation within a knowledge graph.
The word sense disambiguation task determines the correct sense (meaning) of a word
within sentence and document contexts.
The main focus of current research in this field relies solely on the taxonomic relation
”ISA” to evaluate semantic similarity and relatedness between terms. Semantic similarity and relatedness have not been exploited to their full potential to solve integral natural
language processing tasks, such as the word sense disambiguation task. Despite the wide
range of knowledge-based word sense disambiguation approaches, the underlying similarity measure for most of them is the word overlap measure (i.e., Lesk similarity measure),
which is, by definition, limited to the exact match of terms between the compared texts.
This thesis explores the benefits of using all types of non-taxonomic relations in WordNet
knowledge graph to enhance existing semantic similarity and relatedness measures. We
propose a holistic poly-relational approach based on a new relational-based information
content and non-taxonomic-based weighted paths to devise a comprehensive semantic
similarity and relatedness measure. Furthermore, we propose a novel knowledge-based
word sense disambiguation algorithm, namely Sequential Contextual Similarity Matrix
Multiplication algorithm (SCSMM). The SCSMM algorithm combines semantic similarity, heuristic knowledge, and document context to respectively exploit the merits of local context between consecutive terms, human knowledge about terms, and a document’s
main topic in disambiguating terms. Unlike other algorithms, the SCSMM algorithm
guarantees the capture of the maximum sentence context while maintaining the terms’
order within the sentence. Also, we identify the core factors that affect our proposed
algorithm and most existing word sense disambiguation systems.
The results of the proposed algorithms demonstrate an improvement over the benchmark methods, including the state-of-the-art knowledge-based techniques. Our proposed
semantic similarity and relatedness measure demonstrated improvement gain that ranged
from 3.8%-23.8%, 1.3%-18.3%, 31.8%-117.2%, and 19.1%-111.1%, on all gold standard
datasets MC, RG, WordSim, and Mturk, respectively. On the other hand, the proposed
SCSMM algorithm outperformed all other algorithms when disambiguating nouns on the
combined gold standard datasets, while demonstrating comparable results to current stateof-the-art word sense disambiguation systems when dealing with each dataset separately.
Finally, the thesis discusses the impact of granularity level, ambiguity rate, sentence size,
and part of speech distribution on the performance of the proposed algorithm.