Semantic similarity between words and sentences using lexical database and word embeddings

Pawar, Atish Shivaji

dc.contributor.advisor	Mago, Vijay
dc.contributor.author	Pawar, Atish Shivaji
dc.date.accessioned	2018-11-13T20:51:47Z
dc.date.available	2018-11-13T20:51:47Z
dc.date.created	2018
dc.date.issued	2018
dc.identifier.uri	http://knowledgecommons.lakeheadu.ca/handle/2453/4308
dc.description.abstract	Calculating the semantic similarity between sentences is a long-standing problem in the area of natural language processing. The semantic analysis field has a crucial role to play in the research related to the text analytics. The meaning of the word in general English language differs as the context changes. Hence, the semantic similarity varies significantly as the domain of operation differs. For this reason, it is crucial to consider the appropriate definition of the words when they are compared semantically. We present an unsupervised method that can be applied across multiple domains by incorporating corpora based statistics into a standardized semantic similarity algorithm. To calculate the semantic similarity between words and sentences, the proposed method follows an edge-based approach using a lexical database. When tested on both benchmark standards and mean human similarity dataset, the methodology achieves a high correlation value for both word (Pearsons Correlation Coefficient = 0.8753) and sentence similarity (PCC = 0.8793) while comparing Rubenstein and Goodenough standard; and the SICK dataset (PCC = 0.8324) outperforming other unsupervised models. We use the semantic similarity algorithm and extend it to compare the Learning Objectives from course outlines. The course description provided by instructors is an essential piece of information as it defines what is expected from the instructor and what he/she is going to deliver during a particular course. One of the key components of a course description is the Learning Objectives section. The contents of this section are used by program managers who are tasked to compare and match two different courses during the development of Transfer Agreements between various institutions. This research introduces the development of semantic similarity algorithms to calculate the similarity between two learning objectives of the same domain. We present a methodology which deals with the semantic similarity by using a previously established algorithm and integrating it with the domain corpus to utilize domain statistics. The disambiguated domain serves as a supervised learning data for the algorithm. We also introduce Bloom Index to calculate the similarity between action verbs in the Learning Objectives referring to the Bloom's taxonomy. We also study and present the approach to calculate the semantic similarity between words under the word2vec model for a specific domain. We present a methodology to compile a corpus for a specific domain using Wikipedia. We then present a case to show the variance in the semantic similarity between words using different corpora. The core contributions of this thesis are a semantic similarity algorithm for words and sentences, and the corpus compilation of a specific domain to train the word2vec model. We also provide the practical uses of algorithms and the implementation.	en_US
dc.language.iso	en_US	en_US
dc.subject	Semantic similarity (Computer science)	en_US
dc.subject	Semantic analysis	en_US
dc.subject	Bloom's taxonomy	en_US
dc.subject	Corpus statistics	en_US
dc.title	Semantic similarity between words and sentences using lexical database and word embeddings	en_US
dc.type	Thesis	en_US
etd.degree.name	Master of Science	en_US
etd.degree.level	Master	en_US
etd.degree.discipline	Computer Science	en_US
etd.degree.grantor	Lakehead University	en_US
dc.contributor.committeemember	Choudhury, Salimur
dc.contributor.committeemember	Benlamri, Rachid

Files in this item

Name:: PawarA2018m-1a.pdf
Size:: 1.259Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Electronic Theses and Dissertations from 2009 [1744]

Show simple item record