Developing machine learning coding similarity indicators for C & C++ corpora

Kunjir, Ajinkya

dc.contributor.advisor	Fiaidhi, Jinan
dc.contributor.author	Kunjir, Ajinkya
dc.date.accessioned	2021-01-11T16:13:18Z
dc.date.available	2021-01-11T16:13:18Z
dc.date.created	2020
dc.date.issued	2020
dc.identifier.uri	http://knowledgecommons.lakeheadu.ca/handle/2453/4734
dc.description.abstract	The digital data in this modern world is vulnerable to copying, altering and claiming someone else’s work as their own. Performing the same activity in programming assignments can be referred to as source-code theft or e-plagiarism. Despite years of efforts, the already existing similarity detection engines perform pretty well in detecting plagiarism for novice programmers, but provides insufficient results when a student uses complex and smart plagiarism hacks such as word substitution, structure change, line spacing placeholder comments. This thesis research aims to deliver an assistive forensic engine named ‘SimDec’, for the evaluators to help detect similar assignments to address the aforementioned issues. The system's primary objective is to aid the assignment evaluators to get closer to the code thieves and abide by the university's dishonesty regulations. The forensic engine has been developed in Java programming language to detect C and C++ source code's similarities. The research has been split into two modules labelled as ‘software forensic engine development’ and ‘Similarity level classification with machine learning’. The proposed system has a workflow of three stages starting with lexical analysis, tokenizer customization and the final stage displaying similarity percentage and the corresponding level of ‘Low’, ‘Average’ and ‘High’. The combination of similarity algorithms integrated in the engine are Levenshtein distance, Jaro & JaroWinkler measure, Dice coefficient and Cosine similarity. The workflow of lexical analysis and implementing the set of similarity measures on token categories is defined as the first module. The machine learning algorithms selected for performing the classification task are multi-class SVM, logistic regression and a simple neural network. In this second module, the data gathered and generated by the similarity detection engine is fed to the ML algorithms to train the models and make them efficient for predicting the plagiarism or similarity level of newly entered data. This hybrid approach would be impactful in reducing the time complexity and processing speed for the software engine.	en_US
dc.language.iso	en_US	en_US
dc.subject	Plagiarism	en_US
dc.subject	Plagiarism detection software	en_US
dc.subject	Lexical analysis	en_US
dc.subject	ANTLR	en_US
dc.subject	Machine learning	en_US
dc.title	Developing machine learning coding similarity indicators for C & C++ corpora	en_US
dc.type	Thesis	en_US
etd.degree.name	Master of Science	en_US
etd.degree.level	Master	en_US
etd.degree.discipline	Computer Science	en_US
etd.degree.grantor	Lakehead University	en_US
dc.contributor.committeemember	Mohammed, Sabah
dc.contributor.committeemember	Al-Khanjari, Zuhoor

Files in this item

Name:: KunjirA2020m-1a.pdf
Size:: 3.673Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Electronic Theses and Dissertations from 2009 [1743]

Show simple item record