Developing machine learning coding similarity indicators for C & C++ corpora

Kunjir, Ajinkya

Please use this identifier to cite or link to this item: https://knowledgecommons.lakeheadu.ca/handle/2453/4734

Title:	Developing machine learning coding similarity indicators for C & C++ corpora
Authors:	Kunjir, Ajinkya
Keywords:	Plagiarism;Plagiarism detection software;Lexical analysis;ANTLR;Machine learning
Issue Date:	2020
Abstract:	The digital data in this modern world is vulnerable to copying, altering and claiming someone else’s work as their own. Performing the same activity in programming assignments can be referred to as source-code theft or e-plagiarism. Despite years of efforts, the already existing similarity detection engines perform pretty well in detecting plagiarism for novice programmers, but provides insufficient results when a student uses complex and smart plagiarism hacks such as word substitution, structure change, line spacing placeholder comments. This thesis research aims to deliver an assistive forensic engine named ‘SimDec’, for the evaluators to help detect similar assignments to address the aforementioned issues. The system's primary objective is to aid the assignment evaluators to get closer to the code thieves and abide by the university's dishonesty regulations. The forensic engine has been developed in Java programming language to detect C and C++ source code's similarities. The research has been split into two modules labelled as ‘software forensic engine development’ and ‘Similarity level classification with machine learning’. The proposed system has a workflow of three stages starting with lexical analysis, tokenizer customization and the final stage displaying similarity percentage and the corresponding level of ‘Low’, ‘Average’ and ‘High’. The combination of similarity algorithms integrated in the engine are Levenshtein distance, Jaro & JaroWinkler measure, Dice coefficient and Cosine similarity. The workflow of lexical analysis and implementing the set of similarity measures on token categories is defined as the first module. The machine learning algorithms selected for performing the classification task are multi-class SVM, logistic regression and a simple neural network. In this second module, the data gathered and generated by the similarity detection engine is fed to the ML algorithms to train the models and make them efficient for predicting the plagiarism or similarity level of newly entered data. This hybrid approach would be impactful in reducing the time complexity and processing speed for the software engine.
URI:	http://knowledgecommons.lakeheadu.ca/handle/2453/4734
metadata.etd.degree.discipline:	Computer Science
metadata.etd.degree.name:	Master of Science
metadata.etd.degree.level:	Master
metadata.dc.contributor.advisor:	Fiaidhi, Jinan
metadata.dc.contributor.committeemember:	Mohammed, Sabah Al-Khanjari, Zuhoor
Appears in Collections:	Electronic Theses and Dissertations from 2009

Files in This Item:

File	Description	Size	Format
KunjirA2020m-1a.pdf		3.76 MB	Adobe PDF	View/Open

Show full item record Recommend this item

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets