Show simple item record

Developing machine learning coding similarity indicators for C & C++ corpora

dc.contributor.advisorFiaidhi, Jinan
dc.contributor.authorKunjir, Ajinkya
dc.date.accessioned2021-01-11T16:13:18Z
dc.date.available2021-01-11T16:13:18Z
dc.date.created2020
dc.date.issued2020
dc.identifier.urihttp://knowledgecommons.lakeheadu.ca/handle/2453/4734
dc.description.abstractThe digital data in this modern world is vulnerable to copying, altering and claiming someone else’s work as their own. Performing the same activity in programming assignments can be referred to as source-code theft or e-plagiarism. Despite years of efforts, the already existing similarity detection engines perform pretty well in detecting plagiarism for novice programmers, but provides insufficient results when a student uses complex and smart plagiarism hacks such as word substitution, structure change, line spacing placeholder comments. This thesis research aims to deliver an assistive forensic engine named ‘SimDec’, for the evaluators to help detect similar assignments to address the aforementioned issues. The system's primary objective is to aid the assignment evaluators to get closer to the code thieves and abide by the university's dishonesty regulations. The forensic engine has been developed in Java programming language to detect C and C++ source code's similarities. The research has been split into two modules labelled as ‘software forensic engine development’ and ‘Similarity level classification with machine learning’. The proposed system has a workflow of three stages starting with lexical analysis, tokenizer customization and the final stage displaying similarity percentage and the corresponding level of ‘Low’, ‘Average’ and ‘High’. The combination of similarity algorithms integrated in the engine are Levenshtein distance, Jaro & JaroWinkler measure, Dice coefficient and Cosine similarity. The workflow of lexical analysis and implementing the set of similarity measures on token categories is defined as the first module. The machine learning algorithms selected for performing the classification task are multi-class SVM, logistic regression and a simple neural network. In this second module, the data gathered and generated by the similarity detection engine is fed to the ML algorithms to train the models and make them efficient for predicting the plagiarism or similarity level of newly entered data. This hybrid approach would be impactful in reducing the time complexity and processing speed for the software engine.en_US
dc.language.isoen_USen_US
dc.subjectPlagiarismen_US
dc.subjectPlagiarism detection softwareen_US
dc.subjectLexical analysisen_US
dc.subjectANTLRen_US
dc.subjectMachine learningen_US
dc.titleDeveloping machine learning coding similarity indicators for C & C++ corporaen_US
dc.typeThesisen_US
etd.degree.nameMaster of Scienceen_US
etd.degree.levelMasteren_US
etd.degree.disciplineComputer Scienceen_US
etd.degree.grantorLakehead Universityen_US
dc.contributor.committeememberMohammed, Sabah
dc.contributor.committeememberAl-Khanjari, Zuhoor


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record