Lakehead University Library Logo
    • Login
    View Item 
    •   Knowledge Commons Home
    • Electronic Theses and Dissertations
    • Electronic Theses and Dissertations from 2009
    • View Item
    •   Knowledge Commons Home
    • Electronic Theses and Dissertations
    • Electronic Theses and Dissertations from 2009
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.
    quick search

    Browse

    All of Knowledge CommonsCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsDisciplineAdvisorCommittee MemberThis CollectionBy Issue DateAuthorsTitlesSubjectsDisciplineAdvisorCommittee Member

    My Account

    Login

    Developing machine learning coding similarity indicators for C & C++ corpora

    Thumbnail
    View/Open
    KunjirA2020m-1a.pdf (3.673Mb)
    Date
    2020
    Author
    Kunjir, Ajinkya
    Metadata
    Show full item record
    Abstract
    The digital data in this modern world is vulnerable to copying, altering and claiming someone else’s work as their own. Performing the same activity in programming assignments can be referred to as source-code theft or e-plagiarism. Despite years of efforts, the already existing similarity detection engines perform pretty well in detecting plagiarism for novice programmers, but provides insufficient results when a student uses complex and smart plagiarism hacks such as word substitution, structure change, line spacing placeholder comments. This thesis research aims to deliver an assistive forensic engine named ‘SimDec’, for the evaluators to help detect similar assignments to address the aforementioned issues. The system's primary objective is to aid the assignment evaluators to get closer to the code thieves and abide by the university's dishonesty regulations. The forensic engine has been developed in Java programming language to detect C and C++ source code's similarities. The research has been split into two modules labelled as ‘software forensic engine development’ and ‘Similarity level classification with machine learning’. The proposed system has a workflow of three stages starting with lexical analysis, tokenizer customization and the final stage displaying similarity percentage and the corresponding level of ‘Low’, ‘Average’ and ‘High’. The combination of similarity algorithms integrated in the engine are Levenshtein distance, Jaro & JaroWinkler measure, Dice coefficient and Cosine similarity. The workflow of lexical analysis and implementing the set of similarity measures on token categories is defined as the first module. The machine learning algorithms selected for performing the classification task are multi-class SVM, logistic regression and a simple neural network. In this second module, the data gathered and generated by the similarity detection engine is fed to the ML algorithms to train the models and make them efficient for predicting the plagiarism or similarity level of newly entered data. This hybrid approach would be impactful in reducing the time complexity and processing speed for the software engine.
    URI
    http://knowledgecommons.lakeheadu.ca/handle/2453/4734
    Collections
    • Electronic Theses and Dissertations from 2009 [1638]

    Lakehead University Library
    Contact Us | Send Feedback

     

     


    Lakehead University Library
    Contact Us | Send Feedback