Lakehead University Library Logo
    • Login
    View Item 
    •   Knowledge Commons Home
    • Electronic Theses and Dissertations
    • Electronic Theses and Dissertations from 2009
    • View Item
    •   Knowledge Commons Home
    • Electronic Theses and Dissertations
    • Electronic Theses and Dissertations from 2009
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.
    quick search

    Browse

    All of Knowledge CommonsCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsDisciplineAdvisorCommittee MemberThis CollectionBy Issue DateAuthorsTitlesSubjectsDisciplineAdvisorCommittee Member

    My Account

    Login

    Extracting specific text from documents using machine learning algorithms

    Thumbnail
    View/Open
    BudhirajaS2018m-1a.pdf (15.07Mb)
    Date
    2018
    Author
    Budhiraja, Sahib Singh
    Metadata
    Show full item record
    Abstract
    Increasing use of Portable Document Format (PDF) files has promoted research in analyzing the files' layout for text extraction purpose. For this reason, it is important to have a system in place to analyze these documents and extract required text. The purpose of this research fulfills this need by extracting specific text from PDF documents while considering the document layout. This approach is used to extract learning outcomes from academic course outlines. Our algorithm consists of a supervised leaning algorithm and white space analysis. The supervised algorithm locates the relevant text followed by white space analysis to understand document layout before extraction. The supervised learning approach used for detecting relevant text does so by looking for relevant headings, which mimics the approach used by humans while going through a document. The data set used for this research consists of 500 course outlines randomly sampled from the internet. To show the capability of our text detection algorithm to work with documents other than course outlines, it is also tested on 25 reports and articles sampled from the internet. The implemented system has shown promising results with an accuracy of 81.8% and remediated the limitation shown by the current literature by supporting documents with unknown format. The algorithm has a wide scope of applications and takes a step towards automating the task of text extraction from PDF documents.
    URI
    http://knowledgecommons.lakeheadu.ca/handle/2453/4288
    Collections
    • Electronic Theses and Dissertations from 2009 [1632]

    Lakehead University Library
    Contact Us | Send Feedback

     

     


    Lakehead University Library
    Contact Us | Send Feedback