Data preparation & model development for text simplification
Sikka, Punardeep S.
Master of Science
MetadataShow full item record
Text simplification (TS), defined narrowly, is the process of reducing the linguistic complexity of a text, while still retaining the original information and meaning. More broadly, text simplification encompasses other operations; for example, conceptual simplification to simplify content as well as form, elaborative modification, where redundancy and explicitness are used to emphasize key points, and omission of peripheral or inappropriate information. TS is a diverse field with a number of target audience, such as beginner and foreign language learners, the dyslexic, the aphasic etc. This research provides a significant contribution towards the goal of building the first, fully automated and open-source system for TS. TS originally involved simplification through hand-crafted rules of language, a process which is not only extremely time-consuming, but also not applicable across languages. Most recent techniques have automated the rules learning process by using deep neural networks and language models. An extensive survey on TS was first conducted before starting system development, to identify current limitations and research obstacles. TS can be divided into two major components: lexical simplification that involves substituting complex words/phrases with simpler ones, and novel text generation, which generates new simplified version of the input text. This thesis focuses on the latter, focusing on data and models involved. To allow deep learning models to automatically learn simplification rules, a large amount of data is needed, especially in the form of simple and complex sentence pairs, needed to train sequence-to-sequence models. The lack of existing such data of particularly high quality necessitated a focus on a dataset development first. There are only two sources available to extract complex/simple sentence pairs from: Regular & Simple English Wikipedia and the Newsela corpus. A Newsela dataset was extracted for this thesis, which is shown to outperform models trained using any previous Newsela extraction. Also, for this research, three deep learning models were developed, and used to benchmark most commonly used datasets for training TS models, and the effect of using each dataset quantified. The models were then used to set state-of-the-art benchmarks using the best training datasets available. An initial version of the web application for TS application was developed, in conjunction with other developers, which uses one of the three developed models. Having developed an initial system, this research is expected to continue, with the next steps of focusing on lexical simplification and multi-language simplification.