An autoencoder and generative adversarial networks framework for multi-omics data analysis

Loading...
Thumbnail Image

Date

Authors

Al-Hurani, Ibrahim

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The rapid advancement of high-throughput sequencing has generated vast multi-omics datasets that offer unprecedented insights into complex cancer phenotypes. However, effective integration of these modalities, including DNA methylation, gene expression, and copy number alterations, is frequently hindered by inherent high dimensionality, significant noise, and severe class imbalance, which collectively pose substantial challenges to traditional statistical and machine learning approaches. This dissertation addresses these challenges through a progressive and unified computational framework that evolves from interpretable linear modelling to advanced deep generative architectures for robust data integration and predictive modelling. In the first stage, a linear framework was developed to identify menopause-related biomarkers in breast cancer. By utilizing a systematic preprocessing pipeline with Mut- SigCV, applying Synthetic Minority Oversampling TEchnique (SMOTE) to address class imbalance, and leveraging Principal Component Analysis (PCA) for dimensionality reduction, this research successfully identified and validated biologically significant markers including RUNX1, PTEN, MAP3K1, and CDH1. Interpretability was ensured via Shapleyvalue- based explainable AI (XGBoost), demonstrating the framework’s ability to extract clinically relevant insights. Recognizing the limitations of linear methods in capturing complex nonlinear relationships, the second stage introduced a deep learning-based framework integrating AutoEncoder (AE) with Generative Adversarial Network (GAN). This integration enabled the learning of a compact, nonlinear latent representation while simultaneously synthesizing realistic minority class samples to improve model generalization. The proposed AE–GAN framework achieved marked performance improvements, with classification accuracies of 88.82% for bladder cancer and 95.09% for breast cancer. Based on these findings, the final stage of this dissertation proposes a novel architecture-level integration of AE with Conditional Tabular Generative Adversarial Network (CTGAN). Unlike conventional approaches that generate synthetic data in the original feature space, this method trains CTGAN directly within a shared latent space, enabling the generation of high-fidelity synthetic samples that preserve the intrinsic biological structure of the data. Extensive evaluation demonstrates that the AE–CTGAN framework shows improved performance over earlier models, achieving near-perfect accuracies of 0.9929 for bladder cancer and 0.9748 for breast cancer. Furthermore, fidelity analysis reveals that latent space generation reduced the average Euclidean distance between real and synthetic samples by up to 84% compared to standard GANs. In general, this research contributes to a robust and scalable methodology for predicting cancer outcomes, supporting the development of personalized treatment strategies in precision medicine. Future work will focus on adapting the framework to multi-class and longitudinal omics data, integrating attention-based or transformer architectures to improve interpretability, and validating the approach on prospective clinical cohorts to assess real-world generalizability. The proposed AE–CTGAN pipeline also holds promise beyond oncology, with potential applications in other multimodal biomedical domains such as neurodegenerative disease profiling, pharmacogenomics, and rare disease diagnosis, where high dimensionality and class imbalance are similarly pervasive. Ultimately, this dissertation establishes a foundation for robust, scalable, and fidelity-evaluated generative modelling in multi-omics research, contributing to the broader goal of precision medicine.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By