An autoencoder and generative adversarial networks framework for multi-omics data analysis
Loading...
Date
Authors
Al-Hurani, Ibrahim
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The rapid advancement of high-throughput sequencing has generated vast multi-omics
datasets that offer unprecedented insights into complex cancer phenotypes. However, effective
integration of these modalities, including DNA methylation, gene expression, and
copy number alterations, is frequently hindered by inherent high dimensionality, significant
noise, and severe class imbalance, which collectively pose substantial challenges to
traditional statistical and machine learning approaches. This dissertation addresses these
challenges through a progressive and unified computational framework that evolves from
interpretable linear modelling to advanced deep generative architectures for robust data
integration and predictive modelling.
In the first stage, a linear framework was developed to identify menopause-related
biomarkers in breast cancer. By utilizing a systematic preprocessing pipeline with Mut-
SigCV, applying Synthetic Minority Oversampling TEchnique (SMOTE) to address class
imbalance, and leveraging Principal Component Analysis (PCA) for dimensionality reduction,
this research successfully identified and validated biologically significant markers
including RUNX1, PTEN, MAP3K1, and CDH1. Interpretability was ensured via Shapleyvalue-
based explainable AI (XGBoost), demonstrating the framework’s ability to extract
clinically relevant insights.
Recognizing the limitations of linear methods in capturing complex nonlinear relationships,
the second stage introduced a deep learning-based framework integrating AutoEncoder
(AE) with Generative Adversarial Network (GAN). This integration enabled the
learning of a compact, nonlinear latent representation while simultaneously synthesizing
realistic minority class samples to improve model generalization. The proposed AE–GAN
framework achieved marked performance improvements, with classification accuracies of
88.82% for bladder cancer and 95.09% for breast cancer. Based on these findings, the final
stage of this dissertation proposes a novel architecture-level integration of AE with Conditional
Tabular Generative Adversarial Network (CTGAN). Unlike conventional approaches
that generate synthetic data in the original feature space, this method trains CTGAN directly
within a shared latent space, enabling the generation of high-fidelity synthetic samples that preserve the intrinsic biological structure of the data.
Extensive evaluation demonstrates that the AE–CTGAN framework shows improved
performance over earlier models, achieving near-perfect accuracies of 0.9929 for bladder
cancer and 0.9748 for breast cancer. Furthermore, fidelity analysis reveals that latent space
generation reduced the average Euclidean distance between real and synthetic samples by
up to 84% compared to standard GANs. In general, this research contributes to a robust
and scalable methodology for predicting cancer outcomes, supporting the development of
personalized treatment strategies in precision medicine.
Future work will focus on adapting the framework to multi-class and longitudinal omics
data, integrating attention-based or transformer architectures to improve interpretability, and
validating the approach on prospective clinical cohorts to assess real-world generalizability.
The proposed AE–CTGAN pipeline also holds promise beyond oncology, with potential
applications in other multimodal biomedical domains such as neurodegenerative disease
profiling, pharmacogenomics, and rare disease diagnosis, where high dimensionality and
class imbalance are similarly pervasive. Ultimately, this dissertation establishes a foundation
for robust, scalable, and fidelity-evaluated generative modelling in multi-omics research,
contributing to the broader goal of precision medicine.
