A hybrid framework for weak signal learning in breast cancer prediction using metabolomics data
Loading...
Date
Authors
Fang, Jiahui
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Clinical MS-based metabolomics prediction in small cohorts is often constrained by weak
class separation, class imbalance, and heterogeneous sample reliability. Under these
conditions, predictive performance is limited not by a single factor, but by the combined
effects of unstable feature structure, limited minority class support, and unequal learning
difficulty across samples. Existing methods have addressed some of these challenges
separately, but a unified framework for stable learning under weak signal conditions
remains insufficiently developed.
This thesis studies weak signal clinical metabolomics prediction as a structured learning
problem rather than a standard supervised classification task. To address this setting,
a unified and fold-disciplined framework is developed that integrates transformer representation
learning, conditional generative adversarial network (cGAN) augmentation,
and curriculum learning (CL) within stratified cross-validation (CV). The framework is
designed to provide a more stable representation space, strengthen minority class support
during training, and organize training in a way that better reflects variation in
sample reliability.
The proposed framework is evaluated on two breast cancer-related metabolomics datasets
with different signal conditions. ST004145 is used as the primary weak signal dataset,
while ST000355 is used as a strong signal stability-check dataset. On ST004145, the
full hybrid model achieved the highest mean Area Under the ROC Curve (AUC) among
the compared methods (0.6794 ± 0.0871). Ablation analysis further indicated that both
cGAN minority support and CL difficulty-aware training contributed to the final performance
pattern. On ST000355, performance differences between models were much
smaller, although the proposed model remained highly competitive, with an AUC of
0.9896 ± 0.0195.
These findings suggest that the value of the proposed framework is most evident under
weak signal conditions, where predictive robustness depends on addressing multiple
interacting sources of instability within a single training design. Therefore, this thesis
contributes a more structured methodological perspective on weak signal clinical
metabolomics prediction and supports the usefulness of a unified, fold-disciplined learning
framework in small, class imbalanced clinical cohorts.
