Causal discovery and treatment effect modeling in breast cancer
Abstract
Modeling breast cancer outcomes remains challenging because of extreme molecular heterogeneity
and the inability of associative models, including those developed through traditional
machine learning, to support counterfactual, intervention-based clinical reasoning. Building on
recent advances in causal feature selection, multiomics variable selection, and individual treatment
effect estimation, this thesis proposes a hybrid pipeline within a unified computational
multiomics framework that integrates high-dimensional data with causal modeling to produce
interpretable precision oncology models that extend beyond risk prediction.
The proposed pipeline was developed using the TCGA-BRCA cohort as the discovery set and
validated on the independent retrospective METABRIC cohort to assess transportability. To
address the curse of dimensionality, the framework applies Markov Blanket-based local causal
discovery across seven data modalities and reduces more than 600,000 initial features to a sparse
and stable causal core. This causal representation is then used for survival modeling (C-index =
0.8085, 5-year AUC = 0.8676) and individual treatment effect (ITE) estimation for chemotherapy,
hormone therapy, and targeted therapy. External validation on METABRIC achieved a
C-index of 0.7200 and a 5-year AUC of 0.7639, indicating moderate but clear transportability
across cohorts and assay platforms. The final causal core confirmed the integration of clinical,
proteomic, and epigenetic signals, and identified a long non-coding RNA as a structurally
relevant driver.
The treatment-effect stage used treatment-specific arm definitions reconstructed from clinical
records together with a robustness-oriented validation protocol. Chemotherapy showed
the strongest and most stable beneficial treatment effect, most notably in the TNBC subgroup,
where treatment-effect estimates remained consistently protective across estimators and
overlap-adjusted variants. Hormone-therapy estimates showed a consistently protective direction
in receptor-positive subgroup analyses, although the magnitude of the effect was attenuated
under stricter overlap control, indicating residual confounding and limited positivity in the observational
setting. Targeted therapy also showed a protective direction under most evaluated
techniques, but given the very small number of treated patients and partial estimator disagreement,
these effect estimates should be interpreted as exploratory.
Description
Thesis is embargoed until May 15 2027.
