Abstract
Objective: Accurate detection of PIK3CA mutations is essential for guiding PI3K-targeted therapies in breast cancer, yet sequencing is not universally accessible, and single-modality prediction models have limited performance. This study developed a multimodal deep learning framework integrating whole-slide imaging (WSI) and structured clinical data to improve mutation prediction.
Methods: A total of 1,047 patients from TCGA and 166 patients from 3 external centers were included. The histopathology model used a transformer-based pretrained encoder (H-optimus-0) and a clustering-constrained attention multiple instance learning (CLAM-SB MIL) classifier to generate WSI-level representations. The clinical model incorporated engineered clinical variables and an extreme gradient boosting (XGBoost) model. A decision-level late fusion strategy (Multimodal PIK3CA Model, MPM) combined probabilistic outputs from both branches. Performance was evaluated with the area under the curve (AUC) and secondary metrics. Interpretability was assessed via attention heatmaps and shapley additive explanations (SHAP) analysis.
Results: MPM outperformed single-modality models. It achieved an AUC of 0.745 on TCGA and maintained stable performance across external cohorts (0.695, 0.690, and 0.680). SHAP analysis identified molecular subtype as the most influential clinical feature, whereas attention maps highlighted mutation-associated morphological regions.
Conclusions: The developed multimodal framework effectively integrates complementary morphological and clinical information, and provides a robust and generalizable method for predicting PIK3CA mutation status. Strong multicenter adaptability and biological interpretability support its potential use as a clinical decision-support tool and an accessible alternative to molecular testing.
keywords
- Breast cancer
- PIK3CA mutation
- multimodal artificial intelligence
- whole-slide imaging
- computational pathology
Introduction
Breast cancer is among the most common malignancies in women worldwide and remains a major global health burden. According to global cancer statistics, breast cancer accounts for nearly 30% of all cancer diagnoses in women and substantially contributes to cancer-related mortality each year1,2. Despite advances in diagnosis and treatment, the heterogeneous molecular landscape and complex biological behavior of breast cancer continue to challenge precision oncology3. Among the genomic alterations implicated in breast tumorigenesis, mutations within the PI3K/AKT/mTOR (PAM) signaling pathway are among the most prevalent4,5. Recent large-scale genomic studies have shown that approximately 62.6% of Chinese patients with breast cancer bear at least one PAM pathway mutation, and PIK3CA is the most frequently altered gene. PIK3CA mutations occur in as many as 40% of HR+/HER2− tumors and approximately 30% of HER2+ tumors6.
With the development of PI3K-targeted therapies, PIK3CA mutations have emerged as a key therapeutic biomarker. Several novel PI3K inhibitors have demonstrated promising anti-tumor activity7, and the first-in-class PI3Kα-selective inhibitor (alpelisib) has been approved by the U.S. Food and Drug Administration and European Medicines Agency for the treatment of patients with PIK3CA-mutated HR+/HER2− breast cancer8. Consequently, accurate determination of PIK3CA mutation status has become increasingly important for treatment decision-making and individualized therapy.
Traditional approaches for detecting PIK3CA mutations rely on molecular assays such as PCR and NGS. Although highly accurate, these techniques require advanced laboratory infrastructure, incur substantial cost, and substantially depend on tissue quality; therefore, their application in routine clinical practice has been limited. Deep learning models can now predict key gene mutations, such as TP53, EGFR, and KRAS, directly from routine H&E-stained whole-slide images (WSIs); this major advancement in computational pathology9 offers a scalable and cost-effective alternative. However, most existing studies have relied solely on single-modality inputs, primarily pathology images, without leveraging complementary clinical information. Multimodal learning, which integrates WSIs with clinical data, has the potential to more comprehensively capture tumor biology and improve mutation prediction accuracy.
Moreover, current Artificial Intelligence (AI) models for predicting PIK3CA mutation status were trained predominantly on single-center datasets with limited sample sizes or restricted imaging diversity. Differences in staining protocols, scanning devices, and tissue preparation across institutions often introduce domain shift and substantially limit model robustness in real-world clinical settings. Although recent studies have suggested that incorporating structured clinical variables might enhance prediction performance, large-scale, multi-center, multimodal frameworks specifically designed to predict PIK3CA mutations remain lacking.
This study integrated The Cancer Genome Atlas (TCGA) breast cancer dataset with 3 independent multi-center cohorts to develop a decision-level late-fusion multimodal model for predicting PIK3CA mutation status (Multimodal PIK3CA Model, MPM). By jointly leveraging deep representations from whole-slide histopathology images and structured clinical variables, the model demonstrated robust predictive performance across heterogeneous datasets. MPM effectively capitalizes on the complementary strengths of both modalities in generating stable and generalizable predictions, thereby providing a solid methodological and empirical foundation for subsequent algorithmic refinement, clinical translation, and real-world deployment.
Materials and methods
Data sources and patient selection
The training dataset for this study was obtained from TCGA, which provides multimodal breast cancer data including genomic profiles, clinical information, and WSIs. External validation datasets were collected from 3 independent medical centers: Hebei Medical University Fourth Hospital (HBMU), Beijing Xuanwu Hospital of Capital Medical University (CMU), and West China Hospital of Sichuan University (SCU). The external cohorts consisted of core needle biopsy specimens from patients with breast cancer.
The inclusion criteria were (1) histologically confirmed primary breast cancer; (2) available PIK3CA mutation status; (3) high-quality WSIs suitable for downstream analysis; and (4) complete clinical information. The exclusion criteria were (1) absence of PIK3CA mutation results; (2) poor-quality WSIs (e.g., severe blurring, artifacts, or incomplete tissue); (3) diagnosis of carcinoma in situ; and (4) intraoperative frozen-section slides.
A total of 1,047 patients with 1,112 WSIs were included in the TCGA training cohort. The external validation cohorts consisted of 70 patients with 175 WSIs from HBMU, 52 patients with 52 WSIs from CMU, and 44 patients with 44 WSIs from SCU. Clinical variables included demographic characteristics or pathological biomarkers, such as age, molecular subtype, lymph node status, and American Joint Committee on Cancer (AJCC) stage10. All enrolled patients in the external validation cohorts had complete information for the core variables required by the model. External datasets were entirely withheld from model development and used exclusively for independent performance evaluation.
Histopathology image preprocessing and feature extraction
To ensure that only high-quality tissue regions were analyzed, we subjected WSIs to a standardized preprocessing workflow. Tissue segmentation was performed with a combination of the Clustering-constrained Attention Multiple Instance Learning (CLAM)11 framework and an HSV color-space thresholding approach to exclude background, debris, and low-quality regions. Tile extraction was performed with a fixed tile size of 224 × 224 pixels, and all tiles were stored in HDF5 format.
Five state-of-the-art pathology foundation models were systematically evaluated as feature encoders, representing diverse architectures and pretraining paradigms:
UNI-V212: Vision Transformer (ViT) models trained via DINOv2 self-supervised learning (1,536-d features)
CTransPath13: Hybrid CNN–Swin Transformer encoder trained with contrastive learning (768-d features)
Virchow214: ViT model combining convolution and global attention to capture multiscale context (2,560-d features)
H-optimus-015: Efficient pathology-specific ViT encoder (1,536-d features)
CONCH-V1.516: Vision-language foundation model specialized for pathology (768-d features)
These encoders were benchmarked on the TCGA dataset to determine the optimal feature extractor for downstream WSI-level mutation prediction.
Architecture of the multimodal PIK3CA model
Herein, we introduce the MPM, an artificial intelligence framework designed to predict PIK3CA mutation status in breast cancer by leveraging heterogeneous multimodal inputs. The MPM uses a decision-level fusion mechanism to combine deep morphological representations from WSIs with clinically relevant variables. The framework is composed of 2 distinct predictive components that jointly form its core:
A histopathology model dedicated to processing WSI: We used a multiple instance learning pipeline built upon a pretrained Transformer-based architecture to extract high-resolution morphological patterns associated with PIK3CA mutations. The model outputs an independent probability prediction for each case.
A clinical model designed to analyze structured clinical variables, including age, molecular subtype, lymph node status, and AJCC stage: We used machine learning algorithms to model the nonlinear relationships between these clinical attributes and PIK3CA mutation status, thereby yielding an independent probability prediction for each case.
During the training phase, both models were optimized independently to maximize their respective predictive capabilities. In the inference phase, their output probabilities were combined through our decision-level fusion module, thus yielding a unified and robust prediction of PIK3CA mutation status.
Construction of the histopathology model
The histopathology model used a two-stage weakly supervised learning pipeline. The framework first used the pathology foundation model H-optimus-0, pretrained on large-scale histopathology image datasets, to learn deep feature representations. These features were then passed to an attention-based multiple instance learning classifier, CLAM-SB11, which performed slide-level prediction of PIK3CA mutation status.
Given a high-resolution WSI, we first divided it into a sequence of N non-overlapping image patches {p1, p2,…, pN}. Each patch pi was then processed through the H-optimus-0 model via forward propagation to obtain its high-dimensional deep feature representation.
Here, fH-optimus-0 denotes the forward computation function of the pretrained model, θpre represents its fixed parameters, and d is the dimensionality of the feature vector. This step converted the large-scale WSI into a set of deep feature vectors H = {h1, h2, … , hN}, and effectively encoded the morphological information of local regions.
Subsequently, the feature set H was fed into the CLAM-SB classification model. This model uses an attention network to compute the importance weight of each image patch with respect to the final classification decision. Its core mechanism involves spatially encoding the features and calculating the corresponding attention scores:
Here, W ∈ ℝd×d, b ∈ ℝd, and v ∈ ℝd are learnable parameters, and ai denotes the normalized attention score for the i-th image patch. The weighted average of all image patches forms the bag-level representation of the entire WSI:
Finally, the WSI-level representation z is passed into a fully connected classification layer to produce the predicted probability of PIK3CA mutation:
where σ is the sigmoid activation function, and Wc and bc are the parameters of the classification layer.
This two-stage architecture has an advantage of operating without requiring pixel-level or region-level annotations. Instead, it relies solely on weak WSI-level labels to automatically identify morphologically informative regions associated with genetic mutations and to perform end-to-end prediction. This design provides an efficient and interpretable solution for molecular biomarker detection in digital pathology.
Construction of the clinical model
The clinical model included 4 fundamental variables: age, molecular subtype, lymph node status, and AJCC stage. To construct the clinical data prediction model, we used the extreme gradient boosting (XGBoost) algorithm, which has demonstrated outstanding performance in processing tabular data. By sequentially constructing a series of decision trees to iteratively correct the prediction residuals of preceding models and incorporating regularization terms to effectively control model complexity, XGBoost achieves remarkable predictive accuracy, high computational efficiency, and robust resistance to overfitting.
Given a clinical dataset with n samples, D = {(xi, yi)}, where xi ∈ Rm denotes an m-dimensional feature vector, and yi ∈ {0, 1} represents the PIK3CA mutation status label, the XGBoost model produces its prediction as an additive combination of K decision trees:
Here, ŷi denotes the predicted output, fk represents the k-th decision tree, and Ƒ denotes the space of all possible decision trees. The objective function of the model consists of a loss term and a regularization term:
Here, l(yi, ŷi) denotes the cross-entropy loss function. The regularization term includes T, the number of leaf nodes in the tree; w, the weight of leaf nodes; γ, controlling the tree structure complexity; λ, the L2 regularization coefficient; and α, the L1 regularization coefficient.
Construction of the multimodal PIK3CA model
To integrate the complementary information provided by WSI and clinical data for predicting PIK3CA mutations in breast cancer, we developed a late decision-level fusion multimodal learning framework, referred to as the MPM (Figure 1). The core idea of this framework is using the prediction outputs from 2 independent branches, the histopathology model and the clinical model, and fusing them through a shallow classifier. Each branch produces a binary probability vector, wherein the 2 elements represent the predicted probabilities of the sample being PIK3CA wild-type or mutant, respectively, with a sum equal to 1. The implementation of MPM consists of 2 major stages: training and inference.
Training phase
This phase constructs a multimodal fusion training set and trains the logistic regression fusion module. The detailed steps are as follows:
Histopathology model: For the i-th patient in the training set, if the patient has Si WSIs, each slide is processed independently through the histopathology model to obtain a binary probability vector:
The final histopathology model representation for the patient is computed as the arithmetic mean of all slide-level probability vectors:
This aggregation assumes that each slide contributes equally to the final prediction and is designed to reduce random errors from individual slides through ensemble averaging, thereby producing a more robust case-level representation.
Clinical model: For the same patient i, the clinical model directly outputs a binary probability vector based on the structured clinical features:
MPM: The multimodal fusion feature for patient i is formed by concatenating
and
:
All fusion feature vectors
and their corresponding ground-truth labels {y(i)} are used to train a logistic regression model, which learns the optimal fusion weights and bias.Inference phase
For a new sample, we first obtain its image-based and clinical-based probability vectors with the trained image and clinical branch models. After the same procedure as in the training phase, the multimodal fusion feature is constructed and is then fed into the trained logistic regression fusion module to produce the final prediction:
where W and b are the weight matrix and bias vector learned during training. The output yfusion represents the final probability distribution predicted by the MPM.
Experimental settings
All experiments in this study were conducted under a unified experimental setup. The hardware platform was equipped with an NVIDIA GeForce RTX 4090 GPU (24 GB VRAM). The software environment was built on Python 3.8 and PyTorch 2.0.0, with parallel computing acceleration enabled via CUDA 11.8. All model development and training were performed in the Ubuntu 20.04 operating system.
To ensure rigor and reproducibility, we used a standardized 5-fold cross-validation scheme. In each fold, patient-level stratified random sampling was used to partition the dataset into training (80%), validation (10%), and test (10%) subsets, ensuring balanced class distributions across splits. The validation set was used for monitoring training progress and early stopping, whereas the test set was reserved for final performance evaluation.
For the histopathology model, the CLAM-SB architecture was used to process WSI features. Input features with a dimensionality of 1,536 were refined and aggregated through a gated attention network. Model weights were initialized with the Xavier method, and Dropout (P = 0.25) was applied to enhance generalization. Training was performed with the Adam optimizer with an initial learning rate of 1 × 10−3 and a weight decay of 1 × 10−5. The model was trained in a whole-slide-per-batch manner for as many as 200 epochs, and early stopping was triggered after 30 epochs without improvement. Key CLAM-SB hyperparameters included a bag-level loss weight of 0.7 and 8 instances sampled per bag (balanced between positive and negative examples). L2 regularization was applied during training to prevent overfitting, and random seeds were fixed to ensure reproducibility.
For the clinical model, the XGBoost ensemble learning model was used to process features including age, lymph node metastasis status, molecular subtype, and AJCC stage. The model consisted of 100 decision trees with a maximum depth of 3, a learning rate of 0.1, and subsampling ratios of 80% for both instances and features, to mitigate overfitting. L1 (α = 0.1) and L2 (λ = 1.0) regularization were applied. To address class imbalance, the model automatically adjusted sample weights based on the positive-negative ratio.
Throughout the study, all random partitions and initializations were controlled by a fixed random seed to ensure stability and comparability of results. All training procedures were executed in the described hardware and software environment, thereby providing a reliable foundation for subsequent internal validation and external generalization assessments.
Evaluation metrics
The primary evaluation metric in this study was the area under the curve (AUC). In addition, we used the accuracy (ACC), F1 score (F1), sensitivity (SEN), and specificity (SPEC) to provide a more comprehensive assessment of model performance. All metrics were evaluated with the bootstrap method (1,000 resamples) to calculate 95% confidence intervals (CIs), thereby quantifying the statistical stability of the evaluation results. To assess the significance of differences in AUC among models, we used the DeLong test. Furthermore, the Platt Scaling method was applied to assess the consistency between model-predicted probabilities and actual observed probabilities, and calibration curves were plotted accordingly. Decision curve analysis (DCA) was applied to quantify the clinical net benefit of the model across various thresholds. This comprehensive approach ensured balanced consideration of the model’s discriminative ability, calibration performance, and clinical utility. To ensure strict data independence, we performed dataset partition at the patient level, such that all WSIs originating from the same patient were assigned exclusively to the training, validation, or test set. This process prevented any information leakage across folds and ensured fair evaluation.
To further assess the generalization capability of the developed method, we conducted external testing with independent datasets collected from multiple medical centers. These external evaluations enabled a more comprehensive and rigorous examination of the model’s robustness across heterogeneous clinical environments.
Attention-based visualization of histopathology model decisions
To interpret the decision-making process of the histopathology model and identify the morphological features most critical for PIK3CA mutation prediction, we generated heatmap visualizations based on the attention weights produced by the model. These visualizations revealed the microscopic pathological structures focused on by the model when determining mutation status.
Specifically, for a given WSI, the pretrained vision Transformer architecture assigns a normalized attention score to each image patch. This score, derived from the softmax normalization operation within the attention mechanism, quantifies the relative contribution of each patch to the final prediction of PIK3CA mutation status. The normalized attention values are then mapped back to their corresponding spatial locations on the original slide and upsampled to reconstruct a full-resolution heatmap. In the resulting heatmap, regions with high attention scores are highlighted in warm colors (e.g., red), representing tissue areas that the model identifies as most predictive of PIK3CA mutation, such as specific tumor cell arrangements, degrees of nuclear atypia, or patterns of stromal reaction.
This visualization approach transforms the internal decision mechanisms of the model into visual cues that can be readily interpreted by pathologists, thereby enhancing the transparency of the deep learning framework. More importantly, it enables verification of whether the regions highlighted by the model correspond to established pathological knowledge associated with PIK3CA mutations, thus establishing a reliable pathway from computational decision-making to clinical interpretation.
SHAP analysis of clinical feature contributions
To gain deeper insight into the decision-making process of the XGBoost model and identify the clinical features most critical for predicting PIK3CA mutations, we used the SHapley Additive exPlanations (SHAP) method for interpretability analysis. SHAP, which is based on Shapley value theory from cooperative game theory, assigns a unique contribution value to each feature for each individual prediction, thereby providing consistent and reliable explanations at both global and local levels. For a given sample x and model f, the SHAP value ϕj represents the contribution of feature j to the prediction f(x), computed over all possible subsets of features:
Here, S denotes a subset of features not including feature j, f(S) represents the model prediction with only the feature subset S, and m is the total number of features. By aggregating the absolute SHAP values across all samples, we obtain the global importance ranking for each feature:
where Ii denotes the global importance of feature j, and
is the SHAP value of feature j for sample i.
Ethical approval
All digital pathology slides were de-identified according to the DICOM standard (with removal of patient ID, hospital ID, and date of birth). Structured clinical data were pseudonymized by replacing identifiers with unique codes. Access to raw data was restricted to the core research team, with encrypted storage and role-based access controls. This study was approved by the Ethics Committee of Hebei Medical University Fourth Hospital (approval No. 2025KT033). The requirement for informed consent was waived, because of the retrospective nature of the study.
Study flowchart. This study developed a multimodal artificial intelligence framework to predict PIK3CA mutations in breast cancer. Step 1: data collection. Multi-center cohorts were assembled from TCGA and 3 external hospitals. Each case underwent standardized tissue processing, histologic scanning, and genomic testing to obtain WSIs, clinical records, and mutation labels. Step 2: model construction. WSIs were tiled and encoded with a pretrained transformer-based histopathology model, and predictive scores were output via the multiple instance learning (MIL) framework. For the clinical model, structured clinical variables were fed into the dedicated clinical model for the derivation of predictive scores. The 2 models generated probabilistic predictions that were fused through the Multimodal PIK3CA Model (MPM). Step 3: model validation. Model performance was evaluated with ROC curves, confusion matrices, and attention or SHAP-based interpretability analyses. Overall, the multimodal approach integrated complementary morphological and clinical information to improve the accuracy and generalizability of PIK3CA mutation prediction. AJCC, American Joint Committee on Cancer; AUC, area under the curve; CMU, Beijing Xuanwu Hospital of Capital Medical University; HBMU, Hebei Medical University Fourth Hospital; SCU, West China Hospital of Sichuan University; SHAP, Shapley Additive Explanations; TCGA, The Cancer Genome Atlas; WSI, whole-slide image.
Results
Patient cohorts and dataset composition
On the basis of predefined inclusion and exclusion criteria, all cases from TCGA and the 3 external clinical centers were systematically screened. The TCGA cohort included 1,047 patients with 1,112 WSIs, each with complete PIK3CA mutation status, high-quality WSIs, and full clinical information. HBMU (70 patients and 175 WSIs), CMU (52 patients and 52 WSIs), and SCU (44 patients and 44 WSIs) were the external validation cohorts. All included cases were primary breast cancers. Patients with carcinoma in situ, recurrent or metastatic disease, poor-quality WSIs, missing mutation data, or frozen-section slides were excluded. The final datasets and clinicopathologic characteristics are summarized in Figure 2 and Table 1.
Study cohort construction and data flow diagram. This flowchart summarizes the inclusion and exclusion criteria across the TCGA cohort and 3 external clinical cohorts (HBMU, CMU, and SCU). After application of exclusion criteria including missing PIK3CA mutation status, poor image quality, carcinoma in situ, or frozen sections, eligible WSIs and clinical data were retained for analysis. The TCGA dataset was further divided into training, validation, and internal test sets. The 3 external cohorts were used for independent external validation. CMU, Beijing Xuanwu Hospital of Capital Medical University; HBMU, Hebei Medical University Fourth Hospital; SCU, West China Hospital of Sichuan University; TCGA, The Cancer Genome Atlas; WSIs, whole-slide images.
Clinicopathologic characteristics of the TCGA and external validation cohorts
Performance of the histopathology model
Within the TCGA dataset, we conducted a systematic benchmarking analysis of 5 advanced pathology foundation models serving as pretrained encoders in the CLAM-SB architecture (performance metrics summarized in Table 2). Among the evaluated encoders, H-optimus-0 achieved the best performance, with an AUC of 0.715 (95% CI: 0.665–0.775) and ACC of 0.682 (95% CI: 0.597–0.767). UNI-V2 ranked second (AUC: 0.698), whereas Virchow2 had the weakest performance (AUC: 0.635). Statistical analysis (DeLong test) revealed that, compared with the top-performing H-optimus-0 model, all other evaluated models exhibited statistically significant performance disadvantages (all P < 0.05). Because of its comprehensive superior performance across all evaluation metrics, H-optimus-0 was selected as the feature encoder for the final histopathology branch.
Performance comparison of various pretrained feature extractors on the TCGA test set
The trained image model was subsequently evaluated on the external multi-center datasets (Figure 3 and Table 4). The results demonstrated stable generalization. The HBMU cohort had an AUC of 0.680 (95% CI: 0.540–0.800) and ACC of 0.620. The CMU cohort had an AUC of 0.680 (95% CI: 0.550–0.810) and ACC of 0.600. The SCU cohort had an AUC of 0.670 (95% CI: 0.520–0.820) and ACC of 0.590. Overall, the histopathology model maintained consistent performance across centers, thus supporting its robustness against staining and scanning variability.
Performance of the histopathology model across datasets. (A) Comparison of prediction performance with various pretrained encoders. (B) Radar plots summarizing AUC, ACC, F1, SEN, and SPEC across centers. (C–F) ROC curves with 5-fold cross-validation results and corresponding confusion matrices of the histopathology model in the (C) TCGA test set, (D) HBMU dataset, (E) CMU dataset, and (F) SCU dataset. ACC, accuracy; AUC, area under the curve; CI, confidence interval; CMU, Beijing Xuanwu Hospital of Capital Medical University; F1, F1 score; HBMU, Hebei Medical University Fourth Hospital; ROC, receiver operating characteristic; SCU, West China Hospital of Sichuan University; SEN, sensitivity; SPEC, specificity; TCGA, The Cancer Genome Atlas.
Performance of the clinical model
In the clinical model, 5 machine-learning algorithms were benchmarked: k-nearest neighbors (k-NN), logistic regression (LR), gradient boosting (GB), random forest (RF), and XGBoost. XGBoost outperformed all other models on the TCGA dataset, achieving an AUC of 0.694 (95% CI: 0.647–0.741) and ACC of 0.619 (95% CI: 0.561–0.678) (Table 3). Statistical analysis (DeLong test) confirmed the significant performance advantage of XGBoost over k-NN and LR (P < 0.05). GB and RF yielded comparable AUC values (both 0.681), whereas k-NN exhibited the lowest predictive performance (AUC = 0.534). Because of its superior and statistically significant predictive capability, XGBoost was selected as the classifier for the clinical model.
Comparison of machine-learning algorithms for the clinical model
To assess the model’s true generalization ability, we directly applied the trained clinical model to 3 completely independent external validation cohorts. The model demonstrated stable performance across all external cohorts (Figure 4 and Table 4). In the HBMU cohort, it achieved an AUC of 0.660 (95% CI: 0.560–0.760), specificity of 0.680, and sensitivity of 0.600; in the CMU cohort, the AUC was 0.640 (95% CI: 0.520–0.760), and the model exhibited balanced performance characteristics; and in the SCU cohort, the AUC was 0.650 (95% CI: 0.530–0.770). These results consistently indicated the model’s robust discriminative ability across multiple external datasets.
Performance evaluation of single-modal vs. multi-modal models
Performance of the clinical model across datasets. (A) Comparison across multiple machine learning classifiers. (B) Radar plots summarizing model performance across datasets. (C–F) ROC curves with 5-fold cross-validation results and corresponding confusion matrices of the clinical model in the (C) TCGA test set, (D) HBMU dataset, (E) CMU dataset, and (F) SCU dataset. ACC, accuracy; AUC, area under the curve; CI, confidence interval; CMU, Beijing Xuanwu Hospital of Capital Medical University; F1, F1 score; HBMU, Hebei Medical University Fourth Hospital; ROC, receiver operating characteristic; SCU, West China Hospital of Sichuan University; SEN, sensitivity; SPEC, specificity; TCGA, The Cancer Genome Atlas.
Performance of the multimodal PIK3CA model
Using a decision-level late fusion strategy, the MPM integrates the predictive outputs from both the histopathology and clinical models. MPM achieved the best overall performance on the internal TCGA test set, with an AUC of 0.745 (95% CI: 0.715–0.775), ACC of 0.700 (95% CI: 0.660–0.740), F1 of 0.640 (95% CI: 0.600–0.680), SEN of 0.615 (95% CI: 0.570–0.660), and SPEC of 0.775 (95% CI: 0.740–0.810) (Figure 5 and Table 4). Significance testing (DeLong test) confirmed the MPM’s statistically superior performance to those of the histopathology model (P = 0.038) and clinical model (P < 0.001).
Performance of the multimodal PIK3CA Model (MPM). ROC curves, confusion matrices, calibration curves, and DCA results of the MPM in 4 independent cohorts: (A) TCGA test set, (B) HBMU dataset, (C) CMU dataset, and (D) SCU dataset. AUC, area under the curve; CMU, Beijing Xuanwu Hospital of Capital Medical University; DCA, decision curve analysis; HBMU, Hebei Medical University Fourth Hospital; MPM, multimodal PIK3CA model; SCU, West China Hospital of Sichuan University; TCGA, The Cancer Genome Atlas.
On the external validation cohorts, MPM consistently maintained robust and balanced predictive capability. In the HBMU cohort, it attained an AUC of 0.695 (95% CI: 0.615–0.775) and ACC of 0.645 (95% CI: 0.580–0.710). In the CMU cohort, the model achieved an AUC of 0.690 (95% CI: 0.610–0.770) and ACC of 0.615 (95% CI: 0.535–0.695). For the SCU cohort, the MPM yielded an AUC of 0.680 (95% CI: 0.600–0.760) and ACC of 0.620 (95% CI: 0.540–0.700). Across all external centers, the MPM demonstrated stable specificity (0.740–0.760) and provided a more balanced trade-off between sensitivity and specificity than either single-modality model, with no observed statistically significant performance degradation (all P > 0.05 vs. single modalities in external sets).
Calibration curve analysis demonstrated that the predictive probabilities generated by the MPM were in good agreement with the actual event rates across both the internal TCGA cohort and all external validation cohorts. DCA further validated the clinical utility of the model. Within the threshold probability range of 0.020–0.500, the MPM yielded significantly greater net benefits than the 2 conventional strategies of universal sequencing and no sequencing. These findings underscored the advantages and translational potential of the multimodal fusion strategy in clinical practice.
Overall, these results indicated that, by leveraging complementary morphological features from WSIs and structured clinical data, the MPM substantially improved the predictive performance on the internal cohort and maintained strong, generalizable performance across heterogeneous external centers, thereby supporting the efficacy and robustness of the multimodal fusion strategy.
Attention heatmap interpretation of the histopathology model
Attention heatmaps derived from the CLAM-SB model highlighted the WSI regions contributing most strongly to PIK3CA mutation prediction. Attention heatmaps were generated by the histopathology model across the TCGA, CMU, SCU, and HBMU datasets (Figure 6). High-attention regions (in red) localized predominantly to tumor-rich areas characterized by densely packed malignant cells or prominent glandular/tubular structures, whereas low-attention regions (in blue) typically corresponded to normal tissue or non-informative background areas. These patterns indicated the model’s accurate focus on diagnostically relevant histomorphologic features, thereby achieving automatic identification of key regions within WSIs. Furthermore, the consistency of heatmap distributions across datasets from different centers highlighted the robustness and generalizability of the developed image-branch attention mechanism and underscored its strong interpretability and cross-domain adaptability.
Attention heatmaps of WSI regions contributing to PIK3CA mutation prediction. Representative WSIs from TCGA and 3 external cohorts are shown with attention overlays. High-attention regions (red) highlight morphologic areas strongly contributing to mutation prediction. Adjacent panels show magnified patches and corresponding attention heatmaps, reflecting model interpretability and center-to-center robustness. CMU, Beijing Xuanwu Hospital of Capital Medical University; HBMU, Hebei Medical University Fourth Hospital; SCU, West China Hospital of Sichuan University; TCGA, The Cancer Genome Atlas; WSI, whole-slide image.
SHAP-based interpretation of the clinical model
SHAP analysis revealed the relative contributions and directional effects of various clinical features in the prediction model for PIK3CA mutation. Molecular subtype exhibited the highest predictive contribution (SHAP importance = 0.4197), and was followed by lymph node metastasis status (SHAP importance = 0.3321), age (SHAP importance = 0.1507), and AJCC stage (SHAP importance = 0.0976) (Figure 7A). Further analysis indicated significant differences in the effects of molecular subtypes on mutation risk (Figure 7B). Specifically, the luminal A subtype showed the highest tendency toward mutation risk (SHAP value = 1.7702), whereas the basal-like subtype exhibited a lower mutation risk (SHAP value = −0.4977). Additionally, patients without lymph node metastasis demonstrated elevated mutation risk.
SHAP-based interpretation of the clinical feature model. (A) Global feature importance ranked by mean absolute SHAP values. (B) SHAP beeswarm plot illustrating feature-level contributions. (C) SHAP dependence plots showing interaction patterns for key features. AJCC, American Joint Committee on Cancer; SHAP, SHapley Additive exPlanations.
SHAP dependence analysis further elucidated the independent influence patterns of each feature (Figure 7C). Molecular subtype displayed a clear negative gradient, thus further supporting luminal subtypes as high-risk categories; lymph node metastasis contributed most strongly to prediction at moderate levels; the positive influence of age was weakest in the age range of 40–60 years; and earlier AJCC stages were associated with higher predicted mutation risk. Inter-feature correlation analysis indicated a moderate association between AJCC stage and lymph node metastasis in terms of their influence on SHAP values (r = 0.47), whereas age, molecular subtype, and other features showed weaker correlations (r = 0.09). In summary, the SHAP interpretability framework not only quantified the predictive weight of each feature in the model but also systematically revealed their independent and interactive patterns, thus providing an interpretable computational basis for understanding the clinical predictive pathways of PIK3CA mutation.
Discussion
In this study, we developed and validated a multimodal artificial intelligence framework, the MPM, integrating digital histopathology and clinical data to predict PIK3CA mutation status in patients with breast cancer. The MPM achieved robust predictive performance across both the internal TCGA validation set and multiple external independent cohorts, with AUC values ranging from 0.680 to 0.745. Importantly, the MPM consistently outperformed each single-modality model (histopathology-only or clinical-only), thereby highlighting the substantial advantage of multimodal fusion in enhancing model generalizability and predictive accuracy. Our work was aimed at developing a reliable histology-based prediction tool for important clinical biomarkers. We systematically evaluated multiple foundation models and algorithms and provided interpretability analyses, thereby providing substantial advances to the field.
Within the histopathology model, we systematically evaluated 7 state-of-the-art pathology foundation models as pretrained feature extractors. H-optimus-0 achieved the best performance on the TCGA dataset (AUC = 0.715), thus indicating its strong capability in capturing histomorphologic patterns associated with PIK3CA mutations. Its advanced architecture and pre-training strategies highlighted the notable value of self-supervised learning and contrastive learning in feature extraction from pathological images. Furthermore, the histopathology model’s maintenance of favorable generalizability on external datasets (AUC = 0.670–0.680) supported its robustness across multi-center cohorts with heterogeneous staining and scanning conditions.
For the clinical model, the XGBoost model achieved the highest predictive performance (TCGA AUC = 0.694) among multiple machine-learning methods. The strength of XGBoost is its ability to model nonlinear relationships and high-order interactions, which are common in clinical oncology data. On the basis of SHAP interpretability analysis, molecular subtype was identified as the most influential predictor, surpassing all other clinical variables. The luminal A subtype exhibited elevated propensity for mutation risk, in agreement with previously reported associations between these clinical factors and PI3KCA-driven tumor biology in the literature17,18.
The MPM, which integrates the predicted probabilities from the image and clinical models through logistic regression, achieved an AUC of 0.745 on the TCGA cohort and substantially outperformed the single-modality models. This finding confirmed the complementary nature of digital histopathology and clinical data in predicting PIK3CA mutations: the histopathology model captures subtle morphologic patterns within the tumor microenvironment that are associated with underlying genomic alterations, whereas the clinical model incorporates patient-level biological and clinical context. By leveraging both sources of information, the fused model not only improves overall predictive accuracy but also enhances robustness across heterogeneous patient populations. Our results align with the growing trend in medical AI toward using multimodal fusion frameworks to effectively reflect the multifaceted nature of clinical decision-making19,20. Importantly, however, the MPM is intended to serve as a pre-screening tool to prioritize cases for confirmatory sequencing but not as a definitive surrogate, given its current absolute performance level.
The interpretability analyses further strengthened the clinical credibility of the developed model. Through attention heatmap visualization, we identified the specific tissue regions that the histopathology model prioritized during prediction. These high-attention areas mapped primarily to regions with dense tumor cell proliferation or distinctive stromal architectures, and demonstrated substantial concordance with pathological patterns known to be associated with PIK3CA mutations. Notably, these results aligned with findings from Howard et al. indicating that PIK3CA-associated morphologic signatures frequently include enhanced cytoplasmic eosinophilia and elevated tubular or glandular formation21. The SHAP analysis further elucidated the contribution and directional effects of each clinical feature on the model’s predictions, thereby offering a quantitative basis for understanding the decision-making process of the clinical model. However, existing interpretability methods still fall short of fully elucidating the intricate interactions between morphological and clinical characteristics. Future studies will explore advanced techniques such as concept-based interpretability to improve model transparency and clinical credibility.
Although the MPM demonstrated strong performance across multiple centers, several limitations remain. The relatively small external cohorts might restrict the assessment of generalizability. The clinical model, despite incorporating key variables, might have omitted potentially important factors such as treatment history or family history. In addition, because the histopathology model might be affected by variations in staining and slide quality, room for further optimization may exist. Moreover, the model achieves optimal performance when both high-quality WSIs and structured clinical data are available, thus potentially posing practical constraints in certain resource-limited clinical settings. The variability in model specificity across multicenter validations is likely to reflect the inherent heterogeneity of real-world clinical and pathological practice. DCA suggested that, across a broad range of threshold probabilities, the model might provide clinical net benefit when used as a decision-support pre-screening tool. However, differences in patient demographics, disease stage distributions, histopathological staining protocols, and digital scanning parameters across centers might contribute to performance variability and also pose practical challenges in large-scale clinical deployment, including infrastructure requirements, data integration, and regulatory considerations. In practice, computationally intensive components of the framework may be deployed via centralized or cloud-based infrastructures, to enable flexible integration into clinical workflows without requiring extensive local computing resources. To facilitate more standardized application in multicenter settings, several technical strategies may be considered, such as stain-normalization approaches to mitigate inter-institutional staining variability, privacy-preserving federated learning frameworks to support collaborative modeling across centers, and robust-transfer techniques such as domain adaptation and meta-learning to enhance generalizability to new clinical environments.
Future work will be aimed at expanding external validation cohorts, exploring deeper multimodal fusion strategies, and extending this framework to additional driver-gene mutations and treatment-response prediction. In addition, future studies may explore the incorporation of longitudinal clinical data and treatment history, to enable more dynamic modeling of disease evolution. We also plan to investigate the integration of MPM as a pre-screening decision-support tool within digital pathology workflows, to help prioritize cases for confirmatory molecular testing, thereby strengthening its translational potential in precision oncology.
Conclusions
Herein, we developed a multimodal artificial intelligence framework integrating WSI with structured clinical data to predict PIK3CA mutation status in breast cancer. The developed MPM consistently demonstrated robust performance across internal and multi-center external cohorts, and outperformed single-modality approaches, thus highlighting the complementary value of image-derived morphologic features and patient-level clinical information. Interpretability analyses further confirmed that the model captured biologically meaningful patterns associated with PIK3CA alterations, thereby strengthening its potential for clinical translation. Collectively, these findings suggested that multimodal fusion is a promising strategy for molecular biomarker prediction that might serve as an effective pre-screening tool to decrease reliance on, rather than replace, routine sequencing assays, thereby optimizing testing strategies and resource allocation to support precision oncology.
Conflict of interest statement
No potential conflicts of interest are disclosed.
Author contributions
Conceived and designed the analysis: Yueping Liu, Xiao Han, Lianghong Teng.
Collected the data: Jiaxian Miao, Qi Liu, Jianing Zhao, Shishun Fan, Shenwen Wang, Si Wu, Jinze Li, Huirui Zhang, Meng Zhang.
Contributed data or analysis tools: Feng Ye, Hong Bu.
Performed the analysis: Jiaxian Miao, Jianing Zhao.
Wrote the article: Jiaxian Miao, Jianing Zhao.
Data availability statement
The data generated in this study are publicly available in The Cancer Genome Atlas (TCGA) at https://portal.gdc.cancer.gov/. The multicenter external validation data are available on reasonable request from the corresponding author.
- Received November 30, 2025.
- Accepted February 23, 2026.
- Copyright: © 2026, The Authors
This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.

























