Abstract
Objective: We aimed to develop clinical decision support tools for patients with treatment-naïve advanced esophageal squamous cell carcinoma (ESCC).
Methods: Patients who received first-line immunochemotherapy with or without radiotherapy between 2018 and 2023 across 9 centers were included in the current study. Candidate predictors were routine clinical variables such as patient demographics, tumor characteristics, and treatment details. Feature selection was conducted using the least absolute shrinkage and selection operator (LASSO), Boruta, and stepwise feature selection method. The models were fitted using 4 machine learning algorithms and a Cox proportional hazards model was set as a benchmark, comparing model predictive performance by computing the time-dependent area under the receiver operator curve (tAUC). Concordance index (C-index), Brier score, calibration plots, precision-recall (PR) curves, decision curve analysis (DCA), and held-out testing and external validation were utilized for evaluating model robustness, transportability, and clinical utility. SHapley Additive exPlanations (SHAP) analysis was used to rank feature importance and explain the models.
Results: A total of 1,048 patients were included, comprising the training/test (n = 828) and validation cohorts (n = 220). The random survival forest model that was constructed following feature selection using Boruta [Boruta-random survival forest (RSF)] showed the highest predictive performance. The tAUCs for 6-, 12-, and 18-month OS were 0.886, 0.775, and 0.772, respectively, in the training set, which significantly outperformed all other models (P-adjust < 0.05). This superiority persisted in the external validation cohort (all tAUCs > 0.750). The Boruta-RSF model exhibited excellent calibration and the highest clinical net benefit on DCA. A publicly accessible web calculator (https://escc.shinyapps.io/ESCC/) was implemented, enabling individualized overall survival prediction under varying treatment scenarios.
Conclusions: The Boruta-RSF model, leveraging routinely available clinical variables with the corresponding web calculator, facilitates personalized prognostic assessment and optimization of treatment strategies.
keywords
Introduction
Approximately 40% of patients with esophageal squamous cell carcinoma (ESCC) are initially diagnosed at an advanced stage due to the lack of effective screening strategies and early-stage clinical symptoms1–3.
Platinum-based chemotherapy has historically been considered the first-line standard for advanced ESCC, yielding a median overall survival (OS) of < 13 months1,2,4. Recently, several randomized clinical trials (RCTs) have demonstrated that adding anti-PD1 antibodies to chemotherapy can significantly improve survival, resulting in a median OS of 13.2–17.2 months4–10, thereby establishing immunochemotherapy as the current standard first-line treatment for advanced ESCC. However, not all patients with advanced ESCC will exhibit a favorable response to the combination of PD-1 inhibitors and chemotherapy11. In addition, most patients who initially have a positive response to immunochemotherapy eventually develop immune resistance and experience rapid disease progression within a short timeframe11–13. Moreover, the improvement in median OS remains moderate (approximately 3–4 months) compared to chemotherapy alone, even when combined with immunotherapy 4–10, which is not satisfactory. To address this clinical bottleneck, it is essential to explore additional therapeutic strategies and to quantify and predict OS for various treatment plans, thereby assisting physicians in formulating optimal personalized treatment plans for each patient.
Radiotherapy (RT) is frequently utilized in patients with advanced ESCC for palliation of local symptoms (e.g., dysphagia)12–14. The synergistic effects between RT and immunotherapy are well-documented. Radiation-induced immunogenic cell death (ICD) enhances tumor immunogenicity15,16, while immunotherapy can improve radiosensitivity via vascular normalization17. This biological rationale supports studying the clinical efficacy of RT combined with immunochemotherapy. Emerging clinical evidence indicates that integrating aggressive primary tumor RT into first-line immunochemotherapy may yield survival benefits13,14,18–20, thereby driving the growing adoption in real-world practice. Our previous study showed that incorporating aggressive primary tumor RT (≥ 40 Gy) into first-line immunochemotherapy significantly improves OS while maintaining good tolerability19.
Although providing an individualized prognosis for each patient is of great importance regarding patient care and clinical trial design, no validated prognostic models exist for advanced ESCC in the contemporary immunotherapy era, despite immunochemotherapy with or without RT having significantly transformed the treatment landscape. In fact, the existing prognostic prediction models developed in the pre-immunotherapy era are outdated and unable to predict the outcomes of these newly incorporated treatments (i.e., PD-1 inhibitors and/or RT). All the existing prognostic models are traditional regression analysis models developed in the chemotherapy era that lack multi-center external validation21–28.
To bridge these gaps, multi-center real-world data were leveraged to develop and externally validate a clinically robust model capable of reliably predicting OS in patients with advanced ESCC after first-line immunochemotherapy with or without aggressive RT (≥ 40 Gy) targeting the primary tumor in line with contemporary best clinical practice. Machine learning (ML) algorithms can detect non-linear associations or handle higher-order interactions, so ML can be relatively easily retrained to account for emerging cancer treatment and survival patterns29,30. Given the absence of an a priori method to determine which modeling approach would yield the most effective predictive model for this specific scenario, we systematically compared results achieved by the traditional Cox proportional hazards model with the results obtained from four representative non-linear ML algorithms [random survival forest (RSF), extreme gradient boosting (XGBoost), gradient boosting machine (GBM), and decision tree DT], which are particularly adept at capturing high-dimensional interactions.
Materials and methods
Study design and participants
The details of the study design are shown in the Study flowchart. The study was approved by the ethics committee of the participating institutions (approval number: bc20253361). The report was in accordance with the TRIPOD statement31.
The medical records of consecutive EC patients who were treated at 9 academic cancer centers in China between January 2018 and December 2023 were screened. The inclusion criteria were as follows: (i) pathologically confirmed ESCC; (ii) presence of clinical stage cTanyNanyM1 (stage IVB, the 8th AJCC edition); (iii) underwent PD-1 inhibitor plus chemotherapy with or without aggressive RT (≥ 40 Gy) targeting the primary tumor as first-line therapy; (iv) no prior anticancer treatment; (v) Eastern Cooperative Oncology Group (ECOG) score ≤ 2; and (vi) age > 18 years. Additional treatment details of the included patients are provided in the Supplementary methods. The exclusion criteria were as follows: (i) history of other malignancies; (ii) a total radiation dose of less than 40 Gy to the primary tumor; (iii) only metastases were treated with RT.
(A) Patient inclusion from 9 Chinese academic cancer centers (n = 1,048). (B) Data preprocessing. Handling of clinical variables, immunotherapy/radiotherapy details, and missing data imputation. (C) Model development. Feature selection (Boruta, LASSO, and stepwise selection), survival model construction (RSF, XGBoost, GBM, DT, and CoxPH), and performance comparison among models. (D) Model validation and deployment. Held-out testing and external validation, performance evaluation (tAUC, calibration curve, and DCA), model interpretation, and web calculator implementation.
The primary endpoint of this study was the development of a clinically useful model for predicting 6-, 12-, and 18-month OS. OS was defined as the time from the first treatment-to-death or last follow-up evaluation.
Data collection and candidate predictor parameters
Candidate predictors and patient survival status were obtained from the hospital electronic medical records (EMRs). Seventeen candidate predictors were collected based on data availability, clinical significance, and the literature regarding ESCC prognostic models (Table 1). Notably, our previous research demonstrated that incorporating RT at a dose ≥ 40 Gy targeting the primary tumor into first-line immunochemotherapy significantly improved OS19. Therefore, RT at a dose ≥ 40 Gy was defined as aggressive RT in this study.
Characteristics of the training set and held-out testing and external validation set
Missing data, imputation, and sensitivity analysis
Missing values for candidate predictors were visualized using a bar chart to show the extent of missingness. Predictors with a missing rate > 30% and no significant association with OS (P ≥ 0.05) were excluded32,33.
The types of missing data were further analyzed if candidate predictors had a missing rate < 30%. The details of the judgment for missing data types and the imputation of missing data are provided in the Supplementary methods. A sensitivity analysis was performed on models in which candidate predictors were not imputed in the external validation set to check robustness30.
Model and feature selection
The features included in the four ML-based models and the Cox proportional hazards (CoxPH) model were selected using the least absolute shrinkage and selection operator (LASSO) method34–36. Additionally, a traditional CoxPH model was developed using forward stepwise feature selection as a benchmark37,38. An RSF model was specifically constructed based on Boruta feature selection (i.e., Boruta-RSF)30. The details of feature selection are provided in the Supplementary methods.
Model development and validation
A staged approach was used for model development and validation. First, patient data from 6 medical centers were used to form the development cohort, which was randomly divided into a training set (70%) and an internal test set (30%). A rigorous “held-out testing and external validation” strategy was adopted to strictly prevent data leakage and ensure generalizability. Specifically, feature selection, hyperparameter optimization, and model training were solely performed on the training set. The internal test set and the external validation cohort were strictly preserved as “unseen” datasets during supervised feature selection and model training, and were only utilized for the final model evaluation to evaluate the generalization performance. Seven prognostic models (RSF, Boruta-RSF, CoxPH, traditional CoxPH, XGBoost, GBM, and DT) were developed to predict 6-, 12-, and 18-month OS. Detailed information regarding the construction of each model is provided in the Supplementary methods.
Model evaluation, comparison, and statistical analysis
The performance of the models was evaluated visually and numerically. The predictive efficacy of models was comprehensively evaluated on the training, internal test, and external validation sets. The time-dependent area under the receiver operator curve (tAUC) was designated as the core criterion. The algorithm with the highest tAUC value and which was significantly superior to other models, as confirmed by statistical tests (DeLong + Benjamini-Hochberg), was given priority in selection. The Benjamini-Hochberg procedure was applied to all possible pairwise comparisons (n = 21) for each metric to strictly control for the false discovery rate (FDR) across the multiple models evaluated and to prevent type I error inflation. The concordance index (C-index), Brier score, calibration curve, precision-recall (PR) curve, and decision curve analysis (DCA) results were further referred to if there were no significant differences in the tAUC values of models and the predictive stability and clinical utility were comprehensively weighed. More detailed statistical analyses are provided in the Supplementary methods.
Model explanation and deployment
The permutation importance and SHapley Additive exPlanations (SHAP) values were computed and visualized to enhance transparency and interpretability39. The top-performing prediction model was integrated into a web platform based on a Shiny application (RRID:SCR_001626) to facilitate the clinical application. All statistical analyses were performed using SPSS (version 24.0) (RRID:SCR_002865) and R software (version 4.4.1) (RRID:SCR_001905). A 2-tailed P < 0.05 was considered statistically significant for all analyses. Additional information is described in the Supplementary methods.
Results
Study population
A total of 1,048 patients with treatment-naïve advanced ESCC were included in the current study (Study flowchart). All 1,048 patients received immunochemotherapy, while 484 (46.2%) underwent a combination of immunochemotherapy and RT. The treatment details are provided in Table S1. The developmental databases from Sun Yat-sen University Cancer Center, Tianjin Medical University Cancer Institute & Hospital, Zhejiang Cancer Hospital, Sichuan Cancer Center, Affiliated Cancer Hospital of Zhengzhou University, and Hubei Cancer Hospital were randomly split into training (n = 577) and internal testing datasets (n = 251). Significant differences were observed in some characteristics between the developmental and external validation cohorts because the external validation cohort (n = 220) was derived from an independent dataset collected from three other academic cancer centers (Table 1).
The median follow-up time was 23.7 months [interquartile range (IQR): 14.7–32.7 months]. There were 543 deaths (51.8%) during the follow-up period and the corresponding median OS was 19.0 months [95% confidence interval (CI): 17.1–20.9 months]. The specific Kaplan-Meier survival curves for the three datasets are presented in detail in Figure S1.
Feature selection
The missing rate of the pathology grade in the development set was 50.2% and there was no significant association with OS [hazard ratio (HR) = 1.04, P = 0.723], as shown in Figure S2A. The missing rates for body mass index (BMI) and tumor length were 9.1% and 9.2%, respectively, which supported the missing at random (MAR) mechanism according to Little’s MCAR test and stratified tests. Therefore, the missForest algorithm was utilized to address the missing values40,41. The missing data patterns prior to imputation, as well as the comparative data distributions before and after imputation for these validation sets, are detailed in Figures S2–S4. Descriptive statistics and Kolmogorov-Smirnov (K-S) tests confirmed that the imputation preserved the original data distributions across the training, internal test, and external validation sets (all K-S test P-values > 0.05), as detailed in Table S2.
The Boruta-RSF model identified gender, BMI, tumor length, number of metastases, location of the metastasis, chemotherapy cycle, immunotherapy cycle, and RT as the most influential variables (Figure 1A). Lasso regression was used to screen the relevant features of the training set and the characteristics of the variable coefficients are shown in Figure 1B. A total of 6 variables were selected for inclusion in the RSF, XGBoost, GBM, DT, and CoxPH models based on Lasso regression. The traditional-CoxPH model retained 8 relevant predictors of OS after applying a forward stepwise feature selection (Figure 1C and Table S3). Remarkably, both “radiotherapy” and “immunotherapy cycles” were consistently selected by all models. This finding strongly underscores the pivotal roles of RT and immunotherapy in influencing OS in patients with treatment-naïve advanced ESCC, which is aligned with prospective clinical research evidence (e.g., ChiCTR2100046715, ORIENT-15, and JUPITER-06)4,7,18. Following feature selection, sample size adequacy was rigorously evaluated based on the events per variable (EPV) criterion. A total of 297 mortality events were observed in the training set (n = 577). Given that the final models incorporated 6–8 predictors, the EPV ratios ranged from 37.1–49.5 (297/8–297/6). This value substantially exceeds the widely recommended minimum thresholds of 10–20 events per predictor, ensuring sufficient statistical power for robust parameter estimation and minimizing the risk of model overfitting.
Feature selection results for prognostic models. (A) Boruta algorithm selection results for the random survival forest (Boruta-RSF) model. The Boruta algorithm categorized features as either confirmed (contributing significantly to prediction) or rejected (no significant contribution). The horizontal axis shows the ranking of feature importance from left-to-right in ascending order, including clinical variables (e.g., ECOG and T/N stage), treatment-related factors (e.g., immunotherapy/chemotherapy cycles and radiotherapy), and auto-generated shadow features (shadow min/mean/max) used as importance thresholds. All “confirmed” features were incorporated into the final Boruta-RSF model. (B) LASSO regression feature selection results for machine learning models. (Left figure) Cross-validation curve: the optimal λ (λ.1se) was selected using the “1 standard error” rule (right dashed vertical line). (Right figure) Coefficient shrinkage paths: the paths illustrate how coefficients shrink toward zero as λ increases. Note: Three unordered multi-categorical variables (tumor location, chemotherapy regimens, and immunotherapy regimens) were converted into binary indicator variables through dummy variable coding. Specifically, tumor location was decomposed into 3 binary indicator variables, chemotherapy regimens were decomposed into 2 binary indicator variables, and immunotherapy regimens were decomposed into 5 binary indicator variables. Therefore, a total of 7 binary indicator variables were newly added based on the original variables. Eventually, the total number of variables increased from the original 16-to-23. (C) Stepwise selection results for the traditional CoxPH model.
Model development, performance comparison, and sensitivity analysis
Seven ML models were constructed after data preprocessing and feature selection. The optimal hyperparameters for each model are listed in Table S4. The Boruta-RSF model exhibited significantly higher tAUC values overall compared to other models, as illustrated in Figures 2 and S5. The tAUC and C-index values for all seven models are provided in Table 2. The Boruta-RSF model showed the best predictive performance in the training set, as evidenced by the tAUC for predicting the 6-, 12-, and 18-month OS (0.886, 0.775, and 0.772, respectively), which was significantly higher than the other 6 models (Tables 2 and S5). Notably, this advantage was consistent in the internal/external validation cohorts, in which all tAUC values of the Boruta-RSF model exceeded 0.750. While the statistical significance of pairwise differences was attenuated in the validation sets after rigorous global FDR correction [likely due to sample size constraints (Tables S6 and S7)], the Boruta-RSF model consistently achieved the highest or competitive tAUC values. To provide a more comprehensive evaluation, the accuracy, sensitivity, and specificity for all models was further calculated based on the optimal cut-off values determined by the Youden index. The Boruta-RSF model demonstrated robust classification performance, as detailed in Table S8, achieving a sensitivity of 0.962 in the external validation cohort at 6 months, which indicates the strong capability in identifying high-risk patients. The Boruta-RSF model maintained consistently high tAUC values (all tAUCs > 0.750) across all timepoints (Figures 3A–C and S6). In addition, Figure S7 shows that the Boruta-RSF displayed higher C-index values compared to the other six models at different time points. The calibration curves demonstrated a high level of concordance between the predicted and observed probabilities of the Boruta-RSF model (Figure 3D–F). Quantitatively, the Boruta-RSF model exhibited excellent calibration-in-the-large with intercepts remaining close to 0 across cohorts (e.g., 0.100 at the 12-month external validation). The calibration slopes were consistently >1 (e.g., 2.610 at the 12-month external validation; Table S9), a characteristic of random forest ensembles that reflects a conservative prediction strategy robust against the overfitting observed in boosting models (e.g., a GBM slope < 1). The PR curves demonstrated a substantial increase in the area under the precision-recall curve (AUPRC) over time, from 0.298 at 6 months-to-0.718 at 18 months in the training set, from 0.225 at 6 months-to-0.709 at 18 months in the internal test set, and from 0.465 at 6 months-to-0.729 at 18 months in the external validation set (Figure 3G–I). This trend aligned with the dynamic changes in the survival rate of patients with advanced ESCC, from 91.6% at 6 months-to-51.9% at 18 months in the training set, from 91.3% at 6 months-to-52.2% at 18 months in the internal test set, and from 87.0% at 6 months-to-51.3% at 18 months in the external validation set (Figure S1). These results reflected the impact of event rate dynamics on precision-recall metrics. The Boruta-RSF also displayed lower Brier scores, as shown in Figure S8 and Table S10. The DCA curve further demonstrated that the Boruta-RSF had the highest clinical utility (Figure S9). Quantitative analysis confirmed that the model provided superior net benefit across clinically relevant threshold intervals that expanded over time (e.g., from 2%–19% at 6 months-to-22%–64% at 18 months in the external validation), effectively covering the intermediate-risk zone during which decision support is most critical (Table S11). Finally, sensitivity analysis in the external validation cohort confirmed model robustness against imputation bias. The Boruta-RSF model yielded virtually identical performance on the full imputed dataset (n = 220) vs. the complete-case subset (n = 197) (e.g., 6-month AUC: 0.867 vs. 0.871). DeLong’s tests showed no significant differences across time points (all P > 0.05; Table S12), validating that the imputation enhanced data completeness without compromising model stability.
Comparative performance of survival prediction models across cohorts with statistical validation. (A, C, E) Time-dependent receiver operating characteristic (ROC) curves for 7 models predicting 6-, 12-, and 18-month overall survival in (A) training, (C) internal test, and (E) external validation cohorts. Diagonal lines indicate random guessing (AUC = 0.5). (B, D, F) Pairwise comparison matrices of the time-dependent AUC differences for the corresponding cohorts. Diagonal cells display the specific AUC value for each model at the given time point, while off-diagonal cells represent the pairwise difference in AUC (ΔAUC) between the row model and the column model. The color intensity reflects the magnitude of ΔAUC. Asterisks (*) denote statistical significance (adjusted P < 0.05), derived from DeLong’s test followed by the global Benjamini-Hochberg FDR correction for all 21 pairwise comparisons.
Predictive performance of the random survival forest model with Boruta feature selection (Boruta-RSF) across cohorts. (A–C) Time-dependent receiver operating characteristic (ROC) curves for 6-, 12-, and 18-month overall survival in the training, internal test, and external validation cohorts with area under the curve (AUC) and 95% confidence intervals annotated. (D–F) The calibration curves of the Boruta-RSF model at 6-, 12-, and 18-months in the training, internal test, and external validation cohorts. (G–I) Precision-recall (PR) curves of the Boruta-RSF model for predicting overall survival at 6, 12, and 18 months in the training, internal test, and external validation sets. The area under the PR curve (AUPR with 95% CI) is annotated for each time point. Note the increasing AUPR over time, reflecting the impact of rising event rates (6-month OS rate: 92%; 12-month OS rate: 67%; 18-month OS rate: 52%). Lower AUPR in early time points (especially 6 months) is expected due to class imbalance, whereas consistently rising precision-recall balance demonstrates model robustness across follow-up durations.
Details of the 7 models performance values
Model explanation
Figure 4A depicts the dynamics of feature importance for the Boruta-RSF model. Figure 4B presents a SHAP summary plot for the Boruta-RSF model. Figure 4C offers a detailed case study, demonstrating the prediction process of the model for a specific patient. The performance and interpretability analyses, including analogous SHAP visualizations, for the other six models are visually presented in Figures S10 and S11.
(A) Feature importance dynamics for 6-, 12-, and 18-month overall survival. The horizontal axis represents the permutation importance score, which quantifies the mean decrease in prediction accuracy when each variable is randomized, while the vertical axis displays predictors ranked by the overall impact. Key observations reveal that radiotherapy was the dominant predictor for 6-month outcomes, highlighting the critical role in local tumor control. Systemic factors, such as immunotherapy cycles and BMI, showed increasing influence during the 12- and 18-month periods. (B) SHAP summary plot of the Boruta-RSF model (sampling SHAP with 300 samples). The vertical axis displays the feature names, arranged in descending order based on feature importance (average |SHAP value|). The horizontal axis indicates the aggregated SurvSHAP(t) value, for which positive values correspond to increased risk of death and negative values correspond to decreased risk of death. Each data point represents a sample with the color coding reflecting the actual value of the feature in that sample (red for low values and blue for high values). The color gradient corresponds to the numerical value for continuous features (e.g., BMI, tumor length). The color distinguishes between categories for categorical features [e.g., radiotherapy, gender, and location of the metastasis (for radiotherapy, blue signifies radiotherapy and red signifies no radiotherapy; for gender, blue denotes male and red denotes female; and for location of the metastasis, blue signifies non-regional nodes only and red signifies visceral metastases)]. (C) Waterfall plot of SHAP values for individual patient risk prediction (Boruta-RSF model). Each bar represents the contribution of a clinical feature to the model output, in which positive values indicate an increased risk of death and negative values indicate a decreased risk of death.
Implementation of the web calculator
The Boruta-RSF model was integrated into a web application (https://escc.shinyapps.io/ESCC/), as shown in Figure 5. Specifically, by inputting the actual values of the 8 features required for the model, the application is capable of automatically predicting OS probabilities at 6, 12, and 18 months. Notably, among the 8 clinical variables, only the 3 treatment-related variables (immunotherapy cycle, chemotherapy cycle, and the administration of ≥ 40 Gy RT to the primary tumor) are adjustable according to the treatment plan. The remaining five clinical variables remain fixed for any specific patient. Consequently, this Boruta-RSF model, in conjunction with its associated online application platform, enables clinicians to conduct individualized prognostic assessments and develop optimal personalized treatment strategies by quantifying and predicting OS probabilities corresponding to various treatment regimens for each patient. The range of variables for the RSF model is provided in Table S13.
Online computing platform presentation of the optimal Boruta-RSF model using the Shiny platform (https://escc.shinyapps.io/ESCC/). Refer to Table S13 for the allowable range of input values of continuous variables.
Discussion
The key findings of this study are summarized as follows. First, The RSF model using Boruta feature selection (Boruta-RSF) demonstrated significantly superior discrimination performance in predicting OS at 6, 12, and 18 months compared to traditional Cox regression and four other ML-based models (RSF, XGBoost, GBM, and DT), achieving tAUC values consistently > 0.750 in both the training/test and external validation sets. The robustness was further confirmed by the sensitivity analysis. Second, the immunotherapy cycle was the most critical factor for predicting the 12- and 18-month OS for the Boruta-RSF model. In contrast, RT, defined as receiving ≥ 40 Gy targeting the primary tumor, was identified as the predominant factor in predicting the 6-month OS. Third, by incorporating readily accessible clinical variables, including novel treatment factors, like immunotherapy cycles and primary tumor RT, this model addresses a critical gap in prognostication for contemporary first-line management of treatment-naïve advanced ESCC. The web calculator provides a practical tool for quantifying survival probabilities under different treatment scenarios, facilitating individualized prognosis and personalized therapeutic planning.
Table S14 summarizes previous studies on survival prediction models for advanced ESCC using clinical parameters. All of the models were developed during the chemotherapy era and lack integration of PD-1 inhibitors, rendering the models obsolete for predicting outcomes under current standard-of-care. Furthermore, these models often have limited external validation and an inability to model complex, non-linear interactions inherent in multimodal treatments, like immune-chemo-radiotherapy, making it difficult to reflect detailed treatment assignments and complex interactions between variables.
The current study directly addresses this information gap. The first validated prognostic tool specifically designed for the immunotherapy era is provided by incorporating real-world data reflecting contemporary clinical practice (PD-1 inhibitors + chemotherapy ± primary tumor RT) and using advanced ML algorithms capable of capturing complex feature interactions. Notably, the AUPRC demonstrated a substantial increase over time (Figure 3G–I) in contrast to the consistently high tAUC (all tAUCs > 0.750) of the Boruta-RSF model across all timepoints (Figure 3A–C). The divergence between tAUC and AUPRC, particularly at early time points (e.g., 6-month tAUC = 0.886 vs. AUPRC = 0.298 in the training set) is an expected phenomenon in survival settings with high short-term survival rates (e.g., 6-month OS = 91.6% in the training set; Figure S1). The consistent model performance across held-out testing and external validation cohorts (all tAUC > 0.750) rules out data artifacts and AUPRC dynamics were replicated in all cohorts, confirming the AUPRC as a data-driven property. When the event rate is low (e.g., < 10% at 6 months), achieving high precision requires near-perfect specificity to minimize false-positive predictions, a statistically challenging task that inherently limits the AUPRC. As events accumulated over time, the increasing AUPR (reaching 0.718 at 18 months in the training set) indicated improved precision-recall balance, which is consistent with the natural history of advanced ESCC. Notably, the model functions best as a “high-sensitivity sentinel” at the 6-month mark. A high-risk prediction at this stage should prompt intensified surveillance (e.g., shortened follow-up intervals) rather than immediate therapeutic escalation, thus effectively avoiding overtreatment. Conversely, at 12 and 18 months, as events accumulated and model precision naturally improved, the model provided reliable support for definitive decision-making regarding long-term patient management. Furthermore, the quantitative DCA demonstrated that the model conferred net benefit across critical intermediate-risk thresholds (e.g., 22%–64% at 18 months). This confirmed the specific utility in guiding decisions for patients falling into the grey area of clinical uncertainty, in which standard “treat-all” or “treat-none” policies are least effective. In addition, the sensitivity analysis (Table S12) confirmed the consistency of the model in performance, even with missing data addressed, underscoring the reliability for practical use. Time-dependent performance divergence was observed, as shown in Table 2. For example, in the external validation, the Boruta-RSF model demonstrated the highest early performance (6-month AUC: 0.867), which gradually diminished over time. In contrast, the CoxPH model displayed a flatter but consistently lower performance trajectory. Mechanistically, this divergence stems from the Boruta-RSF inclusion of tumor length (a proxy for tumor burden excluded by LASSO) and the ability to model non-linear interactions. Boruta-RSF maximized sensitivity to immediate events, a critical capability for guiding initial clinical decisions that linear models, like CoxPH averaged out, by capturing the compounded risk of high tumor burden on early mortality. These superiorities may be attributed to several inherent strengths of the RSF algorithm enhanced by Boruta feature selection, as follows: (1) This algorithm inherently models complex, non-linear relationships and higher-order interactions without predefined assumptions, a critical capability for capturing the intricate interplay between immunotherapy, chemotherapy, and RT42–45. (2) This algorithm mitigates the overfitting variance typically observed in single DT and improves predictive stability across heterogeneous populations by aggregating predictions from multiple de-correlated decision trees39,46,47. This finding was substantiated by the robust discrimination performance in the external validation dataset, in which all tAUC values exceeded 0.750 at various time points, even though significant baseline differences existed between the training and external validation datasets (e.g., RT rates: 49.05% vs. 34.09%, respectively; P < 0.001). (3) The Boruta algorithm, by identifying features confirmed to be relevant against randomized “shadow” features, effectively mitigates issues of multicollinearity and over-reliance on pre-screening methods, like LASSO, potentially leading to a more parsimonious and generalizable model30,48,49. (4) The Boruta-RSF model provides dynamic estimates of survival probabilities under different treatment scenarios unlike static risk scores, thus better aligning with the requirements of clinical decision-making. The findings herein are consistent with previous studies demonstrating the efficacy of Boruta-RSF model in complex survival prediction tasks30,50. Notably, the performance of Boruta-RSF model in predicting cancer survival warrants further investigation. The current study and the study conducted by Audureau et al.30 may serve as valuable references for future research in this area.
Notably, irrespective of the feature selection method employed, the variables (“immunotherapy cycle” and “RT”) were consistently integrated into all seven predictive models. This consistency highlights the critical role in influencing OS, which is corroborated by recent clinical evidence indicating that immunotherapy has become a core component of first-line treatment for advanced ESCC4,6–8 and incorporating aggressive RT (≥ 40 Gy) targeting the primary tumor significantly improves OS19. Conversely, irrespective of the feature selection method, neither specific chemotherapy (e.g., TP and PF) nor immunotherapy regimens (e.g., camrelizumab) were incorporated into all seven models (Figure 1). These findings are in alignment with the RCTs and meta-analyses showing comparable efficacy among common platinum-based regimens and PD-1 inhibitors in this setting51–53. An RCT conducted by Ai et al.53 showed no significant difference in OS among ESCC patients treated with paclitaxel plus fluorouracil, paclitaxel plus cisplatin, or paclitaxel plus carboplatin. To date, there is a lack of sufficient evidence-based data to indicate that different PD-1 inhibitors exhibit significant differences in therapeutic efficacy for advanced ESCC. The RSF model built with the Boruta (i.e., Boruta-RSF model) demonstrated superior performance when compared to the RSF model constructed using the LASSO variable selection method. This effect can be ascribed to the capacity of the Boruta algorithm to address the limitations of conventional statistical methods by eliminating the dependence on pre-screening processes, linear assumptions, and multicollinearity issues30,54.
Although predictive accuracy is critical, model interpretability is indispensable for facilitating clinical adoption and establishing trust among healthcare professionals. As is well-known, at different stages of disease progression, the efficacy of various treatment modalities and the interrelationships exhibit dynamic changes. RT (≥ 40 Gy) targeting the primary tumor was identified as the primary determinant of short-term (i.e., 6-month) OS (Figure 4A). This observation aligns with the rapid cytoreductive effect of early aggressive RT targeting the primary tumor, which eliminates life-threatening local symptoms (e.g., dysphagia)55. Conversely, the number of immunotherapy cycles was the most critical predictor for longer-term OS (i.e., 12 and 18 months). This finding is consistent with the results of RCTs, emphasizing the importance of treatment adherence and effective management of immune-related adverse events for ensuring the completion of planned immunotherapy cycles4,7,56. SHAP can help physicians understand the rationale underpinning a specific prognosis for an individual patient, as shown in Figure 4C. This allows clinicians to further confirm or challenge the model outputs based on professional expertise. The interplay between SHAP and clinical assessment not only enhances the acceptability and reliability of the Boruta-RSF model’ in practical applications but also highlights its potential value in supporting the decision-making processes of clinicians.
Currently, the combination of PD-1 inhibitors and chemotherapy is the standard first-line treatment for treatment-naïve advanced ESCC. Nevertheless, attaining individualized treatment regimens, such as determining the optimal cycles of immunotherapy and chemotherapy, within the context of standardized treatment protocols continues to pose a significant and enduring clinical challenge. For example, most RCTs for advanced solid tumors recommend 2 years of PD-1 inhibitor maintenance. However, most treatment-naïve advanced ESCC patients progress within 1 year [median progression-free survival (PFS) < 7 months] and cannot complete the standard 2-year maintenance of PD-1 inhibitors6–8. In addition, the results of a recent prospective trial showed that the concurrent use of immunochemotherapy and RT, involving 1-year (12 cycles) of PD-1 inhibitors and 4 cycles of chemotherapy, resulted in a median PFS of 9.8 months18. Therefore, in the era of immunochemotherapy for advanced ESCC, determining the optimal treatment intensity (number of cycles) and appropriately integrating RT continue to pose individualized challenges for clinical practice despite PD-1 inhibitors + chemotherapy being established as a standard treatment.
The web calculator (https://escc.shinyapps.io/ESCC/) transforms the Boruta-RSF model into a dynamic clinical decision-support tool with two key functions, moving beyond a one-size-fits-all approach. (1) Quantify individualized prognosis: The Boruta-RSF model provides patient-specific estimates of 6-, 12-, and 18-month survival probabilities based on the unique clinical profiles. (2) Simulate treatment impact: Clinicians can input different potential treatment plans (e.g., varying immunotherapy or chemotherapy cycles, adding or omitting primary tumor RT) and instantly visualize the predicted impact on OS probabilities for a specific patient (Figures 5 and S12). This function supports evidence-based, shared decision-making by quantifying the potential survival benefit of tailored treatment strategies. This function also addresses unresolved clinical questions in advanced ESCC management, such as the optimal duration of immunotherapy/chemotherapy maintenance and which patients benefit most from aggressive primary tumor RT (≥ 40 Gy).
Inevitably, the study had several limitations. First, the retrospective design of the study may have introduced potential confounding factors. Although rigorous statistical methods, such as sensitivity analysis, DeLong, and Benjamini-Hochberg, along with internal and external validation, demonstrate robustness, unmeasured confounding factors cannot be completely ruled out. Specifically, nuanced data, such as the detailed clinical rationale for specific treatment allocations, the dynamic evolution of patients’ performance status during therapy, and variations in the management of immune-related adverse events, were not fully captured. These unmeasured variables could theoretically influence treatment efficacy and survival outcomes. Second, “aggressive RT (≥ 40 Gy)” was defined based on our prior real-world study. However, the optimal RT dose and fractionation schedule when combined with immunotherapy require further RCT confirmation. Third, while the predictive role of PD-L1 in ESCC remains debated, future iterations of this model should prioritize the integration of PD-L1 status and other emerging molecular biomarkers to further enhance predictive precision. Fourth, radiomic features and specific hematologic biomarkers were excluded due to limited data availability across the 9 centers. Prioritizing standard clinical variables preserved the sample size and ensured broad clinical utility. Our research group has already initiated the collection of multi-omics data and is currently developing a multimodal model to further enhance predictive precision in future studies. Finally, the web calculator for the Boruta-RSF model is currently intended solely for scientific research by healthcare professionals because this medical device has not undergone regulatory review.
In conclusion, we have successfully developed and rigorously validated the first interpretable clinical model for predicting OS in patients with advanced ESCC receiving contemporary first-line immunochemotherapy ± aggressive primary tumor RT. The Boruta-RSF model significantly outperformed traditional Cox regression and other ML algorithms, demonstrating robust generalizability across multi-center datasets. The reasonable accuracy, interpretability through SHAP, and implementation as a user-friendly web calculator provide clinicians with a powerful tool for individualized prognostic assessment and dynamic simulation of survival outcomes under various treatment strategies. By enabling quantification of the potential survival benefit associated with key modifiable treatment factors (i.e., immunotherapy duration, chemotherapy cycles, and inclusion of RT), this model advances the goal of personalized medicine in advanced ESCC. Prospective application and continuous refinement hold promise for further optimizing treatment selection and improving patient outcomes.
Supporting Information
Conflict of interest statement
No potential conflicts of interest are disclosed.
Author contributions
Conceived and designed the analysis: Hui Bai, Xiaofeng Wang, Qifeng Wang, Jiacheng Li, Chengcheng Fan, Zhiyong Yuan, Jianzhong Cao, Baoqing Chen, Wencheng Zhang.
Collected the data: Xiao Liu, Yujin Xu, Qin Xiao, Hao Zhang, Bingyi Wang, Qingsong Pang, Biqi Chen, Mian Xi.
Contributed data or analysis tools: Zhiyong Yuan, Jianzhong Cao, Baoqing Chen, Wencheng Zhang.
Performed the analysis: Hui Bai, Xiaofeng Wang, Qifeng Wang, Jiacheng Li, Chengcheng Fan.
Wrote the paper: Hui Bai, Xiaofeng Wang, Jiacheng Li, Wencheng Zhang.
Data availability statement
The de-identified datasets generated and analyzed during the current study are not publicly available due to patient privacy and ethical restrictions but are available from the corresponding authors upon reasonable request. The code used for model development is accessible at the following GitHub repository: https://github.com/xiaofeng-bai/ESCC-Survival-Prediction-Tool.
- Received October 9, 2025.
- Accepted February 10, 2026.
- Copyright: © 2026, The Authors
This work is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License.






















