From THESAN-ZOOM to JWST: Predicting ionizing photon escape and the rise of UV-bright reionization sources

Source: arXiv:2606.14671 · Published 2026-06-12 · By Zebedee Summerfield, William McClymont, Sandro Tacchella, Aaron Smith, Rahul Kannan, Enrico Garaldi et al.

TL;DR

This paper addresses the critical astrophysical problem of understanding the sources and evolution of cosmic reionization, focusing on predicting the escape fraction (f_esc) and escape rate (\dot{N}_ion,esc) of ionizing Lyman-continuum photons from early galaxies. Due to the opacity of the intergalactic medium at high redshifts, direct observation of LyC photon escape during the Epoch of Reionization (EoR) is infeasible. To overcome this, the authors leverage over 35,000 galaxy realizations from the high-resolution THESAN-ZOOM radiation-hydrodynamic cosmological simulations spanning redshifts z=3 to 16. They identify indirect diagnostics correlating with ionizing photon escape, then train random forest regression models to predict f_esc and \dot{N}_ion,esc, including versions restricted to JWST-accessible observables. The study finds that the ratio of star-formation rates over 10 to 100 Myr and the gas-to-stellar mass ratio are strongest predictors for f_esc, while the rest-frame UV absolute magnitude M_UV dominates prediction of \dot{N}_ion,esc. Applying the predictive relations to JWST photometric samples, the authors construct reionization histories consistent with observational constraints and avoid a previously noted photon budget crisis. Their analysis suggests rapid reionization after z~8, primarily driven by UV-bright galaxies with M_UV < -17 dominating ionizing photon emission. This work combines advanced cosmological zoom-in simulations with machine learning to provide robust indirect diagnostics, bridging theory and JWST observations to better quantify early Universe reionization sources.

Key findings

The 10-to-100 Myr star-formation rate ratio (SFR_10/SFR_100) and gas-to-stellar mass ratio (M_gas/M_*) are the top predictors for the ionizing photon escape fraction f_esc.
Rest-frame UV absolute magnitude (M_UV) is the dominant predictor for the ionizing photon escape rate \dot{N}_ion,esc, exhibiting a monotonic, tighter correlation with scatter of 0.628 dex compared to 0.829 dex for f_esc.
Four random forest models trained on THESAN-ZOOM simulation data achieved Pearson correlation coefficients of 0.809 and 0.919 for f_esc and \dot{N}_ion,esc respectively using all diagnostics, and 0.739 and 0.899 using only JWST-accessible observables.
Models trained with JWST-accessible diagnostics retained strong predictive performance, with mean absolute errors of 0.476 dex for f_esc and 0.446 dex for \dot{N}_ion,esc (Table 2).
Application of the predicted \dot{N}_ion,esc - M_UV relations combined with observed UV luminosity functions produced reionization histories consistent with observational constraints and without requiring an excessive photon budget.
Galaxy samples with M_UV < -17 were found to provide the dominant contribution to reionization photon emission, implying UV-bright galaxies drive the bulk of reionization rapidly after z≈8.
Distinct physical processes govern f_esc and \dot{N}_ion,esc: f_esc is connected with gas clearing via bursty star formation (supported by SFR ratios and gas/stellar mass), while \dot{N}_ion,esc correlates strongly with UV luminosity reflecting galaxy ionizing output.

Methodology — deep read

Threat Model & Assumptions: The adversary considered is not traditional but rather the challenge of inferring the unobservable LyC photon escape fraction and rate from indirect galaxy observables. The authors assume that high-fidelity cosmological radiation-hydrodynamic simulations capturing galactic ISM physics realistically model LyC photon escape processes.
Data: Their primary data source is the THESAN-ZOOM suite of zoom-in cosmological radiation-hydrodynamic simulations that improve resolution and ISM modelling on galaxy scales compared to the parent THESAN volume. The analysis focuses on the largest zoom-in region (m12.6) with 43,076 resolved galaxies from redshift 3 to 16. Applying a star-formation cut (SFR_50 > 0), the final sample for training included 35,512 mock galaxies. Each galaxy’s properties include stellar mass, gas mass within ISM, multiple star-formation rates over different timescales, UV luminosities (1500 Å), H-alpha line luminosities, metallicities, radii, and dust attenuation derived from post-processed Monte Carlo radiative transfer using the COLT code.
Architecture/Algorithm: The core predictive method is random forest (RF) regression, a machine learning ensemble of decision trees that captures nonlinear dependencies and interactions without requiring parametric model assumptions. Four RF models were trained: two to predict f_esc and \dot{N}_ion,esc using the full set of simulation-derived diagnostics, and two trained only on observables realistically accessible to JWST photometric surveys (e.g., M_UV, UV slopes). The input features were transformed via base-10 logarithms (except magnitudes), and include key galaxy diagnostics such as SFR ratios, gas-to-stellar mass ratios, UV magnitudes, stellar masses, metallicities, and radii. A redshift term log10(1+z) was also included to incorporate cosmic evolution. Hyperparameters of the RF models (e.g., number of trees, depth) were tuned to optimize mean absolute error on test sets.
Training Regime: For each model, the dataset was split 75:25 into train and test sets, with 1000 RF regressors trained and the one with lowest test MAE selected for final evaluation. Random bootstrapping for training samples and random feature selection per split ensured robustness against overfitting. Feature importance metrics were averaged over models to prune irrelevant features iteratively.
Evaluation Protocol: Prediction accuracy was evaluated by mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), and Pearson correlation coefficient (PCC) on test sets. Model A (full f_esc predictor) reached PCC=0.809 and MAE=0.420 dex; Model B (full \dot{N}_ion,esc predictor) reached PCC=0.919 and MAE=0.407 dex (Table 2). JWST-compatible models had slightly lower performance but still strong correlations. The models were evaluated over the full redshift range and galaxy sample, with no explicit cross-validation described beyond train-test splits.
Reproducibility: The models rely on proprietary THESAN-ZOOM simulations and industrial-scale radiation-hydrodynamic simulation outputs post-processed with COLT, which may limit external reproduction. The authors detail model input features and hyperparameter tuning strategy for RF regressors and provide references to the COLT public code for radiative transfer. No explicit code release or frozen weights were mentioned.

Concrete end-to-end example: For a galaxy at z=6 with measured stellar mass, gas mass, UV magnitude, SFR over 10 and 100 Myr, and metallicity, the RF model inputs the log-transformed features with redshift term, traverses decision trees to predict f_esc or \dot{N}_ion,esc. The prediction uncertainty is quantified by comparison to test-set ground truth from simulations. This procedure is repeated for over 35,000 galaxies, enabling robust statistical inferences and construction of relations like \dot{N}_ion,esc vs M_UV used in reionization history modeling.

Technical innovations

Application of high-resolution THESAN-ZOOM cosmological radiation-hydrodynamic zoom-in simulations to build a rich dataset of galaxy properties and ionizing photon escape characteristics across z=3-16.
Development of random forest regression models that reliably predict LyC photon escape fraction and escape rate using indirect galaxy diagnostics, including JWST-accessible observables.
Identification of the 10-to-100 Myr star-formation rate ratio and gas-to-stellar mass ratio as key physical diagnostics linked to LyC escape fraction, highlighting the role of bursty star formation and gas clearing.
Demonstration that rest-frame UV absolute magnitude is the strongest single predictor of the LyC photon escape rate, enabling robust reionization emissivity modeling from direct photometric observables.

Datasets

THESAN-ZOOM simulation catalogue — 43,076 galaxies (35,512 after SFR cut) — cosmological radiation-hydrodynamic zoom-in simulations spanning redshift 3 to 16 (non-public)
JWST/NIRCam JADES photometric galaxy sample — ~40,000 galaxies — public JWST photometric survey data (public)

Baselines vs proposed

Model A (full diagnostics for f_esc): MAE = 0.420 dex, PCC = 0.809 vs Model C (JWST-compatible diagnostics for f_esc): MAE = 0.476 dex, PCC = 0.739
Model B (full diagnostics for \dot{N}_ion,esc): MAE = 0.407 dex, PCC = 0.919 vs Model D (JWST-compatible for \dot{N}_ion,esc): MAE = 0.446 dex, PCC = 0.899

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2606.14671.

Fig 1

Fig 1: Ionizing photon escape fraction ( 𝑓esc; left panel) and ionizing photon escape rate ( ¤𝑁ion,esc; right panel) as a function of rest-frame UV absolute

Fig 2

Fig 2: Gas mass (𝑀gas) versus stellar mass (𝑀∗) for thesan-zoom galaxies, colour-coded by median ionizing photon escape fraction ( 𝑓esc; left-panel) and

Fig 3

Fig 3: Mean feature importances for the four random forest models, averaged over 1000 train–test realisations. Importances represent the fractional contribution

Fig 4

Fig 4 (page 5).

Fig 4

Fig 4: True versus predicted values for the four random forest models presented. From left to right they are: Model A, our best 𝑓esc predictor trained with all

Fig 5

Fig 5: Model predictions of 𝑓esc and ¤𝑁ion,esc for galaxies in the JWST photometric catalogue from Simmonds et al. (2025). Top left: Predicted 𝑓esc as a

Fig 6

Fig 6: Comoving cosmic ionizing photon emissivity, ¤𝑛ion, as a function of redshift, 𝑧, derived from the thesan-zoom-based (violet points) and JWST-based

Fig 8

Fig 8 (page 8).

Limitations

The predictors rely on the fidelity and physical accuracy of the THESAN-ZOOM simulations; any systematic uncertainties in these simulations propagate into the models.
Direct observational validation of f_esc and \dot{N}_ion,esc during the EoR remains impossible, so model accuracy against real high-z galaxies is untestable currently.
The study focuses on a single zoom-in region (m12.6) for computational tractability, risking limited sampling of galaxy environmental diversity and cosmic variance.
No explicit adversarial analysis or robustness testing of the machine learning models to out-of-distribution galaxies or extreme ISM conditions was conducted.
JWST photometric observable models exclude spectroscopic diagnostics and may omit subtle features relevant to escape fractions.
Redshift evolution was incorporated as a global variable log10(1+z), but possible evolution in feature-to-escape relations at fixed z was not deeply analyzed.

Open questions / follow-ons

How robust are the identified LyC escape diagnostics to variations in ISM physics, feedback prescriptions, and cosmic environment not fully captured in the THESAN-ZOOM simulations?
Can additional spectroscopy-based observables from JWST or future facilities improve escape fraction predictions beyond photometric diagnostics?
What is the detailed time variability and anisotropy of LyC escape in galaxies, and how could temporal or spatially resolved models improve bulk predictions?
How do AGN contributions, known to affect LyC escape, integrate into the diagnostic framework beyond star formation dominated galaxies?

Why it matters for bot defense

Although astrophysical in nature, the methodology of extracting hard-to-observe internal state properties (LyC photon escape fractions/rates) from indirect, noisy, high-dimensional observables using machine learning on simulation data parallels challenges in bot-defense where intrinsic bot behaviors must be inferred from indirect features. The study exemplifies rigorous selection of predictive diagnostics, detailed feature importance analysis, and training of robust random forest models to predict complex unobservable quantities, a pattern transferable to CAPTCHA and bot-detection contexts where direct labels are limited or missing. The emphasis on restricting models to observables accessible to a given platform (like JWST photometry) mirrors practical constraints in bot-defense where only certain observable features are available for classification. Finally, the careful analysis of model performance metrics, feature pruning, and physical interpretability underscores best practices for trustworthy bot-detection ML pipelines.

Cite

bibtex

@article{arxiv2606_14671,
  title={ From THESAN-ZOOM to JWST: Predicting ionizing photon escape and the rise of UV-bright reionization sources },
  author={ Zebedee Summerfield and William McClymont and Sandro Tacchella and Aaron Smith and Rahul Kannan and Enrico Garaldi and Ewald Puchwein and Xuejian Shen and Josh Borrow and A. Lola Danhaive and Laura Keating and Gabriel Maheson and Parth Mehrotra and Charlotte Simmonds and Amanda Stoffers and Mark Vogelsberger and Oliver Zier },
  journal={arXiv preprint arXiv:2606.14671},
  year={ 2026 },
  url={https://arxiv.org/abs/2606.14671}
}

From THESAN-ZOOM to JWST: Predicting ionizing photon escape and the rise of UV-bright reionization sources ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​