Evaluating the performance of GCM trajectories using Weather Type frequencies for persistence and transitions: the Iberian Peninsula and Lamb classification

Source: arXiv:2605.00779 · Published 2026-05-01 · By Elsa Barrio-Torres, Swen Brands, Jesús Asín, Jesús Abaurrea, Zeus Gracia-Tabuenca

TL;DR

This paper addresses the challenge of objectively evaluating global climate model (GCM) trajectories in reproducing atmospheric circulation patterns relevant for regional climate impact studies. Specifically, the authors assess 36 historical CMIP6 GCM runs over the Iberian Peninsula during summer months (June-September, 1979-2005) using the Lamb Weather Type (WT) classification. The key innovation is extending traditional daily occurrence frequency evaluation to also incorporate day-to-day persistence and transition probabilities between weather types, quantified via an Overlap coefficient similarity metric. Applying a filtering threshold on joint reproduction of daily and conditional WT distributions, the study narrows the ensemble to 12 trajectories that reliably capture the dynamics. Performance metrics reveal models from the EC Earth3 family, particularly EC Earth3 AerChem, deliver the most consistent representation across the domain. Spatially, however, there are marked performance gaps with better fidelity in the northwest Iberian Peninsula compared to central and southern Mediterranean regions. The authors propose their methodology as a broadly applicable, objective framework for GCM selection based on atmospheric circulation dynamics rather than solely marginal frequency distributions.

Key findings

Out of 36 CMIP6 GCM historical trajectories, 16 can adequately reproduce daily WT frequency distributions over the Iberian Peninsula summer months.
Only 12 trajectories satisfy a minimum Overlap coefficient similarity threshold (tsim=0.8) for both daily and conditional WT distributions across at least 10 of 30 grid points, indicating ability to capture persistence and transitions.
EC Earth3 AerChem trajectory shows the highest and most consistent performance across metrics evaluating daily frequencies, conditional transitions, and persistence reproduction.
There is a pronounced geographical performance gap: northwest Iberian Peninsula grid points exhibit better model skill in reproducing WT dynamics than central or southern Mediterranean areas.
Purely anticyclonic (PA) and unclassified (U) weather types are highly persistent, with persistence relative frequencies reaching 0.59 and 0.74 respectively in ERA5 observations.
The filtering approach using joint daily and conditional WT distribution Overlap discriminates better among GCMs than daily frequency comparison alone.
The sample size for estimating conditional WT transition probabilities at rare types can be as low as ~30 daily data points, posing a statistical challenge.
The proposed categorical distribution similarity measures (Overlap coefficient, Bhattacharyya coefficient, Hellinger distance) provide a robust framework for comparing model and reanalysis WT patterns.

Methodology — deep read

The study presents a thorough, stepwise methodological framework to evaluate GCM performance in reproducing atmospheric circulation dynamics via Weather Types (WT). The process includes:

Threat Model & Assumptions: The adversary is not applicable here; instead, the focus is on assessing fidelity of GCMs (36 CMIP6 historical runs from 1979–2005) relative to ERA5 reanalysis as a reference. The assumption is stationarity of WT daily and transition probabilities over the chosen period and summer months (June–September) to minimize seasonality and non-stationarity effects.
Data: WT time series were derived from daily SLP fields for each grid point (2.5°×2.5°) in the Iberian Peninsula region (30 grid points in 35°–45°N and 8.75°W–3.75°E), using the Jenkinson–Collison Lamb Weather Type classification, generating 27 WT categories. ERA5 daily WT data serve as ground truth; GCM model daily WTs were extracted analogously. The time series cover 27 years (1979–2005) during summer months with 3,294 days per grid point.
Architecture/Algorithm: WT dynamics are characterized by calculating joint relative frequencies (joint rf) of two consecutive days' weather types, enabling computation of daily occurrence distributions (marginals) and conditional distributions (transitions and persistence). These categorical distributions are compared via similarity measures including the Overlap coefficient, Bhattacharyya coefficient, and Hellinger distance.
Training: No training in ML sense; rather, the data processing computes empirical probabilities from observed and modeled daily WT sequences.
Evaluation Protocol: At each grid point, the overlap similarity OERA,n(s) between ERA5 and each GCM trajectory n is computed for daily frequency distributions and conditional distributions. Trajectories are filtered by setting thresholds tsim=0.8 for overlap similarity and minimum number of grid points k=10 where this threshold must be met. Summary metrics aggregate results regionally: Daily Reproduction (DR) sums overlap on daily distribution, Conditional Reproduction (CR) on transition probabilities, Persistence Reproduction (PerR) focuses on within-type persistence. These metrics are used to rank surviving trajectories.
Reproducibility: The method and code are provided publicly via GitHub allowing reuse on other regions or with alternate weather typing schemes. Data sources (ERA5 and CMIP6 GCM outputs) are standard climate datasets. The procedure is fully documented, but specific fixed seeds do not apply as statistical frequency estimation is used.

Concrete end-to-end example: For grid point s, the ERA5 daily joint WT relative frequency rfERA(i,j;s) is computed from observed summer day pairs (i=WT at day d, j=WT at day d-1). For GCM trajectory n, rfn(i,j;s) is similarly computed. Then, the marginal daily frequencies rfERA(i;s) and rfn(i;s) are calculated by summing over j. The Overlap coefficient OERA,n(s) = sum over i of min(rfERA(i;s), rfn(i;s)) is computed for daily distributions. Conditional relative frequencies rfERA(i|j;s) and rfn(i|j;s) are formed and Overlap coefficients computed for transitions. Grid points where these similarity scores exceed tsim=0.8 are used to filter trajectories. Those with at least k=10 such grid points survive selection. Final DR, CR, and PerR scores are summed regionally per trajectory and compared to identify best performers, e.g. EC Earth3 AerChem.

Technical innovations

Introduction of a joint distribution similarity framework using Overlap coefficients to evaluate both daily WT frequency and day-to-day transition/persistence dynamics in GCM trajectories.
A filtering algorithm that selects GCM runs by requiring minimum similarity thresholds simultaneously on marginal daily and conditional transition WT distributions across spatial grid points.
Demonstration that persistence and transition dynamics of WTs provide far more discriminative power in GCM evaluation than daily frequency distributions alone.
Application of the Jenkinson–Collison Lamb Weather Type classification jointly with categorical similarity measures to regional GCM performance evaluation, generalizable worldwide.

Datasets

ERA5 reanalysis — daily WT sequences over Iberian Peninsula summer months (1979–2005) — publicly available
CMIP6 GCM historical trajectories — 36 ensemble members from multiple climate models (1979–2005) — public ESGF nodes

Baselines vs proposed

Baseline: ERA5 reanalysis daily and conditional WT distribution (overlap = 1.0)
GCMs passing daily WT frequency threshold (Overlap >= 0.8): 16 out of 36 trajectories
After filtering joint daily and conditional WT thresholds (Overlap >= 0.8 at >=10 grid points): 12 trajectories remain
Top performer EC Earth3 AerChem: Highest DR, CR, and PerR scores across Iberian Peninsula compared to others (exact numeric values not given)
Geographic difference in model fidelity: Northwest Iberian grid points show higher overlap (approaching 0.8–0.9) versus central/southern Mediterranean points with lower overlap (< 0.7)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.00779.

Fig 3

Fig 3: Flow chart for the proposal algorithm.

Limitations

Conditional WT transition probabilities for rare weather types suffer from small sample sizes (~30 occurrences), reducing statistical robustness.
Assumption of stationarity and homogeneity over 27-year summer period may not hold for longer or full-year datasets with evident seasonal cycles and trends.
Evaluations are limited to the Iberian Peninsula summer months and may not generalize directly to other regions or seasons.
Models evaluated only on historical period; no direct assessment of performance under future climate forcing or non-stationarity.
Method relies on the Jenkinson–Collison WT classification; other WT schemes or circulation pattern definitions might yield different model rankings.
Spatial aggregation choices (e.g., grid size, number of grid points k) influence filtering thresholds and final trajectory selection.

Open questions / follow-ons

How robust are the WT transition and persistence similarity metrics when applied to different climate regions or seasons beyond the Iberian Peninsula summer?
Can the approach be adapted or extended to assess GCM performance under future climate scenarios with non-stationary circulation regimes?
What is the impact of using alternative weather typing schemes or other categorical circulation classifications on trajectory selection?
How does weighting or filtering rare weather types influence the sensitivity and specificity of the filtering algorithm for GCM selection?

Why it matters for bot defense

For bot-defense or CAPTCHA practitioners, this work illustrates a rigorous framework for evaluating categorical time-series dynamics beyond simplistic frequency comparisons. Analogously, in CAPTCHA lifecycle analysis or bot behavior modeling, assessing transition probabilities and persistence of categorical behavioral states can provide more discriminative power for distinguishing genuine from automated patterns. The methodology highlights the value of joint statistical evaluation of temporal state sequences rather than marginal distributions alone, a principle applicable to multi-step event or user-session modeling in bot detection pipelines. However, the domain-specific circulation pattern classification does not directly transfer; adaptation to behavioral 'weather types' or user states would be required. The spatial heterogeneity considerations remind practitioners to consider context-dependent performance across subpopulations or geographies in defense models.

Cite

bibtex

@article{arxiv2605_00779,
  title={ Evaluating the performance of GCM trajectories using Weather Type frequencies for persistence and transitions: the Iberian Peninsula and Lamb classification },
  author={ Elsa Barrio-Torres and Swen Brands and Jesús Asín and Jesús Abaurrea and Zeus Gracia-Tabuenca },
  journal={arXiv preprint arXiv:2605.00779},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.00779}
}

Evaluating the performance of GCM trajectories using Weather Type frequencies for persistence and transitions: the Iberian Peninsula and Lamb classification ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​