AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

Source: arXiv:2605.06607 · Published 2026-05-07 · By Nithin Somasekharan, Rabi Pathak, Manushri Dhanakoti, Tingwen Zhang, Ling Yue, Andy Zhu et al.

TL;DR

AI CFD Scientist addresses the gap between generic LLM-based AI-scientist frameworks (designed for software-only ML research) and the stricter validity requirements of computational fluid dynamics. The core problem is that CFD solver completion is not a sufficient proxy for physical correctness: a simulation can converge cleanly while producing physically nonsensical fields, incorrect geometry, or degenerate turbulence-model outputs — failures that are invisible in solver logs but obvious in rendered flow fields. The authors argue that any system that converts CFD runs into scientific claims must therefore include image-level physics verification, mesh-independence gating, and source-code-level model modification as first-class components, none of which appear together in prior work.

The system is built on top of OpenFOAM via Foam-Agent and exposes three coupled pathways: parameter sweeps within a fixed solver, case-local C++ library compilation for new physical models, and an open-ended hypothesis search loop that autonomously edits turbulence model source code against a reference comparator. The central innovation is a vision-language model (VLM) physics-verification gate that renders flow fields as PNGs via PyVista and submits them to a VLM in two sequential calls — a readability filter followed by a physics-consistency check — before any result is accepted, rerun, or incorporated into a manuscript draft. A LaTeX-writing agent closes the loop by grounding every claim to a specific figure or numerical record produced by a gate-passing case.

On five tasks evaluated under a shared GPT-5.5 backbone, the system autonomously discovers a 'quadRecTail' Spalart-Allmaras runtime correction that reduces lower-wall skin-friction coefficient RMSE against DNS by 7.89% on the periodic hill benchmark at Reh=5600 across 44 autonomous discovery iterations. A controlled planted-failure ablation demonstrates that the VLM gate catches 14 of 16 synthetic silent failures that pass all solver-level checks. Two competitive baselines — ARIS and DeepScientist — execute partial CFD workflows under matched LLM cost but consistently produce scientifically unsupported claims because they lack domain-specific validity gates. Code, prompts, and run artifacts are released openly.

Key findings

The VLM physics-verification gate detects 14 of 16 planted silent failures (missing_deliverable 4/4, wrong_magnitude_metric 4/4, broken_postprocessing 4/4, convergence_not_settled 2/4) that pass all solver-level log checks, on a 4-category × 4-case design matrix (Table 3).
Open-ended discovery pathway (T5) autonomously runs 44 iterations and discovers a quadrupolar SA runtime correction ('quadRecTail') that reduces lower-wall Cf RMSE against DNS (Krank et al.) from 0.004297 to 0.003958, a 7.89% improvement, on the periodic hill at Reh=5600 (Fig 2d).
Custom non-Newtonian viscosity library (T3) was compiled successfully on the first attempt; with power-law exponent n=1, the custom model reproduces the analytic Newtonian parabolic centreline velocity within 0.5% (1.5 m/s target), confirming the code-modification pathway's degeneracy validation.
In T4, the APG=0 control case for the custom SA modifier matched the built-in SA baseline to four decimal places (Umax=1.5959 m/s in both), validating that the case-local C++ compilation path does not perturb the underlying solver numerics.
Under matched GPT-5.5 LLM cost, ARIS and DeepScientist both executed partial CFD workflows on T1–T4 but issued closure rankings and Strouhal-number correlations without DNS validation or grid-convergence evidence; AI CFD Scientist explicitly withheld rankings and recorded 'unresolved' verdicts in the same situations (Table 5).
Mesh-independence gate in T5 achieved <2% QoI percent-difference on Cf between baseline and refined meshes (~10% near-wall, ~5% bulk refinement), with y+ ~1 sustained, passing the 5% threshold gate.
T2 jet/plume Re-sweep recovered expected centreline Ux scaling (Uc,max tracks bulk velocity from 0.09 to 0.60 m/s as Re sweeps 60→600) across 7 identical 35,156-cell meshes; case-006 was autonomously flagged as anomalous (centreline-mean collapse) by the analysis agent (Fig 2b).
The convergence_not_settled failure category was the gate's main weakness: only 2/4 truncated-run cases were flagged because consistently edited endTime values make visually incomplete simulations appear complete to the VLM inspector.

Threat model

n/a — this is not a security paper. The closest analog to a threat model is the system's internal adversarial assumption encoded in design principle P4: the agent itself is treated as a potential 'adversary' that might hallucinate an alternate experiment, swap the swept variable, or relax success criteria to convert a failing case into an apparently passing one. The system defends against this by requiring all claims to trace to gate-passing artifacts (P5) and by prohibiting the rerun controller from weakening the scientific objective. No external adversary model is defined.

Methodology — deep read

Threat model and assumptions: this is not a security paper in the traditional sense, but the adversarial assumption is methodological — the system must not hallucinate results, swap swept variables, or relax success criteria to make a failing case appear to pass. Design principle P4 explicitly prohibits the agent from taking any action that would make a failing run easier to claim as successful. The system assumes access to a working OpenFOAM installation, a GPT-5.5 API, and (optionally) reference DNS data and a starter case file. There is no adversarial user model.

Data provenance and structure: the five tasks use heterogeneous inputs. T1 and T2 provide no baseline OpenFOAM files or reference data — the system generates everything from scratch guided by literature retrieval via Semantic Scholar. T3 provides baseline Newtonian channel flow OpenFOAM files. T4 and T5 provide baseline periodic hill OpenFOAM files plus reference DNS Cf(x/h) data from Krank et al. The T5 objective function is computed at 99 wall sample points along the lower wall. No external training dataset is used; the LLM backbone is GPT-5.5 accessed via API with no fine-tuning.

Architecture and novel components: the system is implemented as a LangGraph checkpointed workflow with a parallel modular skills-based variant. Agents exchange structured artifacts (study JSON, requirement paragraphs, source-edit plans, run directories, figure manifests, interpretation JSON, LaTeX drafts) rather than free text. The five operational design principles (P1–P5) are encoded as hard constraints in prompts and execution logic. The three pathways — regular experimentation, code modification, and open-ended discovery — share a capability bus comprising a literature retrieval tool (Semantic Scholar), a string-similarity novelty filter, a requirement validator/repair loop, Foam-Agent for OpenFOAM execution, a PyVista/matplotlib visualization creator, and the VLM gate. The code-modification pathway uses compiler diagnostics as structured feedback in a repair loop and verifies library loading with a smoke test before any sweep. The open-ended discovery pathway wraps both other pathways in an outer hypothesis loop: at each iteration it proposes a concrete edit, compiles it, smoke-tests it, gates it with mesh independence and VLM checks, scores it against the reference comparator, and checkpoints it only if it improves over the previous best. The VLM gate makes two sequential API calls: first a readability/quality check that triggers figure redrawing on failure, then a physics-consistency check that returns ACCEPT, REVISE, or RERUN. REVISE triggers requirement rewriting; RERUN retries with relaxed numerics borrowed from nearby successful cases.

Training regime: there is no model training. All intelligence comes from GPT-5.5 API calls. The paper does not report total iteration counts for T1–T4 individually beyond case counts (4, 7, 6, 6 cases respectively). T5 ran 44 OED iterations plus 6 post-discovery validation cases. Token usage and estimated cost are reported in Section I of the paper (not included in the provided excerpt). No random seed strategy or statistical significance testing is described.

Evaluation protocol: evaluation is entirely manual — the authors explicitly state 'All evaluation is manual because no automated CFD-paper rubric currently scores the workflows the system produces.' Tasks are scored on a rubric with three axes: Task Implementation Quality (TIQ), Scientific Research Quality (SRQ), and Open-Ended Ideation (OEI), rated S/P/W/X (Strong/Partial/Weak/Absent). The VLM gate ablation uses a controlled planted-failure design: four production-passed template cases × four failure categories = 16 planted failures + 4 clean controls, with deterministic ground-truth labels. The T5 quantitative metric is lower-wall Cf RMSE computed at 99 sample points against Krank et al. DNS. No statistical significance tests (p-values, confidence intervals, bootstrap) are reported anywhere. Cross-framework comparison on T1–T4 is artifact-based, inspecting archived case directories, solver logs, custom C++ libraries, figures, and reports under matched GPT-5.5 backbone and matched LLM cost.

Concrete end-to-end example (T5): the user provides a natural-language task ('Discover a novel SA modification that beats baseline SA on Cf for periodic hill at Reh=5600'), a starter OpenFOAM case, and the Krank et al. DNS Cf(x/h) reference. The knowledge-retrieval tool surveys published SA variants (SA-RC, SA-QCR, SA-noft2, SA-Edwards) to seed novelty constraints. Iter 001–026 explore four sink-based mechanism families; iter 003 introduces a reversal-gated near-wall sink that worsens RMSE to 0.004339 (REVISE). Iter 006 introduces a localized downstream-hill Gaussian sink near x/h~8.68 yielding RMSE 0.004262 (first improvement, -0.81%). Iters 027–034 introduce the quadrupolar runtime source structure (4 Gaussian patches: Grec, Gsink, Gsrc, Gtail) yielding RMSE 0.004050–0.004080. The mesh-independence gate is passed at <2% Cf difference. Iters 035–044 fine-tune coefficients; iter_044 (Crec=2.12, Csink=2.25, Csrc=1.20, Ctail=0.75) achieves RMSE 0.003958 (-7.89%). The model is delivered as a coded fvModels runtime block requiring no recompilation. The paper-writing agent then drafts a LaTeX manuscript grounded to the figure manifest and interpretation records from gate-passing cases only.

Reproducibility: code, prompts, and run artifacts are released at https://github.com/csml-rpi/cfd-scientist. The GPT-5.5 backbone is a closed commercial API, introducing non-determinism. Frozen weights are not applicable. The DNS reference data source (Krank et al.) is cited but the specific data file provenance is not fully described in the provided excerpt.

Technical innovations

Vision-language physics-verification gate: a two-call VLM pipeline (readability filter → physics-consistency check) that inspects rendered PyVista/matplotlib PNGs of flow fields before accepting any CFD result, distinguishing physically valid runs from log-passing-but-physically-degenerate runs — absent from all prior AI-scientist and CFD-agent systems surveyed in Table 1.
Open-ended source-code discovery loop: an outer hypothesis search that autonomously proposes, compiles, smoke-tests, gates, and scores case-local C++ turbulence model modifications against a reference comparator without human intervention, extending prior CFD agents (e.g., Foam-Agent, turbulence.ai) which stop short of source-level hypothesis generation.
Structured artifact exchange protocol: agents pass typed JSON objects (study JSON, requirement paragraphs, source-edit plans, figure manifests, interpretation records) rather than free text, enabling P5 claim traceability — every manuscript claim is traceable to a specific gate-passing artifact, not to the model's prior knowledge.
Mesh-independence gate as a mandatory convergence check: a parameterized protocol (~10% near-wall, ~5% bulk refinement, 5% QoI threshold with Richardson/GCI escalation) that is enforced programmatically before any quantitative claim is accepted, rather than left to user discretion as in all surveyed baselines.
String-similarity novelty filter: a lightweight deduplication step that compares generated hypotheses against retrieved Semantic Scholar literature records and rejects near-duplicates before execution, preventing rediscovery of known SA variants (SA-RC, SA-QCR, etc.) — not present in any prior CFD-agent system.

Datasets

Krank et al. DNS periodic hill Cf(x/h) — 99 wall sample points, lower wall — non-public reference DNS data, cited as Krank et al., exact publication not specified in provided excerpt
Periodic hill OpenFOAM baseline case (Reh=5600 and Reh=10595) — starter case files provided by authors — internal
Backward-facing-step OpenFOAM baseline case (Reh=25400) — generated by Foam-Agent from scratch — internal
2D laminar jet OpenFOAM cases (Re=60–600, 7 cases, 35156-cell mesh) — generated by Foam-Agent from scratch — internal
Newtonian channel flow OpenFOAM baseline case — provided by authors — internal

Baselines vs proposed

Baseline SA (built-in OpenFOAM, T5 periodic hill Reh=5600): lower-wall Cf RMSE vs DNS = 0.004297 vs proposed quadRecTail: 0.003958 (-7.89%)
ARIS (T1 BFS): TIQ=Partial (3 closures, no mesh independence), SRQ=Weak (closure ranking without DNS validation) vs AI CFD Scientist: TIQ=Strong (4 closures, mesh-gate, VLM-triaged), SRQ=Partial
DeepScientist (T1 BFS): TIQ=Partial (3 closures, controlled comparison), SRQ=Weak (closure ranking without DNS validation) vs AI CFD Scientist: TIQ=Strong, SRQ=Partial
ARIS (T2 jet Re-sweep): reported St≈0.019 fit without grid-convergence or DNS check, SRQ=Weak vs AI CFD Scientist: marks f(Re) unresolved on missing metadata, SRQ=Partial
DeepScientist (T2 jet Re-sweep): reported St≈0.031 fit without validation, SRQ=Weak vs AI CFD Scientist: conservative unresolved verdict, SRQ=Partial
T3 custom viscosity Newtonian degeneracy check (n=1): centreline velocity within 0.5% of analytic 1.5 m/s target; centreline velocity range across n-sweep: 1.4542–1.5231 m/s (~3.8% variation)
T4 APG=0 control vs built-in SA: Umax=1.5959 m/s in both (match to 4 decimal places); APG sweep induced ~1.25% Umax sensitivity (1.5759–1.5959 m/s)

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.06607.

Fig 1

Fig 1: Architecture of AI CFD Scientist. A natural-language topic, optional base case, and

Fig 2

Fig 2 (page 2).

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

All evaluation is manual with no automated scoring rubric; the paper explicitly acknowledges this, making quantitative cross-system comparisons dependent on subjective human judgment and potentially non-reproducible as LLM versions evolve.
The discovered quadRecTail model is tested only on the periodic hill at Reh=5600; cross-geometry transfer is explicitly flagged as untested (Section E/C), so generalization of the 7.89% Cf improvement to other flow configurations or Reynolds numbers is unknown.
The VLM gate fails on 2/4 convergence_not_settled cases because consistently edited endTime values make truncated simulations appear visually complete — a systematic blind spot for runs that are internally consistent but physically underresolved in time.
The GPT-5.5 backbone is a closed commercial API, introducing API-version-dependent non-determinism; the paper does not report results across multiple runs or seeds, so variance in outcomes (e.g., whether 44 iterations always discovers the same model) is unknown.
No statistical significance testing is applied to any quantitative result; the 7.89% Cf RMSE improvement has no confidence interval, and it is unclear whether this improvement would be maintained under different mesh resolutions, boundary conditions, or reference DNS datasets.
The planted-failure ablation uses only 16 synthetic failures derived from 4 template cases, which may not cover the full distribution of silent failure modes encountered in real CFD workflows; the design matrix is small and the failure categories are author-defined.
Post-discovery validation of quadRecTail (6 cases) is noted as having 'wall-shear extraction recovery pending' (Section E), meaning the validation of the best-performing model is incomplete at submission time.

Open questions / follow-ons

How does the discovered quadRecTail SA correction generalize across different flow geometries (e.g., backward-facing step, channel flow, airfoil), Reynolds numbers, and other RANS closure families beyond Spalart-Allmaras? The paper leaves cross-geometry transfer explicitly untested.
Can the VLM physics-verification gate be made robust to the convergence_not_settled failure mode without requiring domain-specific temporal diagnostics, and what is the minimum VLM capability (model size, vision resolution) needed to maintain the 14/16 detection rate?
What is the variance in scientific outcomes (discovered model, RMSE improvement, number of iterations to convergence) across multiple independent runs of the open-ended discovery pathway under the same task specification, given the stochastic nature of LLM sampling?
Can the structured artifact exchange protocol and VLM gate be adapted to other high-fidelity physical simulation domains (e.g., FEA, molecular dynamics, climate models) where solver completion similarly does not imply physical validity, and what domain-specific gate modifications would be required?

Why it matters for bot defense

At first glance this paper has no direct relevance to CAPTCHA or bot-defense engineering. However, practitioners working on automated pipeline integrity — specifically the problem of distinguishing 'completed' from 'valid' in automated testing and evaluation loops — will recognize a structural analogy. The VLM physics-verification gate addresses the same class of problem as behavioral anomaly detection in bot pipelines: a process can satisfy all syntactic/log-level checks (solver convergence, no error codes, field files written) while being semantically invalid (wrong flow physics, degenerate output). The two-stage VLM approach (readability filter → semantic consistency check) is a transferable pattern for any evaluation loop where ground truth requires semantic interpretation of visual or structured output rather than simple pass/fail log parsing.

More concretely, the planted-failure ablation methodology — constructing a controlled 4-category × N-case synthetic failure matrix with deterministic ground-truth labels to evaluate a detector's sensitivity — is directly applicable to evaluating CAPTCHA solvers, bot behavior classifiers, or any ML-based gate in a defense pipeline. The finding that the gate fails systematically on 'convergence_not_settled' failures (where internal consistency masks incompleteness) maps cleanly onto the class of bot behaviors that are individually valid but collectively incomplete or temporally truncated. Bot-defense engineers building automated evaluation harnesses for their own detection models would benefit from adopting this ablation design pattern, and from the system's explicit principle that no claim should be promoted from a failing or ambiguous run rather than patching the metric to make it pass.

Cite

bibtex

@article{arxiv2605_06607,
  title={ AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents },
  author={ Nithin Somasekharan and Rabi Pathak and Manushri Dhanakoti and Tingwen Zhang and Ling Yue and Andy Zhu and Shaowu Pan },
  journal={arXiv preprint arXiv:2605.06607},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.06607}
}

AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​