Can Coding Agents Reproduce Findings in Computational Materials Science?

Source: arXiv:2605.00803 · Published 2026-05-01 · By Ziyang Huang, Yi Cao, Ali K. Shargh, Jing Luo, Ruidong Mei, Mohd Zaki et al.

TL;DR

This paper introduces AUTOMAT, a new benchmark designed to evaluate the ability of large language model (LLM) based autonomous coding agents to reproduce scientific claims in computational materials science. Unlike typical software engineering benchmarks focused on explicit tasks and test suites, AUTOMAT poses the more complex challenge of recovering underspecified computational workflows from paper text or partial artifacts, navigating specialized scientific toolchains, and producing evidence supporting or undermining domain-specific claims. The benchmark comprises 85 expert-curated claims from recent materials science publications, each packaged with paper text, metadata, and when available, computational artifacts.

The authors evaluate five LLM-based coding agent settings including an orchestrated workflow using Claude Sonnet 4.6, off-the-shelf Claude Code agents, Kimi, and OpenAI Codex with GPT-5.4. Results show that current agents struggle to achieve reliable reproduction, with the best system achieving only a 54.1% success rate (overall score ≥4) and an average score of 3.52 on a 1-5 scale. Performance is worst in the from-paper reproduction setting that requires end-to-end workflow recovery solely from text, where success approaches zero. Agents fare better on from-artifact reproduction where executable code or data are provided, with success rates up to 77%. Even when outputs are available, from-artifact interpretation success rates remain under 50%, indicating challenges in scientific evidence evaluation.

Detailed failure analyses reveal that the main obstacles are incomplete computational procedures, deviations from required methodologies, and fragility during execution rather than mere code generation errors. A specifically orchestrated agent improves scientific rigor scoring but does not substantially increase overall reproduction success, highlighting the complexity of long-horizon autonomous scientific workflows. The work positions AUTOMAT as a diagnostic tool and benchmark for scientific reproducibility with LLM-based agents and highlights current limitations in AI for scientific discovery.

Key findings

Best-performing agent (Claude Code with Opus) achieves a mean reproducibility score of 3.52/5 and a success rate (score ≥4) of 54.1% over 85 claims on AUTOMAT.
Codex with GPT-5.4 is the weakest performing system with mean score 2.44 and success rate 23.5%.
From-paper reproduction (recovering workflows solely from text) is the hardest setting, with near-zero success rates across all agents and mean scores ranging 1.5 to 2.2.
From-artifact reproduction (where scripts, codebases or pretrained models are provided) achieves higher success rates between 39% and 77%.
From-artifact interpretation (analyzing existing outputs) yields success rates from 33% to 50%, showing interpretation remains challenging.
The task-specific orchestrated agent scores statistically significantly higher (p < 0.05) only on scientific rigor but does not improve overall success compared to general-purpose coding agents.
Failure modes are dominated by procedural incompleteness (incomplete workflows), methodological deviations, and execution failures, rather than by code generation errors.
LLM-based evaluator calibrated against human experts achieves a quadratic-weighted kappa of 0.69 and within-1 difference accuracy of 0.80, supporting reliability of benchmark evaluations.

Methodology — deep read

Threat Model & Assumptions: The adversary is absent here as this is not a security paper but an evaluation of AI agents’ capabilities. The LLM agents are assumed to receive either the paper text plus optional artifacts and metadata for a claimed computational materials science result. They have general coding and reasoning abilities but no ground-truth instructions beyond the provided materials. The agents attempt to autonomously reconstruct computational workflows, execute them in HPC-like environments, and produce evidence supporting or undermining the claim.
Data: Provenance and Curation: AUTOMAT comprises 85 distinct claims curated by domain experts (senior PhD students and postdocs) from recent peer-reviewed computational materials science papers spanning areas like density functional theory, molecular dynamics, discrete dislocation dynamics, and ML modeling. Each claim centers on a quantitative numerical result critical to the paper’s conclusions. The data includes the claim text, full paper PDFs, metadata describing reproduction steps and requirements, and when available, computational artifacts such as codebases, scripts, pretrained models, or datasets. The claims are categorized into three reproduction types: from-paper (workflow recovered from text), from-artifact reproduction (code etc. provided), and from-artifact interpretation (analyzing existing outputs).
Agent Architectures: Five agent settings are evaluated:

Orchestrated Agent (Orch./Sonnet): a benchmark-specific pipeline using Claude Sonnet 4.6 that separates the workflow into phases: planning, environment setup, deterministic execution with failure diagnosis, and result extraction with self-assessment. Designed to improve process audibility.
Claude Code with three backbones: Opus 4.6, Sonnet 4.6, and Kimi K2.5 CLI terminal agents offering free-form coding and debugging loops.
OpenAI Codex CLI with GPT-5.4 The agents read materials, plan investigations, run shell commands in a controlled HPC environment (including Slurm scheduling and Singularity containers), inspect outputs, and return final textual reports.

Training Regime: Not directly trained for this task but use off-the-shelf foundation LLMs with their publicly disclosed training pipelines. The orchestrated agent uses a programmed workflow with calls to the Claude agent SDK.
Evaluation Protocol: Each agent run generates a complete execution trace including terminal logs, files produced, intermediate output, and a final assessment report. A separate LLM evaluator agent, also based on Claude Sonnet, inspects all artifacts, the source claim, paper, and SME-annotated ground-truth reproduction steps (hidden from the reproduction agent) to assign a reproducibility score on a 5-point scale: 5=fully reproduced, 4=mostly reproduced, … 1=failed. Success requires a score ≥4. The evaluator also scores five subdimensions: methodological fidelity, execution competence, result accuracy, completeness, and scientific rigor, with justifications. The LLM evaluator can navigate artifact directories and run tools to mimic a human expert’s review process.

To validate the LLM evaluator’s reliability, expert SMEs independently scored a stratified sample of 40 reproduction runs blinded to the LLM’s judgments, achieving substantial inter-rater agreement (quadratic-weighted kappa 0.69).

Reproducibility: AUTOMAT’s 85 claims, task packages, code for launching agents, and evaluation tools are publicly released on GitHub and HuggingFace datasets, supporting benchmarking and analysis by others.

Concrete example: For the claim on defect formation energy in CsPbBr3 (AUTOMAT-0030), the agent must create specified supercells and defects, set up DFT calculations using Quantum Espresso, run multiple computational stages, and extract energy metrics. The orchestrated agent separates this into phases including environment preparation, controlled execution with failure diagnosis, and then checks outputs against expected energies. The LLM evaluator compares the agent’s results and relevant logs to the ground-truth annotated reproduction steps to assign a final reproducibility score.

Overall, the methodology carefully integrates SME expert knowledge, computational materials science domain complexities, resource-controlled execution reflective of HPC workflows, staged agent orchestration vs free-form coding agents, and an artifact-grounded evaluation to provide a rigorous benchmark environment.

Technical innovations

Introduction of AUTOMAT, the first benchmark targeting end-to-end reproduction of computational materials science claims requiring recovery and execution of underspecified workflows.
Novel packaging of SME-curated materials science claims into reproducible task units with paper, metadata, and optional artifacts reflecting real scientific reproducibility settings.
Use of LLM-based autonomous coding agents embedded in controlled HPC-like environments, enabling interactive investigation of complex workflows involving multiple tools.
Deployment of an artifact-grounded LLM evaluator agent capable of navigating filesystem outputs and applying domain-calibrated criteria to assign multidimensional reproducibility scores.
Comparative evaluation of task-specific orchestrated agents against general-purpose CLI coding agents to examine effects of workflow structuring on scientific reproducibility.

Datasets

AUTOMAT — 85 claims — curated from published computational materials science papers (publicly released via https://github.com/JHU-CLSP/AutoMat and HuggingFace dataset)

Baselines vs proposed

Claude Code with Opus 4.6: mean reproducibility score = 3.52, success rate = 54.1% vs Codex GPT-5.4: mean score = 2.44, success rate = 23.5%
Orchestrated Sonnet agent vs Claude Code Sonnet 4.6: no statistically significant overall success difference (win/tie/loss ratio 23%/45%/32%), but orchestrated agent scores higher on scientific rigor (p < 0.05).

Figures from the paper

Figures are reproduced from the source paper for academic discussion. Original copyright: the paper authors. See arXiv:2605.00803.

Fig 1

Fig 1: Overview of AUTOMAT. Claims from domain experts are packaged into runnable

Fig 2

Fig 2: An example claim reproduction package from AUTOMAT.

Fig 3

Fig 3 (page 2).

Fig 4

Fig 4 (page 2).

Fig 5

Fig 5 (page 2).

Fig 6

Fig 6 (page 2).

Fig 7

Fig 7 (page 2).

Fig 8

Fig 8 (page 2).

Limitations

AUTOMAT focuses on a single scientific domain (computational materials science) and 85 claims, limiting generalizability across other scientific fields.
The benchmark emphasizes claims deemed reproducible; it does not evaluate agents’ ability to identify genuinely non-reproducible or fraudulent claims.
The LLM-based evaluator, while calibrated against SMEs, is an approximation and may not fully replace detailed human expert review.
Limited experimental budget restricted the breadth of model and agent architecture variations evaluated.
Execution failures reflect fragility of long-horizon scientific workflows; hardware/software environment variations may compound reproducibility.
Recovery from paper text alone remains extremely challenging; current agents lack deeper scientific understanding or domain-specific contextual reasoning.

Open questions / follow-ons

Can incorporating explicit scientific domain knowledge or ontologies improve agents’ ability to recover underspecified workflows from paper text?
How can autonomous agents achieve greater robustness and fault tolerance during multi-stage scientific workflow execution in HPC environments?
What architectures or training paradigms enable agents to better evaluate and justify scientific claims beyond mere code execution?
How well do these findings generalize to other computational science fields such as computational biology or chemistry with distinct toolchains?

Why it matters for bot defense

From a bot-defense and CAPTCHA perspective, this work highlights key challenges in deploying autonomous coding agents for complex, real-world scientific tasks that require interpretation beyond code generation. It shows that current LLM-based agents remain error-prone when reconstructing complex workflows from underspecified descriptions, stressing that domain expertise and holistic validation remain necessary for trustworthiness. While software engineering benchmarks demonstrate strong coding abilities, the scientific reproducibility domain demands deeper understanding, fault-tolerance, and interpretability that generic agents currently lack.

Practitioners designing bot defenses or CAPTCHAs for automated scientific code execution or analysis pipelines should recognize these agents’ fragility and limited reliability on end-to-end reproduction tasks. Tests evaluating domain knowledge, procedural completeness, and scientific rigor verification could help detect failures or misuse. AUTOMAT offers a detailed diagnostic framework that can help researchers differentiate authentic scientific workflows from bot-driven but incomplete or incorrect attempts, informing more robust defense and validation mechanisms.

Cite

bibtex

@article{arxiv2605_00803,
  title={ Can Coding Agents Reproduce Findings in Computational Materials Science? },
  author={ Ziyang Huang and Yi Cao and Ali K. Shargh and Jing Luo and Ruidong Mei and Mohd Zaki and Zhan Liu and Wyatt Bunstine and William Jurayj and Somdatta Goswami and Tyrel McQueen and Michael Shields and Jaafar El-Awady and Paulette Clancy and Benjamin Van Durme and Nicholas Andrews and William Walden and Daniel Khashabi },
  journal={arXiv preprint arXiv:2605.00803},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.00803}
}

Can Coding Agents Reproduce Findings in Computational Materials Science? ​

TL;DR ​

Key findings ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Figures from the paper ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​