Natural Synthesis: Outperforming Reactive Synthesis Tools with Large Reasoning Models

Source: arXiv:2605.15131 · Published 2026-05-14 · By Frederik Schmitt, Matthias Cosler, Niklas Metzger, Julian Siber, Vladimir Krsmanovic, Mohamed Ghanem et al.

TL;DR

This paper addresses the longstanding challenge of reactive synthesis: automatically creating hardware circuit designs from formal logical specifications. Traditional methods are held back by the problem's extreme computational complexity and the difficulty of writing formal temporal logic specifications by hand. The authors present a neuro-symbolic approach that leverages Large Reasoning Models (LRMs) combined with symbolic model checkers to iteratively generate and repair Verilog hardware modules based on formal Temporal Logic Synthesis Format (TLSF) specifications. This counterexample-guided loop improves LRM performance significantly, enabling them to solve more benchmarks than state-of-the-art symbolic synthesis tools on the SYNTCOMP 2025 competition dataset.

Beyond classical reactive synthesis, the authors extend their method to the parameterized synthesis problem, where designs must be correct for arbitrary parameter values—a problem that is formally undecidable. Despite this, the approach produces reusable hardware templates performing almost as well as fixed-parameter designs. The authors also address the usability bottleneck of requiring formal specifications by introducing an autoformalization step, translating natural language specifications into formal TLSF. They evaluate on a new hand-authored dataset of natural-language specs and demonstrate comparable performance synthesizing correct hardware directly from natural language. Overall, the work establishes a practical, end-to-end natural synthesis workflow that outperforms prior algorithmic and neuro-symbolic reactive synthesis methods.

Key findings

The proposed counterexample-guided LRM approach (CEX-LRM) solves 92% of SYNTCOMP 2025 benchmarks, exceeding the best symbolic tool ltlsynt's 82% success rate (Table 1).
Including model checker counterexamples as feedback boosts the number of solved instances by 162 for Gemini 3.1 Pro (from 1193 to 1355) and by 75 for GPT-5.5 (from 1392 to 1467).
LRMs achieve roughly equal success on realizable and unrealizable specifications, solving 640 realizable and 752 unrealizable specs respectively out of 1392 solved (Sec 3.3).
Performance scales roughly log-linearly with reasoning token budgets in LRMs, with higher budgets yielding more solved instances but diminishing returns (Fig 3).
On the undecidable parameterized synthesis task, Gemini 3.1 Pro solves 38.3 ± 1.1 / 57 benchmarks and GPT-5.5 solves 35.7 ± 2.5 / 57, showing moderate degradation versus fixed-parameter synthesis (Table 2).
The natural language autoformalization pipeline synthesizes Verilog that satisfies the ground-truth TLSF spec for 30.0/57 benchmarks, while direct synthesis from natural language solves 31.7/57 (Table 3).
Autoformalized specifications exhibit 9.3/57 equivalences to ground truth and 8.7/57 provably inequivalent; syntactic errors do not always prevent correct Verilog synthesis (Sec 5.2).
The approach produces parameterized Verilog modules and handles underspecified or ambiguous natural language by generating formal specs allowing for verification and defect localization.

Threat model

n/a — The work focuses on formal synthesis and verification under no explicit security adversary model; challenges come from computational complexity and specification ambiguity rather than malicious actors.

Methodology — deep read

Threat Model & Assumptions: The adversary is not explicitly defined since this is not a security paper. The synthesizer's aim is to generate hardware modules meeting temporal logic specs. The challenge comes from computational hardness (2-EXPTIME-completeness) and potential ambiguity of natural language input. The method assumes access to a large reasoning model capable of generating Verilog code given a TLSF spec or natural language description, and a symbolic model checker to verify implementations or provide counterexamples. Adversaries do not influence the input specs or model checker.
Data: The main dataset is the SYNTCOMP 2025 reactive synthesis benchmark suite with 1586 TLSF specifications, where 458 are realizable, 419 unrealizable, and the remainder unknown. The dataset is stripped of comments and names to avoid hints. For parameterized synthesis, 57 parameterized specs are derived by grouping variants of specs with varying parameter values. For natural language synthesis, the authors manually authored natural language descriptions for each parameterized specification (the NATURAL dataset). Splits follow standard SYNTCOMP setup.

Preprocessing involves removing comments. For evaluation of natural language inputs, specifications are manually annotated to align with TLSF ground truth.

Architecture/Algorithm: The approach uses Large Reasoning Models (LRMs)—specifically Gemini 3.1 Pro and GPT-5.5—prompted to output Verilog modules from TLSF specifications or natural language. Prompts include TLSF spec structure and constraints on Verilog format. Instead of reducing TLSF to automata or parity games, the full TLSF spec is fed to the LRM, preserving structure to utilize context. Generated Verilog is verified using symbolic model checkers (nuXmv with IC3 algorithm).

If verification fails, counterexamples (error traces) are fed back to the LRM as input with instructions to repair the code (counterexample-guided synthesis). This guess-and-check loop is iterated multiple times to refine output. For parameterized synthesis, the LRM is prompted to generate parameterized Verilog modules with parameter declarations. Verification is undecidable, so they verify instantiated modules for multiple parameter values, feeding back counterexamples for failing instances.

For natural language specs, two approaches are used: (1) autoformalization—LRM translates NL to TLSF spec with syntax repair loops, then synthesis proceeds as above; (2) direct synthesis—LRM produces Verilog directly from NL. Verification is performed against both ground truth and autoformalized TLSF to measure semantic shifts.

Training Regime: Not explicitly detailed, presumably inference only using commercial or pre-trained LRMs (Gemini 3.1 Pro, GPT-5.5). Experiments use highest reasoning token budgets (e.g. HIGH, XHIGH) to maximize reasoning context windows.
Evaluation Protocol: Metrics: number of benchmarks solved (correct Verilog satisfying TLSF spec) out of total benchmarks. Baselines: state-of-the-art symbolic synthesis tools ltlsynt (1297 solved) and SemML (1295 solved). Ablations include reasoning token budget variations and effect of counterexample-guided feedback. Parameterized synthesis results averaged over 3 runs. Verification timeouts and memory limits follow SYNTCOMP standards (60 min, 160GB memory). Verification used nuXmv with timeout 10 minutes per instance. For natural language, synthesis correctness is evaluated by verifying against ground truth and autoformalized specs, while also checking TLSF syntax correctness and equivalence using Spot ltlfilt tool.
Reproducibility: Code and prompt details are provided in appendices, including full Verilog generation prompts and symbolic verification scripts. Dataset from SYNTCOMP is publicly available; natural language dataset is newly authored and presumably released. The specific weights of Gemini 3.1 Pro and GPT-5.5 are commercial models without frozen checkpoints discussed.

Concrete Example (End-to-End): Given a TLSF spec for a 27-client completion detector, the LRM is prompted to generate Verilog code implementing the system. The generated Verilog is translated into AIGER format and model checked with nuXmv against the TLSF spec. If verification fails, a counterexample trace illustrating which input-output sequence caused violation is extracted and appended as feedback to the next LRM prompt, which attempts a repair. After 1-2 iterations, the module is accepted as correct if verification passes. This loop enables LRM to refine code based on symbolic sound feedback rather than trial-and-error.

For natural language input, the LRM either produces TLSF specifications autoformalized from the NL text, or directly generates Verilog code, then the same verification and repair loops apply.

Technical innovations

Counterexample-guided neuro-symbolic synthesis loop coupling Large Reasoning Models with symbolic model checkers for iterative hardware design repair.
Direct synthesis of parameterized hardware designs from parameterized temporal logic specifications despite undecidability, verified by testing multiple parameter instantiations.
End-to-end natural synthesis workflow autoformalizing natural language system specifications into temporal logic, enabling hardware synthesis from informal descriptions.
Preserving TLSF and Verilog's high-level, programmatic structure as input/output modalities to LRMs instead of flattening into low-level automata or gate-level circuits.

Datasets

SYNTCOMP 2025 LTL synthesis track benchmarks — 1586 TLSF specifications — publicly available
NATURAL dataset — 57 manual natural language descriptions paired with parameterized TLSF specs — newly authored by authors

Baselines vs proposed

ltlsynt: 1297 / 1586 instances solved vs GPT-5.5 (CEX-LRM): 1467 / 1586
SemML: 1295 / 1586 instances solved vs GPT-5.5 (CEX-LRM): 1467 / 1586
Gemini 3.1 Pro no counterexample feedback: 1193 / 1586 vs with feedback: 1355 / 1586
Parameter synthesis: no baseline vs Gemini 3.1 Pro + CE-Guidance 38.3 ± 1.1 / 57 solved
Natural synthesis via autoformalization: 30.0 / 57 solved vs direct NL synthesis: 31.7 / 57 solved

Limitations

Verification of parameterized synthesis remains incomplete and undecidable; correctness is approximated via parameter instantiation testing only.
Autoformalization step often produces syntactically incorrect or semantically shifted TLSF specifications (only 9.3 out of 57 correct equivalences proven).
LRM synthesis quality depends heavily on reasoning token budgets, indicating scalability and cost concerns.
Evaluation limited to standardized benchmarks and manually authored natural language specs; robustness in broader industrial settings unknown.
No adversarial evaluation or attack simulation on specification or synthesis robustness presented.
Closed-source commercial LRMs (Gemini 3.1 Pro, GPT-5.5) used, limiting exact reproducibility.

Open questions / follow-ons

How to fully guarantee correctness of parameterized hardware synthesis despite formal undecidability?
Can autoformalization accuracy and robustness be improved to reduce semantic shifts and syntax errors?
What is the impact of diverse natural language styles and ambiguities on synthesis correctness?
How well does this approach scale to industrial-scale hardware designs beyond SYNTCOMP benchmarks?

Why it matters for bot defense

Although this work focuses on hardware circuit synthesis, its key insight of combining large reasoning models with sound symbolic feedback to iteratively repair outputs is broadly applicable to bot-defense and CAPTCHA domains. For CAPTCHA designers, integrating feedback loops from symbolic or rule-based verification into large AI models can improve reliability and robustness of generated challenges or defenses. Further, the demonstrated ability to synthesize complex structured artifacts directly from natural language specifications suggests potential for generating adaptive CAPTCHA designs from high-level policies. Finally, extending neuro-symbolic synthesis to parameterized or configurable problem families can inform scalable bot defense workflows. However, direct application requires adaptation to the different modality and threat landscapes of CAPTCHA and bot defense systems.

Cite

bibtex

@article{arxiv2605_15131,
  title={ Natural Synthesis: Outperforming Reactive Synthesis Tools with Large Reasoning Models },
  author={ Frederik Schmitt and Matthias Cosler and Niklas Metzger and Julian Siber and Vladimir Krsmanovic and Mohamed Ghanem and Bernd Finkbeiner },
  journal={arXiv preprint arXiv:2605.15131},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.15131}
}

Natural Synthesis: Outperforming Reactive Synthesis Tools with Large Reasoning Models ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​