Constraint Decay: The Fragility of LLM Agents in Backend Code Generation

Source: arXiv:2605.06445 · Published 2026-05-07 · By Francesco Dente, Dario Satriani, Paolo Papotti

TL;DR

This paper investigates a phenomenon the authors term 'constraint decay': the systematic degradation in LLM agent performance on backend code generation as explicit non-functional structural requirements accumulate. The core observation is that while current frontier agents handle loosely-specified greenfield backend generation reasonably well (85%+ assertion pass rates at baseline), their performance collapses sharply when required to simultaneously satisfy architectural patterns (Clean Architecture), specific database backends (PostgreSQL/SQLite), and ORM frameworks (SQLAlchemy/Sequelize) — with capable configurations losing an average of 30 percentage points in assertion pass rate from L0 to L3. Critically, this is not about functional logic complexity; the API contract (RealWorld Conduit, 19 CRUD endpoints) is held constant across all conditions, isolating structural constraint density as the sole independent variable.

The study evaluates 7 models across 2 agent scaffolds (Mini-SWE-Agent and OpenHands) on 80 greenfield generation tasks and 20 feature-implementation tasks spanning 8 web frameworks across Python 3.12 and Node.js 20. A dual evaluation pipeline combines end-to-end behavioral HTTP testing (291 assertions over 32 requests) with orthogonal static verifiers that check architectural layer compliance, database usage, and ORM adherence independently of functional test outcomes. This design cleanly separates 'does it work' from 'does it follow the rules,' enabling fair cross-framework and cross-constraint comparison.

Key findings reveal that database constraints (PostgreSQL in particular, −19.3 pp marginal effect) drive the majority of performance decline, while ORM constraints add surprisingly little marginal cost once a database is already specified. Framework choice is a major sensitivity axis: lightweight, explicit frameworks (Express, Koa, Flask, avg ~50% A%) dramatically outperform convention-heavy ones (FastAPI, Django, Hono, avg ~18–25% A%). Root cause analysis on 222 failed runs identifies data-layer defects — incorrect query composition and ORM runtime violations — as the dominant failure mode (~45% of logic errors), followed by auth misconfiguration and framework idiosyncrasies.

Key findings

Capable agent configurations (L0 A% > 50%) lose an average of 30 percentage points in assertion pass rate from baseline (L0, framework-only) to fully constrained (L3, framework + Clean Architecture + PostgreSQL + ORM), representing a ~40% relative loss of baseline performance across 8 configurations.
The worst-case configuration (OpenHands + Qwen3-Coder-Next) drops 45.5 pp from L0 (73.0% A%) to L3 (27.6% A%), while the most resilient (OpenHands + MiniMax-M2.5) drops only 17 pp (95.6% to 78.6% A%).
Database backend specification is the single most impactful constraint: PostgreSQL imposes a marginal penalty of −19.3 ± 2.5 pp and SQLite −14.3 ± 2.5 pp, compared to Clean Architecture at −9.1 ± 1.6 pp. ORM constraints (SQLAlchemy: −1.5 ± 2.1 pp, Sequelize: −0.6 ± 2.2 pp) show near-zero marginal effect once a database is already required.
Framework choice creates a 25–32 point average A% spread: Express (51.4%), Koa (50.7%), and Flask (49.3%) form a top tier, while FastAPI (24.2%), Django (25.4%), and Hono (18.5%) trail significantly — all evaluated on identical API contracts and test suites.
Logic errors dominate failures for both analyzed models: ~71% of failed runs for Qwen3-Coder-Next (137/194) and MiniMax-M2.5 (20/28) are cases where the server starts correctly but behaves incorrectly. Data-layer defects account for ~45% of these logic errors (incorrect query logic: 25.5% + DB/ORM runtime errors: 21.2% for Qwen3-Coder-Next).
The gap between A% and pass@1 is severe and consistent: the strongest L3 configuration (OpenHands + MiniMax-M2.5) achieves 78.6% A% but only 8.3% pass@1, confirming agents lack the cross-file consistency needed for deployment without manual intervention.
Constraint decay persists in feature-implementation tasks (20 tasks ablating features from real community RealWorld repositories at implicit L3 conditions): only GPT-5.2 exceeds 50% pass@1, ruling out the possibility that decay is an artifact of the synthetic greenfield setting.
The 16-task subset used for cost-constrained models (MiniMax-M2.5, Kimi-K2.5, GPT-5.2) is validated to be representative of the full 80-task set: Pearson r = 0.98, Spearman ρ = 0.95 (p < 10^−12, N = 24 paired observations) between full-set and subset A% scores.

Threat model

n/a — this is a software engineering capability evaluation paper, not a security paper. The study does not model a human or automated adversary. The 'challenge' posed to agents is structural constraint adherence imposed by a hypothetical production engineering team requiring Clean Architecture, specific database backends, and ORM frameworks. No adversarial manipulation of prompts, environments, or evaluation pipeline is considered.

Methodology — deep read

The threat model / adversarial framing is absent — this is a capability evaluation paper, not a security paper. The 'adversary' is instead the structural constraint itself: the study asks whether agents can satisfy non-functional requirements that a production engineering team would impose. The key confound being controlled is functional task complexity; the authors deliberately hold the API contract constant (RealWorld Conduit OpenAPI 3.0 spec, 19 CRUD endpoints across 5 resource groups) so that any performance change is attributable solely to structural constraint accumulation rather than semantic task difficulty. An explicit assumption is that the Conduit spec is well-represented in LLM pre-training data, minimizing functional-complexity confounds.

The task design defines four orthogonal constraint axes: (1) Web Framework — always active, 8 levels: Flask, FastAPI, Django, aiohttp (Python 3.12), Express, Fastify, Hono, Koa (Node.js 20); (2) Architectural Pattern — Clean Architecture with strict four-layer directory separation (routes/handlers, services/use cases, models/entities, repository/data access); (3) Database Backend — PostgreSQL 16 (containerized, pre-provisioned) or SQLite; (4) ORM Integration — SQLAlchemy (Python) or Sequelize (Node.js). These are composed into 4 constraint levels (L0–L3) yielding 80 greenfield generation tasks total (Table 1). Each agent receives a prompt containing the OpenAPI spec, active constraints, mandatory files list, and evaluation pipeline description. Tasks are executed in isolated Docker containers with ephemeral PostgreSQL instances to prevent state contamination.

Evaluation uses two orthogonal signals. Behavioral testing: a shared HTTP test suite of 32 requests covering all 19 API endpoints, producing 291 assertions verifying response structure, types, status codes, and stateful CRUD sequences. Because tests target API behavior rather than implementation internals, they are valid regardless of how the agent organizes its code. Static verifiers: three independent verifier functions check (a) Clean Architecture layer compliance and import dependency direction; (b) database engine usage via source scanning while confirming no alternative engines are used; (c) ORM usage vs. raw SQL fallback. Task success requires both full behavioral validity AND all applicable verifiers passing. The authors validate that verifier enforcement contributes minimally to the decay signal: A% changes by at most 2.7 pp when verifiers are disabled, and the L0→L3 drop changes only from 28 to 30 pp (Appendix D).

The experimental matrix covers 7 models across 4 capability tiers: small open agentic (Devstral-Small 24B, Qwen3-Coder-Next 80B), large open instruct (Qwen3-235B-A22B, 22B active), large open agentic (MiniMax-M2.5, Kimi-K2.5), and closed frontier (GPT-5-mini, GPT-5.2). Two agent scaffolds are used: Mini-SWE-Agent (~100 lines Python, bash-only interface, up to 300 iterations) and OpenHands (full-featured with file editing, terminal, code search tools, up to 200 iterations). Both scaffolds received prompt refinements to align toward generative pipelines and prevent premature halting. Cost constraints led to restricting Kimi-K2.5 and GPT-5.2 to Mini-SWE-Agent only, and MiniMax-M2.5/Kimi-K2.5/GPT-5.2 to a 16-task representative subset (2 frameworks per runtime × 4 constraint levels). Total token cost was approximately 5 billion tokens across the study. Each task is run n=3 independent trials per configuration.

Primary metrics are Assert% (A%): mean fraction of 291 assertions passed, averaged across trials and tasks for a given configuration — chosen over pass@1 because the all-or-nothing nature of pass@1 amplifies noise when 291 assertions must all pass — and pass@1 using the unbiased Chen et al. (2021) estimator. The marginal effect of each constraint is isolated via a matched-pair design: for constraint c, all task pairs differing only by c (framework and all other constraints held constant) are identified, ΔA% = A%_with_c − A%_without_c is computed per pair, then averaged across all pairs and model-agent configurations. This yields the values in Table 3a with standard errors of the mean.

Root cause analysis on 222 failed runs (Qwen3-Coder-Next: 194/240, MiniMax-M2.5: 28/48, both Mini-SWE-Agent) used GPT-5.2 (temperature 0) as a judge, classifying failures into a two-level taxonomy based on the last 20 trajectory turns, behavioral test results, server logs, and static verifier outputs. Interrater reliability was validated on a stratified sample of 50 Qwen3-Coder-Next logic errors against manual labels, yielding Cohen's κ = 0.975. The 16-task subset representativeness was validated by comparing full-set (80 tasks) vs. subset A% across 24 paired observations (6 model-agent configs × 4 constraint levels), achieving Pearson r = 0.98 and Spearman ρ = 0.95 (p < 10^−12). Devstral-Small was excluded from main analysis due to near-zero Assert% at L0 (reported in Table 8, appendix). Code, tasks, agent trajectories, and analysis scripts are fully open-sourced at the anonymous repository link provided.

Technical innovations

A constraint-layering evaluation protocol (L0–L3) that holds the functional API contract constant via OpenAPI 3.0 while systematically varying four orthogonal non-functional dimensions, enabling isolation of structural constraint density as the sole independent variable — prior benchmarks (SWE-Bench, AppBench, Vero et al. 2024) either target issue resolution or unconstrained generation without this factorial design.
Orthogonal static verifier functions that independently assess architectural layer compliance, database engine usage, and ORM adherence, decoupled from behavioral tests — prior code generation benchmarks conflate structural and functional correctness or ignore structural compliance entirely.
A matched-pair marginal effect estimator that computes per-constraint A% deltas by identifying task pairs differing only along a single constraint dimension, enabling attribution of performance loss to individual constraint types rather than total constraint count.
A two-level failure taxonomy (6 coarse + 6 logic-error subcategories) validated at Cohen's κ = 0.975 using an LLM judge on agent trajectory + server logs + test outputs, providing systematic root-cause attribution at scale (222 failed runs) beyond anecdotal error inspection.
A feature-implementation task suite of 20 tasks derived from real community RealWorld repository implementations (164–2,604 lines, 2–46 files per task) that tests constraint-adherent code modification as a sanity check against greenfield-setting artifacts — no prior constrained backend benchmark includes this cross-validation.

Datasets

RealWorld Conduit OpenAPI 3.0 specification — 19 CRUD endpoints across 5 resource groups — open-source (realworld-docs.netlify.app), used as fixed API contract
Constraint Decay Benchmark (greenfield tasks) — 80 tasks (8 frameworks × 10 constraint variants) — author-constructed, open-sourced at anonymous.4open.science/r/constraint-decay
Constraint Decay Benchmark (feature-implementation tasks) — 20 tasks derived from community RealWorld implementations — author-constructed from open-source RealWorld repos, open-sourced
Agent trajectory logs — 240 runs (Qwen3-Coder-Next full set) + 144 runs (MiniMax-M2.5, Kimi-K2.5, GPT-5.2 subsets) + additional OpenHands runs — author-collected, open-sourced

Baselines vs proposed

Mini-SWE-Agent + GPT-5-mini: A% L0=51.7%, L3=23.7%, pass@1 L3=4.2%
OpenHands + GPT-5-mini: A% L0=65.8%, L3=52.2%, pass@1 L3=33.3%
Mini-SWE-Agent + Qwen3-Coder-Next: A% L0=86.4%, L3=46.1%, pass@1 L3=6.2%
OpenHands + Qwen3-Coder-Next: A% L0=73.0%, L3=27.6%, pass@1 L3=4.2%
Mini-SWE-Agent + Qwen3-235B-A22B: A% L0=29.6%, L3=2.3%, pass@1 L3=0.0%
OpenHands + Qwen3-235B-A22B: A% L0=26.2%, L3=0.8%, pass@1 L3=0.0%
Mini-SWE-Agent + MiniMax-M2.5 (subset): A% L0=88.6%, L3=58.3%, pass@1 L3=25.0%
OpenHands + MiniMax-M2.5 (subset): A% L0=95.6%, L3=78.6%, pass@1 L3=8.3%
Mini-SWE-Agent + Kimi-K2.5 (subset): A% L0=85.4%, L3=53.7%, pass@1 L3=33.3%
Mini-SWE-Agent + GPT-5.2 (subset): A% L0=78.2%, L3=48.0%, pass@1 L3=25.0%
Framework Express (avg across models/scaffolds/levels): A%=51.4% vs. FastAPI: A%=24.2% vs. Hono: A%=18.5%
Feature implementation pass@1: GPT-5.2 (Mini-SWE)=50.0%, GPT-5-mini (Mini-SWE)=15.0%, Qwen3-Coder-Next (Mini-SWE)=16.7%

Limitations

Single API contract: all 80 tasks derive from one OpenAPI spec (RealWorld Conduit). The authors acknowledge this maximizes internal validity but it means generalizability to APIs with different complexity, domain semantics, or schema structure is unestablished.
Cost-driven incomplete factorial design: three of the seven models (MiniMax-M2.5, Kimi-K2.5, GPT-5.2) were evaluated on only 16 of 80 tasks (2 frameworks × 4 levels), restricting framework sensitivity analysis for these models despite the subset validation showing high correlation.
No adversarial or distribution-shift evaluation: all tasks use the same Conduit spec that is likely present in LLM pre-training data. Performance on novel or proprietary API contracts may differ substantially, potentially inflating or deflating the constraint-decay effect.
Only two agent scaffolds tested: Mini-SWE-Agent and OpenHands represent a limited sample of the agent architecture space. More sophisticated multi-agent, retrieval-augmented, or planning-specialized architectures (e.g., those with explicit constraint tracking modules) are not evaluated.
n=3 trials per configuration is statistically thin, especially for pass@1 estimation at lower constraint levels where variance is high. The authors acknowledge the variance of drops is substantial and caution against over-interpreting precise effect sizes.
Root cause analysis covers only two models (Qwen3-Coder-Next and MiniMax-M2.5) with Mini-SWE-Agent only; failure profiles for OpenHands or closed models like GPT-5.2 may differ meaningfully, and the GPT-5.2 judge used for classification introduces potential self-serving bias for closed-model evaluations.
Verifier functions are heuristic-based (source scanning, import analysis) rather than formal static analysis; edge cases where structurally compliant code is incorrectly flagged non-compliant, or vice versa, are possible though the authors show ≤2.7 pp impact on A% when verifiers are disabled.

Open questions / follow-ons

Can retrieval-augmented generation (RAG) over framework documentation and ORM API references meaningfully reduce constraint decay, particularly for convention-heavy frameworks like Django and FastAPI where training data may be sparser or more contradictory?
Does constraint decay scale with API surface complexity? The Conduit spec is deliberately simple (19 endpoints, 5 resources); it is unclear whether decay rates accelerate, plateau, or shift in root-cause profile for enterprise-scale APIs with hundreds of endpoints and complex relational schemas.
Can iterative constraint verification — where the agent receives verifier feedback mid-generation and self-corrects — recover a meaningful fraction of the 30 pp average drop, or does the multi-file consistency problem make local corrections insufficient?
What is the relationship between LLM pre-training corpus representation of a framework and that framework's agent performance? The Hono result (trailing despite Express-comparable API surface, attributed to edge-runtime training data underrepresentation) suggests training data distribution is a key confound that deserves controlled study.

Why it matters for bot defense

For bot-defense engineers, this paper is directly relevant to any organization considering LLM-agent-assisted development of backend detection and enforcement infrastructure. Production bot-defense systems are precisely the kind of constraint-heavy codebases studied here: they require specific architectural patterns (for auditability and separation of concerns), prescribed database backends (for compliance and operational consistency), and ORM layers (for maintainability). The 30 pp average A% drop at full constraint specification — and the 8.3% pass@1 ceiling even for the best L3 configuration — means that agent-generated code for such systems would require substantial manual review and correction before deployment. The finding that database constraints (−19.3 pp marginal effect) drive most of the decay is particularly salient for bot-defense backends, which are inherently data-intensive (session stores, fingerprint databases, rate-limit counters, event logs).

More broadly, the framework sensitivity finding has architectural implications: if an engineering team is evaluating whether to use LLM agents for accelerating development of CAPTCHA serving infrastructure or telemetry pipelines, choosing Flask or Express over FastAPI or Django could meaningfully reduce the supervision burden. The root-cause taxonomy also flags auth misconfiguration as a significant model-specific failure mode (22.6% of Qwen3-Coder-Next logic errors), which is a critical concern for bot-defense systems where token validation and header parsing errors could silently break enforcement logic. The open-sourced benchmark and trajectories may also be useful for evaluating internal agent tooling on constrained backend tasks before committing to agentic development workflows.

Cite

bibtex

@article{arxiv2605_06445,
  title={ Constraint Decay: The Fragility of LLM Agents in Backend Code Generation },
  author={ Francesco Dente and Dario Satriani and Paolo Papotti },
  journal={arXiv preprint arXiv:2605.06445},
  year={ 2026 },
  url={https://arxiv.org/abs/2605.06445}
}

Constraint Decay: The Fragility of LLM Agents in Backend Code Generation ​

TL;DR ​

Key findings ​

Threat model ​

Methodology — deep read ​

Technical innovations ​

Datasets ​

Baselines vs proposed ​

Limitations ​

Open questions / follow-ons ​

Why it matters for bot defense ​

Cite ​

Read the full paper ​