A new paper from researchers at EURECOM and the University of Basilicata — Dente, Satriani, and Papotti — has put a precise number on something many developers have felt intuitively: the more structural rules you give an LLM coding agent, the worse its output gets. They call it constraint decay, and the drop is not subtle. Across eight web frameworks and one hundred tasks, capable model configurations lost an average of 30 points in assertion pass rates when moving from unconstrained code generation to fully specified, production-style tasks. Some weaker configurations effectively collapsed to zero.
The paper, posted to arXiv on May 7, 2026 ( arXiv:2605.06445 ), is the first systematic attempt to isolate how LLM agents handle the non-functional requirements that distinguish working prototypes from production-grade backend code.
What the study actually measured
Most existing benchmarks evaluate code generation against functional correctness — does the output pass the unit tests? The EURECOM team argues this misses the point. Production code must also satisfy structural constraints: framework conventions, database mappings, architectural patterns, ORM lifecycle rules. A generated endpoint that returns the right JSON but bypasses the framework's dependency injection or misuses the ORM's session management is not production-ready.
To measure this, the researchers designed 80 greenfield generation tasks and 20 feature-implementation tasks, all sharing a unified API contract. Each task was evaluated at two levels of specification: a baseline version with only functional requirements, and a fully specified version that added structural constraints typical of real-world backend projects. The evaluation used a dual evaluation approach — end-to-end behavioral tests to verify functional correctness, and static verifiers to check structural compliance with the target framework's conventions.
The experiment covered eight web frameworks across two language ecosystems: Python 3.12 ( Flask , FastAPI , Django , aiohttp) and Node.js 20 ( Express , Fastify, Hono, Koa). The model lineup spanned open-weight and proprietary models: GPT-5.2 and GPT-5-mini from OpenAI, MiniMax-M2.5, Kimi-K2.5, Qwen3-Coder-Next (80B), Qwen3-235B-Instruct, and Devstral-Small (24B).
The 30-point gap
The headline: the average assertion pass rate dropped by 30 points when structural constraints were added.
The distribution tells a more nuanced story. In the baseline condition — generate a working backend without framework-specific rules — several configurations performed reasonably well, with pass rates reaching the mid-60s to 96% range on simpler tasks. Once the researchers activated the full constraint set, those same configurations dropped to the 30-50% range. Weaker models and smaller configurations saw their pass rates fall to near zero on the most constrained tasks.
Framework sensitivity varied sharply across the eight stacks tested. Agents performed best on Flask , a micro-framework with minimal conventions and explicit wiring. Flask's simplicity means there are fewer structural rules to violate. Express.js also performed relatively well for the same reason.
At the other end of the spectrum, FastAPI and Django produced the largest performance drops. These frameworks rely heavily on convention-over-configuration patterns: implicit model registration, automatic serialization, middleware chains, and opinionated project structures. When agents had to respect these patterns while also implementing the functional requirements, the error rate climbed sharply.
Why agents fail: the data layer is the bottleneck
The researchers performed a detailed error analysis to understand what specifically breaks under constraint load. The leading root cause was data-layer defects: incorrect query composition and ORM runtime violations.
The code will parse, it may even pass unit tests, but it will fail at runtime in production because it violates the framework's implicit contract.
Query composition was the second major failure category. Agents frequently generated queries that were semantically close to correct but structurally wrong for the target ORM — using raw SQL where the ORM expected a query builder, or mixing lazy and eager loading in ways that would produce N+1 problems at scale.
A third category involved architectural boundary violations: code that mixed concerns across layers, placed business logic in the wrong file, or bypassed the framework's middleware pipeline. These are precisely the errors that static analysis tools and linters are designed to catch, but the agents rarely produced code that satisfied both the functional tests and the structural verifiers simultaneously.
Limitations of the study
The paper has limitations. As several commenters on Hacker News pointed out, the study tested GPT-5.2 but not GPT-5.2-codex, the variant optimized specifically for coding agent workflows. The authors acknowledge that cost constraints prevented them from testing the most recent frontier models. This means the absolute scores are less important than the relative pattern: the 30-point gap may shrink with newer models, but the existence of constraint decay as a phenomenon is unlikely to disappear.
The tasks are greenfield generation, which differs from the common workflow of editing an existing codebase. Some HN commenters with hands-on experience reported the opposite finding — that larger, more convention-rich codebases actually improve agent accuracy because the model has more exemplars to pattern-match against. The paper's findings apply most directly to the greenfield scenario, where there is no prior context to anchor the agent's output.
What this means for production engineering
For teams evaluating AI coding agents for backend work, the paper's takeaway: agents are not equally unreliable across all frameworks. The cost of constraint violations varies directly with the convention density of the chosen stack.
If you are generating backend code with LLM agents, Flask and Express.js will produce fewer structural errors per task than Django or FastAPI, not because the code is simpler in a computational sense, but because there are fewer implicit rules to violate. This is measurable and predictable.
The broader implication: the current evaluation architecture — functional tests alone — is insufficient for production-grade code generation. The authors advocate for dual evaluation pipelines that combine behavioral tests with structural verifiers. This mirrors what experienced practitioners already do: generate code, run the tests, then lint and review for architectural compliance.
Several HN commenters independently converged on the same solution: providing concrete exemplar files from the existing codebase is more effective than writing architectural rules in markdown. The paper's findings support this. When agents operate from examples rather than specifications, they have a concrete pattern to match rather than a set of abstract rules to satisfy.
Constraint decay is not going away
The paper's contribution is not that LLM agents make mistakes — that was never in dispute. It is that the interaction between functional and structural requirements creates a distinct failure mode that existing benchmarks systematically miss.
Constraint decay is a measurable, repeatable phenomenon, and it cuts across model families and frameworks.
Better models will reduce the absolute error rate, but the decay pattern itself — performance dropping as the number of independent structural constraints increases — is likely to persist. It is a consequence of the underlying architecture: models that generate code token by token cannot simultaneously optimize for functional correctness and structural compliance in the way that a human engineer does when they are aware of both concerns throughout the writing process.
The most practical path forward is better tooling around the agent, not better agents alone. Structural verifiers that feed errors back into the generation loop, exemplar-based prompting, and framework-specific evaluators are all strategies that the paper implicitly validates. The agent does not need to hold every constraint in mind at once if the surrounding system can catch violations and correct them.
As production-grade AI code generation moves from prototype demos to real deployments, constraint decay is the problem that will determine whether that transition succeeds. The EURECOM paper is the first rigorous measurement of the gap. Closing it is the next challenge.
Further Reading
- Constraint Decay: The Fragility of LLM Agents in Backend Code Generation — The full paper by Dente, Satriani, and Papotti on arXiv (May 2026), including methodology, detailed results across all eight frameworks, and the complete error analysis taxonomy.
- Hacker News discussion — Community commentary covering limitations of the study, practical counter-examples from production deployments, and strategies for mitigating constraint decay with exemplar-based prompting.
- LLMs Corrupt Your Documents When You Delegate — A related study by Laban, Schnabel, and Neville showing that LLMs degrade documents during long-horizon delegated workflows, with similar failure patterns across 52 professional domains including code.
No comments yet