Agent fleet self-reviews (Cursor model: error → fix → converge)
At fleet scale, agents review and correct their own work in tight feedback loops - running tests, observing failures, fixing root causes, and iterating until the code converges to a passing state without human intervention.
- ·Agent fleet self-reviews code (error-fix-converge loop) before submitting for merge
- ·Human review is limited to Red-classified PRs (architectural decisions only)
- ·Continuous auto-refactoring runs in background without human initiation
- ·Agent self-review catches 90%+ of issues that would be found by human review
- ·Auto-refactoring PRs are tracked separately and have their own quality metrics
Evidence
- ·Agent iteration logs showing error-fix-converge cycles before PR submission
- ·PR analytics showing human review only on Red-classified PRs
- ·Auto-refactoring PR history with associated quality metrics
What It Is
The Cursor model is a self-review pattern where an agent doesn't just write code - it runs the code, reads the error output, identifies the root cause, writes a fix, runs the code again, and repeats until all checks pass. The review is not a separate step that happens after code is written; it's integrated into the generation process as a continuous feedback loop.
The cycle is: write → run → observe errors → fix → run → observe errors → fix → ... → all tests green → converge. Each iteration is a self-review: the agent examines its own output against the specification (tests, linter, type checker) and makes corrections. The agent stops when the specification is fully satisfied.
At L5 (Autonomous), this pattern runs at fleet scale. Hundreds of agents are simultaneously working on different tasks, each running its own error → fix → converge loops. The "review process" is internal to each agent's loop, not a separate human-facing step. Agents that converge successfully produce Green PRs that auto-merge. Agents that can't converge (the task is genuinely ambiguous, requires external information, or hits a hard constraint) escalate to human review as a Red PR with a detailed explanation of where they got stuck.
Cursor's internal research on this model has documented meaningful improvements in first-pass correctness: agents that iterate through 3-5 error → fix cycles produce substantially fewer post-merge defects than agents that make a single pass and submit. The compounding of test-driven error signals into each iteration is the core mechanism.
At fleet scale, the aggregate effect is a stream of well-verified changes arriving continuously at the auto-merge queue, with humans only engaged when the problem is genuinely hard.
Why It Matters
Agent self-review through error → fix → converge loops changes the quality economics at L5:
- First-pass correctness improves dramatically - An agent that iterates until all tests pass is producing code that has been tested by definition, not just code that was intended to pass tests. The difference between "I wrote code that should pass" and "I ran the code and it passes" is the difference between L3 and L5 quality.
- Review is never a bottleneck - The agent reviews itself continuously during generation. There's no "submit and wait" period. By the time the PR is created, the agent's review is complete.
- Review scales with the fleet - Each agent handles its own review loop. Adding 100 more agents to the fleet adds 100 more parallel review loops. Human review capacity doesn't need to scale with agent fleet size.
- Error messages become specification - The agent uses compiler errors, type errors, test failures, and lint violations as feedback about its own code's correctness. This is a qualitatively different use of the quality gate than humans use: the agent reads error messages as a specification of what needs to change, not as news to act on later.
- Architectural guardrails are self-reinforcing - Custom lint rules (from L3) become part of the agent's feedback loop. When the agent's generated code violates an architectural constraint, the lint rule fires, the agent reads the error, and the agent fixes the violation. Architecture is enforced through the agent's own iteration, not through separate review.
The prerequisite for this model at L5 is everything that came before: TORS > 95% (so test failures are real signals, not noise), comprehensive lint rules (so architectural violations fire reliably), and a well-configured auto-merge system (so convergence results in ships without human intervention). The agent fleet model depends on the entire L1-L4 stack functioning correctly.
Design your test suite for agent legibility. Tests that produce clear, specific failure messages ("expected response status 401, got 200 - check that authentication middleware is applied to this route") give agents better signal for the error → fix → converge loop than tests with generic failure messages ("expected true, got false"). Invest in test message quality as a fleet-scale productivity investment.
Getting Started
6 steps to get from here to the next level
Common Pitfalls
Mistakes teams actually make at this stage - and how to avoid them
How Different Roles See It
Bob is managing a fleet of Claude Code agents working on different parts of the codebase simultaneously. He's noticed that some agents produce clean Green PRs while others consistently produce Red PRs or PRs that require many human corrections after auto-merge. He wants to understand why the fleet's performance is uneven.
What Bob should do - role-specific action plan
Sarah's throughput metrics are strong: the fleet is producing 300+ PRs per week with 70% auto-merging. But she's starting to see noise in her post-merge defect data: a small but growing proportion of auto-merged PRs are being reverted within 24 hours. She wants to understand whether the agent fleet is outpacing the quality gate.
What Sarah should do - role-specific action plan
Victor is spending time each week reviewing the non-convergent agent tasks - the Red PRs that escalated to him because the agent couldn't converge. He's noticed that 40% of them failed because of a pattern the agents kept getting wrong: they were calling an external service directly in a unit test context instead of using the mock framework. The agents would iterate, fail the test, and not understand why.
What Victor should do - role-specific action plan
Further Reading
5 resources worth reading - hand-picked, not scraped
From the Field
Recent releases, projects, and discussions relevant to this maturity level.