Agent fleet self-reviews (Cursor model: error → fix → converge)

At fleet scale, agents review and correct their own work in tight feedback loops - running tests, observing failures, fixing root causes, and iterating until the code converges to a passing state without human intervention.

·Agent fleet self-reviews code (error-fix-converge loop) before submitting for merge
·Human review is limited to Red-classified PRs (architectural decisions only)
·Continuous auto-refactoring runs in background without human initiation

·Agent self-review catches 90%+ of issues that would be found by human review
·Auto-refactoring PRs are tracked separately and have their own quality metrics

Evidence

·Agent iteration logs showing error-fix-converge cycles before PR submission
·PR analytics showing human review only on Red-classified PRs
·Auto-refactoring PR history with associated quality metrics

What It Is

The Cursor model is a self-review pattern where an agent doesn't just write code - it runs the code, reads the error output, identifies the root cause, writes a fix, runs the code again, and repeats until all checks pass. The review is not a separate step that happens after code is written; it's integrated into the generation process as a continuous feedback loop.

The cycle is: write → run → observe errors → fix → run → observe errors → fix → ... → all tests green → converge. Each iteration is a self-review: the agent examines its own output against the specification (tests, linter, type checker) and makes corrections. The agent stops when the specification is fully satisfied.

At L5 (Autonomous), this pattern runs at fleet scale. Hundreds of agents are simultaneously working on different tasks, each running its own error → fix → converge loops. The "review process" is internal to each agent's loop, not a separate human-facing step. Agents that converge successfully produce Green PRs that auto-merge. Agents that can't converge (the task is genuinely ambiguous, requires external information, or hits a hard constraint) escalate to human review as a Red PR with a detailed explanation of where they got stuck.

Cursor's internal research on this model has documented meaningful improvements in first-pass correctness: agents that iterate through 3-5 error → fix cycles produce substantially fewer post-merge defects than agents that make a single pass and submit. The compounding of test-driven error signals into each iteration is the core mechanism.

At fleet scale, the aggregate effect is a stream of well-verified changes arriving continuously at the auto-merge queue, with humans only engaged when the problem is genuinely hard.

Why It Matters

Agent self-review through error → fix → converge loops changes the quality economics at L5:

First-pass correctness improves dramatically - An agent that iterates until all tests pass is producing code that has been tested by definition, not just code that was intended to pass tests. The difference between "I wrote code that should pass" and "I ran the code and it passes" is the difference between L3 and L5 quality.
Review is never a bottleneck - The agent reviews itself continuously during generation. There's no "submit and wait" period. By the time the PR is created, the agent's review is complete.
Review scales with the fleet - Each agent handles its own review loop. Adding 100 more agents to the fleet adds 100 more parallel review loops. Human review capacity doesn't need to scale with agent fleet size.
Error messages become specification - The agent uses compiler errors, type errors, test failures, and lint violations as feedback about its own code's correctness. This is a qualitatively different use of the quality gate than humans use: the agent reads error messages as a specification of what needs to change, not as news to act on later.
Architectural guardrails are self-reinforcing - Custom lint rules (from L3) become part of the agent's feedback loop. When the agent's generated code violates an architectural constraint, the lint rule fires, the agent reads the error, and the agent fixes the violation. Architecture is enforced through the agent's own iteration, not through separate review.

The prerequisite for this model at L5 is everything that came before: TORS > 95% (so test failures are real signals, not noise), comprehensive lint rules (so architectural violations fire reliably), and a well-configured auto-merge system (so convergence results in ships without human intervention). The agent fleet model depends on the entire L1-L4 stack functioning correctly.

Tip

Design your test suite for agent legibility. Tests that produce clear, specific failure messages ("expected response status 401, got 200 - check that authentication middleware is applied to this route") give agents better signal for the error → fix → converge loop than tests with generic failure messages ("expected true, got false"). Invest in test message quality as a fleet-scale productivity investment.

Getting Started

The agent self-review model emerges from capability investment at L3-L4, not from a single new tool:

Ensure your quality gate produces actionable error messages - Agents learn from error output. CI failures with messages like "Tests failed in module auth" are less useful than "UserService.login returns 500 when email is null - missing null check at line 42." Improve error message quality across your test suite, linter, and CI checks.
Configure agents to run tests locally before submitting PRs - In Claude Code, Cursor Agents, or your agent framework of choice, include an instruction: "Before creating the PR, run the test suite and lint. If anything fails, fix the issue and re-run before submitting." This adds the inner convergence loop to single-agent work.
Instrument convergence metrics - Track: how many iterations does the typical agent task take before convergence? What's the most common failure mode in the loop (lint errors, test failures, type errors)? This data reveals which parts of your quality gate produce the clearest signals for agents.
Design tasks with verifiable acceptance criteria - Agents converge on tasks that have clear, machine-checkable success criteria. "Implement the UserProfile endpoint" with a corresponding test suite is a convergent task. "Improve the readability of the auth module" is not - there's no machine-checkable criterion. Structure agent tasks for convergence.
Handle non-convergence gracefully - Some tasks don't converge: the agent hits a hard problem that requires external context, hits an ambiguity in the requirements, or reaches a maximum iteration count. These tasks should escalate to Red (human review) with the agent's explanation of where it got stuck. The escalation path is as important as the happy path.
Scale the fleet after validating the single-agent model - Before running 100 agents simultaneously, validate that 1 agent converges reliably on a range of task types. Fleet-scale deployment of a broken model produces fleet-scale failure.

6 steps to get from here to the next level

Common Pitfalls

Agents converging to the wrong solution. An agent can satisfy all tests while implementing the wrong behavior if the tests don't specify the right behavior. Tests that test what the code does (not what it should do) enable circular convergence: the agent writes incorrect code, the tests pass because they were written against the incorrect behavior, the agent converges. This is the L1 "circular testing" problem at L5 scale. The solution is acceptance tests written by humans (or from requirements) before agent implementation.

Infinite loop convergence attempts. Agents iterating without progress (fixing one error, causing another, reverting, causing the first again) can spin indefinitely without a termination policy. Implement maximum iteration counts and escalation paths for non-convergent tasks.

Ignoring the signal from non-convergent tasks. When agents consistently fail to converge on a category of task, it's a signal about the quality gate, the task design, or the agent's context. Track non-convergence rates and investigate root causes - they often reveal gaps in tests, ambiguous architectural constraints, or task specifications that need improvement.

Confusing convergence with correctness. An agent converging means the agent's code satisfies its quality gate - not that it satisfies the product requirement. Convergence is necessary but not sufficient. Human acceptance testing of requirements satisfaction is still needed at L5 for new feature development.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob is managing a fleet of Claude Code agents working on different parts of the codebase simultaneously. He's noticed that some agents produce clean Green PRs while others consistently produce Red PRs or PRs that require many human corrections after auto-merge. He wants to understand why the fleet's performance is uneven.

What Bob should do: The variance Bob is seeing reflects differences in task type and quality gate calibration, not random variation. Bob should analyze the non-convergent agent tasks: what do they have in common? If they're consistently in one area of the codebase (say, the payment module), the problem is likely insufficient test coverage or unclear lint rules in that area. If they're consistent across task types (the agent always struggles with database migrations), the problem is likely a gap in the agent's context or a missing acceptance criteria template. Bob should direct Victor to improve the quality gate for the problem areas - better tests, clearer lint rules, more specific task templates - rather than trying to tune the agents themselves.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah's throughput metrics are strong: the fleet is producing 300+ PRs per week with 70% auto-merging. But she's starting to see noise in her post-merge defect data: a small but growing proportion of auto-merged PRs are being reverted within 24 hours. She wants to understand whether the agent fleet is outpacing the quality gate.

What Sarah should do: Sarah should correlate the reverted PRs with their convergence history. Were reverted PRs ones that converged quickly (1-2 iterations) or ones that took many iterations? Were they produced by agents working in well-tested areas or areas with lower TORS? Her hypothesis should be: PRs converging in areas with TORS < 90% are slipping through the quality gate. If confirmed, the fix is targeted - improve test coverage in the specific areas where reverts are occurring, not fleet-wide changes. Sarah should present this as: "We've identified the specific quality gaps that are causing reverts. Here's the investment needed to close them."

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor is spending time each week reviewing the non-convergent agent tasks - the Red PRs that escalated to him because the agent couldn't converge. He's noticed that 40% of them failed because of a pattern the agents kept getting wrong: they were calling an external service directly in a unit test context instead of using the mock framework. The agents would iterate, fail the test, and not understand why.

What Victor should do: Victor has identified a systemic gap in the agent context. The agents don't know about the mock framework convention because it's not in the CLAUDE.md or any instruction file the agents use. Victor should add a specific instruction: "When writing tests that involve external service calls, use the MockServiceFactory rather than calling services directly. See /tests/mocks/README.md for patterns." He should then verify that this instruction resolves the convergence failures in the next batch of similar tasks. Victor is doing exactly what an L5 staff engineer should be doing: identifying the systemic gap, fixing it once in the agent configuration, and watching the fleet's performance improve across all future similar tasks.

What Victor should do - role-specific action plan