Max 2 CI rounds per PR (Stripe benchmark)
The "max 2 CI rounds per PR" benchmark comes from Stripe's engineering culture, where one of the key efficiency metrics for their agent-assisted development program (the Minions mo
- ·Policy-based merge rules are enforced (OPA, branch protection, or equivalent)
- ·Deterministic merge ordering with conflict detection prevents concurrent merge failures
- ·PRs require a maximum of 2 CI rounds before merge (Stripe benchmark)
- ·Merge rules are versioned as code and reviewed when changed
- ·PRs exceeding 2 CI rounds are flagged for investigation
Evidence
- ·Policy-as-code configuration (OPA rules, branch protection API config)
- ·CI round count per PR metrics showing 2-round maximum adherence
- ·Merge ordering logs showing deterministic processing
What It Is
The "max 2 CI rounds per PR" benchmark comes from Stripe's engineering culture, where one of the key efficiency metrics for their agent-assisted development program (the Minions model) is that each pull request should reach a green CI state within two CI runs. The first CI run is the initial submission. If the first run fails, the agent (or developer) fixes the issues and resubmits. If the second run fails, something is systematically wrong: the agent is guessing rather than reasoning, the task specification is unclear, or the test suite is flaky and unreliable.
This benchmark matters because CI compute is expensive and iteration cycles are slow. Each full CI run for a large codebase can take 10-30 minutes and cost $5-20 in compute. An agent that requires 5-8 CI rounds to produce a green build is consuming 3-4x the resources of a well-functioning agent and generating PRs that are slow to close. At the scale of hundreds of agent-produced PRs per week, the difference between 2 CI rounds per PR and 5 CI rounds per PR is significant in both cost and throughput.
The deeper implication is about agent quality. An agent that consistently needs 2 or fewer CI rounds understands the codebase, reads test failures correctly, and produces targeted fixes. An agent that needs 5+ rounds is either operating without sufficient context, facing flaky tests that give misleading signals, or being given tasks that are underspecified. The 2-round benchmark is a quality signal about the agent's task execution capability, not just a cost metric.
At L3 (Systematic), measuring and targeting this benchmark means: tracking CI rounds per PR as a metric, identifying why PRs exceed 2 rounds (flaky tests, poor agent context, underspecified tasks), and systematically addressing the root causes. This is distinct from L4/L5 where agents automatically analyze CI failures and self-correct within the 2-round target.
Why It Matters
- Direct cost signal - each CI round costs compute; at scale, tracking rounds per PR reveals whether your agent workflow is economically efficient or wasteful; 5 rounds per PR at $10/run at 200 PRs/week is $10,000/week in avoidable CI spend
- Agent quality proxy - agents that consistently hit green in 1-2 rounds are reading failure logs correctly and producing targeted fixes; agents that need 5+ rounds are not; this metric distinguishes agent capability levels without requiring code quality review
- Identifies systemic issues - PRs that consistently require 3+ rounds often share root causes (flaky test suite, missing context file, ambiguous task specs); the metric surfaces these patterns so they can be fixed once rather than tolerated repeatedly
- Enables throughput at scale - at 1000 merges/week (Stripe scale), a 2-round average means CI runs 2000 times; a 5-round average means 5000 runs; the difference determines whether your CI infrastructure can keep up with agent throughput
- Sets a clear quality bar - "2 CI rounds or fewer" is a specific, measurable standard for agent-produced PRs; it gives teams a concrete target rather than vague quality aspirations
Getting Started
6 steps to get from here to the next level
Common Pitfalls
Mistakes teams actually make at this stage - and how to avoid them
How Different Roles See It
Bob's team uses Claude Code to generate PRs and has noticed that "some PRs just keep bouncing" - they fail CI, the agent tries to fix them, they fail again, the agent tries again. Bob is paying for significant CI compute and is concerned that the cost is disproportionate to the output.
What Bob should do - role-specific action plan
Sarah has been tracking CI costs and noticed a spike in the last two months that correlates with increased AI tool adoption. She wants to understand whether the spike represents waste (excess CI rounds due to poor agent performance) or investment (more PRs going through CI, which is expected at higher throughput).
What Sarah should do - role-specific action plan
Victor tracks his personal CI round count and is averaging 1.8 rounds per PR. His secret: he reads CI failure logs carefully and includes the relevant failure pattern in his task specification when retrying, rather than just asking the agent to "fix the CI failure." He's been meaning to write this up as a reusable pattern.
What Victor should do - role-specific action plan
Further Reading
5 resources worth reading - hand-picked, not scraped
From the Field
Recent releases, projects, and discussions relevant to this maturity level.