Max 2 CI rounds per PR (Stripe benchmark)

The "max 2 CI rounds per PR" benchmark comes from Stripe's engineering culture, where one of the key efficiency metrics for their agent-assisted development program (the Minions mo

·Policy-based merge rules are enforced (OPA, branch protection, or equivalent)
·Deterministic merge ordering with conflict detection prevents concurrent merge failures
·PRs require a maximum of 2 CI rounds before merge (Stripe benchmark)

·Merge rules are versioned as code and reviewed when changed
·PRs exceeding 2 CI rounds are flagged for investigation

Evidence

·Policy-as-code configuration (OPA rules, branch protection API config)
·CI round count per PR metrics showing 2-round maximum adherence
·Merge ordering logs showing deterministic processing

What It Is

The "max 2 CI rounds per PR" benchmark comes from Stripe's engineering culture, where one of the key efficiency metrics for their agent-assisted development program (the Minions model) is that each pull request should reach a green CI state within two CI runs. The first CI run is the initial submission. If the first run fails, the agent (or developer) fixes the issues and resubmits. If the second run fails, something is systematically wrong: the agent is guessing rather than reasoning, the task specification is unclear, or the test suite is flaky and unreliable.

This benchmark matters because CI compute is expensive and iteration cycles are slow. Each full CI run for a large codebase can take 10-30 minutes and cost $5-20 in compute. An agent that requires 5-8 CI rounds to produce a green build is consuming 3-4x the resources of a well-functioning agent and generating PRs that are slow to close. At the scale of hundreds of agent-produced PRs per week, the difference between 2 CI rounds per PR and 5 CI rounds per PR is significant in both cost and throughput.

The deeper implication is about agent quality. An agent that consistently needs 2 or fewer CI rounds understands the codebase, reads test failures correctly, and produces targeted fixes. An agent that needs 5+ rounds is either operating without sufficient context, facing flaky tests that give misleading signals, or being given tasks that are underspecified. The 2-round benchmark is a quality signal about the agent's task execution capability, not just a cost metric.

At L3 (Systematic), measuring and targeting this benchmark means: tracking CI rounds per PR as a metric, identifying why PRs exceed 2 rounds (flaky tests, poor agent context, underspecified tasks), and systematically addressing the root causes. This is distinct from L4/L5 where agents automatically analyze CI failures and self-correct within the 2-round target.

Why It Matters

Direct cost signal - each CI round costs compute; at scale, tracking rounds per PR reveals whether your agent workflow is economically efficient or wasteful; 5 rounds per PR at $10/run at 200 PRs/week is $10,000/week in avoidable CI spend
Agent quality proxy - agents that consistently hit green in 1-2 rounds are reading failure logs correctly and producing targeted fixes; agents that need 5+ rounds are not; this metric distinguishes agent capability levels without requiring code quality review
Identifies systemic issues - PRs that consistently require 3+ rounds often share root causes (flaky test suite, missing context file, ambiguous task specs); the metric surfaces these patterns so they can be fixed once rather than tolerated repeatedly
Enables throughput at scale - at 1000 merges/week (Stripe scale), a 2-round average means CI runs 2000 times; a 5-round average means 5000 runs; the difference determines whether your CI infrastructure can keep up with agent throughput
Sets a clear quality bar - "2 CI rounds or fewer" is a specific, measurable standard for agent-produced PRs; it gives teams a concrete target rather than vague quality aspirations

Getting Started

Instrument CI rounds per PR - add tracking to your CI pipeline to count the number of CI runs per PR before it reaches green. GitHub Actions doesn't expose this directly, but you can track it via workflow run counts per PR via the GitHub API or a tool like LinearB. Export this as a weekly average metric.
Set a baseline - run the measurement for 30 days before making changes. What is your current average CI rounds per PR? What's the distribution? Are some PRs outliers (10+ rounds) pulling the average up? Understanding the baseline tells you where to focus.
Identify the root cause categories - for PRs that required 3+ CI rounds in the last 30 days, categorize why: (a) flaky tests that passed on retry, (b) agent misread the failure and fixed the wrong thing, (c) task specification was unclear, (d) missing codebase context. Each category has a different fix.
Fix flaky tests first - flaky tests are the single biggest driver of excess CI rounds in most codebases. A test that fails 20% of the time forces agents to run CI multiple times just to confirm their fix is correct. Quarantine or fix your top 10 flaky tests before optimizing anything else.
Improve agent failure analysis - configure your agents to receive the full CI failure log (not just "CI failed") and explicit instructions for how to analyze failures: "read the test output carefully, identify the specific failing assertion, determine if the test is flaky or if your code change caused the failure." Better failure analysis context reduces the rounds needed to fix a failure.
Track improvement over 90 days - after fixing flaky tests and improving agent context, measure CI rounds per PR at 30, 60, and 90 days. Target: below 2.5 rounds average within 90 days. If you're not improving, the bottleneck is somewhere you haven't yet addressed.

Tip

Separate the "1st CI round fail rate" from the "2nd CI round fail rate." If 40% of PRs fail on the first CI run but only 5% fail on the second, that's a different problem than 40% failing on the first AND 30% failing on the second. The first pattern suggests task complexity; the second suggests agent inability to read and respond to failures.

6 steps to get from here to the next level

Common Pitfalls

Optimizing for the metric instead of the underlying quality. An agent that retries CI with trivially different code until it passes by chance is gaming the 2-round metric while producing low-quality PRs. Measure rounds per PR alongside PR quality (post-merge incident rate, reviewer rejection rate) to ensure the metric reflects genuine quality improvement.

Treating flaky tests as acceptable overhead. Teams often accept 15-20% flaky test rates as "normal." For human developers who understand test flakiness, this is frustrating but manageable. For agents that read test failures as ground truth, flaky tests are catastrophic: the agent makes changes to "fix" a test that wasn't actually failing due to a bug, introducing regressions. Flaky tests are a more serious problem in agent-assisted workflows than in human workflows.

Not giving agents access to CI failure logs. If agents can only see "CI failed" without the actual log output, they have to guess what went wrong. Most agents will guess wrong at least once. Ensure your agent workflow provides full test output, stack traces, and environment details. The failure log is the most important context for the second CI round.

Measuring average rounds instead of distribution. An average of 2.1 rounds per PR looks good but can hide a tail of 10 PRs per week that require 8+ rounds each. Those tail PRs are the signal - they indicate systematic problems with specific task categories or code areas. Measure P90 and P99 rounds per PR alongside the average.

Ignoring the queue-round interaction. A PR that requires 4 CI rounds before hitting green but does so quickly (10-minute CI) has less impact on merge queue throughput than a PR that requires 2 rounds with 30-minute CI. The metric that matters for throughput is total CI wall time per PR, not just round count. Track both.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob's team uses Claude Code to generate PRs and has noticed that "some PRs just keep bouncing" - they fail CI, the agent tries to fix them, they fail again, the agent tries again. Bob is paying for significant CI compute and is concerned that the cost is disproportionate to the output.

What Bob should do: Bob should pull the CI round distribution for the last 30 days. If the distribution has a long tail (10% of PRs require 5+ rounds) he should investigate those specific PRs: what tasks were they for? What CI failures did they encounter? Were the CI failures real failures or flaky tests? Typically the long tail is 80% explained by 2-3 root causes. Bob should fix those root causes (most commonly: quarantine flaky tests, add more context to the CLAUDE.md for problem-prone code areas) and re-measure. The goal is getting the P90 below 3 rounds within 60 days.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah has been tracking CI costs and noticed a spike in the last two months that correlates with increased AI tool adoption. She wants to understand whether the spike represents waste (excess CI rounds due to poor agent performance) or investment (more PRs going through CI, which is expected at higher throughput).

What Sarah should do: Sarah should separate the signal: calculate CI runs per PR merged (not total CI runs). If total CI runs increased 3x but PRs merged only increased 1.5x, the additional CI runs are excess rounds - waste. If both increased proportionally, the cost increase is expected and appropriate. Presenting this decomposition to leadership converts "CI costs went up" into "CI costs went up because we're shipping more" or "CI costs went up because agents are less efficient than expected and here's why." The second framing is a problem; the first is progress. Sarah should also track the trend over time: as agents improve and flaky tests are fixed, rounds per PR should decrease even as total PR volume increases.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor tracks his personal CI round count and is averaging 1.8 rounds per PR. His secret: he reads CI failure logs carefully and includes the relevant failure pattern in his task specification when retrying, rather than just asking the agent to "fix the CI failure." He's been meaning to write this up as a reusable pattern.

What Victor should do: Victor should formalize his CI failure analysis pattern as a reusable agent workflow. The pattern: (1) extract the specific failing assertion from the CI log, (2) determine if it's a flaky test (known flaky test list) or a genuine failure, (3) include the specific failure context in the retry prompt, (4) if still failing after two rounds, escalate to human review rather than continuing to retry. This pattern, documented and added to the team's CLAUDE.md, can reduce the team's average CI rounds from wherever they are to near Victor's 1.8 baseline. That's the kind of systemic improvement that justifies documentation investment.

What Victor should do - role-specific action plan