ITS (Iterations-to-Success): target 1-3

Iterations-to-Success (ITS) is an AI-native metric that measures how many CI attempts it takes for an agent's PR to pass.

·ITS (Iterations-to-Success) is tracked with a target of 1-3
·CPI (Cost-per-Iteration) is tracked with a target under $0.50
·CI feedback latency is tracked as a metric (time from push to CI result)

·Metrics are broken down per team and per repository
·Cost tracking includes model API costs, CI compute costs, and runner costs per iteration

Evidence

·ITS dashboard showing iteration count distribution per PR
·CPI dashboard showing cost per CI iteration
·CI feedback latency chart with P50, P95, P99 breakdowns

What It Is

Iterations-to-Success (ITS) is an AI-native metric that measures how many CI attempts it takes for an agent's PR to pass. When an agent opens a PR, it runs CI. If CI fails, the agent reads the failure, modifies the code, and pushes another commit. CI runs again. The cycle continues until CI passes (or the agent gives up or hits a timeout). ITS counts those cycles: how many times did CI run before the PR was green?

A target of 1-3 ITS means: a well-functioning agent workflow should get to a passing CI result within one to three attempts. ITS of 1 means the agent's first commit passed CI - near-perfect task execution. ITS of 2-3 means the agent needed to correct course once or twice - acceptable but worth investigating the failure mode. ITS of 4+ means the agent is thrashing - trying variations without clear progress, burning through compute and CI time, and likely producing progressively worse code as it over-corrects.

ITS is one of the most important AI-native metrics because it directly measures agent quality, not just agent output. High ITS values reveal specific, actionable problems: insufficient context (the agent didn't know about a dependency), unclear task specification (the agent was solving the wrong problem), flaky tests (the agent is fighting test failures that aren't its fault), or slow CI feedback loops (the agent is guessing instead of using feedback efficiently). Each of these has a different fix, and ITS is the signal that tells you the fix is needed.

At L3 (Systematic), teams that track ITS typically find a wide distribution: some agents consistently achieve ITS of 1, others regularly hit ITS of 5-8. The high-ITS cases are where the AI investment is being wasted. An agent that takes 8 CI cycles to complete a task has consumed 8x the cost of a well-specified task and has likely produced code of lower quality because the late iterations are increasingly reactive rather than thoughtful. Tracking ITS by task type and agent configuration reveals where the optimization opportunities are.

Why It Matters

Directly measures agent quality - ITS is the clearest available signal for whether your agent setup is working well; low ITS means the agent has good context, clear tasks, and reliable CI feedback; high ITS means something is broken in the pipeline
Enables cost management - each CI iteration has a cost (compute time, token cost, CI runner time); teams that track ITS can compute the direct cost of agent thrashing and prioritize fixes accordingly
Identifies flaky test pollution - when ITS is high on many different task types, flaky tests are often the root cause; the agent is retrying because of test failures that aren't real; ITS makes this visible in a way that PR-level analysis misses
Drives context and specification quality - high ITS on a specific task type or agent configuration is a signal to improve either the context window contents or the task specification format; ITS creates a feedback loop for improving agent setup
Enables SLO-like commitments - teams with consistent ITS tracking can make commitments like "95% of agent PRs complete in 3 or fewer CI iterations"; this is a meaningful quality SLO for agent-powered delivery

Getting Started

Define iteration precisely - An iteration is one CI run triggered by an agent commit. A human-pushed commit that fixes an agent's PR does not count. A CI run triggered by a merge to main does not count. Only agent-triggered CI runs on a feature branch, from the first agent commit to the last one before human review, count toward ITS.
Instrument your CI webhook - Most CI platforms (GitHub Actions, CircleCI, BuildKite) emit webhooks on CI run start and completion. Build a lightweight listener that tracks: PR identifier, commit SHA, commit author (is it an agent?), CI result (pass/fail), and timestamp. This data feeds the ITS calculation.
Tag agent-authored commits - Agents that commit code should include a recognizable signature in the commit message or via a git trailers field (e.g., Co-authored-by: claude-code). This lets you identify agent commits in the webhook stream and separate them from human commits.
Build a per-PR ITS report - For each PR tagged as agent-authored, compute ITS as the count of agent-triggered CI runs before the first passing run. Publish this per-PR and as a weekly aggregate (median ITS, 90th percentile ITS, percentage of PRs with ITS > 3).
Segment ITS by task type - ITS varies by task type. Test writing typically has low ITS (tests either pass or they don't, and agents are good at this). New feature implementation has higher ITS (more context dependencies). Bug fixes can be very high or very low depending on how well the bug is specified. Segment by task type to identify which types need the most improvement.
Set a quarterly ITS reduction target - If your current median ITS is 4.5, set a target of 3.0 for next quarter. The path to lower ITS involves improving context, improving task specification, and reducing flaky tests. Track which improvements produce the biggest ITS reduction.

Tip

ITS spikes after codebase changes that introduce new dependencies or patterns. Monitor ITS when your team does major refactors, library upgrades, or architecture changes. A sudden ITS increase is often the first signal that agents don't have enough context about the new patterns to complete tasks efficiently.

6 steps to get from here to the next level

Common Pitfalls

Counting human-triggered CI runs. If you count every CI run on an agent's branch including the ones humans trigger manually, ITS becomes meaningless. The metric is strictly about agent-triggered CI runs. This requires careful commit attribution - make sure your instrumentation can distinguish agent commits from human commits on the same branch.

Using ITS as a team performance metric. High ITS for a developer's agents reflects the quality of their task specifications and context setup, not their engineering skill. Using ITS as part of developer performance evaluation creates perverse incentives: developers will give agents simpler tasks to keep ITS low, rather than tackling harder problems where agent assistance would add more value.

Not accounting for intentional iteration. Some agent workflows intentionally include multiple CI rounds as part of the process: a "red-green-refactor" agent that writes a failing test, implements the fix, and refactors will have an ITS of at least 2 by design. These intentional multi-iteration patterns should be excluded from ITS targets or tracked separately with a different target.

Ignoring the tail distribution. Median ITS can look healthy (2.0) while the 90th percentile is broken (12). The tail represents the worst agent experiences - the ones consuming the most resources and producing the most frustration. Track and act on the 90th percentile, not just the median.

Treating ITS reduction as the only goal. An ITS of 1 could mean the agent is producing perfect code, or it could mean the task was so trivially simple that no iteration was needed. ITS should be tracked alongside task complexity and PR size to ensure that low ITS reflects genuine agent quality rather than agents being assigned only trivial work.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob has deployed agent workflows across three teams and is seeing significant variance in agent productivity. Team A's agents seem to produce working code quickly; Team B's agents require lots of human intervention. Bob doesn't have the data to understand why.

What Bob should do: Bob should instrument ITS as the first step toward understanding the variance. Once he has per-team ITS distributions, the picture usually clarifies quickly: Team B likely has much higher ITS (4-7) compared to Team A (1-2). The next question is why. Is it task specification quality? Is it a particular agent configuration? Is Team B working in a part of the codebase with more flaky tests? Bob should schedule a cross-team "ITS review" - a monthly meeting where each team shares their ITS data, their highest-ITS task types, and what they've tried to improve. The shared learning accelerates improvement faster than each team experimenting independently.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah has heard about ITS as a metric but isn't sure how to present it to engineering managers who are skeptical of AI-specific metrics. The managers want to understand what they should actually do when ITS is high.

What Sarah should do: Sarah should build a simple ITS decision tree and include it in the engineering manager playbook. When ITS is consistently above 3 for a specific task type: (1) check if the tests for that area are flaky (look at TORS for that test suite), (2) review the context window contents the agent is given for that task type, (3) review the task specification format and improve it based on the most common failure modes. Sarah should present ITS alongside a one-page troubleshooting guide so managers have a clear action to take when the number is bad. Without the decision tree, ITS is just a number. With the decision tree, it's a diagnostic tool that drives concrete improvements.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor tracks his own ITS obsessively. He knows his median ITS is 1.4 for test writing tasks and 2.8 for new feature implementation. He's worked to bring both numbers down by improving his context files and task specification templates. He considers ITS his primary agent health signal.

What Victor should do: Victor should formalize his context and specification improvements as shareable templates. The CLAUDE.md files, agent context documents, and task specification formats that drive his low ITS are an organizational asset that other developers don't have access to. Victor should create a library of high-quality agent context templates organized by task type (test writing, bug fix, feature implementation, refactoring, documentation) and share them as a platform contribution. Each template should include the context that reduces ITS for that task type. Victor should also track ITS before and after other developers adopt his templates - this creates the evidence base that the templates work, which is the argument for making them the team standard.

What Victor should do - role-specific action plan