ITS (Iterations-to-Success): target 1-3
Iterations-to-Success (ITS) is an AI-native metric that measures how many CI attempts it takes for an agent's PR to pass.
- ·ITS (Iterations-to-Success) is tracked with a target of 1-3
- ·CPI (Cost-per-Iteration) is tracked with a target under $0.50
- ·CI feedback latency is tracked as a metric (time from push to CI result)
- ·Metrics are broken down per team and per repository
- ·Cost tracking includes model API costs, CI compute costs, and runner costs per iteration
Evidence
- ·ITS dashboard showing iteration count distribution per PR
- ·CPI dashboard showing cost per CI iteration
- ·CI feedback latency chart with P50, P95, P99 breakdowns
What It Is
Iterations-to-Success (ITS) is an AI-native metric that measures how many CI attempts it takes for an agent's PR to pass. When an agent opens a PR, it runs CI. If CI fails, the agent reads the failure, modifies the code, and pushes another commit. CI runs again. The cycle continues until CI passes (or the agent gives up or hits a timeout). ITS counts those cycles: how many times did CI run before the PR was green?
A target of 1-3 ITS means: a well-functioning agent workflow should get to a passing CI result within one to three attempts. ITS of 1 means the agent's first commit passed CI - near-perfect task execution. ITS of 2-3 means the agent needed to correct course once or twice - acceptable but worth investigating the failure mode. ITS of 4+ means the agent is thrashing - trying variations without clear progress, burning through compute and CI time, and likely producing progressively worse code as it over-corrects.
ITS is one of the most important AI-native metrics because it directly measures agent quality, not just agent output. High ITS values reveal specific, actionable problems: insufficient context (the agent didn't know about a dependency), unclear task specification (the agent was solving the wrong problem), flaky tests (the agent is fighting test failures that aren't its fault), or slow CI feedback loops (the agent is guessing instead of using feedback efficiently). Each of these has a different fix, and ITS is the signal that tells you the fix is needed.
At L3 (Systematic), teams that track ITS typically find a wide distribution: some agents consistently achieve ITS of 1, others regularly hit ITS of 5-8. The high-ITS cases are where the AI investment is being wasted. An agent that takes 8 CI cycles to complete a task has consumed 8x the cost of a well-specified task and has likely produced code of lower quality because the late iterations are increasingly reactive rather than thoughtful. Tracking ITS by task type and agent configuration reveals where the optimization opportunities are.
Why It Matters
- Directly measures agent quality - ITS is the clearest available signal for whether your agent setup is working well; low ITS means the agent has good context, clear tasks, and reliable CI feedback; high ITS means something is broken in the pipeline
- Enables cost management - each CI iteration has a cost (compute time, token cost, CI runner time); teams that track ITS can compute the direct cost of agent thrashing and prioritize fixes accordingly
- Identifies flaky test pollution - when ITS is high on many different task types, flaky tests are often the root cause; the agent is retrying because of test failures that aren't real; ITS makes this visible in a way that PR-level analysis misses
- Drives context and specification quality - high ITS on a specific task type or agent configuration is a signal to improve either the context window contents or the task specification format; ITS creates a feedback loop for improving agent setup
- Enables SLO-like commitments - teams with consistent ITS tracking can make commitments like "95% of agent PRs complete in 3 or fewer CI iterations"; this is a meaningful quality SLO for agent-powered delivery
Getting Started
6 steps to get from here to the next level
Common Pitfalls
Mistakes teams actually make at this stage - and how to avoid them
How Different Roles See It
Bob has deployed agent workflows across three teams and is seeing significant variance in agent productivity. Team A's agents seem to produce working code quickly; Team B's agents require lots of human intervention. Bob doesn't have the data to understand why.
What Bob should do - role-specific action plan
Sarah has heard about ITS as a metric but isn't sure how to present it to engineering managers who are skeptical of AI-specific metrics. The managers want to understand what they should actually do when ITS is high.
What Sarah should do - role-specific action plan
Victor tracks his own ITS obsessively. He knows his median ITS is 1.4 for test writing tasks and 2.8 for new feature implementation. He's worked to bring both numbers down by improving his context files and task specification templates. He considers ITS his primary agent health signal.
What Victor should do - role-specific action plan
Further Reading
5 resources worth reading - hand-picked, not scraped
From the Field
Recent releases, projects, and discussions relevant to this maturity level.