CPI (Cost-per-Iteration): target < $0.50

Cost-per-Iteration (CPI) measures what a single agent CI attempt costs - in model API costs, CI compute, and related infrastructure.

·ITS (Iterations-to-Success) is tracked with a target of 1-3
·CPI (Cost-per-Iteration) is tracked with a target under $0.50
·CI feedback latency is tracked as a metric (time from push to CI result)

·Metrics are broken down per team and per repository
·Cost tracking includes model API costs, CI compute costs, and runner costs per iteration

Evidence

·ITS dashboard showing iteration count distribution per PR
·CPI dashboard showing cost per CI iteration
·CI feedback latency chart with P50, P95, P99 breakdowns

May 2026 Update

Cost telemetry stopped being optional. ccusage (13.2k GitHub stars, ccusage.com) prints token spend per Claude Code session from local JSONL files - cache-aware, offline-capable, MCP-integrated. Claude-Code-Usage-Monitor adds live charts and "when will I hit my limit" predictions. Both /usage and /context shipped as built-in commands. Reddit documented multiple "agentic fork bomb" incidents - including a $3,800 overnight bill - that turned per-session spend caps into baseline governance.

The bigger picture: Pawel Dolega's AI subscriptions are on borrowed time (Apr 26) makes the structural case that CPI thresholds will move. A $20 Pro plan currently burns $50-100 of compute; total enterprise LLM spend doubled in six months despite per-token prices falling (Jevons paradox); Anthropic pulled Claude Code from Pro and GitHub paused Copilot signups. Track CPI now, set per-session caps now, and assume re-pricing within 12 months.

June 2026 Update

The market turn became explicit: spending pivoted from "tokenmaxxing" to efficiency, with the metric shifting from tokens consumed to value per token (CNBC, June 26). The dominant cost pattern is now token arbitrage: spend the premium model only on judgment and planning, then hand the bulk code-writing to a cheaper specialist model. This makes model tiering a CPI lever rather than a nice-to-have. Pricing itself is a live risk: Anthropic announced Agent SDK billing changes for June 15 and then paused them on June 16, so budget against re-pricing within the quarter and keep per-session caps in place.

In May 2026 cost-per-merged-PR crossed over from advanced telemetry into a CFO-facing line item. Microsoft cancelled Claude Code internally on cost (The Verge, May 14) and Uber's COO publicly questioned AI ROI after a team burned a full-year budget in four months (Fortune, May 26). The DORA ROI report (analyzed May 11) found roughly 39% first-year ROI but only about a 10% gain on complex legacy code - a J-curve with a reliability penalty - while Goldman Sachs (May 5) projected a 24x rise in token consumption. The lesson for CPI: cheaper per-token prices do not lower the bill, so per-session caps and the CPI * ITS total are now the numbers leadership will ask about by name.

What It Is

Cost-per-Iteration (CPI) measures what a single agent CI attempt costs - in model API costs, CI compute, and related infrastructure. When an agent submits a commit and CI runs, that's one iteration. The cost of that iteration includes: the token cost of the agent's reasoning and code generation to produce the commit, plus the CI runner cost for executing the test suite. The target at L3 is below $0.50 per iteration.

The $0.50 target is not arbitrary. It's derived from the math of agent economics: at $0.50/iteration and an ITS (Iterations-to-Success) target of 1-3, the total agent cost per PR is $0.50-$1.50. A typical engineer-hour of review and direction costs $50-100 (fully loaded). An agent that costs $1.50 to produce a PR and needs 15 minutes of human review is delivering enormous leverage. But if CPI is $3-5 per iteration and ITS is 5-8, a single PR can cost $15-40 in raw compute - approaching the cost of the human time it was meant to save.

CPI has two components that must be optimized separately. The AI token cost depends on model choice, context window size, and output length. Claude Sonnet is significantly cheaper per token than Opus; using the right model for the task type can cut AI costs by 5-10x without sacrificing quality for well-specified tasks. The CI compute cost depends on CI pipeline efficiency, test suite duration, and runner instance type. A test suite that takes 20 minutes to run on a large instance costs dramatically more per iteration than a 2-minute suite on a standard runner.

At L3, teams that track CPI for the first time frequently discover that 20% of their agent tasks are consuming 80% of their iteration costs. These high-cost tasks typically share two characteristics: high ITS (many failed iterations before success) and large context windows (agents consuming excessive tokens trying to understand complex requirements). Fixing these high-cost outliers - through better context management and task specification - produces dramatic improvements in overall CPI.

Why It Matters

Prevents agent cost from becoming a budget blocker - without CPI tracking, agent costs grow silently as the team adds more agents and more tasks; the first monthly cloud bill that shocks leadership can cause an overcorrection that cuts the entire agent program
Creates incentive for CI speed investment - CI pipeline cost is directly visible in CPI; teams that track CPI have a clear financial argument for CI infrastructure investment: "cutting CI from 20 minutes to 5 minutes saves $0.30/iteration and at 500 iterations/week, that's $6K/month"
Drives model right-sizing - not every task needs the most capable (and expensive) model; CPI tracking reveals that many tasks can use a cheaper model without quality loss, creating a data-driven argument for model tiering
Enables per-task-type cost optimization - some task types (test generation) have inherently low CPI; others (complex feature implementation) have inherently high CPI; knowing this allows teams to structure agent workflows to maximize value per dollar
Provides the unit economics for scaling - before expanding agent usage from 10 to 100 concurrent agents, you need to know what that costs; CPI * ITS * weekly PR volume gives you the monthly cost projection for any scale level

Getting Started

Instrument token costs per agent session - Add token counting to your agent orchestration layer. Every time an agent completes a CI iteration (one commit + CI run), log: tokens in (context), tokens out (code generated), model used, and duration. These map directly to API costs using the model's published pricing.
Instrument CI runner costs per iteration - Most CI platforms provide cost or usage data per run. GitHub Actions charges by minute; the per-iteration CI cost is (minutes per run * cost per minute). Log this alongside token costs for each iteration.
Build a per-PR cost rollup - Sum the token cost and CI cost for all iterations of a PR to get total PR cost. Publish this as a dashboard: median PR cost, 90th percentile PR cost, total weekly agent cost, and weekly cost trend.
Compute CPI as the mean iteration cost - CPI = (total week cost) / (total iterations in the week). Track this weekly and set alerts when CPI exceeds your target. A CPI above $1.00 in any week is a signal that something has gone wrong - a specific task type or agent configuration is producing expensive failures.
Identify the high-CPI outliers - Run a weekly analysis: which 10% of PRs have the highest total cost? What do they have in common? Are they in a specific area of the codebase, on a specific task type, or using a specific agent configuration? These outliers are where CPI optimization effort pays off most.
Experiment with model tiering - Run a two-week experiment: use a cheaper model (Sonnet vs. Opus, or Haiku for very simple tasks) for the task types where quality hasn't suffered. Measure ITS for both model tiers on the same task types. If ITS doesn't significantly increase with the cheaper model, you've found a cost reduction that doesn't sacrifice quality.

Tip

Claude Sonnet typically offers a 4-5x cost reduction vs. Opus for equivalent tasks, while Claude Haiku offers another 3-4x reduction for simple tasks. For agent workflows where the task is well-specified (test writing, documentation, simple refactoring), Haiku or Sonnet can match Opus quality at a fraction of the cost. Run a pilot with task type segmentation before committing to a tiering strategy.

6 steps to get from here to the next level

Common Pitfalls

Tracking only token costs and ignoring CI compute. In many agent workflows, CI compute is the dominant cost, not token costs. A 30-minute test suite on a large runner can cost $2-5 per iteration - far more than the $0.05-0.20 in token costs for the agent's code generation. Teams that optimize only for token cost are optimizing the smaller part of the cost equation.

Setting a single CPI target across all task types. A CPI target of $0.50 is reasonable for test writing tasks (where the agent has a clear, bounded output) but may be unrealistically low for complex feature implementation (where more context and iteration is inherently needed). Set CPI targets by task type: aggressive targets for high-frequency, well-defined tasks, more lenient targets for high-complexity, exploratory tasks.

Optimizing CPI at the expense of ITS. If you reduce context window size to cut token costs and ITS goes from 2 to 6 as a result, you've saved on per-iteration cost but increased total PR cost. CPI and ITS must be optimized together. Total PR cost (CPI * ITS) is the number that matters, not either metric in isolation.

Not alerting on CPI spikes. CPI can spike suddenly if an agent configuration change, a library update, or a codebase change causes agents to consume dramatically more context. Without alerts, these spikes go unnoticed until the monthly cloud bill arrives. Set weekly CPI alerts: if this week's CPI is more than 50% above last week's, trigger an investigation.

Underestimating organizational overhead costs. CPI captures infrastructure costs but not human overhead costs: the time developers spend reviewing high-ITS PRs, the time spent investigating agent failures, the time spent updating context files. The true cost-per-PR includes human time. Track this separately as a monthly estimate rather than trying to instrument it precisely - but don't forget it when making decisions about agent complexity.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob approved an expansion of the agent program last quarter and is now seeing unexpectedly high cloud bills. The AI API costs are three times what he projected. He doesn't know which agents are consuming the budget or why.

What Bob should do: Bob needs CPI instrumentation immediately. He should ask his platform engineer to add token logging to the agent orchestration layer and connect it to the billing data from the cloud provider. Within a week, he should have a report showing: which agent configurations are most expensive, what the per-PR cost distribution looks like, and which task types are producing the highest costs. The analysis will almost certainly reveal that a small number of high-ITS, high-context tasks are consuming a disproportionate share of the budget. Bob should put a temporary cap on context window size for new agent tasks (this is a single configuration change) and measure the impact over two weeks. The combination of CPI instrumentation and context window capping typically reduces AI API costs by 30-50% without significant quality impact.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah is building the quarterly AI productivity report and wants to include unit economics alongside throughput metrics. She wants to show: "Here is the cost of producing each agent PR and here is the value it delivers."

What Sarah should do: Sarah should build a simple cost-vs-value model. Cost side: CPI * ITS = cost per PR. Value side: estimated time saved per PR (based on the type of task - a test-writing PR saves ~30 minutes of developer time, a bug fix PR saves ~90 minutes). The ratio of value to cost is the agent ROI per PR type. Sarah should present this model with ranges and uncertainty estimates rather than false precision. The goal isn't an exact ROI number - it's a framework that the team can use to make decisions about which task types to prioritize for agent automation. High-value, low-cost tasks (test writing) should be automated first. High-cost, low-value tasks (simple boilerplate) should be the last priority.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor tracks CPI for all his agent workflows and has achieved sub-$0.30 CPI through a combination of model tiering (Haiku for simple tasks, Sonnet for complex tasks, Opus reserved for architectural reasoning), optimized context windows (only the most relevant files included, not the whole codebase), and a fast CI pipeline (2-minute test runs via incremental test selection).

What Victor should do: Victor should publish his model tiering configuration and context window strategy as a platform template. The specific configuration choices - which model for which task type, how to determine context window contents, how to structure the agent's working directory to avoid loading unnecessary files - are the optimizations that took Victor months to develop. Packaging them as a template that other developers can adopt with minimal modification is the highest-leverage contribution Victor can make. Victor should also set up a monthly CPI review where he helps other teams analyze their own CPI data and identify their highest-cost outliers. The combination of a good template and ongoing coaching is how the team moves from "Victor's CPI is 0.30" to "the team's median CPI is 0.45."

What Victor should do - role-specific action plan