Delivery Management
How we manage delivery in the age of agents. From human PR review to autonomous delivery pipeline.
Maturity →
CI/CD Pipeline
Speed and reliability of your build-test-feedback loop for AI-generated code.
- CI runs on every changeCI pipelines that take longer than 15 minutes are a defining characteristic of the Ad-hoc maturity level.guide→
- Agent waits for CI feedback"Agent is blind, waits for feedback" describes the L1 state where an AI coding agent completes its code changes, submits them, and then has no way to observe whether those changesguide→
- Shared runner, queueA shared runner queue is the default CI infrastructure configuration at L1: a fixed pool of CI runners (virtual machines or containers) shared across all developers, all teams, andguide→
- 01A CI pipeline runs on pull requests
- 02CI results are reported after the pipeline completes
- 03CI runs on every PR (not just on manual trigger)
- 04Shared runner queue exists even if slow
- CI < 10 minutesCI under 10 minutes is the first meaningful milestone on the path to AI-native delivery infrastructure.guide→
- Basic cachingBasic caching in CI refers to storing dependency packages, compiled artifacts, and Docker layers between pipeline runs so they don't need to be re-downloaded or re-built from scratch on every commit.guide→
- Dedicated runners per teamDedicated runners per team means each engineering team has its own isolated pool of CI runners, not shared with other teams.guide→
- 01CI completes in under 10 minutes (median)
- 02Build caching is implemented (dependency cache, build artifact cache)
- 03Dedicated CI runners are allocated per team (no shared queue across all teams)
- 04CI duration is tracked as a metric and reviewed monthly
- 05Cache hit rate exceeds 70%
- CI < 5 minutesCI under 5 minutes is the Systematic (L3) milestone where CI speed becomes a first-class engineering concern, not a background project.guide→
- Bazel + Remote Caching (EngFlow)Bazel is Google's open-source build system, designed from the ground up for large monorepos and fast incremental builds.guide→
- Incremental builds: only changed fragments; per-worktree pipelines (Cursor 3.2 model)Incremental builds are a build strategy where only the files, modules, or packages that have changed since the last build are recompiled, and only the tests that cover changed code are re-executed.guide→
- 01CI completes in under 5 minutes (median)
- 02Remote caching is implemented (Bazel remote cache, EngFlow, Gradle Enterprise)
- 03Incremental builds run only changed modules or fragments
- 04P95 CI duration is under 8 minutes
- 05Build system supports hermetic builds (reproducible outputs regardless of machine)
- CI < 2 minutesCI under 2 minutes is the Optimized (L4) milestone and represents a qualitative shift in how CI is used.guide→
- Ephemeral sandboxes: agent has own environment (10s spin-up)An ephemeral sandbox is a short-lived, fully isolated environment created specifically for a single agent task and destroyed when the task is complete.guide→
- CI as Sandbox: 50 attempts in 5 min without blocking team; async subagent pipelines (/multitask)"CI as Sandbox" is a configuration pattern where the CI system is intentionally designed to support rapid, high-frequency iteration by AI agents, isolated from the normal developer CI workflow.guide→
- 01CI completes in under 2 minutes (median)
- 02Ephemeral sandbox environments spin up in under 10 seconds for agent CI loops
- 03Agent sandbox CI supports 50+ iteration attempts in 5 minutes without blocking team CI queue
- 04P95 CI duration is under 3 minutes
- 05CI feedback latency (from push to result) is tracked and reported
- Sub-minute feedbackSub-minute CI feedback is the Autonomous (L5) frontier - a pipeline that returns meaningful quality signal to an agent in under 60 seconds.guide→
- Self-driving CI: auto-scaling per agent loadSelf-driving CI is a CI system that observes its own load, predicts demand, and scales its infrastructure automatically without any human intervention.guide→
- Production feedback → CI auto-adjusts test suite"Production feedback drives CI test suite adjustment" is an L5 pattern where the CI test suite is not a static artifact maintained by engineers but a dynamic system that evolves baguide→
- 01CI provides sub-minute feedback for standard changes
- 02CI auto-scales runner capacity based on agent load (no manual capacity planning)
- 03Production feedback loop auto-adjusts the CI test suite (adds tests for observed failures, removes redundant tests)
- 04CI runner utilization stays between 50-80% (auto-scaling prevents both waste and queuing)
- 05Test suite evolution is auditable (each auto-added/removed test has a provenance record)
Merge & Deploy
How PRs flow from creation to production - throughput, automation, and conflict handling.
- Human review and merge on every PRManual PR review with manual merge is the baseline state for most engineering teams.guide→
- 10 PR/day capacityTen PRs per day is the typical throughput ceiling for a manual review-and-merge process on a team of 6-10 developers.guide→
- Manual deploy or simple CDManual deploy or simple CD covers the full spectrum of L1 deployment practice: from "someone SSHes into the server and runs git pull" to "merging to main triggers a pipeline that dguide→
- 01Pull requests are reviewed before merge
- 02The team ships pull requests regularly
- 03Basic CD pipeline exists (even if simple or manually triggered)
- 04Deploy frequency is at least weekly
- Basic merge queuesA merge queue serializes pull requests that are ready to merge, ensuring that each PR is tested against the latest state of the target branch before it actually merges.guide→
- Auto-rebaseAuto-rebase is the practice of automatically keeping pull request branches up to date with the target branch without requiring the developer to manually run `git rebase main` or `gguide→
- CD pipeline with gatesA CD pipeline with gates is a deployment pipeline that has explicit checkpoints between stages.guide→
- 01Merge queue is implemented (GitHub merge queue, Mergify, or equivalent)
- 02Auto-rebase is enabled for PRs targeting main branch
- 03CD pipeline includes at least one gate (tests pass, security scan, approval)
- 04Merge conflicts are detected and flagged before review is requested
- 05Deploy frequency is at least daily
- Policy-based merge rulesPolicy-based merge rules replace ad-hoc human judgment about when and how to merge with codified, machine-enforced criteria.guide→
- Deterministic ordering + conflict detectionDeterministic ordering means the merge queue processes PRs in a defined, predictable sequence rather than in arbitrary arrival order.guide→
- Max 2 CI rounds per PR (Stripe benchmark)The "max 2 CI rounds per PR" benchmark comes from Stripe's engineering culture, where one of the key efficiency metrics for their agent-assisted development program (the Minions moguide→
- 01Policy-based merge rules are enforced (OPA, branch protection, or equivalent)
- 02Deterministic merge ordering with conflict detection prevents concurrent merge failures
- 03PRs require a maximum of 2 CI rounds before merge (Stripe benchmark)
- 04Merge rules are versioned as code and reviewed when changed
- 05PRs exceeding 2 CI rounds are flagged for investigation
- Green = auto-merge → auto-deployGreen = auto-merge → auto-deploy is the L4 delivery pattern where a pull request that passes all required CI checks and satisfies all policy criteria is automatically merged and thguide→
- 50+ PR/day throughputFifty or more PRs per day is the throughput milestone that marks the transition from "AI-assisted development" to "AI-augmented engineering at scale." At 10 PRs/day (L1), human revguide→
- Canary/progressive deployment autoAutomated canary and progressive deployment is the practice of rolling out changes to a small percentage of production traffic first, automatically monitoring key metrics during thguide→
- 01Green-classified PRs auto-merge and auto-deploy without human intervention
- 02Team throughput exceeds 50 PRs per day
- 03Canary or progressive deployment is automated (no manual rollout decisions)
- 04Auto-deploy includes automated rollback on error rate threshold breach
- 05Merge queue wait time is under 10 minutes
- 1000+ merges/week (Stripe scale)1000+ merges per week is the throughput level that Stripe's engineering organization achieved with their AI-assisted development program, published as the "Minions" model.guide→
- Agent produces PR → CI passes → merge → deploy → observeThe full autonomous delivery loop - agent produces PR, CI passes, merge, deploy, observe - is the L5 state where code moves from conception to production without any required humanguide→
- Rollback is agent-drivenAgent-driven rollback is the practice of having AI agents detect production regressions, determine the root cause PR, initiate and execute the rollback procedure, communicate the iguide→
- 01Merge throughput sustains 1,000+ merges per week
- 02Full autonomous pipeline: agent produces PR, CI passes, merge, deploy, observe - no human in the loop
- 03Rollback is agent-driven (agent detects regression, reverts, and opens fix PR)
- 04Mean time to rollback is under 5 minutes from anomaly detection
- 05Agent-driven rollbacks succeed without human intervention 95%+ of the time
Metrics
What you measure to understand AI-assisted engineering productivity and quality.
- DORA metrics, if trackedAt L1 (Ad-hoc), most engineering teams track DORA metrics inconsistently or not at all.guide→
- Standard delivery metrics (not yet AI-specific)At L1, engineering teams that have adopted AI tools - GitHub Copilot, Cursor, Claude Code - are tracking those tools with zero AI-specific metrics.guide→
- ROI of AI not yet measured"How much did we save with AI?" is the question every engineering leader eventually faces from finance, from the CTO, or from the board.guide→
- 01Delivery is tracked with at least basic metrics
- 02Standard delivery metrics are in place (AI-specific metrics come later)
- 03Team acknowledges the need for AI-specific metrics beyond traditional DORA
- 04Basic deployment frequency is at least known (even if not dashboarded)
- DORA + basic AI tracking; per-session token spend (ccusage, /usage)At L2 (Guided), teams have moved past the L1 silence on metrics.guide→
- Licenses vs usage rateLicenses vs. usage rate is the first uncomfortable AI metrics discovery. Teams that invest in AI coding tools - GitHub Copilot, Cursor, Claude Code - routinely find that the numberguide→
- PR throughput per devPR throughput per developer is the first meaningful per-developer productivity metric for AI-assisted development: how many pull requests does a developer merge per week, on averagguide→
- 01DORA metrics are tracked consistently with a dashboard
- 02AI tool license count vs. active usage rate is measured
- 03PR throughput per developer is tracked
- 04AI acceptance rate (% of AI suggestions accepted) is measured per tool
- 05Metrics are reviewed in team retrospectives at least monthly
- ITS (Iterations-to-Success): target 1-3Iterations-to-Success (ITS) is an AI-native metric that measures how many CI attempts it takes for an agent's PR to pass.guide→
- CPI (Cost-per-Iteration): target < $0.50; cost-per-merged-PR tracked over timeCost-per-Iteration (CPI) measures what a single agent CI attempt costs - in model API costs, CI compute, and related infrastructure.guide→
- CI Feedback Latency trackingCI Feedback Latency is the time from when an agent pushes a commit to when CI produces a result (pass or fail) that the agent can act on.guide→
- 01ITS (Iterations-to-Success) is tracked with a target of 1-3
- 02CPI (Cost-per-Iteration) is tracked with a target under $0.50
- 03CI feedback latency is tracked as a metric (time from push to CI result)
- 04Metrics are broken down per team and per repository
- 05Cost tracking includes model API costs, CI compute costs, and runner costs per iteration
- Test-oracle reliability tracked as a metricThe Test Oracle Reliability Score (TORS) measures what percentage of test failures represent real defects rather than test infrastructure issues, environmental flakiness, or poorlyguide→
- Auto-Approve Rate: target > 60% (NOT derived from benchmark scores - Berkeley Apr 12 hack)Auto-Approve Rate is the percentage of PRs that merge without requiring human review - passing all automated gates (CI, security scans, coverage checks, linting) and merging algorithmically.guide→
- Agent Autonomy Score: % tasks without human interventionThe Agent Autonomy Score measures the percentage of tasks that an agent completes from assignment to merge without any human intervention: no clarifying questions answered, no mid-guide→
- Merge Queue Wait < 10 minMerge Queue Wait is the time a PR spends waiting in the merge queue after all gates pass (CI green, reviews approved, policy rules satisfied) before it is actually merged.guide→
- Model regression detection rate (thinking-length, files-read-before-edit timeline)The Test Oracle Reliability Score (TORS) measures what percentage of test failures represent real defects rather than test infrastructure issues, environmental flakiness, or poorlyguide→
- 01Test-oracle reliability is measured and tracked on a dashboard
- 02Auto-approve rate (% of PRs auto-merged as Green) is tracked with a target above 60%
- 03Merge queue wait time is tracked with a target under 10 minutes
- 04Agent Autonomy Score (% of tasks completed without human intervention) is measured and broken down by task type
- 05Metrics trigger automated alerts when thresholds are breached (e.g., test-oracle reliability drops)
- Cost-per-feature (not cost-per-PR)Cost-per-feature is the total cost - in AI compute, CI infrastructure, human review time, and product management time - to deliver a complete user-facing feature from specification to production.guide→
- Business value throughput, not activity metricsBusiness value throughput is the rate at which an engineering organization delivers measurable business outcomes - revenue generated, customer problems solved, churn reduced, conveguide→
- 01Cost-per-feature is tracked (not cost-per-PR) - aggregating all agent, CI, and review costs per delivered feature
- 02Business value throughput is the primary metric (features delivered per week, not PRs merged per week)
- 03Metrics system auto-detects vanity metrics (high activity, low value delivery) and flags them
- 04Cost-per-feature trend is declining quarter-over-quarter
Governance & Compliance
Controls around AI-generated code - licensing, security scanning, and audit trails.
- Individual devs use their own AI subscriptionsShadow AI refers to the use of AI tools by developers through personal subscriptions and accounts that operate entirely outside the organization's awareness, approval, or oversight.guide→
- AI usage not yet auditedA zero audit trail state means that when an auditor, security team, or incident investigator asks "what AI systems were involved in producing this code change?" there is no answer.guide→
- AI usage is informal, policy not yet definedIn 2023 and early 2024, many organizations responded to AI coding tools by banning them outright.guide→
- 01The team knows which AI tools are in use
- 02AI-generated code follows the normal review and merge process
- 03Team is aware of shadow AI usage (developers using private subscriptions)
- 04Organization has moved past "ban AI" as a policy position
- Official AI tool policy; per-session spend caps + kill switches (post fork-bomb $3,800 incident)An official AI tool policy is the organization's first structured governance response to AI in the delivery pipeline.guide→
- Basic audit: who uses whatBasic audit at L2 means the organization has established visibility into which developers are using which AI tools, at what frequency, and for what purposes.guide→
- EU AI Act awarenessThe EU AI Act (Regulation 2024/1689) is the world's first comprehensive legal framework for artificial intelligence, entering force in August 2024 with phased compliance deadlines through 2027.guide→
- 01Official AI tool policy exists and is communicated to all developers
- 02Basic audit tracking is in place (which developers use which AI tools)
- 03EU AI Act awareness training or briefing has been conducted
- 04AI tool policy is reviewed at least annually
- 05Approved tool list is maintained and accessible
- Minimum viable audit trail: model, timestamp, context, approverThe minimum viable audit trail (MVAT) is the smallest set of structured metadata that, when captured consistently for every AI-assisted change, creates a defensible provenance recoguide→
- Policy-as-code; capability-restricted licensing (Claude Mythos / Project Glasswing pattern)Policy-as-code means expressing compliance rules as executable code that runs automatically in the CI/CD pipeline, rather than as documents that developers are expected to read and follow manually.guide→
- Compliance gates in CICompliance gates in CI are automated checks that must pass before a pull request can be merged, specifically focused on governance and compliance requirements rather than functional correctness.guide→
- 01Minimum viable audit trail is captured per AI-assisted change: model identifier, timestamp, context description, human approver
- 02Policy-as-code enforces compliance rules in CI (OPA or equivalent)
- 03Compliance gates run on every PR to in-scope repositories
- 04Audit trail fields are validated by CI (missing fields fail the build)
- 05Policy exceptions are logged and require follow-up within 48 hours
- Full provenance tracking per changeFull provenance tracking means that for every change that reaches production, you can reconstruct the complete lineage: the business requirement that originated the work, the tickeguide→
- Automated compliance checksAutomated compliance checks at L4 go beyond the process gates of L3 (did the developer fill in the right fields?) to evaluate substantive compliance questions automatically: does tguide→
- AI code vs human code distinction in VCSAI code vs. human code distinction in version control means that the repository's history explicitly tags which lines, commits, or files were generated by AI systems versus writtenguide→
- 01Full provenance tracking per change: model version, prompt context, agent session ID, iteration count
- 02Automated compliance checks run without manual intervention on every merge
- 03AI-generated code is distinguishable from human-written code in version control (metadata, labels, or attribution)
- 04Provenance data is queryable (e.g., "show all changes made by model X in the last 30 days")
- 05Compliance check results are aggregated into a governance dashboard
- Continuous compliance: agent monitors regulatory changesContinuous compliance with agent-based regulatory monitoring means that an AI agent continuously tracks regulatory changes - new EU AI Act implementing regulations, updated SOC2 guguide→
- Self-documenting audit trailA self-documenting audit trail is one where the documentation of AI involvement in a change is generated automatically by the AI system itself, without requiring human effort to produce it.guide→
- Enterprise-grade RBAC per agent (Stripe Toolshed model: 400+ MCP tools with access control)Enterprise-grade RBAC (Role-Based Access Control) per agent means that every AI agent operating in the organization's systems has an explicit, audited identity with a specific setguide→
- 01Continuous compliance: agent monitors regulatory changes (EU AI Act updates, SOC2 changes) and proposes policy updates
- 02Audit trail is self-documenting (agent decisions include reasoning, not just outcomes)
- 03Enterprise-grade RBAC is enforced per agent (Stripe Toolshed model: each agent has scoped permissions for specific tools and repositories)
- 04Policy update proposals from compliance agent are auto-tested against existing codebase before rollout
- 05Agent RBAC permissions are audited automatically for least-privilege compliance
You don't have to figure this out alone.
Every level in this matrix has a path. Read the playbooks the teams that have climbed it wrote. Run the assessment with our consultants. Start where you are.
Book an AI Maturity Assessment session with your team.
We walk you through all four perspectives, score where you actually are, and leave you with a 90-day plan to climb in the dimensions that matter most.
May 2026 update: cost is now a first-class metric.
ccusage hit 13.2k stars on GitHub; /usage and /context shipped as built-in commands; Reddit had multiple "I burned $3,800 overnight" posts traced to runaway subagent loops. The economics also got more honest. Pawel Dolega's AI subscriptions are on borrowed time makes the structural case: a $20 Pro plan burns $50-100 of compute, total enterprise LLM spend doubled in six months despite per-token prices falling (Jevons paradox), and labs are quietly testing the water - Anthropic pulled Claude Code from Pro, GitHub paused Copilot signups. Teams that do not measure cost-per-merged-PR now will be re-pricing emergencies later this year.
Governance follows: per-session spend caps and kill switches are now baseline, not advanced. Restricted-use models (Claude Mythos Preview / Project Glasswing - 93.9% SWE-bench but defensive cybersec only) introduce a new lever - capability-restricted licensing. And Berkeley's April 12 reward-hack research means any policy that auto-approves based on benchmark scores is broken by construction. Stripe Minions is still the L5 north star; the new homework is making sure your L2-L3 metrics don't lie to you on the way there.