Maturity Matrix
Matrix/Delivery Management

Delivery Management

How we manage delivery in the age of agents. From human PR review to autonomous delivery pipeline.

4capabilities20levels61practices61guides
The matrix · at a glance
Capability ↓
Maturity →
L1 · Stage 01
Ad-hoc
L2 · Stage 02
Guided
L3 · Stage 03
Systematic
L4 · Stage 04
Optimized
Sweet spot
L5 · Stage 05
Autonomous
01
CI/CD Pipeline
02
Merge & Deploy
03
Metrics
04
Governance & Compliance
Capability 01 · Delivery Management

CI/CD Pipeline

Speed and reliability of your build-test-feedback loop for AI-generated code.

L2 · Stage 02Guided
Criteria - what to measure
  1. 01CI completes in under 10 minutes (median)
  2. 02Build caching is implemented (dependency cache, build artifact cache)
  3. 03Dedicated CI runners are allocated per team (no shared queue across all teams)
  4. 04CI duration is tracked as a metric and reviewed monthly
  5. 05Cache hit rate exceeds 70%
L3 · Stage 03Systematic
L4 · Stage 04OptimizedMost teams aim here
L5 · Stage 05Autonomous
Criteria - what to measure
  1. 01CI provides sub-minute feedback for standard changes
  2. 02CI auto-scales runner capacity based on agent load (no manual capacity planning)
  3. 03Production feedback loop auto-adjusts the CI test suite (adds tests for observed failures, removes redundant tests)
  4. 04CI runner utilization stays between 50-80% (auto-scaling prevents both waste and queuing)
  5. 05Test suite evolution is auditable (each auto-added/removed test has a provenance record)
Capability 02 · Delivery Management

Merge & Deploy

How PRs flow from creation to production - throughput, automation, and conflict handling.

L2 · Stage 02Guided
Criteria - what to measure
  1. 01Merge queue is implemented (GitHub merge queue, Mergify, or equivalent)
  2. 02Auto-rebase is enabled for PRs targeting main branch
  3. 03CD pipeline includes at least one gate (tests pass, security scan, approval)
  4. 04Merge conflicts are detected and flagged before review is requested
  5. 05Deploy frequency is at least daily
L3 · Stage 03Systematic
Criteria - what to measure
  1. 01Policy-based merge rules are enforced (OPA, branch protection, or equivalent)
  2. 02Deterministic merge ordering with conflict detection prevents concurrent merge failures
  3. 03PRs require a maximum of 2 CI rounds before merge (Stripe benchmark)
  4. 04Merge rules are versioned as code and reviewed when changed
  5. 05PRs exceeding 2 CI rounds are flagged for investigation
L4 · Stage 04OptimizedMost teams aim here
L5 · Stage 05Autonomous
Criteria - what to measure
  1. 01Merge throughput sustains 1,000+ merges per week
  2. 02Full autonomous pipeline: agent produces PR, CI passes, merge, deploy, observe - no human in the loop
  3. 03Rollback is agent-driven (agent detects regression, reverts, and opens fix PR)
  4. 04Mean time to rollback is under 5 minutes from anomaly detection
  5. 05Agent-driven rollbacks succeed without human intervention 95%+ of the time
Capability 03 · Delivery Management

Metrics

What you measure to understand AI-assisted engineering productivity and quality.

L1 · Stage 01Ad-hoc
Criteria - what to measure
  1. 01Delivery is tracked with at least basic metrics
  2. 02Standard delivery metrics are in place (AI-specific metrics come later)
  3. 03Team acknowledges the need for AI-specific metrics beyond traditional DORA
  4. 04Basic deployment frequency is at least known (even if not dashboarded)
L2 · Stage 02Guided
Criteria - what to measure
  1. 01DORA metrics are tracked consistently with a dashboard
  2. 02AI tool license count vs. active usage rate is measured
  3. 03PR throughput per developer is tracked
  4. 04AI acceptance rate (% of AI suggestions accepted) is measured per tool
  5. 05Metrics are reviewed in team retrospectives at least monthly
L3 · Stage 03Systematic
Criteria - what to measure
  1. 01ITS (Iterations-to-Success) is tracked with a target of 1-3
  2. 02CPI (Cost-per-Iteration) is tracked with a target under $0.50
  3. 03CI feedback latency is tracked as a metric (time from push to CI result)
  4. 04Metrics are broken down per team and per repository
  5. 05Cost tracking includes model API costs, CI compute costs, and runner costs per iteration
L4 · Stage 04OptimizedMost teams aim here
Criteria - what to measure
  1. 01Test-oracle reliability is measured and tracked on a dashboard
  2. 02Auto-approve rate (% of PRs auto-merged as Green) is tracked with a target above 60%
  3. 03Merge queue wait time is tracked with a target under 10 minutes
  4. 04Agent Autonomy Score (% of tasks completed without human intervention) is measured and broken down by task type
  5. 05Metrics trigger automated alerts when thresholds are breached (e.g., test-oracle reliability drops)
L5 · Stage 05Autonomous
Criteria - what to measure
  1. 01Cost-per-feature is tracked (not cost-per-PR) - aggregating all agent, CI, and review costs per delivered feature
  2. 02Business value throughput is the primary metric (features delivered per week, not PRs merged per week)
  3. 03Metrics system auto-detects vanity metrics (high activity, low value delivery) and flags them
  4. 04Cost-per-feature trend is declining quarter-over-quarter
Capability 04 · Delivery Management

Governance & Compliance

Controls around AI-generated code - licensing, security scanning, and audit trails.

L4 · Stage 04OptimizedMost teams aim here
Criteria - what to measure
  1. 01Full provenance tracking per change: model version, prompt context, agent session ID, iteration count
  2. 02Automated compliance checks run without manual intervention on every merge
  3. 03AI-generated code is distinguishable from human-written code in version control (metadata, labels, or attribution)
  4. 04Provenance data is queryable (e.g., "show all changes made by model X in the last 30 days")
  5. 05Compliance check results are aggregated into a governance dashboard
L5 · Stage 05Autonomous
Criteria - what to measure
  1. 01Continuous compliance: agent monitors regulatory changes (EU AI Act updates, SOC2 changes) and proposes policy updates
  2. 02Audit trail is self-documenting (agent decisions include reasoning, not just outcomes)
  3. 03Enterprise-grade RBAC is enforced per agent (Stripe Toolshed model: each agent has scoped permissions for specific tools and repositories)
  4. 04Policy update proposals from compliance agent are auto-tested against existing codebase before rollout
  5. 05Agent RBAC permissions are audited automatically for least-privilege compliance
Climb the matrix

You don't have to figure this out alone.

Every level in this matrix has a path. Read the playbooks the teams that have climbed it wrote. Run the assessment with our consultants. Start where you are.

Live with Visdom

Book an AI Maturity Assessment session with your team.

We walk you through all four perspectives, score where you actually are, and leave you with a 90-day plan to climb in the dimensions that matter most.

Book an assessment See what's included90-day plan - scored assessment - coaching
Author Commentary

May 2026 update: cost is now a first-class metric.

ccusage hit 13.2k stars on GitHub; /usage and /context shipped as built-in commands; Reddit had multiple "I burned $3,800 overnight" posts traced to runaway subagent loops. The economics also got more honest. Pawel Dolega's AI subscriptions are on borrowed time makes the structural case: a $20 Pro plan burns $50-100 of compute, total enterprise LLM spend doubled in six months despite per-token prices falling (Jevons paradox), and labs are quietly testing the water - Anthropic pulled Claude Code from Pro, GitHub paused Copilot signups. Teams that do not measure cost-per-merged-PR now will be re-pricing emergencies later this year.

Governance follows: per-session spend caps and kill switches are now baseline, not advanced. Restricted-use models (Claude Mythos Preview / Project Glasswing - 93.9% SWE-bench but defensive cybersec only) introduce a new lever - capability-restricted licensing. And Berkeley's April 12 reward-hack research means any policy that auto-approves based on benchmark scores is broken by construction. Stripe Minions is still the L5 north star; the new homework is making sure your L2-L3 metrics don't lie to you on the way there.

Other perspectives