Delivery Management

How we manage delivery in the age of agents. From human PR review to autonomous delivery pipeline.

CI/CD Pipeline

L1Ad-hoc3 practices
  • ·CI pipeline exists but takes longer than 15 minutes
  • ·Agents receive no real-time CI feedback (wait for full pipeline completion)

Evidence

  • ·CI pipeline configuration file in repository
  • ·CI run duration logs showing median > 15 minutes
L2Guided3 practices
  • ·CI completes in under 10 minutes (median)
  • ·Build caching is implemented (dependency cache, build artifact cache)
  • ·Dedicated CI runners are allocated per team (no shared queue across all teams)

Evidence

  • ·CI run duration dashboard showing median under 10 minutes
  • ·Cache configuration in CI pipeline (e.g., actions/cache, Gradle build cache)
  • ·Runner allocation configuration showing per-team resources
L3Systematic3 practices
  • ·CI completes in under 5 minutes (median)
  • ·Remote caching is implemented (Bazel remote cache, EngFlow, Gradle Enterprise)
  • ·Incremental builds run only changed modules or fragments

Evidence

  • ·CI run duration dashboard showing median under 5 minutes
  • ·Remote cache configuration and cache hit rate metrics
  • ·Build configuration showing incremental/changed-only targeting
L4Optimized3 practices
  • ·CI completes in under 2 minutes (median)
  • ·Ephemeral sandbox environments spin up in under 10 seconds for agent CI loops
  • ·Agent sandbox CI supports 50+ iteration attempts in 5 minutes without blocking team CI queue

Evidence

  • ·CI run duration dashboard showing median under 2 minutes
  • ·Sandbox spin-up time metrics showing sub-10-second P50
  • ·Agent CI iteration logs showing 50+ attempts within 5-minute windows
L5Autonomous3 practices
  • ·CI provides sub-minute feedback for standard changes
  • ·CI auto-scales runner capacity based on agent load (no manual capacity planning)
  • ·Production feedback loop auto-adjusts the CI test suite (adds tests for observed failures, removes redundant tests)

Evidence

  • ·CI run duration dashboard showing sub-minute median for standard changes
  • ·Auto-scaling configuration and runner utilization metrics
  • ·Test suite change log showing production-feedback-driven additions and removals

Merge & Deploy

L1Ad-hoc3 practices
  • ·PRs require manual human review and manual merge
  • ·Team capacity is approximately 10 PRs per day or fewer

Evidence

  • ·PR merge history showing manual approvals
  • ·Deploy logs showing manual trigger or simple CD pipeline
L2Guided3 practices
  • ·Merge queue is implemented (GitHub merge queue, Mergify, or equivalent)
  • ·Auto-rebase is enabled for PRs targeting main branch
  • ·CD pipeline includes at least one gate (tests pass, security scan, approval)

Evidence

  • ·Merge queue configuration in repository settings or CI
  • ·Auto-rebase configuration (branch protection rules, bot configuration)
  • ·CD pipeline definition showing gate conditions
L3Systematic3 practices
  • ·Policy-based merge rules are enforced (OPA, branch protection, or equivalent)
  • ·Deterministic merge ordering with conflict detection prevents concurrent merge failures
  • ·PRs require a maximum of 2 CI rounds before merge (Stripe benchmark)

Evidence

  • ·Policy-as-code configuration (OPA rules, branch protection API config)
  • ·CI round count per PR metrics showing 2-round maximum adherence
  • ·Merge ordering logs showing deterministic processing
L4Optimized3 practices
  • ·Green-classified PRs auto-merge and auto-deploy without human intervention
  • ·Team throughput exceeds 50 PRs per day
  • ·Canary or progressive deployment is automated (no manual rollout decisions)

Evidence

  • ·Auto-merge and auto-deploy logs for Green PRs
  • ·PR throughput dashboard showing 50+ per day
  • ·Canary deployment configuration with automated promotion/rollback rules
L5Autonomous3 practices
  • ·Merge throughput sustains 1,000+ merges per week
  • ·Full autonomous pipeline: agent produces PR, CI passes, merge, deploy, observe - no human in the loop
  • ·Rollback is agent-driven (agent detects regression, reverts, and opens fix PR)

Evidence

  • ·Merge throughput dashboard showing 1,000+ per week
  • ·End-to-end autonomous pipeline logs (PR to production with no human steps)
  • ·Agent-driven rollback logs with timestamps and success rate

Metrics

L1Ad-hoc3 practices
  • ·DORA metrics (deployment frequency, lead time, change failure rate, MTTR) are not tracked, or tracked inconsistently
  • ·No AI-specific metrics exist

Evidence

  • ·Absence of metrics dashboard or inconsistent/manual tracking
  • ·No AI-specific fields in existing metrics systems
L2Guided3 practices
  • ·DORA metrics are tracked consistently with a dashboard
  • ·AI tool license count vs. active usage rate is measured
  • ·PR throughput per developer is tracked

Evidence

  • ·DORA metrics dashboard with current data
  • ·License utilization report (licenses purchased vs. active users)
  • ·PR throughput chart showing per-developer breakdown
L3Systematic3 practices
  • ·ITS (Iterations-to-Success) is tracked with a target of 1-3
  • ·CPI (Cost-per-Iteration) is tracked with a target under $0.50
  • ·CI feedback latency is tracked as a metric (time from push to CI result)

Evidence

  • ·ITS dashboard showing iteration count distribution per PR
  • ·CPI dashboard showing cost per CI iteration
  • ·CI feedback latency chart with P50, P95, P99 breakdowns
L4Optimized4 practices
  • ·TORS > 95% is measured and tracked on a dashboard
  • ·Auto-approve rate (% of PRs auto-merged as Green) is tracked with a target above 60%
  • ·Merge queue wait time is tracked with a target under 10 minutes

Evidence

  • ·TORS dashboard showing 95%+ with per-service breakdown
  • ·Auto-approve rate report showing 60%+ Green target
  • ·Merge queue wait time chart showing sub-10-minute target
L5Autonomous2 practices
  • ·Cost-per-feature is tracked (not cost-per-PR) - aggregating all agent, CI, and review costs per delivered feature
  • ·Business value throughput is the primary metric (features delivered per week, not PRs merged per week)

Evidence

  • ·Cost-per-feature dashboard with feature-level cost attribution
  • ·Business value throughput chart correlated with product delivery milestones
  • ·Quarter-over-quarter cost-per-feature trend report

Governance & Compliance

L1Ad-hoc3 practices
  • ·No official AI tool policy exists
  • ·No audit trail for AI-generated code (who used what model, when, on what code)

Evidence

  • ·Absence of written AI tool policy
  • ·No AI-related fields in commit metadata or PR templates
L2Guided3 practices
  • ·Official AI tool policy exists and is communicated to all developers
  • ·Basic audit tracking is in place (which developers use which AI tools)
  • ·EU AI Act awareness training or briefing has been conducted

Evidence

  • ·Published AI tool policy document with distribution records
  • ·AI tool usage tracking dashboard or report
  • ·EU AI Act training completion records
L3Systematic3 practices
  • ·Minimum viable audit trail is captured per AI-assisted change: model identifier, timestamp, context description, human approver
  • ·Policy-as-code enforces compliance rules in CI (OPA or equivalent)
  • ·Compliance gates run on every PR to in-scope repositories

Evidence

  • ·Sample commit or PR metadata showing model, timestamp, context, approver fields
  • ·OPA policy configuration in CI pipeline
  • ·Compliance gate pass/fail logs
L4Optimized3 practices
  • ·Full provenance tracking per change: model version, prompt context, agent session ID, iteration count
  • ·Automated compliance checks run without manual intervention on every merge
  • ·AI-generated code is distinguishable from human-written code in version control (metadata, labels, or attribution)

Evidence

  • ·Provenance metadata on commits/PRs showing full attribution chain
  • ·Automated compliance check configuration with zero manual steps
  • ·VCS query showing AI-vs-human code distinction
L5Autonomous3 practices
  • ·Continuous compliance: agent monitors regulatory changes (EU AI Act updates, SOC2 changes) and proposes policy updates
  • ·Audit trail is self-documenting (agent decisions include reasoning, not just outcomes)
  • ·Enterprise-grade RBAC is enforced per agent (Stripe Toolshed model: each agent has scoped permissions for specific tools and repositories)

Evidence

  • ·Compliance agent logs showing regulatory monitoring and policy update proposals
  • ·Self-documenting audit trail entries with agent reasoning chains
  • ·Agent RBAC configuration showing per-agent tool and repository permissions

Author Commentary

April 2026 update: The productivity paradox is real. Teams report 2-3x more PRs merged, yet customer-facing feature velocity often stays flat. The culprit: vanity metrics. PR throughput per dev (L2) is a starting point, but without quality-adjusted metrics — ITS, CPI, TORS (L3-L4) — you're measuring motion, not progress. AI-generated code ships faster but breaks more often if your review and CI gates aren't keeping up. Invest in quality-adjusted metrics before celebrating raw throughput numbers. Stripe Minions remains the best public case study of enterprise coding agents. Key pattern: Slack invocation → isolated sandbox (10s spin-up) → MCP context → CI loop (max 2 rounds) → human review → merge. This isn't sci-fi - it's a working production system on one of the most demanding codebases in the world. But note: Stripe built this on YEARS of investment in developer tooling. Without fast CI, solid MCP, and mature sandboxes - agents don't work. L3 is the prerequisite.