Delivery Management
How we manage delivery in the age of agents. From human PR review to autonomous delivery pipeline.
CI/CD Pipeline
- ·CI pipeline exists but takes longer than 15 minutes
- ·Agents receive no real-time CI feedback (wait for full pipeline completion)
Evidence
- ·CI pipeline configuration file in repository
- ·CI run duration logs showing median > 15 minutes
- ·CI completes in under 10 minutes (median)
- ·Build caching is implemented (dependency cache, build artifact cache)
- ·Dedicated CI runners are allocated per team (no shared queue across all teams)
Evidence
- ·CI run duration dashboard showing median under 10 minutes
- ·Cache configuration in CI pipeline (e.g., actions/cache, Gradle build cache)
- ·Runner allocation configuration showing per-team resources
- ·CI completes in under 5 minutes (median)
- ·Remote caching is implemented (Bazel remote cache, EngFlow, Gradle Enterprise)
- ·Incremental builds run only changed modules or fragments
Evidence
- ·CI run duration dashboard showing median under 5 minutes
- ·Remote cache configuration and cache hit rate metrics
- ·Build configuration showing incremental/changed-only targeting
- ·CI completes in under 2 minutes (median)
- ·Ephemeral sandbox environments spin up in under 10 seconds for agent CI loops
- ·Agent sandbox CI supports 50+ iteration attempts in 5 minutes without blocking team CI queue
Evidence
- ·CI run duration dashboard showing median under 2 minutes
- ·Sandbox spin-up time metrics showing sub-10-second P50
- ·Agent CI iteration logs showing 50+ attempts within 5-minute windows
- ·CI provides sub-minute feedback for standard changes
- ·CI auto-scales runner capacity based on agent load (no manual capacity planning)
- ·Production feedback loop auto-adjusts the CI test suite (adds tests for observed failures, removes redundant tests)
Evidence
- ·CI run duration dashboard showing sub-minute median for standard changes
- ·Auto-scaling configuration and runner utilization metrics
- ·Test suite change log showing production-feedback-driven additions and removals
Merge & Deploy
- ·PRs require manual human review and manual merge
- ·Team capacity is approximately 10 PRs per day or fewer
Evidence
- ·PR merge history showing manual approvals
- ·Deploy logs showing manual trigger or simple CD pipeline
- ·Merge queue is implemented (GitHub merge queue, Mergify, or equivalent)
- ·Auto-rebase is enabled for PRs targeting main branch
- ·CD pipeline includes at least one gate (tests pass, security scan, approval)
Evidence
- ·Merge queue configuration in repository settings or CI
- ·Auto-rebase configuration (branch protection rules, bot configuration)
- ·CD pipeline definition showing gate conditions
- ·Policy-based merge rules are enforced (OPA, branch protection, or equivalent)
- ·Deterministic merge ordering with conflict detection prevents concurrent merge failures
- ·PRs require a maximum of 2 CI rounds before merge (Stripe benchmark)
Evidence
- ·Policy-as-code configuration (OPA rules, branch protection API config)
- ·CI round count per PR metrics showing 2-round maximum adherence
- ·Merge ordering logs showing deterministic processing
- ·Green-classified PRs auto-merge and auto-deploy without human intervention
- ·Team throughput exceeds 50 PRs per day
- ·Canary or progressive deployment is automated (no manual rollout decisions)
Evidence
- ·Auto-merge and auto-deploy logs for Green PRs
- ·PR throughput dashboard showing 50+ per day
- ·Canary deployment configuration with automated promotion/rollback rules
- ·Merge throughput sustains 1,000+ merges per week
- ·Full autonomous pipeline: agent produces PR, CI passes, merge, deploy, observe - no human in the loop
- ·Rollback is agent-driven (agent detects regression, reverts, and opens fix PR)
Evidence
- ·Merge throughput dashboard showing 1,000+ per week
- ·End-to-end autonomous pipeline logs (PR to production with no human steps)
- ·Agent-driven rollback logs with timestamps and success rate
Metrics
- ·DORA metrics (deployment frequency, lead time, change failure rate, MTTR) are not tracked, or tracked inconsistently
- ·No AI-specific metrics exist
Evidence
- ·Absence of metrics dashboard or inconsistent/manual tracking
- ·No AI-specific fields in existing metrics systems
- ·DORA metrics are tracked consistently with a dashboard
- ·AI tool license count vs. active usage rate is measured
- ·PR throughput per developer is tracked
Evidence
- ·DORA metrics dashboard with current data
- ·License utilization report (licenses purchased vs. active users)
- ·PR throughput chart showing per-developer breakdown
- ·ITS (Iterations-to-Success) is tracked with a target of 1-3
- ·CPI (Cost-per-Iteration) is tracked with a target under $0.50
- ·CI feedback latency is tracked as a metric (time from push to CI result)
Evidence
- ·ITS dashboard showing iteration count distribution per PR
- ·CPI dashboard showing cost per CI iteration
- ·CI feedback latency chart with P50, P95, P99 breakdowns
- ·TORS > 95% is measured and tracked on a dashboard
- ·Auto-approve rate (% of PRs auto-merged as Green) is tracked with a target above 60%
- ·Merge queue wait time is tracked with a target under 10 minutes
Evidence
- ·TORS dashboard showing 95%+ with per-service breakdown
- ·Auto-approve rate report showing 60%+ Green target
- ·Merge queue wait time chart showing sub-10-minute target
- ·Cost-per-feature is tracked (not cost-per-PR) - aggregating all agent, CI, and review costs per delivered feature
- ·Business value throughput is the primary metric (features delivered per week, not PRs merged per week)
Evidence
- ·Cost-per-feature dashboard with feature-level cost attribution
- ·Business value throughput chart correlated with product delivery milestones
- ·Quarter-over-quarter cost-per-feature trend report
Governance & Compliance
- ·No official AI tool policy exists
- ·No audit trail for AI-generated code (who used what model, when, on what code)
Evidence
- ·Absence of written AI tool policy
- ·No AI-related fields in commit metadata or PR templates
- ·Official AI tool policy exists and is communicated to all developers
- ·Basic audit tracking is in place (which developers use which AI tools)
- ·EU AI Act awareness training or briefing has been conducted
Evidence
- ·Published AI tool policy document with distribution records
- ·AI tool usage tracking dashboard or report
- ·EU AI Act training completion records
- ·Minimum viable audit trail is captured per AI-assisted change: model identifier, timestamp, context description, human approver
- ·Policy-as-code enforces compliance rules in CI (OPA or equivalent)
- ·Compliance gates run on every PR to in-scope repositories
Evidence
- ·Sample commit or PR metadata showing model, timestamp, context, approver fields
- ·OPA policy configuration in CI pipeline
- ·Compliance gate pass/fail logs
- ·Full provenance tracking per change: model version, prompt context, agent session ID, iteration count
- ·Automated compliance checks run without manual intervention on every merge
- ·AI-generated code is distinguishable from human-written code in version control (metadata, labels, or attribution)
Evidence
- ·Provenance metadata on commits/PRs showing full attribution chain
- ·Automated compliance check configuration with zero manual steps
- ·VCS query showing AI-vs-human code distinction
- ·Continuous compliance: agent monitors regulatory changes (EU AI Act updates, SOC2 changes) and proposes policy updates
- ·Audit trail is self-documenting (agent decisions include reasoning, not just outcomes)
- ·Enterprise-grade RBAC is enforced per agent (Stripe Toolshed model: each agent has scoped permissions for specific tools and repositories)
Evidence
- ·Compliance agent logs showing regulatory monitoring and policy update proposals
- ·Self-documenting audit trail entries with agent reasoning chains
- ·Agent RBAC configuration showing per-agent tool and repository permissions
Author Commentary
April 2026 update: The productivity paradox is real. Teams report 2-3x more PRs merged, yet customer-facing feature velocity often stays flat. The culprit: vanity metrics. PR throughput per dev (L2) is a starting point, but without quality-adjusted metrics — ITS, CPI, TORS (L3-L4) — you're measuring motion, not progress. AI-generated code ships faster but breaks more often if your review and CI gates aren't keeping up. Invest in quality-adjusted metrics before celebrating raw throughput numbers. Stripe Minions remains the best public case study of enterprise coding agents. Key pattern: Slack invocation → isolated sandbox (10s spin-up) → MCP context → CI loop (max 2 rounds) → human review → merge. This isn't sci-fi - it's a working production system on one of the most demanding codebases in the world. But note: Stripe built this on YEARS of investment in developer tooling. Without fast CI, solid MCP, and mature sandboxes - agents don't work. L3 is the prerequisite.