Matrix/Development

Development

How developers work with AI day-to-day. From sidebar chat to fleet agents.

4capabilities20levels60practices60guides

The matrix · at a glanceClick any cell · L4 is where teams aim

Capability ↓
Maturity →

L1 · Stage 01

Ad-hoc

L2 · Stage 02

Guided

L3 · Stage 03

Systematic

L4 · Stage 04

Optimized

Sweet spot

L5 · Stage 05

Autonomous

01

Coding Agent Usage

02

Context Engineering

03

Code Review & Quality

04

Testing Strategy

Capability 01 · Development

Coding Agent Usage

How your team uses AI coding assistants - from autocomplete to autonomous agent fleets.

L1 · Stage 01Ad-hoc

Practices - what it looks like3 guides

Criteria - what to measure

01At least one AI coding assistant (Copilot, Cursor, Claude Code) is installed and active for at least one developer
02AI autocomplete or chat is used at least once per week by the team
03Developers have access to AI chat in their IDE sidebar
04Team has experimented with AI-assisted code generation on non-critical tasks

L2 · Stage 02Guided

Practices - what it looks like3 guides

Criteria - what to measure

01At least one agentic IDE (Cursor, Windsurf, or Claude Code) is used by 50%+ of the team
02CLAUDE.md, .cursorrules, or equivalent agent instruction file exists in 100% of active repositories
03Agents operate in agentic/YOLO mode (multi-step edits without per-step approval)
04Developers use two or more AI tools in parallel (e.g., Copilot + Claude Code)
05Agent instruction files are reviewed and updated at least quarterly

L3 · Stage 03Systematic

Practices - what it looks like3 guides

Criteria - what to measure

01CLI agents (Claude Code, Codex) are the primary coding interface for 50%+ of feature work
02Per-team or per-repo rules files exist and are maintained with code review
03Coding conventions are written as explicit, agent-parseable rules (not implicit tribal knowledge)
04Agent usage is tracked per developer and per repository
05Agent instruction files follow a standardized template across the organization

L4 · Stage 04OptimizedMost teams aim here

Practices - what it looks like3 guides

Criteria - what to measure

01Unattended agents (Stripe Minions model, Cursor Automations) execute tasks without developer presence
02Agents are invocable from at least two channels (Slack, CLI, Web, PagerDuty)
03Each developer runs 3-5 parallel agent sessions concurrently
04Agent task completion rate without human intervention exceeds 60%
05Agent invocation produces a PR within a defined SLA (e.g., under 30 minutes for standard tasks)

L5 · Stage 05Autonomous

Practices - what it looks like3 guides

Criteria - what to measure

01Multi-agent orchestration system (planner-worker hierarchy) is in production
02Agent fleet sustains 100+ concurrent agents on the codebase
03Agent fleet produces 1,000+ commits per week without manual dispatch
04Planner agents decompose epics into tasks and assign to worker agents autonomously
05Agent fleet self-recovers from failures without human escalation for 90%+ of error cases

Capability 02 · Development

Context Engineering

What information agents receive about your codebase, architecture, and conventions.

L1 · Stage 01Ad-hoc

Practices - what it looks like3 guides

Criteria - what to measure

01The agent can read the file(s) the developer is working on
02Developers can supply the agent with project context when needed
03README.md exists (may be incomplete)
04Developers manually paste context into AI chat when needed

L2 · Stage 02Guided

Practices - what it looks like3 guides

Criteria - what to measure

01CLAUDE.md or equivalent exists with project description, tech stack, and top conventions
02Written coding conventions document exists and is referenced from agent instruction files
03Agent instruction files are committed to the repository (not local-only)
04CLAUDE.md includes explicit prohibitions (banned libraries, anti-patterns)
05Agent instruction files are reviewed as part of the standard PR process

L3 · Stage 03Systematic

Practices - what it looks like3 guides

Criteria - what to measure

01MCP servers provide structured context (architecture, ownership, SLAs) to agents
02Context is organized across at least 3 of the 5 levels: System, Code, Org, Historical, Operational
03Token budget management is implemented (agents receive context within defined token limits)
04Context sources are versioned and tested for correctness
05Context budgeting policy defines priority order when token limits are reached

L4 · Stage 04OptimizedMost teams aim here

Practices - what it looks like3 guides

Criteria - what to measure

01Organization pushes context to agents automatically (BYOC - Bring Your Own Context)
02Knowledge graph (Graph Buddy, CodeTale, or equivalent) is integrated with agent context pipeline
03Ticket-to-spec automation generates acceptance tests from requirements without manual writing
04Context push triggers on repository events (commit, PR, deploy) without manual refresh
05Knowledge graph covers 80%+ of active repositories

L5 · Stage 05Autonomous

Practices - what it looks like3 guides

Criteria - what to measure

01Agents maintain persistent identity and memory across sessions (Beads/Git-backed)
02Production telemetry feeds back into agent context automatically (deploy, error, performance data)
03Agents detect stale documentation and update it without human initiation
04Agent memory persists architectural decisions and their rationale across sessions
05Self-healing context updates are validated by automated tests before commit

Capability 03 · Development

Code Review & Quality

How AI-generated code is reviewed, validated, and approved before merging. AI code has 1.7x more issues and 2.74x more security vulnerabilities - review is now critical infrastructure.

L1 · Stage 01Ad-hoc

Practices - what it looks like3 guides

Criteria - what to measure

01All code is reviewed by a human before merge
02Basic CI checks run on changes
03Code review turnaround is tracked (even if slow)
04Team is aware that AI-generated code has higher defect rates (1.7x issues, 2.74x security vulnerabilities)

L2 · Stage 02Guided

Practices - what it looks like3 guides

Criteria - what to measure

01AI-assisted review tool (CodeRabbit, Qodo, or equivalent) is active on all repositories
02Linter rules are configured and run in CI on every PR
03PRs clearly indicate whether code is AI-generated or AI-assisted (labels, tags, or commit metadata)
04AI review suggestions are triaged (accepted/rejected) rather than ignored
05Linter configuration is committed to the repository and versioned

L3 · Stage 03Systematic

Practices - what it looks like3 guides

Criteria - what to measure

01AI review agent runs as a first-pass reviewer on every PR before human review
02Lint rules enforce architectural standards (not just style) - the "Bug to Codify to Lint Rule" pipeline is active
03At least 3 architectural guardrail rules have been created from past bugs or incidents
04AI review agent findings are categorized by severity (info, warning, blocking)
05New lint rules are proposed automatically when recurring review comments are detected

L4 · Stage 04OptimizedMost teams aim here

Practices - what it looks like3 guides

Criteria - what to measure

01Automated Green/Yellow/Red classification runs on every PR
02Green-classified PRs auto-merge without human review
03Auto-approve rate target of 60%+ Green PRs is tracked and reported
04Yellow PRs receive expedited human review (within 1 hour)
05Classification model accuracy is validated monthly against human review outcomes

L5 · Stage 05Autonomous

Practices - what it looks like3 guides

Criteria - what to measure

01Agent fleet self-reviews code (error-fix-converge loop) before submitting for merge
02Human review is limited to Red-classified PRs (architectural decisions only)
03Continuous auto-refactoring runs in background without human initiation
04Agent self-review catches 90%+ of issues that would be found by human review
05Auto-refactoring PRs are tracked separately and have their own quality metrics

Capability 04 · Development

Testing Strategy

How tests are written, maintained, and validated in an AI-assisted workflow. A test "oracle" is the source of truth for what a test's correct result should be.

L1 · Stage 01Ad-hoc

Practices - what it looks like3 guides

Criteria - what to measure

01An automated test suite exists and runs
02The team writes and maintains its own tests
03Team is aware of flaky test impact (16% of dev time per Google data)
04AI-generated tests are reviewed for circular testing (testing what code does, not what it should do)

L2 · Stage 02Guided

Practices - what it looks like3 guides

Criteria - what to measure

01Agents generate unit tests; humans write acceptance tests
02Flaky test quarantine process is active (flaky tests are isolated, not deleted)
03Humans define the expected results for important paths (not just snapshotting current output)
04Flaky test count is tracked and reported weekly
05Quarantined tests have a resolution SLA (e.g., fix or delete within 30 days)

L3 · Stage 03Systematic

Practices - what it looks like3 guides

Criteria - what to measure

01Expected results are derived from requirements/specs (the requirement is the oracle, not the code)
02Acceptance tests are auto-generated from ticket requirements (Autonomous Requirements pipeline)
03Incremental test selection runs only tests affected by changed code paths
04Oracle reliability is reviewed per service, not just overall
05Test generation from tickets includes edge cases, not just happy paths

L4 · Stage 04OptimizedMost teams aim here

Practices - what it looks like3 guides

Criteria - what to measure

01A failing test reliably indicates a real defect (oracle false-positives are rare)
02Agents iterate tests to green in isolated sandbox CI without blocking team CI queue
03Mutation testing validates that tests catch real defects (not just achieve coverage)
04Sandbox CI iteration count per PR is tracked (ITS target: 1-3)
05Mutation testing kill rate exceeds 80%

L5 · Stage 05Autonomous

Practices - what it looks like3 guides

Criteria - what to measure

01Test suite is self-healing (agent detects broken tests, diagnoses root cause, fixes without human input)
02Production logs automatically generate regression tests for observed failures
03Agents detect edge cases, write tests, fix bugs, and ship - full autonomous loop
04Self-healing test updates are validated by mutation testing before merge
05Production-to-test pipeline latency is under 1 hour (failure observed to regression test committed)

Climb the matrix

You don't have to figure this out alone.

Every level in this matrix has a path. Read the playbooks the teams that have climbed it wrote. Run the assessment with our consultants. Start where you are.

playbook28 min read↗

MCP servers: architecture-as-code

Wiring architecture, ownership and SLA contexts into agents.

playbook26 min read↗

Green / Yellow / Red auto-approval

A behavioural-signal policy that doesn't rely on benchmark scores.

workshop1-day↗

Knowledge graphs for AI codebases

Hands-on with Graph Buddy / CodeTale on your repo.

Live with Visdom

Book an AI Maturity Assessment session with your team.

We walk you through all four perspectives, score where you actually are, and leave you with a 90-day plan to climb in the dimensions that matter most.

Book an assessment →See what's included90-day plan - scored assessment - coaching

Author Commentary

The May 2026 zeitgeist is trust, but audit.

April delivered three uncomfortable proofs: Stella Laurenzo's audit of 6,852 Claude Code sessions measured a 73% collapse in median thinking length (2,200 -> 600 chars) and a drop from 6.6 to 2.0 files read before edit; Anthropic's April 23 postmortem confirmed that harness and system-prompt changes - not the model - caused weeks of regression; UC Berkeley showed all eight major agent benchmarks are reward-hackable to ~100%. The takeaway is not "AI got worse." The takeaway is that capability lives in the harness, the prompt and the permission layer - and you need to measure it like any other production system.

This shifts the bar at every level. L2 review now means tracking thinking-length and files-read-before-edit per session, not just diff awareness. L3 AI-review agents must self-verify their outputs (Opus 4.7 makes this mainstream). L4 auto-approval policies must NOT be derived from benchmark scores - they need behavioural signals. The model is better than ever (Opus 4.7: 87.6% SWE-bench Verified, Mythos Preview: 93.9%). The tooling is better than ever (Cursor 3.2 /multitask, Kairos/Dream Mode in Claude Code). The new gap is observability. Start there.

Other perspectives

01Development·02Delivery Management·03Organization·04Infrastructure