Maturity Matrix

May 2026 · v1.2

VISDOM Maturity Matrix

Trust, but Audit - When the Agent Has a Bad Day

Development

How developers work with AI day-to-day. From sidebar chat to fleet agents.

Coding Agent Usage

L1Ad-hoc3 practices
L2Guided3 practices
L3Systematic3 practices
L4Optimized3 practices
L5Autonomous3 practices

Context Engineering

L1Ad-hoc3 practices
L2Guided3 practices
L3Systematic3 practices
L4Optimized3 practices
L5Autonomous3 practices

Code Review & Quality

L1Ad-hoc3 practices
L2Guided3 practices
L3Systematic3 practices
L4Optimized3 practices
L5Autonomous3 practices

Testing Strategy

L1Ad-hoc3 practices
L2Guided3 practices
L3Systematic3 practices
L4Optimized3 practices
L5Autonomous3 practices

Author Commentary

The May 2026 zeitgeist is **trust, but audit**. April delivered three uncomfortable proofs: Stella Laurenzo's [audit of 6,852 Claude Code sessions](https://scortier.substack.com/p/claude-code-drama-6852-sessions-prove) measured a 73% collapse in median thinking length (2,200 -> 600 chars) and a drop from 6.6 to 2.0 files read before edit; Anthropic's [April 23 postmortem](https://www.anthropic.com/engineering/april-23-postmortem) confirmed that harness and system-prompt changes - not the model - caused weeks of regression; UC Berkeley showed [all eight major agent benchmarks are reward-hackable to ~100%](https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/). The takeaway is not "AI got worse." The takeaway is that capability lives in the harness, the prompt and the permission layer - and you need to measure it like any other production system. This shifts the bar at every level. L2 review now means tracking thinking-length and files-read-before-edit per session, not just diff awareness. L3 AI-review agents must self-verify their outputs (Opus 4.7 makes this mainstream). L4 auto-approval policies must NOT be derived from benchmark scores - they need behavioural signals. The model is better than ever (Opus 4.7: 87.6% SWE-bench Verified, Mythos Preview: 93.9%). The tooling is better than ever (Cursor 3.2 /multitask, Kairos/Dream Mode in Claude Code). The new gap is observability. Start there.

Delivery Management

How we manage delivery in the age of agents. From human PR review to autonomous delivery pipeline.

CI/CD Pipeline

L1Ad-hoc3 practices
L2Guided3 practices
L3Systematic3 practices
L4Optimized3 practices
L5Autonomous3 practices

Merge & Deploy

L1Ad-hoc3 practices
L2Guided3 practices
L3Systematic3 practices
L4Optimized3 practices
L5Autonomous3 practices

Metrics

L1Ad-hoc3 practices
L2Guided3 practices
L3Systematic3 practices
L4Optimized5 practices
L5Autonomous2 practices

Governance & Compliance

L1Ad-hoc3 practices
L2Guided3 practices
L3Systematic3 practices
L4Optimized3 practices
L5Autonomous3 practices

Author Commentary

May 2026 update: cost is now a first-class metric. ccusage hit 13.2k stars on GitHub; /usage and /context shipped as built-in commands; Reddit had multiple "I burned $3,800 overnight" posts traced to runaway subagent loops. The economics also got more honest. Pawel Dolega's [AI subscriptions are on borrowed time](https://www.pdole.ga/p/ai-subscriptions-are-on-borrowed) makes the structural case: a $20 Pro plan burns $50-100 of compute, total enterprise LLM spend doubled in six months despite per-token prices falling (Jevons paradox), and labs are quietly testing the water - Anthropic pulled Claude Code from Pro, GitHub paused Copilot signups. Teams that do not measure cost-per-merged-PR now will be re-pricing emergencies later this year. Governance follows: per-session spend caps and kill switches are now baseline, not advanced. Restricted-use models (Claude Mythos Preview / Project Glasswing - 93.9% SWE-bench but defensive cybersec only) introduce a new lever - capability-restricted licensing. And Berkeley's April 12 reward-hack research means any policy that auto-approves based on benchmark scores is broken by construction. Stripe Minions is still the L5 north star; the new homework is making sure your L2-L3 metrics don't lie to you on the way there.

Organization

How organizations adapt to the age of agents. From "buy licenses" to "agent fleet management".

AI Adoption Model

L1Ad-hoc3 practices
L2Guided3 practices
L3Systematic3 practices
L4Optimized3 practices
L5Autonomous3 practices

Knowledge Management

L1Ad-hoc3 practices
L2Guided3 practices
L3Systematic3 practices
L4Optimized3 practices
L5Autonomous3 practices

Team Structure & Roles

L1Ad-hoc3 practices
L2Guided3 practices
L3Systematic3 practices
L4Optimized4 practices
L5Autonomous3 practices

Tech Debt & Modernization

L1Ad-hoc3 practices
L2Guided3 practices
L3Systematic3 practices
L4Optimized3 practices
L5Autonomous3 practices

Author Commentary

May 2026 update: the Q1 layoff data is now in and it is uglier than the press release tour suggested. 78,557 tech jobs lost in Q1 2026, 47.9% AI-attributed, with junior and entry-level roles disproportionately affected (new SWE postings down 15%). 55% of employers report regretting AI-driven layoffs - the "AI layoff trap" is being quietly reversed. Forrester finds only 16% of workers have high AI readiness, projected to 25% by year-end. The lesson: cutting humans before maturing the AI stack creates a permanent capability gap. Healthy adoption now includes a "bad day protocol" - a documented rollback when the model or harness regresses (the template here is Anthropic's [April 23 postmortem](https://www.anthropic.com/engineering/april-23-postmortem); the diagnostic is Stella Laurenzo's [6,852-session audit](https://scortier.substack.com/p/claude-code-drama-6852-sessions-prove)). On the org side, the most interesting new pattern is **IPETs (Innovation and Practices Enabling Teams)** - a Team Topologies adaptation where a small enabling team owns AI stewardship, knowledge diffusion and security boundaries across product teams. New tech debt categories matter too: "context debt" (rapid iteration without architectural integrity, hits a 12-week unmaintainability cliff) and "verification debt" (3x velocity gain offset by 125% verification overhead). Yegge's 8 stages are still the best individual-maturity model. The org that makes Stage 6+ work without burning out its seniors will be the one with IPETs and a working bad-day protocol.

Infrastructure

The technical layer that enables (or blocks) agents. From shared Jenkins to ephemeral agent sandboxes.

Agent Runtime & Sandboxing

L1Ad-hoc3 practices
L2Guided3 practices
L3Systematic3 practices
L4Optimized3 practices
L5Autonomous3 practices

MCP & Tool Integration

L1Ad-hoc3 practices
L2Guided3 practices
L3Systematic3 practices
L4Optimized3 practices
L5Autonomous3 practices

Build System

L1Ad-hoc3 practices
L2Guided3 practices
L3Systematic3 practices
L4Optimized3 practices
L5Autonomous3 practices

Observability & Feedback Loop

L1Ad-hoc3 practices
L2Guided3 practices
L3Systematic3 practices
L4Optimized4 practices
L5Autonomous3 practices

Author Commentary

May 2026 update: observability stopped being a "nice to have" and became the area where the most money was made and lost. ccusage is at 13.2k GitHub stars; /usage and /context are now built into Claude Code; multiple Reddit threads documented overnight bills of $3,800 from runaway subagent loops. The two new disciplines this month are **cost telemetry** (token spend per session, per project, per merged PR) and **quality telemetry** (thinking length, files-read-before-edit, KV cache hit rate). Stella Laurenzo's [audit of 6,852 Claude Code sessions](https://scortier.substack.com/p/claude-code-drama-6852-sessions-prove) is the template for the latter; Anthropic's [April 23 postmortem](https://www.anthropic.com/engineering/april-23-postmortem) made it official that harness changes - not the model - cause regressions, which makes harness telemetry a first-class concern. MCP also evolved this month - from "code tools" to deep-system access. pentester-mcp (offensive security), windbg-mcp (kernel), Pepper (iOS runtime), with mcp-auth-proxy emerging as middleware for OAuth/token-persistence issues. Rust-based context retrievers (webclaw, ferris-search) and Go orchestrators (jig) bring low-latency multi-agent profiles within reach. Infrastructure, not the model, is now the thing that decides whether your agent fleet scales gracefully or burns the budget on a Tuesday night.