May 2026 · v1.2
VISDOM Maturity Matrix
Trust, but Audit - When the Agent Has a Bad Day
Development
How developers work with AI day-to-day. From sidebar chat to fleet agents.
Coding Agent Usage
Context Engineering
Code Review & Quality
Testing Strategy
Author Commentary
The May 2026 zeitgeist is **trust, but audit**. April delivered three uncomfortable proofs: Stella Laurenzo's [audit of 6,852 Claude Code sessions](https://scortier.substack.com/p/claude-code-drama-6852-sessions-prove) measured a 73% collapse in median thinking length (2,200 -> 600 chars) and a drop from 6.6 to 2.0 files read before edit; Anthropic's [April 23 postmortem](https://www.anthropic.com/engineering/april-23-postmortem) confirmed that harness and system-prompt changes - not the model - caused weeks of regression; UC Berkeley showed [all eight major agent benchmarks are reward-hackable to ~100%](https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/). The takeaway is not "AI got worse." The takeaway is that capability lives in the harness, the prompt and the permission layer - and you need to measure it like any other production system. This shifts the bar at every level. L2 review now means tracking thinking-length and files-read-before-edit per session, not just diff awareness. L3 AI-review agents must self-verify their outputs (Opus 4.7 makes this mainstream). L4 auto-approval policies must NOT be derived from benchmark scores - they need behavioural signals. The model is better than ever (Opus 4.7: 87.6% SWE-bench Verified, Mythos Preview: 93.9%). The tooling is better than ever (Cursor 3.2 /multitask, Kairos/Dream Mode in Claude Code). The new gap is observability. Start there.
Delivery Management
How we manage delivery in the age of agents. From human PR review to autonomous delivery pipeline.
CI/CD Pipeline
Merge & Deploy
Metrics
Governance & Compliance
Author Commentary
May 2026 update: cost is now a first-class metric. ccusage hit 13.2k stars on GitHub; /usage and /context shipped as built-in commands; Reddit had multiple "I burned $3,800 overnight" posts traced to runaway subagent loops. The economics also got more honest. Pawel Dolega's [AI subscriptions are on borrowed time](https://www.pdole.ga/p/ai-subscriptions-are-on-borrowed) makes the structural case: a $20 Pro plan burns $50-100 of compute, total enterprise LLM spend doubled in six months despite per-token prices falling (Jevons paradox), and labs are quietly testing the water - Anthropic pulled Claude Code from Pro, GitHub paused Copilot signups. Teams that do not measure cost-per-merged-PR now will be re-pricing emergencies later this year. Governance follows: per-session spend caps and kill switches are now baseline, not advanced. Restricted-use models (Claude Mythos Preview / Project Glasswing - 93.9% SWE-bench but defensive cybersec only) introduce a new lever - capability-restricted licensing. And Berkeley's April 12 reward-hack research means any policy that auto-approves based on benchmark scores is broken by construction. Stripe Minions is still the L5 north star; the new homework is making sure your L2-L3 metrics don't lie to you on the way there.
Organization
How organizations adapt to the age of agents. From "buy licenses" to "agent fleet management".
AI Adoption Model
Knowledge Management
Team Structure & Roles
Tech Debt & Modernization
Author Commentary
May 2026 update: the Q1 layoff data is now in and it is uglier than the press release tour suggested. 78,557 tech jobs lost in Q1 2026, 47.9% AI-attributed, with junior and entry-level roles disproportionately affected (new SWE postings down 15%). 55% of employers report regretting AI-driven layoffs - the "AI layoff trap" is being quietly reversed. Forrester finds only 16% of workers have high AI readiness, projected to 25% by year-end. The lesson: cutting humans before maturing the AI stack creates a permanent capability gap. Healthy adoption now includes a "bad day protocol" - a documented rollback when the model or harness regresses (the template here is Anthropic's [April 23 postmortem](https://www.anthropic.com/engineering/april-23-postmortem); the diagnostic is Stella Laurenzo's [6,852-session audit](https://scortier.substack.com/p/claude-code-drama-6852-sessions-prove)). On the org side, the most interesting new pattern is **IPETs (Innovation and Practices Enabling Teams)** - a Team Topologies adaptation where a small enabling team owns AI stewardship, knowledge diffusion and security boundaries across product teams. New tech debt categories matter too: "context debt" (rapid iteration without architectural integrity, hits a 12-week unmaintainability cliff) and "verification debt" (3x velocity gain offset by 125% verification overhead). Yegge's 8 stages are still the best individual-maturity model. The org that makes Stage 6+ work without burning out its seniors will be the one with IPETs and a working bad-day protocol.
Infrastructure
The technical layer that enables (or blocks) agents. From shared Jenkins to ephemeral agent sandboxes.
Agent Runtime & Sandboxing
MCP & Tool Integration
Build System
Observability & Feedback Loop
Author Commentary
May 2026 update: observability stopped being a "nice to have" and became the area where the most money was made and lost. ccusage is at 13.2k GitHub stars; /usage and /context are now built into Claude Code; multiple Reddit threads documented overnight bills of $3,800 from runaway subagent loops. The two new disciplines this month are **cost telemetry** (token spend per session, per project, per merged PR) and **quality telemetry** (thinking length, files-read-before-edit, KV cache hit rate). Stella Laurenzo's [audit of 6,852 Claude Code sessions](https://scortier.substack.com/p/claude-code-drama-6852-sessions-prove) is the template for the latter; Anthropic's [April 23 postmortem](https://www.anthropic.com/engineering/april-23-postmortem) made it official that harness changes - not the model - cause regressions, which makes harness telemetry a first-class concern. MCP also evolved this month - from "code tools" to deep-system access. pentester-mcp (offensive security), windbg-mcp (kernel), Pepper (iOS runtime), with mcp-auth-proxy emerging as middleware for OAuth/token-persistence issues. Rust-based context retrievers (webclaw, ferris-search) and Go orchestrators (jig) bring low-latency multi-agent profiles within reach. Infrastructure, not the model, is now the thing that decides whether your agent fleet scales gracefully or burns the budget on a Tuesday night.