Agent Autonomy Score: % tasks without human intervention
The Agent Autonomy Score measures the percentage of tasks that an agent completes from assignment to merge without any human intervention: no clarifying questions answered, no mid-
- ·TORS > 95% is measured and tracked on a dashboard
- ·Auto-approve rate (% of PRs auto-merged as Green) is tracked with a target above 60%
- ·Merge queue wait time is tracked with a target under 10 minutes
- ·Agent Autonomy Score (% of tasks completed without human intervention) is measured and broken down by task type
- ·Metrics trigger automated alerts when thresholds are breached (e.g., TORS drops below 95%)
Evidence
- ·TORS dashboard showing 95%+ with per-service breakdown
- ·Auto-approve rate report showing 60%+ Green target
- ·Merge queue wait time chart showing sub-10-minute target
What It Is
The Agent Autonomy Score measures the percentage of tasks that an agent completes from assignment to merge without any human intervention: no clarifying questions answered, no mid-task course corrections, no manual code fixes before the PR is approved, no human-triggered CI reruns. A task either completes autonomously or it doesn't. The score is the fraction that complete autonomously.
This is a more demanding metric than auto-approve rate. Auto-approve rate measures whether a PR merges without human code review. Agent Autonomy Score measures whether the entire task lifecycle - from task assignment to deployment - happens without a human touching the workflow. A PR that the developer had to re-prompt twice before the agent understood the task is not autonomous, even if the final PR auto-merges. A PR that required the developer to fix a test before the agent could proceed is not autonomous. The bar is strict.
The target at L4 is context-dependent. For well-defined, bounded task types (writing tests for an existing function, updating documentation, fixing a lint warning), autonomy scores of 90%+ are achievable and expected. For complex, exploratory task types (designing a new feature, diagnosing a production issue, refactoring a tightly coupled module), autonomy scores of 40-60% are realistic and represent strong performance. An overall team-level autonomy score target of 70%+ means that the team has structured its agent workflow so that most assigned tasks are in the high-autonomy category.
Agent Autonomy Score is the metric that captures the quality of the human-agent collaboration at the system level. If the score is low, the bottleneck is usually in task specification (unclear or incomplete tasks that require clarification), context availability (agents that don't have enough context to proceed without asking), or test reliability (flaky tests that require human investigation rather than agent remediation). Each root cause has a different fix, and the score points to where the system needs work.
Why It Matters
- Measures true agent leverage - auto-approve rate measures one step in the workflow; Agent Autonomy Score measures the whole workflow; it's the closest single metric to "what fraction of my capacity is being handled autonomously"
- Creates incentive for task specification quality - most autonomy failures trace to poorly specified tasks; tracking the score creates pressure to improve task descriptions, context files, and specification templates
- Identifies where human judgment is actually needed - consistently low autonomy on specific task types reveals which tasks genuinely require human oversight (complex architectural decisions, security-sensitive changes) vs. which are low-autonomy due to fixable friction
- Enables capacity planning - if Agent Autonomy Score is 70%, a developer managing 10 concurrent agent tasks will need to actively intervene in 3 of them; this is predictable overhead that can be planned for rather than experienced as constant interruption
- Tracks progress toward L5 patterns - at L5, entire feature delivery cycles (planning through deployment) run autonomously; Agent Autonomy Score is the leading indicator of progress toward that state; teams with 70% task-level autonomy are well-positioned to attempt feature-level autonomy
Getting Started
6 steps to get from here to the next level
Common Pitfalls
Mistakes teams actually make at this stage - and how to avoid them
How Different Roles See It
Bob wants to give his engineering managers a single number that captures how well their team's AI adoption is working. PR throughput is good for output, but it doesn't capture whether the AI is genuinely autonomous or whether developers are still doing most of the work. He thinks Agent Autonomy Score might be the right metric.
What Bob should do - role-specific action plan
Sarah has been tracking PR throughput and auto-approve rate, but she's noticed a disconnect: some developers have high throughput but report feeling like they spend all day managing agents. The throughput is real, but the experience is different from "autonomous agents handling work."
What Sarah should do - role-specific action plan
Victor has personally achieved an 85% Agent Autonomy Score by investing heavily in context management: per-task CLAUDE.md files, agent-aware task specification templates, and a pre-task checklist that ensures the agent has everything it needs before starting. He considers autonomy score his primary measure of AI workflow quality.
What Victor should do - role-specific action plan
Further Reading
5 resources worth reading - hand-picked, not scraped
From the Field
Recent releases, projects, and discussions relevant to this maturity level.