Agent Autonomy Score: % tasks without human intervention

The Agent Autonomy Score measures the percentage of tasks that an agent completes from assignment to merge without any human intervention: no clarifying questions answered, no mid-

·Test-oracle reliability is measured and tracked on a dashboard
·Auto-approve rate (% of PRs auto-merged as Green) is tracked with a target above 60%
·Merge queue wait time is tracked with a target under 10 minutes

·Agent Autonomy Score (% of tasks completed without human intervention) is measured and broken down by task type
·Metrics trigger automated alerts when thresholds are breached (e.g., test-oracle reliability drops)

Evidence

·Oracle-reliability dashboard (e.g., TORS) with per-service breakdown
·Auto-approve rate report showing 60%+ Green target
·Merge queue wait time chart showing sub-10-minute target

What It Is

The Agent Autonomy Score measures the percentage of tasks that an agent completes from assignment to merge without any human intervention: no clarifying questions answered, no mid-task course corrections, no manual code fixes before the PR is approved, no human-triggered CI reruns. A task either completes autonomously or it doesn't. The score is the fraction that complete autonomously.

This is a more demanding metric than auto-approve rate. Auto-approve rate measures whether a PR merges without human code review. Agent Autonomy Score measures whether the entire task lifecycle - from task assignment to deployment - happens without a human touching the workflow. A PR that the developer had to re-prompt twice before the agent understood the task is not autonomous, even if the final PR auto-merges. A PR that required the developer to fix a test before the agent could proceed is not autonomous. The bar is strict.

The target at L4 is context-dependent. For well-defined, bounded task types (writing tests for an existing function, updating documentation, fixing a lint warning), autonomy scores of 90%+ are achievable and expected. For complex, exploratory task types (designing a new feature, diagnosing a production issue, refactoring a tightly coupled module), autonomy scores of 40-60% are realistic and represent strong performance. An overall team-level autonomy score target of 70%+ means that the team has structured its agent workflow so that most assigned tasks are in the high-autonomy category.

Agent Autonomy Score is the metric that captures the quality of the human-agent collaboration at the system level. If the score is low, the bottleneck is usually in task specification (unclear or incomplete tasks that require clarification), context availability (agents that don't have enough context to proceed without asking), or test reliability (flaky tests that require human investigation rather than agent remediation). Each root cause has a different fix, and the score points to where the system needs work.

Why It Matters

Measures true agent leverage - auto-approve rate measures one step in the workflow; Agent Autonomy Score measures the whole workflow; it's the closest single metric to "what fraction of my capacity is being handled autonomously"
Creates incentive for task specification quality - most autonomy failures trace to poorly specified tasks; tracking the score creates pressure to improve task descriptions, context files, and specification templates
Identifies where human judgment is actually needed - consistently low autonomy on specific task types reveals which tasks genuinely require human oversight (complex architectural decisions, security-sensitive changes) vs. which are low-autonomy due to fixable friction
Enables capacity planning - if Agent Autonomy Score is 70%, a developer managing 10 concurrent agent tasks will need to actively intervene in 3 of them; this is predictable overhead that can be planned for rather than experienced as constant interruption
Tracks progress toward L5 patterns - at L5, entire feature delivery cycles (planning through deployment) run autonomously; Agent Autonomy Score is the leading indicator of progress toward that state; teams with 70% task-level autonomy are well-positioned to attempt feature-level autonomy

Getting Started

Define "human intervention" precisely - The definition shapes the metric. Reasonable definition: any of the following counts as an intervention: responding to an agent clarification question, editing the agent's code directly, manually triggering a CI run the agent didn't trigger, manually fixing a test failure so the agent can continue, or revising the task specification mid-execution. Document and communicate this definition.
Track interventions per task - Add an intervention log to your agent workflow. When a developer intervenes in an agent task (by the definition above), they record it: task ID, intervention type, and brief description. This can be a Slack bot, a GitHub comment command, or a lightweight web form. The goal is a dataset of intervention instances.
Compute autonomy score by task type - Aggregate the intervention log by task type: what fraction of "write tests" tasks completed without intervention? "Fix bug"? "Update documentation"? The task-type breakdown is where the actionable insight lives.
Root cause the low-autonomy task types - For the task types with the lowest autonomy score, analyze the intervention log: what kinds of interventions are most common? If 80% of interventions on "new feature" tasks are clarifying questions, the fix is better task specification. If 80% are test failures requiring human investigation, the fix is TORS improvement. Different root causes require different investments.
Design tasks to maximize autonomy - Once you understand which task types have high autonomy, design your agent workflow to prefer those types. Write a "task design guide" that helps developers specify tasks in ways that maximize agent autonomy: sufficient context, clear success criteria, explicit constraints, and references to relevant code examples.
Set a monthly autonomy score improvement target - If current autonomy score is 55%, target 62% next month by fixing the top root cause of the most common intervention type. Track progress weekly. Small systematic improvements (5-7 percentage points per month) compound into significant autonomy improvements over a quarter.

Tip

The most common reason for low agent autonomy is agents that ask clarifying questions before starting work. This is easy to fix: front-load the context. Instead of assigning a task and waiting for the agent to ask questions, spend 5 minutes writing a richer task description that preemptively answers the questions you know the agent will ask. This single practice improvement often increases autonomy score by 15-20 percentage points.

6 steps to get from here to the next level

Common Pitfalls

Tracking autonomy without segmenting by task type. A team-level autonomy score of 60% is uninterpretable without knowing what types of tasks are being assigned. If 90% of tasks are well-specified, bounded tasks and 10% are exploratory, the 60% might represent a system failure. If the task mix is 50% exploratory, 60% might represent excellent performance. Always segment.

Penalizing appropriate human intervention. Some tasks should involve human input even when agents are capable. Security architecture decisions, API design choices, and product feature tradeoffs benefit from human judgment. Tracking these interventions as "failures" of autonomy creates pressure to exclude humans where they add value. Distinguish between "unplanned interventions caused by poor task specification" and "planned human checkpoints in the workflow." Only the former should be counted against autonomy score.

Gaming the metric by restricting task complexity. An easy way to improve autonomy score is to only assign agents simple, trivial tasks where autonomy is near-guaranteed. This inflates the metric without capturing real value. Pair autonomy score with task complexity or task value to ensure that high autonomy is achieved on meaningful work, not just on trivial tasks.

Not tracking what autonomy failures cost. An intervention that takes 30 seconds (responding to a simple clarification question) is very different from an intervention that takes 30 minutes (debugging a complex agent failure). Track intervention duration alongside intervention count to understand the actual human overhead of autonomy gaps. The expensive interventions are the ones to fix first.

Expecting autonomy score to be monotonically increasing. As the team takes on more complex AI workflows, introduces new task types, and expands agent usage to harder problems, autonomy score may temporarily decrease. This is fine - it reflects the team pushing the boundary of what agents can do. Autonomy score should trend upward over months but may fluctuate week to week as the team experiments.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob wants to give his engineering managers a single number that captures how well their team's AI adoption is working. PR throughput is good for output, but it doesn't capture whether the AI is genuinely autonomous or whether developers are still doing most of the work. He thinks Agent Autonomy Score might be the right metric.

What Bob should do: Bob should introduce Agent Autonomy Score as a quarterly report metric, not a weekly one. Weekly measurement of interventions is valuable, but quarterly aggregation is the right cadence for leadership reporting. Bob should present autonomy score alongside the story it tells: "Our agents complete 68% of assigned tasks without human intervention. The 32% that require intervention are primarily new feature tasks where the agent needs clarification. Our investment next quarter is in improving task specification templates for this task type, which we project will raise autonomy score to 75%." This narrative - metric, root cause, investment, projected outcome - is how leadership should see AI maturity progress.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah has been tracking PR throughput and auto-approve rate, but she's noticed a disconnect: some developers have high throughput but report feeling like they spend all day managing agents. The throughput is real, but the experience is different from "autonomous agents handling work."

What Sarah should do: Sarah should introduce Agent Autonomy Score as the companion metric to PR throughput. Her hypothesis: the developers who feel like they're managing agents all day have low autonomy scores - their agents complete tasks but require frequent intervention. The developers who feel genuinely productive have high autonomy scores - the agents run with minimal oversight. Sarah should survey the high-throughput/high-friction developers and the high-throughput/low-friction developers and compare their intervention logs. The findings will identify the workflow differences that explain the experience gap, which Sarah can then systematize as best practices.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor has personally achieved an 85% Agent Autonomy Score by investing heavily in context management: per-task CLAUDE.md files, agent-aware task specification templates, and a pre-task checklist that ensures the agent has everything it needs before starting. He considers autonomy score his primary measure of AI workflow quality.

What Victor should do: Victor should lead a "task specification hackathon" where the team spends a day auditing its top 20 most common agent task types and rewriting the specification templates for each. The goal is to preemptively eliminate the clarifying questions and context gaps that cause the most interventions. Victor should bring his own templates as examples and facilitate the group in creating templates they'll actually use. After the hackathon, the team should track autonomy score for the rewritten task types specifically and measure the improvement. The hackathon is a high-leverage 1-day investment that directly improves one of the team's most important AI productivity metrics.

What Victor should do - role-specific action plan