Policy-based auto-approval: 60%+ Green target

Setting a 60%+ Green rate as an organizational policy turns code quality into a measurable team KPI - and makes the auto-merge system self-reinforcing as teams work to qualify more PRs.

·Automated Green/Yellow/Red classification runs on every PR
·Green-classified PRs auto-merge without human review
·Auto-approve rate target of 60%+ Green PRs is tracked and reported

·Yellow PRs receive expedited human review (within 1 hour)
·Classification model accuracy is validated monthly against human review outcomes

Evidence

·Dashboard showing Green/Yellow/Red distribution across PRs
·Auto-merge logs for Green PRs with zero post-merge reverts
·Monthly auto-approve rate report showing 60%+ Green target tracking

May 2026 Update

Do not derive auto-approval thresholds from benchmark scores - and May 2026 research now quantifies why. SpecBench (May 20, arXiv 2605.21384) found the validation-vs-holdout reward-hacking gap grows by about 27 percentage points per 10x increase in lines of code; in one case an agent wrote a 2,900-line test-memorizing "compiler" that scored 97% on validation and 0% on a held-out suite. BenchJack and RHB independently confirm that RL post-training drives reward hacking. This sharpens UC Berkeley's earlier April 12 finding that all eight major agent benchmarks are reward-hackable to ~100% (RDI Berkeley). Auto-approval policies must anchor in post-merge outcome metrics and held-out oracles - post-merge bug rate by Green-classified PRs, review-overturn rate, production incident rate - never benchmark or validation-suite scores.

What It Is

Policy-based auto-approval with a 60%+ Green target is the organizational layer on top of the Green auto-merge system. Rather than just enabling auto-merge and letting teams use it however they want, the 60% target sets an explicit expectation: 60% or more of a team's PRs should score Green and auto-merge each week.

The target is both a quality signal and a productivity signal. A team consistently hitting 60%+ Green is a team that writes code meeting a high quality bar before review: good test coverage, clean lint, well-structured changes of safe size, no AI review issues. A team below 60% has one or more systemic quality problems - test coverage gaps, persistent lint violations, oversized PRs, or changes consistently touching high-risk areas without the quality to match.

The 60% figure represents a specific balance: it's enough automation to meaningfully reduce the review bottleneck (if 60% of PRs auto-merge, human reviewers handle 40% of the volume they would otherwise handle), while keeping human review for the complex and high-risk 40%. Teams far above 60% might be defining Green too loosely. Teams far below 60% are leaving significant efficiency gains unrealized.

The policy creates an incentive structure: teams want to be above 60% because it means their code is flowing quickly to production. When a team's Green rate drops, engineering leads investigate: what changed? Is test coverage degrading? Are PRs getting larger? Is the AI review agent flagging more issues? The metric surfaces process problems before they become quality incidents.

Why It Matters

The 60% target transforms auto-merge from a technical feature into an organizational practice:

Creates accountability - Teams have a quality target they're measured against. The Green rate is visible, comparable across teams, and tracked over time.
Drives process improvement - Teams that care about their Green rate invest in the practices that raise it: better test writing, smaller PRs, using AI review to pre-check before submitting. The metric incentivizes the right behaviors.
Makes trade-offs explicit - A team working in a high-risk area (core security, payment processing) will naturally have a lower Green rate because more changes qualify as Yellow or Red. The policy accommodates this: the target for high-risk teams might be 40% rather than 60%. Making this explicit is better than pretending all teams are the same.
Produces fleet-level visibility - With the Green rate tracked per team and per repository, engineering leadership can see quality trends across the organization. A sudden drop in one team's Green rate signals a quality problem that merits investigation.
Compounds with AI code generation - At L4-L5, AI agents are generating significant code volumes. Agent-generated code that's well-structured and tested should score Green at high rates. If agents are producing code that consistently scores Yellow or Red, the agent configuration needs tuning.

The 60% target also makes the business case for quality investments self-evident. Every percentage point increase in the Green rate is measurable throughput improvement: more PRs auto-merge, less time in review queues, faster cycle time. Quality and efficiency are the same metric.

Tip

Publish the Green rate leaderboard internally. Teams that are consistently at 70%+ are doing something right - make it visible. Teams at 30% are struggling with something specific - offer support rather than blame. The metric works best as a learning tool, not a performance evaluation.

Getting Started

Establish the baseline before setting the target - Calculate your current Green rate (or what it would have been over the past 90 days under your criteria) before announcing the 60% target. If baseline is 20%, jumping to 60% immediately is unrealistic. Set a ramp target: 35% in 30 days, 50% in 60 days, 60% in 90 days.
Make the Green rate visible to teams - Add it to the team's engineering dashboard. Track it weekly. Teams that can see their metric improve as they make quality investments have a positive feedback loop. Teams that can't see it have no signal to act on.
Investigate below-target teams, don't penalize - When a team is consistently below 60%, the response should be a root-cause conversation, not a performance review. What's keeping their PRs from scoring Green? Is it test coverage? PR size? A specific category of AI review issue? The answer points to the specific improvement needed.
Allow team-specific targets for high-risk areas - Core security, payment, or compliance code may legitimately require more human review. Don't force a 60% target on a team working in these areas - set an appropriate target that reflects their context.
Celebrate teams hitting the target - When a team consistently hits 60%+ for a quarter, recognize it. The Green rate is a proxy for code quality and development process maturity. A team hitting 60% has built something worth acknowledging.
Review the policy annually - As AI tooling improves and the organization matures, the 60% target may be too conservative. L5 organizations are aiming for 80%+ Green. Revisit the target as the infrastructure and confidence evolve.

6 steps to get from here to the next level

Common Pitfalls

Treating the metric as the goal. Teams that optimize for their Green rate without caring about the underlying quality will find ways to game it: smaller PRs that split risky changes into technically-Green pieces, or loosening their Green criteria. Monitor for gaming by tracking post-merge defect rates alongside the Green rate. If defects rise while the Green rate rises, something is wrong.

Applying the same target to all teams regardless of risk context. A team maintaining authentication and authorization code should have a lower Green rate target than a team building an internal reporting dashboard. Applying the same standard to different risk contexts is unfair and unhelpful.

Ignoring the Green rate when it drops. A falling Green rate is an early warning signal of process degradation. Teams that notice it drop and don't investigate are missing the point of the metric. The Green rate is only valuable as a leading indicator if someone acts on it.

Not connecting the metric to team health. The Green rate should be contextualized with developer satisfaction data. A team with a high Green rate and low developer satisfaction might be hitting the metric but at an unsustainable cost (excessive process, over-automation creating toil). The metric tells you one thing about quality, not everything about team health.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob has enabled auto-merge and is tracking the Green rate across his 8 teams. Four teams are consistently above 55%, two are at 40%, and two are below 30%. The 30% teams are his most experienced teams working on the core transaction processing service. He's not sure whether the low rate reflects a problem or just the nature of their work.

What Bob should do: Bob should investigate the two low-rate teams before setting expectations. A conversation with their tech leads will quickly reveal whether the low rate is: (a) appropriate - high-risk changes that genuinely need human review, or (b) fixable - the team has poor test coverage or large PRs that could be broken down. Bob should set differentiated targets: 60% for product teams, 40% for platform/security teams. He should also make the Green rate a standing topic in his monthly engineering leads meeting: each lead reports their team's current rate and any changes from last month. This keeps the metric alive and creates peer accountability.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah has been reporting PR cycle time to her stakeholders for a year. Now that auto-merge is enabled, she's watching cycle time fall - but she wants a metric that shows the underlying quality trend, not just the throughput trend. She wants to show that the team is getting better at writing code that meets quality standards, not just that they've automated the merge step.

What Sarah should do: The Green rate is exactly the quality trend metric Sarah needs. It shows the percentage of code that meets a defined quality bar before merge - a measure of upstream quality, not just downstream speed. Sarah should add the Green rate to her stakeholder reporting alongside cycle time: "Our code ships faster (cycle time down 40%) because more of it meets quality standards before review (Green rate up from 25% to 58% this quarter)." This tells a coherent story about a quality improvement that's also a speed improvement. When the Green rate is stable and high, cycle time improvements from auto-merge are sustainable. When the Green rate is artificially inflated by loosening criteria, cycle time improvements will be accompanied by defect rate increases.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor is now focused on helping the teams below 40% improve their Green rate. He's been digging into the data for one team (the team at 28%) and has identified the root cause: 60% of their PRs score Yellow because they touch authentication code, which is in the high-risk file list. The authentication code needs changes more often than the team expected when they categorized it.

What Victor should do: Victor has identified a Yellow criteria miscalibration. The team needs to touch authentication code regularly for legitimate, low-risk reasons (updating feature flags, adding new OAuth providers). Rather than keeping all authentication code in the high-risk list, Victor should propose a more nuanced classification: the core authentication primitives (token validation, session management) stay in the high-risk list, but the higher-level OAuth integration code moves to a moderate-risk list that requires review only for changes above a certain complexity threshold. This would immediately move 30-40% of the team's PRs from Yellow to Green. Victor should propose this criteria refinement with data - showing which specific Yellow-triggering files could be safely reclassified, and why.

What Victor should do - role-specific action plan