Back to Development
developmentL4 OptimizedCode Review & Quality

Green/Yellow/Red auto-evaluation

A traffic-light quality evaluation system that replaces binary pass/fail CI with a nuanced, policy-driven assessment - enabling selective automation and focusing human review where it genuinely adds value.

  • ·Automated Green/Yellow/Red classification runs on every PR
  • ·Green-classified PRs auto-merge without human review
  • ·Auto-approve rate target of 60%+ Green PRs is tracked and reported
  • ·Yellow PRs receive expedited human review (within 1 hour)
  • ·Classification model accuracy is validated monthly against human review outcomes

Evidence

  • ·Dashboard showing Green/Yellow/Red distribution across PRs
  • ·Auto-merge logs for Green PRs with zero post-merge reverts
  • ·Monthly auto-approve rate report showing 60%+ Green target tracking

What It Is

Green/Yellow/Red auto-evaluation is an automated system that assesses every pull request and assigns a quality tier that determines what happens next. Unlike traditional CI (which gives a binary pass/fail on tests and lint), the traffic-light system is nuanced:

  • Green - The PR meets all quality criteria: tests pass, coverage is maintained, lint is clean, the AI review agent found no issues, the diff is within safe size limits, and the changes touch no high-risk areas. Green PRs are candidates for auto-merge.
  • Yellow - The PR passes basic checks but requires human review for specific reasons: it touches security-sensitive code, modifies a core shared interface, introduces a new dependency, or has an AI review comment flagged as requiring human judgment. Yellow PRs route to the appropriate human reviewer.
  • Red - The PR has blocking issues: failing tests, lint violations, AI review agent flagged a high-confidence security or correctness issue, or the changes touch architectural boundaries that require explicit approval. Red PRs are returned to the author for remediation.

The evaluation is fully automated - a CI pipeline step (or the AI review agent itself) evaluates the PR against defined criteria and sets the status. No human decides the color; the algorithm does. The criteria are explicit, documented, and consistent.

This system is the bridge between L3's automated first-pass review and L4's auto-merge. Before you can auto-merge Green PRs, you need a trustworthy definition of what "Green" means. The traffic-light evaluation makes that definition explicit and machine-executable.

Why It Matters

The binary pass/fail model of traditional CI has a fundamental limitation: it treats all passing PRs as equivalent. A PR with 100% test coverage, zero lint errors, and an AI review that found nothing is equivalent, in CI's eyes, to a PR that barely passes the minimum checks. Both get a green checkmark.

The traffic-light system introduces nuance that the binary model can't express:

  • Enables proportional response - Green gets auto-merged; Yellow routes to the right reviewer; Red goes back to the author. Each outcome is calibrated to the actual risk of the change.
  • Makes policy explicit - What does "requires human review" actually mean? The Yellow criteria answer that question precisely and consistently. This is better than the implicit judgment call every reviewer currently makes.
  • Provides a quality signal - Tracking the Green rate over time tells you whether development process quality is improving or degrading. A falling Green rate is an early warning signal.
  • Eliminates ambiguity - Developers know exactly what determines their PR's path. If they want to go Green, the criteria are explicit. This drives quality improvement at the source (developers write better code to avoid Yellow/Red) rather than at review time.
  • Scales review capacity - Human reviewers only see Yellow and Red PRs. If 60% of PRs go Green, human review load is reduced by 60%. The same reviewer bandwidth handles significantly more PRs.

The specific criteria for each color tier will vary by team and codebase. What matters is that the criteria are explicit, documented, and enforced algorithmically - not dependent on reviewer judgment to apply consistently.

Tip

Define your Green criteria conservatively at first. It is better to start with a stricter Green definition (few PRs qualify) and loosen it as confidence grows than to start permissive and discover that auto-merged Green PRs are introducing bugs. Trust must be earned incrementally.

Getting Started

6 steps to get from here to the next level

Common Pitfalls

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

B
BobHead of Engineering

Bob's team has adopted the AI review agent and linting, and is seeing good results. Human review time has dropped. But he's started wondering if some PRs could be merged without any human review at all - the AI consistently rates them as clean, the tests pass, and the human reviewer always approves them with no additional comments. He wants a way to identify and automate these PRs.

What Bob should do - role-specific action plan

S
SarahProductivity Lead

Sarah wants to demonstrate to her engineering leadership that the team's quality investments are paying off. She has data on PR cycle time, post-merge bugs, and AI adoption. But she doesn't have a metric that shows the quality of the overall development process improving over time.

What Sarah should do - role-specific action plan

V
VictorStaff Engineer - AI Champion

Victor is skeptical of the traffic-light system. He's worried that Green will be defined too loosely, that auto-merge will create incidents, and that the team will lose the "second pair of eyes" that review provides for catching unexpected issues. He's not wrong to be cautious.

What Victor should do - role-specific action plan