Green/Yellow/Red auto-evaluation

A traffic-light quality evaluation system that replaces binary pass/fail CI with a nuanced, policy-driven assessment - enabling selective automation and focusing human review where it genuinely adds value.

·Automated Green/Yellow/Red classification runs on every PR
·Green-classified PRs auto-merge without human review
·Auto-approve rate target of 60%+ Green PRs is tracked and reported

·Yellow PRs receive expedited human review (within 1 hour)
·Classification model accuracy is validated monthly against human review outcomes

Evidence

·Dashboard showing Green/Yellow/Red distribution across PRs
·Auto-merge logs for Green PRs with zero post-merge reverts
·Monthly auto-approve rate report showing 60%+ Green target tracking

What It Is

Green/Yellow/Red auto-evaluation is an automated system that assesses every pull request and assigns a quality tier that determines what happens next. Unlike traditional CI (which gives a binary pass/fail on tests and lint), the traffic-light system is nuanced:

Green - The PR meets all quality criteria: tests pass, coverage is maintained, lint is clean, the AI review agent found no issues, the diff is within safe size limits, and the changes touch no high-risk areas. Green PRs are candidates for auto-merge.
Yellow - The PR passes basic checks but requires human review for specific reasons: it touches security-sensitive code, modifies a core shared interface, introduces a new dependency, or has an AI review comment flagged as requiring human judgment. Yellow PRs route to the appropriate human reviewer.
Red - The PR has blocking issues: failing tests, lint violations, AI review agent flagged a high-confidence security or correctness issue, or the changes touch architectural boundaries that require explicit approval. Red PRs are returned to the author for remediation.

The evaluation is fully automated - a CI pipeline step (or the AI review agent itself) evaluates the PR against defined criteria and sets the status. No human decides the color; the algorithm does. The criteria are explicit, documented, and consistent.

This system is the bridge between L3's automated first-pass review and L4's auto-merge. Before you can auto-merge Green PRs, you need a trustworthy definition of what "Green" means. The traffic-light evaluation makes that definition explicit and machine-executable.

Why It Matters

The binary pass/fail model of traditional CI has a fundamental limitation: it treats all passing PRs as equivalent. A PR with 100% test coverage, zero lint errors, and an AI review that found nothing is equivalent, in CI's eyes, to a PR that barely passes the minimum checks. Both get a green checkmark.

The traffic-light system introduces nuance that the binary model can't express:

Enables proportional response - Green gets auto-merged; Yellow routes to the right reviewer; Red goes back to the author. Each outcome is calibrated to the actual risk of the change.
Makes policy explicit - What does "requires human review" actually mean? The Yellow criteria answer that question precisely and consistently. This is better than the implicit judgment call every reviewer currently makes.
Provides a quality signal - Tracking the Green rate over time tells you whether development process quality is improving or degrading. A falling Green rate is an early warning signal.
Eliminates ambiguity - Developers know exactly what determines their PR's path. If they want to go Green, the criteria are explicit. This drives quality improvement at the source (developers write better code to avoid Yellow/Red) rather than at review time.
Scales review capacity - Human reviewers only see Yellow and Red PRs. If 60% of PRs go Green, human review load is reduced by 60%. The same reviewer bandwidth handles significantly more PRs.

The specific criteria for each color tier will vary by team and codebase. What matters is that the criteria are explicit, documented, and enforced algorithmically - not dependent on reviewer judgment to apply consistently.

Tip

Define your Green criteria conservatively at first. It is better to start with a stricter Green definition (few PRs qualify) and loosen it as confidence grows than to start permissive and discover that auto-merged Green PRs are introducing bugs. Trust must be earned incrementally.

Getting Started

Define your initial Green criteria explicitly - Start with what's unambiguously safe to auto-merge: all tests pass (and there are adequate tests for the change), lint is clean, AI review found no issues above a severity threshold, diff size is under 300 lines, and the changes don't touch any files in the high-risk list. Write these criteria down before implementing them.
Define your Yellow criteria - What conditions require human review but aren't blocking? Common Yellow conditions: changes to authentication/authorization code, new external dependencies, modifications to public API contracts, AI review comments that were flagged as "requires human judgment," PRs from developers below a seniority threshold.
Define your Red criteria - What automatically blocks a PR? Failing tests, failing lint, AI review flags a high-confidence security or critical correctness issue, coverage drops below threshold, changes exceed size limits without explicit exemption.
Implement the evaluation as a CI step - This can be a GitHub Action, a GitLab CI step, or a custom script. The evaluation reads the CI results, the AI review agent's output, the diff metadata, and the list of affected files, then sets a status check with the appropriate color and an explanation.
Post the evaluation as a PR comment - When a PR is evaluated Yellow or Red, post a comment explaining exactly what triggered the status and what the author needs to do. Transparency about the criteria makes the system legible and actionable.
Calibrate the criteria with data - After 60 days, analyze outcomes: What proportion of Green PRs introduced post-merge bugs? What proportion of Yellow PRs were trivially approved by human reviewers? Use this data to tune the criteria toward the optimal balance.

6 steps to get from here to the next level

Common Pitfalls

Starting with a permissive Green definition. If too many PRs qualify as Green before confidence is established, auto-merge will introduce regressions. Start strict (few Green PRs) and loosen as the evaluation proves reliable. It's better to have human reviewers briefly annoyed that too many PRs are Yellow than to have an auto-merge cause a production incident.

Making the criteria opaque. If developers don't understand what determines their PR's color, the system feels arbitrary and generates frustration. Publish the criteria explicitly in the team wiki. The evaluation step should post a comment explaining the decision in terms of the criteria.

Not tracking the Green rate. The Green rate (percentage of PRs that score Green over a period) is a leading indicator of development process quality. If the rate drops, something in the process has changed - higher risk areas are being touched, test quality is degrading, AI review is catching more issues. Instrument and monitor it.

Conflating Yellow and Red. Yellow means "review needed" not "broken." Red means "blocked." If Yellow PRs routinely sit in a reviewer queue for days without action, the distinction isn't working - Yellow is functionally Red. Ensure Yellow PRs have a defined SLA for human review.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob's team has adopted the AI review agent and linting, and is seeing good results. Human review time has dropped. But he's started wondering if some PRs could be merged without any human review at all - the AI consistently rates them as clean, the tests pass, and the human reviewer always approves them with no additional comments. He wants a way to identify and automate these PRs.

What Bob should do: Bob is ready to implement the traffic-light evaluation. His first step: ask Victor to analyze the last 3 months of PRs and identify what proportion would have met a conservative Green definition (tests pass, lint clean, AI review no issues, <300 lines, no high-risk files). If that proportion is above 20%, the system is worth implementing. Bob should establish the evaluation criteria with his tech leads, start with Green being very strict, and run it in "observation mode" for 30 days - compute the color for every PR, post it as a comment, but don't actually auto-merge. Then review the Green PRs: were any of them problematic? If not, he's ready to enable actual auto-merge.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah wants to demonstrate to her engineering leadership that the team's quality investments are paying off. She has data on PR cycle time, post-merge bugs, and AI adoption. But she doesn't have a metric that shows the quality of the overall development process improving over time.

What Sarah should do: The Green rate is exactly the metric Sarah needs. A rising Green rate over time means the team's code is increasingly meeting a consistent quality bar before review. It combines test coverage, lint compliance, and AI review quality into a single indicator. Sarah should propose implementing the traffic-light evaluation primarily for its measurement value, even before auto-merge is enabled. Tracking the Green rate weekly gives her a quality trend line she can report to leadership. When the rate rises consistently (as quality investments at L2-L3 compound), she has concrete evidence of ROI.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor is skeptical of the traffic-light system. He's worried that Green will be defined too loosely, that auto-merge will create incidents, and that the team will lose the "second pair of eyes" that review provides for catching unexpected issues. He's not wrong to be cautious.

What Victor should do: Victor's caution is appropriate and should inform the Green criteria. He should be the one who defines what Green means - starting from "what would make me comfortable not reviewing this PR at all?" That's a high bar, and it should be. Victor can propose running the system in observation mode for 60 days before any auto-merge happens, with himself reviewing all PRs that would have been auto-merged to validate that they would have been safe. If that 60-day validation shows his veto rate is near zero, he has the evidence he needs to feel comfortable enabling auto-merge - and the team has the evidence it needs to trust Victor's judgment.

What Victor should do - role-specific action plan