Review shifts: from writing to evaluating code

When most code in a PR is agent-generated, the reviewer's job changes fundamentally.

·Platform Engineer role with AI tooling responsibility exists on the platform team
·Context Engineer is a full dedicated role (not part-time, not combined with other duties)
·Team's primary activity has shifted from writing code to evaluating and reviewing AI-generated code

·Role definitions are updated to reflect AI-augmented responsibilities
·Hiring criteria include AI tool proficiency

Evidence

·Platform Engineer job description including AI tooling responsibilities
·Context Engineer role as a dedicated position (headcount or full-time allocation)
·Time tracking showing majority of developer time on review/evaluation vs. writing

What It Is

When most code in a PR is agent-generated, the reviewer's job changes fundamentally. In a world where every line was written by a human who thought carefully about the implementation, review is primarily about finding bugs and improving the code. In a world where most code was generated by an agent working from a task specification, review is primarily about evaluating whether the agent understood the intent and produced something that fits the system - and secondarily about finding the specific failure modes of AI-generated code.

The shift is from microreview to macrofit evaluation. Human-written code review looks at individual lines: is this the right algorithm? Is this null check necessary? Is this naming clear? AI-generated code review looks at the system level: did the agent understand what this feature is supposed to do? Does the implementation fit how the rest of the system is structured? Does it handle the edge cases that this domain requires? The individual line-level concerns are still there, but they are a smaller fraction of the review value.

This shift requires new review skills that aren't obvious from traditional code review practice. The reviewer needs to hold two things in mind simultaneously: the original intent (what the task specification said to build) and the actual implementation (what the agent built). A significant class of AI-generated defects is not "the code is wrong" - it's "the code is a plausible implementation of a misunderstanding of the spec." These are harder to catch than traditional bugs because the code runs, passes tests, and looks reasonable. Catching them requires the reviewer to independently reason about whether the implementation achieves the intended purpose, not just whether it is correct code.

The practical change is that reviewers should spend more time reading the task specification and less time tracing the implementation. A reviewer who starts with the code and works their way back to the intent is less effective than one who starts with the intent and checks whether the code achieves it. This may feel counterintuitive for developers trained in traditional code review, where the code is the primary artifact.

By mid-2026, this shift was moving even further up the lifecycle. A June 24 report on AI moving from code review to PRD governance describes Uber, DoorDash, and Cloudflare using AI to evaluate PRDs and specs before any implementation begins - the same intent-versus-implementation evaluation, applied one stage earlier, where catching a misaligned spec is cheaper than catching a misaligned PR. Alongside this, "loop engineering" emerged as a named skill sitting above harness engineering (Addy Osmani, LangChain, Loopcraft): the discipline of designing and steering the agent's evaluate-and-iterate loop rather than just reviewing its final output. Reviewers at the leading edge are not only evaluating generated code - they are evaluating the specs that drive it and the loops that produce it.

Why It Matters

This shift in review approach produces better outcomes than applying traditional review methods to AI-generated code:

Catches intent misalignment before it reaches production - the most expensive AI-generated defects are spec misunderstandings that pass tests; intent-first review catches these where line-by-line review misses them
Makes review faster - evaluating whether 200 lines of code achieves the stated intent is often faster than checking every line for correctness; reviewers who make this shift report review becoming less cognitively exhausting
Improves agent task specification quality - when reviewers consistently identify that agents misunderstood the spec, it creates evidence for improving the spec template; review becomes a feedback channel for prompt quality
Enables higher throughput - L3 teams produce more code volume than L1 or L2 teams; traditional line-by-line review becomes a bottleneck at this volume; the shift to intent-based review is what makes the throughput sustainable
Develops a new high-value skill - developers who become expert at evaluating AI-generated code fitness are developing a skill that becomes more valuable as AI code volume increases; this is the foundation of the senior-reviewer role at L4

Tip

A good review question to start with for AI-generated code is: "What would the agent have had to misunderstand to write this?" If you can answer this, you know what to check first. Most AI-generated defects fall into 3-5 consistent misunderstanding categories for any given codebase - learn them and look for them explicitly.

Getting Started

Adopt an intent-first review protocol - before reading any code in an AI-assisted PR, read the task specification (the prompt that was given to the agent). Form a mental model of what a correct implementation should look like. Then read the code against your mental model rather than building your mental model from the code.
Create an AI code review checklist - identify the 5-8 most common categories of agent misunderstanding in your codebase (authentication patterns, error handling, data model assumptions, etc.) and create a checklist that reviewers run through for every AI-generated PR. This checklist catches the systematic failure modes that reviewers might otherwise miss.
Review the diff against the spec, not against the previous code - traditional review diffs compare new code to old code. For AI-generated PRs, the more useful comparison is new code against the task specification. Train reviewers to read the spec first and verify each requirement is met, rather than reading the diff first and inferring what changed.
Distinguish "good enough" from "perfect" - AI-generated code often has stylistic variations from the codebase norm that are harmless. Reviewers who apply the same standards they would to human-written code will request changes on many low-priority items, creating unnecessary friction. Define a tiered review standard: blocking (must change), suggested (should change), optional (take it or leave it). Most AI-specific stylistic variations belong in optional.
Track review metrics separately for AI and human code - measure review time, defects found, and defects per PR separately for AI-generated code. This reveals whether the new review approach is working, where the AI-specific failure modes are concentrated, and how to prioritize context improvements.
Build reviewer pairing into the transition - for developers learning the new review approach, pair them with an experienced reviewer for their first ten AI-generated PR reviews. Watching an expert apply intent-first review is more effective than any description of the process.

6 steps to get from here to the next level

Common Pitfalls

Applying traditional review discipline to AI-generated code. Reviewing AI-generated code with the same thoroughness as hand-written code is exhausting and unsustainable at scale. It's also unnecessary - the failure modes are different. Traditional thorough review catches individual mistakes; intent-based review catches systematic misalignments. Both are valuable, but at different depths of scrutiny.

Reviewing code without seeing the task specification. Reviewers who evaluate AI-generated code without knowing what the agent was asked to do are missing the most important context. The PR description for AI-assisted work should always include the task specification that was given to the agent. If it doesn't, the reviewer should ask for it before reviewing.

Treating all AI-generated code as untrustworthy. Some reviewers, particularly experienced senior developers, apply extra scrutiny to AI-generated code because they don't trust it. This is a reasonable instinct at L1 and L2 but becomes counterproductive at L3, where AI code quality with good context is comparable to average human code quality. Trust should be calibrated to context quality, not to whether the code was AI-generated.

Missing the feedback opportunity. When review finds that an agent misunderstood the spec, this is information for two systems: the task specification process (the prompt needs to be clearer) and the context infrastructure (the CLAUDE.md needs to cover this case). If review just fixes the code and moves on, this information is lost. Create a lightweight mechanism to capture review findings that indicate context or spec quality issues.

Not adapting review criteria to the task type. AI-generated code for a well-defined, well-tested business logic function requires different review focus than AI-generated code for a new feature in an under-tested area. Review criteria should vary by task type and risk level, not be uniform across all AI-generated code.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob's team is producing more code volume than ever, but review has become a bottleneck. His senior developers are drowning in PR reviews and complaining about the volume. The junior developers, who are generating most of the AI-assisted code, are frustrated by the slow review cycle. Bob suspects the review process is not adapting to the new reality.

What Bob should do: Bob should run a review retrospective: how are people currently reviewing AI-generated PRs? What's taking the most time? Where are the most back-and-forth review cycles happening? This retrospective will almost certainly reveal that reviewers are applying traditional line-by-line standards to AI-generated code that mostly fails on intent alignment rather than individual line correctness. Bob should introduce the intent-first protocol and the AI code review checklist, run a workshop to walk through the approach with examples, and measure review cycle time before and after. The expectation: review time per PR drops by 20-30% and the quality of review feedback improves (more "this doesn't match the intent" comments, fewer "move this variable" comments).

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah's developer survey shows a split: AI tool users rate code review as "more frustrating" than non-AI users, despite generating code faster. When she digs into the comments, the pattern is clear: AI-generated PRs are getting the same level of detailed stylistic feedback as hand-written PRs, and developers feel the standard is unreasonably high. She needs to recalibrate the review culture without reducing quality.

What Sarah should do: Sarah should make the review standard explicit. Currently it's implicit and inconsistent - different reviewers apply different standards to AI-generated code. She should work with the engineering leads to define the tiered review standard (blocking/suggested/optional) for AI-generated PRs, create example review comments for each tier, and share this standard in the engineering handbook. She should also create an internal guide to intent-first review that developers can reference. The goal is not lower standards - it's appropriately targeted standards that focus review energy on the defects that actually matter and don't create friction on matters of style.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor reviews more AI-generated code than anyone on the team and has developed strong intuitions for what to look for. He can scan a 200-line AI-generated PR in 10 minutes and find the one or two issues that actually matter. He can also spot when an AI has misunderstood the spec before reading a single line of code, just by reading the diff size versus the spec complexity. He hasn't articulated how he does this.

What Victor should do: Victor should externalize his review process. He should write up the three or four heuristics he applies when reviewing AI-generated PRs and share them as the team's AI code review guide. He should also create a "before/after" example: take a real PR review from the past month, show the original review comments, and then show how the intent-first review would have approached the same PR. This makes the skill concrete and teachable. Victor should also advocate for the team to track review metrics separately for AI and human code - not to apply different standards, but to learn where AI-generated code is and isn't working so that context infrastructure can be improved in the right places.

What Victor should do - role-specific action plan