Agent-generated unit tests + human acceptance tests

A hybrid testing strategy at L2 that uses AI to generate unit test scaffolding at scale while keeping business-behavior verification in human hands.

·Agents generate unit tests; humans write acceptance tests
·Flaky test quarantine process is active (flaky tests are isolated, not deleted)
·Humans define the expected results for important paths (not just snapshotting current output)

·Flaky test count is tracked and reported weekly
·Quarantined tests have a resolution SLA (e.g., fix or delete within 30 days)

Evidence

·Test files with agent attribution alongside human-authored acceptance tests
·Quarantine list or label in test framework configuration
·Flaky test tracking dashboard or issue tracker labels

What It Is

The hybrid testing strategy at Level 2 (Guided) makes a deliberate division of labor: AI agents automatically generate unit tests for individual functions and components, while humans write acceptance tests that verify the system meets its business requirements. This isn't a compromise - it's a recognition that AI and humans are good at different things in the testing domain.

Unit tests are well-suited to AI generation. They test discrete, well-defined units of code - a function, a class, a module - against specific inputs and outputs. The structure is formulaic: setup, execute, assert. Given a clear function signature and type information, an AI agent can generate comprehensive unit tests covering happy paths, boundary values, null inputs, and error conditions faster and more exhaustively than a human would bother to do manually.

Acceptance tests are different. They verify that the system does what the product intended - and that intent lives in a ticket, a user story, a product specification, or the minds of the people who wrote the requirements. The AI doesn't have access to that information at L1-L2 (it learns to read requirements at L3). A human writing an acceptance test must understand what the feature is supposed to do and encode that understanding as an assertion. This is the one part of the testing workflow that cannot be delegated to an agent without first solving the requirements-comprehension problem.

The hybrid model at L2 captures most of the efficiency gains of AI test generation while preserving the correctness guarantees that only human-authored acceptance tests can provide. It also resolves the circular testing problem: unit tests may reflect what the code does, but acceptance tests verify what the code should do.

Why It Matters

The hybrid model matters because it operationalizes the division between two types of correctness:

Mechanical correctness - Does the function handle null inputs? Does it throw the right exception? Does the loop terminate? AI is excellent at generating tests for these questions.
Behavioral correctness - Does the feature do what the customer expected? Does the discount apply to the right tier? Is the permission model correct? Only humans can answer these questions without access to requirements.
Coverage velocity - AI-generated unit tests can bring coverage from 35% to 75% within a sprint on a single service. Humans alone cannot move that fast. The hybrid model makes the coverage climb tractable.
Risk-proportionate testing - High-risk business logic gets human-authored acceptance tests. Low-risk utility code gets AI-generated unit tests. Testing effort is proportionate to consequence, not distributed evenly.
Scalable as code grows - As AI agents write more code at L3+, the unit test generation can scale automatically alongside code generation. The hybrid model is designed to scale with AI-assisted development.

Tip

Establish a naming convention to distinguish AI-generated unit tests from human-authored acceptance tests. A simple approach: *.unit.test.ts for AI-generated, *.acceptance.test.ts for human-authored. This makes the distinction visible in the codebase and enables tracking coverage by test type.

Getting Started

Define the boundary - Agree as a team on what constitutes a "unit test" vs. an "acceptance test." A useful heuristic: unit tests test a single function or class in isolation; acceptance tests test a user-observable behavior, usually spanning multiple components. Document this in your CLAUDE.md or test guidelines.
Set up AI-assisted unit test generation - Configure your AI tool (GitHub Copilot, Claude, or a custom agent) to generate unit tests when given a function or module. Provide it with your existing test structure as context so it follows your conventions.
Create an acceptance test template - Write a standardized template for acceptance tests that includes: reference to the ticket/requirement, the user scenario being tested, and the expected outcome in business terms. This template guides humans writing acceptance tests.
Start with new code - Apply the hybrid model to all new code going forward. Don't try to retrofit it to legacy code immediately - you'll stall before gaining momentum.
Track coverage by test type - Measure how much of your coverage comes from unit tests vs. acceptance tests. Over time, you want the acceptance test layer growing as you add features, not just the unit test layer.
Set a minimum acceptance test requirement - For every user-facing feature in a ticket, require at least one acceptance test before the PR can merge. AI generates unit tests; the human PR author writes the acceptance test.

6 steps to get from here to the next level

Common Pitfalls

Letting AI generate acceptance tests anyway. The boundary you establish only works if you enforce it. Developers under deadline pressure will ask the AI to generate everything, including acceptance tests. If those tests are reviewed without checking whether they reflect the ticket requirements, the circular testing problem returns. The enforcement mechanism is code review - reviewers must check that acceptance test assertions trace to requirements.

Under-investing in acceptance tests while over-investing in unit tests. Because AI makes unit test generation easy, the unit test layer can become disproportionately large. A codebase with 90% unit test coverage and 10% acceptance test coverage may have high numbers but low confidence in end-to-end behavior. Balance matters.

Treating acceptance tests as end-to-end tests. Acceptance tests verify business behavior - but they don't have to run against the full stack. An acceptance test for a discount calculation can test the discount service in isolation while still verifying the business rule. Conflating acceptance tests with slow, brittle end-to-end tests leads teams to write fewer of them.

No feedback loop on AI-generated test quality. AI-generated unit tests should be reviewed like any other code. Developers who rubber-stamp AI test output without reading it miss generated assertions that are trivially true, tests that don't actually cover the code path they appear to cover, or mock setups that defeat the purpose of the test.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob wants to move the team from L1 to L2 testing practices and has been pitched the hybrid model. He's worried about the implementation cost: configuring AI tools, training the team on the new process, and maintaining the boundary between test types over time. He's not sure it's worth the setup cost given the team's existing backlog.

What Bob should do: Bob should pilot the hybrid model on one team for one sprint before rolling it out broadly. The key metric to track: total test coverage before and after, time spent on test-writing tasks, and any review comments about test quality. If one team's coverage climbs from 40% to 65% in a sprint without significant time investment, that's the business case. The ongoing cost of maintaining the boundary is lower than the ongoing cost of L1 testing debt and the inevitable production incidents it produces.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah needs to demonstrate that the hybrid model provides better value than the L1 baseline. Coverage numbers are one metric, but she wants to show that the tests are actually catching bugs - not just providing coverage.

What Sarah should do: The metric Sarah needs is bug escape rate - the number of bugs that reach production vs. bugs caught in testing. As acceptance tests grow from requirements, they should catch more pre-production bugs. Track this alongside coverage: if coverage goes up and bug escape rate stays flat, the tests may be too superficial. If coverage goes up and bug escape rate drops, the hybrid model is working. Sarah should also track the ratio of acceptance tests to features shipped - ensuring that the human testing layer is growing proportionately with product scope.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor has already adopted a version of the hybrid model informally: he writes unit tests quickly (sometimes using AI) and carefully authors scenario-based tests for complex business logic. He wants to formalize this into team-wide practice but isn't sure how to enforce the boundary without creating bureaucratic friction.

What Victor should do: Victor should write the team's test guidelines document (not a heavy process doc - a two-page reference that explains what each test type is, when to use each, and what the review expectation is). Then he should make enforcement lightweight: a simple review checklist item "Does this PR have at least one acceptance test for each user-facing behavior?" is enough. The most important thing Victor can do is model the pattern visibly - when his PRs consistently follow the hybrid model, other engineers learn by example.

What Victor should do - role-specific action plan