AI tests circularly test what code DOES not what it SHOULD do

When AI generates tests by reading your implementation, it encodes existing behavior as correct - including bugs - giving you coverage numbers that feel like safety but provide none.

·An automated test suite exists and runs
·The team writes and maintains its own tests

·Team is aware of flaky test impact (16% of dev time per Google data)
·AI-generated tests are reviewed for circular testing (testing what code does, not what it should do)

Evidence

·Coverage report from the existing test suite
·Test authorship in git history (manual, no agent attribution)

What It Is

When you ask an AI to generate tests for existing code at Level 1, it does the natural thing: it reads the implementation and writes tests that verify what the code currently does. If your calculateDiscount() function returns 15% for platinum customers when it should return 20%, the AI will write a test that asserts the result is 15%. The test passes. The bug is hidden. Your coverage number goes up. You feel safer. You aren't.

This is the circular testing problem. A test is only valuable if it can fail when something is wrong. A test that was written by observing the code and asserting its current outputs can only fail if the code changes - it cannot fail because the code was always wrong. The AI doesn't know what calculateDiscount() should return for platinum customers. It only knows what it currently returns. So it tests the current behavior, including the bug.

At L1, this is the dominant mode of AI-assisted test generation because it's the easiest workflow: write code, ask AI to write tests, get coverage. The result is a test suite that looks healthy but is structurally circular. It guards against regressions from future changes, but it provides zero verification that the original implementation was correct. Every bug present at the time the tests were generated is permanently enshrined in the test suite as "expected behavior."

The fix requires separating the source of truth for test expectations from the implementation. Acceptance tests derived from requirements (tickets, specifications, user stories) cannot be circular because they were written before the implementation existed. This is why the maturity matrix progression moves toward requirements-derived tests (L2-L3) and eventually toward autonomous requirements generation (L3).

Why It Matters

Circular tests are worse than no tests in one specific way: they provide false confidence. A team with 0% test coverage knows they have a coverage problem. A team with 70% circular coverage believes they're protected when they're not:

Bug preservation - Every bug present when tests are generated gets encoded as expected behavior. The tests will continue passing until someone notices the behavior is wrong in production.
Refactoring trap - If you refactor the code to fix a bug, the circular tests will fail - not because you broke something, but because the tests were wrong. Developers learn to distrust test failures, which compounds the flakiness problem.
AI amplification - As AI generates more code faster, circular test generation at scale can create an enormous volume of tests that appear to provide coverage while actually encoding AI-generated bugs as correct.
Review blind spot - It's difficult to detect circular testing in code review because the assertions look correct - they match the implementation. You need to know what the implementation should do to catch the circularity.
False L2 metrics - Coverage targets set at L2 can be hit entirely with circular tests. If you're measuring coverage without auditing test quality, the metric is misleading.

Tip

When reviewing AI-generated tests, ask one question for every assertion: "Where does this expected value come from?" If the answer is "the current code," the test is circular. The expected value should come from a spec, a ticket, or a business rule - not from running the implementation and recording its output.

Getting Started

Recognize the pattern - Train your team to identify circular tests. The telltale sign: expected values in tests that match exactly what the current implementation produces, with no reference to a specification or business rule.
Source expected values from requirements - Before writing a function, write down what it should produce for a given input - based on the product spec, the ticket, or the business rule. Use those values as test expectations, not the output of the function itself.
Use test-first for high-risk code - For business-critical logic (pricing, permissions, financial calculations), write the test before the implementation. This structurally prevents circular testing because the test is written without knowledge of the implementation.
Ask AI for test structure, not test values - Use AI to generate the test scaffolding (test file setup, describe blocks, test names), but provide the expected values yourself. This splits AI's strength (boilerplate) from the part that requires human judgment (correctness).
Introduce acceptance test categories - Distinguish between "unit tests" (which verify implementation mechanics and may be AI-generated) and "acceptance tests" (which verify business behavior and must be human-authored or requirements-derived). Keep them in separate directories and track them separately.
Review tests with requirements in hand - When reviewing a PR that includes tests, open the original ticket alongside the test file. Ask: does the test verify what the ticket described?

6 steps to get from here to the next level

Common Pitfalls

Confusing code coverage with behavior coverage. Coverage tools measure whether lines of code were executed during tests - they say nothing about whether the assertions are correct. A function can have 100% line coverage with circular tests that assert the wrong values. Coverage is a measure of test reach, not test correctness.

Using AI snapshots as test oracles. Some testing workflows use AI to generate "snapshot" tests - capture the current output and assert it doesn't change. This is circular testing by design. Snapshot tests are useful for detecting unintentional changes to UI or serialization formats, but they're dangerous for business logic where the current output may be wrong.

Generating tests after debugging. A common workflow: AI-generated code has a bug, you debug and fix it, then ask AI to generate tests. If the AI generates tests after the fix without being told what was wrong, it will test the fixed behavior - but it won't test the edge case that caused the original bug. You need to explicitly write a test for the bug case, not just re-run AI generation on the fixed code.

Treating test generation as a one-time action. Circular tests become more dangerous over time. As the codebase evolves, circular tests accumulate. Periodically auditing your test suite for circularity - especially in business-critical code - is a maintenance activity, not a one-time task.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob's team has been using AI to generate tests for three months. Coverage has climbed from 35% to 65% and he's been celebrating the progress with stakeholders. Then a production bug makes it out: the discount calculation has been wrong for six months, and there are tests that assert the wrong values. The coverage number was real; the safety was not.

What Bob should do: This is a painful lesson, but it's a structural problem, not a blame problem. Bob needs to distinguish between "coverage" (lines executed) and "correctness coverage" (business behavior verified). The fix is process: require that acceptance tests for business logic are written from requirements, not generated from implementation. Bob should pick one critical domain (pricing, permissions, billing) and have the team audit existing tests for circularity. The goal is not to redo all tests - it's to ensure the tests that matter are non-circular.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah presented the coverage improvement (35% to 65%) to stakeholders as evidence of AI tooling value. Now a production incident has revealed that the coverage was partially circular, and her credibility is on the line. Stakeholders are asking whether they should trust the AI tooling at all.

What Sarah should do: Sarah needs a better metric than coverage percentage. The relevant metric is TORS - Test Oracle Reliability Score - which measures what fraction of test failures indicate real bugs. Coverage measures reach; TORS measures trustworthiness. Sarah should also reframe the narrative: the issue isn't AI tooling, it's that the team was using AI for the wrong part of the testing workflow. AI is excellent at test scaffolding and unit test generation from specs; it should not be setting expected values for business logic without requirements input.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor suspected the circular testing problem months ago but didn't have language for it. He noticed that tests added by the AI always passed immediately on the first run, which he found suspicious - good tests should fail first and require working code to make them pass. He mentioned it in a code review and got pushback: "The coverage is going up, isn't it?"

What Victor should do: Victor's intuition is correct and he now has the terminology to make the case explicitly. His next step is to write a short internal document explaining the circular testing problem with a concrete example from the team's codebase - ideally the kind of calculation bug that just made it to production. Then propose a testing policy: acceptance tests for business logic must have expected values documented in a ticket or spec before the AI generates the test structure. Victor can also propose the L2 hybrid model: AI generates unit tests, humans author acceptance tests.

What Victor should do - role-specific action plan