AI tests circularly test what code DOES not what it SHOULD do
When AI generates tests by reading your implementation, it encodes existing behavior as correct - including bugs - giving you coverage numbers that feel like safety but provide none.
- ·Test suite exists but coverage is below 40%
- ·Tests are written manually by developers
- ·Team is aware of flaky test impact (16% of dev time per Google data)
- ·AI-generated tests are reviewed for circular testing (testing what code does, not what it should do)
Evidence
- ·Coverage report showing sub-40% line coverage
- ·Test authorship in git history (manual, no agent attribution)
What It Is
When you ask an AI to generate tests for existing code at Level 1, it does the natural thing: it reads the implementation and writes tests that verify what the code currently does. If your calculateDiscount() function returns 15% for platinum customers when it should return 20%, the AI will write a test that asserts the result is 15%. The test passes. The bug is hidden. Your coverage number goes up. You feel safer. You aren't.
This is the circular testing problem. A test is only valuable if it can fail when something is wrong. A test that was written by observing the code and asserting its current outputs can only fail if the code changes - it cannot fail because the code was always wrong. The AI doesn't know what calculateDiscount() should return for platinum customers. It only knows what it currently returns. So it tests the current behavior, including the bug.
At L1, this is the dominant mode of AI-assisted test generation because it's the easiest workflow: write code, ask AI to write tests, get coverage. The result is a test suite that looks healthy but is structurally circular. It guards against regressions from future changes, but it provides zero verification that the original implementation was correct. Every bug present at the time the tests were generated is permanently enshrined in the test suite as "expected behavior."
The fix requires separating the source of truth for test expectations from the implementation. Acceptance tests derived from requirements (tickets, specifications, user stories) cannot be circular because they were written before the implementation existed. This is why the maturity matrix progression moves toward requirements-derived tests (L2-L3) and eventually toward autonomous requirements generation (L3).
Why It Matters
Circular tests are worse than no tests in one specific way: they provide false confidence. A team with 0% test coverage knows they have a coverage problem. A team with 70% circular coverage believes they're protected when they're not:
- Bug preservation - Every bug present when tests are generated gets encoded as expected behavior. The tests will continue passing until someone notices the behavior is wrong in production.
- Refactoring trap - If you refactor the code to fix a bug, the circular tests will fail - not because you broke something, but because the tests were wrong. Developers learn to distrust test failures, which compounds the flakiness problem.
- AI amplification - As AI generates more code faster, circular test generation at scale can create an enormous volume of tests that appear to provide coverage while actually encoding AI-generated bugs as correct.
- Review blind spot - It's difficult to detect circular testing in code review because the assertions look correct - they match the implementation. You need to know what the implementation should do to catch the circularity.
- False L2 metrics - Coverage targets set at L2 can be hit entirely with circular tests. If you're measuring coverage without auditing test quality, the metric is misleading.
When reviewing AI-generated tests, ask one question for every assertion: "Where does this expected value come from?" If the answer is "the current code," the test is circular. The expected value should come from a spec, a ticket, or a business rule - not from running the implementation and recording its output.
Getting Started
6 steps to get from here to the next level
Common Pitfalls
Mistakes teams actually make at this stage - and how to avoid them
How Different Roles See It
Bob's team has been using AI to generate tests for three months. Coverage has climbed from 35% to 65% and he's been celebrating the progress with stakeholders. Then a production bug makes it out: the discount calculation has been wrong for six months, and there are tests that assert the wrong values. The coverage number was real; the safety was not.
What Bob should do - role-specific action plan
Sarah presented the coverage improvement (35% to 65%) to stakeholders as evidence of AI tooling value. Now a production incident has revealed that the coverage was partially circular, and her credibility is on the line. Stakeholders are asking whether they should trust the AI tooling at all.
What Sarah should do - role-specific action plan
Victor suspected the circular testing problem months ago but didn't have language for it. He noticed that tests added by the AI always passed immediately on the first run, which he found suspicious - good tests should fail first and require working code to make them pass. He mentioned it in a code review and got pushback: "The coverage is going up, isn't it?"
What Victor should do - role-specific action plan
Further Reading
5 resources worth reading - hand-picked, not scraped
From the Field
Recent releases, projects, and discussions relevant to this maturity level.