Maturity Matrix

Flaky tests: 16% of dev time (Google data)

Flaky tests silently consume a sixth of your engineering capacity - Google's internal research quantified the cost, and at L1 they're accepted as normal instead of eliminated.

  • ·Test suite exists but coverage is below 40%
  • ·Tests are written manually by developers
  • ·Team is aware of flaky test impact (16% of dev time per Google data)
  • ·AI-generated tests are reviewed for circular testing (testing what code does, not what it should do)

Evidence

  • ·Coverage report showing sub-40% line coverage
  • ·Test authorship in git history (manual, no agent attribution)

What It Is

A flaky test is one that non-deterministically passes or fails without any change to the code under test. The same commit, the same test, run twice in succession: one pass, one failure. At Level 1 (Ad-hoc), flaky tests are treated as background noise - annoying but inevitable. Developers learn to re-run CI when it fails, to skip tests that "always flake," and to mentally downgrade test failures from "something is broken" to "probably nothing."

Google's internal research on their test infrastructure found that 16% of developer time was consumed by flaky test overhead: investigating failures that weren't real bugs, waiting for re-runs, updating skipped or quarantined tests, and maintaining CI infrastructure buckling under unreliable signal. This number is staggering - one out of every six working hours, evaporated, before a single line of productive code was written.

Flakiness has many sources. Network calls in unit tests. Time-dependent assertions (expect(timestamp).toBe(Date.now())). Tests that depend on database state from previous tests. Race conditions in async code. Order-dependent test suites where one test mutates global state that another test relies on. At L1, these root causes aren't systematically addressed - they're patched individually when they become unbearable, then forgotten.

The compounding effect is what makes this particularly dangerous as teams move toward AI-assisted development. When flakiness is high, the CI signal is unreliable. When CI is unreliable, developers learn to dismiss failures as noise. When failures are dismissed as noise, the entire automated quality gate collapses. An AI agent operating in a high-flakiness environment has no reliable signal to know if its code changes introduced a real bug or triggered a pre-existing flaky test.

Why It Matters

Flaky tests don't just waste time - they erode the culture of testing:

  • Signal destruction - Every false positive teaches developers to distrust the test suite. Once distrust sets in, real failures get dismissed too.
  • CI inflation - Re-running pipelines on flaky failures doubles or triples CI infrastructure costs. At scale, this is a significant budget item.
  • Incident risk - Developers who habitually dismiss test failures will eventually dismiss a real one. The flakiness problem directly contributes to production incidents.
  • Agent reliability - AI agents that iterate to fix test failures cannot distinguish between real failures and flaky noise, causing them to thrash on phantom issues.
  • Blocker for automation - Automated merge decisions (L4's green auto-merge) are impossible if the test suite has meaningful flakiness. A 5% flake rate means 1 in 20 CI runs fails even on correct code.

Addressing flakiness is not a nice-to-have - it's a prerequisite for advancing through the maturity levels. You cannot build reliable automation on an unreliable foundation.

Tip

Track your flake rate as a metric, not just an anecdote. Run your test suite 10 times without any code changes. The percentage of unique test failures across those runs is your flake rate. At L1, this number is often shocking to see quantified.

Getting Started

6 steps to get from here to the next level

Common Pitfalls

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

B
BobHead of Engineering

Bob's team spends about 20% of their time dealing with CI failures. He's noticed that developers have started merging PRs after one passing run, even when previous runs failed, because "it's probably just flaky." He doesn't know if their latest deploy caused a regression or if CI is just being unreliable again.

What Bob should do - role-specific action plan

S
SarahProductivity Lead

Sarah is tracking developer satisfaction scores and sees a recurring complaint: "CI is unreliable and we waste too much time on it." She wants to address it but doesn't know how to put a dollar figure on the problem to justify prioritizing it.

What Sarah should do - role-specific action plan

V
VictorStaff Engineer - AI Champion

Victor has been fixing flaky tests on his own for years, quietly adding @Flaky annotations and retry logic across the codebase. He's frustrated that the problem keeps growing faster than he can fix it, and no one seems to care because the builds still pass (eventually).

What Victor should do - role-specific action plan