Flaky tests: 16% of dev time (Google data)

Flaky tests silently consume a sixth of your engineering capacity - Google's internal research quantified the cost, and at L1 they're accepted as normal instead of eliminated.

·An automated test suite exists and runs
·The team writes and maintains its own tests

·Team is aware of flaky test impact (16% of dev time per Google data)
·AI-generated tests are reviewed for circular testing (testing what code does, not what it should do)

Evidence

·Coverage report from the existing test suite
·Test authorship in git history (manual, no agent attribution)

What It Is

A flaky test is one that non-deterministically passes or fails without any change to the code under test. The same commit, the same test, run twice in succession: one pass, one failure. At Level 1 (Ad-hoc), flaky tests are treated as background noise - annoying but inevitable. Developers learn to re-run CI when it fails, to skip tests that "always flake," and to mentally downgrade test failures from "something is broken" to "probably nothing."

Google's internal research on their test infrastructure found that 16% of developer time was consumed by flaky test overhead: investigating failures that weren't real bugs, waiting for re-runs, updating skipped or quarantined tests, and maintaining CI infrastructure buckling under unreliable signal. This number is staggering - one out of every six working hours, evaporated, before a single line of productive code was written.

Flakiness has many sources. Network calls in unit tests. Time-dependent assertions (expect(timestamp).toBe(Date.now())). Tests that depend on database state from previous tests. Race conditions in async code. Order-dependent test suites where one test mutates global state that another test relies on. At L1, these root causes aren't systematically addressed - they're patched individually when they become unbearable, then forgotten.

The compounding effect is what makes this particularly dangerous as teams move toward AI-assisted development. When flakiness is high, the CI signal is unreliable. When CI is unreliable, developers learn to dismiss failures as noise. When failures are dismissed as noise, the entire automated quality gate collapses. An AI agent operating in a high-flakiness environment has no reliable signal to know if its code changes introduced a real bug or triggered a pre-existing flaky test.

Why It Matters

Flaky tests don't just waste time - they erode the culture of testing:

Signal destruction - Every false positive teaches developers to distrust the test suite. Once distrust sets in, real failures get dismissed too.
CI inflation - Re-running pipelines on flaky failures doubles or triples CI infrastructure costs. At scale, this is a significant budget item.
Incident risk - Developers who habitually dismiss test failures will eventually dismiss a real one. The flakiness problem directly contributes to production incidents.
Agent reliability - AI agents that iterate to fix test failures cannot distinguish between real failures and flaky noise, causing them to thrash on phantom issues.
Blocker for automation - Automated merge decisions (L4's green auto-merge) are impossible if the test suite has meaningful flakiness. A 5% flake rate means 1 in 20 CI runs fails even on correct code.

Addressing flakiness is not a nice-to-have - it's a prerequisite for advancing through the maturity levels. You cannot build reliable automation on an unreliable foundation.

Tip

Track your flake rate as a metric, not just an anecdote. Run your test suite 10 times without any code changes. The percentage of unique test failures across those runs is your flake rate. At L1, this number is often shocking to see quantified.

Getting Started

Measure your flake rate - Before fixing anything, measure the baseline. Run your test suite 5-10 times on the same commit and catalog which tests produce inconsistent results. Flakiness that was "occasional" often turns out to be systematic when measured.
Categorize by root cause - Group your flaky tests into buckets: timing issues, network dependencies, state pollution, order dependencies, resource contention. Different root causes require different fixes, and categorization reveals where to invest first.
Stop accepting re-runs as a solution - Re-running CI on failure is a symptom masking a problem. Every time a developer clicks "retry" on a CI failure, they should file a ticket. The ticket backlog makes the problem visible and prioritizable.
Fix or quarantine - For each flaky test, make a binary decision: fix it this week, or move it to a quarantine suite that runs separately and doesn't block merges. Quarantine is not a permanent state - it's a holding area with a SLA. (This becomes the L2 flaky test quarantine process.)
Add determinism to the most common sources - Mock external calls in unit tests, use fixed timestamps in time-sensitive tests, reset database state between tests, and avoid shared mutable state. These fixes resolve the majority of flakiness in most codebases.
Track flake rate over time - Once you're measuring, add the flake rate to your engineering metrics dashboard. A declining flake rate is one of the clearest signals that test quality is improving.

6 steps to get from here to the next level

Common Pitfalls

Normalizing flakiness. The most dangerous response to flaky tests is cultural acceptance. When "just re-run it" becomes standard advice, the team has implicitly decided that unreliable CI is acceptable. This is a trap: once the culture normalizes flakiness, it becomes nearly impossible to prioritize fixing it because "it's always been this way."

Fixing symptoms instead of root causes. The instinctive fix for a flaky test is adding a sleep(1000) or wrapping it in a retry loop. These fixes make the test less flaky without making it more reliable - they're just adding slack to cover up race conditions or timing dependencies. Proper fixes address the root cause: eliminate the non-determinism, not the symptom.

Treating all flaky tests equally. A flaky test in a utility function and a flaky test in the authentication flow have very different risk profiles. Prioritize quarantine and fixing based on the criticality of the code path, not alphabetical order or most recent failure.

Ignoring infrastructure flakiness. Some flakiness is in the tests; some is in the CI infrastructure itself - resource contention, network timeouts, container startup timing. These are harder to fix because they're not visible in the code. If tests pass locally but flake in CI, the infrastructure is the suspect.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob's team spends about 20% of their time dealing with CI failures. He's noticed that developers have started merging PRs after one passing run, even when previous runs failed, because "it's probably just flaky." He doesn't know if their latest deploy caused a regression or if CI is just being unreliable again.

What Bob should do: This is the flakiness trust erosion problem, and it's a leadership issue before it's a technical one. Bob needs to make flakiness visible as a metric and declare it unacceptable, not just annoying. The practical first step: mandate that every test failure gets a ticket before re-running CI. Within two weeks, the ticket volume will make clear which tests are the worst offenders. Then assign one engineer per sprint to work the flakiness backlog. The ROI is immediate - every hour spent fixing flaky tests recovers multiple hours of wasted developer time.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah is tracking developer satisfaction scores and sees a recurring complaint: "CI is unreliable and we waste too much time on it." She wants to address it but doesn't know how to put a dollar figure on the problem to justify prioritizing it.

What Sarah should do: The Google 16% figure is the business case. For a team of 50 engineers at an average fully-loaded cost of $200k/year, 16% of time wasted on flakiness is $1.6M/year in lost productivity. Even if your flake rate is half of Google's historic level, that's $800k/year. The fix - a few weeks of a dedicated engineer systematically quarantining and addressing root causes - costs a fraction of that. Sarah should present this calculation to stakeholders as a high-ROI investment in engineering infrastructure, not a cost center.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor has been fixing flaky tests on his own for years, quietly adding @Flaky annotations and retry logic across the codebase. He's frustrated that the problem keeps growing faster than he can fix it, and no one seems to care because the builds still pass (eventually).

What Victor should do: Victor's individual heroics are insufficient at scale - and his retry annotations are hiding the problem rather than fixing it. His most impactful move is to shift from fixing to measuring and to stop masking flakiness with retry logic. Victor should instrument the CI pipeline to track and report the flake rate explicitly, then bring that metric to Bob. Once the scope of the problem is visible in numbers, it becomes a planning priority instead of a Victor problem. Victor's second move is to write the quarantine process (L2) so that flaky tests have a structured path to resolution rather than informal annotation.

What Victor should do - role-specific action plan