Maturity Matrix

Flaky test quarantine

A systematic L2 process for removing flaky tests from the main CI signal without deleting them - creating accountability to fix them while keeping builds reliable.

  • ·Agents generate unit tests; humans write acceptance tests
  • ·Flaky test quarantine process is active (flaky tests are isolated, not deleted)
  • ·Test oracle stabilization is underway (deterministic expected values for AI-generated tests)
  • ·Flaky test count is tracked and reported weekly
  • ·Quarantined tests have a resolution SLA (e.g., fix or delete within 30 days)

Evidence

  • ·Test files with agent attribution alongside human-authored acceptance tests
  • ·Quarantine list or label in test framework configuration
  • ·Flaky test tracking dashboard or issue tracker labels

What It Is

Flaky test quarantine is a structured process for handling tests that fail intermittently: when a test is identified as flaky (fails without code changes, or fails on some CI runs but passes on others), it is immediately moved to a quarantine suite that runs separately from the main CI pipeline. Quarantined tests don't block merges, don't trigger alerts, and don't pollute the CI signal. But they are tracked, reported on a schedule, and assigned to engineers for resolution.

At Level 1, the response to flaky tests is ad-hoc: developers retry CI runs, add @Ignore annotations, or just accept intermittent failures as background noise. There's no process, no ownership, and no resolution timeline. The flakiness accumulates until it consumes a significant fraction of engineering time (as Google's research documented - up to 16%).

Quarantine is the L2 answer. It solves two problems simultaneously. First, it removes flaky tests from the primary CI signal, making the build reliable again - a passing build now means something. Second, it creates explicit accountability: quarantined tests are not "ignored" tests. They are tests on a repair schedule. The quarantine suite runs nightly and reports to a dashboard. Engineers are assigned to fix them. The quarantine is a holding area, not a permanent state.

The key distinction is between quarantine and deletion. Teams that simply delete flaky tests lose the coverage they provided. Teams that quarantine them preserve the intent (this behavior should be tested) while temporarily suspending the signal (this specific test is unreliable) until the root cause is fixed.

Why It Matters

Flaky test quarantine is the bridge between the chaos of L1 and the reliable test infrastructure required at L3-L5:

  • Restores CI trust - Once developers can trust that a red build means a real failure, they stop dismissing failures as noise. This is foundational to everything else.
  • Preserves coverage intent - Quarantined tests remain in the codebase, documented, assigned, and scheduled for repair. The behavior they tested is not forgotten.
  • Makes flakiness visible - The quarantine queue is a metric: how many tests are quarantined, for how long, and in which areas of the codebase. This makes the problem prioritizable.
  • Enables agent reliability - AI agents that iterate on failing tests need a reliable CI signal. An agent in a high-flakiness environment thrashes on phantom failures. Quarantine gives agents a clean signal to work with.
  • Prerequisite for TORS - The Test Oracle Reliability Score (TORS) target of 90%+ at L3 is only achievable after systematic quarantine has cleared the backlog of known flaky tests.
Tip

Set a quarantine SLA: no test should stay in quarantine for more than 2 weeks without a resolution decision (fix, rewrite, or delete). Without a SLA, quarantine becomes permanent storage for neglected tests. The SLA creates urgency that deletion feels final but quarantine doesn't.

Getting Started

6 steps to get from here to the next level

Common Pitfalls

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

B
BobHead of Engineering

Bob's team has 47 tests with @Disabled or @Ignore annotations in their codebase, accumulated over 18 months. No one knows when they were disabled, why, or whether the code they tested still exists. The ignored tests feel like a time bomb - something important might be broken and nobody would know.

What Bob should do - role-specific action plan

S
SarahProductivity Lead

Sarah has been tracking build reliability as a metric and it's been declining: 78% of CI runs complete without requiring a retry, down from 91% six months ago. She's getting pressure to address it but doesn't have a concrete intervention to propose.

What Sarah should do - role-specific action plan

V
VictorStaff Engineer - AI Champion

Victor has been manually tracking flaky tests in a shared spreadsheet, but adoption is inconsistent. Some teams mark their tests, others ignore the spreadsheet. The process isn't enforced and the spreadsheet is increasingly out of date.

What Victor should do - role-specific action plan