Flaky test quarantine

A systematic L2 process for removing flaky tests from the main CI signal without deleting them - creating accountability to fix them while keeping builds reliable.

·Agents generate unit tests; humans write acceptance tests
·Flaky test quarantine process is active (flaky tests are isolated, not deleted)
·Humans define the expected results for important paths (not just snapshotting current output)

·Flaky test count is tracked and reported weekly
·Quarantined tests have a resolution SLA (e.g., fix or delete within 30 days)

Evidence

·Test files with agent attribution alongside human-authored acceptance tests
·Quarantine list or label in test framework configuration
·Flaky test tracking dashboard or issue tracker labels

What It Is

Flaky test quarantine is a structured process for handling tests that fail intermittently: when a test is identified as flaky (fails without code changes, or fails on some CI runs but passes on others), it is immediately moved to a quarantine suite that runs separately from the main CI pipeline. Quarantined tests don't block merges, don't trigger alerts, and don't pollute the CI signal. But they are tracked, reported on a schedule, and assigned to engineers for resolution.

At Level 1, the response to flaky tests is ad-hoc: developers retry CI runs, add @Ignore annotations, or just accept intermittent failures as background noise. There's no process, no ownership, and no resolution timeline. The flakiness accumulates until it consumes a significant fraction of engineering time (as Google's research documented - up to 16%).

Quarantine is the L2 answer. It solves two problems simultaneously. First, it removes flaky tests from the primary CI signal, making the build reliable again - a passing build now means something. Second, it creates explicit accountability: quarantined tests are not "ignored" tests. They are tests on a repair schedule. The quarantine suite runs nightly and reports to a dashboard. Engineers are assigned to fix them. The quarantine is a holding area, not a permanent state.

The key distinction is between quarantine and deletion. Teams that simply delete flaky tests lose the coverage they provided. Teams that quarantine them preserve the intent (this behavior should be tested) while temporarily suspending the signal (this specific test is unreliable) until the root cause is fixed.

Why It Matters

Flaky test quarantine is the bridge between the chaos of L1 and the reliable test infrastructure required at L3-L5:

Restores CI trust - Once developers can trust that a red build means a real failure, they stop dismissing failures as noise. This is foundational to everything else.
Preserves coverage intent - Quarantined tests remain in the codebase, documented, assigned, and scheduled for repair. The behavior they tested is not forgotten.
Makes flakiness visible - The quarantine queue is a metric: how many tests are quarantined, for how long, and in which areas of the codebase. This makes the problem prioritizable.
Enables agent reliability - AI agents that iterate on failing tests need a reliable CI signal. An agent in a high-flakiness environment thrashes on phantom failures. Quarantine gives agents a clean signal to work with.
Prerequisite for TORS - The Test Oracle Reliability Score (TORS) target of 90%+ at L3 is only achievable after systematic quarantine has cleared the backlog of known flaky tests.

Tip

Set a quarantine SLA: no test should stay in quarantine for more than 2 weeks without a resolution decision (fix, rewrite, or delete). Without a SLA, quarantine becomes permanent storage for neglected tests. The SLA creates urgency that deletion feels final but quarantine doesn't.

Getting Started

Create the quarantine infrastructure - Add a quarantine test suite to your CI configuration. In most CI systems, this means a separate job that runs the quarantine suite on a schedule (nightly is typical) rather than on every push. The output goes to a dashboard but does not gate merges.
Define the quarantine criteria - A test should be quarantined when it fails on a commit where no related code was changed, or when it has a documented history of intermittent failures. Write this criteria down. You want quarantine decisions to be objective, not judgment calls.
Build the detection pipeline - Automatically detect candidates for quarantine by tracking test results over the last N runs. A test that fails more than 5% of the time without a corresponding code change is a quarantine candidate. Some CI platforms (BuildKite, Gradle Enterprise) have built-in flakiness detection.
Create the ownership model - Assign quarantined tests to the team that owns the code they test. The owning team's tech lead reviews the quarantine queue weekly. Unowned tests go to a shared backlog with an on-call rotation.
Set the fix SLA and track it - Two weeks is a reasonable default. Publish the quarantine queue and SLA status in your engineering metrics dashboard. When tests age out of the SLA, escalate to team leads.
Implement the three resolutions - A quarantined test has three outcomes: fix the test (address the root cause), rewrite the test (replace it with a stable equivalent), or delete it (if the behavior is covered by other tests or the test was not worth keeping). No other outcomes are acceptable.

6 steps to get from here to the next level

Common Pitfalls

Quarantine as permanent storage. Without a SLA and active tracking, quarantine becomes a graveyard of tests that no one will ever fix. The metric to watch: average days-in-quarantine. If it's rising, quarantine has become deletion with extra steps.

Quarantining too eagerly. Some teams interpret "quarantine flaky tests immediately" as "quarantine on first failure." A single failure doesn't indicate flakiness - it might indicate a real bug. The criterion should be a pattern: two or more failures on commits with no related code changes, or a documented history of intermittent failures.

Fixing flakiness with retries instead of root cause. It's tempting to "fix" a quarantined test by wrapping it in a retry loop and moving it back to the main suite. This passes the quarantine SLA while preserving the underlying non-determinism. Retries are acceptable as a short-term stabilizer while root cause investigation proceeds, but they are not a resolution.

Not tracking the quarantine backlog. The value of quarantine as a process comes from visibility. If the quarantine queue isn't tracked, reported, and reviewed regularly, it provides no more accountability than the L1 @Ignore annotation. The quarantine process is only as effective as the management attention it receives.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob's team has 47 tests with @Disabled or @Ignore annotations in their codebase, accumulated over 18 months. No one knows when they were disabled, why, or whether the code they tested still exists. The ignored tests feel like a time bomb - something important might be broken and nobody would know.

What Bob should do: The ignored tests are a real risk but they're not an emergency - they're a maintenance debt. Bob should convert the team to a quarantine model: instead of @Disabled, tests move to a quarantine suite with an owner, a creation date, and a resolution deadline. For the existing 47 ignored tests, run a triage session: for each one, determine if it's still relevant and what the root cause of its failure was. Tests that can be fixed quickly get fixed. Tests that require deeper work get quarantined with proper ownership. Tests that are obsolete get deleted.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah has been tracking build reliability as a metric and it's been declining: 78% of CI runs complete without requiring a retry, down from 91% six months ago. She's getting pressure to address it but doesn't have a concrete intervention to propose.

What Sarah should do: The quarantine process is exactly the intervention Sarah needs, and it comes with a measurable outcome: build reliability (percentage of CI runs that complete without a retry). Implementing quarantine should drive that metric from 78% back toward 95%+ within one quarter. Sarah should propose the quarantine process as a time-boxed initiative with a clear metric goal - not an open-ended cleanup project - and pair it with the TORS metric to track ongoing test quality.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor has been manually tracking flaky tests in a shared spreadsheet, but adoption is inconsistent. Some teams mark their tests, others ignore the spreadsheet. The process isn't enforced and the spreadsheet is increasingly out of date.

What Victor should do: Spreadsheets don't scale as a quarantine mechanism. Victor should replace the spreadsheet with a proper quarantine infrastructure: a CI configuration that runs the quarantine suite separately, test annotations that are enforced by a linter (no raw @Ignore - only quarantine annotations with required metadata), and a dashboard that shows the queue status automatically. The manual spreadsheet process fails because it requires ongoing human maintenance. The automated infrastructure requires only that engineers follow the annotation protocol.

What Victor should do - role-specific action plan