TORS > 90% (Test Oracle Reliability Score)

TORS quantifies what percentage of test failures are real bugs - at L3, achieving 90%+ is the prerequisite for trusting automated quality gates and enabling AI-driven testing workflows.

·Expected results are derived from requirements/specs (the requirement is the oracle, not the code)
·Acceptance tests are auto-generated from ticket requirements (Autonomous Requirements pipeline)
·Incremental test selection runs only tests affected by changed code paths

·Oracle reliability is reviewed per service, not just overall
·Test generation from tickets includes edge cases, not just happy paths

Evidence

·Oracle-reliability dashboard (e.g., TORS) with per-service breakdown
·Ticket-to-test pipeline configuration with sample outputs
·CI configuration showing incremental test selection (e.g., Bazel test targeting, Jest --changedSince)

What It Is

The Test Oracle Reliability Score (TORS) measures the signal-to-noise ratio of your test suite: what percentage of test failures indicate a genuine defect in the code vs. false positives caused by flaky tests, environmental issues, or unstable assertions. A TORS of 90% means that 9 out of 10 test failures are worth investigating because they represent real problems. A TORS of 70% means that nearly one in three failures is noise - and developers learn to ignore CI output.

TORS is calculated by tracking test failure investigations: when a test fails, does it indicate a real bug (signal) or not (noise)? Over time, the ratio of signal failures to total failures is your TORS. Some CI platforms (Gradle Enterprise, BuildKite) automate this tracking. Teams without tooling support can approximate it by tracking re-runs: if a test failure disappears on re-run without code changes, it was noise. Re-run rate is a proxy for TORS.

At Level 1, TORS is typically below 70% due to accumulated flakiness and unstable oracles. At Level 2, quarantine and oracle stabilization push it toward 80-85%. At Level 3 (Systematic), TORS > 90% is a formal target - not an aspiration but a requirement. Automated quality gates (L4's green auto-merge) are architecturally impossible with TORS below 90%, because the automated system would block good code or pass bad code too frequently to be trusted.

The 90% threshold is meaningful. At 90% TORS, automated systems can make decisions with confidence: a failing build is almost certainly a real problem. At 80%, one in five failures is noise - enough to erode trust in automation. At 90%, the failure rate of the automation itself is low enough that it provides more value than friction.

Why It Matters

TORS at 90% is the inflection point where the test suite transitions from a human-managed tool to an automation-ready infrastructure component:

Automated decisions become viable - At 90% TORS, you can begin moving toward automated merge policies. False positive rates are low enough that automation provides net value.
Agent iteration becomes reliable - AI agents fixing test failures need to know that failures indicate real problems. At 90% TORS, agents can trust the CI signal and make focused fixes rather than thrashing.
Developer trust is restored - When 9 out of 10 failures are real, developers stop dismissing CI output. The psychological shift from "CI is unreliable" to "CI is worth trusting" is critical for effective AI-assisted development.
Acceptance test coverage becomes meaningful - Acceptance tests derived from requirements (L3) only provide value if their failures are investigated. At L1-L2 TORS levels, even real failures get dismissed alongside noise.
Foundation for L4 automation - The 95% TORS target at L4 is only reachable if the team has established the practices and metrics to manage TORS systematically at L3.

Tip

Start measuring TORS before you set targets. Instrument your CI to track whether each test failure was resolved by code change (signal) or by re-run / environment fix (noise). Three weeks of data will give you a credible baseline and surface the specific tests or test categories that are dragging your score below 90%.

Getting Started

Instrument CI for TORS measurement - The minimum viable instrumentation: track which test failures led to code changes vs. which were resolved by re-running. A CI platform with test analytics (Gradle Enterprise, BuildKite, DataDog CI Visibility) can automate this. Manual tracking with a spreadsheet works for teams that need to start quickly.
Establish the baseline TORS - Run the measurement for 2-4 weeks before setting targets. Your actual baseline is often lower than expected - teams with high flakiness frequently discover TORS below 60% when they measure for the first time.
Set the 90% target and timeline - 90% TORS within one quarter is achievable for most teams that have already implemented quarantine and oracle stabilization. Publish the target and current score in your engineering metrics dashboard.
Work the remaining false positives - Below 90%, identify the top 5 tests by false positive frequency. Fix them. Repeat. The Pareto principle applies strongly here: a small number of tests usually account for the majority of false positives.
Automate false positive detection - Configure your CI system to flag test failures that match known flaky patterns. When a test fails that has a history of false positives, automatically route it to the quarantine queue rather than blocking the build.
Validate TORS after each major change - Significant codebase changes (large refactors, infrastructure upgrades, dependency updates) can degrade TORS. Measure it explicitly after major changes and restore it before the next sprint.

6 steps to get from here to the next level

Common Pitfalls

Measuring TORS only on failures, not on passes. TORS measures the reliability of failures - but there's an analogous problem with false negatives: tests that pass when they should fail (circular tests, tests with weakened oracles). A comprehensive quality metric includes both dimensions. TORS alone doesn't catch tests that are reliably wrong.

Gaming TORS by deleting flaky tests. Deleting all tests that produce false positives will mathematically raise TORS to 100% while destroying test coverage. Any TORS measurement should be paired with a coverage metric to ensure that TORS improvements reflect genuine quality improvements, not coverage reductions.

Setting TORS as a team-level metric rather than a codebase metric. Different areas of the codebase may have very different TORS profiles. A monorepo with 92% aggregate TORS may have one service at 60% dragging down the rest. Track TORS by service or module to surface these anomalies.

Treating 90% as a destination rather than a floor. At L3, 90% TORS is the minimum - the floor that enables systematic automation. It is not the ceiling. Teams should continue improving TORS as they progress to L4 (where the target rises to 95%). Treating 90% as "good enough" leads to stagnation.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob has been tracking build reliability (percentage of CI runs completing without retry) as his proxy metric. It's at 89%. He wants to move to L3 automated quality gates but isn't sure if 89% build reliability translates to a reliable enough foundation.

What Bob should do: Build reliability and TORS are related but not the same. Build reliability measures run-level reliability (did the whole CI run succeed?), while TORS measures test-level reliability (do individual test failures mean something?). Bob needs to start tracking TORS explicitly to understand the quality of the test signal, not just the build surface. If TORS is below 90% even when build reliability is 89%, the team isn't ready for automated quality gates. The specific metric Bob should add to his dashboard is TORS by service - it will reveal which services are ready for L4 automation and which need more stabilization work.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah wants to put a dollar value on reaching 90% TORS as a business case for the engineering investment. She has the Google 16% flaky test data but needs to translate it into a number for their team.

What Sarah should do: TORS provides a direct productivity calculation. At 75% TORS, 25% of test failure investigations are wasted - the developer investigates, finds nothing wrong, re-runs, and moves on. For a team spending an average of 30 minutes per test failure investigation, 25% waste rate means roughly one full day per developer per month lost to false positives, at scale. Calculate this for the team size and present it as the cost of staying below 90% TORS. The engineering investment in oracle stabilization and flakiness elimination pays back quickly at that rate.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor has been informally tracking which tests are worth investigating - he mentally marks tests as "probably flaky" and ignores them. His personal TORS intuition is good, but it's not shared with the team and it's not encoded anywhere.

What Victor should do: Victor's mental model needs to become infrastructure. He should work with the DevOps team to instrument TORS measurement into the CI pipeline and publish the score on the engineering dashboard. His next step is to formalize his intuition: when does he dismiss a failure vs. investigate? The answer is the TORS measurement protocol. Once it's documented and automated, the rest of the team benefits from Victor's judgment without needing to consult him individually on every CI failure.

What Victor should do - role-specific action plan