TORS > 90% (Test Oracle Reliability Score)
TORS quantifies what percentage of test failures are real bugs - at L3, achieving 90%+ is the prerequisite for trusting automated quality gates and enabling AI-driven testing workflows.
- ·TORS (Test Oracle Reliability Score) is measured and exceeds 90%
- ·Acceptance tests are auto-generated from ticket requirements (Autonomous Requirements pipeline)
- ·Incremental test selection runs only tests affected by changed code paths
- ·TORS is tracked per service or module, not just as an aggregate
- ·Test generation from tickets includes edge cases, not just happy paths
Evidence
- ·TORS dashboard showing 90%+ score with per-service breakdown
- ·Ticket-to-test pipeline configuration with sample outputs
- ·CI configuration showing incremental test selection (e.g., Bazel test targeting, Jest --changedSince)
What It Is
The Test Oracle Reliability Score (TORS) measures the signal-to-noise ratio of your test suite: what percentage of test failures indicate a genuine defect in the code vs. false positives caused by flaky tests, environmental issues, or unstable assertions. A TORS of 90% means that 9 out of 10 test failures are worth investigating because they represent real problems. A TORS of 70% means that nearly one in three failures is noise - and developers learn to ignore CI output.
TORS is calculated by tracking test failure investigations: when a test fails, does it indicate a real bug (signal) or not (noise)? Over time, the ratio of signal failures to total failures is your TORS. Some CI platforms (Gradle Enterprise, BuildKite) automate this tracking. Teams without tooling support can approximate it by tracking re-runs: if a test failure disappears on re-run without code changes, it was noise. Re-run rate is a proxy for TORS.
At Level 1, TORS is typically below 70% due to accumulated flakiness and unstable oracles. At Level 2, quarantine and oracle stabilization push it toward 80-85%. At Level 3 (Systematic), TORS > 90% is a formal target - not an aspiration but a requirement. Automated quality gates (L4's green auto-merge) are architecturally impossible with TORS below 90%, because the automated system would block good code or pass bad code too frequently to be trusted.
The 90% threshold is meaningful. At 90% TORS, automated systems can make decisions with confidence: a failing build is almost certainly a real problem. At 80%, one in five failures is noise - enough to erode trust in automation. At 90%, the failure rate of the automation itself is low enough that it provides more value than friction.
Why It Matters
TORS at 90% is the inflection point where the test suite transitions from a human-managed tool to an automation-ready infrastructure component:
- Automated decisions become viable - At 90% TORS, you can begin moving toward automated merge policies. False positive rates are low enough that automation provides net value.
- Agent iteration becomes reliable - AI agents fixing test failures need to know that failures indicate real problems. At 90% TORS, agents can trust the CI signal and make focused fixes rather than thrashing.
- Developer trust is restored - When 9 out of 10 failures are real, developers stop dismissing CI output. The psychological shift from "CI is unreliable" to "CI is worth trusting" is critical for effective AI-assisted development.
- Acceptance test coverage becomes meaningful - Acceptance tests derived from requirements (L3) only provide value if their failures are investigated. At L1-L2 TORS levels, even real failures get dismissed alongside noise.
- Foundation for L4 automation - The 95% TORS target at L4 is only reachable if the team has established the practices and metrics to manage TORS systematically at L3.
Start measuring TORS before you set targets. Instrument your CI to track whether each test failure was resolved by code change (signal) or by re-run / environment fix (noise). Three weeks of data will give you a credible baseline and surface the specific tests or test categories that are dragging your score below 90%.
Getting Started
6 steps to get from here to the next level
Common Pitfalls
Mistakes teams actually make at this stage - and how to avoid them
How Different Roles See It
Bob has been tracking build reliability (percentage of CI runs completing without retry) as his proxy metric. It's at 89%. He wants to move to L3 automated quality gates but isn't sure if 89% build reliability translates to a reliable enough foundation.
What Bob should do - role-specific action plan
Sarah wants to put a dollar value on reaching 90% TORS as a business case for the engineering investment. She has the Google 16% flaky test data but needs to translate it into a number for their team.
What Sarah should do - role-specific action plan
Victor has been informally tracking which tests are worth investigating - he mentally marks tests as "probably flaky" and ignores them. His personal TORS intuition is good, but it's not shared with the team and it's not encoded anywhere.
What Victor should do - role-specific action plan
Further Reading
5 resources worth reading - hand-picked, not scraped
From the Field
Recent releases, projects, and discussions relevant to this maturity level.