TORS > 95%

The Test Oracle Reliability Score (TORS) measures what percentage of test failures represent real defects rather than test infrastructure issues, environmental flakiness, or poorly

·Test-oracle reliability is measured and tracked on a dashboard
·Auto-approve rate (% of PRs auto-merged as Green) is tracked with a target above 60%
·Merge queue wait time is tracked with a target under 10 minutes

·Agent Autonomy Score (% of tasks completed without human intervention) is measured and broken down by task type
·Metrics trigger automated alerts when thresholds are breached (e.g., test-oracle reliability drops)

Evidence

·Oracle-reliability dashboard (e.g., TORS) with per-service breakdown
·Auto-approve rate report showing 60%+ Green target
·Merge queue wait time chart showing sub-10-minute target

What It Is

The Test Oracle Reliability Score (TORS) measures what percentage of test failures represent real defects rather than test infrastructure issues, environmental flakiness, or poorly designed test assertions. If CI reports 100 failures in a week, and 95 of those failures correctly identify real bugs while 5 are false positives caused by flaky tests or environment issues, TORS is 95%. The target at L4 is TORS above 95%.

TORS is the signal quality metric for the entire agent-driven development pipeline. Agents rely on CI feedback to determine whether their code is correct. When tests are flaky - failing intermittently without any change to the code being tested - agents receive false negative signals. The agent reads a CI failure, concludes that its code is wrong, and attempts to fix a problem that doesn't exist. This drives up ITS (each false negative is an unnecessary iteration), increases CPI (each unnecessary iteration costs money), and degrades the code as the agent "fixes" working code in response to phantom test failures.

At L4 (Optimized), where agents are running hundreds of CI iterations per day across multiple parallel workflows, the impact of low TORS compounds dramatically. A TORS of 90% means 10% of all CI failures are noise. At 100 CI runs per day, that's 10 false signals per day that agents are chasing. Each false signal costs one full iteration including model tokens and CI compute. At L4 scale, a 5 percentage point improvement in TORS (90% to 95%) can save tens of thousands of dollars per year in wasted agent compute.

TORS is measured by tagging CI test failures as "real" or "flaky" over time. A "flaky" failure is one where re-running the same test suite with the same code produces a different result. Modern CI systems can be configured to automatically re-run failed tests once and classify the failure as flaky if it passes on the retry. This retry data is the raw material for computing TORS: (successful reruns / total failures) = flaky rate; TORS = 1 - flaky rate.

Why It Matters

Directly drives ITS quality - every flaky test failure that triggers agent re-iteration inflates ITS artificially; high TORS means agents' ITS reflects genuine task difficulty, not test noise; achieving ITS targets requires TORS above 95%
Reduces wasted CPI - flaky test failures are pure cost waste: the agent consumes tokens and CI compute responding to a failure that isn't real; at L4 scale, high TORS translates directly to significant cost savings
Enables algorithmic trust in CI - auto-merge systems that bypass human review only work if CI results are trustworthy; a TORS below 90% means auto-merge will occasionally merge broken code (real failures misclassified as flaky) or require human review of everything (defeating the purpose); high TORS is the prerequisite for high auto-approve rates
Agents make worse decisions in noisy environments - an agent that has learned (from many false signals) to discount CI failures will eventually miss a real failure; agents calibrate their response to the reliability of the feedback they receive; trustworthy CI produces more reliable agent behavior
TORS improvement has a clear ROI - unlike many quality metrics, TORS has a direct, calculable cost: (flaky failure rate) * (agent iterations per week) * (CPI) = weekly wasted spend; investing in test reliability has a measurable payback period

Getting Started

Implement automatic test retry and failure tagging - Configure your CI system to automatically retry each failed test once. Tag any test that fails on the first run but passes on the retry as "flaky." Most modern CI platforms (GitHub Actions, BuildKite, CircleCI) support retry-on-failure natively. This is the minimum viable TORS instrumentation.
Build a TORS dashboard - Compute weekly TORS: (total failures - flaky failures) / total failures. Plot this over time. Add a team-level target line at 95%. Publish this dashboard alongside ITS and CPI so the three metrics are always reviewed together.
Identify the flakiest tests by frequency - Sort tests by flaky occurrence count over the past 30 days. The top 10 flakiest tests almost always account for 50-80% of all flaky failures. These are your highest-priority quarantine candidates.
Implement a flaky test quarantine process - When a test is identified as flaky above a threshold (e.g., flaky more than 5 times in 30 days), it should be automatically quarantined: removed from the blocking CI gate and moved to a non-blocking suite that runs but doesn't fail the build. Quarantined tests need a fix-or-delete decision within 30 days.
Fix flaky tests at the root cause - Quarantine stops the immediate pain, but it doesn't fix the test. Assign quarantined tests a priority based on what they're testing (high-coverage tests of critical paths are highest priority). Fix the flakiness: add proper waits instead of sleeps, use isolated test databases, eliminate shared state between tests.
Track TORS trend weekly and set an improvement rate - A team at 85% TORS should target 90% in 30 days and 95% in 90 days. Each weekly improvement is driven by quarantining and fixing the top flaky tests. The rate of improvement is entirely within the team's control - it's a function of investment in test reliability, not a complex technical challenge.

Tip

Flaky tests cluster in specific test types. Integration tests that hit real services (database, message queue, external API) are almost always flakier than pure unit tests. If you're below 90% TORS, your flakiest tests are almost certainly integration tests. The fastest path to improving TORS is isolating integration tests from external dependencies using test doubles, not retrying them hoping for better luck.

6 steps to get from here to the next level

Common Pitfalls

Conflating TORS with test coverage. A test suite with 95% TORS might have 40% code coverage. A test suite with 80% code coverage might have 70% TORS. The two metrics are nearly orthogonal. TORS measures whether the tests you have are reliable; coverage measures how much code those tests exercise. Both matter, but for agent reliability, TORS is the more critical short-term investment.

Using retry logic as a substitute for fixing flakiness. Configuring CI to retry tests 3 times until they pass improves the apparent pass rate but doesn't improve TORS - it just masks flakiness. Worse, it increases CI latency by up to 3x for flaky tests. Use retry once to identify flaky tests, not to paper over them.

Quarantining without a fix-or-delete SLA. A quarantine process without a deadline for resolution becomes a quarantine graveyard. Tests are quarantined and then forgotten while they accumulate. Set a firm policy: quarantined tests must be fixed or deleted within 30 days, or they are automatically deleted. Deleted coverage is better than permanently quarantined coverage because it's honest about what's being tested.

Not segmenting TORS by test type. An aggregate TORS of 93% might hide a unit test TORS of 99% and an integration test TORS of 80%. The aggregate looks reasonable but the integration test flakiness is actively harming agent workflows. Segment TORS by test type to identify where the real problem is.

Treating TORS as a one-time cleanup project. Test flakiness is not a one-time problem - new flaky tests are introduced continuously as the codebase grows. TORS maintenance requires a permanent process: weekly flaky test review, monthly quarantine audit, and an engineering culture that treats new flaky tests as bugs to fix immediately. Teams that run a TORS cleanup sprint and then abandon the process see TORS degrade back to its previous level within 6 months.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob's team has an auto-approve system that's not performing as expected. The system is set to auto-merge PRs when CI passes, but it's flagging PRs for human review more than expected. The root cause turns out to be that the system can't distinguish real CI failures from flaky ones - and the team's TORS is 82%.

What Bob should do: Bob should frame the TORS improvement project as a prerequisite for the auto-approve system working correctly. The target is 95% TORS within 90 days. He should assign one senior engineer for 2 weeks to identify and quarantine the 20 flakiest tests (which will likely bring TORS from 82% to ~90%). Then he should budget one engineer-sprint per month for the following 2 months to fix the quarantined tests at root cause. The connection between TORS and auto-approve rate makes the investment easy to justify: every 5% improvement in TORS enables a measurable improvement in auto-approve rate, which directly reduces review queue burden on human developers.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah is investigating why some teams using agents have much better throughput than others. She's collected ITS data but the correlation with productivity is weaker than expected. She suspects test reliability is a confounding variable.

What Sarah should do: Sarah should pull TORS data alongside ITS data for each team and compute the correlation. Her hypothesis: teams with low TORS have artificially inflated ITS (because agents are responding to false failures), which makes their agent workflows appear less efficient than they would be with reliable tests. If the data confirms this, Sarah has a compelling story: improving TORS is the fastest path to reducing ITS for the low-performing teams, because the apparent quality gap between teams is partly a test reliability gap, not a capability gap. This reframes the intervention: instead of retraining teams on agent workflow, fix the tests that are giving agents false signals.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor has achieved 98% TORS for his agent workflows by systematically eliminating flaky tests over the past year. He considers TORS the most important infrastructure metric for agent-driven development. He's seen firsthand how a bad test suite makes agents effectively useless.

What Victor should do: Victor should build the team's flaky test detection and quarantine tooling as a platform contribution. The tooling should: automatically identify flaky tests from CI retry data, add them to a quarantine dashboard, create GitHub issues for each quarantined test with the flaky failure data, and send a weekly digest to the responsible team. Victor should also write up the playbook for fixing the most common types of flakiness in the team's tech stack: how to fix database isolation issues in tests, how to replace sleeps with proper async waits, how to mock external service calls reliably. The playbook turns TORS improvement from an expert task (only Victor knows how to fix these tests) into a systematized process that any developer can execute.

What Victor should do - role-specific action plan