Maturity Matrix

TORS > 95%

The Test Oracle Reliability Score (TORS) measures what percentage of test failures represent real defects rather than test infrastructure issues, environmental flakiness, or poorly

  • ·TORS > 95% is measured and tracked on a dashboard
  • ·Auto-approve rate (% of PRs auto-merged as Green) is tracked with a target above 60%
  • ·Merge queue wait time is tracked with a target under 10 minutes
  • ·Agent Autonomy Score (% of tasks completed without human intervention) is measured and broken down by task type
  • ·Metrics trigger automated alerts when thresholds are breached (e.g., TORS drops below 95%)

Evidence

  • ·TORS dashboard showing 95%+ with per-service breakdown
  • ·Auto-approve rate report showing 60%+ Green target
  • ·Merge queue wait time chart showing sub-10-minute target

What It Is

The Test Oracle Reliability Score (TORS) measures what percentage of test failures represent real defects rather than test infrastructure issues, environmental flakiness, or poorly designed test assertions. If CI reports 100 failures in a week, and 95 of those failures correctly identify real bugs while 5 are false positives caused by flaky tests or environment issues, TORS is 95%. The target at L4 is TORS above 95%.

TORS is the signal quality metric for the entire agent-driven development pipeline. Agents rely on CI feedback to determine whether their code is correct. When tests are flaky - failing intermittently without any change to the code being tested - agents receive false negative signals. The agent reads a CI failure, concludes that its code is wrong, and attempts to fix a problem that doesn't exist. This drives up ITS (each false negative is an unnecessary iteration), increases CPI (each unnecessary iteration costs money), and degrades the code as the agent "fixes" working code in response to phantom test failures.

At L4 (Optimized), where agents are running hundreds of CI iterations per day across multiple parallel workflows, the impact of low TORS compounds dramatically. A TORS of 90% means 10% of all CI failures are noise. At 100 CI runs per day, that's 10 false signals per day that agents are chasing. Each false signal costs one full iteration including model tokens and CI compute. At L4 scale, a 5 percentage point improvement in TORS (90% to 95%) can save tens of thousands of dollars per year in wasted agent compute.

TORS is measured by tagging CI test failures as "real" or "flaky" over time. A "flaky" failure is one where re-running the same test suite with the same code produces a different result. Modern CI systems can be configured to automatically re-run failed tests once and classify the failure as flaky if it passes on the retry. This retry data is the raw material for computing TORS: (successful reruns / total failures) = flaky rate; TORS = 1 - flaky rate.

Why It Matters

  • Directly drives ITS quality - every flaky test failure that triggers agent re-iteration inflates ITS artificially; high TORS means agents' ITS reflects genuine task difficulty, not test noise; achieving ITS targets requires TORS above 95%
  • Reduces wasted CPI - flaky test failures are pure cost waste: the agent consumes tokens and CI compute responding to a failure that isn't real; at L4 scale, high TORS translates directly to significant cost savings
  • Enables algorithmic trust in CI - auto-merge systems that bypass human review only work if CI results are trustworthy; a TORS below 90% means auto-merge will occasionally merge broken code (real failures misclassified as flaky) or require human review of everything (defeating the purpose); high TORS is the prerequisite for high auto-approve rates
  • Agents make worse decisions in noisy environments - an agent that has learned (from many false signals) to discount CI failures will eventually miss a real failure; agents calibrate their response to the reliability of the feedback they receive; trustworthy CI produces more reliable agent behavior
  • TORS improvement has a clear ROI - unlike many quality metrics, TORS has a direct, calculable cost: (flaky failure rate) * (agent iterations per week) * (CPI) = weekly wasted spend; investing in test reliability has a measurable payback period

Getting Started

6 steps to get from here to the next level

Common Pitfalls

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

B
BobHead of Engineering

Bob's team has an auto-approve system that's not performing as expected. The system is set to auto-merge PRs when CI passes, but it's flagging PRs for human review more than expected. The root cause turns out to be that the system can't distinguish real CI failures from flaky ones - and the team's TORS is 82%.

What Bob should do - role-specific action plan

S
SarahProductivity Lead

Sarah is investigating why some teams using agents have much better throughput than others. She's collected ITS data but the correlation with productivity is weaker than expected. She suspects test reliability is a confounding variable.

What Sarah should do - role-specific action plan

V
VictorStaff Engineer - AI Champion

Victor has achieved 98% TORS for his agent workflows by systematically eliminating flaky tests over the past year. He considers TORS the most important infrastructure metric for agent-driven development. He's seen firsthand how a bad test suite makes agents effectively useless.

What Victor should do - role-specific action plan