Self-driving CI: auto-scaling per agent load

Self-driving CI is a CI system that observes its own load, predicts demand, and scales its infrastructure automatically without any human intervention.

·CI provides sub-minute feedback for standard changes
·CI auto-scales runner capacity based on agent load (no manual capacity planning)
·Production feedback loop auto-adjusts the CI test suite (adds tests for observed failures, removes redundant tests)

·CI runner utilization stays between 50-80% (auto-scaling prevents both waste and queuing)
·Test suite evolution is auditable (each auto-added/removed test has a provenance record)

Evidence

·CI run duration dashboard showing sub-minute median for standard changes
·Auto-scaling configuration and runner utilization metrics
·Test suite change log showing production-feedback-driven additions and removals

What It Is

Self-driving CI is a CI system that observes its own load, predicts demand, and scales its infrastructure automatically without any human intervention. When an agent fleet begins an intensive iteration session and submits 200 CI jobs in a 10-minute window, the CI system detects the increasing queue depth, proactively provisions additional runners, and serves all 200 jobs with minimal queuing - then scales back down when the burst ends. No engineer manually adjusts runner counts; no oncall rotation manages CI capacity; the system regulates itself.

The "self-driving" label captures the key property: the CI system drives its own operational parameters based on observed conditions. This goes beyond simple reactive autoscaling (add runners when queue grows, remove runners when queue shrinks). Self-driving CI incorporates predictive scaling (anticipate load based on historical patterns and current agent activity signals), heterogeneous scaling (provision different runner types for different job types - large machines for compilation, small machines for lint), and capacity reservation (maintain a pre-warmed pool sized for the team's peak agent usage, not just current load).

At L5, self-driving CI is the infrastructure that enables "hundreds of agents on a codebase" patterns. Organizations like Stripe have published accounts of running 1,000+ merges per week; Google runs millions of CI jobs per day. At this scale, static runner pools and manual capacity management are not viable. The CI infrastructure must be as dynamic as the agent workloads it serves.

Self-driving CI at the autonomous maturity level also includes loop-closing feedback: the CI system reports its own performance metrics back to an observability layer that agents can read. An orchestrating agent that sees "CI queue time is elevated, sandbox pool exhausted" can reduce its submission rate, batch changes, or escalate to a human. The CI system and the agent fleet are coupled in a feedback loop, not operating independently.

Why It Matters

Agent fleets scale without CI infrastructure becoming the bottleneck - a fleet of 50 concurrent agents generating 500 CI jobs per hour needs a CI system that matches its scale; self-driving CI provides this without manual intervention
Zero engineering oncall for CI capacity - capacity management is automated; engineers focus on CI correctness and optimization, not on "how many runners do we have right now"
Cost optimization through elastic scaling - self-driving CI scales down during off-hours and weekends, paying only for the capacity actually used; static over-provisioning wastes money continuously
Predictive scaling reduces cold-start latency - a CI system that anticipates morning load peaks and pre-warms runners before they're needed delivers consistent sub-minute job start times, even during demand spikes
Provides a CI health signal to the agent layer - agents and orchestrators that can observe CI system health can adapt their behavior (throttle submission, prioritize certain job types) based on actual infrastructure state

Getting Started

Implement reactive autoscaling as the foundation - Before predictive scaling, get reactive scaling working: when queue depth exceeds a threshold, provision new runners; when idle time exceeds a threshold, terminate runners. GitHub Actions with Actions Runner Controller (Kubernetes HPA), BuildKite Elastic CI Stack (AWS Auto Scaling Groups), or CircleCI's autoscaling resource classes all support reactive autoscaling. Start here.
Instrument job queue depth and runner utilization as real-time metrics - Autoscaling decisions require real-time data. Expose CI queue depth, runner utilization, job wait time, and job success rate as metrics that feed into your scaling controller. Prometheus metrics from the Actions Runner Controller, BuildKite agent metrics, or custom CloudWatch metrics are all viable. Without observability, autoscaling is tuning blind.
Define scaling policies per job type - Agent sandbox jobs (30-second runs, high volume) need different scaling behavior than full integration test jobs (5-minute runs, lower volume). Define separate runner pools and separate autoscaling policies for each job type. A scaling policy that works for sandbox jobs (aggressive scale-up, aggressive scale-down) will thrash for integration test jobs.
Implement predictive scaling based on historical patterns - Analyze 30 days of CI load data to find patterns: load peaks at 10 AM, drops at lunch, rises again at 3 PM, falls to near-zero overnight. Configure scheduled scaling to pre-warm runners before predicted peaks. Kubernetes' KEDA (Kubernetes Event-Driven Autoscaling) supports cron-based scaling. AWS Auto Scaling supports scheduled scaling actions. Predictive scaling ensures pre-warmed runners are available at peak demand, not just after demand has already arrived.
Expose CI health as an API - Create an endpoint that returns CI system health: current queue depth, average wait time, runner utilization percentage, pool exhaustion incidents in the last hour. Make this endpoint readable by agent orchestrators so they can adapt to CI conditions. An orchestrator that sees "queue depth: 150, wait time: 4 minutes" can batch its next 20 jobs rather than submitting them individually.
Implement a CI SLO and automated response to SLO violations - Define CI SLOs: 95% of jobs start within 30 seconds, 99% complete within 10 minutes. When the SLO is violated (queue growing faster than scaling can respond), trigger an automated response: provision a burst capacity pool (spot instances, on-demand machines reserved for SLO rescue). Alert the platform team for awareness, but don't require human action to resolve the immediate capacity issue.

Tip

KEDA (Kubernetes Event-Driven Autoscaling) is the recommended approach for teams running Actions Runner Controller on Kubernetes. KEDA can scale the runner pool based on GitHub Actions queue depth directly, providing precise, reactive scaling that's much tighter than time-based or CPU-based scaling policies. A KEDA scaler watching the GitHub Actions queue is the production-grade approach for self-driving CI.

6 steps to get from here to the next level

Common Pitfalls

Reactive autoscaling without adequate pre-warm time. If cloud VMs take 90 seconds to provision and your CI demand can spike from 0 to 50 concurrent jobs in 30 seconds (common with agent fleets), reactive scaling can't respond fast enough. The queue grows faster than capacity can be added. Solve this with: (1) pre-warmed container pools that can be activated immediately, (2) over-provisioning a small "burst reserve" pool, or (3) spot instance pre-warming.

Scaling without cost controls. Self-driving CI that scales aggressively without cost controls can generate unexpected cloud bills. Set hard limits on maximum concurrent runners, configure budget alerts, and review scaling events weekly. A misconfigured scaling policy that scales to 500 runners instead of 50 is a billing emergency.

Ignoring spot instance interruption in scaling policies. Cloud spot instances (AWS Spot, GCP Preemptible, Azure Spot) are 60-70% cheaper than on-demand but can be interrupted with 2-minute notice. CI jobs running on spot instances must be interruptible and retriable. Configure your scaling policy to use spot for non-urgent agent sandbox jobs and on-demand for time-sensitive gate jobs.

Not tuning scale-down delay appropriately. Aggressive scale-down (terminating idle runners immediately) reduces cost but causes constant thrashing in workloads with alternating busy and idle periods. Set scale-down delay to 10-15 minutes - runners that have been idle for 15 minutes are genuinely unused and can be terminated without incurring restart overhead for the next burst.

Building a self-driving CI system that humans don't understand. A CI system that automatically provisions and terminates infrastructure can become a black box that engineers don't understand when it misbehaves. Ensure all scaling events are logged with clear reasons, human-readable dashboards show current state, and runbooks exist for overriding automatic behavior (adding capacity manually, pausing autoscaling during incidents).

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob's engineering team has grown from 15 to 40 developers over the past year, and AI agent adoption is generating 10x the CI load of a year ago. The platform team is spending 3-4 hours per week manually adjusting runner counts based on observed load patterns - scaling up on Monday morning, scaling down Friday afternoon, emergency scaling during sprint end pushes. This manual capacity management is unsustainable and error-prone.

Bob should fund the platform team to implement autoscaling properly. The work is well-defined: Actions Runner Controller with KEDA, configured with the team's load patterns. The implementation is a 1-2 sprint project with a clear success metric: zero manual runner count adjustments for 60 consecutive days after implementation. Bob should frame this as operational work with a direct cost impact: 3-4 hours per week of platform engineer time at fully-loaded cost, saved continuously after implementation. The ROI calculation is straightforward and the work is non-trivial but well-understood.

SarahProductivity Lead

Sarah's CI feedback latency dashboard shows a pattern: CI queue times spike Monday mornings and Thursday afternoons, correlating with sprint start (Monday) and sprint-end push (Thursday). These are predictable, recurring patterns that predictive scaling could address. But the current system only responds reactively, meaning CI is slow for 20-30 minutes at the start of each peak period while the reactive scaling catches up.

Sarah should present this pattern analysis to the platform team: "CI queue time exceeds SLO for 25 minutes at the start of every Monday and every Thursday afternoon. This is a predictable pattern that scheduled pre-warming would eliminate." She should estimate the developer-time cost of these recurring slowdowns: if 40 developers each experience 5 minutes of elevated queue time twice a week (Monday morning, Thursday afternoon), that's 400 developer-minutes per week - 6.7 developer-hours - lost to predictable, preventable CI slowness. The scheduled pre-warming implementation is a few hours of work that recovers 6.7 developer-hours per week indefinitely.

VictorStaff Engineer - AI Champion

Victor has implemented KEDA-based autoscaling for his team's runner pool and it works well: the pool scales from 0 to 20 runners within 60 seconds of a demand spike and back to 0 within 15 minutes of inactivity. His CI costs dropped 60% compared to the previous static pool of 10 runners.

Victor should take the next step: make the CI system observable to agents. He should expose a simple HTTP endpoint that returns current CI metrics (/ci/health returning queue depth, wait time, utilization). He should then create a Claude Code MCP tool that agents can call to check CI health before submitting large batches of changes. An agent that checks CI health before submitting 20 sandbox runs can decide to stagger them (submit 5 at a time, wait for results, submit the next 5) when the queue is already long. This agent-CI feedback loop is the defining capability of L5 CI infrastructure: the CI system and the agents that use it are mutually aware and coordinate their behavior.