Back to Infrastructure
infrastructureL5 AutonomousAgent Runtime & Sandboxing

Auto-scaling: agents scale with load

Auto-scaling for agent fleets means that the compute infrastructure automatically adds or removes capacity in response to agent task load, without manual intervention.

  • ·Dedicated compute infrastructure exists for agent fleet (not shared with developer workstations or production)
  • ·Agent fleet auto-scales with load (agents scale up during business hours, scale down off-hours)
  • ·Each agent runs in a fully isolated environment (Cursor approach: one machine per agent, or smart resource management)
  • ·Cost per agent-hour is tracked and optimized
  • ·Fleet scaling responds to demand within 60 seconds

Evidence

  • ·Infrastructure allocation showing dedicated agent compute (separate from dev and prod)
  • ·Auto-scaling configuration and scaling event logs
  • ·Agent fleet dashboard showing per-agent isolation and resource utilization

What It Is

Auto-scaling for agent fleets means that the compute infrastructure automatically adds or removes capacity in response to agent task load, without manual intervention. When a sprint kicks off and 50 developers simultaneously dispatch agent tasks, the fleet scales out to absorb the demand. When it is 2am and load is low, the fleet scales in to reduce costs. The scaling is driven by metrics that reflect actual agent demand: task queue depth, waiting task age, devbox utilization rate, and pre-warmed pool depletion rate.

Standard Kubernetes autoscaling (the Horizontal Pod Autoscaler) is designed for services with steady load, not for bursty task workloads. The right tool for agent fleet autoscaling is KEDA (Kubernetes Event-Driven Autoscaling), which can scale based on external queue metrics. KEDA reads the depth of the agent task queue (from SQS, Redis, RabbitMQ, or another queue backend) and scales the devbox pool size proportionally. When there are 50 tasks in the queue, KEDA creates 50 devboxes. When the queue is empty, KEDA scales down to the minimum pool size.

Agent task load has distinct patterns that make the scaling parameters different from typical web services. Agent load is bursty and correlated: developers tend to dispatch tasks at the start of work sessions (morning) and before major review cycles (end of sprint). These bursts can be 5-10x the average load. Burst absorption requires either maintaining a large standing pool (expensive) or scaling out quickly when burst demand arrives. The right approach combines a small standing pool (handles baseline demand immediately) with fast scale-out (adds capacity within 60-90 seconds when the queue grows).

At L5, scaling is not just about the compute nodes - it is about the full stack: the pre-warmed container pool scales with load (more pre-warmed containers when queue depth is high), the node pool scales with the container pool (more nodes when containers need more hosts), and the task queue scales with submission rate (more queue consumers when more tasks are arriving). Each layer has its own scaling policy, and the policies are coordinated to prevent cascading failures when load spikes.

Why It Matters

  • Cost efficiency across day/night cycles - agent workloads are strongly time-of-day dependent; auto-scaling that scales in at night reduces compute costs by 60-80% during low-demand periods without manual intervention
  • Burst absorption without permanent over-provisioning - without auto-scaling, fleets must be sized for peak demand; with auto-scaling, fleets can be sized for average demand and scale out for bursts, significantly reducing standing compute costs
  • Developer trust requires consistent availability - developers who try to dispatch agent tasks during a peak period and encounter queuing will stop using agents; auto-scaling that keeps queue depth near zero maintains developer trust in the infrastructure
  • Enables load-responsive pre-warming - the pre-warmed container pool should grow when demand is rising and shrink when demand is falling; this requires the same scaling signals that drive node scaling, coordinated across both layers
  • Autonomous agent workflows depend on capacity availability - automated agent pipelines (triggered by CI, alerts, or schedules) cannot predict when compute will be available; auto-scaling ensures capacity is always available when automation triggers agent tasks

Getting Started

6 steps to get from here to the next level

Common Pitfalls

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

B
BobHead of Engineering

Bob's dedicated agent fleet has a fixed size, and the operations team adjusts it manually every Monday morning to account for the coming week's expected load. They sometimes get it wrong: an end-of-sprint crunch causes more agent use than expected, the fleet gets saturated, tasks queue, developers complain. Bob wants to automate this but is not sure whether the infrastructure is ready for auto-scaling.

What Bob should do - role-specific action plan

S
SarahProductivity Lead

Sarah has been monitoring agent task queue depths and has found a clear pattern: queue depth spikes on Monday mornings and Thursday afternoons (pre-sprint review), with wait times sometimes reaching 8-10 minutes during peak hours. These are exactly the times when developers most want fast agent results - the beginning of their work week and the run-up to sprint reviews. The infrastructure is slowest precisely when the need is highest.

What Sarah should do - role-specific action plan

V
VictorStaff Engineer - AI Champion

Victor has been running KEDA in his personal Kubernetes cluster for experimentation and has a working configuration for queue-depth-based devbox scaling. He has tested it with simulated load and measured the scale-out latency: 75 seconds from queue depth exceeding the threshold to the first new devbox accepting tasks. He thinks 75 seconds is acceptable but wants to get it under 45 seconds by using pre-baked AMIs for faster node provisioning.

What Victor should do - role-specific action plan