Auto-scaling: agents scale with load

Auto-scaling for agent fleets means that the compute infrastructure automatically adds or removes capacity in response to agent task load, without manual intervention.

·Dedicated compute infrastructure exists for agent fleet (not shared with developer workstations or production)
·Agent fleet auto-scales with load (agents scale up during business hours, scale down off-hours)
·Each agent runs in a fully isolated environment (Cursor approach: one machine per agent, or smart resource management)

·Cost per agent-hour is tracked and optimized
·Fleet scaling responds to demand within 60 seconds

Evidence

·Infrastructure allocation showing dedicated agent compute (separate from dev and prod)
·Auto-scaling configuration and scaling event logs
·Agent fleet dashboard showing per-agent isolation and resource utilization

What It Is

Auto-scaling for agent fleets means that the compute infrastructure automatically adds or removes capacity in response to agent task load, without manual intervention. When a sprint kicks off and 50 developers simultaneously dispatch agent tasks, the fleet scales out to absorb the demand. When it is 2am and load is low, the fleet scales in to reduce costs. The scaling is driven by metrics that reflect actual agent demand: task queue depth, waiting task age, devbox utilization rate, and pre-warmed pool depletion rate.

Standard Kubernetes autoscaling (the Horizontal Pod Autoscaler) is designed for services with steady load, not for bursty task workloads. The right tool for agent fleet autoscaling is KEDA (Kubernetes Event-Driven Autoscaling), which can scale based on external queue metrics. KEDA reads the depth of the agent task queue (from SQS, Redis, RabbitMQ, or another queue backend) and scales the devbox pool size proportionally. When there are 50 tasks in the queue, KEDA creates 50 devboxes. When the queue is empty, KEDA scales down to the minimum pool size.

Agent task load has distinct patterns that make the scaling parameters different from typical web services. Agent load is bursty and correlated: developers tend to dispatch tasks at the start of work sessions (morning) and before major review cycles (end of sprint). These bursts can be 5-10x the average load. Burst absorption requires either maintaining a large standing pool (expensive) or scaling out quickly when burst demand arrives. The right approach combines a small standing pool (handles baseline demand immediately) with fast scale-out (adds capacity within 60-90 seconds when the queue grows).

At L5, scaling is not just about the compute nodes - it is about the full stack: the pre-warmed container pool scales with load (more pre-warmed containers when queue depth is high), the node pool scales with the container pool (more nodes when containers need more hosts), and the task queue scales with submission rate (more queue consumers when more tasks are arriving). Each layer has its own scaling policy, and the policies are coordinated to prevent cascading failures when load spikes.

Why It Matters

Cost efficiency across day/night cycles - agent workloads are strongly time-of-day dependent; auto-scaling that scales in at night reduces compute costs by 60-80% during low-demand periods without manual intervention
Burst absorption without permanent over-provisioning - without auto-scaling, fleets must be sized for peak demand; with auto-scaling, fleets can be sized for average demand and scale out for bursts, significantly reducing standing compute costs
Developer trust requires consistent availability - developers who try to dispatch agent tasks during a peak period and encounter queuing will stop using agents; auto-scaling that keeps queue depth near zero maintains developer trust in the infrastructure
Enables load-responsive pre-warming - the pre-warmed container pool should grow when demand is rising and shrink when demand is falling; this requires the same scaling signals that drive node scaling, coordinated across both layers
Autonomous agent workflows depend on capacity availability - automated agent pipelines (triggered by CI, alerts, or schedules) cannot predict when compute will be available; auto-scaling ensures capacity is always available when automation triggers agent tasks

Getting Started

Instrument your task queue for scaling signals - The primary scaling signal for agent fleets is task queue depth. Ensure your task queue (SQS, Redis, RabbitMQ) emits metrics that KEDA or your autoscaler can read: queue depth, oldest waiting task age, and consumer count. These metrics drive scaling decisions.
Deploy KEDA on your Kubernetes cluster - KEDA is a CNCF project that extends Kubernetes with event-driven scaling. Install KEDA with helm install keda kedacore/keda. Create a ScaledObject that points to your agent task queue and defines the scaling behavior: minimum replicas, maximum replicas, and the queue depth threshold that triggers scaling.
Configure the Cluster Autoscaler for node-level scaling - Install the Kubernetes Cluster Autoscaler configured for your cloud provider. The Cluster Autoscaler watches for unschedulable pods (devboxes that cannot find a node to run on) and triggers node provisioning. Set scale-out parameters: node provision time target (should be under 90 seconds for pre-baked AMIs), maximum node count, and scale-in delay (wait 10 minutes after load drops before removing nodes to avoid flapping).
Set scaling thresholds based on load measurement - Measure your typical load patterns over two weeks. Identify the average tasks-per-developer, the peak multiplier (peak load / average load), and the duration of typical peaks. Set your KEDA scaling target (one devbox per N queue items) to keep queue wait time under 30 seconds during typical peaks.
Implement scale-out pre-warming - When KEDA detects queue growth and starts scaling out the devbox pool, the new devboxes need to be pre-warmed before they can accept tasks. Implement a pipeline: KEDA scales out devbox pods, the devbox initialization process pre-warms the environment, a readiness probe signals when pre-warming is complete, and the task queue dispatcher assigns tasks only to ready devboxes.
Test scale-out under synthetic load - Before relying on auto-scaling in production, test it under synthetic load. Submit 50 tasks simultaneously and observe: how long before new nodes are available, how long before new devboxes are ready, what is the queue wait time for the last task submitted. This test reveals the actual scale-out latency and validates that the configuration meets your SLO targets.

Tip

Implement a scale-out "floor" that keeps a minimum number of nodes always running to absorb short bursts without triggering node provisioning. Node provisioning is typically 60-120 seconds even with pre-baked AMIs. A floor of 20% of peak capacity means that bursts up to 20% of peak can be absorbed immediately, and scaling only kicks in for larger bursts. This dramatically improves burst response time at modest standing cost.

Common Pitfalls

Scaling based on CPU/memory rather than queue depth. Standard Kubernetes HPA scales on CPU and memory utilization. Agent devboxes often have low CPU while waiting for LLM API responses. A devbox that is 5% CPU utilization but waiting on an LLM response will not trigger CPU-based scale-out even if the task queue is deep. Use KEDA with queue depth as the primary scaling signal.

Not accounting for pre-warming time in scale-out latency. A node that provisions in 60 seconds and a devbox that warms up in 60 seconds gives you a 2-minute scale-out latency. This latency is hidden when looking at node provisioning time alone. Measure the full scale-out latency from "queue depth exceeds threshold" to "task starts executing" and set your scaling thresholds to ensure queue wait time stays within your SLO.

Scale-in that destroys pre-warmed containers too aggressively. When load drops, aggressive scale-in destroys pre-warmed containers. If load then spikes again immediately (common at sprint transition times), the fleet has no pre-warmed capacity and falls back to cold-start times. Set a scale-in cooldown that keeps pre-warmed containers for at least 10-15 minutes after demand drops, and scale in gradually rather than all at once.

No cost controls on maximum scale. An auto-scaling fleet with no maximum cap can scale to 500 nodes during an unexpected load event (a CI pipeline that spawns 500 agent tasks in parallel, for example) and produce a surprise cloud bill. Always set a maximum node count and a maximum simultaneous devbox count. When the maximum is reached, task queue depth increases but costs are bounded.

Treating auto-scaling as infrastructure set-and-forget. Load patterns change as teams grow, as new agent workflows are introduced, and as sprint cadences evolve. Scaling parameters that are correct for a 10-developer team may be wrong for a 100-developer team. Review scaling parameters quarterly and after any significant change in agent usage patterns.

How Different Roles See It

BobHead of Engineering

Bob's dedicated agent fleet has a fixed size, and the operations team adjusts it manually every Monday morning to account for the coming week's expected load. They sometimes get it wrong: an end-of-sprint crunch causes more agent use than expected, the fleet gets saturated, tasks queue, developers complain. Bob wants to automate this but is not sure whether the infrastructure is ready for auto-scaling.

What Bob should do: Bob should check the prerequisites for auto-scaling: is the task queue instrumented with queue depth metrics? Is the Kubernetes Cluster Autoscaler deployed? If yes, adding KEDA is a one-day project. Bob should have his infrastructure engineer deploy KEDA, configure a ScaledObject that scales the devbox pool based on queue depth, and test it with a simulated load spike. If the prerequisites are not in place, Bob should make instrumenting the task queue and deploying the Cluster Autoscaler the immediate infrastructure sprint, with KEDA as the next step. The manual Monday morning scaling adjustment is a clear candidate for elimination, and its elimination is a concrete success metric.

SarahProductivity Lead

Sarah has been monitoring agent task queue depths and has found a clear pattern: queue depth spikes on Monday mornings and Thursday afternoons (pre-sprint review), with wait times sometimes reaching 8-10 minutes during peak hours. These are exactly the times when developers most want fast agent results - the beginning of their work week and the run-up to sprint reviews. The infrastructure is slowest precisely when the need is highest.

What Sarah should do: Sarah should use the queue depth data to make the case for auto-scaling. The data shows exactly when the fleet is undersized relative to demand, and the pattern is predictable enough to be addressed with either auto-scaling or scheduled scaling (scale up before Monday morning, scale up before Thursday afternoon). Scheduled scaling is simpler to implement than fully reactive auto-scaling and can be deployed in a day using Kubernetes CronJobs or KEDA's cron trigger. Sarah should propose scheduled scaling as an immediate improvement while reactive auto-scaling is implemented, giving developers immediate relief while the more complete solution is built.

VictorStaff Engineer - AI Champion

Victor has been running KEDA in his personal Kubernetes cluster for experimentation and has a working configuration for queue-depth-based devbox scaling. He has tested it with simulated load and measured the scale-out latency: 75 seconds from queue depth exceeding the threshold to the first new devbox accepting tasks. He thinks 75 seconds is acceptable but wants to get it under 45 seconds by using pre-baked AMIs for faster node provisioning.

What Victor should do: Victor should document his KEDA configuration and present it as a proposal for the team's production fleet. The documentation should include: the ScaledObject YAML, the Cluster Autoscaler configuration, the scale-out latency measurement methodology, and a runbook for what to do when scaling behaves unexpectedly. Victor should also propose the pre-baked AMI optimization - creating a custom AMI that has Docker pre-installed, the agent container image pre-pulled, and the node initialization scripts pre-executed. With a pre-baked AMI, node provisioning can drop from 90+ seconds to under 30 seconds, bringing the total scale-out latency close to the 45-second target. Victor should volunteer to build the pre-baked AMI as part of the same sprint that deploys auto-scaling to production.

From the Field

Recent releases, projects, and discussions relevant to this maturity level.

discoveredL5

KI-OS-org/ki-os-communityKI Operating System — Local-first AI infrastructure. Multi-Agent Orchestration, Ghost Control, One Voice Engine.KI-OS v1.6.0 establishes a local-first runtime layer to transition organizations from ad-hoc prompt engineering to systematic AI systems engineering. The framewgithub.com

releaseL5

openai/codexOpenAI's pre-release of rusty-v8 version 147.4.0 signals a critical infrastructure update for the Codex runtime environment, focusing on the Rust-to-V8 engine bgithub.com

articleL5

techcrunch.comReplit's Amjad Masad on Cursor deal, Apple fight, and why he'd rather not sellReplit’s strategic alignment with Cursor signals a shift from local-first AI editing to integrated cloud-native autonomous operations. The deal facilitates deeptechcrunch.com

discoveredL5

0xSero/turboquantTurboQuant: Near-optimal KV cache quantization for LLM inference (3-bit keys, 2-bit values) with Triton kernels + vLLM integrationTurboQuant doubles LLM context capacity (up to 914,144 tokens) through asymmetric KV cache quantization (3-bit keys, 2-bit values) using custom Triton kernels igithub.com

Where does your team actually sit on this?

This guide describes one level of one area. Run the assessment to place your team across all 16 areas, see which gates you have passed, and get a report you can take to your stakeholders.

Start the assessment

Agent Runtime & Sandboxing

Agent fleet on dedicated compute Each agent = isolated machine (Cursor approach) or shared with smart resource management; full sovereign runtime as a real option (open-weight GLM-5.2 / Kimi K2.7-Code / DeepSeek V4 on-device via llama.cpp/vLLM - same capability, no vendor lock-in)