Agent fleet on dedicated compute

An agent fleet on dedicated compute is the infrastructure pattern where AI agent workloads run on a distinct, purpose-built compute layer that is separate from developer laptops, C

·Dedicated compute infrastructure exists for agent fleet (not shared with developer workstations or production)
·Agent fleet auto-scales with load (agents scale up during business hours, scale down off-hours)
·Each agent runs in a fully isolated environment (Cursor approach: one machine per agent, or smart resource management)

·Cost per agent-hour is tracked and optimized
·Fleet scaling responds to demand within 60 seconds

Evidence

·Infrastructure allocation showing dedicated agent compute (separate from dev and prod)
·Auto-scaling configuration and scaling event logs
·Agent fleet dashboard showing per-agent isolation and resource utilization

What It Is

An agent fleet on dedicated compute is the infrastructure pattern where AI agent workloads run on a distinct, purpose-built compute layer that is separate from developer laptops, CI infrastructure, and application workloads. Instead of agents sharing resources with other systems, there is a compute cluster specifically sized, configured, and optimized for running agent processes. The fleet is managed as a first-class infrastructure concern with its own capacity planning, monitoring, autoscaling, and operational runbooks.

At L5, agent workloads are substantial enough to warrant dedicated compute. A team running 50-100 simultaneous agent tasks, 24 hours a day, 7 days a week, needs compute that is always available, appropriately sized, and not subject to resource contention from unrelated workloads. Sharing this with CI (which has peak-and-trough load patterns) or with application servers (which have different resource profiles) creates interference that degrades agent reliability and response time.

The dedicated compute layer is typically Kubernetes-based, because Kubernetes provides the primitives needed for agent fleet management: pod isolation, resource quotas, namespace separation, and integration with autoscaling systems. Agent devboxes run as Kubernetes pods in a dedicated namespace, with node selectors that pin them to nodes in the agent node pool. The agent node pool uses instance types optimized for agent workloads: high memory (agents are memory-intensive), fast local NVMe storage (disk I/O is a key bottleneck for concurrent agent workloads), and high-bandwidth networking (large codebases move significant data).

The dedicated fleet is also the level at which hardware specialization becomes relevant. Cursor's engineering team documented that disk I/O is the hidden bottleneck when running hundreds of agents: each agent reads large numbers of files, and concurrent file reads on a shared disk system create I/O contention that serializes what should be parallel work. Dedicated NVMe SSDs with high IOPS ratings, or a distributed filesystem optimized for concurrent reads, is the solution at this scale.

Why It Matters

No resource contention from unrelated workloads - CI pipelines, application servers, and agent workloads have different resource profiles; mixing them on shared compute creates unpredictable performance characteristics that are hard to diagnose and fix
Compute can be optimized for agent-specific workloads - agent processes are memory-heavy, disk-read-heavy, and episodically network-intensive; dedicated compute uses instance types and storage configurations matched to this profile rather than to general-purpose workloads
Capacity planning becomes straightforward - with dedicated agent compute, you can measure utilization, forecast growth, and make capacity decisions independently of other workloads; shared compute makes capacity planning for agents nearly impossible
Fleet-level observability enables optimization - when all agent workloads run on dedicated compute, fleet-level metrics (tasks per node, disk IOPS per task, memory per task, network bandwidth per task) become meaningful for optimization; this data does not exist when agents are scattered across shared infrastructure
Reliability becomes a compute infrastructure SLO - a dedicated fleet can have an SLO (e.g., 99.9% task start within 30 seconds, 99.5% task completion without infrastructure failure) that is distinct from application SLOs; this SLO commitment drives infrastructure investment and is auditable

Getting Started

Size the initial fleet based on observed agent load - Before deploying dedicated compute, measure current agent workload: peak simultaneous tasks, average task duration, resource consumption per task (CPU, memory, disk IOPS). Use these measurements to size the initial node pool. A starting point: N simultaneous tasks requires approximately N * (2 CPU, 4 GB RAM, 100 GB NVMe) at baseline resource limits.
Create a Kubernetes namespace for agent workloads - Isolate agent workloads from other Kubernetes workloads with a dedicated namespace and resource quotas. Set namespace-level limits on total CPU, memory, and storage that prevent agent workloads from consuming resources needed by application pods.
Deploy a dedicated node pool for agents - Create a Kubernetes node pool (in EKS, GKE, or AKS) with nodes that are tainted for agent workloads only. Use node selectors or taints/tolerations to ensure agent pods run only on agent nodes and non-agent pods do not run on agent nodes.
Choose storage infrastructure for disk-I/O-bound workloads - For each agent node, provision NVMe-backed instance storage (e.g., AWS i3en or i4i instances, GCP n2-highmem with local SSD) rather than network-attached storage. Or deploy a distributed filesystem (e.g., Ceph, Weka) optimized for concurrent read workloads. Measure IOPS per node and per task to confirm the storage configuration meets the demand.
Deploy the devbox manager on the agent cluster - The devbox lifecycle manager (pre-warmed pool, task assignment, environment destruction) should run on the agent cluster, not on a shared management cluster. This co-location minimizes the network distance between the devbox manager and the devboxes it manages.
Establish fleet observability - Deploy Prometheus and Grafana on the agent cluster with dashboards for fleet-level metrics: nodes available, pods running, tasks per node, CPU/memory/disk utilization per node, task queue depth, and task completion rates. These dashboards are the operational interface for the fleet.

Tip

Use spot/preemptible instances for the bulk of the agent fleet to reduce compute costs by 60-80% compared to on-demand pricing. Agent tasks are naturally fault-tolerant (failed tasks can be retried) and typically short-lived (under 30 minutes), which are exactly the workload characteristics that make spot instances viable. Keep a small on-demand pool (20-30% of capacity) for tasks that must not be interrupted.

Common Pitfalls

Under-sizing the initial fleet and then scaling reactively. An under-sized fleet that queues tasks trains developers to not trust the agent infrastructure. Start with more capacity than you think you need and scale down based on observed utilization. Over-provisioning by 50% for the first month is a reasonable approach.

Using network-attached storage when local NVMe is needed. Network-attached storage (EBS, Azure Disk, GCP persistent disk) has unpredictable IOPS under concurrent load. When 50 agents on the same node all read large codebases simultaneously, NAS IOPS limits become the bottleneck. Measure actual disk IOPS under your expected concurrent load before committing to a storage type.

Not separating the agent control plane from the agent data plane. The devbox manager, pool manager, and task queue are the control plane. The devboxes themselves are the data plane. Running them on the same nodes creates resource contention: the control plane components consume CPU and memory that the data plane needs for agent tasks. Use separate node pools for control and data plane components.

Treating the agent fleet as a cost center rather than a productivity center. Dedicated compute has a real cost. If the fleet is evaluated purely as an infrastructure expense, it will be undersized and underinvested. The correct evaluation is: how many developer-hours of productivity does the fleet enable? A fleet that costs $10,000/month and enables 500 developer-hours of agent productivity per month is delivering developer time at $20/hour - significantly cheaper than human labor.

No cost attribution to teams or projects. When agent compute is shared infrastructure with no allocation, teams do not see the cost of their agent usage and have no incentive to optimize. Implement cost attribution at the team or project level using Kubernetes labels and cloud cost allocation tools. Teams that see their agent compute costs make better decisions about task efficiency and cleanup.

How Different Roles See It

BobHead of Engineering

Bob's organization has reached the scale where agent workloads are competing with CI and application workloads for shared compute. CI times are getting longer, agent tasks are getting slower, and the operations team is complaining about resource contention they cannot explain. Bob needs to make the case for dedicated agent compute but is having trouble quantifying the business case.

What Bob should do: Bob should instrument the shared compute to measure resource contention attributable to agent workloads. The metrics to collect: CI queue time before and after major agent usage spikes, application latency during peak agent hours, agent task completion time variance (high variance indicates resource contention). A one-week measurement period that captures a range of agent load conditions will produce the data Bob needs. If the contention is real and measurable, the dedicated compute business case writes itself: dedicated agent compute costs X per month, it eliminates Y hours of CI delay per month across the organization, and it removes Z% of application performance incidents.

SarahProductivity Lead

Sarah has been tracking agent task throughput and has noticed that performance degrades during peak hours - afternoons when most of the engineering team is online and running agents simultaneously. The per-task completion time is 30-40% longer during peak hours than off-peak hours. This degradation is directly impacting developer productivity: developers are running agents off-peak (evenings, early mornings) to get reasonable performance, which is not a sustainable workflow.

What Sarah should do: Sarah should use the peak-hour performance data to make the case for dedicated compute. The calculation is: if X developer-hours per week are shifted to off-peak to work around peak contention, and dedicated compute would eliminate that contention, the value of those developer-hours is the ROI of the infrastructure investment. Sarah should also survey developers directly about how often they choose not to dispatch an agent task during peak hours because of expected performance. The subjective opportunity cost - agent tasks not run because of expected slowness - is likely larger than the objective performance degradation.

VictorStaff Engineer - AI Champion

Victor has been watching the agent infrastructure evolve and is ahead of the curve on the dedicated compute question. He has been advocating for dedicated agent compute for three months based on his own performance analysis. He has data showing that disk IOPS are the binding constraint at the current shared infrastructure scale: when more than 20 agents run concurrently on the shared nodes, disk wait time accounts for 40% of agent task duration.

What Victor should do: Victor should present his disk IOPS analysis as the primary technical argument for dedicated compute with NVMe-optimized instances. He should also prototype a Kubernetes node pool configuration for agent workloads: node type selection (i3en or similar), taint/toleration configuration, resource limits for agent pods, and a simple Prometheus dashboard for fleet-level disk IOPS monitoring. The prototype does not need to be production-ready; it needs to demonstrate that the solution is feasible and the approach is sound. Victor should estimate the cost of the dedicated fleet (cloud pricing is public) and model the cost per developer-hour of agent productivity it enables.

From the Field

Recent releases, projects, and discussions relevant to this maturity level.

discoveredL5

KI-OS-org/ki-os-communityKI Operating System — Local-first AI infrastructure. Multi-Agent Orchestration, Ghost Control, One Voice Engine.KI-OS v1.6.0 establishes a local-first runtime layer to transition organizations from ad-hoc prompt engineering to systematic AI systems engineering. The framewgithub.com

releaseL5

openai/codexOpenAI's pre-release of rusty-v8 version 147.4.0 signals a critical infrastructure update for the Codex runtime environment, focusing on the Rust-to-V8 engine bgithub.com

articleL5

techcrunch.comReplit's Amjad Masad on Cursor deal, Apple fight, and why he'd rather not sellReplit’s strategic alignment with Cursor signals a shift from local-first AI editing to integrated cloud-native autonomous operations. The deal facilitates deeptechcrunch.com

discoveredL5

0xSero/turboquantTurboQuant: Near-optimal KV cache quantization for LLM inference (3-bit keys, 2-bit values) with Triton kernels + vLLM integrationTurboQuant doubles LLM context capacity (up to 914,144 tokens) through asymmetric KV cache quantization (3-bit keys, 2-bit values) using custom Triton kernels igithub.com

Where does your team actually sit on this?

This guide describes one level of one area. Run the assessment to place your team across all 16 areas, see which gates you have passed, and get a report you can take to your stakeholders.

Start the assessment

Agent Runtime & Sandboxing

MicroVM, hardware-isolated execution as default (AWS Lambda MicroVMs / Firecracker; self-hosted E2B-Daytona in your VPC for data residency); classifier-gated sandboxing (Cursor 3.6 Run Mode); cryptographic run provenance (Dapr 1.18 Verifiable Execution)Auto-scaling: agents scale with load