Each agent = isolated machine (Cursor approach) or shared with smart resource management

At L5, organizations running agent fleets at scale face a fundamental architectural choice: does each agent get its own isolated machine (strong isolation, higher cost), or do mult

·Dedicated compute infrastructure exists for agent fleet (not shared with developer workstations or production)
·Agent fleet auto-scales with load (agents scale up during business hours, scale down off-hours)
·Each agent runs in a fully isolated environment (Cursor approach: one machine per agent, or smart resource management)

·Cost per agent-hour is tracked and optimized
·Fleet scaling responds to demand within 60 seconds

Evidence

·Infrastructure allocation showing dedicated agent compute (separate from dev and prod)
·Auto-scaling configuration and scaling event logs
·Agent fleet dashboard showing per-agent isolation and resource utilization

What It Is

At L5, organizations running agent fleets at scale face a fundamental architectural choice: does each agent get its own isolated machine (strong isolation, higher cost), or do multiple agents share machines with smart resource management (weaker isolation per agent, lower cost, higher density)? Both approaches work at scale; the choice depends on the organization's security requirements, cost constraints, and the types of tasks agents run.

The isolated machine approach - one physical or virtual machine per agent task - provides the strongest security guarantees. Cursor's engineering blog describes this as their production approach for running background agents: each agent task runs in its own Firecracker microVM on its own physical host. There is no sharing of memory, kernel, or storage between tasks. If one agent task is compromised or behaves unexpectedly, it has zero ability to affect any other task. This isolation level is appropriate for tasks that handle sensitive code, process secrets, or make financial decisions. The cost is significant: you are paying for a full machine for each agent, regardless of how much of that machine the agent uses.

The shared machine approach runs multiple agents on the same node with resource management controls (Kubernetes resource limits, cgroups, separate namespaces) to isolate them from each other. This is higher density and lower cost. A single node running 20 agent containers can process 20 simultaneous tasks at a fraction of the cost of 20 separate VMs. The isolation is not as strong - agents share the kernel, and container escape vulnerabilities can theoretically allow one agent to affect others on the same host. But for many use cases, especially tasks that do not handle particularly sensitive data, this risk is acceptable.

In practice, most organizations at L5 use a hybrid approach: different isolation tiers for different task types. Routine development tasks (write a test, fix a lint error, update documentation) run on shared machines with container isolation. High-sensitivity tasks (working in the payments service, modifying authentication code, handling credentials) run on isolated machines. The routing logic that assigns tasks to the appropriate tier is itself an interesting engineering problem that defines the quality of the hybrid architecture.

A third axis became practical in June 2026: where the model itself runs. Open-weight models now match hosted frontier coding models closely enough to run the agent on your own hardware - GLM-5.2 (Z.ai, MIT-licensed, beating GPT-5.5 on SWE-bench Pro at roughly one-sixth the cost), Kimi K2.7-Code (Moonshot, day-one vLLM and SGLang support), and DeepSeek V4 all run on-device via llama.cpp or vLLM. This sovereign local runtime gives the same capability with no vendor lock-in, which matters for the isolation decision: a self-hosted model in your own VPC keeps both code and weights off third-party infrastructure. The motivation is no longer hypothetical - Anthropic disabled hosted access to Fable 5 and Mythos 5 worldwide on June 12 to comply with a government directive (CNBC), so teams that need continuity should evaluate a local open-weight tier alongside the isolated-machine vs. shared-machine choice.

Why It Matters

The cost difference between approaches is large - isolated VMs for 100 simultaneous agents can cost 5-10x more than containerized agents on shared nodes; for large agent fleets, this cost difference is a significant budget consideration
Isolation model defines your security guarantee - the choice between Firecracker VMs and shared containers is a choice about the strength of the security guarantee you can make; organizations with strong security requirements need VM-level isolation, not just container-level isolation
Disk I/O sharing is the key performance bottleneck - Cursor's engineering team identified disk I/O (not CPU or memory) as the bottleneck when running hundreds of agents on shared machines; the architecture choice must account for storage isolation and IOPS allocation
Bin packing efficiency matters at scale - 1,000 agent tasks on isolated VMs requires 1,000 VMs; the same tasks on shared nodes might require 50-100 nodes, each running 10-20 agent containers; the infrastructure management overhead of 1,000 VMs vs. 100 nodes is substantial
The hybrid architecture scales better than pure approaches - a pure isolated-VM approach is too expensive at very high agent counts; a pure shared-container approach creates security concerns for sensitive tasks; the hybrid gives organizations a principled way to scale both security and cost simultaneously

Getting Started

Classify your agent task types by security sensitivity - Create a taxonomy of agent task types and assign each a sensitivity level. A simple three-tier model works: low (documentation, test writing, code explanation), medium (feature implementation, dependency updates), high (payment processing, authentication code, credential management). This taxonomy drives routing decisions.
Start with the shared-container architecture - Deploy all agent tasks as containers on shared nodes initially. This is simpler to build and operate. Measure actual isolation needs based on the types of tasks your agents run. Most teams will find that the majority of their agent tasks can safely run in shared containers.
Add a high-sensitivity tier with Firecracker - For the high-sensitivity task class, add a second tier using Firecracker microVMs. Firecracker provides near-VM isolation with container-speed startup (125ms boot time). Deploy a separate pool of Firecracker hosts for sensitive tasks, with the task router directing high-sensitivity tasks to this pool.
Implement the routing layer - Build or adopt a task routing service that inspects each submitted task, classifies it based on the taxonomy, and assigns it to the appropriate tier. The classifier can be rule-based (any task touching the payments/ directory is high-sensitivity) or ML-based for more nuanced classification.
Address disk I/O sharing at the shared-node tier - Provision shared nodes with NVMe-backed local SSDs and configure per-container I/O limits using cgroup blkio or the Kubernetes I/O throttle feature. Monitor disk IOPS per container during peak load to verify that I/O-intensive agents are not starving I/O for other agents on the same node.
Measure isolation effectiveness - For both tiers, run isolation verification tests quarterly: attempt container escape techniques against shared-tier nodes, verify that cross-VM access is impossible on Firecracker-tier nodes. These tests provide ongoing assurance that the isolation model is working as intended.

Tip

Measure disk I/O per agent before deciding on the isolation architecture. Cursor's finding (disk I/O is the bottleneck) is specific to their codebase size and agent task profile. Your workload may be CPU-bound or memory-bound, which leads to different architecture decisions. Measure first, then architect.

Common Pitfalls

Defaulting to isolated VMs for everything based on theoretical security concerns. Isolated VMs for all tasks is the right choice if you have regulatory requirements or handle particularly sensitive code. For most teams, most agent tasks are low-sensitivity and shared containers are appropriate. Over-engineering isolation from day one creates unnecessary cost and complexity before you understand your actual security requirements.

Under-provisioning IOPS on shared nodes. If multiple agents on the same node are all reading large codebases simultaneously, they share the node's disk IOPS budget. Without per-container I/O limits, a single I/O-heavy agent can starve all other agents on the node. Always configure per-container I/O limits on shared nodes, and size the node's total IOPS budget based on the number of agents you plan to run per node.

Not auditing the routing logic. A routing system that is supposed to send sensitive tasks to isolated VMs is only effective if it correctly classifies sensitive tasks. Audit the routing logic quarterly by sampling tasks and verifying they are routed to the correct tier. Misclassification that routes sensitive tasks to shared containers silently degrades the security posture.

Treating Firecracker as a drop-in replacement for Docker containers. Firecracker requires different infrastructure than Docker: a Firecracker host needs the KVM kernel module, Firecracker-specific networking (TAP devices), and a different image format (ext4 filesystem images rather than OCI images). The operational complexity of running Firecracker is higher than Docker. Plan for this complexity in staffing and runbook development.

No cost allocation between tiers. Running two tiers (shared containers + isolated VMs) creates two cost buckets. Without attribution, you cannot tell whether the sensitive-task tier is appropriately sized or whether tasks are being misclassified to the more expensive tier. Implement cost tagging that tracks compute spend by tier and by task type.

How Different Roles See It

BobHead of Engineering

Bob's organization is planning a major expansion of agent usage that will bring simultaneous agent task counts from 50 to 300-500. At this scale, the infrastructure architecture choice has a significant cost and complexity impact. Bob needs to make an architectural decision that will serve the organization for 2-3 years, not just the current scale.

What Bob should do: Bob should commission a three-way architecture evaluation: pure shared containers, pure isolated VMs (Firecracker), and hybrid by task sensitivity. For each option, the evaluation should cover: compute cost at 300 and 500 simultaneous tasks, implementation effort (weeks of engineering time), operational complexity (number of runbooks, infrastructure components to maintain), and security posture for the team's specific task types. The evaluation should be grounded in the team's actual task taxonomy - not a generic analysis. Most teams at this scale end up choosing hybrid, but the evaluation process is what builds organizational alignment on the security and cost tradeoffs.

SarahProductivity Lead

Sarah cares about two things: developer experience (are agents fast and reliable?) and cost (are we spending money on agent infrastructure wisely?). Both concerns point to the same question about architecture: are we spending isolation budget in the right places? Spending a lot on isolated VMs for low-sensitivity tasks is poor ROI. Running sensitive tasks on shared containers is a security risk. Sarah wants the hybrid architecture but needs the data to support the routing decisions.

What Sarah should do: Sarah should work with the security team to define the task sensitivity taxonomy that drives routing decisions. This is not a purely technical exercise - it requires security judgment about which codebases and operations are sensitive. Sarah should facilitate a workshop where the engineering leads and the security team agree on the taxonomy. The output of the workshop is a written policy (which task types go to which tier) that the routing system implements. Once the policy is in place, Sarah should track the distribution of tasks across tiers and the cost per task by tier. This data reveals whether the policy is resulting in the right cost/security tradeoffs.

VictorStaff Engineer - AI Champion

Victor has been following Cursor's engineering blog and has a clear technical opinion: Firecracker with per-task VMs is the right architecture for high-sensitivity work, and shared containers with tight cgroup controls are right for everything else. He has experimented with both and can provide concrete performance data: Firecracker startup at 200ms, container startup at 1 second, disk IOPS per task in both configurations.

What Victor should do: Victor should write a technical comparison document that lays out the two approaches with actual benchmark data from his experiments. The document should cover: startup latency, disk IOPS available per task, memory overhead per task, security isolation guarantees, and operational complexity. Victor should also propose the specific implementation path for the hybrid architecture: which Kubernetes operator to use for Firecracker (Kata Containers or Flintlock), how to implement the task routing logic (a simple webhook admission controller), and what monitoring to add to observe tier utilization. This proposal becomes the technical specification for the infrastructure sprint that implements the hybrid architecture.

From the Field

Recent releases, projects, and discussions relevant to this maturity level.

discoveredL5

KI-OS-org/ki-os-communityKI Operating System — Local-first AI infrastructure. Multi-Agent Orchestration, Ghost Control, One Voice Engine.KI-OS v1.6.0 establishes a local-first runtime layer to transition organizations from ad-hoc prompt engineering to systematic AI systems engineering. The framewgithub.com

releaseL5

openai/codexOpenAI's pre-release of rusty-v8 version 147.4.0 signals a critical infrastructure update for the Codex runtime environment, focusing on the Rust-to-V8 engine bgithub.com

articleL5

techcrunch.comReplit's Amjad Masad on Cursor deal, Apple fight, and why he'd rather not sellReplit’s strategic alignment with Cursor signals a shift from local-first AI editing to integrated cloud-native autonomous operations. The deal facilitates deeptechcrunch.com

discoveredL5

0xSero/turboquantTurboQuant: Near-optimal KV cache quantization for LLM inference (3-bit keys, 2-bit values) with Triton kernels + vLLM integrationTurboQuant doubles LLM context capacity (up to 914,144 tokens) through asymmetric KV cache quantization (3-bit keys, 2-bit values) using custom Triton kernels igithub.com

Where does your team actually sit on this?

This guide describes one level of one area. Run the assessment to place your team across all 16 areas, see which gates you have passed, and get a report you can take to your stakeholders.

Start the assessment

Agent Runtime & Sandboxing

Auto-scaling: agents scale with load