Maturity Matrix

Disk I/O optimized for concurrent agent workloads (Cursor lesson)

Disk I/O is the hidden bottleneck when running hundreds of concurrent agents on a shared infrastructure.

  • ·Build is a commodity: near-instant feedback for agents regardless of codebase size
  • ·Codebase is structured into self-contained modules/crates to eliminate compilation bottleneck (Cursor lesson)
  • ·Disk I/O is optimized for concurrent agent workloads (parallel reads/writes across modules)
  • ·Build latency is under 30 seconds for 90%+ of changes
  • ·Module dependency graph is automatically maintained and optimized

Evidence

  • ·Build duration dashboard showing near-instant feedback for standard changes
  • ·Codebase architecture showing modular structure (crate/module boundaries)
  • ·Disk I/O benchmarks for concurrent agent build workloads

What It Is

Disk I/O is the hidden bottleneck when running hundreds of concurrent agents on a shared infrastructure. Cursor discovered this empirically: when scaling from 10 to 100 simultaneous agents, build times degraded far more than expected from CPU or memory constraints. The culprit was disk I/O saturation. Each agent needs to read source files, write compiled outputs, stage artifacts, and read/write cache entries. With 100 agents doing this simultaneously on shared storage, the I/O bandwidth is fully saturated and every individual agent's build is waiting on disk operations.

The problem is not just raw I/O bandwidth. File system metadata operations - directory listings, stat calls, inode lookups - are particularly expensive under concurrent access patterns. A build system checking whether files have changed since the last build issues hundreds of stat calls per compilation unit. With 100 agents each doing this simultaneously, the metadata operation rate exceeds what most file systems handle efficiently, even on high-end SSDs.

Three classes of solutions address this problem. First, NVMe SSDs with separate volumes per agent sandbox, so each agent's I/O is completely isolated and doesn't contend with other agents' I/O. Second, tmpfs (memory-mapped file systems) for intermediate build artifacts, eliminating disk I/O entirely for ephemeral data. Third, remote storage with a local cache tier - EngFlow and BuildBuddy store build artifacts remotely, with a small local SSD used only as a write-through cache for active builds. The remote tier handles the storage volume; the local SSD handles the latency-sensitive operations.

The Cursor lesson is specifically relevant for teams running agents on shared compute infrastructure (Kubernetes clusters, shared CI runner pools, multi-tenant devboxes) rather than individual developer laptops. On individual laptops, each developer has their own SSD and there's no inter-agent I/O contention. On shared infrastructure running 50+ concurrent agents, I/O contention is guaranteed without explicit architectural choices to prevent it.

Why It Matters

  • Disk I/O saturation is the first bottleneck after CPU and memory are addressed - teams that have optimized their build system for CPU efficiency and memory usage will hit I/O saturation as the next ceiling when scaling agent concurrency
  • I/O contention degrades all agents simultaneously - unlike CPU contention (which slows individual agents proportionally), I/O saturation causes all agents to slow down simultaneously due to file system lock contention
  • The problem is invisible in small-scale testing - 5 agents on a developer laptop don't produce I/O contention; 50 agents on a shared Kubernetes node do; the problem only manifests at scale
  • Solutions are infrastructure-level, not build-system-level - you can't solve I/O contention with better Bazel configuration; it requires NVMe isolation, tmpfs, or remote storage architecture
  • Remote execution partially but not fully solves it - remote execution moves compilation off the local machine but doesn't eliminate source file reads, workspace setup, and cache staging from local storage

Getting Started

6 steps to get from here to the next level

Common Pitfalls

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

B
BobHead of Engineering

Bob has invested in Bazel, remote execution, and agent-specific build profiles. Builds are fast for small concurrent loads but degrade noticeably when more than 20 agents are running simultaneously - a scenario that happens every morning when his team starts their day. The degradation is mysterious: CPU and memory are not saturated, but build times jump from 8 seconds to 45 seconds. His infrastructure team hasn't identified the root cause.

What Bob should do - role-specific action plan

S
SarahProductivity Lead

Sarah has noticed a daily pattern in her build time metrics: builds are fast from 7-9 AM, degrade during 10 AM - 12 PM when peak concurrent agent usage happens, improve at lunch, and degrade again in the afternoon. This "rush hour" pattern is a classic I/O contention signature. The time-of-day correlation with concurrent agent count is the diagnostic clue.

What Sarah should do - role-specific action plan

V
VictorStaff Engineer - AI Champion

Victor has been running agents on a Kubernetes cluster he manages. After hitting the I/O contention problem at 30 concurrent agents, he implemented per-pod local NVMe volumes using local PersistentVolumes. Each agent pod gets a dedicated 100GB NVMe volume for its workspace, and he configured Bazel's output_base to point to a tmpfs directory for intermediate artifacts. He's now running 80 concurrent agents with stable 8-second build times.

What Victor should do - role-specific action plan