Disk I/O optimized for concurrent agent workloads (Cursor lesson)

Disk I/O is the hidden bottleneck when running hundreds of concurrent agents on a shared infrastructure.

·Build is a commodity: near-instant feedback for agents regardless of codebase size
·Codebase is structured into self-contained modules/crates to eliminate compilation bottleneck (Cursor lesson)
·Disk I/O is optimized for concurrent agent workloads (parallel reads/writes across modules)

·Build latency is under 30 seconds for 90%+ of changes
·Module dependency graph is automatically maintained and optimized

Evidence

·Build duration dashboard showing near-instant feedback for standard changes
·Codebase architecture showing modular structure (crate/module boundaries)
·Disk I/O benchmarks for concurrent agent build workloads

What It Is

Disk I/O is the hidden bottleneck when running hundreds of concurrent agents on a shared infrastructure. Cursor discovered this empirically: when scaling from 10 to 100 simultaneous agents, build times degraded far more than expected from CPU or memory constraints. The culprit was disk I/O saturation. Each agent needs to read source files, write compiled outputs, stage artifacts, and read/write cache entries. With 100 agents doing this simultaneously on shared storage, the I/O bandwidth is fully saturated and every individual agent's build is waiting on disk operations.

The problem is not just raw I/O bandwidth. File system metadata operations - directory listings, stat calls, inode lookups - are particularly expensive under concurrent access patterns. A build system checking whether files have changed since the last build issues hundreds of stat calls per compilation unit. With 100 agents each doing this simultaneously, the metadata operation rate exceeds what most file systems handle efficiently, even on high-end SSDs.

Three classes of solutions address this problem. First, NVMe SSDs with separate volumes per agent sandbox, so each agent's I/O is completely isolated and doesn't contend with other agents' I/O. Second, tmpfs (memory-mapped file systems) for intermediate build artifacts, eliminating disk I/O entirely for ephemeral data. Third, remote storage with a local cache tier - EngFlow and BuildBuddy store build artifacts remotely, with a small local SSD used only as a write-through cache for active builds. The remote tier handles the storage volume; the local SSD handles the latency-sensitive operations.

The Cursor lesson is specifically relevant for teams running agents on shared compute infrastructure (Kubernetes clusters, shared CI runner pools, multi-tenant devboxes) rather than individual developer laptops. On individual laptops, each developer has their own SSD and there's no inter-agent I/O contention. On shared infrastructure running 50+ concurrent agents, I/O contention is guaranteed without explicit architectural choices to prevent it.

Why It Matters

Disk I/O saturation is the first bottleneck after CPU and memory are addressed - teams that have optimized their build system for CPU efficiency and memory usage will hit I/O saturation as the next ceiling when scaling agent concurrency
I/O contention degrades all agents simultaneously - unlike CPU contention (which slows individual agents proportionally), I/O saturation causes all agents to slow down simultaneously due to file system lock contention
The problem is invisible in small-scale testing - 5 agents on a developer laptop don't produce I/O contention; 50 agents on a shared Kubernetes node do; the problem only manifests at scale
Solutions are infrastructure-level, not build-system-level - you can't solve I/O contention with better Bazel configuration; it requires NVMe isolation, tmpfs, or remote storage architecture
Remote execution partially but not fully solves it - remote execution moves compilation off the local machine but doesn't eliminate source file reads, workspace setup, and cache staging from local storage

Getting Started

Measure I/O utilization under concurrent agent load - Before optimizing, measure: run 5, 10, 20, 50 concurrent agent builds simultaneously and monitor disk I/O utilization with iostat -x 1 or equivalent. If I/O utilization hits 80%+ at 10 concurrent agents, you have an I/O constraint. If it stays under 40% at 50 agents, I/O is not your bottleneck.
Move intermediate build artifacts to tmpfs - For ephemeral compilation outputs that don't need to persist (object files, intermediate generated code, test runners), configure the build system to use /dev/shm or a tmpfs mount. Bazel's --output_base can be pointed at a tmpfs directory. This eliminates disk I/O entirely for ephemeral data.
Isolate agent workspaces on separate volumes - On Kubernetes, provision each agent pod with its own local NVMe volume (using hostPath volumes with dedicated mount points, or local PVs). Isolated volumes eliminate cross-agent I/O contention at the file system layer.
Configure remote execution to minimize local I/O - With Bazel remote execution, use --remote_download_minimal to avoid downloading remote build outputs to the local disk. Keep only source files and the final artifacts locally. This reduces local disk I/O by 60-80% for builds with high remote cache hit rates.
Use a local SSD cache tier for the remote cache - Rather than routing all remote cache misses directly to S3 or GCS (which has higher latency), configure a local SSD as a write-through cache in front of the remote storage. BuildBuddy and EngFlow support this architecture. The local SSD handles high-frequency small reads; the remote storage handles capacity.
Profile I/O patterns to distinguish metadata from data operations - Use blktrace or eBPF-based I/O profilers to distinguish metadata operations (stat, readdir, lookup) from data operations (read, write). Metadata-heavy I/O patterns suggest that the build system is performing many file change checks; data-heavy patterns suggest artifact transfer is the bottleneck. Different root causes have different solutions.

Tip

On Linux agent runners, add noatime to the mount options for all volumes used by agent builds. The atime (access time) update on every file read is a write operation to the inode, which adds write I/O overhead to every file read. For a build that reads thousands of files, noatime can reduce I/O by 20-30%.

6 steps to get from here to the next level

Common Pitfalls

Assuming NVMe means no I/O contention. NVMe SSDs have extremely high sequential throughput, but concurrent random small I/O (typical of build systems doing many stat calls) can still saturate the I/O scheduler even on NVMe. The relevant metric is IOPS (I/O operations per second) for 4K random reads, not GB/s throughput. Check your NVMe's random IOPS spec and compare to your measured concurrent agent I/O workload.

Using network-attached storage (NAS/NFS) for agent workspaces. NFS or NAS storage introduces network latency on every I/O operation. For a build system that issues thousands of file operations, 1ms of NFS latency per operation adds seconds of latency per build. Agent workspaces must be on local storage. Network storage is only appropriate for build outputs that are transferred to remote storage after the build completes.

Not accounting for the Gradle/Maven local cache under concurrent access. Gradle's local cache at ~/.gradle/caches and Maven's ~/.m2/repository are not designed for concurrent access from multiple processes. Multiple agents using the same user's home directory for caching will produce lock contention and cache corruption. Each agent should use a separate, isolated cache directory - either by running as different users or by explicitly configuring separate cache paths.

Scaling agents without scaling storage infrastructure. Storage infrastructure (IOPS capacity, volume count, cache tier size) needs to scale with agent count. A 10-agent setup that works perfectly will fail when scaled to 100 agents without proportional storage scaling. Build storage capacity planning into your agent fleet scaling plan, not as an afterthought when performance degrades.

Ignoring the I/O pattern differences between agent types. An agent doing iterative compilation (many small file reads and writes) has a very different I/O pattern from an agent running end-to-end tests (large sequential reads, occasional writes). Shared storage is more likely to produce contention when mixed I/O patterns compete. Consider whether different agent types should run on separately provisioned storage.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob has invested in Bazel, remote execution, and agent-specific build profiles. Builds are fast for small concurrent loads but degrade noticeably when more than 20 agents are running simultaneously - a scenario that happens every morning when his team starts their day. The degradation is mysterious: CPU and memory are not saturated, but build times jump from 8 seconds to 45 seconds. His infrastructure team hasn't identified the root cause.

What Bob should do: Bob should ask his infrastructure lead to run an I/O profiling session during the morning peak: monitor iostat, inode operation rates, and file system wait times with 20+ concurrent agents building. The hypothesis to test is I/O saturation. If confirmed, Bob should approve two interventions: (1) move agent build workspaces to isolated NVMe volumes (eliminates cross-agent contention), and (2) enable --remote_download_minimal for agent CI builds (reduces local I/O by avoiding artifact downloads). These two changes should restore build performance under concurrent load. Bob should then set a capacity ceiling: "at what concurrent agent count does I/O performance degrade?" and ensure the infrastructure is scaled to stay comfortably below that ceiling.

What Bob should do - role-specific action plan

SarahProductivity Lead

Sarah has noticed a daily pattern in her build time metrics: builds are fast from 7-9 AM, degrade during 10 AM - 12 PM when peak concurrent agent usage happens, improve at lunch, and degrade again in the afternoon. This "rush hour" pattern is a classic I/O contention signature. The time-of-day correlation with concurrent agent count is the diagnostic clue.

What Sarah should do: Sarah should present the time-of-day build time correlation data to the infrastructure team with the hypothesis: peak concurrent agent usage is causing I/O contention on shared storage. She should request a monitoring instrumentation spike: add I/O wait percentage, IOPS utilization, and file system operation rate to the infrastructure dashboard so the I/O hypothesis can be confirmed or rejected with data. If confirmed, Sarah should work with the infrastructure team to prioritize the I/O isolation work. She should also investigate whether the morning peak is worse because developers tend to start all their agents at 9 AM in a burst - if so, a "staggered agent launch" recommendation could reduce peak concurrency without infrastructure changes.

What Sarah should do - role-specific action plan

VictorStaff Engineer - AI Champion

Victor has been running agents on a Kubernetes cluster he manages. After hitting the I/O contention problem at 30 concurrent agents, he implemented per-pod local NVMe volumes using local PersistentVolumes. Each agent pod gets a dedicated 100GB NVMe volume for its workspace, and he configured Bazel's output_base to point to a tmpfs directory for intermediate artifacts. He's now running 80 concurrent agents with stable 8-second build times.

What Victor should do: Victor should open-source his Kubernetes configuration as a reference implementation. The combination of: (a) local PV provisioner for per-pod NVMe volumes, (b) tmpfs for Bazel output_base, (c) --remote_download_minimal in the Bazel configuration, and (d) noatime mount options creates an I/O-optimized agent build environment that is replicable by any team running agents on Kubernetes. Victor should write up the performance numbers - the before/after on I/O wait time, build time under concurrent load, and the cost comparison (NVMe provisioning cost vs. CI time savings) - as a technical blog post. This kind of concrete, data-driven infrastructure work establishes Victor as a technical authority in the AI engineering space.

What Victor should do - role-specific action plan