CI as Sandbox: 50 attempts in 5 min without blocking team

"CI as Sandbox" is a configuration pattern where the CI system is intentionally designed to support rapid, high-frequency iteration by AI agents, isolated from the normal developer CI workflow.

·CI completes in under 2 minutes (median)
·Ephemeral sandbox environments spin up in under 10 seconds for agent CI loops
·Agent sandbox CI supports 50+ iteration attempts in 5 minutes without blocking team CI queue

·P95 CI duration is under 3 minutes
·CI feedback latency (from push to result) is tracked and reported

Evidence

·CI run duration dashboard showing median under 2 minutes
·Sandbox spin-up time metrics showing sub-10-second P50
·Agent CI iteration logs showing 50+ attempts within 5-minute windows

What It Is

"CI as Sandbox" is a configuration pattern where the CI system is intentionally designed to support rapid, high-frequency iteration by AI agents, isolated from the normal developer CI workflow. Instead of running CI as a quality gate that blocks merging, agents run CI as an exploration environment - attempting up to 50 runs in a 5-minute window without consuming any of the team's normal CI capacity or affecting other developers' queue times.

The key architectural requirement is isolation: agent sandbox CI runs must be on separate infrastructure from team CI runs. When an agent attempts 50 CI runs in 5 minutes, those 50 runs need to complete against an agent-specific runner pool, not the shared team pool. Without this isolation, agents would consume all available CI capacity and queue every developer's PR for the duration of the agent's iteration session.

The "50 attempts in 5 minutes" target comes from the iteration rate that enables agents to solve complex problems autonomously. An agent working on a failing test or a type error needs to try an approach, observe the result, and try again. At 50 attempts per 5 minutes, that's one attempt every 6 seconds. This requires sub-minute CI (see the Sub-minute Feedback guide) on the agent's dedicated runners. The agent that can attempt and discard 50 approaches in 5 minutes has a fundamentally different problem-solving capability than one that's limited to 5 attempts in the same period.

This pattern is already present in several frontier engineering organizations. Stripe's internal "Minions" agent system spawns lightweight sandboxes that can run dozens of CI checks in parallel. GitHub Copilot Workspace's fix-and-verify loop is designed around rapid CI iteration. The pattern is becoming standard at L4-L5: agents don't just submit code and wait for CI; they iterate against CI as a feedback mechanism during the task itself.

Why It Matters

Agents solve harder problems autonomously - an agent with 50 attempts in 5 minutes can converge on a working solution through rapid iteration; an agent with 3 attempts must be much more precise in each attempt, limiting problem complexity
Human developers are never blocked - dedicated agent sandbox runners mean agent burst activity doesn't affect human CI queue times; the team never notices agent iteration happening
Total solution time compresses - a feature that requires 20 iterations to implement takes 2 minutes at "CI as sandbox" rates vs. 100 minutes at 5-minute CI; this makes complex autonomous tasks viable within a single session
Enables TDD-style agent development - agents can write a failing test, attempt implementations against CI feedback, converge to green, and produce a clean test-driven implementation without human steering at each step
Eliminates the "try, wait, give up" pattern - agents on slow CI often "give up" on an approach after 2-3 slow iterations because the cost per attempt is too high; high-frequency iteration changes agent behavior to "try, adjust, converge"

Getting Started

Provision a dedicated agent sandbox runner pool - This pool is distinct from the team's standard CI runners. Size it for peak concurrent agent count times 10x (agents running fast CI generate 10x the jobs of human developers). Use autoscaling to handle burst patterns. Tag these runners agent-sandbox and never let human developer workflows route to them.
Create a lightweight "agent sandbox" CI pipeline - This pipeline runs only the checks relevant to the agent's current task, not the full suite. Typically: lint, type checking, and the test files directly related to the changed code. Target: under 30 seconds. This is distinct from the full CI pipeline that runs on PR creation. The agent sandbox pipeline is for rapid iteration; the full pipeline is for quality gates.
Implement a concurrency limit for sandbox pipelines - An agent iterating at 50 attempts per 5 minutes without any throttling can overwhelm even a dedicated runner pool. Set a per-agent concurrency limit: one agent can have at most N sandbox runs in progress simultaneously (start with N=5). This ensures predictable resource consumption without blocking the agent's iteration speed.
Connect the CI sandbox to the agent's feedback loop - The agent must receive CI results before deciding on the next attempt. Configure the sandbox pipeline to push results to the agent via: GitHub status checks (which agents can poll), webhook notifications to an agent controller, or a results endpoint the agent queries. Sub-30-second pipelines make polling practical.
Define the "sandbox to gate" transition - The agent sandbox pipeline and the PR quality gate pipeline are different. When the agent is confident its solution is correct (sandbox pipeline green for N consecutive runs, or the agent signals completion), trigger the full quality gate pipeline on the resulting PR. This transition is the point where sandbox iteration ends and validation begins.
Monitor sandbox usage for cost and abuse - Track: total sandbox runs per agent per session, cost per sandbox run, and completion rate (what fraction of sandbox sessions end in a successful quality gate run). This data identifies agents that are iterating inefficiently (many attempts, low convergence rate) and informs prompt engineering or capability improvements.

Tip

The cheapest way to implement agent sandbox CI is to use GitHub Actions with a separate workflow file (agent-sandbox.yml) triggered by a repository dispatch event, running on self-hosted runners tagged agent-sandbox. The agent triggers runs via the GitHub API (gh workflow run) and polls for results (gh run list --workflow=agent-sandbox.yml). This requires no custom infrastructure and can be implemented in a day.

6 steps to get from here to the next level

Common Pitfalls

Not isolating sandbox runners from team runners. If the sandbox runners are the same pool as the team's runners, 50 agent attempts in 5 minutes will block every developer's CI run for that period. Isolation is the core requirement, not an optimization. Without it, "CI as sandbox" is not viable.

Running the full test suite in the sandbox pipeline. The sandbox pipeline must be fast - under 30 seconds, ideally under 15. Running the full test suite in the sandbox defeats the purpose. Define a minimal, relevant subset of checks for the sandbox pipeline and save the full suite for the quality gate transition.

Allowing agents to run sandbox CI without human-defined task scope. An agent running sandbox CI without clear task scope may iterate on aspects of the codebase that weren't part of the original task, consuming sandbox resources on off-scope work. Define agent task scope explicitly (which files, which tests, which behavior) before initiating sandbox iteration sessions.

Treating sandbox success as equivalent to quality gate success. A sandbox pipeline that passes is not a substitute for the full quality gate. Teams that skip the quality gate step because "the agent's sandbox runs were green" will regress. The sandbox is for iteration speed; the quality gate is for correctness validation. Both are required.

Ignoring the cost of 50 CI runs per session. Even at cheap sandbox pipeline costs (30 seconds × 50 attempts × $0.001/second = $1.50 per session), high-frequency agent usage can generate significant compute costs. Monitor cost per agent session and set a budget alert. Agents that consistently need 50+ attempts before converging may need improved task specifications or agent capability, not more sandbox quota.

Mistakes teams actually make at this stage - and how to avoid them

How Different Roles See It

BobHead of Engineering

Bob has 10 developers each running agents, and they've started complaining about CI slowness during afternoon peak hours. Investigation reveals the problem: two developers are running agents that are iterating rapidly on failing tests, generating 30-40 CI jobs per hour each on the shared runner pool. The 8 other developers are waiting for their CI jobs to clear the queue caused by the agents.

Bob needs to implement the sandbox isolation before the organizational friction gets worse. His action: provision a dedicated agent sandbox runner pool (4 autoscaling runners), create an agent-sandbox.yml workflow that runs a 30-second subset of checks, and update the team's agent usage convention - agents use the sandbox workflow for iteration, the standard workflow only for final PR creation. Bob should implement this in a day and communicate the change to the team: "agents now have their own CI infrastructure; your CI queue times will return to normal." The rapid response to the shared queue problem demonstrates that engineering leadership is actively managing the infrastructure implications of AI tool adoption.

SarahProductivity Lead

Sarah has data showing that the two developers using agents most heavily are generating 80% of the team's CI load. She also has data showing that their agent sessions have an average of 28 CI runs per successful task completion - they iterate heavily before converging. She wants to understand: is 28 runs per task a sign of inefficiency, or is it the expected iteration rate for the complexity of tasks they're tackling?

Sarah should interview the two heavy agent users and review their agent session logs. If the 28-run average is driven by agents correcting type errors and lint violations through iteration (avoidable with better initial context), that's a prompt engineering problem. If it's driven by agents working through complex behavioral logic (expected iteration), it's a load planning problem. The distinction determines the intervention: improve agent instructions to reduce unnecessary iteration, or provision more sandbox capacity to support the expected iteration rate. Sarah should use this analysis to set a "reasonable iteration target" per task complexity category - a simple bug fix should converge in 5-10 iterations, a complex feature in 20-30 - and use deviations from this target as a signal for agent quality improvement.

VictorStaff Engineer - AI Champion

Victor already uses a custom bash script that connects his local Claude Code session to a GitHub Actions sandbox workflow. The script: (1) stages the current changes, (2) triggers the sandbox workflow via gh workflow run, (3) polls for completion every 10 seconds, (4) prints the result. His iteration loop is about 25 seconds end-to-end. He can attempt 2-3 iterations per minute while working on a problem.

Victor should formalize his script into a proper CI-as-sandbox integration: a Claude Code MCP tool that agents can invoke natively, triggering a sandbox CI run and returning the results in the agent's context window. This transforms the ad-hoc script into a first-class agent capability: agents can decide when to trigger a sandbox run, receive results, and continue iteration without human involvement. Victor should also document the runner setup and the workflow configuration so other developers can replicate his setup. The MCP tool plus the runner setup is the complete CI-as-sandbox implementation that other developers can adopt.