Context budgeting (token economy)

AI models have finite context windows - context budgeting is the practice of deliberately allocating that budget across context types to maximize agent effectiveness per token spent.

·MCP servers provide structured context (architecture, ownership, SLAs) to agents
·Context is organized across at least 3 of the 5 levels: System, Code, Org, Historical, Operational
·Token budget management is implemented (agents receive context within defined token limits)

·Context sources are versioned and tested for correctness
·Context budgeting policy defines priority order when token limits are reached

Evidence

·MCP server configuration files listing active context sources
·Token budget configuration in agent settings
·Context coverage audit showing 3+ context levels populated

May 2026 Update

Three operational realities sharpened context budgeting in April. First, KV cache TTL is 5 minutes - sessions that idle past it pay full re-computation cost; community patches like cc-cache-fix extend persistence, and the handoff.md pattern lets you reset state without losing progress. Second, .claudeignore filters are now standard - excluding generated files, lockfiles, and large fixtures from context can cut spend 30-50%. Third, DESIGN.md is replacing Figma exports for UI work because plain-text constraints are 10-100x more token-efficient than image embeds.

The Anthropic April 23 postmortem also confirmed that thinking-history clearing bugs can silently destroy your context strategy. Track session-level token spend (/usage, ccusage) and pin effort levels explicitly rather than relying on vendor defaults.

What It Is

Every AI model operates within a context window - a maximum number of tokens it can process in a single call. As of 2025, leading models offer context windows from 128k to 1M tokens, which sounds large until you consider what competes for that space: system prompts, project conventions, relevant code files, conversation history, task instructions, tool outputs, and the model's own response. In practice, context windows fill up faster than expected, and the content that fills them has a direct impact on output quality.

Context budgeting is the practice of treating the context window as a finite resource and making deliberate allocation decisions. Which types of context get how many tokens? What gets truncated when the window is full? What gets prioritized when context competes? These decisions are as important as any other engineering optimization - more so, because they directly affect the quality of every agent action.

At L1 and L2, context budgeting is typically unmanaged. Developers paste in whatever seems relevant; tools auto-load files from the current project; conversation history accumulates until it gets cut off. The results are unpredictable: sometimes the right context is present, sometimes the most important instructions are truncated to make room for less important files.

At L3 (Systematic), organizations develop explicit context budgeting strategies. They know approximately how many tokens different context types consume, they prioritize context categories in order of importance, and they build tooling that enforces budget allocation. The goal is consistent, intentional context assembly rather than ad-hoc accumulation.

Why It Matters

Context budgeting directly affects the reliability of agent behavior. An agent with a well-allocated context window behaves consistently. An agent with an overflowed or poorly prioritized context window behaves unpredictably - sometimes correctly, sometimes not, in ways that are hard to debug because the failure mode depends on what happened to get truncated.

Instructions that are truncated are instructions that are ignored - if your CLAUDE.md conventions scroll out of the context window before the agent processes the task, the agent doesn't know about them
Larger context is not always better - studies show that models pay less attention to content in the middle of very long contexts; strategically shorter, denser context can outperform longer, diluted context
Token cost is real - at L3+ with agent workflows running at scale, context token consumption is a significant API cost. Efficient context budgeting is directly reflected in the infrastructure budget.
Context quality matters more than context quantity - a well-chosen 50k-token context typically produces better results than an unfocused 200k-token context assembled without curation

As of May 2026, automatic context consolidation is increasingly built into the harness rather than left as a manual discipline: Claude Code Rewind offers "summarize up to here," the "Dreaming" self-improvement feature compacts learned state, and Amp performs automatic thread compaction at around 90% context. Token budgeting is now partly handled for you - but the allocation decisions below still set the ceiling on quality.

The practical implication: treat your context window like RAM. Know how much you have, know how much each thing costs, and make deliberate allocation decisions.

Tip

Run a token audit on your current agent sessions. Use your model's token counting API to measure how much each context type consumes: system prompt, CLAUDE.md, conversation history, loaded files. You'll often find that conversation history is consuming 60-70% of the budget and crowding out more valuable context.

Getting Started

Measure your current context composition - Before optimizing, understand what's in your context window. Most AI APIs provide token counting. Instrument a few typical agent sessions and measure how tokens are distributed across context types.
Establish priority tiers - Categorize your context types by priority. Tier 1 (must always be present): core instructions, key conventions, task specification. Tier 2 (include when relevant): related code files, recent conversation history. Tier 3 (include if budget allows): broader architectural context, historical examples.
Set token budgets per tier - Allocate fixed token budgets to each tier. Example for a 128k context window: 8k for system prompt and instructions, 20k for CLAUDE.md and conventions, 60k for relevant code files, 20k for conversation history, 20k reserved for tool outputs and response. Adjust based on your actual usage patterns.
Implement context summarization for history - Conversation history is the fastest-growing context consumer. Implement rolling summarization: instead of including full conversation history, include a compressed summary of prior decisions and context, plus the most recent exchanges in full.
Use selective file loading rather than project-wide loading - Tools that load all open tabs or all files in a directory are context-inefficient. Build or configure your tooling to load only files directly relevant to the current task.
Experiment with context ordering - Instructions and conventions placed at the beginning and end of the context window receive more model attention than content in the middle. Put your most critical constraints at the top of the system prompt.

Common Pitfalls

The "more is better" fallacy. Developers intuitively feel that more context should produce better results. In practice, unfocused context dilutes the agent's attention. A 200k-token context full of loosely relevant files often performs worse than a curated 30k-token context with precisely the right information.

Letting conversation history consume the budget. In long agent sessions, conversation history can expand to consume the majority of the context window, crowding out the instructions and code context that the agent actually needs. Implement automatic history compression or use fresh context windows for new sub-tasks.

Ignoring the "lost in the middle" effect. Research has documented that models pay significantly less attention to content in the middle of long contexts than content at the beginning or end. Place critical instructions and constraints at the beginning of the context. Don't bury important conventions in the middle of a long CLAUDE.md file and assume the agent will attend to them.

Not accounting for tool output size. In agentic workflows, tools return results that consume context budget. A tool that returns a large JSON blob or the full contents of a searched file can consume tens of thousands of tokens. Design tool outputs to be concise - return summaries and targeted data rather than full documents.

How Different Roles See It

BobHead of Engineering

Bob's team has moved to using Claude Code for significant coding tasks. He's started receiving API cost reports and is surprised by how high they are. When he investigates, he finds that the token consumption per agent session is much higher than he expected - and not because the agents are doing more work, but because context is being assembled inefficiently: large files are loaded when only a few functions are needed, full conversation history is included in every call, and system prompts have grown unwieldy as people kept adding to them.

What Bob should do: Bob should treat context efficiency as an engineering optimization problem with a direct cost impact. He should assign a staff engineer to run a context budget audit: measure current token consumption per session type, identify the largest consumers of budget, and propose specific optimizations. The optimization targets are typically: conversation history compression, selective file loading, and CLAUDE.md streamlining. A 50% reduction in token consumption per session - achievable with focused optimization - directly halves the AI tooling infrastructure cost while often improving suggestion quality.

SarahProductivity Lead

Sarah is tracking AI tooling costs and is struggling to explain why costs are growing faster than usage. The number of developer seats is stable, but monthly token consumption and API costs are climbing. When she asks engineering, she gets a vague answer: "We're using it more for bigger tasks." She needs to understand the cost drivers to make accurate forecasts.

What Sarah should do: Sarah needs a context cost breakdown by session type. She should work with engineering to instrument agent workflows to log token consumption by category (system prompt, context files, history, tool outputs) and aggregate by task type. This typically reveals that a small number of long-running agent workflows are responsible for a disproportionate share of token consumption - and that targeted optimization of those workflows can significantly reduce costs without affecting the majority of usage. She should also establish a "cost per task" metric for each major agent workflow, which gives her a meaningful denominator for ROI calculation.

VictorStaff Engineer - AI Champion

Victor has been optimizing his agent workflows for quality and speed. He's noticed that his most effective sessions are also the ones with the most carefully curated context: small, focused, exactly what's needed. His least effective sessions are the ones where he let the tool auto-load everything and hoped for the best. He's developed an intuition for context curation but hasn't systematized it.

What Victor should do: Victor should document his context curation intuition as a structured protocol: for each type of agent task (code generation, refactoring, debugging, architecture review), what's the optimal context composition? How many tokens for each type? What gets included vs. excluded? He should then build tooling that implements this protocol automatically - either as an agent pre-processing step or as a custom Claude Code configuration. This systematizes what Victor does intuitively into a repeatable process that benefits the whole team. It also becomes the foundation for the BYOC (Bring Your Own Context) pattern at L4.

From the Field

Recent releases, projects, and discussions relevant to this maturity level.

releaseL3

topoteretes/cogneeCognee v0.5.6 transitions AI memory management from ad-hoc interactions to systematic context engineering by introducing bulk JSON/CSV import/export for large-sgithub.com

discoveredL3

Doorman11991/budget-aware-mcpModel-agnostic code memory MCP server. Budget-aware graph retrieval for AI agents. Sub-millisecond queries, token budgeting, deterministic results. Built obudget-aware-mcp shifts AI agent context management from high-latency vector searches to deterministic, hop-based graph walks using CodeGraphContext and tree-sigithub.com

discoveredL3

samber/cc-skills-golang🧑‍🎨 A collection of Golang agentic skills that worksModular instruction sets for Go projects—focusing on performance, testing, and security—implement the Agent Skills protocol to optimize context window efficiencgithub.com

releaseL3

kodustech/kodus-aiKodus-AI web-1.0.93 stabilizes enterprise integration by resolving connectivity issues for self-hosted GitLab instances, enabling AI agent deployment within prigithub.com

Where does your team actually sit on this?

This guide describes one level of one area. Run the assessment to place your team across all 16 areas, see which gates you have passed, and get a report you can take to your stakeholders.

Start the assessment

Context Engineering

Memory beyond RAG: agentic grep/tool-search + repo-explorer subagent (FastContext) over vector DBs (Anthropic dropped embeddings in Claude Code)BYOC: org PUSHES context to the agent