What happens when an agent exceeds its context window?

The LLM either truncates the oldest content (losing early context) or throws an error. Performance degrades — the model loses track of earlier instructions, tool outputs, or conversation state. You also pay for the full context in every API call.

Which context window strategy is best for production agents?

A hybrid approach — sliding window for recent conversation, summarization for mid-term memory, and structured memory (vector DB or key-value store) for facts that need to persist across sessions. No single strategy handles all cases.

How many tokens should I allocate to each part of the context?

A good starting budget: instructions 15%, tools/definitions 10%, conversation history 40%, structured memory retrieval 20%, output buffer 15%. Adjust based on your agent's task profile.

Does a larger context window eliminate the need for memory management?

No. Even with 200K token windows, you still need management. Larger windows are slower, more expensive, and the model's attention degrades on tokens in the middle of the context. Memory strategies remain essential.

How do I handle context across multiple agent sessions?

Use structured memory (vector database or key-value store) to persist important facts. Summarize the previous session's outcomes into the next session's system prompt. Don't blindly pass the full conversation history.

AI agent context window: keeping your agent from forgetting

Strategies for managing LLM context windows in agents — sliding windows, summarization, structured memory, and when to use each approach.

The Anthropic documentation on context windows covers how models manage large contexts, including prompt caching and compaction — both strategies recommended in this post.

TL;DR: Context window management is the difference between an agent that works reliably and one that degrades over time. This post covers 4 strategies — sliding windows, summarization, structured memory, and hybrid approaches — with code examples and the specific failure mode each strategy addresses. Budget your context: 40% for conversation history, 20% for memory retrieval, 15% each for instructions, tools, and output buffer.

I’ve built agents that start strong and deteriorate over a 30-minute conversation. The answers get shorter. The reasoning gets sloppier. The agent “forgets” instructions you gave at the start.

The culprit? Context window mismanagement.

Every LLM has a context window — a limited number of tokens it can process at once. As your agent accumulates conversation history, tool outputs, and intermediate results, it fills that window. Once it’s full, something has to give. Either the model truncates early content, or the cost becomes absurd (you’re paying for thousands of tokens of stale context on every call).

Context window management isn’t a nice-to-have. It’s the difference between an agent that works reliably and one that degrades over time.

Key takeaways:

Every token in the context costs money and attention — be intentional about what stays

Sliding windows work for short conversations but lose long-term context

Summarization preserves key information but introduces compression loss

Structured memory (vector DB, key-value store) is the most reliable for persistent facts

Hybrid approaches beat any single strategy for production agents

OpenAI’s prompt caching guide describes how caching reduces latency and cost for repeated context prefixes, making sliding-window approaches more efficient.

How context windows actually work

When you send a message to an LLM, you send the entire conversation history plus the new message. The model processes every token in parallel (thanks to the attention mechanism) and generates a response.

This means:

Cost scales with total tokens — every API call costs proportionally to your entire context, not just the new input
Attention degrades — models focus less on tokens in the middle of long contexts (the “lost-in-the-middle” problem)
Latency increases — more tokens means more computation per generation

Here’s the math for a 10-turn conversation with tool calls:

Component	Approximate tokens
System prompt	500
10 user messages (avg 100 tokens each)	1,000
10 assistant responses (avg 300 tokens each)	3,000
20 tool calls with results (avg 200 tokens each)	4,000
Total	8,500 tokens

At Claude Sonnet pricing ($3/M input tokens), a single turn costs $0.026. After 100 turns across a day, that’s $2.55 just in input costs — and more if you exceed the window and retry.

Strategy 1: Sliding window

Keep the last N messages, discard everything older.

def apply_sliding_window(messages: list, max_messages: int = 20) -> list:
    """
    Keep the system prompt (index 0) and the last N-1 messages.
    """
    system_prompt = messages[0]
    recent = messages[-max_messages + 1:]
    return [system_prompt] + recent

When it works: Short conversations where only recent context matters — customer support, simple Q&A agents, short task execution.

When it fails: Long-running research agents that need early findings. If an agent discards the research brief it compiled 50 turns ago, it can’t write the final report.

Problem case: I had a code review agent that used a sliding window of 30 messages. After reviewing 5 files, it had already forgotten its review guidelines from the first exchange. It started contradicting its earlier feedback.

Strategy 2: Summarization

Periodically compress old conversation history into a summary, replace the compressed content with the summary.

import instructor
from pydantic import BaseModel

class ConversationSummary(BaseModel):
    key_decisions: list[str]
    completed_tasks: list[str]
    pending_items: list[str]
    important_context: str

async def summarize_history(messages: list, llm) -> ConversationSummary:
    """Compress conversation history into a structured summary."""
    history_text = format_messages_for_summary(messages)
    
    summary = await llm.chat.completions.create(
        model="claude-sonnet-4-20250514",
        messages=[{
            "role": "user",
            "content": f"Summarize this conversation history:\n\n{history_text}"
        }],
        response_model=ConversationSummary,
    )
    return summary

def compress_context(messages: list, llm, summary_frequency: int = 20):
    """
    If message count exceeds threshold, summarize and compress.
    """
    if len(messages) <= summary_frequency + 1:
        return messages
    
    # Messages to summarize (exclude system prompt and recent N)
    to_summarize = messages[1:-10]
    recent = messages[-10:]
    
    summary = summarize_history(to_summarize, llm)
    summary_message = {
        "role": "system",
        "content": f"[Compressed History]\n{summary.model_dump_json()}"
    }
    
    return [messages[0], summary_message] + recent

When it works: Long research sessions, multi-turn analysis, agents that need to reference early decisions. The summary preserves key information at ~5-10% of the original token count.

When it fails: When the summary itself becomes too large after multiple compressions. Compressing summaries of summaries leads to information loss. I’ve seen agents lose critical edge cases after 3-4 summarization rounds.

Problem case: A compliance-checking agent that needed to track every rule it had verified. Summarization dropped “forgot to check certificate expiry” from the third compression. The agent signed off on a non-compliant deployment.

Strategy 3: Structured memory

Store important facts in a separate database. Retrieve relevant facts when needed.

from typing import TypedDict, List
import chromadb

class MemoryEntry(TypedDict):
    key: str
    content: str
    timestamp: str
    metadata: dict

class StructuredMemory:
    """Key-value memory with semantic search for agent state."""
    
    def __init__(self, collection_name: str = "agent_memory"):
        self.client = chromadb.Client()
        self.collection = self.client.create_collection(collection_name)
        
    async def remember(self, key: str, content: str, metadata: dict = None):
        """Store a fact in memory."""
        self.collection.add(
            documents=[content],
            metadatas=[metadata or {}],
            ids=[key]
        )
    
    async def recall(self, query: str, n: int = 5) -> List[MemoryEntry]:
        """Retrieve relevant memories based on semantic similarity."""
        results = self.collection.query(
            query_texts=[query],
            n_results=n,
        )
        return [
            MemoryEntry(
                key=results["ids"][0][i],
                content=results["documents"][0][i],
                metadata=results["metadatas"][0][i],
            )
            for i in range(len(results["ids"][0]))
        ]
    
    async def update(self, key: str, content: str):
        """Update an existing memory."""
        self.collection.update(ids=[key], documents=[content])

# Agent loop with structured memory
async def agent_loop_with_memory(task: str, memory: StructuredMemory):
    context = {
        "instruction": "You are a research agent with structured memory.",
        "recent_history": [],
        "recalled_facts": await memory.recall(task, n=3),
    }
    
    for step in range(10):
        # Build prompt with recalled facts
        prompt = build_prompt(task, context)
        response = await llm_call(prompt)
        
        # Extract and store important facts
        facts = extract_facts(response)
        for fact in facts:
            await memory.remember(
                key=fact["id"],
                content=fact["content"],
                metadata={"step": step, "source": task}
            )
        
        context["recent_history"].append(response)
        context["recalled_facts"] = await memory.recall(task, n=5)

When it works: Long-running agents, multi-session conversations, any agent that needs to remember specific facts across restarts. This is the most reliable approach for production.

When it fails: When the retrieval retrieves irrelevant content and pollutes the context. Bad retrieval = bad context = bad agent behavior. Requires tuning the embedding model, chunk size, and retrieval count.

Strategy 4: Hybrid approaches

The best production systems combine all three:

class ContextManager:
    """
    Hybrid context management:
    - Sliding window for recent interaction
    - Summarization for mid-term history
    - Structured memory for persistent facts
    """
    
    def __init__(
        self,
        recent_window: int = 15,
        summary_threshold: int = 25,
    ):
        self.recent_window = recent_window
        self.summary_threshold = summary_threshold
        self.memory = StructuredMemory()
        self.summary: ConversationSummary | None = None
    
    async def build_context(self, task: str) -> dict:
        # 1. Get persistent facts from structured memory
        memories = await self.memory.recall(task, n=5)
        
        # 2. Use summary for mid-term history
        summary_text = (
            self.summary.model_dump_json()
            if self.summary
            else "No prior context"
        )
        
        # 3. Recent history uses sliding window
        recent = self.recent_history[-self.recent_window:]
        
        return {
            "persistent_facts": memories,
            "history_summary": summary_text,
            "recent_exchanges": recent,
        }
    
    async def after_turn(self, turn_messages: list):
        # Decide whether to summarize
        if len(turn_messages) > self.summary_threshold:
            self.summary = await summarize_history(
                turn_messages[:-self.recent_window],
                self.llm
            )
        
        # Extract and store persistent facts
        facts = extract_key_facts(turn_messages[-1])
        for fact in facts:
            await self.memory.remember(
                key=fact["id"],
                content=fact["content"],
                metadata={"timestamp": datetime.now().isoformat()}
            )

This is what I run in production. The context stay manageable — under 6K tokens for most turns — and the agent has access to exactly the information it needs.

Context budgeting

Beyond choosing a strategy, you need to think about how you allocate tokens within the window. I call this context budgeting:

Component	Allocation	Why
Instructions	15%	System prompt, task definition, rules
Tools	10%	Tool definitions, function schemas
Conversation history	40%	Recent exchanges + compressed history
Memory retrieval	20%	Facts from structured memory
Output buffer	15%	Room for the model to generate

This ratios change based on the agent type:

Code review agent: Allocate more to tools (file reading, git operations) and less to conversation history
Research agent: Allocate more to memory retrieval (stored findings) and conversation history
Customer support agent: Allocate more to recent history, less to tools
Data analysis agent: Allocate more to output buffer (for large result sets)

When each strategy fails

Strategy	Failure mode	Symptom
Sliding window	Loses early context	Agent contradicts earlier decisions
Summarization	Compression loss	Missing critical details
Structured memory	Bad retrieval	Agent acts on irrelevant context
Hybrid	Configuration complexity	Overhead outweighs benefit

Real-world recommendations

For a typical production agent, here’s what I’d suggest:

Start with a sliding window — simplest, and sufficient for agents with <15 turns
Add structured memory when the agent needs to remember facts across sessions or turns
Add summarization when conversations exceed 25 turns and you need mid-term context
Context budget — measure your actual token usage and adjust allocations
Monitor — log context token counts per turn, alert when averages exceed 80% of the window

The goal isn’t to maximize context utilization. It’s to minimize context while keeping the agent effective. The best context is the one you’re not paying for.

Related: AI agent error handling patterns — what to do when your agent breaks. Also see Preventing AI agent hallucinations — 7 techniques for more reliable agents.

Related: What is an AI agent? A complete beginner’s guide for developers — understanding the fundamentals of AI agents before diving into context management strategies.