AI agent context window: keeping your agent from forgetting
Strategies for managing LLM context windows in agents — sliding windows, summarization, structured memory, and when to use each approach.
The Anthropic documentation on context windows covers how models manage large contexts, including prompt caching and compaction — both strategies recommended in this post.
TL;DR: Context window management is the difference between an agent that works reliably and one that degrades over time. This post covers 4 strategies — sliding windows, summarization, structured memory, and hybrid approaches — with code examples and the specific failure mode each strategy addresses. Budget your context: 40% for conversation history, 20% for memory retrieval, 15% each for instructions, tools, and output buffer.
I’ve built agents that start strong and deteriorate over a 30-minute conversation. The answers get shorter. The reasoning gets sloppier. The agent “forgets” instructions you gave at the start.
The culprit? Context window mismanagement.
Every LLM has a context window — a limited number of tokens it can process at once. As your agent accumulates conversation history, tool outputs, and intermediate results, it fills that window. Once it’s full, something has to give. Either the model truncates early content, or the cost becomes absurd (you’re paying for thousands of tokens of stale context on every call).
Context window management isn’t a nice-to-have. It’s the difference between an agent that works reliably and one that degrades over time.
Key takeaways:
- Every token in the context costs money and attention — be intentional about what stays
- Sliding windows work for short conversations but lose long-term context
- Summarization preserves key information but introduces compression loss
- Structured memory (vector DB, key-value store) is the most reliable for persistent facts
- Hybrid approaches beat any single strategy for production agents
OpenAI’s prompt caching guide describes how caching reduces latency and cost for repeated context prefixes, making sliding-window approaches more efficient.
How context windows actually work
When you send a message to an LLM, you send the entire conversation history plus the new message. The model processes every token in parallel (thanks to the attention mechanism) and generates a response.
This means:
- Cost scales with total tokens — every API call costs proportionally to your entire context, not just the new input
- Attention degrades — models focus less on tokens in the middle of long contexts (the “lost-in-the-middle” problem)
- Latency increases — more tokens means more computation per generation
Here’s the math for a 10-turn conversation with tool calls:
| Component | Approximate tokens |
|---|---|
| System prompt | 500 |
| 10 user messages (avg 100 tokens each) | 1,000 |
| 10 assistant responses (avg 300 tokens each) | 3,000 |
| 20 tool calls with results (avg 200 tokens each) | 4,000 |
| Total | 8,500 tokens |
At Claude Sonnet pricing ($3/M input tokens), a single turn costs $0.026. After 100 turns across a day, that’s $2.55 just in input costs — and more if you exceed the window and retry.
Strategy 1: Sliding window
Keep the last N messages, discard everything older.
def apply_sliding_window(messages: list, max_messages: int = 20) -> list:
"""
Keep the system prompt (index 0) and the last N-1 messages.
"""
system_prompt = messages[0]
recent = messages[-max_messages + 1:]
return [system_prompt] + recent
When it works: Short conversations where only recent context matters — customer support, simple Q&A agents, short task execution.
When it fails: Long-running research agents that need early findings. If an agent discards the research brief it compiled 50 turns ago, it can’t write the final report.
Problem case: I had a code review agent that used a sliding window of 30 messages. After reviewing 5 files, it had already forgotten its review guidelines from the first exchange. It started contradicting its earlier feedback.
Strategy 2: Summarization
Periodically compress old conversation history into a summary, replace the compressed content with the summary.
import instructor
from pydantic import BaseModel
class ConversationSummary(BaseModel):
key_decisions: list[str]
completed_tasks: list[str]
pending_items: list[str]
important_context: str
async def summarize_history(messages: list, llm) -> ConversationSummary:
"""Compress conversation history into a structured summary."""
history_text = format_messages_for_summary(messages)
summary = await llm.chat.completions.create(
model="claude-sonnet-4-20250514",
messages=[{
"role": "user",
"content": f"Summarize this conversation history:\n\n{history_text}"
}],
response_model=ConversationSummary,
)
return summary
def compress_context(messages: list, llm, summary_frequency: int = 20):
"""
If message count exceeds threshold, summarize and compress.
"""
if len(messages) <= summary_frequency + 1:
return messages
# Messages to summarize (exclude system prompt and recent N)
to_summarize = messages[1:-10]
recent = messages[-10:]
summary = summarize_history(to_summarize, llm)
summary_message = {
"role": "system",
"content": f"[Compressed History]\n{summary.model_dump_json()}"
}
return [messages[0], summary_message] + recent
When it works: Long research sessions, multi-turn analysis, agents that need to reference early decisions. The summary preserves key information at ~5-10% of the original token count.
When it fails: When the summary itself becomes too large after multiple compressions. Compressing summaries of summaries leads to information loss. I’ve seen agents lose critical edge cases after 3-4 summarization rounds.
Problem case: A compliance-checking agent that needed to track every rule it had verified. Summarization dropped “forgot to check certificate expiry” from the third compression. The agent signed off on a non-compliant deployment.
Strategy 3: Structured memory
Store important facts in a separate database. Retrieve relevant facts when needed.
from typing import TypedDict, List
import chromadb
class MemoryEntry(TypedDict):
key: str
content: str
timestamp: str
metadata: dict
class StructuredMemory:
"""Key-value memory with semantic search for agent state."""
def __init__(self, collection_name: str = "agent_memory"):
self.client = chromadb.Client()
self.collection = self.client.create_collection(collection_name)
async def remember(self, key: str, content: str, metadata: dict = None):
"""Store a fact in memory."""
self.collection.add(
documents=[content],
metadatas=[metadata or {}],
ids=[key]
)
async def recall(self, query: str, n: int = 5) -> List[MemoryEntry]:
"""Retrieve relevant memories based on semantic similarity."""
results = self.collection.query(
query_texts=[query],
n_results=n,
)
return [
MemoryEntry(
key=results["ids"][0][i],
content=results["documents"][0][i],
metadata=results["metadatas"][0][i],
)
for i in range(len(results["ids"][0]))
]
async def update(self, key: str, content: str):
"""Update an existing memory."""
self.collection.update(ids=[key], documents=[content])
# Agent loop with structured memory
async def agent_loop_with_memory(task: str, memory: StructuredMemory):
context = {
"instruction": "You are a research agent with structured memory.",
"recent_history": [],
"recalled_facts": await memory.recall(task, n=3),
}
for step in range(10):
# Build prompt with recalled facts
prompt = build_prompt(task, context)
response = await llm_call(prompt)
# Extract and store important facts
facts = extract_facts(response)
for fact in facts:
await memory.remember(
key=fact["id"],
content=fact["content"],
metadata={"step": step, "source": task}
)
context["recent_history"].append(response)
context["recalled_facts"] = await memory.recall(task, n=5)
When it works: Long-running agents, multi-session conversations, any agent that needs to remember specific facts across restarts. This is the most reliable approach for production.
When it fails: When the retrieval retrieves irrelevant content and pollutes the context. Bad retrieval = bad context = bad agent behavior. Requires tuning the embedding model, chunk size, and retrieval count.
Strategy 4: Hybrid approaches
The best production systems combine all three:
class ContextManager:
"""
Hybrid context management:
- Sliding window for recent interaction
- Summarization for mid-term history
- Structured memory for persistent facts
"""
def __init__(
self,
recent_window: int = 15,
summary_threshold: int = 25,
):
self.recent_window = recent_window
self.summary_threshold = summary_threshold
self.memory = StructuredMemory()
self.summary: ConversationSummary | None = None
async def build_context(self, task: str) -> dict:
# 1. Get persistent facts from structured memory
memories = await self.memory.recall(task, n=5)
# 2. Use summary for mid-term history
summary_text = (
self.summary.model_dump_json()
if self.summary
else "No prior context"
)
# 3. Recent history uses sliding window
recent = self.recent_history[-self.recent_window:]
return {
"persistent_facts": memories,
"history_summary": summary_text,
"recent_exchanges": recent,
}
async def after_turn(self, turn_messages: list):
# Decide whether to summarize
if len(turn_messages) > self.summary_threshold:
self.summary = await summarize_history(
turn_messages[:-self.recent_window],
self.llm
)
# Extract and store persistent facts
facts = extract_key_facts(turn_messages[-1])
for fact in facts:
await self.memory.remember(
key=fact["id"],
content=fact["content"],
metadata={"timestamp": datetime.now().isoformat()}
)
This is what I run in production. The context stay manageable — under 6K tokens for most turns — and the agent has access to exactly the information it needs.
Context budgeting
Beyond choosing a strategy, you need to think about how you allocate tokens within the window. I call this context budgeting:
| Component | Allocation | Why |
|---|---|---|
| Instructions | 15% | System prompt, task definition, rules |
| Tools | 10% | Tool definitions, function schemas |
| Conversation history | 40% | Recent exchanges + compressed history |
| Memory retrieval | 20% | Facts from structured memory |
| Output buffer | 15% | Room for the model to generate |
This ratios change based on the agent type:
- Code review agent: Allocate more to tools (file reading, git operations) and less to conversation history
- Research agent: Allocate more to memory retrieval (stored findings) and conversation history
- Customer support agent: Allocate more to recent history, less to tools
- Data analysis agent: Allocate more to output buffer (for large result sets)
When each strategy fails
| Strategy | Failure mode | Symptom |
|---|---|---|
| Sliding window | Loses early context | Agent contradicts earlier decisions |
| Summarization | Compression loss | Missing critical details |
| Structured memory | Bad retrieval | Agent acts on irrelevant context |
| Hybrid | Configuration complexity | Overhead outweighs benefit |
Real-world recommendations
For a typical production agent, here’s what I’d suggest:
- Start with a sliding window — simplest, and sufficient for agents with <15 turns
- Add structured memory when the agent needs to remember facts across sessions or turns
- Add summarization when conversations exceed 25 turns and you need mid-term context
- Context budget — measure your actual token usage and adjust allocations
- Monitor — log context token counts per turn, alert when averages exceed 80% of the window
The goal isn’t to maximize context utilization. It’s to minimize context while keeping the agent effective. The best context is the one you’re not paying for.
Related: AI agent error handling patterns — what to do when your agent breaks. Also see Preventing AI agent hallucinations — 7 techniques for more reliable agents.
Related: What is an AI agent? A complete beginner’s guide for developers — understanding the fundamentals of AI agents before diving into context management strategies.