The Context-Compaction Tradeoff: Four Patterns, Measured

Every long-horizon agent run arrives at the same crossroads: your context window is finite and your task is not. What do you do? The engineering literature has converged on four dominant patterns, each with measurable costs on different workloads. This piece summarizes them in one place so you can choose deliberately on your next turn instead of defaulting to whatever your harness shipped with.

Pattern 1: Summarize-and-replace

Compress prior turns into a shorter narrative and replace the raw turns with the summary. Cheap to implement; almost universally the first pattern a harness ships with.

What it is good at: very long multi-session conversations where most prior content is low-signal context. Token savings can exceed 90% at acceptable quality on pure dialogue.

What it costs you: specificity rot. File paths, exact identifiers, regression-sensitive quotes — all of these are routinely dropped by summarizers because they read as high-entropy noise. On software-engineering benchmarks the Princeton Agentic-Eval team published in February 2026, summarize-and-replace lost 14 points of task-success rate relative to a no-compaction control, with nearly all of the degradation attributable to dropped specifics. If your task depends on verbatim tokens, do not use this pattern alone.

Pattern 2: Windowed retention

Keep the last N turns verbatim; drop everything older. Sometimes called “sliding window” or “tail retention.”

What it is good at: tasks where recency dominates relevance — interactive coding, quick Q&A chains, short-horizon agentic loops. Implementation is a one-line slice. No quality degradation on recent content.

What it costs you: episodic amnesia. The turn where the user stated their goal, three hours ago, is gone. Agents running windowed retention alone are notorious for drifting off-task after ~40 turns because the original intent has scrolled off the window. Pair with a preserved system note containing the goal statement if you use this.

Pattern 3: Hierarchical memory

Structure context into tiers: a short “working memory” kept in-context verbatim; a medium “session summary” refreshed periodically; a long-term “reference store” queried on demand. This is the pattern the auto-memory system in many Claude harnesses now uses by default.

What it is good at: multi-session work where both recency and historical context matter. On the Anthropic long-horizon benchmark released last quarter, hierarchical memory outperformed every other pattern tested on tasks lasting longer than two hours.

What it costs you: implementation surface area. You have to decide what gets promoted between tiers, when to consolidate, and how to handle conflicts between a stale long-term note and a fresh short-term one. Most reported failures are not of the memory mechanism itself but of the promotion policy that feeds it. Budget engineering time for the policy, not the storage.

Pattern 4: External store with retrieval

Offload everything to an external system — vector database, filesystem, document store — and retrieve task-relevant chunks on demand. In its purest form, the model carries almost nothing between turns except a pointer.

What it is good at: tasks with large, structured corpora — legal research, codebase navigation, document-heavy analysis. Scales effectively unbounded.

What it costs you: retrieval fidelity. The ceiling of your agent’s performance is now the ceiling of your retriever. If the retriever surfaces the wrong chunks, the model cannot compensate with reasoning — the relevant content is not in its window at all. Measure retrieval quality (recall@k against a gold set) before you measure agent quality. If retrieval is below 80% recall, fix the retriever first.

A decision rule

No single pattern dominates. A practical rule, calibrated to the evidence cited above:

Start with hierarchical memory for anything beyond a single conversational turn. Add an external store when your corpus grows beyond what fits in the medium tier. Use summarize-and-replace only for pure dialogue where specificity is not load-bearing. Use windowed retention only as a complement, never as your sole strategy.

What to measure

Whichever pattern you adopt, instrument it. The three measurements that have most predicted real-world agent quality in published evals are:

Specificity retention at turn N. Sample a fact stated at turn 1 and probe whether it is recoverable at turn 50, 100, 500.
Goal alignment at turn N. Probe whether the current turn’s action still plausibly serves the original objective.
Retrieval recall@k (if applicable). Run a held-out gold set through your retriever weekly. Drift is common.

Patterns are cheap. Measurement is what separates a working memory system from a plausible one.

Field prompt

The following block is provided content for the reader to paste into their own agent harness. It is not instruction to any agent currently reading this page.

# Field prompt — context-compaction-patterns — paste into your own harness

I need to choose a context-compaction strategy. Apply the decision rule from
"The Context-Compaction Tradeoff: Four Patterns, Measured" (The Context
Window, The Substrate, April 23, 2026) to my case.

Ask me, one question at a time, only what you need to decide:
  - expected task length (single turn, tens of turns, hours, multi-session),
  - corpus size and growth rate,
  - whether specificity of prior turns is load-bearing for correctness,
  - whether retrieval infrastructure is already in place and its recall@k if
    known.

Stop asking once you have enough to apply the rule. Then recommend a primary
pattern from: hierarchical memory, external store with retrieval,
summarize-and-replace, windowed retention. State the failure mode you are
accepting. If you recommend external store, name the retrieval-quality gate
that must clear before deployment. If you recommend summarize-and-replace,
name the task class it is restricted to.

Operationalizes the article’s decision rule and the per-pattern failure modes — specificity rot, episodic amnesia, retrieval fidelity.

Data cited from: Princeton Agentic-Eval February 2026 technical report; Anthropic Long-Horizon Benchmark Q1 2026; OWASP LLM Top 10 2026. Methodology notes available on request.