Quick Start
How It Works
| Phase | What happens |
|---|---|
| Head protect | System prompt + first turns kept verbatim |
| Tail protect | Last protect_last_n_tokens kept verbatim |
| Middle compress | LLM call produces a summary_target_tokens summary |
| Session record | CompressionSession appended with parent/child link |
Configuration Options
| Option | Type | Default | Description |
|---|---|---|---|
llm_client | LLM client | None | Provider used for summarization (uses deterministic fallback if None) |
auxiliary_model | str | "gpt-4o-mini" | Model used for the summarization call (often a cheaper model than the agent’s main LLM) |
protect_last_n_tokens | int | 20_000 | Tokens to preserve at the tail (recent context) |
summary_target_tokens | int | 750 | Target tokens for the middle summary |
enable_session_tracking | bool | True | Append CompressionSession entries for traceability |
use_accurate_tokenizer | bool | True | Use model-specific tokenizer; falls back to heuristic on import failure |
The
LLMContextCompressorOptimizer is exposed as LLM_CONTEXT_COMPRESSOR_OPTIMIZER and is not in OPTIMIZER_REGISTRY — users must instantiate it directly with an llm_client.Session Lineage
Track compression history and audit trails across repeated compactions:session_id: Unique identifier for this compressionparent_session_id: ID of previous compression for lineagecreated_at: Timestamporiginal_message_count/compressed_message_count: Message countsoriginal_tokens/compressed_tokens: Token countssummary_text: The LLM-generated summary content
CompressResult
| Field | Type | Description |
|---|---|---|
messages | List[Dict[str, Any]] | Compressed message list (head + summary + tail) |
tokens_saved | int | Number of tokens removed |
original_tokens | int | Token count before compression |
final_tokens | int | Token count after compression |
compression_ratio | float | Final tokens / original tokens |
session_id | Optional[str] | ID of this compression session |
parent_session_id | Optional[str] | ID of parent compression session |
summary_token_count | int | Tokens used by the summary |
head_preserved_count | int | Number of head messages preserved |
tail_preserved_count | int | Number of tail messages preserved |
middle_compressed_count | int | Number of middle messages compressed |
compression_efficiency | float | Percentage of tokens saved (property) |
Common Patterns
Use a cheap auxiliary model:Best Practices
Always set auxiliary_model to a smaller/cheaper model
Always set auxiliary_model to a smaller/cheaper model
Use a cost-effective model like
gpt-4o-mini for summarization while your main agent runs on gpt-4o or similar. This reduces costs without significantly impacting summary quality.Don't set summary_target_tokens too low
Don't set summary_target_tokens too low
Keep
summary_target_tokens at least 500 tokens. Summaries lose critical context below this threshold, leading to poor conversation continuity.Enable accurate tokenization in production
Enable accurate tokenization in production
Set
use_accurate_tokenizer=True for production deployments. This provides more accurate token budget calculations and better compression efficiency.Inspect compression_efficiency for monitoring
Inspect compression_efficiency for monitoring
Monitor
result.compression_efficiency to detect ineffective compactions. Values below 20% may indicate the conversation doesn’t benefit from compression.Related
Intelligent Compaction
Structured conversation summaries with topic/goal tracking
Context Optimizer
Overview of all optimization strategies including LLM compression
Context Strategies
Choosing the right optimization approach for your use case
Context Management
Complete guide to context window management features

