Headroom Complete Guide 2026: Cut LLM Token Costs by 60-95% with This Open Source Tool

2026-06-04 Headroom LLM AI Assistant Claude Code Token Optimization Open Source

What Is Headroom?

Headroom is an open source LLM token compression layer that smartly compresses everything an AI agent reads—tool outputs, logs, RAG retrieval results, files, and conversation history—before sending it to the LLM.

The core value: the same answer, using only 5–40% of the tokens. For developers who rely heavily on AI coding assistants (Claude Code, Cursor, Codex, etc.) every day, this means API costs drop by 60–95% straight away.

Why Do You Need Headroom?

Here's how modern AI coding assistants usually work:

User asks a question → Agent searches the codebase → Returns 100+ file snippets →
Agent gathers all context → Sends to LLM → LLM answers

The problem? Most of the context the agent sends is redundant. For example:

Search results return 100 code snippets, but only 5–10 are truly relevant
Log files are full of irrelevant timestamps and debug noise
RAG-retrieved document chunks share lots of repeated prefixes

Headroom solves this with a three-layer architecture:

ContentRouter — Detects content type (JSON, code, plain text) and picks the best compressor automatically
Smart Compressors — SmartCrusher (JSON), CodeCompressor (AST-aware), Kompress-base (HF model)
CCR (Reversible Compression) — Raw data stays local; the LLM can fetch it on demand when needed

Headroom vs Other Solutions

Feature	Headroom	Native Provider Compression	Manual Prompt Trimming
Token savings	60–95%	20–40%	Depends on human effort
Cross-agent shared memory	✅	❌	❌
Reversible compression (CCR)	✅	❌	N/A
Zero-code access (Proxy)	✅	❌	N/A
Runs locally	✅	❌ (cloud)	✅
Multi-language support	Python + TS	SDK-only	N/A

Installing Headroom

Headroom supports both Python and Node.js. We recommend pip for the full-featured version.

Python Installation

# Install the full version (proxy, MCP, ML, everything)
pip install "headroom-ai[all]"

# Or install sub-modules as needed
pip install "headroom-ai[proxy]"   # Proxy mode only
pip install "headroom-ai[mcp]"     # MCP server only
pip install "headroom-ai[ml]"      # Machine-learning compression models

Requirements: Python 3.10+

Node.js / TypeScript Installation

npm install headroom-ai

Verify Installation

# Check version and features
headroom --version

# Run a performance test to see compression results in your environment
headroom perf

Quick Start: Three Usage Modes

Headroom offers three ways to plug in, from easiest to most flexible.

Mode 1: Wrap Command (Easiest, Zero Config)

If you're already using an AI coding assistant, just one command turns on Headroom:

# Wrap Claude Code
headroom wrap claude

# Wrap Codex
headroom wrap codex

# Wrap Cursor
headroom wrap cursor

# Wrap Aider
headroom wrap aider

# Wrap GitHub Copilot CLI
headroom wrap copilot

After running, Headroom will automatically: 1. Start a local proxy service (default port 8787) 2. Update the corresponding agent's config to route requests through the proxy 3. Print setup instructions so you can verify it's working

Example: Wrapping Claude Code

$ headroom wrap claude

✅ Headroom proxy started on port 8787
✅ Claude Code config updated

To verify, run:
  claude "What is 2+2?"

You should see compression stats in the output.

From then on, every time you use the claude command, requests go through Headroom for compression before hitting the Anthropic API.

Mode 2: Proxy Mode (Zero Code Changes, Works with Any Language)

If you don't want to touch existing code, or you're using a tool that isn't supported by wrap, just start the proxy standalone:

# Start the proxy, listening on port 8787
headroom proxy --port 8787

Then point your app or agent config to http://localhost:8787 instead of the original Anthropic/OpenAI endpoint.

Example: OpenAI-Compatible Client

from openai import OpenAI

# Original config
# client = OpenAI(api_key="sk-...")

# Route through Headroom proxy
client = OpenAI(
    api_key="sk-...",
    base_url="http://localhost:8787/v1"  # Point to Headroom proxy
)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain this code"}]
)

Every request passing through the proxy gets compressed automatically—no business logic changes needed.

Mode 3: Library Mode (Most Flexible, Embed in Your App)

If you're building your own AI application, call Headroom's compression functions directly:

Python Example

from headroom import compress

messages = [
    {"role": "user", "content": "Analyze this log file"},
    {"role": "assistant", "content": "Please provide the log content"},
    {"role": "user", "content": "[... 10,000 lines of logs ...]"}
]

# Compress messages
compressed = compress(messages, model="claude-3-sonnet")

print(f"Original tokens: {compressed.original_tokens}")
print(f"Compressed tokens: {compressed.compressed_tokens}")
print(f"Savings: {compressed.savings_percent}%")

# Send to LLM
response = anthropic_client.messages.create(
    model="claude-3-sonnet-20240229",
    max_tokens=1024,
    messages=compressed.messages  # Use compressed messages
)

TypeScript Example

import { compress } from 'headroom-ai';

const messages = [
  { role: 'user', content: 'Analyze this codebase' },
  { role: 'assistant', content: 'Please provide the files' },
  { role: 'user', content: '[... 50 files ...]' }
];

const compressed = await compress(messages, { model: 'gpt-4' });

console.log(`Saved ${compressed.savingsPercent}% tokens`);

Core Features Explained

1. Smart Compression Algorithms

Headroom ships with multiple compression algorithms and auto-selects based on content type.

SmartCrusher — JSON Compression

Built for structured data (API responses, config files, database query results):

from headroom import SmartCrusher

data = {
    "users": [
        {"id": 1, "name": "Alice", "email": "alice@example.com", "created_at": "2024-01-01"},
        {"id": 2, "name": "Bob", "email": "bob@example.com", "created_at": "2024-01-02"},
        # ... 1,000+ records
    ]
}

crusher = SmartCrusher()
compressed = crusher.compress(data)

# Keeps key fields, strips redundant metadata
# Original: 50,000 tokens → Compressed: 5,000 tokens (90% savings)

CodeCompressor — AST-Aware Code Compression

Understands the syntax tree, keeping only essential structure:

from headroom import CodeCompressor

code = """
def calculate_total(items):
    '''Calculate total price with tax'''
    total = 0
    for item in items:
        if item.active:
            total += item.price * item.quantity
    tax = total * 0.08
    return total + tax
"""

compressor = CodeCompressor(language="python")
compressed = compressor.compress(code)

# Preserves function signatures, control flow, key variables
# Removes comments, whitespace, non-critical implementation details

Supported languages: Python, JavaScript, Go, Rust, Java, C++

Kompress-base — General Text Compression

A dedicated model trained on HuggingFace for natural language, documents, logs, and more:

# The model downloads automatically on first use (~500 MB)
headroom proxy

# Model cache location: ~/.cache/headroom/kompress-base

2. CCR Reversible Compression

CCR (Compress-Cache-Retrieve) is Headroom's core innovation: compressed data goes to the LLM, but the raw data stays on disk. If the LLM needs the full picture, it calls the headroom_retrieve tool.

Workflow:

1. Headroom compresses content → sends to LLM
2. LLM spots a need for more detail → calls headroom_retrieve(chunk_id)
3. Headroom returns the original data from local cache
4. LLM gets the full info and keeps reasoning

Benefits: - LLM receives the compressed version first, using fewer tokens - Full content is fetched only when necessary, avoiding huge upfront sends - Raw data is never lost

3. Cross-Agent Shared Memory

If you juggle multiple AI assistants (say, Claude Code + Codex + Cursor), Headroom lets them share compressed context:

# Enable shared memory
headroom wrap claude --memory
headroom wrap codex --memory

Now, codebase indexes processed by Claude Code are cached, and Codex can reuse them without rescanning. This is especially handy for large projects.

4. MCP Server Integration

Headroom can run as an MCP (Model Context Protocol) server, callable by any MCP-compatible client:

# Install the MCP server
headroom mcp install

# Available MCP tools:
# - headroom_compress: compress any content
# - headroom_retrieve: retrieve original data
# - headroom_stats: view compression statistics

Example: Using in Claude Desktop

// claude_desktop_config.json
{
  "mcpServers": {
    "headroom": {
      "command": "headroom",
      "args": ["mcp", "serve"]
    }
  }
}

Real-World Cases

Case 1: Optimizing Codebase Search

Scenario: Ask an AI assistant to find the implementation of a feature in a 100k-line project.

Without Headroom:

Agent searches → returns 100 related files → sends all to LLM →
Token usage: 17,765 → high cost, slow response

With Headroom:

headroom wrap claude
claude "Find the user authentication module implementation"

Agent searches → Headroom compresses 100 files →
Token usage: 1,408 → 92% savings

Actual benchmark data:

Workload	Before	After	Savings
Code search (100 results)	17,765	1,408	92%
SRE incident debugging	65,694	5,118	92%
GitHub Issue classification	54,174	14,761	73%
Codebase exploration	78,502	41,254	47%

Case 2: Multi-Agent Collaboration

Scenario: Use Claude Code for code review, Codex for unit-test generation, and Cursor for refactoring.

Traditional approach: Each agent scans the codebase independently, burning duplicate tokens.

With Headroom shared memory:

# Step 1: Claude Code scans and caches
headroom wrap claude --memory
claude "Review code quality in src/auth/"

# Step 2: Codex reuses the cache
headroom wrap codex --memory
codex "Generate unit tests for src/auth/"
# Codex uses Claude's cached index—no rescan needed

# Step 3: Cursor continues to reuse
headroom wrap cursor --memory
cursor "Refactor error handling in src/auth/"

Savings: Second and subsequent agents save 40–60% on initial scan tokens.

Case 3: Log File Analysis

Scenario: Debugging a production issue requires an AI to analyze a 10,000-line log file.

from headroom import compress
import anthropic

# Read the log
with open("production.log") as f:
    logs = f.read()

messages = [
    {"role": "user", "content": f"Analyze this log and find the root cause:\n{logs}"}
]

# Compress
compressed = compress(messages, model="claude-3-sonnet")

print(f"Original: {compressed.original_tokens} tokens")
print(f"Compressed: {compressed.compressed_tokens} tokens")
print(f"Savings: {compressed.savings_percent}%")

# Send
client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-3-sonnet-20240229",
    max_tokens=2048,
    messages=compressed.messages
)

print(response.content[0].text)

Typical result: 65,694 tokens → 5,118 tokens (92% savings), with critical error info intact.

Performance Benchmarks

Headroom maintains accuracy on standard benchmarks:

Benchmark	Category	Samples	Baseline Accuracy	Headroom Accuracy	Diff
GSM8K	Math	100	0.870	0.870	±0.000
TruthfulQA	Factual	100	0.530	0.560	+0.030
SQuAD v2	QA	100	—	97%	19% compression
BFCL	Tool calling	100	—	97%	32% compression

Conclusion: Headroom delivers significant token savings while keeping accuracy intact.

Reproduce benchmarks:

python -m headroom.evals suite --tier 1

Advanced Configuration

CacheAligner — Boost Provider KV Cache Hit Rate

CacheAligner stabilizes the prompt prefix so Anthropic/OpenAI KV caches hit more often, cutting costs further:

from headroom import CompressionMiddleware

# Add middleware in your ASGI app
app.add_middleware(CompressionMiddleware)

headroom learn — Learn from Failures

Headroom can mine failed sessions and auto-write fixes into CLAUDE.md or AGENTS.md:

# Enable learning mode
headroom wrap claude --learn

# When Claude Code gives a wrong answer, Headroom will:
# 1. Analyze why it failed
# 2. Generate a corrective prompt
# 3. Write it to the project's CLAUDE.md
# 4. Apply the fix automatically in the next session

Custom Compression Strategies

Extend custom compression behavior via Pipeline:

from headroom import PipelineExtension

class MyCustomCompressor(PipelineExtension):
    def on_input_received(self, event):
        # Run custom logic when input arrives
        print(f"Received {len(event.content)} bytes")

    def on_input_compressed(self, event):
        # Run after compression finishes
        print(f"Compressed to {event.compressed_size} bytes")

# Register the extension
pipeline.register(MyCustomCompressor())

FAQ

Q1: Does Headroom hurt answer quality?

A: No. Benchmarks show accuracy stays the same (GSM8K: 0.870 → 0.870). CCR reversible compression ensures the LLM can always fetch the full content.

Q2: How about data security?

A: Headroom runs entirely locally. All compression, caching, and storage happen on your machine. Raw data never leaves for external servers.

Q3: Which LLM providers are supported?

A: In theory, any provider, since Headroom works at the prompt level. Verified providers include: - Anthropic (Claude) - OpenAI (GPT-4, GPT-3.5) - AWS Bedrock - Google Gemini - Any OpenAI-compatible API

Q4: Does compression add latency?

A: Local compression usually finishes in 10–50 ms, negligible compared to network requests (hundreds of ms to seconds). And because fewer tokens are sent, overall response time is often faster.

Q5: Is it for individual devs or teams?

A: Both. - Individual developers: Save on API bills when using AI assistants daily - Teams: Share memory across agents to avoid rescanning codebases; enforce unified compression policies for cost control

Summary

Headroom is a high-quality open source tool that solves a real pain point. For developers who lean heavily on AI coding assistants, it:

✅ Cuts token costs by 60–95% — saves money directly
✅ Zero-code onboarding — one command: headroom wrap claude
✅ Cross-agent shared memory — Claude, Codex, Cursor share caches
✅ Reversible compression (CCR) — raw data stays safe, LLM fetches on demand
✅ Runs locally — data stays private, no leakage risk

If your team spends more than $100/month on LLM APIs, Headroom will almost certainly save you a hefty chunk of cash.

Get started now:

pip install "headroom-ai[all]"
headroom wrap claude  # or whichever agent you use
headroom perf         # see your savings

Project: https://github.com/chopratejas/headroom
Docs: https://headroom-docs.vercel.app/docs

Deep dive into Headroom, the open source LLM token compression tool: installation, proxy mode, cross-agent shared memory, reversible CCR compression—slash your AI assistant API costs!