What Is Headroom?
Headroom is an open source LLM token compression layer that smartly compresses everything an AI agent reads—tool outputs, logs, RAG retrieval results, files, and conversation history—before sending it to the LLM.
The core value: the same answer, using only 5–40% of the tokens. For developers who rely heavily on AI coding assistants (Claude Code, Cursor, Codex, etc.) every day, this means API costs drop by 60–95% straight away.
Why Do You Need Headroom?
Here's how modern AI coding assistants usually work:
User asks a question → Agent searches the codebase → Returns 100+ file snippets →
Agent gathers all context → Sends to LLM → LLM answers
The problem? Most of the context the agent sends is redundant. For example:
- Search results return 100 code snippets, but only 5–10 are truly relevant
- Log files are full of irrelevant timestamps and debug noise
- RAG-retrieved document chunks share lots of repeated prefixes
Headroom solves this with a three-layer architecture:
- ContentRouter — Detects content type (JSON, code, plain text) and picks the best compressor automatically
- Smart Compressors — SmartCrusher (JSON), CodeCompressor (AST-aware), Kompress-base (HF model)
- CCR (Reversible Compression) — Raw data stays local; the LLM can fetch it on demand when needed
Headroom vs Other Solutions
| Feature | Headroom | Native Provider Compression | Manual Prompt Trimming |
|---|---|---|---|
| Token savings | 60–95% | 20–40% | Depends on human effort |
| Cross-agent shared memory | ✅ | ❌ | ❌ |
| Reversible compression (CCR) | ✅ | ❌ | N/A |
| Zero-code access (Proxy) | ✅ | ❌ | N/A |
| Runs locally | ✅ | ❌ (cloud) | ✅ |
| Multi-language support | Python + TS | SDK-only | N/A |
Installing Headroom
Headroom supports both Python and Node.js. We recommend pip for the full-featured version.
Python Installation
# Install the full version (proxy, MCP, ML, everything)
pip install "headroom-ai[all]"
# Or install sub-modules as needed
pip install "headroom-ai[proxy]" # Proxy mode only
pip install "headroom-ai[mcp]" # MCP server only
pip install "headroom-ai[ml]" # Machine-learning compression models
Requirements: Python 3.10+
Node.js / TypeScript Installation
npm install headroom-ai
Verify Installation
# Check version and features
headroom --version
# Run a performance test to see compression results in your environment
headroom perf
Quick Start: Three Usage Modes
Headroom offers three ways to plug in, from easiest to most flexible.
Mode 1: Wrap Command (Easiest, Zero Config)
If you're already using an AI coding assistant, just one command turns on Headroom:
# Wrap Claude Code
headroom wrap claude
# Wrap Codex
headroom wrap codex
# Wrap Cursor
headroom wrap cursor
# Wrap Aider
headroom wrap aider
# Wrap GitHub Copilot CLI
headroom wrap copilot
After running, Headroom will automatically: 1. Start a local proxy service (default port 8787) 2. Update the corresponding agent's config to route requests through the proxy 3. Print setup instructions so you can verify it's working
Example: Wrapping Claude Code
$ headroom wrap claude
✅ Headroom proxy started on port 8787
✅ Claude Code config updated
To verify, run:
claude "What is 2+2?"
You should see compression stats in the output.
From then on, every time you use the claude command, requests go through Headroom for compression before hitting the Anthropic API.
Mode 2: Proxy Mode (Zero Code Changes, Works with Any Language)
If you don't want to touch existing code, or you're using a tool that isn't supported by wrap, just start the proxy standalone:
# Start the proxy, listening on port 8787
headroom proxy --port 8787
Then point your app or agent config to http://localhost:8787 instead of the original Anthropic/OpenAI endpoint.
Example: OpenAI-Compatible Client
from openai import OpenAI
# Original config
# client = OpenAI(api_key="sk-...")
# Route through Headroom proxy
client = OpenAI(
api_key="sk-...",
base_url="http://localhost:8787/v1" # Point to Headroom proxy
)
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain this code"}]
)
Every request passing through the proxy gets compressed automatically—no business logic changes needed.
Mode 3: Library Mode (Most Flexible, Embed in Your App)
If you're building your own AI application, call Headroom's compression functions directly:
Python Example
from headroom import compress
messages = [
{"role": "user", "content": "Analyze this log file"},
{"role": "assistant", "content": "Please provide the log content"},
{"role": "user", "content": "[... 10,000 lines of logs ...]"}
]
# Compress messages
compressed = compress(messages, model="claude-3-sonnet")
print(f"Original tokens: {compressed.original_tokens}")
print(f"Compressed tokens: {compressed.compressed_tokens}")
print(f"Savings: {compressed.savings_percent}%")
# Send to LLM
response = anthropic_client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1024,
messages=compressed.messages # Use compressed messages
)
TypeScript Example
import { compress } from 'headroom-ai';
const messages = [
{ role: 'user', content: 'Analyze this codebase' },
{ role: 'assistant', content: 'Please provide the files' },
{ role: 'user', content: '[... 50 files ...]' }
];
const compressed = await compress(messages, { model: 'gpt-4' });
console.log(`Saved ${compressed.savingsPercent}% tokens`);
Core Features Explained
1. Smart Compression Algorithms
Headroom ships with multiple compression algorithms and auto-selects based on content type.
SmartCrusher — JSON Compression
Built for structured data (API responses, config files, database query results):
from headroom import SmartCrusher
data = {
"users": [
{"id": 1, "name": "Alice", "email": "alice@example.com", "created_at": "2024-01-01"},
{"id": 2, "name": "Bob", "email": "bob@example.com", "created_at": "2024-01-02"},
# ... 1,000+ records
]
}
crusher = SmartCrusher()
compressed = crusher.compress(data)
# Keeps key fields, strips redundant metadata
# Original: 50,000 tokens → Compressed: 5,000 tokens (90% savings)
CodeCompressor — AST-Aware Code Compression
Understands the syntax tree, keeping only essential structure:
from headroom import CodeCompressor
code = """
def calculate_total(items):
'''Calculate total price with tax'''
total = 0
for item in items:
if item.active:
total += item.price * item.quantity
tax = total * 0.08
return total + tax
"""
compressor = CodeCompressor(language="python")
compressed = compressor.compress(code)
# Preserves function signatures, control flow, key variables
# Removes comments, whitespace, non-critical implementation details
Supported languages: Python, JavaScript, Go, Rust, Java, C++
Kompress-base — General Text Compression
A dedicated model trained on HuggingFace for natural language, documents, logs, and more:
# The model downloads automatically on first use (~500 MB)
headroom proxy
# Model cache location: ~/.cache/headroom/kompress-base
2. CCR Reversible Compression
CCR (Compress-Cache-Retrieve) is Headroom's core innovation: compressed data goes to the LLM, but the raw data stays on disk. If the LLM needs the full picture, it calls the headroom_retrieve tool.
Workflow:
1. Headroom compresses content → sends to LLM
2. LLM spots a need for more detail → calls headroom_retrieve(chunk_id)
3. Headroom returns the original data from local cache
4. LLM gets the full info and keeps reasoning
Benefits: - LLM receives the compressed version first, using fewer tokens - Full content is fetched only when necessary, avoiding huge upfront sends - Raw data is never lost
3. Cross-Agent Shared Memory
If you juggle multiple AI assistants (say, Claude Code + Codex + Cursor), Headroom lets them share compressed context:
# Enable shared memory
headroom wrap claude --memory
headroom wrap codex --memory
Now, codebase indexes processed by Claude Code are cached, and Codex can reuse them without rescanning. This is especially handy for large projects.
4. MCP Server Integration
Headroom can run as an MCP (Model Context Protocol) server, callable by any MCP-compatible client:
# Install the MCP server
headroom mcp install
# Available MCP tools:
# - headroom_compress: compress any content
# - headroom_retrieve: retrieve original data
# - headroom_stats: view compression statistics
Example: Using in Claude Desktop
// claude_desktop_config.json
{
"mcpServers": {
"headroom": {
"command": "headroom",
"args": ["mcp", "serve"]
}
}
}
Real-World Cases
Case 1: Optimizing Codebase Search
Scenario: Ask an AI assistant to find the implementation of a feature in a 100k-line project.
Without Headroom:
Agent searches → returns 100 related files → sends all to LLM →
Token usage: 17,765 → high cost, slow response
With Headroom:
headroom wrap claude
claude "Find the user authentication module implementation"
Agent searches → Headroom compresses 100 files →
Token usage: 1,408 → 92% savings
Actual benchmark data:
| Workload | Before | After | Savings |
|---|---|---|---|
| Code search (100 results) | 17,765 | 1,408 | 92% |
| SRE incident debugging | 65,694 | 5,118 | 92% |
| GitHub Issue classification | 54,174 | 14,761 | 73% |
| Codebase exploration | 78,502 | 41,254 | 47% |
Case 2: Multi-Agent Collaboration
Scenario: Use Claude Code for code review, Codex for unit-test generation, and Cursor for refactoring.
Traditional approach: Each agent scans the codebase independently, burning duplicate tokens.
With Headroom shared memory:
# Step 1: Claude Code scans and caches
headroom wrap claude --memory
claude "Review code quality in src/auth/"
# Step 2: Codex reuses the cache
headroom wrap codex --memory
codex "Generate unit tests for src/auth/"
# Codex uses Claude's cached index—no rescan needed
# Step 3: Cursor continues to reuse
headroom wrap cursor --memory
cursor "Refactor error handling in src/auth/"
Savings: Second and subsequent agents save 40–60% on initial scan tokens.
Case 3: Log File Analysis
Scenario: Debugging a production issue requires an AI to analyze a 10,000-line log file.
from headroom import compress
import anthropic
# Read the log
with open("production.log") as f:
logs = f.read()
messages = [
{"role": "user", "content": f"Analyze this log and find the root cause:\n{logs}"}
]
# Compress
compressed = compress(messages, model="claude-3-sonnet")
print(f"Original: {compressed.original_tokens} tokens")
print(f"Compressed: {compressed.compressed_tokens} tokens")
print(f"Savings: {compressed.savings_percent}%")
# Send
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=2048,
messages=compressed.messages
)
print(response.content[0].text)
Typical result: 65,694 tokens → 5,118 tokens (92% savings), with critical error info intact.
Performance Benchmarks
Headroom maintains accuracy on standard benchmarks:
| Benchmark | Category | Samples | Baseline Accuracy | Headroom Accuracy | Diff |
|---|---|---|---|---|---|
| GSM8K | Math | 100 | 0.870 | 0.870 | ±0.000 |
| TruthfulQA | Factual | 100 | 0.530 | 0.560 | +0.030 |
| SQuAD v2 | QA | 100 | — | 97% | 19% compression |
| BFCL | Tool calling | 100 | — | 97% | 32% compression |
Conclusion: Headroom delivers significant token savings while keeping accuracy intact.
Reproduce benchmarks:
python -m headroom.evals suite --tier 1
Advanced Configuration
CacheAligner — Boost Provider KV Cache Hit Rate
CacheAligner stabilizes the prompt prefix so Anthropic/OpenAI KV caches hit more often, cutting costs further:
from headroom import CompressionMiddleware
# Add middleware in your ASGI app
app.add_middleware(CompressionMiddleware)
headroom learn — Learn from Failures
Headroom can mine failed sessions and auto-write fixes into CLAUDE.md or AGENTS.md:
# Enable learning mode
headroom wrap claude --learn
# When Claude Code gives a wrong answer, Headroom will:
# 1. Analyze why it failed
# 2. Generate a corrective prompt
# 3. Write it to the project's CLAUDE.md
# 4. Apply the fix automatically in the next session
Custom Compression Strategies
Extend custom compression behavior via Pipeline:
from headroom import PipelineExtension
class MyCustomCompressor(PipelineExtension):
def on_input_received(self, event):
# Run custom logic when input arrives
print(f"Received {len(event.content)} bytes")
def on_input_compressed(self, event):
# Run after compression finishes
print(f"Compressed to {event.compressed_size} bytes")
# Register the extension
pipeline.register(MyCustomCompressor())
FAQ
Q1: Does Headroom hurt answer quality?
A: No. Benchmarks show accuracy stays the same (GSM8K: 0.870 → 0.870). CCR reversible compression ensures the LLM can always fetch the full content.
Q2: How about data security?
A: Headroom runs entirely locally. All compression, caching, and storage happen on your machine. Raw data never leaves for external servers.
Q3: Which LLM providers are supported?
A: In theory, any provider, since Headroom works at the prompt level. Verified providers include: - Anthropic (Claude) - OpenAI (GPT-4, GPT-3.5) - AWS Bedrock - Google Gemini - Any OpenAI-compatible API
Q4: Does compression add latency?
A: Local compression usually finishes in 10–50 ms, negligible compared to network requests (hundreds of ms to seconds). And because fewer tokens are sent, overall response time is often faster.
Q5: Is it for individual devs or teams?
A: Both. - Individual developers: Save on API bills when using AI assistants daily - Teams: Share memory across agents to avoid rescanning codebases; enforce unified compression policies for cost control
Summary
Headroom is a high-quality open source tool that solves a real pain point. For developers who lean heavily on AI coding assistants, it:
✅ Cuts token costs by 60–95% — saves money directly
✅ Zero-code onboarding — one command: headroom wrap claude
✅ Cross-agent shared memory — Claude, Codex, Cursor share caches
✅ Reversible compression (CCR) — raw data stays safe, LLM fetches on demand
✅ Runs locally — data stays private, no leakage risk
If your team spends more than $100/month on LLM APIs, Headroom will almost certainly save you a hefty chunk of cash.
Get started now:
pip install "headroom-ai[all]"
headroom wrap claude # or whichever agent you use
headroom perf # see your savings
Project: https://github.com/chopratejas/headroom
Docs: https://headroom-docs.vercel.app/docs