Intelligent prompt compression for LLMs — cutting token costs by 3–20x while preserving semantic intent
Brilliant people communicate to be understood by other humans. LLMs are not humans. The social scaffolding, emotional hedging, and grammatical structure that makes prose readable is, from the model's perspective, expensive noise.
Omit is a production-grade SDK that compresses prompts for Large Language Models while preserving semantic intent, optimized for multi-model deployments with real-time analytics. Built to power the Euphie platform and any cost-sensitive LLM pipeline.
Inference costs have dropped 280-fold since 2020 — yet context window sizes are growing even faster. Teams are sending longer and longer prompts, erasing efficiency gains entirely. Prompt compression is the missing layer.
Euphie processes natural language task descriptions continuously. Every user interaction generates an LLM call. Omit sits in the preprocessing pipeline, compressing each prompt before the API call — compounding savings at scale.
Adaptive routing selects the right strategy for each input based on content type, length, and risk profile
Entropy-based token filtering that maintains grammatical coherence. Every critical fact is preserved. Ideal for short notes, high-stakes content, and any input where there is zero tolerance for quality loss.
Learned token importance pruning produces "token soup" — output optimized for LLM consumption, not human readability. Ideal for long brain dumps, batch processing, and RAG workloads where cost is the primary concern.
Adaptive routing based on input characteristics and model-specific compression tolerance. Automatically falls back to conservative mode on high-risk scenarios. This is the default for production deployments.
An interactive Python notebook for rigorous benchmarking of compression strategies against the promptfu.json dataset
The validation notebook (validation_notebook.py / validation.ipynb) runs each compression strategy against every prompt in the benchmark dataset, producing interactive Plotly visualizations and structured reports so you can see exactly where each strategy wins and loses.
A curated set of real-world task prompts spanning scheduling, follow-ups, project management, and multi-step workflows — each annotated with expected LLM output and known failure modes.
Conservative, Aggressive, and Recommended strategies are applied to each prompt. Compression ratio, token savings, estimated cost savings, and entity preservation are recorded for every result.
Distribution charts, scatter plots of compression ratio vs. quality delta, strategy comparison bar charts, and per-prompt breakdowns — all interactive and exportable.
The notebook outputs omit_validation_report.html and omit_validation_results.csv — shareable artifacts for stakeholder review and ongoing benchmarking.
The omit_analytics.jsonl log feeds a real-time dashboard (analytics_dashboard.tsx) for ongoing monitoring of compression performance in production.
A handcrafted benchmark of task-management prompts in the wild. Each entry includes the raw prompt, the expected structured task list, quality flags, and annotated failure modes — giving the validation notebook everything it needs to score each compression strategy honestly.
Open questions that drive the Level 2 investigation
Does aggressive compression that maintains quality on GPT-4 degrade significantly on smaller models like Claude Haiku or Mistral 7B? What is the minimum model capability that handles "token soup" reliably?
Can entropy-based pruning reliably detect and protect all task-critical entities (names, dates, deadlines, amounts) without a named-entity recognition pre-pass? Where does naive filtering fail?
In retrieval-augmented workflows, does compressing the query before retrieval harm recall, or does semantic equivalence hold? Can retrieved context itself be compressed without impacting answer quality?
As the user vocabulary grows, does protection overhead erode compression gains? Is there a vocabulary size threshold beyond which the Recommended strategy should fall back to Conservative by default?
Concrete next steps for Level 2 — ordered from quick wins to longer-horizon investigations
Run the full validation notebook against GPT-4, GPT-3.5, Claude 3 Haiku, and Mistral 7B. Chart where quality degrades at each compression level and identify the model-specific "safe zone" for Aggressive mode.
Add a spaCy NER pre-pass before entropy filtering. Measure whether false-negative entity drops fall to zero and at what latency cost. Compare against current pattern-matching approach on the promptfu.json dataset.
Apply Conservative compression to retrieved context passages before they are stuffed into the prompt. Measure answer quality delta vs. token savings across a set of QA benchmarks. Target: identical answer quality at 40% fewer tokens in context.
Benchmark Microsoft's LLMLingua (20x compression) as an Aggressive-tier backend alongside the existing learned pruning approach. Measure quality, latency, and whether the SDK's adaptive router can select between them per-request.
Route 50% of Euphie traffic through the Recommended compression pipeline. Measure real-world cost reduction from omit_analytics.jsonl and user-reported task quality scores. Run for 30 days with automatic rollback if quality drops below threshold.
Fine-tune a small sequence-to-sequence model (T5-small or similar) on the promptfu.json dataset to learn task-domain compression directly. Compare against entropy-based approach: can a learned model beat hand-engineered rules on this specific domain?
The full research paper underlying the Omit SDK. Covers entropy-based filtering, learned token pruning, entity protection mechanisms, multi-model compression tolerance, and cost modeling — with citations to LLMLingua, Selective Context, Glyph, COSTAR, and DSPy.