Level 2 LEVEL 2: EARLY DEVELOPMENT

Prompt Exploration

Intelligent prompt compression for LLMs — cutting token costs by 3–20x while preserving semantic intent

3–20x
Token reduction achievable with minimal quality loss
75%
Of natural language is redundant for LLM processing
$28.5
Savings per 1,000 queries in RAG workloads (LongLLMLingua)
280x
Inference price reduction 2020–2024, yet context windows keep growing

The Noise Problem

Brilliant people communicate to be understood by other humans. LLMs are not humans. The social scaffolding, emotional hedging, and grammatical structure that makes prose readable is, from the model's perspective, expensive noise.

The Omit SDK

Omit is a production-grade SDK that compresses prompts for Large Language Models while preserving semantic intent, optimized for multi-model deployments with real-time analytics. Built to power the Euphie platform and any cost-sensitive LLM pipeline.

"The eloquence of brilliant people is noise for machines. Omit removes the noise."
Why This Matters Now

Inference costs have dropped 280-fold since 2020 — yet context window sizes are growing even faster. Teams are sending longer and longer prompts, erasing efficiency gains entirely. Prompt compression is the missing layer.

Built for Euphie

Euphie processes natural language task descriptions continuously. Every user interaction generates an LLM call. Omit sits in the preprocessing pipeline, compressing each prompt before the API call — compounding savings at scale.

Three Compression Strategies

Adaptive routing selects the right strategy for each input based on content type, length, and risk profile

Conservative
~50% reduction
Safe & Coherent

Entropy-based token filtering that maintains grammatical coherence. Every critical fact is preserved. Ideal for short notes, high-stakes content, and any input where there is zero tolerance for quality loss.

Entropy-based filtering
Grammatical coherence preserved
Entity protection active
Aggressive
~75–80% reduction
Maximum Savings

Learned token importance pruning produces "token soup" — output optimized for LLM consumption, not human readability. Ideal for long brain dumps, batch processing, and RAG workloads where cost is the primary concern.

Learned importance pruning
LLM-optimized output
Best for batch workloads
Recommended
Smart hybrid
Production Default

Adaptive routing based on input characteristics and model-specific compression tolerance. Automatically falls back to conservative mode on high-risk scenarios. This is the default for production deployments.

Adaptive routing
Model-aware tolerance
Automatic safety fallback

Quick Start

from omit_sdk import Omit # Initialize with optional domain vocabulary omit = Omit(user_vocabulary=["standup", "retro", "sprint"]) # Compress a prompt before your LLM call result = omit.compress( text=user_input, model="gpt-4", level="recommended" # or "conservative", "aggressive" ) print(f"Compressed: {result.compressed_text}") print(f"Savings: ${result.estimated_savings_usd:.6f} per query") print(f"Ratio: {result.compression_ratio:.2f}x") # Pass compressed text to your LLM response = your_llm_api.complete(result.compressed_text)

Production Features

Multi-model support — GPT-4, Claude, OpenRouter models
Entity protection — Dates, emails, names, domain terms preserved
Real-time analytics — JSONL logging with full metrics
Cost estimation — Per-query and monthly projections
User vocabulary learning — Custom domain terminology protection
Automatic fallback — Safety mechanisms for high-risk compressions
Production logging — Structured logs with compression analytics
promptfu.json — Curated benchmark dataset for validation

The Validation Notebook

An interactive Python notebook for rigorous benchmarking of compression strategies against the promptfu.json dataset

What the Notebook Does

The validation notebook (validation_notebook.py / validation.ipynb) runs each compression strategy against every prompt in the benchmark dataset, producing interactive Plotly visualizations and structured reports so you can see exactly where each strategy wins and loses.

1
Load the promptfu.json dataset

A curated set of real-world task prompts spanning scheduling, follow-ups, project management, and multi-step workflows — each annotated with expected LLM output and known failure modes.

2
Run all three compression levels

Conservative, Aggressive, and Recommended strategies are applied to each prompt. Compression ratio, token savings, estimated cost savings, and entity preservation are recorded for every result.

3
Generate interactive Plotly dashboards

Distribution charts, scatter plots of compression ratio vs. quality delta, strategy comparison bar charts, and per-prompt breakdowns — all interactive and exportable.

4
Export the validation report

The notebook outputs omit_validation_report.html and omit_validation_results.csv — shareable artifacts for stakeholder review and ongoing benchmarking.

5
Integrate findings into the analytics dashboard

The omit_analytics.jsonl log feeds a real-time dashboard (analytics_dashboard.tsx) for ongoing monitoring of compression performance in production.

The promptfu.json Dataset

A handcrafted benchmark of task-management prompts in the wild. Each entry includes the raw prompt, the expected structured task list, quality flags, and annotated failure modes — giving the validation notebook everything it needs to score each compression strategy honestly.

Real-world scheduling prompts
Annotated known failure modes
Quality flags per example
Running the Notebook
# Install dependencies pip install numpy pandas plotly scikit-learn # Run the validation notebook python validation_notebook.py # Or open interactively in Jupyter jupyter notebook validation.ipynb

Active Research Hypotheses

Open questions that drive the Level 2 investigation

H1
Quality Floor by Model

Does aggressive compression that maintains quality on GPT-4 degrade significantly on smaller models like Claude Haiku or Mistral 7B? What is the minimum model capability that handles "token soup" reliably?

H2
Entity Boundary Detection

Can entropy-based pruning reliably detect and protect all task-critical entities (names, dates, deadlines, amounts) without a named-entity recognition pre-pass? Where does naive filtering fail?

H3
Compression + RAG

In retrieval-augmented workflows, does compressing the query before retrieval harm recall, or does semantic equivalence hold? Can retrieved context itself be compressed without impacting answer quality?

H4
User Vocabulary Decay

As the user vocabulary grows, does protection overhead erode compression gains? Is there a vocabulary size threshold beyond which the Recommended strategy should fall back to Conservative by default?

Suggested Experiments

Concrete next steps for Level 2 — ordered from quick wins to longer-horizon investigations

E1 — Cross-Model Quality Sweep

Run the full validation notebook against GPT-4, GPT-3.5, Claude 3 Haiku, and Mistral 7B. Chart where quality degrades at each compression level and identify the model-specific "safe zone" for Aggressive mode.

Quick Win Validation Notebook
E2 — NER-Augmented Entity Protection

Add a spaCy NER pre-pass before entropy filtering. Measure whether false-negative entity drops fall to zero and at what latency cost. Compare against current pattern-matching approach on the promptfu.json dataset.

Medium Effort omit_sdk.py
E3 — RAG Context Compression

Apply Conservative compression to retrieved context passages before they are stuffed into the prompt. Measure answer quality delta vs. token savings across a set of QA benchmarks. Target: identical answer quality at 40% fewer tokens in context.

Medium Effort New Dataset
E4 — LLMLingua Integration

Benchmark Microsoft's LLMLingua (20x compression) as an Aggressive-tier backend alongside the existing learned pruning approach. Measure quality, latency, and whether the SDK's adaptive router can select between them per-request.

Larger Scope omit_sdk.py
E5 — Euphie Production A/B Test

Route 50% of Euphie traffic through the Recommended compression pipeline. Measure real-world cost reduction from omit_analytics.jsonl and user-reported task quality scores. Run for 30 days with automatic rollback if quality drops below threshold.

High Impact Production
E6 — Fine-Tuned Compressor

Fine-tune a small sequence-to-sequence model (T5-small or similar) on the promptfu.json dataset to learn task-domain compression directly. Compare against entropy-based approach: can a learned model beat hand-engineered rules on this specific domain?

Moonshot Model Training

LLM Linguistic Compression Strategies

The full research paper underlying the Omit SDK. Covers entropy-based filtering, learned token pruning, entity protection mechanisms, multi-model compression tolerance, and cost modeling — with citations to LLMLingua, Selective Context, Glyph, COSTAR, and DSPy.

Explore the Repository