How I Built a Prompt Compressor That Reduces LLM Token Costs Without Losing Meaning
Tools like LLMLingua (by Microsoft) use language models to compress prompts by learning which parts can be dropped while preserving meaning. It’s powerful — but also relies on another LLM to optimize prompts for the LLM. I wanted to try something different: a lightweight, rule-based semantic compressor that doesn't require training or GPUs — just smart heuristics, NLP tools like spaCy, and a deep respect for meaning. The Challenge: Every Token Costs In the world of Large Language Models (LLMs), every token comes with a price tag. For organizations running thousands of prompts daily, these costs add up quickly. But what if we could reduce these costs without sacrificing the quality of interactions? Real Results: Beyond Theory Our experimental Semantic Prompt Compressor has shown promising results in real-world testing. Analyzing 135 diverse prompts, we achieved: 22.42% average compression ratio Reduction from 4,986 → 3,868 tokens 1,118 tokens saved while maintaining meaning Over 95% preservation of named entities and technical terms Example 1 Original (33 tokens): "I've been considering the role of technology in mental health treatment. How might virtual therapy and digital interventions evolve? I'm interested in both current applications and future possibilities." _ Compressed (12 tokens): _"I've been considering role of technology in mental health treatment." Compression ratio: 63.64% Example 2 Original (29 tokens): "All these apps keep asking for my location. What are they actually doing with this information? I'm curious about the balance between convenience and privacy." Compressed (14 tokens): "apps keep asking for my location. What are they doing with information." Compression ratio: 51.72% The Cost Impact Let’s translate these results into real business scenarios. Customer Support AI (100,000 queries/day): Avg. 200 tokens per query GPT-4 API cost: $0.03 / 1K tokens Without compression: 20M tokens/day → $600/day → $18,000/month With 22.42% compression: 15.5M tokens/day → $465/day Monthly savings: $4,050 How It Works: A Three-Layer Approach Rules Layer We implemented a configurable rule system instead of using a black-box ML model. For example: Replace “Could you explain” with “explain” Replace “Hello, I was wondering” with “I wonder” rule_groups: remove_fillers: enabled: true patterns: - pattern: "Could you explain" replacement: "explain" remove_greetings: enabled: true patterns: - pattern: "Hello, I was wondering" replacement: "I wonder" spaCy NLP Layer We leverage spaCy’s linguistic analysis for intelligent compression: Named Entity Recognition to preserve key terms Dependency parsing for sentence structure POS tagging to remove non-essential parts Compound-word preservation for technical terms Entity Preservation Layer We ensure critical information is not lost: Technical terms (e.g., "5G", "TCP/IP") Named entities (companies, people, places) Numerical values and measurements Domain-specific vocabulary Real-World Applications _Customer Support _ Compress user queries while maintaining context Preserve product-specific language Reduce support costs, maintain quality _Content Moderation _ Efficiently process user reports Maintain critical context Cost-effective scaling Technical Documentation Compress API or doc queries Preserve code snippets and terms Cut costs without losing accuracy Beyond Simple Compression What makes our approach unique? Intelligent Preservation — Maintains technical accuracy and key data Configurable Rules — Domain-adaptable, transparent, and editable Transparent Processing — Understandable and debuggable Current Limitations Requires domain-specific tuning Conservative in technical contexts Manual rule editing still helpful Entity preservation may be overly cautious Future Development ML-based adaptive compression Domain-specific profiles Real-time compression LLM platform integrations Custom vocabulary modules Conclusion The results from our testing show that intelligent semantic prompt compression is not only possible — it's practical. With a 22.42% average compression ratio and high semantic preservation, LLM-based systems can reduce API costs while maintaining clarity and intent. Whether you're building support bots, moderation tools, or technical assistants, prompt compression could be a key layer in your stack. Project on GitHub: github.com/metawake/prompt_compressor (Open source, transparent, and built for experimentation.)

Tools like LLMLingua (by Microsoft) use language models to compress prompts by learning which parts can be dropped while preserving meaning. It’s powerful — but also relies on another LLM to optimize prompts for the LLM.
I wanted to try something different: a lightweight, rule-based semantic compressor that doesn't require training or GPUs — just smart heuristics, NLP tools like spaCy, and a deep respect for meaning.
The Challenge: Every Token Costs
In the world of Large Language Models (LLMs), every token comes with a price tag. For organizations running thousands of prompts daily, these costs add up quickly. But what if we could reduce these costs without sacrificing the quality of interactions?
Real Results: Beyond Theory
Our experimental Semantic Prompt Compressor has shown promising results in real-world testing. Analyzing 135 diverse prompts, we achieved:
- 22.42% average compression ratio
- Reduction from 4,986 → 3,868 tokens
- 1,118 tokens saved while maintaining meaning
- Over 95% preservation of named entities and technical terms
Example 1
Original (33 tokens):
"I've been considering the role of technology in mental health treatment.
How might virtual therapy and digital interventions evolve?
I'm interested in both current applications and future possibilities."
_
Compressed (12 tokens):
_"I've been considering role of technology in mental health treatment."
Compression ratio: 63.64%
Example 2
Original (29 tokens):
"All these apps keep asking for my location.
What are they actually doing with this information?
I'm curious about the balance between convenience and privacy."
Compressed (14 tokens):
"apps keep asking for my location. What are they doing with information."
Compression ratio: 51.72%
The Cost Impact
Let’s translate these results into real business scenarios.
Customer Support AI
(100,000 queries/day):
- Avg. 200 tokens per query
- GPT-4 API cost: $0.03 / 1K tokens
Without compression:
- 20M tokens/day → $600/day → $18,000/month
- With 22.42% compression:
- 15.5M tokens/day → $465/day
- Monthly savings: $4,050
How It Works: A Three-Layer Approach
Rules Layer
We implemented a configurable rule system instead of using a black-box ML model. For example:
Replace “Could you explain” with “explain”
Replace “Hello, I was wondering” with “I wonder”
rule_groups:
remove_fillers:
enabled: true
patterns:
- pattern: "Could you explain"
replacement: "explain"
remove_greetings:
enabled: true
patterns:
- pattern: "Hello, I was wondering"
replacement: "I wonder"
spaCy NLP Layer
We leverage spaCy’s linguistic analysis for intelligent compression:
- Named Entity Recognition to preserve key terms
- Dependency parsing for sentence structure
- POS tagging to remove non-essential parts
- Compound-word preservation for technical terms
Entity Preservation Layer
We ensure critical information is not lost:
- Technical terms (e.g., "5G", "TCP/IP")
- Named entities (companies, people, places)
- Numerical values and measurements
- Domain-specific vocabulary
Real-World Applications
_Customer Support
_
- Compress user queries while maintaining context
- Preserve product-specific language
- Reduce support costs, maintain quality
_Content Moderation
_
- Efficiently process user reports
- Maintain critical context
- Cost-effective scaling
- Technical Documentation
- Compress API or doc queries
- Preserve code snippets and terms
- Cut costs without losing accuracy
- Beyond Simple Compression
What makes our approach unique?
Intelligent Preservation — Maintains technical accuracy and key data
Configurable Rules — Domain-adaptable, transparent, and editable
Transparent Processing — Understandable and debuggable
Current Limitations
- Requires domain-specific tuning
- Conservative in technical contexts
- Manual rule editing still helpful
- Entity preservation may be overly cautious
Future Development
- ML-based adaptive compression
- Domain-specific profiles
- Real-time compression
- LLM platform integrations
- Custom vocabulary modules Conclusion
The results from our testing show that intelligent semantic prompt compression is not only possible — it's practical.
With a 22.42% average compression ratio and high semantic preservation, LLM-based systems can reduce API costs while maintaining clarity and intent.
Whether you're building support bots, moderation tools, or technical assistants, prompt compression could be a key layer in your stack.
Project on GitHub:
github.com/metawake/prompt_compressor
(Open source, transparent, and built for experimentation.)