From $800 to $12,000/Month: Stopping a B2B SaaS LLM Cost Spiral
A structured 8-hour cost audit found $9,900 in monthly LLM waste. Four engineering changes reduced the bill by 82% — without touching the user experience.
The Situation at a Glance
The Problem
- LLM costs: $800 → $12,000/month in 6 weeks
- Costs growing 15× faster than user count (8×)
- Avg 8,200 tokens per API call (unbounded context)
- GPT-4 used for every task including classification
- No cost visibility — zero per-user or per-feature tracking
What We Built
- Token-level cost audit across all API call patterns
- Context window compression (sliding window + semantic truncation)
- Model routing — GPT-4o-mini for classification, GPT-4 for generation
- Redis semantic cache with 34% hit rate
- Real-time cost dashboard per user, feature, and request
The Results
- Monthly LLM cost: $12,000 → $2,100
- 82% cost reduction at identical user count
- Avg tokens per call: 8,200 → 1,100 (−87%)
- 34% cache hit rate reducing paid API calls
- Unit economics now viable for Series A
“The feature worked. The unit economics were fatal. We were heading into our Series A with a cost curve that would have destroyed the model at scale.”
— Co-founder, Nexus CRM (anonymised)
Where the Money Was Going
Nexus CRM shipped an AI email drafting feature to 50 users in February 2026. By mid-March, with 400 users active, the OpenAI bill had grown from $800 to $12,000/month — a 15× increase on an 8× user growth. The feature was not broken. The cost architecture was.
Unbounded Context Windows
The email drafting feature sent the entire conversation history — every previous email in the thread — with every API call. Average context size grew as users became more active.
Impact: Average 8,200 tokens per call. Power users sending 40-email threads triggered 22,000-token calls.
No Model Routing
GPT-4 was used for every LLM call, including simple tasks like classifying email intent (2-class classification that needed 50 tokens of reasoning).
Impact: 60% of API spend was on calls that a smaller, cheaper model could handle at 1/10th the cost.
Missing Semantic Cache
Users frequently requested drafts for similar email types ("follow-up after demo", "overdue invoice reminder"). Each request was computed fresh against the LLM with no caching layer.
Impact: Analysis showed 34% of requests were semantically near-identical to a previous request within 24 hours.
Retry Storms
The retry logic used aggressive exponential backoff without jitter and no circuit breaker. During a 40-minute OpenAI API degradation, each original request triggered 4–5 retries.
Impact: A single 40-minute API degradation generated 4.8× the normal API call volume — and 4.8× the cost.
Zero Cost Visibility
There was no per-request, per-user, or per-feature cost tracking. The team discovered the $12,000 monthly bill at the end of the billing cycle — not in real time.
Impact: By the time the bill arrived, the architecture causing it had been live for 6 weeks and was embedded in production.
Prompt Verbosity
System prompts were 1,400 tokens of instructions that had accumulated over months of tweaking. Most instructions were redundant with model defaults or contradicted each other.
Impact: 1,400-token system prompt on every call. Revised to 180 tokens with identical output quality.
Cost Audit Breakdown — Where the $12,000 Went
| Waste Driver | Monthly Cost | % of Total |
|---|---|---|
| Unbounded context windows (excess tokens) | $5,100 | 43% |
| GPT-4 on classification tasks (should be GPT-4o-mini) | $3,200 | 27% |
| Missing cache (duplicate near-identical calls) | $1,900 | 16% |
| Retry storms during API degradations | $1,100 | 9% |
| Verbose system prompts (1,400 → 180 tokens) | $700 | 6% |
Four Changes, 82% Cost Reduction
The cost audit took 8 hours and produced a prioritised fix list sorted by dollar impact. We addressed the top four drivers first — the fifth (verbose prompts) was simple enough to fix in the same pass. All changes were deployed without modifying the user-facing product.
Context Window Compression
Replaced full-history injection with a sliding window plus semantic importance scoring.
- Sliding window: last 6 messages always included
- Semantic scoring: important earlier messages retained based on relevance to current request
- Hard token cap: 1,200 tokens maximum context budget per call
- Compression applied before every API call — no change to stored data
Result: Avg tokens/call: 8,200 → 1,100. Cost impact: −$5,100/month.
Model Routing Layer
Introduced a lightweight classifier that routes requests to the right model based on task complexity.
- Intent classification (2-class): GPT-4o-mini at $0.15/1M tokens
- Simple formatting tasks: GPT-4o-mini
- Complex generation (full email drafts): GPT-4 at $2.50/1M tokens
- Routing adds <15ms latency — transparent to users
Result: 60% of calls routed to GPT-4o-mini. Cost impact: −$3,200/month.
Redis Semantic Cache
Semantic similarity cache using text-embedding-3-small to match near-duplicate requests.
- Embed each request on arrival (text-embedding-3-small: $0.02/1M tokens)
- Cosine similarity search against last 24-hour cache
- Hit threshold: 0.92 similarity — conservative to ensure quality
- Cache TTL: 24 hours; invalidated on user context changes
Result: 34% cache hit rate. Cost impact: −$1,900/month.
Retry Architecture + Cost Dashboard
Fixed retry logic with jitter and circuit breaker. Added real-time cost instrumentation.
- Max 3 retries with exponential backoff and full jitter
- Circuit breaker: opens after 5 failures in 60 seconds
- Per-request cost logged: model, tokens_in, tokens_out, latency, cost_usd
- Grafana dashboard: cost per user, per feature, per day — live
Result: Retry storms eliminated. Full cost visibility. Cost impact: −$1,800/month.
Engagement Timeline
Audit to production deployment — 48 hours.
Cost Audit
- ›Pulled full API call logs for 30 days
- ›Computed cost per call, per user, per feature
- ›Ranked waste drivers by monthly dollar impact
- ›Identified top 5 addressable issues
Architecture Redesign
- ›Designed context compression algorithm
- ›Defined model routing decision tree
- ›Specified cache similarity threshold
- ›Wrote retry + circuit breaker spec
Implementation
- ›Context compression middleware deployed
- ›Model router integrated into API layer
- ›Redis semantic cache wired in
- ›Retry logic rewritten with circuit breaker
Validate & Monitor
- ›Shadow mode: new vs. old cost comparison
- ›Quality check on 200 sampled outputs
- ›Grafana dashboard live with cost metrics
- ›Changes cut over to 100% of traffic
Before and After: The Numbers
All measurements taken at identical user count (400 active users). No feature was removed or degraded. User-facing quality was validated against a 200-output sample before and after.
Cost Metrics
- Monthly LLM cost: $12,000 → $2,100
- Cost per active user: $30 → $5.25
- Cost per email draft: $0.082 → $0.014
- 34% of paid calls replaced by cache hits
Token Efficiency
- Avg tokens per call: 8,200 → 1,100
- System prompt tokens: 1,400 → 180
- GPT-4 call share: 100% → 40%
- p95 context size: 22,000 → 1,900 tokens
Reliability
- Retry storm incidents: eliminated
- API call multiplier during degradations: 4.8× → 1.2×
- Time-to-detect cost spikes: real-time
- Output quality score (200-sample): unchanged
Before — The Cost Spiral
- $12,000/month at 400 users — heading to $60,000 at 2,000 users
- No visibility: bill discovered monthly, not in real time
- Every LLM call sending full conversation history
- GPT-4 processing 2-class email classification tasks
- Retry storms amplifying costs 4.8× during any API issue
- No path to viable unit economics at Series A scale
After — Viable Unit Economics
- $2,100/month at same 400 users — $5.25/user
- Real-time cost dashboard: per user, per feature, per request
- Context compressed to 1,100 tokens avg — no quality loss
- 60% of calls routed to GPT-4o-mini — 10× cheaper
- Circuit breaker prevents retry storms entirely
- Series A data room: cost model now defensible at scale
Stack Used in This Engagement
AI & Models
- ›OpenAI GPT-4 (complex generation)
- ›OpenAI GPT-4o-mini (routing target)
- ›text-embedding-3-small (cache embeddings)
- ›Custom model router (Python)
Backend
- ›Python + FastAPI
- ›LangChain (LLM orchestration)
- ›PostgreSQL (user data)
- ›Celery + Redis (async tasks)
Caching
- ›Redis (semantic cache store)
- ›text-embedding-3-small (similarity)
- ›Cosine similarity threshold: 0.92
- ›24-hour TTL with smart invalidation
Infrastructure
- ›AWS ECS (Fargate)
- ›AWS ElastiCache (Redis)
- ›AWS CloudWatch Logs
- ›GitHub Actions CI/CD
Observability
- ›Grafana (cost dashboards)
- ›PagerDuty (cost spike alerts)
- ›Structured cost logs per request
- ›Real-time cost-per-user tracking
Reliability
- ›Retry: max 3 with jitter
- ›Circuit breaker (5 failures / 60s)
- ›Fallback to cached response on failure
- ›Shadow mode for change validation
What This Case Study Teaches
Cost is an engineering metric, not a billing afterthought
The team discovered a $12,000 bill at month-end because they had no real-time visibility. By the time they knew, the architecture causing it had been live for 6 weeks. Cost instrumentation belongs in the initial build, not the post-launch review.
Context is the biggest LLM cost driver — and the most fixable
43% of the monthly bill was excess context tokens. Compressing context from 8,200 to 1,100 tokens average required no model change, no user-facing change, and no quality loss. It was purely an engineering decision that nobody had made.
Model routing is not an advanced optimisation — it is table stakes
Using GPT-4 for email intent classification is like renting a Ferrari to deliver pizza. 60% of calls were misrouted to an expensive model by default, not by design. A routing layer is now standard in any AI feature Kuvaka ships.
Semantic caching has a meaningful hit rate on real workloads
A 34% cache hit rate on a production B2B SaaS dataset was higher than predicted. Users in the same company tend to write similar emails to similar counterparties. The cache exploits a real pattern in how people work.
Retry storms are silent cost multipliers
The retry architecture cost $1,100 in one month — and that was before a major API outage. With an outage, the same architecture would have generated 4.8× normal spend in a single incident. Circuit breakers are cheap insurance.
The cost audit is a one-day investment with a multi-year return
The 8-hour audit that found $9,900/month in waste took 8 hours and cost a fraction of that in billing. Every AI product should run this audit before scaling. Most never do until the bill forces them to.
Your LLM Bill Has a Fixable Architecture
If your LLM costs are growing faster than your user count, submit a description of your setup. We will return a scoped cost audit offer within 24 hours.
Cost Audit in 8 Hours
Full token-level breakdown of where your spend is going and which drivers to fix first.
Fixes Deployed in 48 Hours
Context compression, model routing, caching, and retry architecture — all in one sprint.
Fixed Price, No Surprises
Scoped engagement with a single price agreed before we start. No retainer.
No retainer. No open-ended engagement. One problem, one price, one timeline.