TLDR: Real 2026 pricing across Claude, GPT-4o, and Gemini with worked cost scenarios. Context window inflation in chat apps is the hidden cost nobody warns you about. Prompt caching is the single highest-ROI optimization. Add cost logging before you need it — I learned this the hard way.
My first month building a production AI feature ended with a billing surprise I wasn't expecting. Not catastrophic — but enough to make me immediately add cost instrumentation that I should have built from day one.
Everyone talks about which model is best. Almost nobody talks about what it costs to run in production. The LLM market has matured fast: prices have dropped significantly from 2024 levels, mid-tier models have become surprisingly capable, and the "just use GPT-4" default now costs 10-20× what the smart choice would. Here's what you'll actually see on your bill.
How LLM Pricing Works
LLMs are priced per million tokens (MTok). A token is roughly 0.75 English words.
"Hello, how are you today?" = ~6 tokens
A typical user message = 50–200 tokens
A full page of text = ~750 tokens
Claude Sonnet context = up to 200,000 tokens (~150,000 words)
You pay separately for:
- Input tokens — everything sent to the model (system prompt + conversation history + user message)
- Output tokens — what the model generates back (always more expensive, typically 3–5× the input rate)
- Cache read tokens — reduced rate when a static system prompt is reused across calls
Model Pricing Comparison (2026)
Anthropic Claude
| Model | Input (per MTok) | Output (per MTok) |
|---|---|---|
| Claude Opus 4.7 | $15.00 | $75.00 |
| Claude Sonnet 4.6 | $3.00 | $15.00 |
| Claude Haiku 4.5 | $0.80 | $4.00 |
OpenAI
| Model | Input (per MTok) | Output (per MTok) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o mini | $0.15 | $0.60 |
| o3 | $10.00 | $40.00 |
| o4-mini | $1.10 | $4.40 |
Google Gemini
| Model | Input (per MTok) | Output (per MTok) |
|---|---|---|
| Gemini 2.5 Pro | $1.25 | $10.00 |
| Gemini 2.5 Flash | $0.15 | $0.60 |
| Gemini 2.0 Flash | $0.10 | $0.40 |
The 2026 shift worth paying attention to: Gemini 2.5 Flash and GPT-4o mini now deliver near-Sonnet quality at Haiku-tier prices. The cost gap between "smart" and "cheap" models has narrowed significantly. Always benchmark your specific task before defaulting to a premium model.
Real-World Cost Scenarios
Customer support chatbot — 1,000 conversations/day
- System prompt: 500 tokens
- Average user message: 100 tokens
- Average bot reply: 300 tokens
Per conversation — input: 500 + 100 = 600 tokens
Per conversation — output: 300 tokens
Daily: 600,000 input + 300,000 output tokens
Claude Sonnet 4.6:
Input: 0.6M × $3.00/MTok = $1.80/day
Output: 0.3M × $15.00/MTok = $4.50/day
Total: $6.30/day → ~$189/month
Claude Haiku 4.5:
Total: $1.68/day → ~$50/month
Gemini 2.5 Flash:
Total: $0.27/day → ~$8/month
For FAQ-style support, switching from Sonnet to Gemini Flash cuts costs from $189 to $8/month. In 2026, Flash-tier models handle routine queries well enough that this is a real production choice — but benchmark on your actual queries first.
AI code review tool — 500 reviews/day
- System prompt: 2,000 tokens
- Code diff per review: 3,000 tokens
- Review output: 1,500 tokens
Per review — input: 5,000 tokens
Per review — output: 1,500 tokens
Claude Sonnet 4.6:
Input: 2.5M × $3.00/MTok = $7.50/day
Output: 0.75M × $15.00/MTok = $11.25/day
Total: $18.75/day → ~$563/month
Code review is where Sonnet earns its price. Flash-tier models miss subtle bugs and produce shallow analysis on complex diffs. Apply prompt caching on the static system prompt to cut this bill by ~40% — more on that below.
Solo developer / side project
- 100 interactions/day, avg 200 tokens in + 500 tokens out
Claude Haiku 4.5: ~$0.22/day → ~$6.60/month
GPT-4o mini: ~$0.03/day → ~$0.90/month
Gemini 2.5 Flash: ~$0.03/day → ~$0.90/month
Side projects are extremely cheap at current rates. Don't over-optimize at hobby scale — pick the model that gives the best output for your use case and move on.
The Hidden Cost: Context Window Inflation
This is the production billing surprise that gets almost every team building chatbots. Conversation history grows with every turn:
Turn 1: [system: 500] + [user: 100] = 600 input tokens
Turn 5: [system: 500] + [4 exchanges: 1,600] + [user: 100] = 2,200 input tokens
Turn 10: [system: 500] + [9 exchanges: 3,600] + [user: 100] = 4,200 input tokens
Turn 20: [system: 500] + [19 exchanges: 7,600] + [user: 100] = 8,200 input tokens
A 20-turn conversation costs 13× more in input tokens than the first turn alone. I've seen teams hit unexpected bills of hundreds of dollars per day from a chatbot feature that seemed cheap in testing — because testing was always short conversations.
What to do:
- Sliding window — keep only the last N messages in context
- Progressive summarisation — every 10 turns, compress prior history into a short summary
- RAG for long-term memory — store facts in a vector store and retrieve on demand, rather than carrying everything in every prompt
Prompt Caching: The Biggest Cost Lever
If your system prompt is static across requests, prompt caching cuts input costs by up to 90% on repeated calls. This is non-negotiable for any production deployment with a large system prompt.
Claude Caching Rates (2026)
| Tier | Per MTok |
|---|---|
| Cache write (first call) | $3.75 |
| Cache read (subsequent calls) | $0.30 |
| Regular input (no cache) | $3.00 |
Cache writes cost 25% more than regular input — but reads cost 90% less. Break-even is 5 reuses. Everything after that is pure savings.
const response = await client.messages.create({
model: 'claude-sonnet-4-6',
system: [
{
type: 'text',
text: yourLongSystemPrompt,
cache_control: { type: 'ephemeral' },
},
],
messages,
});
// Check how the cache is performing
console.log(response.usage.cache_read_input_tokens);
console.log(response.usage.cache_creation_input_tokens);
What this looks like in practice: A 2,000-token system prompt called 10,000 times/day:
Without caching:
20M tokens/day × $3.00/MTok = $60/day → $1,800/month
With caching (1 write + 9,999 reads):
Write: 2,000 tokens × $3.75/MTok = negligible
Reads: 19.998M tokens × $0.30/MTok = $6.00/day
Saving: $54/day → $1,620/month
If you do one thing to reduce LLM costs, it's this.
Model Routing: Match Model to Task
Not every task needs your smartest model. In 2026, the routing decision is more nuanced because mid-tier models are genuinely stronger — but the principle holds.
Classification / intent detection → Gemini Flash / GPT-4o mini ($0.10–$0.15/MTok input)
FAQ, summarisation, simple edits → Haiku 4.5 / Gemini Flash ($0.15–$0.80/MTok input)
Code generation, reasoning, analysis → Sonnet 4.6 / GPT-4o ($2.50–$3.00/MTok input)
Complex research, long doc synthesis → Opus 4.7 / o3 ($10.00–$15.00/MTok input)
function selectModel(task: TaskType): string {
switch (task) {
case 'intent_classification':
case 'sentiment_tagging':
return 'gemini-2.0-flash';
case 'simple_faq':
case 'text_summarisation':
return 'claude-haiku-4-5-20251001';
case 'code_review':
case 'document_analysis':
return 'claude-sonnet-4-6';
case 'complex_reasoning':
case 'research_synthesis':
return 'claude-opus-4-7';
}
}
Rule I use: start with the cheapest model that can do the job. Move up only when benchmark evidence says it can't. "Benchmark evidence" means actually testing on your real prompts, not just assuming.
Free and Low-Cost Tiers
| Provider | Free offering | Best for |
|---|---|---|
| Google Gemini API | Generous free tier with rate limits | Prototyping, MVPs |
| Groq | Free tier — fast inference on Llama, Mixtral | Low-latency dev testing |
| Cloudflare Workers AI | Free tier on Llama, Mistral, Phi | Edge-deployed features |
| Ollama | Completely free — runs models locally | Dev/test, privacy-sensitive work |
| Together AI | Low-cost open model inference | Production at scale with open models |
The free tiers available in 2026 are genuinely capable. Most teams can get through MVP and early beta without spending anything on LLM costs.
Cost Monitoring in Production
Never wait for your monthly bill to understand what you're spending. Add cost logging from day one — I've done this on every project since the billing surprise I mentioned upfront.
interface LLMCallLog {
model: string;
feature: string; // which product feature triggered this call
user_id?: string; // for per-user cost attribution
input_tokens: number;
output_tokens: number;
cached_tokens: number;
cost_usd: number;
latency_ms: number;
}
function calculateCost(
model: string,
inputTokens: number,
outputTokens: number,
cachedTokens = 0,
): number {
const pricing: Record<string, { input: number; output: number; cacheRead: number }> = {
'claude-sonnet-4-6': { input: 3.0, output: 15.0, cacheRead: 0.3 },
'claude-haiku-4-5-20251001': { input: 0.8, output: 4.0, cacheRead: 0.08 },
'claude-opus-4-7': { input: 15.0, output: 75.0, cacheRead: 1.5 },
'gpt-4o': { input: 2.5, output: 10.0, cacheRead: 1.25 },
'gpt-4o-mini': { input: 0.15, output: 0.6, cacheRead: 0.075},
'gemini-2.5-flash': { input: 0.15, output: 0.6, cacheRead: 0.038},
'gemini-2.0-flash': { input: 0.1, output: 0.4, cacheRead: 0.025},
};
const p = pricing[model];
if (!p) return 0;
const billableInput = inputTokens - cachedTokens;
return (
(billableInput / 1_000_000) * p.input +
(cachedTokens / 1_000_000) * p.cacheRead +
(outputTokens / 1_000_000) * p.output
);
}
Track cost per feature and cost per user. Set alerts at 80% of your monthly budget. The teams that get blindsided by LLM bills are the ones who treated cost as an afterthought.
The short version: in 2026, Gemini Flash and GPT-4o mini handle a surprising amount of production work at $0.10–$0.15/MTok. Sonnet is the right default for reasoning-heavy tasks. Prompt caching is the single highest-ROI optimization. Context window inflation is the most common unexpected cost. Log everything from the start, and you won't get surprised.