LLM API Cost Guide: What You Actually Pay in 2026

TLDR: Real 2026 pricing across Claude, GPT-4o, and Gemini with worked cost scenarios. Context window inflation in chat apps is the hidden cost nobody warns you about. Prompt caching is the single highest-ROI optimization. Add cost logging before you need it — I learned this the hard way.

My first month building a production AI feature ended with a billing surprise I wasn't expecting. Not catastrophic — but enough to make me immediately add cost instrumentation that I should have built from day one.

Everyone talks about which model is best. Almost nobody talks about what it costs to run in production. The LLM market has matured fast: prices have dropped significantly from 2024 levels, mid-tier models have become surprisingly capable, and the "just use GPT-4" default now costs 10-20× what the smart choice would. Here's what you'll actually see on your bill.

How LLM Pricing Works

LLMs are priced per million tokens (MTok). A token is roughly 0.75 English words.

"Hello, how are you today?" = ~6 tokens
A typical user message      = 50–200 tokens
A full page of text         = ~750 tokens
Claude Sonnet context       = up to 200,000 tokens (~150,000 words)

You pay separately for:

Input tokens — everything sent to the model (system prompt + conversation history + user message)
Output tokens — what the model generates back (always more expensive, typically 3–5× the input rate)
Cache read tokens — reduced rate when a static system prompt is reused across calls

Model Pricing Comparison (2026)

Anthropic Claude

Model	Input (per MTok)	Output (per MTok)
Claude Opus 4.7	$15.00	$75.00
Claude Sonnet 4.6	$3.00	$15.00
Claude Haiku 4.5	$0.80	$4.00

OpenAI

Model	Input (per MTok)	Output (per MTok)
GPT-4o	$2.50	$10.00
GPT-4o mini	$0.15	$0.60
o3	$10.00	$40.00
o4-mini	$1.10	$4.40

Google Gemini

Model	Input (per MTok)	Output (per MTok)
Gemini 2.5 Pro	$1.25	$10.00
Gemini 2.5 Flash	$0.15	$0.60
Gemini 2.0 Flash	$0.10	$0.40

The 2026 shift worth paying attention to: Gemini 2.5 Flash and GPT-4o mini now deliver near-Sonnet quality at Haiku-tier prices. The cost gap between "smart" and "cheap" models has narrowed significantly. Always benchmark your specific task before defaulting to a premium model.

Real-World Cost Scenarios

Customer support chatbot — 1,000 conversations/day

System prompt: 500 tokens
Average user message: 100 tokens
Average bot reply: 300 tokens

Per conversation — input:  500 + 100 = 600 tokens
Per conversation — output: 300 tokens

Daily: 600,000 input + 300,000 output tokens

Claude Sonnet 4.6:
  Input:  0.6M × $3.00/MTok   = $1.80/day
  Output: 0.3M × $15.00/MTok  = $4.50/day
  Total:  $6.30/day → ~$189/month

Claude Haiku 4.5:
  Total:  $1.68/day → ~$50/month

Gemini 2.5 Flash:
  Total:  $0.27/day → ~$8/month

For FAQ-style support, switching from Sonnet to Gemini Flash cuts costs from $189 to $8/month. In 2026, Flash-tier models handle routine queries well enough that this is a real production choice — but benchmark on your actual queries first.

AI code review tool — 500 reviews/day

System prompt: 2,000 tokens
Code diff per review: 3,000 tokens
Review output: 1,500 tokens

Per review — input:  5,000 tokens
Per review — output: 1,500 tokens

Claude Sonnet 4.6:
  Input:  2.5M × $3.00/MTok   = $7.50/day
  Output: 0.75M × $15.00/MTok = $11.25/day
  Total:  $18.75/day → ~$563/month

Code review is where Sonnet earns its price. Flash-tier models miss subtle bugs and produce shallow analysis on complex diffs. Apply prompt caching on the static system prompt to cut this bill by ~40% — more on that below.

Solo developer / side project

100 interactions/day, avg 200 tokens in + 500 tokens out

Claude Haiku 4.5:    ~$0.22/day → ~$6.60/month
GPT-4o mini:         ~$0.03/day → ~$0.90/month
Gemini 2.5 Flash:    ~$0.03/day → ~$0.90/month

Side projects are extremely cheap at current rates. Don't over-optimize at hobby scale — pick the model that gives the best output for your use case and move on.

The Hidden Cost: Context Window Inflation

This is the production billing surprise that gets almost every team building chatbots. Conversation history grows with every turn:

Turn 1:  [system: 500] + [user: 100]                          =  600 input tokens
Turn 5:  [system: 500] + [4 exchanges: 1,600] + [user: 100]   = 2,200 input tokens
Turn 10: [system: 500] + [9 exchanges: 3,600] + [user: 100]   = 4,200 input tokens
Turn 20: [system: 500] + [19 exchanges: 7,600] + [user: 100]  = 8,200 input tokens

A 20-turn conversation costs 13× more in input tokens than the first turn alone. I've seen teams hit unexpected bills of hundreds of dollars per day from a chatbot feature that seemed cheap in testing — because testing was always short conversations.

What to do:

Sliding window — keep only the last N messages in context
Progressive summarisation — every 10 turns, compress prior history into a short summary
RAG for long-term memory — store facts in a vector store and retrieve on demand, rather than carrying everything in every prompt

Prompt Caching: The Biggest Cost Lever

If your system prompt is static across requests, prompt caching cuts input costs by up to 90% on repeated calls. This is non-negotiable for any production deployment with a large system prompt.

Claude Caching Rates (2026)

Tier	Per MTok
Cache write (first call)	$3.75
Cache read (subsequent calls)	$0.30
Regular input (no cache)	$3.00

Cache writes cost 25% more than regular input — but reads cost 90% less. Break-even is 5 reuses. Everything after that is pure savings.

const response = await client.messages.create({
  model: 'claude-sonnet-4-6',
  system: [
    {
      type: 'text',
      text: yourLongSystemPrompt,
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages,
});

// Check how the cache is performing
console.log(response.usage.cache_read_input_tokens);
console.log(response.usage.cache_creation_input_tokens);

What this looks like in practice: A 2,000-token system prompt called 10,000 times/day:

Without caching:
  20M tokens/day × $3.00/MTok = $60/day → $1,800/month

With caching (1 write + 9,999 reads):
  Write: 2,000 tokens × $3.75/MTok   = negligible
  Reads: 19.998M tokens × $0.30/MTok = $6.00/day

Saving: $54/day → $1,620/month

If you do one thing to reduce LLM costs, it's this.

Model Routing: Match Model to Task

Not every task needs your smartest model. In 2026, the routing decision is more nuanced because mid-tier models are genuinely stronger — but the principle holds.

Classification / intent detection    → Gemini Flash / GPT-4o mini   ($0.10–$0.15/MTok input)
FAQ, summarisation, simple edits     → Haiku 4.5 / Gemini Flash      ($0.15–$0.80/MTok input)
Code generation, reasoning, analysis → Sonnet 4.6 / GPT-4o           ($2.50–$3.00/MTok input)
Complex research, long doc synthesis → Opus 4.7 / o3                 ($10.00–$15.00/MTok input)

function selectModel(task: TaskType): string {
  switch (task) {
    case 'intent_classification':
    case 'sentiment_tagging':
      return 'gemini-2.0-flash';
    case 'simple_faq':
    case 'text_summarisation':
      return 'claude-haiku-4-5-20251001';
    case 'code_review':
    case 'document_analysis':
      return 'claude-sonnet-4-6';
    case 'complex_reasoning':
    case 'research_synthesis':
      return 'claude-opus-4-7';
  }
}

Rule I use: start with the cheapest model that can do the job. Move up only when benchmark evidence says it can't. "Benchmark evidence" means actually testing on your real prompts, not just assuming.

Free and Low-Cost Tiers

Provider	Free offering	Best for
Google Gemini API	Generous free tier with rate limits	Prototyping, MVPs
Groq	Free tier — fast inference on Llama, Mixtral	Low-latency dev testing
Cloudflare Workers AI	Free tier on Llama, Mistral, Phi	Edge-deployed features
Ollama	Completely free — runs models locally	Dev/test, privacy-sensitive work
Together AI	Low-cost open model inference	Production at scale with open models

The free tiers available in 2026 are genuinely capable. Most teams can get through MVP and early beta without spending anything on LLM costs.

Cost Monitoring in Production

Never wait for your monthly bill to understand what you're spending. Add cost logging from day one — I've done this on every project since the billing surprise I mentioned upfront.

interface LLMCallLog {
  model: string;
  feature: string;     // which product feature triggered this call
  user_id?: string;    // for per-user cost attribution
  input_tokens: number;
  output_tokens: number;
  cached_tokens: number;
  cost_usd: number;
  latency_ms: number;
}

function calculateCost(
  model: string,
  inputTokens: number,
  outputTokens: number,
  cachedTokens = 0,
): number {
  const pricing: Record<string, { input: number; output: number; cacheRead: number }> = {
    'claude-sonnet-4-6':           { input: 3.0,  output: 15.0, cacheRead: 0.3  },
    'claude-haiku-4-5-20251001':   { input: 0.8,  output: 4.0,  cacheRead: 0.08 },
    'claude-opus-4-7':             { input: 15.0, output: 75.0, cacheRead: 1.5  },
    'gpt-4o':                      { input: 2.5,  output: 10.0, cacheRead: 1.25 },
    'gpt-4o-mini':                 { input: 0.15, output: 0.6,  cacheRead: 0.075},
    'gemini-2.5-flash':            { input: 0.15, output: 0.6,  cacheRead: 0.038},
    'gemini-2.0-flash':            { input: 0.1,  output: 0.4,  cacheRead: 0.025},
  };

  const p = pricing[model];
  if (!p) return 0;

  const billableInput = inputTokens - cachedTokens;
  return (
    (billableInput / 1_000_000) * p.input +
    (cachedTokens / 1_000_000) * p.cacheRead +
    (outputTokens / 1_000_000) * p.output
  );
}

Track cost per feature and cost per user. Set alerts at 80% of your monthly budget. The teams that get blindsided by LLM bills are the ones who treated cost as an afterthought.

The short version: in 2026, Gemini Flash and GPT-4o mini handle a surprising amount of production work at $0.10–$0.15/MTok. Sonnet is the right default for reasoning-heavy tasks. Prompt caching is the single highest-ROI optimization. Context window inflation is the most common unexpected cost. Log everything from the start, and you won't get surprised.

Prompt Engineering for Developers: From Basics to Production

System prompt architecture, few-shot patterns, chain-of-thought, structured output, prompt injection defense, and building an eval loop.

12 min read ·Apr 29, 2025

Read

AIMCPAgentsEngineering

Intermediate

Building Your First MCP Server: Tools, Resources, and the Right Mental Model

Building an MCP server is simpler than it looks — a few tool definitions, a request handler, and a stdio transport. The hard part is designing tools the model will actually use correctly. This guide builds a real server from scratch and covers every design decision that separates a good server from a frustrating one.

13 min read ·May 19, 2026

Read

AIEngineering

Intermediate

You're Using Claude Wrong as a Developer

10 power moves and 5 bonus hacks that changed how I ship code. From treating Claude like a fancier Stack Overflow to unlocking its full potential — concrete prompts, real examples, no fluff.

10 min read ·May 17, 2026

Read

Back to all posts

How LLM Pricing Works

Model Pricing Comparison (2026)

Anthropic Claude

OpenAI

Google Gemini

Real-World Cost Scenarios

Customer support chatbot — 1,000 conversations/day

AI code review tool — 500 reviews/day

Solo developer / side project

The Hidden Cost: Context Window Inflation

Prompt Caching: The Biggest Cost Lever

Claude Caching Rates (2026)

Model Routing: Match Model to Task

Free and Low-Cost Tiers

Cost Monitoring in Production

Prompt Engineering for Developers: From Basics to Production

Building Your First MCP Server: Tools, Resources, and the Right Mental Model

You're Using Claude Wrong as a Developer

Stay in the loop