FeaturesPricingBlogFAQContact
Sign InGet Started
← Back to Blog
Best Practices

Prompt Caching Can Cut Your AI API Bill by 90%. Here Is How to Set It Up.

7 min read read

Prompt Caching Can Cut Your AI API Bill by 90%. Here Is How to Set It Up.

TL;DR: If your app sends the same system prompt on every API call, you are paying full price for tokens the provider has already processed. Prompt caching stores that processed context and lets subsequent requests reuse it at a fraction of the cost. OpenAI does it automatically. Anthropic gives you explicit control. Google offers context caching on Gemini. The savings are real and they compound fast.

You Are Probably Overpaying for the Same Tokens

Here is a pattern that plays out at almost every company running AI in production.

Your app has a system prompt. Maybe it is 1,500 tokens of instructions, rules, persona definitions, and few-shot examples. Every single API call sends that entire prompt. If your team makes 10,000 calls per day, that is 15 million input tokens spent just repeating context the model has seen before.

At GPT-4o pricing ($2.50 per million input tokens), that repeated context alone costs $37.50 per day, or roughly $1,125 per month. And that is before the actual user queries.

Prompt caching eliminates most of that cost. Instead of processing those 1,500 tokens fresh every time, the provider stores the processed version and charges you a fraction of the standard rate on subsequent calls.

The numbers are hard to ignore:

  • OpenAI charges 50% less for cached input tokens (automatic, no code changes)
  • Anthropic charges 90% less for cache reads ($0.30 vs $3.00 per million on Sonnet 4.6)
  • Google Gemini charges 75% less for cached context on supported models

For the team in our example, switching to cached prompts drops that $1,125 monthly cost to somewhere between $112 and $562, depending on the provider. That is real money back in the budget every month.

How Prompt Caching Actually Works

The concept is straightforward. When you send a request, the provider checks whether it has already processed an identical prefix of your prompt. If it finds a match, it skips the expensive processing step and uses the cached version instead.

Think of it like a compiled binary. The first time you compile code, it takes a while. But if the source has not changed, you can reuse the compiled output instantly. Prompt caching works on the same principle, just with transformer attention states instead of machine code.

There are two flavors worth understanding:

Provider-level prompt caching stores processed token states on the provider's infrastructure. This is what OpenAI, Anthropic, and Google offer. Your static prompt prefix gets cached after the first request, and subsequent requests with the same prefix get a discount.

Semantic caching is different. It sits in your own infrastructure and checks whether an incoming query is similar enough to a previous one that you can return the cached response entirely, skipping the API call. Tools like Redis with vector search, GPTCache, and Upstash handle this. The savings are even bigger (100% on a cache hit), but the accuracy tradeoff requires careful tuning.

For most teams, starting with provider-level caching is the right move. Zero risk, minimal effort, immediate savings.

Setting It Up: Provider by Provider

OpenAI: It Just Works

OpenAI made the smart decision to handle caching automatically. There is nothing to configure. Every API request to GPT-4o and newer models gets prompt caching applied behind the scenes.

Here is what happens: if the first 1,024+ tokens of your prompt match a recently cached prefix, OpenAI charges you the discounted cached rate. The cache typically stays warm for 5 to 10 minutes of inactivity, with an extended retention option that keeps it active for up to an hour.

The practical implication is simple. Put your static content (system prompt, instructions, few-shot examples) at the beginning of every request, and put variable content (the user's actual query) at the end. That is it. OpenAI handles the rest.

What you will see on your bill: Cached input tokens appear as a separate line item at 50% of the standard input rate. If you are using CostLayer to track your OpenAI spend, you will see the split between cached and uncached tokens in your model breakdown.

Anthropic: Explicit Control, Bigger Savings

Anthropic takes a different approach. Instead of automatic caching, you explicitly mark which parts of your prompt should be cached using cache control headers. This gives you more control but requires a small code change.

The savings are also significantly larger. Cache reads on Claude Sonnet 4.6 cost just $0.30 per million tokens, compared to $3.00 for standard input. That is a 90% discount.

Here is the key thing to know: there is a small write premium. The first time a prompt gets cached, Anthropic charges 25% more than the standard rate for those tokens. But that cost is amortized across every subsequent read, so if your cache gets hit even a handful of times, you come out way ahead.

The cache has a minimum size of 1,024 tokens and a TTL of about 5 minutes, which resets with each cache hit. For apps making frequent calls, the cache stays warm indefinitely.

Pro tip: Structure your messages so the cacheable content (system prompt, long documents, instructions) comes first, marked with cache breakpoints. Put the per-request variable content last. This maximizes your cache hit rate.

Google Gemini: Context Caching for Long Documents

Google's approach is called Context Caching, and it is particularly useful if you are working with large documents. You create an explicit cache object with a TTL, and then reference it in subsequent requests.

The discount varies by model, but you can expect around 75% savings on cached tokens. The minimum cache size is larger than the other providers (32,768 tokens on Gemini 1.5 Pro), so this works best for use cases involving long context: legal documents, codebases, research papers, and similar workloads.

The tradeoff is that you manage cache lifecycle explicitly. You create caches, set their TTL, and delete them when you are done. It is more operational overhead than OpenAI's automatic approach, but the savings on large-context workloads make it worthwhile.

The Math: What This Looks Like at Scale

Let us run the numbers for a real scenario. Say your team has a customer support AI that processes 50,000 requests per day. The system prompt is 2,000 tokens, and the average user message is 200 tokens.

Without caching:

  • Daily input tokens: 50,000 x 2,200 = 110 million tokens
  • Monthly input cost (GPT-4o at $2.50/M): $8,250

With prompt caching (assuming 85% cache hit rate):

  • Cached tokens: 93.5M at $1.25/M = $116.88
  • Uncached tokens: 16.5M at $2.50/M = $41.25
  • Monthly input cost: $158.13

Monthly savings: $8,092 or a 98% reduction on the system prompt portion.

Even with more modest numbers, the pattern holds. If your application has any degree of prompt repetition, caching will save you money. The question is not whether to use it, but how much you will save.

Quick Wins to Maximize Your Cache Hit Rate

A few practical things you can do this week to get more value from caching:

Front-load your static content. Cache matching works on prefixes. If your system prompt changes between requests, the cache misses. Move all static instructions to the very beginning and keep variable content at the end.

Consolidate system prompts. If different features in your app use slightly different system prompts, look for a shared prefix. A single large cached system prompt with feature-specific instructions appended is often cheaper than multiple smaller uncached prompts.

Monitor your cache hit rate. Both OpenAI and Anthropic report cache hits in their API responses. Track this metric alongside your cost data. If your hit rate drops below 70%, your prompt structure probably needs adjustment.

Batch similar requests together. If you have multiple requests that use the same context, send them close together in time. This keeps the cache warm and maximizes hits.

Pair caching with model routing. Not every request needs your most expensive model. Use CostLayer's model swap recommendations to identify which calls can use cheaper models, then layer caching on top for maximum savings.

When Prompt Caching Is Not Enough

Caching works brilliantly for system-prompt-heavy applications. But if your cost problem is more about output tokens than input tokens, or if every request has unique context (like RAG with different retrieved documents each time), you will need additional strategies.

That is where tools like CostLayer come in. Instead of guessing where your money goes, you get a clear breakdown by model, by team, and by project. And the model swap engine automatically identifies where you are using expensive models for tasks that cheaper ones handle just as well.

Start by checking your current provider breakdown. If you do not know how your spend splits between input and output tokens, between models, or between teams, you are optimizing blind.

Start Saving Today

Prompt caching is one of those rare optimizations that is genuinely free. No infrastructure to deploy, no code rewrites, no quality tradeoffs. You just structure your prompts correctly and the savings appear on your next bill.

If you are spending more than $500/month on AI APIs and you have not looked at your caching strategy, you are almost certainly leaving money on the table.

Track your AI API costs in real time and get model swap recommendations that save you 30-50% on average.

Get started with CostLayer - from $7.49/mo.


Get weekly AI pricing updates, cost optimization strategies, and model comparison data. Subscribe to the AI Spend Report - free, no spam.

Enjoyed this article?

Get weekly AI pricing updates, cost optimisation strategies, and model comparison data.

Subscribe to the AI Spend Report →Join 100+ engineering leaders. Unsubscribe anytime.

Related Posts

Start tracking your AI API costs today.

CostLayer gives you real-time visibility into AI spend across OpenAI, Anthropic & Google AI.

Get Started — $7.49/mo