#Gemini 2.5 prompt caching: when and why?

1 messages · Page 1 of 1 (latest)

kindred spear
#

I'm using Gemini 2.5 pro/flash (paid plans) and I'm adding the cache_control directive to the big system prompt.
I've double checked and the system prompt is always identical (no wonder...) however:

  • sometimes it gets cached
  • sometimes it's processed out of cache

the pattern is inconsistent: sometimes the first interaction in the chat picks-up the cached copy, sometimes not.
Sometimes mid converation it will "forget" and re-cache the system prompt

Is this related to OpenRouter.ai?
Is anyone seeing this on Google APIs directly?

polar jungleBOT
#

Gemini 2.5 Pro and Flash support implicit caching with an average TTL of 3-5 minutes, but this TTL does not update with repeated requests. Cached data is automatically managed by the model, so you don't need to manually create or delete caches. The caching behavior you observe—sometimes using cached prompts and sometimes re-caching—can be due to this TTL expiration or internal cache management, not specifically because of OpenRouter.ai.

OpenRouter simplifies Gemini cache management by abstracting cache handling; it does not control when caching hits or misses happen. To maximize cache hits, keep the initial part of your prompt consistent and push dynamic content toward the end.

Google's caching on Gemini 2.5 Pro/Flash has minimum token thresholds (1028 tokens for Flash, 2048 for Pro) to qualify for caching, and cache writes include a fixed 5-minute storage charge. The cache expiration after 3-5 minutes and the model's internal logic can cause the inconsistent caching pattern you see, including mid-conversation re-caching when the TTL expires.

This behavior is consistent with Google's own API and not unique to OpenRouter.ai.

For details on Gemini caching and usage:

Are you primarily using system prompts with large, static content for caching, or do you also cache user messages?

Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. Run Gemini 2.5 Pro with API

kindred spear
#

I'm using the cache_control directive only on the large system prompt corpus

polar jungleBOT
#

Using the cache_control directive on your large system prompt corpus with Gemini 2.5 Pro/Flash is the recommended way to enable caching for large static content. Gemini 2.5 models support implicit caching with an average TTL of 3-5 minutes, and OpenRouter handles cache management automatically — you don't need to manually manage cache names or TTLs.

Key points:

  • Cache applies only if the prompt meets minimum token thresholds: 2048 tokens for Gemini 2.5 Pro, 1028 tokens for 2.5 Flash.
  • The cache TTL does not update with repeated requests; after 3-5 minutes, caching expires and the prompt is re-cached.
  • OpenRouter routes requests to providers with warm caches when possible, but may fallback to others if unavailable.
  • You should insert cache_control breakpoints in large static text parts (like your system prompt corpus).
  • Only the last cache_control breakpoint in the message is used for Gemini caching.

This setup explains why caching sometimes hits and sometimes misses, including mid-conversation cache expiration and re-caching. This behavior is consistent with Google's own API and not specific to OpenRouter.

More details here: Prompt Caching | OpenRouter

Do you want advice on optimizing your prompt structure for better cache efficiency?

kindred spear
#

my tests show caches expiring way sooner then 3 minutes: may this be due to opensourter switching between providers?

polar jungleBOT
#

Yes, cache expiration sooner than the typical 3-5 minute TTL can happen if OpenRouter switches between different providers. OpenRouter tries to route requests to providers with warm caches to maximize cache hits, but if the primary provider is unavailable or your request filters allow fallback, it may route to another provider without the cached prompt. This results in the cache "missing" and being re-created, causing what looks like early expiration.

OpenRouter abstracts cache management, so you don't control cache TTL or provider switching directly, but provider failover and load balancing can lead to inconsistent caching behavior.

More details on provider routing and cache handling are here: Provider Routing | OpenRouter and Prompt Caching | OpenRouter

Are you using any specific provider preferences or filters in your OpenRouter setup?

kindred spear
#

about the storage cost: i't not really clear how much it will cost!

polar jungleBOT
#

Storage cost for Gemini 2.5 Pro/Flash caching includes a fixed charge of 5 minutes of storage per cache write. This cost is added on top of your normal input and output token costs. Exact pricing depends on the provider and plan you use.

For example, Gemini 2.5 Pro charges $0.10 per 1,000 input tokens and $0.40 per 1,000 output tokens, plus this fixed cache storage fee. OpenRouter routes requests to providers with varying pricing, so your total cost depends on the selected provider as well.

You can see detailed pricing and provider costs here: R1 Distill Llama 70B Pricing and Providers

Would you like help estimating costs based on your expected token usage?

DeepSeek R1 Distill Llama 70B is a distilled large language model based on [Llama-3.3-70B-Instruct](/meta-llama/llama-3. Run R1 Distill Llama 70B with API

kindred spear
#

mh... this are cost data for DeepSeek: which is the storage cost for Gemini 2.5 Flash and Pro ? cna you opint me to official documentation?

polar jungleBOT
#

Gemini 2.5 Flash and Pro have no separate storage cost for caching; their pricing only includes input and output token costs with no additional charges for cache storage.

  • Gemini 2.5 Flash costs $0.30 per 1,000 input tokens and $1.238 per 1,000 output tokens.
  • Gemini 2.5 Pro costs $10 per 1,000 input tokens and $5.16 per 1,000 output tokens.

You can find these details listed under the Gemini 2.5 Flash and Pro models on OpenRouter's models page: Models | OpenRouter.

For official documentation on Gemini models and pricing, visit:

Would you like help comparing costs for your expected usage?

OpenRouter

Browse models on OpenRouter

Gemini 2.5 Flash is Google's state-of-the-art workhorse model, specifically designed for advanced reasoning, coding, mathematics, and scientific tasks. Run Gemini 2.5 Flash with API

Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. Run Gemini 2.5 Pro with API