#Does Gemini 2.0 Flash not support context caching on OR?
18 messages · Page 1 of 1 (latest)
Add "cache_control": {"type": "ephemeral"} to your requests.
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="<OPENROUTER_API_KEY>",
)
completion = client.chat.completions.create(
model="google/gemini-2.5-pro-preview-03-25", # Use a supported Gemini model
messages=[
{
"role": "system",
"content": "Your long system prompt here...",
"provider_metadata": {
"openrouter": {
"cache_control": {"type": "ephemeral"}
}
}
},
{
"role": "user",
"content": "What are the benefits of prompt caching in LLM APIs?",
"provider_metadata": {
"openrouter": {
"cache_control": {"type": "ephemeral"}
}
}
}
]
)
print(completion.choices[0].message.content)
ah yeah didn't see it at the very bottom. Thanks!
Do I need to add it to every message I want cached or does it act as kind of a checkpoint where everything above it is cached (if possible)?
U add it to what u want to be cached.
So if u want all cached, yes.
Yw
Example u dont want ur system prompt cached, u dont add it there, then only for msg.
Can you please tell me if I need to write the full conversation again if I am iteratively passing information.
Like
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="<OPENROUTER_API_KEY>",
)
completion = client.chat.completions.create(
model="google/gemini-2.5-pro-preview-03-25", # Use a supported Gemini model
messages=[
{
"role": "system",
"content": "Your long system prompt here...",
"provider_metadata": {
"openrouter": {
"cache_control": {"type": "ephemeral"}
}
}
},
{
"role": "user",
"content": "What are the benefits of prompt caching in LLM APIs?",
"provider_metadata": {
"openrouter": {
"cache_control": {"type": "ephemeral"}
}
}
}
]
)
print(completion.choices[0].message.content)```
and then in the next iteration,
completion = client.chat.completions.create(
model="google/gemini-2.5-pro-preview-03-25", # Use a supported Gemini model
messages=[
{
"role": "system",
"content": "Your long system prompt here...",
"provider_metadata": {
"openrouter": {
"cache_control": {"type": "ephemeral"}
}
}
},
{
"role": "user",
"content": "What are the benefits of prompt caching in LLM APIs?",
"provider_metadata": {
"openrouter": {
"cache_control": {"type": "ephemeral"}
}
}
},
{
"role": "assistant",
"content" : completion.choices[0].message.content,
"provider_metadata": {
"openrouter": {
"cache_control": {"type": "ephemeral"}
}
}
},
{
"role": "user",
"content" : "My next input",
"provider_metadata": {
"openrouter": {
"cache_control": {"type": "ephemeral"}
}
}
}
]
)
print(completion.choices[0].message.content)```
You do have to resend every piece of context you want the model to “remember” on each chat/completions call.
Neither the OpenAI API nor the OpenRouter proxy keeps state between requests - the endpoint is stateless, so the model only sees the messages array you transmit in that single request
okay. I think I am being charged a lot more then, and I dont think it is able to context cache properly.
because I only send in like 2-3k new tokens in every iteration, however, I have seen that in the end iterations my input was 150k tokens and I was charged for that. even though, almost all of those tokens were the chat history and under cache_control
do let me know if my code syntax is incorrect :
assistant_message = {
"role": "assistant",
"content": [
{
"type": "text",
"text": content,
"cache_control": {
"type": "ephemeral"
}
}
]
}
self.conversation_history.append(assistant_message)
---------------------------------------------------------------
data = {
"model": self.model,
"messages": conversation_history,
"usage": {
"include": True
}
}
response = requests.post(self.url, headers=self.headers, json=data)
response.raise_for_status()
return response.json()```
You are charged for all input and output tokens.
This includes the entire conversation_history sent in the messages array on every request.
There’s no API support for marking certain messages as “ephemeral” or “do not bill”. You pay for all text sent.
