Does Gemini 2.0 Flash not support context caching on OR? | OpenRouter | Page 1

ashen jetty Jul 2, 2025, 10:09 AM

#

Does Gemini 2.0 Flash not support context caching on OR? I have been sending 10k+ token inputs that are almost identical but they never cache. According to the docs it should cache it automatically but this doesn't seem to be the case.

near fractal Jul 2, 2025, 12:26 PM

#

ashen jetty Does Gemini 2.0 Flash not support context caching on OR? I have been sending 10k...

Add "cache_control": {"type": "ephemeral"} to your requests.

near fractal Jul 2, 2025, 12:27 PM

#

ashen jetty Does Gemini 2.0 Flash not support context caching on OR? I have been sending 10k...

from openai import OpenAI

client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="<OPENROUTER_API_KEY>",
)

completion = client.chat.completions.create(
model="google/gemini-2.5-pro-preview-03-25", # Use a supported Gemini model
messages=[
{
"role": "system",
"content": "Your long system prompt here...",
"provider_metadata": {
"openrouter": {
"cache_control": {"type": "ephemeral"}
}
}
},
{
"role": "user",
"content": "What are the benefits of prompt caching in LLM APIs?",
"provider_metadata": {
"openrouter": {
"cache_control": {"type": "ephemeral"}
}
}
}
]
)

print(completion.choices[0].message.content)

OpenRouter

The unified interface for LLMs. Find the best models & prices for your prompts

ashen jetty Jul 2, 2025, 12:32 PM

#

near fractal from openai import OpenAI client = OpenAI( base_url="https://openrouter.ai/...

ah yeah didn't see it at the very bottom. Thanks!

#

Do I need to add it to every message I want cached or does it act as kind of a checkpoint where everything above it is cached (if possible)?

near fractal Jul 2, 2025, 12:39 PM

#

ashen jetty Do I need to add it to every message I want cached or does it act as kind of a c...

U add it to what u want to be cached.

#

So if u want all cached, yes.

ashen jetty Jul 2, 2025, 12:39 PM

#

cheers

#

thanks for the help

near fractal Jul 2, 2025, 12:39 PM

#

Yw

near fractal Jul 2, 2025, 12:40 PM

#

ashen jetty <:cheers:1063796820213502002>

Example u dont want ur system prompt cached, u dont add it there, then only for msg.

proper trout Jul 3, 2025, 1:44 PM

#

near fractal from openai import OpenAI client = OpenAI( base_url="https://openrouter.ai/...

Can you please tell me if I need to write the full conversation again if I am iteratively passing information.
Like

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="<OPENROUTER_API_KEY>",
)

completion = client.chat.completions.create(
    model="google/gemini-2.5-pro-preview-03-25",  # Use a supported Gemini model
    messages=[
        {
            "role": "system",
            "content": "Your long system prompt here...",
            "provider_metadata": {
                "openrouter": {
                    "cache_control": {"type": "ephemeral"}
                }
            }
        },
        {
            "role": "user",
            "content": "What are the benefits of prompt caching in LLM APIs?",
            "provider_metadata": {
                "openrouter": {
                    "cache_control": {"type": "ephemeral"}
                }
            }
        }
    ]
)

print(completion.choices[0].message.content)```

#

and then in the next iteration,

completion = client.chat.completions.create(
    model="google/gemini-2.5-pro-preview-03-25",  # Use a supported Gemini model
    messages=[
        {
            "role": "system",
            "content": "Your long system prompt here...",
            "provider_metadata": {
                "openrouter": {
                    "cache_control": {"type": "ephemeral"}
                }
            }
        },
        {
            "role": "user",
            "content": "What are the benefits of prompt caching in LLM APIs?",
            "provider_metadata": {
                "openrouter": {
                    "cache_control": {"type": "ephemeral"}
                }
            }
        },
        {
            "role": "assistant",
            "content" : completion.choices[0].message.content,
            "provider_metadata": {
                "openrouter": {
                    "cache_control": {"type": "ephemeral"}
                }
            }
        },
        {
            "role": "user",
            "content" : "My next input",
            "provider_metadata": {
                "openrouter": {
                    "cache_control": {"type": "ephemeral"}
                }
            }
        }
    ]
)

print(completion.choices[0].message.content)```

near fractal Jul 3, 2025, 1:55 PM

#

proper trout and then in the next iteration, ``` completion = client.chat.completions.create...

You do have to resend every piece of context you want the model to “remember” on each chat/completions call.
Neither the OpenAI API nor the OpenRouter proxy keeps state between requests - the endpoint is stateless, so the model only sees the messages array you transmit in that single request

proper trout Jul 3, 2025, 4:23 PM

#

near fractal You do have to resend every piece of context you want the model to “remember” on...

okay. I think I am being charged a lot more then, and I dont think it is able to context cache properly.

#

because I only send in like 2-3k new tokens in every iteration, however, I have seen that in the end iterations my input was 150k tokens and I was charged for that. even though, almost all of those tokens were the chat history and under cache_control

#

do let me know if my code syntax is incorrect :

        assistant_message = {
            "role": "assistant",
            "content": [
                    {
                        "type": "text",
                        "text": content,
                        "cache_control": {
                            "type": "ephemeral"
                        }
                    }
                ]
        }
        self.conversation_history.append(assistant_message)

---------------------------------------------------------------


        data = {
            "model": self.model,
            "messages": conversation_history,
            "usage": {
                "include": True
            }
        }
        response = requests.post(self.url, headers=self.headers, json=data)
        response.raise_for_status()
        return response.json()```

near fractal Jul 3, 2025, 4:36 PM

#

proper trout do let me know if my code syntax is incorrect : ``` assistant_message =...

You are charged for all input and output tokens.
This includes the entire conversation_history sent in the messages array on every request.
There’s no API support for marking certain messages as “ephemeral” or “do not bill”. You pay for all text sent.

#Does Gemini 2.0 Flash not support context caching on OR?