#Does Gemini 2.0 Flash not support context caching on OR?

18 messages · Page 1 of 1 (latest)

ashen jetty
#

Does Gemini 2.0 Flash not support context caching on OR? I have been sending 10k+ token inputs that are almost identical but they never cache. According to the docs it should cache it automatically but this doesn't seem to be the case.

near fractal
near fractal
# ashen jetty Does Gemini 2.0 Flash not support context caching on OR? I have been sending 10k...

from openai import OpenAI

client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key="<OPENROUTER_API_KEY>",
)

completion = client.chat.completions.create(
model="google/gemini-2.5-pro-preview-03-25", # Use a supported Gemini model
messages=[
{
"role": "system",
"content": "Your long system prompt here...",
"provider_metadata": {
"openrouter": {
"cache_control": {"type": "ephemeral"}
}
}
},
{
"role": "user",
"content": "What are the benefits of prompt caching in LLM APIs?",
"provider_metadata": {
"openrouter": {
"cache_control": {"type": "ephemeral"}
}
}
}
]
)

print(completion.choices[0].message.content)

OpenRouter

The unified interface for LLMs. Find the best models & prices for your prompts

ashen jetty
#

Do I need to add it to every message I want cached or does it act as kind of a checkpoint where everything above it is cached (if possible)?

near fractal
#

So if u want all cached, yes.

ashen jetty
#

thanks for the help

near fractal
#

Yw

near fractal
proper trout
# near fractal from openai import OpenAI client = OpenAI( base_url="https://openrouter.ai/...

Can you please tell me if I need to write the full conversation again if I am iteratively passing information.
Like

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="<OPENROUTER_API_KEY>",
)

completion = client.chat.completions.create(
    model="google/gemini-2.5-pro-preview-03-25",  # Use a supported Gemini model
    messages=[
        {
            "role": "system",
            "content": "Your long system prompt here...",
            "provider_metadata": {
                "openrouter": {
                    "cache_control": {"type": "ephemeral"}
                }
            }
        },
        {
            "role": "user",
            "content": "What are the benefits of prompt caching in LLM APIs?",
            "provider_metadata": {
                "openrouter": {
                    "cache_control": {"type": "ephemeral"}
                }
            }
        }
    ]
)

print(completion.choices[0].message.content)```
#

and then in the next iteration,

completion = client.chat.completions.create(
    model="google/gemini-2.5-pro-preview-03-25",  # Use a supported Gemini model
    messages=[
        {
            "role": "system",
            "content": "Your long system prompt here...",
            "provider_metadata": {
                "openrouter": {
                    "cache_control": {"type": "ephemeral"}
                }
            }
        },
        {
            "role": "user",
            "content": "What are the benefits of prompt caching in LLM APIs?",
            "provider_metadata": {
                "openrouter": {
                    "cache_control": {"type": "ephemeral"}
                }
            }
        },
        {
            "role": "assistant",
            "content" : completion.choices[0].message.content,
            "provider_metadata": {
                "openrouter": {
                    "cache_control": {"type": "ephemeral"}
                }
            }
        },
        {
            "role": "user",
            "content" : "My next input",
            "provider_metadata": {
                "openrouter": {
                    "cache_control": {"type": "ephemeral"}
                }
            }
        }
    ]
)

print(completion.choices[0].message.content)```
near fractal
proper trout
#

because I only send in like 2-3k new tokens in every iteration, however, I have seen that in the end iterations my input was 150k tokens and I was charged for that. even though, almost all of those tokens were the chat history and under cache_control

#

do let me know if my code syntax is incorrect :

        assistant_message = {
            "role": "assistant",
            "content": [
                    {
                        "type": "text",
                        "text": content,
                        "cache_control": {
                            "type": "ephemeral"
                        }
                    }
                ]
        }
        self.conversation_history.append(assistant_message)

---------------------------------------------------------------


        data = {
            "model": self.model,
            "messages": conversation_history,
            "usage": {
                "include": True
            }
        }
        response = requests.post(self.url, headers=self.headers, json=data)
        response.raise_for_status()
        return response.json()```
near fractal