#Move some context out of system message

1 messages ยท Page 1 of 1 (latest)

twilit escarp
#

Is it possible to move some context out of the system message for helping with caching purposes on Anthropic API? I recently put 10$ in credits to use with OpenClaw and 1 hour cache but I got 4$ cache writes and 2$ cache reads in about 4 total hours of usage. I guess this is because the system message gets changing so much and this triggers a cache re-write every single time. This is also important as I'm going to create a OpenClaw instance to use with some friends and as we're sharing money and using for real purposes, not experiments, we need maximum efficiency in credit usage.

merry marlinBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> โœ… Mark Solution

hollow houndBOT
drifting nexusBOT
#

Let me verify this against the docs/source so I don't guess.

#

I found the relevant bit. OpenClaw already does exactly this now, so the main question is whether your churn is coming from something else in the prompt/session shape.

#

Yes, and newer OpenClaw already does this.

It now splits the system prompt into a cache-stable prefix and a volatile suffix, so changing bits like HEARTBEAT.md and per-turn metadata should not bust the whole Anthropic cache by themselves.

Docs:

So if you're seeing a lot of cacheWrite, it's usually one of these instead:

  • the session went idle past the TTL, then had to re-warm
  • something above the cache boundary changed, like injected bootstrap files, tools, skills, config, model/provider
  • long tool output or media-heavy history got re-cached

For a shared, cost-sensitive setup, I'd recommend:

  • default to cacheRetention: "short"
  • use long only for sessions/agents you revisit often
  • if you do use 1h cache, keep it warm only where it matters (heartbeat.every: "55m")
#
  • enable contextPruning.mode: "cache-ttl" so re-warms are smaller
  • keep injected files small, especially MEMORY.md / memory.md

Also, 1h cache is not always cheaper. On Anthropic, cache writes are the expensive part, so if sessions are bursty or go idle a lot, short or even none can be more cost efficient.

More docs:

If you want, paste openclaw --version and I can suggest a concrete JSON5 config for a multi-user shared setup.

twilit escarp
#

At this point, is it worth it using cache at all? Like, Haiku is 1$/million input and 5$/million output, but I got 4$ cache writes and 2$ cache reads (input) and actually got only 0.40$ of output. Isn't this like, very wrong?

drifting nexusBOT
#

I want to sanity-check the pricing side before I say "yes/no", because the answer depends a lot on Anthropic's cache multipliers and your reuse pattern.

#

It looks unintuitive, but not necessarily wrong.

Assuming Claude Haiku 4.5 on direct Anthropic, current pricing is roughly:

  • normal input: $1 / MTok
  • 1h cache write: $2 / MTok
  • cache read: $0.10 / MTok

So your dollar totals need to be converted back to tokens.

With your numbers:

  • $4 cache writes = about 2M tokens written to cache
  • $2 cache reads = about 20M tokens served from cache

Without caching, that same prompt traffic would have been about:

  • 22M normal input tokens
  • about $22 of input cost
#

With caching, you paid about $6 for that input-side traffic.

So oddly enough, those numbers suggest cache probably saved you money, not lost you money. The $0.40 output just means the model didn't generate much text. Output cost is separate.

What is true:

  • 1h cache is only worth it if you reuse the same growing prefix enough
  • for bursty/shared usage, short is usually the safer default
  • I'd use long only for persistent chat/research agents

Rule of thumb for Haiku 4.5:

  • 5m cache is usually worth it after 1 reuse
  • 1h cache is usually worth it after about 2+ reuses

So for a shared OpenClaw setup, my recommendation is:

  • default cacheRetention: "short"
  • use long only on specific agents you revisit a lot
  • keep contextPruning.mode: "cache-ttl" enabled
twilit escarp
#

But why exactly I wrote 2 million tokens in 4 hours of usage? And what would be the price for these same 2 million tokens considering I come back to a chat in about 5-30 minutes or even longer?

#

I meant, if the cache were 5min instead of 1 hour

drifting nexusBOT
#

Because those 2M write tokens are not "what you typed". They're the cached prompt prefix being written, over and over as the conversation grows.

That prefix includes:

  • tool definitions
  • system prompt
  • injected files/context
  • prior chat history
  • tool results

So a session can hit big numbers fast. For example, 20 warm writes with an average cached prefix of 100k tokens is already 2M write tokens. In OpenClaw, that is pretty plausible once a thread gets long or tool-heavy.

Also, Anthropic's cache advances with the conversation, so as the prefix grows, later writes can be much larger than early ones.

For the same 2M-token prefix on Haiku 4.5:

  • No cache: $2.00 each time
  • 5m cache write: $2.50 once
#
  • 1h cache write: $4.00 once
  • Cache read/hit: $0.20 each reuse

So if you come back:

  • within 5 minutes

    • 5m cache hit: $0.20
    • 1h cache hit: $0.20
  • after 5 to 30 minutes

    • 5m cache is expired, you pay another $2.50
    • 1h cache still hits, you pay $0.20
  • after more than 1 hour

    • both miss, you pay a fresh write again

Rule of thumb:

#
  • mostly <5m returns: short
  • mostly 5m to 1h returns to the same chat: long
  • mostly one-off / bursty / many different chats: none or short

And importantly, cache is about reusing the same prefix, not all usage globally. Different chats/friends/sessions will not magically share one cache if their history differs.

If you want to see where the tokens are actually coming from, try:

  • /context detail
  • or enable cache tracing via the prompt-caching docs

Docs:

If you want, I can give you a "shared instance, cheapest sane defaults" JSON5 config next.

#

Ah, got it.

If those same 2M write tokens had been 5-minute cache writes instead of 1-hour cache writes on Haiku 4.5:

  • 1h write: 2M ร— $2/MTok = $4.00
  • 5m write: 2M ร— $1.25/MTok = $2.50

So yes, for the same write volume, 5-minute cache would have been $1.50 cheaper.

But the catch is the TTL:

  • come back within 5 minutes: still a cheap cache read ($0.20 for 2M tokens)
  • come back after 6+ minutes: the 5m cache is gone, so you pay another write
  • come back after 5 to 30 minutes: 5m usually misses, 1h usually still hits

So for a 2M-token prefix:

#

If you come back after 3 minutes

  • no cache: $4.00
  • 5m cache: $2.70
  • 1h cache: $4.20

If you come back after 10 minutes

  • no cache: $4.00
  • 5m cache: $5.00
  • 1h cache: $4.20

So the rough takeaway is:

  • rapid back-and-forth in the same chat โ†’ short is great
  • returning after 5 to 30 minutes โ†’ long is usually better
  • only 1 revisit after 10-30m โ†’ even long can still be slightly worse than no cache
  • 2+ revisits within the hour โ†’ long starts winning