anthropic caching | Dagger | Page 1

livid wedge Feb 27, 2025, 4:47 PM

#

well that's annoying...

400 Bad Request {"type":"error","error":{"type":"invalid_request_error","message":"A maximum of 4 blocks with cache_control may be provided. Found 7."}}

#

ah, i guess it works more like a breakpoint

livid wedge Feb 27, 2025, 5:07 PM

#

the API for this is really clunky - despite it being more like a checkpoint for all prior messages/blocks, you have to attach it to a specific block, which are many different types, so you need to 1) figure out where the checkpoint should be, and then 2) modify whatever type the last block near that checkpoint is

unborn trout Feb 27, 2025, 5:44 PM

#

Can we hide it from our devs?

#

If so it's a no-brainer. Otherwise.. tricky, need to figure out how to keep it portable

livid wedge Feb 27, 2025, 5:45 PM

#

can confirm prompt caching makes a HUGE difference - there's no way this would succeed without it party_blob
https://asciinema.org/a/G0wZ4Rb3HpZBWpcUw605NjOem

asciinema

vito

anthropic token caching

Recorded by vito

unborn trout Feb 27, 2025, 5:45 PM

#

I don't understand the difference, looking at their examples

#

Oh I see it: "cache_control": {"type": "ephemeral"}

#

So you still re-send everything on the wire, the usual way. You just markup some messages so they process it differently on their end

livid wedge Feb 27, 2025, 5:46 PM

#

unborn trout Can we hide it from our devs?

I implemented a heuristic that seems good enough:

Keep track of token usage in the API response, record it in the LLM history
For each history item that uses more than 2048 tokens, enable caching
Once we enable caching 4 times, just stop

The last step is a bit of a shame since we won't be able to cache past that point, but I don't think we can go back and remove blocks without forcing it to use a bunch of tokens on the next request

livid wedge Feb 27, 2025, 5:46 PM

#

unborn trout So you still re-send everything on the wire, the usual way. You just markup some...

yeah exactly

#

I wonder if Claude Desktop uses this internally

unborn trout Feb 27, 2025, 5:47 PM

#

I guess the safe/explicit way is to add a LLM.withPrompt(cache: Bool) optional arg

livid wedge Feb 27, 2025, 5:48 PM

#

yeah, but the extra wrinkle is you'll also want caching of responses since those can be large, and that's harder to predict

#anthropic caching