#anthropic caching

1 messages · Page 1 of 1 (latest)

livid wedge
#

well that's annoying...

400 Bad Request {"type":"error","error":{"type":"invalid_request_error","message":"A maximum of 4 blocks with cache_control may be provided. Found 7."}}
#

ah, i guess it works more like a breakpoint

livid wedge
#

the API for this is really clunky - despite it being more like a checkpoint for all prior messages/blocks, you have to attach it to a specific block, which are many different types, so you need to 1) figure out where the checkpoint should be, and then 2) modify whatever type the last block near that checkpoint is

unborn trout
#

Can we hide it from our devs?

#

If so it's a no-brainer. Otherwise.. tricky, need to figure out how to keep it portable

livid wedge
unborn trout
#

I don't understand the difference, looking at their examples

#

Oh I see it: "cache_control": {"type": "ephemeral"}

#

So you still re-send everything on the wire, the usual way. You just markup some messages so they process it differently on their end

livid wedge
# unborn trout Can we hide it from our devs?

I implemented a heuristic that seems good enough:

  • Keep track of token usage in the API response, record it in the LLM history
  • For each history item that uses more than 2048 tokens, enable caching
  • Once we enable caching 4 times, just stop

The last step is a bit of a shame since we won't be able to cache past that point, but I don't think we can go back and remove blocks without forcing it to use a bunch of tokens on the next request

livid wedge
#

I wonder if Claude Desktop uses this internally

unborn trout
#

I guess the safe/explicit way is to add a LLM.withPrompt(cache: Bool) optional arg

livid wedge
#

yeah, but the extra wrinkle is you'll also want caching of responses since those can be large, and that's harder to predict