#Being charged each time a new prompt is added to a context

44 messages · Page 1 of 1 (latest)

frigid linden
#

Hey there, I am using a different website that implements openrouter for their Gemini 1.5 model and charges on a pay-as-you-go model. After some back and forth, we have noticed that openrouter doesn't cache previous prompts (or at least this person's implementation doesn't), so that means, for each new prompt I would make, I would be charged for the previous context as well as well, on top of the current new one (i.e. if the current conversation is 20k, and new input/output is 1k, i would be charged for 21k, instead of the 1k I would expect).

Is there something in the documentation I am missing where you can cache previous answers? Otherwise, it would seem ridiculous to be charged for the past context for each new prompt.

#

So basically, if I'm trying to roleplay a longer rp, by the time I reach a 1M context, I would have recursively paid for 1m+999k+998k+... tokens, instead of just the current 1k input/output

tender nymph
#

This is how LLMs work today unfortunately. They don’t cache the compute needed to do the next inference pass. They’re like humans with zero memory: you have to replay the whole convo for them each time

#

That said, there’s some work being done to enable prompt caching. When that comes out, we’ll of course pass those cost savings directly to you

#

Also, apps sometimes allow you to submit a summary of your previous messages instead of all of them

frigid linden
#

Huh, is it the same for using apis dirrectly? Because I swearI have not heard of this in any other soft dev practices ever

tender nymph
#

Yep

frigid linden
#

But that would make sense as to why I can't see anything in your docs

#

Man, that's some BS lol

#

But thank you alex! It makes sense why im being charged 0.15/Gemini 1.5 prompt for 20k tokens context in

#

They should definitely look at caching or memoizing first before they bring in the bigger context sizes

#

Otherwise 7USD/1Mtokens would go by pretty fast

teal robin
frigid linden
#

Welp, they definitively have a DB of chats, so they could just feed that into the model at each step of a convo

#

instead, you know, manually feeding it from the current context

#

and being charged for it

teal robin
#

I didn't get your point 😅

frigid linden
#

You have 50 messages in a chat

#

From what you're describing, each message is being sent each time you make a new prompt

#

So youre being charged for 51 messages

#

Save messages in a DB so when you sent a message it would be the 1 message you sent + 50 other whatever is in DB

#

at compute time, instead of at request time

#

so that way it wont recalculate the whole context from your chat, it will have the context saved in a db or smth, aka caching xd

teal robin
#

I think you don't know how exactly LLMs work. To oversimplify it, it is an advanced autocomplete.

Your request with a chat will actually be converted into a text which looks, once again, oversimplified like this:

User: Hello!
Assistant: How can I help you today?
User: Rephrase this text: <...>
Assistant:```

and the "autocomplete" starts to work after the words "Assistant:" and will complete assistant message.
tender nymph
#

The compute requirements (the expensive parts) are also holding the cache internal to the GPU for the prompt between requests

frigid linden
#

Man, my dissertation was on NLPs, my conclusion was that fuck knows, but yea, I cant say I know how LLMs actually work

tender nymph
#

That’s expensive for whoever is running the gpu

#

It could be stored somewhere else and maybe loaded quickly for the next request, which is one way to do caching

#

But that has to be done at the gpu level

frigid linden
#

Memoize it into the ram, save it in the DB, but again, just talking BS

#

My NLP disseration took 200GB for a 2 GB file lmao

#

Thats why they're researching into AI GPUs nowadays anyway, I am a bit out of date

#

But yea, this thread is going a bit outside of my question

#

But thank you both for your answers 🙂

teal robin
teal robin
frigid linden
#

Pulling up my dissertation, dont remember much. It was basically automatically categorising research papers into what scientific instruments they might need for their research

#

(previously it was done manually)

#

Used Word2Vec

frigid linden
#

Basically yes

#

Managed to get 85% accuracy before all the ai craze, so pretty proud of that

teal robin
#

Good result!

tender nymph