Being charged each time a new prompt is added to a context | OpenRouter | Page 1

frigid linden Apr 14, 2024, 12:13 AM

#

Hey there, I am using a different website that implements openrouter for their Gemini 1.5 model and charges on a pay-as-you-go model. After some back and forth, we have noticed that openrouter doesn't cache previous prompts (or at least this person's implementation doesn't), so that means, for each new prompt I would make, I would be charged for the previous context as well as well, on top of the current new one (i.e. if the current conversation is 20k, and new input/output is 1k, i would be charged for 21k, instead of the 1k I would expect).

Is there something in the documentation I am missing where you can cache previous answers? Otherwise, it would seem ridiculous to be charged for the past context for each new prompt.

#

So basically, if I'm trying to roleplay a longer rp, by the time I reach a 1M context, I would have recursively paid for 1m+999k+998k+... tokens, instead of just the current 1k input/output

tender nymph Apr 14, 2024, 12:20 AM

#

This is how LLMs work today unfortunately. They don’t cache the compute needed to do the next inference pass. They’re like humans with zero memory: you have to replay the whole convo for them each time

#

That said, there’s some work being done to enable prompt caching. When that comes out, we’ll of course pass those cost savings directly to you

#

Also, apps sometimes allow you to submit a summary of your previous messages instead of all of them

frigid linden Apr 14, 2024, 12:24 AM

#

Huh, is it the same for using apis dirrectly? Because I swearI have not heard of this in any other soft dev practices ever

tender nymph Apr 14, 2024, 12:24 AM

#

Yep

frigid linden Apr 14, 2024, 12:24 AM

#

But that would make sense as to why I can't see anything in your docs

#

Man, that's some BS lol

#

But thank you alex! It makes sense why im being charged 0.15/Gemini 1.5 prompt for 20k tokens context in

#

They should definitely look at caching or memoizing first before they bring in the bigger context sizes

#

Otherwise 7USD/1Mtokens would go by pretty fast

teal robin Apr 14, 2024, 12:32 AM

#

frigid linden They should definitely look at caching or memoizing first before they bring in t...

That would be super nice, unfortunately though I don't see this coming in the near future due to how LLMs work.

frigid linden Apr 14, 2024, 12:34 AM

#

Welp, they definitively have a DB of chats, so they could just feed that into the model at each step of a convo

#

instead, you know, manually feeding it from the current context

#

and being charged for it

teal robin Apr 14, 2024, 12:36 AM

#

I didn't get your point 😅

frigid linden Apr 14, 2024, 12:36 AM

#

You have 50 messages in a chat

#

From what you're describing, each message is being sent each time you make a new prompt

#

So youre being charged for 51 messages

#

Save messages in a DB so when you sent a message it would be the 1 message you sent + 50 other whatever is in DB

#

at compute time, instead of at request time

#

so that way it wont recalculate the whole context from your chat, it will have the context saved in a db or smth, aka caching xd

teal robin Apr 14, 2024, 12:43 AM

#

I think you don't know how exactly LLMs work. To oversimplify it, it is an advanced autocomplete.

Your request with a chat will actually be converted into a text which looks, once again, oversimplified like this:

User: Hello!
Assistant: How can I help you today?
User: Rephrase this text: <...>
Assistant:```

and the "autocomplete" starts to work after the words "Assistant:" and will complete assistant message.

tender nymph Apr 14, 2024, 12:44 AM

#

The compute requirements (the expensive parts) are also holding the cache internal to the GPU for the prompt between requests

frigid linden Apr 14, 2024, 12:45 AM

#

Man, my dissertation was on NLPs, my conclusion was that fuck knows, but yea, I cant say I know how LLMs actually work

tender nymph Apr 14, 2024, 12:45 AM

#

That’s expensive for whoever is running the gpu

#

It could be stored somewhere else and maybe loaded quickly for the next request, which is one way to do caching

#

But that has to be done at the gpu level

frigid linden Apr 14, 2024, 12:46 AM

#

Memoize it into the ram, save it in the DB, but again, just talking BS

#

My NLP disseration took 200GB for a 2 GB file lmao

#

Thats why they're researching into AI GPUs nowadays anyway, I am a bit out of date

#

But yea, this thread is going a bit outside of my question

#

But thank you both for your answers 🙂

teal robin Apr 14, 2024, 12:52 AM

#

tender nymph The compute requirements (the expensive parts) are also holding the cache intern...

I'm a little confused. LLMs need to have all previous tokens to predict the next one. How this can cut the costs if all conversation tokens will be considered by LLM to predict the next one?

teal robin Apr 14, 2024, 12:53 AM

#

frigid linden My NLP disseration took 200GB for a 2 GB file lmao

Can I know, what was it?

frigid linden Apr 14, 2024, 12:55 AM

#

Pulling up my dissertation, dont remember much. It was basically automatically categorising research papers into what scientific instruments they might need for their research

#

(previously it was done manually)

#

Used Word2Vec

teal robin Apr 14, 2024, 12:56 AM

#

frigid linden Pulling up my dissertation, dont remember much. It was basically automatically c...

So a classification problem?

frigid linden Apr 14, 2024, 12:57 AM

#

Basically yes

#

Managed to get 85% accuracy before all the ai craze, so pretty proud of that

teal robin Apr 14, 2024, 12:59 AM

#

Good result!

tender nymph Apr 14, 2024, 1:15 AM

#

teal robin I'm a little confused. LLMs need to have all previous tokens to predict the next...

Basically if the kv cache and other similar memory can be cached somewhere, it should be cheaper to load from cache than read in all the prompt tokens again

#Being charged each time a new prompt is added to a context