I'm seeing a lot of situations, across a variety of models where a single request (e.g. a single response from Silly Tavern) is leading to two rows on the billing (if you look at the timing you will see these are typically 1 sec after the original), with those second "responses" typically being only 1 token (occasionally 2 or 0).
Because these appear to be causing the input tokens to be counted again they are essentially near enough doubling the cost.
Any ideas what would be causing this and how to prevent it?
#Question about model billing patterns
33 messages · Page 1 of 1 (latest)
Any idea what would be causing that sort of behaviour? The model response is coming through ok, so this second one seems to follow the successful reply.
Can you try hitting with a similar request on the playground: https://openrouter.ai/playground
Not entirely sure how I would do that to be honest? (you've seen the ST logs for how it is assembling the various bits across a character and LB)
Oh what I mean by that is try to see if yuo can replicate the issue on the playground
(with the same model)
It seems to work for me i.e, no duplicate response)
So might be a feature within ST :-?... @barren nymph - is there some kind of retry feature in ST?
ok - just run a quick prompt on playground against that model and there is no issue at all with that.
Yeah so not related to your networking. If the router does any retry, it'll be on the same transaction (you don't get charged extra if we retry with another provider)
gotcha, thanks.
Strange thing is these 0-2 token ones seem to follow immediately on the heels of a good response (and are happening while I am still reading what it has sent back - i.e. within 1-2 seconds of drafting its response)
Auto-retry exists if your model does not produce anything but an EOS token or otherwise empty reply and you don't have streaming enabled
I have streaming enabled for these, and the empty/single token replies are coming within 1-2 seconds of the successfully generated reply
I have objectives and summarise switched off so AFAIK there should not be anything making a separate API call in tandem
@shy trench just a quick follow-up on this to note that although the example I posted above is 1:1 good:problematic replies, other LLMs are exhibiting a different frequency of the issue. You can see that I had a patch of solid responses, but another patch of alternating problem ones. No change at all in ST settings in that window - this was all a continuation of a single roleplay with the same character.
Hi Cohee, is there any way I can turn off the auto-retry to confirm that is having no impact here? I do have streaming enabled, and these blank hits always follow immediately on solid responses (without any further input from me). I cannot think of anything else in ST that I am aware of that could be driving this "double-tap" but because it is essentially doubling all billing using OR I am keen to resolve it. Hope you don't mind the ping on this, but wanted to double-check before I resort to an uninstall/reinstall of ST to see if that fixes.
Just disable everything that sends background requests
Summary is the likely offender.
AFAIK anything non-core is disabled, incl. Summary and Objectives (I use almost nothing bar these two)
One last question - if ST were in any way responsible for this, would it show in the console?
You will see all the requests there
Thank you, very much appreciated.
@shy trench I have tripple checked that anything in ST that could cause that is turned off. Will rerun, confirm I still see the issues and send you the log if case you can spot anything that you think could cause this
@barren nymph OR received 2 requests with the following diff:
https://www.diffchecker.com/FZ7W1tIO/
Looks like the 2nd request included more than just the completion of the first request. The diff shows there're injected contexts in the middle :-?... Does it look familiar to you?
(Both are prompts btw)
Is it possible that ST is incrementally sending prompt + completion to extract as much generation out of as single run as possible, and that the 2nd one was stopped because the model deemed it's the end of the convo?
It looks as though something is causing the "Continue" command to autofire, but I cannot for the life of me work out why that would auto-trigger in parallel to the first request being sent.
Hm, I had a similar problem happen but not with auto-continue option, but with "Request Token Probabilities" option, it fired off regardless of it's being enabled or disabled, and added logprobs: 10 to req, causing a 400 error on Aphro backend
it was on ST 1.11.8
Also on ST 1.11.8
In the case of my logprobs issue it got fixed on it's own when I switched to staging branch
ok, thanks, good to know
Theres “auto continue” function in advanced settings. But you have to explicitly enable it for chat comps
Strange...this was disabled, however enabling, then disabling and re-starting appears to have fixed the issue. Need to test a little more extensively, but neither of the first couple of tests generated the problem whereas it was occurring every message with this LLM (but not all LLMs). Wondering whether the inconsistency of the occurrence could be driven by differing frequencies in the cut-off from different models.
Thank you all again for the pointers, patience and troubleshooting here - very much appreciated.