#Question about model billing patterns

33 messages · Page 1 of 1 (latest)

unborn egret
#

I'm seeing a lot of situations, across a variety of models where a single request (e.g. a single response from Silly Tavern) is leading to two rows on the billing (if you look at the timing you will see these are typically 1 sec after the original), with those second "responses" typically being only 1 token (occasionally 2 or 0).
Because these appear to be causing the input tokens to be counted again they are essentially near enough doubling the cost.
Any ideas what would be causing this and how to prevent it?

shy trench
#

Likely ST having a retrying pattern?

#

Looks pretty consistent

unborn egret
shy trench
unborn egret
shy trench
#

Oh what I mean by that is try to see if yuo can replicate the issue on the playground

#

(with the same model)

#

It seems to work for me i.e, no duplicate response)

#

So might be a feature within ST :-?... @barren nymph - is there some kind of retry feature in ST?

unborn egret
shy trench
#

Yeah so not related to your networking. If the router does any retry, it'll be on the same transaction (you don't get charged extra if we retry with another provider)

unborn egret
barren nymph
#

Auto-retry exists if your model does not produce anything but an EOS token or otherwise empty reply and you don't have streaming enabled

unborn egret
#

I have streaming enabled for these, and the empty/single token replies are coming within 1-2 seconds of the successfully generated reply

#

I have objectives and summarise switched off so AFAIK there should not be anything making a separate API call in tandem

unborn egret
#

@shy trench just a quick follow-up on this to note that although the example I posted above is 1:1 good:problematic replies, other LLMs are exhibiting a different frequency of the issue. You can see that I had a patch of solid responses, but another patch of alternating problem ones. No change at all in ST settings in that window - this was all a continuation of a single roleplay with the same character.

unborn egret
# barren nymph Auto-retry exists if your model does not produce anything but an EOS token or ot...

Hi Cohee, is there any way I can turn off the auto-retry to confirm that is having no impact here? I do have streaming enabled, and these blank hits always follow immediately on solid responses (without any further input from me). I cannot think of anything else in ST that I am aware of that could be driving this "double-tap" but because it is essentially doubling all billing using OR I am keen to resolve it. Hope you don't mind the ping on this, but wanted to double-check before I resort to an uninstall/reinstall of ST to see if that fixes.

barren nymph
#

Just disable everything that sends background requests

#

Summary is the likely offender.

unborn egret
# barren nymph Summary is the likely offender.

AFAIK anything non-core is disabled, incl. Summary and Objectives (I use almost nothing bar these two)
One last question - if ST were in any way responsible for this, would it show in the console?

barren nymph
#

You will see all the requests there

unborn egret
#

Thank you, very much appreciated.
@shy trench I have tripple checked that anything in ST that could cause that is turned off. Will rerun, confirm I still see the issues and send you the log if case you can spot anything that you think could cause this

shy trench
#

@barren nymph OR received 2 requests with the following diff:

https://www.diffchecker.com/FZ7W1tIO/

Looks like the 2nd request included more than just the completion of the first request. The diff shows there're injected contexts in the middle :-?... Does it look familiar to you?

#

(Both are prompts btw)

#

Is it possible that ST is incrementally sending prompt + completion to extract as much generation out of as single run as possible, and that the 2nd one was stopped because the model deemed it's the end of the convo?

unborn egret
#

It looks as though something is causing the "Continue" command to autofire, but I cannot for the life of me work out why that would auto-trigger in parallel to the first request being sent.

hidden meadow
hidden meadow
unborn egret
#

ok, thanks, good to know

barren nymph
unborn egret
# barren nymph Theres “auto continue” function in advanced settings. But you have to explicitly...

Strange...this was disabled, however enabling, then disabling and re-starting appears to have fixed the issue. Need to test a little more extensively, but neither of the first couple of tests generated the problem whereas it was occurring every message with this LLM (but not all LLMs). Wondering whether the inconsistency of the occurrence could be driven by differing frequencies in the cut-off from different models.

Thank you all again for the pointers, patience and troubleshooting here - very much appreciated.