#Limited outputs and cut offs
176 messages · Page 1 of 1 (latest)
Does venus has an option to increase the max_tokens? Likely an advanced params
I did increase it but doesnt work😭😭
How much did you increase it to BTW?
I tried 500, and all the way up to 1000
Tried again, and got 500-ish tokens
Yeah I think something's wrong on chub + openrouter integration - everyone's seeing it
Hmm okay I’ll tell the chub discord about it
they said it seems like a problem on openrouter's side- any updates on this error?
Can you try it again
Just pushed a hotfix that might fixed this
Awesome!
buuut the cutoffs aren't fixed yet, but it's fine!
@copper knot for the cutoff - it's weird. Can you ask what venus is passing down for the max_tokens?
wdym?
Do they have a slider somewhere that let you configure the max amount of token generated?
yup
By default, if they pass nothing down, we default to 256
But I'm not sure what chub's actually passing down
defult is 0, which means unlimited
i think thats what you mean? right?
0 will be normalized to 1 tokens (if that's what they are sending) btw - it has to be "undefined" or set to a valid number
right now, Im moving the slider up and down but nothing seems to work so...idk what the error issss😔
Can you try debug the network tab? I.e, press F12, look at the network tab
can you send me a screenshot of what Chub's sending OpenRouter
does this work the same on mac?
im using mac rn
Yeah f12 should open up inspector for the venus web app
then, navigate to the network tab
then try making a request
you should see a new item in the network - one of them should be openrouter's completion endpoint
check what chub's sending us for max_tokens
that looks right to me lol
how cut-off is it btw? Like it doesn't finish all the way?
Try asking it to give you a long essay and see if it's actually "cut-off" or if your prompt can only generate that much.
like it stops mid sentence after about two paragraphs
consistently
also, I checked the payload to messages, and noticed that every model only remembers about 20 messages
no matter how much context you have
my prompt isn't that long so..
That's chub's own configuration - it's what they send our API
Hi, which model are you using? Cuz I think that the same happens to me.
i think its the same for every model
at least for me
Wow I came on here with the exact same question
No matter the model I use, it always gets cut off early. Maybe after about 200 words. Same thing every model in playground
@eternal cedar on playground, you will need to set the max_tokens up to see it do more:
Because it sometimes automatically switches your model from the filtered to the unfiltered one. Mythomax in my case, and this model specifically cuts off any answers unless you play with the max tokens settings
Maybe you could check in your activity in open router?
Bump that to 1000 and you will find it's up
Oooh that's likely the case yeah
Even if you choose one model, it might sometime switch to another
Try using Mythalion
Just kind of check if the model you chose in venus is the same that generated the answer in your activity tab in open router
you mean the default model?
OHHH
bruh
its all mythomax
Oh yeah
wait how
That’s the case
because something in the prompt was marked as nsfw/flagged by your original model
It usually switches to Mythomax if you do nsfw or anything close to that
^ this will make it fallback to mythomax yeah
Then it’s the model’s issue unfortunately 😔
I usually increase the max tokens to 500-600 with Mythomax, and the cutoffs are usually rare
You can't go above 250 with Mythomax tho
Oh
But with Mythalion you should be able to I think
Weird, but I guess try playing with other models
Just tried this and it appears to work with GPT4 but thats about it
Phind code llama still outputting 200 wordsish
Code llama broke and is only outputting [INST] no matter what it is prompted
x.x
Also token count is wildly different in playground vs activity
The output is indeed a bit ghetto doh
yeah - on playground, it's normalized to GPT token, whereas on activity, it's counted by the token native to the model
Llama2 family of model uses its own tokenizer, so on the activity you will see llama2 tokens instead of GPT tokens
Llama2 vocab size is 32k vs GPT vocab is 100k
Codellama is totally broken for me now and only outputs [INST], tried changing tokens back to 300, tried clearing browser cash, tried deleting and making new character
But it appears Phind is now able to display over 200 words. Lol
What's the last message you sent codellama before it outputs [INST]?
Woahhhh okay heres a hint
Seems to be something in the browser window, after I tabbed out and back into the playground it truncated the response I got from like 800 characters back down to 200
I changed the max token count
"Write a snake game using html and javascript. Only output entire code files"
Was the prompt, but that was unchanged when it broke. Had used the same prompt before
Think I figured out the problem, at least with Phind.... It appears to be a browser issue (using brave).
If stay on the window thorugh the entire generation it will generate the whole thing. If I don't, it will truncate down to about 200 words when I go back to the window (but still charge for the full API usage and show token count in activity).
Basically, it appears I have keep the window active during generation or it will fail/truncate.
Once it is generated fully, it stays even if I go elsewhere
Pushing a fix to the inactive window issue - should be up in a bit
🚀 Sweet! Ty
should be up now, can you pls try it out
Yeah it doesnt appear to be cutting off anymore, it also seems to display text a lot quicker too
hey any way to fix this?
cuz I paid in order to use gpt 4
I tried to delete everything from my bot that is 'nsfw' but its not working..
The flagging is done by OpenAI so we have no control over it. Try removing jailbreaks and system prompt. (i.e, your character card must also not include nsfw stuff)
can I just ask how you played with the max token settings?
how do I get it to stop cutting off
I literally just adjust it every time when I get a cutoff - increase or decrease it. But maybe you could try the new model synthia? I heard it’s pretty good and unfiltered
I'm confused at the cut off... I tried Pygmalion, MythoMax L2 13B (beta), both got cut off at 250 tokens. Although I setup max_tokens to 8192.
i have the same problem
cc @vast moss
can confirm this behavior even though max_tokens are set
This happens with most models, with or without max_tokens set. Makes most models unusable for my use case (chatbot). Would really appreciate if it's fixed.
Yes for me too
@vast moss any way to fix this? Or is it a limit of your models? I tried the same model in other platforms (hface, deepinfra) and the limitation isn't present
We’re working on a fix for this @velvet parcel - right now some open source models have an output limitation in order to support extremely long (eg 8k token) prompts.
We’re doing a refactor of some internal stuff cc @tawny acorn and then doing this next
Thanks. Right now the model that I want to use and only outputs out to 250~280 tokens is jondurbin/airoboros-l2-70b (many others do to), and the one that I also use but doesn't have that limitation is nousresearch/nous-hermes-llama2-13b.
Both show the same context length in the docs, but for some reason one is limited and the other isn't.
That's good to hear. Any ETA?
Likely a week or so. Just curious, is the deepinfra cost for airoboros higher? I can’t tell from their site
I'm just doing some tests with it right now. I'd prefer to only use openrouter. About the price, it shows $0.001 per thousand tokens on the model's page. (though it's not the same version of the model as the one in openrouter)
Will post an update here when we're closer to launching this
Limited outputs and cut offs
@here the cutoff for Mythomax should now be resolved. Airoboros is not fixed yet.
what about synthia?
Synthia is also not fixed yet
Synthia cuts off after like 30 tokens
it should have a 300 token cutoff right now
Hi, when will this cutoff issue be fixed? Or at least in the meanwhile, could you show in a column besides each models info what is the token cutoff? Because otherwise one has to manually try all the models to see which are limited and which aren't, which isn't a very reliable method.
Thanks in advance.
Deploying a change now to the /api/v1/models endpoint that will at least allow you to get the top_provider's max_completion_tokens, for each model
Thanks. Is there an ETA for the limit increase?
Not at the moment, unfortunately
I was getting same issue in the playground few weeks back with Claude. So looking forward to seeing it fixed as otherwise it is not usable for me. BTW - is it only an issue in the playground and if I use python there will be no problems with cutoff in Claude?
What did you set max_token to for Claude in the playground? The default is I think 300 tokens.
Yeah, I incresed it now and experimenting. Thanks!
Same in python. You must set an value for max_tokens :)
Just checking in, cause i am having issues with GPT-4 cutoffs as well. I see that the GPT-4 model has very low prompt and completion token limits. is that what you're all referring to? I'm worried about it basically becoming unusable with such a low context size.
Here's what the models endpoint returns:
{
"id": "openai/gpt-4-0314",
"pricing": {
"prompt": "0.00003",
"completion": "0.00006",
"discount": 0
},
"context_length": 8191,
"top_provider": {
"max_completion_tokens": 8191
},
"per_request_limits": {
"prompt_tokens": "666",
"completion_tokens": "333"
}
},
{
"id": "openai/gpt-4-32k",
"pricing": {
"prompt": "0.00006",
"completion": "0.00012",
"discount": 0
},
"context_length": 32767,
"top_provider": {
"max_completion_tokens": 32767
},
"per_request_limits": {
"prompt_tokens": "333",
"completion_tokens": "166"
}
},
@plain raven this is a function of your account balance. if you increase it, your limits go up. see openrouter.ai/docs#limits. is that helpful?
(There has to be a limit here otherwise ppl go negative)
i have 20 USD in my acc. balance. wouldn't that be enough? (just checked: Credits: $27.385)
i can check with different api keys if that'd be helpful. maybe i got a "bad" one 😅
no, it shouldn't be. but let me check real quick
nope it's not. sk-or-v1-db7...ded is the one in my account giving me these results
ah found the issue, working on a fix
thanks for flaggin
damn, it's been showing the free-tier limits to everyone (in /models)
in just that endpoint
okay, so does that only apply to the API? or generation too?
not when you actually make completion requests
gotcha!
then it uses the correct limits
This should be fixed now!
@velvet parcel fyi, we re-routed airoboros to DeepInfra, which improves the context length and uptime
@vast moss i'm noticing a lot of random cutoff in the AI responses. Happens with open hermes llama 13b which I've been using for a long time (which btw is very slow currently, was very fast), and also with open hermes mistral 7b.
The messages cut off at random lengths, could be 70 tokens or more than 200, but are mostly in the shorter side.
It has never happened before.
It's happening with the other model i'm using too (nousresearch/nous-hermes-llama2-70b). Strange.
Happens in the playground too
(that was open hermes mistral 7b with max tokens 4096)
Also I noticed that open hermes llama 13b started appending <s/> to responses. Both in my app and in the Playground.
the cutsoff make coding impossible. Even if I tell it to continue, it will make a different code, and then get cuts off again.
@vast moss can you confirm if you're looking into this?
@tawny acorn can you see if you can reproduce this? I couldn't on my end
Ah just repro'd
@velvet parcel Can you try Mistral now? Deploying a change that might fix it
Should also double the tokens per second!
Deep into a chat in SillyTavern, using XWin model, I get the following error message despite using the attached prompt settings. I am under the impression the LLM is set up to return 300 tokens no matter the setting I use.
Ah, just read some earlier messages that this might be intentional.