#FP4 models are unacceptable especially for coding models. There’s no way to filter this as a paying
57 messages · Page 1 of 1 (latest)
If there can be a option to sort providers on the basis of the quant models or block a particular quant (fp4) that would be helpful
(maybe in future updates)
in plugins? or how do we do that
https://openrouter.ai/docs/features/provider-routing#quantization
can't you use this?
That requires manually messing with headers or json request sent, I suppose
It's not only DeepInfra, I think Atlas did it, and some providers don't specify quants used, could be de-facto any size
i have never used this one in my programs
thanks for sharing
I never read entire docs 😢
this makes stuff easy
No, DeepSeek was trained in FP8 and don't have FP16, still some providers have blank in precision level
Chutes for examples, is very shy
Or maybe a way to specific the quant
Like how you can specific :floor or :nitro
You should be able to specify either like <model_name>:fp8
It would also be nice to be able to specify a specific provider in the slug
Like if I want baseten I can be like moonshot-ai/kimi-k2-instruct:baseten
That would make it so much easier to use and more powerful
Especially for apps that only let you apply a slug (like open WebUI)
I might build a proxy just so I can do this lol
I think realistically this would need to be paired with a policy that providers can't list open weight models on openrouter without disclosing the quantisation. Yes they could lie, but at least they would be actively choosing to lie publicly.
https://openrouter.ai/docs/features/presets are a thing if you can't modify your app but can specify a model slug.
Just checking, you mean slug as in like openai/gpt-4o-mini and not like 🐌 the animal
I use this now, but man, it really sucks setting up a preset for every new model that i'm interested in using
We really need a way to ignore/block the fp4 junk globally. They're a net negative. I don't want to use fp4 quantized models even if the providers paid me per token 😆
it's giving OR a bad reputation too because people say "don't use open Router for benchmarking because it's unreliable" they're referring to the junk quantized models
I wrote my proxy
To run it just use bun run or_proxy.js
It supports this syntax: <slug>:<option>
where option is:
["free", "beta", "floor", "nitro", "thinking"] : passed as-is (openrouter uses these)
["int4", "int8", "fp4", "fp6", "fp8", "fp16", "bf16", "fp32"]: forces that quantisation
anything else (baseten, deepinfra/fp8, any provider slug): force that provider
it should work for you, just add :fp8 to the end of your requests
to force fp8
To get the provider slug press the clipboard next to the providers name
hope it helps!
Yes, just put qwen/qwen3-coder:fp8 and it will force fp8 quantisation
Haven’t coded that in yet (if people want it I can), you can also just a specific provider in
too many sneaky ways of being scammed by providers on OR
Need someone patient and with money to run benchmarks on suspicious providers and compare to proven and official ones
Or just have OR to start labeling quantized models as quantized 😢
And a global way to block quantized models, not just per request
add deepinfra to that
fp4 doesn't automatically mean bad quality. I've tested DeepInfra fp4 for Kimi K2 and it was surprisingly better than some other providers:
https://eval.16x.engineer/blog/kimi-k2-provider-evaluation-results
Yeah, it depends on quant type, not only size. Exl3 at 4 bit would be much different in quality from legacy Q4 quant with no calibration dataset and no i-matrix. And people say some tasks and token probabilities get hit by quant much harder than others
I've seen benchmarks for quantization, and they simply never match reality
.
We're not asking for a ban of quantized models.
Give us a away to globally block anything below fp8. If some want to use them, great, have fun. I think anything below fp8 is brain dead and that I'm being scammed by secretly being served them
appreciate all the discussion here folks, this kind of thing is something we’re very aware of internally and want to make better in a few ways
has anyone tried my proxy? if so, any feedback?
Why do you say that? their models are labeled as fp8. Is that not true?
also depends on context quant
Some are fp8 but some are fp4
Also deepinfra(turbo) are fp4
More of a per model basis
I wish there was a way to get a real quant size through model request or provider data
That's a big damn difference. If claimed fp4 gets 90-93% AIME, what is the quant of 80% AIME, ? Or is it 4bit KV cache they use?
That's for smaller models, but still an example of llama3. And maybe MoE models are getting hit by quantization more
I think just saying fp4 is not enough to describe the quantization technique now. It's the layers which fp4 is applied to that matters.
Back in DeepSeek era, even though the model was trained on FP8, some layers are still BF16.
Question on top of this is there an proof if a providers says they serving fp8 but will be routing internally to something like fp4 or int4.
This was a concern for me when I was directly using provider. Which I was using llama 405 but was routed to 70b this was obviously was on the json but you get my point