#FP4 models are unacceptable especially for coding models. There’s no way to filter this as a paying

57 messages · Page 1 of 1 (latest)

next forge
#

You can ignore the provider at the global level

#

by going to settings

safe acorn
#

If there can be a option to sort providers on the basis of the quant models or block a particular quant (fp4) that would be helpful

(maybe in future updates)

opal hornet
latent rapids
#

That requires manually messing with headers or json request sent, I suppose

#

It's not only DeepInfra, I think Atlas did it, and some providers don't specify quants used, could be de-facto any size

safe acorn
latent rapids
#

No, DeepSeek was trained in FP8 and don't have FP16, still some providers have blank in precision level

#

Chutes for examples, is very shy

exotic juniper
#

Or maybe a way to specific the quant

#

Like how you can specific :floor or :nitro

#

You should be able to specify either like <model_name>:fp8

#

It would also be nice to be able to specify a specific provider in the slug

#

Like if I want baseten I can be like moonshot-ai/kimi-k2-instruct:baseten

#

That would make it so much easier to use and more powerful

#

Especially for apps that only let you apply a slug (like open WebUI)

#

I might build a proxy just so I can do this lol

hearty verge
#

I think realistically this would need to be paired with a policy that providers can't list open weight models on openrouter without disclosing the quantisation. Yes they could lie, but at least they would be actively choosing to lie publicly.

elfin stream
exotic juniper
supple creek
#

We really need a way to ignore/block the fp4 junk globally. They're a net negative. I don't want to use fp4 quantized models even if the providers paid me per token 😆

#

it's giving OR a bad reputation too because people say "don't use open Router for benchmarking because it's unreliable" they're referring to the junk quantized models

exotic juniper
#

I wrote my proxy

#

To run it just use bun run or_proxy.js

#

It supports this syntax: <slug>:<option>
where option is:
["free", "beta", "floor", "nitro", "thinking"] : passed as-is (openrouter uses these)
["int4", "int8", "fp4", "fp6", "fp8", "fp16", "bf16", "fp32"]: forces that quantisation
anything else (baseten, deepinfra/fp8, any provider slug): force that provider

exotic juniper
#

to force fp8

exotic juniper
#

To get the provider slug press the clipboard next to the providers name

#

hope it helps!

exotic juniper
#

Yes, just put qwen/qwen3-coder:fp8 and it will force fp8 quantisation

#

Haven’t coded that in yet (if people want it I can), you can also just a specific provider in

supple creek
#

too many sneaky ways of being scammed by providers on OR

latent rapids
#

Need someone patient and with money to run benchmarks on suspicious providers and compare to proven and official ones

supple creek
#

And a global way to block quantized models, not just per request

exotic juniper
abstract salmon
latent rapids
#

Yeah, it depends on quant type, not only size. Exl3 at 4 bit would be much different in quality from legacy Q4 quant with no calibration dataset and no i-matrix. And people say some tasks and token probabilities get hit by quant much harder than others

supple creek
#

.
We're not asking for a ban of quantized models.

Give us a away to globally block anything below fp8. If some want to use them, great, have fun. I think anything below fp8 is brain dead and that I'm being scammed by secretly being served them

broken ether
#

appreciate all the discussion here folks, this kind of thing is something we’re very aware of internally and want to make better in a few ways

exotic juniper
supple creek
exotic juniper
#

Also deepinfra(turbo) are fp4

#

More of a per model basis

latent rapids
#

I wish there was a way to get a real quant size through model request or provider data

latent rapids
#

That's a big damn difference. If claimed fp4 gets 90-93% AIME, what is the quant of 80% AIME, ? Or is it 4bit KV cache they use?
That's for smaller models, but still an example of llama3. And maybe MoE models are getting hit by quantization more

abstract salmon
#

I think just saying fp4 is not enough to describe the quantization technique now. It's the layers which fp4 is applied to that matters.

#

Back in DeepSeek era, even though the model was trained on FP8, some layers are still BF16.

nova sedge
#

Question on top of this is there an proof if a providers says they serving fp8 but will be routing internally to something like fp4 or int4.

#

This was a concern for me when I was directly using provider. Which I was using llama 405 but was routed to 70b this was obviously was on the json but you get my point

broken ether
#

I went into testing OpenRouter providers with several theories. One theory I had was that Qwen3 Coder with provider deepinfra/fp4 would perform significantly worse because of the fp4 quantization.

I was incredibly wrong, whatever magic Deepinfra is working on their fp4 version