Currently, the top provider for gpt-oss-120b is an fp4 provider which is $0.01 cheaper than a bf16 provider. The bf16 model will have significantly better response quality. It seems like you are penalizing the bf16 providers, or rather incentivizing open source inference providers to provide the most quantized and low quality version of open source models possible by not factoring quantization into your best bid algorithm.
#Why are fp4 providers allowed to be used ahead of bf16 providers only due to a better price?
7 messages · Page 1 of 1 (latest)
Our price sorting isn't so aggressive that the BF16 provider will get no traffic with the difference being $0.01. Regarding "significantly better response quality", are there any specific evals you ran to measure that difference? I'm sure the team is happy to factor that into consideration if we can reproduce those results
I do not have an eval, but anecdotal reports from open source subreddits are that fp4 will have a noticeable quality and intelligence reduction from 8 bit or 16 bit
Ive never heard of open source subreddits
open source subreddit: https://www.reddit.com/r/LocalLLaMA/
Anecdotal reports:
https://www.google.com/search?q=quantized+performance+site%3Areddit.com
Looks like it isn't too much of a difference for large param models to use fp4 but smaller models suffer performance degredation
That’s a local llama subreddit. Reddit is inherently closed source. I think you mean public instead of open source
No need to be intentionally obtuse