#Llama 4
1082 messages · Page 2 of 2 (latest)
questionable benchmark if you ask me
how is o1 mini better than 3.7 sonnet
still, something to be said about people taking shortcuts
Just testing random stuff, this one was particular telling (identical prompt), interactive village webgame
When physics problems tell you to approximate a person as a point mass
Hahaha
Supposedly there were some implementation issues with maverick in vllm
does anybody know whether the current providers have introduced the fix?
so far it looks like there's been at least 3 bugs each a couple days apart that have each improved performance
I'm guessing most providers at this point have caught up
might not be the end of it though!
@wary warren hope its ok to tag
SambaNova added support for both new models a few days ago, is there a reason they aren't available?
they didn’t support images and we don’t have a good way of toggling that off for a specific endpoint yet
checking in with them now
I'm afraid to say - Llama 4 already came with a rocky start... but with the release of GLM-4-32B-0414, I think it's officially dead (for non-long-context tasks).
It's interesting that both Meta and OpenAI (both big players) released models that are basically dead out of the gate (OpenAI is already being abandoned after only 2 months from release and Llama 4 just seems to be a non-starter), while the smaller (and let's face it, mostly Chinese) modelmakers are eating their lunch with these amazing models that punch way above their weight
OpenAI, at least, is still the apparent king of multimodality for now, but Google's catching up there. There's a lot of catching up to do in that regard for the smaller guys
my guess is that openai is mostly milking margin instead of actually being behind right now
the challenger modelmakers simply aren't
4.5 was never practical
massive ass model they'll likely use to distill smaller ones off of (if they didn't already for 4.1)
reminds me of claude's opus
Hey folks 👋
I’ve enabled tool support for both Llama 4 Scout and Maverick models through OpenRouter, using the Together provider. We’re also using the paid endpoint.
However, the tools don’t seem to work properly with these models — unlike Claude and GPT, which are working as expected under the same conditions.
Just checking if anyone else has run into this issue or found a workaround. Appreciate any help!
Google announces that their Llama 4 and 3.3 serverless endpoints have graduated from (free) preview into (paid) GA, but doesn't publish pricing anywhere
https://developers.googleblog.com/en/llama-4-ga-maas-vertex-ai/
Can't wait to use Llama 4 everywhere!
@wary warren on Llama-4 Scout's model card it lists image input support https://www.llama.com/docs/model-cards-and-prompt-formats/llama4_omni/ but seems Openrouter only lists image input support for Llama 4 Maverick paid and not Scout. Is this a provider decision or is Llama 4 Scout on Openrouter meant to also support image inputs?
the wait-list link does not work properly for me
I hope they open voice up to api in the future, and maybe even open source it
answered my own question Llama 4 Scout does support image inputs just Openrouter is not listing it in it's model info page
What do you mean we dont list it as image input? It does support images on OpenRouter.
Llama api has early access, looks to be free usage, would be nice to get it added to BYOk
see screenshot Maverick lists image input pricing but Scout doesn't list it
In open web ui I think scout is a very good task model (e.g. title generation, search prompt generation, etc) better results so far compared to Gemma 27b, and GPT 4.1 nano.
That sounds about right. Very weird model. Terrible at code, zero rizz, and no creativity in writing. But it's smart for the size and reliable.
surely Behemoth will deliver (one can hope)
I bet the reason won't even touch qwen 3 scores
I wish we had a Qwen 3 with vision, thats the main reason I use maverick more than Qwen 3.
Idk, Qwen 3 is still a mixed bag
well maybe I just have not used it enough yet, it seemed impressive on first testing, but then I have not really used it because I need vision models for most tasks
The main thing that impressed me was that I could run at 10 tokens per second on my cpu with the 30b 3a, it was definitely much more capable than any other model that I can run at 10 tokens per second, I can’t run the bigger one so I have not really used it at all
The benchmarks seemed promising, on aider I think 30b a3 got 39%, Gemma 3 4b I think got sub 1% if I remember correctly
I haven't done enough testing yet either, I just hear very mixed things.
I'm trying to use Llama 4 via completion endpoint (https://openrouter.ai/docs/api-reference/completion)
But it always gives me instruct-style output.
This doesn't happen with other models (even the ones that's clearly marked as instruct models.)
Does a provide change my request to chat completion internally? Or is it a problem with Llama 4 itself?
do you think LLama Maverick is better than 4.1 mini for building Agent, tools?
According to https://artificialanalysis.ai basically everything is better then Llama Maverick
I mean 4.1 mini, it costs similar
There is mini in the charts, but hmm looking at other charts Llama 4 Maverick is 3x cheaper, 2x faster and has 2x lower latency
Oh, 4.1 was trained for agent use in mind, it would probably crush Llama then
4.1 mini looks like close 4.1 in benchmark, weird 😂
How about your practice test?
Yeah i was surprised too, maybe there is an issue with the benchmark
I don't use any of these models, I don't have any experience with them
i dont think there is an issue
i've been saying frfom the very beginning that 4.1 mini is very close in coding to 4,1
It's close in everything if you see the individual benchmarks
In my experience, 4.1 is smarter but follows instructions worse than 4o in both the legacy and new migration system prompts
I’m struggling to find between intelligence and the ability to follow instructions when comparing 4.1 Mini and 4o Mini.
is lama 4 that bad? that none is using it
Seem like more people ar sticking with Lama 3.3 70b
imma wait another year to take llma seriosuly.
their reasoning models will suck too
Yes, compare to other competition right now. they lack behind them
Llama 3.3 70B is not an alternative to Scout/Maverick. They fill different usecases. Scout/Maverick exist to have a general purpose model that can be served for stupidly cheap and fast. They suck at roleplay, EQ, coding, and others, but Maverick is $0.20/$0.60 on Groq at 240 tk/s which is just insane.
People will say "Sure but it's worse than V3." When #1, it's faster and cheaper than V3, and #2, they actually tie on multiple benchmarks. So depending on your usecase, it can very much be the smart choice for the job.
Plus the vision capabilities are quite good (wayyy better than llama 90b), i often chose it over V3 due to its vision ability and crazy fast speeds on samba nova. Definitely depends on the use case like you are saying, for code I often use V3.1, for general tasks I often use maverick, and for stem topics I often use R1 or for tough stem topics I will spend a little extra to use o4 mini high
Always gotta use the right tool for the job 👍
If you don't need the bonkers speed or open-weights though, it can definitely be hard to argue for. Flash Thinking 2.5 is better at nearly everything all-around, code, reasoning, EQ, creativity, etc. for close to the same price and still 100+ tk/s. 235B is significantly more creative and personable than either, at the same price, just way lower speed.
Considering Llama 4 Scout hosted by Cerebras is 3-8k TPS for dirt cheap, I use it to ask questions about code and I want a very fast (albeit probably deficient) answer
I'll definitely setup Maverick to try the vision capabilities
After seeing the crazy tokens per sec I now have it as the auto complete in openwebui.
I've not been able to use Llama 4 Maverick yet because it doesn't handle tool calling very well. It looks like they've just updated the chat template to try to improve this, as well as the tool parser in vllm: https://github.com/vllm-project/vllm/pull/17917
I don't think there's any way to tell if the different openrouter providers are using these latest changes though?
There isn't. You'll have to ask the providers
Is anyone hosting Scout at its full 10M context?
No provider on OpenRouter is hosting it at its full context length
I know you didn't ask, but Llama 4 Scout & Maverick are horrible at long context
Look here to see the performance of a model at long context:
https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/oQdzQvKHw8JyXbN87
Benchmarking AI Models for Long Context Comprehension