#Llama 4

1082 messages · Page 2 of 2 (latest)

flat whale
#

I wish they'd colorize this table

#

Transcribed with Gemini Flash 2.0, I did not double check the values

deft knot
candid slate
#

questionable benchmark if you ask me

#

how is o1 mini better than 3.7 sonnet

#

still, something to be said about people taking shortcuts

formal lantern
#

Just testing random stuff, this one was particular telling (identical prompt), interactive village webgame

flat whale
#

When physics problems tell you to approximate a person as a point mass

shut lodge
#

llama 4 moved up a bit so now its above 2.0 flash

#

oh but thats an older 2.0 flash

hollow yacht
#

Supposedly there were some implementation issues with maverick in vllm

#

does anybody know whether the current providers have introduced the fix?

stiff isle
#

I'm guessing most providers at this point have caught up

#

might not be the end of it though!

lucid pond
#

@wary warren hope its ok to tag
SambaNova added support for both new models a few days ago, is there a reason they aren't available?

wary warren
#

checking in with them now

lucid pond
#

ohhh ok

#

thanks!

stiff isle
#

I'm afraid to say - Llama 4 already came with a rocky start... but with the release of GLM-4-32B-0414, I think it's officially dead (for non-long-context tasks).

agile pendant
#

OpenAI, at least, is still the apparent king of multimodality for now, but Google's catching up there. There's a lot of catching up to do in that regard for the smaller guys

stiff isle
#

the challenger modelmakers simply aren't

mild blade
#

reminds me of claude's opus

rocky sphinx
#

Hey folks 👋
I’ve enabled tool support for both Llama 4 Scout and Maverick models through OpenRouter, using the Together provider. We’re also using the paid endpoint.
However, the tools don’t seem to work properly with these models — unlike Claude and GPT, which are working as expected under the same conditions.
Just checking if anyone else has run into this issue or found a workaround. Appreciate any help!

arctic canopy
median flint
#

Can't wait to use Llama 4 everywhere!

marble terrace
junior oyster
#

the wait-list link does not work properly for me

shut lodge
#

I hope they open voice up to api in the future, and maybe even open source it

marble terrace
neon sun
#

What do you mean we dont list it as image input? It does support images on OpenRouter.

shut lodge
#

Llama api has early access, looks to be free usage, would be nice to get it added to BYOk

marble terrace
shut lodge
#

In open web ui I think scout is a very good task model (e.g. title generation, search prompt generation, etc) better results so far compared to Gemma 27b, and GPT 4.1 nano.

inland falcon
#

That sounds about right. Very weird model. Terrible at code, zero rizz, and no creativity in writing. But it's smart for the size and reliable.

formal lantern
robust frigate
#

I bet the reason won't even touch qwen 3 scores

shut lodge
inland falcon
#

Idk, Qwen 3 is still a mixed bag

shut lodge
#

The main thing that impressed me was that I could run at 10 tokens per second on my cpu with the 30b 3a, it was definitely much more capable than any other model that I can run at 10 tokens per second, I can’t run the bigger one so I have not really used it at all

#

The benchmarks seemed promising, on aider I think 30b a3 got 39%, Gemma 3 4b I think got sub 1% if I remember correctly

inland falcon
#

I haven't done enough testing yet either, I just hear very mixed things.

gloomy vault
vague cypress
#

do you think LLama Maverick is better than 4.1 mini for building Agent, tools?

merry quarry
vague cypress
merry quarry
#

Oh, 4.1 was trained for agent use in mind, it would probably crush Llama then

vague cypress
#

4.1 mini looks like close 4.1 in benchmark, weird 😂

vague cypress
merry quarry
merry quarry
deft knot
#

i've been saying frfom the very beginning that 4.1 mini is very close in coding to 4,1

merry quarry
#

It's close in everything if you see the individual benchmarks

vague cypress
#

In my experience, 4.1 is smarter but follows instructions worse than 4o in both the legacy and new migration system prompts

#

I’m struggling to find between intelligence and the ability to follow instructions when comparing 4.1 Mini and 4o Mini.

hardy sparrow
#

is lama 4 that bad? that none is using it
Seem like more people ar sticking with Lama 3.3 70b

robust frigate
#

imma wait another year to take llma seriosuly.

#

their reasoning models will suck too

hollow beacon
inland falcon
#

Llama 3.3 70B is not an alternative to Scout/Maverick. They fill different usecases. Scout/Maverick exist to have a general purpose model that can be served for stupidly cheap and fast. They suck at roleplay, EQ, coding, and others, but Maverick is $0.20/$0.60 on Groq at 240 tk/s which is just insane.

#

People will say "Sure but it's worse than V3." When #1, it's faster and cheaper than V3, and #2, they actually tie on multiple benchmarks. So depending on your usecase, it can very much be the smart choice for the job.

shut lodge
inland falcon
#

Always gotta use the right tool for the job 👍

inland falcon
#

If you don't need the bonkers speed or open-weights though, it can definitely be hard to argue for. Flash Thinking 2.5 is better at nearly everything all-around, code, reasoning, EQ, creativity, etc. for close to the same price and still 100+ tk/s. 235B is significantly more creative and personable than either, at the same price, just way lower speed.

median flint
#

I'll definitely setup Maverick to try the vision capabilities

shut lodge
calm jasper
#

I've not been able to use Llama 4 Maverick yet because it doesn't handle tool calling very well. It looks like they've just updated the chat template to try to improve this, as well as the tool parser in vllm: https://github.com/vllm-project/vllm/pull/17917

I don't think there's any way to tell if the different openrouter providers are using these latest changes though?

proud basin
main obsidian
#

Is anyone hosting Scout at its full 10M context?

proud basin
#

I know you didn't ask, but Llama 4 Scout & Maverick are horrible at long context