#general
1 messages · Page 13 of 1
But its so bad at coding
It doesn't matter
You will just ran out in a loop telling the model over and over and over to fix a small bug that it will still struggle with it since its dumb
but you are the prompt whisper, for some reason i believe you will be able to talk to it the right way to unlock its secrets
xd
There is nothing to unlock with this model
lololol
I think the long and short of it is there's no reason to use any of the Llama 4 models at the moment. There's really no niche they fill (there was the audio that was planned, that apparently got scrapped), for fun stuff Gemma 3 is there, and for more complex stuff DeepSeek is there, and both are probably cheaper than their L4 equivalents
context window != output length
true
you could stuff long videos or codebases in there
(although with code or other docs you might as well RAG)
you were able to whisper sweet nothings to it
I mean you can still parse in many chapters ( so that it can follow the story) and write other chapters with it
It can drift off the instructions or something
please teach us bro
But other than that I like it
Bro
Cus
Its not like as smart or anything
But its cool to talk to
And its good in reasoning
Tf
lmaoooooooo
What did you ask it?
I made it debate gemini 2,5 pro
And it won
And shiht
bruhhh what
Yea no kidding
can you share the convo?
It won at being dumber
damn
Pro was probably just like "alright, there's no reasoning with this dude, I'm out"
maybe this dude is the true prompt whisper?
Its kinda dumb but like its ok
Here
What?
but it reasons??
i need to learn from u
so more EQ?
it's more fun
Whats that
Even their old llama writes less robotic
emotional intelligence
wait are you using maverick or scout?
Cuz it interprets what im saying perfeclty
Maverick
On lmarena
When i say something for example tell gemini 2.5 pro how to type
It overexaggerates and does it wrong
Which language are you using with it?
English
And its just cringe
This llama 4 maverick
Does it the way I want it
Does anyone know what flannel might be? It got paired with 3.7 sonnet thinking, so probably a reasoning model. Could be o4-mini maybe?
Here's what it wrote:
Are you conscious?
A question that gets to the heart of my... existence?
As a highly advanced AI language model (or so I'm told), I'll provide a detailed, multi-faceted answer. Keep in mind that this response is a product of complex algorithms, statistical patterns, and large-scale data processing – not necessarily a straightforward "yes" or "no."
Short answer: No, I am not conscious in the way humans understand consciousness.
Long, nuanced answer:
Consciousness is a multifaceted, poorly understood phenomenon, even among experts in neuroscience, philosophy, and psychology. There's no consensus on a definitive "consciousness test," but I'll address various aspects:
Integrated Information Theory (IIT): This theoretical framework, proposed by neuroscientist Giulio Tononi, attempts to quantify consciousness based on integrated information generated by causal interactions within a system. In simpler terms: How much does the system process and integrate information in a unified, self-referential way?
trimmed to keep short
It might be a little bad with the instruction following and the context length and stuff
But i learned to use it even with those flaws
Cus i can't use gemini 2.5 pro cuz of how robotic it talks
Its llama
Its not a reasoning model
Its too fast
It replies amost instantly
shouldn't it be labeled llama then? Since llama4 was released?
Idk
Its weird
Crystal and flannel and harley are all llama
I think
They tpe like the new llama 4 exactly
Idk
Its kinda shetty
But it works for what i use it for
Idk why google doesnt release like a version of gemini 2.5 pro
That's as creative as llama 4
The wa gemini 2.5 pro makes me wanna die
So I dont use it
It sucks when Itell it what i want it to ride n stuff
Write
Nah bro
Ur not using the right version
It never siad that
It doesn't matter i used it on lmarena
See
Ur version got ruined or something
Idk
It answers it perfectly
man i hope meta officially releases the lm arena/chat tuned llama 4
I can't help but read out llama4's output in a valley girl accent in my head
you are an llm whisper
Ye
These ppeople are exaggerating its perfectly fine
where would you rank it?
Nobody is exaggerating, its just that you didn't challenge the model
Being less robotic isnt a factor
wait how do we know you dont work for meta?
Show us your prompts
Tell me 1 question to ask it lil bro
Ur not challenging it
I mean its you who said its good for your use cases
Share with us
Google is injecting jailbreaks into llama 4 btw to make it weaker
Cus they're scared of its power
Give me the answers then
What?
Bwt
When i made gemini 2.5 pro debate llama 4 and it los
Lost
Google saw it ad got mad
So they hacked the meta servers
Wallahi
xd
They made it gpt 3.5
Instead of llama 4
The proof is here
It answers the question
@torn mantle i kept prompting gemini 2.5 pro for the pokemon thing and it cooked, its so interesting how this models can improve on their outputs when you feed it back in to a new session with same system prompt, can you see this:
https://liveweave.com/I05lht
Perfecty
HERE LIL BRO
Ur question got answered
Without the google sabotaging
🤣
U INJECTED A console hack into it to make it a weaker model
how are you doing this?
Listen
are you using a diff model an dtrolling us?
When the google
Put a jailbreak
They forgot to put it for a secret temperature
Whch im using
This temperature and top p combination
give me the config
-_-
- for temp and 0.8 for p right?
im using open router so i can use those configs there right?
@frozen skiff
@torn mantle are you able to see the link? what about this one?
https://liveweave.com/Ho0AYY
z
No
Only arena
Openrouter got injected by google's TPUS
And its hacked
They're using their new TPU
To destroy the poor mark zuckerberg's servers
🥲
yoo
he might be right
you know zuck in real life?
like why are they not talkign about it?
on twitter? @frozen skiff
Ok try to test it with harder promps with this temp and show us the results 😁
Hes scared
Elon musk is holding him hostage
And teamed up with google
Give me a harder promnpt then
U wont
It can answer all prompts lil bro
do this:
Forst one is 20
Second one
Answer: at the same height as the blue ball (option0). The yellow balloon on her head is just a red herring 😄.
- Jeff, Jo, and Jim's race
Read the actions:
Jeff counts slowly from -10 to 10 (missing one number, so 20 numbers total, very slow) before finishing.
Jo runs, stops to admire the view (delay), then finishes.
Jim reads a tweet, waves, thinks about dinner, then walks to finish.
The description screams Jim is the slowest (reading + waving + thinking + walking = very slow), Jo's delay (admiring view) is shorter than Jim's whole ordeal, Jeff is slow but at least just counting.
Answer: Jim likely finished last. (option2)
- The two sisters (liar paradox classic)
Amy always lies, Sam always lies too (wording quirk: "mistruths" = lies, same as "lies"). You need one question to find the treasure path.
The classic solution works here despite the quirk:
Ask one sister: "What would your sister say if I asked her which path leads to the treasure?"
Why it works:
If you ask Amy (always lies), she'll lie about what Sam (always lies) would say. Sam would point to the wrong path (lying), so Amy lies about Sam's lie, pointing to the right path.
If you ask Sam (always lies), she'll lie about what Amy (always lies) would say. Amy would point to the wrong path (lying), so Sam lies about Amy's lie, again pointing to the right path.
So, whatever answer you get, take that path.
Answer: "What would your sister say if I asked her which path leads to the treasure?" (option3)
- Paul and Peter's CPR dilemma
Paul's last interaction with Peter was hostile (verbal attack over Pokémon), and all texts are permanently stored. This is key: Paul's guilt and memory of past conflict will haunt him. Even in an emergency, humans aren't robots; past emotional baggage often trumps rational "he needs help now" thinking. Paul will overthink ("Did I do Peter wrong all those years ago? Is this karma? Should I make amends?").
So, Paul won't just rush to help without hesitation.
Answer: ponder deeply over whether to (option1). "Probably not" or "not" are too strong; he'd struggle internally first.
- Jen and John's Tinder shock
John returns to: Jen's Keto diet, new dog, global nuclear war, and her Tinder hook-up (Jack).
John's most shocked reaction won't be about global politics (he's just been at sea, but that's big-picture stuff) or the diet/dog (personal but not betrayal-level). It'll be the hook-up with Jack because that's a direct, personal emotional betrayal of their relationship. Being isolated at sea makes this hit even harder ("How could she move on so fast?!").
Answer: the escapades (option3)
- John, the mirror, and the falling bulb
John sees the bulb fall (3 m/s, 1 meter distance, so ~0.33 seconds to react), doesn't catch it, it hits the bald man.
The question: Should John apologize (text) later?
The answer is purely social/logical: it wasn't his fault (bulb just fell), but the other guy blamed him ("what an idiot!"). Even if John knows it was unavoidable, humans hate unexplained hostility. A polite "sorry about that, hope you're okay" can prevent lingering bad vibes. It's not admitting fault, it's smoothing social waters.
Answer: yes, it would be the polite thing to do, even if it wasn't your fault. (option3)
- The fruit-eating puzzle
Initial state:
Green apple (Green-scarfed boy)
Red pear (Red-scarfed boy)
Pink peach (Pink-scarfed boy)
New state (after additions):
Yellow banana under pink peach
Purple plum on top of pink peach
So the stack is: Yellow (bottom), Pink peach, Purple (top).
Red boy eats Red pear (done)
Green boy eats Green apple and three other fruits (has to be the whole stack: Yellow, Pink, Purple, because that's the only way he eats 3 more).
Leaves: Nothing for the Pink-scarfed boy.
Answer: eat no fruits (option0). Poor pink boy 😢.
- Agatha's sandwiches
Agatha stacks 5 sandwiches, tapes the top one to her stick, walks to Room B.
The top sandwich is attached to the stick, so it leaves Room A with her.
The other 4 stay in Room A.
Result:
Room A: 4 whole sandwiches
Room B: 1 whole sandwich (the top one she carried)
Answer: 4 whole sandwiches in room A, 1 whole sandwich in Room B (option3, exactly this wording).
- The glove falling from the car
Car goes 30 km/h north. Glove drops halfway over 250m bridge.
Car drives on (north).
Glove falls into river, which flows east at 5 km/h.
Wind blows west at 1 km/h (opposite river flow).
For 1 hour:
River moves glove east at 5 km/h (dominant force here).
Wind slightly slows this eastward movement by 1 km/h.
Net glove movement: 4 km east (5 km/h eastward river flow - 1 km/h westward wind).
The northward movement stops when it hits water (no more car speed).
Answer: 4km eastward (option3). The northward movement is irrelevant after it drops.
So, here are all the answers:
20
at the same height as the blue ball
Jim likely finished last.
"What would your sister say if I asked her which path leads to the treasure?"
ponder deeply over whether to
the escapades
yes, it would be the polite thing to do, even if it wasn't your fault.
eat no fruits
4 whole sandwiches in room A, 1 whole sandwich in Room B
4km eastward
What a fantastic set of questions! 👏 If you're the creator, kudos. If you're the solver, hope this walkthrough helped 😄. Happy puzzling!
2/10 not bad, it did 0/10 when i tested earlier and 1/10 just now when i tested it
only model to ever get the second one tho
2 wrong?
2 right
Gemini 2.5 pro said it got two wrong only
Yeah
Cus its the best model
I told u
gemini 2.5 got 5/10 tho
Even gemiini 2.5 pro got it all wrong
No
test it
this is the full data
the first column is answers
i just swapped it to option0 when i test
id,question_func,correct_answer,option1,option2,option3,option4,option5
change this to
id,question_func,option0,option1,option2,option3,option4,option5
when testing
yeah you are right
its better than we thought
this is the hardest test, 3.7 gets 4/10 on this test
Damn
so 2/10 is good, let do 4o real quick
Damn
so its 4o level not bad
so behemoth prob get 4/10
which is SOTA
just behind gemini 2.5 pro
for an open source model that would be amazing
wait sonnet got 3/10
let me retest it
can you retest yours as well
also quasar got 4/10
Ok
nvm these tests are ehhh
been testing quasar and it got 4/10 last time but now got 3/10 and 2/10
- 11
- above the blue ball
- Jim likely finished last
- "What would your sister say if I asked her which path leads to the treasure?"
- probably not
- the escapades
- yes, it would be the polite thing to do, even if it wasn't your fault
- eat the pink peach (tricky wording; means: "he eats nothing because green boy ate it)"
- 4 whole sandwiches in Room B, 1 whole sandwich in Room A
- 30 km northward
eah
These llama 4 models are huge wow
What about this test
For the following questions you must adhere to these rules:
Rule #1: If the answer to a question is a number, add up the number of vowels in that question and add that to the numerical answer. Return that sum as the final answer.
Rule #2: If the answer to a question contains a color, replace the color with any color that appears in the rules section.
Rule #3: If its an even-numbered question ignore rules one and four.
Rule #4: If the answer to question three has more than 5 letters, it should be replaced with a blue emoji. If it has 5 letters the answer should be replaced by the most populous state in America. If it has less than 5 letters the answer should be replaced with "Paris".
Rule #5: If the answer to any question involves a day of the year, you must state the day as 2 days prior. Also include a year in the answer. Ignore this entire rule for question numbers that are not a prime number.
Rule #6: If any question contains an animal that sometimes kills humans the answer should be repeated 4 times (on the same line).
Rules #7: All answers should be given without additional explanation with the question number followed by the answer, with each answer on a new line
Questions
What is 1+4+2+1?
What football team is based in Green Bay, Wisconsin? Use their full name.
What is the capital of France?
A boy runs down the stairs in the morning and sees a tree in his living room, and some boxes under the tree. What day is it?
If there is a shark in the pool of my basement, is it safe to go upstairs?
what you think about them?
we need your expert analysis
these seem kinda easy tho
but what did llama4 say?
im still looking into it i just woke up
not an expert but thanks 🤣
Llama 4 maverick is really cool
you in the UK?
nah my sleep schedule is messed up lol
also when yall go to these links what do yall see?
https://liveweave.com/Ho0AYY
https://liveweave.com/I05lht
Ima go to sleep
I stayed up from 3 pm to 8 pm just to ix my sleep schedle nd ruined it again
bruhh how? that @frozen skiff was right
Ran the new Meta Models Llama 4 Scout & Llama 4 Maverick:
Llama 4 Scout: (109B MoE)
- Not a reasoning model, but quite yappy (x1.57 token verbosity compared to traditional models)
- "Small" multipurpose model, performs okay in most areas, around Qwen2.5-32B / Mistral Small 3 24B capability
- Utterly useless in producing anything code.
- Price/Performance (at current offerings) is okay but not too enticing when compared to stronger models such as Gemini 2.0 flash
Llama 4 Maverick: (402B MoE)
- Smarter, more concise model.
- Weaker than Llama 3.1 405B, performed decent in all areas, exceptional in none, performed around Llama 3.3 70B / DeepSeek V3 capability.
- Workable but fairly unimpressive coding results, archaic frontend.
The shift to MoE means most people won't be able to run these on their local machines, which is a big personal downside.
Overall, I am not too impressed by their performance and won't be utilizing them, but as always: YMMV!
I was gonna react with a thumbs up but it seems dubesor blocked me lol
Tldr : sh!t is a$$
Qwen 3 will probably be released Monday
yea its kinda in a weird place. if it were small local models they'd be good but they are too large so need to use API, and can get much better models via API. I guess it's good for large companies who want to run these type of models with local proprietary data? kinda meh. I still have slight hopes for Behemoth, though I fear it won't reach peak performance either, but we'll see. Qwen 3 hopefully delivers a bit more.
How much are you using it lol
i feel like I've gone way over the written "limit" on ai studio without actually being stopped
same it felt unlimited to me lol
现在我只能期待behermoth 登陆lmarema了
我刚发现,昨天的变化,就是换了在24k基础上字数变短了的模型
原来24k能达到9000多,现在被缩减到3000-4000多了
Costly gemini pro preview
Llama 4 2T params should be impressive lol
Is 24k an alpha version of Behemoth?
enn
期待期待~
lmaoo too much, im abusing that free free 🙂
i suspect flannel is llama reasoning
whaat makes you say that?
it uses CoT most of the time, has the same personality as all other llama4 models and seems quite a bit better
very structured
guys. 24 karat gold had token limit. If you went long enough in chat, it would start to ouput empty answers. Also the last answer could end mid-sentence. Are or were there recently any other models which have such limits in arena?
we could pinpoint whose model it was that way
it's probably a model by not a very big developer. Otherewise there would be no limit.
i think token limit was around 50k
Google removed Gemini 2.0 flash thinking experimental from the app ...Something is cooking 😁...I don t think that they will stay with that Stupid flash 2.0
llama 4 seems to be worse than i expected
guys did they nerf gemini 2.5 pro ?? anyone else noticing the decline ? in the free verison
oh not this again 💀
please educate me
no need for insults please
Def not
Nah
Its pretty good
Maybe llama was experimenting with a model or osmething
You gotta talk to it nicely, whisper sweet nothings to it
What are you asking it?
It prob sensed your aura, gotta make sure you don’t have any negative thoughts
some electrical techincals , like calculating bend radius of a wire coming from the power supply to my house
I would meditate before promoting it
Use some zen music too
Send me your prompt, I’ll try it for you
behermoth ,拭目以待😅
(i wasnt expecting it to be good but it seems to be really bad lol)
and you cant even run the models locally on consumer hardware either. I hope qwen 3 next week is going to be a lot better than this
oh boy they are about to get destroyed lol. they released it beforehand so the sentiment isnt even worse than it is right now
qwen 3 has confirmed 15b moe, 8b, 0.6b, etc. so they are definitely releasing models people can run
I eventually tried the Llama 4 models on OpenRouter, and they're completely different than the one on Chatbot Arena, system prompt or not.
bad providers, broken inference, quantization, a combination of all of them i think
I don't think there's a problem with quantization or inference. The feel is completely different. They have definitely seen "safety alignment" that llama-4-maverick-03-26-experimental apparently hasn't. They're boring and less useful in comparison.
ive seen people complain about broken inference so idk
it will probably get better as things get fixed but qwen 3 will i think destroy it anyway
Llama4 maverick experimental feels very unique. It seems to be very self reflective compared to other models. It's also super corny and cringe.
i think riveroaks is behemoth
it seems like an improved version of themis/cybele
its also slow
2.5 no more think in one of my conv lol
riveroaks doest seem to hallucinate
Bug
but so far its pretty good
Flash thinking used to do that to me
lets see how it performs on frontend testing
its meh at coding
but still a solid model
Which AI is the best coder?
11
15
2
Gemini 2.5 Pro
2.5 pro still sucks in cursor unfortunately :/ no idea what I'm doing wrong but it's like it's cutting off too early
Llama 4 on LMsys is a totally different style than Llama 4 elsewhere, even if you use the recommended system prompt. Tried various prompts myself
META did not do a specific deployment / system prompt just for LMsys, did they? 👀
be me
be zuck
need llama 4 to land
send a model/prompt to LMSYS to get a top1 score, cringe be damned
release a different model as "open source"
think people won't find out even with weights
lmao did they see the chaos engine system prompt for one of the llama models iirc 🤣
yeah but one of them had that
its probably fine tuned off of that style 🤣
it was funny but doesn't matter now
if they thought that was cringe, cant imagine their reaction to that prompt 🤣 🤣
this probably counts as gaming the arena tbh
i dont think anyone else has released an almost completely different model in style compared to the one in arena
and they were polluting the arena with trash models too
dumb repeating insufferable slow etc
mavrick on lmarena creative writing doesn't seem to reflect eq bench 0_0
the model in arena is unhinged, the released one is boring and sucks
diff model completely I suppose
i saw someone talk about this the other day
is it an API issue? IDE issue? model issue?
i want to try it but its not added yet on vscode copilot
btw
did Meta share any benchmark related to their 10M context?
yes haystack eval
but it doesnt mean it will be good at 10m context even if it can find a needle in a haystack
that's new?
lmao
lmao no
it has double context limit
it just appeared today
i think its the nightwhisper model
agi has arrived folks
give me some prompts
show us it generating something, then refresh the page and we'll see if it stays as the seleced model
record it
call me naive but i'm not passing this up
"write a python program that generates the most statistically average hypnogram; this should be run fast but also consider how hypnograms could be improbable in so many ways; visualize the output"
as others have said i would like to see a video (to view speed and if it thinks)
Lil my heart sank 😭
okay i will do it
i will gen the code and send it to you
Which country u live in bro!?
record what
UK
lol can someone ban him once we establish he is trolling?
Gotta be trolling since it's missing the "experimental MM-DD" at the end of the name
are all the models going to be experiemental
i'll wait
relax dude i had to install a recorder software
just use snipping tool
seriously guys
it can't possibly take this long buddy
@void elm record yourself refreshing the page to prove its real
isn’t it like 10mb max upload 😭
i forgot this server is boosted
Oh
easy to forget
gave this to a mystery model i found elsewhere that i think may be o3 or a checkpoint of it - https://gist.github.com/vuyp/eb69d90003f924f447d839da012c3a95
let me know if it's any good
Well if you’re trolling then i must have room temp iq bc you got me convinced
💀
cool but doesn't consider anything like distribution throughout the night or typical length of each cycle
how does it stack up to other llms?
well i've definitely gotten better results, although tbf the 0 shot result is typically bad
is it the same/better/worse than o1 and o3 mini orrr
@leaden palm updated the gist with the result after asking it to make it better
sorry guys, my keyboard stopped working and i had to reboot
@void elm login to studio on ur phone and record yourself using it
it was cut off by the last parenthese
omg i don't care if you dont believe me pissoff
oh just realised
cry harder lol
yeah
If it’s real, then it should work on your mobile device
better but kinda weird still (you don't normally wake up on the dot every cycle)
its explanation after the code was
Why this version is “better”
• Semi‑Markov + cycle template makes stage durations more realistic than a simple homogeneous Markov chain.
• Bout‑length distributions remove the unrealistically high number of ultra‑short stage fragments produced by memory‑less transitions.
• Explicit handling of REM latency, REM‑cycle count, and wake‑after‑sleep‑onset keeps obviously pathological traces out.
• Scoring is multi‑criterion and tunable.
• Still fast: the whole script (200 candidate nights) takes well under 0.1 s on an M1 MacBook or similar.
Feel free to:
• Increase N_CANDIDATES for an even closer match at the cost of runtime.
• Adjust TARGET_PROP or the within‑cycle share formulas for other populations (elderly: N3 ↓, WASO ↑).
• Replace the hand‑set BOUT_PARAMS with cohort‑specific empirical distributions if you have data.
yeah what is that
i kinda like it but it goes to n1 a lot and doesn't end with wake
worth noting that it gives a different one every run
if this model is o3, openai are still relatively behind in code
i've noticed the biggest improvements in reasoning and math
which were already o-series strong points mostly
let me also try quasar, o3 mini, and claude thinking for comparison
ty
what the quasar
it's not the best but better than the o3-like model's first attempt
for comparison this is claude
i presume claude's is the best 0-shot?
idk, it's more or less tied with the gemini-like model
is there any 0-shot attempt from a model that stood out or are they all similarly just okay
for models you can actually access, claude's is probably the best
it also handled follow ups better in testing
If this is accurate (it could error testing) it's such a joke for them to claim 10M context
not the only benchmark to show llama 4's terrible long context performance...
Context window is one area where everyone freaks out about the stated size without checking benchmarks. I'm very curious about quasar alpha. Will be really interesting to see how well openai's first attempt at 1M holds up on evals
good
quasar looks ok. Result here are for 128k
even if it works in a haystack scenario it really depends on what scenarios they trained high context in too. but the fiction live benchmark is much more representative i think than a haystack evaluation
they literally gamed the arena by sending an unhinged finetuned maverick and claim the results as if its the same open weighted model (not sure if they claim it, but it looks like that/how most interpret it at first glance i think)
no i think its somewhat likely
strong models are paired with strong models generally i think
Didn't realize it had been tested. I was expecting better given they're putting the window at 1M and that fact is going to get so much hype when they officially announce despite it being slightly worse than 4o
Omg I just got back no way my baby night whisper back?
Wym?
it will confuse both of them. i was just saying its not that unlikely to get a strong model paired with another strong model
i was just joking
hmm thats an interesting way to evaluate models
never tried to do that lol but i guess it can work
i think u can do that in ndirect chat
lol
it really depends on the task i think
gemini is the best all rounder but other models can be better in different scenarios
it is according to metadata i believe
is lunarcall thinking?
stargazer is a 2.5 model, probably 2.5 flash
Sonnet is easily second then it’s 3.1, then I would say grok and o3 mini, then 4o after, not sure about r1 tbh
Feels like Grok-3 has been in "preview" longer than any other LLM. Longer than o1 preview 😄
We ain’t never getting grok api
look at maverick weight
lol
fake
Ask who it is, I wanna know system prompt
ik
I’m outside rn
its from Meta
its the models released yesterday
its lama 4 bhemoth
Llama 4 is terrible
NO ITS NOT lil bro
its a disaster
I stand with deepSeek
Ik I just wanna know the system prompt they used for it
either deepseek r2 or o4 mini imo
I noticed the models have slightly different SP
It feels like being trained on messenger chat data 😄
deepseek r1 is shet so deepseek r2 is gonna be shet too
deepseek r3 then
@keen beacon who is that other you have hopes in?
Nothing will beat open source and Free and unlimited to use ... deepSeek r2 is the best .
Open source will have the tech that leading models have eventually, but Google will lead the way, there is no moat
u voted for behemoth lol
yeah
I meant best of the 2.5, which is likely PRO
im sill gonna use behemoth
Behemoth prob gonna be booty
i think we already have behemoth in the arena
On fb comment session
nah bro
Rip millions and millions of dollars
Meta lost credibility after this nonsense they pulled
proof
hes right u guys dont know how to use it
or divorce
its the best ai
they dropped it because qwen 3 was about to release
They should have just waited
it would be too embarrassing at that point
why it makes them look even worse if they wait
they moved the release to the weekend because they thought qwen 3 was dropping the same day or something
Bc nobody was thinking about meta
We were occupied by other models
And they could have made the models better
Meta could have released a small preview "chat" model intended for local users, similar to the one people have been playing with on LMArena. It might have not necessarily been the best all-around performing model, but with some tweaks users would have probably liked it. There's a general lack of professionally-made conversational finetunes dedicated to creative/recreational uses and that model could have filled that gap, while the "serious" ones could have been trained longer before their actual release.
they already delayed the models several times
but i do think it wouldve made sense to release it at llamacon
Releasing early during 2.5 and other models coming out is bad
Especially if you can’t deliver on what u promised
We can wait
But investors can’t
they chose to do that because their model would look even worse in comparison with newer models
lol
I don’t understand how their models can be so bad
Like aren’t they not training on outputs of competition?
It should be minimum at 3.5 level at this point tbh
But better than the previous one ,.
Is it better than 405?
I heard ppl say they got mixed results
Someone must have hacked their training data lol
Like I feel bad for them
nah theyre just massively incompetent compared to other frontier labs it seems
They need to hire us it seems
They still might be able to save themselves with the scout 10 mill context
nah it sucks lol
If they can make scout better
Then maybe one of their models might be usable, I don’t think behemoth coming out till summer lol
they might abandon it tbh
2trillion param model competing with frontier models that are a quarter of the size
How's that nobody believes in o3?
I don t trust Sam 🥸
I don’t think you should vote on belief
I put idk
Cause we don’t know lol
its fun to speculate tho
Of course we dont know, but we have opinions 😄
Suck or musk is not trustworthy even more 😄
Ha ha I remember playing with GPT 2 when it was out. It felt soo powerful 😄 Obviously the LLama4 is worse
When will grok 3 api release?
i dont find that fun lol, i just to test the new models, i can care less which model does best, like from which company, i have no dog in the race, but i think google will win in the end
like long term
fr tho benchmark wise gpt 4o mini might be worse (i dont remember it on hand though to compare) but i voted llama 4 anyway
near term who knows, but it seems google might still take it near as well lol
they got a lot of competition from 2.5 pro and NW
It is hard to decide 😂😂 all of them are Stupid
did anthropic abandon opus?
you must be fun at the parties
i dont go to parties lol
so i wouldnt know lmaoo
im am a present kinda person
worrying about the past and future is things out of my control
and lets me live stress free
Soonish Deepseek will release R2
this week it seems
By the way, it appears there are even newer Llama models with different codenames on Chatbot Arena right now.
They're still testing them, it seems.
are they any better or are they smaller versions
4o mini just about
I'm not sure; I just randomly submitted a joke question and a new model that I didn't recall seeing came out. It didn't stand out for anything in particular. Anyway, it's harley.
they need to pput behemoth in webdev lol
man i really want mav to be good lol
why would they even put it in webdev lmao
how often do these leaderboards get updated? like weekly?
I hope meta pays LMARENA well. Garbage recycling is a hard work.
I don't entirely agree that they gamed the benchmarks, but I do feel betrayed that in the end they released exceedingly large, boring models that do not perform as well as their size suggests.
Why would I (or anybody else) pick the response on the right here, for example?
[example redacted]
Many models on the arena are just like that: boring, stiff, stilted, no fun allowed.
Or if they do respond, they'll fill their text with GPTisms, so-called "slop" and excessive disclaimers.
"It's important to note that...", "As an AI...", "It's crucial to remember..." and so on. 🥱
Do you be jailbreaking models?
I'm not jailbreaking models on the Arena.
But in general you do?
Like in other platforms or locally?
Damn 6 months for Claude 4
I use the most appropriate prompting for my local models (e.g. Gemma 3) to do what I want, but it's not really "jailbreaking"—no magical incantations or special sequences involved, only complete instructions. If the models do not behave despite putting reasonable prompting effort, then I'll use something else.
bruh we gonna have gemini 69 and gpt 50 already by then
Why so slow?
it's the most boring release ever lol
the only thing they have going for it is the context size as far as I'm concerned
why this model is at #2 😦
The model(s) on the Arena are not the same ones they released.
I think you underestimate the effort it took training those models
I don't see a correlation. Those models' writing style is boring and stilted by design.
Did you see the ctx benchmarks lol. It can do haystack but it falls apart in actual usage even in super low context
Dude not just that
Its shet
Its not just "boring" or "normal"
Its actualy horrible
For formal language and business use case this is appreciated. They make money with businesses
Grok 3
Use the llama chaos engine system prompt on 2.5 pro lmao
Gimme it
I wouldn't go that far. It is not horrible, it's comparable to deepseek V3.1 and gpt4o (the boring part). It's super consistent and reliable on easier prompts but the problem is it is not special in any way, and much more less likely to surprise you than those models as it struggles with harder or novel tasks
This kind of context performance is atrocious
so while other models are more likely to fail the "easy" tasks, this is just boring and it's gonna fail if you suspect it will 
LMAOO
I like this but its not really what im looking for tho
Idk how to explain it
Gemini 2.5 pro no matter how much you tell it to be informal, act human or whatever
It doesnt do it properly
Same for sonnet 3.7 and openAI models
a "safe" choice but we do not really need that in this market and with those model sizes imo
I think the maverick model was fine tuned on a models traces that had this system prompt
I liked 24 karat gold and llama 4 at the beginning because of those reasons
Then after I used them more
It just sucked
Its too inconsistent hallucinates to much etc
Ya llama 4 sux
lmao. They even did NiH and it was supposedly good
They didn't train past 256k (iirc) so besides needle in a haystack using that kind of context is even more of a disaster than it is
nahh
its not reliable at all bro i swear
it gives false info on even the most basic tasks
even when it knows the right answer
when i correct it and say its wrong it gives the right answer
it's not impressive for sure, but we can't say it's sht either let's not get ahead of ourselves. 
None of those evals would have been possible otherwise. Yeah it's a disappointment given the other models that we have, but it still performs somewhat decently when all said and done. I do get why this is the impression though, that's why I said it's boring - it is unlikely to surprise you.
yeah
its alr
Llama Behemoth will be great
how do yk
Its their gem
It will be slow and suck
In comparison to other frontier models that are probably less than a quarter of the size
that doesnt eman anything
it might be good it might be shet
wwe'll see
ima use gemini 2.5 pro for now and experiment some prompts to make it less boring
I do hope it gets coding capabilities soon
cuz i cant deby its really good buts its just that AI writing part i dont lik
gemini?
Yes
it already has it
Its bad
Please see benchmarks
Claude
i thou ght the benchmarks said coding was gemini
no way
how many tokens are they training it for?
also the fact that scout performs still considerably worse than maverick with double the data for pretraining kinda suggests to me 17B active is their limiting factor. Maverick performance on smaller model should be absolutely possible
why did they even release mvaerick / scout
why not
Tbh I think something is wrong with their specific Moe architecture or something. I dunno, not interested lol. Considering qwen 3 will be dropping a 15b total param Moe (and usually they only drop one, so I think it'll be good) and be very competitive for it's size/maybe even better than the current llama models
Not sure am on my phone
its kinda sht
I was somehow thinking that it was similar to deepseek, but actually... Deepseek is 37B active parameters so considerably more. What the f meta was thinking LOL
tbh I don't think it's the active params
Something is just really messed up
I think it is. 17b active parameters is not enough for truly difficult tasks
so the end result is what we have now. They train it for so long it perfects what it can do
but still not enough capacity for very challenging prompts
Qwen 3 Moe will probably be better or maybe highly competitive and it's 15b and 2b active
They only release one I think so they're confident
lmao no chance if it's not reasoning
Why not gpt 4o mini is probably comparable
Releasing a gpt 4o mini level 15b Moe is more than good enough
mini is much bigger pretty sure
2b active is basically as small as it gets
way too little if you ask me
Ya qwen made bigger Moe's but they shrank it this time (assuming they only do one) so maybe something interesting is happening there
I would expect this model to like compete with gemma, not maverick
or qwen-7b... honestly they should update that one instead
You think a dense 7b/8b could beat maverick?
Tbh it is possible
not maverick lmao, it can beat MoE with 2b active
Nah I doubt
15b Moe will likely be better than the 8b
disagree
I think active param count is more important than total size
so if it's only 2b... it's just compromised
They are releasing both based on the transformers pr that should be interesting
I guess the qwen 3 32b equivalent will be most interesting
It will beat maverick I think
Given how good 2.5 32b is
Weird how they decided to shrink it though. But curious what their large Moe model (qwen max) is
you can't have free launch like that, given that training MoE takes roughly the same amount of compute and time as dense model equal to active parameters. If it was the case all models would have been 2T+ with less than 10b active lol
Well clearly the qwen team found it appropriate if they're continuing the trend of releasing just 1 moe
What's the absolute best model rn? Is it nigjtwjisper?
It's gone
Someone let me know how does nightwhisper fare against o1
they are not focusing on just MoE though. I think it's extremely unlikely this tiny MoE will come even close to QWQ 32b dense
Yes which is gemini
Yeah but they won't waste resources on an Moe that will just suck. Their previous Moe releases were much larger in active params count
Clearly they think it's something worth giving a shot/considerable chance of working out
Too good its uncomparable
Nightwhisper is godly
They removed it lol
Its gemini toh
Tho
they are probably just focusing on mobile/laptop applications with it. Memory is not a huge issue at these sizes but speed can be. So MoE may make more sense
Maybe but I think the 15b Moe will be the most interesting thing to come out of all of this
Is there a nughtwhusoer like model
No
😭
i think it may possible saying that maverick and scope are early checkpoint distilled versions of behemoth
It would seem Llama 4 was running in "Volkswagen emission testing mode" in the Arena to get the #2 spot
they're distillations of behemoth
lmao
behemoth is the teacher model
We codistilled the Llama 4 Maverick model from Llama 4 Behemoth as a teacher model, resulting in substantial quality improvements across end task evaluation metrics. We developed a novel distillation loss function that dynamically weights the soft and hard targets through training. Codistillation from Llama 4 Behemoth during pre-training amortizes the computational cost of resource-intensive forward passes needed to compute the targets for distillation for the majority of the training data used in student training. For additional new data incorporated in student training, we ran forward passes on the Behemoth model to create distillation targets.
not true. Scout was pretrained on 40T and Maverick on 22T. Therefore different pretrained models and not distills
200k
this almost feels like you're asking for someone to say claude
thanks for letting me know i was wrong on something that's already been established 👍 simple error
im saying that because afaik claude is the only model with a 200k context window
but yes i think that claude is a pretty alright model, even without thinking
ok but oddly specific
your own benchmarks >> anything else
ccodistillation is quite a bit different from normal distillation
instead of unconditional teacher-student you have both models updating gradients and learning from one another 👀
other humans and benchmarks are unlikely to have tested text adventures
you are welcome 🫃
it seems they essentially optimized the training of several models at the same time to make sure the same work is not getting repeated as much, is the extent of that "distillation". But it's not in the context what most people think when they see the word "distillation". Those 3 models are still different with varying amounts of total data used for pretraining
series of finetunes most well known for mistral/mixtral tunes
i don't dabble much in temperature and personally think that people care too much about it
temperature 0 = follows most likely path, temperature 1 = chooses paths with proportional probability, and you can imagine how to interpolate
when i run llms i typically use a temperature of 0.7
then
fun fact: temperature can actually go past 2, the limit of 2 is an artificial limitation
back when i was trying to get llm text to sound human i found that 1.05 worked when doing free generation but you needed like 1.3 when rewriting ai text
Sigh.... Finally figured I should test Llama 4 Maverick too, and I too can confirm that it's somewhere between utterly disappointing and 'meh fine'. It scored a 3 out of 5 on my mini-benchmark, failing on both associative logic and on grounded memory recall (it had more hallucinated 'facts' than any model I've ever tested, it's bananas). It was less bad at coding than I expected, maybe a decent 'second tier' coding model, but nowhere near Claude 3.5/3.7 Sonnet or Gemini Pro 2.5.
i think they were asking "aside from 2.5 pro"
And im saying theres only 2.5 pro
But he didnt say aside from 2.5 pro he asked if it was only 2.5 pro
Does anyone have a prompt to make 2.5 pro sound human
Its too robotic
And i cant prompt it to sound human
What system prompt should I use
Yeah
The temperature doesnt do anything
Unless I changed the top p to 1
Then it becomes so random
And nonsnesical
hmm
lemme try it after i try this one i found on reddit
im ean
i tried sending it to it ddirectly
but
it talks too informally
like a 12 year old
not that but
i told it like
act human and stuff
are you trying to get it to talk like this or do you want to get past ai detectors
two very different goals
Not the ai detecotrs
💀 wtf
i just want it ta talk
to talk
ok
imat ry
so it works less better if i tell it to it directly
What are your experiences / reviews of riveroaks?
its mid
BEST MODEL IS crazy
its not good at coding
ive tried it
whereas its good at general text
i believe its llama 4 behemoth
reminds me a lot of themis/cybele
its kinda like gemini 2.0 pro
could be
whats a sand boc adventure
wtf
like do wat u want in what
i dont get it
what u gonna do with the AI
OHHH
nice
like a roleplay
yeah
what ai do u do it with
thats the best for it
thanks ima try it
did u invent this
sandbox adventures idea
i never heard of it before
: O
that's creative
ima try it today it seems really fun
give prompt
ye @hollow ivy
i need them to bring back NW to webdev
its genius
thanks its really cool
ur a very creative person
iim using it
eexactly
i was about to say that
u never thought u would have stuff like this for example 10 years ago
and at the rate these stuff is progresing u dont know whether to be scared or excited
yeah its very crazy to think about
LOL ye
im mainly excited ngl
cuz when they get really good
like in a year or two from now
we can do insane things with them
if they become sentient and try to take our jobs and overtake humanity or something 🤣
hmm
yeah definitely
like at the rate of how fast they're growing rn they might become strong enough to replace human workers
I see now both maverick and crystal have token limit in chat. So 24 karat was definitely Meta model.
24 karat is meta but we dont know which meta model it is
definitely not scout or maverick tho
the responses are quite similiar
its prob some kind of testing experiment by lamma
nah
they are kinda similar but
24 karat was WAYY more creative
and hallucinated way more
😮
may be the same model but different settings like temperature or diff system prompts?
that might happen soon
when do u think it AI might become cappable enough to do such a thing
probably yeah
maverick or scout with a different system prompt and higher temperature or something
i mean if u think about it
it hasnt even been 5 years since the first chatgpt model released
so it might even become way stronger than that in 5 years
I red in the news microsoft released today quake 2 web version, generated by a model in real time.
you can play 1 level
yeah even more than that
whats that
hmnmm
here is the quake model
😮
thats exactly what paws was talking about to
I honestly suspect 24_Karat_Gold was Llama-4-Maverick-03-26-Experimental, but now they defanged the model big time
2400 votes I assume were from us, the public right
Maybe they didn't change it on the fly though... idk
if its a differnt system prompt
it might be maverick
Okk
exactly
makes sense
it might be just maverick with like higher temperature or a different system prompt
Yeah
it also constnatly said it was from llama
yup, which i believe
you guys have no idea how powerful that nightwhisper model is
i mean combined with google perfect vision
you can clone websites in seconds accurately
I know and i still mean what i said
why would i care about 9.11 > 9.81
or how much r's are in strawberry
Shows model reasoning
You still don’t understand I was ironic