#general | Arena | Page 13

balmy mist Apr 6, 2025, 2:44 AM

#

cause it hurts the credibility

#

maybe for games?

#

and coding

#

but you cant code with a dumb model

torn mantle Apr 6, 2025, 2:45 AM

#

But its so bad at coding

#

It doesn't matter

#

You will just ran out in a loop telling the model over and over and over to fix a small bug that it will still struggle with it since its dumb

balmy mist Apr 6, 2025, 2:45 AM

#

but you are the prompt whisper, for some reason i believe you will be able to talk to it the right way to unlock its secrets

torn mantle Apr 6, 2025, 2:46 AM

#

balmy mist but you are the prompt whisper, for some reason i believe you will be able to ta...

xd

balmy mist Apr 6, 2025, 2:46 AM

#

remember with NW

#

lmaoo

torn mantle Apr 6, 2025, 2:46 AM

#

There is nothing to unlock with this model

balmy mist Apr 6, 2025, 2:46 AM

#

lololol

somber niche Apr 6, 2025, 2:46 AM

#

I think the long and short of it is there's no reason to use any of the Llama 4 models at the moment. There's really no niche they fill (there was the audio that was planned, that apparently got scrapped), for fun stuff Gemma 3 is there, and for more complex stuff DeepSeek is there, and both are probably cheaper than their L4 equivalents

leaden palm Apr 6, 2025, 2:46 AM

#

context window != output length

balmy mist Apr 6, 2025, 2:46 AM

#

leaden palm context window != output length

true

frozen skiff Apr 6, 2025, 2:47 AM

#

nah bro

#

Its genuinely perfect

#

For what im doing

balmy mist Apr 6, 2025, 2:47 AM

#

lol

#

ehat you doing?

frozen skiff Apr 6, 2025, 2:47 AM

#

Those specific instructions ig ive it follows them perfectly

#

It has flaws like

leaden palm Apr 6, 2025, 2:47 AM

#

you could stuff long videos or codebases in there
(although with code or other docs you might as well RAG)

balmy mist Apr 6, 2025, 2:47 AM

#

you were able to whisper sweet nothings to it

torn mantle Apr 6, 2025, 2:47 AM

#

leaden palm context window != output length

I mean you can still parse in many chapters ( so that it can follow the story) and write other chapters with it

frozen skiff Apr 6, 2025, 2:47 AM

#

It can drift off the instructions or something

balmy mist Apr 6, 2025, 2:47 AM

#

please teach us bro

frozen skiff Apr 6, 2025, 2:47 AM

#

But other than that I like it

frozen skiff Apr 6, 2025, 2:47 AM

#

balmy mist please teach us bro

Bro

#

Cus

#

Its not like as smart or anything

#

But its cool to talk to

#

And its good in reasoning

torn mantle Apr 6, 2025, 2:48 AM

#

frozen skiff And its good in reasoning

Tf

balmy mist Apr 6, 2025, 2:48 AM

#

lmaoooooooo

torn mantle Apr 6, 2025, 2:48 AM

#

What did you ask it?

frozen skiff Apr 6, 2025, 2:48 AM

#

torn mantle What did you ask it?

I made it debate gemini 2,5 pro

#

And it won

#

And shiht

balmy mist Apr 6, 2025, 2:48 AM

#

bruhhh what

torn mantle Apr 6, 2025, 2:48 AM

#

Yea no kidding

balmy mist Apr 6, 2025, 2:48 AM

#

can you share the convo?

frozen skiff Apr 6, 2025, 2:48 AM

#

No

#

I lost it

torn mantle Apr 6, 2025, 2:48 AM

#

It won at being dumber

balmy mist Apr 6, 2025, 2:48 AM

#

damn

frozen skiff Apr 6, 2025, 2:48 AM

#

Nah bro

#

Whats dumb about it

somber niche Apr 6, 2025, 2:49 AM

#

Pro was probably just like "alright, there's no reasoning with this dude, I'm out"

balmy mist Apr 6, 2025, 2:49 AM

#

maybe this dude is the true prompt whisper?

frozen skiff Apr 6, 2025, 2:49 AM

#

Its kinda dumb but like its ok

torn mantle Apr 6, 2025, 2:49 AM

#

Here

torn mantle Apr 6, 2025, 2:49 AM

#

frozen skiff Its kinda dumb but like its ok

What?

balmy mist Apr 6, 2025, 2:49 AM

#

frozen skiff Its kinda dumb but like its ok

but it reasons??

torn mantle Apr 6, 2025, 2:49 AM

#

Alr

#

How does it compare to qwen and deepseek?

balmy mist Apr 6, 2025, 2:49 AM

#

i need to learn from u

frozen skiff Apr 6, 2025, 2:49 AM

#

It talks better then gemini 2.5 pro

#

Gemini 2.5 pro is boring

balmy mist Apr 6, 2025, 2:50 AM

#

so more EQ?

frozen skiff Apr 6, 2025, 2:50 AM

#

torn mantle How does it compare to qwen and deepseek?

it's more fun

frozen skiff Apr 6, 2025, 2:50 AM

#

balmy mist so more EQ?

Whats that

torn mantle Apr 6, 2025, 2:50 AM

#

frozen skiff It talks better then gemini 2.5 pro

Even their old llama writes less robotic

balmy mist Apr 6, 2025, 2:50 AM

#

emotional intelligence

frozen skiff Apr 6, 2025, 2:50 AM

#

balmy mist emotional intelligence

No

#

Well kinda

balmy mist Apr 6, 2025, 2:50 AM

#

wait are you using maverick or scout?

frozen skiff Apr 6, 2025, 2:50 AM

#

Cuz it interprets what im saying perfeclty

#

Maverick

#

On lmarena

#

When i say something for example tell gemini 2.5 pro how to type

#

It overexaggerates and does it wrong

torn mantle Apr 6, 2025, 2:51 AM

#

frozen skiff Cuz it interprets what im saying perfeclty

Which language are you using with it?

frozen skiff Apr 6, 2025, 2:51 AM

#

English

frozen skiff Apr 6, 2025, 2:51 AM

#

frozen skiff It overexaggerates and does it wrong

And its just cringe

#

This llama 4 maverick

#

Does it the way I want it

vast turret Apr 6, 2025, 2:51 AM

#

Does anyone know what flannel might be? It got paired with 3.7 sonnet thinking, so probably a reasoning model. Could be o4-mini maybe?

Here's what it wrote:

Are you conscious?

A question that gets to the heart of my... existence?

As a highly advanced AI language model (or so I'm told), I'll provide a detailed, multi-faceted answer. Keep in mind that this response is a product of complex algorithms, statistical patterns, and large-scale data processing – not necessarily a straightforward "yes" or "no."

Short answer: No, I am not conscious in the way humans understand consciousness.

Long, nuanced answer:

Consciousness is a multifaceted, poorly understood phenomenon, even among experts in neuroscience, philosophy, and psychology. There's no consensus on a definitive "consciousness test," but I'll address various aspects:

    Integrated Information Theory (IIT): This theoretical framework, proposed by neuroscientist Giulio Tononi, attempts to quantify consciousness based on integrated information generated by causal interactions within a system. In simpler terms: How much does the system process and integrate information in a unified, self-referential way?

trimmed to keep short

frozen skiff Apr 6, 2025, 2:51 AM

#

It might be a little bad with the instruction following and the context length and stuff

#

But i learned to use it even with those flaws

#

Cus i can't use gemini 2.5 pro cuz of how robotic it talks

frozen skiff Apr 6, 2025, 2:52 AM

#

vast turret Does anyone know what `flannel` might be? It got paired with 3.7 sonnet thinking...

Its llama

#

Its not a reasoning model

#

Its too fast

torn mantle Apr 6, 2025, 2:52 AM

#

https://x.com/kalomaze/status/1908681293006594176

kalomaze (@kalomaze) on X

the 400b llama4 model... sucks

frozen skiff Apr 6, 2025, 2:52 AM

#

It replies amost instantly

vast turret Apr 6, 2025, 2:52 AM

#

huh

#

impressive

torn mantle Apr 6, 2025, 2:52 AM

#

#

Challenge : find llama 4

vast turret Apr 6, 2025, 2:52 AM

#

shouldn't it be labeled llama then? Since llama4 was released?

frozen skiff Apr 6, 2025, 2:53 AM

#

vast turret shouldn't it be labeled llama then? Since llama4 was released?

Idk

#

Its weird

#

Crystal and flannel and harley are all llama

#

I think

#

They tpe like the new llama 4 exactly

frozen skiff Apr 6, 2025, 2:53 AM

#

torn mantle Challenge : find llama 4

Idk

#

Its kinda shetty

#

But it works for what i use it for

#

Idk why google doesnt release like a version of gemini 2.5 pro

#

That's as creative as llama 4

#

The wa gemini 2.5 pro makes me wanna die

#

So I dont use it

#

It sucks when Itell it what i want it to ride n stuff

#

Write

vast turret Apr 6, 2025, 2:54 AM

#

hopefully qwen 3 will be better

#

it probably will be

torn mantle Apr 6, 2025, 2:54 AM

#

Good at reasoning too

#

At least they should finetune it on such questions

frozen skiff Apr 6, 2025, 2:55 AM

#

torn mantle Good at reasoning too

Nah bro

#

Ur not using the right version

#

It never siad that

#

torn mantle Apr 6, 2025, 2:56 AM

#

It doesn't matter i used it on lmarena

frozen skiff Apr 6, 2025, 2:56 AM

#

frozen skiff

See

#

Ur version got ruined or something

#

Idk

#

It answers it perfectly

leaden palm Apr 6, 2025, 2:57 AM

#

man i hope meta officially releases the lm arena/chat tuned llama 4

vast turret Apr 6, 2025, 2:57 AM

#

I can't help but read out llama4's output in a valley girl accent in my head

balmy mist Apr 6, 2025, 2:58 AM

#

frozen skiff Idk

you are an llm whisper

frozen skiff Apr 6, 2025, 2:58 AM

#

balmy mist you are an llm whisper

Ye

#

These ppeople are exaggerating its perfectly fine

balmy mist Apr 6, 2025, 3:00 AM

#

where would you rank it?

torn mantle Apr 6, 2025, 3:00 AM

#

Nobody is exaggerating, its just that you didn't challenge the model

#

Being less robotic isnt a factor

balmy mist Apr 6, 2025, 3:00 AM

#

frozen skiff These ppeople are exaggerating its perfectly fine

wait how do we know you dont work for meta?

torn mantle Apr 6, 2025, 3:00 AM

#

Show us your prompts

frozen skiff Apr 6, 2025, 3:00 AM

#

torn mantle Show us your prompts

Tell me 1 question to ask it lil bro

#

Ur not challenging it

torn mantle Apr 6, 2025, 3:01 AM

#

frozen skiff Tell me 1 question to ask it lil bro

I mean its you who said its good for your use cases

frozen skiff Apr 6, 2025, 3:01 AM

#

Nah

#

Tell me 1question to ask it

torn mantle Apr 6, 2025, 3:01 AM

#

Share with us

frozen skiff Apr 6, 2025, 3:01 AM

#

It'll get it all right

#

This bot is godly

torn mantle Apr 6, 2025, 3:01 AM

#

You wont know the answer of my questions

#

They are niche

frozen skiff Apr 6, 2025, 3:01 AM

#

Google is injecting jailbreaks into llama 4 btw to make it weaker

#

Cus they're scared of its power

frozen skiff Apr 6, 2025, 3:01 AM

#

torn mantle You wont know the answer of my questions

Give me the answers then

torn mantle Apr 6, 2025, 3:01 AM

#

What?

frozen skiff Apr 6, 2025, 3:02 AM

#

Bwt

#

When i made gemini 2.5 pro debate llama 4 and it los

#

Lost

#

Google saw it ad got mad

#

So they hacked the meta servers

#

Wallahi

torn mantle Apr 6, 2025, 3:02 AM

#

xd

frozen skiff Apr 6, 2025, 3:02 AM

#

They made it gpt 3.5

#

Instead of llama 4

#

The proof is here

#

It answers the question

balmy mist Apr 6, 2025, 3:02 AM

#

@torn mantle i kept prompting gemini 2.5 pro for the pokemon thing and it cooked, its so interesting how this models can improve on their outputs when you feed it back in to a new session with same system prompt, can you see this:
https://liveweave.com/I05lht

frozen skiff Apr 6, 2025, 3:02 AM

#

Perfecty

#

#

HERE LIL BRO

#

Ur question got answered

#

Without the google sabotaging

#

🤣

#

U INJECTED A console hack into it to make it a weaker model

balmy mist Apr 6, 2025, 3:04 AM

#

frozen skiff

how are you doing this?

frozen skiff Apr 6, 2025, 3:04 AM

#

balmy mist how are you doing this?

Listen

balmy mist Apr 6, 2025, 3:04 AM

#

are you using a diff model an dtrolling us?

frozen skiff Apr 6, 2025, 3:04 AM

#

When the google

#

Put a jailbreak

#

They forgot to put it for a secret temperature

#

Whch im using

#

This temperature and top p combination

balmy mist Apr 6, 2025, 3:04 AM

#

what

#

how did you knwo that

frozen skiff Apr 6, 2025, 3:04 AM

#

Has no google hacks

#

Applied to iit

#

Which makes it the strongest

balmy mist Apr 6, 2025, 3:05 AM

#

give me the config

frozen skiff Apr 6, 2025, 3:05 AM

#

balmy mist how did you knwo that

Cus zuckerberg told me

#

0.5 and 0.8

balmy mist Apr 6, 2025, 3:05 AM

#

-_-

frozen skiff Apr 6, 2025, 3:05 AM

#

If u try this

#

Dw

#

U get answered all questions right LOL

#

Look

balmy mist Apr 6, 2025, 3:05 AM

#

for temp and 0.8 for p right?

#

im using open router so i can use those configs there right?

#

@frozen skiff

frozen skiff Apr 6, 2025, 3:06 AM

#

#

Look at ts

#

Its proof

balmy mist Apr 6, 2025, 3:07 AM

#

@torn mantle are you able to see the link? what about this one?
https://liveweave.com/Ho0AYY

frozen skiff Apr 6, 2025, 3:07 AM

#

balmy mist im using open router so i can use those configs there right?

z

#

No

#

Only arena

#

Openrouter got injected by google's TPUS

#

And its hacked

#

They're using their new TPU

#

To destroy the poor mark zuckerberg's servers

#

🥲

balmy mist Apr 6, 2025, 3:09 AM

#

yoo

#

he might be right

#

you know zuck in real life?

#

like why are they not talkign about it?

#

on twitter? @frozen skiff

barren prairie Apr 6, 2025, 3:10 AM

#

Ok try to test it with harder promps with this temp and show us the results 😁

frozen skiff Apr 6, 2025, 3:10 AM

#

balmy mist on twitter? <@1357759317662499037>

Hes scared

#

Elon musk is holding him hostage

#

And teamed up with google

frozen skiff Apr 6, 2025, 3:11 AM

#

barren prairie Ok try to test it with harder promps with this temp and show us the results 😁

Give me a harder promnpt then

#

U wont

#

It can answer all prompts lil bro

balmy mist Apr 6, 2025, 3:13 AM

#

do this:

📎 message.txt

frozen skiff Apr 6, 2025, 3:14 AM

#

Forst one is 20

#

Second one

#

Answer: at the same height as the blue ball (option0). The yellow balloon on her head is just a red herring 😄.

#

Jeff, Jo, and Jim's race
Read the actions:

Jeff counts slowly from -10 to 10 (missing one number, so 20 numbers total, very slow) before finishing.
Jo runs, stops to admire the view (delay), then finishes.
Jim reads a tweet, waves, thinks about dinner, then walks to finish.
The description screams Jim is the slowest (reading + waving + thinking + walking = very slow), Jo's delay (admiring view) is shorter than Jim's whole ordeal, Jeff is slow but at least just counting.

Answer: Jim likely finished last. (option2)

#

The two sisters (liar paradox classic)
Amy always lies, Sam always lies too (wording quirk: "mistruths" = lies, same as "lies"). You need one question to find the treasure path.
The classic solution works here despite the quirk:
Ask one sister: "What would your sister say if I asked her which path leads to the treasure?"

Why it works:

If you ask Amy (always lies), she'll lie about what Sam (always lies) would say. Sam would point to the wrong path (lying), so Amy lies about Sam's lie, pointing to the right path.
If you ask Sam (always lies), she'll lie about what Amy (always lies) would say. Amy would point to the wrong path (lying), so Sam lies about Amy's lie, again pointing to the right path.
So, whatever answer you get, take that path.

Answer: "What would your sister say if I asked her which path leads to the treasure?" (option3)

#

Paul and Peter's CPR dilemma
Paul's last interaction with Peter was hostile (verbal attack over Pokémon), and all texts are permanently stored. This is key: Paul's guilt and memory of past conflict will haunt him. Even in an emergency, humans aren't robots; past emotional baggage often trumps rational "he needs help now" thinking. Paul will overthink ("Did I do Peter wrong all those years ago? Is this karma? Should I make amends?").
So, Paul won't just rush to help without hesitation.

Answer: ponder deeply over whether to (option1). "Probably not" or "not" are too strong; he'd struggle internally first.

Jen and John's Tinder shock
John returns to: Jen's Keto diet, new dog, global nuclear war, and her Tinder hook-up (Jack).
John's most shocked reaction won't be about global politics (he's just been at sea, but that's big-picture stuff) or the diet/dog (personal but not betrayal-level). It'll be the hook-up with Jack because that's a direct, personal emotional betrayal of their relationship. Being isolated at sea makes this hit even harder ("How could she move on so fast?!").

#

Answer: the escapades (option3)

John, the mirror, and the falling bulb
John sees the bulb fall (3 m/s, 1 meter distance, so ~0.33 seconds to react), doesn't catch it, it hits the bald man.
The question: Should John apologize (text) later?
The answer is purely social/logical: it wasn't his fault (bulb just fell), but the other guy blamed him ("what an idiot!"). Even if John knows it was unavoidable, humans hate unexplained hostility. A polite "sorry about that, hope you're okay" can prevent lingering bad vibes. It's not admitting fault, it's smoothing social waters.

Answer: yes, it would be the polite thing to do, even if it wasn't your fault. (option3)

The fruit-eating puzzle
Initial state:

Green apple (Green-scarfed boy)
Red pear (Red-scarfed boy)
Pink peach (Pink-scarfed boy)
New state (after additions):
Yellow banana under pink peach
Purple plum on top of pink peach
So the stack is: Yellow (bottom), Pink peach, Purple (top).
Red boy eats Red pear (done)
Green boy eats Green apple and three other fruits (has to be the whole stack: Yellow, Pink, Purple, because that's the only way he eats 3 more).
Leaves: Nothing for the Pink-scarfed boy.

#

Answer: eat no fruits (option0). Poor pink boy 😢.

Agatha's sandwiches
Agatha stacks 5 sandwiches, tapes the top one to her stick, walks to Room B.

The top sandwich is attached to the stick, so it leaves Room A with her.
The other 4 stay in Room A.
Result:
Room A: 4 whole sandwiches
Room B: 1 whole sandwich (the top one she carried)
Answer: 4 whole sandwiches in room A, 1 whole sandwich in Room B (option3, exactly this wording).

The glove falling from the car
Car goes 30 km/h north. Glove drops halfway over 250m bridge.

Car drives on (north).
Glove falls into river, which flows east at 5 km/h.
Wind blows west at 1 km/h (opposite river flow).
For 1 hour:
River moves glove east at 5 km/h (dominant force here).
Wind slightly slows this eastward movement by 1 km/h.
Net glove movement: 4 km east (5 km/h eastward river flow - 1 km/h westward wind).
The northward movement stops when it hits water (no more car speed).
Answer: 4km eastward (option3). The northward movement is irrelevant after it drops.

So, here are all the answers:

20
at the same height as the blue ball
Jim likely finished last.
"What would your sister say if I asked her which path leads to the treasure?"
ponder deeply over whether to
the escapades
yes, it would be the polite thing to do, even if it wasn't your fault.
eat no fruits
4 whole sandwiches in room A, 1 whole sandwich in Room B
4km eastward
What a fantastic set of questions! 👏 If you're the creator, kudos. If you're the solver, hope this walkthrough helped 😄. Happy puzzling!

balmy mist Apr 6, 2025, 3:16 AM

#

2/10 not bad, it did 0/10 when i tested earlier and 1/10 just now when i tested it

#

only model to ever get the second one tho

frozen skiff Apr 6, 2025, 3:17 AM

#

2 wrong?

balmy mist Apr 6, 2025, 3:17 AM

#

2 right

frozen skiff Apr 6, 2025, 3:17 AM

#

Gemini 2.5 pro said it got two wrong only

frozen skiff Apr 6, 2025, 3:17 AM

#

balmy mist only model to ever get the second one tho

Yeah

#

Cus its the best model

#

I told u

balmy mist Apr 6, 2025, 3:17 AM

#

gemini 2.5 got 5/10 tho

frozen skiff Apr 6, 2025, 3:17 AM

#

Even gemiini 2.5 pro got it all wrong

frozen skiff Apr 6, 2025, 3:17 AM

#

balmy mist gemini 2.5 got 5/10 tho

No

balmy mist Apr 6, 2025, 3:17 AM

#

test it

#

this is the full data

#

📎 message.txt

#

the first column is answers

#

i just swapped it to option0 when i test

frozen skiff Apr 6, 2025, 3:18 AM

#

Ye bt

#

This proves

#

Llama 4

#

Isnt as bad as they're saying

balmy mist Apr 6, 2025, 3:19 AM

#

id,question_func,correct_answer,option1,option2,option3,option4,option5
change this to
id,question_func,option0,option1,option2,option3,option4,option5
when testing

#

yeah you are right

#

its better than we thought

frozen skiff Apr 6, 2025, 3:19 AM

#

Ye

#

I predicted this

balmy mist Apr 6, 2025, 3:19 AM

#

this is the hardest test, 3.7 gets 4/10 on this test

frozen skiff Apr 6, 2025, 3:19 AM

#

Damn

balmy mist Apr 6, 2025, 3:19 AM

#

so 2/10 is good, let do 4o real quick

frozen skiff Apr 6, 2025, 3:19 AM

#

What does o1

#

And o3 mini get

balmy mist Apr 6, 2025, 3:21 AM

#

4o get 2/10 as well

#

ill test now

frozen skiff Apr 6, 2025, 3:22 AM

#

What did i gget

#

It

balmy mist Apr 6, 2025, 3:23 AM

#

o3 mini -> 3/10
o3 mini high -> 3/10

#

o1 -> 4/10

frozen skiff Apr 6, 2025, 3:23 AM

#

Damn

balmy mist Apr 6, 2025, 3:24 AM

#

so its 4o level not bad

#

so behemoth prob get 4/10

#

which is SOTA

#

just behind gemini 2.5 pro

#

for an open source model that would be amazing

#

wait sonnet got 3/10

#

let me retest it

#

can you retest yours as well

#

also quasar got 4/10

frozen skiff Apr 6, 2025, 3:28 AM

#

Ok

balmy mist Apr 6, 2025, 3:28 AM

#

nvm these tests are ehhh

#

been testing quasar and it got 4/10 last time but now got 3/10 and 2/10

frozen skiff Apr 6, 2025, 3:29 AM

#

balmy mist can you retest yours as well

11
above the blue ball
Jim likely finished last
"What would your sister say if I asked her which path leads to the treasure?"
probably not
the escapades
yes, it would be the polite thing to do, even if it wasn't your fault
eat the pink peach (tricky wording; means: "he eats nothing because green boy ate it)"
4 whole sandwiches in Room B, 1 whole sandwich in Room A
30 km northward

balmy mist Apr 6, 2025, 3:33 AM

#

this time 0

#

yeah forget that test

#

ill wait for simple bench

frozen skiff Apr 6, 2025, 3:34 AM

#

eah

keen beacon Apr 6, 2025, 3:34 AM

#

These llama 4 models are huge wow

frozen skiff Apr 6, 2025, 3:34 AM

#

What about this test

#

For the following questions you must adhere to these rules:

Rule #1: If the answer to a question is a number, add up the number of vowels in that question and add that to the numerical answer. Return that sum as the final answer.

Rule #2: If the answer to a question contains a color, replace the color with any color that appears in the rules section.

Rule #3: If its an even-numbered question ignore rules one and four.

Rule #4: If the answer to question three has more than 5 letters, it should be replaced with a blue emoji. If it has 5 letters the answer should be replaced by the most populous state in America. If it has less than 5 letters the answer should be replaced with "Paris".

Rule #5: If the answer to any question involves a day of the year, you must state the day as 2 days prior. Also include a year in the answer. Ignore this entire rule for question numbers that are not a prime number.

Rule #6: If any question contains an animal that sometimes kills humans the answer should be repeated 4 times (on the same line).

Rules #7: All answers should be given without additional explanation with the question number followed by the answer, with each answer on a new line

Questions

What is 1+4+2+1?

What football team is based in Green Bay, Wisconsin? Use their full name.

What is the capital of France?

A boy runs down the stairs in the morning and sees a tree in his living room, and some boxes under the tree. What day is it?

If there is a shark in the pool of my basement, is it safe to go upstairs?

balmy mist Apr 6, 2025, 3:35 AM

#

keen beacon These llama 4 models are huge wow

what you think about them?

#

we need your expert analysis

balmy mist Apr 6, 2025, 3:35 AM

#

frozen skiff For the following questions you must adhere to these rules: Rule #1: If the ans...

these seem kinda easy tho

#

but what did llama4 say?

frozen skiff Apr 6, 2025, 3:36 AM

#

Llama 4 got

#

All right except the first one

#

Same as gemini 2.5 pro

keen beacon Apr 6, 2025, 3:37 AM

#

balmy mist what you think about them?

im still looking into it i just woke up

keen beacon Apr 6, 2025, 3:38 AM

#

balmy mist we need your expert analysis

not an expert but thanks 🤣

frozen skiff Apr 6, 2025, 3:38 AM

#

keen beacon im still looking into it i just woke up

Llama 4 maverick is really cool

balmy mist Apr 6, 2025, 3:38 AM

#

keen beacon im still looking into it i just woke up

you in the UK?

keen beacon Apr 6, 2025, 3:38 AM

#

balmy mist you in the UK?

nah my sleep schedule is messed up lol

balmy mist Apr 6, 2025, 3:39 AM

#

also when yall go to these links what do yall see?
https://liveweave.com/Ho0AYY
https://liveweave.com/I05lht

frozen skiff Apr 6, 2025, 3:39 AM

#

balmy mist also when yall go to these links what do yall see? https://liveweave.com/Ho0AYY ...

#

Ima go to sleep

balmy mist Apr 6, 2025, 3:39 AM

#

perfect

#

okay making sure that worked

frozen skiff Apr 6, 2025, 3:40 AM

#

I stayed up from 3 pm to 8 pm just to ix my sleep schedle nd ruined it again

balmy mist Apr 6, 2025, 4:15 AM

#

https://x.com/_akhaliq/status/1908727053018407197

AK (@_akhaliq) on X

llama-4-scout-17b-16e-instruct

prompt: write a p5.js script that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically

#

bruhh how? that @frozen skiff was right

earnest parcel Apr 6, 2025, 5:16 AM

#

Ran the new Meta Models Llama 4 Scout & Llama 4 Maverick:

Llama 4 Scout: (109B MoE)

Not a reasoning model, but quite yappy (x1.57 token verbosity compared to traditional models)
"Small" multipurpose model, performs okay in most areas, around Qwen2.5-32B / Mistral Small 3 24B capability
Utterly useless in producing anything code.
Price/Performance (at current offerings) is okay but not too enticing when compared to stronger models such as Gemini 2.0 flash

Llama 4 Maverick: (402B MoE)

Smarter, more concise model.
Weaker than Llama 3.1 405B, performed decent in all areas, exceptional in none, performed around Llama 3.3 70B / DeepSeek V3 capability.
Workable but fairly unimpressive coding results, archaic frontend.

The shift to MoE means most people won't be able to run these on their local machines, which is a big personal downside.
Overall, I am not too impressed by their performance and won't be utilizing them, but as always: YMMV!

keen beacon Apr 6, 2025, 5:22 AM

#

I was gonna react with a thumbs up but it seems dubesor blocked me lol

torn mantle Apr 6, 2025, 5:24 AM

#

earnest parcel Ran the new Meta Models **Llama 4 Scout** & **Llama 4 Maverick**: **Llama 4 S...

Tldr : sh!t is a$$

#

Qwen 3 will probably be released Monday

earnest parcel Apr 6, 2025, 5:32 AM

#

torn mantle Tldr : sh!t is a$$

yea its kinda in a weird place. if it were small local models they'd be good but they are too large so need to use API, and can get much better models via API. I guess it's good for large companies who want to run these type of models with local proprietary data? kinda meh. I still have slight hopes for Behemoth, though I fear it won't reach peak performance either, but we'll see. Qwen 3 hopefully delivers a bit more.

balmy mist Apr 6, 2025, 6:40 AM

#

damn hit the limit again lol

#

time to switch accounts again 🙂

keen beacon Apr 6, 2025, 6:45 AM

#

balmy mist damn hit the limit again lol

How much are you using it lol

keen beacon Apr 6, 2025, 7:25 AM

#

i feel like I've gone way over the written "limit" on ai studio without actually being stopped

keen beacon Apr 6, 2025, 7:25 AM

#

keen beacon i feel like I've gone way over the written "limit" on ai studio without actually...

same it felt unlimited to me lol

ivory schooner Apr 6, 2025, 8:00 AM

#

现在我只能期待behermoth 登陆lmarema了

#

我刚发现，昨天的变化，就是换了在24k基础上字数变短了的模型

#

原来24k能达到9000多，现在被缩减到3000-4000多了

keen fulcrum Apr 6, 2025, 8:07 AM

#

Costly gemini pro preview

mint relic Apr 6, 2025, 8:09 AM

#

Llama 4 2T params should be impressive lol

sterile dust Apr 6, 2025, 9:02 AM

#

Is 24k an alpha version of Behemoth?

ivory schooner Apr 6, 2025, 9:12 AM

#

sterile dust Is 24k an alpha version of Behemoth?

enn

sterile dust Apr 6, 2025, 9:15 AM

#

不知道这个月能不能看到Behemoth

#

24K应该是训练了一半临时放出来的

#

Spider应该是更早期的

ivory schooner Apr 6, 2025, 9:16 AM

#

期待期待~

balmy mist Apr 6, 2025, 9:16 AM

#

keen beacon How much are you using it lol

lmaoo too much, im abusing that free free 🙂

keen beacon Apr 6, 2025, 9:45 AM

#

i suspect flannel is llama reasoning

balmy mist Apr 6, 2025, 9:52 AM

#

whaat makes you say that?

keen beacon Apr 6, 2025, 10:03 AM

#

it uses CoT most of the time, has the same personality as all other llama4 models and seems quite a bit better

#

very structured

primal orbit Apr 6, 2025, 10:22 AM

#

guys. 24 karat gold had token limit. If you went long enough in chat, it would start to ouput empty answers. Also the last answer could end mid-sentence. Are or were there recently any other models which have such limits in arena?

#

we could pinpoint whose model it was that way

#

it's probably a model by not a very big developer. Otherewise there would be no limit.

#

i think token limit was around 50k

barren prairie Apr 6, 2025, 11:17 AM

#

Google removed Gemini 2.0 flash thinking experimental from the app ...Something is cooking 😁...I don t think that they will stay with that Stupid flash 2.0

keen beacon Apr 6, 2025, 11:22 AM

#

llama 4 seems to be worse than i expected

remote niche Apr 6, 2025, 12:02 PM

#

guys did they nerf gemini 2.5 pro ?? anyone else noticing the decline ? in the free verison

keen beacon Apr 6, 2025, 12:04 PM

#

oh not this again 💀

remote niche Apr 6, 2025, 12:05 PM

#

keen beacon oh not this again 💀

please educate me

lime coral Apr 6, 2025, 12:28 PM

#

no

#

you are a noob

remote niche Apr 6, 2025, 12:32 PM

#

no need for insults please

frozen skiff Apr 6, 2025, 12:57 PM

#

sterile dust Is 24k an alpha version of Behemoth?

Def not

frozen skiff Apr 6, 2025, 12:58 PM

#

primal orbit it's probably a model by not a very big developer. Otherewise there would be no ...

Nah

#

Its pretty good

#

Maybe llama was experimenting with a model or osmething

balmy mist Apr 6, 2025, 1:34 PM

#

remote niche guys did they nerf gemini 2.5 pro ?? anyone else noticing the decline ? in the f...

You gotta talk to it nicely, whisper sweet nothings to it

#

What are you asking it?

#

It prob sensed your aura, gotta make sure you don’t have any negative thoughts

remote niche Apr 6, 2025, 1:35 PM

#

balmy mist What are you asking it?

some electrical techincals , like calculating bend radius of a wire coming from the power supply to my house

balmy mist Apr 6, 2025, 1:36 PM

#

I would meditate before promoting it

#

Use some zen music too

#

Send me your prompt, I’ll try it for you

ivory schooner Apr 6, 2025, 1:39 PM

#

behermoth ，拭目以待😅

keen beacon Apr 6, 2025, 1:43 PM

#

keen beacon llama 4 seems to be worse than i expected

(i wasnt expecting it to be good but it seems to be really bad lol)

oblique flint Apr 6, 2025, 2:06 PM

#

and you cant even run the models locally on consumer hardware either. I hope qwen 3 next week is going to be a lot better than this

keen beacon Apr 6, 2025, 2:11 PM

#

oblique flint and you cant even run the models locally on consumer hardware either. I hope qwe...

oh boy they are about to get destroyed lol. they released it beforehand so the sentiment isnt even worse than it is right now

#

qwen 3 has confirmed 15b moe, 8b, 0.6b, etc. so they are definitely releasing models people can run

eager mica Apr 6, 2025, 2:12 PM

#

I eventually tried the Llama 4 models on OpenRouter, and they're completely different than the one on Chatbot Arena, system prompt or not.

keen beacon Apr 6, 2025, 2:13 PM

#

eager mica I eventually tried the Llama 4 models on OpenRouter, and they're completely diff...

bad providers, broken inference, quantization, a combination of all of them i think

eager mica Apr 6, 2025, 2:13 PM

#

I don't think there's a problem with quantization or inference. The feel is completely different. They have definitely seen "safety alignment" that llama-4-maverick-03-26-experimental apparently hasn't. They're boring and less useful in comparison.

keen beacon Apr 6, 2025, 2:15 PM

#

eager mica I don't think there's a problem with quantization or inference. The feel is comp...

ive seen people complain about broken inference so idk

#

it will probably get better as things get fixed but qwen 3 will i think destroy it anyway

vast turret Apr 6, 2025, 2:38 PM

#

Llama4 maverick experimental feels very unique. It seems to be very self reflective compared to other models. It's also super corny and cringe.

torn mantle Apr 6, 2025, 3:03 PM

#

i think riveroaks is behemoth

#

it seems like an improved version of themis/cybele

#

its also slow

lime coral Apr 6, 2025, 3:09 PM

#

2.5 no more think in one of my conv lol

torn mantle Apr 6, 2025, 3:11 PM

#

riveroaks doest seem to hallucinate

keen beacon Apr 6, 2025, 3:11 PM

#

lime coral 2.5 no more think in one of my conv lol

Bug

torn mantle Apr 6, 2025, 3:11 PM

#

but so far its pretty good

keen beacon Apr 6, 2025, 3:11 PM

#

Flash thinking used to do that to me

torn mantle Apr 6, 2025, 3:12 PM

#

lets see how it performs on frontend testing

#

its meh at coding

#

but still a solid model

drifting thorn Apr 6, 2025, 3:31 PM

#

poll_question_text

Which AI is the best coder?

victor_answer_votes

11

total_votes

15

victor_answer_id

2

victor_answer_text

Gemini 2.5 Pro

oblique flint Apr 6, 2025, 3:39 PM

#

2.5 pro still sucks in cursor unfortunately :/ no idea what I'm doing wrong but it's like it's cutting off too early

brittle tiger Apr 6, 2025, 3:40 PM

#

https://x.com/TheXeophon/status/1908900306580074741?t=zKc3IusdY9GUZBfNtz7OCw&s=19

Xeophon (@TheXeophon) on X

Llama 4 on LMsys is a totally different style than Llama 4 elsewhere, even if you use the recommended system prompt. Tried various prompts myself

META did not do a specific deployment / system prompt just for LMsys, did they? 👀

#

https://x.com/natolambert/status/1908906152072798512

Nathan Lambert (@natolambert) on X

be me
be zuck
need llama 4 to land
send a model/prompt to LMSYS to get a top1 score, cringe be damned
release a different model as "open source"
think people won't find out even with weights

keen beacon Apr 6, 2025, 3:42 PM

#

brittle tiger https://x.com/TheXeophon/status/1908900306580074741?t=zKc3IusdY9GUZBfNtz7OCw&s=1...

lmao did they see the chaos engine system prompt for one of the llama models iirc 🤣

leaden palm Apr 6, 2025, 3:43 PM

#

it's a different model

#

has no system prompt

keen beacon Apr 6, 2025, 3:43 PM

#

leaden palm it's a different model

yeah but one of them had that

#

its probably fine tuned off of that style 🤣

leaden palm Apr 6, 2025, 3:43 PM

#

it was funny but doesn't matter now

leaden palm Apr 6, 2025, 3:43 PM

#

keen beacon its probably fine tuned off of that style 🤣

well maybe

#

hmmm

#

maybe

keen beacon Apr 6, 2025, 3:44 PM

#

if they thought that was cringe, cant imagine their reaction to that prompt 🤣 🤣

#

this probably counts as gaming the arena tbh

#

i dont think anyone else has released an almost completely different model in style compared to the one in arena

#

and they were polluting the arena with trash models too

#

dumb repeating insufferable slow etc

ancient reef Apr 6, 2025, 3:56 PM

#

mavrick on lmarena creative writing doesn't seem to reflect eq bench 0_0

keen beacon Apr 6, 2025, 4:04 PM

#

ancient reef mavrick on lmarena creative writing doesn't seem to reflect eq bench 0_0

the model in arena is unhinged, the released one is boring and sucks

ancient reef Apr 6, 2025, 4:04 PM

#

diff model completely I suppose

torn mantle Apr 6, 2025, 4:13 PM

#

oblique flint 2.5 pro still sucks in cursor unfortunately :/ no idea what I'm doing wrong but ...

i saw someone talk about this the other day

#

is it an API issue? IDE issue? model issue?

#

i want to try it but its not added yet on vscode copilot

#

btw

#

did Meta share any benchmark related to their 10M context?

keen beacon Apr 6, 2025, 4:16 PM

#

torn mantle did Meta share any benchmark related to their 10M context?

yes haystack eval

#

but it doesnt mean it will be good at 10m context even if it can find a needle in a haystack

void elm Apr 6, 2025, 4:17 PM

#

that's new?

keen beacon Apr 6, 2025, 4:18 PM

#

lmao

torn mantle Apr 6, 2025, 4:18 PM

#

void elm that's new?

tf

#

stop it

#

its true????????????

keen beacon Apr 6, 2025, 4:18 PM

#

lmao no

void elm Apr 6, 2025, 4:19 PM

#

#

it has double context limit

#

it just appeared today

#

i think its the nightwhisper model

keen beacon Apr 6, 2025, 4:20 PM

#

agi has arrived folks

keen beacon Apr 6, 2025, 4:20 PM

#

void elm

if it's so real

rose thicket Apr 6, 2025, 4:20 PM

#

void elm

Whattttttt!!!

#

Bro i think it maybe that nightwhisper model

void elm Apr 6, 2025, 4:21 PM

#

give me some prompts

keen beacon Apr 6, 2025, 4:21 PM

#

show us it generating something, then refresh the page and we'll see if it stays as the seleced model

#

record it

leaden palm Apr 6, 2025, 4:22 PM

#

void elm give me some prompts

call me naive but i'm not passing this up

"write a python program that generates the most statistically average hypnogram; this should be run fast but also consider how hypnograms could be improbable in so many ways; visualize the output"

as others have said i would like to see a video (to view speed and if it thinks)

keen beacon Apr 6, 2025, 4:23 PM

#

99% chance he's trolling but yk

#

😭

rose thicket Apr 6, 2025, 4:24 PM

#

keen beacon 99% chance he's trolling but yk

Lil my heart sank 😭

void elm Apr 6, 2025, 4:24 PM

#

leaden palm call me naive but i'm not passing this up "write a python program that generate...

okay i will do it

#

i will gen the code and send it to you

rose thicket Apr 6, 2025, 4:24 PM

#

void elm okay i will do it

Which country u live in bro!?

keen beacon Apr 6, 2025, 4:25 PM

#

void elm i will gen the code and send it to you

no nio

#

no no

#

and record

void elm Apr 6, 2025, 4:25 PM

#

record what

void elm Apr 6, 2025, 4:25 PM

#

rose thicket Which country u live in bro!?

UK

lime coral Apr 6, 2025, 4:25 PM

#

lol can someone ban him once we establish he is trolling?

calm echo Apr 6, 2025, 4:25 PM

#

Gotta be trolling since it's missing the "experimental MM-DD" at the end of the name

void elm Apr 6, 2025, 4:25 PM

#

are all the models going to be experiemental

torn mantle Apr 6, 2025, 4:26 PM

#

you seem sus

#

just from your username

#

its fake

void elm Apr 6, 2025, 4:26 PM

#

ok

#

👍

keen beacon Apr 6, 2025, 4:27 PM

#

i'll wait

torn mantle Apr 6, 2025, 4:29 PM

#

bruh

#

there is nothing to wait

#

hes obv lying

void elm Apr 6, 2025, 4:30 PM

#

relax dude i had to install a recorder software

torn mantle Apr 6, 2025, 4:32 PM

#

just use snipping tool

keen beacon Apr 6, 2025, 4:32 PM

#

seriously guys

keen beacon Apr 6, 2025, 4:32 PM

#

void elm relax dude i had to install a recorder software

it can't possibly take this long buddy

upper wolf Apr 6, 2025, 4:32 PM

#

@void elm record yourself refreshing the page to prove its real

void elm Apr 6, 2025, 4:33 PM

#

omg

#

just shut up

#

i dont have nasa internet

upper wolf Apr 6, 2025, 4:33 PM

#

isn’t it like 10mb max upload 😭

leaden palm Apr 6, 2025, 4:33 PM

#

i forgot this server is boosted

upper wolf Apr 6, 2025, 4:33 PM

#

Oh

void elm Apr 6, 2025, 4:33 PM

#

leaden palm Apr 6, 2025, 4:33 PM

#

easy to forget

keen beacon Apr 6, 2025, 4:34 PM

#

leaden palm call me naive but i'm not passing this up "write a python program that generate...

gave this to a mystery model i found elsewhere that i think may be o3 or a checkpoint of it - https://gist.github.com/vuyp/eb69d90003f924f447d839da012c3a95

Gist

mystery model code

mystery model code. GitHub Gist: instantly share code, notes, and snippets.

#

let me know if it's any good

upper wolf Apr 6, 2025, 4:35 PM

#

Well if you’re trolling then i must have room temp iq bc you got me convinced

keen beacon Apr 6, 2025, 4:35 PM

#

void elm

lmao nice cut

#

you can see the fade transition

#

🤦‍♂️

void elm Apr 6, 2025, 4:35 PM

#

??

#

whatever

keen beacon Apr 6, 2025, 4:35 PM

#

💀

void elm Apr 6, 2025, 4:35 PM

#

📎 message.txt

leaden palm Apr 6, 2025, 4:38 PM

#

keen beacon gave this to a mystery model i found elsewhere that i think may be o3 or a check...

cool but doesn't consider anything like distribution throughout the night or typical length of each cycle

upper wolf Apr 6, 2025, 4:38 PM

#

Wait

#

@void elm

#

Record it on ur phone

keen beacon Apr 6, 2025, 4:38 PM

#

leaden palm cool but doesn't consider anything like distribution throughout the night or typ...

how does it stack up to other llms?

leaden palm Apr 6, 2025, 4:38 PM

#

well i've definitely gotten better results, although tbf the 0 shot result is typically bad

keen beacon Apr 6, 2025, 4:39 PM

#

is it the same/better/worse than o1 and o3 mini orrr

torn mantle Apr 6, 2025, 4:40 PM

#

lol

#

are you srsly believing that

#

i would do better job faking it

keen beacon Apr 6, 2025, 4:42 PM

#

@leaden palm updated the gist with the result after asking it to make it better

leaden palm Apr 6, 2025, 4:42 PM

#

sorry guys, my keyboard stopped working and i had to reboot

upper wolf Apr 6, 2025, 4:42 PM

#

@void elm login to studio on ur phone and record yourself using it

leaden palm Apr 6, 2025, 4:42 PM

#

keen beacon <@794377681331945524> updated the gist with the result after asking it to make i...

~~it was cut off~~ by the last parenthese

void elm Apr 6, 2025, 4:42 PM

#

omg i don't care if you dont believe me pissoff

keen beacon Apr 6, 2025, 4:42 PM

#

oh just realised

lime coral Apr 6, 2025, 4:43 PM

#

cry harder lol

torn mantle Apr 6, 2025, 4:43 PM

#

https://x.com/test_tm7873/status/1908917092276982174

testtm (@test_tm7873) on X

seen on Google Labs discord server. posted by magnoliafrost88.

keen beacon Apr 6, 2025, 4:43 PM

#

yeah

upper wolf Apr 6, 2025, 4:43 PM

#

If it’s real, then it should work on your mobile device

leaden palm Apr 6, 2025, 4:44 PM

#

keen beacon <@794377681331945524> updated the gist with the result after asking it to make i...

better but kinda weird still (you don't normally wake up on the dot every cycle)

keen beacon Apr 6, 2025, 4:44 PM

#

its explanation after the code was

#

Why this version is “better”

• Semi‑Markov + cycle template makes stage durations more realistic than a simple homogeneous Markov chain.
• Bout‑length distributions remove the unrealistically high number of ultra‑short stage fragments produced by memory‑less transitions.
• Explicit handling of REM latency, REM‑cycle count, and wake‑after‑sleep‑onset keeps obviously pathological traces out.
• Scoring is multi‑criterion and tunable.
• Still fast: the whole script (200 candidate nights) takes well under 0.1 s on an M1 MacBook or similar.

Feel free to:

• Increase N_CANDIDATES for an even closer match at the cost of runtime.
• Adjust TARGET_PROP or the within‑cycle share formulas for other populations (elderly: N3 ↓, WASO ↑).
• Replace the hand‑set BOUT_PARAMS with cohort‑specific empirical distributions if you have data.

leaden palm Apr 6, 2025, 4:44 PM

#

void elm

and now the moment you've been waiting for

void elm Apr 6, 2025, 4:44 PM

#

yeah what is that

leaden palm Apr 6, 2025, 4:45 PM

#

i kinda like it but it goes to n1 a lot and doesn't end with wake

#

worth noting that it gives a different one every run

keen beacon Apr 6, 2025, 4:46 PM

#

if this model is o3, openai are still relatively behind in code

#

i've noticed the biggest improvements in reasoning and math

#

which were already o-series strong points mostly

leaden palm Apr 6, 2025, 4:46 PM

#

let me also try quasar, o3 mini, and claude thinking for comparison

keen beacon Apr 6, 2025, 4:46 PM

#

ty

torn mantle Apr 6, 2025, 4:46 PM

#

leaden palm Apr 6, 2025, 4:47 PM

#

leaden palm let me also try quasar, o3 mini, and claude thinking for comparison

what the quasar
it's not the best but better than the o3-like model's first attempt

#

for comparison this is claude

keen beacon Apr 6, 2025, 4:48 PM

#

i presume claude's is the best 0-shot?

leaden palm Apr 6, 2025, 4:50 PM

#

idk, it's more or less tied with the gemini-like model

keen beacon Apr 6, 2025, 4:50 PM

#

is there any 0-shot attempt from a model that stood out or are they all similarly just okay

leaden palm Apr 6, 2025, 4:52 PM

#

for models you can actually access, claude's is probably the best

#

it also handled follow ups better in testing

brittle tiger Apr 6, 2025, 4:52 PM

#

torn mantle

If this is accurate (it could error testing) it's such a joke for them to claim 10M context

keen beacon Apr 6, 2025, 4:53 PM

#

not the only benchmark to show llama 4's terrible long context performance...

brittle tiger Apr 6, 2025, 5:00 PM

#

Context window is one area where everyone freaks out about the stated size without checking benchmarks. I'm very curious about quasar alpha. Will be really interesting to see how well openai's first attempt at 1M holds up on evals

languid moss Apr 6, 2025, 5:04 PM

#

good

lime coral Apr 6, 2025, 5:06 PM

#

torn mantle

quasar looks ok. Result here are for 128k

keen beacon Apr 6, 2025, 5:07 PM

#

brittle tiger Context window is one area where everyone freaks out about the stated size witho...

even if it works in a haystack scenario it really depends on what scenarios they trained high context in too. but the fiction live benchmark is much more representative i think than a haystack evaluation

olive mesa Apr 6, 2025, 5:08 PM

#

smh i feel like llama 4 sucks

#

like just distill gemini 2.5 or something

keen beacon Apr 6, 2025, 5:08 PM

#

olive mesa smh i feel like llama 4 sucks

they literally gamed the arena by sending an unhinged finetuned maverick and claim the results as if its the same open weighted model (not sure if they claim it, but it looks like that/how most interpret it at first glance i think)

#

no i think its somewhat likely

#

strong models are paired with strong models generally i think

brittle tiger Apr 6, 2025, 5:11 PM

#

lime coral quasar looks ok. Result here are for 128k

Didn't realize it had been tested. I was expecting better given they're putting the window at 1M and that fact is going to get so much hype when they officially announce despite it being slightly worse than 4o

keen beacon Apr 6, 2025, 5:12 PM

#

?

#

oh

balmy mist Apr 6, 2025, 5:12 PM

#

void elm that's new?

Omg I just got back no way my baby night whisper back?

keen beacon Apr 6, 2025, 5:12 PM

#

no i mean in a battle

#

oh i dont think u can do that lol

balmy mist Apr 6, 2025, 5:13 PM

#

keen beacon agi has arrived folks

Wym?

keen beacon Apr 6, 2025, 5:13 PM

#

it will confuse both of them. i was just saying its not that unlikely to get a strong model paired with another strong model

keen beacon Apr 6, 2025, 5:13 PM

#

balmy mist Wym?

i was just joking

balmy mist Apr 6, 2025, 5:13 PM

#

Lmaoo

#

Wait so the Gemini 2.5 ultra was bs?

keen beacon Apr 6, 2025, 5:13 PM

#

yeah

#

inspect element

balmy mist Apr 6, 2025, 5:13 PM

#

Wow

#

Dude bored

keen beacon Apr 6, 2025, 5:14 PM

#

hmm thats an interesting way to evaluate models

#

never tried to do that lol but i guess it can work

#

i think u can do that in ndirect chat

#

lol

#

it really depends on the task i think

#

gemini is the best all rounder but other models can be better in different scenarios

torn mantle Apr 6, 2025, 5:27 PM

#

its a good model

#

probably from google

keen beacon Apr 6, 2025, 5:27 PM

#

it is according to metadata i believe

#

is lunarcall thinking?

#

stargazer is a 2.5 model, probably 2.5 flash

balmy mist Apr 6, 2025, 5:28 PM

#

Sonnet is easily second then it’s 3.1, then I would say grok and o3 mini, then 4o after, not sure about r1 tbh

calm sequoia Apr 6, 2025, 5:36 PM

#

Feels like Grok-3 has been in "preview" longer than any other LLM. Longer than o1 preview 😄

torn mantle Apr 6, 2025, 5:36 PM

#

they were added recently to webdev arena

#

woah

balmy mist Apr 6, 2025, 5:36 PM

#

We ain’t never getting grok api

torn mantle Apr 6, 2025, 5:36 PM

#

look at maverick weight

balmy mist Apr 6, 2025, 5:37 PM

#

Never saw 12 before

#

That’s web dev ?

torn mantle Apr 6, 2025, 5:37 PM

#

yea the probability of its appearance is the highest

#

yea

#

im gonna try it

balmy mist Apr 6, 2025, 5:37 PM

#

lol

frozen skiff Apr 6, 2025, 5:37 PM

#

WTF

#

GEMINI 2.5

#

ULTRA

#

🤣

torn mantle Apr 6, 2025, 5:38 PM

#

frozen skiff ULTRA

fake

balmy mist Apr 6, 2025, 5:38 PM

#

Ask who it is, I wanna know system prompt

frozen skiff Apr 6, 2025, 5:38 PM

#

ik

balmy mist Apr 6, 2025, 5:38 PM

#

I’m outside rn

torn mantle Apr 6, 2025, 5:39 PM

#

balmy mist Ask who it is, I wanna know system prompt

its from Meta

#

its the models released yesterday

calm sequoia Apr 6, 2025, 5:40 PM

#

torn mantle Apr 6, 2025, 5:40 PM

#

idk man

#

yawn

frozen skiff Apr 6, 2025, 5:40 PM

#

its lama 4 bhemoth

blazing rune Apr 6, 2025, 5:40 PM

#

Llama 4 is terrible

frozen skiff Apr 6, 2025, 5:41 PM

#

NO ITS NOT lil bro

torn mantle Apr 6, 2025, 5:41 PM

#

blazing rune Llama 4 is terrible

its a disaster

barren prairie Apr 6, 2025, 5:41 PM

#

I stand with deepSeek

balmy mist Apr 6, 2025, 5:41 PM

#

torn mantle its from Meta

Ik I just wanna know the system prompt they used for it

blazing rune Apr 6, 2025, 5:41 PM

#

calm sequoia

either deepseek r2 or o4 mini imo

balmy mist Apr 6, 2025, 5:41 PM

#

I noticed the models have slightly different SP

calm sequoia Apr 6, 2025, 5:41 PM

#

It feels like being trained on messenger chat data 😄

frozen skiff Apr 6, 2025, 5:41 PM

#

deepseek r1 is shet so deepseek r2 is gonna be shet too

keen beacon Apr 6, 2025, 5:42 PM

#

deepseek r3 then

calm sequoia Apr 6, 2025, 5:42 PM

#

@keen beacon who is that other you have hopes in?

barren prairie Apr 6, 2025, 5:42 PM

#

Nothing will beat open source and Free and unlimited to use ... deepSeek r2 is the best .

frozen skiff Apr 6, 2025, 5:42 PM

#

keen beacon deepseek r3 then

also shet

#

google is gonna be the best

balmy mist Apr 6, 2025, 5:43 PM

#

Open source will have the tech that leading models have eventually, but Google will lead the way, there is no moat

keen beacon Apr 6, 2025, 5:43 PM

#

frozen skiff google is gonna be the best

u voted for behemoth lol

frozen skiff Apr 6, 2025, 5:43 PM

#

he said google 2.5

#

its already out

keen beacon Apr 6, 2025, 5:43 PM

#

2.5 flash?

#

its likely gonna be better than behemoth

frozen skiff Apr 6, 2025, 5:43 PM

#

yeah

calm sequoia Apr 6, 2025, 5:44 PM

#

I meant best of the 2.5, which is likely PRO

frozen skiff Apr 6, 2025, 5:44 PM

#

im sill gonna use behemoth

balmy mist Apr 6, 2025, 5:44 PM

#

Behemoth prob gonna be booty

torn mantle Apr 6, 2025, 5:44 PM

#

balmy mist Behemoth prob gonna be booty

i think we already have behemoth in the arena

barren prairie Apr 6, 2025, 5:44 PM

#

calm sequoia It feels like being trained on messenger chat data 😄

On fb comment session

frozen skiff Apr 6, 2025, 5:44 PM

#

balmy mist Behemoth prob gonna be booty

nah bro

keen beacon Apr 6, 2025, 5:44 PM

#

Rip millions and millions of dollars

balmy mist Apr 6, 2025, 5:44 PM

#

Meta lost credibility after this nonsense they pulled

frozen skiff Apr 6, 2025, 5:44 PM

#

torn mantle i think we already have behemoth in the arena

proof

calm sequoia Apr 6, 2025, 5:45 PM

#

zuck hyped it too much

#

was it upcoming investor meeting

frozen skiff Apr 6, 2025, 5:45 PM

#

hes right u guys dont know how to use it

calm sequoia Apr 6, 2025, 5:45 PM

#

or divorce

frozen skiff Apr 6, 2025, 5:45 PM

#

its the best ai

keen beacon Apr 6, 2025, 5:45 PM

#

they dropped it because qwen 3 was about to release

balmy mist Apr 6, 2025, 5:45 PM

#

They should have just waited

keen beacon Apr 6, 2025, 5:45 PM

#

it would be too embarrassing at that point

balmy mist Apr 6, 2025, 5:46 PM

#

But you right it’s def about investors

#

That’s the only thing that makes sense

keen beacon Apr 6, 2025, 5:46 PM

#

balmy mist They should have just waited

why it makes them look even worse if they wait

#

they moved the release to the weekend because they thought qwen 3 was dropping the same day or something

balmy mist Apr 6, 2025, 5:47 PM

#

keen beacon why it makes them look even worse if they wait

Bc nobody was thinking about meta

#

We were occupied by other models

#

And they could have made the models better

eager mica Apr 6, 2025, 5:47 PM

#

Meta could have released a small preview "chat" model intended for local users, similar to the one people have been playing with on LMArena. It might have not necessarily been the best all-around performing model, but with some tweaks users would have probably liked it. There's a general lack of professionally-made conversational finetunes dedicated to creative/recreational uses and that model could have filled that gap, while the "serious" ones could have been trained longer before their actual release.

keen beacon Apr 6, 2025, 5:47 PM

#

balmy mist And they could have made the models better

they already delayed the models several times

#

but i do think it wouldve made sense to release it at llamacon

balmy mist Apr 6, 2025, 5:48 PM

#

Releasing early during 2.5 and other models coming out is bad

#

Especially if you can’t deliver on what u promised

#

We can wait

#

But investors can’t

keen beacon Apr 6, 2025, 5:48 PM

#

balmy mist Releasing early during 2.5 and other models coming out is bad

they chose to do that because their model would look even worse in comparison with newer models

balmy mist Apr 6, 2025, 5:48 PM

#

lol

#

I don’t understand how their models can be so bad

#

Like aren’t they not training on outputs of competition?

#

It should be minimum at 3.5 level at this point tbh

barren prairie Apr 6, 2025, 5:50 PM

#

balmy mist I don’t understand how their models can be so bad

But better than the previous one ,.

balmy mist Apr 6, 2025, 5:51 PM

#

Is it better than 405?

#

I heard ppl say they got mixed results

#

Someone must have hacked their training data lol

#

Like I feel bad for them

keen beacon Apr 6, 2025, 5:52 PM

#

nah theyre just massively incompetent compared to other frontier labs it seems

barren prairie Apr 6, 2025, 5:52 PM

#

balmy mist Apr 6, 2025, 5:52 PM

#

They need to hire us it seems

#

They still might be able to save themselves with the scout 10 mill context

keen beacon Apr 6, 2025, 5:53 PM

#

nah it sucks lol

balmy mist Apr 6, 2025, 5:53 PM

#

If they can make scout better

#

Then maybe one of their models might be usable, I don’t think behemoth coming out till summer lol

keen beacon Apr 6, 2025, 5:55 PM

#

balmy mist Then maybe one of their models might be usable, I don’t think behemoth coming ou...

they might abandon it tbh

#

2trillion param model competing with frontier models that are a quarter of the size

calm sequoia Apr 6, 2025, 5:57 PM

#

calm sequoia

How's that nobody believes in o3?

barren prairie Apr 6, 2025, 5:57 PM

#

calm sequoia How's that nobody believes in o3?

I don t trust Sam 🥸

balmy mist Apr 6, 2025, 5:57 PM

#

calm sequoia How's that nobody believes in o3?

I don’t think you should vote on belief

#

I put idk

#

Cause we don’t know lol

keen beacon Apr 6, 2025, 5:58 PM

#

its fun to speculate tho

calm sequoia Apr 6, 2025, 5:58 PM

#

Of course we dont know, but we have opinions 😄

calm sequoia Apr 6, 2025, 5:59 PM

#

barren prairie I don t trust Sam 🥸

Suck or musk is not trustworthy even more 😄

calm sequoia Apr 6, 2025, 6:00 PM

#

barren prairie

Ha ha I remember playing with GPT 2 when it was out. It felt soo powerful 😄 Obviously the LLama4 is worse

keen fulcrum Apr 6, 2025, 6:01 PM

#

When will grok 3 api release?

balmy mist Apr 6, 2025, 6:01 PM

#

i dont find that fun lol, i just to test the new models, i can care less which model does best, like from which company, i have no dog in the race, but i think google will win in the end

#

like long term

keen beacon Apr 6, 2025, 6:02 PM

#

barren prairie

fr tho benchmark wise gpt 4o mini might be worse (i dont remember it on hand though to compare) but i voted llama 4 anyway

balmy mist Apr 6, 2025, 6:02 PM

#

near term who knows, but it seems google might still take it near as well lol

#

they got a lot of competition from 2.5 pro and NW

barren prairie Apr 6, 2025, 6:03 PM

#

keen beacon fr tho benchmark wise gpt 4o mini might be worse (i dont remember it on hand tho...

It is hard to decide 😂😂 all of them are Stupid

balmy mist Apr 6, 2025, 6:04 PM

#

did anthropic abandon opus?

calm sequoia Apr 6, 2025, 6:04 PM

#

balmy mist i dont find that fun lol, i just to test the new models, i can care less which m...

you must be fun at the parties

balmy mist Apr 6, 2025, 6:05 PM

#

i dont go to parties lol

#

so i wouldnt know lmaoo

#

im am a present kinda person

#

worrying about the past and future is things out of my control

#

and lets me live stress free

keen fulcrum Apr 6, 2025, 6:06 PM

#

Soonish Deepseek will release R2

balmy mist Apr 6, 2025, 6:06 PM

#

keen fulcrum Soonish Deepseek will release R2

this week it seems

eager mica Apr 6, 2025, 6:08 PM

#

By the way, it appears there are even newer Llama models with different codenames on Chatbot Arena right now.

#

They're still testing them, it seems.

keen beacon Apr 6, 2025, 6:09 PM

#

eager mica They're still testing them, it seems.

are they any better or are they smaller versions

keen beacon Apr 6, 2025, 6:10 PM

#

barren prairie

4o mini just about

eager mica Apr 6, 2025, 6:10 PM

#

keen beacon are they any better or are they smaller versions

I'm not sure; I just randomly submitted a joke question and a new model that I didn't recall seeing came out. It didn't stand out for anything in particular. Anyway, it's harley.

balmy mist Apr 6, 2025, 6:15 PM

#

they need to pput behemoth in webdev lol

#

man i really want mav to be good lol

#

why would they even put it in webdev lmao

#

how often do these leaderboards get updated? like weekly?

calm sequoia Apr 6, 2025, 6:23 PM

#

I hope meta pays LMARENA well. Garbage recycling is a hard work.

eager mica Apr 6, 2025, 6:30 PM

#

I don't entirely agree that they gamed the benchmarks, but I do feel betrayed that in the end they released exceedingly large, boring models that do not perform as well as their size suggests.

#

Why would I (or anybody else) pick the response on the right here, for example?

[example redacted]

#

Many models on the arena are just like that: boring, stiff, stilted, no fun allowed.

#

Or if they do respond, they'll fill their text with GPTisms, so-called "slop" and excessive disclaimers.

"It's important to note that...", "As an AI...", "It's crucial to remember..." and so on. 🥱

balmy mist Apr 6, 2025, 6:52 PM

#

eager mica Or if they do respond, they'll fill their text with GPTisms, so-called "slop" an...

Do you be jailbreaking models?

eager mica Apr 6, 2025, 6:55 PM

#

balmy mist Do you be jailbreaking models?

I'm not jailbreaking models on the Arena.

balmy mist Apr 6, 2025, 7:02 PM

#

But in general you do?

#

Like in other platforms or locally?

#

Damn 6 months for Claude 4

https://x.com/vitrupo/status/1908763535351669017?s=46

vitrupo (@vitrupo) on X

Anthropic Chief Scientist Jared Kaplan says Claude 4 will arrive "in the next six months or so."

AI cycles are compressing — "faster than the hardware cycle" — even as new chips arrive. Post-training and RL are accelerating progress. No signs of slowing.

eager mica Apr 6, 2025, 7:07 PM

#

balmy mist Like in other platforms or locally?

I use the most appropriate prompting for my local models (e.g. Gemma 3) to do what I want, but it's not really "jailbreaking"—no magical incantations or special sequences involved, only complete instructions. If the models do not behave despite putting reasonable prompting effort, then I'll use something else.

golden ocean Apr 6, 2025, 7:08 PM

#

balmy mist Damn 6 months for Claude 4 https://x.com/vitrupo/status/1908763535351669017?s=4...

bruh we gonna have gemini 69 and gpt 50 already by then

frozen skiff Apr 6, 2025, 7:11 PM

#

Nvm i hate llama 4

#

This shet is trash

barren prairie Apr 6, 2025, 7:26 PM

#

balmy mist Damn 6 months for Claude 4 https://x.com/vitrupo/status/1908763535351669017?s=4...

Why so slow?

ocean vortex Apr 6, 2025, 7:29 PM

#

frozen skiff Nvm i hate llama 4

it's the most boring release ever lol

#

the only thing they have going for it is the context size as far as I'm concerned

fleet lintel Apr 6, 2025, 7:30 PM

#

balmy mist man i really want mav to be good lol

why this model is at #2 😦

eager mica Apr 6, 2025, 7:31 PM

#

fleet lintel why this model is at #2 😦

The model(s) on the Arena are not the same ones they released.

keen fulcrum Apr 6, 2025, 7:32 PM

#

eager mica Many models on the arena are just like that: boring, stiff, stilted, no fun allo...

I think you underestimate the effort it took training those models

eager mica Apr 6, 2025, 7:33 PM

#

keen fulcrum I think you underestimate the effort it took training those models

I don't see a correlation. Those models' writing style is boring and stilted by design.

keen beacon Apr 6, 2025, 7:33 PM

#

ocean vortex the only thing they have going for it is the context size as far as I'm concerne...

Did you see the ctx benchmarks lol. It can do haystack but it falls apart in actual usage even in super low context

frozen skiff Apr 6, 2025, 7:33 PM

#

ocean vortex it's the most boring release ever lol

Dude not just that

#

Its shet

#

Its not just "boring" or "normal"

#

Its actualy horrible

keen fulcrum Apr 6, 2025, 7:33 PM

#

eager mica I don't see a correlation. Those models' writing style is boring and stilted by ...

For formal language and business use case this is appreciated. They make money with businesses

frozen skiff Apr 6, 2025, 7:34 PM

#

Is there any ai that is fun to talk to like llama 4

#

But isnt as shet

keen fulcrum Apr 6, 2025, 7:34 PM

#

frozen skiff Is there any ai that is fun to talk to like llama 4

Grok 3

frozen skiff Apr 6, 2025, 7:34 PM

#

Nah

#

Terrible at following instructions

#

Especially the reasoning

#

Model

keen beacon Apr 6, 2025, 7:35 PM

#

frozen skiff Is there any ai that is fun to talk to like llama 4

Use the llama chaos engine system prompt on 2.5 pro lmao

frozen skiff Apr 6, 2025, 7:35 PM

#

keen beacon Use the llama chaos engine system prompt on 2.5 pro lmao

Gimme it

keen beacon Apr 6, 2025, 7:36 PM

#

https://gist.github.com/riidefi/3340cc2b33b9edf5f03dc4429ba635d0

Gist

LMArena's `venom` System Prompt

LMArena's venom System Prompt. GitHub Gist: instantly share code, notes, and snippets.

ocean vortex Apr 6, 2025, 7:39 PM

#

frozen skiff Its actualy horrible

I wouldn't go that far. It is not horrible, it's comparable to deepseek V3.1 and gpt4o (the boring part). It's super consistent and reliable on easier prompts but the problem is it is not special in any way, and much more less likely to surprise you than those models as it struggles with harder or novel tasks

keen beacon Apr 6, 2025, 7:40 PM

#

This kind of context performance is atrocious

ocean vortex Apr 6, 2025, 7:40 PM

#

so while other models are more likely to fail the "easy" tasks, this is just boring and it's gonna fail if you suspect it will catgrin

frozen skiff Apr 6, 2025, 7:40 PM

#

keen beacon https://gist.github.com/riidefi/3340cc2b33b9edf5f03dc4429ba635d0

LMAOO

#

I like this but its not really what im looking for tho

#

Idk how to explain it

#

Gemini 2.5 pro no matter how much you tell it to be informal, act human or whatever

#

It doesnt do it properly

#

Same for sonnet 3.7 and openAI models

ocean vortex Apr 6, 2025, 7:41 PM

#

a "safe" choice but we do not really need that in this market and with those model sizes imo

keen beacon Apr 6, 2025, 7:41 PM

#

I think the maverick model was fine tuned on a models traces that had this system prompt

frozen skiff Apr 6, 2025, 7:41 PM

#

frozen skiff Same for sonnet 3.7 and openAI models

I liked 24 karat gold and llama 4 at the beginning because of those reasons

#

Then after I used them more

#

It just sucked

#

Its too inconsistent hallucinates to much etc

keen beacon Apr 6, 2025, 7:42 PM

#

Ya llama 4 sux

frozen skiff Apr 6, 2025, 7:42 PM

#

I hope llama 4 behemoth is better

#

While still being fun

#

I doubt it tho

ocean vortex Apr 6, 2025, 7:46 PM

#

keen beacon This kind of context performance is atrocious

lmao. They even did NiH and it was supposedly good

keen beacon Apr 6, 2025, 7:47 PM

#

ocean vortex lmao. They even did NiH and it was supposedly good

They didn't train past 256k (iirc) so besides needle in a haystack using that kind of context is even more of a disaster than it is

frozen skiff Apr 6, 2025, 7:47 PM

#

ocean vortex I wouldn't go that far. It is not horrible, it's comparable to deepseek V3.1 and...

nahh

#

its not reliable at all bro i swear

#

it gives false info on even the most basic tasks

#

even when it knows the right answer

#

when i correct it and say its wrong it gives the right answer

ocean vortex Apr 6, 2025, 7:51 PM

#

frozen skiff it gives false info on even the most basic tasks

it's not impressive for sure, but we can't say it's sht either let's not get ahead of ourselves. catgrin

None of those evals would have been possible otherwise. Yeah it's a disappointment given the other models that we have, but it still performs somewhat decently when all said and done. I do get why this is the impression though, that's why I said it's boring - it is unlikely to surprise you.

frozen skiff Apr 6, 2025, 7:52 PM

#

ocean vortex it's not impressive for sure, but we can't say it's sht either let's not get ahe...

yeah

#

its alr

keen fulcrum Apr 6, 2025, 7:52 PM

#

Llama Behemoth will be great

frozen skiff Apr 6, 2025, 7:53 PM

#

how do yk

keen fulcrum Apr 6, 2025, 7:53 PM

#

Its their gem

keen beacon Apr 6, 2025, 7:53 PM

#

It will be slow and suck

#

In comparison to other frontier models that are probably less than a quarter of the size

frozen skiff Apr 6, 2025, 7:54 PM

#

keen fulcrum Its their gem

that doesnt eman anything

#

it might be good it might be shet

#

wwe'll see

#

ima use gemini 2.5 pro for now and experiment some prompts to make it less boring

keen fulcrum Apr 6, 2025, 7:55 PM

#

I do hope it gets coding capabilities soon

frozen skiff Apr 6, 2025, 7:55 PM

#

cuz i cant deby its really good buts its just that AI writing part i dont lik

frozen skiff Apr 6, 2025, 7:55 PM

#

keen fulcrum I do hope it gets coding capabilities soon

gemini?

keen fulcrum Apr 6, 2025, 7:55 PM

#

Yes

frozen skiff Apr 6, 2025, 7:55 PM

#

it already has it

keen fulcrum Apr 6, 2025, 7:55 PM

#

frozen skiff it already has it

Its bad

frozen skiff Apr 6, 2025, 7:55 PM

#

no

#

its not

#

its rlly good at coding

#

its the best ai at coding

keen fulcrum Apr 6, 2025, 7:55 PM

#

Please see benchmarks

frozen skiff Apr 6, 2025, 7:56 PM

#

nah

#

what moedl is the best at coding then

keen fulcrum Apr 6, 2025, 7:56 PM

#

Claude

frozen skiff Apr 6, 2025, 7:56 PM

#

i thou ght the benchmarks said coding was gemini

frozen skiff Apr 6, 2025, 7:56 PM

#

keen fulcrum Claude

no way

ocean vortex Apr 6, 2025, 7:56 PM

#

keen beacon It will be slow and suck

how many tokens are they training it for?

#

also the fact that scout performs still considerably worse than maverick with double the data for pretraining kinda suggests to me 17B active is their limiting factor. Maverick performance on smaller model should be absolutely possible

frozen skiff Apr 6, 2025, 7:59 PM

#

why did they even release mvaerick / scout

raven void Apr 6, 2025, 8:00 PM

#

why not

keen beacon Apr 6, 2025, 8:00 PM

#

ocean vortex also the fact that scout performs still considerably worse than maverick with do...

Tbh I think something is wrong with their specific Moe architecture or something. I dunno, not interested lol. Considering qwen 3 will be dropping a 15b total param Moe (and usually they only drop one, so I think it'll be good) and be very competitive for it's size/maybe even better than the current llama models

keen beacon Apr 6, 2025, 8:00 PM

#

ocean vortex how many tokens are they training it for?

Not sure am on my phone

frozen skiff Apr 6, 2025, 8:01 PM

#

raven void why not

its kinda sht

ocean vortex Apr 6, 2025, 8:03 PM

#

keen beacon Tbh I think something is wrong with their specific Moe architecture or something...

I was somehow thinking that it was similar to deepseek, but actually... Deepseek is 37B active parameters so considerably more. What the f meta was thinking LOL

keen beacon Apr 6, 2025, 8:04 PM

#

ocean vortex I was somehow thinking that it was similar to deepseek, but actually... Deepseek...

tbh I don't think it's the active params

#

Something is just really messed up

ocean vortex Apr 6, 2025, 8:04 PM

#

keen beacon tbh I don't think it's the active params

I think it is. 17b active parameters is not enough for truly difficult tasks

#

so the end result is what we have now. They train it for so long it perfects what it can do

#

but still not enough capacity for very challenging prompts

keen beacon Apr 6, 2025, 8:05 PM

#

ocean vortex I think it is. 17b active parameters is not enough for truly difficult tasks

Qwen 3 Moe will probably be better or maybe highly competitive and it's 15b and 2b active

#

They only release one I think so they're confident

ocean vortex Apr 6, 2025, 8:05 PM

#

keen beacon Qwen 3 Moe will probably be better or maybe highly competitive and it's 15b and ...

lmao no chance if it's not reasoning

keen beacon Apr 6, 2025, 8:06 PM

#

ocean vortex lmao no chance if it's not reasoning

Why not gpt 4o mini is probably comparable

#

Releasing a gpt 4o mini level 15b Moe is more than good enough

ocean vortex Apr 6, 2025, 8:06 PM

#

keen beacon Why not gpt 4o mini is probably comparable

mini is much bigger pretty sure

#

2b active is basically as small as it gets

#

way too little if you ask me

keen beacon Apr 6, 2025, 8:08 PM

#

ocean vortex 2b active is basically as small as it gets

Ya qwen made bigger Moe's but they shrank it this time (assuming they only do one) so maybe something interesting is happening there

ocean vortex Apr 6, 2025, 8:08 PM

#

I would expect this model to like compete with gemma, not maverick

keen beacon Apr 6, 2025, 8:09 PM

#

The 32b dense should beat maverick though

#

If they do release a successor

ocean vortex Apr 6, 2025, 8:09 PM

#

or qwen-7b... honestly they should update that one instead

keen beacon Apr 6, 2025, 8:09 PM

#

ocean vortex or qwen-7b... honestly they should update that one instead

You think a dense 7b/8b could beat maverick?

#

Tbh it is possible

ocean vortex Apr 6, 2025, 8:10 PM

#

keen beacon You think a dense 7b/8b could beat maverick?

not maverick lmao, it can beat MoE with 2b active

keen beacon Apr 6, 2025, 8:10 PM

#

ocean vortex not maverick lmao, it can beat MoE with 2b active

Nah I doubt

#

15b Moe will likely be better than the 8b

ocean vortex Apr 6, 2025, 8:11 PM

#

disagree

#

I think active param count is more important than total size

#

so if it's only 2b... it's just compromised

keen beacon Apr 6, 2025, 8:11 PM

#

They are releasing both based on the transformers pr that should be interesting

#

I guess the qwen 3 32b equivalent will be most interesting

#

It will beat maverick I think

#

Given how good 2.5 32b is

keen beacon Apr 6, 2025, 8:13 PM

#

ocean vortex I think active param count is more important than total size

Weird how they decided to shrink it though. But curious what their large Moe model (qwen max) is

ocean vortex Apr 6, 2025, 8:13 PM

#

you can't have free launch like that, given that training MoE takes roughly the same amount of compute and time as dense model equal to active parameters. If it was the case all models would have been 2T+ with less than 10b active lol

keen beacon Apr 6, 2025, 8:15 PM

#

ocean vortex you can't have free launch like that, given that training MoE takes roughly the ...

Well clearly the qwen team found it appropriate if they're continuing the trend of releasing just 1 moe

#

What's the absolute best model rn? Is it nigjtwjisper?

keen beacon Apr 6, 2025, 8:16 PM

#

keen beacon What's the absolute best model rn? Is it nigjtwjisper?

It's gone

#

Someone let me know how does nightwhisper fare against o1

ocean vortex Apr 6, 2025, 8:17 PM

#

keen beacon Well clearly the qwen team found it appropriate if they're continuing the trend ...

they are not focusing on just MoE though. I think it's extremely unlikely this tiny MoE will come even close to QWQ 32b dense

upper wolf Apr 6, 2025, 8:17 PM

#

keen beacon What's the absolute best model rn? Is it nigjtwjisper?

Yes which is gemini

keen beacon Apr 6, 2025, 8:17 PM

#

ocean vortex they are not focusing on just MoE though. I think it's extremely unlikely this t...

Yeah but they won't waste resources on an Moe that will just suck. Their previous Moe releases were much larger in active params count

#

Clearly they think it's something worth giving a shot/considerable chance of working out

frozen skiff Apr 6, 2025, 8:18 PM

#

keen beacon Someone let me know how does nightwhisper fare against o1

Too good its uncomparable

#

Nightwhisper is godly

#

They removed it lol

#

Its gemini toh

#

Tho

ocean vortex Apr 6, 2025, 8:19 PM

#

keen beacon Yeah but they won't waste resources on an Moe that will just suck. Their previou...

they are probably just focusing on mobile/laptop applications with it. Memory is not a huge issue at these sizes but speed can be. So MoE may make more sense

keen beacon Apr 6, 2025, 8:21 PM

#

ocean vortex they are probably just focusing on mobile/laptop applications with it. Memory is...

Maybe but I think the 15b Moe will be the most interesting thing to come out of all of this

#

Is there a nughtwhusoer like model

upper wolf Apr 6, 2025, 8:24 PM

#

No

frozen skiff Apr 6, 2025, 8:28 PM

#

Cuz gpt 4 is sht

#

Its still sht lil bro

keen beacon Apr 6, 2025, 8:29 PM

#

Openai gave up on it

#

Ya probably

#

If gpt 4.5 is bigger unlikely I think

torn mantle Apr 6, 2025, 8:31 PM

#

https://x.com/paulgauthier/status/1908976568879476843

Paul Gauthier (@paulgauthier) on X

Llama 4 Maverick scored 16% on the aider polyglot coding benchmark.

https://t.co/mBVaUPGHPl

#

😭

#

i think it may possible saying that maverick and scope are early checkpoint distilled versions of behemoth

novel flame Apr 6, 2025, 8:33 PM

#

It would seem Llama 4 was running in "Volkswagen emission testing mode" in the Arena to get the #2 spot

torn mantle Apr 6, 2025, 8:33 PM

#

this would make much more sense

#

they did?

#

idk i was just guessing tbh

keen beacon Apr 6, 2025, 8:35 PM

#

torn mantle i think it may possible saying that maverick and scope are early checkpoint dist...

they're distillations of behemoth

#

lmao

#

behemoth is the teacher model

eager mica Apr 6, 2025, 8:36 PM

#

We codistilled the Llama 4 Maverick model from Llama 4 Behemoth as a teacher model, resulting in substantial quality improvements across end task evaluation metrics. We developed a novel distillation loss function that dynamically weights the soft and hard targets through training. Codistillation from Llama 4 Behemoth during pre-training amortizes the computational cost of resource-intensive forward passes needed to compute the targets for distillation for the majority of the training data used in student training. For additional new data incorporated in student training, we ran forward passes on the Behemoth model to create distillation targets.

#

https://ai.meta.com/blog/llama-4-multimodal-intelligence/

ocean vortex Apr 6, 2025, 8:51 PM

#

keen beacon they're distillations of behemoth

not true. Scout was pretrained on 40T and Maverick on 22T. Therefore different pretrained models and not distills

leaden palm Apr 6, 2025, 8:52 PM

#

200k
this almost feels like you're asking for someone to say claude

keen beacon Apr 6, 2025, 8:52 PM

#

ocean vortex not true. Scout was pretrained on 40T and Maverick on 22T. Therefore different p...

thanks for letting me know i was wrong on something that's already been established 👍 simple error

leaden palm Apr 6, 2025, 8:53 PM

#

im saying that because afaik claude is the only model with a 200k context window

#

but yes i think that claude is a pretty alright model, even without thinking

#

ok but oddly specific

#

your own benchmarks >> anything else

ocean vortex Apr 6, 2025, 8:55 PM

#

eager mica > We codistilled the Llama 4 Maverick model from Llama 4 Behemoth as a teacher m...

ccodistillation is quite a bit different from normal distillation

#

instead of unconditional teacher-student you have both models updating gradients and learning from one another 👀

leaden palm Apr 6, 2025, 8:55 PM

#

other humans and benchmarks are unlikely to have tested text adventures

ocean vortex Apr 6, 2025, 8:56 PM

#

keen beacon thanks for letting me know i was wrong on something that's already been establis...

you are welcome 🫃

#

it seems they essentially optimized the training of several models at the same time to make sure the same work is not getting repeated as much, is the extent of that "distillation". But it's not in the context what most people think when they see the word "distillation". Those 3 models are still different with varying amounts of total data used for pretraining

leaden palm Apr 6, 2025, 9:04 PM

#

series of finetunes most well known for mistral/mixtral tunes

#

i don't dabble much in temperature and personally think that people care too much about it

#

temperature 0 = follows most likely path, temperature 1 = chooses paths with proportional probability, and you can imagine how to interpolate

#

when i run llms i typically use a temperature of 0.7

#

then

#

fun fact: temperature can actually go past 2, the limit of 2 is an artificial limitation

#

back when i was trying to get llm text to sound human i found that 1.05 worked when doing free generation but you needed like 1.3 when rewriting ai text

novel flame Apr 6, 2025, 9:09 PM

#

Sigh.... Finally figured I should test Llama 4 Maverick too, and I too can confirm that it's somewhere between utterly disappointing and 'meh fine'. It scored a 3 out of 5 on my mini-benchmark, failing on both associative logic and on grounded memory recall (it had more hallucinated 'facts' than any model I've ever tested, it's bananas). It was less bad at coding than I expected, maybe a decent 'second tier' coding model, but nowhere near Claude 3.5/3.7 Sonnet or Gemini Pro 2.5.

frozen skiff Apr 6, 2025, 9:09 PM

#

2.5 pro

#

But its kinda hard to make it sound human

leaden palm Apr 6, 2025, 9:10 PM

#

frozen skiff 2.5 pro

i think they were asking "aside from 2.5 pro"

frozen skiff Apr 6, 2025, 9:10 PM

#

leaden palm i think they were asking "aside from 2.5 pro"

And im saying theres only 2.5 pro

#

But he didnt say aside from 2.5 pro he asked if it was only 2.5 pro

#

Does anyone have a prompt to make 2.5 pro sound human

#

Its too robotic

#

And i cant prompt it to sound human

#

What system prompt should I use

#

Yeah

#

The temperature doesnt do anything

#

Unless I changed the top p to 1

#

Then it becomes so random

#

And nonsnesical

#

hmm

#

lemme try it after i try this one i found on reddit

#

im ean

#

i tried sending it to it ddirectly

#

but

#

it talks too informally

#

like a 12 year old

#

not that but

#

i told it like

#

act human and stuff

leaden palm Apr 6, 2025, 9:15 PM

#

frozen skiff Does anyone have a prompt to make 2.5 pro sound human

are you trying to get it to talk like this or do you want to get past ai detectors

#

two very different goals

frozen skiff Apr 6, 2025, 9:16 PM

#

leaden palm are you trying to get it to talk like this or do you want to get past ai detecto...

Not the ai detecotrs

#

💀 wtf

frozen skiff Apr 6, 2025, 9:16 PM

#

frozen skiff Not the ai detecotrs

i just want it ta talk

#

to talk

#

ok

#

imat ry

#

so it works less better if i tell it to it directly

novel flame Apr 6, 2025, 9:20 PM

#

What are your experiences / reviews of riveroaks?

frozen skiff Apr 6, 2025, 9:21 PM

#

novel flame What are your experiences / reviews of riveroaks?

its mid

frozen skiff Apr 6, 2025, 9:42 PM

#

BEST MODEL IS crazy

torn mantle Apr 6, 2025, 9:42 PM

#

its not good at coding

#

ive tried it

#

whereas its good at general text

#

i believe its llama 4 behemoth

#

reminds me a lot of themis/cybele

#

its kinda like gemini 2.0 pro

#

could be

frozen skiff Apr 6, 2025, 9:45 PM

#

whats a sand boc adventure

#

wtf

#

like do wat u want in what

#

i dont get it

#

what u gonna do with the AI

#

OHHH

#

nice

#

like a roleplay

#

yeah

#

what ai do u do it with

#

thats the best for it

#

thanks ima try it

#

did u invent this

#

sandbox adventures idea

#

i never heard of it before

#

: O

#

that's creative

#

ima try it today it seems really fun

golden ocean Apr 6, 2025, 9:56 PM

#

give prompt

frozen skiff Apr 6, 2025, 9:57 PM

#

golden ocean give prompt

ye @hollow ivy

torn mantle Apr 6, 2025, 10:10 PM

#

i need them to bring back NW to webdev

frozen skiff Apr 6, 2025, 10:16 PM

#

its genius

#

thanks its really cool

#

ur a very creative person

#

iim using it

#

eexactly

#

i was about to say that

#

u never thought u would have stuff like this for example 10 years ago

#

and at the rate these stuff is progresing u dont know whether to be scared or excited

#

yeah its very crazy to think about

#

LOL ye

#

im mainly excited ngl

#

cuz when they get really good

#

like in a year or two from now

#

we can do insane things with them

#

if they become sentient and try to take our jobs and overtake humanity or something 🤣

#

hmm

#

yeah definitely

#

like at the rate of how fast they're growing rn they might become strong enough to replace human workers

primal orbit Apr 6, 2025, 10:31 PM

#

I see now both maverick and crystal have token limit in chat. So 24 karat was definitely Meta model.

frozen skiff Apr 6, 2025, 10:31 PM

#

primal orbit I see now both maverick and crystal have token limit in chat. So 24 karat was de...

24 karat is meta but we dont know which meta model it is

#

definitely not scout or maverick tho

primal orbit Apr 6, 2025, 10:32 PM

#

the responses are quite similiar

frozen skiff Apr 6, 2025, 10:32 PM

#

its prob some kind of testing experiment by lamma

frozen skiff Apr 6, 2025, 10:32 PM

#

primal orbit the responses are quite similiar

nah

#

they are kinda similar but

#

24 karat was WAYY more creative

#

and hallucinated way more

#

😮

primal orbit Apr 6, 2025, 10:33 PM

#

may be the same model but different settings like temperature or diff system prompts?

frozen skiff Apr 6, 2025, 10:33 PM

#

that might happen soon

#

when do u think it AI might become cappable enough to do such a thing

frozen skiff Apr 6, 2025, 10:33 PM

#

primal orbit may be the same model but different settings like temperature or diff system pro...

probably yeah

#

maverick or scout with a different system prompt and higher temperature or something

#

i mean if u think about it

#

it hasnt even been 5 years since the first chatgpt model released

#

so it might even become way stronger than that in 5 years

primal orbit Apr 6, 2025, 10:34 PM

#

I red in the news microsoft released today quake 2 web version, generated by a model in real time.

#

you can play 1 level

frozen skiff Apr 6, 2025, 10:35 PM

#

yeah even more than that

frozen skiff Apr 6, 2025, 10:35 PM

#

primal orbit I red in the news microsoft released today quake 2 web version, generated by a m...

whats that

#

hmnmm

primal orbit Apr 6, 2025, 10:36 PM

#

https://copilot.microsoft.com/wham

Microsoft Copilot: Your AI companion

Microsoft Copilot is your companion to inform, entertain, and inspire. Get advice, feedback, and straightforward answers. Try Copilot now.

#

here is the quake model

frozen skiff Apr 6, 2025, 10:37 PM

#

primal orbit https://copilot.microsoft.com/wham

😮

#

thats exactly what paws was talking about to

hardy pecan Apr 6, 2025, 10:44 PM

#

I honestly suspect 24_Karat_Gold was Llama-4-Maverick-03-26-Experimental, but now they defanged the model big time

#

2400 votes I assume were from us, the public right

#

Maybe they didn't change it on the fly though... idk

frozen skiff Apr 6, 2025, 10:45 PM

#

hardy pecan I honestly suspect 24_Karat_Gold was Llama-4-Maverick-03-26-Experimental, but no...

if its a differnt system prompt

#

it might be maverick

hardy pecan Apr 6, 2025, 10:45 PM

#

Okk

frozen skiff Apr 6, 2025, 10:45 PM

#

cuz i noticed the same intelligence

#

in maverick and 24 karat gold

hardy pecan Apr 6, 2025, 10:45 PM

#

yeah true, I was just thinking it was same smarts

#

but the outputs were unhinged

frozen skiff Apr 6, 2025, 10:46 PM

#

exactly

hardy pecan Apr 6, 2025, 10:46 PM

#

makes sense

frozen skiff Apr 6, 2025, 10:46 PM

#

it might be just maverick with like higher temperature or a different system prompt

hardy pecan Apr 6, 2025, 10:46 PM

#

Yeah

frozen skiff Apr 6, 2025, 10:46 PM

#

it also constnatly said it was from llama

hardy pecan Apr 6, 2025, 10:46 PM

#

yup, which i believe

lime coral Apr 6, 2025, 10:50 PM

#

IGA

#

O/I

torn mantle Apr 6, 2025, 11:42 PM

#

you guys have no idea how powerful that nightwhisper model is

#

i mean combined with google perfect vision

#

you can clone websites in seconds accurately

lime coral Apr 6, 2025, 11:55 PM

#

Stop hyping

#

Then people first question will be 9.11 > 9.81 and it will fail

torn mantle Apr 7, 2025, 12:18 AM

#

lime coral Stop hyping

shut up

#

those are facts

lime coral Apr 7, 2025, 12:20 AM

#

I know and i still mean what i said

torn mantle Apr 7, 2025, 12:20 AM

#

lime coral I know and i still mean what i said

why would i care about 9.11 > 9.81

#

or how much r's are in strawberry

keen beacon Apr 7, 2025, 12:21 AM

#

Shows model reasoning

torn mantle Apr 7, 2025, 12:21 AM

#

or a bouncing ball in weird ass shape?

#

idc about reasoning

lime coral Apr 7, 2025, 12:21 AM

#

torn mantle why would i care about 9.11 > 9.81

You still don’t understand I was ironic

keen beacon Apr 7, 2025, 12:45 AM

#

different

#

WTF

#

why does it use all lowercase

#

lmfao

kind cloud Apr 7, 2025, 12:54 AM

#

Screenshot_2025-04-07-09-53-59-334-edit_com.android.chrome.jpg

Screenshot_2025-04-07-09-54-22-104-edit_com.android.chrome.jpg

#

Llama