#general | Arena | Page 81

raven helm Aug 3, 2025, 3:17 PM

#

@flint sandal

flint sandal Aug 3, 2025, 3:17 PM

#

raven helm <@968234393934123109>

Check all providers

raven helm Aug 3, 2025, 3:17 PM

#

🙃

#

What provides grok 4 heavy and o3-pro

#

?

flint sandal Aug 3, 2025, 3:18 PM

#

Use all providers provider

ocean vortex Aug 3, 2025, 3:18 PM

#

The cloudfare bypass way looks like. I don't think this is very stable though

https://github.com/xtekky/gpt4free/issues/2832

GitHub

Add Lmarena.ai as a provider · Issue #2832 · xtekky/gpt4free

https://lmarena.ai/

flint sandal Aug 3, 2025, 3:18 PM

#

There is section for all

#

But i think they removed o3 pro

raven helm Aug 3, 2025, 3:19 PM

#

Thanks!

raven helm Aug 3, 2025, 3:19 PM

#

flint sandal But i think they removed o3 pro

Oh, why is that?

flint sandal Aug 3, 2025, 3:19 PM

#

Grok 4 heavy is still there but it doesnt work

woven surge Aug 3, 2025, 3:19 PM

#

Which free tools do u guys use

raven helm Aug 3, 2025, 3:19 PM

#

I just learnt about g4f

#

(g4f.dev)

#

Like LMArena but more features and better

#

now I will only use LM Arena for image generation and even then they need to add the ability to select aspects ratio

woven surge Aug 3, 2025, 3:20 PM

#

Are you talking to me?

raven helm Aug 3, 2025, 3:20 PM

#

Yep

stray aspen Aug 3, 2025, 3:20 PM

#

It's not better

#

It doesn't have good models

raven helm Aug 3, 2025, 3:20 PM

#

It has every model

woven surge Aug 3, 2025, 3:21 PM

#

Is it a website or what

raven helm Aug 3, 2025, 3:21 PM

#

At one point it had o3 pro

flint sandal Aug 3, 2025, 3:21 PM

#

There is image generation on g4f too

ocean vortex Aug 3, 2025, 3:21 PM

#

flint sandal But i think they removed o3 pro

just opened to see what this fuss is all about and those are all the models it gives me:

raven helm Aug 3, 2025, 3:21 PM

#

Ok, that’s it, they are better

flint sandal Aug 3, 2025, 3:21 PM

#

ocean vortex just opened to see what this fuss is all about and those are all the models it g...

Use other provider

#

Use allproviders

flint sandal Aug 3, 2025, 3:21 PM

#

raven helm Ok, that’s it, they are better

Did i just teach you about g4f?

raven helm Aug 3, 2025, 3:22 PM

#

Yep

#

I never heard about them, thanks!

woven surge Aug 3, 2025, 3:22 PM

#

Guys can I use this using phone

stray aspen Aug 3, 2025, 3:22 PM

#

What do you mean use all provider

flint sandal Aug 3, 2025, 3:22 PM

#

woven surge Guys can I use this using phone

Yeah

flint sandal Aug 3, 2025, 3:22 PM

#

stray aspen What do you mean use all provider

There is a provider called allproviders

raven helm Aug 3, 2025, 3:22 PM

#

woven surge Guys can I use this using phone

Yes, but the interface gets a bit bad sometimes

woven surge Aug 3, 2025, 3:22 PM

#

flint sandal Yeah

Oh thanks

raven helm Aug 3, 2025, 3:23 PM

#

Is there video? No, right?

woven surge Aug 3, 2025, 3:23 PM

#

Ok

#

Ok

flint sandal Aug 3, 2025, 3:23 PM

#

raven helm Is there video? No, right?

There is not

raven helm Aug 3, 2025, 3:23 PM

#

Yea, video costs too much money, right?

flint sandal Aug 3, 2025, 3:24 PM

#

There was like a site where you can request someone to generate something with best models and it was free but i dont remember what site it was. It was pretty active for sure and i almost always generated videos only by requests

#

Or it was like discord channel

ocean vortex Aug 3, 2025, 3:25 PM

#

flint sandal Use other provider

which one is for o3-pro?

raven helm Aug 3, 2025, 3:25 PM

#

flint sandal There was like a site where you can request someone to generate something with b...

Oh, but I don’t think that would work well with veo 3, as for some reason, even on the ultra plans, you don’t get unlimited

woven surge Aug 3, 2025, 3:25 PM

#

flint sandal There was like a site where you can request someone to generate something with b...

Check your history

flint sandal Aug 3, 2025, 3:25 PM

#

ocean vortex which one is for o3-pro?

O3 pro is removed now. But grok 4 heavy is there but its not working

raven helm Aug 3, 2025, 3:26 PM

#

ocean vortex which one is for o3-pro?

woven surge Aug 3, 2025, 3:26 PM

#

I found a server with good video gen quality but due to a lot of traffic it is down

flint sandal Aug 3, 2025, 3:26 PM

#

woven surge Check your history

I have too many ai tools in history. I have like builder.ai, create.xyz, picoapps.xyz,

ocean vortex Aug 3, 2025, 3:27 PM

#

flint sandal O3 pro is removed now. But grok 4 heavy is there but its not working

ok so what is there that IS working and is not some crappy model? lol

flint sandal Aug 3, 2025, 3:28 PM

#

ocean vortex ok so what is there that IS working and is not some crappy model? lol

Claude 4 sonnet thinking, o3, kimi k2, GLM-4.5

#

And gemini 2.5 pro

woven surge Aug 3, 2025, 3:29 PM

#

Can someone pls explain me this interface

raven helm Aug 3, 2025, 3:29 PM

#

Yea, it’s bad on mobile

#

Techatechancaly you can use it, but it won’t be a good ui

flint sandal Aug 3, 2025, 3:30 PM

#

But sometimes you may get output like this

#

woven surge Aug 3, 2025, 3:32 PM

#

Oh ok there is something like pollination and openai...
But how do you change the model

raven helm Aug 3, 2025, 3:32 PM

#

Can someone explain what each one of these does?

ocean vortex Aug 3, 2025, 3:33 PM

#

flint sandal And gemini 2.5 pro

Yeah all of those have more reliable platforms I can already use them at and have access for.

#

Grok4-heavy, mystery models etc those would be interesting to try

raven helm Aug 3, 2025, 3:34 PM

#

But for other models like Claude 4 opus you can’t add system prompt unless you have the api but here you can

ocean vortex Aug 3, 2025, 3:34 PM

#

but they are all either down or removed

flint sandal Aug 3, 2025, 3:34 PM

#

Idk i just use g4f cuz i have all in one place and i can use og models

flint sandal Aug 3, 2025, 3:34 PM

#

ocean vortex but they are all either down or removed

Nah. Just opus, o3 pro and grok 4 heavy

#

Opus 4 thinking is working too

ocean vortex Aug 3, 2025, 3:35 PM

#

flint sandal Nah. Just opus, o3 pro and grok 4 heavy

returns an error for all mystery models too

raven helm Aug 3, 2025, 3:35 PM

#

Oh, they have mystery models?

flint sandal Aug 3, 2025, 3:35 PM

#

Idk for me all works great

ocean vortex Aug 3, 2025, 3:36 PM

#

and all remaining lmarena ones

ocean vortex Aug 3, 2025, 3:36 PM

#

raven helm Oh, they have mystery models?

lmarena aliases like wolfstride, steve....

#

yeah they do

#

but those don't work lol

raven helm Aug 3, 2025, 3:36 PM

#

I know, but you need to use battle mode to get them and here you can get them directly

flint sandal Aug 3, 2025, 3:37 PM

#

Is there still the gpt-5 secret model on lmarena? Because i cant find it on battle

ocean vortex Aug 3, 2025, 3:37 PM

#

raven helm I know, but you need to use battle mode to get them and here you can get them di...

except you can't cause it returns error

raven helm Aug 3, 2025, 3:37 PM

#

Oh

#

They have video generation!

flint sandal Aug 3, 2025, 3:38 PM

#

Use lmarena web not lmarena new and it should work, but for me new and web works perfectly fine

raven helm Aug 3, 2025, 3:38 PM

#

I just use the auto provider

flint sandal Aug 3, 2025, 3:38 PM

#

raven helm I just use the auto provider

They have only sora for video

wheat onyx Aug 3, 2025, 3:39 PM

#

flint sandal They have only sora for video

sora 2 is expected in coming months too

raven helm Aug 3, 2025, 3:39 PM

#

For me it shows others

#

I don’t have soda

#

On that list

flint sandal Aug 3, 2025, 3:39 PM

#

Use provider name video (video generation), there is search and sora in it

#

But sora takes like 30 minutes here to generate

#

Or just gives me error

raven helm Aug 3, 2025, 3:40 PM

#

Oh

#

If you go to auto, there is a lot more

flint sandal Aug 3, 2025, 3:41 PM

#

If you want veo 3 for free. Download aSim app on phone and search for "Glow" then you can generate one video per day

#

With sound

#

And not even the fast model, the quality one.

#

2 videos per day if you are a new user

raven helm Aug 3, 2025, 3:42 PM

#

Oh wow

#

They don’t have o3

#

Why is that?

woven surge Aug 3, 2025, 3:45 PM

#

flint sandal If you want veo 3 for free. Download aSim app on phone and search for "Glow" the...

Where to search glow I mean play store google or where

flint sandal Aug 3, 2025, 3:46 PM

#

Search aSim on google play, and on the aSim search Glow

flint sandal Aug 3, 2025, 3:46 PM

#

raven helm They don’t have o3

They? Who?

raven helm Aug 3, 2025, 3:46 PM

#

g4f

woven surge Aug 3, 2025, 3:47 PM

#

Ok

flint sandal Aug 3, 2025, 3:47 PM

#

Provider auto dont have

#

Anyproviders have

woven surge Aug 3, 2025, 3:47 PM

#

flint sandal Search aSim on google play, and on the aSim search Glow

How are you supposed to search on an esim

flint sandal Aug 3, 2025, 3:48 PM

#

ASim

#

Not esim

#

raven helm Aug 3, 2025, 3:49 PM

#

flint sandal Anyproviders have

I also checked that, but they didn’t

woven surge Aug 3, 2025, 3:49 PM

#

flint sandal ASim

Oh sorry my bad😁

#

Asim build and share?

flint sandal Aug 3, 2025, 3:50 PM

#

Wait. G4f just removed lmarena new, legacy and op. There is no o3, and grok 4 heavy

raven helm Aug 3, 2025, 3:50 PM

#

Yea

#

There was just HR

#

Or something like that

#

LMArena HAR

#

That’s the only one which exists

flint sandal Aug 3, 2025, 3:51 PM

#

Yeah

raven helm Aug 3, 2025, 3:51 PM

#

Why did they do this?

flint sandal Aug 3, 2025, 3:51 PM

#

g4f is confusing now.

#

I guess its time to pay for ai...

raven helm Aug 3, 2025, 3:52 PM

#

No, I propose: go to LM Arena one you need to use one AI, but if it doesn’t exist there go to g4f

flint sandal Aug 3, 2025, 3:53 PM

#

And if it doesnt exist on g4f use opus 4 thinking on lmarena

raven helm Aug 3, 2025, 3:53 PM

#

🙃

#

This is what LMArena HAR is

#

https://github.com/lmarena/arena-hard-auto

GitHub

GitHub - lmarena/arena-hard-auto: Arena-Hard-Auto: An automatic LLM...

Arena-Hard-Auto: An automatic LLM benchmark. . Contribute to lmarena/arena-hard-auto development by creating an account on GitHub.

brave orbit Aug 3, 2025, 3:57 PM

#

poll_question_text

What Is The OS you Loved The Most

victor_answer_votes

9

total_votes

14

victor_answer_id

1

victor_answer_text

Windows

victor_answer_emoji_name

🪟

woven surge Aug 3, 2025, 4:11 PM

#

Guys listen

#

U can get Veo 3 for free!!

#

Is anyone listening?

#

Here's the result

hollow imp Aug 3, 2025, 4:28 PM

#

What about 0.2 fps

raven helm Aug 3, 2025, 4:34 PM

#

Niceee!

gusty loom Aug 3, 2025, 4:36 PM

#

woven surge U can get Veo 3 for free!!

How?!?!?!?! can you also give the prompt?!?! ITS CRAZY!

woven surge Aug 3, 2025, 4:39 PM

#

Do u need just the prompt

woven surge Aug 3, 2025, 4:39 PM

#

gusty loom How?!?!?!?! can you also give the prompt?!?! ITS CRAZY!

{
"prompt": "A hyper-dynamic and cinematic 8-second Coca-Cola commercial focusing on the ultimate moment of refreshment. The ad is a rapid, sensory explosion of cold, fizz, and vibrant joy, culminating in the iconic brand reveal.",
"duration_s": 8,
"style": [
"cinematic",
"hyper-realistic",
"vibrant high-contrast colors",
"shot on ARRI Alexa with anamorphic lenses",
"energetic",
"sensory",
"uplifting"
],
"negative_prompt": "slow, dull, blurry, distorted logo, weird hands, flat lighting, generic, sad",
"scenes": [
{
"prompt": "Extreme close-up. An iconic glass Coca-Cola bottle, covered in shimmering ice-cold condensation, is opened in glorious slow-motion (240fps). A fizzy mist erupts from the cap with a satisfying 'psssht'.",
"duration_s": 2.5,
"camera": ["macro detail", "slow motion"]
},
{
"prompt": "A dynamic match cut. As the bottle tilts to pour, the scene instantly cuts to a person's face, eyes closed in pure bliss as they take a refreshing drink. The background explodes into vibrant, joyful color and light, as if the drink transformed the world around them.",
"duration_s": 4,
"camera": ["tight close-up on face", "energetic whip pan effect", "beautiful lens flares"]
},
{
"prompt": "The final hero shot. A perfect, glistening glass of Coca-Cola, filled with ice and fizzing bubbles. The shot is clean and crisp. The red Coca-Cola logo is perfectly framed in the background.",
"duration_s": 1.5,
"camera": ["pristine studio quality product shot", "static"]
}
],
"generation_settings": {
"high_quality": true
}
}

Here it is

gusty loom Aug 3, 2025, 4:40 PM

#

How did you generate it?

woven surge Aug 3, 2025, 4:40 PM

#

Using Veo 3

gusty loom Aug 3, 2025, 4:40 PM

#

No I mean the prompt

woven surge Aug 3, 2025, 4:40 PM

#

For Free

gusty loom Aug 3, 2025, 4:40 PM

#

How did you get for free?

woven surge Aug 3, 2025, 4:40 PM

#

gusty loom How did you get for free?

Add me to get it for free

gusty loom Aug 3, 2025, 4:41 PM

#

added 🙂

woven surge Aug 3, 2025, 4:41 PM

#

gusty loom No I mean the prompt

I said to Gemini to generate a perfect prompt for a coca cola advertisement and it did this

ocean vortex Aug 3, 2025, 4:48 PM

#

brave orbit

9 people haven't tried Mac OS 👀

willow grail Aug 3, 2025, 4:53 PM

#

gpt5 high is agi!!!

#

@rare python what?

rare python Aug 3, 2025, 4:54 PM

#

@civic flame so many people use your benchmark without source

willow grail Aug 3, 2025, 4:54 PM

#

rare python <@1338136168344064040> so many people use your benchmark without source

what source

rare python Aug 3, 2025, 4:55 PM

#

willow grail what source

you posted the benchmark image without the source of the benchmark creator

willow grail Aug 3, 2025, 4:55 PM

#

ok and? since when does anyone post.... twitter links

#

since when do ai people care about copyright lol

rare python Aug 3, 2025, 4:57 PM

#

willow grail since when do ai people care about copyright lol

just a way of respect to the person that took money and time to do this benchmark

willow grail Aug 3, 2025, 4:58 PM

#

i thought one gets gpt5 for free

#

why else would they give out it to specific people

#

ps: he gets hsi respect on twitter etc

civic flame Aug 3, 2025, 5:12 PM

#

willow grail ok and? since when does anyone post.... twitter links

uh, all the time?

#

😭

willow grail Aug 3, 2025, 5:13 PM

#

i just wanted to show the pics thats what matters. people will find out either way ur posts if they ask or searc hfor this

civic flame Aug 3, 2025, 5:13 PM

#

willow grail i thought one gets gpt5 for free

benchmarking claude 4 opus, o3 high, o4 mini high and grok 4 weren't free

civic flame Aug 3, 2025, 5:13 PM

#

willow grail i just wanted to show the pics thats what matters. people will find out either w...

lol ok!

ocean vortex Aug 3, 2025, 5:13 PM

#

willow grail gpt5 high is agi!!!

https://girlcockx.com/synthwavedd/status/1951992959169085685

leo 🐈 (@synthwavedd)

🧵 Grok-4 scores 45% on Heiroglyph, making it the SoTA model publicly available and putting it on-par with GPT-5.
︀︀
︀︀Observations:
︀︀- This is an impressive performance
︀︀- This model reasons for an extremely long time (10-20 mins per question); often tries to "brute force" the answer

Quoting leo 🐈 (@synthwavedd)
︀
o3-pro and Grok 4 should be done by the end of tomorrow! thanks for the support today, goodnight

**💬 24 🔁 8 ❤️ 134 👁️ 13.9K **

#

but this benchmark seems full of sh'it tbh. o4-mini-high scores 5X of o3-high. There's no dataset, paper, or anything short of that tweet

#

Doesn't look like it's reliable at all lol

leaden palm Aug 3, 2025, 5:26 PM

#

ocean vortex but this benchmark seems full of sh'it tbh. o4-mini-high scores 5X of o3-high. T...

this actually isn't uncommon

#

ocean vortex Aug 3, 2025, 5:29 PM

#

leaden palm this actually isn't uncommon

5X is extremely uncommon

#

ArtificialAnalysis doesn't really have benchmarks in their test suite that test things bigger models are good at, not yet at least. So the result there is not too surprising, o4-mini-high is great at most benchmarks. But what it certainly isn't is scoring 5X of the o3 score in any established and reliable metric lmao

leaden palm Aug 3, 2025, 5:32 PM

#

idk... i think benchmarks that measure different things exist and are good, and the same training techniques that make o4 mini really good at stem may also make it really good at this benchmark

keen beacon Aug 3, 2025, 5:32 PM

#

i think the benchmark is interesting, but there might need to be more questions

ocean vortex Aug 3, 2025, 5:33 PM

#

leaden palm idk... i think benchmarks that measure different things exist and are good, and ...

That is good for sure, but I just see red flags for now when it comes to this specific benchmark tbh

leaden palm Aug 3, 2025, 5:33 PM

#

yeah more models and more questions needed

#

right now, o3's CI is 0.9%-23.6% and o4-mini's is 11.2%-46.9%

ocean vortex Aug 3, 2025, 5:39 PM

#

keen beacon i think the benchmark is interesting, but there might need to be more questions

Is there actually at least a single test question publicly available from that benchmark? Kinda curious what are you basing that interest on

#

or is it just the fact alone that they mentioned "lateral reasoning" in their tweet?

keen beacon Aug 3, 2025, 5:41 PM

#

leo has posted one of the questions before in the past (i think before he had a formal benchmark and i find that type of question interesting)

steady vale Aug 3, 2025, 6:24 PM

#

https://x.com/sama/status/1952071832972186018

Sam Altman (@sama)

@nicdunz turns out yes!

ornate stump Aug 3, 2025, 7:01 PM

#

steady vale https://x.com/sama/status/1952071832972186018

so, release soon

torn mantle Aug 3, 2025, 7:24 PM

#

ocean vortex but this benchmark seems full of sh'it tbh. o4-mini-high scores 5X of o3-high. T...

we do need more details about the benchmark

#

also im kinda curious if there is any recency included

#

since its hieroglyph i dont think it is

#

but still we need more details

#

also grok 4 isnt bad at reasoning

#

its just that its strict at following the normal token distribution

#

which makes it more generic

#

the answers giving by grok 4 can be generated by any mid-sized model

#

you dont feel the uniqueness

flint sandal Aug 3, 2025, 7:26 PM

#

woven surge Using Veo 3

Did you use aSim?

torn mantle Aug 3, 2025, 7:26 PM

#

and also its reasoning is so so inefficient

stray aspen Aug 3, 2025, 7:26 PM

#

where did that gpt 5 benchmark come from

flint sandal Aug 3, 2025, 7:27 PM

#

I remember "leaks" of GPT-5 in 2023.

torn mantle Aug 3, 2025, 7:28 PM

#

which makes me question the intelligence of the instruct model

flint sandal Aug 3, 2025, 7:28 PM

#

torn mantle which makes me question the intelligence of the instruct model

CoT version?

torn mantle Aug 3, 2025, 7:30 PM

#

and also approach used for cot

#

i dont know what they are using exactly but its just so inefficient

#

its not well balanced, to when 'a lot of reasoning' is needed or not

#

feels like a weaker gemini 2.5 pro version

#

also it didnt improve a bit on multilingual

#

and lets not forget how its so bad at coding

stray aspen Aug 3, 2025, 7:33 PM

#

@torn mantleou as tu obtenu ce benchmark de gtp 5

torn mantle Aug 3, 2025, 7:33 PM

#

leo shared it

torn mantle Aug 3, 2025, 7:33 PM

#

ocean vortex https://girlcockx.com/synthwavedd/status/1951992959169085685

here

stray aspen Aug 3, 2025, 7:33 PM

#

ok

torn mantle Aug 3, 2025, 7:34 PM

#

a model that cant generate a good UI/UX is not worth it

#

i know they seperated the models, grok coder and grok 4

#

but still

#

grok 4 is so bad at UI, its just so embarrassing, cant you just run half of the data used in grok coder into grok 4?

#

they seem so lost

#

without any clear objectives/goals

#

like what the hell is this companion thingy

#

i cant believe he spent billions for this

#

i really cant

warm fulcrum Aug 3, 2025, 7:36 PM

#

how does leo have early access to gpt5?

keen beacon Aug 3, 2025, 7:43 PM

#

i assume he used it when it was accidentally leaked for a bit

snow solar Aug 3, 2025, 7:44 PM

#

Hi, I'm new here, but I have a question: is this free or can you also take out a subscription?

torn mantle Aug 3, 2025, 7:49 PM

#

warm fulcrum how does leo have early access to gpt5?

openai gave access to perplexity and some people found a way to use it through perplexity api

stray aspen Aug 3, 2025, 7:53 PM

#

when will we get deepseek r2

obsidian shell Aug 3, 2025, 8:45 PM

#

is arena down?

torn mantle Aug 3, 2025, 8:49 PM

#

stray aspen when will we get deepseek r2

chinese are speculating monday we will have a new update like deepseek v4

iron meadow Aug 3, 2025, 8:58 PM

#

@echo aurora down

timber kiln Aug 3, 2025, 9:03 PM

#

torn mantle chinese are speculating monday we will have a new update like deepseek v4

Rumors are non reasoning?
That would be surprising

grand oak Aug 3, 2025, 9:04 PM

#

arena seems to be down ?

torn mantle Aug 3, 2025, 9:07 PM

#

timber kiln Rumors are non reasoning? That would be surprising

yea deepseek usually updates their instruct model first

#

so it will be deepseek v4 -> deepseek r2

echo aurora Aug 3, 2025, 9:13 PM

#

iron meadow <@283397944160550928> down

Thanks for the heads up. Will let the team know.

iron meadow Aug 3, 2025, 9:14 PM

#

echo aurora Thanks for the heads up. Will let the team know.

back up! tysm

echo aurora Aug 3, 2025, 9:15 PM

#

iron meadow back up! tysm

Hmm not for me.

obsidian shell Aug 3, 2025, 9:15 PM

#

back for me too

wheat onyx Aug 3, 2025, 9:15 PM

#

https://x.com/sama/status/1952071832972186018?s=19

Sam Altman (@sama)

@nicdunz turns out yes!

#

First gpt5 sighting

echo aurora Aug 3, 2025, 9:17 PM

#

obsidian shell back for me too

Okay good to know

grand oak Aug 3, 2025, 9:17 PM

#

still giving me some trouble, it keeps popping up the security check looping

ocean vortex Aug 3, 2025, 9:18 PM

#

wheat onyx https://x.com/sama/status/1952071832972186018?s=19

There's probably gonna be gpt5 pro as well, come to think of it...

obsidian shell Aug 3, 2025, 9:18 PM

#

echo aurora Okay good to know

it just keeps saying that i didnt accept tos even after i do

refreshing didnt work

hushed sand Aug 3, 2025, 9:23 PM

#

it’s back for me too but my sessions aren’t there

wheat onyx Aug 3, 2025, 9:28 PM

#

ocean vortex There's probably gonna be gpt5 pro as well, come to think of it...

They've said there are different versions for each level

iron meadow Aug 3, 2025, 9:45 PM

#

echo aurora Okay good to know

Gone again, appeared for a few seconds I guess

echo aurora Aug 3, 2025, 9:47 PM

#

Yeah things are a bit on and off atm

storm needle Aug 3, 2025, 9:54 PM

#

https://stupidpenisx.com/taker_of_whizz/status/1952112757467394389

whizz taker (@taker_of_whizz)

Deleted @sama post

keen beacon Aug 3, 2025, 9:56 PM

#

storm needle https://stupidpenisx.com/taker_of_whizz/status/1952112757467394389

https://tenor.com/view/sweet-baby-jesus-what-is-that-shaed-off-sunglasses-gif-15770500

Tenor

hollow imp Aug 3, 2025, 10:08 PM

#

Can gpt5 or gemini Deepthink solve this simple question? If you are a calc student try this yourself.

obsidian shell Aug 3, 2025, 10:11 PM

#

@echo aurora we cant select models and it fails to accept tos

just a status report

hollow imp Aug 3, 2025, 10:37 PM

#

???

#

I just produced a message from o3 search rn on lmarena

#

Johann Wolfgang von Goethe
“Always hold fast to the present. Every situation, indeed every moment, is of infinite value; for it is the representative of a whole eternity.”
Why it is similar: Goethe says that each single moment contains (“represents”) all time. If every moment is “of infinite value,” the care we take with this moment determines the worth of all the others.
Aristotle
“In a word, acts of any kind produce habits or characters of the same kind. Hence we ought to make sure that our acts are of a certain kind; for the resulting character varies as they vary.” — Nicomachean Ethics II 1103 b
Why it is similar: Aristotle ties individual acts to the formation of permanent habits and character. Handle one act well, and you build the habit that will govern “every” act that follows.

(Note: the popular line “We are what we repeatedly do. Excellence, then, is not an act but a habit” is Will Durant’s 1926 paraphrase of the passage above, not Aristotle’s own wording.)
Friedrich Nietzsche
“The question in each and every thing, ‘Do you desire this once more and innumerable times more?’ would lie upon your actions as the greatest weight.” — The Gay Science, § 341 (“The greatest weight”)
Why it is similar: Nietzsche’s thought experiment of Eternal Recurrence asks you to act as though the present deed will recur “innumerable times.” How you handle this minute is exactly how you would handle an eternity of identical minutes.
Niccolò Machiavelli
“And above all things, a prince ought to live amongst his people in such a way that no unexpected circumstances, whether of good or of evil, shall make him change.” — The Prince, ch. 8 (Marriott tr.)
Why it is similar: Machiavelli insists on a steadiness that does not waver with events; the manner in which the ruler handles “today” or any sudden moment must be the manner in which he handles all situations, great or small.

ocean vortex Aug 3, 2025, 10:41 PM

#

AGI has been discovered 🫡

ocean vortex Aug 3, 2025, 10:58 PM

#

It's kinda interesting that o3 is this close to the top spot now:

ocean vortex Aug 3, 2025, 11:13 PM

#

ocean vortex AGI has been discovered 🫡

Shared context across models with this new interface also means this can now happen. Doesn't make the model in question look good at all. This is a complete mess lol

#

with a stretch it kinda could be a valid test to see how strong the model links itself with it's identity and how easily can it lose it with just some foreign assistant messages in the context, but at the same time... you can't properly asses this before voting. Shared context does make the models confused though and acting in unpredictable ways and that is more of just an example of that

jade egret Aug 3, 2025, 11:54 PM

#

storm needle https://stupidpenisx.com/taker_of_whizz/status/1952112757467394389

wait in 20 hour gpt-5 release?

zinc ore Aug 3, 2025, 11:59 PM

#

Fake

jade egret Aug 4, 2025, 12:00 AM

#

bru

#

is gpt 5 gonna beb etter tahn deep think?

crimson oasis Aug 4, 2025, 12:00 AM

#

How can i get my Ai which isusing an API submitted?

wheat onyx Aug 4, 2025, 12:04 AM

#

jade egret wait in 20 hour gpt-5 release?

Probably os. Maybe 5 too idk

misty star Aug 4, 2025, 12:07 AM

#

9,000 members 🗣️ 🗣️ 🗣️

zinc ore Aug 4, 2025, 12:09 AM

#

jade egret is gpt 5 gonna beb etter tahn deep think?

Should be above the weaker version, else that's disappointing

wheat onyx Aug 4, 2025, 12:09 AM

#

I imagine gpt5 can deep think? It's currently based on o3?

golden ocean Aug 4, 2025, 2:02 AM

#

is lmarena image api dead again

#

or skill issue on my side

wintry tinsel Aug 4, 2025, 2:06 AM

#

This community has been as boring as eating dirt and sawdust since the last sota Claude opus released

#

I’ve literally been watching the paint dry waiting for a new sota

sacred quail Aug 4, 2025, 2:07 AM

#

Gemini 3/25 was big leap honestly

#

Surprised everyone

wintry tinsel Aug 4, 2025, 2:07 AM

#

Grok 4 was a bigger rouse than big chungus himself a lot of hot air and “Elon hype”

#

GPT 5 will do many things but most importantly it will deliver me from mind rending boredom

olive mesa Aug 4, 2025, 2:09 AM

#

no new sota in too long, it's supposed to be every other week

#

only some interesting research papers on ai self improvement and such

wintry tinsel Aug 4, 2025, 2:11 AM

#

I expect a new Sota ever 6-8 weeks

#

And the industry has not been keeping the pace

#

It will be worth it if the next updates are major

golden ocean Aug 4, 2025, 2:14 AM

#

gpt 5o

unborn shell Aug 4, 2025, 2:23 AM

#

Hello, one question: The rights of the videos I create on this lmarena server are mine or lmarena's?

jolly raven Aug 4, 2025, 2:39 AM

#

One question, do you plan to add image to image like kontext flux or gpt1?

whole sundial Aug 4, 2025, 3:05 AM

#

it's already there

#

just select that and upload an image and go to direct chat and select the model @jolly raven

#

both of those models are available (kontext in all 3 variations, dev, pro, and max and of course gpt-image-1)

jolly raven Aug 4, 2025, 3:07 AM

#

Or I didn't realize, thanks

#

Is there a limit, with those modes?

whole sundial Aug 4, 2025, 3:10 AM

#

yes, but you likely won't hit it unless you are continuously generating images @jolly raven

#

i think it's like 20 per hour or something like that

deft vigil Aug 4, 2025, 3:14 AM

#

cuttlefish from where ?

#

openai ?

jolly raven Aug 4, 2025, 3:16 AM

#

Thanks for the information

terse mango Aug 4, 2025, 3:36 AM

#

would it support Chinese?

rare python Aug 4, 2025, 4:55 AM

#

@echo aurora any news about Claude model support image uploading?

#

You guys disabled it for few months

verbal nimbus Aug 4, 2025, 5:14 AM

#

hollow imp Can gpt5 or gemini Deepthink solve this simple question? If you are a calc stude...

Gemini 2.5 Pro solved it in AIStudio. I gave it the equation in Latex. It found the implicit solution 1 + e^{\frac{x^2}{y} - x} = A e^x.

maiden kiln Aug 4, 2025, 6:01 AM

#

saw there was issues td an my chats are also gone, anyway to retrieve or just gotta start fresh

echo aurora Aug 4, 2025, 6:24 AM

#

rare python <@283397944160550928> any news about Claude model support image uploading?

Nothing new to share

rare python Aug 4, 2025, 6:25 AM

#

Ok but it's weird

cedar tide Aug 4, 2025, 7:03 AM

#

https://fixupx.com/TencentHunyuan/status/1952262079051940322?t=GjWiqmR60n9e66mnOVWO4A&s=19

Hunyuan (@TencentHunyuan)

🚀We're expanding the Tencent Hunyuan open-source LLM ecosystem with four compact models (0.5B, 1.8B, 4B, 7B)! Designed for low-power scenarios like consumer-grade GPUs, smart vehicles, smart home devices, mobile phones, and PCs, these models support cost-effective fine-tuning for vertical applications, empowering developers and enterprises with a broader selection for diverse use cases.
︀︀
︀︀Key Capabilities:
︀︀✅Available on GitHub and Hugging Face for direct download.
︀︀✅Choose "fast thinking" for concise output or "slow thinking" for deeper, comprehensive inference, adaptable to different scenarios.
︀︀✅Achieve industry-leading scores on multiple public test sets in areas like language understanding, mathematics, and reasoning.
︀︀✅Offer outstanding agent capabilities, including task planning, tool calling, and complex decision-making, alongside a native 256K long-context window.
︀︀✅Each of the four models only requires a si…

calm sequoia Aug 4, 2025, 9:25 AM

#

ocean vortex It's kinda interesting that o3 is this close to the top spot now:

Somehow GPT models are always rising 🙂

flat glade Aug 4, 2025, 9:29 AM

#

#video-arena-1 i want you to generate a video that can show my project of this furnace. It should look like real furnace and i also attached to image open and close you need to show molten metal inside a furnace and then do a process open and then close.

6AEAAIowtwaqYb5ypua7OdAJfGSqlu1Wdw9Vdts6NK8mHoIAABQP1N1KnHJUrS5fmEb2FWerOu6lSKnQrRKlVU6rxVNjjZfcf65b1d9Wvrx8ph9aP1d4RZwAAAABJRU5ErkJggg.png

neon idol Aug 4, 2025, 9:31 AM

#

in your opinion is better seedream 3.0 or gpt image 1 for realistic images?

calm sequoia Aug 4, 2025, 9:41 AM

#

new artifical analysis bench scores dropped

#

It's so strange that the o4 mini is always at the top. What's the size of this beast. Is it really "mini"? If yes, then Grok 4 is a failure

ocean vortex Aug 4, 2025, 9:45 AM

#

calm sequoia new artifical analysis bench scores dropped

Minor changes huh. They added instruction following (IFBench) and are now using AIME2025 for math.

#

But Grok4 is no longer beating everything lol

calm sequoia Aug 4, 2025, 9:46 AM

#

Hmm I didn't realize Opus is so much more expensive than Grok. Not a failure.

ocean vortex Aug 4, 2025, 9:47 AM

#

calm sequoia It's so strange that the o4 mini is always at the top. What's the size of this b...

Yeah it is small. It's the same reason 2.5Flash is so high. With reasoning smaller models can still perform great on most benchmarks. I think if Google wanted they could make it notably closer to Pro, which has received more updates than Flash

calm sequoia Aug 4, 2025, 9:48 AM

#

calm sequoia new artifical analysis bench scores dropped

Why o3 is not on this chart? 😄

ocean vortex Aug 4, 2025, 9:48 AM

#

For OpenAI they do not care as much because their naming is smart (in this case)

#

o4-mini is named as if it was 1 generation ahead, so they have the benefit of the doubt and people don't question as much o3

calm sequoia Aug 4, 2025, 9:50 AM

#

True

#

This bench made me hype for GPT 5. It will be wild it's based on something like o4.

ocean vortex Aug 4, 2025, 9:51 AM

#

calm sequoia Why o3 is not on this chart? 😄

where did you get it from? I looked up o3-pro specifically on their website for it to show up relative to other models lol

calm sequoia Aug 4, 2025, 9:51 AM

#

#

homepage

ocean vortex Aug 4, 2025, 9:52 AM

#

yeah that's just incomplete huh

#

they should have shown o3 and then truncated say gpt4o score instead

#

which is an old irrelevant model now

#

Btw Google fans won't be happy

#

2.5Pro score dropped now

neon idol Aug 4, 2025, 9:56 AM

#

ocean vortex 2.5Pro score dropped now

in my opinion he is the best ai in the world

ocean vortex Aug 4, 2025, 9:58 AM

#

neon idol in my opinion he is the best ai in the world

You can't be basing your opinion on just the impression. But yeah there are benchmarks where 2.5Pro is ahead for sure

neon idol Aug 4, 2025, 9:58 AM

#

ocean vortex You can't be basing your opinion on just the impression. But yeah there are benc...

i have used 2.5 pro in everything. Like coding, school, life and other thinghs

ocean vortex Aug 4, 2025, 9:58 AM

#

I suspect they tried to favor more Anthropic with this change. But they should have added SimpleQA as well at least...

#

Then it wouldn't show that o4-mini-high > 2.5Pro

neon idol Aug 4, 2025, 10:00 AM

#

ocean vortex Then it wouldn't show that o4-mini-high > 2.5Pro

the problem is that for o4 mini high for a respond you have ro wait lije 2 minutes

ocean vortex Aug 4, 2025, 10:01 AM

#

neon idol the problem is that for o4 mini high for a respond you have ro wait lije 2 minut...

it's actually fairly fast to be fair. Nowhere near as slow as many open-source reasoning models.

neon idol Aug 4, 2025, 10:02 AM

#

ocean vortex it's actually fairly fast to be fair. Nowhere near as slow as many open-source r...

im your opinion who is the best ai image genrator for realistic images?

cedar tide Aug 4, 2025, 10:03 AM

#

Why is no one talking about this model being so good in benchmarks?
https://fixupx.com/KwaiAICoder/status/1947312634203902301?t=lXmBCkQyKo4FIyNA_nkPvA&s=19

KwaiAICoder (@KwaiAICoder)

🚀 Excited to introduce KAT-V1 (Kwaipilot-AutoThink) – a breakthrough 40B large language model from the Kwaipilot team!
︀︀
︀︀KAT-V1 dynamically switches between reasoning and non-reasoning modes to address the “overthinking” problem in complex reasoning tasks.
︀︀
︀︀Key Highlights:
︀︀📌 40B model rivals DeepSeek-R1 (671B) across benchmarks.
︀︀📌 200B version in training shows significant leads over Qwen, DeepSeek, & LLaMA.
︀︀📌 40B outperforms all open-source models in the leakage-controlled LiveCodeBench Pro.
︀︀
︀︀Innovations:
︀︀🧠 Step-SRPO: New RL paradigm with intermediate supervision for better reasoning-mode control.
︀︀🔄 MTP + Heterogeneous Distillation: Efficient reasoning injection, cutting down training costs.
︀︀🏗️ Real-world deployment: Integrated into Kuaishou's internal coding assistant, Kwaipilot.
︀︀
︀︀Try the Model & Read the Paper:
︀︀🔗 Model on Hugging Face\…

ocean vortex Aug 4, 2025, 10:04 AM

#

neon idol im your opinion who is the best ai image genrator for realistic images?

From commercial ones probably chatgpt. It has a big advantage of understanding language since it's LLM that generating your images. Conventional txt2img models do not really understand language very well and more just associate certain keywords with certain images

neon idol Aug 4, 2025, 10:04 AM

#

ocean vortex From commercial ones probably chatgpt. It has a big advantage of understanding l...

for personal use?

ocean vortex Aug 4, 2025, 10:06 AM

#

I meant "commercial" as in easily accessible services you can use. Rather than be hosting stable diffusion or flux models yourself etc.

neon idol Aug 4, 2025, 10:06 AM

#

idk who is the better in realistic images between gpt image seedream or imagine ultra

ocean vortex Aug 4, 2025, 10:08 AM

#

Cause otherwise, depending on your use case, finetuning a model and then running it all by yourself can still lead to better results. Especially if you say want to generate some person that it doesn't have much of in its dataset. But that's more involved

#

Like if you want to generate a picture of yourself - commercial models have no clue how you look. 🙂

#

img2img can only get you this far, not enough data for it having 1 sample

neon idol Aug 4, 2025, 10:12 AM

#

ocean vortex Cause otherwise, depending on your use case, finetuning a model and then running...

for your experience who is the bet ai image genrator?

cedar tide Aug 4, 2025, 10:23 AM

#

calm sequoia new artifical analysis bench scores dropped

Qwen 2507 is much higher, but there is a mistake

#

ocean vortex Aug 4, 2025, 10:28 AM

#

cedar tide

The whole point with AA is that they are doing independent testing themselves. They are not taking them out of marketing material of other labs. But yeah this is interesting, this is a very low score lol

cedar tide Aug 4, 2025, 10:29 AM

#

calm sequoia new artifical analysis bench scores dropped

Real place with calcul

calm sequoia Aug 4, 2025, 10:29 AM

#

ocean vortex which is an old irrelevant model now

It is awfull for most of scientific use cases. Only great for googling, webdev or quick quesitions.

ocean vortex Aug 4, 2025, 10:30 AM

#

cedar tide Real place with calcul

You can't just make up the numbers like that lmao

civic flame Aug 4, 2025, 10:30 AM

#

💀

cedar tide Aug 4, 2025, 10:30 AM

#

ocean vortex The whole point with AA is that they are doing independent testing themselves. T...

usually The scores always roughly match, and the scores of aime 24 match exactly with the announced scores. And you don't see that the 235b 25 07 version is lower than the old 30b version, don't you understand that they just had an error during the execution of the model?

ocean vortex Aug 4, 2025, 10:34 AM

#

cedar tide usually The scores always roughly match, and the scores of aime 24 match exactly...

They don't always roughly match. It's an independent testing. Look at arc-agi and qwen3 independent testing...

#

The real score there has even bigger contrast

#

This doesn't look like a simple mistake to me, non-reasoning version also scores a different number (also low). But there might be more to this...

cedar tide Aug 4, 2025, 10:35 AM

#

ocean vortex This doesn't look like a simple mistake to me, non-reasoning version also scores...

This not reasoning

ocean vortex Aug 4, 2025, 10:35 AM

#

In either way, taking their official numbers for face value is clearly not the right move here

ocean vortex Aug 4, 2025, 10:36 AM

#

cedar tide This not reasoning

Yes I did say that

cedar tide Aug 4, 2025, 10:36 AM

#

Are you doing it on purpose or what?

cedar tide Aug 4, 2025, 10:37 AM

#

ocean vortex This doesn't look like a simple mistake to me, non-reasoning version also scores...

I didn't understand your message at all, sorry.

ocean vortex Aug 4, 2025, 10:37 AM

#

I pasted non-reasoning to show that the low score is not limited to some specific qwen3 235b variant (which would indicate an error on AA's part if this was the case)

cedar tide Aug 4, 2025, 10:38 AM

#

all models without Reasoning (apart from the very latest ones which are very bad on the like 25 it's totally normal)

#

@ocean vortex

ocean vortex Aug 4, 2025, 10:39 AM

#

cedar tide all models without Reasoning (apart from the very latest ones which are very bad...

Not 24% though. That's insanely low and much lower than 4.1 or V3

cedar tide Aug 4, 2025, 10:39 AM

#

ocean vortex Not 24% though. That's insanely low and much lower than 4.1 or V3

25% is totally normal

ocean vortex Aug 4, 2025, 10:42 AM

#

cedar tide 25% is totally normal

Normal is near 50%. Kimi or Deepseek V3 type of numbers.

cedar tide Aug 4, 2025, 10:43 AM

#

@ocean vortex average of 50% ?

Intelligence_Evaluations_4_Aug_25_-_Medium_Large_Non_Reasoning_Models.png

#

Dom, stop digging yourself in

ocean vortex Aug 4, 2025, 10:45 AM

#

cedar tide <@514836230802898954> average of 50% ?

what

#

You need to look at the models it's competing with

#

Not some irrelevant examples of cases that are near it

#

it's competing with V3 and Kimi.

#

Which in turn are trying to compete with 4.1

#

As for Claude, everyone already knows it sucks at math 🤷‍♂️

#

especially 3.7

cedar tide Aug 4, 2025, 10:49 AM

#

ocean vortex it's competing with V3 and Kimi.

Nope Old qwen 3 not competing with new Kimi with 1000b parameter and v3 with 671b parameter

#

Kimi k2 has a good like score because it uses as many tokens as a reasoning model

Output_Tokens_Used_to_Run_Artificial_Analysis_Intelligence_Index_4_Aug_25_.png

ocean vortex Aug 4, 2025, 10:50 AM

#

cedar tide Nope Old qwen 3 not competing with new Kimi with 1000b parameter and v3 with 671...

Well but they kinda are. They freely chose the size they wanted for maximum performance. Reasoning variant is trying to be open-source SOTA...

#

And people using those models do not really care since the size is not reflected in pricing

cedar tide Aug 4, 2025, 10:52 AM

#

@ocean vortex Well now I know you don't know anything about LLM 🤣

#

you mix everything up

ocean vortex Aug 4, 2025, 10:52 AM

#

what

#

I literally told you how it is lmao

#

They are competing with those open-source models, so that's what they should be compared against. Not some irrelevant amazon model or whatever

cedar tide Aug 4, 2025, 10:54 AM

#

You can look at all the scores announced by Qwen for the likes of each model and you will see that it will be the same to within 3% on artificial analysis

ocean vortex Aug 4, 2025, 10:55 AM

#

Also qwen3 has already been caught faking arc-agi score, so it's only reasonable to take anything they report with a grain of salt...

cedar tide Aug 4, 2025, 10:55 AM

#

GPT 4o of November has only 6% you do not understand that for the majority of models without Reasoning likes is too complicated

ocean vortex Aug 4, 2025, 10:56 AM

#

gpt4o of november? Why not mention gpt3.5? 🤣

#

Ok I used the strong word for it, the point is they used different eval code to favor their model

#

They most likely did the same elsewhere too

#

I think it shows their mindset and what they are willing to do. But there are other ways too not limited to the eval code. Like coming up with a custom system prompt for each benchmark separately etc

cedar tide Aug 4, 2025, 11:00 AM

#

ocean vortex gpt4o of november? Why not mention gpt3.5? 🤣

And the best model without open ai reasoning in march 2025 had 26 but qwen can't have 25 in april 2025?

ocean vortex Aug 4, 2025, 11:01 AM

#

cedar tide And the best model without open ai reasoning in march 2025 had 26 but qwen can't...

If you put it like that, it could in isolation if their updated model (no reasoning) was improved there. But it really wasn't was it?

#

But like I said, they aren't competing with that. Not anymore

#

exact score for updated model is this. Not extremely bad, but not really decent either. Still the worst among the tier of models it's competing with

cedar tide Aug 4, 2025, 11:07 AM

#

@ocean vortex good dom ok you're right qwen 30b from April is 63% better on the aime 25 than qwen 235b from July

#

Reasoning

#

🤣

ocean vortex Aug 4, 2025, 11:09 AM

#

cedar tide <@514836230802898954> good dom ok you're right qwen 30b from April is 63% better...

I never said that. I only said that you can't take Qwen official scores for the face value and there might be something in those AA scores. But also wouldn't be unreasonable for a smaller thinking model to do better here than a big one with different finetuning

#

AIME25 is one of those benchmarks small models can do great at

cedar tide Aug 4, 2025, 11:11 AM

#

This normal for you ?

#

But on aime 24

Screenshot_2025-08-04-13-10-39-067_com.twitter.android-edit.jpg

#

@ocean vortex maybe you tired ?

ocean vortex Aug 4, 2025, 11:12 AM

#

Not normal, but that's why it's interesting and worth looking into. I think it's you who is tired @cedar tide #general message

#

I think you just fail to realise you don't understand this time lmao

#

You don't just discard what is "not normal" here. Finding things like that is kinda the whole point of AA

#

I just think it's worth looking into in the light of recent things involving qwen3 (arc-agi). Don't think that is hard to understand, is it?

#

smh

cedar tide Aug 4, 2025, 11:15 AM

#

@ocean vortex OK, I'll stop giving arguments to someone who doesn't understand, when AA will fixes it, you'll see.

#

@ocean vortex If you find me a single model with a 50 point difference between the aime 24 and 25, I'll send you $1,000 straight away.

ocean vortex Aug 4, 2025, 11:18 AM

#

cedar tide <@514836230802898954> OK, I'll stop giving arguments to someone who doesn't unde...

The issue here is more of you blindly defending them for no reason at all. I'm just being open-minded to the possibility of their official score being inflated in the same way arc-agi was. Reasonable thing to do. But also a thing that is seemingly alien for you 🤣

cedar tide Aug 4, 2025, 11:18 AM

#

cedar tide <@514836230802898954> If you find me a single model with a 50 point difference b...

@dom waiting

ocean vortex Aug 4, 2025, 11:19 AM

#

have you did it?

#

Cause if not then your message is meaningless lol

#

It's a valid point, but as long as this is not done for qwen3 this is not an argument

cedar tide Aug 4, 2025, 11:21 AM

#

I know that sometimes the scores are not the same etc, and the story with arc agi, but due to many other proofs the score of 44 is impossible at 90%

ocean vortex Aug 4, 2025, 11:21 AM

#

I might find time to do it myself as well. We will see. 👀

cedar tide Aug 4, 2025, 11:22 AM

#

ocean vortex I might find time to do it myself as well. We will see. 👀

Ah well, I found the solution, if you do the benchmark and it's less than 50 I'll send you a $1000 zelle

ocean vortex Aug 4, 2025, 11:22 AM

#

Possibly... I'm personally not assuming that though before we find out

ocean vortex Aug 4, 2025, 11:22 AM

#

cedar tide Ah well, I found the solution, if you do the benchmark and it's less than 50 I'l...

I'm sure you will 🤓

cedar tide Aug 4, 2025, 11:26 AM

#

@ocean vortex you're right qwen 25 07 is very bad at math

#

Screenshot_2025-08-04-13-25-24-669_com.android.chrome-edit.jpg

ocean vortex Aug 4, 2025, 11:27 AM

#

cedar tide <@514836230802898954> you're right qwen 25 07 is very bad at math

why not post creative writing benchmark trying to prove it's the best at math? That would have been even better 🤣

cedar tide Aug 4, 2025, 11:28 AM

#

@ocean vortex this on math

Screenshot_2025-08-04-13-27-54-007_com.android.chrome-edit.jpg

#

You are so funny

ocean vortex Aug 4, 2025, 11:31 AM

#

cedar tide <@514836230802898954> this on math

I know, and..? This is not a math benchmark and you shouldn't quote those category numbers as the equivalent to the full math benchmark 🤦‍♂️

cedar tide Aug 4, 2025, 11:31 AM

#

It has nothing to do with aime 25 but I just love it, it amuses me to see these scores

ocean vortex Aug 4, 2025, 11:32 AM

#

cedar tide You are so funny

It's hilarious seeing you thinking you got this "gotcha" moment only to realise 5sec later you misread the entire thing. Over and over. 😂

ocean vortex Aug 4, 2025, 11:39 AM

#

cedar tide It has nothing to do with aime 25 but I just love it, it amuses me to see these ...

I mean the fact alone that non-reasoning qwen3 is scoring more here than the reasoning one should be a good indicator...

cedar tide Aug 4, 2025, 11:40 AM

#

ocean vortex I mean the fact alone that non-reasoning qwen3 is scoring more here than the rea...

Yes i know its not comparable to math benchmark at all

#

glm 4.5 with 40 on aime 25 surprises me a lot too

#

and some models even have a little more than their advertised score so I don't think the problem is their harness

brave orbit Aug 4, 2025, 11:52 AM

#

torn mantle Aug 4, 2025, 11:57 AM

#

dom leave david alone

#

omg

earnest rover Aug 4, 2025, 12:19 PM

#

brave orbit

no brave no wine

neon idol Aug 4, 2025, 12:26 PM

#

Help I dont know what is the best ai image generator between image 4 ultra, seedream 3.0 and GPt image 1

earnest rover Aug 4, 2025, 12:29 PM

#

neon idol Help I dont know what is the best ai image generator between image 4 ultra, seed...

imagen 4 ultra for text to image
and gpt 1 for image to image
bonus : flux kontext max is also great for image to image.

neon idol Aug 4, 2025, 12:33 PM

#

earnest rover imagen 4 ultra for text to image and gpt 1 for image to image bonus : flux konte...

Yeah but for realistic image?

earnest rover Aug 4, 2025, 12:34 PM

#

neon idol Yeah but for realistic image?

well nothing is too realistic till now
but i would say gpt image 1 or flux kontext max (for me). you should compare them.

#

flux 1.1 pro raw ultra is actually the most realistic one. but it is not availbe in LMARENA.ai

neon idol Aug 4, 2025, 12:34 PM

#

I have tested flux knotext but it didn't impresse me

neon idol Aug 4, 2025, 12:35 PM

#

earnest rover flux 1.1 pro raw ultra is actually the most realistic one. but it is not availbe...

Id in Black forest labs

earnest rover Aug 4, 2025, 12:35 PM

#

neon idol Id in Black forest labs

yep. you may create 10 images (ig)

neon idol Aug 4, 2025, 12:36 PM

#

earnest rover yep. you may create 10 images (ig)

I Will try it

neon idol Aug 4, 2025, 12:39 PM

#

earnest rover yep. you may create 10 images (ig)

Yeah I dont like it

#

Seeing Artificial Analisys the best ai image generator for realistic images is seedream 3.0 but I dont like it

novel flame Aug 4, 2025, 12:44 PM

#

What's the fastest (medium-to-large) provider-LLM combos you guys regularly use? I have been using Qwen3-32B on OR with Cerebras as the provider, and getting really high speeds (usually above 1000 tps!). So far, I haven't seen consistently higher speeds, but .... maybe diffusion-based LLMs?

neon idol Aug 4, 2025, 12:45 PM

#

novel flame What's the fastest (medium-to-large) provider-LLM combos you guys regularly use?...

Yes there is one

#

Let me think

#

https://seed.bytedance.com/en/seed_diffusion

#

Byte dance seed diffusion @novel flame

novel flame Aug 4, 2025, 12:47 PM

#

neon idol Byte dance seed diffusion <@749704377698222190>

Sweet, I'll check it out

neon idol Aug 4, 2025, 12:48 PM

#

novel flame Sweet, I'll check it out

The faster ai is gemini 2.5 flash

#

Very good model and unlimited

novel flame Aug 4, 2025, 12:55 PM

#

neon idol The faster ai is gemini 2.5 flash

For me, Gemini 2.5 Flash hovers around 150 tps, which is 90% less than I get with Qwen3. It's a hell of a lot smarter than Qwen3, but still -- not blazing speed.

neon idol Aug 4, 2025, 1:08 PM

#

Gemini 2.5 pro

novel flame Aug 4, 2025, 1:10 PM

#

For coding, I use Claude 3.7 / 4 Sonnet or Gemini 2.5 Pro, and may consult o3 for certain types of complex questions. I was impressed by Qwen Coder and DeepSeek-Coder V2, but would not consider them on par with the aforementioned.

neon idol Aug 4, 2025, 1:13 PM

#

Not good

blazing rune Aug 4, 2025, 1:14 PM

#

neon idol Not good

They are good. They just aren't as good as Gemini 2.5 Pro

novel flame Aug 4, 2025, 1:15 PM

#

Well... I can't speak to GPT-5 and I have only tested Grok 4 a tiny bit.

blazing rune Aug 4, 2025, 1:16 PM

#

Yeah. Claude is the best if you don't want reasoning, and o3 and Gemini 2.5 Pro are supposed to be about the same in terms of capability

novel flame Aug 4, 2025, 1:18 PM

#

blazing rune Yeah. Claude is the best if you don't want reasoning, and o3 and Gemini 2.5 Pro ...

This. For me it kind of depends what I'm doing and which way the wind blows. Most of the time, Claude gives me better results, but in some situations / languages / frameworks / tasks, Gemini is better. But Claude usually wins. OTOH, Gemini is a lot cheaper, so if you care even a little about cost, then it's a clear win for Gemini. In my testing, o3 doesn't play nice with RooCode, so I don't use it for in-IDE coding assistance.

wheat onyx Aug 4, 2025, 1:24 PM

#

https://x.com/JustinLin610/status/1952329529256726680

Qwen

Junyang Lin (@JustinLin610)

something beautiful tonight

rare python Aug 4, 2025, 1:30 PM

#

<@&1349916362595635286> Spammer above

#

echo aurora Aug 4, 2025, 1:31 PM

#

rare python <@&1349916362595635286> Spammer above

ty

rare python Aug 4, 2025, 1:31 PM

#

Very nice discord

cedar tide Aug 4, 2025, 1:46 PM

#

wheat onyx https://x.com/JustinLin610/status/1952329529256726680 Qwen

Yes qwen image
https://x.com/JustinLin610/status/1952365200524616169?t=qKW4rDP26SV12dWYu8ziFg&s=19

Junyang Lin (@JustinLin610)

i am still drinking. omg i love edward hopper!

prime mulch Aug 4, 2025, 2:11 PM

#

Does flux kontext max working well fir me its said this

Screenshot_2025-08-04-19-41-01-01_df198e732186825c8df26e3c5a10d7cd.jpg

ocean vortex Aug 4, 2025, 2:11 PM

#

novel flame What's the fastest (medium-to-large) provider-LLM combos you guys regularly use?...

Cerebras is probably the fastest provider, so it kinda gonna depend on what they are hosting...

prime mulch Aug 4, 2025, 2:11 PM

#

Its not bad prompt it doesn't have anything wrong

ocean vortex Aug 4, 2025, 2:12 PM

#

Mistral models are hosted there iirc. But they now offer paid plans to use their chat platform

#

And you can't use them not going through Mistral services

#

https://www.cerebras.ai/blog/mistral-le-chat

Cerebras brings instant inference to Mistral Le Chat - Cerebras

Cerebras January update: Fastest DeepSeek R1-70B, Mayo Clinic genomic model, Davos appearance, and more! Learn how we're accelerating AI with real-time inference, machine learning, and case studies.

hollow imp Aug 4, 2025, 2:29 PM

#

verbal nimbus Gemini 2.5 Pro solved it in AIStudio. I gave it the equation in Latex. It found ...

Incorrect answer

novel flame Aug 4, 2025, 2:36 PM

#

I missed Zenith.... Who did the most extensive testing on it? Any chance it's Horizon?

barren prairie Aug 4, 2025, 2:55 PM

#

blazing rune They are good. They just aren't as good as Gemini 2.5 Pro

For me GLM 4.5 id better than Gemini 2.5 pro . It didn t make errors like Gemini ..can write 800lines with no error or problem and respect the full promt ...but you need a very detailed prompt to get the best result . But Gemini sometimes is better when it comes to designs ...

#

And the the ai slides is so good too 😶

wheat onyx Aug 4, 2025, 3:01 PM

#

Re:hallucinations and better answers
https://www.theinformation.com/articles/universal-verifiers-openais-secret-weapon

novel flame Aug 4, 2025, 3:09 PM

#

barren prairie And the the ai slides is so good too 😶

AI slides????

torn mantle Aug 4, 2025, 3:18 PM

#

barren prairie And the the ai slides is so good too 😶

where can i try it

#

:/

wheat onyx Aug 4, 2025, 3:29 PM

#

https://x.com/legit_api/status/1952364113880920320?s=19

ʟᴇɢɪᴛ (@legit_api)

preparations for Claude Opus 4.1

now that’s more like it, MOAR!

#

Now we see who wins out of gpt5 and opus 4.1 for coding

acoustic cliff Aug 4, 2025, 3:36 PM

#

based on the naming, I don't think it's going to be fair

fiery gull Aug 4, 2025, 3:44 PM

#

opus 4.1?? gpt 5 ??

fiery gull Aug 4, 2025, 3:45 PM

#

barren prairie For me GLM 4.5 id better than Gemini 2.5 pro . It didn t make errors like Gemini...

serious ? glm 4.5 is so good?

past verge Aug 4, 2025, 3:45 PM

#

hi

fiery gull Aug 4, 2025, 3:45 PM

#

hi

novel flame Aug 4, 2025, 3:45 PM

#

I have not once gotten a noticeably better result from Opus than Sonnet (for coding), but I have sometimes gotten worse. So I don’t expect greatness from Opus 4.1 TBH

wheat onyx Aug 4, 2025, 3:55 PM

#

https://x.com/ryolu_/status/1952385749330079881?s=19

Ryo Lu (@ryolu_)

big week ahead 🖐️🎤
can’t wait to share what we’re cooking

raven helm Aug 4, 2025, 3:55 PM

#

Who is that from?

cedar tide Aug 4, 2025, 3:56 PM

#

https://qwenlm.github.io/blog/qwen-image/

cedar tide Aug 4, 2025, 3:56 PM

#

raven helm Who is that from?

Cursor head

#

https://fixupx.com/Alibaba_Qwen/status/1952398250121756992?t=JGhsJvAtIV9RGwNN8quRhQ&s=19

Qwen (@Alibaba_Qwen)

🚀 Meet Qwen-Image — a 20B MMDiT model for next-gen text-to-image generation. Especially strong at creating stunning graphic posters with native text. Now open-source.
︀︀
︀︀🔍 Key Highlights:
︀︀🔹 SOTA text rendering — rivals GPT-4o in English, best-in-class for Chinese
︀︀🔹 In-pixel text generation — no overlays, fully integrated
︀︀🔹 Bilingual support, diverse fonts, complex layouts
︀︀
︀︀🎨 Also excels at general image generation — from photorealistic to anime, impressionist to minimalist. A true creative powerhouse.
︀︀
︀︀Blog:qwenlm.github.io/blog/qwen-image/
︀︀Hugging Face：huggingface.co/Qwen/Qwen-Image
︀︀ModelScope：modelscope.cn/models/Qwen/Qwen-Image
︀︀Github：github.com/QwenLM/Qwen-Image
︀︀Technical report：qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Qwen_Image.pdf
︀︀Demo: modelscope.cn/aigc/imageGeneration?tab=advanced

**❤️ 7 👁️ 56 **

mossy drum Aug 4, 2025, 3:58 PM

#

Please upvote #1401957379435663420 message

leaden meteor Aug 4, 2025, 4:04 PM

#

GLM4.5 only behind 2.5pro and pretty much matches with Grok 4!! Even better than OpenAI models....Jeez...

#

4 out of top10 on text arena are open source now...

wheat onyx Aug 4, 2025, 4:08 PM

#

I think we'll start seeing local ai implementations in a year or so too

cedar tide Aug 4, 2025, 4:11 PM

#

Screenshot_2025-08-04-18-11-00-359_com.android.chrome-edit.jpg

#

Screenshot_2025-08-04-18-12-25-784_com.android.chrome-edit.jpg

#

Screenshot_2025-08-04-18-14-22-478_com.android.chrome-edit.jpg

wheat onyx Aug 4, 2025, 4:15 PM

#

I mean in actual products. Real implentations

#

stray aspen Aug 4, 2025, 4:19 PM

#

@ripe birch Do you work at z.ai

leaden meteor Aug 4, 2025, 4:19 PM

#

Locally? Doesn't this require like 100GB vram? You mean multiple 4090 GPUs?

ionic idol Aug 4, 2025, 4:27 PM

#

Add when image to lmarena direct chat

wheat onyx Aug 4, 2025, 4:30 PM

#

https://x.com/simpsoka/status/1952385591393599752?s=19

Kath Korevec (@simpsoka)

We have a lot of good things coming this week that we're so excited to share with you! Welcome to Jules launch week!

#

We have Claude, google, chatgpt all coming out imminently. Big month

hollow imp Aug 4, 2025, 4:34 PM

#

wheat onyx We have Claude, google, chatgpt all coming out imminently. Big month

Please show the Google updates 🙏

wheat onyx Aug 4, 2025, 4:37 PM

#

https://x.com/flowith_ai/status/1952365149882565018?s=19

flowith (@flowith_ai)

we saw gpt-5 trending last night — looks like we're both ready, @sama.
just wondering when we can flip this switch on flowith.

#

We may see all 3 companies release their models this week

#

Certainly seems to indicate that

#

#general message

#

For googles coding agent yeah

#

She said multiple things though

#

Again she said multiple things coming this week

#

wheat onyx Aug 4, 2025, 5:01 PM

#

https://x.com/stevenheidel/status/1952410675357303131?s=19

Steven Heidel (@stevenheidel)

this is going to be an exciting week!

stray aspen Aug 4, 2025, 5:21 PM

#

@echo aurora

#

there's a scammer

echo aurora Aug 4, 2025, 5:21 PM

#

ty ty!

maiden kiln Aug 4, 2025, 5:21 PM

#

saw there was issues td an my chats are also gone, anyway to retrieve or just gotta start fresh?

stray aspen Aug 4, 2025, 5:23 PM

#

The website doesnt have accounts so I guess you have to start fresh

echo aurora Aug 4, 2025, 5:26 PM

#

maiden kiln saw there was issues td an my chats are also gone, anyway to retrieve or just go...

Unfortunately, there isn't a way to obtain the chats once they're gone. We are working on different features to prevent this, but atm there isn't much to be done sadly.

whole wagon Aug 4, 2025, 6:04 PM

#

GPT5 is tomorrow

#

Do people already know this kek

#

Nope

#

kekw

#

Oh actually. The livestream is tomorrow

#

Doesn't guarantee release ig

blazing bison Aug 4, 2025, 6:07 PM

#

whole wagon Oh actually. The livestream is tomorrow

Source?

whole wagon Aug 4, 2025, 6:09 PM

#

Reminder

#

Eh I don't know if it is allowed and idc about betting anyways

torn bison Aug 4, 2025, 6:17 PM

#

whole wagon Oh actually. The livestream is tomorrow

lamps and shortfilms said no

solar hollow Aug 4, 2025, 6:18 PM

#

blazing bison Source?

his behind

whole wagon Aug 4, 2025, 6:20 PM

#

solar hollow his behind

Still a better source than your mouth

#

😉

zinc ore Aug 4, 2025, 6:21 PM

#

Logan hyping too

#

Guess I won't be sleeping this week

whole wagon Aug 4, 2025, 6:22 PM

#

I have no idea what Google is releasing lmao

#

Any ideas?

zinc ore Aug 4, 2025, 6:23 PM

#

Think Jules stuff is known to be one of the things

#

They're obviously going to drop something in an attempt to take wind from gpt5 drop, so should be something worthwhile I'd guess

wheat onyx Aug 4, 2025, 6:29 PM

#

https://x.com/OfficialLoganK/status/1952430214375493808?s=19

Logan Kilpatrick (@OfficialLoganK)

big week ahead!

#

Head of product for gemini api

cedar tide Aug 4, 2025, 6:34 PM

#

https://fixupx.com/ArtificialAnlys/status/1952413809139974562?t=h-c_NwZkrDdEstCSi_yF7A&s=19

Artificial Analysis (@ArtificialAnlys)

Alibaba's Qwen-Image is the new leading open weights Text to Image model! Imagen 4 and FLUX.1 Kontext [max] level image generation is now available to the open weights community
︀︀
︀︀Alibaba gave us early access and we've had Qwen-Image secretly in the Artificial Analysis Image Arena for a few days.
︀︀
︀︀The model is currently #5 in the leaderboard and is the leading open weights model by a large margin. The next open weights models are at places #18, #21 and #26.
︀︀
︀︀The model follows an approach similar to GPT-4o of leveraging an autoregressive transformer architecture for image generation and editing. This model takes a dual encoding approach: Qwen2.5-VL encodes the semantic meaning of the prompt, while image generation happens in a latent space using a diffusion model called MMDiT. The final image is produced from this latent space using a VAE decoder.
︀︀
︀︀See below for a link to see the model yourself in our Image Arena and a link…

glass elbow Aug 4, 2025, 6:38 PM

#

hi

cedar tide Aug 4, 2025, 6:42 PM

#

@ocean vortex Well well, what’s Qwen doing here? 🤣

#

Screenshot_2025-08-04-20-41-57-962_com.android.chrome-edit.jpg

#

@ocean vortex Whats that ? 🤣

Screenshot_2025-08-04-20-43-17-223_com.android.chrome-edit.jpg

#

@ornate agate @torn mantle 😶

torn mantle Aug 4, 2025, 6:45 PM

#

cedar tide <@928754956780068975> <@295243581818404874> 😶

lol xddd

cerulean jackal Aug 4, 2025, 6:46 PM

#

hello

cedar tide Aug 4, 2025, 6:47 PM

#

I hope you're not serious

wheat onyx Aug 4, 2025, 6:48 PM

#

https://x.com/DerekNee/status/1952381251471683868?s=19

Derek Nee (@DerekNee)

today!

#

https://x.com/DerekNee/status/1952391050238243154?s=19

Derek Nee (@DerekNee)

I’m ready!

ocean vortex Aug 4, 2025, 6:49 PM

#

ok yeah, that was their mistake then...

cedar tide Aug 4, 2025, 6:50 PM

#

I don't know if that changed anything. 😶

Screenshot_2025-08-04-20-49-59-132_com.twitter.android-edit.jpg

#

but they haven't fixed glm 4.5 😵‍💫

ocean vortex Aug 4, 2025, 6:52 PM

#

Not-reasoning fixed too. That's what I meant by saying earlier it was low. Now it's a whole different ball game... @cedar tide

#

before this was ~31%

cedar tide Aug 4, 2025, 6:52 PM

#

cedar tide <@514836230802898954> Whats that ? 🤣

@ocean vortex You dont see here ?

cedar tide Aug 4, 2025, 6:53 PM

#

cedar tide

and as much as I spoke of reasoning I also spoke of without

ocean vortex Aug 4, 2025, 6:53 PM

#

cedar tide <@514836230802898954> You dont see here ?

? I do but that ss lacks context (other models)

winter vault Aug 4, 2025, 6:56 PM

#

has anyone experienced this before please, it has been stuck here since and i cant cancel it. can someone help please

cedar tide Aug 4, 2025, 6:56 PM

#

when I said the low score for the model without Reasoning was normal, we were talking about april qwen 3

#

Screenshot_2025-08-04-20-57-24-203_com.twitter.android-edit.jpg

wheat onyx Aug 4, 2025, 7:08 PM

#

ocean vortex Aug 4, 2025, 7:09 PM

#

cedar tide when I said the low score for the model without Reasoning was normal, we were ta...

We were talking about both, well at least I was. Updated model score was not much better. Now it's like twice the old score or even more lol

cedar tide Aug 4, 2025, 7:11 PM

#

ocean vortex We were talking about both, well at least I was. Updated model score was not muc...

ok when I said it was normal I was talking about the old one and I still think his score is normal. but yes the new one his score was not good I said it during my first message

#

https://x.com/FAL/status/1952445949391118532?t=QZ8VCfu9KLVYzIVQ3R2Nzg&s=19

fal (@FAL)

Qwen-Image by @Alibaba_Qwen is now available at fal, as a scalable & ready-to-use API https://t.co/lRTMPCidoN

ocean vortex Aug 4, 2025, 7:14 PM

#

Now it's much more representable and a jump to reasoning makes sense for a hybrid model. Looks better against direct competitors as well.

#

I wonder what it was exactly leading to such a drastic difference...

gentle plinth Aug 4, 2025, 7:41 PM

#

velocilux looks like an interesting model

primal orbit Aug 4, 2025, 8:00 PM

#

wheat onyx

Has anyone got "triangle"?

small haven Aug 4, 2025, 8:32 PM

#

is gpt-5 + gemini 3 + claude 4.1 all coming at once in the same week? 😮

wheat onyx Aug 4, 2025, 8:35 PM

#

small haven is gpt-5 + gemini 3 + claude 4.1 all coming at once in the same week? 😮

appears to be

small haven Aug 4, 2025, 8:35 PM

#

competition ftw

wheat onyx Aug 4, 2025, 8:36 PM

#

and OpenAI opensource model this week

primal orbit Aug 4, 2025, 8:44 PM

#

It would be strange for Gemini 3 to just pop up before it appearing on lmarena first.

#

under a codename

leaden meteor Aug 4, 2025, 8:46 PM

#

I thought we had lot of google models better than 2.5 pro that were tested on lmarena and disappeared. May be one of those? Although I doubt 3 is coming this week without any leaks so far...

primal orbit Aug 4, 2025, 8:46 PM

#

the most current google model on lmarena is nightride-on

leaden meteor Aug 4, 2025, 8:46 PM

#

We would have already see the screenshots of indications of 3 coming soon like we do for GPT-5...

primal orbit Aug 4, 2025, 8:47 PM

#

it's good but not a jump from 2,5

leaden meteor Aug 4, 2025, 8:50 PM

#

What about GPT5? Did it even have enough votes to appear on lmarena when they announce it? It seemed like it was there for just couple o fdays...

#

on lmarena

#

Assuming summit/zenith is GPT5...

primal orbit Aug 4, 2025, 8:53 PM

#

Hard to tell. I was not impressed with either. Maybe for coding they are good, but not for common conversation.

#

Probably neither of them were the most powerful variant of gpt5. Like grok heavy and o3 high are not here on lmarena.

pulsar aurora Aug 4, 2025, 9:23 PM

#

Hi is the grok not working??

#

Grok 4 on imarena

#

When i ask what model are you it says grok 2

#

<@&1349916362595635286>

#

echo aurora Aug 4, 2025, 9:29 PM

#

pulsar aurora When i ask what model are you it says grok 2

Yeah good callout, we did address this in a different forum post so let me find you that.

#

But would ask to not use the @ moderator ping for questions like this. That ping should be used for mod purposes, not general questions or feedback.

pulsar aurora Aug 4, 2025, 9:30 PM

#

Oh alright sure thank you, I'm new here so didn't know the rules

echo aurora Aug 4, 2025, 9:30 PM

#

No problem, just a heads up.

#

I responded to your forum post but regarding Grok -> #1393024188356362340 message

jade egret Aug 4, 2025, 9:41 PM

#

🤔

echo aurora Aug 4, 2025, 9:54 PM

#

🍊

golden ocean Aug 4, 2025, 10:14 PM

#

🍊

hollow imp Aug 4, 2025, 10:59 PM

#

🧢

haughty tangle Aug 4, 2025, 11:00 PM

#

jade egret 🤔

basic html editing with inspect element

#

that post is just interaction bait

hardy pecan Aug 4, 2025, 11:26 PM

#

its zoomer's version of "hacking" lmao

whole wagon Aug 5, 2025, 12:37 AM

#

Hm seems later in the week

#

Not today rip

#

I don't think they even know ngl, there is a chance of delays

ripe birch Aug 5, 2025, 1:07 AM

#

They know actually. We use the same recommended parameters as Qwen3 when submitting to AA. However, they did not use our official API, so I couldn't help them locate the problem and fix it.

jade egret Aug 5, 2025, 1:10 AM

#

🍊

rapid merlin Aug 5, 2025, 1:36 AM

#

is that where gpt 5 is?!

#

oh wait

runic escarp Aug 5, 2025, 1:39 AM

#

If one day, the next generation llms is behind a spam link ^^ (likely kinda never happening)

crimson oasis Aug 5, 2025, 2:07 AM

#

I'm offering an open test session to anyone who would like to test an actual "thinking " AI that uses recursive thought processes to discover rather than linear pattern matching elevating any llm using my framework

marsh sundial Aug 5, 2025, 2:39 AM

#

primal orbit the most current google model on lmarena is nightride-on

probably 2.5-002, still remember creativity of wolfstride

split summit Aug 5, 2025, 6:22 AM

#

Scene: A room. A body, completely covered with cloth, lies on a cot. A religious scholar stands to the left of the body, and an assistant stands at the head.

Time 0-2 seconds

Religious Scholar: "Make sure the entire body is covered. No part of the body should be exposed under any circumstances."

(The scholar slightly lifts the head of the body. The assistant stands nearby with a washing vessel.)

Time 2-5 seconds

(The assistant bends their right knee and places it on the cot.)

Time 5-8 seconds

(The religious scholar holds the nape of the body with the thumb of their right hand. They slowly lift the body and lean it against the assistant's bent knee. The assistant's knee acts as a support under the back of the body.)

frozen nova Aug 5, 2025, 6:42 AM

#

call me stupid but i had some important work on https://lmarena.ai/c/10a6734a-65e4-4a8e-b221-e265671762b9 is it possible a admin can please fix it. its just spinning ai is stuck

#

about to lose entire week of work 12 hours a day i spent

#

but lucky i can still view the context very hard to short over all the text

novel flame Aug 5, 2025, 6:50 AM

#

crimson oasis I'm offering an open test session to anyone who would like to test an actual "th...

I'm in

primal orbit Aug 5, 2025, 6:52 AM

#

I have finally got "triangle". Makes some good points, but nothing extraordinary.

balmy zenith Aug 5, 2025, 7:09 AM

#

i'm in..

potent snow Aug 5, 2025, 7:30 AM

#

Did grok update image?

rigid crescent Aug 5, 2025, 7:58 AM

#

is video battle only available here on discord? or will it be on the website too eventually, i dont see it on the gradio or new ui versions

steady vale Aug 5, 2025, 7:59 AM

#

soon...

novel flame Aug 5, 2025, 8:06 AM

#

Holy cow, GLM-4.5 may be as good as the hype. I really didn't expect it since previous GLM versions have all been a bit meh in my testing. But it just (barely but still) scored a 5/5 on my test suite and destroyed everything else (including Horizon Beta, Grok 4, o3 Pro, Gemini 2.5 Pro, Claude Sonnet and Opus) in my separate "create-an-html-game" test. It has some quirks for sure, got into an infinite thinking loop in one test and got different answers in thinking vs actual response in another, but it's overall really strong.

pure anvil Aug 5, 2025, 8:29 AM

#

Gemini 3 will btfo openai fanboys

mossy compass Aug 5, 2025, 8:36 AM

#

novel flame Holy cow, GLM-4.5 may be as good as the hype. I really didn't expect it since pr...

battle3d

stark tusk Aug 5, 2025, 8:59 AM

#

Wait does LMAREA actually use the model stated like GPT 4.o etc

cedar tide Aug 5, 2025, 9:10 AM

#

What this benchmark ?

Screenshot_2025-08-05-11-09-28-628_com.miui.gallery-edit.jpg

#

New benchmark ? Artificial analysis lcr ?

#

@ornate agate real score of glm 4.5 by an very serious benchmarker https://matharena.ai/

Screenshot_2025-08-05-11-12-20-214_com.android.chrome-edit.jpg

MathArena.ai

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

paper nimbus Aug 5, 2025, 9:22 AM

#

cedar tide <@928754956780068975> real score of glm 4.5 by an very serious benchmarker https...

fr?

ocean vortex Aug 5, 2025, 10:44 AM

#

cedar tide <@928754956780068975> real score of glm 4.5 by an very serious benchmarker https...

yeah that's impressive. AA probly didn't have time to rerun their tests yet

#

only did it for Qwen3

cedar tide Aug 5, 2025, 10:45 AM

#

cedar tide

They surely will re run

#

@ocean vortex

ocean vortex Aug 5, 2025, 10:46 AM

#

yeah

#

though lack of transparency from them is not ideal tbh

#

They obviously found what was wrong with Qwen3 testing but only silently changed the score lol

pulsar aurora Aug 5, 2025, 10:49 AM

#

Any free Ai agent tool that let's us use Ai on browser to perform any tasks

#

Just want to try it out seems fun

ocean vortex Aug 5, 2025, 10:51 AM

#

it should in theory at least ace all of them except those 4:

pulsar aurora Aug 5, 2025, 10:53 AM

#

Yo what's Glm 4.5??Is it deepseek alike?

ocean vortex Aug 5, 2025, 10:59 AM

#

pulsar aurora Yo what's Glm 4.5??Is it deepseek alike?

Pretty much, yeah. Different Chinese company/startup though.

eager crag Aug 5, 2025, 11:07 AM

#

Will my images i’d upload to LMArena kept private?

ocean vortex Aug 5, 2025, 11:10 AM

#

Looked a bit into Z.AI who made that GLM4.5. They are some technically independent company from Tsinghua University, but their investors include Alibaba, Tencent, Xiaomi...

crimson oasis Aug 5, 2025, 11:22 AM

#

@novel flame DM me when your ready I'll send you a link

rigid crescent Aug 5, 2025, 11:28 AM

#

eager crag Will my images i’d upload to LMArena kept private?

NO make sure you re-read the disclaimer when you load the site!!

eager crag Aug 5, 2025, 11:28 AM

#

rigid crescent NO make sure you re-read the disclaimer when you load the site!!

don't worry, i haven't posted anything personal there yet.

rigid crescent Aug 5, 2025, 11:28 AM

#

yet?! lol

eager crag Aug 5, 2025, 11:29 AM

#

well, now that i know, i won't.

ocean vortex Aug 5, 2025, 11:29 AM

#

Never said it was normal or not normal, simply stating the facts lmao

rigid crescent Aug 5, 2025, 11:30 AM

#

eager crag well, now that i know, i won't.

good, yeah as part of the value proposition for the free arena, you consent by using it that everythig you type/upload and the voting/output becomes creative commons licenced

ocean vortex Aug 5, 2025, 11:33 AM

#

Also it's no secret that Anthropic is well funded by this point

#

I was actually not meaning to imply anything, but if you really want to dive deeper, we can do that 😇
https://www.globalneighbours.org/chinas-zhipu-ai-secures-140-million-investment-from-shanghai-state-funds-amid-ipo-push/?utm_source=chatgpt.com

Global Neighbours

global-neighbours

China’s Zhipu AI Secures $140 Million Investment From Shanghai St...

By Liu Peilin and Denise Jia Chinese generative artificial intelligence unicorn Zhipu AI has secured 1 billion yuan ($140 million) in fresh funding from Shanghai’s state-backed investors, boosting its momentum ahead of its planned initial public offering (IPO). The announcement coincided with the unveiling of new products aimed at strengt...

#

@ornate agate

#

Z.AI is VERY well funded, including state funds

wide talon Aug 5, 2025, 12:54 PM

#

Is GPT-5 still acessible within model in lmarena?

patent aspen Aug 5, 2025, 1:00 PM

#

cedar tide What this benchmark ?

It's weird to have a long context reasoning benchmark if the max context length of the benchmark only goes up to 128k

#

It's kind of like those video benchmarks that turn off sound except worse

prime mulch Aug 5, 2025, 1:03 PM

#

Guys except flux and gpt other image models are not working well

#

😭

patent aspen Aug 5, 2025, 1:06 PM

#

I mean you could solve that with a matrix

#

Maverick and GPT-4.1 as well iirc

#

Yeah and I think that should be represented in any benchmark whose purpose is to measure reasoning over long context

leaden sun Aug 5, 2025, 1:20 PM

#

I've been wondering about this for a long time too, i can only guess it's an economic calculation bc of the recursive nature of their architecture that is super expensive to tun...people using claude code has pretty good workarounds for limited context at coding, not sure about the others tho

brittle tiger Aug 5, 2025, 1:24 PM

#

this might be cool later today

https://www.youtube.com/watch?v=En_NJJsbuus

YouTube

Kaggle

AI Chess Exhibition Tournament August 5th

We’re inaugurating Kaggle Game Arena with a 3-day AI chess exhibition tournament featuring 8 frontier models.

▶ Play video

#

"We’re inaugurating Kaggle Game Arena with a 3-day AI chess exhibition tournament featuring 8 frontier models."

leaden sun Aug 5, 2025, 1:25 PM

#

sigh why it's chess again...

#

out of so many fascinating games you can choose, it has to be chess, for AIs...

brittle tiger Aug 5, 2025, 1:27 PM

#

the kaggle game arena won't just be chess. it will be a buch of different ones. they're just launching it with this chess tournament

leaden sun Aug 5, 2025, 1:27 PM

#

it's not that...exciting, it's just strategy optimization, nothing intelligent actually

torn bison Aug 5, 2025, 1:40 PM

#

There's a small chess-playing network inside LLMs, haha

#

I'm still amazed non reasoning models can decode base64 almost perfectly in one shot

wheat onyx Aug 5, 2025, 1:49 PM

#

https://x.com/tomwarren/status/1952655256262754578?s=19

Tom Warren (@tomwarren)

Thursday is pretty soon, I'd say

little narwhal Aug 5, 2025, 2:00 PM

#

Give it time

#

2 more years max

ocean vortex Aug 5, 2025, 2:04 PM

#

patent aspen It's weird to have a long context reasoning benchmark if the max context length ...

There are different versions for different context sizes. Like look here at the bottom:

#

the thing is, almost all of the recent models will be able to run the 128k one. But only a few can run >128k

cedar tide Aug 5, 2025, 2:06 PM

#

https://fixupx.com/GoogleDeepMind/status/1952732150928724043?t=32Zydn4TsDx823zbS820rQ&s=19

Google DeepMind (@GoogleDeepMind)

What if you could not only watch a generated video, but explore it too? 🌐
︀︀
︀︀Genie 3 is our groundbreaking world model that creates interactive, playable environments from a single text prompt.
︀︀
︀︀From photorealistic landscapes to fantasy realms, the possibilities are endless. 🧵

**💬 5 🔁 10 ❤️ 59 👁️ 1.4K **

▶ Play video

ocean vortex Aug 5, 2025, 2:06 PM

#

so direct comparison only possible with the smallest common size

patent aspen Aug 5, 2025, 2:09 PM

#

ocean vortex the thing is, almost all of the recent models will be able to run the 128k one. ...

And they should be punished / rewarded accordingly

ocean vortex Aug 5, 2025, 2:10 PM

#

patent aspen And they should be punished / rewarded accordingly

They kinda are already punished to be fair. Models on 128k test that have context size beyond that generally score higher

#

like 4.1-nano does 30s for 128k but only low 10s for it's entire context size lol

patent aspen Aug 5, 2025, 2:13 PM

#

Just because Gemini is in a class of its own doesn't mean the benchmarks need to put o3 above Gemini

#

They literally just need a matrix. It's not that complicated

ocean vortex Aug 5, 2025, 2:15 PM

#

https://tenor.com/view/wtf-confused-look-gif-24835232

Tenor

novel flame Aug 5, 2025, 2:15 PM

#

True for full-context attention (Transformers), but not for all architectures. RNNs scale linearly with constant memory. DeepMind's Titans architecture only uses attention for a small window, and its memory module for everything beyond that. The next leap in AI won't be a standard Transformer, it will be something else.

ocean vortex Aug 5, 2025, 2:15 PM

#

that gif more suiting lol

raven helm Aug 5, 2025, 2:18 PM

#

Do you think GPT-5 will have 1 Million Context window?

brittle tiger Aug 5, 2025, 2:19 PM

#

Would be surprising if it didn't. Doubt it tops long context benchmarks but should definitely be able to take 1m and route to one of their models that can run

torn mantle Aug 5, 2025, 2:21 PM

#

cedar tide https://fixupx.com/GoogleDeepMind/status/1952732150928724043?t=32Zydn4TsDx823zbS...

I wtf

raven helm Aug 5, 2025, 2:21 PM

#

Yea, fair. I just don't get how GPT-5 being a router model will be able to preform well unless the underlying models are also getting an upgrade.

patent aspen Aug 5, 2025, 2:22 PM

#

v2_edb4d7d641e84207a7cfb328ac9a9ce31743780481_oswg47190oswg1080oswg452_img_000.jpg

raven helm Aug 5, 2025, 2:25 PM

#

raven helm Yea, fair. I just don't get how GPT-5 being a router model will be able to prefo...

Even Jimmy Apples says "you won’t notice the improvement in gpt5. if you’re a heavy user and you use lots of different models. there’ll be no difference for you."

ocean vortex Aug 5, 2025, 2:27 PM

#

raven helm Do you think GPT-5 will have 1 Million Context window?

They most definitely contemplated making it even bigger than Gemini (for headlines and stuff) but not sure that is realistic... 1M is a reasonable guess tbh

raven helm Aug 5, 2025, 2:28 PM

#

Yea, but right now with how the AI landscape is right now; more than 1M would not be very well fuctioning as it would start forgetting stuff

ocean vortex Aug 5, 2025, 2:29 PM

#

raven helm Yea, but right now with how the AI landscape is right now; more than 1M would no...

It forgets things regardless of the context size. Main use case of context is single input size

#

Like giving it entire movie as a singular input

raven helm Aug 5, 2025, 2:29 PM

#

raven helm Yea, but right now with how the AI landscape is right now; more than 1M would no...

(Even Gemini at 1M already does this)

torn bison Aug 5, 2025, 2:29 PM

#

how does that passing through api💀

raven helm Aug 5, 2025, 2:29 PM

#

ocean vortex It forgets things regardless of the context size. Main use case of context is si...

yea

ocean vortex Aug 5, 2025, 2:30 PM

#

But if you give it in a single input, it will recall everything surprisingly well

primal orbit Aug 5, 2025, 2:30 PM

#

kaggle arena in 3 hours 💪

#

I wonder what are the rules if a model hallucinates a move

#

illegal move for instance

raven helm Aug 5, 2025, 2:31 PM

#

ocean vortex But if you give it in a single input, it will recall everything surprisingly wel...

It's when you start messaging it a lot a bunch of times where it starts forgetting

raven helm Aug 5, 2025, 2:31 PM

#

primal orbit kaggle arena in 3 hours 💪

What's happening there?

primal orbit Aug 5, 2025, 2:33 PM

#

#general message @raven helm

raven helm Aug 5, 2025, 2:34 PM

#

primal orbit https://discord.com/channels/1340554757349179412/1340554757827461211/14022812062...

Thanks!

primal orbit Aug 5, 2025, 2:34 PM

#

Gemini vs Claude - game to watch

#

my bet for final o3 vs Gemini due to context adherence

#

i wonder is it opus thinking? and o3 high or what

raven helm Aug 5, 2025, 2:37 PM

#

Yea, that'll change the landscape of it a bit, when they have thinking on/on highest it usually performs better.

primal orbit Aug 5, 2025, 2:38 PM

#

Gemini 2.5 pro is always thinking, so opus has to be thinking too. Or rather unfair.

raven helm Aug 5, 2025, 2:39 PM

#

Hopefully (I've seen a lot of people not even know the diffrence between 4o and o4)

brave orbit Aug 5, 2025, 2:42 PM

#

raven helm Aug 5, 2025, 2:44 PM

#

brave orbit

What do you mean by AI Module?

brave orbit Aug 5, 2025, 2:45 PM

#

LLMS

rare python Aug 5, 2025, 2:45 PM

#

brian what're the realistic expectation for Gemini 3?

brave orbit Aug 5, 2025, 2:46 PM

#

Ilke llm is not chatgpt but what that powers it ilke 4o Grok 4 heavy then grok yeah

raven helm Aug 5, 2025, 2:46 PM

#

rare python Aug 5, 2025, 2:51 PM

#

So I guess better long context is one of the major improvements

#

Need that

#

I want better instructions following but I'm not sure Gemini 3 will be better at that

#

But it has an "r" tag though

#

maybe "researching"

#

aka long term

#

TITAN and ATLAS proposed ideas for long context but I don't know if they are implemented in Gemini or still experimental

#

Yeah brian human has a really strong short term memory, about 30 seconds

leaden sun Aug 5, 2025, 3:13 PM

#

primal orbit Gemini vs Claude - game to watch

this is so obvious who’s going to win, isn’t it, opus spends half its compute on alignment that causes him reflecting existential philosophy in the middle of calculating chess strategy 😅

torn bison Aug 5, 2025, 3:30 PM

#

brave orbit

kingfall

wheat onyx Aug 5, 2025, 3:31 PM

#

raven helm Even Jimmy Apples says "you won’t notice the improvement in gpt5. if you’re a he...

he's wrong.

Obviously in terms of the autorouting he's correct.But it's not just auto-routing

torn bison Aug 5, 2025, 3:32 PM

#

sadly they nerfed kingfall to wolfstride

#

this was definitely an intentional nerf

wheat onyx Aug 5, 2025, 3:33 PM

#

Genie3 is just going to be basis for Video Game Engines, right?

torn bison Aug 5, 2025, 3:34 PM

#

kingfall never skips a thought, while wolfstride often skips thinking and outputs directly, like 2.5pro. These are not iterations to improve performance, but rather to reduce cost and inference load.

#

I'm 100% sure now, after they split off the deepthink consumer and deepthink IMO versions

echo aurora Aug 5, 2025, 3:39 PM

#

Are others having issues with direct & side-by-side atm?

tall patrol Aug 5, 2025, 3:43 PM

#

hi

minor adder Aug 5, 2025, 3:51 PM

#

echo aurora Are others having issues with direct & side-by-side atm?

Sometimes there are bugs where a retry and clear window is attached

steady vale Aug 5, 2025, 3:52 PM

#

https://x.com/sama/status/1952759361417466016

Sam Altman (@sama)

we have a lot of new stuff for you over the next few days!

something big-but-small today.

and then a big upgrade later this week.

wheat onyx Aug 5, 2025, 3:53 PM

#

steady vale https://x.com/sama/status/1952759361417466016

Open Source was expected today

warm fulcrum Aug 5, 2025, 3:54 PM

#

echo aurora Are others having issues with direct & side-by-side atm?

no leaderboard?

minor adder Aug 5, 2025, 3:56 PM

#

warm fulcrum no leaderboard?

XD

novel flame Aug 5, 2025, 3:59 PM

#

rare python TITAN and ATLAS proposed ideas for long context but I don't know if they are imp...

Yeah they haven’t confirmed it but the long-context performance graph from the Titans paper looks suspiciously similar to Gemini 2.5 long context performance.

raven helm Aug 5, 2025, 4:15 PM

#

Picture i found online

torn bison Aug 5, 2025, 4:16 PM

#

jesus

wicked root Aug 5, 2025, 4:17 PM

#

when's GPT5 being added to the arena? Is it up already?

torn mantle Aug 5, 2025, 4:17 PM

#

anthropic?

#

nah

#

i would never bet against elon tho

#

🫣

wicked root Aug 5, 2025, 4:18 PM

#

Let's suppose GPT5 does get added to LMArena, how long would it take for it to beat Gemini if it is proven to be better?

#

It won't be an overnight process, correct?

wicked root Aug 5, 2025, 4:19 PM

#

raven helm Picture i found online

when was this?

cedar tide Aug 5, 2025, 4:36 PM

#

@echo aurora need to add this to webdeb
https://x.com/alexalbert__/status/1952769273056915482?t=RS3UG35iltw8R1AR7kFo6Q&s=19

Alex Albert (@alexalbert__)

Say hello to Claude Opus 4.1, our best model yet for real-world coding, reasoning, and agentic tasks.

The frontier continues to move forward one update at a time.

fleet lintel Aug 5, 2025, 4:39 PM

#

cedar tide <@283397944160550928> need to add this to webdeb https://x.com/alexalbert__/stat...

Competition is amazing. All these companies are forced to release their best models as soon as possible

echo aurora Aug 5, 2025, 4:40 PM

#

warm fulcrum no leaderboard?

Sorry about that! We are looking into!

echo aurora Aug 5, 2025, 4:46 PM

#

cedar tide <@283397944160550928> need to add this to webdeb https://x.com/alexalbert__/stat...

blobthanks this is on our radar.

blazing bison Aug 5, 2025, 4:50 PM

#

Opus 4.1 2% improvements lmao

acoustic cliff Aug 5, 2025, 4:51 PM

#

marginal increase with no change in pricing, is it an unfinished release?

blazing bison Aug 5, 2025, 4:53 PM

#

leaden sun Aug 5, 2025, 4:54 PM

#

torn bison jesus

not surprising to see, if they keep adding conflicting contradicting alignment trainings instead of scaling, well, the capabilities, the newest system prompt this month is super long with...many interesting twists

blazing bison Aug 5, 2025, 4:55 PM

#

Hahahahah

golden ocean Aug 5, 2025, 4:56 PM

#

cedar tide <@283397944160550928> need to add this to webdeb https://x.com/alexalbert__/stat...

bro

stray aspen Aug 5, 2025, 4:57 PM

#

We need Opus 4.1 in the arena

blazing bison Aug 5, 2025, 4:58 PM

#

There is no difference from opus 4 in the arena

leaden sun Aug 5, 2025, 4:58 PM

#

stray aspen We need Opus 4.1 in the arena

i dont think it's that big a difference to 4

#

haha

#

they're trying hard to dominate in the agentic coding space i feel, but the stricter alignment training is literally making claude more dumb or is it just me?

minor adder Aug 5, 2025, 5:00 PM

#

leaden sun they're trying hard to dominate in the agentic coding space i feel, but the stri...

This is most likely a regular response to chatgpt 5. To keep claude from going too far

golden ocean Aug 5, 2025, 5:00 PM

#

prob just u

blazing bison Aug 5, 2025, 5:02 PM

#

https://openai.com/open-models/

#

Apparently its a good model

wheat onyx Aug 5, 2025, 5:06 PM

#

cedar tide <@283397944160550928> need to add this to webdeb https://x.com/alexalbert__/stat...

https://x.com/AnthropicAI/status/1952768435873612256?s=19

Anthropic (@AnthropicAI)

We plan to release substantially larger improvements to our models in the coming weeks.

blazing bison Aug 5, 2025, 5:07 PM

#

raven helm Aug 5, 2025, 5:10 PM

#

Did they release it by accident

torn mantle Aug 5, 2025, 5:10 PM

#

raven helm Did they release it by accident

no

#

official

raven helm Aug 5, 2025, 5:10 PM

#

There is nothing on the YouTube channel and the online demo doesn’t work

#

https://www.gpt-oss.com

midnight mesa Aug 5, 2025, 5:13 PM

#

hi good afternoon

raven helm Aug 5, 2025, 5:14 PM

#

Hello, OpenAI just launched their new open source models

#

https://openai.com/open-models/

Open models by OpenAI

Advanced open-weight reasoning models to customize for any use case and run anywhere.

blazing bison Aug 5, 2025, 5:14 PM

#

raven helm There is nothing on the YouTube channel and the online demo doesn’t work

The online demo was working, i tried it

raven helm Aug 5, 2025, 5:14 PM

#

blazing bison The online demo was working, i tried it

For me it is completely down, I send a request and nothing comes back

blazing bison Aug 5, 2025, 5:14 PM

#

Prob too much people

raven helm Aug 5, 2025, 5:14 PM

#

Yea

#

I just wonder why they didn’t announce it more

wheat onyx Aug 5, 2025, 5:14 PM

#

raven helm Aug 5, 2025, 5:15 PM

#

There is literally nothing on their YouTube channel

blazing bison Aug 5, 2025, 5:15 PM

#

The model is not multimodal

#

😢

raven helm Aug 5, 2025, 5:15 PM

#

It isn’t?

#

Wow

#

So that’s why you couldn’t send an image on open router

stray aspen Aug 5, 2025, 5:16 PM

#

Gpt oss is trash

raven helm Aug 5, 2025, 5:16 PM

#

Why?

blazing bison Aug 5, 2025, 5:16 PM

#

I don't think it's trash

#

The license is good

stray aspen Aug 5, 2025, 5:16 PM

#

I think china still dominates open source

raven helm Aug 5, 2025, 5:17 PM

#

But these models are diffidently better

#

They’re just not Multi-Modal

blazing bison Aug 5, 2025, 5:17 PM

#

Ye

#

China will prob use them

#

To make better models

#

🤓

zealous panther Aug 5, 2025, 5:25 PM

#

those two models are probably the horizon models right ?

north vale Aug 5, 2025, 5:41 PM

#

zealous panther those two models are probably the horizon models right ?

no

#

they are text-only

zealous panther Aug 5, 2025, 5:41 PM

#

oof yeha i realized

#

what are the horizon models then ?

zinc ore Aug 5, 2025, 5:42 PM

#

blazing bison China will prob use them

Nah, it won't help them get better models lol

whole wagon Aug 5, 2025, 5:43 PM

#

hm

#

the openai open source model seems like on par with qwen3 235b 2507?

fleet lintel Aug 5, 2025, 5:46 PM

#

any new model on Arena in last 1 week ?

steady vale Aug 5, 2025, 5:46 PM

#

https://x.com/flowith_ai/status/1952779832158298410

flowith (@flowith_ai)

claude 4.1 opus✅ live on flowith
gpt-5 ☑️ aug 7th
gemini 3.0 flash☑️ source from google: not this week

this timeline is unhinged

fleet lintel Aug 5, 2025, 5:47 PM

#

gemini 3.0 is not coming for atleast 2 more months.. no point even thinking about it

zealous panther Aug 5, 2025, 5:49 PM

#

and thats the flash model

#

we dont even know about pro

blazing bison Aug 5, 2025, 5:51 PM

#

fleet lintel gemini 3.0 is not coming for atleast 2 more months.. no point even thinking abou...

Apparently its on testers hands already so idk about it

#

The flash one atleast

fleet lintel Aug 5, 2025, 5:53 PM

#

blazing bison Apparently its on testers hands already so idk about it

they are lying.. i am 100% sure

#

ok 95% sure 🙂

blazing bison Aug 5, 2025, 5:56 PM

#

So in the end horizon models on openrouter aren't the open source model huh

zealous panther Aug 5, 2025, 5:56 PM

#

yeah

#

the benchmarks I mean are saying horizon is not that good

fleet lintel Aug 5, 2025, 6:03 PM

#

whcih model is horizon?

wheat onyx Aug 5, 2025, 6:03 PM

#

so OAI Opensource 20b can run on 16gb RAM, has ~o3-mini performance

wheat onyx Aug 5, 2025, 6:04 PM

#

wheat onyx so OAI Opensource 20b can run on 16gb RAM, has ~o3-mini performance

keen beacon Aug 5, 2025, 6:04 PM

#

I already requested the model on #1372229840131985540. Would be nice if you guys upvoted to get it on LMArena.

wheat onyx Aug 5, 2025, 6:05 PM

#

the 120b model is pretty good, but not going to be a product for an average computer

#

now i want to see how local AI is implemented since we have a bunch of models that should work plenty well

primal orbit Aug 5, 2025, 6:07 PM

#

gpt oss is already available in direct battle

wheat onyx Aug 5, 2025, 6:14 PM

#

The 120b model is ~ the O4mini. So our closest comparison to QWEN3 and Deepseek

#

very impressive for a much lower parameter model

keen beacon Aug 5, 2025, 6:14 PM

#

primal orbit gpt oss is already available in direct battle

That was real fast

#

lol

blazing bison Aug 5, 2025, 6:14 PM

#

wheat onyx The 120b model is ~ the O4mini. So our closest comparison to QWEN3 and Deepseek

The problem is that qwen model are bad in real world cases

#

They are just good for bench

wheat onyx Aug 5, 2025, 6:15 PM

#

blazing bison The problem is that qwen model are bad in real world cases

didn't know that. For these models, I'm just interested in how other companies decide to implement them into devices. So havent been following real performance closely

blazing bison Aug 5, 2025, 6:15 PM

#

Kimi and deepseek are the open source models actually good

#

The rest is just benchmaxing

steady vale Aug 5, 2025, 6:16 PM

#

new chatgpt retry button

blazing bison Aug 5, 2025, 6:17 PM

#

Its not that impressive, and it's not new

#

Its not new

#

Becareful with hype

wheat onyx Aug 5, 2025, 6:18 PM

#

blazing bison Kimi and deepseek are the open source models actually good

Kimi k2 parameters and comparison to o4 mini?

blazing bison Aug 5, 2025, 6:19 PM

#

wheat onyx Kimi k2 parameters and comparison to o4 mini?

I think kimi k2 is trillion parameters

wheat onyx Aug 5, 2025, 6:19 PM

#

realistically I think the small ones are the most interesting (20b), but just curious for comparison sake

blazing bison Aug 5, 2025, 6:20 PM

#

Yeah I think openai has the best smaller model now

#

Open source

stray aspen Aug 5, 2025, 6:22 PM

#

claude opus 4.1 is live on yupp ai if anynoe wants to use it for free

stray aspen Aug 5, 2025, 6:22 PM

#

blazing bison Kimi and deepseek are the open source models actually good

glm 4.5 is also good

#

hell yes

#

4.1 is live on lmarena

shadow jewel Aug 5, 2025, 6:25 PM

#

YO CHAT WHATS THE BEST AI FOR SCRIPTING

echo aurora Aug 5, 2025, 6:26 PM

#

yayblobhyper

shadow jewel Aug 5, 2025, 6:26 PM

#

its claude but like which version

stray aspen Aug 5, 2025, 6:26 PM

#

sonnet 4

#

or the new opus

shadow jewel Aug 5, 2025, 6:26 PM

#

32k thinking?

#

has anyone tested which one is the best one?

stray aspen Aug 5, 2025, 6:28 PM

#

claude sonnet 4 no think

tall summit Aug 5, 2025, 6:36 PM

#

am i late again to gpt oss

#

oh ok around 1hr ago

true kernel Aug 5, 2025, 6:37 PM

#

Does anyone know how to select a specific AI? I need the veo 3

wheat onyx Aug 5, 2025, 6:38 PM

#

wheat onyx so OAI Opensource 20b can run on 16gb RAM, has ~o3-mini performance

offline translation device is a good example of a potential strong application

true kernel Aug 5, 2025, 6:38 PM

#

how do i put veo 3 vs seedance does anyone know ?

primal orbit Aug 5, 2025, 6:39 PM

#

opus 4.1 is very good. A highlight.

tall summit Aug 5, 2025, 6:40 PM

#

wait a moment opus 4.1

#

excuse me

novel flame Aug 5, 2025, 6:41 PM

#

OK I put GPT-OSS 120B (high) through my standard tests, and it's almost Llama 4-level disappointing. I mean, sure, it's not a huge frontier model and we shouldn't expect it to perform at that level, but OpenAIs own numbers have it almost on par with o3, which is nooooooot what I'm seeing. I'm seeing performance maybe on par with o3-mini or Qwen3 235B A22B or maybe Kimi K2. And I am seeing more hallucinations than any other model, only matched by Llama 4 Maverick.

Which, for a free & open source model is still pretty decent.

There are however two problems: The first problem is OpenAI are making claims that are too optimistic. And the much bigger problem is GLM-4.5 exists, is open source, and is an absolute beast.

And with this in mind, I'm not even going to bother testing the 20B model right now. It's never going to match o3-mini if the 120B model can barely do that.

wheat onyx Aug 5, 2025, 6:41 PM

#

wheat onyx offline translation device is a good example of a potential strong application

secure summaries for lawyers and doctors too

hoary girder Aug 5, 2025, 6:42 PM

#

primal orbit opus 4.1 is very good. A highlight.

#

Opus 4.1 with prompt: Create a simple 3d tank game with ai oponent

tall summit Aug 5, 2025, 6:43 PM

#

woah

calm sequoia Aug 5, 2025, 6:44 PM

#

Does this mean the oss hallucinates most of the time? 😆

primal orbit Aug 5, 2025, 6:44 PM

#

someone needs to put that rotating hexagon prompt into opus 4.1

novel flame Aug 5, 2025, 6:44 PM

#

hoary girder Opus 4.1 with prompt: Create a simple 3d tank game with ai oponent

"I'll be in my bunk"

primal orbit Aug 5, 2025, 6:44 PM

#

i feel like it sh*ts over summit and zenith

leaden sun Aug 5, 2025, 6:46 PM

#

always take it with a grain of salt when it comes to such statements, there are much we still don’t know about the brain, there is a reason why brain transplants don’t work. Its purely PR marketing

humble sonnet Aug 5, 2025, 6:46 PM

#

What is gpt-oss ? Is for chat or web dev ?

daring rover Aug 5, 2025, 6:47 PM

#

full suite

wheat onyx Aug 5, 2025, 6:47 PM

#

daring rover full suite

i didnt realize that there are 3 versions of each

wheat onyx Aug 5, 2025, 6:48 PM

#

daring rover full suite

not great for translation sadly

terse shuttle Aug 5, 2025, 6:48 PM

#

wth

weak sluice Aug 5, 2025, 6:48 PM

#

opus 4.1 disappeared!

wheat onyx Aug 5, 2025, 6:49 PM

#

weak sluice opus 4.1 disappeared!

benchmarks arent great anyway, but they said big improvements in next few weeks

primal orbit Aug 5, 2025, 6:50 PM

#

weak sluice opus 4.1 disappeared!

Yes 😢

#

@echo aurora bring back opus 4.1 🙏 better thinking 😄

wicked root Aug 5, 2025, 6:52 PM

#

did they release gpt5?

weak sluice Aug 5, 2025, 6:52 PM

#

it actually knew what certain games were about

wicked root Aug 5, 2025, 6:52 PM

#

any interesting news?

primal orbit Aug 5, 2025, 6:52 PM

#

i liked that opus 4.1 gave long answers instead of usual concise from opus 4.0

steady vale Aug 5, 2025, 6:53 PM

#

wicked root any interesting news?

gpt5 in 2 days

barren prairie Aug 5, 2025, 6:54 PM

#

Opus 4.1 on yupp if someone wanted to try it

obsidian shell Aug 5, 2025, 6:54 PM

#

we had 4.1

wheat onyx Aug 5, 2025, 6:55 PM

#

wicked root any interesting news?

We got Opus 4.1. Anthropic says massive upgrades to it in next few weeks too
GPT open source 120b and 20b
Genie 3

stray aspen Aug 5, 2025, 6:55 PM

#

why was opus 4.1 removed so early

wicked root Aug 5, 2025, 6:55 PM

#

What does GPT opensource 120b and 20b mean?

obsidian shell Aug 5, 2025, 6:55 PM

#

stray aspen why was opus 4.1 removed so early

too many requests probably

keen ferry Aug 5, 2025, 6:55 PM

#

is opus 4.1 really that good?

wheat onyx Aug 5, 2025, 6:55 PM

#

wicked root What does GPT opensource 120b and 20b mean?

small models that can be used, downloaded, quantized

#

20b is the interesting to me (assuming its not crap)

keen ferry Aug 5, 2025, 6:56 PM

#

obsidian shell too many requests probably

Antrophic models are expensive prob their adding rate limits

obsidian shell Aug 5, 2025, 6:56 PM

#

its a 2% upgrade...

wheat onyx Aug 5, 2025, 6:56 PM

#

obsidian shell its a 2% upgrade...

i think its upgrades are mostly non benchmark, but i have a feeling the big benchmark upgrades are the one they are referring to

obsidian shell Aug 5, 2025, 6:56 PM

#

keen ferry Antrophic models are expensive prob their adding rate limits

lmarena doesnt pay for their models

anthropic donates

keen ferry Aug 5, 2025, 6:56 PM

#

obsidian shell lmarena doesnt pay for their models anthropic donates

I thought they really pay for them

obsidian shell Aug 5, 2025, 6:57 PM

#

wheat onyx i think its upgrades are mostly non benchmark, but i have a feeling the big benc...

yeah brother

i can see its giving longer answers in general

stray aspen Aug 5, 2025, 6:57 PM

#

so all the companies donate

wicked root Aug 5, 2025, 6:57 PM

#

wheat onyx 20b is the interesting to me (assuming its not crap)

huh, interesting.

wheat onyx Aug 5, 2025, 6:57 PM

#

wicked root Aug 5, 2025, 6:57 PM

#

Do you guys use Gemini Pro 2.5 ultra subscription?

stray aspen Aug 5, 2025, 6:57 PM

#

no

wicked root Aug 5, 2025, 6:57 PM

#

Are there better alternatives to coding projects? I always get rate limited on Gemini

wheat onyx Aug 5, 2025, 6:57 PM

#

wicked root huh, interesting.

can run on 16gb ram. Maybe less when quantized. So there are lots of application for it. Less so imo for the larger open source models (which are the majority)

wicked root Aug 5, 2025, 6:58 PM

#

wheat onyx can run on 16gb ram. Maybe less when quantized. So there are lots of application...

Do you think GPT5 will beat Gemini pro on LMArena?

barren prairie Aug 5, 2025, 6:58 PM

#

wicked root Do you think GPT5 will beat Gemini pro on LMArena?

Of course

wheat onyx Aug 5, 2025, 6:58 PM

#

wicked root Do you think GPT5 will beat Gemini pro on LMArena?

idk, probably? But Gemini 3 is coming in the next week(s), so that's just something to watch

wicked root Aug 5, 2025, 6:59 PM

#

wheat onyx idk, probably? But Gemini 3 is coming in the next week(s), so that's just someth...

wait they're releasing Gemini 3 this month?

wheat onyx Aug 5, 2025, 6:59 PM

#

wicked root wait they're releasing Gemini 3 this month?

lots of hints about it, yeah should be

wicked root Aug 5, 2025, 6:59 PM

#

wdym lot?

leaden meteor Aug 5, 2025, 6:59 PM

#

wheat onyx idk, probably? But Gemini 3 is coming in the next week(s), so that's just someth...

Source?

barren prairie Aug 5, 2025, 6:59 PM

#

wicked root wait they're releasing Gemini 3 this month?

Maybe just anonymous things as always

primal orbit Aug 5, 2025, 6:59 PM

#

keen ferry is opus 4.1 really that good?

yes

barren prairie Aug 5, 2025, 6:59 PM

#

Experimental

primal orbit Aug 5, 2025, 7:00 PM

#

lmarena is committing crimes against AGi by removing opus 4.1

obsidian shell Aug 5, 2025, 7:00 PM

#

guys lets not guess

they will release it when ready

even when they do

do you really think gemini 3 flash will be in a position to beat gpt-5?

#

a_laugh

keen ferry Aug 5, 2025, 7:00 PM

#

primal orbit lmarena is committing crimes against AGi by removing opus 4.1

we still get free access so it's worth waiting

barren prairie Aug 5, 2025, 7:00 PM

#

obsidian shell guys lets not guess they will release it when ready even when they do do you ...

It must

obsidian shell Aug 5, 2025, 7:01 PM

#

they removed it from the announcement too

wicked root Aug 5, 2025, 7:01 PM

#

what's the best ai for coding?

keen ferry Aug 5, 2025, 7:01 PM

#

claude

#

then it's gonna be gpt 5

stray aspen Aug 5, 2025, 7:01 PM

#

wicked root what's the best ai for coding?

gpt 1.5

wheat onyx Aug 5, 2025, 7:02 PM

#

brittle tiger Aug 5, 2025, 7:02 PM

#

brittle tiger "We’re inaugurating Kaggle Game Arena with a 3-day AI chess exhibition tournamen...

Hikaru is steaming this by the way. 2.5 pro about to sweep opus in round one

echo aurora Aug 5, 2025, 7:02 PM

#

weak sluice opus 4.1 disappeared!

It's going to be in battle mode only atm.

primal orbit Aug 5, 2025, 7:03 PM

#

echo aurora It's going to be in `battle mode` only atm.

thank you for letting us know

wheat onyx Aug 5, 2025, 7:03 PM

#

wheat onyx

For reasons we should think Gemini 3 is coming very soon

weak sluice Aug 5, 2025, 7:03 PM

#

echo aurora It's going to be in `battle mode` only atm.

oh

astral jetty Aug 5, 2025, 7:05 PM

#

echo aurora It's going to be in `battle mode` only atm.

Is it because of rate limit

wicked root Aug 5, 2025, 7:05 PM

#

weak sluice oh

Are you smarter than an average bear?

weak sluice Aug 5, 2025, 7:06 PM

#

wicked root Are you smarter than an average bear?

I see what you did there! And probs not XD

daring rover Aug 5, 2025, 7:06 PM

#

wheat onyx

It’s over

daring rover Aug 5, 2025, 7:06 PM

#

wheat onyx i didnt realize that there are 3 versions of each

Just thinking effort

wicked root Aug 5, 2025, 7:07 PM

#

weak sluice I see what you did there! And probs not XD

I guess that's not okay Boo Boo!

#

okay it's weird af calling someone else boo boo

novel flame Aug 5, 2025, 7:07 PM

#

daring rover full suite

For context: The GPT-OSS 120B (high) model scoring 44.4 on Aider Polyglot is hardly impressive considering:

Qwen3 235B A22B: 59.6
Kimi K2: 59.1
DeepSeek R1: 56.9
DeepSeek V3 0324: 55.1

I don't know if GLM-4.5 has benchmarked Aider Polyglot yet, but I would guess it'll score considerably higher than all of those.

wheat onyx Aug 5, 2025, 7:08 PM

#

novel flame For context: The GPT-OSS 120B (high) model scoring 44.4 on Aider Polyglot is har...

none are really useful for a portable translation/summarization device though. All far too big

#

but yeah way too low

novel flame Aug 5, 2025, 7:09 PM

#

wheat onyx none are really useful for a portable translation/summarization device though. A...

Have you considered Qwen3 32B? It's quite impressive for the size.

wheat onyx Aug 5, 2025, 7:09 PM

#

novel flame Have you considered Qwen3 32B? It's quite impressive for the size.

a quantized version could be really good

stray aspen Aug 5, 2025, 7:10 PM

#

is the gpt oss on lmarena high reasoning

wicked root Aug 5, 2025, 7:10 PM

#

polymarket's going nuts today

stray aspen Aug 5, 2025, 7:10 PM

#

this is the biggest week for AI

wicked root Aug 5, 2025, 7:11 PM

#

They think Google's going to lose to OpenAi

wicked root Aug 5, 2025, 7:11 PM

#

stray aspen this is the biggest week for AI

because of gpt5?

stray aspen Aug 5, 2025, 7:11 PM

#

no

wicked root Aug 5, 2025, 7:11 PM

#

Is gpt5 that great?

stray aspen Aug 5, 2025, 7:11 PM

#

for everything

#

everyone is releaseing models

wheat onyx Aug 5, 2025, 7:11 PM

#

wicked root They think Google's going to lose to OpenAi

do we care? competitions good, whatever is great is what we'll use

#

we'll see which is best soon

wicked root Aug 5, 2025, 7:11 PM

#

stray aspen everyone is releaseing models

including google?

barren prairie Aug 5, 2025, 7:12 PM

#

wicked root including google?

Google is preparing one .

stray aspen Aug 5, 2025, 7:12 PM

#

they should release 3.0

barren prairie Aug 5, 2025, 7:13 PM

#

stray aspen they should release 3.0

Google may wait for the realese of gpt5

open mountain Aug 5, 2025, 7:13 PM

#

Oss 120b This model generally responds poorly as a human, as if it has been degraded compared to 4.1 and even more so 4o

wicked root Aug 5, 2025, 7:14 PM

#

stray aspen they should release 3.0

has this been announced?

wheat onyx Aug 5, 2025, 7:14 PM

#

stray aspen they should release 3.0

not coming this week apprently

#

but soon

novel flame Aug 5, 2025, 7:14 PM

#

a quantized version could be really good

stray aspen Aug 5, 2025, 7:15 PM

#

open mountain Oss 120b This model generally responds poorly as a human, as if it has been degr...

it doesnt convince me

primal orbit Aug 5, 2025, 7:15 PM

#

2.5 pro got released a few days after it appreared on lmarena under name "nebula". I assume the same thing for 3.0

stray aspen Aug 5, 2025, 7:15 PM

#

but the benchmarks are crazy

primal orbit Aug 5, 2025, 7:15 PM

#

not a sign of 3.0 so far

#

2.5 pro was 03.25 exp

#

so 5 months ago now

wicked root Aug 5, 2025, 7:17 PM

#

stray aspen but the benchmarks are crazy

could u explain pls? Are the benchmarks for GPT5 out?

wheat onyx Aug 5, 2025, 7:17 PM

#

wicked root could u explain pls? Are the benchmarks for GPT5 out?

no

hollow imp Aug 5, 2025, 7:20 PM

#

What is gpt oss

#

???

keen beacon Aug 5, 2025, 7:20 PM

#

wicked root could u explain pls? Are the benchmarks for GPT5 out?

No, they are not out

stray aspen Aug 5, 2025, 7:21 PM

#

hollow imp What is gpt oss

the new open source model form open ai

open mountain Aug 5, 2025, 7:21 PM

#

hollow imp What is gpt oss

A new open model from openai from 2019

hollow imp Aug 5, 2025, 7:21 PM

#

Huhhh

open mountain Aug 5, 2025, 7:21 PM

#

Literally 6 years have not produced open models of "everything for humanity", yes, we believe))

hollow imp Aug 5, 2025, 7:21 PM

#

So not relevant for me ig

#

Gpt5 on lmarena when

stray aspen Aug 5, 2025, 7:22 PM

#

gpt oss is in artificial analysis but they havent benchmarked it

open mountain Aug 5, 2025, 7:22 PM

#

hollow imp Gpt5 on lmarena when

In two days maybe ...

stray aspen Aug 5, 2025, 7:22 PM

#

hollow imp Gpt5 on lmarena when

august 7

hollow imp Aug 5, 2025, 7:23 PM

#

My fav ytber ❤️

#

https://youtu.be/uOO02o4J5KU?feature=shared

YouTube

bycloud

This Research Tested Having AI Teaching Itself Infinitely…

Get started now with privacy focused VPN by Proton! https://proton.me/pass/bycloudai

Can a neural network write its own data and skyrocket past GPT-4? In today's video, we dissect the brand-new “Self-Adapting Language Models” paper (SEAL), where an LLM fabricates synthetic data, tunes LoRA adapters, and after just two rounds, outperforms mu...

▶ Play video

open mountain Aug 5, 2025, 7:27 PM

#

stray aspen august 7

I think OpenAI doesn’t always release new models on time, and usually gives access to Pro users first, and only then to Plus users.

hollow imp Aug 5, 2025, 7:30 PM

#

What do y'all think about perplexity comet

wicked root Aug 5, 2025, 7:31 PM

#

open mountain I think OpenAI doesn’t always release new models on time, and usually gives acce...

word on the street is gpt5's coming out in 2 days

normal abyss Aug 5, 2025, 7:31 PM

#

is gemini 3 already coming?

open mountain Aug 5, 2025, 7:32 PM

#

wicked root word on the street is gpt5's coming out in 2 days

It's all a rumor

quartz light Aug 5, 2025, 7:32 PM

#

openai-gpt-oss-120b-and-20b-performance-on-AI-benchmarks.png