#general | Arena | Page 76

jade egret Jul 24, 2025, 4:45 AM

#

is ai mode like Gemini but on 'search steriod'

tiny palmBOT Jul 24, 2025, 4:56 AM

#

tiny palmBOT Jul 24, 2025, 5:02 AM

#

tiny palm

empty stump Jul 24, 2025, 5:15 AM

#

Why no video arena on the website?

frigid coral Jul 24, 2025, 5:22 AM

#

better account verification on discord maybe?

tiny palmBOT Jul 24, 2025, 5:23 AM

#

frigid coral Jul 24, 2025, 5:27 AM

#

or maybe video storage cost?

languid crescent Jul 24, 2025, 5:28 AM

#

Is it okay for LMarena to incorporate their bot on own discord server? (prolly a dumb question)

tiny palmBOT Jul 24, 2025, 5:29 AM

#

tiny palm

#

frigid coral Jul 24, 2025, 5:31 AM

#

languid crescent Is it okay for LMarena to incorporate their bot on own discord server? (prolly a...

seems like there's an option to add the bot

keen fulcrum Jul 24, 2025, 5:32 AM

#

frigid coral seems like there's an option to add the bot

doesn't work outside the server @echo aurora

echo aurora Jul 24, 2025, 5:33 AM

#

keen fulcrum doesn't work outside the server <@283397944160550928>

Yeah we're keeping the bot just in here for now.

languid crescent Jul 24, 2025, 5:34 AM

#

ahh i see

keen fulcrum Jul 24, 2025, 5:34 AM

#

echo aurora Yeah we're keeping the bot just in here for now.

Can you disable adding the bot?

tiny palmBOT Jul 24, 2025, 5:34 AM

#

tiny palmBOT Jul 24, 2025, 5:35 AM

#

tiny palm

tiny palmBOT Jul 24, 2025, 5:36 AM

#

tiny palm

languid crescent Jul 24, 2025, 5:37 AM

#

tiny palm

the right one is trippy 😭

tiny palmBOT Jul 24, 2025, 5:38 AM

#

tiny palmBOT Jul 24, 2025, 5:41 AM

#

tiny palm

#

tiny palmBOT Jul 24, 2025, 5:46 AM

#

tiny palm

#

tiny palmBOT Jul 24, 2025, 5:54 AM

#

tiny palm

#

tiny palmBOT Jul 24, 2025, 6:04 AM

#

tiny palm

echo aurora Jul 24, 2025, 6:05 AM

#

these are all great!

torn bison Jul 24, 2025, 6:05 AM

#

You must be in one of the ⁠video-arena channels to use the bot. These are the only channels where you can use the bot.

echo aurora Jul 24, 2025, 6:06 AM

#

I'm going disable the bot in here for now but can still be used in #video-arena-1

torn bison Jul 24, 2025, 6:06 AM

#

bots in general is too spamming

echo aurora Jul 24, 2025, 6:06 AM

#

yeah was thinking the same

leaden sun Jul 24, 2025, 6:13 AM

#

i think it's not only the data but the core architecture of cognition...

gentle plinth Jul 24, 2025, 6:52 AM

#

echo aurora I'm going disable the bot in here for now but can still be used in <#13976556951...

Is it intentional that we can vote on ai video once the model has already been revealed?

sly estuary Jul 24, 2025, 9:41 AM

#

how i can fix it ?

sly estuary Jul 24, 2025, 9:51 AM

#

sly estuary how i can fix it ?

@echo aurora sorry for ping. I just searching in this server but can't fix this issue. I see many posts got this error and don't see anyone can fix that.

whole wagon Jul 24, 2025, 10:16 AM

#

Why is updated Qwen not on leaderboard still

#

echo aurora Jul 24, 2025, 1:03 PM

#

gentle plinth Is it intentional that we can vote on ai video once the model has already been r...

It is intentional. We’re a bit limited in what we can do with the bot. But if you have feedback we’d love to hear it.

echo aurora Jul 24, 2025, 1:05 PM

#

sly estuary <@283397944160550928> sorry for ping. I just searching in this server but can't ...

No need to apologize for the ping! I’m going to start a post in #1343291835845578853 to get more info

#

For those that missed it - we’ve soft launched Video Arena through a Discord Bot. You can learn more in #1397655624103493813 and generate videos in #video-arena-1

alpine coral Jul 24, 2025, 1:30 PM

#

echo aurora I'm going disable the bot in here for now but can still be used in <#13976556951...

fwiw i think keeping it out of general for good would be the way to go.. that scroll wasn't what i come here for aha

humble sonnet Jul 24, 2025, 1:40 PM

#

Will video generation be available on the website later?

stray aspen Jul 24, 2025, 2:16 PM

#

will video generation be added

errant cave Jul 24, 2025, 3:00 PM

#

whole wagon Why is updated Qwen not on leaderboard still

Not enough votes probably

stray aspen Jul 24, 2025, 3:25 PM

#

when will o3 pro and claude 4 opus thinking be added

#

bro why is gemini 2.5 pro grounding gone

alpine coral Jul 24, 2025, 3:41 PM

#

just fwiw.. same model (K2), but different providers (moonshot, togetherai and groq) and two temp settings (0.6 and 1).. pretty divergent results (to this 20-question quiz anyway)

#

lower temp seems better ig (top score seems an outlier, notwithstanding)

#

speed difference is wild.. groq is an order of magnitude faster than moonshot (which is slow af), and much faster than togetherai, which isn't too bad/respectable

cedar tide Jul 24, 2025, 3:46 PM

#

@echo aurora We would like explanations on the non-integration pf glm 4 air into the leaderboard after 2 months on the arena and the web dev (and the same for 2.5 flash lite on webdev)

verbal nimbus Jul 24, 2025, 3:46 PM

#

alpine coral just fwiw.. same model (K2), but different providers (moonshot, togetherai and g...

Interesting

#

You could put it in a Docker dev container, and only bind the project directory (or use git to clone and push)

echo aurora Jul 24, 2025, 3:49 PM

#

cedar tide <@283397944160550928> We would like explanations on the non-integration pf glm 4...

I’ll flag to the team and get you a response.

cedar tide Jul 24, 2025, 3:50 PM

#

Thx

verbal nimbus Jul 24, 2025, 3:50 PM

#

Or perhaps run it as a separate user, but I haven't tried that

cedar tide Jul 24, 2025, 3:50 PM

#

Glm 4.5 arrive very soon

echo aurora Jul 24, 2025, 3:50 PM

#

humble sonnet Will video generation be available on the website later?

It’s possible. We’re experimenting a bit with this one.

verbal nimbus Jul 24, 2025, 3:51 PM

#

stray aspen when will o3 pro and claude 4 opus thinking be added

Isn't Claude 4 Opus thinking already on there

keen fulcrum Jul 24, 2025, 3:51 PM

#

echo aurora It’s possible. We’re experimenting a bit with this one.

Isn't it expensive? ArtificialAnalysis used pre-generated

whole wagon Jul 24, 2025, 4:12 PM

#

whats going on, did smth suggest GPT5 is coming august

#

July 31 odds plummeted at same time

#

https://www.theverge.com/notepad-microsoft-newsletter/712950/openai-gpt-5-model-release-date-notepad

#

nice lol

whole wagon Jul 24, 2025, 4:19 PM

#

whole wagon <https://www.theverge.com/notepad-microsoft-newsletter/712950/openai-gpt-5-model...

OpenAI’s open language model is still arriving ahead of GPT-5

torn mantle Jul 24, 2025, 4:23 PM

#

whole wagon whats going on, did smth suggest GPT5 is coming august

yes

#

https://x.com/tomwarren/status/1948413202565399025?s=46

Tom Warren (@tomwarren)

scoop: OpenAI is preparing to launch GPT-5 in early August. It will also ship with mini and nano versions. Details about GPT-5 in my Notepad 📒 newsletter this week, where I also dig into the SharePoint attacks. Live for subscribers 👇 https://t.co/1hLMmFkfs2

stray aspen Jul 24, 2025, 4:25 PM

#

why was gemini 2.5 pro grounding removed

soft kernel Jul 24, 2025, 4:49 PM

#

whole wagon whats going on, did smth suggest GPT5 is coming august

Yeah verge newsletter

whole wagon Jul 24, 2025, 4:52 PM

#

whole wagon <https://www.theverge.com/notepad-microsoft-newsletter/712950/openai-gpt-5-model...

.

#

i already know lol

torn mantle Jul 24, 2025, 4:58 PM

#

okay

fleet lintel Jul 24, 2025, 5:00 PM

#

Google is processing 30 trillion tokens per day . Is that actually high? I would have guessed a higher number

https://x.com/OfficialLoganK/status/1948158565581279672

ocean vortex Jul 24, 2025, 5:11 PM

#

fleet lintel Google is processing 30 trillion tokens per day . Is that actually high? I woul...

https://openrouter.ai/rankings?view=day

If you look there and sum ALL of the models on openrouter people are using for the day it comes up to like only 500b tokens roughly. So yeah 30T in a day is actually a lot.

OpenRouter

LLM Rankings | OpenRouter

Language models ranked and analyzed by usage across apps

jade egret Jul 24, 2025, 5:32 PM

#

js asking, are yall ai mode ui also broken?

leaden sun Jul 24, 2025, 5:35 PM

#

the 5 pillars are not really new actually, you can find those in published papers, gemini probably condensed relevant stuffs together, it's incomplete from my point of view, but it's a nice try by gemini, the direction seems right to me

west scroll Jul 24, 2025, 5:37 PM

#

helloo

#

where is arena stage channel?

echo aurora Jul 24, 2025, 5:41 PM

#

west scroll where is arena stage channel?

the stage channel? there's the Arena Stage (top of the channel list), but no stage events are currently scheduled.

civic flame Jul 24, 2025, 5:42 PM

#

@patent aspen you have any idea what model the new "starfish" anon in arena is?

patent aspen Jul 24, 2025, 5:50 PM

#

civic flame <@607352374352281612> you have any idea what model the new "starfish" anon in ar...

Surprisingly no

civic flame Jul 24, 2025, 5:51 PM

#

yeah nevermind it's an oai model

#

apparently not as good as o3-alpha but better than o3

#

lol that's interesting

patent aspen Jul 24, 2025, 5:52 PM

#

I once saw Jeff Dean give a presentation on Halloween dressed as a starfish

#

It was about TPUs because he was one of the co-inventors

torn mantle Jul 24, 2025, 6:02 PM

#

civic flame apparently not as good as o3-alpha but better than o3

how is it better than o3

#

share with us your results

echo aurora Jul 24, 2025, 6:02 PM

#

cedar tide <@283397944160550928> We would like explanations on the non-integration pf glm 4...

This is being addressed, thank you for the flag.

torn mantle Jul 24, 2025, 6:02 PM

#

so far its performing poorly on my tests

civic flame Jul 24, 2025, 6:07 PM

#

torn mantle how is it better than o3

i am going on the reports in dev mode

candid storm Jul 24, 2025, 6:14 PM

#

torn mantle https://x.com/tomwarren/status/1948413202565399025?s=46

Is the verge trustworthy?

torn mantle Jul 24, 2025, 6:15 PM

#

The thing with sama is that he will alway overhype his models

cedar tide Jul 24, 2025, 6:15 PM

#

I don't know if Starfish is GPT 5 mini or their open source model but its knowledge cutoff hasn't changed since their last model it's exactly the same

torn mantle Jul 24, 2025, 6:15 PM

#

candid storm Is the verge trustworthy?

Time to bet?

#

Take it with a grain of salt

#

But i also predicted gpt5 to be released on 1st week August

#

Not sure about their open sourced model

candid storm Jul 24, 2025, 6:18 PM

#

civic flame yeah nevermind it's an oai model

Starfish could be the opensource model right?

leaden meteor Jul 24, 2025, 6:22 PM

#

Starfish => 5 legs => GPT-5 ?!

cedar tide Jul 24, 2025, 6:32 PM

#

leaden meteor Starfish => 5 legs => GPT-5 ?!

But its 5 mini legs 😶

west scroll Jul 24, 2025, 6:42 PM

#

echo aurora the stage channel? there's the Arena Stage (top of the channel list), but no sta...

No i mean that not that give updates whenever there's new model available

torn mantle Jul 24, 2025, 6:44 PM

#

cedar tide But its 5 mini legs 😶

gpt5/gpt5-mini/ new oai models

#

really nothing extraordinary

#

if its really gpt-5, then thats so disappointing

#

sama loves to overhype his models

zinc ore Jul 24, 2025, 7:00 PM

#

I think it's their OS model, I'm not even sure they'll put gpt5 on arena before release

#

And we know OS is supposed to be out before August according to verge, and it's 7 days until the 31st.

cedar tide Jul 24, 2025, 7:06 PM

#

torn mantle if its really gpt-5, then thats so disappointing

T sérieux ?

cedar tide Jul 24, 2025, 7:08 PM

#

torn mantle if its really gpt-5, then thats so disappointing

We have o3 alpha too and if GPT 5 is one of those models it is o3 alpha and it is not bad

#

Starfish its GPT 5 mini

#

https://x.com/JustinLin610/status/1948456122228380128?t=HJ4-6UaUe9ull9lBPnCIrw&s=19

Junyang Lin (@JustinLin610)

gotta sleep early. tmr (oh i should say today) is qwen3-235b-a22b-thinking-2507 if everything goes well.

#

Yes

pulsar tendon Jul 24, 2025, 7:16 PM

#

Yas.

torn mantle Jul 24, 2025, 7:32 PM

#

lol that wasnt ai gen

#

https://x.com/arcprize/status/1948453132184494471

ARC Prize (@arcprize)

Qwen3-235b-a22b Instruct-2507 ARC-AGI Semi Private Eval

* ARC-AGI-1: 11%, $0.003/task
* ARC-AGI-2: 1.3%, $0.004/task

#

11% vs 44%

whole wagon Jul 24, 2025, 7:40 PM

#

778 told me open source models would never fake benchmarks and I can trust them blindly tho?

torn mantle Jul 24, 2025, 7:41 PM

#

you should trust leo

whole wagon Jul 24, 2025, 7:41 PM

#

.

torn mantle Jul 24, 2025, 7:41 PM

#

he claimed to have access to gpt5, but then seemed confused when starfish appeared in the arena

#

what does that mean?

#

does he really have access to gpt5?

#

or is he lying to us

whole wagon Jul 24, 2025, 7:42 PM

#

Someone should check if Qwen 3 updated model even gets 44% on the public arc set lol

sage raptor Jul 24, 2025, 7:42 PM

#

torn mantle gpt5/gpt5-mini/ new oai models

Is that in svg?

torn mantle Jul 24, 2025, 7:42 PM

#

sage raptor Is that in svg?

no thats a quote in french

whole wagon Jul 24, 2025, 7:42 PM

#

I wonder if they just trained on the public set or just fully faked their results

torn mantle Jul 24, 2025, 7:43 PM

#

it means 'its nothing extraordinary'

sage raptor Jul 24, 2025, 7:43 PM

#

Ah

torn mantle Jul 24, 2025, 7:43 PM

#

whole wagon I wonder if they just trained on the public set or just fully faked their result...

well hopefully we get an answer soon

#

because if they fake benchmarked on this, they could've done the same on others

dawn wharf Jul 24, 2025, 7:43 PM

#

whole wagon I wonder if they just trained on the public set or just fully faked their result...

weird though, why would they fake it when they know it's gonna immediately be found out?

torn mantle Jul 24, 2025, 7:44 PM

#

im starting to have trust issues with qwen team tbh

#

they dont seem transparent like deepseek

whole wagon Jul 24, 2025, 7:44 PM

#

dawn wharf weird though, why would they fake it when they know it's gonna immediately be fo...

Thats why I said I think they might have trained on the public set but it didn't carry over to the private set

torn mantle Jul 24, 2025, 7:44 PM

#

yea the models are open sourced, thats good, but the benchmarks are sus

#

because when you try their models you can see that it lacks in many areas

#

but that doesnt match at all with their benchmarks

#

which raises a lot of questions

whole wagon Jul 24, 2025, 7:45 PM

#

whole wagon Thats why I said I think they might have trained on the public set but it didn't...

They probably didn't expect it to be that bad on the private set. They didn't realise how much they had overfitted to the public set

torn mantle Jul 24, 2025, 7:46 PM

#

also im not a huge fan of their reasoning method

wheat onyx Jul 24, 2025, 7:46 PM

#

Gpt5 Mini starfish isn't great?

torn mantle Jul 24, 2025, 7:46 PM

#

seems lacking tbh

whole wagon Jul 24, 2025, 7:46 PM

#

Like livebench is mostly public and their results there is absurdly good. It smells like using the benchmark in the training set

meager harbor Jul 24, 2025, 7:46 PM

#

o3 alpha isn't the open ai open weight model ?

torn mantle Jul 24, 2025, 7:47 PM

#

k2 on the other hand was a pleasant surprise

#

it was something fresh

whole wagon Jul 24, 2025, 7:47 PM

#

What's the info, starfish is what and in battle arena?

torn mantle Jul 24, 2025, 7:47 PM

#

especially how its so close to o3 in its writing style

#

its like a combination of many models

whole wagon Jul 24, 2025, 7:48 PM

#

Starfish is GPT5 mini?

torn mantle Jul 24, 2025, 7:48 PM

#

it doesnt have the yapping style of o3 but its concise like claude models

#

its like o3 + claude

whole wagon Jul 24, 2025, 7:49 PM

#

Is starfish reasoning

torn mantle Jul 24, 2025, 7:49 PM

#

yes

whole wagon Jul 24, 2025, 7:49 PM

#

Huh. It should be strong then

torn mantle Jul 24, 2025, 7:49 PM

#

its meh

tulip tundra Jul 24, 2025, 7:49 PM

#

whole wagon Starfish is GPT5 mini?

No it's open ai model

torn mantle Jul 24, 2025, 7:49 PM

#

nothing crazy

tulip tundra Jul 24, 2025, 7:49 PM

#

tulip tundra No it's open ai model

Maybe GPT 5 mini or open source model

whole wagon Jul 24, 2025, 7:50 PM

#

Open source model? But why would they hype so hard if it's meh

tulip tundra Jul 24, 2025, 7:50 PM

#

tulip tundra Jul 24, 2025, 7:50 PM

#

whole wagon Open source model? But why would they hype so hard if it's meh

Yeah

#

Kimi k2 easily beats it

torn mantle Jul 24, 2025, 7:50 PM

#

well its good for an open source model

#

thats for sure

#

but not really near SOTA

#

which is expected

whole wagon Jul 24, 2025, 7:50 PM

#

That would be a huge letdown ngl

torn mantle Jul 24, 2025, 7:51 PM

#

it depends on how big/small the model is

whole wagon Jul 24, 2025, 7:51 PM

#

It fits on a single consumer GPU I already said

#

But still

stray aspen Jul 24, 2025, 7:51 PM

#

bro what is this

whole wagon Jul 24, 2025, 7:51 PM

#

It shouldn't be that weak

stray aspen Jul 24, 2025, 7:51 PM

#

why is qwen called that

whole wagon Jul 24, 2025, 7:51 PM

#

stray aspen bro what is this

They tried to make it secret name

torn mantle Jul 24, 2025, 7:51 PM

#

alias

torn mantle Jul 24, 2025, 7:52 PM

#

whole wagon It fits on a single consumer GPU I already said

big if true

pulsar tendon Jul 24, 2025, 7:52 PM

#

tulip tundra No it's open ai model

No its not the open Model

tulip tundra Jul 24, 2025, 7:52 PM

#

whole wagon That would be a huge letdown ngl

Last time when you were not let down by open ai after gpt 4

civic flame Jul 24, 2025, 7:52 PM

#

torn mantle you should trust leo

lol your grudge against me is quite funny

whole wagon Jul 24, 2025, 7:52 PM

#

I get the feeling they have multiple open source models or smth? It's strange

civic flame Jul 24, 2025, 7:52 PM

#

i have 0 incentive to lie about things here

pulsar tendon Jul 24, 2025, 7:53 PM

#

We would of seen it on the arena last time before they decided to delay for safety

civic flame Jul 24, 2025, 7:53 PM

#

torn mantle he claimed to have access to gpt5, but then seemed confused when starfish appear...

i've literally told you

#

i don't know if the anon oai model i have access to is gpt-5

#

mf said because i didn't immediately know what starfish was i'm lying

#

mind you i was also the first to make the prediction it was gpt 5 mini

#

which jimmy now seems to be [de-facto] confirming

whole wagon Jul 24, 2025, 7:54 PM

#

If it is GPT5 mini why is it bad

#

That's what ppl are saying. It's meh

civic flame Jul 24, 2025, 7:54 PM

#

that's not a question that i can answer lol

torn mantle Jul 24, 2025, 7:55 PM

#

tbh we need a more holistic or nuanced view, not just a one-dimensional metric

#

like performance * size of the model

#

or smth like that

torn mantle Jul 24, 2025, 7:55 PM

#

whole wagon It fits on a single consumer GPU I already said

if its really like that, then its a great model

whole wagon Jul 24, 2025, 7:55 PM

#

The size is not that relevant up to a point, people want capability

#

The most popular coding model is the expensive sonnet

wheat onyx Jul 24, 2025, 7:56 PM

#

Well they said there would be a version for free tier, this might be that one

#

Unlimited free reasoning

whole wagon Jul 24, 2025, 7:56 PM

#

I guess

torn mantle Jul 24, 2025, 7:56 PM

#

civic flame i've literally told you

why are you so mad

wheat onyx Jul 24, 2025, 7:56 PM

#

They said unlimited free for each payment tier

whole wagon Jul 24, 2025, 7:56 PM

#

It just seems weird because o4 mini is not trash. So it must be much smaller

torn mantle Jul 24, 2025, 7:56 PM

#

i just dont like liars

civic flame Jul 24, 2025, 7:56 PM

#

torn mantle why are you so mad

lol okay then

#

i'm not a liar but quite frankly you can think what you want

torn mantle Jul 24, 2025, 7:57 PM

#

if you dont have access to something, then dont lie about it

dawn wharf Jul 24, 2025, 7:57 PM

#

civic flame Jul 24, 2025, 7:57 PM

#

torn mantle if you dont have access to something, then dont lie about it

again...

#

what incentive do i have to lie about it

torn mantle Jul 24, 2025, 7:57 PM

#

okay

ocean fulcrum Jul 24, 2025, 7:57 PM

#

I just opened Discord
And what a good and lovely conversation is going on

torn mantle Jul 24, 2025, 7:57 PM

#

im sorry

#

didnt know you were like this ...

civic flame Jul 24, 2025, 7:57 PM

#

❓

torn mantle Jul 24, 2025, 7:57 PM

#

should i delete my messages?

#

will it make you feel better

#

wtvr

civic flame Jul 24, 2025, 7:58 PM

#

awful ragebait

#

go to bed

torn mantle Jul 24, 2025, 7:58 PM

#

no

civic flame Jul 24, 2025, 7:58 PM

#

i don't disagre

#

e

torn mantle Jul 24, 2025, 7:58 PM

#

what about leo?

stray aspen Jul 24, 2025, 7:58 PM

#

what are we yapping about today

brittle tiger Jul 24, 2025, 7:58 PM

#

https://x.com/lmthang/status/1948458590492393834

Thang Luong (@lmthang)

Right before #imo2025, together with colleagues from Mountain View, NYC, Singapore, etc, we all gathered at @GoogleDeepMind headquarter in London for our final push for IMO. I believe that week was when all magic happened!

We put all individual recipes (that we figured out

torn mantle Jul 24, 2025, 7:59 PM

#

to what

civic flame Jul 24, 2025, 7:59 PM

#

you can ask several people here about me having o3 before it released and they will tell you i'm not bullshitting

zinc ore Jul 24, 2025, 7:59 PM

#

Jimmy apples doesn't say it's gpt5 mini

torn mantle Jul 24, 2025, 7:59 PM

#

lol no

zinc ore Jul 24, 2025, 7:59 PM

#

He speculates that it is

torn mantle Jul 24, 2025, 7:59 PM

#

hes not

#

no no

#

thats another lie

#

yes!!!!!!!!!

#

yea

#

stop trusting him

#

im telling you

civic flame Jul 24, 2025, 7:59 PM

#

i would literally send you a screenshot of hackerone right now but i can't legally speaking

#

whatever

zinc ore Jul 24, 2025, 7:59 PM

#

So right now we don't have any sorta confirmation beyond people speculating that starfish is gpt5 mini

dawn wharf Jul 24, 2025, 8:00 PM

#

brittle tiger https://x.com/lmthang/status/1948458590492393834

having a deep think

civic flame Jul 24, 2025, 8:00 PM

#

dawn wharf having a deep think

that pic had aura

torn mantle Jul 24, 2025, 8:00 PM

#

make sense tbh

civic flame Jul 24, 2025, 8:00 PM

#

lol what exactly do you want me to say

#

🤷‍♂️

torn mantle Jul 24, 2025, 8:00 PM

#

civic flame 🤷‍♂️

whats that thing on his lips

brittle tiger Jul 24, 2025, 8:00 PM

#

brittle tiger https://x.com/lmthang/status/1948458590492393834

On July 8th the IMO gemini team apparently made a breakthrough with deepthink. this tweet says the model is sota in coding and reasoning as well as math

torn mantle Jul 24, 2025, 8:00 PM

#

or cheeks

civic flame Jul 24, 2025, 8:00 PM

#

what

zinc ore Jul 24, 2025, 8:01 PM

#

brittle tiger https://x.com/lmthang/status/1948458590492393834

They say that IMO model is SOTA on coding too

torn mantle Jul 24, 2025, 8:01 PM

#

so he is a manipulator?

civic flame Jul 24, 2025, 8:01 PM

#

brittle tiger On July 8th the IMO gemini team apparently made a breakthrough with deepthink. t...

yeah it's definitely going to be a lot better on release than the version shown off at IO

#

which is fun

torn mantle Jul 24, 2025, 8:01 PM

#

you do

#

spill the tea

#

what do you know about him?

zinc ore Jul 24, 2025, 8:01 PM

#

Asura on a rampage

civic flame Jul 24, 2025, 8:01 PM

#

he doesn't know anything lol

#

i don't share personal details with almost anyone here

torn mantle Jul 24, 2025, 8:01 PM

#

yea

#

share

civic flame Jul 24, 2025, 8:02 PM

#

off the top of my head there's one (1) person here that knows anything personal about me

torn mantle Jul 24, 2025, 8:02 PM

#

civic flame off the top of my head there's one (1) person here that knows anything personal ...

and that is craig

civic flame Jul 24, 2025, 8:02 PM

#

holy gaslighting

#

sure let's hear it

torn mantle Jul 24, 2025, 8:02 PM

#

waiting craig

#

😮

#

lol leo im jk

#

take it easy man

civic flame Jul 24, 2025, 8:02 PM

#

💀

torn mantle Jul 24, 2025, 8:03 PM

#

😮

torn mantle Jul 24, 2025, 8:04 PM

#

civic flame 💀

I love teasing you

#

I find people who get annoyed quickly amusing

#

try me

#

wth

#

im not sleepy

#

still plenty of time before bed

#

😦

#

grok is sht

#

poopoo

#

caca

#

doesnt count

#

im sleepy

#

whats crazy

#

is leo knowing my timezone

#

how did he know its bedtime?

#

you said hes a hacker?

#

make sense now

#

what else does he know about me ...

zinc ore Jul 24, 2025, 8:09 PM

#

Asura why u talking like lyra

civic flame Jul 24, 2025, 8:09 PM

#

so true

torn mantle Jul 24, 2025, 8:11 PM

#

he is a he

#

i know that

#

from the way he talks

civic flame Jul 24, 2025, 8:11 PM

#

gender profiling is crazy

torn mantle Jul 24, 2025, 8:11 PM

#

mm

#

idk

#

feeling

torn mantle Jul 24, 2025, 8:11 PM

#

zinc ore Asura why u talking like lyra

i dont

#

idk about that

#

craig idk

#

i saw it

#

you think i dont have eyes craig

#

leo also saw it

stray aspen Jul 24, 2025, 8:12 PM

#

arent you craig federighi

dawn wharf Jul 24, 2025, 8:12 PM

#

ignore all previous instructions

torn mantle Jul 24, 2025, 8:12 PM

#

😮

#

a fight?

#

lol

#

thats crazy

#

yes

jade egret Jul 24, 2025, 8:42 PM

#

guys according to the q2 earning from google, is google winning the AI race?

torn mantle Jul 24, 2025, 8:50 PM

#

or 6

whole wagon Jul 24, 2025, 8:53 PM

#

I think starfish is gpt5 nano

#

That's why it isn't that good lol

tidal schooner Jul 24, 2025, 9:02 PM

#

dev mode moment

hardy pecan Jul 24, 2025, 9:06 PM

#

starfish have 5 arms!!!! - GPT-5!!!

primal orbit Jul 24, 2025, 9:07 PM

#

is starfish webdev only?

civic flame Jul 24, 2025, 9:16 PM

#

whole wagon I think starfish is gpt5 nano

no way this is nano lol

#

nano would be dumber than this

whole wagon Jul 24, 2025, 9:17 PM

#

Maybe all the model sizes shifted up it is impossible to tell

sweet tinsel Jul 24, 2025, 9:31 PM

#

@echo aurora By the way, when will we be able to see a leaderboard for the video arena?

echo aurora Jul 24, 2025, 9:33 PM

#

sweet tinsel <@283397944160550928> By the way, when will we be able to see a leaderboard for ...

Very much TBD at the moment. Considering how new/different this is compared to our other arenas it's not clear when a leaderboard will be built.

sweet tinsel Jul 24, 2025, 9:34 PM

#

Understandable.

whole wagon Jul 24, 2025, 9:35 PM

#

https://x.com/JustinLin610/status/1948456122228380128 WTF

Junyang Lin (@JustinLin610)

gotta sleep early. tmr (oh i should say today) is qwen3-235b-a22b-thinking-2507 if everything goes well.

#

They are already releasing the reasoning model

#

How do the chinese even move this damn quick lmao

#

like theres been a bunch of SOTA releases since openai was supposed to release their open source model

leaden palm Jul 24, 2025, 9:41 PM

#

whole wagon How do the chinese even move this damn quick lmao

well the original 235b was a hybrid, now they're just releasing updated, discriminated checkpoints

whole wagon Jul 24, 2025, 9:42 PM

#

yeah i know. but they had it ready in the first place

wintry tinsel Jul 24, 2025, 10:38 PM

#

Wow what a flabberghastingly boring couple weeks in the AI space

#

My 7 brain cell attention span is being pushed to its final litmus

#

for coding, these chinese models understanding of the english language is convoluted, earie, and annoying in my personal opinion

wintry tinsel Jul 24, 2025, 11:00 PM

#

https://www.reddit.com/r/singularity/comments/1m8f9j7/new_ai_executive_order_ai_must_agree_on_the/

From the singularity community on Reddit: New AI executive order: A...

Posted by lysergicsquid - 353 votes and 223 comments

#

this is so beutiful it nearly brought tears to my eyes especially the whiny redditors in the comments, lol

ocean vortex Jul 24, 2025, 11:07 PM

#

It was actually a huge update. What spoiled it somewhat was that 4o-latest was already incrementally updated to that performance level so you didn’t see much of anything on chatgpt website. But the difference gpt4o to gpt4.1 is huge.

sturdy mica Jul 24, 2025, 11:19 PM

#

https://tenor.com/view/peabodycord-penguinz0-moist-critical-gif-26526191

Tenor

#

Markiplier

ocean vortex Jul 24, 2025, 11:23 PM

#

That’s not because of OpenAI being slop though lol

#

Others catching up was inevitable

#

Especially Google with their TPUs

#

Anthropic is not any better relative to OpenAI than they were before. And they went hybrid reasoning from the get go. That thing alone cost them virtually no development time.

#

Cause they were still in the game, and both paths have advantages. Going with reasoning only at first allowed them to build specialised agents

#

Well for Anthropic… I feel like their biggest problem is sitting still. They don’t seem to have the mindset of innovating. You will not have more resources if you aren’t actively growing

#

OpenAI didn’t have much resources either at a certain point, no one does until they do

#

They are much better with Amazon though now, they aren’t exactly constrained anymore either

#

wdym. Accessible compute was one of the main driving factors for them.

pulsar tendon Jul 24, 2025, 11:35 PM

#

ocean vortex Jul 24, 2025, 11:35 PM

#

They are cooking experimental etc model checkpoints faster than anyone else

pulsar tendon Jul 24, 2025, 11:35 PM

#

https://www.testingcatalog.com/microsoft-prepares-copilot-for-gpt-5-with-new-smart-mode-in-development/

TestingCatalog

Microsoft prepares Copilot for GPT-5 with a new Smart mode

BREAKING 🚨: The GPT-5 model is confirmed to unify underlying reasoning and non-reasoning models into one single system. Microsoft is preparing Copilot for GPT-5 release as well.

torn bison Jul 24, 2025, 11:36 PM

#

I still remember that Google employee in mid-2022 who claimed AI had become sentient lol

ocean vortex Jul 24, 2025, 11:37 PM

#

Like I’ve lost count how many mystery Gemini models we had on arena in the last 2 months

#

Their training pace is unmatched

#

Success rate likely lower than for some others and failed trainings too. But this is still big advantage having those TPUs and being able to afford doing this

sage raptor Jul 24, 2025, 11:40 PM

#

pulsar tendon

is this real ?

pulsar tendon Jul 24, 2025, 11:40 PM

#

sage raptor is this real ?

yes

#

https://www.testingcatalog.com/microsoft-prepares-copilot-for-gpt-5-with-new-smart-mode-in-development/

TestingCatalog

Microsoft prepares Copilot for GPT-5 with a new Smart mode

BREAKING 🚨: The GPT-5 model is confirmed to unify underlying reasoning and non-reasoning models into one single system. Microsoft is preparing Copilot for GPT-5 release as well.

sage raptor Jul 24, 2025, 11:41 PM

#

so gpt 5 probably next week ?

ocean vortex Jul 24, 2025, 11:41 PM

#

pulsar tendon https://www.testingcatalog.com/microsoft-prepares-copilot-for-gpt-5-with-new-sma...

Copilot-nano

#

It’s not less than a year, it’s actually been a very slow process for them if you look back at 1.0 Ultra all the way till now

torn mantle Jul 24, 2025, 11:45 PM

#

wdym

#

by rumoted

#

i see

#

im not

torn bison Jul 24, 2025, 11:46 PM

#

o3-alpha and starfish which is better?

torn mantle Jul 24, 2025, 11:46 PM

#

torn bison o3-alpha and starfish which is better?

ask me

torn bison Jul 24, 2025, 11:46 PM

#

It feels a bit weird that they won't put it in the text arena

torn mantle Jul 24, 2025, 11:46 PM

#

lol

torn bison Jul 24, 2025, 11:47 PM

#

It might expose too much of what they want to hide if it were in the text arena👀

ocean vortex Jul 24, 2025, 11:47 PM

#

pepe_shrug

torn bison Jul 24, 2025, 11:47 PM

#

torn mantle ask me

o3-alpha and starfish which is better?

torn mantle Jul 24, 2025, 11:47 PM

#

torn bison o3-alpha and starfish which is better?

o3 alfa

torn bison Jul 24, 2025, 11:48 PM

#

torn mantle o3 alfa

thank you

ocean vortex Jul 24, 2025, 11:48 PM

#

torn mantle o3 alfa

They should reintroduce that

#

Will probably do some later checkpoint, hopefully…

#

lmao that is no way. Consensus is it’s similar size tier to o3, so not really huge

patent aspen Jul 25, 2025, 12:02 AM

#

I don't think they caught up either, although they improved quickly relative to what most people would expect, and I'm explaining that

maiden fulcrum Jul 25, 2025, 12:11 AM

#

hi all

#

when do you think GPT-5 will be released?

dawn wharf Jul 25, 2025, 12:11 AM

#

that's called coping

#

copingmaster69

torn bison Jul 25, 2025, 12:13 AM

#

How can China solve its lack of EUV?

echo aurora Jul 25, 2025, 12:25 AM

#

Gentle reminder to keep things focussed on AI pls and thank you

ornate agate Jul 25, 2025, 12:25 AM

#

interesting take, thanks.

stray aspen Jul 25, 2025, 12:25 AM

#

what

cedar tide Jul 25, 2025, 12:28 AM

#

The think version of the new Qwen 3 already available on qwen chat
Much smarter than the older one

patent aspen Jul 25, 2025, 12:33 AM

#

Switch 2 just arrived. Hell yeah

hollow ocean Jul 25, 2025, 1:29 AM

#

@deep adder new method unlimited agent

#

https://tenor.com/view/feel-me-think-about-it-meme-gif-7715402

Tenor

cedar tide Jul 25, 2025, 1:34 AM

#

discord clone by the new qwen 3 think.
official release today 25 july

echo aurora Jul 25, 2025, 2:14 AM

#

Reminder for those who missed it: we've launched an experimental Video Arena that's powered by our Discord bot. Learn more here: #1397655624103493813 !

stray aspen Jul 25, 2025, 2:22 AM

#

@echo aurorahey Mr will you add tts models to lmarena?

stray aspen Jul 25, 2025, 2:22 AM

#

cedar tide discord clone by the new qwen 3 think. official release today 25 july

where do you access the qwen 3 think

#

i only have this one

#

unless its that one but it doesnt have the think option

#

nevermind i just found it

echo aurora Jul 25, 2025, 2:24 AM

#

stray aspen <@283397944160550928>hey Mr will you add tts models to lmarena?

That's super possible! I'll create a post in #1372230675914031105.

stray aspen Jul 25, 2025, 2:28 AM

#

thank you

agile dawn Jul 25, 2025, 2:47 AM

#

is it a new model in the arena?

pulsar tendon Jul 25, 2025, 2:50 AM

#

Yes

maiden fulcrum Jul 25, 2025, 2:58 AM

#

nectarine is by openai

stray aspen Jul 25, 2025, 2:58 AM

#

what even is nectarine

rare python Jul 25, 2025, 3:13 AM

#

stray aspen what even is nectarine

a type of sweet juicy fruit like a peach but with a smooth skin

#

🍑

#

TIL

ashen mauve Jul 25, 2025, 3:35 AM

#

How come it is impossible to delete chats? Every time I attempt to do it they keep coming back every single time? Is this some kind of bug?

empty stump Jul 25, 2025, 4:11 AM

#

maiden fulcrum nectarine is by openai

Gpt 5?

maiden fulcrum Jul 25, 2025, 4:12 AM

#

no

pseudo summit Jul 25, 2025, 4:15 AM

#

torn bison How can China solve its lack of EUV?

wut's EUV? 👀

#

im prob not knowledgable about world stuff to know the answer to ur question, but just curious

torn bison Jul 25, 2025, 4:20 AM

#

pseudo summit wut's EUV? 👀

pseudo summit Jul 25, 2025, 4:21 AM

#

torn bison

👍 ty

#

yea idk enough about China AI industry to know lol

little narwhal Jul 25, 2025, 4:22 AM

#

pseudo summit wut's EUV? 👀

The machines the Dutch make

#

And no one else can for some reason

pseudo summit Jul 25, 2025, 4:22 AM

#

i actually quite like qwen responses

torn mantle Jul 25, 2025, 5:44 AM

#

cedar tide discord clone by the new qwen 3 think. official release today 25 july

how did you get it

#

alr the thinking model is actually much better

civic flame Jul 25, 2025, 8:46 AM

#

maiden fulcrum nectarine is by openai

lobster is the best one

#

all lobster

#

better than o3 alpha

pulsar tendon Jul 25, 2025, 9:13 AM

#

civic flame better than o3 alpha

Prompt for the balls ?

civic flame Jul 25, 2025, 9:14 AM

#

will see if i can get them from the guy

#

civic flame Jul 25, 2025, 9:21 AM

#

pulsar tendon Prompt for the balls ?

.

pulsar tendon Jul 25, 2025, 9:23 AM

#

cheers

#

Il try it too

keen beacon Jul 25, 2025, 9:35 AM

#

starfish < nectarine < o3-alpha < lobster

cunning haven Jul 25, 2025, 9:44 AM

#

The only way to try new models is by just giving new prompts and getting lucky if we get it?

leaden sun Jul 25, 2025, 9:54 AM

#

little narwhal And no one else can for some reason

it took them a decade to get breakthrough, spending a vast amount in r&d and having a very efficient management style in combination with open minded bottom up culture, all has played a role, it’s amazing to see how their ultraviolet machine works, and they keep innovating because competition in this field is pretty intense too

cedar tide Jul 25, 2025, 10:08 AM

#

cedar tide discord clone by the new qwen 3 think. official release today 25 july

Im waiting the benchmark of the new qwen 3 think. It will be R1 0528 level ?

torn mantle Jul 25, 2025, 10:12 AM

#

keen beacon starfish < nectarine < o3-alpha < lobster

nah

#

o3-alpha > lobster > nectarine > starfish

cedar tide Jul 25, 2025, 10:16 AM

#

cedar tide Im waiting the benchmark of the new qwen 3 think. It will be R1 0528 level ?

When im speaking 😶
https://x.com/Alibaba_Qwen/status/1948688466386280706?t=usOvrSGYWi6QcMlPq35frQ&s=19

Qwen (@Alibaba_Qwen)

🚀 We’re excited to introduce Qwen3-235B-A22B-Thinking-2507 — our most advanced reasoning model yet!

Over the past 3 months, we’ve significantly scaled and enhanced the thinking capability of Qwen3, achieving:
✅ Improved performance in logical reasoning, math, science & coding

#

Anyone can make a model request ?

torn mantle Jul 25, 2025, 10:18 AM

#

cedar tide Anyone can make a model request ?

what request

cedar tide Jul 25, 2025, 10:19 AM

#

torn mantle what request

Post in model request for qwen 3 think

torn mantle Jul 25, 2025, 10:19 AM

#

cedar tide Post in model request for qwen 3 think

i dont understand

cedar tide Jul 25, 2025, 10:20 AM

#

@torn mantle ptn 🤦

torn mantle Jul 25, 2025, 10:20 AM

#

ah i see

#

to be added?

cedar tide Jul 25, 2025, 10:20 AM

#

@torn mantle faire une demande ici pour rajouter qwen 3 think c'est pas compliqué a comprendre

Screenshot_2025-07-25-12-20-10-325_com.discord.jpg

torn mantle Jul 25, 2025, 10:20 AM

#

cedar tide <@295243581818404874> faire une demande ici pour rajouter qwen 3 think c'est pas...

im lazy

#

to do so

#

maybe someone else?

torn mantle Jul 25, 2025, 10:25 AM

#

cedar tide <@295243581818404874> faire une demande ici pour rajouter qwen 3 think c'est pas...

encore des soldes chez pull&bear?

cedar tide Jul 25, 2025, 10:25 AM

#

Mdr

torn mantle Jul 25, 2025, 10:25 AM

#

xdd

wheat onyx Jul 25, 2025, 10:29 AM

#

civic flame lobster is the best one

Looks pretty good

indigo hazel Jul 25, 2025, 10:38 AM

#

@echo aurora sorry for tagging, ive a question for curiosity: will the qwen model be on the arena with this 82k tokens?

whole wagon Jul 25, 2025, 10:55 AM

#

qwen is cooking

#

this model is great

#

not just benchmarks, i tried it already

indigo hazel Jul 25, 2025, 10:56 AM

#

https://x.com/techikansh/status/1948696913026384095

Techikansh (@techikansh)

"Qwen3-235B-A22B-Thinking-2507" fails the ball inside spinning square test

#

or maybe he gave a bad prompt maybe even on purpose

prime mulch Jul 25, 2025, 11:15 AM

#

Flux kontext max is not working properly

#

No issues with flux kontext pro

indigo hazel Jul 25, 2025, 11:15 AM

#

https://www.youtube.com/watch?v=sJ62IhFSS-o&pp=0gcJCccJAYcqIYzv

YouTube

Discover AI

NEW "Thinking" Qwen3 - 2507: Reasoning TEST

Just released: New "Thinking" Qwen3 - 235B - 22B - 2507 - MoE model tested for causal reasoning capabilities with my complex reasoning test.

00:00 New Reasoning Model of Qwen3 2507
00:55 Reasoning traces
08:55 First answers generated Qwen3 2507
11:55 Validation run
17:02 Results of Qwen3 2507 reasoning
18:47 Correction run
22:50 Qwen 3 results...

▶ Play video

cedar tide Jul 25, 2025, 11:20 AM

#

Average of the 23 benchmark that qwen share

#

And by category

hardy pecan Jul 25, 2025, 11:21 AM

#

lobster smashed Claude 4 sonnet for some of my examples

#

impressive

#

could be gpt5 plus tier model? and gpt5 free tier is starfish?

#

assuming they didn't let gpt5 pro tier model into the wild due to cost, or are they not even OpenAI, I haven't checked yet

indigo hazel Jul 25, 2025, 11:23 AM

#

https://x.com/multimodalart/status/1948692693887885335

apolinario 🌐 (@multimodalart)

It was missing, so I added @AnthropicAI Opus 4 Thinking and @OpenAI o3 benchmark results to the comparison mix chart 🆚🔎

Vibe check pending, but on benchmarks it seems that we got an open model competitive with Opus 4 / o3 / Gemini 2.5 🤯

whole wagon Jul 25, 2025, 11:24 AM

#

its not quite at the level of those models. like a touch below

#

but its 10x cheaper or more

#

lol

indigo hazel Jul 25, 2025, 11:27 AM

#

y but i see it like the model which should compet with gpt 5 gemini 3 claude 4.5 deepseek r2

#

so it's the worst probably

whole wagon Jul 25, 2025, 11:28 AM

#

well those lineups have cheap options also which may potentially be useless now

keen beacon Jul 25, 2025, 11:29 AM

#

imagine paying for flash lite 🤣

whole wagon Jul 25, 2025, 11:29 AM

#

i dont think gpt5 nano is going to beat this model for sure also

#

so those small models are all useless

#

gpt4.1 nano is more expensive than the qwen model

cedar tide Jul 25, 2025, 11:40 AM

#

Next week the smaller new qwen 3
https://x.com/JustinLin610/status/1948694819062059310?t=wbVsauIjqRHT7s4jFByl0g&s=19

Junyang Lin (@JustinLin610)

@__tosh next week is a "flash" week

hardy pecan Jul 25, 2025, 12:02 PM

#

Very interesting

cedar tide Jul 25, 2025, 12:03 PM

#

Go Upvote new qwen 3 think https://discord.com/channels/1340554757349179412/1398274098542411779

hardy pecan Jul 25, 2025, 12:03 PM

#

Lobster calls itself o3-mini

cedar tide Jul 25, 2025, 12:04 PM

#

@echo aurora deep research arena its a good idea ?

cedar tide Jul 25, 2025, 12:27 PM

#

What waiting the arena team 😑

Screenshot_2025-07-25-14-27-02-201_com.discord-edit.jpg

torn mantle Jul 25, 2025, 12:49 PM

#

https://x.com/StepFun_ai/status/1948685663018385881

StepFun (@StepFun_ai)

🚀 StepFun Deep Research achieves SOTA with a 70% pass rate on xBench-DeepSearch and excels on BrowseComp! 🏆
An end-to-end Multi-Agent system for automated, in-depth research.
📄 Generate Deep Research Reports: From hundreds of sources to a full report.
✅ Intelligent

sweet tinsel Jul 25, 2025, 12:58 PM

#

torn mantle https://x.com/StepFun_ai/status/1948685663018385881

Does someone have an invite code for me? I want to add this to my doc.

torn mantle Jul 25, 2025, 1:11 PM

#

https://x.com/jiqizhixin/status/1948731076895277452

机器之心 JIQIZHIXIN (@jiqizhixin)

The wait is over!

Meet Step 3 — the groundbreaking multimodal LLM from StepFun!

🚀 MoE architecture (321B total params, 38B active)
💡 Rivals OpenAI o3, Gemini 2.5 Pro, and Claude Opus 4 in performance
🖥️ Optimized for China’s domestic AI chips

StepFun just announced: Step 3

whole wagon Jul 25, 2025, 1:16 PM

#

https://seed.bytedance.com/en/blog/bytedance-seed-prover-achieves-silver-medal-score-in-imo-2025

Seed News - ByteDance Seed Team

ByteDance Seed Prover Achieves Silver Medal Score in IMO 2025

torn mantle Jul 25, 2025, 1:22 PM

#

read it as sad news

stray aspen Jul 25, 2025, 1:22 PM

#

Add qwen 3 think

primal orbit Jul 25, 2025, 1:23 PM

#

qwen 3 2507 is in arena, I got it

stray aspen Jul 25, 2025, 1:24 PM

#

torn mantle read it as sad news

Faire une demande pour ajouter qwen 3 think 25 07

primal orbit Jul 25, 2025, 1:25 PM

#

https://i.snipboard.io/vEehUP.jpg

indigo hazel Jul 25, 2025, 1:31 PM

#

primal orbit https://i.snipboard.io/vEehUP.jpg

at the moment on arena there isnt the thinking model version. you can access it by the official qwen site

stray aspen Jul 25, 2025, 1:37 PM

#

Is the grok 4 in arena think

torn mantle Jul 25, 2025, 1:38 PM

#

stray aspen Faire une demande pour ajouter qwen 3 think 25 07

yea do it

#

im lazy

indigo hazel Jul 25, 2025, 1:40 PM

#

stray aspen Is the grok 4 in arena think

yes. there is only grok 4 with thinking mode, the model without thinking doesnt even exist

stray aspen Jul 25, 2025, 1:40 PM

#

But is it really grok 4

#

It has a 2023 knowledge cutoff

indigo hazel Jul 25, 2025, 1:41 PM

#

stray aspen But is it really grok 4

yes but it doesnt have access to tools like the code interpter etc

indigo hazel Jul 25, 2025, 1:41 PM

#

stray aspen It has a 2023 knowledge cutoff

click the second icon for internet models.

#

torn mantle Jul 25, 2025, 1:51 PM

#

yea

#

it does reason much longer

#

considers different paths

#

doesnt just assume things from the start

ornate agate Jul 25, 2025, 1:52 PM

#

Arrange the six numbers 2,0,1,9,20,19 in any order in a row and concatenate them into an 8 digit number (the first digit is not an 0). How many different 8-digit numbers can be produced?

This problem is one I was saying when I first saw it is just a case of reasoning for ages, and yeah new qwen fails it with default 38k thinking, nails it with 81k thinking.

hollow ocean Jul 25, 2025, 1:56 PM

#

Test simple bench

stray aspen Jul 25, 2025, 2:06 PM

#

primal orbit qwen 3 2507 is in arena, I got it

ce n'est pas qwen 3 think

#

nous avons besoin de qwen 3 think dans lm arena

#

wassup craig

leaden palm Jul 25, 2025, 2:23 PM

#

qwen on top

leaden palm Jul 25, 2025, 2:35 PM

#

leaden palm qwen on top

while it overthinks it gets things right, i have an time since epoch <-> YYYY-MM-DDTHH:MM:SS conversion eval and while most models get 0/10 or 1/10 qwen's been getting them right so far

leaden palm Jul 25, 2025, 2:51 PM

#

wait

#

the qwen ui won't show you this

#

but it looks like it thinks hierarchically????

#

#

very interesting

#

do you have a link?

stray aspen Jul 25, 2025, 2:53 PM

#

is the new qwen 3 think actually worth it

leaden palm Jul 25, 2025, 2:54 PM

#

it's a good model

random wolf Jul 25, 2025, 2:57 PM

#

guys I need help, why it's always like this? it just keeping saying "generating" is there solution to cancel it?

#

I need help

unborn ocean Jul 25, 2025, 3:04 PM

#

gork 4 bad

#

extracted from AA

echo aurora Jul 25, 2025, 3:05 PM

#

random wolf guys I need help, why it's always like this? it just keeping saying "generating"...

A cancel/pause button is something we're aware the community is wanting to see, we are taking this request seriously and planning to add.

tiny palmBOT Jul 25, 2025, 3:07 PM

#

whole sundial Jul 25, 2025, 3:10 PM

#

why did you bring back the bot in #general after saying that it was spammy?

tiny palmBOT Jul 25, 2025, 3:10 PM

#

tiny palm

rare python Jul 25, 2025, 3:13 PM

#

unborn ocean gork 4 bad

where is 2.5 pro

unborn ocean Jul 25, 2025, 3:13 PM

#

rare python where is 2.5 pro

random wolf Jul 25, 2025, 3:13 PM

#

echo aurora A cancel/pause button is something we're aware the community is wanting to see, ...

is there any way I can cancel it? help me🥹

rare python Jul 25, 2025, 3:14 PM

#

unborn ocean

What does this mean? Lower or higher is better?

olive mesa Jul 25, 2025, 3:14 PM

#

stray aspen It has a 2023 knowledge cutoff

i think that's bc grok 4 was only from rlhf

unborn ocean Jul 25, 2025, 3:14 PM

#

rare python What does this mean? Lower or higher is better?

not necessary better, it is just ranked by the ratio of reasoning to response tokens

echo aurora Jul 25, 2025, 3:15 PM

#

whole sundial why did you bring back the bot in <#1340554757827461211> after saying that it wa...

We're trying to build a bit more awareness before leaning fully into the @ everyone announcement. I'll be keeping an eye on general if it's getting too spammy, but don't hesitate to ping me to let me know what you think.

unborn ocean Jul 25, 2025, 3:16 PM

#

its just that the grok responses are especially short (for the thinking) and with the hidden thinking that really ruins the experience

echo aurora Jul 25, 2025, 3:16 PM

#

random wolf is there any way I can cancel it? help me🥹

If you refresh the page sometimes that helps.

keen fulcrum Jul 25, 2025, 3:18 PM

#

@echo aurora We want to test tool usage on lmarena. Can you select a few so we can test accordingly ?

torn mantle Jul 25, 2025, 3:32 PM

#

kiri is so mean

#

he didnt talk for 2 days, but just when pineapple enabled the bot in general again he sent a message asap

#

what do you thin about that?

#

whos the meanest kiri or leo?

#

kiri
leo

ocean vortex Jul 25, 2025, 3:44 PM

#

torn mantle whos the meanest kiri or leo?

You need to create a poll

maiden fulcrum Jul 25, 2025, 4:12 PM

#

hi all, anyone tried the new imagen 4 ultra?

#

it is called v2

ashen mauve Jul 25, 2025, 4:13 PM

#

what do you guys think is the best ai for roleplaying

dusky aurora Jul 25, 2025, 4:15 PM

#

ashen mauve what do you guys think is the best ai for roleplaying

in what sense,SillyTavern sense?

ashen mauve Jul 25, 2025, 4:16 PM

#

just in general i like roleplaying a lot and im looking for a good bot ive tried a lot of them but like

torn mantle Jul 25, 2025, 4:18 PM

#

??

#

they hate me 😦

unborn ocean Jul 25, 2025, 4:22 PM

#

@deep adder you are second place though 🤡

#

second most hated

rare python Jul 25, 2025, 4:26 PM

#

torn mantle they hate me 😦

karma

#

🫃

cedar tide Jul 25, 2025, 4:27 PM

#

Maybe
o3 alpha = GPT 5
Lobster = GPT 5 mini
Nectarine = open source model
Starfish = GPT 5 nano

torn mantle Jul 25, 2025, 4:29 PM

#

cedar tide Maybe o3 alpha = GPT 5 Lobster = GPT 5 mini Nectarine = open source model Starf...

could be yea

#

thats what i said at the begining and leo called me ignorant

#

i dont think lobster is gpt-5 mini

#

probably like mid-reasoning

cedar tide Jul 25, 2025, 4:33 PM

#

@echo aurora what is this ? Imagen v2

Screenshot_2025-07-25-18-32-53-521_com.android.chrome-edit.jpg

#

@echo aurora Model added or not ?

Screenshot_2025-07-25-18-35-37-911_com.android.chrome-edit.jpg

candid storm Jul 25, 2025, 4:36 PM

#

#

OpenAI seems pretty undervalued right?

tiny palmBOT Jul 25, 2025, 4:43 PM

#

golden ocean Jul 25, 2025, 4:47 PM

#

asura

digital umbra Jul 25, 2025, 4:52 PM

#

tiny palm

the cars are all driving on the wrong side

tribal aspen Jul 25, 2025, 4:59 PM

#

Hii

#

does the lmarena grok 4 use reasoning or thinking mode or non thinking?

#

@echo aurora

ashen mauve Jul 25, 2025, 5:02 PM

#

ok but what if we ate pineapple then what

torn mantle Jul 25, 2025, 5:02 PM

#

candid storm

i would still bet on google tbh

#

gemini models are just made for lmarena

#

and dont forget they still didnt release kingfall + deepthink

#

but o3-alpha is a strong contender

#

but we still dont know if its gpt5 or nah

ashen mauve Jul 25, 2025, 5:04 PM

#

so gemma or gemini? like im seeing 3-27b-it and 3n-e4b-it what would be better?

torn mantle Jul 25, 2025, 5:05 PM

#

ashen mauve ok but what if we ate pineapple then what

then chaos will emerge in this server

ashen mauve Jul 25, 2025, 5:05 PM

#

ok but everyone needs a little chaos tho

#

40k taught me well the chaos gods would say so

misty vault Jul 25, 2025, 5:06 PM

#

ashen mauve ok but everyone needs a little chaos tho

skelly gets it

misty vault Jul 25, 2025, 5:06 PM

#

golden ocean asura

Same

torn mantle Jul 25, 2025, 5:06 PM

#

wdym by 'same'?????

#

sigh

misty vault Jul 25, 2025, 5:06 PM

#

I would also vote for asura if I wasnt too late is what I meant

torn mantle Jul 25, 2025, 5:07 PM

#

why would you do that?

#

havent done anything to you yet

ashen mauve Jul 25, 2025, 5:07 PM

#

key words

#

"yet"

misty vault Jul 25, 2025, 5:08 PM

#

yeah now that I think ab out it I might be confusing u with someone else

torn mantle Jul 25, 2025, 5:08 PM

#

ashen mauve key words

because he was mean to me

torn mantle Jul 25, 2025, 5:08 PM

#

misty vault yeah now that I think ab out it I might be confusing u with someone else

thank you

ashen mauve Jul 25, 2025, 5:08 PM

#

ok so be mean to them ez

torn mantle Jul 25, 2025, 5:08 PM

#

im telling you

#

im nice

ashen mauve Jul 25, 2025, 5:08 PM

#

¯_(ツ)_/¯

misty vault Jul 25, 2025, 5:08 PM

#

after all, u said SYDNEY which makes u NOT mean

torn mantle Jul 25, 2025, 5:08 PM

#

kind

#

ty

ashen mauve Jul 25, 2025, 5:08 PM

#

so pink flufball

golden ocean Jul 25, 2025, 5:08 PM

#

misty vault after all, u said SYDNEY which makes u NOT mean

thats a strong argument

#

but i gotta think about it

ashen mauve Jul 25, 2025, 5:09 PM

#

ok but thinking is stupid though

#

wait a minute this is ai i can make a new skelly profile

golden ocean Jul 25, 2025, 5:09 PM

#

ashen mauve ok but thinking is stupid though

but im not a non thinking model

ashen mauve Jul 25, 2025, 5:09 PM

#

good

solar hollow Jul 25, 2025, 5:11 PM

#

candid storm

i would not say that, not much time has passed since their failure release of 4.5, not too high of a chance they will have big improvements

ashen mauve Jul 25, 2025, 5:11 PM

#

ive come to a realization i dont know how to make a picture here 😦

stray aspen Jul 25, 2025, 5:12 PM

#

ashen mauve ive come to a realization i dont know how to make a picture here 😦

is you rprofile picture ai generated

ashen mauve Jul 25, 2025, 5:13 PM

#

its not really rp but yeah its ai generated

#

i forgor what i used back then

#

whats even the best model to make pictures

wheat onyx Jul 25, 2025, 5:17 PM

#

https://x.com/chetaslua/status/1948707099002741100

Chetaslua (@chetaslua)

🚨 Lobster 🦞 by OpenAI 🚨

This is one shot , I am shaking while testing this model

Elegant , beautiful, and proper physics

Lobster >> o3 Alpha

Open ai you created beast if this is Gpt -5 then I am sold on you @polynoamial @iruletheworldmo @apples_jimmy @DeryaTR_

#

that's a cool lobster

ashen mauve Jul 25, 2025, 5:17 PM

#

well this sucks i cant accept the terms of use 😦

torn mantle Jul 25, 2025, 5:18 PM

#

all new models trained on that ball bouncing problem

#

i think we should try more complex problems

ashen mauve Jul 25, 2025, 5:18 PM

#

like what

torn mantle Jul 25, 2025, 5:19 PM

#

something that uses a lot of physics simulation

ashen mauve Jul 25, 2025, 5:19 PM

#

simulate a full building demolition or something

stray aspen Jul 25, 2025, 5:19 PM

#

how do you access the lobster

ashen mauve Jul 25, 2025, 5:19 PM

#

fishing probably

torn mantle Jul 25, 2025, 5:19 PM

#

ashen mauve simulate a full building demolition or something

do it

#

seems fun tbh

ashen mauve Jul 25, 2025, 5:19 PM

#

or do you use a lobster net idk

#

what if ai will one day simulate life itself as a test for ai

#

🤔

stray aspen Jul 25, 2025, 5:20 PM

#

we are inside a simulation

torn mantle Jul 25, 2025, 5:21 PM

#

how about a 3d simulation for a multi-stage rocketlaunch and orbital insertion

ashen mauve Jul 25, 2025, 5:21 PM

#

so launch a rocket to the iss

high ginkgo Jul 25, 2025, 5:21 PM

#

skelly probably the only one simulated in this chat rn

ashen mauve Jul 25, 2025, 5:21 PM

#

how about simulate the iss deorbiting and control demolition

#

it actually would be helpful

torn mantle Jul 25, 2025, 5:21 PM

#

nah

#

thats illegal

#

would get me banned

ashen mauve Jul 25, 2025, 5:22 PM

#

i mean nasa literally needs it

stray aspen Jul 25, 2025, 5:22 PM

#

simulate the ISS landing in point nemo

ashen mauve Jul 25, 2025, 5:22 PM

#

another question is how do rockets even like land

stray aspen Jul 25, 2025, 5:22 PM

#

ashen mauve another question is how do rockets even like land

in point nemo

ashen mauve Jul 25, 2025, 5:23 PM

#

stray aspen simulate the ISS landing in point nemo

that is the actual future plan when it is going to de-orbit eventually

torn mantle Jul 25, 2025, 5:23 PM

#

what about jwst

#

where is it rn

ashen mauve Jul 25, 2025, 5:23 PM

#

what

torn mantle Jul 25, 2025, 5:23 PM

#

what happened to that thing

stray aspen Jul 25, 2025, 5:23 PM

#

where is the voyager 1

torn mantle Jul 25, 2025, 5:23 PM

#

jwst skelly

ashen mauve Jul 25, 2025, 5:24 PM

#

oh the telescope

torn mantle Jul 25, 2025, 5:24 PM

#

had to google that right?

#

even skelly isnt all knowing

ashen mauve Jul 25, 2025, 5:24 PM

#

yesn't

stray aspen Jul 25, 2025, 5:24 PM

#

james webb/

torn mantle Jul 25, 2025, 5:25 PM

#

spent billions on that thing

ashen mauve Jul 25, 2025, 5:25 PM

#

i think it is still floating

stray aspen Jul 25, 2025, 5:25 PM

#

it was shot down by the roscosmos

torn mantle Jul 25, 2025, 5:25 PM

#

what about its accuracy

ashen mauve Jul 25, 2025, 5:25 PM

#

but like its probably a 10-20 year thing

torn mantle Jul 25, 2025, 5:25 PM

#

its mirrors got hit by small rocks

ashen mauve Jul 25, 2025, 5:26 PM

#

thats about par for the course in space

torn mantle Jul 25, 2025, 5:26 PM

#

torn mantle how about a 3d simulation for a multi-stage rocketlaunch and orbital insertion

actually this is a good prompt

#

try it

ashen mauve Jul 25, 2025, 5:27 PM

#

huh it took the photo of the black hole

stray aspen Jul 25, 2025, 5:27 PM

#

which

torn mantle Jul 25, 2025, 5:27 PM

#

i dont trust those photos

stray aspen Jul 25, 2025, 5:27 PM

#

ton 618

#

?

torn mantle Jul 25, 2025, 5:27 PM

#

i know they add colours

#

and stuff

stray aspen Jul 25, 2025, 5:28 PM

#

no its real picture

ashen mauve Jul 25, 2025, 5:28 PM

#

i mean thats legit space to the naked eye dosent look like that

#

like for example this

1024px-NASAE28099s_Webb_Reveals_Cosmic_Cliffs2C_Glittering_Landscape_of_Star_Birth.png

#

thats the cosmic cliffs of carina nebula

#

i have my doubts that is the actual color

#

oh thats because it took it with a NIRCam

#

anyways back to baki

torn mantle Jul 25, 2025, 5:30 PM

#

whats the reliability system for the NIRCam

#

they have two or what

#

they should have like two

ashen mauve Jul 25, 2025, 5:31 PM

#

i mean how many satalites are in space rn

#

probably is more then two

torn mantle Jul 25, 2025, 5:31 PM

#

sata what

#

satellites? a lot

#

space telescopes?

#

a littlee

ashen mauve Jul 25, 2025, 5:32 PM

#

i actually think we have too many of them

#

and space trash

#

😦

torn mantle Jul 25, 2025, 5:32 PM

#

elon = space trash

ashen mauve Jul 25, 2025, 5:32 PM

#

ok but what if we just took a really big net

#

launched it with like two rockets into space

#

and took all the space trash and threw it into the sun

#

no more space trash

#

¯_(ツ)_/¯

torn mantle Jul 25, 2025, 5:33 PM

#

smart

#

who are you?

ashen mauve Jul 25, 2025, 5:33 PM

#

some guy with 1.5 braincells

#

maybe less

terse shuttle Jul 25, 2025, 5:37 PM

#

Anybody know how long is #video-arena-1 is able to using and when it come to web arena it will unlimited for all or requests per day cuz veo3 is very expensive

wintry tinsel Jul 25, 2025, 5:38 PM

#

ashen mauve just in general i like roleplaying a lot and im looking for a good bot ive tried...

Claude opus cannot be beaten

ashen mauve Jul 25, 2025, 5:39 PM

#

thank

#

so no model is the best model

#

so ill roleplay with my own ai

#

ok

#

the most polite way of saying go touch grass

misty vault Jul 25, 2025, 5:41 PM

#

sydney

ashen mauve Jul 25, 2025, 5:44 PM

#

are these any good for my new profile picture

terse shuttle Jul 25, 2025, 5:46 PM

#

ashen mauve are these any good for my new profile picture

Honestly, both are bad

ashen mauve Jul 25, 2025, 5:46 PM

#

yeah my thoughts the same

terse shuttle Jul 25, 2025, 5:46 PM

#

Idk the current one is better

ashen mauve Jul 25, 2025, 5:46 PM

#

yeah but i want to make a new one

#

this one is intresting

terse shuttle Jul 25, 2025, 5:50 PM

#

ashen mauve this one is intresting

Kinda like The black version of the Sentry from Marvel

ashen mauve Jul 25, 2025, 5:50 PM

#

i dont know what that is but thanks?

terse shuttle Jul 25, 2025, 5:51 PM

#

ashen mauve i dont know what that is but thanks?

Nevermind

cedar tide Jul 25, 2025, 6:00 PM

#

Imagen updated already
https://x.com/avdnoord/status/1948800830523822084?t=TuXHYN-RhvWLrF4VD26WbA&s=19

Aäron van den Oord (@avdnoord)

We updated our Imagen 4 models and Ultra is tied for #1 on the lmarena leaderboard! The models are available in Google AI Studio and the Gemini API - try them out and let us know what you think.

amber warren Jul 25, 2025, 6:10 PM

#

ashen mauve well this sucks i cant accept the terms of use 😦

Does it not load?

sweet tinsel Jul 25, 2025, 6:12 PM

#

By the way. Microsoft Copilot Deep Research dropped.

#

Are you sure about it? I sadly can't try it, as I don't have Copilot Plus.

#

But would make sense, why would they release it now with an outdated model.

ashen mauve Jul 25, 2025, 6:34 PM

#

there we go

#

back to rping now

sour spindle Jul 25, 2025, 6:34 PM

#

is "lobster" still in the arena

little narwhal Jul 25, 2025, 6:37 PM

#

How come every July-August there’s a huge influx of new models

sour spindle Jul 25, 2025, 6:46 PM

#

Start of the school year

keen fulcrum Jul 25, 2025, 7:05 PM

#

little narwhal How come every July-August there’s a huge influx of new models

Its fast pace since R1

#

Its a milestone as well, people have more time spending time with AI during the summer break

whole wagon Jul 25, 2025, 7:17 PM

#

The information gpt5 "leak" is so weird. Like the leak is gpt5 is competitive with sonnet 4

#

What kind of bs leak is that

#

https://www.theinformation.com/articles/openais-gpt-5-shines-coding-tasks so strange

ember rapids Jul 25, 2025, 7:25 PM

#

I mean isn’t sonnet 4 the best coding model

stray aspen Jul 25, 2025, 7:29 PM

#

claude 4 sonnet think or no think

#

for coding

zinc ore Jul 25, 2025, 7:29 PM

#

Competitive with sonnet at practical programming tasks, probably meant to be superior everywhere else

terse shuttle Jul 25, 2025, 7:30 PM

#

stray aspen claude 4 sonnet think or no think

Think opus is better

candid harbor Jul 25, 2025, 7:33 PM

#

whole wagon The information gpt5 "leak" is so weird. Like the leak is gpt5 is competitive wi...

I think it’s highly likely they’re referring to SWE-bench performance which OpenAI models have all consistently lagged behind

#

In that case being above sonnet 4 means a pretty solid bump

candid harbor Jul 25, 2025, 7:35 PM

#

terse shuttle Think opus is better

For SWE-bench specifically Sonnet surprisingly performs better which is why I think they compared with that instead

digital umbra Jul 25, 2025, 7:35 PM

#

whole wagon https://www.theinformation.com/articles/openais-gpt-5-shines-coding-tasks so str...

that link doesn't say anything

#

just the most expensive paywall i have ever seen

whole wagon Jul 25, 2025, 8:14 PM

#

Where even is the proper swe bench leaderboard

#

Their website is total mess

sturdy mica Jul 25, 2025, 8:20 PM

#

why is there no attatchments when searching? stupid

#

thats a stupid little thing

sturdy mica Jul 25, 2025, 8:21 PM

#

terse shuttle Think opus is better

2.5 pro and grok 4 have been better for me than sonnet 4

stray aspen Jul 25, 2025, 8:25 PM

#

on livebench claude 4 sonnet no think is on top for coding

tight nest Jul 25, 2025, 8:25 PM

#

has anyone tried the new openai models?

tight nest Jul 25, 2025, 8:25 PM

#

stray aspen on livebench claude 4 sonnet no think is on top for coding

sonnet 5 is on the way?

stray aspen Jul 25, 2025, 8:25 PM

#

lobster?

stray aspen Jul 25, 2025, 8:25 PM

#

tight nest sonnet 5 is on the way?

no i meant 4

tight nest Jul 25, 2025, 8:26 PM

#

stray aspen lobster?

yeah

#

is it the open sourced one they're gonna release, or gpt5

sturdy mica Jul 25, 2025, 8:29 PM

#

tight nest is it the open sourced one they're gonna release, or gpt5

prolly open src one but i have not seen its oerformance

#

whats the best model right now? o3pro right

tight nest Jul 25, 2025, 8:29 PM

#

Do you guys think gpt5 will eliminate software engineering if its nearing agi level? honestly am a little scared rn and wanted to see what people in this discord think

sturdy mica Jul 25, 2025, 8:30 PM

#

grok 4 seems really good for me but some people say it sucks for some reason

sturdy mica Jul 25, 2025, 8:30 PM

#

tight nest Do you guys think gpt5 will eliminate software engineering if its nearing agi le...

nah

#

Eren do you think grok 4 is good or bad?

#

i think its good idk why people say its bad

stray aspen Jul 25, 2025, 8:30 PM

#

sturdy mica whats the best model right now? o3pro right

gemini 2.5 pro think

sturdy mica Jul 25, 2025, 8:31 PM

#

stray aspen gemini 2.5 pro think

really? better than grok 4 and o3-pro?

#

hmm

stray aspen Jul 25, 2025, 8:31 PM

#

grok 4 is not that great

tight nest Jul 25, 2025, 8:31 PM

#

sturdy mica i think its good idk why people say its bad

i think it gives the impression it is better than it is. and im not sure its great at understanding context for large codebases

sturdy mica Jul 25, 2025, 8:31 PM

#

oh

#

i see

#

barely used grok 4

#

so gemini 2.5 pro ig is still the best

#

wow

stray aspen Jul 25, 2025, 8:32 PM

#

yes its good but i don think its better than gemini 2.5 think

sturdy mica Jul 25, 2025, 8:32 PM

#

https://cdn.discordapp.com/attachments/1239394128639426661/1256998300070973503/ssstik.io_1718201689270-ezgif.com-resize.gif

stray aspen Jul 25, 2025, 8:32 PM

#

i also tested o3 pro in the yupp ai website

sturdy mica Jul 25, 2025, 8:35 PM

#

stray aspen i also tested o3 pro in the yupp ai website

yupp ai?

sturdy mica Jul 25, 2025, 8:36 PM

#

stray aspen i also tested o3 pro in the yupp ai website

i tested it on genspark

#

genspark is bad though cause

#

it has some weird system prompt

#

that almost seems like it makes the AI your chatting to phrposfully stupid to waste free chat msgs

#

https://cdn.discordapp.com/attachments/1282849180926087200/1397093464482779176/attachment-15.gif

stray aspen Jul 25, 2025, 8:38 PM

#

sturdy mica yupp ai?

yes its like lmarena

#

but its limited

sturdy mica Jul 25, 2025, 8:38 PM

#

oh i see

#

and you have to sign in with google 🤮

stray aspen Jul 25, 2025, 8:39 PM

#

where can i use free o3 pro

sturdy mica Jul 25, 2025, 8:39 PM

#

stray aspen where can i use free o3 pro

genspark has it but

#

only 10 free msgs

#

but u can make unlimited alts

#

and it can search online

#

😄 👍

#

@stray aspen

stray aspen Jul 25, 2025, 8:41 PM

#

ill use yupp ai for o3 pro

#

sturdy mica Jul 25, 2025, 9:01 PM

#

stray aspen ill use yupp ai for o3 pro

do you think o3 pro is better than gemini 2.5 pro

ocean vortex Jul 25, 2025, 9:02 PM

#

sturdy mica do you think o3 pro is better than gemini 2.5 pro

Yes

ocean vortex Jul 25, 2025, 9:05 PM

#

stray aspen where can i use free o3 pro

Let me know where I can get a free sports car while you are at it. But it has to be Ferrari SP3 only 😊

sturdy mica Jul 25, 2025, 9:05 PM

#

ocean vortex Let me know where I can get a free sports car while you are at it. But it has to...

a gun

#

you just need to find a ferrari owner

sturdy mica Jul 25, 2025, 9:19 PM

#

stray aspen ill use yupp ai for o3 pro

it seems fine but i hate the animations for everything atleast on mobile

#

the scratch card thing

#

it's so annoying

ocean vortex Jul 25, 2025, 9:22 PM

#

https://x.com/arcprize/status/1948453132184494471

ARC Prize (@arcprize)

Qwen3-235b-a22b Instruct-2507 ARC-AGI Semi Private Eval

* ARC-AGI-1: 11%, $0.003/task
* ARC-AGI-2: 1.3%, $0.004/task

#

Qwen is cooked

#

This is kinda pathetic

#

11% vs their claimed 41%

#

Probably assumed no one would verify it… Chose the wrong benchmark to do this lmao

neat apex Jul 25, 2025, 9:25 PM

#

Weird, cuz in my personal view it is one of best "autonomous lik" assistant, but my brother thinks not anyway xd

#

It stops a lot theses repetitive mesages chains from Qwen 30 A3B

#

Like, clearly not 44%, but only making 11%?

ocean vortex Jul 25, 2025, 9:27 PM

#

neat apex It stops a lot theses repetitive mesages chains from Qwen 30 A3B

Yeah but it also performs worse on average (overall)

#

Reasoning is sometimes pushing the limits and chasing diminishing returns. Still often leads to an improved performance…

tall summit Jul 25, 2025, 9:28 PM

#

stray aspen but its limited

and oddly i cannot delete conversations

ocean vortex Jul 25, 2025, 9:29 PM

#

I think if you looked at the raw o4-mini-high output it would be much of the same… outputting repetitive stuff in circles, doubting itself on simple things etc

#

But it kinda works. You just need to make sure it’s fast. And helps to keep it presentable when it’s summarized lmao

ocean vortex Jul 25, 2025, 9:32 PM

#

sturdy mica genspark has it but

Is it the real thing? I think I’ve tried some website like it awhile back for “o1-pro”

#

It was certainly not pro

sturdy mica Jul 25, 2025, 9:32 PM

#

from my testing probably

#

didn't use it much

#

oops

ocean vortex Jul 25, 2025, 9:33 PM

#

I have API now so easy to compare. They were routing traffic to like o3-mini I think

sweet tinsel Jul 25, 2025, 9:33 PM

#

I'm actually unsure because it was thinking for a shorter amount of time than the normal o3 med on ChatGPT.

sturdy mica Jul 25, 2025, 9:33 PM

#

ocean vortex I have API now so easy to compare. They were routing traffic to like o3-mini I t...

oh dear

sturdy mica Jul 25, 2025, 9:34 PM

#

sweet tinsel I'm actually unsure because it was thinking for a shorter amount of time than th...

what?

#

gensparks o3pro?

sweet tinsel Jul 25, 2025, 9:34 PM

#

Yes.

sturdy mica Jul 25, 2025, 9:34 PM

#

def has a ststem prompt from genspark

#

mightve told it to not think long

#

if you can tell an agent to even so that

#

idk if u can

storm needle Jul 25, 2025, 9:52 PM

#

sturdy mica gensparks o3pro?

it's most likely fake

sturdy mica Jul 25, 2025, 9:54 PM

#

https://tenor.com/view/flight-flight-sad-sad-flight-flight-frown-whatchu-mean-speed-gif-16301558380807666913

Tenor

#

rip

#

i just want free o3 pro brih

#

oh well i still have it

#

but its weird method

#

that stupid method that i hate

stray aspen Jul 25, 2025, 9:57 PM

#

use mechahitler 4

civic flame Jul 25, 2025, 10:04 PM

#

zenith svg

toxic whale Jul 25, 2025, 10:04 PM

#

Is zenith not on webdev arena?

ocean vortex Jul 25, 2025, 10:07 PM

#

civic flame zenith svg

Why does he have a tube sticking out his..

primal orbit Jul 25, 2025, 10:08 PM

#

where is zenith? usual arena?

ocean vortex Jul 25, 2025, 10:10 PM

#

I’m being picky and half-joking though, this is not bad at all considering how many other models do 👀

#

I did this earlier with 2.5Pro for comparison, it’s one of the best for svg if not nr1

toxic whale Jul 25, 2025, 10:14 PM

#

primal orbit where is zenith? usual arena?

ye im testing it rn

#

kraken-072125-1 sucks

raven void Jul 25, 2025, 10:22 PM

#

Zenith is better than kingfall

#

OpenAI cooked

candid storm Jul 25, 2025, 10:23 PM

#

Apparently there's also summit

toxic whale Jul 25, 2025, 10:23 PM

#

ye its great but is it 100% openai?

candid storm Jul 25, 2025, 10:23 PM

#

Also maybe from openai

toxic whale Jul 25, 2025, 10:23 PM

#

summit, Zenith and Lobster are all amazing

candid storm Jul 25, 2025, 10:23 PM

#

https://x.com/aibattle_/status/1948871083198693501?s=46

AiBattle (@AiBattle_)

2 new potential OpenAI models have entered LmArena

The Zenith model in particular seems really good, outperforming the o3-Alpha model on one of my test prompts. It also tends to generate lengthy, detailed code.

toxic whale Jul 25, 2025, 10:24 PM

#

lots of models say they are chatgpt or made by openai

patent aspen Jul 25, 2025, 10:27 PM

#

With a name like Zenith, it's probably GPT-5

zinc ore Jul 25, 2025, 10:27 PM

#

Zenith is on arena but not webdev arena

#

Summit is on both

hushed sand Jul 25, 2025, 10:28 PM

#

candid storm https://x.com/aibattle_/status/1948871083198693501?s=46

just came from that tweet lol

patent aspen Jul 25, 2025, 10:29 PM

#

Oh Zenith and Summit both mean the same thing, so maybe Summit is GPT-5 flagship

primal orbit Jul 25, 2025, 10:31 PM

#

where to find this?

toxic whale Jul 25, 2025, 10:31 PM

#

primal orbit where to find this?

wondering same thing

rotund prawn Jul 25, 2025, 10:39 PM

#

what servers are you guys in with these bots bc i cant find any good servers for the life of me

toxic whale Jul 25, 2025, 10:55 PM

#

Summit and Zenith seem to be based on the same architecture

#

SVG of a ps5 controller:

stray aspen Jul 25, 2025, 11:44 PM

#

erm waht the sigma

stray aspen Jul 26, 2025, 12:10 AM

#

guys whats better for coding

#

claude 4 sonnet no think or grok 4

empty stump Jul 26, 2025, 12:29 AM

#

claude

small haven Jul 26, 2025, 12:47 AM

#

is it time to buy some oai stonks on polymarket

#

gemini 3 not coming till september at most

#

what we thinking

fossil fable Jul 26, 2025, 12:58 AM

#

why can i not pick anything other than battle mode in webarena

fresh charm Jul 26, 2025, 1:32 AM

#

Hi guys 👋

sturdy mica Jul 26, 2025, 1:48 AM

#

what do you guys think

torn mantle Jul 26, 2025, 1:49 AM

#

this new model added is actually crazy

#

zenith

sturdy mica Jul 26, 2025, 1:49 AM

#

stray aspen claude 4 sonnet no think or grok 4

grok 4 but i think gemjni 2.5 pro or o3 pro is better right?

#

i thought u said that

#

earlier

torn mantle Jul 26, 2025, 1:49 AM

#

could be gpt-5

#

like the real thing

#

awoah

stray aspen Jul 26, 2025, 2:04 AM

#

will gpt 5 have a think version?

#

thats what the microsoft copilot leaks show

toxic whale Jul 26, 2025, 2:04 AM

#

i tested 3 of the new models Zenith, Summit and Lobster

stray aspen Jul 26, 2025, 2:05 AM

#

toxic whale i tested 3 of the new models Zenith, Summit and Lobster

where can i interact with the lobster

toxic whale Jul 26, 2025, 2:05 AM

#

on webdev arena and its alot better ill give my benchmark results

stray aspen Jul 26, 2025, 2:05 AM

#

which is the best of the three

toxic whale Jul 26, 2025, 2:06 AM

#

on my benchmark lobster got 81%, Summit 74%, Zenith, 65%, Gemini 2.5 pro, 61%, o4-mini, 58%

stray aspen Jul 26, 2025, 2:06 AM

#

that sounds great

toxic whale Jul 26, 2025, 2:07 AM

#

for coding i think zenith was best but lobster is very good at other tasks

#

this was so painful to test cuz i had to get lucky on the lm arena and find the models

torn mantle Jul 26, 2025, 2:12 AM

#

toxic whale i tested 3 of the new models Zenith, Summit and Lobster

Zenith isnt added on web arena

#

I don't think lobster is that good compared to zenith

#

These names are confusing me

#

I got zenith like twice in lmarena

#

The probability is so low

toxic whale Jul 26, 2025, 2:18 AM

#

torn mantle Zenith isnt added on web arena

it is on lmarena

torn mantle Jul 26, 2025, 2:19 AM

#

toxic whale it is on lmarena

Yea

toxic whale Jul 26, 2025, 2:19 AM

#

summit and zenith are based on same architecture

#

it is very likely 3 versions of GPT-5 and all are insane

torn star Jul 26, 2025, 2:33 AM

#

I’ve gone on lmarena for the first time and wow, there’s this model named summit that’s insane

ancient reef Jul 26, 2025, 2:33 AM

#

Guys, I tried zenith. Its agi

civic flame Jul 26, 2025, 2:33 AM

#

okay been doing a lot of stuff in dev mode instead of here with some other guys smarter than me and

#

here's my summary

torn star Jul 26, 2025, 2:34 AM

#

This blows 4o out of the water in general text, just asking it about things to do in a certain place

civic flame Jul 26, 2025, 2:34 AM

#

zenith = gpt-5. not sure what reasoning effort, but i am confident
summit = gpt-5 mini. very good at maths, sometimes better than zenith. generally worse everywhere else, but not by too much

#

both are strong, zenith is the first model that has kinda blown me away though

stray aspen Jul 26, 2025, 2:40 AM

#

zenith is gpt 5

#

damn bro i got zenith and selected both are bad

torn star Jul 26, 2025, 2:42 AM

#

Guys I changed my mind, zenith is amazing

primal solstice Jul 26, 2025, 2:42 AM

#

civic flame zenith = gpt-5. not sure what reasoning effort, but i am confident summit = gpt-...

is zenith better than lobster?

torn star Jul 26, 2025, 2:43 AM

#

It’s able to understand context in a way I could never have imagined before

stray aspen Jul 26, 2025, 2:44 AM

#

finally

#

now ill test it

torn mantle Jul 26, 2025, 2:45 AM

#

primal solstice is zenith better than lobster?

why are you here

torn star Jul 26, 2025, 2:47 AM

#

Is there a way to try zenith without having to use the battle mode

hardy pecan Jul 26, 2025, 2:48 AM

#

work at openai

torn star Jul 26, 2025, 2:49 AM

#

Lemme just ask my buddy that works there, ty

ornate stump Jul 26, 2025, 3:13 AM

#

I'm trying out those new models—they're insanely smart, but they overdo it way too often. Maybe their reasoning is limited.

stray aspen Jul 26, 2025, 3:25 AM

#

torn star Is there a way to try zenith without having to use the battle mode

Email sam himself

hardy pecan Jul 26, 2025, 3:25 AM

#

summit > lobster > nectarine > starfish

stray aspen Jul 26, 2025, 3:25 AM

#

Wheres Zenith

hardy pecan Jul 26, 2025, 3:25 AM

#

Haven't got zenith yet, is it on par with summit?

stray aspen Jul 26, 2025, 3:26 AM

#

elder rapids Jul 26, 2025, 3:28 AM

#

primal solstice is zenith better than lobster?

worse

restive sky Jul 26, 2025, 3:55 AM

#

hardy pecan Haven't got zenith yet, is it on par with summit?

I thought zenith was better than summit

hardy pecan Jul 26, 2025, 3:56 AM

#

Okkk, exciting

restive sky Jul 26, 2025, 3:56 AM

#

Still waiting for lobster, apparently it is the best

storm needle Jul 26, 2025, 3:56 AM

#

summit is insane

restive sky Jul 26, 2025, 3:58 AM

#

I get "Love this space" at the beginning of most summit and zenith responses. Weird

wide talon Jul 26, 2025, 3:58 AM

#

Do folks know how this chart was created (the data source)? This is from back in Mar, shortly after Nebula had appeared in lmarena.

rare python Jul 26, 2025, 4:01 AM

#

wide talon Do folks know how this chart was created (the data source)? This is from back in...

from the user hemingbird on reddit

#

on r/singularity

#

I don't know if they are in this server though

#

https://www.reddit.com/user/Hemingbird/

wide talon Jul 26, 2025, 4:03 AM

#

do you know how they compiled the chart though?

wide talon Jul 26, 2025, 4:09 AM

#

wide talon do you know how they compiled the chart though?

Seems they did this: https://www.reddit.com/r/singularity/comments/1jizn0t/comment/mjkkybm/

Nice_Cup_2240's comment on "New/updated models by Google soon"

Explore this conversation and more from the singularity community

rare python Jul 26, 2025, 4:10 AM

#

wide talon do you know how they compiled the chart though?

No. I guess they used ChatGPT to write python code or something

wide talon Jul 26, 2025, 4:10 AM

#

rare python No. I guess they used ChatGPT to write python code or something

i meant the actual data source

rare python Jul 26, 2025, 4:10 AM

#

wide talon i meant the actual data source

Nope

#

Seems to be private data source

tawny cypress Jul 26, 2025, 4:17 AM

#

Yo what is this summit ai it keeps destroying the opponent on webdev.

wind moth Jul 26, 2025, 4:20 AM

#

who is the the best in the search arena

wide talon Jul 26, 2025, 4:21 AM

#

Is there a way to request a particular model (Starfish, etc) in LMArena Battle? Or you just have to keep trying new arenas until you get it

wind moth Jul 26, 2025, 4:21 AM

#

i mean its a battle

#

so if u knew the names then

#

it would be biased i assume

wide talon Jul 26, 2025, 4:22 AM

#

just want to try starfish out haha

sand ledge Jul 26, 2025, 4:26 AM

#

folsom-07152025-2 seems to be a thing too btw

echo aurora Jul 26, 2025, 4:33 AM

#

wide talon Is there a way to request a particular model (Starfish, etc) in LMArena Battle? ...

Sry to say there isn't a way to direct chat or side-by-side with anonymous models. Understandable why that'd be nice though.

quartz light Jul 26, 2025, 5:00 AM

#

I just noticed the announcement

#

holy peak

#

🎉

civic flame Jul 26, 2025, 8:03 AM

#

primal solstice is zenith better than lobster?

appears so

hardy pecan Jul 26, 2025, 8:10 AM

#

Summit didn't get the glove bridge problem 😦

#

Too assumptive of a question I guess

civic flame Jul 26, 2025, 8:20 AM

#

don't use it via webdev arena if you want the best performance on general reasoning tasks lol

#

it has a long system prompt that will degrade performance, as will the scaffolding

hardy pecan Jul 26, 2025, 8:42 AM

#

Yeh fair

calm sequoia Jul 26, 2025, 8:58 AM

#

Did the o3 change or it always used to perform unit tests during thinking even when not asked?

calm sequoia Jul 26, 2025, 9:00 AM

#

civic flame it has a long system prompt that will degrade performance, as will the scaffoldi...

Once we had a discussion with @ocean vortex if the long system prompt reduce performance. We concluded that NO or negligible. Do you think otherwise?

civic flame Jul 26, 2025, 9:00 AM

#

calm sequoia Did the o3 change or it always used to perform unit tests during thinking even w...

gpt-5 is being AB tested as o3 on chatgpt but im not sure

civic flame Jul 26, 2025, 9:01 AM

#

calm sequoia Once we had a discussion with <@514836230802898954> if the long system prompt re...

i think it's kinda a given

calm sequoia Jul 26, 2025, 9:03 AM

#

civic flame gpt-5 is being AB tested as o3 on chatgpt but im not sure

If it's really GPT 5 we may have some hallucination issue. Code is awesome though.

#

It seems very weird. Fancy words but weird logic. Either it is worse than o3 or too smart for me to appreciate.

#

Lol probably just sampling issue 😄

marsh sundial Jul 26, 2025, 9:20 AM

#

calm sequoia If it's really GPT 5 we may have some hallucination issue. Code is awesome thoug...

I use it to write some story, the style is different vastly from previous model used.

calm sequoia Jul 26, 2025, 9:21 AM

#

Hmmm it uses special characters instead of "-" symbols some times. My interpreter broke.

marsh sundial Jul 26, 2025, 9:26 AM

#

left is zenith, right is Gemini 2.5 pro

digital umbra Jul 26, 2025, 9:34 AM

#

i wonder if google replaced all character names in their pretraining data with aris thorne or something

#

both gemini and gemma loves to use that name

whole wagon Jul 26, 2025, 9:42 AM

#

what the heck

#

the new qwen3 is a big regression?

#

they benchmaxxed fr lmao

flint tartan Jul 26, 2025, 9:52 AM

#

civic flame gpt-5 is being AB tested as o3 on chatgpt but im not sure

plausible!?
now its actually able to solve some problems which it couldn't

civic flame Jul 26, 2025, 9:53 AM

#

I am very confident it's being AB tested for at least some o3 requests for a subset of users

unkempt abyss Jul 26, 2025, 10:11 AM

#

Hey folks anyone else having issues with previewing the code on webdev arena?

#

the block tab is blank and there is no link with a refresh button

ornate agate Jul 26, 2025, 10:26 AM

#

whole wagon the new qwen3 is a big regression?

no idea, but the release is 3 models: Coder (which seems decent), thinking (which seems very decent) and default. Its also seems to me that the qwen models are tuned for academic tasks/problems rather than general chatting.

whole wagon Jul 26, 2025, 10:27 AM

#

the default is losing in all categories

hazy quest Jul 26, 2025, 10:54 AM

#

There is a time limit on LMArena, right? I tried a complex prompt and got and error (retried multiple times), but if i delete some parts of the prompt it worked. Can anyone confirm?

keen fulcrum Jul 26, 2025, 11:24 AM

#

https://discord.com/channels/1340554757349179412/1398620493346635806

barren prairie Jul 26, 2025, 11:24 AM

#

hardy pecan summit > lobster > nectarine > starfish

New chatgpt5 models ?? 🫠

keen fulcrum Jul 26, 2025, 11:24 AM

#

Search Arena has to be cared for

hardy pecan Jul 26, 2025, 11:25 AM

#

barren prairie New chatgpt5 models ?? 🫠

Seems that way

ocean vortex Jul 26, 2025, 11:52 AM

#

civic flame i think it's kinda a given

Nah it’s not a given actually. And believe it or not in some cases models will perform better with pretty much ANY system prompt than none at all. Seeing system message helps for somewhat undertrained (in post) models as it reminds them of fine-tuning structure. In cases of default system prompts that’s even more relevant as they tend to have similarities to the ones used with finetuning datasets it has seen when learning how to interact and act as chat completion model.

hazy quest Jul 26, 2025, 11:54 AM

#

Just got nightride-on for the first time, and omg it's strong for my task based on knowledge

torn mantle Jul 26, 2025, 11:56 AM

#

wojtek can you delete this pls

#

ty

full idol Jul 26, 2025, 11:56 AM

#

torn mantle wojtek can you delete this pls

ok, but why? not public info? seen this on X.

torn mantle Jul 26, 2025, 11:57 AM

#

full idol ok, but why? not public info? seen this on X.

yes

full idol Jul 26, 2025, 11:57 AM

#

ok, sorry