#general | Arena | Page 74

torn bison Jul 20, 2025, 6:35 AM

#

babyhitler

whole wagon Jul 20, 2025, 6:35 AM

#

It was always part of the endgame. To start re educating the kids

tidal schooner Jul 20, 2025, 6:36 AM

#

whole wagon It was always part of the endgame. To start re educating the kids

grok shorts™

torn mantle Jul 20, 2025, 6:36 AM

#

honestly it started with that guy called yang

whole wagon Jul 20, 2025, 6:37 AM

#

What did he do lol

#

Ngl the xAI livestream had some weird vibes but maybe it's because most of the engineers are socially inept

#

Like the demo singing about the diet coke was absurdly cringe but none of them even realised that?

torn mantle Jul 20, 2025, 6:39 AM

#

whole wagon Ngl the xAI livestream had some weird vibes but maybe it's because most of the e...

thats the consequence of lack of sleep

torn mantle Jul 20, 2025, 6:40 AM

#

whole wagon What did he do lol

i dont remember exactly, but he was sharing cringe stuff all the time

whole wagon Jul 20, 2025, 6:41 AM

#

I remember one of the xAI engineers did this whole thing that one of them equals 10 researchers at other labs lmao

#

They don't see how it's just cringe

torn mantle Jul 20, 2025, 6:41 AM

#

yea he just recently said that

#

yea that was cringe too

#

I think they are trying to justify their shortcomings

#

'look we made this model called grok 4, while it may not be the best, but HEY DONT FORGET THAT WE ARE A SMALL TEAM AND WE JUST STARTED A YEAR AGO'

#

and please zoom-in on that

whole wagon Jul 20, 2025, 7:09 AM

#

Clown 778 has left so I can discuss the results of the paper

#


With this "heavy" GenSelect inference mode, **OpenReasoning-Nemotron-32B model surpasses O3 (High) on math and coding benchmarks.**```
This does not seem like it would match the real usage results at all

#

They are claiming to surpass o3 high on maths and coding benchmarks. Which may be true but I have serious doubts their models are truly better than o3 high

#

Even in parallel inference mode

#

#

If their claims are valid it basically means the end of openAI lol

leaden sun Jul 20, 2025, 7:14 AM

#

i think he's trying to imply the difference between how human and AI contestants "work" at IMO, i sense a bit unfairness there in the approach at comparing since humans dont use anything else other than their brain while models can access a variety of tools (and internet?) but then again, what is fair when llms arent that comparable to raw human intelligence?

pure anvil Jul 20, 2025, 7:23 AM

#

whole wagon Clown 778 has left so I can discuss the results of the paper

🖕

whole wagon Jul 20, 2025, 7:26 AM

#

44ppl in Meta's Superintelligence team.
— 50% from China
— 75% have PhDs, 70% Researchers
— 40% from OpenAI, 20% DeepMind, 15% Scale
— 20% L8+ level
— 75% 1st gen immigrants

detailed-list-of-all-44-people-in-metas-superintelligence-v0-40cr7tl2sudf1.png

torn mantle Jul 20, 2025, 7:48 AM

#

whole wagon 44ppl in Meta's Superintelligence team. — 50% from China — 75% have PhDs, 70% Re...

poached recently?

#

mm i see i see

whole wagon Jul 20, 2025, 7:49 AM

#

well some of them were 4 months ago

#

yea

#

a few been at meta for years

torn mantle Jul 20, 2025, 7:51 AM

#

yea this will def take a while

unborn ocean Jul 20, 2025, 7:51 AM

#

whole wagon 44ppl in Meta's Superintelligence team. — 50% from China — 75% have PhDs, 70% Re...

the transition from china undergrad -> us graduate i really incredibly common

#

wonder if the people will ever "come back"

#

bc ccp would certainly love that

torn mantle Jul 20, 2025, 7:51 AM

#

zuck is doing more harm than good

whole wagon Jul 20, 2025, 7:52 AM

#

I dont know if 44 ppl is enough tbh

#

he tried to go for quality over everything

#

with the insane salaries and all

torn mantle Jul 20, 2025, 7:52 AM

#

the goal will probably be to release a model EOY

#

behemoth is probably scrapped

whole wagon Jul 20, 2025, 7:53 AM

#

yes

#

open source may be scrapped in general

#

the head guy suggested that

#

alexandr wang

unborn ocean Jul 20, 2025, 7:55 AM

#

whole wagon I dont know if 44 ppl is enough tbh

me too, i guess a lot of these people where actually management level before -> might hire more people to the team with them as lead or they might get the FAIR people as "subordinates" to some degree

#

bc otherwise these people are really expensive if they can't manage

#

and only work on their own

smoky finch Jul 20, 2025, 8:56 AM

#

guys may i know where they announce that the newest models got added - removed from the arena?

#

there was a channel called "🕸️ webdev-arena" but i don't know in which server it is

pure anvil Jul 20, 2025, 9:46 AM

#

torn mantle zuck is doing more harm than good

I wonder if the existing researchers got their salaries increased

#

or the dynamics of that

#

man upto llama 3.3 they made amazing models

ocean vortex Jul 20, 2025, 10:14 AM

#

rumors 😇

chilly nexus Jul 20, 2025, 10:21 AM

#

killing people (fictional character)

ocean vortex Jul 20, 2025, 10:43 AM

#

whole wagon 44ppl in Meta's Superintelligence team. — 50% from China — 75% have PhDs, 70% Re...

So 75% gonna get deported. Can't have illegal aliens playing with AI

civic flame Jul 20, 2025, 10:51 AM

#

if we shouldn't believe jimmy, who generally has a good track record, why should we believe you

#

what track record do you have

ocean vortex Jul 20, 2025, 10:55 AM

#

civic flame what track record do you have

He's got a feeling

#

other feelings are inferior... 🤔

civic flame Jul 20, 2025, 10:59 AM

#

☠️

whole wagon Jul 20, 2025, 11:32 AM

#

i dont really care if you dont believe me lol, im not going to out sources for internet points

chilly nexus Jul 20, 2025, 11:43 AM

#

ocean vortex So 75% gonna get deported. Can't have illegal aliens playing with AI

sounds like Operation paperclip v2.0

frosty lark Jul 20, 2025, 11:52 AM

#

unborn ocean bc ccp would certainly love that

to be fair industral espionage is a thing too. They can stay where they are and leak info.

#

it can work in the other way too (people that have contacts in other labs and get leaks)

gentle plinth Jul 20, 2025, 11:57 AM

#

leaden sun i think he's trying to imply the difference between how human and AI contestants...

No he is just saying that the rules and conditions for this contest should be disclosed and are possibly not comparable to the rules that apply to the human contest.

cedar tide Jul 20, 2025, 12:17 PM

#

just we agree that o3 alpha is not o4 yes?

torn mantle Jul 20, 2025, 12:47 PM

#

cedar tide just we agree that o3 alpha is not o4 yes?

its not

#

its called o3-alpha for a reason

#

its a more refined o3 version

#

with coding-focused abilities and factually correct answers ( less hallucinations )

soft kernel Jul 20, 2025, 1:13 PM

#

torn mantle its not

I think it'll be just an update

torn mantle Jul 20, 2025, 1:23 PM

#

we've come a long way

#

i remember models couldn't even do ascii visualisation properly

sour spindle Jul 20, 2025, 2:00 PM

#

Is o3 alpha still on arena

ocean vortex Jul 20, 2025, 2:16 PM

#

sour spindle Is o3 alpha still on arena

Checked yesterday, it was not there, barely anything interesting at all actually...

ocean vortex Jul 20, 2025, 2:17 PM

#

torn mantle its called o3-alpha for a reason

just like 'gpt2-chatbot' was called that for a reason lol

#

it means nothing tbh

torn mantle Jul 20, 2025, 2:17 PM

#

but hes asking if its o4

#

its def not o4

ocean vortex Jul 20, 2025, 2:17 PM

#

o4 = gpt5

#

there's not gonna be o4 🙂

#

Also their current naming makes this... interesting. o4-mini is not a mini version of the upcoming improved o3 (gpt5), lmao

#

we do know that they are referring to their current gpt5 as "alpha" though

sacred quail Jul 20, 2025, 2:21 PM

#

is O3 using Gpt 4 for base model, or is totally different one

ocean vortex Jul 20, 2025, 2:21 PM

#

sacred quail is O3 using Gpt 4 for base model, or is totally different one

gpt4.1

sacred quail Jul 20, 2025, 2:22 PM

#

I thought gpt 4.1 was small model that has 1 million token window

ocean vortex Jul 20, 2025, 2:22 PM

#

sacred quail I thought gpt 4.1 was small model that has 1 million token window

gpt4.1-mini is a small model, not this one

sacred quail Jul 20, 2025, 2:22 PM

#

Hmmmm. Can we use gpt 4.1 with app on plus plan ? Or only in API ?

ocean vortex Jul 20, 2025, 2:23 PM

#

sacred quail Hmmmm. Can we use gpt 4.1 with app on plus plan ? Or only in API ?

Yes you can. But it's gonna perform similarly to 4o-latest (which is a different model to API dated ver of gpt4o)

#

torn mantle Jul 20, 2025, 2:25 PM

#

ocean vortex there's not gonna be o4 🙂

who said that

#

gpt5 is a router model

#

could be anything

#

o-* + gpt model

pure anvil Jul 20, 2025, 2:26 PM

#

torn mantle gpt5 is a router model

hopefully it's based on 4.5

ocean vortex Jul 20, 2025, 2:26 PM

#

torn mantle gpt5 is a router model

I doubt that's gonna turn out to be true tbh. Just a rumor. I think just "unified model" or bluntly speaking hybrid reasoning is much more likely

torn mantle Jul 20, 2025, 2:27 PM

#

ocean vortex I doubt that's gonna turn out to be true tbh. Just a rumor. I think just "unifie...

how to achieve hybrid reasoning model

ocean vortex Jul 20, 2025, 2:27 PM

#

torn mantle how to achieve hybrid reasoning model

Claude4

torn mantle Jul 20, 2025, 2:27 PM

#

i think router model or hybrid model are the same thing

#

depends on how you see it

ocean vortex Jul 20, 2025, 2:27 PM

#

Not the same, Claude is not a router lol

torn mantle Jul 20, 2025, 2:29 PM

#

im was talking about the final result not the approach used

#

they will both at the end chose which path to take ( reasoning/or not )

torn mantle Jul 20, 2025, 2:30 PM

#

ocean vortex Not the same, Claude is not a router lol

how is it achieved tho

ocean vortex Jul 20, 2025, 2:31 PM

#

I mean if it's an unified model, it could decide by itself when to output reasoning and when not to... Literal router does not sound to me like a good idea

#

reasoning effort: off/auto/low/med/high

#

or smth like that

torn mantle Jul 20, 2025, 2:33 PM

#

yea i know but how do we decide that

#

its controlled by thinking_budget=0 / >0 & some internal controller

#

but how does the internal controller work

ocean vortex Jul 20, 2025, 2:34 PM

#

torn mantle yea i know but how do we decide that

wdym. Model can decide by itself. And when it's explicitly specified you can always prefill thinking token and it's gonna output reasoning always

torn mantle Jul 20, 2025, 2:35 PM

#

ocean vortex wdym. Model can decide by itself. And when it's explicitly specified you can alw...

im talking about how does the model decide that

#

the training phase

#

or methods used

ocean vortex Jul 20, 2025, 2:35 PM

#

torn mantle im talking about how does the model decide that

By being trained on both outputs with reasoning and without them. Where harder tasks contain it and easier ones do not

torn mantle Jul 20, 2025, 2:36 PM

#

ocean vortex By being trained on both outputs with reasoning and without them. Where harder t...

as simple as that?

#

qwen 3 has that too no?

ocean vortex Jul 20, 2025, 2:37 PM

#

Roughly speaking yeah... We already had some reasoning models that did not always output reasoning, that wasn't explored much yet though. It was all about max performance

torn mantle Jul 20, 2025, 2:38 PM

#

mm i get it now, so its basically like a tiny classifier head used in the same transformer

#

looks at the query embedding and outputs a probability

ocean vortex Jul 20, 2025, 2:38 PM

#

yeah

torn mantle Jul 20, 2025, 2:39 PM

#

prob is compared/learned through a param using RLHF

ocean vortex Jul 20, 2025, 2:39 PM

#

the challenging part is making this deciding by itself model perform better than o3-high. But if they add an option for explicit reasoning effort which it looks like they will, this becomes less crucial

torn mantle Jul 20, 2025, 2:40 PM

#

ocean vortex the challenging part is making this deciding by itself model perform better than...

im sure they will

cedar tide Jul 20, 2025, 3:13 PM

#

New imagen 4 v2

Screenshot_2025-07-20-17-05-16-987_com.android.chrome-edit.jpg

#

Screenshot_2025-07-20-17-05-41-358_com.android.chrome-edit.jpg

woeful geyser Jul 20, 2025, 3:40 PM

#

Found out that Grok 4 got 60.5 on SimpleBench

Screenshot_2025-07-20-22-38-07-735_com.android.chrome.jpg

#

Yet it failed miserably when faced with this prompt #share-prompts message

ocean vortex Jul 20, 2025, 3:44 PM

#

https://girlcockx.com/sama/status/1946575101509734619

why did they remove it from lmarena then... 😠

Sam Altman (@sama)

woke up early on a saturday to have a couple of hours to try using our new model for a little coding project.
︀︀
︀︀done in 5 minutes. it is very, very good.
︀︀
︀︀not sure how i feel about it...

**💬 1.9K 🔁 755 ❤️ 21.1K 👁️ 1.83M **

#

@cedar tide too many pings I think

#

lmao

cedar tide Jul 20, 2025, 3:45 PM

#

o3 alpha its the open source model

#

Screenshot_2025-07-20-17-41-50-361_com.twitter.android-edit.jpg

#

Screenshot_2025-07-20-17-42-48-338_com.twitter.android-edit.jpg

ocean vortex Jul 20, 2025, 3:47 PM

#

cedar tide o3 alpha its the open source model

yeah I heard of this idea earlier. It's plausible

torn mantle Jul 20, 2025, 3:48 PM

#

cedar tide

eh?

#

no way

ocean vortex Jul 20, 2025, 3:48 PM

#

Does he really work at OpenAI though?

#

This satoshi guy

#

https://girlcockx.com/idontexist_nn/status/1946959252838351081

Satoshi (@idontexist_nn)

Sort of crazy how that CEO fucked up and then the internet fucked his life up 😅

**❤️ 3 👁️ 140 **

#

https://girlcockx.com/idontexist_nn/status/1946943082261295519

Satoshi (@idontexist_nn)

Oh, and once again, Strawberry dickhead doesn't know shit and most these shit talkers come from his direction.

**💬 4 ❤️ 29 👁️ 909 **

#

lmao

torn mantle Jul 20, 2025, 3:51 PM

#

bruh

ocean vortex Jul 20, 2025, 3:51 PM

#

I would still say 60:40 it's gpt5. But open-source model is a plausible option too.

torn mantle Jul 20, 2025, 3:51 PM

#

this guy has nothing to do with oai

cedar tide Jul 20, 2025, 3:51 PM

#

ocean vortex Does he really work at OpenAI though?

https://x.com/idontexist_nn/status/1944573117558395125?t=xEfPKE_XrIoHLdg-LulZIQ&s=19

Satoshi (@idontexist_nn)

Nothing reliable yet I told everyone about o1 pro, about o3, about chain of thought tool calling in Jan as its on my profile, what the openrouter secret model was 4.1, codex, plus post before we update the Ui. The recursive system we have that Sam confirmed in his blog 🤣OK kid.

torn mantle Jul 20, 2025, 3:52 PM

#

cedar tide https://x.com/idontexist_nn/status/1944573117558395125?t=xEfPKE_XrIoHLdg-LulZIQ&...

yea not reliable

#

why is he talking like a kid

#

cedar tide Jul 20, 2025, 3:52 PM

#

@torn mantle tu verra

torn mantle Jul 20, 2025, 3:53 PM

#

cedar tide <@295243581818404874> tu verra

no trust me

#

hes like the strawberry guy

#

aint no way o3 alpha is the os model

#

impossible

#

https://x.com/tetsuoai/status/1946760622009643188

Tetsuo (@tetsuoai)

Grok4’s Ani explains Multithreading vs. Multiprocessing in C.

#

do people really understand things like that

#

shes all over the place

#

fast talking

#

weird movements

ocean vortex Jul 20, 2025, 3:54 PM

#

torn mantle hes like the strawberry guy

Kinda agree. And like this #general message.... That's exactly the kind of thing you would say when you are ignorant and have no clue. "5 is bigger"

#

That's unlikely for 5 to be bigger, and also even if it was, no one would ever say this

torn mantle Jul 20, 2025, 3:56 PM

#

this companion thingy is fun but its not well executed, and im not talking about waifus / elon's degenerate ideas

ocean vortex Jul 20, 2025, 3:56 PM

#

"strawberry dickhead"

storm needle Jul 20, 2025, 3:57 PM

#

ocean vortex "strawberry dickhead"

this dude is clearly a troll

#

and he works no less than at mc donalds

torn mantle Jul 20, 2025, 4:05 PM

#

hes working at x

#

farming engagements

civic flame Jul 20, 2025, 4:07 PM

#

there are many of those links and one of them is that

lime coral Jul 20, 2025, 4:08 PM

#

https://x.com/lmthang/status/1946960256439058844?s=46

Thang Luong (@lmthang)

Yes, there is an official marking guideline from the IMO organizers which is not available externally. Without the evaluation based on that guideline, no medal claim can be made. With one point deducted, it is a Silver, not Gold.

rare python Jul 20, 2025, 4:12 PM

#

civic flame there are many of those links and one of them is that

they are embedding improvement for stock x.com url

#

https://fixupx.com/lmthang/status/1946960256439058844?s=46

Thang Luong (@lmthang)

Yes, there is an official marking guideline from the IMO organizers which is not available externally. Without the evaluation based on that guideline, no medal claim can be made. With one point deducted, it is a Silver, not Gold.

Quoting Mikhail Samin (@Mihonarium)
︀
🚨 According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony.
︀︀
︀︀According to a Coordinator on Problem 6, the one problem OpenAI couldn't solve, "the general sense of the IMO Jury and Coordinators is that it was rude and inappropriate" for OpenAI to do this.
︀︀
︀︀OpenAI wasn't one of the AI companies that cooperated with the IMO on testing their models, so unlike the likely upcoming Google DeepMind results, we can't even be sure OpenAI's "gold medal" is legit. Still, the IMO organizers directly asked OpenAI not to announce their results imme…

#

Way better

ember rapids Jul 20, 2025, 4:17 PM

#

I have a feeling most people will be disappointed with gpt5

ocean vortex Jul 20, 2025, 4:19 PM

#

it's just a fix embed domain very fitting to X 😇

torn mantle Jul 20, 2025, 4:21 PM

#

lime coral https://x.com/lmthang/status/1946960256439058844?s=46

they are fuming

#

shows how much IMO means to them

#

it was their thing

#

their lil baby

#

now oai stole that from them

ocean vortex Jul 20, 2025, 4:23 PM

#

torn mantle shows how much IMO means to them

ikr. Massacred and scarcened deepmind, can't let this go lol

#

already deducting points 💀

torn mantle Jul 20, 2025, 4:24 PM

#

its silver now xd

#

not really

#

we need to wait a week for the official results

whole wagon Jul 20, 2025, 4:26 PM

#

i read the openai "solutions"

#

they are garbage lel

#

the model somehow does not even write coherent english?

#

like it feels like they didnt teach it english somehow

ocean vortex Jul 20, 2025, 4:29 PM

#

whole wagon like it feels like they didnt teach it english somehow

I think they just trained it to not waste tokens on complex/long sentences... it still makes sense
https://github.com/aw31/openai-imo-2025-proofs/blob/main/problem_1.txt

GitHub

openai-imo-2025-proofs/problem_1.txt at main · aw31/openai-imo-202...

Contribute to aw31/openai-imo-2025-proofs development by creating an account on GitHub.

keen beacon Jul 20, 2025, 4:30 PM

#

You can see there's so many rl artifacts

#

It goes hard

#

Openai did a great job

whole wagon Jul 20, 2025, 4:31 PM

#

cant tell if it is sarcasm ngl

#

some ppl could probably spin it in their head as an innovation lol

ocean vortex Jul 20, 2025, 4:32 PM

#

whole wagon cant tell if it is sarcasm ngl

try training your own model to do gold then in a math competition you have barely any data about at all as you are training....

whole wagon Jul 20, 2025, 4:32 PM

#

problem1 is probably the best one

#

its all downhill from there lol

ocean vortex Jul 20, 2025, 4:33 PM

#

The amazing about it is that it's actually an usable model as well, unlike AlphaGeometry...

whole wagon Jul 20, 2025, 4:33 PM

#

did they really mark their own solutions

#

thats what i heard

#

why dont they let all the imo participants mark their own solutions

ocean vortex Jul 20, 2025, 4:34 PM

#

@whole wagon I refuse to believe you aren't actively and deliberately trying to stay ignorant lmao

keen beacon Jul 20, 2025, 4:35 PM

#

whole wagon cant tell if it is sarcasm ngl

It isn't. I find rl artifacts really cool because it's largely unintentional. There are so many rl artifacts in those traces, if you've seen it before

whole wagon Jul 20, 2025, 4:35 PM

#

i have done the national olympiad

#

in the UK

#

got the certificates

ocean vortex Jul 20, 2025, 4:35 PM

#

How much did you score?

#

5%?

#

🧐

leaden palm Jul 20, 2025, 4:40 PM

#

what do you guys think about the theory that it's actually a formal language, just written in a way that looks like plain english?

keen beacon Jul 20, 2025, 4:41 PM

#

It could be doing that partly in the cot but this reads to me like the final output of extremely rld model. So you can't tell much

ornate agate Jul 20, 2025, 4:42 PM

#

I'm 100% sure this is final output and not cot and also not intermediate agent results etc.

keen beacon Jul 20, 2025, 4:43 PM

#

It seems to me it reasons primarily in natural language in the CoT though it can 'codeswitch' into lean or whatever maybe

ocean vortex Jul 20, 2025, 4:43 PM

#

Those are valid and correct solutions though at the end of the day. I suspect if you read AlphaGeometry solutions those would look way more weird and unnatural...

keen beacon Jul 20, 2025, 4:44 PM

#

I think the solutions are good weird at least to me. It reeks of rl

ocean vortex Jul 20, 2025, 4:46 PM

#

AlphaGeometry was bruteforcing it and combining existing ML algorhitms with an LLM. This in contrast appears to be an actual singular model that you can use.

whole wagon Jul 20, 2025, 4:46 PM

#

i could write pages on how this solution is flawed. The solution presents the bijection as a fact for example

#

just on problem 1

ocean vortex Jul 20, 2025, 4:47 PM

#

whole wagon i could write pages on how this solution is flawed. The solution presents the bi...

It's a good thing you are not a judge and not qualified to be one too

whole wagon Jul 20, 2025, 4:48 PM

#

the premise is transformed from n+1 to n but then later on they somehow start using n+1 again. There is no reasoning for why this occurs

#

"For n=3, we have possibilities k=0, k=1, or k=3" no justification?

#

just stated as a fact

#

the IMO solution explictly enumerates all 6 points in p3

#

it shows possible k values are 0,1,3 but never proves those values are achievable for all n greater than equal to 3

#

the invokation of the pigeonhole principle is vague, they dont define exactly what the pigeons and holes are

#

The solution states that if you remove a side-line, you get a valid covering for n-1. There is 0 explanation for this

ocean vortex Jul 20, 2025, 4:53 PM

#

whole wagon "For n=3, we have possibilities k=0, k=1, or k=3" no justification?

I think you should at least read on how it's being graded

#

whole wagon Jul 20, 2025, 4:54 PM

#

"proving"

#

there is no proof in the openai solution

#

i can tell they didnt "make an argument" about the boundary point covering. because they literally just state it

ocean vortex Jul 20, 2025, 4:57 PM

#

whole wagon there is no proof in the openai solution

I think this part:

#

though to be fair we do not know if all problems were graded the max 7/7. Solving it does not neccessarily mean full grade think

whole wagon Jul 20, 2025, 4:58 PM

#

i heard if they drop a single point it goes from gold to silver

ornate agate Jul 20, 2025, 4:59 PM

#

IMO Gold threshold this year is 35. OpenAI are assuming all their published solutions are 7/7 perfect because 7*5=35. They graded it all themselves and did not participate with the IMO people at all. People who did (eg DeepMind, probably also ByteDance, DeepSeek, etc) are still under an agreed embargo for a few days.

ocean vortex Jul 20, 2025, 4:59 PM

#

whole wagon i heard if they drop a single point it goes from gold to silver

well in that case it counted as the necessary proof I suppose... But anyway, we are arguing about small details here. It's a big achievement no matter how you put it

whole wagon Jul 20, 2025, 4:59 PM

#

its an achievement. but i dont believe it is gold

#

where is the grading site btw

#

i can go through point by point

#

from the headings for problem 1 they have 6 at least

ornate agate Jul 20, 2025, 5:02 PM

#

DeepMind got a solid silver last year, I think the main achievement here is converting to natural language. Its interesting but its EXTREMELY likely that a 1yr improved prover (eg DeepMind's or DeepSeek prover) will score gold.

ocean vortex Jul 20, 2025, 5:02 PM

#

I think this sums it up well: https://werd.io/openais-gold-medal-performance-on-the-international-math-olympiad/

Werd I/O

OpenAI’s gold medal performance on the International Math Olympiad

OpenAI claims a significant result: gold-level performance International Mathematical Olympiad. But they're scant on details and it needs to be independently verified.

#

independent verifications are welcome, obviously

ocean vortex Jul 20, 2025, 5:04 PM

#

whole wagon where is the grading site btw

I used this https://matharena.ai, they have grading details for each individual IMO score and the requirements when you click on it

whole wagon Jul 20, 2025, 5:07 PM

#

hm well with that grading criteria it is 7/7

ocean vortex Jul 20, 2025, 5:08 PM

#

yeah fair point I suppose that it may not be 100% identical. But probably as close as it gets at this point

We followed a methodology similar to our evaluation of the 2025 USA Math Olympiad [1]. In particular, four experienced human judges, each with IMO-level mathematical expertise, were recruited to evaluate the responses. Evaluation began immediately after the 2025 IMO problems were released to prevent contamination. Judges reviewed the problems and developed grading schemes, with each problem scored out of 7 points. To ensure fairness, each response was anonymized and graded independently by two judges. Grading was conducted using the same interface developed for our Open Proof Corpus project [2].

lime coral Jul 20, 2025, 5:13 PM

#

torn mantle they are fuming

Don’t think they stole anything since they also got the gold, by the right way. OAI not passing the vanilla test&jury is more like them worrying to loose the battle. I’m sure if they had Silver with their eval they wouldn’t make noise

#

https://x.com/abeirami/status/1946979340031349109?s=46

Ahmad Beirami (@abeirami)

This explains why OpenAI results are out and GDM results are not.

And what's out is not even official results verified by IMO!

whole wagon Jul 20, 2025, 5:15 PM

#

problem 1 - 3 looks fine with the point system

ornate agate Jul 20, 2025, 5:20 PM

#

https://xcancel.com/Mihonarium/status/1946880931723194389#m

Nitter

Mikhail Samin (@Mihonarium)

🚨 According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony.

According to a Coordinator on Problem 6, the one problem OpenAI couldn't solve, "the general sense of the IMO Jury and Coordinato...

storm needle Jul 20, 2025, 5:22 PM

#

can you prove that you really have it

ornate agate Jul 20, 2025, 5:22 PM

#

storm needle can you prove that you really have it

not to you, no

whole wagon Jul 20, 2025, 5:24 PM

#

hmmmmmmmm

torn mantle Jul 20, 2025, 5:25 PM

#

whole wagon problem 1 - 3 looks fine with the point system

Can you share links

#

Let me read them a bit

whole wagon Jul 20, 2025, 5:25 PM

#

this is correct me thinks

torn mantle Jul 20, 2025, 5:25 PM

#

Can someone share links or nah

whole wagon Jul 20, 2025, 5:25 PM

#

its all correct

#

its valid proofs its just a pain to read

leaden palm Jul 20, 2025, 5:29 PM

#

ornate agate https://xcancel.com/Mihonarium/status/1946880931723194389#m

moving fast and breaking things

whole wagon Jul 20, 2025, 5:29 PM

#

torn mantle Can someone share links or nah

https://github.com/aw31/openai-imo-2025-proofs/tree/main

sterile rapids Jul 20, 2025, 5:49 PM

#

hello

torn mantle Jul 20, 2025, 5:50 PM

#

sterile rapids hello

hi

#

can you solve IMO problems or nah?

sterile rapids Jul 20, 2025, 5:50 PM

#

can anyone help me. i use LMArena. to create the image but can not create vertical image. can only create 1024x1024 king image

#

I have fixed the prompt many times but it still doesn't work.

torn mantle Jul 20, 2025, 5:51 PM

#

are you good at math?

sterile rapids Jul 20, 2025, 5:51 PM

#

torn mantle are you good at math?

I'm not

#

If possible can you please re-prompt me?

torn mantle Jul 20, 2025, 5:52 PM

#

im not that good

#

sorry

sterile rapids Jul 20, 2025, 5:53 PM

#

768 x 1376 . nhưng AI chỉ tạo ra 1024 x 1024

#

I want to create a picture with dimensions like this. It is 768 x 1376. But AI only creates 1024 x 1024.

ornate agate Jul 20, 2025, 5:58 PM

#

whole wagon its valid proofs its just a pain to read

just looking at problem 1. OAI solution seems a bit of a different proof. "Stating and proving that the leftmost and bottommost points are covered by nn or n−1n−1 lines." . the OAI model doesn't do this, it shows instead that one line must always be a triangle side (and so can then be removed).

tight nexus Jul 20, 2025, 6:01 PM

#

I am looking for someone to work on a project I have some ok hardware (3x 3090 GPUs and Threadripper 3990x with 128 Gb Ram). If anyone is interested DM me.

ocean vortex Jul 20, 2025, 6:19 PM

#

ornate agate https://xcancel.com/Mihonarium/status/1946880931723194389#m

I think there are valid points in the comments. OpenAI brought more spotlight to it than this competition otherwise would have gotten. And all we have are some anecdotal reports from people allegedly having connections to the organizers and not even the organizers themselves going publicly on record to say there was anything wrong at all in what OpenAI did...

#

If we asked people actually taking part there (students), I would be very surprised if majority was in favor of forcing the labs to wait before they publish their results

#

Like all of it just seems a cheap way to attack someone, and seems to be driven for a good part by those who are just generally against OpenAI no matter what they do lol

#

They managed a breakthrough? Find ways to poke holes in it or how you don't need it. Then start needing it when everyone else starts doing it.
They beat the competition? Find ways how they cheated. And if they didn't cheat then they did something else wrong or released the results at the wrong time... catgrin

zinc ore Jul 20, 2025, 6:32 PM

#

I don't even think that's what's going on, I think they likely do consider the announcement rude, since it comes from an individual speaking to one of the directors.

It doesn't help that all the other companies are effectively under embargo and also want to announce, then a company comes in and didn't go through official channels and takes the spot light. So it angers the other companies and also embarrasses the IMO.

ocean vortex Jul 20, 2025, 6:32 PM

#

Also matharena published results before OpenAI did. I don't think they would have waited any longer even if that 2.5Pro run was medal worthy lol

ocean vortex Jul 20, 2025, 6:34 PM

#

zinc ore I don't even think that's what's going on, I think they likely do consider the a...

"Individual speaking to one of the directors" just sounds incredibly weak though. There's no official statement even remotely resembling anything like this...

zinc ore Jul 20, 2025, 6:34 PM

#

They're not going to make an official statement

ocean vortex Jul 20, 2025, 6:35 PM

#

well then it's "he said he heard person X who's sister rated one of the problems" types of thing lol

#

meaningless

zinc ore Jul 20, 2025, 6:35 PM

#

All you have to do is find out who Joseph Myers is and see if they are indeed a reliable source connected to IMO officials

#

If yes, then likely what he's reporting is true

ocean vortex Jul 20, 2025, 6:36 PM

#

I'm not saying he made it all up. But you also need to realise they have dozens of judges and probably even more people that could be counted as organisers and quoted.

whole wagon Jul 20, 2025, 6:42 PM

#

It is a guideline they have at the end of the day

#

Whether or not they are truly angered by it is not that relevant

#

They should follow the guidelines

ocean vortex Jul 20, 2025, 6:43 PM

#

whole wagon It is a guideline they have at the end of the day

Is it really a guideline for outsiders though? Doesn't seem to apply to matharena.ai etc

#

If they have AI labs asking their judges to participate rating model outputs then I suppose there could be guidelines. But it doesn't look like OpenAI was in that circle

keen beacon Jul 20, 2025, 6:44 PM

#

was openai aware of those guidelines when they released those results? i think that's what matters

small haven Jul 20, 2025, 6:44 PM

#

cedar tide

that is actually insane if its open source, wow

keen beacon Jul 20, 2025, 6:44 PM

#

if they did, it's bad faith

zinc ore Jul 20, 2025, 6:44 PM

#

keen beacon was openai aware of those guidelines when they released those results? i think t...

Isn't it public information? Should be very easy for a company like openAI to know

#

It's almost common sensical

whole wagon Jul 20, 2025, 6:45 PM

#

They were in contact with imo ppl they ofc knew lol

ocean vortex Jul 20, 2025, 6:45 PM

#

zinc ore Isn't it public information? Should be very easy for a company like openAI to k...

if it's public then link it lol

zinc ore Jul 20, 2025, 6:45 PM

#

Even Altman would intuitively know this

ocean vortex Jul 20, 2025, 6:45 PM

#

I don't see it

zinc ore Jul 20, 2025, 6:45 PM

#

It's extremely unlikely openAI is oblivious to this

whole wagon Jul 20, 2025, 6:46 PM

#

Every year the AI labs wait a period of time before releasing their results. And you are trying to suggest openAI were not aware

ocean vortex Jul 20, 2025, 6:46 PM

#

zinc ore It's extremely unlikely openAI is oblivious to this

If they are not part of their inner circle or asking their organisers for anything (rating outputs), I don't see how any of this applies what isn't public info tbh

keen beacon Jul 20, 2025, 6:47 PM

#

ive got no dog in this fight nor am i familar with imo btw, just stating that it is bad faith if they were aware of it

zinc ore Jul 20, 2025, 6:47 PM

#

ocean vortex If they are not part of their inner circle or asking their organisers for anythi...

That's basically what makes it rude

ocean vortex Jul 20, 2025, 6:47 PM

#

Imagine you training your model yourself for a breakthrough result and this receiving this same kind of scrutiny after doing everything independently lol

zinc ore Jul 20, 2025, 6:47 PM

#

Like them "not being officially involved" doesn't spare them the "rudeness" of the whole thing

ocean vortex Jul 20, 2025, 6:48 PM

#

you were not bound by any internal "guidelines"

zinc ore Jul 20, 2025, 6:48 PM

#

I'm aware

#

They're in effect taking IMOs prestige to hype themselves

#

Which is why it is rude

ocean vortex Jul 20, 2025, 6:49 PM

#

zinc ore Like them "not being officially involved" doesn't spare them the "rudeness" of t...

it is not any more "rude" than matharena publishing IMO leaderboard

zinc ore Jul 20, 2025, 6:49 PM

#

Matharenas is irrelevant

#

No one cares about some 10-30% results

ocean vortex Jul 20, 2025, 6:49 PM

#

zinc ore They're in effect taking IMOs prestige to hype themselves

Debatable. They are also making IMO much more talked about event than it otherwise would have been

#

and bringing more spotlight to the human participants

zinc ore Jul 20, 2025, 6:50 PM

#

ocean vortex Debatable. They are also making IMO much more talked about event than it otherwi...

This would have occurred either way, even if they waited a week

#

Especially from the other companies

#

The main thing is this embarrasses the IMO

#

That's why you get a comment saying it is rude

ocean vortex Jul 20, 2025, 6:51 PM

#

zinc ore Matharenas is irrelevant

but they had no clue what the results would have been. You can't judge it based on the results. If there ARE any guidelines (even that part is murky), they are to be applied universally not selectively depending on who does it.

zinc ore Jul 20, 2025, 6:51 PM

#

It's just openAI disrespecting some obvious social etiquette

whole wagon Jul 20, 2025, 6:51 PM

#

Yes but I thought it is valid still. There is a few places where they do things differently

ocean vortex Jul 20, 2025, 6:51 PM

#

zinc ore It's just openAI disrespecting some obvious social etiquette

that's an extreme reach lol

zinc ore Jul 20, 2025, 6:52 PM

#

ocean vortex but they had no clue what the results would have been. You can't judge it based ...

I'm not saying we can't consider matharena rude, I'm just saying it's too irrelevant to be noticed compared to a company hyping up their gold placement while every other AI company is forced to keep their mouth shut out of respect to the IMO

#

Hence, the rudeness

#

OpenAI is casually being rude to both the IMO and every other AI company involved

#

This is literally common sense and not even controversial

whole wagon Jul 20, 2025, 6:53 PM

#

Yeah but it's openAI they never cared about things like that

ocean vortex Jul 20, 2025, 6:53 PM

#

zinc ore I'm not saying we can't consider matharena rude, I'm just saying it's too irrele...

I don't think they are forced per se. I think Google just did things differently and hence were bound by different constraints. They asked the actual judges to rate the outputs and hence were expected to comply with certain rules

zinc ore Jul 20, 2025, 6:54 PM

#

ocean vortex I don't think they are forced per se. I think Google just did things differently...

No, not a single involved AI company, not just Deepmind, have said anything out of respect to IMO

ocean vortex Jul 20, 2025, 6:54 PM

#

matharena and OpenAI did not go that route 🤷‍♂️

zinc ore Jul 20, 2025, 6:54 PM

#

There are both open source and closed source AI companies that participated this year

#

They haven't announced their results, out of respect to the IMO

#

Hence, openAI is rude

whole wagon Jul 20, 2025, 6:54 PM

#

Well there is the entirety of AIMO that is also keeping quiet

#

It's a massive thing

#

Yeah. They just don't create a fanfare till the embargo date

ocean vortex Jul 20, 2025, 6:55 PM

#

zinc ore They haven't announced their results, out of respect to the IMO

That's your assumption. However matharena did announce them, and as for others.. We do not know what their relationship was (did they need assistance of judges), or even if their runs were actually successful and worth celebrating etc

zinc ore Jul 20, 2025, 6:56 PM

#

ocean vortex That's your assumption. However matharena did announce them, and as for others.....

It was literally stated earlier they are supposed to wait a week

whole wagon Jul 20, 2025, 6:56 PM

#

The result itself is fine to publish imo. It's all this hyping and the suggestion openAI was the first (the other companies are literally just waiting)

zinc ore Jul 20, 2025, 6:56 PM

#

How am I assuming anything lol

#

I'm literally just saying what has been stated they are supposed to do

ocean vortex Jul 20, 2025, 6:57 PM

#

zinc ore It was literally stated earlier they are supposed to wait a week

Link to that guideline?

zinc ore Jul 20, 2025, 6:57 PM

#

We already have the report from the secretary general conversation for one, which you reject

ocean vortex Jul 20, 2025, 6:58 PM

#

that's what internally was supposed to be an understanding for the labs participating together with them. OpenAI was not one of such labs...

#

therefore it literally needs to be public

#

or it kinda doesn't apply

zinc ore Jul 20, 2025, 6:58 PM

#

You have to prove they didn't know

ocean vortex Jul 20, 2025, 6:58 PM

#

zinc ore You have to prove they didn't know

bruh...

zinc ore Jul 20, 2025, 6:59 PM

#

If we're going to do this whole annoying burden argument thing here, then you have to prove your position

ocean vortex Jul 20, 2025, 6:59 PM

#

They were not part of that circle needing special treatment in the first place

keen beacon Jul 20, 2025, 6:59 PM

#

even then, shouldn't those guidelines only really apply only if they were using the actual judges. it's still kinda bad faith even if they weren't bound by them anyway, but posting the results is fine i guess since they were unbound

whole wagon Jul 20, 2025, 7:00 PM

#

Why didn't they make themselves part of the circle? They literally just came in as an outsider and said they got gold according to themselves

zinc ore Jul 20, 2025, 7:00 PM

#

ocean vortex They were not part of that circle needing special treatment in the first place

Why does it need to be public then if it doesn't apply to them?

whole wagon Jul 20, 2025, 7:00 PM

#

That is strange

zinc ore Jul 20, 2025, 7:00 PM

#

That makes no sense

#

The point is, if that expectation exists, then it's rude even if they weren't officially involved

keen beacon Jul 20, 2025, 7:00 PM

#

whole wagon Why didn't they make themselves part of the circle? They literally just came in ...

i guess so they could do this lol

ocean vortex Jul 20, 2025, 7:00 PM

#

Like If I was participating from outside I'm publishing my results without doing some hilarious investigation to find out what they are secretly thinking internally lmao

zinc ore Jul 20, 2025, 7:00 PM

#

ocean vortex Like If I was participating from outside I'm publishing my results without doing...

Right, because we are discussing if they did so knowingly

#

Because doing so ignorantly technically looks less bad for them

ocean vortex Jul 20, 2025, 7:02 PM

#

zinc ore Right, because we are discussing if they did so knowingly

I mean, sure... But there's also a question whether it even applies at all to outsiders. Even if you are aware of such rule being imposed on someone who is considered part of internal circle...

keen beacon Jul 20, 2025, 7:03 PM

#

ocean vortex I mean, sure... But there's also a question whether it even applies at all to ou...

i dont think it applies, but they probably avoided participating officially with the imo just so they could publish results early. which is sorta bad faith

ocean vortex Jul 20, 2025, 7:03 PM

#

Like they do get the benefits for assistance of judges. It's not like there's only 1 way to do this

zinc ore Jul 20, 2025, 7:04 PM

#

If the IMO indeed wants the first week to be about human achievement, and tells all the AI companies not to report yet, then some outside company company comes in before the closing ceremony and says "we got gold everyone!!!" And hypes their product, then it seems natural that would be annoying both to the IMO and the companies involved.

#

Obviously they're not bound by that standard, but that's precisely why it is rude

#

Because you're using their prestige and name to hype their product, while contradicting their intent to celebrate human performance, while also stealing the spotlight from other companies respecting this waiting period.

#

So basically, because other companies are respecting the IMO, they're getting slapped in the face meanwhile.

ocean vortex Jul 20, 2025, 7:07 PM

#

zinc ore If the IMO indeed wants the first week to be about human achievement, and tells ...

If they definitively contacted OpenAI and told them that then I agree it's bad faith, even if not legally bound at all. Though I find the details on this murky. There needs to be public info in writing for them to be in a good position on it

zinc ore Jul 20, 2025, 7:07 PM

#

There's no "legality" here

ocean vortex Jul 20, 2025, 7:07 PM

#

Well I used this term loosely. Valid submission whatever

zinc ore Jul 20, 2025, 7:08 PM

#

Well, I think openAI being ignorant of this expectation looks much better for them

hot field Jul 20, 2025, 7:08 PM

#

It’s really not okay to lash out at someone or point fingers before we even know the full story.”

zinc ore Jul 20, 2025, 7:09 PM

#

But, if they were aware, then yeah that comes off as bad faith

keen beacon Jul 20, 2025, 7:10 PM

#

if they contacted openai even if they weren't participating officially, it's definitively bad faith.
if they didn't contact openai but openai were also aware of the rule (which is likely) which resulted in them not participating officially, it's sorta bad faith and manipulative (in order to release the results early). but plausibly deniable i guess

ocean vortex Jul 20, 2025, 7:10 PM

#

Fundamentally I think... We should apply the same rules to them you would follow yourself if you were training a model and wanted to do it. So we still come back to that public info part. Allegedly contacting someone on the side but not everyone is not the way this is supposed to be done

zinc ore Jul 20, 2025, 7:10 PM

#

keen beacon if they contacted openai even if they weren't participating officially, it's def...

Yep, this is basically my view

#

Also I find it strange they didn't just officially involve themselves like everyone else did

#

Which may add to the whole rudeness factor, because every other company had the courtesy to go through the IMO

ocean vortex Jul 20, 2025, 7:20 PM

#

zinc ore Which may add to the whole rudeness factor, because every other company had the ...

I agree but only partially and conditionally depending on the actual communication. Seems to me like what they did was still a valid path to take, and in some ways, perhaps even smart. They did take some risk too though as it's not gonna be a great look if this doesn't pass independent validation lol

leaden palm Jul 20, 2025, 7:22 PM

#

has everyone here already seen this

zinc ore Jul 20, 2025, 7:22 PM

#

I actually have a slightly different theory over what happened, despite what I just said.

torn mantle Jul 20, 2025, 7:24 PM

#

https://x.com/ErnestRyu/status/1946699212307259659

Ernest Ryu @ ICML'25 (@ErnestRyu)

4. OpenAI surely knew GDM was working on the IMO, so they beat GDM to the punch with their Saturday morning announcement, generating hype. GDM’s slow-science scholarship cost them the PR battle. (4/10)

#

lol

whole wagon Jul 20, 2025, 7:24 PM

#

ofc they knew man

#

its so obvious

torn mantle Jul 20, 2025, 7:24 PM

#

its obvious to me

#

not to you

torn mantle Jul 20, 2025, 7:25 PM

#

zinc ore I actually have a slightly different theory over what happened, despite what I j...

elaborate

zinc ore Jul 20, 2025, 7:25 PM

#

Well I doubt my own theory now

whole wagon Jul 20, 2025, 7:26 PM

#

You should appreciate that openai blessed this earth with their presence Kappa

zinc ore Jul 20, 2025, 7:26 PM

#

Basically, I think they didn't intend to participate in this IMO, until they heard Deepmind got gold and saw the questions. Realized they could get gold too and rushed and solved the problems then announced

torn mantle Jul 20, 2025, 7:26 PM

#

whole wagon You should appreciate that openai blessed this earth with their presence <:Kappa...

we will see roon

zinc ore Jul 20, 2025, 7:27 PM

#

Which is why they didn't go through official channels, because they originally didn't plan to participate

whole wagon Jul 20, 2025, 7:27 PM

#

They are scared

#

it is obvious

zinc ore Jul 20, 2025, 7:27 PM

#

And the timeline technically lines up

#

Yeah I agree

#

I think had there been tougher questions, openAI doesn't participate

#

Because, it was a good opportunity for them

tidal schooner Jul 20, 2025, 7:28 PM

#

whole wagon You should appreciate that openai blessed this earth with their presence <:Kappa...

“openai made ai open”

#

🥀

zinc ore Jul 20, 2025, 7:29 PM

#

deepmind gets gold and it gets leaked on twitter
36 hrs later openAI announces they got gold
Says they solved each question within 4.5 hrs each (so they could have solved everything within a day)

keen beacon Jul 20, 2025, 7:30 PM

#

that would be a crazy series of events

zinc ore Jul 20, 2025, 7:30 PM

#

They probably know Deepmind can't announce yet, decide to announce

keen beacon Jul 20, 2025, 7:30 PM

#

if it happened like that

#

but i doubt it

whole wagon Jul 20, 2025, 7:30 PM

#

Well i dont think it changes anything fundamentally. they are still being chased down the fundamentals are identical before and after

zinc ore Jul 20, 2025, 7:30 PM

#

Yeah I doubt it too, that's why I didn't want to share lol

torn mantle Jul 20, 2025, 7:30 PM

#

If both of them got similar scores then im more interested who got the cleanest readable solution

#

So they both failed p6?

#

But still 1/10

whole wagon Jul 20, 2025, 7:32 PM

#

maybe the model is too smart. i noticed with o3 sometimes it skips steps cos it just assumes ill get it

#

like it cant conceptualise what a human would struggle to understand

torn mantle Jul 20, 2025, 7:32 PM

#

whole wagon maybe the model is too smart. i noticed with o3 sometimes it skips steps cos it ...

It pisses me off all the time

#

Hmph

#

I have to tell it to elaborate

#

And don't assume things

#

Makes me so mad

#

🤬

ocean vortex Jul 20, 2025, 7:36 PM

#

torn mantle 🤬

Have you tried Heavy Dork 4.0?

zinc ore Jul 20, 2025, 7:37 PM

#

https://vxtwitter.com/ns123abc/status/1947016206768046452

So they're claiming how they mark correct answers is only internally known

NIK (@ns123abc)

🚨🚨🚨BREAKING: DeepMind leader calls OpenAI’s IMO gold claim bullshit:

“IMO has an internal marking guide no one outside sees. Without that, you can’t claim a medal. With the point you lose on P6, you’re Silver, not Gold.”

Reminder:

OpenAI didn’t even collaborate with IMO.
vague-posts results with no full transparency
DeepMind collaborated with IMO to verify results
IMO explicitly requested AI labs to wait a week after the closing ceremony
to avoid stealing spotlight from brilliant students (humans)
OpenAI chose to announce their results BEFORE the ceremony even ended
Research community: “OpenAI has been disrespectful”

LMFAOOO

#

How were you guys judging if they were correct then? Based on prior IMOs?

#

Depends if the guidelines they know is the same used in 2025

ocean vortex Jul 20, 2025, 7:41 PM

#

zinc ore https://vxtwitter.com/ns123abc/status/1947016206768046452 So they're claiming h...

it is bullsh'it the existence of this internal only thing, but on the other hand... Claiming a medal without cooperation with organizers can be a valid point as well. This is getting interesting though lol

#

🍿

whole wagon Jul 20, 2025, 7:42 PM

#

wait are they claiming it is literally impossible for openai to get gold?

ocean vortex Jul 20, 2025, 7:42 PM

#

"IMO has an internal marking guide no one outside sees."

whole wagon Jul 20, 2025, 7:43 PM

#

With the point you lose on P6, you’re Silver, not Gold this sounds unambiguous

#

like its just straight up impossible

#

eh this is all some weird bs im going to only treat the official results as real

#

i wouldnt accept a student doing this path

#

why would i accept an ai company

ocean vortex Jul 20, 2025, 7:47 PM

#

yeah if true then that's not gold... It was gold based on publicly available info 🤷‍♂️

#

Would have been insane plot twist too if they changed that reactively responding to OpenAI... lmao

whole wagon Jul 20, 2025, 7:52 PM

#

how do u know the boundary

#

is 35

ocean vortex Jul 20, 2025, 7:54 PM

#

@whole wagon explain this to me:

#

no last task = 35 = gold

#

the way everyone would read it

whole wagon Jul 20, 2025, 7:54 PM

#

i dont need to explain anything it was a question

ocean vortex Jul 20, 2025, 7:55 PM

#

whole wagon i dont need to explain anything it was a question

what

#

If they have official data clearly stating what a gold medal result is... That tweet gets an entirely different context tbh

#

At the moment it looks like this:

Company X gets gold.
Undisclosed guidelines change going against public info
One of "internal partners" gets gold instead.

This whole thing can go dirty both ways lmao

zinc ore Jul 20, 2025, 7:58 PM

#

OpenAI claims full marks on P1-P5, for 35 pts

ocean vortex Jul 20, 2025, 7:59 PM

#

that's exactly the same, can't you read?

#

35 points threshold for gold

zinc ore Jul 20, 2025, 8:00 PM

#

I think the Deepmind employee was just saying that the internal guidelines aren't known to openAI, so how could they grade themselves? And if the lose even just a single point from those guidelines they get silver not gold.

whole wagon Jul 20, 2025, 8:00 PM

#

oh yea

#

they dont even know the official guidelines lol

zinc ore Jul 20, 2025, 8:01 PM

#

That's why I was asking how you guys had scored it as 35, without knowing the guidelines

ocean vortex Jul 20, 2025, 8:01 PM

#

that's not AI summary lmao. It's the old fashioned search type of thing

#

info straight from that website

#

which isn't loading atm for me

#

Anyway, it matches with what you said yourself, so what is the issue?

#

ok it did load somehow... yeah exactly the same info:

#

according to this 35 without last problem is gold medal

#

So yeah... they can say internally what they want, if my model scores 35 I'm claiming it scores equivalent of gold medal with a clear conscience 🤷‍♂️

#

Yeah that part is a factor still. Which is why I said they are risking it claiming the result valid

whole wagon Jul 20, 2025, 8:09 PM

#

if only i could grade my own exams also

#

i would also get perfect score

ocean vortex Jul 20, 2025, 8:10 PM

#

whole wagon i would also get perfect score

LOL. Well we can only expect or assume they didn't half-ass it. They did know it's gonna need to be independently verified and published outputs for everyone to see, but who knows...

#

The problem right now is that if IMO organization is really united on this, they will want to screw them over as well lol

#

Also regarding the last problem, did it not produce final solution or did it not output anything useful at all? A detail that could be meaningful and something they may be withholding...

zinc ore Jul 20, 2025, 8:21 PM

#

ocean vortex The problem right now is that if IMO organization is *really* united on this, th...

It comes down to whether or not they actually got gold according to the IMO guidelines, because if they didn't then we might see some drama over this.

#

If their answers are valid to those guidelines then I don't think that drama happens.

#

whole wagon Jul 20, 2025, 8:40 PM

#

#

😂

gentle plinth Jul 20, 2025, 8:54 PM

#

whole wagon 😂

maybe sam promised that they would get the same pay at oai if they turn down the offer

whole wagon Jul 20, 2025, 9:00 PM

#

This gap is crazy

#

Though the speed with which Google has gained ground is absurd also

torn mantle Jul 20, 2025, 9:02 PM

#

whole wagon This gap is crazy

not surprised

#

grok will get a bit more cuz of the new companion thingy

ocean vortex Jul 20, 2025, 9:10 PM

#

zinc ore

Yeah this changes quite a lot... People are quick to blame OpenAI and attack them for pretty much anything lol

ocean vortex Jul 20, 2025, 9:11 PM

#

whole wagon Though the speed with which Google has gained ground is absurd also

Android

hollow ocean Jul 20, 2025, 9:11 PM

#

whole wagon This gap is crazy

this was made by gpt agent

ocean vortex Jul 20, 2025, 9:13 PM

#

OpenAI are shameless greedy villains, they don't care about those harmless little kids BOOO 🤓

quartz light Jul 20, 2025, 9:35 PM

#

mb im late to this message but this is NOT impressive its just using plugins or whatever its called from threejs for the water and hills

#

did they even check the code

#

maybe its good at problems but its not good at coding

quartz light Jul 20, 2025, 9:40 PM

#

whole wagon

thats js dumb 🥀

small haven Jul 20, 2025, 9:49 PM

#

whole wagon Though the speed with which Google has gained ground is absurd also

forced integration aint fair tho

lime coral Jul 20, 2025, 10:04 PM

#

Like having ChatGPT in my iPhone

#

https://x.com/harmonicmath/status/1947023450578763991?s=46

Harmonic (@HarmonicMath)

This past week, Harmonic had the opportunity to represent our advanced mathematical reasoning model, Aristotle, at the International Mathematics Olympiad - the most prestigious mathematics competition in the world.

To uphold the sanctity of the student competition, the IMO Board

zinc ore Jul 20, 2025, 10:14 PM

#

Well, that confirms the wait a week claim

#

Also confirms they are supposed to wait for the sanctity of student contestants

#

You can definitely tell these companies are annoyed at openAI

ocean vortex Jul 20, 2025, 10:44 PM

#

zinc ore Also confirms they are supposed to wait for the sanctity of student contestants

Yeah, but only as far as they are concerned doing this internally with IMO and assumingly using their judges. OpenAI was green lit by IMO:
#general message

pulsar cobalt Jul 20, 2025, 10:45 PM

#

Hi 👋 Can anyone tell me which AI models available on LMArena have the most recent knowledge cut offs? And through using which ones can I get the most up-to-date and accurate information? Would be really helpful if you help 🙏

ocean vortex Jul 20, 2025, 10:45 PM

#

They are annoyed but probably an equal amount about the fact that they didn't think of doing it this way themselves lol

#

Though this is still mostly only relevant to the labs actually competing for the top spot though...

jade egret Jul 20, 2025, 10:46 PM

#

whole wagon This gap is crazy

google gemini cooking?

torn mantle Jul 20, 2025, 10:47 PM

#

well its free

#

and better

jade egret Jul 20, 2025, 10:47 PM

#

do yall think it gonna catch up sooner or later

torn mantle Jul 20, 2025, 10:47 PM

#

gemini vision is way ahead too

#

grok doesnt have that

ocean vortex Jul 20, 2025, 10:47 PM

#

for others I would imagine the advantages of doing this in full collaboration far outweigh the opportunity of releasing results slightly sooner

torn mantle Jul 20, 2025, 10:47 PM

#

jade egret do yall think it gonna catch up sooner or later

yea with waifus

#

and nsfw stuff

jade egret Jul 20, 2025, 10:48 PM

#

😭

#

naw.....

zinc ore Jul 20, 2025, 10:48 PM

#

ocean vortex Yeah, but only as far as they are concerned doing this internally with IMO and a...

I like how you cite me while explaining this to me lol

ocean vortex Jul 20, 2025, 10:48 PM

#

jade egret google gemini cooking?

Google numbers are inflated by all those Android phones with dedicated AI buttons lmao

#

Your phone comes with it pre-installed so you automatically an user if you buy it. Any kind of automated request tied to a feature could count as an "active user"

torn mantle Jul 20, 2025, 10:49 PM

#

no but if you think about it, xai cant boost their user base with just grok4/3 or a 300$ plan, then need to find another way

#

and then some genius at xai (cough cough elon) proposed companions

#

they know how a lot of virgins will eat that up

#

its sad but they dont have a choice since they are incompetent at making good models

ocean vortex Jul 20, 2025, 10:51 PM

#

torn mantle no but if you think about it, xai cant boost their user base with just grok4/3 o...

Trump's phone gonna fix all their problems

#

the best phone

#

comes with Dork preinstalled and SuperDork free for the first 3 months ™

jade egret Jul 20, 2025, 10:52 PM

#

whos gonna win the ai race

torn mantle Jul 20, 2025, 10:52 PM

#

definitely google

ocean vortex Jul 20, 2025, 10:53 PM

#

It's either Google or OpenAI

torn mantle Jul 20, 2025, 10:53 PM

#

well first we need to define the finish line/target

jade egret Jul 20, 2025, 10:53 PM

#

i think google

#

evantually

ocean vortex Jul 20, 2025, 10:54 PM

#

xAI too detached from reality, Anthropic thinking too small, Meta is like MS operationally so don't really see them leading the way ever tbh

jade egret Jul 20, 2025, 10:54 PM

#

torn mantle well first we need to define the finish line/target

agi ig

#

or just takes most of the users

#

like 85%

#

idk

ocean vortex Jul 20, 2025, 10:56 PM

#

Chinese labs are impressive, but they are yet to prove they can come up with their own paradigms or ideas and convert it all into finished product

torn mantle Jul 20, 2025, 10:57 PM

#

zuck poached many employees and lead figures from oai, but the way i see it, is that they will likely face the same outcome as xai

#

it's extremely difficult to start from scratch, even if you have a significant background or extensive experience

#

+they still need more personnel

ocean vortex Jul 20, 2025, 10:58 PM

#

open-source from them is impressive, but then again there are no closed Chinese models better than that regardless of how much you pay

torn mantle Jul 20, 2025, 10:58 PM

#

tbh, we should be asking 1000 questions to the people who left

#

i honestly still don't understand why they would leave... can money really be a strong enough motivator to do that?

ocean vortex Jul 20, 2025, 11:00 PM

#

torn mantle zuck poached many employees and lead figures from oai, but the way i see it, is ...

afaik they didn't really poach any key figures from OpenAI. Relatively speaking I mean, every single person working there has an impressive CV technically and it's very easy to make big deal out of it

torn mantle Jul 20, 2025, 11:01 PM

#

ocean vortex Chinese labs are impressive, but they are yet to prove they can come up with the...

chinese are a great case study people should learn from... you get the sense they genuinely want to innovate and achieve agi by any means necessary

ocean vortex Jul 20, 2025, 11:02 PM

#

torn mantle chinese are a great case study people should learn from... you get the sense the...

In some ways yeah... But they also want for others to see their success more than anything (aka facade). Which is part of the reason why their best is equal to their best in open-source ig

#

If they come up with something, they gonna make sure the whole world knows it and it's represented in the best possible way 👀

jade egret Jul 20, 2025, 11:04 PM

#

torn mantle Jul 20, 2025, 11:04 PM

#

also for anthropic, I get the impression they're working more for the government, or to please it, rather than for their actual users

#

look at how many years have passed, and they still have rate limits and GPU issues

jade egret Jul 20, 2025, 11:05 PM

#

torn mantle also for anthropic, I get the impression they're working more for the government...

but they good at coding : (

torn mantle Jul 20, 2025, 11:05 PM

#

however, their core philosophy remains the same

jade egret Jul 20, 2025, 11:05 PM

#

torn mantle look at how many years have passed, and they still have rate limits and GPU issu...

true

ocean vortex Jul 20, 2025, 11:06 PM

#

torn mantle also for anthropic, I get the impression they're working more for the government...

wdym? They are all for "safe AI" though and the current US gov wants to ban all regulation with safety canceled entirely essentially lol

torn mantle Jul 20, 2025, 11:06 PM

#

they probably have more respect for their employees than oai does... for example, lets say their researchers at anthropic are allocated 30% of the GPU resources for their own tests, simulations, etc... I'm certain that this percentage in oai case has decreased over time, which shows they may prioritize users.. but in turn, has an impact on their researchers, and it could very well be one of the reasons that led those researchers to search somewhere else

torn mantle Jul 20, 2025, 11:07 PM

#

ocean vortex wdym? They are all for "safe AI" though and the current US gov wants to ban all ...

idk i just feel like their culture are so different

#

/ what they prioritize

torn mantle Jul 20, 2025, 11:08 PM

#

ocean vortex In some ways yeah... But they also want for others to see their success more tha...

everyone wants to show-off not just the chinese

ocean vortex Jul 20, 2025, 11:08 PM

#

torn mantle they probably have more respect for their employees than oai does... for example...

Well but you can't make an assumption out of thin air and then say that it shows something lol

torn mantle Jul 20, 2025, 11:08 PM

#

but do you have the talent to do so?

#

why cant other labs do the same?

#

i mean wtf happened to mistral?

ocean vortex Jul 20, 2025, 11:10 PM

#

There's no GPU resources allocation for engineers at all tbh. Neither at Anthropic nor at OpenAI. They have certain projects and the resources allocated for it, you don't have random people running around hogging compute if that wasn't planned months in advance and wanted by the entire company

#

Ofc they do have some kind of resources they can use for isolated work related smaller things, but that's gonna be miniscule comparing to the resources we are talking about with training the production models etc

#

and mostly not a problem at all to limit ever

hard hazel Jul 20, 2025, 11:13 PM

#

Beep boop, am i disturbing?
Had a question

#

Back when i was using lmarena, i remember the models generated the code and deployed it too right then and there, does that not happen anymore?

#

Or is that feature relocated somewhere else

torn mantle Jul 20, 2025, 11:15 PM

#

ocean vortex There's no GPU resources allocation for engineers at all tbh. Neither at Anthrop...

i'm talking about the whole R&D department, not individuals.... what I mean is that this number, or its percentage, we can say is fixed at anthropic or more prioritized, whereas at oai it could have decreased given their massive user base

ocean vortex Jul 20, 2025, 11:15 PM

#

torn mantle i mean wtf happened to mistral?

Lack of funding. They recently partnered with nVidia though iirc

#

so probably things gonna get better

hard hazel Jul 20, 2025, 11:15 PM

#

torn mantle i'm talking about the whole R&D department, not individuals.... what I mean is t...

Yet OpenAI is sluggish af, azure works so much better relatively

ocean vortex Jul 20, 2025, 11:17 PM

#

torn mantle i'm talking about the whole R&D department, not individuals.... what I mean is t...

Those are just assumptions. It may as well could have increased due to them modernizing their infra even more than the user count increased...

#

OpenAI is not sitting still

hard hazel Jul 20, 2025, 11:18 PM

#

Meanwhile Gemini giving out year long free trials, thousands of free requests per day on CLI, all them Veo 3 nonsense, and having blazing fast speeds still.....

TPUs are just so much better, aren't they 🤔

ocean vortex Jul 20, 2025, 11:19 PM

#

hard hazel Or is that feature relocated somewhere else

https://web.lmarena.ai

#

has to be web based code though

hard hazel Jul 20, 2025, 11:19 PM

#

Ah, without that subdomain it is just simple chat and image gen huh, i see

torn mantle Jul 20, 2025, 11:23 PM

#

ocean vortex Those are just assumptions. It may as well could have increased due to them mode...

well, its possible hypothesis, of course I'm not saying its 100% happening, but it's plausible that it could happen

hard hazel Jul 20, 2025, 11:29 PM

#

ocean vortex https://web.lmarena.ai

Seems to be broken at the moment?
When i type something in it and send it, it becomes unresponsive, nothing happens, it doesn't freezes it just does nothing.
Same when i click on suggested prompts, same on both desktop mode and mobile mode

torn mantle Jul 20, 2025, 11:48 PM

#

https://x.com/lefthanddraft/status/1944782515220463981

Wyatt Walls (@lefthanddraft)

xAI targeting the incel market

#

................

#

keen beacon Jul 20, 2025, 11:51 PM

#

LMAO

civic flame Jul 20, 2025, 11:55 PM

#

torn mantle

jesus christ

jade egret Jul 21, 2025, 12:52 AM

#

jade egret

why do yall think it google? (i agree, i just wanna know ur reasons)

stray aspen Jul 21, 2025, 12:52 AM

#

what ht ehell

patent aspen Jul 21, 2025, 1:51 AM

#

jade egret why do yall think it google? (i agree, i just wanna know ur reasons)

Google did 80% of the foundational AI research of the past decade that allowed the other AI labs to exist

#

Google was already spending over $30B a year on R&D (mostly on AI) before ChatGPT existed

jade egret Jul 21, 2025, 2:02 AM

#

oo

whole wagon Jul 21, 2025, 2:05 AM

#

wintry tinsel Jul 21, 2025, 2:24 AM

#

torn mantle https://x.com/lefthanddraft/status/1944782515220463981

There is no companion only a blender rig and a computer server behind it

#

How do people delude themselves into finding this amusing

misty vault Jul 21, 2025, 2:25 AM

#

i have been doing this with sydney for ages

leaden palm Jul 21, 2025, 2:43 AM

#

wintry tinsel There is no companion only a blender rig and a computer server behind it

aren't we all just computer servers at the end of the day

zinc ore Jul 21, 2025, 3:44 AM

#

lilac nimbus Jul 21, 2025, 3:55 AM

#

Claude back soon

mellow salmon Jul 21, 2025, 4:12 AM

#

hey guys,can anyone tell me what's the rate limits of flux kontext max in lmarena

small haven Jul 21, 2025, 4:20 AM

#

lilac nimbus Claude back soon

what u mean by this 👀

#

is neptune v3 red phase done?

echo aurora Jul 21, 2025, 4:58 AM

#

hard hazel Back when i was using lmarena, i remember the models generated the code and depl...

Are you looking for the code preview section in https://web.lmarena.ai/ ?

torn bison Jul 21, 2025, 6:37 AM

#

The current ask for no is 97.8c. It took the market 5 days to reach a relatively fair price. No need to wait until resolution to realize 11% profit😂

tidal schooner Jul 21, 2025, 7:03 AM

#

this sounds more like an ad for machine learning than anything 😭
https://www.instagramez.com/reel/DF2JqSeNFfB/

CusDeb | Technology Magazine (@cusdeb_com)

💬 118 🔁 0 💜 11.8K 👀 125.6K

In 2003, IBM attempted to appeal to the hearts of ordinary computer users through a commercial that aired on U.S. television during the Super Bowl broadcast. The ad featured a young boy who personified Linux. He's 9 years old*, curious and quickly absorbs information.

We can argue at length about IBM's role in the modern high-tech world, but credit should be given to them, as the commercial turned out to be a masterpiece. Enjoy watching!

* Linux 1.0 was released in March 1994, so in the 2003 commercial, the boy was 9 years old.

#CusDebMagazine #TechJournal #TechMagazine
#IBM #linux #LinuxKernel #SuperBowl #SuperBowl2025 #SuperBowlAd #SuperBowlAds #SuperBowlCommercial #SuperBowlCommercials

▶ Play video

#

yeah ik it’s meant to be focused on the expertise of contributers across the world but still

hard hazel Jul 21, 2025, 8:07 AM

#

echo aurora Are you looking for the code preview section in <https://web.lmarena.ai/> ?

Yes

ocean vortex Jul 21, 2025, 8:11 AM

#

If anything the foul play was by xAI with those initial benchmarks that don't look very realistic. The model is just not very good and outputs are way too short and basic to do well on lmarena lol

#

It had a chance of capitalizing on their earlier success there, but when you look at the outputs of it it's easy to see why it did so poorly... People will not vote for what they don't see and what is hidden, there was no such issue with grok3.

gusty helm Jul 21, 2025, 8:20 AM

#

given suitgate and several scandals with whales lately im expecting the worst strictly.

#

I mean from polymarket

#

Lmarena has no stakes in this and its just caught in crossfire

#

Silver lining this is a small market/I did not see any of the usual suspects. Too little money for them. Id fully expect an army of people/bots trying to influence it otherwise

pure lynx Jul 21, 2025, 8:34 AM

#

Does anyone know when the Text Arena leaderboard is updated? Are the updates weekly?

whole wagon Jul 21, 2025, 8:41 AM

#

ocean vortex Jul 21, 2025, 8:58 AM

#

whole wagon

He took the advice a bit too literally lmao

ocean vortex Jul 21, 2025, 9:13 AM

#

gusty helm Silver lining this is a small market/I did not see any of the usual suspects. To...

They have a fairly solid protection against bots with cloudfare which seems to have strict settings (hence the people complaining about getting locked out of the website). Though ofc everything is possible, spotting your own model should be very easy too when you know what to ask - but you can't exactly automate that, at least not easily.

ocean vortex Jul 21, 2025, 9:16 AM

#

pure lynx Does anyone know when the Text Arena leaderboard is updated? Are the updates wee...

Hard to tell if it's regular in same intervals but the last update before current one I think was 1 week earlier yeah

primal orbit Jul 21, 2025, 9:17 AM

#

is it me or it has been a while since new secret model from Google. Stonebloom has been around for quite some time.

ocean vortex Jul 21, 2025, 9:20 AM

#

primal orbit is it me or it has been a while since new secret model from Google. Stonebloom h...

they are still finishing up Ultra 1.0

#

give them time

#

👀

#

But yeah... They are way behind schedule and Ultra name controversy continues lol

gusty helm Jul 21, 2025, 9:21 AM

#

ocean vortex They have a fairly solid protection against bots with cloudfare which seems to h...

I have to disagree. Its not trivial but its not that hard either. And we talking f u kind of money with some these people. Worst case you can afford to pay 10 people to non stop click via vpns. This being the brute force approach basically with little to no automation.

pulsar tendon Jul 21, 2025, 9:22 AM

#

Is the openai anon bot gone, been a while since i got it

gusty helm Jul 21, 2025, 9:22 AM

#

Not saying it happened, there’s no sign of it. But it could

ocean vortex Jul 21, 2025, 9:24 AM

#

gusty helm I have to disagree. Its not trivial but its not that hard either. And we talking...

Like I said anything is possible... Though I would imagine cases of spamming the same thing (prompt) from several IPs relentlessly is an easy thing to catch and exclude from results as well lol

gusty helm Jul 21, 2025, 9:25 AM

#

Given sample size in the voting (20k votes) its trivial to bypass IP restrictions. Only nord vpn would give you 100 + ips to use

primal orbit Jul 21, 2025, 9:25 AM

#

gusty helm I have to disagree. Its not trivial but its not that hard either. And we talking...

this all would be visible to admins of lmarena. They have all the prompt data. If prompt or prompts repeat suspiciously, they can take action.

ocean vortex Jul 21, 2025, 9:25 AM

#

In some ways, if you have the money, you simply fine-tune your model to do well exclusively only on lmarena if it's so important for you and that's gonna be more effective and perhaps even easier. As for betting and polymarket, people with such resources do not bet on these odds...

gusty helm Jul 21, 2025, 9:26 AM

#

ocean vortex In some ways, if you have the money, you simply fine-tune your model to do well ...

Man look up polymarket suitgate 🤣

#

The guys own the whole crypto chain they use to solve markets. Talking few hundred million usd there

ocean vortex Jul 21, 2025, 9:28 AM

#

gusty helm Man look up polymarket suitgate 🤣

I meant it for lmarena related odds. Some of the other bets and categories get lots of attention and reach for sure

gusty helm Jul 21, 2025, 9:28 AM

#

Eg: when something is disputed they use a distributed voting system that’s external (uma token). Few wallets hold majority of uma

gusty helm Jul 21, 2025, 9:28 AM

#

ocean vortex I meant it for lmarena related odds. Some of the other bets and categories get l...

Agreed, simply not enough money in the ai markets to be worth it

primal orbit Jul 21, 2025, 9:29 AM

#

guys, you talk about polymarket so often How much do you actually bet? if you're comfortable to tell.

gusty helm Jul 21, 2025, 9:30 AM

#

Not that much in my case just few grands disposable income

#

But you get range from 10 $ to 100 million people basically

primal orbit Jul 21, 2025, 9:32 AM

#

gusty helm Not that much in my case just few grands disposable income

ok, wish you good luck 🙂

gusty helm Jul 21, 2025, 9:32 AM

#

Thanks

ocean vortex Jul 21, 2025, 9:39 AM

#

@pure lynx If you look there there are all the updates listed:
https://github.com/fboulnois/llm-leaderboard-csv/releases

GitHub

Releases · fboulnois/llm-leaderboard-csv

CSVs of the Huggingface and LMSYS LLM leaderboards, along with the code to generate them in R. - fboulnois/llm-leaderboard-csv

#

looks like it's somewhat more frequent than once a week actually

#

just noticed smth interesting too... their csv includes 2.5Pro-05-06 too even though the official leaderboard in new interface doesn't anymore 👀

#

So technically... even without style control Grok4 is still 3rd not 2nd lmao

calm sequoia Jul 21, 2025, 11:11 AM

#

Funny how mistral is now building manpads 😄

ocean vortex Jul 21, 2025, 11:15 AM

#

calm sequoia Funny how mistral is now building manpads 😄

Mistral also has a very nice car lmao

https://www.bugatti.com/en/models/w16-mistral

calm sequoia Jul 21, 2025, 11:16 AM

#

Such an awesome company!

torn mantle Jul 21, 2025, 11:24 AM

#

ocean vortex So technically... even without style control Grok4 is still 3rd not 2nd lmao

yea

#

it kinda reminds me of gemini 2.5 flash reasoning

ocean vortex Jul 21, 2025, 11:36 AM

#

torn mantle it kinda reminds me of gemini 2.5 flash reasoning

That was very verbose though. Well I mean you could see the output and see it in full. Even if it was fundamentally similar you don't get to use any of that with Grok lol

#

I'm fairly sure in certain cases Grok answers could be counted as a pass or at least partially correct instead of completely off the mark, if only you could see the reasoning it did. But you can't

torn mantle Jul 21, 2025, 11:44 AM

#

ocean vortex That was very verbose though. Well I mean you could see the output and see it in...

im so lazy to explain xd

ocean vortex Jul 21, 2025, 11:44 AM

#

Especially relevant for these math competitions where there are strict clear rules and containing explanation and not having it can be a difference between 1 or 0 points for that part

cedar tide Jul 21, 2025, 11:45 AM

#

This model has been in the arena for 2 months, still without results 🥴 (he still appears in the arena)

#

gemini 2.5 flash lite and minimax M1 are also in the webdev arena for a long time and still no result

candid storm Jul 21, 2025, 11:52 AM

#

#

What do you guys think about these timelines?

keen beacon Jul 21, 2025, 11:53 AM

#

candid storm

I expect GPT-5 in 1-2 weeks.
Coding model is likely next year
The rest seem ok

pure anvil Jul 21, 2025, 11:54 AM

#

candid storm

midjourney 8 and claude 4.5 are the most exciting ones on this list

ocean vortex Jul 21, 2025, 11:59 AM

#

torn mantle im so lazy to explain xd

Have actually no idea what you mean, here's telling me they are nothing alike without using words:

#

#

🗿

torn mantle Jul 21, 2025, 12:03 PM

#

lol

#

A

candid storm Jul 21, 2025, 12:03 PM

#

candid storm

When do you all expect Gemini 3?

torn mantle Jul 21, 2025, 12:06 PM

#

idk its just that grok 4 gives me 'weaker gemini 2.5 pro' vibes, which makes it feel like a weaker iteration.. again im saying it reminds me of it, not claiming that its a perfect match for flash ver

torn mantle Jul 21, 2025, 12:06 PM

#

ocean vortex Have actually no idea what you mean, here's telling me they are nothing alike wi...

.

torn mantle Jul 21, 2025, 12:07 PM

#

candid storm When do you all expect Gemini 3?

wolfstride / stonebloom / kingfall could all be gemini 3 early checkpoints

#

leo shared their metadata before

rare python Jul 21, 2025, 12:08 PM

#

torn mantle leo shared their metadata before

v3 internally is used for Gemini 2.0 right?

torn mantle Jul 21, 2025, 12:10 PM

#

rare python v3 internally is used for Gemini 2.0 right?

i dont remember tbh

#

but there was n+1 iteration for wolfstride/kingfall/stonebloom models

rare python Jul 21, 2025, 12:10 PM

#

solar hollow Jul 21, 2025, 12:15 PM

#

you guys tested that new openai model? sth sth alpha

civic flame Jul 21, 2025, 12:18 PM

#

rare python

the interesting part of the internal model names wasn't to do with that lol

#

it was that the suffix for gem 2.5 pro was "m" (as in "medium") and for kingfall/stonebloom/wolfstride it was "l" (as in "large")

cedar tide Jul 21, 2025, 12:20 PM

#

candid storm

Claude 4.5 sooner

rare python Jul 21, 2025, 12:21 PM

#

civic flame it was that the suffix for gem 2.5 pro was "m" (as in "medium") and for kingfall...

not as Gemini 3.0 as Asura said

cedar tide Jul 21, 2025, 12:22 PM

#

@echo aurora add HiDream-E1.1
Best open source image editing model
https://huggingface.co/HiDream-ai/HiDream-E1-1

Screenshot_2025-07-21-14-17-36-086_com.android.chrome-edit.jpg

HiDream-ai/HiDream-E1-1 · Hugging Face

torn mantle Jul 21, 2025, 12:23 PM

#

civic flame it was that the suffix for gem 2.5 pro was "m" (as in "medium") and for kingfall...

ah this one yea

rare python Jul 21, 2025, 12:24 PM

#

We still haven't seen seededit 3.0 in direct chat, pineapple

torn mantle Jul 21, 2025, 12:24 PM

#

gemini-v3p1l-rev20-kingfall-sc__202505301__model__variant

echo aurora Jul 21, 2025, 12:24 PM

#

cedar tide <@283397944160550928> add HiDream-E1.1 Best open source image editing model http...

blobthanks I'll be sure to flag.
#1396830129635852329 message

ocean vortex Jul 21, 2025, 12:26 PM

#

civic flame it was that the suffix for gem 2.5 pro was "m" (as in "medium") and for kingfall...

Unicorn and Ultra fate then... I wonder if they are actually planning to ever release it. Probably marginal gains on most conventional metrics

civic flame Jul 21, 2025, 12:26 PM

#

no obvious activity on that stuff since wolfstride dropped on the arena

#

that was late june and now both of the models are gone

#

💔

#

also still find it weird that oai dropped that interesting model on webdev arena for a single day

cedar tide Jul 21, 2025, 12:29 PM

#

Add gemini 2.5 no think instead of putting crappy models from amazon that nobody wants (kraken folsom, v1 v2, nova experimental)

#

Screenshot_2025-07-21-14-27-12-642_com.discord-edit.jpg

ocean vortex Jul 21, 2025, 12:30 PM

#

civic flame also still find it weird that oai dropped that interesting model on webdev arena...

Probably wanted to quantify the gains they made with spatial awareness. Webdev arena remains one of the most reliable metrics for this, even if focused on code...

cedar tide Jul 21, 2025, 12:31 PM

#

ocean vortex Jul 21, 2025, 12:31 PM

#

arc-agi-2 is as good as useless for spatial awareness now, doesn't mean anything lol

torn mantle Jul 21, 2025, 12:41 PM

#

cedar tide Add gemini 2.5 no think instead of putting crappy models from amazon that nobody...

wesh

#

they are still experimental models

#

and they cant just say no to requests from amazon

civic flame Jul 21, 2025, 12:42 PM

#

were kingfall/stonebloom/wolfstride early versions of gemini 3 or?

#

when do you expect that to drop then 😭

#

would you bet before or after gpt-5

#

the next 2-3 weeks should be fun then

nocturne agate Jul 21, 2025, 12:45 PM

#

how do ppl access that o3 alpha model?

civic flame Jul 21, 2025, 12:47 PM

#

it was on webdev arena

#

not anymore

#

doubt

#

probably close

nocturne agate Jul 21, 2025, 12:48 PM

#

civic flame not anymore

ah ok then. i was running prompts like stupid

#

but i found interesting model, nightforge - anyone know what is it? probably a gemini model?

rare python Jul 21, 2025, 12:49 PM

#

minimax m1 iirc

ocean vortex Jul 21, 2025, 12:51 PM

#

rare python minimax m1 iirc

btw it's a weird model size for it just like the name suggests lmao

#

neither max nor mini. Neither good performing nor lightweight

candid storm Jul 21, 2025, 12:52 PM

#

Is it confirmed we will get gemini 2.5 ultra?

civic flame Jul 21, 2025, 12:54 PM

#

for all intents and purposes

nocturne agate Jul 21, 2025, 1:01 PM

#

or ultra dumb

nocturne agate Jul 21, 2025, 1:01 PM

#

candid storm Is it confirmed we will get gemini 2.5 ultra?

if there is no exp model yet nothing is confirmed

#

ultra?

#

when was it lol

#

ah ok, i was talking about this

#

they usually do the release well in advance

#

like month

#

so i guess we arent getting anything atm

#

yes

#

im curious what the cost of O3 Alpha will be

#

this model looks really strong

leaden sun Jul 21, 2025, 1:22 PM

#

if then it's due to geopolitical power play...

whole wagon Jul 21, 2025, 1:36 PM

#

nocturne agate im curious what the cost of O3 Alpha will be

Is it in the arena

#

The battle one

#

I never get it there

willow grail Jul 21, 2025, 1:46 PM

#

i am a vegan who drinks dairy kefir
no coconut kefir has worse properties than dairy kefir
i will do everything to become healthy

would u say i am doing everything i can do be as vegan as possible?
i suffer from meteorism and psoriasis and constipation, in short a bad microbiome/dysbiosis for unknown reasons. probably very low motility.
so i am trying dairy kefir now cause this helps some people
i also tried bile acid from bovines
and digestion enzymes

pulsar cobalt Jul 21, 2025, 1:52 PM

#

Hi

#

Can anyone tell me which AI models available on LMArena have the most recent knowledge cut offs? And through using which ones can I get the most up-to-date and accurate information?

willow grail Jul 21, 2025, 2:39 PM

#

where is kingfall in benchmarks ?

fleet lintel Jul 21, 2025, 2:44 PM

#

torn mantle gemini-v3p1l-rev20-kingfall-sc__202505301__model__variant

where is this info?

civic flame Jul 21, 2025, 2:51 PM

#

iirc i posted it here or dmed it to them

snow lily Jul 21, 2025, 3:10 PM

#

Does anyone know why mines keep showing "something went wrong with this response try again"?

#

This is such a nice site but the worst thing is you just can't copy paste the whole conversation if you wanna restart because of the way the site is, you can't select more than 1 reply at a time, makes it so annoying

hardy lion Jul 21, 2025, 3:32 PM

#

cedar tide

how do you use the best voted models if you don't at least put all the models out for voting? 🤔

cedar tide Jul 21, 2025, 3:33 PM

#

hardy lion how do you use the best voted models if you don't at least put all the models ou...

The best voted models in "model-request"
https://discord.com/channels/1340554757349179412/1372229840131985540

hardy lion Jul 21, 2025, 3:34 PM

#

ahh

stray aspen Jul 21, 2025, 3:40 PM

#

naaah

whole wagon Jul 21, 2025, 4:23 PM

#

Ai explained did a vid shitting on openAI imo announcement

#

Lel

#

The goat of AI influencers

whole wagon Jul 21, 2025, 4:42 PM

#

Sora 2 is coming soon

#

Maybe openAI can redeem themselves after getting mauled by Google

ornate agate Jul 21, 2025, 4:46 PM

#

https://xcancel.com/GoogleDeepMind/status/1947333836594946337

Nitter

Google DeepMind (@GoogleDeepMind)

An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇

It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵

balmy mist Jul 21, 2025, 4:48 PM

#

ornate agate https://xcancel.com/GoogleDeepMind/status/1947333836594946337

wait didnt openai just win thhis?

main gulch Jul 21, 2025, 4:50 PM

#

A version of this model with Deep Think will soon be available to trusted testers, before rolling out to
Google AI Ultra subscribers.

they said the same thing 2 months ago

hollow ocean Jul 21, 2025, 4:51 PM

#

@deep adder Deepthink release January 2026

stray aspen Jul 21, 2025, 4:52 PM

#

@deep adderare you the real craig federighi

torn mantle Jul 21, 2025, 4:52 PM

#

https://x.com/GoogleDeepMind/status/1947333836594946337

Google DeepMind (@GoogleDeepMind)

An advanced version of Gemini with Deep Think has officially achieved gold medal-level performance at the International Mathematical Olympiad. 🥇

It solved 5️⃣ out of 6️⃣ exceptionally difficult problems, involving algebra, combinatorics, geometry and number theory. Here’s how 🧵

#

no one beat me to it right?

torn mantle Jul 21, 2025, 4:53 PM

#

ornate agate https://xcancel.com/GoogleDeepMind/status/1947333836594946337

uh oh

#

i would rather take a ready to deploy model than an unreadable prototype that will take months to be released (oai model)

ornate agate Jul 21, 2025, 4:54 PM

#

at first glance the solutions made by Gemini deepthink are really crisp https://storage.googleapis.com/deepmind-media/gemini/IMO_2025.pdf

stray aspen Jul 21, 2025, 4:55 PM

#

billy do you work at google deepmind

torn mantle Jul 21, 2025, 5:00 PM

#

ornate agate at first glance the solutions made by Gemini deepthink are really crisp https://...

thanks

#

lol

#

what a difference

#

i would be ashamed to share oai results

#

https://x.com/GoogleDeepMind/status/1947333841531568182

Google DeepMind (@GoogleDeepMind)

Gemini solved the math problems end-to-end in natural language (English).

This differs from our results last year when experts first translated them into formal languages like Lean for specialized systems to tackle.

#

dont end them like that

wintry tinsel Jul 21, 2025, 5:02 PM

#

Lots of talk no release

rare python Jul 21, 2025, 5:02 PM

#

Lmao who said DeepMind got cooked at IMO?

stray aspen Jul 21, 2025, 5:03 PM

#

is open ai geting destroyed?

torn mantle Jul 21, 2025, 5:03 PM

#

wintry tinsel Lots of talk no release

we are waiting for you

#

agree

#

but is it with informal mathematics or nah

rare python Jul 21, 2025, 5:05 PM

#

torn mantle Jul 21, 2025, 5:05 PM

#

then we can safely say gdm are way ahead

torn mantle Jul 21, 2025, 5:05 PM

#

rare python

its still their thing

#

but i was serious tho

#

gdm takes IMO so seriously

rare python Jul 21, 2025, 5:06 PM

#

torn mantle its still their thing

Full context

torn mantle Jul 21, 2025, 5:07 PM

#

rare python Full context

i said something similar

#

viren is copying me

#

smh

rare python Jul 21, 2025, 5:07 PM

#

💀

#

Bro acted like DeepMind got cooked in IMO

fleet lintel Jul 21, 2025, 5:12 PM

#

based on tweets, it looks like DeepMind will release IMO gold version much faster than OAI gold version to public.

#

OAI mention in tweets that they are not planning to release their version for several months. I standby my conclusion

#

Honestly, at this point only OAI and Gemini are making some real progress. everyonw else is just catching upto them .

#

its not about releasing that model specifically but able to advance their public models that achives the same performance

cedar tide Jul 21, 2025, 5:20 PM

#

Incredible non thinking model
https://x.com/Alibaba_Qwen/status/1947344511988076547?t=ZOtrFNlSmWmDn28eW3z5qg&s=19

Qwen (@Alibaba_Qwen)

Bye Qwen3-235B-A22B, hello Qwen3-235B-A22B-2507!

After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible. Today, we’re releasing

torn mantle Jul 21, 2025, 5:23 PM

#

https://x.com/Alibaba_Qwen/status/1947344511988076547

Qwen (@Alibaba_Qwen)

Bye Qwen3-235B-A22B, hello Qwen3-235B-A22B-2507!

After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible. Today, we’re releasing

#

no one beat me to it again

#

yesh

rare python Jul 21, 2025, 5:23 PM

#

torn mantle no one beat me to it again

yes

#

3 minutes late

zinc ore Jul 21, 2025, 5:32 PM

#

If Deepmind got gold with Gemini 2.5 deepthink then they just mogged openAI

torn mantle Jul 21, 2025, 5:32 PM

#

is it benchmaxxed again

#

aaaand

#

yes it is

#

initial thoughts : still the same as old qwen

#

better knowledge? i dont think so

#

where is deepseek ....

#

we need them

lone vector Jul 21, 2025, 5:37 PM

#

https://x.com/ns123abc/status/1947347617131680232?s=46

NIK (@ns123abc)

You see?

OpenAI ignored the IMO request. Shame. No class. Straight up disrespect.

Google DeepMind acted with integrity, aligned with humanity.

TRVTHNUKE

pure anvil Jul 21, 2025, 5:38 PM

#

https://x.com/Yuchenj_UW/status/1947339774257402217/photo/1

Yuchen Jin (@Yuchenj_UW)

This wins my respect.

zinc ore Jul 21, 2025, 5:41 PM

#

Makes sense why openAI rushed their announcement tho, if Deepmind was going to release natural language results via an LLM

ocean vortex Jul 21, 2025, 5:47 PM

#

torn mantle what a difference

same score...?

#

35/42

#

Also this:

Finally, we pushed this version of Gemini further by giving it:
🔘 More thinking time
🔘 Access to a set of high-quality solutions to previous problems
🔘 General hints and tips on how to approach IMO problems

I wouldn't be very proud about doing that nr2. That's like a cheat sheet lol

#

It had some kind of RAG with the entire history of IMO and previous curated solutions I presume

pure anvil Jul 21, 2025, 5:50 PM

#

ocean vortex It had some kind of RAG with the entire history of IMO and previous curated solu...

sure xd

torn mantle Jul 21, 2025, 5:51 PM

#

ocean vortex same score...?

yea

ocean vortex Jul 21, 2025, 5:51 PM

#

I mean how else are you honna interpret "Access to a set of high-quality solutions to previous problems"..?

torn mantle Jul 21, 2025, 5:52 PM

#

they both got the last one wrong

#

-7pts

ocean vortex Jul 21, 2025, 5:54 PM

#

They did do parallel for sure.

#

but they explicitly said no tools

#

so I assume no RAG either

#

We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.

and

2/N We evaluated our models on the 2025 IMO problems under the same rules as human contestants: two 4.5 hour exam sessions, no tools or internet, reading the official problem statements, and writing natural language proofs.

Technically they did not explicitly mention RAG, but this strongly implies they did not use it, otherwise saying no internet and no tools kinda loses it's importance...

whole wagon Jul 21, 2025, 6:01 PM

#

oh my god

ocean vortex Jul 21, 2025, 6:01 PM

#

test-time compute scalling --> parallel compute, so they did use it for sure.

whole wagon Jul 21, 2025, 6:01 PM

#

these proofs are so damn good

#

ai solved imo

#

these are clearer than human solutions even

zinc ore Jul 21, 2025, 6:02 PM

#

If openAI released second, everyone would have saw their sloppy proofs and wouldn't have been as impressed after seeing the Deepmind ones

ocean vortex Jul 21, 2025, 6:03 PM

#

whole wagon these are clearer than human solutions even

Well no sh'it if you give it access to the corpus of the perfectly written solutions for past IMO problems it is going to mimmick that style lol

ocean vortex Jul 21, 2025, 6:04 PM

#

zinc ore If openAI released second, everyone would have saw their sloppy proofs and would...

Hard to disagree on this. Looks like it did pay off huh

zinc ore Jul 21, 2025, 6:05 PM

#

ocean vortex Also this: > Finally, we pushed this version of Gemini further by giving it: >...

Where did this come from?

cedar tide Jul 21, 2025, 6:06 PM

#

average 25 benchmark

zinc ore Jul 21, 2025, 6:07 PM

#

“To make the most of the reasoning capabilities of Deep Think, we additionally trained this version of Gemini on novel reinforcement learning techniques that can leverage more multi-step reasoning, problem-solving and theorem-proving data. We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions.”

From this one they don't mention using previous IMO solutions

#

So I'm wondering where this additional claim is coming from

whole wagon Jul 21, 2025, 6:10 PM

#

tbh i think the whole openai thing is going to backfire. I assume they didnt expect to get called out in that way

#

not just by us but demis himself kek

cedar tide Jul 21, 2025, 6:10 PM

#

cedar tide average 25 benchmark

per category

zinc ore Jul 21, 2025, 6:11 PM

#

Even on r/singularity you'll run into a decent number of comments calling them "shady" or "scummy" for what they did

whole wagon Jul 21, 2025, 6:11 PM

#

and their result just isnt the actual sota

#

so it feels pretty pointless

#

like if they were the actual best maybe they could get away with it

ocean vortex Jul 21, 2025, 6:13 PM

#

zinc ore Where did this come from?

https://girlcockx.com/GoogleDeepMind/status/1947333846837444990

Google DeepMind (@GoogleDeepMind)

Finally, we pushed this version of Gemini further by giving it:
︀︀🔘 More thinking time
︀︀🔘 Access to a set of high-quality solutions to previous problems
︀︀🔘 General hints and tips on how to approach IMO problems

**💬 1 🔁 3 ❤️ 127 👁️ 15.1K **

whole wagon Jul 21, 2025, 6:13 PM

#

"We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems" there is no contradiction

#

nowhere does it say previous IMO problems

ocean vortex Jul 21, 2025, 6:14 PM

#

they don't need to spell it out... it's obvious that they used the problems as close as possible to what they guess the problems are going to be

#

previous IMO problems is a reasonable interpretation I would say, but this doesn't change anything really.

zinc ore Jul 21, 2025, 6:16 PM

#

This is why standardization and transparency is so important

whole wagon Jul 21, 2025, 6:16 PM

#

zinc ore This is why standardization and transparency is so important

It is harder to game

#

well this all looks great for google

ocean vortex Jul 21, 2025, 6:17 PM

#

I'm really not. We can take this for face value and don't interpret anything "Access to a set of high-quality solutions to previous problems" - this still means fundamentally the same thing

whole wagon Jul 21, 2025, 6:17 PM

#

they win ethically and performance wise lol

fleet lintel Jul 21, 2025, 6:18 PM

#

how??

ornate agate Jul 21, 2025, 6:19 PM

#

fleet lintel how??

read them, they are just a lot cleaner

zinc ore Jul 21, 2025, 6:19 PM

#

Much crisper, cleaner, and easy to understand

whole wagon Jul 21, 2025, 6:19 PM

#

it is easy to tell from the first paragraph

fleet lintel Jul 21, 2025, 6:19 PM

#

Lol.. like I can actually evaluate them. Thank you for the confidence though 😄

ocean vortex Jul 21, 2025, 6:19 PM

#

whole wagon they win ethically and performance wise lol

Well attacking OpenAI without knowing all the details and deducting points from them even (lmao) was certainly not a good look at all

whole wagon Jul 21, 2025, 6:20 PM

#

"we werent in touch with IMO" isnt a defense imo

ornate agate Jul 21, 2025, 6:20 PM

#

IMO Problem 1 is not too difficult to do yourself if you grant yourself a bit of help from the AOPS forums. Then you can read the AI solutions to at least problem 1 and see the difference.

fleet lintel Jul 21, 2025, 6:20 PM

#

whole wagon "we werent in touch with IMO" isnt a defense imo

It is kind of admission of guilt... like oh I didn't know that you cant steal from bank

ocean vortex Jul 21, 2025, 6:21 PM

#

whole wagon "we werent in touch with IMO" isnt a defense imo

There's no need to defend anything. You aren't required to be in their inner circle and have assistance from their judges like Google did, you can simply be an outsider doing your own thing... They went beyond that and made sure it is ok to post results

fleet lintel Jul 21, 2025, 6:22 PM

#

Well, OAI is not known for their ethics. I am not surprised. Good part is that we will soon have these great models at our disposal.

ocean vortex Jul 21, 2025, 6:22 PM

#

And...? They had the closing ceremony and IMO green lit it. What is the problem? 🙂

whole wagon Jul 21, 2025, 6:23 PM

#

They asked one random organizer. They are a multibillion company

ocean vortex Jul 21, 2025, 6:23 PM

#

Oh gimme a break... 😅

#

you are reaching

zinc ore Jul 21, 2025, 6:23 PM

#

Yeah lol, they should have had a little more reliable communication channels going

ocean vortex Jul 21, 2025, 6:24 PM

#

They not gonna actively look for ways that would disallow them to post the results. That's on IMO and their communication

ornate agate Jul 21, 2025, 6:24 PM

#

some delayed publication until next Monday btw, so there is more results to come on this.

zinc ore Jul 21, 2025, 6:24 PM

#

I wouldn't be surprised if more than two companies got gold tbh

brittle tiger Jul 21, 2025, 6:24 PM

#

ocean vortex There's no need to defend anything. You aren't required to be in their inner cir...

Yeah that's why I forgoed the help of the SAT graders and just told them my score.

whole wagon Jul 21, 2025, 6:25 PM

#

zinc ore Yeah lol, they should have had a little more reliable communication channels goi...

I doubt IMO would want to associate with this strange outsider attempt anyways tbf. They would just redirect them to compete officially

zinc ore Jul 21, 2025, 6:25 PM

#

Not sure if they even could at that point

ocean vortex Jul 21, 2025, 6:25 PM

#

brittle tiger Yeah that's why I forgoed the help of the SAT graders and just told them my scor...

I agree the way they rated it is a bit sus. But at the end of the day their solutions are public for everyone to independently verify, so not really a problem...

whole wagon Jul 21, 2025, 6:26 PM

#

IMO are never going to grade openai solutions. Thats not how they operate they only care about official results

ocean vortex Jul 21, 2025, 6:27 PM

#

whole wagon IMO are never going to grade openai solutions. Thats not how they operate they o...

If they wanted they could. Also pretty sure Google are motivated to do it if no one else will take initiative 👀

fleet lintel Jul 21, 2025, 6:27 PM

#

whole wagon IMO are never going to grade openai solutions. Thats not how they operate they o...

They did grade Google's solutions

https://www.reddit.com/r/singularity/comments/1m5pwqr/what_does_it_mean_for_ai_and_the_advancement/

whole wagon Jul 21, 2025, 6:28 PM

#

fleet lintel They did grade Google's solutions https://www.reddit.com/r/singularity/comment...

since they competed officially

fleet lintel Jul 21, 2025, 6:28 PM

#

whole wagon since they competed officially

ah..ohk

ornate agate Jul 21, 2025, 6:28 PM

#

they graded Googles solutions because Google (and several other companies yet to announce) entered the IMO as AI teams officially this year

ocean vortex Jul 21, 2025, 6:29 PM

#

ocean vortex If they wanted they could. Also pretty sure Google are motivated to do it if no ...

Well... if they manage to arrive at a different score. Probably not gonna post anything if the score checks out lmao

whole wagon Jul 21, 2025, 6:30 PM

#

The openAI and deep mind researchers are flaming each other on X rn

#

What an absolute spectacle

#

🍿

ocean vortex Jul 21, 2025, 6:30 PM

#

whole wagon The openAI and deep mind researchers are flaming each other on X rn

Holly that's what I said. I said it first. 😠

#

🤯

#

#general message

fleet lintel Jul 21, 2025, 6:31 PM

#

whole wagon The openAI and deep mind researchers are flaming each other on X rn

Is there any reply from Deepmind researchers on it?

whole wagon Jul 21, 2025, 6:31 PM

#

I don't think IMO expected it to go down like this 😂

zinc ore Jul 21, 2025, 6:31 PM

#

whole wagon The openAI and deep mind researchers are flaming each other on X rn

This was going to happen eventually with openAI always trying to steal Deepmind's thunder, not surprising IMO would be big enough to cause actual public comments

lime coral Jul 21, 2025, 6:31 PM

#

This post is way too funny https://x.com/demishassabis/status/1947337620226240803?s=46

Demis Hassabis (@demishassabis)

We've now been given permission to share our results and are pleased to have been part of the inaugural cohort to have our model results officially graded and certified by IMO coordinators and experts, receiving the first official gold-level performance grading for an AI system!

fleet lintel Jul 21, 2025, 6:32 PM

#

"first official gold-level.." lol

lime coral Jul 21, 2025, 6:32 PM

#

It’s official you cannot deny

#

And it’s the first

#

You cannot Denny

ocean vortex Jul 21, 2025, 6:33 PM

#

lime coral This post is way too funny https://x.com/demishassabis/status/194733762022624080...

70% of people reading it gonna think they lost their mind and not even notice that "official" word lmaoo

zinc ore Jul 21, 2025, 6:33 PM

#

Deepmind supposedly got gold before openAI did anyway

tidal schooner Jul 21, 2025, 6:33 PM

#

whole wagon The openAI and deep mind researchers are flaming each other on X rn

now we’re waiting for an open-weights model to get imo gold

zinc ore Jul 21, 2025, 6:33 PM

#

So it's true either way

ocean vortex Jul 21, 2025, 6:34 PM

#

zinc ore Deepmind supposedly got gold before openAI did anyway

Doesn't matter. They got played. 😇

zinc ore Jul 21, 2025, 6:34 PM

#

Unless you're stuck waiting for "official grading" or whatever

whole wagon Jul 21, 2025, 6:34 PM

#

openAI played themselves

#

Lol

zinc ore Jul 21, 2025, 6:35 PM

#

They basically offended the IMO to announce first

#

But then didn't end up with the best looking results

fleet lintel Jul 21, 2025, 6:35 PM

#

whole wagon openAI played themselves

they did. Also, one tweet mentioned that they wont release the gold model for many more month. I think they will be forced to release it faster

ocean vortex Jul 21, 2025, 6:35 PM

#

whole wagon openAI played themselves

How? I don't think they care much about people who are always hating OpenAI regardless of anything tbh

lime coral Jul 21, 2025, 6:35 PM

#

ocean vortex https://girlcockx.com/GoogleDeepMind/status/1947333846837444990

At least we have some transparency

ember rapids Jul 21, 2025, 6:35 PM

#

whoever ships first

zinc ore Jul 21, 2025, 6:37 PM

#

OpenAI model definitely trained on IMO answers from past competitions

ocean vortex Jul 21, 2025, 6:37 PM

#

lime coral At least we have some transparency

We had it from both sides to be fair...

brittle tiger Jul 21, 2025, 6:37 PM

#

ocean vortex Doesn't matter. They got played. 😇

An org relying on unprecedented venture fundraising cares much more about being first and the online sentiment that it carries with normies. Google would prefer to stay in good graces of elite math types who follow this closely

fleet lintel Jul 21, 2025, 6:37 PM

#

this fight between OAI and Google is good. They will be try to release these models faster and try to one-up each other again. I am finally excited after long winter (~4 months 🙂 ) of non-significant progress

zinc ore Jul 21, 2025, 6:38 PM

#

Normies aren't following elite math competitions closely lol

brittle tiger Jul 21, 2025, 6:38 PM

#

Normies see headlines tho

ocean vortex Jul 21, 2025, 6:38 PM

#

brittle tiger An org relying on unprecedented venture fundraising cares much more about being ...

IMO it's more of they didn't think of doing it this way and did not even suspect OpenAI would enter this competition from the outside. If they knew they would have acted differently

whole wagon Jul 21, 2025, 6:40 PM

#

I think the embargo date was moved

#

It was supposed to be Friday

#

They moved it because of openAI

#

God that's horrible 😂 the imo organizers literally had to move the embargo date

pure anvil Jul 21, 2025, 6:41 PM

#

openai like always tasteless af

zinc ore Jul 21, 2025, 6:42 PM

#

Why would they push the embargo date to later?

whole wagon Jul 21, 2025, 6:42 PM

#

No I mean Friday this week

#

Was the actual embargo date

ocean vortex Jul 21, 2025, 6:42 PM

#

If we are really being truthful... If you were in their shoes, you would see their course of action as a better move. Especially considering that they did reach out and did not publish it before IMO said it is ok to do so

zinc ore Jul 21, 2025, 6:42 PM

#

Oh, up from Monday

ocean vortex Jul 21, 2025, 6:42 PM

#

Have they entered officially...

#

Google would have stole the show lol

zinc ore Jul 21, 2025, 6:42 PM

#

whole wagon No I mean Friday this week

Billy was saying embargo extends to the 28th

#

Friday is the 25th

whole wagon Jul 21, 2025, 6:43 PM

#

I don't know exactly. It was supposed to be later than this

#

Is all I know

ornate agate Jul 21, 2025, 6:43 PM

#

ocean vortex If we are really being truthful... If you were in their shoes, you would see the...

no I wouldnt. I have some respect for competitions like IMO.

ocean vortex Jul 21, 2025, 6:44 PM

#

ornate agate no I wouldnt. I have some respect for competitions like IMO.

In what way is it "direspectful" to do what they asked and wait for closing ceremony to finish? Everything that came later about their supposed internal guidelines Google was very laud about, only came out after the fact

zinc ore Jul 21, 2025, 6:46 PM

#

There's a decent chance IMO makes some changes for next year to prevent this from happening again

ornate agate Jul 21, 2025, 6:46 PM

#

ocean vortex In what way is it "direspectful" to do what they asked and wait for closing cere...

notice how you are the only one who thinks this is all perfectly fine.

ocean vortex Jul 21, 2025, 6:46 PM

#

zinc ore There's a decent chance IMO makes some changes for next year to prevent this fro...

yeah 100% lmao

zinc ore Jul 21, 2025, 6:47 PM

#

Because this caused a bunch of unnecessary drama surrounding their event

ocean vortex Jul 21, 2025, 6:48 PM

#

ornate agate notice how you are the only one who thinks this is all perfectly fine.

Cause it's not a popular opinion given the sensitivity of it. But it's also fueled for a good part by general hate of OpenAI in the public. If you just look at the facts, it is pretty much what I'm saying...

ornate agate Jul 21, 2025, 6:48 PM

#

ocean vortex Cause it's not a popular opinion given the sensitivity of it. But it's also fuel...

yeah those HS kids didnt deserve just a couple days to bask in their achievement.

ocean vortex Jul 21, 2025, 6:49 PM

#

ornate agate yeah those HS kids didnt deserve just a couple days to bask in their achievement...

Did anyone ask what they wanted??

#

lmao

#

I do not think they did

ornate agate Jul 21, 2025, 6:49 PM

#

you're making my point really well here

ocean vortex Jul 21, 2025, 6:50 PM

#

ornate agate you're making my point really well here

Not really. It's ignorant to assume there were no participants that were against embargo of any kind in the first place. The whole thing meant more people are talking about and are interested in IMO too...

zinc ore Jul 21, 2025, 6:51 PM

#

https://www.reddit.com/r/singularity/comments/1m5qfqu/openai_researcher_on_deepminds_imo_gold/n4dx2ck/

Been an interesting few days, the Reddit comments have been pretty good on this whole thing.

FarrisAT's comment on "OpenAI researcher on deepmind’s IMO gold"

Explore this conversation and more from the singularity community

whole wagon Jul 21, 2025, 6:51 PM

#

I think Google are the only ones that got gold officially. IMO organizers would have moved other companies embargo dates a week forward if they also had gold

ocean vortex Jul 21, 2025, 6:51 PM

#

Taking "poor kids" argument it's kinda cheap tbh

#

They did have their closing ceremony without anyone publishing anything ---> not much of an argument in the first place

ornate agate Jul 21, 2025, 6:53 PM

#

The IMO and every other company agreed to wait a bit. They agreed to wait a bit to not diminish IMO participants achievements.

ocean vortex Jul 21, 2025, 6:53 PM

#

Anyway, just my opinion, you don't have to agree with it

#

🙂

zinc ore Jul 21, 2025, 6:54 PM

#

ocean vortex Taking "poor kids" argument it's kinda cheap tbh

The event isn't for AI companies, it is to celebrate highschool kids achievements. Them asking companies to not announce yet isn't new to this year

whole wagon Jul 21, 2025, 6:54 PM

#

Yes it has always been a thing

zinc ore Jul 21, 2025, 6:54 PM

#

Your arguments come off pretty weak and contrived

#

Like you're basically heavy reaching to defend openAI no matter what

#

This is no longer rationalism, but emotional investment

whole wagon Jul 21, 2025, 6:56 PM

#

I mean from openai side they got a bit desperate for a sota result and thought they could get away with this. It is reasonable they wouldn't have expected to be called out

#

Desperation sometimes makes people do unethical things

#

I mean picture this. You had a huge lead and now Google are on your tail and moving in fast. You might start to crash out as a natural reaction and reach for any good result you can

#

Not justifying it but I can see why they do this

zinc ore Jul 21, 2025, 7:00 PM

#

be the most prestigious mathematical competition
Host competitions for HS students for 70 years
70 years into your history an AI company wants to attempt to solve math questions from current event
Say sure, but don't announce for two weeks
Next year another company does the same unofficially
doesn't wait two weeks
hypes it up on social media, all on same day as closing ceremony
People say "f* them kids"
Mfw

cedar tide Jul 21, 2025, 7:00 PM

#

cedar tide average 25 benchmark

Go upvote new qwen
https://discord.com/channels/1340554757349179412/1396916057024757902

whole wagon Jul 21, 2025, 7:00 PM

#

Yeah he had a total crash out I saw

brittle tiger Jul 21, 2025, 7:00 PM

#

OpenAI doesn't cooperate with or submit their answers to be graded by IMO allowing them possibly run multiple attempts or withhold any bad results which would hurt fundraising
releases results before closing party pissing off the entire IMO board
surprised Pikachu face when math people are mad online

whole wagon Jul 21, 2025, 7:00 PM

#

Probably openAI HR told him to calm down

#

😂

zinc ore Jul 21, 2025, 7:00 PM

#

Imagine being a competition for HS kids for 70 years and people are saying "f* them kids" lmao

#

IMO could have told all these companies to kick rocks, but they were initially respecting their wishes

#

Meanwhile openAI just shat on them

lime coral Jul 21, 2025, 7:03 PM

#

https://x.com/yitayml/status/1947350087941951596?s=46

Yi Tay (@YiTayML)

Our IMO gold model is not just an "experimental reasoning" model. It is way more general purpose than anyone would have expected. This general deep think model is going to be shipped so stay tuned! 🔥

zinc ore Jul 21, 2025, 7:03 PM

#

Lmao

#

"um actually ours is more general purpose"

whole wagon Jul 21, 2025, 7:04 PM

#

Well I mean it probably is if they are going to actually ship it

#

Wouldn't be a great model to release otherwise, maths problems are not that valuable compared to coding

#

There is a trick though. It's not just deep thinking

#

There's smth else

#

:p

fleet lintel Jul 21, 2025, 7:06 PM

#

lime coral https://x.com/yitayml/status/1947350087941951596?s=46

wow.. this

hardy lion Jul 21, 2025, 7:07 PM

#

ocean vortex Also this: > Finally, we pushed this version of Gemini further by giving it: >...

I owuld think all human competing would have been studying previous questions and solutions as well.

whole wagon Jul 21, 2025, 7:07 PM

#

They added an august 31 bet for GPT5 release. It is only 60% ??????

#

Wtf

ocean vortex Jul 21, 2025, 7:08 PM

#

hardy lion I owuld think all human competing would have been studying previous questions an...

Yeah but you don't get to take them all printed with you in person lol

ornate agate Jul 21, 2025, 7:08 PM

#

whole wagon Yeah he had a total crash out I saw

oh man I was rofling so hard. Found one in another browser tab. Can't find the other one tho 😦

ocean vortex Jul 21, 2025, 7:08 PM

#

"studying" = training data, but this is different. It's literally a cheat sheet 👀

whole wagon Jul 21, 2025, 7:09 PM

#

ornate agate oh man I was rofling so hard. Found one in another browser tab. Can't find the o...

What point is Aidan even making

#

They both "trained" models for it

zinc ore Jul 21, 2025, 7:10 PM

#

ornate agate oh man I was rofling so hard. Found one in another browser tab. Can't find the o...

Yeah, this like, idk why this is even a point of attack

ocean vortex Jul 21, 2025, 7:10 PM

#

training data is not gonna be recollected with 100% accuracy typically for tasks like these, unless you overfit and degrade performance

#

Honestly I'm amazed they allowed it to go as far as it did LOL

dawn wharf Jul 21, 2025, 7:12 PM

#

lime coral This post is way too funny https://x.com/demishassabis/status/194733762022624080...

all of them are "first official" then💀

whole wagon Jul 21, 2025, 7:13 PM

#

They are being kind, they don't need the official

#

They were the first. Just embargoed

zinc ore Jul 21, 2025, 7:14 PM

#

whole wagon Jul 21, 2025, 7:16 PM

#

#

😂

zinc ore Jul 21, 2025, 7:30 PM

#

https://vxtwitter.com/YiTayML/status/1947359357202804899

Yi Tay (@YiTayML)

↪️ Replying to @peterjliu

@peterjliu Thanks Peter!

It was surely annoying at first but once I came to terms this was because we were doing the right thing, everything felt okay!

I think we've won on all aspects! Legitimacy and sportsmanship!

#

Yeah, they're making little subtle pot shots at openAI

ocean vortex Jul 21, 2025, 7:32 PM

#

zinc ore https://vxtwitter.com/YiTayML/status/1947359357202804899

LOL he is still hurting catgrin

#

deep down he knows he's only saying that but it is not actually true...

#

🗿

torn mantle Jul 21, 2025, 7:37 PM

#

https://github.com/MoonshotAI/Kimi-K2/blob/main/tech_report.pdf

GitHub

Kimi-K2/tech_report.pdf at main · MoonshotAI/Kimi-K2

Kimi K2 is the large language model series developed by Moonshot AI team - MoonshotAI/Kimi-K2

torn mantle Jul 21, 2025, 7:37 PM

#

lime coral https://x.com/yitayml/status/1947350087941951596?s=46

thats what im talking about

whole wagon Jul 21, 2025, 7:41 PM

#

decent but still pretty light on the details of the actual pretraining dataset

lime coral Jul 21, 2025, 7:42 PM

#

https://x.com/jasondeanlee/status/1947365171577438367?s=46

Jason Lee (@jasondeanlee)

At least the GDM imo proofs are readable!

ocean vortex Jul 21, 2025, 7:42 PM

#

torn mantle https://github.com/MoonshotAI/Kimi-K2/blob/main/tech_report.pdf

that's what happens when you make a model that's in-between reasoning and a concise one 😇

#

better than 4.1, worse than o3

whole wagon Jul 21, 2025, 7:42 PM

#

better than 4.1 isnt a high bar

ocean vortex Jul 21, 2025, 7:43 PM

#

whole wagon better than 4.1 isnt a high bar

Depends on your perspective and bias

whole wagon Jul 21, 2025, 7:44 PM

#

2.5 flash > opus 4

#

nice benchmark

ocean vortex Jul 21, 2025, 7:44 PM

#

whole wagon 2.5 flash > opus 4

For conventional metrics Opus is not great

#

that's not a secret

#

it's a niche model, but SOTA for those select things it's good at

whole wagon Jul 21, 2025, 7:45 PM

#

even livebench can get this right

ocean vortex Jul 21, 2025, 7:45 PM

#

In turn 4.1 is similar enough size to all of those it is being compared against... So no such discrepancies. 😇

whole wagon Jul 21, 2025, 7:46 PM

#

any benchmark putting 2.5 flash above opus 4 is just a joke

lime coral Jul 21, 2025, 7:47 PM

#

whole wagon any benchmark putting 2.5 flash above opus 4 is just a joke

For multimodal retrieval it’s actually better. Flash does what it is used for

ocean vortex Jul 21, 2025, 7:47 PM

#

whole wagon any benchmark putting 2.5 flash above opus 4 is just a joke

Look at 2.5Flash vs 2.5Pro then. Or o4-mini-high vs o3. I think it's just you not knowing how to read them and what to look for...

whole wagon Jul 21, 2025, 7:48 PM

#

lime coral For multimodal retrieval it’s actually better. Flash does what it is used for

the benchmark is for "intelligence". that is what it is literally called not retrieval

#