#general

1 messages · Page 74 of 1

torn bison
#

babyhitler

whole wagon
#

It was always part of the endgame. To start re educating the kids

torn mantle
#

honestly it started with that guy called yang

whole wagon
#

What did he do lol

#

Ngl the xAI livestream had some weird vibes but maybe it's because most of the engineers are socially inept

#

Like the demo singing about the diet coke was absurdly cringe but none of them even realised that?

torn mantle
torn mantle
whole wagon
#

I remember one of the xAI engineers did this whole thing that one of them equals 10 researchers at other labs lmao

#

They don't see how it's just cringe

torn mantle
#

yea he just recently said that

#

yea that was cringe too

#

I think they are trying to justify their shortcomings

#

'look we made this model called grok 4, while it may not be the best, but HEY DONT FORGET THAT WE ARE A SMALL TEAM AND WE JUST STARTED A YEAR AGO'

#

and please zoom-in on that

whole wagon
#

Clown 778 has left so I can discuss the results of the paper

#

With this "heavy" GenSelect inference mode, **OpenReasoning-Nemotron-32B model surpasses O3 (High) on math and coding benchmarks.**```
This does not seem like it would match the real usage results at all
#

They are claiming to surpass o3 high on maths and coding benchmarks. Which may be true but I have serious doubts their models are truly better than o3 high

#

Even in parallel inference mode

#

If their claims are valid it basically means the end of openAI lol

leaden sun
#

i think he's trying to imply the difference between how human and AI contestants "work" at IMO, i sense a bit unfairness there in the approach at comparing since humans dont use anything else other than their brain while models can access a variety of tools (and internet?) but then again, what is fair when llms arent that comparable to raw human intelligence?

whole wagon
#

44ppl in Meta's Superintelligence team.
— 50% from China
— 75% have PhDs, 70% Researchers
— 40% from OpenAI, 20% DeepMind, 15% Scale
— 20% L8+ level
— 75% 1st gen immigrants

torn mantle
#

mm i see i see

whole wagon
#

well some of them were 4 months ago

#

yea

#

a few been at meta for years

torn mantle
#

yea this will def take a while

unborn ocean
#

wonder if the people will ever "come back"

#

bc ccp would certainly love that

torn mantle
#

zuck is doing more harm than good

whole wagon
#

I dont know if 44 ppl is enough tbh

#

he tried to go for quality over everything

#

with the insane salaries and all

torn mantle
#

the goal will probably be to release a model EOY

#

behemoth is probably scrapped

whole wagon
#

yes

#

open source may be scrapped in general

#

the head guy suggested that

#

alexandr wang

unborn ocean
# whole wagon I dont know if 44 ppl is enough tbh

me too, i guess a lot of these people where actually management level before -> might hire more people to the team with them as lead or they might get the FAIR people as "subordinates" to some degree

#

bc otherwise these people are really expensive if they can't manage

#

and only work on their own

smoky finch
#

guys may i know where they announce that the newest models got added - removed from the arena?

#

there was a channel called "🕸️ webdev-arena" but i don't know in which server it is

pure anvil
#

or the dynamics of that

#

man upto llama 3.3 they made amazing models

ocean vortex
#

rumors 😇

chilly nexus
#

killing people (fictional character)

ocean vortex
civic flame
#

if we shouldn't believe jimmy, who generally has a good track record, why should we believe you

#

what track record do you have

ocean vortex
#

other feelings are inferior... 🤔

civic flame
#

☠️

whole wagon
#

i dont really care if you dont believe me lol, im not going to out sources for internet points

chilly nexus
frosty lark
#

it can work in the other way too (people that have contacts in other labs and get leaks)

gentle plinth
cedar tide
#

just we agree that o3 alpha is not o4 yes?

torn mantle
#

its called o3-alpha for a reason

#

its a more refined o3 version

#

with coding-focused abilities and factually correct answers ( less hallucinations )

soft kernel
torn mantle
#

we've come a long way

#

i remember models couldn't even do ascii visualisation properly

sour spindle
#

Is o3 alpha still on arena

ocean vortex
ocean vortex
#

it means nothing tbh

torn mantle
#

but hes asking if its o4

#

its def not o4

ocean vortex
#

o4 = gpt5

#

there's not gonna be o4 🙂

#

Also their current naming makes this... interesting. o4-mini is not a mini version of the upcoming improved o3 (gpt5), lmao

#

we do know that they are referring to their current gpt5 as "alpha" though

sacred quail
#

is O3 using Gpt 4 for base model, or is totally different one

sacred quail
#

I thought gpt 4.1 was small model that has 1 million token window

ocean vortex
sacred quail
#

Hmmmm. Can we use gpt 4.1 with app on plus plan ? Or only in API ?

ocean vortex
torn mantle
#

gpt5 is a router model

#

could be anything

#

o-* + gpt model

pure anvil
ocean vortex
torn mantle
ocean vortex
torn mantle
#

i think router model or hybrid model are the same thing

#

depends on how you see it

ocean vortex
#

Not the same, Claude is not a router lol

torn mantle
#

im was talking about the final result not the approach used

#

they will both at the end chose which path to take ( reasoning/or not )

torn mantle
ocean vortex
#

I mean if it's an unified model, it could decide by itself when to output reasoning and when not to... Literal router does not sound to me like a good idea

#

reasoning effort: off/auto/low/med/high

#

or smth like that

torn mantle
#

yea i know but how do we decide that

#

its controlled by thinking_budget=0 / >0 & some internal controller

#

but how does the internal controller work

ocean vortex
torn mantle
#

the training phase

#

or methods used

ocean vortex
torn mantle
#

qwen 3 has that too no?

ocean vortex
#

Roughly speaking yeah... We already had some reasoning models that did not always output reasoning, that wasn't explored much yet though. It was all about max performance

torn mantle
#

mm i get it now, so its basically like a tiny classifier head used in the same transformer

#

looks at the query embedding and outputs a probability

ocean vortex
#

yeah

torn mantle
#

prob is compared/learned through a param using RLHF

ocean vortex
#

the challenging part is making this deciding by itself model perform better than o3-high. But if they add an option for explicit reasoning effort which it looks like they will, this becomes less crucial

cedar tide
#

New imagen 4 v2

woeful geyser
#

Found out that Grok 4 got 60.5 on SimpleBench

ocean vortex
#

@cedar tide too many pings I think

#

lmao

cedar tide
#

o3 alpha its the open source model

ocean vortex
torn mantle
#

no way

ocean vortex
#

Does he really work at OpenAI though?

#

This satoshi guy

#

lmao

torn mantle
#

bruh

ocean vortex
#

I would still say 60:40 it's gpt5. But open-source model is a plausible option too.

torn mantle
#

this guy has nothing to do with oai

cedar tide
torn mantle
#

why is he talking like a kid

cedar tide
#

@torn mantle tu verra

torn mantle
#

hes like the strawberry guy

#

aint no way o3 alpha is the os model

#

impossible

#

do people really understand things like that

#

shes all over the place

#

fast talking

#

weird movements

ocean vortex
#

That's unlikely for 5 to be bigger, and also even if it was, no one would ever say this

torn mantle
#

this companion thingy is fun but its not well executed, and im not talking about waifus / elon's degenerate ideas

ocean vortex
#

"strawberry dickhead"

storm needle
#

and he works no less than at mc donalds

torn mantle
#

hes working at x

#

farming engagements

civic flame
#

there are many of those links and one of them is that

lime coral
rare python
#

Yes, there is an official marking guideline from the IMO organizers which is not available externally. Without the evaluation based on that guideline, no medal claim can be made. With one point deducted, it is a Silver, not Gold.

Quoting Mikhail Samin (@Mihonarium)

🚨 According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony.
︀︀
︀︀According to a Coordinator on Problem 6, the one problem OpenAI couldn't solve, "the general sense of the IMO Jury and Coordinators is that it was rude and inappropriate" for OpenAI to do this.
︀︀
︀︀OpenAI wasn't one of the AI companies that cooperated with the IMO on testing their models, so unlike the likely upcoming Google DeepMind results, we can't even be sure OpenAI's "gold medal" is legit. Still, the IMO organizers directly asked OpenAI not to announce their results imme…

#

Way better

ember rapids
#

I have a feeling most people will be disappointed with gpt5

ocean vortex
#

it's just a fix embed domain very fitting to X 😇

torn mantle
#

shows how much IMO means to them

#

it was their thing

#

their lil baby

#

now oai stole that from them

ocean vortex
#

already deducting points 💀

torn mantle
#

its silver now xd

#

not really

#

we need to wait a week for the official results

whole wagon
#

i read the openai "solutions"

#

they are garbage lel

#

the model somehow does not even write coherent english?

#

like it feels like they didnt teach it english somehow

ocean vortex
keen beacon
#

You can see there's so many rl artifacts

#

It goes hard

#

Openai did a great job

whole wagon
#

cant tell if it is sarcasm ngl

#

some ppl could probably spin it in their head as an innovation lol

ocean vortex
whole wagon
#

problem1 is probably the best one

#

its all downhill from there lol

ocean vortex
#

The amazing about it is that it's actually an usable model as well, unlike AlphaGeometry...

whole wagon
#

did they really mark their own solutions

#

thats what i heard

#

why dont they let all the imo participants mark their own solutions

ocean vortex
#

@whole wagon I refuse to believe you aren't actively and deliberately trying to stay ignorant lmao

keen beacon
whole wagon
#

i have done the national olympiad

#

in the UK

#

got the certificates

ocean vortex
#

How much did you score?

#

5%?

#

🧐

leaden palm
#

what do you guys think about the theory that it's actually a formal language, just written in a way that looks like plain english?

keen beacon
#

It could be doing that partly in the cot but this reads to me like the final output of extremely rld model. So you can't tell much

ornate agate
#

I'm 100% sure this is final output and not cot and also not intermediate agent results etc.

keen beacon
#

It seems to me it reasons primarily in natural language in the CoT though it can 'codeswitch' into lean or whatever maybe

ocean vortex
#

Those are valid and correct solutions though at the end of the day. I suspect if you read AlphaGeometry solutions those would look way more weird and unnatural...

keen beacon
#

I think the solutions are good weird at least to me. It reeks of rl

ocean vortex
#

AlphaGeometry was bruteforcing it and combining existing ML algorhitms with an LLM. This in contrast appears to be an actual singular model that you can use.

whole wagon
#

i could write pages on how this solution is flawed. The solution presents the bijection as a fact for example

#

just on problem 1

ocean vortex
whole wagon
#

the premise is transformed from n+1 to n but then later on they somehow start using n+1 again. There is no reasoning for why this occurs

#

"For n=3, we have possibilities k=0, k=1, or k=3" no justification?

#

just stated as a fact

#

the IMO solution explictly enumerates all 6 points in p3

#

it shows possible k values are 0,1,3 but never proves those values are achievable for all n greater than equal to 3

#

the invokation of the pigeonhole principle is vague, they dont define exactly what the pigeons and holes are

#

The solution states that if you remove a side-line, you get a valid covering for n-1. There is 0 explanation for this

ocean vortex
whole wagon
#

"proving"

#

there is no proof in the openai solution

#

i can tell they didnt "make an argument" about the boundary point covering. because they literally just state it

ocean vortex
#

though to be fair we do not know if all problems were graded the max 7/7. Solving it does not neccessarily mean full grade think

whole wagon
#

i heard if they drop a single point it goes from gold to silver

ornate agate
#

IMO Gold threshold this year is 35. OpenAI are assuming all their published solutions are 7/7 perfect because 7*5=35. They graded it all themselves and did not participate with the IMO people at all. People who did (eg DeepMind, probably also ByteDance, DeepSeek, etc) are still under an agreed embargo for a few days.

ocean vortex
whole wagon
#

its an achievement. but i dont believe it is gold

#

where is the grading site btw

#

i can go through point by point

#

from the headings for problem 1 they have 6 at least

ornate agate
#

DeepMind got a solid silver last year, I think the main achievement here is converting to natural language. Its interesting but its EXTREMELY likely that a 1yr improved prover (eg DeepMind's or DeepSeek prover) will score gold.

ocean vortex
#

independent verifications are welcome, obviously

ocean vortex
whole wagon
#

hm well with that grading criteria it is 7/7

ocean vortex
#

yeah fair point I suppose that it may not be 100% identical. But probably as close as it gets at this point

We followed a methodology similar to our evaluation of the 2025 USA Math Olympiad [1]. In particular, four experienced human judges, each with IMO-level mathematical expertise, were recruited to evaluate the responses. Evaluation began immediately after the 2025 IMO problems were released to prevent contamination. Judges reviewed the problems and developed grading schemes, with each problem scored out of 7 points. To ensure fairness, each response was anonymized and graded independently by two judges. Grading was conducted using the same interface developed for our Open Proof Corpus project [2].

lime coral
# torn mantle they are fuming

Don’t think they stole anything since they also got the gold, by the right way. OAI not passing the vanilla test&jury is more like them worrying to loose the battle. I’m sure if they had Silver with their eval they wouldn’t make noise

whole wagon
#

problem 1 - 3 looks fine with the point system

ornate agate
#
Nitter

🚨 According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony.

According to a Coordinator on Problem 6, the one problem OpenAI couldn't solve, "the general sense of the IMO Jury and Coordinato...

storm needle
#

can you prove that you really have it

ornate agate
whole wagon
#

hmmmmmmmm

torn mantle
#

Let me read them a bit

whole wagon
#

this is correct me thinks

torn mantle
#

Can someone share links or nah

whole wagon
#

its all correct

#

its valid proofs its just a pain to read

leaden palm
sterile rapids
#

hello

torn mantle
#

can you solve IMO problems or nah?

sterile rapids
#

can anyone help me. i use LMArena. to create the image but can not create vertical image. can only create 1024x1024 king image

#

I have fixed the prompt many times but it still doesn't work.

torn mantle
#

are you good at math?

sterile rapids
#

If possible can you please re-prompt me?

torn mantle
#

im not that good

#

sorry

sterile rapids
#

768 x 1376 . nhưng AI chỉ tạo ra 1024 x 1024

#

I want to create a picture with dimensions like this. It is 768 x 1376. But AI only creates 1024 x 1024.

ornate agate
# whole wagon its valid proofs its just a pain to read

just looking at problem 1. OAI solution seems a bit of a different proof. "Stating and proving that the leftmost and bottommost points are covered by nn or n−1n−1 lines." . the OAI model doesn't do this, it shows instead that one line must always be a triangle side (and so can then be removed).

tight nexus
#

I am looking for someone to work on a project I have some ok hardware (3x 3090 GPUs and Threadripper 3990x with 128 Gb Ram). If anyone is interested DM me.

ocean vortex
# ornate agate https://xcancel.com/Mihonarium/status/1946880931723194389#m

I think there are valid points in the comments. OpenAI brought more spotlight to it than this competition otherwise would have gotten. And all we have are some anecdotal reports from people allegedly having connections to the organizers and not even the organizers themselves going publicly on record to say there was anything wrong at all in what OpenAI did...

#

If we asked people actually taking part there (students), I would be very surprised if majority was in favor of forcing the labs to wait before they publish their results

#

Like all of it just seems a cheap way to attack someone, and seems to be driven for a good part by those who are just generally against OpenAI no matter what they do lol

#

They managed a breakthrough? Find ways to poke holes in it or how you don't need it. Then start needing it when everyone else starts doing it.
They beat the competition? Find ways how they cheated. And if they didn't cheat then they did something else wrong or released the results at the wrong time... catgrin

zinc ore
#

I don't even think that's what's going on, I think they likely do consider the announcement rude, since it comes from an individual speaking to one of the directors.

It doesn't help that all the other companies are effectively under embargo and also want to announce, then a company comes in and didn't go through official channels and takes the spot light. So it angers the other companies and also embarrasses the IMO.

ocean vortex
#

Also matharena published results before OpenAI did. I don't think they would have waited any longer even if that 2.5Pro run was medal worthy lol

ocean vortex
zinc ore
#

They're not going to make an official statement

ocean vortex
#

well then it's "he said he heard person X who's sister rated one of the problems" types of thing lol

#

meaningless

zinc ore
#

All you have to do is find out who Joseph Myers is and see if they are indeed a reliable source connected to IMO officials

#

If yes, then likely what he's reporting is true

ocean vortex
#

I'm not saying he made it all up. But you also need to realise they have dozens of judges and probably even more people that could be counted as organisers and quoted.

whole wagon
#

It is a guideline they have at the end of the day

#

Whether or not they are truly angered by it is not that relevant

#

They should follow the guidelines

ocean vortex
#

If they have AI labs asking their judges to participate rating model outputs then I suppose there could be guidelines. But it doesn't look like OpenAI was in that circle

keen beacon
#

was openai aware of those guidelines when they released those results? i think that's what matters

small haven
keen beacon
#

if they did, it's bad faith

zinc ore
#

It's almost common sensical

whole wagon
#

They were in contact with imo ppl they ofc knew lol

ocean vortex
zinc ore
#

Even Altman would intuitively know this

ocean vortex
#

I don't see it

zinc ore
#

It's extremely unlikely openAI is oblivious to this

whole wagon
#

Every year the AI labs wait a period of time before releasing their results. And you are trying to suggest openAI were not aware

ocean vortex
keen beacon
#

ive got no dog in this fight nor am i familar with imo btw, just stating that it is bad faith if they were aware of it

zinc ore
ocean vortex
#

Imagine you training your model yourself for a breakthrough result and this receiving this same kind of scrutiny after doing everything independently lol

zinc ore
#

Like them "not being officially involved" doesn't spare them the "rudeness" of the whole thing

ocean vortex
#

you were not bound by any internal "guidelines"

zinc ore
#

I'm aware

#

They're in effect taking IMOs prestige to hype themselves

#

Which is why it is rude

ocean vortex
zinc ore
#

Matharenas is irrelevant

#

No one cares about some 10-30% results

ocean vortex
#

and bringing more spotlight to the human participants

zinc ore
#

Especially from the other companies

#

The main thing is this embarrasses the IMO

#

That's why you get a comment saying it is rude

ocean vortex
# zinc ore Matharenas is irrelevant

but they had no clue what the results would have been. You can't judge it based on the results. If there ARE any guidelines (even that part is murky), they are to be applied universally not selectively depending on who does it.

zinc ore
#

It's just openAI disrespecting some obvious social etiquette

whole wagon
#

Yes but I thought it is valid still. There is a few places where they do things differently

ocean vortex
zinc ore
#

Hence, the rudeness

#

OpenAI is casually being rude to both the IMO and every other AI company involved

#

This is literally common sense and not even controversial

whole wagon
#

Yeah but it's openAI they never cared about things like that

ocean vortex
zinc ore
ocean vortex
#

matharena and OpenAI did not go that route 🤷‍♂️

zinc ore
#

There are both open source and closed source AI companies that participated this year

#

They haven't announced their results, out of respect to the IMO

#

Hence, openAI is rude

whole wagon
#

Well there is the entirety of AIMO that is also keeping quiet

#

It's a massive thing

#

Yeah. They just don't create a fanfare till the embargo date

ocean vortex
zinc ore
whole wagon
#

The result itself is fine to publish imo. It's all this hyping and the suggestion openAI was the first (the other companies are literally just waiting)

zinc ore
#

How am I assuming anything lol

#

I'm literally just saying what has been stated they are supposed to do

ocean vortex
zinc ore
#

We already have the report from the secretary general conversation for one, which you reject

ocean vortex
#

that's what internally was supposed to be an understanding for the labs participating together with them. OpenAI was not one of such labs...

#

therefore it literally needs to be public

#

or it kinda doesn't apply

zinc ore
#

You have to prove they didn't know

ocean vortex
zinc ore
#

If we're going to do this whole annoying burden argument thing here, then you have to prove your position

ocean vortex
#

They were not part of that circle needing special treatment in the first place

keen beacon
#

even then, shouldn't those guidelines only really apply only if they were using the actual judges. it's still kinda bad faith even if they weren't bound by them anyway, but posting the results is fine i guess since they were unbound

whole wagon
#

Why didn't they make themselves part of the circle? They literally just came in as an outsider and said they got gold according to themselves

zinc ore
whole wagon
#

That is strange

zinc ore
#

That makes no sense

#

The point is, if that expectation exists, then it's rude even if they weren't officially involved

keen beacon
ocean vortex
#

Like If I was participating from outside I'm publishing my results without doing some hilarious investigation to find out what they are secretly thinking internally lmao

zinc ore
#

Because doing so ignorantly technically looks less bad for them

ocean vortex
keen beacon
ocean vortex
#

Like they do get the benefits for assistance of judges. It's not like there's only 1 way to do this

zinc ore
#

If the IMO indeed wants the first week to be about human achievement, and tells all the AI companies not to report yet, then some outside company company comes in before the closing ceremony and says "we got gold everyone!!!" And hypes their product, then it seems natural that would be annoying both to the IMO and the companies involved.

#

Obviously they're not bound by that standard, but that's precisely why it is rude

#

Because you're using their prestige and name to hype their product, while contradicting their intent to celebrate human performance, while also stealing the spotlight from other companies respecting this waiting period.

#

So basically, because other companies are respecting the IMO, they're getting slapped in the face meanwhile.

ocean vortex
zinc ore
#

There's no "legality" here

ocean vortex
#

Well I used this term loosely. Valid submission whatever

zinc ore
#

Well, I think openAI being ignorant of this expectation looks much better for them

hot field
#

It’s really not okay to lash out at someone or point fingers before we even know the full story.”

zinc ore
#

But, if they were aware, then yeah that comes off as bad faith

keen beacon
#

if they contacted openai even if they weren't participating officially, it's definitively bad faith.
if they didn't contact openai but openai were also aware of the rule (which is likely) which resulted in them not participating officially, it's sorta bad faith and manipulative (in order to release the results early). but plausibly deniable i guess

ocean vortex
#

Fundamentally I think... We should apply the same rules to them you would follow yourself if you were training a model and wanted to do it. So we still come back to that public info part. Allegedly contacting someone on the side but not everyone is not the way this is supposed to be done

zinc ore
#

Also I find it strange they didn't just officially involve themselves like everyone else did

#

Which may add to the whole rudeness factor, because every other company had the courtesy to go through the IMO

ocean vortex
leaden palm
#

has everyone here already seen this

zinc ore
#

I actually have a slightly different theory over what happened, despite what I just said.

torn mantle
#

lol

whole wagon
#

ofc they knew man

#

its so obvious

torn mantle
#

its obvious to me

#

not to you

zinc ore
#

Well I doubt my own theory now

whole wagon
#

You should appreciate that openai blessed this earth with their presence Kappa

zinc ore
#

Basically, I think they didn't intend to participate in this IMO, until they heard Deepmind got gold and saw the questions. Realized they could get gold too and rushed and solved the problems then announced

zinc ore
#

Which is why they didn't go through official channels, because they originally didn't plan to participate

whole wagon
#

They are scared

#

it is obvious

zinc ore
#

And the timeline technically lines up

#

Yeah I agree

#

I think had there been tougher questions, openAI doesn't participate

#

Because, it was a good opportunity for them

tidal schooner
#

🥀

zinc ore
#

deepmind gets gold and it gets leaked on twitter
36 hrs later openAI announces they got gold
Says they solved each question within 4.5 hrs each (so they could have solved everything within a day)

keen beacon
#

that would be a crazy series of events

zinc ore
#

They probably know Deepmind can't announce yet, decide to announce

keen beacon
#

if it happened like that

#

but i doubt it

whole wagon
#

Well i dont think it changes anything fundamentally. they are still being chased down the fundamentals are identical before and after

zinc ore
#

Yeah I doubt it too, that's why I didn't want to share lol

torn mantle
#

If both of them got similar scores then im more interested who got the cleanest readable solution

#

So they both failed p6?

#

But still 1/10

whole wagon
#

maybe the model is too smart. i noticed with o3 sometimes it skips steps cos it just assumes ill get it

#

like it cant conceptualise what a human would struggle to understand

torn mantle
#

Hmph

#

I have to tell it to elaborate

#

And don't assume things

#

Makes me so mad

#

🤬

ocean vortex
zinc ore
#

https://vxtwitter.com/ns123abc/status/1947016206768046452

So they're claiming how they mark correct answers is only internally known

🚨🚨🚨BREAKING: DeepMind leader calls OpenAI’s IMO gold claim bullshit:

“IMO has an internal marking guide no one outside sees. Without that, you can’t claim a medal. With the point you lose on P6, you’re Silver, not Gold.”

Reminder:

OpenAI didn’t even collaborate with IMO.
vague-posts results with no full transparency
DeepMind collaborated with IMO to verify results
IMO explicitly requested AI labs to wait a week after the closing ceremony
to avoid stealing spotlight from brilliant students (humans)
OpenAI chose to announce their results BEFORE the ceremony even ended
Research community: “OpenAI has been disrespectful”

LMFAOOO

#

How were you guys judging if they were correct then? Based on prior IMOs?

#

Depends if the guidelines they know is the same used in 2025

ocean vortex
#

🍿

whole wagon
#

wait are they claiming it is literally impossible for openai to get gold?

ocean vortex
#

"IMO has an internal marking guide no one outside sees."

whole wagon
#

With the point you lose on P6, you’re Silver, not Gold this sounds unambiguous

#

like its just straight up impossible

#

eh this is all some weird bs im going to only treat the official results as real

#

i wouldnt accept a student doing this path

#

why would i accept an ai company

ocean vortex
#

yeah if true then that's not gold... It was gold based on publicly available info 🤷‍♂️

#

Would have been insane plot twist too if they changed that reactively responding to OpenAI... lmao

whole wagon
#

how do u know the boundary

#

is 35

ocean vortex
#

@whole wagon explain this to me:

#

no last task = 35 = gold

#

the way everyone would read it

whole wagon
#

i dont need to explain anything it was a question

ocean vortex
#

If they have official data clearly stating what a gold medal result is... That tweet gets an entirely different context tbh

#

At the moment it looks like this:

  1. Company X gets gold.
  2. Undisclosed guidelines change going against public info
  3. One of "internal partners" gets gold instead.

This whole thing can go dirty both ways lmao

zinc ore
#

OpenAI claims full marks on P1-P5, for 35 pts

ocean vortex
#

that's exactly the same, can't you read?

#

35 points threshold for gold

zinc ore
#

I think the Deepmind employee was just saying that the internal guidelines aren't known to openAI, so how could they grade themselves? And if the lose even just a single point from those guidelines they get silver not gold.

whole wagon
#

oh yea

#

they dont even know the official guidelines lol

zinc ore
#

That's why I was asking how you guys had scored it as 35, without knowing the guidelines

ocean vortex
#

that's not AI summary lmao. It's the old fashioned search type of thing

#

info straight from that website

#

which isn't loading atm for me

#

Anyway, it matches with what you said yourself, so what is the issue?

#

ok it did load somehow... yeah exactly the same info:

#

according to this 35 without last problem is gold medal

#

So yeah... they can say internally what they want, if my model scores 35 I'm claiming it scores equivalent of gold medal with a clear conscience 🤷‍♂️

#

Yeah that part is a factor still. Which is why I said they are risking it claiming the result valid

whole wagon
#

if only i could grade my own exams also

#

i would also get perfect score

ocean vortex
# whole wagon i would also get perfect score

LOL. Well we can only expect or assume they didn't half-ass it. They did know it's gonna need to be independently verified and published outputs for everyone to see, but who knows...

#

The problem right now is that if IMO organization is really united on this, they will want to screw them over as well lol

#

Also regarding the last problem, did it not produce final solution or did it not output anything useful at all? A detail that could be meaningful and something they may be withholding...

zinc ore
#

If their answers are valid to those guidelines then I don't think that drama happens.

whole wagon
#

😂

gentle plinth
# whole wagon 😂

maybe sam promised that they would get the same pay at oai if they turn down the offer

whole wagon
#

This gap is crazy

#

Though the speed with which Google has gained ground is absurd also

torn mantle
#

grok will get a bit more cuz of the new companion thingy

ocean vortex
# zinc ore

Yeah this changes quite a lot... People are quick to blame OpenAI and attack them for pretty much anything lol

hollow ocean
ocean vortex
#

OpenAI are shameless greedy villains, they don't care about those harmless little kids BOOO 🤓

quartz light
#

mb im late to this message but this is NOT impressive its just using plugins or whatever its called from threejs for the water and hills

#

did they even check the code

#

maybe its good at problems but its not good at coding

quartz light
small haven
lime coral
#

Like having ChatGPT in my iPhone

zinc ore
#

Well, that confirms the wait a week claim

#

Also confirms they are supposed to wait for the sanctity of student contestants

#

You can definitely tell these companies are annoyed at openAI

ocean vortex
pulsar cobalt
#

Hi 👋 Can anyone tell me which AI models available on LMArena have the most recent knowledge cut offs? And through using which ones can I get the most up-to-date and accurate information? Would be really helpful if you help 🙏

ocean vortex
#

They are annoyed but probably an equal amount about the fact that they didn't think of doing it this way themselves lol

#

Though this is still mostly only relevant to the labs actually competing for the top spot though...

jade egret
torn mantle
#

well its free

#

and better

jade egret
#

do yall think it gonna catch up sooner or later

torn mantle
#

gemini vision is way ahead too

#

grok doesnt have that

ocean vortex
#

for others I would imagine the advantages of doing this in full collaboration far outweigh the opportunity of releasing results slightly sooner

torn mantle
#

and nsfw stuff

jade egret
#

😭

#

naw.....

zinc ore
ocean vortex
#

Your phone comes with it pre-installed so you automatically an user if you buy it. Any kind of automated request tied to a feature could count as an "active user"

torn mantle
#

no but if you think about it, xai cant boost their user base with just grok4/3 or a 300$ plan, then need to find another way

#

and then some genius at xai (cough cough elon) proposed companions

#

they know how a lot of virgins will eat that up

#

its sad but they dont have a choice since they are incompetent at making good models

ocean vortex
#

the best phone

#

comes with Dork preinstalled and SuperDork free for the first 3 months ™

jade egret
#

whos gonna win the ai race

torn mantle
#

definitely google

ocean vortex
#

It's either Google or OpenAI

torn mantle
#

well first we need to define the finish line/target

jade egret
#

i think google

#

evantually

ocean vortex
#

xAI too detached from reality, Anthropic thinking too small, Meta is like MS operationally so don't really see them leading the way ever tbh

jade egret
#

or just takes most of the users

#

like 85%

#

idk

ocean vortex
#

Chinese labs are impressive, but they are yet to prove they can come up with their own paradigms or ideas and convert it all into finished product

torn mantle
#

zuck poached many employees and lead figures from oai, but the way i see it, is that they will likely face the same outcome as xai

#

it's extremely difficult to start from scratch, even if you have a significant background or extensive experience

#

+they still need more personnel

ocean vortex
#

open-source from them is impressive, but then again there are no closed Chinese models better than that regardless of how much you pay

torn mantle
#

tbh, we should be asking 1000 questions to the people who left

#

i honestly still don't understand why they would leave... can money really be a strong enough motivator to do that?

ocean vortex
torn mantle
ocean vortex
#

If they come up with something, they gonna make sure the whole world knows it and it's represented in the best possible way 👀

jade egret
torn mantle
#

also for anthropic, I get the impression they're working more for the government, or to please it, rather than for their actual users

#

look at how many years have passed, and they still have rate limits and GPU issues

torn mantle
#

however, their core philosophy remains the same

ocean vortex
torn mantle
#

they probably have more respect for their employees than oai does... for example, lets say their researchers at anthropic are allocated 30% of the GPU resources for their own tests, simulations, etc... I'm certain that this percentage in oai case has decreased over time, which shows they may prioritize users.. but in turn, has an impact on their researchers, and it could very well be one of the reasons that led those researchers to search somewhere else

torn mantle
#

/ what they prioritize

torn mantle
ocean vortex
torn mantle
#

but do you have the talent to do so?

#

why cant other labs do the same?

#

i mean wtf happened to mistral?

ocean vortex
#

There's no GPU resources allocation for engineers at all tbh. Neither at Anthropic nor at OpenAI. They have certain projects and the resources allocated for it, you don't have random people running around hogging compute if that wasn't planned months in advance and wanted by the entire company

#

Ofc they do have some kind of resources they can use for isolated work related smaller things, but that's gonna be miniscule comparing to the resources we are talking about with training the production models etc

#

and mostly not a problem at all to limit ever

hard hazel
#

Beep boop, am i disturbing?
Had a question

#

Back when i was using lmarena, i remember the models generated the code and deployed it too right then and there, does that not happen anymore?

#

Or is that feature relocated somewhere else

torn mantle
ocean vortex
#

so probably things gonna get better

hard hazel
ocean vortex
#

OpenAI is not sitting still

hard hazel
#

Meanwhile Gemini giving out year long free trials, thousands of free requests per day on CLI, all them Veo 3 nonsense, and having blazing fast speeds still.....

TPUs are just so much better, aren't they 🤔

ocean vortex
#

has to be web based code though

hard hazel
#

Ah, without that subdomain it is just simple chat and image gen huh, i see

torn mantle
hard hazel
# ocean vortex https://web.lmarena.ai

Seems to be broken at the moment?
When i type something in it and send it, it becomes unresponsive, nothing happens, it doesn't freezes it just does nothing.
Same when i click on suggested prompts, same on both desktop mode and mobile mode

torn mantle
#

................

keen beacon
#

LMAO

civic flame
jade egret
# jade egret

why do yall think it google? (i agree, i just wanna know ur reasons)

stray aspen
#

what ht ehell

patent aspen
#

Google was already spending over $30B a year on R&D (mostly on AI) before ChatGPT existed

jade egret
#

oo

whole wagon
wintry tinsel
#

How do people delude themselves into finding this amusing

misty vault
#

i have been doing this with sydney for ages

leaden palm
zinc ore
lilac nimbus
#

Claude back soon

mellow salmon
#

hey guys,can anyone tell me what's the rate limits of flux kontext max in lmarena

small haven
#

is neptune v3 red phase done?

echo aurora
torn bison
#

The current ask for no is 97.8c. It took the market 5 days to reach a relatively fair price. No need to wait until resolution to realize 11% profit😂

tidal schooner
#

this sounds more like an ad for machine learning than anything 😭
https://www.instagramez.com/reel/DF2JqSeNFfB/

💬 118 🔁 0 💜 11.8K 👀 125.6K

In 2003, IBM attempted to appeal to the hearts of ordinary computer users through a commercial that aired on U.S. television during the Super Bowl broadcast. The ad featured a young boy who personified Linux. He's 9 years old*, curious and quickly absorbs information.

We can argue at length about IBM's role in the modern high-tech world, but credit should be given to them, as the commercial turned out to be a masterpiece. Enjoy watching!

* Linux 1.0 was released in March 1994, so in the 2003 commercial, the boy was 9 years old.

#CusDebMagazine #TechJournal #TechMagazine
#IBM #linux #LinuxKernel #SuperBowl #SuperBowl2025 #SuperBowlAd #SuperBowlAds #SuperBowlCommercial #SuperBowlCommercials

▶ Play video
#

yeah ik it’s meant to be focused on the expertise of contributers across the world but still

ocean vortex
#

If anything the foul play was by xAI with those initial benchmarks that don't look very realistic. The model is just not very good and outputs are way too short and basic to do well on lmarena lol

#

It had a chance of capitalizing on their earlier success there, but when you look at the outputs of it it's easy to see why it did so poorly... People will not vote for what they don't see and what is hidden, there was no such issue with grok3.

gusty helm
#

given suitgate and several scandals with whales lately im expecting the worst strictly.

#

I mean from polymarket

#

Lmarena has no stakes in this and its just caught in crossfire

#

Silver lining this is a small market/I did not see any of the usual suspects. Too little money for them. Id fully expect an army of people/bots trying to influence it otherwise

pure lynx
#

Does anyone know when the Text Arena leaderboard is updated? Are the updates weekly?

whole wagon
ocean vortex
ocean vortex
ocean vortex
primal orbit
#

is it me or it has been a while since new secret model from Google. Stonebloom has been around for quite some time.

ocean vortex
#

give them time

#

👀

#

But yeah... They are way behind schedule and Ultra name controversy continues lol

gusty helm
pulsar tendon
#

Is the openai anon bot gone, been a while since i got it

gusty helm
#

Not saying it happened, there’s no sign of it. But it could

ocean vortex
gusty helm
#

Given sample size in the voting (20k votes) its trivial to bypass IP restrictions. Only nord vpn would give you 100 + ips to use

primal orbit
ocean vortex
#

In some ways, if you have the money, you simply fine-tune your model to do well exclusively only on lmarena if it's so important for you and that's gonna be more effective and perhaps even easier. As for betting and polymarket, people with such resources do not bet on these odds...

gusty helm
#

The guys own the whole crypto chain they use to solve markets. Talking few hundred million usd there

ocean vortex
gusty helm
#

Eg: when something is disputed they use a distributed voting system that’s external (uma token). Few wallets hold majority of uma

gusty helm
primal orbit
#

guys, you talk about polymarket so often How much do you actually bet? if you're comfortable to tell.

gusty helm
#

Not that much in my case just few grands disposable income

#

But you get range from 10 $ to 100 million people basically

primal orbit
gusty helm
#

Thanks

ocean vortex
#

looks like it's somewhat more frequent than once a week actually

#

just noticed smth interesting too... their csv includes 2.5Pro-05-06 too even though the official leaderboard in new interface doesn't anymore 👀

#

So technically... even without style control Grok4 is still 3rd not 2nd lmao

calm sequoia
#

Funny how mistral is now building manpads 😄

ocean vortex
calm sequoia
#

Such an awesome company!

torn mantle
#

it kinda reminds me of gemini 2.5 flash reasoning

ocean vortex
#

I'm fairly sure in certain cases Grok answers could be counted as a pass or at least partially correct instead of completely off the mark, if only you could see the reasoning it did. But you can't

ocean vortex
#

Especially relevant for these math competitions where there are strict clear rules and containing explanation and not having it can be a difference between 1 or 0 points for that part

cedar tide
#

This model has been in the arena for 2 months, still without results 🥴 (he still appears in the arena)

#

gemini 2.5 flash lite and minimax M1 are also in the webdev arena for a long time and still no result

candid storm
#

What do you guys think about these timelines?

keen beacon
# candid storm

I expect GPT-5 in 1-2 weeks.
Coding model is likely next year
The rest seem ok

pure anvil
# candid storm

midjourney 8 and claude 4.5 are the most exciting ones on this list

ocean vortex
#

🗿

torn mantle
#

lol

#

A

candid storm
torn mantle
#

idk its just that grok 4 gives me 'weaker gemini 2.5 pro' vibes, which makes it feel like a weaker iteration.. again im saying it reminds me of it, not claiming that its a perfect match for flash ver

torn mantle
#

leo shared their metadata before

rare python
torn mantle
#

but there was n+1 iteration for wolfstride/kingfall/stonebloom models

rare python
solar hollow
#

you guys tested that new openai model? sth sth alpha

civic flame
# rare python

the interesting part of the internal model names wasn't to do with that lol

#

it was that the suffix for gem 2.5 pro was "m" (as in "medium") and for kingfall/stonebloom/wolfstride it was "l" (as in "large")

cedar tide
rare python
cedar tide
rare python
#

We still haven't seen seededit 3.0 in direct chat, pineapple

torn mantle
#

gemini-v3p1l-rev20-kingfall-sc__202505301__model__variant

ocean vortex
civic flame
#

no obvious activity on that stuff since wolfstride dropped on the arena

#

that was late june and now both of the models are gone

#

💔

#

also still find it weird that oai dropped that interesting model on webdev arena for a single day

cedar tide
#

Add gemini 2.5 no think instead of putting crappy models from amazon that nobody wants (kraken folsom, v1 v2, nova experimental)

ocean vortex
cedar tide
ocean vortex
#

arc-agi-2 is as good as useless for spatial awareness now, doesn't mean anything lol

torn mantle
#

they are still experimental models

#

and they cant just say no to requests from amazon

civic flame
#

were kingfall/stonebloom/wolfstride early versions of gemini 3 or?

#

when do you expect that to drop then 😭

#

would you bet before or after gpt-5

#

the next 2-3 weeks should be fun then

nocturne agate
#

how do ppl access that o3 alpha model?

civic flame
#

it was on webdev arena

#

not anymore

#

doubt

#

probably close

nocturne agate
#

but i found interesting model, nightforge - anyone know what is it? probably a gemini model?

rare python
#

minimax m1 iirc

ocean vortex
#

neither max nor mini. Neither good performing nor lightweight

candid storm
#

Is it confirmed we will get gemini 2.5 ultra?

civic flame
#

for all intents and purposes

nocturne agate
#

or ultra dumb

nocturne agate
#

ultra?

#

when was it lol

#

ah ok, i was talking about this

#

they usually do the release well in advance

#

like month

#

so i guess we arent getting anything atm

#

yes

#

im curious what the cost of O3 Alpha will be

#

this model looks really strong

leaden sun
#

if then it's due to geopolitical power play...

whole wagon
#

The battle one

#

I never get it there

willow grail
#

i am a vegan who drinks dairy kefir
no coconut kefir has worse properties than dairy kefir
i will do everything to become healthy

would u say i am doing everything i can do be as vegan as possible?
i suffer from meteorism and psoriasis and constipation, in short a bad microbiome/dysbiosis for unknown reasons. probably very low motility.
so i am trying dairy kefir now cause this helps some people
i also tried bile acid from bovines
and digestion enzymes

pulsar cobalt
#

Hi

#

Can anyone tell me which AI models available on LMArena have the most recent knowledge cut offs? And through using which ones can I get the most up-to-date and accurate information?

willow grail
#

where is kingfall in benchmarks ?

fleet lintel
civic flame
#

iirc i posted it here or dmed it to them

snow lily
#

Does anyone know why mines keep showing "something went wrong with this response try again"?

#

This is such a nice site but the worst thing is you just can't copy paste the whole conversation if you wanna restart because of the way the site is, you can't select more than 1 reply at a time, makes it so annoying

hardy lion
# cedar tide

how do you use the best voted models if you don't at least put all the models out for voting? 🤔

hardy lion
#

ahh

stray aspen
#

naaah

whole wagon
#

Ai explained did a vid shitting on openAI imo announcement

#

Lel

#

The goat of AI influencers

whole wagon
#

Sora 2 is coming soon

#

Maybe openAI can redeem themselves after getting mauled by Google

ornate agate
balmy mist
main gulch
#

A version of this model with Deep Think will soon be available to trusted testers, before rolling out to
Google AI Ultra subscribers.

they said the same thing 2 months ago

hollow ocean
#

@deep adder Deepthink release January 2026

stray aspen
#

@deep adderare you the real craig federighi

torn mantle
#

no one beat me to it right?

torn mantle
#

i would rather take a ready to deploy model than an unreadable prototype that will take months to be released (oai model)

ornate agate
stray aspen
#

billy do you work at google deepmind

torn mantle
#

lol

#

what a difference

#

i would be ashamed to share oai results

#

dont end them like that

wintry tinsel
#

Lots of talk no release

rare python
#

Lmao who said DeepMind got cooked at IMO?

stray aspen
#

is open ai geting destroyed?

torn mantle
#

agree

#

but is it with informal mathematics or nah

rare python
torn mantle
#

then we can safely say gdm are way ahead

torn mantle
#

but i was serious tho

#

gdm takes IMO so seriously

rare python
torn mantle
#

viren is copying me

#

smh

rare python
#

💀

#

Bro acted like DeepMind got cooked in IMO

fleet lintel
#

based on tweets, it looks like DeepMind will release IMO gold version much faster than OAI gold version to public.

#

OAI mention in tweets that they are not planning to release their version for several months. I standby my conclusion

#

Honestly, at this point only OAI and Gemini are making some real progress. everyonw else is just catching upto them .

#

its not about releasing that model specifically but able to advance their public models that achives the same performance

cedar tide
torn mantle
#

Bye Qwen3-235B-A22B, hello Qwen3-235B-A22B-2507!

After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible. Today, we’re releasing

#

no one beat me to it again

#

yesh

rare python
#

3 minutes late

zinc ore
#

If Deepmind got gold with Gemini 2.5 deepthink then they just mogged openAI

torn mantle
#

is it benchmaxxed again

#

aaaand

#

yes it is

#

initial thoughts : still the same as old qwen

#

better knowledge? i dont think so

#

where is deepseek ....

#

we need them

lone vector
pure anvil
zinc ore
#

Makes sense why openAI rushed their announcement tho, if Deepmind was going to release natural language results via an LLM

ocean vortex
#

35/42

#

Also this:

Finally, we pushed this version of Gemini further by giving it:
🔘 More thinking time
🔘 Access to a set of high-quality solutions to previous problems
🔘 General hints and tips on how to approach IMO problems

I wouldn't be very proud about doing that nr2. That's like a cheat sheet lol

#

It had some kind of RAG with the entire history of IMO and previous curated solutions I presume

torn mantle
ocean vortex
#

I mean how else are you honna interpret "Access to a set of high-quality solutions to previous problems"..?

torn mantle
#

they both got the last one wrong

#

-7pts

ocean vortex
#

They did do parallel for sure.

#

but they explicitly said no tools

#

so I assume no RAG either

#

We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.

and

2/N We evaluated our models on the 2025 IMO problems under the same rules as human contestants: two 4.5 hour exam sessions, no tools or internet, reading the official problem statements, and writing natural language proofs.

Technically they did not explicitly mention RAG, but this strongly implies they did not use it, otherwise saying no internet and no tools kinda loses it's importance...

whole wagon
#

oh my god

ocean vortex
#

test-time compute scalling --> parallel compute, so they did use it for sure.

whole wagon
#

these proofs are so damn good

#

ai solved imo

#

these are clearer than human solutions even

zinc ore
#

If openAI released second, everyone would have saw their sloppy proofs and wouldn't have been as impressed after seeing the Deepmind ones

ocean vortex
ocean vortex
cedar tide
#

average 25 benchmark

zinc ore
#

“To make the most of the reasoning capabilities of Deep Think, we additionally trained this version of Gemini on novel reinforcement learning techniques that can leverage more multi-step reasoning, problem-solving and theorem-proving data. We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions.”

From this one they don't mention using previous IMO solutions

#

So I'm wondering where this additional claim is coming from

whole wagon
#

tbh i think the whole openai thing is going to backfire. I assume they didnt expect to get called out in that way

#

not just by us but demis himself kek

cedar tide
zinc ore
#

Even on r/singularity you'll run into a decent number of comments calling them "shady" or "scummy" for what they did

whole wagon
#

and their result just isnt the actual sota

#

so it feels pretty pointless

#

like if they were the actual best maybe they could get away with it

ocean vortex
whole wagon
#

"We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems" there is no contradiction

#

nowhere does it say previous IMO problems

ocean vortex
#

they don't need to spell it out... it's obvious that they used the problems as close as possible to what they guess the problems are going to be

#

previous IMO problems is a reasonable interpretation I would say, but this doesn't change anything really.

zinc ore
#

This is why standardization and transparency is so important

whole wagon
#

well this all looks great for google

ocean vortex
#

I'm really not. We can take this for face value and don't interpret anything "Access to a set of high-quality solutions to previous problems" - this still means fundamentally the same thing

whole wagon
#

they win ethically and performance wise lol

fleet lintel
#

how??

ornate agate
zinc ore
#

Much crisper, cleaner, and easy to understand

whole wagon
#

it is easy to tell from the first paragraph

fleet lintel
#

Lol.. like I can actually evaluate them. Thank you for the confidence though 😄

ocean vortex
whole wagon
#

"we werent in touch with IMO" isnt a defense imo

ornate agate
#

IMO Problem 1 is not too difficult to do yourself if you grant yourself a bit of help from the AOPS forums. Then you can read the AI solutions to at least problem 1 and see the difference.

fleet lintel
ocean vortex
# whole wagon "we werent in touch with IMO" isnt a defense imo

There's no need to defend anything. You aren't required to be in their inner circle and have assistance from their judges like Google did, you can simply be an outsider doing your own thing... They went beyond that and made sure it is ok to post results

fleet lintel
#

Well, OAI is not known for their ethics. I am not surprised. Good part is that we will soon have these great models at our disposal.

ocean vortex
#

And...? They had the closing ceremony and IMO green lit it. What is the problem? 🙂

whole wagon
#

They asked one random organizer. They are a multibillion company

ocean vortex
#

Oh gimme a break... 😅

#

you are reaching

zinc ore
#

Yeah lol, they should have had a little more reliable communication channels going

ocean vortex
#

They not gonna actively look for ways that would disallow them to post the results. That's on IMO and their communication

ornate agate
#

some delayed publication until next Monday btw, so there is more results to come on this.

zinc ore
#

I wouldn't be surprised if more than two companies got gold tbh

brittle tiger
whole wagon
zinc ore
#

Not sure if they even could at that point

ocean vortex
whole wagon
#

IMO are never going to grade openai solutions. Thats not how they operate they only care about official results

ocean vortex
whole wagon
fleet lintel
ornate agate
#

they graded Googles solutions because Google (and several other companies yet to announce) entered the IMO as AI teams officially this year

ocean vortex
whole wagon
#

The openAI and deep mind researchers are flaming each other on X rn

#

What an absolute spectacle

#

🍿

ocean vortex
#

🤯

fleet lintel
whole wagon
#

I don't think IMO expected it to go down like this 😂

zinc ore
lime coral
fleet lintel
#

"first official gold-level.." lol

lime coral
#

It’s official you cannot deny

#

And it’s the first

#

You cannot Denny

ocean vortex
zinc ore
#

Deepmind supposedly got gold before openAI did anyway

tidal schooner
zinc ore
#

So it's true either way

ocean vortex
zinc ore
#

Unless you're stuck waiting for "official grading" or whatever

whole wagon
#

openAI played themselves

#

Lol

zinc ore
#

They basically offended the IMO to announce first

#

But then didn't end up with the best looking results

fleet lintel
ocean vortex
lime coral
ember rapids
#

whoever ships first

zinc ore
#

OpenAI model definitely trained on IMO answers from past competitions

ocean vortex
brittle tiger
# ocean vortex Doesn't matter. They got played. 😇

An org relying on unprecedented venture fundraising cares much more about being first and the online sentiment that it carries with normies. Google would prefer to stay in good graces of elite math types who follow this closely

fleet lintel
#

this fight between OAI and Google is good. They will be try to release these models faster and try to one-up each other again. I am finally excited after long winter (~4 months 🙂 ) of non-significant progress

zinc ore
#

Normies aren't following elite math competitions closely lol

brittle tiger
#

Normies see headlines tho

ocean vortex
whole wagon
#

I think the embargo date was moved

#

It was supposed to be Friday

#

They moved it because of openAI

#

God that's horrible 😂 the imo organizers literally had to move the embargo date

pure anvil
#

openai like always tasteless af

zinc ore
#

Why would they push the embargo date to later?

whole wagon
#

No I mean Friday this week

#

Was the actual embargo date

ocean vortex
#

If we are really being truthful... If you were in their shoes, you would see their course of action as a better move. Especially considering that they did reach out and did not publish it before IMO said it is ok to do so

zinc ore
#

Oh, up from Monday

ocean vortex
#

Have they entered officially...

#

Google would have stole the show lol

zinc ore
#

Friday is the 25th

whole wagon
#

I don't know exactly. It was supposed to be later than this

#

Is all I know

ornate agate
ocean vortex
zinc ore
#

There's a decent chance IMO makes some changes for next year to prevent this from happening again

ornate agate
zinc ore
#

Because this caused a bunch of unnecessary drama surrounding their event

ocean vortex
ornate agate
ocean vortex
#

lmao

#

I do not think they did

ornate agate
#

you're making my point really well here

ocean vortex
zinc ore
whole wagon
#

I think Google are the only ones that got gold officially. IMO organizers would have moved other companies embargo dates a week forward if they also had gold

ocean vortex
#

Taking "poor kids" argument it's kinda cheap tbh

#

They did have their closing ceremony without anyone publishing anything ---> not much of an argument in the first place

ornate agate
#

The IMO and every other company agreed to wait a bit. They agreed to wait a bit to not diminish IMO participants achievements.

ocean vortex
#

Anyway, just my opinion, you don't have to agree with it

#

🙂

zinc ore
whole wagon
#

Yes it has always been a thing

zinc ore
#

Your arguments come off pretty weak and contrived

#

Like you're basically heavy reaching to defend openAI no matter what

#

This is no longer rationalism, but emotional investment

whole wagon
#

I mean from openai side they got a bit desperate for a sota result and thought they could get away with this. It is reasonable they wouldn't have expected to be called out

#

Desperation sometimes makes people do unethical things

#

I mean picture this. You had a huge lead and now Google are on your tail and moving in fast. You might start to crash out as a natural reaction and reach for any good result you can

#

Not justifying it but I can see why they do this

zinc ore
#

be the most prestigious mathematical competition
Host competitions for HS students for 70 years
70 years into your history an AI company wants to attempt to solve math questions from current event
Say sure, but don't announce for two weeks
Next year another company does the same unofficially
doesn't wait two weeks
hypes it up on social media, all on same day as closing ceremony
People say "f* them kids"
Mfw

whole wagon
#

Yeah he had a total crash out I saw

brittle tiger
#
  1. OpenAI doesn't cooperate with or submit their answers to be graded by IMO allowing them possibly run multiple attempts or withhold any bad results which would hurt fundraising
  2. releases results before closing party pissing off the entire IMO board
  3. surprised Pikachu face when math people are mad online
whole wagon
#

Probably openAI HR told him to calm down

#

😂

zinc ore
#

Imagine being a competition for HS kids for 70 years and people are saying "f* them kids" lmao

#

IMO could have told all these companies to kick rocks, but they were initially respecting their wishes

#

Meanwhile openAI just shat on them

lime coral
zinc ore
#

Lmao

#

"um actually ours is more general purpose"

whole wagon
#

Well I mean it probably is if they are going to actually ship it

#

Wouldn't be a great model to release otherwise, maths problems are not that valuable compared to coding

#

There is a trick though. It's not just deep thinking

#

There's smth else

#

:p

fleet lintel
hardy lion
whole wagon
#

They added an august 31 bet for GPT5 release. It is only 60% ??????

#

Wtf

ocean vortex
ornate agate
ocean vortex
#

"studying" = training data, but this is different. It's literally a cheat sheet 👀

whole wagon
#

They both "trained" models for it

zinc ore
ocean vortex
#

training data is not gonna be recollected with 100% accuracy typically for tasks like these, unless you overfit and degrade performance

#

Honestly I'm amazed they allowed it to go as far as it did LOL

dawn wharf
whole wagon
#

They are being kind, they don't need the official

#

They were the first. Just embargoed

zinc ore
whole wagon
#

😂

zinc ore
#

Yeah, they're making little subtle pot shots at openAI

ocean vortex
#

deep down he knows he's only saying that but it is not actually true...

#

🗿

torn mantle
torn mantle
whole wagon
#

decent but still pretty light on the details of the actual pretraining dataset

lime coral
ocean vortex
#

better than 4.1, worse than o3

whole wagon
#

better than 4.1 isnt a high bar

ocean vortex
whole wagon
#

2.5 flash > opus 4

#

nice benchmark

ocean vortex
#

that's not a secret

#

it's a niche model, but SOTA for those select things it's good at

whole wagon
#

even livebench can get this right

ocean vortex
#

In turn 4.1 is similar enough size to all of those it is being compared against... So no such discrepancies. 😇

whole wagon
#

any benchmark putting 2.5 flash above opus 4 is just a joke

lime coral
ocean vortex
whole wagon