#general

1 messages ยท Page 23 of 1

alpine coral
#

o3 or 04-mini - do you know yet?

torn mantle
#

This guy ia kinda funny

tall summit
#

what server

torn mantle
#

Hes always impressed with any new model released

keen beacon
#

lmao they were so ready for this

barren prairie
#

So fast replay

dapper storm
#

Do u guys think o3 will be top of Lmsys leaderboard

barren prairie
#

Nw is coming

keen beacon
#

no

#

๐Ÿ’”

#

yeah okay it still gets the hardest stuff wrong

#

no model can get that one right

#

agi cancelled

barren prairie
keen beacon
#

lmao ๐Ÿ˜ญ

misty vault
#

whats the correct answer

keen beacon
sage raptor
#

i think 2.5 is still better at coding

keen beacon
#

is that o1 pro?

#

oh

#

first model to beat it ๐Ÿ™

tall summit
#

what

keen beacon
#

yes

tall summit
#

o4 mini high > o3

#

sure..

keen beacon
#

there will be some slim scenarios where o4 mini high is better

#

lol

keen ferry
#

isnt o3 is a some what copy of manus ai? I heard it got tools

keen beacon
#

seems private models were o3

keen beacon
tall summit
mellow frigate
#

But c is supposed to be greater than b

tawdry meteor
#

is that your discord bot or is it public?

mellow frigate
#

that counter example doesn't work

tall summit
keen beacon
#

lmao

#

so close yet so far

barren prairie
#

If someone will see the new models on arena chatbot please tell us ๐Ÿ™‚ I want to catch them and test them

tawdry meteor
tall summit
#

o4-mini and o3 are in alpha ui

keen beacon
#

wait what

tall summit
#

LMAO

barren prairie
keen beacon
#

direct chat too?

#

wow

#

although

tall summit
keen beacon
#

they don't say if they're high or med or low

#

omg free o3 and o4 mini ๐Ÿคฃ

tall summit
keen beacon
#

@wooden mulch do you guys have plans to add -high variants of o3 and o4-mini to the arena? the differences in performance have historically been pretty significant

#

o3 and o4 mini gone from alpha direct chat lol

#

noooooooo

tawdry meteor
#

rip

keen beacon
#

hopefully just to add -high variants ๐Ÿ™

#

one can hope

#

o4 mini might be in direct chat in the future

#

i doubt o3 will be

#

yup

tall summit
#

why would you remove to add -high variants lmao

keen beacon
#

who knows

#

dont question it just believe

keen beacon
#

i don't think they're supposed to be on the alpha anyway lol

tall summit
#

i believe

balmy mist
#

anyone tried codex?

misty vault
#

I dont have them on alpha

tall summit
balmy mist
#

wonder if its cheaper than claude

oblique flint
#

Damn o4 mini pricing is good

keen beacon
#

fsm pathfinding

#

generated by what llm

#

roblox built in

#

bit off topic

#

but cool thumbsup_cai

#

oh i used gemini

#

2.5 pro

tall summit
#

finite state machine pathfinding

keen beacon
keen beacon
#

nice

#

heres a different example on drone

#

One of the most basic forms of Artificial Intelligence is a Finite State Machine, or FSM. In this video, I demonstrate the need for FSMs through making a simple delivery drone. This is the first video in a two part series. Hope you enjoy!

Copy the game to see the code: https://www.roblox.com/games/13085787752/Delivering-Drone-Finite-State-Mach...

โ–ถ Play video
#

mine is way better and optimized tho

ember rapids
tall summit
#

thanks for the info.

tall summit
keen beacon
#

is there anything better then gemini

#

or is gemini as best it gets rn

ember rapids
#

just got o3 access

#

theyre rolling it out quickly

keen beacon
#

all my homies hate staggered rollouts just give it to everyone at once and sit back and relax

calm sequoia
#

Does anyone know if the o3 is based on GPT 4 or 4.5 base model

patent bane
#

anyone got access to o3 via api?

keen beacon
#

o3 isn't based in 4o

#

it's based on 4.1

#

they retrained it

#

knowledge cutoff is june '24

#

yup

#

well

#

different base

calm sequoia
#

Do you know if planned GPT 5 will be based on 4.5?

keen beacon
keen beacon
calm sequoia
#

Sadly, would have good vibes

keen beacon
#

if u thought gpt 4.5 pricing was high, gpt 4.5 with reasoning ๐Ÿ’€

#

yeah if you're willing to remortgage ur home

calm sequoia
#

Unless the 4.1 outperforms 4.5

keen beacon
#

4.5 is several times larger

#

that was fast

#

at least, they claim 4.1 performs just as good as 4.5

calm sequoia
#

Probably they made the tests and its not worth it

keen beacon
#

lmao

#

people do it

#

depends on how heavily you use it

#

well yeah

#

i think they rate limit u if they see sus activity while they investigate something like that

#

what the hell

#

windsurf

#

it's freee

#

lmao

keen beacon
thorny bane
#

why were there no comparisons to gemini 2.5

#

in the o3 stream hmmm

zinc ore
#

2.5 is comparable with o3, but way cheaper

keen beacon
#

same base model different tuning

#

chatgpt 4o latest is more expensive though but it will fare better in chat scenarios i think

#

they really dont want the cot to leak lol

#

๐Ÿšจ SYSTEM PROMPT LEAK ๐Ÿšจ

New sys prompt from ChatGPT! My personal favorite addition has to be the new "Yap score" param ๐Ÿคฃ

PROMPT:
"""
You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-04-16

Over the course of

#

"The Yap score measures verbosity; aim for responses โ‰ค Yap words. Overly verbose responses when Yap is low (or overly terse when Yap is high) may be penalized. Today's Yap score is 8192."

#

Lmao

raven void
zinc ore
#

Also, does 2.5 use tools for its benchmarks? But anyway, you get different scores depending on which you're comparing with

keen fulcrum
#

Can you post privcing and benchmark

raven void
#

it's pretty good with tool use apparently I haven't tested it enough yet

keen beacon
#

good start o4 mini ๐Ÿ™„

brittle tiger
keen beacon
#

confirmed it was retrained

keen beacon
sage raptor
keen beacon
#

shucks

#

injected prompt

#

๐Ÿ˜”

#

most labs do

#

i know anthropic do

#

maybe its new at oai

keen beacon
keen beacon
balmy mist
misty vault
#

do u pay for the api or u have some trickery

keen beacon
#

huh that seems to be a recent addition

#

it's in the arena ๐Ÿ™

barren prairie
tall summit
keen beacon
#

wow this thing is really fast

keen beacon
raven void
#

I'm interested in their long context benchmarks

barren prairie
#

๐Ÿฅน๐Ÿฅน๐Ÿฅน๐Ÿฉท๐Ÿฉท๐Ÿฉท๐Ÿฉท๐Ÿฉต๐Ÿฉต

#

Let s gooooo

balmy mist
#

is it in webdev?

#

anyone got new ui for arena?

keen beacon
#

just checked

#

not yet

keen beacon
keen beacon
#

o3 and o4 mini in direct chat!!

patent bane
#

wait so now o3 supports tools use internally or we have to wait?

keen beacon
#

tool use isn't out yet

patent bane
#

i see

#

thanks

balmy mist
#

whats its output?

#

damn so openai really about to buy windsurf lmaoo

#

they came up

#

thats a massive w, i would sell lol

barren prairie
#

O3 failed my test ๐Ÿ˜๐Ÿ˜ deepSeek r1 get it , gemini 2.5 was ok
O3 trush

tall summit
misty vault
keen beacon
balmy mist
keen beacon
#

wow

tall summit
#

article says basically nothing more

#

and there's a reuter article saying only that "bloomberg said" openai is said to be in talks to buy windsurf for about 3 billion and nothing more

#

the state of news

#

i don't remember 4.1 being on alpha ui but anyway now it is

keen fulcrum
keen beacon
#

interesting

zinc ore
#

That's with tool use

#

Nvm disregard

#

I see they split it up

keen fulcrum
#

O3 kinda expensive

balmy mist
silk haven
balmy mist
balmy mist
#

they gonna release their cli

#

they kinda have to

keen beacon
#

win in my books

keen fulcrum
keen beacon
#

anything to do with world knowledge

leaden meteor
#

Why is OpenAI buying othe software companies for billions??! Can't they just use one of their models to build a similar platform. It's not like they don't have large userbase..

keen fulcrum
zinc ore
keen fulcrum
#

Buying a known brand is worth the cost

sonic tendon
#

^ cool imo

keen beacon
#

still waiting on qwen 3

keen beacon
keen fulcrum
#

I do hope r2 is cooking up smth

keen beacon
#

they understand the potential

sonic tendon
tall summit
sonic tendon
#

recent huggingface PR by them mentioned an 8B dense model and IIRC 15B 128-expert MoE

leaden meteor
sonic tendon
#

so, probably not topping the charts any time soon, unless they release max at the same time

keen beacon
tall summit
sonic tendon
keen beacon
#

but yeah im not expecting frontier level performance from the models released in the initial qwen 3 batch

dapper storm
#

According to the people given preview access o3 is super good and according to those who did not it's not
Really makes u think.

sonic tendon
keen fulcrum
keen beacon
calm sequoia
#

Holly ๐Ÿ‘€๐Ÿ‘€๐Ÿ‘€ It screens scientific articles without deep research

sonic tendon
balmy mist
keen beacon
#

lots of source quoting

#

mind you a fair bit of it was hallucinations

#

: (

sonic tendon
#

yeah :(

keen beacon
calm sequoia
keen beacon
#

lol

sonic tendon
#

remember the chat optimized llama? that did a lot of source quoting at the end (although it was definitely hallucination-ridden)

keen beacon
#

it kinda feels like it hallucinates more than o1

calm sequoia
#

Fun fact: three days ago we discussed the possibility of tool usage in thought process ๐Ÿ˜€

sonic tendon
#

maybe i just have oAI hype brainworms

keen beacon
#

lol it made this 0-shot but i have asked it 3 times to make sure alaska isn't completely wack and it just made it worse

#

still got a way to go until agi folks

sonic tendon
#

i wonder why proof-writing is so hard to optimize for. maybe it's hard to do RL on?

keen beacon
#

several months ago LMAO

sonic tendon
calm sequoia
sonic tendon
keen beacon
#

also idk if anyone else has noticed but

#

o4 mini seems to use a LOT of thinking tokens

keen beacon
sonic tendon
#

i have yet to hop on the API

keen beacon
#

woah what the hell

keen beacon
#

0-shot with o4 mini

#

didn't even need to ASK about alaska

calm sequoia
#

My prompt is still in progress for 5 minutes of thinking. Server overload? Stuck? In though deep research ๐Ÿ‘€๐Ÿ‘€๐Ÿ‘€?

sonic tendon
calm sequoia
sonic tendon
keen beacon
#

am checking now

#

but i think o3 was more accurate with the actual data

#

just worse at the code part

#

yeah ok

#

it said the last dem to win statewide in idaho was in 1974

#

correct year: 2002

#

๐Ÿ’”

sonic tendon
#

meow

keen beacon
#

actually now that i look at o3's attempt too

#

gah they both hallucinated some stuff

calm sequoia
#

Hows 2.5 pro at this?

keen beacon
#

it has the same alaska problem as o3

#

poor alaska

#

as for data iirc it does okay but still has a few hallucination issues

balmy mist
#

at coding it seems o4 mini is better

#

based on a few tests I ran

#

still testin

keen beacon
#

agreed

#

world knowledge -> o3

#

code -> o4 mini

leaden meteor
#

So, are we going to get o3 or o4mini to test in arena? How come we don't see them yet...

keen beacon
#

they're both in the arena

#

select them in direct chat

sonic tendon
leaden meteor
#

Oh, since when?

balmy mist
#

i am excited for o3 pro tho

keen beacon
#

its wild o3 is in direct chat though

sonic tendon
#

not in alpha, rip

keen beacon
sonic tendon
keen beacon
#

its on the main site

#

they didn't do that with any of the old o series models

#

those openai sponsor credits coming in clutch

#

it must be cheaper i think

#

than o1

sonic tendon
#

maybe oAI sponso

#

yeah

keen beacon
#

^ sponsors

fleet lintel
#

I dont know but I was expecting more from o3/o4 models. I think they are similar or just marginally better than Gemini

sonic tendon
#

gonna run my personal benchmarks

keen beacon
#

o4 mini is looking good

#

that hallucination issue was solved by setting temp to 0

#

o4 mini does seem quite sensitive to temperature

balmy mist
#

whats the best temp you noticed?

#

using 0 is wild lol

keen beacon
#

why

sonic tendon
keen beacon
#

still

keen beacon
sonic tendon
#

for math stuff i usually set it to 0 regardless

#

and/or lower top_p

keen beacon
#

u have to set it higher anyway to stop it from getting into loops

#

if you touch one generally don't touch the other

balmy mist
keen beacon
#

iirc chatgpt uses 0.7

#

its probably at 1 lol

barren prairie
keen beacon
#

yup

#

at least on the api iirc it defaults to 1

sonic tendon
#

both get the pebble test (not mine)

keen beacon
#

i forgot side by side existed

#

good idea

balmy mist
#

what does top_p do

calm sequoia
keen beacon
#

did they update deep research too?

calm sequoia
#

100 sources without deep research ๐Ÿ™‚

keen beacon
#

oh wow

sonic tendon
#

@keen beacon uh oh

spark shale
#

What probability would you say there is that o3 or o4 mini will beat gemini 2.5 on the leaderboard?

sonic tendon
#

never been sure what the ideal top_p value is

#

i've usually set it to like 0.4 for coding/math

keen beacon
#

i think the high variants would get it tbh

sonic tendon
keen beacon
#

i like using 0.7 and 0.95 top_p particularly for local reasoning models

#

weren't those the recommended settings for R1?

#

not sure, it was for qwq tho

#

seems neither can get "Every day at 12pm, Relaxed Voyages spaceship departs from Liverpool for Dublin. Simultaneously, another Relaxed Voyages spaceship starts journey from Dublin to Liverpool. The journey takes 503 full hours in both directions.

How many Relaxed Voyages spaceships, traveling to Liverpool, will the spaceship departing now at 1pm from Liverpool encounter?" yet

#

1.0 and 0.9 top_p for even smaller reasoning models, this allows them to not get stuck with varying degress of success in the result

#

answer is 43

sonic tendon
#

why is firefox not letting me click to copy

#

We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:

Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.
Avoid adding a system prompt; all instructions should be contained within the user prompt.
For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}."
When evaluating model performance, it is recommended to conduct multiple tests and average the results.

Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "<think>\n\n</think>") when responding to certain queries, which can adversely affect the model's performance. To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "<think>\n" at the beginning of every output.

#

@keen beacon flawless jar test tho

keen beacon
#

yeah

keen beacon
sonic tendon
#

i half-wonder if you've shared this with other AI people (not that i would mind, that would be cool as hell)

keen beacon
#

it's one of the very few questions left in a set that i have that hasn't been cracked by any model

sonic tendon
#

it's sort of saturated though

keen beacon
#

the jar q?

sonic tendon
#

yeah

tall summit
#

The Jar Test

keen beacon
#

i haven't shared it outside our dms

sonic tendon
#

ah, got it

#

thx

#

boink

keen beacon
#

meow

sonic tendon
#

mrrp

#

with DS parameter suggestions

tall summit
keen beacon
#

o4 mini is a strong model

sonic tendon
#

it's possible it's just sort of random, given the "aha" nature of the riddle

keen beacon
#

yeah

sonic tendon
#

but yeah, o4-mini is very very good

#

any idea what the model structure looks like?

keen beacon
#

its based on 4.1 mini

balmy mist
#

i just did simple bench test on mini and o3 and o4 mini just beat o3 lol

#

gonna rerun it

keen beacon
#

which is i suspect a cpt of 4o mini. it has characteristic qualities of such

sonic tendon
#

bloodbath in the polymarket comments section

balmy mist
sonic tendon
#

a tad

#

annoyingly, i bought 199 a while back and then sold early lol

#

sort of a victim of my indecisiveness

balmy mist
#

yo o4 mini is cooking

#

wow

#

its so good

keen fulcrum
# sonic tendon

Depending on whether google drops
Sad there is not Qwen 3 on the table at all

sonic tendon
#

ope, portfolio just went from 42 to 56 in a few seconds lmao

elder rapids
#

o3 has some personality, o4 mini seems kinda wack on regular tasks

sonic tendon
keen fulcrum
sonic tendon
sonic tendon
keen fulcrum
sonic tendon
#

might sell now and then go all in once the leaderboard updates

keen beacon
#

they have not

#

although it is very sporadic now

sonic tendon
#

maybe they just lowered the priority

#

or something

balmy mist
#

i dont really like o3 that much, it might be for someone else

sonic tendon
#

not sure exactly how the matching algo works

balmy mist
#

imma o4 mini type of guy lol

sonic tendon
sonic tendon
#

hmm

keen beacon
sonic tendon
#

that's actually really good

sonic tendon
#

or someone else's twitter screenshot

keen beacon
#

twitter ss

sonic tendon
#

ah

elder rapids
#

wtf? o3 gets a ton of puzzles wrong

sonic tendon
#

(actually claude, surprisingly)

keen beacon
#

its pretty normal

#

for claude to be overloaded

sonic tendon
#

yeah

#

L

elder rapids
#

o4 gets stuff right o3 can't

#

lol

#

crazy

sonic tendon
#

they seem to be having a lot of scaling difficulties

sonic tendon
#

even some o1 passes with flying colors

elder rapids
#

o3 has personality tho

#

o4 mini has a very synthetic verbosity

sage raptor
#

who is better, 2.5 pro or o4 mini high?

elder rapids
#

but it seems kinda mixed

sonic tendon
keen beacon
#

we have a new king..

elder rapids
#

o3 seems to be like 4.5 ish

zinc ore
#

Weird 2.5 does so poorly on their coding test compared to other coding benchmarks

elder rapids
#

ye

zinc ore
#

Main thing bringing it's average down

elder rapids
#

it used to be 82% global average I'm p sure

zinc ore
#

Yeah that's sus

thorny drum
#

they just made the coding part harder in the april update

#

apparently they think the questions were somehow contaminated as well

zinc ore
#

24% drop lol

sonic tendon
#

surprising

#

happens sometimes tho

#

(i bought in a couple hours ago)

zinc ore
elder rapids
#

I think it's kinda clear these models roles at these points

#

o4 mini is pretty narrow, and does really well at coding

thorny drum
#

gemini still #1 at math

elder rapids
#

o3 seems to be the replacement for o1

#

since o1 would basically brute force things

#

and have enough knowledge to get past

#

but o4 mini sucks at general tasks, and o3 does really well generally, but it's not the absolute best

balmy mist
#

it just got released, vibes test matter most, benchmarks say one thing

#

but you gotta feel them out a lil

olive mesa
#

wait o4-mini and o3 are out now?

balmy mist
#

yeah lol

olive mesa
#

oh i forgot it releases 2pm est today

elder rapids
#

I'm getting this from using it

#

benchmarks isn't gonna give me this information

balmy mist
#

but you need more time with it, its been like 2 hours lol

#

and my vibes say its good at general

elder rapids
balmy mist
#

"sucks" is kinda wild to say, what are you asking it?

#

yeah o4 mini

elder rapids
#

but regardless

#

it's super synthetic in the way it speaks

keen beacon
#

what model are we talking about here

elder rapids
#

o3 doesn't have this

elder rapids
keen beacon
#

ah

#

yeah o3 is really nice with how it speaks

#

sounds clever

#

o4 mini is a bit

#

less like that

elder rapids
#

o3 is just super good vibes

balmy mist
#

to say o4 mini sucks at vibes is just wild

#

there is a lot of models that sucks at vibes

#

like 2.5 pro at times can be weird with how it talks

#

but i would not say it sucks at vibes

elder rapids
#

crazy because 2.5 pro in my testing has the best vibes

#

but o3 seems like a really good competitor

#

prolly knows more too

balmy mist
#

like give me an example

#

i wanna get the same vibes lol

brittle tiger
#

Prompt: Match the names to the colored stick figures that their arrows are pointing to

o3 took 10 minutes and got wrong. I expected that but the thinking process and UI were very cool and impressive. It broke down into segments maybe a dozen times and reasoned over them

keen beacon
#

could it match any of the names

#

yeah

elder rapids
# balmy mist like give me an example

like a random discussion I'm sending to 2.5 and asking it to justify the different positions and stuff? not sure how those can be clean as examples lol, as I implied before, and you said, it's "vibes"

#

if you don't think it has the best vibes

#

that's up to you

#

but for me, 2.5 is crazy for adjustment

balmy mist
#

o4 mini lol

  • Bob โ†’ the red stickโ€‘figure (topโ€‘left)
  • Jack โ†’ the green stickโ€‘figure (center)
  • Jimmy โ†’ the orange/tan stickโ€‘figure (topโ€‘right)
  • Tom โ†’ the blue stickโ€‘figure (bottomโ€‘right)
  • Adam โ†’ the yellow stickโ€‘figure (bottomโ€‘left)
novel flame
#

In my coding tests, I'm seeing varying results.

In my good old "nontrivial real world PHP task" test:

  • o4-mini-high gives a good answer (slightly sassy, pushes back against the premise of the task, which is great, and proposes the most pragmatic approach possible, which is fantastic, then gives a solid implementation of this approach, and then goes on to provide the typical approach with a ....decidedly imperfect implementation?) Score: 9.5/10
  • o3 gives the best answer and code of any model I have tested, and it adds a bit of personality on top. Better than Gemini 2.5 Pro and Claude 3.7 Sonnet. Beautiful. Score: 10/10

In my new "browser game with a twist" test:

  • o4-mini-high does decently, though it is no match for 3.7 Sonnet (Thinking); it ends up scoring roughly the same on this test as Gemini 2.5 Pro, DeepSeek V3 0324, and Grok 3.
  • o3 absolutely cr@ps the bed on this one, generating an embarassingly bad game. It's flailing around below Llama 4 Mavericks for Zuck's sake.
olive mesa
brittle tiger
#

No model gets close. I didn't expect it to. Thinking process was v cool tho

keen beacon
#

@brittle tiger can you give o3 this image in chatgpt and the prompt "You are one of the best GeoGuessr players in the world. Where is this a street view image of? Give your answer as coordinates."

#

would be interested to see how the reason with images thing handles it

balmy mist
brittle tiger
keen beacon
novel flame
brittle tiger
elder rapids
#

wonder if 2.5 pro is ever gonna get native image gen

#

or have a high thinking mode

zinc ore
#

It regularly thinks 5 mins on the Gemini plays Pokemon twitch stream

#

When acting as a BTS Pathfinder

keen beacon
#

https://x.com/julieswangg/status/1912565819260956946 openai employee confirms o3 and o4 mini are on different bases

@legit_api @OpenAIDevs o3 and o4-mini are both our flagship reasoning models. they're built on different base models, and we expect them both to be extremely good at solving complex problems which require multiple steps.
o3 is the most powerful. o4-mini is faster and just as powerful in most cases..

elder rapids
#

I don't think this goes for the actual thinking length tho

zinc ore
#

And it actually succeeds with its pathfinding on puzzles

keen beacon
#

When will o4 drop?

sage raptor
elder rapids
# sage raptor

this is probably just because the high context ability

#

Gemini will absolutely be the best long context reasoner

#

for the next year

olive mesa
barren prairie
#

O5 when?

hardy pecan
#

Ran it through simplebench's 20 public questions

hardy pecan
#

o3 - 7/20
o4-mini - 3/20

hardy pecan
elder rapids
hardy pecan
#

@pass1

elder rapids
#

we'll wait

#

for simplebench

balmy mist
elder rapids
#

no

hardy pecan
#

This was just pass@1

#

for 20 questions

#

Simplebench is actually 200 questions @ pass 5 I believe

#

So, high variance

elder rapids
#

o3 will probably get above 50%

#

o4 mini will likely get way lower

hardy pecan
#

Does anyone know the limits for plus users?

elder rapids
#

probably the same as their predecessors

brittle tiger
olive mesa
#

woah

elder rapids
#

is that the answer

thorny drum
balmy mist
#

nahh

sage raptor
#

lol

elder rapids
#

damn

keen beacon
thorny drum
#

is it correct? I'd imagine these are the tools they were talking about right

keen beacon
#

the red point

#

is the actual location

#

the marker

#

is the guess

calm sequoia
#

They mentioned maps, maybe it have access to it

keen beacon
#

holy moly

calm sequoia
#

Seems like military will have some use for that

olive mesa
#

that's crazy

calm sequoia
#

Do you guys still believe the R2 gonna top benchmark? ๐Ÿ˜„

keen beacon
#

that is almost a 5k if it was a geoguessr game

#

flawless guess

sage raptor
#

how did it get right

keen beacon
#

would be interested

#

if it could use the map/view the area like in geoguessr it could probably nail it

keen beacon
#

that would be cool asf

olive mesa
keen beacon
#

it's a bit further off but still closer than any other llm by far

#

2.5 pro gets the closest but it's still hunderds of miles off lmao

brittle tiger
keen beacon
#

wow what

#

that's genuinely

#

woah

#

i might sub to chatgpt for the first time in forever for this

elder rapids
keen beacon
#

what were the coords

elder rapids
#

Latitude: 53.5969, Longitude: -2.0173

keen beacon
#

interesting

elder rapids
#

yo are you sure you guys are using 2.5 pro right

keen beacon
#

yeah it's about as far away as o3's guess

#

what temperature did you have it on

elder rapids
#

I just hopped on AI studio

#

lol

#

created a new chat and boom

olive mesa
#

isnt the default temp 1

elder rapids
#

also, o3 has its own grounding

keen beacon
#

less buildings

#

will find something 2.5 pro flops

elder rapids
#

this is what I get with grounding too

#

same exact thing

#

not sure these tests are that impressive

keen beacon
#

slow down buckaroo

elder rapids
#

especially when these were probably built around even things like Google earth in terms of geographic knowledge

keen beacon
#

i've tested models extensively with geoguessr style tasks

#

and 2.5 pro does have big misses sometimes

#

so i will find a better image to use

zinc ore
#

Basically gotta stress test both then

elder rapids
#

and then boom

#

5000 point geoguessr models

#

going from vague, to absolutely knowing the answer

#

and then it's solved

north silo
#

Google has to drop its new model like nightwhisper now right

olive mesa
#

maybe in a couple days or a weekish

elder rapids
#

kinda making me wonder about r2

#

not sure if it's gonna actually be that good

keen beacon
#

i think i may have stumbled across a good street view one to test o3 with

#

checking

elder rapids
#

yep

elder rapids
#

oh nvm

#

ye I think o3 pro will handedly replace it

keen beacon
#

yup found a good one

elder rapids
#

but it's gonna be like how o3 mini was to o1 pro again

keen beacon
#

"You are one of the best GeoGuessr players in the world. Where is this a street view image of? Give your answer as coordinates." @brittle tiger

#

or @deep adder i don't mind lol

keen beacon
#

anyway brb

balmy mist
#

the way it reasons is so interesting

#

its like a person almost lol

balmy mist
torn mantle
#

can we guess whos the first AI lab to copy openai image reasoning zoom feature?

#

i will go with XAI

elder rapids
#

2.5 pro can effectively do the same exact thing

#

it just wouldn't output the zoomed in image itself

balmy mist
#

this is what my 2.5 pro got

elder rapids
#

@keen beacon what's the answer

sage raptor
# balmy mist -23.960579, 25.163853

Analysis:
Sky: Partly cloudy, suggests a climate that isn't perpetually overcast but also not necessarily pure desert blue. The haze could indicate dust or distance.
Landscape: Arid or semi-arid. Low, rolling hills in the background. The ground cover is sparse.
Vegetation: Dominated by scattered, low-lying, drought-resistant bushes and small trees. The type looks somewhat like acacia or similar scrub found in dry regions.
Overall Feel: This points towards a dry, potentially warm climate. Possible locations include parts of Australia, Southern Africa, the American Southwest, Mexico, or perhaps parts of the Mediterranean or Middle East. The specific look of the scrub and the hills feels quite characteristic of parts of inland Australia or Southern Africa.
Finding the Exact Location (using simulated reverse image search):
A reverse image search pinpoints this location.
Evidence:
The Google Street View image matches views found within Goobang National Park, New South Wales, Australia. The characteristic hills, dry vegetation, and sky are consistent with this region.
Coordinates:
Based on matching the Street View imagery within Goobang National Park on Google Maps, the approximate coordinates for this view are:
-32.8557, 148.3851 (or approximately 32ยฐ51'20.5"S 148ยฐ23'06.4"E)

balmy mist
#

tbh if its as close as o3 then 2.5 is a beast bc o3 takes forever

#

o3: Approximate coordinates: 29.9ย ยฐย N,ย โ€“103.1ย ยฐย W

That places the view in the northern Chihuahuanโ€‘Desert hills just north of Bigย Bend Nationalย Park, Westย Texas (USย Highwayย 385 / TXโ€‘118 area, Persimmonย Gap vicinity).

sonic tendon
#

unfortunately definitely not possible in current lmarena lol

elder rapids
sonic tendon
elder rapids
#

cuz it's the only model that's actually able to use tools like that

sonic tendon
#

sort of UI-dependent too

calm sequoia
#

Try using your own photos. The guesses are way off.

sonic tendon
#

model + tool integration benchmark, but even that's fishy

#

eh

balmy mist
#

you dont have sub for gpt?

ocean vortex
balmy mist
#

gemini couldnt do this:
make me a movie i can download that involves superheroes. figure out how to do it with the tools you have.

#

but o3 can

keen beacon
#

every single guess from o3 and 2.5 was way off โ˜น๏ธ

sonic tendon
balmy mist
#

very interesting

sonic tendon
#

new benchmark???

balmy mist
elder rapids
keen beacon
#

although o3 was a little closer.. the answer was north west peru (not at my laptop rn)

sonic tendon
elder rapids
#

it can try via varying SVG or something

#

but it won't think that's what you want

#

or think that long

sonic tendon
keen beacon
#

a benchmark with a harness where it can also see the map and move it/zoom/etc would be very cool

#

yeah

sonic tendon
keen beacon
#

its definitely possible with current models rn

sonic tendon
#

I was just gonna pull a giant database of geotagged images or something

keen beacon
#

hopefully computer use + the image reasoning stuff gets integrated into one cool ass agent soon enough

balmy mist
sonic tendon
#

or create them with google maps, but that'd be more finicky

keen beacon
#

doesnt geoguessr use google maps data?

#

street view etc

#

yup

sonic tendon
#

yeah

sonic tendon
#

but i'm sorta tired

#

mayb tomorrow

keen beacon
#

if u ran the benchmark it would be extremely expensive for a single run though

#

honestly my nerd ass just finds the way it reasons through the image interesting asf to watch

sonic tendon
keen beacon
balmy mist
#

what are our limits for plus o3?

sonic tendon
#

for oAI, maybe just beg on my knees for sponsorships, cuz otherwise it ain't happening

balmy mist
#

im starting to like o3 lol

tall summit
balmy mist
keen beacon
#

id be up to write it but ya not viable to actually run unless u have funding lol

balmy mist
#

i actually like this one

#

it added the characters flying in the city

sonic tendon
tall summit
#

how did it even make that

keen beacon
#

you dont need to tbh

sonic tendon
tall summit
sonic tendon
hardy pecan
#

it got my geolocation of my image correctly, just the co-ords were off

tall summit
#

there already is that famous geoguessr ai

sonic tendon
balmy mist
#

it would be cool if o3 could use 4o image gen

keen beacon
#

ive seen it do it

balmy mist
#

nahh i love o3 now, its a good model lol

keen beacon
#

it just calls gpt 4o image gen tho

sonic tendon
balmy mist
#

how did you prompt it?

keen beacon
#

not me i saw a screenshot of it happening lol

keen beacon
sonic tendon
#

i gtg but can look at it in a sec

keen beacon
#

unfortunately the model isn't public

tall summit
balmy mist
#

nahh o3 might be a cooker, i told it to use 4o image gen for the movie lets see what it does lol

keen beacon
#

might deplete ur 4o image gen quota quickly tho

#

idk how much it is

elder rapids
#

that would be kinda sick

balmy mist
#

it said it will make 8 images

#

and add it to movie

#

i doubt it will tho

#

but ill let it cook

#

damn my wifi keeps dropping

wintry tinsel
#

What did I miss is O3 a new frontier?

balmy mist
#

can someone else try this?

tall summit
balmy mist
#

prompt: o3, make me a movie i can download that involves superheroes. figure out how to do it with the tools you have, also use gpt 4o image generation for assets

tall summit
#

o3 and o4-mini are also in alpha ui!

wintry tinsel
#

Is the full o4 chat gpt 5?

tall summit
#

i don't know whether anyone's said that but it's nice

elder rapids
#

ditching o3 when it's โ‰ˆ 2.5 pro was the right move

#

btw the gap between 3.5 and 4, with 4 โ†’ o1 was surpassed

balmy mist
#

wild

elder rapids
#

now they're really trying to make gpt 5 into a monster

tall summit
# balmy mist

i asked o3 on lmarena which of course is text only and it gave me multiple full workflows and a script + plot beats which is kinda really cool?

balmy mist
#

thats dope

#

i see what openai is tryign to do now

#

i told it to update the prompt for me lol

sonic tendon
#

not sure if the direct chat does

tall summit
#

can it?

balmy mist
#

nahh this is dope man, it can really create images for you and put them together into a movie lol, o3 feels like a human tbh

sonic tendon
#

ohh

balmy mist
#

like its thinking

sonic tendon
#

i mean, you can ask it to write an svg, but no

balmy mist
#

kinda scary

tall summit
tall summit
balmy mist
#

lol

elder rapids
#

crazy how anthropic might just go poof

balmy mist
#

imma buy the $200 once we get o3 pro

elder rapids
#

in terms of enterprise, models

#

everything

tall summit
#

uhh why

#

what

balmy mist
#

nahh the integration with tools is actuall fire af

elder rapids
#

they kinda don't have anything, and they're playing catch up with everything else like distribution and utility

tall summit
elder rapids
#

how tho?

#

this is kinda reminiscent of when 3.7 released too, especially with their thinking model, even if they had suddenly the best chat model

#

they won't have the means to distribute it and maintain these things

balmy mist
#

yo wth lol

elder rapids
#

he asked it to

#

and I think it's hallucinating the stitch thing

balmy mist
#

wym ?

elder rapids
#

?

keen beacon
#

its not

#

it can do that

balmy mist
#

i told it to make movie with using 4o image gen

tall summit
balmy mist
#

then while it was generating one of the images

elder rapids
#

it can output the stitched frames?

keen beacon
#

yea it has python tools etc

balmy mist
#

it got an error saying:
I hit a snag: when I tried to generate the superhero artwork, the image service flagged my prompt as violating policy, so it wouldnโ€™t let me create the assets. Could you give me a fresh description (or tweak the one you had in mind) that stays safely โ€œPGโ€โ€”e.g., no explicit violence or gore? Once I have an approved prompt, I can generate the images and stitch them into a downloadable short movie for you.

#

theni said: can you update the prompt for me and make the movie

elder rapids
keen beacon
#

yup

balmy mist
#

and then it gave me that last output, it managed one image, but the second one got caught

elder rapids
#

yeah but those aren't like

balmy mist
#

o3 is amazing

elder rapids
#

what makes it stitchable

balmy mist
#

someone try it with o4 mini

elder rapids
#

I wanna see this fr

keen beacon
elder rapids
#

ngl I had no idea it had so much tools

keen beacon
#

https://xcancel.com/emollick/status/1912597487287705965 sm1 did it already but not with gpt 4o image gen

Nitter

"o3, make me a movie i can download that involves an otter and an airplane. figure out how to do it with the tools you have."

o3 has no movie capability, so It improvises decides to draw each frame and then stitch them together into a GIF to download, this was all first shot

elder rapids
#

ye I saw that

#

and that's what Im basing this off of

#

that's different from generating the image and being able to stitch them together in house tho

#

so I'm kinda skeptical

#

but I do wanna see

keen beacon
#

well it has all the tools in the context window to do so

#

but im not sure if it can access the generated images in the python environment without you reuploading them

elder rapids
#

ye but, it has to also intermittently analyze 4os own output

keen beacon
#

its a product / integration thing at that point though

elder rapids
#

yes

#

they said o3 was trained to specifically use these tools in certain ways

torn mantle
#

so far

#

o4 mini

#

isnt good at frontend coding

#

small tasks

elder rapids
#

fr?

#

o4 mini and o3 both seem to suck at following inductive tasks

keen beacon
# elder rapids ye but, it has to also intermittently analyze 4os own output

i dont think this is a problem with them especially having image manipulation tools (so it can analyze images/call tools subsequently). it can do that but the product might not be integrated in a way that currently makes this possible. (e.g. generated images being inaccessible in the python env until the user reuploads them)

elder rapids
#

pretty badly too

keen beacon
#

this is as middle of nowhere as middle of nowhere gets bro ๐Ÿ™๐Ÿ˜ญ

tall summit
#

unnamed road

#

thats brilliant

keen beacon
#

o4 mini gets it, o3 says namibia

balmy mist
tall summit
#

O4 MINI GETS IT HOW

torn mantle
#

not good so far

#

maybe i need to do more tests

elder rapids
#

I've been doing a TON of tests

keen beacon
elder rapids
#

this seems to be the only problem

#

but that's still a major flaw icl

balmy mist
#

Critical analysis of the two most powerful new models behind ChatGPT, o3 and o4-mini. Not just the system cards, benchmarks, and my own tests, but some you may not have seen before. Yes, they can whip up amazing front-end in a few seconds, but you always have to ask what is in their data. Either way, they prove the gains from RL are just beginni...

โ–ถ Play video
elder rapids
#

the fact it won't easily adjust to the user is kinda bad

tall summit
torn mantle
#

they finetuned it on threejs apps

#

physical simulations

#

with complex reasoning

#

frontend/backend its meh

#

python its meh

#

c# its the usual

#

general reasoning its pretty good

#

but thats it

elder rapids
#

ye, but honestly with that, 2.5 is still the general king

#

o3 is just so smart

torn mantle
#

2.5 is pretty good overall

#

well balanced

elder rapids
#

but it's constantly missing things

torn mantle
#

the reasoning approach used by google is different

#

so we have a slight edge to oai reasoning method

#

since its better ( for now )

elder rapids
#

this is so crazy to me tbh

#

how did Google do that

#

they didn't even release 2.0 pro, they just got rid of it

zinc ore
#

Google's reasoning method is a bit more flexible

torn mantle
#

high quality data
trial-error for the best RL algorithm ( they just mentioned that recently )
a lot of experiments giving how much TPUs they have
smart team ( gdm )

#

we built a system that used RL to discover its own RL algorithms.

this AI-designed system outperformed all human-created RL algorithms developed over the years.

elder rapids
#

wonder how this correlates to where they are now

#

in the beginning it was super super slow

#

around 1.5~ yrs ago

ornate stump
#

Just got back from work and I thought we were gonna have a huge leap, but I still kinda like Gemini 2.5 as a "PhD-level science assistant," at least. am i biased ?

elder rapids
#

and then they went from 1.5 pro โ†’ 1.5 pro 002, which was a large leap, and then with little time, to 1206, and then 2.0 pro, and then ditching for 2.5 pro

elder rapids
balmy mist
#

lol from o4 mini

tall summit
#

i got jumpscared

#

somehow

zinc ore
#

Yeah my brain totally processed all those images in a split second

ornate stump
#

Yeah, ChatGPT has a better output format. If I were still a student, I would definitely use that, but Gemini seems really sophisticated.

keen beacon
balmy mist
#

yeah

keen beacon
#

oh thats cool

#

maybe ask it to make assets then animate instead of trying to make whole scenes usign gpt 4o image gen

balmy mist
#

oohh okay, ill try when i get back, home gotta head out

keen beacon
ornate stump
#

Are they going to raise the deep research limits now? That's one thing I still prefer from OpenAI, but I haven't checked if Google upgraded it.

quiet pollen
#

o3 feels pretty smart - what do you all think?

brittle tiger
# keen beacon

Hard to compare. If question is just webdev it's easy but we havent seen nightwhisper outside of webdev

quiet pollen
#

I am wondering if it is due to the knowledge cut off date

zinc ore
#

Is night better at programming than o3 full and o4 mini?

#

What about dragontail?

torn mantle
#

yea

#

physical simulations, it will nail it

#

but its still struggling on what makes a design good and whatnot

#

oh btw

#

im talking about o4-mini

#

i havent tried o3 full yet

quiet pollen
#

do you think usage for non-reasoning models will decrease?

#

like GPT 4.1 etc

torn mantle
#

i mean it depends on the use case tbh

#

for example gemini 2.5 thinking achieved a similar performance at coding tasks to sonnet 3.5 only after applying reasoning

quiet pollen
#

o3 mini is cheaper than GPT-4.1 lol

tall summit
#

o3 mini is old

torn mantle
#

o4-mini vs nightwhisper

#

a simple prompt assessing stylistic choices/organization/colours...

quiet pollen
torn mantle
#

yea

#

not for o4 mini

#

i had to guide it

#

let me see if i still have 1st o4-mini output

tall summit
#

how'd nightwhisper make that

torn mantle
#

like this resume may seem easy to clone by any model, but trust me ive tried all models with the same prompt and even guiding them and they dont come near nightwhisper

#

just the vertical line on the bullet point if you can make any model do it centered in one shot i will give you whatever you like

#

they all messed that up

calm sequoia
quiet pollen
#

thanks for sharing nightwhisper

#

just googled it and found that it could be a stealth model from Google

#

Google models been a huge winnder for me

ocean vortex
#

there's no way released o3 is anywhere near this now renamed "preview" lol

quiet pollen
barren prairie
thorny drum
#

yeah this version is like 100x cheaper lol

barren prairie
keen beacon
#

on the 4.1 base, and the arc agi folks confirmed it

thorny drum
#

o3-preview (high): 87.5%, $34.4k/task

ocean vortex
keen beacon
#

no

ocean vortex
#

I feel like they kinda scammed everyone with that initial o3 announcement

thorny drum
#

34.4k/task

keen beacon
#

its worse

torn mantle
ocean vortex
keen beacon
#

they said o3 preview was closer to o3 pro

#

arc agi folks

ocean vortex
keen beacon
#

at least its worse on stuff that requires a lot of compute which openai doesn't want to serve

#

o3 preview to arc agi was served with an unrealistic level of compute

thorny drum
#

o3 high is 1000x cheaper than o3-preview high lol

ocean vortex
#

if you look at o1-pro it scored double on arc-agi-1 compared to o1-high

#

so it makes sense

#

they still scammed people implying that it was normal o3 lol

keen beacon
#

if they benchmarked the retrained o3 with the new 4.1 base with as much compute, itd probably score higher

ocean vortex
#

but it's also kinda pointless too

#

to do it

#

they made it look initially like o3 is a HUGE improvement

keen beacon
#

so yeah normal o3 i expect it do worse since its not juiced with that level of compute they gave o3 preview

keen beacon
#

i think thats worth something

ocean vortex
keen beacon
#

i dont think their plans for o3 were that fully fleshed at the time

#

i dont think it was malicious

#

but i agree

ocean vortex
keen beacon
#

i mean look at how theyre continuing chatgpt 4o despite it on the 4.1 base, openai has committed every naming sin possible

#

they didnt even rename 4o on chatgpt so its even more confusing against o4 mini etc

ember rapids
#

i feel like nightwhisper will def be better at coding

keen beacon
#

what was the point of renaming it to 4.1 on just the api ๐Ÿ˜ญ

ocean vortex
#

they already had the pro model line. So what they showed I doubt it was actually ever referred to internally as just o3

keen beacon
#

was o1 pro even out at the tie

#

i think o3 was still early in developmment and they reached a milestone and wanted to share it despite it not being fleshed out

ocean vortex
torn mantle
#

gemini 2.5 pro general knowledge >>>>>

ocean vortex
#

and they adopted the same system for o3 benchmarks while knowing it's o3-pro ๐Ÿ’€

torn mantle
#

sonnet 3.5/3.7 general knowledge >>>>>>>>

keen beacon
#

i dont even think the level of compute they used for o3 preview would even match anything close to what they use for o3 pro tbh

#

they probably used even more than o3 pro

calm sequoia
#

They used more than the pro. And this term "pro" was quite new at that time.

#

As I understand they gave unlimited compute for some time. Therefore, the approaches can't be compared.

keen beacon
#

yea

torn mantle
#

o3 full initial vibes are kinda off for me

keen beacon
#

nah i think the vibes are good

#

mini has worse vibes than full but better at non-web coding tasks

ocean vortex
thorny drum
#

wasnt it like unlimited compute and then like thousand model majority voting

keen beacon
ocean vortex
#

imagine them announcing standard o1 with benchmarks that are higher than o1-pro

keen beacon
#

what would you propose then

#

u were early in the dev process and wanted to share results, even though they were unrealistic, and committed to o3

thorny drum
#

i mean they got a model with a similar quality 1000x cheaper in like 4 months