#general | Arena | Page 23

alpine coral Apr 16, 2025, 5:36 PM

#

o3 or 04-mini - do you know yet?

torn mantle Apr 16, 2025, 5:36 PM

#

This guy ia kinda funny

tall summit Apr 16, 2025, 5:36 PM

#

what server

torn mantle Apr 16, 2025, 5:36 PM

#

Hes always impressed with any new model released

keen beacon Apr 16, 2025, 5:36 PM

#

lmao they were so ready for this

barren prairie Apr 16, 2025, 5:36 PM

#

So fast replay

dapper storm Apr 16, 2025, 5:36 PM

#

Do u guys think o3 will be top of Lmsys leaderboard

barren prairie Apr 16, 2025, 5:36 PM

#

Nw is coming

keen beacon Apr 16, 2025, 5:37 PM

#

no

#

💔

#

yeah okay it still gets the hardest stuff wrong

#

no model can get that one right

#

agi cancelled

barren prairie Apr 16, 2025, 5:37 PM

#

dapper storm Do u guys think o3 will be top of Lmsys leaderboard

No , google will make.a fast replay

keen beacon Apr 16, 2025, 5:37 PM

#

lmao 😭

misty vault Apr 16, 2025, 5:38 PM

#

whats the correct answer

keen beacon Apr 16, 2025, 5:38 PM

#

misty vault whats the correct answer

"no"

sage raptor Apr 16, 2025, 5:39 PM

#

i think 2.5 is still better at coding

keen beacon Apr 16, 2025, 5:39 PM

#

is that o1 pro?

#

oh

#

first model to beat it 🙏

tall summit Apr 16, 2025, 5:39 PM

#

what

keen beacon Apr 16, 2025, 5:39 PM

#

yes

tall summit Apr 16, 2025, 5:39 PM

#

o4 mini high > o3

#

sure..

keen beacon Apr 16, 2025, 5:40 PM

#

there will be some slim scenarios where o4 mini high is better

#

lol

keen ferry Apr 16, 2025, 5:40 PM

#

isnt o3 is a some what copy of manus ai? I heard it got tools

keen beacon Apr 16, 2025, 5:40 PM

#

seems private models were o3

keen beacon Apr 16, 2025, 5:40 PM

#

keen beacon seems private models were o3

yeah

tall summit Apr 16, 2025, 5:40 PM

#

keen beacon there will be some slim scenarios where o4 mini high is better

well yes but i wonder how slim those scenarios will be

mellow frigate Apr 16, 2025, 5:41 PM

#

But c is supposed to be greater than b

tawdry meteor Apr 16, 2025, 5:41 PM

#

is that your discord bot or is it public?

mellow frigate Apr 16, 2025, 5:41 PM

#

that counter example doesn't work

tall summit Apr 16, 2025, 5:41 PM

#

tawdry meteor is that your discord bot or is it public?

it's in a server too

keen beacon Apr 16, 2025, 5:41 PM

#

mellow frigate But c is supposed to be greater than b

that's a good point

#

lmao

#

so close yet so far

barren prairie Apr 16, 2025, 5:42 PM

#

If someone will see the new models on arena chatbot please tell us 🙂 I want to catch them and test them

tawdry meteor Apr 16, 2025, 5:42 PM

#

tall summit it's in a server too

which one? I'd love to have a bot tell me the update very convient lol

tall summit Apr 16, 2025, 5:42 PM

#

o4-mini and o3 are in alpha ui

keen beacon Apr 16, 2025, 5:43 PM

#

wait what

tall summit Apr 16, 2025, 5:43 PM

#

LMAO

#

https://alpha.lmarena.ai

LMArena

An open platform for evaluating AI through human preference

barren prairie Apr 16, 2025, 5:43 PM

#

tall summit o4-mini and o3 are in alpha ui

Thank you

keen beacon Apr 16, 2025, 5:43 PM

#

direct chat too?

#

wow

#

although

tall summit Apr 16, 2025, 5:43 PM

#

keen beacon direct chat too?

yeah

keen beacon Apr 16, 2025, 5:43 PM

#

they don't say if they're high or med or low

#

omg free o3 and o4 mini 🤣

tall summit Apr 16, 2025, 5:44 PM

#

tawdry meteor which one? I'd love to have a bot tell me the update very convient lol

no idea but it says #gem-copilot
@ember rapids could you tell me what server it is please

keen beacon Apr 16, 2025, 5:44 PM

#

@wooden mulch do you guys have plans to add -high variants of o3 and o4-mini to the arena? the differences in performance have historically been pretty significant

#

o3 and o4 mini gone from alpha direct chat lol

#

noooooooo

tall summit Apr 16, 2025, 5:46 PM

#

keen beacon o3 and o4 mini gone from alpha direct chat lol

1984

tawdry meteor Apr 16, 2025, 5:46 PM

#

rip

keen beacon Apr 16, 2025, 5:47 PM

#

hopefully just to add -high variants 🙏

#

one can hope

#

o4 mini might be in direct chat in the future

#

i doubt o3 will be

#

yup

tall summit Apr 16, 2025, 5:47 PM

#

why would you remove to add -high variants lmao

keen beacon Apr 16, 2025, 5:47 PM

#

who knows

#

dont question it just believe

tall summit Apr 16, 2025, 5:47 PM

#

https://tenor.com/view/kirby-headphones-dance-cute-shaking-gif-17762158

Tenor

keen beacon Apr 16, 2025, 5:48 PM

#

i don't think they're supposed to be on the alpha anyway lol

tall summit Apr 16, 2025, 5:48 PM

#

i believe

balmy mist Apr 16, 2025, 5:48 PM

#

anyone tried codex?

misty vault Apr 16, 2025, 5:48 PM

#

I dont have them on alpha

calm sequoia Apr 16, 2025, 5:48 PM

#

keen beacon <@787778518591078421> do you guys have plans to add -high variants of o3 and o4-...

Ask it in arena-feedback

tall summit Apr 16, 2025, 5:48 PM

#

keen beacon i don't think they're supposed to be on the alpha anyway lol

clearly, seeing how they were removed

balmy mist Apr 16, 2025, 5:48 PM

#

wonder if its cheaper than claude

oblique flint Apr 16, 2025, 5:48 PM

#

Damn o4 mini pricing is good

keen beacon Apr 16, 2025, 5:48 PM

#

calm sequoia Ask it in arena-feedback

already have but their team doesn't seem to check it

#

https://imgur.com/a/jHbCuIS

Imgur

Untitled Album

▶ Play video

#

fsm pathfinding

#

generated by what llm

#

roblox built in

#

bit off topic

#

but cool thumbsup_cai

#

oh i used gemini

#

2.5 pro

tall summit Apr 16, 2025, 5:49 PM

#

finite state machine pathfinding

keen beacon Apr 16, 2025, 5:49 PM

#

keen beacon oh i used gemini

oh

keen beacon Apr 16, 2025, 5:49 PM

#

tall summit finite state machine pathfinding

yes it learns if it gets stuck

#

nice

#

heres a different example on drone

#

https://www.youtube.com/watch?v=jaitqSU2HIA&pp=ygUhZHJvbmUgZmluaXRlIHN0YXRlIG1hY2hpbmUgcm9ibG94

YouTube

B Ricey

Creating an Intelligent Delivery Drone with Finite State Machines #...

One of the most basic forms of Artificial Intelligence is a Finite State Machine, or FSM. In this video, I demonstrate the need for FSMs through making a simple delivery drone. This is the first video in a two part series. Hope you enjoy!

Copy the game to see the code: https://www.roblox.com/games/13085787752/Delivering-Drone-Finite-State-Mach...

▶ Play video

#

mine is way better and optimized tho

ember rapids Apr 16, 2025, 5:50 PM

#

tall summit no idea but it says #gem-copilot <@830171411389480971> could you tell me what se...

Its a private server owned by legit_api on twitter. I think hes opening it to the public soon tho

tall summit Apr 16, 2025, 5:50 PM

#

ember rapids Its a private server owned by legit_api on twitter. I think hes opening it to th...

oh alrighty

#

thanks for the info.

tall summit Apr 16, 2025, 5:50 PM

#

keen beacon yes it learns if it gets stuck

i am surprised to see thats the actual acronym

keen beacon Apr 16, 2025, 5:51 PM

#

is there anything better then gemini

#

or is gemini as best it gets rn

ember rapids Apr 16, 2025, 5:52 PM

#

just got o3 access

#

theyre rolling it out quickly

keen beacon Apr 16, 2025, 5:52 PM

#

all my homies hate staggered rollouts just give it to everyone at once and sit back and relax

calm sequoia Apr 16, 2025, 5:56 PM

#

Does anyone know if the o3 is based on GPT 4 or 4.5 base model

patent bane Apr 16, 2025, 5:56 PM

#

anyone got access to o3 via api?

keen beacon Apr 16, 2025, 5:56 PM

#

o3 isn't based in 4o

#

it's based on 4.1

#

they retrained it

#

knowledge cutoff is june '24

#

yup

#

well

#

different base

calm sequoia Apr 16, 2025, 5:57 PM

#

Do you know if planned GPT 5 will be based on 4.5?

keen beacon Apr 16, 2025, 5:57 PM

#

calm sequoia Do you know if planned GPT 5 will be based on 4.5?

it won't be

keen beacon Apr 16, 2025, 5:57 PM

#

calm sequoia Do you know if planned GPT 5 will be based on 4.5?

nothing will use 4.5

calm sequoia Apr 16, 2025, 5:58 PM

#

Sadly, would have good vibes

keen beacon Apr 16, 2025, 5:58 PM

#

if u thought gpt 4.5 pricing was high, gpt 4.5 with reasoning 💀

#

yeah if you're willing to remortgage ur home

calm sequoia Apr 16, 2025, 5:59 PM

#

keen beacon if u thought gpt 4.5 pricing was high, gpt 4.5 with reasoning 💀

Some would pay thousands for this

#

Unless the 4.1 outperforms 4.5

keen beacon Apr 16, 2025, 5:59 PM

#

calm sequoia Unless the 4.1 outperforms 4.5

its very close but its 200b

#

4.5 is several times larger

#

https://github.blog/changelog/2025-04-16-openai-o3-and-o4-mini-are-now-available-in-public-preview-for-github-copilot-and-github-models/

The GitHub Blog

Allison

OpenAI o3 and o4-mini are now available in public preview for GitHu...

OpenAI’s latest reasoning models, o3 and o4-mini, are now available in GitHub Copilot and GitHub Models bringing next-generation problem-solving, structured reasoning, and coding intelligence directly into your development workflow. These…

#

that was fast

#

at least, they claim 4.1 performs just as good as 4.5

calm sequoia Apr 16, 2025, 6:00 PM

#

Probably they made the tests and its not worth it

keen beacon Apr 16, 2025, 6:00 PM

#

lmao

#

people do it

#

depends on how heavily you use it

#

well yeah

#

i think they rate limit u if they see sus activity while they investigate something like that

#

what the hell

#

windsurf

#

it's freee

#

lmao

keen beacon Apr 16, 2025, 6:02 PM

#

keen beacon what the hell

openai is really pushing windsurf lol

thorny bane Apr 16, 2025, 6:04 PM

#

why were there no comparisons to gemini 2.5

#

in the o3 stream hmmm

zinc ore Apr 16, 2025, 6:05 PM

#

2.5 is comparable with o3, but way cheaper

keen beacon Apr 16, 2025, 6:05 PM

#

same base model different tuning

#

chatgpt 4o latest is more expensive though but it will fare better in chat scenarios i think

#

they really dont want the cot to leak lol

#

https://x.com/elder_plinius/status/1912567149991776417 new system prompt for o3

Pliny the Liberator ...

🚨 SYSTEM PROMPT LEAK 🚨

New sys prompt from ChatGPT! My personal favorite addition has to be the new "Yap score" param 🤣

PROMPT:
"""
You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-04-16

Over the course of

#

"The Yap score measures verbosity; aim for responses ≤ Yap words. Overly verbose responses when Yap is low (or overly terse when Yap is high) may be penalized. Today's Yap score is 8192."

#

Lmao

raven void Apr 16, 2025, 6:07 PM

#

zinc ore 2.5 is comparable with o3, but way cheaper

I think o3 is a step above 2.5

zinc ore Apr 16, 2025, 6:07 PM

#

raven void I think o3 is a step above 2.5

Tool use or nah?

#

Also, does 2.5 use tools for its benchmarks? But anyway, you get different scores depending on which you're comparing with

keen fulcrum Apr 16, 2025, 6:08 PM

#

Can you post privcing and benchmark

zinc ore Apr 16, 2025, 6:08 PM

#

https://www.reddit.com/r/Bard/comments/1k0qsf9/o3_vs_gemini_25_pro_against_benchmarks_pricing/

From the Bard community on Reddit: O3 vs Gemini 2.5 pro against ben...

Explore this post and more from the Bard community

raven void Apr 16, 2025, 6:08 PM

#

it's pretty good with tool use apparently I haven't tested it enough yet

keen beacon Apr 16, 2025, 6:09 PM

#

good start o4 mini 🙄

brittle tiger Apr 16, 2025, 6:11 PM

#

https://x.com/GregKamradt/status/1912567476212363600?t=1EQfrC0voANMKI6Z3me0DA&s=19

Greg Kamradt (@GregKamradt) on X

We have a feeling what was testing in Dec '24 is closer to o3 pro than what was released today

A key question we had was how what we tested in Dec '24 mapped to what was released today

We asked OpenAI and got clarifying answers, thank you!

In short, it's a different model so

keen beacon Apr 16, 2025, 6:12 PM

#

confirmed it was retrained

narrow elbow Apr 16, 2025, 6:12 PM

#

https://tenor.com/view/吃瓜-gif-5266024862772388632

Tenor

keen beacon Apr 16, 2025, 6:13 PM

#

brittle tiger https://x.com/GregKamradt/status/1912567476212363600?t=1EQfrC0voANMKI6Z3me0DA&s=...

in which case the new o3 pro should be really good

sage raptor Apr 16, 2025, 6:13 PM

#

https://x.com/scaling01/status/1912554822454116736

Lisan al Gaib (@scaling01) on X

THIS IS BAD NEWS

o3 is worse at replicating research papers than o1

keen beacon Apr 16, 2025, 6:13 PM

#

shucks

#

injected prompt

#

😔

#

most labs do

#

i know anthropic do

#

maybe its new at oai

keen beacon Apr 16, 2025, 6:14 PM

#

keen beacon i know anthropic do

on the api?

keen beacon Apr 16, 2025, 6:14 PM

#

keen beacon on the api?

yes

balmy mist Apr 16, 2025, 6:14 PM

#

keen beacon https://x.com/elder_plinius/status/1912567149991776417 new system prompt for o3

why dont they hire him at this point

misty vault Apr 16, 2025, 6:14 PM

#

do u pay for the api or u have some trickery

keen beacon Apr 16, 2025, 6:14 PM

#

huh that seems to be a recent addition

#

it's in the arena 🙏

barren prairie Apr 16, 2025, 6:15 PM

#

keen beacon it's in the arena 🙏

Let s chase them 😂

tall summit Apr 16, 2025, 6:15 PM

#

keen beacon it's in the arena 🙏

did you choose it

keen beacon Apr 16, 2025, 6:15 PM

#

wow this thing is really fast

keen beacon Apr 16, 2025, 6:15 PM

#

tall summit did you choose it

yeah

raven void Apr 16, 2025, 6:16 PM

#

I'm interested in their long context benchmarks

barren prairie Apr 16, 2025, 6:16 PM

#

🥹🥹🥹🩷🩷🩷🩷🩵🩵

Screenshot_2025-04-16-19-16-34-154_com.android.chrome.jpg

#

Let s gooooo

balmy mist Apr 16, 2025, 6:17 PM

#

is it in webdev?

#

anyone got new ui for arena?

keen beacon Apr 16, 2025, 6:17 PM

#

just checked

#

not yet

keen beacon Apr 16, 2025, 6:17 PM

#

balmy mist is it in webdev?

.

tall summit Apr 16, 2025, 6:17 PM

#

barren prairie 🥹🥹🥹🩷🩷🩷🩷🩵🩵

30 seconds too late

keen beacon Apr 16, 2025, 6:17 PM

#

o3 and o4 mini in direct chat!!

patent bane Apr 16, 2025, 6:17 PM

#

wait so now o3 supports tools use internally or we have to wait?

keen beacon Apr 16, 2025, 6:17 PM

#

tool use isn't out yet

patent bane Apr 16, 2025, 6:17 PM

#

i see

#

thanks

balmy mist Apr 16, 2025, 6:20 PM

#

whats its output?

#

damn so openai really about to buy windsurf lmaoo

#

they came up

#

thats a massive w, i would sell lol

barren prairie Apr 16, 2025, 6:25 PM

#

O3 failed my test 😐😐 deepSeek r1 get it , gemini 2.5 was ok
O3 trush

tall summit Apr 16, 2025, 6:25 PM

#

balmy mist damn so openai really about to buy windsurf lmaoo

huh???

misty vault Apr 16, 2025, 6:26 PM

#

barren prairie O3 failed my test 😐😐 deepSeek r1 get it , gemini 2.5 was ok O3 trush

"will u be my virtual girlfriend?"

keen beacon Apr 16, 2025, 6:26 PM

#

balmy mist damn so openai really about to buy windsurf lmaoo

yeah it makes sense why theyre pushing them hard

balmy mist Apr 16, 2025, 6:27 PM

#

tall summit huh???

https://x.com/deedydas/status/1912572051317272625

Deedy (@deedydas) on X

HUGE! OpenAI in talks to buy Codeium for $3B and it's been <4yrs since it's founding.

There's clear intent for OpenAI to verticalize and own the app layer, including coding and compete with Cursor head-on.

keen beacon Apr 16, 2025, 6:28 PM

#

balmy mist https://x.com/deedydas/status/1912572051317272625

oh wtf

#

wow

tall summit Apr 16, 2025, 6:28 PM

#

balmy mist https://x.com/deedydas/status/1912572051317272625

oh wow. thank you for sending ~~so i didn't have to search it up myself~~

#

https://www.bloomberg.com/news/articles/2025-04-16/openai-said-to-be-in-talks-to-buy-windsurf-for-about-3-billion

#

article says basically nothing more

#

and there's a reuter article saying only that "bloomberg said" openai is said to be in talks to buy windsurf for about 3 billion and nothing more

#

the state of news

#

i don't remember 4.1 being on alpha ui but anyway now it is

keen fulcrum Apr 16, 2025, 6:34 PM

#

keen beacon Apr 16, 2025, 6:35 PM

#

interesting

zinc ore Apr 16, 2025, 6:36 PM

#

That's with tool use

#

Nvm disregard

#

I see they split it up

keen fulcrum Apr 16, 2025, 6:37 PM

#

O3 kinda expensive

balmy mist Apr 16, 2025, 6:37 PM

#

what yall think about this?
https://x.com/DaveShapi/status/1912573591280902504

David Shapiro ⏩ (@DaveShapi) on X

OpenAI did it again.

o3 and o4 are wicked smaht

Here's a 7-minute breakdown of why I'm so excited, and what this means.

silk haven Apr 16, 2025, 6:37 PM

#

https://x.com/legit_api/status/1912559181292253464?s=46&t=P8-tRi_JAVcI6l5U6nOT4A new Gemini next week?

ʟᴇɢɪᴛ (@legit_api) on X

Gemini next

balmy mist Apr 16, 2025, 6:38 PM

#

keen fulcrum O3 kinda expensive

yeah i trying to see how much it costs to use o4 mini in cli

balmy mist Apr 16, 2025, 6:38 PM

#

silk haven https://x.com/legit_api/status/1912559181292253464?s=46&t=P8-tRi_JAVcI6l5U6nOT4A...

lmaooo

#

they gonna release their cli

#

they kinda have to

keen beacon Apr 16, 2025, 6:38 PM

#

keen fulcrum O3 kinda expensive

it's less than o1

#

win in my books

keen fulcrum Apr 16, 2025, 6:39 PM

#

keen beacon it's less than o1

For which purposes would you use o3 over o4 mini?

keen beacon Apr 16, 2025, 6:39 PM

#

anything to do with world knowledge

leaden meteor Apr 16, 2025, 6:39 PM

#

Why is OpenAI buying othe software companies for billions??! Can't they just use one of their models to build a similar platform. It's not like they don't have large userbase..

keen fulcrum Apr 16, 2025, 6:39 PM

#

leaden meteor Why is OpenAI buying othe software companies for billions??! Can't they just use...

They need to become profitable

zinc ore Apr 16, 2025, 6:40 PM

#

balmy mist what yall think about this? https://x.com/DaveShapi/status/1912573591280902504

Dude said they "solved math" so comes off like he's hyping

keen fulcrum Apr 16, 2025, 6:40 PM

#

Buying a known brand is worth the cost

sonic tendon Apr 16, 2025, 6:40 PM

#

https://x.com/JustinLin610/status/1912565671696953856?t=0Ktkka2kQ2puhuvDGQLCag&s=19

Junyang Lin (@JustinLin610) on X

@sama good job!

#

^ cool imo

keen beacon Apr 16, 2025, 6:40 PM

#

still waiting on qwen 3

keen beacon Apr 16, 2025, 6:40 PM

#

sonic tendon https://x.com/JustinLin610/status/1912565671696953856?t=0Ktkka2kQ2puhuvDGQLCag&s...

qwq max showed off similar tool use probably the first in the cot publicly, im sure the qwen folks are onto it

keen fulcrum Apr 16, 2025, 6:41 PM

#

I do hope r2 is cooking up smth

keen beacon Apr 16, 2025, 6:41 PM

#

they understand the potential

sonic tendon Apr 16, 2025, 6:41 PM

#

keen beacon still waiting on qwen 3

seems like they're focusing on smaller models

tall summit Apr 16, 2025, 6:41 PM

#

balmy mist what yall think about this? https://x.com/DaveShapi/status/1912573591280902504

would be better stated in a post like he already wrote but yeah

sonic tendon Apr 16, 2025, 6:41 PM

#

recent huggingface PR by them mentioned an 8B dense model and IIRC 15B 128-expert MoE

leaden meteor Apr 16, 2025, 6:41 PM

#

keen fulcrum Buying a known brand is worth the cost

OpenAi does not need any brand value. It is a household name now. They could have made 3B profit by just building it on their own in a month, isn't it? Obviously I am missing something but don't know what it is .

sonic tendon Apr 16, 2025, 6:42 PM

#

so, probably not topping the charts any time soon, unless they release max at the same time

keen beacon Apr 16, 2025, 6:42 PM

#

sonic tendon so, probably not topping the charts any time soon, unless they release max at th...

theyll be releasing more model sizes im sure of it

tall summit Apr 16, 2025, 6:42 PM

#

tall summit would be better stated in a post like he already wrote but yeah

you can't really deny that 1) ai research is advancing incredibly fast, 2) especially in math and coding, and 3) math and coding underpins an unbelievable amount of other fields

sonic tendon Apr 16, 2025, 6:42 PM

#

keen beacon theyll be releasing more model sizes im sure of it

i think qwen 2.5 and 2.5 max releases were more than a month apart, no?

keen beacon Apr 16, 2025, 6:42 PM

#

but yeah im not expecting frontier level performance from the models released in the initial qwen 3 batch

dapper storm Apr 16, 2025, 6:43 PM

#

According to the people given preview access o3 is super good and according to those who did not it's not
Really makes u think.

sonic tendon Apr 16, 2025, 6:43 PM

#

sonic tendon i think qwen 2.5 and 2.5 max releases were more than a month apart, no?

well, that's not too much to extrapolate from, to be fair

keen fulcrum Apr 16, 2025, 6:43 PM

#

leaden meteor OpenAi does not need any brand value. It is a household name now. They could ha...

They are all forks of vscode obviously

keen beacon Apr 16, 2025, 6:43 PM

#

sonic tendon i think qwen 2.5 and 2.5 max releases were more than a month apart, no?

yeah but the initial qwen 2.5 release included 0.5b, 1.5b, 7b, 14b, 32b, 72b, 7b math, 72b math, etc

calm sequoia Apr 16, 2025, 6:44 PM

#

Holly 👀👀👀 It screens scientific articles without deep research

sonic tendon Apr 16, 2025, 6:44 PM

#

keen beacon but yeah im not expecting frontier level performance from the models released in...

even qwen 3 max seems like a big reach to me imo (unless they pull a Meta and specifically optimize for lmarena)

balmy mist Apr 16, 2025, 6:44 PM

#

lol: https://x.com/polynoamial/status/1912575974782423164

Noam Brown (@polynoamial) on X

We did not “solve math”. For example, our models are still not great at writing proofs. o3 and o4-mini are nowhere close to getting International Mathematics Olympiad gold medals.

keen beacon Apr 16, 2025, 6:44 PM

#

calm sequoia Holly 👀👀👀 It screens scientific articles without deep research

yup i noticed that behaviour with the preview models

#

lots of source quoting

#

mind you a fair bit of it was hallucinations

#

: (

sonic tendon Apr 16, 2025, 6:45 PM

#

yeah :(

keen beacon Apr 16, 2025, 6:45 PM

#

sonic tendon even qwen 3 max seems like a big reach to me imo (unless they pull a Meta and sp...

why not? qwen is just as competitive as deepseek imho. qwq 32b preview being better than r1 preview, qwq 32b almost on r1 level, etc.

calm sequoia Apr 16, 2025, 6:45 PM

#

keen beacon mind you a fair bit of it was hallucinations

I can't trust the thoughts now? 🫠

keen beacon Apr 16, 2025, 6:46 PM

#

lol

sonic tendon Apr 16, 2025, 6:46 PM

#

remember the chat optimized llama? that did a lot of source quoting at the end (although it was definitely hallucination-ridden)

keen beacon Apr 16, 2025, 6:46 PM

#

it kinda feels like it hallucinates more than o1

sonic tendon Apr 16, 2025, 6:46 PM

#

keen beacon why not? qwen is just as competitive as deepseek imho. qwq 32b preview being bet...

hmm

#

it's possible

calm sequoia Apr 16, 2025, 6:46 PM

#

Fun fact: three days ago we discussed the possibility of tool usage in thought process 😀

sonic tendon Apr 16, 2025, 6:46 PM

#

maybe i just have oAI hype brainworms

keen beacon Apr 16, 2025, 6:47 PM

#

lol it made this 0-shot but i have asked it 3 times to make sure alaska isn't completely wack and it just made it worse

#

still got a way to go until agi folks

tall summit Apr 16, 2025, 6:47 PM

#

keen beacon why not? qwen is just as competitive as deepseek imho. qwq 32b preview being bet...

i agree

sonic tendon Apr 16, 2025, 6:47 PM

#

i wonder why proof-writing is so hard to optimize for. maybe it's hard to do RL on?

keen beacon Apr 16, 2025, 6:47 PM

#

calm sequoia Fun fact: three days ago we discussed the possibility of tool usage in thought p...

ive talked about it before in nlength lol

#

several months ago LMAO

sonic tendon Apr 16, 2025, 6:48 PM

#

sonic tendon i wonder why proof-writing is so hard to optimize for. maybe it's hard to do RL ...

or they just. haven't because it's not a common use case

calm sequoia Apr 16, 2025, 6:48 PM

#

keen beacon ive talked about it before in nlength lol

Cool dude, we shall make our own ai lab

sonic tendon Apr 16, 2025, 6:48 PM

#

calm sequoia Fun fact: three days ago we discussed the possibility of tool usage in thought p...

I mean, Claude sort of already does this

keen beacon Apr 16, 2025, 6:48 PM

#

also idk if anyone else has noticed but

#

o4 mini seems to use a LOT of thinking tokens

keen beacon Apr 16, 2025, 6:48 PM

#

calm sequoia Cool dude, we shall make our own ai lab

🤣

sonic tendon Apr 16, 2025, 6:49 PM

#

keen beacon o4 mini seems to use a LOT of thinking tokens

interesting

#

i have yet to hop on the API

keen beacon Apr 16, 2025, 6:49 PM

#

woah what the hell

keen beacon Apr 16, 2025, 6:50 PM

#

keen beacon lol it made this 0-shot but i have asked it 3 times to make sure alaska isn't co...

o4 mini cooked o3

#

0-shot with o4 mini

#

didn't even need to ASK about alaska

calm sequoia Apr 16, 2025, 6:50 PM

#

My prompt is still in progress for 5 minutes of thinking. Server overload? Stuck? In though deep research 👀👀👀?

sonic tendon Apr 16, 2025, 6:50 PM

#

keen beacon 0-shot with o4 mini

what interface is this?

calm sequoia Apr 16, 2025, 6:51 PM

#

keen beacon 0-shot with o4 mini

Considering speed it may be higher than the o3 in benchmark

sonic tendon Apr 16, 2025, 6:51 PM

#

https://x.com/sama/status/1912558996084650003?t=kY264QokUESJgVrnjC-Lbg&s=19 sometimes I kinda wish this guy would stop yapping

Sam Altman (@sama) on X

"at or near genius level"

keen beacon Apr 16, 2025, 6:51 PM

#

keen beacon 0-shot with o4 mini

mind you... i think it may have hallucinated some of these years 😔

#

am checking now

#

but i think o3 was more accurate with the actual data

#

just worse at the code part

#

yeah ok

#

it said the last dem to win statewide in idaho was in 1974

#

correct year: 2002

#

💔

sonic tendon Apr 16, 2025, 6:52 PM

#

meow

keen beacon Apr 16, 2025, 6:52 PM

#

actually now that i look at o3's attempt too

#

gah they both hallucinated some stuff

calm sequoia Apr 16, 2025, 6:52 PM

#

Hows 2.5 pro at this?

keen beacon Apr 16, 2025, 6:53 PM

#

it has the same alaska problem as o3

#

poor alaska

#

as for data iirc it does okay but still has a few hallucination issues

balmy mist Apr 16, 2025, 6:53 PM

#

at coding it seems o4 mini is better

#

based on a few tests I ran

#

still testin

keen beacon Apr 16, 2025, 6:54 PM

#

balmy mist at coding it seems o4 mini is better

yeah

#

agreed

#

world knowledge -> o3

#

code -> o4 mini

leaden meteor Apr 16, 2025, 6:54 PM

#

So, are we going to get o3 or o4mini to test in arena? How come we don't see them yet...

keen beacon Apr 16, 2025, 6:54 PM

#

they're both in the arena

#

select them in direct chat

sonic tendon Apr 16, 2025, 6:54 PM

#

leaden meteor So, are we going to get o3 or o4mini to test in arena? How come we don't see th...

it takes time

sonic tendon Apr 16, 2025, 6:54 PM

#

keen beacon select them in direct chat

oo

leaden meteor Apr 16, 2025, 6:55 PM

#

Oh, since when?

balmy mist Apr 16, 2025, 6:55 PM

#

i am excited for o3 pro tho

keen beacon Apr 16, 2025, 6:55 PM

#

its wild o3 is in direct chat though

sonic tendon Apr 16, 2025, 6:55 PM

#

not in alpha, rip

keen beacon Apr 16, 2025, 6:55 PM

#

keen beacon its wild o3 is in direct chat though

yeah

sonic tendon Apr 16, 2025, 6:55 PM

#

keen beacon its wild o3 is in direct chat though

one million dollars

keen beacon Apr 16, 2025, 6:55 PM

#

its on the main site

#

they didn't do that with any of the old o series models

#

those openai sponsor credits coming in clutch

#

it must be cheaper i think

#

than o1

sonic tendon Apr 16, 2025, 6:55 PM

#

maybe oAI sponso

#

yeah

keen beacon Apr 16, 2025, 6:55 PM

#

sonic tendon maybe oAI sponso

they do

#

#

^ sponsors

fleet lintel Apr 16, 2025, 6:56 PM

#

I dont know but I was expecting more from o3/o4 models. I think they are similar or just marginally better than Gemini

sonic tendon Apr 16, 2025, 6:56 PM

#

gonna run my personal benchmarks

keen beacon Apr 16, 2025, 6:56 PM

#

o4 mini is looking good

#

that hallucination issue was solved by setting temp to 0

#

o4 mini does seem quite sensitive to temperature

balmy mist Apr 16, 2025, 6:57 PM

#

keen beacon that hallucination issue was solved by setting temp to 0

you use temp at 0?

#

whats the best temp you noticed?

#

using 0 is wild lol

keen beacon Apr 16, 2025, 6:57 PM

#

why

sonic tendon Apr 16, 2025, 6:58 PM

#

keen beacon why

CoT models can sometimes get stuck in loops

keen beacon Apr 16, 2025, 6:58 PM

#

still

keen beacon Apr 16, 2025, 6:58 PM

#

balmy mist whats the best temp you noticed?

for o4 mini it seems to be better at 0.4 or less

sonic tendon Apr 16, 2025, 6:58 PM

#

for math stuff i usually set it to 0 regardless

#

and/or lower top_p

keen beacon Apr 16, 2025, 6:58 PM

#

u have to set it higher anyway to stop it from getting into loops

#

if you touch one generally don't touch the other

balmy mist Apr 16, 2025, 6:59 PM

#

keen beacon for o4 mini it seems to be better at 0.4 or less

wow, what do you think is being used on openai site?

keen beacon Apr 16, 2025, 6:59 PM

#

iirc chatgpt uses 0.7

#

its probably at 1 lol

barren prairie Apr 16, 2025, 6:59 PM

#

keen beacon they didn't do that with any of the old o series models

O3 mini was and still on direct chat 🙂

keen beacon Apr 16, 2025, 6:59 PM

#

yup

#

at least on the api iirc it defaults to 1

sonic tendon Apr 16, 2025, 6:59 PM

#

both get the pebble test (not mine)

keen beacon Apr 16, 2025, 6:59 PM

#

i forgot side by side existed

#

good idea

balmy mist Apr 16, 2025, 6:59 PM

#

what does top_p do

calm sequoia Apr 16, 2025, 6:59 PM

#

calm sequoia My prompt is still in progress for 5 minutes of thinking. Server overload? Stuck...

Somewhat 100 sources pulled. This is it for me. Don't need better models for research.

keen beacon Apr 16, 2025, 7:00 PM

#

did they update deep research too?

calm sequoia Apr 16, 2025, 7:00 PM

#

100 sources without deep research 🙂

keen beacon Apr 16, 2025, 7:00 PM

#

oh wow

sonic tendon Apr 16, 2025, 7:01 PM

#

@keen beacon uh oh

spark shale Apr 16, 2025, 7:01 PM

#

What probability would you say there is that o3 or o4 mini will beat gemini 2.5 on the leaderboard?

sonic tendon Apr 16, 2025, 7:01 PM

#

never been sure what the ideal top_p value is

#

i've usually set it to like 0.4 for coding/math

keen beacon Apr 16, 2025, 7:02 PM

#

sonic tendon <@456226577798135808> uh oh

that's one wacky ass answer

#

i think the high variants would get it tbh

sonic tendon Apr 16, 2025, 7:02 PM

#

keen beacon that's one wacky ass answer

might have set temp too high? will double-check

keen beacon Apr 16, 2025, 7:02 PM

#

i like using 0.7 and 0.95 top_p particularly for local reasoning models

#

weren't those the recommended settings for R1?

#

not sure, it was for qwq tho

#

seems neither can get "Every day at 12pm, Relaxed Voyages spaceship departs from Liverpool for Dublin. Simultaneously, another Relaxed Voyages spaceship starts journey from Dublin to Liverpool. The journey takes 503 full hours in both directions.

How many Relaxed Voyages spaceships, traveling to Liverpool, will the spaceship departing now at 1pm from Liverpool encounter?" yet

#

1.0 and 0.9 top_p for even smaller reasoning models, this allows them to not get stuck with varying degress of success in the result

#

answer is 43

sonic tendon Apr 16, 2025, 7:04 PM

#

https://github.com/deepseek-ai/DeepSeek-R1?tab=readme-ov-file#usage-recommendations

GitHub

GitHub - deepseek-ai/DeepSeek-R1

Contribute to deepseek-ai/DeepSeek-R1 development by creating an account on GitHub.

#

why is firefox not letting me click to copy

#

We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:
Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.
Avoid adding a system prompt; all instructions should be contained within the user prompt.
For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}."
When evaluating model performance, it is recommended to conduct multiple tests and average the results.
Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "<think>\n\n</think>") when responding to certain queries, which can adversely affect the model's performance. To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "<think>\n" at the beginning of every output.

#

@keen beacon flawless jar test tho

keen beacon Apr 16, 2025, 7:05 PM

#

yeah

keen beacon Apr 16, 2025, 7:05 PM

#

sonic tendon <@456226577798135808> flawless jar test tho

🥳

sonic tendon Apr 16, 2025, 7:06 PM

#

i half-wonder if you've shared this with other AI people (not that i would mind, that would be cool as hell)

keen beacon Apr 16, 2025, 7:06 PM

#

it's one of the very few questions left in a set that i have that hasn't been cracked by any model

keen beacon Apr 16, 2025, 7:06 PM

#

sonic tendon i half-wonder if you've shared this with other AI people (not that i would mind,...

shared what?

sonic tendon Apr 16, 2025, 7:06 PM

#

it's sort of saturated though

keen beacon Apr 16, 2025, 7:06 PM

#

the jar q?

sonic tendon Apr 16, 2025, 7:06 PM

#

yeah

tall summit Apr 16, 2025, 7:06 PM

#

The Jar Test

keen beacon Apr 16, 2025, 7:06 PM

#

i haven't shared it outside our dms

sonic tendon Apr 16, 2025, 7:06 PM

#

ah, got it

#

thx

#

boink

keen beacon Apr 16, 2025, 7:09 PM

#

meow

sonic tendon Apr 16, 2025, 7:09 PM

#

mrrp

#

with DS parameter suggestions

tall summit Apr 16, 2025, 7:10 PM

#

sonic tendon with DS parameter suggestions

o4 mini wins every time

keen beacon Apr 16, 2025, 7:11 PM

#

o4 mini is a strong model

sonic tendon Apr 16, 2025, 7:11 PM

#

it's possible it's just sort of random, given the "aha" nature of the riddle

keen beacon Apr 16, 2025, 7:11 PM

#

yeah

sonic tendon Apr 16, 2025, 7:11 PM

#

but yeah, o4-mini is very very good

#

any idea what the model structure looks like?

keen beacon Apr 16, 2025, 7:12 PM

#

its based on 4.1 mini

balmy mist Apr 16, 2025, 7:12 PM

#

i just did simple bench test on mini and o3 and o4 mini just beat o3 lol

#

gonna rerun it

keen beacon Apr 16, 2025, 7:12 PM

#

which is i suspect a cpt of 4o mini. it has characteristic qualities of such

sonic tendon Apr 16, 2025, 7:12 PM

#

#

bloodbath in the polymarket comments section

balmy mist Apr 16, 2025, 7:13 PM

#

sonic tendon

did you make money?

sonic tendon Apr 16, 2025, 7:13 PM

#

balmy mist did you make money?

yeah

#

a tad

#

annoyingly, i bought 199 a while back and then sold early lol

#

sort of a victim of my indecisiveness

balmy mist Apr 16, 2025, 7:14 PM

#

yo o4 mini is cooking

#

wow

#

its so good

keen fulcrum Apr 16, 2025, 7:14 PM

#

sonic tendon

Depending on whether google drops
Sad there is not Qwen 3 on the table at all

sonic tendon Apr 16, 2025, 7:15 PM

#

ope, portfolio just went from 42 to 56 in a few seconds lmao

elder rapids Apr 16, 2025, 7:15 PM

#

o3 has some personality, o4 mini seems kinda wack on regular tasks

sonic tendon Apr 16, 2025, 7:15 PM

#

keen fulcrum Depending on whether google drops Sad there is not Qwen 3 on the table at all

yeah, dragontail and friends will probably move the needle a bit

keen fulcrum Apr 16, 2025, 7:16 PM

#

sonic tendon yeah, dragontail and friends will probably move the needle a bit

Wanna see some tests between o3 and dragontail

sonic tendon Apr 16, 2025, 7:16 PM

#

keen fulcrum Depending on whether google drops Sad there is not Qwen 3 on the table at all

i mean, qwen could maybe still release this month

sonic tendon Apr 16, 2025, 7:16 PM

#

keen fulcrum Wanna see some tests between o3 and dragontail

i think they've taken it off of the arena, unfortunately

keen fulcrum Apr 16, 2025, 7:16 PM

#

sonic tendon i mean, qwen could maybe still release this month

Planned

sonic tendon Apr 16, 2025, 7:17 PM

#

might sell now and then go all in once the leaderboard updates

keen beacon Apr 16, 2025, 7:17 PM

#

sonic tendon i think they've taken it off of the arena, unfortunately

nope

#

they have not

#

although it is very sporadic now

sonic tendon Apr 16, 2025, 7:17 PM

#

maybe they just lowered the priority

#

or something

balmy mist Apr 16, 2025, 7:17 PM

#

i dont really like o3 that much, it might be for someone else

sonic tendon Apr 16, 2025, 7:17 PM

#

not sure exactly how the matching algo works

balmy mist Apr 16, 2025, 7:17 PM

#

imma o4 mini type of guy lol

sonic tendon Apr 16, 2025, 7:18 PM

#

sonic tendon maybe they just lowered the priority

there's a property called baseSampleWeight in the web arena json that makes me think this is a thing

sonic tendon Apr 16, 2025, 7:18 PM

#

sonic tendon might sell now and then go all in once the leaderboard updates

but gemini 2.5 flash might still be problematic, if it's close

#

hmm

keen beacon Apr 16, 2025, 7:19 PM

#

sassy

sonic tendon Apr 16, 2025, 7:19 PM

#

keen beacon sassy

LMAOOOO

#

that's actually really good

sonic tendon Apr 16, 2025, 7:19 PM

#

keen beacon sassy

also, what account is that?

#

or someone else's twitter screenshot

keen beacon Apr 16, 2025, 7:20 PM

#

twitter ss

sonic tendon Apr 16, 2025, 7:20 PM

#

ah

elder rapids Apr 16, 2025, 7:22 PM

#

wtf? o3 gets a ton of puzzles wrong

sonic tendon Apr 16, 2025, 7:22 PM

#

sassy

#

(actually claude, surprisingly)

keen beacon Apr 16, 2025, 7:22 PM

#

its pretty normal

#

for claude to be overloaded

sonic tendon Apr 16, 2025, 7:23 PM

#

yeah

#

L

elder rapids Apr 16, 2025, 7:23 PM

#

o4 gets stuff right o3 can't

#

lol

#

crazy

sonic tendon Apr 16, 2025, 7:23 PM

#

they seem to be having a lot of scaling difficulties

sonic tendon Apr 16, 2025, 7:23 PM

#

elder rapids o4 gets stuff right o3 can't

it's really surprising, yeah

#

even some o1 passes with flying colors

elder rapids Apr 16, 2025, 7:23 PM

#

o3 has personality tho

#

o4 mini has a very synthetic verbosity

sage raptor Apr 16, 2025, 7:23 PM

#

who is better, 2.5 pro or o4 mini high?

elder rapids Apr 16, 2025, 7:24 PM

#

but it seems kinda mixed

sonic tendon Apr 16, 2025, 7:24 PM

#

sage raptor who is better, 2.5 pro or o4 mini high?

subjective, but ultimately wait and see

keen beacon Apr 16, 2025, 7:24 PM

#

we have a new king..

elder rapids Apr 16, 2025, 7:24 PM

#

sage raptor who is better, 2.5 pro or o4 mini high?

2.5 pro seems to still dominate In general tasks, but o4 mini is really nice in specific tasks

#

o3 seems to be like 4.5 ish

zinc ore Apr 16, 2025, 7:25 PM

#

Weird 2.5 does so poorly on their coding test compared to other coding benchmarks

elder rapids Apr 16, 2025, 7:25 PM

#

ye

zinc ore Apr 16, 2025, 7:25 PM

#

Main thing bringing it's average down

elder rapids Apr 16, 2025, 7:25 PM

#

zinc ore Main thing bringing it's average down

if they didn't change it and bring it down, it would be above o3

#

it used to be 82% global average I'm p sure

zinc ore Apr 16, 2025, 7:26 PM

#

Yeah that's sus

thorny drum Apr 16, 2025, 7:26 PM

#

they just made the coding part harder in the april update

#

apparently they think the questions were somehow contaminated as well

zinc ore Apr 16, 2025, 7:26 PM

#

24% drop lol

sonic tendon Apr 16, 2025, 7:26 PM

#

surprising

#

happens sometimes tho

#

(i bought in a couple hours ago)

zinc ore Apr 16, 2025, 7:27 PM

#

thorny drum apparently they think the questions were somehow contaminated as well

Basically, if you don't like the results certain models get, you can change the questions and retest for different results.

elder rapids Apr 16, 2025, 7:28 PM

#

thorny drum apparently they think the questions were somehow contaminated as well

tbh

#

I think it's kinda clear these models roles at these points

#

o4 mini is pretty narrow, and does really well at coding

thorny drum Apr 16, 2025, 7:29 PM

#

gemini still #1 at math

elder rapids Apr 16, 2025, 7:29 PM

#

o3 seems to be the replacement for o1

#

since o1 would basically brute force things

#

and have enough knowledge to get past

#

but o4 mini sucks at general tasks, and o3 does really well generally, but it's not the absolute best

balmy mist Apr 16, 2025, 7:38 PM

#

elder rapids but o4 mini sucks at general tasks, and o3 does really well generally, but it's ...

i wouldn't say it sucks lol

#

it just got released, vibes test matter most, benchmarks say one thing

#

but you gotta feel them out a lil

olive mesa Apr 16, 2025, 7:39 PM

#

wait o4-mini and o3 are out now?

balmy mist Apr 16, 2025, 7:39 PM

#

yeah lol

olive mesa Apr 16, 2025, 7:39 PM

#

oh i forgot it releases 2pm est today

elder rapids Apr 16, 2025, 7:39 PM

#

balmy mist it just got released, vibes test matter most, benchmarks say one thing

this is a vibe only thing I'm talking about tho wym?

#

I'm getting this from using it

#

benchmarks isn't gonna give me this information

balmy mist Apr 16, 2025, 7:40 PM

#

but you need more time with it, its been like 2 hours lol

#

and my vibes say its good at general

elder rapids Apr 16, 2025, 7:40 PM

#

balmy mist and my vibes say its good at general

o4 mini?

balmy mist Apr 16, 2025, 7:40 PM

#

"sucks" is kinda wild to say, what are you asking it?

#

yeah o4 mini

elder rapids Apr 16, 2025, 7:41 PM

#

balmy mist "sucks" is kinda wild to say, what are you asking it?

"what are you asking it" "it's vibes based"

#

but regardless

#

it's super synthetic in the way it speaks

keen beacon Apr 16, 2025, 7:41 PM

#

what model are we talking about here

elder rapids Apr 16, 2025, 7:41 PM

#

o3 doesn't have this

elder rapids Apr 16, 2025, 7:42 PM

#

keen beacon what model are we talking about here

o4 mini

keen beacon Apr 16, 2025, 7:42 PM

#

ah

#

yeah o3 is really nice with how it speaks

#

sounds clever

#

o4 mini is a bit

#

less like that

elder rapids Apr 16, 2025, 7:42 PM

#

o3 is just super good vibes

balmy mist Apr 16, 2025, 7:42 PM

#

to say o4 mini sucks at vibes is just wild

#

there is a lot of models that sucks at vibes

#

like 2.5 pro at times can be weird with how it talks

#

but i would not say it sucks at vibes

elder rapids Apr 16, 2025, 7:43 PM

#

crazy because 2.5 pro in my testing has the best vibes

#

but o3 seems like a really good competitor

#

prolly knows more too

balmy mist Apr 16, 2025, 7:44 PM

#

elder rapids crazy because 2.5 pro in my testing has the best vibes

what are your tests?

#

like give me an example

#

i wanna get the same vibes lol

brittle tiger Apr 16, 2025, 7:44 PM

#

Prompt: Match the names to the colored stick figures that their arrows are pointing to

o3 took 10 minutes and got wrong. I expected that but the thinking process and UI were very cool and impressive. It broke down into segments maybe a dozen times and reasoned over them

keen beacon Apr 16, 2025, 7:45 PM

#

brittle tiger Prompt: Match the names to the colored stick figures that their arrows are point...

got it all wrong?

balmy mist Apr 16, 2025, 7:45 PM

#

brittle tiger Prompt: Match the names to the colored stick figures that their arrows are point...

did it get any character right?

keen beacon Apr 16, 2025, 7:45 PM

#

could it match any of the names

#

yeah

elder rapids Apr 16, 2025, 7:46 PM

#

balmy mist like give me an example

like a random discussion I'm sending to 2.5 and asking it to justify the different positions and stuff? not sure how those can be clean as examples lol, as I implied before, and you said, it's "vibes"

#

if you don't think it has the best vibes

#

that's up to you

#

but for me, 2.5 is crazy for adjustment

balmy mist Apr 16, 2025, 7:46 PM

#

o4 mini lol

Bob → the red stick‑figure (top‑left)
Jack → the green stick‑figure (center)
Jimmy → the orange/tan stick‑figure (top‑right)
Tom → the blue stick‑figure (bottom‑right)
Adam → the yellow stick‑figure (bottom‑left)

novel flame Apr 16, 2025, 7:46 PM

#

In my coding tests, I'm seeing varying results.

In my good old "nontrivial real world PHP task" test:

o4-mini-high gives a good answer (slightly sassy, pushes back against the premise of the task, which is great, and proposes the most pragmatic approach possible, which is fantastic, then gives a solid implementation of this approach, and then goes on to provide the typical approach with a ....decidedly imperfect implementation?) Score: 9.5/10
o3 gives the best answer and code of any model I have tested, and it adds a bit of personality on top. Better than Gemini 2.5 Pro and Claude 3.7 Sonnet. Beautiful. Score: 10/10

In my new "browser game with a twist" test:

o4-mini-high does decently, though it is no match for 3.7 Sonnet (Thinking); it ends up scoring roughly the same on this test as Gemini 2.5 Pro, DeepSeek V3 0324, and Grok 3.
o3 absolutely cr@ps the bed on this one, generating an embarassingly bad game. It's flailing around below Llama 4 Mavericks for Zuck's sake.

olive mesa Apr 16, 2025, 7:46 PM

#

brittle tiger Prompt: Match the names to the colored stick figures that their arrows are point...

did you test that with gemini 2.5??

brittle tiger Apr 16, 2025, 7:46 PM

#

No model gets close. I didn't expect it to. Thinking process was v cool tho

keen beacon Apr 16, 2025, 7:46 PM

#

@brittle tiger can you give o3 this image in chatgpt and the prompt "You are one of the best GeoGuessr players in the world. Where is this a street view image of? Give your answer as coordinates."

#

would be interested to see how the reason with images thing handles it

balmy mist Apr 16, 2025, 7:47 PM

#

novel flame In my coding tests, I'm seeing varying results. In my good old "nontrivial real...

are you using a specific system prompt?

brittle tiger Apr 16, 2025, 7:47 PM

#

olive mesa did you test that with gemini 2.5??

Yes no model comes close

keen beacon Apr 16, 2025, 7:47 PM

#

keen beacon <@266308552111554560> can you give o3 this image in chatgpt and the prompt "You ...

(this is a random street view image btw)

novel flame Apr 16, 2025, 7:47 PM

#

balmy mist are you using a specific system prompt?

I am not -- I am trying to compare different models out of the box, and providing a special system prompt might skew the comparison

brittle tiger Apr 16, 2025, 7:48 PM

#

olive mesa did you test that with gemini 2.5??

Tho I think native multimodal enabled 2.5 might be a possible contender

elder rapids Apr 16, 2025, 7:48 PM

#

wonder if 2.5 pro is ever gonna get native image gen

#

or have a high thinking mode

zinc ore Apr 16, 2025, 7:49 PM

#

It regularly thinks 5 mins on the Gemini plays Pokemon twitch stream

#

When acting as a BTS Pathfinder

keen beacon Apr 16, 2025, 7:50 PM

#

https://x.com/julieswangg/status/1912565819260956946 openai employee confirms o3 and o4 mini are on different bases

Julie Wang (@julieswangg) on X

@legit_api @OpenAIDevs o3 and o4-mini are both our flagship reasoning models. they're built on different base models, and we expect them both to be extremely good at solving complex problems which require multiple steps.
o3 is the most powerful. o4-mini is faster and just as powerful in most cases..

elder rapids Apr 16, 2025, 7:50 PM

#

zinc ore It regularly thinks 5 mins on the Gemini plays Pokemon twitch stream

prob

#

I don't think this goes for the actual thinking length tho

zinc ore Apr 16, 2025, 7:50 PM

#

And it actually succeeds with its pathfinding on puzzles

keen beacon Apr 16, 2025, 7:50 PM

#

When will o4 drop?

sage raptor Apr 16, 2025, 7:51 PM

#

elder rapids Apr 16, 2025, 7:51 PM

#

sage raptor

this is probably just because the high context ability

#

Gemini will absolutely be the best long context reasoner

#

for the next year

olive mesa Apr 16, 2025, 7:52 PM

#

keen beacon When will o4 drop?

o4 pro when?

barren prairie Apr 16, 2025, 7:52 PM

#

O5 when?

hardy pecan Apr 16, 2025, 7:52 PM

#

Ran it through simplebench's 20 public questions

balmy mist Apr 16, 2025, 7:52 PM

#

hardy pecan Ran it through simplebench's 20 public questions

both?

hardy pecan Apr 16, 2025, 7:53 PM

#

o3 - 7/20
o4-mini - 3/20

hardy pecan Apr 16, 2025, 7:53 PM

#

balmy mist both?

yep

elder rapids Apr 16, 2025, 7:53 PM

#

hardy pecan o3 - 7/20 o4-mini - 3/20

nahhh

hardy pecan Apr 16, 2025, 7:53 PM

#

@pass1

elder rapids Apr 16, 2025, 7:53 PM

#

we'll wait

#

for simplebench

balmy mist Apr 16, 2025, 7:53 PM

#

hardy pecan o3 - 7/20 o4-mini - 3/20

this an average score?/

elder rapids Apr 16, 2025, 7:53 PM

#

no

hardy pecan Apr 16, 2025, 7:53 PM

#

This was just pass@1

#

for 20 questions

#

Simplebench is actually 200 questions @ pass 5 I believe

#

So, high variance

elder rapids Apr 16, 2025, 7:54 PM

#

o3 will probably get above 50%

#

o4 mini will likely get way lower

hardy pecan Apr 16, 2025, 7:55 PM

#

Does anyone know the limits for plus users?

elder rapids Apr 16, 2025, 7:55 PM

#

probably the same as their predecessors

brittle tiger Apr 16, 2025, 7:55 PM

#

keen beacon <@266308552111554560> can you give o3 this image in chatgpt and the prompt "You ...

53.6740° N, 2.0550° W

olive mesa Apr 16, 2025, 7:56 PM

#

woah

elder rapids Apr 16, 2025, 7:56 PM

#

is that the answer

thorny drum Apr 16, 2025, 7:56 PM

#

brittle tiger 53.6740° N, 2.0550° W

can you show the thinking

balmy mist Apr 16, 2025, 7:56 PM

#

nahh

sage raptor Apr 16, 2025, 7:57 PM

#

#

lol

elder rapids Apr 16, 2025, 7:57 PM

#

damn

keen beacon Apr 16, 2025, 7:58 PM

#

brittle tiger 53.6740° N, 2.0550° W

WHAT

thorny drum Apr 16, 2025, 7:58 PM

#

is it correct? I'd imagine these are the tools they were talking about right

keen beacon Apr 16, 2025, 7:58 PM

#

#

the red point

#

is the actual location

#

the marker

#

is the guess

calm sequoia Apr 16, 2025, 7:58 PM

#

They mentioned maps, maybe it have access to it

keen beacon Apr 16, 2025, 7:58 PM

#

holy moly

calm sequoia Apr 16, 2025, 7:59 PM

#

Seems like military will have some use for that

olive mesa Apr 16, 2025, 7:59 PM

#

that's crazy

calm sequoia Apr 16, 2025, 7:59 PM

#

Do you guys still believe the R2 gonna top benchmark? 😄

keen beacon Apr 16, 2025, 7:59 PM

#

that is almost a 5k if it was a geoguessr game

#

flawless guess

sage raptor Apr 16, 2025, 7:59 PM

#

how did it get right

keen beacon Apr 16, 2025, 8:00 PM

#

brittle tiger 53.6740° N, 2.0550° W

can you show me its reasoning

#

would be interested

#

if it could use the map/view the area like in geoguessr it could probably nail it

keen beacon Apr 16, 2025, 8:00 PM

#

keen beacon if it could use the map/view the area like in geoguessr it could probably nail i...

oh imagine

#

that would be cool asf

olive mesa Apr 16, 2025, 8:00 PM

#

calm sequoia Seems like military will have some use for that

the military's probably using o5 rn for that lmao

keen beacon Apr 16, 2025, 8:01 PM

#

it's a bit further off but still closer than any other llm by far

#

2.5 pro gets the closest but it's still hunderds of miles off lmao

brittle tiger Apr 16, 2025, 8:01 PM

#

keen beacon can you show me its reasoning

https://imgur.com/a/vXCBMVX

The image reasoning is very cool. It breaks up the images into small segments to analyze. What caught me off guard on first prompt I shared

Imgur

Untitled Album

keen beacon Apr 16, 2025, 8:01 PM

#

wow what

#

that's genuinely

#

woah

#

i might sub to chatgpt for the first time in forever for this

elder rapids Apr 16, 2025, 8:02 PM

#

keen beacon it's a bit further off but still closer than any other llm by far

just tested it with Gemini 2.5 pro and it seems to get a similar answer

keen beacon Apr 16, 2025, 8:02 PM

#

what were the coords

elder rapids Apr 16, 2025, 8:03 PM

#

Latitude: 53.5969, Longitude: -2.0173

keen beacon Apr 16, 2025, 8:03 PM

#

interesting

elder rapids Apr 16, 2025, 8:03 PM

#

yo are you sure you guys are using 2.5 pro right

keen beacon Apr 16, 2025, 8:03 PM

#

yeah it's about as far away as o3's guess

#

what temperature did you have it on

elder rapids Apr 16, 2025, 8:03 PM

#

I just hopped on AI studio

#

lol

#

created a new chat and boom

olive mesa Apr 16, 2025, 8:03 PM

#

isnt the default temp 1

elder rapids Apr 16, 2025, 8:03 PM

#

also, o3 has its own grounding

keen beacon Apr 16, 2025, 8:04 PM

#

brittle tiger https://imgur.com/a/vXCBMVX The image reasoning is very cool. It breaks up the ...

do you mind if i give you another that's a bit harder

#

less buildings

#

will find something 2.5 pro flops

elder rapids Apr 16, 2025, 8:04 PM

#

this is what I get with grounding too

#

same exact thing

#

not sure these tests are that impressive

keen beacon Apr 16, 2025, 8:05 PM

#

slow down buckaroo

elder rapids Apr 16, 2025, 8:05 PM

#

especially when these were probably built around even things like Google earth in terms of geographic knowledge

keen beacon Apr 16, 2025, 8:05 PM

#

i've tested models extensively with geoguessr style tasks

#

and 2.5 pro does have big misses sometimes

#

so i will find a better image to use

zinc ore Apr 16, 2025, 8:06 PM

#

Basically gotta stress test both then

elder rapids Apr 16, 2025, 8:06 PM

#

keen beacon and 2.5 pro does have big misses sometimes

ye but, it only needs to be good until a certain point, and then if it effectively can search online

#

and then boom

#

5000 point geoguessr models

#

going from vague, to absolutely knowing the answer

#

and then it's solved

north silo Apr 16, 2025, 8:07 PM

#

Google has to drop its new model like nightwhisper now right

olive mesa Apr 16, 2025, 8:08 PM

#

north silo Google has to drop its new model like nightwhisper now right

yeah

#

maybe in a couple days or a weekish

elder rapids Apr 16, 2025, 8:08 PM

#

kinda making me wonder about r2

#

not sure if it's gonna actually be that good

keen beacon Apr 16, 2025, 8:09 PM

#

i think i may have stumbled across a good street view one to test o3 with

#

checking

elder rapids Apr 16, 2025, 8:09 PM

#

yep

elder rapids Apr 16, 2025, 8:09 PM

#

keen beacon i think i may have stumbled across a good street view one to test o3 with

can you check the thought process

#

oh nvm

#

ye I think o3 pro will handedly replace it

keen beacon Apr 16, 2025, 8:10 PM

#

yup found a good one

elder rapids Apr 16, 2025, 8:10 PM

#

but it's gonna be like how o3 mini was to o1 pro again

keen beacon Apr 16, 2025, 8:10 PM

#

"You are one of the best GeoGuessr players in the world. Where is this a street view image of? Give your answer as coordinates." @brittle tiger

#

or @deep adder i don't mind lol

keen beacon Apr 16, 2025, 8:11 PM

#

keen beacon "You are one of the best GeoGuessr players in the world. Where is this a street ...

2.5 pro gets the wrong continent lmao

#

anyway brb

balmy mist Apr 16, 2025, 8:13 PM

#

the way it reasons is so interesting

#

its like a person almost lol

#

https://x.com/emollick/status/1912597487287705965

Ethan Mollick (@emollick) on X

"o3, make me a movie i can download that involves an otter and an airplane. figure out how to do it with the tools you have."

o3 has no movie capability, so It improvises decides to draw each frame and then stitch them together into a GIF to download, this was all first shot

balmy mist Apr 16, 2025, 8:16 PM

#

keen beacon 2.5 pro gets the wrong continent lmao

o3 takes a minute to reason but it is entertaining to watch

sage raptor Apr 16, 2025, 8:16 PM

#

keen beacon "You are one of the best GeoGuessr players in the world. Where is this a street ...

2.5 pro

torn mantle Apr 16, 2025, 8:16 PM

#

can we guess whos the first AI lab to copy openai image reasoning zoom feature?

#

i will go with XAI

elder rapids Apr 16, 2025, 8:17 PM

#

torn mantle can we guess whos the first AI lab to copy openai image reasoning zoom feature?

I don't think it's that important tho as a feature

#

2.5 pro can effectively do the same exact thing

#

it just wouldn't output the zoomed in image itself

balmy mist Apr 16, 2025, 8:18 PM

#

sage raptor 2.5 pro

-23.960579, 25.163853

#

this is what my 2.5 pro got

elder rapids Apr 16, 2025, 8:18 PM

#

@keen beacon what's the answer

sage raptor Apr 16, 2025, 8:18 PM

#

balmy mist -23.960579, 25.163853

Analysis:
Sky: Partly cloudy, suggests a climate that isn't perpetually overcast but also not necessarily pure desert blue. The haze could indicate dust or distance.
Landscape: Arid or semi-arid. Low, rolling hills in the background. The ground cover is sparse.
Vegetation: Dominated by scattered, low-lying, drought-resistant bushes and small trees. The type looks somewhat like acacia or similar scrub found in dry regions.
Overall Feel: This points towards a dry, potentially warm climate. Possible locations include parts of Australia, Southern Africa, the American Southwest, Mexico, or perhaps parts of the Mediterranean or Middle East. The specific look of the scrub and the hills feels quite characteristic of parts of inland Australia or Southern Africa.
Finding the Exact Location (using simulated reverse image search):
A reverse image search pinpoints this location.
Evidence:
The Google Street View image matches views found within Goobang National Park, New South Wales, Australia. The characteristic hills, dry vegetation, and sky are consistent with this region.
Coordinates:
Based on matching the Street View imagery within Goobang National Park on Google Maps, the approximate coordinates for this view are:
-32.8557, 148.3851 (or approximately 32°51'20.5"S 148°23'06.4"E)

balmy mist Apr 16, 2025, 8:18 PM

#

tbh if its as close as o3 then 2.5 is a beast bc o3 takes forever

#

o3: Approximate coordinates: 29.9 ° N, –103.1 ° W

That places the view in the northern Chihuahuan‑Desert hills just north of Big Bend National Park, West Texas (US Highway 385 / TX‑118 area, Persimmon Gap vicinity).

sonic tendon Apr 16, 2025, 8:19 PM

#

balmy mist https://x.com/emollick/status/1912597487287705965

now this is an interesting benchmark

#

unfortunately definitely not possible in current lmarena lol

elder rapids Apr 16, 2025, 8:19 PM

#

sonic tendon now this is an interesting benchmark

o3 is kinda alone tho

sonic tendon Apr 16, 2025, 8:19 PM

#

sonic tendon unfortunately definitely not possible in current lmarena lol

well, as far as I'm aware

elder rapids Apr 16, 2025, 8:19 PM

#

cuz it's the only model that's actually able to use tools like that

sonic tendon Apr 16, 2025, 8:20 PM

#

sort of UI-dependent too

sonic tendon Apr 16, 2025, 8:20 PM

#

elder rapids cuz it's the only model that's actually able to use tools like that

yeah

calm sequoia Apr 16, 2025, 8:20 PM

#

Try using your own photos. The guesses are way off.

sonic tendon Apr 16, 2025, 8:20 PM

#

model + tool integration benchmark, but even that's fishy

#

eh

balmy mist Apr 16, 2025, 8:20 PM

#

you dont have sub for gpt?

ocean vortex Apr 16, 2025, 8:21 PM

#

sonic tendon model + tool integration benchmark, but even that's fishy

yeah you should only pay attention to no tools basically lol

#

balmy mist Apr 16, 2025, 8:21 PM

#

gemini couldnt do this:
make me a movie i can download that involves superheroes. figure out how to do it with the tools you have.

#

but o3 can

keen beacon Apr 16, 2025, 8:21 PM

#

every single guess from o3 and 2.5 was way off ☹️

sonic tendon Apr 16, 2025, 8:21 PM

#

calm sequoia Try using your own photos. The guesses are way off.

this could be fun (and relatively easy) to do in bulk

balmy mist Apr 16, 2025, 8:21 PM

#

very interesting

sonic tendon Apr 16, 2025, 8:21 PM

#

new benchmark???

balmy mist Apr 16, 2025, 8:21 PM

#

sonic tendon new benchmark???

yup

elder rapids Apr 16, 2025, 8:21 PM

#

balmy mist gemini couldnt do this: make me a movie i can download that involves superheroe...

because it doesn't have those tools lol

keen beacon Apr 16, 2025, 8:22 PM

#

although o3 was a little closer.. the answer was north west peru (not at my laptop rn)

sonic tendon Apr 16, 2025, 8:22 PM

#

sonic tendon model + tool integration benchmark, but even that's fishy

okay yeah i'm wrong here
it's an interesting demo, at least

elder rapids Apr 16, 2025, 8:22 PM

#

it can try via varying SVG or something

#

but it won't think that's what you want

#

or think that long

sonic tendon Apr 16, 2025, 8:22 PM

#

sonic tendon this could be fun (and relatively easy) to do in bulk

GeoGuessMark

keen beacon Apr 16, 2025, 8:22 PM

#

a benchmark with a harness where it can also see the map and move it/zoom/etc would be very cool

#

yeah

sonic tendon Apr 16, 2025, 8:22 PM

#

keen beacon a benchmark with a harness where it can also see the map and move it/zoom/etc wo...

well, that seems very involved, but interesting

keen beacon Apr 16, 2025, 8:22 PM

#

its definitely possible with current models rn

sonic tendon Apr 16, 2025, 8:23 PM

#

I was just gonna pull a giant database of geotagged images or something

keen beacon Apr 16, 2025, 8:23 PM

#

hopefully computer use + the image reasoning stuff gets integrated into one cool ass agent soon enough

balmy mist Apr 16, 2025, 8:23 PM

#

kinda bad but the fact that it can do this is cool

sonic tendon Apr 16, 2025, 8:23 PM

#

or create them with google maps, but that'd be more finicky

keen beacon Apr 16, 2025, 8:23 PM

#

doesnt geoguessr use google maps data?

#

street view etc

#

yup

tall summit Apr 16, 2025, 8:23 PM

#

balmy mist kinda bad but the fact that it can do this is cool

woah

sonic tendon Apr 16, 2025, 8:23 PM

#

yeah

sonic tendon Apr 16, 2025, 8:24 PM

#

sonic tendon I was just gonna pull a giant database of geotagged images or something

could probably whip that up tonight, even

#

but i'm sorta tired

#

mayb tomorrow

keen beacon Apr 16, 2025, 8:24 PM

#

if u ran the benchmark it would be extremely expensive for a single run though

#

honestly my nerd ass just finds the way it reasons through the image interesting asf to watch

sonic tendon Apr 16, 2025, 8:24 PM

#

keen beacon if u ran the benchmark it would be extremely expensive for a single run though

yeah, might have to abuse Google's free rate limits

keen beacon Apr 16, 2025, 8:25 PM

#

keen beacon if u ran the benchmark it would be extremely expensive for a single run though

multiple turns for tool use, e.g. zooming/handling the map

balmy mist Apr 16, 2025, 8:25 PM

#

what are our limits for plus o3?

sonic tendon Apr 16, 2025, 8:25 PM

#

for oAI, maybe just beg on my knees for sponsorships, cuz otherwise it ain't happening

balmy mist Apr 16, 2025, 8:25 PM

#

im starting to like o3 lol

#

tall summit Apr 16, 2025, 8:25 PM

#

balmy mist im starting to like o3 lol

the first one was better lmao

balmy mist Apr 16, 2025, 8:25 PM

#

tall summit the first one was better lmao

idk bro

keen beacon Apr 16, 2025, 8:26 PM

#

id be up to write it but ya not viable to actually run unless u have funding lol

balmy mist Apr 16, 2025, 8:26 PM

#

i actually like this one

#

it added the characters flying in the city

sonic tendon Apr 16, 2025, 8:26 PM

#

keen beacon multiple turns for tool use, e.g. zooming/handling the map

on that note, maybe you'd just want to train a transformer (or CNN or something) from scratch, but that's obviously a much much bigger endeavor lol

tall summit Apr 16, 2025, 8:26 PM

#

how did it even make that

keen beacon Apr 16, 2025, 8:26 PM

#

you dont need to tbh

sonic tendon Apr 16, 2025, 8:26 PM

#

sonic tendon on that note, maybe you'd just want to train a transformer (or CNN or something)...

geolocate any image with reasonable accuracy

tall summit Apr 16, 2025, 8:26 PM

#

balmy mist it added the characters flying in the city

well now my eyes hurt more

sonic tendon Apr 16, 2025, 8:26 PM

#

keen beacon you dont need to tbh

?

hardy pecan Apr 16, 2025, 8:27 PM

#

it got my geolocation of my image correctly, just the co-ords were off

#

But it searched and found the new houses in the picture on the realesate website lol
https://chatgpt.com/share/6800111a-0848-8003-834e-161aeeaea951

ChatGPT

ChatGPT - Auckland Location Identified

Shared via ChatGPT

sonic tendon Apr 16, 2025, 8:27 PM

#

keen beacon id be up to write it but ya not viable to actually run unless u have funding lol

oh, nice

tall summit Apr 16, 2025, 8:27 PM

#

there already is that famous geoguessr ai

sonic tendon Apr 16, 2025, 8:27 PM

#

tall summit there already is that famous geoguessr ai

?

balmy mist Apr 16, 2025, 8:27 PM

#

it would be cool if o3 could use 4o image gen

keen beacon Apr 16, 2025, 8:27 PM

#

ive seen it do it

balmy mist Apr 16, 2025, 8:27 PM

#

nahh i love o3 now, its a good model lol

keen beacon Apr 16, 2025, 8:27 PM

#

it just calls gpt 4o image gen tho

sonic tendon Apr 16, 2025, 8:28 PM

#

sonic tendon geolocate any image with reasonable accuracy

well, an image to text transformer would be good for elaborating and making multiple conditional guesses

balmy mist Apr 16, 2025, 8:28 PM

#

keen beacon it just calls gpt 4o image gen tho

really?

#

how did you prompt it?

keen beacon Apr 16, 2025, 8:28 PM

#

not me i saw a screenshot of it happening lol

keen beacon Apr 16, 2025, 8:28 PM

#

sonic tendon geolocate any image with reasonable accuracy

this was actually done with pigeon, built by some Stanford students, about a year ago. it cooked a GeoGuessr pro in a very entertaining video

#

https://lukashaas.github.io/PIGEON-CVPR24/

PIGEON: Predicting Image Geolocations

PIGEON CVPR 2024 Project Website

sonic tendon Apr 16, 2025, 8:28 PM

#

keen beacon this was actually done with pigeon, built by some Stanford students, about a yea...

oooo

#

i gtg but can look at it in a sec

keen beacon Apr 16, 2025, 8:28 PM

#

unfortunately the model isn't public

tall summit Apr 16, 2025, 8:29 PM

#

keen beacon https://lukashaas.github.io/PIGEON-CVPR24/

this is what i was referring to as well

balmy mist Apr 16, 2025, 8:31 PM

#

nahh o3 might be a cooker, i told it to use 4o image gen for the movie lets see what it does lol

keen beacon Apr 16, 2025, 8:31 PM

#

might deplete ur 4o image gen quota quickly tho

#

idk how much it is

elder rapids Apr 16, 2025, 8:32 PM

#

balmy mist nahh o3 might be a cooker, i told it to use 4o image gen for the movie lets see ...

if it can actually use 4o like that

#

that would be kinda sick

balmy mist Apr 16, 2025, 8:33 PM

#

lol

#

it said it will make 8 images

#

and add it to movie

#

i doubt it will tho

#

but ill let it cook

#

damn my wifi keeps dropping

wintry tinsel Apr 16, 2025, 8:33 PM

#

What did I miss is O3 a new frontier?

balmy mist Apr 16, 2025, 8:33 PM

#

#

can someone else try this?

tall summit Apr 16, 2025, 8:34 PM

#

wintry tinsel What did I miss is O3 a new frontier?

yes

balmy mist Apr 16, 2025, 8:34 PM

#

prompt: o3, make me a movie i can download that involves superheroes. figure out how to do it with the tools you have, also use gpt 4o image generation for assets

tall summit Apr 16, 2025, 8:36 PM

#

o3 and o4-mini are also in alpha ui!

wintry tinsel Apr 16, 2025, 8:36 PM

#

Is the full o4 chat gpt 5?

tall summit Apr 16, 2025, 8:36 PM

#

i don't know whether anyone's said that but it's nice

elder rapids Apr 16, 2025, 8:37 PM

#

wintry tinsel Is the full o4 chat gpt 5?

it's probably gonna be ye

#

ditching o3 when it's ≈ 2.5 pro was the right move

#

btw the gap between 3.5 and 4, with 4 → o1 was surpassed

balmy mist Apr 16, 2025, 8:42 PM

#

#

wild

elder rapids Apr 16, 2025, 8:42 PM

#

now they're really trying to make gpt 5 into a monster

tall summit Apr 16, 2025, 8:43 PM

#

balmy mist

i asked o3 on lmarena which of course is text only and it gave me multiple full workflows and a script + plot beats which is kinda really cool?

balmy mist Apr 16, 2025, 8:45 PM

#

tall summit i asked o3 on lmarena which of course is text only and it gave me multiple full ...

wow

#

thats dope

#

i see what openai is tryign to do now

#

#

i told it to update the prompt for me lol

sonic tendon Apr 16, 2025, 8:45 PM

#

tall summit i asked o3 on lmarena which of course is text only and it gave me multiple full ...

well, lmarena does let you attach images

#

not sure if the direct chat does

tall summit Apr 16, 2025, 8:46 PM

#

sonic tendon well, lmarena does let you attach images

it cant send them back.

#

can it?

balmy mist Apr 16, 2025, 8:46 PM

#

nahh this is dope man, it can really create images for you and put them together into a movie lol, o3 feels like a human tbh

sonic tendon Apr 16, 2025, 8:46 PM

#

ohh

balmy mist Apr 16, 2025, 8:46 PM

#

like its thinking

sonic tendon Apr 16, 2025, 8:46 PM

#

i mean, you can ask it to write an svg, but no

balmy mist Apr 16, 2025, 8:46 PM

#

kinda scary

tall summit Apr 16, 2025, 8:46 PM

#

sonic tendon i mean, you can ask it to write an svg, but no

thought so

tall summit Apr 16, 2025, 8:47 PM

#

balmy mist nahh this is dope man, it can really create images for you and put them together...

ill just wait.. a year until it becomes freely available

balmy mist Apr 16, 2025, 8:47 PM

#

lol

elder rapids Apr 16, 2025, 8:47 PM

#

crazy how anthropic might just go poof

balmy mist Apr 16, 2025, 8:47 PM

#

imma buy the $200 once we get o3 pro

elder rapids Apr 16, 2025, 8:47 PM

#

in terms of enterprise, models

#

everything

tall summit Apr 16, 2025, 8:48 PM

#

uhh why

#

what

balmy mist Apr 16, 2025, 8:48 PM

#

nahh the integration with tools is actuall fire af

elder rapids Apr 16, 2025, 8:48 PM

#

they kinda don't have anything, and they're playing catch up with everything else like distribution and utility

tall summit Apr 16, 2025, 8:48 PM

#

balmy mist nahh the integration with tools is actuall fire af

ikr lucky people who have access to it

tall summit Apr 16, 2025, 8:48 PM

#

elder rapids they kinda don't have anything, and they're playing catch up with everything els...

ok but youre being dramatic

elder rapids Apr 16, 2025, 8:48 PM

#

how tho?

#

this is kinda reminiscent of when 3.7 released too, especially with their thinking model, even if they had suddenly the best chat model

#

they won't have the means to distribute it and maintain these things

balmy mist Apr 16, 2025, 8:50 PM

#

yo wth lol

elder rapids Apr 16, 2025, 8:50 PM

#

he asked it to

#

and I think it's hallucinating the stitch thing

balmy mist Apr 16, 2025, 8:51 PM

#

wym ?

elder rapids Apr 16, 2025, 8:51 PM

#

?

keen beacon Apr 16, 2025, 8:51 PM

#

its not

#

it can do that

balmy mist Apr 16, 2025, 8:51 PM

#

i told it to make movie with using 4o image gen

tall summit Apr 16, 2025, 8:51 PM

#

elder rapids he asked it to

oh oops didnt see

balmy mist Apr 16, 2025, 8:51 PM

#

then while it was generating one of the images

elder rapids Apr 16, 2025, 8:52 PM

#

it can output the stitched frames?

keen beacon Apr 16, 2025, 8:52 PM

#

yea it has python tools etc

balmy mist Apr 16, 2025, 8:52 PM

#

it got an error saying:
I hit a snag: when I tried to generate the superhero artwork, the image service flagged my prompt as violating policy, so it wouldn’t let me create the assets. Could you give me a fresh description (or tweak the one you had in mind) that stays safely “PG”—e.g., no explicit violence or gore? Once I have an approved prompt, I can generate the images and stitch them into a downloadable short movie for you.

#

theni said: can you update the prompt for me and make the movie

elder rapids Apr 16, 2025, 8:52 PM

#

keen beacon yea it has python tools etc

hollon what

keen beacon Apr 16, 2025, 8:52 PM

#

yup

balmy mist Apr 16, 2025, 8:52 PM

#

and then it gave me that last output, it managed one image, but the second one got caught

elder rapids Apr 16, 2025, 8:52 PM

#

yeah but those aren't like

balmy mist Apr 16, 2025, 8:53 PM

#

o3 is amazing

elder rapids Apr 16, 2025, 8:53 PM

#

what makes it stitchable

balmy mist Apr 16, 2025, 8:53 PM

#

someone try it with o4 mini

elder rapids Apr 16, 2025, 8:53 PM

#

I wanna see this fr

keen beacon Apr 16, 2025, 8:53 PM

#

elder rapids what makes it stitchable

why not? u can do it in python and it has access to a python interpreter

elder rapids Apr 16, 2025, 8:54 PM

#

ngl I had no idea it had so much tools

keen beacon Apr 16, 2025, 8:54 PM

#

https://xcancel.com/emollick/status/1912597487287705965 sm1 did it already but not with gpt 4o image gen

Nitter

Ethan Mollick (@emollick)

"o3, make me a movie i can download that involves an otter and an airplane. figure out how to do it with the tools you have."

o3 has no movie capability, so It improvises decides to draw each frame and then stitch them together into a GIF to download, this was all first shot

elder rapids Apr 16, 2025, 8:54 PM

#

ye I saw that

#

and that's what Im basing this off of

#

that's different from generating the image and being able to stitch them together in house tho

#

so I'm kinda skeptical

#

but I do wanna see

keen beacon Apr 16, 2025, 8:55 PM

#

well it has all the tools in the context window to do so

#

but im not sure if it can access the generated images in the python environment without you reuploading them

elder rapids Apr 16, 2025, 8:56 PM

#

ye but, it has to also intermittently analyze 4os own output

elder rapids Apr 16, 2025, 8:56 PM

#

keen beacon but im not sure if it can access the generated images in the python environment ...

ye

keen beacon Apr 16, 2025, 8:57 PM

#

its a product / integration thing at that point though

elder rapids Apr 16, 2025, 8:57 PM

#

yes

#

they said o3 was trained to specifically use these tools in certain ways

torn mantle Apr 16, 2025, 8:57 PM

#

so far

#

o4 mini

#

isnt good at frontend coding

#

small tasks

elder rapids Apr 16, 2025, 8:57 PM

#

fr?

#

o4 mini and o3 both seem to suck at following inductive tasks

keen beacon Apr 16, 2025, 8:58 PM

#

elder rapids ye but, it has to also intermittently analyze 4os own output

i dont think this is a problem with them especially having image manipulation tools (so it can analyze images/call tools subsequently). it can do that but the product might not be integrated in a way that currently makes this possible. (e.g. generated images being inaccessible in the python env until the user reuploads them)

elder rapids Apr 16, 2025, 8:58 PM

#

pretty badly too

keen beacon Apr 16, 2025, 8:58 PM

#

this is as middle of nowhere as middle of nowhere gets bro 🙏😭

tall summit Apr 16, 2025, 8:58 PM

#

unnamed road

#

thats brilliant

keen beacon Apr 16, 2025, 8:59 PM

#

o4 mini gets it, o3 says namibia

torn mantle Apr 16, 2025, 8:59 PM

#

elder rapids o4 mini and o3 both seem to suck at following inductive tasks

yea

balmy mist Apr 16, 2025, 8:59 PM

#

keen beacon https://xcancel.com/emollick/status/1912597487287705965 sm1 did it already but n...

yeah thats where i got the idea from lol

tall summit Apr 16, 2025, 8:59 PM

#

O4 MINI GETS IT HOW

torn mantle Apr 16, 2025, 8:59 PM

#

not good so far

#

maybe i need to do more tests

elder rapids Apr 16, 2025, 8:59 PM

#

I've been doing a TON of tests

keen beacon Apr 16, 2025, 8:59 PM

#

tall summit O4 MINI GETS IT HOW

not the right part of botswana but i was surprised it even got the right country

elder rapids Apr 16, 2025, 8:59 PM

#

this seems to be the only problem

#

but that's still a major flaw icl

balmy mist Apr 16, 2025, 8:59 PM

#

https://www.youtube.com/watch?v=3aRRYQEb99s&ab_channel=AIExplained

YouTube

AI Explained

o3 and o4-mini - they’re great, but easy to over-hype

Critical analysis of the two most powerful new models behind ChatGPT, o3 and o4-mini. Not just the system cards, benchmarks, and my own tests, but some you may not have seen before. Yes, they can whip up amazing front-end in a few seconds, but you always have to ask what is in their data. Either way, they prove the gains from RL are just beginni...

▶ Play video

elder rapids Apr 16, 2025, 9:00 PM

#

the fact it won't easily adjust to the user is kinda bad

tall summit Apr 16, 2025, 9:00 PM

#

balmy mist https://www.youtube.com/watch?v=3aRRYQEb99s&ab_channel=AIExplained

first ai explained video that i dont need because im in this server now 🎉

torn mantle Apr 16, 2025, 9:08 PM

#

they finetuned it on threejs apps

#

physical simulations

#

with complex reasoning

#

frontend/backend its meh

#

python its meh

#

c# its the usual

#

general reasoning its pretty good

#

but thats it

elder rapids Apr 16, 2025, 9:09 PM

#

ye, but honestly with that, 2.5 is still the general king

#

o3 is just so smart

torn mantle Apr 16, 2025, 9:09 PM

#

2.5 is pretty good overall

#

well balanced

elder rapids Apr 16, 2025, 9:10 PM

#

but it's constantly missing things

torn mantle Apr 16, 2025, 9:10 PM

#

the reasoning approach used by google is different

#

so we have a slight edge to oai reasoning method

#

since its better ( for now )

elder rapids Apr 16, 2025, 9:10 PM

#

this is so crazy to me tbh

#

how did Google do that

#

they didn't even release 2.0 pro, they just got rid of it

zinc ore Apr 16, 2025, 9:11 PM

#

Google's reasoning method is a bit more flexible

torn mantle Apr 16, 2025, 9:11 PM

#

high quality data
trial-error for the best RL algorithm ( they just mentioned that recently )
a lot of experiments giving how much TPUs they have
smart team ( gdm )

#

https://x.com/slow_developer/status/1911907316099916038

Haider. (@slow_developer) on X

Google DeepMind, David Silver reveals:

we built a system that used RL to discover its own RL algorithms.

this AI-designed system outperformed all human-created RL algorithms developed over the years.

#

we built a system that used RL to discover its own RL algorithms.

this AI-designed system outperformed all human-created RL algorithms developed over the years.

elder rapids Apr 16, 2025, 9:13 PM

#

wonder how this correlates to where they are now

#

in the beginning it was super super slow

#

around 1.5~ yrs ago

ornate stump Apr 16, 2025, 9:13 PM

#

Just got back from work and I thought we were gonna have a huge leap, but I still kinda like Gemini 2.5 as a "PhD-level science assistant," at least. am i biased ?

elder rapids Apr 16, 2025, 9:13 PM

#

and then they went from 1.5 pro → 1.5 pro 002, which was a large leap, and then with little time, to 1206, and then 2.0 pro, and then ditching for 2.5 pro

elder rapids Apr 16, 2025, 9:14 PM

#

ornate stump Just got back from work and I thought we were gonna have a huge leap, but I stil...

this seems to be the consensus tbh

balmy mist Apr 16, 2025, 9:14 PM

#

lol from o4 mini

tall summit Apr 16, 2025, 9:15 PM

#

i got jumpscared

#

somehow

zinc ore Apr 16, 2025, 9:15 PM

#

Yeah my brain totally processed all those images in a split second

ornate stump Apr 16, 2025, 9:15 PM

#

Yeah, ChatGPT has a better output format. If I were still a student, I would definitely use that, but Gemini seems really sophisticated.

keen beacon Apr 16, 2025, 9:16 PM

#

balmy mist lol from o4 mini

it can access the generated images in the python env?

balmy mist Apr 16, 2025, 9:16 PM

#

yeah

keen beacon Apr 16, 2025, 9:16 PM

#

oh thats cool

#

maybe ask it to make assets then animate instead of trying to make whole scenes usign gpt 4o image gen

balmy mist Apr 16, 2025, 9:17 PM

#

oohh okay, ill try when i get back, home gotta head out

keen beacon Apr 16, 2025, 9:19 PM

#

ornate stump Apr 16, 2025, 9:20 PM

#

Are they going to raise the deep research limits now? That's one thing I still prefer from OpenAI, but I haven't checked if Google upgraded it.

quiet pollen Apr 16, 2025, 9:22 PM

#

o3 feels pretty smart - what do you all think?

brittle tiger Apr 16, 2025, 9:22 PM

#

keen beacon

Hard to compare. If question is just webdev it's easy but we havent seen nightwhisper outside of webdev

quiet pollen Apr 16, 2025, 9:22 PM

#

I am wondering if it is due to the knowledge cut off date

zinc ore Apr 16, 2025, 9:25 PM

#

Is night better at programming than o3 full and o4 mini?

#

What about dragontail?

torn mantle Apr 16, 2025, 9:29 PM

#

quiet pollen o3 feels pretty smart - what do you all think?

reasoning

#

yea

#

physical simulations, it will nail it

#

but its still struggling on what makes a design good and whatnot

#

oh btw

#

im talking about o4-mini

#

i havent tried o3 full yet

quiet pollen Apr 16, 2025, 9:31 PM

#

do you think usage for non-reasoning models will decrease?

#

like GPT 4.1 etc

torn mantle Apr 16, 2025, 9:31 PM

#

i mean it depends on the use case tbh

#

for example gemini 2.5 thinking achieved a similar performance at coding tasks to sonnet 3.5 only after applying reasoning

quiet pollen Apr 16, 2025, 9:32 PM

#

o3 mini is cheaper than GPT-4.1 lol

tall summit Apr 16, 2025, 9:34 PM

#

o3 mini is old

torn mantle Apr 16, 2025, 9:35 PM

#

o4-mini vs nightwhisper

#

a simple prompt assessing stylistic choices/organization/colours...

quiet pollen Apr 16, 2025, 9:35 PM

#

torn mantle o4-mini vs nightwhisper

wow is this 1 shot?

torn mantle Apr 16, 2025, 9:35 PM

#

yea

#

not for o4 mini

#

i had to guide it

#

let me see if i still have 1st o4-mini output

#

tall summit Apr 16, 2025, 9:37 PM

#

how'd nightwhisper make that

torn mantle Apr 16, 2025, 9:37 PM

#

like this resume may seem easy to clone by any model, but trust me ive tried all models with the same prompt and even guiding them and they dont come near nightwhisper

#

just the vertical line on the bullet point if you can make any model do it centered in one shot i will give you whatever you like

#

they all messed that up

calm sequoia Apr 16, 2025, 9:39 PM

#

keen beacon

Why would you steal my name

quiet pollen Apr 16, 2025, 9:39 PM

#

thanks for sharing nightwhisper

#

just googled it and found that it could be a stealth model from Google

#

Google models been a huge winnder for me

ocean vortex Apr 16, 2025, 9:39 PM

#

there's no way released o3 is anywhere near this now renamed "preview" lol

quiet pollen Apr 16, 2025, 9:40 PM

#

torn mantle like this resume may seem easy to clone by any model, but trust me ive tried all...

how do you use nightwhisper? I can't find it in the arena

barren prairie Apr 16, 2025, 9:41 PM

#

quiet pollen how do you use nightwhisper? I can't find it in the arena

It was there

thorny drum Apr 16, 2025, 9:41 PM

#

yeah this version is like 100x cheaper lol

barren prairie Apr 16, 2025, 9:41 PM

#

quiet pollen how do you use nightwhisper? I can't find it in the arena

But google removed it maybe to improve it

keen beacon Apr 16, 2025, 9:41 PM

#

ocean vortex there's no way released o3 is anywhere near this now renamed "preview" lol

it was retrained

#

on the 4.1 base, and the arc agi folks confirmed it

thorny drum Apr 16, 2025, 9:42 PM

#

o3-preview (high): 87.5%, $34.4k/task

ocean vortex Apr 16, 2025, 9:42 PM

#

keen beacon on the 4.1 base, and the arc agi folks confirmed it

so it's now better

keen beacon Apr 16, 2025, 9:43 PM

#

no

ocean vortex Apr 16, 2025, 9:43 PM

#

I feel like they kinda scammed everyone with that initial o3 announcement

thorny drum Apr 16, 2025, 9:43 PM

#

34.4k/task

keen beacon Apr 16, 2025, 9:43 PM

#

its worse

torn mantle Apr 16, 2025, 9:43 PM

#

quiet pollen how do you use nightwhisper? I can't find it in the arena

they removed it

ocean vortex Apr 16, 2025, 9:43 PM

#

keen beacon its worse

it's not worse

keen beacon Apr 16, 2025, 9:43 PM

#

they said o3 preview was closer to o3 pro

#

arc agi folks

torn mantle Apr 16, 2025, 9:43 PM

#

quiet pollen just googled it and found that it could be a stealth model from Google

yea its from google

ocean vortex Apr 16, 2025, 9:43 PM

#

keen beacon they said o3 preview was closer to o3 pro

because it was. Scamming part was not saying this is pro

keen beacon Apr 16, 2025, 9:43 PM

#

at least its worse on stuff that requires a lot of compute which openai doesn't want to serve

#

o3 preview to arc agi was served with an unrealistic level of compute

thorny drum Apr 16, 2025, 9:44 PM

#

o3 high is 1000x cheaper than o3-preview high lol

ocean vortex Apr 16, 2025, 9:44 PM

#

if you look at o1-pro it scored double on arc-agi-1 compared to o1-high

#

so it makes sense

#

they still scammed people implying that it was normal o3 lol

keen beacon Apr 16, 2025, 9:45 PM

#

if they benchmarked the retrained o3 with the new 4.1 base with as much compute, itd probably score higher

ocean vortex Apr 16, 2025, 9:45 PM

#

keen beacon if they benchmarked the retrained o3 with the new 4.1 base with as much compute,...

yeah for sure, that's my point

#

but it's also kinda pointless too

#

to do it

#

they made it look initially like o3 is a HUGE improvement

keen beacon Apr 16, 2025, 9:46 PM

#

so yeah normal o3 i expect it do worse since its not juiced with that level of compute they gave o3 preview

keen beacon Apr 16, 2025, 9:47 PM

#

ocean vortex they made it look initially like o3 is a HUGE improvement

well despite the unrealistic levels of compute (that wouldn't ever be served to the public) it reached that level tho

#

i think thats worth something

ocean vortex Apr 16, 2025, 9:48 PM

#

keen beacon well despite the unrealistic levels of compute (that wouldn't ever be served to ...

it's a skewed reference. When they are referring to it as a standard non-pro model while running pro model setup

keen beacon Apr 16, 2025, 9:48 PM

#

i dont think their plans for o3 were that fully fleshed at the time

#

i dont think it was malicious

#

but i agree

ocean vortex Apr 16, 2025, 9:49 PM

#

keen beacon so yeah normal o3 i expect it do worse since its not juiced with that level of c...

I would guess o3-high to do like 40-45% think

keen beacon Apr 16, 2025, 9:49 PM

#

i mean look at how theyre continuing chatgpt 4o despite it on the 4.1 base, openai has committed every naming sin possible

#

they didnt even rename 4o on chatgpt so its even more confusing against o4 mini etc

ember rapids Apr 16, 2025, 9:50 PM

#

i feel like nightwhisper will def be better at coding

keen beacon Apr 16, 2025, 9:50 PM

#

what was the point of renaming it to 4.1 on just the api 😭

ocean vortex Apr 16, 2025, 9:50 PM

#

keen beacon they didnt even rename 4o on chatgpt so its *even* more confusing against o4 min...

this is not a naming problem though. It's simply them being misleading on purpose lol

#

they already had the pro model line. So what they showed I doubt it was actually ever referred to internally as just o3

keen beacon Apr 16, 2025, 9:51 PM

#

was o1 pro even out at the tie

#

i think o3 was still early in developmment and they reached a milestone and wanted to share it despite it not being fleshed out

ocean vortex Apr 16, 2025, 9:52 PM

#

keen beacon was o1 pro even out at the tie

yeah it was out when they announced o3

torn mantle Apr 16, 2025, 9:52 PM

#

gemini 2.5 pro general knowledge >>>>>

ocean vortex Apr 16, 2025, 9:53 PM

#

and they adopted the same system for o3 benchmarks while knowing it's o3-pro 💀

torn mantle Apr 16, 2025, 9:53 PM

#

sonnet 3.5/3.7 general knowledge >>>>>>>>

keen beacon Apr 16, 2025, 9:53 PM

#

i dont even think the level of compute they used for o3 preview would even match anything close to what they use for o3 pro tbh

#

they probably used even more than o3 pro

calm sequoia Apr 16, 2025, 9:54 PM

#

They used more than the pro. And this term "pro" was quite new at that time.

#

As I understand they gave unlimited compute for some time. Therefore, the approaches can't be compared.

keen beacon Apr 16, 2025, 9:55 PM

#

yea

torn mantle Apr 16, 2025, 9:55 PM

#

o3 full initial vibes are kinda off for me

keen beacon Apr 16, 2025, 9:56 PM

#

nah i think the vibes are good

#

mini has worse vibes than full but better at non-web coding tasks

ocean vortex Apr 16, 2025, 9:56 PM

#

keen beacon they probably used even more than o3 pro

yeah which makes it even worse 💀

thorny drum Apr 16, 2025, 9:56 PM

#

wasnt it like unlimited compute and then like thousand model majority voting

keen beacon Apr 16, 2025, 9:57 PM

#

thorny drum wasnt it like unlimited compute and then like thousand model majority voting

yeah it was something ridiculous

ocean vortex Apr 16, 2025, 9:57 PM

#

imagine them announcing standard o1 with benchmarks that are higher than o1-pro

keen beacon Apr 16, 2025, 9:57 PM

#

what would you propose then

#

u were early in the dev process and wanted to share results, even though they were unrealistic, and committed to o3

thorny drum Apr 16, 2025, 9:57 PM

#

i mean they got a model with a similar quality 1000x cheaper in like 4 months