#general | Arena | Page 28

keen beacon Apr 19, 2025, 5:57 PM

#

i think its close, but o3 has the edge

elder rapids Apr 19, 2025, 5:57 PM

#

keen beacon i think its close, but o3 has the edge

ye but you and I both know there are cases where that's not true at all

keen beacon Apr 19, 2025, 5:58 PM

#

ofc

elder rapids Apr 19, 2025, 5:58 PM

#

whether that's more meaningful than your benchmarks

#

but that's besides the point

#

therefore it's not the general intelligence you think it is

keen beacon Apr 19, 2025, 5:58 PM

#

nah , i just use them both in programming for example, o3 has bigger edge
for general stuff its usually gemini 2.5

elder rapids Apr 19, 2025, 5:59 PM

#

specific coding performance tasks are held back by total coding capability

keen beacon Apr 19, 2025, 5:59 PM

#

elder rapids therefore it's not the general intelligence you think it is

I think both gemini and especially o3, are in the non-zero agi % metric

elder rapids Apr 19, 2025, 6:00 PM

#

so ye

#

i don't think it's good to compare them with algorithms

elder rapids Apr 19, 2025, 6:01 PM

#

keen beacon I think both gemini and especially o3, are in the non-zero agi % metric

yeah

ember rapids Apr 19, 2025, 6:12 PM

#

dayhush is rly good wow

novel flame Apr 19, 2025, 6:23 PM

#

Correct.

I think the problem for many is that it’s easy to mistake greedy decoding (which is deterministic and also boring to the point where it would be practically useless) for temperature=0 (which still has inherent randomness but the probability distribution is more biased towards the top logits).

Mathematically, temperature=0 isn’t even possible since it would cause a divide-by-zero so LLMs just change it to a very small value >0 behind the scenes.

ocean vortex Apr 19, 2025, 6:26 PM

#

that's the thing, chatgpt-latest was above o1 on lmarena even when they were using the old gpt4o base model

#

It does very well on lmarena, openai reasoning models much less so... till now at least, if this doesn't change I expect o3 to be no higher than 3rd

ocean vortex Apr 19, 2025, 6:28 PM

#

novel flame Correct. I think the problem for many is that it’s easy to mistake greedy decod...

yeah I was actually confused af by the temp 0 at first, it made no sense to me lmao

#

but I understand why they implemented it this way. It's just much faster and easier than writing smth like 0.00000001

keen beacon Apr 19, 2025, 6:34 PM

#

novel flame Correct. I think the problem for many is that it’s easy to mistake greedy decod...

Where did you hear that? 0 temperature does not get changed to a small temperature lol. Assuming no other sampling options are used, it's greedy decoding

#

Bruh the misinformation wtf???

keen beacon Apr 19, 2025, 6:35 PM

#

ocean vortex but I understand why they implemented it this way. It's just much faster and eas...

?????

keen beacon Apr 19, 2025, 6:35 PM

#

ocean vortex yeah I was actually confused af by the temp 0 at first, it made no sense to me l...

????

calm sequoia Apr 19, 2025, 6:35 PM

#

ocean vortex It does very well on lmarena, openai reasoning models much less so... till now a...

I get it. The o1 was best but took only rhe lower ranks.

leaden palm Apr 19, 2025, 6:35 PM

#

keen beacon Where did you hear that? 0 temperature does not get changed to a small temperatu...

on many platforms they don't know how to greedy decode ig

#

i've seen a few docs that say something along the lines of 0 temperature is changed to epsilon

keen beacon Apr 19, 2025, 6:36 PM

#

Huh really?

ocean vortex Apr 19, 2025, 6:36 PM

#

keen beacon ?????

???????

keen beacon Apr 19, 2025, 6:36 PM

#

Check vllm Lol

ocean vortex Apr 19, 2025, 6:36 PM

#

you can't divide by 0 lol

keen beacon Apr 19, 2025, 6:36 PM

#

Omg

#

They check if it's 0

#

Omfg

#

If it's 0 they switch to greedy assuming no other sampling options are enabled. Check vllm docs

#

Vllm is industry practice

ocean vortex Apr 19, 2025, 6:37 PM

#

keen beacon They check if it's 0

that's why it's custom implementation. If there was no custom implementation and no handling 0 differently than any other value, it wouldn't be possible

keen beacon Apr 19, 2025, 6:37 PM

#

ocean vortex that's why it's custom implementation. If there was no custom implementation and...

This is industry standard

#

Don't change the goalpost since you thought they changed it to a small value lmfaoooo

ocean vortex Apr 19, 2025, 6:37 PM

#

it is not handled the same way as other value, that's the main and only point

#

I don't get why you got so fussed about it lmao

keen beacon Apr 19, 2025, 6:38 PM

#

leaden palm i've seen a few docs that say something along the lines of 0 temperature is chan...

Can you link me this is news to me

ocean vortex Apr 19, 2025, 6:38 PM

#

keen beacon Don't change the goalpost since you thought they changed it to a small value lmf...

???????

keen beacon Apr 19, 2025, 6:38 PM

#

ocean vortex I don't get why you got so fussed about it lmao

It's one of the most hilarious things Ive read all day

ocean vortex Apr 19, 2025, 6:39 PM

#

my point was to show what you would need to do if 0 wasn't handled explicitly

#

nothing less

#

nothing more

#

maybe take a chill pill @keen beacon I have no clue what's gotten into you LMAO

leaden palm Apr 19, 2025, 6:40 PM

#

keen beacon Can you link me this is news to me

exa is down 😔

#

novel flame Apr 19, 2025, 6:41 PM

#

Temp=0 in the math is not possible, if implementations just change to greedy decoding then there’s no longer a temperature value involved in the inference.

But I genuinely thought the standard was to sub a small epsilon value. I could be wrong about that for sure.

keen beacon Apr 19, 2025, 6:41 PM

#

novel flame Temp=0 in the math is not possible, if implementations just change to greedy dec...

Yes obviously I thought everyone knew 0 was greedy decoding lmao

#

Otherwise what would you even do there

ocean vortex Apr 19, 2025, 6:41 PM

#

novel flame Temp=0 in the math is not possible, if implementations just change to greedy dec...

yeah exactly. And when you are treating it different to any other value that's basically just confirming that 0 is not possible technically speaking

ocean vortex Apr 19, 2025, 6:45 PM

#

keen beacon Yes obviously I thought everyone knew 0 was greedy decoding lmao

this doesn't make the temp 0 magically valid lmfao. All it means is that they are interpreting your technically invalid parameter input to provide you with a closest match different solution

keen beacon Apr 19, 2025, 6:46 PM

#

ocean vortex this doesn't make the temp 0 magically valid lmfao. All it means is that they ar...

The industry has used 0 for that for an extremely long time

#

You would know this yourself

#

Otherwise you would not be able to use 0

ocean vortex Apr 19, 2025, 6:46 PM

#

keen beacon The industry has used 0 for that for an extremely long time

that doesn't change a single thing or disprove what I was saying really...

keen beacon Apr 19, 2025, 6:47 PM

#

It is a special value for all intents and purposes

keen beacon Apr 19, 2025, 6:48 PM

#

ocean vortex this doesn't make the temp 0 magically valid lmfao. All it means is that they ar...

It is not setting it to a small non zero value though. It just picks the top token

ocean vortex Apr 19, 2025, 6:48 PM

#

that's what we were saying before you replied here with "??????" for no reason at all lmfao. If there was no custom interpretation and temp parameter was strictly temperature directly applied, you would have to use it differently

keen beacon Apr 19, 2025, 6:49 PM

#

But everyone does that lmao. I wouldn't consider it a custom interpretation lmao. Not setting 0 to that would be a custom implementation at this point, no one expects it

ocean vortex Apr 19, 2025, 6:50 PM

#

keen beacon But everyone does that lmao. I wouldn't consider it a custom interpretation lmao...

omg... that's because temp is not getting passed as "0" and it doesn't use it the same way like if you inputted any higher value.

keen beacon Apr 19, 2025, 6:50 PM

#

ocean vortex omg... that's because temp is not getting passed as "0" and it doesn't use it th...

Did you just figure that out just today lol

ocean vortex Apr 19, 2025, 6:51 PM

#

keen beacon Did you just figure that out just today lol

we were already talking about it before you showed up catgrin

#

so no

keen beacon Apr 19, 2025, 6:52 PM

#

ocean vortex we were already talking about it before you showed up <a:catgrin:114166152647489...

You agreed that it was set to a small non zero value though

dark oak Apr 19, 2025, 6:52 PM

#

how to vote for leaderboard?

ocean vortex Apr 19, 2025, 6:53 PM

#

keen beacon You agreed that it was set to a small non zero value though

what I meant was 0 was much easier than setting a smallest possible value that would limit it to the only possible highest probability, effectively greedy decoding, which in practice probably is not even possible in most cases to express that as a number...

tall summit Apr 19, 2025, 7:00 PM

#

hello friends!

keen beacon Apr 19, 2025, 7:09 PM

#

ocean vortex maybe take a chill pill <@456226577798135808> I have no clue what's gotten into ...

I definitely overreacted on something that is a trivial correction lol. I apologize I wanted to get mad at smthing

novel flame Apr 19, 2025, 7:09 PM

#

OK I had to check... and @keen beacon is wrong. I just made four identical requests to Claude 3.7 Sonnet on Bedrock:

"prompt": "How much wood could a woodchuck chuck?",
"model": "bedrock/claude-3.7-sonnet",
"config": {
  "temperature": 0,
  "max_tokens": 220
}

keen beacon Apr 19, 2025, 7:10 PM

#

See this

Screenshot_2025-04-20-02-10-02-419_com.android.chrome-edit.jpg

tawdry phoenix Apr 19, 2025, 7:11 PM

#

Installing mac os on my laptop

novel flame Apr 19, 2025, 7:12 PM

#

keen beacon See this

vLLM is a library, not how LLMs are built

keen beacon Apr 19, 2025, 7:12 PM

#

novel flame vLLM is a library, not how LLMs are built

It is the most used open inference engine.

novel flame Apr 19, 2025, 7:16 PM

#

keen beacon It is the most used open inference engine.

And if you are building toy LLMs at home then I'm sure you're using vLLM, but that does not make it the industry standard, nor does it mean anyone else is following its design decisions

keen beacon Apr 19, 2025, 7:21 PM

#

That is a ridiculous claim

#

Vllm optimizations and how throughout changed dramatically against other providers: https://developers.redhat.com/articles/2025/03/19/how-we-optimized-vllm-deepseek-r1#deepseek_r1__a_complex_model

#

Also iirc deepseek uses a fork of vllm are they building toy llms

#

I can go on about this but I'm going to stop here since you're making ridiculous claims

keen beacon Apr 19, 2025, 7:25 PM

#

keen beacon Also iirc deepseek uses a fork of vllm are they building toy llms

https://github.com/deepseek-ai/open-infra-index/tree/main/OpenSourcing_DeepSeek_Inference_Engine

GitHub

open-infra-index/OpenSourcing_DeepSeek_Inference_Engine at main · ...

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation - deepseek-ai/open-infra-index

golden ocean Apr 19, 2025, 7:29 PM

#

I love lmarena beef

novel flame Apr 19, 2025, 7:34 PM

#

keen beacon https://github.com/deepseek-ai/open-infra-index/tree/main/OpenSourcing_DeepSeek_...

Congratulations, you found one example of a major lab that uses (checks notes) a fork of vLLM which may or may not use the method of replacing temp=0 with greedy decoding. I tested the APIs from OpenAI, Google and Amazon Bedrock and all of them did not. I just don't have the time to discuss what 'industry standard' means, nor to try all the non-SOTA labs' APIs to find out what I already know.

keen beacon Apr 19, 2025, 7:35 PM

#

novel flame Congratulations, you found one example of a major lab that uses (checks notes) a...

Why tf would they replace the greedy decoding code for no reason lmao

keen beacon Apr 19, 2025, 7:35 PM

#

novel flame Congratulations, you found one example of a major lab that uses (checks notes) a...

0 temperature/greedy decoding is not deterministic. Anthropics api even notes it lmao

#

You can't even set a seed on the anthropic api lol

novel flame Apr 19, 2025, 7:37 PM

#

keen beacon Why tf would they replace the greedy decoding code for no reason lmao

I'm not saying they did, I am saying you don't know they didn't and your entire argument seems to be "vLLM docs say vLLM does this, so this is industry standard". But evidently not Google or OpenAI or Claude on Bedrock, so you are grasping

novel flame Apr 19, 2025, 7:39 PM

#

keen beacon You can't even set a seed on the anthropic api lol

OMG are you just trolling? Anthropics API docs say exactly what I've been saying: "Note that even with temperature of 0.0, the results will not be fully deterministic." That's my whole point. But they do not say "greedy decoding" because they do not switch to greedy decoding, they use a small epsilon which leaves it non-deterministic.

keen beacon Apr 19, 2025, 7:40 PM

#

novel flame I'm not saying they did, I am saying you don't know they didn't and your entire ...

https://152334h.github.io/blog/non-determinism-in-gpt-4/ fyi

152334H

Non-determinism in GPT-4 is caused by Sparse MoE

It’s well-known at this point that GPT-4/GPT-3.5-turbo is non-deterministic, even at temperature=0.0. This is an odd behavior if you’re used to dense decoder-only models, where temp=0 should imply greedy sampling which should imply full determinism, because the logits for the next token should be a pure function of the input sequence & the m...

#

Google announced they used Moe too

#

While we don't know for sure for the others its highly likely to be Moe too

keen beacon Apr 19, 2025, 7:41 PM

#

novel flame OMG are you just trolling? Anthropics API docs say exactly what I've been saying...

Link me to a resource anywhere where this behavior occurs and is documented

#

That they change with epsilon

keen beacon Apr 19, 2025, 7:43 PM

#

keen beacon https://152334h.github.io/blog/non-determinism-in-gpt-4/ fyi

While this isn't definitive I found it convincing a while back at least. I dont recall it much

novel flame Apr 19, 2025, 7:43 PM

#

keen beacon While we don't know for sure for the others its highly likely to be Moe too

"...under capacity constraints". That is describing anomalous behavior. I just ran tests across three APIs and in just four requests (no cherry-picking) I got different results from all of them.

keen beacon Apr 19, 2025, 7:45 PM

#

novel flame "...under capacity constraints". That is describing anomalous behavior. I just r...

Please show me anything that supports your hypothesis

#

Actual documentation

novel flame Apr 19, 2025, 7:52 PM

#

keen beacon Please show me anything that supports your hypothesis

It's not a hypothesis -- in a Transformer LLM, temperature is used in the denominator and you cannot divide by zero.

#

See the 'T's right there. This is how I "support my hypothesis". By knowing the actual math.

keen beacon Apr 19, 2025, 7:58 PM

#

novel flame It's not a hypothesis -- in a Transformer LLM, temperature is used in the denomi...

There's literally a check to see if it's 0. Do you see your api calls error out lmao

#

If it's 0 it goes into greedy decoding if no other sampling options are enabled

#

Keep on believing what you want it's a weird hill to die on

novel flame Apr 19, 2025, 8:03 PM

#

keen beacon If it's 0 it goes into greedy decoding if no other sampling options are enabled

There is... in vLLM specifically. And vLLM switches to greedy decoding, which removes (99.99%-100% of) the non-determinism. So since these are not deterministic, they are not using greedy decoding.

calm sequoia Apr 19, 2025, 8:22 PM

#

It's so nice seeing a real talking instead of "look what svg pokemon llm drew for me"

novel flame Apr 19, 2025, 8:28 PM

#

I apologize on my part for the heated debate here tonight, this was not a productive contribution to the channel.

tall summit Apr 19, 2025, 8:30 PM

#

novel flame I apologize on my part for the heated debate here tonight, this was not a produc...

it was

#

i learned a bit by following it

novel flame Apr 19, 2025, 8:43 PM

#

keen beacon While this isn't definitive I found it convincing a while back at least. I dont ...

Actually re-reading this, this can indeed explain non-deterministic outputs even if the implementation replaces the zero temp with greedy decoding. MoE (which they're probably all using) concurrency means partial results may arrive/recombine in different order each time, and that is a totally plausible explanation for small differences in the resulting logprobs. As for whether individual LLM providers do one or the other, who knows; the labs don't reveal the details. I heard about the 'epsilon' trick from Andrew Ng, who is a fairly reliable source of how things are done; but my evidence was hinged on observing non-deterministic results, and the MoE explanation is as valid as the epsilon explanation. I concede the argument.

torn mantle Apr 19, 2025, 8:57 PM

#

https://3000-ifqrxeuj2b1psco73w9h5-2266d380.e2b-foxtrot.dev

#

do you think this is accurate enough?

#

dayhush vs sonnet

ocean vortex Apr 19, 2025, 9:02 PM

#

tall summit i learned a bit by following it

Continued there not to flood this chat #1363229951851237376~~ (ignore the thread name 😂 )~~

torn mantle Apr 19, 2025, 9:03 PM

#

torn mantle dayhush vs sonnet

functionality wise, sonnet is kinda better, collapsible are working + you can edit the code

keen beacon Apr 19, 2025, 9:11 PM

#

sonnet 3.7 still probably best for coding

#

despite benchmarks

ocean vortex Apr 19, 2025, 9:15 PM

#

keen beacon despite benchmarks

then make your own benchmark to prove it or it's just not true lol

#

or heck, at least a singular easily reproducible coding challenge where it would be clear 2.5Pro, o3 and o4-mini all output a notably worse solution

#

though we probably need to add some non-reasoning model too if you are to do this... gpt4.1 and/or updated grok3 should do

keen fulcrum Apr 19, 2025, 9:19 PM

#

keen beacon sonnet 3.7 still probably best for coding

Bs
o3

ocean vortex Apr 19, 2025, 9:20 PM

#

some specific coding tasks reasoning models are at an disadvantage kind of by design 👀

#

especially if you are asking smth that you could easily google for and fit the answer with minimal changes. Reasoning can work against it trying to reinvent the wheel sometimes

barren prairie Apr 19, 2025, 9:36 PM

#

I just tried this now with Gemini pro 2.5 , Gemini flash 2.5 , O3 , O4 mini

Give me two words with those letters FP DAS .
Answer : pf sad (pf : sound )

Surprised that the winner was Gemini 2.5 flash

novel flame Apr 19, 2025, 9:39 PM

#

ocean vortex or heck, at least a singular easily reproducible coding challenge where it would...

I sent you mine. I wish the others would match their benchmarks, but Sonnet is still overall coding king for my tests 😕

balmy mist Apr 19, 2025, 9:43 PM

#

has anyone been testin 4.1? how good is it?

ocean vortex Apr 19, 2025, 9:49 PM

#

yeah I wrote it too generally looking back at it. 🤷‍♂️ For coding tasks that involve visuals 3.7 sonnet can indeed the best still. Like even just drawing (coding) in svg it is gonna be better than competition. Overall though if we consider everything coding related and not just this, sonnet is not the best after all said and done

keen fulcrum Apr 19, 2025, 9:49 PM

#

Dom, 3.7 performs really bad at coding

ocean vortex Apr 19, 2025, 9:50 PM

#

keen fulcrum Dom, 3.7 performs really bad at coding

it's behind in many things coding, but visual coding / web design or my mentioned svg... try it and you will see. I would still classify it the best in this more specific aspect

#

like for this thing: https://discordapp.com/channels/1340554757349179412/1340554757827461211/1362774913038946394

this is what 3.7 sonnet did:

N2zUaAAAECBAgQIECAAAECBAoKiB0Fl25kAgQIECBAgAABAgQIECCQWUDsyLxdsxEgQIAAAQIECBAgQIAAgYICYkfBpRuZAAECBAgQIECAAAECBAhkFvgXOfXECGKYS6wAAAAASUVORK5CYII.png

#

the way they fine-tuned it they are kind of playing on their strenghts as well. I had thinking enabled and reasoning was way longer than for o1/o3 with high reasoning effort for this prompt. Chat model is already doing it better, and thinking enabled (since I'm comparing it against reasoning models) just amplifies this

small haven Apr 19, 2025, 11:19 PM

#

WTF

tawdry phoenix Apr 19, 2025, 11:32 PM

#

#

Windows on the Samsung

torn mantle Apr 19, 2025, 11:39 PM

#

ocean vortex like for this thing: https://discordapp.com/channels/1340554757349179412/1340554...

woah

#

prompt?

raven void Apr 20, 2025, 1:03 AM

#

O3 is better for code architecture and bug fixing not for writing code

verbal nimbus Apr 20, 2025, 4:08 AM

#

I didn't know Artificial Analysis had an image arena:

#

Midjourney V6 and V7 are on there too, although idk why it's so low

thorn arrow Apr 20, 2025, 4:10 AM

#

When will o3 be put on the leaderboard?

worthy thunder Apr 20, 2025, 4:12 AM

#

As requested by several - OpenAI-MRCR results for o3: https://x.com/DillonUzar/status/1913806594199990582

Finally got access. They had some server-side issue with our org account, got resolved today.

Overall, pretty strong performance over its context window! Really strong up to 64k context, then drastically falls after, but that's to be expected (it hasn't specifically been improved for long context like the GPT-4.1 series, or Gemini 2.5). Should be really interesting to see GPT-4.1 applied to o-series!

I've also reran the bench against all 21 models I've tested so far, tracking costs and individual test results, will post results on that over the next couple of days.

Enjoy

calm sequoia Apr 20, 2025, 7:40 AM

#

Is there are LLM tool which allows agent to crawl through pages of a specific website?

small haven Apr 20, 2025, 8:05 AM

#

o3 sux

torn mantle Apr 20, 2025, 8:06 AM

#

claybrook has a lot of rendering issues on webdev

tall summit Apr 20, 2025, 9:24 AM

#

verbal nimbus I didn't know Artificial Analysis had an image arena:

never heard of it before now but artificial analysis just seems like one of infinite benchmark compilation websites

brittle tiger Apr 20, 2025, 10:23 AM

#

Tomay model from goog?

keen beacon Apr 20, 2025, 10:28 AM

#

tomay?

#

have you tested it

torn mantle Apr 20, 2025, 10:51 AM

#

brittle tiger Tomay model from goog?

So similar to claybrook

#

Output wise

#

Claybrook & tomay outputs are more close to eo than to gemini 2.5 pro 03

plain zinc Apr 20, 2025, 10:54 AM

#

brittle tiger Tomay model from goog?

Where did you find it?

#

I don't see him.

#

Is it of a rare level?

torn mantle Apr 20, 2025, 10:54 AM

#

plain zinc Is it of a rare level?

Yes

brittle tiger Apr 20, 2025, 10:54 AM

#

I haven't seen. saw a tweet with image of regular arena

https://x.com/AiBattle_/status/1913899487455572283

AiBattle (@AiBattle_) on X

Another new Gemini model "Tomay" dropped in LMarena! What is Google cooking 🤔

torn mantle Apr 20, 2025, 10:56 AM

#

Its a thinking model + so fast

keen fulcrum Apr 20, 2025, 10:57 AM

#

Which companies released stealth models on lmarena so far?

torn mantle Apr 20, 2025, 11:02 AM

#

brittle tiger I haven't seen. saw a tweet with image of regular arena https://x.com/AiBattle_...

Yea def flash version

hardy pecan Apr 20, 2025, 11:02 AM

#

To may

torn mantle Apr 20, 2025, 11:02 AM

#

Could be even smaller

#

Like flash lite

hardy pecan Apr 20, 2025, 11:02 AM

#

It releases it may then gg!

fleet lintel Apr 20, 2025, 12:29 PM

#

I haven't seen Tomay yet. Is it any good?

brittle tiger Apr 20, 2025, 12:32 PM

#

o3 and o4-mini are on Minecraft bench now

harsh flume Apr 20, 2025, 12:34 PM

#

Did google remove the option to fine tune a model on AI studio or just hid it in a way im not seeing?

brittle tiger Apr 20, 2025, 12:38 PM

#

harsh flume Did google remove the option to fine tune a model on AI studio or just hid it in...

i think they're sunsetting it for aistudio and leaving for api. aistudio can only do 1.5 flash still and it is hidden

https://aistudio.google.com/u/0/tune

Sign in - Google Accounts

harsh flume Apr 20, 2025, 12:38 PM

#

Ohh I just found it

#

They moved it vertex ai studio on google cloud

balmy mist Apr 20, 2025, 1:10 PM

#

wait we got a new model?

keen beacon Apr 20, 2025, 1:12 PM

#

brittle tiger o3 and o4-mini are on Minecraft bench now

whats this bench xD , link ?

brittle tiger Apr 20, 2025, 1:13 PM

#

keen beacon whats this bench xD , link ?

https://mcbench.ai/

MC-Bench

Evaluating AI with Minecraft

balmy mist Apr 20, 2025, 1:18 PM

#

brittle tiger https://mcbench.ai/

this new?

#

this is such a dope benchmark

brittle tiger Apr 20, 2025, 1:18 PM

#

o3 and o4-mini are new additions. it's been around a couple months i think

keen beacon Apr 20, 2025, 1:24 PM

#

brittle tiger https://mcbench.ai/

cool

tall summit Apr 20, 2025, 2:13 PM

#

brittle tiger I haven't seen. saw a tweet with image of regular arena https://x.com/AiBattle_...

are they in this server i wonder

#

all his tweets are them using lmarena or webarena

tall summit Apr 20, 2025, 2:16 PM

#

worthy thunder As requested by several - OpenAI-MRCR results for o3: https://x.com/DillonUzar/s...

thank you for this!

plain zinc Apr 20, 2025, 2:50 PM

#

Tomay better in coding?

ocean vortex Apr 20, 2025, 3:15 PM

#

torn mantle prompt?

it's below the linked message. Sonnet generated like 32k tokens and went into the mode of "what can be improved" as it sometimes likes to do lmao
But even without that, I'm not an advocate for claude and do not see this model as best overall, but yeah I will still say it is the best with spatial awareness. Competing models are either optimized for cost much more (smaller, like gpt4.1), or they are simply way too extremely big and in effect undertrained and lacking this skill comparatively speaking for other reasons (gpt4.5). In this aspect gpt4.5 is still better than gpt4.1, but it is not as good as sonnet.

#

essentially, spatial awarenes is a skill that is very rarely captured in benchmarks. Arc-agi does it, but for some reason until very recently it was not a very popular benchmark to use or reference at all. So this tends to get compromised sooner than their usual target evals.

brittle tiger Apr 20, 2025, 3:18 PM

#

claude 3.7 is svg king for sure. i think that was purposeful on their part. i think minecraft is better for spacial awareness than svg though

ocean vortex Apr 20, 2025, 3:18 PM

#

I suspect Anthropic's internal evals do test for it, while OpenAI does lack proper internal testing of this

knotty anchor Apr 20, 2025, 3:19 PM

#

Good day everyone
Happy Easter Sunday

ocean vortex Apr 20, 2025, 3:20 PM

#

or they willingly sacrificed it for lower cost, though that is somewhat less likely...

#

especially in the light of them going crazy lately with things like o1-pro lmao

ocean vortex Apr 20, 2025, 3:23 PM

#

knotty anchor Good day everyone Happy Easter Sunday

yeah you too, happy Easter! CryptoZooEggToken

knotty anchor Apr 20, 2025, 3:24 PM

#

ocean vortex yeah you too, happy Easter! <:CryptoZooEggToken:977149638094258196>

It’s nice seeing y’all here

keen fulcrum Apr 20, 2025, 3:34 PM

#

earnest parcel Apr 20, 2025, 3:36 PM

#

keen fulcrum

yeap, it's a clusterf. even people who interact with AI daily get confused (and I accidentally named some files 4o instead o4)

keen fulcrum Apr 20, 2025, 3:37 PM

#

I mean google trains new models daily as well
its just why do they all need to be released to the public?

keen beacon Apr 20, 2025, 4:04 PM

#

o3 > 2.5

tall summit Apr 20, 2025, 4:10 PM

#

keen fulcrum

makes sense to me now

plain zinc Apr 20, 2025, 4:10 PM

#

keen beacon o3 > 2.5

Gemini 2.5 Pro with Canvas >> o3 😁

tall summit Apr 20, 2025, 4:10 PM

#

plain zinc Gemini 2.5 Pro with Canvas >> o3 😁

what's canvas?

plain zinc Apr 20, 2025, 4:10 PM

#

tall summit what's canvas?

In Gemini Web/Mobile app

#

He's there.

plain zinc Apr 20, 2025, 4:11 PM

#

tall summit what's canvas?

Oops, Canvas is such a thing designed for coding

worthy thunder Apr 20, 2025, 4:52 PM

#

plain zinc Oops, Canvas is such a thing designed for coding

And for small documents before exporting to docs 😉

balmy mist Apr 20, 2025, 5:36 PM

#

I think that’s a Pokémon

#

That’s on arena?

#

Hmm

#

That’s cool

#

Came out today?

#

Also does anybody know what the best free model on open router is with favorable rate limits?

keen beacon Apr 20, 2025, 5:38 PM

#

woaah

balmy mist Apr 20, 2025, 5:38 PM

#

I miss having quasar on open router lol

#

Did you try them?

keen beacon Apr 20, 2025, 5:39 PM

#

balmy mist I miss having quasar on open router lol

free unlimited gpt 4.1 🤣

balmy mist Apr 20, 2025, 5:39 PM

#

Yeah I think they lowered it cause I hit rate limits so fast with 2.5 free now

balmy mist Apr 20, 2025, 5:39 PM

#

keen beacon free unlimited gpt 4.1 🤣

Fr man we did not appreciate that as much as we should have

#

What company you think backing it?

tall summit Apr 20, 2025, 5:41 PM

#

balmy mist Also does anybody know what the best free model on open router is with favorable...

deepseek

balmy mist Apr 20, 2025, 5:41 PM

#

keen beacon free unlimited gpt 4.1 🤣

What do you think is the best free model rn in terms of api and good rate limits?(so 2.5 is not included)

balmy mist Apr 20, 2025, 5:41 PM

#

tall summit deepseek

Really?

#

I actually never used deep seek cause I was scared lol

#

Wait nvm I used v3

keen beacon Apr 20, 2025, 5:42 PM

#

maybe u can use chutes but ur data is not really private

balmy mist Apr 20, 2025, 5:42 PM

#

And deepseek doesn’t rate limit heavy?

keen beacon Apr 20, 2025, 5:42 PM

#

w openrouter u can use chutes hosted models but 1000 rpd i think

balmy mist Apr 20, 2025, 5:42 PM

#

1000 is good enough

balmy mist Apr 20, 2025, 5:43 PM

#

keen beacon w openrouter u can use chutes hosted models but 1000 rpd i think

Chutes? Hmm I didn’t see that provider

keen beacon Apr 20, 2025, 5:43 PM

#

rpm of 2.5 flash not high enough?

keen beacon Apr 20, 2025, 5:43 PM

#

balmy mist Chutes? Hmm I didn’t see that provider

ya deepseek v3 r1 i think etc

balmy mist Apr 20, 2025, 5:43 PM

#

Flash is not free tho

keen beacon Apr 20, 2025, 5:43 PM

#

the free variants

#

theres a free tier

balmy mist Apr 20, 2025, 5:43 PM

#

Only pro is

keen beacon Apr 20, 2025, 5:43 PM

#

really hmm?

balmy mist Apr 20, 2025, 5:43 PM

#

What I looked only see thinking and normal

keen beacon Apr 20, 2025, 5:43 PM

#

or use 300 usd in credits on vertex

#

i think u can do that

balmy mist Apr 20, 2025, 5:44 PM

#

Wait wym 300 usd?

#

Lmaoo

keen beacon Apr 20, 2025, 5:44 PM

#

balmy mist Wait wym 300 usd?

ya free credits

balmy mist Apr 20, 2025, 5:44 PM

#

Wait these Pokémon models

#

I love it

keen beacon Apr 20, 2025, 5:44 PM

#

who made them btw

#

did u ask

#

might reveal smthing

balmy mist Apr 20, 2025, 5:44 PM

#

keen beacon ya free credits

Okay I will try that provider, that’s in open router right?

keen beacon Apr 20, 2025, 5:45 PM

#

which one chutes?

balmy mist Apr 20, 2025, 5:45 PM

#

No the vertex thing you said

keen beacon Apr 20, 2025, 5:45 PM

#

vertex u have to go direct to them

sage raptor Apr 20, 2025, 5:45 PM

#

https://x.com/OfficialLoganK/status/1913989074043404564

Logan Kilpatrick (@OfficialLoganK) on X

Remember when AI was supposedly hitting a wall 😆

keen beacon Apr 20, 2025, 5:45 PM

#

google vertex

#

u can use it as a provider on openrouter but no free credits

#

meh lol

balmy mist Apr 20, 2025, 5:46 PM

#

Oh I see, so vertex has their own api key? I mean I might as well use the one from Google cause they give 300 as well, yeah I’ll just use that until I run out but I’m tryna plan for when I do run out lmaoo

#

If they gonna use Pokémon names they gotta produce heat

#

Ask why is it called charizard lol

keen beacon Apr 20, 2025, 5:47 PM

#

balmy mist Oh I see, so vertex has their own api key? I mean I might as well use the one fr...

it might be the same thing tbh im so confused with the google offerings

balmy mist Apr 20, 2025, 5:48 PM

#

keen beacon it might be the same thing tbh im so confused with the google offerings

Fr lol, all these platforms but as long as i get free credits im cool lol, i got like 4 accounts lol

keen beacon Apr 20, 2025, 5:48 PM

#

dont u need a cc to claim the 300 usd in credits?

#

u used like 4 ccs lmao 🤣?

balmy mist Apr 20, 2025, 5:49 PM

#

keen beacon dont u need a cc to claim the 300 usd in credits?

Nahh the same cc

keen beacon Apr 20, 2025, 5:49 PM

#

u can do that??

#

huh

balmy mist Apr 20, 2025, 5:49 PM

#

They don’t have restrictions on it

keen beacon Apr 20, 2025, 5:49 PM

#

Lmao

balmy mist Apr 20, 2025, 5:49 PM

#

Yeah Lmaoo

#

I might just farm this

#

But idk they might ban me and I will be devastated

keen beacon Apr 20, 2025, 5:50 PM

#

hallucination

#

i doubt they trained the cut off in

keen beacon Apr 20, 2025, 5:51 PM

#

balmy mist But idk they might ban me and I will be devastated

ya i personally wouldnt test it myself ig. but if it was allowed lol

brittle tiger Apr 20, 2025, 5:52 PM

#

Are labs specifically trolling this discord now 🤔

balmy mist Apr 20, 2025, 5:53 PM

#

Ask this:
You are facing north and A rectangular tank is filled with water. Each side The West and East phases of the tank are painted to look like an island. A small toy boat floats in the tank. On the boat is a small figurine which is facing north. You lift the right side of the tank. From the point of view of the figurine Which island appears to rise. The West side, East Side or Both or None.

sage raptor Apr 20, 2025, 5:53 PM

#

maybe its from meta

balmy mist Apr 20, 2025, 5:53 PM

#

Yo did yall try the meta glasses?

#

East side

#

Smh

#

Why put crap on lmarena

#

They need to start screening theses models before going on arena

keen beacon Apr 20, 2025, 5:55 PM

#

balmy mist They need to start screening theses models before going on arena

if ur a big company u can put any model into the arena it seems lol

#

cohere are employing the meta strategy of releasing 10 equally horrific models to the arena

#

dw

balmy mist Apr 20, 2025, 5:56 PM

#

Lmarena doesn’t group or section models based on size? Like separate leaderboards for diff sizes? I don’t usually use arena so this might be an obvious question, I’m a web dev guy

keen beacon Apr 20, 2025, 5:56 PM

#

it doesn't

balmy mist Apr 20, 2025, 5:57 PM

#

How is ray?

upper wolf Apr 20, 2025, 5:57 PM

#

balmy mist Fr lol, all these platforms but as long as i get free credits im cool lol, i got...

they banned me for doing this 😭

balmy mist Apr 20, 2025, 5:57 PM

#

keen beacon it doesn't

That seems like an obvious feature to have tho

balmy mist Apr 20, 2025, 5:57 PM

#

upper wolf they banned me for doing this 😭

What

upper wolf Apr 20, 2025, 5:58 PM

#

Alts for free credits

balmy mist Apr 20, 2025, 5:58 PM

#

Yeah don’t even waste your time, giving these companies free benchmarking lol

balmy mist Apr 20, 2025, 5:58 PM

#

upper wolf Alts for free credits

How many alts you had?

upper wolf Apr 20, 2025, 5:58 PM

#

only two

keen beacon Apr 20, 2025, 5:58 PM

#

plain zinc Gemini 2.5 Pro with Canvas >> o3 😁

o3 with tools >> Gemini 2.5 Pro with Canvas >> o3 😁

balmy mist Apr 20, 2025, 5:58 PM

#

Lmaoooo

keen beacon Apr 20, 2025, 5:58 PM

#

new contender for world's worst thinking model if true

balmy mist Apr 20, 2025, 5:58 PM

#

keen beacon o3 with tools >> Gemini 2.5 Pro with Canvas >> o3 😁

How do you use canvas? On the Gemini app?

#

Show us

keen beacon Apr 20, 2025, 5:59 PM

#

balmy mist How do you use canvas? On the Gemini app?

idk i dont use gemini, im openai fanboy

balmy mist Apr 20, 2025, 5:59 PM

#

Do more tests like that

balmy mist Apr 20, 2025, 5:59 PM

#

keen beacon idk i dont use gemini, im openai fanboy

How you know about canvas then?

keen beacon Apr 20, 2025, 5:59 PM

#

keen beacon idk i dont use gemini, im openai fanboy

idk why I read that as openai femboy

balmy mist Apr 20, 2025, 6:00 PM

#

Are the models at least fast?

keen beacon Apr 20, 2025, 6:00 PM

#

😂

#

im making use of o3 to build repos via the subscription, similar to how one would use cursor or windsurf + o3 via api $$$
but its way cheaper (200$)

#

if rayquaza is from another lab it seems the lmsys staff are naming them and having fun

#

they really ARE pulling a meta. gonna be a 200b model that bombs every meaningful benchmark 🔥🔥🔥

balmy mist Apr 20, 2025, 6:03 PM

#

keen beacon im making use of o3 to build repos via the subscription, similar to how one woul...

How are you able to use o3 from api? They said I had to have an enterprise

balmy mist Apr 20, 2025, 6:03 PM

#

upper wolf Alts for free credits

Wait so how are you banned?

keen beacon Apr 20, 2025, 6:04 PM

#

balmy mist How are you able to use o3 from api? They said I had to have an enterprise

true, nvm i wouldnt even be able to use + would be super expensive.
thats why i use it via the subscription , but i have made a script that make questions and extract answers.
so i use it as if i had api access

balmy mist Apr 20, 2025, 6:05 PM

#

keen beacon true, nvm i wouldnt even be able to use + would be super expensive. thats why i...

U lay $200 a month?

#

Damn I was gonna wait to get that until o3 pro comes

keen beacon Apr 20, 2025, 6:05 PM

#

balmy mist U lay $200 a month?

worth it

#

i split it with another dev so 100$

balmy mist Apr 20, 2025, 6:06 PM

#

Not a bad idea

#

Hmm I might do that actually makes an account specifically for that

#

And tell people to pay me to use it

keen beacon Apr 20, 2025, 6:06 PM

#

yeah some do it that way

#

some scam that way

balmy mist Apr 20, 2025, 6:06 PM

#

Lmaoo

#

How they scam?

#

Google models

#

Try dayhush

#

That’s current best model that we have access to

keen beacon Apr 20, 2025, 6:09 PM

#

balmy mist How they scam?

pay me 100$ to use gpt , we share win win
ok here are 100$
ok bye

#

bro we just had 2.5 flash

balmy mist Apr 20, 2025, 6:13 PM

#

Out of all models?

kind charm Apr 20, 2025, 6:14 PM

#

I see claybrook not giving output on webarena quite often, especially on complex questions.
I think it takes more time & there seems to be some timeout setup in the webarena, which ends before it gives an output

keen beacon Apr 20, 2025, 6:22 PM

#

better than 2.5 ? coding or general model ?

#

tbh im rooting for o3 to get nr 1 than on 30 april 2.5 ultra to be nr 1 , thatd be hella fun

#

I dont like that its taking too long for openai models to get in leaderboard. They probably have the votes, just making sure they aint biased.

leaden palm Apr 20, 2025, 6:26 PM

#

keen beacon I dont like that its taking too long for openai models to get in leaderboard. Th...

they aren't arena tuned so they probably didn't go through the "run it privately and release the elo asap" process

balmy mist Apr 20, 2025, 6:32 PM

#

@keen beacon that $10 thing applies to all free models on open router

keen beacon Apr 20, 2025, 6:32 PM

#

yea now

balmy mist Apr 20, 2025, 6:32 PM

#

so you get 1000 rpd with 2.5 pro

keen beacon Apr 20, 2025, 6:32 PM

#

u dont get it with 2.5 pro

balmy mist Apr 20, 2025, 6:32 PM

#

thats cracked

keen beacon Apr 20, 2025, 6:32 PM

#

its special i think

#

or maybe u do now

balmy mist Apr 20, 2025, 6:32 PM

#

yeah you do bro, they just said it

#

yeah now

keen beacon Apr 20, 2025, 6:32 PM

#

oh

#

thats great

balmy mist Apr 20, 2025, 6:32 PM

#

thats crazy

#

how can they do that?

keen beacon Apr 20, 2025, 6:33 PM

#

they were getting hammered on free requests before i guess their quota is now sufficient enough with that criteria

balmy mist Apr 20, 2025, 6:33 PM

#

for a regular ai user that does not use ai heavy in products, you never have to pay for SOTA models ever

keen beacon Apr 20, 2025, 6:33 PM

#

things are extremely competitive now lol

balmy mist Apr 20, 2025, 6:33 PM

#

besides o3 lol

keen beacon Apr 20, 2025, 6:33 PM

#

(u can still use it in direct chat lmao)

balmy mist Apr 20, 2025, 6:33 PM

#

lmaoooo

#

truee

#

it makes it so hard to pay for ai, thats why i cancelled all my subs

#

only will pay for o3 pro when it drops

#

but even that has me hesistant

keen beacon Apr 20, 2025, 6:35 PM

#

2.5 pro is still unbeatable in my own personal experience lol for the things im using it on

#

its incredible

balmy mist Apr 20, 2025, 6:35 PM

#

yeah thats what i am saying, for a general model its perfect

#

so this mean that a year from now we should have models better than 2.5 pro but basically free and fast like how we treat flash lite

#

2.5 is already free basically

#

but a year from now it will be free for production environments or dirt cheap

#

thats crazy

#

any model better than 2.5 is overkill for most usecases, so we basically made it if we assume we will get better and better models

#

i mean i will use it, im just saying its wild that our models are so good now that we have SOTA models that are free, where you get 1000 RPD, for openai plus you get like 50 o4 mini per day i believe lol

torn mantle Apr 20, 2025, 6:39 PM

#

kind charm I see claybrook not giving output on webarena quite often, especially on complex...

its a server-side issue

#

coming from webdev itself

#

not the model

#

#

it didnt even send the prompt

#

#

it could be interfering with JSON format on the POST data

#

not sure

keen beacon Apr 20, 2025, 6:42 PM

#

Every day at 12pm, Relaxed Voyages spaceship departs from Liverpool for Dublin. Simultaneously, another Relaxed Voyages spaceship starts journey from Dublin to Liverpool. The journey takes 503 full hours in both directions.

How many Relaxed Voyages spaceships, traveling to Liverpool, will the spaceship departing now at 1pm from Liverpool encounter?

opaque adder Apr 20, 2025, 6:51 PM

#

what's the latest best gemini mode?

#

in arena

#

is it dayhush or nw

torn mantle Apr 20, 2025, 6:56 PM

#

opaque adder is it dayhush or nw

nw

kind charm Apr 20, 2025, 6:56 PM

#

Currently in arena, I would say claybrook.

opaque adder Apr 20, 2025, 6:56 PM

#

is this a joke?

torn mantle Apr 20, 2025, 6:56 PM

#

better instruction following
better on reasoning
better on colors/ui choices

opaque adder Apr 20, 2025, 6:56 PM

#

why the same models are battling eachother
and the one on the right is waaay worse

#

tf

leaden palm Apr 20, 2025, 6:57 PM

#

opaque adder why the same models are battling eachother and the one on the right is waaay wor...

mini

opaque adder Apr 20, 2025, 6:57 PM

#

mini is ass then

torn mantle Apr 20, 2025, 6:57 PM

#

opaque adder why the same models are battling eachother and the one on the right is waaay wor...

gpt models arent good at coding

#

i would just chose

#

"both are bad"

#

there is an option for that

leaden palm Apr 20, 2025, 6:58 PM

#

torn mantle "both are bad"

no????

#

why would you do that

#

that says that it's a tie

torn mantle Apr 20, 2025, 6:58 PM

#

because of them looks bad to me

opaque adder Apr 20, 2025, 6:58 PM

#

bruh

#

ok well it was gemini 2.0 flash

#

so understandable

leaden palm Apr 20, 2025, 6:59 PM

#

opaque adder is this a joke?

if you care about the actual rankings you would say that left is better because it has more details (assuming nothing was cut off)

opaque adder Apr 20, 2025, 6:59 PM

#

i know, i did put left

torn mantle Apr 20, 2025, 7:01 PM

#

torn mantle

i think i found the issue, i may override the POST data see if it fixes the problem

opaque adder Apr 20, 2025, 7:01 PM

#

wtf...

#

#

no fuking way 2.5 pro isnt nerfed in the left side

#

dayhush is seriously insane

#

im sure 2.5 pro is nerfed here somewat

torn mantle Apr 20, 2025, 7:02 PM

#

opaque adder

both looks bad

#

what the hell are these color choices

#

dayhush is so bad at chosing colors palette

opaque adder Apr 20, 2025, 7:02 PM

#

right is significantly better dude

torn mantle Apr 20, 2025, 7:02 PM

#

NW was so good at it

#

yea but it \still looks bad

keen beacon Apr 20, 2025, 7:02 PM

#

torn mantle what the hell are these color choices

why are you such a hater dawg 😭😭

#

there's nothing wrong with the colour scheme

opaque adder Apr 20, 2025, 7:03 PM

#

obviously they don't look like it was made by a 10 year experience developer

#

doesnt mean right isnt better

torn mantle Apr 20, 2025, 7:03 PM

#

idk

#

looks bad to me

keen beacon Apr 20, 2025, 7:03 PM

#

nightwhisper generated much better results

torn mantle Apr 20, 2025, 7:03 PM

#

yea

keen beacon Apr 20, 2025, 7:03 PM

#

but it could be the prompt

torn mantle Apr 20, 2025, 7:03 PM

#

finally

#

someone

#

agrees with me

#

high-taste user?

keen beacon Apr 20, 2025, 7:03 PM

#

the right is definitely better tho

torn mantle Apr 20, 2025, 7:04 PM

#

wild?

keen beacon Apr 20, 2025, 7:04 PM

#

🤣

torn mantle Apr 20, 2025, 7:04 PM

#

are you one of them?

keen beacon Apr 20, 2025, 7:04 PM

#

maybe 🤣

earnest parcel Apr 20, 2025, 7:04 PM

#

i prefer the vertical selection, but other than that right is better in almost everything

#

dayhush aint half bad

keen beacon Apr 20, 2025, 7:11 PM

#

unfortunately the answer is 43

#

no LLM has got it right 😔

earnest parcel Apr 20, 2025, 7:22 PM

#

keen beacon unfortunately the answer is 43

o1 = 42, gpt-4.5 = 41, o4 mini = 42, grok 3 = 41, o3-mini high = 42, 2.5 pro = 41, 3.7 sonnet thinking = 42, R1 = 41

#

from a single attempt just now

leaden palm Apr 20, 2025, 7:25 PM

#

https://websim.ai/@wonderousfog95826785/spaceship-encounter-simulator/6 still yet to figure out why this is incorrect

Spaceship Encounter Simulator

keen beacon Apr 20, 2025, 7:26 PM

#

earnest parcel o1 = 42, gpt-4.5 = 41, o4 mini = 42, grok 3 = 41, o3-mini high = 42, 2.5 pro = 4...

ive never tried o3 high but o3 med gets it wrong

keen beacon Apr 20, 2025, 7:26 PM

#

leaden palm https://websim.ai/@wonderousfog95826785/spaceship-encounter-simulator/6 still ye...

cc @ocean vortex

leaden palm Apr 20, 2025, 7:34 PM

#

o3 high

📎 message.md

#

qwq

📎 message.md

torn mantle Apr 20, 2025, 7:38 PM

#

why is it 43?

sonic tendon Apr 20, 2025, 7:41 PM

#

keen beacon Every day at 12pm, Relaxed Voyages spaceship departs from Liverpool for Dublin. ...

oo, saved

sage raptor Apr 20, 2025, 7:47 PM

#

https://x.com/Hangsiin/status/1914036008829767943/photo/1

NomoreID (@Hangsiin) on X

Scores for o3 and o4-mini have been uploaded for USAMO 2025.

Gemini 2.5 pro: 24.40% / $6.23
o3 (high): 21.73% / $24.17
o4-mini (high): 19.05% / $2.21

leaden palm Apr 20, 2025, 7:47 PM

#

keen beacon unfortunately the answer is 43

keen beacon Apr 20, 2025, 7:51 PM

#

i don't think that's a particularly good way of figuring out the right answer tbf

torn mantle Apr 20, 2025, 7:56 PM

#

yea but why is it 43

keen beacon Apr 20, 2025, 7:56 PM

#

dawg

#

i mentioned dom for a reason

#

it's his question

torn mantle Apr 20, 2025, 7:57 PM

#

keen beacon unfortunately the answer is 43

you said the answer is 43

keen beacon Apr 20, 2025, 7:57 PM

#

yes, because that's what dom states the answer is

#

duh 😭

torn mantle Apr 20, 2025, 7:57 PM

#

so you didnt try to solve that on ur own

keen beacon Apr 20, 2025, 7:57 PM

#

it's not a question i use often

#

you can go and try that if you want

#

but i trust his judgement

earnest parcel Apr 20, 2025, 7:58 PM

#

I tried solving it on my own, I also don't get 43. might be a trick if anything. either way, without provided correct solution it's kinda pointless

torn mantle Apr 20, 2025, 7:59 PM

#

i tried it as well

#

i think llms are rounding the numbers

#

using days instead

#

of hours

#

"k ≤ (503 – 1)/24 = 502/24 ≈ 20.92 → k = 0,1,2,…,20. So there are 21 such ships already on their way when we leave."

#

thats o3

#

but on the 21th day its 21 * 24 + 1 = 505 >= 503

#

so it shouldnt count

earnest parcel Apr 20, 2025, 8:01 PM

#

still makes no sense, then the models would overshoot, not undershoot

torn mantle Apr 20, 2025, 8:02 PM

#

leaden palm o3 high

⇒ –20.916…≤k≤21
same here

#

and they are also taking into account the ship that will fly at the same time as our ship

north vale Apr 20, 2025, 8:03 PM

#

there's 20 ships that are on the road when it departs, 1 ship that departs at the same time, and then 20 ships that depart between the time it leaves and the time it arrives right?

torn mantle Apr 20, 2025, 8:03 PM

#

yea

north vale Apr 20, 2025, 8:03 PM

#

so 41?

torn mantle Apr 20, 2025, 8:03 PM

#

yea exactly

#

41 is the correct one imo

north vale Apr 20, 2025, 8:05 PM

#

there's a ship that arrived an hour before our ship left, and a ship that leaves 1 hour after our ship arrives? if it's 43 it could just be that the guy considers that with 1 hour of wiggleroom our ship would encounter those ships in the dock

#

that would be a bit dumb but it's my guess

torn mantle Apr 20, 2025, 8:08 PM

#

north vale there's 20 ships that are on the road when it departs, 1 ship that departs at th...

sorry its 21 on depart since 0-20 day = 21 days

#

so it could be 41 or 42

#

if you count the ship that departs at the same time

torn mantle Apr 20, 2025, 8:09 PM

#

earnest parcel o1 = 42, gpt-4.5 = 41, o4 mini = 42, grok 3 = 41, o3-mini high = 42, 2.5 pro = 4...

thats why some concluded 42/41

north vale Apr 20, 2025, 8:09 PM

#

torn mantle sorry its 21 on depart since 0-20 day = 21 days

could you expand on this reasoning bc i don't follow

torn mantle Apr 20, 2025, 8:09 PM

#

some of them counted the ship that departed at the same time

north vale Apr 20, 2025, 8:10 PM

#

right i'm counting it in an extra category

#

20 already departed, 1 departs at the same time, 20 depart after our departure but before our arrival

brittle tiger Apr 20, 2025, 8:10 PM

#

torn mantle Apr 20, 2025, 8:11 PM

#

brittle tiger

nah aistudio gave me 41

#

xd

#

wtf

north vale Apr 20, 2025, 8:11 PM

#

same

#

gdm cooked

torn mantle Apr 20, 2025, 8:12 PM

#

north vale 20 already departed, 1 departs at the same time, 20 depart after our departure b...

21 already departed

north vale Apr 20, 2025, 8:12 PM

#

how though

#

could you explain that in more detail? where you don't count the boat that departs at the same time

ember rapids Apr 20, 2025, 8:13 PM

#

Happy Easter to all those who celebrate

#

#

Google is cooking

torn mantle Apr 20, 2025, 8:14 PM

#

north vale how though

we need ships that departed before day 0 @ 1pm and haven't completed the 503-hour journey yet so its x <= 503 :

day 0 @ 12 PM: 1 hr. (1 <= 503) -> 1 ship
day -1 @ 12 PM: (1*24)+1 = 25 hrs. (25 <= 503) -> 1 ship
day -20 @ 12 PM: (20*24)+1 = 481 hrs. (481 <= 503) -> 1 ship
day -21 @ 12 PM: (21*24)+1 = 505 hrs. (505 > 503) -> 0 ships

counting these days gives: 0 - (-20) + 1 = 21 days.

north vale Apr 20, 2025, 8:16 PM

#

oh it's departing at 1pm i see

torn mantle Apr 20, 2025, 8:16 PM

#

yea

#

so its either 41 or 42

north vale Apr 20, 2025, 8:17 PM

#

so 21 before it departs, 0 while it departs, 20 after it departs

#

oh

#

hmm

#

yeah no that seems right

torn mantle Apr 20, 2025, 8:18 PM

#

0 while it departs = is what differentiated other LLMs

north vale Apr 20, 2025, 8:18 PM

#

it is obviously right bc it departs at 1pm and the others depart at noon

torn mantle Apr 20, 2025, 8:18 PM

#

yea

#

thats right

north vale Apr 20, 2025, 8:20 PM

#

tbc you think it's 41? or are you unsure and think it might be either 41 or 42?

north vale Apr 20, 2025, 8:20 PM

#

torn mantle so it could be 41 or 42

^wrt to this comment

torn mantle Apr 20, 2025, 8:25 PM

#

north vale tbc you think it's 41? or are you unsure and think it might be either 41 or 42?

yea not sure tbh

#

maybe im missing something as well

north vale Apr 20, 2025, 8:25 PM

#

okok

torn mantle Apr 20, 2025, 8:26 PM

#

north vale okok

https://3000-iba7djux2o4w7ckonvrhh-7f5cc07c.e2b-foxtrot.dev

#

xd

keen fulcrum Apr 20, 2025, 8:27 PM

#

Mistral offers access to news (AFP)

#

What is AFP?

north vale Apr 20, 2025, 8:28 PM

#

torn mantle https://3000-iba7djux2o4w7ckonvrhh-7f5cc07c.e2b-foxtrot.dev

agi

torn mantle Apr 20, 2025, 8:28 PM

#

north vale agi

xddd

#

kinda impressive from dayhush

#

sonnet and other models failed to visualise the solution

#

and they also failed to provide the answer

leaden palm Apr 20, 2025, 8:29 PM

#

keen fulcrum What is AFP?

https://google.com/search?q=afp mistral
https://google.com/search?q=afp news

torn mantle Apr 20, 2025, 8:30 PM

#

north vale Apr 20, 2025, 8:30 PM

#

not agi

torn mantle Apr 20, 2025, 8:31 PM

#

😭

#

its telling me to review logic

#

tf does that mean

#

keen beacon Apr 20, 2025, 8:38 PM

#

~~the answer is 43 i thinnk~~ (i reread it again, it is not)

#

~~i might be very wrong~~ i was

torn mantle Apr 20, 2025, 8:38 PM

#

https://3000-iqghyddl9grrzhiyxd9s8-59d5f765.e2b-foxtrot.dev

#

improved version

north vale Apr 20, 2025, 8:38 PM

#

feel free to explain your reasoning wild

#

ppl are betting about the correct solution here lol https://manifold.markets/Bayesian/the-riddle-of-spaceships-encounters?r=QmF5ZXNpYW4

torn mantle Apr 20, 2025, 8:53 PM

#

tf

ocean vortex Apr 20, 2025, 8:54 PM

#

keen beacon cc <@514836230802898954>

Yeah did not expect to see my prompt randomly here lmfao. The thing here is that the formula is really 2*days+1. And +1 is because it will encounter additional ship arriving just as it is leaving. The original prompt is already tricky enough for LLMs though can still be trained for. But when you add complexity they tend to default to their default earlier reasoning patterns (indicating that they didn't really understand it all that well to begin with still)

torn mantle Apr 20, 2025, 8:55 PM

#

yes

#

2.0

north vale Apr 20, 2025, 8:55 PM

#

I made the market, ppl can bet on it (with fake money)

torn mantle Apr 20, 2025, 8:55 PM

#

cool

keen fulcrum Apr 20, 2025, 8:55 PM

#

torn mantle

Spacex sim?
wow
Release it
One of those was very popular

north vale Apr 20, 2025, 8:55 PM

#

not sure yet, open to ideas

north vale Apr 20, 2025, 8:56 PM

#

ocean vortex Yeah did not expect to see my prompt randomly here lmfao. The thing here is that...

do you think the correct answer is 43?

#

lol ok

#

there obviously are lots of tenable ideas

ocean vortex Apr 20, 2025, 8:57 PM

#

north vale do you think the correct answer is 43?

yeah it is

north vale Apr 20, 2025, 8:58 PM

#

ask 10 people what they think the answer is, resolve to majority voting
wait for a mathematical proof that is formally verifiable, of the answer to the riddle
etc.

north vale Apr 20, 2025, 8:58 PM

#

ocean vortex yeah it is

i don't think you justified that appropriately and i think you're wrong but if you don't wanna explain in more detail that's fine tbc you didn't ask for your riddle to be talked about ig

ocean vortex Apr 20, 2025, 8:59 PM

#

north vale 1. ask 10 people what they think the answer is, resolve to majority voting 2. wa...

not really necessary since it's a variation of the known riddle with just basic additional math added on top

north vale Apr 20, 2025, 8:59 PM

#

that would be much worse because whoever has more mana can dictate any solution

#

it doesn't have a truth-based schelling point

north vale Apr 20, 2025, 8:59 PM

#

ocean vortex not really necessary since it's a variation of the known riddle with just basic ...

could you point me to the known riddle?

zinc ore Apr 20, 2025, 8:59 PM

#

The riddle involves ambiguities (at least one).

keen beacon Apr 20, 2025, 9:00 PM

#

its a modification of the 15 ship one

zinc ore Apr 20, 2025, 9:00 PM

#

Which imo isn't good for a mathematical prompt

north vale Apr 20, 2025, 9:00 PM

#

i agree ambiguities make a math prompt worse but could you point out the ambiguities?

small haven Apr 20, 2025, 9:00 PM

#

now do o1 pro vs o3

zinc ore Apr 20, 2025, 9:00 PM

#

So different people use different assumptions to get the answers they get

torn mantle Apr 20, 2025, 9:00 PM

#

keen fulcrum Spacex sim? wow Release it One of those was very popular

ocean vortex Apr 20, 2025, 9:01 PM

#

north vale could you point me to the known riddle?

How many ships belonging to the same shipping company, traveling in the opposite direction, will the steamship departing today at noon from Le Havre encounter?``` Answer 15. Many models answer with 13 or 14 to this day

small haven Apr 20, 2025, 9:01 PM

#

marginal cost is $0, less expensive than o3

keen beacon Apr 20, 2025, 9:01 PM

#

lmao polymarket polls

small haven Apr 20, 2025, 9:01 PM

#

no its unlimited usage

zinc ore Apr 20, 2025, 9:01 PM

#

north vale i agree ambiguities make a math prompt worse but could you point out the ambigui...

Do you count the ship that left an hour early or not? What constitutes meeting other ships in that situation pre launch? Does it need to meet them on the air strip? What if our ship is in the storage bay still when the other one launches? This information isn't provided.

small haven Apr 20, 2025, 9:02 PM

#

yes i love that statement

#

wtf is going on loool

#

o3 pro gon be amazing

ocean vortex Apr 20, 2025, 9:03 PM

#

zinc ore Do you count the ship that left an hour early or not? What constitutes meeting ...

if it's departing at the same time another one is arriving they will for a fact meet. Departure would take preparation (boarding the passengers, waiting for clearance to depart etc) so mere logic and common sense dictates for this to happen

zinc ore Apr 20, 2025, 9:04 PM

#

ocean vortex if it's departing at the same time another one is arriving they will for a fact ...

One arrived an hour before you launched.

#

No wait my bad

north vale Apr 20, 2025, 9:04 PM

#

zinc ore Do you count the ship that left an hour early or not? What constitutes meeting ...

what ship left an hour early? pre-launch and post-arrival encounters do not count. storage bay encounters do not count. they do not count because they are not specified as happening. we don't know the ships encounter in the bay.

zinc ore Apr 20, 2025, 9:04 PM

#

I was actually talking about the one that left an hour before yours, not arriving

north vale Apr 20, 2025, 9:04 PM

#

right you would not encounter that one bc it left before you and will arrive before you

zinc ore Apr 20, 2025, 9:04 PM

#

Yeah, but that's your assumption

north vale Apr 20, 2025, 9:05 PM

#

i suppose so ok

zinc ore Apr 20, 2025, 9:05 PM

#

"if they are on the strip together, it doesn't count"

#

So even if they are in proximity you don't count it.

north vale Apr 20, 2025, 9:05 PM

#

zinc ore "if they are on the strip together, it doesn't count"

did someone say this

zinc ore Apr 20, 2025, 9:06 PM

#

One of the ships left at 12, yours left at 1

small haven Apr 20, 2025, 9:07 PM

#

huh i literally have 4 tabs running o1 pro at once, never got banned for months

#

only got limited via o3 yesterday

#

they gave it back

north vale Apr 20, 2025, 9:09 PM

#

ocean vortex yeah it is

Does this modified riddle also have 43 as its answer in your opinion? I made it more similar to the original riddle but with the differences i think exist in your riddle

Every day at noon, a steamship departs from Le Havre for New York. The journey takes exactly 503 hours (20 days and 23 hours) in both directions.
How many ships belonging to the same shipping company, traveling in the opposite direction, will a steamship departing today at 1 pm from New York encounter during its trip?

zinc ore Apr 20, 2025, 9:10 PM

#

Based on the last sentence, I agree that you don't count any grounded flights before pre-launch.

small haven Apr 20, 2025, 9:12 PM

#

yea o4 mini > o4 mini high, very very cool

ember rapids Apr 20, 2025, 9:16 PM

#

small haven o3 pro gon be amazing

Ultra coming late may to compete with o3 pro

#

Possibly

north vale Apr 20, 2025, 9:17 PM

#

A shipping company operates a route between Le Havre and New York.

Every day at 12:00 noon local time, a company ship departs Le Havre bound for New York.

The journey time between the two ports is exactly 503 hours in either direction.

Consider a specific ship from this company that departs New York for Le Havre today at 1:00 PM local time.

During its entire 503-hour voyage (from the moment it leaves the New York port until it arrives at the Le Havre port), how many of the company's other ships, traveling in the opposite direction (from Le Havre towards New York), will it pass at sea?

ember rapids Apr 20, 2025, 9:17 PM

#

Google is in their bag rn who knows

#

Google io is may 20th

small haven Apr 20, 2025, 9:18 PM

#

ember rapids Ultra coming late may to compete with o3 pro

i think o3 pro > ultra if it ever happens

ember rapids Apr 20, 2025, 9:18 PM

#

#

Logan is confident

#

Hard to say we’ll see

#

Anyway exciting time

#

Progress ain’t halting

small haven Apr 20, 2025, 9:25 PM

#

i hate that o3 cant give the full code content

#

i asked for full updated code file, but it shows only the changed section, unlike o1 pro which gives the entirety

torn mantle Apr 20, 2025, 9:35 PM

#

small haven i asked for full updated code file, but it shows only the changed section, unlik...

Same thing with its usual outputs

#

Its a smart model but it doesn't explain itself too much

#

It will use technical jargon without definitions

torn mantle Apr 20, 2025, 9:37 PM

#

small haven i asked for full updated code file, but it shows only the changed section, unlik...

I think they deliberately asked the model to keep it short since it's costly

small haven Apr 20, 2025, 9:39 PM

#

torn mantle It will use technical jargon without definitions

u think its better to go with asking the diff code? or does it also hallucinate that

small haven Apr 20, 2025, 9:39 PM

#

torn mantle I think they deliberately asked the model to keep it short since it's costly

yea i heard about its yap score..

#

ok asking for diff code aint too bad; feed it to 4.1 in cursor to apply it

torn mantle Apr 20, 2025, 9:46 PM

#

small haven u think its better to go with asking the diff code? or does it also hallucinate ...

It gave me that feeling of "figure out the rest urself"

#

Its really a different experience from other models

#

Im not talking about how its concise /short

#

But just the way its talking and providing infos

zinc ore Apr 20, 2025, 10:16 PM

#

~~So I did that riddle manually, and I also got 43. So I agree with Dom, the answer must be 43.~~

hardy pecan Apr 20, 2025, 10:49 PM

#

I got 42 doing it manually

zinc ore Apr 20, 2025, 10:55 PM

#

Yeah, it's 42

#

I read the riddle again, for some reason I thought it had said a ship arrived right when your ship was leaving /.-

keen beacon Apr 20, 2025, 11:02 PM

#

the problem was reworded in a poor manner that doesnt work as intended i think. i dont think the answer is 43 anymore, i saw it was a modification of the 15 ship one upon first glance but didnt actually read it properly lol

hardy pecan Apr 20, 2025, 11:09 PM

#

ocean vortex ```Every day at noon, a steamship departs from Le Havre for New York. At the sam...

This logic doesn't apply to your puzzle you created though, they both aren't leaving at 12pm, and the journey isn't a perfect 21 days, its a non integer 20.9583333. The 2*days+1 that is. So I think the important detail here is there is a 1 hour offset which skips a pass/meet

keen beacon Apr 20, 2025, 11:32 PM

#

keen beacon the problem was reworded in a poor manner that doesnt work as intended i think. ...

i think if its changed to 505 it works though

zinc ore Apr 20, 2025, 11:39 PM

#

505 is still 42

#

No nvm

zinc ore Apr 21, 2025, 12:05 AM

#

505 I think is 43.
504 I think is 42.

balmy mist Apr 21, 2025, 12:58 AM

#

lol

#

https://x.com/im_roy_lee/status/1914061483149001132

Roy (@im_roy_lee) on X

Cluely is out. cheat on everything.

#

promo vid funny, but it seems to just be a software for your laptop and not actually a glasses device

zinc ore Apr 21, 2025, 2:29 AM

#

https://youtu.be/qUbx5RC8ro4?si=c62U1LNBaZHceyxR

YouTube

60 Minutes

Google DeepMind CEO demonstrates world-building AI model Genie 2

Google DeepMind CEO Demis Hassabis showed 60 Minutes Genie 2, an AI model that generates 3D interactive environments, which could be used to train robots in the not-so-distant future.

"60 Minutes" is the most successful television broadcast in history. Offering hard-hitting investigative reports, interviews, feature segments and profiles of peo...

▶ Play video

#

Just dropped 3 hrs ago

verbal nimbus Apr 21, 2025, 3:07 AM

#

o1 Pro scored lower than Bing? 🤣

#

Are o3 or o4-mini in the Web Arena yet?

leaden palm Apr 21, 2025, 3:16 AM

#

verbal nimbus o1 Pro scored lower than Bing? 🤣

where bing is sydney "precise" mode

elder rapids Apr 21, 2025, 3:42 AM

#

crazy

alpine coral Apr 21, 2025, 3:43 AM

#

north vale > A shipping company operates a route between Le Havre and New York. > > Every ...

this adds ambiguity. previously, it was:

Every day at noon, a steamship departs from Le Havre for New York. At the same time, another steamship belonging to the same company sets sail from New York to Le Havre
So i interpret that to mean that at 1200hrs (/noon) Le Havre local time, ships simultaneously depart (meaning it's 0600hrs in New York when that ship departs )

But in this variation:

Consider a specific ship from this company that departs New York for Le Havre today at 1:00 PM local time.
So it's not departing 1hr later than in the original formulation, it's departing 7hrs later.

Also:

how many of the company's other ships...will it pass at sea?
"at sea" has a specific meaning; imo it precludes any 'encounters' that occur at / around the zero time instant when one ship is arriving and the other departing from the calculation. Neither ship would strictly be 'at sea' [i.e. "...in navigable waters beyond the immediate confines of the port area.] sonn-3.7

#

fwiw i think both the original and that tweaked version are problematic - too much ambiguity

elder rapids Apr 21, 2025, 3:46 AM

#

I just did it

#

it is 42

#

absolutely

#

no question about it

alpine coral Apr 21, 2025, 3:46 AM

#

yeah i get 42 too

#

using the two different time zones anyway

elder rapids Apr 21, 2025, 3:46 AM

#

I'm lying

#

I used 2.5 pro to solve it

keen beacon Apr 21, 2025, 3:47 AM

#

alpine coral fwiw i think both the original and that tweaked version are problematic - too mu...

the original is a well known puzzle tho

#

well relatively well known

alpine coral Apr 21, 2025, 3:49 AM

#

keen beacon well relatively well known

yeah the creator (a French mathematician) of it is apparently quite famous for these

#

but still, i don't think this is one of his finest

keen beacon Apr 21, 2025, 3:50 AM

#

maybe it was poorly translated

alpine coral Apr 21, 2025, 3:50 AM

#

top rated answer here says there are multiple answers depending on interpretation https://puzzling.stackexchange.com/questions/30012/a-ship-departs-from-le-havre

keen beacon Apr 21, 2025, 3:52 AM

#

anyways it seems doms reworded one was poorly done and doesnt have the intended answer

alpine coral Apr 21, 2025, 3:57 AM

#

~~nah fwiw I think dom's reformulation seems more or less the same as the original (at as it appears in English here)~~. I think it's a crappy question, as the comments highlight - like there's several non-trivial flaws / alternative interpretations imo
[edit: the spaceship reformulation is flawed.. i was confused by what was being referenced]

keen beacon Apr 21, 2025, 3:57 AM

#

alpine coral ~~nah fwiw I think dom's reformulation seems more or less the same as the origin...

u got 43?

elder rapids Apr 21, 2025, 3:58 AM

#

this one is 15

#

it's different

keen beacon Apr 21, 2025, 3:58 AM

#

yes but doms formulation is 43

#

the intended answer

elder rapids Apr 21, 2025, 3:58 AM

#

no

keen beacon Apr 21, 2025, 3:58 AM

#

but its actually 42 i think

elder rapids Apr 21, 2025, 3:59 AM

#

the intended answer is 42

keen beacon Apr 21, 2025, 3:59 AM

#

no its 43

elder rapids Apr 21, 2025, 3:59 AM

#

and it's the answer

keen beacon Apr 21, 2025, 3:59 AM

#

according to dom

elder rapids Apr 21, 2025, 3:59 AM

#

lol

elder rapids Apr 21, 2025, 3:59 AM

#

keen beacon according to dom

prob tried recalling the answer from 42

#

and mistook it for 43

keen beacon Apr 21, 2025, 3:59 AM

#

Claybrook vs nightwh8sper

#

What's better

elder rapids Apr 21, 2025, 3:59 AM

#

41 is more valid than 43

keen beacon Apr 21, 2025, 3:59 AM

#

no

#

u can see his reasoning lol

elder rapids Apr 21, 2025, 4:00 AM

#

send

keen beacon Apr 21, 2025, 4:01 AM

#

it doesnt make sense

elder rapids Apr 21, 2025, 4:01 AM

#

alpine coral top rated answer here says there are multiple answers depending on interpretatio...

it is 15 ye, I think this comes from deliberately choosing other interpretation though

#

"every day"

#

"conversely"

#

"exactly 7 days"

#

so might be a skill issue or non native English speakers

#

tbf not everyone cares that much about the wording and assume it'll be super intuitive

keen beacon Apr 21, 2025, 4:02 AM

#

alpine coral ~~nah fwiw I think dom's reformulation seems more or less the same as the origin...

im surprised you got 43 how??

#

specific mechanics in le havre dont apply because of the changes he made

alpine coral Apr 21, 2025, 4:04 AM

#

keen beacon im surprised you got 43 how??

i was referring to this formulation (which is apparently the original ) #general message

I got 42 for this one #general message (but the wording is more flawed imo)

#

can you link to the version from Dom where 43 is the putative answer

keen beacon Apr 21, 2025, 4:05 AM

#

keen beacon it doesnt make sense

this?

elder rapids Apr 21, 2025, 4:05 AM

#

keen beacon it doesnt make sense

this is his reasoning?

keen beacon Apr 21, 2025, 4:05 AM

#

#general message he also justified here

alpine coral Apr 21, 2025, 4:05 AM

#

keen beacon this?

thanks

keen beacon Apr 21, 2025, 4:05 AM

#

his version doesnt make sense

alpine coral Apr 21, 2025, 4:05 AM

#

i glazed over the spaceship versions ha

keen beacon Apr 21, 2025, 4:05 AM

#

if u change it to 505 i think it works though

#

but models still get it anyway

#

so this type of question isnt particularly challenging anymore

#

Is it better then o3

#

?

#

?

#

The Google model

#

How does it compare

#

which google model??

elder rapids Apr 21, 2025, 4:08 AM

#

tbh

keen beacon Apr 21, 2025, 4:08 AM

#

The new one on webdev?

elder rapids Apr 21, 2025, 4:08 AM

#

y'all gotta be comparing o4 mini

#

to 2.5 pro

#

not o3

keen beacon Apr 21, 2025, 4:08 AM

#

y

balmy mist Apr 21, 2025, 4:10 AM

#

whats with this 43 i keep seeing

elder rapids Apr 21, 2025, 4:10 AM

#

balmy mist whats with this 43 i keep seeing

just an old puzzle thing

balmy mist Apr 21, 2025, 4:10 AM

#

oh lol

elder rapids Apr 21, 2025, 4:10 AM

#

the server seems stuck on

#

someone proposed the wrong answer initially

#

and the models were actually getting it right

#

but nobody could really figure it out without second guessing themselves because the models seem to get answers that are very close

#

just not the specific answer 43

keen beacon Apr 21, 2025, 4:12 AM

#

dom made a mistake when rewording it, and assumed an incorrect ground truth. this type of question seems conquered by models tho

elder rapids Apr 21, 2025, 4:12 AM

#

ye

#

and how strong the model forgiveness is somewhat decides how it's going to approach it

#

since the questions are flawed in general

keen beacon Apr 21, 2025, 4:18 AM

#

Is claybrook a coding model or a new general purpose mode

narrow elbow Apr 21, 2025, 5:08 AM

#

Google has so many models, I'm a bit suspicious if the names of those models come from Google? Shouldn't we come up with some more fun names like: temusk, closesam?🤪

elder rapids Apr 21, 2025, 5:33 AM

#

ion think it's a truly massive model though, doesn't seem to know more than 2.0 pro or 1.5 pro

#

and running a model like that wouldn't be practical and somewhat dismisses the fact 2.5 flash exists

#

which in comparison to 2.5 pro, isn't a major downgrade, yet is the cheapest and best reasoning/non reasoning model by far

#

we can infer flash is less than a 100b due to its 8b variant, since it would be redundant and concede the proposition the original flash makes by being fast and small, why create size extremities and therefore room for an intermediate "flash" model

#

but it doesn't exist

elder rapids Apr 21, 2025, 5:56 AM

#

me saying cheapest is relevant to that entire claim

#

its not saying it's the "best"

#

no?

#

bruh what

#

you're getting confused

#

2.5 FLASH

#

depends

#

2.5 flash is LITERALLY just a smaller 2.5 pro

#

same techniques prob

#

if 2.5 pro is so good generally, since it knows a lot

#

then it's gonna be good generally

#

if 2.5 flash is attempting to be general, but doesn't know as much

#

I think it's gonna simply appear worse

#

because there's nothing specific that stands out

#

whereas o3 mini vs o1, coding

#

o4 mini vs o3, coding/puzzle tasks

#

ye the gap is getting closer and closer for the narrow tasks

#

although o3 and o4 mini hallucinate a ton

#

and o4 mini doesn't quite comprehend some stuff

#

I think it generally does better

#

than o3

#

for direct if then processes

#

ye depends tho

keen beacon Apr 21, 2025, 6:05 AM

#

it hallucinates significantly more than o1

elder rapids Apr 21, 2025, 6:05 AM

#

still, depends

keen beacon Apr 21, 2025, 6:06 AM

#

lmao don't try and downplay it

elder rapids Apr 21, 2025, 6:06 AM

#

how come 2.5 pro can so effortlessly process this stuff

#

without hallucinating at all

#

I don't want to bring up an example this way

#

but it's def clear

#

p sure this is still the flaw of the model

#

o3 has better reasoning retention than o1

#

and that's why it's doing better at context (just not as good as some benchmarks suggest)

#

but if it's still failing hard at recalling instruction and iterating through it

#

then is it truly a good model

keen beacon Apr 21, 2025, 6:09 AM

#

the way they have trained the model to respond differs from o1 in that o3 is eager to provide specific facts and figures and sources, which is all well and good but it just ends up all false

elder rapids Apr 21, 2025, 6:10 AM

#

ion think so anymore

#

I think they released it to dismiss o3

#

if a summer release is true

#

they have time to simply do this with o4

elder rapids Apr 21, 2025, 6:12 AM

#

keen beacon the way they have trained the model to respond differs from o1 in that o3 is eag...

ye, I read the paper

#

in my testing 4.1 stomps deepseek v3

#

but still

#

2.0 pro just seems to be the best

#

as far as just being smart goes

sonic tendon Apr 21, 2025, 6:12 AM

#

keen beacon the way they have trained the model to respond differs from o1 in that o3 is eag...

lmaoo

elder rapids Apr 21, 2025, 6:13 AM

#

so?

#

not at all lmao

#

4.1 is really just

#

good

#

but it just doesn't understand stuff implicitly

#

hn

#

2.0 pro is way better

#

it's not even close tbh

#

and deepseek v3.1 is also mid

#

hallucinates way too much

#

awfully assumptive since my primary concern is practicality

#

not puzzles or narrow tasks lol

#

I just happen to use puzzles for maybe mathematical reasoning

#

or testing how it processes a claim

#

ye

keen beacon Apr 21, 2025, 6:17 AM

#

You can exclude providers on or

elder rapids Apr 21, 2025, 6:17 AM

#

2.5 flash 💔

#

.60 lmao

#

why would you reason using it

#

yet it's better than 2.0 flash

#

while flash 2.0 has always been the absolute best price for performance

#

this just doesn't follow

#

bruh?

keen beacon Apr 21, 2025, 6:18 AM

#

They were working on it for a while so I don't think so

narrow elbow Apr 21, 2025, 6:19 AM

#

Prefer to hand over data to U.S. companies, which are more trustworthy?

elder rapids Apr 21, 2025, 6:19 AM

#

def not lmao

keen beacon Apr 21, 2025, 6:19 AM

#

It's also used as the base in the retrained o3

elder rapids Apr 21, 2025, 6:19 AM

#

narrow elbow Prefer to hand over data to U.S. companies, which are more trustworthy?

ye, historically and legally

narrow elbow Apr 21, 2025, 6:19 AM

#

🤣

elder rapids Apr 21, 2025, 6:19 AM

#

not "barely"

#

it's pretty visible

#

2.0 flash is hella straightforward

#

albeit concise

#

2.5 flash retains conciseness while giving insight

#

it's obviously a major upgrade

#

including instruction tasks

narrow elbow Apr 21, 2025, 6:20 AM

#

racism😏

elder rapids Apr 21, 2025, 6:20 AM

#

although it doesn't get "better" after practically perfect in retaining instructions

keen beacon Apr 21, 2025, 6:21 AM

#

2.5 flash isn't good?

elder rapids Apr 21, 2025, 6:21 AM

#

it's good asf

#

I've been using it

keen beacon Apr 21, 2025, 6:21 AM

#

Do you use the non thinking mode too

elder rapids Apr 21, 2025, 6:21 AM

#

a ton

elder rapids Apr 21, 2025, 6:21 AM

#

keen beacon Do you use the non thinking mode too

I use only the non thinking mode

keen beacon Apr 21, 2025, 6:21 AM

#

Oh

#

Hmm I was considering it for a task later

elder rapids Apr 21, 2025, 6:22 AM

#

ye, this pretty much goes for all smaller models

#

if you want very specific feedback

#

go for them

#

W utility

#

def efficient

#

🙏

keen beacon Apr 21, 2025, 6:23 AM

#

Qwen max isn't small tho lmao Im interested in them open sourcing it

#

Depends tbh I heard prompt processing can be slow on macs

elder rapids Apr 21, 2025, 6:23 AM

#

keen beacon Qwen max isn't small tho lmao Im interested in them open sourcing it

he's joking but ngl I wouldn't stick with it regardless

#

qwen 2.5 seems the right choice

keen beacon Apr 21, 2025, 6:24 AM

#

Qwen 3 is about to come out anyway

elder rapids Apr 21, 2025, 6:24 AM

#

try hooking up full deepseek r1

keen beacon Apr 21, 2025, 6:24 AM

#

Qwen 2.5 is super old

elder rapids Apr 21, 2025, 6:25 AM

#

this is the most efficient you can get

keen beacon Apr 21, 2025, 6:25 AM

#

You can probably run qwq well tho and fast on a mac

elder rapids Apr 21, 2025, 6:25 AM

#

0 temp deepseek r1 for low context output goes crazy

#

🙏

keen beacon Apr 21, 2025, 6:26 AM

#

Original qwq wasn't that finicky I think

hardy pecan Apr 21, 2025, 6:26 AM

#

#

This pretty much summs it up unfortunately

keen beacon Apr 21, 2025, 6:27 AM

#

Eh usable even if it's somewhat slow I think

#

Definitely spoiled by api speed lol

#

O lol

#

It makes more sense with Mac mini s I guess

#

Yeah much better compute compared to a Mac I think and higher bandwidth

#

It'll do well if it can fit

#

Yea

keen beacon Apr 21, 2025, 6:39 AM

#

hardy pecan

exactly

#

o3 was a step forward for everything except in combatting hallucinations

hardy pecan Apr 21, 2025, 6:39 AM

#

We all feel it right

#

hallucinates more, and more lazy

#

o3 benefits with a great prompt

#

if you are clear and concise and verbose with your prompt, itll do well, but it defintely will cut corners if you don't watch it

novel flame Apr 21, 2025, 7:17 AM

#

What makes you say that? Their bet on JEPA?

plain zinc Apr 21, 2025, 7:18 AM

#

Gemini 2.5 Pro with Canvas just WOW

📎 canvas.html

fleet lintel Apr 21, 2025, 7:49 AM

#

what are the incoming models that folks are excited about ?

hardy pecan Apr 21, 2025, 8:12 AM

#

Tomay appears to be a thinkerrrr

earnest parcel Apr 21, 2025, 8:13 AM

#

fleet lintel what are the incoming models that folks are excited about ?

Qwen3, DeepSeek-R2 for me.

torn mantle Apr 21, 2025, 8:17 AM

#

Yea def r2

fleet lintel Apr 21, 2025, 8:27 AM

#

earnest parcel Qwen3, DeepSeek-R2 for me.

Same for R2. Will it top the charts? Better than 2.5 or o3?

hardy pecan Apr 21, 2025, 8:50 AM

#

I wont lie, I despise R1 when it comes up in lmarena, since im sitting there waiting for 4 mins to an output..

#

I hope r2 is faster

cedar tide Apr 21, 2025, 9:26 AM

#

earnest parcel Qwen3, DeepSeek-R2 for me.

And llama reasoning and behemoth

#

New model the v5 and V6 of "cobalt-exp-beta"
(Amazon model)

#

GPT 4.1 mini and nano are in the arena?

torn mantle Apr 21, 2025, 10:13 AM

#

cedar tide New model the v5 and V6 of "cobalt-exp-beta" (Amazon model)

yea was added yesterday

#

i think two models were added

kind cloud Apr 21, 2025, 10:33 AM

#

by Amazon

Screenshot_2025-04-20-19-21-35-227-edit_com.android.chrome.jpg

#

(Yesterday's screenshot)

cedar tide Apr 21, 2025, 10:38 AM

#

kind cloud by Amazon

Its good ? Its finally amazon premier ?

#

Apricot v1 was on the arena a month ago

keen fulcrum Apr 21, 2025, 10:45 AM

#

plain zinc Apr 21, 2025, 10:59 AM

#

Claybrook(LMarena)
o3

torn mantle Apr 21, 2025, 11:03 AM

#

cedar tide Its good ? Its finally amazon premier ?

no

calm sequoia Apr 21, 2025, 11:41 AM

#

Im often very impressed with the qwen 32B. I wonder what could they achieve with normal size models and inference compute

alpine coral Apr 21, 2025, 11:55 AM

#

keen beacon but its actually 42 i think

yeah you're spot on

#

i don't get 43 for that liverpool-dublin spaceship formulation.. (it's indeed botched)

#

funnny thing is.. by changing it from Le Havre/NY to Liverpool/Dublin, it actually removes one of the chief ambiguities in the original version (timezones), as both cities are in the same time zone. Then the other change made, rather than both routes departing at the same times, in this version there is 1hr delay – so there is no overlap (putting aside the problematic aspect of whether that overlap actually constitutes 'encountering' another company ship). But..using 503 hours.. yeah.. I get 42. With 505hrs, there is sufficient time that the 1hr delay means you encounter an additional Liverpool-bound spaceship, bringing it to 43. The stuff about spending in time the port with people boarding etc is nonsense and cannot be used as part of the 'answer' lol

#

what a mess..anyway.. let's move on ha

hardy pecan Apr 21, 2025, 11:58 AM

#

yep

torn mantle Apr 21, 2025, 11:58 AM

#

Dom gave us quite the entertainment for the past 2 days

hardy pecan Apr 21, 2025, 11:58 AM

#

Turns out LLMS actually DO know their stuff, and the question creator is confused XD

#

agi confirmed

brittle tiger Apr 21, 2025, 12:23 PM

#

AGI would question the premise of spaceship travel between Liverpool and Dublin and also note two launching rockets could see the other while reaching orbit due to proximity

ember rapids Apr 21, 2025, 12:29 PM

#

Will titans be used in Gemini 3.0?

#

Or perhaps 2.5 ultra if we get it

novel flame Apr 21, 2025, 12:59 PM

#

ember rapids Will titans be used in Gemini 3.0?

The question is if it was aka used in 2.5…. They would have had the research internally for a while before it was published

balmy mist Apr 21, 2025, 1:07 PM

#

do we have any interestin news this week?

#

the only thing about having good weeks is that the enxt week is usually dry lmaoo

alpine coral Apr 21, 2025, 1:13 PM

#

not sure but feels like there's a bit of momentum and dynamism atm that was lacking since reasoning models were first introduced

#

just an unrelated curiosity... anyone have any thoughts why India seems to be kinda absent from the AI landscape? (I say that.. though perhaps it's misplaced)

#

not: where is India's deepseek moment?.. just: why doesn't India seem to appear in the picture at all? big country; lots clever minds etc..

wintry tinsel Apr 21, 2025, 1:19 PM

#

alpine coral not: where is India's deepseek moment?.. just: why doesn't India seem to appear ...

They all come to the US where they get paid better

fleet lintel Apr 21, 2025, 1:20 PM

#

cedar tide New model the v5 and V6 of "cobalt-exp-beta" (Amazon model)

is amazon serious abut their models? Is it worth giving them any benefit of doubt?

hardy pecan Apr 21, 2025, 1:21 PM

#

fleet lintel is amazon serious abut their models? Is it worth giving them any benefit of dou...

They are just so bad, like bottom of the barrel bad..

alpine coral Apr 21, 2025, 1:21 PM

#

fleet lintel is amazon serious abut their models? Is it worth giving them any benefit of dou...

ha yeah but still.. i don't find that explanation satisfactory.. like there's still plenty of talent regardless of that kinda brain drain.. and there's capital (increasingly so)

wintry tinsel Apr 21, 2025, 1:21 PM

#

Google, Claude, Open AI, Grok, Deep Seek, everyone else is chump change

alpine coral Apr 21, 2025, 1:21 PM

#

alpine coral ha yeah but still.. i don't find that explanation satisfactory.. like there's st...

oh this was meant ot be @wintry tinsel

wintry tinsel Apr 21, 2025, 1:22 PM

#

alpine coral oh this was meant ot be <@798223293823057932>

Their country is in a lot of political unrest now, they seem to be sort of disconnected from the rest of the world like South America, they are their own cultural eco system satisfied with their own internal state of affairs

fleet lintel Apr 21, 2025, 1:22 PM

#

alpine coral not: where is India's deepseek moment?.. just: why doesn't India seem to appear ...

India? lol. you will not see any Deepseek moment from India.
They will use models from others to create applications. But there is no hope that they will be able to competitive model

wintry tinsel Apr 21, 2025, 1:23 PM

#

They are not worried about winning any AI race, Indians are not very creative, hard working and smart, but slow to new technology

alpine coral Apr 21, 2025, 1:24 PM

#

wintry tinsel Their country is in a lot of political unrest now, they seem to be sort of disco...

yeah i don;t wann go down a political direction.. but i'm fairly familiar with India's politics - i get the crony capitalism under Modi (but also how the BJP was dealt a blow at the last elections.. only scraping by.. that said.. the INC is still a mess)

fleet lintel Apr 21, 2025, 1:24 PM

#

wintry tinsel They are not worried about winning any AI race, Indians are not very creative, ...

They dont have any govt support and they dont have good VC capital money in research areas.
They are creative. But creativity shines when they are in countries like US

alpine coral Apr 21, 2025, 1:24 PM

#

i still don't find it satisfactory as an explanation (brain drain / domestic politics).. there's so much Indian pride...

#

i thought maybe the diversity of the country

#

there's so many languages

#

unlike China. where [the central govt] is just like: you're all Han, and speak Mandarin

wintry tinsel Apr 21, 2025, 1:25 PM

#

Hopefully this helps clarify

#

wintry tinsel Apr 21, 2025, 1:26 PM

#

fleet lintel They dont have any govt support and they dont have good VC capital money in rese...

Perhaps Indians are capable of creativity, but the country itself does not foster or support it very much culturally

fleet lintel Apr 21, 2025, 1:27 PM

#

wintry tinsel Perhaps Indians are capable of creativity, but the country itself does not foste...

I agree.

wintry tinsel Apr 21, 2025, 1:28 PM

#

Japan is also a country known for its hard working and smart tech industry adopting technologies like the bullet train before most other major urban centers, yet they are absent from AI space too

alpine coral Apr 21, 2025, 1:28 PM

#

wintry tinsel

shame to see

fleet lintel Apr 21, 2025, 1:28 PM

#

I am concerned a bit about EU though. With america ruled by mad king, EU might be in trouble

wintry tinsel Apr 21, 2025, 1:29 PM

#

Japan and India should collaborate on AI, now that Japan has opened floodgates to immigration

#

It won’t happen though

#

Samsung in South Korea is also in panic mode as they don’t have any competitive AI models and their new phone chips are being outperformed by others with better AI implementation

fleet lintel Apr 21, 2025, 1:31 PM

#

Samsung is too dependent on Android and Google

#

and I think they will continue to

#

Any news in the Industry? Which models are businesses adopting more of ? OAI models?

ember rapids Apr 21, 2025, 1:33 PM

#

novel flame The question is if it was aka used in 2.5…. They would have had the research int...

We’d know if 2.5 pro had it wouldn’t we?

novel flame Apr 21, 2025, 1:54 PM

#

ember rapids We’d know if 2.5 pro had it wouldn’t we?

One of the strongest properties of Titans is loooong context recall performance, like drastically better than other models. Have you seen the Gemini 2.5 needle-in-a-haystack results? It fits.

novel flame Apr 21, 2025, 1:56 PM

#

fleet lintel Any news in the Industry? Which models are businesses adopting more of ? OAI m...

I think enterprises are using OpenAI models (and MS Copilot, which has to be a derivative of OpenAI?) if they are in the MS ecosystem; and Google models in they are in the Google ecosystem. Apart from that I'm pretty confident a lot of businesses have accounts with OpenAI just because that's what they know / what came first, and if they build software a lot of them are probably paying for GitHub Copilot (again, MS/OpenAI). Anthropic is a developer favorite of course, but that's a narrow segment.

I don't know to what extent any of the other big labs have captured the enterprise market.

cedar tide Apr 21, 2025, 2:05 PM

#

fleet lintel is amazon serious abut their models? Is it worth giving them any benefit of dou...

llama 2, gemini 1.5 and grok 2 were the 3 really bad ones, while now meta google and xai have really interesting models and with a very good price-quality ratio

opaque adder Apr 21, 2025, 2:06 PM

#

dayhush or nw?

cedar tide Apr 21, 2025, 2:08 PM

#

but I don't have the impression that Amazon really wants to be in the race, but maybe they only want to have their AI for their chatbot and for Alexa

fleet lintel Apr 21, 2025, 2:13 PM

#

cedar tide but I don't have the impression that Amazon really wants to be in the race, but ...

what's the point of doing half-hearted effort? they are still spending like billion+ model and producing garbage .
What's the long term game here?

balmy mist Apr 21, 2025, 2:13 PM

#

people think that we might get deepseek this week right?

#

r2*

fleet lintel Apr 21, 2025, 2:14 PM

#

Hard to say but I think it's likely

balmy mist Apr 21, 2025, 2:14 PM

#

fleet lintel what's the point of doing half-hearted effort? they are still spending like bil...

and i thought amazon partnered with claude?

#

its like what happened with microsoft and OA

keen beacon Apr 21, 2025, 2:14 PM

#

balmy mist people think that we might get deepseek this week right?

yeah i think it makes sense

balmy mist Apr 21, 2025, 2:14 PM

#

they still ended up making their own model

keen beacon Apr 21, 2025, 2:14 PM

#

either this week or next week for sure

fleet lintel Apr 21, 2025, 2:15 PM

#

both Google and Amazon partnered with Claude. But Amazon also creates garbage like nova

balmy mist Apr 21, 2025, 2:15 PM

#

keen beacon yeah i think it makes sense

i really hope man!! that would make this month top tier

keen beacon Apr 21, 2025, 2:15 PM

#

same goes for qwen 3

#

i do have pretty high hopes for R2

#

a few little birdies tell me it gives o3 a run for its money

balmy mist Apr 21, 2025, 2:15 PM

#

keen beacon i do have pretty high hopes for R2

you think it will beat 2.5?

keen beacon Apr 21, 2025, 2:15 PM

#

it will be behind in some things and better in others

#

2.5 is very good

balmy mist Apr 21, 2025, 2:15 PM

#

last time it was pretty much o1 right? during o1 phase?

keen beacon Apr 21, 2025, 2:15 PM

#

yeah

balmy mist Apr 21, 2025, 2:16 PM

#

so r2 should be o3 ish

#

wow

#

that would be insane

balmy mist Apr 21, 2025, 2:16 PM

#

keen beacon a few little birdies tell me it gives o3 a run for its money

whatt

fleet lintel Apr 21, 2025, 2:16 PM

#

I'll be dissappointed if they dont beat 2.5 Pro in half of the categories

balmy mist Apr 21, 2025, 2:16 PM

#

your connects talking

balmy mist Apr 21, 2025, 2:16 PM

#

fleet lintel I'll be dissappointed if they dont beat 2.5 Pro in half of the categories

right now can we reliably say that o3 is better than 2.5 pro?

keen beacon Apr 21, 2025, 2:17 PM

#

either way it'll be state of the art as an open model by a big margin

#

i don't know how much of a threat qwen 3 will pose though

#

don't have any connections there lmao

novel flame Apr 21, 2025, 2:17 PM

#

fleet lintel is amazon serious abut their models? Is it worth giving them any benefit of dou...

It's a good question. Their Titan series of models was utter trash; then they suddenly came out with an entirely new set of models - Nova, which is not SOTA nor within spitting distance of it, but let's be fair here, Nova Pro was not terrible either, and it was a gargantuan leap up from Titan. If they can make another leap like that then they could potentially join the top labs. But that's a big if.

keen beacon Apr 21, 2025, 2:17 PM

#

speaking of nova titan

fleet lintel Apr 21, 2025, 2:17 PM

#

balmy mist right now can we reliably say that o3 is better than 2.5 pro?

nope.. I consider 2.5 > o3 (but barely)

keen beacon Apr 21, 2025, 2:17 PM

#

they said early 2025

#

where the hell is it

balmy mist Apr 21, 2025, 2:17 PM

#

@keen beacon are your birdies letting you test out the model 👀

keen beacon Apr 21, 2025, 2:17 PM

#

if cobalt on the arena is titan it sucks

keen beacon Apr 21, 2025, 2:17 PM

#

balmy mist <@456226577798135808> are your birdies letting you test out the model 👀

nope

balmy mist Apr 21, 2025, 2:17 PM

#

fleet lintel nope.. I consider 2.5 > o3 (but barely)

i think they are better in different stuff tbh

balmy mist Apr 21, 2025, 2:18 PM

#

keen beacon nope

nahh we need to make you the official model tester

keen beacon Apr 21, 2025, 2:18 PM

#

they don't really hire people outside of china for that stuff

fleet lintel Apr 21, 2025, 2:18 PM

#

balmy mist i think they are better in different stuff tbh

+1. For me 2.5 is slightly better but o3 is slightly in many other areas as well

balmy mist Apr 21, 2025, 2:18 PM

#

you can be an unbiased source for all companies

keen beacon Apr 21, 2025, 2:18 PM

#

for national security reasons mostly

#

the ccp aren't taking any chances

balmy mist Apr 21, 2025, 2:19 PM

#

fleet lintel +1. For me 2.5 is slightly better but o3 is slightly in many other areas as we...

yeah reasoning o3 def wins

#

but dont forget yall, o3 pro coming out next week

#

and if its the same gap from o1 to o1 pro

fleet lintel Apr 21, 2025, 2:19 PM

#

novel flame It's a good question. Their Titan series of models was utter trash; then they su...

umm.. may be they come up with something interesting.. yet to see. Right now, i dont consider them in AI race. They are in the top of AI infra provider though

keen beacon Apr 21, 2025, 2:19 PM

#

i may re-sub to chatgpt pro next week

#

depends on how good o3 pro is

balmy mist Apr 21, 2025, 2:19 PM

#

i think we dont see a model beat o3 pro until end of year

balmy mist Apr 21, 2025, 2:20 PM

#

keen beacon i may re-sub to chatgpt pro next week

fr

keen beacon Apr 21, 2025, 2:21 PM

#

balmy mist i think we dont see a model beat o3 pro until end of year

we will

fleet lintel Apr 21, 2025, 2:21 PM

#

keen beacon i may re-sub to chatgpt pro next week

pro is 200$ per week? Yes, i am never subbing it unless it is like crazy better over 2.5 pro

keen beacon Apr 21, 2025, 2:21 PM

#

trust me

keen beacon Apr 21, 2025, 2:21 PM

#

fleet lintel pro is 200$ per week? Yes, i am never subbing it unless it is like crazy better...

month

fleet lintel Apr 21, 2025, 2:21 PM

#

yes, month.. wow, they are so generous 😄

keen beacon Apr 21, 2025, 2:21 PM

#

to be fair you do get practically unlimited o3, o4 mini, 4o + image gen

#

if only they didn't cap o3 pro

balmy mist Apr 21, 2025, 2:21 PM

#

keen beacon we will

hmm i wont be mad if we do lmaoo

balmy mist Apr 21, 2025, 2:22 PM

#

keen beacon to be fair you do get practically unlimited o3, o4 mini, 4o + image gen

and sora lol

keen beacon Apr 21, 2025, 2:22 PM

#

sora is unfortunately pretty bad

balmy mist Apr 21, 2025, 2:22 PM

#

its def worth the price imo

keen beacon Apr 21, 2025, 2:22 PM

#

hopefully sora 2 drops this year

novel flame Apr 21, 2025, 2:22 PM

#

fleet lintel umm.. may be they come up with something interesting.. yet to see. Right now, i...

In my testing, Nova Pro ended up in the same tier (third tier) as Grok 3, Gemini 2.0 Flash, o3-mini-high, Mistral Large, and QwQ-32B. So given its price point (very low) it did have some merit there for a while.

keen beacon Apr 21, 2025, 2:22 PM

#

oh yeah and imagen 4 is coming in the next 1-2 months btw

#

it's in internal testing

balmy mist Apr 21, 2025, 2:23 PM

#

but whats this talk about openAi branching out so much into shopping, social media, etc..

fleet lintel Apr 21, 2025, 2:23 PM

#

novel flame In my testing, Nova Pro ended up in the same tier (third tier) as Grok 3, Gemini...

i thought nova pro was like 5x+ more expensive than flash.. let me check

ocean vortex Apr 21, 2025, 2:24 PM

#

balmy mist but whats this talk about openAi branching out so much into shopping, social med...

I think they are looking into avenues that actually make money, instead of you know... money sinking business

fleet lintel Apr 21, 2025, 2:24 PM

#

Actually nova pro is 8x more expensive than 2.0 flash

fleet lintel Apr 21, 2025, 2:25 PM

#

balmy mist but whats this talk about openAi branching out so much into shopping, social med...

i dont get it? why not focus on couple of things and make it kick ass instead of 10 mediocre things?

#

each extra venture takes effort and energy away from other things

#

i have seen businesses fail multiple times because of expanding too fast. it is a common pitfall

ocean vortex Apr 21, 2025, 2:27 PM

#

fleet lintel i dont get it? why not focus on couple of things and make it kick ass instead o...

I don't think there's a single frontier AI LLM company out there that is actually making money

sonic tendon Apr 21, 2025, 2:27 PM

#

keen beacon a few little birdies tell me it gives o3 a run for its money

no way

#

i hope

fleet lintel Apr 21, 2025, 2:27 PM

#

sonic tendon no way

i also have my doubt about r2 >>> o3

sonic tendon Apr 21, 2025, 2:28 PM

#

fleet lintel i also have my doubt about r2 >>> o3

i mean, i was half joking

ocean vortex Apr 21, 2025, 2:28 PM

#

keen beacon a few little birdies tell me it gives o3 a run for its money

I wouldn't expect any other way tbh

novel flame Apr 21, 2025, 2:28 PM

#

fleet lintel i thought nova pro was like 5x+ more expensive than flash.. let me check

Yes, it had merit before Flash came out

sonic tendon Apr 21, 2025, 2:28 PM

#

i think they have a pretty good chance

fleet lintel Apr 21, 2025, 2:28 PM

#

ocean vortex I don't think there's a single frontier AI LLM company out there that is actuall...

yes and spreading in more directions == more loses and less focus

ocean vortex Apr 21, 2025, 2:28 PM

#

o3 is really not that impressive despite the hype and "feelings"

#

the biggest practical change is how they made it now use every tool that they have, quite extensively

brittle tiger Apr 21, 2025, 2:30 PM

#

balmy mist right now can we reliably say that o3 is better than 2.5 pro?

Definitely depends on use case. Bigger context makes 2.5 shine in many cases. Tooling and less restrictions vault o3 over others. I was talking to some crazy girl this weekend and her and all her friends have started using o3 and deep research to analyze guys on dating apps. o3 much more willing to dig up social profiles than Gemini. I tested myself and o3 output freaked me out. Gemini didn't touch socials.

fleet lintel Apr 21, 2025, 2:31 PM

#

brittle tiger Definitely depends on use case. Bigger context makes 2.5 shine in many cases. To...

damn.. never thought people would be using deep research this way 🙂

balmy mist Apr 21, 2025, 2:32 PM

#

as long as r2 is near o3 its a win, an o3 thats around r1 price is nuts

#

it doesnt have to be better

keen beacon Apr 21, 2025, 2:32 PM

#

r2 is absolutely near o3

#

well more than near